Predicting the Authenticity of Android Applications Using Classification Techniques

With the exponential growth of Android applications on the Google Play Store, ensuring the legitimacy and safety of these apps has become increasingly important. In this article, we discussed about predicting the authenticity of Android applications using classification techniques. With the increasing number of applications available on the Google Play Store, ensuring that apps are legitimate and safe is crucial.

We use a dataset that includes various attributes of Android apps to build classification models and assess their performance. We will explore the dataset, preprocess the data, select relevant features, and build classification models to assess their performance.

Table of Content

Android Authenticity : Understanding the Problem

Dataset Description
Approach to the Model

Android Authenticity Prediction using Classification : Step-by-Step Guide

Step 1: Importing Necessary Libraries
Step 2: Data Preprocessing: Cleaning and Preparing the Data
Step 3: Feature Selection
Step 4: Building Classification Models
Step 5: Model Evaluation: Assessing Performance
Step 6: Summarizing and Concluding Insights

Android Authenticity : Understanding the Problem

The problem at hand is binary classification. We aim to classify an Android application as either authentic (safe) or non-authentic (potentially malicious). The primary goal is to develop a model that can predict whether an Android application is authentic or not. Authenticity in this context means that the application is legitimate and safe for users to download and use.

To achieve this, we will leverage a dataset containing various attributes of Android apps, such as permissions required, minimum supported SDK version, and user ratings. By analyzing these features, our models will learn to identify patterns that distinguish legitimate apps from potentially harmful ones.

Dataset Link - Android_Authenticity

Dataset Description

It contain several column such as:

name: Name of the application.
MD5: MD5 hash of the application.
Min_SDK: Minimum SDK version required.
Min_Screen: Minimum screen size supported.
Min_OpenGL: Minimum OpenGL ES version required.
Supported_CPU: Supported CPU types.
rating_number: Number of ratings.
rating_count: Count of ratings.
android.permission_camera, android.permission_internet, etc.: Binary indicators for various permissions.
authentic: Target variable indicating whether the application is authentic (1) or not (0).

Approach to the Model

Our approach involves several key steps:

Data Preprocessing: We clean and prepare the dataset for model training. This includes handling missing values, encoding categorical features into numerical values, and feature selection.
Feature Selection: We identify the most relevant features from the dataset that contribute significantly to predicting app authenticity.
Model Training: We train three different classification models: Logistic Regression, Random Forest, and Support Vector Machine (SVM). Each model has its own strengths and weaknesses, offering a comprehensive analysis.
Model Evaluation: We evaluate the performance of each model using metrics like accuracy, classification report, confusion matrix, and ROC curve. Evaluation metrics provide insights into the model's effectiveness in classifying authentic and non-authentic apps.

Android Authenticity Prediction using Classification : Step-by-Step Guide

Step 1: Importing Necessary Libraries

Python

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Load dataset
file_path = 'E:\\python\\android_authenticity_dataset.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(df.head())

Output:

                        name                               MD5  ... android.permission_storage  authentic
0                 Moon-Brady  cabb8a96352b2131cbc998df3399af42  ...                          0          1
1   Sutton, Ponce and Benton  92d9fe1cfd115d8fd4475779bf7128d7  ...                          0          1
2  Berger, Jordan and Hunter  813a4c42f2a9b09c4d071d0c2e335dc9  ...                          0          1
3    White, Cooper and Young  9bbb79a73e2d028359ecb8b98a94421f  ...                          1          1
4      Ross, Jones and Adams  a82e02d4a3476c7e6dec7ea239611610  ...                          1          1

[5 rows x 21 columns]

Step 2: Data Preprocessing: Cleaning and Preparing the Data

Before training any model, we must ensure the data is clean and suitable for analysis. This involves several steps:

Handling Missing Values: We'll use a technique like forward fill to replace missing values in the dataset.
Encoding Categorical Features: Features like developer name or screen size are categorical (textual). We'll convert them into numerical values using a technique called Label Encoding.
Extracting Numerical Values: Some features might require transformation. For example, the "Min_OpenGL" feature might contain version information like "OpenGL ES 3.1." We'll extract the numeric part (3.1 in this case) for consistency.
Feature Selection: Not all features might be equally important for predicting app authenticity. We'll identify the most relevant features through techniques like correlation analysis or feature importance analysis.

Python

# Handle Missing Values
df.ffill(inplace=True)  # Forward fill missing values

# Encode Categorical Features
label_encoder = LabelEncoder()
categorical_features = ['Min_Screen', 'Supported_CPU', 'Signature', 'Developer', 'Organization', 'Locality', 'Country', 'State']
for feature in categorical_features:
    if df[feature].dtype == 'object':
        df[feature] = label_encoder.fit_transform(df[feature])

# Encode Min_OpenGL (convert OpenGL ES versions to numeric)
df['Min_OpenGL'] = df['Min_OpenGL'].apply(lambda x: int(x.split(' ')[-1]))  # Extract the numeric part of 'OpenGL ES'

Step 3: Feature Selection

Select the relevant features for the model based on their significance.

Min_SDK
Min_Screen
Min_OpenGL
Supported_CPU
rating_number
rating_count
android.permission_camera
android.permission_internet
android.permission_location
android.permission_storage

Python

# Define Features and Target Variable
X = df[['Min_SDK', 'Min_Screen', 'Min_OpenGL', 'Supported_CPU', 'rating_number', 'rating_count', 'android.permission_camera', 'android.permission_internet', 'android.permission_location', 'android.permission_storage']]
y = df['authentic']

Step 4: Building Classification Models

Now that our data is prepared, we can train classification models. We'll explore three popular algorithms:

Logistic Regression: A basic linear classification algorithm suitable for understanding the relationship between features and authenticity.
Random Forest: An ensemble method that combines multiple decision trees, offering robustness and interpretability.
Support Vector Machine (SVM): A powerful technique for complex datasets, especially when dealing with clear boundaries between classes (authentic vs. non-authentic).

Python

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Logistic Regression
logistic_model = LogisticRegression(max_iter=1000)  # Increased max_iter for convergence
logistic_model.fit(X_train, y_train)

# Random Forest
random_forest_model = RandomForestClassifier()
random_forest_model.fit(X_train, y_train)

# Support Vector Machine (SVM)
svm_model = SVC()
svm_model.fit(X_train, y_train)

Step 5: Model Evaluation: Assessing Performance

After training the models, we need to evaluate their effectiveness in predicting app authenticity. We'll use various metrics:

Accuracy: The overall percentage of correctly classified apps (authentic or non-authentic).
Classification Report: Provides a detailed breakdown of the model's performance for each class (authentic/non-authentic) in terms of precision, recall, and F1-score.

Python

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns

# Predict and Evaluate Models

## Logistic Regression
y_pred = logistic_model.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

## Random Forest
y_pred = random_forest_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

## Support Vector Machine (SVM)
y_pred = svm_model.predict(X_test)
print("SVM Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

Logistic Regression Accuracy: 0.6
              precision    recall  f1-score   support

           0       0.65      0.65      0.65        17
           1       0.54      0.54      0.54        13

    accuracy                           0.60        30
   macro avg       0.59      0.59      0.59        30
weighted avg       0.60      0.60      0.60        30

Random Forest Accuracy: 0.5666666666666667
              precision    recall  f1-score   support

           0       0.60      0.71      0.65        17
           1       0.50      0.38      0.43        13

    accuracy                           0.57        30
   macro avg       0.55      0.55      0.54        30
weighted avg       0.56      0.57      0.56        30

SVM Accuracy: 0.43333333333333335
              precision    recall  f1-score   support

           0       0.50      0.35      0.41        17
           1       0.39      0.54      0.45        13

    accuracy                           0.43        30
   macro avg       0.44      0.45      0.43        30
weighted avg       0.45      0.43      0.43        30

Step 6: Summarizing and Concluding Insights

1. Check the confusion matrix

Python

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Output:

Confusion Matrix:
[[ 6 11]
 [ 6  7]]

2. Plotting the ROC Curve

Python

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

Output:

3. Visualize the Feature Importance

Visualize the feature importance from the Random Forest model and other relevant insights.

Python

# Feature Importance for Random Forest
importances = random_forest_model.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance')
plt.show()

Output:

Conclusion

In this article, we explored how to predict the authenticity of Android applications using classification techniques. By applying models such as Logistic Regression, Random Forest, and Support Vector Machines, we were able to analyze and interpret the dataset effectively.

Each model provided different insights into the authenticity of apps, with Random Forest showing particularly robust performance. Feature importance analysis highlighted that app requirements and permissions are crucial indicators of authenticity. Overall, our models offer a practical approach to assessing app legitimacy, and future improvements could focus on incorporating additional features or advanced algorithms to enhance prediction accuracy.

Predicting the Authenticity of Android Applications Using Classification Techniques

Android Authenticity : Understanding the Problem

Dataset Description

Approach to the Model

Android Authenticity Prediction using Classification : Step-by-Step Guide

Step 1: Importing Necessary Libraries

Step 2: Data Preprocessing: Cleaning and Preparing the Data

Step 3: Feature Selection

Step 4: Building Classification Models

Step 5: Model Evaluation: Assessing Performance

Step 6: Summarizing and Concluding Insights

Conclusion

Explore