With the exponential growth of Android applications on the Google Play Store, ensuring the legitimacy and safety of these apps has become increasingly important. In this article, we discussed about predicting the authenticity of Android applications using classification techniques. With the increasing number of applications available on the Google Play Store, ensuring that apps are legitimate and safe is crucial.
We use a dataset that includes various attributes of Android apps to build classification models and assess their performance. We will explore the dataset, preprocess the data, select relevant features, and build classification models to assess their performance.
Table of Content
Android Authenticity : Understanding the Problem
The problem at hand is binary classification. We aim to classify an Android application as either authentic (safe) or non-authentic (potentially malicious). The primary goal is to develop a model that can predict whether an Android application is authentic or not. Authenticity in this context means that the application is legitimate and safe for users to download and use.
To achieve this, we will leverage a dataset containing various attributes of Android apps, such as permissions required, minimum supported SDK version, and user ratings. By analyzing these features, our models will learn to identify patterns that distinguish legitimate apps from potentially harmful ones.
Dataset Link - Android_Authenticity
Dataset Description
It contain several column such as:
name: Name of the application.MD5: MD5 hash of the application.Min_SDK: Minimum SDK version required.Min_Screen: Minimum screen size supported.Min_OpenGL: Minimum OpenGL ES version required.Supported_CPU: Supported CPU types.rating_number: Number of ratings.rating_count: Count of ratings.android.permission_camera,android.permission_internet, etc.: Binary indicators for various permissions.authentic: Target variable indicating whether the application is authentic (1) or not (0).
Approach to the Model
Our approach involves several key steps:
- Data Preprocessing: We clean and prepare the dataset for model training. This includes handling missing values, encoding categorical features into numerical values, and feature selection.
- Feature Selection: We identify the most relevant features from the dataset that contribute significantly to predicting app authenticity.
- Model Training: We train three different classification models: Logistic Regression, Random Forest, and Support Vector Machine (SVM). Each model has its own strengths and weaknesses, offering a comprehensive analysis.
- Model Evaluation: We evaluate the performance of each model using metrics like accuracy, classification report, confusion matrix, and ROC curve. Evaluation metrics provide insights into the model's effectiveness in classifying authentic and non-authentic apps.
Android Authenticity Prediction using Classification : Step-by-Step Guide
Step 1: Importing Necessary Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
# Load dataset
file_path = 'E:\\python\\android_authenticity_dataset.csv'
df = pd.read_csv(file_path)
# Display the first few rows of the dataset
print(df.head())
Output:
name MD5 ... android.permission_storage authentic
0 Moon-Brady cabb8a96352b2131cbc998df3399af42 ... 0 1
1 Sutton, Ponce and Benton 92d9fe1cfd115d8fd4475779bf7128d7 ... 0 1
2 Berger, Jordan and Hunter 813a4c42f2a9b09c4d071d0c2e335dc9 ... 0 1
3 White, Cooper and Young 9bbb79a73e2d028359ecb8b98a94421f ... 1 1
4 Ross, Jones and Adams a82e02d4a3476c7e6dec7ea239611610 ... 1 1
[5 rows x 21 columns]
Step 2: Data Preprocessing: Cleaning and Preparing the Data
Before training any model, we must ensure the data is clean and suitable for analysis. This involves several steps:
- Handling Missing Values: We'll use a technique like forward fill to replace missing values in the dataset.
- Encoding Categorical Features: Features like developer name or screen size are categorical (textual). We'll convert them into numerical values using a technique called Label Encoding.
- Extracting Numerical Values: Some features might require transformation. For example, the "Min_OpenGL" feature might contain version information like "OpenGL ES 3.1." We'll extract the numeric part (3.1 in this case) for consistency.
- Feature Selection: Not all features might be equally important for predicting app authenticity. We'll identify the most relevant features through techniques like correlation analysis or feature importance analysis.
# Handle Missing Values
df.ffill(inplace=True) # Forward fill missing values
# Encode Categorical Features
label_encoder = LabelEncoder()
categorical_features = ['Min_Screen', 'Supported_CPU', 'Signature', 'Developer', 'Organization', 'Locality', 'Country', 'State']
for feature in categorical_features:
if df[feature].dtype == 'object':
df[feature] = label_encoder.fit_transform(df[feature])
# Encode Min_OpenGL (convert OpenGL ES versions to numeric)
df['Min_OpenGL'] = df['Min_OpenGL'].apply(lambda x: int(x.split(' ')[-1])) # Extract the numeric part of 'OpenGL ES'
Step 3: Feature Selection
Select the relevant features for the model based on their significance.
Min_SDKMin_ScreenMin_OpenGLSupported_CPUrating_numberrating_countandroid.permission_cameraandroid.permission_internetandroid.permission_locationandroid.permission_storage
# Define Features and Target Variable
X = df[['Min_SDK', 'Min_Screen', 'Min_OpenGL', 'Supported_CPU', 'rating_number', 'rating_count', 'android.permission_camera', 'android.permission_internet', 'android.permission_location', 'android.permission_storage']]
y = df['authentic']
Step 4: Building Classification Models
Now that our data is prepared, we can train classification models. We'll explore three popular algorithms:
- Logistic Regression: A basic linear classification algorithm suitable for understanding the relationship between features and authenticity.
- Random Forest: An ensemble method that combines multiple decision trees, offering robustness and interpretability.
- Support Vector Machine (SVM): A powerful technique for complex datasets, especially when dealing with clear boundaries between classes (authentic vs. non-authentic).
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
# Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Logistic Regression
logistic_model = LogisticRegression(max_iter=1000) # Increased max_iter for convergence
logistic_model.fit(X_train, y_train)
# Random Forest
random_forest_model = RandomForestClassifier()
random_forest_model.fit(X_train, y_train)
# Support Vector Machine (SVM)
svm_model = SVC()
svm_model.fit(X_train, y_train)
Step 5: Model Evaluation: Assessing Performance
After training the models, we need to evaluate their effectiveness in predicting app authenticity. We'll use various metrics:
- Accuracy: The overall percentage of correctly classified apps (authentic or non-authentic).
- Classification Report: Provides a detailed breakdown of the model's performance for each class (authentic/non-authentic) in terms of precision, recall, and F1-score.
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns
# Predict and Evaluate Models
## Logistic Regression
y_pred = logistic_model.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
## Random Forest
y_pred = random_forest_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
## Support Vector Machine (SVM)
y_pred = svm_model.predict(X_test)
print("SVM Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Output:
Logistic Regression Accuracy: 0.6
precision recall f1-score support
0 0.65 0.65 0.65 17
1 0.54 0.54 0.54 13
accuracy 0.60 30
macro avg 0.59 0.59 0.59 30
weighted avg 0.60 0.60 0.60 30
Random Forest Accuracy: 0.5666666666666667
precision recall f1-score support
0 0.60 0.71 0.65 17
1 0.50 0.38 0.43 13
accuracy 0.57 30
macro avg 0.55 0.55 0.54 30
weighted avg 0.56 0.57 0.56 30
SVM Accuracy: 0.43333333333333335
precision recall f1-score support
0 0.50 0.35 0.41 17
1 0.39 0.54 0.45 13
accuracy 0.43 30
macro avg 0.44 0.45 0.43 30
weighted avg 0.45 0.43 0.43 30
Step 6: Summarizing and Concluding Insights
1. Check the confusion matrix
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
Output:
Confusion Matrix:
[[ 6 11]
[ 6 7]]
2. Plotting the ROC Curve
# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
Output:

3. Visualize the Feature Importance
Visualize the feature importance from the Random Forest model and other relevant insights.
# Feature Importance for Random Forest
importances = random_forest_model.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance')
plt.show()
Output:

Conclusion
In this article, we explored how to predict the authenticity of Android applications using classification techniques. By applying models such as Logistic Regression, Random Forest, and Support Vector Machines, we were able to analyze and interpret the dataset effectively.
Each model provided different insights into the authenticity of apps, with Random Forest showing particularly robust performance. Feature importance analysis highlighted that app requirements and permissions are crucial indicators of authenticity. Overall, our models offer a practical approach to assessing app legitimacy, and future improvements could focus on incorporating additional features or advanced algorithms to enhance prediction accuracy.