Binary classification using LightGBM

Last Updated : 6 Sep, 2025

LightGBM (Light Gradient Boosting Machine) is an open-source gradient boosting framework designed for efficient and scalable machine learning. It is widely used for classification tasks, including binary classification and is optimized for speed and memory usage.

We will implement binary classification using LightGBM:

1. Installing Libraries

We will install LightGBM for classification tasks.

pip install lightgbm

2. Importing Libraries and Dataset

We will import the necessary Python libraries such as pandasnumpyseabornmatplotlibsklearn and load the dataset.

You can download dataset from here.

Python
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import lightgbm as lgb
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('/content/diabetes.csv')

2.1. Previewing the Dataset

We will check the first few rows to understand the data structure.

  • df.head() displays the first five rows for a quick preview
Python
df.head()

Output:

head
Previewing the Dataset

2.2. Dataset Shape

We will check the dimensions of the dataset.

  • df.shape returns the number of rows and columns
Python
df.shape

Output:

(768, 9)

2.3. Dataset Information

We will check the data types and null values.

  • df.info() shows column data types and counts of non-null values
Python
df.info()

Output:

info
Dataset Information

2.4. Descriptive Statistics

We will compute statistical summaries of numeric features.

  • df.describe() shows count, mean, std, min, max and percentiles
Python
df.describe()

Output:

describe
Descriptive Statistics

3. Exploratory Data Analysis (EDA)

We will analyze patterns, distributions and relationships among features.

3.1. Class Distribution

We will visualize the distribution of the target variable Outcome.

  • value_counts() counts frequency of each class.
  • plt.pie() plots a pie chart; autopct='%1.1f%%' shows percentages.
  • It helps identify class imbalance which can affect model training.
Python
temp = df['Outcome'].value_counts()
plt.pie(temp.values, labels=temp.index.values, autopct='%1.1f%%')
plt.title("Class Distribution")
plt.show()

Output:

piechart
Class Distribution

3.2. Correlation Matrix

We will check correlations between features.

  • df.corr() computes pairwise correlations of columns.
  • sb.heatmap() visualizes correlation matrix and annot=True shows values.
  • Useful for detecting highly correlated features (>0.7) which may indicate redundancy or risk of data leakage.
Python
sb.heatmap(df.corr() > 0.7, cbar=False, annot=True)
plt.show()

Output:

heatmap
Correlation Matrix

3.3. Feature Distributions

We will visualize individual feature's distributions.

  • plt.figure(figsize=(15, 15)) sets the size of the figure for all subplots.
  • plt.subplot(nrows, ncols, index) specifies the position of the subplot in the grid.
  • sb.histplot(df[col], kde=True) plots a histogram with a KDE (Kernel Density Estimate) to show distribution.
  • plt.tight_layout() automatically adjusts subplot spacing to prevent overlap.
  • Visualizing distributions helps detect skewness, spread and outliers in numerical features.
Python
num_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 
            'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

plt.figure(figsize=(15, 15))
for col in num_cols:
    plt.subplot(3, 3, num_cols.index(col)+1)
    sb.histplot(df[col], kde=True)

plt.tight_layout()
plt.show()

Output:

features
Feature Distributions

3.4. Count Plots

We will visualize categorical feature relationships with the target.

  • countplot() displays counts for each category; hue separates by target variable.
  • Helps observe trends between features and target variable.
Python
sb.countplot(data=df, x='Pregnancies', hue='Outcome')
plt.show()

Output:

count
Count Plots

Insights from the Diabetes dataset:

  • The dataset is imbalanced: fewer positive cases (Outcome=1) than negative cases.
  • Features like Glucose, BMI and Age show skewed distributions and may influence the target strongly.
  • Higher Pregnancies tend to correlate with a higher likelihood of diabetes.
  • Most other features show weak correlations with each other (no multicollinearity issues).
  • Skewed or zero-heavy features (like Insulin and SkinThickness) might benefit from transformations.

4. Data Preprocessing

We will prepare the dataset for LightGBM.

4.1. Splitting Features and Target

We will split the dataset into input features and target variable.

  • drop('Outcome', axis=1) removes target column to create features.
  • train_test_split() splits data into training and validation sets.
  • test_size=0.2 reserves 20% of data for validation.
  • random_state=2023 ensures reproducibility.
Python
features = df.drop('Outcome', axis=1)
target = df['Outcome']

X_train, X_val, Y_train, Y_val = train_test_split(
    features, target, test_size=0.2, random_state=2023
)

4.2. Feature Scaling

We will standardize the features to improve model learning.

  • StandardScaler() transforms features to mean=0, std=1.
  • fit_transform() computes mean or standard deviation (std) on training data and transforms it.
  • transform() applies the same scaling to validation data.
  • Standardization improves gradient boosting model performance.
Python
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

5. Dataset Preparation for LightGBM

We will convert arrays into LightGBM dataset objects for training.

  • lgb.Dataset() prepares dataset compatible with LightGBM.
  • label specifies target variable.
  • reference ensures validation set is consistent with training set.
Python
train_data = lgb.Dataset(X_train, label=Y_train)
test_data = lgb.Dataset(X_val, label=Y_val, reference=train_data)

6. Binary Classification Model Using LightGBM

We will define model parameters and train the classifier.

  • objective='binary' defines task as binary classification.
  • metric='auc' uses ROC-AUC as evaluation metric.
  • boosting_type='gbdt' uses Gradient Boosting Decision Tree algorithm.
  • num_leaves=31 sets max number of leaves per tree.
  • learning_rate=0.05 sets step size for boosting.
  • feature_fraction=0.9 specifies fraction of features per iteration.
  • early_stopping_rounds=10 stops training if no improvement.
Python
params = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}

num_round = 100
bst = lgb.train(params, train_data, num_round, valid_sets=[test_data], early_stopping_rounds=10)

Output:

lightgbm
Binary Classification Model Using LightGBM

7. Prediction and Evaluation

We will generate predictions and evaluate performance using ROC-AUC.

  • bst.predict() predicts probabilities for each instance.
  • (y > 0.5).astype(int) converts probabilities to binary outcomes.
  • roc_auc_score() computes ROC-AUC score for evaluation.
Python
y_train = bst.predict(X_train)
y_val = bst.predict(X_val)

y_train_class = (y_train > 0.5).astype(int)
y_val_class = (y_val > 0.5).astype(int)

print("Training ROC-AUC: ", ras(Y_train, y_train))
print("Validation ROC-AUC: ", ras(Y_val, y_val))

Output:

Training ROC-AUC: 1.0
Validation ROC-AUC: 0.6791463194067643

Comment