E-commerce product recommendations using catboost

E-commerce platforms use product recommendations to enhance customer experience and increase sales. CatBoost is a gradient boosting algorithm designed to efficiently handle categorical data and missing values making it ideal for this task. It works by combining decision trees and correcting errors from previous trees to create a strong predictive model. In this article we’ll use CatBoost to build a product recommendation system predicting items a customer is likely to be interested in based on their browsing and purchase history.

Step 1: Import Libraries

We will import numpy, pandas, matplotlib, seaborn and scikit learn. We will also be using catboost library to implement it and ipywidgets for creating a widget to see recommendation.

Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from IPython.display import display

from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

Step 2: Load the Dataset

We'll use the Online Retail dataset for this implementation.

Python

import pandas as pd

df = pd.read_excel('Online-Retail.xlsx')
df.head()

Output:

Step 3: Data Preprocessing

Before we can train a model we need to clean and preprocess the data. We will do:

Drop rows with missing values
Convert InvoiceDate to datetime
Create a new column for total price
Filter out negative quantities

Python

df.dropna(inplace=True)

df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

df = df[df['Quantity'] > 0]

df.head()

Output:

Screenshot-2025-03-05-135228 — Pre-Processed Dataset

Step 4: Feature Engineering

Feature engineering is a technique where we engineer new features based on existing features. We'll create features like the recency, frequency and monetary value (RFM) for each customer.

reference_date is set to one day after the latest invoice date
Groups data by CustomerID to calculate RFM metrics
Recency: Days since last purchase
Frequency: Number of unique invoices
Monetary: Total spending (TotalPrice)
Renames columns to 'Recency', 'Frequency', 'Monetary'
rfm.head() shows the first few rows of the RFM table

Python

reference_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)

rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (reference_date - x.max()).days,
    'InvoiceNo': 'nunique', 
    'TotalPrice': 'sum' 
})

rfm.columns = ['Recency', 'Frequency', 'Monetary']

rfm.head()

Output:

Screenshot-2025-03-05-135616 — Feature Engineering

Step 5: Train-Test Split

Split the data into training and testing sets. Here we split 80% dataset for training and 20% for testing.

X uses 'Recency', 'Frequency', 'Monetary' as features
y is 1 if 'Monetary' > median, else 0
Data split: 80% train, 20% test
random_state=42 for reproducibility
shape shows rows and columns in each split

Python

X = rfm[['Recency', 'Frequency', 'Monetary']]
y = np.where(rfm['Monetary'] > rfm['Monetary'].median(), 1, 0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

Output:

((3471, 3), (868, 3), (3471,), (868,))

Step 6: Train the CatBoost Model

Now, we'll train a CatBoost classifier on the training data.

iterations = 100: Defines the number of trees to be built (100 in this case) each improving the model.
learning_rate = 0.1: Controls the contribution of each tree. A lower rate needs more trees to converge, balancing accuracy and training time.
depth = 6: Specifies the depth of each tree. A depth of 6 offers a good trade-off between capturing complexity and avoiding overfitting.
verbose = False: Suppresses training logs to keep the process quiet.

Python

model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, verbose=False)

model.fit(X_train, y_train)

accuracy = model.score(X_test, y_test)
print(f'Accuracy: {accuracy:.2f}')

Output:

Accuracy: 1.00

Model is working fine but it may be overfitted we can further fine tune it to make model better.

Step 7: Visualize Feature Importance

CatBoost provides feature importance which helps understand which features are contributing the most to the predictions. We are doing this too understand why model is getting overfitted.

Python

feature_importance = model.get_feature_importance()
features = X.columns

importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importance})

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance')
plt.show()

Output:

We can see that only one feature i.e monetary is contributing maximum to prediction due to which model is getting overfitted. We can fine tune model to deal this issue.

Step 8: Interactive GUI with ipywidgets

We'll create an interactive GUI where users can input recency, frequency and monetary values to get a recommendation.

Imports interactive widgets and display tools
Creates sliders for Recency, Frequency, and Monetary inputs
Defines output area to show prediction result
make_prediction() creates a DataFrame and uses model to predict
Returns "Recommend" if prediction is 1, else "Do not recommend"
update_prediction() updates output when sliders change
Observers trigger prediction on slider value changes
Displays sliders and prediction output in notebook

Python

import ipywidgets as widgets
from IPython.display import display

recency_input = widgets.IntSlider(min=0, max=365, step=1, description='Recency')
frequency_input = widgets.IntSlider(min=0, max=100, step=1, description='Frequency')
monetary_input = widgets.FloatSlider(min=0, max=10000, step=0.01, description='Monetary')

output = widgets.Output()

def make_prediction(recency, frequency, monetary):
    data = pd.DataFrame({'Recency': [recency], 'Frequency': [frequency], 'Monetary': [monetary]})
    prediction = model.predict(data)[0]
    return "Recommend" if prediction == 1 else "Do not recommend"

def update_prediction(change):
    with output:
        output.clear_output()
        prediction = make_prediction(recency_input.value, frequency_input.value, monetary_input.value)
        print(f'Recommendation: {prediction}')

recency_input.observe(update_prediction, names='value')
frequency_input.observe(update_prediction, names='value')
monetary_input.observe(update_prediction, names='value')

display(recency_input, frequency_input, monetary_input, output)

Output:

Our model is working fine helping E-commerce platform to decide which product to list down on their platform by providing powerful, efficient and accurate product recommendations. As businesses continue to embrace machine learning technologies tools like CatBoost will play a important role in delivering personalized shopping experiences.

E-commerce product recommendations using catboost

Step 1: Import Libraries

Step 2: Load the Dataset

Step 3: Data Preprocessing

Step 4: Feature Engineering

Step 5: Train-Test Split

Step 6: Train the CatBoost Model

Step 7: Visualize Feature Importance

Step 8: Interactive GUI with ipywidgets

Explore