CatBoost with Sparse Data

CatBoost is a gradient boosting library that naturally handles sparse data as it has many zero or missing values without extra preprocessing. It supports sparse matrix formats (CSR/CSC), treats missing values smartly and uses efficient encoding for categorical variables. This makes training fast and memory efficient even for huge sparse datasets.

CatBoost with Sparse Data

Native Support for Missing Values: CatBoost automatically treats missing values as a separate category. You don’t need to impute or drop them the algorithm learns the optimal splits for them during training.
Efficient Encoding of Categorical Features: High cardinality categorical features can lead to sparse representations when one hot encoded. CatBoost uses ordered target statistics instead, which avoids huge sparse matrices.
CSR / CSC Matrices: CatBoost supports sparse matrix formats like Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC). This reduces memory usage and speeds up training on sparse data.
Optimized Data Structures: CatBoost uses optimized data structures and gradient calculation techniques that are friendly for sparse inputs avoiding unnecessary computation for zero entries.

Implementation

Step 1: Install Libraries

This command installs the required Python libraries:

catboost for gradient boosting on categorical and sparse data,
scikit learn for machine learning tools,
scipy for handling sparse matrix formats efficiently.

Python

pip install catboost scikit-learn scipy

Step 2: Load the Dataset

This code loads the Amazon Fine Food Reviews dataset from a SQLite database.
It filters the data to keep only reviews with scores 1 or 5 for binary sentiment analysis and prepares the text and target labels for modeling.

Python

import pandas as pd
import sqlite3
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier, Pool

import zipfile
with zipfile.ZipFile("database.sqlite.zip", 'r') as zip_ref:
    zip_ref.extractall(".")

con = sqlite3.connect("database.sqlite")

df = pd.read_sql_query("SELECT * FROM Reviews", con)

con.close()

print(df[['Text', 'Score']].head())
df = df[df['Score'].isin([1, 5])]
df = df.dropna(subset=['Text'])

X_text = df['Text']
y = (df['Score'] == 5).astype(int)

Output:

Step 3: Perform Vectorization

This block converts the review texts into a sparse TF-IDF matrix with up to 5,000 features, creating a high dimensional representation and then prints the shape of the matrix and its sparsity percentage to show how many values are zeros.

Python

vectorizer = TfidfVectorizer(max_features=5000)
X_sparse = vectorizer.fit_transform(X_text)

print(f"Sparse Shape: {X_sparse.shape}")
print(f"Sparsity: {100 * (1.0 - X_sparse.count_nonzero() / (X_sparse.shape[0] * X_sparse.shape[1])):.2f}%")

Output:

Step 4: Train Test Split

This splits the sparse TF-IDF data and labels into training and test sets using an 80-20 split.
The stratify=y ensures both sets keep the same class distribution balanced positive and negative reviews.

Python

X_train, X_test, y_train, y_test = train_test_split(
    X_sparse, y, test_size=0.2, random_state=42, stratify=y
)

Step 5: Train the model

This creates CatBoost Pool objects to handle the sparse training and test data. Then it initializes and trains a CatBoostClassifier for 100 iterations with specified hyperparameters, evaluating its performance on the test set during training.

Python

train_pool = Pool(X_train, y_train)
test_pool = Pool(X_test, y_test)

model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, verbose=10)
model.fit(train_pool, eval_set=test_pool)

Output:

Step 6: Evaluate the model

This calculates and prints the model’s accuracy on the test data, showing how well the CatBoost model predicts the review sentiment.

Python

accuracy = model.score(test_pool)
print(f"Accuracy: {accuracy:.4f}")

Output:

You can download the source from here - CatBoost with Sparse Data

CatBoost with Sparse Data