CatBoost with Sparse Data

Last Updated : 30 Jun, 2025

CatBoost is a gradient boosting library that naturally handles sparse data as it has many zero or missing values without extra preprocessing. It supports sparse matrix formats (CSR/CSC), treats missing values smartly and uses efficient encoding for categorical variables. This makes training fast and memory efficient even for huge sparse datasets.

catBoost-3
CatBoost

CatBoost with Sparse Data

  1. Native Support for Missing Values: CatBoost automatically treats missing values as a separate category. You don’t need to impute or drop them the algorithm learns the optimal splits for them during training.
  2. Efficient Encoding of Categorical Features: High cardinality categorical features can lead to sparse representations when one hot encoded. CatBoost uses ordered target statistics instead, which avoids huge sparse matrices.
  3. CSR / CSC Matrices: CatBoost supports sparse matrix formats like Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC). This reduces memory usage and speeds up training on sparse data.
  4. Optimized Data Structures: CatBoost uses optimized data structures and gradient calculation techniques that are friendly for sparse inputs avoiding unnecessary computation for zero entries.

Implementation

Step 1: Install Libraries

This command installs the required Python libraries:

  • catboost for gradient boosting on categorical and sparse data,
  • scikit learn for machine learning tools,
  • scipy for handling sparse matrix formats efficiently.
Python
pip install catboost scikit-learn scipy

Step 2: Load the Dataset

  • This code loads the Amazon Fine Food Reviews dataset from a SQLite database.
  • It filters the data to keep only reviews with scores 1 or 5 for binary sentiment analysis and prepares the text and target labels for modeling.
Python
import pandas as pd
import sqlite3
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier, Pool

import zipfile
with zipfile.ZipFile("database.sqlite.zip", 'r') as zip_ref:
    zip_ref.extractall(".")

con = sqlite3.connect("database.sqlite")

df = pd.read_sql_query("SELECT * FROM Reviews", con)

con.close()

print(df[['Text', 'Score']].head())
df = df[df['Score'].isin([1, 5])]
df = df.dropna(subset=['Text'])

X_text = df['Text']
y = (df['Score'] == 5).astype(int)  

Output:

Output
Output

Step 3: Perform Vectorization

This block converts the review texts into a sparse TF-IDF matrix with up to 5,000 features, creating a high dimensional representation and then prints the shape of the matrix and its sparsity percentage to show how many values are zeros.

Python
vectorizer = TfidfVectorizer(max_features=5000)
X_sparse = vectorizer.fit_transform(X_text)

print(f"Sparse Shape: {X_sparse.shape}")
print(f"Sparsity: {100 * (1.0 - X_sparse.count_nonzero() / (X_sparse.shape[0] * X_sparse.shape[1])):.2f}%")

Output:

Output
Output

Step 4: Train Test Split

  • This splits the sparse TF-IDF data and labels into training and test sets using an 80-20 split.
  • The stratify=y ensures both sets keep the same class distribution balanced positive and negative reviews.
Python
X_train, X_test, y_train, y_test = train_test_split(
    X_sparse, y, test_size=0.2, random_state=42, stratify=y
)

Step 5: Train the model

This creates CatBoost Pool objects to handle the sparse training and test data. Then it initializes and trains a CatBoostClassifier for 100 iterations with specified hyperparameters, evaluating its performance on the test set during training.

Python
train_pool = Pool(X_train, y_train)
test_pool = Pool(X_test, y_test)

model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, verbose=10)
model.fit(train_pool, eval_set=test_pool)

Output:

Output
Output

Step 6: Evaluate the model

This calculates and prints the model’s accuracy on the test data, showing how well the CatBoost model predicts the review sentiment.

Python
accuracy = model.score(test_pool)
print(f"Accuracy: {accuracy:.4f}")

Output:

Output
Output

You can download the source from here - CatBoost with Sparse Data

Comment

Explore