Dataset Inspection and Visualization

Last Updated : 6 May, 2026

Dataset inspection and visualisation are the first steps in data science. They help you understand your data, spot patterns and identify issues before building models. Tools like Hugging Face Dataset Viewer make this process faster and more interactive.

Hugging Face Dataset Viewer

Hugging Face Dataset Viewer is a built-in tool that allows you to explore datasets directly from the browser without writing code. It simplifies data inspection into a clean and interactive experience.

  • Displays data in a table format
  • Shows 100 rows at a time for easy navigation
  • Supports search and filtering to quickly explore data
  • Provides quick statistics for better understanding
  • Works across text, image, audio and tabular datasets

Step 1: Open the Hugging Face Dataset Hub

  • Go to Hugging Face Dataset Hub
  • This is the central repository of all available datasets

Step 2: Search for a Dataset

  • Use the search bar (e.g., type imdb, mnist, squad)
  • Click on any dataset that fits your use case
Search-for-dataset
searching for a dataset

Step 3: Access the Dataset Viewer

  • Once inside the dataset page, locate the Dataset Viewer tab
  • This is where the interactive table is available

Step 4: Explore Data in Table Format

  • You’ll see rows and columns like a spreadsheet
  • Each row is one data sample
  • Each column is a feature (text, label, image, etc.)
dataset
Dataset

Step 5: Navigate Through Data

  • Scroll down to the bottom of the table
  • Use next/previous buttons to move across pages
  • Each page typically shows 100 rows
pagination
Pagination

Step 6: Use Search Functionality

  • Use the search bar in the viewer
  • Enter keywords (e.g., “good”, “error”)
  • Instantly find matching rows in the dataset

Step 7: Inspect Different Data Types

  • Text: shown directly
  • Images: displayed visually
  • Audio: playable in the viewer
  • Tabular: structured in columns
text-data-shown-directly
Text data shown directly

Step 9: Check Dataset Splits

  • Switch between splits like train, test and validation
  • Helps understand how data is divided

Visualization in Hugging Face Dataset

The default Dataset Viewer focuses on structured inspection, but true visualization (patterns, clusters, trends) is achieved by integrating tools like Spotlight on top of Hugging Face datasets.

  • Viewer is for quick inspection (tables, filters, search)
  • Visualization tools are for deeper insights (patterns, clusters, errors)
  • Works seamlessly with Hugging Face datasets
  • No need to duplicate or preprocess data
  • Enables interactive, visual data understanding

Step 1: Install Required Libraries

Run the following command in your terminal

pip install datasets renumics-spotlight transformers torch

Step 2: Import Required Libraries

Importing necessary libraries for loading datasets, processing images and launching visualization.

Python
from datasets import load_dataset
from transformers import ViTForImageClassification, ViTImageProcessor
import torch
from transformers import ViTModel
from renumics import spotlight

Step 3: Load a Dataset from Hugging Face

Loading a sample dataset for visualization.

Python
ds = load_dataset("cifar100", split="test[:500]")

Step 4: Add Model Predictions

Generating predictions for each data sample using a pre-trained model.

Python
model_name = "Ahmed9275/Vit-Cifar100"

processor = ViTImageProcessor.from_pretrained(model_name)
model = ViTForImageClassification.from_pretrained(model_name)

def add_predictions(example):
    image = example["img"].convert("RGB")
    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)

    pred = outputs.logits.argmax(dim=-1).item()
    example["prediction"] = pred
    return example

ds = ds.map(add_predictions)

Step 5: Add Embeddings

Extracting feature vectors (embeddings) from the model.

Python
feature_model = ViTModel.from_pretrained(model_name)

def add_embedding(example):
    image = example["img"].convert("RGB")
    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        outputs = feature_model(**inputs)

    embedding = outputs.last_hidden_state[:, 0].squeeze().numpy()
    example["embedding"] = embedding
    return example

ds = ds.map(add_embedding)

Step 6: Launch Spotlight

Visualizing the dataset with embeddings.

Python
spotlight.show(
    ds,
    dtype={"embedding": spotlight.Embedding}
)

Output:

If your dataset contains numerical or structured data, you can perform simple but powerful visualizations to understand patterns, relationships, and overall data behavior.

  • Distribution of values: see how data is spread (e.g., histograms, boxplots)
  • Relationships between features: understand how two variables are connected (e.g., scatter plots)
  • Correlation between columns: identify strong or weak relationships (e.g., heatmaps)
  • Trends over data: observe changes if data has an order (e.g., line charts)
  • Outliers detection: find unusual or extreme values (e.g., boxplots)

Advantages

  • You can quickly find missing values, incorrect entries or inconsistent data
  • It helps you clearly understand what your dataset contains and how it is structured
  • Clean and well understood data improves model accuracy and performance
  • Visual insights make it easier to interpret information compared to raw tables
  • It saves time by reducing issues that usually appear later during model training

Limitations

  • Large datasets can be slow to load and difficult to visualize fully
  • Basic viewers (like Hugging Face Dataset Viewer) provide limited visualization features
  • Some insights may require advanced tools or additional processing
  • Visualizations can sometimes be misleading if not interpreted correctly
  • Not all datasets support filtering, statistics, or advanced interactions in the UI
Comment

Explore