Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and Seaborn

Last Updated : 29 Jan, 2026

Exploratory Data Analysis (EDA) is the foundation of every data science project. It is the process of examining datasets to understand their structure, identify patterns, detect anomalies and extract meaningful insights. Before applying any machine learning or statistical models, data must be cleaned, transformed and explored this is where EDA plays an important role.

EDA helps answer important questions such as:

  • What type of data is present (numerical, categorical, text, dates)
  • Are there missing or inconsistent values
  • Are there outliers that could affect analysis
  • What patterns or relationships exist between variables
eda_tools
Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and Seaborn

For example, in a student performance dataset, some records may have missing scores or inconsistent subject names (such as “Math” and “Mathematics”). EDA helps identify and fix such issues, ensuring the dataset is ready for analysis and modeling.

The most commonly used Python libraries for EDA are NumPy, Pandas, Matplotlib and Seaborn. Each library serves a specific purpose in the EDA workflow.

1. NumPy for Numerical Operations

NumPy is the core library for numerical computing in Python. It is designed to handle large, multi-dimensional arrays efficiently and provides fast mathematical and statistical operations.

  • Handles Large Datasets Efficiently: NumPy allows to work with large, multi-dimensional arrays and matrices of numerical data. Provides functions for performing mathematical operations such as linear algebra, statistical analysis.
  • Facilitates Data Transformation: Helps in sorting, reshaping and aggregating data.
Python
import numpy as np

# Dataset: Exam scores
scores = np.array([45, 50, 55, 60, 65, 70, 75, 80, 200]) 

# Calculate basic statistics
mean_score = np.mean(scores)
median_score = np.median(scores)
std_dev_score = np.std(scores)

print(f"Mean: {mean_score}, Median: {median_score}, Standard Deviation: {std_dev_score}")

Output
Mean: 77.77777777777777, Median: 65.0, Standard Deviation: 44.541560561838764

This example shows how NumPy quickly computes descriptive statistics and highlights the impact of outliers (the value 200) on the mean.

Topics to Explore

2. Pandas for Data Manipulation

Pandas is built on top of NumPy and is designed for working with structured, tabular data. It introduces two main data structures:

  • Series (1D)
  • DataFrame (2D)

Pandas makes data cleaning, transformation and analysis simple and intuitive.

  • Reading and writing data (CSV, Excel, JSON, SQL)
  • Handling missing values
  • Filtering and slicing data
  • Grouping and aggregation
  • Working with date and time data
Python
import pandas as pd

data={
    "Name":["A","B","C","D"],
    "Marks":[78,85,None,90]
}

df=pd.DataFrame(data)

print(df)
print(df.isnull())
print(df["Marks"].mean())

Output
  Name  Marks
0    A   78.0
1    B   85.0
2    C    NaN
3    D   90.0
    Name  Marks
0  False  False
1  False  False
2  False   True
3  False  False
84.33333333333333

This example demonstrates how Pandas identifies missing values and computes summary statistics.

Topics to Explore

3. Matplotlib for Data Visualization

Matplotlib is a powerful and flexible plotting library used to visualize data in various formats. It helps convert numerical data into meaningful visual representations.

  • Supports line, bar, scatter, histogram and 3D plots
  • Highly customizable
  • Essential for visual EDA
Python
import matplotlib.pyplot as plt

scores=[45,50,55,60,65,70,75,80,200]

plt.hist(scores)
plt.xlabel("Scores")
plt.ylabel("Frequency")
plt.title("Distribution of Exam Scores")
plt.show()

Output:

Plot
Visualizing Data with Matplotlib

This histogram helps visually identify the presence of outliers in the dataset.

Topics to Explore

4. Seaborn for Statistical Data Visualization

Seaborn is built on top of Matplotlib and focuses on statistical visualizations. It provides a high-level interface for creating attractive and informative plots with minimal code.

  • Better default aesthetics
  • Built-in support for statistical plots
  • Easy visualization of relationships
Python
import seaborn as sns
import pandas as pd

data=sns.load_dataset("tips")

sns.boxplot(x="day",y="total_bill",data=data)

Output:

plot
Statistical Visualization Using Seaborn

This boxplot helps analyze data distribution and detect outliers across different categories.

Complete EDA Workflow Using NumPy, Pandas and Seaborn

Let's implement complete workflow for performing EDA: starting with numerical analysis using NumPy and Pandas, followed by insightful visualizations using Seaborn to make data-driven decisions effectively.

Hands-On EDA Projects

To strengthen your understanding, explore these real-world projects:

Web Scraping For EDA

Web scraping is the automated process of extracting data from websites for analysis. It is useful when datasets are not readily available.

Comment