Pyspark GroupBy DataFrame with Aggregation or Count

Pyspark is a powerful tool for handling large datasets in a distributed environment using Python. One common operation when working with data is grouping it based on one or more columns. This can be easily done in Pyspark using the groupBy() function, which helps to aggregate or count values in each group.

In this article, we will explore how to use the groupBy() function in Pyspark for counting occurrences and performing various aggregation operations.

Syntax of groupBy()

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

Parameters:

by: The column(s) to group by, can be a single column, list, or a function.
axis: The axis to operate on, default is 0 (rows).
level: For multi-level index DataFrames, specify the level(s) to group by.
as_index: If True (default), the grouped column(s) become the index; otherwise, the original index is kept.
sort: If True (default), groups are sorted; False keeps original order.
group_keys: Includes group labels in the output, default is True.
squeeze: If True, reduces dimensionality to a DataFrame or Series.
kwargs: Extra parameters for aggregation functions like count(), sum(), etc.

Creating a Pyspark DataFrame

Before performing the groupBy() operation, let's create a simple DataFrame containing some student data, including columns like ID, NAME, DEPT, and FEE.

Python

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('GroupByExample').getOrCreate()

data = [
    ["1", "sravan", "IT", 45000],
    ["2", "ojaswi", "CS", 85000],
    ["3", "rohith", "CS", 41000],
    ["4", "sridevi", "IT", 56000],
    ["5", "bobby", "ECE", 45000],
    ["6", "gayatri", "ECE", 49000],
    ["7", "gnanesh", "CS", 45000],
    ["8", "bhanu", "Mech", 21000]
]

columns = ['ID', 'NAME', 'DEPT', 'FEE']

dataframe = spark.createDataFrame(data, columns)

dataframe.show()

Output:

Pyspark groupBy DataFrame with aggregation or count — Snapshot of the dataframe

Pyspark groupBy with Count

To count the number of rows in each group, we can use the count() function. This method counts the occurrences of each unique value in the specified column.

Python

# Grouping by 'DEPT' and counting occurrences
dataframe.groupBy('DEPT').count().show()

Output:

Explanation:

groupBy('DEPT'): Groups the data by the DEPT column.
count(): Counts the number of rows for each group (department).

Pyspark groupBy with Aggregation

You can apply various aggregation functions to your grouped data, such as sum(), max(), min(), mean(), etc.

Python

from pyspark.sql.functions import sum, max, min, mean, count

# Grouping by 'DEPT' and applying aggregation functions
dataframe.groupBy("DEPT").agg(
    max("FEE"), sum("FEE"),
    min("FEE"), mean("FEE"),
    count("FEE")
).show()

Output:

Explanation:

groupBy("DEPT"): Groups the data by the DEPT column.
agg(): Applies the aggregation functions (max, sum, min, mean, count) on the FEE column for each group.

Pyspark GroupBy DataFrame with Aggregation or Count

Syntax of groupBy()

Creating a Pyspark DataFrame

Pyspark groupBy with Count

Pyspark groupBy with Aggregation

Explore