In this article, we will discuss how to join multiple columns in PySpark Dataframe using Python.
Let's create the first dataframe:
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby")]
# specify column names
columns = ['ID1', 'NAME1']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
Output:
Let's create the second dataframe:
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [(1, "sravan"), (2, "ojsawi"),
(3, "bobby"),
(4, "rohith"), (5, "gnanesh")]
# specify column names
columns = ['ID2', 'NAME2']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
dataframe1.show()
Output:
we can join the multiple columns by using join() function using conditional operator
Syntax: dataframe.join(dataframe1, (dataframe.column1== dataframe1.column1) & (dataframe.column2== dataframe1.column2))
where,
- dataframe is the first dataframe
- dataframe1 is the second dataframe
- column1 is the first matching column in both the dataframes
- column2 is the second matching column in both the dataframes
Example 1: PySpark code to join the two dataframes with multiple columns (id and name)
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby")]
# specify column names
columns = ['ID1', 'NAME1']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# list of employee data
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby"),
(4, "rohith"), (5, "gnanesh")]
# specify column names
columns = ['ID2', 'NAME2']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
# join based on ID and name column
dataframe.join(dataframe1, (dataframe.ID1 == dataframe1.ID2)
& (dataframe.NAME1 == dataframe1.NAME2)).show()
Output:
Example 2: Join with or operator
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby")]
# specify column names
columns = ['ID1', 'NAME1']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# list of employee data
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby"),
(4, "rohith"), (5, "gnanesh")]
# specify column names
columns = ['ID2', 'NAME2']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
# join based on ID and name column
dataframe.join(dataframe1, (dataframe.ID1 == dataframe1.ID2)
| (dataframe.NAME1 == dataframe1.NAME2)).show()
Output: