. Advertisement .
..3..
. Advertisement .
..4..
The examples in this guide will help you get the hang of PySpark join types | Join two DataFrames. These operations are extremely essential when you need to process complicated databases in Spark.
PySpark Join Types | Join Two DataFrames
pyspark.sql.DataFrame.join()
You will need to use the method DataFrame.join() from the pyspark.sql module to join two DataFrames in PySpark.
Syntax:
DataFrame1.join(DataFrame2, on, method)
Parameters:
- DataFrame2: the only required parameter, which is the other DataFrame you want to join to the first.
- on: this optional parameter can be a list of Columns, a join expression, a list of column labels, or the join column name. If it indicates the join column names, they must be present on both DataFrames.
- method: there are many ways to join two DataFrames with join(). The default value is inner, but you can also choose another option, such as full, outer, left, right, etc. If you are familiar with SQL, they are equivalent to join types in SQL.
It is important to note that it is only available in version 1.3.0 or newer. Upgrade your PySpark installation if you can’t find this method.
Let’s say we have two DataFrames with following information (you can also use other methods to create them):
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [["1", "ITTutorial", 1],
["2", "Stack Overflow", 3],
["3", "Quora", 4],
["4", "Reddit", 2]]
columns1 = ['ID', 'site', 'ranking']
df1 = spark.createDataFrame(data1, columns1)
df1.show()
data2 = [[1, 78, "Python"],
[3, 23, "JavaScript"],
[5, 10, "C++"],
[6, 56, "HTML"]]
columns2 = ['ID', 'traffic', 'toplanguage']
df2 = spark.createDataFrame(data2, columns2) df2.show()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [["1", "ITTutorial", 1],
["2", "Stack Overflow", 3],
["3", "Quora", 4],
["4", "Reddit", 2]]
columns1 = ['ID', 'site', 'ranking']
df1 = spark.createDataFrame(data1, columns1)
df1.show()
data2 = [[1, 78, "Python"],
[3, 23, "JavaScript"],
[5, 10, "C++"],
[6, 56, "HTML"]]
columns2 = ['ID', 'traffic', 'toplanguage']
df2 = spark.createDataFrame(data2, columns2) df2.show()
......... ADVERTISEMENT .........
..8..
......... ADVERTISEMENT .........
..8..
Inner Join
You can perform join the two previous DataFrames using the ID column:
df1.join(df2, ['ID']).show()
......... ADVERTISEMENT .........
..8..
The statement has called the method join() to perform an equi-join and only keep one of the two join ‘ID’ columns. Since this is an inner join, there are only two rows in the final results, which have the same ‘ID’ on both DataFrames. It is equivalent to this SQL statement:
SELECT df1.*, df2.*
FROM df1
INNER JOIN df2
ON df1.name = df2.name
Outer Join
Using the keyword ‘outer’, you can tell the join() method to carry out an outer join:
df1.join(df2, ['ID'], 'outer').show()
......... ADVERTISEMENT .........
..8..
As you can see, it joins all the rows whether their IDs are in both DataFrames. Empty items will be replaced with null.
Left Join
Change the parameter method to ‘left’ and we have a left join:
df1.join(df2, ['ID'], 'left').show()
......... ADVERTISEMENT .........
..8..
The returned DataFrame only contains rows with IDs from the first (left) DataFrame while filling the other information from the second DataFrame.
Right Join
In a similar manner, we can do a right join with the method DataFrame.join():
df1.join(df2, ['ID'], 'right').show()
......... ADVERTISEMENT .........
..8..
Semi Join
The method join() also supports semi joins:
df1.join(df2, ['ID'], 'semi').show()
......... ADVERTISEMENT .........
..8..
Notice that the result only has a subset of the first DataFrame whose IDs match IDs on the second DataFrame.
Anti Join
On the other hand, the anti join returns rows of the first DataFrame whose IDs don’t exist on the second DataFrame:
df1.join(df2, ['ID'], 'anti').show()
......... ADVERTISEMENT .........
..8..
Conclusion
We hope you have learned a lot of things about PySpark join types | Join two DataFrames. With the method DataFrame.join() in the pyspark.sql module, you can carry out almost every type of SQL join in Spark.
Leave a comment