. Advertisement .
..3..
. Advertisement .
..4..
A SQL join in Spark uses certain criteria to combine two (or more) datasets, creating a resulting object. There are several Spark SQL join types you must master at the same time. We are going to break them down to help you have a better understanding of these joins.
Spark SQL Join Types
Spark supports a wide range of joining operations, including inner, (full) outer, left, right, anti, and semi. The method DataFrame.join()
has the option ‘how’ to indicate the join type you want to have.
To demonstrate our examples, we will use these two related DataFrames:
schema1 = StructType([
StructField("payment", IntegerType(), True),
StructField("customer", IntegerType(), True),
StructField("amount", IntegerType(), True)
])
data1 = [
(1, 201, 1400),
(2, 123, 159),
(3, 340, 600),
(4, 201, 4500),
(5, 150, 120),
(6, 697, 49)
]
df1 = spark.createDataFrame(data=data1, schema=schema1)
df1.show()
schema2 = StructType([
StructField("customer", IntegerType(), True),
StructField("name", StringType(), True)
])
data2 = [
(201, "John"),
(340, "Anna"),
(697, "Bill")
]
df2 = spark.createDataFrame(data=data2, schema=schema2)
df2.show()
......... ADVERTISEMENT .........
..8..
......... ADVERTISEMENT .........
..8..
Inner Joins
This is the default – also the most common – type of SQL join in Spark. Basically, it removes rows that don’t exist in both DataFrame. The method join()
also allows you to specify which columns you want to compare when making the join.
cond = [df1.customer == df2.customer]
df1.join(df2, cond).show()
......... ADVERTISEMENT .........
..8..
With the list cond providing the join columns, the final DataFrame only contains the four rows that share the value of the column customer on both sides.
Outer Joins
This type of join retains records from both DataFrames, even when they don’t share indexes.
df1.join(df2, cond, how = 'outer').show()
......... ADVERTISEMENT .........
..8..
The method basically merges two DataFrames together and gives the null value to records that don’t exist in the other DataFrame.
Left Joins
All the rows of the left DataFrame will be retained regardless of whether the second dataset has a matching record. The method join()
will append a null value when there is no corresponding record on the right DataFrame.
df1.join(df2, cond, how = 'leftouter').show()
......... ADVERTISEMENT .........
..8..
As you can see, the returned DataFrame has all the rows of the first DataFrame. The other columns will be populated with entries from the second DataFrame. If there is none, it will get a null value.
Right Joins
On the other hand, a right join will only take rows on the right DataFrame into account and return all of them.
df1.join(df2, cond, how = 'rightouter').show()
......... ADVERTISEMENT .........
..8..
You can see that every row in the customer DataFrame is presented in the resulting object. Since every entry in the customer label has a match in the left DataFrame, the method join() didn’t need to append null values into the output.
Left-Semi Joins
The left-semi join of join()
returns a subset of the left side that has a match on the second DataFrame (using the condition provided with the argument on). It doesn’t result in any column from the right side.
df1.join(df2, cond, how = 'left_semi').show()
......... ADVERTISEMENT .........
..8..
Left-Anti Joins
A left-anti join works in the opposite way to the left-semi join. While it still only returns entries of the first DataFrame, it selects only rows that have no match on the right side.
df1.join(df2, cond, how = 'left_anti').show()
......... ADVERTISEMENT .........
..8..
Conclusion
There are several Spark SQL join types. For the most part, they mimic what join operations in SQL do, which should make the transition a breeze if you come from a relational database management system (RDBMS). In addition to these joins, you can learn more about the equivalent of the SELECT statement with this guide.
Leave a comment