. Advertisement .
..3..
. Advertisement .
..4..
Wondering in which way can you implement Spark join two DataFrames successfully? Wondering no more with our instruction to guide you the most!
Spark Join Two Dataframes: How To
Step 1: Create Two Different DataFrame tables
Let’s first establish two DataFrame tables for “emp,” and “dept,”.
Emp Table
val emp = Seq((1,"Ross","10"),
(2,"Rachel","20"),
(3,"Joey","10"),
(4,"Chandler","10"),
(5,"Phebes","40"),
(6,"Monica","50")
)
val empColumns = Seq("emp_id","name","emp_dept_id")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)
The result will yield as below:
+------+--------+-----------+
|emp_id|name |emp_dept_id|
+------+--------+-----------+
|1 |Ross |10 |
|2 |Rachel |20 |
|3 |Joey |10 |
|4 |Chandler|10 |
|5 |Phebes |40 |
|6 |Monica |50 |
+------+--------+-----------+
Dept Table
val dept = Seq(("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
)
val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF(deptColumns:_*)
deptDF.show(false)
The output is as follows:
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
Step 2: Join two DataFrames
- Method #1: Employing Join Operator
Syntax:
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
join(right: Dataset[_]): DataFrame
As such, the appropriate joinExprs, dataset, and joinType are required parameters for the initial join syntax, and joinExprs is of great use to supply the join condition.
As for the second join syntax, it just demands joinExprs and the dataset. Besides, it also refers to the default join as inner join.
Let’s now have a look at the main part of the feast!
Running the code:
//Using Join expression
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner" )
.show(false)
This connects all two tables and provides a fresh DataFrame with the output listed below.
+------+--------+-----------+---------+-------+
|emp_id|name |emp_dept_id|dept_name|dept_id|
+------+--------+-----------+---------+-------+
|1 |Ross |10 |Finance |10 |
|2 |Rachel |20 |Marketing|20 |
|3 |Joey |10 |Finance |10 |
|4 |Chandler|10 |Finance |10 |
|5 |Phebes |40 |IT |40 |
+------+--------+-----------+---------+-------+
Optionally, Inner.sql may be used as a jointype. To do so, import “import org.apache.spark.sql.catalyst.plans.Inner
”.
//Using Join expression
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),Inner.sql )
.show(false)
- Method #2: Employing SQL Expression
Alternatively, we can also utilize native SQL syntax to implement a join across many tables. To do so, first we’ll need to construct a temporary view for each of our DataFrames, and then we’ll use spark.sql() to run the SQL expression.
Running the code:
//Using SQL expression
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
spark.sql("select * from EMP e, DEPT d" +
"where e.emp_dept_id == d.dept_id and e.emp_id")
.show(false)
That way, the output will yield exactly the same as the method’s above.
The Bottom Line
That’s how you can get Spark join two DataFrames to succeed on your coding. Hopefully, this article can be of benefit to you somehow. See then!
Leave a comment