Spark Join Two Dataframes - The Work Of Merging

. Advertisement .

..3..

. Advertisement .

..4..

Wondering in which way can you implement Spark join two DataFrames successfully? Wondering no more with our instruction to guide you the most!

Spark Join Two Dataframes: How To

Step 1: Create Two Different DataFrame tables

Let’s first establish two DataFrame tables for “emp,” and “dept,”.

Emp Table

val emp = Seq((1,"Ross","10"),
    (2,"Rachel","20"),
    (3,"Joey","10"),
    (4,"Chandler","10"),
    (5,"Phebes","40"),
    (6,"Monica","50")
  )

  val empColumns = Seq("emp_id","name","emp_dept_id")
  import spark.sqlContext.implicits._
  val empDF = emp.toDF(empColumns:_*)
  empDF.show(false)

The result will yield as below:

+------+--------+-----------+
|emp_id|name    |emp_dept_id|
+------+--------+-----------+
|1     |Ross    |10         |
|2     |Rachel  |20         |
|3     |Joey    |10         |
|4     |Chandler|10         |
|5     |Phebes  |40         |
|6     |Monica  |50         |
+------+--------+-----------+

Dept Table

  val dept = Seq(("Finance",10),
    ("Marketing",20),
    ("Sales",30),
    ("IT",40)
  )

  val deptColumns = Seq("dept_name","dept_id")
  val deptDF = dept.toDF(deptColumns:_*)
  deptDF.show(false)

The output is as follows:

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

Step 2: Join two DataFrames

Method #1: Employing Join Operator

Syntax:

join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

join(right: Dataset[_]): DataFrame

As such, the appropriate joinExprs, dataset, and joinType are required parameters for the initial join syntax, and joinExprs is of great use to supply the join condition.

As for the second join syntax, it just demands joinExprs and the dataset. Besides, it also refers to the default join as inner join.

Let’s now have a look at the main part of the feast!

Running the code:

  //Using Join expression
  empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner" )
      .show(false)

This connects all two tables and provides a fresh DataFrame with the output listed below.

+------+--------+-----------+---------+-------+
|emp_id|name    |emp_dept_id|dept_name|dept_id|
+------+--------+-----------+---------+-------+
|1     |Ross    |10         |Finance  |10     |
|2     |Rachel  |20         |Marketing|20     |
|3     |Joey    |10         |Finance  |10     |
|4     |Chandler|10         |Finance  |10     |
|5     |Phebes  |40         |IT       |40     |
+------+--------+-----------+---------+-------+

Optionally, Inner.sql may be used as a jointype. To do so, import “import org.apache.spark.sql.catalyst.plans.Inner”.

 //Using Join expression
  empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),Inner.sql )
      .show(false)

Method #2: Employing SQL Expression

Alternatively, we can also utilize native SQL syntax to implement a join across many tables. To do so, first we’ll need to construct a temporary view for each of our DataFrames, and then we’ll use spark.sql() to run the SQL expression.

Running the code:

  //Using SQL expression
  empDF.createOrReplaceTempView("EMP")
  deptDF.createOrReplaceTempView("DEPT")
  spark.sql("select * from EMP e, DEPT d" +
    "where e.emp_dept_id == d.dept_id and e.emp_id")
    .show(false)

That way, the output will yield exactly the same as the method’s above.

The Bottom Line

That’s how you can get Spark join two DataFrames to succeed on your coding. Hopefully, this article can be of benefit to you somehow. See then!

Spark Join Two Dataframes – The Work Of Merging

Spark Join Two Dataframes: How To

Step 1: Create Two Different DataFrame tables

Step 2: Join two DataFrames

The Bottom Line

Leave a comment
Cancel reply

Sign Up

Sign In

Forgot Password

Spark Join Two Dataframes: How To

Step 1: Create Two Different DataFrame tables

Step 2: Join two DataFrames

The Bottom Line

Related Posts

Leave a commentCancel reply

Leave a comment
Cancel reply