. Advertisement .
..3..
. Advertisement .
..4..
Knowing how to convert Spark RDD to DataFrame and Dataset is important. Read on to find out the recommended methods of how to do so and enjoy the optimizations of those data structures.
Convert Spark RDD To DataFrame And Dataset
First, we initialize a SparkSession and create an RDD using parallelize – one of the three methods for creating RDDs in Spark. It reads an existing list or array and copies its elements into a distributed dataset. As the data is spread across several nodes, the driver program can run operations in parallel with better performance.
Scala:
scala> val spark = SparkSession
.builder()
.appName("Convert Spark RDD to DataFrame and Dataset")
.getOrCreate()
scala> val rdd = spark.sparkContext.parallelize(
Seq(
("ITTutorial", "Scala", 1),
("Stack Overflow", "Python", 2),
("Quora", "HTML", 3)
)
)
scala> rdd.foreach(println)
(ITTutorial,Scala,1)
(Stack Overflow,Python,2)
(Quora,HTML,3)
Python:
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession \
... .builder \
... .appName("Convert Spark RDD to DataFrame and Dataset") \
... .getOrCreate()
>>> data = [("ITTutorial", "Scala", 1),\
... ("Stack Overflow", "Python", 2),\
... ("Quora", "HTML", 3)]
>>>
>>> rdd = spark.sparkContext.parallelize(data)
>>> rdd.collect()
[('ITTutorial', 'Scala', 1), ('Stack Overflow', 'Python', 2), ('Quora', 'HTML', 3)]
In both examples above, we start a SparkSession of the same name and use the sparkContext.parallelize() method of a SparkContext object to create an RDD.
We use a sequence in Scala and a list in Python to hold the data. You can see the elements of the RDD thanks to the foreach() and collect() methods.
Convert RDD To DataFrame Using toDF()
DataFrame draws inspiration from the Python package pandas. In Spark, it is based on RDD, translating SQL and DSL expressions into operations.
toDF() is a simple method in both Scala and Python APIs that you can use to convert RDDs to DataFrames.
Scala:
scala> rdd.toDF("Site", "Language", "Ranking").show
+--------------+--------+-------+
| Site|Language|Ranking|
+--------------+--------+-------+
| ITTutorial| Scala| 1|
|Stack Overflow| Python| 2|
| Quora| HTML| 3|
+--------------+--------+-------+
Python:
>>> rdd.toDF(['Site', 'Language', 'Ranking']).show()
+--------------+--------+-------+
| Site|Language|Ranking|
+--------------+--------+-------+
| ITTutorial| Scala| 1|
|Stack Overflow| Python| 2|
| Quora| HTML| 3|
+--------------+--------+-------+
You can pass a schema to the toDF() method. The results can be displayed through the show method.
There are many serious limitations when using this method. For starters, it only works with RDDs of some types: int, long, string, and sub-classes of scala.Product. You can’t customize the nullable flag either.
Convert RDD To DataFrame Using createDataFrame()
The createDataFrame() method is more powerful and addresses many problems of toDF(). In addition to a pandas DataFrame or a list, it can create a DataFrame from an RDD as well. You can learn more about it here.
Scala:
val df = spark.createDataFrame(rdd)
Python:
df = spark.createDataFrame(rdd)
The createDataFrame() method allows you to choose a schema for your DataFrame. For instance:
>>> df = rdd.toDF(['Site', 'Language', 'Ranking']).show()
+--------------+--------+-------+
| Site|Language|Ranking|
+--------------+--------+-------+
| ITTutorial| Scala| 1|
|Stack Overflow| Python| 2|
| Quora| HTML| 3|
+--------------+--------+-------+
The output shows that df is a DataFrame that holds the same elements as rdd.
Convert RDD To Dataset Using toDS()
While RDD is the core database of Spark, DataFrame and Dataset (which can be seen as its extension) have their advantages too.
RDD has type safety, but there is no automatic optimization. On the other hand, DataFrame features automatic optimization but doesn’t provide compile-time type safety. Dataset offers the best of both worlds when it has those two features.
Converting an RDD to Dataset is trivial with toDS() and createDataset(), which are only available in Scala.
scala> rdd.toDS().show
+--------------+------+---+
| _1| _2| _3|
+--------------+------+---+
| ITTutorial| Scala| 1|
|Stack Overflow|Python| 2|
| Quora| HTML| 3|
+--------------+------+---+
scala> spark.createDataset(rdd).show
+--------------+------+---+
| _1| _2| _3|
+--------------+------+---+
| ITTutorial| Scala| 1|
|Stack Overflow|Python| 2|
| Quora| HTML| 3|
+--------------+------+---+
Conclusion
You can convert Spark RDD to DataFrame and Dataset easily with built-in methods. They enable the switch between different data abstractions, allowing you to optimize your program’s performance.
Leave a comment