. Advertisement .
. Advertisement .
A data frame is something that every single programmer will have to face at one point or another. However, it is something that Spark enthusiasts need to deal with the most. That is why there are so many questions regarding “Spark Create DataFrame” flying around.
We will provide you with all the approaches that we managed to find.
Spark Create DataFrame With RDD
The easiest approach to this issue is to make use of an existing RDD to create a manual DataFrame. As such, the first thing to do is initialize a new RDD from your collection sequence with the parallelize() function.
val spark:SparkSession = SparkSession.builder() .master("local").appName("ittutoria.net") .getOrCreate() import spark.implicits._ val columns = Seq("count","name") val data = Seq(("20000", "Java"), ("100000", "Python"), ("3000", "Scala") val RDD = spark.sparkContext.parallelize(data)
From then on, there are three branches of solution. Each of them has its own strengths and weaknesses.
The most known approach that the majority of programmers may automatically resort to is using toDF(). It will default to creating two columns for one row, naming them “_1” and “_2”.
The reason for these names lies in RDD’s schema-less nature. Converting from this format to DataFrame will only provide you with the default column names.
val dfFromRDD1 = rdd.toDF()
If you don’t like the name, you can customize them by specifying it in the toDF() function.
val dfRDD1 = rdd.toDF(“count”, “name”)
The default data type will always be string, so you may need to make some tweaks if you want other data types. That is why following this method may take quite a while.
SparkSession has a more advanced solution with its createDataFrame() function. It takes in an argument in the form of RDD object and then chains it with toDF(). This saves you the time spent on specifying names.
val dfRDD2 = spark.createDataFrame(rdd).toDF(columns:_*)
You can save even more time if you have already converted the object to RDD[Row]. Doing so lets the function takes in additional information like row name.
Spark Create DataFrame With List and Seq Collection
The biggest difference between this approach and the one above is that we will use an object of “data” type instead of “RDD”. That is why you will be using either collection List[T] or Seq[T].
The easiest approach is just to call toDF() on “data”. Do remember to import spark.implicits._ or it won’t work.
import spark.implicits._ val dfData1 = data.toDF()
The second solution is the same as “RDD”, using createDataFrame(). You only need to change the input.
val dfRDD2 = spark.createDataFrame(data).toDF(columns:_*)
Spark Create DataFrame With CSV
Both RDD and data objects are not used that much in real-time situations. In this case, you need to make use of Spark’s API to read delimiter files.
val df2 = spark.read.csv("/src/file.csv")
For different types of files, you simply change the part after “read.”. For example, it’s “text” for TXT files and “json” for JSON files. Defining static class variable is something very vital in any language, after all.
Spark Create DataFrame With RDBMS Database
When you already have a database available, things can be less tricky. Just make sure you put the MySQL library in the pom.xml file as a dependency. Do the same if you have a DB2 library.
val df_mysql = spark.read.format("jdbc") .option(“url”, “jdbc:mysql://localhost:port/db”) .option(“driver”, “com.mysql.jdbc.Driver”) .option(“dbtable”, “tablename”) .option(“user”, “user”) .option(“password”, “password”) .load()
With this article, we have explained the most popular approach to the Spark Create DataFrame problem. As long as you read through the whole thing, we are quite sure that this issue will no longer challenge you.
If you find our information helpful, look forward to our next release for even more interesting questions and solutions.