. Advertisement .
..3..
. Advertisement .
..4..
When your applications require you to create a dataset without any data first, you will need to learn more about PySpark – Create an empty DataFrame & RDD. This guide can help you with that.
PySpark – Create An Empty DataFrame & RDD
Create An Empty RDD Using SparkContext.emptyRDD()
You can use the function pyspark.SparkContext.emptyRDD()
to create an empty resilient distributed dataset (RDD) in PySpark. It requires no arguments and creates an RDD without any element or partition.
A SparkSession will need to be created first:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.getOrCreate()
Then you can create and assign an empty RDD to an object:
empty_RDD = spark.sparkContext.emptyRDD()
You can verify the type and content of this RDD with the function type()
and method collect()
:
>>> type(empty_RDD)
<class 'pyspark.rdd.RDD'>
>>> empty_RDD.collect()
[]
Create An Empty RDD Using A List
As you may already know, the function SparkContext.parallelize()
can form an RDD from a Python collection (such as a list). By using an empty list, the result you receive will be a RDD without any content.
>>> empty_RDD = spark.sparkContext.parallelize([])
>>> type(empty_RDD)
<class 'pyspark.rdd.RDD'>
>>> empty_RDD.collect()
[]
Create An Empty RDD Using A Text File
Text files are another method of creating Spark RDDs. You can take advantage of this to make empty RDDs with ease in PySpark.
The method you will need is pyspark.SparkContext.textFile()
, which can read text files from Hadoop-supported file systems in addition to your local storage.
>>> import os
>>>
>>> with open('data.txt', 'w') as fp:
... pass
...
>>> path = os.path.join(os.getcwd(), 'data.txt')
>>> empty_RDD = spark.sparkContext.textFile(path)
>>> type(empty_RDD)
<class 'pyspark.rdd.RDD'>
>>> empty_RDD.collect()
[]
Create An Empty DataFrame From A RDD
While there is no equivalent to the method spark.sparkContext.emptyRDD()
for DataFrames in PySpark, you can make use of SparkSession.createDataFrame()
to get the job done.
Let’s recall its syntax from this guide:
SparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
Where:
- data: the only required parameter. It indicates the source of data for the creation of your Spark DataFrame. You can use a list, a resilient distributed dataset (RDD), or, of course, a pandas DataFrame.
- schema: this parameter determines the names (labels) and data types of columns. It is optional with the default value None.
If you don’t pass a specific schema to the method creatDataFrame()
, it will try to get it by inferring your data. You can use a datatype string or any type from pyspark.sql.types.DataType
to specify this schema. The recommended data type is StructType.
As SparkSession.createDataFrame()
needs a source of data to work, you can provide it with an object representing an empty dataset. This can be an empty list or an empty RDD.
As it turns out, the returned object of the method SparkContext.emptyRDD
can be used to create an empty DataFrame too.
>>> schema = StructType([])
>>> empty_RDD = spark.sparkContext.emptyRDD()
>>> empty_df = spark.createDataFrame(empty_RDD, schema = schema)
>>> empty_df.show()
++
||
++
++
Why do we need to use a schema here? Because the method createDataFrame()
will raise an error if you don’t provide it. Its job is to infer schema from the dataset. But when given an empty one, this will lead to a ValueError:
>>> empty_df = spark.createDataFrame(empty_RDD)
ValueError: RDD is empty
Create An Empty DataFrame From A List
In a similar fashion, you can simply provide the method createDataFrame()
with an empty list:
>>> empty_df = spark.createDataFrame([], schema = schema)
>>> empty_df.show()
++
||
++
++
Conclusion
When it comes to PySpark – Create an empty DataFrame & RDD, there are several ways to get the trick done. You can provide traditional methods like spark.sparkContext.parallelize()
and SparkSession.createDataFrame()
with an empty dataset, or rely on specialized options like SparkContext.emptyRDD()
.
Leave a comment