. Advertisement .
. Advertisement .
Our topic today: How to create DataFrame from list in Pyspark?
The tutorial is ideal for beginners who have already referred to our previous guide on manually creating a single DataFrame and desire to level up the skill by building multiple Dataframes at a time.
Here we provide five methods to create DataFrame from lists, along with samples and explanations.
Benefits of Create Pyspark DataFrame from List
The list is an ordered collection of optimized elements structured into named columns. Then, the columns can quickly be mapped and converted to a data frame. This allows iterating, operating, or duplicating over hug data by residing them over a list.
For that reason, we strongly suggest you check this article and take notes on the sample to create Pyspark DataFrame from the list.
How to Create DataFrame from List in Pyspark?
Typically, there are three steps to mass convert data in the list into DataFrame:
- First, create the list
- Next, convert the list into DataFarm and check the output
- Finally, build an RDD from a list.
Creating a list
We would create a simple list of departments in a company and the number of employees in each department as follow:
dept = [("Development",30), ("Sales",5), ("Marketing",15), ("HR",4) ]
Convert the list into a DataFrame: 2 Ways Available
- Assign columns from the list to a DataFrame
Schema is the most common method to collaborate columns from the list and those of a DataFrame.
In Pyspark, a scheme is to define the structure of the DataFrame, using the StructType class. It looks just similar to the table schema used to print passed schema. Simply put, it is the columns’ names embedded during data processing.
deptColumns = ["dept_name","dept_id"] deptDF = spark.createDataFrame(data=dept, schema = deptColumns) deptDF.printSchema() deptDF.show(truncate=False)
Then, the Spark.createDataFrame accepts the schema and data together to generate data frame out of it.
root |-- dept_name: string (nullable = true) |-- dept_id: long (nullable = true) +---------+-------+ |dept_name|dept_id| +---------+-------+ |Development|30 | |Sales |5 | |Marketing |15 | |HR |4 | +---------+-------+
- Add columns to a DataFrame using schema
It is also possible to add column using schema. For instance, we add the fullname column:
from pyspark.sql.types import StructType,StructField, StringType deptSchema = StructType([ StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('lastname', StringType(), True) ]) deptDF = spark.createDataFrame(data=dept, schema = deptSchema) deptDF.printSchema() deptDF.show(truncate=False)
The output is similar to the above.
- Build a DataFrame using a Row-type list
A row-type list is another method to build a DataFrame. The sample is available:
# This is the list created above. dept2 = [("Development",30), ("Sales",5), ("Marketing",15), ("HR",4) ] # This part is to create a DataFrame from the Row-type list deptDF2 = spark.createDataFrame(data=dept2, schema = deptColumns) deptDF2.printSchema()
Build an RDD from a list
Convert the list to Resilient Distributed Dataset (RDD) to run parallel processing and operate on several nodes. In specific, RDD in Pyspark stores the memory state as an object shareable across related jobs.
rdd = spark.sparkContext.parallelize(dept) # This is an example of RDD to run parallel python for the DataFrame created from the list above.
We have done the tutorial on how to create DataFrame from list in Pyspark. Try now on your Python and if possible, share the results for reference. Let us know if you want further explanation on any code samples!