. Advertisement .
..3..
. Advertisement .
..4..
Your application may require you to transfer your data from pandas to a Spark cluster, especially when you have to process a huge data set. These examples will show you how to convert Pandas to PySpark DataFrame.
Convert Pandas To PySpark DataFrame
The data abstraction DataFrames in Spark draws inspiration from the data structure of the same name from pandas. In short, a DataFrame is a Spark Dataset that has been organized into labeled columns. You can think of it as a Python/R DataFrame or a relational database with much more optimizations.
For this reason, it doesn’t come as a surprise you can convert a pandas DataFrame to a Spark DataFrame using only built-in solutions of PySpark.
In particular, the SparkSession.createDataFrame() method is what you need. It can create a Spark DataFrame from various kinds of data, including a pandas DataFrame.
Syntax: SparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
Parameters:
- data: the only required parameter. It indicates the source of data for the creation of your Spark DataFrame. You can use a list, a resilient distributed dataset (RDD), or, of course, a pandas DataFrame.
- schema: this parameter determines the names (labels) and data types of columns. It is optional with the default value None.
If you don’t pass a specific schema to the method creatDataFrame(), it will try to get it by inferring your data. You can use a datatype string or any type from pyspark.sql.types.DataType to specify this schema. The recommended data type is StructType.
- samplingRatio: this parameter is equal to the ratio of rows createDataFrame() uses to infer the schema. The default value is None, which prompts the method to use only the first row.
- verifySchema: this is a new parameter that has only been added since version 2.1.0. It is enabled by default, forcing PySpark to verify every row to make sure its data type matches the schema.
To show you how to use createDataFrame(), we need to create a pandas DataFrame first:
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'Site': ['ITTutoria', 'Stack Overflow', 'Quora'],
... 'Ranking': [1, 2, 3]
... })
>>> df
Site Ranking
0 ITTutoria 1
1 Stack Overflow 2
2 Quora 3
The function DataFrame() of the pandas module is used in a dictionary (which you can read more about here). The result is a DataFrame with two named columns and three indices.
Before converting this DataFrame to Spark, make sure you have initialized a SparkSession first:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Convert your pandas DataFrame to a Spark DataFrame:
df_spark = spark.createDataFrame(df)
You can verify its content and schema:
>>> df_spark.show()
+--------------+-------+
| Site|Ranking|
+--------------+-------+
| ITTutoria| 1|
|Stack Overflow| 2|
| Quora| 3|
+--------------+-------+
>>> df_spark.printSchema()
root
|-- Site: string (nullable = true)
|-- Ranking: long (nullable = true)
As you can see, the method gets its data, including the schema, correctly from the pandas DataFrame.
Many analysts may want to enable Apache Arrow, which is an in-memory data framework, during this conversion. Disabled by default, it can speed up the process of transferring data between Python and JVM. One of Arrow’s common applications is to work with NumPy/pandas data.
To take advantage of this framework and its benefits, you will need to install its Python module and alter your code to force Spark to incorporate it.
Make sure you have installed PyArrow on your cluster nodes first:
pip install pyarrow
Enable Arrow in your operations by switching the configuration spark.sql.execution.arrow.pyspark.enabled to True before calling the createDataFrame() method:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_spark = spark.createDataFrame(df)
If you want to adjust the schema of your dataset, you can do this during the conversion. For example, this will change the name of the DataFrame’s first column:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
df_schema = StructType([
StructField("Website", StringType(), True),
StructField("Ranking", IntegerType(), True)]
)
df_spark = spark.createDataFrame(df, schema=df_schema)
You can print the schema and content to verify:
>>> df_spark.printSchema()
root
|-- Website: string (nullable = true)
|-- Ranking: integer (nullable = true)
>>> df_spark.show()
+--------------+-------+
| Website|Ranking|
+--------------+-------+
| ITTutoria| 1|
|Stack Overflow| 2|
| Quora| 3|
+--------------+-------+
Conclusion
It is easy to convert Pandas to PySpark DataFrame. Spark provides a built-in method that can take care of the job and allow you to switch to a module more suitable for your analysis.
Leave a comment