. Advertisement .
..3..
. Advertisement .
..4..
It is absolutely essential for any Spark developer to know about PySpark – Create DataFrame with examples for structured data processing. If you aren’t familiar with this task, you have come to the right place. Read on to find out.
PySpark – Create DataFrame With Examples
createDataFrame()
To create a DataFrame in PySpark from a data source, you will need to use the createDataFrame method:
SparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
The parameter data is required, referring to a list, a pandas DataFrame, a CSV or JSON file, or a Resilient Distributed Datasets (RDD) of a SQL data representation (boolean, int, tuple, Row, etc.)
The parameters schema and samplingRatio both have the default value None.
Remember that the schema determines the description of your data structure and how the Dataset is created in Spark SQL. It can be explicit (provided at compiling time) or implicit (inferred at runtime).
With the default value, the schema will be inferred (types and column names) from the data. It needs to be an RDD of dict, namedtuple, or Row.
When given, the schema should be a list or string of column names, while samplingRatio is the row sample ratio PySpark needs to use to infer the DataFrame. When the schema parameter is a datatype string, the real data should be matched. Otherwise, Spark will throw an exception at runtime.
By default, data type verification is enabled. But you can switch it off by passing verifySchema=False.
Before demonstrating the abilities of SparkSession.createDataFrame(), we must create a basic SparkSession – the entry point to all Spark’s functionality.
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName(
... 'ITTutoria: Create a PySpark DataFrame').getOrCreate()
Using Row
The Row class (from pyspark.sql.Row) can be used in PySpark to create a set of rows. Instances of this class can be feeded into createDataFrame() to create a DataFrame.
>>> from pyspark.sql import Row
>>> df = spark.createDataFrame([
... Row(a = "ITTutoria", b = 1, c = "Python"),
... Row(a = "Stack Overflow", b = 2, c = "Python"),
... Row(a = "Quora", b = 3, c = "Python")
... ])
>>>
>>> df.show()
+--------------+---+------+
| a| b| c|
+--------------+---+------+
| ITTutoria| 1|Python|
|Stack Overflow| 2|Python|
| Quora| 3|Python|
+--------------+---+------+
>>> df.printSchema()
root
|-- a: string (nullable = true)
|-- b: long (nullable = true)
|-- c: string (nullable = true)
As you can see, the DataFrame df now contains three columns and three rows, corresponding to the information in the Row class instances from createDataFrame(). PySpark also automatically detects the schema from the data. Don’t worry about column names – you can change them later.
You can explicitly specify the schema as an argument, for example, if you just want to set the data type of the second column to integer, add the schema parameter like “schema=’a string, b int, c string’”:
>>> df.printSchema()
root
|-- a: string (nullable = true)
|-- b: integer (nullable = true)
|-- c: string (nullable = true)
Using pandas DataFrames
As DataFrames in PySpark draws inspiration from its pandas counterpart, PySpark supports the import of an existing pandas DataFrame into a session as well.
>>> import pandas as pd
>>> pandas_df = pd.DataFrame({
... 'Sites': ['ITTutoria', 'Stack Overflow', 'Quora'],
... 'Ranking': [1, 2, 3],
... 'Language': ['Python', 'Python', 'Python']
... })
>>> df = spark.createDataFrame(pandas_df)
>>> df.show()
+--------------+-------+--------+
| Sites|Ranking|Language|
+--------------+-------+--------+
| ITTutoria| 1| Python|
|Stack Overflow| 2| Python|
| Quora| 3| Python|
+--------------+-------+--------+
>>> df.printSchema()
root
|-- Sites: string (nullable = true)
|-- Ranking: long (nullable = true)
|-- Language: string (nullable = true)
Using CSV And JSON Files
CSV and JSON files are common sources of data for analysis in Python, including PySpark.
The read_csv() and read_json() functions from the pandas module are designed to process data from these file formats. You can use them in conjunction with createDataFrame() to create a PySpark DataFrame.
>>> df = spark.createDataFrame(pd.read_csv('data.csv'))
>>> df.show()
+--------------+-------+--------+
| Site|Ranking|Language|
+--------------+-------+--------+
| ITTutoria| 1| Python|
|Stack Overflow| 2| Python|
| Quora| 3| Python|
+--------------+-------+--------+
>>> df.printSchema()
root
|-- Site: string (nullable = true)
|-- Ranking: long (nullable = true)
|-- Language: string (nullable = true)
To read from a JSON file:
>>> df = spark.createDataFrame(pd.read_json('data.json'))
Conclusion
There are many options involving PySpark – Create DataFrame with examples. You can use different data sources, including Row, text files, and pandas DataFrames, all of which are processed by the createDataFrame() method.
Leave a comment