. Advertisement .
..3..
. Advertisement .
..4..
The task to PySpark read CSV file into DataFrame is a functionality supported by many modules. This tutorial will show how you can make use of them in different situations.
PySpark Read CSV File Into DataFrame
You can use either the read.csv() function from the spark.sql module itself or rely on the read_csv() function from pandas. Regardless of your choice, you must initialize a SparkSession first:
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName(
... 'ITTutoria: PySpark Read CSV file into DataFrame').getOrCreate()
Using read.csv()
The PySpark SQL module comes with read.csv() function for converting one or multiple files in a directory in the CSV file format into a Spark DataFrame. (There is also a function named write.csv() for writing operations, just like how you can write CSV files with Python).
Once you have a SparkSession, the syntax for reading a CSV file is straightforward:
spark.read.csv(path)
path is the file path of your CSV file, which can be an absolute or relative path.
>>> df = spark.read.csv('data.csv')
>>> type(df)
<class 'pyspark.sql.dataframe.DataFrame'>
>>> df.show()
+--------------+-------+--------+
| _c0| _c1| _c2|
+--------------+-------+--------+
| Site|Ranking|Language|
| ITTutoria| 1| Python|
|Stack Overflow| 2| Python|
| Quora| 3| Python|
+--------------+-------+--------+
>>> df.printSchema()
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
The type() function indicates that df is a PySpark DataFrame. From the output, you can see that read.csv() doesn’t process the first line in the CSV file as the header by default.
If you want to set a header or adjust your CSV import, you will need to use the option() function in conjunction with read.csv().
Its job is to customize behaviors of writing and reading in the spark.sql module, including controlling the character set, delimiter character, and header, among other things.
For instance, to set a header, change its parameter to True. This option controls how read.csv() uses the first time as column names. It is set to False by default.
>>> df = spark.read.option('header', True).csv('data.csv')
>>> df.show()
+--------------+-------+--------+
| Site|Ranking|Language|
+--------------+-------+--------+
| ITTutoria| 1| Python|
|Stack Overflow| 2| Python|
| Quora| 3| Python|
+--------------+-------+--------+
The delimiter and sep are aliases to each other. They determine how each value is separated. The default delimiter is the comma character (,), which you can set to one or multiple characters.
You can nest several options together:
>>> df = spark.read.options(delimiter=';', header=True).csv('data.csv')
If you need to import data from multiple CSV files, place them in a list like this:
>>> csv_files = ['data1.csv', 'data2.csv']
>>> df = spark.read.csv(csv_files)
You can also select every CSV file in a directory:
>>> df = spark.read.csv('~/*.csv')
Using pandas.read_csv()
Instead of using the built-in csv() function of the spark.sql module, you can take advantage of the pandas module to get the job done.
It has the read_csv() function, which basically provides the same functionalities.
Example:
>>> df = spark.createDataFrame(pd.read_csv('data.csv'))
>>> type(df)
<class 'pyspark.sql.dataframe.DataFrame'>
>>> df.show()
+--------------+-------+--------+
| Site|Ranking|Language|
+--------------+-------+--------+
| ITTutoria| 1| Python|
|Stack Overflow| 2| Python|
| Quora| 3| Python|
+--------------+-------+--------+
>>> df.printSchema()
root
|-- Site: string (nullable = true)
|-- Ranking: long (nullable = true)
|-- Language: string (nullable = true)
As shown in the output, read_csv() automatically reads the first line in the CSV file as the header by default. You can also change the behavior of this function by changing different options.
For instance, the sep parameter is used to specify the delimiter of the CSV file. If it is set to None, Python will use its own parsing engine instead of the default C engine to parse the file. Meanwhile, the usecols and skiprows parameters allow you to retrieve only certain columns and rows from the data.
Conclusion
The task to PySpark read CSV file into DataFrame is possible with the pyspark.sql and pandas modules. They are simple functions created for the task of importing comma-separated values, with several options you can modify.
Leave a comment