. Advertisement .
..3..
. Advertisement .
..4..
In Spark read CSV file into DataFrame is officially supported like many other data sources. Check out the details here and learn more about how you can accomplish this task in both Scala (the default Spark shell) and PySpark (another popular API).
Spark Read CSV File Into DataFrame
CSV Data Sources And Spark SQL
Spark SQL comes with many reading and writing operations for data sources. They work with the DataFrame interface, which can be used to generate a temporary view or operate with relational transformations.
Many file formats are supported for data sources, including CSV, JSON, text, and ORC files.
Simple Reading Operating
The function designed for reading CSV files into a DataFrame is spark.read().csv(), which can deal with one or multiple files at once. You can also use the function option() to control how the reading works.
Let’s say you have a CSV file named homes.csv with this content, which shows several pieces of information about three home sales.
......... ADVERTISEMENT .........
..8..
Here are two examples of using Scala (spark-shell) and Python (pyspark) to import this CSV file into a Spark DataFrame.
Scala:
scala> val data_source = "homes.csv"
data_source: String = homes.csv
scala> val df = spark.read.csv(data_source)
df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 4 more fields]
scala> df.show()
+----+----+------+-----+----+-----+
| _c0| _c1| _c2| _c3| _c4| _c5|
+----+----+------+-----+----+-----+
|Sell|List|Living|Rooms|Beds|Baths|
| 142| 160| 28| 10| 5| 3|
| 175| 180| 18| 8| 4| 1|
| 129| 132| 13| 6| 3| 1|
+----+----+------+-----+----+-----+
scala> df.printSchema()
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true
As you can see, the reading of CSV files in Scala is quite simple.
A variable is created to contain the path to the data source, which is given to the read.csv() function. You can see the content and schema of the DataFrame created by using the show() and printSchema() functions.
Note that by default, the read.csv() function doesn’t parse the first line in the CSV file as a header.
Python:
>>> data_source = 'homes.csv'
>>> df = spark.read.csv(data_source)
>>> df.show()
+----+----+------+-----+----+-----+
| _c0| _c1| _c2| _c3| _c4| _c5|
+----+----+------+-----+----+-----+
|Sell|List|Living|Rooms|Beds|Baths|
| 142| 160| 28| 10| 5| 3|
| 175| 180| 18| 8| 4| 1|
| 129| 132| 13| 6| 3| 1|
+----+----+------+-----+----+-----+
>>> df.printSchema()
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
The PySpark module is a great alternative if you want to take advantage of Spark’s amazing computation capabilities in Python. It gives us similar functions and results, which are expected.
Check out this article if you want to import a CSV file through pandas.
CSV Data Source Options
Thanks to the option() function, Spark equips developers with several options they can configure when reading CSV files. Below are the most important settings you should be aware of (they apply to both Scala and Python APIs):
sep or delimiter: these settings determine the character(s) used by Spark to separate values in the data source. Their default value is a comma.
If your CSV uses, for example, a semicolon as its separator instead, you can easily make certain Spark can read it right.
Scala:
val df = spark.read.option("delimiter", ";").csv(data_source)
Python:
df = spark.read.option("delimiter", ";").csv(data_source)
header: this option determines whether Spark uses the first line of the CSV file as column names. It is a boolean value that is False by default. You can alter this with the option() function:
Scala:
val df = spark.read.option("header", "true").csv(data_source)
Python:
df = spark.read.option("header", True)..csv(data_source)
Conclusion
Both the Scala and Python APIs provide simple functions to Spark read CSV file into DataFrame. You can explore more data source options to customize the import process.
Leave a comment