. Advertisement .
..3..
. Advertisement .
..4..
When dealing with a lot of text files containing your data sets, you must master Spark write DataFrame to CSV file. Read on to explore this topic and have a better idea of file writing in Spark.
Spark Write DataFrame to CSV File
We are going to need a sample DataFrame to illustrate the methods in this tutorial. The following snippet will create a SparkSession and a DataFrame in the Python API:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
schema = StructType([
StructField("Site", StringType(), True),
StructField("Ranking", IntegerType(), True),
StructField("Site Type", StringType(), True),
StructField("Foundation", IntegerType(), True),
StructField("Top Language", StringType(), True)
])
data = [
("ITTutoria", 1, "Tutorials", 2022, "Python"),
("Stack Overflow", 2, "Q&A", 2008, "JavaScript"),
("Quora", 3, "Q&A", 2009, None),
("Reddit", 4, "Forum", 2005, None)
]
df = spark.createDataFrame(data=data, schema=schema)
......... ADVERTISEMENT .........
..8..
We try different ways to convert this DataFrame df into a CSV file.
Using dataframe.write.csv()
The Spark SQL module provides the method dataframe.write.csv()
that can be used to, as its name suggests, write the content of a DataFrame into a CSV file. You will need to have Spark version 2.0.0 or above to use this option.
Its syntax is quite simple:
DF.write.csv(PATH)
Where DF is the DataFrame you want to extract data from, and PATH is the file system path to your destination CSV file.
For example, you can create a CSV file that contains the elements of the DataFrame above:
import os
path = os.path.join(os.getcwd(), 'data.csv')
df.write.csv(path)
Using dataframe.write.format(‘csv’).save()
The method df.write.save()
can save a DataFrame into a text file. You can adjust the format into CSV when adding the option format('csv')
.
DF.write.format('csv').save(PATH)
Where DF is the DataFrame you want to extract data from, and PATH is the file system path to your destination CSV file.
The code just needs some small adjustments:
import os
path = os.path.join(os.getcwd(), 'data.csv')
df.write.format('csv').save(path)
Writing Options
Both the methods above share many parameters you can provide to customize your writing operation.
mode: this string parameter controls how PySpark behaves when the CSV file you specify isn’t empty and already contains some data. There are four modes you can assign to this parameter.
In the assign mode, PySpark will leave the existing data intact and insert the new elements from your DataFrame below it. Meanwhile, when mode is set to overwrite, PySpark will delete everything in the existing CSV file and write new content to it.
When ‘mode = silent’, the program will ignore the operation and proceed to the next statements when it detects existing data in the CSV file. However, the default mode is called errorifexists or error. This means PySpark will throw an error when it encounters a non-empty CSV file.
There are many extra options you can use to control the writing operation. To use them, you must put the correct properties in option() of the statement, such as:
spark.write.options(delimiter=";", header=True).csv(path)
Some important options you may need to pay attention to:
- sep or delimiter: these settings determine the character(s) used by Spark to separate values in the data source. Their default value is a comma.
- encoding: the character encoding type you want to write into the CSV file. By default, PySpark uses UTF-8 – the most popular encoding defined in the Unicode Standard.
- quote: this parameter determines the character PySpark uses to escape quoted values. You can only set it to one character, whose default value is “.
Conclusion
The methods dataframe.write.csv()
and dataframe.write.format('csv').save()
can help you with Spark write DataFrame to CSV file. Remember that they use the same data source options that you can apply to have more control over your writing operation. Interested in reading CSV files instead of writing? Check out this guide.
Leave a comment