. Advertisement .
..3..
. Advertisement .
..4..
PySpark SQL offers coders the read.json(“ path”)
to read JSON file as a DataFrame object and write.json(“path”)
to write and save this file. This blog will teach you to read files from a JSON directory and write the DataFrames back.
How to Read JSON file into DataFrame in PySpark
read.format(“json”).load(“path”)
and read.json(“path”)
are two methods allowing you to read a file type into a PySpark DataFrame. All JSON data sources come from the same input file by default.
Run the following command:
[{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
},
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}]
Then, use the read.option(“multiline”, “true”)
# Read multiline json file
multiline_df = spark.read.option("multiline","true") \
.json("resources/multiline-zipcode.json")
multiline_df.show()
Read Multiple Files At Once
You can use the read.json() for multiple files reading from distinctive paths, as well. Separate commas to pass the file type names with qualified paths.
# Read multiple files
df2 = spark.read.json(
['resources/zipcode1.json','resources/zipcode2.json'])
df2.show()
Read Files In The Same Directory
Pass the directory like a path to read all the files into DataFrame:
# Read all JSON files from a folder
df3 = spark.read.json("resources/*.json")
df3.show()
How To Implement PySpark In Databricks
The nullValues option is used to notify a string in a JSON format. For instance, if the value “2022-14-07” is set for a date column, then set null on that DataFrame field.
Again, the dateFormat function sets the input’s format for TimestampType and DateType columns. It also supports java.text.SimpleDateFormat
formats.
# Importing package
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType,BooleanType,DoubleType
Now let’s import the PySpark SQL packages into the environment to write and read DataFrame data into a JSON file format in Databricks.
# Implementing JSON File in PySpark
spark = SparkSession.builder \
.master("local[1]") \
.appName("PySpark Read JSON") \
.getOrCreate()
# Reading JSON file into dataframe
dataframe = spark.read.json("/FileStore/tables/zipcodes.json")
dataframe.printSchema()
dataframe.show()
# Reading multiline json file
multiline_dataframe = spark.read.option("multiline","true") \
.json("/FileStore/tables/zipcodes.json")
multiline_dataframe.show()
# Writing PySpark dataframe into JSON File
dataframe.write.mode('Overwrite').json("/tmp/spark_output/zipcodes.json")
In this example, when the program reads the zipcodes.json
with the spark.read.json(“path”)
, the DataFrame value is made. You will also get the multiline_dataframe value to read records from these files.
They are distributed in various lines. Thus, it would be best to use the value true to the multiline option and ‘false’ multiline option by default to read such files.
Finally, the dataframe.write.mode().json()
function is used to write a PySpark dataframe into a JSON file.
How To Write A PySpark DataFrame To A JSON File
All you have to do is to run the following command to write a DataFrame:
df2.write.json("/tmp/spark_output/zipcodes.json")
There are several options to write a JSON file, including dateFormat and nullValue. PySpark also offers various saving modes. Its argument takes append, errorifexists, overwrite and ignore. The first parameter assigns data to an existing file. The second default one will return an error in the case of an already existing file.
The overwrite mode overwrites this one, while the ignore mode ignores the write operation.
Conclusion
This tutorial offers various ways to read a JSOn file into PySpark DataFrame with multiline or single-line records. Hopefully, it is useful for you.
Leave a comment