. Advertisement .
..3..
. Advertisement .
..4..
Many of us must have come into the circumstance when we had to drop one or multiple columns from Dataframe no less than once.
If that’s a case, here we are to get you out of trouble. Read on to learn more!
How To Drop One or Multiple Columns From DataFrame In PySpark?
To remove one column or field or several DataFrame or Dataset columns, the drop() function provided by PySpark DataFrame is the best you can employ.
You must first generate a Dataframe in Pyspark.
Running the code:
pyspark = Pyspark.builder.appName.getOrCreate()
simpleData = (("John","","Simon","36536","NewYork",3110), \
("Mark","Raph","","40188","California",4310), \
("Ruth","","Williams","42014","Florida",1410), \
("Mai","Alex","Jones","39912","Florida",5510), \
("Jenny","Michael","Brown","34461","NewYork",3010) \
)
columns= ["firstname","middlename","lastname","id","location","salary"]
df = pyspark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
In this manner, we produce the following result:
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- salary: long (nullable = true)
1. Syntax – DataFrame drop() In PySpark
PySpark’s drop() method accepts the arguments *cols and self. We’ve provided explanations with examples in the areas below.
drop(self, *cols)
2. DataFrame Drop Column
Let’s first look at how to remove only one column from a PySpark DataFrame. The following three methods are explained.
As such, you must adopt pyspark.sql.functions import col in order to utilize the second signature.
df.drop("firstname") \
.printSchema()
""" import col is required """
df.drop(col("firstname")) \
.printSchema()
df.drop(df.firstname) \
.printSchema()
In the three cases above, the DataFrame’s “firstname” field is removed. Depending on your needs, you can utilize one or the other of them.
Output:
root
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- salary: long (nullable = true)
3. DataFrame Drop Over A Single Column
Such a manner calls the drop() method with an array string parameter. By employing it, you will be able to delete many columns from a DataFrame or all array’s columns.
df.drop("firstname","middlename","lastname") \
.printSchema()
cols = ("firstname","middlename","lastname")
df.drop(*cols) \
.printSchema()
The two techniques above delete several columns out of DataFrame at once. Both of them get the same result.
Output:
root
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- salary: integer (nullable = true)
Wrapping It up
Above is all the fundamental insight regarding ways to drop one or multiple columns from Dataframe. Hopefully, this article can be of great help to you somehow. See then!
Read more:
Leave a comment