. Advertisement .
..3..
. Advertisement .
..4..
There are many ways to add columns to DataFrame in PySpark. In this guide, we will show you how to do so utilizing select()
, withColumn()
, and sql()
functions.
We also provide other methods, such as deriving data based on another column, adding several columns, etc. Plus, the examples of each method are included.
Example Of Adding New Columns To DataFrame
Let’s start with creating a DataFrame in PySpark.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('ITtutoria.net') \
.getOrCreate()
data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
('Robert','Williams','M',6200)
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
Output
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
| James| Smith| M| 3000|
| Anna| Rose| F| 4100|
| Robert|Williams| M| 6200|
+---------+--------+------+------+
How to add columns to DataFrame in PySpark
Add New Columns With Fixed Value
To add columns to DataFrame in PySpark, import the lit()
function from pyspark.sql.functions
import lit. This function gives back a Column type after accepting a fixed value you wish to add. In case you want to add a None/NULL
, utilize the lit(None)
functions.
From this example, add 0.3 as a literal constant to a DataFrame first, and then add None in the second step.
# Add new constant column
from pyspark.sql.functions import lit
df.withColumn("bonus_percent", lit(0.3)) \
.show()
+---------+--------+------+------+-------------+
|firstname|lastname|gender|salary|bonus_percent|
+---------+--------+------+------+-------------+
| James| Smith| M| 3000| 0.3|
| Anna| Rose| F| 4100| 0.3|
| Robert|Williams| M| 6200| 0.3|
+---------+--------+------+------+-------------+
# Add New column with NULL
df.withColumn("DEFAULT_COL", lit(None)) \
.show()
Add A Column Based On Another DataFrame’s Column
The most popular method is to add a new column to a DataFrame according to the value of an already-existing column.
#Add column from existing column
df.withColumn("bonus_amount", df.salary*0.3) \
.show()
+---------+--------+------+------+------------+
|firstname|lastname|gender|salary|bonus_amount|
+---------+--------+------+------+------------+
| James| Smith| M| 3000| 900.0|
| Anna| Rose| F| 4100| 1230.0|
| Robert|Williams| M| 6200| 1860.0|
+---------+--------+------+------+------------+
#Add column by concatinating existing columns
from pyspark.sql.functions import concat_ws
df.withColumn("name", concat_ws(",","firstname",'lastname')) \
.show()
+---------+--------+------+------+---------------+
|firstname|lastname|gender|salary| name|
+---------+--------+------+------+---------------+
| James| Smith| M| 3000| James,Smith|
| Anna| Rose| F| 4100| Anna,Rose|
| Robert|Williams| M| 6200|Robert,Williams|
+---------+--------+------+------+---------------+
Add A Column Based On Condition
When adding a literal value or constant based on a condition, you may need to use lit()
and otherwise()
simultaneously.
from pyspark.sql.functions import when
df.withColumn("grade", \
when((df.salary < 4000), lit("A")) \
.when((df.salary >= 4000) & (df.salary <= 5000), lit("B")) \
.otherwise(lit("C")) \
).show()
+---------+--------+------+------+-----+
|firstname|lastname|gender|salary|grade|
+---------+--------+------+------+-----+
| James| Smith| M| 3000| A|
| Anna| Rose| F| 4100| B|
| Robert|Williams| M| 6200| C|
+---------+--------+------+------+-----+
Add A Column When Not Exist On DataFrame
Want to add columns when not exist? You must determine whether the requested column name already exists in the DataFrame of PySpark. With df.columns
, you can obtain the DataFrame columns. Now create columns when not exist in df.columns
.
if 'dummy' not in df.columns:
df.withColumn("dummy",lit(None))
Add Multiple Columns Utilizing Map
There are various ways to add multiple columns to a DataFrame in PySpark. You might add a known column set with ease by utilizing select()
or chaining withColumn()
.
However, after several modifications, you might occasionally need to add more than one column. In this scenario, you can utilize either foldLeft()
or map()
function. Below is an example using the map()
function.
//Let's assume DF has just 3 columns c1,c2,c3
df2 = df.rdd.map(row=>{
//apply transformation on these columns and derive multiple columns
//and store these column values into c5,c6,c7,c8,c9,10
(c1,c2,c5,c6,c7,c8,c9,c10)
})
As you can see from the snippet above, the DataFrame only contains three columns. We are dynamically generating more columns from the three already there by using transformations. To do this, you can use a custom UDF or the split()
function.
Add A Column Utilizing Select()
Most presented examples utilize withColumn()
to insert columns to DataFrame. Still, you can use select()
for all of the a mentioned examples.
# Add column using select
df.select("firstname","salary", lit(0.3).alias("bonus")).show()
df.select("firstname","salary", lit(df.salary * 0.3).alias("bonus_amount")).show()
df.select("firstname","salary", current_date().alias("today_date")).show()
Add A Column Utilizing SQL Expression
Don’t know how to use SQL expressions to create columns in PySpark? You can use the following example as a reference to better understand how it works.
#Add columns to DataFrame using SQL
df.createOrReplaceTempView("PER")
df2=spark.sql("select firstname,salary, '0.3' as bonus from PER")
df3=spark.sql("select firstname,salary, salary * 0.3 as bonus_amount from PER")
df4=dfspark.sql("select firstname,salary, current_date() as today_date from PER")
df5=spark.sql("select firstname,salary, " +
"case salary when salary < 4000 then 'A' "+
"else 'B' END as grade from PER")
Conclusion
Learning to add columns to DataFrame in PySpark is not hard at all. As you can see, there are many ways to perform this action. We also include many examples that can help you in the journey of learning each method. Once you know the way to add columns, you may also want to learn how to drop columns from DataFrame.
Leave a comment