Add Columns To DataFrame In PySpark Best Practices

. Advertisement .

..3..

. Advertisement .

..4..

There are many ways to add columns to DataFrame in PySpark. In this guide, we will show you how to do so utilizing select(), withColumn(), and sql() functions.

We also provide other methods, such as deriving data based on another column, adding several columns, etc. Plus, the examples of each method are included.

Example Of Adding New Columns To DataFrame

Let’s start with creating a DataFrame in PySpark.

from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .appName('ITtutoria.net') \
                    .getOrCreate()
data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
  ('Robert','Williams','M',6200)
]

columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()

Output


+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|     M|  3000|
|     Anna|    Rose|     F|  4100|
|   Robert|Williams|     M|  6200|
+---------+--------+------+------+

How to add columns to DataFrame in PySpark

Add New Columns With Fixed Value

To add columns to DataFrame in PySpark, import the lit() function from pyspark.sql.functions import lit. This function gives back a Column type after accepting a fixed value you wish to add. In case you want to add a None/NULL, utilize the lit(None) functions.

From this example, add 0.3 as a literal constant to a DataFrame first, and then add None in the second step.

# Add new constant column
from pyspark.sql.functions import lit
df.withColumn("bonus_percent", lit(0.3)) \
  .show()

+---------+--------+------+------+-------------+
|firstname|lastname|gender|salary|bonus_percent|
+---------+--------+------+------+-------------+
|    James|   Smith|     M|  3000|          0.3|
|     Anna|    Rose|     F|  4100|          0.3|
|   Robert|Williams|     M|  6200|          0.3|
+---------+--------+------+------+-------------+

# Add New column with NULL
df.withColumn("DEFAULT_COL", lit(None)) \
  .show()

Add A Column Based On Another DataFrame’s Column

The most popular method is to add a new column to a DataFrame according to the value of an already-existing column.

#Add column from existing column
df.withColumn("bonus_amount", df.salary*0.3) \
  .show()

+---------+--------+------+------+------------+
|firstname|lastname|gender|salary|bonus_amount|
+---------+--------+------+------+------------+
|    James|   Smith|     M|  3000|       900.0|
|     Anna|    Rose|     F|  4100|      1230.0|
|   Robert|Williams|     M|  6200|      1860.0|
+---------+--------+------+------+------------+

#Add column by concatinating existing columns
from pyspark.sql.functions import concat_ws
df.withColumn("name", concat_ws(",","firstname",'lastname')) \
  .show()

+---------+--------+------+------+---------------+
|firstname|lastname|gender|salary|           name|
+---------+--------+------+------+---------------+
|    James|   Smith|     M|  3000|    James,Smith|
|     Anna|    Rose|     F|  4100|      Anna,Rose|
|   Robert|Williams|     M|  6200|Robert,Williams|
+---------+--------+------+------+---------------+

Add A Column Based On Condition

When adding a literal value or constant based on a condition, you may need to use lit() and otherwise() simultaneously.

from pyspark.sql.functions import when
df.withColumn("grade", \
   when((df.salary < 4000), lit("A")) \
     .when((df.salary >= 4000) & (df.salary <= 5000), lit("B")) \
     .otherwise(lit("C")) \
  ).show()

+---------+--------+------+------+-----+
|firstname|lastname|gender|salary|grade|
+---------+--------+------+------+-----+
|    James|   Smith|     M|  3000|    A|
|     Anna|    Rose|     F|  4100|    B|
|   Robert|Williams|     M|  6200|    C|
+---------+--------+------+------+-----+

Add A Column When Not Exist On DataFrame

Want to add columns when not exist? You must determine whether the requested column name already exists in the DataFrame of PySpark. With df.columns, you can obtain the DataFrame columns. Now create columns when not exist in df.columns.

if 'dummy' not in df.columns:
   df.withColumn("dummy",lit(None))

Add Multiple Columns Utilizing Map

There are various ways to add multiple columns to a DataFrame in PySpark. You might add a known column set with ease by utilizing select() or chaining withColumn().

However, after several modifications, you might occasionally need to add more than one column. In this scenario, you can utilize either foldLeft() or map() function. Below is an example using the map() function.

//Let's assume DF has just 3 columns c1,c2,c3
df2 = df.rdd.map(row=>{
//apply transformation on these columns and derive multiple columns
//and store these column values into c5,c6,c7,c8,c9,10
(c1,c2,c5,c6,c7,c8,c9,c10)
})

As you can see from the snippet above, the DataFrame only contains three columns. We are dynamically generating more columns from the three already there by using transformations. To do this, you can use a custom UDF or the split() function.

Add A Column Utilizing Select()

Most presented examples utilize withColumn() to insert columns to DataFrame. Still, you can use select() for all of the a mentioned examples.

# Add column using select
df.select("firstname","salary", lit(0.3).alias("bonus")).show()
df.select("firstname","salary", lit(df.salary * 0.3).alias("bonus_amount")).show()
df.select("firstname","salary", current_date().alias("today_date")).show()

Add A Column Utilizing SQL Expression

Don’t know how to use SQL expressions to create columns in PySpark? You can use the following example as a reference to better understand how it works.

#Add columns to DataFrame using SQL
df.createOrReplaceTempView("PER")
df2=spark.sql("select firstname,salary, '0.3' as bonus from PER")
df3=spark.sql("select firstname,salary, salary * 0.3 as bonus_amount from PER")
df4=dfspark.sql("select firstname,salary, current_date() as today_date from PER")
df5=spark.sql("select firstname,salary, " +
          "case salary when salary < 4000 then 'A' "+
          "else 'B' END as grade from PER")

Conclusion

Learning to add columns to DataFrame in PySpark is not hard at all. As you can see, there are many ways to perform this action. We also include many examples that can help you in the journey of learning each method. Once you know the way to add columns, you may also want to learn how to drop columns from DataFrame.

Sign Up

Sign In

Forgot Password