. Advertisement .
..3..
. Advertisement .
..4..
This tutorial will show you how to use PySpark withColumnRenamed to rename column on DataFrame. This is a common task that, without a good understanding of this method, you may not be able to get right.
Use PySpark withColumnRenamed to Rename Column On DataFrame
The DataFrame.withColumnRenamed() method renames a column of an existing DataFrame and returns a new DataFrame. Its syntax is quite self-explanatory:
DataFrame.withColumnRenamed(existing_name, new_name)
Both arguments should be string values.
To demonstrate the capabilities of withColumnRenamed(), we are going to use a DataFrame created from a CSV file. It shows several home sales and their information, such as selling prices, listing prices, numbers of rooms, and taxes.
We must import this CSV file into a DataFrame first:
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName(
... 'ITTutoria: PySpark withColumnRenamed').getOrCreate()
>>> df = spark.read.csv("homes.csv", header=True)
>>> df.show()
+----+----+------+-----+----+-----+---+-----+-----+
|Sell|List|Living|Rooms|Beds|Baths|Age|Acres|Taxes|
+----+----+------+-----+----+-----+---+-----+-----+
| 142| 160| 28| 10| 5| 3| 60| 0.28| 3167|
| 175| 180| 18| 8| 4| 1| 12| 0.43| 4033|
| 129| 132| 13| 6| 3| 1| 41| 0.33| 1471|
| 138| 140| 17| 7| 3| 1| 22| 0.46| 3204|
| 232| 240| 25| 8| 4| 3| 5| 2.05| 3613|
+----+----+------+-----+----+-----+---+-----+-----+
Since the SparkSession class serves as the entry point into all Spark functionality, we must create it first with SparkSession.builder. Once you have a basic SparkSession like above, you can create a DataFrame for your applications using data from Spark sources, Hive tables, or existing RDD.
The read.csv() method creates a DataFrame for that SparkSession based on our CSV file’s content. Remember that DataFrames in Spark are just sets of rows in all APIs, including pyspark.
Note: check your environment variables if you can’t import pyspark.
When importing the CSV file, remember to include the header=True option. Otherwise, you will end up with meaningless headers like _c0, _c1, etc.
>>> df = spark.read.csv("homes.csv")
>>> df.show()
+----+----+------+-----+----+-----+---+-----+-----+
| _c0| _c1| _c2| _c3| _c4| _c5|_c6| _c7| _c8|
+----+----+------+-----+----+-----+---+-----+-----+
|Sell|List|Living|Rooms|Beds|Baths|Age|Acres|Taxes|
| 142| 160| 28| 10| 5| 3| 60| 0.28| 3167|
| 175| 180| 18| 8| 4| 1| 12| 0.43| 4033|
This DataFrame has the wrong column labels, meaning you have no ways to change their names regardless of the method. This is a crucial step, so make sure you get it right.
Now let’s say you want to rename the “Sell” column to “Selling”. Use those names as argument for the
>>> df = df.withColumnRenamed("Sell", "Selling")
>>> df.show()
+-------+----+------+-----+----+-----+---+-----+-----+
|Selling|List|Living|Rooms|Beds|Baths|Age|Acres|Taxes|
+-------+----+------+-----+----+-----+---+-----+-----+
| 142| 160| 28| 10| 5| 3| 60| 0.28| 3167|
| 175| 180| 18| 8| 4| 1| 12| 0.43| 4033|
It is important to stress the fact that this method only returns a new DataFrame and doesn’t change the existing one. If you don’t assign the return object to anything, you will see no effect on the DataFrame.
>>> df.withColumnRenamed("Sell", "Selling")
DataFrame[Selling: string, List: string, Living: string, Rooms: string, Beds: string, Baths: string, Age: string, Acres: string, Taxes: string]
>>> df.show()
+----+----+------+-----+----+-----+---+-----+-----+
|Sell|List|Living|Rooms|Beds|Baths|Age|Acres|Taxes|
+----+----+------+-----+----+-----+---+-----+-----+
| 142| 160| 28| 10| 5| 3| 60| 0.28| 3167|
| 175| 180| 18| 8| 4| 1| 12| 0.43| 4033|
If you want to rename multiple columns with withColumnRenamed(), just chain it as many times as you want with the dot operator.
>>> df = df.withColumnRenamed("Sell", "Selling").withColumnRenamed("List", "Asking")
>>> df.show()
+-------+------+------+-----+----+-----+---+-----+-----+
|Selling|Asking|Living|Rooms|Beds|Baths|Age|Acres|Taxes|
+-------+------+------+-----+----+-----+---+-----+-----+
| 142| 160| 28| 10| 5| 3| 60| 0.28| 3167|
| 175| 180| 18| 8| 4| 1| 12| 0.43| 4033|
Conclusion
While there are alternative methods like col(), knowing how to use PySpark withColumnRenamed to rename column on DataFrame is also a good idea. It provides a more powerful and convenient option that supports multiple selections at once.
Leave a comment