. Advertisement .
..3..
. Advertisement .
..4..
In this post, we’ll go through how to drop duplicate columns in Pandas DataFrame.
There are explicit APIs in the Python pandas library to identify duplicate rows, but there is no direct API to determine those duplications with columns.
As a result, we must create an API for it.
Dataframe Example
In the beginning, let’s make a DataFrame containing duplicate columns.
The columns in our DataFrame will be named as Fee, Course, Discount, and Duration.
# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days","Spark",20000,1000],
["Pyspark",23000,"35days","Pyspark",23000,1500],
["Pandas",25000, "40days","Pandas",25000,2000],
["Spark",20000, "30days","Spark",20000,1000]
]
columns = ["Courses","Fee", "Duration", "Subject","Fee", "Discount" ]
df=pd.DataFrame(technologies, columns= columns)
print(df)
In this manner, we generate the following performance:
Courses Fee Duration Subject Fee Discount
0 Spark 20000 30days Spark 20000 1000
1 Pyspark 23000 35days Pyspark 23000 1500
2 Pandas 25000 40days Pandas 25000 2000
3 Spark 20000 30days Spark 20000 1000
How to Drop Duplicate Columns in pandas DataFrame
#1. Employing DataFrame.loc[] Function
Not only will this method drop duplicate columns, through the matching of column names and data, it will also eliminate them out of the picture.
# Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]
print(df2)
Output:
Courses Fee Duration Subject Discount
0 Spark 20000 30days Spark 1000
1 Pyspark 23000 35days Pyspark 1500
2 Pandas 25000 40days Pandas 2000
3 Spark 20000 30days Spark 1000
Bear in mind that the Subject and Courses columns are still present even though they contain the identical information.
#2. Using DataFrame.drop_duplicates()
Employing the df.T.drop_duplicates().T function is another way to remove duplicate columns from a pandas dataframe. This method will help eliminate duplicate columns independent of column name.
Running the code:
# Drop duplicate columns
df2 = df.T.drop_duplicates().T
print(df2)
Output:
Courses Fee Duration Discount
0 Spark 20000 30days 1000
1 Pyspark 23000 35days 1500
2 Pandas 25000 40days 2000
3 Spark 20000 30days 1000
Using a groupby is likely the simplest option. Only that this does not exclude columns with the same data but different names.
#3. Drop Duplicate and Remain The Initial Columns Using DataFrame.loc
To remove columns that have the identical values across all of their columns, call DataFrame.duplicated() with no any parameters included. It accepts the default parameters keep=’first’ and subset=None.
After deleting duplicate columns from our DataFrame, the example below gives back four columns.
Running the code:
# Remove repeted columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]
print(df2)
Output:
Courses Fee Duration Discount
0 Spark 20000 30days 1000
1 Pyspark 23000 35days 1500
2 Pandas 25000 40days 2000
3 Spark 20000 30days 1000
You must give a retain parameter with the value “latest” if you wish to choose all duplicate columns and their most recent occurrence. Let’s have a look at the instance below:
# keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]
print(df2)
Output:
Duration Courses Fee Discount
0 30days Spark 20000 1000
1 35days Pyspark 23000 1500
2 40days Pandas 25000 2000
Conclusion
Above are neatly all that you can grasp about how to drop duplicate columns in pandas dataframe. Hopefully, this post can be of great help to you.
Leave a comment