. Advertisement .
..3..
. Advertisement .
..4..
There are several functions that help remove duplicate rows. Still, is there any way to use distinct to drop duplicate rows in PySpark?
Use Distinct To Drop Duplicate Rows In Pyspark
The distinct()
function in PySpark helps remove or drop all columns (duplicate rows) from DataFrame. It will produce shuffled partitions, meaning that the target partition count of DataFrame will differ from the original one.
With the dropDuplicates()
function, rows can be removed depending on one or more chosen columns. This function will remove duplicate records while still maintaining the first instance of each entry in the DataFrame.
Don’t know how to use these two functions? Do not worry! This guide will show you how, with detailed examples. Follow along, and you will soon master the technique of utilizing these useful functions in PySpark.
Before we begin, let’s first make a DataFrame by adding duplicate values and rows to columns. This DataFrame serves as our example of how to obtain many separate columns.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.appName('ITtutoria.net').getOrCreate()
data = [("James", "Sales", 3000), \
("Michael", "Sales", 4600), \
("Robert", "Sales", 4100), \
("Maria", "Finance", 3000), \
("James", "Sales", 3000), \
("Scott", "Finance", 3300), \
("Jen", "Finance", 3900), \
("Jeff", "Marketing", 3000), \
("Kumar", "Marketing", 2000), \
("Saif", "Sales", 4100) \
]
columns= ["employee_name", "department", "salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
Output:
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James |Sales |3000 |
|Michael |Sales |4600 |
|Robert |Sales |4100 |
|Maria |Finance |3000 |
|James |Sales |3000 |
|Scott |Finance |3300 |
|Jen |Finance |3900 |
|Jeff |Marketing |3000 |
|Kumar |Marketing |2000 |
|Saif |Sales |4100 |
+-------------+----------+------+
In the example table above, we can see that James’s employer record has multiple rows. Here you can see that there are duplicate values in two rows with all columns. Plus, the salary and department fields in four rows have duplicate entries.
Get Unique Rows (Using All Columns To Compare)
Here we use the above example of DataFrame. There are ten rows together on the DataFrame shown above; yet, two of them have duplicate values. After eliminating one duplicate row, the distinct function on the DataFrame should return nine rows.
distinctDF = df.distinct()
print("Distinct count: "+str(distinctDF.count()))
distinctDF.show(truncate=False)
After deleting the duplicate records, the DataFrame’s distinct()
function gives back a different DataFrame. You can see the following result.
Output:
Distinct count: 9
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James |Sales |3000 |
|Michael |Sales |4600 |
|Maria |Finance |3000 |
|Robert |Sales |4100 |
|Saif |Sales |4100 |
|Scott |Finance |3300 |
|Jeff |Marketing |3000 |
|Jen |Finance |3900 |
|Kumar |Marketing |2000 |
+-------------+----------+------+
As an alternative, you can use the dropDuplicates()
function. It creates a different DataFrame once deleted duplicate entries.
Distinct Of Chosen Columns In Pyspark
There is no separate method in PySpark that accepts columns for distinct processing (drop redundant rows on chosen columns). Still, PySpark offers a different version of the dropDuplicates()
function that uses several columns to remove duplicates.
Notice that the dropDuplicates()
function on DataFrame results in a different DataFrame with the duplicate rows deleted. Here is an example.
dropDisDF = df.dropDuplicates(["department","salary"])
print("Distinct count of department & salary : "+str(dropDisDF.count()))
dropDisDF.show(truncate=False)
Output:
Distinct count of department & salary : 8
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|Jen |Finance |3900 |
|Maria |Finance |3000 |
|Scott |Finance |3300 |
|Michael |Sales |4600 |
|Kumar |Marketing |2000 |
|Robert |Sales |4100 |
|James |Sales |3000 |
|Jeff |Marketing |3000 |
+-------------+----------+------+
Look at the result, and you will see that it removed two duplicate records.
Code For Getting Different Rows
Here is the source code for obtaining different rows.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.appName('ITtutoria.net').getOrCreate()
data = [("James", "Sales", 3000), \
("Michael", "Sales", 4600), \
("Robert", "Sales", 4100), \
("Maria", "Finance", 3000), \
("James", "Sales", 3000), \
("Scott", "Finance", 3300), \
("Jen", "Finance", 3900), \
("Jeff", "Marketing", 3000), \
("Kumar", "Marketing", 2000), \
("Saif", "Sales", 4100) \
]
columns= ["employee_name", "department", "salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
#Distinct
distinctDF = df.distinct()
print("Distinct count: "+str(distinctDF.count()))
distinctDF.show(truncate=False)
#Drop duplicates
df2 = df.dropDuplicates()
print("Distinct count: "+str(df2.count()))
df2.show(truncate=False)
#Drop duplicates on selected columns
dropDisDF = df.dropDuplicates(["department","salary"])
print("Distinct count of department salary : "+str(dropDisDF.count()))
dropDisDF.show(truncate=False)
}
The Bottom Line
So there you go. This detailed guide has explained how to use distinct to drop duplicate rows in Pyspark. You also discovered how to get distinct columns using the dropDuplicates()
function. With our tutorial and thorough examples, you can now start putting these functions to use.Once you have become familiar with using these functions to remove duplicate rows, do not stop your learning process here. Continue to learn how to drop one or more columns from DataFrame using this tutorial.
Leave a comment