. Advertisement .
..3..
. Advertisement .
..4..
Is it possible to identify and count unique values from a list of PySpark columns? Yes, of course, that is possible, thanks to methods to use PySpark Count Distinct from DataFrame. Check out this article for more pointers.
How to Get PySpark Count Distinct from DataFrame
Method 1. Use Distinct().Count()
Remember that count and distinct are two varied functions, both of which can operate on DataFrames. Distinct() eliminates all duplicate records or values by inspecting all the columns of one DataFrame row.
On the other hand, count() returns records count on DataFrame. Once we chain the two functions together, it is possible to get a count distinct from PySpark DataFrame.
Let’s have a look at Example 1:
In the example, we created DataFrame df containing student info such as Names, Courses, and Marks. Duplicate values are also covered. Then, we will operate distinct().count() to spot all distinct values present in this DataFrame df:
Before we start, let’s observe this DataFrame df:
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# giving rows value for dataframe
data = [("Rem", "ME", 81),
("Reya", "MCA", 86),
("Jeya", "BA", 61),
("Marea", "Bsc", 66),
("Shriya", "Bsc", 92),
("Rem", "ME", 81),
("Johnny", "MCA", 86),
("Shyim", "BE", 71),
("Kumir", "BTech", 79),
("Marea", "Bsc", 66)]
# giving column names of dataframe
columns = ["Names", "Courses", "Marks"]
# creating a dataframe df
df = spark.createDataFrame(data, columns)
# show df
df.show()
# counting the total number of values in df
print("Total records in df:", df.count())
Output:
Names Courses Marks
Rem ME 81
Reya MCA 86
Jeya BA 61
Marea Bsc 66
Shriya Bsc 92
Rem ME 81
Johnny MCA 86
Shyim BE 71
Kumir BTech 79
Marea Bsc 66
Total records in df: 10
Here is our DataFrame df, containing 10 records in total. Now, let’s apply distinct().count()
to identify all the value counts available in this DataFrame df.
# applying distinct().count() on df
print('Distinct count in DataFrame df is :', df.distinct().count())
Output:
Distinct count in DataFrame df is : 8
Method 2. Use CountDistinct()
CountDistinct() produces distinct element counts present in one group consisting of selected columns. In essence, it is a type of SQL function that offers distinct value counts from all selected columns.
In the second example of this guideline, we created one DataFrame df containing employee info such as Names, Departments, and Salaries. This DataFrame also includes duplicate values. Then, the next step is to apply countDistinct()
and locate all distinct values present in this DataFrame df.
First, have a look at our DataFrame df:
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# giving rows value for dataframe
data = [("Rem", "Sales", 79000),
("Shiv", "IT", 69000),
("Jaya", "IT", 59000),
("Marie", "Management", 64000),
("Ramish", "Sales", 79000),
("Johnny", "Account", 79000),
("Shiv", "IT", 69000),
("Kumur", "IT", 77000),
("Marie", "Management", 64000)]
# giving column names of dataframe
columns = ["Name", "Department", "Salary"]
# creating a dataframe df
df = spark.createDataFrame(data, columns)
# show df
df.show()
# counting the total number of values in df
print("Total records in df:", df.count())
Outputs:
Name Department Salary
Rem Sales 79000
Shiv IT 69000
Jaya IT 59000
Marie Management 64000
Ramish Sales 79000
Johnny Account 79000
Shiv IT 69000
Kumur IT 77000
Marie Management 64000
Total records in df: 9
That is our DataFrame df, containing 9 records in total. The next step is to apply countDistinct()
and identify all distinct values present in this DataFrame df. Let’s apply it by importing the command from our module 'pyspark.sql.functions'
.
# importing countDistinct from pyspark.sql.functions
from pyspark.sql.functions import countDistinct
# applying the function countDistinct() on df using select()
df2 = df.select(countDistinct("Name", "Department", "Salary"))
# show df2
df2.show()
Output:
+----------------------------------------+
|count(DISTINCT Name, Department, Salary)|
+----------------------------------------+
| 7|
+----------------------------------------+
We have 7 distinct values present in this DataFrame df. Our countDistinct() command offers distinct values in column formats as presented in the outputs above, since it is a type of SQL function.
To wrap up our guide, let’s inspect distinct value counts for one specific column.
Let’s count distinct values in the column “Department”:
# importing countDistinct from pyspark.sql.functions
from pyspark.sql.functions import countDistinct
# applying the function countDistinct() on df using select()
df3 = df.select(countDistinct("Department"))
# show df3
df3.show()
Outputs:
+----------------------+
|count(DISTINCT Department)|
+----------------------+
| 4|
+----------------------+
FAQs
1. What is PySpark Count?
First, let’s explore what PySpark Count means. It is a type of Pyspark function used to count all the elements observed in a PySpark DataFrame model. By using PySpark Count, you can receive the number of factors within the frame.
In short, you may view PySpark as an active operation that monitors and tracks all row numbers in PySpark. Due to its convenience and importance, PySpark Count is adopted for numerous intensive analyses, which helps break down the exact quantity of factors involved.
As it returns the number of data to your PySpark driver, this function performs a PySpark-type action, keeping close supervision on all the relevant elements within a post or pre-data analysis.
2. Can I Count Several PySpark Duplicate Rows?
Yes, you may get a number of PySpark duplicate rows.
In order to yield only the duplicate rows from PySpark, you might adopt the “groupby” command in conjunction with the (count) command. Once done, you will filter other rows whose values are greater than your chosen value.
3. Can I Remove All Duplicates from PySpark?
Yes, of course you can! The distinct() Pyspark function aims to remove/drop duplicate columns or rows from your DataFrame. Meanwhile, the dropDuplicates() commands help drop duplicate rows depending on one or several selected columns.
4. Can I Count Non-Null PySpark Values?
Again, nothing is impossible! To identify all PySpark non-null values from DataFrame columns, you need to count on the isNotNull() commands, which are not different from non-nan values.
Conclusion
Our article has delivered several methods to use PySpark Count Distinct from DataFrame, with detailed examples. Our ITtutoria support team hopes that these guidelines can help you solve the problem. For other issues revolving around PySpark DataFrames (such as how to drop columns), you may keep browsing our website.
Leave a comment