. Advertisement .
. Advertisement .
DataFrame is something that all data scientists must work with, as it’s the backbone of this profession. However, this data type is among the hardest to challenge, making many people go bald over the stress.
That is why we prepared this guide on Spark SQL Aggregate Functions. They can simplify the process significantly.
What Are Spark SQL Aggregate Functions?
In Spark SQL, what we call Spark SQL Aggregate Functions are more commonly referred to as agg_funcs. There are a total of 28 functions in this collection.
Each of them has a handful of specific situations where it can shine the most. What you need to do is remember all of their syntaxes and know which time is appropriate for which function.
Counting And Checking Functions
The first pair of functions have the same name: approx_count_distinct(e: Column) & approx_count_distinct(e: Column, rsd: Double), and they both return you the count of all distinct items present in an input group.
However, one of them only accepts columns as input, while the other accepts one more input type.
Then there is collect_list(e: Column) and collect_set(e: Column). The former takes the values in the input column and returns them to you. On the other hand, the latter only takes each value once and doesn’t return duplicate values.
count(e: Column) may have the simplest job in this list, as it only counts how many elements there are and return that value. In comparison, countDistinct(expr: Column, exprs: Column*) has to verify if the element is distinct before counting.
first(e: Column) returns you the input column’s first element, ignoring whether that element is true or not. If you want this distinction, you can use first(e: Column, ignoreNulls: Boolean).
On the other hand, last(e: Column) and last(e: Column, ignoreNulls: Boolean) performs the exact function. Of course, the distinction is still the same.
grouping(e: Column) is mainly responsible for indicating if the input column is aggregated.
If you want to compare the values within a column, you can use either max(e: Column) or min(e: Column).
The first function of this type is avg(e: Column), which computes the input column’s average value and returns it. Another function with the exact same process as it is mean(e: Column).
If you are interested in covariance, you can check out covar_samp(column1: Column, column2: Column) and covar_pop(column1: Column, column2: Column). The former gives you two columns’ sample covariance, while the latter provides population covariance.
People interested in manipulating only one stream of data will like stddev_pop(e: Column) and stddev_samp(e: Column). stddev_samp(e: Column) checks for the input’s standard value deviation and stddev_pop(e: Column) gives you the population standard deviation.
corr(column1: Column, column2: Column) also requires two input columns, but it only provides you with the Pearson Correlation Coefficient for them.
We know that this might seem intimidating, but it’s the basis. Once you master them all, you can tackle more advanced problems like joining types.
We have shown you most of the Spark SQL Aggregate Functions. Don’t be intimidated by the huge number, as each function can help you tremendously in manipulating databases.
Once you master all of them, you can configure all kinds of databases much quicker than any other database manipulator.