. Advertisement .
..3..
. Advertisement .
..4..
We are discussing a frequently-asked topic – Pyspark: Where Filter function with conditions.
Accordingly, the filter is added in Pyspark to deal with filtering data in a DataFrame or a Resilient Distributed Dataset (RDD) with several conditions and cleaning unwanted or evil data. The rest data, then, are used for processing. This increases the productivity of related tasks.
In this article, we explain six ways that help.
How to Use Pyspark: Where Filter Function with Conditions?
We use the same DataFrame as follows:
val arrayStructureData = Seq(
Row(Row("Giancarlo","","Kerr"),List("Java","Scala","C++"),"OH","F"),
Row(Row("Austin","Choi",""),List("Spark","Java","C++"),"NY","M"),
Row(Row("Hardy","","Cassidy"),List("CSharp","VB"),"OH","F"),
Row(Row("Jeffer","Ali","Jay"),List("CSharp","VB"),"NY","M"),
Row(Row("Skylar","Miley","Jr"),List("CSharp","VB"),"NY","M"),
Row(Row("Kaleb","Daisy","Will"),List("Python","VB"),"OH","M")
)
val arrayStructureSchema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("languages", ArrayType(StringType))
.add("state", StringType)
.add("gender", StringType)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),
arrayStructureSchema)
df.printSchema()
df.show()
Method 1: Use Spark DataFrame filter() Syntaxes
As mentioned, the filter() function in Pyspark is applied to sort the rows in DataFrame or from RDD with various condition(s) or SQL expressions.
Here are examples of Spark filter() syntaxes for DataFrame.
The first signature will use syntaxes $colname, col(“colname”), ‘colname, and df(“colname”) to refer Column names with condition expressions:
1) filter(condition: Column): Dataset[T]
The second signature is a sample to provide SQL expressions:
2) filter(conditionExpr: String): Dataset[T] //if you want to use SQL expression
Next, use the signature if you prefer SQL functions applied to rows:
3) filter(func: T => Boolean): Dataset[T]
The fourth signature takes advantage of FilterFunction class instead:
4) filter(func: FilterFunction[T]): Dataset[T]
The outcome:
root
|-- name: struct (nullable = true)
| |-- firstname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- lastname: string (nullable = true)
|-- languages: array (nullable = true)
| |-- element: string (containsNull = true)
|-- state: string (nullable = true)
|-- gender: string (nullable = true)
+--------------------+------------------+-----+------+
| name| languages| state| gender|
+--------------------+------------------+-----+------+
| [Giancarlo, , Kerr]| [Java,Scala, C++]| OH| F|
| [Austin, Choi, ]|[Spark, Java, C++]| NY| M|
| [Hardy, , Cassidy]| [CSharp, VB]| OH| F|
| [Jeffer, Ali, Jay]| [CSharp, VB]| NY| M|
| [Skylar, Miley, Jr]| [CSharp, VB]| NY| M|
| [Kaleb, Daisy, Will| [Python, VB]| OH| M|
+--------------------+------------------+-----+------+
Method 2: Use filter() Syntaxes Applied for Column Conditions
Instead, use syntaxes for column name, then you can also refer to column names and filter the rows.
df.filter(df("state") === "NY").show(false)
where() operator can be applied to find data in the rows as well, supposing that you are SQL-based. Input == to compare.
df.filter('state === "NY").show(false)
df.filter($state === "NY").show(false)
df.filter(col("state") === "NY").show(false)
df.where(df("state") === "NY").show(false)
df.where('state === "NU").show(false)
df.where($state === "NY").show(false)
df.where(col("state") === "NY").show(false)
Either way, we have the new DataFrame:
+----------------------+------------------+-----+------+
|name |languages |state |gender|
+----------------------+------------------+-----+------+
|[Austin, Choi, ]|[Spark, Java, C++]| NY| M|
|[Jeffer, Ali, Jay]| [CSharp, VB]| NY| M|
|[Skylar, Miley, Jr]| [CSharp, VB]| NY| M|
+----------------------+------------------+-----+------+
Method 3: Apply SQL Expressions to Filter Multiple Rows
For those with an SQL background, it is possible to use SQL expressions with the Pyspark filter function.
This time, we filter the gender Female (F):
df.filter("gender == 'F'").show(false)
df.where("gender == 'F'").show(false)
The outcome is here:
+--------------------+------------------+-----+------+
| name| languages|state|gender|
+--------------------+------------------+-----+------+
| [Giancarlo, , Kerr]| [Java,Scala, C++]| OH| F|
| [Hardy, , Cassidy]| [CSharp, VB]| OH| F|
+--------------------+------------------+-----+------+
Method 4: Try Filtering with Various Conditions
Let’s try several conditions to find complicated data. We will still apply the filter() function and along with OR(||), AND(&&), and NOT(!).
df.filter(df("languages") === "VB" && df("gender") === "F")
.show(false)
Now, check the new DataFrame:
+--------------------+------------------+-----+------+
| name| languages|state|gender|
+--------------------+------------------+-----+------+
| [Hardy, , Cassidy]| [CSharp, VB]| OH| F|
+--------------------+------------------+-----+------+
Method 5: Filter in DataFrame for an Array Column
We use array_contains() SQL function in Spark for filtering rows and check whether a value belonging to an array column is presented. Then, return the new DataFrame.
import org.apache.spark.sql.functions.array_contains
df.filter(array_contains(df("state"),"OH"))
.show(false)
The filter function returns the following DataFrame.
+--------------------+------------------+-----+------+
| name| languages|state|gender|
+--------------------+------------------+-----+------+
| [Giancarlo, , Kerr]| [Java,Scala, C++]| OH| F|
| [Hardy, , Cassidy]| [CSharp, VB]| OH| F|
| [Kaleb, Daisy, Will| [Python, VB]| OH| M|
+--------------------+------------------+-----+------+
Method 6: Filter Applied to Columns with Nested Structure
//Pyspark: Filter Function with various Conditions applied to Nested collums
df.filter(df("name.lastname") === "Will")
.show(false)
When it comes to a column with nested structure, you can try the above syntaxes. It helps filter and shows the following table:
+--------------------+------------------+-----+------+
| name| languages|state|gender|
+--------------------+------------------+-----+------+
| [Kaleb, Daisy, Will| [Python, VB]| OH| M|
+--------------------+------------------+-----+------+
Conclusion
We have introduced various methods to apply Pyspark: Where Filter function with conditions. You can customize the samples per your DataFrame flexibly.
Please share our tutorial article if you find it helpful. Many thanks!
Leave a comment