. Advertisement .
..3..
. Advertisement .
..4..
The Spark DataFrame Where Filter functions bring the selection capabilities many SQL developers fall in love with. Check out these examples to get the hang of them.
Spark DataFrame Where Filter
Spark allows you to create a subset of any DataFrame using one or multiple conditions. Both rows and columns can be selected with the filter() and where() functions, which works exactly the same way.
In this example, we are going to import data from a CSV file into a DataFrame. It contains sales information of 10 homes, including the listing price, selling price, number of rooms, and taxes. You can read more about this reading operation in our article.
......... ADVERTISEMENT .........
..8..
Scala:
scala> val data_source = "homes.csv"
scala> val df = spark.read.option("header", "true").csv(data_source)
scala> df.printSchema()
root
|-- Sell: string (nullable = true)
|-- List: string (nullable = true)
|-- Rooms: string (nullable = true)
|-- Beds: string (nullable = true)
|-- Baths: string (nullable = true)
|-- Age: string (nullable = true)
|-- Taxes: string (nullable = true)
Python:
>>> df = spark.read.option("header", True).csv('homes.csv')
>>> df.printSchema()
root
|-- Sell: string (nullable = true)
|-- List: string (nullable = true)
|-- Rooms: string (nullable = true)
|-- Beds: string (nullable = true)
|-- Baths: string (nullable = true)
|-- Age: string (nullable = true)
|-- Taxes: string (nullable = true)
Make sure that your DataFrame has the correct schema, including the column name and its data type. Otherwise, you will have a hard time specifying which column and value the filter() and where() functions need to pay attention to.
The basic syntaxes of both Python and Scala APIs are similar:
Dataframe.filter(condition)
Now let’s head to some common examples. Suppose you want to print only rows containing homes that have 10 rooms. This can be done by assigning this value to the correct column name.
Scala:
scala> df.filter(df("Rooms") === 10).show
+----+----+-----+----+-----+---+-----+
|Sell|List|Rooms|Beds|Baths|Age|Taxes|
+----+----+-----+----+-----+---+-----+
| 142| 160| 10| 5| 3| 60| 3167|
| 271| 285| 10| 5| 2| 30| 5702|
+----+----+-----+----+-----+---+-----+
Python:
>>> df.filter(df.Rooms == 10).show()
+----+----+-----+----+-----+---+-----+
|Sell|List|Rooms|Beds|Baths|Age|Taxes|
+----+----+-----+----+-----+---+-----+
| 142| 160| 10| 5| 3| 60| 3167|
| 271| 285| 10| 5| 2| 30| 5702|
+----+----+-----+----+-----+---+-----+
As you can see, both Scala and Python APIs of Spark produce exactly the same output. You can replace filter() with where() as well.
Scala:
scala> df.where(df("Rooms") === 10).show
Python:
>>> df.where(df.Rooms == 10).show()
Using the isin() method, you can check whether the values of a column match any element of a provided list or string. The filters below will only select houses with exactly 5 or 7 rooms.
Scala:
scala> val listValues = Seq(5, 7)
scala> df.filter(df("Rooms").isin(listValues:_*)).show
+----+----+-----+----+-----+---+-----+
|Sell|List|Rooms|Beds|Baths|Age|Taxes|
+----+----+-----+----+-----+---+-----+
| 138| 140| 7| 3| 1| 22| 3204|
| 135| 140| 7| 4| 3| 9| 3028|
| 89| 90| 5| 3| 1| 43| 2054|
+----+----+-----+----+-----+---+-----+
Python:
>>> listValues = [5, 7]
>>> df.filter(df.Rooms.isin(listValues)).show()
+----+----+-----+----+-----+---+-----+
|Sell|List|Rooms|Beds|Baths|Age|Taxes|
+----+----+-----+----+-----+---+-----+
| 138| 140| 7| 3| 1| 22| 3204|
| 135| 140| 7| 4| 3| 9| 3028|
| 89| 90| 5| 3| 1| 43| 2054|
+----+----+-----+----+-----+---+-----+
To apply multiple conditions when filtering a DataFrame, you can chain several filter() or where() functions to make your desired selection.
These examples will print homes that have at least 7 rooms with a listing price of less than $150,000.
Scala:
scala> df.filter(df("Rooms") >= 7).filter(df("List") < 150).show
+----+----+-----+----+-----+---+-----+
|Sell|List|Rooms|Beds|Baths|Age|Taxes|
+----+----+-----+----+-----+---+-----+
| 138| 140| 7| 3| 1| 22| 3204|
| 135| 140| 7| 4| 3| 9| 3028|
+----+----+-----+----+-----+---+-----+
Python:
>>> df.filter(df.Rooms >= 7).filter(df.List < 150).show()
+----+----+-----+----+-----+---+-----+
|Sell|List|Rooms|Beds|Baths|Age|Taxes|
+----+----+-----+----+-----+---+-----+
| 138| 140| 7| 3| 1| 22| 3204|
| 135| 140| 7| 4| 3| 9| 3028|
+----+----+-----+----+-----+---+-----+
There is no limitation to how you can chain those filter() and where() conditions. As long as they are valid and work as a single-condition filter, you can expect them to work perfectly in conjunction as well.
Conclusion
The Spark DataFrame Where Filter functions make selection data based on certain conditions a breeze. Thanks to them, you can generate a subset of a DataFrame that contains data you need to process.
Leave a comment