. Advertisement .
..3..
. Advertisement .
..4..
Pandas filter rows by conditions is a simple tactic, yet not many people know how to do it properly. Our article will shed light on this issue. Check it out!
Pandas Filter Rows by Conditions – Tips to Remember
This tutorial will investigate how to filter Pandas rows via multiple conditions through realistic examples. Before we start, have a look at this sample DataFrame, which we will keep referring to for demonstrations throughout the guidelines.
DataFrame – Code:
import pandas as pd
data = {
'Name': ['Linux', 'Yahoo', 'Xpeng',\
'Samsung', 'Disney'],
'Symbol': ['LNX', 'YH', 'XNG', 'SMSNG', 'DN'],
'Industry': ['Tech', 'Tech', 'Automotive', 'Tech', 'Entertainment'],
'Shares': [99, 49, 149, 199, 79]
}
df = pd.DataFrame(data)
print(df)
Outputs:
Name Symbol Industry Shares
0 Linux LNX Tech 99
1 Yahoo YH Tech 49
2 Xpeng XNG Automotive 149
3 Samsung SMSNG Tech 199
4 Disney DN Entertainment 79
Now that we have established a DataFrame for references, here are some simple tips to remember during your coding:
1. Boolean Indexing Is The Fastest Method
A Pandas DataFrame allows boolean indexing, which is viewed by many as an effective method to filter rows via multiple conditions. In a boolean index, the boolean vectors produced in compliance with a set of conditions are adopted to help you filter data.
The conditions involve operators: | (or), & (and), and ~ (not)
, which you may group via parenthesis ().
In the DataFrame above, let’s try to filter all rows that belong to the “Tech” category with capital shares greater than 99.
df_filtered = df[(df['Industry']=='Tech')&(df['Shares']>=99)]
print(df_filtered)
Output
Name Symbol Industry Shares
Linux LNX Tech 99
Samsung SMSNG Tech 199
2. Always Use Operators ~, |, &. Respectively. Never Use Not, Or, And
As mentioned above, Pandas offer several operators, including & (and), |(or), and ~(not). These operators are often applied to a logical operation on a series, helping programmers bind several conditions together during DataFrame filtering.
So if you choose to use logical operators from Python instead, you will likely receive an error.
Let’s say we want to filter stocks with shares from 99 to 149 using “and”.
df_filtered = df[(df['Shares']>=99) and (df['Shares']<=149)]
print(df_filtered)
What we get will be this error:
ValueError Traceback (most recent call last)
<ipython-input-4-dac68abbe005> in <module>
----> 1 df_filtered = df[(df['Shares']>=99) and (df['Shares']<=149)]
2 print(df_filtered)
Why does the error occur? It’s because the logical operators from Python (not, or, and) only accommodate boolean values. So once you apply them to an arrayed series, it doesn’t know how to determine True and False values. In the end, the system decided to send “ValueError”.
The solution is to replace “and” with “&”.
Here is the correct code:
df_filtered = df[(df['Shares']>=99) & (df['Shares']<=149)]
3. Always Group Conditions Using Parenthesis ()
Parenthesis() is something you should never forget in your coding. Python will evaluate all expressions based on the operator precedence if you forget to include parenthesis() in condition grouping, which gives you unintended output.
Let’s come back to the previous example. Suppose you want to filter stocks with shares from 149 to 199. What will happen if you decide not to use parenthesis?
df_filtered = df[df['Shares']>=149 & df['Shares']<=199]
print(df_filtered)
You will only receive this error:
ValueError Traceback (most recent call last)
<ipython-input-23-545c272b68ba> in <module>
----> 1 df_filtered = df[df['Shares']>=149 & df['Shares']<=199]
2 print(df_filtered)
Why does the error occur? It’s because there is no parenthesis. That means df['Shares']>=149 & df['Shares']<=199
gets interpreted as df['Shares'] >= (149 & df['Shares']) <= 199
, since “&
” has much greater precedence than >=
and <=
.
Here is the correct code:
df_filtered = df[(df['Shares']>=149) & (df['Shares']<=199)]
A Brief Introduction of Pandas
Origin
First of all, we need to know a thing or two about Pandas. This software library is tailored for Python languages, aiming to analyze and manipulate data.
The founder, Wes McKinney, had some general ideas on what would later turn into Pandas when he still worked as a researcher at AQR Capital (from 2008 to 2010). Free of charge, it was released with BSD three-clause licenses, delivering data operations and structures to control time series and numerical tables.
The “Pandas” title stems from the phrase “panel data” – an econometrics word used for any data set that features close observations over numerous-time periods of one individual. Another interpretation is that the name is inspired by “Python Data Analysis”.
Usage
A data structure serves as the core foundation for abstract data categories or types (also known as an ADT). ADTs help define the data type’s logical form, implementing its physical shape.
Different data structures suit different applications, some of which are specialized for particular tasks. By illustrations, relative databases often adopt a B-tree index to retrieve data, while compiler setups turn to hash tables to spot identifiers.
We can say that a data structure offers a means for properly managing huge data quantities (such as Internet indexing services and large databases).
Often, effective data structures have a decisive role in efficient algorithm designs. Some layouts and languages focus on data structures instead of algorithms as a result, assigning it as the crucial organizing factor. Hence, programmers tend to use data structures for storage organization and info retrieval in both first and secondary memories.
Implementation
We organize data structures depending on the computer’s ability to store and fetch any data from memory, assigned by a specific pointer (a string bit representing the memory address). This string itself can also be manipulated and preserved by the Pandas program.
Hence, record data and array structures rely on address computing of data instruments via arithmetic operations. On the other hand, connected data structures are dependent on the address storage of data objects in the structure itself.
The data structure implementation calls for establishing procedure sets that manipulate and create several instances stemming from that original structure. Its efficiency, still, cannot get analyzed independently from these operations.
With such observations, theoretical concepts of abstract data types are motivated. Data structures receive indirect definitions from operations performed on them, accompanied by their corresponding mathematical properties (entailing time cost and space).
Conclusion
This article has everything you need to know about Pandas filter rows by conditions. For methods to implement column names on Pandas, you may turn to this article for more guidance.
Leave a comment