There are two methods you can use to filter a Pandas DataFrame:
- The Expression Way
- The List Comprehension Way
I will describe both of them here, because while the expression way is much easier to understand for someone with little experience in Python, list comprehensions are an amazing Python tool that you should learn to utilize, and this is the perfect opportunity to start.
For today's lesson I will use the following toy-DataFrame:
import pandas as pd
headers = ["Letters", "Even/Odd"]
python_dictionary = {
1: ['A', 'Odd'],
2: ['B', 'Even'],
3: ['C', 'Odd'],
4: ['D', 'Even'],
5: ['E', 'Odd'],
6: ['F', 'Even'],
7: ['G', 'Odd'],
8: ['H', 'Even'],
9: ['I', 'Odd'],
10: ['J', 'Even'],
}
df = pd.DataFrame.from_dict(data=python_dictionary, orient='index')
df.columns = headers
If we were to view or print this DataFrame held in the variable df
it would look exactly like we expect:
Letters Even/Odd
1 A Odd
2 B Even
3 C Odd
4 D Even
5 E Odd
6 F Even
7 G Odd
8 H Even
9 I Odd
10 J Even
Before moving on there are a few things I want to point out about the above code:
- Python Dictionaries and Pandas DataFrames are extremely similar and can be switched between each other at will. Here I am using the
from_dict
method of theDataFrame
object of thepandas
library. - I only added column headers to the values in the dictionary; in this case the keys (1-10) of the dictionary are seen as the index, aka the row index.
- The
orient
parameter of thefrom_dict
method can be either the string "columns" or the string "index". What it is essentially saying is "For each new entry in the dictionary, should Pandas create a new Column (columns
) or a new Row (index
)?" Nine times out of ten I prefer to useorient='index'
even though by defaultorient
is set to"columns"
. If you were to switch the above code to sayorient='columns'
(and erase the naming of the columnsdf.columns = headers
) the resulting DataFrame would look like this:
1 2 3 4 5 6 7 8 9 10
0 A B C D E F G H I J
1 Odd Even Odd Even Odd Even Odd Even Odd Even
Moving on...
The expression way is painfully simple. Let's say we wished to filter for "Odd" in the "Even/Odd" column of the toy-DataFrame above. The expression way would look like this:
df = df[df["Even/Odd"] == "Odd"]
As you already should know by now, in coding, syntax means a lot. The syntax of the expression way is:
dataframe[dataframe[column_name] == looking_for]
And there you have it! A perfectly filtered Pandas DataFrame. The resulting DataFrame will look like this:
Letters Even/Odd
1 A Odd
3 C Odd
5 E Odd
7 G Odd
9 I Odd
List comprehensions are a way to quickly build a list. Usually you would use a list literal or a 'for loop' to build a list, like so:
>>> hard_coded_list = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
>>> built_at_runtime_list = []
>>> for letter in hard_coded_list:
... letter_number = f"{letter}10"
... built_at_runtime_list.append(letter_number)
...
>>> built_at_runtime_list
['A10', 'B10', 'C10', 'D10', 'E10', 'F10', 'G10', 'H10', 'I10', 'J10']
Note on the 4th line there I have used something called an "f-string", that you can read more about here.
Essentially the Python code f"{letter}10"
produces the same result as letter + str(10)
and "{}10".format(letter)
. In VBA it would be the same as letter & CStr(10)
.
List comprehensions allow for the same result to be produced from fewer lines of code:
>>> hard_coded_list = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
>>> list_comp = [f"{letter}10" for letter in hard_coded_list]
>>> list_comp
['A10', 'B10', 'C10', 'D10', 'E10', 'F10', 'G10', 'H10', 'I10', 'J10']
Let's break down what's happening inside that list comprehension in a little more depth:
-
We had a pre-built/hard-coded list helpfully called
hard_coded_list
-
We wanted to tack on the number 10 to each item in the list
-
Instead of building a for loop (which in VBA and many other languages would be better understood as a "For Each" loop) we decided to build a list comprehension
-
I usually like to start out my list comprehensions with the simplest form of one; recreating the list I'm working from, which in this example would look something like this:
[letter for letter in hard_coded_list]
-
Then I would modify the first variable to bring about the result I want in the final list, in this case, stick it in an f-sting:
[f"{letter}10" for letter in hard_coded_list]
As you can begin to deduce from the above, the basic syntax of a list comprehension is:
[result for item in iterable]
This is the same basic concept we had for the for loop:
final_list = []
for item in iterable:
result = edit item
final_list.append(result)
So now that we know what a list comprehension is, we can filter a Pandas DataFrame rather simply by utilizing one.
For comparison's sake, let's filter the toy-DataFrame above, again looking for the word "Odd" in the "Even/Odd" column. The simplest list comprehension, to replicate the values in that column, would look like this:
[value for value in df["Even/Odd"]]
Note the two ]]
at the end; the first one closes off the reference to the column and the second one closes the list comprehension.
Now all we need to do is come up with a result
instead of the simply replicating each value. This is where the magic of filtering in Pandas comes from; the result
we are looking for is a Boolean(True/False).
It would look something like this:
[value == "Odd" for value in df["Even/Odd"]]
The result of this list comprehension is
[True, False, True, False, True, False, True, False, True, False]
This list of Trues & Falses will tell the DataFrame which rows stick around and which rows to filter out (Each True
sticks around obviously).
But how do we apply it to the DataFrame? Like so:
df = df[[value == "Odd" for value in df["Even/Odd"]]]
Note now that we have 3 ]]]
at the end!
We could also break it up to look like this:
list_comp = [value == "Odd" for value in df["Even/Odd"]]
df = df[list_comp]
See, that's not much more difficult than the expression way.
So, with a bit of practice and careful attention to matching your square brackets []
you can filter any Pandas Dataframe with a list comprehension too.