4.3 Filter and Subset
There are two ways to remove rows from a DataFrame
, one is filter
(Section 4.3.1) and the other is subset
(Section 4.3.2). filter
was added earlier to DataFrames.jl
, is more powerful and more consistent with syntax from Julia base, so that is why we start discussing filter
first. subset
is newer and often more convenient.
4.3.1 Filter
From this point on, we start to get into the more powerful features of DataFrames.jl
. To do this, we need to learn some functions, such as select
and filter
. But don’t worry! It might be a relief to know that the general design goal of DataFrames.jl
is to keep the number of functions that a user has to learn to a minimum15.
Like before, we resume from the grades_2020
:
grades_2020()
name | grade_2020 |
---|---|
Sally | 1.0 |
Bob | 5.0 |
Alice | 8.5 |
Hank | 4.0 |
We can filter rows by using filter(source => f::Function, df)
. Note how this function is very similar to the function filter(f::Function, V::Vector)
from Julia Base
module. This is because DataFrames.jl
uses multiple dispatch (see Section 2.3.3) to define a new method of filter
that accepts a DataFrame
as argument.
At first sight, defining and working with a function f
for filtering can be a bit difficult to use in practice. Hold tight, that effort is well-paid, since it is a very powerful way of filtering data. As a simple example, we can create a function equals_alice
that checks whether its input equals “Alice”:
equals_alice(name::String) = name == "Alice"
JDS.equals_alice("Bob")
false
equals_alice("Alice")
true
Equipped with such a function, we can use it as our function f
to filter out all the rows for which name
equals “Alice”:
filter(:name => equals_alice, grades_2020())
name | grade_2020 |
---|---|
Alice | 8.5 |
Note that this doesn’t only work for DataFrame
s, but also for vectors:
filter(equals_alice, ["Alice", "Bob", "Dave"])
["Alice"]
We can make it a bit less verbose by using an anonymous function (see Section 3.2.4.4):
filter(n -> n == "Alice", ["Alice", "Bob", "Dave"])
["Alice"]
which we can also use on grades_2020
:
filter(:name => n -> n == "Alice", grades_2020())
name | grade_2020 |
---|---|
Alice | 8.5 |
To recap, this function call can be read as “for each element in the column :name
, let’s call the element n
, check whether n
equals Alice.” For some people, this is still too verbose. Luckily, Julia has added a partial function application of ==
. The details are not important – just know that you can use it just like any other function:
filter(:name => ==("Alice"), grades_2020())
name | grade_2020 |
---|---|
Alice | 8.5 |
To get all the rows which are not Alice, ==
(equality) can be replaced by !=
(inequality) in all previous examples:
filter(:name => !=("Alice"), grades_2020())
name | grade_2020 |
---|---|
Sally | 1.0 |
Bob | 5.0 |
Hank | 4.0 |
Now, to show why functions are so powerful, we can come up with a slightly more complex filter. In this filter, we want to have the people whose names start with A or B and have a grade above 6:
function complex_filter(name, grade)::Bool
interesting_name = startswith(name, 'A') || startswith(name, 'B')
interesting_grade = 6 < grade
interesting_name && interesting_grade
end
filter([:name, :grade_2020] => complex_filter, grades_2020())
name | grade_2020 |
---|---|
Alice | 8.5 |
4.3.2 Subset
The subset
function was added to make it easier to work with missing values (Section 4.9). In contrast to filter
, subset
works on complete columns instead of rows or single values. If we want to use our earlier defined functions, we should wrap it inside ByRow
:
subset(grades_2020(), :name => ByRow(equals_alice))
name | grade_2020 |
---|---|
Alice | 8.5 |
Also note that the DataFrame
is now the first argument subset(df, args...)
, whereas in filter
it was the second one filter(f, df)
. The reason for this is that Julia defines filter as filter(f, V::Vector)
and DataFrames.jl
chose to maintain consistency with existing Julia functions that were extended to DataFrame
s types by multiple dispatch.
NOTE: Most of native
DataFrames.jl
functions, whichsubset
belongs to, have a consistent function signature that always takes aDataFrame
as first argument.
Just like with filter
, we can also use anonymous functions inside subset
:
subset(grades_2020(), :name => ByRow(name -> name == "Alice"))
name | grade_2020 |
---|---|
Alice | 8.5 |
Or, the partial function application for ==
:
subset(grades_2020(), :name => ByRow(==("Alice")))
name | grade_2020 |
---|---|
Alice | 8.5 |
Ultimately, let’s show the real power of subset
. First, we create a dataset with some missing values:
function salaries()
names = ["John", "Hank", "Karen", "Zed"]
salary = [1_900, 2_800, 2_800, missing]
DataFrame(; names, salary)
end
salaries()
names | salary |
---|---|
John | 1900 |
Hank | 2800 |
Karen | 2800 |
Zed | missing |
This data is about a plausible situation where you want to figure out your colleagues’ salaries, and haven’t figured it out for Zed yet. Even though we don’t want to encourage these practices, we suspect it is an interesting example. Suppose we want to know who earns more than 2000. If we use filter
, without taking the missing
values into account, it will fail:
filter(:salary => >(2_000), salaries())
TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
[1] (::DataFrames.var"#99#100"{Base.Fix2{typeof(>), Int64}})(x::Missing)
@ DataFrames ~/.julia/packages/DataFrames/dgZn3/src/abstractdataframe/abstractdataframe.jl:1178
...
subset
will also fail, but it will fortunately point us towards an easy solution:
subset(salaries(), :salary => ByRow(>(2_000)))
ArgumentError: missing was returned in condition number 1 but only true or false are allowed; pass skipmissing=true to skip missing values
Stacktrace:
[1] _and(x::Missing)
@ DataFrames ~/.julia/packages/DataFrames/dgZn3/src/abstractdataframe/subset.jl:11
...
So, we just need to pass the keyword argument skipmissing=true
:
subset(salaries(), :salary => ByRow(>(2_000)); skipmissing=true)
names | salary |
---|---|
Hank | 2800 |
Karen | 2800 |
Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso