Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter
Your example of the "expected" behavior doesn't actually return what you display in your question. I get:
> df[df$y != 'a',]
x y
NA NA <NA>
3 3 c
This is arguably more wrong than what subset
and dplyr::filter
return. Remember that in R, NA
really is intended to mean "unknown", so df$y != 'a'
returns,
> df$y != 'a'
[1] FALSE NA TRUE
So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NA
s.
Many people dislike this behavior, but it is what it is.
subset
and dplyr::filter
make a different default choice which is to simply drop the NA
rows, which arguably is accurate-ish.
But really, the lesson here is that if your data has NA
s, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a'
, or as mentioned in the other answer by using %in%
which is based on match
.
From base::Extract
:
When extracting, a numerical, logical or character
NA
index picks an unknown element and so returnsNA
From ?base::subset
:
missing values are taken as false [...] For ordinary vectors, the result is simply
x[subset & !is.na(subset)]
From ?dplyr::filter
Unlike base subsetting with
[
, rows where the condition evaluates toNA
are dropped
dplyr filter removing NA when that was not specified
this is the default behavior: R simply does not know if NA == ''
is TRUE
or FALSE
NA == ""
[1] NA
Therefore the third row is not returned.
If you want to include NA
as well there are several workarrounds:
df %>% filter(coalesce(col1, "x") != "")
df %>% filter(col1 != "" | is.na(col1)
Personally, I prefer the first way: coalesce
substitutes NA
with a default value (here "x"
) and then checks if the substituted value is equal to ""
.
Subsetting in R vs filter(from dplyr) giving different results
If there are NA
s make sure to adjust for the NA
elements with is.na
or else filter
by default will remove those rows
library(dplyr)
filter(house2, (datetime >= "2007-02-01 00:00:00" &
datetime <= "2007-02-03 00:00:00")|
is.na(datetime))
According to ?filter
The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions. Note that when a condition evaluates to NA the row will be dropped, unlike base subsetting with [.
dplyr::filter() behavior unexpected with NAs
One possibility is with existence of NA
elements in those rows. Base R would return an NA
row because the ==
with NA
returns NA while filter
removes the NA
in logical vector by default
data[!(data$location_country == "US" & nchar(data$location_admin_level_1) > 2), ]
Now check with filter
from dplyr
library(dplyr)
data %>%
filter(!(location_country == "US" & nchar(location_admin_level_1) > 2))
If we wanted to get the NA
rows in filter
, use is.na
data %>%
filter((!(location_country == "US" & !is.na(location_country) &
nchar(location_admin_level_1) > 2 &
!is.na(location_admin_level_1)))|
is.na(location_country))
The issue is ==
returns NA
when there is any NA
with(data, location_country == "US")
#[1] TRUE TRUE FALSE FALSE NA
In base R
, the NA in logical vector just returns an NA
row because it is not TRUE or FALSE, while in filter
, this gets removed by default leaving only 2 rows in the filter
step (considering only the last expression). To make this TRUE or FALSE, just add an is.na
with(data, location_country == "US" & !is.na(location_country))
#[1] TRUE TRUE FALSE FALSE FALSE
This would remove the NA
rows. But, suppose if we need the NA
row, then the last element should be TRUE. For that we need |
with(data, location_country == "US"|is.na(location_country))
#[1] TRUE TRUE FALSE FALSE TRUE
data
data <- data.frame(location_country = c('US', 'US', 'China', 'Canada', NA), location_admin_level_1 = c('hello', 'l', 'w', '321', '2443'))
filter / subset empty cells vs. NA. Why is subset (df, x =='') not the opposite of subset(df, x !=''). Bug in dplyr or base?
We can use a condition with is.na
subset(df, is.na(x) | x != "")
Because the ==
or !=
returns NA
whereever NA
elements (i.e. any comparison with NA
returns NA) are present and not a logical vector. subset
and filter
removes those NA
rows as showed in the documentation of ?subset
subset - logical expression indicating elements or rows to keep: missing values are taken as false
and in ?filter
Note that when a condition evaluates to NA the row will be dropped, unlike base subsetting with [.
i.e.
with(df, x != "")
#[1] FALSE FALSE NA NA
with(df, is.na(x) | x != "")
#[1] FALSE FALSE TRUE TRUE
Why does dplyr's filter drop NA values from a factor variable?
You could use this:
filter(dat, var1 != 1 | is.na(var1))
var1
1 <NA>
2 3
3 3
4 <NA>
5 2
6 2
7 <NA>
And it won't.
Also just for completion, dropping NAs is the intended behavior of filter
as you can see from the following:
test_that("filter discards NA", {
temp <- data.frame(
i = 1:5,
x = c(NA, 1L, 1L, 0L, 0L)
)
res <- filter(temp, x == 1)
expect_equal(nrow(res), 2L)
})
This test above was taken from the tests for filter
from github.
Remove duplicated rows using dplyr
Note: dplyr
now contains the distinct
function for this purpose.
Original answer below:
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
One approach would be to group, and then only keep the first row:
df %>% group_by(x, y) %>% filter(row_number(z) == 1)
## Source: local data frame [3 x 3]
## Groups: x, y
##
## x y z
## 1 0 1 1
## 2 1 0 2
## 3 1 1 4
(In dplyr 0.2 you won't need the dummy z
variable and will just be
able to write row_number() == 1
)
I've also been thinking about adding a slice()
function that would
work like:
df %>% group_by(x, y) %>% slice(from = 1, to = 1)
Or maybe a variation of unique()
that would let you select which
variables to use:
df %>% unique(x, y)
Related Topics
Assign Names to Data Frame with As.Data.Frame Function
Numbers as Column Names of Data Frames
Union of Intersecting Vectors in a List in R
Using R to Download Zipped Data File, Extract, and Import .Csv
Convert Quarter/Year Format to a Date
Si Prefixes in Ggplot2 Axis Labels
Convert List to Data Frame While Keeping List-Element Names
Index Unique Values in Data.Table
Add Axis Tick-Marks on Top and to the Right to a Ggplot
Linear Model Function Lm() Error: Na/Nan/Inf in Foreign Function Call (Arg 1)
Element-Wise Concatenation of String Vectors
Dplyr Summarize with Subtotals
Varying Axis Labels Formatter Per Facet in Ggplot/R
How to Add Shaded Confidence Intervals to Line Plot with Specified Values
Remove 'Search' Option But Leave 'Search Columns' Option
Best Practice: Should I Try to Change to Utf-8 as Locale or Is It Safe to Leave It as Is