Filtering Rows in R Unexpectedly Removes Nas When Using Subset or Dplyr::Filter

Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter

Your example of the "expected" behavior doesn't actually return what you display in your question. I get:

> df[df$y != 'a',]
    x    y
NA NA <NA>
3   3    c

This is arguably more wrong than what subset and dplyr::filter return. Remember that in R, NA really is intended to mean "unknown", so df$y != 'a' returns,

> df$y != 'a'
[1] FALSE    NA  TRUE

So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NAs.

Many people dislike this behavior, but it is what it is.

subset and dplyr::filter make a different default choice which is to simply drop the NA rows, which arguably is accurate-ish.

But really, the lesson here is that if your data has NAs, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a', or as mentioned in the other answer by using %in% which is based on match.

From base::Extract:

When extracting, a numerical, logical or character NA index picks an unknown element and so returns NA

From ?base::subset:

missing values are taken as false [...] For ordinary vectors, the result is simply x[subset & !is.na(subset)]

From ?dplyr::filter

Unlike base subsetting with [, rows where the condition evaluates to NA are dropped

dplyr filter removing NA when that was not specified

this is the default behavior: R simply does not know if NA == '' is TRUE or FALSE

NA == ""
[1] NA

Therefore the third row is not returned.
If you want to include NA as well there are several workarrounds:

df %>% filter(coalesce(col1, "x") != "")
df %>% filter(col1 != "" | is.na(col1)

Personally, I prefer the first way: coalesce substitutes NA with a default value (here "x") and then checks if the substituted value is equal to "".

Subsetting in R vs filter(from dplyr) giving different results

If there are NAs make sure to adjust for the NA elements with is.na or else filter by default will remove those rows

library(dplyr)
filter(house2, (datetime >= "2007-02-01 00:00:00" & 
                datetime <= "2007-02-03 00:00:00")|
                is.na(datetime))

According to ?filter

The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions. Note that when a condition evaluates to NA the row will be dropped, unlike base subsetting with [.

dplyr::filter() behavior unexpected with NAs

One possibility is with existence of NA elements in those rows. Base R would return an NA row because the == with NA returns NA while filter removes the NA in logical vector by default

data[!(data$location_country == "US" & nchar(data$location_admin_level_1) > 2), ]

Now check with filter from dplyr

library(dplyr)
data %>%
    filter(!(location_country == "US" & nchar(location_admin_level_1) > 2))

If we wanted to get the NA rows in filter, use is.na

data %>% 
   filter((!(location_country == "US" & !is.na(location_country) &
        nchar(location_admin_level_1) > 2 &
           !is.na(location_admin_level_1)))|
           is.na(location_country))

The issue is == returns NA when there is any NA

with(data, location_country == "US")
#[1]  TRUE  TRUE FALSE FALSE    NA

In base R, the NA in logical vector just returns an NA row because it is not TRUE or FALSE, while in filter, this gets removed by default leaving only 2 rows in the filter step (considering only the last expression). To make this TRUE or FALSE, just add an is.na

with(data, location_country == "US" & !is.na(location_country))
#[1]  TRUE  TRUE FALSE FALSE FALSE

This would remove the NA rows. But, suppose if we need the NA row, then the last element should be TRUE. For that we need |

with(data, location_country == "US"|is.na(location_country))
#[1]  TRUE  TRUE FALSE FALSE  TRUE

data

data <- data.frame(location_country = c('US', 'US', 'China', 'Canada', NA), location_admin_level_1 = c('hello', 'l', 'w', '321', '2443'))

filter / subset empty cells vs. NA. Why is subset (df, x =='') not the opposite of subset(df, x !=''). Bug in dplyr or base?

We can use a condition with is.na

subset(df, is.na(x) | x != "")

Because the == or != returns NA whereever NA elements (i.e. any comparison with NA returns NA) are present and not a logical vector. subset and filter removes those NA rows as showed in the documentation of ?subset

subset - logical expression indicating elements or rows to keep: missing values are taken as false

and in ?filter

Note that when a condition evaluates to NA the row will be dropped, unlike base subsetting with [.

i.e.

with(df,  x != "")
#[1] FALSE FALSE    NA    NA

with(df, is.na(x) | x != "")
#[1] FALSE FALSE  TRUE  TRUE

Why does dplyr's filter drop NA values from a factor variable?

You could use this:

 filter(dat, var1 != 1 | is.na(var1))
  var1
1 <NA>
2    3
3    3
4 <NA>
5    2
6    2
7 <NA>

And it won't.

Also just for completion, dropping NAs is the intended behavior of filter as you can see from the following:

test_that("filter discards NA", {
  temp <- data.frame(
    i = 1:5,
    x = c(NA, 1L, 1L, 0L, 0L)
  )
  res <- filter(temp, x == 1)
  expect_equal(nrow(res), 2L)
})

This test above was taken from the tests for filter from github.

Remove duplicated rows using dplyr

Note: dplyr now contains the distinct function for this purpose.

Original answer below:

library(dplyr)
set.seed(123)
df <- data.frame(
  x = sample(0:1, 10, replace = T),
  y = sample(0:1, 10, replace = T),
  z = 1:10
)

One approach would be to group, and then only keep the first row:

df %>% group_by(x, y) %>% filter(row_number(z) == 1)

## Source: local data frame [3 x 3]
## Groups: x, y
## 
##   x y z
## 1 0 1 1
## 2 1 0 2
## 3 1 1 4

(In dplyr 0.2 you won't need the dummy z variable and will just be
able to write row_number() == 1)

I've also been thinking about adding a slice() function that would
work like:

df %>% group_by(x, y) %>% slice(from = 1, to = 1)

Or maybe a variation of unique() that would let you select which
variables to use:

df %>% unique(x, y)

Filtering Rows in R Unexpectedly Removes Nas When Using Subset or Dplyr::Filter