Why Does Dplyr's Filter Drop Na Values from a Factor Variable

Why does dplyr's filter drop NA values from a factor variable?

You could use this:

 filter(dat, var1 != 1 | is.na(var1))
var1
1 <NA>
2 3
3 3
4 <NA>
5 2
6 2
7 <NA>

And it won't.

Also just for completion, dropping NAs is the intended behavior of filter as you can see from the following:

test_that("filter discards NA", {
temp <- data.frame(
i = 1:5,
x = c(NA, 1L, 1L, 0L, 0L)
)
res <- filter(temp, x == 1)
expect_equal(nrow(res), 2L)
})

This test above was taken from the tests for filter from github.

Removing NA observations with dplyr::filter()

From @Ben Bolker:

[T]his has nothing specifically to do with dplyr::filter()

From @Marat Talipov:

[A]ny comparison with NA, including NA==NA, will return NA

From a related answer by @farnsy:

The == operator does not treat NA's as you would expect it to.

Think of NA as meaning "I don't know what's there". The correct answer
to 3 > NA is obviously NA because we don't know if the missing value
is larger than 3 or not. Well, it's the same for NA == NA. They are
both missing values but the true values could be quite different, so
the correct answer is "I don't know."

R doesn't know what you are doing in your analysis, so instead of
potentially introducing bugs that would later end up being published
an embarrassing you, it doesn't allow comparison operators to think NA
is a value.

Designing a function so filter does not drop NAs

Try coalesce

df %>% filter(coalesce(A != B, TRUE))

Why I loose my NA's after count and filter (dplyr)

From the Help file of filter()

...Only rows where the condition evaluates to TRUE are kept...

NA != -1
[1] NA

Since your condition returns a NA (hence not TRUE) you need a second OR condition:

df %>% 
filter(Procedure != -1 | is.na(Procedure))

When filtering with dplyr in R, why do filtered out levels of a variable remain in filtered data?

Factors in R do not automatically drop levels when filtered. You may think this is a silly default (I do), but it's easy to deal with -- just use the droplevels function on the result.

new_data <- data %>%
filter(y == "yes") %>%
droplevels
levels(new_data$y)
## [1] "yes"

If you did this all the time you could define a new function

dfilter <- function(...) droplevels(filter(...))

How to filter data.frame by a factor that includes NA as level

Check if the levels of the corresponding df$a is na:

df[is.na(levels(df$a)[df$a]),]
a b
6 <NA> 0.1649003
8 <NA> 0.6556045

As Frank pointed out, this also includes observations where the value of df$a, not just it's level, is NA. I guess the original poster wanted to include these cases. If not, one can do something like

x <- factor(c("A","B", NA), levels=c("A", NA), exclude = NULL)
i <- which(is.na(levels(x)[x]))
i[!is.na(x[i])]

gives you 3, only the NA-level, leaving out unknown level (B).



Related Topics



Leave a reply



Submit