Removing Na Observations with Dplyr::Filter()

Removing NA observations with dplyr::filter()

From @Ben Bolker:

[T]his has nothing specifically to do with dplyr::filter()

From @Marat Talipov:

[A]ny comparison with NA, including NA==NA, will return NA

From a related answer by @farnsy:

The == operator does not treat NA's as you would expect it to.

Think of NA as meaning "I don't know what's there". The correct answer
to 3 > NA is obviously NA because we don't know if the missing value
is larger than 3 or not. Well, it's the same for NA == NA. They are
both missing values but the true values could be quite different, so
the correct answer is "I don't know."

R doesn't know what you are doing in your analysis, so instead of
potentially introducing bugs that would later end up being published
an embarrassing you, it doesn't allow comparison operators to think NA
is a value.

How to filter data without losing NA rows using dplyr

The documentation for dplyr::filter says... "Unlike base subsetting, rows where the condition evaluates to NA are dropped."

NA != "str" evaluates to NA so is dropped by filter.

!grepl("str", NA) returns TRUE, so is kept.

If you want filter to keep NA, you could do filter(is.na(col)|col!="str")

Ignore NA values in filtering with dplyr

We can use and | with is.na

tab %>%
group_by(Groups) %>%
filter(all(Value < 80 |is.na(Value)))
# A tibble: 8 x 3
# Groups: Groups [2]
# Groups Species Value
# <chr> <chr> <int>
#1 Group1 Sp1 1
#2 Group1 Sp1 4
#3 Group1 Sp2 78
#4 Group1 Sp3 NA
#5 Group1 Sp4 NA
#6 Group2 Sp2 3
#7 Group2 Sp3 9
#8 Group2 Sp4 8

The issue in OP's code was that when we wrap all with Value < 80, the comparison returns NA for those values that are NA and now the all also returns NA instead of a logical TRUE/FALSE and in filter, it automatically drops the NA by default

To understand better, check the output of

tab %>% 
group_by(Groups) %>%
mutate(ind = all(Value < 80))

and the difference here

tab %>% 
group_by(Groups) %>%
mutate(ind = all(Value < 80| is.na(Value)))

Or using data.table

library(data.table)
setDT(tab)[, .SD[all(Value < 80 | is.na(Value))], Groups]

Or using base R

tab[with(tab, ave(Value < 80  | is.na(Value), Groups, FUN = all)),]

Update

For the second dataset,

tab1 %>% 
group_by(Groups) %>%
filter(all(Value < 80 |is.na(Value)),
all(na.omit(abs(sp-mrca)) %in% 0:9))

data

tab <- structure(list(Groups = c("Group1", "Group1", "Group1", "Group1", 
"Group1", "Group2", "Group2", "Group2", "Group3", "Group3", "Group3",
"Group3"), Species = c("Sp1", "Sp1", "Sp2", "Sp3", "Sp4", "Sp2",
"Sp3", "Sp4", "Sp1", "Sp3", "Sp3", "Sp3"), Value = c(1L, 4L,
78L, NA, NA, 3L, 9L, 8L, 9L, 10L, 110L, 14L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))



tab1 <- structure(list(Groups = c("Group1", "Group1", "Group1", "Group1",
"Group1", "Group2", "Group2", "Group2", "Group3", "Group3", "Group3",
"Group3"), Species = c("Sp1", "Sp1", "Sp2", "Sp3", "Sp4", "Sp2",
"Sp3", "Sp4", "Sp1", "Sp3", "Sp3", "Sp3"), Value = c(1L, 4L,
78L, NA, NA, 3L, 9L, 8L, 9L, 10L, 110L, 14L), sp = c(3L, 3L,
NA, 3L, 3L, 2L, 2L, NA, 2L, 3L, 3L, 2L), mrca = c(3L, 3L, NA,
12L, 3L, 3L, 40L, NA, 2L, 3L, 2L, 3L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

Remove rows with all or some NAs (missing values) in data.frame

Also check complete.cases :

> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2

na.omit is nicer for just removing all NA's. complete.cases allows partial selection by including only certain columns of the dataframe:

> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2

Your solution can't work. If you insist on using is.na, then you have to do something like:

> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2

but using complete.cases is quite a lot more clear, and faster.

Remove rows where all variables are NA using dplyr

Since dplyr 0.7.0 new, scoped filtering verbs exists. Using filter_any you can easily filter rows with at least one non-missing column:

# dplyr 0.7.0
dat %>% filter_all(any_vars(!is.na(.)))

Using @hejseb benchmarking algorithm it appears that this solution is as efficient as f4.

UPDATE:

Since dplyr 1.0.0 the above scoped verbs are superseded. Instead the across function family was introduced, which allows to perform a function on multiple (or all) columns. Filtering rows with at least one column being not NA looks now like this:

# dplyr 1.0.0
dat %>% filter(if_any(everything(), ~ !is.na(.)))

Removing NA in dplyr pipe

I don't think desc takes an na.rm argument... I'm actually surprised it doesn't throw an error when you give it one. If you just want to remove NAs, use na.omit (base) or tidyr::drop_na:

outcome.df %>%
na.omit() %>%
group_by(Hospital, State) %>%
arrange(desc(HeartAttackDeath)) %>%
head()

library(tidyr)
outcome.df %>%
drop_na() %>%
group_by(Hospital, State) %>%
arrange(desc(HeartAttackDeath)) %>%
head()

If you only want to remove NAs from the HeartAttackDeath column, filter with is.na, or use tidyr::drop_na:

outcome.df %>%
filter(!is.na(HeartAttackDeath)) %>%
group_by(Hospital, State) %>%
arrange(desc(HeartAttackDeath)) %>%
head()

outcome.df %>%
drop_na(HeartAttackDeath) %>%
group_by(Hospital, State) %>%
arrange(desc(HeartAttackDeath)) %>%
head()

As pointed out at the dupe, complete.cases can also be used, but it's a bit trickier to put in a chain because it takes a data frame as an argument but returns an index vector. So you could use it like this:

outcome.df %>%
filter(complete.cases(.)) %>%
group_by(Hospital, State) %>%
arrange(desc(HeartAttackDeath)) %>%
head()


Related Topics



Leave a reply



Submit