Removing NA observations with dplyr::filter()
From @Ben Bolker:
[T]his has nothing specifically to do with dplyr::filter()
From @Marat Talipov:
[A]ny comparison with NA, including NA==NA, will return NA
From a related answer by @farnsy:
The == operator does not treat NA's as you would expect it to.
Think of NA as meaning "I don't know what's there". The correct answer
to 3 > NA is obviously NA because we don't know if the missing value
is larger than 3 or not. Well, it's the same for NA == NA. They are
both missing values but the true values could be quite different, so
the correct answer is "I don't know."R doesn't know what you are doing in your analysis, so instead of
potentially introducing bugs that would later end up being published
an embarrassing you, it doesn't allow comparison operators to think NA
is a value.
How to filter data without losing NA rows using dplyr
The documentation for dplyr::filter
says... "Unlike base subsetting, rows where the condition evaluates to NA are dropped."
NA != "str"
evaluates to NA
so is dropped by filter
.
!grepl("str", NA)
returns TRUE
, so is kept.
If you want filter
to keep NA
, you could do filter(is.na(col)|col!="str")
Ignore NA values in filtering with dplyr
We can use and |
with is.na
tab %>%
group_by(Groups) %>%
filter(all(Value < 80 |is.na(Value)))
# A tibble: 8 x 3
# Groups: Groups [2]
# Groups Species Value
# <chr> <chr> <int>
#1 Group1 Sp1 1
#2 Group1 Sp1 4
#3 Group1 Sp2 78
#4 Group1 Sp3 NA
#5 Group1 Sp4 NA
#6 Group2 Sp2 3
#7 Group2 Sp3 9
#8 Group2 Sp4 8
The issue in OP's code was that when we wrap all
with Value < 80
, the comparison returns NA
for those values that are NA
and now the all
also returns NA
instead of a logical TRUE/FALSE and in filter
, it automatically drops the NA
by default
To understand better, check the output of
tab %>%
group_by(Groups) %>%
mutate(ind = all(Value < 80))
and the difference here
tab %>%
group_by(Groups) %>%
mutate(ind = all(Value < 80| is.na(Value)))
Or using data.table
library(data.table)
setDT(tab)[, .SD[all(Value < 80 | is.na(Value))], Groups]
Or using base R
tab[with(tab, ave(Value < 80 | is.na(Value), Groups, FUN = all)),]
Update
For the second dataset,
tab1 %>%
group_by(Groups) %>%
filter(all(Value < 80 |is.na(Value)),
all(na.omit(abs(sp-mrca)) %in% 0:9))
data
tab <- structure(list(Groups = c("Group1", "Group1", "Group1", "Group1",
"Group1", "Group2", "Group2", "Group2", "Group3", "Group3", "Group3",
"Group3"), Species = c("Sp1", "Sp1", "Sp2", "Sp3", "Sp4", "Sp2",
"Sp3", "Sp4", "Sp1", "Sp3", "Sp3", "Sp3"), Value = c(1L, 4L,
78L, NA, NA, 3L, 9L, 8L, 9L, 10L, 110L, 14L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
tab1 <- structure(list(Groups = c("Group1", "Group1", "Group1", "Group1",
"Group1", "Group2", "Group2", "Group2", "Group3", "Group3", "Group3",
"Group3"), Species = c("Sp1", "Sp1", "Sp2", "Sp3", "Sp4", "Sp2",
"Sp3", "Sp4", "Sp1", "Sp3", "Sp3", "Sp3"), Value = c(1L, 4L,
78L, NA, NA, 3L, 9L, 8L, 9L, 10L, 110L, 14L), sp = c(3L, 3L,
NA, 3L, 3L, 2L, 2L, NA, 2L, 3L, 3L, 2L), mrca = c(3L, 3L, NA,
12L, 3L, 3L, 40L, NA, 2L, 3L, 2L, 3L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
Remove rows with all or some NAs (missing values) in data.frame
Also check complete.cases
:
> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
na.omit
is nicer for just removing all NA
's. complete.cases
allows partial selection by including only certain columns of the dataframe:
> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Your solution can't work. If you insist on using is.na
, then you have to do something like:
> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
but using complete.cases
is quite a lot more clear, and faster.
Remove rows where all variables are NA using dplyr
Since dplyr 0.7.0 new, scoped filtering verbs exists. Using filter_any you can easily filter rows with at least one non-missing column:
# dplyr 0.7.0
dat %>% filter_all(any_vars(!is.na(.)))
Using @hejseb benchmarking algorithm it appears that this solution is as efficient as f4.
UPDATE:
Since dplyr 1.0.0 the above scoped verbs are superseded. Instead the across function family was introduced, which allows to perform a function on multiple (or all) columns. Filtering rows with at least one column being not NA looks now like this:
# dplyr 1.0.0
dat %>% filter(if_any(everything(), ~ !is.na(.)))
Removing NA in dplyr pipe
I don't think desc
takes an na.rm
argument... I'm actually surprised it doesn't throw an error when you give it one. If you just want to remove NA
s, use na.omit
(base) or tidyr::drop_na
:
outcome.df %>%
na.omit() %>%
group_by(Hospital, State) %>%
arrange(desc(HeartAttackDeath)) %>%
head()
library(tidyr)
outcome.df %>%
drop_na() %>%
group_by(Hospital, State) %>%
arrange(desc(HeartAttackDeath)) %>%
head()
If you only want to remove NA
s from the HeartAttackDeath column, filter with is.na
, or use tidyr::drop_na
:
outcome.df %>%
filter(!is.na(HeartAttackDeath)) %>%
group_by(Hospital, State) %>%
arrange(desc(HeartAttackDeath)) %>%
head()
outcome.df %>%
drop_na(HeartAttackDeath) %>%
group_by(Hospital, State) %>%
arrange(desc(HeartAttackDeath)) %>%
head()
As pointed out at the dupe, complete.cases
can also be used, but it's a bit trickier to put in a chain because it takes a data frame as an argument but returns an index vector. So you could use it like this:
outcome.df %>%
filter(complete.cases(.)) %>%
group_by(Hospital, State) %>%
arrange(desc(HeartAttackDeath)) %>%
head()
Related Topics
Avoid String Printed to Console Getting Truncated (In Rstudio)
Equivalent to Unix "Less" Command Within R Console
Handling Dates When We Switch to Daylight Savings Time and Back
Merge by Range in R - Applying Loops
Add Objects to Package Namespace
R: Assign Variable Labels of Data Frame Columns
Mean of a Column in a Data Frame, Given the Column's Name
Generate Paired Stacked Bar Charts in Ggplot (Using Position_Dodge Only on Some Variables)
Directly Creating Dummy Variable Set in a Sparse Matrix in R
Data.Frame Without Ruining Column Names
What Do the %Op% Operators in Mean? for Example "%In%"
Generate Dynamic R Markdown Blocks
Add Max Value to a New Column in R
The Condition Has Length > 1 and Only the First Element Will Be Used in If Else Statement
How to Subset Data in R Without Losing Na Rows