How to Filter Data Without Losing Na Rows Using Dplyr

How to filter data without losing NA rows using dplyr

The documentation for dplyr::filter says... "Unlike base subsetting, rows where the condition evaluates to NA are dropped."

NA != "str" evaluates to NA so is dropped by filter.

!grepl("str", NA) returns TRUE, so is kept.

If you want filter to keep NA, you could do filter(is.na(col)|col!="str")

Ignore NA values in filtering with dplyr

We can use and | with is.na

tab %>%
    group_by(Groups) %>%
    filter(all(Value < 80 |is.na(Value)))
# A tibble: 8 x 3
# Groups:   Groups [2]
#  Groups Species Value
#  <chr>  <chr>   <int>
#1 Group1 Sp1         1
#2 Group1 Sp1         4
#3 Group1 Sp2        78
#4 Group1 Sp3        NA
#5 Group1 Sp4        NA
#6 Group2 Sp2         3
#7 Group2 Sp3         9
#8 Group2 Sp4         8

The issue in OP's code was that when we wrap all with Value < 80, the comparison returns NA for those values that are NA and now the all also returns NA instead of a logical TRUE/FALSE and in filter, it automatically drops the NA by default

To understand better, check the output of

tab %>% 
    group_by(Groups) %>% 
    mutate(ind = all(Value < 80))

and the difference here

tab %>% 
    group_by(Groups) %>% 
    mutate(ind = all(Value < 80| is.na(Value)))

Or using data.table

library(data.table)
setDT(tab)[, .SD[all(Value < 80 | is.na(Value))], Groups]

Or using base R

tab[with(tab, ave(Value < 80  | is.na(Value), Groups, FUN = all)),]

Update

For the second dataset,

tab1 %>% 
    group_by(Groups) %>%
    filter(all(Value < 80 |is.na(Value)),
           all(na.omit(abs(sp-mrca)) %in% 0:9))

data

tab <- structure(list(Groups = c("Group1", "Group1", "Group1", "Group1", 
"Group1", "Group2", "Group2", "Group2", "Group3", "Group3", "Group3", 
"Group3"), Species = c("Sp1", "Sp1", "Sp2", "Sp3", "Sp4", "Sp2", 
"Sp3", "Sp4", "Sp1", "Sp3", "Sp3", "Sp3"), Value = c(1L, 4L, 
78L, NA, NA, 3L, 9L, 8L, 9L, 10L, 110L, 14L)), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

tab1 <- structure(list(Groups = c("Group1", "Group1", "Group1", "Group1", 
"Group1", "Group2", "Group2", "Group2", "Group3", "Group3", "Group3", 
"Group3"), Species = c("Sp1", "Sp1", "Sp2", "Sp3", "Sp4", "Sp2", 
"Sp3", "Sp4", "Sp1", "Sp3", "Sp3", "Sp3"), Value = c(1L, 4L, 
78L, NA, NA, 3L, 9L, 8L, 9L, 10L, 110L, 14L), sp = c(3L, 3L, 
NA, 3L, 3L, 2L, 2L, NA, 2L, 3L, 3L, 2L), mrca = c(3L, 3L, NA, 
12L, 3L, 3L, 40L, NA, 2L, 3L, 2L, 3L)), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

How do I group by then filter without losing rows from data in R?

Grouped by 'full_name', filter the 'type' based on the 'type' that corresponds to the max count value where 'year' is 2021 and then slice the max 2 rows ordered by 'year'

library(dplyr)
sample_df %>% 
   group_by(full_name) %>% 
   filter(type %in% type[which.max(count[year == 2021])])%>% 
   slice_max(order_by= year, n = 2) %>%
   ungroup %>%
   arrange(factor(full_name, levels = unique(sample_df$full_name)))

-output

# A tibble: 6 x 6
#   year full_name          type       count avg_score   max
#  <int> <chr>              <chr>      <chr>     <dbl> <dbl>
#1  2021 Jason Valdez       Humanities 59           97    99
#2  2020 Jason Valdez       Humanities 231          96    98
#3  2021 Eric Collins       Sciences   21           92    93
#4  2019 Eric Collins       Sciences   831          94    97
#5  2021 Sebastian Goldberg Sciences   41           93    96
#6  2020 Sebastian Goldberg Sciences   476          94    98

dplyr filter function is dropping rows corresponding to empty values

The problem is that (NA != "AA") is NA, not TRUE, so you need to explicitly check for NA, and include those rows.

library(dplyr, warn.conflicts = FALSE)
data <- data.frame(row=seq(9), sortingvar=c("AA", NA, "BB"))
data %>% filter(sortingvar != "AA")
#>   row sortingvar
#> 1   3         BB
#> 2   6         BB
#> 3   9         BB
data %>% filter(sortingvar != "AA" | is.na(sortingvar))
#>   row sortingvar
#> 1   2       <NA>
#> 2   3         BB
#> 3   5       <NA>
#> 4   6         BB
#> 5   8       <NA>
#> 6   9         BB

^{Created on 2022-03-11 by the reprex package (v2.0.1)}

Designing a function so filter does not drop NAs

Try coalesce

df %>% filter(coalesce(A != B, TRUE))

Removing NA observations with dplyr::filter()

From @Ben Bolker:

[T]his has nothing specifically to do with dplyr::filter()

From @Marat Talipov:

[A]ny comparison with NA, including NA==NA, will return NA

From a related answer by @farnsy:

The == operator does not treat NA's as you would expect it to.

Think of NA as meaning "I don't know what's there". The correct answer
to 3 > NA is obviously NA because we don't know if the missing value
is larger than 3 or not. Well, it's the same for NA == NA. They are
both missing values but the true values could be quite different, so
the correct answer is "I don't know."

R doesn't know what you are doing in your analysis, so instead of
potentially introducing bugs that would later end up being published
an embarrassing you, it doesn't allow comparison operators to think NA
is a value.

Why does dplyr's filter drop NA values from a factor variable?

You could use this:

 filter(dat, var1 != 1 | is.na(var1))
  var1
1 <NA>
2    3
3    3
4 <NA>
5    2
6    2
7 <NA>

And it won't.

Also just for completion, dropping NAs is the intended behavior of filter as you can see from the following:

test_that("filter discards NA", {
  temp <- data.frame(
    i = 1:5,
    x = c(NA, 1L, 1L, 0L, 0L)
  )
  res <- filter(temp, x == 1)
  expect_equal(nrow(res), 2L)
})

This test above was taken from the tests for filter from github.

Subset dataframe in R, dplyr filter row values of column A not NA in row of column B

Maybe this will achieve your goal. If all participants have all StudyDay timepoints, and you just want to see if not missing in days 2 or 4, you can just check the Ab values at those time points in your filter. In this case, an ID will be omitted if is NA in both days 2 and 4 (in this example, "D").

Alternatively, if you want to require that both values are available for days 2 and 4, you can use & (AND) instead of | (OR).

library(dplyr)

fakedat %>%
  group_by(ID) %>%
  filter(!is.na(Ab[StudyDay == 2]) | !is.na(Ab[StudyDay == 4]))

If you have multiple days to check are not missing, you can use all and check values for NA where the StudyDay is %in% a vector of required days as follows:

required_vals <- c(2, 4)

fakedat %>%
  group_by(ID) %>%
  filter(all(!is.na(Ab[StudyDay %in% required_vals])))

Output

   ID    StudyDay    Ab
   <chr>    <dbl> <dbl>
 1 A            1    10
 2 A            2    NA
 3 A            3    15
 4 A            4    10
 5 B            1    10
 6 B            2    20
 7 B            3    10
 8 B            4    NA
 9 C            1    10
10 C            2    10
11 C            3    NA
12 C            4    30
13 E            1    10
14 E            2    20
15 E            3    10
16 E            4    30
17 F            1    NA
18 F            2    10
19 F            3    NA
20 F            4    20

dplyr filter removing NA when that was not specified

this is the default behavior: R simply does not know if NA == '' is TRUE or FALSE

NA == ""
[1] NA

Therefore the third row is not returned.
If you want to include NA as well there are several workarrounds:

df %>% filter(coalesce(col1, "x") != "")
df %>% filter(col1 != "" | is.na(col1)

Personally, I prefer the first way: coalesce substitutes NA with a default value (here "x") and then checks if the substituted value is equal to "".

In dplyr filter rows with number of NA bigger than one

We can use rowSums to create a logical expression within filter

library(dplyr)
airquality  %>% 
    filter(rowSums(is.na(.)) > 1)

-output

#   Ozone Solar.R Wind Temp Month Day
#1    NA      NA 14.3   56     5   5
#2    NA      NA  8.0   57     5  27

Or with rowwise and c_across

airquality %>%
   rowwise %>%
   filter(sum(is.na(c_across(everything()))) > 1)
# A tibble: 2 x 6
# Rowwise: 
#  Ozone Solar.R  Wind  Temp Month   Day
#  <int>   <int> <dbl> <int> <int> <int>
#1    NA      NA  14.3    56     5     5
#2    NA      NA   8      57     5    27

How to Filter Data Without Losing Na Rows Using Dplyr