How to filter data without losing NA rows using dplyr
The documentation for dplyr::filter
says... "Unlike base subsetting, rows where the condition evaluates to NA are dropped."
NA != "str"
evaluates to NA
so is dropped by filter
.
!grepl("str", NA)
returns TRUE
, so is kept.
If you want filter
to keep NA
, you could do filter(is.na(col)|col!="str")
Ignore NA values in filtering with dplyr
We can use and |
with is.na
tab %>%
group_by(Groups) %>%
filter(all(Value < 80 |is.na(Value)))
# A tibble: 8 x 3
# Groups: Groups [2]
# Groups Species Value
# <chr> <chr> <int>
#1 Group1 Sp1 1
#2 Group1 Sp1 4
#3 Group1 Sp2 78
#4 Group1 Sp3 NA
#5 Group1 Sp4 NA
#6 Group2 Sp2 3
#7 Group2 Sp3 9
#8 Group2 Sp4 8
The issue in OP's code was that when we wrap all
with Value < 80
, the comparison returns NA
for those values that are NA
and now the all
also returns NA
instead of a logical TRUE/FALSE and in filter
, it automatically drops the NA
by default
To understand better, check the output of
tab %>%
group_by(Groups) %>%
mutate(ind = all(Value < 80))
and the difference here
tab %>%
group_by(Groups) %>%
mutate(ind = all(Value < 80| is.na(Value)))
Or using data.table
library(data.table)
setDT(tab)[, .SD[all(Value < 80 | is.na(Value))], Groups]
Or using base R
tab[with(tab, ave(Value < 80 | is.na(Value), Groups, FUN = all)),]
Update
For the second dataset,
tab1 %>%
group_by(Groups) %>%
filter(all(Value < 80 |is.na(Value)),
all(na.omit(abs(sp-mrca)) %in% 0:9))
data
tab <- structure(list(Groups = c("Group1", "Group1", "Group1", "Group1",
"Group1", "Group2", "Group2", "Group2", "Group3", "Group3", "Group3",
"Group3"), Species = c("Sp1", "Sp1", "Sp2", "Sp3", "Sp4", "Sp2",
"Sp3", "Sp4", "Sp1", "Sp3", "Sp3", "Sp3"), Value = c(1L, 4L,
78L, NA, NA, 3L, 9L, 8L, 9L, 10L, 110L, 14L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
tab1 <- structure(list(Groups = c("Group1", "Group1", "Group1", "Group1",
"Group1", "Group2", "Group2", "Group2", "Group3", "Group3", "Group3",
"Group3"), Species = c("Sp1", "Sp1", "Sp2", "Sp3", "Sp4", "Sp2",
"Sp3", "Sp4", "Sp1", "Sp3", "Sp3", "Sp3"), Value = c(1L, 4L,
78L, NA, NA, 3L, 9L, 8L, 9L, 10L, 110L, 14L), sp = c(3L, 3L,
NA, 3L, 3L, 2L, 2L, NA, 2L, 3L, 3L, 2L), mrca = c(3L, 3L, NA,
12L, 3L, 3L, 40L, NA, 2L, 3L, 2L, 3L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
How do I group by then filter without losing rows from data in R?
Grouped by 'full_name', filter
the 'type' based on the 'type' that corresponds to the max
count
value where 'year' is 2021 and then slice
the max 2 rows ordered by 'year'
library(dplyr)
sample_df %>%
group_by(full_name) %>%
filter(type %in% type[which.max(count[year == 2021])])%>%
slice_max(order_by= year, n = 2) %>%
ungroup %>%
arrange(factor(full_name, levels = unique(sample_df$full_name)))
-output
# A tibble: 6 x 6
# year full_name type count avg_score max
# <int> <chr> <chr> <chr> <dbl> <dbl>
#1 2021 Jason Valdez Humanities 59 97 99
#2 2020 Jason Valdez Humanities 231 96 98
#3 2021 Eric Collins Sciences 21 92 93
#4 2019 Eric Collins Sciences 831 94 97
#5 2021 Sebastian Goldberg Sciences 41 93 96
#6 2020 Sebastian Goldberg Sciences 476 94 98
dplyr filter function is dropping rows corresponding to empty values
The problem is that (NA != "AA")
is NA
, not TRUE
, so you need to explicitly check for NA
, and include those rows.
library(dplyr, warn.conflicts = FALSE)
data <- data.frame(row=seq(9), sortingvar=c("AA", NA, "BB"))
data %>% filter(sortingvar != "AA")
#> row sortingvar
#> 1 3 BB
#> 2 6 BB
#> 3 9 BB
data %>% filter(sortingvar != "AA" | is.na(sortingvar))
#> row sortingvar
#> 1 2 <NA>
#> 2 3 BB
#> 3 5 <NA>
#> 4 6 BB
#> 5 8 <NA>
#> 6 9 BB
Created on 2022-03-11 by the reprex package (v2.0.1)
Designing a function so filter does not drop NAs
Try coalesce
df %>% filter(coalesce(A != B, TRUE))
Removing NA observations with dplyr::filter()
From @Ben Bolker:
[T]his has nothing specifically to do with dplyr::filter()
From @Marat Talipov:
[A]ny comparison with NA, including NA==NA, will return NA
From a related answer by @farnsy:
The == operator does not treat NA's as you would expect it to.
Think of NA as meaning "I don't know what's there". The correct answer
to 3 > NA is obviously NA because we don't know if the missing value
is larger than 3 or not. Well, it's the same for NA == NA. They are
both missing values but the true values could be quite different, so
the correct answer is "I don't know."R doesn't know what you are doing in your analysis, so instead of
potentially introducing bugs that would later end up being published
an embarrassing you, it doesn't allow comparison operators to think NA
is a value.
Why does dplyr's filter drop NA values from a factor variable?
You could use this:
filter(dat, var1 != 1 | is.na(var1))
var1
1 <NA>
2 3
3 3
4 <NA>
5 2
6 2
7 <NA>
And it won't.
Also just for completion, dropping NAs is the intended behavior of filter
as you can see from the following:
test_that("filter discards NA", {
temp <- data.frame(
i = 1:5,
x = c(NA, 1L, 1L, 0L, 0L)
)
res <- filter(temp, x == 1)
expect_equal(nrow(res), 2L)
})
This test above was taken from the tests for filter
from github.
Subset dataframe in R, dplyr filter row values of column A not NA in row of column B
Maybe this will achieve your goal. If all participants have all StudyDay
timepoints, and you just want to see if not missing in days 2 or 4, you can just check the Ab
values at those time points in your filter
. In this case, an ID
will be omitted if is NA
in both days 2 and 4 (in this example, "D").
Alternatively, if you want to require that both values are available for days 2 and 4, you can use &
(AND) instead of |
(OR).
library(dplyr)
fakedat %>%
group_by(ID) %>%
filter(!is.na(Ab[StudyDay == 2]) | !is.na(Ab[StudyDay == 4]))
If you have multiple days to check are not missing, you can use all
and check values for NA
where the StudyDay
is %in%
a vector of required days as follows:
required_vals <- c(2, 4)
fakedat %>%
group_by(ID) %>%
filter(all(!is.na(Ab[StudyDay %in% required_vals])))
Output
ID StudyDay Ab
<chr> <dbl> <dbl>
1 A 1 10
2 A 2 NA
3 A 3 15
4 A 4 10
5 B 1 10
6 B 2 20
7 B 3 10
8 B 4 NA
9 C 1 10
10 C 2 10
11 C 3 NA
12 C 4 30
13 E 1 10
14 E 2 20
15 E 3 10
16 E 4 30
17 F 1 NA
18 F 2 10
19 F 3 NA
20 F 4 20
dplyr filter removing NA when that was not specified
this is the default behavior: R simply does not know if NA == ''
is TRUE
or FALSE
NA == ""
[1] NA
Therefore the third row is not returned.
If you want to include NA
as well there are several workarrounds:
df %>% filter(coalesce(col1, "x") != "")
df %>% filter(col1 != "" | is.na(col1)
Personally, I prefer the first way: coalesce
substitutes NA
with a default value (here "x"
) and then checks if the substituted value is equal to ""
.
In dplyr filter rows with number of NA bigger than one
We can use rowSums
to create a logical expression within filter
library(dplyr)
airquality %>%
filter(rowSums(is.na(.)) > 1)
-output
# Ozone Solar.R Wind Temp Month Day
#1 NA NA 14.3 56 5 5
#2 NA NA 8.0 57 5 27
Or with rowwise
and c_across
airquality %>%
rowwise %>%
filter(sum(is.na(c_across(everything()))) > 1)
# A tibble: 2 x 6
# Rowwise:
# Ozone Solar.R Wind Temp Month Day
# <int> <int> <dbl> <int> <int> <int>
#1 NA NA 14.3 56 5 5
#2 NA NA 8 57 5 27
Related Topics
Annotate Ggplot with an Extra Tick and Label
Label X Axis in Time Series Plot Using R
List of Word Frequencies Using R
Ggplot2 Legend for Stat_Summary
Possible to Create Latex Multicolumns in Xtable
Crop for Spatialpolygonsdataframe
Output Error/Warning Log (Txt File) When Running R Script Under Command Line
Avoiding the Infamous "Eval(Parse())" Construct
Merge Nearest Date, and Related Variables from a Another Dataframe by Group
Ggplot: Adding Regression Line Equation and R2 with Facet
Make R Exit with Non-Zero Status Code
What Does the @ Symbol Mean in R
Can't Loop with R's Leaflet Package to Produce Multiple Maps