Using Filter() with Across() to Keep All Rows of a Data Frame That Include a Missing Value for Any Variable

Using filter() with across() to keep all rows of a data frame that include a missing value for any variable

It's now possible with dplyr 1.0.4. The new if_any() replaces across() for the filtering use-case.

library(dplyr)

df <- tribble(~ id, ~ x, ~ y,
1, 1, 0,
2, 1, 1,
3, NA, 1,
4, 0, 0,
5, 1, NA)

df %>%
filter(if_any(everything(), is.na))
#> # A tibble: 2 x 3
#> id x y
#> <dbl> <dbl> <dbl>
#> 1 3 NA 1
#> 2 5 1 NA

Created on 2021-02-10 by the reprex package (v0.3.0)

See here for more details: https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/

Filter data.frame with all colums NA but keep when some are NA

We can use base R

teste[rowSums(!is.na(teste)) >0,]
# a b c
#1 1 NA 1
#3 3 3 3
#4 NA 4 4

Or using apply and any

teste[apply(!is.na(teste), 1, any),]

which can be also used within filter

teste %>%
filter(rowSums(!is.na(.)) >0)

Or using c_across from dplyr, we can directly remove the rows with all NA

library(dplyr)
teste %>%
rowwise %>%
filter(!all(is.na(c_across(everything()))))
# A tibble: 3 x 3
# Rowwise:
# a b c
# <dbl> <dbl> <dbl>
#1 1 NA 1
#2 3 3 3
#3 NA 4 4

NOTE: filter_all is getting deprecated

How to use dplyr across to filter NA in multiple columns

We can use across to loop over the columns 'type', 'company' and return the rows that doesn't have any NA in the specified columns

library(dplyr)
df %>%
filter(across(c(type, company), ~ !is.na(.)))
# id type company
#1 3 North Alex
#2 NA North BDA

With filter, there are two options that are similar to all_vars/any_vars used with filter_at/filter_all

df %>%
filter(if_any(c(company, type), ~ !is.na(.)))
# id type company
#1 2 <NA> ADM
#2 3 North Alex
#3 4 South <NA>
#4 NA North BDA
#5 6 <NA> CA

Or using if_all

    df %>%
filter(!if_all(c(company, type), is.na))
id type company
1 2 <NA> ADM
2 3 North Alex
3 4 South <NA>
4 NA North BDA
5 6 <NA> CA

data

df <- structure(list(id = c(1L, 2L, 3L, 4L, NA, 6L), type = c(NA, NA, 
"North", "South", "North", NA), company = c(NA, "ADM", "Alex",
NA, "BDA", "CA")), class = "data.frame", row.names = c(NA, -6L
))

filtering data frame based on NA on multiple columns

We can get the logical index for both columns, use & and subset the rows.

df1[!is.na(df1$type) & !is.na(df1$company),]
# id type company
#3 3 North Alex
#5 NA North BDA

Or use rowSums on the logical matrix (is.na(df1[-1])) to subset.

df1[!rowSums(is.na(df1[-1])),]

Remove rows with all or some NAs (missing values) in data.frame

Also check complete.cases :

> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2

na.omit is nicer for just removing all NA's. complete.cases allows partial selection by including only certain columns of the dataframe:

> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2

Your solution can't work. If you insist on using is.na, then you have to do something like:

> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2

but using complete.cases is quite a lot more clear, and faster.

Filter all rows that start with any Latin alphabetic letter in R

First, [A-z] is not the same as [A-Za-z], you need to be more careful with character classes. (See Difference between regex [A-z] and [a-zA-Z] and ignore the java portions.)

Second, where does field: come in? Do this:

df %>%
filter(grepl("^[A-Za-z]", roles))
# marks age roles
# 1 20.1 21 Software Eng.
# 2 30.2 22 Software Dev
# 3 40.3 23 Data Analyst
# 4 50.4 24 Data Eng.

(Plus the previous comment about grepl versus grep.)

Remove rows where all variables are NA using dplyr

Since dplyr 0.7.0 new, scoped filtering verbs exists. Using filter_any you can easily filter rows with at least one non-missing column:

# dplyr 0.7.0
dat %>% filter_all(any_vars(!is.na(.)))

Using @hejseb benchmarking algorithm it appears that this solution is as efficient as f4.

UPDATE:

Since dplyr 1.0.0 the above scoped verbs are superseded. Instead the across function family was introduced, which allows to perform a function on multiple (or all) columns. Filtering rows with at least one column being not NA looks now like this:

# dplyr 1.0.0
dat %>% filter(if_any(everything(), ~ !is.na(.)))


Related Topics



Leave a reply



Submit