Using filter() with across() to keep all rows of a data frame that include a missing value for any variable
It's now possible with dplyr
1.0.4. The new if_any()
replaces across()
for the filtering use-case.
library(dplyr)
df <- tribble(~ id, ~ x, ~ y,
1, 1, 0,
2, 1, 1,
3, NA, 1,
4, 0, 0,
5, 1, NA)
df %>%
filter(if_any(everything(), is.na))
#> # A tibble: 2 x 3
#> id x y
#> <dbl> <dbl> <dbl>
#> 1 3 NA 1
#> 2 5 1 NA
Created on 2021-02-10 by the reprex package (v0.3.0)
See here for more details: https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/
Filter data.frame with all colums NA but keep when some are NA
We can use base R
teste[rowSums(!is.na(teste)) >0,]
# a b c
#1 1 NA 1
#3 3 3 3
#4 NA 4 4
Or using apply
and any
teste[apply(!is.na(teste), 1, any),]
which can be also used within filter
teste %>%
filter(rowSums(!is.na(.)) >0)
Or using c_across
from dplyr
, we can directly remove the rows with all
NA
library(dplyr)
teste %>%
rowwise %>%
filter(!all(is.na(c_across(everything()))))
# A tibble: 3 x 3
# Rowwise:
# a b c
# <dbl> <dbl> <dbl>
#1 1 NA 1
#2 3 3 3
#3 NA 4 4
NOTE: filter_all
is getting deprecated
How to use dplyr across to filter NA in multiple columns
We can use across
to loop over the columns 'type', 'company' and return the rows that doesn't have any NA in the specified columns
library(dplyr)
df %>%
filter(across(c(type, company), ~ !is.na(.)))
# id type company
#1 3 North Alex
#2 NA North BDA
With filter
, there are two options that are similar to all_vars/any_vars
used with filter_at/filter_all
df %>%
filter(if_any(c(company, type), ~ !is.na(.)))
# id type company
#1 2 <NA> ADM
#2 3 North Alex
#3 4 South <NA>
#4 NA North BDA
#5 6 <NA> CA
Or using if_all
df %>%
filter(!if_all(c(company, type), is.na))
id type company
1 2 <NA> ADM
2 3 North Alex
3 4 South <NA>
4 NA North BDA
5 6 <NA> CA
data
df <- structure(list(id = c(1L, 2L, 3L, 4L, NA, 6L), type = c(NA, NA,
"North", "South", "North", NA), company = c(NA, "ADM", "Alex",
NA, "BDA", "CA")), class = "data.frame", row.names = c(NA, -6L
))
filtering data frame based on NA on multiple columns
We can get the logical index for both columns, use &
and subset the rows.
df1[!is.na(df1$type) & !is.na(df1$company),]
# id type company
#3 3 North Alex
#5 NA North BDA
Or use rowSums
on the logical matrix (is.na(df1[-1])
) to subset.
df1[!rowSums(is.na(df1[-1])),]
Remove rows with all or some NAs (missing values) in data.frame
Also check complete.cases
:
> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
na.omit
is nicer for just removing all NA
's. complete.cases
allows partial selection by including only certain columns of the dataframe:
> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Your solution can't work. If you insist on using is.na
, then you have to do something like:
> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
but using complete.cases
is quite a lot more clear, and faster.
Filter all rows that start with any Latin alphabetic letter in R
First, [A-z]
is not the same as [A-Za-z]
, you need to be more careful with character classes. (See Difference between regex [A-z] and [a-zA-Z] and ignore the java portions.)
Second, where does field:
come in? Do this:
df %>%
filter(grepl("^[A-Za-z]", roles))
# marks age roles
# 1 20.1 21 Software Eng.
# 2 30.2 22 Software Dev
# 3 40.3 23 Data Analyst
# 4 50.4 24 Data Eng.
(Plus the previous comment about grepl
versus grep
.)
Remove rows where all variables are NA using dplyr
Since dplyr 0.7.0 new, scoped filtering verbs exists. Using filter_any you can easily filter rows with at least one non-missing column:
# dplyr 0.7.0
dat %>% filter_all(any_vars(!is.na(.)))
Using @hejseb benchmarking algorithm it appears that this solution is as efficient as f4.
UPDATE:
Since dplyr 1.0.0 the above scoped verbs are superseded. Instead the across function family was introduced, which allows to perform a function on multiple (or all) columns. Filtering rows with at least one column being not NA looks now like this:
# dplyr 1.0.0
dat %>% filter(if_any(everything(), ~ !is.na(.)))
Related Topics
Shiny: Switching Between Reactive Data Sets with Rhandsontable
Error in Unserialize(Socklist[[N]]):Error Reading from Connection on Unix
R: Split Elements of a List into Sublists
Plotting Dose Response Curves with Ggplot2 and Drc
Ddply Multiple Quantiles by Group
Split a Vector into Three Vectors of Unequal Length in R
Add Dynamic Tabs in Shiny Dashboard Using Conditional Panel
How to Rotate the Axis Labels in Ggplot2
R:Convert Nested List into a One Level List
What Are Helpful Optimizations in R for Big Data Sets
R: Creating a Map of Selected Canadian Provinces and U.S. States
Calculating Minimum Distance Between a Point and the Coast
How to Do Gaussian Elimination in R (Do Not Use "Solve")
Data.Table VS Plyr Regression Output
Access Data Frame Column Using Variable
How to Show the Progress of Code in R
How to Change Strip.Text Labels in Ggplot with Facet and Margin=True