Filter Multiple Values on a String Column in Dplyr

Filter multiple values on a string column in dplyr

You need %in% instead of ==:

library(dplyr)
target <- c("Tom", "Lynn")
filter(dat, name %in% target) # equivalently, dat %>% filter(name %in% target)

Produces

  days name
1 88 Lynn
2 11 Tom
3 1 Tom
4 222 Lynn
5 2 Lynn

To understand why, consider what happens here:

dat$name == target
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

Basically, we're recycling the two length target vector four times to match the length of dat$name. In other words, we are doing:

 Lynn == Tom
Tom == Lynn
Chris == Tom
Lisa == Lynn
... continue repeating Tom and Lynn until end of data frame

In this case we don't get an error because I suspect your data frame actually has a different number of rows that don't allow recycling, but the sample you provide does (8 rows). If the sample had had an odd number of rows I would have gotten the same error as you. But even when recycling works, this is clearly not what you want. Basically, the statement dat$name == target is equivalent to saying:

return TRUE for every odd value that is equal to "Tom" or every even value that is equal to "Lynn".

It so happens that the last value in your sample data frame is even and equal to "Lynn", hence the one TRUE above.

To contrast, dat$name %in% target says:

for each value in dat$name, check that it exists in target.

Very different. Here is the result:

[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

Note your problem has nothing to do with dplyr, just the mis-use of ==.

Filtering by multiple columns at once in `dplyr`

We could use if_all or if_any as Anil is pointing in his comments: For your code this would be:

https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/

if_any() and if_all()

"across() is very useful within summarise() and mutate(), but it’s hard to use it with filter() because it is not clear how the results would be combined into one logical vector. So to fill the gap, we’re introducing two new functions if_all() and if_any()."

if_all

data %>% 
filter(if_all(starts_with("cp"), ~ . > 0.2))
  mt100 cp001 cp002 cp003
<dbl> <dbl> <dbl> <dbl>
1 0.688 0.402 0.467 0.646
2 0.663 0.757 0.728 0.335
3 0.472 0.533 0.717 0.638

if_any:

data %>% 
filter(if_any(starts_with("cp"), ~ . > 0.2))
  mt100 cp001   cp002 cp003
<dbl> <dbl> <dbl> <dbl>
1 0.554 0.970 0.874 0.187
2 0.688 0.402 0.467 0.646
3 0.658 0.850 0.00813 0.542
4 0.663 0.757 0.728 0.335
5 0.472 0.533 0.717 0.638

dplyr filter multiple variables (columns) with multiple conditions

Another possible solution:

library(dplyr)

test %>%
filter(complete.cases(.) & if_all(everything(), ~ !(.x %in% 0:2)))

#> A B C
#> 1 6 5 6
#> 2 7 7 7

How can I filter multiple columns with dplyr using string matching for the column name?

You could do that using filter_at with ends_with.

library(dplyr)
nyc_crashes %>%
# Select columns that end with KILLED or INJURED
filter_at(vars(c(ends_with("KILLED"),ends_with("INJURED"))),
# Keep rows where any of these variables is >= 1
any_vars(. >= 1))

R filtering for strings across several columns

If you want to use stringr and str_detect you can try:

library(stringr)
library(dplyr)

df %>%
filter(across(A:C, ~!str_detect(., "[A-Z]")))

Or to filter based on all columns in the data.frame:

df %>% 
filter(across(everything(), ~!str_detect(., "[A-Z]")))

Edit: As mentioned in the comments, starting with dplyr v. 1.0.4, you can use the new functions if_any or if_all with filter. For example:

df %>%
filter(if_all(everything(), ~!str_detect(., "[A-Z]")))

Output

  A B C
1 5 6 7

Filtering multiple string columns based on 2 different criteria - questions about grepl and starts_with

We can use filter with across. where we loop over the columns using c_across specifying the column name match in select_helpers (starts_with), get a logical output with grepl checking for either "C18" or (|) the number that starts with (^) 153

library(dplyr) #1.0.0
library(stringr)
df %>%
# // do a row wise grouping
rowwise() %>%
# // subset the columns that starts with 'DGN' within c_across
# // apply grepl condition on the subset
# // wrap with any for any column in a row meeting the condition
filter(any(grepl("C18|^153", c_across(starts_with("DGN")))))

Or with filter_at

df %>% 
# //apply the any_vars along with grepl in filter_at
filter_at(vars(starts_with("DGN")), any_vars(grepl("C18|^153", .)))

data

df <-  data.frame(ID = 1:3, DGN1 = c("2_C18", 32, "1532"), 
DGN2 = c("24", "C18_2", "23"))

R function to filter / subset (programatically) multiple values over one variable

We can use %in% if the number of elements to check is more than 1.

df[df$v2 %in% c('a', 'b'),]
# v1 v2
#1 1 a
#2 2 b

Or if we use subset, the df$ can be removed

subset(df, v2 %in% c('a', 'b'))

Or the dplyr::filter

filter(df, v2 %in% c('a', 'b'))

This can be wrapped in a function

f1 <- function(dat, col, val){
filter(dat, col %in% val)
}

f1(df, v2, c('a', 'b'))
# v1 v2
#1 1 a
#2 2 b

If we need to use ==, we could loop the vector to compare in a list and use Reduce with |

df[Reduce(`|`, lapply(letters[1:2], `==`, df$v2)),]

R dplyr filter string condition on multiple columns

You can use filter_at with any_vars to select rows that have at least one value of "X".

library(dplyr)
df %>% filter_at(vars(v2:v5), any_vars(. == 'X'))

# v1 v2 v3 v4 v5
#1 1 A B X C
#2 2 A B C X

However, filter_at has been superseeded so to translate this into across you can do :

df %>% filter(Reduce(`|`, across(v2:v5, ~. == 'X')))

It is also easier in base R :

df[rowSums(df[-1] == 'X') > 0, ]


Related Topics



Leave a reply



Submit