Filter multiple values on a string column in dplyr
You need %in%
instead of ==
:
library(dplyr)
target <- c("Tom", "Lynn")
filter(dat, name %in% target) # equivalently, dat %>% filter(name %in% target)
Produces
days name
1 88 Lynn
2 11 Tom
3 1 Tom
4 222 Lynn
5 2 Lynn
To understand why, consider what happens here:
dat$name == target
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Basically, we're recycling the two length target
vector four times to match the length of dat$name
. In other words, we are doing:
Lynn == Tom
Tom == Lynn
Chris == Tom
Lisa == Lynn
... continue repeating Tom and Lynn until end of data frame
In this case we don't get an error because I suspect your data frame actually has a different number of rows that don't allow recycling, but the sample you provide does (8 rows). If the sample had had an odd number of rows I would have gotten the same error as you. But even when recycling works, this is clearly not what you want. Basically, the statement dat$name == target
is equivalent to saying:
return
TRUE
for every odd value that is equal to "Tom" or every even value that is equal to "Lynn".
It so happens that the last value in your sample data frame is even and equal to "Lynn", hence the one TRUE
above.
To contrast, dat$name %in% target
says:
for each value in
dat$name
, check that it exists intarget
.
Very different. Here is the result:
[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
Note your problem has nothing to do with dplyr
, just the mis-use of ==
.
Filtering by multiple columns at once in `dplyr`
We could use if_all
or if_any
as Anil is pointing in his comments: For your code this would be:
https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/
if_any() and if_all()
"across() is very useful within summarise() and mutate(), but it’s hard to use it with filter() because it is not clear how the results would be combined into one logical vector. So to fill the gap, we’re introducing two new functions if_all() and if_any()."
if_all
data %>%
filter(if_all(starts_with("cp"), ~ . > 0.2))
mt100 cp001 cp002 cp003
<dbl> <dbl> <dbl> <dbl>
1 0.688 0.402 0.467 0.646
2 0.663 0.757 0.728 0.335
3 0.472 0.533 0.717 0.638
if_any:
data %>%
filter(if_any(starts_with("cp"), ~ . > 0.2))
mt100 cp001 cp002 cp003
<dbl> <dbl> <dbl> <dbl>
1 0.554 0.970 0.874 0.187
2 0.688 0.402 0.467 0.646
3 0.658 0.850 0.00813 0.542
4 0.663 0.757 0.728 0.335
5 0.472 0.533 0.717 0.638
dplyr filter multiple variables (columns) with multiple conditions
Another possible solution:
library(dplyr)
test %>%
filter(complete.cases(.) & if_all(everything(), ~ !(.x %in% 0:2)))
#> A B C
#> 1 6 5 6
#> 2 7 7 7
How can I filter multiple columns with dplyr using string matching for the column name?
You could do that using filter_at
with ends_with
.
library(dplyr)
nyc_crashes %>%
# Select columns that end with KILLED or INJURED
filter_at(vars(c(ends_with("KILLED"),ends_with("INJURED"))),
# Keep rows where any of these variables is >= 1
any_vars(. >= 1))
R filtering for strings across several columns
If you want to use stringr
and str_detect
you can try:
library(stringr)
library(dplyr)
df %>%
filter(across(A:C, ~!str_detect(., "[A-Z]")))
Or to filter
based on all columns in the data.frame:
df %>%
filter(across(everything(), ~!str_detect(., "[A-Z]")))
Edit: As mentioned in the comments, starting with dplyr v. 1.0.4, you can use the new functions if_any
or if_all
with filter
. For example:
df %>%
filter(if_all(everything(), ~!str_detect(., "[A-Z]")))
Output
A B C
1 5 6 7
Filtering multiple string columns based on 2 different criteria - questions about grepl and starts_with
We can use filter
with across
. where we loop over the columns using c_across
specifying the column name match in select_helpers (starts_with
), get a logical output with grepl
checking for either "C18" or (|
) the number that starts with (^
) 153
library(dplyr) #1.0.0
library(stringr)
df %>%
# // do a row wise grouping
rowwise() %>%
# // subset the columns that starts with 'DGN' within c_across
# // apply grepl condition on the subset
# // wrap with any for any column in a row meeting the condition
filter(any(grepl("C18|^153", c_across(starts_with("DGN")))))
Or with filter_at
df %>%
# //apply the any_vars along with grepl in filter_at
filter_at(vars(starts_with("DGN")), any_vars(grepl("C18|^153", .)))
data
df <- data.frame(ID = 1:3, DGN1 = c("2_C18", 32, "1532"),
DGN2 = c("24", "C18_2", "23"))
R function to filter / subset (programatically) multiple values over one variable
We can use %in%
if the number of elements to check is more than 1.
df[df$v2 %in% c('a', 'b'),]
# v1 v2
#1 1 a
#2 2 b
Or if we use subset
, the df$
can be removed
subset(df, v2 %in% c('a', 'b'))
Or the dplyr::filter
filter(df, v2 %in% c('a', 'b'))
This can be wrapped in a function
f1 <- function(dat, col, val){
filter(dat, col %in% val)
}
f1(df, v2, c('a', 'b'))
# v1 v2
#1 1 a
#2 2 b
If we need to use ==
, we could loop the vector
to compare in a list
and use Reduce
with |
df[Reduce(`|`, lapply(letters[1:2], `==`, df$v2)),]
R dplyr filter string condition on multiple columns
You can use filter_at
with any_vars
to select rows that have at least one value of "X"
.
library(dplyr)
df %>% filter_at(vars(v2:v5), any_vars(. == 'X'))
# v1 v2 v3 v4 v5
#1 1 A B X C
#2 2 A B C X
However, filter_at
has been superseeded so to translate this into across
you can do :
df %>% filter(Reduce(`|`, across(v2:v5, ~. == 'X')))
It is also easier in base R :
df[rowSums(df[-1] == 'X') > 0, ]
Related Topics
Change R Default Library Path Using .Libpaths in Rprofile.Site Fails to Work
How Does the 'Prop.Table()' Function Work in R
Saving Output of Confusionmatrix as a .Csv Table
Changing from Upper to Lower Case in Several Data Frames
Showing Data Values on Stacked Bar Chart in Ggplot2
How to Declare a Vector of Zeros in R
Add Row to a Data Frame With Total Sum for Each Column
Removing Columns That Are All 0
Count Number of Rows Within Each Group
Quickly Reading Very Large Tables as Dataframes
How to Find the Statistical Mode
Pass a String as Variable Name in Dplyr::Filter
How to Join (Merge) Data Frames (Inner, Outer, Left, Right)
How to Save for Loop Results in Data Frame Using Cbind