R Dplyr Filter Based on Matching Search Term with First Words of Any Work in Select Columns

R dplyr filter based on matching search term with first words of any work in select columns

EDIT

If we want to select any word which starts with "bio" we can do

df %>%
filter(str_detect(str_to_lower(Name), "\\bbio") | str_detect(Code, "^15"))

OR the same thing in base R

df[sapply(strsplit(df$Name, "\\s+"), function(x) any(grepl("^bio", tolower(x)))) | 
grepl("^15", df$Code), ]

Original Answer

This selects rows where "bio" is present in first word of Name (word(Name) returns only first word) or Code which starts with "15".

library(tidyverse)
df %>%
filter(str_detect(str_to_lower(word(Name)), "bio") | str_detect(Code, "^15"))

# Name Code
#1 Biofuel is good 159403
#2 Bioecological is good 161540
#3 Probiotics is good 159883

Using the same logic but in base R, we can do

df[sapply(strsplit(df$Name, "\\s+"), function(x) grepl("bio", tolower(x[1]))) 
| grepl("^15", df$Code), ]

# Name Code
#1 Biofuel is good 159403
#2 Bioecological is good 161540
#3 Probiotics is good 159883

Here, it splits the string at empty space and then extracts the first word from each (x[1]) and check if it has "bio" in it OR get rows which starts with "15".

Select columns based on string match - dplyr::select

Within the dplyr world, try:

select(iris,contains("Sepal"))

See the Selection section in ?select for numerous other helpers like starts_with, ends_with, etc.

Filter rows that contain a certain string across all columns (with dplyr)

filter takes a logical vector, thus when using across you need to pass the function to the across call as to apply that function on all the selected columns:

df %>% filter(across(everything(), ~ !str_detect(., "John")))
   V1  V2  V3
1 A C A R A L
2 A D A M A T
3 A F A V A N
4 A D A L A L
5 A C A Q A X

using the solution proposed in @ekoam's comment:

df %>% filter(rowSums(across(everything(), ~ str_detect(., "John"))) > 0)
            V1         V2           V3
1 John Smith A V John Donovan
2 A A John Smith A R
3 A B A D John Donovan
4 John Donovan A O A V
5 A F John Smith A Q

Just to make the picture a bit clearer :

df %>% filter(print(across(everything(), ~ !str_detect(., "John"))))
# A tibble: 10 x 3
V1 V2 V3
<lgl> <lgl> <lgl>
1 FALSE TRUE FALSE
2 TRUE FALSE TRUE
3 TRUE TRUE FALSE
4 TRUE TRUE TRUE
5 FALSE TRUE TRUE
6 TRUE FALSE TRUE
7 TRUE TRUE TRUE
8 TRUE TRUE TRUE
9 TRUE TRUE TRUE
10 TRUE TRUE TRUE
V1 V2 V3
1 A C A R A L
2 A D A M A T
3 A F A V A N
4 A D A L A L
5 A C A Q A X

Notice that filter is &(and)ing the booleans by row i.e only rows with all TRUE value will be selected, those who have at least one FALSE will not. Now let's take a look at the code you provided in your comment:

 df %>% filter(print(across(everything(), ~ str_detect(., "John"))))
# A tibble: 10 x 3
V1 V2 V3
<lgl> <lgl> <lgl>
1 TRUE FALSE TRUE
2 FALSE TRUE FALSE
3 FALSE FALSE TRUE
4 FALSE FALSE FALSE
5 TRUE FALSE FALSE
6 FALSE TRUE FALSE
7 FALSE FALSE FALSE
8 FALSE FALSE FALSE
9 FALSE FALSE FALSE
10 FALSE FALSE FALSE
[1] V1 V2 V3
<0 rows> (or 0-length row.names)

All the rows have at least one FALSE, thus no rows are selected.

Filtering by multiple columns at once in `dplyr`

We could use if_all or if_any as Anil is pointing in his comments: For your code this would be:

https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/

if_any() and if_all()

"across() is very useful within summarise() and mutate(), but it’s hard to use it with filter() because it is not clear how the results would be combined into one logical vector. So to fill the gap, we’re introducing two new functions if_all() and if_any()."

if_all

data %>% 
filter(if_all(starts_with("cp"), ~ . > 0.2))
  mt100 cp001 cp002 cp003
<dbl> <dbl> <dbl> <dbl>
1 0.688 0.402 0.467 0.646
2 0.663 0.757 0.728 0.335
3 0.472 0.533 0.717 0.638

if_any:

data %>% 
filter(if_any(starts_with("cp"), ~ . > 0.2))
  mt100 cp001   cp002 cp003
<dbl> <dbl> <dbl> <dbl>
1 0.554 0.970 0.874 0.187
2 0.688 0.402 0.467 0.646
3 0.658 0.850 0.00813 0.542
4 0.663 0.757 0.728 0.335
5 0.472 0.533 0.717 0.638

dplyr filter with condition on multiple columns

A possible dplyr(0.5.0.9004 <= version < 1.0) solution is:

# > packageVersion('dplyr')
# [1] ‘0.5.0.9004’

dataset %>%
filter(!is.na(father), !is.na(mother)) %>%
filter_at(vars(-father, -mother), all_vars(is.na(.)))

Explanation:

  • vars(-father, -mother): select all columns except father and mother.
  • all_vars(is.na(.)): keep rows where is.na is TRUE for all the selected columns.

note: any_vars should be used instead of all_vars if rows where is.na is TRUE for any column are to be kept.


Update (2020-11-28)

As the _at functions and vars have been superseded by the use of across since dplyr 1.0, the following way (or similar) is recommended now:

dataset %>%
filter(across(c(father, mother), ~ !is.na(.x))) %>%
filter(across(c(-father, -mother), is.na))

See more example of across and how to rewrite previous code with the new approach here: Colomn-wise operatons or type vignette("colwise") in R after installing the latest version of dplyr.

How to filter out a particular phrase from a unstructured data set in R and put it in a new data frame

If you want to add a new column component containing the type descriptions to an existing dataframe df, which contains the large string given in the post, then this should work:

Solution:

df$component <- str_extract_all(df$v1, '(?<=bi:component name="\\w{1,100}" type=")[^"]+')

This makes use of positive lookbehind (?<=...)as well as the fact that \\w does not only match letters but also digits and the underscore, which are all involved in the values for bi:component name.

Result:

df
v1
1 </bi:data_source_alias>\n <bi:component name="ROOT" type="ABSOLUTE_LAYOUT_COMPONENT">\n <bi:component name="CHART_1" type="com_sap_ip_bi_VizFrame">\n <bi:property name="LEFT_MARGIN" value="31"/>` <bi:property name="TOP_MARGIN" value="64"/>\n<bi:component name="SCORECARD_1" type="com_sap_ip_bi_Scorecard">\n <bi:property name="LEFT_MARGIN" value="9"/>
component
1 ABSOLUTE_LAYOUT_COMPONENT, com_sap_ip_bi_VizFrame, com_sap_ip_bi_Scorecard

Data:

df <- data.frame(
v1 = '</bi:data_source_alias>
<bi:component name="ROOT" type="ABSOLUTE_LAYOUT_COMPONENT">
<bi:component name="CHART_1" type="com_sap_ip_bi_VizFrame">
<bi:property name="LEFT_MARGIN" value="31"/>` <bi:property name="TOP_MARGIN" value="64"/>
<bi:component name="SCORECARD_1" type="com_sap_ip_bi_Scorecard">
<bi:property name="LEFT_MARGIN" value="9"/>'
)

select columns based on multiple strings with dplyr contains()

You can use matches

 mtcars %>%
select(matches('m|ar')) %>%
head(2)
# mpg am gear carb
#Mazda RX4 21 1 4 4
#Mazda RX4 Wag 21 1 4 4

According to the ?select documentation

‘matches(x, ignore.case = TRUE)’: selects all variables whose
name matches the regular expression ‘x’

Though contains work with a single string

mtcars %>% 
select(contains('m'))


Related Topics



Leave a reply



Submit