R dplyr filter based on matching search term with first words of any work in select columns
EDIT
If we want to select any word which starts with "bio" we can do
df %>%
filter(str_detect(str_to_lower(Name), "\\bbio") | str_detect(Code, "^15"))
OR the same thing in base R
df[sapply(strsplit(df$Name, "\\s+"), function(x) any(grepl("^bio", tolower(x)))) |
grepl("^15", df$Code), ]
Original Answer
This selects rows where "bio" is present in first word of Name
(word(Name)
returns only first word) or Code
which starts with "15".
library(tidyverse)
df %>%
filter(str_detect(str_to_lower(word(Name)), "bio") | str_detect(Code, "^15"))
# Name Code
#1 Biofuel is good 159403
#2 Bioecological is good 161540
#3 Probiotics is good 159883
Using the same logic but in base R, we can do
df[sapply(strsplit(df$Name, "\\s+"), function(x) grepl("bio", tolower(x[1])))
| grepl("^15", df$Code), ]
# Name Code
#1 Biofuel is good 159403
#2 Bioecological is good 161540
#3 Probiotics is good 159883
Here, it splits the string at empty space and then extracts the first word from each (x[1]
) and check if it has "bio" in it OR get rows which starts with "15".
Select columns based on string match - dplyr::select
Within the dplyr world, try:
select(iris,contains("Sepal"))
See the Selection section in ?select
for numerous other helpers like starts_with
, ends_with
, etc.
Filter rows that contain a certain string across all columns (with dplyr)
filter takes a logical vector, thus when using across you need to pass the function to the across call as to apply that function on all the selected columns:
df %>% filter(across(everything(), ~ !str_detect(., "John")))
V1 V2 V3
1 A C A R A L
2 A D A M A T
3 A F A V A N
4 A D A L A L
5 A C A Q A X
using the solution proposed in @ekoam's comment:
df %>% filter(rowSums(across(everything(), ~ str_detect(., "John"))) > 0)
V1 V2 V3
1 John Smith A V John Donovan
2 A A John Smith A R
3 A B A D John Donovan
4 John Donovan A O A V
5 A F John Smith A Q
Just to make the picture a bit clearer :
df %>% filter(print(across(everything(), ~ !str_detect(., "John"))))
# A tibble: 10 x 3
V1 V2 V3
<lgl> <lgl> <lgl>
1 FALSE TRUE FALSE
2 TRUE FALSE TRUE
3 TRUE TRUE FALSE
4 TRUE TRUE TRUE
5 FALSE TRUE TRUE
6 TRUE FALSE TRUE
7 TRUE TRUE TRUE
8 TRUE TRUE TRUE
9 TRUE TRUE TRUE
10 TRUE TRUE TRUE
V1 V2 V3
1 A C A R A L
2 A D A M A T
3 A F A V A N
4 A D A L A L
5 A C A Q A X
Notice that filter is &
(and)ing the booleans by row i.e only rows with all TRUE
value will be selected, those who have at least one FALSE
will not. Now let's take a look at the code you provided in your comment:
df %>% filter(print(across(everything(), ~ str_detect(., "John"))))
# A tibble: 10 x 3
V1 V2 V3
<lgl> <lgl> <lgl>
1 TRUE FALSE TRUE
2 FALSE TRUE FALSE
3 FALSE FALSE TRUE
4 FALSE FALSE FALSE
5 TRUE FALSE FALSE
6 FALSE TRUE FALSE
7 FALSE FALSE FALSE
8 FALSE FALSE FALSE
9 FALSE FALSE FALSE
10 FALSE FALSE FALSE
[1] V1 V2 V3
<0 rows> (or 0-length row.names)
All the rows have at least one FALSE
, thus no rows are selected.
Filtering by multiple columns at once in `dplyr`
We could use if_all
or if_any
as Anil is pointing in his comments: For your code this would be:
https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/
if_any() and if_all()
"across() is very useful within summarise() and mutate(), but it’s hard to use it with filter() because it is not clear how the results would be combined into one logical vector. So to fill the gap, we’re introducing two new functions if_all() and if_any()."
if_all
data %>%
filter(if_all(starts_with("cp"), ~ . > 0.2))
mt100 cp001 cp002 cp003
<dbl> <dbl> <dbl> <dbl>
1 0.688 0.402 0.467 0.646
2 0.663 0.757 0.728 0.335
3 0.472 0.533 0.717 0.638
if_any:
data %>%
filter(if_any(starts_with("cp"), ~ . > 0.2))
mt100 cp001 cp002 cp003
<dbl> <dbl> <dbl> <dbl>
1 0.554 0.970 0.874 0.187
2 0.688 0.402 0.467 0.646
3 0.658 0.850 0.00813 0.542
4 0.663 0.757 0.728 0.335
5 0.472 0.533 0.717 0.638
dplyr filter with condition on multiple columns
A possible dplyr
(0.5.0.9004 <= version < 1.0) solution is:
# > packageVersion('dplyr')
# [1] ‘0.5.0.9004’
dataset %>%
filter(!is.na(father), !is.na(mother)) %>%
filter_at(vars(-father, -mother), all_vars(is.na(.)))
Explanation:
vars(-father, -mother)
: select all columns exceptfather
andmother
.all_vars(is.na(.))
: keep rows whereis.na
isTRUE
for all the selected columns.
note: any_vars
should be used instead of all_vars
if rows where is.na
is TRUE
for any column are to be kept.
Update (2020-11-28)
As the _at
functions and vars
have been superseded by the use of across
since dplyr 1.0, the following way (or similar) is recommended now:
dataset %>%
filter(across(c(father, mother), ~ !is.na(.x))) %>%
filter(across(c(-father, -mother), is.na))
See more example of across
and how to rewrite previous code with the new approach here: Colomn-wise operatons or type vignette("colwise")
in R after installing the latest version of dplyr
.
How to filter out a particular phrase from a unstructured data set in R and put it in a new data frame
If you want to add a new column component
containing the type
descriptions to an existing dataframe df
, which contains the large string given in the post, then this should work:
Solution:
df$component <- str_extract_all(df$v1, '(?<=bi:component name="\\w{1,100}" type=")[^"]+')
This makes use of positive lookbehind (?<=...)
as well as the fact that \\w
does not only match letters but also digits and the underscore, which are all involved in the values for bi:component name
.
Result:
df
v1
1 </bi:data_source_alias>\n <bi:component name="ROOT" type="ABSOLUTE_LAYOUT_COMPONENT">\n <bi:component name="CHART_1" type="com_sap_ip_bi_VizFrame">\n <bi:property name="LEFT_MARGIN" value="31"/>` <bi:property name="TOP_MARGIN" value="64"/>\n<bi:component name="SCORECARD_1" type="com_sap_ip_bi_Scorecard">\n <bi:property name="LEFT_MARGIN" value="9"/>
component
1 ABSOLUTE_LAYOUT_COMPONENT, com_sap_ip_bi_VizFrame, com_sap_ip_bi_Scorecard
Data:
df <- data.frame(
v1 = '</bi:data_source_alias>
<bi:component name="ROOT" type="ABSOLUTE_LAYOUT_COMPONENT">
<bi:component name="CHART_1" type="com_sap_ip_bi_VizFrame">
<bi:property name="LEFT_MARGIN" value="31"/>` <bi:property name="TOP_MARGIN" value="64"/>
<bi:component name="SCORECARD_1" type="com_sap_ip_bi_Scorecard">
<bi:property name="LEFT_MARGIN" value="9"/>'
)
select columns based on multiple strings with dplyr contains()
You can use matches
mtcars %>%
select(matches('m|ar')) %>%
head(2)
# mpg am gear carb
#Mazda RX4 21 1 4 4
#Mazda RX4 Wag 21 1 4 4
According to the ?select
documentation
‘matches(x, ignore.case = TRUE)’: selects all variables whose
name matches the regular expression ‘x’
Though contains
work with a single string
mtcars %>%
select(contains('m'))
Related Topics
How to Always Suppress Messages in R
How to Prevent Rplots.Pdf from Being Generated
Regression Line for the Entire Data Set Together with Regression Lines Based on Groups
Making Gsub Only Replace Entire Words
Usemethod("Predict"):No Applicable Method for 'Predict' Applied to an Object of Class "Train"
Ggsave Png Error with Larger Size
Using Override.Aes() in Ggplot2 with Layered Symbols (R)
Specify Function Parameters in Do.Call
Changing the Appearance of Facet Labels Size
Subset a Data Frame Based on Value Pairs Stored in Independent Ordered Vectors
Use of .By and .Eachi in the Data.Table Package
Reading Timestamp Data in R from Multiple Time Zones
Convert R Dataframe from Long to Wide Format, But with Unequal Group Sizes, for Use with Qcc
Two Y Axis in Highcharter in R
Mutating Dummy Variables in Dplyr
R Cmd Check Latex Error: Fatal PDFlatex - Gui Framework Cannot Be Initialized