How to filter rows containing a string pattern from a Pandas dataframe
In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
ids vals
0 aball 1
1 bball 2
3 fball 4
Filter rows that contain a certain string across all columns (with dplyr)
filter takes a logical vector, thus when using across you need to pass the function to the across call as to apply that function on all the selected columns:
df %>% filter(across(everything(), ~ !str_detect(., "John")))
V1 V2 V3
1 A C A R A L
2 A D A M A T
3 A F A V A N
4 A D A L A L
5 A C A Q A X
using the solution proposed in @ekoam's comment:
df %>% filter(rowSums(across(everything(), ~ str_detect(., "John"))) > 0)
V1 V2 V3
1 John Smith A V John Donovan
2 A A John Smith A R
3 A B A D John Donovan
4 John Donovan A O A V
5 A F John Smith A Q
Just to make the picture a bit clearer :
df %>% filter(print(across(everything(), ~ !str_detect(., "John"))))
# A tibble: 10 x 3
V1 V2 V3
<lgl> <lgl> <lgl>
1 FALSE TRUE FALSE
2 TRUE FALSE TRUE
3 TRUE TRUE FALSE
4 TRUE TRUE TRUE
5 FALSE TRUE TRUE
6 TRUE FALSE TRUE
7 TRUE TRUE TRUE
8 TRUE TRUE TRUE
9 TRUE TRUE TRUE
10 TRUE TRUE TRUE
V1 V2 V3
1 A C A R A L
2 A D A M A T
3 A F A V A N
4 A D A L A L
5 A C A Q A X
Notice that filter is &
(and)ing the booleans by row i.e only rows with all TRUE
value will be selected, those who have at least one FALSE
will not. Now let's take a look at the code you provided in your comment:
df %>% filter(print(across(everything(), ~ str_detect(., "John"))))
# A tibble: 10 x 3
V1 V2 V3
<lgl> <lgl> <lgl>
1 TRUE FALSE TRUE
2 FALSE TRUE FALSE
3 FALSE FALSE TRUE
4 FALSE FALSE FALSE
5 TRUE FALSE FALSE
6 FALSE TRUE FALSE
7 FALSE FALSE FALSE
8 FALSE FALSE FALSE
9 FALSE FALSE FALSE
10 FALSE FALSE FALSE
[1] V1 V2 V3
<0 rows> (or 0-length row.names)
All the rows have at least one FALSE
, thus no rows are selected.
Filter row based on a string condition, dplyr filter, contains
The contains
function in dplyr is a select helper. It's purpose is to help when using the select
function, and the select
function is focused on selecting columns not rows. See documentation here.
filter
is the intended mechanism for selecting rows. The function you are probably looking for is grepl
which does pattern matching for text.
So the solution you are looking for is probably:
filtered_df <- filter(df, grepl("background", site_type, ignore.case = TRUE))
I suspect that contains
is mostly a wrapper applying grepl
to the column names. So the logic is very similar.
References:
- grep R documentation
- high rated question applying exactly this technique
Filter rows which does NOT contain a certain string
You can simply use !
before your grepl
like this:
df <- df %>% filter(!grepl('first|second', Text))
Is there a function to filter rows that contains a string, but on chosen columns names that contains a given string in R?
You can use filter_at
to specifically filter based on specific selected variables (in this case, those starting with "Actor").
library(tidyverse)
df_ex %>%
filter_at(vars(starts_with("Actor")), any_vars(. %in% outl))
Output
ID Actor1 Actor2 Actor3 Actor4 Leng Genre
1 1 Driver President CEO Priest 12 horror
2 2 Zombie Devil 42 criminal
Filter only rows that contain exact two strings in a column
We can use str_count
to create a logical vector in filter
library(dplyr)
library(stringr)
df %>%
filter(str_count(sp_name, "\\w+") == 2)
-output
sp_name value
1 Xylopia brasiliensis 1
2 Xylosma tweediana 2
Or this can be done with str_detect
as well - match the word (\\w+
) from the start (^
) followed by a space and another word (\\w+
) at the end ($
) of the string
df %>%
filter(str_detect(sp_name, "^\\w+ \\w+$"))
Or in base R
with grep
subset(df, grepl("^\\w+ \\w+$", sp_name))
sp_name value
1 Xylopia brasiliensis 1
2 Xylosma tweediana 2
r - Filter rows that contain a string from a vector
We can use grep
df1[grep(paste(v1, collapse="|"), df1$animal),]
Or using dplyr
df1 %>%
filter(grepl(paste(v1, collapse="|"), animal))
How to filter rows containing specific string values with an AND operator
df[df['ids'].str.contains("ball")]
Would become:
df[df['ids'].str.contains("ball") & df['ids'].str.contains("field")]
If you are into neater code:
contains_balls = df['ids'].str.contains("ball")
contains_fields = df['ids'].str.contains("field")
filtered_df = df[contains_balls & contains_fields]
Related Topics
How to Loop Through List and Create Separate Dataframes in R
Select the N Most Frequent Values in a Variable
Find All Combinations of a Set of Numbers That Add Up to a Certain Total
R: Error in Usemethod("Tbl_Vars")
Removing Space Between Numeric Values in R
Choose the Top Five Values from Each Group in R
Aggregate/Summarize Multiple Variables Per Group (E.G. Sum, Mean)
How to Convert a Data Frame Column to Numeric Type
Break Dataframe into Smaller Dataframe'S and Save Them
How to Replace Negative Values in a Dataframe Column With a Different Value
How to Filter Multiple Columns With Same Condition in R
Splitting a Large Data Frame into Smaller Segments
How to Select Variables in an R Dataframe Whose Names Contain a Particular String
Sum Rows in Data.Frame or Matrix
Extracting Specific Columns from a Data Frame
How to Convert Only Some Positive Numbers to Negative Numbers (Conditional Recoding)