Filter Rows Which Contain a Certain String

How to filter rows containing a string pattern from a Pandas dataframe

In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
ids vals
0 aball 1
1 bball 2
3 fball 4

Filter rows that contain a certain string across all columns (with dplyr)

filter takes a logical vector, thus when using across you need to pass the function to the across call as to apply that function on all the selected columns:

df %>% filter(across(everything(), ~ !str_detect(., "John")))
   V1  V2  V3
1 A C A R A L
2 A D A M A T
3 A F A V A N
4 A D A L A L
5 A C A Q A X

using the solution proposed in @ekoam's comment:

df %>% filter(rowSums(across(everything(), ~ str_detect(., "John"))) > 0)
            V1         V2           V3
1 John Smith A V John Donovan
2 A A John Smith A R
3 A B A D John Donovan
4 John Donovan A O A V
5 A F John Smith A Q

Just to make the picture a bit clearer :

df %>% filter(print(across(everything(), ~ !str_detect(., "John"))))
# A tibble: 10 x 3
V1 V2 V3
<lgl> <lgl> <lgl>
1 FALSE TRUE FALSE
2 TRUE FALSE TRUE
3 TRUE TRUE FALSE
4 TRUE TRUE TRUE
5 FALSE TRUE TRUE
6 TRUE FALSE TRUE
7 TRUE TRUE TRUE
8 TRUE TRUE TRUE
9 TRUE TRUE TRUE
10 TRUE TRUE TRUE
V1 V2 V3
1 A C A R A L
2 A D A M A T
3 A F A V A N
4 A D A L A L
5 A C A Q A X

Notice that filter is &(and)ing the booleans by row i.e only rows with all TRUE value will be selected, those who have at least one FALSE will not. Now let's take a look at the code you provided in your comment:

 df %>% filter(print(across(everything(), ~ str_detect(., "John"))))
# A tibble: 10 x 3
V1 V2 V3
<lgl> <lgl> <lgl>
1 TRUE FALSE TRUE
2 FALSE TRUE FALSE
3 FALSE FALSE TRUE
4 FALSE FALSE FALSE
5 TRUE FALSE FALSE
6 FALSE TRUE FALSE
7 FALSE FALSE FALSE
8 FALSE FALSE FALSE
9 FALSE FALSE FALSE
10 FALSE FALSE FALSE
[1] V1 V2 V3
<0 rows> (or 0-length row.names)

All the rows have at least one FALSE, thus no rows are selected.

Filter row based on a string condition, dplyr filter, contains

The contains function in dplyr is a select helper. It's purpose is to help when using the select function, and the select function is focused on selecting columns not rows. See documentation here.

filter is the intended mechanism for selecting rows. The function you are probably looking for is grepl which does pattern matching for text.

So the solution you are looking for is probably:

filtered_df <- filter(df, grepl("background", site_type, ignore.case = TRUE))

I suspect that contains is mostly a wrapper applying grepl to the column names. So the logic is very similar.

References:

  • grep R documentation
  • high rated question applying exactly this technique

Filter rows which does NOT contain a certain string

You can simply use ! before your grepl like this:

df <- df %>% filter(!grepl('first|second', Text))

Is there a function to filter rows that contains a string, but on chosen columns names that contains a given string in R?

You can use filter_at to specifically filter based on specific selected variables (in this case, those starting with "Actor").

library(tidyverse)

df_ex %>%
filter_at(vars(starts_with("Actor")), any_vars(. %in% outl))

Output

  ID Actor1    Actor2 Actor3 Actor4 Leng    Genre
1 1 Driver President CEO Priest 12 horror
2 2 Zombie Devil 42 criminal

Filter only rows that contain exact two strings in a column

We can use str_count to create a logical vector in filter

library(dplyr)
library(stringr)
df %>%
filter(str_count(sp_name, "\\w+") == 2)

-output

               sp_name value
1 Xylopia brasiliensis 1
2 Xylosma tweediana 2

Or this can be done with str_detect as well - match the word (\\w+) from the start (^) followed by a space and another word (\\w+) at the end ($) of the string

df %>%
filter(str_detect(sp_name, "^\\w+ \\w+$"))

Or in base R with grep

subset(df, grepl("^\\w+ \\w+$", sp_name))
sp_name value
1 Xylopia brasiliensis 1
2 Xylosma tweediana 2

r - Filter rows that contain a string from a vector

We can use grep

df1[grep(paste(v1, collapse="|"), df1$animal),]

Or using dplyr

df1 %>%
filter(grepl(paste(v1, collapse="|"), animal))

How to filter rows containing specific string values with an AND operator

df[df['ids'].str.contains("ball")]

Would become:

df[df['ids'].str.contains("ball") & df['ids'].str.contains("field")]

If you are into neater code:

contains_balls = df['ids'].str.contains("ball")
contains_fields = df['ids'].str.contains("field")

filtered_df = df[contains_balls & contains_fields]


Related Topics



Leave a reply



Submit