How to Filter Rows Containing a String Pattern from a Pandas Dataframe

How to filter rows containing a string pattern from a Pandas dataframe

In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
ids vals
0 aball 1
1 bball 2
3 fball 4

How to filter Pandas Dataframe rows which contains any string from a list?

Setup

df = pd.DataFrame(dict(
A=['I need avocado', 'something', 'useless', 'nothing'],
B=['something', 'I eat margarina', 'eat apple', 'more nothing']
))
includeKeywords = ["apple", "avocado", "bannana"]

Problem

                A                B
0 I need avocado something # True 'avocado' in A
1 something I eat margarina
2 useless eat apple # True 'apple' in B
3 nothing more nothing

Solution

  • pandas.DataFrame.stack to make df a Series and enable us to use the pandas.Series.str accessor functions
  • pandas.Series.str.contains with '|'.join(includeKeywords)
  • pandas.Series.any with argument level=0 because we added a level to the index when we stacked

df[df.stack().str.contains('|'.join(includeKeywords)).any(level=0)]

A B
0 I need avocado something
2 useless eat apple

Details

This produces a regex search string. In regex, '|' means or. So for a regex search, this says match 'apple', 'avocado', or 'bannana'

kwstr = '|'.join(includeKeywords)
print(kwstr)

apple|avocado|bannana

Stacking will flatten our DataFrame

df.stack()

0 A I need avocado
B something
1 A something
B I eat margarina
2 A useless
B eat apple
3 A nothing
B more nothing
dtype: object

Fortunately, the pandas.Series.str.contains method can handle regex and it will produce a boolean Series

df.stack().str.contains(kwstr)

0 A True
B False
1 A False
B False
2 A False
B True
3 A False
B False
dtype: bool

At which point we can cleverly use pandas.Series.any by suggesting it only care about level=0

mask = df.stack().str.contains(kwstr).any(level=0)
mask

0 True
1 False
2 True
3 False
dtype: bool

By using level=0 we preserved the original index in the resulting Series. This makes it perfect for filtering df

df[mask]

A B
0 I need avocado something
2 useless eat apple

How to filter rows containing specific string values with an AND operator

df[df['ids'].str.contains("ball")]

Would become:

df[df['ids'].str.contains("ball") & df['ids'].str.contains("field")]

If you are into neater code:

contains_balls = df['ids'].str.contains("ball")
contains_fields = df['ids'].str.contains("field")

filtered_df = df[contains_balls & contains_fields]

Filter dataframe rows containing a set of string in python

Use join with | for regex OR with \b for word boundary:

L = ['cat', 'dog']
pat = r'(\b{}\b)'.format('|'.join(L))
df[df["B"].str.contains(pat, case=False, na=False)]

Pandas filtering rows with regex pattern present in the row itself

After a bit of modification, here is the result:

df[df.apply(lambda row: re.compile(row['pattern']).match(row['data']) is not None, axis=1)]

How to drop/delete/filter rows in pandas dataframe based on string pattern condition?

You should make a list of characters that are the conditions for dropping rows:

list = ['<character>', '\|',....]

and then filter your df by

df = df[~df['your column'].isin(['list'])]

Note the \| for the pipe character.



Related Topics



Leave a reply



Submit