Pandas Filtering for Multiple Substrings in Series

Pandas filtering for multiple substrings in series

If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).

This is easy to do using re.escape:

>>> import re
>>> esc_lst = [re.escape(s) for s in lst]

These escaped substrings can then be joined using a regex pipe |. Each of the substrings can be checked against a string until one matches (or they have all been tested).

>>> pattern = '|'.join(esc_lst)

The masking stage then becomes a single low-level loop through the rows:

df[col].str.contains(pattern, case=False)

Here's a simple setup to get a sense of performance:

from random import randint, seed

seed(321)

# 100 substrings of 5 characters
lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]

# 50000 strings of 20 characters
strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]

col = pd.Series(strings)
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)

The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):

%timeit col.str.contains(pattern, case=False)
1 loop, best of 3: 981 ms per loop

The method in the question took approximately 5 seconds using the same input data.

It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.

Pandas how to filter for multiple substrings in series

You can use re.escape() to escape the regex meta-characters in the following way such that you don't need to escape every string in the word list searchfor (no need to change the definition of searchfor):

import re

searchfor = ['.F1', '.N1', '.FW', '.SP'] # no need to escape each string

pattern = '|'.join(map(re.escape, searchfor)) # use re.escape() with map()

mask = (df["id"].str.contains(pattern))

re.escape() will escape each string for you:

print(pattern)

'\\.F1|\\.N1|\\.FW|\\.SP'

Filter Multiple Values using pandas

You are missing a pair of parentheses to get comparable items on both sides of the | operator - which has higher precedence than == (see docs):

df = df.loc[(df['Col 2'] == 'High') | (df['Col2'] == 'Medium')]

Check if multiple substrings are in pandas dataframe

You can use regex, where '|' is an "or" in regular expressions:

l = ['LIMITED','INC','CORP']  
regstr = '|'.join(l)
df['NAME'].str.upper().str.contains(regstr)

MVCE:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'NAME':['Baby CORP.','Baby','Baby INC.','Baby LIMITED
...: ']})

In [3]: df
Out[3]:
NAME
0 Baby CORP.
1 Baby
2 Baby INC.
3 Baby LIMITED

In [4]: l = ['LIMITED','INC','CORP']
...: regstr = '|'.join(l)
...: df['NAME'].str.upper().str.contains(regstr)
...:
Out[4]:
0 True
1 False
2 True
3 True
Name: NAME, dtype: bool

In [5]: regstr
Out[5]: 'LIMITED|INC|CORP'

How to filter string in multiple conditions python pandas

Use str.contains with a string with values separated by '|':

print(data[data['columnName'].str.contains("5|five")])

Output:

    columnName
0 5Star
2 five star

Pandas filter dataframe columns through substring match

You can iterate over index axis:

>>> df[df.apply(lambda x: x['Name'].lower() in x['Fname'].lower(), axis=1)]

Name Age Fname
1 Bob 12 Bob
2 Clarke 13 clarke

str.contains takes a constant in first argument pat not a Series.

How to filter a pandas dataframe using multiple partial strings?

Use str.contains with | for multiple search
elements:

mask = df['Answers'].str.contains(regex_pattern)
final_df = df[mask]

To create the regex pattern if you have the search elements use:

strings_to_find = ["not in","not on","not have"]
regex_pattern = '|'.join(strings_to_find)
regex_pattern
'not in|not on|not have'

Filter Pandas Dataframe based on List of substrings

You could use pandas.Series.isin

>>> df.loc[df['type'].isin(substr)]
year type value price
0 2000 A 500 10000
4 2006 C 500 12500
5 2012 A 500 65000
7 2019 D 500 51900

Pandas multiple filter str.contains or not contains

My answer for problem:

for item in glob.glob('D:\\path\\*.change'):
table = pd.read_csv(item, sep='\t', index_col=None)
#FILTERING
query_table = table[
(table['query'].str.contains("egg*", regex=True)==False) &
(table['query'].str.contains(".*phospho*", regex=True)==False) &
(table['query'].str.contains("vipe", regex=True)==False) ]

filtered_table = query_table[
(query_table['template'].str.contains("ABC1")) |
(query_table['template'].str.contains("bender")) ]


Related Topics



Leave a reply



Submit