Pandas filtering for multiple substrings in series
If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).
This is easy to do using re.escape
:
>>> import re
>>> esc_lst = [re.escape(s) for s in lst]
These escaped substrings can then be joined using a regex pipe |
. Each of the substrings can be checked against a string until one matches (or they have all been tested).
>>> pattern = '|'.join(esc_lst)
The masking stage then becomes a single low-level loop through the rows:
df[col].str.contains(pattern, case=False)
Here's a simple setup to get a sense of performance:
from random import randint, seed
seed(321)
# 100 substrings of 5 characters
lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]
# 50000 strings of 20 characters
strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]
col = pd.Series(strings)
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)
The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):
%timeit col.str.contains(pattern, case=False)
1 loop, best of 3: 981 ms per loop
The method in the question took approximately 5 seconds using the same input data.
It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.
Pandas how to filter for multiple substrings in series
You can use re.escape()
to escape the regex meta-characters in the following way such that you don't need to escape every string in the word list searchfor
(no need to change the definition of searchfor
):
import re
searchfor = ['.F1', '.N1', '.FW', '.SP'] # no need to escape each string
pattern = '|'.join(map(re.escape, searchfor)) # use re.escape() with map()
mask = (df["id"].str.contains(pattern))
re.escape()
will escape each string for you:
print(pattern)
'\\.F1|\\.N1|\\.FW|\\.SP'
Filter Multiple Values using pandas
You are missing a pair of parentheses to get comparable items on both sides of the |
operator - which has higher precedence than ==
(see docs):
df = df.loc[(df['Col 2'] == 'High') | (df['Col2'] == 'Medium')]
Check if multiple substrings are in pandas dataframe
You can use regex, where '|' is an "or" in regular expressions:
l = ['LIMITED','INC','CORP']
regstr = '|'.join(l)
df['NAME'].str.upper().str.contains(regstr)
MVCE:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'NAME':['Baby CORP.','Baby','Baby INC.','Baby LIMITED
...: ']})
In [3]: df
Out[3]:
NAME
0 Baby CORP.
1 Baby
2 Baby INC.
3 Baby LIMITED
In [4]: l = ['LIMITED','INC','CORP']
...: regstr = '|'.join(l)
...: df['NAME'].str.upper().str.contains(regstr)
...:
Out[4]:
0 True
1 False
2 True
3 True
Name: NAME, dtype: bool
In [5]: regstr
Out[5]: 'LIMITED|INC|CORP'
How to filter string in multiple conditions python pandas
Use str.contains
with a string with values separated by '|'
:
print(data[data['columnName'].str.contains("5|five")])
Output:
columnName
0 5Star
2 five star
Pandas filter dataframe columns through substring match
You can iterate over index axis:
>>> df[df.apply(lambda x: x['Name'].lower() in x['Fname'].lower(), axis=1)]
Name Age Fname
1 Bob 12 Bob
2 Clarke 13 clarke
str.contains
takes a constant in first argument pat
not a Series
.
How to filter a pandas dataframe using multiple partial strings?
Use str.contains
with |
for multiple search
elements:
mask = df['Answers'].str.contains(regex_pattern)
final_df = df[mask]
To create the regex pattern if you have the search elements use:
strings_to_find = ["not in","not on","not have"]
regex_pattern = '|'.join(strings_to_find)
regex_pattern
'not in|not on|not have'
Filter Pandas Dataframe based on List of substrings
You could use pandas.Series.isin
>>> df.loc[df['type'].isin(substr)]
year type value price
0 2000 A 500 10000
4 2006 C 500 12500
5 2012 A 500 65000
7 2019 D 500 51900
Pandas multiple filter str.contains or not contains
My answer for problem:
for item in glob.glob('D:\\path\\*.change'):
table = pd.read_csv(item, sep='\t', index_col=None)
#FILTERING
query_table = table[
(table['query'].str.contains("egg*", regex=True)==False) &
(table['query'].str.contains(".*phospho*", regex=True)==False) &
(table['query'].str.contains("vipe", regex=True)==False) ]
filtered_table = query_table[
(query_table['template'].str.contains("ABC1")) |
(query_table['template'].str.contains("bender")) ]
Related Topics
How to Overwrite Part of a Text File in Python
What Is the Fastest Way to Stack Numpy Arrays in a Loop
Regex Check If Specific Multiple Words Present in a Sentence
Fast Way to Split Column into Multiple Rows in Pandas
How to Repeatedly Execute a Function Every X Seconds
How to Read Numbers from File in Python
How to Remove Hashtag, @User, Link of a Tweet Using Regular Expression
A Way to Quick Preview .Ipynb Files
How to Convert Strings With Billion or Million Abbreviation into Integers in a List
How to Open a Password Protected Excel File Using Python
How to Make Type Cast for Python Custom Class
Get Value of Span Tag Using Beautifulsoup
How to Align Labels and Entry Boxes in a Gui Program
Python Get File Size of Volumes or Physical Drives