How to test if a string contains one of the substrings in a list, in pandas?
One option is just to use the regex |
character to try to match each of the substrings in the words in your Series s
(still using str.contains
).
You can construct the regex by joining the words in searchfor
with |
:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $
and ^
which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape
:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains
.
How to test if a string contains one of the substrings stored in a list column in pandas?
You can just use zip
and list comprehension:
df['c'] = [int(any(w in a for w in b)) for a, b in zip(df.a, df.b)]
df
# a b c
#0 Bob Smith is great. [Smith, foo] 1
#1 The Sun is a mass of incandescent gas. [Jones, bar] 0
If you don't care about case:
df['c'] = [any(w.lower() in a for w in b) for a, b in zip(df.a.str.lower(), df.b)]
How to test string contains one of the substrings in a list, in pandas?
)
is a special regex
character. You need to escape:
searchfor = ['og\)', 'at\)']
s[s.str.contains('|'.join(searchfor))]
Output:
0 cat)
1 hat)
2 dog)
3 fog)
dtype: object
pandas dataframe str.contains() AND operation
You can do that as follows:
df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]
Substituting values of a column if it contains a substring of a list
Simplier is use loop here:
L = ['dog', 'cat', 'panda']
for x in L:
df.loc[df['column'].str.contains(x), "column"]= x
print (df)
column
0 dog
1 cat
2 I have nothing
3 panda
4
Or use Series.str.extract
with Series.fillna
by original data:
df['column'] = (df['column'].str.extract(f'({"|".join(L)})', expand=False)
.fillna(df['column']))
print (df)
column
0 dog
1 cat
2 I have nothing
3 panda
4
Pandas str.contains - Search for multiple values in a string and print the values in a new column
Here is one way:
foods =['apples', 'oranges', 'grapes', 'blueberries']
def matcher(x):
for i in foods:
if i.lower() in x.lower():
return i
else:
return np.nan
df['Match'] = df['Text'].apply(matcher)
# Text Match
# 0 I want to buy some apples. apples
# 1 Oranges are good for the health. oranges
# 2 John is eating some grapes. grapes
# 3 This line does not contain any fruit names. NaN
# 4 I bought 2 blueberries yesterday. blueberries
Check if a string in a Pandas DataFrame column is in a list of strings
frame = pd.DataFrame({'a' : ['the cat is blue', 'the sky is green', 'the dog is black']})
frame
a
0 the cat is blue
1 the sky is green
2 the dog is black
The str.contains
method accepts a regular expression pattern:
mylist = ['dog', 'cat', 'fish']
pattern = '|'.join(mylist)
pattern
'dog|cat|fish'
frame.a.str.contains(pattern)
0 True
1 False
2 True
Name: a, dtype: bool
Because regex patterns are supported, you can also embed flags:
frame = pd.DataFrame({'a' : ['Cat Mr. Nibbles is blue', 'the sky is green', 'the dog is black']})
frame
a
0 Cat Mr. Nibbles is blue
1 the sky is green
2 the dog is black
pattern = '|'.join([f'(?i){animal}' for animal in mylist]) # python 3.6+
pattern
'(?i)dog|(?i)cat|(?i)fish'
frame.a.str.contains(pattern)
0 True # Because of the (?i) flag, 'Cat' is also matched to 'cat'
1 False
2 True
Python Pandas: check if Series contains a string from list
You can loop through the lists simultaneously with zip
. Make sure to pass regex=False
to str.contains
as .
is a regex character.
abbreviation=['n.', 'v.']
col_name=['Noun','Verb']
for a, col in zip(abbreviation, col_name):
Blaze[col] = np.where(Blaze['Info'].str.contains(a, regex=False),True,False)
Blaze
Out[1]:
Word Info Noun Verb
0 Aam Aam, n. Etym: [D. aam, fr. LL. ama; cf. L. ham... True False
1 aard-vark Aard"-vark`, n. Etym: [D., earth-pig.] (Zoöl.) True False
2 aard-wolf Aard"-wolf`, n. Etym: [D, earth-wolf] (Zoöl.) True False
If required, str.contains
also has a case
parameter, so you can specify case=False
to search case-insensitively.
Check if String in List of Strings is in Pandas DataFrame Column
If need match values in list, use Series.isin
:
df['Match'] = df["Brand"].isin(search_for_these_values)
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 False
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False
Solution with match
is used for check substrings, so different output.
Alternative solution for match substrings with Series.str.contains
and parameter na=False
:
df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 True
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False
EDIT:
For test values in substrings is possible use list comprehension with loop by values in search_for_these_values
and test match by in
with any
for return at least one True
:
df['Match'] = [any(x in z for z in search_for_these_values)
if x == x
else False
for x in df["Brand"]]
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 False
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 True
4 NaN 29000 DEF 456 False
Related Topics
Meaning of @Classmethod and @Staticmethod For Beginner
How to Remove an Element from a List by Index
How to Install Python Packages [Ssl: Tlsv1_Alert_Protocol_Version]
Importerror: No Module Named 'Pygame'
What's the U Prefix in a Python String
Return, Return None, and No Return At All
Valueerror: Invalid Literal For Int() With Base 10: ''
Changing One Character in a String
Append Existing Excel Sheet With New Dataframe Using Python Pandas
Can't Send Input to Running Program in Sublime Text
Why Can't Python'S Raw String Literals End With a Single Backslash