How to Test If a String Contains One of the Substrings in a List, in Pandas

How to test if a string contains one of the substrings in a list, in pandas?

One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).

You can construct the regex by joining the words in searchfor with |:

>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object

As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.

You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:

>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']

The strings with in this new list will match each character literally when used with str.contains.

How to test if a string contains one of the substrings stored in a list column in pandas?

You can just use zip and list comprehension:

df['c'] = [int(any(w in a for w in b)) for a, b in zip(df.a, df.b)]

df
# a b c
#0 Bob Smith is great. [Smith, foo] 1
#1 The Sun is a mass of incandescent gas. [Jones, bar] 0

If you don't care about case:

df['c'] = [any(w.lower() in a for w in b) for a, b in zip(df.a.str.lower(), df.b)]

How to test string contains one of the substrings in a list, in pandas?

) is a special regex character. You need to escape:

searchfor = ['og\)', 'at\)']
s[s.str.contains('|'.join(searchfor))]

Output:

0    cat)
1 hat)
2 dog)
3 fog)
dtype: object

pandas dataframe str.contains() AND operation

You can do that as follows:

df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]

Substituting values of a column if it contains a substring of a list

Simplier is use loop here:

L = ['dog', 'cat', 'panda']

for x in L:
df.loc[df['column'].str.contains(x), "column"]= x
print (df)
column
0 dog
1 cat
2 I have nothing
3 panda
4

Or use Series.str.extract with Series.fillna by original data:

df['column'] =  (df['column'].str.extract(f'({"|".join(L)})', expand=False)
.fillna(df['column']))
print (df)
column
0 dog
1 cat
2 I have nothing
3 panda
4

Pandas str.contains - Search for multiple values in a string and print the values in a new column

Here is one way:

foods =['apples', 'oranges', 'grapes', 'blueberries']

def matcher(x):
for i in foods:
if i.lower() in x.lower():
return i
else:
return np.nan

df['Match'] = df['Text'].apply(matcher)

# Text Match
# 0 I want to buy some apples. apples
# 1 Oranges are good for the health. oranges
# 2 John is eating some grapes. grapes
# 3 This line does not contain any fruit names. NaN
# 4 I bought 2 blueberries yesterday. blueberries

Check if a string in a Pandas DataFrame column is in a list of strings

frame = pd.DataFrame({'a' : ['the cat is blue', 'the sky is green', 'the dog is black']})

frame
a
0 the cat is blue
1 the sky is green
2 the dog is black

The str.contains method accepts a regular expression pattern:

mylist = ['dog', 'cat', 'fish']
pattern = '|'.join(mylist)

pattern
'dog|cat|fish'

frame.a.str.contains(pattern)
0 True
1 False
2 True
Name: a, dtype: bool

Because regex patterns are supported, you can also embed flags:

frame = pd.DataFrame({'a' : ['Cat Mr. Nibbles is blue', 'the sky is green', 'the dog is black']})

frame
a
0 Cat Mr. Nibbles is blue
1 the sky is green
2 the dog is black

pattern = '|'.join([f'(?i){animal}' for animal in mylist]) # python 3.6+

pattern
'(?i)dog|(?i)cat|(?i)fish'

frame.a.str.contains(pattern)
0 True # Because of the (?i) flag, 'Cat' is also matched to 'cat'
1 False
2 True

Python Pandas: check if Series contains a string from list

You can loop through the lists simultaneously with zip. Make sure to pass regex=False to str.contains as . is a regex character.

abbreviation=['n.', 'v.']
col_name=['Noun','Verb']
for a, col in zip(abbreviation, col_name):
Blaze[col] = np.where(Blaze['Info'].str.contains(a, regex=False),True,False)
Blaze
Out[1]:
Word Info Noun Verb
0 Aam Aam, n. Etym: [D. aam, fr. LL. ama; cf. L. ham... True False
1 aard-vark Aard"-vark`, n. Etym: [D., earth-pig.] (Zoöl.) True False
2 aard-wolf Aard"-wolf`, n. Etym: [D, earth-wolf] (Zoöl.) True False

If required, str.contains also has a case parameter, so you can specify case=False to search case-insensitively.

Check if String in List of Strings is in Pandas DataFrame Column

If need match values in list, use Series.isin:

df['Match'] = df["Brand"].isin(search_for_these_values)
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 False
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False

Solution with match is used for check substrings, so different output.

Alternative solution for match substrings with Series.str.contains and parameter na=False:

df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 True
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False

EDIT:

For test values in substrings is possible use list comprehension with loop by values in search_for_these_values and test match by in with any for return at least one True:

df['Match'] = [any(x in z for z in search_for_these_values) 
if x == x
else False
for x in df["Brand"]]
print (df)

Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 False
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 True
4 NaN 29000 DEF 456 False


Related Topics



Leave a reply



Submit