Check If a String in a Pandas Dataframe Column Is in a List of Strings

Check if a string in a Pandas DataFrame column is in a list of strings

frame = pd.DataFrame({'a' : ['the cat is blue', 'the sky is green', 'the dog is black']})

frame
a
0 the cat is blue
1 the sky is green
2 the dog is black

The str.contains method accepts a regular expression pattern:

mylist = ['dog', 'cat', 'fish']
pattern = '|'.join(mylist)

pattern
'dog|cat|fish'

frame.a.str.contains(pattern)
0 True
1 False
2 True
Name: a, dtype: bool

Because regex patterns are supported, you can also embed flags:

frame = pd.DataFrame({'a' : ['Cat Mr. Nibbles is blue', 'the sky is green', 'the dog is black']})

frame
a
0 Cat Mr. Nibbles is blue
1 the sky is green
2 the dog is black

pattern = '|'.join([f'(?i){animal}' for animal in mylist]) # python 3.6+

pattern
'(?i)dog|(?i)cat|(?i)fish'

frame.a.str.contains(pattern)
0 True # Because of the (?i) flag, 'Cat' is also matched to 'cat'
1 False
2 True

Check if String in List of Strings is in Pandas DataFrame Column

If need match values in list, use Series.isin:

df['Match'] = df["Brand"].isin(search_for_these_values)
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 False
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False

Solution with match is used for check substrings, so different output.

Alternative solution for match substrings with Series.str.contains and parameter na=False:

df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 True
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False

EDIT:

For test values in substrings is possible use list comprehension with loop by values in search_for_these_values and test match by in with any for return at least one True:

df['Match'] = [any(x in z for z in search_for_these_values) 
if x == x
else False
for x in df["Brand"]]
print (df)

Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 False
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 True
4 NaN 29000 DEF 456 False

How to check if Pandas column has value from list of string?

Use apply and lambda like:

df['Names'].apply(lambda x: any([k in x for k in kw]))

0 True
1 True
2 True
3 True
4 False
Name: Names, dtype: bool

Checking if column in dataframe contains any item from list of strings

Pandas generally allows you to filter data frames without resorting to for loops.

This is one approach that should work:

matches = ['beat saber', 'half life', 'walking dead', 'population one']

# matches_regex is a regular expression meaning any of your strings:
# "beat saber|half life|walking dead|population one"
matches_regex = "|".join(matches)

# matches_bools will be a series of booleans indicating whether there was a match
# for each item in the series
matches_bools = hot_quest1.all_text.str.contains(matches_regex, regex=True)

# You can then use that series of booleans to derive a new data frame
# containing only matching rows
matched_rows = hot_quest1[matches_bools]

Here's the documentation for the str.contains method.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

How to test if a string contains one of the substrings in a list, in pandas?

One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).

You can construct the regex by joining the words in searchfor with |:

>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object

As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.

You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:

>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']

The strings with in this new list will match each character literally when used with str.contains.

How to check if string in list of strings is in pandas dataframe column

While you said that a loop might be too slow it does seem like the most efficient way due to the extent of the list. Tried to keep it as simple as possible.
Feel free to modify the print statement based on your needs.

text = 'Bad Word test for Terrible Word same as Horrible Word and NSFW Word and Bad Word again'
bad_words = ['Bad Word', 'Terrible Word', 'Horrible Word', 'NSFW Word']

length_list = []

for i in bad_words:
count = text.count(i)
length_list.append([i, count])

print(length_list)

output:

[['Bad Word', 2], ['Terrible Word', 1], ['Horrible Word', 1], ['NSFW Word', 1]]

Alternatively your output as a string can be:

length_list = []

for i in bad_words:
count = text.count(i)
print(i + ' count: ' + str(count))

Output:

Bad Word count: 2
Terrible Word count: 1
Horrible Word count: 1
NSFW Word count: 1

Check if string is in a pandas dataframe

a['Names'].str.contains('Mel') will return an indicator vector of boolean values of size len(BabyDataSet)

Therefore, you can use

mel_count=a['Names'].str.contains('Mel').sum()
if mel_count>0:
print ("There are {m} Mels".format(m=mel_count))

Or any(), if you don't care how many records match your query

if a['Names'].str.contains('Mel').any():
print ("Mel is there")


Related Topics



Leave a reply



Submit