Check if a string in a Pandas DataFrame column is in a list of strings
frame = pd.DataFrame({'a' : ['the cat is blue', 'the sky is green', 'the dog is black']})
frame
a
0 the cat is blue
1 the sky is green
2 the dog is black
The str.contains
method accepts a regular expression pattern:
mylist = ['dog', 'cat', 'fish']
pattern = '|'.join(mylist)
pattern
'dog|cat|fish'
frame.a.str.contains(pattern)
0 True
1 False
2 True
Name: a, dtype: bool
Because regex patterns are supported, you can also embed flags:
frame = pd.DataFrame({'a' : ['Cat Mr. Nibbles is blue', 'the sky is green', 'the dog is black']})
frame
a
0 Cat Mr. Nibbles is blue
1 the sky is green
2 the dog is black
pattern = '|'.join([f'(?i){animal}' for animal in mylist]) # python 3.6+
pattern
'(?i)dog|(?i)cat|(?i)fish'
frame.a.str.contains(pattern)
0 True # Because of the (?i) flag, 'Cat' is also matched to 'cat'
1 False
2 True
Check if String in List of Strings is in Pandas DataFrame Column
If need match values in list, use Series.isin
:
df['Match'] = df["Brand"].isin(search_for_these_values)
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 False
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False
Solution with match
is used for check substrings, so different output.
Alternative solution for match substrings with Series.str.contains
and parameter na=False
:
df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 True
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False
EDIT:
For test values in substrings is possible use list comprehension with loop by values in search_for_these_values
and test match by in
with any
for return at least one True
:
df['Match'] = [any(x in z for z in search_for_these_values)
if x == x
else False
for x in df["Brand"]]
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 False
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 True
4 NaN 29000 DEF 456 False
How to check if Pandas column has value from list of string?
Use apply
and lambda
like:
df['Names'].apply(lambda x: any([k in x for k in kw]))
0 True
1 True
2 True
3 True
4 False
Name: Names, dtype: bool
Checking if column in dataframe contains any item from list of strings
Pandas generally allows you to filter data frames without resorting to for
loops.
This is one approach that should work:
matches = ['beat saber', 'half life', 'walking dead', 'population one']
# matches_regex is a regular expression meaning any of your strings:
# "beat saber|half life|walking dead|population one"
matches_regex = "|".join(matches)
# matches_bools will be a series of booleans indicating whether there was a match
# for each item in the series
matches_bools = hot_quest1.all_text.str.contains(matches_regex, regex=True)
# You can then use that series of booleans to derive a new data frame
# containing only matching rows
matched_rows = hot_quest1[matches_bools]
Here's the documentation for the str.contains
method.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
How to test if a string contains one of the substrings in a list, in pandas?
One option is just to use the regex |
character to try to match each of the substrings in the words in your Series s
(still using str.contains
).
You can construct the regex by joining the words in searchfor
with |
:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $
and ^
which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape
:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains
.
How to check if string in list of strings is in pandas dataframe column
While you said that a loop might be too slow it does seem like the most efficient way due to the extent of the list. Tried to keep it as simple as possible.
Feel free to modify the print statement based on your needs.
text = 'Bad Word test for Terrible Word same as Horrible Word and NSFW Word and Bad Word again'
bad_words = ['Bad Word', 'Terrible Word', 'Horrible Word', 'NSFW Word']
length_list = []
for i in bad_words:
count = text.count(i)
length_list.append([i, count])
print(length_list)
output:
[['Bad Word', 2], ['Terrible Word', 1], ['Horrible Word', 1], ['NSFW Word', 1]]
Alternatively your output as a string can be:
length_list = []
for i in bad_words:
count = text.count(i)
print(i + ' count: ' + str(count))
Output:
Bad Word count: 2
Terrible Word count: 1
Horrible Word count: 1
NSFW Word count: 1
Check if string is in a pandas dataframe
a['Names'].str.contains('Mel')
will return an indicator vector of boolean values of size len(BabyDataSet)
Therefore, you can use
mel_count=a['Names'].str.contains('Mel').sum()
if mel_count>0:
print ("There are {m} Mels".format(m=mel_count))
Or any()
, if you don't care how many records match your query
if a['Names'].str.contains('Mel').any():
print ("Mel is there")
Related Topics
What Is Python Whitespace and How Does It Work
When I Catch an Exception, How to Get the Type, File, and Line Number
How to Read and Write Ini File with Python3
Virtualenv --No-Site-Packages and Pip Still Finding Global Packages
Cleanest Way to Get Last Item from Python Iterator
Attributeerror: 'Tensor' Object Has No Attribute 'Numpy'
How to Disable a Pylint Warning
Prepend a Line to an Existing File in Python
Calculate Area of Polygon Given (X,Y) Coordinates
Get a Function Argument's Default Value
How to Find Out the Number of Cpus Using Python
Python: Pandas Series - Why Use Loc
Unpickling a Python 2 Object with Python 3
Pip Cannot Uninstall <Package>: "It Is a Distutils Installed Project"
Sorting Text File by Using Python
"Importerror: No Module Named" When Trying to Run Python Script