Python/Pandas: How to Match List of Strings with a DataFrame column
Here is a readable solution using an individual search_func
:
def search_func(row):
matches = [test_value in row["Description"].lower()
for test_value in row["Text_Search"]]
if any(matches):
return "Yes"
else:
return "No"
This function is then applied row-wise:
# create example data
df = pd.DataFrame({"Description": ["CANSEL SURVEY E PAY", "JX154 TFR?FR xxx8690"],
"Employer": ["Cansel Survey Equipment", "Cansel Survey Equipment"]})
print(df)
Description Employer
0 CANSEL SURVEY E PAY Cansel Survey Equipment
1 JX154 TFR?FR xxx8690 Cansel Survey Equipment
# create text searches and match column
df["Text_Search"] = df["Employer"].str.lower().str.split()
df["Match"] = df.apply(search_func, axis=1)
# show result
print(df)
Description Employer Text_Search Match
0 CANSEL SURVEY E PAY Cansel Survey Equipment [cansel, survey, equipment] Yes
1 JX154 TFR?FR xxx8690 Cansel Survey Equipment [cansel, survey, equipment] No
Pandas, finding match(any) between list of strings and df column values(as list) to create new column?
Is this what you're looking for?
If there's a match, the keyword is assigned to a new colum
df['new_col'] = df['type'].str.extract(f"({'|'.join(matches)})")
type new_col
0 A23 E I28 E
1 I28 F A23 NaN
2 D41 E F22 E
Edit:
df['new_col'] = (df['type']
.str.findall(f"({'|'.join(matches)})")
.str.join(', ')
.replace('', np.nan))
type new_col
0 A23 E I28 E
1 I28 F A23 NaN
2 D41 E F22 E, F22
Python - Find matching string(s) between DataFrame column (scraped text) and list of strings
You have to add word boundaries '\b'
to your regex pattern. From the re module docs:
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.
Besides that, you want to use Series.str.findall
(or Series.str.extractall
) instead of Series.str.extract
to find all the matches.
This should work
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
pat = r'\b({0})\b'.format('|'.join(the_list))
df['matched text'] = df.text_lemmatized.str.findall(pat, flags = re.IGNORECASE).map(", ".join)
Check if String in List of Strings is in Pandas DataFrame Column
If need match values in list, use Series.isin
:
df['Match'] = df["Brand"].isin(search_for_these_values)
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 False
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False
Solution with match
is used for check substrings, so different output.
Alternative solution for match substrings with Series.str.contains
and parameter na=False
:
df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 True
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False
EDIT:
For test values in substrings is possible use list comprehension with loop by values in search_for_these_values
and test match by in
with any
for return at least one True
:
df['Match'] = [any(x in z for z in search_for_these_values)
if x == x
else False
for x in df["Brand"]]
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 False
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 True
4 NaN 29000 DEF 456 False
Search for list of strings in pandas column
We can use Series.str.findall
with the regex ignore case flag (?i
), this way we dont have to use import re
df['Matches'] = df['desc'].str.findall(f'(?i)({"|".join(strings)})')
itemid desc Matches
0 101 tea leaves [tea]
1 201 baseball gloves [baseball]
2 221 tea leaves from Onus Green Tea Co. [tea, Onus, Tea]
To remove duplicates, we cast your strings to upper case and make a set
:
df['Matches'] = (
df['desc'].str.findall(f'(?i)({"|".join(strings)})')
.apply(lambda x: list(set(map(str.upper, x))))
)
itemid desc Matches
0 101 tea leaves [TEA]
1 201 baseball gloves [BASEBALL]
2 221 tea leaves from Onus Green Tea Co. [TEA, ONUS]
Edit for partial match
We can use word boundaries \b
for this:
strings = ['\\b' + f + '\\b' for f in strings]
df['Matches'] = df['desc'].str.findall(f'(?i)({"|".join(strings)})')
itemid desc Matches
0 101 tea leaves [tea]
1 201 baseball gloves [baseball]
2 221 teas leaves from Onus Green Tea Co. [Onus, Tea]
Retrieve match from list of strings and add as column in dataframe
You can use str.findall
to extract fruits into a list and then explode
it:
df.assign(fruits = df.text.str.findall('|'.join(fruits))).explode('fruits')
user text fruits
0 Tom I love bananas bananas
1 Dick I love apples apples
2 Harry I love apples and bananas apples
2 Harry I love apples and bananas bananas
Related Topics
Parsing Outlook .Msg Files With Python
Sum a Column Based on Groupby and Condition
How to Convert a List of Dictionaries to Json in Python/Django
Calculate Sklearn.Roc_Auc_Score for Multi-Class
Python Login Script; Usernames and Passwords in a Separate File
Anaconda Installed But Cannot Launch Navigator
In Python, How to Check If a Date Is Valid
How to Disable Pylint Unused Import Error Messages in VS Code
Ssl: Certificate_Verify_Failed With Python3
Could Not Find a Version That Satisfies the Requirement in Python
Removing Punctuations and Spaces in a String Without Using Regex
How to Dynamically Build a Json Object
Cv2.Videocapture.Open() Always Returns False
What Causes a Python Segmentation Fault
Making a Discord Bot Change Playing Status Every 10 Seconds
Generate List of Quarters Betweeen Given Dates
Reading Contents of a Gzip File from a Aws S3 in Python
Accuracy Score Valueerror: Can't Handle Mix of Binary and Continuous Target