Python/Pandas: How to Match List of Strings With a Dataframe Column

Python/Pandas: How to Match List of Strings with a DataFrame column

Here is a readable solution using an individual search_func:

def search_func(row):
matches = [test_value in row["Description"].lower()
for test_value in row["Text_Search"]]

if any(matches):
return "Yes"
else:
return "No"

This function is then applied row-wise:

# create example data
df = pd.DataFrame({"Description": ["CANSEL SURVEY E PAY", "JX154 TFR?FR xxx8690"],
"Employer": ["Cansel Survey Equipment", "Cansel Survey Equipment"]})

print(df)
Description Employer
0 CANSEL SURVEY E PAY Cansel Survey Equipment
1 JX154 TFR?FR xxx8690 Cansel Survey Equipment

# create text searches and match column
df["Text_Search"] = df["Employer"].str.lower().str.split()
df["Match"] = df.apply(search_func, axis=1)

# show result
print(df)
Description Employer Text_Search Match
0 CANSEL SURVEY E PAY Cansel Survey Equipment [cansel, survey, equipment] Yes
1 JX154 TFR?FR xxx8690 Cansel Survey Equipment [cansel, survey, equipment] No

Pandas, finding match(any) between list of strings and df column values(as list) to create new column?

Is this what you're looking for?

If there's a match, the keyword is assigned to a new colum

df['new_col'] = df['type'].str.extract(f"({'|'.join(matches)})")
    type        new_col
0 A23 E I28 E
1 I28 F A23 NaN
2 D41 E F22 E

Edit:

df['new_col'] = (df['type']
.str.findall(f"({'|'.join(matches)})")
.str.join(', ')
.replace('', np.nan))
    type    new_col
0 A23 E I28 E
1 I28 F A23 NaN
2 D41 E F22 E, F22

Python - Find matching string(s) between DataFrame column (scraped text) and list of strings

You have to add word boundaries '\b' to your regex pattern. From the re module docs:

\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

Besides that, you want to use Series.str.findall (or Series.str.extractall) instead of Series.str.extract to find all the matches.

This should work

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

pat = r'\b({0})\b'.format('|'.join(the_list))
df['matched text'] = df.text_lemmatized.str.findall(pat, flags = re.IGNORECASE).map(", ".join)

Check if String in List of Strings is in Pandas DataFrame Column

If need match values in list, use Series.isin:

df['Match'] = df["Brand"].isin(search_for_these_values)
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 False
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False

Solution with match is used for check substrings, so different output.

Alternative solution for match substrings with Series.str.contains and parameter na=False:

df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 True
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False

EDIT:

For test values in substrings is possible use list comprehension with loop by values in search_for_these_values and test match by in with any for return at least one True:

df['Match'] = [any(x in z for z in search_for_these_values) 
if x == x
else False
for x in df["Brand"]]
print (df)

Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 False
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 True
4 NaN 29000 DEF 456 False

Search for list of strings in pandas column

We can use Series.str.findall with the regex ignore case flag (?i), this way we dont have to use import re

df['Matches'] = df['desc'].str.findall(f'(?i)({"|".join(strings)})')

itemid desc Matches
0 101 tea leaves [tea]
1 201 baseball gloves [baseball]
2 221 tea leaves from Onus Green Tea Co. [tea, Onus, Tea]

To remove duplicates, we cast your strings to upper case and make a set:

df['Matches'] = (
df['desc'].str.findall(f'(?i)({"|".join(strings)})')
.apply(lambda x: list(set(map(str.upper, x))))
)
   itemid                                desc      Matches
0 101 tea leaves [TEA]
1 201 baseball gloves [BASEBALL]
2 221 tea leaves from Onus Green Tea Co. [TEA, ONUS]

Edit for partial match

We can use word boundaries \b for this:

strings = ['\\b' + f + '\\b' for f in strings]

df['Matches'] = df['desc'].str.findall(f'(?i)({"|".join(strings)})')
   itemid                                 desc      Matches
0 101 tea leaves [tea]
1 201 baseball gloves [baseball]
2 221 teas leaves from Onus Green Tea Co. [Onus, Tea]

Retrieve match from list of strings and add as column in dataframe

You can use str.findall to extract fruits into a list and then explode it:

df.assign(fruits = df.text.str.findall('|'.join(fruits))).explode('fruits')

user text fruits
0 Tom I love bananas bananas
1 Dick I love apples apples
2 Harry I love apples and bananas apples
2 Harry I love apples and bananas bananas


Related Topics



Leave a reply



Submit