Python/Pandas: How to Match List of Strings With a Dataframe Column

Python/Pandas: How to Match List of Strings with a DataFrame column

Here is a readable solution using an individual search_func:

def search_func(row):
    matches = [test_value in row["Description"].lower() 
               for test_value in row["Text_Search"]]

    if any(matches):
        return "Yes"
    else:
        return "No"

This function is then applied row-wise:

# create example data
df = pd.DataFrame({"Description": ["CANSEL SURVEY E PAY", "JX154 TFR?FR xxx8690"],
                   "Employer": ["Cansel Survey Equipment", "Cansel Survey Equipment"]})

print(df)
    Description             Employer
0   CANSEL SURVEY E PAY     Cansel Survey Equipment
1   JX154 TFR?FR xxx8690    Cansel Survey Equipment

# create text searches and match column
df["Text_Search"] = df["Employer"].str.lower().str.split()
df["Match"] = df.apply(search_func, axis=1)

# show result
print(df)
    Description             Employer                    Text_Search                     Match
0   CANSEL SURVEY E PAY     Cansel Survey Equipment     [cansel, survey, equipment]     Yes
1   JX154 TFR?FR xxx8690    Cansel Survey Equipment     [cansel, survey, equipment]     No

Pandas, finding match(any) between list of strings and df column values(as list) to create new column?

Is this what you're looking for?

If there's a match, the keyword is assigned to a new colum

df['new_col'] = df['type'].str.extract(f"({'|'.join(matches)})")

    type        new_col
0   A23 E I28   E
1   I28 F A23   NaN
2   D41 E F22   E

Edit:

df['new_col'] = (df['type']
                 .str.findall(f"({'|'.join(matches)})")
                 .str.join(', ')
                 .replace('', np.nan))

    type    new_col
0   A23 E I28   E
1   I28 F A23   NaN
2   D41 E F22   E, F22

Python - Find matching string(s) between DataFrame column (scraped text) and list of strings

You have to add word boundaries '\b' to your regex pattern. From the re module docs:

\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

Besides that, you want to use Series.str.findall (or Series.str.extractall) instead of Series.str.extract to find all the matches.

This should work

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

pat = r'\b({0})\b'.format('|'.join(the_list))
df['matched text'] = df.text_lemmatized.str.findall(pat, flags = re.IGNORECASE).map(", ".join)

Check if String in List of Strings is in Pandas DataFrame Column

If need match values in list, use Series.isin:

df['Match'] = df["Brand"].isin(search_for_these_values)
print (df)
            Brand  Price Liscence Plate  Match
0     Honda Civic  22000        ABC 123  False
1  Toyota Corolla  25000        XYZ 789  False
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987  False
4             NaN  29000        DEF 456  False

Solution with match is used for check substrings, so different output.

Alternative solution for match substrings with Series.str.contains and parameter na=False:

df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)
            Brand  Price Liscence Plate  Match
0     Honda Civic  22000        ABC 123   True
1  Toyota Corolla  25000        XYZ 789   True
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987  False
4             NaN  29000        DEF 456  False

EDIT:

For test values in substrings is possible use list comprehension with loop by values in search_for_these_values and test match by in with any for return at least one True:

df['Match'] = [any(x in z for z in search_for_these_values) 
                                if x == x 
                                else False 
                                for x in df["Brand"]]
print (df)

            Brand  Price Liscence Plate  Match
0     Honda Civic  22000        ABC 123  False
1  Toyota Corolla  25000        XYZ 789  False
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987   True
4             NaN  29000        DEF 456  False

Search for list of strings in pandas column

We can use Series.str.findall with the regex ignore case flag (?i), this way we dont have to use import re

df['Matches'] = df['desc'].str.findall(f'(?i)({"|".join(strings)})')

   itemid                                desc           Matches
0     101                          tea leaves             [tea]
1     201                     baseball gloves        [baseball]
2     221  tea leaves from Onus Green Tea Co.  [tea, Onus, Tea]

To remove duplicates, we cast your strings to upper case and make a set:

df['Matches'] = (
    df['desc'].str.findall(f'(?i)({"|".join(strings)})')
    .apply(lambda x: list(set(map(str.upper, x))))
)

   itemid                                desc      Matches
0     101                          tea leaves        [TEA]
1     201                     baseball gloves   [BASEBALL]
2     221  tea leaves from Onus Green Tea Co.  [TEA, ONUS]

Edit for partial match

We can use word boundaries \b for this:

strings = ['\\b' + f + '\\b' for f in strings]

df['Matches'] = df['desc'].str.findall(f'(?i)({"|".join(strings)})')

   itemid                                 desc      Matches
0     101                           tea leaves        [tea]
1     201                      baseball gloves   [baseball]
2     221  teas leaves from Onus Green Tea Co.  [Onus, Tea]

Retrieve match from list of strings and add as column in dataframe

You can use str.findall to extract fruits into a list and then explode it:

df.assign(fruits = df.text.str.findall('|'.join(fruits))).explode('fruits')

    user                        text   fruits
0    Tom              I love bananas  bananas
1   Dick               I love apples   apples
2  Harry   I love apples and bananas   apples
2  Harry   I love apples and bananas  bananas

Python/Pandas: How to Match List of Strings With a Dataframe Column