Python-Compare Two String Columns in Same Dataframe, Return Matching Result

how to string compare two columns in pandas dataframe?

Just do:

df['compare'] = [levenshtein_distance(a, b) for a, b in zip(df2['a'], df2['b'])]

Or, if you want equality comparison:

df['compare'] = (df['a'] == df['b'])

Compare two dataframe columns for matching strings or are substrings then count in pandas

I'm rewriting this answer based on our discussions in the comments.

Rather than use apply, you can use a list comprehension to provide the same effect; the following creates a list with the desired calculation for each row

[sum(all(val in cell for val in row) for cell in dfB['values_list']) for row in dfA['values_list']]

While I originally found this significantly harder to parse than an apply function (and much harder to write), there is a tremendous advantage in speed. Here is your data, with the final two lines to split entries into lists:

import pandas as pd

dfA = pd.DataFrame(["4012, 4065, 4682",
"4712, 2339, 5652, 10007",
"4618, 8987",
"7447, 4615, 4012",
"6515",
"4065, 2339, 4012",],
columns=['values'])

dfB = pd.DataFrame(["6515, 4012, 4618, 8987",
"4065, 5116, 2339, 8757, 4012",
"1101",
"6515",
"4012, 4615, 7447",
"7447, 6515, 4012, 4615"],
columns=['values'])

dfA['values_list'] = dfA['values'].str.split(', ')
dfB['values_list'] = dfB['values'].str.split(', ')

Here is a speed test using the gnarly list comp:

In[0]
%%timeit -n 1000
dfA['overlap_A'] = [sum(all(val in cell for val in row)
for cell in dfB['values_list'])
for row in dfA['values_list']]

Out[0]
186 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And here is the same using an apply function, similar to that used in MrNobody33 's answer, and in my original (derivative) answer. Note that this function already uses some comprehensions, and presumably moving things to for loops would make things slower:

def check_overlap(row):
return sum(all(val in cell for val in row['values_list']) for cell in dfB['values_list'])

In[1]:
%%timeit -n 1000
dfA['overlap_B'] = dfA.apply(check_overlap, axis=1)

Out[1]:
1.4 ms ± 61.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That's about 7x faster when not using apply! Note that the resulting output is the same:

                    values                values_list  overlap_A  overlap_B
0 4012, 4065, 4682 [4012, 4065, 4682] 0 0
1 4712, 2339, 5652, 10007 [4712, 2339, 5652, 10007] 0 0
2 4618, 8987 [4618, 8987] 1 1
3 7447, 4615, 4012 [7447, 4615, 4012] 2 2
4 6515 [6515] 3 3
5 4065, 2339, 4012 [4065, 2339, 4012] 1 1

how to compare two columns in dataframe and update a column based on matching fields

import pandas as pd

d1={
"a":(1,4,7),
"b":(2,5,8),
"c":(0,0,0)
}

d2={
"a_1": (1, 4, 7),
"b_1": (5, 2, 8)
}

df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)

# Iterate through each entry in a and compare it to a_1
for i in range(len(df1["a"])):
for j in range(len(df2["a_1"])):
if df1["a"][i] == df2["a_1"][j]:
df1["c"][i] = df2["b_1"][j]

Python - Findall matching string(s) between two DataFrame columns - sequence item 0: expected str instance, tuple found

You can use

pattern = r'(?i)\b({0})\b'.format("|".join(df["column_text_to_find"].to_list()))
df["column_text_to_search"].str.findall(pattern).str.join('_')

Or, if your "words" to find can contain special chars anywhere in the string:

pattern = r'(?i)(?!\B\w)({0})(?<!\w\B)'.format("|".join( sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True) ))
df["column_text_to_search"].str.findall(pattern).str.join('_')

Note the use of

  • (?i) - it enables case insensitive search
  • \b...\b - word boundaries enable whole word search for natural language words (if the "wors" can contain special chars in arbitrary positions, you cannot rely on word boundaries)
  • (?!\B\w) / (?<!\w\B) - dynamic adaptive word boundaries that only require a word boundary if the neighbouring char in the word to find is a word char
  • "|".join(df["column_text_to_find"].to_list()) - forms an alternation based pattern of values inside the column_text_to_find column.
  • sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True) - sorts the words to find by length in descending order and escapes them for use in regex
  • .findall(pattern) - finds all occurrences of the pattern and
  • .str.join('_') - joins them with _.

Compare two python pandas dataframe string columns to identify common string and add the common string to new column

  1. create a map obj_map with key as item_cleaned's lower letters, values is item_cleaned.
  2. use regexp to extract tem_cleaned, with flags re.IGNORECASE
  3. then lower the extract part and replace it with obj_map to get item_final
import re
item_cleaned = df2['item_cleaned'].dropna().unique()
obj_map = pd.Series(dict(zip(map(str.lower, item_cleaned), item_cleaned)))

# escape the special characters
re_pat = '(%s)' % '|'.join([re.escape(i) for i in item_cleaned])

df1['item_final'] = df1['item_name'].str.extract(re_pat, flags=re.IGNORECASE)
df1['item_final'] = df1['item_final'].str.lower().map(obj_map)

obj_map

def    Def
ghi Ghi
abc Abc
dtype: object

df1

    item_name item_final
0 abc xyz Abc
1 xuy DEF Def
2 s GHI lsoe Ghi
3 p ABc ois Abc


Related Topics



Leave a reply



Submit