how to string compare two columns in pandas dataframe?
Just do:
df['compare'] = [levenshtein_distance(a, b) for a, b in zip(df2['a'], df2['b'])]
Or, if you want equality comparison:
df['compare'] = (df['a'] == df['b'])
Compare two dataframe columns for matching strings or are substrings then count in pandas
I'm rewriting this answer based on our discussions in the comments.
Rather than use apply
, you can use a list comprehension to provide the same effect; the following creates a list with the desired calculation for each row
[sum(all(val in cell for val in row) for cell in dfB['values_list']) for row in dfA['values_list']]
While I originally found this significantly harder to parse than an apply
function (and much harder to write), there is a tremendous advantage in speed. Here is your data, with the final two lines to split entries into lists:
import pandas as pd
dfA = pd.DataFrame(["4012, 4065, 4682",
"4712, 2339, 5652, 10007",
"4618, 8987",
"7447, 4615, 4012",
"6515",
"4065, 2339, 4012",],
columns=['values'])
dfB = pd.DataFrame(["6515, 4012, 4618, 8987",
"4065, 5116, 2339, 8757, 4012",
"1101",
"6515",
"4012, 4615, 7447",
"7447, 6515, 4012, 4615"],
columns=['values'])
dfA['values_list'] = dfA['values'].str.split(', ')
dfB['values_list'] = dfB['values'].str.split(', ')
Here is a speed test using the gnarly list comp:
In[0]
%%timeit -n 1000
dfA['overlap_A'] = [sum(all(val in cell for val in row)
for cell in dfB['values_list'])
for row in dfA['values_list']]
Out[0]
186 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And here is the same using an apply
function, similar to that used in MrNobody33 's answer, and in my original (derivative) answer. Note that this function already uses some comprehensions, and presumably moving things to for loops would make things slower:
def check_overlap(row):
return sum(all(val in cell for val in row['values_list']) for cell in dfB['values_list'])
In[1]:
%%timeit -n 1000
dfA['overlap_B'] = dfA.apply(check_overlap, axis=1)
Out[1]:
1.4 ms ± 61.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
That's about 7x faster when not using apply
! Note that the resulting output is the same:
values values_list overlap_A overlap_B
0 4012, 4065, 4682 [4012, 4065, 4682] 0 0
1 4712, 2339, 5652, 10007 [4712, 2339, 5652, 10007] 0 0
2 4618, 8987 [4618, 8987] 1 1
3 7447, 4615, 4012 [7447, 4615, 4012] 2 2
4 6515 [6515] 3 3
5 4065, 2339, 4012 [4065, 2339, 4012] 1 1
how to compare two columns in dataframe and update a column based on matching fields
import pandas as pd
d1={
"a":(1,4,7),
"b":(2,5,8),
"c":(0,0,0)
}
d2={
"a_1": (1, 4, 7),
"b_1": (5, 2, 8)
}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
# Iterate through each entry in a and compare it to a_1
for i in range(len(df1["a"])):
for j in range(len(df2["a_1"])):
if df1["a"][i] == df2["a_1"][j]:
df1["c"][i] = df2["b_1"][j]
Python - Findall matching string(s) between two DataFrame columns - sequence item 0: expected str instance, tuple found
You can use
pattern = r'(?i)\b({0})\b'.format("|".join(df["column_text_to_find"].to_list()))
df["column_text_to_search"].str.findall(pattern).str.join('_')
Or, if your "words" to find can contain special chars anywhere in the string:
pattern = r'(?i)(?!\B\w)({0})(?<!\w\B)'.format("|".join( sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True) ))
df["column_text_to_search"].str.findall(pattern).str.join('_')
Note the use of
(?i)
- it enables case insensitive search\b...\b
- word boundaries enable whole word search for natural language words (if the "wors" can contain special chars in arbitrary positions, you cannot rely on word boundaries)(?!\B\w)
/(?<!\w\B)
- dynamic adaptive word boundaries that only require a word boundary if the neighbouring char in the word to find is a word char"|".join(df["column_text_to_find"].to_list())
- forms an alternation based pattern of values inside the column_text_to_find column.sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True)
- sorts the words to find by length in descending order and escapes them for use in regex.findall(pattern)
- finds all occurrences of the pattern and.str.join('_')
- joins them with_
.
Compare two python pandas dataframe string columns to identify common string and add the common string to new column
- create a map
obj_map
with key as item_cleaned's lower letters, values is item_cleaned. - use regexp to extract tem_cleaned, with flags
re.IGNORECASE
- then lower the extract part and replace it with
obj_map
to getitem_final
import re
item_cleaned = df2['item_cleaned'].dropna().unique()
obj_map = pd.Series(dict(zip(map(str.lower, item_cleaned), item_cleaned)))
# escape the special characters
re_pat = '(%s)' % '|'.join([re.escape(i) for i in item_cleaned])
df1['item_final'] = df1['item_name'].str.extract(re_pat, flags=re.IGNORECASE)
df1['item_final'] = df1['item_final'].str.lower().map(obj_map)
obj_map
def Def
ghi Ghi
abc Abc
dtype: object
df1
item_name item_final
0 abc xyz Abc
1 xuy DEF Def
2 s GHI lsoe Ghi
3 p ABc ois Abc
Related Topics
Cv2.Videocapture.Open() Always Returns False
Convert HTML String to an Image in Python
How to Iterate Through Cur.Fetchall() in Python
Google Chrome Closes Immediately After Being Launched With Selenium
Python Tkinter How to Update a Text Widget in a for Loop
How to Assign and Use Column Headers in Spark
Adding Columns to Dataframe Based on File Name in Python
Json Valueerror: Expecting Property Name: Line 1 Column 2 (Char 1)
Image.Open() Cannot Identify Image File - Python
How to Append Data Using Openpyxl Python to Excel File from a Specified Row
How to Export a Table Dataframe in Pyspark to Csv
How to Count the Number of Files in a Directory Using Python
Finding Index of an Item Closest to the Value in a List That'S Not Entirely Sorted
Python Login Script; Usernames and Passwords in a Separate File
How to Read a Column Without Header from CSV and Save the Output in a Txt File Using Python
Get All the Diagonals in a Matrix/List of Lists in Python
Loop Over List of Elements for Find_Element_By_Xpath() by Selenium and Webdriver