String edit distance in python
There is a NLTK package which you can use, it uses the Levenshtein edit-distance which should be what you're looking for.
Example:
import nltk
s1 = "abc"
s2 = "ebcd"
nltk.edit_distance(s1, s2) # output: 2
Reference:
https://tedboy.github.io/nlps/generated/generated/nltk.edit_distance.html
Python Edit distance algorithm with dynamic programming and 2D Array - Output is not optimal
Ah, looks like I have found a solution, that I'll have to answer my own question now. (I'm still confused with some parts, and am only answering this question to briefly introduce the new implementation, as to save the time of other kind helpers)
So firstly, I have missed a condition in the original code, that is, what if one of the two string inputs are empty? Then we'll have to insert everything from the other string. Henceforth, the optimal editing distance is just the length of this other string.
if i == 0:
D[i][j] = j
elif j == 0:
D[i][j] = i
Also, regarding the original for-loop of the code, I learnt my mistakes from GeeksforGeeks. If my understanding is correct, they are saying that if two indices (i and j) are consistent, all we need to do is to move diagonally upward on the graph (i-1 and j-1) without adding any counts.
Else if the indices do not match, we move either to the direction of i-1, j-1 or diagonally up dependently. I was right on this, apart from the fact the count is added after the move, whereas I have added them during the move.
I am still a bit unsure with how it worked, however I'll compare the two algorithms below, would be appreciated if someone could explain it further in the comments.
My original for-loop (present in the question)
for j in range(0, m):
for i in range(0, n):
if target[i-1] == source[j-1]:
D[i][j] = min(D[i-1][j-1], D[i-1][j]+1, D[i][j-1]+1)
else:
D[i][j] = min(D[i-1][j-1]+1, D[i-1][j]+1, D[i][j-1]+1)
And below is the new for-loop, whose output is correct after testing:
if target[i-1] == source[j-1]:
D[i][j] = D[i-1][j-1]
else:
D[i][j] = 1 + min(D[i][j-1], D[i-1][j], D[i-1][j-1])
Would be appreciated if someone could further explain how did this work, as I still only have a superfacial understanding of the new code
Final code:
def edit_distance(target, source):
m = len(target)+1
n = len(source)+1
D = [[0 for x in range(n)] for x in range(m)]
for i in range(m):
for j in range(n):
if i == 0:
D[i][j] = j
elif j == 0:
D[i][j] = i
elif target[i-1] == source[j-1]:
D[i][j] = D[i-1][j-1]
else:
D[i][j] = 1 + min(D[i][j-1], D[i-1][j], D[i-1][j-1])
return D[m-1][n-1]
print(edit_distance("distance", "editing"))
# output = 5, which is correct
Levenshtein edit-distance between rows and columns
edit_distance
expects 2 strings, so you have to iterate over the indexes. One option is to apply a lambda that does that on df
:
df.apply(lambda col: [nltk.edit_distance(col.name, i) for i in col.index])
But, instead of filling in a DataFrame, I think it's simpler to first create a dictionary with the values; then build a DataFrame as follows:
df = pd.DataFrame({j: {i: nltk.edit_distance(i,j) for i in rows} for j in columns})
Output:
Band Tree Foot
Hand 1 4 4
Foot 4 4 0
Shoulder 7 7 7
Edit Distance w/ operational weights in Python
You have the base costs for when i = 0
and j = 0
to be j
and i
respectively, which are not multiples of 5. Then you should be multiplying them by 20
since not using the letters is essentially the same as deleting or inserting them for the purposes of edit distance.
So you should try something like this:
str1="algorithms"
str2="alligator"
m=len(str1)
n=len(str2)
def editdistance(str1, str2, m, n):
table=[[0 for x in range(n+1)] for x in range(m+1)]
for i in range(m+1):
for j in range(n+1):
if i==0:
table[i][j]=j*20
elif j==0:
table[i][j]=i*20
elif str1[i-1]==str2[j-1]:
table[i][j]=table[i-1][j-1]
else:
table[i][j] = min(20+table[i][j-1], 20+table[i-1][j], 5+table[i-1][j-1])
return table[m][n]
print(editdistance(str1, str2, m, n))
Calculating Minimum Edit Distance for unequal strings python
For different length strings, cost
and backtrace
indices doesn't match.
Can be implemented minimum edit distance with 2 substitution cost by updating only one numpy m
* n
arr with cost at each step.
As per Algorithm,
Below code will do the job.
def minimumEditDistance(first, second):
#Creating numpy ndarray( initialized with 0 of dimension of size of both strings
matrix = np.zeros((len(first)+1,len(second)+1), dtype=np.int)
# Cross relation loop through each character of each string with each other and
# fill the respective index of matrxi (row,column)
for i in range(len(first)+1):
for j in range(len(second)+1):
#First doing the boundary value analysis, if first or second string is empty so directly adding insertion cost
if i == 0:
matrix[i][j] = j
#Second case
elif j == 0:
matrix[i][j] = i
else:
matrix[i][j] = min(matrix[i][j-1] + 1,
matrix[i-1][j] + 1,
matrix[i-1][j-1] + 2 if first[i-1] != second[j-1] else matrix[i-1][j-1] + 0)
# Adjusted the cost accordinly, insertion = 1, deletion=1 and substitution=2
return matrix[len(first)][len(second)] # Returning the final
Output:
>>>print(minimumEditDistance('levenshtein','levels'))
7
>>>print(minimumEditDistance('levenshtein','levenshtein'))
0
Is there a way to perform edit distance between two string columns in a dataframe?
Using Levenshtein distance from the textdistance module:
from textdistance import levenshtein
# Merge the two columns in one dataframe
df = dataset1[['SAX']].merge(dataset2[['SAX']], left_index=True, right_index=True, suffixes=('_1', '_2'))
# Compute the Levenshtein distance
df['distance'] = df.apply(lambda x: levenshtein.distance(x['SAX_1'], x['SAX_2']), axis=1)
Related Topics
Ssl.Sslerror: Tlsv1 Alert Protocol Version
How to Calculate Mean Values Grouped on Another Column in Pandas
Timeit Versus Timing Decorator
Excluding Directories in Os.Walk
Adding a Background Image to a Plot
Replicating Rows in a Pandas Data Frame by a Column Value
Attributeerror: 'Tensor' Object Has No Attribute 'Numpy'
Function Changes List Values and Not Variable Values in Python
How to Sum Values in a Column That Match a Given Condition Using Pandas
Remove File After Flask Serves It
Pyqt Gui Size on High Resolution Screens
Flask SQLalchemy Query, Specify Column Names
Decorating Class Methods - How to Pass the Instance to the Decorator
How to Scroll Frame Using Mouse Wheel & Adding Horizontal Scrollbar
Get Lat/Long Given Current Point, Distance and Bearing
How to Use the 'JSON' Module to Read in One JSON Object at a Time