Edit Distance in Python

String edit distance in python

There is a NLTK package which you can use, it uses the Levenshtein edit-distance which should be what you're looking for.

Example:

import nltk
s1 = "abc"
s2 = "ebcd"
nltk.edit_distance(s1, s2) # output: 2

Reference:
https://tedboy.github.io/nlps/generated/generated/nltk.edit_distance.html

Python Edit distance algorithm with dynamic programming and 2D Array - Output is not optimal

Ah, looks like I have found a solution, that I'll have to answer my own question now. (I'm still confused with some parts, and am only answering this question to briefly introduce the new implementation, as to save the time of other kind helpers)

So firstly, I have missed a condition in the original code, that is, what if one of the two string inputs are empty? Then we'll have to insert everything from the other string. Henceforth, the optimal editing distance is just the length of this other string.

    if i == 0:
D[i][j] = j
elif j == 0:
D[i][j] = i

Also, regarding the original for-loop of the code, I learnt my mistakes from GeeksforGeeks. If my understanding is correct, they are saying that if two indices (i and j) are consistent, all we need to do is to move diagonally upward on the graph (i-1 and j-1) without adding any counts.

Else if the indices do not match, we move either to the direction of i-1, j-1 or diagonally up dependently. I was right on this, apart from the fact the count is added after the move, whereas I have added them during the move.

I am still a bit unsure with how it worked, however I'll compare the two algorithms below, would be appreciated if someone could explain it further in the comments.

My original for-loop (present in the question)

for j in range(0, m):
for i in range(0, n):
if target[i-1] == source[j-1]:
D[i][j] = min(D[i-1][j-1], D[i-1][j]+1, D[i][j-1]+1)

else:
D[i][j] = min(D[i-1][j-1]+1, D[i-1][j]+1, D[i][j-1]+1)

And below is the new for-loop, whose output is correct after testing:

        if target[i-1] == source[j-1]:
D[i][j] = D[i-1][j-1]

else:
D[i][j] = 1 + min(D[i][j-1], D[i-1][j], D[i-1][j-1])

Would be appreciated if someone could further explain how did this work, as I still only have a superfacial understanding of the new code

Final code:

def edit_distance(target, source):

m = len(target)+1
n = len(source)+1
D = [[0 for x in range(n)] for x in range(m)]

for i in range(m):
for j in range(n):

if i == 0:
D[i][j] = j
elif j == 0:
D[i][j] = i

elif target[i-1] == source[j-1]:
D[i][j] = D[i-1][j-1]
else:
D[i][j] = 1 + min(D[i][j-1], D[i-1][j], D[i-1][j-1])

return D[m-1][n-1]

print(edit_distance("distance", "editing"))
# output = 5, which is correct

Levenshtein edit-distance between rows and columns

edit_distance expects 2 strings, so you have to iterate over the indexes. One option is to apply a lambda that does that on df:

df.apply(lambda col: [nltk.edit_distance(col.name, i) for i in col.index])

But, instead of filling in a DataFrame, I think it's simpler to first create a dictionary with the values; then build a DataFrame as follows:

df = pd.DataFrame({j: {i: nltk.edit_distance(i,j) for i in rows} for j in columns})

Output:

          Band  Tree  Foot
Hand 1 4 4
Foot 4 4 0
Shoulder 7 7 7

Edit Distance w/ operational weights in Python

You have the base costs for when i = 0 and j = 0 to be j and i respectively, which are not multiples of 5. Then you should be multiplying them by 20 since not using the letters is essentially the same as deleting or inserting them for the purposes of edit distance.
So you should try something like this:

str1="algorithms"
str2="alligator"
m=len(str1)
n=len(str2)

def editdistance(str1, str2, m, n):
table=[[0 for x in range(n+1)] for x in range(m+1)]

for i in range(m+1):
for j in range(n+1):

if i==0:
table[i][j]=j*20

elif j==0:
table[i][j]=i*20

elif str1[i-1]==str2[j-1]:
table[i][j]=table[i-1][j-1]

else:
table[i][j] = min(20+table[i][j-1], 20+table[i-1][j], 5+table[i-1][j-1])


return table[m][n]

print(editdistance(str1, str2, m, n))

Calculating Minimum Edit Distance for unequal strings python

For different length strings, cost and backtrace indices doesn't match.

Can be implemented minimum edit distance with 2 substitution cost by updating only one numpy m * n arr with cost at each step.

Sample Image

As per Algorithm,
Below code will do the job.

def minimumEditDistance(first, second): 

#Creating numpy ndarray( initialized with 0 of dimension of size of both strings

matrix = np.zeros((len(first)+1,len(second)+1), dtype=np.int)


# Cross relation loop through each character of each string with each other and
# fill the respective index of matrxi (row,column)

for i in range(len(first)+1):
for j in range(len(second)+1):

#First doing the boundary value analysis, if first or second string is empty so directly adding insertion cost
if i == 0:
matrix[i][j] = j
#Second case
elif j == 0:
matrix[i][j] = i
else:
matrix[i][j] = min(matrix[i][j-1] + 1,
matrix[i-1][j] + 1,
matrix[i-1][j-1] + 2 if first[i-1] != second[j-1] else matrix[i-1][j-1] + 0)
# Adjusted the cost accordinly, insertion = 1, deletion=1 and substitution=2
return matrix[len(first)][len(second)] # Returning the final

Output:

>>>print(minimumEditDistance('levenshtein','levels'))
7
>>>print(minimumEditDistance('levenshtein','levenshtein'))
0

Is there a way to perform edit distance between two string columns in a dataframe?

Using Levenshtein distance from the textdistance module:

from textdistance import levenshtein

# Merge the two columns in one dataframe
df = dataset1[['SAX']].merge(dataset2[['SAX']], left_index=True, right_index=True, suffixes=('_1', '_2'))

# Compute the Levenshtein distance
df['distance'] = df.apply(lambda x: levenshtein.distance(x['SAX_1'], x['SAX_2']), axis=1)


Related Topics



Leave a reply



Submit