Ruby Compare Two Strings Similarity Percentage

How to compare two strings and find the percent of similarity?

One way to solve this is to go out from the Levenshtein distance, which will tell you how many operations needed to convert a string to another.

In Ruby, there is a gem you can use for this, called Levenshtein.

To convert the number of operations needed to a percentage, you can go out from 100% is having to write the string all the way from the beginning and have no similarities. That would be the length of the longest string. Another option would be to use the average length of the strings, but in this example, I will use the longest.

Here is a method using the levenshtein gem and getting a percentage of how close they are:

require 'levenshtein'
def distance_percent(first,second)
max_distance = [first,second].max_by(&:length).length
distance = Levenshtein.distance(first,second)
(100.0 / max_distance * distance).round.to_s + "%"
end

Here are some examples of what that method would return for different strings.

string_one = "1234567890"
string_two = "1234567890"
puts distance_percent(string_one, string_two)

# => 0%

This returns 0% since the distance between them is 0.

string_one = "1234512345"
string_two = "6789067890"
puts distance_percent(string_one, string_two)

# => 100%

This will return 100% since there are none of the same characters.

string_one = "This is a string"
string_two = "This is another string"
puts distance_percent(string_one, string_two)

# => 27%

This will return 27% since 27 percent of the string is different from each other.

Ruby compare two strings similarity percentage

I think your question could do with some clarifications, but here's something quick and dirty (calculating as percentage of the longer string as per your clarification above):

def string_difference_percent(a, b)
longer = [a.size, b.size].max
same = a.each_char.zip(b.each_char).count { |a,b| a == b }
(longer - same) / a.size.to_f
end

I'm still not sure how much sense this percent difference you are looking for makes, but this should get you started at least.

It's a bit like Levensthein distance, in that it compares the strings character by character. So if two names differ only by the middle name, they'll actually be very different.

Check to see if two strings are very similar (similar characters, patterns, etc.)


Fuzzy Matching with Damerau-Levenshtein Distance

Any kind of fuzzy matching will be somewhat dependent on how you choose to look at your data. For something like this, you can look at one of the many variants of the Levenshtein distance such as Damerau-Levenshtein. You can adjust MIN_SIMILARITY_PERCENT to tune the similarity index, which is calculated using the edit distance as a percentage of the characters found in the longest word in the pair.

require 'damerau-levenshtein'

class SimilarityIndex
MIN_SIMILARITY_PERCENT = 70.0

attr_reader :similarity_idx, :words

def initialize word1, word2
@words = word1, word2
similar?
end

def edit_distance
DamerauLevenshtein.distance *@words
end

def longest_word_length
@words.max_by(&:length).size
end

def similar?
e = edit_distance
l = longest_word_length.to_f
@similarity_idx = ((1 - (e/l)) * 100).round 2
@similarity_idx >= MIN_SIMILARITY_PERCENT
end
end

You can validate this with some test data. For example:

word_pairs = %w[
www.domain.com
domain.com

www.example.com
foobarbaz.example.com
]

word_pairs.each_slice(2).map do |word1, word2|
s = SimilarityIndex.new word1, word2
{ words: s.words, similarity_idx: s.similarity_idx, similar?: s.similar? }
end

This test data generates the following results:

[{:words=>["www.domain.com", "domain.com"],
:similarity_idx=>71.43,
:similar?=>true},
{:words=>["www.example.com", "foobarbaz.example.com"],
:similarity_idx=>57.14,
:similar?=>false}]

How to do advanced string comparison in Ruby?

You may want to try the Levenshtein string distance algorithm for this. http://rubygems.org/gems/text has an implementation of this along with other helpful string comparison utils.

Compare two strings and match them regardless of order


require 'set'

def is_present?(array, register)
array.map { |s| s.chars.to_set }.include?(register.chars.to_set)
end

array = ["012", "345", "678", "048", "246", "036", "147", "258"]
["012", "210", "102", "246", "174", "014", "247"].each { |s|
puts "#{s} => #{is_present?(array, s)}" }
012 => true
210 => true
102 => true
246 => true
174 => true
014 => false
247 => false

If (unlike the OP's example) leading zeroes may be omitted in the value of register or the elements of array (e.g., s = "12" rather than s = "012"), it would be prudent to pad the string with zeroes to make them all three characters in length:

array.map { |s| s.rjust(3, "0").chars.to_set }.
include?(register.rjust(3, "0").chars.to_set)

Grouping bulk text as into group based on the given similarity percentage

You can do it by using just Ruby plus one of the listed gems.

I chose fuzzy-string-match because I liked the name

Here's how you use the gem:

require 'fuzzystringmatch'

# Create the matcher
jarow = FuzzyStringMatch::JaroWinkler.create( :native )

# Get the distance
jarow.getDistance( "jones", "johnson" )
# => 0.8323809523809523

# Round it
jarow.getDistance( "jones", "johnson" ).round(2)
# => 0.83

Since you're getting a float, you can define the precision you're looking for using the round method.

Now, to group similar results, you can use the group_by methos found on the Enumerable module.

You pass it a block and group_by will iterate over the collection. For each iteration, you return the value you're trying to group for (in this case, the distance) and it'll return a hash with the distances as keys and arrays of strings that matched togehter as values.

require 'fuzzystringmatch'

jarow = FuzzyStringMatch::JaroWinkler.create( :native )

target = "jones"
precision = 2
candidates = [ "Jessica Jones", "Jones", "Johnson", "thompson", "john", "thompsen" ]

distances = candidates.group_by { |candidate|
jarow.getDistance( target, candidate ).round(precision)
}

distances
# => {0.52=>["Jessica Jones"],
# 0.87=>["Jones"],
# 0.68=>["Johnson"],
# 0.55=>["thompson", "thompsen"],
# 0.83=>["john"]}

I hope this helps

Measure the distance between two strings with Ruby?

I found this for you:

def levenshtein_distance(s, t)
m = s.length
n = t.length
return m if n == 0
return n if m == 0
d = Array.new(m+1) {Array.new(n+1)}

(0..m).each {|i| d[i][0] = i}
(0..n).each {|j| d[0][j] = j}
(1..n).each do |j|
(1..m).each do |i|
d[i][j] = if s[i-1] == t[j-1] # adjust index into string
d[i-1][j-1] # no operation required
else
[ d[i-1][j]+1, # deletion
d[i][j-1]+1, # insertion
d[i-1][j-1]+1, # substitution
].min
end
end
end
d[m][n]
end

[ ['fire','water'], ['amazing','horse'], ["bamerindos", "giromba"] ].each do |s,t|
puts "levenshtein_distance('#{s}', '#{t}') = #{levenshtein_distance(s, t)}"
end

That's awesome output: =)

levenshtein_distance('fire', 'water') = 4
levenshtein_distance('amazing', 'horse') = 7
levenshtein_distance('bamerindos', 'giromba') = 9

Source: http://rosettacode.org/wiki/Levenshtein_distance#Ruby

Ruby: How does one test for similarity between two blocks of text?

I believe you're looking for is the Longest Common Substring problem, i.e. the problem of finding, given two strings, of the longest substring they both have in common. The link is to the Wikipedia page which will help you understand the domain and contains a pseudocode example of an algorithm that runs in O(nm) time.

Further, Wikibooks' Algorithm Implementation book has an implementation in Ruby. It includes an lcs_size method that may be all you need. In short, if lcs_size(text1, text2) returns 4 that means text1 and text2 have very little consecutive text in common, probably just a single word, but if it returns, say, 40, they might have an entire sentence in common.

Hope that's helpful!

What is an efficient way to measure similarity between two strings? (Levenshtein Distance makes stack too deep)

Consider a non-recursive version to avoid the excessive call stack overhead. Seth Schroeder has an iterative implementation in Ruby which uses multi-dimensional arrays instead; it appears to be related to the dynamic programming approach for Levenshtein distance (as outlined in the pseudocode for the Wikipedia article). Seth's ruby code is reproduced below:

def levenshtein(s1, s2)
d = {}
(0..s1.size).each do |row|
d[[row, 0]] = row
end
(0..s2.size).each do |col|
d[[0, col]] = col
end
(1..s1.size).each do |i|
(1..s2.size).each do |j|
cost = 0
if (s1[i-1] != s2[j-1])
cost = 1
end
d[[i, j]] = [d[[i - 1, j]] + 1,
d[[i, j - 1]] + 1,
d[[i - 1, j - 1]] + cost
].min
next unless @@damerau
if (i > 1 and j > 1 and s1[i-1] == s2[j-2] and s1[i-2] == s2[j-1])
d[[i, j]] = [d[[i,j]],
d[[i-2, j-2]] + cost
].min
end
end
end
return d[[s1.size, s2.size]]
end


Related Topics



Leave a reply



Submit