How to compare two strings and find the percent of similarity?
One way to solve this is to go out from the Levenshtein distance, which will tell you how many operations needed to convert a string to another.
In Ruby, there is a gem you can use for this, called Levenshtein
.
To convert the number of operations needed to a percentage, you can go out from 100% is having to write the string all the way from the beginning and have no similarities. That would be the length of the longest string. Another option would be to use the average length of the strings, but in this example, I will use the longest.
Here is a method using the levenshtein
gem and getting a percentage of how close they are:
require 'levenshtein'
def distance_percent(first,second)
max_distance = [first,second].max_by(&:length).length
distance = Levenshtein.distance(first,second)
(100.0 / max_distance * distance).round.to_s + "%"
end
Here are some examples of what that method would return for different strings.
string_one = "1234567890"
string_two = "1234567890"
puts distance_percent(string_one, string_two)
# => 0%
This returns 0%
since the distance between them is 0.
string_one = "1234512345"
string_two = "6789067890"
puts distance_percent(string_one, string_two)
# => 100%
This will return 100%
since there are none of the same characters.
string_one = "This is a string"
string_two = "This is another string"
puts distance_percent(string_one, string_two)
# => 27%
This will return 27%
since 27 percent of the string is different from each other.
Ruby compare two strings similarity percentage
I think your question could do with some clarifications, but here's something quick and dirty (calculating as percentage of the longer string as per your clarification above):
def string_difference_percent(a, b)
longer = [a.size, b.size].max
same = a.each_char.zip(b.each_char).count { |a,b| a == b }
(longer - same) / a.size.to_f
end
I'm still not sure how much sense this percent difference you are looking for makes, but this should get you started at least.
It's a bit like Levensthein distance, in that it compares the strings character by character. So if two names differ only by the middle name, they'll actually be very different.
Check to see if two strings are very similar (similar characters, patterns, etc.)
Fuzzy Matching with Damerau-Levenshtein Distance
Any kind of fuzzy matching will be somewhat dependent on how you choose to look at your data. For something like this, you can look at one of the many variants of the Levenshtein distance such as Damerau-Levenshtein. You can adjust MIN_SIMILARITY_PERCENT to tune the similarity index, which is calculated using the edit distance as a percentage of the characters found in the longest word in the pair.
require 'damerau-levenshtein'
class SimilarityIndex
MIN_SIMILARITY_PERCENT = 70.0
attr_reader :similarity_idx, :words
def initialize word1, word2
@words = word1, word2
similar?
end
def edit_distance
DamerauLevenshtein.distance *@words
end
def longest_word_length
@words.max_by(&:length).size
end
def similar?
e = edit_distance
l = longest_word_length.to_f
@similarity_idx = ((1 - (e/l)) * 100).round 2
@similarity_idx >= MIN_SIMILARITY_PERCENT
end
end
You can validate this with some test data. For example:
word_pairs = %w[
www.domain.com
domain.com
www.example.com
foobarbaz.example.com
]
word_pairs.each_slice(2).map do |word1, word2|
s = SimilarityIndex.new word1, word2
{ words: s.words, similarity_idx: s.similarity_idx, similar?: s.similar? }
end
This test data generates the following results:
[{:words=>["www.domain.com", "domain.com"],
:similarity_idx=>71.43,
:similar?=>true},
{:words=>["www.example.com", "foobarbaz.example.com"],
:similarity_idx=>57.14,
:similar?=>false}]
How to do advanced string comparison in Ruby?
You may want to try the Levenshtein string distance algorithm for this. http://rubygems.org/gems/text has an implementation of this along with other helpful string comparison utils.
Compare two strings and match them regardless of order
require 'set'
def is_present?(array, register)
array.map { |s| s.chars.to_set }.include?(register.chars.to_set)
end
array = ["012", "345", "678", "048", "246", "036", "147", "258"]
["012", "210", "102", "246", "174", "014", "247"].each { |s|
puts "#{s} => #{is_present?(array, s)}" }
012 => true
210 => true
102 => true
246 => true
174 => true
014 => false
247 => false
If (unlike the OP's example) leading zeroes may be omitted in the value of register
or the elements of array
(e.g., s = "12"
rather than s = "012"
), it would be prudent to pad the string with zeroes to make them all three characters in length:
array.map { |s| s.rjust(3, "0").chars.to_set }.
include?(register.rjust(3, "0").chars.to_set)
Grouping bulk text as into group based on the given similarity percentage
You can do it by using just Ruby plus one of the listed gems.
I chose fuzzy-string-match
because I liked the name
Here's how you use the gem:
require 'fuzzystringmatch'
# Create the matcher
jarow = FuzzyStringMatch::JaroWinkler.create( :native )
# Get the distance
jarow.getDistance( "jones", "johnson" )
# => 0.8323809523809523
# Round it
jarow.getDistance( "jones", "johnson" ).round(2)
# => 0.83
Since you're getting a float, you can define the precision you're looking for using the round
method.
Now, to group similar results, you can use the group_by
methos found on the Enumerable
module.
You pass it a block and group_by
will iterate over the collection. For each iteration, you return the value you're trying to group for (in this case, the distance) and it'll return a hash with the distances as keys and arrays of strings that matched togehter as values.
require 'fuzzystringmatch'
jarow = FuzzyStringMatch::JaroWinkler.create( :native )
target = "jones"
precision = 2
candidates = [ "Jessica Jones", "Jones", "Johnson", "thompson", "john", "thompsen" ]
distances = candidates.group_by { |candidate|
jarow.getDistance( target, candidate ).round(precision)
}
distances
# => {0.52=>["Jessica Jones"],
# 0.87=>["Jones"],
# 0.68=>["Johnson"],
# 0.55=>["thompson", "thompsen"],
# 0.83=>["john"]}
I hope this helps
Measure the distance between two strings with Ruby?
I found this for you:
def levenshtein_distance(s, t)
m = s.length
n = t.length
return m if n == 0
return n if m == 0
d = Array.new(m+1) {Array.new(n+1)}
(0..m).each {|i| d[i][0] = i}
(0..n).each {|j| d[0][j] = j}
(1..n).each do |j|
(1..m).each do |i|
d[i][j] = if s[i-1] == t[j-1] # adjust index into string
d[i-1][j-1] # no operation required
else
[ d[i-1][j]+1, # deletion
d[i][j-1]+1, # insertion
d[i-1][j-1]+1, # substitution
].min
end
end
end
d[m][n]
end
[ ['fire','water'], ['amazing','horse'], ["bamerindos", "giromba"] ].each do |s,t|
puts "levenshtein_distance('#{s}', '#{t}') = #{levenshtein_distance(s, t)}"
end
That's awesome output: =)
levenshtein_distance('fire', 'water') = 4
levenshtein_distance('amazing', 'horse') = 7
levenshtein_distance('bamerindos', 'giromba') = 9
Source: http://rosettacode.org/wiki/Levenshtein_distance#Ruby
Ruby: How does one test for similarity between two blocks of text?
I believe you're looking for is the Longest Common Substring problem, i.e. the problem of finding, given two strings, of the longest substring they both have in common. The link is to the Wikipedia page which will help you understand the domain and contains a pseudocode example of an algorithm that runs in O(nm) time.
Further, Wikibooks' Algorithm Implementation book has an implementation in Ruby. It includes an lcs_size
method that may be all you need. In short, if lcs_size(text1, text2) returns 4
that means text1
and text2
have very little consecutive text in common, probably just a single word, but if it returns, say, 40
, they might have an entire sentence in common.
Hope that's helpful!
What is an efficient way to measure similarity between two strings? (Levenshtein Distance makes stack too deep)
Consider a non-recursive version to avoid the excessive call stack overhead. Seth Schroeder has an iterative implementation in Ruby which uses multi-dimensional arrays instead; it appears to be related to the dynamic programming approach for Levenshtein distance (as outlined in the pseudocode for the Wikipedia article). Seth's ruby code is reproduced below:
def levenshtein(s1, s2)
d = {}
(0..s1.size).each do |row|
d[[row, 0]] = row
end
(0..s2.size).each do |col|
d[[0, col]] = col
end
(1..s1.size).each do |i|
(1..s2.size).each do |j|
cost = 0
if (s1[i-1] != s2[j-1])
cost = 1
end
d[[i, j]] = [d[[i - 1, j]] + 1,
d[[i, j - 1]] + 1,
d[[i - 1, j - 1]] + cost
].min
next unless @@damerau
if (i > 1 and j > 1 and s1[i-1] == s2[j-2] and s1[i-2] == s2[j-1])
d[[i, j]] = [d[[i,j]],
d[[i-2, j-2]] + cost
].min
end
end
end
return d[[s1.size, s2.size]]
end
Related Topics
How to Loop Over a Hash of Hashes
Why Bundle Install Is Installing Gems in Vendor/Bundle
Which Algorithm Does Ruby's Sort Method Use
Ruby: Character to Ascii from a String
Directory Layout for Pure Ruby Project
Parse CSV File with Header Fields as Attributes for Each Row
What's the Difference Between "Includes" and "Preload" in an Activerecord Query
Change the Context/Binding Inside a Block in Ruby
When to Use Curly Braces VS Parenthesis in Expect Rspec Method
Ruby Invalid Byte Sequence in Utf-8
How to Generate Rdoc for (All Of) Rails
Given the Session Key and Secret, How to Decrypt Rails Cookies
How to Install Nokogiri Ruby Gem with Mkmf.Log Saying Libiconv Not Found