How to Do Fuzzy Substring Matching in Ruby

How to match string with another string that is almost the same (fuzzy matching)

After reading your question, I think you want something like Levenshtein distance, and as you stated in your comment, for Postgres you could use it.

Quoting its documentation here:
https://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
levenshtein
-------------
2
(1 row)

test=# SELECT levenshtein('GUMBO', 'GAMBOL', 2,1,1);
levenshtein
-------------
3
(1 row)

test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',2);
levenshtein_less_equal
------------------------
3
(1 row)

test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',4);
levenshtein_less_equal
------------------------
4
(1 row)

Then you can build your sql query with your desire distance:

SELECT * 
FROM YourTable
WHERE levenshtein(string , 'AE18BX21') <= 2

Fuzzy String Matching with Rails (Tire) and ElasticSearch

Try creating custom analyzers to achieve other stemming features, etc.
Check out my example (this example also uses Mongoid & attachments, don't look at it if you don't need it):

class Document
include Mongoid::Document
include Mongoid::Timestamps
include Tire::Model::Search
include Tire::Model::Callbacks

field :filename, type: String
field :md5, type: String
field :tags, type: String
field :size, type: String

index({md5: 1}, {unique: true})
validates_uniqueness_of :md5

DEFAULT_PAGE_SIZE = 10

settings :analysis => {
:filter => {
:ngram_filter => {
:type => "edgeNGram",
:min_gram => 2,
:max_gram => 12
},
:custom_word_delimiter => {
:type => "word_delimiter",
:preserve_original => "true",
:catenate_all => "true",
}
}, :analyzer => {
:index_ngram_analyzer => {
:type => "custom",
:tokenizer => "standard",
:filter => ["lowercase", "ngram_filter", "asciifolding", "custom_word_delimiter"]
},
:search_ngram_analyzer => {
:type => "custom",
:tokenizer => "standard",
:filter => ["standard", "lowercase", "ngram_filter", "custom_word_delimiter"]
},
:suggestions => {
:tokenizer => "standard",
:filter => ["suggestions_shingle"]
}
}
} do
mapping {
indexes :id, index: :not_analyzed
indexes :filename, :type => 'string', :store => 'yes', :boost => 100, :search_analyzer => :search_ngram_analyzer, :index_analyzer => :index_ngram_analyzer
indexes :tags, :type => 'string', :store => 'yes', :search_analyzer => :search_ngram_analyzer, :index_analyzer => :index_ngram_analyzer
indexes :attachment, :type => 'attachment',
:fields => {
:content_type => {:store => 'yes'},
:author => {:store => 'yes', :analyzer => 'keyword'},
:title => {:store => 'yes'},
:attachment => {:term_vector => 'with_positions_offsets', :boost => 90, :store => 'yes', :search_analyzer => :search_ngram_analyzer, :index_analyzer => :index_ngram_analyzer},
:date => {:store => 'yes'}
}
}
end

def to_indexed_json
self.to_json(:methods => [:attachment])
end

def attachment
path_to_file = "#{Rails.application.config.document_library}#{path}/#{filename}"
Base64.encode64(open(path_to_file) { |file| file.read })
end

def self.search(query, options)
tire.search do
query { string "#{query}", :default_operator => :AND, :default_field => 'attachment', :fields => ['filename', 'attachment', 'tags'] }
highlight :attachment
page = (options[:page] || 1).to_i
search_size = options[:per_page] || DEFAULT_PAGE_SIZE
from (page -1) * search_size
size search_size
sort { by :_score, :desc }
if (options[:facet])
filter :terms, :tags => [options[:facet]]
facet 'global-tags', :global => true do
terms :tags
end
facet 'current-tags' do
terms :tags
end
end
end
end
end

Hope it helps,

Fast fuzzy/approximate search in dictionary of strings in Ruby

I wrote a pair of gems, fuzzily and blurrily which do trigrams-based fuzzy matching.
Given your (low) volume of data Fuzzily will be easier to integrate and about as fast, in with either you'd get answers within 5-10ms on modern hardware.

Given both are trigrams-based (which is indexable), not edit-distance-based (which isn't), you'd probably have to do this in two passes:

  • first ask either gem for a set of best matches trigrams-wise
  • then compare results with your input string, using Levenstein
  • and return the min for that measure.

In Ruby (as you asked), using Fuzzily + the Text gem, obtaining the records withing the edit distance threshold would look like:

MyRecords.find_by_fuzzy_name(input_string).select { |result|
Text::Levenshtein.distance(input_string, result.name)] < my_distance_threshold
}

This performas a handful of well optimized database queries and a few

Caveats:

  • if the "minimal" edit distance you're looking for is high, you'll still be doing lots of Levenshteins.
  • using trigrams assumes your input text is latin text or close to (european languages basically).
  • there probably are edge cases since nothing garantees that "number of matching trigrams" is a great general approximation to "edit distance".

How to do advanced string comparison in Ruby?

You may want to try the Levenshtein string distance algorithm for this. http://rubygems.org/gems/text has an implementation of this along with other helpful string comparison utils.



Related Topics



Leave a reply



Submit