How to Do Fuzzy Substring Matching in Ruby

How to match string with another string that is almost the same (fuzzy matching)

After reading your question, I think you want something like Levenshtein distance, and as you stated in your comment, for Postgres you could use it.

Quoting its documentation here:
https://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
 levenshtein
-------------
           2
(1 row)

test=# SELECT levenshtein('GUMBO', 'GAMBOL', 2,1,1);
 levenshtein
-------------
           3
(1 row)

test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',2);
 levenshtein_less_equal
------------------------
                      3
(1 row)

test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',4);
 levenshtein_less_equal
------------------------
                      4
(1 row)

Then you can build your sql query with your desire distance:

SELECT * 
FROM YourTable
WHERE levenshtein(string , 'AE18BX21') <= 2

Fuzzy String Matching with Rails (Tire) and ElasticSearch

Try creating custom analyzers to achieve other stemming features, etc.
Check out my example (this example also uses Mongoid & attachments, don't look at it if you don't need it):

class Document
      include Mongoid::Document
      include Mongoid::Timestamps
      include Tire::Model::Search
      include Tire::Model::Callbacks

      field :filename, type: String
      field :md5, type: String
      field :tags, type: String
      field :size, type: String

      index({md5: 1}, {unique: true})
      validates_uniqueness_of :md5

      DEFAULT_PAGE_SIZE = 10

      settings :analysis => {
          :filter => {
              :ngram_filter => {
                  :type => "edgeNGram",
                  :min_gram => 2,
                  :max_gram => 12
              },
              :custom_word_delimiter => {
                  :type => "word_delimiter",
                  :preserve_original => "true",
                  :catenate_all => "true",
              }
          }, :analyzer => {
              :index_ngram_analyzer => {
                  :type => "custom",
                  :tokenizer => "standard",
                  :filter => ["lowercase", "ngram_filter", "asciifolding", "custom_word_delimiter"]
              },
              :search_ngram_analyzer => {
                  :type => "custom",
                  :tokenizer => "standard",
                  :filter => ["standard", "lowercase", "ngram_filter", "custom_word_delimiter"]
              },
              :suggestions => {
                  :tokenizer => "standard",
                  :filter => ["suggestions_shingle"]
              }
          }
      } do
        mapping {
          indexes :id, index: :not_analyzed
          indexes :filename, :type => 'string', :store => 'yes', :boost => 100, :search_analyzer => :search_ngram_analyzer, :index_analyzer => :index_ngram_analyzer
          indexes :tags, :type => 'string', :store => 'yes', :search_analyzer => :search_ngram_analyzer, :index_analyzer => :index_ngram_analyzer
          indexes :attachment, :type => 'attachment',
                  :fields => {
                      :content_type => {:store => 'yes'},
                      :author => {:store => 'yes', :analyzer => 'keyword'},
                      :title => {:store => 'yes'},
                      :attachment => {:term_vector => 'with_positions_offsets', :boost => 90, :store => 'yes', :search_analyzer => :search_ngram_analyzer, :index_analyzer => :index_ngram_analyzer},
                      :date => {:store => 'yes'}
                  }
        }
      end

      def to_indexed_json
        self.to_json(:methods => [:attachment])
      end

      def attachment        
          path_to_file = "#{Rails.application.config.document_library}#{path}/#{filename}"
          Base64.encode64(open(path_to_file) { |file| file.read })
      end

      def self.search(query, options)
        tire.search do
          query { string "#{query}", :default_operator => :AND, :default_field => 'attachment', :fields => ['filename', 'attachment', 'tags'] }
          highlight :attachment
          page = (options[:page] || 1).to_i
          search_size = options[:per_page] || DEFAULT_PAGE_SIZE
          from (page -1) * search_size
          size search_size
          sort { by :_score, :desc }
          if (options[:facet])
            filter :terms, :tags => [options[:facet]]
            facet 'global-tags', :global => true do
              terms :tags
            end
            facet 'current-tags' do
              terms :tags
            end
          end
        end
      end
    end

Hope it helps,

Fast fuzzy/approximate search in dictionary of strings in Ruby

I wrote a pair of gems, fuzzily and blurrily which do trigrams-based fuzzy matching.
Given your (low) volume of data Fuzzily will be easier to integrate and about as fast, in with either you'd get answers within 5-10ms on modern hardware.

Given both are trigrams-based (which is indexable), not edit-distance-based (which isn't), you'd probably have to do this in two passes:

first ask either gem for a set of best matches trigrams-wise
then compare results with your input string, using Levenstein
and return the min for that measure.

In Ruby (as you asked), using Fuzzily + the Text gem, obtaining the records withing the edit distance threshold would look like:

MyRecords.find_by_fuzzy_name(input_string).select { |result|
  Text::Levenshtein.distance(input_string, result.name)] < my_distance_threshold
}

This performas a handful of well optimized database queries and a few

Caveats:

if the "minimal" edit distance you're looking for is high, you'll still be doing lots of Levenshteins.
using trigrams assumes your input text is latin text or close to (european languages basically).
there probably are edge cases since nothing garantees that "number of matching trigrams" is a great general approximation to "edit distance".

How to do advanced string comparison in Ruby?

You may want to try the Levenshtein string distance algorithm for this. http://rubygems.org/gems/text has an implementation of this along with other helpful string comparison utils.

How to Do Fuzzy Substring Matching in Ruby

How to match string with another string that is almost the same (fuzzy matching)

Fuzzy String Matching with Rails (Tire) and ElasticSearch

Fast fuzzy/approximate search in dictionary of strings in Ruby

How to do advanced string comparison in Ruby?

Related Topics

Leave a reply