How to match string with another string that is almost the same (fuzzy matching)
After reading your question, I think you want something like Levenshtein distance, and as you stated in your comment, for Postgres you could use it.
Quoting its documentation here:
https://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html
test=# SELECT levenshtein('GUMBO', 'GAMBOL');
levenshtein
-------------
2
(1 row)
test=# SELECT levenshtein('GUMBO', 'GAMBOL', 2,1,1);
levenshtein
-------------
3
(1 row)
test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',2);
levenshtein_less_equal
------------------------
3
(1 row)
test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',4);
levenshtein_less_equal
------------------------
4
(1 row)
Then you can build your sql query with your desire distance:
SELECT *
FROM YourTable
WHERE levenshtein(string , 'AE18BX21') <= 2
Fuzzy String Matching with Rails (Tire) and ElasticSearch
Try creating custom analyzers to achieve other stemming features, etc.
Check out my example (this example also uses Mongoid & attachments, don't look at it if you don't need it):
class Document
include Mongoid::Document
include Mongoid::Timestamps
include Tire::Model::Search
include Tire::Model::Callbacks
field :filename, type: String
field :md5, type: String
field :tags, type: String
field :size, type: String
index({md5: 1}, {unique: true})
validates_uniqueness_of :md5
DEFAULT_PAGE_SIZE = 10
settings :analysis => {
:filter => {
:ngram_filter => {
:type => "edgeNGram",
:min_gram => 2,
:max_gram => 12
},
:custom_word_delimiter => {
:type => "word_delimiter",
:preserve_original => "true",
:catenate_all => "true",
}
}, :analyzer => {
:index_ngram_analyzer => {
:type => "custom",
:tokenizer => "standard",
:filter => ["lowercase", "ngram_filter", "asciifolding", "custom_word_delimiter"]
},
:search_ngram_analyzer => {
:type => "custom",
:tokenizer => "standard",
:filter => ["standard", "lowercase", "ngram_filter", "custom_word_delimiter"]
},
:suggestions => {
:tokenizer => "standard",
:filter => ["suggestions_shingle"]
}
}
} do
mapping {
indexes :id, index: :not_analyzed
indexes :filename, :type => 'string', :store => 'yes', :boost => 100, :search_analyzer => :search_ngram_analyzer, :index_analyzer => :index_ngram_analyzer
indexes :tags, :type => 'string', :store => 'yes', :search_analyzer => :search_ngram_analyzer, :index_analyzer => :index_ngram_analyzer
indexes :attachment, :type => 'attachment',
:fields => {
:content_type => {:store => 'yes'},
:author => {:store => 'yes', :analyzer => 'keyword'},
:title => {:store => 'yes'},
:attachment => {:term_vector => 'with_positions_offsets', :boost => 90, :store => 'yes', :search_analyzer => :search_ngram_analyzer, :index_analyzer => :index_ngram_analyzer},
:date => {:store => 'yes'}
}
}
end
def to_indexed_json
self.to_json(:methods => [:attachment])
end
def attachment
path_to_file = "#{Rails.application.config.document_library}#{path}/#{filename}"
Base64.encode64(open(path_to_file) { |file| file.read })
end
def self.search(query, options)
tire.search do
query { string "#{query}", :default_operator => :AND, :default_field => 'attachment', :fields => ['filename', 'attachment', 'tags'] }
highlight :attachment
page = (options[:page] || 1).to_i
search_size = options[:per_page] || DEFAULT_PAGE_SIZE
from (page -1) * search_size
size search_size
sort { by :_score, :desc }
if (options[:facet])
filter :terms, :tags => [options[:facet]]
facet 'global-tags', :global => true do
terms :tags
end
facet 'current-tags' do
terms :tags
end
end
end
end
end
Hope it helps,
Fast fuzzy/approximate search in dictionary of strings in Ruby
I wrote a pair of gems, fuzzily and blurrily which do trigrams-based fuzzy matching.
Given your (low) volume of data Fuzzily will be easier to integrate and about as fast, in with either you'd get answers within 5-10ms on modern hardware.
Given both are trigrams-based (which is indexable), not edit-distance-based (which isn't), you'd probably have to do this in two passes:
- first ask either gem for a set of best matches trigrams-wise
- then compare results with your input string, using Levenstein
- and return the min for that measure.
In Ruby (as you asked), using Fuzzily + the Text gem, obtaining the records withing the edit distance threshold would look like:
MyRecords.find_by_fuzzy_name(input_string).select { |result|
Text::Levenshtein.distance(input_string, result.name)] < my_distance_threshold
}
This performas a handful of well optimized database queries and a few
Caveats:
- if the "minimal" edit distance you're looking for is high, you'll still be doing lots of Levenshteins.
- using trigrams assumes your input text is latin text or close to (european languages basically).
- there probably are edge cases since nothing garantees that "number of matching trigrams" is a great general approximation to "edit distance".
How to do advanced string comparison in Ruby?
You may want to try the Levenshtein string distance algorithm for this. http://rubygems.org/gems/text has an implementation of this along with other helpful string comparison utils.
Related Topics
Is There a Shorthand If (Without Else) Statement in Ruby on Rails
When to Use Self in Module's Methods
How to Declare a Rake Task That Depends on a Parameterized Task
Accessing Headers from Sinatra
Robustly Call a Flaky API: Proper Error Handling with Net::Http
Properly Using Log4R in Ruby Application
Ruby on Rails:How to Implement Cancel Button in Form_Tag
How to Create a Dynamic Regular Expression in Ruby
Ruby Attr_Reader Allows One to Modify String Variable If Using <<
How Do Open a File for Writing Only If It Doesn't Already Exist in Ruby
How to Increment an Integer in Ruby
Finding the Difference Between Strings in Ruby
Best Way to Remove File Extension
Ruby Metaprogramming, How Does Rspec's 'Should' Work
Why Is a String Key for a Hash Frozen
How to Join a Table and Count Records in Rails 3
Changing Type of Activerecord Class in Rails with Single Table Inheritance