Fast Fuzzy/Approximate Search in Dictionary of Strings in Ruby

Fast fuzzy/approximate search in dictionary of strings in Ruby

I wrote a pair of gems, fuzzily and blurrily which do trigrams-based fuzzy matching.
Given your (low) volume of data Fuzzily will be easier to integrate and about as fast, in with either you'd get answers within 5-10ms on modern hardware.

Given both are trigrams-based (which is indexable), not edit-distance-based (which isn't), you'd probably have to do this in two passes:

  • first ask either gem for a set of best matches trigrams-wise
  • then compare results with your input string, using Levenstein
  • and return the min for that measure.

In Ruby (as you asked), using Fuzzily + the Text gem, obtaining the records withing the edit distance threshold would look like:

MyRecords.find_by_fuzzy_name(input_string).select { |result|
Text::Levenshtein.distance(input_string, result.name)] < my_distance_threshold
}

This performas a handful of well optimized database queries and a few

Caveats:

  • if the "minimal" edit distance you're looking for is high, you'll still be doing lots of Levenshteins.
  • using trigrams assumes your input text is latin text or close to (european languages basically).
  • there probably are edge cases since nothing garantees that "number of matching trigrams" is a great general approximation to "edit distance".

Objective-c: Fast Fuzzy Search Match

If NSString+Score does what you want but is too slow, you could start by speeding it up. Lines 23 to 28 in -scoreAgainst:fuzziness:options: are setup code that only need to be done once, not on every of the 200 compares. So pull that code out into a setup method and measure again.

Edit:

As an exercise, I forked StringScore, extracted the setup code and did minimal changes to get some performance improvement, then measured it. I used 1000 random words, grouped them in three each (e.g. "disrupted dot drinking"). For each of these groups I did the setup (as told in this original answer) and then compared the string to all of the 1000 groups. This takes around 11 seconds on my Core 2 Duo.

So comparing one word to 1000 takes about 11 ms. Now you only need 1 to 200, so it will be probably well under 10 ms. That should work for you?

(By the way, nearly half the time is still spent in rangeOfString: finding a single character; this can probably done much much faster, but I didn't want to get in the details of the algorithm.)

Use Postgresql full text search to fuzzy match all search terms

Ideally if the user would type a small error like:
ofice bmt
The results should still appear.

This could be very hard to do on more than a best-effort basis. If someone enters "Canter", how should the system know if they meant a shortening of Canterburry, or a misspelling of "cancer", or of "cantor", or if they really meant a horse's gait? Perhaps you can create a dictionary of common typos for your specific field? Also, without the specific knowledge that time zones are expected and common, "bmt" seems like a misspelling of, well, something.

This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...

Don't just believe, check and see!

select to_tsquery('english','OFF:*&BMT:*');
to_tsquery
------------
'bmt':*

Yes indeed, to_tsquery does omit stop words, even with the :* thingy.

One option is to use 'simple' rather than 'english' as your configuration:

select to_tsquery('simple','OFF:*&BMT:*');
to_tsquery
-------------------
'off':* & 'bmt':*

Another option is to write tsquery directly rather than processing through to_tsquery. Note that in this case, you have to lower-case it yourself:

select 'off:*&bmt:*'::tsquery;
tsquery
-------------------
'off':* & 'bmt':*

Also note that if you do this with 'office:*', you will never get a match in an 'english' configuration, because 'office' in the document gets stemmed to 'offic', while no stemming occurs when you write 'office:*'::tsquery. So you could use 'simple' rather than 'english' to avoid both stemming and stop words. Or you could test each word in the query individually to see if it gets stemmed before deciding to add :* to it.

Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.

What do you mean by fuzzysearch? You don't seem to be using that now. You are just using prefix matching, and accidentally using stemming and stopwords. How large is your table to be searched, and what kind of performance is acceptable?

If did you use ElasticSearch, how would you then phrase your searches? If you explained how you would phrase the search in ES, maybe someone can help you do the same thing in PostgreSQL. I don't think we can take it as a given that switching to ES will just magically do the right thing.

I could do it by building a list of AND statements in the WHERE clause
with LIKE '% ... %' but that would probably hurt performance and
doesn't support fuzzysearch.

Have you looked into pg_trgm? It can make those types of queries quite fast. Also, LIKE '%...%' is lot more fuzzy than what you are currently doing, so I don't understand how you will lose that. pg_trgm also provides the '<->' operator which is even fuzzier, and might be your best bet. It can deal with typos fairly well when embedded in long strings, but in short strings they can really be a problem.

Javascript fuzzy search that makes sense

Good question! But my thought is that, rather than trying to modify Levenshtein-Demerau, you might be better to try a different algorithm or combine/ weight the results from two algorithms.

It strikes me that exact or close matches to the "starting prefix" are something Levenshtein-Demerau gives no particular weight to -- but your apparent user expectations would.

I searched for "better than Levenshtein" and, among other things, found this:

http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/

This mentions a number of "string distance" measures. Three which looked particularly relevant to your requirement, would be:

  1. Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.

  2. q-gram distance: Sum of absolute differences between N-gram vectors of both strings.

  3. Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.

Maybe you could use a weighted combination (or minimum) of these metrics, with Levenshtein -- common substring, common N-gram or Jaccard will all strongly prefer similar strings -- or perhaps try just using Jaccard?

Depending on the size of your list/ database, these algorithms can be moderately expensive. For a fuzzy search I implemented, I used a configurable number of N-grams as "retrieval keys" from the DB then ran the expensive string-distance measure to sort them in preference order.

I wrote some notes on Fuzzy String Search in SQL. See:

  • http://literatejava.com/sql/fuzzy-string-search-sql/


Related Topics



Leave a reply



Submit