How to Create Simple Fuzzy Search with Postgresql Only

How to create simple fuzzy search with PostgreSQL only?

Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshtein edit distance function.

Example:

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
levenshtein
-------------
2
(1 row)

The 2 is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.

Try this query sample: (with your own object names and data of course)

SELECT * 
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10

This query says:

Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...

Note: if you get an error like:

function levenshtein(character varying, unknown) does not exist

Install the fuzzystrmatch extension using:

test=# CREATE EXTENSION fuzzystrmatch;

Use Postgresql full text search to fuzzy match all search terms

Ideally if the user would type a small error like:
ofice bmt
The results should still appear.

This could be very hard to do on more than a best-effort basis. If someone enters "Canter", how should the system know if they meant a shortening of Canterburry, or a misspelling of "cancer", or of "cantor", or if they really meant a horse's gait? Perhaps you can create a dictionary of common typos for your specific field? Also, without the specific knowledge that time zones are expected and common, "bmt" seems like a misspelling of, well, something.

This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...

Don't just believe, check and see!

select to_tsquery('english','OFF:*&BMT:*');
to_tsquery
------------
'bmt':*

Yes indeed, to_tsquery does omit stop words, even with the :* thingy.

One option is to use 'simple' rather than 'english' as your configuration:

select to_tsquery('simple','OFF:*&BMT:*');
to_tsquery
-------------------
'off':* & 'bmt':*

Another option is to write tsquery directly rather than processing through to_tsquery. Note that in this case, you have to lower-case it yourself:

select 'off:*&bmt:*'::tsquery;
tsquery
-------------------
'off':* & 'bmt':*

Also note that if you do this with 'office:*', you will never get a match in an 'english' configuration, because 'office' in the document gets stemmed to 'offic', while no stemming occurs when you write 'office:*'::tsquery. So you could use 'simple' rather than 'english' to avoid both stemming and stop words. Or you could test each word in the query individually to see if it gets stemmed before deciding to add :* to it.

Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.

What do you mean by fuzzysearch? You don't seem to be using that now. You are just using prefix matching, and accidentally using stemming and stopwords. How large is your table to be searched, and what kind of performance is acceptable?

If did you use ElasticSearch, how would you then phrase your searches? If you explained how you would phrase the search in ES, maybe someone can help you do the same thing in PostgreSQL. I don't think we can take it as a given that switching to ES will just magically do the right thing.

I could do it by building a list of AND statements in the WHERE clause
with LIKE '% ... %' but that would probably hurt performance and
doesn't support fuzzysearch.

Have you looked into pg_trgm? It can make those types of queries quite fast. Also, LIKE '%...%' is lot more fuzzy than what you are currently doing, so I don't understand how you will lose that. pg_trgm also provides the '<->' operator which is even fuzzier, and might be your best bet. It can deal with typos fairly well when embedded in long strings, but in short strings they can really be a problem.

How can I use ~ to fuzzy match two fields of a table?

You could join on the condition that either name field be a substring of the other's name field:

SELECT t1.*, t2.*
FROM table1 t1
INNER JOIN table2 t2
ON t1.name LIKE '%' || t2.name || '%' OR
t2.name LIKE '%' || t1.name || '%';

This approach does not even require regex. We could use regex here, if we wanted to ensure that one table's name only appears as a substring of the other's name and is also a word. But, maybe you don't even need to do this.

fuzzy search in full text search

I'd suggest that you either use full-text search or trigram similarity matching, but don't try to mix them.

Based on the requirement, I would say that trigram similarity matching is the better fit.

If you don't get a result using the similarity operator %, you have two choices:

  1. Lower the similarity threshold pg_trgm.similarity_threshold.

  2. Query in a different way so that you get the best matches, however „distant” they are:

    SELECT * FROM product ORDER BY katadi <-> ' pen' LIMIT 10;

    I think that would be the better solution.

PostgreSQL Fuzzy Searching multiple words with Levenshtein

Given your data and the following query with wild values for the Levenshtein Insertion (10000), Deletion (100) and Substitution (1) cost:

with sample_data as (select 101 "id", 'Sony Entertainment Inc' as "name"
union
select 102 "id",'Apple Corp' as "name")
select sample_data.id,sample_data.name, components.part,
levenshtein(components.part,'sony',10000,100,1) ld_sony
from sample_data
inner join (select sd.id,
lower(unnest(regexp_split_to_array(sd.name,E'\\s+'))) part
from sample_data sd) components on components.id = sample_data.id

The output is so:

 id  |          name          |     part      | ld_sony 
-----+------------------------+---------------+---------
101 | Sony Entertainment Inc | sony | 0
101 | Sony Entertainment Inc | entertainment | 903
101 | Sony Entertainment Inc | inc | 10002
102 | Apple Corp | apple | 104
102 | Apple Corp | corp | 3
(5 rows)
  • Row 1 - no changes..
  • Row 2 - 9 deletions and 3 changes
  • Row 3 - 1 insertion and 2 changes
  • Row 4 - 1 deletion and 4 changes
  • Row 5 - 3 changes

I've found that splitting the words out causes a lot of false positives whe you give a threshold. You can order by the Levenshtein distance to position the better matches close to the top. Maybe tweaking the Levenshtein variables will help you to order the matches better. Sadly, Levenshtein doesn't weight earlier changes differently than later changes.



Related Topics



Leave a reply



Submit