Finding Similar Strings with Postgresql Quickly

Finding similar strings with PostgreSQL quickly

The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.

Use SET pg_trgm.similarity_threshold and the % operator instead. Both are provided by the pg_trgm module. This way, a trigram GiST index can be used to great effect.

The configuration parameter pg_trgm.similarity_threshold replaced the functions set_limit() and show_limit() in Postgres 9.6. The deprecated functions still work (as of Postgres 13). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.

Try instead:

SET pg_trgm.similarity_threshold = 0.8;  -- Postgres 9.6 or later
  
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   names n1
JOIN   names n2 ON n1.name <> n2.name
               AND n1.name % n2.name
ORDER  BY sim DESC;

Faster by orders of magnitude, but still slow.

pg_trgm.similarity_threshold is a "customized" option, which can be handled like any other option. See:

Query a parameter (postgresql.conf setting) like "max_connections"

You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).

This does not work because you cannot refer to output columns in WHERE or HAVING clauses:

WHERE ... sim > 0.8

That's according to the SQL standard (which is handled rather loosely by certain other RDBMS). On the other hand:

ORDER BY sim DESC

Works because output columns can be used in GROUP BY and ORDER BY. See:

PostgreSQL reusing computation result in select query

Test case

I ran a quick test on my old test server to verify my claims.

PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE (best of 5).

CREATE TEMP table t AS 
SELECT some_col AS name FROM some_table LIMIT 1000;  -- real life test strings

First round of tests with GIN index:

CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops);  -- round1: with GIN index

Second round of tests with GIST index:

DROP INDEX t_gin;
CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);

New query:

SELECT set_limit(0.8);

SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   t n1
JOIN   t n2 ON n1.name <> n2.name
           AND n1.name % n2.name
ORDER  BY sim DESC;

GIN index used, 64 hits: total runtime: 484.022 ms

GIST index used, 64 hits: total runtime: 248.772 ms

Old query:

SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
FROM   t n1, t n2
WHERE  n1.name != n2.name
AND    similarity(n1.name, n2.name) > 0.8
ORDER  BY sim DESC;

GIN index not used, 64 hits: total runtime: 6345.833 ms

GIST index not used, 64 hits: total runtime: 6335.975 ms

Otherwise identical results. Advice is good. And this is for just 1000 rows!

GIN or GiST?

GIN often provides superior read performance:

Difference between GiST and GIN index

But not in this particular case!

This can be implemented quite efficiently by GiST indexes, but not by
GIN indexes.

Multicolumn index on 3 fields with heterogenous data types

Scalable way to search for (similar) strings in a database

PostgreSQL Full-text search cannot do what you're looking for. However, PostgreSQL trigram similarity can do it.

You first need to install the packages with 'trigram similarity' and 'btree_gist', by executing (once) in your database:

CREATE EXTENSION pg_trgm;
CREATE EXTENSION btree_gist;

I assume you have one table that looks like this one:

CREATE TABLE sentences
(
    sentence_id integer PRIMARY KEY,
    sentence text
) ;

INSERT INTO sentences (sentence_id, sentence)
VALUES
    (1, 'Cangaroos are not my favorite.'),
    (2, 'A vegetable sentence.'),
    (3, 'Cangaroos are evil.'),
    (4, 'Again, some plants in my garden.'),
    (5, 'I once had a cangaroo. Never again.') ;

This table needs a 'trigram index', to allow the PostgreSQL database to 'index by similarity'. This is accomplished by executing:

CREATE INDEX ON sentences USING GIST (sentence gist_trgm_ops, sentence_id) ;

To find the answers you're looking for, you execute:

-- Set the minimum similarity you want to be able to search
SELECT set_limit(0.2) ;

-- And now, select the sentences 'similar' to the input one
SELECT
    similarity(sentence, 'I don''t like cangaroos') AS similarity, 
    sentence_id,
    sentence
FROM
    sentences
WHERE
    /* That's how you choose your sentences:
       % means 'similar to', in the trigram sense */
    sentence % 'I don''t like cangaroos'
ORDER BY
    similarity DESC ;

The result that you get is:

similarity | sentence_id | sentence
-----------+-------------+-------------------------------------
    0.3125 |           3 | Cangaroos are evil.      
    0.2325 |           1 | Cangaroos are not my favorite.
    0.2173 |           5 | I once had a cangaroo. Never again.

Hope this gives you what you want...

PostgreSQL, trigrams and similarity

The concept of trigram similarity relies on having any sentence divided into "trigrams" (sequences of three consecutive letters), and treating the result as a SET (i.e.: the order doesn't matter, and you don't have repeated values). Before the sentence is considered, two blank spaces are added at the beginning, and one at the end, and single spaces are replaced by double ones.

Trigrams are a special case of N-grams.

The trigram set corresponding to "Chateau blanc" is found by finding all sequences of three letters that appear on it:

  chateau  blanc
---                 => '  c'
 ---                => ' ch'
  ---               => 'cha'
   ---              => 'hat'
    ---             => 'ate'
     ---            => 'tea'
      ---           => 'eau'
       ---          => 'au '
        ---         => 'u  '
         ---        => '  b'
          ---       => ' bl'
           ---      => 'bla'
            ---     => 'lan'
             ---    => 'anc'
              ---   => 'nc '

Sorting them, and taking out repetitions gets you:

'  b'
'  c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'

This can be computed by PostgreSQL by means of the function show_trgm:

SELECT show_trgm('Chateau blanc') AS A

A = [  b,  c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

... which has 14 trigrams. (Check pg_trgm).

And the trigram set corresponding to "Chateau Cheval Blanc" is:

SELECT show_trgm('Chateau Cheval Blanc') AS B 

B = [  b,  c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

... which has 19 trigrams

If you count how many trigrams have both sets in common, you find that they have the following ones:

A intersect B = 
    [  b,  c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

and the ones they have in total are:

A union B = 
    [  b,  c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

That is, both sentences have 14 trigrams in common, and 19 in total.

The similarity is computed as:

 similarity = 14 / 19

You can check it with:

SELECT 
    cast(14.0/19.0 as real) AS computed_result, 
    similarity('Chateau blanc', 'chateau cheval blanc') AS function_in_pg

and you'll see that you get: 0.736842

... which explains how similarity is computed, and why you get the values you get.

NOTE: You can compute the intersection and union by means of:

SELECT 
   array_agg(t) AS in_common
FROM
(
    SELECT unnest(show_trgm('Chateau blanc')) AS t 
    INTERSECT 
    SELECT unnest(show_trgm('chateau chevla blanc')) AS t
    ORDER BY t
) AS trigrams_in_common ;

SELECT 
   array_agg(t) AS in_total
FROM
(
    SELECT unnest(show_trgm('Chateau blanc')) AS t 
    UNION 
    SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;

And this is a way to explore the similarity of different pair of sentences:

WITH p AS
(
    SELECT 
      'This is just a sentence I''ve invented'::text AS f1,
      'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
    SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
    SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
    SELECT
        (SELECT count(*) FROM 
            (SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
        (SELECT count(*) FROM 
            (SELECT * FROM t1 UNION     SELECT * FROM t2) AS s0)::integer AS total,
        similarity(f1, f2) AS sim_2
FROM
    p 
)
SELECT
    same, total, same::real/total::real AS sim_1, sim_2
FROM
    x ;

You can check it at Rextester

Postgres word_similarity not comparing words

Function inconsistent with description

Related subject on pgsql-bugs mailing list.

The substring similarity algorithm described by the author compares trigram arrays of a query string and a text. The problem is that a trigram array is optimised (duplicated trigrams are eliminated) and loses information about individual words of the text.

The query illustrates the issue:

with data(t) as (
values
    ('message'),
    ('message s'),
    ('message sag'),
    ('message sag sag'),
    ('message sag sage')
)

select 
    t as "text", 
    show_trgm(t) as "text trigrams", 
    show_trgm('sage') as "string trigrams", 
    cardinality(array_intersect(show_trgm(t), show_trgm('sage'))) as "common trgms"
from data;

       text       |                       text trigrams                       |       string trigrams       | common trgms 
------------------+-----------------------------------------------------------+-----------------------------+--------------
 message          | {"  m"," me",age,ess,"ge ",mes,sag,ssa}                   | {"  s"," sa",age,"ge ",sag} |            3
 message s        | {"  m","  s"," me"," s ",age,ess,"ge ",mes,sag,ssa}       | {"  s"," sa",age,"ge ",sag} |            4
 message sag      | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
 message sag sag  | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
 message sag sage | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
(5 rows)

The trigram arrays in last three rows are the same and contain all trigrams of the query string.

Obviously, the implementation is not consistent with the description of the function (the description was changed in later releases of the documentation):

Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring.

My function used in the above query:

create or replace function public.array_intersect(anyarray, anyarray)
returns anyarray language sql immutable
as $$
    select case 
        when $1 is null then $2
        else
            array(
                select unnest($1)
                intersect
                select unnest($2)
            )
        end;
$$;

Workaround

You can easily write your own function to get more expected results:

create or replace function my_word_similarity(text, text)
returns real language sql immutable as $$
    select max(similarity($1, word))
    from regexp_split_to_table($2, '[^[:alnum:]]') word
$$;

Compare:

with data(t) as (
values
    ('message'),
    ('message s'),
    ('message sag'),
    ('message sag sag'),
    ('message sag sage')
)

select t, word_similarity('sage', t), my_word_similarity('sage', t)
from data;

        t         | word_similarity | my_word_similarity
------------------+-----------------+--------------------
 message          |             0.6 |                0.3
 message s        |             0.8 |                0.3
 message sag      |               1 |                0.5
 message sag sag  |               1 |                0.5
 message sag sage |               1 |                  1
(5 rows)

New function in Postgres 11+

There is a new function in Postgres 11+ strict_word_similarity() which gives results expected by the author of the question:

with data(t) as (
values
    ('message'),
    ('message s'),
    ('message sag'),
    ('message sag sag'),
    ('message sag sage')
)

select t, word_similarity('sage', t), strict_word_similarity('sage', t)
from data;

        t         | word_similarity | strict_word_similarity
------------------+-----------------+------------------------
 message          |             0.6 |                    0.3
 message s        |             0.8 |             0.36363637
 message sag      |               1 |                    0.5
 message sag sag  |               1 |                    0.5
 message sag sage |               1 |                      1
(5 rows)

PostgreSQL '%' operator

% is the "similarity" operator, provided by the additional module pg_trgm.

It takes text (or other string types) as left and right operand and returns boolean: true if both operands are similar enough, false if not. The threshold is set with the GUC parameter pg_trgm.similarity_threshold.

Finding similar strings with PostgreSQL quickly
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

Not to be confused with the modulo operator %. Same symbol, but the mathematical operator takes numeric types as left and right operand.

In Postgres, operators are defined by the operator name (like %) plus left and right operands. Gory details in the manual chapter Operator Type Resolution. The casual user hardly needs to know any of this. Typically it just works.

Can PostgreSQL index array columns?

PostgreSQL LIKE query performance variations

FTS does not support `LIKE`

The previously accepted answer was incorrect. Full Text Search with its full text indexes is not for the LIKE operator at all, it has its own operators and doesn't work for arbitrary strings. It operates on words based on dictionaries and stemming. It does support prefix matching for words, but not with the LIKE operator:

Get partial match from GIN indexed TSVECTOR column

Trigram index for `LIKE`

Install the additional module pg_trgm which provides operator classes for GIN and GiST trigram indexes to support all LIKE and ILIKE patterns, not just left-anchored ones:

Example index:

CREATE INDEX tbl_col_gin_trgm_idx  ON tbl USING gin  (col gin_trgm_ops);

Or:

CREATE INDEX tbl_col_gist_trgm_idx ON tbl USING gist (col gist_trgm_ops);

Difference between GiST and GIN index

Example query:

SELECT * FROM tbl WHERE col LIKE '%foo%';   -- leading wildcard
SELECT * FROM tbl WHERE col ILIKE '%foo%';  -- works case insensitively as well

Trigrams? What about shorter strings?

Words with less than 3 letters in indexed values still work. The manual:

Each word is considered to have two spaces prefixed and one space
suffixed when determining the set of trigrams contained in the string.

And search patterns with less than 3 letters? The manual:

For both LIKE and regular-expression searches, keep in mind that a
pattern with no extractable trigrams will degenerate to a full-index scan.

Meaning, that index / bitmap index scans still work (query plans for prepared statement won't break), it just won't buy you better performance. Typically no big loss, since 1- or 2-letter strings are hardly selective (more than a few percent of the underlying table matches) and index support would not improve performance to begin with, because a full table scan is faster.

`text_pattern_ops` or `COLLATE "C"` for prefix matching

Update

Since Postgres 9.1, COLLATE "C" is better. See:

Is there a difference between text_pattern_ops and COLLATE "C"?

Original answer

For just left-anchored patterns (no leading wildcard) you get the optimum with a suitable operator class for a btree index: text_pattern_ops or varchar_pattern_ops. Both built-in features of standard Postgres, no additional module needed. Similar performance, but much smaller index.

Example index:

CREATE INDEX tbl_col_text_pattern_ops_idx ON tbl(col text_pattern_ops);

Example query:

SELECT * FROM tbl WHERE col LIKE 'foo%';  -- no leading wildcard

Or, if you should be running your database with the 'C' locale (effectively no locale), then everything is sorted according to byte order anyway and a plain btree index with default operator class does the job.

Similar UTF-8 strings for autocomplete field

You are not using the operator class provided by the pg_trgm module. Create an index like this:

CREATE INDEX label_Lower_unaccent_trgm_idx
ON test_trgm USING gist (lower(unaccent_text(label)) gist_trgm_ops);

Originally, I had a GIN index here, but a GiST is typically better suited for this kind of query because it can return values sorted by similarity. See:

Matching patterns between multiple columns
Finding similar strings with PostgreSQL quickly

Your query has to match the index expression to be able to use it.

SELECT label
FROM   the_table
WHERE  lower(unaccent_text(label)) % 'fil'
ORDER  BY similarity(label, 'fil') DESC;  -- ok to use original string here

However, "filbert" and "filé powder" are not actually very similar to "fil" according to the % operator. I suspect you really want:

SELECT label
FROM   the_table
WHERE  lower(unaccent_text(label)) LIKE 'fil%'  -- !
ORDER  BY similarity(label, 'fil') DESC;  -- ok to use original string here

This finds all strings starting with the search string, and sorts the best matches according to the % operator first.

The expression can use a GIN or GiST index since PostgreSQL 9.1! The manual:

Beginning in PostgreSQL 9.1, these index types also support index
searches for LIKE and ILIKE, for example

If you actually meant to use the % operator:

Try adapting the threshold for the similarity operator %:

SET pg_trgm.similarity_threshold = 0.1;  -- Postgres 9.6 or later
SELECT set_limit(0.1);  -- Postgres 9.5 or older

Or even lower? The default is 0.3. Just to see whether the threshold filters additional matches.

Finding Similar Strings with Postgresql Quickly