Similarity Function in Postgres with Pg_Trgm

Similarity function in Postgres with pg_trgm

You have to install pg_trgm. In debian, source this sql: /usr/share/postgresql/8.4/contrib/pg_trgm.sql. From the command line:

psql -f /usr/share/postgresql/8.4/contrib/pg_trgm.sql

Or inside a psql shell:

\i /usr/share/postgresql/8.4/contrib/pg_trgm.sql

The script defaults to installing in the public schema, edit the search path at the top if you want to install it somewhere else (so that uninstalling/upgrading can be done simply by dropping the schema).

Postgresql - Similarity with trigram (pg_trgm)

The problem here is the % operator. It will return TRUE only if the similarity exceeds the pg_trgm.similarity_threshold parameter, which defaults to 0.3.

SELECT similarity('mariazirita', 'mar');

 similarity 
════════════
 0.23076923
(1 row)

SELECT similarity('mariazirita', 'maria');

 similarity 
════════════
  0.3846154
(1 row)

So you can either lower the threshold or remove the condition with % from the query.

How is the similarity calculated in Postgres pg_trgm module

The formula can be found in contrib/pg_trgm/trgm.h (see the macro CALCSML) and is as follows:

nt / (n1 + n2 - nt)

In your case that is 3 / (5+8-3) = 0.3.

Postgres pg_trgm - why ordering by similarity is very slow

PostgreSQL doesn't use indexes for function, it uses indexes only for operators.

The query that orders by similarity() calls that function for every row and then orders the rows.

The query that uses the % uses the index and runs similarity function on those that match (no index only scans for functions).

If you want to order by least similarity (as in the question) those that have similarity greater than 0.2 you should use the distance operator <->.

Like so:

WHERE "User"."displayName" <-> 'John' < 0.8
ORDER BY "User"."displayName" <-> 'John' DESC

The distance is 1- similarity hence 0.8

PostgreSQL, trigrams and similarity

The concept of trigram similarity relies on having any sentence divided into "trigrams" (sequences of three consecutive letters), and treating the result as a SET (i.e.: the order doesn't matter, and you don't have repeated values). Before the sentence is considered, two blank spaces are added at the beginning, and one at the end, and single spaces are replaced by double ones.

Trigrams are a special case of N-grams.

The trigram set corresponding to "Chateau blanc" is found by finding all sequences of three letters that appear on it:

  chateau  blanc
---                 => '  c'
 ---                => ' ch'
  ---               => 'cha'
   ---              => 'hat'
    ---             => 'ate'
     ---            => 'tea'
      ---           => 'eau'
       ---          => 'au '
        ---         => 'u  '
         ---        => '  b'
          ---       => ' bl'
           ---      => 'bla'
            ---     => 'lan'
             ---    => 'anc'
              ---   => 'nc '

Sorting them, and taking out repetitions gets you:

'  b'
'  c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'

This can be computed by PostgreSQL by means of the function show_trgm:

SELECT show_trgm('Chateau blanc') AS A

A = [  b,  c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

... which has 14 trigrams. (Check pg_trgm).

And the trigram set corresponding to "Chateau Cheval Blanc" is:

SELECT show_trgm('Chateau Cheval Blanc') AS B 

B = [  b,  c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

... which has 19 trigrams

If you count how many trigrams have both sets in common, you find that they have the following ones:

A intersect B = 
    [  b,  c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

and the ones they have in total are:

A union B = 
    [  b,  c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

That is, both sentences have 14 trigrams in common, and 19 in total.

The similarity is computed as:

 similarity = 14 / 19

You can check it with:

SELECT 
    cast(14.0/19.0 as real) AS computed_result, 
    similarity('Chateau blanc', 'chateau cheval blanc') AS function_in_pg

and you'll see that you get: 0.736842

... which explains how similarity is computed, and why you get the values you get.

NOTE: You can compute the intersection and union by means of:

SELECT 
   array_agg(t) AS in_common
FROM
(
    SELECT unnest(show_trgm('Chateau blanc')) AS t 
    INTERSECT 
    SELECT unnest(show_trgm('chateau chevla blanc')) AS t
    ORDER BY t
) AS trigrams_in_common ;

SELECT 
   array_agg(t) AS in_total
FROM
(
    SELECT unnest(show_trgm('Chateau blanc')) AS t 
    UNION 
    SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;

And this is a way to explore the similarity of different pair of sentences:

WITH p AS
(
    SELECT 
      'This is just a sentence I''ve invented'::text AS f1,
      'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
    SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
    SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
    SELECT
        (SELECT count(*) FROM 
            (SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
        (SELECT count(*) FROM 
            (SELECT * FROM t1 UNION     SELECT * FROM t2) AS s0)::integer AS total,
        similarity(f1, f2) AS sim_2
FROM
    p 
)
SELECT
    same, total, same::real/total::real AS sim_1, sim_2
FROM
    x ;

You can check it at Rextester

Optimizing a postgres similarity query (pg_trgm + gin index)

I expect much faster results with this approach:

1.

Create a GiST index with 1 column holding concatenated values:

CREATE INDEX users_search_idx ON auth_user
USING gist((username || ' ' || first_name || ' ' || last_name) gist_trgm_ops);

This assumes all 3 columns to be defined NOT NULL (you did not specify). Else you need to do more.

Why not simplify with concat_ws()?

Combine two columns and add into one new column
Faster query with pattern-matching on multiple text fields
Combine two columns and add into one new column

2.

Use a proper nearest-neighbor query, matching above index:

SELECT username, email, first_name, last_name
     , similarity(username  , $1) AS s_username
     , similarity(first_name, $1) AS s_first_name
     , similarity(last_name , $1) AS s_last_name
     , row_number() OVER () AS rank  -- greatest similarity first
FROM   auth_user
WHERE     (username || ' ' || first_name || ' ' || last_name) %   $1  -- !!
ORDER  BY (username || ' ' || first_name || ' ' || last_name) <-> $1  -- !!
LIMIT  $2;

Expressions in WHERE and ORDER BY must match index expression!

In particular ORDER BY rank (like you had it) will always perform poorly for a small LIMIT picking from a much larger pool of qualifying rows, because it cannot use an index directly: The sophisticated expression behind rank has to be calculated for every qualifying row, then all have to be sorted before the small selection of best matches can be returned. This is much, much more expensive than a true nearest-neighbor query that can pick the best results from the index directly without even looking at the rest.

row_number() with empty window definition just reflects the ordering produced by the ORDER BY of the same SELECT.

Finding similar strings with PostgreSQL quickly

The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.

Use SET pg_trgm.similarity_threshold and the % operator instead. Both are provided by the pg_trgm module. This way, a trigram GiST index can be used to great effect.

The configuration parameter pg_trgm.similarity_threshold replaced the functions set_limit() and show_limit() in Postgres 9.6. The deprecated functions still work (as of Postgres 13). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.

Try instead:

SET pg_trgm.similarity_threshold = 0.8;  -- Postgres 9.6 or later
  
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   names n1
JOIN   names n2 ON n1.name <> n2.name
               AND n1.name % n2.name
ORDER  BY sim DESC;

Faster by orders of magnitude, but still slow.

pg_trgm.similarity_threshold is a "customized" option, which can be handled like any other option. See:

Query a parameter (postgresql.conf setting) like "max_connections"

You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).

This does not work because you cannot refer to output columns in WHERE or HAVING clauses:

WHERE ... sim > 0.8

That's according to the SQL standard (which is handled rather loosely by certain other RDBMS). On the other hand:

ORDER BY sim DESC

Works because output columns can be used in GROUP BY and ORDER BY. See:

PostgreSQL reusing computation result in select query

Test case

I ran a quick test on my old test server to verify my claims.

PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE (best of 5).

CREATE TEMP table t AS 
SELECT some_col AS name FROM some_table LIMIT 1000;  -- real life test strings

First round of tests with GIN index:

CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops);  -- round1: with GIN index

Second round of tests with GIST index:

DROP INDEX t_gin;
CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);

New query:

SELECT set_limit(0.8);

SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   t n1
JOIN   t n2 ON n1.name <> n2.name
           AND n1.name % n2.name
ORDER  BY sim DESC;

GIN index used, 64 hits: total runtime: 484.022 ms

GIST index used, 64 hits: total runtime: 248.772 ms

Old query:

SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
FROM   t n1, t n2
WHERE  n1.name != n2.name
AND    similarity(n1.name, n2.name) > 0.8
ORDER  BY sim DESC;

GIN index not used, 64 hits: total runtime: 6345.833 ms

GIST index not used, 64 hits: total runtime: 6335.975 ms

Otherwise identical results. Advice is good. And this is for just 1000 rows!

GIN or GiST?

GIN often provides superior read performance:

Difference between GiST and GIN index

But not in this particular case!

This can be implemented quite efficiently by GiST indexes, but not by
GIN indexes.

Multicolumn index on 3 fields with heterogenous data types

Similarity Function in Postgres with Pg_Trgm