Similarity function in Postgres with pg_trgm
You have to install pg_trgm. In debian, source this sql: /usr/share/postgresql/8.4/contrib/pg_trgm.sql
. From the command line:
psql -f /usr/share/postgresql/8.4/contrib/pg_trgm.sql
Or inside a psql shell:
\i /usr/share/postgresql/8.4/contrib/pg_trgm.sql
The script defaults to installing in the public schema, edit the search path at the top if you want to install it somewhere else (so that uninstalling/upgrading can be done simply by dropping the schema).
Postgresql - Similarity with trigram (pg_trgm)
The problem here is the %
operator. It will return TRUE only if the similarity exceeds the pg_trgm.similarity_threshold
parameter, which defaults to 0.3.
SELECT similarity('mariazirita', 'mar');
similarity
════════════
0.23076923
(1 row)
SELECT similarity('mariazirita', 'maria');
similarity
════════════
0.3846154
(1 row)
So you can either lower the threshold or remove the condition with %
from the query.
How is the similarity calculated in Postgres pg_trgm module
The formula can be found in contrib/pg_trgm/trgm.h
(see the macro CALCSML
) and is as follows:
nt / (n1 + n2 - nt)
In your case that is 3 / (5+8-3) = 0.3
.
Postgres pg_trgm - why ordering by similarity is very slow
PostgreSQL doesn't use indexes for function, it uses indexes only for operators.
The query that orders by similarity() calls that function for every row and then orders the rows.
The query that uses the %
uses the index and runs similarity function on those that match (no index only scans for functions).
If you want to order by least similarity (as in the question) those that have similarity greater than 0.2 you should use the distance operator <->
.
Like so:
WHERE "User"."displayName" <-> 'John' < 0.8
ORDER BY "User"."displayName" <-> 'John' DESC
The distance is 1- similarity hence 0.8
PostgreSQL, trigrams and similarity
The concept of trigram similarity relies on having any sentence divided into "trigrams" (sequences of three consecutive letters), and treating the result as a SET (i.e.: the order doesn't matter, and you don't have repeated values). Before the sentence is considered, two blank spaces are added at the beginning, and one at the end, and single spaces are replaced by double ones.
Trigrams are a special case of N-grams.
The trigram set corresponding to "Chateau blanc" is found by finding all sequences of three letters that appear on it:
chateau blanc
--- => ' c'
--- => ' ch'
--- => 'cha'
--- => 'hat'
--- => 'ate'
--- => 'tea'
--- => 'eau'
--- => 'au '
--- => 'u '
--- => ' b'
--- => ' bl'
--- => 'bla'
--- => 'lan'
--- => 'anc'
--- => 'nc '
Sorting them, and taking out repetitions gets you:
' b'
' c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'
This can be computed by PostgreSQL by means of the function show_trgm
:
SELECT show_trgm('Chateau blanc') AS A
A = [ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
... which has 14 trigrams. (Check pg_trgm).
And the trigram set corresponding to "Chateau Cheval Blanc" is:
SELECT show_trgm('Chateau Cheval Blanc') AS B
B = [ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
... which has 19 trigrams
If you count how many trigrams have both sets in common, you find that they have the following ones:
A intersect B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
and the ones they have in total are:
A union B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
That is, both sentences have 14 trigrams in common, and 19 in total.
The similarity is computed as:
similarity = 14 / 19
You can check it with:
SELECT
cast(14.0/19.0 as real) AS computed_result,
similarity('Chateau blanc', 'chateau cheval blanc') AS function_in_pg
and you'll see that you get: 0.736842
... which explains how similarity is computed, and why you get the values you get.
NOTE: You can compute the intersection and union by means of:
SELECT
array_agg(t) AS in_common
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
INTERSECT
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
ORDER BY t
) AS trigrams_in_common ;
SELECT
array_agg(t) AS in_total
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
UNION
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;
And this is a way to explore the similarity of different pair of sentences:
WITH p AS
(
SELECT
'This is just a sentence I''ve invented'::text AS f1,
'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
SELECT
(SELECT count(*) FROM
(SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
(SELECT count(*) FROM
(SELECT * FROM t1 UNION SELECT * FROM t2) AS s0)::integer AS total,
similarity(f1, f2) AS sim_2
FROM
p
)
SELECT
same, total, same::real/total::real AS sim_1, sim_2
FROM
x ;
You can check it at Rextester
Optimizing a postgres similarity query (pg_trgm + gin index)
I expect much faster results with this approach:
1.
Create a GiST index with 1 column holding concatenated values:
CREATE INDEX users_search_idx ON auth_user
USING gist((username || ' ' || first_name || ' ' || last_name) gist_trgm_ops);
This assumes all 3 columns to be defined NOT NULL
(you did not specify). Else you need to do more.
Why not simplify with concat_ws()
?
- Combine two columns and add into one new column
- Faster query with pattern-matching on multiple text fields
- Combine two columns and add into one new column
2.
Use a proper nearest-neighbor query, matching above index:
SELECT username, email, first_name, last_name
, similarity(username , $1) AS s_username
, similarity(first_name, $1) AS s_first_name
, similarity(last_name , $1) AS s_last_name
, row_number() OVER () AS rank -- greatest similarity first
FROM auth_user
WHERE (username || ' ' || first_name || ' ' || last_name) % $1 -- !!
ORDER BY (username || ' ' || first_name || ' ' || last_name) <-> $1 -- !!
LIMIT $2;
Expressions in WHERE
and ORDER BY
must match index expression!
In particular ORDER BY rank
(like you had it) will always perform poorly for a small LIMIT
picking from a much larger pool of qualifying rows, because it cannot use an index directly: The sophisticated expression behind rank
has to be calculated for every qualifying row, then all have to be sorted before the small selection of best matches can be returned. This is much, much more expensive than a true nearest-neighbor query that can pick the best results from the index directly without even looking at the rest.
row_number()
with empty window definition just reflects the ordering produced by the ORDER BY
of the same SELECT
.
Related answers:
- Best index for similarity function
- Search in 300 million addresses with pg_trgm
As for your item 3.
, I added an answer to the question you referenced, that should explain it:
- PostgreSQL GIN index slower than GIST for pg_trgm?
Finding similar strings with PostgreSQL quickly
The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.
Use SET pg_trgm.similarity_threshold
and the %
operator instead. Both are provided by the pg_trgm
module. This way, a trigram GiST index can be used to great effect.
The configuration parameter pg_trgm.similarity_threshold
replaced the functions set_limit()
and show_limit()
in Postgres 9.6. The deprecated functions still work (as of Postgres 13). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.
Try instead:
SET pg_trgm.similarity_threshold = 0.8; -- Postgres 9.6 or later
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM names n1
JOIN names n2 ON n1.name <> n2.name
AND n1.name % n2.name
ORDER BY sim DESC;
Faster by orders of magnitude, but still slow.
pg_trgm.similarity_threshold
is a "customized" option, which can be handled like any other option. See:
- Query a parameter (postgresql.conf setting) like "max_connections"
You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).
This does not work because you cannot refer to output columns in WHERE
or HAVING
clauses:
WHERE ... sim > 0.8
That's according to the SQL standard (which is handled rather loosely by certain other RDBMS). On the other hand:
ORDER BY sim DESC
Works because output columns can be used in GROUP BY
and ORDER BY
. See:
- PostgreSQL reusing computation result in select query
Test case
I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE
(best of 5).
CREATE TEMP table t AS
SELECT some_col AS name FROM some_table LIMIT 1000; -- real life test strings
First round of tests with GIN index:
CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops); -- round1: with GIN index
Second round of tests with GIST index:
DROP INDEX t_gin;
CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);
New query:
SELECT set_limit(0.8);
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM t n1
JOIN t n2 ON n1.name <> n2.name
AND n1.name % n2.name
ORDER BY sim DESC;
GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms
Old query:
SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
FROM t n1, t n2
WHERE n1.name != n2.name
AND similarity(n1.name, n2.name) > 0.8
ORDER BY sim DESC;
GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms
Otherwise identical results. Advice is good. And this is for just 1000 rows!
GIN or GiST?
GIN often provides superior read performance:
- Difference between GiST and GIN index
But not in this particular case!
This can be implemented quite efficiently by GiST indexes, but not by
GIN indexes.
- Multicolumn index on 3 fields with heterogenous data types
Related Topics
Trigger Insert Old Values- Values That Was Updated
Change Column Types in a Huge Table
Update Statement Using Nested Query
Date_Trunc 5 Minute Interval in Postgresql
How to Take Sum of Column with Same Id in SQL
Clean Way to Use Postgresql Window Functions in Django Orm
Counting Number of Records Hour by Hour Between Two Dates in Oracle
SQL Server 2005, Turn Columns into Rows
SQL Row_Number() Function in Where Clause Without Order By
Count Rows Per Hour in SQL Server with Full Date-Time Value as Result
SQL Searching Multiple Words in a String
Update Multiple Tables in SQL Server Using Inner Join
How to Find the Average Time Difference Between Rows in a Table