Finding similar strings with PostgreSQL quickly
The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.
Use SET pg_trgm.similarity_threshold
and the %
operator instead. Both are provided by the pg_trgm
module. This way, a trigram GiST index can be used to great effect.
The configuration parameter pg_trgm.similarity_threshold
replaced the functions set_limit()
and show_limit()
in Postgres 9.6. The deprecated functions still work (as of Postgres 13). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.
Try instead:
SET pg_trgm.similarity_threshold = 0.8; -- Postgres 9.6 or later
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM names n1
JOIN names n2 ON n1.name <> n2.name
AND n1.name % n2.name
ORDER BY sim DESC;
Faster by orders of magnitude, but still slow.
pg_trgm.similarity_threshold
is a "customized" option, which can be handled like any other option. See:
- Query a parameter (postgresql.conf setting) like "max_connections"
You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).
This does not work because you cannot refer to output columns in WHERE
or HAVING
clauses:
WHERE ... sim > 0.8
That's according to the SQL standard (which is handled rather loosely by certain other RDBMS). On the other hand:
ORDER BY sim DESC
Works because output columns can be used in GROUP BY
and ORDER BY
. See:
- PostgreSQL reusing computation result in select query
Test case
I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE
(best of 5).
CREATE TEMP table t AS
SELECT some_col AS name FROM some_table LIMIT 1000; -- real life test strings
First round of tests with GIN index:
CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops); -- round1: with GIN index
Second round of tests with GIST index:
DROP INDEX t_gin;
CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);
New query:
SELECT set_limit(0.8);
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM t n1
JOIN t n2 ON n1.name <> n2.name
AND n1.name % n2.name
ORDER BY sim DESC;
GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms
Old query:
SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
FROM t n1, t n2
WHERE n1.name != n2.name
AND similarity(n1.name, n2.name) > 0.8
ORDER BY sim DESC;
GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms
Otherwise identical results. Advice is good. And this is for just 1000 rows!
GIN or GiST?
GIN often provides superior read performance:
- Difference between GiST and GIN index
But not in this particular case!
This can be implemented quite efficiently by GiST indexes, but not by
GIN indexes.
- Multicolumn index on 3 fields with heterogenous data types
Scalable way to search for (similar) strings in a database
PostgreSQL Full-text search cannot do what you're looking for. However, PostgreSQL trigram similarity can do it.
You first need to install the packages with 'trigram similarity' and 'btree_gist', by executing (once) in your database:
CREATE EXTENSION pg_trgm;
CREATE EXTENSION btree_gist;
I assume you have one table that looks like this one:
CREATE TABLE sentences
(
sentence_id integer PRIMARY KEY,
sentence text
) ;
INSERT INTO sentences (sentence_id, sentence)
VALUES
(1, 'Cangaroos are not my favorite.'),
(2, 'A vegetable sentence.'),
(3, 'Cangaroos are evil.'),
(4, 'Again, some plants in my garden.'),
(5, 'I once had a cangaroo. Never again.') ;
This table needs a 'trigram index', to allow the PostgreSQL database to 'index by similarity'. This is accomplished by executing:
CREATE INDEX ON sentences USING GIST (sentence gist_trgm_ops, sentence_id) ;
To find the answers you're looking for, you execute:
-- Set the minimum similarity you want to be able to search
SELECT set_limit(0.2) ;
-- And now, select the sentences 'similar' to the input one
SELECT
similarity(sentence, 'I don''t like cangaroos') AS similarity,
sentence_id,
sentence
FROM
sentences
WHERE
/* That's how you choose your sentences:
% means 'similar to', in the trigram sense */
sentence % 'I don''t like cangaroos'
ORDER BY
similarity DESC ;
The result that you get is:
similarity | sentence_id | sentence
-----------+-------------+-------------------------------------
0.3125 | 3 | Cangaroos are evil.
0.2325 | 1 | Cangaroos are not my favorite.
0.2173 | 5 | I once had a cangaroo. Never again.
Hope this gives you what you want...
PostgreSQL, trigrams and similarity
The concept of trigram similarity relies on having any sentence divided into "trigrams" (sequences of three consecutive letters), and treating the result as a SET (i.e.: the order doesn't matter, and you don't have repeated values). Before the sentence is considered, two blank spaces are added at the beginning, and one at the end, and single spaces are replaced by double ones.
Trigrams are a special case of N-grams.
The trigram set corresponding to "Chateau blanc" is found by finding all sequences of three letters that appear on it:
chateau blanc
--- => ' c'
--- => ' ch'
--- => 'cha'
--- => 'hat'
--- => 'ate'
--- => 'tea'
--- => 'eau'
--- => 'au '
--- => 'u '
--- => ' b'
--- => ' bl'
--- => 'bla'
--- => 'lan'
--- => 'anc'
--- => 'nc '
Sorting them, and taking out repetitions gets you:
' b'
' c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'
This can be computed by PostgreSQL by means of the function show_trgm
:
SELECT show_trgm('Chateau blanc') AS A
A = [ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
... which has 14 trigrams. (Check pg_trgm).
And the trigram set corresponding to "Chateau Cheval Blanc" is:
SELECT show_trgm('Chateau Cheval Blanc') AS B
B = [ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
... which has 19 trigrams
If you count how many trigrams have both sets in common, you find that they have the following ones:
A intersect B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
and the ones they have in total are:
A union B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
That is, both sentences have 14 trigrams in common, and 19 in total.
The similarity is computed as:
similarity = 14 / 19
You can check it with:
SELECT
cast(14.0/19.0 as real) AS computed_result,
similarity('Chateau blanc', 'chateau cheval blanc') AS function_in_pg
and you'll see that you get: 0.736842
... which explains how similarity is computed, and why you get the values you get.
NOTE: You can compute the intersection and union by means of:
SELECT
array_agg(t) AS in_common
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
INTERSECT
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
ORDER BY t
) AS trigrams_in_common ;
SELECT
array_agg(t) AS in_total
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
UNION
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;
And this is a way to explore the similarity of different pair of sentences:
WITH p AS
(
SELECT
'This is just a sentence I''ve invented'::text AS f1,
'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
SELECT
(SELECT count(*) FROM
(SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
(SELECT count(*) FROM
(SELECT * FROM t1 UNION SELECT * FROM t2) AS s0)::integer AS total,
similarity(f1, f2) AS sim_2
FROM
p
)
SELECT
same, total, same::real/total::real AS sim_1, sim_2
FROM
x ;
You can check it at Rextester
Postgres word_similarity not comparing words
Function inconsistent with description
Related subject on pgsql-bugs mailing list.
The substring similarity algorithm described by the author compares trigram arrays of a query string and a text. The problem is that a trigram array is optimised (duplicated trigrams are eliminated) and loses information about individual words of the text.
The query illustrates the issue:
with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select
t as "text",
show_trgm(t) as "text trigrams",
show_trgm('sage') as "string trigrams",
cardinality(array_intersect(show_trgm(t), show_trgm('sage'))) as "common trgms"
from data;
text | text trigrams | string trigrams | common trgms
------------------+-----------------------------------------------------------+-----------------------------+--------------
message | {" m"," me",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 3
message s | {" m"," s"," me"," s ",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 4
message sag | {" m"," s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 5
message sag sag | {" m"," s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 5
message sag sage | {" m"," s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 5
(5 rows)
The trigram arrays in last three rows are the same and contain all trigrams of the query string.
Obviously, the implementation is not consistent with the description of the function (the description was changed in later releases of the documentation):
Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring.
My function used in the above query:
create or replace function public.array_intersect(anyarray, anyarray)
returns anyarray language sql immutable
as $$
select case
when $1 is null then $2
else
array(
select unnest($1)
intersect
select unnest($2)
)
end;
$$;
Workaround
You can easily write your own function to get more expected results:
create or replace function my_word_similarity(text, text)
returns real language sql immutable as $$
select max(similarity($1, word))
from regexp_split_to_table($2, '[^[:alnum:]]') word
$$;
Compare:
with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, word_similarity('sage', t), my_word_similarity('sage', t)
from data;
t | word_similarity | my_word_similarity
------------------+-----------------+--------------------
message | 0.6 | 0.3
message s | 0.8 | 0.3
message sag | 1 | 0.5
message sag sag | 1 | 0.5
message sag sage | 1 | 1
(5 rows)
New function in Postgres 11+
There is a new function in Postgres 11+ strict_word_similarity()
which gives results expected by the author of the question:
with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, word_similarity('sage', t), strict_word_similarity('sage', t)
from data;
t | word_similarity | strict_word_similarity
------------------+-----------------+------------------------
message | 0.6 | 0.3
message s | 0.8 | 0.36363637
message sag | 1 | 0.5
message sag sag | 1 | 0.5
message sag sage | 1 | 1
(5 rows)
PostgreSQL '%' operator
%
is the "similarity" operator, provided by the additional module pg_trgm.
It takes text
(or other string types) as left and right operand and returns boolean
: true
if both operands are similar enough, false
if not. The threshold is set with the GUC parameter pg_trgm.similarity_threshold
.
Related:
- Finding similar strings with PostgreSQL quickly
- Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
Not to be confused with the modulo operator %
. Same symbol, but the mathematical operator takes numeric types as left and right operand.
In Postgres, operators are defined by the operator name (like %
) plus left and right operands. Gory details in the manual chapter Operator Type Resolution. The casual user hardly needs to know any of this. Typically it just works.
Related:
- Can PostgreSQL index array columns?
PostgreSQL LIKE query performance variations
FTS does not support LIKE
The previously accepted answer was incorrect. Full Text Search with its full text indexes is not for the LIKE
operator at all, it has its own operators and doesn't work for arbitrary strings. It operates on words based on dictionaries and stemming. It does support prefix matching for words, but not with the LIKE
operator:
- Get partial match from GIN indexed TSVECTOR column
Trigram index for LIKE
Install the additional module pg_trgm
which provides operator classes for GIN and GiST trigram indexes to support all LIKE
and ILIKE
patterns, not just left-anchored ones:
Example index:
CREATE INDEX tbl_col_gin_trgm_idx ON tbl USING gin (col gin_trgm_ops);
Or:
CREATE INDEX tbl_col_gist_trgm_idx ON tbl USING gist (col gist_trgm_ops);
- Difference between GiST and GIN index
Example query:
SELECT * FROM tbl WHERE col LIKE '%foo%'; -- leading wildcard
SELECT * FROM tbl WHERE col ILIKE '%foo%'; -- works case insensitively as well
Trigrams? What about shorter strings?
Words with less than 3 letters in indexed values still work. The manual:
Each word is considered to have two spaces prefixed and one space
suffixed when determining the set of trigrams contained in the string.
And search patterns with less than 3 letters? The manual:
For both
LIKE
and regular-expression searches, keep in mind that a
pattern with no extractable trigrams will degenerate to a full-index scan.
Meaning, that index / bitmap index scans still work (query plans for prepared statement won't break), it just won't buy you better performance. Typically no big loss, since 1- or 2-letter strings are hardly selective (more than a few percent of the underlying table matches) and index support would not improve performance to begin with, because a full table scan is faster.
text_pattern_ops
or COLLATE "C"
for prefix matching
Update
Since Postgres 9.1, COLLATE "C"
is better. See:
- Is there a difference between text_pattern_ops and COLLATE "C"?
Original answer
For just left-anchored patterns (no leading wildcard) you get the optimum with a suitable operator class for a btree index: text_pattern_ops
or varchar_pattern_ops
. Both built-in features of standard Postgres, no additional module needed. Similar performance, but much smaller index.
Example index:
CREATE INDEX tbl_col_text_pattern_ops_idx ON tbl(col text_pattern_ops);
Example query:
SELECT * FROM tbl WHERE col LIKE 'foo%'; -- no leading wildcard
Or, if you should be running your database with the 'C' locale (effectively no locale), then everything is sorted according to byte order anyway and a plain btree index with default operator class does the job.
Further reading
- Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
- How is LIKE implemented?
- Finding similar strings with PostgreSQL quickly
Similar UTF-8 strings for autocomplete field
You are not using the operator class provided by the pg_trgm
module. Create an index like this:
CREATE INDEX label_Lower_unaccent_trgm_idx
ON test_trgm USING gist (lower(unaccent_text(label)) gist_trgm_ops);
Originally, I had a GIN index here, but a GiST is typically better suited for this kind of query because it can return values sorted by similarity. See:
- Matching patterns between multiple columns
- Finding similar strings with PostgreSQL quickly
Your query has to match the index expression to be able to use it.
SELECT label
FROM the_table
WHERE lower(unaccent_text(label)) % 'fil'
ORDER BY similarity(label, 'fil') DESC; -- ok to use original string here
However, "filbert" and "filé powder" are not actually very similar to "fil" according to the %
operator. I suspect you really want:
SELECT label
FROM the_table
WHERE lower(unaccent_text(label)) LIKE 'fil%' -- !
ORDER BY similarity(label, 'fil') DESC; -- ok to use original string here
This finds all strings starting with the search string, and sorts the best matches according to the %
operator first.
The expression can use a GIN or GiST index since PostgreSQL 9.1! The manual:
Beginning in PostgreSQL 9.1, these index types also support index
searches forLIKE
andILIKE
, for example
If you actually meant to use the %
operator:
Try adapting the threshold for the similarity operator %
:
SET pg_trgm.similarity_threshold = 0.1; -- Postgres 9.6 or later
SELECT set_limit(0.1); -- Postgres 9.5 or older
Or even lower? The default is 0.3. Just to see whether the threshold filters additional matches.
Related Topics
Why Can't I Seem to Force Oracle 11G to Consume More Cpus for a Single SQL Query
Which SQL Query Is Faster? Filter on Join Criteria or Where Clause
SQL Server Deterministic User-Defined Function
How to Count Instances of Character in SQL Column
Scope of Temporary Tables in SQL Server
Know Relationships Between All the Tables of Database in SQL Server
Update If Exists Else Insert in SQL Server 2008
Eliminating Duplicate Values Based on Only One Column of the Table
What Is Easier to Read in Exists Subqueries
MySQL - How to Front Pad Zip Code with "0"
Db Design to Use Sub-Type or Not
Update One Table with Data from Another
MySQL Equivalent of Decode Function in Oracle
How Does a Recursive Cte Run, Line by Line
Access to Result Sets from Within Stored Procedures Transact-SQL SQL Server