Postgresql Gin Index Slower Than Gist for Pg_Trgm

PostgreSQL GIN index slower than GIST for pg_trgm?

Some issues stand out:

First, consider upgrading to a current version of Postgres. At the time of writing that's pg 9.6 or pg 10 (currently beta). Since Pg 9.4 there have been multiple improvements for GIN indexes, the additional module pg_trgm and big data in general.

Next, you need much more RAM, in particular a higher work_mem setting. I can tell from this line in the EXPLAIN output:

Heap Blocks: exact=7625 lossy=223807

"lossy" in the details for a Bitmap Heap Scan (with your particular numbers) indicates a dramatic shortage of work_mem. Postgres only collects block addresses in the bitmap index scan instead of row pointers because that's expected to be faster with your low work_mem setting (can't hold exact addresses in RAM). Many more non-qualifying rows have to be filtered in the following Bitmap Heap Scan this way. This related answer has details:

  • “Recheck Cond:” line in query plans with a bitmap index scan

But don't set work_mem too high without considering the whole situation:

  • Optimize simple query using ORDER BY date and text

There may other problems, like index or table bloat or more configuration bottlenecks. But if you fix just these two items, the query should be much faster already.

Also, do you really need to retrieve all 40k rows in the example? You probably want to add a small LIMIT to the query and make it a "nearest-neighbor" search - in which case a GiST index is the better choice after all, because that is supposed to be faster with a GiST index. Example:

  • Best index for similarity function

Difference between GiST and GIN index

I don't think I could explain it better than the manual already does:

In choosing which index type to use, GiST or GIN, consider these
performance differences:

  • GIN index lookups are about three times faster than GiST

  • GIN indexes take about three times longer to build than GiST

  • GIN indexes are moderately slower to update than GiST indexes, but about 10 times slower if fast-update support was disabled [...]

  • GIN indexes are two-to-three times larger than GiST indexes

Link and quote refer to the manual for Postgres 9.4. Size and performance estimates seemed slightly outdated already. With Postgres 9.4 the odds have shifted substantially in favor of GIN.

The release notes of Postgres 9.4 include:

  • Reduce GIN index size (Alexander Korotkov, Heikki Linnakangas) [...]

  • Improve speed of multi-key GIN lookups (Alexander Korotkov, Heikki
    Linnakangas)

Size and performance estimates have since been removed from the manual.

Note that there are special use cases that require one or the other.

One thing you misunderstood: You never get wrong results with a GiST index. The index operates on hash values, which can lead to false positives in the index. This should only become relevant with a very big number of different words in your documents. False positives are eliminated after re-checking the actual row in any case. The manual:

A GiST index is lossy, meaning that the index may produce false
matches, and it is necessary to check the actual table row to
eliminate such false matches. (PostgreSQL does this automatically when needed.)

Bold emphasis mine.

Why are PostgreSQL Text-Search GiST indexes so much slower than GIN indexes?

try

CREATE INDEX PostText_GIST ON Posts USING GIST(PostText varchar_pattern_ops);

which creates an index suitable for prefix queries. See the PostgreSQL docs on Operator Classes and Operator Families. The @@ operator is only sensible on term vectors; the GiST index (with varchar_pattern_ops) will give excellent results with LIKE.

Optimizing a postgres similarity query (pg_trgm + gin index)

I expect much faster results with this approach:

1.

Create a GiST index with 1 column holding concatenated values:

CREATE INDEX users_search_idx ON auth_user
USING gist((username || ' ' || first_name || ' ' || last_name) gist_trgm_ops);

This assumes all 3 columns to be defined NOT NULL (you did not specify). Else you need to do more.

Why not simplify with concat_ws()?

  • Combine two columns and add into one new column
  • Faster query with pattern-matching on multiple text fields
  • Combine two columns and add into one new column

2.

Use a proper nearest-neighbor query, matching above index:

SELECT username, email, first_name, last_name
, similarity(username , $1) AS s_username
, similarity(first_name, $1) AS s_first_name
, similarity(last_name , $1) AS s_last_name
, row_number() OVER () AS rank -- greatest similarity first
FROM auth_user
WHERE (username || ' ' || first_name || ' ' || last_name) % $1 -- !!
ORDER BY (username || ' ' || first_name || ' ' || last_name) <-> $1 -- !!
LIMIT $2;

Expressions in WHERE and ORDER BY must match index expression!

In particular ORDER BY rank (like you had it) will always perform poorly for a small LIMIT picking from a much larger pool of qualifying rows, because it cannot use an index directly: The sophisticated expression behind rank has to be calculated for every qualifying row, then all have to be sorted before the small selection of best matches can be returned. This is much, much more expensive than a true nearest-neighbor query that can pick the best results from the index directly without even looking at the rest.

row_number() with empty window definition just reflects the ordering produced by the ORDER BY of the same SELECT.

Related answers:

  • Best index for similarity function
  • Search in 300 million addresses with pg_trgm

As for your item 3., I added an answer to the question you referenced, that should explain it:

  • PostgreSQL GIN index slower than GIST for pg_trgm?

Quick Search with autocorrect (GIN INDEX and PG_TRGM extension)

First off, your table definition creates two unique indices on (email_address). Don't. Drop the UNIQUE constraint, keep the PK:

CREATE TABLE email (
email_address text PRIMARY KEY
, person_id uuid NOT NULL -- bigint?
);

(Also not sure why you would need uuid for person_id. There aren't nearly enough people in the world to justify more than a bigint.)

Next, since you want to ...

LIMIT the returned results to only the highly similar email addresses,

I suggest a nearest neighbor search. Create a GiST index for this purpose instead of the GIN:

CREATE INDEX email_address_trigram_gist_idx ON email USING gist (email_address gist_trgm_ops);

And use a query like this:

SELECT *, similarity('tesd100@gmail.com', email_address)
FROM email
WHERE email_address % 'tesd100@gmail.com'
ORDER BY email_address <-> 'tesd100@gmail.com' -- note the use of the operator <->
LIMIT 10;

Quoting the manual:

This can be implemented quite efficiently by GiST indexes, but not by
GIN indexes. It will usually beat the first formulation when only a
small number of the closest matches is wanted.

While working with a small LIMIT, there is probably no need to set pg_trgm.similarity_threshold very high because this query gives you the best matches first.

Related:

  • Search in 300 million addresses with pg_trgm
  • Finding similar strings with PostgreSQL quickly
  • PostgreSQL GIN index slower than GIST for pg_trgm?

How to improve Postgres query with pg_trgm?

I'm assuming you're largely having trouble with (a) optimising the query and indices and (b) how to get the data for a large number of queries in some realistic time. I'll try to answer them both one by one.

  1. Optimising the query and indices (comparing multiple columns)

    I think you've got two doubts here again:

    • Supporting queries with concatenated string avoiding null's.

      Well it can be handled by making use of coalesce. So your query should look like:

      similarity(coalesce(col1, '') || ' ' || coalesce(col2, '') || ' ' || coalesce(col3, '')), 'some text')

      And the corresponding index to support this query would be:

      CREATE INDEX trgm_idx ON goods_goods USING GIN ((coalesce(col1, '') || ' ' || coalesce(col2, '') || ' ' || coalesce(col3, ''))) gin_trgm_ops)

    • Supporting queries with columns of different type (WHERE name % 'Some text' AND some_id = 123)

      You can create an index like this to support index search:

      CREATE INDEX trgm_idx ON goods_goods USING GIN (name gin_trgm_ops, some_id);

    If you wish to combine above two, then the corresponding index would be:

    CREATE INDEX trgm_idx ON goods_goods USING GIN ((coalesce(col1, '') || ' ' || coalesce(col2, '') || ' ' || coalesce(col3, ''))) gin_trgm_ops, some_id)

  2. How to get the data for a large number of queries in some realistic time (I'll try to answer the queries posed by you in 2, 3 & 4)

    Above solutions on optimising the query and indices will surely improve your query performance. But, if you wish to improve further you could look at establishing multiple connections with your database and firing queries in parallel. This would give you a certain performance gain. You could achieve this by looking at popular ORMs out there. Further, you could have multiple read replicas of your master Postgres machine so that you have more compute power and even more connections. Still you can look at partitioning your data and run your queries on individual partitions instead of a single large table and come up with a solution to merge/recompute the results.

Postgres hstore: GIN vs GiST index performance

The documents state what the situation is "in general".

However, you aren't running PostgreSQL "in general", you are running it on specific hardware with a specific pattern of use.

So - if you care a lot, then you'll want to test it yourself. A GiST index will always require re-checking its condition. However if the queries you run end up doing further checks anyway, a GIN index might not win there. Also there are all the usual issues around cache usage etc.

For my usage, on smaller databases with moderate update rates, I've been happy enough with GiST. I've seen a 50% improvement in speed with GIN (across a whole query), but it's not been worth the slower indexing. If I was building a huge archive server it might be different.



Related Topics



Leave a reply



Submit