Select Distinct Is Slower Than Expected on My Table in Postgresql

SELECT DISTINCT is slower than expected on my table in PostgreSQL

While there is no index skip scan in Postgres yet, emulate it:

WITH RECURSIVE cte AS (
( -- parentheses required
SELECT product_id
FROM tickers
ORDER BY 1
LIMIT 1
)
UNION ALL
SELECT l.*
FROM cte c
CROSS JOIN LATERAL (
SELECT product_id
FROM tickers t
WHERE t.product_id > c.product_id -- lateral reference
ORDER BY 1
LIMIT 1
) l
)
TABLE cte;

With an index on (product_id) and only 40 unique product IDs in the table this should be Fast. With capital F.

The PK index on (product_id, trade_id) is good for it, too!

With only very few rows per product_id (the opposite of your data distribution), DISTINCT / DISTINCT ON would be as fast or faster.

Work to implement index skip scans is ongoing.

See:

  • Select first row in each GROUP BY group?
  • Optimize GROUP BY query to retrieve latest row per user
  • Is a composite index also good for queries on the first field?

Extremely slow distinct query on indexed column

distinct values ... 300 million rows ... about 400 of them ... column ... indexed.

There are much faster techniques for this. Emulating a loose index scan (a.k.a. skip scan), and assuming my_date is defined NOT NULL (or we can ignore NULL values):

WITH RECURSIVE cte AS (
SELECT min(my_date) AS my_date
FROM my_table

UNION ALL
SELECT (SELECT my_date
FROM my_table
WHERE my_date > cte.my_date
ORDER BY my_date
LIMIT 1)
FROM cte
WHERE my_date IS NOT NULL
)
TABLE cte;

Related:

  • Optimize GROUP BY query to retrieve latest record per user

Using the index you mentioned it should finish in milliseconds.

Oracle DB ... 11 seconds.

Because Oracle has native index skip scans and Postgres does not. There are ongoing efforts to implement similar functionality in Postgres 12.

Currently (Postgres 11), while the index is used to good effect, even in an index-only scan, Postgres cannot skip ahead and has to read index tuples in sequence. Without LIMIT, the complete index has to be scanned. Hence we see in your EXPLAIN output:

Index Only Scan ... rows=298788038

The suggested new query achieves the same with reading 400 index tuples (one per distinct value). Big difference.

With LIMIT (and no ORDER BY!) like you tested, Postgres stops as soon as enough rows are retrieved. Increasing the limit has a linear effect. But if the number of rows per distinct value can vary, so does the added cost.

Fastest way to PostgreSQL Distinct and Format

In your second attempt you get distinct dates from the sub-query which you then all convert to a string representation and then you select the distinct ones. That is rather inefficient. Better is it to first extract the distinct years from the creation_date in a sub-query and simply cast those to text in the main query:

SELECT year::text
FROM (
SELECT DISTINCT extract(year FROM creation_date) AS year FROM acs_objects
) AS distinct_years;

If you create an INDEX on the table, the query should run much faster still:

CREATE INDEX really_fast ON acs_objects((extract(year FROM creation_date)));

However, this may impact other uses of your table, in particular if you have many modifying statements (insert, update, delete). And this will only work if creation_date has a data type of date or timestamp (specifically not timestamp with timezone).

The below option looked promising because it does not use a sub-query, but it is in fact much slower (see comments below), probably because the DISTINCT clause is applied on a string:

SELECT DISTINCT extract(year FROM creation_date)::text
FROM acs_objects;

Postgres slow distinct query for multiple columns

My data doesn't change that frequently so I ended up creating a materialized view

CREATE MATERIALIZED VIEW tbl1_distinct_view AS SELECT DISTINCT col1,col2,col3,col4 FROM tbl1;

that I refresh with a cronjob once a day at 6am

0 6 * * * psql -U mydb mydb -c 'REFRESH MATERIALIZED VIEW tbl1_distinct_view;

Slow query with distinct/group by on varchar column with Postgres

There is a Seq Scan on company_industry in both query plans that should really be a (bitmap) index scan. The same goes for Seq Scan on company.

Seems to be a issue of missing indexes - or something is not right in your db. If something seems wrong, draw a backup before you proceed. Check if cost settings and statistics are valid:

  • Keep PostgreSQL from sometimes choosing a bad query plan

If settings are good, I would then check the relevant indices (as detailed below). Maybe REINDEX can fix it:

REINDEX TABLE company;
REINDEX TABLE company_industry;

Maybe you need to do more:

  • Optimize Postgres query on timestamp range

Also, you can simplify the query:

SELECT c.city_name AS city
FROM company_industry ci
JOIN company c ON c.id = ci.company_id
WHERE ci.industry_id = 288
GROUP BY 1;

Notes

If your PK constraint is on (company_id, industry_id) add another (unique) index on (industry_id, company_id) (reversed order!). Why?

  • Is a composite index also good for queries on the first field?

Seq Scan on company is equally bothersome. It seems like there is no index on company(id), but your ER diagram indicates a PK, so that cannot be?

The fastest option would be to have a multicolumn index on (id, city_name) - if (and only if) you get index-only scans out of it.

Since you already have the id of the given industry, you don't need to include the table industry table at all.

No need for parentheses around the expression(s) in the ON clause.

This is unfortunate:

Unfortunately I do currently not have the liberty of being able to change the database schema to something more normalized.

Your simple schema makes sense for small tables with little redundancy and hardly any strain on available cache memory. But city names are probably highly redundant in big tables. Normalizing would shrink table and index sizes considerably, which is the most important factor for performance.

A de-normalized form with redundant storage can sometimes speed up targeted queries, sometimes not, it depends. But it always affects everything else adversely. Redundant storage eats more of your available cache, so other data has to be evicted sooner. Even if you gain something locally, you may lose overall.

In this particular case it would also be considerably cheaper to get distinct values for a city_id int column, because integer values are smaller and faster to compare than (potentially long) strings. A multicolumn index on (id, city_id) in company would be smaller than the same for (id, city_name) and faster to process. One more join after folding many duplicates is comparatively cheap.

If you need top performance, you can always add a MATERIALIZED VIEW for the special purpose with pre-computed results (readily aggregated and with an index on industry_id), but avoid massively redundant data in your prime tables.

Slow select - PostgreSQL

Your query is doing a full table scan on the larger table. An obvious speed up is to add an index on event_track(inboundid, eventid). Postgres should be able to use the index on your query as written. You can rewrite the query as:

SELECT te.eventid
FROM track_event te join
temp_message tm
on te.inboundid = tm.messageid;

which should definitely use the index. (You might need select distinct te.eventid if there are duplicates in the temp_message table.)

EDIT:

The last attempted rewrite is to invert the query:

select (select eventid from track_event te WHERE tm.messageid = te.inboundid) as eventid
from temp_message tm;

This should force the use of the index. If there are non-matches, you might want:

select eventid
from (select (select eventid from track_event te WHERE tm.messageid = te.inboundid) as eventid
from temp_message tm
) tm
where eventid is not null;


Related Topics



Leave a reply



Submit