Solution for Speeding Up a Slow Select Distinct Query in Postgres

Solution for speeding up a slow SELECT DISTINCT query in Postgres

Your DISTINCT is causing it to sort the output rows in order to find duplicates. If you put an index on the column(s) selected by the query, the database may be able to read them out in index order and save the sort step. A lot will depend on the details of the query and the tables involved-- your saying you "know the problem is with the DISTINCT" really limits the scope of available answers.

SELECT DISTINCT is slower than expected on my table in PostgreSQL

While there is no index skip scan in Postgres yet, emulate it:

WITH RECURSIVE cte AS (
( -- parentheses required
SELECT product_id
FROM tickers
ORDER BY 1
LIMIT 1
)
UNION ALL
SELECT l.*
FROM cte c
CROSS JOIN LATERAL (
SELECT product_id
FROM tickers t
WHERE t.product_id > c.product_id -- lateral reference
ORDER BY 1
LIMIT 1
) l
)
TABLE cte;

With an index on (product_id) and only 40 unique product IDs in the table this should be Fast. With capital F.

The PK index on (product_id, trade_id) is good for it, too!

With only very few rows per product_id (the opposite of your data distribution), DISTINCT / DISTINCT ON would be as fast or faster.

Work to implement index skip scans is ongoing.

See:

  • Select first row in each GROUP BY group?
  • Optimize GROUP BY query to retrieve latest row per user
  • Is a composite index also good for queries on the first field?

Solution for speeding up a slow SELECT DISTINCT query SQL Server

<Object Database="[BigData]" Schema="[dbo]" Table="[vwDistinct]" Index="[cdxDistinct]" IndexKind="ViewClustered" Storage="RowStore" /> 

shows that your view is indexed. Based on your comment where you state

Around 2.5-3 milion rows and there are about 100 distinct values.

The query

select distinct [somecolumn] 
from bigtable

without the view will scan 2.5+ million rows in the table's index to find all the distinct values.

The view, however, will only contain 100 rows. So when that exists it can perform a scan on the view's clustered index to find all the distinct values.

The cost is that all inserts and any updates that modify somecolumn will be more expensive.

Postgres slow distinct query for multiple columns

My data doesn't change that frequently so I ended up creating a materialized view

CREATE MATERIALIZED VIEW tbl1_distinct_view AS SELECT DISTINCT col1,col2,col3,col4 FROM tbl1;

that I refresh with a cronjob once a day at 6am

0 6 * * * psql -U mydb mydb -c 'REFRESH MATERIALIZED VIEW tbl1_distinct_view;

Slow query with distinct/group by on varchar column with Postgres

There is a Seq Scan on company_industry in both query plans that should really be a (bitmap) index scan. The same goes for Seq Scan on company.

Seems to be a issue of missing indexes - or something is not right in your db. If something seems wrong, draw a backup before you proceed. Check if cost settings and statistics are valid:

  • Keep PostgreSQL from sometimes choosing a bad query plan

If settings are good, I would then check the relevant indices (as detailed below). Maybe REINDEX can fix it:

REINDEX TABLE company;
REINDEX TABLE company_industry;

Maybe you need to do more:

  • Optimize Postgres query on timestamp range

Also, you can simplify the query:

SELECT c.city_name AS city
FROM company_industry ci
JOIN company c ON c.id = ci.company_id
WHERE ci.industry_id = 288
GROUP BY 1;

Notes

If your PK constraint is on (company_id, industry_id) add another (unique) index on (industry_id, company_id) (reversed order!). Why?

  • Is a composite index also good for queries on the first field?

Seq Scan on company is equally bothersome. It seems like there is no index on company(id), but your ER diagram indicates a PK, so that cannot be?

The fastest option would be to have a multicolumn index on (id, city_name) - if (and only if) you get index-only scans out of it.

Since you already have the id of the given industry, you don't need to include the table industry table at all.

No need for parentheses around the expression(s) in the ON clause.

This is unfortunate:

Unfortunately I do currently not have the liberty of being able to change the database schema to something more normalized.

Your simple schema makes sense for small tables with little redundancy and hardly any strain on available cache memory. But city names are probably highly redundant in big tables. Normalizing would shrink table and index sizes considerably, which is the most important factor for performance.

A de-normalized form with redundant storage can sometimes speed up targeted queries, sometimes not, it depends. But it always affects everything else adversely. Redundant storage eats more of your available cache, so other data has to be evicted sooner. Even if you gain something locally, you may lose overall.

In this particular case it would also be considerably cheaper to get distinct values for a city_id int column, because integer values are smaller and faster to compare than (potentially long) strings. A multicolumn index on (id, city_id) in company would be smaller than the same for (id, city_name) and faster to process. One more join after folding many duplicates is comparatively cheap.

If you need top performance, you can always add a MATERIALIZED VIEW for the special purpose with pre-computed results (readily aggregated and with an index on industry_id), but avoid massively redundant data in your prime tables.

Slow query in postgres using count distinct

Assuming actual date types.

SELECT d.day, count(distinct o.id) AS users_past_year
FROM (
SELECT generate_series(min(order_date), max(order_date), '1 day')::date AS day
FROM orders -- single query
) d
LEFT JOIN ( -- fold duplicates on same day right away
SELECT id, order_date
FROM orders
GROUP BY 1,2
) o ON o.order_date > d.day - interval '1 year' -- exclude
AND o.order_date <= d.day -- include
GROUP BY 1
ORDER BY 1;

Folding multiple purchases from the same user on the same day first only makes sense if that's a common thing. Else it will be faster to omit that step and simply left-join to the table orders instead.

It's rather odd that orders.id would be the ID of the user. Should be named something like user_id.

If you are not comfortable with generate_series() in the SELECT list (which works just fine), you can replace that with a LATERAL JOIN in Postgres 9.3+.

FROM  (SELECT min(order_date) AS a
, max(order_date) AS z FROM orders) x
, generate_series(x.a, x.z, '1 day') AS d(day)
LEFT JOIN ...

Note that day is type timestamp in this case. Works the same. You may want to cast.

General performance tips

I understand this is a read-only table for a single user. This simplifies things.

You already seem to have an index:

CREATE INDEX orders_mult_idx ON orders (order_date, id);

That's good.

Some things to try:

Basics

Of course, the usual performance advice applies:

https://wiki.postgresql.org/wiki/Slow_Query_Questions

https://wiki.postgresql.org/wiki/Performance_Optimization

Streamline table

Cluster your table using this index once:

CLUSTER orders USING orders_mult_idx;

This should help a bit. It also effectively runs VACUUM FULL on the table, which removes any dead rows and compacts the table if applicable.

Better statistic

ALTER TABLE orders ALTER COLUMN number SET STATISTICS 1000;
ANALYZE orders;

Explanation here:

  • Configuration parameter work_mem in PostgreSQL on Linux

Allocate more RAM

Make sure you have ample resources allocated. In particular for shared_buffers and work_mem. You can do this temporarily for your session.

Experiment with planner methods

Try disabling nested loops (enable_nestloop) (in your session only). Maybe hash joins are faster. (I would be surprised, though.)

SET enable_nestedloop = off;
-- test ...

RESET enable_nestedloop;

Temporary table

Since this seems to be a "temporary table" by nature, you could try and make it an actual temporary table saved in RAM only. You need enough RAM to allocate enough temp_buffers. Detailed instructions:

  • How to delete duplicate entries?

Be sure to run ANALYZE manually. Temp tables are not covered by autovacuum.



Related Topics



Leave a reply



Submit