SQL -- How Is Distinct So Fast Without an Index

How is SELECT DISTINCT so fast?

Duplicate question deserves a duplicate answer:

To be more accurate, one query is not quicker than the other. More precisely, the amount of time taken until the query is completed should be the same for both queries. The difference is that the query with DISTINCT simply has more rows to return therefore it appears to respond faster since you are recieving rows at a fast rate. However, what is happening under the hood of both is the same table scan. The distinct query has a data structure storing what has been returned and filters duplicates. Therefore, it SHOULD actually take longer until the query completes but (rows returned)/time is larger since there are simply more rows that match. (Also note: some viewers add a query result limit which can make the distinct query appear to run faster (since you hit the result limit and stop)).

What's faster, SELECT DISTINCT or GROUP BY in MySQL?

They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT under the hood).

If one of them is faster, it's going to be DISTINCT. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY is not taking advantage of any group members, just their keys. DISTINCT makes this explicit, so you can get away with a slightly dumber optimizer.

When in doubt, test!

Solution for speeding up a slow SELECT DISTINCT query in Postgres

Your DISTINCT is causing it to sort the output rows in order to find duplicates. If you put an index on the column(s) selected by the query, the database may be able to read them out in index order and save the sort step. A lot will depend on the details of the query and the tables involved-- your saying you "know the problem is with the DISTINCT" really limits the scope of available answers.

SQL : Can WHERE clause increase a SELECT DISTINCT query's speed?

In general, to benchmark performances of queries, you usually use queries that gives you the execution plan the query they receive in input (Every small step that the engine is performing to solve your request).

You are not mentioning your database engine (e.g. PostgreSQL, SQL Server, MySQL), but for example in PostgreSQL the query is the following:

EXPLAIN SELECT DISTINCT secondID FROM table WHERE isTitle = 1;

Going back to your question, since the isTitle is not indexed, I think the first action the engine will do is a full scan of the table to check that attribute and then perform the SELECT hence, in my opinion, the first query:

SELECT DISTINCT secondID FROM table;

will be faster.
If you want to optimize it, you can create an index on isTitle column. In such scenario, the query with the WHERE clause will become faster.

SQL Distinct keyword bogs down performance?

Yes, as using DISTINCT will (sometimes according to a comment) cause results to be ordered. Sorting hundreds of records takes time.

Try GROUP BY all your columns, it can sometimes lead the query optimiser to choose a more efficient algorithm (at least with Oracle I noticed significant performance gain).

Slow distinct query in SQL Server over large dataset

You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?

SELECT DISTINCT is slower than expected on my table in PostgreSQL

While there is no index skip scan in Postgres yet, emulate it:

WITH RECURSIVE cte AS (
( -- parentheses required
SELECT product_id
FROM tickers
ORDER BY 1
LIMIT 1
)
UNION ALL
SELECT l.*
FROM cte c
CROSS JOIN LATERAL (
SELECT product_id
FROM tickers t
WHERE t.product_id > c.product_id -- lateral reference
ORDER BY 1
LIMIT 1
) l
)
TABLE cte;

With an index on (product_id) and only 40 unique product IDs in the table this should be Fast. With capital F.

The PK index on (product_id, trade_id) is good for it, too!

With only very few rows per product_id (the opposite of your data distribution), DISTINCT / DISTINCT ON would be as fast or faster.

Work to implement index skip scans is ongoing.

See:

  • Select first row in each GROUP BY group?
  • Optimize GROUP BY query to retrieve latest row per user
  • Is a composite index also good for queries on the first field?


Related Topics



Leave a reply



Submit