How is SELECT DISTINCT so fast?
Duplicate question deserves a duplicate answer:
To be more accurate, one query is not quicker than the other. More precisely, the amount of time taken until the query is completed should be the same for both queries. The difference is that the query with DISTINCT simply has more rows to return therefore it appears to respond faster since you are recieving rows at a fast rate. However, what is happening under the hood of both is the same table scan. The distinct query has a data structure storing what has been returned and filters duplicates. Therefore, it SHOULD actually take longer until the query completes but (rows returned)/time is larger since there are simply more rows that match. (Also note: some viewers add a query result limit which can make the distinct query appear to run faster (since you hit the result limit and stop)).
What's faster, SELECT DISTINCT or GROUP BY in MySQL?
They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT
under the hood).
If one of them is faster, it's going to be DISTINCT
. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY
is not taking advantage of any group members, just their keys. DISTINCT
makes this explicit, so you can get away with a slightly dumber optimizer.
When in doubt, test!
Solution for speeding up a slow SELECT DISTINCT query in Postgres
Your DISTINCT is causing it to sort the output rows in order to find duplicates. If you put an index on the column(s) selected by the query, the database may be able to read them out in index order and save the sort step. A lot will depend on the details of the query and the tables involved-- your saying you "know the problem is with the DISTINCT" really limits the scope of available answers.
SQL : Can WHERE clause increase a SELECT DISTINCT query's speed?
In general, to benchmark performances of queries, you usually use queries that gives you the execution plan the query they receive in input (Every small step that the engine is performing to solve your request).
You are not mentioning your database engine (e.g. PostgreSQL, SQL Server, MySQL), but for example in PostgreSQL the query is the following:
EXPLAIN SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
Going back to your question, since the isTitle
is not indexed, I think the first action the engine will do is a full scan of the table to check that attribute and then perform the SELECT hence, in my opinion, the first query:
SELECT DISTINCT secondID FROM table;
will be faster.
If you want to optimize it, you can create an index on isTitle
column. In such scenario, the query with the WHERE clause will become faster.
SQL Distinct keyword bogs down performance?
Yes, as using DISTINCT
will (sometimes according to a comment) cause results to be ordered. Sorting hundreds of records takes time.
Try GROUP BY
all your columns, it can sometimes lead the query optimiser to choose a more efficient algorithm (at least with Oracle I noticed significant performance gain).
Slow distinct query in SQL Server over large dataset
You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?
SELECT DISTINCT is slower than expected on my table in PostgreSQL
While there is no index skip scan in Postgres yet, emulate it:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT product_id
FROM tickers
ORDER BY 1
LIMIT 1
)
UNION ALL
SELECT l.*
FROM cte c
CROSS JOIN LATERAL (
SELECT product_id
FROM tickers t
WHERE t.product_id > c.product_id -- lateral reference
ORDER BY 1
LIMIT 1
) l
)
TABLE cte;
With an index on (product_id)
and only 40 unique product IDs in the table this should be Fast. With capital F.
The PK index on (product_id, trade_id)
is good for it, too!
With only very few rows per product_id
(the opposite of your data distribution), DISTINCT
/ DISTINCT ON
would be as fast or faster.
Work to implement index skip scans is ongoing.
See:
- Select first row in each GROUP BY group?
- Optimize GROUP BY query to retrieve latest row per user
- Is a composite index also good for queries on the first field?
Related Topics
Delimited Function in SQL to Split Data Between Semi-Colon
In Ms SQL Server, How to "Atomically" Increment a Column Being Used as a Counter
How to Use Explain Plan to Optimize Queries
Differencebetween a Hash Join and a Merge Join (Oracle Rdbms )
Use a Query to Access Column Description in SQL
T-Sql: Separate String into Multiple Columns
Postgresql Constraint - Only One Row Can Have Flag Set
How to Perform a Left Join in SQL Server Between Two Select Statements
Dynamic Pivot Queries with Dynamic Dates as Column Header in SQL Server
Why Does No Database Fully Support Ansi or Iso SQL Standards
Informix 7.3 Isql Insert Statement - Text/Blob/Clob Field Insert Error
How to Select Using with Recursive Clause