SQL Distinct Keyword Bogs Down Performance

SQL Distinct keyword bogs down performance?

Yes, as using DISTINCT will (sometimes according to a comment) cause results to be ordered. Sorting hundreds of records takes time.

Try GROUP BY all your columns, it can sometimes lead the query optimiser to choose a more efficient algorithm (at least with Oracle I noticed significant performance gain).

Solution for speeding up a slow SELECT DISTINCT query in Postgres

Your DISTINCT is causing it to sort the output rows in order to find duplicates. If you put an index on the column(s) selected by the query, the database may be able to read them out in index order and save the sort step. A lot will depend on the details of the query and the tables involved-- your saying you "know the problem is with the DISTINCT" really limits the scope of available answers.

In SQL, How does using DISTINCT affect performance?

Yes, basically it has to sort the results and then re-processed to eliminate the duplicates. This cull could also be being done during the sort, but we can only speculate as to how exactly the code works in the background. You could try and improve the performance by creating an index composed of all three (3) fields.

Huge performance difference when using GROUP BY vs DISTINCT

The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct approach is executed like:

  • Copy all business_key values to a temporary table
  • Sort the temporary table
  • Scan the temporary table, returning each item that is different from the one before it

The group by could be executed like:

  • Scan the full table, storing each value of business key in a hashtable
  • Return the keys of the hashtable

The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.

Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.

SELECT DISTINCT Extremely slow

The problem that you have is the execution plan. My guess is that the "in" clause might be confusing it. You might try:

SELECT count(DISTINCT tmd_logins.userID) as totalLoginsUniqueLast30Days 
FROM tmd_logins join
tmd_users
on tmd_logins.userID = tmd_users.userID join
(SELECT distinct userID as accounts30Days
FROM tmd_users
where isPatient = 1 AND
created > '2012-04-29' AND
computerID is null
) t
on tmd_logins.userID = t.accounts30Days
where tmd_users.isPatient = 1 AND
loggedIn > '2011-03-25'

That might or might not work. However, I'm wondering about the structure of the query itself. It would seem that UserID should be distinct in a table called tmd_users. If so, then you can wrap all your conditions into one:

SELECT count(DISTINCT tmd_logins.userID) as totalLoginsUniqueLast30Days 
FROM tmd_logins join
tmd_users
on tmd_logins.userID = tmd_users.userID
where tmd_users.isPatient = 1 AND
loggedIn > '2011-03-25' and
created > '2012-04-29' AND
computerID is null

If my guess is true, then this should definitely run faster.

Slow distinct query in SQL Server over large dataset

You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?

optimizing query by removing distinct key word

There are two points about your query that should be mentioned

  1. implicit join vs explicit join which approximately has the same performance.

people often ask if there is a performance difference between implicit and explicit joins. The answer is: “Usually not”


  1. distinct vs group by which distinct is optimum for memory usage and group by is optimum for speed so the latter outperforms the former but requires a large amount of memory if needed.

The distinct approach is executed like:

  • Copy all business_key values to a temporary table

  • Sort the temporary table

  • Scan the temporary table, returning each item that is different from the one before it

The group by could be executed like:

  • Scan the full table, storing each value of business key in a hashtable

  • Return the keys of the hashtable

An astute explanation on the links below.

implicit join vs explicit join

distinct vs group by

What is the Performance while fetching data from multiple tables using DISTINCT keyword

When you specify select distinct, the database needs to go through the effort of removing duplicate values.

In a minority of cases, all the columns/expressions in the select may be in an index. If so, Oracle should be smart enough to use the index. In this one case, the select distinct may not have a large impact on performance.

Otherwise, the database needs to aggregate the data. The same algorithms are available for select distinct as for group by. There are many of them, but they will definitely be slower than not using select distinct.

One other factor is that the database will not return any results until it has removed the duplicates from the result set. In other cases, the database can start removing rows when they are available.

You should only use select distinct if you really need it, knowing that you will incur overhead for running the query.

Order by with distinct has an impact on performance

My simplest suggestion is to put an index on profile(joining_time). Then select a certain number of the most recent in a subquery. For instance, if you are pretty confident that the top 20 rows you want are within the most recent 100 records in profile, then you can try this:

SELECT DISTINCT p.id, s.subject, p.joining_time
FROM (SELECT p.id, p.joining_join
FROM profile p
ORDER BY p.joining_time
LIMIT 100
) p INNER JOIN
profile_subject ps
ON p.id = ps.profile_id LEFT JOIN
subject s
ON ps.subject_id = s.id
ORDER BY p.joining_time
LIMIT 20;

I would also suggest that you remove the DISTINCT keyword. Unless you have duplicate subjects for one profile, then this is not necessary. Similarly, it is hard to believe that the LEFT JOIN is necessary. In a well-structured database, there would be no subject_id values in profile_subject that are not in subject. So, try this:

SELECT p.id, s.subject, p.joining_time
FROM (SELECT p.id, p.joining_join
FROM profile p
ORDER BY p.joining_time
LIMIT 100
) p INNER JOIN
profile_subject ps
ON p.id = ps.profile_id JOIN
subject s
ON ps.subject_id = s.id
ORDER BY p.joining_time
LIMIT 20;


Related Topics



Leave a reply



Submit