SQL Distinct keyword bogs down performance?
Yes, as using DISTINCT
will (sometimes according to a comment) cause results to be ordered. Sorting hundreds of records takes time.
Try GROUP BY
all your columns, it can sometimes lead the query optimiser to choose a more efficient algorithm (at least with Oracle I noticed significant performance gain).
Solution for speeding up a slow SELECT DISTINCT query in Postgres
Your DISTINCT is causing it to sort the output rows in order to find duplicates. If you put an index on the column(s) selected by the query, the database may be able to read them out in index order and save the sort step. A lot will depend on the details of the query and the tables involved-- your saying you "know the problem is with the DISTINCT" really limits the scope of available answers.
In SQL, How does using DISTINCT affect performance?
Yes, basically it has to sort the results and then re-processed to eliminate the duplicates. This cull could also be being done during the sort, but we can only speculate as to how exactly the code works in the background. You could try and improve the performance by creating an index composed of all three (3) fields.
Huge performance difference when using GROUP BY vs DISTINCT
The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct
approach is executed like:
- Copy all
business_key
values to a temporary table - Sort the temporary table
- Scan the temporary table, returning each item that is different from the one before it
The group by
could be executed like:
- Scan the full table, storing each value of
business key
in a hashtable - Return the keys of the hashtable
The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.
Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.
SELECT DISTINCT Extremely slow
The problem that you have is the execution plan. My guess is that the "in" clause might be confusing it. You might try:
SELECT count(DISTINCT tmd_logins.userID) as totalLoginsUniqueLast30Days
FROM tmd_logins join
tmd_users
on tmd_logins.userID = tmd_users.userID join
(SELECT distinct userID as accounts30Days
FROM tmd_users
where isPatient = 1 AND
created > '2012-04-29' AND
computerID is null
) t
on tmd_logins.userID = t.accounts30Days
where tmd_users.isPatient = 1 AND
loggedIn > '2011-03-25'
That might or might not work. However, I'm wondering about the structure of the query itself. It would seem that UserID should be distinct in a table called tmd_users. If so, then you can wrap all your conditions into one:
SELECT count(DISTINCT tmd_logins.userID) as totalLoginsUniqueLast30Days
FROM tmd_logins join
tmd_users
on tmd_logins.userID = tmd_users.userID
where tmd_users.isPatient = 1 AND
loggedIn > '2011-03-25' and
created > '2012-04-29' AND
computerID is null
If my guess is true, then this should definitely run faster.
Slow distinct query in SQL Server over large dataset
You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?
optimizing query by removing distinct key word
There are two points about your query that should be mentioned
- implicit join vs explicit join which approximately has the same performance.
people often ask if there is a performance difference between implicit and explicit joins. The answer is: “Usually not”
- distinct vs group by which distinct is optimum for memory usage and group by is optimum for speed so the latter outperforms the former but requires a large amount of memory if needed.
The distinct approach is executed like:
Copy all business_key values to a temporary table
Sort the temporary table
Scan the temporary table, returning each item that is different from the one before it
The group by could be executed like:
Scan the full table, storing each value of business key in a hashtable
Return the keys of the hashtable
An astute explanation on the links below.
implicit join vs explicit join
distinct vs group by
What is the Performance while fetching data from multiple tables using DISTINCT keyword
When you specify select distinct
, the database needs to go through the effort of removing duplicate values.
In a minority of cases, all the columns/expressions in the select
may be in an index. If so, Oracle should be smart enough to use the index. In this one case, the select distinct
may not have a large impact on performance.
Otherwise, the database needs to aggregate the data. The same algorithms are available for select distinct
as for group by
. There are many of them, but they will definitely be slower than not using select distinct
.
One other factor is that the database will not return any results until it has removed the duplicates from the result set. In other cases, the database can start removing rows when they are available.
You should only use select distinct
if you really need it, knowing that you will incur overhead for running the query.
Order by with distinct has an impact on performance
My simplest suggestion is to put an index on profile(joining_time)
. Then select a certain number of the most recent in a subquery. For instance, if you are pretty confident that the top 20 rows you want are within the most recent 100 records in profile
, then you can try this:
SELECT DISTINCT p.id, s.subject, p.joining_time
FROM (SELECT p.id, p.joining_join
FROM profile p
ORDER BY p.joining_time
LIMIT 100
) p INNER JOIN
profile_subject ps
ON p.id = ps.profile_id LEFT JOIN
subject s
ON ps.subject_id = s.id
ORDER BY p.joining_time
LIMIT 20;
I would also suggest that you remove the DISTINCT
keyword. Unless you have duplicate subjects for one profile, then this is not necessary. Similarly, it is hard to believe that the LEFT JOIN
is necessary. In a well-structured database, there would be no subject_id
values in profile_subject
that are not in subject
. So, try this:
SELECT p.id, s.subject, p.joining_time
FROM (SELECT p.id, p.joining_join
FROM profile p
ORDER BY p.joining_time
LIMIT 100
) p INNER JOIN
profile_subject ps
ON p.id = ps.profile_id JOIN
subject s
ON ps.subject_id = s.id
ORDER BY p.joining_time
LIMIT 20;
Related Topics
Difference Between Different Types of SQL
Inserting Text String with Hex into Postgresql as a Bytea
Database/SQL Tx - Detecting Commit or Rollback
How to Drop Multiple Tables in Postgresql Using a Wildcard
How to Add a Not Null Column Without Default Value
SQL Server Invalid Column Name After Adding New Column
Efficiently Storing 7.300.000.000 Rows
How to Add a Unique Constraint to a Postgresql Table, After It's Already Created
Identity_Insert Is Set to Off - How to Turn It On
How to Change Schema of All Tables, Views and Stored Procedures in Mssql
Difference Between for and After Triggers
How to Index a Database Column
Get Month and Year from a Datetime in SQL Server 2005
MySQL Mulitple Row Insert-Select Statement with Last_Insert_Id()
SQL Server: How to Get All Child Records Given a Parent Id in a Self Referencing Table