Huge performance difference when using GROUP BY vs DISTINCT
The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct
approach is executed like:
- Copy all
business_key
values to a temporary table - Sort the temporary table
- Scan the temporary table, returning each item that is different from the one before it
The group by
could be executed like:
- Scan the full table, storing each value of
business key
in a hashtable - Return the keys of the hashtable
The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.
Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.
What's faster, SELECT DISTINCT or GROUP BY in MySQL?
They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT
under the hood).
If one of them is faster, it's going to be DISTINCT
. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY
is not taking advantage of any group members, just their keys. DISTINCT
makes this explicit, so you can get away with a slightly dumber optimizer.
When in doubt, test!
SQL Performance: SELECT DISTINCT versus GROUP BY
The performance difference is probably due to the execution of the subquery in the SELECT
clause. I am guessing that it is re-executing this query for every row before the distinct. For the group by
, it would execute once after the group by.
Try replacing it with a join, instead:
select . . .,
parentcnt
from . . . left outer join
(SELECT PARENT_ITEM_ID, COUNT(PKID) as parentcnt
FROM ITEM_PARENTS
) p
on items.item_id = p.parent_item_id
distinct vs group by which is better
Your experience is interesting. I have not seen the single reducer effect for distinct
versus group by
. Perhaps there is some subtle difference in the optimizer between the two constructs.
A "famous" example in Hive is:
select count(distinct id)
from mytbl;
versus
select count(*)
from (select distinct id
from mytbl
) t;
The former only uses one reducer and the latter operates in parallel. I have seen this both in my experience, and it is documented and discussed (for example, on slides 26 and 27 in this presentation). So, distinct
can definitely take advantage of parallelism.
I imagine that as Hive matures, such problems will be fixed. However, it is ironic that Postgres has a similar performance issue with COUNT(DISTINCT)
, although I think the underlying reason is a little bit different.
What is difference between distinct and group by (without aggregate function)
GROUP BY lets you use aggregate functions, like AVG, MAX, MIN, SUM, and COUNT. Other hand DISTINCT just removes duplicates.
You can read this answer too : https://stackoverflow.com/a/164544/4227703
Performance difference between GroupBy and MoreLinq's DistinctBy
There is a very big difference in what they do and thus the performance difference is expected. GroupBy
will create a collection for each key in the original collection before passing it to the Select
. DistinctBy
needs to only keep a hashset with weather it has encountered the key before, so it can be much faster.
If DistinctBy
is enough for you always use it, only use GroupBy
if you need the elements in each group.
Also for LINQ to EF for example the DistinctBy
operator will not work.
SnowFlake's performance on group by vs partition on vs distinct
First two queries can be executed with same execution plan, based on cardinality expectation of Snowflake.
Your third approach will use a window function operator, and it would probably take more time.
As you have the dataset, I would HIGHLY recommend you to do your own tests, and observe the execution plans and the performance:
https://docs.snowflake.com/en/user-guide/ui-query-profile.html#how-to-access-query-profile
Actually, I did some tests with SNOWFLAKE_SAMPLE_DATA database, and I can see that first two queries are executed with same execution plan, and perform better than 3rd query.
Using Distinct or Group by in SQL . Which one is better and faster? State it
Both queries has the exactly same execution plan.
select distinct a,b,c from bigTable
select a,b,c from bigTable
group by a,b,c
works with the same process, so you can choose the sintax you prefer.
For being a best Query tunning user, just use "Show execution plan" button from the SQL Server Management Studio. It is really usefull.
Related Topics
Is There a Performance Difference Between Cte , Sub-Query, Temporary Table or Table Variable
Create Table If Not Exists Equivalent in SQL Server
Db Design to Use Sub-Type or Not
Differencebetween Function and Procedure in Pl/Sql
How to Avoid Duplicate Values for Insert in SQL
SQL - How to Select a Row Having a Column with Max Value
Convert Text Value in SQL Server from Utf8 to Iso 8859-1
Django Select Only Rows with Duplicate Field Values
How to Set Variable from a SQL Query
Linq Version of SQL "In" Statement
Excel Function to Make SQL-Like Queries on Worksheet Data
The New Pivot Function in Bigquery
Postgresql: Encoding Problems on Windows When Using Psql Command Line Utility
Rbar VS. Set Based Programming for SQL