Huge Performance Difference When Using Group by VS Distinct

Huge performance difference when using GROUP BY vs DISTINCT

The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct approach is executed like:

  • Copy all business_key values to a temporary table
  • Sort the temporary table
  • Scan the temporary table, returning each item that is different from the one before it

The group by could be executed like:

  • Scan the full table, storing each value of business key in a hashtable
  • Return the keys of the hashtable

The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.

Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.

What's faster, SELECT DISTINCT or GROUP BY in MySQL?

They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT under the hood).

If one of them is faster, it's going to be DISTINCT. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY is not taking advantage of any group members, just their keys. DISTINCT makes this explicit, so you can get away with a slightly dumber optimizer.

When in doubt, test!

SQL Performance: SELECT DISTINCT versus GROUP BY

The performance difference is probably due to the execution of the subquery in the SELECT clause. I am guessing that it is re-executing this query for every row before the distinct. For the group by, it would execute once after the group by.

Try replacing it with a join, instead:

select . . .,
parentcnt
from . . . left outer join
(SELECT PARENT_ITEM_ID, COUNT(PKID) as parentcnt
FROM ITEM_PARENTS
) p
on items.item_id = p.parent_item_id

distinct vs group by which is better

Your experience is interesting. I have not seen the single reducer effect for distinct versus group by. Perhaps there is some subtle difference in the optimizer between the two constructs.

A "famous" example in Hive is:

select count(distinct id)
from mytbl;

versus

select count(*)
from (select distinct id
from mytbl
) t;

The former only uses one reducer and the latter operates in parallel. I have seen this both in my experience, and it is documented and discussed (for example, on slides 26 and 27 in this presentation). So, distinct can definitely take advantage of parallelism.

I imagine that as Hive matures, such problems will be fixed. However, it is ironic that Postgres has a similar performance issue with COUNT(DISTINCT), although I think the underlying reason is a little bit different.

What is difference between distinct and group by (without aggregate function)

GROUP BY lets you use aggregate functions, like AVG, MAX, MIN, SUM, and COUNT. Other hand DISTINCT just removes duplicates.

You can read this answer too : https://stackoverflow.com/a/164544/4227703

Performance difference between GroupBy and MoreLinq's DistinctBy

There is a very big difference in what they do and thus the performance difference is expected.
GroupBy will create a collection for each key in the original collection before passing it to the Select. DistinctBy needs to only keep a hashset with weather it has encountered the key before, so it can be much faster.

If DistinctBy is enough for you always use it, only use GroupBy if you need the elements in each group.

Also for LINQ to EF for example the DistinctBy operator will not work.

SnowFlake's performance on group by vs partition on vs distinct

First two queries can be executed with same execution plan, based on cardinality expectation of Snowflake.

Your third approach will use a window function operator, and it would probably take more time.

As you have the dataset, I would HIGHLY recommend you to do your own tests, and observe the execution plans and the performance:

https://docs.snowflake.com/en/user-guide/ui-query-profile.html#how-to-access-query-profile

Actually, I did some tests with SNOWFLAKE_SAMPLE_DATA database, and I can see that first two queries are executed with same execution plan, and perform better than 3rd query.

Using Distinct or Group by in SQL . Which one is better and faster? State it

Both queries has the exactly same execution plan.

select distinct a,b,c from bigTable

select a,b,c from bigTable
group by a,b,c

works with the same process, so you can choose the sintax you prefer.
For being a best Query tunning user, just use "Show execution plan" button from the SQL Server Management Studio. It is really usefull.



Related Topics



Leave a reply



Submit