What's Faster, Select Distinct or Group by in MySQL

What's faster, SELECT DISTINCT or GROUP BY in MySQL?

They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT under the hood).

If one of them is faster, it's going to be DISTINCT. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY is not taking advantage of any group members, just their keys. DISTINCT makes this explicit, so you can get away with a slightly dumber optimizer.

When in doubt, test!

GROUP BY x vs DISTINCT( x )

You should use GROUP BY if you need aggregate functions, like SUM, MAX etc.

If you only need grouping columns, they are the same (and use the same plan).

Please note that DISTINCT is not a function, so this clause:

SELECT DISTINCT(id), othercol

which is the same (except for column order) as

SELECT DISTINCT othercol, (id)

or just

SELECT DISTINCT othercol, id

might still give you duplicates on id if there are records with same id but different othercol.

Is there any difference between GROUP BY and DISTINCT

MusiGenesis' response is functionally the correct one with regard to your question as stated; the SQL Server is smart enough to realize that if you are using "Group By" and not using any aggregate functions, then what you actually mean is "Distinct" - and therefore it generates an execution plan as if you'd simply used "Distinct."

However, I think it's important to note Hank's response as well - cavalier treatment of "Group By" and "Distinct" could lead to some pernicious gotchas down the line if you're not careful. It's not entirely correct to say that this is "not a question about aggregates" because you're asking about the functional difference between two SQL query keywords, one of which is meant to be used with aggregates and one of which is not.

A hammer can work to drive in a screw sometimes, but if you've got a screwdriver handy, why bother?

(for the purposes of this analogy, Hammer : Screwdriver :: GroupBy : Distinct and screw => get list of unique values in a table column)

Which is better: Distinct or Group By

In your example, both queries will generate the same execution plan so their performance will be the same.

However, they both have their own purpose. To make your code easier to understand, you should use distinct to eliminate duplicate rows and group by to apply aggregate operators (sum, count, max, ...).

distinct vs group by which is better

Your experience is interesting. I have not seen the single reducer effect for distinct versus group by. Perhaps there is some subtle difference in the optimizer between the two constructs.

A "famous" example in Hive is:

select count(distinct id)
from mytbl;

versus

select count(*)
from (select distinct id
from mytbl
) t;

The former only uses one reducer and the latter operates in parallel. I have seen this both in my experience, and it is documented and discussed (for example, on slides 26 and 27 in this presentation). So, distinct can definitely take advantage of parallelism.

I imagine that as Hive matures, such problems will be fixed. However, it is ironic that Postgres has a similar performance issue with COUNT(DISTINCT), although I think the underlying reason is a little bit different.

Huge performance difference when using GROUP BY vs DISTINCT

The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct approach is executed like:

  • Copy all business_key values to a temporary table
  • Sort the temporary table
  • Scan the temporary table, returning each item that is different from the one before it

The group by could be executed like:

  • Scan the full table, storing each value of business key in a hashtable
  • Return the keys of the hashtable

The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.

Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.

Which is faster: SELECT DISTINCT or WHERE foo != 0?

Here's an indexed data set of 130,000 rows. The sparse column has values in the range 0-100000. The dense column has values in the range 0-100.

SELECT * FROM my_table;
+----+--------+-------+
| id | sparse | dense |
+----+--------+-------+
| 1 | 0 | 0 |
| 2 | 52863 | 87 |
| 3 | 76503 | 21 |
| 4 | 77783 | 25 |
| 6 | 89359 | 73 |
| 7 | 97772 | 69 |
| 8 | 53429 | 59 |
| 9 | 35206 | 99 |
| 13 | 88062 | 44 |
| 14 | 56312 | 49 |
...

SELECT * FROM my_table WHERE sparse <> 0;
130941 rows in set (0.09 sec)

SELECT * FROM my_table WHERE dense <> 0;
130289 rows in set (0.09 sec)

SELECT DISTINCT sparse FROM my_table;
72844 rows in set (0.27 sec)

SELECT DISTINCT dense FROM my_table;
101 rows in set (0.00 sec)

As you can see, whether or not DISTINCT is faster depends very much on the density of the data.

Obviously, in this instance, the two queries are very different from each other!



Related Topics



Leave a reply



Submit