Is SQL Group by a Design Flaw

Is SQL GROUP BY a design flaw?

You don't have to group by the exactly the same thing you're selecting, e.g. :

SQL:select priority,count(*) from rule_class
group by priority

PRIORITY COUNT(*)
70 1
50 4
30 1
90 2
10 4

SQL:select decode(priority,50,'Norm','Odd'),count(*) from rule_class
group by priority

DECO COUNT(*)
Odd 1
Norm 4
Odd 1
Odd 2
Odd 4

SQL:select decode(priority,50,'Norm','Odd'),count(*) from rule_class
group by decode(priority,50,'Norm','Odd')

DECO COUNT(*)
Norm 4
Odd 8

What is the underlying purpose of a group by statement in SQL?

An example from Oracle which supports nested aggregate functions

Assume that you have a cube rolling results.

The following query shows us the throws distribution.

select   result
,count(*) as count
from cube_roll
group by result


































RESULTCOUNT
111
223
312
423
515
616

Why do I need to explicitly specify all columns in a SQL GROUP BY clause - why not GROUP BY * ?

It's hard to know exactly what the designers of the SQL language were thinking when they wrote the standard, but here's my opinion.

SQL, as a general rule, requires you to explicitly state your expectations and your intent. The language does not try to "guess what you meant", and automatically fill in the blanks. This is a good thing.

When you write a query the most important consideration is that it yields correct results. If you made a mistake, it's probably better that the SQL parser informs you, rather than making a guess about your intent and returning results that may not be correct. The declarative nature of SQL (where you state what you want to retrieve rather than the steps how to retrieve it) already makes it easy to inadvertently make mistakes. Introducing fuzziniess into the language syntax would not make this better.

In fact, every case I can think of where the language allows for shortcuts has caused problems. Take, for instance, natural joins - where you can omit the names of the columns you want to join on and allow the database to infer them based on column names. Once the column names change (as they naturally do over time) - the semantics of existing queries changes with them. This is bad ... very bad - you really don't want this kind of magic happening behind the scenes in your database code.

One consequence of this design choice, however, is that SQL is a verbose language in which you must explicitly express your intent. This can result in having to write more code than you may like, and gripe about why certain constructs are so verbose ... but at the end of the day - it is what it is.

Is MySQL breaking the standard by allowing selecting columns that are not part of the group by clause?

Standard SQL would reject your query because you can not SELECT non-aggregate fields that are not part of the GROUP BY clause in an aggregate query

This is correct, up to 1992.

But it is plainly wrong, from 2003 and beyond.

From SQL-2003 standard, 6IWD6-02-Foundation-2011-01.pdf, from http://www.wiscorp.com/, paragraph-7.12 (query specification), page 398:


  1. If T is a grouped table, then let G be the set of grouping columns of T. In each ((value expression)) contained
    in ((select list)) , each column reference that references a column of T shall reference some column C that
    is functionally dependent on G or shall be contained in an aggregated argument of a ((set function specification))
    whose aggregation query is QS

Now MYSQL, has implemented this feature by allowing not only columns that are functionally dependent on the grouping columns but allowing all columns. This is causing some problems with users that do not understand how grouping works and get indeterminate results where they don't expect.

But you are right to say that MySQL has added a feature that conflicts with SQL-standards (although you seem to think that for the wrong reason). It's not entirely accurate as they have added a SQL-standard feature but not in the best way (more like the easy way) but it does conflict with the latest standards.

To answer your question, the reason for this MySQL feature (extension) is I suppose to be accordance with latest SQL-standards (2003+). Why they chose to implement it this way (not fully compliant), we can only speculate.

As @Quassnoi and @Johan answered with examples, it's mainly a performance and maintainability issue. But one can't easily change the RDBMS to be clever enough (Skynet excluded) to recognize functionally dependent columns, so MySQL developers made a choice:

We (MySQL) give you (MySQL users) this feature which is in SQL-2003 standards. It improves speed in certain GROUP BY queries but there's a catch. You have to be careful (and not the SQL engine) so columns in the SELECT and HAVING lists are functionally dependent on the GROUP BY columns. If not, you may get indeterminate results.

If you want to disable it, you can set sql_mode to ONLY_FULL_GROUP_BY.

It's all in the MySQL docs: Extensions to GROUP BY (5.5) - although not in the above wording but as in your quote (they even forgot to mention that it's a deviation from standard SQL-2003 while not standard SQL-92). This kind of choices is common I think in all software, other RDBMS included. They are made for performance, backward compatibility and a lot of other reasons. Oracle has the famous '' is the same as NULL for example and SQL-Server has probably some, too.

There is also this blog post by Peter Bouman, where MySQL developers' choice is defended: Debunking GROUP BY myths.

In 2011, as @Mark Byers informed us in a comment (in a related question at DBA.SE), PostgreSQL 9.1 added a new feature (release date: September 2011) designed for this purpose. It is more restrictive than MySQL's implementation and closer to the standard.

Later, in 2015 MySQL announced that in 5.7 version, the behaviour is improved to conform with the standard and actually recognize functional dependencies, (even better than the Postgres implementation). The documentation: MySQL Handling of GROUP BY (5.7) and another blog post by Peter Bouman: MySQL 5.7.5: GROUP BY respects functional dependencies!

group by problem

Use:

   SELECT x.cust_id,
y.price_id AS max,
z.price_id AS min
FROM (SELECT t.cust_id,
MAX(t.price) as max,
MIN(t.price) as min
FROM TABLE t
GROUP BY t.cust_id) x
LEFT JOIN TABLE y ON y.cust_id = x.cust_id
AND y.price = x.max
LEFT JOIN TABLE z ON z.cust_id = x.cust_id
AND z.price = x.min

The problem is that if a cust_id has two records with the same high (or low) price, you'll see duplicates and will need to provide logic to deal with ties.

Arithmetic overflow depending on group by columns

The final value doesn't actually matter. What is likely happening, is that at some point in your SUM you are going over the maximum value (2,147,483,647) or minimum value (-2,147,483,648) for an int and getting the error.

Take this example:

SELECT SUM(V.I)
FROM (VALUES(2147483646),
(2),
(-2006543543))V(I);

This will likely generate the same error:

Arithmetic overflow error converting expression to data type int.

The result of the SUM however, would be 140,940,105 (well below the maximum). This is because if 2147483646 and 2 are summed first, then you get 2147483648, which is larger than the maximum value of an int. If you CAST/CONVERT the value first, you don't get the error:

SELECT SUM(CONVERT(bigint,V.I))
FROM (VALUES(2147483646),
(2),
(-2006543543))V(I);

SQL Group By comparison with Python Group By error

I think SQL should be like below

SELECT Company, AVG(Sales)
FROM CSVDatabase.dbo.sales_info
GROUP BY Company

SQL GROUP BY issue

The reason you get an arbitrary price is that mysql cannot know which columns to select if you GROUP BY something. It knows it needs a price and a date per pid and can fetch the latest date as you requested with max(date) but chooses to return a price that is most efficient for him to retrieve - you didn't provide an aggregate function for that column (your first query is not valid SQL, actually.)

Your second query looks OK, but here is a shorter alternative:

SELECT pid, price, date
FROM ProductPrice p
WHERE date = (SELECT MAX(date) FROM ProductPrice tmp WHERE tmp.pid = p.pid)

But if you access the latest price a lot (which I think you do), I would recommend adding the old column back to your original table to hold the newest value, if you have the option of altering the database structure again.



Related Topics



Leave a reply



Submit