Why Does SQL Force Me to Repeat All Non-Aggregated Fields from My Select Clause in My Group by Clause

Why does SQL force me to repeat all non-aggregated fields from my SELECT clause in my GROUP BY clause?

I tend to agree with you - this is one of many cases where SQL should have slightly smarter defaults to save us all some typing. For example, imagine if this were legal:

Select ClientName, InvoiceAmount, Sum(PaymentAmount) Group By *

where "*" meant "all the non-aggregate fields". If everybody knew that's how it worked, then there would be no confusion. You could sub in a specific list of fields if you wanted to do something tricky, but the splat means "all of 'em" (which in this context means, all the possible ones).

Granted, "*" means something different here than in the SELECT clause, so maybe a different character would work better:

Select ClientName, InvoiceAmount, Sum(PaymentAmount) Group By !

There are a few other areas like that where SQL just isn't as eloquent as it could be. But at this point, it's probably too entrenched to make many big changes like that.

Why do I need to explicitly specify all columns in a SQL GROUP BY clause - why not GROUP BY *?

It's hard to know exactly what the designers of the SQL language were thinking when they wrote the standard, but here's my opinion.

SQL, as a general rule, requires you to explicitly state your expectations and your intent. The language does not try to "guess what you meant", and automatically fill in the blanks. This is a good thing.

When you write a query the most important consideration is that it yields correct results. If you made a mistake, it's probably better that the SQL parser informs you, rather than making a guess about your intent and returning results that may not be correct. The declarative nature of SQL (where you state what you want to retrieve rather than the steps how to retrieve it) already makes it easy to inadvertently make mistakes. Introducing fuzziniess into the language syntax would not make this better.

In fact, every case I can think of where the language allows for shortcuts has caused problems. Take, for instance, natural joins - where you can omit the names of the columns you want to join on and allow the database to infer them based on column names. Once the column names change (as they naturally do over time) - the semantics of existing queries changes with them. This is bad ... very bad - you really don't want this kind of magic happening behind the scenes in your database code.

One consequence of this design choice, however, is that SQL is a verbose language in which you must explicitly express your intent. This can result in having to write more code than you may like, and gripe about why certain constructs are so verbose ... but at the end of the day - it is what it is.

Query without duplicates & aggregate function or the GROUP BY clause issue. - REPEX

I'm trying to query all UNIQUE (invoice_num & invoice_suffix) pairs with the name of the company and the invoice_amt.

I interpret this as saying that you want all pairs that appear only once in the table. If so, a simple way would be:

SELECT r.invoice_num, r.invoice_suffix,
MAX(r.company_name) as company_name,
MAX(r.invoice_amt) asinvoice_amt
FROM dbo.distinct_repex r
GROUP BY r.invoice_num, r.invoice_suffix
HAVING COUNT(*) = 1
ORDER BY r.invoice_num;

If you actually want all pairs but to only appear once, then you can remove the HAVING clause.

If the COUNT(*) = 1, then MAX() returns the one value. An aggregation function is needed, but either MAX() or MIN() will do.

Also not the use of table aliases. This makes the query easier to write and to read.

Is SQL GROUP BY a design flaw?

You don't have to group by the exactly the same thing you're selecting, e.g. :

SQL:select priority,count(*) from rule_class
group by priority

PRIORITY COUNT(*)
70 1
50 4
30 1
90 2
10 4

SQL:select decode(priority,50,'Norm','Odd'),count(*) from rule_class
group by priority

DECO COUNT(*)
Odd 1
Norm 4
Odd 1
Odd 2
Odd 4

SQL:select decode(priority,50,'Norm','Odd'),count(*) from rule_class
group by decode(priority,50,'Norm','Odd')

DECO COUNT(*)
Norm 4
Odd 8

SQL Query Still having duplicates after group by

SELECT T1.*
FROM eBayorders T1
JOIN
( SELECT `Name`,
`SKU`,
max(`TIME`) AS MAX_TIME
FROM eBayorders
WHERE (`OrderIDAmazon` IS NULL OR `OrderIDAmazon` = "null") AND `Flag` = "True" AND `TYPE` = "GROUP" AND (`Carrier` IS NULL OR `Carrier` = "null") AND LEFT(`SKU`, 1) = "B" AND datediff(now(), `TIME`) < 4 AND (`TrackingInfo` IS NULL OR `TrackingInfo` = "null") AND `STATUS` = "PROCESSING"
GROUP BY `Name`,
`SKU`) AS dedupe ON T1.`Name` = dedupe.`Name`
AND T1.`SKU` = dedupe.`SKU`
AND T1.`Time` = dedupe.`MAX_TIME`
ORDER BY `TIME` ASC LIMIT 7

Your database platform should have complained because your original query had items in the select list which were not present in the group by (generally not allowed). The above should resolve it.

An even better option would be the following if your database supported window functions (MySQL doesn't, unfortunately):

SELECT *
FROM
( SELECT *,
row_number() over (partition BY `Name`, `SKU`
ORDER BY `TIME` ASC) AS dedupe_rank
FROM eBayorders
WHERE (`OrderIDAmazon` IS NULL OR `OrderIDAmazon` = "null") AND `Flag` = "True" AND `TYPE` = "GROUP" AND (`Carrier` IS NULL OR `Carrier` = "null") AND LEFT(`SKU`, 1) = "B" AND datediff(now(), `TIME`) < 4 AND (`TrackingInfo` IS NULL OR `TrackingInfo` = "null") AND `STATUS` = "PROCESSING" ) T
WHERE dedupe_rank = 1
ORDER BY T.`TIME` ASC LIMIT 7

Is the GROUP BY clause in SQL redundant?

Whenever we use an aggregate function in SQL (MIN, MAX, AVG etc), we must always GROUP BY all non-aggregated columns

This is not true in general. MySQL for example doesn't require this, and the SQL standard doesn't say this either.

  • Debunking GROUP BY myths

It becomes even more intrusive when we use a function or other calculation in our SELECT statement, as this must also be copied to the GROUP BY clause.

Also not true in general. MySQL (and perhaps other databases too) allow column aliases to be used in the GROUP BY clause:

SELECT (2 * (x + y)) / z + 1 AS a, MyFunction(x, y) AS b, SUM(z)
FROM AnotherTable
GROUP BY a, b

If this is not the case, then what extra functionality does GROUP BY give us?

The only way of specifying what to group by is to use a GROUP BY clause. You cannot necessarily deduce it from the columns mentioned in the SELECT. In fact you don't even have to select all the columns mentioned in the GROUP BY:

SELECT MAX(col2)
FROM foo
GROUP BY col1
HAVING COUNT(*) = 2

Select multiple columns from a table, but group by one

I use this trick to group by one column when I have a multiple columns selection:

SELECT MAX(id) AS id,
Nume,
MAX(intrare) AS intrare,
MAX(iesire) AS iesire,
MAX(intrare-iesire) AS stoc,
MAX(data) AS data
FROM Produse
GROUP BY Nume
ORDER BY Nume

This works.

MySQL claims that I can use columns in SELECT that aren't in GROUP BY, but I can't with equal performance

Since it doesn't look like there's a simple answer, I'm going with a cheap solution for the moment.

What I would do would be something like the following:

SELECT o1.* FROM objects o1 WHERE o1.id IN (SELECT o2.id FROM objects o2 WHERE mycondition GROUP BY o2.id)

However, according to how it gets EXPLAINed, the MySQL optimizer views the subquery as being dependent, which is always a really, really nasty performance killer. I think that's a bug in the query optimizer brought about by the fact that it's the same table, even though it's aliased. As such, I'll be using one query to fetch the IDs, and putting them IN the second query that fetches o.*. It gets reasonable performance, and isn't too painful.

This question is still open to answers with cleaner solutions that perform as well, if not better :)

Performance of ISNULL() in GROUP BY clause SQL

Non-aggregated columns in SELECT clauses generally must precisely match the ones in GROUP BY clauses. If I were you, and I were dealing with tested production code, I would not make the change you propose.

Edit the match between non-aggregated SELECT columns and GROUP BY columns is necessary for GROUP BY. If the columns in SELECT are 1:1 dependent on the columns in GROUP BY, it will work. Otherwise the results are ambiguous.



Related Topics



Leave a reply



Submit