Do All Columns in a Select List Have to Appear in a Group by Clause

Why do I need to explicitly specify all columns in a SQL GROUP BY clause - why not GROUP BY *?

It's hard to know exactly what the designers of the SQL language were thinking when they wrote the standard, but here's my opinion.

SQL, as a general rule, requires you to explicitly state your expectations and your intent. The language does not try to "guess what you meant", and automatically fill in the blanks. This is a good thing.

When you write a query the most important consideration is that it yields correct results. If you made a mistake, it's probably better that the SQL parser informs you, rather than making a guess about your intent and returning results that may not be correct. The declarative nature of SQL (where you state what you want to retrieve rather than the steps how to retrieve it) already makes it easy to inadvertently make mistakes. Introducing fuzziniess into the language syntax would not make this better.

In fact, every case I can think of where the language allows for shortcuts has caused problems. Take, for instance, natural joins - where you can omit the names of the columns you want to join on and allow the database to infer them based on column names. Once the column names change (as they naturally do over time) - the semantics of existing queries changes with them. This is bad ... very bad - you really don't want this kind of magic happening behind the scenes in your database code.

One consequence of this design choice, however, is that SQL is a verbose language in which you must explicitly express your intent. This can result in having to write more code than you may like, and gripe about why certain constructs are so verbose ... but at the end of the day - it is what it is.

Are the columns used in the order / group by clause(s) mandatory in the select clause?

The tutorial is just plain wrong.

An order by can include any columns in the queried objects whether they are in the select list or not.

A group by goes the other way. Any non-aggregated columns in the select list must be included in the group by.

selected items don't have to appear in the GROUP BY clause or be used in an aggregate function

You were taught right.

According to the SQL Standard when you use GROUP BY the columns that can appear in the SELECT clause fall into three categories:

  1. Columns included in the GROUP BY clause. In this case you have s.sid.
  2. Aggregated columns. In this case you have avg(grade).
  3. Functionally dependent columns of case #1. Since s.sid is the PK of the table, you can include s.name without aggregating it.

So all good.

However, you should know that MySQL 5.7.4 and older do allow you to include other columns in non-aggregated form. This is a bug/feature of MySQL that I personally find error prone. If you do this, MySQL will silently pick one value randomly without aggregating it and without telling you.

This functionality can be turned on by using the ONLY_FULL_GROUP_BY configuration parameter (as @Shawn pointed out in the comments) in newer versions of MySQL, to allow old/bad queries to run. I would try to avoid using it, though.

SQL - How to do a group by without having to pass all the columns from the select?

GROUP BY summarizes data by aggregating a group of rows, returning one row per group. You're using the aggregate function max(), which will return the maximum value from one column for a group of rows.

Let's look at some data. I renamed the column you called "date".

create table sales (
customerId integer not null,
saleId integer not null,
saledate date not null
);


insert into sales values
(1, 10, '2013-05-13'),
(1, 11, '2013-05-14'),
(1, 12, '2013-05-14'),
(1, 13, '2013-05-17'),
(2, 20, '2013-05-11'),
(2, 21, '2013-05-16'),
(2, 31, '2013-05-17'),
(2, 32, '2013-03-01'),
(3, 33, '2013-05-14'),
(3, 35, '2013-05-14');

You said

In another words, the first 2 lines are from the same customer(id 19), i wish he'd get only one record for each client, which would be the record with the max date, in the case, the second record from this list.

select s.customerId, max (s.saledate) 
from sales s
where s.saledate <= '2013-05-16'
group by s.customerId
order by customerId;

customerId max
--
1 2013-05-14
2 2013-05-16
3 2013-05-14

What does that table mean? It means that the latest date on or before May 16 on which customer "1" bought something was May 14; the latest date on or before May 16 on which customer "2" bought something was May 16. If you use this derived table in joins, it will return predictable results with consistent meaning.

Now let's look at a slightly different query. MySQL permits this syntax, and returns the result set below.

select s.customerId, s.saleId, max(s.saledate) max_sale
from sales s
where s.saledate <= '2013-05-16'
group by s.customerId
order by customerId;

customerId saleId max_sale
--
1 10 2013-05-14
2 20 2013-05-16
3 33 2013-05-14

The sale with ID "10" didn't happen on May 14; it happened on May 13. This query has produced a falsehood. Joining this derived table with the table of sales transactions will compound the error.

That's why Firebird correctly raises an error. The solution is to drop saleId from the SELECT clause.

Now, having said all that, you can find the customers who have had no sales since May 16 like this.

select distinct customerId from sales
where customerID not in
(select customerId
from sales
where saledate >= '2013-05-16')

And you can get the right customerId and the "right" saleId like this. (I say "right" saleId, because there could be more than one on the day in question. I just chose the max.)

select sales.customerId, sales.saledate, max(saleId)
from sales
inner join (select customerId, max(saledate) max_date
from sales
where saledate < '2013-05-16'
group by customerId) max_dates
on sales.customerId = max_dates.customerId
and sales.saledate = max_dates.max_date
inner join (select distinct customerId
from sales
where customerID not in
(select customerId
from sales
where saledate >= '2013-05-16')) no_sales
on sales.customerId = no_sales.customerId
group by sales.customerId, sales.saledate

Personally, I find common table expressions make it easier for me to read SQL statements like that without getting lost in the SELECTs.

with no_sales as (
select distinct customerId
from sales
where customerID not in
(select customerId
from sales
where saledate >= '2013-05-16')
),
max_dates as (
select customerId, max(saledate) max_date
from sales
where saledate < '2013-05-16'
group by customerId
)
select sales.customerId, sales.saledate, max(saleId)
from sales
inner join max_dates
on sales.customerId = max_dates.customerId
and sales.saledate = max_dates.max_date
inner join no_sales
on sales.customerId = no_sales.customerId
group by sales.customerId, sales.saledate

must appear in the GROUP BY clause or be used in an aggregate function

Yes, this is a common aggregation problem. Before SQL3 (1999), the selected fields must appear in the GROUP BY clause[*].

To workaround this issue, you must calculate the aggregate in a sub-query and then join it with itself to get the additional columns you'd need to show:

SELECT m.cname, m.wmname, t.mx
FROM (
SELECT cname, MAX(avg) AS mx
FROM makerar
GROUP BY cname
) t JOIN makerar m ON m.cname = t.cname AND t.mx = m.avg
;

cname | wmname | mx
--------+--------+------------------------
canada | zoro | 2.0000000000000000
spain | usopp | 5.0000000000000000

But you may also use window functions, which looks simpler:

SELECT cname, wmname, MAX(avg) OVER (PARTITION BY cname) AS mx
FROM makerar
;

The only thing with this method is that it will show all records (window functions do not group). But it will show the correct (i.e. maxed at cname level) MAX for the country in each row, so it's up to you:

 cname  | wmname |          mx           
--------+--------+------------------------
canada | zoro | 2.0000000000000000
spain | luffy | 5.0000000000000000
spain | usopp | 5.0000000000000000

The solution, arguably less elegant, to show the only (cname, wmname) tuples matching the max value, is:

SELECT DISTINCT /* distinct here matters, because maybe there are various tuples for the same max value */
m.cname, m.wmname, t.avg AS mx
FROM (
SELECT cname, wmname, avg, ROW_NUMBER() OVER (PARTITION BY avg DESC) AS rn
FROM makerar
) t JOIN makerar m ON m.cname = t.cname AND m.wmname = t.wmname AND t.rn = 1
;


cname | wmname | mx
--------+--------+------------------------
canada | zoro | 2.0000000000000000
spain | usopp | 5.0000000000000000

[*]: Interestingly enough, even though the spec sort of allows to select non-grouped fields, major engines seem to not really like it. Oracle and SQLServer just don't allow this at all. Mysql used to allow it by default, but now since 5.7 the administrator needs to enable this option (ONLY_FULL_GROUP_BY) manually in the server configuration for this feature to be supported...

Select multiple columns from a table, but group by one

I use this trick to group by one column when I have a multiple columns selection:

SELECT MAX(id) AS id,
Nume,
MAX(intrare) AS intrare,
MAX(iesire) AS iesire,
MAX(intrare-iesire) AS stoc,
MAX(data) AS data
FROM Produse
GROUP BY Nume
ORDER BY Nume

This works.



Related Topics



Leave a reply



Submit