Why Do We Need Group by with Aggregate Functions

Why do we need GROUP BY with AGGREGATE FUNCTIONS?

It might be easier if you think of GROUP BY as "for each" for the sake of explanation. The query below:

SELECT empid, SUM (MonthlySalary) 
FROM Employee
GROUP BY EmpID

is saying:

"Give me the sum of MonthlySalary's for each empid"

So if your table looked like this:

+-----+------------+
|empid|MontlySalary|
+-----+------------+
|1 |200 |
+-----+------------+
|2 |300 |
+-----+------------+

result:

+-+---+
|1|200|
+-+---+
|2|300|
+-+---+

Sum wouldn't appear to do anything because the sum of one number is that number. On the other hand if it looked like this:

+-----+------------+
|empid|MontlySalary|
+-----+------------+
|1 |200 |
+-----+------------+
|1 |300 |
+-----+------------+
|2 |300 |
+-----+------------+

result:

+-+---+
|1|500|
+-+---+
|2|300|
+-+---+

Then it would because there are two empid 1's to sum together. Not sure if this explanation helps or not, but I hope it makes things a little clearer.

Does Group by always need an aggregate function


Does Group by always need an aggregate function?

In an aggregation query, the unaggregated expressions in the SELECT need to be consistent with the expressions in the GROUP BY. All other expressions need to use aggregation functions.

A GROUP BY query does not have to have any aggregation functions in the SELECT.

The most common type of consistency is that the unaggregated expressions are exactly the same. However, the SQL Standard also supports:

  • Expressions based on the expressions in the GROUP BY (and constants).
  • GROUP BY keys that are missing from the SELECT.
  • Any column from a table where the primary key or unique key is in the GROUP BY.

The third involves a concept called "functional dependence" and most databases do not (yet) support this functionality.

If you have a question about how to do something in particular, then ask a new question. Provide sample data, desired results, an explanation of what you want to do, and an appropriate database tag.

Is it necessary to use GROUP BY when we are selecting an aggregate function on a column in addition to a column without an aggregate function?

Why is it necessary to use GROUP BY. You have an aggregation query where you want to return one row per city -- even if there is only one city. That makes your query an aggregation query. You can adjust the query so it is not needed. Here are two methods:

SELECT MAX(cityName) as cityName, MAX(highTemperature) As highTemperature
FROM Weather
WHERE cityName = 'Rawalpindi';

Or:

SELECT 'Rawalpindi' as cityName, MAX(highTemperature) As highTemperature
FROM Weather
WHERE cityName = 'Rawalpindi';

Both of these are valid aggregation queries with no GROUP BY. As such, they will return exactly one row -- even if no rows match the WHERE clause. Instead of CityName, they use either an aggregation function or a constant, so there is no problem.

When is GROUP BY required for aggregate functions?

You gave a query as an example not requiring GROUP BY. For the sake of explanation, I'll simplify it as follows.

SELECT MAX(key)
FROM myEntity
WHERE accounts_id = 123

Why doesn't that query require GROUP BY? Because you only expect one row in the result set, describing a particular account.

What if you wanted a result set describing all your accounts with one row per account? Then you would use this:

 SELECT accounts_id, MAX(key)
FROM myEntity
GROUP BY accounts_id

See how that goes? You get one row in this result set for each distinct value of accounts_id. By the way, MySQL's query planner knows that

 SELECT accounts_id, MAX(key)
FROM myEntity
WHERE accounts_id = '123'
GROUP BY accounts_id

is equivalent to the same query omitting the GROUP BY clause.

One more thing to know: If you have a compound index on (accounts_id, key) in your table, all these queries will be almost miraculously fast because the query planner will satisfy them with a very efficient loose index scan. That's specific to MAX() and MIN() aggregate functions. Loose index scans can't bue used for SUM() or AVG() or similar functions; those require tight index scans.

How to Use Group By clause when we use Aggregate function in the Joins?

GROUP BY for any unique combination of the specified columns does aggregation (like sum, min etc). If you don't specify some column name in the GROUP BY clause or in the aggregate function its unknown to the SQL engine which value it should return for that kind of column.

Why GROUP BY is needed for a nested aggregate function in Oracle

Well the error message you get says it ORA-00978: nested group function without GROUP BY

If you check the documentation there is not an explicit notion of this limitation, but carefully reading the description of the functionality you should realize, that the Group byclause is required for the usage of the nested aggregate functions.

You can nest aggregate functions ...
This calculation evaluates the inner aggregate (MAX(salary)) for each group defined by the GROUP BY clause ...,
and aggregates the results again.

So you have two workarounds to simulate the nested aggregation without GROUP BY

A) Add constant GROUP BY

select max(count(*)) from tab group by 42;

Note that you must use NLV if you require a zero result on the empty table (same as in case B)

select nvl(max(count(*)),0) max_cnt from tab group by 42

B) Split in two Subqueries

with tab2 as (
select count(*) cnt from tab)
select max(cnt) max_cnt
from tab2

must appear in the GROUP BY clause or be used in an aggregate function

Yes, this is a common aggregation problem. Before SQL3 (1999), the selected fields must appear in the GROUP BY clause[*].

To workaround this issue, you must calculate the aggregate in a sub-query and then join it with itself to get the additional columns you'd need to show:

SELECT m.cname, m.wmname, t.mx
FROM (
SELECT cname, MAX(avg) AS mx
FROM makerar
GROUP BY cname
) t JOIN makerar m ON m.cname = t.cname AND t.mx = m.avg
;

cname | wmname | mx
--------+--------+------------------------
canada | zoro | 2.0000000000000000
spain | usopp | 5.0000000000000000

But you may also use window functions, which looks simpler:

SELECT cname, wmname, MAX(avg) OVER (PARTITION BY cname) AS mx
FROM makerar
;

The only thing with this method is that it will show all records (window functions do not group). But it will show the correct (i.e. maxed at cname level) MAX for the country in each row, so it's up to you:

 cname  | wmname |          mx           
--------+--------+------------------------
canada | zoro | 2.0000000000000000
spain | luffy | 5.0000000000000000
spain | usopp | 5.0000000000000000

The solution, arguably less elegant, to show the only (cname, wmname) tuples matching the max value, is:

SELECT DISTINCT /* distinct here matters, because maybe there are various tuples for the same max value */
m.cname, m.wmname, t.avg AS mx
FROM (
SELECT cname, wmname, avg, ROW_NUMBER() OVER (PARTITION BY avg DESC) AS rn
FROM makerar
) t JOIN makerar m ON m.cname = t.cname AND m.wmname = t.wmname AND t.rn = 1
;


cname | wmname | mx
--------+--------+------------------------
canada | zoro | 2.0000000000000000
spain | usopp | 5.0000000000000000

[*]: Interestingly enough, even though the spec sort of allows to select non-grouped fields, major engines seem to not really like it. Oracle and SQLServer just don't allow this at all. Mysql used to allow it by default, but now since 5.7 the administrator needs to enable this option (ONLY_FULL_GROUP_BY) manually in the server configuration for this feature to be supported...

Is it really necessary to have GROUP BY in the SQL standard

Because they may not always match exactly.

For example, If I want find out the maximum number of books per category, I could do:

select max(cnt)
from (
select count(*) as cnt
from books
group by category
) t;

In some DBs such as Oracle, you can even do this:

select max(count(*))
from books
group by category;

I don't really need to specify the category column as I don't need it.

A few databases such as Postgres support the use of aliases in the group by clause.

using group by and aggregate function

It is certainly repetitive and therefore undesirable, but it's hard to avoid the repetition. You could use nested queries:

SELECT orderid, totalvalue
FROM (SELECT orderid, SUM(qty * unitprice) AS totalvalue
FROM sales.orderdetails
GROUP BY orderid) AS order_value
WHERE totalvalue > 10000

You'd need to look at your DBMS's optimizer plan to determine whether there's a significant performance penalty for doing it like that, but it avoids repeating the SUM(qty * price) expression. Ideally, the optimizer will push the outer WHERE clause into the inner query as a HAVING clause, but I'd not want to guarantee that it does (and different systems may, probably will, handle it differently).



Related Topics



Leave a reply



Submit