Why Do I Need to Explicitly Specify All Columns in a SQL "Group By" Clause - Why Not "Group by *"

Why do I need to explicitly specify all columns in a SQL GROUP BY clause - why not GROUP BY * ?

It's hard to know exactly what the designers of the SQL language were thinking when they wrote the standard, but here's my opinion.

SQL, as a general rule, requires you to explicitly state your expectations and your intent. The language does not try to "guess what you meant", and automatically fill in the blanks. This is a good thing.

When you write a query the most important consideration is that it yields correct results. If you made a mistake, it's probably better that the SQL parser informs you, rather than making a guess about your intent and returning results that may not be correct. The declarative nature of SQL (where you state what you want to retrieve rather than the steps how to retrieve it) already makes it easy to inadvertently make mistakes. Introducing fuzziniess into the language syntax would not make this better.

In fact, every case I can think of where the language allows for shortcuts has caused problems. Take, for instance, natural joins - where you can omit the names of the columns you want to join on and allow the database to infer them based on column names. Once the column names change (as they naturally do over time) - the semantics of existing queries changes with them. This is bad ... very bad - you really don't want this kind of magic happening behind the scenes in your database code.

One consequence of this design choice, however, is that SQL is a verbose language in which you must explicitly express your intent. This can result in having to write more code than you may like, and gripe about why certain constructs are so verbose ... but at the end of the day - it is what it is.

Reason for Column is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause

Suppose I have the following table T:

a   b
--------
1 abc
1 def
1 ghi
2 jkl
2 mno
2 pqr

And I do the following query:

SELECT a, b
FROM T
GROUP BY a

The output should have two rows, one row where a=1 and a second row where a=2.

But what should the value of b show on each of these two rows? There are three possibilities in each case, and nothing in the query makes it clear which value to choose for b in each group. It's ambiguous.

This demonstrates the single-value rule, which prohibits the undefined results you get when you run a GROUP BY query, and you include any columns in the select-list that are neither part of the grouping criteria, nor appear in aggregate functions (SUM, MIN, MAX, etc.).

Fixing it might look like this:

SELECT a, MAX(b) AS x
FROM T
GROUP BY a

Now it's clear that you want the following result:

a   x
--------
1 ghi
2 pqr

Portable GROUP BY clause rules?

group by is part of the SQL standard. While SQL databases often diverge from the standard, it's good to know the standard, rather than "SQL is what $database does". group by is pretty well standardized across databases, but there's going to be quirks.

Unfortunately the SQL standards are quite expensive, and have gone through several revisions. Fortunately, even an old copy of the standard is still useful. In particular part 2 which has the grammar and syntax rules. I'd suggest you find a copy and read through it.

In addition there are these free resources.

  • Jake Wheat's SQL Overview we can examine the SQL:2016 group by grammar.
  • SQL-99 Complete, Really - Chapter 33 - Searching with Groups

    • Rules For Grouping Columns is particularly useful.

Let's line up some of your derived rules with the standard. I'm using SQL:1999 because that's what I've got.

Yours.

If an aggregate expression is used and no explicit GROUP BY clause exists, a GROUP BY clause with an empty grouping set is implied

SQL:1999.

6.16.2
The argument of COUNT(*) and the argument source of a <general set function> is a table or a group of a grouped table as specified in Subclause 7.10, <having clause>, and Subclause 7.11, <query specification>.


Yours.

The GROUP BY clause is required before HAVING.

This is incorrect. The group by clause must come before the having clause, but neither is required.

SQL:1999.

7.4 <table expression>

<table expression> ::=
<from clause>
[ <where clause> ]
[ <group by clause> ]
[ <having clause> ]

SQL-99 Complete, Really points out that you can have a having clause without a group by clause.

SELECT COUNT(a) FROM Somethings
HAVING COUNT(a) > 5;

This SQL statement means “if there are more than 3 a’s in the whole Table, display how many a’s there are”. As is usual, because there is a set function in the SELECT statement, there is an implied GROUP BY (). Therefore grouping rules apply: the select list in such an SQL statement may contain only single-valued Columns.


Yours.

Columns used in non-aggregate expressions in the SELECT/HAVING/ORDER BY clause must be in the GROUP BY clause

SQL-99 Complete, Really.

You must follow “The Single-Value Rule” – every Column named in the select list must also be a grouping Column, unless it is an argument for one of the set functions.


Yours.

Aliased expressions in the SELECT clause must not be used in the GROUP BY clause.

The GROUP BY clause may contain columns not in the SELECT clause.

SQL-99 Complete, Really.

Each Column in a GROUP BY clause must unambiguously name a Column that belongs to a Table named in the SELECT statement’s FROM clause. The name may be qualified, i.e.: it may be a . Such a Column is called a grouping Column: its values will be grouped for the final result.

SELECT a*5 AS b FROM T GROUP BY b;
illegal: a is not a grouping Column and b isn't evaluated until the select list is; by then it's too late

SELECT a+5 FROM T GROUP BY a+5;
illegal: expression in GROUP BY

Yours.

The ORDER BY clause may reference aliased aggregate expressions in the SELECT clause.

SQL-99 Complete, Really.

The superficial similarity of the GROUP BY clause and the ORDER BY clause often misleads people. The big difference is that grouping is done on the input (that is, the Tables named in the FROM clause), while ordering is done on the output (that is, the Columns named in the select list). So, although you can say “ORDER BY integer” (only in SQL-92 though) and “ORDER BY expression”, it makes no sense to say “GROUP BY integer” or “GROUP BY expression”. On the other hand, grouping Columns don’t have to be in the select list, as sorted Columns must.

Yours.

The HAVING clause must not reference aliases in the SELECT clause

SQL-99 Complete, Really.

Operands in the HAVING clause are subject to the same restrictions as in the select list:

  • Column expressions in both must be single-valued per group.
  • Column references must be unambiguous.
  • If a SELECT statement contains HAVING without a preceding GROUP BY clause, the select list can’t include any references to Columns belonging to a Table named in the FROM clause unless those references are used with a set function.
  • If HAVING includes a subquery, it can’t include outer Column references unless those references are to grouping Columns or are used with a set function.

And so on.

What is the reason not to use select *?

The essence of the quote of not prematurely optimizing is to go for simple and straightforward code and then use a profiler to point out the hot spots, which you can then optimize to be efficient.

When you use select * you're make it impossible to profile, therefore you're not writing clear & straightforward code and you are going against the spirit of the quote. select * is an anti-pattern.


So selecting columns is not a premature optimization. A few things off the top of my head ....

  1. If you specify columns in a SQL statement, the SQL execution engine will error if that column is removed from the table and the query is executed.
  2. You can more easily scan code where that column is being used.
  3. You should always write queries to bring back the least amount of information.
  4. As others mention if you use ordinal column access you should never use select *
  5. If your SQL statement joins tables, select * gives you all columns from all tables in the join

The corollary is that using select * ...

  1. The columns used by the application is opaque
  2. DBA's and their query profilers are unable to help your application's poor performance
  3. The code is more brittle when changes occur
  4. Your database and network are suffering because they are bringing back too much data (I/O)
  5. Database engine optimizations are minimal as you're bringing back all data regardless (logical).

Writing correct SQL is just as easy as writing Select *. So the real lazy person writes proper SQL because they don't want to revisit the code and try to remember what they were doing when they did it. They don't want to explain to the DBA's about every bit of code. They don't want to explain to their clients why the application runs like a dog.

Why is SELECT * considered harmful?

There are really three major reasons:

  • Inefficiency in moving data to the consumer. When you SELECT *, you're often retrieving more columns from the database than your application really needs to function. This causes more data to move from the database server to the client, slowing access and increasing load on your machines, as well as taking more time to travel across the network. This is especially true when someone adds new columns to underlying tables that didn't exist and weren't needed when the original consumers coded their data access.

  • Indexing issues. Consider a scenario where you want to tune a query to a high level of performance. If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might. For example, you wouldn't be able to create an index which simply covered the columns in your SELECT list, and even if you did (including all columns [shudder]), the next guy who came around and added a column to the underlying table would cause the optimizer to ignore your optimized covering index, and you'd likely find that the performance of your query would drop substantially for no readily apparent reason.

  • Binding Problems. When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can often crash your data consumer. Imagine a query that joins two tables, both of which contain a column called "ID". How would a consumer know which was which? SELECT * can also confuse views (at least in some versions SQL Server) when underlying table structures change -- the view is not rebuilt, and the data which comes back can be nonsense. And the worst part of it is that you can take care to name your columns whatever you want, but the next guy who comes along might have no way of knowing that he has to worry about adding a column which will collide with your already-developed names.

But it's not all bad for SELECT *. I use it liberally for these use cases:

  • Ad-hoc queries. When trying to debug something, especially off a narrow table I might not be familiar with, SELECT * is often my best friend. It helps me just see what's going on without having to do a boatload of research as to what the underlying column names are. This gets to be a bigger "plus" the longer the column names get.

  • When * means "a row". In the following use cases, SELECT * is just fine, and rumors that it's a performance killer are just urban legends which may have had some validity many years ago, but don't now:

    SELECT COUNT(*) FROM table;

    in this case, * means "count the rows". If you were to use a column name instead of * , it would count the rows where that column's value was not null. COUNT(*), to me, really drives home the concept that you're counting rows, and you avoid strange edge-cases caused by NULLs being eliminated from your aggregates.

    Same goes with this type of query:

    SELECT a.ID FROM TableA a
    WHERE EXISTS (
    SELECT *
    FROM TableB b
    WHERE b.ID = a.B_ID);

    in any database worth its salt, * just means "a row". It doesn't matter what you put in the subquery. Some people use b's ID in the SELECT list, or they'll use the number 1, but IMO those conventions are pretty much nonsensical. What you mean is "count the row", and that's what * signifies. Most query optimizers out there are smart enough to know this. (Though to be honest, I only know this to be true with SQL Server and Oracle.)



Related Topics



Leave a reply



Submit