Execution Sequence of Group By, Having and Where Clause in SQL Server

Execution sequence of Group By, Having and Where clause in SQL Server?

in order:

FROM & JOINs determine & filter rows

WHERE more filters on the rows

GROUP BY combines those rows into groups

HAVING filters groups

ORDER BY arranges the remaining rows/groups

LIMIT filters on the remaining rows/groups

sql query execution order question for group by and select

You can use least() and greatest() to group by consistently:

select 
least(pickup_state, dropoff_state) pickup,
greatest(pickup_state, dropoff_state) dropoff,
count(*) as no_of_shipment
from t_hr_ship s
where month(date_time) in (1, 2)
group by pickup, dropoff
order by no_of_shipment desc
limit 3;

Note that, unlike other RDBMS, MySQL allows the use of column aliases in the GROUP BY clause (and as well in the ORDER BY clause, but this is common in most RDMS).

Demo on DB Fiddle:


pickup | dropoff | no_of_shipment
:----- | :------ | -------------:
OR | WA | 8
CA | OR | 2
CA | WA | 1

Is there any specific order of execution in SQL query?

There is a logical order to evaluation of the query text, but the database engine can choose what order execute the query components based upon what is most optimal. The logical text parsing ordering is listed below. That is, for example, why you can't use an alias from SELECT clause in a WHERE clause. As far as the query parsing process is concerned, the alias doesn't exist yet.

  1. FROM

  2. ON

  3. OUTER

  4. WHERE

  5. GROUP BY

  6. CUBE | ROLLUP (these are not present in MySQL but are in some other SQL dialects)

  7. HAVING

  8. SELECT

  9. DISTINCT

  10. ORDER BY

  11. LIMIT (or, in MSSQL, TOP)

See the Microsoft documentation (see "Logical Processing Order of the SELECT statement") for more information on this.

How to combine GROUP BY, ORDER BY and HAVING

ORDER BY is always last...

However, you need to pick the fields you ACTUALLY WANT then select only those and group by them. SELECT * and GROUP BY Email will give you RANDOM VALUES for all the fields but Email. Most RDBMS will not even allow you to do this because of the issues it creates, but MySQL is the exception.

SELECT Email, COUNT(*)
FROM user_log
GROUP BY Email
HAVING COUNT(*) > 1
ORDER BY UpdateDate DESC

WHERE vs. HAVING performance with GROUP BY

One of your assumptions is wrong: HAVING is slower than WHERE because it only filters results after accessing and hashing rows.

It's that hashing part that makes HAVING conditions more expensive than WHERE conditions. Hashing requires writing data, which can be more expensive both physically and algorithmically.

Theory

Hashing requires writing as well as reading data. Ideally hashing the data will run in O(n) time. But in practice there will be hash collisions, which slow things down. And in practice not all the data will fit in memory.

Those two problems can be disastrous. In the worst-case, with limited memory, the hashing requires multiple passes and the complexity approaches O(n^2). And writing to disk in the temporary tablespace is orders of magnitude slower than writing to memory.

Those are the kind of performance issues you need to worry about with databases. The constant time to run simple conditions and expressions is usually irrelevant compared to the time to read, write, and join the data.

That might be especially true in your environment. The operation TABLE ACCESS STORAGE FULL implies you are using Exadata. Depending on the platform you might be taking advantage of SQL in silicon. Those high-level conditions may translate perfectly to low-level instructions executed on storage devices. Which means your estimate of the cost of executing a clause may be several orders of magnitude too high.

Practice

Create a sample table with 100,000 rows:

create table customer(id number, status varchar2(100));

insert into customer
select
level,
case
when level <= 15000 then 'Deceased'
when level between 15001 and 50001 then 'Active'
else 'Dormant'
end
from dual
connect by level <= 100000;

begin
dbms_stats.gather_table_stats(user, 'customer');
end;
/

Running the code in a loop shows that the WHERE version is about twice as fast as the HAVING version.

--Run times (in seconds): 0.765, 0.78, 0.765
declare
type string_nt is table of varchar2(100);
type number_nt is table of number;
v_status string_nt;
v_count number_nt;
begin
for i in 1 .. 100 loop
SELECT status, count(status)
bulk collect into v_status, v_count
FROM customer
GROUP BY status
HAVING status != 'Active' AND status != 'Dormant';
end loop;
end;
/

--Run times (in seconds): 0.39, 0.39, 0.39
declare
type string_nt is table of varchar2(100);
type number_nt is table of number;
v_status string_nt;
v_count number_nt;
begin
for i in 1 .. 100 loop
SELECT status, count(status)
bulk collect into v_status, v_count
FROM customer
WHERE status != 'Active' AND status != 'Dormant'
GROUP BY status;
end loop;
end;
/

SQL Order of execution vs Order of writing

SQL is a declarative language, not a procedural language. That means that the SQL compiler and optimizer determine what operations are actually run. These operations typically take the form of a directed acyclic graph (DAG) of operations.

The operators have no obvious relationship to the original query -- except that the results it generates are guaranteed to be the same. In terms of execution there are no clauses, just things like "hash join" and "filter" and "sort" -- or whatever the database implements for the DAG.

You are confusing execution with compilation and probably you just care about scoping rules.

So, to start with SQL has a set of clauses and these are in a very specified order. Your question contains this ordering -- at least for a database that supports those clauses.

The second part is the ordering for identifying identifiers. Basically, this comes down to:

  • Table aliases are defined in the FROM clause. So this can be considered as "first" for scoping purposes.
  • Column aliases are defined in the SELECT clause. By the SQL Standard, column aliases can be used in the ORDER BY. Many databases extend this to the QUALIFY (if supported), HAVING, and GROUP BY clauses. In general, databases do not support them in the WHERE clause.
  • If two tables in the FROM have the same column name, then the column has to be qualified to identify the table. The one exception to this is when the column is a key in a JOIN and the USING clause is used. Then the unqualified column name is fine.
  • If a column alias defined in the SELECT conflicts with a table alias in a clause that supports column aliases, then it is up to the database which to choose.


Related Topics



Leave a reply



Submit