Execution sequence of Group By, Having and Where clause in SQL Server?
in order:
FROM & JOINs determine & filter rows
WHERE more filters on the rows
GROUP BY combines those rows into groups
HAVING filters groups
ORDER BY arranges the remaining rows/groups
LIMIT filters on the remaining rows/groups
sql query execution order question for group by and select
You can use least()
and greatest()
to group by consistently:
select
least(pickup_state, dropoff_state) pickup,
greatest(pickup_state, dropoff_state) dropoff,
count(*) as no_of_shipment
from t_hr_ship s
where month(date_time) in (1, 2)
group by pickup, dropoff
order by no_of_shipment desc
limit 3;
Note that, unlike other RDBMS, MySQL allows the use of column aliases in the GROUP BY
clause (and as well in the ORDER BY
clause, but this is common in most RDMS).
Demo on DB Fiddle:
pickup | dropoff | no_of_shipment
:----- | :------ | -------------:
OR | WA | 8
CA | OR | 2
CA | WA | 1
Is there any specific order of execution in SQL query?
There is a logical order to evaluation of the query text, but the database engine can choose what order execute the query components based upon what is most optimal. The logical text parsing ordering is listed below. That is, for example, why you can't use an alias from SELECT
clause in a WHERE
clause. As far as the query parsing process is concerned, the alias doesn't exist yet.
FROM
ON
OUTER
WHERE
GROUP BY
CUBE | ROLLUP (these are not present in MySQL but are in some other SQL dialects)
HAVING
SELECT
DISTINCT
ORDER BY
LIMIT (or, in MSSQL, TOP)
See the Microsoft documentation (see "Logical Processing Order of the SELECT statement") for more information on this.
How to combine GROUP BY, ORDER BY and HAVING
ORDER BY
is always last...
However, you need to pick the fields you ACTUALLY WANT then select only those and group by them. SELECT *
and GROUP BY Email
will give you RANDOM VALUES for all the fields but Email
. Most RDBMS will not even allow you to do this because of the issues it creates, but MySQL is the exception.
SELECT Email, COUNT(*)
FROM user_log
GROUP BY Email
HAVING COUNT(*) > 1
ORDER BY UpdateDate DESC
WHERE vs. HAVING performance with GROUP BY
One of your assumptions is wrong: HAVING is slower than WHERE because it only filters results after accessing and hashing rows.
It's that hashing part that makes HAVING conditions more expensive than WHERE conditions. Hashing requires writing data, which can be more expensive both physically and algorithmically.
Theory
Hashing requires writing as well as reading data. Ideally hashing the data will run in O(n)
time. But in practice there will be hash collisions, which slow things down. And in practice not all the data will fit in memory.
Those two problems can be disastrous. In the worst-case, with limited memory, the hashing requires multiple passes and the complexity approaches O(n^2)
. And writing to disk in the temporary tablespace is orders of magnitude slower than writing to memory.
Those are the kind of performance issues you need to worry about with databases. The constant time to run simple conditions and expressions is usually irrelevant compared to the time to read, write, and join the data.
That might be especially true in your environment. The operation TABLE ACCESS STORAGE FULL
implies you are using Exadata. Depending on the platform you might be taking advantage of SQL in silicon. Those high-level conditions may translate perfectly to low-level instructions executed on storage devices. Which means your estimate of the cost of executing a clause may be several orders of magnitude too high.
Practice
Create a sample table with 100,000 rows:
create table customer(id number, status varchar2(100));
insert into customer
select
level,
case
when level <= 15000 then 'Deceased'
when level between 15001 and 50001 then 'Active'
else 'Dormant'
end
from dual
connect by level <= 100000;
begin
dbms_stats.gather_table_stats(user, 'customer');
end;
/
Running the code in a loop shows that the WHERE
version is about twice as fast as the HAVING
version.
--Run times (in seconds): 0.765, 0.78, 0.765
declare
type string_nt is table of varchar2(100);
type number_nt is table of number;
v_status string_nt;
v_count number_nt;
begin
for i in 1 .. 100 loop
SELECT status, count(status)
bulk collect into v_status, v_count
FROM customer
GROUP BY status
HAVING status != 'Active' AND status != 'Dormant';
end loop;
end;
/
--Run times (in seconds): 0.39, 0.39, 0.39
declare
type string_nt is table of varchar2(100);
type number_nt is table of number;
v_status string_nt;
v_count number_nt;
begin
for i in 1 .. 100 loop
SELECT status, count(status)
bulk collect into v_status, v_count
FROM customer
WHERE status != 'Active' AND status != 'Dormant'
GROUP BY status;
end loop;
end;
/
SQL Order of execution vs Order of writing
SQL is a declarative language, not a procedural language. That means that the SQL compiler and optimizer determine what operations are actually run. These operations typically take the form of a directed acyclic graph (DAG) of operations.
The operators have no obvious relationship to the original query -- except that the results it generates are guaranteed to be the same. In terms of execution there are no clauses, just things like "hash join" and "filter" and "sort" -- or whatever the database implements for the DAG.
You are confusing execution with compilation and probably you just care about scoping rules.
So, to start with SQL has a set of clauses and these are in a very specified order. Your question contains this ordering -- at least for a database that supports those clauses.
The second part is the ordering for identifying identifiers. Basically, this comes down to:
- Table aliases are defined in the
FROM
clause. So this can be considered as "first" for scoping purposes. - Column aliases are defined in the
SELECT
clause. By the SQL Standard, column aliases can be used in theORDER BY
. Many databases extend this to theQUALIFY
(if supported),HAVING
, andGROUP BY
clauses. In general, databases do not support them in theWHERE
clause. - If two tables in the
FROM
have the same column name, then the column has to be qualified to identify the table. The one exception to this is when the column is a key in aJOIN
and theUSING
clause is used. Then the unqualified column name is fine. - If a column alias defined in the
SELECT
conflicts with a table alias in a clause that supports column aliases, then it is up to the database which to choose.
Related Topics
Is the Like Operator Case-Sensitive with SQL Server
Using SQL Function Generate_Series() in Redshift
In SQL/Mysql, Differencebetween "On" and "Where" in a Join Statement
Easiest Way to Populate a Temp Table with Dates Between and Including 2 Date Parameters
Fastest Check If Row Exists in Postgresql
How to Use SQL's Getdate() and Dateadd() in a Linq to SQL Expression
Support Union Function in Bigquery SQL
Saving Changes After Table Edit in SQL Server Management Studio
Alternatives to Limit and Offset for Paging in Oracle
Which SQL Query Is Faster? Filter on Join Criteria or Where Clause
Postgresql Equivalent for Top N with Ties: Limit "With Ties"
How to Add a Foreign Key to an Existing SQLite Table
How to Convert Comma Separated Nvarchar to Table Records in SQL Server 2005
How Can a Left Outer Join Return More Records Than Exist in the Left Table