General Rules for Simplifying SQL Statements

General rules for simplifying SQL statements

To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?

This answer was written in 2009. Some of the query optimization tricks described here are obsolete by now, others can be made more efficient, yet others still apply. The statements about feature support by different database systems apply to versions that existed at the time of this writing.

That's exactly what optimizers do for a living (not that I'm saying they always do this well).

Since SQL is a set based language, there are usually more than one way to transform one query to other.

Like this query:

SELECT  *
FROM    mytable
WHERE   col1 > @value1 OR col2 < @value2

can be transformed into this one (provided that mytable has a primary key):

SELECT  *
FROM    mytable
WHERE   col1 > @value1
UNION
SELECT  *
FROM    mytable
WHERE   col2 < @value2

or this one:

SELECT  mo.*
FROM    (
        SELECT  id
        FROM    mytable
        WHERE   col1 > @value1
        UNION
        SELECT  id
        FROM    mytable
        WHERE   col2 < @value2
        ) mi
JOIN    mytable mo
ON      mo.id = mi.id

, which look uglier but can yield better execution plans.

One of the most common things to do is replacing this query:

SELECT  *
FROM    mytable
WHERE   col IN
        (
        SELECT  othercol
        FROM    othertable
        )

with this one:

SELECT  *
FROM    mytable mo
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    othertable o
        WHERE   o.othercol = mo.col
        )

In some RDBMS's (like PostgreSQL 8.4), DISTINCT and GROUP BY use different execution plans, so sometimes it's better to replace the one with the other:

SELECT  mo.grouper,
        (
        SELECT  SUM(col)
        FROM    mytable mi
        WHERE   mi.grouper = mo.grouper
        )
FROM    (
        SELECT  DISTINCT grouper
        FROM    mytable
        ) mo

vs.

SELECT  mo.grouper, SUM(col)
FROM    mytable
GROUP BY
        mo.grouper

In PostgreSQL, DISTINCT sorts and GROUP BY hashes.

MySQL 5.6 lacks FULL OUTER JOIN, so it can be rewritten as following:

SELECT  t1.col1, t2.col2
FROM    table1 t1
LEFT OUTER JOIN
        table2 t2
ON      t1.id = t2.id

vs.

SELECT  t1.col1, t2.col2
FROM    table1 t1
LEFT JOIN
        table2 t2
ON      t1.id = t2.id
UNION ALL
SELECT  NULL, t2.col2
FROM    table1 t1
RIGHT JOIN
        table2 t2
ON      t1.id = t2.id
WHERE   t1.id IS NULL

, but see this article in my blog on how to do this more efficiently in MySQL:

Emulating FULL OUTER JOIN in MySQL

This hierarchical query in Oracle 11g:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    animal
START WITH
        animal_id = :id
CONNECT BY
        PRIOR animal_id IN (father, mother)
ORDER BY
        animal_id

can be transformed to this:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    (
        SELECT  0 AS gender, animal_id, father AS parent
        FROM    animal
        UNION ALL
        SELECT  1, animal_id, mother
        FROM    animal
        )
START WITH
        animal_id = :id
CONNECT BY
        parent = PRIOR animal_id
ORDER BY
        animal_id

, the latter one being more efficient.

See this article in my blog for the execution plan details:

Genealogy query on both parents

To find all ranges that overlap the given range, you can use the following query:

SELECT  *
FROM    ranges
WHERE   end_date >= @start
        AND start_date <= @end

, but in SQL Server this more complex query yields same results faster:

SELECT  *
FROM    ranges
WHERE   (start_date > @start AND start_date <= @end)
        OR (@start BETWEEN start_date AND end_date)

, and believe it or not, I have an article in my blog on this too:

Overlapping ranges: SQL Server

SQL Server 2008 also lacks an efficient way to do cumulative aggregates, so this query:

SELECT  mi.id, SUM(mo.value) AS running_sum
FROM    mytable mi
JOIN    mytable mo
ON      mo.id <= mi.id
GROUP BY
        mi.id

can be more efficiently rewritten using, Lord help me, cursors (you heard me right: "cursors", "more efficiently" and "SQL Server" in one sentence).

See this article in my blog on how to do it:

Flattening timespans: SQL Server

There is a certain kind of query, commonly met in financial applications, that pulls effective exchange rate for a currency, like this one in Oracle 11g:

SELECT  TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999')
FROM    t_transaction x
JOIN    t_rate r
ON      (rte_currency, rte_date) IN
        (
        SELECT  xac_currency, MAX(rte_date)
        FROM    t_rate
        WHERE   rte_currency = xac_currency
                AND rte_date <= xac_date
        )

This query can be heavily rewritten to use an equality condition which allows a HASH JOIN instead of NESTED LOOPS:

WITH v_rate AS
        (
        SELECT  cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
        FROM    (
                SELECT  cur_id, dte_date,
                        (
                        SELECT  MAX(rte_date)
                        FROM    t_rate ri
                        WHERE   rte_currency = cur_id
                                AND rte_date <= dte_date
                        ) AS rte_effdate
                FROM    (
                        SELECT  (
                                SELECT  MAX(rte_date)
                                FROM    t_rate
                                ) - level + 1 AS dte_date
                        FROM    dual
                        CONNECT BY
                                level <=
                                (
                                SELECT  MAX(rte_date) - MIN(rte_date)
                                FROM    t_rate
                                )
                        ) v_date,
                        (
                        SELECT  1 AS cur_id
                        FROM    dual
                        UNION ALL
                        SELECT  2 AS cur_id
                        FROM    dual
                        ) v_currency
                ) v_eff
        LEFT JOIN
                t_rate
        ON      rte_currency = cur_id
                AND rte_date = rte_effdate
        )
SELECT  TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
FROM    (
        SELECT  xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
        FROM    t_transaction x
        GROUP BY
                xac_currency, TRUNC(xac_date)
        )
JOIN    v_rate
ON      eff_currency = xac_currency
        AND eff_date = xac_date

Despite being bulky as hell, the latter query is six times as fast.

The main idea here is replacing <= with =, which requires building an in-memory calendar table to join with.

Converting currencies

SQL query simplification

I still say to go with one table, which is best practice design. (Not duplicating identical tables unnecessarily.)

CREATE TABLE unified_table (
  product_type,
  price,
  model
)

Doing so enables this query...

SELECT
  *
FROM
  unified_table
WHERE
  price = (SELECT MAX(price) FROM unified_table)

But, if you can't, or won't, trust the optimiser to deal with the consequences of the UNIONs...

SELECT
  *
FROM
(
  SELECT * FROM pc
  UNION ALL
  SELECT * FROM laptop
  UNION ALL
  SELECT * FROM printer
) t1
WHERE
  price = (SELECT MAX(price) FROM (SELECT price FROM pc
                                   UNION ALL
                                   SELECT price FROM laptop
                                   UNION ALL
                                   SELECT price FROM printer
                                  ) t2
          )

The optimiser will understand how to optimise this so as to remove redundant searches.

EDIT:

As a compromise, you can make a unified view, and query that...

CREATE VIEW unified_table AS
  SELECT 'pc'      AS type, * FROM pc
  UNION ALL
  SELECT 'laptop'  AS type, * FROM laptop
  UNION ALL
  SELECT 'printer' AS type, * FROM printer

What generic techniques can be applied to optimize SQL queries?

Use primary keys
Avoid select *
Be as specific as you can when building your conditional statements
De-normalisation can often be more efficient
Table variables and temporary tables (where available) will often be better than using a large source table
Partitioned views
Employ indices and constraints

simplifying query that has multiple WITH and multiple subqueries

Does this look any better?

;WITH minmax AS (
SELECT client_id, specimen_source, received_date,
       RMin = row_number() over (partition by Client_id
                                 order by received_date, accession_daily_key),
       RMax = row_number() over (partition by Client_id
                                 order by received_date desc, accession_daily_key desc)
FROM F_ACCESSION_DAILY
)
SELECT f.client_id,
       max(case when rmin=1 then f.specimen_source end),
       max(case when rmin=1 then f.received_date end),
       max(case when rmax=1 then f.specimen_source end),
       max(case when rmax=1 then f.received_date end),
       D.MLIS_DATE_ESTABLISHED
FROM   minmax f
LEFT JOIN D_CLIENT D ON D.REC_ACTIVE_FLG = 1 AND D.MLIS_DATE_ESTABLISHED IS NOT NULL
WHERE 1 in (f.rmin, f.rmax)
GROUP BY f.client_id, D.MLIS_DATE_ESTABLISHED

Simplifying a group of AND and OR clauses

The only equivalence I can think of as useful here is

(A+B).(A+C') === A+(B.C')

So it becomes

(A+(B.C')) . (B+D')

if B: --> A . D'
else: --> (A+C')

Don't know whether that helps you get anything more efficient/useful

A   B   C'  D'  f()
TRUE    TRUE    TRUE    TRUE    TRUE
TRUE    TRUE    TRUE    FALSE   TRUE
TRUE    TRUE    FALSE   TRUE    TRUE
TRUE    TRUE    FALSE   FALSE   TRUE
TRUE    FALSE   TRUE    TRUE    TRUE
TRUE    FALSE   TRUE    FALSE   TRUE
TRUE    FALSE   FALSE   TRUE    FALSE
TRUE    FALSE   FALSE   FALSE   FALSE
FALSE   TRUE    TRUE    TRUE    TRUE
FALSE   TRUE    TRUE    FALSE   FALSE
FALSE   TRUE    FALSE   TRUE    TRUE
FALSE   TRUE    FALSE   FALSE   FALSE
FALSE   FALSE   TRUE    TRUE    FALSE
FALSE   FALSE   TRUE    FALSE   FALSE
FALSE   FALSE   FALSE   TRUE    FALSE
FALSE   FALSE   FALSE   FALSE   FALSE

Watch it live in a spreadsheet: google docs

Doing BNF for a simplified SQL “where” clause

The grammar doesn't care about semantics.

Syntactically, an expression is an expression, nothing more. If you later do some kind of semantic analysis, that is when you'll need to deal with the difference. You might do that in the reduction actions for condition and from_expression but it would be cleaner to just build an AST while parsing and do the semantic analysis later on the tree.

What are some of your most useful database standards?

Putting everybody's input together into one list.

Naming Standards

Schemas are named by functional area (Products, Orders, Shipping)
No Hungarian Notation: No type names in object names (no strFirstName)
Do not use registered keywords for object names
No spaces or any special characters in object names (Alphanumber + Underscore are the only things allowed)
Name objects in a natural way (FirstName instead of NameFirst)
Table name should match Primary Key Name and Description field (SalesType – SalesTypeId, SalesTypeDescription)
Do not prefix with tbl_ or sp_
Name code by object name (CustomerSearch, CustomerGetBalance)
CamelCase database object names
Column names should be singular
Table names may be plural
Give business names to all constraints (MustEnterFirstName)

Data Types

Use same variable type across tables (Zip code – numeric in one table and varchar in another is not a good idea)
Use nNVarChar for customer information (name, address(es)) etc. you never know when you may go multinational

In code

Keywords always in UPPERCASE
Never use implied joins (Comma syntax) - always use explicit INNER JOIN / OUTER JOIN
One JOIN per line
One WHERE clause per line
No loops – replace with set based logic
Use short forms of table names for aliases rather than A, B, C
Avoid triggers unless there is no recourse
Avoid cursors like the plague (read http://www.sqlservercentral.com/articles/T-SQL/66097/)

Documentation

Create database diagrams
Create a data dictionary

Normalization and Referential Integrity

Use single column primary keys as much as possible. Use unique constraints where required.
Referential integrity will be always enforced
Avoid ON DELETE CASCADE
OLTP must be at least 4NF
Evaluate every one-to-many relationship as a potential many-to-many relationship
Non user generated Primary Keys
Build Insert based models instead of update based
PK to FK must be same name (Employee.EmployeeId is the same field as EmployeeSalary.EmployeeId)
Except when there is a double join (Person.PersonId joins to PersonRelation.PersonId_Parent and PersonRelation.PersonId_Child)

Maintenance : run periodic scripts to find

Schema without table
Orphaned records
Tables without primary keys
Tables without indexes
Non-deterministic UDF
Backup, Backup, Backup

Be good

Be Consistent
Fix errors now
Read Joe Celko's SQL Programming Style (ISBN 978-0120887972)

SQL: Two select statements in one query

You can do something like this:

 (SELECT
    name, games, goals
    FROM tblMadrid WHERE name = 'ronaldo')
 UNION
 (SELECT
    name, games, goals
    FROM tblBarcelona WHERE name = 'messi')
ORDER BY goals;

See, for example: https://dev.mysql.com/doc/refman/5.0/en/union.html

General Rules for Simplifying SQL Statements