Is There Any General Rule on SQL Query Complexity VS Performance

Is there any general rule on SQL query complexity Vs performance?

This depends on the query plan used.

Even without indexes, modern servers can use HASH JOIN and MERGE JOIN which are faster than O(N * M)

More specifically, complexity of a HASH JOIN is O(N + M), where N is the hashed table and M the is lookup table. Hashing and hash lookups have constant complexity.

Complexity of a MERGE JOIN is O(N*Log(N) + M*Log(M)): it's the sum of times to sort both tables plus time to scan them.

SELECT  T1.name, T2.date
FROM T1, T2
WHERE T1.id=T2.id
AND T1.color='red'
AND T2.type='CAR'

If there are no indexes defined, the engine will select either a HASH JOIN or a MERGE JOIN.

The HASH JOIN works as follows:

  1. The hashed table is chosen (usually it's the table with fewer records). Say it's t1

  2. All records from t1 are scanned. If the records holds color='red', this record goes into the hash table with id as a key and name as a value.

  3. All records from t2 are scanned. If the record holds type='CAR', its id is searched in the hash table and the values of name from all hash hits are returned along with the current value of data.

The MERGE JOIN works as follows:

  1. The copy of t1 (id, name) is created, sorted on id

  2. The copy of t2 (id, data) is created, sorted on id

  3. The pointers are set to the minimal values in both tables:

    >1  2<
    2 3
    2 4
    3 5
  4. The pointers are compared in a loop, and if they match, the records are returned. If they don't match, the pointer with the minimal value is advanced:

    >1  2<  - no match, left pointer is less. Advance left pointer
    2 3
    2 4
    3 5

    1 2< - match, return records and advance both pointers
    >2 3
    2 4
    3 5

    1 2 - match, return records and advance both pointers
    2 3<
    2 4
    >3 5

    1 2 - the left pointer is out of range, the query is over.
    2 3
    2 4<
    3 5
    >

In such a case, making the query more complex could make it faster because less rows are subjected to the join-level tests?

Sure.

Your query without the WHERE clause:

SELECT  T1.name, T2.date
FROM T1, T2

is more simple but returns more results and runs longer.

Computational Complexity of SQL Query

There are two basic ways that such a simple query would be executed.

The first is to do a full table scan. This would have O(n) performance.

The second is a to look up the value in an index, then load the page, and return the results. The index scan should be O(log(n)). Loading the page should be O(1).

With a more complicated query, it would be hard to make such a general statement. But any SQL engine is generally going to take one of these two paths. Oh, there is a third option if the table is partitioned on author_id, but you are probably not interested in that.

That said, the power of a database is not in these details. It is in the management of memory. The database will cache the data and index in memory, so you do not have to re-read data pages. The database will take advantage of multiple processors and multiple disks, so you do not have to code this. The database keeps everything consistent, in the face of updates and deletes.

As for your specific question. If the data is in the database, search it there. Loading all the data into an xml file and then doing the search in memory requires a lot of overhead. You would only want to do that if the connection to your database is slow and you are doing many such queries.

General rules for simplifying SQL statements

To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?

That's exactly what optimizers do for a living (not that I'm saying they always do this well).

Since SQL is a set based language, there are usually more than one way to transform one query to other.

Like this query:

SELECT  *
FROM mytable
WHERE col1 > @value1 OR col2 < @value2

can be transformed into this:

SELECT  *
FROM mytable
WHERE col1 > @value1
UNION
SELECT *
FROM mytable
WHERE col2 < @value2

or this:

SELECT  mo.*
FROM (
SELECT id
FROM mytable
WHERE col1 > @value1
UNION
SELECT id
FROM mytable
WHERE col2 < @value2
) mi
JOIN mytable mo
ON mo.id = mi.id

, which look uglier but can yield better execution plans.

One of the most common things to do is replacing this query:

SELECT  *
FROM mytable
WHERE col IN
(
SELECT othercol
FROM othertable
)

with this one:

SELECT  *
FROM mytable mo
WHERE EXISTS
(
SELECT NULL
FROM othertable o
WHERE o.othercol = mo.col
)

In some RDBMS's (like PostgreSQL), DISTINCT and GROUP BY use the different execution plans, so sometimes it's better to replace one with the other:

SELECT  mo.grouper,
(
SELECT SUM(col)
FROM mytable mi
WHERE mi.grouper = mo.grouper
)
FROM (
SELECT DISTINCT grouper
FROM mytable
) mo

vs.

SELECT  mo.grouper, SUM(col)
FROM mytable
GROUP BY
mo.grouper

In PostgreSQL, DISTINCT sorts and GROUP BY hashes.

MySQL lacks FULL OUTER JOIN, so it can be rewritten as folloing:

SELECT  t1.col1, t2.col2
FROM table1 t1
LEFT OUTER JOIN
table2 t2
ON t1.id = t2.id

vs.

SELECT  t1.col1, t2.col2
FROM table1 t1
LEFT JOIN
table2 t2
ON t1.id = t2.id
UNION ALL
SELECT NULL, t2.col2
FROM table1 t1
RIGHT JOIN
table2 t2
ON t1.id = t2.id
WHERE t1.id IS NULL

, but see this article in my blog on how to do this more efficiently in MySQL:

  • Emulating FULL OUTER JOIN in MySQL

This hierarchical query in Oracle:

SELECT  DISTINCT(animal_id) AS animal_id
FROM animal
START WITH
animal_id = :id
CONNECT BY
PRIOR animal_id IN (father, mother)
ORDER BY
animal_id

can be transformed to this:

SELECT  DISTINCT(animal_id) AS animal_id
FROM (
SELECT 0 AS gender, animal_id, father AS parent
FROM animal
UNION ALL
SELECT 1, animal_id, mother
FROM animal
)
START WITH
animal_id = :id
CONNECT BY
parent = PRIOR animal_id
ORDER BY
animal_id

, the latter one being more performant.

See this article in my blog for the execution plan details:

  • Genealogy query on both parents

To find all ranges that overlap the given range, you can use the following query:

SELECT  *
FROM ranges
WHERE end_date >= @start
AND start_date <= @end

, but in SQL Server this more complex query yields same results faster:

SELECT  *
FROM ranges
WHERE (start_date > @start AND start_date <= @end)
OR (@start BETWEEN start_date AND end_date)

, and believe it or not, I have an article in my blog on this too:

  • Overlapping ranges: SQL Server

SQL Server also lacks an efficient way to do cumulative aggregates, so this query:

SELECT  mi.id, SUM(mo.value) AS running_sum
FROM mytable mi
JOIN mytable mo
ON mo.id <= mi.id
GROUP BY
mi.id

can be more efficiently rewritten using, Lord help me, cursors (you heard me right: cursors, more efficiently and SQL Server in one sentence).

See this article in my blog on how to do it:

  • Flattening timespans: SQL Server

There is a certain kind of query commonly met in financial applications that searches for the effective rate for a currency, like this one in Oracle:

SELECT  TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999')
FROM t_transaction x
JOIN t_rate r
ON (rte_currency, rte_date) IN
(
SELECT xac_currency, MAX(rte_date)
FROM t_rate
WHERE rte_currency = xac_currency
AND rte_date <= xac_date
)

This query can be heavily rewritten to use an equality condition which allows a HASH JOIN instead of NESTED LOOPS:

WITH v_rate AS
(
SELECT cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
FROM (
SELECT cur_id, dte_date,
(
SELECT MAX(rte_date)
FROM t_rate ri
WHERE rte_currency = cur_id
AND rte_date <= dte_date
) AS rte_effdate
FROM (
SELECT (
SELECT MAX(rte_date)
FROM t_rate
) - level + 1 AS dte_date
FROM dual
CONNECT BY
level <=
(
SELECT MAX(rte_date) - MIN(rte_date)
FROM t_rate
)
) v_date,
(
SELECT 1 AS cur_id
FROM dual
UNION ALL
SELECT 2 AS cur_id
FROM dual
) v_currency
) v_eff
LEFT JOIN
t_rate
ON rte_currency = cur_id
AND rte_date = rte_effdate
)
SELECT TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
FROM (
SELECT xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
FROM t_transaction x
GROUP BY
xac_currency, TRUNC(xac_date)
)
JOIN v_rate
ON eff_currency = xac_currency
AND eff_date = xac_date

Despite being bulky as a hell, the latter query is 6 times faster.

The main idea here is replacing <= with =, which requires building an in-memory calendar table. to JOIN with.

  • Converting currencies

What are the pros and cons of performing calculations in sql vs. in your application

It depends on a lot of factors - but most crucially:

  • complexity of calculations (prefer doing complex crunching on an app-server, since that scales out; rather than a db server, which scales up)
  • volume of data (if you need to access/aggregate a lot of data, doing it at the db server will save bandwidth, and disk io if the aggregates can be done inside indexes)
  • convenience (sql is not the best language for complex work - especially not great for procedural work, but very good for set-based work; lousy error-handling, though)

As always, if you do bring the data back to the app-server, minimising the columns and rows will be to your advantage. Making sure the query is tuned and appropriately indexed will help either scenario.

Re your note:

and then loop through the records

Looping through records is almost always the wrong thing to do in sql - writing a set-based operation is preferred.

As a general rule, I prefer to keep the database's job to a minimum "store this data, fetch this data" - however, there are always examples of scenarios where an elegant query at the server can save a lot of bandwidth.

Also consider: if this is computationally expensive, can it be cached somewhere?

If you want an accurate "which is better"; code it both ways and compare it (noting that a first draft of either is likely not 100% tuned). But factor in typical usage to that: if, in reality, it is being called 5 times (separately) at once, then simulate that: don't compare just a single "1 of these vs 1 of those".

Complexity of Views - Comprehensive or Building-Blocks?

SQL development in general does not ply well to the software development paradigm: it is not reusable, does not lead to DRY definitions and is hard to maintain. But the most important thing is that most techniques that improve this status-quo and lead to better quality code result is runtime problems. And a SQL run time problem is nothing like a suboptimal code construct in code, it results in bad plans that give results in tens and hundreds of times slower than an optimal plan. In other words, when a DRY query definition base don reasonable blocks results in a table scan plan that runs 10 seconds and an ugly single-use view has a better plan that runs in 10 ms, you forget everything about DRY and go with the ugly but fast view. The differences in run time between a good plan and a bad plan are just too big.

This is why with SQL development a good projects ends up with a few well tuned queries that are constantly measured and checked for performance. I'm sad to say but in my experience the more 'healthy' the SQL code was from a classic code pov (DRY, reusable, maintainable) the more problems it had in real world production when faced with large data size. I really wish there was an easy way to deploy reusable SQL blocks that could be assembled into complex structures. It just doesn't work that way. I know enough about how SQL query optimization works to understand that the query optimizers look at the resulted complex block as a whole and they cannot leverage the internal blocks as 'units of work', they are tasked to optimize the final, end result. And optimizing such complex queries, considering data access paths, IO costs, size of data, column values distribution probabilities is just very very very complex, orders of magnitude more complex than the task, say, a C# optimizer is asked to do.

My advice would be: keep few complex views that are tested and tuned. Freedom to compose basic building block will quickly be abused and you'll discover it too late.

Is there a performance difference between CTE , Sub-Query, Temporary Table or Table Variable?

SQL is a declarative language, not a procedural language. That is, you construct a SQL statement to describe the results that you want. You are not telling the SQL engine how to do the work.

As a general rule, it is a good idea to let the SQL engine and SQL optimizer find the best query plan. There are many person-years of effort that go into developing a SQL engine, so let the engineers do what they know how to do.

Of course, there are situations where the query plan is not optimal. Then you want to use query hints, restructure the query, update statistics, use temporary tables, add indexes, and so on to get better performance.

As for your question. The performance of CTEs and subqueries should, in theory, be the same since both provide the same information to the query optimizer. One difference is that a CTE used more than once could be easily identified and calculated once. The results could then be stored and read multiple times. Unfortunately, SQL Server does not seem to take advantage of this basic optimization method (you might call this common subquery elimination).

Temporary tables are a different matter, because you are providing more guidance on how the query should be run. One major difference is that the optimizer can use statistics from the temporary table to establish its query plan. This can result in performance gains. Also, if you have a complicated CTE (subquery) that is used more than once, then storing it in a temporary table will often give a performance boost. The query is executed only once.

The answer to your question is that you need to play around to get the performance you expect, particularly for complex queries that are run on a regular basis. In an ideal world, the query optimizer would find the perfect execution path. Although it often does, you may be able to find a way to get better performance.

Pre-fetching row counts before query - performance

First, the answer to your question is highly dependent on the database.

I cannot think of a situation when doing a COUNT() before a query will shorten the overall time for both the query and the count().

In general, doing a count will pre-load tables and indexes into the page cache. Assuming the data fits in memory, this will make the subsequent query run faster (although not much faster if you have fast I/O and the database does read-ahead page reading). However, you have just shifted the time frame to the COUNT(), rather than reducing overall time.

To shorten the overall time (including the run time of the COUNT()) would require changing the execution plan. Here are two ways this could theoretically happen:

  1. A database could update statistics as a table is read in, and these statistics, in turn, change the query plan for the main query.
  2. A database could change the execution plan based on whether tables/indexes are already in the page cache.

Although theoretically possible, I am not aware of any database that does either of these.

You could imagine that intermediate results could be stored, but this would violate the dynamic nature of SQL databases. That is, updates/inserts could occur on the tables between the COUNT() and the query. A database engine could not maintain integrity and maintain such intermediate results.

Doing a COUNT() has disadvantages, relative to speeding up the subsequent query. The query plan for the COUNT() might be quite different from the query plan for the main query. Your example with indexes is one case. Another case would be in a columnar database, where different vertical partitions of the data do not need to be read.

Yet another case would be a query such as:

select t.*, r.val
from table t left outer join
ref r
on t.refID = r.refID

and refID is a unique index on the ref table. This join can be eliminated for a count, since there are not duplicates and all records in t are used. However, the join is clearly needed for this query. Once again, whether a SQL optimizer recognizes and acts on this situation is entirely the decision of the writers of the database. However, the join could theoretically be optimized away for the COUNT().

Is SQL IN bad for performance?

There are several considerations when writing a query using the IN operator that can have an effect on performance.

First, IN clauses are generally internally rewritten by most databases to use the OR logical connective. So col IN ('a','b','c') is rewritten to: (COL = 'a') OR (COL = 'b') or (COL = 'c'). The execution plan for both queries will likely be equivalent assuming that you have an index on col.

Second, when using either IN or OR with a variable number of arguments, you are causing the database to have to re-parse the query and rebuild an execution plan each time the arguments change. Building the execution plan for a query can be an expensive step. Most databases cache the execution plans for the queries they run using the EXACT query text as a key. If you execute a similar query but with different argument values in the predicate - you will most likely cause the database to spend a significant amount of time parsing and building execution plans. This is why bind variables are strongly recommended as a way to ensure optimal query performance.

Third, many database have a limit on the complexity of queries they can execute - one of those limits is the number of logical connectives that can be included in the predicate. In your case, a few dozen values are unlikely to reach the built-in limit of the database, but if you expect to pass hundreds or thousands of value to an IN clause - it can definitely happen. In which case the database will simply cancel the query request.

Fourth, queries that include IN and OR in the predicate cannot always be optimally rewritten in a parallel environment. There are various cases where parallel server optimization do not get applied - MSDN has a decent introduction to optimizing queries for parallelism. Generally though, queries that use the UNION ALL operator are trivially parrallelizable in most databases - and are preferred to logical connectives (like OR and IN) when possible.

Does the order of where clauses matter in SQL?

No, that order doesn't matter (or at least: shouldn't matter).

Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.

I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.

What does matter is whether or not you have a suitable index for this!

In the case of SQL Server, it will likely use an index if you have:

  • an index on (LastName, FirstName)
  • an index on (FirstName, LastName)
  • an index on just (LastName), or just (FirstName) (or both)

On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).



Related Topics



Leave a reply



Submit