When to Use SQL Sub-Queries Versus a Standard Join

When to use SQL sub-queries versus a standard join?

Subqueries are usually fine unless they are dependent subqueries (also known as correlated subqueries). If you are only using independent subqueries and they are using appropriate indexes then they should run quickly. If you have a dependent subquery you might run into performance problems because a dependent subquery typically needs to be run once for each row in the outer query. So if your outer query has 1000 rows, the subquery will be run 1000 times. On the other hand an independent subquery typically only needs to be evaluated once.

If you're not sure what is meant by a subquery being dependent or independent here's a rule of thumb - if you can take the subquery, remove it from its context, run it, and get a result set then it's an independent subquery.

If you get a syntax error because it refers to some tables outside of the subquery then its a dependent subquery.

The general rule of course has a few exceptions. For example:

  • Many optimizers can take a dependent subquery and find a way to run it efficiently as a JOIN. For example an NOT EXISTS query might result in an ANTI JOIN query plan, so it will not necessarily be any slower than writing the query with a JOIN.
  • MySQL has a bug where an independent subquery inside an IN expression is incorrectly identified as a dependent subquery and so a suboptimal query plan is used. This is apparently fixed in the very newest versions of MySQL.

If performance is an issue then measure your specific queries and see what works best for you.

Join vs. sub-query

Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):

A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.

So subqueries can be slower than LEFT [OUTER] JOIN, but in my opinion their strength is slightly higher readability.

Why is subquery join much faster than direct join

The problem with your long-running queries, is that you lack an index on the page_id column of the comments table. Hence, for each row from the pages table, you need to check all rows of the comments table. Since you are using LEFT JOIN, this is the only possible join order. What happens in 5.6, is that when you use a subquery in the FROM clause (aka derived table), MySQL will create an index on the temporary table used for the result of the derived table (auto_key0 in the EXPLAIN output). The reason it is faster when you only select one column, is that the temporary table will be smaller.

In MySQL 5.7, such derived tables will be automatically merge into the main query, if possible. This is done to avoid the extra temporary tables. However, this means that you no longer have an index to use for the join. (See this blog post for details.)

You have two options to improve the query time in 5.7:

  1. You can create an index on comments(page_id)
  2. You can prevent the subquery from being merged by rewriting it to a query that can not be merged. Subqueries with aggregation, LIMIT, or UNION will not be merged (see the blog post for details). One way to do this is to add a LIMIT clause to the subquery. In order not to remove any rows from the result, the limit must be larger than the number of rows in the table.

In MySQL 8.0, you can also use an optimizer hint to avoid the merging. In your case, that would be something like

SELECT /*+ NO_MERGE(c) */ ... FROM

See slides 34-37 of this presentation for examples of how to use such hints.

Subquery v/s inner join in sql server

Usually joins will work faster than inner queries, but in reality it will depend on the execution plan generated by SQL Server. No matter how you write your query, SQL Server will always transform it on an execution plan. If it is "smart" enough to generate the same plan from both queries, you will get the same result.

Here and here some links to help.

Subquery vs Traditional join with WHERE clause?

What about a third option?

SELECT ...
FROM Customers AS c
INNER JOIN Classification AS cf
ON cf.CustomerType = 'Standard'
AND c.TypeCode = cf.Code
INNER JOIN SalesReps AS s
ON cf.SalesRepID = s.SalesRepID

Personally, I prefer to use JOIN syntax to indicate the statements on which the overall set is defined, the foreign keys or other conditions that indicate two rows should be joined to make a row in the result set.

The WHERE clause contains the criteria which filter my result set. Arguably, this can become quite bloaty and complicated when you are performing a number of joins, however when you think in sets it follows a kind of logic:

  • SELECT what columns I want.
  • JOIN tables to define the set I want to get rows from.
  • Filter out rows WHERE my criteria are not met.

By this logic, I'd always choose your second syntax for consistent readability.

SQL inner join vs subquery

If I had to guess I would say it's because query 1 is pulling the data from both tables. Queries 2 and 3 (aprox the same time) are only pulling data for TabA.

One way you could check this is by running the following:

SET STATISTICS TIME ON
SET STATISTICS IO ON

When I ran

SELECT * FROM sys.objects

I saw the following results.

SQL Server parse and compile time: 
CPU time = 0 ms, elapsed time = 104 ms.

(242 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysschobjs'. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syssingleobjrefs'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syspalnames'. Scan count 1, logical reads 2, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 866 ms.

You can take a look at # of scans, logical reads and physical reads for each query. Physical reads of course take much longer and represent reading from the disk into the cache. If all of your reads are logical reads then your table is completely in cache.

I would be willing to bet if you look you will see a lot more logical reads on TabB on query 1 than on 2 and 3.

EDIT:

Just out of curiosity I did some tests and blogged the results here.

SQL Query Time Complexity - Join vs Sub Query

Does join take O(M+N) time complexity? and does sub-querytake O(M*N)?
Am I wrong to think this way?

Yes, with respect, you are wrong to think this way. SQL is declarative. You use it to state the result you want, and the server figures out the best way to deliver that result -- to satisfy your query -- based on available indexes and data structures.

Thousands of years -- really! -- of developer effort have gone into figuring out all sorts of algorithms, optimizations, and hacks to reduce the complexity of the processes the servers use to satisfy queries.

As the thousands of years of experience accumulate the performance distinction between correlated subqueries and join queries gets less important.

Your thinking is wrong for a specific reason: you are thinking procedurally, not declaratively. When you assert that a particular type of query can be satisfied in, for example, O(m*n) time, you are making assumptions about procedures used to satisfy it. Generations of developers have been dedicated to making your assumptions wrong.

Certainly it's possible to create tables, indexes, and queries with pathological performance characteristics. It happens all the time. But somebody fixes an index and the problem is solved.

Is there any performance difference when select from subqueries?

No, there is no difference whatsoever.

You can easily find that out by yourself, by looking at the execution plan generated using explain (analyze) select ....

Except for the aliases the plans should be identical.

LEFT JOIN multiple sub queries

You can do it with conditional aggregation:

SELECT p.PK_Product_ID AS Product_ID,
SUM(IIF(YEAR(o.OrderDate) = 2013, 1, 0)) AS 2013_Orders,
SUM(IIF(YEAR(o.OrderDate) = 2013, p.UnitPrice * od.Quantity, 0)) AS 2013_Gross_Value,
SUM(IIF(YEAR(o.OrderDate) = 2014, 1, 0)) AS 2014_Orders,
SUM(IIF(YEAR(o.OrderDate) = 2014, p.UnitPrice * od.Quantity, 0)) AS 2014_Gross_Value
FROM (Products AS p LEFT JOIN [Order Details] AS od ON od.FK_Product_ID = p.PK_Product_ID)
LEFT JOIN (SELECT * FROM Orders WHERE YEAR(OrderDate) IN (2013, 2014)) AS o ON od.FK_Order_ID = o.PK_Order_ID
GROUP BY p.PK_Product_ID

If you want only the products ordered in the years 2013 and/or 2014 then you can change the LEFT joins to INNER joins.



Related Topics



Leave a reply



Submit