Subquery V/S Inner Join in SQL Server

Subquery v/s inner join in sql server

Usually joins will work faster than inner queries, but in reality it will depend on the execution plan generated by SQL Server. No matter how you write your query, SQL Server will always transform it on an execution plan. If it is "smart" enough to generate the same plan from both queries, you will get the same result.

Here and here some links to help.

Join vs. sub-query

Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):

A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.

So subqueries can be slower than LEFT [OUTER] JOIN, but in my opinion their strength is slightly higher readability.

Inner Join with derived table using sub query

The first thing to note is that your queries are not comparable, OUTER APPLY needs to be replaced with CROSS APPLY, or INNER JOIN with LEFT JOIN.

When they are made comparable though, you can see that the query plans for both queries are identical. I have just mocked up a sample DDL:

CREATE TABLE #Employees (EID INT NOT NULL);
INSERT #Employees VALUES (0);
CREATE TABLE #SalaryRate (EID INT NOT NULL, Rate MONEY NOT NULL);
CREATE TABLE #AttendanceDetails (EID INT NOT NULL, DaysAttended INT NOT NULL);

Running the following:

SELECT E.EID,DT.Salary FROM #Employees E
OUTER APPLY
(
SELECT (SR.Rate * AD.DaysAttended) Salary
FROM #SalaryRate SR
INNER JOIN #AttendanceDetails AD on AD.EID=SR.EID
WHERE SR.EID=E.EID
) DT; --Derived Table for outer apply

SELECT E.EID,DT.Salary FROM #Employees E
LEFT JOIN
(
SELECT SR.EID, (SR.Rate * AD.DaysAttended) Salary
FROM #SalaryRate SR
INNER JOIN #AttendanceDetails AD on AD.EID=SR.EID
) DT --Derived Table for inner join
ON DT.EID=E.EID;

Gives the following plan:

Sample Image

And changing to INNER/CROSS:

SELECT E.EID,DT.Salary FROM #Employees E
CROSS APPLY
(
SELECT (SR.Rate * AD.DaysAttended) Salary
FROM #SalaryRate SR
INNER JOIN #AttendanceDetails AD on AD.EID=SR.EID
WHERE SR.EID=E.EID
) DT; --Derived Table for outer apply


SELECT E.EID,DT.Salary FROM #Employees E
INNER JOIN
(
SELECT SR.EID, (SR.Rate * AD.DaysAttended) Salary
FROM #SalaryRate SR
INNER JOIN #AttendanceDetails AD on AD.EID=SR.EID
) DT --Derived Table for inner join
ON DT.EID=E.EID;

Gives the following plan:

Sample Image

These are the plans where there is no data in the outer tables, and only one row in employees, so not really realistic. In the case of the outer apply, SQL Server is able to determine that there is only one row in employees, so it would be beneficial to just do a nested loop join (i.e. row by row lookup) to the outer tables. After putting 1,000 rows in employees, using LEFT JOIN/OUTER APPLY yields the following plan:

Sample Image

You can see here that the join is now a hash match join, which means (in it's simplest terms) that SQL Server has determined that the best plan is to execute the outer query first, hash the results and then lookup from employees. This however does not mean that the subquery as a whole is executed and the results stored, for simplicity purposes you could consider this, but predicates from the outer query can still be still be used, for example, if the subquery were executed and stored internally, the following query would present massive overhead:

SELECT E.EID,DT.Salary FROM #Employees E
LEFT JOIN
(
SELECT SR.EID, (SR.Rate * AD.DaysAttended) Salary
FROM #SalaryRate SR
INNER JOIN #AttendanceDetails AD on AD.EID=SR.EID
) DT --Derived Table for inner join
ON DT.EID=E.EID
WHERE E.EID = 1;

What whould be the point in retrieving all employee rates, storing the results, only to actually look up one employee? Inspection of the execution plan shows that the EID = 1 predicate is passed to the table scan on #AttendanceDetails:

Sample Image

So the answer to the following points is:

  • If I re-write above query using OUTER APPLY, I know for sure the subquery will be executed for each row.
  • Inner Join will execute sub query only once.

It depends. Using APPLY SQL Server will attempt to rewrite the query as a JOIN if possible, as this will yield the optimal plan, so using OUTER APPLY does not guarantee that the query will be executed once for each row. Similarly using LEFT JOIN does not guarantee that the query is executed only once.

SQL is a declarative language, in that you tell it what you want it to do, not how to do it, so you shouldn't rely on specific commands to elicit specific behaviour, instead, if you find performance issues, check the execution plan, and IO statistics to find out how it is doing it, and identify how you can improve your query.

Further more, SQL Server does not matierialise subqueries, usually the definition is expanded out into the main query, so even though you have written:

SELECT E.EID,DT.Salary FROM #Employees E
INNER JOIN
(
SELECT SR.EID, (SR.Rate * AD.DaysAttended) Salary
FROM #SalaryRate SR
INNER JOIN #AttendanceDetails AD on AD.EID=SR.EID
) DT --Derived Table for inner join
ON DT.EID=E.EID;

What is actually executed is more like:

SELECT  e.EID, sr.Rate * ad.DaysAttended AS Salary
FROM #Employees e
INNER JOIN #SalaryRate sr
on e.EID = sr.EID
INNER JOIN #AttendanceDetails ad
ON ad.EID = sr.EID;

Making sense of Inner-Join with Sub-query in SQL Server 2008

Never use commas in the FROM clause. Always use proper, explicit JOIN syntax.

Your query does not return results because no one has exactly the average value, so the salary condition fails. Based on what you are selecting, the subquery is the query you want:

SELECT Dept_ID, AVG(Salary) as Avg_Salary
FROM table123
GROUP BY Dept_ID;

Presumably, the other table brings in the name, so:

SELECT b.Dept_Name, AVG(a.Salary) as Avg_Salary
FROM table123 a JOIN
table246 b
ON a.Dept_ID = b.Dept_Id
GROUP BY b.Dept_Name;

Understanding when to use a subquery over a join

Good Read for Subquery vs Inner Join

https://www.essentialsql.com/subquery-versus-inner-join/

SQL inner join vs subquery

If I had to guess I would say it's because query 1 is pulling the data from both tables. Queries 2 and 3 (aprox the same time) are only pulling data for TabA.

One way you could check this is by running the following:

SET STATISTICS TIME ON
SET STATISTICS IO ON

When I ran

SELECT * FROM sys.objects

I saw the following results.

SQL Server parse and compile time: 
CPU time = 0 ms, elapsed time = 104 ms.

(242 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysschobjs'. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syssingleobjrefs'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syspalnames'. Scan count 1, logical reads 2, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 866 ms.

You can take a look at # of scans, logical reads and physical reads for each query. Physical reads of course take much longer and represent reading from the disk into the cache. If all of your reads are logical reads then your table is completely in cache.

I would be willing to bet if you look you will see a lot more logical reads on TabB on query 1 than on 2 and 3.

EDIT:

Just out of curiosity I did some tests and blogged the results here.

Sub query select statment vs inner join

First, the two queries are not the same. The first filters out any rows that have no matching rows in work.

The equivalent first query uses a left join:

select p.id, p.name, w.id, w.name 
from person p left join
work w
on w.id = p.wid
where p.id in (somenumbers);

Then, the second query can be simplified to:

select p.id, p.name, p.wid,
(select name from work where w.id = p.wid)
from person p
where p.id in (somenumbers);

There is no reason to look up the id in work when it is already present in person.

If you want optimized queries, then you want indexes on person(id, wid, name) and work(id, name).

With these indexes, the two queries should have basically the same performance. The subquery will use the index on work for fetching the rows from work and the where clause will use the index on person. Either query should be fast and scalable.

SQL Server: Perfomance of INNER JOIN on small table vs subquery in IN clause

The execution plans you have supplied both have exactly the same basic strategy.

Join

Sample Image

There is a seek on ActionTable to find rows where ActionName starts with "generate" with a residual predicate on the ActionName LIKE '%action%'. The 7 matching rows are then used to build a hash table.

On the probe side there is a seek on TimeStamp > Scalar Operator(dateadd(day,(-3),getdate())) and matching rows are tested against the hash table to see if the rows should join.

There are two main differences which explain why the IN version executes quicker

IN

Sample Image

  1. The IN version is executing in parallel. There are 4 concurrent threads working on the query execution - not just one.
  2. Related to the parallelism this plan has a bitmap filter. It is able to use this bitmap to eliminate rows early. In the inner join plan 25,959,124 rows are passed to the probe side of the hash join, in the semi join plan the seek still reads 25.9 million rows but only 313 rows are passed out to be evaluated by the join. The remainder are eliminated early by applying the bitmap inside the seek.

It is not readily apparent why the INNER JOIN version does not execute in parallel. You could try adding the hint OPTION(USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE')) to see if you now get a plan which executes in parallel and contains the bitmap filter.

If you are able to change indexes then, given that the query only returns 309 rows for 7 distinct actions, you may well find that replacing IX_BigTimeSeriesTable_ActionID with a covering index with leading columns [ActionID], [TimeStamp] and then getting a nested loops plan with 7 seeks performs much better than your current queries.

CREATE NONCLUSTERED INDEX [IX_BigTimeSeriesTable_ActionID_TimeStamp]
ON [dbo].[BigTimeSeriesTable] ([ActionID], [TimeStamp])
INCLUDE ([Details], [ID])

Hopefully with that index in place your existing queries will just use it and you will see 7 seeks, each returning an average of 44 rows, to read and return only the exact 309 total required. If not you can try the below

SELECT CA.*
FROM ActionTable A
CROSS APPLY
(
SELECT *
FROM BigTimeSeriesTable B
WHERE B.ActionID = A.ActionID AND B.TimeStamp > DATEADD(DAY, -3, GETDATE())
) CA
WHERE A.ActionName LIKE '%action%'


Related Topics



Leave a reply



Submit