SQL JOIN vs IN performance?
Generally speaking, IN
and JOIN
are different queries that can yield different results.
SELECT a.*
FROM a
JOIN b
ON a.col = b.col
is not the same as
SELECT a.*
FROM a
WHERE col IN
(
SELECT col
FROM b
)
, unless b.col
is unique.
However, this is the synonym for the first query:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT col
FROM b
)
ON b.col = a.col
If the joining column is UNIQUE
and marked as such, both these queries yield the same plan in SQL Server
.
If it's not, then IN
is faster than JOIN
on DISTINCT
.
See this article in my blog for performance details:
IN
vs.JOIN
vs.EXISTS
INNER JOIN vs LEFT JOIN performance in SQL Server
A LEFT JOIN
is absolutely not faster than an INNER JOIN
. In fact, it's slower; by definition, an outer join (LEFT JOIN
or RIGHT JOIN
) has to do all the work of an INNER JOIN
plus the extra work of null-extending the results. It would also be expected to return more rows, further increasing the total execution time simply due to the larger size of the result set.
(And even if a LEFT JOIN
were faster in specific situations due to some difficult-to-imagine confluence of factors, it is not functionally equivalent to an INNER JOIN
, so you cannot simply go replacing all instances of one with the other!)
Most likely your performance problems lie elsewhere, such as not having a candidate key or foreign key indexed properly. 9 tables is quite a lot to be joining so the slowdown could literally be almost anywhere. If you post your schema, we might be able to provide more details.
Edit:
Reflecting further on this, I could think of one circumstance under which a LEFT JOIN
might be faster than an INNER JOIN
, and that is when:
- Some of the tables are very small (say, under 10 rows);
- The tables do not have sufficient indexes to cover the query.
Consider this example:
CREATE TABLE #Test1
(
ID int NOT NULL PRIMARY KEY,
Name varchar(50) NOT NULL
)
INSERT #Test1 (ID, Name) VALUES (1, 'One')
INSERT #Test1 (ID, Name) VALUES (2, 'Two')
INSERT #Test1 (ID, Name) VALUES (3, 'Three')
INSERT #Test1 (ID, Name) VALUES (4, 'Four')
INSERT #Test1 (ID, Name) VALUES (5, 'Five')
CREATE TABLE #Test2
(
ID int NOT NULL PRIMARY KEY,
Name varchar(50) NOT NULL
)
INSERT #Test2 (ID, Name) VALUES (1, 'One')
INSERT #Test2 (ID, Name) VALUES (2, 'Two')
INSERT #Test2 (ID, Name) VALUES (3, 'Three')
INSERT #Test2 (ID, Name) VALUES (4, 'Four')
INSERT #Test2 (ID, Name) VALUES (5, 'Five')
SELECT *
FROM #Test1 t1
INNER JOIN #Test2 t2
ON t2.Name = t1.Name
SELECT *
FROM #Test1 t1
LEFT JOIN #Test2 t2
ON t2.Name = t1.Name
DROP TABLE #Test1
DROP TABLE #Test2
If you run this and view the execution plan, you'll see that the INNER JOIN
query does indeed cost more than the LEFT JOIN
, because it satisfies the two criteria above. It's because SQL Server wants to do a hash match for the INNER JOIN
, but does nested loops for the LEFT JOIN
; the former is normally much faster, but since the number of rows is so tiny and there's no index to use, the hashing operation turns out to be the most expensive part of the query.
You can see the same effect by writing a program in your favourite programming language to perform a large number of lookups on a list with 5 elements, vs. a hash table with 5 elements. Because of the size, the hash table version is actually slower. But increase it to 50 elements, or 5000 elements, and the list version slows to a crawl, because it's O(N) vs. O(1) for the hashtable.
But change this query to be on the ID
column instead of Name
and you'll see a very different story. In that case, it does nested loops for both queries, but the INNER JOIN
version is able to replace one of the clustered index scans with a seek - meaning that this will literally be an order of magnitude faster with a large number of rows.
So the conclusion is more or less what I mentioned several paragraphs above; this is almost certainly an indexing or index coverage problem, possibly combined with one or more very small tables. Those are the only circumstances under which SQL Server might sometimes choose a worse execution plan for an INNER JOIN
than a LEFT JOIN
.
Join vs. sub-query
Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOIN
, but in my opinion their strength is slightly higher readability.
IN vs. JOIN with large rowsets
Update:
This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
- IN vs. JOIN vs. EXISTS
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a
JOIN b
ON a.c = b.d
These queries are not equivalent. They can yield different results if your table b
is not key preserved (i. e. the values of b.d
are not unique).
The equivalent of the first query is the following:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d
If b.d
is UNIQUE
and marked as such (with a UNIQUE INDEX
or UNIQUE CONSTRAINT
), then these queries are identical and most probably will use identical plans, since SQL Server
is smart enough to take this into account.
SQL Server
can employ one of the following methods to run this query:
If there is an index on
a.c
,d
isUNIQUE
andb
is relatively small compared toa
, then the condition is propagated into the subquery and the plainINNER JOIN
is used (withb
leading)If there is an index on
b.d
andd
is notUNIQUE
, then the condition is also propagated andLEFT SEMI JOIN
is used. It can also be used for the condition above.If there is an index on both
b.d
anda.c
and they are large, thenMERGE SEMI JOIN
is usedIf there is no index on any table, then a hash table is built on
b
andHASH SEMI JOIN
is used.
Neither of these methods reevaluates the whole subquery each time.
See this entry in my blog for more detail on how this works:
- Counting missing rows: SQL Server
There are links for all RDBMS
's of the big four.
SQL Joins Vs SQL Subqueries (Performance)?
I would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN
is a very slow operator, since SQL normally evaluates it as a series of WHERE
clauses separated by "OR" (WHERE x=Y OR x=Z OR...
).
As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.
The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!
INNER JOIN vs LEFT JOIN performance in SQL
With this statement, the predicates in the WHERE
clause would negate the "outerness" of a LEFT JOIN, making equivalent to an INNER JOIN.
I'd expect the optimizer would recognize that, and choose the most cost efficient plan in either case. I'd expect equivalent execution plans, and I wouldn't expect to observe a performance difference in this query, when the INNER
keyword is replaced with LEFT
.
But without looking at the actual execution plans, that's just conjecture. To see the execution plan, we can use EXPLAIN
(for MySQL) or SET SHOWPLAN_ALL ON
(for SQL Server, or STATISTICS PROFILE ON
, et al.)
The actual performance of the different statements can be observed and measured.
I'm not understanding why you would want to use a LEFT JOIN at all.
(And the mixing of the old-school comma syntax for the join operation with the newer JOIN keyword... what's up with that?)
As far as improving performance, we'd really need to make sure that suitable indexes are available, and are being used. We need to look at the execution plans. We may need to make sure that statistics are collected and up-to-date, and we may need to make some tweaks to the SQL to get an more efficient plan.
It's possible that replacing INNER
with LEFT
may make a difference in performance of the existing query, but it's unlikely to be a cure to a performance problem.
Personally, I'd prefer to investigate performance of an equivalent query, with this as a starting point:
UPDATE poker_hands f
JOIN poker_cards a ON a.card_name = f.r1 AND a.game_value = 14
JOIN poker_cards b ON b.card_name = f.r2 AND b.game_value = 13 AND b.suit = a.suit
JOIN poker_cards c ON c.card_name = f.r3 AND c.game_value = 12 AND c.suit = a.suit
JOIN poker_cards d ON d.card_name = f.r4 AND d.game_value = 11 AND d.suit = a.suit
JOIN poker_cards e ON e.card_name = f.r5 AND e.game_value = 10 AND e.suit = a.suit
SET f.hand_type = 'Royal flush'
INNER JOIN vs IN
The two syntaxes serve different purposes. Using the Join syntax presumes you want something from both the StockToCategory and Category table. If there are multiple entries in the StockToCategory table for each category, the Category table values will be repeated.
Using the IN function presumes that you want only items from the Category whose ID meets some criteria. If a given CategoryId (assuming it is the PK of the Category table) exists multiple times in the StockToCategory table, it will only be returned once.
In your exact example, they will produce the same output however IMO, the later syntax makes your intent (only wanting categories), clearer.
Btw, yet a third syntax which is similar to using the IN function:
Select ...
From Category
Where Exists (
Select 1
From StockToCategory
Where StockToCategory.CategoryId = Category.CategoryId
And StockToCategory.Stock = @StockId
)
Related Topics
Equivalent of Explode() to Work With Strings in MySQL
Conversion Failed When Converting Date And/Or Time from Character String While Inserting Datetime
How Rownum Works in Pagination Query
Column Calculated from Another Column
Return Default Result For in Value Regardless
How to List the Tables in a Sqlite Database File That Was Opened With Attach
Count the Occurrences of Distinct Values
Join Tables With Sum Issue in MySQL
Sql/MySQL - Select Distinct/Unique But Return All Columns
Cannot Insert Explicit Value For Identity Column in Table 'Table' When Identity_Insert Is Set to Off
How to Use Count and Group by At the Same Select Statement
SQL Server, Division Returns Zero
Null in MySQL (Performance & Storage)
Why Isn't SQL Ansi-92 Standard Better Adopted Over Ansi-89