SQL Join VS in Performance

SQL JOIN vs IN performance?

Generally speaking, IN and JOIN are different queries that can yield different results.

SELECT  a.*
FROM a
JOIN b
ON a.col = b.col

is not the same as

SELECT  a.*
FROM a
WHERE col IN
(
SELECT col
FROM b
)

, unless b.col is unique.

However, this is the synonym for the first query:

SELECT  a.*
FROM a
JOIN (
SELECT DISTINCT col
FROM b
)
ON b.col = a.col

If the joining column is UNIQUE and marked as such, both these queries yield the same plan in SQL Server.

If it's not, then IN is faster than JOIN on DISTINCT.

See this article in my blog for performance details:

  • IN vs. JOIN vs. EXISTS

INNER JOIN vs LEFT JOIN performance in SQL Server

A LEFT JOIN is absolutely not faster than an INNER JOIN. In fact, it's slower; by definition, an outer join (LEFT JOIN or RIGHT JOIN) has to do all the work of an INNER JOIN plus the extra work of null-extending the results. It would also be expected to return more rows, further increasing the total execution time simply due to the larger size of the result set.

(And even if a LEFT JOIN were faster in specific situations due to some difficult-to-imagine confluence of factors, it is not functionally equivalent to an INNER JOIN, so you cannot simply go replacing all instances of one with the other!)

Most likely your performance problems lie elsewhere, such as not having a candidate key or foreign key indexed properly. 9 tables is quite a lot to be joining so the slowdown could literally be almost anywhere. If you post your schema, we might be able to provide more details.


Edit:

Reflecting further on this, I could think of one circumstance under which a LEFT JOIN might be faster than an INNER JOIN, and that is when:

  • Some of the tables are very small (say, under 10 rows);
  • The tables do not have sufficient indexes to cover the query.

Consider this example:

CREATE TABLE #Test1
(
ID int NOT NULL PRIMARY KEY,
Name varchar(50) NOT NULL
)
INSERT #Test1 (ID, Name) VALUES (1, 'One')
INSERT #Test1 (ID, Name) VALUES (2, 'Two')
INSERT #Test1 (ID, Name) VALUES (3, 'Three')
INSERT #Test1 (ID, Name) VALUES (4, 'Four')
INSERT #Test1 (ID, Name) VALUES (5, 'Five')

CREATE TABLE #Test2
(
ID int NOT NULL PRIMARY KEY,
Name varchar(50) NOT NULL
)
INSERT #Test2 (ID, Name) VALUES (1, 'One')
INSERT #Test2 (ID, Name) VALUES (2, 'Two')
INSERT #Test2 (ID, Name) VALUES (3, 'Three')
INSERT #Test2 (ID, Name) VALUES (4, 'Four')
INSERT #Test2 (ID, Name) VALUES (5, 'Five')

SELECT *
FROM #Test1 t1
INNER JOIN #Test2 t2
ON t2.Name = t1.Name

SELECT *
FROM #Test1 t1
LEFT JOIN #Test2 t2
ON t2.Name = t1.Name

DROP TABLE #Test1
DROP TABLE #Test2

If you run this and view the execution plan, you'll see that the INNER JOIN query does indeed cost more than the LEFT JOIN, because it satisfies the two criteria above. It's because SQL Server wants to do a hash match for the INNER JOIN, but does nested loops for the LEFT JOIN; the former is normally much faster, but since the number of rows is so tiny and there's no index to use, the hashing operation turns out to be the most expensive part of the query.

You can see the same effect by writing a program in your favourite programming language to perform a large number of lookups on a list with 5 elements, vs. a hash table with 5 elements. Because of the size, the hash table version is actually slower. But increase it to 50 elements, or 5000 elements, and the list version slows to a crawl, because it's O(N) vs. O(1) for the hashtable.

But change this query to be on the ID column instead of Name and you'll see a very different story. In that case, it does nested loops for both queries, but the INNER JOIN version is able to replace one of the clustered index scans with a seek - meaning that this will literally be an order of magnitude faster with a large number of rows.

So the conclusion is more or less what I mentioned several paragraphs above; this is almost certainly an indexing or index coverage problem, possibly combined with one or more very small tables. Those are the only circumstances under which SQL Server might sometimes choose a worse execution plan for an INNER JOIN than a LEFT JOIN.

Join vs. sub-query

Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):

A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.

So subqueries can be slower than LEFT [OUTER] JOIN, but in my opinion their strength is slightly higher readability.

IN vs. JOIN with large rowsets

Update:

This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:

  • IN vs. JOIN vs. EXISTS

SELECT  *
FROM a
WHERE a.c IN (SELECT d FROM b)

SELECT a.*
FROM a
JOIN b
ON a.c = b.d

These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).

The equivalent of the first query is the following:

SELECT  a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d

If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.

SQL Server can employ one of the following methods to run this query:

  • If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)

  • If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.

  • If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used

  • If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.

Neither of these methods reevaluates the whole subquery each time.

See this entry in my blog for more detail on how this works:

  • Counting missing rows: SQL Server

There are links for all RDBMS's of the big four.

SQL Joins Vs SQL Subqueries (Performance)?

I would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...).

As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.

The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!

INNER JOIN vs LEFT JOIN performance in SQL

With this statement, the predicates in the WHERE clause would negate the "outerness" of a LEFT JOIN, making equivalent to an INNER JOIN.

I'd expect the optimizer would recognize that, and choose the most cost efficient plan in either case. I'd expect equivalent execution plans, and I wouldn't expect to observe a performance difference in this query, when the INNER keyword is replaced with LEFT.

But without looking at the actual execution plans, that's just conjecture. To see the execution plan, we can use EXPLAIN (for MySQL) or SET SHOWPLAN_ALL ON (for SQL Server, or STATISTICS PROFILE ON, et al.)

The actual performance of the different statements can be observed and measured.


I'm not understanding why you would want to use a LEFT JOIN at all.

(And the mixing of the old-school comma syntax for the join operation with the newer JOIN keyword... what's up with that?)


As far as improving performance, we'd really need to make sure that suitable indexes are available, and are being used. We need to look at the execution plans. We may need to make sure that statistics are collected and up-to-date, and we may need to make some tweaks to the SQL to get an more efficient plan.

It's possible that replacing INNER with LEFT may make a difference in performance of the existing query, but it's unlikely to be a cure to a performance problem.

Personally, I'd prefer to investigate performance of an equivalent query, with this as a starting point:

UPDATE poker_hands f
JOIN poker_cards a ON a.card_name = f.r1 AND a.game_value = 14
JOIN poker_cards b ON b.card_name = f.r2 AND b.game_value = 13 AND b.suit = a.suit
JOIN poker_cards c ON c.card_name = f.r3 AND c.game_value = 12 AND c.suit = a.suit
JOIN poker_cards d ON d.card_name = f.r4 AND d.game_value = 11 AND d.suit = a.suit
JOIN poker_cards e ON e.card_name = f.r5 AND e.game_value = 10 AND e.suit = a.suit
SET f.hand_type = 'Royal flush'

INNER JOIN vs IN

The two syntaxes serve different purposes. Using the Join syntax presumes you want something from both the StockToCategory and Category table. If there are multiple entries in the StockToCategory table for each category, the Category table values will be repeated.

Using the IN function presumes that you want only items from the Category whose ID meets some criteria. If a given CategoryId (assuming it is the PK of the Category table) exists multiple times in the StockToCategory table, it will only be returned once.

In your exact example, they will produce the same output however IMO, the later syntax makes your intent (only wanting categories), clearer.

Btw, yet a third syntax which is similar to using the IN function:

Select ...
From Category
Where Exists (
Select 1
From StockToCategory
Where StockToCategory.CategoryId = Category.CategoryId
And StockToCategory.Stock = @StockId
)


Related Topics



Leave a reply



Submit