JOIN versus EXISTS performance
NOT EXISTS is more efficient than using a LEFT OUTER JOIN to exclude records that are missing from the participating table using an IS NULL condition because the optimizer will elect to use an EXCLUSION MERGE JOIN with the NOT EXISTS predicate.
While your second test did not yield impressive results for the data sets you were using the performance increase from NOT EXISTS over a LEFT JOIN is very noticeable as your data volumes increase. Keep in mind that the tables will need to be hash distributed by the columns that participate in the NOT EXISTS join just like they would in the LEFT JOIN. Therefore, data skew can impact the performance of the EXCLUSION MERGE JOIN.
EDIT:
Typically, I would defer to EXISTS as a replacement for IN instead of using it for re-writing a join solution. This is especially true when the column(s) participating in the logical comparison can be NULL. That's not to say you couldn't use EXISTS in place of an INNER JOIN. Instead of an EXCLUSION JOIN you will end up with an INCLUSION JOIN. The INNER JOIN is in essence an inclusion join to begin with. I'm sure there are some nuances that I am overlooking but you can find those in the manuals if you wish to take the time to read them.
EXISTS vs JOIN and use of EXISTS clause
EXISTS
is used to return a boolean value, JOIN
returns a whole other table
EXISTS
is only used to test if a subquery returns results, and short circuits as soon as it does. JOIN
is used to extend a result set by combining it with additional fields from another table to which there is a relation.
In your example, the queries are semantically equivalent.
In general, use EXISTS
when:
- You don't need to return data from the related table
- You have dupes in the related table (
JOIN
can cause duplicate rows if values are repeated) - You want to check existence (use instead of
LEFT OUTER JOIN...NULL
condition)
If you have proper indexes, most of the time the EXISTS
will perform identically to the JOIN
. The exception is on very complicated subqueries, where it is normally quicker to use EXISTS
.
If your JOIN
key is not indexed, it may be quicker to use EXISTS
but you will need to test for your specific circumstance.
JOIN
syntax is easier to read and clearer normally as well.
Can an INNER JOIN offer better performance than EXISTS
Generally speaking, INNER JOIN
and EXISTS
are different things.
The former returns duplicates and columns from both tables, the latter returns one record and, being a predicate, returns records from only one table.
If you do an inner join on a UNIQUE
column, they exhibit same performance.
If you do an inner join on a recordset with DISTINCT
applied (to get rid of the duplicates), EXISTS
is usually faster.
IN
and EXISTS
clauses (with an equijoin correlation) usually employ one of the several SEMI JOIN
algorithms which are usually more efficient than a DISTINCT
on one of the tables.
See this article in my blog:
- IN vs. JOIN vs. EXISTS
SQL JOIN vs IN performance?
Generally speaking, IN
and JOIN
are different queries that can yield different results.
SELECT a.*
FROM a
JOIN b
ON a.col = b.col
is not the same as
SELECT a.*
FROM a
WHERE col IN
(
SELECT col
FROM b
)
, unless b.col
is unique.
However, this is the synonym for the first query:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT col
FROM b
)
ON b.col = a.col
If the joining column is UNIQUE
and marked as such, both these queries yield the same plan in SQL Server
.
If it's not, then IN
is faster than JOIN
on DISTINCT
.
See this article in my blog for performance details:
IN
vs.JOIN
vs.EXISTS
SQL performance on LEFT OUTER JOIN vs NOT EXISTS
Joe's link is a good starting point. Quassnoi covers this too.
In general, if your fields are properly indexed, OR if you expect to filter out more records (i.e. have a lots of rows EXIST
in the subquery) NOT EXISTS
will perform better.
EXISTS
and NOT EXISTS
both short circuit - as soon as a record matches the criteria it's either included or filtered out and the optimizer moves on to the next record.
LEFT JOIN
will join ALL RECORDS regardless of whether they match or not, then filter out all non-matching records. If your tables are large and/or you have multiple JOIN
criteria, this can be very very resource intensive.
I normally try to use NOT EXISTS
and EXISTS
where possible. For SQL Server, IN
and NOT IN
are semantically equivalent and may be easier to write. These are among the only operators you will find in SQL Server that are guaranteed to short circuit.
JOIN or Correlated subquery with exists clause, which one is better
Generally, the EXISTS clause because you may need DISTINCT for a JOIN for it to give the expected output. For example, if you have multiple Department
rows for a ContactInformation
row.
In your example above, the SELECT *
:
- means different output too so they are not actually equivalent
- less chance of a index being used because you are pulling all columns out
Saying that, even with a limited column list, they will give the same plan: until you need DISTINCT... which is why I say "EXISTS"
Related Topics
Entity Framework Indexing All Foreign Key Columns
Select Single Row from Child Table for Each Row in Parent Table
Local Collection Types Not Allowed in SQL Statements
Expression Engine SQL Query Entries List by Authors
Conditional Operator in SQL Where Clause
Using Dynamic in Clause in Mssql
Using Parameters in SQL Query with Sub-Query
Trimmining a Column with Bad Data
How to Join Two Tables Together with Same Number of Rows by Their Order
How to Use a Function-Based Index on a Column That Contains Nulls in Oracle 10+
Oracle as Keyword and Subqueries
How to Pivot Data in Bigquery Standard SQL Without Manual Hardcoding
SQL to Include Condition in Where If Not Null