NOT IN vs NOT EXISTS
I always default to NOT EXISTS
.
The execution plans may be the same at the moment but if either column is altered in the future to allow NULL
s the NOT IN
version will need to do more work (even if no NULL
s are actually present in the data) and the semantics of NOT IN
if NULL
s are present are unlikely to be the ones you want anyway.
When neither Products.ProductID
or [Order Details].ProductID
allow NULL
s the NOT IN
will be treated identically to the following query.
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
The exact plan may vary but for my example data I get the following.
A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.
/*Not valid syntax but better reflects the plan*/
SELECT p.ProductID,
p.ProductName
FROM Products p
LEFT ANTI SEMI JOIN [Order Details] od
ON p.ProductId = od.ProductId
If [Order Details].ProductID
is NULL
-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
The reason for this is that the correct semantics if [Order Details]
contains any NULL
ProductId
s is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.
If Products.ProductID
is also changed to become NULL
-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
AND NOT EXISTS (SELECT *
FROM (SELECT TOP 1 *
FROM [Order Details]) S
WHERE p.ProductID IS NULL)
The reason for that one is because a NULL
Products.ProductId
should not be returned in the results except if the NOT IN
sub query were to return no results at all (i.e. the [Order Details]
table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.
The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.
Additionally the fact that a single NULL
can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were no NULL
rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.
This is not the only possible execution plan for a NOT IN
on a NULL
-able column however. This article shows another one for a query against the AdventureWorks2008
database.
For the NOT IN
on a NOT NULL
column or the NOT EXISTS
against either a nullable or non nullable column it gives the following plan.
When the column changes to NULL
-able the NOT IN
plan now looks like
It adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on Sales.SalesOrderDetail.ProductID = <correlated_product_id>
to two seeks per outer row. The additional one is on WHERE Sales.SalesOrderDetail.ProductID IS NULL
.
As this is under an anti semi join if that one returns any rows the second seek will not occur. However if Sales.SalesOrderDetail
does not contain any NULL
ProductID
s it will double the number of seek operations required.
Not Exists vs Not In: efficiency
try
SELECT DISTINCT a.SFAccountID, a.SLXID, a.Name
FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN _SLX_AccountChannel b WITH(NOLOCK)
ON a.SLXID = b.ACCOUNTID
AND b.STATUS IN ('Active','Customer', 'Current')
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
ON a.SFAccountID = c.SFAccountID
AND c.Primary__C = 0
LEFT JOIN [dbo].[Salesforce_Contacts] c2 WITH(NOLOCK)
on c2.SFAccountID = a.SFAccountID
AND c2.Primary__c = 1
WHERE c2.SFAccountID is null
NOT EXISTS vs NOT IN
As you have rightly said the two are different things. If the subquery of items to not be IN
contains NULL
no results will be returned because nothing equals NULL
and nothing does not equal NULL
(Not even NULL).
Assuming you are using the two to achieve the same result, there is no difference between the two as long as you handle NULL
values in your IN
statement. The optimiser is clever enough to know that with NULL
values eliminated, or with non nullable columns the two are the same, so use the same ANTI SEMI JOIN
.
Consider these two tables:
CREATE TABLE T (ID INT NOT NULL PRIMARY KEY);
CREATE TABLE T2 (ID INT NOT NULL PRIMARY KEY);
These two queries get exactly the same execution plan:
SELECT *
FROM T
WHERE ID NOT IN (SELECT ID FROM T2);
SELECT *
FROM T
WHERE NOT EXISTS (SELECT ID FROM T2 WHERE T.ID = T2.ID);
because the optimiser knows T2.ID is a non nullable column. With a third table:
CREATE TABLE T3 (ID INT);
where the ID column is neither indexed or nullable these two queries render very different execution plans:
SELECT *
FROM T
WHERE ID NOT IN (SELECT ID FROM T3);
SELECT *
FROM T
WHERE NOT EXISTS (SELECT ID FROM T3 WHERE T.ID = T3.ID);
and NOT EXISTS will be much more efficient. However these two again yield (essentially) the same execution plan:
SELECT *
FROM T
WHERE ID NOT IN (SELECT ID FROM T3 WHERE T3.ID IS NOT NULL);
SELECT *
FROM T
WHERE NOT EXISTS (SELECT ID FROM T3 WHERE T.ID = T3.ID);
All these queries and sample data are on SQL Fiddle
EDIT
To actually answer your question:
Case 1 will be the same performance with NOT IN
or NOT EXISTS
if tracked_session_id
is a non nullable column in data.conversions
, or you add WHERE tracked_Session_id IS NOT NULL
inside the In statement. If the column is not nullable and you don't exclude null values the performance won't be the same, and assuming there are no nulls NOT EXISTS
will perform better, if there are no nulls the result won't be the same so the performances are not comparable.
Case 2 actually surprised me with sample data, I had assumed that this would not be optimised into an ANTI SEMI JOIN
, and had already written an answer saying as much, but just before saving the edit I thought I'd better check, and was surprised to see that this:
SELECT *
FROM T
WHERE ( SELECT COUNT(*)
FROM T3
WHERE T.ID = T3.ID
) = 0;
Is optimised exactly the same as NOT EXISTS
. So it appears the optimiser is even more clever than I thought, it will only generate a different plan if you want the count to be something other than 0.
SQL Fiddle for Case 2
What's the difference between 'not in' and 'not exists'?
I think it serves the same purpose.
not in
can also take literal values whereas not exists
need a query to compare the results with.
EDIT: not exists
could be good to use because it can join
with the outer query & can lead to usage of index, if the criteria uses column that is indexed.
EDIT2: See this question as well.
EDIT3: Let me take the above things back.
See this link. I think, it all depends on how the DB translates this & on database/indexes etc.
Which is faster - NOT IN or NOT EXISTS?
Usually it does not matter if NOT IN
is slower / faster than NOT EXISTS
, because they are NOT equivalent in presence of NULL
. Read:
NOT IN vs NOT EXISTS
In these cases you almost always want NOT EXISTS
, because it has the usually expected behaviour.
If they are equivalent, it is likely that your database already has figured that out and will generate the same execution plan for both.
In the few cases where both options are aquivalent and your database is not able to figure that out, it is better to analyze both execution plans and choose the best options for your specific case.
What's the difference between NOT EXISTS vs. NOT IN vs. LEFT JOIN WHERE IS NULL?
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL
In a nutshell:
NOT IN
is a little bit different: it never matches if there is but a single NULL
in the list.
In
MySQL
,NOT EXISTS
is a little bit less efficientIn
SQL Server
,LEFT JOIN / IS NULL
is less efficientIn
PostgreSQL
,NOT IN
is less efficientIn
Oracle
, all three methods are the same.
not exists' and 'not in', which one has better performance?
Use NOT EXISTS
.
The execution plans may be the same at the moment but if either column is altered in the future to allow NULLs the NOT IN version will need to do more work.
The most important thing to note about NOT EXISTS and NOT IN is that, unlike EXISTS and IN, they are not equivalent in all cases. Specifically, when NULLs are involved they will return different results. To be totally specific, when the subquery returns even one null, NOT IN will not match any rows.
Related Topics
Rails 4 Like Query - Activerecord Adds Quotes
What the Difference Between MySQL and MySQL2 Gem
Error When Trying to Install App With MySQL2 Gem
Correct MySQL Configuration For Ruby on Rails Database.Yml File
Rails 3 Query on Condition of an Association'S Count
How to Make a Row Generator in MySQL
What Is the Expected Behaviour For Multiple Set-Returning Functions in Select Clause
Group by Column and Multiple Rows into One Row Multiple Columns
Storing Json in Database Vs. Having a New Column For Each Key
SQL Server Insert If Not Exists
Query For Array Elements Inside Json Type
How to Do Insert into a Table Records Extracted from Another Table
Ordering by Specific Field Value First
How to Update Selected Rows With Values from a CSV File in Postgres