Not in VS Not Exists

NOT IN vs NOT EXISTS

I always default to NOT EXISTS.

The execution plans may be the same at the moment but if either column is altered in the future to allow NULLs the NOT IN version will need to do more work (even if no NULLs are actually present in the data) and the semantics of NOT IN if NULLs are present are unlikely to be the ones you want anyway.

When neither Products.ProductID or [Order Details].ProductID allow NULLs the NOT IN will be treated identically to the following query.

SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)

The exact plan may vary but for my example data I get the following.

Neither NULL

A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.

/*Not valid syntax but better reflects the plan*/ 
SELECT p.ProductID,
p.ProductName
FROM Products p
LEFT ANTI SEMI JOIN [Order Details] od
ON p.ProductId = od.ProductId

If [Order Details].ProductID is NULL-able the query then becomes

SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)

The reason for this is that the correct semantics if [Order Details] contains any NULL ProductIds is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.

One NULL

If Products.ProductID is also changed to become NULL-able the query then becomes

SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
AND NOT EXISTS (SELECT *
FROM (SELECT TOP 1 *
FROM [Order Details]) S
WHERE p.ProductID IS NULL)

The reason for that one is because a NULL Products.ProductId should not be returned in the results except if the NOT IN sub query were to return no results at all (i.e. the [Order Details] table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.

Both NULL

The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.

Additionally the fact that a single NULL can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were no NULL rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.

This is not the only possible execution plan for a NOT IN on a NULL-able column however. This article shows another one for a query against the AdventureWorks2008 database.

For the NOT IN on a NOT NULL column or the NOT EXISTS against either a nullable or non nullable column it gives the following plan.

Not EXists

When the column changes to NULL-able the NOT IN plan now looks like

Not In - Null

It adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on Sales.SalesOrderDetail.ProductID = <correlated_product_id> to two seeks per outer row. The additional one is on WHERE Sales.SalesOrderDetail.ProductID IS NULL.

As this is under an anti semi join if that one returns any rows the second seek will not occur. However if Sales.SalesOrderDetail does not contain any NULL ProductIDs it will double the number of seek operations required.

Not Exists vs Not In: efficiency

try

SELECT DISTINCT a.SFAccountID, a.SLXID, a.Name 
FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN _SLX_AccountChannel b WITH(NOLOCK)
ON a.SLXID = b.ACCOUNTID
AND b.STATUS IN ('Active','Customer', 'Current')
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
ON a.SFAccountID = c.SFAccountID
AND c.Primary__C = 0
LEFT JOIN [dbo].[Salesforce_Contacts] c2 WITH(NOLOCK)
on c2.SFAccountID = a.SFAccountID
AND c2.Primary__c = 1
WHERE c2.SFAccountID is null

NOT EXISTS vs NOT IN

As you have rightly said the two are different things. If the subquery of items to not be IN contains NULL no results will be returned because nothing equals NULL and nothing does not equal NULL (Not even NULL).

Assuming you are using the two to achieve the same result, there is no difference between the two as long as you handle NULL values in your IN statement. The optimiser is clever enough to know that with NULL values eliminated, or with non nullable columns the two are the same, so use the same ANTI SEMI JOIN.

Consider these two tables:

CREATE TABLE T (ID INT NOT NULL PRIMARY KEY);
CREATE TABLE T2 (ID INT NOT NULL PRIMARY KEY);

These two queries get exactly the same execution plan:

SELECT  *
FROM T
WHERE ID NOT IN (SELECT ID FROM T2);

SELECT *
FROM T
WHERE NOT EXISTS (SELECT ID FROM T2 WHERE T.ID = T2.ID);

because the optimiser knows T2.ID is a non nullable column. With a third table:

CREATE TABLE T3 (ID INT);

where the ID column is neither indexed or nullable these two queries render very different execution plans:

SELECT  *
FROM T
WHERE ID NOT IN (SELECT ID FROM T3);

SELECT *
FROM T
WHERE NOT EXISTS (SELECT ID FROM T3 WHERE T.ID = T3.ID);

and NOT EXISTS will be much more efficient. However these two again yield (essentially) the same execution plan:

SELECT  *
FROM T
WHERE ID NOT IN (SELECT ID FROM T3 WHERE T3.ID IS NOT NULL);

SELECT *
FROM T
WHERE NOT EXISTS (SELECT ID FROM T3 WHERE T.ID = T3.ID);

All these queries and sample data are on SQL Fiddle

EDIT

To actually answer your question:

Case 1 will be the same performance with NOT IN or NOT EXISTS if tracked_session_id is a non nullable column in data.conversions, or you add WHERE tracked_Session_id IS NOT NULL inside the In statement. If the column is not nullable and you don't exclude null values the performance won't be the same, and assuming there are no nulls NOT EXISTS will perform better, if there are no nulls the result won't be the same so the performances are not comparable.

Case 2 actually surprised me with sample data, I had assumed that this would not be optimised into an ANTI SEMI JOIN, and had already written an answer saying as much, but just before saving the edit I thought I'd better check, and was surprised to see that this:

SELECT  *
FROM T
WHERE ( SELECT COUNT(*)
FROM T3
WHERE T.ID = T3.ID
) = 0;

Is optimised exactly the same as NOT EXISTS. So it appears the optimiser is even more clever than I thought, it will only generate a different plan if you want the count to be something other than 0.

SQL Fiddle for Case 2

What's the difference between 'not in' and 'not exists'?

I think it serves the same purpose.

not in can also take literal values whereas not exists need a query to compare the results with.

EDIT: not exists could be good to use because it can join with the outer query & can lead to usage of index, if the criteria uses column that is indexed.

EDIT2: See this question as well.

EDIT3: Let me take the above things back.

See this link. I think, it all depends on how the DB translates this & on database/indexes etc.

Which is faster - NOT IN or NOT EXISTS?

Usually it does not matter if NOT IN is slower / faster than NOT EXISTS, because they are NOT equivalent in presence of NULL. Read:

NOT IN vs NOT EXISTS

In these cases you almost always want NOT EXISTS, because it has the usually expected behaviour.

If they are equivalent, it is likely that your database already has figured that out and will generate the same execution plan for both.

In the few cases where both options are aquivalent and your database is not able to figure that out, it is better to analyze both execution plans and choose the best options for your specific case.

What's the difference between NOT EXISTS vs. NOT IN vs. LEFT JOIN WHERE IS NULL?

  • NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server

  • NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL

  • NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle

  • NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL

In a nutshell:

NOT IN is a little bit different: it never matches if there is but a single NULL in the list.

  • In MySQL, NOT EXISTS is a little bit less efficient

  • In SQL Server, LEFT JOIN / IS NULL is less efficient

  • In PostgreSQL, NOT IN is less efficient

  • In Oracle, all three methods are the same.

not exists' and 'not in', which one has better performance?

Use NOT EXISTS.

The execution plans may be the same at the moment but if either column is altered in the future to allow NULLs the NOT IN version will need to do more work.

The most important thing to note about NOT EXISTS and NOT IN is that, unlike EXISTS and IN, they are not equivalent in all cases. Specifically, when NULLs are involved they will return different results. To be totally specific, when the subquery returns even one null, NOT IN will not match any rows.



Related Topics



Leave a reply



Submit