Is there a way to rewrite EXCEPT statements into NOT IN statements in SQL?
This might help you to understand the Difference between Except and NOT IN
EXCEPT operator returns all distinct rows from left hand side table which does not exist in right hand side table.
On the other hand, "NOT IN" will return all rows from left hand side table which are not present in right hand side table but it will not remove duplicate rows from the result.
NOT IN vs NOT EXISTS
I always default to NOT EXISTS
.
The execution plans may be the same at the moment but if either column is altered in the future to allow NULL
s the NOT IN
version will need to do more work (even if no NULL
s are actually present in the data) and the semantics of NOT IN
if NULL
s are present are unlikely to be the ones you want anyway.
When neither Products.ProductID
or [Order Details].ProductID
allow NULL
s the NOT IN
will be treated identically to the following query.
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
The exact plan may vary but for my example data I get the following.
A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.
/*Not valid syntax but better reflects the plan*/
SELECT p.ProductID,
p.ProductName
FROM Products p
LEFT ANTI SEMI JOIN [Order Details] od
ON p.ProductId = od.ProductId
If [Order Details].ProductID
is NULL
-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
The reason for this is that the correct semantics if [Order Details]
contains any NULL
ProductId
s is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.
If Products.ProductID
is also changed to become NULL
-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
AND NOT EXISTS (SELECT *
FROM (SELECT TOP 1 *
FROM [Order Details]) S
WHERE p.ProductID IS NULL)
The reason for that one is because a NULL
Products.ProductId
should not be returned in the results except if the NOT IN
sub query were to return no results at all (i.e. the [Order Details]
table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.
The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.
Additionally the fact that a single NULL
can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were no NULL
rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.
This is not the only possible execution plan for a NOT IN
on a NULL
-able column however. This article shows another one for a query against the AdventureWorks2008
database.
For the NOT IN
on a NOT NULL
column or the NOT EXISTS
against either a nullable or non nullable column it gives the following plan.
When the column changes to NULL
-able the NOT IN
plan now looks like
It adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on Sales.SalesOrderDetail.ProductID = <correlated_product_id>
to two seeks per outer row. The additional one is on WHERE Sales.SalesOrderDetail.ProductID IS NULL
.
As this is under an anti semi join if that one returns any rows the second seek will not occur. However if Sales.SalesOrderDetail
does not contain any NULL
ProductID
s it will double the number of seek operations required.
Alternate to 'Except' in SQL with performance
Is the combination (TrnId,flgStatus)
unique?
Then you might switch to EXCEPT ALL
, similar to UNION ALL
which might be more efficient than UNION
because it avoids the DISTINCT operation.
Another solution which accesses the base table only once:
Select TrnId
From TableA Where flgStatus in (0,3)
group by TrnId
having MIN(flgStatus) = 3
Alternative for Except All in SQL Server
EXCEPT ALL
is not supported by SQL Server. With the tables
t1
a | b
--+--
1 | 1
1 | 1
1 | 1
1 | 2
1 | 2
1 | 3
and
t2
a | b
--+--
1 | 1
1 | 2
1 | 4
the query
select a, b from t1
except all
select a, b from t2
order by a, b;
would return
a | b
--+--
1 | 1
1 | 1
1 | 2
1 | 3
because t1 contains two more (1|1) rows, one more (1|2) row and one more (1|3) row than t2.
To achieve the same in SQL Server, number the rows:
select a, b from
(
select a, b, row_number() over (partition by a, b order by a) as rn from t1
except
select a, b, row_number() over (partition by a, b order by a) as rn from t2
) evaluated
order by a, b;
Transact SQL using EXCEPT vs INTERSECT
Checking for rows to exist across multiple keys works much better with a WHERE NOT EXISTS
correlated subquery:
SELECT *
FROM Table1 T1
WHERE NOT EXISTS (
SELECT 1
FROM Table2 T2
WHERE T2.FIRST_NAME = T1.FIRST_NAME
AND T2.LAST_NAME = T1.LAST_NAME
AND T2.DATE_OF_BIRTH = T1.DATE_OF_BIRTH
)
If your database is actually configured to use case-sensitive collation, you should use the COLLATE
option to enforce case-insensitive comparisons. It's significantly more efficient. There should be an equivalent case-insensitive collation whatever your configuration.
SELECT *
FROM Table1 T1
WHERE NOT EXISTS (
SELECT 1
FROM Table2 T2
WHERE T2.FIRST_NAME = T1.FIRST_NAME COLLATE SQL_Latin1_General_CP1_CI_AS
AND T2.LAST_NAME = T1.LAST_NAME COLLATE SQL_Latin1_General_CP1_CI_AS
AND T2.DATE_OF_BIRTH = T1.DATE_OF_BIRTH
)
If you have an index on Table1 (FIRST_NAME, LAST_NAME, DATE_OF_BIRTH)
and Table2 (FIRST_NAME, LAST_NAME, DATE_OF_BIRTH)
, you should have even better performance.
Is Except operator computational expensive
In this case, you should be using EXISTS
. It is one of the most performant operations in SQL Server
SELECT *
FROM big_table b
WHERE NOT EXISTS (
SELECT 1
FROM small_table s
WHERE s.id = b.id)
Related Topics
How to Get Other Columns When Using Spark Dataframe Groupby
How to Sort a 'Version Number' Column Generically Using a SQL Server Query
Differencebetween Views and Materialized Views in Oracle
Nested Select Statement in SQL Server
How Do SQL Exists Statements Work
Replace First Occurrence of Substring in a String in SQL
Drop All Tables Whose Names Begin with a Certain String
How to Anticipate and Escape Single Quote ' in Oracle
How to Delete a Fixed Number of Rows with Sorting in Postgresql
Split Function in SQL Server 2008
Slow Simple Update Query on Postgresql Database with 3 Million Rows
How to Calculate the Number of "Tuesdays" Between Two Dates in Tsql
Optimize Groupwise Maximum Query
Foreign Key Creation Issue in Oracle
Why Execute Stored Procedures Is Faster Than SQL Query from a Script