SQL - IN vs. NOT IN
When it comes to performance you should always profile your code (i.e. run your queries few thousand times and measure each loops performance using some kind of stopwatch
. Sample).
But here I highly recommend using the first query for better future maintaining. The logic is that you need all records but 9 and 10. If you add value 11 to your table and use second query, logic of your application will be broken that will lead to bug, of course.
Edit: I remember this was tagged as php that's why I provided sample in php, but I might be mistaken. I guess it won't be hard to rewrite that sample in the language you're using.
vs NOT IN
SELECT something
FROM someTable
WHERE idcode NOT IN (SELECT ids FROM tmpIdTable)
checks against any value in the list.
However, the NOT IN is not NULL-tolerant. If the sub-query returned a set of values that contained NULL, no records would be returned at all. (This is because internally the NOT IN is optimized to idcode <> 'foo' AND idcode <> 'bar' AND idcode <> NULL
etc., which will always fail because any comparison to NULL yields UNKNOWN, preventing the whole expression from ever becoming TRUE.)
A nicer, NULL-tolerant variant would be this:
SELECT something
FROM someTable
WHERE NOT EXISTS (SELECT ids FROM tmpIdTable WHERE ids = someTable.idcode)
EDIT: I initially assumed that this:
SELECT something
FROM someTable
WHERE idcode <> (SELECT ids FROM tmpIdTable)
would check against the first value only. It turns out that this assumption is wrong at least for SQL Server, where it actually triggers his error:
Msg 512, Level 16, State 1, Line 1
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
IN vs OR in the SQL WHERE clause
I assume you want to know the performance difference between the following:
WHERE foo IN ('a', 'b', 'c')
WHERE foo = 'a' OR foo = 'b' OR foo = 'c'
According to the manual for MySQL if the values are constant IN
sorts the list and then uses a binary search. I would imagine that OR
evaluates them one by one in no particular order. So IN
is faster in some circumstances.
The best way to know is to profile both on your database with your specific data to see which is faster.
I tried both on a MySQL with 1000000 rows. When the column is indexed there is no discernable difference in performance - both are nearly instant. When the column is not indexed I got these results:
SELECT COUNT(*) FROM t_inner WHERE val IN (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000);
1 row fetched in 0.0032 (1.2679 seconds)
SELECT COUNT(*) FROM t_inner WHERE val = 1000 OR val = 2000 OR val = 3000 OR val = 4000 OR val = 5000 OR val = 6000 OR val = 7000 OR val = 8000 OR val = 9000;
1 row fetched in 0.0026 (1.7385 seconds)
So in this case the method using OR is about 30% slower. Adding more terms makes the difference larger. Results may vary on other databases and on other data.
NOT vs Operator in sql server
NOT
is a negation, <>
is a comparison operator, they are both ISO standard.
And have no performance difference for your example.
Is there any performance difference between IN and NOT IN Operators?
Like most performance questions, it depends.
If there are no indexes then they should be roughly comparable.
If you have an index on the limiting column then IN
will likely be faster than NOT IN
as IN
can use an index seek while NOT IN
will require a table scan.
The above depends of course, if there are very few distinct values of col1
and it's indexed, then NOT IN
could end up using an index seek rather than a table scan.
NOT IN vs NOT EXISTS
I always default to NOT EXISTS
.
The execution plans may be the same at the moment but if either column is altered in the future to allow NULL
s the NOT IN
version will need to do more work (even if no NULL
s are actually present in the data) and the semantics of NOT IN
if NULL
s are present are unlikely to be the ones you want anyway.
When neither Products.ProductID
or [Order Details].ProductID
allow NULL
s the NOT IN
will be treated identically to the following query.
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
The exact plan may vary but for my example data I get the following.
A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.
/*Not valid syntax but better reflects the plan*/
SELECT p.ProductID,
p.ProductName
FROM Products p
LEFT ANTI SEMI JOIN [Order Details] od
ON p.ProductId = od.ProductId
If [Order Details].ProductID
is NULL
-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
The reason for this is that the correct semantics if [Order Details]
contains any NULL
ProductId
s is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.
If Products.ProductID
is also changed to become NULL
-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
AND NOT EXISTS (SELECT *
FROM (SELECT TOP 1 *
FROM [Order Details]) S
WHERE p.ProductID IS NULL)
The reason for that one is because a NULL
Products.ProductId
should not be returned in the results except if the NOT IN
sub query were to return no results at all (i.e. the [Order Details]
table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.
The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.
Additionally the fact that a single NULL
can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were no NULL
rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.
This is not the only possible execution plan for a NOT IN
on a NULL
-able column however. This article shows another one for a query against the AdventureWorks2008
database.
For the NOT IN
on a NOT NULL
column or the NOT EXISTS
against either a nullable or non nullable column it gives the following plan.
When the column changes to NULL
-able the NOT IN
plan now looks like
It adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on Sales.SalesOrderDetail.ProductID = <correlated_product_id>
to two seeks per outer row. The additional one is on WHERE Sales.SalesOrderDetail.ProductID IS NULL
.
As this is under an anti semi join if that one returns any rows the second seek will not occur. However if Sales.SalesOrderDetail
does not contain any NULL
ProductID
s it will double the number of seek operations required.
What is the difference between NOT and != operators in SQL?
NOT
negates the following condition so it can be used with various operators. !=
is the non-standard alternative for the <>
operator which means "not equal".
e.g.
NOT (a LIKE 'foo%')
NOT ( (a,b) OVERLAPS (x,y) )
NOT (a BETWEEN x AND y)
NOT (a IS NULL)
Except for the overlaps
operator above could also be written as:
a NOT LIKE 'foo%'
a NOT BETWEEN x AND y
a IS NOT NULL
In some situations it might be easier to understand to negate a complete expression rather then rewriting it to mean the opposite.
NOT
can however be used with <>
- but that wouldn't make much sense though: NOT (a <> b)
is the same as a = b
. Similarly you could use NOT to negate the equality operator NOT (a = b)
is the same as a <> b
Not Exists vs Not In: efficiency
try
SELECT DISTINCT a.SFAccountID, a.SLXID, a.Name
FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN _SLX_AccountChannel b WITH(NOLOCK)
ON a.SLXID = b.ACCOUNTID
AND b.STATUS IN ('Active','Customer', 'Current')
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
ON a.SFAccountID = c.SFAccountID
AND c.Primary__C = 0
LEFT JOIN [dbo].[Salesforce_Contacts] c2 WITH(NOLOCK)
on c2.SFAccountID = a.SFAccountID
AND c2.Primary__c = 1
WHERE c2.SFAccountID is null
SQL WHERE NOT...IN vs WHERE ... NOT IN
Using the Where... NOT IN
is the one I've come across the most and in terms of the industry it is preferred in order to make the statement more readable for others who aren't accustomed to seeing Where NOT ... IN
.
MINUS vs NOT in where clause
Consider these two rows:
1, 2, a, b, x, y
1, 2, u, v, c, d
The MINUS operation will not return the pair (1, 2) but your query will. The c, d values may appear with the same 1, 2 but in a different row from the a, b
The fundamental distinction is that MINUS operates at the set level, while your NOT condition only works on one row at a time (the same row with the "required" values in the other columns).
Now: You CAN make your query a bit more efficient (although you can't avoid reading the base table twice). Use a NOT IN condition:
select field1, field2 from tab where field3 = 'a' and field4 = 'b'
and (field1, field2) not in
(select field1, field2 from tab where field5 = 'c' and field6 = 'd');
Note (see spencer7593's comment below): As in all cases when NULLs may be present, NOT IN is not a good solution. Rather, a NOT EXISTS condition should be used. I won't elaborate, since it seems out of scope for the question asked (which was why the "NOT" solution is different from the "MINUS" solution).
Related Topics
Composing Database.Esqueleto Queries, Conditional Joins and Counting
Sqlite Like & Order by Match Query
Sql Server Delete Is Slower with Indexes
Generated Excel from Ssis But Getting Quote in Every Column
Datetime Query on Only Year in SQL Server
T-Sql: Comparing Two Tables - Records That Don't Exist in Second Table
How to Get Id of Newly Inserted Record Using Excel Vba
Sql Server String to Varbinary Conversion
Delphi: Accessing JSON Objects Within a JSON Array
How to Update an Xml Attribute Value in an Xml Variable Using T-Sql
How to Change Default Systemdate from Ymd to Dmy
"Pivoting" a Table in SQL (I.E. Cross Tabulation/Crosstabulation)
What Is The Most Efficient Way to Count Rows in a Table in Sqlite
All Operator Vs Any on an Empty Query