SQL: When it comes to NOT IN and NOT EQUAL TO, which is more efficient and why?
In PostgreSQL there's usually a fairly small difference at reasonable list lengths, though IN
is much cleaner conceptually. Very long AND ... <> ...
lists and very long NOT IN
lists both perform terribly, with AND
much worse than NOT IN
.
In both cases, if they're long enough for you to even be asking the question you should be doing an anti-join or subquery exclusion test over a value list instead.
WITH excluded(item) AS (
VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5')
)
SELECT *
FROM thetable t
WHERE NOT EXISTS(SELECT 1 FROM excluded e WHERE t.item = e.item);
or:
WITH excluded(item) AS (
VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5')
)
SELECT *
FROM thetable t
LEFT OUTER JOIN excluded e ON (t.item = e.item)
WHERE e.item IS NULL;
(On modern Pg versions both will produce the same query plan anyway).
If the value list is long enough (many tens of thousands of items) then query parsing may start having a significant cost. At this point you should consider creating a TEMPORARY
table, COPY
ing the data to exclude into it, possibly creating an index on it, then using one of the above approaches on the temp table instead of the CTE.
Demo:
CREATE UNLOGGED TABLE exclude_test(id integer primary key);
INSERT INTO exclude_test(id) SELECT generate_series(1,50000);
CREATE TABLE exclude AS SELECT x AS item FROM generate_series(1,40000,4) x;
where exclude
is the list of values to omit.
I then compare the following approaches on the same data with all results in milliseconds:
NOT IN
list: 3424.596AND ...
list: 80173.823VALUES
basedJOIN
exclusion: 20.727VALUES
based subquery exclusion: 20.495- Table-based
JOIN
, no index on ex-list: 25.183 - Subquery table based, no index on ex-list: 23.985
... making the CTE-based approach over three thousand times faster than the AND
list and 130 times faster than the NOT IN
list.
Code here: https://gist.github.com/ringerc/5755247 (shield your eyes, ye who follow this link).
For this data set size adding an index on the exclusion list made no difference.
Notes:
IN
list generated withSELECT 'IN (' || string_agg(item::text, ',' ORDER BY item) || ')' from exclude;
AND
list generated withSELECT string_agg(item::text, ' AND item <> ') from exclude;
)- Subquery and join based table exclusion were much the same across repeated runs.
- Examination of the plan shows that Pg translates
NOT IN
to<> ALL
So... you can see that there's a truly huge gap between both IN
and AND
lists vs doing a proper join. What surprised me was how fast doing it with a CTE using a VALUES
list was ... parsing the VALUES
list took almost no time at all, performing the same or slightly faster than the table approach in most tests.
It'd be nice if PostgreSQL could automatically recognise a preposterously long IN
clause or chain of similar AND
conditions and switch to a smarter approach like doing a hashed join or implicitly turning it into a CTE node. Right now it doesn't know how to do that.
See also:
- this handy blog post Magnus Hagander wrote on the topic
Is it better to use equal or not equal when making a query?
Performance is one reason to use =
. There are two components to this. First is indexing, where =
comparisons provide for more powerful indexing capabilities. The second is partitioning. Although unlikely on a column that takes just a handful of values, =
is better for resolving partitions.
Another reason is semantic. The presence of NULL
s can be confusing. Consider the two comparisons:
where col = 'x'
where col <> 'x'
Both of these where
clauses filter out values where col
is NULL
. This makes total sense with the =
. However, even after you know the rules, it is a bit confusing with the <>
. Intuitively, we think "NULL
is not equal to "x", so it should be true". In fact, NULL
means an unknown value, and an unknown value could be equal to 'x'
, so the statement could be true; in fact, it returns NULL
which is filtered out.
Difference between NOT IN and equals vs. IN and not equals
NOT IN
is going to give you the wrong results if id
is nullable (which I hope it is not, otherwise it has a terrible name).
Why would you choose IN
over EXISTS
when it has been proven time and time again that EXISTS
is more efficient (or at least no less efficient), since it can short-circuit? IN
has to materialize the entire set.
SELECT * -- stop doing this
FROM dbo.usagerecords AS UR
WHERE EXISTS
(
SELECT 1 FROM dbo.pipelinerate AS pr
WHERE pr.id = ur.usagerateid
AND pr.name <> 'No Usage'
);
You can also express your other query like this:
SELECT * -- again, stop doing this
FROM dbo.usagerecords AS UR
WHERE NOT EXISTS
(
SELECT 1 FROM dbo.pipelinerate AS pr
WHERE pr.id = ur.usagerateid
AND pr.name = 'No Usage'
);
But I have no idea which, if either, gets the correct results. This is why we typically ask for sample data and desired results.
Your use of SELECT *
is likely to have a greater negative impact on performance than whether you use IN
or EXISTS
. FWIW.
SQL Performance With Not Equals before Equals and vice versa
Query optimization has very little to do with the syntax of your query and a lot to do with the RDMS query optimizer.
All of the things you suggest will probably make no difference whatsoever as the optimizer will pull them apart and build what it feels is the best query. Specifically,
- Doesn't matter
- Doesn't matter
- No performance impact but note that
COUNT(id)<>COUNT(*)
if there areNULL
s in the id column - for a primary key there won't be anyNULL
s. - I can't see how you could build this query with an
IN
but in any event it will not impact performance - Indexes impact speed dramatically - for this query, indexes on
recipientId
,recipientView
andsourceUserId
will have dramatic impacts
What you should do is not take my word for it. Set up each of the queries and look at the execution plan from the RDMS. If they are the same there, then they are the same query.
SQL - IN vs. NOT IN
When it comes to performance you should always profile your code (i.e. run your queries few thousand times and measure each loops performance using some kind of stopwatch
. Sample).
But here I highly recommend using the first query for better future maintaining. The logic is that you need all records but 9 and 10. If you add value 11 to your table and use second query, logic of your application will be broken that will lead to bug, of course.
Edit: I remember this was tagged as php that's why I provided sample in php, but I might be mistaken. I guess it won't be hard to rewrite that sample in the language you're using.
Minus or not equal to? Which is better?
1st one is better, since that involves only a single scan also that does not contains any 'in's or 'not in's. go for 1st first one...
Should I use != or for not equal in T-SQL?
Technically they function the same if you’re using SQL Server AKA T-SQL. If you're using it in stored procedures there is no performance reason to use one over the other. It then comes down to personal preference. I prefer to use <> as it is ANSI compliant.
You can find links to the various ANSI standards at...
http://en.wikipedia.org/wiki/SQL
Performance differences between equal (=) and IN with one literal value
There is no difference between those two statements, and the optimiser will transform the IN
to the =
when IN
has just one element in it.
Though when you have a question like this, just run both statements, run their execution plan and see the differences. Here - you won't find any.
After a big search online, I found a document on SQL to support this (I assume it applies to all DBMS):
If there is only one value inside the parenthesis, this commend [sic] is equivalent to,
WHERE "column_name" = 'value1
Here is the execution plan of both queries in Oracle (most DBMS will process this the same):
EXPLAIN PLAN FOR
select * from dim_employees t
where t.identity_number = '123456789'
Plan hash value: 2312174735
-----------------------------------------------------
| Id | Operation | Name |
-----------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID| DIM_EMPLOYEES |
| 2 | INDEX UNIQUE SCAN | SYS_C0029838 |
-----------------------------------------------------
And for IN()
:
EXPLAIN PLAN FOR
select * from dim_employees t
where t.identity_number in('123456789');
Plan hash value: 2312174735
-----------------------------------------------------
| Id | Operation | Name |
-----------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID| DIM_EMPLOYEES |
| 2 | INDEX UNIQUE SCAN | SYS_C0029838 |
-----------------------------------------------------
As you can see, both are identical. This is on an indexed column. Same goes for an unindexed column (just full table scan).
NOT IN vs NOT EXISTS
I always default to NOT EXISTS
.
The execution plans may be the same at the moment but if either column is altered in the future to allow NULL
s the NOT IN
version will need to do more work (even if no NULL
s are actually present in the data) and the semantics of NOT IN
if NULL
s are present are unlikely to be the ones you want anyway.
When neither Products.ProductID
or [Order Details].ProductID
allow NULL
s the NOT IN
will be treated identically to the following query.
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
The exact plan may vary but for my example data I get the following.
A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.
/*Not valid syntax but better reflects the plan*/
SELECT p.ProductID,
p.ProductName
FROM Products p
LEFT ANTI SEMI JOIN [Order Details] od
ON p.ProductId = od.ProductId
If [Order Details].ProductID
is NULL
-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
The reason for this is that the correct semantics if [Order Details]
contains any NULL
ProductId
s is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.
If Products.ProductID
is also changed to become NULL
-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
AND NOT EXISTS (SELECT *
FROM (SELECT TOP 1 *
FROM [Order Details]) S
WHERE p.ProductID IS NULL)
The reason for that one is because a NULL
Products.ProductId
should not be returned in the results except if the NOT IN
sub query were to return no results at all (i.e. the [Order Details]
table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.
The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.
Additionally the fact that a single NULL
can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were no NULL
rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.
This is not the only possible execution plan for a NOT IN
on a NULL
-able column however. This article shows another one for a query against the AdventureWorks2008
database.
For the NOT IN
on a NOT NULL
column or the NOT EXISTS
against either a nullable or non nullable column it gives the following plan.
When the column changes to NULL
-able the NOT IN
plan now looks like
It adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on Sales.SalesOrderDetail.ProductID = <correlated_product_id>
to two seeks per outer row. The additional one is on WHERE Sales.SalesOrderDetail.ProductID IS NULL
.
As this is under an anti semi join if that one returns any rows the second seek will not occur. However if Sales.SalesOrderDetail
does not contain any NULL
ProductID
s it will double the number of seek operations required.
Postgres NOT IN performance
A huge IN
list is very inefficient. PostgreSQL should ideally identify it and turn it into a relation that it does an anti-join on, but at this point the query planner doesn't know how to do that, and the planning time required to identify this case would cost every query that uses NOT IN
sensibly, so it'd have to be a very low cost check. See this earlier much more detailed answer on the topic.
As David Aldridge wrote this is best solved by turning it into an anti-join. I'd write it as a join over a VALUES
list simply because PostgreSQL is extremely fast at parsing VALUES
lists into relations, but the effect is the same:
SELECT entityid
FROM entity e
LEFT JOIN level1entity l1 ON l.level1id = e.level1_level1id
LEFT JOIN level2entity l2 ON l2.level2id = l1.level2_level2id
LEFT OUTER JOIN (
VALUES
(1377776),(1377792),(1377793),(1377794),(1377795),(1377796)
) ex(ex_entityid) ON (entityid = ex_entityid)
WHERE l2.userid = 'a987c246-65e5-48f6-9d2d-a7bcb6284c8f'
AND ex_entityid IS NULL;
For a sufficiently large set of values you might even be better off creating a temporary table, COPY
ing the values into it, creating a PRIMARY KEY
on it, and joining on that.
More possibilities explored here:
https://stackoverflow.com/a/17038097/398670
Related Topics
Remove Duplicate Records Except the First Record in SQL
Delete Statement in SQL Is Very Slow
A Procedure to Reverse a String in Pl/Sql
Find N Nearest Neighbors for Given Point Using Postgis
Fewest Number of Buckets to Bag Elements in Bigquery
Get Table and Index Storage Size in SQL Server
How to Add Multiple "Not Like '%%' in the Where Clause of SQLite3
How to Use Merge on Linked Servers
Postgresql Visual Interface Similar to PHPmyadmin
Divide the Table Data Randomly Based on Percentages
Db2 - Returning the Top 5 of Each Category
Any Detailed and Specific Reasons for Why Mongodb Is Much Faster Than SQL Dbs
Calculating Age from Birthday - Tsql, Oracle, and Any Others