PostgreSQL: NOT IN versus EXCEPT performance difference (edited #2)
Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.
Is there any performance difference when select from subqueries?
No, there is no difference whatsoever.
You can easily find that out by yourself, by looking at the execution plan generated using explain (analyze) select ...
.
Except for the aliases the plans should be identical.
PostgreSQL performance, using ILIKE with just two percentages versus not at all
there a performance difference if you query a SELECT statement with ILIKE '%%' versus without it at all?
The two queries:
select *
from some_table
where some_column ILIKE '%'
and
select *
from some_table
will return different results.
The first one is equivalent to where some_column is not null
- so it will never return rows where some_column
is null, but the second one will.
So it's not only about performance, but also about correctness.
Performance wise they will most likely be identical - doing a Seq Scan
in both cases.
SQL: When it comes to NOT IN and NOT EQUAL TO, which is more efficient and why?
In PostgreSQL there's usually a fairly small difference at reasonable list lengths, though IN
is much cleaner conceptually. Very long AND ... <> ...
lists and very long NOT IN
lists both perform terribly, with AND
much worse than NOT IN
.
In both cases, if they're long enough for you to even be asking the question you should be doing an anti-join or subquery exclusion test over a value list instead.
WITH excluded(item) AS (
VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5')
)
SELECT *
FROM thetable t
WHERE NOT EXISTS(SELECT 1 FROM excluded e WHERE t.item = e.item);
or:
WITH excluded(item) AS (
VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5')
)
SELECT *
FROM thetable t
LEFT OUTER JOIN excluded e ON (t.item = e.item)
WHERE e.item IS NULL;
(On modern Pg versions both will produce the same query plan anyway).
If the value list is long enough (many tens of thousands of items) then query parsing may start having a significant cost. At this point you should consider creating a TEMPORARY
table, COPY
ing the data to exclude into it, possibly creating an index on it, then using one of the above approaches on the temp table instead of the CTE.
Demo:
CREATE UNLOGGED TABLE exclude_test(id integer primary key);
INSERT INTO exclude_test(id) SELECT generate_series(1,50000);
CREATE TABLE exclude AS SELECT x AS item FROM generate_series(1,40000,4) x;
where exclude
is the list of values to omit.
I then compare the following approaches on the same data with all results in milliseconds:
NOT IN
list: 3424.596AND ...
list: 80173.823VALUES
basedJOIN
exclusion: 20.727VALUES
based subquery exclusion: 20.495- Table-based
JOIN
, no index on ex-list: 25.183 - Subquery table based, no index on ex-list: 23.985
... making the CTE-based approach over three thousand times faster than the AND
list and 130 times faster than the NOT IN
list.
Code here: https://gist.github.com/ringerc/5755247 (shield your eyes, ye who follow this link).
For this data set size adding an index on the exclusion list made no difference.
Notes:
IN
list generated withSELECT 'IN (' || string_agg(item::text, ',' ORDER BY item) || ')' from exclude;
AND
list generated withSELECT string_agg(item::text, ' AND item <> ') from exclude;
)- Subquery and join based table exclusion were much the same across repeated runs.
- Examination of the plan shows that Pg translates
NOT IN
to<> ALL
So... you can see that there's a truly huge gap between both IN
and AND
lists vs doing a proper join. What surprised me was how fast doing it with a CTE using a VALUES
list was ... parsing the VALUES
list took almost no time at all, performing the same or slightly faster than the table approach in most tests.
It'd be nice if PostgreSQL could automatically recognise a preposterously long IN
clause or chain of similar AND
conditions and switch to a smarter approach like doing a hashed join or implicitly turning it into a CTE node. Right now it doesn't know how to do that.
See also:
- this handy blog post Magnus Hagander wrote on the topic
Postgres NOT IN performance
A huge IN
list is very inefficient. PostgreSQL should ideally identify it and turn it into a relation that it does an anti-join on, but at this point the query planner doesn't know how to do that, and the planning time required to identify this case would cost every query that uses NOT IN
sensibly, so it'd have to be a very low cost check. See this earlier much more detailed answer on the topic.
As David Aldridge wrote this is best solved by turning it into an anti-join. I'd write it as a join over a VALUES
list simply because PostgreSQL is extremely fast at parsing VALUES
lists into relations, but the effect is the same:
SELECT entityid
FROM entity e
LEFT JOIN level1entity l1 ON l.level1id = e.level1_level1id
LEFT JOIN level2entity l2 ON l2.level2id = l1.level2_level2id
LEFT OUTER JOIN (
VALUES
(1377776),(1377792),(1377793),(1377794),(1377795),(1377796)
) ex(ex_entityid) ON (entityid = ex_entityid)
WHERE l2.userid = 'a987c246-65e5-48f6-9d2d-a7bcb6284c8f'
AND ex_entityid IS NULL;
For a sufficiently large set of values you might even be better off creating a temporary table, COPY
ing the values into it, creating a PRIMARY KEY
on it, and joining on that.
More possibilities explored here:
https://stackoverflow.com/a/17038097/398670
Performance differences between equal (=) and IN with one literal value
There is no difference between those two statements, and the optimiser will transform the IN
to the =
when IN
has just one element in it.
Though when you have a question like this, just run both statements, run their execution plan and see the differences. Here - you won't find any.
After a big search online, I found a document on SQL to support this (I assume it applies to all DBMS):
If there is only one value inside the parenthesis, this commend [sic] is equivalent to,
WHERE "column_name" = 'value1
Here is the execution plan of both queries in Oracle (most DBMS will process this the same):
EXPLAIN PLAN FOR
select * from dim_employees t
where t.identity_number = '123456789'
Plan hash value: 2312174735
-----------------------------------------------------
| Id | Operation | Name |
-----------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID| DIM_EMPLOYEES |
| 2 | INDEX UNIQUE SCAN | SYS_C0029838 |
-----------------------------------------------------
And for IN()
:
EXPLAIN PLAN FOR
select * from dim_employees t
where t.identity_number in('123456789');
Plan hash value: 2312174735
-----------------------------------------------------
| Id | Operation | Name |
-----------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID| DIM_EMPLOYEES |
| 2 | INDEX UNIQUE SCAN | SYS_C0029838 |
-----------------------------------------------------
As you can see, both are identical. This is on an indexed column. Same goes for an unindexed column (just full table scan).
Related Topics
How to Create an Index in Amazon Redshift
Conditional Join Different Tables
SQL How to Select the Most Recent Date Item
Mysql: How to Sum() a Timediff() on a Group
SQL Joining Three Tables, Join Precedence
SQL Server Index - Any Improvement for Like Queries
The Wait Operation Timed Out. Asp
Oracle SQL Clause Evaluation Order
String or Binary Data Would Be Truncated. the Statement Has Been Terminated
How to Add Time to Datetime in SQL
Export from SQL Server 2012 to .CSV Through Management Studio
Finding Free Slots in a Booking System
SQL Count* Group by Bigger Than,
SQL Get All Records Older Than 30 Days
How to Further Optimize a Derived Table Query Which Performs Better Than the Joined Equivalent
Store Select Query's Output in One Array in Postgres
Alter Database Failed Because a Lock Could Not Be Placed on Database