How to Perform a SQL 'Not In' Query Faster

How can I perform a SQL 'NOT IN' query faster?

You can use a left outer join, or a not exists clause.

Left outer join:

select E.EmailAddress
from EMAIL E left outer join BLACKLIST B on (E.EmailAddress = B.EmailAddress)
where B.EmailAddress is null;

Not Exists:

select E.EmailAddress
from EMAIL E where not exists
(select EmailAddress from BLACKLIST B where B.EmailAddress = E.EmailAddress)

Both are quite generic SQL solutions (don't depend on a specific DB engine). I would say that the latter is a little bit more performant (not by much though). But definitely more performant than the not in one.

As commenters stated, you can also try creating an index on BLACKLIST(EmailAddress), that should help speed up the execution of your query.

Is there a way to make an SQL NOT IN query faster?

A NOT IN can be rewritten as a NOT EXISTS query which is very often faster (unfortunately the Postgres optimizer isn't smart enough to detect this).

SELECT COUNT(DISTINCT l1.mobile_number) 
FROM log_entries as l1
WHERE l1.created_at >= '2015-04-09 00:00:00'
AND l1.created_at <= '2015-04-09 23:59:59'
AND NOT EXISTS (SELECT *
FROM log_entries l2
WHERE l2.created_at < '2015-04-09 00:00:00'
AND l2.mobile_number = l1.mobile_number);

An index on (mobile_number, created_at) should further improve the performance.


A side note: created_at <= '2015-04-09 23:59:59' will not include rows with fractional seconds, e.g. 2015-04-09 23:59:59.789. When dealing with timestamps it's better to use a "lower than" with the "next day" instead of a "lower or equal" with the day in question.

So better use: created_at < '2015-04-10 00:00:00' instead to also "catch" rows on that day with fractional seconds.

SQL (NOT IN) Query takes forever to execute

Please try the following. I simplified the inner queries by eliminating unnecessary IN, UNION, and EXCEPT clauses.

WITH frm 
AS (SELECT product_id AS PId,
Min(Cast(product_startdate AS DATETIME)) AS PStartDate
FROM products
WHERE product_status IN ( 'F', 'R', 'M' )
GROUP BY product_id),
firstcount
AS (SELECT pid,
pstartdate,
(SELECT Count(*)
FROM products
WHERE product_status IN ( 'OR', 'OP' )
AND product_comments LIKE '%CANCELLED%'
AND product_id = pid) AS v_count
FROM frm),
secondcount
AS (SELECT pid,
pstartdate,
CASE
WHEN v_count = 0 THEN (SELECT Count(*)
FROM products
WHERE product_status IN ( 'F', 'R', 'M' )
AND product_startdate != '.'
AND product_id = pid)
ELSE v_count
END AS v_count
FROM firstcount)
INSERT INTO products_del
(product_id,
product_startdate,
productdel_status)
SELECT pid,
pstartdate,
CASE
WHEN v_count != 0 THEN 'UNKNOWN'
ELSE NULL
END
FROM secondcount

SELECT *
FROM products_del

Which is faster - NOT IN or NOT EXISTS?

Usually it does not matter if NOT IN is slower / faster than NOT EXISTS, because they are NOT equivalent in presence of NULL. Read:

NOT IN vs NOT EXISTS

In these cases you almost always want NOT EXISTS, because it has the usually expected behaviour.

If they are equivalent, it is likely that your database already has figured that out and will generate the same execution plan for both.

In the few cases where both options are aquivalent and your database is not able to figure that out, it is better to analyze both execution plans and choose the best options for your specific case.

NOT IN query performance issue with large data

You can write it with NOT EXISTS instead, although these queries will have different results if any value of id is NULL (in which case, NOT IN probably yields not the answer you want, so NOT EXISTS is better from that perspective as well.)

select id,number from tmp_t where not exists 
(select 1 from tmp_t a where a.id=tmp_t.number);

But your formulation is also efficient as long as work_mem is large enough.



Related Topics



Leave a reply



Submit