What Is the Most Efficient Way to Write a Select Statement with a "Not In" Subquery

What is the most efficient way to write a select statement with a not in subquery?

"Most efficient" is going to be different depending on tables sizes, indexes, and so on. In other words it's going to differ depending on the specific case you're using.

There are three ways I commonly use to accomplish what you want, depending on the situation.

1. Your example works fine if Orders.order_id is indexed, and HeldOrders is fairly small.

2. Another method is the "correlated subquery" which is a slight variation of what you have...

SELECT *
FROM Orders o
WHERE Orders.Order_ID not in (Select Order_ID
FROM HeldOrders h
where h.order_id = o.order_id)

Note the addition of the where clause. This tends to work better when HeldOrders has a large number of rows. Order_ID needs to be indexed in both tables.

3. Another method I use sometimes is left outer join...

SELECT *
FROM Orders o
left outer join HeldOrders h on h.order_id = o.order_id
where h.order_id is null

When using the left outer join, h.order_id will have a value in it matching o.order_id when there is a matching row. If there isn't a matching row, h.order_id will be NULL. By checking for the NULL values in the where clause you can filter on everything that doesn't have a match.

Each of these variations can work more or less efficiently in various scenarios.

SQL select where not in subquery returns no results

Update:

These articles in my blog describe the differences between the methods in more detail:

  • NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
  • NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL
  • NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle
  • NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL

There are three ways to do such a query:

  • LEFT JOIN / IS NULL:

    SELECT  *
    FROM common
    LEFT JOIN
    table1 t1
    ON t1.common_id = common.common_id
    WHERE t1.common_id IS NULL
  • NOT EXISTS:

    SELECT  *
    FROM common
    WHERE NOT EXISTS
    (
    SELECT NULL
    FROM table1 t1
    WHERE t1.common_id = common.common_id
    )
  • NOT IN:

    SELECT  *
    FROM common
    WHERE common_id NOT IN
    (
    SELECT common_id
    FROM table1 t1
    )

When table1.common_id is not nullable, all these queries are semantically the same.

When it is nullable, NOT IN is different, since IN (and, therefore, NOT IN) return NULL when a value does not match anything in a list containing a NULL.

This may be confusing but may become more obvious if we recall the alternate syntax for this:

common_id = ANY
(
SELECT common_id
FROM table1 t1
)

The result of this condition is a boolean product of all comparisons within the list. Of course, a single NULL value yields the NULL result which renders the whole result NULL too.

We never cannot say definitely that common_id is not equal to anything from this list, since at least one of the values is NULL.

Suppose we have these data:

common

--
1
3

table1

--
NULL
1
2

LEFT JOIN / IS NULL and NOT EXISTS will return 3, NOT IN will return nothing (since it will always evaluate to either FALSE or NULL).

In MySQL, in case on non-nullable column, LEFT JOIN / IS NULL and NOT IN are a little bit (several percent) more efficient than NOT EXISTS. If the column is nullable, NOT EXISTS is the most efficient (again, not much).

In Oracle, all three queries yield same plans (an ANTI JOIN).

In SQL Server, NOT IN / NOT EXISTS are more efficient, since LEFT JOIN / IS NULL cannot be optimized to an ANTI JOIN by its optimizer.

In PostgreSQL, LEFT JOIN / IS NULL and NOT EXISTS are more efficient than NOT IN, sine they are optimized to an Anti Join, while NOT IN uses hashed subplan (or even a plain subplan if the subquery is too large to hash)

How to write the query without using sub query?

You can use LEFT JOIN and IS NULL condition

SELECT store_product.* 
FROM store_product
LEFT JOIN import_pack_file_element ON store_product.product_id = import_pack_file_element.entity_id
AND import_pack_file_element.import_pack_file_id IN (135)
AND import_pack_file_element.status = 'DONE'
WHERE
import_pack_file_element.entity_id IS NULL

Is it better / more efficient to use sub queries or SELECT statements within the WHERE clause (in MS Access)

Your single composite query already looks optimal to me, I doubt that you can do simpler or more efficient.

Judicious use of indexes in your table should ensure that the query runs pretty fast.

Your last query is called a Correlated subquery.

It is sometimes useful, but can be very slow: the subquery will need to be executed for each record in the ScoresTable because the result of the subquery depends on the value of each individual record in ScoresTable.

This is rather difficult for the database engine to optimise.

If you are interested in finding out details about how the query planner optimises your queries, have a look at these articles, they'll show you what's under the hood:

  • Use Microsoft Jet's ShowPlan to write more efficient queries
  • Access 2002 Desktop Developer's Handbook, Chapter 15: Application Optimization

NOT IN vs NOT EXISTS

I always default to NOT EXISTS.

The execution plans may be the same at the moment but if either column is altered in the future to allow NULLs the NOT IN version will need to do more work (even if no NULLs are actually present in the data) and the semantics of NOT IN if NULLs are present are unlikely to be the ones you want anyway.

When neither Products.ProductID or [Order Details].ProductID allow NULLs the NOT IN will be treated identically to the following query.

SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)

The exact plan may vary but for my example data I get the following.

Neither NULL

A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.

/*Not valid syntax but better reflects the plan*/ 
SELECT p.ProductID,
p.ProductName
FROM Products p
LEFT ANTI SEMI JOIN [Order Details] od
ON p.ProductId = od.ProductId

If [Order Details].ProductID is NULL-able the query then becomes

SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)

The reason for this is that the correct semantics if [Order Details] contains any NULL ProductIds is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.

One NULL

If Products.ProductID is also changed to become NULL-able the query then becomes

SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
AND NOT EXISTS (SELECT *
FROM (SELECT TOP 1 *
FROM [Order Details]) S
WHERE p.ProductID IS NULL)

The reason for that one is because a NULL Products.ProductId should not be returned in the results except if the NOT IN sub query were to return no results at all (i.e. the [Order Details] table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.

Both NULL

The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.

Additionally the fact that a single NULL can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were no NULL rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.

This is not the only possible execution plan for a NOT IN on a NULL-able column however. This article shows another one for a query against the AdventureWorks2008 database.

For the NOT IN on a NOT NULL column or the NOT EXISTS against either a nullable or non nullable column it gives the following plan.

Not EXists

When the column changes to NULL-able the NOT IN plan now looks like

Not In - Null

It adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on Sales.SalesOrderDetail.ProductID = <correlated_product_id> to two seeks per outer row. The additional one is on WHERE Sales.SalesOrderDetail.ProductID IS NULL.

As this is under an anti semi join if that one returns any rows the second seek will not occur. However if Sales.SalesOrderDetail does not contain any NULL ProductIDs it will double the number of seek operations required.

MySQL and making subquery more efficient

1) Simpler queries are easier for the query engine to interpret and produce an efficient plan.

If you pay careful attention to the following part of your query, you may realise something a little "weird" is going. This is a clue the approach is perhaps a little too complicated.

...(
list.user_id NOT IN (
SELECT user_id
FROM status
/* Note the sub-query cannot ever return a user_id different
to the one checked with "NOT IN" above */
WHERE user_id = list.user_id
AND season_id = rg.incorrect_status_id)
)

The query filtering where list.user_id is not in a result set that cannot contain user_id's other than list.user_id. Of course the sub-query could return zero results. So basically it boils down to a simple existence check.

So for a start, you should rather write:

...(
NOT EXISTS (
SELECT *
FROM status
WHERE user_id = list.user_id
AND season_id = rg.incorrect_status_id)
)

2) Be clear about your "what joins the tables together" (this refers back to 1 as well).

Your query selects from 3 tables without specifying any join conditions:

FROM rg, list, status

This would result in a cross join producing a result set that is a permutation combination of all possible row matches. If your WHERE clause were simple, the query engine might be able to implicitly promote certain filter conditions into join conditions, but that's not the case. So even if for example you have a very small number of rows in each table:


status 20
rg 100
list 1000

Your intermediate result set (before WHERE is applied),
would need 1000 * 100 * 20 = 2000000 rows!

It helps tremendously to make it clear with join conditions how the rows of each table are intended to match up. Not only does it make the query easier to read and understand, but it also helps avoid overlooking join conditions which can be the bane of performance considerations.

Note that when specifying join conditions, some rows might not have matches and this is where knowing and understanding the different types of joins is extremely important. Particularly in your case, most of the complexity in your WHERE clause seems to come from trying resolve when rows do/do not match. See this answer for some useful information.

Your FROM/WHERE clause should probably look more like the following. (Difficult to be certain because you haven't stated your table relationships or expected input/output of your query. But it should set you on the right track.)

FROM    rg
/* Assumes rg rows form the base of the query, and not to have
some rg rows excluded due to non-matches in list or status. */
LEFT OUTER JOIN status ON
status.season_id = rg.required_status_id
LEFT OUTER JOIN list ON
status.user_id = list.user_id
WHERE rg.incorrect_status_id IS NULL
/* As Barmar commented, it may also be useful to break this
OR condition out as a separate query UNION to the above. */
OR (
rg.incorrect_status_id IS NOT NULL
AND NOT EXISTS (
SELECT *
FROM status
WHERE user_id = list.user_id
AND season_id = rg.incorrect_status_id)
)

Note that this query is very clear about the distinction between how the tables are joined, and what is used to filter the joined result set.

3) Finally and very importantly, even the best queries are of little benefit without the correct indexes!

A good query with bad indexes (or conversely a bad query with good indexes) is going to be inefficient either way. Computers are fast enough that you might not notice on small databases, but you do experiment with candidate indexes to find the best combination for your data and workload.

In the above query you likely need indexes on the following. (Some may already be covered by Primary Key constraints.)

status.season_id
status.user_id
list.user_id
rg.required_status_id
rg.incorrect_status_id

What is the most performant way to rewrite a correlated subquery in the SELECT clause?

Here is one way with conditional aggregation, Rextester:

select 
user_id
,MAX(case when '2017-11-17'-visit_date <=30
then 1
else 0
end) as last_30
,MAX(case when '2017-11-17'-visit_date >=31
and '2017-11-17'-visit_date <=60
then 1
else 0
end) as between_31_60
,MAX(case when '2017-11-17'-visit_date >=61
and '2017-11-17'-visit_date <=90
then 1
else 0
end) as between_61_90
from
visits
group by user_id
order by user_id


Related Topics



Leave a reply



Submit