Sql: Performance Comparison for Exclusion (Join VS Not In)

SQL: Performance comparison for exclusion (Join vs Not in)

Depends on the RDBMS. For Microsoft SQL Server NOT EXISTS is preferred to the OUTER JOIN as it can use the more efficient Anti-Semi join.

For Oracle Minus is apparently preferred to NOT EXISTS (where suitable)

You would need to look at the execution plans and decide.

SQL performance on LEFT OUTER JOIN vs NOT EXISTS

Joe's link is a good starting point. Quassnoi covers this too.

In general, if your fields are properly indexed, OR if you expect to filter out more records (i.e. have a lots of rows EXIST in the subquery) NOT EXISTS will perform better.

EXISTS and NOT EXISTS both short circuit - as soon as a record matches the criteria it's either included or filtered out and the optimizer moves on to the next record.

LEFT JOIN will join ALL RECORDS regardless of whether they match or not, then filter out all non-matching records. If your tables are large and/or you have multiple JOIN criteria, this can be very very resource intensive.

I normally try to use NOT EXISTS and EXISTS where possible. For SQL Server, IN and NOT IN are semantically equivalent and may be easier to write. These are among the only operators you will find in SQL Server that are guaranteed to short circuit.

SQL query 'NOT IN' vs 'Join'

I would actually recommend not exists for this:

select count(*)
from tbl_a a
where
a.mailDatId =
and not exists (
select 1
from tbl_b as b
where b.isStandard = 1 and b.serviceTypeId = substring(a.PIMBRecord,3,3)
)

Rationale:

  • NOT IN is not null-safe, while NOT EXISTS is

  • JOINs usually better fit the use case when you want to return something from the other table - which is not the case here

For performance, you want an index on tbl_b(serviceTypeId, isStandard)

What's the difference between NOT EXISTS vs. NOT IN vs. LEFT JOIN WHERE IS NULL?

  • NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server

  • NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL

  • NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle

  • NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL

In a nutshell:

NOT IN is a little bit different: it never matches if there is but a single NULL in the list.

  • In MySQL, NOT EXISTS is a little bit less efficient

  • In SQL Server, LEFT JOIN / IS NULL is less efficient

  • In PostgreSQL, NOT IN is less efficient

  • In Oracle, all three methods are the same.

Does EXCEPT execute faster than a JOIN when the table columns are the same

There is no way anyone can tell you that EXCEPT will always or never out-perform an equivalent OUTER JOIN. The optimizer will choose an appropriate execution plan regardless of how you write your intent.

That said, here is my guideline:


Use EXCEPT when at least one of the following is true:

  1. The query is more readable (this will almost always be true).
  2. Performance is improved.

And BOTH of the following are true:

  1. The query produces semantically identical results, and you can demonstrate this through sufficient regression testing, including all edge cases.
  2. Performance is not degraded (again, in all edge cases, as well as environmental changes such as clearing buffer pool, updating statistics, clearing plan cache, and restarting the service).

It is important to note that it can be a challenge to write an equivalent EXCEPT query as the JOIN becomes more complex and/or you are relying on duplicates in part of the columns but not others. Writing a NOT EXISTS equivalent, while slightly less readable than EXCEPT should be far more trivial to accomplish - and will often lead to a better plan (but note that I would never say ALWAYS or NEVER, except in the way I just did).

In this blog post I demonstrate at least one case where EXCEPT is outperformed by both a properly constructed LEFT OUTER JOIN and of course by an equivalent NOT EXISTS variation.

Is there any performance benefit if we exclude null rows in where clause in Select query

No, it will not help -- although it could make things slightly (probably unmeasurably) worse.

The condition SampleCol = 'Test' is exactly the comparison you want to make. So, the database has to make this comparison, in some fashion, for every row that is returned.

There are basically two situations. Without an index, your query needs to do a full table scan. Two comparisons on each row (one for NULL and one for the value) take longer than a single comparison. To be honest, some databases might optimize this just to the equality comparison, so the two could be equal. I don't think SQL Server does this elimination but it might.

With an index, SQL Server will use an index for the = comparison. It might then do an additional comparison against NULL (even though that is redundant). You run into a bigger issue here, though: The more complicated the predicate the more likely the optimizer gets confused and doesn't use an index.

There is a third case where your column is used for partitioning. I do not know if the redundant comparison would have an impact on partition pruning.

You want your where comparisons to be simple. In general, you want to let the optimizer do its work. On very rare occasions, you might want to give the optimizer some help, but that is very, very, very rare -- and generally involves functions that are much more expensive to run than simple comparisons.

JOIN versus EXISTS performance

NOT EXISTS is more efficient than using a LEFT OUTER JOIN to exclude records that are missing from the participating table using an IS NULL condition because the optimizer will elect to use an EXCLUSION MERGE JOIN with the NOT EXISTS predicate.

While your second test did not yield impressive results for the data sets you were using the performance increase from NOT EXISTS over a LEFT JOIN is very noticeable as your data volumes increase. Keep in mind that the tables will need to be hash distributed by the columns that participate in the NOT EXISTS join just like they would in the LEFT JOIN. Therefore, data skew can impact the performance of the EXCLUSION MERGE JOIN.

EDIT:

Typically, I would defer to EXISTS as a replacement for IN instead of using it for re-writing a join solution. This is especially true when the column(s) participating in the logical comparison can be NULL. That's not to say you couldn't use EXISTS in place of an INNER JOIN. Instead of an EXCLUSION JOIN you will end up with an INCLUSION JOIN. The INNER JOIN is in essence an inclusion join to begin with. I'm sure there are some nuances that I am overlooking but you can find those in the manuals if you wish to take the time to read them.

PostgreSQL: NOT IN versus EXCEPT performance difference (edited #2)

Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.

Difference between filtering queries in JOIN and WHERE?

The answer is NO difference, but:

I will always prefer to do the following.

  • Always keep the Join Conditions in ON clause
  • Always put the filter's in where clause

This makes the query more readable.

So I will use this query:

SELECT value
FROM table1
INNER JOIN table2
ON table1.id = table2.id
WHERE table1.id = 1

However when you are using OUTER JOIN'S there is a big difference in keeping the filter in the ON condition and Where condition.

Logical Query Processing

The following list contains a general form of a query, along with step numbers assigned according to the order in which the different clauses are logically processed.

(5) SELECT (5-2) DISTINCT (5-3) TOP(<top_specification>) (5-1) <select_list>
(1) FROM (1-J) <left_table> <join_type> JOIN <right_table> ON <on_predicate>
| (1-A) <left_table> <apply_type> APPLY <right_table_expression> AS <alias>
| (1-P) <left_table> PIVOT(<pivot_specification>) AS <alias>
| (1-U) <left_table> UNPIVOT(<unpivot_specification>) AS <alias>
(2) WHERE <where_predicate>
(3) GROUP BY <group_by_specification>
(4) HAVING <having_predicate>
(6) ORDER BY <order_by_list>;

Flow diagram logical query processing

Sample Image

  • (1) FROM: The FROM phase identifies the query’s source tables and
    processes table operators. Each table operator applies a series of
    sub phases. For example, the phases involved in a join are (1-J1)
    Cartesian product, (1-J2) ON Filter, (1-J3) Add Outer Rows. The FROM
    phase generates virtual table VT1.

  • (1-J1) Cartesian Product: This phase performs a Cartesian product
    (cross join) between the two tables involved in the table operator,
    generating VT1-J1.

  • (1-J2) ON Filter: This phase filters the rows from VT1-J1 based on
    the predicate that appears in the ON clause (<on_predicate>). Only
    rows for which the predicate evaluates to TRUE are inserted into
    VT1-J2.
  • (1-J3) Add Outer Rows: If OUTER JOIN is specified (as opposed to
    CROSS JOIN or INNER JOIN), rows from the preserved table or tables
    for which a match was not found are added to the rows from VT1-J2 as
    outer rows, generating VT1-J3.
  • (2) WHERE: This phase filters the rows from VT1 based on the
    predicate that appears in the WHERE clause (). Only
    rows for which the predicate evaluates to TRUE are inserted into VT2.
  • (3) GROUP BY: This phase arranges the rows from VT2 in groups based
    on the column list specified in the GROUP BY clause, generating VT3.
    Ultimately, there will be one result row per group.
  • (4) HAVING: This phase filters the groups from VT3 based on the
    predicate that appears in the HAVING clause (<having_predicate>).
    Only groups for which the predicate evaluates to TRUE are inserted
    into VT4.
  • (5) SELECT: This phase processes the elements in the SELECT clause,
    generating VT5.
  • (5-1) Evaluate Expressions: This phase evaluates the expressions in
    the SELECT list, generating VT5-1.
  • (5-2) DISTINCT: This phase removes duplicate rows from VT5-1,
    generating VT5-2.
  • (5-3) TOP: This phase filters the specified top number or percentage
    of rows from VT5-2 based on the logical ordering defined by the ORDER
    BY clause, generating the table VT5-3.
  • (6) ORDER BY: This phase sorts the rows from VT5-3 according to the
    column list specified in the ORDER BY clause, generating the cursor
    VC6.

it is referred from book "T-SQL Querying (Developer Reference)"



Related Topics



Leave a reply



Submit