Performance of SQL "Exists" Usage Variants

Performance of SQL EXISTS usage variants

The truth about the EXISTS clause is that the SELECT clause is not evaluated in an EXISTS clause - you could try:

SELECT * 
  FROM tableA 
 WHERE EXISTS (SELECT 1/0 
                 FROM tableB 
                WHERE tableA.x = tableB.y)

...and should expect a divide by zero error, but you won't because it's not evaluated. This is why my habit is to specify NULL in an EXISTS to demonstrate that the SELECT can be ignored:

SELECT * 
  FROM tableA 
 WHERE EXISTS (SELECT NULL
                 FROM tableB 
                WHERE tableA.x = tableB.y)

All that matters in an EXISTS clause is the FROM and beyond clauses - WHERE, GROUP BY, HAVING, etc.

This question wasn't marked with a database in mind, and it should be because vendors handle things differently -- so test, and check the explain/execution plans to confirm. It is possible that behavior changes between versions...

SQL Server IN vs. EXISTS Performance

EXISTS will be faster because once the engine has found a hit, it will quit looking as the condition has proved true.

With IN, it will collect all the results from the sub-query before further processing.

Slow query when used as EXISTS subquery

I've seen this myself.

I can guess that EXISTS is better in a WHERE clause because it gives a semi-join which is set based, And exactly what you need.

In an IF, this isn't clear to the optimiser. That is, there is nothing to semi-join too. This should hopefully be the same (bad that is):

SELECT 1 WHERE EXISTS (SELECT I.InsuranceID
    FROM Insurance I
    INNER JOIN JobDetail JD ON I.AccountID = JD.AccountID
    WHERE I.InsuranceLookupID IS NULL
    AND JD.JobID = 28)

You could to this though

SELECT SIGN(COUNT(*))
FROM Insurance I
INNER JOIN JobDetail JD ON I.AccountID = JD.AccountID
WHERE I.InsuranceLookupID IS NULL
AND JD.JobID = 28

It is optimised in some circumstances:

What's the best to check if item exist or not: Select Count(ID)OR Exist(...)?

Not sure what confuses the optimiser...

Where does the practice exists (select 1 from ...) come from?

The main part of your question is - "where did this myth come from?"

So to answer that, I guess one of the first performance hints people learn with sql is that select * is inefficient in most situations. The fact that it isn't inefficient in this specific situation is hence somewhat counter intuitive. So its not surprising that people are skeptical about it. But some simple research or experiments should be enough to banish most myths. Although human history kinda shows that myths are quite hard to banish.

MySQL: where exists VS where id in [performance]

Gordon has a good answer. The fact is that performance depends on a lot of different factors including database design/schema and volume of data.

As a rough guide, the exists sub-query is going to execute once for every row in replays and the in sub-query is going to execute once to get the results of the sub-query and then those results will be searched for every row in replays.

So with the exists, the better the indexing/access path the faster it will run. Without relevant index(es) it will just read through all rows until it finds a match. For every single row in replays. For the rows with no matches it would end up reading the entire players table each time. Even the rows with matches could read through a significant number of players before finding a match.

With the in the smaller the resultset from the sub-query the faster it will run. For those without a match it only needs to quickly check the small sub query rows to reach that answer. That said you don't get the benefit of indexes (if it works this way) so for a large result set from the sub query it has to read every row in the sub select before deciding that when there is no match.

That said, database optimisers are pretty clever, and don't always evaluate queries exactly the way you ask them to, hence why checking execution plans and testing yourself is important to figure out the best approach. Its not unusual to expect a certain execution path only to find that optimiser has chosen a different method of execution based on how it expects the data to look.

JOIN versus EXISTS performance

NOT EXISTS is more efficient than using a LEFT OUTER JOIN to exclude records that are missing from the participating table using an IS NULL condition because the optimizer will elect to use an EXCLUSION MERGE JOIN with the NOT EXISTS predicate.

While your second test did not yield impressive results for the data sets you were using the performance increase from NOT EXISTS over a LEFT JOIN is very noticeable as your data volumes increase. Keep in mind that the tables will need to be hash distributed by the columns that participate in the NOT EXISTS join just like they would in the LEFT JOIN. Therefore, data skew can impact the performance of the EXCLUSION MERGE JOIN.

EDIT:

Typically, I would defer to EXISTS as a replacement for IN instead of using it for re-writing a join solution. This is especially true when the column(s) participating in the logical comparison can be NULL. That's not to say you couldn't use EXISTS in place of an INNER JOIN. Instead of an EXCLUSION JOIN you will end up with an INCLUSION JOIN. The INNER JOIN is in essence an inclusion join to begin with. I'm sure there are some nuances that I am overlooking but you can find those in the manuals if you wish to take the time to read them.

Performance of SQL "Exists" Usage Variants