Mysql Exists vs IN -- correlated subquery vs subquery?
This is a RDBMS-agnostic answer, but may help nonetheless. In my understanding, the correlated (aka, dependent) subquery is perhaps the most often falsely accused culprit for bad performance.
The problem (as it is most often described) is that it processes the inner query for every row of the outer query. Therefore, if the outer query returns 1,000 rows, and the inner query returns 10,000, then your query has to slog through 10,000,000 rows (outer×inner) to produce a result. Compared to the 11,000 rows (outer+inner) from a non-correlated query over the same result sets, that ain't good.
However, this is just the worst case scenario. In many cases, the DBMS will be able to exploit indexes to drastically reduce the rowcount. Even if only the inner query can use an index, the 10,000 rows becomes ~13 seeks, which drops the total down to 13,000.
The exists
operator can stop processing rows after the first, cutting down the query cost further, especially when most outer rows match at least one inner row.
In some rare cases, I have seen SQL Server 2008R2 optimise correlated subqueries to a merge join (which traverses both sets only once - best possible scenario) where a suitable index can be found in both inner and outer queries.
The real culprit for bad performance is not necessarily correlated subqueries, but nested scans.
Subqueries with EXISTS vs IN - MySQL
An Explain Plan
would have shown you why exactly you should use Exists
. Usually the question comes Exists vs Count(*)
. Exists
is faster. Why?
With regard to challenges present by NULL: when subquery returns
Null
, for IN the entire query becomesNull
. So you need to handle that as well. But usingExist
, it's merely afalse
. Much easier to cope. SimplyIN
can't compare anything withNull
butExists
can.e.g.
Exists (Select * from yourtable where bla = 'blabla');
you get true/false the moment one hit is found/matched.In this case
IN
sort of takes the position of theCount(*)
to select ALL matching rows based on theWHERE
because it's comparing all values.
But don't forget this either:
EXISTS
executes at high speed againstIN
: when the subquery results is very large.IN
gets ahead ofEXISTS
: when the subquery results is very small.
Reference to for more details:
- subquery using IN.
- IN - subquery optimization
- Join vs. sub-query.
What do I have to SELECT in a WHERE EXIST clause?
It doesn't matter. A good practice is to use SELECT 1
to indicate it is a non-data returning subquery.
The select is not evaluated and doesn't matter. In SQL Server you can put a SELECT 1/0
in the exists subquery and it will not throw a divide by zero error even.
Related: What is easier to read in EXISTS subqueries?
https://dba.stackexchange.com/questions/159413/exists-select-1-vs-exists-select-one-or-the-other
For the non-believers:
DECLARE @table1 TABLE (id INT)
DECLARE @table2 TABLE (id INT)
INSERT INTO @table1
VALUES
(1),
(2),
(3),
(4),
(5)
INSERT INTO @table2
VALUES
(1),
(2),
(3)
SELECT *
FROM @table1 t1
WHERE EXISTS (
SELECT 1/0
FROM @table2 t2
WHERE t1.id = t2.id)
EXISTS subquery: SELECT 1 or SELECT * FROM X performant in Postgres?
Per the documentation:
Since the result depends only on whether any rows are returned, and
not on the contents of those rows, the output list of the subquery is
normally unimportant.
Join vs. sub-query
Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOIN
, but in my opinion their strength is slightly higher readability.
JOIN or Correlated subquery with exists clause, which one is better
Generally, the EXISTS clause because you may need DISTINCT for a JOIN for it to give the expected output. For example, if you have multiple Department
rows for a ContactInformation
row.
In your example above, the SELECT *
:
- means different output too so they are not actually equivalent
- less chance of a index being used because you are pulling all columns out
Saying that, even with a limited column list, they will give the same plan: until you need DISTINCT... which is why I say "EXISTS"
Related Topics
Log Record Changes in SQL Server in an Audit Table
Why Can't I Seem to Force Oracle 11G to Consume More Cpus for a Single SQL Query
Call a Set-Returning Function with an Array Argument Multiple Times
Postgresql: Encoding Problems on Windows When Using Psql Command Line Utility
Remove Duplicates Using Only a MySQL Query
Cannot Use Update with Output Clause When a Trigger Is on the Table
Are There Any Way to Execute a Query Inside the String Value (Like Eval) in Postgresql
SQL Server Loop - How to Loop Through a Set of Records
How to Check If a String Is a Uniqueidentifier
Partition Function Count() Over Possible Using Distinct
What Is Easier to Read in Exists Subqueries
Concatenate Results from a SQL Query in Oracle
Maintaining Order in MySQL "In" Query
How to Avoid Duplicate Values for Insert in SQL
Generate Delete Statement from Foreign Key Relationships in SQL 2008