Subquery using Exists 1 or Exists *
No, SQL Server is smart and knows it is being used for an EXISTS, and returns NO DATA to the system.
Quoth Microsoft:
http://technet.microsoft.com/en-us/library/ms189259.aspx?ppud=4
The select list of a subquery
introduced by EXISTS almost always
consists of an asterisk (*). There is
no reason to list column names because
you are just testing whether rows that
meet the conditions specified in the
subquery exist.
To check yourself, try running the following:
SELECT whatever
FROM yourtable
WHERE EXISTS( SELECT 1/0
FROM someothertable
WHERE a_valid_clause )
If it was actually doing something with the SELECT list, it would throw a div by zero error. It doesn't.
EDIT: Note, the SQL Standard actually talks about this.
ANSI SQL 1992 Standard, pg 191 http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
3) Case:
a) If the<select list>
"*" is simply contained in a<subquery>
that
is immediately contained in an<exists predicate>
, then the<select list>
is
equivalent to a<value expression>
that is an arbitrary<literal>
.
Exists / not exists: 'select 1' vs 'select field'
Yes, they are the same. exists
checks if there is at least one row in the sub query. If so, it evaluates to true
. The columns in the sub query don't matter in any way.
According to MSDN, exists
:
Specifies a subquery to test for the existence of rows.
And Oracle:
An EXISTS condition tests for existence of rows in a subquery.
Maybe the MySQL documentation is even more explaining:
Traditionally, an EXISTS subquery starts with SELECT *, but it could begin with SELECT 5 or SELECT column1 or anything at all. MySQL ignores the SELECT list in such a subquery, so it makes no difference.
EXISTS vs JOIN and use of EXISTS clause
EXISTS
is used to return a boolean value, JOIN
returns a whole other table
EXISTS
is only used to test if a subquery returns results, and short circuits as soon as it does. JOIN
is used to extend a result set by combining it with additional fields from another table to which there is a relation.
In your example, the queries are semantically equivalent.
In general, use EXISTS
when:
- You don't need to return data from the related table
- You have dupes in the related table (
JOIN
can cause duplicate rows if values are repeated) - You want to check existence (use instead of
LEFT OUTER JOIN...NULL
condition)
If you have proper indexes, most of the time the EXISTS
will perform identically to the JOIN
. The exception is on very complicated subqueries, where it is normally quicker to use EXISTS
.
If your JOIN
key is not indexed, it may be quicker to use EXISTS
but you will need to test for your specific circumstance.
JOIN
syntax is easier to read and clearer normally as well.
Where does the practice exists (select 1 from ...) come from?
The main part of your question is - "where did this myth come from?"
So to answer that, I guess one of the first performance hints people learn with sql is that select *
is inefficient in most situations. The fact that it isn't inefficient in this specific situation is hence somewhat counter intuitive. So its not surprising that people are skeptical about it. But some simple research or experiments should be enough to banish most myths. Although human history kinda shows that myths are quite hard to banish.
NOT IN vs NOT EXISTS
I always default to NOT EXISTS
.
The execution plans may be the same at the moment but if either column is altered in the future to allow NULL
s the NOT IN
version will need to do more work (even if no NULL
s are actually present in the data) and the semantics of NOT IN
if NULL
s are present are unlikely to be the ones you want anyway.
When neither Products.ProductID
or [Order Details].ProductID
allow NULL
s the NOT IN
will be treated identically to the following query.
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
The exact plan may vary but for my example data I get the following.
A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.
/*Not valid syntax but better reflects the plan*/
SELECT p.ProductID,
p.ProductName
FROM Products p
LEFT ANTI SEMI JOIN [Order Details] od
ON p.ProductId = od.ProductId
If [Order Details].ProductID
is NULL
-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
The reason for this is that the correct semantics if [Order Details]
contains any NULL
ProductId
s is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.
If Products.ProductID
is also changed to become NULL
-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
AND NOT EXISTS (SELECT *
FROM (SELECT TOP 1 *
FROM [Order Details]) S
WHERE p.ProductID IS NULL)
The reason for that one is because a NULL
Products.ProductId
should not be returned in the results except if the NOT IN
sub query were to return no results at all (i.e. the [Order Details]
table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.
The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.
Additionally the fact that a single NULL
can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were no NULL
rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.
This is not the only possible execution plan for a NOT IN
on a NULL
-able column however. This article shows another one for a query against the AdventureWorks2008
database.
For the NOT IN
on a NOT NULL
column or the NOT EXISTS
against either a nullable or non nullable column it gives the following plan.
When the column changes to NULL
-able the NOT IN
plan now looks like
It adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on Sales.SalesOrderDetail.ProductID = <correlated_product_id>
to two seeks per outer row. The additional one is on WHERE Sales.SalesOrderDetail.ProductID IS NULL
.
As this is under an anti semi join if that one returns any rows the second seek will not occur. However if Sales.SalesOrderDetail
does not contain any NULL
ProductID
s it will double the number of seek operations required.
SQL Server: IF EXISTS massively slowing down a query
Did you try running the original query with TOP 1? most likely it will be just as slow.
Sometimes when the optimizer thinks that something is very likely and going to return a vast set of data with little effort (i.e. almost all records are going to get returned), it chooses mostly loop joins because it only needs to get the first one and a loop join is good for only getting a couple records. When that turns out to not be true, it takes forever and a day to get results.
In your case, it sounds like it's very rare, so this choice hurts badly. Try instead doing something like SELECT @count = COUNT(*) FROM ...
and then checking if that count is non-zero.
If Exists command in setting a variable
Set the variable in each statement:
IF EXISTS (SELECT PID_GUID FROM PID WHERE EDI_ID = '12874' OR PID = 'ROBERT' OR PID = 'R595')
BEGIN
SELECT @OLDPID = PID_GUID
FROM PID
WHERE EDI_ID = '12874' OR PID = 'ROBERT' OR PID = 'R595'
End
ELSE
SELECT @OLDPID = 'a70600f4-1cff-4284-a2ce-5eb19f47cf19';
Actually, I would be more inclined to use:
DECLARE @OLDPID VARCHAR(36) = 'a70600f4-1cff-4284-a2ce-5eb19f47cf19';
IF EXISTS (SELECT PID_GUID
FROM PID
WHERE EDI_ID = '12874' OR PID = 'ROBERT' OR PID = 'R595'
)
BEGIN
SELECT @OLDPID = PID_GUID
FROM PID
WHERE EDI_ID = '12874' OR PID = 'ROBERT' OR PID = 'R595';
END;
SQL Server: Is SELECTing a literal value faster than SELECTing a field?
For google's sake, I'll update this question with the same answer as this one (Subquery using Exists 1 or Exists *) since (currently) an incorrect answer is marked as accepted. Note the SQL standard actually says that EXISTS via * is identical to a constant.
No. This has been covered a bazillion times. SQL Server is smart and knows it is being used for an EXISTS, and returns NO DATA to the system.
Quoth Microsoft:
http://technet.microsoft.com/en-us/library/ms189259.aspx?ppud=4
The select list of a subquery
introduced by EXISTS almost always
consists of an asterisk (*). There is
no reason to list column names because
you are just testing whether rows that
meet the conditions specified in the
subquery exist.
Also, don't believe me? Try running the following:
SELECT whatever
FROM yourtable
WHERE EXISTS( SELECT 1/0
FROM someothertable
WHERE a_valid_clause )
If it was actually doing something with the SELECT list, it would throw a div by zero error. It doesn't.
EDIT: Note, the SQL Standard actually talks about this.
ANSI SQL 1992 Standard, pg 191 http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
3) Case:
a) If the <select list> "*" is simply contained in a <subquery> that is immediately contained in an <exists predicate>, then the <select list> is equivalent to a <value expression> that is an arbitrary <literal>.
Determine if any values satisfy a condition
You can use a subquery to return 1 or no row using this query:
SELECT TOP 1 1 as row_exists
FROM MyTable
WHERE [state] = 12 AND age > 110;
You can use a subquery to return 1 or NULL
using this as a subquery:
SELECT (SELECT TOP 1 1 FROM MyTable WHERE [state] = 12 AND age > 110
) as row_exists;
You can put this into T-SQL using:
IF (EXISTS (SELECT 1 FROM MyTable WHERE [state] = 12 AND age > 110))
BEGIN
. . .
END;
TOP
is not needed in an EXISTS
subquery.
Related Topics
SQL Statement Is Ignoring Where Parameter
Error 1452: Cannot Add or Update a Child Row: a Foreign Key Constraint Fails
MySQL Foreign Key Constraint Is Incorrectly Formed Error
Reset Identity Seed After Deleting Records in SQL Server
Get Top Results For Each Group (In Oracle)
Simulate Lag Function in MySQL
Does the Join Order Matter in Sql
String_Agg For SQL Server Before 2017
Removing Duplicate Rows from Table in Oracle
Select Values That Meet Different Conditions on Different Rows
Synchronizing Client-Server Databases
Listagg in Oracle to Return Distinct Values
Stored Procedure That Automatically Delete Rows Older Than 7 Days in MySQL
How to Use Group by to Concatenate Strings in SQL Server