Subquery Using Exists 1 or Exists *

Subquery using Exists 1 or Exists *

No, SQL Server is smart and knows it is being used for an EXISTS, and returns NO DATA to the system.

Quoth Microsoft:
http://technet.microsoft.com/en-us/library/ms189259.aspx?ppud=4

The select list of a subquery
introduced by EXISTS almost always
consists of an asterisk (*). There is
no reason to list column names because
you are just testing whether rows that
meet the conditions specified in the
subquery exist.

To check yourself, try running the following:

SELECT whatever
FROM yourtable
WHERE EXISTS( SELECT 1/0
FROM someothertable
WHERE a_valid_clause )

If it was actually doing something with the SELECT list, it would throw a div by zero error. It doesn't.

EDIT: Note, the SQL Standard actually talks about this.

ANSI SQL 1992 Standard, pg 191 http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

3) Case:

a) If the <select list> "*" is simply contained in a <subquery> that
is immediately contained in an <exists predicate>, then the <select list> is
equivalent to a <value expression>
that is an arbitrary <literal>.

Exists / not exists: 'select 1' vs 'select field'

Yes, they are the same. exists checks if there is at least one row in the sub query. If so, it evaluates to true. The columns in the sub query don't matter in any way.

According to MSDN, exists:

Specifies a subquery to test for the existence of rows.

And Oracle:

An EXISTS condition tests for existence of rows in a subquery.

Maybe the MySQL documentation is even more explaining:

Traditionally, an EXISTS subquery starts with SELECT *, but it could begin with SELECT 5 or SELECT column1 or anything at all. MySQL ignores the SELECT list in such a subquery, so it makes no difference.

EXISTS vs JOIN and use of EXISTS clause

EXISTS is used to return a boolean value, JOIN returns a whole other table

EXISTS is only used to test if a subquery returns results, and short circuits as soon as it does. JOIN is used to extend a result set by combining it with additional fields from another table to which there is a relation.

In your example, the queries are semantically equivalent.

In general, use EXISTS when:

  • You don't need to return data from the related table
  • You have dupes in the related table (JOIN can cause duplicate rows if values are repeated)
  • You want to check existence (use instead of LEFT OUTER JOIN...NULL condition)

If you have proper indexes, most of the time the EXISTS will perform identically to the JOIN. The exception is on very complicated subqueries, where it is normally quicker to use EXISTS.

If your JOIN key is not indexed, it may be quicker to use EXISTS but you will need to test for your specific circumstance.

JOIN syntax is easier to read and clearer normally as well.

Where does the practice exists (select 1 from ...) come from?

The main part of your question is - "where did this myth come from?"

So to answer that, I guess one of the first performance hints people learn with sql is that select * is inefficient in most situations. The fact that it isn't inefficient in this specific situation is hence somewhat counter intuitive. So its not surprising that people are skeptical about it. But some simple research or experiments should be enough to banish most myths. Although human history kinda shows that myths are quite hard to banish.

NOT IN vs NOT EXISTS

I always default to NOT EXISTS.

The execution plans may be the same at the moment but if either column is altered in the future to allow NULLs the NOT IN version will need to do more work (even if no NULLs are actually present in the data) and the semantics of NOT IN if NULLs are present are unlikely to be the ones you want anyway.

When neither Products.ProductID or [Order Details].ProductID allow NULLs the NOT IN will be treated identically to the following query.

SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)

The exact plan may vary but for my example data I get the following.

Neither NULL

A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.

/*Not valid syntax but better reflects the plan*/ 
SELECT p.ProductID,
p.ProductName
FROM Products p
LEFT ANTI SEMI JOIN [Order Details] od
ON p.ProductId = od.ProductId

If [Order Details].ProductID is NULL-able the query then becomes

SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)

The reason for this is that the correct semantics if [Order Details] contains any NULL ProductIds is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.

One NULL

If Products.ProductID is also changed to become NULL-able the query then becomes

SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
AND NOT EXISTS (SELECT *
FROM (SELECT TOP 1 *
FROM [Order Details]) S
WHERE p.ProductID IS NULL)

The reason for that one is because a NULL Products.ProductId should not be returned in the results except if the NOT IN sub query were to return no results at all (i.e. the [Order Details] table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.

Both NULL

The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.

Additionally the fact that a single NULL can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were no NULL rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.

This is not the only possible execution plan for a NOT IN on a NULL-able column however. This article shows another one for a query against the AdventureWorks2008 database.

For the NOT IN on a NOT NULL column or the NOT EXISTS against either a nullable or non nullable column it gives the following plan.

Not EXists

When the column changes to NULL-able the NOT IN plan now looks like

Not In - Null

It adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on Sales.SalesOrderDetail.ProductID = <correlated_product_id> to two seeks per outer row. The additional one is on WHERE Sales.SalesOrderDetail.ProductID IS NULL.

As this is under an anti semi join if that one returns any rows the second seek will not occur. However if Sales.SalesOrderDetail does not contain any NULL ProductIDs it will double the number of seek operations required.

SQL Server: IF EXISTS massively slowing down a query

Did you try running the original query with TOP 1? most likely it will be just as slow.

Sometimes when the optimizer thinks that something is very likely and going to return a vast set of data with little effort (i.e. almost all records are going to get returned), it chooses mostly loop joins because it only needs to get the first one and a loop join is good for only getting a couple records. When that turns out to not be true, it takes forever and a day to get results.

In your case, it sounds like it's very rare, so this choice hurts badly. Try instead doing something like SELECT @count = COUNT(*) FROM ... and then checking if that count is non-zero.

If Exists command in setting a variable

Set the variable in each statement:

IF EXISTS (SELECT PID_GUID FROM PID WHERE EDI_ID = '12874' OR PID = 'ROBERT' OR PID = 'R595')
BEGIN
SELECT @OLDPID = PID_GUID
FROM PID
WHERE EDI_ID = '12874' OR PID = 'ROBERT' OR PID = 'R595'
End
ELSE
SELECT @OLDPID = 'a70600f4-1cff-4284-a2ce-5eb19f47cf19';

Actually, I would be more inclined to use:

DECLARE @OLDPID VARCHAR(36) = 'a70600f4-1cff-4284-a2ce-5eb19f47cf19';
IF EXISTS (SELECT PID_GUID
FROM PID
WHERE EDI_ID = '12874' OR PID = 'ROBERT' OR PID = 'R595'
)
BEGIN
SELECT @OLDPID = PID_GUID
FROM PID
WHERE EDI_ID = '12874' OR PID = 'ROBERT' OR PID = 'R595';
END;

SQL Server: Is SELECTing a literal value faster than SELECTing a field?

For google's sake, I'll update this question with the same answer as this one (Subquery using Exists 1 or Exists *) since (currently) an incorrect answer is marked as accepted. Note the SQL standard actually says that EXISTS via * is identical to a constant.

No. This has been covered a bazillion times. SQL Server is smart and knows it is being used for an EXISTS, and returns NO DATA to the system.

Quoth Microsoft:
http://technet.microsoft.com/en-us/library/ms189259.aspx?ppud=4

The select list of a subquery
introduced by EXISTS almost always
consists of an asterisk (*). There is
no reason to list column names because
you are just testing whether rows that
meet the conditions specified in the
subquery exist.

Also, don't believe me? Try running the following:

SELECT whatever
FROM yourtable
WHERE EXISTS( SELECT 1/0
FROM someothertable
WHERE a_valid_clause )

If it was actually doing something with the SELECT list, it would throw a div by zero error. It doesn't.

EDIT: Note, the SQL Standard actually talks about this.

ANSI SQL 1992 Standard, pg 191 http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

     3) Case:

a) If the <select list> "*" is simply contained in a <subquery> that is immediately contained in an <exists predicate>, then the <select list> is equivalent to a <value expression> that is an arbitrary <literal>.

Determine if any values satisfy a condition

You can use a subquery to return 1 or no row using this query:

SELECT TOP 1 1 as row_exists
FROM MyTable
WHERE [state] = 12 AND age > 110;

You can use a subquery to return 1 or NULL using this as a subquery:

SELECT (SELECT TOP 1 1 FROM MyTable WHERE [state] = 12 AND age > 110
) as row_exists;

You can put this into T-SQL using:

IF (EXISTS (SELECT 1 FROM MyTable WHERE [state] = 12 AND age > 110))
BEGIN
. . .
END;

TOP is not needed in an EXISTS subquery.



Related Topics



Leave a reply



Submit