SQL Efficient Way to Join a Table Where All Values Exist

EXISTS vs JOIN and use of EXISTS clause

EXISTS is used to return a boolean value, JOIN returns a whole other table

EXISTS is only used to test if a subquery returns results, and short circuits as soon as it does. JOIN is used to extend a result set by combining it with additional fields from another table to which there is a relation.

In your example, the queries are semantically equivalent.

In general, use EXISTS when:

  • You don't need to return data from the related table
  • You have dupes in the related table (JOIN can cause duplicate rows if values are repeated)
  • You want to check existence (use instead of LEFT OUTER JOIN...NULL condition)

If you have proper indexes, most of the time the EXISTS will perform identically to the JOIN. The exception is on very complicated subqueries, where it is normally quicker to use EXISTS.

If your JOIN key is not indexed, it may be quicker to use EXISTS but you will need to test for your specific circumstance.

JOIN syntax is easier to read and clearer normally as well.

Fastest way to determine if record exists

SELECT TOP 1 products.id FROM products WHERE products.id = ?; will outperform all of your suggestions as it will terminate execution after it finds the first record.

Most efficient way to SELECT rows WHERE the ID EXISTS IN a second table

Summary:

IN and EXISTS performed similarly in all scenarios.. Below are the parameters used to validate..

Execution cost,Time:

Same for both and optimizer produced same plan.

Memory Grant:

Same for both queries

Cpu Time,Logical reads :

Exists seems to outperform IN by little bit margin in terms of CPU Time,though reads are same..

I ran each query 10 times each using below test data set..

  1. A very large subquery result set (100000 rows)
  2. Duplicate rows
  3. Null rows

For all the above scenarios, both IN and EXISTS performed in identical manner.

Some info about Performance V3 database used for testing.
20000 customers having 1000000 orders, so each customer is randomly duplicated (in a range of 10 to 100) in the orders table.

Execution cost,Time:

Below is screenshot of both queries running. Observe each query relative cost.

Sample Image

Memory Cost:

Memory grant for the two queries is also same..I Forced MDOP 1 so as not to spill them to TEMPDB..

Sample Image

CPU Time ,Reads:

For Exists:

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Customers'. Scan count 1, logical reads 109, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Orders'. Scan count 1, logical reads 3855, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:
CPU time = 469 ms, elapsed time = 595 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

For IN:

(20000 row(s) affected)
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Customers'. Scan count 1, logical reads 109, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Orders'. Scan count 1, logical reads 3855, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:
CPU time = 547 ms, elapsed time = 669 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

In each case, the optimizer is smart enough to rearrange the queries.

I tend to use EXISTS only though (my opinion). One use case to use EXISTS is when you don't want to return a second table result set.

Update as per queries from Martin Smith:

I ran the below queries to find the most effective way to get rows from the first table for which a reference exists in the second table.

SELECT DISTINCT c.*
FROM Customers c
JOIN Orders o ON o.custid = c.custid

SELECT c.*
FROM Customers c
INNER JOIN (SELECT DISTINCT custid FROM Orders) AS o ON o.custid = c.custid

SELECT *
FROM Customers C
WHERE EXISTS(SELECT 1 FROM Orders o WHERE o.custid = c.custid)

SELECT *
FROM Customers c
WHERE custid IN (SELECT custid FROM Orders)

All the above queries share the same cost with the exception of 2nd INNER JOIN, Plan being the same for the rest.

Sample Image

Memory Grant:

This query

SELECT DISTINCT c.*
FROM Customers c
JOIN Orders o ON o.custid = c.custid

required memory grant of

Sample Image

This query

SELECT c.*
FROM Customers c
INNER JOIN (SELECT DISTINCT custid FROM Orders) AS o ON o.custid = c.custid

required memory grant of ..

Sample Image

CPU Time,Reads:

For Query :

SELECT DISTINCT c.*
FROM Customers c
JOIN Orders o ON o.custid = c.custid

(20000 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 48, logical reads 1344, physical reads 96, read-ahead reads 1248, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Orders'. Scan count 5, logical reads 3929, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Customers'. Scan count 5, logical reads 322, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 1453 ms, elapsed time = 781 ms.

For Query:

SELECT c.*
FROM Customers c
INNER JOIN (SELECT DISTINCT custid FROM Orders) AS o ON o.custid = c.custid

(20000 row(s) affected)
Table 'Customers'. Scan count 5, logical reads 322, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Orders'. Scan count 5, logical reads 3929, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 1499 ms, elapsed time = 403 ms.

Joining a list of values with table rows in SQL

If in Microsoft SQL Server 2008 or later, then you can use Table Value Constructor

 Select v.valueId, m.name 
From (values (1), (2), (3), (4), (5)) v(valueId)
left Join otherTable m
on m.id = v.valueId

Postgres also has this construction VALUES Lists:

SELECT * FROM (VALUES (1, 'one'), (2, 'two'), (3, 'three')) AS t (num,letter)

Also note the possible Common Table Expression syntax which can be handy to make joins:

WITH my_values(num, str) AS (
VALUES (1, 'one'), (2, 'two'), (3, 'three')
)
SELECT num, txt FROM my_values

With Oracle it's possible, though heavier From ASK TOM:

with id_list as (
select 10 id from dual union all
select 20 id from dual union all
select 25 id from dual union all
select 70 id from dual union all
select 90 id from dual
)
select * from id_list;

Using a left join and checking if the row existed along with another check in where clause

According to this answer, in SQL-Server using NOT EXISTS is more efficient than LEFT JOIN/IS NULL

SELECT  *
FROM Users u
WHERE u.IsActive = 1
AND u.Status <> 'disabled'
AND NOT EXISTS (SELECT 1 FROM Banned b WHERE b.UserID = u.UserID)

EDIT

For the sake of completeness this is how I would do it with a LEFT JOIN:

SELECT  *
FROM Users u
LEFT JOIN Banned b
ON b.UserID = u.UserID
WHERE u.IsActive = 1
AND u.Status <> 'disabled'
AND b.UserID IS NULL -- EXCLUDE ROWS WITH A MATCH IN `BANNED`

How to select rows with no matching entry in another table?

Here's a simple query:

SELECT t1.ID
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.ID = t2.ID
WHERE t2.ID IS NULL

The key points are:

  1. LEFT JOIN is used; this will return ALL rows from Table1, regardless of whether or not there is a matching row in Table2.

  2. The WHERE t2.ID IS NULL clause; this will restrict the results returned to only those rows where the ID returned from Table2 is null - in other words there is NO record in Table2 for that particular ID from Table1. Table2.ID will be returned as NULL for all records from Table1 where the ID is not matched in Table2.

Find records where join doesn't exist

Use an EXISTS expression:

WHERE NOT EXISTS (
SELECT FROM votes v -- SELECT list can be empty
WHERE v.some_id = base_table.some_id
AND v.user_id = ?
)

The difference

... between NOT EXISTS() (Ⓔ) and NOT IN() (Ⓘ) is twofold:

  1. Performance

    Ⓔ is generally faster. It stops processing the subquery as soon as the first match is found. The manual:

    The subquery will generally only be executed long enough to determine
    whether at least one row is returned, not all the way to completion.

    Ⓘ can also be optimized by the query planner, but to a lesser extent since NULL handling makes it more complex.

  2. Correctness

    If one of the resulting values in the subquery expression is NULL, the result of Ⓘ is NULL, while common logic would expect TRUE - and Ⓔ will return TRUE. The manual:

    If all the per-row results are either unequal or null, with at least
    one null, then the result of NOT IN is null.

Essentially, (NOT) EXISTS is the better choice in most cases.

Example

Your query can look like this:

SELECT *
FROM questions q
WHERE NOT EXISTS (
SELECT FROM votes v
WHERE v.question_id = q.id
AND v.user_id = ?
);

Do not join to votes in the base query. That would void the effort.

Besides NOT EXISTS and NOT IN there are additional syntax options with LEFT JOIN / IS NULL and EXCEPT. See:

  • Select rows which are not present in other table

Find records from one table which don't exist in another

There's several different ways of doing this, with varying efficiency, depending on how good your query optimiser is, and the relative size of your two tables:

This is the shortest statement, and may be quickest if your phone book is very short:

SELECT  *
FROM Call
WHERE phone_number NOT IN (SELECT phone_number FROM Phone_book)

alternatively (thanks to Alterlife)

SELECT *
FROM Call
WHERE NOT EXISTS
(SELECT *
FROM Phone_book
WHERE Phone_book.phone_number = Call.phone_number)

or (thanks to WOPR)

SELECT * 
FROM Call
LEFT OUTER JOIN Phone_Book
ON (Call.phone_number = Phone_book.phone_number)
WHERE Phone_book.phone_number IS NULL

(ignoring that, as others have said, it's normally best to select just the columns you want, not '*')

How to select all records from one table that do not exist in another table?

SELECT t1.name
FROM table1 t1
LEFT JOIN table2 t2 ON t2.name = t1.name
WHERE t2.name IS NULL

Q: What is happening here?

A: Conceptually, we select all rows from table1 and for each row we attempt to find a row in table2 with the same value for the name column. If there is no such row, we just leave the table2 portion of our result empty for that row. Then we constrain our selection by picking only those rows in the result where the matching row does not exist. Finally, We ignore all fields from our result except for the name column (the one we are sure that exists, from table1).

While it may not be the most performant method possible in all cases, it should work in basically every database engine ever that attempts to implement ANSI 92 SQL



Related Topics



Leave a reply



Submit