Can an INNER JOIN offer better performance than EXISTS
Generally speaking, INNER JOIN
and EXISTS
are different things.
The former returns duplicates and columns from both tables, the latter returns one record and, being a predicate, returns records from only one table.
If you do an inner join on a UNIQUE
column, they exhibit same performance.
If you do an inner join on a recordset with DISTINCT
applied (to get rid of the duplicates), EXISTS
is usually faster.
IN
and EXISTS
clauses (with an equijoin correlation) usually employ one of the several SEMI JOIN
algorithms which are usually more efficient than a DISTINCT
on one of the tables.
See this article in my blog:
- IN vs. JOIN vs. EXISTS
How to improve performance of a SQL JOIN with an OR condition
THIS ANSWERS THE ORIGINAL VERSION OF THE QUESTION.
You may be better off with exists
:
SELECT FIRST.ID, FIRST.NAME
FROM FIRST_TABLE FIRST
WHERE EXISTS (SELECT 1 FROM SECOND_TABLE SECOND WHERE SECOND.FIRST_NAME = FIRST.NAME) OR
EXISTS (SELECT 1 FROM SECOND_TABLE SECOND WHERE SECOND.LAST_NAME = FIRST.NAME);
Then for performance, you want indexes on SECOND_TABLE(LAST_NAME)
AND SECOND_TABLE(FIRST_NAME)
.
JOIN versus EXISTS performance
NOT EXISTS is more efficient than using a LEFT OUTER JOIN to exclude records that are missing from the participating table using an IS NULL condition because the optimizer will elect to use an EXCLUSION MERGE JOIN with the NOT EXISTS predicate.
While your second test did not yield impressive results for the data sets you were using the performance increase from NOT EXISTS over a LEFT JOIN is very noticeable as your data volumes increase. Keep in mind that the tables will need to be hash distributed by the columns that participate in the NOT EXISTS join just like they would in the LEFT JOIN. Therefore, data skew can impact the performance of the EXCLUSION MERGE JOIN.
EDIT:
Typically, I would defer to EXISTS as a replacement for IN instead of using it for re-writing a join solution. This is especially true when the column(s) participating in the logical comparison can be NULL. That's not to say you couldn't use EXISTS in place of an INNER JOIN. Instead of an EXCLUSION JOIN you will end up with an INCLUSION JOIN. The INNER JOIN is in essence an inclusion join to begin with. I'm sure there are some nuances that I am overlooking but you can find those in the manuals if you wish to take the time to read them.
EXISTS vs JOIN and use of EXISTS clause
EXISTS
is used to return a boolean value, JOIN
returns a whole other table
EXISTS
is only used to test if a subquery returns results, and short circuits as soon as it does. JOIN
is used to extend a result set by combining it with additional fields from another table to which there is a relation.
In your example, the queries are semantically equivalent.
In general, use EXISTS
when:
- You don't need to return data from the related table
- You have dupes in the related table (
JOIN
can cause duplicate rows if values are repeated) - You want to check existence (use instead of
LEFT OUTER JOIN...NULL
condition)
If you have proper indexes, most of the time the EXISTS
will perform identically to the JOIN
. The exception is on very complicated subqueries, where it is normally quicker to use EXISTS
.
If your JOIN
key is not indexed, it may be quicker to use EXISTS
but you will need to test for your specific circumstance.
JOIN
syntax is easier to read and clearer normally as well.
SQL JOIN vs IN performance?
Generally speaking, IN
and JOIN
are different queries that can yield different results.
SELECT a.*
FROM a
JOIN b
ON a.col = b.col
is not the same as
SELECT a.*
FROM a
WHERE col IN
(
SELECT col
FROM b
)
, unless b.col
is unique.
However, this is the synonym for the first query:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT col
FROM b
)
ON b.col = a.col
If the joining column is UNIQUE
and marked as such, both these queries yield the same plan in SQL Server
.
If it's not, then IN
is faster than JOIN
on DISTINCT
.
See this article in my blog for performance details:
IN
vs.JOIN
vs.EXISTS
JOIN or Correlated subquery with exists clause, which one is better
Generally, the EXISTS clause because you may need DISTINCT for a JOIN for it to give the expected output. For example, if you have multiple Department
rows for a ContactInformation
row.
In your example above, the SELECT *
:
- means different output too so they are not actually equivalent
- less chance of a index being used because you are pulling all columns out
Saying that, even with a limited column list, they will give the same plan: until you need DISTINCT... which is why I say "EXISTS"
Performance optimization with EXISTS
With any performance tuning exercise the devil is in the details. The following is some guesswork based on rules of thumb. You must run some performance benchmarks for yourself, using your actual tables and actual data.
"is there a better way to optimize this than the following index"
Almost certainly, yes.
The benefit of bitmap indexes lies in having several of them. That way,when we issue a query filtering on those columns the optimizer can choose to execute a Star Transformation to find the rows in the intersection of the bitmaps. Even then, bitmap indexes on bivalent columns are as useful as columns with several different values.
One bitmap index on its own, particularly one with only two values, isn't much use. Given the monstrous overheads of maintain bitmap indexes, and their concurrency issues, you should probably consider other options.
"~40M lines on an Oracle Exadata server"
Oracle have engineered the Exadata appliance for crunching through large volumes of data. This means looking for paths which support hash joins, Bloom filters and similar operations. With Exadata a common tuning technique is to drop an existing index rather than creating a new one. While cheaper than bitmaps, B-Tree indexes still incur costs resource (CPU, storage, memory) so it's worth considering whether using Exadata's brute force offers a lower cost overall. That's what we pay the big bucks for.
However, even Exadata's raw power is a limited resource. So if you're going to run this query a lot (or rather the EXISTS sub-query) you will likely get a benefit from clustering the excluded rows. From your question it seems IS_DELETE
is an updated attribute you can't use physical organisation at the table level (CTAS, attribute clustering). So a B-tree index on ARCHIVED_ID(IS_DELETE, ID_ARCHIVED)
is the primary candidate.
With compound indexes it's usually best to start with the least selective column, and that is true here. You're only interested in the rows where IS_DELETE = 'Y'
, so leading with that column will reduce the number of blocks the sub-query needs to visit. Leading with ID_ARCHIVED
would mean the sub-query has to scan the whole index. Even with Exadata, we should always seek to minimize the work undertaken to get a set of records.
But please benchmark this, or any other index.
" For the bitmap index, I was thinking it only track Not Null value (so Y), I'm mistaken ?"
'Fraid so. Bitmap indexes have an entry for each indexed field. So, unlike single-column B-tree indexes, they do index null
entries.
"I use EXISTS, instead of IN, as Oracle recommend (ability to use index),"
Hmmm, not sure what Oracle recommendation you think you're citing there. Which is better really depends on the data. But basically if the sub-query is huge and the main query is relatively small then EXISTS is more efficient. But if the data volumes are reversed, the main query is huge and the sub-query is small IN is the better choice. Given your current starting position is the sub-query returns no rows (because nothing is deleted) and presumably relatively few records will be deleted in the near future it seems more likely that IN is the construct you should be using.
But again: benchmark each approach and see what works better with your data.
Related Topics
What's the Purpose of SQL Keyword "As"
Retrieve Column Names and Types of a Stored Procedure
How to Speed Up Counting Rows in a Postgresql Table
How to Insert Arabic Characters into SQL Database
How to Count in SQL All Fields with Null Values in One Record
Syntax Error at End of Input in Postgresql
Oracle- Split String Comma Delimited (String Contains Spaces and Consecutive Commas)
How to Retrieve the Current Value of an Oracle Sequence Without Increment It
What Is the Most Appropriate Data Type for Storing an Ip Address in SQL Server
Compare Dates in T-Sql, Ignoring the Time Part
Postgresql: How to Convert from Unix Epoch to Date
Why Are Aggregate Functions Not Allowed in Where Clause
SQL 2005 - the Column Was Specified Multiple Times
Script All Data from SQL Server Database
Creating or Simulating Two Dimensional Arrays in Pl/Sql