How to Get Better Performance Using a Join or Using Exists

Can an INNER JOIN offer better performance than EXISTS

Generally speaking, INNER JOIN and EXISTS are different things.

The former returns duplicates and columns from both tables, the latter returns one record and, being a predicate, returns records from only one table.

If you do an inner join on a UNIQUE column, they exhibit same performance.

If you do an inner join on a recordset with DISTINCT applied (to get rid of the duplicates), EXISTS is usually faster.

IN and EXISTS clauses (with an equijoin correlation) usually employ one of the several SEMI JOIN algorithms which are usually more efficient than a DISTINCT on one of the tables.

See this article in my blog:

IN vs. JOIN vs. EXISTS

How to improve performance of a SQL JOIN with an OR condition

THIS ANSWERS THE ORIGINAL VERSION OF THE QUESTION.

You may be better off with exists:

SELECT FIRST.ID, FIRST.NAME
FROM FIRST_TABLE FIRST 
WHERE EXISTS (SELECT 1 FROM SECOND_TABLE SECOND WHERE SECOND.FIRST_NAME = FIRST.NAME) OR
      EXISTS (SELECT 1 FROM SECOND_TABLE SECOND WHERE  SECOND.LAST_NAME = FIRST.NAME);

Then for performance, you want indexes on SECOND_TABLE(LAST_NAME) AND SECOND_TABLE(FIRST_NAME).

JOIN versus EXISTS performance

NOT EXISTS is more efficient than using a LEFT OUTER JOIN to exclude records that are missing from the participating table using an IS NULL condition because the optimizer will elect to use an EXCLUSION MERGE JOIN with the NOT EXISTS predicate.

While your second test did not yield impressive results for the data sets you were using the performance increase from NOT EXISTS over a LEFT JOIN is very noticeable as your data volumes increase. Keep in mind that the tables will need to be hash distributed by the columns that participate in the NOT EXISTS join just like they would in the LEFT JOIN. Therefore, data skew can impact the performance of the EXCLUSION MERGE JOIN.

EDIT:

Typically, I would defer to EXISTS as a replacement for IN instead of using it for re-writing a join solution. This is especially true when the column(s) participating in the logical comparison can be NULL. That's not to say you couldn't use EXISTS in place of an INNER JOIN. Instead of an EXCLUSION JOIN you will end up with an INCLUSION JOIN. The INNER JOIN is in essence an inclusion join to begin with. I'm sure there are some nuances that I am overlooking but you can find those in the manuals if you wish to take the time to read them.

EXISTS vs JOIN and use of EXISTS clause

EXISTS is used to return a boolean value, JOIN returns a whole other table

EXISTS is only used to test if a subquery returns results, and short circuits as soon as it does. JOIN is used to extend a result set by combining it with additional fields from another table to which there is a relation.

In your example, the queries are semantically equivalent.

In general, use EXISTS when:

You don't need to return data from the related table
You have dupes in the related table (JOIN can cause duplicate rows if values are repeated)
You want to check existence (use instead of LEFT OUTER JOIN...NULL condition)

If you have proper indexes, most of the time the EXISTS will perform identically to the JOIN. The exception is on very complicated subqueries, where it is normally quicker to use EXISTS.

If your JOIN key is not indexed, it may be quicker to use EXISTS but you will need to test for your specific circumstance.

JOIN syntax is easier to read and clearer normally as well.

SQL JOIN vs IN performance?

Generally speaking, IN and JOIN are different queries that can yield different results.

SELECT  a.*
FROM    a
JOIN    b
ON      a.col = b.col

is not the same as

SELECT  a.*
FROM    a
WHERE   col IN
        (
        SELECT  col
        FROM    b
        )

, unless b.col is unique.

However, this is the synonym for the first query:

SELECT  a.*
FROM    a
JOIN    (
        SELECT  DISTINCT col
        FROM    b
        )
ON      b.col = a.col

If the joining column is UNIQUE and marked as such, both these queries yield the same plan in SQL Server.

If it's not, then IN is faster than JOIN on DISTINCT.

See this article in my blog for performance details:

IN vs. JOIN vs. EXISTS

JOIN or Correlated subquery with exists clause, which one is better

Generally, the EXISTS clause because you may need DISTINCT for a JOIN for it to give the expected output. For example, if you have multiple Department rows for a ContactInformation row.

In your example above, the SELECT *:

means different output too so they are not actually equivalent
less chance of a index being used because you are pulling all columns out

Saying that, even with a limited column list, they will give the same plan: until you need DISTINCT... which is why I say "EXISTS"

Performance optimization with EXISTS

With any performance tuning exercise the devil is in the details. The following is some guesswork based on rules of thumb. You must run some performance benchmarks for yourself, using your actual tables and actual data.

"is there a better way to optimize this than the following index"

Almost certainly, yes.

The benefit of bitmap indexes lies in having several of them. That way,when we issue a query filtering on those columns the optimizer can choose to execute a Star Transformation to find the rows in the intersection of the bitmaps. Even then, bitmap indexes on bivalent columns are as useful as columns with several different values.

One bitmap index on its own, particularly one with only two values, isn't much use. Given the monstrous overheads of maintain bitmap indexes, and their concurrency issues, you should probably consider other options.

"~40M lines on an Oracle Exadata server"

Oracle have engineered the Exadata appliance for crunching through large volumes of data. This means looking for paths which support hash joins, Bloom filters and similar operations. With Exadata a common tuning technique is to drop an existing index rather than creating a new one. While cheaper than bitmaps, B-Tree indexes still incur costs resource (CPU, storage, memory) so it's worth considering whether using Exadata's brute force offers a lower cost overall. That's what we pay the big bucks for.

However, even Exadata's raw power is a limited resource. So if you're going to run this query a lot (or rather the EXISTS sub-query) you will likely get a benefit from clustering the excluded rows. From your question it seems IS_DELETE is an updated attribute you can't use physical organisation at the table level (CTAS, attribute clustering). So a B-tree index on ARCHIVED_ID(IS_DELETE, ID_ARCHIVED) is the primary candidate.

With compound indexes it's usually best to start with the least selective column, and that is true here. You're only interested in the rows where IS_DELETE = 'Y', so leading with that column will reduce the number of blocks the sub-query needs to visit. Leading with ID_ARCHIVED would mean the sub-query has to scan the whole index. Even with Exadata, we should always seek to minimize the work undertaken to get a set of records.

But please benchmark this, or any other index.

" For the bitmap index, I was thinking it only track Not Null value (so Y), I'm mistaken ?"

'Fraid so. Bitmap indexes have an entry for each indexed field. So, unlike single-column B-tree indexes, they do index null entries.

"I use EXISTS, instead of IN, as Oracle recommend (ability to use index),"

Hmmm, not sure what Oracle recommendation you think you're citing there. Which is better really depends on the data. But basically if the sub-query is huge and the main query is relatively small then EXISTS is more efficient. But if the data volumes are reversed, the main query is huge and the sub-query is small IN is the better choice. Given your current starting position is the sub-query returns no rows (because nothing is deleted) and presumably relatively few records will be deleted in the near future it seems more likely that IN is the construct you should be using.

But again: benchmark each approach and see what works better with your data.