In VS. Join with Large Rowsets

IN vs. JOIN with large rowsets

Update:

This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:

  • IN vs. JOIN vs. EXISTS

SELECT  *
FROM a
WHERE a.c IN (SELECT d FROM b)

SELECT a.*
FROM a
JOIN b
ON a.c = b.d

These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).

The equivalent of the first query is the following:

SELECT  a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d

If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.

SQL Server can employ one of the following methods to run this query:

  • If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)

  • If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.

  • If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used

  • If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.

Neither of these methods reevaluates the whole subquery each time.

See this entry in my blog for more detail on how this works:

  • Counting missing rows: SQL Server

There are links for all RDBMS's of the big four.

SQL Server: Perfomance of INNER JOIN on small table vs subquery in IN clause

The execution plans you have supplied both have exactly the same basic strategy.

Join

Sample Image

There is a seek on ActionTable to find rows where ActionName starts with "generate" with a residual predicate on the ActionName LIKE '%action%'. The 7 matching rows are then used to build a hash table.

On the probe side there is a seek on TimeStamp > Scalar Operator(dateadd(day,(-3),getdate())) and matching rows are tested against the hash table to see if the rows should join.

There are two main differences which explain why the IN version executes quicker

IN

Sample Image

  1. The IN version is executing in parallel. There are 4 concurrent threads working on the query execution - not just one.
  2. Related to the parallelism this plan has a bitmap filter. It is able to use this bitmap to eliminate rows early. In the inner join plan 25,959,124 rows are passed to the probe side of the hash join, in the semi join plan the seek still reads 25.9 million rows but only 313 rows are passed out to be evaluated by the join. The remainder are eliminated early by applying the bitmap inside the seek.

It is not readily apparent why the INNER JOIN version does not execute in parallel. You could try adding the hint OPTION(USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE')) to see if you now get a plan which executes in parallel and contains the bitmap filter.

If you are able to change indexes then, given that the query only returns 309 rows for 7 distinct actions, you may well find that replacing IX_BigTimeSeriesTable_ActionID with a covering index with leading columns [ActionID], [TimeStamp] and then getting a nested loops plan with 7 seeks performs much better than your current queries.

CREATE NONCLUSTERED INDEX [IX_BigTimeSeriesTable_ActionID_TimeStamp]
ON [dbo].[BigTimeSeriesTable] ([ActionID], [TimeStamp])
INCLUDE ([Details], [ID])

Hopefully with that index in place your existing queries will just use it and you will see 7 seeks, each returning an average of 44 rows, to read and return only the exact 309 total required. If not you can try the below

SELECT CA.*
FROM ActionTable A
CROSS APPLY
(
SELECT *
FROM BigTimeSeriesTable B
WHERE B.ActionID = A.ActionID AND B.TimeStamp > DATEADD(DAY, -3, GETDATE())
) CA
WHERE A.ActionName LIKE '%action%'

SQL JOIN vs IN performance?

Generally speaking, IN and JOIN are different queries that can yield different results.

SELECT  a.*
FROM a
JOIN b
ON a.col = b.col

is not the same as

SELECT  a.*
FROM a
WHERE col IN
(
SELECT col
FROM b
)

, unless b.col is unique.

However, this is the synonym for the first query:

SELECT  a.*
FROM a
JOIN (
SELECT DISTINCT col
FROM b
)
ON b.col = a.col

If the joining column is UNIQUE and marked as such, both these queries yield the same plan in SQL Server.

If it's not, then IN is faster than JOIN on DISTINCT.

See this article in my blog for performance details:

  • IN vs. JOIN vs. EXISTS

Join vs. sub-query

Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):

A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.

So subqueries can be slower than LEFT [OUTER] JOIN, but in my opinion their strength is slightly higher readability.

Update using join on big table - performance tips?

Updating all the rows in the table is going to be really expensive. I would suggest re-creating the table:

create temp_votings as
select v.*, vv.vote_id
from votings v join
voters vv
on vv.person_id = v.person_id;

For this query, you want an index on votes(person_id, vote_id). I am guessing that person_id might already be the primary key; if so, no additional index is needed.

Then, you can replace the existing table -- but back it up first:

truncate table votings;

insert into votings ( . . . ) -- list columns here
select . . . -- and the same columns here
from temp_votings;

Speeding up inner joins between a large table and a small table

No, the order does not matter.

Almost all RDBMS's (such MS Access, MySQL, SQL Server, ORACLE etc) use a cost based optimiser based upon column statistics. In most situations, the optimiser will choose a correct plan. In the example you gave, the order will not matter (provided statistics are up to date).

To decide what query strategy to use,
the Jet Engine optimizer uses
statistics. The following factors are
some of the factors that these
statistics are based on:

  • The number of records in a table
  • The number of data pages in a table
  • The location of the table
  • Whether indexes are present
  • How unique the indexes are

Note: You cannot view Jet database engine optimization schemes, and you
cannot specify how to optimize a
query. However, you can use the
Database Documenter to determine
whether indexes are present and how
unique an index is.

Based on these statistics, the
Optimizer then selects the best
internal query strategy for dealing
with a particular query.

The statistics are updated whenever a
query is compiled. A query is flagged
for compiling when you save any
changes to the query (or its
underlying tables) and when the
database is compacted. If a query is
flagged for compiling, the compiling
and the updating of statistics occurs
the next time that the query is run.
Compiling typically takes from one
second to four seconds.

If you add a significant number of
records to your database, you must
open and then save your queries to
recompile the queries. For example, if
you design and then test a query by
using a small set of sample data, you
must re-compile the query after
additional records are added to the
database. When you do this, you want
to make sure that optimal query
performance is achieved when your
application is in use.

Ref.

Might be of interest: ACC: How to Optimize Queries in Microsoft Access 2.0, Microsoft Access 95, and Microsoft Access 97

Tony Toews's Microsoft Access Performance FAQ is worth reading.

There's a caveat to "JOIN order does not matter".

If your RDBMS's cost based query optimiser times out creating the query plan then the join order COULD matter. Cost based optimisers have finite resources (both CPU time and memory) in which to construct a query plan. If they time out during the compilation stage, you will get the best plan found so far.

TLDR; If you have complex queries that receive a plan compilation timeout (not query execution timeout), then put your most restrictive joins first. That way, at the point the query plan optimiser times out, it will increase the chance that a 'better' plan was found.

Of course, if you are experiencing query plan compilation timeouts, you should probably simplify your query.

Which provides better performance one big join or multiple queries?

I agree with everyone who's said a single join will probably be more efficient, even with a lot of tables. It's also less development effort than doing the work in your application code. This assumes the tables are appropriately indexed, with an index on each foreign key column, and (of course) an index on each primary key column.

Your best bet is to try the easiest approach (the big join) first, and see how well it performs. If it performs well, then great - you're done. If it performs poorly, profile the query and look for missing indexes on your tables.

Your option #1 is not likely to perform well, due to the number of network round-trips (as anijhaw mentioned). This is sometimes called the "select N+1" problem - you do one SELECT to get the list of N applications, and then do N SELECTs in a loop to get the customers. This record-at-a-time looping is natural to application programmers; but SQL works much better when you operate on whole sets of data at once.

If option #2 is slow even with good indexing, you may want to look into caching. You can cache in the database (using a summary table or materialized/indexed view), in the application (if there is enough RAM), or in a dedicated caching server such as memcached. Of course, this depends on how up-to-date your query results need to be. If everything has to be fully up-to-date, then any cache would have to be updated whenever the underlying tables are updated - it gets complicated and becomes less useful.

This sounds like a reporting query though, and reporting often doesn't need to be real-time. So caching might be able to help you.

Depending on your DBMS, another thing to think about is the impact of this query on other queries hitting the same database. If your DBMS allows readers to block writers, then this query could prevent updates to the tables if it takes a long time to run. That would be bad. Oracle doesn't have this problem, and neither does SQL Server when run in "read committed snapshot" mode. I don't know about MySQL though.



Related Topics



Leave a reply



Submit