Is Natural Join Any Better Than Select from Where in Terms of Performance

Inner Join vs. Natural Join, speed-wise?

It is going to depend on the DBMS, and to some extent on the tables joined, and the difference is generally not going to be measurable. With an INNER JOIN, there will be two copies of the common columns; with a NATURAL JOIN, there will be just one copy of the common columns.

If the intermediate results have to be written to disk or sorted, there will be more data to be managed, and there is slightly more data to be returned to the client over the client-server connection, so an INNER JOIN might take slightly longer than an NATURAL JOIN. If the common columns are big enough - long characters fields, for example - this size difference could be significant.

The difference in time to prepare the two types of join is completely negligible. Both have to read the information for all the columns in both tables involved in the join, and the processing of that data is substantially similar.

During execution, the copying of the data to form the intermediate results might be slightly more complex for a NATURAL JOIN - because you don't simply take a copy of each row - but you don't necessarily take a full copy of each row for an INNER JOIN either, so the difference is again negligible. And, as before, there is slightly less data to be copied in aggregate for the NATURAL JOIN, so there is a still a small advantage to it.

So, overall, a NATURAL JOIN might be slightly quicker than an INNER JOIN, but the difference is still likely to be negligible.

SQL JOIN: is there a difference between USING, ON or WHERE?

There is no difference in performance.

However, the first style is ANSI-89 and will get your legs broken in some shops. Including mine. The second style is ANSI-92 and is much clearer.

Examples:

Which is the JOIN, which is the filter?

FROM T1,T2,T3....
WHERE T1.ID = T2.ID AND
T1.foo = 'bar' AND T2.fish = 42 AND
T1.ID = T3.ID

FROM T1
INNER JOIN T2 ON T1.ID = T2.ID
INNER JOIN T3 ON T1.ID = T3.ID
WHERE
T1.foo = 'bar' AND T2.fish = 42

If you have OUTER JOINs (=*, *=) then the 2nd style will work as advertised. The first most likely won't and is also deprecated in SQL Server 2005+

The ANSI-92 style is harder to bollix too. With the older style you can easily end up with a Cartesian product (cross join) if you miss a condition. You'll get a syntax error with ANSI-92.

Edit: Some more clarification

  • The reason for not using "join the where" (implicit) is the dodgy results with outer joins.
  • If you use explicit OUTER JOINs + implicit INNER JOINs you'll still get dodgy results + you have inconsistency in usage

It isn't just syntax: it's about having a semantically correct query

Edit, Dec 2011

SQL Server logical query processing order is FROM, ON, JOIN, WHERE...

So if you mix "implicit WHERE inner joins" and "explicit FROM outer joins" you most likely won't get expected results because the query is ambiguous...

Join vs. sub-query

Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):

A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.

So subqueries can be slower than LEFT [OUTER] JOIN, but in my opinion their strength is slightly higher readability.

Inner join vs Where

No! The same execution plan, look at these two tables:

CREATE TABLE table1 (
id INT,
name VARCHAR(20)
);

CREATE TABLE table2 (
id INT,
name VARCHAR(20)
);

The execution plan for the query using the inner join:

-- with inner join

EXPLAIN PLAN FOR
SELECT * FROM table1 t1
INNER JOIN table2 t2 ON t1.id = t2.id;

SELECT *
FROM TABLE (DBMS_XPLAN.DISPLAY);

-- 0 select statement
-- 1 hash join (access("T1"."ID"="T2"."ID"))
-- 2 table access full table1
-- 3 table access full table2

And the execution plan for the query using a WHERE clause.

-- with where clause

EXPLAIN PLAN FOR
SELECT * FROM table1 t1, table2 t2
WHERE t1.id = t2.id;

SELECT *
FROM TABLE (DBMS_XPLAN.DISPLAY);

-- 0 select statement
-- 1 hash join (access("T1"."ID"="T2"."ID"))
-- 2 table access full table1
-- 3 table access full table2

Natural join in SQL Server

No, and thank the lucky stars

I can't believe that you'd want the engine to guess the JOIN for you

Related links:

  • SQL Server - lack of NATURAL JOIN / x JOIN y USING(field)
  • is NATURAL JOIN any better than SELECT FROM WHERE in terms of performance ?

Edit, to explain why

  • The JOIN (whether USING or ON) is clear and explicit
  • I should be able to name my columns for the entity stored in the table, without worrying about what a column is called in another table, without NATURAL JOIN side effects

Quoting Bill Karwin in this excellent answer:

I never use NATURAL JOIN because I don't like the possibility that the
join could do something I don't intend just because some column name
exists in both tables.

Natural Join -- Relational theory and SQL

a number of points regarding your question (even if I'm afraid I'm not really answering anything you asked),

"On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred)."

This seems to suggest that you interpret theory as if it proscribes against "other kinds" of joins ... That is not really true. Relational theory does not say "you cannot have antijoins", or "you should never use antijoins", or anything like that. What it DOES say, is that in the relational algebra, a set of primitive operators can be identified, in which natural join is the only "join-like" operator. All other "join-like" operators, can always be expressed equivalently in terms of the primitive operators defined. Cartesian product, for example, is a special case of a natural join (where the set of common attributes is empty), and if you want the cartesian product of two tables that do have an attribute name in common, you can address this using RENAME. Semijoin, for example, is the natural join of the first table with some projection on the second. Antijoin, for example (SEMIMINUS or NOT MATCHING in Date's book), is the relational difference between the first table and a SEMIJOIN of the two. etc. etc.

"On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction)."

Where are such things advised ? In the SQL standard ? I don't really think so. It is important to distinguish between the SQL language per se, which is defined by an ISO standard, and some (/any) particular implementation of that language, which is built by some particular vendor. If Microsoft advises its customers to not use NJ in SQL Server 200x, then that advice has a completely different meaning than an advice by someone to not ever use NJ in SQL altogether.

"Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS."

While it is true that SQL per se fails to faithfully comply with relational theory, that actually has very little to do with the question of NJ.

Whether an implementation gives good performance for invocations of NJ, is a characteristic of that implementation, not of the language, or of the "degree of trueness" of the 'R' in 'RDBMS'. It is very easy to build a TRDBMS that doesn't use SQL, and that gives ridiculous execution times for NJ. The SQL language per se has everything that is needed to support NJ. If an implementation supports NJ, then NJ will work in that implementation too. Whether it gives good performance, is a characteristic of that implementation, and poor performance of some particular implementation should not be "extrapolated" to other implementations, or be seen as a characteristic of the SQL language per se.

"Good/Better table design should remove/minimise the problems that natural join creates."

Problems that natural join creates ? Controlling the columns that appear in the arguments to a join is easily done by adding explicit projections (and renames if needed) on the columns you want. Much like you also want to avoid SELECT * as much as possible, for basically the same reason ...

INNER JOIN vs LEFT JOIN performance in SQL Server

A LEFT JOIN is absolutely not faster than an INNER JOIN. In fact, it's slower; by definition, an outer join (LEFT JOIN or RIGHT JOIN) has to do all the work of an INNER JOIN plus the extra work of null-extending the results. It would also be expected to return more rows, further increasing the total execution time simply due to the larger size of the result set.

(And even if a LEFT JOIN were faster in specific situations due to some difficult-to-imagine confluence of factors, it is not functionally equivalent to an INNER JOIN, so you cannot simply go replacing all instances of one with the other!)

Most likely your performance problems lie elsewhere, such as not having a candidate key or foreign key indexed properly. 9 tables is quite a lot to be joining so the slowdown could literally be almost anywhere. If you post your schema, we might be able to provide more details.


Edit:

Reflecting further on this, I could think of one circumstance under which a LEFT JOIN might be faster than an INNER JOIN, and that is when:

  • Some of the tables are very small (say, under 10 rows);
  • The tables do not have sufficient indexes to cover the query.

Consider this example:

CREATE TABLE #Test1
(
ID int NOT NULL PRIMARY KEY,
Name varchar(50) NOT NULL
)
INSERT #Test1 (ID, Name) VALUES (1, 'One')
INSERT #Test1 (ID, Name) VALUES (2, 'Two')
INSERT #Test1 (ID, Name) VALUES (3, 'Three')
INSERT #Test1 (ID, Name) VALUES (4, 'Four')
INSERT #Test1 (ID, Name) VALUES (5, 'Five')

CREATE TABLE #Test2
(
ID int NOT NULL PRIMARY KEY,
Name varchar(50) NOT NULL
)
INSERT #Test2 (ID, Name) VALUES (1, 'One')
INSERT #Test2 (ID, Name) VALUES (2, 'Two')
INSERT #Test2 (ID, Name) VALUES (3, 'Three')
INSERT #Test2 (ID, Name) VALUES (4, 'Four')
INSERT #Test2 (ID, Name) VALUES (5, 'Five')

SELECT *
FROM #Test1 t1
INNER JOIN #Test2 t2
ON t2.Name = t1.Name

SELECT *
FROM #Test1 t1
LEFT JOIN #Test2 t2
ON t2.Name = t1.Name

DROP TABLE #Test1
DROP TABLE #Test2

If you run this and view the execution plan, you'll see that the INNER JOIN query does indeed cost more than the LEFT JOIN, because it satisfies the two criteria above. It's because SQL Server wants to do a hash match for the INNER JOIN, but does nested loops for the LEFT JOIN; the former is normally much faster, but since the number of rows is so tiny and there's no index to use, the hashing operation turns out to be the most expensive part of the query.

You can see the same effect by writing a program in your favourite programming language to perform a large number of lookups on a list with 5 elements, vs. a hash table with 5 elements. Because of the size, the hash table version is actually slower. But increase it to 50 elements, or 5000 elements, and the list version slows to a crawl, because it's O(N) vs. O(1) for the hashtable.

But change this query to be on the ID column instead of Name and you'll see a very different story. In that case, it does nested loops for both queries, but the INNER JOIN version is able to replace one of the clustered index scans with a seek - meaning that this will literally be an order of magnitude faster with a large number of rows.

So the conclusion is more or less what I mentioned several paragraphs above; this is almost certainly an indexing or index coverage problem, possibly combined with one or more very small tables. Those are the only circumstances under which SQL Server might sometimes choose a worse execution plan for an INNER JOIN than a LEFT JOIN.



Related Topics



Leave a reply



Submit