Inner Join VS Natural Join VS Using Clause: Are There Any Advantages

Inner Join vs Natural Join vs USING clause: are there any advantages?

Now, apart from the fact that the first form has a duplicated column, is there a real advantage to the other two forms? Or are they just syntactic sugar?

TL;DR NATURAL JOIN is used in a certain style of relational programming that is simpler than the usual SQL style. (Although when embedded in SQL it is burdened with the rest of SQL query syntax.) That's because 1. it directly uses the simple operators of predicate logic, the language of precision in engineering (including software engineering), science (including computer science) and mathematics, and moreover 2. simultaneously and alternatively it directly uses the simple operators of relational algebra.

The common complaint about NATURAL JOIN is that since shared columns aren't explicit, after a schema change inappropriate column pairing may occur. And that may be the case in a particular development environment. But in that case there was a requirement that only certain columns be joined and NATURAL JOIN without PROJECT was not appropriate. So these arguments assume that NATURAL JOIN is being used inappropriately. Moreover the arguers aren't even aware that they are ignoring requirements. Such complaints are specious. (Moreover, sound software engineering design principles lead to not having interfaces with such specificiatons.)

Another related misconceived specious complaint from the same camp is that "NATURAL JOIN does not even take foreign key relationships into account". But any join is there because of the table meanings, not the constraints. Constraints are not needed to query. If a constraint is added then a query remains correct. If a constraint is dropped then a query relying on it becomes wrong and must be changed to a phrasing that doesn't rely on it that wouldn't have had to change. This has nothing to do with NATURAL JOIN.

You have described the difference in effect: just one copy of each common column is returned.

From Is there any rule of thumb to construct SQL query from a human-readable description?:

It turns out that natural language expressions and logical expressions and relational algebra expressions and SQL expressions (a hybrid of the last two) correspond in a rather direct way.

Eg from Codd 1970:

The relation depicted is called component. [...] The meaning of component(x, y,z) is that part x is an immediate component (or subassembly) of part y, and z units of part x are needed to assemble one unit of part y.

From this answer:

Every base table has a statement template, aka predicate, parameterized by column names, by which we put a row in or leave it out.

Plugging a row into a predicate gives a statement aka proposition. The rows that make a true proposition go in a table and the rows that make a false proposition stay out. (So a table states the proposition of each present row and states NOT the proposition of each absent row.)

But every table expression value has a predicate per its expression. The relational model is designed so that if tables T and U hold rows where T(...) and U(...) (respectively) then:

T NATURAL JOIN U holds rows where T(...) AND U(...)
T WHEREcondition holds rows where T(...) AND condition
T UNION CORRESPONDING U holds rows where T(...) OR U(...)
T EXCEPT CORRESPONDING U holds rows where T(...) AND NOT U(...)
SELECT DISTINCTcolumns to keepFROM T holds rows where

THERE EXISTS columns to drop SUCH THAT T(...)
etc

Whereas reasoning about SQL otherwise is... not "natural":

An SQL SELECT statement can be thought of algebraically as 1. implicitly RENAMEing each column C of a table with (possibly implicit) correlation name T to T.C, then 2. CROSS JOINing, then 3. RESTRICTing per INNER ON, then 4. RESTRICTing per WHERE, then 5. PROJECTing per SELECT, then 6. RENAMEing per SELECT, dropping T.s, then 7. implicitly RENAMEing to drop remaining T.s Between the T.-RENAMEings algebra operators can also be thought of as logic operators and table names as their predicates: T JOIN ... vs Employee T.EMPLOYEE has name T.NAME ... AND .... But conceptually inside a SELECT statement is a double-RENAME-inducing CROSS JOIN table with T.Cs for column names while outside tables have Cs for column names.

Alternatively an SQL SELECT statement can be thought of logically as 1. introducing FORSOME T IN E around the entire statement per correlation name T and base name or subquery E, then 2. referring to the value of quantified T by using T.C to refer to its C part, then 3. building result rows from T.Cs per FROM etc, then 4. naming the result row columns per the SELECT clause, then 4. leaving the scope of the FORSOMEs. Again the algebra operators are being thought of as logic operators and table names as their predicates. Again though, this conceptually has T.C inside SELECTs but C outside with correlation names coming and going.

These two SQL interpretations are nowhere near as straightforward as just using JOIN or AND, etc, interchangeably. (You don't have to agree that it's simpler, but that perception is why NATURAL JOIN and UNION/EXCEPT CORRESPONDING are there.) (Arguments criticizing this style outside the context of its intended use are specious.)

USING is a kind of middle ground orphan with one foot in the NATURAL JOIN camp and one in the CROSS JOIN. It has no real role in the former because there are no duplicate column names there. In the latter it is more or less just abbreviating JOIN conditions and SELECT clauses.

I can see the disadvantage in the latter forms is that you are expected to have named your primary and foreign keys the same, which is not always practical.

PKs (primary keys), FKs (foreign keys) & other constraints are not needed for querying. (Knowing a column is a function of others allows scalar subqueries, but you can always phrase without.) Moreover any two tables can be meaningfully joined. If you need two columns to have the same name with NATURAL JOIN you rename via SELECT AS.

Difference between natural join and inner join

One significant difference between INNER JOIN and NATURAL JOIN is the number of columns returned.

Consider:

TableA                           TableB
+------------+----------+        +--------------------+    
|Column1     | Column2  |        |Column1  |  Column3 |
+-----------------------+        +--------------------+
| 1          |  2       |        | 1       |   3      |
+------------+----------+        +---------+----------+

The INNER JOIN of TableA and TableB on Column1 will return

SELECT * FROM TableA AS a INNER JOIN TableB AS b USING (Column1);
SELECT * FROM TableA AS a INNER JOIN TableB AS b ON a.Column1 = b.Column1;

+------------+-----------+---------------------+    
| a.Column1  | a.Column2 | b.Column1| b.Column3|
+------------------------+---------------------+
| 1          |  2        | 1        |   3      |
+------------+-----------+----------+----------+

The NATURAL JOIN of TableA and TableB on Column1 will return:

SELECT * FROM TableA NATURAL JOIN TableB
+------------+----------+----------+    
|Column1     | Column2  | Column3  |
+-----------------------+----------+
| 1          |  2       |   3      |
+------------+----------+----------+

The repeated column is avoided.

(AFAICT from the standard grammar, you can't specify the joining columns in a natural join; the join is strictly name-based. See also Wikipedia.)

(There's a cheat in the inner join output; the a. and b. parts would not be in the column names; you'd just have column1, column2, column1, column3 as the headings.)

Inner Join vs. Natural Join, speed-wise?

It is going to depend on the DBMS, and to some extent on the tables joined, and the difference is generally not going to be measurable. With an INNER JOIN, there will be two copies of the common columns; with a NATURAL JOIN, there will be just one copy of the common columns.

If the intermediate results have to be written to disk or sorted, there will be more data to be managed, and there is slightly more data to be returned to the client over the client-server connection, so an INNER JOIN might take slightly longer than an NATURAL JOIN. If the common columns are big enough - long characters fields, for example - this size difference could be significant.

The difference in time to prepare the two types of join is completely negligible. Both have to read the information for all the columns in both tables involved in the join, and the processing of that data is substantially similar.

During execution, the copying of the data to form the intermediate results might be slightly more complex for a NATURAL JOIN - because you don't simply take a copy of each row - but you don't necessarily take a full copy of each row for an INNER JOIN either, so the difference is again negligible. And, as before, there is slightly less data to be copied in aggregate for the NATURAL JOIN, so there is a still a small advantage to it.

So, overall, a NATURAL JOIN might be slightly quicker than an INNER JOIN, but the difference is still likely to be negligible.

MySQL JOIN ON vs USING?

It is mostly syntactic sugar, but a couple differences are noteworthy:

ON is the more general of the two. One can join tables ON a column, a set of columns and even a condition. For example:

SELECT * FROM world.City JOIN world.Country ON (City.CountryCode = Country.Code) WHERE ...

USING is useful when both tables share a column of the exact same name on which they join. In this case, one may say:

SELECT ... FROM film JOIN film_actor USING (film_id) WHERE ...

An additional nice treat is that one does not need to fully qualify the joining columns:

SELECT film.title, film_id -- film_id is not prefixed
FROM film
JOIN film_actor USING (film_id)
WHERE ...

To illustrate, to do the above with ON, we would have to write:

SELECT film.title, film.film_id -- film.film_id is required here
FROM film
JOIN film_actor ON (film.film_id = film_actor.film_id)
WHERE ...

Notice the film.film_id qualification in the SELECT clause. It would be invalid to just say film_id since that would make for an ambiguity:

ERROR 1052 (23000): Column 'film_id' in field list is ambiguous

As for select *, the joining column appears in the result set twice with ON while it appears only once with USING:

mysql> create table t(i int);insert t select 1;create table t2 select*from t;
Query OK, 0 rows affected (0.11 sec)

Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

Query OK, 1 row affected (0.19 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> select*from t join t2 on t.i=t2.i;
+------+------+
| i    | i    |
+------+------+
|    1 |    1 |
+------+------+
1 row in set (0.00 sec)

mysql> select*from t join t2 using(i);
+------+
| i    |
+------+
|    1 |
+------+
1 row in set (0.00 sec)

mysql>

Natural Join -- Relational theory and SQL

a number of points regarding your question (even if I'm afraid I'm not really answering anything you asked),

"On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred)."

This seems to suggest that you interpret theory as if it proscribes against "other kinds" of joins ... That is not really true. Relational theory does not say "you cannot have antijoins", or "you should never use antijoins", or anything like that. What it DOES say, is that in the relational algebra, a set of primitive operators can be identified, in which natural join is the only "join-like" operator. All other "join-like" operators, can always be expressed equivalently in terms of the primitive operators defined. Cartesian product, for example, is a special case of a natural join (where the set of common attributes is empty), and if you want the cartesian product of two tables that do have an attribute name in common, you can address this using RENAME. Semijoin, for example, is the natural join of the first table with some projection on the second. Antijoin, for example (SEMIMINUS or NOT MATCHING in Date's book), is the relational difference between the first table and a SEMIJOIN of the two. etc. etc.

"On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction)."

Where are such things advised ? In the SQL standard ? I don't really think so. It is important to distinguish between the SQL language per se, which is defined by an ISO standard, and some (/any) particular implementation of that language, which is built by some particular vendor. If Microsoft advises its customers to not use NJ in SQL Server 200x, then that advice has a completely different meaning than an advice by someone to not ever use NJ in SQL altogether.

"Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS."

While it is true that SQL per se fails to faithfully comply with relational theory, that actually has very little to do with the question of NJ.

Whether an implementation gives good performance for invocations of NJ, is a characteristic of that implementation, not of the language, or of the "degree of trueness" of the 'R' in 'RDBMS'. It is very easy to build a TRDBMS that doesn't use SQL, and that gives ridiculous execution times for NJ. The SQL language per se has everything that is needed to support NJ. If an implementation supports NJ, then NJ will work in that implementation too. Whether it gives good performance, is a characteristic of that implementation, and poor performance of some particular implementation should not be "extrapolated" to other implementations, or be seen as a characteristic of the SQL language per se.

"Good/Better table design should remove/minimise the problems that natural join creates."

Problems that natural join creates ? Controlling the columns that appear in the arguments to a join is easily done by adding explicit projections (and renames if needed) on the columns you want. Much like you also want to avoid SELECT * as much as possible, for basically the same reason ...

Relational Algebra - how does natural join work?

Well, you have learned the first important lesson, which is to avoid natural join. It is just lousy syntax, because it does not even take properly declared foreign key relationships into account and the join conditions are hidden -- which makes queries hard to maintain and debug.

A natural join is an inner join equijoin with the join conditions on columns with the same names. Natural joins do not even take types into account, so the query can have type conversion errors if your data is really messed.

If the corresponding inner join on the common column names have no matches, then it returns the empty set. If there are no common column names, then it is the same as a cross join.

The way to think about it is that a natural join (inner natural join) generates the Cartesian product of two tables. When the tables have duplicated column names, then the final result set contains only those Cartesian-product rows where the common column names have the same value.

Inner Join VS Natural Join VS Using Clause: Are There Any Advantages