In MySQL Queries, Why Use Join Instead of Where

In MySQL queries, why use join instead of where?

Any query involving more than one table requires some form of association to link the results from table "A" to table "B". The traditional (ANSI-89) means of doing this is to:

List the tables involved in a comma separated list in the FROM clause

Write the association between the tables in the WHERE clause

SELECT *
  FROM TABLE_A a,
       TABLE_B b
 WHERE a.id = b.id

Here's the query re-written using ANSI-92 JOIN syntax:

SELECT *
  FROM TABLE_A a
  JOIN TABLE_B b ON b.id = a.id

From a Performance Perspective:

Where supported (Oracle 9i+, PostgreSQL 7.2+, MySQL 3.23+, SQL Server 2000+), there is no performance benefit to using either syntax over the other. The optimizer sees them as the same query. But more complex queries can benefit from using ANSI-92 syntax:

Ability to control JOIN order - the order which tables are scanned
Ability to apply filter criteria on a table prior to joining

From a Maintenance Perspective:

There are numerous reasons to use ANSI-92 JOIN syntax over ANSI-89:

More readable, as the JOIN criteria is separate from the WHERE clause
Less likely to miss JOIN criteria
Consistent syntax support for JOIN types other than INNER, making queries easy to use on other databases
WHERE clause only serves as filtration of the cartesian product of the tables joined

From a Design Perspective:

ANSI-92 JOIN syntax is pattern, not anti-pattern:

The purpose of the query is more obvious; the columns used by the application is clear
It follows the modularity rule about using strict typing whenever possible. Explicit is almost universally better.

Conclusion

Short of familiarity and/or comfort, I don't see any benefit to continuing to use the ANSI-89 WHERE clause instead of the ANSI-92 JOIN syntax. Some might complain that ANSI-92 syntax is more verbose, but that's what makes it explicit. The more explicit, the easier it is to understand and maintain.

Use JOIN instead of WHERE OR... IN sub query

The OR operations are probably going to mess with your speed the most, OR's kill index performance.
This converts it to the most succinct join equivalent:

SELECT t1.id
FROM t1 INNER JOIN t2 ON t2.id IN (t1.id2, t1.id3, t1.id4)
WHERE t1.date1 >= (UTC_TIMESTAMP() + INTERVAL - 20 Hour)
    OR t1.date2 >= (UTC_TIMESTAMP() + INTERVAL - 20 Hour)
;

IN is logically an OR, but I think more recent MySQL versions have optimized it a bit; if not, this might have better performance:

SELECT t1.id
FROM t1 
   LEFT JOIN t2 AS t2_2 ON t2_2.id  = t1.id2
   LEFT JOIN t2 AS t2_3 ON t2_3.id  = t1.id3
   LEFT JOIN t2 AS t2_4 ON t2_4.id  = t1.id4
WHERE (t1.date1 >= (UTC_TIMESTAMP() + INTERVAL - 20 Hour)
       OR t1.date2 >= (UTC_TIMESTAMP() + INTERVAL - 20 Hour)
      )
      AND (t2_2.id IS NOT NULL OR t2_3.id IS NOT NULL OR t2_4.id IS NOT NULL)
;

It avoids the OR's on the id fields, so may take advantage of indexing on those fields; but JOINs three times as much, so could still end up being slower.

MySQL join with where clause

You need to put it in the join clause, not the where:

SELECT *
FROM categories
LEFT JOIN user_category_subscriptions ON 
    user_category_subscriptions.category_id = categories.category_id
    and user_category_subscriptions.user_id =1

See, with an inner join, putting a clause in the join or the where is equivalent. However, with an outer join, they are vastly different.

As a join condition, you specify the rowset that you will be joining to the table. This means that it evaluates user_id = 1 first, and takes the subset of user_category_subscriptions with a user_id of 1 to join to all of the rows in categories. This will give you all of the rows in categories, while only the categories that this particular user has subscribed to will have any information in the user_category_subscriptions columns. Of course, all other categories will be populated with null in the user_category_subscriptions columns.

Conversely, a where clause does the join, and then reduces the rowset. So, this does all of the joins and then eliminates all rows where user_id doesn't equal 1. You're left with an inefficient way to get an inner join.

Hopefully this helps!

SQL JOIN - WHERE clause vs. ON clause

They are not the same thing.

Consider these queries:

SELECT *
FROM Orders
LEFT JOIN OrderLines ON OrderLines.OrderID=Orders.ID
WHERE Orders.ID = 12345

and

SELECT *
FROM Orders
LEFT JOIN OrderLines ON OrderLines.OrderID=Orders.ID 
    AND Orders.ID = 12345

The first will return an order and its lines, if any, for order number 12345. The second will return all orders, but only order 12345 will have any lines associated with it.

With an INNER JOIN, the clauses are effectively equivalent. However, just because they are functionally the same, in that they produce the same results, does not mean the two kinds of clauses have the same semantic meaning.

MySQL: JOIN vs WHERE performance

With a minor bit of investigation.

I have sympathy with the point above that readability is important, although I find the join to be readable while sub queries I find less readable (although in this case the sub query is quite simple so not a major issue either way).

Normally I would hope that MySQL would manage to optimise a non correlated sub query away and execute it just as efficiently as if it were a join. This sub query at first glance does appear to be non correlated (ie, the results of it do not depend on the containing query).

However playing on SQL fiddle this doesn't appear to be the case:-

http://www.sqlfiddle.com/#!2/7696c/2

Using the sub query the explain says it is an UNCACHEABLE SUBQUERY which from the manual is :-

A subquery for which the result cannot be cached and must be re-evaluated for each row of the outer query

Doing much the same sub query by specifying the value rather than passing it in as a variable gives a different explain and just describes it as a SUBQUERY . This I suspect is just as efficient as the join.

My feeling is that MySQL is confused by the use of the variable, and has planned the query on the assumption that the value of the variable can change between rows. Hence it needs to re execute the sub query for every row. It hasn't managed to recognise that there is nothing in the query that modifies the value of the variable.

If you want to try yourself here are the details to set up the test:-

CREATE TABLE `table`
(
    id INT,
  PRIMARY KEY id(id)
);

CREATE TABLE another_table
(
    id INT,
    table_id_fk INT,
    PRIMARY KEY id (id),
    INDEX table_id_fk (table_id_fk)
);

INSERT INTO `table`
VALUES
(1),
(2),
(3),
(4),
(5),
(6),
(7),
(8);

INSERT INTO another_table
VALUES
(11,1),
(12,3),
(13,5),
(14,7),
(15,9),
(16,11),
(17,13),
(18,15);

SQL to execute:-

SET @id:=13;

SELECT t.id 
FROM `table` t
WHERE id = (
    SELECT table_id_fk 
    FROM another_table 
    WHERE id = @id
);



SELECT t.id 
FROM `table` t 
JOIN another_table at
ON t.id = at.table_id_fk
WHERE at.id = @id;


SELECT t.id 
FROM `table` t
WHERE id = (
    SELECT table_id_fk 
    FROM another_table 
    WHERE id = 13
);

EXPLAIN SELECT t.id 
FROM `table` t
WHERE id = (
    SELECT table_id_fk 
    FROM another_table 
    WHERE id = @id
);



EXPLAIN SELECT t.id 
FROM `table` t 
JOIN another_table at
ON t.id = at.table_id_fk
WHERE at.id = @id;


EXPLAIN SELECT t.id 
FROM `table` t
WHERE id = (
    SELECT table_id_fk 
    FROM another_table 
    WHERE id = 13
);

Explain results:-

ID  SELECT_TYPE     TABLE   TYPE    POSSIBLE_KEYS   KEY     KEY_LEN     REF     ROWS    EXTRA
1   PRIMARY     t   index   (null)  PRIMARY     4   (null)  8   Using where; Using index
2   UNCACHEABLE SUBQUERY    another_table   const   PRIMARY     PRIMARY     4   const   1   

ID  SELECT_TYPE     TABLE   TYPE    POSSIBLE_KEYS   KEY     KEY_LEN     REF     ROWS    EXTRA
1   SIMPLE  at  const   PRIMARY,table_id_fk     PRIMARY     4   const   1   
1   SIMPLE  t   const   PRIMARY     PRIMARY     4   const   1   Using index

ID  SELECT_TYPE     TABLE   TYPE    POSSIBLE_KEYS   KEY     KEY_LEN     REF     ROWS    EXTRA
1   PRIMARY     t   const   PRIMARY     PRIMARY     4   const   1   Using index
2   SUBQUERY    another_table   const   PRIMARY     PRIMARY     4       1

When should I prefer JOIN over WHERE in MySQL queries?

Explicitly mentioning the join is generally supposed to be better (and easier to read) besides being the ANSI standard, but with modern optimizers, I dont think there is any marked difference in performance in both the versions.

Note: the two queries you mentioned are not equivalent - if you replace the left join with an inner join, they become equivalent, in which case there is no noticeable difference in performance.

An inner join is generally faster than a left join.

INNER JOIN ON vs WHERE clause

INNER JOIN is ANSI syntax that you should use.

It is generally considered more readable, especially when you join lots of tables.

It can also be easily replaced with an OUTER JOIN whenever a need arises.

The WHERE syntax is more relational model oriented.

A result of two tables JOINed is a cartesian product of the tables to which a filter is applied which selects only those rows with joining columns matching.

It's easier to see this with the WHERE syntax.

As for your example, in MySQL (and in SQL generally) these two queries are synonyms.

Also, note that MySQL also has a STRAIGHT_JOIN clause.

Using this clause, you can control the JOIN order: which table is scanned in the outer loop and which one is in the inner loop.

You cannot control this in MySQL using WHERE syntax.

Join vs. sub-query

Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):

A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.

So subqueries can be slower than LEFT [OUTER] JOIN, but in my opinion their strength is slightly higher readability.

In SQL / MySQL, what is the difference between ON and WHERE in a join statement?

WHERE is a part of the SELECT query as a whole, ON is a part of each individual join.

ON can only refer to the fields of previously used tables.

When there is no actual match against a record in the left table, LEFT JOIN returns one record from the right table with all fields set to NULLS. WHERE clause then evaluates and filter this.

In your query, only the records from gifts without match in 'sentgifts' are returned.

Here's the example

gifts

1   Teddy bear
2   Flowers

sentgifts

1   Alice
1   Bob

---
SELECT  *
FROM    gifts g
LEFT JOIN
        sentgifts sg
ON      g.giftID = sg.giftID

---

1  Teddy bear   1     Alice
1  Teddy bear   1     Bob
2  Flowers      NULL  NULL    -- no match in sentgifts

---
SELECT  *
FROM    gifts g
LEFT JOIN
        sentgifts sg
ON      g.giftID = sg.giftID
WHERE   sg.giftID IS NULL

---

2  Flowers      NULL  NULL    -- no match in sentgifts

As you can see, no actual match can leave a NULL in sentgifts.id, so only the gifts that had not ever been sent are returned.

In MySQL Queries, Why Use Join Instead of Where