Explain Join VS. Left Join and Where Condition Performance Suggestion in More Detail

Explain JOIN vs. LEFT JOIN and WHERE condition performance suggestion in more detail

Consider the following example. We have two tables, DEPARTMENTS and EMPLOYEES.

Some departments do not yet have any employees.

This query uses an inner join that finds the department employee 999 works at, if any, otherwise it shows nothing (not even the employee or his or her name):

select a.department_id, a.department_desc, b.employee_id, b.employee_name
from departments a
join employees b
on a.department_id = b.department_id
where b.employee_id = '999'

This next query uses an outer join (left between departments and employees) and finds the department that employee 999 works for. However it too will not show the employee's ID or his or her name, if they do not work at any departments. That is because of the outer joined table being used in the WHERE clause. If there is no matching department, it will be null (not 999, even though 999 exists in employees).

select a.department_id, a.department_desc, b.employee_id, b.employee_name
from departments a
left join employees b
on a.department_id = b.department_id
where b.employee_id = '999'

But consider this query:

select a.department_id, a.department_desc, b.employee_id, b.employee_name
from departments a
left join employees b
on a.department_id = b.department_id
and b.employee_id= '999'

Now the criteria is in the on clause. So even if this employee works at no departments, he will still be returned (his ID and name). The department columns will be null, but we get a result (the employee side).

You might think you would never want to use the outer joined table in the WHERE clause, but that is not necessarily the case. Normally it is, for the reason described above, though.

Suppose you want all departments with no employees. Then you could run the following, which does use an outer join, and the outer joined table is used in the where clause:

select a.department_id, a.department_desc, b.employee_id
from departments a
left join employees b
on a.department_id = b.department_id
where b.employee_id is null

^^ Shows departments with no employees.

The above is likely the only legitimate reason you would want to use an outer joined table in the WHERE clause rather than the ON clause (which I think is what your question is; the difference between inner and outer joins is an entirely different topic).

A good way to look at is this: You use outer joins to allow nulls. Why would you then use an outer join and say that a field should not be null and should be equal to 'XYZ'? If a value has to be 'XYZ' (not null), then why instruct the database to allow nulls to come back? It's like saying one thing and then overriding it later.

INNER JOIN vs LEFT JOIN performance in SQL Server

A LEFT JOIN is absolutely not faster than an INNER JOIN. In fact, it's slower; by definition, an outer join (LEFT JOIN or RIGHT JOIN) has to do all the work of an INNER JOIN plus the extra work of null-extending the results. It would also be expected to return more rows, further increasing the total execution time simply due to the larger size of the result set.

(And even if a LEFT JOIN were faster in specific situations due to some difficult-to-imagine confluence of factors, it is not functionally equivalent to an INNER JOIN, so you cannot simply go replacing all instances of one with the other!)

Most likely your performance problems lie elsewhere, such as not having a candidate key or foreign key indexed properly. 9 tables is quite a lot to be joining so the slowdown could literally be almost anywhere. If you post your schema, we might be able to provide more details.


Edit:

Reflecting further on this, I could think of one circumstance under which a LEFT JOIN might be faster than an INNER JOIN, and that is when:

  • Some of the tables are very small (say, under 10 rows);
  • The tables do not have sufficient indexes to cover the query.

Consider this example:

CREATE TABLE #Test1
(
ID int NOT NULL PRIMARY KEY,
Name varchar(50) NOT NULL
)
INSERT #Test1 (ID, Name) VALUES (1, 'One')
INSERT #Test1 (ID, Name) VALUES (2, 'Two')
INSERT #Test1 (ID, Name) VALUES (3, 'Three')
INSERT #Test1 (ID, Name) VALUES (4, 'Four')
INSERT #Test1 (ID, Name) VALUES (5, 'Five')

CREATE TABLE #Test2
(
ID int NOT NULL PRIMARY KEY,
Name varchar(50) NOT NULL
)
INSERT #Test2 (ID, Name) VALUES (1, 'One')
INSERT #Test2 (ID, Name) VALUES (2, 'Two')
INSERT #Test2 (ID, Name) VALUES (3, 'Three')
INSERT #Test2 (ID, Name) VALUES (4, 'Four')
INSERT #Test2 (ID, Name) VALUES (5, 'Five')

SELECT *
FROM #Test1 t1
INNER JOIN #Test2 t2
ON t2.Name = t1.Name

SELECT *
FROM #Test1 t1
LEFT JOIN #Test2 t2
ON t2.Name = t1.Name

DROP TABLE #Test1
DROP TABLE #Test2

If you run this and view the execution plan, you'll see that the INNER JOIN query does indeed cost more than the LEFT JOIN, because it satisfies the two criteria above. It's because SQL Server wants to do a hash match for the INNER JOIN, but does nested loops for the LEFT JOIN; the former is normally much faster, but since the number of rows is so tiny and there's no index to use, the hashing operation turns out to be the most expensive part of the query.

You can see the same effect by writing a program in your favourite programming language to perform a large number of lookups on a list with 5 elements, vs. a hash table with 5 elements. Because of the size, the hash table version is actually slower. But increase it to 50 elements, or 5000 elements, and the list version slows to a crawl, because it's O(N) vs. O(1) for the hashtable.

But change this query to be on the ID column instead of Name and you'll see a very different story. In that case, it does nested loops for both queries, but the INNER JOIN version is able to replace one of the clustered index scans with a seek - meaning that this will literally be an order of magnitude faster with a large number of rows.

So the conclusion is more or less what I mentioned several paragraphs above; this is almost certainly an indexing or index coverage problem, possibly combined with one or more very small tables. Those are the only circumstances under which SQL Server might sometimes choose a worse execution plan for an INNER JOIN than a LEFT JOIN.

INNER JOIN vs LEFT JOIN performance in SQL

With this statement, the predicates in the WHERE clause would negate the "outerness" of a LEFT JOIN, making equivalent to an INNER JOIN.

I'd expect the optimizer would recognize that, and choose the most cost efficient plan in either case. I'd expect equivalent execution plans, and I wouldn't expect to observe a performance difference in this query, when the INNER keyword is replaced with LEFT.

But without looking at the actual execution plans, that's just conjecture. To see the execution plan, we can use EXPLAIN (for MySQL) or SET SHOWPLAN_ALL ON (for SQL Server, or STATISTICS PROFILE ON, et al.)

The actual performance of the different statements can be observed and measured.


I'm not understanding why you would want to use a LEFT JOIN at all.

(And the mixing of the old-school comma syntax for the join operation with the newer JOIN keyword... what's up with that?)


As far as improving performance, we'd really need to make sure that suitable indexes are available, and are being used. We need to look at the execution plans. We may need to make sure that statistics are collected and up-to-date, and we may need to make some tweaks to the SQL to get an more efficient plan.

It's possible that replacing INNER with LEFT may make a difference in performance of the existing query, but it's unlikely to be a cure to a performance problem.

Personally, I'd prefer to investigate performance of an equivalent query, with this as a starting point:

UPDATE poker_hands f
JOIN poker_cards a ON a.card_name = f.r1 AND a.game_value = 14
JOIN poker_cards b ON b.card_name = f.r2 AND b.game_value = 13 AND b.suit = a.suit
JOIN poker_cards c ON c.card_name = f.r3 AND c.game_value = 12 AND c.suit = a.suit
JOIN poker_cards d ON d.card_name = f.r4 AND d.game_value = 11 AND d.suit = a.suit
JOIN poker_cards e ON e.card_name = f.r5 AND e.game_value = 10 AND e.suit = a.suit
SET f.hand_type = 'Royal flush'

Postgres LEFT JOIN with WHERE condition


left join two tables with a where condition

It's typically wrong to use a LEFT [OUTER] JOIN and then filter with a WHERE condition, thereby voiding the special feature of a LEFT JOIN to include all rows from the left table unconditionally. Detailed explanation:

  • Explain JOIN vs. LEFT JOIN and WHERE condition performance suggestion in more detail

Put conditions supposed to filter all rows into the WHERE clause (rid = 2), but make conditions to left-join rows from record_table out to be actual join conditions:

SELECT t.start_date, t.end_date  -- adding those
, r.id, r.name, r.date
FROM time_table t
LEFT JOIN record_table r ON r.date >= t.start_date
AND r.date < t.end_date
WHERE t.rid = 2;

As commented, it makes sense to include columns from time_table in the result, but that's my optional addition.

You also need to be clear about lower and upper bounds. The general convention is to include the lower and exclude the upper bound in time (timestamp) ranges. Hence my use of >= and < above.

Related:

  • SQL query on a time series to calculate the average
  • Selecting an average of records grouped by 5 minute periods

Performance should be no problem at all with the right indexes.
You need an index (or PK) on time_table(rid) and another on record_table(date).

mysql join performance IF multiple OR conditions

Try doing two joins and moving the "OR" into the COUNT() function:

For every row, this joins table2 once on FK1, then again on FK2 (if it is not already joined to that row via FK1. Then in the COUNT, we specify that only rows which have either join's rel column as non-null.

SELECT 
label,
COUNT( table2A.rel || table2B.rel ) as freq
FROM
`table1`
LEFT JOIN
table2 as table2A
ON id=table2A.FK_first
LEFT JOIN
table2 as table2B
ON id=table2B.FK_second
AND table2A.FKFirst != table2B.FKFirst
GROUP BY label
ORDER BY freq DESC

Where vs ON in outer join

As Gordon Lindolf pointed out it's not true, Your friend is plain wrong.

I want just to add developers like to think SQL like they think their language of trade (C++, VB, Java), but those are procedural/imperative languages.
When you code SQL you are in another paradigm. You are just describing a function to be applied to a dataset.

Let's get your own example:

Select a.*, b.* 
From A a
Left outer join B on a.id = b.id
Where b.id is NULL;

If a.Id and b.Id are not null columns.

It's semantically equal to

Select a.*, null, ..., null
From A a
where not exists (select * from B b where b.Id = a.Id)

Now try to run those to queries and profile.
In most DBMS I can expect both queries to run in the exact same way.

It happens because the engine decides how to implement your "function" over the dataset.

Note the above example is the equivalent in set mathematics to:

Give me the set A minus the intersection between A and B.

Engines can decide how to implement your query because they have some tricks under its sleeve.
It has metrics about your tables, indexes, etc and can use it to, for example, "make a join" in a diferent order you wrote it.

IMHO engines today are really good at finding the best way to implement the function you describe and rarely needs query hints.
Of course you can end describing your funciton in a way too complicated, affecting how the engines decides to run it.
The art of better describing functions and sets and managins indexes is what we call query tunning.

Mysql - LEFT JOIN way faster than INNER JOIN

The reason for the different plans is that LEFT JOIN will force the join order of your tables to match the order they appear in your query. Without the left join, the optimizer will choose the join order for you, and in this case it will choose the very small table first. (You can see this in your explain by looking at the order the tables are listed.) Once your join order is switched, the index for X changes to KEY d which must have a much larger data set than the compound key.

To fix this, change your select to SELECT STRAIGHT_JOIN *. This is preferred over USE INDEX so that the optimizer can still choose the best key for table X... You might find a better compound key than a,b,c,d, or if your data in X changes dramatically, one of your other keys may be better after a point.

I have to point out, that you normally can't just switch to a LEFT JOIN. The data returned will usually be different!



Related Topics



Leave a reply



Submit