Venn Diagram for Natural Join

Venn Diagram for Natural Join

Venn diagrams are not very helpful for understanding natural join or inner join. Most Venn diagrams associated with joins on Stack Overflow are parroted worthless misrepresentations--even in cases where a Venn diagram could be useful.

Here are some valid uses of Venn diagrams for SQL natural join:

If you ignore column order, we can have an area be a set whose elements are an associated table's column names. Then the left & right circles' elements are the left & right tables' column names & the combined elements are the result's column names.

If input table columns with the same name have the same type then we can have an area be the set whose elements are the subrow values that appear somewhere in a table for the common columns. Then the left & right circles' elements are the left & right tables' such subrow values & the intersection elements are the result's such subrow values.

But neither diagram nor the pair tells us what the output rows are.

From my answer at CROSS JOIN vs INNER JOIN in SQL Server 2008:

Re Venn diagrams A Venn diagram with two intersecting circles can illustrate the difference between output rows for INNER, LEFT, RIGHT & FULL JOINs for the same input. And when the ON is unconditionally TRUE, the INNER JOIN result is the same as CROSS JOIN. Also it can illustrate the input & output rows for INTERSECT, UNION & EXCEPT. And when both inputs have the same columns, the INTERSECT result is the same as for standard SQL NATURAL JOIN, and the EXCEPT result is the same as for certain idioms involving LEFT & RIGHT JOIN. But it does not illustrate how (INNER) JOIN works in general. That just seems plausible at first glance. It can identify parts of input and/or output for special cases of ON, PKs (primary keys), FKs (foreign keys) and/or SELECT. All you have to do to see this is to identify what exactly are the elements of the sets represented by the circles. (Which muddled presentations never make clear.) (Remember that in general for joins output rows have different headings from input rows.)

I repeat with emphasis:

But it does not illustrate how (INNER) JOIN works in general.

All you have to do to see this is to identify what exactly are the elements of the sets represented by the circles.

From my comments (using "key" in the sense of "legend") on an answer re its Venn diagram "Figure 1" for inner join:

Figure 1 is a common terrible attempt to explain JOIN. Its key is also complex: It's only for tables as sets & only equijoin & only one [column]; it also represents the input differently than the output. Write it for JOIN in general.

From my comments on What is the difference between “INNER JOIN” and “OUTER JOIN”?:

Venn diagrams show elements in sets. Just try to identify exactly what the sets are and what the elements are in these diagrams. The sets aren't the tables and the elements aren't their rows. Also any two tables can be joined, so PKs & FKs are irrelvant. All bogus. You are doing just what thousands of others have done--got a vague impression you (wrongly) assume makes sense.

Of the answers & comments & their references below only one actually explains how Venn diagrams represent the operators: The circle intersection area represents the set of rows in A JOIN B. The area unique to each circle represents the set of rows you get by taking its table's rows that don't participate in A JOIN B and adding the columns unique to the other table all set to NULL. (And most give a vague bogus correspondence of the circles to A and B.)

So Venn diagrams are relevant for certain cases where tables can reasonably be considered to hold sets of row-valued elements. But in general SQL tables do not hold sets of row-valued elements, while Venn diagrams denote sets.

Re illustrating inner vs outer joins via Venn diagrams:

From my comment on LEFT JOIN vs. LEFT OUTER JOIN in SQL Server

Re Venn diagrams: If no nulls or duplicate rows are input, so we can take a table to be a set of row-valued values & use normal math =, then the Venn diagrams are OK--taking circles to hold left & right join output tables/sets. But if nulls or duplicate rows are input then it is so difficult to explain just what the circles are sets of & how those sets relate to input & output tables/bags that Venn diagrams are not helpful.

From my comment on my answer at What is the difference between “INNER JOIN” and “OUTER JOIN”?

I must admit that, despite my quick phrasing in comments, because SQL involves bags & nulls and SQL culture doesn't have common terminology to name & distinguish between relevant notions, it is non-trivial even to explain clearly how elements of a Venn diagram are 1:1 with output "rows", let alone input "rows". Or what inner or outer joins do, let alone their difference. "value" may or may not include NULL, "row" may be a list of values vs a slot in a table value or variable & "=" may be SQL "=" vs equality.

PS Often diagrams are called Venn diagrams when they are really Euler diagrams.

sql joins as venn diagram

I think your main underlying confusion is that when (for example) only A is highlighted in red, you're taking that to mean "the query only returns data from A", but in fact it means "the query only returns data for those cases where A has a record". The query might still contain data from B. (For cases where B does not have a record, the query will substitute NULL.)

Similarly, the image below that only includes data from the B circle, so why is A included at all in the join statement?

If you mean — the image where A is entirely in white, and there's a red crescent-shape for the part of B that doesn't overlap with A, then: the reason that A appears in the query is, A is how it finds the records in B that need to be excluded. (If A didn't appear in the query, then Venn diagram wouldn't have A, it would only show B, and there'd be no way to distinguish the desired records from the unwanted ones.)

The image makes it seem like circle B is the primary focus of the sql statement, but the sql statement itself, by starting with A (select from A, join B), conveys the opposite impression to me, namely that A would be the focus of the sql statement.

Quite right. For this reason, RIGHT JOINs are relatively uncommon; although a query that uses a LEFT JOIN can nearly always be re-ordered to use a RIGHT JOIN instead (and vice versa), usually people will write their queries with LEFT JOIN and not with RIGHT JOIN.

Joins explained by Venn Diagram with more than one join

I think it is not quite possible to map your example onto these types of diagrams for the following reason:

The diagrams here are diagrams used to describe intersections and unions in set theory. For the purpose of having an overlap as depicted in the diagrams, all three diagrams need to contain elements of the same type which by definition is not possible if we are dealing with three different tables where each contains a different type of (row-)object.

If all three tables would be joined on the same key then you could identify the values of this key as the elements the sets contain but since this is not the case in your example these kind of pictures are not applicable.

If we do assume that in your example both joins use the same key, then only the green area would be the correct result since the first join restricts you to the intersection of Employees and Employee types and the second join restricts you the all of Employees and since both join conditions must be true you would get the intersection of both of the aforementioned sections which is the green area.

Hope this helps.

Inner Join, Natural Joins and Equi Join

Inner join of A and B combines columns of a row from A and a row from B based on a join predicate. For example, a "sempai" join: SELECT ... FROM people A INNER JOIN people B ON A.age > B.age will pair each person with each person that is their junior; the juniormost people will not be selected from A, and seniormost people will not be selected from B, because there are no matching rows.

Equi join is a particular join where the join relation is equality. A "sempai" join from the last paragraph is not an equi join; but "same age" join would be. Though typically it would be used for foreign relationships (equi joins on primary keys), such as SELECT ... FROM person A INNER JOIN bicycle B ON A.bicycle_id = B.id. (Pay no attention to the fact that this is not a proper model, people sometimes have multiple bicycles... a bit of a silly example, I'm sure I could have found a better one.)

A natural join is a special kind of equi join that assumes equality of all shared columns (without explicitly stating the predicate). So for example SELECT ... FROM people A INNER JOIN bicycles B ON A.bicycle_id = B.bicycle_id is equivalent to SELECT ... FROM people A NATURAL JOIN bicycles B, assuming bicycle_id is the only column present in both tables. Most people I know will not use this, because of several reasons - it is a more common practice to have the primary key not repeat the table name, i.e. bicycles.id than bicycles.bicycles_id; it is possible the foreign key does not reflect the table name (e.g. person.overseer_id rather than person.person_id, for obvious reasons), and (forgotten my me but thankfully remembered by Sudipta Mondal) there might be unrelated columns that are named the same but make zero sense to join on, like creation_time. For these reasons, I have never used NATURAL JOIN in my life.

Equi/natural joins do not necessarily have to be inner.

What is the difference between INNER JOIN and OUTER JOIN ?

Assuming you're joining on columns with no duplicates, which is a very common case:

  • An inner join of A and B gives the result of A intersect B, i.e. the inner part of a Venn diagram intersection.

  • An outer join of A and B gives the results of A union B, i.e. the outer parts of a Venn diagram union.

Examples

Suppose you have two tables, with a single column each, and data as follows:

A    B
- -
1 3
2 4
3 5
4 6

Note that (1,2) are unique to A, (3,4) are common, and (5,6) are unique to B.

Inner join

An inner join using either of the equivalent queries gives the intersection of the two tables, i.e. the two rows they have in common.

select * from a INNER JOIN b on a.a = b.b;
select a.*, b.* from a,b where a.a = b.b;

a | b
--+--
3 | 3
4 | 4

Left outer join

A left outer join will give all rows in A, plus any common rows in B.

select * from a LEFT OUTER JOIN b on a.a = b.b;
select a.*, b.* from a,b where a.a = b.b(+);

a | b
--+-----
1 | null
2 | null
3 | 3
4 | 4

Right outer join

A right outer join will give all rows in B, plus any common rows in A.

select * from a RIGHT OUTER JOIN b on a.a = b.b;
select a.*, b.* from a,b where a.a(+) = b.b;

a | b
-----+----
3 | 3
4 | 4
null | 5
null | 6

Full outer join

A full outer join will give you the union of A and B, i.e. all the rows in A and all the rows in B. If something in A doesn't have a corresponding datum in B, then the B portion is null, and vice versa.

select * from a FULL OUTER JOIN b on a.a = b.b;

a | b
-----+-----
1 | null
2 | null
3 | 3
4 | 4
null | 6
null | 5

Opposite of left outer join in pandas based on one column results in less records than expected

While you're merging both the tables, use the Parameter "Indicator" which will tell us how each and every data point is joined.

df_right_only = pd.merge(df1, df2, on = "Common Column", how = "right", indicator = True)

Then we can use LOC to filter out only the values that are pulled uniquely from the right table.

df_right_only.loc[df_right_only ['_merge'] == "right_only", 'Column that you want']

This approach is known as Anti-Joins.

How to show only the left part of the table(Not intersecting section) in Left Join in sql?

All those records from customers that have no orders:

SELECT *
FROM Customers c
LEFT JOIN Orders o ON o.CustomerId = c.CustomerId
WHERE o.CustomerId IS NULL

Not really sure I would represent your customers orders relationship like that vent diagram because it implies to me that there can be orders that have no customers but I'll overlook that

When we left join we get all the customers plus any orders they have made. Customers that have made no orders have a null in the related o.CustomerId and that is what we look for in the where clause

It is most reliable to use (one of) the column(s) specified in the join condition when doing this test. Any other column from orders might be null for other reasons (product type not known, for example) unless it is specified as NOT NULL in the table definition. To save looking this up we rely on the fact that the only way o.CustomerId can be null in this particular query (where it is mentioned in the join) is if there is no matching order row for that customer

You could also use either these:

SELECT *
FROM Customers c
WHERE c.CustomerId NOT IN (SELECT CustomerId FROM orders)

SELECT *
FROM Customers c
WHERE NOT EXISTS (SELECT null FROM orders o WHERE o.CustomerID = c.CustomerID)

In most high end database systems these will all be implemented the same under the hood; the query compiler will recognise the job they're trying to do and carry them out in the same way. There's a risk that the NOT IN will perform poorly in some databases, particularly older or more naively written ones, and it's perhaps a reasonable rule of thumb to follow that "do not use IN for lists longer than you would happily type in manually". The EXISTS version is called a coordinated subquery and is often a fairly succinct and high performing way of doing things like this- for a long time databases have had specific optimisations for EXISTS that they formerly might not have had for a JOIN based route but that's again something that has largely gone away nowadays. Seeing someone exhibit a preference for EXISTS may indicate they've been using SQL a loooong time, as it was frequently the best performing way of answering this type of query in ancient databases

Of all 3 I find the join method clearest to read and understand - coordinated subqueries always require a bit more mental effort to see how they hook into the bigger picture because they refer to columns that aren't part of their local scope. It's also possible to make a mistake with a coordinated subquery more easily than the other forms, and particularly with the IN, by typoing a column from the outer query into the inner query, changing the results it emits.

Use whatever works for you; being able to read and understand all these forms will help you when you read other people's code



Related Topics



Leave a reply



Submit