Why Does X[Y] Join of Data.Tables Not Allow a Full Outer Join, or a Left Join

Why does X[Y] join of data.tables not allow a full outer join, or a left join?

To quote from the data.table FAQ 1.11 What is the difference between X[Y] and merge(X, Y)?

X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index.

Y[X] is a join, looking up Y's rows using X (or X's key if it has one)

merge(X,Y) does both ways at the same time. The number of rows of X[Y] and Y[X] usually differ, whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same.

BUT that misses the main point. Most tasks require something to be done on the
data after a join or merge. Why merge all the columns of data, only to
use a small subset of them afterwards? You may suggest
merge(X[,ColsNeeded1],Y[,ColsNeeded2]), but that requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for
you. When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses, and Y columns enjoy standard R recycling rules within the context of each group. Let's say foo is in X, and bar is in Y (along with 20 other columns in Y). Isn't X[Y,sum(foo*bar)] quicker to program and quicker to run than a merge of everything wastefully followed by a subset?


If you want a left outer join of X[Y]

le <- Y[X]
mallx <- merge(X, Y, all.x = T)
# the column order is different so change to be the same as `merge`
setcolorder(le, names(mallx))
identical(le, mallx)
# [1] TRUE

If you want a full outer join

# the unique values for the keys over both data sets
unique_keys <- unique(c(X[,t], Y[,t]))
Y[X[J(unique_keys)]]
## t b a
## 1: 1 NA 1
## 2: 2 NA 4
## 3: 3 9 9
## 4: 4 16 16
## 5: 5 25 NA
## 6: 6 36 NA

# The following will give the same with the column order X,Y
X[Y[J(unique_keys)]]

How do I do a full outer join in data tables with multiple keys?

You can have a look at getAnywhere('merge.data.table') to see the source code of the merge method for data.tables -- the method is built using [. Follow the logic for the case of all=TRUE to see what happens when using merge:

outer_join = merge(B, A, all=TRUE, by=c('key_a', 'key_b'))

Essentially this will do:

  1. B[A, nomatch=NA] to left join B to A
  2. B[!A] to anti-join B to A (finding the B rows now found in A, that would be missing from the left join)
  3. rbind the outputs from 1. and 2. to complete the outer join

The last step is what makes this impossible to do "by-reference" like a join-and-update is often recommended for data.table -- we can update existing rows and new/existing columns for a data.table by reference, but we can not add new rows by reference.

FULL JOIN query results in Error Code: 1054. Any workaround?

You can generate a list of year-month pairs that are present in one or both tables using union, then left join the two tables with that result:

select *
from (
select billyear, billmonth from tblhydrobill where meteridnumber = 19
union
select billyear, billmonth from tblgasdata where buildingid = 19
) as ym
left join tblhydrobill on tblhydrobill.billyear = ym.billyear and tblhydrobill.billmonth = ym.billmonth and tblhydrobill.meteridnumber = 19
left join tblgasdata on tblgasdata.billyear = ym.billyear and tblgasdata.billmonth = ym.billmonth and tblgasdata.buildingid = 19
order by ym.billyear, ym.billmonth

Note that it is possible to build ym list manually e.g.:

from (
select 2022, 1 union
select 2021, 12 union
select 2021, 11
) as ym

When to use right join or full outer join

Why would you join two tables and keep the rows that do not match of BOTH tables?

The full join has cases where it is useful.One of them is comparing two tables for differences like XOR between tables:

 SELECT * 
FROM t1
FULL JOIN t2
ON t1.id = t2.id
WHERE t1.id IS NULL
OR t2.id IS NULL;

Example:

t1.id ... t2.id
1 NULL
NULL 2

you could also achieve this by using two left joins.

Yes you could:

SELECT t1.*, t2.*
FROM t1
LEFT JOIN t2
ON t1.id = t2.id
WHERE t2.id IS NULL
UNION ALL
SELECT t1.*, t2.*
FROM t2
LEFT JOIN t1
ON t1.id = t2.id
WHERE t1.id IS NULL;

Some SQL dialects does not support FULL OUTER JOIN and we emulate it that way.
Related: How to do a FULL OUTER JOIN in MySQL?


On the other hand RIGHT JOIN is useful when you have to join more than 2 tables:

SELECT *
FROM t1
JOIN t2
...
RIGHT JOIN t3
...

Of course you could argue that you could rewrite it to correspodning form either by changing join order or using subqueries(inline views). From developer perspective it is always good to have tools(even if you don't have to use them)

Full Outer Join with data.tables without knowing keys

merge already uses data.table optimization, not much to do.
In my experience data.table has blazing fast merge operations.

One approach could be to merge using as key integer variables or factor variables, that should be way faster than characters.

data.table join is hard to understand

This is a non-equi join :

  • joins same x on both tables : b and c in this case
  • keeps only the values of DT where DT$y <= X$foo

Perhaps easier to understand like this :

DT[X,.(x.x, x.y, x.v, i.x, i.v, i.foo,`y < foo`= x.y < i.foo ), on = .(x = x, y <= foo)]

x.x x.y x.v i.x i.v i.foo y < foo
1: c 1 7 c 8 4 TRUE
2: c 3 8 c 8 4 TRUE
3: b 1 1 b 7 2 TRUE

Where:

  • x. are the columns of the LHS table (DT)
  • i. are the columns of the RHS table (X), to remember i. think about DT[i,j,by].

How to mix left and right joins with cross reference table

Joins are evaluated left to right. Your query's order of operations is as if if you used parentheses like this:

(A LEFT JOIN CR) RIGHT JOIN B

Therefore it's bound to return all rows in B.

But it will only return matching rows from (A LEFT JOIN CR). That part of the join will include all rows from A, but depending on the join condition of the subsequent right join, some of those A rows may be excluded.

From your description, it sounds like you really want a FULL OUTER JOIN. MySQL does not support this type of outer join. There are ways to simulate it by using a UNION of two joins:

...
A LEFT OUTER JOIN B
UNION
A RIGHT OUTER JOIN B
...

The feature request of MySQL to support FULL OUTER JOIN was filed in 2006. If you need this, you should go log into the bug tracker and click "Affects Me" on that feature request.

Joining two incomplete data.tables with the same column names

You can group by ID and get the unique values after omitting NAs, i.e.

library(data.table)

merge(dt1, dt2, all = TRUE)[,
lapply(.SD, function(i)na.omit(unique(i))),
by = id][]

# id v1 v2
#1: 1 w a
#2: 2 x b
#3: 3 y c
#4: 4 z <NA>

Why is there NULL in the result of a full outer join between two tables?

Your CASE expression has no ELSE, so it defaults to null:

case when a.domain is null then b.domain
when b.domain is null then a.domain
ELSE NULL -- implicitly
end as unique_domains

The value 'example_2.com' has a match so both a.domain and b.domain equal ''example_2.com'' and are not null. So, both WHEN don't match and ELSE NULL is applied.

As to "a better way": I'd probably use

select coalesce(a.domain, b.domain) as domain
from domains_1 as a full outer join domains_2 as b on a.domain = b.domain
where a.domain is null or b.domain is null;


Related Topics



Leave a reply



Submit