Why does X[Y] join of data.tables not allow a full outer join, or a left join?
To quote from the data.table
FAQ 1.11 What is the difference between X[Y]
and merge(X, Y)
?
X[Y]
is a join, looking up X's rows using Y (or Y's key if it has one) as an index.
Y[X]
is a join, looking up Y's rows using X (or X's key if it has one)
merge(X,Y)
does both ways at the same time. The number of rows ofX[Y]
andY[X]
usually differ, whereas the number of rows returned bymerge(X,Y)
andmerge(Y,X)
is the same.BUT that misses the main point. Most tasks require something to be done on the
data after a join or merge. Why merge all the columns of data, only to
use a small subset of them afterwards? You may suggest
merge(X[,ColsNeeded1],Y[,ColsNeeded2])
, but that requires the programmer to work out which columns are needed.X[Y,j
] in data.table does all that in one step for
you. When you writeX[Y,sum(foo*bar)]
, data.table automatically inspects thej
expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns thej
uses, andY
columns enjoy standard R recycling rules within the context of each group. Let's sayfoo
is inX
, and bar is inY
(along with 20 other columns inY
). Isn'tX[Y,sum(foo*bar)]
quicker to program and quicker to run than a merge of everything wastefully followed by a subset?
If you want a left outer join of X[Y]
le <- Y[X]
mallx <- merge(X, Y, all.x = T)
# the column order is different so change to be the same as `merge`
setcolorder(le, names(mallx))
identical(le, mallx)
# [1] TRUE
If you want a full outer join
# the unique values for the keys over both data sets
unique_keys <- unique(c(X[,t], Y[,t]))
Y[X[J(unique_keys)]]
## t b a
## 1: 1 NA 1
## 2: 2 NA 4
## 3: 3 9 9
## 4: 4 16 16
## 5: 5 25 NA
## 6: 6 36 NA
# The following will give the same with the column order X,Y
X[Y[J(unique_keys)]]
How do I do a full outer join in data tables with multiple keys?
You can have a look at getAnywhere('merge.data.table')
to see the source code of the merge
method for data.table
s -- the method is built using [
. Follow the logic for the case of all=TRUE
to see what happens when using merge
:
outer_join = merge(B, A, all=TRUE, by=c('key_a', 'key_b'))
Essentially this will do:
B[A, nomatch=NA]
to left joinB
toA
B[!A]
to anti-joinB
toA
(finding theB
rows now found inA
, that would be missing from the left join)rbind
the outputs from 1. and 2. to complete the outer join
The last step is what makes this impossible to do "by-reference" like a join-and-update is often recommended for data.table
-- we can update existing rows and new/existing columns for a data.table
by reference, but we can not add new rows by reference.
FULL JOIN query results in Error Code: 1054. Any workaround?
You can generate a list of year-month pairs that are present in one or both tables using union
, then left join the two tables with that result:
select *
from (
select billyear, billmonth from tblhydrobill where meteridnumber = 19
union
select billyear, billmonth from tblgasdata where buildingid = 19
) as ym
left join tblhydrobill on tblhydrobill.billyear = ym.billyear and tblhydrobill.billmonth = ym.billmonth and tblhydrobill.meteridnumber = 19
left join tblgasdata on tblgasdata.billyear = ym.billyear and tblgasdata.billmonth = ym.billmonth and tblgasdata.buildingid = 19
order by ym.billyear, ym.billmonth
Note that it is possible to build ym
list manually e.g.:
from (
select 2022, 1 union
select 2021, 12 union
select 2021, 11
) as ym
When to use right join or full outer join
Why would you join two tables and keep the rows that do not match of BOTH tables?
The full join has cases where it is useful.One of them is comparing two tables for differences like XOR between tables:
SELECT *
FROM t1
FULL JOIN t2
ON t1.id = t2.id
WHERE t1.id IS NULL
OR t2.id IS NULL;
Example:
t1.id ... t2.id
1 NULL
NULL 2
you could also achieve this by using two left joins.
Yes you could:
SELECT t1.*, t2.*
FROM t1
LEFT JOIN t2
ON t1.id = t2.id
WHERE t2.id IS NULL
UNION ALL
SELECT t1.*, t2.*
FROM t2
LEFT JOIN t1
ON t1.id = t2.id
WHERE t1.id IS NULL;
Some SQL dialects does not support FULL OUTER JOIN
and we emulate it that way.
Related: How to do a FULL OUTER JOIN in MySQL?
On the other hand RIGHT JOIN
is useful when you have to join more than 2 tables:
SELECT *
FROM t1
JOIN t2
...
RIGHT JOIN t3
...
Of course you could argue that you could rewrite it to correspodning form either by changing join order or using subqueries(inline views). From developer perspective it is always good to have tools(even if you don't have to use them)
Full Outer Join with data.tables without knowing keys
merge already uses data.table optimization, not much to do.
In my experience data.table has blazing fast merge operations.
One approach could be to merge using as key integer variables or factor variables, that should be way faster than characters.
data.table join is hard to understand
This is a non-equi join :
- joins same x on both tables :
b
andc
in this case - keeps only the values of DT where
DT$y <= X$foo
Perhaps easier to understand like this :
DT[X,.(x.x, x.y, x.v, i.x, i.v, i.foo,`y < foo`= x.y < i.foo ), on = .(x = x, y <= foo)]
x.x x.y x.v i.x i.v i.foo y < foo
1: c 1 7 c 8 4 TRUE
2: c 3 8 c 8 4 TRUE
3: b 1 1 b 7 2 TRUE
Where:
x.
are the columns of theLHS
table (DT
)i.
are the columns of theRHS
table (X
), to rememberi.
think aboutDT[i,j,by]
.
How to mix left and right joins with cross reference table
Joins are evaluated left to right. Your query's order of operations is as if if you used parentheses like this:
(A LEFT JOIN CR) RIGHT JOIN B
Therefore it's bound to return all rows in B.
But it will only return matching rows from (A LEFT JOIN CR). That part of the join will include all rows from A, but depending on the join condition of the subsequent right join, some of those A rows may be excluded.
From your description, it sounds like you really want a FULL OUTER JOIN
. MySQL does not support this type of outer join. There are ways to simulate it by using a UNION of two joins:
...
A LEFT OUTER JOIN B
UNION
A RIGHT OUTER JOIN B
...
The feature request of MySQL to support FULL OUTER JOIN was filed in 2006. If you need this, you should go log into the bug tracker and click "Affects Me" on that feature request.
Joining two incomplete data.tables with the same column names
You can group by ID and get the unique values after omitting NAs, i.e.
library(data.table)
merge(dt1, dt2, all = TRUE)[,
lapply(.SD, function(i)na.omit(unique(i))),
by = id][]
# id v1 v2
#1: 1 w a
#2: 2 x b
#3: 3 y c
#4: 4 z <NA>
Why is there NULL in the result of a full outer join between two tables?
Your CASE
expression has no ELSE
, so it defaults to null:
case when a.domain is null then b.domain
when b.domain is null then a.domain
ELSE NULL -- implicitly
end as unique_domains
The value 'example_2.com' has a match so both a.domain and b.domain equal ''example_2.com'' and are not null. So, both WHEN
don't match and ELSE NULL
is applied.
As to "a better way": I'd probably use
select coalesce(a.domain, b.domain) as domain
from domains_1 as a full outer join domains_2 as b on a.domain = b.domain
where a.domain is null or b.domain is null;
Related Topics
Select Groups With More Than One Distinct Value
Overlay Normal Curve to Histogram in R
Change the Blank Cells to "Na"
Unlist Data Frame Column Preserving Information from Other Column
How to Remove Outliers from a Dataset
Call Apply-Like Function on Each Row of Dataframe With Multiple Arguments from Each Row
Frequency Count of Two Column in R
Find How Many Times Duplicated Rows Repeat in R Data Frame
Painless Way to Install a New Version of R
Split Violin Plot With Ggplot2
Levels≪-'( What Sorcery Is This
Efficient Way to Rbind Data.Frames With Different Columns
How to Save Plots That Are Made in a Shiny App
How to Fill Geom_Polygon With Different Colors Above and Below Y = 0 (Or Any Other Value)