Data.Table - Left Outer Join on Multiple Tables

Data.table - left outer join on multiple tables

I just committed a new feature in data.table, v1.9.5, with which we can join without setting keys (that is, specify the columns to join by directly, without having to use setkey() first):

With that, this is simply:

require(data.table) # v1.9.5+
fruits[tastes, on="FruitID"][colors, on="FruitID"] # no setkey required
# FruitID Fruit TasteID Taste ColorID Color
# 1: 1 Apple 1 Sweeet 1 Red
# 2: 1 Apple 2 Sour 1 Red
# 3: 1 Apple 1 Sweeet 2 Yellow
# 4: 1 Apple 2 Sour 2 Yellow
# 5: 1 Apple 1 Sweeet 3 Green
# 6: 1 Apple 2 Sour 3 Green
# 7: 2 NA NA NA 4 Yellow
# 8: 3 Strawberry 3 Sweet 5 Red

Multiple LEFT OUTER JOIN on multiple tables

The join on D is an inner join, the rest are left outer joins:

SELECT *
FROM TABLEA A JOIN
TABLED D
ON D.Z = A.Z LEFT JOIN
TABLEB B
ON A.X = B.X LEFT JOIN
TABLEC C
ON B.Y = C.Y
WHERE MY_COL = @col_val;

I always start chains of joins with inner joins followed by the left outer join. I never use right join, and full join rather rarely. The inner joins define the rows in the result set, so they come first.

How to make a table left outer join of multiple tables

You want the tables in the wrong order. The FeedTable should be first:

Select ft.*, mt1.key as cola, mt2.key as colb, mt3.key colc
from FeedTable ft left join
MainTable mt1
on ft.fk1 = mt1.key left join
MainTable mt2
on ft.fk2 = mt2.key left join
MainTable mt2
on ft.fk3 = mt3.key;

Left outer join on multiple tables

Your new left outer join is forcing some rows to be returned in the result set a few times due to multiple relations most likely. Remove your SUM and just review the returned rows and work out exactly which ones you require (maybe restrict it to on certain type of t_ind record if that is applicable??), then adjust your query accordingly.

left outer join with data.table with different names for key variables

From ?data.table::merge

This merge method for data.table behaves very similarly to that of data.frames with one major exception: By default, the columns used to merge the data.tables are the shared key columns rather than the shared columns with the same names. Set the by, or by.x, by.y arguments explicitly to override this default.

So we can use the by arguments to override the keys.

library(data.table)

DT1 = data.table(x1=c("b","c", "a", "b", "a", "b"), x2a=1:6,m1=seq(10,60,by=10))
DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6," "," "),m2=5:10)

## you will get an error when joining a character to a integer:
DT2$x2b <- as.integer(DT2$x2b)
## Alternative:
## DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6,NA,NA),m2=5:10)

merge(DT1, DT2, by.x=c('x1','x2a'), by.y=c('x1','x2b'), all.x=TRUE)

x1 x2a m1 m2
1: a 3 30 NA
2: a 5 50 NA
3: b 1 10 5
4: b 4 40 NA
5: b 6 60 8
6: c 2 20 NA

Left join using data.table

You can try this:

# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a')

B[A]

Left join on a table with condition on others table

What you need to do is the left outer joins from the b table to the c and d tables first, and then outer join that back to the a table if a value exists in either the c or d conditions columns. Like so:

SELECT a.id a_id, b2.b_id, b2.c_id, b2.d_id
FROM a
LEFT OUTER JOIN (SELECT b.id b_id,
b.a_id,
c.id c_id,
d.id d_id
FROM b
LEFT OUTER JOIN c ON b.c_id = c.id AND c.conditions = 1
LEFT OUTER JOIN d ON b.d_id = d.id AND d.conditions = 1) b2
ON a.id = b2.a_id AND COALESCE(b2.c_id, b2.d_id) IS NOT NULL
ORDER BY a.id, b2.b_id, b2.c_id, b2.d_id;

A_ID B_ID C_ID D_ID
---------- ---------- ---------- ----------
1 1 1
1 2 2
2
3

(Thanks to Alex Poole for spotting the issues with my edited output!)


ETA:

This could also be written as:

SELECT a.id a_id, b.id b_id, c.id c_id, d.id d_id
FROM a
LEFT OUTER JOIN (b
LEFT OUTER JOIN c ON b.c_id = c.id AND c.conditions = 1
LEFT OUTER JOIN d ON b.d_id = d.id AND d.conditions = 1)
ON a.id = b.a_id AND COALESCE(c.id, d.id) IS NOT NULL
ORDER BY a.id, b.id, b.c_id, b.d_id;

which is simpler but potentially harder to decipher the intent (and therefore harder to maintain in the future). I've added it here as I had no idea this was valid syntax, and you may feel it works better for you.

Chaining multiple data.table::merge operations with data.tables

Multiple data.table joins with the on argument can be chained. Note that without an update operator (":=") in j, this would be a right join, but with ":=" (i.e., adding columns), this becomes a left outer join. A useful post on left joins here Left join using data.table.

Example using example data above with a subset between joins:

dt4 <- dt1[dt2, on="food", `:=`(status = i.status)][
food == "apples"][dt3, on="food", rank := i.rank]

##> dt4
## food quantity status rank
##1: apples 1 good okay

Example adding new column between joins

dt4 <- dt1[dt2, on="food", `:=`(status = i.status)][
, new_col := NA][dt3, on="food", rank := i.rank]

##> dt4
## food quantity status new_col rank
##1: apples 1 good NA okay
##2: bananas 2 bad NA good
##3: carrots 3 rotten NA better
##4: dates 4 raw NA best

Example using merge and magrittr pipes:

dt4 <-  merge(dt1, dt2, by = "food") %>%
set( , "new_col", NA) %>%
merge(dt3, by = "food")

##> dt4
## food quantity status new_col rank
##1: apples 1 good NA okay
##2: bananas 2 bad NA good
##3: carrots 3 rotten NA better
##4: dates 4 raw NA best

How to join (merge) data frames (inner, outer, left, right)

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

SQL Left Outer Join 2 Tables with 2 columns and different conditions for each column

This is one way to do it.

SELECT DT.CompID,DT.CountryCode,MT.RegionName
FROM DT
JOIN MT
ON MT.CompID IS NOT NULL AND MT.CompID = DT.CompID
UNION
SELECT DT.CompID,DT.CountryCode,MT.RegionName
FROM DT
JOIN MT
ON MT.CompID IS NULL AND MT.CountryCode = DT.CountryCode
WHERE NOT EXISTS (SELECT TOP 1 1 FROM MT WHERE CompID = DT.CompID)

SQLFidder

Try not to hard-code the CompID as part of your logic as it will not be future proof.



Related Topics



Leave a reply



Submit