Joining Two Data.Tables in R Based on Multiple Keys and Duplicate Entries

Joining two data.tables in r: removing overlap duplicates while keeping duplicates in each separate dataset

You could remove the overlaps coming from y:

l = list(dtx, dty)
dtxy = rbindlist(l, use.names = TRUE)

overlaps = merge(dtx,dty,by=c("ID","date","code"))[,.(ID,date,code,dataset = dataset.y)]

dtresultnew <- overlaps[dtxy,.(ID,date,code,x.dataset,i.dataset),on = .(ID,date,code,dataset)][
is.na(x.dataset),.(ID,date,code,dataset=i.dataset)]

identical(dtresult[order(ID,date,code)],dtresultnew[order(ID,date,code)])
[1] TRUE

Rolling join on data.table with duplicate keys

The reason that t_id = 1 doesn't show up in the output is because a rolling join takes the row where the key-combination occurs last. From the documentation (emphasis mine):

Applies to the last join column, generally a date but can be any
ordered variable, irregular and including gaps. If roll=TRUE and i's
row matches to all but the last x join column, and its value in the
last i join column falls in a gap (including after the last
observation in x for that group), then the prevailing value in x is
rolled forward. This operation is particularly fast using a modified
binary search. The operation is also known as last observation carried
forward (LOCF).

Let's consider somewhat larger datasets:

> DT
t_id airport thisTime
1: 1 a 5.1
2: 4 a 5.1
3: 3 a 5.1
4: 2 d 6.2
5: 5 d 6.2

> DT_LU
f_id airport thisTime
1: 1 a 6
2: 2 a 6
3: 2 a 8
4: 1 b 7
5: 1 c 8
6: 2 d 7
7: 1 d 9

When you perform a rolling join just like in your question:

DT[DT_LU, nomatch=0, roll=Inf]

you get:

   t_id airport thisTime f_id
1: 3 a 6 1
2: 3 a 6 2
3: 3 a 8 2
4: 5 d 7 2
5: 5 d 9 1

As you can see, from both the key combination a, 5.1 and d, 6.2 the last row is used for the joined datatable. Because you use Inf as roll-value, all the future values are incorporated in the resulting datatable. When you use:

DT[DT_LU, nomatch=0, roll=1]

you see that only the first value in the future is included:

   t_id airport thisTime f_id
1: 3 a 6 1
2: 3 a 6 2
3: 5 d 7 2

If you want the f_id's for for all combinations of airport & thisTime where DT$thisTime is lower than DT_LU$thisTime, you can achieve that by creating a new variable (or replacing the existing thisTime) by means of the ceiling function. An example where I create a new variable thisTime2 and then do a normal join with DT_LU:

DT[, thisTime2 := ceiling(thisTime)]
setkey(DT, airport, thisTime2)[DT_LU, nomatch=0]

which gives:

   t_id airport thisTime thisTime2 f_id
1: 1 a 5.1 6 1
2: 4 a 5.1 6 1
3: 3 a 5.1 6 1
4: 1 a 5.1 6 2
5: 4 a 5.1 6 2
6: 3 a 5.1 6 2
7: 2 d 6.2 7 2
8: 5 d 6.2 7 2

Applied to the data you provided:

> dt[, thisTime2 := ceiling(thisTime)]
> setkey(dt, airport, thisTime2)[dt_lookup, nomatch=0]

t_id airport thisTime thisTime2 f_id
1: 1 a 5.1 6 1
2: 3 a 5.1 6 1
3: 1 a 5.1 6 2
4: 3 a 5.1 6 2

When you want to include al the future values instead of only the first one, you need a somewhat different approach for which you will need the i.col functionality (which is not documented yet):

1: First set the key to only the airport columns:

setkey(DT, airport)
setkey(DT_LU, airport)

2: Use the i.col functionality (which is not documented yet) in j to get what you want as follows:

DT1 <- DT_LU[DT, .(tid = i.t_id,
tTime = i.thisTime,
fTime = thisTime[i.thisTime < thisTime],
fid = f_id[i.thisTime < thisTime]),
by=.EACHI]

this gives you:

> DT1
airport tid tTime fTime fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 1 5.1 8 2
4: a 4 5.1 6 1
5: a 4 5.1 6 2
6: a 4 5.1 8 2
7: a 3 5.1 6 1
8: a 3 5.1 6 2
9: a 3 5.1 8 2
10: d 2 6.2 7 2
11: d 2 6.2 9 1
12: d 5 6.2 7 2
13: d 5 6.2 9 1

Some explanation: In case when you are joining two datatables where the same columnnames are used, you can refer to the columns of the datatable in i by preceding the columnnames with i.. Now it's possible to compare thisTime from DT with thisTime from DT_LU. With by = .EACHI you assure that all combinations for with the condition holds are included in the resulting datatable.

Alternatively, you can achieve the same with:

DT2 <- DT_LU[DT, .(airport=i.airport,
tid=i.t_id,
tTime=i.thisTime,
fTime=thisTime[i.thisTime < thisTime],
fid=f_id[i.thisTime < thisTime]),
allow.cartesian=TRUE]

which gives the same result:

> identical(DT1, DT2)
[1] TRUE

When you only want to include future values within a certain boundary, you can use:

DT1 <- DT_LU[DT, 
{
idx = i.thisTime < thisTime & thisTime - i.thisTime < 2
.(tid = i.t_id,
tTime = i.thisTime,
fTime = thisTime[idx],
fid = f_id[idx])
},
by=.EACHI]

which gives:

> DT1
airport tid tTime fTime fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 4 5.1 6 1
4: a 4 5.1 6 2
5: a 3 5.1 6 1
6: a 3 5.1 6 2
7: d 2 6.2 7 2
8: d 5 6.2 7 2

When you compare that to the previous result, you see that now the rows 3, 6, 9, 10 and 12 have been removed.


Data:

DT <- data.table(t_id = c(1,4,2,3,5),
airport = c("a","a","d","a","d"),
thisTime = c(5.1, 5.1, 6.2, 5.1, 6.2),
key=c("airport","thisTime"))

DT_LU <- data.table(f_id = c(rep(1,4),rep(2,3)),
airport = c("a","b","c","d","a","d","e"),
thisTime = c(6,7,8,9,6,7,8),
key=c("airport","thisTime"))

Combine two data tables in R by a condition referring to two columns

So, we have two solutions that work!

Version 1:
Adapted from Frank's comment above:

 library(dplyr)
final <- dt2[col1 > col2, c("col1", "col2") := .(col2, col1)]
final <- dt1[dt2, on=.(col1, col2)]
final <- select(final, col1, col2, x, y) # select relevant columns
final
col1 col2 x y
1: bb zz 29 34
2: aa bb 130 567
3: cc dd 122 56
4: dd ff 85 101

Version2: This is just a tweak of PritamJ's answer that simplifies a few things and makes this solution more applicable for large data tables. Hope it helps other people as well!

library(dplyr)
dt1$pairs <- paste(dt1$col1, dt1$col2) # creates new column with col1 and col2
merged into one
dt2$pairs <- paste(dt2$col1, dt2$col2) # same here
dt2$revpairs <- paste(dt2$col2, dt2$col1) # creates new column with reverse pairs

f1 <- merge(dt1, dt2, by="pairs") # merge by pairs as they are in dt1
f1 <- select(f1, col1.x, col2.x, x, y) # select by name (easier for big dt)

f2 <- merge(dt1, dt2, by.x = "pairs", by.y = "revpairs") # merge by pairs and reverse pairs
colnames(f2)[ncol(f2)] <- "revpairs" # rename last column because it has the same name as the first, which can cause errors
f2 <- select(f2, col1.x, col2.x, x, y)

final <- bind_rows(f2, f1) # bind the two together
colnames(final)[1:2] <- c("col1", "col2") # this is not necessary, just for clarity
final
col1 col2 x y
1: aa bb 130 567
2: bb zz 29 34
3: dd ff 85 101
4: cc dd 122 56

Join two data.tables by one numeric variable with dynamic tolerance

Here is an approach using foverlaps() from data.table.

tolerance = 5e-6
#create ranges to join on
DT1[, `:=`(min = mz - mz * tolerance,
max = mz + mz * tolerance) ]
DT2[, `:=`(min = iso_mz - iso_mz * tolerance,
max = iso_mz + iso_mz * tolerance) ]
#set keys
setkey(DT1, min, max )
setkey(DT2, min, max )
#perform overlap join, order, remove min-max columns
ans <- setorder( foverlaps( DT2, DT1 ), mz)[, `:=`(min=NULL,max=NULL,i.min=NULL,i.max=NULL)][]

# mz Var1 Var2 iso_mz comp
# 1: 433.2315 433.2310 433.2321 433.2335 m4
# 2: 451.0920 451.0914 451.0926 451.0900 m1
# 3: 490.1672 490.1664 490.1683 490.1651 m2
# 4: 518.2259 518.2251 518.2266 518.2281 m3

#check
all.equal( setcolorder(ans, names(Output)), Output )
[1] TRUE

How do I combine two data-frames based on two columns?

See the documentation on ?merge, which states:

By default the data frames are merged on the columns with names they both have, 
but separate specifications of the columns can be given by by.x and by.y.

This clearly implies that merge will merge data frames based on more than one column. From the final example given in the documentation:

x <- data.frame(k1=c(NA,NA,3,4,5), k2=c(1,NA,NA,4,5), data=1:5)
y <- data.frame(k1=c(NA,2,NA,4,5), k2=c(NA,NA,3,4,5), data=1:5)
merge(x, y, by=c("k1","k2")) # NA's match

This example was meant to demonstrate the use of incomparables, but it illustrates merging using multiple columns as well. You can also specify separate columns in each of x and y using by.x and by.y.

R: How to join dataframes with a key that has duplicate values, and COPY the second frame's values to each duplicate?

You can use left_join by your ID from the dplyr package. You can use the following code:

df_A <- data.frame(ID = c(123, 123, 124, 124, 124, 125, 125, 126),
Sale_Date = c("1/1/2020", "2/1/2021", "3/1/2019", "2/13/2020", "1/1/2022", "2/1/2021", "1/1/2021", "2/1/2021"),
Sale_Price = c(320000, 429000, 190000, 280000, 419000, 300000, 390000, 310000))

df_B <- data.frame(ID = c(123, 124, 125, 126),
Distance = c(1290, 1809, 370, 976))

library(dplyr)
df_joined <- left_join(df_A, df_B, by = "ID")
df_joined

Output:

   ID Sale_Date Sale_Price Distance
1 123 1/1/2020 320000 1290
2 123 2/1/2021 429000 1290
3 124 3/1/2019 190000 1809
4 124 2/13/2020 280000 1809
5 124 1/1/2022 419000 1809
6 125 2/1/2021 300000 370
7 125 1/1/2021 390000 370
8 126 2/1/2021 310000 976


Related Topics



Leave a reply



Submit