R Data.Table Join on Conditionals

Conditional join in data.table?

I suggest to use non-equi joins combined with mult = "last" (in order to capture only the most recent EndDate)

dtgrouped2[, c("Amount1", "Amount2") := # Assign the below result to new columns in dtgrouped2
dt2[dtgrouped2, # join
.(Amount1, Amount2), # get the column you need
on = .(Unique1 = Unique, # join conditions
StartDate < MonthNo,
EndDate >= MonthNo),
mult = "last"]] # get always the latest EndDate
dtgrouped2

# MonthNo Unique Total Amount1 Amount2
# 1: 1 AAA 10 7 0
# 2: 1 BBB 0 NA NA
# 3: 2 CCC 3 NA NA
# 4: 2 DDD 0 NA NA
# 5: 3 AAA 0 3 2
# 6: 3 BBB 35 NA NA
# 7: 4 CCC 15 NA NA
# 8: 4 AAA 0 3 2
# 9: 5 BBB 60 NA NA
# 10: 5 CCC 0 NA NA
# 11: 6 DDD 100 NA NA
# 12: 6 AAA 0 NA NA

The reason that you would need to join dt2[dtgrouped] first (and not the other way around) is because you want to join dt2 for each possible value in dtgrouped, hence allow multiple values in dt2 to be joined to dtgrouped

Conditional joins in data.table - left join with column selection

You can also include the selection of variables within the first query.

DT2[DT1, .(x, y, z, t), on = .(a = x, y_start <= y, y_end >= y)]
# x y z t
#1: 1 15 -0.6264538 NA
#2: 2 25 0.1836433 1206
#3: 3 35 -0.8356286 NA
#4: 4 45 1.5952808 NA
#5: 5 55 0.3295078 1686

Regards!

R Data.Table Join on Conditionals

It's a bit ugly but works:

library(data.table)
library(sqldf)

dt <- data.table(num=c(1, 2, 3, 4, 5, 6),
char=c('A', 'A', 'A', 'B', 'B', 'B'),
bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))

dt_two <- data.table(
num =c(6, 1, 5, 2, 4, 3),
char=c('A', 'A', 'A', 'B', 'B', 'B'),
bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))

dt_out_sql <- sqldf('
select dtone.num,
dtone.char,
dtone.bool,
SUM(dttwo.num) as SUM,
MIN(dttwo.num) as MIN
from dt as dtone
INNER join dt_two as dttwo on
(dtone.char = dttwo.char) and
(dtone.num >= dttwo.num OR dtone.bool)
GROUP BY dtone.num, dtone.char, dtone.bool
')

setDT(dt_out_sql)

setkey(dt, char)
setkey(dt_two, char)

dt_out_r <- dt[dt_two,
list(dtone.num = num,
dttwo.num = i.num,
char,
bool) ,
nomatch = 0, allow.cartesian = T
][
dtone.num >= dttwo.num | bool,
list(SUM = sum(dttwo.num),
MIN = min(dttwo.num)),
by = list(num = dtone.num,
char,
bool)
]

setkey(dt_out_r, num, char, bool)

all.equal(dt_out_sql, dt_out_r, check.attributes = FALSE)

Conditional (inequality) join in data.table

Using non-equi joins:

ans = dt1[dt2, on=.(start <= end, end > start), 
.(i.start, i.end, counts, id, circuit, cndn = i.start < x.start & i.end >= x.end),
allow.cartesian=TRUE
][!cndn %in% TRUE]

The condition start <= end, end >= start (note the >= on both cases) would check if two intervals overlap by any means. The open interval on one side is accomplished by end > start part (> instead of >=). But still it also picks up the intervals of type:

         dt1: start=================end
dt2: start--------------------------------end ## start < start, end > end

and

         dt1: start=================end
dt2: start----------end ## end == end

The cndn column is to check and remove these cases. Hopefully, those cases aren't a lot so that we don't materialise unwanted rows unnecessarily.


PS: the solution in this case is not as straightforward as I'd like to still, and that's because the solution requires an OR operation. It is possible to do two conditional joins, and then bind them together though.

Perhaps at some point, we'll have to think about the feasibility of extending joins to these kinds of operations in a more straightforward manner.

Conditional data.table merge with .EACHI

I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.

setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]

ID ValueSmall r
1: A 478 4
2: A 862 7
3: B 439 4
4: B 245 2
5: C 71 1
6: C 100 1
7: D 317 2
8: D 519 5
9: E 663 5
10: E 407 1

I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.


Is there a way to aggregate these matches using a function like max or min for these multiple matches?

For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...

dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]

data.table conditional join overwrites columns used for condition- R

I think this is what you need...

df1[ df2, 
`:=`( Date_2A = i.Date_2A, Date_2B = i.Date_2B, Date_2B_EXTENDED = i.Date_2B_EXTENDED ),
on = .( ID, date_1 >= Date_2A, date_1 <= Date_2B_EXTENDED ) ][]

output

#     ID row_unique_identifier     date_1    Date_2A    Date_2B Date_2B_EXTENDED
# 1: 1 1 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 2: 1 2 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 3: 1 3 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 4: 1 4 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 5: 1 5 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 6: 1 6 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 7: 1 7 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 8: 1 8 2017-11-02 <NA> <NA> <NA>
# 9: 1 9 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 10: 1 10 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 11: 1 11 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 12: 1 12 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 13: 1 13 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 14: 1 14 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 15: 1 15 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 16: 1 16 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 17: 1 17 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 18: 1 18 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 19: 1 19 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 20: 1 20 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 21: 1 21 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 22: 1 22 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 23: 1 23 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 24: 1 24 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 25: 1 25 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 26: 1 26 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 27: 1 27 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 28: 1 28 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 29: 1 29 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 30: 1 30 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 31: 1 31 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 32: 1 32 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 33: 1 33 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 34: 1 34 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 35: 1 35 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 36: 1 36 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 37: 1 37 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 38: 1 38 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 39: 1 39 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 40: 1 40 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 41: 1 41 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# ID row_unique_identifier date_1 Date_2A Date_2B Date_2B_EXTENDED

data.table conditional Inequality join

When i is a data.table, the columns of i can be referred to in j by using the prefix i., e.g., X[Y, .(val, i.val)]. Here val refers to X's column and i.val Y's. Columns of x can now be referred to using the prefix x. and is particularly useful during joining to refer to x's join columns as they are otherwise masked by i's. For example, X[Y, .(x.a-i.a, b), on="a"].

bDT[aDT, .(col1, col2, i.ExtractDate, x.date_pol, Value),
on = .(date_pol <= ExtractDate, col1 = col1, col2 = col2),
mult = "first"]

output

   col1 col2 i.ExtractDate x.date_pol Value
1: 1 A 2017-01-01 2016-05-20 2
2: 1 A 2016-01-01 2015-05-20 3
3: 2 B 2015-01-01 2014-05-20 4
4: 2 B 2014-01-01 <NA> NA

R data.table Multiple Conditions Join

Very interesting question.. and great use of by = .EACHI! Here's another approach using the NEW non-equi joins from the current development version, v1.9.7.

Issue: Your use of by=.EACHI is completely justified because the other alternative is to perform a cross join (each row of dtGrid joined to all rows of dtEvents) but that's too exhaustive and is bound to explode very quickly.

However by = .EACHI is performed along with an equi-join using a dummy column, which results in computing all distances (except that it does one at a time, therefore memory efficient). That is, in your code, for each dtGrid, all possible distances are still computed with dtEvents; hence it doesn't scale as well as expected.

Strategy: Then you'd agree that an acceptable improvement is to restrict the number of rows that would result from joining each row of dtGrid to dtEvents.

Let (x_i, y_i) come from dtGrid and (a_j, b_j) come from from dtEvents, say, where 1 <= i <= nrow(dtGrid) and 1 <= j <= nrow(dtEvents). Then, i = 1 implies, all j that satisfies (x1 - a_j)^2 + (y1 - b_j)^2 < 1 needs to be extracted. That can only happen when:

(x1 - a_j)^2 < 1 AND (y1 - b_j)^2 < 1

This helps reduce the search space drastically because, instead of looking at all rows in dtEvents for each row in dtGrid, we just have to extract those rows where,

a_j - 1 <= x1 <= a_j + 1 AND b_j - 1 <= y1 <= b_j + 1
# where '1' is the radius

This constraint can be directly translated to a non-equi join, and combined with by = .EACHI as before. The only additional step required is to construct the columns a_j-1, a_j+1, b_j-1, b_j+1 as follows:

foo1 <- function(dt1, dt2) {
dt2[, `:=`(xm=x-1, xp=x+1, ym=y-1, yp=y+1)] ## (1)
tmp = dt2[dt1, on=.(xm<=x, xp>=x, ym<=y, yp>=y),
.(sum((i.x-x)^2+(i.y-y)^2<1)), by=.EACHI,
allow=TRUE, nomatch=0L
][, c("xp", "yp") := NULL] ## (2)
tmp[]
}

## (1) constructs all columns necessary for non-equi joins (since expressions are not allowed in the formula for on= yet.

## (2) performs a non-equi join that computes distances and checks for all distances that are < 1 on the restricted set of combinations for each row in dtGrid -- hence should be much faster.

Benchmarks:

# Here's your code (modified to ensure identical column names etc..):
foo2 <- function(dt1, dt2) {
ans = dt2[dt1,
{
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];
.(xm=i.x, ym=i.y, V1=sum(val))
},
by=.EACHI][, "DummyJoin" := NULL]
ans[]
}

# on grid size of 100:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.166s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 1.626s

# on grid size of 200:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.983s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 31.038s

# on grid size of 300:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 2.847s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 151.32s

identical(ans1[V1 != 0]L, ans2[V1 != 0L]) # TRUE for all of them

The speedups are ~10x, 32x and 53x respectively.

Note that the rows in dtGrid for which the condition is not satisfied even for a single row in dtEvents will not be present in the result (due to nomatch=0L). If you want those rows, you'll have to also add one of the xm/xp/ym/yp cols.. and check them for NA (= no matches).

This is the reason we had to remove all 0 counts to get identical = TRUE.

HTH

PS: See history for another variation where the entire join is materialised and then the distance is computed and counts generated.



Related Topics



Leave a reply



Submit