Conditional join in data.table?
I suggest to use non-equi joins combined with mult = "last"
(in order to capture only the most recent EndDate
)
dtgrouped2[, c("Amount1", "Amount2") := # Assign the below result to new columns in dtgrouped2
dt2[dtgrouped2, # join
.(Amount1, Amount2), # get the column you need
on = .(Unique1 = Unique, # join conditions
StartDate < MonthNo,
EndDate >= MonthNo),
mult = "last"]] # get always the latest EndDate
dtgrouped2
# MonthNo Unique Total Amount1 Amount2
# 1: 1 AAA 10 7 0
# 2: 1 BBB 0 NA NA
# 3: 2 CCC 3 NA NA
# 4: 2 DDD 0 NA NA
# 5: 3 AAA 0 3 2
# 6: 3 BBB 35 NA NA
# 7: 4 CCC 15 NA NA
# 8: 4 AAA 0 3 2
# 9: 5 BBB 60 NA NA
# 10: 5 CCC 0 NA NA
# 11: 6 DDD 100 NA NA
# 12: 6 AAA 0 NA NA
The reason that you would need to join dt2[dtgrouped]
first (and not the other way around) is because you want to join dt2
for each possible value in dtgrouped
, hence allow multiple values in dt2
to be joined to dtgrouped
Conditional joins in data.table - left join with column selection
You can also include the selection of variables within the first query.
DT2[DT1, .(x, y, z, t), on = .(a = x, y_start <= y, y_end >= y)]
# x y z t
#1: 1 15 -0.6264538 NA
#2: 2 25 0.1836433 1206
#3: 3 35 -0.8356286 NA
#4: 4 45 1.5952808 NA
#5: 5 55 0.3295078 1686
Regards!
R Data.Table Join on Conditionals
It's a bit ugly but works:
library(data.table)
library(sqldf)
dt <- data.table(num=c(1, 2, 3, 4, 5, 6),
char=c('A', 'A', 'A', 'B', 'B', 'B'),
bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))
dt_two <- data.table(
num =c(6, 1, 5, 2, 4, 3),
char=c('A', 'A', 'A', 'B', 'B', 'B'),
bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))
dt_out_sql <- sqldf('
select dtone.num,
dtone.char,
dtone.bool,
SUM(dttwo.num) as SUM,
MIN(dttwo.num) as MIN
from dt as dtone
INNER join dt_two as dttwo on
(dtone.char = dttwo.char) and
(dtone.num >= dttwo.num OR dtone.bool)
GROUP BY dtone.num, dtone.char, dtone.bool
')
setDT(dt_out_sql)
setkey(dt, char)
setkey(dt_two, char)
dt_out_r <- dt[dt_two,
list(dtone.num = num,
dttwo.num = i.num,
char,
bool) ,
nomatch = 0, allow.cartesian = T
][
dtone.num >= dttwo.num | bool,
list(SUM = sum(dttwo.num),
MIN = min(dttwo.num)),
by = list(num = dtone.num,
char,
bool)
]
setkey(dt_out_r, num, char, bool)
all.equal(dt_out_sql, dt_out_r, check.attributes = FALSE)
Conditional (inequality) join in data.table
Using non-equi joins:
ans = dt1[dt2, on=.(start <= end, end > start),
.(i.start, i.end, counts, id, circuit, cndn = i.start < x.start & i.end >= x.end),
allow.cartesian=TRUE
][!cndn %in% TRUE]
The condition start <= end, end >= start
(note the >=
on both cases) would check if two intervals overlap by any means. The open interval on one side is accomplished by end > start
part (>
instead of >=
). But still it also picks up the intervals of type:
dt1: start=================end
dt2: start--------------------------------end ## start < start, end > end
and
dt1: start=================end
dt2: start----------end ## end == end
The cndn
column is to check and remove these cases. Hopefully, those cases aren't a lot so that we don't materialise unwanted rows unnecessarily.
PS: the solution in this case is not as straightforward as I'd like to still, and that's because the solution requires an OR
operation. It is possible to do two conditional joins, and then bind them together though.
Perhaps at some point, we'll have to think about the feasibility of extending joins to these kinds of operations in a more straightforward manner.
Conditional data.table merge with .EACHI
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.
setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]
ID ValueSmall r
1: A 478 4
2: A 862 7
3: B 439 4
4: B 245 2
5: C 71 1
6: C 100 1
7: D 317 2
8: D 519 5
9: E 663 5
10: E 407 1
I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.
Is there a way to aggregate these matches using a function like max or min for these multiple matches?
For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]
data.table conditional join overwrites columns used for condition- R
I think this is what you need...
df1[ df2,
`:=`( Date_2A = i.Date_2A, Date_2B = i.Date_2B, Date_2B_EXTENDED = i.Date_2B_EXTENDED ),
on = .( ID, date_1 >= Date_2A, date_1 <= Date_2B_EXTENDED ) ][]
output
# ID row_unique_identifier date_1 Date_2A Date_2B Date_2B_EXTENDED
# 1: 1 1 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 2: 1 2 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 3: 1 3 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 4: 1 4 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 5: 1 5 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 6: 1 6 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 7: 1 7 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 8: 1 8 2017-11-02 <NA> <NA> <NA>
# 9: 1 9 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 10: 1 10 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 11: 1 11 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 12: 1 12 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 13: 1 13 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 14: 1 14 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 15: 1 15 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 16: 1 16 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 17: 1 17 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 18: 1 18 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 19: 1 19 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 20: 1 20 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 21: 1 21 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 22: 1 22 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 23: 1 23 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 24: 1 24 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 25: 1 25 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 26: 1 26 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 27: 1 27 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 28: 1 28 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 29: 1 29 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 30: 1 30 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 31: 1 31 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 32: 1 32 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 33: 1 33 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 34: 1 34 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 35: 1 35 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 36: 1 36 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 37: 1 37 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 38: 1 38 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 39: 1 39 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 40: 1 40 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 41: 1 41 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# ID row_unique_identifier date_1 Date_2A Date_2B Date_2B_EXTENDED
data.table conditional Inequality join
When i is a data.table, the columns of i can be referred to in j by using the prefix i., e.g., X[Y, .(val, i.val)]
. Here val refers to X's column and i.val Y's. Columns of x can now be referred to using the prefix x. and is particularly useful during joining to refer to x's join columns as they are otherwise masked by i's. For example, X[Y, .(x.a-i.a, b), on="a"]
.
bDT[aDT, .(col1, col2, i.ExtractDate, x.date_pol, Value),
on = .(date_pol <= ExtractDate, col1 = col1, col2 = col2),
mult = "first"]
output
col1 col2 i.ExtractDate x.date_pol Value
1: 1 A 2017-01-01 2016-05-20 2
2: 1 A 2016-01-01 2015-05-20 3
3: 2 B 2015-01-01 2014-05-20 4
4: 2 B 2014-01-01 <NA> NA
R data.table Multiple Conditions Join
Very interesting question.. and great use of by = .EACHI
! Here's another approach using the NEW non-equi joins from the current development version, v1.9.7.
Issue: Your use of by=.EACHI
is completely justified because the other alternative is to perform a cross join (each row of dtGrid
joined to all rows of dtEvents
) but that's too exhaustive and is bound to explode very quickly.
However by = .EACHI
is performed along with an equi-join using a dummy column, which results in computing all distances (except that it does one at a time, therefore memory efficient). That is, in your code, for each dtGrid
, all possible distances are still computed with dtEvents
; hence it doesn't scale as well as expected.
Strategy: Then you'd agree that an acceptable improvement is to restrict the number of rows that would result from joining each row of dtGrid
to dtEvents
.
Let (x_i, y_i)
come from dtGrid
and (a_j, b_j)
come from from dtEvents
, say, where 1 <= i <= nrow(dtGrid)
and 1 <= j <= nrow(dtEvents)
. Then, i = 1
implies, all j
that satisfies (x1 - a_j)^2 + (y1 - b_j)^2 < 1
needs to be extracted. That can only happen when:
(x1 - a_j)^2 < 1 AND (y1 - b_j)^2 < 1
This helps reduce the search space drastically because, instead of looking at all rows in dtEvents
for each row in dtGrid
, we just have to extract those rows where,
a_j - 1 <= x1 <= a_j + 1 AND b_j - 1 <= y1 <= b_j + 1
# where '1' is the radius
This constraint can be directly translated to a non-equi join, and combined with by = .EACHI
as before. The only additional step required is to construct the columns a_j-1, a_j+1, b_j-1, b_j+1
as follows:
foo1 <- function(dt1, dt2) {
dt2[, `:=`(xm=x-1, xp=x+1, ym=y-1, yp=y+1)] ## (1)
tmp = dt2[dt1, on=.(xm<=x, xp>=x, ym<=y, yp>=y),
.(sum((i.x-x)^2+(i.y-y)^2<1)), by=.EACHI,
allow=TRUE, nomatch=0L
][, c("xp", "yp") := NULL] ## (2)
tmp[]
}
## (1)
constructs all columns necessary for non-equi joins (since expressions are not allowed in the formula for on=
yet.
## (2)
performs a non-equi join that computes distances and checks for all distances that are < 1
on the restricted set of combinations for each row in dtGrid
-- hence should be much faster.
Benchmarks:
# Here's your code (modified to ensure identical column names etc..):
foo2 <- function(dt1, dt2) {
ans = dt2[dt1,
{
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];
.(xm=i.x, ym=i.y, V1=sum(val))
},
by=.EACHI][, "DummyJoin" := NULL]
ans[]
}
# on grid size of 100:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.166s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 1.626s
# on grid size of 200:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.983s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 31.038s
# on grid size of 300:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 2.847s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 151.32s
identical(ans1[V1 != 0]L, ans2[V1 != 0L]) # TRUE for all of them
The speedups are ~10x, 32x and 53x respectively.
Note that the rows in dtGrid
for which the condition is not satisfied even for a single row in dtEvents
will not be present in the result (due to nomatch=0L
). If you want those rows, you'll have to also add one of the xm/xp/ym/yp
cols.. and check them for NA
(= no matches).
This is the reason we had to remove all 0 counts to get identical = TRUE
.
HTH
PS: See history for another variation where the entire join is materialised and then the distance is computed and counts generated.
Related Topics
R: Merge Based on Multiple Conditions (With Non-Equal Criteria)
Extracting Coefficients and Their Standard Error for Each Unit in an Lme Model Fit
How to Show Corpus Text in R Tm Package
Generating Names Iteratively in R for Storing Plots
Stacked Histograms Like in Flow Cytometry
The Art of R Programming:Where Else Could I Find the Information
S4 Classes: Multiple Types Per Slot
Custom Fill Color in Ggvis (And Other Options)
Keep Same Order as in Data Files When Using Ggplot
Create Multilines from Points, Grouped by Id with Sf Package
Add Missing Xts/Zoo Data with Linear Interpolation in R
How to Skip Error Checking at Rmarkdown Compiling
Can Ggplot Make 2D Summaries of Data
What Is the Fastest Way to Get a Vector of Sorted Unique Values from a Data.Table
Linear Models in R with Different Combinations of Variables