Conditional Binary Join and Update by Reference Using the Data.Table Package

Conditional binary join and update by reference using the data.table package

Copying from Arun's updated answer here

TK[venue_id %in% 1:2, New_id := DFT[.SD, New_id]][]
# venue_id DFT_id New_id
# 1: 1 1 3
# 2: 2 1 3
# 3: 1 2 4
# 4: 3 2 9401
# 5: 2 3 2
# 6: 3 3 456

His answer gives the details of what is going on.

Join two data frames using the data.table package in R based on dates that are within +/- 3 months

Rolling joins in data.table are extremely useful but can be a little bit tough to get the hang of. The syntax for rollends is quite a bit different than what you've got there, it's not designed to handle any kind of complex logic, just simple a TRUE/FALSE case.

At any rate, here's one way to approach this problem. Using month arithmetic as a filtering criteria in combination with the nearest value requirement makes this a multi-step problem instead of a one-step join (at least any way I can see it).

While the join/filter/copy result values operation is technically a one-liner, I did my best to add in plenty of explanation of the nested operations.

## Make a copy of Date2 to use as key, as it will be inaccessible within the joined table
df2[, Date2Copy := Date2]

## Set Keys
setkey(df1,ID,Date1)
setkey(df2,Id,Date2Copy)

## Step 3: (read the inner nested steps first!)
## After performing the steps 1/2, join the intermediate result table back to `df1`...
df1[
## Step 1:
## First use the key of `df1` to subset `df2`` with a rolling join
df2[df1,.(ID, Date1, Date2), roll = "nearest"
## Step 2:
## Then apply the +/- 3 month filtering critera
][between(Date2,
Date1 %m-% months(3),
Date1 %m+% months(3))]
## Step 3:
## ...on the `ID` column and add the intermediate results
## for `Date2` and `Value` columns to `df1` by reference
, c("Date2","Value") := .(i.Date2,i.Value), on = .(ID)]

## Results
print(df1)
# ID Date1 Date2 Value
# 1: 1 2019-09-09 2019-10-09 7
# 2: 2 2019-09-09 <NA> NA
# 3: 3 2019-09-09 <NA> NA
# 4: 4 2019-09-09 2019-10-27 15

These are my three go-to resources (other than the package documentation) for rolling joins, they've all helped me understand some of the quirks at multiple points over the years.

  • https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html
  • https://www.gormanalysis.com/blog/r-data-table-rolling-joins/
  • http://franknarf1.github.io/r-tutorial/_book/tables.html#tables

Combining chaining and assignment by reference in a data.table

I'm not sure why you think that even if DT[a == 1][b == 0, c := 2] worked in theory it would be more efficient than DT[a == 1 & b == 0, c := 2]

Either way, the most efficient solution in your case would be to key by both a and b and conduct the assignment by reference while performing a binary join on both

DT <- data.table(a = c(1, 1, 1, 2, 2), b = c(0, 2, 0, 1, 1)) ## mock data
setkey(DT, a, b) ## keying by both `a` and `b`
DT[J(1, 0), c := 2] ## Update `c` by reference
DT
# a b c
# 1: 1 0 2
# 2: 1 0 2
# 3: 1 2 NA
# 4: 2 1 NA
# 5: 2 1 NA

Update a column of NAs in one data table with the value from a column in another data table

We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.


The idea is to first setkey() on DT1 on the column tract.

setkey(DT1, tract)

In data.tables, a join of the form x[i] requires key for x, but not necessarily for i. This results in two scenarios:

  • If i also has key set -- the first key column of i is matched against first key column of x, second against second and so on..

  • If i doesn't have key set -- the first column of i is matched against the first key column of x, second column of i against second key column of x and so on..

In this case, since your first column in i is also tract, we'll skip setting key on i.

Then, we perform a join of the form x[i]. By doing this, for each i the matching row indices in x is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1's CreditScore column with DT2's on those matching rows..

In data.tables, we can perform that operation while joining, by providing the expression in j, as follows:

DT1[DT2, CreditScore := i.CreditScore]
# tract CreditScore
# 1: 36067013000 777
# 2: 36083052304 663
# 3: 36083052403 650
# 4: 36091062602 335
# 5: 36107020401 635

DT1[DT2 part finds the matching rows in DT1 for each row in DT2. And if there's a match, we want DT2's value to be updated in DT1. We accomplish that by using i.CreditScore -- it refers to DT2's CreditScore column (i. is a prefix used to distinguish columns with identical names between x and i data.tables).


Update: As pointed out under comments, the solution above would also update the non-NA values in DT1. Therefore the way to do it would be:

DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]

On those rows where CreditScore from DT1 is NA, replace CreditScore from DT1 with the values from CreditScore obtained from the join of DT2[.(.SD)], where .SD corresponds to the subset of data.table that contains all the rows where CreditScore is NA.

HTH

Update a data.table based on another data table

Here's how I would approach this. First, I will expand DT_new according to the unique values combination of x columns of both tables using binary join

res <- setkey(DT_new, x)[unique(c(x, DT_old$x))]
res
# x y v z
# 1: b 9 2 9
# 2: c 6 NA 9
# 3: d 10 10 9
# 4: a NA NA NA

Then, I will updated the two columns in res by reference using another binary join

setkey(res, x)[DT_old, `:=`(y = i.y, v = i.v)]
res
# x y v z
# 1: a 1 1 NA
# 2: b 3 2 9
# 3: c 6 3 9
# 4: d 10 10 9

Following the comments section, it seems that you are trying to join each column by its own condition. There is no simple way of doing such thing in R or any language AFAIK. Thus, your own solution could be a good option by itself.

Though, here are some other alternatives, mainly taken from a similar question I myself asked not long ago

Using two ifelse statments

setkey(res, x)[DT_old, `:=`(y = ifelse(is.na(y), i.y, y), 
v = ifelse(is.na(v), i.v, v))]

Two separate conditional joins

setkey(res, x) ; setkey(DT_old, x) ## old data set needs to be keyed too now
res[is.na(y), y := DT_old[.SD, y]]
res[is.na(v), v := DT_old[.SD, v]]

Both will give you what you need.


P.S.

If you don't want warnings, you need to define the corresponding column classes correctly, e.g. v column in DT_new should be defined as v= c(2L, NA_integer_, 10L)

Conditional keyed join/update _and_ update a flag column for matches

The key (no pun intended) I think was to realize that the merge was returning NA for the missed IDs, so I should add the flag to unmatched at each step, e.g., at step 1:

unmatched <- dt[.(1:(yr - 1L))
][!id %in% existing_ids,
.SD[.N], by = id][ , flag1 := TRUE]
dt[year == yr, c("id", "flag1") :=
unmatched[.SD, .(id, flag1), on = "name,surname"]]

In the end, this produces:

> dt[ ]
name surname maiden_name id year flag1 flag2
1: Carol Clymer 3 1 FALSE FALSE
2: Jim Jones 2 1 FALSE FALSE
3: Joe Smith 1 1 FALSE FALSE
4: Joe Smith 1 1 FALSE FALSE
5: Ann Cotter 4 2 NA NA
6: Carol Klein Clymer 3 2 NA TRUE
7: Joe Smith 1 2 TRUE FALSE
8: Ann Cotter 4 3 TRUE FALSE
9: Beth Brown 5 3 NA NA
10: Joe Smith 1 4 TRUE FALSE
11: Joe Smith 1 4 TRUE FALSE

One problem remaining is that some flags that should be F have reset to NA; would be nice to be able to set nomatch=F, but I'm not too worried about this side effect--the key for me is knowing when each flag is T.

Conditional join in r

What you want is:

setkey(large.tbl, cd)
setkey(key.table, keyz)
key.table[large.tbl, roll = -Inf]

See ?data.table>roll:

Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in x's key, the last key column is a date (or time, or datetime) and all the columns of x's key are joined to. A common idiom is to select a contemporaneous regular time series (dts) across a set of identifiers (ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date) and CJ stands for cross join. When roll is a positive number, this limits how far values are carried forward. roll=TRUE is equivalent to roll=+Inf. When roll is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use -Inf for unlimited roll back. When roll is "nearest", the nearest value is joined to.

(to be fair I think this could go for some elucidation, it's pretty dense)

R data.table join with inequality conditions

The solution is quite fast and straightforward using the package dplyr.

install.packages(dplyr)
library(dplyr)

newdata <- filter(data, X > 0 , Y > 0 , Z > 0)

dplyr is showing to be one of the easiest and fastest packages for managing data frames. Check this great tutorial here: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

The RStudio team have alsoe produced a nice Cheat Sheet, here: http://www.rstudio.com/resources/cheatsheets/

R data.table binary value for last row in group by condition

Timings for reference:

library(data.table)
#data.table 1.12.3 IN DEVELOPMENT built 2019-05-12 17:04:48 UTC; root using 4 threads (see ?getDTthreads). Latest news: r-datatable.com
set.seed(0L)
nid <- 3e6L
DT <- data.table(id=rep(1L:nid, each=3L))[,
conversion := sample(c(0L,1L), 1L, replace=TRUE), by=.(id)]
DT0 <- copy(DT)
DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)

mtd0 <- function() {
DT0[DT0[, .I[.N], by=id]$V1, lastconv := conversion]
DT0[is.na(lastconv), lastconv := 0L]
}

mtd1 <- function() {
DT1[DT1[, .I[.N], by=id]$V1, lastconv := conversion]
setnafill(DT1, cols = "lastconv", fill = 0L)
}

mtd2 <- function() {
DT2[, v := 0]
DT2[.(DT2[conversion == 1, unique(id)]), on=.(id), mult="last", v := 1]

#or also
#DT2[, v := 0L][
# DT2[,.(cv=last(conversion)), id], on=.(id), mult="last", v := cv]
}

mtd3 <- function() {
DT3[ , lastconv := as.integer(.I == .I[.N] & conversion == 1), by = id]
}

library(microbenchmark)
microbenchmark(mtd0(), mtd1(), mtd2(), mtd3(), times=1L)

timings:

Unit: milliseconds
expr min lq mean median uq max neval cld
mtd0() 1363.1783 1416.1867 1468.9256 1469.1952 1521.7992 1574.4033 3 b
mtd1() 1349.5333 1365.4653 1378.9350 1381.3974 1393.6358 1405.8743 3 b
mtd2() 511.5615 515.4728 552.9133 519.3841 573.5892 627.7944 3 a
mtd3() 3966.8867 4009.1128 4048.9607 4051.3389 4089.9977 4128.6564 3 c


Related Topics



Leave a reply



Submit