Data.Table Join and J-Expression Unexpected Behavior

data.table join and j-expression unexpected behavior

As of data.table 1.9.3, the default behavior has been changed and the examples below produce the same result. To get the by-without-by result, one now has to specify an explicit by=.EACHI:

d = data.table(a = 1:5, value = 2:6, key = "a")

d[J(3), value]
#[1] 4

d[J(3), value, by = .EACHI]
# a value
#1: 3 4

And here's a slightly more complicated example, illustrating the difference:

d = data.table(a = 1:2, b = 1:6, key = 'a')
# a b
#1: 1 1
#2: 1 3
#3: 1 5
#4: 2 2
#5: 2 4
#6: 2 6

# normal join
d[J(c(1,2)), sum(b)]
#[1] 21

# join with a by-without-by, or by-each-i
d[J(c(1,2)), sum(b), by = .EACHI]
# a V1
#1: 1 9
#2: 2 12

# and a more complicated example:
d[J(c(1,2,1)), sum(b), by = .EACHI]
# a V1
#1: 1 9
#2: 2 12
#3: 1 9

fast join data.table (potential bug, checking before reporting)

This is a scoping issue, similar to the one discussed in data.table-faq 2.13 (warning, pdf). Because test contains a column named mo, when J(mo) is evaluated, it returns that entire column, rather than value of the mo found in the global environment, which it masks. (This scoping behavior is, of course, quite nice when you want to do something like test[mo<4]!)

Try this to see what's going on:

test <- data.table(mo=1:5, b=5:1, key=c("mo", "b"))
mo <- 1
test[browser()]
Browse[1]> J(mo)
# mo
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# Browse[1]>

As suggested in the linked FAQ, a simple solution is to rename the indexing variable:

MO <- 1
test[J(MO)]
# mo b
# 1: 1 6

(This will also work, for reasons discussed in the documentation of i in ?data.table):

mo <- data.table(1)
test[mo]
# mo b
# 1: 1 6

Changing behavior for closure stored in data.table between R 3.4.3 and R 3.6.0

After running git bisect on the r-source, I was able to deduce that it was this commit that caused the behavior: https://github.com/wch/r-source/commit/adcf18b773149fa20f289f2c8f2e45e6f7b0dbfe

What fundamentally happened was that in the case where x's were ordered in approxfun, an internal copy was no longer made. If the data had been randomly sorted, the code would have continued to work! (see snippet below)

Lesson for me is that its probably best not to mix complicated objects with data.table as the same environment is used over and over for each "by" group (or being very deliberate with data.table::copy)

## should be run under R > 3.6.0 to see disparity
library(data.table)

## original sorted x (does not work)
data <- data.table(
group = c(rep("a", 4), rep("b", 4)),
x = rep(c(.02, .04, .12, .21), 2),
y = c(
0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)

dtFuncs <- data[ , {
print(environment())
list(
func = list(stats::approxfun(x, y, rule = 2))
)
}, by = group]

f <- function(group, x) {
dtResults <- CJ(group = group, x = x)
dtResults <- dtResults[ , {
.g <- group
f2 <- dtFuncs[group == .g, func][[1]]
list(x = x, y = f2(x))
}, by = group]
dtResults
}

get("y", environment(dtFuncs$func[[1]]))
get("y", environment(dtFuncs$func[[2]]))

x0 <- .07
g <- "a"
all.equal(
with(data[group == g], approx(x, y, x0, rule = 2)$y),
f(group = g, x = x0)$y
)

## unsorted x (works)
data <- data.table(
group = c(rep("a", 4), rep("b", 4)),
x = rep(c(.02, .04, .12, .21), 2),
y = c(
0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)
set.seed(10)
data <- data[sample(1:.N, .N)]
dtFuncs <- data[ , {
print(environment())
list(
func = list(stats::approxfun(x, y, rule = 2))
)
}, by = group]

f <- function(group, x) {
dtResults <- CJ(group = group, x = x)
dtResults <- dtResults[ , {
.g <- group
f2 <- dtFuncs[group == .g, func][[1]]
list(x = x, y = f2(x))
}, by = group]
dtResults
}

get("y", environment(dtFuncs$func[[1]]))
get("y", environment(dtFuncs$func[[2]]))

x0 <- .07
g <- "a"
all.equal(
with(data[group == g], approx(x, y, x0, rule = 2)$y),
f(group = g, x = x0)$y
)

## better approach: maybe safer to avoid mixing objects treated by reference
## (data.table & closures) all together...
fList <- lapply(split(data, by = "group"), function(x){
with(x, stats::approxfun(x, y, rule = 2))
})
fList
fList[[1]](.07) != fList[[2]](.07)

.EACHI in data.table?

I've added this to the list here. And hopefully we'll be able to deliver as planned.


The reason is most likely that by=.EACHI is a recent feature (since 1.9.4), but what it does isn't. Let me explain with an example. Suppose we have two data.tables X and Y:

X = data.table(x = c(1,1,1,2,2,5,6), y = 1:7, key = "x")
Y = data.table(x = c(2,6), z = letters[2:1], key = "x")

We know that we can join by doing X[Y]. this is similar to a subset operation, but using data.tables (instead of integers / row names or logical values). For each row in Y, taking Y's key columns, it finds and returns corresponding matching rows in X's key columns (+ columns in Y) .

X[Y]
# x y z
# 1: 2 4 b
# 2: 2 5 b
# 3: 6 7 a

Now let's say we'd like to, for each row from Y's key columns (here only one key column), we'd like to get the count of matches in X. In versions of data.table < 1.9.4, we can do this by simply specifying .N in j as follows:

# < 1.9.4
X[Y, .N]
# x N
# 1: 2 2
# 2: 6 1

What this implicitly does is, in the presence of j, evaluate the j-expression on each matched result of X (corresponding to the row in Y). This was called by-without-by or implicit-by, because it's as if there's a hidden by.

The issue was that this'll always perform a by operation. So, if we wanted to know the number of rows after a join, then we'd have to do: X[Y][ .N] (or simply nrow(X[Y]) in this case). That is, we can't have the j expression in the same call if we don't want a by-without-by. As a result, when we did for example X[Y, list(z)], it evaluated list(z) using by-without-by and was therefore slightly slower.

Additionally data.table users requested this to be explicit - see this and this for more context.

Hence by=.EACHI was added. Now, when we do:

X[Y, .N]
# [1] 3

it does what it's meant to do (avoids confusion). It returns the number of rows resulting from the join.

And,

X[Y, .N, by=.EACHI]

evaluates j-expression on the matching rows for each row in Y (corresponding to value from Y's key columns here). It'd be easier to see this by using which=TRUE.

X[.(2), which=TRUE] # [1] 4 5
X[.(6), which=TRUE] # [1] 7

If we run .N for each, then we should get 2,1.

X[Y, .N, by=.EACHI]
# x N
# 1: 2 2
# 2: 6 1

So we now have both functionalities.

Using := in data.table with subset for i and a vector for update

In data.table versions <= 1.9.2, a join of the form x[i, j=...] - that is, a join where j is also used was designed to be an implicit by (or) by-without-by operation. In other words, this'll calculate j for each value in i. So it won't work as you intend.

This design choice has been changed in the current development version 1.9.3 (which'll at some point be pushed to CRAN versioned 1.9.4), for consistency, after feedback from a lot of users. You can check the discussions here, here and the feature request (FR) here.

So in 1.9.3, this will work as intended (as @BenBarnes points out). That is, by default, x[i, j=...] will first perform the join and the evaluate j after, once, instead of obtaining j for each i. If instead you'd like the old behaviour, you'll have to explicitly state by as follows:

## v 1.9.3
## performs the join and then calculates/evaluates j
x[i, j]
## explicitly state by to obtain j for each i
x[i, j, by=.EACHI]

When this version hits CRAN, there should also be a provision to use the old version (so that existing code doesn't break), with a warning that this feature will be deprecated in the next release (or something like that - how this'll be done is not finalised yet).

To summarise, your code will work as intended from versions >= 1.9.3.

Note that .EACHI feature is not yet documented in ?data.table. This is still a development version. When it's being released to CRAN, you can find the documentation for .EACHI in ?data.table, where all the other special variables like .I, .N, .GRP, .BY etc.. are also documented.

HTH


Edit: If you've to do this efficiently using <= 1.9.2, then you can do it by first finding the matching indices as follows:

idx = DT[DT.alt, which=TRUE]
DT[idx, V := rnorm(length(idx))]

Including all permutations when using data.table[,,by=...]

I'd also go with a cross-join, but would use it in the i-slot of the original call to [.data.table:

keycols <- c("g1", "g2", "g3")                       ## Grouping columns
setkeyv(dat, keycols) ## Set dat's key
ii <- do.call(CJ, sapply(dat[, ..keycols], unique)) ## CJ() to form index
datCollapsed <- dat[ii, list(nv=.N)] ## Aggregate

## Check that it worked
nrow(datCollapsed)
# [1] 625
table(datCollapsed$nv)
# 0 1 2 3 4 5 6
# 135 191 162 82 39 13 3

This approach is referred to as a "by-without-by" and, as documented in ?data.table, it is just as efficient and fast as passing the grouping instructions in via the by argument:

Advanced: Aggregation for a subset of known groups is
particularly efficient when passing those groups in i. When
i is a data.table, DT[i,j] evaluates j for each row
of i. We call this by without by or grouping by i.
Hence, the self join DT[data.table(unique(colA)),j] is
identical to DT[,j,by=colA].

fuzzyjoin two data frames using data.table

To clarify terminology:

The data.table approach for your problem does not require a fuzzyjoin with data.table [at least not in the sense of inexact matching]. Instead, you just want to join on data.table columns using non-equal binary operators >=,>, <= and/or <. In data.table terminology those are called "non equi joins".

Where you titled your question "fuzzyjoin two data frames using data.table" that is just, understandably, after you used library(fuzzyjoin) in your first working attempt. (No problem, just clarifying for readers.)

Solution using data.table non equi joins to compare date columns:

You were very close to a working data.table solution where you had:

dt_final_data <- setDT(df2)[df1, 
on = .(ID, date > start_date, date <= end_date)]

To modify it to make it work as you want, simply add a data.table j expression to select the columns you want, in the order you want them EDIT: and prefix the problem column with x. (to tell data.table to return the column from the x side of the dt_x[dt_i,] join) For example, as below calling the column x.date:

dt_final_data <- setDT(df2)[df1, 
.(ID, f_date, ACCNUM, flmNUM, start_date, end_date, x.date, fyear, at, lt),
on = .(ID, date > start_date, date <= end_date)]

This now gives you the output you are after:

dt_final_data
ID f_date ACCNUM flmNUM start_date end_date x.date fyear at lt
1: 50341 2002-03-08 0001104659-02-000656 2571187 2002-09-07 2003-08-30 2002-12-31 2002 190453.000 173620.000
2: 1067983 2009-11-25 0001047469-09-010426 91207220 2010-05-27 2011-05-19 2010-12-31 2010 372229.000 209295.000
3: 804753 2004-05-14 0001193125-04-088404 4805453 2004-11-13 2005-11-05 2004-12-31 2004 982.265 383.614
4: 1090727 2013-05-22 0000712515-13-000022 13865105 2013-11-21 2014-11-13 2013-12-31 2013 36212.000 29724.000
5: 1467858 2010-02-26 0001193125-10-043035 10640035 2010-08-28 2011-08-20 2010-12-31 2010 138898.000 101739.000
6: 858877 2019-01-31 0001166691-19-000005 19556540 2019-08-02 2020-07-24 <NA> NA NA NA
7: 2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17 2016-12-31 2016 3321.000 2905.000
8: 1478242 2004-03-12 0001193125-04-039482 4664082 2004-09-11 2005-09-03 <NA> NA NA NA
9: 1467858 2017-02-16 0001555280-17-000044 17618235 2017-08-18 2018-08-10 2017-12-31 2017 212482.000 176282.000
10: 14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20 2016-04-30 2015 4183.000 2621.000

As above, your result for ID=50341 now has date=2002-12-31. In other words, the result column date now comes from df2.date.

You can of course rename the x.date column in your j expression:

setDT(df2)[ df1, 
.(ID,
f_date,
ACCNUM,
flmNUM,
start_date,
end_date,
my_result_date_name = x.date,
fyear,
at,
lt),
on = .(ID, date > start_date, date <= end_date)]

Why does data.table (currently) rename columns in non-equi joins and return data from a different column:

This explanation from @ScottRitchie sums it up quite nicely:

When performing any join, only one copy of each key column is returned in the result. Currently, the column from i is returned, and labelled with the column name from x, making equi joins consistent with the behaviour of base merge().

Above makes sense if you keep in mind back before version 1.9.8 data.table didn't have non-equi joins.

Through and including the current 1.12.2 release of data.table, this (and several overlapping issues) have been the source a lot of discussion on the data.table github issues list. For example:
possible inconsistency in non-equi join, returning join columns #3437 and
SQL-like column return for non-equi and rolling joins #2706 are just 2 of many.

However, watch this github issue: Continuing from the above discussions the keen analytical minds of the data.table team are working to make this less confusing in some (hopefully not too distant) future version:
Both columns for rolling and non-equi joins #3093

Unexpected results when trying to left join

Since you are listing c.name with the left join, you need to Group By c.name instead j.client_id

SELECT result=count(j.client_id), c.name 
FROM spp_job j LEFT JOIN
spp_client c ON j.client_id = c.id
GROUP BY c.name
ORDER BY result DESC

UPDATE: Since name is not unique (as @lc. said) you may need to use group by both by id and name for accurate results.

GROUP BY j.client_id, c.name 
ORDER BY result DESC

Find matches to several tables: conditional (full) join using data.table

Here's one way of approaching it:

# clean data
purchase[, PurchaseDate := as.IDate(PurchaseDate)]
df1[, `:=`(ValidFrom = as.IDate(ValidFrom), ValidTo = as.IDate(ValidTo))]
df2[, `:=`(ValidFrom = as.IDate(ValidFrom), ValidTo = as.IDate(ValidTo))]

# initialize
purchase[, matched := FALSE ]

# update joins
purchase[!(matched), matched :=
df1[.SD, on=.(Name, Surname, ValidFrom <= PurchaseDate, ValidTo >= PurchaseDate),
.N, by=.EACHI ]$N > 0L
]
purchase[!(matched), matched :=
df2[.SD, on=.(Name, Surname, ValidFrom <= PurchaseDate, ValidTo >= PurchaseDate),
.N, by=.EACHI ]$N > 0L
]

I'm keeping df1 and df2 separate since the OP mentioned that their join rules differ in their actual use-case.


How it works

The overall structure is...

DT[, matched := FALSE ]
DT[!(matched), matched := expr1 ]
DT[!(matched), matched := expr2 ]

so we're initializing matched to false; and at each later step, updating unmatched rows, !(matched).

The expressions start with DT2[.SD, ...], which is just a join against the Subset of Data we have after filtering with !(matched). Joins like this look up rows of .SD in DT2 according to the on= filters. In this case, the on= filters are associated with a non equi join.***

When we use by=.EACHI we are grouping by each row of .SD. With .N, by=.EACHI, we get the number of DT2 rows matched for each row of .SD.

Once we have the number of matched rows, we can compare N > 0L to update matched.


*** Unfortunately, as of April 2017 there's an open bug in this pattern of usage that sometimes gives an error about .SD. The workaround is to replace .SD with copy(.SD).



Related Topics



Leave a reply



Submit