.Eachi in Data.Table

.EACHI in data.table?

I've added this to the list here. And hopefully we'll be able to deliver as planned.

The reason is most likely that by=.EACHI is a recent feature (since 1.9.4), but what it does isn't. Let me explain with an example. Suppose we have two data.tables X and Y:

X = data.table(x = c(1,1,1,2,2,5,6), y = 1:7, key = "x")
Y = data.table(x = c(2,6), z = letters[2:1], key = "x")

We know that we can join by doing X[Y]. this is similar to a subset operation, but using data.tables (instead of integers / row names or logical values). For each row in Y, taking Y's key columns, it finds and returns corresponding matching rows in X's key columns (+ columns in Y) .

X[Y]
#    x y z
# 1: 2 4 b
# 2: 2 5 b
# 3: 6 7 a

Now let's say we'd like to, for each row from Y's key columns (here only one key column), we'd like to get the count of matches in X. In versions of data.table < 1.9.4, we can do this by simply specifying .N in j as follows:

# < 1.9.4
X[Y, .N]
#    x N
# 1: 2 2
# 2: 6 1

What this implicitly does is, in the presence of j, evaluate the j-expression on each matched result of X (corresponding to the row in Y). This was called by-without-by or implicit-by, because it's as if there's a hidden by.

The issue was that this'll always perform a by operation. So, if we wanted to know the number of rows after a join, then we'd have to do: X[Y][ .N] (or simply nrow(X[Y]) in this case). That is, we can't have the j expression in the same call if we don't want a by-without-by. As a result, when we did for example X[Y, list(z)], it evaluated list(z) using by-without-by and was therefore slightly slower.

Additionally data.table users requested this to be explicit - see this and this for more context.

Hence by=.EACHI was added. Now, when we do:

X[Y, .N]
# [1] 3

it does what it's meant to do (avoids confusion). It returns the number of rows resulting from the join.

And,

X[Y, .N, by=.EACHI]

evaluates j-expression on the matching rows for each row in Y (corresponding to value from Y's key columns here). It'd be easier to see this by using which=TRUE.

X[.(2), which=TRUE] # [1] 4 5
X[.(6), which=TRUE] # [1] 7

If we run .N for each, then we should get 2,1.

X[Y, .N, by=.EACHI]
#    x N
# 1: 2 2
# 2: 6 1

So we now have both functionalities. Hope this helps.

`by` and `.EACHI` in data.table

It seems that when doing a right join between two data.tables, we should use by=.EACHI in the by parameter of the join, and not use any variables from the right table (b here), as they won't be accessible in the resulting joined table. Thats why by = .id in the first query doesn't work.

As noted in section 3.5.3 here http://franknarf1.github.io/r-tutorial/_book/tables.html

Beware DT[i,on=,j,by=bycols]. Just to repeat: only by=.EACHI works in
a join. Typing other by= values there will cause i’s columns to become
unavailable

This query helped me understand the above statement a little better:

a[b, .SD, on = .(id)]
# id t  x
# 1:  1 1 11
# 2:  1 2 12
# 3:  2 1 13

The columns from b, besides id, are not accessible in .SD for this join.

I guess that means in a join like the above, by must take either .EACHI, or a column name from the left table (a here) that is not the join variable name (as in the question above shows, id doesn't work right, even though it is in a too). Because using a column name from a seems to work correctly:

a[b, sum(x), on = .(id), by = .(t)]
   t V1
1: 1 24
2: 2 12

Conditional data.table merge with .EACHI

I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.

setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
  dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]

    ID ValueSmall r
 1:  A        478 4
 2:  A        862 7
 3:  B        439 4
 4:  B        245 2
 5:  C         71 1
 6:  C        100 1
 7:  D        317 2
 8:  D        519 5
 9:  E        663 5
10:  E        407 1

I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.

Is there a way to aggregate these matches using a function like max or min for these multiple matches?

For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...

dtSmall[, r :=
  dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]

How can I access all columns of `i` when using .EACHI in data.table

Indeed, there seem to be no special symbols like x.SD or i.SD to specify all columns (except the ones used in by = ....

However, mget() can be used as a workaround which returns the expected results:

DT2_cols <- paste0("i.", setdiff(names(DT2), key(DT2)))
DT[DT2,{print(as.matrix(.SD) %*% diag(mget(DT2_cols)))},by=.EACHI]

           [,1]        [,2]
[1,]  0.3891785 -0.02190195
[2,] -0.1140867 -0.03317559
          [,1]         [,2]
[1,]  1.850667 -0.009322052
[2,] -3.533068  0.004944318
           [,1]      [,2]
[1,]  0.3706735 1.4268738
[2,] -0.9229703 0.3679482
Empty data.table (0 rows) of 3 cols: A,B,C

(Note that the numbers differ from OP's because using set.seed(1L) I do get different DT and DT2 than posted by the OP.)

Edit: Feature request on GitHub

I just noticed that this feature already has been requested on GitHub.

Use of .BY and .EACHI in the data.table package

.BY is a named list containing the values of the by variables.

Passing an unnamed list to main will work, however a named list will fail (wholly unrelated to data.table

plot(1, main = list(1))
# works....
plot(1, main = list(s=1))
# Error in title(...) : invalid graphics parameter

There is a recent commit to data.table 1.9.3 which fixed a bug to do with naming in `.BY
Closes bug #5415. .BY gets names attribute set properly in april this year.

If you had more than 1 "by" variable, you would want to be able to concatenate some how

perhaps

iris[,plot(Sepal.Length~Sepal.Width,main=do.call(paste,.BY)),by=Species]

will work (unless you have a column called collapse!)

EACHI is completely unrelated to this. Please read the NEWS for data.table 1.9.3 for an understanding of this.

Can someone explain how mult works in data.table when it performs update in joins (using .EACHI and mult)

I'd like to use the first value in the right table to override the value of the left table

Select the first values and update with them alone:

X[unique(Y, by="xx", fromLast=FALSE), on=.(x=xx), y := i.y]

   x  y t
1: a  6 0
2: a  6 1
3: b NA 2
4: c  3 3
5: d  2 4

fromLast= can select the first or last row when dropping dupes.

How multiple matches are handled:

In x[i, mult=], if a row of i has multiple matches, mult determines which matching row(s) of x are selected. This explains the results shown in the OP.

In x[i, v := i.v], if multiple rows of i match to the same row in x, all of the relevant i-rows write to the x-row sequentially, so the last i-row gets the final write. Turn on verbose output to see how many edits are made in an update -- it will exceed the number of x rows in this case (because the rows are edited repeatedly):

options(datatable.verbose=TRUE)
data.table(a=1,b=2)[.(a=1, b=3:4), on=.(a), b := i.b][]
# Assigning to 2 row subset of 1 rows
   a b
1: 1 4

R data.table average if lookup using join

Using by = .EACHI you could do something like the following:

table2[table1, 
       on = .(`individual id`), 
       .(date = i.date, mean_alpha = mean(alpha[date2 <= i.date])),
       by = .EACHI]

#    individual id       date mean_alpha
# 1:             1 2018-01-02        1.0
# 2:             1 2018-01-03        1.0
# 3:             2 2018-01-02        1.5
# 4:             2 2018-01-03        1.5

Edit:

# Assign by reference as a new column
table1[, mean_alpha := table2[table1, 
                              on = .(`individual id`), 
                              mean(alpha[date2 <= i.date]),
                              by = .EACHI][["V1"]]]

Edit 2:

Here is slightly more elegant way suggested by Frank in the comment section.

# In this solution our date columns can't be type character
table1[, date := as.Date(date)]
table2[, date2 := as.Date(date2)]

table1[, mean_alpha := table2[table1, # or equivalently .SD instead of table1
                              on = .(`individual id`, date2 <= date), 
                              mean(alpha), 
                              by = .EACHI][["V1"]]]

Reproducible data

table1 <- fread(
  "individual id | date       
   1             |  2018-01-02
   1             |  2018-01-03
   2             |  2018-01-02
   2             |  2018-01-03", 
  sep ="|"
)
table2 <- fread(
  "individual id | date2       | alpha
   1             |  2018-01-02 |  1     
   1             |  2018-01-04 |  1.5 
   1             |  2018-01-05 |  1   
   2             |  2018-01-01 |  2     
   2             |  2018-01-02 |  1   
   2             |  2018-01-05 |  4",
  sep = "|"
)

Combining by and on to join and create summary column for data.table

I would do

mytesta[, v := mytestb[.SD, on=.(groupa), sum(weighta), by=.EACHI]$V1 ]

In a X[Y] join, we're looking up each row of Y in X.

So if the end goal is to create a new column in Y computed per row, we'll need a join Y[, v := X[Y, ...]] even though Y[X, v := ...] might seem more intuitive at first.

.Eachi in Data.Table