Use of .By and .Eachi in the Data.Table Package

Use of .BY and .EACHI in the data.table package

.BY is a named list containing the values of the by variables.

Passing an unnamed list to main will work, however a named list will fail (wholly unrelated to data.table

plot(1, main = list(1))
# works....
plot(1, main = list(s=1))
# Error in title(...) : invalid graphics parameter

There is a recent commit to data.table 1.9.3 which fixed a bug to do with naming in `.BY
Closes bug #5415. .BY gets names attribute set properly in april this year.

If you had more than 1 "by" variable, you would want to be able to concatenate some how

perhaps

iris[,plot(Sepal.Length~Sepal.Width,main=do.call(paste,.BY)),by=Species]

will work (unless you have a column called collapse!)

EACHI is completely unrelated to this. Please read the NEWS for data.table 1.9.3 for an understanding of this.

`by` and `.EACHI` in data.table

It seems that when doing a right join between two data.tables, we should use by=.EACHI in the by parameter of the join, and not use any variables from the right table (b here), as they won't be accessible in the resulting joined table. Thats why by = .id in the first query doesn't work.

As noted in section 3.5.3 here http://franknarf1.github.io/r-tutorial/_book/tables.html

Beware DT[i,on=,j,by=bycols]. Just to repeat: only by=.EACHI works in
a join. Typing other by= values there will cause i’s columns to become
unavailable

This query helped me understand the above statement a little better:

a[b, .SD, on = .(id)]
# id t x
# 1: 1 1 11
# 2: 1 2 12
# 3: 2 1 13

The columns from b, besides id, are not accessible in .SD for this join.

I guess that means in a join like the above, by must take either .EACHI, or a column name from the left table (a here) that is not the join variable name (as in the question above shows, id doesn't work right, even though it is in a too). Because using a column name from a seems to work correctly:

a[b, sum(x), on = .(id), by = .(t)]
t V1
1: 1 24
2: 2 12

How is the R data.table `.BY` operator used?

Here's a simple example showing how .BY can be passed through to further arguments like a plot title. Using the built-in mtcars data:

mtcars <- as.data.table(mtcars)
layout(1:3)
mtcars[, plot(mpg, main=paste("Cylinders:", as.character(.BY))), by=cyl]

Sample Image

.EACHI in data.table?

I've added this to the list here. And hopefully we'll be able to deliver as planned.


The reason is most likely that by=.EACHI is a recent feature (since 1.9.4), but what it does isn't. Let me explain with an example. Suppose we have two data.tables X and Y:

X = data.table(x = c(1,1,1,2,2,5,6), y = 1:7, key = "x")
Y = data.table(x = c(2,6), z = letters[2:1], key = "x")

We know that we can join by doing X[Y]. this is similar to a subset operation, but using data.tables (instead of integers / row names or logical values). For each row in Y, taking Y's key columns, it finds and returns corresponding matching rows in X's key columns (+ columns in Y) .

X[Y]
# x y z
# 1: 2 4 b
# 2: 2 5 b
# 3: 6 7 a

Now let's say we'd like to, for each row from Y's key columns (here only one key column), we'd like to get the count of matches in X. In versions of data.table < 1.9.4, we can do this by simply specifying .N in j as follows:

# < 1.9.4
X[Y, .N]
# x N
# 1: 2 2
# 2: 6 1

What this implicitly does is, in the presence of j, evaluate the j-expression on each matched result of X (corresponding to the row in Y). This was called by-without-by or implicit-by, because it's as if there's a hidden by.

The issue was that this'll always perform a by operation. So, if we wanted to know the number of rows after a join, then we'd have to do: X[Y][ .N] (or simply nrow(X[Y]) in this case). That is, we can't have the j expression in the same call if we don't want a by-without-by. As a result, when we did for example X[Y, list(z)], it evaluated list(z) using by-without-by and was therefore slightly slower.

Additionally data.table users requested this to be explicit - see this and this for more context.

Hence by=.EACHI was added. Now, when we do:

X[Y, .N]
# [1] 3

it does what it's meant to do (avoids confusion). It returns the number of rows resulting from the join.

And,

X[Y, .N, by=.EACHI]

evaluates j-expression on the matching rows for each row in Y (corresponding to value from Y's key columns here). It'd be easier to see this by using which=TRUE.

X[.(2), which=TRUE] # [1] 4 5
X[.(6), which=TRUE] # [1] 7

If we run .N for each, then we should get 2,1.

X[Y, .N, by=.EACHI]
# x N
# 1: 2 2
# 2: 6 1

So we now have both functionalities.

Row operations in data.table using `by = .I`

UPDATE:

Since data.table version 1.4.3 or later, by=.I has been implemented to work as expected by OP for row-wise grouping. Note using by=.I will create a new column in the data.table called I that has the row numbers. The row number column can then be kept or deleted according to preference.

The following parts of this answer records an earlier version that pertains to older versions of data.table. I keep it here for reference in case someone still uses legacy versions.


Note: section (3) of this answer updated in April 2019, due to many changes in data.table over time redering the original version obsolete. Also, use of the argument with= removed from all instances of data.table, as it has since been deprecated.

1) Well, one reason not to use it, at least for the rowsums example is performance, and creation of an unnecessary column. Compare to option f2 below, which is almost 4x faster and does not need the rowpos column (Note that the original question used rowSums as the example function, to which this part of the answer responds. OP edited the question afterwards to use a different function, for which part 3 of this answer is more relevant`):

dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1)
f1 <- function(dt){
dt[, rowpos := .I]
dt[ , sdd := rowSums(.SD[, 2:4]), by = rowpos ] }
f2 <- function(dt) dt[, sdd := rowSums(.SD), .SDcols= 2:4]

library(microbenchmark)
microbenchmark(f1(dt),f2(dt))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608 100 b
# f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464 100 a

2) On your second question, although dt[, sdd := sum(.SD[, 2:4]), by = .I] does not work, dt[, sdd := sum(.SD[, 2:4]), by = 1:NROW(dt)] works perfectly. Given that according to ?data.table ".I is an integer vector equal to seq_len(nrow(x))", one might expect these to be equivalent. The difference, however, is that .I is for use in j, not in by. NB the value of .I is calculated internally in data.table, so is not available beforehand to be passed in as a parameter value as in by=.I.

It might also be expected that by = .I should just throw an error. But this does not occur, because loading the data.table package creates an object .I in the data.table namespace that is accessible from the global environment, and whose value is NULL. You can test this by typing .I at the command prompt. (Note, the same applies to .SD, .EACHI, .N, .GRP, and .BY)

.I
# Error: object '.I' not found
library(data.table)
.I
# NULL
data.table::.I
# NULL

The upshot of this is that the behaviour of by = .I is equivalent to by = NULL.

3) Although we have already seen in part 1 that in the case of rowSums, which already loops row-wise efficiently, there are much faster ways than creating the rowpos column. But what about looping when we don't have a fast row-wise function?

Benchmarking the by = rowpos and by = 1:NROW(dt) versions against a for loop with set() is informative here. We find that looping over set in a for loop is slower than either of the methods that use data.table's by argument for looping. However there is neglibible difference in timing between the by loop that creates an additional column and the one that uses seq_len(NROW(dt)). Absent any performance difference, it seems that f.nrow is probably preferable, but only on the basis of being more concise and not creating an unnecessary column

dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)

f.rowpos <- function() {
dt[, rowpos := .I]
dt[, sdd := sum(.SD[, 2:4]), by = rowpos ]
}

f.nrow <- function() {
dt[, sdd := sum(.SD[, 2:4]), by = seq_len(NROW(dt)) ]
}

f.forset<- function() {
for (i in seq_len(NROW(dt))) set(dt, i, 'sdd', sum(dt[i, 2:4]))
}

microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# f.rowpos() 559.1115 575.3162 580.2853 578.6865 588.5532 599.7591 5
# f.nrow() 558.4327 582.4434 584.6893 587.1732 588.6689 606.7282 5
# f.forset() 1172.6560 1178.8399 1298.4842 1255.4375 1292.7393 1592.7486 5

So, in conclusion, even in situations where there is not an optimised function such as rowSums that already operates by row, there are alternatives to using a rowpos column that, although not faster, don't require creation of a redundant column.

Conditional data.table merge with .EACHI

I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.

setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]

ID ValueSmall r
1: A 478 4
2: A 862 7
3: B 439 4
4: B 245 2
5: C 71 1
6: C 100 1
7: D 317 2
8: D 519 5
9: E 663 5
10: E 407 1

I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.


Is there a way to aggregate these matches using a function like max or min for these multiple matches?

For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...

dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]

Joining data.table with by argument

If, I understood your requirement correctly, There is a direct merge option that you can use,

dx <- data.table(a = c(1,1,2,2), b = 3:6)
dy <- data.table(a = c(1,1,2), c = 7:9)
merge(x = dx, y = dy, by = "a", all = TRUE)

It gives your desired output that you mentioned.
How to join (merge) data frames (inner, outer, left, right)?

I hope it clears your doubt if not, I am sory.

Looking up data in another data.table from j

First, the join columns should be the same class, so we can either convert main_dt$End to integer, or main_df$Start and lookup_dt$Year to numeric. I'll choose the first:

main_dt[, End := as.integer(End)]
main_dt
# Start End
# <int> <int>
# 1: 1 2
# 2: 2 2

From here, we can do a joining-assignment:

main_dt[, Amount := lookup_dt[.SD, sum(Amount), on = .(Year >= Start, Year <= End), by = .EACHI]$V1 ]
main_dt
# Start End Amount
# <int> <int> <num>
# 1: 1 2 30
# 2: 2 2 20

If you're somewhat familiar with data.table, note that .SD referenced is actually the contents of main_dt, so lookup_dt[.SD,...] is effectively "main_dt left join lookup_dt". From there, the on= should be normal, and sum(Amount) is what you want to aggregate. The only new thing introduced here is the use of by=.EACHI, which can be confusing; some links for that:

  • https://rdatatable.gitlab.io/data.table/reference/special-symbols.html
  • https://stackoverflow.com/a/27004566/3358272

R data.table average if lookup using join

Using by = .EACHI you could do something like the following:

table2[table1, 
on = .(`individual id`),
.(date = i.date, mean_alpha = mean(alpha[date2 <= i.date])),
by = .EACHI]

# individual id date mean_alpha
# 1: 1 2018-01-02 1.0
# 2: 1 2018-01-03 1.0
# 3: 2 2018-01-02 1.5
# 4: 2 2018-01-03 1.5

Edit:

# Assign by reference as a new column
table1[, mean_alpha := table2[table1,
on = .(`individual id`),
mean(alpha[date2 <= i.date]),
by = .EACHI][["V1"]]]

Edit 2:

Here is slightly more elegant way suggested by Frank in the comment section.

# In this solution our date columns can't be type character
table1[, date := as.Date(date)]
table2[, date2 := as.Date(date2)]

table1[, mean_alpha := table2[table1, # or equivalently .SD instead of table1
on = .(`individual id`, date2 <= date),
mean(alpha),
by = .EACHI][["V1"]]]

Reproducible data

table1 <- fread(
"individual id | date
1 | 2018-01-02
1 | 2018-01-03
2 | 2018-01-02
2 | 2018-01-03",
sep ="|"
)
table2 <- fread(
"individual id | date2 | alpha
1 | 2018-01-02 | 1
1 | 2018-01-04 | 1.5
1 | 2018-01-05 | 1
2 | 2018-01-01 | 2
2 | 2018-01-02 | 1
2 | 2018-01-05 | 4",
sep = "|"
)


Related Topics



Leave a reply



Submit