Use of .BY and .EACHI in the data.table package
.BY
is a named list
containing the values of the by
variables.
Passing an unnamed list to main
will work, however a named list will fail (wholly unrelated to data.table
plot(1, main = list(1))
# works....
plot(1, main = list(s=1))
# Error in title(...) : invalid graphics parameter
There is a recent commit to data.table 1.9.3 which fixed a bug to do with naming in `.BY
Closes bug #5415. .BY gets names attribute set properly in april this year.
If you had more than 1 "by" variable, you would want to be able to concatenate some how
perhaps
iris[,plot(Sepal.Length~Sepal.Width,main=do.call(paste,.BY)),by=Species]
will work (unless you have a column called collapse
!)
EACHI
is completely unrelated to this. Please read the NEWS for data.table 1.9.3 for an understanding of this.
`by` and `.EACHI` in data.table
It seems that when doing a right join between two data.tables, we should use by=.EACHI
in the by
parameter of the join, and not use any variables from the right table (b
here), as they won't be accessible in the resulting joined table. Thats why by = .id
in the first query doesn't work.
As noted in section 3.5.3 here http://franknarf1.github.io/r-tutorial/_book/tables.html
Beware DT[i,on=,j,by=bycols]. Just to repeat: only by=.EACHI works in
a join. Typing other by= values there will cause i’s columns to become
unavailable
This query helped me understand the above statement a little better:
a[b, .SD, on = .(id)]
# id t x
# 1: 1 1 11
# 2: 1 2 12
# 3: 2 1 13
The columns from b
, besides id
, are not accessible in .SD
for this join.
I guess that means in a join like the above, by
must take either .EACHI
, or a column name from the left table (a
here) that is not the join variable name (as in the question above shows, id
doesn't work right, even though it is in a
too). Because using a column name from a
seems to work correctly:
a[b, sum(x), on = .(id), by = .(t)]
t V1
1: 1 24
2: 2 12
How is the R data.table `.BY` operator used?
Here's a simple example showing how .BY
can be passed through to further arguments like a plot title. Using the built-in mtcars
data:
mtcars <- as.data.table(mtcars)
layout(1:3)
mtcars[, plot(mpg, main=paste("Cylinders:", as.character(.BY))), by=cyl]
.EACHI in data.table?
I've added this to the list here. And hopefully we'll be able to deliver as planned.
The reason is most likely that by=.EACHI
is a recent feature (since 1.9.4), but what it does isn't. Let me explain with an example. Suppose we have two data.tables X
and Y
:
X = data.table(x = c(1,1,1,2,2,5,6), y = 1:7, key = "x")
Y = data.table(x = c(2,6), z = letters[2:1], key = "x")
We know that we can join by doing X[Y]
. this is similar to a subset operation, but using data.tables
(instead of integers / row names or logical values). For each row in Y
, taking Y
's key columns, it finds and returns corresponding matching rows in X
's key columns (+ columns in Y
) .
X[Y]
# x y z
# 1: 2 4 b
# 2: 2 5 b
# 3: 6 7 a
Now let's say we'd like to, for each row from Y
's key columns (here only one key column), we'd like to get the count of matches in X
. In versions of data.table
< 1.9.4, we can do this by simply specifying .N
in j
as follows:
# < 1.9.4
X[Y, .N]
# x N
# 1: 2 2
# 2: 6 1
What this implicitly does is, in the presence of j
, evaluate the j-expression
on each matched result of X
(corresponding to the row in Y
). This was called by-without-by or implicit-by, because it's as if there's a hidden by.
The issue was that this'll always perform a by
operation. So, if we wanted to know the number of rows after a join, then we'd have to do: X[Y][ .N]
(or simply nrow(X[Y])
in this case). That is, we can't have the j
expression in the same call if we don't want a by-without-by
. As a result, when we did for example X[Y, list(z)]
, it evaluated list(z)
using by-without-by
and was therefore slightly slower.
Additionally data.table
users requested this to be explicit - see this and this for more context.
Hence by=.EACHI
was added. Now, when we do:
X[Y, .N]
# [1] 3
it does what it's meant to do (avoids confusion). It returns the number of rows resulting from the join.
And,
X[Y, .N, by=.EACHI]
evaluates j
-expression on the matching rows for each row in Y
(corresponding to value from Y
's key columns here). It'd be easier to see this by using which=TRUE
.
X[.(2), which=TRUE] # [1] 4 5
X[.(6), which=TRUE] # [1] 7
If we run .N
for each, then we should get 2,1.
X[Y, .N, by=.EACHI]
# x N
# 1: 2 2
# 2: 6 1
So we now have both functionalities.
Row operations in data.table using `by = .I`
UPDATE:
Since data.table version 1.4.3 or later, by=.I
has been implemented to work as expected by OP for row-wise grouping. Note using by=.I
will create a new column in the data.table called I
that has the row numbers. The row number column can then be kept or deleted according to preference.
The following parts of this answer records an earlier version that pertains to older versions of data.table. I keep it here for reference in case someone still uses legacy versions.
Note: section (3) of this answer updated in April 2019, due to many changes in data.table over time redering the original version obsolete. Also, use of the argument with=
removed from all instances of data.table, as it has since been deprecated.
1) Well, one reason not to use it, at least for the rowsums
example is performance, and creation of an unnecessary column. Compare to option f2 below, which is almost 4x faster and does not need the rowpos column (Note that the original question used rowSums
as the example function, to which this part of the answer responds. OP edited the question afterwards to use a different function, for which part 3 of this answer is more relevant`):
dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1)
f1 <- function(dt){
dt[, rowpos := .I]
dt[ , sdd := rowSums(.SD[, 2:4]), by = rowpos ] }
f2 <- function(dt) dt[, sdd := rowSums(.SD), .SDcols= 2:4]
library(microbenchmark)
microbenchmark(f1(dt),f2(dt))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608 100 b
# f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464 100 a
2) On your second question, although dt[, sdd := sum(.SD[, 2:4]), by = .I]
does not work, dt[, sdd := sum(.SD[, 2:4]), by = 1:NROW(dt)]
works perfectly. Given that according to ?data.table
".I is an integer vector equal to seq_len(nrow(x))", one might expect these to be equivalent. The difference, however, is that .I
is for use in j
, not in by
. NB the value of .I
is calculated internally in data.table, so is not available beforehand to be passed in as a parameter value as in by=.I
.
It might also be expected that by = .I
should just throw an error. But this does not occur, because loading the data.table
package creates an object .I
in the data.table namespace that is accessible from the global environment, and whose value is NULL
. You can test this by typing .I
at the command prompt. (Note, the same applies to .SD
, .EACHI
, .N
, .GRP
, and .BY
)
.I
# Error: object '.I' not found
library(data.table)
.I
# NULL
data.table::.I
# NULL
The upshot of this is that the behaviour of by = .I
is equivalent to by = NULL
.
3) Although we have already seen in part 1 that in the case of rowSums
, which already loops row-wise efficiently, there are much faster ways than creating the rowpos column. But what about looping when we don't have a fast row-wise function?
Benchmarking the by = rowpos
and by = 1:NROW(dt)
versions against a for
loop with set()
is informative here. We find that looping over set
in a for
loop is slower than either of the methods that use data.table's by
argument for looping. However there is neglibible difference in timing between the by
loop that creates an additional column and the one that uses seq_len(NROW(dt))
. Absent any performance difference, it seems that f.nrow
is probably preferable, but only on the basis of being more concise and not creating an unnecessary column
dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)
f.rowpos <- function() {
dt[, rowpos := .I]
dt[, sdd := sum(.SD[, 2:4]), by = rowpos ]
}
f.nrow <- function() {
dt[, sdd := sum(.SD[, 2:4]), by = seq_len(NROW(dt)) ]
}
f.forset<- function() {
for (i in seq_len(NROW(dt))) set(dt, i, 'sdd', sum(dt[i, 2:4]))
}
microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# f.rowpos() 559.1115 575.3162 580.2853 578.6865 588.5532 599.7591 5
# f.nrow() 558.4327 582.4434 584.6893 587.1732 588.6689 606.7282 5
# f.forset() 1172.6560 1178.8399 1298.4842 1255.4375 1292.7393 1592.7486 5
So, in conclusion, even in situations where there is not an optimised function such as rowSums
that already operates by row, there are alternatives to using a rowpos column that, although not faster, don't require creation of a redundant column.
Conditional data.table merge with .EACHI
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.
setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]
ID ValueSmall r
1: A 478 4
2: A 862 7
3: B 439 4
4: B 245 2
5: C 71 1
6: C 100 1
7: D 317 2
8: D 519 5
9: E 663 5
10: E 407 1
I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.
Is there a way to aggregate these matches using a function like max or min for these multiple matches?
For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]
Joining data.table with by argument
If, I understood your requirement correctly, There is a direct merge option that you can use,
dx <- data.table(a = c(1,1,2,2), b = 3:6)
dy <- data.table(a = c(1,1,2), c = 7:9)
merge(x = dx, y = dy, by = "a", all = TRUE)
It gives your desired output that you mentioned.
How to join (merge) data frames (inner, outer, left, right)?
I hope it clears your doubt if not, I am sory.
Looking up data in another data.table from j
First, the join columns should be the same class, so we can either convert main_dt$End
to integer, or main_df$Start
and lookup_dt$Year
to numeric. I'll choose the first:
main_dt[, End := as.integer(End)]
main_dt
# Start End
# <int> <int>
# 1: 1 2
# 2: 2 2
From here, we can do a joining-assignment:
main_dt[, Amount := lookup_dt[.SD, sum(Amount), on = .(Year >= Start, Year <= End), by = .EACHI]$V1 ]
main_dt
# Start End Amount
# <int> <int> <num>
# 1: 1 2 30
# 2: 2 2 20
If you're somewhat familiar with data.table
, note that .SD
referenced is actually the contents of main_dt
, so lookup_dt[.SD,...]
is effectively "main_dt left join lookup_dt". From there, the on=
should be normal, and sum(Amount)
is what you want to aggregate. The only new thing introduced here is the use of by=.EACHI
, which can be confusing; some links for that:
- https://rdatatable.gitlab.io/data.table/reference/special-symbols.html
- https://stackoverflow.com/a/27004566/3358272
R data.table average if lookup using join
Using by = .EACHI
you could do something like the following:
table2[table1,
on = .(`individual id`),
.(date = i.date, mean_alpha = mean(alpha[date2 <= i.date])),
by = .EACHI]
# individual id date mean_alpha
# 1: 1 2018-01-02 1.0
# 2: 1 2018-01-03 1.0
# 3: 2 2018-01-02 1.5
# 4: 2 2018-01-03 1.5
Edit:
# Assign by reference as a new column
table1[, mean_alpha := table2[table1,
on = .(`individual id`),
mean(alpha[date2 <= i.date]),
by = .EACHI][["V1"]]]
Edit 2:
Here is slightly more elegant way suggested by Frank in the comment section.
# In this solution our date columns can't be type character
table1[, date := as.Date(date)]
table2[, date2 := as.Date(date2)]
table1[, mean_alpha := table2[table1, # or equivalently .SD instead of table1
on = .(`individual id`, date2 <= date),
mean(alpha),
by = .EACHI][["V1"]]]
Reproducible data
table1 <- fread(
"individual id | date
1 | 2018-01-02
1 | 2018-01-03
2 | 2018-01-02
2 | 2018-01-03",
sep ="|"
)
table2 <- fread(
"individual id | date2 | alpha
1 | 2018-01-02 | 1
1 | 2018-01-04 | 1.5
1 | 2018-01-05 | 1
2 | 2018-01-01 | 2
2 | 2018-01-02 | 1
2 | 2018-01-05 | 4",
sep = "|"
)
Related Topics
How to Get the First 10 Words in a String in R
How to Add Colorbar with Perspective Plot in R
R Multiple Conditions in If Statement
Fitting a Lognormal Distribution to Truncated Data in R
How to Convert Camelcase to Not.Camel.Case in R
How to Join Data from 2 Different CSV-Files in R
Is There an Alternative to "Revalue" Function from Plyr When Using Dplyr
How Does R Handle Object in Function Call
Specifying the Colour Scale for Maps in Ggplot
Installing R Packages Error in Readrds(File):Error Reading from Connection
How to Force Seasonality from Auto.Arima
R: Find Missing Columns, Add to Data Frame If Missing
How to Change the Size of the Strip on Facets in a Ggplot
Click on Points in a Leaflet Map as Input for a Plot in Shiny