.EACHI in data.table?
I've added this to the list here. And hopefully we'll be able to deliver as planned.
The reason is most likely that by=.EACHI
is a recent feature (since 1.9.4), but what it does isn't. Let me explain with an example. Suppose we have two data.tables X
and Y
:
X = data.table(x = c(1,1,1,2,2,5,6), y = 1:7, key = "x")
Y = data.table(x = c(2,6), z = letters[2:1], key = "x")
We know that we can join by doing X[Y]
. this is similar to a subset operation, but using data.tables
(instead of integers / row names or logical values). For each row in Y
, taking Y
's key columns, it finds and returns corresponding matching rows in X
's key columns (+ columns in Y
) .
X[Y]
# x y z
# 1: 2 4 b
# 2: 2 5 b
# 3: 6 7 a
Now let's say we'd like to, for each row from Y
's key columns (here only one key column), we'd like to get the count of matches in X
. In versions of data.table
< 1.9.4, we can do this by simply specifying .N
in j
as follows:
# < 1.9.4
X[Y, .N]
# x N
# 1: 2 2
# 2: 6 1
What this implicitly does is, in the presence of j
, evaluate the j-expression
on each matched result of X
(corresponding to the row in Y
). This was called by-without-by or implicit-by, because it's as if there's a hidden by.
The issue was that this'll always perform a by
operation. So, if we wanted to know the number of rows after a join, then we'd have to do: X[Y][ .N]
(or simply nrow(X[Y])
in this case). That is, we can't have the j
expression in the same call if we don't want a by-without-by
. As a result, when we did for example X[Y, list(z)]
, it evaluated list(z)
using by-without-by
and was therefore slightly slower.
Additionally data.table
users requested this to be explicit - see this and this for more context.
Hence by=.EACHI
was added. Now, when we do:
X[Y, .N]
# [1] 3
it does what it's meant to do (avoids confusion). It returns the number of rows resulting from the join.
And,
X[Y, .N, by=.EACHI]
evaluates j
-expression on the matching rows for each row in Y
(corresponding to value from Y
's key columns here). It'd be easier to see this by using which=TRUE
.
X[.(2), which=TRUE] # [1] 4 5
X[.(6), which=TRUE] # [1] 7
If we run .N
for each, then we should get 2,1.
X[Y, .N, by=.EACHI]
# x N
# 1: 2 2
# 2: 6 1
So we now have both functionalities. Hope this helps.
`by` and `.EACHI` in data.table
It seems that when doing a right join between two data.tables, we should use by=.EACHI
in the by
parameter of the join, and not use any variables from the right table (b
here), as they won't be accessible in the resulting joined table. Thats why by = .id
in the first query doesn't work.
As noted in section 3.5.3 here http://franknarf1.github.io/r-tutorial/_book/tables.html
Beware DT[i,on=,j,by=bycols]. Just to repeat: only by=.EACHI works in
a join. Typing other by= values there will cause i’s columns to become
unavailable
This query helped me understand the above statement a little better:
a[b, .SD, on = .(id)]
# id t x
# 1: 1 1 11
# 2: 1 2 12
# 3: 2 1 13
The columns from b
, besides id
, are not accessible in .SD
for this join.
I guess that means in a join like the above, by
must take either .EACHI
, or a column name from the left table (a
here) that is not the join variable name (as in the question above shows, id
doesn't work right, even though it is in a
too). Because using a column name from a
seems to work correctly:
a[b, sum(x), on = .(id), by = .(t)]
t V1
1: 1 24
2: 2 12
Conditional data.table merge with .EACHI
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.
setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]
ID ValueSmall r
1: A 478 4
2: A 862 7
3: B 439 4
4: B 245 2
5: C 71 1
6: C 100 1
7: D 317 2
8: D 519 5
9: E 663 5
10: E 407 1
I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.
Is there a way to aggregate these matches using a function like max or min for these multiple matches?
For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]
How can I access all columns of `i` when using .EACHI in data.table
Indeed, there seem to be no special symbols like x.SD
or i.SD
to specify all columns (except the ones used in by = ...
.
However, mget()
can be used as a workaround which returns the expected results:
DT2_cols <- paste0("i.", setdiff(names(DT2), key(DT2)))
DT[DT2,{print(as.matrix(.SD) %*% diag(mget(DT2_cols)))},by=.EACHI]
[,1] [,2]
[1,] 0.3891785 -0.02190195
[2,] -0.1140867 -0.03317559
[,1] [,2]
[1,] 1.850667 -0.009322052
[2,] -3.533068 0.004944318
[,1] [,2]
[1,] 0.3706735 1.4268738
[2,] -0.9229703 0.3679482
Empty data.table (0 rows) of 3 cols: A,B,C
(Note that the numbers differ from OP's because using set.seed(1L)
I do get different DT
and DT2
than posted by the OP.)
Edit: Feature request on GitHub
I just noticed that this feature already has been requested on GitHub.
Use of .BY and .EACHI in the data.table package
.BY
is a named list
containing the values of the by
variables.
Passing an unnamed list to main
will work, however a named list will fail (wholly unrelated to data.table
plot(1, main = list(1))
# works....
plot(1, main = list(s=1))
# Error in title(...) : invalid graphics parameter
There is a recent commit to data.table 1.9.3 which fixed a bug to do with naming in `.BY
Closes bug #5415. .BY gets names attribute set properly in april this year.
If you had more than 1 "by" variable, you would want to be able to concatenate some how
perhaps
iris[,plot(Sepal.Length~Sepal.Width,main=do.call(paste,.BY)),by=Species]
will work (unless you have a column called collapse
!)
EACHI
is completely unrelated to this. Please read the NEWS for data.table 1.9.3 for an understanding of this.
Can someone explain how mult works in data.table when it performs update in joins (using .EACHI and mult)
I'd like to use the first value in the right table to override the value of the left table
Select the first values and update with them alone:
X[unique(Y, by="xx", fromLast=FALSE), on=.(x=xx), y := i.y]
x y t
1: a 6 0
2: a 6 1
3: b NA 2
4: c 3 3
5: d 2 4
fromLast=
can select the first or last row when dropping dupes.
How multiple matches are handled:
In x[i, mult=]
, if a row of i
has multiple matches, mult
determines which matching row(s) of x
are selected. This explains the results shown in the OP.
In x[i, v := i.v]
, if multiple rows of i
match to the same row in x
, all of the relevant i-rows write to the x-row sequentially, so the last i-row gets the final write. Turn on verbose output to see how many edits are made in an update -- it will exceed the number of x rows in this case (because the rows are edited repeatedly):
options(datatable.verbose=TRUE)
data.table(a=1,b=2)[.(a=1, b=3:4), on=.(a), b := i.b][]
# Assigning to 2 row subset of 1 rows
a b
1: 1 4
R data.table average if lookup using join
Using by = .EACHI
you could do something like the following:
table2[table1,
on = .(`individual id`),
.(date = i.date, mean_alpha = mean(alpha[date2 <= i.date])),
by = .EACHI]
# individual id date mean_alpha
# 1: 1 2018-01-02 1.0
# 2: 1 2018-01-03 1.0
# 3: 2 2018-01-02 1.5
# 4: 2 2018-01-03 1.5
Edit:
# Assign by reference as a new column
table1[, mean_alpha := table2[table1,
on = .(`individual id`),
mean(alpha[date2 <= i.date]),
by = .EACHI][["V1"]]]
Edit 2:
Here is slightly more elegant way suggested by Frank in the comment section.
# In this solution our date columns can't be type character
table1[, date := as.Date(date)]
table2[, date2 := as.Date(date2)]
table1[, mean_alpha := table2[table1, # or equivalently .SD instead of table1
on = .(`individual id`, date2 <= date),
mean(alpha),
by = .EACHI][["V1"]]]
Reproducible data
table1 <- fread(
"individual id | date
1 | 2018-01-02
1 | 2018-01-03
2 | 2018-01-02
2 | 2018-01-03",
sep ="|"
)
table2 <- fread(
"individual id | date2 | alpha
1 | 2018-01-02 | 1
1 | 2018-01-04 | 1.5
1 | 2018-01-05 | 1
2 | 2018-01-01 | 2
2 | 2018-01-02 | 1
2 | 2018-01-05 | 4",
sep = "|"
)
Combining by and on to join and create summary column for data.table
I would do
mytesta[, v := mytestb[.SD, on=.(groupa), sum(weighta), by=.EACHI]$V1 ]
In a X[Y]
join, we're looking up each row of Y
in X
.
So if the end goal is to create a new column in Y
computed per row, we'll need a join Y[, v := X[Y, ...]]
even though Y[X, v := ...]
might seem more intuitive at first.
Related Topics
How to Match Fuzzy Match Strings from Two Datasets
Reasons For Using the Set.Seed Function
Filtering a Data Frame on a Vector
Understanding the Order() Function
How to Assign a Unique Id Number to Each Group of Identical Values in a Column
Standard Evaluation in Dplyr: Summarise a Variable Given as a Character String
How to Convert a Table to a Data Frame
Clang-7: Error: Linker Command Failed With Exit Code 1 For Macos Big Sur
Ggplot, Facet, Piechart: Placing Text in the Middle of Pie Chart Slices
Change Variable Name in For Loop Using R
Select Groups With More Than One Distinct Value
Simpler Population Pyramid in Ggplot2
Workflow For Statistical Analysis and Report Writing
Overlay Histogram With Density Curve