Using data.table i and j arguments in functions
Gavin and Josh are right. This answer is only to add more background. The idea is that not only can you pass variable column names into a function like that, but expressions of column names, using quote()
.
group = quote(car)
mtcars[, list(Total=length(mpg)), by=group][order(group)]
group Total
AMC 1
Cadillac 1
...
Toyota 2
Valiant 1
Volvo 1
Although, admitedly more difficult to start with, it can be more flexible. That's the idea, anyway. Inside functions you need substitute()
, like this :
tableOrder = function(x,.expr) {
.expr = substitute(.expr)
ans = x[,list(Total=length(mpg)),by=.expr]
setkeyv(ans, head(names(ans),-1)) # see below re feature request #1780
ans
}
tableOrder(mtcars, car)
.expr Total
AMC 1
Cadillac 1
Camaro 1
...
Toyota 2
Valiant 1
Volvo 1
tableOrder(mtcars, substring(car,1,1)) # an expression, not just a column name
.expr Total
[1,] A 1
[2,] C 3
[3,] D 3
...
[8,] P 2
[9,] T 2
[10,] V 2
tableOrder(mtcars, list(cyl,gear%%2)) # by two expressions, so head(,-1) above
cyl gear Total
[1,] 4 0 8
[2,] 4 1 3
[3,] 6 0 4
[4,] 6 1 3
[5,] 8 1 14
A new argument keyby
was added in v1.8.0 (July 2012) making it simpler :
tableOrder = function(x,.expr) {
.expr = substitute(.expr)
x[,list(Total=length(mpg)),keyby=.expr]
}
Comments and feedback in the area of i
,j
and by
variable expressions are most welcome. The other thing you can do is have a table where a column contains expressions and then look up which expression to put in i
, j
or by
from that table.
Passing multiple arguments to data.table inside a function
Or using eval
with substitute
:
library(data.table) #Win R-3.5.1 x64 data.table_1.12.2
dt_mtcars <- as.data.table(mtcars)
processFUN <- function(dt, where, select, group) {
out <- dt[i=eval(substitute(where)),
j=eval(substitute(select)),
by=eval(substitute(group))]
return(out)
}
processFUN(dt_mtcars, mpg>20, .(mean_mpg=mean(mpg), median_mpg=median(mpg)), .(cyl, gear))
Some of the earliest references that I can find are
- Aggregating sub totals and grand totals with data.table
- Using data.table i and j arguments in functions
The old faq 1.6 contains reference to this:
http://datatable.r-forge.r-project.org/datatable-faq.pdf
How do I reference a function parameter inside inside a data.table with a column of the same name?
- One possible option is this:
myfunc <- function(dt, t){
env <- environment()
dt <- dt[t==get('t',env)]
mean(dt$b)
}
- Another approach: while perhaps not strictly a "solution" to your current problem, you may find it of interest. Consider
data.table
version>= 1.14.3. In this case, we can useenv
param ofDT[i,j,by,env,...]
, to indicate the datatable column as"t"
, and the function parameter ast
. Notice that this will work on columnt
with function parametert
, even ifdt
contains columns namedcol
andval
myfunc <- function(dt, t){
dt <- dt[col==val, env=list(col="t", val=t)]
mean(dt$b)
}
In both case, usage and output is as below:
Usage
myfunc(dt = foo, t = 3)
Output:
[1] 0.1292877
Input:
set.seed(123)
foo <- data.table(t = c(1,1,2,2,3), b = rnorm(5))
foo
looks like this:
> foo
t b
1: 1 -0.56047565
2: 1 -0.23017749
3: 2 1.55870831
4: 2 0.07050839
5: 3 0.12928774
Evaluating function arguments to pass to data.table
If you expect an object to have a certain structure or hold certain data, then defining a class can really help. And with S3, it's simple.
as.relationship <- function(DT, id, fid, mid) {
out <- DT[, c(id, fid, mid), with = FALSE]
setnames(out, c("id", "fid", "mid"))
setattr(out, "class", c("relationship", class(out)))
out
}
Then you can write a function to work on that class with the safety of knowing where everything is.
f <- function(DT, id, fid, mid) {
relatives <- as.relationship(DT, id, fid, mid)
relatives[
relatives,
on = "fid",
allow.cartesian = TRUE
][
relatives,
on = "mid",
allow.cartesian = TRUE
][
,
{
siblings <- union(i.id, i.id.1)
except_self <- setdiff(siblings, .BY[["id"]])
list(siblist = list(except_self))
},
by = "id"
]
}
This function takes the column names as strings. So you'd call it like this:
f(DT, "id", "fid", "mid")
# id siblist
# 1: 1
# 2: 2
# 3: 3 4
# 4: 4 3
# 5: 12 13,14
# 6: 13 14,15,12
# 7: 14 13,15,12
# 8: 15 13,14
setnames(DT, c("A", "B", "C"))
f(DT, "A", "B", "C")
# id siblist
# 1: 1
# 2: 2
# 3: 3 4
# 4: 4 3
# 5: 12 13,14
# 6: 13 14,15,12
# 7: 14 13,15,12
# 8: 15 13,14
If you're worried about performance, don't be. If you create a data.table
from entire columns of another data.table
, they're smart enough not to actually copy the data. They share it. So there's no real performance penalty to making another object.
A simple reproducible example to pass arguments to data.table in a self-defined function in R
If we are using unquoted arguments, substitute
and eval
uate
zz <- function(data, var, group){
var <- substitute(var)
group <- substitute(group)
setnames(data[, sum(eval(var)), by = group],
c(deparse(group), deparse(var)))[]
# or use
# setnames(data[, sum(eval(var)), by = c(deparse(group))], 2, deparse(var))[]
}
zz(mtcars, mpg, gear)
# gear mpg
#1: 4 294.4
#2: 3 241.6
#3: 5 106.9
Pass variable name as argument inside data.table
Generally, quote and eval will work:
library(data.table)
plus <- function(x, y) {
x + y
}
add_one <- function(data, col) {
expr0 = quote(copy(data)[, z := plus(col, 1)][])
expr = do.call(substitute, list(expr0, list(col = substitute(col))))
cat("Evaluated expression:\n"); print(expr); cat("\n")
eval(expr)
}
set.seed(1)
library(magrittr)
data.table(x = 1:10, y = rnorm(10)) %>%
add_one(y)
which gives
Evaluated expression:
copy(data)[, `:=`(z, plus(y, 1))][]
x y z
1: 1 -0.6264538 0.3735462
2: 2 0.1836433 1.1836433
3: 3 -0.8356286 0.1643714
4: 4 1.5952808 2.5952808
5: 5 0.3295078 1.3295078
6: 6 -0.8204684 0.1795316
7: 7 0.4874291 1.4874291
8: 8 0.7383247 1.7383247
9: 9 0.5757814 1.5757814
10: 10 -0.3053884 0.6946116
data.table .. notation with functions in j
Parroting an answer to a different problem that works here as well. Not the prettiest solution, but variants on this have worked for me numerous times in the past.
Thanks @Frank for a non-parse()
solution here!
I'm well familiar with the old adage "If the answer is parse() you should usually rethink the question.", but I have a hard time coming up with alternatives many times when evaluating within the data.table
calling environment, I'd love to see a robust solution that doesn't execute arbitrary code passed in as a character string. In fact, half the reason I'm posting an answer like this is in hopes that someone can recommend a better option.
test_func3 <- function(data, var, var2) {
expr = substitute(sum(var), list(var=as.symbol(var)))
data[, eval(expr)]
}
test_func3(dt, 'x', 'y')
## [1] 5.472968
Quick disclaimer on hypothetical doomsday scenarios possible with eval(parse(...))
There are far more in depth discussions on the dangers of eval(parse(...))
, but I'll avoid repeating them in full.
Theoretically you could have issues if one of your columns is named something unfortunate like "(system(paste0('kill ',Sys.getpid())))"
(Do not execute that, it will kill your R session on the spot!). This is probably enough of an outside chance to not lose sleep over it unless you plan on putting this in a package on CRAN.
Update:
For the specific case in the comments below where the table is grouped and then sum
is applied to all, .SDcols
is potentially useful. The only way I'm aware of to make sure that this function would return consistent results even if dt
had a column named var3
is to evaluate the arguments within the function environment but outside of the data.table
environment using c()
.
set.seed(42)
dt <- data.table(
x = rnorm(10),
y = rnorm(10),
z = sample(c("a","b","c"),size = 10, replace = TRUE)
)
test_func3 <- function(data, var, var2, var3) {
ListOfColumns = c(var,var2)
GroupColumn <- c(var3)
dt[, lapply(.SD, sum), by= eval(GroupColumn), .SDcols = ListOfColumns]
}
test_func3(dt, 'x', 'y','z')
returns
z x y
1: b 1.0531555 2.121852
2: a 0.3631284 -1.388861
3: c 4.0566838 -2.367558
Passing multiple column names to by in a data.table function
Just create a character vector for by
part of data.table
, it will work:
myFun <- function(df, i, j, by){
df[get(i) == 4, .(Count = .N,
Mean = mean(get(j)),
Median = median(get(j))),
by = c(by, 'am')]
}
myFun(dt, i = 'cyl', j = 'hp', by = 'vs')
#vs am Count Mean Median
#1: 1 1 7 80.57143 66
#2: 1 0 3 84.66667 95
#3: 0 1 1 91.00000 91
Is R data.table documented to pass by reference as argument?
I think what you're being surprised about is actually R behavior, which is why it's not specifically documented in data.table
(maybe it should be anyway, as the implications are more important for data.table
).
You were surprised that the object passed to a function had the same address, but this is the same for base
R as well:
x = 1:10
address(x)
# [1] "0x7fb7d4b6c820"
(function(y) {print(address(y))})(x)
# [1] "0x7fb7d4b6c820"
What's being copied in the function environment is the pointer to x
. Moreover, for base
R, the parent x
is immutable:
foo = function(y) {
print(address(y))
y[1L] = 2L
print(address(y))
}
foo(x)
# [1] "0x7fb7d4b6c820"
# [1] "0x7fb7d4e11d28"
That is, as soon as we try to edit y
, a copy is made. This is related to reference counting -- you can see some work by Luke Tierney on this, e.g. this presentation
The difference for data.table
is that data.table
enables edit permissions for the parent object -- a double-edged sword as I think you know.
Related Topics
How to Add Rtools\Bin to the System Path in R
Get Date Difference in Years (Floating Point)
Calculating Time Difference Between Two Columns
Adding Empty Graphs to Facet_Wrap in Ggplot2
Shinydashboard Some Font Awesome Icons Not Working
How to Annotate a Reference Line at the Same Angle as the Reference Line Itself
How to Cross-Paste All Combinations of Two Vectors (Each-To-Each)
Get Connected Components Using Igraph in R
Make Sequential Numeric Column Names Prefixed with a Letter
What Evaluates to True/False in R
How to Deal with Spaces in Column Names
Ggplot2: Font Style in Label Expression
R: Sample() Command Subject to a Constraint
Can You Specify Different Geoms for Different Facets in a Ggplot
Merging a Large List of Xts Objects
R Random Forest Error - Type of Predictors in New Data Do Not Match