create an expression from a function for data.table to eval
One solution is to put the list(...)
within the function output.
I tend to use as.quoted
, stealing from the way @hadley implements .()
in the plyr
package.
library(data.table)
library(plyr)
dat <- data.table(x_one=1:10, x_two=1:10, y_one=1:10, y_two=1:10)
myfun <- function(name) {
one <- paste0(name, '_one')
two <- paste0(name, '_two')
out <- paste0(name,'_out')
as.quoted(paste('list(',out, '=',one, '-', two,')'))[[1]]
}
dat[, eval(myfun('x')),]
# x_out
# 1: 0
# 2: 0
# 3: 0
# 4: 0
# 5: 0
# 6: 0
# 7: 0
# 8: 0
# 9: 0
#10: 0
To do two columns at once you can adjust your call
myfun <- function(name) {
one <- paste0(name, '_one')
two <- paste0(name, '_two')
out <- paste0(name,'_out')
calls <- paste(paste(out, '=', one, '-',two), collapse = ',')
as.quoted(paste('list(', calls,')'))[[1]]
}
dat[, eval(myfun(c('x','y'))),]
# x_out y_out
# 1: 0 0
# 2: 0 0
# 3: 0 0
# 4: 0 0
# 5: 0 0
# 6: 0 0
# 7: 0 0
# 8: 0 0
# 9: 0 0
# 0: 0 0
As for the reason.....
in this solution the entire call to 'list(..)
is evaluated within the parent.frame being the data.table.
The relevant code within [.data.table
is
if (missing(j)) stop("logical error, j missing")
jsub = substitute(j)
if (is.null(jsub)) return(NULL)
jsubl = as.list.default(jsub)
if (identical(jsubl[[1L]],quote(eval))) {
jsub = eval(jsubl[[2L]],parent.frame())
if (is.expression(jsub)) jsub = jsub[[1L]]
}
if (in your case)
j = list(xout = eval(myfun('x')))
##then
jsub <- substitute(j)
is
# list(xout = eval(myfun("x")))
and
as.list.default(jsub)
## [[1]]
## list
##
## $xout
## eval(myfun("x"))
so jsubl[[1L]]
is list
, jsubl[[2L]]
is eval(myfun("x"))
so data.table
has not found a call to eval
and will not deal with it appropriately.
This will work, forcing the second evaluation within correct data.table
# using OP myfun
dat[,list(xout =eval(myfun('x'), dat))]
The same way
eval(parse(text = 'x_one'),dat)
# [1] 1 2 3 4 5 6 7 8 9 10
Works but
eval(eval(parse(text = 'x_one')), dat)
Does not
Edit 10/4/13
Although it is probably safer (but slower) to use .SD
as the environment, as it will then be robust to i
or by
as well eg
dat[,list(xout =eval(myfun('x'), .SD))]
Edit from Matthew :
+10 to above. I couldn't have explained it better myself. Taking it a step further, what I sometimes do is construct the entire data.table query and then eval
that. It can be a bit more robust that way, sometimes. I think of it like SQL; i.e, we often construct a dynamic SQL statement that is sent to the SQL server to be executed. When you are debugging, too, it's also sometimes easier to look at the constructed query and run that at the browser prompt. But, sometimes such a query would be very long, so passing eval
into i
,j
or by
can be more efficient by not recomputing the other components. As usual, many ways to skin the cat.
The subtle reasons for considering eval
ing the entire query include :
One reason grouping is fast is that it inspects the
j
expression first. If it's alist
, it removes the names, but remembers them. It theneval
s an unnamed list for each group, then reinstates the names once, at the end on the final result. One reason other methods can be slow is the recreation of the same column name vector for each and every group, over and over again. The more complexj
is defined though (e.g. if the expression doesn't start precisely withlist
), the harder it gets to code up the inspection logic internally. There are lots of tests in this area; e.g., in combination witheval
, and verbosity reports if name dropping isn't working. But, constructing a "simple" query (the full query) andeval
ing that may be faster and more robust for this reason.With v1.8.2 there's now optimization of
j
:options(datatable.optimize=Inf)
. This inspectsj
and modifies it to optimizemean
and thelapply(.SD,...)
idiom, so far. This makes orders of magnitude difference and means theres less for the user to need to know (e.g. a few of the wiki points have gone away now). We could do more of this; e.g.,DT[a==10]
could be optimized toDT[J(10)]
automatically ifkey(DT)[1]=="a"
[Update Sep 2014 - now implemented in v1.9.3]. But again, the internal optimizations get harder to code up internally if rather thanDT[,mean(a),by=b]
it'sDT[,list(x=eval(expr)),by=b]
whereexpr
contained a call tomean
, for example. Soeval
ing the entire query may play nicer withdatatable.optimize
. Turning verbosity on reports what it's doing and optimization can be turned off if needed; e.g., to test the speed difference it makes.
As per comments, FR#2183 has been added: "Change j=list(xout=eval(...))'s eval to eval within scope of DT". Thanks for highlighting. That's the sort of complex j
I mean where the eval
is nested in the expression. If j
starts with eval
, though, that's much simpler and already coded (as shown above) and tested, and should be optimized fine.
If there's one take-away from this then it's: do use DT[...,verbose=TRUE]
or options(datatable.verbose=TRUE)
to check data.table
is still working efficiently when used for dynamic queries involving eval
.
Pass expressions to function to evaluate within data.table to allow for internal optimisation
No need for fancy tools, just use base R metaprogramming features.
my_fun2 = function(my_i, my_j, by, my_data) {
dtq = substitute(
my_data[.i, .j, .by],
list(.i=substitute(my_i), .j=substitute(my_j), .by=substitute(by))
)
print(dtq)
eval(dtq)
}
my_fun2(Species == "setosa", sum(Sepal.Length), my_data=as.data.table(iris))
my_fun2(my_j = "Sepal.Length", my_data=as.data.table(iris))
This way you can be sure that data.table will use all possible optimizations as when typing [
call by hand.
Note that in data.table we are planning to make substitution easier, see solution proposed in PR
Rdatatable/data.table#4304.
Then using extra env
var substitute will be handled internally for you
my_fun3 = function(my_i, my_j, by, my_data) {
my_data[.i, .j, .by, env=list(.i=substitute(my_i), .j=substitute(my_j), .by=substitute(by)), verbose=TRUE]
}
my_fun3(Species == "setosa", sum(Sepal.Length), my_data=as.data.table(iris))
#Argument 'j' after substitute: sum(Sepal.Length)
#Argument 'i' after substitute: Species == "setosa"
#...
my_fun3(my_j = "Sepal.Length", my_data=as.data.table(iris))
#Argument 'j' after substitute: Sepal.Length
#...
evaluate expression in data.table
Instead of parse_expr
, eval(parse
can be used
dt[, b := eval(parse(text = expr))]
Or wrap with eval
on parse_expr
as the !!
is doing the evaluation in tidyverse
dt[, b := eval(rlang::parse_expr(expr)) ]
Evaluate expression in R data.table
If your actual expressions describe calls to vectorized functions and are repeated many times each, this may be more efficient, since it only parses and evaluates each distinct expression one time:
f <- function(e, .SD) eval(parse(text=e[1]), envir=.SD)
dt[, ans:=f(expr,.SD), by=expr, .SDcols=c("a", "b")]
# expr a b ans
# 1: a + b 1 5 6.0
# 2: a - b 2 6 -4.0
# 3: a * b 3 7 21.0
# 4: a / b 4 8 0.5
eval and quote in data.table
UPDATE (eddi): As of version 1.8.11 this has been fixed and .SD
is not needed in cases where the expression can be evaluated in place, like in OP. Since currently presence of .SD
triggers construction of full .SD
, this will result in much faster speeds in some cases.
What's going on is that calls to eval()
are treated differently than you likely imagine in the code that implements [.data.table()
. Specifically, [.data.table()
contains special evaluation branches for i
and j
expressions that begin with the symbol eval
. When you wrap the call to eval
inside of a call to sum()
, eval
is no longer the first element of the parsed/substituted expression, and the special evaluation branch is skipped.
Here is the bit of code in the monster function displayed by typing getAnywhere("[.data.table")
that makes a special allowance for calls to eval()
passed in via [.data.table()
's j
-argument:
jsub = substitute(j)
...
# Skipping some lines
...
jsubl = as.list.default(jsub)
if (identical(jsubl[[1L]], quote(eval))) { # The test for eval 'on the outside'
jsub = eval(jsubl[[2L]], parent.frame(), parent.frame())
if (is.expression(jsub))
jsub = jsub[[1L]]
}
As a workaround, either follow the example in data.table FAQ 1.6 (pdf here), or explicitly point eval()
towards .SD
, the local variable that holds columns of whatever data.table you are operating on (here d
). (For some more explanation of .SD
's role, see the first few paragraphs of this answer).
d[, sum(eval(quoted_a, envir=.SD))]
Unable to use eval(parse(...)) in data.table [,:=] function
This behaviour, bug #5423, is now fixed in commit #1234 in the current development version of data.table v1.9.3. From NEWS:
o
DT[,y := y * eval(parse(text="1*2"))]
resulted in error unlesseval()
is wrapped with paranthesis. That is,DT[,y := y * (eval(parse(text="1*2")))]
. Now fixed, closes #5423. Thanks to Wet Feet for reporting and to Simon O'Hanlon for identifying the issue here on SO.
Regardless of why you want to do this and the relative merits or not, this is about order of precedence. You need to wrap the whole eval(parse(text="1*2"))
in a set of parentheses to ensure that it is evaluated before you try to multiply it by y
...
DT[,y:=y*(eval(parse(text="1*2")))]
# x y
# 1: 1 4
# 2: 2 8
# 3: 3 12
# 4: 4 16
# 5: 5 20
# 6: 6 24
# 7: 7 28
# 8: 8 32
# 9: 9 36
#10: 10 40
Without the parentheses data.table::":="
is trying to multiply y
by an expression
object, which will result in a syntax error.
How to convert r data.table expression into a function for looping
The goal is a function of DT & var to do:
DT[, list(.N), by=var][order(var), list(var, N, Proportion=N/sum(N))]
To do a calculation by a variable and then order by it, use keyby=
. So your function can become:
f_tabulate <-
function(DT, var) {
DT[, list(.N), keyby=var][, Proportion := N/sum(N)][]
}
# usage
for (i in c('cyl', 'gear')) print(f_tabulate(mtcars_dt, i))
keyby=
and by=
can take arguments in many forms, and you don't need to quote or eval a simple vector of variable names. (Other ways include .(var1, var2)
, list(var1, var2)
, "var1,var2"
.) You could also extend to counting by multiple variables...
f_tabulate2 <-
function(DT, ...) {
DT[, list(.N), keyby=c(...)][, Proportion := N/sum(N)][]
}
# usage
f_tabulate2(mtcars_dt, 'cyl', 'gear')
For this operation (excepting the ordering part), you could also use groupingsets()
:
> groupingsets(mtcars_dt, .N, keyby=c('cyl', 'gear'), sets=list("cyl", "gear"))[,
Proportion := N/nrow(mtcars_dt)][]
cyl gear N Proportion
1: 6 NA 7 0.21875
2: 4 NA 11 0.34375
3: 8 NA 14 0.43750
4: NA 4 12 0.37500
5: NA 3 15 0.46875
6: NA 5 5 0.15625
As a function (and adding ordering back in)...
f_tabulate_all = function(DT, vars){
lvars = as.list(vars)
ocall = as.call(lapply(c("order", vars), as.name))
groupingsets(DT[eval(ocall)], .N, by=vars, sets=as.list(vars))[,
Proportion := N/nrow(DT)][]
}
# usage
f_tabulate_all(mtcars_dt, c('cyl', 'gear'))
The as.name
function is achieving the same thing as quote
when applied to a string naming a function or other object.
create a filter expression (i) dynamically in data.table
Building expression instead of parsing it.
library(data.table)
dd = data.table(x=1:10,y=10:1,z=20:20)
AndIN = function(cond){
Reduce(
function(x, y) call("&", call("(",x), call("(",y)),
lapply(names(cond), function(var) call("%in%", as.name(var), cond[[var]]))
)
}
cond = list(x=c(1,3),z=c(12,20))
AndIN(cond)
#(x %in% c(1, 3)) & (z %in% c(12, 20))
dd[eval(AndIN(cond))]
# x y z
#1: 1 10 20
#2: 3 8 20
Calls call("(",x)
and call("(",y)
may not be necessary.
Related Topics
Change Size of Axes Title and Labels in Ggplot2
How to Merge Two Columns in R with a Specific Symbol
R Convert Between Zoo Object and Data Frame, Results Inconsistent for Different Numbers of Columns
Random Forest with Classes That Are Very Unbalanced
Get the Number of Lines in a Text File Using R
R: Split Variable Column into Multiple (Unbalanced) Columns by Comma
Clustering Very Large Dataset in R
How to Get Ggplot to Order Facets Correctly
How to Rotate Legend Symbols in Ggplot2
Print String and Variable Contents on the Same Line in R
How to Find Row Number of a Value in R Code
How to Skip an Error in a Loop
Disregarding Simple Warnings/Errors in Trycatch()
Using Lapply to Change Column Names of a List of Data Frames
How to Put Exact Number of Decimal Places on Label Ggplot Bar Chart
How to Create a Different Report for Each Subset of a Data Frame with R Markdown
How to Create a Pivot Table in R with Multiple (3+) Variables