Eval and Quote in Data.Table

eval and quote in data.table

UPDATE (eddi): As of version 1.8.11 this has been fixed and .SD is not needed in cases where the expression can be evaluated in place, like in OP. Since currently presence of .SD triggers construction of full .SD, this will result in much faster speeds in some cases.


What's going on is that calls to eval() are treated differently than you likely imagine in the code that implements [.data.table(). Specifically, [.data.table() contains special evaluation branches for i and j expressions that begin with the symbol eval. When you wrap the call to eval inside of a call to sum(), eval is no longer the first element of the parsed/substituted expression, and the special evaluation branch is skipped.

Here is the bit of code in the monster function displayed by typing getAnywhere("[.data.table") that makes a special allowance for calls to eval() passed in via [.data.table()'s j-argument:

jsub = substitute(j)
...
# Skipping some lines
...
jsubl = as.list.default(jsub)
if (identical(jsubl[[1L]], quote(eval))) { # The test for eval 'on the outside'
jsub = eval(jsubl[[2L]], parent.frame(), parent.frame())
if (is.expression(jsub))
jsub = jsub[[1L]]
}

As a workaround, either follow the example in data.table FAQ 1.6 (pdf here), or explicitly point eval() towards .SD, the local variable that holds columns of whatever data.table you are operating on (here d). (For some more explanation of .SD's role, see the first few paragraphs of this answer).

d[, sum(eval(quoted_a, envir=.SD))]

create an expression from a function for data.table to eval

One solution is to put the list(...) within the function output.

I tend to use as.quoted, stealing from the way @hadley implements .() in the plyr package.

library(data.table)
library(plyr)
dat <- data.table(x_one=1:10, x_two=1:10, y_one=1:10, y_two=1:10)
myfun <- function(name) {
one <- paste0(name, '_one')
two <- paste0(name, '_two')
out <- paste0(name,'_out')
as.quoted(paste('list(',out, '=',one, '-', two,')'))[[1]]
}


dat[, eval(myfun('x')),]

# x_out
# 1: 0
# 2: 0
# 3: 0
# 4: 0
# 5: 0
# 6: 0
# 7: 0
# 8: 0
# 9: 0
#10: 0

To do two columns at once you can adjust your call

myfun <- function(name) {
one <- paste0(name, '_one')
two <- paste0(name, '_two')
out <- paste0(name,'_out')
calls <- paste(paste(out, '=', one, '-',two), collapse = ',')


as.quoted(paste('list(', calls,')'))[[1]]
}


dat[, eval(myfun(c('x','y'))),]

# x_out y_out
# 1: 0 0
# 2: 0 0
# 3: 0 0
# 4: 0 0
# 5: 0 0
# 6: 0 0
# 7: 0 0
# 8: 0 0
# 9: 0 0
# 0: 0 0

As for the reason.....

in this solution the entire call to 'list(..) is evaluated within the parent.frame being the data.table.

The relevant code within [.data.table is

if (missing(j)) stop("logical error, j missing")
jsub = substitute(j)
if (is.null(jsub)) return(NULL)
jsubl = as.list.default(jsub)
if (identical(jsubl[[1L]],quote(eval))) {
jsub = eval(jsubl[[2L]],parent.frame())
if (is.expression(jsub)) jsub = jsub[[1L]]
}

if (in your case)

j = list(xout = eval(myfun('x'))) 

##then

jsub <- substitute(j)

is

 #  list(xout = eval(myfun("x")))

and

as.list.default(jsub)
## [[1]]
## list
##
## $xout
## eval(myfun("x"))

so jsubl[[1L]] is list, jsubl[[2L]] is eval(myfun("x"))

so data.table has not found a call to evaland will not deal with it appropriately.

This will work, forcing the second evaluation within correct data.table

# using OP myfun
dat[,list(xout =eval(myfun('x'), dat))]

The same way

eval(parse(text = 'x_one'),dat)
# [1] 1 2 3 4 5 6 7 8 9 10

Works but

 eval(eval(parse(text = 'x_one')), dat)

Does not

Edit 10/4/13

Although it is probably safer (but slower) to use .SD as the environment, as it will then be robust to i or by as well eg

dat[,list(xout =eval(myfun('x'), .SD))]

Edit from Matthew :

+10 to above. I couldn't have explained it better myself. Taking it a step further, what I sometimes do is construct the entire data.table query and then eval that. It can be a bit more robust that way, sometimes. I think of it like SQL; i.e, we often construct a dynamic SQL statement that is sent to the SQL server to be executed. When you are debugging, too, it's also sometimes easier to look at the constructed query and run that at the browser prompt. But, sometimes such a query would be very long, so passing eval into i,j or by can be more efficient by not recomputing the other components. As usual, many ways to skin the cat.

The subtle reasons for considering evaling the entire query include :

  1. One reason grouping is fast is that it inspects the j expression first. If it's a list, it removes the names, but remembers them. It then evals an unnamed list for each group, then reinstates the names once, at the end on the final result. One reason other methods can be slow is the recreation of the same column name vector for each and every group, over and over again. The more complex j is defined though (e.g. if the expression doesn't start precisely with list), the harder it gets to code up the inspection logic internally. There are lots of tests in this area; e.g., in combination with eval, and verbosity reports if name dropping isn't working. But, constructing a "simple" query (the full query) and evaling that may be faster and more robust for this reason.

  2. With v1.8.2 there's now optimization of j: options(datatable.optimize=Inf). This inspects j and modifies it to optimize mean and the lapply(.SD,...) idiom, so far. This makes orders of magnitude difference and means theres less for the user to need to know (e.g. a few of the wiki points have gone away now). We could do more of this; e.g., DT[a==10] could be optimized to DT[J(10)] automatically if key(DT)[1]=="a" [Update Sep 2014 - now implemented in v1.9.3]. But again, the internal optimizations get harder to code up internally if rather than DT[,mean(a),by=b] it's DT[,list(x=eval(expr)),by=b] where expr contained a call to mean, for example. So evaling the entire query may play nicer with datatable.optimize. Turning verbosity on reports what it's doing and optimization can be turned off if needed; e.g., to test the speed difference it makes.

As per comments, FR#2183 has been added: "Change j=list(xout=eval(...))'s eval to eval within scope of DT". Thanks for highlighting. That's the sort of complex j I mean where the eval is nested in the expression. If j starts with eval, though, that's much simpler and already coded (as shown above) and tested, and should be optimized fine.

If there's one take-away from this then it's: do use DT[...,verbose=TRUE] or options(datatable.verbose=TRUE) to check data.table is still working efficiently when used for dynamic queries involving eval.

How to evaluate a string to filter an R data.table?

I think eval(parse(text())) will work, you just need some modifications. Try this:

library(data.table)
iris <- data.table(iris)

#Updated so it will have quotes in your string
vars <- '\"setosa\"'
#Update so you can change your vars
filter <- paste0('Species==',vars,'& Petal.Length >= 4')

res <- iris[eval(parse(text=filter)), list(
sep.len.tot = sum(Sepal.Length)
, sep.width.total = sum(Sepal.Width)
), by = 'Species']

A few notes: I updated your vars so there will be quotes in the string so it will run properly, and I also updated filter so you can dynamically change vars.

Finally, for explanatory purposes, the resulting df is blank (because no setosa species have Petal.Length >= 4. So in order to see this work, we can just remove the last condition.

filter <- paste0('Species==',vars)
res2 <- iris[eval(parse(text=filter)), list(
sep.len.tot = sum(Sepal.Length)
, sep.width.total = sum(Sepal.Width)
), by = 'Species']

res2
Species sep.len.tot sep.width.total
1: setosa 250.3 171.4

EDIT:
Per @Frank's comment below, a cleaner approach is to write the whole thing as an expression:

filter <- substitute(Species == vars, list(vars = "setosa"))

res <- iris[eval(filter), list(
sep.len.tot = sum(Sepal.Length)
, sep.width.total = sum(Sepal.Width)
), by = 'Species']

Pass expressions to function to evaluate within data.table to allow for internal optimisation

No need for fancy tools, just use base R metaprogramming features.

my_fun2 = function(my_i, my_j, by, my_data) {
dtq = substitute(
my_data[.i, .j, .by],
list(.i=substitute(my_i), .j=substitute(my_j), .by=substitute(by))
)
print(dtq)
eval(dtq)
}

my_fun2(Species == "setosa", sum(Sepal.Length), my_data=as.data.table(iris))
my_fun2(my_j = "Sepal.Length", my_data=as.data.table(iris))

This way you can be sure that data.table will use all possible optimizations as when typing [ call by hand.


Note that in data.table we are planning to make substitution easier, see solution proposed in PR
Rdatatable/data.table#4304.

Then using extra env var substitute will be handled internally for you

my_fun3 = function(my_i, my_j, by, my_data) {
my_data[.i, .j, .by, env=list(.i=substitute(my_i), .j=substitute(my_j), .by=substitute(by)), verbose=TRUE]
}
my_fun3(Species == "setosa", sum(Sepal.Length), my_data=as.data.table(iris))
#Argument 'j' after substitute: sum(Sepal.Length)
#Argument 'i' after substitute: Species == "setosa"
#...
my_fun3(my_j = "Sepal.Length", my_data=as.data.table(iris))
#Argument 'j' after substitute: Sepal.Length
#...

Can eval be called within a data frame with variables defined in that data frame?

Have you noticed that the simpler call data.frame(x, y = x) wouldn't work either ?

data.frame() uses standard evaluation so in your case they'll be evaluated in the global environment.

If you name your quotes elements you'll be able to do tibble(x, !!!quotes) though, because tibble works differently.

Technically the following might be acceptable to you, we cheat by creating a temp value in the global environment, which we then remove on exit.

(I use evalq only to be able to use on.exit)

quotes<-alist(x, x+1, x+2)

df <- data.frame(
x = c(5, 10, 15) ->> .t.e.m.p.,
evalq({
on.exit(rm(.t.e.m.p., envir = .GlobalEnv))
lapply(quotes, eval, list(x= .t.e.m.p.))
}))

df
#> x c.5..10..15. c.6..11..16. c.7..12..17.
#> 1 5 5 6 7
#> 2 10 10 11 12
#> 3 15 15 16 17

ls(all.names = TRUE)
#> [1] "df" "quotes"

Created on 2021-05-11 by the reprex package (v0.3.0)

This looks of course horrible, and using transform, within or tibble is probably a wiser choice.

A simple reproducible example to pass arguments to data.table in a self-defined function in R

If we are using unquoted arguments, substitute and evaluate

zz <- function(data, var, group){
var <- substitute(var)
group <- substitute(group)
setnames(data[, sum(eval(var)), by = group],
c(deparse(group), deparse(var)))[]
# or use
# setnames(data[, sum(eval(var)), by = c(deparse(group))], 2, deparse(var))[]

}
zz(mtcars, mpg, gear)
# gear mpg
#1: 4 294.4
#2: 3 241.6
#3: 5 106.9

Using data.table and tidy eval together: why group by does not work as expected, why is ~ inserted?

TLDR: Quosures are implemented as formulas because of a bug that affects all versions of R prior to 3.5.1. The special rlang definition for ~ is only available with eval_tidy(). This is why quosures are not as compatible with non-tidyeval functions as we'd like to.

Edit: That said, there are probably other challenges to make data masking APIs like data.table compatible with quosures.


Quosures are currently implemented as formulas:

library("rlang")

q <- quo(cat("eval!\n"))

is.call(q)
#> [1] TRUE

as.list(unclass(q))
#> [[1]]
#> `~`
#>
#> [[2]]
#> cat("eval!\n")
#>
#> attr(,".Environment")
#> <environment: R_GlobalEnv>

Compare to ordinary formulas:

f <- ~cat("eval?\n")

is.call(f)
#> [1] TRUE

as.list(unclass(f))
#> [[1]]
#> `~`
#>
#> [[2]]
#> cat("eval?\n")
#>
#> attr(,".Environment")
#> <environment: R_GlobalEnv>

So what's the difference between a quosure and a formula? The former evaluates itself while the latter quotes itself, i.e. it returns itself.

eval_tidy(q)
#> eval!

eval_tidy(f)
#> ~cat("eval?\n")

The self-quoting mechanism is implemented by the ~ primitive:

`~`
#> .Primitive("~")

One important task of this primitive is to record an environment the very first time a formula is evaluated. For instance the formula in quote(~foo) is not evaluated and does not record an environment while eval(quote(~foo)) does.

Anyway, when you evaluate a ~ call, the definition for ~ is looked up in the ordinary way and usually finds the ~ primitive. Just like when you compute 1 + 1, the definition for + is looked up and usually the .Primitive("+") is found. The reason quosures self-evaluate instead of self-quote is simply that eval_tidy() creates a special definition for ~ in its evaluation environment. You can get a hold on this special definition with eval_tidy(quote(`~`)).

So why did we implement quosures as formulas?

  1. It deparses and prints better. This reason is now outdated because we have our own expression deparser where quosures are printed with a leading ^ rather than a leading ~.

  2. Because of a bug in all versions of R prior to 3.5.1, expressions with a class are evaluated on recursive prints. Here is an example of classed call:

    x  <- quote(stop("oh no!"))
    x <- structure(x, class = "some_class")

    The object itself prints fine:

    x
    #> stop("oh no!")
    #> attr(,"class")
    #> [1] "some_class"

    But if you put it in a list it gets evaluated!

    list(x)
    #> [[1]]
    #> Error in print(stop("oh no!")) : oh no!

The eager evaluation bug does not affect formulas because they self-quote. Implementing quosures as formulas protected us from this bug.

Ideally we'll inline a function directly in the quosure. E.g. the first element wouldn't contain the symbol ~ but a function. Here is how you can create such functions:

c <- as.call(list(toupper, "a"))
c
#> (function (x)
#> {
#> if (!is.character(x))
#> x <- as.character(x)
#> .Internal(toupper(x))
#> })("a")

The biggest advantage of inlining functions in calls is that they can be evaluated anywhere. Even in the empty environment!

eval(c, emptyenv())
#> [1] "A"

If we implemented quosures with inlined functions, they could similarly be evaluated anywhere. eval(q) would work, you could unquote quosures inside data.table calls, etc. But did you notice how noisy the inlined call prints because of the inlining? To work around this we'd have to give the call a class and a print method. But remember the R <= 3.5.0 bug. We'd get weird eager evaluations when printing lists of quosures at the console. This is why quosures are still implemented as formulas to this day and are not as compatible with non-tidyeval functions as we'd like.



Related Topics



Leave a reply



Submit