Using Data.Table I and J Arguments in Functions

Using data.table i and j arguments in functions

Gavin and Josh are right. This answer is only to add more background. The idea is that not only can you pass variable column names into a function like that, but expressions of column names, using quote().

group = quote(car)
mtcars[, list(Total=length(mpg)), by=group][order(group)]
group Total
AMC 1
Cadillac 1
...
Toyota 2
Valiant 1
Volvo 1

Although, admitedly more difficult to start with, it can be more flexible. That's the idea, anyway. Inside functions you need substitute(), like this :

tableOrder = function(x,.expr) {
.expr = substitute(.expr)
ans = x[,list(Total=length(mpg)),by=.expr]
setkeyv(ans, head(names(ans),-1)) # see below re feature request #1780
ans
}

tableOrder(mtcars, car)
.expr Total
AMC 1
Cadillac 1
Camaro 1
...
Toyota 2
Valiant 1
Volvo 1

tableOrder(mtcars, substring(car,1,1)) # an expression, not just a column name
.expr Total
[1,] A 1
[2,] C 3
[3,] D 3
...
[8,] P 2
[9,] T 2
[10,] V 2

tableOrder(mtcars, list(cyl,gear%%2)) # by two expressions, so head(,-1) above
cyl gear Total
[1,] 4 0 8
[2,] 4 1 3
[3,] 6 0 4
[4,] 6 1 3
[5,] 8 1 14

A new argument keyby was added in v1.8.0 (July 2012) making it simpler :

tableOrder = function(x,.expr) {
.expr = substitute(.expr)
x[,list(Total=length(mpg)),keyby=.expr]
}

Comments and feedback in the area of i,j and by variable expressions are most welcome. The other thing you can do is have a table where a column contains expressions and then look up which expression to put in i, j or by from that table.

Passing multiple arguments to data.table inside a function

Or using eval with substitute:

library(data.table) #Win R-3.5.1 x64 data.table_1.12.2
dt_mtcars <- as.data.table(mtcars)

processFUN <- function(dt, where, select, group) {

out <- dt[i=eval(substitute(where)),
j=eval(substitute(select)),
by=eval(substitute(group))]

return(out)
}

processFUN(dt_mtcars, mpg>20, .(mean_mpg=mean(mpg), median_mpg=median(mpg)), .(cyl, gear))

Some of the earliest references that I can find are

  1. Aggregating sub totals and grand totals with data.table
  2. Using data.table i and j arguments in functions

The old faq 1.6 contains reference to this:
http://datatable.r-forge.r-project.org/datatable-faq.pdf

How do I reference a function parameter inside inside a data.table with a column of the same name?

  1. One possible option is this:
myfunc <- function(dt, t){
env <- environment()
dt <- dt[t==get('t',env)]
mean(dt$b)
}

  1. Another approach: while perhaps not strictly a "solution" to your current problem, you may find it of interest. Consider data.table version>= 1.14.3. In this case, we can use env param of DT[i,j,by,env,...], to indicate the datatable column as "t", and the function parameter as t. Notice that this will work on column t with function parameter t, even if dt contains columns named col and val
myfunc <- function(dt, t){
dt <- dt[col==val, env=list(col="t", val=t)]
mean(dt$b)
}

In both case, usage and output is as below:

Usage

myfunc(dt = foo, t = 3)

Output:

[1] 0.1292877

Input:

set.seed(123)
foo <- data.table(t = c(1,1,2,2,3), b = rnorm(5))

foo looks like this:

> foo
t b
1: 1 -0.56047565
2: 1 -0.23017749
3: 2 1.55870831
4: 2 0.07050839
5: 3 0.12928774

Evaluating function arguments to pass to data.table

If you expect an object to have a certain structure or hold certain data, then defining a class can really help. And with S3, it's simple.

as.relationship <- function(DT, id, fid, mid) {
out <- DT[, c(id, fid, mid), with = FALSE]
setnames(out, c("id", "fid", "mid"))
setattr(out, "class", c("relationship", class(out)))
out
}

Then you can write a function to work on that class with the safety of knowing where everything is.

f <- function(DT, id, fid, mid) {
relatives <- as.relationship(DT, id, fid, mid)
relatives[
relatives,
on = "fid",
allow.cartesian = TRUE
][
relatives,
on = "mid",
allow.cartesian = TRUE
][
,
{
siblings <- union(i.id, i.id.1)
except_self <- setdiff(siblings, .BY[["id"]])
list(siblist = list(except_self))
},
by = "id"
]
}

This function takes the column names as strings. So you'd call it like this:

f(DT, "id", "fid", "mid")
# id siblist
# 1: 1
# 2: 2
# 3: 3 4
# 4: 4 3
# 5: 12 13,14
# 6: 13 14,15,12
# 7: 14 13,15,12
# 8: 15 13,14

setnames(DT, c("A", "B", "C"))
f(DT, "A", "B", "C")
# id siblist
# 1: 1
# 2: 2
# 3: 3 4
# 4: 4 3
# 5: 12 13,14
# 6: 13 14,15,12
# 7: 14 13,15,12
# 8: 15 13,14

If you're worried about performance, don't be. If you create a data.table from entire columns of another data.table, they're smart enough not to actually copy the data. They share it. So there's no real performance penalty to making another object.

A simple reproducible example to pass arguments to data.table in a self-defined function in R

If we are using unquoted arguments, substitute and evaluate

zz <- function(data, var, group){
var <- substitute(var)
group <- substitute(group)
setnames(data[, sum(eval(var)), by = group],
c(deparse(group), deparse(var)))[]
# or use
# setnames(data[, sum(eval(var)), by = c(deparse(group))], 2, deparse(var))[]

}
zz(mtcars, mpg, gear)
# gear mpg
#1: 4 294.4
#2: 3 241.6
#3: 5 106.9

Pass variable name as argument inside data.table

Generally, quote and eval will work:

library(data.table)
plus <- function(x, y) {
x + y
}

add_one <- function(data, col) {
expr0 = quote(copy(data)[, z := plus(col, 1)][])

expr = do.call(substitute, list(expr0, list(col = substitute(col))))
cat("Evaluated expression:\n"); print(expr); cat("\n")

eval(expr)
}

set.seed(1)
library(magrittr)
data.table(x = 1:10, y = rnorm(10)) %>%
add_one(y)

which gives

Evaluated expression:
copy(data)[, `:=`(z, plus(y, 1))][]

x y z
1: 1 -0.6264538 0.3735462
2: 2 0.1836433 1.1836433
3: 3 -0.8356286 0.1643714
4: 4 1.5952808 2.5952808
5: 5 0.3295078 1.3295078
6: 6 -0.8204684 0.1795316
7: 7 0.4874291 1.4874291
8: 8 0.7383247 1.7383247
9: 9 0.5757814 1.5757814
10: 10 -0.3053884 0.6946116

data.table .. notation with functions in j

Parroting an answer to a different problem that works here as well. Not the prettiest solution, but variants on this have worked for me numerous times in the past.

Thanks @Frank for a non-parse() solution here!

I'm well familiar with the old adage "If the answer is parse() you should usually rethink the question.", but I have a hard time coming up with alternatives many times when evaluating within the data.table calling environment, I'd love to see a robust solution that doesn't execute arbitrary code passed in as a character string. In fact, half the reason I'm posting an answer like this is in hopes that someone can recommend a better option.

test_func3 <- function(data, var, var2) {
expr = substitute(sum(var), list(var=as.symbol(var)))
data[, eval(expr)]
}

test_func3(dt, 'x', 'y')
## [1] 5.472968

Quick disclaimer on hypothetical doomsday scenarios possible with eval(parse(...))

There are far more in depth discussions on the dangers of eval(parse(...)), but I'll avoid repeating them in full.

Theoretically you could have issues if one of your columns is named something unfortunate like "(system(paste0('kill ',Sys.getpid())))" (Do not execute that, it will kill your R session on the spot!). This is probably enough of an outside chance to not lose sleep over it unless you plan on putting this in a package on CRAN.


Update:

For the specific case in the comments below where the table is grouped and then sum is applied to all, .SDcols is potentially useful. The only way I'm aware of to make sure that this function would return consistent results even if dt had a column named var3 is to evaluate the arguments within the function environment but outside of the data.table environment using c().

set.seed(42)
dt <- data.table(
x = rnorm(10),
y = rnorm(10),
z = sample(c("a","b","c"),size = 10, replace = TRUE)
)

test_func3 <- function(data, var, var2, var3) {
ListOfColumns = c(var,var2)
GroupColumn <- c(var3)
dt[, lapply(.SD, sum), by= eval(GroupColumn), .SDcols = ListOfColumns]
}

test_func3(dt, 'x', 'y','z')

returns

   z         x         y
1: b 1.0531555 2.121852
2: a 0.3631284 -1.388861
3: c 4.0566838 -2.367558

Passing multiple column names to by in a data.table function

Just create a character vector for by part of data.table, it will work:

myFun <- function(df, i, j, by){

df[get(i) == 4, .(Count = .N,
Mean = mean(get(j)),
Median = median(get(j))),
by = c(by, 'am')]
}

myFun(dt, i = 'cyl', j = 'hp', by = 'vs')

#vs am Count Mean Median
#1: 1 1 7 80.57143 66
#2: 1 0 3 84.66667 95
#3: 0 1 1 91.00000 91

Is R data.table documented to pass by reference as argument?

I think what you're being surprised about is actually R behavior, which is why it's not specifically documented in data.table (maybe it should be anyway, as the implications are more important for data.table).

You were surprised that the object passed to a function had the same address, but this is the same for base R as well:

x = 1:10
address(x)
# [1] "0x7fb7d4b6c820"
(function(y) {print(address(y))})(x)
# [1] "0x7fb7d4b6c820"

What's being copied in the function environment is the pointer to x. Moreover, for base R, the parent x is immutable:

foo = function(y) {
print(address(y))
y[1L] = 2L
print(address(y))
}
foo(x)
# [1] "0x7fb7d4b6c820"
# [1] "0x7fb7d4e11d28"

That is, as soon as we try to edit y, a copy is made. This is related to reference counting -- you can see some work by Luke Tierney on this, e.g. this presentation

The difference for data.table is that data.table enables edit permissions for the parent object -- a double-edged sword as I think you know.



Related Topics



Leave a reply



Submit