How to Write a Function That Calls a Function That Calls Data.Table

How to write a function that calls a function that calls data.table?

This will work:

plotfoo <- function(data, by) {
by <- substitute(by)
do.call(foo, list(quote(data), by))
}

plotfoo(DT, gear)
# by N
# 1: 4 12
# 2: 3 15
# 3: 5 5

Explanation:

The problem is that your call to foo() in plotfoo() looks like one of the following:

foo(data, eval(by))
foo(data, by)

When foo processes those calls, it dutifully substitutes for the second formal argument (by) getting as by's value the symbols eval(by) or by. But you want by's value to be gear, as in the call foo(data, gear).

do.call() solves this problem by evaluating the elements of its second argument before constructing the call that it then evaluates. As a result, when you pass it by, it evaluates it to its value (the symbol gear) before constructing a call that looks (essentially) like this:

foo(data, gear)

How to use data.table inside a function call?

A simple fix would be to pass the column name as string

fillna = function(df,var){
col = df[[var]]
set(df, i = which(is.na(col)), j = var, value = mean(col, na.rm=T))
return(df)
}

fillna(DT,"a")

# a b
#1: 6.00 4
#2: 3.00 5
#3: 1.00 6
#4: 9.00 7
#5: 4.75 8

Calling user defined function from data.table object

Two solutions to solving the problem (thanks @chinsoon12) :

test[,c:=mapply(f, test[,a],test[,b])]

test[,c:=f(a,b),1L:nrow(test)]

Speed-wise, these two solutions are equivalent :

a<-1:500
b<-500:1

test_1 <- data.table(a,b)
test_2 <- data.table(a,b)

bench <- microbenchmark(v_1 = test_1[,c:=mapply(f,test_1[,a],test_1[,b])],v_2 = test_2[,c:=f(a,b),1L:nrow(test_2)],times=100L)

summary(bench)
# expr min lq mean median uq max neval cld
#1 v_1 91.83598 95.63639 97.82780 96.94672 98.51073 113.2232 100 a
#2 v_2 91.72392 95.45878 98.92037 96.53573 98.71301 139.9906 100 a

autoplot(bench)

Benchmark plot

Apply function to data.table using function's character name and arguments as character vector

Yes you are missing something (well, it's not really obvious, but careful debugging of the error identifies the problem). Your function expects named arguments arg1 and arg2. You are passing it arguments y = ... and z = ... via do.call (which you have noticed). The solution is to pass the list without names:

> DT[, do.call(func, unname(.SD[, mycols, with = F])), by = x]
x V1
1: a 6
2: a 6
3: a 11
4: a 17
5: a 7
6: b 15
7: b 17
8: b 10
9: b 11
10: b 10

Writings functions (procedures) for data.table objects

Yes, the addition, modification, deletion of columns in data.tables is done by reference. In a sense, it is a good thing because a data.table usually holds a lot of data, and it would be very memory and time consuming to reassign it all every time a change to it is made. On the other hand, it is a bad thing because it goes against the no-side-effect functional programming approach that R tries to promote by using pass-by-value by default. With no-side-effect programming, there is little to worry about when you call a function: you can rest assured that your inputs or your environment won't be affected, and you can just focus on the function's output. It's simple, hence comfortable.

Of course it is ok to disregard John Chambers's advice if you know what you are doing. About writing "good" data.tables procedures, here are a couple rules I would consider if I were you, as a way to limit complexity and the number of side-effects:

  • a function should not modify more than one table, i.e., modifying that table should be the only side-effect,
  • if a function modifies a table, then make that table the output of the function. Of course, you won't want to re-assign it: just run do.something.to(table) and not table <- do.something.to(table). If instead the function had another ("real") output, then when calling result <- do.something.to(table), it is easy to imagine how you may focus your attention on the output and forget that calling the function had a side effect on your table.

While "one output / no-side-effect" functions are the norm in R, the above rules allow for "one output or side-effect". If you agree that a side-effect is somehow a form of output, then you'll agree I am not bending the rules too much by loosely sticking to R's one-output functional programming style. Allowing functions to have multiple side-effects would be a little more of a stretch; not that you can't do it, but I would try to avoid it if possible.

r data.table usage in function call

One possibility is to define your own re-leveling function using data.table::setattr that will modify dt in place. Something like

DTsetlvls <- function(x, newl)  
setattr(x, "levels", c(setdiff(levels(x), newl), rep("other", length(newl))))

Then use it within another predefined function

f <- function(variableName, min.freq){
fail.min.f <- dt[, .N, by = variableName][N < min.freq, get(variableName)]
dt[, DTsetlvls(get(variableName), fail.min.f)]
invisible()
}

f("type", min.freq)
levels(dt$type)
# [1] "C" "other"

Some other data.table alternatives

f <- function(var, min.freq) {
fail.min.f <- dt[, .N, by = var][N < min.freq, get(var)]
dt[get(var) %in% fail.min.f, (var) := "Other"]
dt[, (var) := factor(get(var))]
}

Or using set/.I

f <- function(var, min.freq) {
fail.min.f <- dt[, .I[.N < min.freq], by = var]$V1
set(dt, fail.min.f, var, "other")
set(dt, NULL, var, factor(dt[[var]]))
}

Or combining with base R (doesn't modify original data set)

f <- function(df, variableName, min.freq){
fail.min.f <- df[, .N, by = variableName][N < min.freq, get(variableName)]
levels(df$type)[fail.min.f] <- "Other"
df
}

Alternatively, we could stick we characters instead (if type is a character), you could simply do

f <- function(var, min.freq) dt[, (var) := if(.N < min.freq) "other", by = var]

Is it possible to call a function inside a data.table operation?

You seem to be a bit confused about what you're doing. In data.table, the second argument is an expression (unlike ddply's 3rd argument, which is a function) - and right now you just gave it an anonymous function.

No reproducible data in OP to test, but my guess is you simply want:

dt[, {
m1 <- nls(form, data=.SD, start=s)
y.pred <- predict(m1, newdata=data.frame(x=x.range))
list(x=x.range, y=y.pred)
},
by=list(ID1,ID2,ID3)]

How to run a function inside data.table?

We need to specify the pattern argument if we are not using anonymous function call

my[,lapply(.SD, grepl, pattern = patt)]

Or otherwise with an anonymous function call

my[,lapply(.SD, function(x) grepl(patt, x))]

Use data.table within another function in R

If you want to use non-standard evaluation, you need something like substitute. However, there is absolutely no reason for using parse.

addColumnsError <- function(dt, v1, v2){
eval(substitute(dt[, v1 + v2]))
}

addColumnsError(dt, var1, var2)
#[1] 3 6 9 12 15 18 21 24 27 30


Related Topics



Leave a reply



Submit