Writings Functions (Procedures) for Data.Table Objects

Writings functions (procedures) for data.table objects

Yes, the addition, modification, deletion of columns in data.tables is done by reference. In a sense, it is a good thing because a data.table usually holds a lot of data, and it would be very memory and time consuming to reassign it all every time a change to it is made. On the other hand, it is a bad thing because it goes against the no-side-effect functional programming approach that R tries to promote by using pass-by-value by default. With no-side-effect programming, there is little to worry about when you call a function: you can rest assured that your inputs or your environment won't be affected, and you can just focus on the function's output. It's simple, hence comfortable.

Of course it is ok to disregard John Chambers's advice if you know what you are doing. About writing "good" data.tables procedures, here are a couple rules I would consider if I were you, as a way to limit complexity and the number of side-effects:

  • a function should not modify more than one table, i.e., modifying that table should be the only side-effect,
  • if a function modifies a table, then make that table the output of the function. Of course, you won't want to re-assign it: just run do.something.to(table) and not table <- do.something.to(table). If instead the function had another ("real") output, then when calling result <- do.something.to(table), it is easy to imagine how you may focus your attention on the output and forget that calling the function had a side effect on your table.

While "one output / no-side-effect" functions are the norm in R, the above rules allow for "one output or side-effect". If you agree that a side-effect is somehow a form of output, then you'll agree I am not bending the rules too much by loosely sticking to R's one-output functional programming style. Allowing functions to have multiple side-effects would be a little more of a stretch; not that you can't do it, but I would try to avoid it if possible.

Using by in a function with data.table

We can pass the by as a vector of strings

onewayfn <- function(df, x, weight = NULL, displacement = NULL, by = NULL){
.x <- deparse(substitute(x))
.weight <- deparse(substitute(weight))
.displacement <- deparse(substitute(displacement))
#.by <- deparse(substitute(by)) # Does not work with multiple variables!

cols <- c(.weight, .displacement)
cols <- cols[cols != "NULL"]

.xby <- c(.x, by)
.xby <- .xby[.xby != "NULL"]

data.table::data.table(df)[, lapply(.SD, sum, na.rm = TRUE), by = .xby, .SDcols = cols][]
}

-testing

onewayfn(mtcars, cyl, weight = wt, displacement = disp, by = c("am","vs"))

# cyl am vs wt disp
#1: 6 1 0 8.265 465.0
#2: 4 1 1 14.198 628.6
#3: 6 0 1 13.555 818.2
#4: 8 0 0 49.249 4291.4
#5: 4 0 1 8.805 407.6
#6: 4 1 0 2.140 120.3
#7: 8 1 0 6.740 652.0

Or another option is to evaluate a string

newayfn <- function(df, x, weight = NULL, displacement = NULL, by = NULL){

dfname <- deparse(substitute(df))
.x <- deparse(substitute(x))
.weight <- deparse(substitute(weight))
.displacement <- deparse(substitute(displacement))
.by <- deparse(substitute(by)) # Does not work with multiple variables!


cols <- c(.weight, .displacement)
cols <- cols[cols != "NULL"]
cols <- paste(dQuote(cols, FALSE), collapse=",")
cols <- glue::glue("c({cols})")
.by <- gsub("list\\(|\\)", "", .by)
.xby <- c(.x, .by)
.xby <- .xby[.xby != "NULL"]
.xby1 <- paste0("c(", gsub("(\\w+)", "'\\1'", toString(.xby)), ")")
str1 <- glue::glue('data.table::data.table({dfname})[, lapply(.SD, sum, na.rm = TRUE), by = {.xby1}, .SDcols = {cols}][]')
print(str1)
eval(parse(text = str1))
}

-testing

onewayfn(mtcars, cyl, weight = wt, displacement = disp, by = list(am, vs))
#data.table::data.table(mtcars)[, lapply(.SD, sum, na.rm = TRUE), by = c('cyl', 'am', 'vs'), .SDcols = c("wt","disp")][]
# cyl am vs wt disp
#1: 6 1 0 8.265 465.0
#2: 4 1 1 14.198 628.6
#3: 6 0 1 13.555 818.2
#4: 8 0 0 49.249 4291.4
#5: 4 0 1 8.805 407.6
#6: 4 1 0 2.140 120.3
#7: 8 1 0 6.740 652.0

onewayfn(mtcars, cyl, weight = wt, displacement = disp, by = am)
#data.table::data.table(mtcars)[, lapply(.SD, sum, na.rm = TRUE), by = c('cyl', 'am'), .SDcols = c("wt","disp")][]
# cyl am wt disp
#1: 6 1 8.265 465.0
#2: 4 1 16.338 748.9
#3: 6 0 13.555 818.2
#4: 8 0 49.249 4291.4
#5: 4 0 8.805 407.6
#6: 8 1 6.740 652.0

Do I need to use copy() with data.table objects inside a function?

The linked answer from @Henrik to https://stackoverflow.com/a/10226454/4468078 does explain all details to answer your question.

This (modified) version of your example function does not modify the passed data.table:

library(data.table)
dt <- data.table(id = 1:4, a = LETTERS[1:4])
myfun2 <- function(mydata) {
x <- mydata[, .(newcolumn = .N), by=id]
setnames(x, "newcolumn", "Count")
return(table(x$Count))
}
myfun2(dt)

This does not copy the whole data.table (which would be a waste of RAM and CPU time) but only writes the result of the aggregation into a new data.table which you can modify without side effects (= no changes of the original data.table).

> str(dt)
Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
$ id: int 1 2 3 4
$ a : chr "A" "B" "C" "D"

A data.table is always passed by reference to a function so you have to be careful not to modify it unless you are absolutely sure you want to do this.

The data.table package was designed exactly for this efficient way of modifying data without the usual "COW" ("copy on (first) write") principle to support efficient data manipulation.

"Dangerous" operations that modify a data.table are mainly:

  • := assignment to modify or create a new column "in-place"
  • all set* functions

If you don't want to modify a data.table you can use just row filters, and column (selection) expressions (i, j, by etc. arguments).

Chaining does also prevent the modification of the original data.frame if you modify "by ref" in the second (or later) chain:

myfun3 <- function(mydata) {
# chaining also creates a copy
return(mydata[id < 3,][, a := "not overwritten outside"])
}

myfun3(dt)
# > str(dt)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 3 variables:
# $ id: int 1 2 3 4
# $ a : chr "A" "B" "C" "D"

A simple reproducible example to pass arguments to data.table in a self-defined function in R

If we are using unquoted arguments, substitute and evaluate

zz <- function(data, var, group){
var <- substitute(var)
group <- substitute(group)
setnames(data[, sum(eval(var)), by = group],
c(deparse(group), deparse(var)))[]
# or use
# setnames(data[, sum(eval(var)), by = c(deparse(group))], 2, deparse(var))[]

}
zz(mtcars, mpg, gear)
# gear mpg
#1: 4 294.4
#2: 3 241.6
#3: 5 106.9

Subset data.table to represent the connections of objects

x1 <- c(1,1,1,2,2,2,3,3,3,4,5,6)
x2 <- c(1,2,3,1,2,3,1,2,3,4,6,5)
dt <- data.frame(x1, x2)

dt$x3=dt$x1
dt
for(i in 1:nrow(dt)){
if(dt$x3[i]!=dt$x2[i]){
dt$x3[dt$x3==dt$x2[i]]=dt$x3[i]
}
}
setDT(dt)[, id := .GRP, by=x3]
dt
  1. Create duplicate of x1, x3
  2. Iterate through x3, check if different from x2
  3. If different, replaces all elements in x3 which are equal to the element that you just checked in x2 with the current value of x3
  4. Assign ID's with setDT function

How to chain together a mix of data.table and base r functions?

If I understand correctly, the OP wants to

  • rename column Value_1 to Value (or in OP's words: create new column "Values", which equals "Values_1")
  • drop column Value_2
  • replace all occurrences of XX by HI in column State

Here is what I would do in data.table syntax:

setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID][
, Values_2 := NULL][
State == "XX", State := "HI"][]
setnames(data, "Values_1", "Values")
data
       ID Period Values  State
1: 1 1 5 X0
2: 1 2 0 X1
3: 1 3 0 X2
4: 1 4 0 X1
5: 2 1 1 X0
6: 2 2 0 HI
7: 2 3 0 HI
8: 2 4 0 HI
9: 3 1 0 X2
10: 3 2 0 X1
11: 3 3 0 X9
12: 3 4 0 X3
13: 4 1 1 X2
14: 4 2 2 X1
15: 4 3 3 X9
16: 4 4 0 HI

setnames() updates by reference, e.g., without copying. There is no need to create a copy of Values_1 and delete Values_1 later on.

Also, [State == "XX", State := "HI"] replaces XX by HI only in affected rows by reference while

[, State := gsub('XX','HI', State)] replaces the whole column.

data.table chaining is used where appropriate.

BTW: I wonder why the replacement of XX by HI cannot be done rightaway in the first statement:

setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "HI"), ID][
, Values_2 := NULL][]
setnames(data, "Values_1", "Values")

Pass variable name as argument inside data.table

Generally, quote and eval will work:

library(data.table)
plus <- function(x, y) {
x + y
}

add_one <- function(data, col) {
expr0 = quote(copy(data)[, z := plus(col, 1)][])

expr = do.call(substitute, list(expr0, list(col = substitute(col))))
cat("Evaluated expression:\n"); print(expr); cat("\n")

eval(expr)
}

set.seed(1)
library(magrittr)
data.table(x = 1:10, y = rnorm(10)) %>%
add_one(y)

which gives

Evaluated expression:
copy(data)[, `:=`(z, plus(y, 1))][]

x y z
1: 1 -0.6264538 0.3735462
2: 2 0.1836433 1.1836433
3: 3 -0.8356286 0.1643714
4: 4 1.5952808 2.5952808
5: 5 0.3295078 1.3295078
6: 6 -0.8204684 0.1795316
7: 7 0.4874291 1.4874291
8: 8 0.7383247 1.7383247
9: 9 0.5757814 1.5757814
10: 10 -0.3053884 0.6946116

Passing unquoted function arguments to i in data.table

There may be other (better) options, but you can wrap it in tryCatch and use bquote for the unquoted argument

test.function <- function(my.dt, ...){
where <- tryCatch(parse(text = paste0(list(...))), error = function (e) parse(text = paste0(list(bquote(...)))))
my.dt <- my.dt[eval(where), ]
return(my.dt)
}

tmp <- test.function(x, 'x==3 | sex=="F"')
head(tmp)
x sex
1: 1 F
2: 2 F
3: 3 F
4: 3 M
5: 3 F

tmp <- test.function(x, x==3 | sex=='F')
head(tmp)
x sex
1: 1 F
2: 2 F
3: 3 F
4: 3 M
5: 3 F


Related Topics



Leave a reply



Submit