Writings functions (procedures) for data.table objects
Yes, the addition, modification, deletion of columns in data.table
s is done by reference
. In a sense, it is a good thing because a data.table
usually holds a lot of data, and it would be very memory and time consuming to reassign it all every time a change to it is made. On the other hand, it is a bad thing because it goes against the no-side-effect
functional programming approach that R tries to promote by using pass-by-value
by default. With no-side-effect programming, there is little to worry about when you call a function: you can rest assured that your inputs or your environment won't be affected, and you can just focus on the function's output. It's simple, hence comfortable.
Of course it is ok to disregard John Chambers's advice if you know what you are doing. About writing "good" data.tables procedures, here are a couple rules I would consider if I were you, as a way to limit complexity and the number of side-effects:
- a function should not modify more than one table, i.e., modifying that table should be the only side-effect,
- if a function modifies a table, then make that table the output of the function. Of course, you won't want to re-assign it: just run
do.something.to(table)
and nottable <- do.something.to(table)
. If instead the function had another ("real") output, then when callingresult <- do.something.to(table)
, it is easy to imagine how you may focus your attention on the output and forget that calling the function had a side effect on your table.
While "one output / no-side-effect" functions are the norm in R, the above rules allow for "one output or side-effect". If you agree that a side-effect is somehow a form of output, then you'll agree I am not bending the rules too much by loosely sticking to R's one-output functional programming style. Allowing functions to have multiple side-effects would be a little more of a stretch; not that you can't do it, but I would try to avoid it if possible.
Using by in a function with data.table
We can pass the by
as a vector of strings
onewayfn <- function(df, x, weight = NULL, displacement = NULL, by = NULL){
.x <- deparse(substitute(x))
.weight <- deparse(substitute(weight))
.displacement <- deparse(substitute(displacement))
#.by <- deparse(substitute(by)) # Does not work with multiple variables!
cols <- c(.weight, .displacement)
cols <- cols[cols != "NULL"]
.xby <- c(.x, by)
.xby <- .xby[.xby != "NULL"]
data.table::data.table(df)[, lapply(.SD, sum, na.rm = TRUE), by = .xby, .SDcols = cols][]
}
-testing
onewayfn(mtcars, cyl, weight = wt, displacement = disp, by = c("am","vs"))
# cyl am vs wt disp
#1: 6 1 0 8.265 465.0
#2: 4 1 1 14.198 628.6
#3: 6 0 1 13.555 818.2
#4: 8 0 0 49.249 4291.4
#5: 4 0 1 8.805 407.6
#6: 4 1 0 2.140 120.3
#7: 8 1 0 6.740 652.0
Or another option is to eval
uate a string
newayfn <- function(df, x, weight = NULL, displacement = NULL, by = NULL){
dfname <- deparse(substitute(df))
.x <- deparse(substitute(x))
.weight <- deparse(substitute(weight))
.displacement <- deparse(substitute(displacement))
.by <- deparse(substitute(by)) # Does not work with multiple variables!
cols <- c(.weight, .displacement)
cols <- cols[cols != "NULL"]
cols <- paste(dQuote(cols, FALSE), collapse=",")
cols <- glue::glue("c({cols})")
.by <- gsub("list\\(|\\)", "", .by)
.xby <- c(.x, .by)
.xby <- .xby[.xby != "NULL"]
.xby1 <- paste0("c(", gsub("(\\w+)", "'\\1'", toString(.xby)), ")")
str1 <- glue::glue('data.table::data.table({dfname})[, lapply(.SD, sum, na.rm = TRUE), by = {.xby1}, .SDcols = {cols}][]')
print(str1)
eval(parse(text = str1))
}
-testing
onewayfn(mtcars, cyl, weight = wt, displacement = disp, by = list(am, vs))
#data.table::data.table(mtcars)[, lapply(.SD, sum, na.rm = TRUE), by = c('cyl', 'am', 'vs'), .SDcols = c("wt","disp")][]
# cyl am vs wt disp
#1: 6 1 0 8.265 465.0
#2: 4 1 1 14.198 628.6
#3: 6 0 1 13.555 818.2
#4: 8 0 0 49.249 4291.4
#5: 4 0 1 8.805 407.6
#6: 4 1 0 2.140 120.3
#7: 8 1 0 6.740 652.0
onewayfn(mtcars, cyl, weight = wt, displacement = disp, by = am)
#data.table::data.table(mtcars)[, lapply(.SD, sum, na.rm = TRUE), by = c('cyl', 'am'), .SDcols = c("wt","disp")][]
# cyl am wt disp
#1: 6 1 8.265 465.0
#2: 4 1 16.338 748.9
#3: 6 0 13.555 818.2
#4: 8 0 49.249 4291.4
#5: 4 0 8.805 407.6
#6: 8 1 6.740 652.0
Do I need to use copy() with data.table objects inside a function?
The linked answer from @Henrik to https://stackoverflow.com/a/10226454/4468078 does explain all details to answer your question.
This (modified) version of your example function does not modify the passed data.table
:
library(data.table)
dt <- data.table(id = 1:4, a = LETTERS[1:4])
myfun2 <- function(mydata) {
x <- mydata[, .(newcolumn = .N), by=id]
setnames(x, "newcolumn", "Count")
return(table(x$Count))
}
myfun2(dt)
This does not copy the whole data.table
(which would be a waste of RAM and CPU time) but only writes the result of the aggregation into a new data.table
which you can modify without side effects (= no changes of the original data.table
).
> str(dt)
Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
$ id: int 1 2 3 4
$ a : chr "A" "B" "C" "D"
A data.table
is always passed by reference to a function so you have to be careful not to modify it unless you are absolutely sure you want to do this.
The data.table
package was designed exactly for this efficient way of modifying data without the usual "COW" ("copy on (first) write") principle to support efficient data manipulation.
"Dangerous" operations that modify a data.table
are mainly:
:=
assignment to modify or create a new column "in-place"- all
set*
functions
If you don't want to modify a data.table
you can use just row filters, and column (selection) expressions (i
, j
, by
etc. arguments).
Chaining does also prevent the modification of the original data.frame
if you modify "by ref" in the second (or later) chain:
myfun3 <- function(mydata) {
# chaining also creates a copy
return(mydata[id < 3,][, a := "not overwritten outside"])
}
myfun3(dt)
# > str(dt)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 3 variables:
# $ id: int 1 2 3 4
# $ a : chr "A" "B" "C" "D"
A simple reproducible example to pass arguments to data.table in a self-defined function in R
If we are using unquoted arguments, substitute
and eval
uate
zz <- function(data, var, group){
var <- substitute(var)
group <- substitute(group)
setnames(data[, sum(eval(var)), by = group],
c(deparse(group), deparse(var)))[]
# or use
# setnames(data[, sum(eval(var)), by = c(deparse(group))], 2, deparse(var))[]
}
zz(mtcars, mpg, gear)
# gear mpg
#1: 4 294.4
#2: 3 241.6
#3: 5 106.9
Subset data.table to represent the connections of objects
x1 <- c(1,1,1,2,2,2,3,3,3,4,5,6)
x2 <- c(1,2,3,1,2,3,1,2,3,4,6,5)
dt <- data.frame(x1, x2)
dt$x3=dt$x1
dt
for(i in 1:nrow(dt)){
if(dt$x3[i]!=dt$x2[i]){
dt$x3[dt$x3==dt$x2[i]]=dt$x3[i]
}
}
setDT(dt)[, id := .GRP, by=x3]
dt
- Create duplicate of x1, x3
- Iterate through x3, check if different from x2
- If different, replaces all elements in x3 which are equal to the element that you just checked in x2 with the current value of x3
- Assign ID's with setDT function
How to chain together a mix of data.table and base r functions?
If I understand correctly, the OP wants to
- rename column
Value_1
toValue
(or in OP's words: create new column "Values", which equals "Values_1") - drop column
Value_2
- replace all occurrences of
XX
byHI
in columnState
Here is what I would do in data.table syntax:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID][
, Values_2 := NULL][
State == "XX", State := "HI"][]
setnames(data, "Values_1", "Values")
data
ID Period Values State
1: 1 1 5 X0
2: 1 2 0 X1
3: 1 3 0 X2
4: 1 4 0 X1
5: 2 1 1 X0
6: 2 2 0 HI
7: 2 3 0 HI
8: 2 4 0 HI
9: 3 1 0 X2
10: 3 2 0 X1
11: 3 3 0 X9
12: 3 4 0 X3
13: 4 1 1 X2
14: 4 2 2 X1
15: 4 3 3 X9
16: 4 4 0 HI
setnames()
updates by reference, e.g., without copying. There is no need to create a copy of Values_1
and delete Values_1
later on.
Also, [State == "XX", State := "HI"]
replaces XX
by HI
only in affected rows by reference while[, State := gsub('XX','HI', State)]
replaces the whole column.
data.table chaining is used where appropriate.
BTW: I wonder why the replacement of XX
by HI
cannot be done rightaway in the first statement:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "HI"), ID][
, Values_2 := NULL][]
setnames(data, "Values_1", "Values")
Pass variable name as argument inside data.table
Generally, quote and eval will work:
library(data.table)
plus <- function(x, y) {
x + y
}
add_one <- function(data, col) {
expr0 = quote(copy(data)[, z := plus(col, 1)][])
expr = do.call(substitute, list(expr0, list(col = substitute(col))))
cat("Evaluated expression:\n"); print(expr); cat("\n")
eval(expr)
}
set.seed(1)
library(magrittr)
data.table(x = 1:10, y = rnorm(10)) %>%
add_one(y)
which gives
Evaluated expression:
copy(data)[, `:=`(z, plus(y, 1))][]
x y z
1: 1 -0.6264538 0.3735462
2: 2 0.1836433 1.1836433
3: 3 -0.8356286 0.1643714
4: 4 1.5952808 2.5952808
5: 5 0.3295078 1.3295078
6: 6 -0.8204684 0.1795316
7: 7 0.4874291 1.4874291
8: 8 0.7383247 1.7383247
9: 9 0.5757814 1.5757814
10: 10 -0.3053884 0.6946116
Passing unquoted function arguments to i in data.table
There may be other (better) options, but you can wrap it in tryCatch
and use bquote
for the unquoted argument
test.function <- function(my.dt, ...){
where <- tryCatch(parse(text = paste0(list(...))), error = function (e) parse(text = paste0(list(bquote(...)))))
my.dt <- my.dt[eval(where), ]
return(my.dt)
}
tmp <- test.function(x, 'x==3 | sex=="F"')
head(tmp)
x sex
1: 1 F
2: 2 F
3: 3 F
4: 3 M
5: 3 F
tmp <- test.function(x, x==3 | sex=='F')
head(tmp)
x sex
1: 1 F
2: 2 F
3: 3 F
4: 3 M
5: 3 F
Related Topics
Compute All Fixed Window Averages with Dplyr and Rcpproll
Gathering Wide Columns into Multiple Long Columns Using Pivot_Longer
R Web Application Introduction
How Does the Removesparseterms in R Work
R: Legend with Points and Lines Being Different Colors (For the Same Legend Item)
How to Change the Background Color of the Shiny Dashboard Body
Find Location of Current .R File
How to Stop Emacs from Replacing Underbar with <- in Ess-Mode
R Text Mining Documents from CSV File (One Row Per Doc)
Merge Two Dataframes If Timestamp of X Is Within Time Interval of Y
How to Calculate the Average of a Variable Between Two Date Ranges Using a Loop or Apply Function
Scatterplot with Alpha Transparent Histograms in R
Kruskal-Wallis Test with Details on Pairwise Comparisons
How to Hide Code in Rmarkdown, with Option to See It
Writing Functions VS. Line-By-Line Interpretation in an R Workflow