How to Use Data.Table Within Functions and Loops

How to use data.table within functions and loops?

This might not be the most data.table-like or the fastest solution but I would streamline the code in this particular loop as follows:

for(nVarGroup in 2:4) {      # Grouped by several categorical values...
for(nVarMeans in 5:10) { # ... get means of all numerical parameters
strGroupConditions <- levels(dt[[nVarGroup]])[-1]
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
# qAction <- quote(mean(get(strVarMeans)))
# qGroup <- quote(get(strVarGroup) %in% strGroupConditions)
# p <- dt[eval(qGroup), .(AVE = eval(qAction), COUNT = .N), by = strVarGroup]
setkeyv(dt, strVarGroup)
p <- dt[strGroupConditions, .(AVE = lapply(.SD, mean), COUNT = .N), by = strVarGroup,
.SDcols = strVarMeans]

print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
print(p)
}
}

I've left the old code as comment for reference.

qAction is replaced by using lapply(.SD, mean) together with the .SDcols parameter.

qGroup for subsetting rows is replaced by the combination of setting a key and providing the vector of desired values as i parameter.


In case of a more complex subsetting expression I would try use non-equi (or conditional) joins using the on= syntax.

Or, follow Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server".

Matt suggested to create a helper function

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))

which can be combined with the "quasi-perl type string interpolation of fn$ from the gsubfn package to improve the readability of the EVAL solution" as suggested by G. Grothendieck.

With this, the code for the loop becomes eventually:

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
library(gsubfn)

for(nVarGroup in 2:4) { # Grouped by several categorical values...
for(nVarMeans in 5:10) { # ... get means of all numerical parameters
strGroupConditions = levels(dt[[nVarGroup]])[-1]
strVarGroup = names(dt)[nVarGroup]
strVarMeans = names(dt)[nVarMeans]
p <- fn$EVAL("dt[$strVarGroup %in% strGroupConditions, .(AVE=mean($strVarMeans), COUNT=.N), by = strVarGroup]" )

print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
print(p)
}
}

Now, the data.table statement looks pretty much like a "native" statement except that $strVarGroup and $strVarMeans is used where the contents of variables is referenced.


With version 1.1.0 (CRAN release on 2016-08-19), the stringr package has gained a string interpolation function str_interp() which is an alternative to the gsubfn package here.

With str_interp(), the central statement in the for loop would become

p <- EVAL(stringr::str_interp(
"dt[${strVarGroup} %in% strGroupConditions, .(AVE=mean(${strVarMeans}), COUNT=.N), by = strVarGroup]"
))

and the call to library(gsubfn) could be removed.

Perform for-loop inside data.table

Another approach:

dt1[, { 
gv <- value
.SD[, {
VT <- var.test(value, gv, alternative="greater", ratio=1)
.(F.Stat=VT$statistic, SD.Ratio=sqrt(VT$estimate), P.value=VT$p.value)
},
by=.(id)]
}, by=.(group)]

timing code (took too long to time OP's approaches so I took them out) and another approach that strips away the checks in var.test (use with caution):

mtd2 <- function() {
dt1[, {
gv <- value
leng <- .N
.SD[, {
VT <- var.test(value, gv, alternative="greater", ratio=1)
.(F.Stat=VT$statistic, SD.Ratio=sqrt(VT$estimate), P.value=VT$p.value)
},
by=.(id)]
}, by=.(group)]
}

mtd3 <- function() {
dt1[, {
gv <- value
leng <- .N
.SD[, {
#see stats:::var.test.default
ESTIMATE <- var(value) / var(gv)
.(F.Stat=ESTIMATE, P.value=1 - pf(ESTIMATE, .N - 1L, leng - 1L))
},
by=.(id)]
}, by=.(group)][, SD.Ratio:=sqrt(F.Stat)][]
}

library(bench)
bench::mark(mtd2(), mtd3(), check=FALSE)

timings:

# A tibble: 2 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time gc
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <list> <list>
1 mtd2() 2.75s 2.75s 2.75s 2.75s 0.364 1.9GB 54 1 2.75s <data.table [10,000 x 5]> <Rprofmem [82,295 x 3]> <bch:tm> <tibble [1 x 3]>
2 mtd3() 1.29s 1.29s 1.29s 1.29s 0.774 13.6MB 4 1 1.29s <data.table [10,000 x 5]> <Rprofmem [2,432 x 3]> <bch:tm> <tibble [1 x 3]>

data:

library(data.table)
set.seed(1234) # reproducibility
nr <- 1e6
ng <- 100
dt1 <- data.table(id=sample(ng, nr, TRUE),
group=sample(ng, nr, TRUE),
value=rnorm(nr))

Create tables by using data.table and a for loop for multiple columns


  • .data is not used in data.table
  • You don't need select here and that is why you also don't need .[, .(id, columns[i], group)] in data.table version.
  • You can use get to get column values based on string.

Since this is just an example I have not tried to simplify the loop so that you can add more complicated stuff in there later.

library(data.table)

cols <- c("dv1", "dv2")

test <- function(data, columns, group) {
for(i in columns) {
sub1 <-df[, .(mean(get(i), na.rm=T), sd=sd(get(i), na.rm=T)), by=year]
print(sub1)
}
}

test(data=df, columns=cols, group="year")

# year V1 sd
#1: 2014 1.00 NA
#2: 2015 3.67 1.528
#3: 2016 6.00 3.000
#4: <NA> 7.50 0.707

# year V1 sd
#1: 2014 2.00 NA
#2: 2015 4.67 1.528
#3: 2016 7.00 3.000
#4: <NA> 8.50 0.707

How to apply a custom recursive function with data.table and loop over each index group-wise?

Does this use of Reduce do the trick?

tmp = data.table(
grp = c(rep(0,6), rep(1,6)),
x=c(10,20,30,40,50,60,1,2,3,4,5,6),
y=c(1,2,3,4,5,6, 10,20,30,40,50,60)
)
tmp[, z:=Reduce(f=function(z,i) z + x[i-1] - y[i-1],
x=(1:.N)[-1],
init=0,
accumulate = T)
,by=grp
]

Output:

    grp  x  y    z
1: 0 10 1 0
2: 0 20 2 9
3: 0 30 3 27
4: 0 40 4 54
5: 0 50 5 90
6: 0 60 6 135
7: 1 1 10 0
8: 1 2 20 -9
9: 1 3 30 -27
10: 1 4 40 -54
11: 1 5 50 -90
12: 1 6 60 -135

Take for example, row 4. The value in the z column is 54, which is equal to the prior row's z-value + prior row's x-value, minus prior row's y-value.

The function f within Reduce can take any complicated form, including ifelse statements. Here is an example, where I've made a function called func, which is a wrapper around Reduce. Notice that within the Reduce statement, f is a function taking prev (thanks to suggestion by @r2evans), and this function first calculates previous row's s value minus previous row's t value (this is akin to your x[-1]-y[-1]. Then there is an ifelse statement. If the difference between the prior rows s and t value (i.e. k) is >20, then the new value in this row will be the previous z value minus the product of 20-4k (i.e. prev-(20-4k)), otherwise it will the previous z value + k (i.e. which is equal to your original formulation: z[i-1]+x[i-1]-y[i-1])

func <- function(s,t) {
Reduce(
f=function(prev,i) {
k=s[i-1] - t[i-1]
ifelse(k>10, prev -(20-4*k), prev+k)
},
x=2:length(s),
init=0,
accumulate = TRUE
)
}

You can then assign the value of the func(x,y) to z, like this:

tmp[, z:=func(x,y), by=.(grp)][]

Output:

    grp  x  y    z
1: 0 10 1 0
2: 0 20 2 9
3: 0 30 3 61
4: 0 40 4 149
5: 0 50 5 273
6: 0 60 6 433
7: 1 1 10 0
8: 1 2 20 -9
9: 1 3 30 -27
10: 1 4 40 -54
11: 1 5 50 -90
12: 1 6 60 -135

Apply a function across groups and columns in data.table and/or dplyr

The sample datasets provided with the question indicate that the names of the columns may differ between datasets, e.g., column b of dt1 and column b2 of dt2 are supposed to be added.

Here are two approaches which should be working for an arbitrary number of arbitrarily named pairs of columns:

  1. Working in long format
  2. EDIT: Update joins using get()
  3. EDIT 2: Computing on the language

1. Working in long format

The information on corresponding columns can be provided in a look-up table or translation table:

library(data.table)
lut <- data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2"))

lut
   vars1 vars2
1: a a2
2: b b2
3: c c2

In cases where column names are treated as data and the column data are of the same data type my first approach is to reshape to long format.

# reshape to long format
mdt1 <- melt(dt1[, rn := .I], measure.vars = lut$vars1)
mdt2 <- melt(dt2[, groupVar := .I], measure.vars = lut$vars2)
# update join to translate variable names
mdt2[lut, on = .(variable = vars2), variable := vars1]
# update join to add corresponding values of both tables
mdt1[mdt2, on = .(groupVar, variable), value := x.value + i.value]
# reshape backe to wide format
dt3 <- dcast(mdt1, rn + groupVar ~ ...)[, rn := NULL][]
dt3
    groupVar  a  b  c
1: 1 11 22 33
2: 1 12 23 34
3: 1 13 24 35
4: 2 24 35 46
5: 2 25 36 47
6: 2 26 37 48
7: 3 37 48 59
8: 3 38 49 60
9: 3 39 50 61
10: 3 40 51 62

2. Update joins using get()

Giving a second thought, here is an approach which is similar to OP's proposed for loop and requires much less coding:

vars1 <- c("a", "b", "c")
vars2 <- c("a2", "b2", "c2")
dt2[, groupVar := .I]

for (iv in seq_along(vars1)) {
dt1[dt2, on = .(groupVar),
(vars1[iv]) := get(paste0("x.", vars1[iv])) + get(paste0("i.", vars2[iv]))][]
}

dt1[]
     a  b  c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3

Note that dt1 is updated by reference, i.e., without copying.

Prepending the variable names vars1[iv] by "x." and vars2[iv] by "i." on the right hand side of := is to ensure that the right columns from dt1 and dt2, resp., are picked in case of duplicated column names. See the Advanced: section on the j parameter in help("data.table").

3. Computing on the language

This follows Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server". See here for another use case.

library(glue) # literal string interpolation
library(magrittr) # piping used to improve readability

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))

data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{glue("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`({.})][]")} %>%
EVAL()
     a  b  c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3

It starts with a look-up table which is created on-the-fly and subsequently manipulated to form a complete data.table statement

dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(a = x.a + i.a2, b = x.b + i.b2, c = x.c + i.c2)][]

as a character string. This string is then evaluated and executed in one go; no for loops required.

As the helper function EVAL() already uses paste0() the call to glue() can be omitted:

data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>% 
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{EVAL("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(", ., ")][]")}

Note that dot . and curly brackets {} are used with different meaning in different contexts which may appear somewhat confusing.

Apply multiple functions to multiple columns of data table in R

With another lapply:

f_lapply = function(DT, col_by, col_func, col_new, func){
temp = DT[, mget(unique(c(col_by,col_func)))]

##loop to remove
# 1:length(func)){
# temp[ ,eval(col_new[i]) := do.call(func[[i]], lapply(col_func, function(x) get(x))), by = mget(col_by)]
# }
##

temp[ ,(col_new) := lapply(func, function(y) do.call(y, lapply(col_func, function(x) get(x)))), by = mget(col_by)]

temp = unique(temp[,mget(c(col_new, col_by))])
return(temp)
}

DT = data.table(iris)

col_by = c("Species")
col_func = c("Petal.Length","Petal.Width")
col_new = c("PL.mean","PL.max")
func = list(function(x,y) return(mean(x[y == max(y)])), function(x,y) return(max(x[y == max(y)])))

f_lapply(DT, col_by, col_func,col_new,func)
#> PL.mean PL.max Species
#> <num> <num> <fctr>
#> 1: 1.600000 1.6 setosa
#> 2: 4.800000 4.8 versicolor
#> 3: 5.933333 6.1 virginica

About 25% more efficient on this example:

microbenchmark::microbenchmark(f_lapply(DT, col_by, col_func,col_new,func),
f(DT, col_by, col_func,col_new,func))
Unit: milliseconds
expr min lq mean median uq max neval
f_lapply(DT, col_by, col_func, col_new, func) 2.1065 2.2694 2.791411 2.41730 2.58115 13.4727 100
f(DT, col_by, col_func, col_new, func) 2.8480 3.0575 3.672728 3.19515 3.46075 17.1140 100

How to convert r data.table expression into a function for looping

The goal is a function of DT & var to do:

DT[, list(.N), by=var][order(var), list(var, N, Proportion=N/sum(N))]

To do a calculation by a variable and then order by it, use keyby=. So your function can become:

f_tabulate <- 
function(DT, var) {
DT[, list(.N), keyby=var][, Proportion := N/sum(N)][]
}
# usage
for (i in c('cyl', 'gear')) print(f_tabulate(mtcars_dt, i))

keyby= and by= can take arguments in many forms, and you don't need to quote or eval a simple vector of variable names. (Other ways include .(var1, var2), list(var1, var2), "var1,var2".) You could also extend to counting by multiple variables...

f_tabulate2 <- 
function(DT, ...) {
DT[, list(.N), keyby=c(...)][, Proportion := N/sum(N)][]
}
# usage
f_tabulate2(mtcars_dt, 'cyl', 'gear')

For this operation (excepting the ordering part), you could also use groupingsets():

> groupingsets(mtcars_dt, .N, keyby=c('cyl', 'gear'), sets=list("cyl", "gear"))[, 
Proportion := N/nrow(mtcars_dt)][]
cyl gear N Proportion
1: 6 NA 7 0.21875
2: 4 NA 11 0.34375
3: 8 NA 14 0.43750
4: NA 4 12 0.37500
5: NA 3 15 0.46875
6: NA 5 5 0.15625

As a function (and adding ordering back in)...

f_tabulate_all = function(DT, vars){
lvars = as.list(vars)
ocall = as.call(lapply(c("order", vars), as.name))
groupingsets(DT[eval(ocall)], .N, by=vars, sets=as.list(vars))[,
Proportion := N/nrow(DT)][]
}
# usage
f_tabulate_all(mtcars_dt, c('cyl', 'gear'))

The as.name function is achieving the same thing as quote when applied to a string naming a function or other object.

vectorizing functions that include iterative loops inside a data.table

First, the if conditional requires a single logical, not a vector, and if myscalar has a length over 1 (or just 0) then it will fail. Further, the if should really have a vectorized comparison in a sense, because you are likely to have one pattern match one but not all, etc.

For this, one might think ifelse as a vectorized if/else alternative, but I think a different method is a reductive method, where the try each pattern on any not-yet-matched inputs, and stop processing when everything has a match.


my_postcode_fun <- function(myvector){
allpatterns <- c("[[:alpha:]][[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]",
# this is AA00 0AA
"[[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]"
# this is A00 0AA
)# these are the patterns I'm looking for

post <- rep(NA_character_, length(myvector))

for (ptn in allpatterns) {
isna <- is.na(post)
if (!any(isna)) break
post[isna] <- str_extract(myvector[isna], regex(ptn))
}

isna <- is.na(post)
if (any(!isna)) {
myvector[!isna] <- mapply(sub, post[!isna], "", myvector[!isna])
}
list(postcode = post, leftover = myvector)
}

testdata <- data.table(Address = c("1 Some Street, sometown, AA00 0AA",
"1 Some Street, sometown, A00 0AA",
"1 Some Street, sometown, "))

testdata[, c("post","left") := my_postcode_fun(Address)][]
# Address post left
# <char> <char> <char>
# 1: 1 Some Street, sometown, AA00 0AA AA00 0AA 1 Some Street, sometown,
# 2: 1 Some Street, sometown, A00 0AA A00 0AA 1 Some Street, sometown,
# 3: 1 Some Street, sometown, <NA> 1 Some Street, sometown,

(The post and left strings can definitely be cleaned up, perhaps trimws, but that's a different task.)

Quick walk-through:

  • We start by generating a post vector as long as the input myscalar, but all NA, this is an intentional starting-point.
  • For each ptn in the patterns, first we only want to look at those in post that are NA (yes, the first pass is always all-true, but the point of this state-machine is that it doesn't need to know if it's on the first or nth pass). For this, we define isna as a variable tracking which of post are NA, meaning which of myscalar have not yet had a match.
  • Quick check: if everything has a match (i.e., !any(isna)), then stop processing. This is a nice break point, since it means that if we have 1M inputs and the first pattern matches for all of them, then we do not have to proceed with any of the other patterns.
  • str_extract the remaining inputs with this ptn, and store in the not-yet-matches post output.
  • After the for loop, we populate isna one more time so that we can remove the postcode from the original input. sub is not vectorized on its pattern (just on its input x= vector), so I use mapply to vectorize it in a sense (there are other methods for doing so).
    • Note: there is a very slight risk here: if an input has two apparent postcodes in it (however unlikely), and they contain the same matching substring, then the first will be extracted into post, and sub will remove the first occurrence. Neither of these steps will acknowledge or remove the second. I doubt it's likely, but I wanted to identify this possibility.
  • BTW: while I named the function output with postcode and leftover (as in your question), those names are being dropped in this case by the c("post","left") := within data.table. I named them differently to demonstrate this. The function could easily have returned just list(post, leftover) and it works just as well (though I think the names are good, declarative, and if this function is ever used outside of data.table, the names may be quite helpful).


Related Topics



Leave a reply



Submit