How to use data.table within functions and loops?
This might not be the most data.table
-like or the fastest solution but I would streamline the code in this particular loop as follows:
for(nVarGroup in 2:4) { # Grouped by several categorical values...
for(nVarMeans in 5:10) { # ... get means of all numerical parameters
strGroupConditions <- levels(dt[[nVarGroup]])[-1]
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
# qAction <- quote(mean(get(strVarMeans)))
# qGroup <- quote(get(strVarGroup) %in% strGroupConditions)
# p <- dt[eval(qGroup), .(AVE = eval(qAction), COUNT = .N), by = strVarGroup]
setkeyv(dt, strVarGroup)
p <- dt[strGroupConditions, .(AVE = lapply(.SD, mean), COUNT = .N), by = strVarGroup,
.SDcols = strVarMeans]
print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
print(p)
}
}
I've left the old code as comment for reference.
qAction
is replaced by using lapply(.SD, mean)
together with the .SDcols
parameter.
qGroup
for subsetting rows is replaced by the combination of setting a key and providing the vector of desired values as i
parameter.
In case of a more complex subsetting expression I would try use non-equi (or conditional) joins using the on=
syntax.
Or, follow Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server".
Matt suggested to create a helper function
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
which can be combined with the "quasi-perl type string interpolation of fn$
from the gsubfn
package to improve the readability of the EVAL solution" as suggested by G. Grothendieck.
With this, the code for the loop becomes eventually:
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
library(gsubfn)
for(nVarGroup in 2:4) { # Grouped by several categorical values...
for(nVarMeans in 5:10) { # ... get means of all numerical parameters
strGroupConditions = levels(dt[[nVarGroup]])[-1]
strVarGroup = names(dt)[nVarGroup]
strVarMeans = names(dt)[nVarMeans]
p <- fn$EVAL("dt[$strVarGroup %in% strGroupConditions, .(AVE=mean($strVarMeans), COUNT=.N), by = strVarGroup]" )
print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
print(p)
}
}
Now, the data.table
statement looks pretty much like a "native" statement except that $strVarGroup
and $strVarMeans
is used where the contents of variables is referenced.
With version 1.1.0 (CRAN release on 2016-08-19), the stringr
package has gained a string interpolation function str_interp()
which is an alternative to the gsubfn
package here.
With str_interp()
, the central statement in the for loop would become
p <- EVAL(stringr::str_interp(
"dt[${strVarGroup} %in% strGroupConditions, .(AVE=mean(${strVarMeans}), COUNT=.N), by = strVarGroup]"
))
and the call to library(gsubfn)
could be removed.
Perform for-loop inside data.table
Another approach:
dt1[, {
gv <- value
.SD[, {
VT <- var.test(value, gv, alternative="greater", ratio=1)
.(F.Stat=VT$statistic, SD.Ratio=sqrt(VT$estimate), P.value=VT$p.value)
},
by=.(id)]
}, by=.(group)]
timing code (took too long to time OP's approaches so I took them out) and another approach that strips away the checks in var.test
(use with caution):
mtd2 <- function() {
dt1[, {
gv <- value
leng <- .N
.SD[, {
VT <- var.test(value, gv, alternative="greater", ratio=1)
.(F.Stat=VT$statistic, SD.Ratio=sqrt(VT$estimate), P.value=VT$p.value)
},
by=.(id)]
}, by=.(group)]
}
mtd3 <- function() {
dt1[, {
gv <- value
leng <- .N
.SD[, {
#see stats:::var.test.default
ESTIMATE <- var(value) / var(gv)
.(F.Stat=ESTIMATE, P.value=1 - pf(ESTIMATE, .N - 1L, leng - 1L))
},
by=.(id)]
}, by=.(group)][, SD.Ratio:=sqrt(F.Stat)][]
}
library(bench)
bench::mark(mtd2(), mtd3(), check=FALSE)
timings:
# A tibble: 2 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time gc
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <list> <list>
1 mtd2() 2.75s 2.75s 2.75s 2.75s 0.364 1.9GB 54 1 2.75s <data.table [10,000 x 5]> <Rprofmem [82,295 x 3]> <bch:tm> <tibble [1 x 3]>
2 mtd3() 1.29s 1.29s 1.29s 1.29s 0.774 13.6MB 4 1 1.29s <data.table [10,000 x 5]> <Rprofmem [2,432 x 3]> <bch:tm> <tibble [1 x 3]>
data:
library(data.table)
set.seed(1234) # reproducibility
nr <- 1e6
ng <- 100
dt1 <- data.table(id=sample(ng, nr, TRUE),
group=sample(ng, nr, TRUE),
value=rnorm(nr))
Create tables by using data.table and a for loop for multiple columns
.data
is not used indata.table
- You don't need
select
here and that is why you also don't need.[, .(id, columns[i], group)]
indata.table
version. - You can use
get
to get column values based on string.
Since this is just an example I have not tried to simplify the loop so that you can add more complicated stuff in there later.
library(data.table)
cols <- c("dv1", "dv2")
test <- function(data, columns, group) {
for(i in columns) {
sub1 <-df[, .(mean(get(i), na.rm=T), sd=sd(get(i), na.rm=T)), by=year]
print(sub1)
}
}
test(data=df, columns=cols, group="year")
# year V1 sd
#1: 2014 1.00 NA
#2: 2015 3.67 1.528
#3: 2016 6.00 3.000
#4: <NA> 7.50 0.707
# year V1 sd
#1: 2014 2.00 NA
#2: 2015 4.67 1.528
#3: 2016 7.00 3.000
#4: <NA> 8.50 0.707
How to apply a custom recursive function with data.table and loop over each index group-wise?
Does this use of Reduce
do the trick?
tmp = data.table(
grp = c(rep(0,6), rep(1,6)),
x=c(10,20,30,40,50,60,1,2,3,4,5,6),
y=c(1,2,3,4,5,6, 10,20,30,40,50,60)
)
tmp[, z:=Reduce(f=function(z,i) z + x[i-1] - y[i-1],
x=(1:.N)[-1],
init=0,
accumulate = T)
,by=grp
]
Output:
grp x y z
1: 0 10 1 0
2: 0 20 2 9
3: 0 30 3 27
4: 0 40 4 54
5: 0 50 5 90
6: 0 60 6 135
7: 1 1 10 0
8: 1 2 20 -9
9: 1 3 30 -27
10: 1 4 40 -54
11: 1 5 50 -90
12: 1 6 60 -135
Take for example, row 4. The value in the z column is 54, which is equal to the prior row's z-value + prior row's x-value, minus prior row's y-value.
The function f
within Reduce can take any complicated form, including ifelse
statements. Here is an example, where I've made a function called func
, which is a wrapper around Reduce
. Notice that within the Reduce statement, f
is a function taking prev
(thanks to suggestion by @r2evans), and this function first calculates previous row's s
value minus previous row's t
value (this is akin to your x[-1]-y[-1]
. Then there is an ifelse
statement. If the difference between the prior rows s
and t
value (i.e. k
) is >20, then the new value in this row will be the previous z
value minus the product of 20-4k (i.e. prev-(20-4k)
), otherwise it will the previous z
value + k
(i.e. which is equal to your original formulation: z[i-1]+x[i-1]-y[i-1]
)
func <- function(s,t) {
Reduce(
f=function(prev,i) {
k=s[i-1] - t[i-1]
ifelse(k>10, prev -(20-4*k), prev+k)
},
x=2:length(s),
init=0,
accumulate = TRUE
)
}
You can then assign the value of the func(x,y)
to z, like this:
tmp[, z:=func(x,y), by=.(grp)][]
Output:
grp x y z
1: 0 10 1 0
2: 0 20 2 9
3: 0 30 3 61
4: 0 40 4 149
5: 0 50 5 273
6: 0 60 6 433
7: 1 1 10 0
8: 1 2 20 -9
9: 1 3 30 -27
10: 1 4 40 -54
11: 1 5 50 -90
12: 1 6 60 -135
Apply a function across groups and columns in data.table and/or dplyr
The sample datasets provided with the question indicate that the names of the columns may differ between datasets, e.g., column b
of dt1
and column b2
of dt2
are supposed to be added.
Here are two approaches which should be working for an arbitrary number of arbitrarily named pairs of columns:
- Working in long format
- EDIT: Update joins using
get()
- EDIT 2: Computing on the language
1. Working in long format
The information on corresponding columns can be provided in a look-up table or translation table:
library(data.table)
lut <- data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2"))
lut
vars1 vars2
1: a a2
2: b b2
3: c c2
In cases where column names are treated as data and the column data are of the same data type my first approach is to reshape to long format.
# reshape to long format
mdt1 <- melt(dt1[, rn := .I], measure.vars = lut$vars1)
mdt2 <- melt(dt2[, groupVar := .I], measure.vars = lut$vars2)
# update join to translate variable names
mdt2[lut, on = .(variable = vars2), variable := vars1]
# update join to add corresponding values of both tables
mdt1[mdt2, on = .(groupVar, variable), value := x.value + i.value]
# reshape backe to wide format
dt3 <- dcast(mdt1, rn + groupVar ~ ...)[, rn := NULL][]
dt3
groupVar a b c
1: 1 11 22 33
2: 1 12 23 34
3: 1 13 24 35
4: 2 24 35 46
5: 2 25 36 47
6: 2 26 37 48
7: 3 37 48 59
8: 3 38 49 60
9: 3 39 50 61
10: 3 40 51 62
2. Update joins using get()
Giving a second thought, here is an approach which is similar to OP's proposed for
loop and requires much less coding:
vars1 <- c("a", "b", "c")
vars2 <- c("a2", "b2", "c2")
dt2[, groupVar := .I]
for (iv in seq_along(vars1)) {
dt1[dt2, on = .(groupVar),
(vars1[iv]) := get(paste0("x.", vars1[iv])) + get(paste0("i.", vars2[iv]))][]
}
dt1[]
a b c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3
Note that dt1
is updated by reference, i.e., without copying.
Prepending the variable names vars1[iv]
by "x."
and vars2[iv]
by "i."
on the right hand side of :=
is to ensure that the right columns from dt1
and dt2
, resp., are picked in case of duplicated column names. See the Advanced: section on the j
parameter in help("data.table")
.
3. Computing on the language
This follows Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server". See here for another use case.
library(glue) # literal string interpolation
library(magrittr) # piping used to improve readability
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{glue("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`({.})][]")} %>%
EVAL()
a b c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3
It starts with a look-up table which is created on-the-fly and subsequently manipulated to form a complete data.table statement
dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(a = x.a + i.a2, b = x.b + i.b2, c = x.c + i.c2)][]
as a character string. This string is then evaluated and executed in one go; no for
loops required.
As the helper function EVAL()
already uses paste0()
the call to glue()
can be omitted:
data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{EVAL("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(", ., ")][]")}
Note that dot .
and curly brackets {}
are used with different meaning in different contexts which may appear somewhat confusing.
Apply multiple functions to multiple columns of data table in R
With another lapply
:
f_lapply = function(DT, col_by, col_func, col_new, func){
temp = DT[, mget(unique(c(col_by,col_func)))]
##loop to remove
# 1:length(func)){
# temp[ ,eval(col_new[i]) := do.call(func[[i]], lapply(col_func, function(x) get(x))), by = mget(col_by)]
# }
##
temp[ ,(col_new) := lapply(func, function(y) do.call(y, lapply(col_func, function(x) get(x)))), by = mget(col_by)]
temp = unique(temp[,mget(c(col_new, col_by))])
return(temp)
}
DT = data.table(iris)
col_by = c("Species")
col_func = c("Petal.Length","Petal.Width")
col_new = c("PL.mean","PL.max")
func = list(function(x,y) return(mean(x[y == max(y)])), function(x,y) return(max(x[y == max(y)])))
f_lapply(DT, col_by, col_func,col_new,func)
#> PL.mean PL.max Species
#> <num> <num> <fctr>
#> 1: 1.600000 1.6 setosa
#> 2: 4.800000 4.8 versicolor
#> 3: 5.933333 6.1 virginica
About 25% more efficient on this example:
microbenchmark::microbenchmark(f_lapply(DT, col_by, col_func,col_new,func),
f(DT, col_by, col_func,col_new,func))
Unit: milliseconds
expr min lq mean median uq max neval
f_lapply(DT, col_by, col_func, col_new, func) 2.1065 2.2694 2.791411 2.41730 2.58115 13.4727 100
f(DT, col_by, col_func, col_new, func) 2.8480 3.0575 3.672728 3.19515 3.46075 17.1140 100
How to convert r data.table expression into a function for looping
The goal is a function of DT & var to do:
DT[, list(.N), by=var][order(var), list(var, N, Proportion=N/sum(N))]
To do a calculation by a variable and then order by it, use keyby=
. So your function can become:
f_tabulate <-
function(DT, var) {
DT[, list(.N), keyby=var][, Proportion := N/sum(N)][]
}
# usage
for (i in c('cyl', 'gear')) print(f_tabulate(mtcars_dt, i))
keyby=
and by=
can take arguments in many forms, and you don't need to quote or eval a simple vector of variable names. (Other ways include .(var1, var2)
, list(var1, var2)
, "var1,var2"
.) You could also extend to counting by multiple variables...
f_tabulate2 <-
function(DT, ...) {
DT[, list(.N), keyby=c(...)][, Proportion := N/sum(N)][]
}
# usage
f_tabulate2(mtcars_dt, 'cyl', 'gear')
For this operation (excepting the ordering part), you could also use groupingsets()
:
> groupingsets(mtcars_dt, .N, keyby=c('cyl', 'gear'), sets=list("cyl", "gear"))[,
Proportion := N/nrow(mtcars_dt)][]
cyl gear N Proportion
1: 6 NA 7 0.21875
2: 4 NA 11 0.34375
3: 8 NA 14 0.43750
4: NA 4 12 0.37500
5: NA 3 15 0.46875
6: NA 5 5 0.15625
As a function (and adding ordering back in)...
f_tabulate_all = function(DT, vars){
lvars = as.list(vars)
ocall = as.call(lapply(c("order", vars), as.name))
groupingsets(DT[eval(ocall)], .N, by=vars, sets=as.list(vars))[,
Proportion := N/nrow(DT)][]
}
# usage
f_tabulate_all(mtcars_dt, c('cyl', 'gear'))
The as.name
function is achieving the same thing as quote
when applied to a string naming a function or other object.
vectorizing functions that include iterative loops inside a data.table
First, the if
conditional requires a single logical, not a vector, and if myscalar
has a length over 1 (or just 0) then it will fail. Further, the if
should really have a vectorized comparison in a sense, because you are likely to have one pattern match one but not all, etc.
For this, one might think ifelse
as a vectorized if
/else
alternative, but I think a different method is a reductive method, where the try each pattern on any not-yet-matched inputs, and stop processing when everything has a match.
my_postcode_fun <- function(myvector){
allpatterns <- c("[[:alpha:]][[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]",
# this is AA00 0AA
"[[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]"
# this is A00 0AA
)# these are the patterns I'm looking for
post <- rep(NA_character_, length(myvector))
for (ptn in allpatterns) {
isna <- is.na(post)
if (!any(isna)) break
post[isna] <- str_extract(myvector[isna], regex(ptn))
}
isna <- is.na(post)
if (any(!isna)) {
myvector[!isna] <- mapply(sub, post[!isna], "", myvector[!isna])
}
list(postcode = post, leftover = myvector)
}
testdata <- data.table(Address = c("1 Some Street, sometown, AA00 0AA",
"1 Some Street, sometown, A00 0AA",
"1 Some Street, sometown, "))
testdata[, c("post","left") := my_postcode_fun(Address)][]
# Address post left
# <char> <char> <char>
# 1: 1 Some Street, sometown, AA00 0AA AA00 0AA 1 Some Street, sometown,
# 2: 1 Some Street, sometown, A00 0AA A00 0AA 1 Some Street, sometown,
# 3: 1 Some Street, sometown, <NA> 1 Some Street, sometown,
(The post
and left
strings can definitely be cleaned up, perhaps trimws
, but that's a different task.)
Quick walk-through:
- We start by generating a
post
vector as long as the inputmyscalar
, but allNA
, this is an intentional starting-point. - For each
ptn
in the patterns, first we only want to look at those inpost
that areNA
(yes, the first pass is always all-true, but the point of this state-machine is that it doesn't need to know if it's on the first orn
th pass). For this, we defineisna
as a variable tracking which ofpost
areNA
, meaning which ofmyscalar
have not yet had a match. - Quick check: if everything has a match (i.e.,
!any(isna)
), then stop processing. This is a nice break point, since it means that if we have 1M inputs and the first pattern matches for all of them, then we do not have to proceed with any of the other patterns. str_extract
the remaining inputs with thisptn
, and store in the not-yet-matchespost
output.- After the
for
loop, we populateisna
one more time so that we can remove the postcode from the original input.sub
is not vectorized on its pattern (just on its inputx=
vector), so I usemapply
to vectorize it in a sense (there are other methods for doing so).- Note: there is a very slight risk here: if an input has two apparent postcodes in it (however unlikely), and they contain the same matching substring, then the first will be extracted into
post
, andsub
will remove the first occurrence. Neither of these steps will acknowledge or remove the second. I doubt it's likely, but I wanted to identify this possibility.
- Note: there is a very slight risk here: if an input has two apparent postcodes in it (however unlikely), and they contain the same matching substring, then the first will be extracted into
- BTW: while I named the function output with
postcode
andleftover
(as in your question), those names are being dropped in this case by thec("post","left") :=
withindata.table
. I named them differently to demonstrate this. The function could easily have returned justlist(post, leftover)
and it works just as well (though I think the names are good, declarative, and if this function is ever used outside ofdata.table
, the names may be quite helpful).
Related Topics
Spreading a Two Column Data Frame with Tidyr
Lapply-Ing with the "$" Function
Extract Text After "/" in a Data Frame Column
R Ifelse Avoiding Change in Date Format
R Color Palettes for Many Data Classes
Producing a Vector Graphics Image (I.E. Metafile) in R Suitable for Printing in Word 2007
Any Way to Make Plot Points in Scatterplot More Transparent in R
R: How to Rbind Two Huge Data-Frames Without Running Out of Memory
Boxplot Show the Value of Mean
How to Add \Newpage in Rmarkdown in a Smart Way
Two-Column Layouts in Rstudio Presentations/Slidify/Pandoc
What Is Difference Between Dataframe and List in R
Best Way to Transpose Data.Table