Create a Formula in a Data.Table Environment in R

create a formula in a data.table environment in R

lm can accept a character string as the formula so combine that with .SD like this:

> x[, as.list(coef(lm("a ~ b", .SD))), by = id]
   id (Intercept)  b
1:  1          21 -1
2:  2          21 -1
3:  3          21 -1
4:  4          21 -1
5:  5          21 -1

Convert variables in data.table to formula

One way to do this is to convert Formulae to actual one-sided formulae and then functions which are in turn evaluated inside of lst() which allows for the sequential building of objects. This relies on the metaprogramming functionality of the tidyverse framework rather than data.table.

library(dplyr)
library(purrr)

df <- data.frame(VarName = LETTERS[1:5],
                 Formulae = c("1+1", "A+3", "B*10", "A+C", "D/2"))

lst(!!!map(set_names(df$Formulae, df$VarName),
           ~ quo(
             as_mapper(reformulate(.x))()
           )))
$A
[1] 2

$B
[1] 5

$C
[1] 50

$D
[1] 52

$E
[1] 26

Or alternatively:

lst(!!!set_names(df$Formulae, df$VarName) %>% map(str2lang))

As noted in the comments below these require that the formulae are in a sequential order.

using lm(my_formula) inside [.data.table's j

The way lm works it looks for the variables used in the environment of the formula supplied. Since you create your formula in the global environment, it's not going to look in the j-expression environment, so the only way to make the exact expression lm(frm) work would be to add the appropriate variables to the correct environment:

DT[, {assign('x', x, environment(frm));
      assign('y', y, environment(frm));
      lm(frm)}]

Now obviously this is not a very good solution, and both Arun's and Josh's suggestions are much better and I'm just putting it here for the understanding of the problem at hand.

edit Another (possibly more perverted, and quite fragile) way would be to change the environment of the formula at hand (I do it permanently here, but you could revert it back, or copy it and then do it):

DT[, {setattr(frm, '.Environment', get('SDenv', parent.frame(2))); lm(frm)}]

Btw a funny thing is happening here - whenever you use get in j-expression, all of the variables get constructed (so don't use it if you can avoid it), and this is why I don't need to also use x and y in some way for data.table to know that those variables are needed.

Data table aggregations with vector functions, take 2

> dt[ , as.list(quantile(x)),by=y]
   y 0%   25%  50%   75% 100%
1: a  1  3.25  5.5  7.75   10
2: b 11 13.25 15.5 17.75   20

I tried using rbind, but that failed to generate the by-y arrangement I was thinking you wanted. The trick with as.list (vs. list) is that it constructs a multi-element list wehn givne a vector, whereas list only puts the vector into a single element list.

as.list acts like sapply(x, list):

> dt[ , sapply(quantile(x), list), by=y]
   y 0%   25%  50%   75% 100%
1: a  1  3.25  5.5  7.75   10
2: b 11 13.25 15.5 17.75   20

Your target solution:

> ddply(dt,~y,function(dtbit) quantile(dtbit$x))
  y 0%   25%  50%   75% 100%
1 a  1  3.25  5.5  7.75   10
2 b 11 13.25 15.5 17.75   20

I was kind of proud of that solution, but mindful of fortunes::fortune("Liaw-Baron principle") ............

Lastly, by what we could call the 'Liaw-Baron principle', every question that can be asked has in fact
already been asked.
-- Dirk Eddelbuettel (citing Andy Liaw's and Jonathan Baron's opinion on unique questions on R-help)
R-help (January 2006)

.... I did a search on: [r] data.table as.list, and find that I am by no means the first to post this strategy on SO:

Tabulate a data frame in R

Using ave() with function which returns a vector

create a formula in a data.table environment in R

I don't really know if this question would be considered a duplicate, but I am particularly grateful to @G.Grothedieck for the last one. It may be where I picked up the strategy. There were about 125 hits to that search and I've only gone through the first 20 to gather those examples, so there may be some more pearls that I haven't uncovered.

what data structure does model formula operator in R create?

Don't know if this helps, but: it's a language object — i.e. R parses the input but doesn't try to evaluate any of the components — with class "formula"

> f <- a ~ b + (c + d)
> str(f)
Class 'formula'  language a ~ b + (c + d)
  ..- attr(*, ".Environment")=<environment: R_GlobalEnv>

If you want to work with these objects, you need to know that it is essentially stored as a tree, where the parent node, an operator or function (~, +, () , can be extracted as the first element and the child nodes (as many as the 'arity' of the function/operator) can be extracted as elements 2..n, i.e.

f[[1]] is ~
f[[2]] is a (the first argument, i.e. the LHS of the formula)
f[[3]] is b + (c+d)
f[[3]][[1]] is +
f[[3]][[2]] is b

... and so on.

The chapter on Expressions in Hadley Wickham's Advanced R gives a more complete description.

This is also discussed (more opaquely) in the R Language Manual, e.g.

Expression objects
Direct manipulation of language objects

@user2554330 points out that formulas also typically have associated environments; that is, they carry along a pointer to the environment in which they were created

R Dynamically build list in data.table (or ddply)

Another way is to use .SDcols to group the columns for which you'd like to perform the same operations together. Let's say that you require columns a,d,e to be summed by type where as, b,g should have mean taken and c,f its median, then,

# constructing an example data.table:
set.seed(45)
dt <- data.table(type=rep(c("hello","bye","ok"), each=3), a=sample(9), 
                 b = rnorm(9), c=runif(9), d=sample(9), e=sample(9), 
                 f = runif(9), g=rnorm(9))

#     type a          b         c d e         f          g
# 1: hello 6 -2.5566166 0.7485015 9 6 0.5661358 -2.2066521
# 2: hello 3  1.1773119 0.6559926 3 3 0.4586280 -0.8376586
# 3: hello 2 -0.1015588 0.2164430 1 7 0.9299597  1.7216593
# 4:   bye 8 -0.2260640 0.3924327 8 2 0.1271187  0.4360063
# 5:   bye 7 -1.0720503 0.3256450 7 8 0.5774691  0.7571990
# 6:   bye 5 -0.7131021 0.4855804 6 9 0.2687791  1.5398858
# 7:    ok 1 -0.4680549 0.8476840 2 4 0.5633317  1.5393945
# 8:    ok 4  0.4183264 0.4402595 4 1 0.7592801  2.1829996
# 9:    ok 9 -1.4817436 0.5080116 5 5 0.2357030 -0.9953758

# 1) set key
setkey(dt, "type")

# 2) group col-ids by similar operations
id1 <- which(names(dt) %in% c("a", "d", "e"))
id2 <- which(names(dt) %in% c("b","g"))
id3 <- which(names(dt) %in% c("c","f"))

# 3) now use these ids in with .SDcols parameter
dt1 <- dt[, lapply(.SD, sum), by="type", .SDcols=id1]
dt2 <- dt[, lapply(.SD, mean), by="type", .SDcols=id2]
dt3 <- dt[, lapply(.SD, median), by="type", .SDcols=id3]

# 4) merge them.
dt1[dt2[dt3]]

#     type  a  d  e          b          g         c         f
# 1:   bye 20 21 19 -0.6704055  0.9110304 0.3924327 0.2687791
# 2: hello 11 13 16 -0.4936211 -0.4408838 0.6559926 0.5661358
# 3:    ok 14 11 10 -0.5104907  0.9090061 0.5080116 0.5633317

If/when you have many many column, making a list like the one you've might be cumbersome.

Specifying variables used in regression-formula using a string

You can compute on the language:

do.regr <- function(rhs) {
  rhs <- parse(text = rhs)[[1]]
  varname <- as.symbol(varname)
  eval(bquote(dt[, lmtest::coeftest(x = lm(.(varname) ~ .(rhs)),
                        vcov. = sandwich::NeweyWest(x = lm(.(varname) ~ .(rhs))) ), by = "d"]))
}

do.regr("b+c")
#works

The only disadvantage is that you couldn't use data.table's dot alias because bquote would (try to) substitute it. You could use substitute instead of bquote if this becomes an issue.

R: Using get and data.table within a user defined function

You don't have to call get on dt (Based on my experience, get is most often used to refer to a column using string) and you can supply character vector to by or keyby:

tw_tab <- function(dt,v1,v2){

    #set up variables as charaters
    v1<-as.character(substitute(v1))
    v2<-as.character(substitute(v2))

    #function
    tmp <- dt[,.(N=.N), keyby = c(v1, v2)]
    tmp[,total:=sum(N), by= c(v1)]
    tmp[,percent:=percent(N/total)]
    dcast(tmp, paste(v1, '~', v2), value.var="percent")
}

#test function
tw_tab(test, y, z)
#    y     A     B     C
# 1: 1 60.0% 40.0%    NA
# 2: 2    NA 16.7% 83.3%

Here is also a solution using xtabs and prop.table:

tw_tab <- function(x, v1, v2){
    fm <- bquote(~ .(substitute(v1)) + .(substitute(v2)))
    res <- prop.table(xtabs(formula = fm, data = x), 1)
    res <- as.data.frame.matrix(res)
    res[] <- lapply(res, scales::percent)
    return(res)
}

tw_tab(test, y, z)
#     A     B     C
# 1 60% 40.0%  0.0%
# 2  0% 16.7% 83.3%

Create a Formula in a Data.Table Environment in R