create a formula in a data.table environment in R
lm
can accept a character string as the formula so combine that with .SD
like this:
> x[, as.list(coef(lm("a ~ b", .SD))), by = id]
id (Intercept) b
1: 1 21 -1
2: 2 21 -1
3: 3 21 -1
4: 4 21 -1
5: 5 21 -1
Convert variables in data.table to formula
One way to do this is to convert Formulae
to actual one-sided formulae and then functions which are in turn evaluated inside of lst()
which allows for the sequential building of objects. This relies on the metaprogramming functionality of the tidyverse
framework rather than data.table
.
library(dplyr)
library(purrr)
df <- data.frame(VarName = LETTERS[1:5],
Formulae = c("1+1", "A+3", "B*10", "A+C", "D/2"))
lst(!!!map(set_names(df$Formulae, df$VarName),
~ quo(
as_mapper(reformulate(.x))()
)))
$A
[1] 2
$B
[1] 5
$C
[1] 50
$D
[1] 52
$E
[1] 26
Or alternatively:
lst(!!!set_names(df$Formulae, df$VarName) %>% map(str2lang))
As noted in the comments below these require that the formulae are in a sequential order.
using lm(my_formula) inside [.data.table's j
The way lm
works it looks for the variables used in the environment of the formula supplied. Since you create your formula in the global environment, it's not going to look in the j
-expression environment, so the only way to make the exact expression lm(frm)
work would be to add the appropriate variables to the correct environment:
DT[, {assign('x', x, environment(frm));
assign('y', y, environment(frm));
lm(frm)}]
Now obviously this is not a very good solution, and both Arun's and Josh's suggestions are much better and I'm just putting it here for the understanding of the problem at hand.
edit Another (possibly more perverted, and quite fragile) way would be to change the environment of the formula at hand (I do it permanently here, but you could revert it back, or copy it and then do it):
DT[, {setattr(frm, '.Environment', get('SDenv', parent.frame(2))); lm(frm)}]
Btw a funny thing is happening here - whenever you use get
in j
-expression, all of the variables get constructed (so don't use it if you can avoid it), and this is why I don't need to also use x
and y
in some way for data.table
to know that those variables are needed.
Data table aggregations with vector functions, take 2
> dt[ , as.list(quantile(x)),by=y]
y 0% 25% 50% 75% 100%
1: a 1 3.25 5.5 7.75 10
2: b 11 13.25 15.5 17.75 20
I tried using rbind
, but that failed to generate the by-y arrangement I was thinking you wanted. The trick with as.list
(vs. list
) is that it constructs a multi-element list wehn givne a vector, whereas list
only puts the vector into a single element list.
as.list
acts like sapply(x, list)
:
> dt[ , sapply(quantile(x), list), by=y]
y 0% 25% 50% 75% 100%
1: a 1 3.25 5.5 7.75 10
2: b 11 13.25 15.5 17.75 20
Your target solution:
> ddply(dt,~y,function(dtbit) quantile(dtbit$x))
y 0% 25% 50% 75% 100%
1 a 1 3.25 5.5 7.75 10
2 b 11 13.25 15.5 17.75 20
I was kind of proud of that solution, but mindful of fortunes::fortune("Liaw-Baron principle")
............
Lastly, by what we could call the 'Liaw-Baron principle', every question that can be asked has in fact
already been asked.
-- Dirk Eddelbuettel (citing Andy Liaw's and Jonathan Baron's opinion on unique questions on R-help)
R-help (January 2006)
.... I did a search on: [r] data.table as.list
, and find that I am by no means the first to post this strategy on SO:
Tabulate a data frame in R
Using ave() with function which returns a vector
create a formula in a data.table environment in R
I don't really know if this question would be considered a duplicate, but I am particularly grateful to @G.Grothedieck for the last one. It may be where I picked up the strategy. There were about 125 hits to that search and I've only gone through the first 20 to gather those examples, so there may be some more pearls that I haven't uncovered.
what data structure does model formula operator in R create?
Don't know if this helps, but: it's a language
object — i.e. R parses the input but doesn't try to evaluate any of the components — with class "formula"
> f <- a ~ b + (c + d)
> str(f)
Class 'formula' language a ~ b + (c + d)
..- attr(*, ".Environment")=<environment: R_GlobalEnv>
If you want to work with these objects, you need to know that it is essentially stored as a tree, where the parent node, an operator or function (~
, +
, (
) , can be extracted as the first element and the child nodes (as many as the 'arity' of the function/operator) can be extracted as elements 2..n, i.e.
f[[1]]
is~
f[[2]]
isa
(the first argument, i.e. the LHS of the formula)f[[3]]
isb + (c+d)
f[[3]][[1]]
is+
f[[3]][[2]]
isb
... and so on.
The chapter on Expressions in Hadley Wickham's Advanced R gives a more complete description.
This is also discussed (more opaquely) in the R Language Manual, e.g.
- Expression objects
- Direct manipulation of language objects
@user2554330 points out that formulas also typically have associated environments; that is, they carry along a pointer to the environment in which they were created
R Dynamically build list in data.table (or ddply)
Another way is to use .SDcols
to group the columns for which you'd like to perform the same operations together. Let's say that you require columns a,d,e
to be summed by type
where as, b,g
should have mean
taken and c,f
its median, then,
# constructing an example data.table:
set.seed(45)
dt <- data.table(type=rep(c("hello","bye","ok"), each=3), a=sample(9),
b = rnorm(9), c=runif(9), d=sample(9), e=sample(9),
f = runif(9), g=rnorm(9))
# type a b c d e f g
# 1: hello 6 -2.5566166 0.7485015 9 6 0.5661358 -2.2066521
# 2: hello 3 1.1773119 0.6559926 3 3 0.4586280 -0.8376586
# 3: hello 2 -0.1015588 0.2164430 1 7 0.9299597 1.7216593
# 4: bye 8 -0.2260640 0.3924327 8 2 0.1271187 0.4360063
# 5: bye 7 -1.0720503 0.3256450 7 8 0.5774691 0.7571990
# 6: bye 5 -0.7131021 0.4855804 6 9 0.2687791 1.5398858
# 7: ok 1 -0.4680549 0.8476840 2 4 0.5633317 1.5393945
# 8: ok 4 0.4183264 0.4402595 4 1 0.7592801 2.1829996
# 9: ok 9 -1.4817436 0.5080116 5 5 0.2357030 -0.9953758
# 1) set key
setkey(dt, "type")
# 2) group col-ids by similar operations
id1 <- which(names(dt) %in% c("a", "d", "e"))
id2 <- which(names(dt) %in% c("b","g"))
id3 <- which(names(dt) %in% c("c","f"))
# 3) now use these ids in with .SDcols parameter
dt1 <- dt[, lapply(.SD, sum), by="type", .SDcols=id1]
dt2 <- dt[, lapply(.SD, mean), by="type", .SDcols=id2]
dt3 <- dt[, lapply(.SD, median), by="type", .SDcols=id3]
# 4) merge them.
dt1[dt2[dt3]]
# type a d e b g c f
# 1: bye 20 21 19 -0.6704055 0.9110304 0.3924327 0.2687791
# 2: hello 11 13 16 -0.4936211 -0.4408838 0.6559926 0.5661358
# 3: ok 14 11 10 -0.5104907 0.9090061 0.5080116 0.5633317
If/when you have many many column, making a list like the one you've might be cumbersome.
Specifying variables used in regression-formula using a string
You can compute on the language:
do.regr <- function(rhs) {
rhs <- parse(text = rhs)[[1]]
varname <- as.symbol(varname)
eval(bquote(dt[, lmtest::coeftest(x = lm(.(varname) ~ .(rhs)),
vcov. = sandwich::NeweyWest(x = lm(.(varname) ~ .(rhs))) ), by = "d"]))
}
do.regr("b+c")
#works
The only disadvantage is that you couldn't use data.table's dot alias because bquote
would (try to) substitute it. You could use substitute
instead of bquote
if this becomes an issue.
R: Using get and data.table within a user defined function
You don't have to call get
on dt
(Based on my experience, get
is most often used to refer to a column using string) and you can supply character vector to by
or keyby
:
tw_tab <- function(dt,v1,v2){
#set up variables as charaters
v1<-as.character(substitute(v1))
v2<-as.character(substitute(v2))
#function
tmp <- dt[,.(N=.N), keyby = c(v1, v2)]
tmp[,total:=sum(N), by= c(v1)]
tmp[,percent:=percent(N/total)]
dcast(tmp, paste(v1, '~', v2), value.var="percent")
}
#test function
tw_tab(test, y, z)
# y A B C
# 1: 1 60.0% 40.0% NA
# 2: 2 NA 16.7% 83.3%
Here is also a solution using xtabs
and prop.table
:
tw_tab <- function(x, v1, v2){
fm <- bquote(~ .(substitute(v1)) + .(substitute(v2)))
res <- prop.table(xtabs(formula = fm, data = x), 1)
res <- as.data.frame.matrix(res)
res[] <- lapply(res, scales::percent)
return(res)
}
tw_tab(test, y, z)
# A B C
# 1 60% 40.0% 0.0%
# 2 0% 16.7% 83.3%
Related Topics
Reorder Rows Using Custom Order
Change Color of Leaflet Marker
Embedding a Miniature Plot Within a Plot
Subtract a Constant Vector from Each Row in a Matrix in R
Check If Each Row of a Data Frame Is Contained in Another Data Frame
Dataframe Create New Column Based on Other Columns
Check If R Is Running in Rstudio
How to Create a Pivot Table in R with Multiple (3+) Variables
Display Y-Axis for Each Subplot When Faceting
Avoid Wasting Space When Placing Multiple Aligned Plots Onto One Page
How to Install R Package from Private Repo Using Devtools Install_Github
Logistic Regression - Defining Reference Level in R
Display Only Months in Daterangeinput or Dateinput for a Shiny App [R Programming]
How to Draw Gridlines Using Abline() That Are Behind the Data
Implementation of Parallel Coordinates
Multiple Functions in a Single Tapply or Aggregate Statement