Why is apply() method slower than a for loop in R?
As Chase said: Use the power of vectorization. You're comparing two bad solutions here.
To clarify why your apply solution is slower:
Within the for loop, you actually use the vectorized indices of the matrix, meaning there is no conversion of type going on. I'm going a bit rough over it here, but basically the internal calculation kind of ignores the dimensions. They're just kept as an attribute and returned with the vector representing the matrix. To illustrate :
> x <- 1:10
> attr(x,"dim") <- c(5,2)
> y <- matrix(1:10,ncol=2)
> all.equal(x,y)
[1] TRUE
Now, when you use the apply, the matrix is split up internally in 100,000 row vectors, every row vector (i.e. a single number) is put through the function, and in the end the result is combined into an appropriate form. The apply function reckons a vector is best in this case, and thus has to concatenate the results of all rows. This takes time.
Also the sapply function first uses as.vector(unlist(...))
to convert anything to a vector, and in the end tries to simplify the answer into a suitable form. Also this takes time, hence also the sapply might be slower here. Yet, it's not on my machine.
IF apply would be a solution here (and it isn't), you could compare :
> system.time(loop_million <- mash(million))
user system elapsed
0.75 0.00 0.75
> system.time(sapply_million <- matrix(unlist(sapply(million,squish,simplify=F))))
user system elapsed
0.25 0.00 0.25
> system.time(sapply2_million <- matrix(sapply(million,squish)))
user system elapsed
0.34 0.00 0.34
> all.equal(loop_million,sapply_million)
[1] TRUE
> all.equal(loop_million,sapply2_million)
[1] TRUE
Fit many glm models: improve speed
The IRLS algorithm typically used for fitting glms requires matrix inversion/decomposition at each iteration. fastglm
offers several different options for the decomposition and the default choice is a slower but more stable option (QR with column-pivoting). If your only interest is speed, then one of the two available Cholesky-type decompositions will improve the speed dramatically, which would be more advisable than just changing the number of IRLS iterations. Another notable difference between fastglm
and standard IRLS implementations is its careful use of half-steps in order to prevent divergence (IRLS can diverge in practice in a number of cases).
The method
argument of fastglm
allows one to change the decomposition. option 2 gives the vanilla Cholesky decomposition and option 3 gives a slightly more stable version of this. On my computer, the timings for your provided example are:
> system.time(m_glm <- glm(fo, data=df, family = binomial))
user system elapsed
23.206 0.429 23.689
> system.time(m_speedglm <- speedglm(fo, data= df, family = binomial()))
user system elapsed
15.448 0.283 15.756
> system.time(m_fastglm <- fastglm(x, y, family = binomial(), method = 2))
user system elapsed
2.159 0.055 2.218
> system.time(m_fastglm <- fastglm(x, y, family = binomial(), method = 3))
user system elapsed
2.247 0.065 2.337
With regards to using broom with fastglm objects, I can look into that.
Another note about decompositions: When fastglm
uses the QR decomposition, it is working with the design matrix directly. Although speedglm
technically offers a QR decomposition, it works by first computing $X^TX$ and decomposing this, which is more numerically unstable than a QR on X.
How to fit multiple interaction models in a loop?
Here is a sort of functional programming approach.
You create your data, and as long as your Y
is the first column, this code would take all the rest of the variables (no matter how many) and construct models on their combinations.
Finally, since you've done it in this framework, you can call broom's tidy
and confint_tidy
to extract the results into an easy to filter dataset.
DF <- data_frame(Y = rpois(100, 5),
A = rnorm(100),
C = rnorm(100),
M = rnorm(100))
formula_frame <- bind_rows(data_frame(V1 = names(DF[,-1])),
as_data_frame(t(combn(names(DF[,-1]),2)))) %>%
rowwise() %>%
mutate(formula_text = paste0("Y ~", if_else(is.na(V2),
V1,
paste(V1,V2, sep = "*"))),
formula_obj = list(as.formula(formula_text))) %>%
ungroup()
formula_frame %>%
mutate(fits = map(formula_obj, ~glm(.x, family = "poisson", data = DF) %>%
(function(X)bind_cols(broom::tidy(X),broom::confint_tidy((X)))))) %>%
unnest(fits) %>%
select(-formula_obj)
# A tibble: 18 x 10
V1 V2 formula_text term estimate std.error statistic p.value conf.low conf.high
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A NA Y ~A (Intercept) 1.63 0.0443 36.8 6.92e-297 1.54 1.72
2 A NA Y ~A A 0.0268 0.0444 0.602 5.47e- 1 -0.0603 0.114
3 C NA Y ~C (Intercept) 1.63 0.0443 36.8 5.52e-296 1.54 1.72
4 C NA Y ~C C 0.0326 0.0466 0.699 4.84e- 1 -0.0587 0.124
5 M NA Y ~M (Intercept) 1.63 0.0454 35.8 1.21e-280 1.53 1.71
6 M NA Y ~M M -0.0291 0.0460 -0.634 5.26e- 1 -0.119 0.0615
7 A C Y ~A*C (Intercept) 1.62 0.0446 36.4 5.64e-290 1.54 1.71
8 A C Y ~A*C A 0.00814 0.0459 0.178 8.59e- 1 -0.0816 0.0982
9 A C Y ~A*C C 0.0410 0.0482 0.850 3.96e- 1 -0.0532 0.136
10 A C Y ~A*C A:C 0.0650 0.0474 1.37 1.70e- 1 -0.0270 0.158
11 A M Y ~A*M (Intercept) 1.62 0.0458 35.5 1.21e-275 1.53 1.71
12 A M Y ~A*M A 0.0232 0.0451 0.514 6.07e- 1 -0.0653 0.112
13 A M Y ~A*M M -0.0260 0.0464 -0.561 5.75e- 1 -0.116 0.0655
14 A M Y ~A*M A:M -0.00498 0.0480 -0.104 9.17e- 1 -0.0992 0.0887
15 C M Y ~C*M (Intercept) 1.60 0.0472 34.0 1.09e-253 1.51 1.70
16 C M Y ~C*M C 0.0702 0.0506 1.39 1.65e- 1 -0.0291 0.169
17 C M Y ~C*M M -0.0333 0.0479 -0.695 4.87e- 1 -0.127 0.0611
18 C M Y ~C*M C:M 0.0652 0.0377 1.73 8.39e- 2 -0.0102 0.138
lapply: Fitting thousands of mixed models and being able to extract lsmeans
Your original setup would work if you add one line to modelSeq()
:
modelSeq <- function (x, dat) {
environment(x) <- environment()
return(tryCatch(lmer(x, data = dat), error=function(e) NULL))
}
This sets the environment of the formula to that of the function body, making it possible to find the dataset named dat
.
A similar example:
fitm <- function(formula, data, ...) {
environment(formula) <- environment()
lm(formula, data = data, ...)
}
fl <- list(breaks ~ tension, breaks ~ wool + tension, breaks ~ wool*tension)
md <- lapply(fl, fitm, data = warpbreaks[c(1,2,3,5,8,13,21,34,54), ])
lapply(md, function(m) emmeans(m, "tension"))
Which produces:
NOTE: Results may be misleading due to involvement in interactions
[[1]]
tension emmean SE df lower.CL upper.CL
L 41.2 6.64 6 24.91 57.4
M 17.0 16.27 6 -22.82 56.8
H 26.0 11.51 6 -2.16 54.2
Confidence level used: 0.95
[[2]]
tension emmean SE df lower.CL upper.CL
L 41.6 8.91 5 18.73 64.5
M 17.7 19.41 5 -32.21 67.6
H 26.0 12.59 5 -6.38 58.4
Results are averaged over the levels of: wool
Confidence level used: 0.95
[[3]]
tension emmean SE df lower.CL upper.CL
L 41.1 10.9 4 10.9 71.3
M nonEst NA NA NA NA
H 26.0 14.1 4 -13.0 65.0
Results are averaged over the levels of: wool
Confidence level used: 0.95
BTW, you don't need the lsmeans package; it is just a front-end for emmeans. In fact, the lsmeans
function itself is in emmeans; it just runs emmeans
and re-labels the results.
Applying multiple model formulas to groups of data
You could mutate list columns in place, using mutate_at
(or mutate_if
). This saves several iterations and makes the code pipeable and more compact.
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
lin_mod = function(formula) {
function(data,...){
map(data,~lm(formula, data = .x))
}
}
list_model <- list(cyl_model= hwy ~ cyl,
displ_model= hwy ~ displ,
full_model= hwy ~ cyl + displ) %>%
lapply(lin_mod)
ggplot2::mpg %>%
group_by(manufacturer) %>% nest() %>%
mutate_at(.vars=("data"),.funs=list_model) %>%
mutate_at(.vars=vars(ends_with("model")), .funs=~map(.x, augment)) %>%
mutate_at(.vars=vars(ends_with("model")), .funs=~map(.x, ".resid")) %>% unnest()
How to succinctly write a formula with many variables from a data frame?
There is a special identifier that one can use in a formula to mean all the variables, it is the .
identifier.
y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)
You can also do things like this, to use all variables but one (in this case x3 is excluded):
mod <- lm(y ~ . - x3, data = d)
Technically, .
means all variables not already mentioned in the formula. For example
lm(y ~ x1 * x2 + ., data = d)
where .
would only reference x3
as x1
and x2
are already in the formula.
lapply function to pass single and + arguments to LM
I've generally found it more robust/easier to understand to use reformulate
to construct formulas via string manipulations rather than trying to use substitute()
to modify an expression, e.g.
model_combinations <- c('.', 'Long', 'Lat', 'Elev', 'Lat+Elev')
model_formulas <- lapply(model_combinations,reformulate,
response="Y")
lm_models <- lapply(model_formulas,lm,data=climatol_ann)
Because reformulate
works at a string level, it doesn't have a problem if the elements are themselves non-atomic (e.g. Lat+Elev
). The only tricky situation here is if your data
argument or variables are constructed in some environment that can't easily be found, but passing an explicit data
argument usually avoids problems.
(You can also use as.formula(paste(...))
or as.formula(sprintf(...))
; reformulate()
is just a convenient wrapper.)
Related Topics
Meaning of Tilde and Dot Notation in Dplyr
Adding Counts of a Factor to a Dataframe
R Windows Os Choose.Dir() File Chooser Won't Open at Working Directory
How to Scrape Items Together So You Don't Lose the Index
Can't Read an .Rdata Fileinput
Draw Histograms Per Row Over Multiple Columns in R
Finding Close Match from Data Frame 1 in Data Fame 2
R Doesn't Recognize Pandoc Linux Mint
Text Mining in R | Memory Management
From Long to Wide Data with Multiple Columns
How to Calculate Confidence Intervals for Nonlinear Least Squares in R
Tricks to Override Plot.Factor