Fit Many Formulae at Once, Faster Options Than Lapply

Why is apply() method slower than a for loop in R?

As Chase said: Use the power of vectorization. You're comparing two bad solutions here.

To clarify why your apply solution is slower:

Within the for loop, you actually use the vectorized indices of the matrix, meaning there is no conversion of type going on. I'm going a bit rough over it here, but basically the internal calculation kind of ignores the dimensions. They're just kept as an attribute and returned with the vector representing the matrix. To illustrate :

> x <- 1:10
> attr(x,"dim") <- c(5,2)
> y <- matrix(1:10,ncol=2)
> all.equal(x,y)
[1] TRUE

Now, when you use the apply, the matrix is split up internally in 100,000 row vectors, every row vector (i.e. a single number) is put through the function, and in the end the result is combined into an appropriate form. The apply function reckons a vector is best in this case, and thus has to concatenate the results of all rows. This takes time.

Also the sapply function first uses as.vector(unlist(...)) to convert anything to a vector, and in the end tries to simplify the answer into a suitable form. Also this takes time, hence also the sapply might be slower here. Yet, it's not on my machine.

IF apply would be a solution here (and it isn't), you could compare :

> system.time(loop_million <- mash(million))
   user  system elapsed 
   0.75    0.00    0.75    
> system.time(sapply_million <- matrix(unlist(sapply(million,squish,simplify=F))))
   user  system elapsed 
   0.25    0.00    0.25 
> system.time(sapply2_million <- matrix(sapply(million,squish)))
   user  system elapsed 
   0.34    0.00    0.34 
> all.equal(loop_million,sapply_million)
[1] TRUE
> all.equal(loop_million,sapply2_million)
[1] TRUE

Fit many glm models: improve speed

The IRLS algorithm typically used for fitting glms requires matrix inversion/decomposition at each iteration. fastglm offers several different options for the decomposition and the default choice is a slower but more stable option (QR with column-pivoting). If your only interest is speed, then one of the two available Cholesky-type decompositions will improve the speed dramatically, which would be more advisable than just changing the number of IRLS iterations. Another notable difference between fastglm and standard IRLS implementations is its careful use of half-steps in order to prevent divergence (IRLS can diverge in practice in a number of cases).

The method argument of fastglm allows one to change the decomposition. option 2 gives the vanilla Cholesky decomposition and option 3 gives a slightly more stable version of this. On my computer, the timings for your provided example are:

> system.time(m_glm <- glm(fo, data=df, family = binomial))
   user  system elapsed 
 23.206   0.429  23.689 

> system.time(m_speedglm <- speedglm(fo, data= df, family = binomial()))
   user  system elapsed 
 15.448   0.283  15.756 

> system.time(m_fastglm <- fastglm(x, y, family = binomial(), method = 2))
   user  system elapsed 
  2.159   0.055   2.218 

> system.time(m_fastglm <- fastglm(x, y, family = binomial(), method = 3))
   user  system elapsed 
  2.247   0.065   2.337

With regards to using broom with fastglm objects, I can look into that.

Another note about decompositions: When fastglm uses the QR decomposition, it is working with the design matrix directly. Although speedglm technically offers a QR decomposition, it works by first computing $X^TX$ and decomposing this, which is more numerically unstable than a QR on X.

How to fit multiple interaction models in a loop?

Here is a sort of functional programming approach.
You create your data, and as long as your Y is the first column, this code would take all the rest of the variables (no matter how many) and construct models on their combinations.

Finally, since you've done it in this framework, you can call broom's tidy and confint_tidy to extract the results into an easy to filter dataset.

DF <- data_frame(Y = rpois(100, 5),
           A = rnorm(100),
           C = rnorm(100),
           M = rnorm(100))

formula_frame <- bind_rows(data_frame(V1 = names(DF[,-1])),
                           as_data_frame(t(combn(names(DF[,-1]),2)))) %>%
  rowwise() %>%
  mutate(formula_text = paste0("Y ~", if_else(is.na(V2),
                                              V1, 
                                              paste(V1,V2, sep = "*"))), 
         formula_obj = list(as.formula(formula_text))) %>%
  ungroup()

formula_frame %>% 
mutate(fits = map(formula_obj, ~glm(.x, family = "poisson", data = DF) %>%
(function(X)bind_cols(broom::tidy(X),broom::confint_tidy((X)))))) %>%
 unnest(fits) %>%
 select(-formula_obj)

# A tibble: 18 x 10
   V1    V2    formula_text term        estimate std.error statistic   p.value conf.low conf.high
   <chr> <chr> <chr>        <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
 1 A     NA    Y ~A         (Intercept)  1.63       0.0443    36.8   6.92e-297   1.54      1.72  
 2 A     NA    Y ~A         A            0.0268     0.0444     0.602 5.47e-  1  -0.0603    0.114 
 3 C     NA    Y ~C         (Intercept)  1.63       0.0443    36.8   5.52e-296   1.54      1.72  
 4 C     NA    Y ~C         C            0.0326     0.0466     0.699 4.84e-  1  -0.0587    0.124 
 5 M     NA    Y ~M         (Intercept)  1.63       0.0454    35.8   1.21e-280   1.53      1.71  
 6 M     NA    Y ~M         M           -0.0291     0.0460    -0.634 5.26e-  1  -0.119     0.0615
 7 A     C     Y ~A*C       (Intercept)  1.62       0.0446    36.4   5.64e-290   1.54      1.71  
 8 A     C     Y ~A*C       A            0.00814    0.0459     0.178 8.59e-  1  -0.0816    0.0982
 9 A     C     Y ~A*C       C            0.0410     0.0482     0.850 3.96e-  1  -0.0532    0.136 
10 A     C     Y ~A*C       A:C          0.0650     0.0474     1.37  1.70e-  1  -0.0270    0.158 
11 A     M     Y ~A*M       (Intercept)  1.62       0.0458    35.5   1.21e-275   1.53      1.71  
12 A     M     Y ~A*M       A            0.0232     0.0451     0.514 6.07e-  1  -0.0653    0.112 
13 A     M     Y ~A*M       M           -0.0260     0.0464    -0.561 5.75e-  1  -0.116     0.0655
14 A     M     Y ~A*M       A:M         -0.00498    0.0480    -0.104 9.17e-  1  -0.0992    0.0887
15 C     M     Y ~C*M       (Intercept)  1.60       0.0472    34.0   1.09e-253   1.51      1.70  
16 C     M     Y ~C*M       C            0.0702     0.0506     1.39  1.65e-  1  -0.0291    0.169 
17 C     M     Y ~C*M       M           -0.0333     0.0479    -0.695 4.87e-  1  -0.127     0.0611
18 C     M     Y ~C*M       C:M          0.0652     0.0377     1.73  8.39e-  2  -0.0102    0.138

lapply: Fitting thousands of mixed models and being able to extract lsmeans

Your original setup would work if you add one line to modelSeq():

modelSeq <- function (x, dat) {
  environment(x) <- environment()
  return(tryCatch(lmer(x, data = dat), error=function(e) NULL))
}

This sets the environment of the formula to that of the function body, making it possible to find the dataset named dat.

A similar example:

fitm <- function(formula, data, ...) {
    environment(formula) <- environment()
    lm(formula, data = data, ...)
}

fl <- list(breaks ~ tension, breaks ~ wool + tension, breaks ~ wool*tension)

md <- lapply(fl, fitm, data = warpbreaks[c(1,2,3,5,8,13,21,34,54), ])

lapply(md, function(m) emmeans(m, "tension"))

Which produces:

NOTE: Results may be misleading due to involvement in interactions

[[1]]
 tension emmean    SE df lower.CL upper.CL
 L         41.2  6.64  6    24.91     57.4
 M         17.0 16.27  6   -22.82     56.8
 H         26.0 11.51  6    -2.16     54.2

Confidence level used: 0.95 

[[2]]
 tension emmean    SE df lower.CL upper.CL
 L         41.6  8.91  5    18.73     64.5
 M         17.7 19.41  5   -32.21     67.6
 H         26.0 12.59  5    -6.38     58.4

Results are averaged over the levels of: wool 
Confidence level used: 0.95 

[[3]]
 tension emmean   SE df lower.CL upper.CL
 L         41.1 10.9  4     10.9     71.3
 M       nonEst   NA NA       NA       NA
 H         26.0 14.1  4    -13.0     65.0

Results are averaged over the levels of: wool 
Confidence level used: 0.95

BTW, you don't need the lsmeans package; it is just a front-end for emmeans. In fact, the lsmeans function itself is in emmeans; it just runs emmeans and re-labels the results.

Applying multiple model formulas to groups of data

You could mutate list columns in place, using mutate_at (or mutate_if). This saves several iterations and makes the code pipeable and more compact.

library(dplyr)
library(tidyr)
library(purrr)
library(broom)

lin_mod = function(formula) {
  function(data,...){
  map(data,~lm(formula, data = .x))
  }
}

list_model <- list(cyl_model= hwy ~ cyl,
                   displ_model= hwy ~ displ,
                   full_model= hwy ~ cyl + displ) %>% 
              lapply(lin_mod)

ggplot2::mpg %>% 
  group_by(manufacturer) %>% nest() %>% 
    mutate_at(.vars=("data"),.funs=list_model) %>% 
    mutate_at(.vars=vars(ends_with("model")), .funs=~map(.x, augment)) %>% 
    mutate_at(.vars=vars(ends_with("model")), .funs=~map(.x, ".resid")) %>% unnest()

How to succinctly write a formula with many variables from a data frame?

There is a special identifier that one can use in a formula to mean all the variables, it is the . identifier.

y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)

You can also do things like this, to use all variables but one (in this case x3 is excluded):

mod <- lm(y ~ . - x3, data = d)

Technically, . means all variables not already mentioned in the formula. For example

lm(y ~ x1 * x2 + ., data = d)

where . would only reference x3 as x1 and x2 are already in the formula.

lapply function to pass single and + arguments to LM

I've generally found it more robust/easier to understand to use reformulate to construct formulas via string manipulations rather than trying to use substitute() to modify an expression, e.g.

model_combinations <- c('.', 'Long', 'Lat', 'Elev', 'Lat+Elev')
model_formulas <- lapply(model_combinations,reformulate,
                         response="Y")
lm_models <- lapply(model_formulas,lm,data=climatol_ann)

Because reformulate works at a string level, it doesn't have a problem if the elements are themselves non-atomic (e.g. Lat+Elev). The only tricky situation here is if your data argument or variables are constructed in some environment that can't easily be found, but passing an explicit data argument usually avoids problems.

(You can also use as.formula(paste(...)) or as.formula(sprintf(...)); reformulate() is just a convenient wrapper.)