Is There a Faster Lm Function

Is there a faster lm function

Yes there are:

R itself has lm.fit() which is more bare-bones: no formula notation, much simpler result set
several of our Rcpp-related packages have fastLm() implementations: RcppArmadillo, RcppEigen, RcppGSL.

We have described fastLm() in a number of blog posts and presentations. If you want it in the fastest way, do not use the formula interface: parsing the formula and preparing the model matrix takes more time than the actual regression.

That said, if you are regressing a single vector on a single vector you can simplify this as no matrix package is needed.

Faster functions than lm() in R

Apparently, it should not be a problem to run a regression on a dataset of ~100,000 observations.

After receiving helpful comments on the main post, I found that one of the independent variables used in the input of the regression was coded as a character, by using the following command to find the data type of every column in the dataframe (df):

str(df)

$ var1           : chr  "x1" "x2" "x1" "x1"
$ var2           : Factor w/ 2 levels "factor1" "factor2": 1 1 1 0
$ var3           : Factor w/ 2 levels "factorx" "factory": 0 1 1 0
$ var4           : num 1 8 3 2

Changing var1 to a factor variable:

df$var1 <- as.factor(df$var1)

After changing var1 to a factor variable, the regression indeed runs within a few seconds.

Why the built-in lm function is so slow in R?

You are overlooking that

solve() only returns your parameters
lm() returns you a (very rich) object with many components for subsequent analysis, inference, plots, ...
the main cost of your lm() call is not the projection but the resolution of the formula y ~ . from which the model matrix needs to be built.

To illustrate Rcpp we wrote a few variants of a function fastLm() doing more of what lm() does (ie a bit more than lm.fit() from base R) and measured it. See e.g. this benchmark script which clearly shows that the dominant cost for smaller data sets is in parsing the formula and building the model matrix.

In short, you are doing the Right Thing by using benchmarking but you are doing it not all that correctly in trying to compare what is mostly incomparable: a subset with a much larger task.

Optimizing lm() function in a loop

Wide-ranging comments above, but I'll try to answer a few narrower points.

I seem to get the same (i.e., all.equal() is TRUE) results with .lm.fit and fitLmPure, if I'm careful about random-number seeds:

library(Rcpp)
library(RcppEigen)
library(microbenchmark)

nsim <- 1e3
n <- 1e5
set.seed(101)
dd <- data.frame(Y=rnorm(n))

testfun <- function(fitFn=.lm.fit, seed=NULL) {
    if (!is.null(seed)) set.seed(seed)
    x <- rnorm(n)
    reg2 <- fitFn(as.matrix(x), dd$Y)$residuals
    return(max(abs(reg2) / sd(reg2)))
}

## make sure NOT to use seed=101 - also used to pick y - 
## if we have y==x then results are unstable (resids approx. 0)
all.equal(testfun(seed=102), testfun(fastLmPure,seed=102)) ## TRUE

fastLmPure is fastest (but not by very much):

(bm1 <- microbenchmark(testfun(),
                     testfun(lm.fit),
                     testfun(fastLmPure),
                     times=1000))

Unit: milliseconds
                expr      min       lq      mean   median        uq      max
           testfun() 6.603822 8.234967  8.782436 8.332270  8.745622 82.54284
     testfun(lm.fit) 7.666047 9.334848 10.201158 9.503538 10.742987 99.15058
 testfun(fastLmPure) 5.964700 7.358141  7.818624 7.471030  7.782182 86.47498

If you wanted to fit many independent responses, rather than many independent predictors (i.e. if you were varying Y rather than X in the regression), you could provide a matrix for Y in .lm.fit, rather than looping over lots of regressions, which might be a big win. If all you care about are "residuals of random regressions" that might be worth a try. (Unfortunately, providing a matrix that combines may separate X vectors runs a multiple regression, not many univariate regressions ...)

Parallelizing is worthwhile, but will only scale (at best) according to the number of cores you have available. Doing a single run rather than a set of benchmarks because I'm lazy ...

Running 5000 replicates sequentially takes about 40 seconds for me (modern Linux laptop).

system.time(replicate(5000,testfun(fastLmPure), simplify=FALSE))
##    user  system elapsed 
##  38.953   0.072  39.028

Running in parallel on 5 cores takes about 13 seconds, so a 3-fold speedup for 5 cores. This will probably be a bit better if the individual jobs are larger, but obviously will never scale better than the number of cores ... (8 cores didn't do much better).

library(parallel)
system.time(mclapply(1:5000, function(x) testfun(fastLmPure),
                     mc.cores=5))
##    user  system elapsed 
##  43.225   0.627  12.970

It makes sense to me that parallelizing at a higher/coarser level (across runs rather than within lm fits) will perform better.

I wonder if there are analytical results you could use in terms of the order statistics of a t distribution ... ?

How to make group_by and lm fast?

In theory

Firstly, you can fit a linear model with multiple LHS.

Secondly, explicit data splitting is not the only way (or a recommended way) for group-by regression. See R regression analysis: analyzing data for a certain ethnicity and R: build separate models for each category. So build your model as cbind(x1, x2, x3, x4) ~ day * subject where subject is a factor variable.

Finally, since you have many factor levels and work with a large dataset, lm is infeasible. Consider using speedglm::speedlm with sparse = TRUE, or MatrixModels::glm4 with sparse = TRUE.

In reality

Neither speedlm nor glm4 is under active development. Their functionality is (in my view) primitive.

Neither speedlm nor glm4 supports multiple LHS as lm. So you need do 4 separate models x1 ~ day * subject to x4 ~ day * subject instead.

These two packages have different logic behind sparse = TRUE.

speedlm first constructs a dense design matrix using the standard model.matrix.default, then uses is.sparse to survey whether it is sparse or not. If TRUE, subsequent computations can use sparse methods.
glm4 uses model.Matrix to construct a design matrix, and a sparse one can be built directly.

So, it is not surprising that speedlm is as bad as lm in this sparsity issue, and glm4 is the one we really want to use.

glm4 does not have a full, useful set of generic functions for analyzing fitted models. You can extract coefficients, fitted values and residuals by coef, fitted and residuals, but have to compute all statistics (standard error, t-statistic, F-statistic, etc) all by yourself. This is not a big deal for people who know regression theory well, but it is still rather inconvenient.

glm4 still expects you to use the best model formula so that the sparsest matrix can be constructed. The conventional ~ day * subject is really not a good one. I should probably set up a Q & A on this issue later. Basically, if your formula has intercept and factors are contrasted, you loose sparsity. This is the one we should use: ~ 0 + subject + day:subject.

A test with `glm4`

## use chinsoon12's data in his answer
set.seed(0L)
nSubj <- 200e3
nr <- 1e6
DF <- data.frame(subject = gl(nSubj, 5),
                 day = 3:7,
                 y1 = runif(nr), 
                 y2 = rpois(nr, 3), 
                 y3 = rnorm(nr), 
                 y4 = rnorm(nr, 1, 5))

library(MatrixModels)
fit <- glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE)

It takes about 6 ~ 7 seconds on my 1.1GHz Sandy Bridge laptop. Let's extract its coefficients:

b <- coef(fit)

head(b)
#  subject1   subject2   subject3   subject4   subject5   subject6 
# 0.4378952  0.3582956 -0.2597528  0.8141229  1.3337102 -0.2168463 

tail(b)
#subject199995:day subject199996:day subject199997:day subject199998:day 
#      -0.09916175       -0.15653402       -0.05435883       -0.02553316 
#subject199999:day subject200000:day 
#       0.02322640       -0.09451542

You can do B <- matrix(b, ncol = 2), so that the 1st column is intercept and the 2nd is slope.

My thoughts: We probably need better packages for large regression

Using glm4 here does not give attractive advantage over chinsoon12's data.table solution since it also basically just tells you the regression coefficient. It is also a bit slower than data.table method, because it computes fitted values and residuals.

Analysis of simple regression does not require a proper model fitting routine. I have a few answers on how to do fancy stuff on this kind of regression, like Fast pairwise simple linear regression between all variables in a data frame where details on how to compute all statistics are also given. But as I wrote this answer, I was thinking about something in general regarding large regression problem. We might need better packages otherwise there is no end doing case-to-case coding.

In reply to OP

speedglm::speedlm(x1 ~ 0 + subject + day:subject, data = df, sparse = TRUE) gives Error: cannot allocate vector of size 74.5 Gb

yeah, because it has a bad sparse logic.

MatrixModels::glm4(x1 ~ day * subject, data = df, sparse = TRUE) gives Error in Cholesky(crossprod(from), LDL = FALSE) : internal_chm_factor: Cholesky factorization failed

This is because you have only one datum for some subject. You need at least two data to fit a line. Here is an example (in dense settings):

dat <- data.frame(t = c(1:5, 1:9, 1),
                  f = rep(gl(3,1,labels = letters[1:3]), c(5, 9, 1)),
                  y = rnorm(15))

Level "c" of f only has one datum / row.

X <- model.matrix(~ 0 + f + t:f, dat)
XtX <- crossprod(X)
chol(XtX)
#Error in chol.default(XtX) : 
#  the leading minor of order 6 is not positive definite

Cholesky factorization is unable to resolve rank-deficient model. If we use lm's QR factorization, we will see an NA coefficient.

lm(y ~ 0 + f + t:f, dat)
#Coefficients:
#      fa        fb        fc      fa:t      fb:t      fc:t  
# 0.49893   0.52066  -1.90779  -0.09415  -0.03512        NA

We can only estimate an intercept for level "c", not a slope.

Note that if you use the data.table solution, you end up with 0 / 0 when computing slope of this level, and the final result is NaN.

Update: fast solution now available

Check out Fast group-by simple linear regression and general paired simple linear regression.

Faster anova of regression models

We could hack stats:::anova.lmlist so that it works for lists produced by .lm.fit (notice the dot) and RcppArmadillo::fastLm. Please check stats:::anova.lmlist, if I didn't delete stuff you need!

anova2 <- function(object, ...) {
  objects <- list(object, ...)
  ns <- vapply(objects, function(x) length(x$residuals), 0L)
  stopifnot(!any(ns != ns[1L])) 
  resdf <- vapply(objects, df.residual, 0L)
  resdev <- vapply(objects, function(x) crossprod(residuals(x)), 0)
  bigmodel <- order(resdf)[1L]
  dfs <- c(NA, -diff(resdf))[bigmodel]
  ssq <- c(NA, -diff(resdev))[bigmodel]
  df.scale <- resdf[bigmodel]
  scale <- resdev[bigmodel]/resdf[bigmodel]
  fstat <- (ssq/dfs)/scale
  p <- pf(fstat, abs(dfs), df.scale, lower.tail=FALSE)
  return(c(F=fstat, p=p))
}

These wrappers make .lm.fit and RcppArmadillo::fastLm a little more convenient:

lm_fun <- function(fo, dat) {
  X <- model.matrix(fo, dat)
  fit <- .lm.fit(X, dat[[all.vars(fo)[1]]])
  fit$df.residual <- dim(X)[1] - dim(X)[2]
  return(fit)
}

fastLm_fun <- function(fo, dat) {
  fit <- RcppArmadillo::fastLm(model.matrix(fo, dat), dat[[all.vars(fo)[1]]])
  return(fit)
}

Use it

fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
#         F         p 
# 0.3609728 0.5481032 

## Or use `fastLm_fun` here, but it's slower.

Compare

anova(lm(y ~ x1 + x3, data), lm(y ~ x2 + x3 + x4, data))
# Analysis of Variance Table
# 
# Model 1: y ~ x1 + x3
# Model 2: y ~ x2 + x3 + x4
#   Res.Df    RSS Df Sum of Sq     F Pr(>F)
# 1    997 1003.7                          
# 2    996 1003.4  1   0.36365 0.361 0.5481

lm_fun (which outperforms fastLm_fun) combined with anova2 appears to be around 60% faster than the conventional approach:

microbenchmark::microbenchmark(
  conventional={
    fit1 <- lm(y ~ x1 + x3, data)
    fit2 <- lm(y ~ x2 + x3 + x4, data)
    anova(fit1, fit2)
  },
  anova2={  ## using `.lm.fit`
    fit1 <- lm_fun(y ~ x1 + x3, data)
    fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
    anova2(fit1, fit2)
  },
  anova2_Fast={  ## using `RcppArmadillo::fastLm`
    fit1 <- fastLm_fun(y ~ x1 + x3, data)
    fit2 <- fastLm_fun(y ~ x2 + x3 + x4, data)
    anova2(fit1, fit2)
  },
  anova_Donald={  
    fit1 <- lm_fun(y ~ x1 + x3, data)
    fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
    anova_Donald(fit1, fit2)
  }, 
  times=1000L,
  control=list(warmup=100L)
)
# Unit: milliseconds
#         expr      min       lq     mean   median       uq      max neval  cld
# conventional 2.885718 2.967053 3.205947 3.018668 3.090954 40.15720  1000    d
#       anova2 1.180683 1.218121 1.285131 1.233335 1.267833 23.81955  1000 a   
#  anova2_Fast 1.961897 2.012756 2.179458 2.037854 2.087893 26.65279  1000   c 
# anova_Donald 1.368699 1.409198 1.561751 1.430562 1.472148 33.12247  1000  b

Data:

set.seed(42)
data <- data.frame(y=rnorm(1000), x1=rnorm(1000), x2=rnorm(1000), x3=rnorm(1000),
                   x4=rnorm(1000))

Optimize the time to run fixed effects in an R lm() model

When you have that many factors, constructing the model.matrix in lm() will take up most of the time, one way is to use sparseMatrix like in glmnet and there are two packages, sparseM, MatrixModels that allows lm onto sparseMatrix:

set.seed(111)
x <- data.frame(
    a = sample( 1:1000, 1000000 , replace=T),
    cityfips = sample( 1:250, 1000000 , replace=T),
    d = sample( 1:4, 1000000 , replace=T)
)

library(SparseM)
library(MatrixModels)
library(Matrix)

system.time(f_lm <- lm( a~as.factor(cityfips) + d  , x )  )
   user  system elapsed 
 75.720   2.494  79.365  
system.time(f_sparseM <- slm(a~as.factor(cityfips) + d  , x ))
   user  system elapsed 
  5.373   3.952  10.646
system.time(f_modelMatrix <- glm4(a~as.factor(cityfips) + d  ,data=x,sparse=TRUE))
   user  system elapsed 
  1.878   0.335   2.219

The closest I can find is glm4 in MatrixModels, but you can see below the coefficients are the same as the fit using lm:

all.equal(as.numeric(f_sparseM$coefficients),as.numeric(f_lm$coefficients))
[1] TRUE
all.equal(as.numeric(f_lm$coefficients),as.numeric(coefficients(f_modelMatrix)))
[1] TRUE

One other option besides glm4 in MatrixModels is to use lm.fit (as pointed out by @BenBolker:

lm.fit(x=Matrix::sparse.model.matrix(~as.factor(cityfips) + d,data=x),y=x$a)

This gives you a list as like lm.fit() normally and you cannot apply functions such as summary() etc.

Authors of both package warn about it being experimental so there might still be some differences compared to stats::lm , take care to check.

Fast adjusted r-squared extraction

The following function computes the adjusted R2 from an object returned by .lm.fit and the response vector y.

adj_r2_lmfit <- function(object, y){
  ypred <- y - resid(object)
  mss <- sum((ypred - mean(ypred))^2)
  rss <- sum(resid(object)^2)
  rdf <- length(resid(object)) - object$rank
  r.squared <- mss/(mss + rss)
  adj.r.squared <- 1 - (1 - r.squared)*(NROW(y) - 1)/rdf
  adj.r.squared
}

tstlm <- lm(cyl ~ hp + wt, data = mtcars)
tstlmf <- .lm.fit(cbind(1,mtmatrix [,c("hp","wt")]), mtmatrix [,"cyl"])

summary(tstlm)$adj.r.squared
#[1] 0.7753073
adj_r2_lmfit(tstlmf, mtmatrix [,"cyl"])
#[1] 0.7753073

Faster way to run a regression on large Data

Your regression is fitting slow because of the name variable. Fitting a factor with 5903 levels will add 5903 columns to your design matrix - it will be like trying to fit 5903 separate variables.

Your design matrix will have dimensions 63000x5908, which will one, take up a lot of memory and two, make lm work hard to produce its estimates (hence the 40 min fitting time).

You have a few options:

Keep your design as is, and wait (or find a slightly faster lm)
Throw out the name variable, in which case lm will fit almost instantly
Fit a mixed effects model, with name as a random effect, using lmer or other package. lmer in particular uses a sparse design matrix for the random effects, taking advantage of the fact that each observation can only have one of the 5903 names (so most of the matrix is empty).

Of the three, the third option is likely the most principled way forward. A random effect will account for individual-level variation across observations, and also pool information among different individuals to help give better estimates for dogs that don't have a lot of observations. On top of that, it will compute quickly thanks to the sparse design matrix.

A simple model for your dataset might look something like this:

library(lme4)
## read data
mygrey <- lmer(gtrain$margin~(1|name)+box+distance+trate+grade+trackid,
               data = gtrain)

If you want to go that route, I recommend reading more about mixed effects models so that you can choose the model structure that makes sense for your data. Here are two good resources:

Guide for using lmer -
https://stats.stackexchange.com/questions/13166/rs-lmer-cheat-sheet
More theoretical discusion of random/mixed effects -
https://stats.stackexchange.com/a/151800/

Faster way to run a million regression models

Every time you call myresults <- rbindlist(list(myresults, ...)), you're copying the entirety of myresults, modifying the copy, and then having the name point to the copy. The most common cause of inefficient looping in R is "growing an object". You know the exact dimensions of the result (ncol(data) by 3), so just make it to begin with. And then use data.table to assign by reference (no copying).

See if this helps improve efficiency:

#### initialize results file ###
myresults <- data.table(
  id       = character(length(data)),
  estimate = numeric(  length(data)),
  pvalue   = numeric(  length(data))
)

#### Run regression against same covariates and outcome for each column in data ##
for (i in seq_along(data)) {
  id = colnames(data)[i]
  mydata <- cbind(cov, outcome, data[, id])
  colnames(mydata)[ncol(mydata)] <- id #I can't figure out how to not have to do this
  fit <-
    glm(formula(paste0("outcome ~ as.factor(cov1) + ", id)), data = mydata)
  set(
    myresults,
    i     = i,
    j     = c("id", "estimate", "pvalue"),
    value = list(
      id       = id,
      estimate = signif(coef(summary(fit))[id, "Estimate"], digits = 4),
      pvalue   = signif(coef(summary(fit))[id, "Pr(>|t|)"], digits = 4)
    )
  )
}

I also replaced for (i in 1:ncol(data)) with for (i in seq_along(data)), because the first way behaves in a bad way when data has no columns. You might think it'll never happen, but writing loops that way is a bad habit.

Is There a Faster Lm Function