Is there a faster lm function
Yes there are:
R itself has
lm.fit()
which is more bare-bones: no formula notation, much simpler result setseveral of our Rcpp-related packages have
fastLm()
implementations: RcppArmadillo, RcppEigen, RcppGSL.
We have described fastLm()
in a number of blog posts and presentations. If you want it in the fastest way, do not use the formula interface: parsing the formula and preparing the model matrix takes more time than the actual regression.
That said, if you are regressing a single vector on a single vector you can simplify this as no matrix package is needed.
Faster functions than lm() in R
Apparently, it should not be a problem to run a regression on a dataset of ~100,000 observations.
After receiving helpful comments on the main post, I found that one of the independent variables used in the input of the regression was coded as a character, by using the following command to find the data type of every column in the dataframe (df):
str(df)
$ var1 : chr "x1" "x2" "x1" "x1"
$ var2 : Factor w/ 2 levels "factor1" "factor2": 1 1 1 0
$ var3 : Factor w/ 2 levels "factorx" "factory": 0 1 1 0
$ var4 : num 1 8 3 2
Changing var1 to a factor variable:
df$var1 <- as.factor(df$var1)
After changing var1 to a factor variable, the regression indeed runs within a few seconds.
Why the built-in lm function is so slow in R?
You are overlooking that
solve()
only returns your parameterslm()
returns you a (very rich) object with many components for subsequent analysis, inference, plots, ...- the main cost of your
lm()
call is not the projection but the resolution of the formulay ~ .
from which the model matrix needs to be built.
To illustrate Rcpp we wrote a few variants of a function fastLm()
doing more of what lm()
does (ie a bit more than lm.fit()
from base R) and measured it. See e.g. this benchmark script which clearly shows that the dominant cost for smaller data sets is in parsing the formula and building the model matrix.
In short, you are doing the Right Thing by using benchmarking but you are doing it not all that correctly in trying to compare what is mostly incomparable: a subset with a much larger task.
Optimizing lm() function in a loop
Wide-ranging comments above, but I'll try to answer a few narrower points.
- I seem to get the same (i.e.,
all.equal()
isTRUE
) results with.lm.fit
andfitLmPure
, if I'm careful about random-number seeds:
library(Rcpp)
library(RcppEigen)
library(microbenchmark)
nsim <- 1e3
n <- 1e5
set.seed(101)
dd <- data.frame(Y=rnorm(n))
testfun <- function(fitFn=.lm.fit, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
x <- rnorm(n)
reg2 <- fitFn(as.matrix(x), dd$Y)$residuals
return(max(abs(reg2) / sd(reg2)))
}
## make sure NOT to use seed=101 - also used to pick y -
## if we have y==x then results are unstable (resids approx. 0)
all.equal(testfun(seed=102), testfun(fastLmPure,seed=102)) ## TRUE
fastLmPure
is fastest (but not by very much):
(bm1 <- microbenchmark(testfun(),
testfun(lm.fit),
testfun(fastLmPure),
times=1000))
Unit: milliseconds
expr min lq mean median uq max
testfun() 6.603822 8.234967 8.782436 8.332270 8.745622 82.54284
testfun(lm.fit) 7.666047 9.334848 10.201158 9.503538 10.742987 99.15058
testfun(fastLmPure) 5.964700 7.358141 7.818624 7.471030 7.782182 86.47498
If you wanted to fit many independent responses, rather than many independent predictors (i.e. if you were varying Y rather than X in the regression), you could provide a matrix for Y in .lm.fit
, rather than looping over lots of regressions, which might be a big win. If all you care about are "residuals of random regressions" that might be worth a try. (Unfortunately, providing a matrix that combines may separate X vectors runs a multiple regression, not many univariate regressions ...)
Parallelizing is worthwhile, but will only scale (at best) according to the number of cores you have available. Doing a single run rather than a set of benchmarks because I'm lazy ...
Running 5000 replicates sequentially takes about 40 seconds for me (modern Linux laptop).
system.time(replicate(5000,testfun(fastLmPure), simplify=FALSE))
## user system elapsed
## 38.953 0.072 39.028
Running in parallel on 5 cores takes about 13 seconds, so a 3-fold speedup for 5 cores. This will probably be a bit better if the individual jobs are larger, but obviously will never scale better than the number of cores ... (8 cores didn't do much better).
library(parallel)
system.time(mclapply(1:5000, function(x) testfun(fastLmPure),
mc.cores=5))
## user system elapsed
## 43.225 0.627 12.970
It makes sense to me that parallelizing at a higher/coarser level (across runs rather than within lm fits) will perform better.
I wonder if there are analytical results you could use in terms of the order statistics of a t distribution ... ?
How to make group_by and lm fast?
In theory
Firstly, you can fit a linear model with multiple LHS.
Secondly, explicit data splitting is not the only way (or a recommended way) for group-by regression. See R regression analysis: analyzing data for a certain ethnicity and R: build separate models for each category. So build your model as cbind(x1, x2, x3, x4) ~ day * subject
where subject
is a factor variable.
Finally, since you have many factor levels and work with a large dataset, lm
is infeasible. Consider using speedglm::speedlm
with sparse = TRUE
, or MatrixModels::glm4
with sparse = TRUE
.
In reality
Neither speedlm
nor glm4
is under active development. Their functionality is (in my view) primitive.
Neither speedlm
nor glm4
supports multiple LHS as lm
. So you need do 4 separate models x1 ~ day * subject
to x4 ~ day * subject
instead.
These two packages have different logic behind sparse = TRUE
.
speedlm
first constructs a dense design matrix using the standardmodel.matrix.default
, then usesis.sparse
to survey whether it is sparse or not. If TRUE, subsequent computations can use sparse methods.glm4
usesmodel.Matrix
to construct a design matrix, and a sparse one can be built directly.
So, it is not surprising that speedlm
is as bad as lm
in this sparsity issue, and glm4
is the one we really want to use.
glm4
does not have a full, useful set of generic functions for analyzing fitted models. You can extract coefficients, fitted values and residuals by coef
, fitted
and residuals
, but have to compute all statistics (standard error, t-statistic, F-statistic, etc) all by yourself. This is not a big deal for people who know regression theory well, but it is still rather inconvenient.
glm4
still expects you to use the best model formula so that the sparsest matrix can be constructed. The conventional ~ day * subject
is really not a good one. I should probably set up a Q & A on this issue later. Basically, if your formula has intercept and factors are contrasted, you loose sparsity. This is the one we should use: ~ 0 + subject + day:subject
.
A test with glm4
## use chinsoon12's data in his answer
set.seed(0L)
nSubj <- 200e3
nr <- 1e6
DF <- data.frame(subject = gl(nSubj, 5),
day = 3:7,
y1 = runif(nr),
y2 = rpois(nr, 3),
y3 = rnorm(nr),
y4 = rnorm(nr, 1, 5))
library(MatrixModels)
fit <- glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE)
It takes about 6 ~ 7 seconds on my 1.1GHz Sandy Bridge laptop. Let's extract its coefficients:
b <- coef(fit)
head(b)
# subject1 subject2 subject3 subject4 subject5 subject6
# 0.4378952 0.3582956 -0.2597528 0.8141229 1.3337102 -0.2168463
tail(b)
#subject199995:day subject199996:day subject199997:day subject199998:day
# -0.09916175 -0.15653402 -0.05435883 -0.02553316
#subject199999:day subject200000:day
# 0.02322640 -0.09451542
You can do B <- matrix(b, ncol = 2)
, so that the 1st column is intercept and the 2nd is slope.
My thoughts: We probably need better packages for large regression
Using glm4
here does not give attractive advantage over chinsoon12's data.table
solution since it also basically just tells you the regression coefficient. It is also a bit slower than data.table
method, because it computes fitted values and residuals.
Analysis of simple regression does not require a proper model fitting routine. I have a few answers on how to do fancy stuff on this kind of regression, like Fast pairwise simple linear regression between all variables in a data frame where details on how to compute all statistics are also given. But as I wrote this answer, I was thinking about something in general regarding large regression problem. We might need better packages otherwise there is no end doing case-to-case coding.
In reply to OP
speedglm::speedlm(x1 ~ 0 + subject + day:subject, data = df, sparse = TRUE)
gives Error: cannot allocate vector of size 74.5 Gb
yeah, because it has a bad sparse
logic.
MatrixModels::glm4(x1 ~ day * subject, data = df, sparse = TRUE)
gives Error in Cholesky(crossprod(from), LDL = FALSE) : internal_chm_factor: Cholesky factorization failed
This is because you have only one datum for some subject
. You need at least two data to fit a line. Here is an example (in dense settings):
dat <- data.frame(t = c(1:5, 1:9, 1),
f = rep(gl(3,1,labels = letters[1:3]), c(5, 9, 1)),
y = rnorm(15))
Level "c" of f
only has one datum / row.
X <- model.matrix(~ 0 + f + t:f, dat)
XtX <- crossprod(X)
chol(XtX)
#Error in chol.default(XtX) :
# the leading minor of order 6 is not positive definite
Cholesky factorization is unable to resolve rank-deficient model. If we use lm
's QR factorization, we will see an NA
coefficient.
lm(y ~ 0 + f + t:f, dat)
#Coefficients:
# fa fb fc fa:t fb:t fc:t
# 0.49893 0.52066 -1.90779 -0.09415 -0.03512 NA
We can only estimate an intercept for level "c", not a slope.
Note that if you use the data.table
solution, you end up with 0 / 0
when computing slope of this level, and the final result is NaN
.
Update: fast solution now available
Check out Fast group-by simple linear regression and general paired simple linear regression.
Faster anova of regression models
We could hack stats:::anova.lmlist
so that it works for lists produced by .lm.fit
(notice the dot) and RcppArmadillo::fastLm
. Please check stats:::anova.lmlist
, if I didn't delete stuff you need!
anova2 <- function(object, ...) {
objects <- list(object, ...)
ns <- vapply(objects, function(x) length(x$residuals), 0L)
stopifnot(!any(ns != ns[1L]))
resdf <- vapply(objects, df.residual, 0L)
resdev <- vapply(objects, function(x) crossprod(residuals(x)), 0)
bigmodel <- order(resdf)[1L]
dfs <- c(NA, -diff(resdf))[bigmodel]
ssq <- c(NA, -diff(resdev))[bigmodel]
df.scale <- resdf[bigmodel]
scale <- resdev[bigmodel]/resdf[bigmodel]
fstat <- (ssq/dfs)/scale
p <- pf(fstat, abs(dfs), df.scale, lower.tail=FALSE)
return(c(F=fstat, p=p))
}
These wrappers make .lm.fit
and RcppArmadillo::fastLm
a little more convenient:
lm_fun <- function(fo, dat) {
X <- model.matrix(fo, dat)
fit <- .lm.fit(X, dat[[all.vars(fo)[1]]])
fit$df.residual <- dim(X)[1] - dim(X)[2]
return(fit)
}
fastLm_fun <- function(fo, dat) {
fit <- RcppArmadillo::fastLm(model.matrix(fo, dat), dat[[all.vars(fo)[1]]])
return(fit)
}
Use it
fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
# F p
# 0.3609728 0.5481032
## Or use `fastLm_fun` here, but it's slower.
Compare
anova(lm(y ~ x1 + x3, data), lm(y ~ x2 + x3 + x4, data))
# Analysis of Variance Table
#
# Model 1: y ~ x1 + x3
# Model 2: y ~ x2 + x3 + x4
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 997 1003.7
# 2 996 1003.4 1 0.36365 0.361 0.5481
lm_fun
(which outperforms fastLm_fun
) combined with anova2
appears to be around 60% faster than the conventional approach:
microbenchmark::microbenchmark(
conventional={
fit1 <- lm(y ~ x1 + x3, data)
fit2 <- lm(y ~ x2 + x3 + x4, data)
anova(fit1, fit2)
},
anova2={ ## using `.lm.fit`
fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
},
anova2_Fast={ ## using `RcppArmadillo::fastLm`
fit1 <- fastLm_fun(y ~ x1 + x3, data)
fit2 <- fastLm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
},
anova_Donald={
fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova_Donald(fit1, fit2)
},
times=1000L,
control=list(warmup=100L)
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# conventional 2.885718 2.967053 3.205947 3.018668 3.090954 40.15720 1000 d
# anova2 1.180683 1.218121 1.285131 1.233335 1.267833 23.81955 1000 a
# anova2_Fast 1.961897 2.012756 2.179458 2.037854 2.087893 26.65279 1000 c
# anova_Donald 1.368699 1.409198 1.561751 1.430562 1.472148 33.12247 1000 b
Data:
set.seed(42)
data <- data.frame(y=rnorm(1000), x1=rnorm(1000), x2=rnorm(1000), x3=rnorm(1000),
x4=rnorm(1000))
Optimize the time to run fixed effects in an R lm() model
When you have that many factors, constructing the model.matrix in lm()
will take up most of the time, one way is to use sparseMatrix like in glmnet
and there are two packages, sparseM
, MatrixModels
that allows lm onto sparseMatrix:
set.seed(111)
x <- data.frame(
a = sample( 1:1000, 1000000 , replace=T),
cityfips = sample( 1:250, 1000000 , replace=T),
d = sample( 1:4, 1000000 , replace=T)
)
library(SparseM)
library(MatrixModels)
library(Matrix)
system.time(f_lm <- lm( a~as.factor(cityfips) + d , x ) )
user system elapsed
75.720 2.494 79.365
system.time(f_sparseM <- slm(a~as.factor(cityfips) + d , x ))
user system elapsed
5.373 3.952 10.646
system.time(f_modelMatrix <- glm4(a~as.factor(cityfips) + d ,data=x,sparse=TRUE))
user system elapsed
1.878 0.335 2.219
The closest I can find is glm4 in MatrixModels
, but you can see below the coefficients are the same as the fit using lm:
all.equal(as.numeric(f_sparseM$coefficients),as.numeric(f_lm$coefficients))
[1] TRUE
all.equal(as.numeric(f_lm$coefficients),as.numeric(coefficients(f_modelMatrix)))
[1] TRUE
One other option besides glm4
in MatrixModels
is to use lm.fit (as pointed out by @BenBolker:
lm.fit(x=Matrix::sparse.model.matrix(~as.factor(cityfips) + d,data=x),y=x$a)
This gives you a list as like lm.fit()
normally and you cannot apply functions such as summary()
etc.
Authors of both package warn about it being experimental so there might still be some differences compared to stats::lm
, take care to check.
Fast adjusted r-squared extraction
The following function computes the adjusted R2 from an object returned by .lm.fit
and the response vector y
.
adj_r2_lmfit <- function(object, y){
ypred <- y - resid(object)
mss <- sum((ypred - mean(ypred))^2)
rss <- sum(resid(object)^2)
rdf <- length(resid(object)) - object$rank
r.squared <- mss/(mss + rss)
adj.r.squared <- 1 - (1 - r.squared)*(NROW(y) - 1)/rdf
adj.r.squared
}
tstlm <- lm(cyl ~ hp + wt, data = mtcars)
tstlmf <- .lm.fit(cbind(1,mtmatrix [,c("hp","wt")]), mtmatrix [,"cyl"])
summary(tstlm)$adj.r.squared
#[1] 0.7753073
adj_r2_lmfit(tstlmf, mtmatrix [,"cyl"])
#[1] 0.7753073
Faster way to run a regression on large Data
Your regression is fitting slow because of the name
variable. Fitting a factor with 5903 levels will add 5903 columns to your design matrix - it will be like trying to fit 5903 separate variables.
Your design matrix will have dimensions 63000x5908, which will one, take up a lot of memory and two, make lm
work hard to produce its estimates (hence the 40 min fitting time).
You have a few options:
- Keep your design as is, and wait (or find a slightly faster
lm
) - Throw out the
name
variable, in which caselm
will fit almost instantly - Fit a mixed effects model, with
name
as a random effect, usinglmer
or other package.lmer
in particular uses a sparse design matrix for the random effects, taking advantage of the fact that each observation can only have one of the 5903 names (so most of the matrix is empty).
Of the three, the third option is likely the most principled way forward. A random effect will account for individual-level variation across observations, and also pool information among different individuals to help give better estimates for dogs that don't have a lot of observations. On top of that, it will compute quickly thanks to the sparse design matrix.
A simple model for your dataset might look something like this:
library(lme4)
## read data
mygrey <- lmer(gtrain$margin~(1|name)+box+distance+trate+grade+trackid,
data = gtrain)
If you want to go that route, I recommend reading more about mixed effects models so that you can choose the model structure that makes sense for your data. Here are two good resources:
- Guide for using
lmer
-
https://stats.stackexchange.com/questions/13166/rs-lmer-cheat-sheet - More theoretical discusion of random/mixed effects -
https://stats.stackexchange.com/a/151800/
Faster way to run a million regression models
Every time you call myresults <- rbindlist(list(myresults, ...))
, you're copying the entirety of myresults
, modifying the copy, and then having the name point to the copy. The most common cause of inefficient looping in R is "growing an object". You know the exact dimensions of the result (ncol(data)
by 3), so just make it to begin with. And then use data.table
to assign by reference (no copying).
See if this helps improve efficiency:
#### initialize results file ###
myresults <- data.table(
id = character(length(data)),
estimate = numeric( length(data)),
pvalue = numeric( length(data))
)
#### Run regression against same covariates and outcome for each column in data ##
for (i in seq_along(data)) {
id = colnames(data)[i]
mydata <- cbind(cov, outcome, data[, id])
colnames(mydata)[ncol(mydata)] <- id #I can't figure out how to not have to do this
fit <-
glm(formula(paste0("outcome ~ as.factor(cov1) + ", id)), data = mydata)
set(
myresults,
i = i,
j = c("id", "estimate", "pvalue"),
value = list(
id = id,
estimate = signif(coef(summary(fit))[id, "Estimate"], digits = 4),
pvalue = signif(coef(summary(fit))[id, "Pr(>|t|)"], digits = 4)
)
)
}
I also replaced for (i in 1:ncol(data))
with for (i in seq_along(data))
, because the first way behaves in a bad way when data
has no columns. You might think it'll never happen, but writing loops that way is a bad habit.
Related Topics
Venn Diagram Proportional and Color Shading with Semi-Transparency
R Plotting Confidence Bands with Ggplot
How to Use the Switch Statement in R Functions
Why Use As.Factor() Instead of Just Factor()
Accept Http Request in R Shiny Application
How to Test If List Element Exists
Plot a Function with Ggplot, Equivalent of Curve()
Non-Redundant Version of Expand.Grid
Remove Empty Documents from Documenttermmatrix in R Topicmodels
How to Deal with "Data of Class Uneval" Error from Ggplot2
Finding Row Index Containing Maximum Value Using R
How to Convert Data Frame to Spatial Coordinates
How to Maintain Size of Ggplot with Long Labels
Dplyr Issues When Using Group_By(Multiple Variables)
Using Parallel's Parlapply: Unable to Access Variables Within Parallel Code