Why the built-in lm function is so slow in R?
You are overlooking that
solve()
only returns your parameterslm()
returns you a (very rich) object with many components for subsequent analysis, inference, plots, ...- the main cost of your
lm()
call is not the projection but the resolution of the formulay ~ .
from which the model matrix needs to be built.
To illustrate Rcpp we wrote a few variants of a function fastLm()
doing more of what lm()
does (ie a bit more than lm.fit()
from base R) and measured it. See e.g. this benchmark script which clearly shows that the dominant cost for smaller data sets is in parsing the formula and building the model matrix.
In short, you are doing the Right Thing by using benchmarking but you are doing it not all that correctly in trying to compare what is mostly incomparable: a subset with a much larger task.
Optimizing lm() function in a loop
Wide-ranging comments above, but I'll try to answer a few narrower points.
- I seem to get the same (i.e.,
all.equal()
isTRUE
) results with.lm.fit
andfitLmPure
, if I'm careful about random-number seeds:
library(Rcpp)
library(RcppEigen)
library(microbenchmark)
nsim <- 1e3
n <- 1e5
set.seed(101)
dd <- data.frame(Y=rnorm(n))
testfun <- function(fitFn=.lm.fit, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
x <- rnorm(n)
reg2 <- fitFn(as.matrix(x), dd$Y)$residuals
return(max(abs(reg2) / sd(reg2)))
}
## make sure NOT to use seed=101 - also used to pick y -
## if we have y==x then results are unstable (resids approx. 0)
all.equal(testfun(seed=102), testfun(fastLmPure,seed=102)) ## TRUE
fastLmPure
is fastest (but not by very much):
(bm1 <- microbenchmark(testfun(),
testfun(lm.fit),
testfun(fastLmPure),
times=1000))
Unit: milliseconds
expr min lq mean median uq max
testfun() 6.603822 8.234967 8.782436 8.332270 8.745622 82.54284
testfun(lm.fit) 7.666047 9.334848 10.201158 9.503538 10.742987 99.15058
testfun(fastLmPure) 5.964700 7.358141 7.818624 7.471030 7.782182 86.47498
If you wanted to fit many independent responses, rather than many independent predictors (i.e. if you were varying Y rather than X in the regression), you could provide a matrix for Y in .lm.fit
, rather than looping over lots of regressions, which might be a big win. If all you care about are "residuals of random regressions" that might be worth a try. (Unfortunately, providing a matrix that combines may separate X vectors runs a multiple regression, not many univariate regressions ...)
Parallelizing is worthwhile, but will only scale (at best) according to the number of cores you have available. Doing a single run rather than a set of benchmarks because I'm lazy ...
Running 5000 replicates sequentially takes about 40 seconds for me (modern Linux laptop).
system.time(replicate(5000,testfun(fastLmPure), simplify=FALSE))
## user system elapsed
## 38.953 0.072 39.028
Running in parallel on 5 cores takes about 13 seconds, so a 3-fold speedup for 5 cores. This will probably be a bit better if the individual jobs are larger, but obviously will never scale better than the number of cores ... (8 cores didn't do much better).
library(parallel)
system.time(mclapply(1:5000, function(x) testfun(fastLmPure),
mc.cores=5))
## user system elapsed
## 43.225 0.627 12.970
It makes sense to me that parallelizing at a higher/coarser level (across runs rather than within lm fits) will perform better.
I wonder if there are analytical results you could use in terms of the order statistics of a t distribution ... ?
Why does lm run out of memory while matrix multiplication works fine for coefficients?
None of the answers so far has pointed to the right direction.
The accepted answer by @idr is making confusion between lm
and summary.lm
. lm
computes no diagnostic statistics at all; instead, summary.lm
does. So he is talking about summary.lm
.
@Jake's answer is a fact on the numeric stability of QR factorization and LU / Choleksy factorization. Aravindakshan's answer expands this, by pointing out the amount of floating point operations behind both operations (though as he said, he did not count in the costs for computing matrix cross product). But, do not confuse FLOP counts with memory costs. Actually both method have the same memory usage in LINPACK / LAPACK. Specifically, his argument that QR method costs more RAM to store Q
factor is a bogus one. The compacted storage as explained in lm(): What is qraux returned by QR decomposition in LINPACK / LAPACK clarifies how QR factorization is computed and stored. Speed issue of QR v.s. Chol is detailed in my answer: Why the built-in lm function is so slow in R?, and my answer on faster lm
provides a small routine lm.chol
using Choleksy method, which is 3 times faster than QR method.
@Greg's answer / suggestion for biglm
is good, but it does not answer the question. Since biglm
is mentioned, I would point out that QR decomposition differs in lm
and biglm
. biglm
computes householder reflection so that the resulting R
factor has positive diagonals. See Cholesky factor via QR factorization for details. The reason that biglm
does this, is that the resulting R
will be as same as the Cholesky factor, see QR decomposition and Choleski decomposition in R for information. Also, apart from biglm
, you can use mgcv
. Read my answer: biglm
predict unable to allocate a vector of size xx.x MB for more.
After a summary, it is time to post my answer.
In order to fit a linear model, lm
will
- generates a model frame;
- generates a model matrix;
- call
lm.fit
for QR factorization; - returns the result of QR factorization as well as the model frame in
lmObject
.
You said your input data frame with 5 columns costs 2 GB to store. With 20 factor levels the resulting model matrix has about 25 columns taking 10 GB storage. Now let's see how memory usage grows when we call lm
.
- [global environment] initially you have 2 GB storage for the data frame;
- [
lm
envrionment] then it is copied to a model frame, costing 2 GB; - [
lm
environment] then a model matrix is generated, costing 10 GB; - [
lm.fit
environment] a copy of model matrix is made then overwritten by QR factorization, costing 10 GB; - [
lm
environment] the result oflm.fit
is returned, costing 10 GB; - [global environment] the result of
lm.fit
is further returned bylm
, costing another 10 GB; - [global environment] the model frame is returned by
lm
, costing 2 GB.
So, a total of 46 GB RAM is required, far greater than your available 22 GB RAM.
Actually if lm.fit
can be "inlined" into lm
, we could save 20 GB costs. But there is no way to inline an R function in another R function.
Maybe we can take a small example to see what happens around lm.fit
:
X <- matrix(rnorm(30), 10, 3) # a `10 * 3` model matrix
y <- rnorm(10) ## response vector
tracemem(X)
# [1] "<0xa5e5ed0>"
qrfit <- lm.fit(X, y)
# tracemem[0xa5e5ed0 -> 0xa1fba88]: lm.fit
So indeed, X
is copied when passed into lm.fit
. Let's have a look at what qrfit
has
str(qrfit)
#List of 8
# $ coefficients : Named num [1:3] 0.164 0.716 -0.912
# ..- attr(*, "names")= chr [1:3] "x1" "x2" "x3"
# $ residuals : num [1:10] 0.4 -0.251 0.8 -0.966 -0.186 ...
# $ effects : Named num [1:10] -1.172 0.169 1.421 -1.307 -0.432 ...
# ..- attr(*, "names")= chr [1:10] "x1" "x2" "x3" "" ...
# $ rank : int 3
# $ fitted.values: num [1:10] -0.466 -0.449 -0.262 -1.236 0.578 ...
# $ assign : NULL
# $ qr :List of 5
# ..$ qr : num [1:10, 1:3] -1.838 -0.23 0.204 -0.199 0.647 ...
# ..$ qraux: num [1:3] 1.13 1.12 1.4
# ..$ pivot: int [1:3] 1 2 3
# ..$ tol : num 1e-07
# ..$ rank : int 3
# ..- attr(*, "class")= chr "qr"
# $ df.residual : int 7
Note that the compact QR matrix qrfit$qr$qr
is as large as model matrix X
. It is created inside lm.fit
, but on exit of lm.fit
, it is copied. So in total, we will have 3 "copies" of X
:
- the original one in global environment;
- the one copied into
lm.fit
, the overwritten by QR factorization; - the one returned by
lm.fit
.
In your case, X
is 10 GB, so the memory costs associated with lm.fit
alone is already 30 GB. Let alone other costs associated with lm
.
On the other hand, let's have a look at
solve(crossprod(X), crossprod(X,y))
X
takes 10 GB, but crossprod(X)
is only a 25 * 25
matrix, and crossprod(X,y)
is just a length-25 vector. They are so tiny compared with X
, thus memory usage does not increase at all.
Maybe you are worried that a local copy of X
will be made when crossprod
is called? Not at all! Unlike lm.fit
which performs both read and write to X
, crossprod
only reads X
, so no copy is made. We can verify this with our toy matrix X
by:
tracemem(X)
crossprod(X)
You will see no copying message!
If you want a short summary for all above, here it is:
- memory costs for
lm.fit(X, y)
(or even.lm.fit(X, y)
) is three times as large as that forsolve(crossprod(X), crossprod(X,y))
; - Depending on how much larger the model matrix is than the model frame, memory costs for
lm
is 3 ~ 6 times as large as that forsolve(crossprod(X), crossprod(X,y))
. The lower bound 3 is never reached, while the upper bound 6 is reached when the model matrix is as same as the model frame. This is the case when there is no factor variables or "factor-alike" terms, likebs()
andpoly()
, etc.
How to make group_by and lm fast?
In theory
Firstly, you can fit a linear model with multiple LHS.
Secondly, explicit data splitting is not the only way (or a recommended way) for group-by regression. See R regression analysis: analyzing data for a certain ethnicity and R: build separate models for each category. So build your model as cbind(x1, x2, x3, x4) ~ day * subject
where subject
is a factor variable.
Finally, since you have many factor levels and work with a large dataset, lm
is infeasible. Consider using speedglm::speedlm
with sparse = TRUE
, or MatrixModels::glm4
with sparse = TRUE
.
In reality
Neither speedlm
nor glm4
is under active development. Their functionality is (in my view) primitive.
Neither speedlm
nor glm4
supports multiple LHS as lm
. So you need do 4 separate models x1 ~ day * subject
to x4 ~ day * subject
instead.
These two packages have different logic behind sparse = TRUE
.
speedlm
first constructs a dense design matrix using the standardmodel.matrix.default
, then usesis.sparse
to survey whether it is sparse or not. If TRUE, subsequent computations can use sparse methods.glm4
usesmodel.Matrix
to construct a design matrix, and a sparse one can be built directly.
So, it is not surprising that speedlm
is as bad as lm
in this sparsity issue, and glm4
is the one we really want to use.
glm4
does not have a full, useful set of generic functions for analyzing fitted models. You can extract coefficients, fitted values and residuals by coef
, fitted
and residuals
, but have to compute all statistics (standard error, t-statistic, F-statistic, etc) all by yourself. This is not a big deal for people who know regression theory well, but it is still rather inconvenient.
glm4
still expects you to use the best model formula so that the sparsest matrix can be constructed. The conventional ~ day * subject
is really not a good one. I should probably set up a Q & A on this issue later. Basically, if your formula has intercept and factors are contrasted, you loose sparsity. This is the one we should use: ~ 0 + subject + day:subject
.
A test with glm4
## use chinsoon12's data in his answer
set.seed(0L)
nSubj <- 200e3
nr <- 1e6
DF <- data.frame(subject = gl(nSubj, 5),
day = 3:7,
y1 = runif(nr),
y2 = rpois(nr, 3),
y3 = rnorm(nr),
y4 = rnorm(nr, 1, 5))
library(MatrixModels)
fit <- glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE)
It takes about 6 ~ 7 seconds on my 1.1GHz Sandy Bridge laptop. Let's extract its coefficients:
b <- coef(fit)
head(b)
# subject1 subject2 subject3 subject4 subject5 subject6
# 0.4378952 0.3582956 -0.2597528 0.8141229 1.3337102 -0.2168463
tail(b)
#subject199995:day subject199996:day subject199997:day subject199998:day
# -0.09916175 -0.15653402 -0.05435883 -0.02553316
#subject199999:day subject200000:day
# 0.02322640 -0.09451542
You can do B <- matrix(b, ncol = 2)
, so that the 1st column is intercept and the 2nd is slope.
My thoughts: We probably need better packages for large regression
Using glm4
here does not give attractive advantage over chinsoon12's data.table
solution since it also basically just tells you the regression coefficient. It is also a bit slower than data.table
method, because it computes fitted values and residuals.
Analysis of simple regression does not require a proper model fitting routine. I have a few answers on how to do fancy stuff on this kind of regression, like Fast pairwise simple linear regression between all variables in a data frame where details on how to compute all statistics are also given. But as I wrote this answer, I was thinking about something in general regarding large regression problem. We might need better packages otherwise there is no end doing case-to-case coding.
In reply to OP
speedglm::speedlm(x1 ~ 0 + subject + day:subject, data = df, sparse = TRUE)
gives Error: cannot allocate vector of size 74.5 Gb
yeah, because it has a bad sparse
logic.
MatrixModels::glm4(x1 ~ day * subject, data = df, sparse = TRUE)
gives Error in Cholesky(crossprod(from), LDL = FALSE) : internal_chm_factor: Cholesky factorization failed
This is because you have only one datum for some subject
. You need at least two data to fit a line. Here is an example (in dense settings):
dat <- data.frame(t = c(1:5, 1:9, 1),
f = rep(gl(3,1,labels = letters[1:3]), c(5, 9, 1)),
y = rnorm(15))
Level "c" of f
only has one datum / row.
X <- model.matrix(~ 0 + f + t:f, dat)
XtX <- crossprod(X)
chol(XtX)
#Error in chol.default(XtX) :
# the leading minor of order 6 is not positive definite
Cholesky factorization is unable to resolve rank-deficient model. If we use lm
's QR factorization, we will see an NA
coefficient.
lm(y ~ 0 + f + t:f, dat)
#Coefficients:
# fa fb fc fa:t fb:t fc:t
# 0.49893 0.52066 -1.90779 -0.09415 -0.03512 NA
We can only estimate an intercept for level "c", not a slope.
Note that if you use the data.table
solution, you end up with 0 / 0
when computing slope of this level, and the final result is NaN
.
Update: fast solution now available
Check out Fast group-by simple linear regression and general paired simple linear regression.
robust to outliers lm in R
Is there a built in way in
lm
to make it ignore extreme values?
Yes. You need to look at rlm
.
For more reading material, look at the CRAN Task for robust methods. (Josh already gave this link)
R confint method very slow
It's a bit buried in the documentation, but what's slowing you down is the multiple-comparisons correction computations. There's wide variation in the elapsed time for the available methods. See the Confidence-limit and P-value adjustments section of ?summary.ref.grid
for details, and decide which method meets your criteria of being both fast enough and reliable enough for your use case ...
adj <- c("tukey","scheffe","sidak","bonferroni","dunnettx","mvt")
sapply(adj,function(a) system.time(confint(t,adjust=a))["elapsed"])
## tukey.elapsed scheffe.elapsed sidak.elapsed bonferroni.elapsed
9.256 0.195 0.168 0.173
## dunnettx.elapsed mvt.elapsed
14.370 1.821
Related Topics
Can the Value.Var in Dcast Be a List or Have Multiple Value Variables
Knitr Wont Compile PDF: "Error in Tools::File_Path_As_Absolute(Output_File)"
Installing Package - Cannot Open File - Permission Denied
How to Cross-Paste All Combinations of Two Vectors (Each-To-Each)
R Shiny Table Not Rendering HTML
How to Find the Indices of the Top 10,000 Elements in a Symmetric Matrix(12K X 12K) in R
How to Install Roracle Package on Windows 7
Create New Column Based on 4 Values in Another Column
Raw Text Strings for File Paths in R
Crop for Spatialpolygonsdataframe
Relocating Alaska and Hawaii on Thematic Map of the Usa with Ggplot2
Edit Datatable in Shiny with Dropdown Selection for Factor Variables
Propagating Data Within a Vector
How to Deal with Nas in Residuals in a Regression in R
Calculate Euclidean Distance Matrix Using a Big.Matrix Object