Fast Pairwise Simple Linear Regression Between Variables in a Data Frame

Fast pairwise simple linear regression between variables in a data frame

Some statistical result / background

Sample Image

(Link in the picture: Function to calculate R2 (R-squared) in R)


Computational details

Computations involved here is basically the computation of the variance-covariance matrix. Once we have it, results for all pairwise regression is just element-wise matrix arithmetic.

The variance-covariance matrix can be obtained by R function cov, but functions below compute it manually using crossprod. The advantage is that it can obviously benefit from an optimized BLAS library if you have it. Be aware that significant amount of simplification is made in this way. R function cov has argument use which allows handling NA, but crossprod does not. I am assuming that your dat has no missing values at all! If you do have missing values, remove them yourself with na.omit(dat).

The initial as.matrix that converts a data frame to a matrix might be an overhead. In principle if I code everything up in C / C++, I can eliminate this coercion. And in fact, many element-wise matrix matrix arithmetic can be merged into a single loop-nest. However, I really bother doing this at the moment (as I have no time).

Some people may argue that the format of the final return is inconvenient. There could be other format:

  1. a list of data frames, each giving the result of the regression for a particular LHS variable;
  2. a list of data frames, each giving the result of the regression for a particular RHS variable.

This is really opinion-based. Anyway, you can always do a split.data.frame by "LHS" column or "RHS" column yourself on the data frame I return you.


R function pairwise_simpleLM

pairwise_simpleLM <- function (dat) {
## matrix and its dimension (n: numbeta.ser of data; p: numbeta.ser of variables)
dat <- as.matrix(dat)
n <- nrow(dat)
p <- ncol(dat)
## variable summary: mean, (unscaled) covariance and (unscaled) variance
m <- colMeans(dat)
V <- crossprod(dat) - tcrossprod(m * sqrt(n))
d <- diag(V)
## R-squared (explained variance) and its complement
R2 <- (V ^ 2) * tcrossprod(1 / d)
R2_complement <- 1 - R2
R2_complement[seq.int(from = 1, by = p + 1, length = p)] <- 0
## slope and intercept
beta <- V * rep(1 / d, each = p)
alpha <- m - beta * rep(m, each = p)
## residual sum of squares and standard error
RSS <- R2_complement * d
sig <- sqrt(RSS * (1 / (n - 2)))
## statistics for slope
beta.se <- sig * rep(1 / sqrt(d), each = p)
beta.tv <- beta / beta.se
beta.pv <- 2 * pt(abs(beta.tv), n - 2, lower.tail = FALSE)
## F-statistic and p-value
F.fv <- (n - 2) * R2 / R2_complement
F.pv <- pf(F.fv, 1, n - 2, lower.tail = FALSE)
## export
data.frame(LHS = rep(colnames(dat), times = p),
RHS = rep(colnames(dat), each = p),
alpha = c(alpha),
beta = c(beta),
beta.se = c(beta.se),
beta.tv = c(beta.tv),
beta.pv = c(beta.pv),
sig = c(sig),
R2 = c(R2),
F.fv = c(F.fv),
F.pv = c(F.pv),
stringsAsFactors = FALSE)
}

Let's compare the result on the toy dataset in the question.

oo <- poor(dat)
rr <- pairwise_simpleLM(dat)
all.equal(oo, rr)
#[1] TRUE

Let's see its output:

rr[1:3, ]
# LHS RHS alpha beta beta.se beta.tv beta.pv sig
#1 A A 0.00000000 1.0000000 0.00000000 Inf 0.000000e+00 0.0000000
#2 B A 0.05550367 0.6206434 0.04456744 13.92594 5.796437e-25 0.1252402
#3 C A 0.05809455 1.2215173 0.04790027 25.50126 4.731618e-45 0.1346059
# R2 F.fv F.pv
#1 1.0000000 Inf 0.000000e+00
#2 0.6643051 193.9317 5.796437e-25
#3 0.8690390 650.3142 4.731618e-45

When we have the same LHS and RHS, regression is meaningless hence intercept is 0, slope is 1, etc.

What about speed? Still using this toy example:

library(microbenchmark)
microbenchmark("poor_man's" = poor(dat), "fast" = pairwise_simpleLM(dat))
#Unit: milliseconds
# expr min lq mean median uq max
# poor_man's 127.270928 129.060515 137.813875 133.390722 139.029912 216.24995
# fast 2.732184 3.025217 3.381613 3.134832 3.313079 10.48108

The gap is going be increasingly wider as we have more variables. For example, with 10 variables we have:

set.seed(0)
X <- matrix(runif(100), 100, 10, dimnames = list(1:100, LETTERS[1:10]))
b <- runif(10)
DAT <- X * b[col(X)] + matrix(rnorm(100 * 10, 0, 0.1), 100, 10)
DAT <- as.data.frame(DAT)
microbenchmark("poor_man's" = poor(DAT), "fast" = pairwise_simpleLM(DAT))
#Unit: milliseconds
# expr min lq mean median uq max
# poor_man's 548.949161 551.746631 573.009665 556.307448 564.28355 801.645501
# fast 3.365772 3.578448 3.721131 3.621229 3.77749 6.791786

R function general_paired_simpleLM

general_paired_simpleLM <- function (dat_LHS, dat_RHS) {
## matrix and its dimension (n: numbeta.ser of data; p: numbeta.ser of variables)
dat_LHS <- as.matrix(dat_LHS)
dat_RHS <- as.matrix(dat_RHS)
if (nrow(dat_LHS) != nrow(dat_RHS)) stop("'dat_LHS' and 'dat_RHS' don't have same number of rows!")
n <- nrow(dat_LHS)
pl <- ncol(dat_LHS)
pr <- ncol(dat_RHS)
## variable summary: mean, (unscaled) covariance and (unscaled) variance
ml <- colMeans(dat_LHS)
mr <- colMeans(dat_RHS)
vl <- colSums(dat_LHS ^ 2) - ml * ml * n
vr <- colSums(dat_RHS ^ 2) - mr * mr * n
##V <- crossprod(dat - rep(m, each = n)) ## cov(u, v) = E[(u - E[u])(v - E[v])]
V <- crossprod(dat_LHS, dat_RHS) - tcrossprod(ml * sqrt(n), mr * sqrt(n)) ## cov(u, v) = E[uv] - E{u]E[v]
## R-squared (explained variance) and its complement
R2 <- (V ^ 2) * tcrossprod(1 / vl, 1 / vr)
R2_complement <- 1 - R2
## slope and intercept
beta <- V * rep(1 / vr, each = pl)
alpha <- ml - beta * rep(mr, each = pl)
## residual sum of squares and standard error
RSS <- R2_complement * vl
sig <- sqrt(RSS * (1 / (n - 2)))
## statistics for slope
beta.se <- sig * rep(1 / sqrt(vr), each = pl)
beta.tv <- beta / beta.se
beta.pv <- 2 * pt(abs(beta.tv), n - 2, lower.tail = FALSE)
## F-statistic and p-value
F.fv <- (n - 2) * R2 / R2_complement
F.pv <- pf(F.fv, 1, n - 2, lower.tail = FALSE)
## export
data.frame(LHS = rep(colnames(dat_LHS), times = pr),
RHS = rep(colnames(dat_RHS), each = pl),
alpha = c(alpha),
beta = c(beta),
beta.se = c(beta.se),
beta.tv = c(beta.tv),
beta.pv = c(beta.pv),
sig = c(sig),
R2 = c(R2),
F.fv = c(F.fv),
F.pv = c(F.pv),
stringsAsFactors = FALSE)
}

Apply this to Example 1 in the question.

general_paired_simpleLM(dat[1:3], dat[4:5])
# LHS RHS alpha beta beta.se beta.tv beta.pv sig
#1 A D -0.009212582 0.3450939 0.01171768 29.45071 1.772671e-50 0.09044509
#2 B D 0.012474593 0.2389177 0.01420516 16.81908 1.201421e-30 0.10964516
#3 C D -0.005958236 0.4565443 0.01397619 32.66585 1.749650e-54 0.10787785
#4 A E 0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.10656866
#5 B E 0.012738403 -0.3437776 0.01949488 -17.63426 3.636655e-32 0.10581331
#6 C E 0.009068106 -0.6430553 0.02183128 -29.45569 1.746439e-50 0.11849472
# R2 F.fv F.pv
#1 0.8984818 867.3441 1.772671e-50
#2 0.7427021 282.8815 1.201421e-30
#3 0.9158840 1067.0579 1.749650e-54
#4 0.8590604 597.3333 1.738263e-43
#5 0.7603718 310.9670 3.636655e-32
#6 0.8985126 867.6375 1.746439e-50

Apply this to Example 2 in the question.

general_paired_simpleLM(dat[1:4], dat[5])
# LHS RHS alpha beta beta.se beta.tv beta.pv sig
#1 A E 0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.1065687
#2 B E 0.012738403 -0.3437776 0.01949488 -17.63426 3.636655e-32 0.1058133
#3 C E 0.009068106 -0.6430553 0.02183128 -29.45569 1.746439e-50 0.1184947
#4 D E 0.066190196 -1.3767586 0.03597657 -38.26820 9.828853e-61 0.1952718
# R2 F.fv F.pv
#1 0.8590604 597.3333 1.738263e-43
#2 0.7603718 310.9670 3.636655e-32
#3 0.8985126 867.6375 1.746439e-50
#4 0.9372782 1464.4551 9.828853e-61

Apply this to Example 3 in the question.

general_paired_simpleLM(dat[1], dat[2:5])
# LHS RHS alpha beta beta.se beta.tv beta.pv sig
#1 A B 0.112229318 1.0703491 0.07686011 13.92594 5.796437e-25 0.16446951
#2 A C 0.025628210 0.7114422 0.02789832 25.50126 4.731618e-45 0.10272687
#3 A D -0.009212582 0.3450939 0.01171768 29.45071 1.772671e-50 0.09044509
#4 A E 0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.10656866
# R2 F.fv F.pv
#1 0.6643051 193.9317 5.796437e-25
#2 0.8690390 650.3142 4.731618e-45
#3 0.8984818 867.3441 1.772671e-50
#4 0.8590604 597.3333 1.738263e-43

We can even just do a simple linear regression between two variables:

general_paired_simpleLM(dat[1], dat[2])
# LHS RHS alpha beta beta.se beta.tv beta.pv sig
#1 A B 0.1122293 1.070349 0.07686011 13.92594 5.796437e-25 0.1644695
# R2 F.fv F.pv
#1 0.6643051 193.9317 5.796437e-25

This means that the simpleLM function in is now obsolete.


Appendix: Markdown (needs MathJax support) fot the picture

Denote our variables by $x_1$, $x_2$, etc, a pairwise simple linear regression takes the form $$x_i = \alpha_{ij} + \beta_{ij}x_j$$ where $\alpha_{ij}$ and $\beta_{ij}$ is the intercept and the slope of $x_i \sim x_j$, respectively. We also denote $m_i$ and $v_i$ as the sample mean and **unscaled** sample variance of $x_i$. Here, the unscaled variance is just the sum of squares without dividing by sample size, that is $v_i = \sum_{k = 1}^n(x_{ik} - m_i)^2 = (\sum_{k = 1}^nx_{ik}^2) - n m_i^2$. We also denote $V_{ij}$ as the **unscaled** covariance between $x_i$ and $x_j$: $V_{ij} = \sum_{k = 1}^n(x_{ik} - m_i)(x_{jk} - m_j)$ = $(\sum_{k = 1}^nx_{ik}x_{jk}) - nm_im_j$.

Using the results for a simple linear regression given in [Function to calculate R2 (R-squared) in R](https://stackoverflow.com/a/40901487/4891738), we have $$\beta_{ij} = V_{ij} \ / \ v_j,\quad \alpha_{ij} = m_i - \beta_{ij}m_j,\quad r_{ij}^2 = V_{ij}^2 \ / \ (v_iv_j),$$ where $r_{ij}^2$ is the R-squared. Knowing $r_{ij}^2 = RSS_{ij} \ / \ TSS_{ij}$ where $RSS_{ij}$ and $TSS_{ij} = v_i$ are residual sum of squares and total sum of squares of $x_i \sim x_j$, we can derive $RSS_{ij}$ and residual standard error $\sigma_{ij}$ **without actually computing residuals**: $$RSS_{ij} = (1 - r_{ij}^2)v_i,\quad \sigma_{ij} = \sqrt{RSS_{ij} \ / \ (n - 2)}.$$

F-statistic $F_{ij}$ and associated p-value $p_{ij}^F$ can also be obtained from sum of squares: $$F_{ij} = \tfrac{(TSS_{ij} - RSS_{ij}) \ / \ 1}{RSS_{ij} \ / \ (n - 2)} = (n - 2) r_{ij}^2 \ / \ (1 - r_{ij}^2),\quad p_{ij}^F = 1 - \texttt{CDF_F}(F_{ij};\ 1,\ n - 2),$$ where $\texttt{CDF_F}$ denotes the CDF of F-distribution.

The only thing left is the standard error $e_{ij}$, t-statistic $t_{ij}$ and associated p-value $p_{ij}^t$ for $\beta_{ij}$, which are $$e_{ij} = \sigma_{ij} \ / \ \sqrt{v_i},\quad t_{ij} = \beta_{ij} \ / \ e_{ij},\quad p_{ij}^t = 2 * \texttt{CDF_t}(-|t_{ij}|; \ n - 2),$$ where $\texttt{CDF_t}$ denotes the CDF of t-distribution.

R - Linear linear regression with variables in different dataframes

We may use asplit to split by row and then construct the linear model by looping each of the split elements in Map

out <- Map(function(a, b, c, d) lm(a ~ b * c + d),
asplit(A, 1), asplit(B, 1), asplit(C, 1), asplit(D, 1))

How can I perform and store linear regression models between all continuous variables in a data frame?

Assuming you need pairwise comparisons between all columns of mtcars, you can use combn() function to find all pairwise comparisons (2), and perform all linear models with:

combinations <- combn(colnames(mtcars), 2)

forward <- list()
reverse <- list()

for(i in 1:ncol(combinations)){
forward[[i]] <- lm(formula(paste0(combinations[,i][1], "~", combinations[,i][2])), data = mtcars)

reverse[[i]] <- lm(formula(paste0(combinations[,i][2], "~", combinations[,i][1])), data = mtcars)
}

all <- c(forward, reverse)

all will be your list with all of the linear models together, with both forward and reverse directions of associations between the two variables.

If you want combinations between three variables, you can do combn(colnames(mtcars), 3), and so on.

Fast group-by simple linear regression

We have to do this by writing compiled code. I would present this with Rcpp. Be aware that I am a C programmer and have been using R's conventional C interface. Rcpp is used just to ease handling of lists, strings and attributes, as well as to facilitate immediate testing in R. The code is largely written in C-style. Macros from R's conventional C interface like REAL and INTEGER are still used. See the bottom of this answer for "group_by_simpleLM.cpp".

The R wrapper function group_by_simpleLM has four arguments:

group_by_simpleLM <- function (dat, LHS, RHS, group) {
##.... [TRUNCATED]

  • dat is a data frame. If you feed a matrix or a list, it will stop and complain.
  • LHS is a character vector giving the names for the variables on the left hand side of ~. Multiple LHS variables are supported.
  • RHS is a character vector giving the names for the variables on the right hand side of ~. Only a single, non-factor RHS variable is allowed in simple linear regression. You can provide a vector of variables to RHS, but the function will only retain the first element (with a warning). If that variable is not found in dat (maybe because you've mistyped the name) or it is not a numerical variable, it will give you an informative error message.
  • group is a character vector giving the name of the grouping variable. It should best be readily a factor in dat, otherwise the function uses match(group, unique(group)) for a fast coercion and a warning is issued. A factor with unused levels is no harm. group_by_simpleLM_cpp sees this and returns you all NaN for such levels. group can be NULL so that a single regression is done for all data.

The workhorse function group_by_simpleLM_cpp returns a named list of matrices to hold regression results for each response. Each matrix is "wide" with nlevels(group) columns and 5 rows:

  • "alpha" (for intercept);
  • "beta" (for slope);
  • "beta.se" (for standard error of slope);
  • "r2" (for R-squared);
  • "df.resid" (for residual degree of freedom);

For a simple linear regression, these five statistics are sufficient to obtain other statistics.

The function watches out for rank-deficient case where there is only one datum in a group. Slope can not be estimated then and NaN is returned. Another special case is when a group only has two data. The fit is then perfect and you get 0 standard error for the slope.

The function is a fast method for nlme::lmList(RHS ~ LHS | group, dat, pool = FALSE) when group != NULL, and a fast method for lm(RHS ~ LHS, dat) when group = NULL (may even be faster than general_paired_simpleLM because it is written in C).

Caution:

  • weighted regression is not handled, because the covariance method is invalid in that case.
  • no check for NA / NaN / Inf / -Inf in dat is made and functions break in their presence.

Examples

library(Rcpp)
sourceCpp("group_by_simpleLM.cpp")

## a toy dataset
set.seed(0)
dat <- data.frame(y1 = rnorm(10), y2 = rnorm(10), x = 1:5,
f = gl(2, 5, labels = letters[1:2]),
g = sample(gl(2, 5, labels = LETTERS[1:2])))

group-by regression: a fast method of nlme::lmList

group_by_simpleLM(dat, c("y1", "y2"), "x", "f")
#$y1
# a b
#alpha 0.820107094 -2.7164723
#beta -0.009796302 0.8812007
#beta.se 0.266690568 0.2090644
#r2 0.000449565 0.8555330
#df.resid 3.000000000 3.0000000
#
#$y2
# a b
#alpha 0.1304709 0.06996587
#beta -0.1616069 -0.14685953
#beta.se 0.2465047 0.24815024
#r2 0.1253142 0.10454374
#df.resid 3.0000000 3.00000000

fit <- nlme::lmList(cbind(y1, y2) ~ x | f, data = dat, pool = FALSE)

## results for level "a"; use `fit[[2]]` to see results for level "b"
lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
#$`Response y1`
#$`Response y1`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.820107094 0.8845125 0.92718537 0.4222195
#x -0.009796302 0.2666906 -0.03673284 0.9730056
#
#$`Response y1`$r.squared
#[1] 0.000449565
#
#
#$`Response y2`
#$`Response y2`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1304709 0.8175638 0.1595850 0.8833471
#x -0.1616069 0.2465047 -0.6555936 0.5588755
#
#$`Response y2`$r.squared
#[1] 0.1253142

dealing with rank-deficiency without break-down

## with unused level "b"
group_by_simpleLM(dat[1:5, ], "y1", "x", "f")
#$y1
# a b
#alpha 0.820107094 NaN
#beta -0.009796302 NaN
#beta.se 0.266690568 NaN
#r2 0.000449565 NaN
#df.resid 3.000000000 NaN

## rank-deficient case for level "b"
group_by_simpleLM(dat[1:6, ], "y1", "x", "f")
#$y1
# a b
#alpha 0.820107094 -1.53995
#beta -0.009796302 NaN
#beta.se 0.266690568 NaN
#r2 0.000449565 NaN
#df.resid 3.000000000 0.00000

more than one grouping variables

When we have more than one grouping variables, group_by_simpleLM can not handle them directly. But you can use interaction to first create a single factor variable.

dat$fg <- with(dat, interaction(f, g, drop = TRUE, sep = ":"))
group_by_simpleLM(dat, c("y1", "y2"), "x", "fg")
#$y1
# a:A b:A a:B b:B
#alpha 1.4750325 -2.7684583 -1.6393289 -1.8513669
#beta -0.2120782 0.9861509 0.7993313 0.4613999
#beta.se 0.0000000 0.2098876 0.4946167 0.0000000
#r2 1.0000000 0.9566642 0.7231188 1.0000000
#df.resid 0.0000000 1.0000000 1.0000000 0.0000000
#
#$y2
# a:A b:A a:B b:B
#alpha 1.0292956 -0.22746944 -1.5096975 0.06876360
#beta -0.2657021 -0.20650690 0.2547738 0.09172993
#beta.se 0.0000000 0.01945569 0.3483856 0.00000000
#r2 1.0000000 0.99120195 0.3484482 1.00000000
#df.resid 0.0000000 1.00000000 1.0000000 0.00000000

fit <- nlme::lmList(cbind(y1, y2) ~ x | fg, data = dat, pool = FALSE)

## note that the first group a:A only has two values, so df.resid = 0
## my method returns 0 standard error for the slope
## but lm or lmList would return NaN
lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
#$`Response y1`
#$`Response y1`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.4750325 NaN NaN NaN
#x -0.2120782 NaN NaN NaN
#
#$`Response y1`$r.squared
#[1] 1
#
#
#$`Response y2`
#$`Response y2`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.0292956 NaN NaN NaN
#x -0.2657021 NaN NaN NaN
#
#$`Response y2`$r.squared
#[1] 1

no grouping: a fast method of lm

group_by_simpleLM(dat, c("y1", "y2"), "x", NULL)
#$y1
# ALL
#alpha -0.9481826
#beta 0.4357022
#beta.se 0.2408162
#r2 0.2903691
#df.resid 8.0000000
#
#$y2
# ALL
#alpha 0.1002184
#beta -0.1542332
#beta.se 0.1514935
#r2 0.1147012
#df.resid 8.0000000

fast large simple linear regression

set.seed(0L)
nSubj <- 200e3
nr <- 1e6
DF <- data.frame(subject = gl(nSubj, 5),
day = 3:7,
y1 = runif(nr),
y2 = rpois(nr, 3),
y3 = rnorm(nr),
y4 = rnorm(nr, 1, 5))

system.time(group_by_simpleLM(DF, paste0("y", 1:4), "day", "subject"))
# user system elapsed
# 0.200 0.016 0.219

library(MatrixModels)
system.time(glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE))
# user system elapsed
# 9.012 0.172 9.266

group_by_simpleLM does all 4 responses almost instantly, while glm4 needs 9s for one response alone!

Note that glm4 can break down in rank-deficient case, while group_by_simpleLM would not.


Appendix: "group_by_simpleLM.cpp"

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List group_by_simpleLM_cpp (List Y, NumericVector x, IntegerVector group, CharacterVector group_levels, bool group_unsorted) {

/* number of data and number of responses */
int n = x.size(), k = Y.size(), n_groups = group_levels.size();

/* set up result list */
List result(k);
List dimnames = List::create(CharacterVector::create("alpha", "beta", "beta.se", "r2", "df.resid"), group_levels);
int j; for (j = 0; j < k; j++) {
NumericMatrix mat(5, n_groups);
mat.attr("dimnames") = dimnames;
result[j] = mat;
}
result.attr("names") = Y.attr("names");

/* set up a vector to hold sample size for each group */
size_t *group_offset = (size_t *)calloc(n_groups + 1, sizeof(size_t));

/*
compute group offset: cumsum(group_offset)
The offset is used in a different way when group is sorted or unsorted
In the former case, it is the offset to real x, y values;
In the latter case, it is the offset to ordering index indx
*/
int *u = INTEGER(group), *u_end = u + n, i;
if (n_groups > 1) {
while (u < u_end) group_offset[*u++]++;
for (i = 0; i < n_groups; i++) group_offset[i + 1] += group_offset[i];
} else {
group_offset[1] = n;
group_unsorted = 0;
}

/* local variables & pointers */
double *xi, *xi_end; /* pointer to the 1st and the last x value */
double *yi; /* pointer to the first y value */
int gi; double inv_gi; /* sample size of the i-th group */
double xi_mean, xi_var; /* mean & variance of x values in the i-th group */
double yi_mean, yi_var; /* mean & variance of y values in the i-th group */
double xiyi_cov; /* covariance between x and y values in the i-th group */
double beta, r2; int df_resi;
double *matij;

/* additional storage and variables when group is unsorted */
int *indx; double *xb, *xbi, dtmp;
if (group_unsorted) {
indx = (int *)malloc(n * sizeof(int));
xb = (double *)malloc(n * sizeof(double)); // buffer x for caching
R_orderVector1(indx, n, group, TRUE, FALSE); // Er, how is TRUE & FALSE recogonized as Rboolean?
}

/* loop through groups */
for (i = 0; i < n_groups; i++) {
/* set group size gi */
gi = group_offset[i + 1] - group_offset[i];
/* special case for a factor level with no data */
if (gi == 0) {
for (j = 0; j < k; j++) {
/* matrix column for write-back */
matij = REAL(result[j]) + i * 5;
matij[0] = R_NaN; matij[1] = R_NaN; matij[2] = R_NaN;
matij[3] = R_NaN; matij[4] = R_NaN;
}
continue;
}
/* rank-deficient case */
if (gi == 1) {
gi = group_offset[i];
if (group_unsorted) gi = indx[gi];
for (j = 0; j < k; j++) {
/* matrix column for write-back */
matij = REAL(result[j]) + i * 5;
matij[0] = REAL(Y[j])[gi];
matij[1] = R_NaN; matij[2] = R_NaN;
matij[3] = R_NaN; matij[4] = 0.0;
}
continue;
}
/* general case where a regression line can be estimated */
inv_gi = 1 / (double)gi;
/* compute mean & variance of x values in this group */
xi_mean = 0.0; xi_var = 0.0;
if (group_unsorted) {
/* use u, u_end and xbi */
xi = REAL(x);
u = indx + group_offset[i]; /* offset acts on index */
u_end = u + gi;
xbi = xb + group_offset[i];
for (; u < u_end; xbi++, u++) {
dtmp = xi[*u];
xi_mean += dtmp;
xi_var += dtmp * dtmp;
*xbi = dtmp;
}
} else {
/* use xi and xi_end */
xi = REAL(x) + group_offset[i]; /* offset acts on values */
xi_end = xi + gi;
for (; xi < xi_end; xi++) {
xi_mean += *xi;
xi_var += (*xi) * (*xi);
}
}
xi_mean = xi_mean * inv_gi;
xi_var = xi_var * inv_gi - xi_mean * xi_mean;
/* loop through responses doing simple linear regression */
for (j = 0; j < k; j++) {
/* compute mean & variance of y values, as well its covariance with x values */
yi_mean = 0.0; yi_var = 0.0; xiyi_cov = 0.0;
if (group_unsorted) {
xbi = xb + group_offset[i]; /* use buffered x values */
yi = REAL(Y[j]);
u = indx + group_offset[i]; /* offset acts on index */
for (; u < u_end; u++, xbi++) {
dtmp = yi[*u];
yi_mean += dtmp;
yi_var += dtmp * dtmp;
xiyi_cov += dtmp * (*xbi);
}
} else {
/* set xi and yi */
xi = REAL(x) + group_offset[i]; /* offset acts on values */
yi = REAL(Y[j]) + group_offset[i]; /* offset acts on values */
for (; xi < xi_end; xi++, yi++) {
yi_mean += *yi;
yi_var += (*yi) * (*yi);
xiyi_cov += (*yi) * (*xi);
}
}
yi_mean = yi_mean * inv_gi;
yi_var = yi_var * inv_gi - yi_mean * yi_mean;
xiyi_cov = xiyi_cov * inv_gi - xi_mean * yi_mean;
/* locate the right place to write back regression result */
matij = REAL(result[j]) + i * 5 + 4;
/* residual degree of freedom */
df_resi = gi - 2; *matij-- = (double)df_resi;
/* R-squared = squared correlation */
r2 = (xiyi_cov * xiyi_cov) / (xi_var * yi_var); *matij-- = r2;
/* standard error of regression slope */
if (df_resi == 0) *matij-- = 0.0;
else *matij-- = sqrt((1 - r2) * yi_var / (df_resi * xi_var));
/* regression slope */
beta = xiyi_cov / xi_var; *matij-- = beta;
/* regression intercept */
*matij = yi_mean - beta * xi_mean;
}
}

if (group_unsorted) {
free(indx);
free(xb);
}
free(group_offset);
return result;
}

/*** R
group_by_simpleLM <- function (dat, LHS, RHS, group = NULL) {

## basic input validation
if (missing(dat)) stop("no data provided to 'dat'!")
if (!is.data.frame(dat)) stop("'dat' must be a data frame!")

if (missing(LHS)) stop("no 'LHS' provided!")
if (!is.character(LHS)) stop("'LHS' must be provided as a character vector of variable names!")

if (missing(RHS)) stop("no 'RHS' provided!")
if (!is.character(RHS)) stop("'RHS' must be provided as a character vector of variable names!")

if (!is.null(group)) {

## grouping variable provided: a fast method of `nlme::lmList`

if (!is.character(group)) stop("'group' must be provided as a character vector of variable names!")

## ensure that group has length 1, is available in the data frame and is a factor
if (length(group) > 1L) {
warning("only one grouping variable allowed for group-by simple linear regression; ignoring all but the 1st variable provided!")
group <- group[1L]
}
grp <- dat[[group]]
if (is.null(grp)) stop(sprintf("grouping variable '%s' not found in 'dat'!", group))

if (is.factor(grp)) {
grp_levels <- levels(grp)
} else {
warning("grouping variable is not provided as a factor; fast coercion is made!")
grp_levels <- unique(grp)
grp <- match(grp, grp_levels)
grp_levels <- as.character(grp_levels)
}

grp_unsorted <- .Internal(is.unsorted(grp, FALSE))

} else {

## no grouping; a fast method of `lm`
grp <- 1L; grp_levels <- "ALL"; grp_unsorted <- FALSE

}

## the RHS must has length 1, is available in the data frame and is numeric
if (length(RHS) > 1L) {
warning("only one RHS variable allowed for simple linear regression; ignoring all but the 1st variable provided!")
RHS <- RHS[1L]
}
x <- dat[[RHS]]
if (is.null(x)) stop(sprintf("RHS variable '%s' not found in 'dat'!", RHS))
if (!is.numeric(x) || is.factor(x)) {
stop("RHS variable must be 'numeric' for simple linear regression!")
}
x < as.numeric(x) ## just in case that `x` is an integer

## check LHS variables
nested <- match(RHS, LHS, nomatch = 0L)
if (nested > 0L) {
warning(sprintf("RHS variable '%s' found in LHS variables; removing it from LHS", RHS))
LHS <- LHS[-nested]
}
if (length(LHS) == 0L) stop("no usable LHS variables found!")
missed <- !(LHS %in% names(dat))
if (any(missed)) {
warning(sprintf("LHS variables '%s' not found in 'dat'; removing them from LHS", toString(LHS[missed])))
LHS <- LHS[!missed]
}


Related Topics



Leave a reply



Submit