Fitting Linear Model/Anova by Group

Fitting linear model / ANOVA by group

I think you are looking for by facility in R.

fit <- with(mhw, by(mhw, Quad, function (dat) lm(grain ~ straw, data = dat)))

Since you have 4 levels in Quad, you end up with 4 linear models in fit, i.e., fit is a "by" class object (a type of "list") of length 4.

To get coefficient for each model, you can use

sapply(fit, coef)

To produce model summary, use

lapply(fit, summary)

To export ANOVA table, use

lapply(fit, anova)

As a reproducible example, I am taking the example from ?by:

tmp <- with(warpbreaks,
by(warpbreaks, tension,
function(x) lm(breaks ~ wool, data = x)))

class(tmp)
# [1] "by"

mode(tmp)
# [1] "list"

sapply(tmp, coef)

# L M H
#(Intercept) 44.55556 24.000000 24.555556
#woolB -16.33333 4.777778 -5.777778

lapply(tmp, anova)

#$L
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 1200.5 1200.50 5.6531 0.03023 *
#Residuals 16 3397.8 212.36
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#$M
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 102.72 102.722 1.2531 0.2795
#Residuals 16 1311.56 81.972
#
#$H
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 150.22 150.222 2.3205 0.1472
#Residuals 16 1035.78 64.736

I was aware of this option, but not familiar with it. Thanks to @Roland for providing code for the above reproducible example:

library(nlme)
lapply(lmList(breaks ~ wool | tension, data = warpbreaks), anova)

For your data I think it would be

fit <- lmList(grain ~ straw | Quad, data = mhw)
lapply(fit, anova)

You don't need to install nlme; it comes with R as one of recommended packages.

Linear Regression and group by in R

Here's one way using the lme4 package.

 library(lme4)
d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
year=rep(1:10, 2),
response=c(rnorm(10), rnorm(10)))

xyplot(response ~ year, groups=state, data=d, type='l')

fits <- lmList(response ~ year | state, data=d)
fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
(Intercept) year
CA -1.34420990 0.17139963
NY 0.00196176 -0.01852429

Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316

Explaining Anova of linear model for correlation

Statistical explanation of ANOVA

For ANOVA, where we are using a linear model between one numerical and one nominal variable the null and alternative hypothesis looks like this:

$H_0$: the group means of the numerical variable grouped by the nominal variable are the same for all groups (i.e. the numerical variable is independent from the nominal variable, they are not "correlated")

$H_1$: at least one group mean differs from the others (i.e. the variables not independent, they are "correlated")

For testing these hypotheses we have to calculate the F test statistic, for that calculation we are using the ANOVA table:

---------------------------------------------------------------------
| Source | SS | df | MS | F | p-value |
|-----------|-----|-----------|-----------------|---------|---------|
| Treatment | SST | dfT = k-1 | MST = SST/(k-1) | MST/MSE | ... |
| Error | SSE | dfE = N-k | MSE = SSE/(N-k) | - | - |
---------------------------------------------------------------------
| Total | SS | df = N-1 | - | - | - |
---------------------------------------------------------------------

where

  • $SST$ is the sum of squared difference between the group means and the overall mean
  • $SSE$ is the sum of squared difference between the individual values and the overall mean
  • $k$ is the number of groups
  • $N$ is the number of observations
  • $MST$ is the mean squared difference between the group means and the overall mean
  • $MSE$ is the mean squared difference between the individual values and the overall mean

So in this sense the F test statistic "compares" the $MST$ and the $MSE$ as the calculation of their ratio is a kind of comparison. For a given significance level alpha (mainly 5%) we can get the value of the F distribution ($F\left(\alpha, df_T, df_E\right)$), and if the $F$ value from the ANOVA table is more then the value of $F$ distribution ($F < F\left(\alpha, df_T, df_E\right)$), we reject $H_0$, otherwise accept it. The $p\text{-value}$ is where $F = F\left(p\text{-value}, df_T, df_E\right)$ (the point where we can't deicide), that's why we can use the decision rule if $p\text{-value} < \text{choosen significance level}$ then $H_0$ rejected.

More statistical details

http://www.itl.nist.gov/div898/handbook/prc/section4/prc431.htm#SST%20and%20SSE

http://www.itl.nist.gov/div898/handbook/prc/section4/prc433.htm

R code for the example in the linked page

lvl1 <- c(6.9, 5.4, 5.8, 4.6, 4.0)
lvl2 <- c(8.3, 6.8, 7.8, 9.2, 6.5)
lvl3 <- c(8.0, 10.5, 8.1, 6.9, 9.3)
group <- gl(3, 5, labels = c("Lvl1","Lvl2", "Lvl3"))
values <- c(lvl1, lvl2, lvl3)
anova(lm(values ~ group))

How to calculate linear models individually in melted Dataframe

Use group_split by Site and then map with lm():

library(tidyverse)

test %>%
group_split(Site) %>%
map(~lm(value ~ Time * Group, data = .))

Output:

[[1]]

Call:
lm(formula = value ~ Time * Group, data = .)

Coefficients:
(Intercept) Time GroupG2 Time:GroupG2
-0.6393 0.5201 3.6533 -1.2188

[[2]]

Call:
lm(formula = value ~ Time * Group, data = .)

Coefficients:
(Intercept) Time GroupG2 Time:GroupG2
-0.38982 0.24745 0.58777 -0.08554

[[3]]

Call:
lm(formula = value ~ Time * Group, data = .)

Coefficients:
(Intercept) Time GroupG2 Time:GroupG2
0.17921 0.02528 2.13208 -0.34299

Add %>% summary() or whatever other post-fitting processes you want, within the call to map():

map(~lm(value ~ Time * Group, data = .) %>% summary())

Overall ANOVA for Linear Models in R

supernova function from supernova package with option type = 1 gives the required output.

library(supernova)
supernova(fm1, type = 1, verbose = FALSE)
Analysis of Variance Table (Type I SS)
Model: Y ~ X1 + X2

SS df MS F PRE p
----- | ---------- - ---------- ----- ------ -----
Model | 153616.587 2 76808.293 1.586 0.3119 .2703
X1 | 37842.327 1 37842.327 0.782 0.1004 .4060
X2 | 115774.260 1 115774.260 2.391 0.2546 .1660
Error | 338940.570 7 48420.081
----- | ---------- - ---------- ----- ------ -----
Total | 492557.157 9 54728.573

How to apply a linear model to specific parts of a data frame?

The by() function in base R provides one solution:

by(td, td$Species, function(df) lm(df[,"Temp"] ~ df[,"Time"]))

If you want different rules for different subsets, you're probably going to need to do it group by group. For example, to do Species-Y for temps of 20-29 (inclusive), you could run:

lm(Temp ~ Time, data = td[td$Species == "Species-Y" & td$Temp >= 20 & td$Temp <= 29, ])

Faster anova of regression models

We could hack stats:::anova.lmlist so that it works for lists produced by .lm.fit (notice the dot) and RcppArmadillo::fastLm. Please check stats:::anova.lmlist, if I didn't delete stuff you need!

anova2 <- function(object, ...) {
objects <- list(object, ...)
ns <- vapply(objects, function(x) length(x$residuals), 0L)
stopifnot(!any(ns != ns[1L]))
resdf <- vapply(objects, df.residual, 0L)
resdev <- vapply(objects, function(x) crossprod(residuals(x)), 0)
bigmodel <- order(resdf)[1L]
dfs <- c(NA, -diff(resdf))[bigmodel]
ssq <- c(NA, -diff(resdev))[bigmodel]
df.scale <- resdf[bigmodel]
scale <- resdev[bigmodel]/resdf[bigmodel]
fstat <- (ssq/dfs)/scale
p <- pf(fstat, abs(dfs), df.scale, lower.tail=FALSE)
return(c(F=fstat, p=p))
}

These wrappers make .lm.fit and RcppArmadillo::fastLm a little more convenient:

lm_fun <- function(fo, dat) {
X <- model.matrix(fo, dat)
fit <- .lm.fit(X, dat[[all.vars(fo)[1]]])
fit$df.residual <- dim(X)[1] - dim(X)[2]
return(fit)
}

fastLm_fun <- function(fo, dat) {
fit <- RcppArmadillo::fastLm(model.matrix(fo, dat), dat[[all.vars(fo)[1]]])
return(fit)
}

Use it

fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
# F p
# 0.3609728 0.5481032

## Or use `fastLm_fun` here, but it's slower.

Compare

anova(lm(y ~ x1 + x3, data), lm(y ~ x2 + x3 + x4, data))
# Analysis of Variance Table
#
# Model 1: y ~ x1 + x3
# Model 2: y ~ x2 + x3 + x4
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 997 1003.7
# 2 996 1003.4 1 0.36365 0.361 0.5481

lm_fun (which outperforms fastLm_fun) combined with anova2 appears to be around 60% faster than the conventional approach:

microbenchmark::microbenchmark(
conventional={
fit1 <- lm(y ~ x1 + x3, data)
fit2 <- lm(y ~ x2 + x3 + x4, data)
anova(fit1, fit2)
},
anova2={ ## using `.lm.fit`
fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
},
anova2_Fast={ ## using `RcppArmadillo::fastLm`
fit1 <- fastLm_fun(y ~ x1 + x3, data)
fit2 <- fastLm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
},
anova_Donald={
fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova_Donald(fit1, fit2)
},
times=1000L,
control=list(warmup=100L)
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# conventional 2.885718 2.967053 3.205947 3.018668 3.090954 40.15720 1000 d
# anova2 1.180683 1.218121 1.285131 1.233335 1.267833 23.81955 1000 a
# anova2_Fast 1.961897 2.012756 2.179458 2.037854 2.087893 26.65279 1000 c
# anova_Donald 1.368699 1.409198 1.561751 1.430562 1.472148 33.12247 1000 b

Data:

set.seed(42)
data <- data.frame(y=rnorm(1000), x1=rnorm(1000), x2=rnorm(1000), x3=rnorm(1000),
x4=rnorm(1000))

dplyr version of grouping a dataframe then creating regression model on each group

Returning a list from dplyr is not possible yet. If you just need the intercept and slope @jazzurro 's answer is the way, but if you need the whole model you need to do something like

library(dplyr)
models <- df %>% group_by(country) %>% do(mod = lm(BirthRate ~ US., data = .))

Then if you want to perform ANOVA on each fitted model, you can do it using rowwise

models %>% rowwise %>% do(anova(.$mod))

but again the result is coerced to a data frame and is not quite the same as doing lapply(models$mod, anova).

For now (ie until the next version of dplyr) if you need to store the whole result in a list, you can just use dlply from plyr, like plyr::dlply(df, "country", function(d) anova(lm(BirthRate ~ US., data = d))), or of course if you do not absolutely have to use dplyr you can go for @SvenHohenstein 's answer which looks like a better way of doing this anyway.



Related Topics



Leave a reply



Submit