Fitting linear model / ANOVA by group
I think you are looking for by
facility in R.
fit <- with(mhw, by(mhw, Quad, function (dat) lm(grain ~ straw, data = dat)))
Since you have 4 levels in Quad
, you end up with 4 linear models in fit
, i.e., fit
is a "by" class object (a type of "list") of length 4.
To get coefficient for each model, you can use
sapply(fit, coef)
To produce model summary, use
lapply(fit, summary)
To export ANOVA table, use
lapply(fit, anova)
As a reproducible example, I am taking the example from ?by
:
tmp <- with(warpbreaks,
by(warpbreaks, tension,
function(x) lm(breaks ~ wool, data = x)))
class(tmp)
# [1] "by"
mode(tmp)
# [1] "list"
sapply(tmp, coef)
# L M H
#(Intercept) 44.55556 24.000000 24.555556
#woolB -16.33333 4.777778 -5.777778
lapply(tmp, anova)
#$L
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 1200.5 1200.50 5.6531 0.03023 *
#Residuals 16 3397.8 212.36
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#$M
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 102.72 102.722 1.2531 0.2795
#Residuals 16 1311.56 81.972
#
#$H
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 150.22 150.222 2.3205 0.1472
#Residuals 16 1035.78 64.736
I was aware of this option, but not familiar with it. Thanks to @Roland for providing code for the above reproducible example:
library(nlme)
lapply(lmList(breaks ~ wool | tension, data = warpbreaks), anova)
For your data I think it would be
fit <- lmList(grain ~ straw | Quad, data = mhw)
lapply(fit, anova)
You don't need to install nlme
; it comes with R as one of recommended packages.
Linear Regression and group by in R
Here's one way using the lme4
package.
library(lme4)
d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
year=rep(1:10, 2),
response=c(rnorm(10), rnorm(10)))
xyplot(response ~ year, groups=state, data=d, type='l')
fits <- lmList(response ~ year | state, data=d)
fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
(Intercept) year
CA -1.34420990 0.17139963
NY 0.00196176 -0.01852429
Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316
Explaining Anova of linear model for correlation
Statistical explanation of ANOVA
For ANOVA, where we are using a linear model between one numerical and one nominal variable the null and alternative hypothesis looks like this:
$H_0$: the group means of the numerical variable grouped by the nominal variable are the same for all groups (i.e. the numerical variable is independent from the nominal variable, they are not "correlated")
$H_1$: at least one group mean differs from the others (i.e. the variables not independent, they are "correlated")
For testing these hypotheses we have to calculate the F test statistic, for that calculation we are using the ANOVA table:
---------------------------------------------------------------------
| Source | SS | df | MS | F | p-value |
|-----------|-----|-----------|-----------------|---------|---------|
| Treatment | SST | dfT = k-1 | MST = SST/(k-1) | MST/MSE | ... |
| Error | SSE | dfE = N-k | MSE = SSE/(N-k) | - | - |
---------------------------------------------------------------------
| Total | SS | df = N-1 | - | - | - |
---------------------------------------------------------------------
where
- $SST$ is the sum of squared difference between the group means and the overall mean
- $SSE$ is the sum of squared difference between the individual values and the overall mean
- $k$ is the number of groups
- $N$ is the number of observations
- $MST$ is the mean squared difference between the group means and the overall mean
- $MSE$ is the mean squared difference between the individual values and the overall mean
So in this sense the F test statistic "compares" the $MST$ and the $MSE$ as the calculation of their ratio is a kind of comparison. For a given significance level alpha (mainly 5%) we can get the value of the F distribution ($F\left(\alpha, df_T, df_E\right)$), and if the $F$ value from the ANOVA table is more then the value of $F$ distribution ($F < F\left(\alpha, df_T, df_E\right)$), we reject $H_0$, otherwise accept it. The $p\text{-value}$ is where $F = F\left(p\text{-value}, df_T, df_E\right)$ (the point where we can't deicide), that's why we can use the decision rule if $p\text{-value} < \text{choosen significance level}$ then $H_0$ rejected.
More statistical details
http://www.itl.nist.gov/div898/handbook/prc/section4/prc431.htm#SST%20and%20SSE
http://www.itl.nist.gov/div898/handbook/prc/section4/prc433.htm
R code for the example in the linked page
lvl1 <- c(6.9, 5.4, 5.8, 4.6, 4.0)
lvl2 <- c(8.3, 6.8, 7.8, 9.2, 6.5)
lvl3 <- c(8.0, 10.5, 8.1, 6.9, 9.3)
group <- gl(3, 5, labels = c("Lvl1","Lvl2", "Lvl3"))
values <- c(lvl1, lvl2, lvl3)
anova(lm(values ~ group))
How to calculate linear models individually in melted Dataframe
Use group_split
by Site
and then map
with lm()
:
library(tidyverse)
test %>%
group_split(Site) %>%
map(~lm(value ~ Time * Group, data = .))
Output:
[[1]]
Call:
lm(formula = value ~ Time * Group, data = .)
Coefficients:
(Intercept) Time GroupG2 Time:GroupG2
-0.6393 0.5201 3.6533 -1.2188
[[2]]
Call:
lm(formula = value ~ Time * Group, data = .)
Coefficients:
(Intercept) Time GroupG2 Time:GroupG2
-0.38982 0.24745 0.58777 -0.08554
[[3]]
Call:
lm(formula = value ~ Time * Group, data = .)
Coefficients:
(Intercept) Time GroupG2 Time:GroupG2
0.17921 0.02528 2.13208 -0.34299
Add %>% summary()
or whatever other post-fitting processes you want, within the call to map()
:
map(~lm(value ~ Time * Group, data = .) %>% summary())
Overall ANOVA for Linear Models in R
supernova
function from supernova package with option type = 1
gives the required output.
library(supernova)
supernova(fm1, type = 1, verbose = FALSE)
Analysis of Variance Table (Type I SS)
Model: Y ~ X1 + X2
SS df MS F PRE p
----- | ---------- - ---------- ----- ------ -----
Model | 153616.587 2 76808.293 1.586 0.3119 .2703
X1 | 37842.327 1 37842.327 0.782 0.1004 .4060
X2 | 115774.260 1 115774.260 2.391 0.2546 .1660
Error | 338940.570 7 48420.081
----- | ---------- - ---------- ----- ------ -----
Total | 492557.157 9 54728.573
How to apply a linear model to specific parts of a data frame?
The by()
function in base R provides one solution:
by(td, td$Species, function(df) lm(df[,"Temp"] ~ df[,"Time"]))
If you want different rules for different subsets, you're probably going to need to do it group by group. For example, to do Species-Y for temps of 20-29 (inclusive), you could run:
lm(Temp ~ Time, data = td[td$Species == "Species-Y" & td$Temp >= 20 & td$Temp <= 29, ])
Faster anova of regression models
We could hack stats:::anova.lmlist
so that it works for lists produced by .lm.fit
(notice the dot) and RcppArmadillo::fastLm
. Please check stats:::anova.lmlist
, if I didn't delete stuff you need!
anova2 <- function(object, ...) {
objects <- list(object, ...)
ns <- vapply(objects, function(x) length(x$residuals), 0L)
stopifnot(!any(ns != ns[1L]))
resdf <- vapply(objects, df.residual, 0L)
resdev <- vapply(objects, function(x) crossprod(residuals(x)), 0)
bigmodel <- order(resdf)[1L]
dfs <- c(NA, -diff(resdf))[bigmodel]
ssq <- c(NA, -diff(resdev))[bigmodel]
df.scale <- resdf[bigmodel]
scale <- resdev[bigmodel]/resdf[bigmodel]
fstat <- (ssq/dfs)/scale
p <- pf(fstat, abs(dfs), df.scale, lower.tail=FALSE)
return(c(F=fstat, p=p))
}
These wrappers make .lm.fit
and RcppArmadillo::fastLm
a little more convenient:
lm_fun <- function(fo, dat) {
X <- model.matrix(fo, dat)
fit <- .lm.fit(X, dat[[all.vars(fo)[1]]])
fit$df.residual <- dim(X)[1] - dim(X)[2]
return(fit)
}
fastLm_fun <- function(fo, dat) {
fit <- RcppArmadillo::fastLm(model.matrix(fo, dat), dat[[all.vars(fo)[1]]])
return(fit)
}
Use it
fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
# F p
# 0.3609728 0.5481032
## Or use `fastLm_fun` here, but it's slower.
Compare
anova(lm(y ~ x1 + x3, data), lm(y ~ x2 + x3 + x4, data))
# Analysis of Variance Table
#
# Model 1: y ~ x1 + x3
# Model 2: y ~ x2 + x3 + x4
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 997 1003.7
# 2 996 1003.4 1 0.36365 0.361 0.5481
lm_fun
(which outperforms fastLm_fun
) combined with anova2
appears to be around 60% faster than the conventional approach:
microbenchmark::microbenchmark(
conventional={
fit1 <- lm(y ~ x1 + x3, data)
fit2 <- lm(y ~ x2 + x3 + x4, data)
anova(fit1, fit2)
},
anova2={ ## using `.lm.fit`
fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
},
anova2_Fast={ ## using `RcppArmadillo::fastLm`
fit1 <- fastLm_fun(y ~ x1 + x3, data)
fit2 <- fastLm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
},
anova_Donald={
fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova_Donald(fit1, fit2)
},
times=1000L,
control=list(warmup=100L)
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# conventional 2.885718 2.967053 3.205947 3.018668 3.090954 40.15720 1000 d
# anova2 1.180683 1.218121 1.285131 1.233335 1.267833 23.81955 1000 a
# anova2_Fast 1.961897 2.012756 2.179458 2.037854 2.087893 26.65279 1000 c
# anova_Donald 1.368699 1.409198 1.561751 1.430562 1.472148 33.12247 1000 b
Data:
set.seed(42)
data <- data.frame(y=rnorm(1000), x1=rnorm(1000), x2=rnorm(1000), x3=rnorm(1000),
x4=rnorm(1000))
dplyr version of grouping a dataframe then creating regression model on each group
Returning a list from dplyr
is not possible yet. If you just need the intercept and slope @jazzurro 's answer is the way, but if you need the whole model you need to do something like
library(dplyr)
models <- df %>% group_by(country) %>% do(mod = lm(BirthRate ~ US., data = .))
Then if you want to perform ANOVA on each fitted model, you can do it using rowwise
models %>% rowwise %>% do(anova(.$mod))
but again the result is coerced to a data frame and is not quite the same as doing lapply(models$mod, anova)
.
For now (ie until the next version of dplyr
) if you need to store the whole result in a list, you can just use dlply
from plyr
, like plyr::dlply(df, "country", function(d) anova(lm(BirthRate ~ US., data = d)))
, or of course if you do not absolutely have to use dplyr
you can go for @SvenHohenstein 's answer which looks like a better way of doing this anyway.
Related Topics
Multiple Boxplots Using Ggplot
Print String and Variable Contents on the Same Line in R
Diagnosing R Package Build Warning: "Latex Errors When Creating PDF Version"
Reading Hdf Files into R and Converting Them to Geotiff Rasters
Align Violin Plots with Dodged Box Plots
Collapsing/Hiding Figures in R Markdown
Daily Time Series with Ts.. How to Specify Start and End
How to Compute Correlations Between All Columns in R and Detect Highly Correlated Variables
Names of R's Available Packages
Control Number of Decimal Places on Xtable Output in R
Sources on S4 Objects, Methods and Programming in R
How to Automatically Include All 2-Way Interactions in a Glm Model in R
Marking Specific Tiles in Geom_Tile()/Geom_Raster()
Create Convex Hull Polygon from Points and Save as Shapefile