Running multiple linear regressions across several columns of a data frame in R
Let's get some tidyverse
power here, along with broom
, and forgo all these loops...
First I'll make a dummy table:
df <- data.frame(
g = runif(50),
pair = sample(x = c("A", "B", "C"), size = 50, replace = TRUE),
V1 = runif(50),
V2 = runif(50),
V3 = runif(50),
V4 = runif(50),
V5 = runif(50),
stringsAsFactors = FALSE
)
That's approximately what your data structure looks like. Now onto the meat of the code:
library(tidyverse)
library(broom)
df %>%
as_tibble %>%
gather(key = "column", value = "value", V1:V5) %>% # first set the data in long format
nest(g, value) %>% # now nest the dependent and independent factors
mutate(model = map(data, ~lm(g ~ value, data = .))) %>% # fit the model using purrr
mutate(tidy_model = map(model, tidy)) %>% # clean the model output with broom
select(-data, -model) %>% # remove the "untidy" parts
unnest() # get it back in a recognizable data frame
Which gives us the following:
# A tibble: 30 x 7
pair column term estimate std.error statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 C V1 (Intercept) 0.470 0.142 3.31 0.00561
2 C V1 value 0.125 0.265 0.472 0.645
3 B V1 (Intercept) 0.489 0.142 3.45 0.00359
4 B V1 value -0.0438 0.289 -0.151 0.882
5 A V1 (Intercept) 0.515 0.111 4.63 0.000279
6 A V1 value -0.00569 0.249 -0.0229 0.982
7 C V2 (Intercept) 0.367 0.147 2.50 0.0265
8 C V2 value 0.377 0.300 1.26 0.231
9 B V2 (Intercept) 0.462 0.179 2.59 0.0206
10 B V2 value 0.0175 0.322 0.0545 0.957
# … with 20 more rows
yep, that's a beaut! Note that I used lm(g ~ value)
instead of lm(value ~ g)
as this is what your text description alluded to.
How to perform linear regression for multiple columns and get a dataframe output with: regression equation and r squared value?
Something like this:
library(tidyverse)
library(broom)
df1 %>%
pivot_longer(
cols = starts_with("X")
) %>%
mutate(name = factor(name)) %>%
group_by(name) %>%
group_split() %>%
map_dfr(.f = function(df){
lm(LH27_20822244_U_Stationary ~ value, data = df) %>%
glance() %>%
# tidy() %>%
add_column(name = unique(df$name), .before=1)
})
Using tidy()
name term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 X20676887_X2LH_S (Intercept) 12.8 2.28 5.62 0.00494
2 X20676887_X2LH_S value 0.393 0.0855 4.59 0.0101
3 X20819831_11LH_S (Intercept) 10.4 3.72 2.79 0.0495
4 X20819831_11LH_S value 0.492 0.142 3.47 0.0256
5 X20822214_X4LH_S (Intercept) 10.5 3.30 3.20 0.0329
6 X20822214_X4LH_S value 0.485 0.126 3.86 0.0182
Using glance()
name r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 X20676887_X2~ 0.841 0.801 0.0350 21.1 0.0101 1 12.8 -19.6 -20.3 0.00490 4 6
2 X20819831_11~ 0.751 0.688 0.0438 12.0 0.0256 1 11.5 -17.0 -17.6 0.00766 4 6
3 X20822214_X4~ 0.788 0.735 0.0403 14.9 0.0182 1 12.0 -17.9 -18.6 0.00651 4 6
Running multiple, simple linear regressions from dataframe in R
Here's one solution using combn
combn(names(DF), 2, function(x){lm(DF[, x])}, simplify = FALSE)
Example:
set.seed(1)
DF <- data.frame(A=rnorm(50, 100, 3),
B=rnorm(50, 100, 3),
C=rnorm(50, 100, 3),
D=rnorm(50, 100, 3),
E=rnorm(50, 100, 3))
Updated: adding @Henrik suggestion (see comments)
# only the coefficients
> results <- combn(names(DF), 2, function(x){coefficients(lm(DF[, x]))}, simplify = FALSE)
> vars <- combn(names(DF), 2)
> names(results) <- vars[1 , ] # adding names to identify variables in the reggression
> results
$A
(Intercept) B
103.66739418 -0.03354243
$A
(Intercept) C
97.88341555 0.02429041
$A
(Intercept) D
122.7606103 -0.2240759
$A
(Intercept) E
99.26387487 0.01038445
$B
(Intercept) C
99.971253525 0.003824755
$B
(Intercept) D
102.65399702 -0.02296721
$B
(Intercept) E
96.83042199 0.03524868
$C
(Intercept) D
80.1872211 0.1931079
$C
(Intercept) E
89.0503893 0.1050202
$D
(Intercept) E
107.84384655 -0.07620397
Running several linear regressions from a single dataframe in R
Just to add an alternative, I would propose going down this route:
library(reshape2)
library(dplyr)
library(broom)
df <- melt(data.frame(x = 1962:2014,
y1 = rnorm(53),
y2 = rnorm(53),
y3 = rnorm(53)),
id.vars = "x")
df %>% group_by(variable) %>% do(tidy(lm(value ~ x, data=.)))
Here, I just melt the data so that all relevant columns are given by groups of rows, to be able to use dplyr
's grouped actions. This gives the following dataframe as output:
Source: local data frame [6 x 6]
Groups: variable [3]
variable term estimate std.error statistic p.value
(fctr) (chr) (dbl) (dbl) (dbl) (dbl)
1 y1 (Intercept) -3.646666114 18.988154862 -0.1920495 0.8484661
2 y1 x 0.001891627 0.009551103 0.1980533 0.8437907
3 y2 (Intercept) -8.939784046 16.206935047 -0.5516024 0.5836297
4 y2 x 0.004545156 0.008152140 0.5575415 0.5795966
5 y3 (Intercept) 21.699503502 16.785586452 1.2927462 0.2019249
6 y3 x -0.010879271 0.008443204 -1.2885240 0.2033785
This is a pretty convenient form to continue working with the coefficients. All that is required is to melt the dataframe so that all columns are rows in the dataset, and then to use dplyr
's group_by
to carry out the regression in all subsets. broom::tidy
puts the regression output into a nice dataframe. See ?broom
for more information.
In case you need to keep the models to do adjustments of some sort (which are implemented for lm
objects), then you can also do the following:
df %>% group_by(variable) %>% do(mod = lm(value ~ x, data=.))
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
variable mod
* <fctr> <list>
1 y1 <S3: lm>
2 y2 <S3: lm>
3 y3 <S3: lm>
Here, for each variable, the lm
object is stored in the dataframe. So, if you want to get the model output for the first, you can just access it as you would access any normal dataframe, e.g.
tmp <- df %>% group_by(variable) %>% do(mod = lm(value ~ x, data=.))
tmp[tmp$variable == "y1",]$mod
[[1]]
Call:
lm(formula = value ~ x, data = .)
Coefficients:
(Intercept) x
-1.807255 0.001019
This is convenient if you want to apply some methods to all lm
objects since you can use the fact that tmp$mod
gives you a list of them, which makes it easy to pass to e.g. lapply
.
R code for linear model grouped by multiple columns
This fits a "many models" pattern, where we want to split our data into groups, apply a model to each group, and then extract the results of that model for a summary.
In the tidyverse
+ tidymodels
world, that's accomplished by combining:
tidyr::nest()
, which create columns of dataframes for each grouppurrr::map()
, which allows us to apply a function to each element of a list/column.broom::tidy()
, which constructs a more usable dataframe of model outputtidyr::unnest()
, which takes a dataframe and transforms into simple columns.
We have a couple of extra steps with your data specifically to handle areas where the values are all NA, and the linear model would normally throw errors back.
Code looks like:
# Wrap lm, since it throws errors
lmw <- function(data, formula) {
tryCatch(lm(data = data, formula = formula), error = function(cond) NULL)
}
library(tidyverse)
library(broom)
df %>%
nest(data = c(GHGEventID, SmpTime, TrtIDs, ugN2O, PK)) %>%
mutate(lm_model = purrr::map(data,
~ broom::tidy(lmw(data = .x, ugN2O ~ SmpTime)))) %>%
unnest(lm_model, keep_empty = TRUE) %>%
filter(term != "(Intercept)" | is.na(estimate)) %>%
mutate(LinMod = 60*24*10000/(0.144032*1000000)* estimate)
Output looks like:
# A tibble: 8 x 10
SmpDate Plot Ch data term estimate std.error statistic p.value LinMod
<fct> <int> <int> <list<df[,5]>> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-06-04 101 1 [4 × 5] SmpTime 0.00105 0.00561 0.187 0.869 0.105
2 2019-06-04 101 2 [4 × 5] SmpTime -0.00164 0.00680 -0.241 0.832 -0.164
3 2019-06-04 202 1 [4 × 5] NA NA NA NA NA NA
4 2019-06-04 202 2 [4 × 5] NA NA NA NA NA NA
5 2019-06-19 101 1 [4 × 5] SmpTime 0.0189 0.00335 5.63 0.0302 1.89
6 2019-06-19 101 2 [4 × 5] SmpTime 0.00726 0.00778 0.933 0.449 0.726
7 2019-06-19 202 1 [4 × 5] SmpTime 0.00691 0.00565 1.22 0.346 0.690
8 2019-06-19 202 2 [4 × 5] SmpTime 0.0238 0.00508 4.70 0.0424 2.38
More on this pattern, especially using broom
and the tidymodels
approach here: summary dataframe from several multiple regression outputs. (And there are non tidymodels
approaches too, as commenters have already added.)
Related Topics
R Sequence of Dates with Lubridate
Adding Custom Image to Geom_Polygon Fill in Ggplot
Name Columns Within Aggregate in R
How to Save a Data Frame as CSV to a User Selected Location Using Tcltk
Controlling the 'Alpha' Level in a Ggplot2 Legend
Difference Between As.Data.Frame(X) and Data.Frame(X)
Avoid Rbind()/Cbind() Conversion from Numeric to Factor
Add Author Affiliation in R Markdown Beamer Presentation
Loop Over Rows of Dataframe Applying Function with If-Statement
How to Build a Dendrogram from a Directory Tree
Read Lines by Number from a Large File
What Is a Fast Way to Set Debugging Code at a Given Line in a Function
Knitr: Run All Chunks in an Rmarkdown Document
HTML with Multicolumn Table in Markdown Using Knitr
Plot a Legend and Well-Spaced Universal Y-Axis and Main Titles in Grid.Arrange
Make Dataframe of Top N Frequent Terms for Multiple Corpora Using Tm Package in R