Running Multiple Linear Regressions Across Several Columns of a Data Frame in R

Running multiple linear regressions across several columns of a data frame in R

Let's get some tidyverse power here, along with broom, and forgo all these loops...

First I'll make a dummy table:

df <- data.frame(
g = runif(50),
pair = sample(x = c("A", "B", "C"), size = 50, replace = TRUE),
V1 = runif(50),
V2 = runif(50),
V3 = runif(50),
V4 = runif(50),
V5 = runif(50),
stringsAsFactors = FALSE
)

That's approximately what your data structure looks like. Now onto the meat of the code:

library(tidyverse)
library(broom)

df %>%
as_tibble %>%
gather(key = "column", value = "value", V1:V5) %>% # first set the data in long format
nest(g, value) %>% # now nest the dependent and independent factors
mutate(model = map(data, ~lm(g ~ value, data = .))) %>% # fit the model using purrr
mutate(tidy_model = map(model, tidy)) %>% # clean the model output with broom
select(-data, -model) %>% # remove the "untidy" parts
unnest() # get it back in a recognizable data frame

Which gives us the following:

# A tibble: 30 x 7
pair column term estimate std.error statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 C V1 (Intercept) 0.470 0.142 3.31 0.00561
2 C V1 value 0.125 0.265 0.472 0.645
3 B V1 (Intercept) 0.489 0.142 3.45 0.00359
4 B V1 value -0.0438 0.289 -0.151 0.882
5 A V1 (Intercept) 0.515 0.111 4.63 0.000279
6 A V1 value -0.00569 0.249 -0.0229 0.982
7 C V2 (Intercept) 0.367 0.147 2.50 0.0265
8 C V2 value 0.377 0.300 1.26 0.231
9 B V2 (Intercept) 0.462 0.179 2.59 0.0206
10 B V2 value 0.0175 0.322 0.0545 0.957
# … with 20 more rows

yep, that's a beaut! Note that I used lm(g ~ value) instead of lm(value ~ g) as this is what your text description alluded to.

How to perform linear regression for multiple columns and get a dataframe output with: regression equation and r squared value?

Something like this:

library(tidyverse)
library(broom)
df1 %>%
pivot_longer(
cols = starts_with("X")
) %>%
mutate(name = factor(name)) %>%
group_by(name) %>%
group_split() %>%
map_dfr(.f = function(df){
lm(LH27_20822244_U_Stationary ~ value, data = df) %>%
glance() %>%
# tidy() %>%
add_column(name = unique(df$name), .before=1)
})

Using tidy()

  name             term        estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 X20676887_X2LH_S (Intercept) 12.8 2.28 5.62 0.00494
2 X20676887_X2LH_S value 0.393 0.0855 4.59 0.0101
3 X20819831_11LH_S (Intercept) 10.4 3.72 2.79 0.0495
4 X20819831_11LH_S value 0.492 0.142 3.47 0.0256
5 X20822214_X4LH_S (Intercept) 10.5 3.30 3.20 0.0329
6 X20822214_X4LH_S value 0.485 0.126 3.86 0.0182

Using glance()

  name          r.squared adj.r.squared  sigma statistic p.value    df logLik   AIC   BIC deviance df.residual  nobs
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 X20676887_X2~ 0.841 0.801 0.0350 21.1 0.0101 1 12.8 -19.6 -20.3 0.00490 4 6
2 X20819831_11~ 0.751 0.688 0.0438 12.0 0.0256 1 11.5 -17.0 -17.6 0.00766 4 6
3 X20822214_X4~ 0.788 0.735 0.0403 14.9 0.0182 1 12.0 -17.9 -18.6 0.00651 4 6

Running multiple, simple linear regressions from dataframe in R

Here's one solution using combn

 combn(names(DF), 2, function(x){lm(DF[, x])}, simplify = FALSE)

Example:

set.seed(1)
DF <- data.frame(A=rnorm(50, 100, 3),
B=rnorm(50, 100, 3),
C=rnorm(50, 100, 3),
D=rnorm(50, 100, 3),
E=rnorm(50, 100, 3))

Updated: adding @Henrik suggestion (see comments)

# only the coefficients
> results <- combn(names(DF), 2, function(x){coefficients(lm(DF[, x]))}, simplify = FALSE)
> vars <- combn(names(DF), 2)
> names(results) <- vars[1 , ] # adding names to identify variables in the reggression
> results
$A
(Intercept) B
103.66739418 -0.03354243

$A
(Intercept) C
97.88341555 0.02429041

$A
(Intercept) D
122.7606103 -0.2240759

$A
(Intercept) E
99.26387487 0.01038445

$B
(Intercept) C
99.971253525 0.003824755

$B
(Intercept) D
102.65399702 -0.02296721

$B
(Intercept) E
96.83042199 0.03524868

$C
(Intercept) D
80.1872211 0.1931079

$C
(Intercept) E
89.0503893 0.1050202

$D
(Intercept) E
107.84384655 -0.07620397

Running several linear regressions from a single dataframe in R

Just to add an alternative, I would propose going down this route:

library(reshape2)
library(dplyr)
library(broom)

df <- melt(data.frame(x = 1962:2014,
y1 = rnorm(53),
y2 = rnorm(53),
y3 = rnorm(53)),
id.vars = "x")

df %>% group_by(variable) %>% do(tidy(lm(value ~ x, data=.)))

Here, I just melt the data so that all relevant columns are given by groups of rows, to be able to use dplyr's grouped actions. This gives the following dataframe as output:

Source: local data frame [6 x 6]
Groups: variable [3]

variable term estimate std.error statistic p.value
(fctr) (chr) (dbl) (dbl) (dbl) (dbl)
1 y1 (Intercept) -3.646666114 18.988154862 -0.1920495 0.8484661
2 y1 x 0.001891627 0.009551103 0.1980533 0.8437907
3 y2 (Intercept) -8.939784046 16.206935047 -0.5516024 0.5836297
4 y2 x 0.004545156 0.008152140 0.5575415 0.5795966
5 y3 (Intercept) 21.699503502 16.785586452 1.2927462 0.2019249
6 y3 x -0.010879271 0.008443204 -1.2885240 0.2033785

This is a pretty convenient form to continue working with the coefficients. All that is required is to melt the dataframe so that all columns are rows in the dataset, and then to use dplyr's group_by to carry out the regression in all subsets. broom::tidy puts the regression output into a nice dataframe. See ?broom for more information.

In case you need to keep the models to do adjustments of some sort (which are implemented for lm objects), then you can also do the following:

df %>% group_by(variable) %>% do(mod = lm(value ~ x, data=.))

Source: local data frame [3 x 2]
Groups: <by row>

# A tibble: 3 x 2
variable mod
* <fctr> <list>
1 y1 <S3: lm>
2 y2 <S3: lm>
3 y3 <S3: lm>

Here, for each variable, the lm object is stored in the dataframe. So, if you want to get the model output for the first, you can just access it as you would access any normal dataframe, e.g.

tmp <- df %>% group_by(variable) %>% do(mod = lm(value ~ x, data=.))
tmp[tmp$variable == "y1",]$mod
[[1]]

Call:
lm(formula = value ~ x, data = .)

Coefficients:
(Intercept) x
-1.807255 0.001019

This is convenient if you want to apply some methods to all lm objects since you can use the fact that tmp$mod gives you a list of them, which makes it easy to pass to e.g. lapply.

R code for linear model grouped by multiple columns

This fits a "many models" pattern, where we want to split our data into groups, apply a model to each group, and then extract the results of that model for a summary.

In the tidyverse + tidymodels world, that's accomplished by combining:

  1. tidyr::nest(), which create columns of dataframes for each group
  2. purrr::map(), which allows us to apply a function to each element of a list/column.
  3. broom::tidy(), which constructs a more usable dataframe of model output
  4. tidyr::unnest(), which takes a dataframe and transforms into simple columns.

We have a couple of extra steps with your data specifically to handle areas where the values are all NA, and the linear model would normally throw errors back.

Code looks like:

# Wrap lm, since it throws errors
lmw <- function(data, formula) {
tryCatch(lm(data = data, formula = formula), error = function(cond) NULL)
}

library(tidyverse)
library(broom)
df %>%
nest(data = c(GHGEventID, SmpTime, TrtIDs, ugN2O, PK)) %>%
mutate(lm_model = purrr::map(data,
~ broom::tidy(lmw(data = .x, ugN2O ~ SmpTime)))) %>%
unnest(lm_model, keep_empty = TRUE) %>%
filter(term != "(Intercept)" | is.na(estimate)) %>%
mutate(LinMod = 60*24*10000/(0.144032*1000000)* estimate)

Output looks like:

# A tibble: 8 x 10
SmpDate Plot Ch data term estimate std.error statistic p.value LinMod
<fct> <int> <int> <list<df[,5]>> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-06-04 101 1 [4 × 5] SmpTime 0.00105 0.00561 0.187 0.869 0.105
2 2019-06-04 101 2 [4 × 5] SmpTime -0.00164 0.00680 -0.241 0.832 -0.164
3 2019-06-04 202 1 [4 × 5] NA NA NA NA NA NA
4 2019-06-04 202 2 [4 × 5] NA NA NA NA NA NA
5 2019-06-19 101 1 [4 × 5] SmpTime 0.0189 0.00335 5.63 0.0302 1.89
6 2019-06-19 101 2 [4 × 5] SmpTime 0.00726 0.00778 0.933 0.449 0.726
7 2019-06-19 202 1 [4 × 5] SmpTime 0.00691 0.00565 1.22 0.346 0.690
8 2019-06-19 202 2 [4 × 5] SmpTime 0.0238 0.00508 4.70 0.0424 2.38

More on this pattern, especially using broom and the tidymodels approach here: summary dataframe from several multiple regression outputs. (And there are non tidymodels approaches too, as commenters have already added.)



Related Topics



Leave a reply



Submit