Fitting Several Regression Models With Dplyr

Fitting several regression models with dplyr

As of mid 2020 (and updated to fit dplyr 1.0+ as of 2022-04), tchakravarty's answer will fail. In order to circumvent the new approach of broom and dpylr seem to interact, the following combination of broom::tidy, broom::augment and broom::glance can be used. We just have to use them in conjunvtion with nest_by() and summarize() (previously inside do() and later unnest() the tibble).

library(dplyr)
library(broom)
library(tidyr)

set.seed(42)
df.h = data.frame( 
  hour     = factor(rep(1:24, each = 21)),
  price    = runif(504, min = -10, max = 125),
  wind     = runif(504, min = 0, max = 2500),
  temp     = runif(504, min = - 10, max = 25)  
)

df.h %>%
  nest_by(hour) %>%
  mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
  summarize(tidy(mod))
# # A tibble: 72 × 6
# # Groups:   hour [24]
#    hour  term        estimate std.error statistic   p.value
#    <fct> <chr>          <dbl>     <dbl>     <dbl>     <dbl>
# 1  1     (Intercept) 87.4       15.8        5.55  0.0000289
# 2  1     wind        -0.0129     0.0120    -1.08  0.296    
# 3  1     temp         0.588      0.849      0.693 0.497    
# 4  2     (Intercept) 92.3       21.6        4.27  0.000466 
# 5  2     wind        -0.0227     0.0134    -1.69  0.107    
# 6  2     temp        -0.216      0.841     -0.257 0.800    
# 7  3     (Intercept) 61.1       18.6        3.29  0.00409  
# 8  3     wind         0.00471    0.0128     0.367 0.718    
# 9  3     temp         0.425      0.964      0.442 0.664    
# 10 4     (Intercept) 31.6       15.3        2.07  0.0529   

df.h %>% 
  nest_by(hour) %>%
  mutate(mod = list(lm(price ~ wind + temp, data = data))) %>% 
  summarize(augment(mod))
# # A tibble: 504 × 10
# # Groups:   hour [24]
#    hour   price  wind   temp .fitted .resid   .hat .sigma  .cooksd .std.resid
#    <fct>  <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
#  1 1     113.    288. -1.75     82.7  30.8  0.123    37.8 0.0359       0.877 
#  2 1     117.   2234. 18.4      69.5  47.0  0.201    36.4 0.165        1.40  
#  3 1      28.6  1438.  4.75     71.7 -43.1  0.0539   37.1 0.0265      -1.18  
#  4 1     102.    366.  9.77     88.5  13.7  0.151    38.4 0.00926      0.395 
#  5 1      76.6  2257. -4.69     55.6  21.0  0.245    38.2 0.0448       0.644 
#  6 1      60.1   633. -3.18     77.4 -17.3  0.0876   38.4 0.00749     -0.484 
#  7 1      89.4   376. -4.16     80.1   9.31 0.119    38.5 0.00314      0.264 
#  8 1       8.18 1921. 19.2      74.0 -65.9  0.173    34.4 0.261       -1.93  
#  9 1      78.7   575. -6.11     76.4   2.26 0.111    38.6 0.000170     0.0640
# 10 1      85.2   763. -0.618    77.2   7.94 0.0679   38.6 0.00117      0.219 
# # … with 494 more rows

df.h %>% 
  nest_by(hour) %>%
  mutate(mod = list(lm(price ~ wind + temp, data = data))) %>% 
  summarize(glance(mod))
# # A tibble: 24 × 13
# # Groups:   hour [24]
#    hour  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#    <fct>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl>
#  1 1        0.0679       -0.0357  37.5     0.655   0.531     2  -104.  217.
#  2 2        0.139         0.0431  42.7     1.45    0.261     2  -107.  222.
#  3 3        0.0142       -0.0953  43.1     0.130   0.879     2  -107.  222.
#  4 4        0.0737       -0.0293  36.7     0.716   0.502     2  -104.  216.
#  5 5        0.213         0.126   37.8     2.44    0.115     2  -104.  217.
#  6 6        0.0813       -0.0208  33.5     0.796   0.466     2  -102.  212.
#  7 7        0.0607       -0.0437  40.7     0.582   0.569     2  -106.  220.
#  8 8        0.153         0.0592  36.3     1.63    0.224     2  -104.  215.
#  9 9        0.166         0.0736  36.5     1.79    0.195     2  -104.  216.
# 10 10       0.110         0.0108  40.0     1.11    0.351     2  -106.  219.
# # … with 14 more rows, and 4 more variables: BIC <dbl>, deviance <dbl>,
# #   df.residual <int>, nobs <int>

Credits to Bob Muenchen's Blog for inspiration on that.

Fitting several regression models by changing only one independent variable within mutate()

I'm not sure if this will be considered more direct/elegant but here is my solution that does not use a double map:

library(tidyverse)
library(broom)

gen_model_expr <- function(var) {
  form = paste("dep ~ cont_a + cont_b +", var)
  tidy(lm(as.formula(form), data = df))
}

grep("cov_", names(df), value = TRUE) %>%
  map(gen_model_expr)

Output (Note that it does not retain the data column):

[[1]]
# A tibble: 4 x 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)   0.472     0.0812     5.81  0.0000000799
2 cont_a       -0.103     0.0983    -1.05  0.296       
3 cont_b        0.172     0.0990     1.74  0.0848      
4 cov_a        -0.0455    0.0581    -0.783 0.436       

[[2]]
# A tibble: 4 x 5
  term        estimate std.error statistic     p.value
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)   0.415     0.0787     5.27  0.000000846
2 cont_a       -0.0874    0.0984    -0.888 0.377      
3 cont_b        0.181     0.0980     1.84  0.0682     
4 cov_b         0.0482    0.0576     0.837 0.405

EDIT

In the interest of speed performance (inspired by @TimTeaFan), a benchmark test comparing the different ways to retrieve the covariate names is shown below. grep("cov_", names(df), value = TRUE) is the fastest

# A tibble: 4 x 13
  expression                                         min median `itr/sec` mem_alloc
  <bch:expr>                                      <bch:> <bch:>     <dbl> <bch:byt>
1 names(df)[grepl("cov_", names(df))]             7.59µs  8.4µs   101975.        0B
2 grep("cov_", colnames(df), value = TRUE)        8.21µs 8.96µs   103142.        0B
3 grep("cov_", names(df), value = TRUE)           6.96µs 7.43µs   128694.        0B
4 df %>% select(starts_with("cov_")) %>% colnames 1.17ms 1.33ms      636.    5.39KB

How to run many regressions across rows and columns with vectorization

Using tidyr::(un)nest to perform the regressions by groups and a helper function this could be achieved like so:

library(dplyr)
library(broom)
library(tidyr)
library(purrr)

vec <- names(mtcars)[3:9]

lm_help <- function(vec) {
  mtcars %>% 
    tidyr::nest(data = -cyl) %>% 
    mutate(con = vec,
           fit = purrr::map(data, lm, formula = as.formula(paste0("mpg ~ disp + ", vec))),
           tidy = purrr::map(fit, tidy)) %>% 
    select(cyl, con, tidy) %>% 
    tidyr::unnest(tidy) %>% 
    filter(term == "disp")
}

purrr::map(vec, lm_help) %>% 
  bind_rows()
#> # A tibble: 21 x 7
#>      cyl con   term  estimate std.error statistic p.value
#>    <dbl> <chr> <chr>    <dbl>     <dbl>     <dbl>   <dbl>
#>  1     6 disp  disp   0.00361   0.0156     0.232  0.826  
#>  2     4 disp  disp  -0.135     0.0332    -4.07   0.00278
#>  3     8 disp  disp  -0.0196    0.00932   -2.11   0.0568 
#>  4     6 hp    disp   0.00180   0.0202     0.0890 0.933  
#>  5     4 hp    disp  -0.120     0.0369    -3.24   0.0120 
#>  6     8 hp    disp  -0.0186    0.00946   -1.97   0.0746 
#>  7     6 drat  disp   0.0224    0.0292     0.770  0.484  
#>  8     4 drat  disp  -0.133     0.0406    -3.27   0.0114 
#>  9     8 drat  disp  -0.0196    0.00977   -2.01   0.0697 
#> 10     6 wt    disp   0.0191    0.0109     1.75   0.154  
#> # ... with 11 more rows

Running multiple nested regression models in tidyverse workflow

EDITED

Here's an option using nested map calls and avoiding having to unnest the data.

library(dplyr)
library(purrr)
library(broom)

# named vector so we can distinguish list results
formulae <- c(bivariate = mpg ~ cyl, 
              wcontrol = mpg ~ cyl + gear + wt)

map(formulae, function (y) 
  mtcars %>%
  split(.$vs) %>%
  map(~ lm(y, data = .x)) %>%
  map(~ broom::tidy(.)))

Per your update this goes straight to platting the models

library(dplyr)
library(ggplot2)
library(dotwhisker)

map(formulae, function (y) 
  mtcars %>%
    split(.$am) %>%
    purrr::map(~ lm(y, data = .x)) %>%
    dwplot() %>%
    relabel_predictors(c(wt = "Weight", cyl = "Cylinders", gear = "Gears")) +
    theme_bw() + xlab("Coefficient") + ylab("") +
    geom_vline(xintercept = 0, colour = "grey60", linetype = 2) +
    ggtitle(paste("The model is", deparse(y, width.cutoff = 100), collapse=""))  +
    scale_colour_grey(start = .4, end = .8,
                      name = "Transmission",
                      breaks = c("Model 0", "Model 1"),
                      labels = c("Automatic", "Manual"))
)
#> $bivariate

Sample Image

#> 
#> $wcontrol

Sample Image

Multi-step regression with broom and dplyr in R

One option to achieve your desired result would be to add the residuals as a new column to your dataframe after the first stage regression:

library(tidyverse)
library(broom)

# Nest the data frame
df_nested <- df %>% 
  group_by(group) %>% 
  nest()

# Run first stage regression and retrieve residuals
df_fit <- df_nested %>% 
  mutate(
    fit1 = map(data, ~ lm(x ~ z1 + z2, data = .x)),
    resids = map(fit1, residuals),
    data = map2(data, resids, ~ bind_cols(.x, resids = .y))
  )

# Run second stage with residuals as control variable
df_fit %>% 
  mutate(
    fit2 = map(data, ~ tidy(lm(y ~ x + z2 + resids, data = .x)))
  ) %>% 
  unnest(fit2)
#> # A tibble: 16 × 9
#> # Groups:   group [4]
#>    group data        fit1   resids  term    estimate std.error statistic p.value
#>    <dbl> <list>      <list> <list>  <chr>      <dbl>     <dbl>     <dbl>   <dbl>
#>  1     1 <tibble [2… <lm>   <dbl [… (Inter…   0.402      0.524    0.767  0.451  
#>  2     1 <tibble [2… <lm>   <dbl [… x         0.0836     0.912    0.0917 0.928  
#>  3     1 <tibble [2… <lm>   <dbl [… z2        0.161      0.250    0.644  0.527  
#>  4     1 <tibble [2… <lm>   <dbl [… resids   -0.0536     0.942   -0.0569 0.955  
#>  5     2 <tibble [2… <lm>   <dbl [… (Inter…   0.977      0.273    3.58   0.00175
#>  6     2 <tibble [2… <lm>   <dbl [… x        -0.561      0.459   -1.22   0.235  
#>  7     2 <tibble [2… <lm>   <dbl [… z2       -0.351      0.192   -1.82   0.0826 
#>  8     2 <tibble [2… <lm>   <dbl [… resids    0.721      0.507    1.42   0.170  
#>  9     3 <tibble [2… <lm>   <dbl [… (Inter…  -0.710      1.19    -0.598  0.556  
#> 10     3 <tibble [2… <lm>   <dbl [… x         3.61       3.80     0.951  0.352  
#> 11     3 <tibble [2… <lm>   <dbl [… z2       -1.21       1.19    -1.01   0.323  
#> 12     3 <tibble [2… <lm>   <dbl [… resids   -3.67       3.80    -0.964  0.346  
#> 13     4 <tibble [2… <lm>   <dbl [… (Inter…  59.6       40.1      1.49   0.152  
#> 14     4 <tibble [2… <lm>   <dbl [… x       -83.4       56.5     -1.48   0.155  
#> 15     4 <tibble [2… <lm>   <dbl [… z2      -18.7       12.8     -1.45   0.160  
#> 16     4 <tibble [2… <lm>   <dbl [… resids   83.4       56.5      1.48   0.155

Grouped regression with dplyr using different formulas

This needs map2 instead of map as the data column from nest is also a list of data.frame, and thus we need to loop over the corresponding elements of 'fm' list and data (map2 does that)

library(tidyr)
library(purrr)
library(dplyr)
library(broom)
out <- dt %>%
  nest(data = -t)  %>%
  mutate(
    fit = map2(fm, data, ~lm(.x, data = .y)), 
    tfit = map(fit, tidy))

-output

> out
# A tibble: 2 × 4
      t data              fit    tfit            
  <dbl> <list>            <list> <list>          
1     1 <tibble [50 × 4]> <lm>   <tibble [2 × 5]>
2     2 <tibble [50 × 4]> <lm>   <tibble [2 × 5]>

> bind_rows(out$tfit)
# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  0.0860      0.128    0.670   0.506 
2 x1           0.262       0.119    2.19    0.0331
3 (Intercept) -0.00285     0.152   -0.0187  0.985 
4 x2          -0.115       0.154   -0.746   0.459

Or may also use

> imap_dfr(fm, ~ lm(.x, data = dt %>% 
                          filter(t == .y)) %>% 
               tidy)
# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  0.0860      0.128    0.670   0.506 
2 x1           0.262       0.119    2.19    0.0331
3 (Intercept) -0.00285     0.152   -0.0187  0.985 
4 x2          -0.115       0.154   -0.746   0.459

If we want to have all the combinations of 'fm' for each level of 't', then use crossing

dt %>% 
   nest(data = -t) %>% 
   crossing(fm) %>% 
   mutate(fit = map2(fm, data, ~ lm(.x, data = .y)),
        tfit = map(fit, tidy))

-output

# A tibble: 4 × 5
      t data              fm        fit    tfit            
  <dbl> <list>            <list>    <list> <list>          
1     1 <tibble [50 × 4]> <formula> <lm>   <tibble [2 × 5]>
2     1 <tibble [50 × 4]> <formula> <lm>   <tibble [2 × 5]>
3     2 <tibble [50 × 4]> <formula> <lm>   <tibble [2 × 5]>
4     2 <tibble [50 × 4]> <formula> <lm>   <tibble [2 × 5]>

dplyr get linear regression coefficients

Here is a combination of tidyverse and broom package to get your desired output.

Very handy here is group_split -> you get a list and then you iterate with purrrs map_dfr (by the way with map_dfr you get a dataframe otherwise with map you get a list) your regression lm(... through each list element. Using brooms glance gives the desired output:

library(tidyverse)
library(broom)

mydata %>%
  pivot_longer(starts_with("Site"),
               names_to = ".value",
               names_pattern = "(^Site)") %>% 
  mutate(Site=as.factor(Site)) %>% 
  group_by(Site) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(Outcome ~ Age+Gender, data=df) %>% 
      glance() %>% 
      add_column(Site = unique(df$Site), .before = 1)
  })

  Site  r.squared adj.r.squared    sigma statistic  p.value    df logLik    AIC     BIC deviance df.residual  nobs
  <fct>     <dbl>         <dbl>    <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>   <dbl>    <dbl>       <int> <int>
1 1         0.6           0.44  3.87e- 1  3.75e+ 0 1.01e- 1     2  -1.88   11.8   12.1  7.5 e- 1           5     8
2 2         1             1     2.22e-16  1.01e+31 2.22e-16     2 141.   -275.  -277.   4.93e-32           1     4
3 3         0.351        -0.946 6.97e- 1  2.71e- 1 8.05e- 1     2  -1.46   10.9    8.47 4.86e- 1           1     4

dplyr version of grouping a dataframe then creating regression model on each group

Returning a list from dplyr is not possible yet. If you just need the intercept and slope @jazzurro 's answer is the way, but if you need the whole model you need to do something like

library(dplyr)
models <- df %>% group_by(country) %>% do(mod = lm(BirthRate ~ US., data = .))

Then if you want to perform ANOVA on each fitted model, you can do it using rowwise

models %>% rowwise %>% do(anova(.$mod))

but again the result is coerced to a data frame and is not quite the same as doing lapply(models$mod, anova).

For now (ie until the next version of dplyr) if you need to store the whole result in a list, you can just use dlply from plyr, like plyr::dlply(df, "country", function(d) anova(lm(BirthRate ~ US., data = d))), or of course if you do not absolutely have to use dplyr you can go for @SvenHohenstein 's answer which looks like a better way of doing this anyway.

Fitting Several Regression Models With Dplyr