Fitting Several Regression Models With Dplyr

Fitting several regression models with dplyr

As of mid 2020 (and updated to fit dplyr 1.0+ as of 2022-04), tchakravarty's answer will fail. In order to circumvent the new approach of broom and dpylr seem to interact, the following combination of broom::tidy, broom::augment and broom::glance can be used. We just have to use them in conjunvtion with nest_by() and summarize() (previously inside do() and later unnest() the tibble).

library(dplyr)
library(broom)
library(tidyr)

set.seed(42)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)

df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(tidy(mod))
# # A tibble: 72 × 6
# # Groups: hour [24]
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 87.4 15.8 5.55 0.0000289
# 2 1 wind -0.0129 0.0120 -1.08 0.296
# 3 1 temp 0.588 0.849 0.693 0.497
# 4 2 (Intercept) 92.3 21.6 4.27 0.000466
# 5 2 wind -0.0227 0.0134 -1.69 0.107
# 6 2 temp -0.216 0.841 -0.257 0.800
# 7 3 (Intercept) 61.1 18.6 3.29 0.00409
# 8 3 wind 0.00471 0.0128 0.367 0.718
# 9 3 temp 0.425 0.964 0.442 0.664
# 10 4 (Intercept) 31.6 15.3 2.07 0.0529

df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(augment(mod))
# # A tibble: 504 × 10
# # Groups: hour [24]
# hour price wind temp .fitted .resid .hat .sigma .cooksd .std.resid
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 113. 288. -1.75 82.7 30.8 0.123 37.8 0.0359 0.877
# 2 1 117. 2234. 18.4 69.5 47.0 0.201 36.4 0.165 1.40
# 3 1 28.6 1438. 4.75 71.7 -43.1 0.0539 37.1 0.0265 -1.18
# 4 1 102. 366. 9.77 88.5 13.7 0.151 38.4 0.00926 0.395
# 5 1 76.6 2257. -4.69 55.6 21.0 0.245 38.2 0.0448 0.644
# 6 1 60.1 633. -3.18 77.4 -17.3 0.0876 38.4 0.00749 -0.484
# 7 1 89.4 376. -4.16 80.1 9.31 0.119 38.5 0.00314 0.264
# 8 1 8.18 1921. 19.2 74.0 -65.9 0.173 34.4 0.261 -1.93
# 9 1 78.7 575. -6.11 76.4 2.26 0.111 38.6 0.000170 0.0640
# 10 1 85.2 763. -0.618 77.2 7.94 0.0679 38.6 0.00117 0.219
# # … with 494 more rows

df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(glance(mod))
# # A tibble: 24 × 13
# # Groups: hour [24]
# hour r.squared adj.r.squared sigma statistic p.value df logLik AIC
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.0679 -0.0357 37.5 0.655 0.531 2 -104. 217.
# 2 2 0.139 0.0431 42.7 1.45 0.261 2 -107. 222.
# 3 3 0.0142 -0.0953 43.1 0.130 0.879 2 -107. 222.
# 4 4 0.0737 -0.0293 36.7 0.716 0.502 2 -104. 216.
# 5 5 0.213 0.126 37.8 2.44 0.115 2 -104. 217.
# 6 6 0.0813 -0.0208 33.5 0.796 0.466 2 -102. 212.
# 7 7 0.0607 -0.0437 40.7 0.582 0.569 2 -106. 220.
# 8 8 0.153 0.0592 36.3 1.63 0.224 2 -104. 215.
# 9 9 0.166 0.0736 36.5 1.79 0.195 2 -104. 216.
# 10 10 0.110 0.0108 40.0 1.11 0.351 2 -106. 219.
# # … with 14 more rows, and 4 more variables: BIC <dbl>, deviance <dbl>,
# # df.residual <int>, nobs <int>

Credits to Bob Muenchen's Blog for inspiration on that.

Fitting several regression models by changing only one independent variable within mutate()

I'm not sure if this will be considered more direct/elegant but here is my solution that does not use a double map:

library(tidyverse)
library(broom)

gen_model_expr <- function(var) {
form = paste("dep ~ cont_a + cont_b +", var)
tidy(lm(as.formula(form), data = df))
}

grep("cov_", names(df), value = TRUE) %>%
map(gen_model_expr)

Output (Note that it does not retain the data column):

[[1]]
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.472 0.0812 5.81 0.0000000799
2 cont_a -0.103 0.0983 -1.05 0.296
3 cont_b 0.172 0.0990 1.74 0.0848
4 cov_a -0.0455 0.0581 -0.783 0.436

[[2]]
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.415 0.0787 5.27 0.000000846
2 cont_a -0.0874 0.0984 -0.888 0.377
3 cont_b 0.181 0.0980 1.84 0.0682
4 cov_b 0.0482 0.0576 0.837 0.405

EDIT

In the interest of speed performance (inspired by @TimTeaFan), a benchmark test comparing the different ways to retrieve the covariate names is shown below. grep("cov_", names(df), value = TRUE) is the fastest

# A tibble: 4 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:> <bch:> <dbl> <bch:byt>
1 names(df)[grepl("cov_", names(df))] 7.59µs 8.4µs 101975. 0B
2 grep("cov_", colnames(df), value = TRUE) 8.21µs 8.96µs 103142. 0B
3 grep("cov_", names(df), value = TRUE) 6.96µs 7.43µs 128694. 0B
4 df %>% select(starts_with("cov_")) %>% colnames 1.17ms 1.33ms 636. 5.39KB

How to run many regressions across rows and columns with vectorization

Using tidyr::(un)nest to perform the regressions by groups and a helper function this could be achieved like so:

library(dplyr)
library(broom)
library(tidyr)
library(purrr)

vec <- names(mtcars)[3:9]

lm_help <- function(vec) {
mtcars %>%
tidyr::nest(data = -cyl) %>%
mutate(con = vec,
fit = purrr::map(data, lm, formula = as.formula(paste0("mpg ~ disp + ", vec))),
tidy = purrr::map(fit, tidy)) %>%
select(cyl, con, tidy) %>%
tidyr::unnest(tidy) %>%
filter(term == "disp")
}

purrr::map(vec, lm_help) %>%
bind_rows()
#> # A tibble: 21 x 7
#> cyl con term estimate std.error statistic p.value
#> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 disp disp 0.00361 0.0156 0.232 0.826
#> 2 4 disp disp -0.135 0.0332 -4.07 0.00278
#> 3 8 disp disp -0.0196 0.00932 -2.11 0.0568
#> 4 6 hp disp 0.00180 0.0202 0.0890 0.933
#> 5 4 hp disp -0.120 0.0369 -3.24 0.0120
#> 6 8 hp disp -0.0186 0.00946 -1.97 0.0746
#> 7 6 drat disp 0.0224 0.0292 0.770 0.484
#> 8 4 drat disp -0.133 0.0406 -3.27 0.0114
#> 9 8 drat disp -0.0196 0.00977 -2.01 0.0697
#> 10 6 wt disp 0.0191 0.0109 1.75 0.154
#> # ... with 11 more rows

Running multiple nested regression models in tidyverse workflow

EDITED

Here's an option using nested map calls and avoiding having to unnest the data.

library(dplyr)
library(purrr)
library(broom)

# named vector so we can distinguish list results
formulae <- c(bivariate = mpg ~ cyl,
wcontrol = mpg ~ cyl + gear + wt)

map(formulae, function (y)
mtcars %>%
split(.$vs) %>%
map(~ lm(y, data = .x)) %>%
map(~ broom::tidy(.)))

Per your update this goes straight to platting the models

library(dplyr)
library(ggplot2)
library(dotwhisker)

map(formulae, function (y)
mtcars %>%
split(.$am) %>%
purrr::map(~ lm(y, data = .x)) %>%
dwplot() %>%
relabel_predictors(c(wt = "Weight", cyl = "Cylinders", gear = "Gears")) +
theme_bw() + xlab("Coefficient") + ylab("") +
geom_vline(xintercept = 0, colour = "grey60", linetype = 2) +
ggtitle(paste("The model is", deparse(y, width.cutoff = 100), collapse="")) +
scale_colour_grey(start = .4, end = .8,
name = "Transmission",
breaks = c("Model 0", "Model 1"),
labels = c("Automatic", "Manual"))
)
#> $bivariate

Sample Image

#> 
#> $wcontrol

Sample Image

Multi-step regression with broom and dplyr in R

One option to achieve your desired result would be to add the residuals as a new column to your dataframe after the first stage regression:

library(tidyverse)
library(broom)

# Nest the data frame
df_nested <- df %>%
group_by(group) %>%
nest()

# Run first stage regression and retrieve residuals
df_fit <- df_nested %>%
mutate(
fit1 = map(data, ~ lm(x ~ z1 + z2, data = .x)),
resids = map(fit1, residuals),
data = map2(data, resids, ~ bind_cols(.x, resids = .y))
)

# Run second stage with residuals as control variable
df_fit %>%
mutate(
fit2 = map(data, ~ tidy(lm(y ~ x + z2 + resids, data = .x)))
) %>%
unnest(fit2)
#> # A tibble: 16 × 9
#> # Groups: group [4]
#> group data fit1 resids term estimate std.error statistic p.value
#> <dbl> <list> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 <tibble [2… <lm> <dbl [… (Inter… 0.402 0.524 0.767 0.451
#> 2 1 <tibble [2… <lm> <dbl [… x 0.0836 0.912 0.0917 0.928
#> 3 1 <tibble [2… <lm> <dbl [… z2 0.161 0.250 0.644 0.527
#> 4 1 <tibble [2… <lm> <dbl [… resids -0.0536 0.942 -0.0569 0.955
#> 5 2 <tibble [2… <lm> <dbl [… (Inter… 0.977 0.273 3.58 0.00175
#> 6 2 <tibble [2… <lm> <dbl [… x -0.561 0.459 -1.22 0.235
#> 7 2 <tibble [2… <lm> <dbl [… z2 -0.351 0.192 -1.82 0.0826
#> 8 2 <tibble [2… <lm> <dbl [… resids 0.721 0.507 1.42 0.170
#> 9 3 <tibble [2… <lm> <dbl [… (Inter… -0.710 1.19 -0.598 0.556
#> 10 3 <tibble [2… <lm> <dbl [… x 3.61 3.80 0.951 0.352
#> 11 3 <tibble [2… <lm> <dbl [… z2 -1.21 1.19 -1.01 0.323
#> 12 3 <tibble [2… <lm> <dbl [… resids -3.67 3.80 -0.964 0.346
#> 13 4 <tibble [2… <lm> <dbl [… (Inter… 59.6 40.1 1.49 0.152
#> 14 4 <tibble [2… <lm> <dbl [… x -83.4 56.5 -1.48 0.155
#> 15 4 <tibble [2… <lm> <dbl [… z2 -18.7 12.8 -1.45 0.160
#> 16 4 <tibble [2… <lm> <dbl [… resids 83.4 56.5 1.48 0.155

Grouped regression with dplyr using different formulas

This needs map2 instead of map as the data column from nest is also a list of data.frame, and thus we need to loop over the corresponding elements of 'fm' list and data (map2 does that)

library(tidyr)
library(purrr)
library(dplyr)
library(broom)
out <- dt %>%
nest(data = -t) %>%
mutate(
fit = map2(fm, data, ~lm(.x, data = .y)),
tfit = map(fit, tidy))

-output

> out
# A tibble: 2 × 4
t data fit tfit
<dbl> <list> <list> <list>
1 1 <tibble [50 × 4]> <lm> <tibble [2 × 5]>
2 2 <tibble [50 × 4]> <lm> <tibble [2 × 5]>

> bind_rows(out$tfit)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0860 0.128 0.670 0.506
2 x1 0.262 0.119 2.19 0.0331
3 (Intercept) -0.00285 0.152 -0.0187 0.985
4 x2 -0.115 0.154 -0.746 0.459

Or may also use

> imap_dfr(fm, ~ lm(.x, data = dt %>% 
filter(t == .y)) %>%
tidy)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0860 0.128 0.670 0.506
2 x1 0.262 0.119 2.19 0.0331
3 (Intercept) -0.00285 0.152 -0.0187 0.985
4 x2 -0.115 0.154 -0.746 0.459

If we want to have all the combinations of 'fm' for each level of 't', then use crossing

dt %>% 
nest(data = -t) %>%
crossing(fm) %>%
mutate(fit = map2(fm, data, ~ lm(.x, data = .y)),
tfit = map(fit, tidy))

-output

# A tibble: 4 × 5
t data fm fit tfit
<dbl> <list> <list> <list> <list>
1 1 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
2 1 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
3 2 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
4 2 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>

dplyr get linear regression coefficients

Here is a combination of tidyverse and broom package to get your desired output.

Very handy here is group_split -> you get a list and then you iterate with purrrs map_dfr (by the way with map_dfr you get a dataframe otherwise with map you get a list) your regression lm(... through each list element. Using brooms glance gives the desired output:

library(tidyverse)
library(broom)

mydata %>%
pivot_longer(starts_with("Site"),
names_to = ".value",
names_pattern = "(^Site)") %>%
mutate(Site=as.factor(Site)) %>%
group_by(Site) %>%
group_split() %>%
map_dfr(.f = function(df){
lm(Outcome ~ Age+Gender, data=df) %>%
glance() %>%
add_column(Site = unique(df$Site), .before = 1)
})
  Site  r.squared adj.r.squared    sigma statistic  p.value    df logLik    AIC     BIC deviance df.residual  nobs
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 0.6 0.44 3.87e- 1 3.75e+ 0 1.01e- 1 2 -1.88 11.8 12.1 7.5 e- 1 5 8
2 2 1 1 2.22e-16 1.01e+31 2.22e-16 2 141. -275. -277. 4.93e-32 1 4
3 3 0.351 -0.946 6.97e- 1 2.71e- 1 8.05e- 1 2 -1.46 10.9 8.47 4.86e- 1 1 4

dplyr version of grouping a dataframe then creating regression model on each group

Returning a list from dplyr is not possible yet. If you just need the intercept and slope @jazzurro 's answer is the way, but if you need the whole model you need to do something like

library(dplyr)
models <- df %>% group_by(country) %>% do(mod = lm(BirthRate ~ US., data = .))

Then if you want to perform ANOVA on each fitted model, you can do it using rowwise

models %>% rowwise %>% do(anova(.$mod))

but again the result is coerced to a data frame and is not quite the same as doing lapply(models$mod, anova).

For now (ie until the next version of dplyr) if you need to store the whole result in a list, you can just use dlply from plyr, like plyr::dlply(df, "country", function(d) anova(lm(BirthRate ~ US., data = d))), or of course if you do not absolutely have to use dplyr you can go for @SvenHohenstein 's answer which looks like a better way of doing this anyway.



Related Topics



Leave a reply



Submit