Fitting several regression models with dplyr
As of mid 2020 (and updated to fit dplyr
1.0+ as of 2022-04), tchakravarty's answer will fail. In order to circumvent the new approach of broom
and dpylr
seem to interact, the following combination of broom::tidy
, broom::augment
and broom::glance
can be used. We just have to use them in conjunvtion with nest_by()
and summarize()
(previously inside do()
and later unnest()
the tibble).
library(dplyr)
library(broom)
library(tidyr)
set.seed(42)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(tidy(mod))
# # A tibble: 72 × 6
# # Groups: hour [24]
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 87.4 15.8 5.55 0.0000289
# 2 1 wind -0.0129 0.0120 -1.08 0.296
# 3 1 temp 0.588 0.849 0.693 0.497
# 4 2 (Intercept) 92.3 21.6 4.27 0.000466
# 5 2 wind -0.0227 0.0134 -1.69 0.107
# 6 2 temp -0.216 0.841 -0.257 0.800
# 7 3 (Intercept) 61.1 18.6 3.29 0.00409
# 8 3 wind 0.00471 0.0128 0.367 0.718
# 9 3 temp 0.425 0.964 0.442 0.664
# 10 4 (Intercept) 31.6 15.3 2.07 0.0529
df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(augment(mod))
# # A tibble: 504 × 10
# # Groups: hour [24]
# hour price wind temp .fitted .resid .hat .sigma .cooksd .std.resid
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 113. 288. -1.75 82.7 30.8 0.123 37.8 0.0359 0.877
# 2 1 117. 2234. 18.4 69.5 47.0 0.201 36.4 0.165 1.40
# 3 1 28.6 1438. 4.75 71.7 -43.1 0.0539 37.1 0.0265 -1.18
# 4 1 102. 366. 9.77 88.5 13.7 0.151 38.4 0.00926 0.395
# 5 1 76.6 2257. -4.69 55.6 21.0 0.245 38.2 0.0448 0.644
# 6 1 60.1 633. -3.18 77.4 -17.3 0.0876 38.4 0.00749 -0.484
# 7 1 89.4 376. -4.16 80.1 9.31 0.119 38.5 0.00314 0.264
# 8 1 8.18 1921. 19.2 74.0 -65.9 0.173 34.4 0.261 -1.93
# 9 1 78.7 575. -6.11 76.4 2.26 0.111 38.6 0.000170 0.0640
# 10 1 85.2 763. -0.618 77.2 7.94 0.0679 38.6 0.00117 0.219
# # … with 494 more rows
df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(glance(mod))
# # A tibble: 24 × 13
# # Groups: hour [24]
# hour r.squared adj.r.squared sigma statistic p.value df logLik AIC
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.0679 -0.0357 37.5 0.655 0.531 2 -104. 217.
# 2 2 0.139 0.0431 42.7 1.45 0.261 2 -107. 222.
# 3 3 0.0142 -0.0953 43.1 0.130 0.879 2 -107. 222.
# 4 4 0.0737 -0.0293 36.7 0.716 0.502 2 -104. 216.
# 5 5 0.213 0.126 37.8 2.44 0.115 2 -104. 217.
# 6 6 0.0813 -0.0208 33.5 0.796 0.466 2 -102. 212.
# 7 7 0.0607 -0.0437 40.7 0.582 0.569 2 -106. 220.
# 8 8 0.153 0.0592 36.3 1.63 0.224 2 -104. 215.
# 9 9 0.166 0.0736 36.5 1.79 0.195 2 -104. 216.
# 10 10 0.110 0.0108 40.0 1.11 0.351 2 -106. 219.
# # … with 14 more rows, and 4 more variables: BIC <dbl>, deviance <dbl>,
# # df.residual <int>, nobs <int>
Credits to Bob Muenchen's Blog for inspiration on that.
Fitting several regression models by changing only one independent variable within mutate()
I'm not sure if this will be considered more direct/elegant but here is my solution that does not use a double map
:
library(tidyverse)
library(broom)
gen_model_expr <- function(var) {
form = paste("dep ~ cont_a + cont_b +", var)
tidy(lm(as.formula(form), data = df))
}
grep("cov_", names(df), value = TRUE) %>%
map(gen_model_expr)
Output (Note that it does not retain the data column):
[[1]]
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.472 0.0812 5.81 0.0000000799
2 cont_a -0.103 0.0983 -1.05 0.296
3 cont_b 0.172 0.0990 1.74 0.0848
4 cov_a -0.0455 0.0581 -0.783 0.436
[[2]]
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.415 0.0787 5.27 0.000000846
2 cont_a -0.0874 0.0984 -0.888 0.377
3 cont_b 0.181 0.0980 1.84 0.0682
4 cov_b 0.0482 0.0576 0.837 0.405
EDIT
In the interest of speed performance (inspired by @TimTeaFan), a benchmark test comparing the different ways to retrieve the covariate names is shown below. grep("cov_", names(df), value = TRUE)
is the fastest
# A tibble: 4 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:> <bch:> <dbl> <bch:byt>
1 names(df)[grepl("cov_", names(df))] 7.59µs 8.4µs 101975. 0B
2 grep("cov_", colnames(df), value = TRUE) 8.21µs 8.96µs 103142. 0B
3 grep("cov_", names(df), value = TRUE) 6.96µs 7.43µs 128694. 0B
4 df %>% select(starts_with("cov_")) %>% colnames 1.17ms 1.33ms 636. 5.39KB
How to run many regressions across rows and columns with vectorization
Using tidyr::(un)nest
to perform the regressions by groups and a helper function this could be achieved like so:
library(dplyr)
library(broom)
library(tidyr)
library(purrr)
vec <- names(mtcars)[3:9]
lm_help <- function(vec) {
mtcars %>%
tidyr::nest(data = -cyl) %>%
mutate(con = vec,
fit = purrr::map(data, lm, formula = as.formula(paste0("mpg ~ disp + ", vec))),
tidy = purrr::map(fit, tidy)) %>%
select(cyl, con, tidy) %>%
tidyr::unnest(tidy) %>%
filter(term == "disp")
}
purrr::map(vec, lm_help) %>%
bind_rows()
#> # A tibble: 21 x 7
#> cyl con term estimate std.error statistic p.value
#> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 disp disp 0.00361 0.0156 0.232 0.826
#> 2 4 disp disp -0.135 0.0332 -4.07 0.00278
#> 3 8 disp disp -0.0196 0.00932 -2.11 0.0568
#> 4 6 hp disp 0.00180 0.0202 0.0890 0.933
#> 5 4 hp disp -0.120 0.0369 -3.24 0.0120
#> 6 8 hp disp -0.0186 0.00946 -1.97 0.0746
#> 7 6 drat disp 0.0224 0.0292 0.770 0.484
#> 8 4 drat disp -0.133 0.0406 -3.27 0.0114
#> 9 8 drat disp -0.0196 0.00977 -2.01 0.0697
#> 10 6 wt disp 0.0191 0.0109 1.75 0.154
#> # ... with 11 more rows
Running multiple nested regression models in tidyverse workflow
EDITED
Here's an option using nested map
calls and avoiding having to unnest the data.
library(dplyr)
library(purrr)
library(broom)
# named vector so we can distinguish list results
formulae <- c(bivariate = mpg ~ cyl,
wcontrol = mpg ~ cyl + gear + wt)
map(formulae, function (y)
mtcars %>%
split(.$vs) %>%
map(~ lm(y, data = .x)) %>%
map(~ broom::tidy(.)))
Per your update this goes straight to platting the models
library(dplyr)
library(ggplot2)
library(dotwhisker)
map(formulae, function (y)
mtcars %>%
split(.$am) %>%
purrr::map(~ lm(y, data = .x)) %>%
dwplot() %>%
relabel_predictors(c(wt = "Weight", cyl = "Cylinders", gear = "Gears")) +
theme_bw() + xlab("Coefficient") + ylab("") +
geom_vline(xintercept = 0, colour = "grey60", linetype = 2) +
ggtitle(paste("The model is", deparse(y, width.cutoff = 100), collapse="")) +
scale_colour_grey(start = .4, end = .8,
name = "Transmission",
breaks = c("Model 0", "Model 1"),
labels = c("Automatic", "Manual"))
)
#> $bivariate
#>
#> $wcontrol
Multi-step regression with broom and dplyr in R
One option to achieve your desired result would be to add the residuals as a new column to your dataframe after the first stage regression:
library(tidyverse)
library(broom)
# Nest the data frame
df_nested <- df %>%
group_by(group) %>%
nest()
# Run first stage regression and retrieve residuals
df_fit <- df_nested %>%
mutate(
fit1 = map(data, ~ lm(x ~ z1 + z2, data = .x)),
resids = map(fit1, residuals),
data = map2(data, resids, ~ bind_cols(.x, resids = .y))
)
# Run second stage with residuals as control variable
df_fit %>%
mutate(
fit2 = map(data, ~ tidy(lm(y ~ x + z2 + resids, data = .x)))
) %>%
unnest(fit2)
#> # A tibble: 16 × 9
#> # Groups: group [4]
#> group data fit1 resids term estimate std.error statistic p.value
#> <dbl> <list> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 <tibble [2… <lm> <dbl [… (Inter… 0.402 0.524 0.767 0.451
#> 2 1 <tibble [2… <lm> <dbl [… x 0.0836 0.912 0.0917 0.928
#> 3 1 <tibble [2… <lm> <dbl [… z2 0.161 0.250 0.644 0.527
#> 4 1 <tibble [2… <lm> <dbl [… resids -0.0536 0.942 -0.0569 0.955
#> 5 2 <tibble [2… <lm> <dbl [… (Inter… 0.977 0.273 3.58 0.00175
#> 6 2 <tibble [2… <lm> <dbl [… x -0.561 0.459 -1.22 0.235
#> 7 2 <tibble [2… <lm> <dbl [… z2 -0.351 0.192 -1.82 0.0826
#> 8 2 <tibble [2… <lm> <dbl [… resids 0.721 0.507 1.42 0.170
#> 9 3 <tibble [2… <lm> <dbl [… (Inter… -0.710 1.19 -0.598 0.556
#> 10 3 <tibble [2… <lm> <dbl [… x 3.61 3.80 0.951 0.352
#> 11 3 <tibble [2… <lm> <dbl [… z2 -1.21 1.19 -1.01 0.323
#> 12 3 <tibble [2… <lm> <dbl [… resids -3.67 3.80 -0.964 0.346
#> 13 4 <tibble [2… <lm> <dbl [… (Inter… 59.6 40.1 1.49 0.152
#> 14 4 <tibble [2… <lm> <dbl [… x -83.4 56.5 -1.48 0.155
#> 15 4 <tibble [2… <lm> <dbl [… z2 -18.7 12.8 -1.45 0.160
#> 16 4 <tibble [2… <lm> <dbl [… resids 83.4 56.5 1.48 0.155
Grouped regression with dplyr using different formulas
This needs map2
instead of map
as the data
column from nest
is also a list
of data.frame
, and thus we need to loop over the corresponding elements of 'fm' list
and data
(map2
does that)
library(tidyr)
library(purrr)
library(dplyr)
library(broom)
out <- dt %>%
nest(data = -t) %>%
mutate(
fit = map2(fm, data, ~lm(.x, data = .y)),
tfit = map(fit, tidy))
-output
> out
# A tibble: 2 × 4
t data fit tfit
<dbl> <list> <list> <list>
1 1 <tibble [50 × 4]> <lm> <tibble [2 × 5]>
2 2 <tibble [50 × 4]> <lm> <tibble [2 × 5]>
> bind_rows(out$tfit)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0860 0.128 0.670 0.506
2 x1 0.262 0.119 2.19 0.0331
3 (Intercept) -0.00285 0.152 -0.0187 0.985
4 x2 -0.115 0.154 -0.746 0.459
Or may also use
> imap_dfr(fm, ~ lm(.x, data = dt %>%
filter(t == .y)) %>%
tidy)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0860 0.128 0.670 0.506
2 x1 0.262 0.119 2.19 0.0331
3 (Intercept) -0.00285 0.152 -0.0187 0.985
4 x2 -0.115 0.154 -0.746 0.459
If we want to have all the combinations of 'fm' for each level of 't', then use crossing
dt %>%
nest(data = -t) %>%
crossing(fm) %>%
mutate(fit = map2(fm, data, ~ lm(.x, data = .y)),
tfit = map(fit, tidy))
-output
# A tibble: 4 × 5
t data fm fit tfit
<dbl> <list> <list> <list> <list>
1 1 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
2 1 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
3 2 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
4 2 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
dplyr get linear regression coefficients
Here is a combination of tidyverse
and broom
package to get your desired output.
Very handy here is group_split
-> you get a list and then you iterate with purrr
s map_dfr
(by the way with map_dfr
you get a dataframe otherwise with map
you get a list) your regression lm(...
through each list element. Using broom
s glance gives the desired output:
library(tidyverse)
library(broom)
mydata %>%
pivot_longer(starts_with("Site"),
names_to = ".value",
names_pattern = "(^Site)") %>%
mutate(Site=as.factor(Site)) %>%
group_by(Site) %>%
group_split() %>%
map_dfr(.f = function(df){
lm(Outcome ~ Age+Gender, data=df) %>%
glance() %>%
add_column(Site = unique(df$Site), .before = 1)
})
Site r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 0.6 0.44 3.87e- 1 3.75e+ 0 1.01e- 1 2 -1.88 11.8 12.1 7.5 e- 1 5 8
2 2 1 1 2.22e-16 1.01e+31 2.22e-16 2 141. -275. -277. 4.93e-32 1 4
3 3 0.351 -0.946 6.97e- 1 2.71e- 1 8.05e- 1 2 -1.46 10.9 8.47 4.86e- 1 1 4
dplyr version of grouping a dataframe then creating regression model on each group
Returning a list from dplyr
is not possible yet. If you just need the intercept and slope @jazzurro 's answer is the way, but if you need the whole model you need to do something like
library(dplyr)
models <- df %>% group_by(country) %>% do(mod = lm(BirthRate ~ US., data = .))
Then if you want to perform ANOVA on each fitted model, you can do it using rowwise
models %>% rowwise %>% do(anova(.$mod))
but again the result is coerced to a data frame and is not quite the same as doing lapply(models$mod, anova)
.
For now (ie until the next version of dplyr
) if you need to store the whole result in a list, you can just use dlply
from plyr
, like plyr::dlply(df, "country", function(d) anova(lm(BirthRate ~ US., data = d)))
, or of course if you do not absolutely have to use dplyr
you can go for @SvenHohenstein 's answer which looks like a better way of doing this anyway.
Related Topics
Limit Ggplot2 Axes Without Removing Data (Outside Limits): Zoom
Rename Multiple Columns by Names
Expand Rows by Date Range Using Start and End Date
Extracting the Last N Characters from a String in R
Calculate the Mean For Each Column of a Matrix in R
Create a Variable Name With "Paste" in R
Why Do R Objects Not Print in a Function or a "For" Loop
Chopping a String into a Vector of Fixed Width Character Elements
Split Date-Time Column into Date and Time Variables
How to Use an Image as a Point in Ggplot
Levels≪-'( What Sorcery Is This
How to Use Grep()/Gsub() to Find Exact Match
How to Use Facets With a Dual Y-Axis Ggplot
How to Merge Color, Line Style and Shape Legends in Ggplot
Getting Warning: " 'Newdata' Had 1 Row But Variables Found Have 32 Rows" on Predict.Lm