Linear Regression and Group by in R

Linear Regression and group by in R

Here's one way using the lme4 package.

 library(lme4)
d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
year=rep(1:10, 2),
response=c(rnorm(10), rnorm(10)))

xyplot(response ~ year, groups=state, data=d, type='l')

fits <- lmList(response ~ year | state, data=d)
fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
(Intercept) year
CA -1.34420990 0.17139963
NY 0.00196176 -0.01852429

Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316

Creating a linear regression model for each group in a column

You have some mistakes in the syntax of your functions. Functions are usually written as function(x), and then you substitute the x with the data you want to use it with.

For example, in the linear_model function you defined, if you were to use it alone you would write:

linear_model(data)

However, because you are using it inside the lapply function it is a bit more tricky to see. Lapply is just making a loop and applying the linear_model function to each of the data frames you obtain from split(table2,table2$LOCATION).

The same thing happens with my_predict.

Anyway, this should work for you:

linear_model <- function(x) lm(Education ~ TIME, x)

m <- lapply(split(table2,table2$LOCATION),linear_model)

new_df <- data.frame(TIME=c(2019))

my_predict <- function(x) predict(x,new_df)

sapply(m,my_predict)

ANSWER TO THE EDIT

There are probably more efficient ways of looping the prediction, but here is my approach:

pred_data <- list()

for (i in 3:6){
linear_model <- function(x) lm(x[,i] ~ TIME, x)
m <- lapply(split(tableLinR,tableLinR$LOCATION),linear_model)
new_df <- data.frame(TIME=c(2020, 2021), row.names = c("2020", "2021"))
my_predict <- function(x) predict(x,new_df)
pred_data[[colnames(tableLinR)[i]]] <- sapply(m,my_predict)
}

pred_data <- melt(pred_data)
pred_data <- as.data.frame(pivot_wider(pred_data, names_from = L1, values_from = value))

First you create an empty list where you will be saving the outputs of your loop. In for (i in 3:4) you put the interval of columns you want a prediction from. The result pred_data is a list that you can transform into a data frame in different ways. With melt and pivot_wider you obtain a format similar to your original data.

Running single linear regressions across multiple variables, in groups

IIUC, you can use group_by and group_modify, with a map inside that iterates over predictors.

If you can isolate your predictor variables in advance, it'll make it easier, as with ivs in this solution.

library(tidyverse)

ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs

mtcars %>%
group_by(cyl) %>%
group_modify(function(data, key) {
map_df(ivs, function(iv) {
frml <- as.formula(paste("mpg", "~", iv))
lm(frml, data = data) %>% broom::glance()
}, .id = "iv")
}) %>%
select(cyl, iv, r.squared, p.value)

# A tibble: 27 × 4
# Groups: cyl [3]
cyl iv r.squared p.value
<dbl> <chr> <dbl> <dbl>
1 4 disp 0.648 0.00278
2 4 hp 0.274 0.0984
3 4 drat 0.180 0.193
4 4 wt 0.509 0.0137
5 4 qsec 0.0557 0.485
6 4 vs 0.00238 0.887
7 4 am 0.287 0.0892
8 4 gear 0.115 0.308
9 4 carb 0.0378 0.567
10 6 disp 0.0106 0.826
11 6 hp 0.0161 0.786
# ...

Multiple linear regression by group in a rolling window in R

Use group_modify and use rollapplyr with the by.column = FALSE argument so that rsd is applied to all columns at once rather than one at a time.

Note that if you use width 3 with two predictors and an intercept the residuals will necessarily be all zero so we changed the width to 5.

library(dplyr, exclude = c("lag", "filter"))
library(zoo)

width <- 5

df %>%
group_by(Group) %>%
group_modify(~ {
cbind(., rollapplyr(.[c("y", "x1", "x2")], width, rsd, fill = NA,
by.column = FALSE))
}) %>%
ungroup

Is it enough to use group_by in order to run a grouped linear regression?

That won't do what you are hoping. The lm method is quite old, and as such it isn't familiar with functionality from newer libraries like dplyr.

I believe you can accomplish what you want by adding the family name as an indicator to the model. Something like:

model <- lm(income_children ~ family_name + family_name:income_parents, data = df)

This will effectively create a mini-model per family. The first part gives an intercept per family and the the interaction variable gives a slope for income_parents.

If you want to stick with the many models approach, you can use nest.

model_one <- function(data) {
lm(income_children ~ income_parents, data = data)
}

models <- df %>%
group_by(family_name) %>%
nest() %>%
mutate(model = map(data, model_one))

models
# # A tibble: 9 x 3
# # Groups: family_name [9]
# family_name data model
# <fct> <list> <list>
# 1 Miller <tibble [1 × 2]> <lm>
# 2 Smith <tibble [1 × 2]> <lm>
# 3 Clark <tibble [1 × 2]> <lm>
# 4 Powell <tibble [1 × 2]> <lm>
# 5 Brown <tibble [1 × 2]> <lm>
# 6 Jone <tibble [1 × 2]> <lm>
# 7 Garcia <tibble [1 × 2]> <lm>
# 8 Williams <tibble [1 × 2]> <lm>
# 9 Lopez <tibble [1 × 2]> <lm>

You will notice that this output is not very useful yet. It can be summarized neatly with broom::glance, and then unnested. It is not very interesting here because each model only has one data point.

summarized <- models %>%
mutate(summary = map(model, broom::glance)) %>%
unnest(summary)

# Drop the still-nested columns for display.
summarized %>% select(-data, -model)
# family_name r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
# 1 Miller 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0
# 2 Smith 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0
# 3 Clark 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0
# 4 Powell 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0
# 5 Brown 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0
# 6 Jone 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0
# 7 Garcia 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0
# 8 Williams 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0
# 9 Lopez 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0

For more detail, I recommend the Many Models chapter of Data Science for R.

Fitting several regression models with dplyr

As of mid 2020 (and updated to fit dplyr 1.0+ as of 2022-04), tchakravarty's answer will fail. In order to circumvent the new approach of broom and dpylr seem to interact, the following combination of broom::tidy, broom::augment and broom::glance can be used. We just have to use them in conjunvtion with nest_by() and summarize() (previously inside do() and later unnest() the tibble).

library(dplyr)
library(broom)
library(tidyr)

set.seed(42)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)

df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(tidy(mod))
# # A tibble: 72 × 6
# # Groups: hour [24]
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 87.4 15.8 5.55 0.0000289
# 2 1 wind -0.0129 0.0120 -1.08 0.296
# 3 1 temp 0.588 0.849 0.693 0.497
# 4 2 (Intercept) 92.3 21.6 4.27 0.000466
# 5 2 wind -0.0227 0.0134 -1.69 0.107
# 6 2 temp -0.216 0.841 -0.257 0.800
# 7 3 (Intercept) 61.1 18.6 3.29 0.00409
# 8 3 wind 0.00471 0.0128 0.367 0.718
# 9 3 temp 0.425 0.964 0.442 0.664
# 10 4 (Intercept) 31.6 15.3 2.07 0.0529

df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(augment(mod))
# # A tibble: 504 × 10
# # Groups: hour [24]
# hour price wind temp .fitted .resid .hat .sigma .cooksd .std.resid
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 113. 288. -1.75 82.7 30.8 0.123 37.8 0.0359 0.877
# 2 1 117. 2234. 18.4 69.5 47.0 0.201 36.4 0.165 1.40
# 3 1 28.6 1438. 4.75 71.7 -43.1 0.0539 37.1 0.0265 -1.18
# 4 1 102. 366. 9.77 88.5 13.7 0.151 38.4 0.00926 0.395
# 5 1 76.6 2257. -4.69 55.6 21.0 0.245 38.2 0.0448 0.644
# 6 1 60.1 633. -3.18 77.4 -17.3 0.0876 38.4 0.00749 -0.484
# 7 1 89.4 376. -4.16 80.1 9.31 0.119 38.5 0.00314 0.264
# 8 1 8.18 1921. 19.2 74.0 -65.9 0.173 34.4 0.261 -1.93
# 9 1 78.7 575. -6.11 76.4 2.26 0.111 38.6 0.000170 0.0640
# 10 1 85.2 763. -0.618 77.2 7.94 0.0679 38.6 0.00117 0.219
# # … with 494 more rows

df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(glance(mod))
# # A tibble: 24 × 13
# # Groups: hour [24]
# hour r.squared adj.r.squared sigma statistic p.value df logLik AIC
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.0679 -0.0357 37.5 0.655 0.531 2 -104. 217.
# 2 2 0.139 0.0431 42.7 1.45 0.261 2 -107. 222.
# 3 3 0.0142 -0.0953 43.1 0.130 0.879 2 -107. 222.
# 4 4 0.0737 -0.0293 36.7 0.716 0.502 2 -104. 216.
# 5 5 0.213 0.126 37.8 2.44 0.115 2 -104. 217.
# 6 6 0.0813 -0.0208 33.5 0.796 0.466 2 -102. 212.
# 7 7 0.0607 -0.0437 40.7 0.582 0.569 2 -106. 220.
# 8 8 0.153 0.0592 36.3 1.63 0.224 2 -104. 215.
# 9 9 0.166 0.0736 36.5 1.79 0.195 2 -104. 216.
# 10 10 0.110 0.0108 40.0 1.11 0.351 2 -106. 219.
# # … with 14 more rows, and 4 more variables: BIC <dbl>, deviance <dbl>,
# # df.residual <int>, nobs <int>

Credits to Bob Muenchen's Blog for inspiration on that.

How to find linear regression between groups using data.table?

With the new data, we could split the data by 'group' into a list. Then, use combn on the names of the list for pairwise combination, extract the list elements (s1, s2), check if there are any common 'time' (intersect). Use a condition based on length i.e. if there are common elements, then apply the lm on the corresponding 'value' columns, create a data.table with summarised coef along with the group names and rbind the list elements

library(data.table)
lst1 <- split(dt, dt$group)
rbindlist(combn(names(lst1), 2, FUN = function(x) {
s1 <- lst1[[x[1]]]
s2 <- lst1[[x[2]]]
i1 <- intersect(s1$time, s2$time)
if(length(i1) > 0) na.omit(s1[s2, on = .(time)][,
. (group1 = first(s1$group), group2 = first(s2$group),
regression = lm(i.value ~ value)$coef[2])])
else
data.table(group1 = first(s1$group), group2 = first(s2$group),
regression = NA_real_)}, simplify = FALSE))

-output

     group1 group2  regression
1: a b 0.03033996
2: a c 0.06391242
3: a d -0.09138112
4: a e -0.27738183
5: b c 0.05663270
6: b d 0.05481604
7: b e 0.27789495
8: c d -0.13987978
9: c e 0.16388299
10: d e 0.12380720

If we want full combinations, use either expand.grid or CJ (from data.table

dt2 <- CJ(group1 = names(lst1), group2 = names(lst1))[group1 != group2]
dt2[, rbindlist(Map(function(x, y) {
s1 <- lst1[[x]]
s2 <- lst1[[y]]
i1 <- intersect(s1$time, s2$time)
if(length(i1) > 0) na.omit(s1[s2, on = .(time)][,
data.table(group1 = x, group2 = y,
regresion = lm(i.value ~ value)$coef[2])]) else
data.table(group1 = x, group2 = y, regression = NA_real_)

}, group1, group2))]

-output

  group1 group2   regresion
1: a b 0.03033996
2: a c 0.06391242
3: a d -0.09138112
4: a e -0.27738183
5: b a 0.03247826
6: b c 0.05663270
7: b d 0.05481604
8: b e 0.27789495
9: c a 0.07488082
10: c b 0.06198333
11: c d -0.13987978
12: c e 0.16388299
13: d a -0.09295215
14: d b 0.05208743
15: d c -0.12144302
16: d e 0.12380720
17: e a -0.25136439
18: e b 0.34052322
19: e c 0.28677255
20: e d 0.21435666

Regression in R with Groups

Quick solution with lmList (package nlme):

library(nlme)
lmList(x ~ y | group, data=df)

Call:
Model: x ~ y | group
Data: df

Coefficients:
(Intercept) y
1 0.4786373 0.04978624
2 3.5125369 -0.94751894
3 2.7429958 -0.01208329
4 -5.2231576 2.24589181
5 5.6370824 -0.24223131
6 7.1785581 -0.08077726
7 8.2060808 -0.18283134
8 8.9072851 -0.13090764
9 10.1974577 -0.18514527
10 6.0687105 0.37396911
11 9.0682622 0.23469187
12 15.1081915 -0.29234452
13 17.3147636 -0.30306692
14 13.1352411 0.05873189
15 6.4006623 0.57619151
16 25.4454182 -0.59535396
17 22.0231916 -0.30073768
18 27.7317267 -0.54651597
19 10.9689733 0.45280604
20 23.3495704 -0.14488522

Degrees of freedom: 200 total; 160 residual
Residual standard error: 0.9536226

Borrowed the data df from @Gopala answer.



Related Topics



Leave a reply



Submit