Linear Regression and Group by in R

Linear Regression and group by in R

Here's one way using the lme4 package.

 library(lme4)
 d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                 year=rep(1:10, 2),
                 response=c(rnorm(10), rnorm(10)))

 xyplot(response ~ year, groups=state, data=d, type='l')

 fits <- lmList(response ~ year | state, data=d)
 fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
   (Intercept)        year
CA -1.34420990  0.17139963
NY  0.00196176 -0.01852429

Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316

Creating a linear regression model for each group in a column

You have some mistakes in the syntax of your functions. Functions are usually written as function(x), and then you substitute the x with the data you want to use it with.

For example, in the linear_model function you defined, if you were to use it alone you would write:

linear_model(data)

However, because you are using it inside the lapply function it is a bit more tricky to see. Lapply is just making a loop and applying the linear_model function to each of the data frames you obtain from split(table2,table2$LOCATION).

The same thing happens with my_predict.

Anyway, this should work for you:

linear_model <- function(x) lm(Education ~ TIME, x)

m <- lapply(split(table2,table2$LOCATION),linear_model)

new_df <- data.frame(TIME=c(2019))

my_predict <- function(x) predict(x,new_df)

sapply(m,my_predict)

ANSWER TO THE EDIT

There are probably more efficient ways of looping the prediction, but here is my approach:

pred_data <- list()

for (i in 3:6){
   linear_model <- function(x) lm(x[,i] ~ TIME, x)
   m <- lapply(split(tableLinR,tableLinR$LOCATION),linear_model)
   new_df <- data.frame(TIME=c(2020, 2021), row.names = c("2020", "2021"))
   my_predict <- function(x) predict(x,new_df)
   pred_data[[colnames(tableLinR)[i]]] <- sapply(m,my_predict)
 }

 pred_data <- melt(pred_data)
 pred_data <- as.data.frame(pivot_wider(pred_data, names_from = L1, values_from = value))

First you create an empty list where you will be saving the outputs of your loop. In for (i in 3:4) you put the interval of columns you want a prediction from. The result pred_data is a list that you can transform into a data frame in different ways. With melt and pivot_wider you obtain a format similar to your original data.

Running single linear regressions across multiple variables, in groups

IIUC, you can use group_by and group_modify, with a map inside that iterates over predictors.

If you can isolate your predictor variables in advance, it'll make it easier, as with ivs in this solution.

library(tidyverse)

ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs

mtcars %>% 
  group_by(cyl) %>% 
  group_modify(function(data, key) {
    map_df(ivs, function(iv) {
      frml <- as.formula(paste("mpg", "~", iv))
      lm(frml, data = data) %>% broom::glance()
      }, .id = "iv") 
  }) %>% 
  select(cyl, iv, r.squared, p.value)

# A tibble: 27 × 4
# Groups:   cyl [3]
     cyl iv    r.squared  p.value
   <dbl> <chr>     <dbl>    <dbl>
 1     4 disp  0.648      0.00278
 2     4 hp    0.274      0.0984 
 3     4 drat  0.180      0.193  
 4     4 wt    0.509      0.0137 
 5     4 qsec  0.0557     0.485  
 6     4 vs    0.00238    0.887  
 7     4 am    0.287      0.0892 
 8     4 gear  0.115      0.308  
 9     4 carb  0.0378     0.567  
10     6 disp  0.0106     0.826  
11     6 hp    0.0161     0.786  
# ...

Multiple linear regression by group in a rolling window in R

Use group_modify and use rollapplyr with the by.column = FALSE argument so that rsd is applied to all columns at once rather than one at a time.

Note that if you use width 3 with two predictors and an intercept the residuals will necessarily be all zero so we changed the width to 5.

library(dplyr, exclude = c("lag", "filter"))
library(zoo)

width <- 5

df %>% 
  group_by(Group) %>%
  group_modify(~ {
      cbind(., rollapplyr(.[c("y", "x1", "x2")], width, rsd, fill = NA,
          by.column = FALSE))
  }) %>%
  ungroup

Is it enough to use group_by in order to run a grouped linear regression?

That won't do what you are hoping. The lm method is quite old, and as such it isn't familiar with functionality from newer libraries like dplyr.

I believe you can accomplish what you want by adding the family name as an indicator to the model. Something like:

model <- lm(income_children ~ family_name + family_name:income_parents, data = df)

This will effectively create a mini-model per family. The first part gives an intercept per family and the the interaction variable gives a slope for income_parents.

If you want to stick with the many models approach, you can use nest.

model_one <- function(data) {
  lm(income_children ~ income_parents, data = data)
}

models <- df %>%
  group_by(family_name) %>%
  nest() %>%
  mutate(model = map(data, model_one))

models
# # A tibble: 9 x 3
# # Groups:   family_name [9]
#   family_name data             model 
#   <fct>       <list>           <list>
# 1 Miller      <tibble [1 × 2]> <lm>  
# 2 Smith       <tibble [1 × 2]> <lm>  
# 3 Clark       <tibble [1 × 2]> <lm>  
# 4 Powell      <tibble [1 × 2]> <lm>  
# 5 Brown       <tibble [1 × 2]> <lm>  
# 6 Jone        <tibble [1 × 2]> <lm>  
# 7 Garcia      <tibble [1 × 2]> <lm>  
# 8 Williams    <tibble [1 × 2]> <lm>  
# 9 Lopez       <tibble [1 × 2]> <lm>

You will notice that this output is not very useful yet. It can be summarized neatly with broom::glance, and then unnested. It is not very interesting here because each model only has one data point.

summarized <- models %>%
  mutate(summary = map(model, broom::glance)) %>%
  unnest(summary)

# Drop the still-nested columns for display.
summarized %>% select(-data, -model)
#   family_name r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC deviance df.residual
#   <fct>           <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
# 1 Miller              0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 2 Smith               0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 3 Clark               0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 4 Powell              0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 5 Brown               0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 6 Jone                0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 7 Garcia              0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 8 Williams            0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 9 Lopez               0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0

For more detail, I recommend the Many Models chapter of Data Science for R.

Fitting several regression models with dplyr

As of mid 2020 (and updated to fit dplyr 1.0+ as of 2022-04), tchakravarty's answer will fail. In order to circumvent the new approach of broom and dpylr seem to interact, the following combination of broom::tidy, broom::augment and broom::glance can be used. We just have to use them in conjunvtion with nest_by() and summarize() (previously inside do() and later unnest() the tibble).

library(dplyr)
library(broom)
library(tidyr)

set.seed(42)
df.h = data.frame( 
  hour     = factor(rep(1:24, each = 21)),
  price    = runif(504, min = -10, max = 125),
  wind     = runif(504, min = 0, max = 2500),
  temp     = runif(504, min = - 10, max = 25)  
)

df.h %>%
  nest_by(hour) %>%
  mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
  summarize(tidy(mod))
# # A tibble: 72 × 6
# # Groups:   hour [24]
#    hour  term        estimate std.error statistic   p.value
#    <fct> <chr>          <dbl>     <dbl>     <dbl>     <dbl>
# 1  1     (Intercept) 87.4       15.8        5.55  0.0000289
# 2  1     wind        -0.0129     0.0120    -1.08  0.296    
# 3  1     temp         0.588      0.849      0.693 0.497    
# 4  2     (Intercept) 92.3       21.6        4.27  0.000466 
# 5  2     wind        -0.0227     0.0134    -1.69  0.107    
# 6  2     temp        -0.216      0.841     -0.257 0.800    
# 7  3     (Intercept) 61.1       18.6        3.29  0.00409  
# 8  3     wind         0.00471    0.0128     0.367 0.718    
# 9  3     temp         0.425      0.964      0.442 0.664    
# 10 4     (Intercept) 31.6       15.3        2.07  0.0529   

df.h %>% 
  nest_by(hour) %>%
  mutate(mod = list(lm(price ~ wind + temp, data = data))) %>% 
  summarize(augment(mod))
# # A tibble: 504 × 10
# # Groups:   hour [24]
#    hour   price  wind   temp .fitted .resid   .hat .sigma  .cooksd .std.resid
#    <fct>  <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
#  1 1     113.    288. -1.75     82.7  30.8  0.123    37.8 0.0359       0.877 
#  2 1     117.   2234. 18.4      69.5  47.0  0.201    36.4 0.165        1.40  
#  3 1      28.6  1438.  4.75     71.7 -43.1  0.0539   37.1 0.0265      -1.18  
#  4 1     102.    366.  9.77     88.5  13.7  0.151    38.4 0.00926      0.395 
#  5 1      76.6  2257. -4.69     55.6  21.0  0.245    38.2 0.0448       0.644 
#  6 1      60.1   633. -3.18     77.4 -17.3  0.0876   38.4 0.00749     -0.484 
#  7 1      89.4   376. -4.16     80.1   9.31 0.119    38.5 0.00314      0.264 
#  8 1       8.18 1921. 19.2      74.0 -65.9  0.173    34.4 0.261       -1.93  
#  9 1      78.7   575. -6.11     76.4   2.26 0.111    38.6 0.000170     0.0640
# 10 1      85.2   763. -0.618    77.2   7.94 0.0679   38.6 0.00117      0.219 
# # … with 494 more rows

df.h %>% 
  nest_by(hour) %>%
  mutate(mod = list(lm(price ~ wind + temp, data = data))) %>% 
  summarize(glance(mod))
# # A tibble: 24 × 13
# # Groups:   hour [24]
#    hour  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#    <fct>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl>
#  1 1        0.0679       -0.0357  37.5     0.655   0.531     2  -104.  217.
#  2 2        0.139         0.0431  42.7     1.45    0.261     2  -107.  222.
#  3 3        0.0142       -0.0953  43.1     0.130   0.879     2  -107.  222.
#  4 4        0.0737       -0.0293  36.7     0.716   0.502     2  -104.  216.
#  5 5        0.213         0.126   37.8     2.44    0.115     2  -104.  217.
#  6 6        0.0813       -0.0208  33.5     0.796   0.466     2  -102.  212.
#  7 7        0.0607       -0.0437  40.7     0.582   0.569     2  -106.  220.
#  8 8        0.153         0.0592  36.3     1.63    0.224     2  -104.  215.
#  9 9        0.166         0.0736  36.5     1.79    0.195     2  -104.  216.
# 10 10       0.110         0.0108  40.0     1.11    0.351     2  -106.  219.
# # … with 14 more rows, and 4 more variables: BIC <dbl>, deviance <dbl>,
# #   df.residual <int>, nobs <int>

Credits to Bob Muenchen's Blog for inspiration on that.

How to find linear regression between groups using data.table?

With the new data, we could split the data by 'group' into a list. Then, use combn on the names of the list for pairwise combination, extract the list elements (s1, s2), check if there are any common 'time' (intersect). Use a condition based on length i.e. if there are common elements, then apply the lm on the corresponding 'value' columns, create a data.table with summarised coef along with the group names and rbind the list elements

library(data.table)
lst1 <- split(dt, dt$group)
rbindlist(combn(names(lst1), 2, FUN = function(x) {
      s1 <- lst1[[x[1]]]
      s2 <- lst1[[x[2]]]
      i1 <- intersect(s1$time, s2$time)
      if(length(i1) > 0) na.omit(s1[s2, on = .(time)][, 
        . (group1 = first(s1$group), group2 = first(s2$group), 
          regression = lm(i.value ~ value)$coef[2])]) 
       else
         data.table(group1 = first(s1$group), group2 = first(s2$group), 
         regression = NA_real_)}, simplify = FALSE))

-output

     group1 group2  regression
 1:      a      b  0.03033996
 2:      a      c  0.06391242
 3:      a      d -0.09138112
 4:      a      e -0.27738183
 5:      b      c  0.05663270
 6:      b      d  0.05481604
 7:      b      e  0.27789495
 8:      c      d -0.13987978
 9:      c      e  0.16388299
10:      d      e  0.12380720

If we want full combinations, use either expand.grid or CJ (from data.table

dt2 <- CJ(group1 = names(lst1), group2 = names(lst1))[group1 != group2]
dt2[, rbindlist(Map(function(x, y) {
       s1 <- lst1[[x]]
       s2 <- lst1[[y]]
       i1 <- intersect(s1$time, s2$time)
       if(length(i1) > 0) na.omit(s1[s2, on = .(time)][,
           data.table(group1 = x, group2 = y, 
          regresion = lm(i.value ~ value)$coef[2])]) else 
           data.table(group1 = x, group2 = y, regression = NA_real_)

        }, group1, group2))]

-output

  group1 group2   regresion
 1:      a      b  0.03033996
 2:      a      c  0.06391242
 3:      a      d -0.09138112
 4:      a      e -0.27738183
 5:      b      a  0.03247826
 6:      b      c  0.05663270
 7:      b      d  0.05481604
 8:      b      e  0.27789495
 9:      c      a  0.07488082
10:      c      b  0.06198333
11:      c      d -0.13987978
12:      c      e  0.16388299
13:      d      a -0.09295215
14:      d      b  0.05208743
15:      d      c -0.12144302
16:      d      e  0.12380720
17:      e      a -0.25136439
18:      e      b  0.34052322
19:      e      c  0.28677255
20:      e      d  0.21435666

Regression in R with Groups

Quick solution with lmList (package nlme):

library(nlme)
lmList(x ~ y | group, data=df)

    Call:
  Model: x ~ y | group 
   Data: df 

Coefficients:
   (Intercept)           y
1    0.4786373  0.04978624
2    3.5125369 -0.94751894
3    2.7429958 -0.01208329
4   -5.2231576  2.24589181
5    5.6370824 -0.24223131
6    7.1785581 -0.08077726
7    8.2060808 -0.18283134
8    8.9072851 -0.13090764
9   10.1974577 -0.18514527
10   6.0687105  0.37396911
11   9.0682622  0.23469187
12  15.1081915 -0.29234452
13  17.3147636 -0.30306692
14  13.1352411  0.05873189
15   6.4006623  0.57619151
16  25.4454182 -0.59535396
17  22.0231916 -0.30073768
18  27.7317267 -0.54651597
19  10.9689733  0.45280604
20  23.3495704 -0.14488522

Degrees of freedom: 200 total; 160 residual
Residual standard error: 0.9536226

Borrowed the data df from @Gopala answer.

Linear Regression and Group by in R