Loop Linear Regression and Saving Coefficients

Loop linear regression and saving coefficients

There are several ways to do this. First, we create some generated data for illustration purposes:

set.seed(123)
dat <- expand.grid(year=2000:2010, AgeR=seq(-1,1,0.1))
dat$value <- rnorm(nrow(dat))

We can start with base-R. We split our data by year, fit the model and extract our coefficient. Then we bind everything together.

res <- do.call(rbind,lapply(split(dat, dat$year),function(x){
  fit <- lm(value~exp(AgeR), data=x)
  res <- data.frame(year=unique(x$year),coeff=coef(fit)[2])
  res
}))

We can do the same using data.table:

library(data.table)

res2 <- setDT(dat)[,.(coeff=coef(lm(value~exp(AgeR)))[2]),year]
res2

loop or apply multiple regressions, extract coefficients and p-values into data frame

Here's how I would do it. I shortened your example a little, but that won't matter:

lhs <- c('mpg', 'cyl', 'disp')
rhs <- c('hp', 'drat')

models = list()
for (i in lhs){
  for (j in rhs){
    models[[paste(i, "vs", j)]] <- lm(as.formula(paste(i, "~", j)), data = mtcars)
  }
}

If you want to use apply, you'll need to start with a matrix. The difference in runtime will be negligible.

# with apply:
coefs_mat = expand.grid(lhs, rhs)
mods = apply(coefs_mat, 1, function(row) {
  lm(as.formula(paste(row[1], "~", row[2])), data = mtcars)
})
names(mods) = with(coefs_mat, paste(Var1, "vs", Var2))

Both methods give the same results. Now we can pull the coefficients, etc. with broom::tidy

# get coefs
library(broom)
coefs = lapply(mods, tidy, simplify = F)
# combine
dplyr::bind_rows(coefs, .id = "mod")
#             mod        term      estimate    std.error  statistic      p.value
# 1     mpg vs hp (Intercept)   30.09886054 1.633921e+00 18.4212465 6.642736e-18
# 2     mpg vs hp          hp   -0.06822828 1.011930e-02 -6.7423885 1.787835e-07
# 3     cyl vs hp (Intercept)    3.00679525 4.254852e-01  7.0667442 7.405351e-08
# 4     cyl vs hp          hp    0.02168354 2.635142e-03  8.2286042 3.477861e-09
# 5    disp vs hp (Intercept)   20.99248341 3.260662e+01  0.6438104 5.245902e-01
# 6    disp vs hp          hp    1.42977003 2.019414e-01  7.0801224 7.142679e-08
# 7   mpg vs drat (Intercept)   -7.52461844 5.476663e+00 -1.3739423 1.796391e-01
# 8   mpg vs drat        drat    7.67823260 1.506705e+00  5.0960421 1.776240e-05

We can also pull out model summary stats:

# get summary stats
summ = lapply(mods, glance, simplify = F)
dplyr::bind_rows(summ, .id = "mod")
#            mod r.squared adj.r.squared     sigma statistic      p.value df     logLik
# 1    mpg vs hp 0.6024373     0.5891853  3.862962  45.45980 1.787835e-07  2  -87.61931
# 2    cyl vs hp 0.6929688     0.6827344  1.005944  67.70993 3.477861e-09  2  -44.56307
# 3   disp vs hp 0.6255997     0.6131197 77.089503  50.12813 7.142679e-08  2 -183.41236
# 4  mpg vs drat 0.4639952     0.4461283  4.485409  25.96964 1.776240e-05  2  -92.39996
# 5  cyl vs drat 0.4899134     0.4729105  1.296596  28.81354 8.244636e-06  2  -52.68517
# 6 disp vs drat 0.5044038     0.4878839 88.693360  30.53315 5.282022e-06  2 -187.89934
#         AIC       BIC     deviance df.residual
# 1 181.23863 185.63584    447.67431          30
# 2  95.12614  99.52335     30.35771          30
# 3 372.82473 377.22194 178283.74604          30
# 4 190.79993 195.19714    603.56673          30
# 5 111.37033 115.76754     50.43482          30
# 6 381.79868 386.19588 235995.36410          30

Regression loop and store coefficients

As @Todd has already suggested, you can just choose the particular results you care about and use postfile to store them as new variables in a new dataset. Note that a forval loop is more direct than your while code, while using xi: is superseded by factor variable notation in recent versions of Stata. (I have not changed that just in case you are using some older version.) Note evaluation of saved results such as _b[_cons] on the fly and the use of parentheses () to stop negative signs being evaluated. Some code examples elsewhere store results temporarily in local macros or scalars, which is quite unnecessary.

sysuse auto.dta, clear 
tempname myresults 
postfile `myresults' threshold intercept gradient se using myresults.dta 
quietly forval x = 2000(200)4800 {
    xi: regress price mpg length gear_ratio i.foreign if weight < `x'
    post `myresults' (`x') (`=_b[_cons]') (`=_b[mpg]') (`=_se[mpg]') 
}
postclose `myresults' 
use myresults 
list 

     +---------------------------------------------+
     | thresh~d   intercept    gradient         se |
     |---------------------------------------------|
  1. |     2000    -3699.55   -296.8218   215.0348 |
  2. |     2200   -4175.722   -53.19774   54.51251 |
  3. |     2400   -3918.388   -58.83933   42.19707 |
  4. |     2600   -6143.622   -58.20153   38.28178 |
  5. |     2800   -11159.67   -49.21381   44.82019 |
     |---------------------------------------------|
  6. |     3000   -6636.524   -51.28141   52.96473 |
  7. |     3200   -7410.392   -58.14692   60.55182 |
  8. |     3400   -2193.125   -57.89508   52.78178 |
  9. |     3600   -1824.281   -103.4387   56.49762 |
 10. |     3800   -1192.767   -110.9302    51.6335 |
     |---------------------------------------------|
 11. |     4000     5649.41   -173.9975   74.51212 |
 12. |     4200    5784.363   -147.4454   71.89362 |
 13. |     4400     6494.47   -93.81158   80.81586 |
 14. |     4600     6494.47   -93.81158   80.81586 |
 15. |     4800    5373.041   -95.25342   82.60246 |
     +---------------------------------------------+

statsby (a command, not a function) is just not designed for this problem at all, so it is not a question of whether it works well.

How to Loop/Repeat a Linear Regression in R

You want to run 22,000 linear regressions and extract the coefficients? That's simple to do from a coding standpoint.

set.seed(1)

# number of columns in the Lung and Blood data.frames. 22,000 for you?
n <- 5 

# dummy data
obs <- 50 # observations
Lung <- data.frame(matrix(rnorm(obs*n), ncol=n))
Blood <- data.frame(matrix(rnorm(obs*n), ncol=n))
Age <- sample(20:80, obs)
Gender  <- factor(rbinom(obs, 1, .5))

# run n regressions
my_lms <- lapply(1:n, function(x) lm(Lung[,x] ~ Blood[,x] + Age + Gender))

# extract just coefficients
sapply(my_lms, coef)

# if you need more info, get full summary call. now you can get whatever, like:
summaries <- lapply(my_lms, summary)
# ...coefficents with p values:
lapply(summaries, function(x) x$coefficients[, c(1,4)])
# ...or r-squared values
sapply(summaries, function(x) c(r_sq = x$r.squared, 
                                adj_r_sq = x$adj.r.squared))

The models are stored in a list, where model 3 (with DV Lung[, 3] and IVs Blood[,3] + Age + Gender) is in my_lms[[3]] and so on. You can use apply functions on the list to perform summaries, from which you can extract the numbers you want.

R loop over linear regression

Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:

library(broom)
library(purrr)

lm.test <- map_dfr(
    set_names(names(df)[-2]),
    ~ glance(lm(
      as.formula(paste("value ~", .x)),
      data = df
     )),
    .id = "predictor"
)

Result:

# A tibble: 4 x 13
  predictor r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
  <chr>         <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl>
1 num           0.131       -0.739   27.4     0.150   0.765     1  -12.5  31.1
2 person1       0.836        0.672   11.9     5.10    0.265     1  -10.0  26.1
3 person2       0.542        0.0831  19.9     1.18    0.474     1  -11.6  29.2
4 person3       0.607        0.215   18.4     1.55    0.431     1  -11.3  28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
#   nobs <int>

NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().

Linear regression in R, loop through csv files

Here is how I would run it using a function, (you can also use a for loop). the principle is to name all the files in question and go trough them one by one, saving the results of each model as we go.

#put all your files in one folder and point to the folder path
path <- "C:/Users/xxx/Desktop"

#list all the files, with directory attached
lst <- list.files(path, full.names = T)

#make a function or loop (i like functions to get structured output)
fun <- function(i){
  
  #read each csv one at a time
  dat <- read.csv(lst[i])
  
  #make the model
  mod <- lm(dat$columnY~dat$columnX)
  
  #extract the information from the model (press view on any model and chose the desired values and hjust copy that code)
  intcpt <- mod[["coefficients"]][["(Intercept)"]]
  y <- mod[["coefficients"]][["columnX"]]
  
  #set into dataframe, with the name of the file
  out <- data.frame(lst[i], intcpt, y)
}
temp <- lapply(1:length(lst), fun) #run the model (will take the last thing stated in the fuction and make a list elemnt for each "loop")
results <- do.call("rbind",temp) #from list to dataframe

Loop Linear Regression and Saving Coefficients