Loop Linear Regression and Saving Coefficients

Loop linear regression and saving coefficients

There are several ways to do this. First, we create some generated data for illustration purposes:

set.seed(123)
dat <- expand.grid(year=2000:2010, AgeR=seq(-1,1,0.1))
dat$value <- rnorm(nrow(dat))

We can start with base-R. We split our data by year, fit the model and extract our coefficient. Then we bind everything together.

res <- do.call(rbind,lapply(split(dat, dat$year),function(x){
fit <- lm(value~exp(AgeR), data=x)
res <- data.frame(year=unique(x$year),coeff=coef(fit)[2])
res
}))

We can do the same using data.table:

library(data.table)

res2 <- setDT(dat)[,.(coeff=coef(lm(value~exp(AgeR)))[2]),year]
res2

loop or apply multiple regressions, extract coefficients and p-values into data frame

Here's how I would do it. I shortened your example a little, but that won't matter:

lhs <- c('mpg', 'cyl', 'disp')
rhs <- c('hp', 'drat')

models = list()
for (i in lhs){
for (j in rhs){
models[[paste(i, "vs", j)]] <- lm(as.formula(paste(i, "~", j)), data = mtcars)
}
}

If you want to use apply, you'll need to start with a matrix. The difference in runtime will be negligible.

# with apply:
coefs_mat = expand.grid(lhs, rhs)
mods = apply(coefs_mat, 1, function(row) {
lm(as.formula(paste(row[1], "~", row[2])), data = mtcars)
})
names(mods) = with(coefs_mat, paste(Var1, "vs", Var2))

Both methods give the same results. Now we can pull the coefficients, etc. with broom::tidy

# get coefs
library(broom)
coefs = lapply(mods, tidy, simplify = F)
# combine
dplyr::bind_rows(coefs, .id = "mod")
# mod term estimate std.error statistic p.value
# 1 mpg vs hp (Intercept) 30.09886054 1.633921e+00 18.4212465 6.642736e-18
# 2 mpg vs hp hp -0.06822828 1.011930e-02 -6.7423885 1.787835e-07
# 3 cyl vs hp (Intercept) 3.00679525 4.254852e-01 7.0667442 7.405351e-08
# 4 cyl vs hp hp 0.02168354 2.635142e-03 8.2286042 3.477861e-09
# 5 disp vs hp (Intercept) 20.99248341 3.260662e+01 0.6438104 5.245902e-01
# 6 disp vs hp hp 1.42977003 2.019414e-01 7.0801224 7.142679e-08
# 7 mpg vs drat (Intercept) -7.52461844 5.476663e+00 -1.3739423 1.796391e-01
# 8 mpg vs drat drat 7.67823260 1.506705e+00 5.0960421 1.776240e-05

We can also pull out model summary stats:

# get summary stats
summ = lapply(mods, glance, simplify = F)
dplyr::bind_rows(summ, .id = "mod")
# mod r.squared adj.r.squared sigma statistic p.value df logLik
# 1 mpg vs hp 0.6024373 0.5891853 3.862962 45.45980 1.787835e-07 2 -87.61931
# 2 cyl vs hp 0.6929688 0.6827344 1.005944 67.70993 3.477861e-09 2 -44.56307
# 3 disp vs hp 0.6255997 0.6131197 77.089503 50.12813 7.142679e-08 2 -183.41236
# 4 mpg vs drat 0.4639952 0.4461283 4.485409 25.96964 1.776240e-05 2 -92.39996
# 5 cyl vs drat 0.4899134 0.4729105 1.296596 28.81354 8.244636e-06 2 -52.68517
# 6 disp vs drat 0.5044038 0.4878839 88.693360 30.53315 5.282022e-06 2 -187.89934
# AIC BIC deviance df.residual
# 1 181.23863 185.63584 447.67431 30
# 2 95.12614 99.52335 30.35771 30
# 3 372.82473 377.22194 178283.74604 30
# 4 190.79993 195.19714 603.56673 30
# 5 111.37033 115.76754 50.43482 30
# 6 381.79868 386.19588 235995.36410 30

Regression loop and store coefficients

As @Todd has already suggested, you can just choose the particular results you care about and use postfile to store them as new variables in a new dataset. Note that a forval loop is more direct than your while code, while using xi: is superseded by factor variable notation in recent versions of Stata. (I have not changed that just in case you are using some older version.) Note evaluation of saved results such as _b[_cons] on the fly and the use of parentheses () to stop negative signs being evaluated. Some code examples elsewhere store results temporarily in local macros or scalars, which is quite unnecessary.

sysuse auto.dta, clear 
tempname myresults
postfile `myresults' threshold intercept gradient se using myresults.dta
quietly forval x = 2000(200)4800 {
xi: regress price mpg length gear_ratio i.foreign if weight < `x'
post `myresults' (`x') (`=_b[_cons]') (`=_b[mpg]') (`=_se[mpg]')
}
postclose `myresults'
use myresults
list

+---------------------------------------------+
| thresh~d intercept gradient se |
|---------------------------------------------|
1. | 2000 -3699.55 -296.8218 215.0348 |
2. | 2200 -4175.722 -53.19774 54.51251 |
3. | 2400 -3918.388 -58.83933 42.19707 |
4. | 2600 -6143.622 -58.20153 38.28178 |
5. | 2800 -11159.67 -49.21381 44.82019 |
|---------------------------------------------|
6. | 3000 -6636.524 -51.28141 52.96473 |
7. | 3200 -7410.392 -58.14692 60.55182 |
8. | 3400 -2193.125 -57.89508 52.78178 |
9. | 3600 -1824.281 -103.4387 56.49762 |
10. | 3800 -1192.767 -110.9302 51.6335 |
|---------------------------------------------|
11. | 4000 5649.41 -173.9975 74.51212 |
12. | 4200 5784.363 -147.4454 71.89362 |
13. | 4400 6494.47 -93.81158 80.81586 |
14. | 4600 6494.47 -93.81158 80.81586 |
15. | 4800 5373.041 -95.25342 82.60246 |
+---------------------------------------------+

statsby (a command, not a function) is just not designed for this problem at all, so it is not a question of whether it works well.

How to Loop/Repeat a Linear Regression in R

You want to run 22,000 linear regressions and extract the coefficients? That's simple to do from a coding standpoint.

set.seed(1)

# number of columns in the Lung and Blood data.frames. 22,000 for you?
n <- 5

# dummy data
obs <- 50 # observations
Lung <- data.frame(matrix(rnorm(obs*n), ncol=n))
Blood <- data.frame(matrix(rnorm(obs*n), ncol=n))
Age <- sample(20:80, obs)
Gender <- factor(rbinom(obs, 1, .5))

# run n regressions
my_lms <- lapply(1:n, function(x) lm(Lung[,x] ~ Blood[,x] + Age + Gender))

# extract just coefficients
sapply(my_lms, coef)

# if you need more info, get full summary call. now you can get whatever, like:
summaries <- lapply(my_lms, summary)
# ...coefficents with p values:
lapply(summaries, function(x) x$coefficients[, c(1,4)])
# ...or r-squared values
sapply(summaries, function(x) c(r_sq = x$r.squared,
adj_r_sq = x$adj.r.squared))

The models are stored in a list, where model 3 (with DV Lung[, 3] and IVs Blood[,3] + Age + Gender) is in my_lms[[3]] and so on. You can use apply functions on the list to perform summaries, from which you can extract the numbers you want.

R loop over linear regression

Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:

library(broom)
library(purrr)

lm.test <- map_dfr(
set_names(names(df)[-2]),
~ glance(lm(
as.formula(paste("value ~", .x)),
data = df
)),
.id = "predictor"
)

Result:

# A tibble: 4 x 13
predictor r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 num 0.131 -0.739 27.4 0.150 0.765 1 -12.5 31.1
2 person1 0.836 0.672 11.9 5.10 0.265 1 -10.0 26.1
3 person2 0.542 0.0831 19.9 1.18 0.474 1 -11.6 29.2
4 person3 0.607 0.215 18.4 1.55 0.431 1 -11.3 28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>

NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().

Linear regression in R, loop through csv files

Here is how I would run it using a function, (you can also use a for loop). the principle is to name all the files in question and go trough them one by one, saving the results of each model as we go.

#put all your files in one folder and point to the folder path
path <- "C:/Users/xxx/Desktop"

#list all the files, with directory attached
lst <- list.files(path, full.names = T)

#make a function or loop (i like functions to get structured output)
fun <- function(i){

#read each csv one at a time
dat <- read.csv(lst[i])

#make the model
mod <- lm(dat$columnY~dat$columnX)

#extract the information from the model (press view on any model and chose the desired values and hjust copy that code)
intcpt <- mod[["coefficients"]][["(Intercept)"]]
y <- mod[["coefficients"]][["columnX"]]

#set into dataframe, with the name of the file
out <- data.frame(lst[i], intcpt, y)
}
temp <- lapply(1:length(lst), fun) #run the model (will take the last thing stated in the fuction and make a list elemnt for each "loop")
results <- do.call("rbind",temp) #from list to dataframe


Related Topics



Leave a reply



Submit