Linear Regression and Storing Results in Data Frame

Linear Regression and storing results in data frame

Here's a vote for the plyr package and ddply().

plyrFunc <- function(x){
  mod <- lm(b~c, data = x)
  return(summary(mod)$coefficients[2,3])
  }

tStats <- ddply(dF, .(a), plyrFunc)
tStats
  a         V1
1 a  1.6124515
2 b -0.1369306
3 c  0.6852483

how to store results from multiple regressions in a single dataframe in a neat way

You can try using lapply to loop over each Y variables.

cols <- grep('Y\\d+', names(data))

do.call(rbind, lapply(cols, function(x) {
  model <- lm(reformulate('X', x), data)
  summary <- summary(model)
  data.frame(coef = summary$coefficients[2,1],
               se = summary$coefficients[2,2])
})) -> df

df

R: Store the results of a linear model directly to a data frame

The predict() function does what you want (so does the fitted() function); the predict() function has more options (see ?predict.lm; as @Frank said in comments, this is linked from the See Also section of ?lm)

dd <- data.frame(x=1:4,y=c(2,3,5,8))
dd$est <- predict(lm(y~x,data=dd))

A good general book on modeling in R should tell you this (e.g. Dalgaard's Introductory Statistics with R, Julian Faraway's books) - I'm sure there are also a zillion tutorials online, although I can't point you to a specific one. One hint for finding out what you can do with a model is as follows:

m <- lm(y~x,data=dd) ## fitted model
class(m)   ## "lm"
methods(class="lm")
## [1] add1           alias          anova          case.names     coerce        
## [6] confint        cooks.distance deviance       dfbeta         dfbetas       
## [11] drop1          dummy.coef     effects        extractAIC     family        
## [16] formula        hatvalues      influence      initialize     kappa         
## [21] labels         logLik         model.frame    model.matrix   nobs          
## [26] plot           predict        print          proj           qr            
## [31] residuals      rstandard      rstudent       show           simulate      
## [36] slotsFromS3    summary        variable.names vcov

Now you can try to guess whether any of these might be useful (or look up their help files via, e.g. ?confint.lm)

How can I perform and store linear regression models between all continuous variables in a data frame?

Assuming you need pairwise comparisons between all columns of mtcars, you can use combn() function to find all pairwise comparisons (2), and perform all linear models with:

combinations <- combn(colnames(mtcars), 2)

forward <- list()
reverse <- list()

for(i in 1:ncol(combinations)){
  forward[[i]] <- lm(formula(paste0(combinations[,i][1], "~", combinations[,i][2])), data = mtcars)

  reverse[[i]] <- lm(formula(paste0(combinations[,i][2], "~", combinations[,i][1])), data = mtcars)
}

all <- c(forward, reverse)

all will be your list with all of the linear models together, with both forward and reverse directions of associations between the two variables.

If you want combinations between three variables, you can do combn(colnames(mtcars), 3), and so on.

Running linear models for groups within dataframe and storing outputs in dataframe in R

Use glance instead of tidy:

dt_lm <- dt %>%
  group_by(group) %>%
  do(glance(lm(y~x, data=.))) %>%
  select(AIC)

which gives:

Adding missing grouping variables: `group`
# A tibble: 2 x 2
# Groups:   group [2]
  group   AIC
  <chr> <dbl>
1 a      119.
2 b      114.

If you not only want to store the AIC but other metrics just skip the select part.

R Storing regression coefficients in data frame column by group

I'm not 100% sure that this output is what you're after, but, is this on the right track?

df2 <- df %>%
  spread(QST, VALUE, fill = 0) %>%
  split(.$CLIENT) %>%
  lapply(., function(x) { lm(Q2 ~ ., x[, -c(1,2)])$coefficients }) %>%
  do.call(rbind, .) %>%
  data.frame(.) %>%
  mutate(CLIENT = rownames(.)) %>%
  gather(QST, COEFFICIENT, -CLIENT) %>%
  arrange(CLIENT)

> df2
   CLIENT          QST   COEFFICIENT
1       A X.Intercept. -1.200000e+01
2       A           Q1  1.000000e+00
3       A          Q10            NA
4       A           Q3  2.000000e+00
5       A           Q4  3.000000e+00
6       A           Q5  5.000000e-01
7       A           Q6            NA
8       A           Q7            NA
9       A           Q8            NA
10      A           Q9            NA
11      B X.Intercept.  5.000000e+00
12      B           Q1 -1.326970e-16
13      B          Q10  1.666667e+00
14      B           Q3  3.726559e-15
15      B           Q4 -2.000000e+00
16      B           Q5            NA
17      B           Q6            NA
18      B           Q7            NA
19      B           Q8            NA
20      B           Q9            NA

Edit:

Running the splitting component only generates a list of wide-format dataframes for each client:

df %>%
  spread(QST, VALUE, fill = 0) %>%
  split(.$CLIENT) 

$A
  RESP_ID CLIENT Q1 Q10 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
1       1      A  4   0  1  4  3  3  2  0  0  0
2       2      A  2   0  2  2  3  2  4  4  3  0
3       3      A  2   0  2  3  3  1  2  4  2  3
4       4      A  3   0  3  4  2  1  0  0  0  0
5       5      A  3   0  4  4  3  0  0  0  0  0

$B
   RESP_ID CLIENT Q1 Q10 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
6        6      B  3   2  3  2  3  2  2  1  3  3
7        7      B  2   0  3  2  2  0  0  0  0  0
8        8      B  3   0  2  4  1  3  3  2  3  0
9        9      B  2   0  1  4  2  1  3  1  2  0
10      10      B  3   2  3  3  3  3  4  2  3  3

Note that all the zeroes are filling in for questions where your original data had no values- if a question wasn't answered. See Ben Bolker's answer on that point.

If you now include the code to run the lm on each of those, you get the coefficient values directly, which include the NA values seen above:

> df %>%
+   spread(QST, VALUE, fill = 0) %>%
+   split(.$CLIENT) %>%
+   lapply(., function(x) { lm(Q2 ~ ., x[, -c(1,2)])$coefficients })
$A
(Intercept)          Q1         Q10          Q3          Q4          Q5          Q6          Q7          Q8          Q9 
  6.6666667   2.0000000          NA  -1.6666667  -0.6666667  -1.6666667          NA          NA          NA          NA 

$B
(Intercept)          Q1         Q10          Q3          Q4          Q5          Q6          Q7          Q8          Q9 
       13.0        -3.0        -0.5        -2.0          NA         2.0          NA          NA          NA          NA

Edit 2:

Just to explore with a more complete dataset, if we use this df:

set.seed(42)
df <-
  expand.grid(RESP_ID = 1:10,
              CLIENT = c("A", "B"),
              QST = paste("Q", 1:10, sep = "")) %>%
  mutate(VALUE = round(runif(200, 1, 4), 0))

and run the same code, we get coefficients without NA values:

> df %>%
+   spread(QST, VALUE, fill = 0) %>%
+   split(.$CLIENT) %>%
+   lapply(., function(x) { lm(Q2 ~ ., x[, -c(1,2)])$coefficients }) %>%
+   do.call(rbind, .) %>%
+   data.frame(.) %>%
+   mutate(CLIENT = rownames(.)) %>%
+   gather(QST, COEFFICIENT, -CLIENT) %>%
+   arrange(CLIENT)
   CLIENT          QST COEFFICIENT
1       A X.Intercept.  6.50000000
2       A           Q1 -4.14285714
3       A           Q3  2.50000000
4       A           Q4  0.85714286
5       A           Q5  1.00000000
6       A           Q6 -0.64285714
7       A           Q7 -1.21428571
8       A           Q8 -1.85714286
9       A           Q9  2.50000000
10      A          Q10 -0.07142857
11      B X.Intercept. -4.69924812
12      B           Q1 -0.86466165
13      B           Q3  1.56390977
14      B           Q4  1.10150376
15      B           Q5 -0.86842105
16      B           Q6  0.87593985
17      B           Q7  0.57142857
18      B           Q8  0.25187970
19      B           Q9  0.79699248
20      B          Q10 -0.12781955

How to perform linear regression for multiple columns and get a dataframe output with: regression equation and r squared value?

Something like this:

library(tidyverse)
library(broom)
df1 %>% 
  pivot_longer(
    cols = starts_with("X")
  ) %>% 
  mutate(name = factor(name)) %>% 
  group_by(name) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(LH27_20822244_U_Stationary ~ value, data = df) %>% 
      glance() %>% 
      # tidy() %>%  
      add_column(name = unique(df$name), .before=1)
  })

Using tidy()

  name             term        estimate std.error statistic p.value
  <fct>            <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 X20676887_X2LH_S (Intercept)   12.8      2.28        5.62 0.00494
2 X20676887_X2LH_S value          0.393    0.0855      4.59 0.0101 
3 X20819831_11LH_S (Intercept)   10.4      3.72        2.79 0.0495 
4 X20819831_11LH_S value          0.492    0.142       3.47 0.0256 
5 X20822214_X4LH_S (Intercept)   10.5      3.30        3.20 0.0329 
6 X20822214_X4LH_S value          0.485    0.126       3.86 0.0182

Using glance()

  name          r.squared adj.r.squared  sigma statistic p.value    df logLik   AIC   BIC deviance df.residual  nobs
  <fct>             <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1 X20676887_X2~     0.841         0.801 0.0350      21.1  0.0101     1   12.8 -19.6 -20.3  0.00490           4     6
2 X20819831_11~     0.751         0.688 0.0438      12.0  0.0256     1   11.5 -17.0 -17.6  0.00766           4     6
3 X20822214_X4~     0.788         0.735 0.0403      14.9  0.0182     1   12.0 -17.9 -18.6  0.00651           4     6

R: repeat linear regression for all variables and save results in a new data frame

You can try the following code to have the desired output

data <- structure(list(var1 = c(12L, 3L, 13L, 17L, 9L, 15L, 12L, 3L, 
13L), var2 = c(5L, 2L, 15L, 11L, 13L, 6L, 5L, 2L, 15L), var3 = c(18L, 
10L, 14L, 16L, 8L, 20L, 18L, 10L, 14L), var4 = c(19L, 6L, 13L, 
18L, 8L, 17L, 19L, 6L, 13L), var5 = c(12L, 13L, 1L, 10L, 7L, 
3L, 12L, 13L, 1L), var6 = c(17L, 17L, 17L, 17L, 17L, 17L, 17L, 
17L, 17L), var7 = c(11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L
), var8 = c(16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), var9 = c(18L, 
18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L), var10 = c(10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L)), class = "data.frame", row.names = c(NA, 
-9L))

head(data,2)
#>   var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
#> 1   12    5   18   19   12   17   11   16   18    10
#> 2    3    2   10    6   13   17   11   16   18    10

x = names(data[,-1])
out <- unlist(lapply(1, function(n) combn(x, 1, FUN=function(row) paste0("var1 ~ ", paste0(row, collapse = "+")))))
out
#> [1] "var1 ~ var2"  "var1 ~ var3"  "var1 ~ var4"  "var1 ~ var5" 
#> [5] "var1 ~ var6"  "var1 ~ var7"  "var1 ~ var8"  "var1 ~ var9" 
#> [9] "var1 ~ var10"

library(broom)
#> Warning: package 'broom' was built under R version 3.5.3

library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

#To have the regression coefficients
tmp1 = bind_rows(lapply(out, function(frml) {
 a = tidy(lm(frml, data=data))
 a$frml = frml
 return(a)
}))
head(tmp1)
#> # A tibble: 6 x 6
#>   term        estimate std.error statistic p.value frml       
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl> <chr>      
#> 1 (Intercept)    6.46      2.78      2.33  0.0529  var1 ~ var2
#> 2 var2           0.525     0.288     1.82  0.111   var1 ~ var2
#> 3 (Intercept)   -1.50      4.47     -0.335 0.748   var1 ~ var3
#> 4 var3           0.863     0.303     2.85  0.0247  var1 ~ var3
#> 5 (Intercept)    0.649     2.60      0.250 0.810   var1 ~ var4
#> 6 var4           0.766     0.183     4.18  0.00413 var1 ~ var4

#To have the regression results i.e. R2, AIC, BIC
tmp2 = bind_rows(lapply(out, function(frml) {
 a = glance(lm(frml, data=data))
 a$frml = frml
 return(a)
}))
head(tmp2)
#> # A tibble: 6 x 12
#>   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
#>       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
#> 1     0.321         0.224  4.33      3.31  0.111       2  -24.8  55.7  56.3
#> 2     0.537         0.471  3.58      8.12  0.0247      2  -23.1  52.2  52.8
#> 3     0.714         0.673  2.81     17.5   0.00413     2  -20.9  47.9  48.5
#> 4     0.276         0.173  4.47      2.67  0.146       2  -25.1  56.2  56.8
#> 5     0             0      4.92     NA    NA           1  -26.6  57.2  57.6
#> 6     0             0      4.92     NA    NA           1  -26.6  57.2  57.6
#> # ... with 3 more variables: deviance <dbl>, df.residual <int>, frml <chr>

write.csv(tmp1, "Try_lm_coefficients.csv")
write.csv(tmp2, "Try_lm_results.csv")

^{Created on 2019-11-20 by the reprex package (v0.3.0)}

Linear Regression and Storing Results in Data Frame