Linear Regression and Storing Results in Data Frame

Linear Regression and storing results in data frame

Here's a vote for the plyr package and ddply().

plyrFunc <- function(x){
mod <- lm(b~c, data = x)
return(summary(mod)$coefficients[2,3])
}

tStats <- ddply(dF, .(a), plyrFunc)
tStats
a V1
1 a 1.6124515
2 b -0.1369306
3 c 0.6852483

how to store results from multiple regressions in a single dataframe in a neat way

You can try using lapply to loop over each Y variables.

cols <- grep('Y\\d+', names(data))

do.call(rbind, lapply(cols, function(x) {
model <- lm(reformulate('X', x), data)
summary <- summary(model)
data.frame(coef = summary$coefficients[2,1],
se = summary$coefficients[2,2])
})) -> df

df

R: Store the results of a linear model directly to a data frame

The predict() function does what you want (so does the fitted() function); the predict() function has more options (see ?predict.lm; as @Frank said in comments, this is linked from the See Also section of ?lm)

dd <- data.frame(x=1:4,y=c(2,3,5,8))
dd$est <- predict(lm(y~x,data=dd))

A good general book on modeling in R should tell you this (e.g. Dalgaard's Introductory Statistics with R, Julian Faraway's books) - I'm sure there are also a zillion tutorials online, although I can't point you to a specific one. One hint for finding out what you can do with a model is as follows:

m <- lm(y~x,data=dd) ## fitted model
class(m) ## "lm"
methods(class="lm")
## [1] add1 alias anova case.names coerce
## [6] confint cooks.distance deviance dfbeta dfbetas
## [11] drop1 dummy.coef effects extractAIC family
## [16] formula hatvalues influence initialize kappa
## [21] labels logLik model.frame model.matrix nobs
## [26] plot predict print proj qr
## [31] residuals rstandard rstudent show simulate
## [36] slotsFromS3 summary variable.names vcov

Now you can try to guess whether any of these might be useful (or look up their help files via, e.g. ?confint.lm)

How can I perform and store linear regression models between all continuous variables in a data frame?

Assuming you need pairwise comparisons between all columns of mtcars, you can use combn() function to find all pairwise comparisons (2), and perform all linear models with:

combinations <- combn(colnames(mtcars), 2)

forward <- list()
reverse <- list()

for(i in 1:ncol(combinations)){
forward[[i]] <- lm(formula(paste0(combinations[,i][1], "~", combinations[,i][2])), data = mtcars)

reverse[[i]] <- lm(formula(paste0(combinations[,i][2], "~", combinations[,i][1])), data = mtcars)
}

all <- c(forward, reverse)

all will be your list with all of the linear models together, with both forward and reverse directions of associations between the two variables.

If you want combinations between three variables, you can do combn(colnames(mtcars), 3), and so on.

Running linear models for groups within dataframe and storing outputs in dataframe in R

Use glance instead of tidy:

dt_lm <- dt %>%
group_by(group) %>%
do(glance(lm(y~x, data=.))) %>%
select(AIC)

which gives:

Adding missing grouping variables: `group`
# A tibble: 2 x 2
# Groups: group [2]
group AIC
<chr> <dbl>
1 a 119.
2 b 114.

If you not only want to store the AIC but other metrics just skip the select part.

R Storing regression coefficients in data frame column by group

I'm not 100% sure that this output is what you're after, but, is this on the right track?

df2 <- df %>%
spread(QST, VALUE, fill = 0) %>%
split(.$CLIENT) %>%
lapply(., function(x) { lm(Q2 ~ ., x[, -c(1,2)])$coefficients }) %>%
do.call(rbind, .) %>%
data.frame(.) %>%
mutate(CLIENT = rownames(.)) %>%
gather(QST, COEFFICIENT, -CLIENT) %>%
arrange(CLIENT)

> df2
CLIENT QST COEFFICIENT
1 A X.Intercept. -1.200000e+01
2 A Q1 1.000000e+00
3 A Q10 NA
4 A Q3 2.000000e+00
5 A Q4 3.000000e+00
6 A Q5 5.000000e-01
7 A Q6 NA
8 A Q7 NA
9 A Q8 NA
10 A Q9 NA
11 B X.Intercept. 5.000000e+00
12 B Q1 -1.326970e-16
13 B Q10 1.666667e+00
14 B Q3 3.726559e-15
15 B Q4 -2.000000e+00
16 B Q5 NA
17 B Q6 NA
18 B Q7 NA
19 B Q8 NA
20 B Q9 NA

Edit:

Running the splitting component only generates a list of wide-format dataframes for each client:

df %>%
spread(QST, VALUE, fill = 0) %>%
split(.$CLIENT)

$A
RESP_ID CLIENT Q1 Q10 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
1 1 A 4 0 1 4 3 3 2 0 0 0
2 2 A 2 0 2 2 3 2 4 4 3 0
3 3 A 2 0 2 3 3 1 2 4 2 3
4 4 A 3 0 3 4 2 1 0 0 0 0
5 5 A 3 0 4 4 3 0 0 0 0 0

$B
RESP_ID CLIENT Q1 Q10 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
6 6 B 3 2 3 2 3 2 2 1 3 3
7 7 B 2 0 3 2 2 0 0 0 0 0
8 8 B 3 0 2 4 1 3 3 2 3 0
9 9 B 2 0 1 4 2 1 3 1 2 0
10 10 B 3 2 3 3 3 3 4 2 3 3

Note that all the zeroes are filling in for questions where your original data had no values- if a question wasn't answered. See Ben Bolker's answer on that point.

If you now include the code to run the lm on each of those, you get the coefficient values directly, which include the NA values seen above:

> df %>%
+ spread(QST, VALUE, fill = 0) %>%
+ split(.$CLIENT) %>%
+ lapply(., function(x) { lm(Q2 ~ ., x[, -c(1,2)])$coefficients })
$A
(Intercept) Q1 Q10 Q3 Q4 Q5 Q6 Q7 Q8 Q9
6.6666667 2.0000000 NA -1.6666667 -0.6666667 -1.6666667 NA NA NA NA

$B
(Intercept) Q1 Q10 Q3 Q4 Q5 Q6 Q7 Q8 Q9
13.0 -3.0 -0.5 -2.0 NA 2.0 NA NA NA NA

Edit 2:

Just to explore with a more complete dataset, if we use this df:

set.seed(42)
df <-
expand.grid(RESP_ID = 1:10,
CLIENT = c("A", "B"),
QST = paste("Q", 1:10, sep = "")) %>%
mutate(VALUE = round(runif(200, 1, 4), 0))

and run the same code, we get coefficients without NA values:

> df %>%
+ spread(QST, VALUE, fill = 0) %>%
+ split(.$CLIENT) %>%
+ lapply(., function(x) { lm(Q2 ~ ., x[, -c(1,2)])$coefficients }) %>%
+ do.call(rbind, .) %>%
+ data.frame(.) %>%
+ mutate(CLIENT = rownames(.)) %>%
+ gather(QST, COEFFICIENT, -CLIENT) %>%
+ arrange(CLIENT)
CLIENT QST COEFFICIENT
1 A X.Intercept. 6.50000000
2 A Q1 -4.14285714
3 A Q3 2.50000000
4 A Q4 0.85714286
5 A Q5 1.00000000
6 A Q6 -0.64285714
7 A Q7 -1.21428571
8 A Q8 -1.85714286
9 A Q9 2.50000000
10 A Q10 -0.07142857
11 B X.Intercept. -4.69924812
12 B Q1 -0.86466165
13 B Q3 1.56390977
14 B Q4 1.10150376
15 B Q5 -0.86842105
16 B Q6 0.87593985
17 B Q7 0.57142857
18 B Q8 0.25187970
19 B Q9 0.79699248
20 B Q10 -0.12781955

How to perform linear regression for multiple columns and get a dataframe output with: regression equation and r squared value?

Something like this:

library(tidyverse)
library(broom)
df1 %>%
pivot_longer(
cols = starts_with("X")
) %>%
mutate(name = factor(name)) %>%
group_by(name) %>%
group_split() %>%
map_dfr(.f = function(df){
lm(LH27_20822244_U_Stationary ~ value, data = df) %>%
glance() %>%
# tidy() %>%
add_column(name = unique(df$name), .before=1)
})

Using tidy()

  name             term        estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 X20676887_X2LH_S (Intercept) 12.8 2.28 5.62 0.00494
2 X20676887_X2LH_S value 0.393 0.0855 4.59 0.0101
3 X20819831_11LH_S (Intercept) 10.4 3.72 2.79 0.0495
4 X20819831_11LH_S value 0.492 0.142 3.47 0.0256
5 X20822214_X4LH_S (Intercept) 10.5 3.30 3.20 0.0329
6 X20822214_X4LH_S value 0.485 0.126 3.86 0.0182

Using glance()

  name          r.squared adj.r.squared  sigma statistic p.value    df logLik   AIC   BIC deviance df.residual  nobs
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 X20676887_X2~ 0.841 0.801 0.0350 21.1 0.0101 1 12.8 -19.6 -20.3 0.00490 4 6
2 X20819831_11~ 0.751 0.688 0.0438 12.0 0.0256 1 11.5 -17.0 -17.6 0.00766 4 6
3 X20822214_X4~ 0.788 0.735 0.0403 14.9 0.0182 1 12.0 -17.9 -18.6 0.00651 4 6

R: repeat linear regression for all variables and save results in a new data frame

You can try the following code to have the desired output

data <- structure(list(var1 = c(12L, 3L, 13L, 17L, 9L, 15L, 12L, 3L, 
13L), var2 = c(5L, 2L, 15L, 11L, 13L, 6L, 5L, 2L, 15L), var3 = c(18L,
10L, 14L, 16L, 8L, 20L, 18L, 10L, 14L), var4 = c(19L, 6L, 13L,
18L, 8L, 17L, 19L, 6L, 13L), var5 = c(12L, 13L, 1L, 10L, 7L,
3L, 12L, 13L, 1L), var6 = c(17L, 17L, 17L, 17L, 17L, 17L, 17L,
17L, 17L), var7 = c(11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L
), var8 = c(16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), var9 = c(18L,
18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L), var10 = c(10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L)), class = "data.frame", row.names = c(NA,
-9L))

head(data,2)
#> var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
#> 1 12 5 18 19 12 17 11 16 18 10
#> 2 3 2 10 6 13 17 11 16 18 10

x = names(data[,-1])
out <- unlist(lapply(1, function(n) combn(x, 1, FUN=function(row) paste0("var1 ~ ", paste0(row, collapse = "+")))))
out
#> [1] "var1 ~ var2" "var1 ~ var3" "var1 ~ var4" "var1 ~ var5"
#> [5] "var1 ~ var6" "var1 ~ var7" "var1 ~ var8" "var1 ~ var9"
#> [9] "var1 ~ var10"

library(broom)
#> Warning: package 'broom' was built under R version 3.5.3

library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union

#To have the regression coefficients
tmp1 = bind_rows(lapply(out, function(frml) {
a = tidy(lm(frml, data=data))
a$frml = frml
return(a)
}))
head(tmp1)
#> # A tibble: 6 x 6
#> term estimate std.error statistic p.value frml
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 (Intercept) 6.46 2.78 2.33 0.0529 var1 ~ var2
#> 2 var2 0.525 0.288 1.82 0.111 var1 ~ var2
#> 3 (Intercept) -1.50 4.47 -0.335 0.748 var1 ~ var3
#> 4 var3 0.863 0.303 2.85 0.0247 var1 ~ var3
#> 5 (Intercept) 0.649 2.60 0.250 0.810 var1 ~ var4
#> 6 var4 0.766 0.183 4.18 0.00413 var1 ~ var4

#To have the regression results i.e. R2, AIC, BIC
tmp2 = bind_rows(lapply(out, function(frml) {
a = glance(lm(frml, data=data))
a$frml = frml
return(a)
}))
head(tmp2)
#> # A tibble: 6 x 12
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.321 0.224 4.33 3.31 0.111 2 -24.8 55.7 56.3
#> 2 0.537 0.471 3.58 8.12 0.0247 2 -23.1 52.2 52.8
#> 3 0.714 0.673 2.81 17.5 0.00413 2 -20.9 47.9 48.5
#> 4 0.276 0.173 4.47 2.67 0.146 2 -25.1 56.2 56.8
#> 5 0 0 4.92 NA NA 1 -26.6 57.2 57.6
#> 6 0 0 4.92 NA NA 1 -26.6 57.2 57.6
#> # ... with 3 more variables: deviance <dbl>, df.residual <int>, frml <chr>

write.csv(tmp1, "Try_lm_coefficients.csv")
write.csv(tmp2, "Try_lm_results.csv")

Created on 2019-11-20 by the reprex package (v0.3.0)



Related Topics



Leave a reply



Submit