Automatically Create Formulas for All Possible Linear Models

Automatically create formulas for all possible linear models

Say we work with this ridiculous example :

DF <- data.frame(Class=1:10,A=1:10,B=1:10,C=1:10)

Then you get the names of the columns

Cols <- names(DF)
Cols <- Cols[! Cols %in% "Class"]
n <- length(Cols)

You construct all possible combinations

id <- unlist(
        lapply(1:n,
              function(i)combn(1:n,i,simplify=FALSE)
        )
      ,recursive=FALSE)

You paste them to formulas

Formulas <- sapply(id,function(i)
              paste("Class~",paste(Cols[i],collapse="+"))
            )

And you loop over them to apply the models.

lapply(Formulas,function(i)
    lm(as.formula(i),data=DF))

Be warned though: if you have more than a handful columns, this will quickly become very heavy on the memory and result in literally thousands of models. You have 2^n - 1 different models with n being the number of columns.

Make very sure that is what you want, in general this kind of model comparison is strongly advised against. Forget about any kind of inference as well when you do this.

automatically create a dynamic string with variable names in the form of lm() function to fit linear models in R

names(st) <- make.names(names(st))
y <- "Life.Exp"
x <- names(st)[!names(st) %in% y]
x
y
mymodel <- as.formula(paste(y, paste(x, collapse="+"), sep="~"))
lm(mymodel, data=st)

How to create many Linear Regression models via a For Loop in R?

Your approach wasn't so bad. This is how I reproduced your work as you described it:

library(rje)   # provides the powerSet function
library(olsrr) # provides the ols_mallows_cp function to calculate the Mallow's Cp values

x <- powerSet(colnames(mtcars[,-1]))
full_model <- lm( mpg ~ ., data=mtcars )

your_models <- lapply( x, function(n) {
    d_i <- mtcars[,c( "mpg", n), drop=FALSE] # use drop=FALSE to make sure it stays a 2d structure
    return( lm( mpg ~ ., data = d_i ) )
})

Cp_vec <- sapply( your_models, function(m) {
    ols_mallows_cp( m, full_model )
})

TenSmallestIndeces <- head( order( Cp_vec ), n=10 )

TenSmallestCp <- head( sort( Cp_vec ), n=10 )

TenSmallestSets <- x[ TenSmallestIndeces ]

## inspect one of your models:
your_models[[ TenSmallestIndeces[1] ]]

It's always preferable to use some sort of apply when collecting from a loop. I frequently use foreach from the foreach package also when building data frames or other 2d structures from a loop.

I create the subset just like you did, and fit the model pretty much the same way, just do it in one go.

Then you just need to understand sort() and order() proberly to look back up in the set you started out with I think.

How to run lm models using all possible combinations of several variables and a factor

I suspect the dredge function in the MuMIn package would help you. You specify a "full" model with all parameters you want to include and then run dredge(fullmodel) to get all combinations nested within the full model.

You should then be able to get the coefficients and AIC values from the results of this.

Something like:

require(MuMIn)
data(iris)

globalmodel <- lm(Sepal.Length ~ Petal.Length + Petal.Width + Species, data = iris)

combinations <- dredge(globalmodel)

print(combinations)

to get the parameter estimates for all models (a bit messy) you can then use

coefTable(combinations)

or to get the coefficients for a particular model you can index that using the row number in the dredge object, e.g.

coefTable(combinations)[1]

to get the coefficients in the model at row 1. This should also print coefficients for factor levels.

See the MuMIn helpfile for more details and ways to extract information.

Hope that helps.

How to succinctly write a formula with many variables from a data frame?

There is a special identifier that one can use in a formula to mean all the variables, it is the . identifier.

y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)

You can also do things like this, to use all variables but one (in this case x3 is excluded):

mod <- lm(y ~ . - x3, data = d)

Technically, . means all variables not already mentioned in the formula. For example

lm(y ~ x1 * x2 + ., data = d)

where . would only reference x3 as x1 and x2 are already in the formula.

different possible combinations of variables for a generalized linear model

This is called dredging:

library(MuMIn)
data(Cement)
fm1 <- lm(y ~ ., data = Cement)
dd <- dredge(fm1)

Global model call: lm(formula = y ~ ., data = Cement)
---
Model selection table 
   (Intrc)    X1      X2      X3      X4 df  logLik  AICc delta weight
4    52.58 1.468  0.6623                  4 -28.156  69.3  0.00  0.566
12   71.65 1.452  0.4161         -0.2365  5 -26.933  72.4  3.13  0.119
8    48.19 1.696  0.6569  0.2500          5 -26.952  72.5  3.16  0.116
10  103.10 1.440                 -0.6140  4 -29.817  72.6  3.32  0.107
14  111.70 1.052         -0.4100 -0.6428  5 -27.310  73.2  3.88  0.081
15  203.60       -0.9234 -1.4480 -1.5570  5 -29.734  78.0  8.73  0.007
16   62.41 1.551  0.5102  0.1019 -0.1441  6 -26.918  79.8 10.52  0.003
13  131.30               -1.2000 -0.7246  4 -35.372  83.7 14.43  0.000
7    72.07        0.7313 -1.0080          4 -40.965  94.9 25.62  0.000
9   117.60                       -0.7382  3 -45.872 100.4 31.10  0.000
3    57.42        0.7891                  3 -46.035 100.7 31.42  0.000
11   94.16        0.3109         -0.4569  4 -45.761 104.5 35.21  0.000
2    81.48 1.869                          3 -48.206 105.1 35.77  0.000
6    72.35 2.312          0.4945          4 -48.005 109.0 39.70  0.000
5   110.20               -1.2560          3 -50.980 110.6 41.31  0.000
1    95.42                                2 -53.168 111.5 42.22  0.000

Is there a way to loop through column names (not numbers) in r for linear models?

If you want the statistics in a table (which might come in handy) you can use the purrr and broom packages. Here's an example using the dataset mtcars:

Code

library(tidyr)
library(purrr)
library(broom)

formula <- lapply(colnames(mtcars)[3:ncol(mtcars)], function(x) as.formula(paste0(x, " ~ cyl")))

names(formula) <- format(formula)

table <- formula %>% map(~aov(.x, mtcars)) %>% map_dfr(tidy, .id="model")

Output

> head(table)
# A tibble: 6 x 7
  model      term         df     sumsq     meansq statistic   p.value
  <chr>      <chr>     <dbl>     <dbl>      <dbl>     <dbl>     <dbl>
1 disp ~ cyl cyl           1 387454.   387454.        131.   1.80e-12
2 disp ~ cyl Residuals    30  88731.     2958.         NA   NA       
3 hp ~ cyl   cyl           1 100984.   100984.         67.7  3.48e- 9
4 hp ~ cyl   Residuals    30  44743.     1491.         NA   NA       
5 drat ~ cyl cyl           1      4.34      4.34       28.8  8.24e- 6
6 drat ~ cyl Residuals    30      4.52      0.151      NA   NA

Try

formula <- lapply(colnames(df)[10:ncol(df)], function(x) as.formula(paste0(x, " ~ block + tillage * residue + Error(subblock)")))

names(formula) <- format(formula)

table <- formula %>% map(~aov(.x, df)) %>% map_dfr(tidy, .id="model")

Automatically Create Formulas for All Possible Linear Models