Logistic Regression: How to Try Every Combination of Predictors in R

Logistic regression: how to try every combination of predictors in R?

I am not sure about the value of this "educational exercise", but for the sake of programming, here would be my approach:

First, let's create some example predictor names. I use 5 predictors as in your example, but for 10, you would obviously need to replace 5 with 10.

X = paste0("x",1:5)
X
[1] "x1" "x2" "x3" "x4" "x5"

Now, we can get the combinations with combn.

For instance, for one variable at a time:

 t(combn(X,1))
     [,1]
[1,] "x1"
[2,] "x2"
[3,] "x3"
[4,] "x4"
[5,] "x5"

Two variables at a time:

> t(combn(X,2))
      [,1] [,2]
 [1,] "x1" "x2"
 [2,] "x1" "x3"
 [3,] "x1" "x4"
 [4,] "x1" "x5"
 [5,] "x2" "x3"
 [6,] "x2" "x4"
 [7,] "x2" "x5"
 [8,] "x3" "x4"
 [9,] "x3" "x5"
[10,] "x4" "x5"

etc.

We can use lapply to call these functions successively with an increasing number of variables to consider, and to catch the results in a list. For instance, have a look at the output of lapply(1:5, function(n) t(combn(X,n))). To turn these combinations into formulas, we can use the following:

out <- unlist(lapply(1:5, function(n) {
  # get combinations
  combinations <- t(combn(X,n))
  # collapse them into usable formulas:
  formulas <- apply(combinations, 1, 
                    function(row) paste0("y ~ ", paste0(row, collapse = "+")))}))

Or equivalently using the FUN argument of combn (as pointed out by user20650):

out <- unlist(lapply(1:5, function(n) combn(X, n, FUN=function(row) paste0("y ~ ", paste0(row, collapse = "+")))))

This gives:

out
 [1] "y ~ x1"             "y ~ x2"             "y ~ x3"             "y ~ x4"             "y ~ x5"            
 [6] "y ~ x1+x2"          "y ~ x1+x3"          "y ~ x1+x4"          "y ~ x1+x5"          "y ~ x2+x3"         
[11] "y ~ x2+x4"          "y ~ x2+x5"          "y ~ x3+x4"          "y ~ x3+x5"          "y ~ x4+x5"         
[16] "y ~ x1+x2+x3"       "y ~ x1+x2+x4"       "y ~ x1+x2+x5"       "y ~ x1+x3+x4"       "y ~ x1+x3+x5"      
[21] "y ~ x1+x4+x5"       "y ~ x2+x3+x4"       "y ~ x2+x3+x5"       "y ~ x2+x4+x5"       "y ~ x3+x4+x5"      
[26] "y ~ x1+x2+x3+x4"    "y ~ x1+x2+x3+x5"    "y ~ x1+x2+x4+x5"    "y ~ x1+x3+x4+x5"    "y ~ x2+x3+x4+x5"   
[31] "y ~ x1+x2+x3+x4+x5"

This can now be passed to your logistic regression function.

Example:

Let's use the mtcars dataset, with mpg as dependent variable.

X = names(mtcars[,-1])
X
 [1] "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

Now, let's use the aforementioned function:

out <- unlist(lapply(1:length(X), function(n) combn(X, n, FUN=function(row) paste0("mpg ~ ", paste0(row, collapse = "+")))))

which gives us a vector of all combinations as formulas.

To run the corresponding models, we can do for instance

mods = lapply(out, function(frml) lm(frml, data=mtcars))

Since you want to capture specific statistics and order the models accordingly, I would use broom::glance. broom::tidy turns lm output into a dataframe (useful if you want to compare coefficients etc) and broom::glance turns e.g. r-squared, sigma, the F-statistic, the logLikelihood, AIC, BIC etc into a dataframe. For instance:

library(broom)
library(dplyr)
tmp = bind_rows(lapply(out, function(frml) {
  a = glance(lm(frml, data=mtcars))
  a$frml = frml
  return(a)
}))

head(tmp)
  r.squared adj.r.squared    sigma statistic      p.value df    logLik      AIC      BIC deviance df.residual       frml
1 0.7261800     0.7170527 3.205902 79.561028 6.112687e-10  2 -81.65321 169.3064 173.7036 308.3342          30  mpg ~ cyl
2 0.7183433     0.7089548 3.251454 76.512660 9.380327e-10  2 -82.10469 170.2094 174.6066 317.1587          30 mpg ~ disp
3 0.6024373     0.5891853 3.862962 45.459803 1.787835e-07  2 -87.61931 181.2386 185.6358 447.6743          30   mpg ~ hp
4 0.4639952     0.4461283 4.485409 25.969645 1.776240e-05  2 -92.39996 190.7999 195.1971 603.5667          30 mpg ~ drat
5 0.7528328     0.7445939 3.045882 91.375325 1.293959e-10  2 -80.01471 166.0294 170.4266 278.3219          30   mpg ~ wt
6 0.1752963     0.1478062 5.563738  6.376702 1.708199e-02  2 -99.29406 204.5881 208.9853 928.6553          30 mpg ~ qsec

which you can sort as you wish.

How to run lm models using all possible combinations of several variables and a factor

I suspect the dredge function in the MuMIn package would help you. You specify a "full" model with all parameters you want to include and then run dredge(fullmodel) to get all combinations nested within the full model.

You should then be able to get the coefficients and AIC values from the results of this.

Something like:

require(MuMIn)
data(iris)

globalmodel <- lm(Sepal.Length ~ Petal.Length + Petal.Width + Species, data = iris)

combinations <- dredge(globalmodel)

print(combinations)

to get the parameter estimates for all models (a bit messy) you can then use

coefTable(combinations)

or to get the coefficients for a particular model you can index that using the row number in the dredge object, e.g.

coefTable(combinations)[1]

to get the coefficients in the model at row 1. This should also print coefficients for factor levels.

See the MuMIn helpfile for more details and ways to extract information.

Hope that helps.

Looping over combinations of regression model terms

With this dummy data:

dat1 <- data.frame(y = rpois(100,5),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
z1 = runif(100),
z2 = runif(100)
)

You could get your list of two lm objects this way:

 lapply(dat1[5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))

Which iterates through those two columns and substitutes them as arguments into the lm call.

As Alex notes below, it's preferable to pass the names through the formula, rather than the actual data columns as I have done here.

different possible combinations of variables for a generalized linear model

This is called dredging:

library(MuMIn)
data(Cement)
fm1 <- lm(y ~ ., data = Cement)
dd <- dredge(fm1)

Global model call: lm(formula = y ~ ., data = Cement)
---
Model selection table 
   (Intrc)    X1      X2      X3      X4 df  logLik  AICc delta weight
4    52.58 1.468  0.6623                  4 -28.156  69.3  0.00  0.566
12   71.65 1.452  0.4161         -0.2365  5 -26.933  72.4  3.13  0.119
8    48.19 1.696  0.6569  0.2500          5 -26.952  72.5  3.16  0.116
10  103.10 1.440                 -0.6140  4 -29.817  72.6  3.32  0.107
14  111.70 1.052         -0.4100 -0.6428  5 -27.310  73.2  3.88  0.081
15  203.60       -0.9234 -1.4480 -1.5570  5 -29.734  78.0  8.73  0.007
16   62.41 1.551  0.5102  0.1019 -0.1441  6 -26.918  79.8 10.52  0.003
13  131.30               -1.2000 -0.7246  4 -35.372  83.7 14.43  0.000
7    72.07        0.7313 -1.0080          4 -40.965  94.9 25.62  0.000
9   117.60                       -0.7382  3 -45.872 100.4 31.10  0.000
3    57.42        0.7891                  3 -46.035 100.7 31.42  0.000
11   94.16        0.3109         -0.4569  4 -45.761 104.5 35.21  0.000
2    81.48 1.869                          3 -48.206 105.1 35.77  0.000
6    72.35 2.312          0.4945          4 -48.005 109.0 39.70  0.000
5   110.20               -1.2560          3 -50.980 110.6 41.31  0.000
1    95.42                                2 -53.168 111.5 42.22  0.000

Getting specific combination of interaction as variable in logistic regression with R

If you set the reference for religious to be "Somewhat religious" first. We can look at the results first :

library(broom)
dataset$religious = relevel(dataset$religious,ref="Somewhat religious")

fit0 = glm(family_role_recoded ~ urban_rural*religious,data=dataset,family=binomial())

    # A tibble: 6 x 5
  term                                   estimate std.error statistic    p.value
  <chr>                                     <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)                             -0.902      0.181   -4.99      6.03e-7
2 urban_ruralRural                        -0.484      0.532   -0.910     3.63e-1
3 religiousReligious                      -0.0141     0.456   -0.0308    9.75e-1
4 religiousNot religious                   1.47       0.391    3.76      1.67e-4
5 urban_ruralRural:religiousReligious      0.995      1.14     0.876     3.81e-1
6 urban_ruralRural:religiousNot religio…   0.201      0.993    0.203     8.39e-1

You have one of the terms rural/religious. Intuitively, the Urban/Not religious term would be the flip of urban_ruralRural:religiousNot religio. We can also manually define the interaction terms we need:

dataset$Rural_religious = with(dataset,as.numeric(urban_rural=="Rural" & religious=="Religious"))

dataset$Urban_not_religious = with(dataset,as.numeric(urban_rural=="Urban" & religious=="Not religious"))

fit = glm(family_role_recoded ~ 0+urban_rural+religious+Urban_not_religious+Rural_religious,data=dataset,family=binomial())

tidy(fit)
# A tibble: 6 x 5
  term                   estimate std.error statistic     p.value
  <chr>                     <dbl>     <dbl>     <dbl>       <dbl>
1 urban_ruralUrban        -0.902      0.181   -4.99   0.000000603
2 urban_ruralRural        -1.39       0.500   -2.77   0.00556    
3 religiousReligious      -0.0141     0.456   -0.0308 0.975      
4 religiousNot religious   1.67       0.913    1.83   0.0667     
5 Urban_not_religious     -0.201      0.993   -0.203  0.839      
6 Rural_religious          0.995      1.14     0.876  0.381

Linear models in R with different combinations of variables

Here's one way to get all of the combinations of variables using the combn function. It's a bit messy, and uses a loop (perhaps someone can improve on this with mapply):

vars <- c("price","model","size","year","color")
N <- list(1,2,3,4)
COMB <- sapply(N, function(m) combn(x=vars[2:5], m))
COMB2 <- list()
k=0
for(i in seq(COMB)){
    tmp <- COMB[[i]]
    for(j in seq(ncol(tmp))){
        k <- k + 1
        COMB2[[k]] <- formula(paste("price", "~", paste(tmp[,j], collapse=" + ")))
    }
}

Then, you can call these formulas and store the model objects using a list or possibly give unique names with the assign function:

res <- vector(mode="list", length(COMB2))
for(i in seq(COMB2)){
    res[[i]] <- lm(COMB2[[i]], data=data)
}

Logistic Regression: How to Try Every Combination of Predictors in R