Logistic regression: how to try every combination of predictors in R?
I am not sure about the value of this "educational exercise", but for the sake of programming, here would be my approach:
First, let's create some example predictor names. I use 5 predictors as in your example, but for 10, you would obviously need to replace 5 with 10.
X = paste0("x",1:5)
X
[1] "x1" "x2" "x3" "x4" "x5"
Now, we can get the combinations with combn
.
For instance, for one variable at a time:
t(combn(X,1))
[,1]
[1,] "x1"
[2,] "x2"
[3,] "x3"
[4,] "x4"
[5,] "x5"
Two variables at a time:
> t(combn(X,2))
[,1] [,2]
[1,] "x1" "x2"
[2,] "x1" "x3"
[3,] "x1" "x4"
[4,] "x1" "x5"
[5,] "x2" "x3"
[6,] "x2" "x4"
[7,] "x2" "x5"
[8,] "x3" "x4"
[9,] "x3" "x5"
[10,] "x4" "x5"
etc.
We can use lapply
to call these functions successively with an increasing number of variables to consider, and to catch the results in a list. For instance, have a look at the output of lapply(1:5, function(n) t(combn(X,n)))
. To turn these combinations into formulas, we can use the following:
out <- unlist(lapply(1:5, function(n) {
# get combinations
combinations <- t(combn(X,n))
# collapse them into usable formulas:
formulas <- apply(combinations, 1,
function(row) paste0("y ~ ", paste0(row, collapse = "+")))}))
Or equivalently using the FUN
argument of combn
(as pointed out by user20650):
out <- unlist(lapply(1:5, function(n) combn(X, n, FUN=function(row) paste0("y ~ ", paste0(row, collapse = "+")))))
This gives:
out
[1] "y ~ x1" "y ~ x2" "y ~ x3" "y ~ x4" "y ~ x5"
[6] "y ~ x1+x2" "y ~ x1+x3" "y ~ x1+x4" "y ~ x1+x5" "y ~ x2+x3"
[11] "y ~ x2+x4" "y ~ x2+x5" "y ~ x3+x4" "y ~ x3+x5" "y ~ x4+x5"
[16] "y ~ x1+x2+x3" "y ~ x1+x2+x4" "y ~ x1+x2+x5" "y ~ x1+x3+x4" "y ~ x1+x3+x5"
[21] "y ~ x1+x4+x5" "y ~ x2+x3+x4" "y ~ x2+x3+x5" "y ~ x2+x4+x5" "y ~ x3+x4+x5"
[26] "y ~ x1+x2+x3+x4" "y ~ x1+x2+x3+x5" "y ~ x1+x2+x4+x5" "y ~ x1+x3+x4+x5" "y ~ x2+x3+x4+x5"
[31] "y ~ x1+x2+x3+x4+x5"
This can now be passed to your logistic regression function.
Example:
Let's use the mtcars
dataset, with mpg
as dependent variable.
X = names(mtcars[,-1])
X
[1] "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
Now, let's use the aforementioned function:
out <- unlist(lapply(1:length(X), function(n) combn(X, n, FUN=function(row) paste0("mpg ~ ", paste0(row, collapse = "+")))))
which gives us a vector of all combinations as formulas.
To run the corresponding models, we can do for instance
mods = lapply(out, function(frml) lm(frml, data=mtcars))
Since you want to capture specific statistics and order the models accordingly, I would use broom::glance
. broom::tidy
turns lm
output into a dataframe (useful if you want to compare coefficients etc) and broom::glance
turns e.g. r-squared, sigma, the F-statistic, the logLikelihood, AIC, BIC etc into a dataframe. For instance:
library(broom)
library(dplyr)
tmp = bind_rows(lapply(out, function(frml) {
a = glance(lm(frml, data=mtcars))
a$frml = frml
return(a)
}))
head(tmp)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual frml
1 0.7261800 0.7170527 3.205902 79.561028 6.112687e-10 2 -81.65321 169.3064 173.7036 308.3342 30 mpg ~ cyl
2 0.7183433 0.7089548 3.251454 76.512660 9.380327e-10 2 -82.10469 170.2094 174.6066 317.1587 30 mpg ~ disp
3 0.6024373 0.5891853 3.862962 45.459803 1.787835e-07 2 -87.61931 181.2386 185.6358 447.6743 30 mpg ~ hp
4 0.4639952 0.4461283 4.485409 25.969645 1.776240e-05 2 -92.39996 190.7999 195.1971 603.5667 30 mpg ~ drat
5 0.7528328 0.7445939 3.045882 91.375325 1.293959e-10 2 -80.01471 166.0294 170.4266 278.3219 30 mpg ~ wt
6 0.1752963 0.1478062 5.563738 6.376702 1.708199e-02 2 -99.29406 204.5881 208.9853 928.6553 30 mpg ~ qsec
which you can sort as you wish.
How to run lm models using all possible combinations of several variables and a factor
I suspect the dredge
function in the MuMIn package would help you. You specify a "full" model with all parameters you want to include and then run dredge(fullmodel)
to get all combinations nested within the full model.
You should then be able to get the coefficients and AIC values from the results of this.
Something like:
require(MuMIn)
data(iris)
globalmodel <- lm(Sepal.Length ~ Petal.Length + Petal.Width + Species, data = iris)
combinations <- dredge(globalmodel)
print(combinations)
to get the parameter estimates for all models (a bit messy) you can then use
coefTable(combinations)
or to get the coefficients for a particular model you can index that using the row number in the dredge object, e.g.
coefTable(combinations)[1]
to get the coefficients in the model at row 1. This should also print coefficients for factor levels.
See the MuMIn helpfile for more details and ways to extract information.
Hope that helps.
Looping over combinations of regression model terms
With this dummy data:
dat1 <- data.frame(y = rpois(100,5),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
z1 = runif(100),
z2 = runif(100)
)
You could get your list of two lm
objects this way:
lapply(dat1[5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))
Which iterates through those two columns and substitutes them as arguments into the lm
call.
As Alex notes below, it's preferable to pass the names through the formula, rather than the actual data columns as I have done here.
different possible combinations of variables for a generalized linear model
This is called dredging:
library(MuMIn)
data(Cement)
fm1 <- lm(y ~ ., data = Cement)
dd <- dredge(fm1)
Global model call: lm(formula = y ~ ., data = Cement)
---
Model selection table
(Intrc) X1 X2 X3 X4 df logLik AICc delta weight
4 52.58 1.468 0.6623 4 -28.156 69.3 0.00 0.566
12 71.65 1.452 0.4161 -0.2365 5 -26.933 72.4 3.13 0.119
8 48.19 1.696 0.6569 0.2500 5 -26.952 72.5 3.16 0.116
10 103.10 1.440 -0.6140 4 -29.817 72.6 3.32 0.107
14 111.70 1.052 -0.4100 -0.6428 5 -27.310 73.2 3.88 0.081
15 203.60 -0.9234 -1.4480 -1.5570 5 -29.734 78.0 8.73 0.007
16 62.41 1.551 0.5102 0.1019 -0.1441 6 -26.918 79.8 10.52 0.003
13 131.30 -1.2000 -0.7246 4 -35.372 83.7 14.43 0.000
7 72.07 0.7313 -1.0080 4 -40.965 94.9 25.62 0.000
9 117.60 -0.7382 3 -45.872 100.4 31.10 0.000
3 57.42 0.7891 3 -46.035 100.7 31.42 0.000
11 94.16 0.3109 -0.4569 4 -45.761 104.5 35.21 0.000
2 81.48 1.869 3 -48.206 105.1 35.77 0.000
6 72.35 2.312 0.4945 4 -48.005 109.0 39.70 0.000
5 110.20 -1.2560 3 -50.980 110.6 41.31 0.000
1 95.42 2 -53.168 111.5 42.22 0.000
Getting specific combination of interaction as variable in logistic regression with R
If you set the reference for religious
to be "Somewhat religious"
first. We can look at the results first :
library(broom)
dataset$religious = relevel(dataset$religious,ref="Somewhat religious")
fit0 = glm(family_role_recoded ~ urban_rural*religious,data=dataset,family=binomial())
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.902 0.181 -4.99 6.03e-7
2 urban_ruralRural -0.484 0.532 -0.910 3.63e-1
3 religiousReligious -0.0141 0.456 -0.0308 9.75e-1
4 religiousNot religious 1.47 0.391 3.76 1.67e-4
5 urban_ruralRural:religiousReligious 0.995 1.14 0.876 3.81e-1
6 urban_ruralRural:religiousNot religio… 0.201 0.993 0.203 8.39e-1
You have one of the terms rural/religious
. Intuitively, the Urban/Not religious
term would be the flip of urban_ruralRural:religiousNot religio
. We can also manually define the interaction terms we need:
dataset$Rural_religious = with(dataset,as.numeric(urban_rural=="Rural" & religious=="Religious"))
dataset$Urban_not_religious = with(dataset,as.numeric(urban_rural=="Urban" & religious=="Not religious"))
fit = glm(family_role_recoded ~ 0+urban_rural+religious+Urban_not_religious+Rural_religious,data=dataset,family=binomial())
tidy(fit)
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 urban_ruralUrban -0.902 0.181 -4.99 0.000000603
2 urban_ruralRural -1.39 0.500 -2.77 0.00556
3 religiousReligious -0.0141 0.456 -0.0308 0.975
4 religiousNot religious 1.67 0.913 1.83 0.0667
5 Urban_not_religious -0.201 0.993 -0.203 0.839
6 Rural_religious 0.995 1.14 0.876 0.381
Linear models in R with different combinations of variables
Here's one way to get all of the combinations of variables using the combn
function. It's a bit messy, and uses a loop (perhaps someone can improve on this with mapply
):
vars <- c("price","model","size","year","color")
N <- list(1,2,3,4)
COMB <- sapply(N, function(m) combn(x=vars[2:5], m))
COMB2 <- list()
k=0
for(i in seq(COMB)){
tmp <- COMB[[i]]
for(j in seq(ncol(tmp))){
k <- k + 1
COMB2[[k]] <- formula(paste("price", "~", paste(tmp[,j], collapse=" + ")))
}
}
Then, you can call these formulas and store the model objects using a list
or possibly give unique names with the assign
function:
res <- vector(mode="list", length(COMB2))
for(i in seq(COMB2)){
res[[i]] <- lm(COMB2[[i]], data=data)
}
Related Topics
R: Finding the Intersect of Two Lines
Parallel R on a Windows Cluster
Choose Specific Number with Probability
Select a Sequence of Columns: ':' Works But Not 'Seq'
How to 'Subset' a Named Vector in R
Using Ggplot2 with Columns That Have Spaces in Their Names
As.Date Produces Unexpected Result in a Sequence of Week-Based Dates
Extract Only Quarter from a Date in R
Why Does "Hello" > 0 Return True
How to Configure R-3.0.1 with --Enable-R-Shlib
Replace Column Values with Column Name Using Dplyr's Transmute_All
Changing Names in a List of Dataframes
Adding an Image to a Datatable in R
Recode Multiple Columns Using Dplyr
Why Does 1..99,999 == "1".."99,999" in R, But 100,000 != "100,000"
Shiny Error in Match.Arg(Position):'Arg' Must Be Null or a Character Vector