Stepwise Regression Using P-Values to Drop Variables with Nonsignificant P-Values

Stepwise regression using p-values to drop variables with nonsignificant p-values

Show your boss the following :

set.seed(100)
x1 <- runif(100,0,1)
x2 <- as.factor(sample(letters[1:3],100,replace=T))

y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)
summary(lm(y~x1*x2))

Which gives :

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.1525 0.3066 -0.498 0.61995
x1 1.8693 0.6045 3.092 0.00261 **
x2b 2.5149 0.4334 5.802 8.77e-08 ***
x2c 0.3089 0.4475 0.690 0.49180
x1:x2b -1.1239 0.8022 -1.401 0.16451
x1:x2c -1.0497 0.7873 -1.333 0.18566
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.


Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.

I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.

#####################################
# Automated model selection
# Author : Joris Meys
# version : 0.2
# date : 12/01/09
#####################################
#CHANGE LOG
# 0.2 : check for empty scopevar vector
#####################################

# Function has.interaction checks whether x is part of a term in terms
# terms is a vector with names of terms from a model
has.interaction <- function(x,terms){
out <- sapply(terms,function(i){
sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0
})
return(sum(out)>0)
}

# Function Model.select
# model is the lm object of the full model
# keep is a list of model terms to keep in the model at all times
# sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)
# verbose=T gives the F-tests, dropped var and resulting model after
model.select <- function(model,keep,sig=0.05,verbose=F){
counter=1
# check input
if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm object\n"))
# calculate scope for drop1 function
terms <- attr(model$terms,"term.labels")
if(missing(keep)){ # set scopevars to all terms
scopevars <- terms
} else{ # select the scopevars if keep is used
index <- match(keep,terms)
# check if all is specified correctly
if(sum(is.na(index))>0){
novar <- keep[is.na(index)]
warning(paste(
c(novar,"cannot be found in the model",
"\nThese terms are ignored in the model selection."),
collapse=" "))
index <- as.vector(na.omit(index))
}
scopevars <- terms[-index]
}

# Backward model selection :

while(T){
# extract the test statistics from drop.
test <- drop1(model, scope=scopevars,test="F")

if(verbose){
cat("-------------STEP ",counter,"-------------\n",
"The drop statistics : \n")
print(test)
}

pval <- test[,dim(test)[2]]

names(pval) <- rownames(test)
pval <- sort(pval,decreasing=T)

if(sum(is.na(pval))>0) stop(paste("Model",
deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))

# check if all significant
if(pval[1]<sig) break # stops the loop if all remaining vars are sign.

# select var to drop
i=1
while(T){
dropvar <- names(pval)[i]
check.terms <- terms[-match(dropvar,terms)]
x <- has.interaction(dropvar,check.terms)
if(x){i=i+1;next} else {break}
} # end while(T) drop var

if(pval[i]<sig) break # stops the loop if var to remove is significant

if(verbose){
cat("\n--------\nTerm dropped in step",counter,":",dropvar,"\n--------\n\n")
}

#update terms, scopevars and model
scopevars <- scopevars[-match(dropvar,scopevars)]
terms <- terms[-match(dropvar,terms)]

formul <- as.formula(paste(".~.-",dropvar))
model <- update(model,formul)

if(length(scopevars)==0) {
warning("All variables are thrown out of the model.\n",
"No model could be specified.")
return()
}
counter=counter+1
} # end while(T) main loop
return(model)
}

How to eliminate variables with p value 0.7 before computing stepwise polynomial regression?

What you probably want is to exclude anything about the highest polynom where p <= .7 (lower degrees should be kept). Supposed you know what you're doing, you could write a function degAna() that analyzes the degrees of each polynomial and apply it to the coefficients matrix obtained by summary.

REG19 <- lm(R10 ~ poly(M1, 3) + poly(M2, 3) + poly(M3, 3) + poly(M4, 3) +
poly(M5, 3) + poly(M6, 3) + poly(M7, 3) + poly(M8, 3) +
poly(M9, 3) + poly(M10, 3), WorkData)

rr <- summary(REG19)$coefficients

The function that detects highest degree with p <= .7:

degAna <- function(d) {
out <- as.matrix(rr[grep(paste0(")", d), rownames(rr)), "Pr(>|t|)"] <= .7)
dimnames(out) <- list(c(gsub("^.*\\((.*)\\,.+", "\\1", rownames(out))), d)
return(out)
}

lapply degAna to coefficients matrix:

dM <- do.call(cbind, lapply(1:3, degAna))  # max. degree always 3 as in example
# 1 2 3
# M1 TRUE TRUE TRUE
# M2 TRUE TRUE TRUE
# M3 FALSE TRUE TRUE
# M4 TRUE TRUE TRUE
# M5 TRUE TRUE TRUE
# M6 TRUE FALSE TRUE
# M7 TRUE FALSE FALSE
# M8 TRUE TRUE TRUE
# M9 TRUE TRUE FALSE
# M10 TRUE FALSE TRUE

Now we need the last degree of the polynomials where p <= .7:

tM <- apply(dM, 1, function(x) max(which(x != 0)))
tM <- tM[tM > 0] # excludes polynomes where every p < .7
# M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
# 3 3 3 3 3 3 1 3 2 3

(Note, that the apply will throw a warning if a polynomial has completely p <= .7, i.e. row is completely FALSE. Since we throw them out in the next line we can ignore the warning with apply(dM, 1, function(x) suppressWarnings(max(which(x != 0)))).)

With this information we can cobble together a new formula with reformulate,

terms.new <- paste0("poly(", names(tM), ", ", tM, ")")
FO <- reformulate(terms.new, response="R10")
# R10 ~ poly(M1, 3) + poly(M2, 3) + poly(M3, 3) + poly(M4, 3) +
# poly(M5, 3) + poly(M6, 3) + poly(M7, 1) + poly(M8, 3) + poly(M9,
# 2) + poly(M10, 3)

with which we finally can do the desired shortened regression.

REG19.2 <- lm(FO, WorkData)

n <- length(resid(REG19.2))
REG20.2 <- step(REG19.2, direction="backward", k=log(n))
# [...]

Simulated Data

set.seed(42)
M1 <- rnorm(1e3)
M2 <- rnorm(1e3)
M3 <- rnorm(1e3)
M4 <- rnorm(1e3)
M5 <- rnorm(1e3)
M6 <- rnorm(1e3)
M7 <- rnorm(1e3)
M8 <- rnorm(1e3)
M9 <- rnorm(1e3)
M10 <- rnorm(1e3)
R10 <- 6 + 5*M1^3 + 4.5*M2^3 + 4*M3^2 + 3.5*M4 + 3*M5 + 2.5*M6 + 2*M7 +
.5*rnorm(1e3, 1, sd=20)
WorkData <- data.frame(M1, M2, M3, M4, M5, M6, M7, M8, M9, M10, R10)

Why can't we use AIC and p-value variable selection within the same model building exercise?

AIC is more a model selection method in which you do not favour some null hypothesis.

Contrary to this, with a hypothesis test (and with p-values) you choose to reject or not reject a null hypothesis based on whether some (subjective) significance criterium is being obtained or not. The hypothesis test generally favours the null hypothesis. With a high significance level requirement you can choose to not reject the null hypothesis and not select a more complex model, even when the more complex model is improving the likelihood by a lot.

The selection based on AIC is very much related to a Wilks' hypothesis test (which approximates the distribution of the likelihood ratio by a chi squared distribution). When you compare a full model with a model with one parameter less then the difference in AIC is related to the log likelihood ratio

$$\lambda_{LR} = - 2 \log \left(\frac{L_0}{L_1}\right) = -2\left( \log L_0 - \log L_1 \right) = AIC_1 - AIC_2 -2 \sim \chi^2 $$

Where here the $\sim$ means assymptoticaly equal in distribution (more often it means 'is distributed as').

For instance when we compare the full model (AIC = 519.54) with the model without the cement variable (AIC = 535.27) then the log likelihood ratio is 11.73 and this corresponds to a p-value of 0.0002110562 much similar to the p-value of your ANOVA test which is 0.0002355.

When you select a model based on the lowest AIC then you are effectively using a boundary value for the log likelihood value of 2 (if the difference is only one parameter) and this corresponds to a p-value of 0.157, much less strict than most significance values like 0.05, 0.01 or smaller.


The reason that you should not mix AIC and F-test based selection is because they are different philosophies.

The hypothesis test is subjective and the philosophy behind a hypothesis test is different from model selection. With hypothesis testing you are generally placing a stronger focus/preference on the model with less variables. The smaller model is the null hypothesis, and for the rejection of the null hypothesis we require to have strong evidence.

This subjectivity is a good practice when your goal is hypothesis testing, for instance when you are testing some scientific theory and you want no doubt about the selected parameters in the model. But for model selection, when there is no obvious reason to favour a null hypothesis and a model with less parameters, then the hypothesis testing makes less sense.

Note, technically speaking it is not wrong what you did. Especially, you can do stepwise model selection based on a F-test (ANOVA) instead of based on AIC and this is done a lot. However, it is just a bit superfluous to mix them and do first selection based on AIC and then add a final step with selection based on a F-test. You could have started the F-test based selection from the start.

The function stepwise from the package StepReg allows you to do such stepwise regression based on repeated Anova tests and has parameters where you can change the default p-value of 0.157 (which roughly corresponds with AIC based selection) to some other level.

Example code:

Note that the p-value for Infant.Mortality is 0.00169, by using the selection criterium of either 0.0016 or 0.0017 the variable becomes selected or not.

> library(StepReg)
> summary(lm(Fertility ~ Education + Catholic + Infant.Mortality, dat = swiss))

Call:
lm(formula = Fertility ~ Education + Catholic + Infant.Mortality,
data = swiss)

Residuals:
Min 1Q Median 3Q Max
-14.4781 -5.4403 -0.5143 4.1568 15.1187

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.67707 7.91908 6.147 2.24e-07 ***
Education -0.75925 0.11680 -6.501 6.83e-08 ***
Catholic 0.09607 0.02722 3.530 0.00101 **
Infant.Mortality 1.29615 0.38699 3.349 0.00169 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.505 on 43 degrees of freedom
Multiple R-squared: 0.6625, Adjusted R-squared: 0.639
F-statistic: 28.14 on 3 and 43 DF, p-value: 3.15e-10

> stepwise(Fertility ~., data = swiss , sle = 0.0016, select = "SL")$Process
Step EnteredEffect RemovedEffect DF NumberEffectIn NumberParmsIn SL
1 0 1 1 0 1 1
2 1 Education 1 1 2 3.65861696596238e-07
3 2 Catholic 1 2 3 0.000559833157909886
> stepwise(Fertility ~., data = swiss , sle = 0.0017, select = "SL")$
Process
Step EnteredEffect RemovedEffect DF NumberEffectIn NumberParmsIn SL
1 0 1 1 0 1 1
2 1 Education 1 1 2 3.65861696596238e-07
3 2 Catholic 1 2 3 0.000559833157909886
4 3 Infant.Mortality 1 3 4 0.00169375295474758

Sidenote. The reason that you ended up removing the I(water^2) term is slightly different. It is because the ANOVA is computed by sequentially removing the terms. You have been comparing the removal of I(water^2) after first already removing Water:Coarse.Aggregate.

Equivalent of SignifReg function in R for logit regressions?

The package rms has a function fastbw which takes models as logistic, coxph. It is based on Lawless and Singhal (1978), 'a first-order approximation that has greater numerical efficiency than full backward selection'.
eg:

Glmfullfit = Glm (Hypertension ~ ., family = binomial, data = mydata) #Glm from rms. 
fastbw(Glmfullfit, rule= 'p', type='individual', sls=.1) #retention pvalue = 0.1.

In SAS, the equivalent code is selection method = backward(fast) [I did not try this code due to SAS cloud). Don't use the coefficients from the fastbw results due to the methodology used in fastbw. Take the final survived variables and fit with glm for coefficients and other parameters.



Related Topics



Leave a reply



Submit