Logistic Regression - Defining Reference Level in R

Logistic regression - defining reference level in R

Assuming you have class saved as a factor, use the relevel() function:

auth$class <- relevel(auth$class, ref = "YES")

Confused with the reference level in logistic regression in R

If P(0) is the probability of 0 and P(1) is the probability of 1, then P(0) = 1 - P(1). Thus, you can always calculate the probability of the reference level, regardless of which level you set as the reference.

For example, predict(model1, type="response") gives you the probability of the non-reference level. 1 - predict(model1, type="response") gives you the probability of the reference level.

You also asked, "what is glm() to predict in default if we use response other than '0' and '1'." For (binomial) logistic regression to be appropriate, your outcome needs to be a categorical variable with two categories. You can call them whatever you want, 0/1, black/white, because/otherwise, Mal/Serenity, etc. One will be the reference level--whichever you prefer--and the model will give you the probability of the other level. The probability of the reference level is just 1 minus the probability of the other level.

If your outcome has more than two categories, you can use a multinomial logistic regression model, but the principle is similar.

Logistic regression outcome variable predictions in r

By default, R uses alphabetical order for levels of factor. You can set your own order simply by

df$Group <- factor(df$Group, levels=c('CON','CI'))

Then CON would be used as reference level in logistic regression and you should get the same results as with 0/1 coding.

Changing reference group for categorical predictor variable in logistic regression

Use the C function to define your contrasts in the dataframe.

If your dataframe is DF and the factor variable is fct, then

DF$fct <- C(DF$fct, contr.treatment, base=3)

(untested).

Is there a way to display the reference category in a regression output in R?

The reference level is the one that is missing in the summary, because the coefficients of the other levels are the contrasts to the reference level, i.e. the intercept actually represents the mean in the reference category.

iris <- transform(iris, Species_=factor(Species))  ## create factor

summary(lm(Sepal.Length ~ Petal.Length + Species_, iris))$coe
#                    Estimate Std. Error   t value      Pr(>|t|)
# (Intercept)         3.6835266 0.10609608 34.718780 1.968671e-72
# Petal.Length        0.9045646 0.06478559 13.962436 1.121002e-28
# Species_versicolor -1.6009717 0.19346616 -8.275203 7.371529e-14
# Species_virginica  -2.1176692 0.27346121 -7.743947 1.480296e-12

You could remove the intercept, to get the missing level displayed, but that makes not much sense. You then just get the means of each level without a reference, however you are interested in the contrast between the reference level and the other levels.

summary(lm(Sepal.Length ~ 0 + Petal.Length + Species_, iris))$coe
#                     Estimate Std. Error   t value     Pr(>|t|)
# Petal.Length       0.9045646 0.06478559 13.962436 1.121002e-28
# Species_setosa     3.6835266 0.10609608 34.718780 1.968671e-72
# Species_versicolor 2.0825548 0.28009598  7.435147 8.171219e-12
# Species_virginica  1.5658574 0.36285224  4.315413 2.921850e-05

If you're not sure, the reference level is always the first level of the factor.

levels(iris$Species_)[1]
# [1] "setosa"

To prove that, specify a different reference level and see if it's first.

iris$Species_ <- relevel(iris$Species_, ref='versicolor')

levels(iris$Species_)[1]
# [1] "versicolor"

It is common to refer to the reference level in a note under the table in the report, and I recommend that you do the same.

Logistic Regression - Defining Reference Level in R