How to Force R to Use a Specified Factor Level as Reference in a Regression

How to force R to use a specified factor level as reference in a regression?

See the relevel() function. Here is an example:

set.seed(123)
x <- rnorm(100)
DF <- data.frame(x = x,
                 y = 4 + (1.5*x) + rnorm(100, sd = 2),
                 b = gl(5, 20))
head(DF)
str(DF)

m1 <- lm(y ~ x + b, data = DF)
summary(m1)

Now alter the factor b in DF by use of the relevel() function:

DF <- within(DF, b <- relevel(b, ref = 3))
m2 <- lm(y ~ x + b, data = DF)
summary(m2)

The models have estimated different reference levels.

> coef(m1)
(Intercept)           x          b2          b3          b4          b5 
  3.2903239   1.4358520   0.6296896   0.3698343   1.0357633   0.4666219 
> coef(m2)
(Intercept)           x          b1          b2          b4          b5 
 3.66015826  1.43585196 -0.36983433  0.25985529  0.66592898  0.09678759

Is there a way to display the reference category in a regression output in R?

The reference level is the one that is missing in the summary, because the coefficients of the other levels are the contrasts to the reference level, i.e. the intercept actually represents the mean in the reference category.

iris <- transform(iris, Species_=factor(Species))  ## create factor

summary(lm(Sepal.Length ~ Petal.Length + Species_, iris))$coe
#                    Estimate Std. Error   t value      Pr(>|t|)
# (Intercept)         3.6835266 0.10609608 34.718780 1.968671e-72
# Petal.Length        0.9045646 0.06478559 13.962436 1.121002e-28
# Species_versicolor -1.6009717 0.19346616 -8.275203 7.371529e-14
# Species_virginica  -2.1176692 0.27346121 -7.743947 1.480296e-12

You could remove the intercept, to get the missing level displayed, but that makes not much sense. You then just get the means of each level without a reference, however you are interested in the contrast between the reference level and the other levels.

summary(lm(Sepal.Length ~ 0 + Petal.Length + Species_, iris))$coe
#                     Estimate Std. Error   t value     Pr(>|t|)
# Petal.Length       0.9045646 0.06478559 13.962436 1.121002e-28
# Species_setosa     3.6835266 0.10609608 34.718780 1.968671e-72
# Species_versicolor 2.0825548 0.28009598  7.435147 8.171219e-12
# Species_virginica  1.5658574 0.36285224  4.315413 2.921850e-05

If you're not sure, the reference level is always the first level of the factor.

levels(iris$Species_)[1]
# [1] "setosa"

To prove that, specify a different reference level and see if it's first.

iris$Species_ <- relevel(iris$Species_, ref='versicolor')

levels(iris$Species_)[1]
# [1] "versicolor"

It is common to refer to the reference level in a note under the table in the report, and I recommend that you do the same.

Change reference level for variable in R

mode(DATA$COLOR) is "numeric" because R internally stores factors as numeric codes (to save space), plus an associated vector of labels corresponding to the code values. When you print the factor, R automatically substitutes the corresponding label for each code.

f <- factor(c("orange","banana","apple"))
## [1] orange banana apple 
## Levels: apple banana orange
str(f)
##  Factor w/ 3 levels "apple","banana",..: 3 2 1
c(f)    ## strip attributes to get a numeric vector
## [1] 3 2 1 
attributes(f)
## $levels
## [1] "apple"  "banana" "orange"
## $class
## [1] "factor"

... I need to Write R code to return the levels of the COLOR variable ...

levels(DATA$COLOR)

... then determine the current reference level of this variable,

levels(DATA$COLOR)[1]

... and finally set the reference level of this variable to White.

DATA$COLOR <- relevel(DATA$COLOR,"White")

Changing reference group for categorical predictor variable in logistic regression

Use the C function to define your contrasts in the dataframe.

If your dataframe is DF and the factor variable is fct, then

DF$fct <- C(DF$fct, contr.treatment, base=3)

(untested).

Change the levels of the categorical predictor in glm in R

You can use the relevel() function to specify which level of the factor is the reference level. Assuming the variable Grupo is already a factor, this should work:

PAIS_PBI$Grupo <- relevel(PAIS_PBI$Grupo, ref = "BAJO")

How do I make predictions using an ordered factor coefficient in R?

Just give R a data frame with x values drawn from the levels of the factor ("none", "some", etc.), and it will do the rest.

I changed your setup slightly to change the type of x to ordered() within the data frame (this will carry through all of the computations).

d$x = ordered(d$x, labels=c("none", "some", "more", "a lot"))                                                                                                                                                                               
m1 <- lm(y~x, d)      ## save fitted object                                                                                                     
Coefs <- coef(m1)

Now we can predict():

predict(m1, newdata =  data.frame(x=c("none","more"))) 
##         1        2
##  2.993959 6.997342

(didn't have to explicitly say that the new x was ordered())

If you want to dig a little bit deeper into the computations, you can look at the model matrix:

model.matrix(~unique(d$x))

For each level of the factor, these are the values R multiplies the coefficients by to generate the prediction (e.g. for level = "none", 1*b0 + (-0.67)*b1 + 0.5*b2 - 0.223*b3)

   (Intercept) unique(d$x).L unique(d$x).Q unique(d$x).C                                                                   
 1           1    -0.6708204           0.5    -0.2236068                                                                   
 2           1    -0.2236068          -0.5     0.6708204                                                                   
 3           1     0.2236068          -0.5    -0.6708204                                                                   
 4           1     0.6708204           0.5     0.2236068

For even more detail, look at ?poly or the source code of poly() (although neither of these is easy!)

How to Force R to Use a Specified Factor Level as Reference in a Regression