Why am I Losing Categorical Data in My Regression Summary

`lm` summary not display all factor levels

This issue is raised over and over again, but unfortunately no satisfying answer has been made which can be an appropriate duplicate target. Looks like I need to write one.


Most people know this is related to "contrasts", but not everyone knows why it is needed, and how to understand its result. We have to look at model matrix in order to fully digest this.

Suppose we are interested in a model with two factors: ~ f + g (numerical covariates do not matter so I include none of them; the response does not appear in model matrix, so drop it, too). Consider the following reproducible example:

set.seed(0)

f <- sample(gl(3, 4, labels = letters[1:3]))
# [1] c a a b b a c b c b a c
#Levels: a b c

g <- sample(gl(3, 4, labels = LETTERS[1:3]))
# [1] A B A B C B C A C C A B
#Levels: A B C

We start with a model matrix with no contrasts at all:

X0 <- model.matrix(~ f + g, contrasts.arg = list(
f = contr.treatment(n = 3, contrasts = FALSE),
g = contr.treatment(n = 3, contrasts = FALSE)))

# (Intercept) f1 f2 f3 g1 g2 g3
#1 1 0 0 1 1 0 0
#2 1 1 0 0 0 1 0
#3 1 1 0 0 1 0 0
#4 1 0 1 0 0 1 0
#5 1 0 1 0 0 0 1
#6 1 1 0 0 0 1 0
#7 1 0 0 1 0 0 1
#8 1 0 1 0 1 0 0
#9 1 0 0 1 0 0 1
#10 1 0 1 0 0 0 1
#11 1 1 0 0 1 0 0
#12 1 0 0 1 0 1 0

Note, we have:

unname( rowSums(X0[, c("f1", "f2", "f3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1

unname( rowSums(X0[, c("g1", "g2", "g3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1

So span{f1, f2, f3} = span{g1, g2, g3} = span{(Intercept)}. In this full specification, 2 columns are not identifiable. X0 will have column rank 1 + 3 + 3 - 2 = 5:

qr(X0)$rank
# [1] 5

So, if we fit a linear model with this X0, 2 coefficients out of 7 parameters will be NA:

y <- rnorm(12)  ## random `y` as a response
lm(y ~ X - 1) ## drop intercept as `X` has intercept already

#X0(Intercept) X0f1 X0f2 X0f3 X0g1
# 0.32118 0.05039 -0.22184 NA -0.92868
# X0g2 X0g3
# -0.48809 NA

What this really implies, is that we have to add 2 linear constraints on 7 parameters, in order to get a full rank model. It does not really matter what these 2 constraints are, but there must be 2 linearly independent constrains. For example, we can do either of the following:

  • drop any 2 columns from X0;
  • add two sum-to-zero constrains on parameters, like we require coefficients for f1, f2 and f3 sum to 0, and the same for g1, g2 and g3.
  • use regularization, for example, adding ridge penalty to f and g.

Note, these three ways end up with three different solutions:

  • contrasts;
  • constrained least squares;
  • linear mixed models or penalized least squares.

The first two are still in the scope of fixed effect modelling. By "contrasts", we reduce the number of parameters until we get a full rank model matrix; while the other two does not reduce the number of parameters, but effectively reduces the effective degree of freedom.


Now, you are certainly after the "contrasts" way. So, remember, we have to drop 2 columns. They can be

  • one column from f and one column from g, giving to a model ~ f + g, with f and g contrasted;
  • intercept, and one column from either f or g, giving to a model ~ f + g - 1.

Now you should be clear, that within the framework of dropping columns, there is no way you can get what you want, because you are expecting to drop only 1 column. The resulting model matrix will still be rank-deficient.

If you really want to have all coefficients there, use constrained least squares, or penalized regression / linear mixed models.


Now, when we have interaction of factors, things are more complicated but the idea is still the same. But given that my answer is already long enough, I don't want to continue.

Why is my summary in R only including some of my variables?

For R to correctly understand a dummy variable, you have to indicate Pup is a cualitative (dummy) variable by using factor

> Pup <- factor(Pup)
> q<-data.frame(Calls, Pup)
> q1<-lm(Calls~Pup, data=q)
> summary(q1)

Call:
lm(formula = Calls ~ Pup, data = q)

Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.00 15.61 5.444 0.0122 *
PupPost 85.00 22.08 3.850 0.0309 *
PupPre -32.50 22.08 -1.472 0.2374
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9097, Adjusted R-squared: 0.8494
F-statistic: 15.1 on 2 and 3 DF, p-value: 0.02716

If you want R to show all categories inside the dummy variable, then you must remove the intercept from the regression, otherwise, you will be in variable dummy trap.

summary(lm(Calls~Pup-1, data=q))

Call:
lm(formula = Calls ~ Pup - 1, data = q)

Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5

Coefficients:
Estimate Std. Error t value Pr(>|t|)
PupMiddle 85.00 15.61 5.444 0.01217 *
PupPost 170.00 15.61 10.889 0.00166 **
PupPre 52.50 15.61 3.363 0.04365 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9631
F-statistic: 53.17 on 3 and 3 DF, p-value: 0.004234

gtsummary reporting categorical variables with missing from the total not the available total

Use the forcats::fct_explicit_na() to make the missing values explicit, then pass the data frame to tbl_summary(). Example below!

library(gtsummary)

gender <- sample(c("Male", "Female", NA), 100, replace = TRUE)
age <- rpois(100,5)
sample1 <-
data.frame(gender, age) |>
dplyr::mutate(
gender = forcats::fct_explicit_na(gender)
)

sample1 %>%
tbl_summary(statistic = list(all_continuous()~ "{median} ({p25}, {p75})",
all_categorical() ~"{n} ({p}%)"),
digits = all_continuous()~ 2,
type = all_categorical() ~ "categorical",
missing_text = "Missing"
) |>
as_kable() # export as kable to display on SO






























CharacteristicN = 100
gender
Female33 (33%)
Male37 (37%)
(Missing)30 (30%)
age4.00 (3.00, 6.00)

Is there a way to display the reference category in a regression output in R?

The reference level is the one that is missing in the summary, because the coefficients of the other levels are the contrasts to the reference level, i.e. the intercept actually represents the mean in the reference category.

iris <- transform(iris, Species_=factor(Species))  ## create factor

summary(lm(Sepal.Length ~ Petal.Length + Species_, iris))$coe
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.6835266 0.10609608 34.718780 1.968671e-72
# Petal.Length 0.9045646 0.06478559 13.962436 1.121002e-28
# Species_versicolor -1.6009717 0.19346616 -8.275203 7.371529e-14
# Species_virginica -2.1176692 0.27346121 -7.743947 1.480296e-12

You could remove the intercept, to get the missing level displayed, but that makes not much sense. You then just get the means of each level without a reference, however you are interested in the contrast between the reference level and the other levels.

summary(lm(Sepal.Length ~ 0 + Petal.Length + Species_, iris))$coe
# Estimate Std. Error t value Pr(>|t|)
# Petal.Length 0.9045646 0.06478559 13.962436 1.121002e-28
# Species_setosa 3.6835266 0.10609608 34.718780 1.968671e-72
# Species_versicolor 2.0825548 0.28009598 7.435147 8.171219e-12
# Species_virginica 1.5658574 0.36285224 4.315413 2.921850e-05

If you're not sure, the reference level is always the first level of the factor.

levels(iris$Species_)[1]
# [1] "setosa"

To prove that, specify a different reference level and see if it's first.

iris$Species_ <- relevel(iris$Species_, ref='versicolor')

levels(iris$Species_)[1]
# [1] "versicolor"

It is common to refer to the reference level in a note under the table in the report, and I recommend that you do the same.

Why does regression in R delete index 1 of a factor variable?

If R were to create a dummy variable for every level in the factor, the resulting set of variables would be linearly dependent (assuming there is also an intercept term). Therefore, one factor level is chosen as the baseline and has no dummy generated for it.

To illustrate this, let's consider a toy example:

> data <- data.frame(y=c(2, 3, 5, 7, 11, 25), f=as.factor(c('a', 'a', 'b', 'b', 'c', 'c')))
> summary(lm(y ~ f, data))

Call:
lm(formula = y ~ f, data = data)

Residuals:
1 2 3 4 5 6
-0.5 0.5 -1.0 1.0 -7.0 7.0

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.500 4.093 0.611 0.5845
fb 3.500 5.788 0.605 0.5880
fc 15.500 5.788 2.678 0.0752 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.788 on 3 degrees of freedom
Multiple R-squared: 0.7245, Adjusted R-squared: 0.5409
F-statistic: 3.945 on 2 and 3 DF, p-value: 0.1446

As you can see, there are three coefficients (the same as the number of levels in the factor). Here, a has been chosen as the baseline, so (Intercept) refers to the subset of data where f is a. The coefficients for b and c (fb and fc) are the differences between the baseline intercept and the intercepts for the two other factor levels. Thus the intercept for b is 6 (2.500+3.500) and the intercept for c is 19 (2.500+15.500).

If you don't like the automatic choice, you could pick another level as the baseline: How to force R to use a specified factor level as reference in a regression?



Related Topics



Leave a reply



Submit