`lm` summary not display all factor levels
This issue is raised over and over again, but unfortunately no satisfying answer has been made which can be an appropriate duplicate target. Looks like I need to write one.
Most people know this is related to "contrasts", but not everyone knows why it is needed, and how to understand its result. We have to look at model matrix in order to fully digest this.
Suppose we are interested in a model with two factors: ~ f + g
(numerical covariates do not matter so I include none of them; the response does not appear in model matrix, so drop it, too). Consider the following reproducible example:
set.seed(0)
f <- sample(gl(3, 4, labels = letters[1:3]))
# [1] c a a b b a c b c b a c
#Levels: a b c
g <- sample(gl(3, 4, labels = LETTERS[1:3]))
# [1] A B A B C B C A C C A B
#Levels: A B C
We start with a model matrix with no contrasts at all:
X0 <- model.matrix(~ f + g, contrasts.arg = list(
f = contr.treatment(n = 3, contrasts = FALSE),
g = contr.treatment(n = 3, contrasts = FALSE)))
# (Intercept) f1 f2 f3 g1 g2 g3
#1 1 0 0 1 1 0 0
#2 1 1 0 0 0 1 0
#3 1 1 0 0 1 0 0
#4 1 0 1 0 0 1 0
#5 1 0 1 0 0 0 1
#6 1 1 0 0 0 1 0
#7 1 0 0 1 0 0 1
#8 1 0 1 0 1 0 0
#9 1 0 0 1 0 0 1
#10 1 0 1 0 0 0 1
#11 1 1 0 0 1 0 0
#12 1 0 0 1 0 1 0
Note, we have:
unname( rowSums(X0[, c("f1", "f2", "f3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1
unname( rowSums(X0[, c("g1", "g2", "g3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1
So span{f1, f2, f3} = span{g1, g2, g3} = span{(Intercept)}
. In this full specification, 2 columns are not identifiable. X0
will have column rank 1 + 3 + 3 - 2 = 5
:
qr(X0)$rank
# [1] 5
So, if we fit a linear model with this X0
, 2 coefficients out of 7 parameters will be NA
:
y <- rnorm(12) ## random `y` as a response
lm(y ~ X - 1) ## drop intercept as `X` has intercept already
#X0(Intercept) X0f1 X0f2 X0f3 X0g1
# 0.32118 0.05039 -0.22184 NA -0.92868
# X0g2 X0g3
# -0.48809 NA
What this really implies, is that we have to add 2 linear constraints on 7 parameters, in order to get a full rank model. It does not really matter what these 2 constraints are, but there must be 2 linearly independent constrains. For example, we can do either of the following:
- drop any 2 columns from
X0
; - add two sum-to-zero constrains on parameters, like we require coefficients for
f1
,f2
andf3
sum to 0, and the same forg1
,g2
andg3
. - use regularization, for example, adding ridge penalty to
f
andg
.
Note, these three ways end up with three different solutions:
- contrasts;
- constrained least squares;
- linear mixed models or penalized least squares.
The first two are still in the scope of fixed effect modelling. By "contrasts", we reduce the number of parameters until we get a full rank model matrix; while the other two does not reduce the number of parameters, but effectively reduces the effective degree of freedom.
Now, you are certainly after the "contrasts" way. So, remember, we have to drop 2 columns. They can be
- one column from
f
and one column fromg
, giving to a model~ f + g
, withf
andg
contrasted; - intercept, and one column from either
f
org
, giving to a model~ f + g - 1
.
Now you should be clear, that within the framework of dropping columns, there is no way you can get what you want, because you are expecting to drop only 1 column. The resulting model matrix will still be rank-deficient.
If you really want to have all coefficients there, use constrained least squares, or penalized regression / linear mixed models.
Now, when we have interaction of factors, things are more complicated but the idea is still the same. But given that my answer is already long enough, I don't want to continue.
Why is my summary in R only including some of my variables?
For R to correctly understand a dummy variable, you have to indicate Pup
is a cualitative (dummy) variable by using factor
> Pup <- factor(Pup)
> q<-data.frame(Calls, Pup)
> q1<-lm(Calls~Pup, data=q)
> summary(q1)
Call:
lm(formula = Calls ~ Pup, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.00 15.61 5.444 0.0122 *
PupPost 85.00 22.08 3.850 0.0309 *
PupPre -32.50 22.08 -1.472 0.2374
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9097, Adjusted R-squared: 0.8494
F-statistic: 15.1 on 2 and 3 DF, p-value: 0.02716
If you want R to show all categories inside the dummy variable, then you must remove the intercept from the regression, otherwise, you will be in variable dummy trap.
summary(lm(Calls~Pup-1, data=q))
Call:
lm(formula = Calls ~ Pup - 1, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
PupMiddle 85.00 15.61 5.444 0.01217 *
PupPost 170.00 15.61 10.889 0.00166 **
PupPre 52.50 15.61 3.363 0.04365 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9631
F-statistic: 53.17 on 3 and 3 DF, p-value: 0.004234
gtsummary reporting categorical variables with missing from the total not the available total
Use the forcats::fct_explicit_na()
to make the missing values explicit, then pass the data frame to tbl_summary()
. Example below!
library(gtsummary)
gender <- sample(c("Male", "Female", NA), 100, replace = TRUE)
age <- rpois(100,5)
sample1 <-
data.frame(gender, age) |>
dplyr::mutate(
gender = forcats::fct_explicit_na(gender)
)
sample1 %>%
tbl_summary(statistic = list(all_continuous()~ "{median} ({p25}, {p75})",
all_categorical() ~"{n} ({p}%)"),
digits = all_continuous()~ 2,
type = all_categorical() ~ "categorical",
missing_text = "Missing"
) |>
as_kable() # export as kable to display on SO
Characteristic | N = 100 |
---|---|
gender | |
Female | 33 (33%) |
Male | 37 (37%) |
(Missing) | 30 (30%) |
age | 4.00 (3.00, 6.00) |
Is there a way to display the reference category in a regression output in R?
The reference level is the one that is missing in the summary, because the coefficients of the other levels are the contrasts to the reference level, i.e. the intercept actually represents the mean in the reference category.
iris <- transform(iris, Species_=factor(Species)) ## create factor
summary(lm(Sepal.Length ~ Petal.Length + Species_, iris))$coe
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.6835266 0.10609608 34.718780 1.968671e-72
# Petal.Length 0.9045646 0.06478559 13.962436 1.121002e-28
# Species_versicolor -1.6009717 0.19346616 -8.275203 7.371529e-14
# Species_virginica -2.1176692 0.27346121 -7.743947 1.480296e-12
You could remove the intercept, to get the missing level displayed, but that makes not much sense. You then just get the means of each level without a reference, however you are interested in the contrast between the reference level and the other levels.
summary(lm(Sepal.Length ~ 0 + Petal.Length + Species_, iris))$coe
# Estimate Std. Error t value Pr(>|t|)
# Petal.Length 0.9045646 0.06478559 13.962436 1.121002e-28
# Species_setosa 3.6835266 0.10609608 34.718780 1.968671e-72
# Species_versicolor 2.0825548 0.28009598 7.435147 8.171219e-12
# Species_virginica 1.5658574 0.36285224 4.315413 2.921850e-05
If you're not sure, the reference level is always the first level of the factor.
levels(iris$Species_)[1]
# [1] "setosa"
To prove that, specify a different reference level and see if it's first.
iris$Species_ <- relevel(iris$Species_, ref='versicolor')
levels(iris$Species_)[1]
# [1] "versicolor"
It is common to refer to the reference level in a note under the table in the report, and I recommend that you do the same.
Why does regression in R delete index 1 of a factor variable?
If R were to create a dummy variable for every level in the factor, the resulting set of variables would be linearly dependent (assuming there is also an intercept term). Therefore, one factor level is chosen as the baseline and has no dummy generated for it.
To illustrate this, let's consider a toy example:
> data <- data.frame(y=c(2, 3, 5, 7, 11, 25), f=as.factor(c('a', 'a', 'b', 'b', 'c', 'c')))
> summary(lm(y ~ f, data))
Call:
lm(formula = y ~ f, data = data)
Residuals:
1 2 3 4 5 6
-0.5 0.5 -1.0 1.0 -7.0 7.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.500 4.093 0.611 0.5845
fb 3.500 5.788 0.605 0.5880
fc 15.500 5.788 2.678 0.0752 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.788 on 3 degrees of freedom
Multiple R-squared: 0.7245, Adjusted R-squared: 0.5409
F-statistic: 3.945 on 2 and 3 DF, p-value: 0.1446
As you can see, there are three coefficients (the same as the number of levels in the factor). Here, a
has been chosen as the baseline, so (Intercept)
refers to the subset of data where f
is a
. The coefficients for b
and c
(fb
and fc
) are the differences between the baseline intercept and the intercepts for the two other factor levels. Thus the intercept for b
is 6
(2.500+3.500
) and the intercept for c
is 19
(2.500+15.500
).
If you don't like the automatic choice, you could pick another level as the baseline: How to force R to use a specified factor level as reference in a regression?
Related Topics
Split Time Series Data into Time Intervals (Say an Hour) and Then Plot the Count
Asymmetric Color Distribution in Scale_Gradient2
Control the Size of Points in an R Scatterplot
How to Create Vectors with Specific Intervals in R
Use Expression with a Variable R
Return Index of the Smallest Value in a Vector
Modify Glm Function to Adopt User-Specified Link Function in R
How to Interrupt a Running Code in R with a Keyboard Command
Left-Adjust Title in Ggplot2, or Absolute Position for Ggtitle
Interactive Directory Input in Shiny App (R)
How to Trigger a Data Refresh in Shiny
Add Color to Boxplot - "Continuous Value Supplied to Discrete Scale" Error
Aggregate 1-Minute Data into 5-Minute Average Data
Special Characters and Superscripts on Plot Axis Titles
How to Modify an Existing a Sheet in an Excel Workbook Using Openxlsx Package in R