`lm` summary not display all factor levels
This issue is raised over and over again, but unfortunately no satisfying answer has been made which can be an appropriate duplicate target. Looks like I need to write one.
Most people know this is related to "contrasts", but not everyone knows why it is needed, and how to understand its result. We have to look at model matrix in order to fully digest this.
Suppose we are interested in a model with two factors: ~ f + g
(numerical covariates do not matter so I include none of them; the response does not appear in model matrix, so drop it, too). Consider the following reproducible example:
set.seed(0)
f <- sample(gl(3, 4, labels = letters[1:3]))
# [1] c a a b b a c b c b a c
#Levels: a b c
g <- sample(gl(3, 4, labels = LETTERS[1:3]))
# [1] A B A B C B C A C C A B
#Levels: A B C
We start with a model matrix with no contrasts at all:
X0 <- model.matrix(~ f + g, contrasts.arg = list(
f = contr.treatment(n = 3, contrasts = FALSE),
g = contr.treatment(n = 3, contrasts = FALSE)))
# (Intercept) f1 f2 f3 g1 g2 g3
#1 1 0 0 1 1 0 0
#2 1 1 0 0 0 1 0
#3 1 1 0 0 1 0 0
#4 1 0 1 0 0 1 0
#5 1 0 1 0 0 0 1
#6 1 1 0 0 0 1 0
#7 1 0 0 1 0 0 1
#8 1 0 1 0 1 0 0
#9 1 0 0 1 0 0 1
#10 1 0 1 0 0 0 1
#11 1 1 0 0 1 0 0
#12 1 0 0 1 0 1 0
Note, we have:
unname( rowSums(X0[, c("f1", "f2", "f3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1
unname( rowSums(X0[, c("g1", "g2", "g3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1
So span{f1, f2, f3} = span{g1, g2, g3} = span{(Intercept)}
. In this full specification, 2 columns are not identifiable. X0
will have column rank 1 + 3 + 3 - 2 = 5
:
qr(X0)$rank
# [1] 5
So, if we fit a linear model with this X0
, 2 coefficients out of 7 parameters will be NA
:
y <- rnorm(12) ## random `y` as a response
lm(y ~ X - 1) ## drop intercept as `X` has intercept already
#X0(Intercept) X0f1 X0f2 X0f3 X0g1
# 0.32118 0.05039 -0.22184 NA -0.92868
# X0g2 X0g3
# -0.48809 NA
What this really implies, is that we have to add 2 linear constraints on 7 parameters, in order to get a full rank model. It does not really matter what these 2 constraints are, but there must be 2 linearly independent constrains. For example, we can do either of the following:
- drop any 2 columns from
X0
; - add two sum-to-zero constrains on parameters, like we require coefficients for
f1
,f2
andf3
sum to 0, and the same forg1
,g2
andg3
. - use regularization, for example, adding ridge penalty to
f
andg
.
Note, these three ways end up with three different solutions:
- contrasts;
- constrained least squares;
- linear mixed models or penalized least squares.
The first two are still in the scope of fixed effect modelling. By "contrasts", we reduce the number of parameters until we get a full rank model matrix; while the other two does not reduce the number of parameters, but effectively reduces the effective degree of freedom.
Now, you are certainly after the "contrasts" way. So, remember, we have to drop 2 columns. They can be
- one column from
f
and one column fromg
, giving to a model~ f + g
, withf
andg
contrasted; - intercept, and one column from either
f
org
, giving to a model~ f + g - 1
.
Now you should be clear, that within the framework of dropping columns, there is no way you can get what you want, because you are expecting to drop only 1 column. The resulting model matrix will still be rank-deficient.
If you really want to have all coefficients there, use constrained least squares, or penalized regression / linear mixed models.
Now, when we have interaction of factors, things are more complicated but the idea is still the same. But given that my answer is already long enough, I don't want to continue.
Why summary a linear model in R does not show all the needed levels?
This is simply a consequence of over parameterisation and is nothing to worry about. Your modelling code is simply taking the final level of your factor Density
as the reference levels. The effects of the other levels are simply differences from the reference level.
To see this, in your first model, with "B" as the reference level, the difference between "A" and "M" is -0.05080 - -0.24315 = 0.19235. In your second model, with "A" as the reference level, the coefficient of "M" (ie the estimated difference between "A" and "M") is 0.19235. Exactly the same value.
You can work out the value of any effect you like from either model, and the two values will be identical. You just need to take account of the parametrisation that the model has used.
I'm voting to close this question because I believe it's better suited to stackexchange: it's a stats question, not a programming question.
R's lm() function removes factor levels (aside from intercept estimator) without reporting error or warning
Yes your missingness is probably causing your results.
Consider the following reproducible example.
x1 <- sample(1:5, 1000, replace=T)
x2 <- sample(1:3, 1000, replace=T)
y <- 2*x1 + 3*x2 + rnorm(1000)
#no missings (everything works fine)
lm(y~as.factor(x1) + as.factor(x2))
lm(y~ -1 + as.factor(x1) + as.factor(x2)) # no intercept
#with missings
x1[x2==2]<-NA #create specific missingness
table(x1,useNA = "always")
table(x2,useNA = "always")
table(x1,x2,useNA = "always") #you see the missing pattern
lm(y~ as.factor(x1) + as.factor(x2))
lm(y~ -1 + as.factor(x1) + as.factor(x2))
If you run the code, you will see that for factor x2 two categories are omitted. The first is the ref.cat. the second (x2)2
because of the missings.
Comparing all factor levels to the grand mean: can I tweak contrasts in linear model fitting to show all levels?
This answer shows you how to obtain the following coefficient table:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655
#A -0.19238543 0.6619845 -0.29061922 0.7902750
#B 0.40884591 0.6619845 0.61760645 0.5805485
#C -0.21646049 0.6619845 -0.32698723 0.7651640
Amazing, isn't it? It mimics what you see from summary(fit)
, i.e.,
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655
#x1 -0.19238543 0.6619845 -0.29061922 0.7902750
#x2 0.40884591 0.6619845 0.61760645 0.5805485
But now we have all factor levels displayed.
Why lm
summary does not display all factor levels?
In 2016, I answered this Stack Overflow question: `lm` summary not display all factor levels and since then, it has become the target for marking duplicated questions on similar topics.
To recap, the basic idea is that in order to have a full-rank design matrix for least squares fitting, we must apply contrasts to a factor variable. Let's say that the factor has N levels, then no matter what type of contrasts we choose (see ?contrasts
for a list), it reduces the raw N dummy variables to a new set of N - 1 variables. Therefore, only N - 1 coefficients are associated with an N-level factor.
However, we can transform the N - 1 coefficients back to the original N coefficients using the contrasts matrix. The transformation enables us to obtain a coefficient table for all factor levels. I will now demonstrate how to do this, based on OP's reproducible example:
set.seed(1)
y <- rnorm(6, 0, 1)
x <- factor(rep(LETTERS[1:3], each = 2))
fit <- lm(y ~ x, contrasts = list(x = contr.sum))
In this example, the sum-to-zero contrast is applied to factor x
. To know more on how to control contrasts for model fitting, see my answer at How to set contrasts for my variable in regression analysis with R?.
R code walk-through
For a factor variable of N levels subject to sum-to-zero contrasts, we can use the following function to get the N x (N - 1) transformation matrix that maps the (N - 1) coefficients estimated by lm
back to the N coefficients for all levels.
ContrSumMat <- function (fctr, sparse = FALSE) {
if (!is.factor(fctr)) stop("'fctr' is not a factor variable!")
N <- nlevels(fctr)
Cmat <- contr.sum(N, sparse = sparse)
dimnames(Cmat) <- list(levels(fctr), seq_len(N - 1))
Cmat
}
For the example 3-level factor x
, this matrix is:
Cmat <- ContrSumMat(x)
# 1 2
#A 1 0
#B 0 1
#C -1 -1
The fitted model fit
reports 3 - 1 = 2 coefficients for this factor. We can extract them as:
## coefficients After Contrasts
coef_ac <- coef(fit)[2:3]
# x1 x2
#-0.1923854 0.4088459
Therefore, the level-specific coefficients are:
## coefficients Before Contrasts
coef_bc <- (Cmat %*% coef_ac)[, 1]
# A B C
#-0.1923854 0.4088459 -0.2164605
## note that they sum to zero as expected
sum(coef_bc)
#[1] 0
Similarly, we can get the covariance matrix after contrasts:
var_ac <- vcov(fit)[2:3, 2:3]
# x1 x2
#x1 0.4382235 -0.2191118
#x2 -0.2191118 0.4382235
and transform it to the one before contrasts:
var_bc <- Cmat %*% var_ac %*% t(Cmat)
# A B C
#A 0.4382235 -0.2191118 -0.2191118
#B -0.2191118 0.4382235 -0.2191118
#C -0.2191118 -0.2191118 0.4382235
Interpretation:
The model intercept
coef(fit)[1]
is the grand mean.The computed
coef_bc
is the deviation of each level's mean from the grand mean.The diagonal entries of
var_bc
gives the estimated variance of these deviations.
We can then proceed to compute t-statistics and p-values for these coefficients, as follows.
## standard error of point estimate `coef_bc`
std.error_bc <- sqrt(diag(var_bc))
# A B C
#0.6619845 0.6619845 0.6619845
## t-statistics (Null Hypothesis: coef_bc = 0)
t.stats_bc <- coef_bc / std.error_bc
# A B C
#-0.2906192 0.6176065 -0.3269872
## p-values of the t-statistics
p.value_bc <- 2 * pt(abs(t.stats_bc), df = fit$df.residual, lower.tail = FALSE)
# A B C
#0.7902750 0.5805485 0.7651640
## construct a coefficient table that mimics `coef(summary(fit))`
stats.tab_bc <- cbind("Estimate" = coef_bc,
"Std. Error" = std.error_bc,
"t value" = t.stats_bc,
"Pr(>|t|)" = p.value_bc)
# Estimate Std. Error t value Pr(>|t|)
#A -0.1923854 0.6619845 -0.2906192 0.7902750
#B 0.4088459 0.6619845 0.6176065 0.5805485
#C -0.2164605 0.6619845 -0.3269872 0.7651640
We can also augment it by including the result for the grand mean (i.e., the model intercept).
## extract statistics of the intercept
intercept.stats <- coef(summary(fit))[1, , drop = FALSE]
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655
## augment the coefficient table
stats.tab <- rbind(intercept.stats, stats.tab_bc)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655
#A -0.19238543 0.6619845 -0.29061922 0.7902750
#B 0.40884591 0.6619845 0.61760645 0.5805485
#C -0.21646049 0.6619845 -0.32698723 0.7651640
We can also print this table with significance stars.
printCoefmat(stats.tab)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02902 0.46809 -0.0620 0.9545
#A -0.19239 0.66199 -0.2906 0.7903
#B 0.40885 0.66199 0.6176 0.5805
#C -0.21646 0.66199 -0.3270 0.7652
Emm? Why are there no stars? Well, in this example all p-values are very large. The stars will show up if p-values are small. Here is a convincing demo:
fake.tab <- stats.tab
fake.tab[, 4] <- fake.tab[, 4] / 100
printCoefmat(fake.tab)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02902 0.46809 -0.0620 0.009545 **
#A -0.19239 0.66199 -0.2906 0.007903 **
#B 0.40885 0.66199 0.6176 0.005805 **
#C -0.21646 0.66199 -0.3270 0.007652 **
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Oh, this is so beautiful. For the meaning of these stars, see my answer at: Interpeting R significance codes for ANOVA table?
Closing Remarks
It should be possible to write a function (or even an R package) to perform such table transformation. However, it might take great effort to make such function flexible enough, to handle:
all type of contrasts (this is easy to do);
complicated model terms, like interaction between a factor and other numeric/factor variables (this seems really involving!!).
So, I will stop here for the moment.
Miscellaneous Replies
Are the model scores that I get from the lm's summary still accurate, even though it isn't displaying all levels of the factor?
Yes, they are. lm
conducts accurate least squares fitting.
In addition, the transformation of coefficient table does not affect R-squares, degree of freedom, residuals, fitted values, F-statistics, ANOVA table, etc.
Accessing intercept factor levels for a linear regression
Calculate the dummy coefficients.
library(nlme) # Alfalfa
fm <- lm(Yield ~ ., Alfalfa)
dc <- dummy.coef(fm); dc
## Full coefficients are
##
## (Intercept): 2.033333
## Variety: Cossack Ladak Ranger
## 0.00000000 0.09458333 -0.01916667
## Date: None S1 S20 O7
## 0.0000000 -0.4405556 -0.2066667 -0.0900000
## Block: 1 2 3 4 5 6
## 0.0000000 -0.1808333 -0.0875000 -0.2066667 -0.5016667 -0.6875000
The dummy coefficients with one level, in this case the Intercept, are not factors so extract the ones with more than one level and get the maximum coef from each factor producing a data frame as shown.
max.lev <- function(x) data.frame(level = names(x)[which.max(x)], value = max(x))
do.call("rbind", lapply(dc[lengths(dc) > 1], max.lev))
## level value
## Variety Ladak 0.09458333
## Date None 0.00000000
## Block 1 0.00000000
Related Topics
How to Install an R Package from Source
Convert Data from Long Format to Wide Format With Multiple Measure Columns
Extract the Maximum Value Within Each Group in a Dataframe
Group by Multiple Columns and Sum Other Multiple Columns
How to Sort a Character Vector Where Elements Contain Letters and Numbers
How to Call an Object With the Character Variable of the Same Name
Count Number of Rows in a Data Frame in R Based on Group
Strptime, As.Posixct and As.Date Return Unexpected Na
Ggplot2 - Annotate Outside of Plot
Fastest Way to Add Rows For Missing Time Steps
Dplyr Conditional Summarise Function
Add Column Values Based on Other Columns in Data Frame Using for and If
Duplicate Columns in Spark Dataframe