How to Set Contrasts for My Variable in Regression Analysis with R

How to set contrasts for my variable in regression analysis with R?

There are several confusing things regarding your question. You have used both a ~ b and b ~ a, so what are you looking at exactly?

  • Contrasts only applies to covariates / independent variables, as it is related to construction of model matrix; So for a ~ b, contrasts should be applied to b, while for b ~ a, contrasts should be applied to a;
  • Contrasts only works for factor / logical variables, rather than numerical variables. So unless you have b as a factor, you can't play contrasts with it.

Without changing data type, it is clear that only a model b ~ a is legitimate for further discussion. In the following, I will show how to set contrasts for a.


Method 1: using contrasts argument of glm and lm

We can control contrasts treatment by the contrasts argument of glm (the same for lm):

## dropping the first factor level (default)
coef(glm(b ~ a, data = test_mx, family = binomial(),
contrasts = list(a = contr.treatment(n = 2, base = 1))))
#(Intercept) a2
# -24.56607 49.13214

## dropping the second factor level
coef(glm(b ~ a, data = test_mx, family = binomial(),
contrasts = list(a = contr.treatment(n = 2, base = 2))))
#(Intercept) a1
# 24.56607 -49.13214

Here, contr.treatment is generating a contrasts matrix:

contr.treatment(n = 2, base = 1)
# 2
#1 0
#2 1

contr.treatment(n = 2, base = 2)
# 1
#1 1
#2 0

and they are passed to glm to effectively change the behaviour of model.matrix.default. Let's compare the model matrix for two cases:

model.matrix.default( ~ a, test_mx, contrasts.arg =
list(a = contr.treatment(n = 2, base = 1)))

# (Intercept) a2
#1 1 1
#2 1 1
#3 1 1
#4 1 0
#5 1 0
#6 1 0

model.matrix.default( ~ a, test_mx, contrasts.arg =
list(a = contr.treatment(n = 2, base = 2)))

# (Intercept) a1
#1 1 0
#2 1 0
#3 1 0
#4 1 1
#5 1 1
#6 1 1

The second column for a is just a flip between 0 and 1, which is what you have expected for a dummy variable.


Method 2: setting "contrasts" attribute to data frame directly

We can use C or contrasts to set "contrasts" attributes (C is only for setting, but contrasts can be used for viewing as well):

test_mx2 <- test_mx
contrasts(test_mx2$a) <- contr.treatment(n = 2, base = 1)
str(test_mx2)
#'data.frame': 6 obs. of 2 variables:
# $ a: Factor w/ 2 levels "FALSE","TRUE": 2 2 2 1 1 1
# ..- attr(*, "contrasts")= num [1:2, 1] 0 1
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr "FALSE" "TRUE"
# .. .. ..$ : chr "2"
# $ b: num 1 1 1 0 0 0

test_mx3 <- test_mx
contrasts(test_mx3$a) <- contr.treatment(n = 2, base = 2)
str(test_mx3)
#'data.frame': 6 obs. of 2 variables:
# $ a: Factor w/ 2 levels "FALSE","TRUE": 2 2 2 1 1 1
# ..- attr(*, "contrasts")= num [1:2, 1] 1 0
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr "FALSE" "TRUE"
# .. .. ..$ : chr "1"
# $ b: num 1 1 1 0 0 0

Now we can fit glm without using contrasts argument:

coef(glm(b ~ a, data = test_mx2, family = "binomial"))
#(Intercept) a2
# -24.56607 49.13214

coef(glm(b ~ a, data = test_mx3, family = "binomial"))
#(Intercept) a1
# 24.56607 -49.13214

Method 3: setting options("contrasts") for a global change

Hahaha, @BenBolker yet mentions another option, which is by setting the global options of R. For your specific example with a factor involving only two levels, we can makes use of ?contr.SAS.

## using R default contrasts options
#$contrasts
# unordered ordered
#"contr.treatment" "contr.poly"

coef(glm(b ~ a, data = test_mx, family = "binomial"))
#(Intercept) aTRUE
# -24.56607 49.13214

options(contrasts = c("contr.SAS", "contr.poly"))
coef(glm(b ~ a, data = test_mx, family = "binomial"))
#(Intercept) aFALSE
# 24.56607 -49.13214

But I believe Ben is just mention this to complete the picture; He will not take this way in reality, as changing global options is not good for getting reproducible R code.

Another issue is that contr.SAS will just treat the last factor level as reference. In your particular case with only 2 levels, this effectively does the "flipping".


Method 4: Manually recoding your factor levels

I had no intention to mention this as it is so trivial, but since I have added "Method 3", I'd better add this one, too.

test_mx4 <- test_mx
test_mx4$a <- factor(test_mx4$a, levels = c("TRUE", "FALSE"))
coef(glm(b ~ a, data = test_mx4, family = "binomial"))
#(Intercept) aTRUE
# -24.56607 49.13214

test_mx5 <- test_mx
test_mx5$a <- factor(test_mx5$a, levels = c("FALSE", "TRUE"))
coef(glm(b ~ a, data = test_mx5, family = "binomial"))
#(Intercept) aFALSE
# 24.56607 -49.13214

in r, how to deal with 'contrasts can be applied only to factors with 2 or more levels' when my variables in the model have 2 levels?

I fixed my mistake. What happened was that I was entering my lm() incorrectly so it was only accepting one independent variable rather than multiple. Additionally, if for instance you have example <- lm(test_score ~ study_type, univ_title, data = df, it will do the analyses using the subset of the data which in this case is univ_title. After changing study_type, univ_title to study_type + univ_title the lm() got fixed.

Error in contrasts when defining a linear model in R

If your independent variable (RHS variable) is a factor or a character taking only one value then that type of error occurs.

Example: iris data in R

(model1 <- lm(Sepal.Length ~ Sepal.Width + Species, data=iris))

# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)

# Coefficients:
# (Intercept) Sepal.Width Speciesversicolor Speciesvirginica
# 2.2514 0.8036 1.4587 1.9468

Now, if your data consists of only one species:

(model1 <- lm(Sepal.Length ~ Sepal.Width + Species,
data=iris[iris$Species == "setosa", ]))
# Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels

If the variable is numeric (Sepal.Width) but taking only a single value say 3, then the model runs but you will get NA as coefficient of that variable as follows:

(model2 <-lm(Sepal.Length ~ Sepal.Width + Species,
data=iris[iris$Sepal.Width == 3, ]))

# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Species,
# data = iris[iris$Sepal.Width == 3, ])

# Coefficients:
# (Intercept) Sepal.Width Speciesversicolor Speciesvirginica
# 4.700 NA 1.250 2.017

Solution: There is not enough variation in dependent variable with only one value. So, you need to drop that variable, irrespective of whether that is numeric or character or factor variable.

Updated as per comments: Since you know that the error will only occur with factor/character, you can focus only on those and see whether the length of levels of those factor variables is 1 (DROP) or greater than 1 (NODROP).

To see, whether the variable is a factor or not, use the following code:

(l <- sapply(iris, function(x) is.factor(x)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# FALSE FALSE FALSE FALSE TRUE

Then you can get the data frame of factor variables only

m <- iris[, l]

Now, find the number of levels of factor variables, if this is one you need to drop that

ifelse(n <- sapply(m, function(x) length(levels(x))) == 1, "DROP", "NODROP")

Note: If the levels of factor variable is only one then that is the variable, you have to drop.

Comparing all factor levels to the grand mean: can I tweak contrasts in linear model fitting to show all levels?

This answer shows you how to obtain the following coefficient table:

#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655
#A -0.19238543 0.6619845 -0.29061922 0.7902750
#B 0.40884591 0.6619845 0.61760645 0.5805485
#C -0.21646049 0.6619845 -0.32698723 0.7651640

Amazing, isn't it? It mimics what you see from summary(fit), i.e.,

#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655
#x1 -0.19238543 0.6619845 -0.29061922 0.7902750
#x2 0.40884591 0.6619845 0.61760645 0.5805485

But now we have all factor levels displayed.



Why lm summary does not display all factor levels?

In 2016, I answered this Stack Overflow question: `lm` summary not display all factor levels and since then, it has become the target for marking duplicated questions on similar topics.

To recap, the basic idea is that in order to have a full-rank design matrix for least squares fitting, we must apply contrasts to a factor variable. Let's say that the factor has N levels, then no matter what type of contrasts we choose (see ?contrasts for a list), it reduces the raw N dummy variables to a new set of N - 1 variables. Therefore, only N - 1 coefficients are associated with an N-level factor.

However, we can transform the N - 1 coefficients back to the original N coefficients using the contrasts matrix. The transformation enables us to obtain a coefficient table for all factor levels. I will now demonstrate how to do this, based on OP's reproducible example:

set.seed(1)
y <- rnorm(6, 0, 1)
x <- factor(rep(LETTERS[1:3], each = 2))
fit <- lm(y ~ x, contrasts = list(x = contr.sum))

In this example, the sum-to-zero contrast is applied to factor x. To know more on how to control contrasts for model fitting, see my answer at How to set contrasts for my variable in regression analysis with R?.



R code walk-through

For a factor variable of N levels subject to sum-to-zero contrasts, we can use the following function to get the N x (N - 1) transformation matrix that maps the (N - 1) coefficients estimated by lm back to the N coefficients for all levels.

ContrSumMat <- function (fctr, sparse = FALSE) {
if (!is.factor(fctr)) stop("'fctr' is not a factor variable!")
N <- nlevels(fctr)
Cmat <- contr.sum(N, sparse = sparse)
dimnames(Cmat) <- list(levels(fctr), seq_len(N - 1))
Cmat
}

For the example 3-level factor x, this matrix is:

Cmat <- ContrSumMat(x)
# 1 2
#A 1 0
#B 0 1
#C -1 -1

The fitted model fit reports 3 - 1 = 2 coefficients for this factor. We can extract them as:

## coefficients After Contrasts
coef_ac <- coef(fit)[2:3]
# x1 x2
#-0.1923854 0.4088459

Therefore, the level-specific coefficients are:

## coefficients Before Contrasts
coef_bc <- (Cmat %*% coef_ac)[, 1]
# A B C
#-0.1923854 0.4088459 -0.2164605

## note that they sum to zero as expected
sum(coef_bc)
#[1] 0

Similarly, we can get the covariance matrix after contrasts:

var_ac <- vcov(fit)[2:3, 2:3]
# x1 x2
#x1 0.4382235 -0.2191118
#x2 -0.2191118 0.4382235

and transform it to the one before contrasts:

var_bc <- Cmat %*% var_ac %*% t(Cmat)
# A B C
#A 0.4382235 -0.2191118 -0.2191118
#B -0.2191118 0.4382235 -0.2191118
#C -0.2191118 -0.2191118 0.4382235

Interpretation:

  • The model intercept coef(fit)[1] is the grand mean.

  • The computed coef_bc is the deviation of each level's mean from the grand mean.

  • The diagonal entries of var_bc gives the estimated variance of these deviations.

We can then proceed to compute t-statistics and p-values for these coefficients, as follows.

## standard error of point estimate `coef_bc`
std.error_bc <- sqrt(diag(var_bc))
# A B C
#0.6619845 0.6619845 0.6619845

## t-statistics (Null Hypothesis: coef_bc = 0)
t.stats_bc <- coef_bc / std.error_bc
# A B C
#-0.2906192 0.6176065 -0.3269872

## p-values of the t-statistics
p.value_bc <- 2 * pt(abs(t.stats_bc), df = fit$df.residual, lower.tail = FALSE)
# A B C
#0.7902750 0.5805485 0.7651640

## construct a coefficient table that mimics `coef(summary(fit))`
stats.tab_bc <- cbind("Estimate" = coef_bc,
"Std. Error" = std.error_bc,
"t value" = t.stats_bc,
"Pr(>|t|)" = p.value_bc)
# Estimate Std. Error t value Pr(>|t|)
#A -0.1923854 0.6619845 -0.2906192 0.7902750
#B 0.4088459 0.6619845 0.6176065 0.5805485
#C -0.2164605 0.6619845 -0.3269872 0.7651640

We can also augment it by including the result for the grand mean (i.e., the model intercept).

## extract statistics of the intercept
intercept.stats <- coef(summary(fit))[1, , drop = FALSE]
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655

## augment the coefficient table
stats.tab <- rbind(intercept.stats, stats.tab_bc)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655
#A -0.19238543 0.6619845 -0.29061922 0.7902750
#B 0.40884591 0.6619845 0.61760645 0.5805485
#C -0.21646049 0.6619845 -0.32698723 0.7651640

We can also print this table with significance stars.

printCoefmat(stats.tab)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02902 0.46809 -0.0620 0.9545
#A -0.19239 0.66199 -0.2906 0.7903
#B 0.40885 0.66199 0.6176 0.5805
#C -0.21646 0.66199 -0.3270 0.7652

Emm? Why are there no stars? Well, in this example all p-values are very large. The stars will show up if p-values are small. Here is a convincing demo:

fake.tab <- stats.tab
fake.tab[, 4] <- fake.tab[, 4] / 100
printCoefmat(fake.tab)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02902 0.46809 -0.0620 0.009545 **
#A -0.19239 0.66199 -0.2906 0.007903 **
#B 0.40885 0.66199 0.6176 0.005805 **
#C -0.21646 0.66199 -0.3270 0.007652 **
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Oh, this is so beautiful. For the meaning of these stars, see my answer at: Interpeting R significance codes for ANOVA table?



Closing Remarks

It should be possible to write a function (or even an R package) to perform such table transformation. However, it might take great effort to make such function flexible enough, to handle:

  • all type of contrasts (this is easy to do);

  • complicated model terms, like interaction between a factor and other numeric/factor variables (this seems really involving!!).

So, I will stop here for the moment.



Miscellaneous Replies

Are the model scores that I get from the lm's summary still accurate, even though it isn't displaying all levels of the factor?

Yes, they are. lm conducts accurate least squares fitting.

In addition, the transformation of coefficient table does not affect R-squares, degree of freedom, residuals, fitted values, F-statistics, ANOVA table, etc.

Setting contrasts for part of the factors in a model.matrix

You are using a variable model.factors that's not defined anywhere. Not sure what the goal is. If you just wanted all these values as covariates, you can do

contrasts.list <- list(species="contr.treatment", sex="contr.sum")
design.mat <- model.matrix(~., contrasts=contrasts.list, data=design.df)

Note that your contrasts.list should only have values for the factor variables. Do not include batch.

lm in R: Workaround for 'contrasts' error

Turning Ben Bolker's comment into an answer, with evidence of some better-simulated data that it works.

Just treat your two-level factor a continuous. This is the same as treating it as a factor.

Example:

df.num = data.frame(a = rnorm(12),
b = as.factor(rep(1:4,each = 3)),
c = rep(0:1, 6))
df.factor = df.num
df.factor$c = factor(df.factor$c)

mod.factor = lm(a~b*c - 1,data = df.factor)
mod.num = lm(a~b*c - 1,data = df.num)

all(coef(mod.factor) == coef(mod.num))
# [1] TRUE

How to force R to use a specified factor level as reference in a regression?

See the relevel() function. Here is an example:

set.seed(123)
x <- rnorm(100)
DF <- data.frame(x = x,
y = 4 + (1.5*x) + rnorm(100, sd = 2),
b = gl(5, 20))
head(DF)
str(DF)

m1 <- lm(y ~ x + b, data = DF)
summary(m1)

Now alter the factor b in DF by use of the relevel() function:

DF <- within(DF, b <- relevel(b, ref = 3))
m2 <- lm(y ~ x + b, data = DF)
summary(m2)

The models have estimated different reference levels.

> coef(m1)
(Intercept) x b2 b3 b4 b5
3.2903239 1.4358520 0.6296896 0.3698343 1.0357633 0.4666219
> coef(m2)
(Intercept) x b1 b2 b4 b5
3.66015826 1.43585196 -0.36983433 0.25985529 0.66592898 0.09678759

How (and why) do you use contrasts?

Contrasts are needed when you fit linear models with factors (i.e. categorical variables) as explanatory variables. The contrast specifies how the levels of the factors will be coded into a family of numeric dummy variables for fitting the model.

Here are some good notes for the different varieties of contrasts used:
http://www.unc.edu/courses/2006spring/ecol/145/001/docs/lectures/lecture26.htm

When the contrasts used are changed, the model remains the same in terms of the underlying joint probability distributions allowed. Only its parametrization changes. The fitted values remain the same as well. Also, once you have the value of the parameters for one choice of contrasts, it is easy to derive what the value of the parameters for another choice of contrasts would have been.

Therefore the choice of contrasts has no statistical consequence. It is purely a matter of making coefficients and hypothesis tests easier to interpret.



Related Topics



Leave a reply



Submit