Interpretation of Ordered and Non-Ordered Factors, VS. Numerical Predictors in Model Summary

Interpretation of ordered and non-ordered factors, vs. numerical predictors in model summary

This is not really a mixed-model specific question, but rather a general question about model parameterization in R.

Let's try a simple example.

set.seed(101)
d <- data.frame(x=sample(1:4,size=30,replace=TRUE))
d$y <- rnorm(30,1+2*d$x,sd=0.01)

x as numeric

This just does a linear regression: the x parameter denotes the change in y per unit of change in x; the intercept specifies the expected value of y at x=0.

coef(lm(y~x,d))
## (Intercept) x
## 0.9973078 2.0001922

x as (unordered/regular) factor

coef(lm(y~factor(x),d))
## (Intercept) factor(x)2 factor(x)3 factor(x)4
## 3.001627 1.991260 3.995619 5.999098

The intercept specifies the expected value of y in the baseline level of the factor (x=1); the other parameters specify the difference between the expected value of y when x takes on other values.

x as ordered factor

coef(lm(y~ordered(x),d))
## (Intercept) ordered(x).L ordered(x).Q ordered(x).C
## 5.998121421 4.472505514 0.006109021 -0.003125958

Now the intercept specifies the value of y at the mean factor level (halfway between 2 and 3); the L (linear) parameter gives a measure of the linear trend (not quite sure I can explain the particular value ...), Q and C specify quadratic and cubic terms (which are close to zero in this case because the pattern is linear); if there were more levels the higher-order contrasts would be numbered 5, 6, ...

successive-differences contrasts

coef(lm(y~factor(x),d,contrasts=list(`factor(x)`=MASS::contr.sdif)))
## (Intercept) factor(x)2-1 factor(x)3-2 factor(x)4-3
## 5.998121 1.991260 2.004359 2.003478

This contrast specifies the parameters as the differences between successive levels, which are all a constant value of (approximately) 2.

Interpretation of .L, .Q., .C, .4… for logistic regression

That output indicates that your predictor Year is an "ordered factor" meaning R not only understands observations within that variable to be distinct categories or groups (i.e., a factor) but also that the various categories have a natural order to them where one category is considered larger than another.

In this situation, R's default is to fit a series of polynomial functions or contrasts to the levels of the variable. The first is linear (.L), the second is quadratic (.Q), the third is cubic (.C), and so on. R will fit one fewer polynomial functions than the number of available levels. Thus, your output indicates there are 17 distinct years in your data.

You can probably think of those 17 (counting the intercept) predictors in your output as entirely new variables all based on the order of your original variable because R creates them using special values that make all the new predictors orthogonal (i.e., unrelated, linearly independent, or uncorrelated) to each other.

One way to see the values that got used is to use the model.matrix() function on your model object.

model.matrix(glm(DEPENDENT ~ Year, data = HAVE, family = "binomial"))

If you run the above, you will find a bunch of repeated numbers within each of the new variable columns where the changes in repetition correspond to where your original Year predictor switched categories. The specific values themselves hold no real meaning to you because they were chosen/computed by R to make all the contrasts linearly independent of one another.

Therefore, your model in the R output would be:

logit(p) = -3.57 + -2.21 * Year.L + -0.93 * Year.Q + ... + -0.15 * Year^16

where p is the probability of presence of the characteristic of interest, and the logit transformation is defined as the logged odds where odds = p / (1 - p) and logged odds = ln(odds). Therefore logit(p) = ln(p / (1 - p)).

The interpretation of a particular beta test is then generalized to: Which contrasts contribute significantly to explain any differences between levels in your dependent variable? Because your Year.L predictor is significant and negative, this suggests a linear decreasing trend in logit across years, and because your Year.Q predictor is significant and negative, this suggests a deacceleration trend is detectable in the pattern of logits across years. Third order polynomials model jerk, and fourth order polynomials model jounce (a.k.a., snap). However, I would stop interpreting around this order and higher because it quickly becomes nonsensical to practical folk.

Similarly, to interpret a particular beta estimate is a bit nonsensical to me, but it would be that the odds of switching categories in your outcome at a given level of a particular contrast (e.g., quadratic) as compared to the odds of switching categories in your outcome at the given level of that contrast (e.g., quadratic) less one unit is equal to the odds ratio had by exponentiating the beta estimate. For the quadratic contrast in your example, the odds ratio would be exp(-0.9328) = 0.3935, but I say it's a bit nonsensical because the units have little practical meaning as they were chosen by R to make the predictors linearly independent from one another. Thus I prefer focusing on the interpretation of a given contrast test as opposed to the coefficient in this circumstance.

For further reading, here is a webpage at UCLA's wonderful IDRE that discusses how to interpret odds ratios in logistic regression, and here is a crazy cool but intense stack exchange answer that walks through how R chooses the polynomial contrast weights.

labelling of ordered factor variable

Like many other people, I think you might be misunderstanding the meaning of an "ordered" factor in R. All factors in R are ordered, in a sense; the estimates etc. are typically printed, plotted, etc. in the order of the levels vector. Specifying that a factor is of type ordered has two major effects:

  • it allows you to evaluate inequalities on the levels of the factor (e.g. you can filter(age > "b"))
  • the contrasts are set by default to orthogonal polynomial contrasts, which is where the L (linear) and Q (quadratic) labels come from: see e.g. this CrossValidated answer for more details.

If you want this variable treated in the same way a regular factor (so that the estimates are made for differences of groups from the baseline level, i.e. treatment contrasts), you can:

  • convert back to an unordered factor (e.g. factor(age, ordered=FALSE))
  • specify that you want to use treatment contrasts in your model (in base R you would specify contrasts = list(age = "contr.treatment"))
  • set options(contrasts = c(unordered = "contr.treatment", ordered = "contr.treatment")) (the default for ordered is "contr.poly")

If you have an unordered ("regular") factor and the levels are not in the order you want, you can reset the level order by specifying the levels explicitly, e.g.

mutate(across(age, factor, 
levels = c("0-10 years", "11-20 years", "21-30 years", "30-40 years")))

R sets the factors in alphabetical order by default, which is sometimes not what you want (but I can't think of a case where the order would be 'random' ...)

R's lm showing strange behaviour with ordered factors

Try running this

rest<- lm(y ~ t)
restno <- lm(y ~ tno)
resti <- lm(y ~ ti)

rest$fitted.values
restno$fitted.values
resti$fitted.values

rest$xlevels
restno$xlevels
resti$xlevels

rest$contrasts
restno$contrasts
resti$contrasts

What you will see is, first, that the fitted values are exactly the same for all three models. So the model for ordered is not "wrong."

Second, you will see that the levels are different. In fact, only tno has levels. The others do not since you are treating them as though they are numeric, which you can do since this is a dichotomous variable. You'll also see that the factor is a string while the other two are not.

Third, you will see that tno and ti use "contr.treatment" while t uses "contr.poly" which makes sense for an ordinal variable.

If you run this

restno_poly<- lm(y ~ tno, contrasts = list(tno = "contr.poly"))
restno_poly

You'll get

Call:
lm(formula = y ~ tno, contrasts = list(tno = "contr.poly"))

Coefficients:
(Intercept) tno.L
-3.6082 -0.8038

Likewise

rest_treatment<- lm(y ~ t, contrasts = list(t = "contr.treatment"))

Similarly gives the results you were expecting.

This page explains more.



Related Topics



Leave a reply



Submit