All Levels of a Factor in a Model Matrix in R

All Levels of a Factor in a Model Matrix in R

You need to reset the contrasts for the factor variables:

model.matrix(~ Fourth + Fifth, data=testFrame, 
contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F),
Fifth=contrasts(testFrame$Fifth, contrasts=F)))

or, with a little less typing and without the proper names:

model.matrix(~ Fourth + Fifth, data=testFrame, 
contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)),
Fifth=diag(nlevels(testFrame$Fifth))))

Force model.matrix() in R to use a given set of levels

The best way to achieve this is by explicitly specifying the levels to each factor:

d$foodtype1rank1=factor(sample(c('noodles','rice','cabbage','pork'), 5, replace=T), 
levels=c('noodles','rice','cabbage','pork','mackerel'))

When you know the data this is always good practice.

R model.matrix column names for factors

That is because if you had c("X.Intercept.", "x1A", "x1B", "x2"), then you would have perfect multicollinearity: x1A + x1B would be a column of ones, just like the X.Intercept. column. If, for the sake of interpretation, you prefer having x1A instead of the intercept, we may use

formula_test <- as.formula("Y ~ -1 + x1 + x2")

giving

names(result_test)
# [1] "x1A" "x1B" "x2"

and

all(rowSums(result_test[, c("x1A", "x1B")]) == 1)
# [1] TRUE

As for why it is x1A that is dropped rather than x1B, the rule seems to be that the first factor levels goes away. If instead we use

levels(data_test$x1) <- c("B", "A")

then this gives

names(result_test)
# [1] "X.Intercept." "x1A" "x2"

column names of ordered factor in model.matrix in R

I still do not know the meaning of those suffixes, probably some historical reasons attached to it.

After debugging model.matrix function, there was a call to the C_modelmatrix inside it.

.External2(C_modelmatrix, t, data)

For ordered factor contrasts.poly is used to get the design matrix using make.poly function defined inside this call. After getting the design matrix, the column names are modified and they are given those strange suffixes for the columns 2 - 4. The first column is ignored and if there are more than 4 columns, they will be left with the name as defined by the make.poly function.

contr <- make.poly(n, scores)
if (contrasts) {
dn <- colnames(contr)
dn[2:min(4, n)] <- c(".L", ".Q", ".C")[1:min(3, n - 1)]
colnames(contr) <- dn
contr[, -1, drop = FALSE]
}

In summary, those suffixes mean nothing, but they are mapped to the levels of the ordered factor from 2 to 4. For factor levels having length greater than 4, no renaming is applied. See an example below.

head( model.matrix( as.formula( ~ ps ), 
model.frame( as.formula( ~ ps ),
data.frame(ps = factor( x = sample(x = c( 'none', '3XLT', '2X', '41X', '3X' ),
size = 50,
replace = TRUE ),
levels = c( '3X', '3XLT', '2X', '41X', 'none' ),
ordered = TRUE ) ) ) ) )

# (Intercept) ps.L ps.Q ps.C ps^4
# 1 1 0.0000000 -0.5345225 -4.095972e-16 0.7171372
# 2 1 0.0000000 -0.5345225 -4.095972e-16 0.7171372
# 3 1 0.3162278 -0.2672612 -6.324555e-01 -0.4780914
# 4 1 -0.6324555 0.5345225 -3.162278e-01 0.1195229
# 5 1 0.3162278 -0.2672612 -6.324555e-01 -0.4780914
# 6 1 -0.6324555 0.5345225 -3.162278e-01 0.1195229

The output of contr

contr <- make.poly(n, scores)
Browse[6]> contr
# ^0 ^1 ^2 ^3 ^4
# [1,] 0.4472136 -0.6324555 0.5345225 -3.162278e-01 0.1195229
# [2,] 0.4472136 -0.3162278 -0.2672612 6.324555e-01 -0.4780914
# [3,] 0.4472136 0.0000000 -0.5345225 -4.095972e-16 0.7171372
# [4,] 0.4472136 0.3162278 -0.2672612 -6.324555e-01 -0.4780914
# [5,] 0.4472136 0.6324555 0.5345225 3.162278e-01 0.1195229

EDIT:
The expansion for L, Q, C in the contrast matrix of ordered factor corresponds to Linear, Quadratic and Cubic terms. The naming of further degrees of polynomial terms (greater than 3) are indicated by the numeric value of degree of that polynomial term.

how to force model.matrix to use all levels of 2 categorical variables?

Not a model.matrix solution, but you can get the binary output using mtabulate

library(qdapTools)
mtabulate(as.data.frame(t(d.data)))

Or another option would be to loop through the column names of 'd.data' and do the model.matrix separately on each column, cbind and change the column names (if required).

d1 <- do.call(cbind,lapply(names(d.data), function(i) 
model.matrix(~get(i)-1, d.data)))
colnames(d1) <- sub('.*\\)', '', colnames(d1))

defining the control value in a model.matrix

The choice is not random. It leaves out what ever the first level of the factor is. In your examples, observe

# from example 1
levels(factor(c('high','high','control','control','low','low')))
# [1] "control" "high" "low"

# from example 2
levels(factor(c('high','high','med','med','low','low')))
# [1] "high" "low" "med"

By default they are sorted alphabetically. So in the first case, "control" is used as the reference when in the second case "high" is used as a reference. This wouldn't have been a problem if you have the same levels is both factors. You can adjust that by either setting your factors to have the same levels explicilty when you create the factor, or you can use the relevel() command. For example

diet <- relevel(diet,"med")
model.matrix(~ diet + sex)

Also, remember they are not "left out"; the default contrast is reference level, so the reference level winds up in the intercept them. If you fit a model without an intercept, the are all there

model.matrix(~ diet -1)
# dietmed diethigh dietlow
# 1 0 1 0
# 2 0 1 0
# 3 1 0 0
# 4 1 0 0
# 5 0 0 1
# 6 0 0 1


Related Topics



Leave a reply



Submit