All Levels of a Factor in a Model Matrix in R
You need to reset the contrasts
for the factor variables:
model.matrix(~ Fourth + Fifth, data=testFrame,
contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F),
Fifth=contrasts(testFrame$Fifth, contrasts=F)))
or, with a little less typing and without the proper names:
model.matrix(~ Fourth + Fifth, data=testFrame,
contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)),
Fifth=diag(nlevels(testFrame$Fifth))))
Force model.matrix() in R to use a given set of levels
The best way to achieve this is by explicitly specifying the levels to each factor:
d$foodtype1rank1=factor(sample(c('noodles','rice','cabbage','pork'), 5, replace=T),
levels=c('noodles','rice','cabbage','pork','mackerel'))
When you know the data this is always good practice.
R model.matrix column names for factors
That is because if you had c("X.Intercept.", "x1A", "x1B", "x2")
, then you would have perfect multicollinearity: x1A + x1B
would be a column of ones, just like the X.Intercept.
column. If, for the sake of interpretation, you prefer having x1A
instead of the intercept, we may use
formula_test <- as.formula("Y ~ -1 + x1 + x2")
giving
names(result_test)
# [1] "x1A" "x1B" "x2"
and
all(rowSums(result_test[, c("x1A", "x1B")]) == 1)
# [1] TRUE
As for why it is x1A
that is dropped rather than x1B
, the rule seems to be that the first factor levels goes away. If instead we use
levels(data_test$x1) <- c("B", "A")
then this gives
names(result_test)
# [1] "X.Intercept." "x1A" "x2"
column names of ordered factor in model.matrix in R
I still do not know the meaning of those suffixes, probably some historical reasons attached to it.
After debugging model.matrix
function, there was a call to the C_modelmatrix
inside it.
.External2(C_modelmatrix, t, data)
For ordered factor contrasts.poly
is used to get the design matrix using make.poly
function defined inside this call. After getting the design matrix, the column names are modified and they are given those strange suffixes for the columns 2 - 4. The first column is ignored and if there are more than 4 columns, they will be left with the name as defined by the make.poly
function.
contr <- make.poly(n, scores)
if (contrasts) {
dn <- colnames(contr)
dn[2:min(4, n)] <- c(".L", ".Q", ".C")[1:min(3, n - 1)]
colnames(contr) <- dn
contr[, -1, drop = FALSE]
}
In summary, those suffixes mean nothing, but they are mapped to the levels of the ordered factor from 2 to 4. For factor levels having length greater than 4, no renaming is applied. See an example below.
head( model.matrix( as.formula( ~ ps ),
model.frame( as.formula( ~ ps ),
data.frame(ps = factor( x = sample(x = c( 'none', '3XLT', '2X', '41X', '3X' ),
size = 50,
replace = TRUE ),
levels = c( '3X', '3XLT', '2X', '41X', 'none' ),
ordered = TRUE ) ) ) ) )
# (Intercept) ps.L ps.Q ps.C ps^4
# 1 1 0.0000000 -0.5345225 -4.095972e-16 0.7171372
# 2 1 0.0000000 -0.5345225 -4.095972e-16 0.7171372
# 3 1 0.3162278 -0.2672612 -6.324555e-01 -0.4780914
# 4 1 -0.6324555 0.5345225 -3.162278e-01 0.1195229
# 5 1 0.3162278 -0.2672612 -6.324555e-01 -0.4780914
# 6 1 -0.6324555 0.5345225 -3.162278e-01 0.1195229
The output of contr
contr <- make.poly(n, scores)
Browse[6]> contr
# ^0 ^1 ^2 ^3 ^4
# [1,] 0.4472136 -0.6324555 0.5345225 -3.162278e-01 0.1195229
# [2,] 0.4472136 -0.3162278 -0.2672612 6.324555e-01 -0.4780914
# [3,] 0.4472136 0.0000000 -0.5345225 -4.095972e-16 0.7171372
# [4,] 0.4472136 0.3162278 -0.2672612 -6.324555e-01 -0.4780914
# [5,] 0.4472136 0.6324555 0.5345225 3.162278e-01 0.1195229
EDIT:
The expansion for L, Q, C in the contrast matrix of ordered factor corresponds to Linear, Quadratic and Cubic terms. The naming of further degrees of polynomial terms (greater than 3) are indicated by the numeric value of degree of that polynomial term.
how to force model.matrix to use all levels of 2 categorical variables?
Not a model.matrix solution, but you can get the binary output using mtabulate
library(qdapTools)
mtabulate(as.data.frame(t(d.data)))
Or another option would be to loop through the column names of 'd.data' and do the model.matrix
separately on each column, cbind
and change the column names (if required).
d1 <- do.call(cbind,lapply(names(d.data), function(i)
model.matrix(~get(i)-1, d.data)))
colnames(d1) <- sub('.*\\)', '', colnames(d1))
defining the control value in a model.matrix
The choice is not random. It leaves out what ever the first level of the factor is. In your examples, observe
# from example 1
levels(factor(c('high','high','control','control','low','low')))
# [1] "control" "high" "low"
# from example 2
levels(factor(c('high','high','med','med','low','low')))
# [1] "high" "low" "med"
By default they are sorted alphabetically. So in the first case, "control" is used as the reference when in the second case "high" is used as a reference. This wouldn't have been a problem if you have the same levels is both factors. You can adjust that by either setting your factors to have the same levels explicilty when you create the factor, or you can use the relevel()
command. For example
diet <- relevel(diet,"med")
model.matrix(~ diet + sex)
Also, remember they are not "left out"; the default contrast is reference level, so the reference level winds up in the intercept them. If you fit a model without an intercept, the are all there
model.matrix(~ diet -1)
# dietmed diethigh dietlow
# 1 0 1 0
# 2 0 1 0
# 3 1 0 0
# 4 1 0 0
# 5 0 0 1
# 6 0 0 1
Related Topics
Why Do I Get "Warning Longer Object Length Is Not a Multiple of Shorter Object Length"
Replace/Translate Characters in a String
Create New Dummy Variable Columns from Categorical Variable
Overlaying Histograms With Ggplot2 in R
Concatenate Row-Wise Across Specific Columns of Dataframe
Convert Column With Pipe Delimited Data into Dummy Variables
What Is Meaning of First Tilde in Purrr::Map
Converting Decimal to Binary in R
Generate N Random Integers That Sum to M in R
Subset Data to Contain Only Columns Whose Names Match a Condition
Convert Comma Separated String to Numeric Columns
How to Change the Default Time Zone in R
Assign Multiple Columns Using := in Data.Table, by Group
Select Multiple Columns in Data.Table by Their Numeric Indices