Model.Matrix() with Na.Action=Null

model.matrix() with na.action=NULL?

You can mess around a little with the model.matrix object, based on the rownames :

MM <- model.matrix(ff,dat)
MM <- MM[match(rownames(dat),rownames(MM)),]
MM[,"b"] <- dat$b
rownames(MM) <- rownames(dat)

which gives :

> MM
(Intercept) b fact2 fact3 fact4 fact5
1 1 0.9583010 0 0 0 0
2 1 0.3266986 0 0 0 0
3 NA 1.4992358 NA NA NA NA
4 1 1.2867461 1 0 0 0
5 1 0.5024700 0 1 0 0
6 1 0.9583010 0 1 0 0
7 1 0.3266986 0 0 1 0
8 1 1.4992358 0 0 1 0
9 1 1.2867461 0 0 0 1
10 1 0.5024700 0 0 0 1

Alternatively, you can use contrasts() to do the work for you. Constructing the matrix by hand would be :

cont <- contrasts(dat$fact)[as.numeric(dat$fact),]
colnames(cont) <- paste("fact",colnames(cont),sep="")
out <- cbind(1,dat$b,cont)
out[is.na(dat$fact),1] <- NA
colnames(out)[1:2]<- c("Intercept","b")
rownames(out) <- rownames(dat)

which gives :

> out
Intercept b fact2 fact3 fact4 fact5
1 1 0.2534288 0 0 0 0
2 1 0.2697760 0 0 0 0
3 NA -0.8236879 NA NA NA NA
4 1 -0.6053445 1 0 0 0
5 1 0.4608907 0 1 0 0
6 1 0.2534288 0 1 0 0
7 1 0.2697760 0 0 1 0
8 1 -0.8236879 0 0 1 0
9 1 -0.6053445 0 0 0 1
10 1 0.4608907 0 0 0 1

In any case, both methods can be incorporated in a function that can deal with more complex formulae. I leave the exercise to the reader (what do I loath that sentence when I meet it in a paper ;-) )

Using NAs with model.matrix

One solution might be to convert your variable of interest to a factor, and don't exclude NA while doing that:

iris$Species[1] <- NA
mm2 <- model.matrix(~factor(iris$Species, exclude=NULL)-1)
>dim(mm2)
150 4

Why do I get unused argument (na.action = NULL) error in aggregate?

You don't have to use the column names in aggregate.formula.
na.pass should solve your na.action requirements.

setNames( 
aggregate( cbind(df[,1], df[,3]) ~ df[,2], df, sum, na.rm=T,
na.action=na.pass ), colnames(df[,c(2,1,3)]) )
group x other_var
1 1 25 -0.7313815
2 2 30 0.3231317

Data

(I added NAs)

df <- structure(list(x = 1:10, group = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L), other_var = c(-1.79458090358371, 0.295106071151792,
NA, -0.589487588239041, 0.325944874015228, NA, 0.737254570399201,
0.47849317537615, NA, 0.139020009150021)), row.names = c(NA,
-10L), class = "data.frame")

failed to omit Columns containing NA values with: na.rm=TRUE and na.action=NULL

Let's take a look at your aggregate call

aggregate(data, by = list(data$Role, data$Shift), FUN = mean)

Here you are calculating the average of values across all columns of data by data$Role and data$Shift (which are your grouping variables).

The error is pretty self-explanatory in telling you that you are trying to calculate the mean of non-numeric entries. data$Name, data$Role and data$Shift are all non-numeric columns.

I assume you are after

aggregate(. ~ Role + Shift, data = data[, -1], FUN = mean)
# Role Shift Salary Age
#1 Cook Dinner 1800 25.0
#2 Manager Dinner 2000 41.0
#3 Server Dinner 1650 27.5
#4 Cook Lunch 1200 24.0
#5 Manager Lunch 2200 32.0
#6 Server Lunch 1350 24.0

The . (dot) here denotes all variables except the ones on the RHS of the ~ (tilde). Notice how we exclude data$Name by passing data[, -1] as the data argument to aggregate.

Or using the by syntax

aggregate(data[, c("Salary", "Age")], by = list(data$Role, data$Shift), FUN = "mean")

Here the x argument refers to all columns the values of which you want to aggregate according to groups defined in by.


In response to your comment, to aggregate only by Role

aggregate(cbind(Salary, Age) ~ Role, data = data[, -1], FUN = mean)
# Role Salary Age
#1 Cook 1500 24.50
#2 Manager 2100 36.50
#3 Server 1500 25.75

Remove dependent variable from formula for model.matrix

This is how I usually do this. I'm not aware of a built-in function for this.

df = data.frame(y=rnorm(10), x1=rnorm(10), x2=rnorm(10), x3=rnorm(10))
mymodel = lm(y ~ x1 + x2 + x3, df)
form_vars_only =
formula(paste("~",strsplit(as.character(formula(mymodel)),"~")[[3]]))

Why these two formulas give two different correlograms?

The two functions model.matrix and data.matrix behave differently in several ways, including what happens if there are NA values, and how non-numeric variables are handled. See the help pages.

By default, entire rows are deleted in the presence of NA when using model.matrix. In data.matrix, these are kept and contribute to cor(use = "pairwise.complete.obs") observations, if not the entire rows are NA. This explains the different correlation coefficients.

If you have to use model.matrix, you could set the option to pass NA values (see solution here) and handle NA values in cor(use="pairwise.complete.obs").

Get data

library(tidyverse)

df <- data.frame(
idcode = c(1:10),
contract = c(TRUE,FALSE,FALSE,FALSE,NA,NA,TRUE,TRUE,FALSE,TRUE),
score = c (1.17, 5, 7.2, 6.6, 3, 3.8, 7.2, 9.1, 5.4, 2.21),
CEO = c(FALSE,NA,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE))

Note that logical variables should be coded without "", but the results will look the same here.

Default behaviour of model.matrix

If there are NA values, model.matrix drops the entire row while data.matrix keeps them. This is due to the default options()$na.action, which is set to na.omit and which only affecs model.matrix.

options()$na.action
#[1] "na.omit"

model.matrix(~0 + ., data = df)
#> idcode contractFALSE contractTRUE score CEOTRUE
#> 1 1 0 1 1.17 0
#> 3 3 1 0 7.20 1
#> 4 4 1 0 6.60 1
#> 7 7 0 1 7.20 1
#> 8 8 0 1 9.10 1
#> 9 9 1 0 5.40 1
#> 10 10 0 1 2.21 1
#> attr(,"assign")
#> [1] 1 2 2 3 4
#> attr(,"contrasts")
#> attr(,"contrasts")$contract
#> [1] "contr.treatment"
#>
#> attr(,"contrasts")$CEO
#> [1] "contr.treatment"

data.matrix(df)
#> idcode contract score CEO
#> [1,] 1 2 1.17 1
#> [2,] 2 1 5.00 NA
#> [3,] 3 1 7.20 2
#> [4,] 4 1 6.60 2
#> [5,] 5 NA 3.00 2
#> [6,] 6 NA 3.80 2
#> [7,] 7 2 7.20 2
#> [8,] 8 2 9.10 2
#> [9,] 9 1 5.40 2
#> [10,] 10 2 2.21 2

Behaviour with na.action = "na.pass"

# set na.action options
oldpar <- options()$na.action
options(na.action ="na.pass")

model.matrix(~0 + ., data = df)
#> idcode contractFALSE contractTRUE score CEOTRUE
#> 1 1 0 1 1.17 0
#> 2 2 1 0 5.00 NA
#> 3 3 1 0 7.20 1
#> 4 4 1 0 6.60 1
#> 5 5 NA NA 3.00 1
#> 6 6 NA NA 3.80 1
#> 7 7 0 1 7.20 1
#> 8 8 0 1 9.10 1
#> 9 9 1 0 5.40 1
#> 10 10 0 1 2.21 1
#> attr(,"assign")
#> [1] 1 2 2 3 4
#> attr(,"contrasts")
#> attr(,"contrasts")$contract
#> [1] "contr.treatment"
#>
#> attr(,"contrasts")$CEO
#> [1] "contr.treatment"

data.matrix(df)
#> idcode contract score CEO
#> [1,] 1 2 1.17 1
#> [2,] 2 1 5.00 NA
#> [3,] 3 1 7.20 2
#> [4,] 4 1 6.60 2
#> [5,] 5 NA 3.00 2
#> [6,] 6 NA 3.80 2
#> [7,] 7 2 7.20 2
#> [8,] 8 2 9.10 2
#> [9,] 9 1 5.40 2
#> [10,] 10 2 2.21 2

Compare correlation coefficients

data.matrix(df) %>% cor(use="pairwise.complete.obs") %>% round(digit=3)
#> idcode contract score CEO
#> idcode 1.000 0.312 0.177 0.625
#> contract 0.312 1.000 -0.226 -0.354
#> score 0.177 -0.226 1.000 0.548
#> CEO 0.625 -0.354 0.548 1.000

model.matrix(~0+., data=df) %>% cor(use="pairwise.complete.obs") %>% round(digit=3)
#> idcode contractFALSE contractTRUE score CEOTRUE
#> idcode 1.000 -0.312 0.312 0.177 0.625
#> contractFALSE -0.312 1.000 -1.000 0.226 0.354
#> contractTRUE 0.312 -1.000 1.000 -0.226 -0.354
#> score 0.177 0.226 -0.226 1.000 0.548
#> CEOTRUE 0.625 0.354 -0.354 0.548 1.000

Note that the two functions handle logical variables data differently (model.matrix creates two dummy variables for contract, and one dummy variable for CEO (see discussion in the comments section to this Answer), data.matrix creates a single binary integer variable), which is reflected in the correlation matrix.

reset default options

options(na.action = oldpar)

Session Info

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] knitr_1.33 magrittr_2.0.1 rlang_0.4.11 fastmap_1.1.0
#> [5] fansi_0.5.0 stringr_1.4.0 styler_1.5.1 highr_0.9
#> [9] tools_4.1.1 xfun_0.25 utf8_1.2.2 withr_2.4.2
#> [13] htmltools_0.5.2 ellipsis_0.3.2 yaml_2.2.1 digest_0.6.27
#> [17] tibble_3.1.4 lifecycle_1.0.0 crayon_1.4.1 purrr_0.3.4
#> [21] vctrs_0.3.8 fs_1.5.0 glue_1.4.2 evaluate_0.14
#> [25] rmarkdown_2.10 reprex_2.0.1 stringi_1.7.4 compiler_4.1.1
#> [29] pillar_1.6.2 backports_1.2.1 pkgconfig_2.0.3

Created on 2021-09-19 by the reprex package (v2.0.1)

sparse.model.matrix loses rows in R

I've had some success with changing the na.action to na.pass, this includes all the rows in my matrix:

options(na.action='na.pass')

Just note that this is a global option, so you probably want to set it back to it original value after, to not mess with the rest of your code.

previous_na_action <- options('na.action')
options(na.action='na.pass')
# Do your stuff...

options(na.action=previous_na_action$na.action)

Solution from this answer.

model.matrix explanation in R

The simplest answer is that the -1 in the formula in model.matrix removes the X intercept term from the model.
data.frame(model.matrix( ~ . -1, test_df)) produces:

  categoryMusic categoryNarrative.Film categoryPoetry countryUS usd_goal_real time_int state
1 0 0 1 0 1534 59 0
2 0 1 0 1 30000 60 0
3 1 0 0 1 45000 45 0

and data.frame(model.matrix( ~ . , test_df)) produces:

  X.Intercept. categoryNarrative.Film categoryPoetry countryUS usd_goal_real time_int state
1 1 0 1 0 1534 59 0
2 1 1 0 1 30000 60 0
3 1 0 0 1 45000 45 0

since there is a categorical variable in the model, you will also notice that the Music level of that variable disappears when there is an X intercept in the model since the first level of the variable is used for the intercept and all others are measured from that.

These are 2 different ways of parameterizing your model

R code: Error in model.matrix.default(mt, mf, contrasts) : Variable 1 has no levels

Your problem is similar to the one reported here on the randomForest classifier.

Apparently glm checks through the variables in your data and throws an error because X contains only NA values.

You can fix that error by

  1. either by dropping X completely from your dataset, setting Cancer$X <- NULL before handing it to glm and leaving X out in your formula (glm(diagnosis~.-id, data = Cancer, family = binomial));
  2. or by adding na.action = na.pass to the glm call (which will instruct to ignore the NA-warning, essentially) but still excluding X in the formula itself (glm(diagnosis~.-id-X, data = Cancer, family = binomial, na.action = na.pass))

However, please note that still, you'd have to make sure to provide the diagnosis variable in a form digestible by glm. Meaning: either a numeric vector with values 0 and 1, a logical or a factor-vector

"For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success)" - from the glm-doc

Just define Cancer$diagnosis <- as.factor(Cancer$diagnosis).

On my end, this still leaves some warnings, but I think those are coming from the data or your feature selection. It clears the blocking errors :)



Related Topics



Leave a reply



Submit