How to Create Design Matrix in R

How to generate a specific design matrix in R?

Try this:

num_players <- 4
matches <- t(combn(num_players, 2))
nr <- nrow(matches)

mtx <- matrix(0L, nrow = nr, ncol = num_players)
mtx[ cbind(seq_len(nr), matches[,1]) ] <- 1L
mtx[ cbind(seq_len(nr), matches[,2]) ] <- -1L
mtx
# [,1] [,2] [,3] [,4]
# [1,] 1 -1 0 0
# [2,] 1 0 -1 0
# [3,] 1 0 0 -1
# [4,] 0 1 -1 0
# [5,] 0 1 0 -1
# [6,] 0 0 1 -1

Create a design (numeric) matrix from a dataframe

See https://stackoverflow.com/a/5048727/4530610

In your case it would be:

mat <- model.matrix( ~ gender + age + lincome + party + education, data=df)[,-1]

Is there an existing function to create a design matrix for a subset of the training data in which some contrasts have only one level?

Maybe this was a stupid question. But I realized when I made that iris example (with the actual iris dataset) that model.matrix() doesn't actually throw an error if the factor variable has more than one level (even if it has only one unique value in the hypothetical sample). Because it needs to make binary indicators for factor variables, model.matrix() is left with no potential reference levels in my mini-example.

I used the purged model's "xlevels" object (model$xlevels), looped through the variables with levels, and re-factored those columns like this:

for(i in 1:length(model$xlevels)){
newdata[[names(model$xlevels)[i]]] <- factor(newdata[[names(model$xlevels)[i]]], levels = model$xlevels[[i]])
}

Note that I tried to use the purged model object, as well as model$terms, in the model.matrix()'s "object" input, but to no avail.

If there is a better way, I would appreciate someone responding! Otherwise, I hope this helps.

Design matrix for genotypes in R

Usually the data.table package is pretty performant in these type of cases. Example below:

library(data.table)
#> Warning: package 'data.table' was built under R version 4.1.1

df <- fread(text = "snp id a1 a2 code
snp1 an1 A A 0
snp1 an2 A B 1
snp1 an3 B B -1
snp2 an1 A B 1
snp2 an2 A A 0
snp2 an3 B B -1")

dcast(df, id ~ snp, value.var = "code")
#> id snp1 snp2
#> 1: an1 0 1
#> 2: an2 1 0
#> 3: an3 -1 -1

Created on 2021-10-13 by the reprex package (v2.0.1)

If you need the output as a matrix you could use:

cast <- dcast(df, id ~ snp, value.var = "code")
mat <- as.matrix(cast[, -"id"])
rownames(mat) <- cast$id
mat
#> snp1 snp2
#> an1 0 1
#> an2 1 0
#> an3 -1 -1

For a ~3Gb file you might expect this to run for about 10 seconds:

library(data.table)
#> Warning: package 'data.table' was built under R version 4.1.1

# Setting up larger data
df <- expand.grid(
snp = paste0("snp", 1:10000),
id = paste0("an", 1:10000)
)
df$a1 <- sample(c("A", "B"), nrow(df), replace = TRUE)
df$a2 <- sample(c("A", "B"), nrow(df), replace = TRUE)
df$code <- with(df, dplyr::case_when(
a1 == "A" & a2 == "A" ~ 0,
a1 == "B" & a2 == "B" ~ -1,
TRUE ~ 1
))
setDT(df)

# How big is this data?
format(object.size(df), "Gb")
#> [1] "3 Gb"

# How fast does the function run?
bench::mark(
dcast(df, id ~ snp, value.var = "code")
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 dcast(df, id ~ snp, value.var = "code") 9.32s 9.32s 0.107 6.71GB
#> # ... with 1 more variable: gc/sec <dbl>

Created on 2021-10-13 by the reprex package (v2.0.1)

Design Matrix using model.matrix function for gene expression

You need something like

disease <- factor(rep(c(1,2),c(4,6)))
levels(disease) <- c("normal","abnormal")
design <- model.matrix(~disease)

Have you tried reading the limma User's Guide? There are heaps of examples:



Related Topics



Leave a reply



Submit