How to generate a specific design matrix in R?
Try this:
num_players <- 4
matches <- t(combn(num_players, 2))
nr <- nrow(matches)
mtx <- matrix(0L, nrow = nr, ncol = num_players)
mtx[ cbind(seq_len(nr), matches[,1]) ] <- 1L
mtx[ cbind(seq_len(nr), matches[,2]) ] <- -1L
mtx
# [,1] [,2] [,3] [,4]
# [1,] 1 -1 0 0
# [2,] 1 0 -1 0
# [3,] 1 0 0 -1
# [4,] 0 1 -1 0
# [5,] 0 1 0 -1
# [6,] 0 0 1 -1
Create a design (numeric) matrix from a dataframe
See https://stackoverflow.com/a/5048727/4530610
In your case it would be:
mat <- model.matrix( ~ gender + age + lincome + party + education, data=df)[,-1]
Is there an existing function to create a design matrix for a subset of the training data in which some contrasts have only one level?
Maybe this was a stupid question. But I realized when I made that iris example (with the actual iris dataset) that model.matrix() doesn't actually throw an error if the factor variable has more than one level (even if it has only one unique value in the hypothetical sample). Because it needs to make binary indicators for factor variables, model.matrix() is left with no potential reference levels in my mini-example.
I used the purged model's "xlevels" object (model$xlevels), looped through the variables with levels, and re-factored those columns like this:
for(i in 1:length(model$xlevels)){
newdata[[names(model$xlevels)[i]]] <- factor(newdata[[names(model$xlevels)[i]]], levels = model$xlevels[[i]])
}
Note that I tried to use the purged model object, as well as model$terms, in the model.matrix()'s "object" input, but to no avail.
If there is a better way, I would appreciate someone responding! Otherwise, I hope this helps.
Design matrix for genotypes in R
Usually the data.table package is pretty performant in these type of cases. Example below:
library(data.table)
#> Warning: package 'data.table' was built under R version 4.1.1
df <- fread(text = "snp id a1 a2 code
snp1 an1 A A 0
snp1 an2 A B 1
snp1 an3 B B -1
snp2 an1 A B 1
snp2 an2 A A 0
snp2 an3 B B -1")
dcast(df, id ~ snp, value.var = "code")
#> id snp1 snp2
#> 1: an1 0 1
#> 2: an2 1 0
#> 3: an3 -1 -1
Created on 2021-10-13 by the reprex package (v2.0.1)
If you need the output as a matrix you could use:
cast <- dcast(df, id ~ snp, value.var = "code")
mat <- as.matrix(cast[, -"id"])
rownames(mat) <- cast$id
mat
#> snp1 snp2
#> an1 0 1
#> an2 1 0
#> an3 -1 -1
For a ~3Gb file you might expect this to run for about 10 seconds:
library(data.table)
#> Warning: package 'data.table' was built under R version 4.1.1
# Setting up larger data
df <- expand.grid(
snp = paste0("snp", 1:10000),
id = paste0("an", 1:10000)
)
df$a1 <- sample(c("A", "B"), nrow(df), replace = TRUE)
df$a2 <- sample(c("A", "B"), nrow(df), replace = TRUE)
df$code <- with(df, dplyr::case_when(
a1 == "A" & a2 == "A" ~ 0,
a1 == "B" & a2 == "B" ~ -1,
TRUE ~ 1
))
setDT(df)
# How big is this data?
format(object.size(df), "Gb")
#> [1] "3 Gb"
# How fast does the function run?
bench::mark(
dcast(df, id ~ snp, value.var = "code")
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 dcast(df, id ~ snp, value.var = "code") 9.32s 9.32s 0.107 6.71GB
#> # ... with 1 more variable: gc/sec <dbl>
Created on 2021-10-13 by the reprex package (v2.0.1)
Design Matrix using model.matrix function for gene expression
You need something like
disease <- factor(rep(c(1,2),c(4,6)))
levels(disease) <- c("normal","abnormal")
design <- model.matrix(~disease)
Have you tried reading the limma User's Guide? There are heaps of examples:
Related Topics
Rolling Window Over Irregular Time Series
Changing Title in Multiplot Ggplot2 Using Grid.Arrange
Ggplot2:Adding Two Errorbars to Each Point in Scatterplot
Calling a Function from a Namespace
Rank Variable by Group (Dplyr)
How to Set Seed for Random Simulations with Foreach and Domc Packages
Shiny Saving Url State Subpages and Tabs
Get Filename and Path of 'Source'D File
Keeping Zero Count Combinations When Aggregating with Data.Table
Increase the API Limit in Ggmap's Geocode Function (In R)
How to Do a Regression of a Series of Variables Without Typing Each Variable Name
About Gforce in Data.Table 1.9.2
Long and Wide Data - When to Use What
R- Converting Data from Fraction to Decimal
Producing a Boxplot in Ggplot2 Using Summary Statistics
Is There a Reason to Prefer Extractor Functions to Accessing Attributes with $