Directly Creating Dummy Variable Set in a Sparse Matrix in R

Directly creating dummy variable set in a sparse matrix in R

Thanks for having clarified your question, try this.

Here is sample data with two columns that have three and two levels respectively:

set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
y = sample(c("D", "E"), n, TRUE))
# x y
# 1 A E
# 2 C E
# 3 B E
# 4 C D
# 5 C E
# 6 A D

library(Matrix)
spm <- lapply(df, function(j)sparseMatrix(i = seq_along(j),
j = as.integer(j), x = 1))
do.call(cBind, spm)
# 6 x 5 sparse Matrix of class "dgCMatrix"
#
# [1,] 1 . . . 1
# [2,] . . 1 . 1
# [3,] . 1 . . 1
# [4,] . . 1 1 .
# [5,] . . 1 . 1
# [6,] 1 . . 1 .

Edit: @user20650 pointed out do.call(cBind, ...) was sluggish or failing with large data. So here is a more complex but much faster and efficient approach:

n <- nrow(df)
nlevels <- sapply(df, nlevels)
i <- rep(seq_len(n), ncol(df))
j <- unlist(lapply(df, as.integer)) +
rep(cumsum(c(0, head(nlevels, -1))), each = n)
x <- 1
sparseMatrix(i = i, j = j, x = x)

generating a sparse matrix for a categorical variable

We may need to specify the contrasts.arg

as.matrix(sparse.model.matrix(~.-1, z, contrasts.arg = lapply(z,
function(x) contrasts(factor(x), contrasts = FALSE))))

R: Generating a sparse matrix with exactly one value per row (dummy coding)

If you wanted to create a random dummy matrix, a quick way would be to create a function like this:

Dummy <- function(nrow, ncol) {
M <- matrix(0L, nrow = nrow, ncol = ncol)
M[cbind(sequence(nrow), sample(ncol, nrow, TRUE))] <- 1L
M
}

The first line of the function just creates an empty matrix of zeroes. The second line uses matrix indexing to replace exactly one value per row with a one. The third line just returns the output. I'm not sure how you were planning on creating/using your j vectors, but this is how I would suggest approaching it....

Usage is simple: You just need to specify the number of rows and the number of columns that the final matrix should have.

Example:

set.seed(1) ## for reproducibility
Dummy(3, 3)
# [,1] [,2] [,3]
# [1,] 1 0 0
# [2,] 0 1 0
# [3,] 0 1 0
Dummy(6, 4)
# [,1] [,2] [,3] [,4]
# [1,] 0 0 0 1
# [2,] 1 0 0 0
# [3,] 0 0 0 1
# [4,] 0 0 0 1
# [5,] 0 0 1 0
# [6,] 0 0 1 0

Creating a dummy variable according to data in a matrix in R

How about creating a factor variable (you can show the underlying integer codes with as.integer). We use regexec and regmatches to extract the letter codes that occur at the beginning of the Region variable (ignoring letters that occur later) and turn them into the factor...

#  Data with an extra row (row number 11)
df <- read.table( text = " Region x
1 be1 71615
4 be211 54288
5 be112 51158
6 it213 69856
8 it221 71412
9 uk222 79537
11 uk222a 79537
10 de101 94827" , h = T , stringsAsFactors = FALSE )

levs <- regmatches( df$Region , regexec( "^[a-z]+" , df$Region ) )

df$Country <- as.integer( factor( levs , levels = unique(levs ) ) )

Region x Country
1 be1 71615 1
4 be211 54288 1
5 be112 51158 1
6 it213 69856 2
8 it221 71412 2
9 uk222 79537 3
11 uk222a 79537 3
10 de101 94827 4

unlist( regmatches( df$Region , regexec( "^[a-z]+" , df$Region ) ) )
[1] "be" "be" "be" "it" "it" "uk" "uk" "de"

Dummy Variable with increase YEAR

We can use model.matrix to get the dummy coding, and then get the cumsum of each column.

apply(model.matrix(~year-1, dt)[,-1], 2, cumsum)

Another option is mtabulate

library(qdapTools)
d1 <- mtabulate(dt$year)[-1]
#based on the example, we can also change the lower triangle as 1
d1[lower.tri(d1)] <- 1

Creating a sparse matrix in R

I tried to reproduce the problem by constructing an object like you described in the question (which I've now edited into the question) and by appending some additional fake rows to it.

library(Matrix)

Likes <- data.frame(userid=c("n1","n2"),
m1=c(0,1),
m2=c(0,0),
m3=c(0,0),
m4=c(1,0)
)

I found that running your code on this threw a different error:

sM_Likes <- sparseMatrix(Likes, i=likes$userid, j=1,c(2:ncol(Likes)), x=1)

Error in sparseMatrix(Likes, i = likes$userid, j = 1,
c(2:ncol(Likes)), : exactly one of 'i', 'j', or 'p' must be missing
from call

I mentioned this a couple of times in the comments as what I thought was causing the problem. You corrected the specification of your j argument and now it works :)

There's also a follow up question you asked in the comments about column names. I think this should solve that:

devtools::install_github("ben519/mltools")
require(mltools)
dt <- data.table(
intCol=c(1L, NA_integer_, 3L, 0L),
realCol=c(NA, 2, NA, NA),
logCol=c(TRUE, FALSE, TRUE, FALSE),
ofCol=factor(c("a", "b", NA, "b"), levels=c("a", "b", "c"), ordered=TRUE),
ufCol=factor(c("a", NA, "c", "b"), ordered=FALSE)
)

sparsify(dt)
sparsify(dt, sparsifyNAs=TRUE)
sparsify(dt[, list(realCol)], naCols="identify")
sparsify(dt[, list(realCol)], naCols="efficient")


Related Topics



Leave a reply



Submit