Directly creating dummy variable set in a sparse matrix in R
Thanks for having clarified your question, try this.
Here is sample data with two columns that have three and two levels respectively:
set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
y = sample(c("D", "E"), n, TRUE))
# x y
# 1 A E
# 2 C E
# 3 B E
# 4 C D
# 5 C E
# 6 A D
library(Matrix)
spm <- lapply(df, function(j)sparseMatrix(i = seq_along(j),
j = as.integer(j), x = 1))
do.call(cBind, spm)
# 6 x 5 sparse Matrix of class "dgCMatrix"
#
# [1,] 1 . . . 1
# [2,] . . 1 . 1
# [3,] . 1 . . 1
# [4,] . . 1 1 .
# [5,] . . 1 . 1
# [6,] 1 . . 1 .
Edit: @user20650 pointed out do.call(cBind, ...)
was sluggish or failing with large data. So here is a more complex but much faster and efficient approach:
n <- nrow(df)
nlevels <- sapply(df, nlevels)
i <- rep(seq_len(n), ncol(df))
j <- unlist(lapply(df, as.integer)) +
rep(cumsum(c(0, head(nlevels, -1))), each = n)
x <- 1
sparseMatrix(i = i, j = j, x = x)
generating a sparse matrix for a categorical variable
We may need to specify the contrasts.arg
as.matrix(sparse.model.matrix(~.-1, z, contrasts.arg = lapply(z,
function(x) contrasts(factor(x), contrasts = FALSE))))
R: Generating a sparse matrix with exactly one value per row (dummy coding)
If you wanted to create a random dummy matrix, a quick way would be to create a function like this:
Dummy <- function(nrow, ncol) {
M <- matrix(0L, nrow = nrow, ncol = ncol)
M[cbind(sequence(nrow), sample(ncol, nrow, TRUE))] <- 1L
M
}
The first line of the function just creates an empty matrix of zeroes. The second line uses matrix indexing to replace exactly one value per row with a one. The third line just returns the output. I'm not sure how you were planning on creating/using your j vectors, but this is how I would suggest approaching it....
Usage is simple: You just need to specify the number of rows and the number of columns that the final matrix should have.
Example:
set.seed(1) ## for reproducibility
Dummy(3, 3)
# [,1] [,2] [,3]
# [1,] 1 0 0
# [2,] 0 1 0
# [3,] 0 1 0
Dummy(6, 4)
# [,1] [,2] [,3] [,4]
# [1,] 0 0 0 1
# [2,] 1 0 0 0
# [3,] 0 0 0 1
# [4,] 0 0 0 1
# [5,] 0 0 1 0
# [6,] 0 0 1 0
Creating a dummy variable according to data in a matrix in R
How about creating a factor variable (you can show the underlying integer codes with as.integer
). We use regexec
and regmatches
to extract the letter codes that occur at the beginning of the Region
variable (ignoring letters that occur later) and turn them into the factor...
# Data with an extra row (row number 11)
df <- read.table( text = " Region x
1 be1 71615
4 be211 54288
5 be112 51158
6 it213 69856
8 it221 71412
9 uk222 79537
11 uk222a 79537
10 de101 94827" , h = T , stringsAsFactors = FALSE )
levs <- regmatches( df$Region , regexec( "^[a-z]+" , df$Region ) )
df$Country <- as.integer( factor( levs , levels = unique(levs ) ) )
Region x Country
1 be1 71615 1
4 be211 54288 1
5 be112 51158 1
6 it213 69856 2
8 it221 71412 2
9 uk222 79537 3
11 uk222a 79537 3
10 de101 94827 4
unlist( regmatches( df$Region , regexec( "^[a-z]+" , df$Region ) ) )
[1] "be" "be" "be" "it" "it" "uk" "uk" "de"
Dummy Variable with increase YEAR
We can use model.matrix
to get the dummy coding, and then get the cumsum
of each column.
apply(model.matrix(~year-1, dt)[,-1], 2, cumsum)
Another option is mtabulate
library(qdapTools)
d1 <- mtabulate(dt$year)[-1]
#based on the example, we can also change the lower triangle as 1
d1[lower.tri(d1)] <- 1
Creating a sparse matrix in R
I tried to reproduce the problem by constructing an object like you described in the question (which I've now edited into the question) and by appending some additional fake rows to it.
library(Matrix)
Likes <- data.frame(userid=c("n1","n2"),
m1=c(0,1),
m2=c(0,0),
m3=c(0,0),
m4=c(1,0)
)
I found that running your code on this threw a different error:
sM_Likes <- sparseMatrix(Likes, i=likes$userid, j=1,c(2:ncol(Likes)), x=1)
Error in sparseMatrix(Likes, i = likes$userid, j = 1,
c(2:ncol(Likes)), : exactly one of 'i', 'j', or 'p' must be missing
from call
I mentioned this a couple of times in the comments as what I thought was causing the problem. You corrected the specification of your j
argument and now it works :)
There's also a follow up question you asked in the comments about column names. I think this should solve that:
devtools::install_github("ben519/mltools")
require(mltools)
dt <- data.table(
intCol=c(1L, NA_integer_, 3L, 0L),
realCol=c(NA, 2, NA, NA),
logCol=c(TRUE, FALSE, TRUE, FALSE),
ofCol=factor(c("a", "b", NA, "b"), levels=c("a", "b", "c"), ordered=TRUE),
ufCol=factor(c("a", NA, "c", "b"), ordered=FALSE)
)
sparsify(dt)
sparsify(dt, sparsifyNAs=TRUE)
sparsify(dt[, list(realCol)], naCols="identify")
sparsify(dt[, list(realCol)], naCols="efficient")
Related Topics
Generate Paired Stacked Bar Charts in Ggplot (Using Position_Dodge Only on Some Variables)
How to Generate Distributions Given, Mean, Sd, Skew and Kurtosis in R
Connecting Across Missing Values with Geom_Line
Setting Function Defaults R on a Project Specific Basis
Re-Ordering Bars in R's Barplot()
Extract Elements Common in All Column Groups
Convert All Data Frame Character Columns to Factors
Euclidean Distance of Two Vectors
Listing Contents of an R Data File Without Loading
Display Weighted Mean by Group in the Data.Frame
Predict.Lm() with an Unknown Factor Level in Test Data
Subsetting a Data Frame Based on Contents of Another Data Frame
Merge Dataframes of Different Sizes