Running cor() (or any variant) over a sparse matrix in R
This is what I ended up using. Thanks @Joris for all the help!
My x
matrix is quite big. Assuming it's size is n * p
, n=200k
and p=10k
in my case.
The trick is to maintain the sparsity of the operations and perform the calculations on p * p
matrices instead of p*n x n*p
.
Version 1, is more straightforward, but less efficient on time and memory, as the outer product operation is expensive:
sparse.cor2 <- function(x){
n <- nrow(x)
covmat <- (crossprod(x)-2*(colMeans(x) %o% colSums(x))
+n*colMeans(x)%o%colMeans(x))/(n-1)
sdvec <- sqrt(diag(covmat)) # standard deviations of columns
covmat/crossprod(t(sdvec)) # correlation matrix
}
Version 2 is more efficient on time (saves several operations) and on memory. Still requires huge amounts of memory for a p=10k
matrix:
sparse.cor3 <- function(x){
memory.limit(size=10000)
n <- nrow(x)
cMeans <- colMeans(x)
cSums <- colSums(x)
# Calculate the population covariance matrix.
# There's no need to divide by (n-1) as the std. dev is also calculated the same way.
# The code is optimized to minize use of memory and expensive operations
covmat <- tcrossprod(cMeans, (-2*cSums+n*cMeans))
crossp <- as.matrix(crossprod(x))
covmat <- covmat+crossp
sdvec <- sqrt(diag(covmat)) # standard deviations of columns
covmat/crossprod(t(sdvec)) # correlation matrix
}
Timing comparisons (sparse.cor is @Joris latest version):
> X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
>
> object.size(x)
11999472 bytes
>
> system.time(corx <- sparse.cor(x))
user system elapsed
0.50 0.06 0.56
> system.time(corx2 <- sparse.cor2(x))
user system elapsed
0.17 0.00 0.17
> system.time(corx3 <- sparse.cor3(x))
user system elapsed
0.13 0.00 0.12
> system.time(correg <-cor(as.matrix(x)))
user system elapsed
0.25 0.03 0.29
> all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx2)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
[1] TRUE
Much larger x
matrix:
> X <- sample(0:10,1e8,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> object.size(x)
120005688 bytes
> system.time(corx2 <- sparse.cor2(x))
user system elapsed
1.47 0.07 1.53
> system.time(corx3 <- sparse.cor3(x))
user system elapsed
1.18 0.09 1.29
> system.time(corx <- sparse.cor(x))
user system elapsed
5.43 1.26 6.71
Is there a method with R function duplicated() for a sparse matrix from {Matrix}?
My thought is to restore this sparse matrix into a list, RowLst
or ColLst
, such that Rowlst[[i]]
or ColLst[[i]]
is the compressed vector for the i-th row or column. Then apply duplicated
on this list.
duplicated.dgCMatrix <- function (dgCMat, MARGIN, include.all.zero.vectors = TRUE) {
MARGIN <- as.integer(MARGIN)
J <- rep(1:ncol(dgCMat), diff(dgCMat@p))
I <- dgCMat@i + 1
x <- dgCMat@x
if (MARGIN == 1L) {
## check duplicated rows
names(x) <- J
if (include.all.zero.vectors) {
RowLst <- split(x, factor(I, levels = 1:nrow(dgCMat)))
} else {
RowLst <- split(x, I) ## will do `factor(I)` internally in `split`
}
result <- duplicated.default(RowLst)
} else if (MARGIN == 2L) {
## check duplicated columns
names(x) <- I
if (include.all.zero.vectors) {
ColLst <- split(x, factor(J, levels = 1:ncol(dgCMat)))
} else {
ColLst <- split(x, J) ## will do `factor(J)` internally in `split`
}
result <- duplicated.default(ColLst)
} else {
warning("invalid MARGIN; return NULL")
result <- NULL
}
result
}
which(duplicated.dgCMatrix(A, 2))
#[1] 7 8
Discussion between 20650 and me reveals something that is worth commenting.
- I did not realize that with the above self-defined function, S3 method dispatching works for the S4 object. So,
which(duplicated(A, 2))
suffices. duplicated.matrix(t(A))
andduplicated.array(A, MARGIN = 2)
return the right result here as well. At first we thought we found a hidden treasure, but by checking their source we see that they both rely onapply
, which would do anas.matrix
on a two-dimensional input object.
OP had a nice spot on his application. The original solution does not take all-zero rows / columns into account. The new version with added argument include.all.zero.vectors
solves this. Basically we control factor levels used for split
, so that an all-zero row / column is assigned an NULL
entry in the list rather than being ignored.
In R how do you write a sparse matrix to a file?
You can use writeMM
and readMM
to Read and write sparse matrix, so no need to coerce it to a matrix.
writeMM(fit$A,file='test.txt')
readMM(file='test.txt')
EDIT within multinomial
, glmnet returns a list of coefficients. SO you need to loop over this list and write each coefficient. Here an example:
library(glmnet)
g4=sample(1:4,100,replace=TRUE)
fit3=glmnet(x,g4,family="multinomial")
lapply(seq_along(fit3$beta),function(x)
writeMM(fit3$beta[[x]],file=paste0('coef.beta',x,'.txt')))
R: Calculate cosine distance between rows of two sparse matrices
According to the documentation, element-by-element (array) multiplication of compatible simple_triplet_matrices
and row_sums
of simple_triplet_matrices
are available. With these operators/functions, the computation is:
cosineDist <- function(A, B){
row_sums(A * B) / sqrt(row_sums(A * A) * row_sums(B * B))
}
Notes:
row_sums(A * B)
computes the dot product of each row inA
and its corresponding row inB
, which is the numerator term in yourcosine
. The result is a vector (not sparse) whose elements are these dot products for each corresponding row in A and B.row_sums(A * A)
computes the squared 2-norm of each row inA
. The result is a vector (not sparse) whose elements are these squared 2-norms for each row in A.- Similarly,
row_sums(B * B)
computes the squared 2-norm of each row inB
. The result is a vector (not sparse) whose elements are these squared 2-norms for each row in B. - The rest of the computation operates on these vectors whose elements are for each row of A and/or B.
In R How do I load a matrix market formated sparse matrix into a dgCMatrix?
I'm not sure why you're having a problem, because you can use as()
to coerce an ngTMatrix
to an ngCMatrix
:
> pm <- as(as.integer(c(2,3,1)), "pMatrix")
3 x 3 sparse Matrix of class "pMatrix"
[1,] . | .
[2,] . . |
[3,] | . .
> pm.t <- as(pm, 'ngTMatrix')
> pm.c <- as(pm.t, 'ngCMatrix')
> class(pm.c)
[1] "ngCMatrix"
attr(,"package")
[1] "Matrix"
> class(pm.t)
[1] "ngTMatrix"
attr(,"package")
[1] "Matrix"
Related Topics
Passing Arguments to Iterated Function Through Apply
Time Difference in Years with Lubridate
Dplyr - Mean for Multiple Columns
How to Increase Size of the Points in Ggplot2, Similar to Cex in Base Plots
Quick/Elegant Way to Construct Mean/Variance Summary Table
How to Delete a Row from a Data.Frame Without Losing the Attributes
Function to Extract Domain Name from Url in R
Partial Animal String Matching in R
Building a Box Plot from All Columns of Data Frame with Column Names on X in Ggplot2
How to Create a Time-Spiral Graph Using R
Shiny: Plot Results in Popup Window
R Text Mining Documents from CSV File (One Row Per Doc)
R Draw Kmeans Clustering with Heatmap
"Long Vectors Not Supported Yet" Error in Rmd But Not in R Script