Running Cor() (Or Any Variant) Over a Sparse Matrix in R

Running cor() (or any variant) over a sparse matrix in R

This is what I ended up using. Thanks @Joris for all the help!

My x matrix is quite big. Assuming it's size is n * p, n=200k and p=10k in my case.

The trick is to maintain the sparsity of the operations and perform the calculations on p * p matrices instead of p*n x n*p.

Version 1, is more straightforward, but less efficient on time and memory, as the outer product operation is expensive:

sparse.cor2 <- function(x){
    n <- nrow(x)

    covmat <- (crossprod(x)-2*(colMeans(x) %o% colSums(x))
              +n*colMeans(x)%o%colMeans(x))/(n-1)

    sdvec <- sqrt(diag(covmat)) # standard deviations of columns
    covmat/crossprod(t(sdvec)) # correlation matrix
}

Version 2 is more efficient on time (saves several operations) and on memory. Still requires huge amounts of memory for a p=10k matrix:

sparse.cor3 <- function(x){
    memory.limit(size=10000)
    n <- nrow(x)

    cMeans <- colMeans(x)
    cSums <- colSums(x)

    # Calculate the population covariance matrix.
    # There's no need to divide by (n-1) as the std. dev is also calculated the same way.
    # The code is optimized to minize use of memory and expensive operations
    covmat <- tcrossprod(cMeans, (-2*cSums+n*cMeans))
    crossp <- as.matrix(crossprod(x))
    covmat <- covmat+crossp

    sdvec <- sqrt(diag(covmat)) # standard deviations of columns
    covmat/crossprod(t(sdvec)) # correlation matrix
}

Timing comparisons (sparse.cor is @Joris latest version):

> X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> 
> object.size(x)
11999472 bytes
> 
> system.time(corx <- sparse.cor(x))
   user  system elapsed 
   0.50    0.06    0.56 
> system.time(corx2 <- sparse.cor2(x))
   user  system elapsed 
   0.17    0.00    0.17 
> system.time(corx3 <- sparse.cor3(x))
   user  system elapsed 
   0.13    0.00    0.12 
> system.time(correg <-cor(as.matrix(x)))
   user  system elapsed 
   0.25    0.03    0.29 
> all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx2)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
[1] TRUE

Much larger x matrix:

> X <- sample(0:10,1e8,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> object.size(x)
120005688 bytes
> system.time(corx2 <- sparse.cor2(x))
   user  system elapsed 
   1.47    0.07    1.53 
> system.time(corx3 <- sparse.cor3(x))
   user  system elapsed 
   1.18    0.09    1.29 
> system.time(corx <- sparse.cor(x))
   user  system elapsed 
   5.43    1.26    6.71

Is there a method with R function duplicated() for a sparse matrix from {Matrix}?

My thought is to restore this sparse matrix into a list, RowLst or ColLst, such that Rowlst[[i]] or ColLst[[i]] is the compressed vector for the i-th row or column. Then apply duplicated on this list.

duplicated.dgCMatrix <- function (dgCMat, MARGIN, include.all.zero.vectors = TRUE) {
  MARGIN <- as.integer(MARGIN)
  J <- rep(1:ncol(dgCMat), diff(dgCMat@p))
  I <- dgCMat@i + 1
  x <- dgCMat@x
  if (MARGIN == 1L) {
    ## check duplicated rows
    names(x) <- J
    if (include.all.zero.vectors) {
      RowLst <- split(x, factor(I, levels = 1:nrow(dgCMat)))
      } else {
      RowLst <- split(x, I)  ## will do `factor(I)` internally in `split`
      }
    result <- duplicated.default(RowLst)
    } else if (MARGIN == 2L) {
    ## check duplicated columns
    names(x) <- I
    if (include.all.zero.vectors) {
      ColLst <- split(x, factor(J, levels = 1:ncol(dgCMat)))
      } else {
      ColLst <- split(x, J)  ## will do `factor(J)` internally in `split`
      }
    result <- duplicated.default(ColLst)
    } else {
    warning("invalid MARGIN; return NULL")
    result <- NULL
    }
  result
  }

which(duplicated.dgCMatrix(A, 2))
#[1] 7 8

Discussion between 20650 and me reveals something that is worth commenting.

I did not realize that with the above self-defined function, S3 method dispatching works for the S4 object. So, which(duplicated(A, 2)) suffices.
duplicated.matrix(t(A)) and duplicated.array(A, MARGIN = 2) return the right result here as well. At first we thought we found a hidden treasure, but by checking their source we see that they both rely on apply, which would do an as.matrix on a two-dimensional input object.

OP had a nice spot on his application. The original solution does not take all-zero rows / columns into account. The new version with added argument include.all.zero.vectors solves this. Basically we control factor levels used for split, so that an all-zero row / column is assigned an NULL entry in the list rather than being ignored.

In R how do you write a sparse matrix to a file?

You can use writeMM and readMM to Read and write sparse matrix, so no need to coerce it to a matrix.

writeMM(fit$A,file='test.txt')
readMM(file='test.txt')

EDIT within multinomial, glmnet returns a list of coefficients. SO you need to loop over this list and write each coefficient. Here an example:

library(glmnet)
g4=sample(1:4,100,replace=TRUE)
fit3=glmnet(x,g4,family="multinomial")
lapply(seq_along(fit3$beta),function(x)
       writeMM(fit3$beta[[x]],file=paste0('coef.beta',x,'.txt')))

R: Calculate cosine distance between rows of two sparse matrices

According to the documentation, element-by-element (array) multiplication of compatible simple_triplet_matrices and row_sums of simple_triplet_matrices are available. With these operators/functions, the computation is:

cosineDist <- function(A, B){
  row_sums(A * B) / sqrt(row_sums(A * A) * row_sums(B * B)) 
}

Notes:

row_sums(A * B) computes the dot product of each row in A and its corresponding row in B, which is the numerator term in your cosine. The result is a vector (not sparse) whose elements are these dot products for each corresponding row in A and B.
row_sums(A * A) computes the squared 2-norm of each row in A. The result is a vector (not sparse) whose elements are these squared 2-norms for each row in A.
Similarly, row_sums(B * B) computes the squared 2-norm of each row in B. The result is a vector (not sparse) whose elements are these squared 2-norms for each row in B.
The rest of the computation operates on these vectors whose elements are for each row of A and/or B.

In R How do I load a matrix market formated sparse matrix into a dgCMatrix?

I'm not sure why you're having a problem, because you can use as() to coerce an ngTMatrix to an ngCMatrix:

> pm <- as(as.integer(c(2,3,1)), "pMatrix")
3 x 3 sparse Matrix of class "pMatrix"

[1,] . | .
[2,] . . |
[3,] | . .

> pm.t <- as(pm, 'ngTMatrix')
> pm.c <- as(pm.t, 'ngCMatrix')

> class(pm.c)
[1] "ngCMatrix"
attr(,"package")
[1] "Matrix"

> class(pm.t)
[1] "ngTMatrix"
attr(,"package")
[1] "Matrix"

Running Cor() (Or Any Variant) Over a Sparse Matrix in R