Svd for Sparse Matrix in R

SVD for sparse matrix in R

The irlba package has a very fast SVD implementation for sparse matrices.

Problems with SVD of a large matrix using bigmemory and irlba

This should work:

library(bigalgebra)
library(irlba)

## --> CHANGES HERE <--
setMethod("%*%", signature(x = "big.matrix", y = "numeric"),
function(x, y) x %*% as.matrix(y))
setMethod("%*%", signature(x = "numeric", y = "big.matrix"),
function(x, y) t(x) %*% y)
mult <- function(A, B) (A %*% B)[]

# Repdata
x <- matrix(rnorm(20 * 100 * 100), nrow = 20 * 10)
big <- as.big.matrix(x)
# Computation
irlbaObject <- irlba(big, nv = 10, mult = mult)
# Verification
svd <- svd(x, nu = 10, nv = 10)
plot(irlbaObject$u, svd$u)
plot(irlbaObject$v, svd$v)

Note 1: I think the algo in irlba has changed and now use only matrix-vector multiplications.

Note 2: mult is a deprecated argument (it will disappear in next versions).

Note 3: I'm not sure this solution will be fast. If you want a fast algorithm to compute partial SVD, try function big_randomSVD of package bigstatsr (disclaimer: I'm the author).

randomized SVD singular values

Calculation

I reckon that your algorithm is a modification of the algorithm of Martinsson et al.. If I understood it correctly, this is especially meant for approximations for low rank matrices. I might be wrong though.

The difference is easily explained by the huge difference between the actual rank of A (500) and the values of k (10) and p (5). Plus, Martinsson et al mention that the value for p should actually be larger than the chosen value for k.

So if we apply your solution taking their considerations into account, using :

set.seed(10)
A <- matrix(rnorm(500*500),500,500) # rank 500
B <- matrix(rnorm(500*50),500,500) # rank 50

We find for the timings that the use of a larger p value still results in a huge speed-up compared to the original svd algorithm.

> system.time(t1 <- svd(A)$d[1:5])
user system elapsed
0.8 0.0 0.8

> system.time(t2 <- rsvd(A,10,5)$d[1:5])
user system elapsed
0.01 0.00 0.02

> system.time(t3 <- rsvd(A,10,30)$d[1:5])
user system elapsed
0.04 0.00 0.03

> system.time(t4 <- svd(B)$d[1:5] )
user system elapsed
0.55 0.00 0.55

> system.time(t5 <-rsvd(B,10,5)$d[1:5] )
user system elapsed
0.02 0.00 0.02

> system.time(t6 <-rsvd(B,10,30)$d[1:5] )
user system elapsed
0.05 0.00 0.05

> system.time(t7 <-rsvd(B,25,30)$d[1:5] )
user system elapsed
0.06 0.00 0.06

But we see that using a higher p for a lower rank matrix indeed gives a better approximation. If we let k also approach the rank a bit closer, the difference between the real solution and the approximation becomes appx. 0, while the speed gain is still substantial.

> round(mean(t2/t1),2)
[1] 0.77

> round(mean(t3/t1),2)
[1] 0.82

> round(mean(t5/t4),2)
[1] 0.92

> round(mean(t6/t4),2)
[1] 0.97

> round(mean(t7/t4),2)
[1] 1

So in general I believe that one could conclude that :

  • p should be chosen so p > k (Martinsson calls it l if I'm right)
  • k shouldn't be too much different from rank(A)
  • For low rank matrices the result is generally better.

Optimalization:

As far as I'm concerned, it's a neat way of doing it. I couldn't really find a more optimal way actually. The only thing I could say is that the construct t(q) %*% A is advised against. One should use crossprod(q,A) for that, which is supposed to be a tiny bit faster. But in your example the difference was nonexistent.



Related Topics



Leave a reply



Submit