SVD for sparse matrix in R
The irlba package has a very fast SVD implementation for sparse matrices.
Problems with SVD of a large matrix using bigmemory and irlba
This should work:
library(bigalgebra)
library(irlba)
## --> CHANGES HERE <--
setMethod("%*%", signature(x = "big.matrix", y = "numeric"),
function(x, y) x %*% as.matrix(y))
setMethod("%*%", signature(x = "numeric", y = "big.matrix"),
function(x, y) t(x) %*% y)
mult <- function(A, B) (A %*% B)[]
# Repdata
x <- matrix(rnorm(20 * 100 * 100), nrow = 20 * 10)
big <- as.big.matrix(x)
# Computation
irlbaObject <- irlba(big, nv = 10, mult = mult)
# Verification
svd <- svd(x, nu = 10, nv = 10)
plot(irlbaObject$u, svd$u)
plot(irlbaObject$v, svd$v)
Note 1: I think the algo in irlba has changed and now use only matrix-vector multiplications.
Note 2: mult is a deprecated argument (it will disappear in next versions).
Note 3: I'm not sure this solution will be fast. If you want a fast algorithm to compute partial SVD, try function big_randomSVD of package bigstatsr (disclaimer: I'm the author).
randomized SVD singular values
Calculation
I reckon that your algorithm is a modification of the algorithm of Martinsson et al.. If I understood it correctly, this is especially meant for approximations for low rank matrices. I might be wrong though.
The difference is easily explained by the huge difference between the actual rank of A (500) and the values of k (10) and p (5). Plus, Martinsson et al mention that the value for p should actually be larger than the chosen value for k.
So if we apply your solution taking their considerations into account, using :
set.seed(10)
A <- matrix(rnorm(500*500),500,500) # rank 500
B <- matrix(rnorm(500*50),500,500) # rank 50
We find for the timings that the use of a larger p value still results in a huge speed-up compared to the original svd algorithm.
> system.time(t1 <- svd(A)$d[1:5])
user system elapsed
0.8 0.0 0.8
> system.time(t2 <- rsvd(A,10,5)$d[1:5])
user system elapsed
0.01 0.00 0.02
> system.time(t3 <- rsvd(A,10,30)$d[1:5])
user system elapsed
0.04 0.00 0.03
> system.time(t4 <- svd(B)$d[1:5] )
user system elapsed
0.55 0.00 0.55
> system.time(t5 <-rsvd(B,10,5)$d[1:5] )
user system elapsed
0.02 0.00 0.02
> system.time(t6 <-rsvd(B,10,30)$d[1:5] )
user system elapsed
0.05 0.00 0.05
> system.time(t7 <-rsvd(B,25,30)$d[1:5] )
user system elapsed
0.06 0.00 0.06
But we see that using a higher p for a lower rank matrix indeed gives a better approximation. If we let k also approach the rank a bit closer, the difference between the real solution and the approximation becomes appx. 0, while the speed gain is still substantial.
> round(mean(t2/t1),2)
[1] 0.77
> round(mean(t3/t1),2)
[1] 0.82
> round(mean(t5/t4),2)
[1] 0.92
> round(mean(t6/t4),2)
[1] 0.97
> round(mean(t7/t4),2)
[1] 1
So in general I believe that one could conclude that :
- p should be chosen so p > k (Martinsson calls it
l
if I'm right) - k shouldn't be too much different from rank(A)
- For low rank matrices the result is generally better.
Optimalization:
As far as I'm concerned, it's a neat way of doing it. I couldn't really find a more optimal way actually. The only thing I could say is that the construct t(q) %*% A
is advised against. One should use crossprod(q,A)
for that, which is supposed to be a tiny bit faster. But in your example the difference was nonexistent.
Related Topics
Export Both Image and Data from R to an Excel Spreadsheet
How to Fill Histogram with Color Gradient
How to Convert Unix Timestamp (Milliseconds) and Timezone in R
The Representation of an Empty Argument in a "Call"
Extract Data Between a Pattern from a Text File
Create a New Variable Based on the First 7 Characters of Existing Variable
Sum Specific Columns Among Rows
Text Mining R Package & Regex to Handle Replace Smart Curly Quotes
Using Grep to Subset Rows from a Data.Table, Comparing Row Content
Read CSV with Two Headers into a Data.Frame
Overlapping the Predicted Time Series on the Original Series in R
Extent of Boundary of Text in R Plot
Update a Ggplot Using a for Loop (R)
Update Subset of Values in a Dataframe Column
Getsymbols and Using Lapply, Cl, and Merge to Extract Close Prices