How to Convert a Huge List-Of-Vector to a Matrix More Efficiently

How to convert a huge list-of-vector to a matrix more efficiently?

This should be equivalent to your current code, only a lot faster:

output <- matrix(unlist(z), ncol = 10, byrow = TRUE)

Converting each element of a Large list to a matrix in R

You can use lapply :

list_to_matrix <- function(data) {
lapply(data, as.matrix)
}

data1 <- list_to_matrix(data)

As far as your approach is concerned it should work if you take out the return line within the for loop.

list_to_matrix <- function(data) {
for (i in 1:length(data)) {
data[[i]] <- as.matrix(data[[i]])
}
return(data)
}

Convert a vector of lists with uneven length to a matrix

I guess the 'data' should be a list instead of a vector, then the code would work

t(sapply(data, `length<-`, max(lengths(data))))

NOTE: lengths is a faster option (introduced in the recent R versions) that replaces sapply(data, length)

data

data = list(
c(349, 364, 393, 356, 357, 394, 334, 394, 343, 365, 349),
c(390, 336, 752, 377),
c(670, 757, 405, 343, 1109, 350, 372),
c(0, 0),
numeric(0),
c(1115, 394, 327, 356, 408, 329, 385, 357, 357))

How do I make a matrix from a list of vectors in R?

One option is to use do.call():

 > do.call(rbind, a)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 2 3 4 5
[2,] 2 1 2 3 4 5
[3,] 3 1 2 3 4 5
[4,] 4 1 2 3 4 5
[5,] 5 1 2 3 4 5
[6,] 6 1 2 3 4 5
[7,] 7 1 2 3 4 5
[8,] 8 1 2 3 4 5
[9,] 9 1 2 3 4 5
[10,] 10 1 2 3 4 5

Find a sequence in a matrix as efficiently as possible

If I understand the problem correctly, a single loop through the rows is enough. Here is a way to do this with Rcpp. Here I only return the true/false answer, if you need the indices, it's also doable.

library(Rcpp)

cppFunction('
bool hasSequence(LogicalMatrix m) {
int nrow = m.nrow(), ncol = m.ncol();

if (nrow > 0 && ncol > 0) {
int j = 0;
for (int i = 0; i < nrow; i++) {
if (m(i, j)) {
if (++j >= ncol) {
return true;
}
}
}
}
return false;
}')

a <- matrix(c(F, F, T, T, F, T, F, F, F, F,
T, F, T, T, F, T, T, F, F, F,
T, F, T, T, F, F, F, F, T, T), ncol = 3)

a
hasSequence(a)

In order to get also the indices, the following function returns a list, with at least one element (named 'found', true or false) and if found = true, another element, named 'indices':

cppFunction('
List findSequence(LogicalMatrix m) {
int nrow = m.nrow(), ncol = m.ncol();

IntegerVector indices(ncol);
if (nrow > 0 && ncol > 0) {
int j = 0;
for (int i = 0; i < nrow; i++) {
if (m(i, j)) {
indices(j) = i + 1;
if (++j >= ncol) {
return List::create(Named("found") = true,
Named("indices") = indices);
}
}
}
}
return List::create(Named("found") = false);
}')

findSequence(a)

A few links to learn about Rcpp:

  • High performance functions with Rcpp, Hadley Wickham
  • Rcpp for everyone, Masaki E. Tsuda
  • Interfacing R with C/C++, Matteo Fasiolo
  • Rcpp Gallery - Articles and code examples for the Rcpp package

You have to know at least a bit of C language (preferably C++, but for a basic usage, you can think of Rcpp as C with some magic syntax for R data types). The first link explains the basics of Rcpp types (vectors, matrices and lists, how to allocate, use and return them). The other links are good complements.

Rcpp: List - Matrix conversions by reference?? + Optimizing memory allocation when programming with matrices

The amount of memory allocated in both your methods is the same. You can see this from the mem_alloc column when using bench::mark() for benchmarking:

> bench::mark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE), check = FALSE)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 gsumm(testm, ng, g, fill = FALSE) 14.1ms 15.1ms 64.7 7.63MB 0 33 0 510ms <dbl …
2 gsuml(testl, ng, g, fill = FALSE) 12.5ms 15.1ms 67.0 7.68MB 4.19 32 2 478ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

> bench::mark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE), check = FALSE)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 gsumm(testm, ng, g, fill = TRUE) 39.2ms 45.6ms 20.0 83.9MB 20.0 5 5 250ms <dbl …
2 gsuml(testl, ng, g, fill = TRUE) 30.3ms 32ms 26.7 84MB 20.0 8 6 299ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

However, the memory is not only allocated, which is fast anyway, but also initialized with zero everywhere. This is unnecessary in your case and can be avoided by replacing Rcpp::NumericMatrix mat(rows, cols) with Rcpp::NumericMatrix mat = Rcpp::no_init(rows, cols) as well as Rcpp::NumericVector vec(length) with Rcpp::NumericVector vec = Rcpp::no_init(length). If I do this with your code, both functions profit:

> bench::mark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE), check = FALSE)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 gsumm(testm, ng, g, fill = FALSE) 13ms 14.7ms 67.1 7.63MB 0 34 0 507ms <dbl …
2 gsuml(testl, ng, g, fill = FALSE) 12.8ms 14.6ms 67.4 7.68MB 2.04 33 1 489ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

> bench::mark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE), check = FALSE)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 gsumm(testm, ng, g, fill = TRUE) 27.5ms 31ms 26.6 83.9MB 10.7 10 4 375ms <dbl …
2 gsuml(testl, ng, g, fill = TRUE) 24.7ms 26.4ms 36.9 84MB 36.9 9 9 244ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

I am not sure why the matrix version profits more from not initializing the memory, though.

Convert list of matrices of the same order to a array

Here's a benchmark test of the different strategies that were suggested. Feel free to update if you have new ideas / strategies.

# packages
require(data.table)
require(tidyr)
require(microbenchmark)

# data
lst <- replicate(100, matrix(rnorm(16), ncol = 4), simplify = FALSE)
# benchmark test
microbenchmark(
do.call(rbind, lst),
Reduce(rbind, lst),
apply(simplify2array(lst), 2, rbind),
rbindlist(lapply(lst, data.frame)),
unnest(lapply(lst, data.frame))
)

And the results:

Unit: microseconds
expr min lq mean median uq max neval
do.call(rbind, lst) 43.290 47.9760 55.63858 52.8845 62.703 101.307 100
Reduce(rbind, lst) 542.236 570.7985 620.99652 585.3020 610.518 1871.272 100
apply(simplify2array(lst), 2, rbind) 311.061 345.2010 382.22978 368.6315 388.268 1563.782 100
rbindlist(lapply(lst, data.frame)) 11827.884 12472.3190 13092.57937 12823.0995 13595.841 15833.736 100
unnest(lapply(lst, data.frame)) 12371.905 12927.9765 13514.24261 13236.1360 14008.655 16121.143 100

Just out of curiosity, I have performed these benchmark tests for data.frame inputs as well and there the picture is very different:

# packages
require(data.table)
require(tidyr)
require(microbenchmark)
# data
lst <- replicate(100, as.data.frame(matrix(rnorm(16), ncol = 4)), simplify=FALSE)
# benchmark test
microbenchmark(
do.call(rbind, lst),
Reduce(rbind, lst),
apply(simplify2array(lapply(lst, as.matrix)), 2, rbind),
rbindlist(lst),
unnest(lst)
)

with results:

Unit: microseconds
expr min lq mean median uq max neval
do.call(rbind, lst) 12406.716 12944.2660 13746.8552 13571.966 14564.056 16333.128 100
Reduce(rbind, lst) 36316.866 38450.7765 39894.9806 39299.610 40325.395 100949.158 100
apply(simplify2array(lapply(lst, as.matrix)), 2, rbind) 9577.717 9940.9930 10273.8674 10065.059 10291.996 12114.846 100
rbindlist(lst) 324.896 369.0770 397.7828 402.995 426.202 500.732 100
unnest(lst) 926.487 974.9095 1011.7322 1010.834 1033.596 1171.051 100

Faster way to unlist a list of large matrices?

To work on a list and call a function on all objects, do.call is my usual first idea, along with cbind here to bind by column all objects.

For n=100 (with others answers for sake of completeness):

n <- 10
nr <- 24
nc <- 8000
test <- list()
set.seed(1234)
for (i in 1:n){
test[[i]] <- matrix(rnorm(nr*nc),nr,nc)
}

require(data.table)
ori <- function() { matrix( as.numeric( unlist(test) ) ,nr,nc*n) }
Tensibai <- function() { do.call(cbind,test) }
BrodieG <- function() { `attr<-`(do.call(c, test), "dim", c(nr, nc * n)) }
nicola <- function() { setattr(unlist(test),"dim",c(nr,nc*n)) }

library(microbenchmark)
microbenchmark(r1 <- ori(),
r2 <- Tensibai(),
r3 <- BrodieG(),
r4 <- nicola(), times=10)

Results:

Unit: milliseconds
expr min lq mean median uq max neval cld
r1 <- ori() 23.834673 24.287391 39.49451 27.066844 29.737964 93.74249 10 a
r2 <- Tensibai() 17.416232 17.706165 18.18665 17.873083 18.192238 21.29512 10 a
r3 <- BrodieG() 6.009344 6.145045 21.63073 8.690869 10.323845 77.95325 10 a
r4 <- nicola() 5.912984 6.106273 13.52697 6.273904 6.678156 75.40914 10 a

As for the why (in comments), @nicola did give the answer about it, there's less copy than original method.

All methods gives the same result:

> identical(r1,r2,r3,r4)
[1] TRUE

Convert a matrix to a list of column-vectors

In the interests of skinning the cat, treat the array as a vector as if it had no dim attribute:

 split(x, rep(1:ncol(x), each = nrow(x)))


Related Topics



Leave a reply



Submit