How to Convert a Huge List-Of-Vector to a Matrix More Efficiently

How to convert a huge list-of-vector to a matrix more efficiently?

This should be equivalent to your current code, only a lot faster:

output <- matrix(unlist(z), ncol = 10, byrow = TRUE)

Converting each element of a Large list to a matrix in R

You can use lapply :

list_to_matrix <- function(data) {
  lapply(data, as.matrix)
}

data1 <- list_to_matrix(data)

As far as your approach is concerned it should work if you take out the return line within the for loop.

list_to_matrix <- function(data) {
  for (i in 1:length(data)) {
    data[[i]] <- as.matrix(data[[i]]) 
  }
  return(data)
}

Convert a vector of lists with uneven length to a matrix

I guess the 'data' should be a list instead of a vector, then the code would work

t(sapply(data, `length<-`, max(lengths(data))))

NOTE: lengths is a faster option (introduced in the recent R versions) that replaces sapply(data, length)

data

data = list(
  c(349, 364, 393, 356, 357, 394, 334, 394, 343, 365, 349),
  c(390, 336, 752, 377),
  c(670, 757, 405, 343, 1109, 350, 372),
  c(0, 0),
  numeric(0),
  c(1115, 394, 327, 356, 408, 329, 385, 357, 357))

How do I make a matrix from a list of vectors in R?

One option is to use do.call():

 > do.call(rbind, a)
      [,1] [,2] [,3] [,4] [,5] [,6]
 [1,]    1    1    2    3    4    5
 [2,]    2    1    2    3    4    5
 [3,]    3    1    2    3    4    5
 [4,]    4    1    2    3    4    5
 [5,]    5    1    2    3    4    5
 [6,]    6    1    2    3    4    5
 [7,]    7    1    2    3    4    5
 [8,]    8    1    2    3    4    5
 [9,]    9    1    2    3    4    5
[10,]   10    1    2    3    4    5

Find a sequence in a matrix as efficiently as possible

If I understand the problem correctly, a single loop through the rows is enough. Here is a way to do this with Rcpp. Here I only return the true/false answer, if you need the indices, it's also doable.

library(Rcpp)

cppFunction('
bool hasSequence(LogicalMatrix m) {
  int nrow = m.nrow(), ncol = m.ncol();
  
  if (nrow > 0 && ncol > 0) {
    int j = 0;
    for (int i = 0; i < nrow; i++) {
      if (m(i, j)) {
        if (++j >= ncol) {
          return true;
        }
      }
    }
  }
  return false;
}')

a <- matrix(c(F, F, T, T, F, T, F, F, F, F,
              T, F, T, T, F, T, T, F, F, F,
              T, F, T, T, F, F, F, F, T, T), ncol = 3)

a
hasSequence(a)

In order to get also the indices, the following function returns a list, with at least one element (named 'found', true or false) and if found = true, another element, named 'indices':

cppFunction('
List findSequence(LogicalMatrix m) {
  int nrow = m.nrow(), ncol = m.ncol();

  IntegerVector indices(ncol);
  if (nrow > 0 && ncol > 0) {
    int j = 0;
    for (int i = 0; i < nrow; i++) {
      if (m(i, j)) {
        indices(j) = i + 1;
        if (++j >= ncol) {
          return List::create(Named("found") = true,
                              Named("indices") = indices);
        }
      }
    }
  }
  return List::create(Named("found") = false);
}')

findSequence(a)

A few links to learn about Rcpp:

High performance functions with Rcpp, Hadley Wickham
Rcpp for everyone, Masaki E. Tsuda
Interfacing R with C/C++, Matteo Fasiolo
Rcpp Gallery - Articles and code examples for the Rcpp package

You have to know at least a bit of C language (preferably C++, but for a basic usage, you can think of Rcpp as C with some magic syntax for R data types). The first link explains the basics of Rcpp types (vectors, matrices and lists, how to allocate, use and return them). The other links are good complements.

Rcpp: List - Matrix conversions by reference?? + Optimizing memory allocation when programming with matrices

The amount of memory allocated in both your methods is the same. You can see this from the mem_alloc column when using bench::mark() for benchmarking:

> bench::mark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE), check = FALSE)
# A tibble: 2 x 13
  expression                           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                        <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 gsumm(testm, ng, g, fill = FALSE) 14.1ms 15.1ms      64.7    7.63MB     0       33     0      510ms <dbl …
2 gsuml(testl, ng, g, fill = FALSE) 12.5ms 15.1ms      67.0    7.68MB     4.19    32     2      478ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

> bench::mark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE), check = FALSE)
# A tibble: 2 x 13
  expression                          min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                       <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 gsumm(testm, ng, g, fill = TRUE) 39.2ms 45.6ms      20.0    83.9MB     20.0     5     5      250ms <dbl …
2 gsuml(testl, ng, g, fill = TRUE) 30.3ms   32ms      26.7      84MB     20.0     8     6      299ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

However, the memory is not only allocated, which is fast anyway, but also initialized with zero everywhere. This is unnecessary in your case and can be avoided by replacing Rcpp::NumericMatrix mat(rows, cols) with Rcpp::NumericMatrix mat = Rcpp::no_init(rows, cols) as well as Rcpp::NumericVector vec(length) with Rcpp::NumericVector vec = Rcpp::no_init(length). If I do this with your code, both functions profit:

> bench::mark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE), check = FALSE)
# A tibble: 2 x 13
  expression                           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                        <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 gsumm(testm, ng, g, fill = FALSE)   13ms 14.7ms      67.1    7.63MB     0       34     0      507ms <dbl …
2 gsuml(testl, ng, g, fill = FALSE) 12.8ms 14.6ms      67.4    7.68MB     2.04    33     1      489ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

> bench::mark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE), check = FALSE)
# A tibble: 2 x 13
  expression                          min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                       <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 gsumm(testm, ng, g, fill = TRUE) 27.5ms   31ms      26.6    83.9MB     10.7    10     4      375ms <dbl …
2 gsuml(testl, ng, g, fill = TRUE) 24.7ms 26.4ms      36.9      84MB     36.9     9     9      244ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

I am not sure why the matrix version profits more from not initializing the memory, though.

Convert list of matrices of the same order to a array

Here's a benchmark test of the different strategies that were suggested. Feel free to update if you have new ideas / strategies.

# packages
require(data.table)
require(tidyr)
require(microbenchmark)

# data
lst <- replicate(100, matrix(rnorm(16), ncol = 4), simplify = FALSE)
# benchmark test
microbenchmark(
  do.call(rbind, lst), 
  Reduce(rbind, lst), 
  apply(simplify2array(lst), 2, rbind), 
  rbindlist(lapply(lst, data.frame)), 
  unnest(lapply(lst, data.frame))
  )

And the results:

Unit: microseconds
                                 expr       min         lq        mean     median        uq       max neval  
                  do.call(rbind, lst)    43.290    47.9760    55.63858    52.8845    62.703   101.307   100  
                   Reduce(rbind, lst)   542.236   570.7985   620.99652   585.3020   610.518  1871.272   100  
 apply(simplify2array(lst), 2, rbind)   311.061   345.2010   382.22978   368.6315   388.268  1563.782   100  
   rbindlist(lapply(lst, data.frame)) 11827.884 12472.3190 13092.57937 12823.0995 13595.841 15833.736   100   
      unnest(lapply(lst, data.frame)) 12371.905 12927.9765 13514.24261 13236.1360 14008.655 16121.143   100

Just out of curiosity, I have performed these benchmark tests for data.frame inputs as well and there the picture is very different:

# packages
require(data.table)
require(tidyr)
require(microbenchmark)
# data
lst <- replicate(100, as.data.frame(matrix(rnorm(16), ncol = 4)), simplify=FALSE)
# benchmark test
microbenchmark(
  do.call(rbind, lst), 
  Reduce(rbind, lst), 
  apply(simplify2array(lapply(lst, as.matrix)), 2, rbind), 
  rbindlist(lst), 
  unnest(lst)
)

with results:

Unit: microseconds
                                                    expr       min         lq       mean    median        uq        max neval 
                                     do.call(rbind, lst) 12406.716 12944.2660 13746.8552 13571.966 14564.056  16333.128   100    
                                      Reduce(rbind, lst) 36316.866 38450.7765 39894.9806 39299.610 40325.395 100949.158   100    
 apply(simplify2array(lapply(lst, as.matrix)), 2, rbind)  9577.717  9940.9930 10273.8674 10065.059 10291.996  12114.846   100    
                                          rbindlist(lst)   324.896   369.0770   397.7828   402.995   426.202    500.732   100  
                                             unnest(lst)   926.487   974.9095  1011.7322  1010.834  1033.596   1171.051   100

Faster way to unlist a list of large matrices?

To work on a list and call a function on all objects, do.call is my usual first idea, along with cbind here to bind by column all objects.

For n=100 (with others answers for sake of completeness):

n <- 10
nr <- 24
nc <- 8000
test <- list()
set.seed(1234)
for (i in 1:n){
  test[[i]] <- matrix(rnorm(nr*nc),nr,nc)
}

require(data.table)
ori <- function() { matrix( as.numeric( unlist(test) ) ,nr,nc*n) }
Tensibai <- function() { do.call(cbind,test) }
BrodieG <- function() { `attr<-`(do.call(c, test), "dim", c(nr, nc * n)) }
nicola <- function() { setattr(unlist(test),"dim",c(nr,nc*n)) }

library(microbenchmark)
microbenchmark(r1 <- ori(),
               r2 <- Tensibai(),
               r3 <- BrodieG(),
               r4 <- nicola(), times=10)

Results:

Unit: milliseconds
             expr       min        lq     mean    median        uq      max neval cld
      r1 <- ori() 23.834673 24.287391 39.49451 27.066844 29.737964 93.74249    10   a
 r2 <- Tensibai() 17.416232 17.706165 18.18665 17.873083 18.192238 21.29512    10   a
  r3 <- BrodieG()  6.009344  6.145045 21.63073  8.690869 10.323845 77.95325    10   a
   r4 <- nicola()  5.912984  6.106273 13.52697  6.273904  6.678156 75.40914    10   a

As for the why (in comments), @nicola did give the answer about it, there's less copy than original method.

All methods gives the same result:

> identical(r1,r2,r3,r4)
[1] TRUE

Convert a matrix to a list of column-vectors

In the interests of skinning the cat, treat the array as a vector as if it had no dim attribute:

 split(x, rep(1:ncol(x), each = nrow(x)))

How to Convert a Huge List-Of-Vector to a Matrix More Efficiently