Split a Vector into Chunks

How can I split a vector into smaller vectors of size N?

Rust slices already contain the necessary method for that: chunks.

Starting from this:

let src: Vec<u8> = vec![1, 2, 3, 4, 5];

you can get a vector of slices (no copy):

let dst: Vec<&[u8]> = src.chunks(3).collect();

or a vector of vectors (slower, heavier):

let dst: Vec<Vec<u8>> = src.chunks(3).map(|s| s.into()).collect();

playground

In R, split a vector randomly into k chunks?

That should be pretty quick for smaller numbers. But here a more concise way.

k.chunks2 = function(seq.size, n.chunks) {
  break.pts <- sort(sample(1:seq.size, n.chunks - 1, replace = FALSE))
  break.len <- diff(c(0, break.pts, seq.size))
  
  groups <- rep(1:n.chunks, times = break.len)
  return(groups)
}

If you really get a huge number of groups, I think the sort will start to cost you execution time. So you can do something like this (probably can be tweaked to be even faster) to split based on proportions. I am not sure how I feel about this, because as n.chunks gets very large, the proportions will get very small. But it is faster.

k.chunks3 = function(seq.size, n.chunks) {
  props <- runif(n.chunks)
  grp.props <- props / sum(props)
  
  chunk.size <- floor(grp.props[-n.chunks] * seq.size)
  break.len <- c(chunk.size, seq.size - sum(chunk.size))
  
  groups <- rep(1:n.chunks, times = break.len)
  return(groups)
}

Running a benchmark, I think any of these will be fast enough (unit is microseconds).

n <- 1000
y <- 10

microbenchmark::microbenchmark(k.chunks(n, y),
                               k.chunks2(n, y),
                               k.chunks3(n, y))

Unit: microseconds
            expr  min    lq   mean median    uq   max neval
  k.chunks(n, y) 49.9 52.05 59.613  53.45 58.35 251.7   100
 k.chunks2(n, y) 46.1 47.75 51.617  49.25 52.55 107.1   100
 k.chunks3(n, y)  8.1  9.35 11.412  10.80 11.75  44.2   100

But as the numbers get larger, you will notice a meaningful speedup (note the unit is now milliseconds).

n <- 1000000
y <- 100000

microbenchmark::microbenchmark(k.chunks(n, y),
                               k.chunks2(n, y),
                               k.chunks3(n, y))

Unit: milliseconds
            expr     min       lq     mean   median       uq      max neval
  k.chunks(n, y) 46.9910 51.38385 57.83917 54.54310 56.59285 113.5038   100
 k.chunks2(n, y) 17.2184 19.45505 22.72060 20.74595 22.73510  69.5639   100
 k.chunks3(n, y)  7.7354  8.62715 10.32754  9.07045 10.44675  58.2093   100

All said and done, I would probably use my k.chunks2() function.

Split a vector into chunks such that sum of each chunk is approximately constant

So, let's use pruning. I think other solutions are giving a good solution, but not the best one.

First, we want to minimize

Sample Image

where S_n is the cumulative sum of the first n elements.

computeD <- function(p, q, S) {
  n <- length(S)
  S.star <- S[n] / 3
  if (all(p < q)) {
    (S[p] - S.star)^2 + (S[q] - S[p] - S.star)^2 + (S[n] - S[q] - S.star)^2
  } else {
    stop("You shouldn't be here!")
  }
}

I think the other solutions optimize over p and q independently, which won't give a global minima (expected for some particular cases).

optiCut <- function(v) {
  S <- cumsum(v)
  n <- length(v)
  S_star <- S[n] / 3
  # good starting values
  p_star <- which.min((S - S_star)^2)
  q_star <- which.min((S - 2*S_star)^2)
  print(min <- computeD(p_star, q_star, S))
  
  count <- 0
  for (q in 2:(n-1)) {
    S3 <- S[n] - S[q] - S_star
    if (S3*S3 < min) {
      count <- count + 1
      D <- computeD(seq_len(q - 1), q, S)
      ind = which.min(D);
      if (D[ind] < min) {
        # Update optimal values
        p_star = ind;
        q_star = q;
        min = D[ind];
      }
    }
  }
  c(p_star, q_star, computeD(p_star, q_star, S), count)
}

This is as fast as the other solutions because it prunes a lot the iterations based on the condition S3*S3 < min. But, it gives the optimal solution, see optiCut(c(1, 2, 3, 3, 5, 10)).

For the solution with K >= 3, I basically reimplemented trees with nested tibbles, that was fun!

optiCut_K <- function(v, K) {
  
  S <- cumsum(v)
  n <- length(v)
  S_star <- S[n] / K
  # good starting values
  p_vec_first <- sapply(seq_len(K - 1), function(i) which.min((S - i*S_star)^2))
  min_first <- sum((diff(c(0, S[c(p_vec_first, n)])) - S_star)^2)
  
  compute_children <- function(level, ind, val) {
    
    # leaf
    if (level == 1) {
      val <- val + (S[ind] - S_star)^2
      if (val > min_first) {
        return(NULL)
      } else {
        return(val)
      } 
    } 
    
    P_all <- val + (S[ind] - S[seq_len(ind - 1)] - S_star)^2
    inds <- which(P_all < min_first)
    if (length(inds) == 0) return(NULL)
    
    node <- tibble::tibble(
      level = level - 1,
      ind = inds,
      val = P_all[inds]
    )
    node$children <- purrr::pmap(node, compute_children)
    
    node <- dplyr::filter(node, !purrr::map_lgl(children, is.null))
    `if`(nrow(node) == 0, NULL, node)
  }
  
  compute_children(K, n, 0)
}

This gives you all the solution that are least better than the greedy one:

v <- sort(sample(1:1000, 1e5, replace = TRUE))
test <- optiCut_K(v, 9)

You need to unnest this:

full_unnest <- function(tbl) {
  tmp <- try(tidyr::unnest(tbl), silent = TRUE)
  `if`(identical(class(tmp), "try-error"), tbl, full_unnest(tmp))
}
print(test <- full_unnest(test))

And finally, to get the best solution:

test[which.min(test$children), ]

Quickly split a large vector into chunks in R

A speed improvement from the parallel package:

chunks <- parallel::splitIndices(6510321, ncl = ceiling(6510321/2000))

Is it possible to split a vector into groups of 10 with iterators?

There is no such helper method on the Iterator trait directly. However, there are two main ways to do it:

Use the [T]::chunks() method (which can be called on a Vec<T> directly). However, it has a minor difference: it won't produce None, but the last iteration yields a smaller slice. For always-exact slices that omit the final chunk if it's incomplete, see the [T]::chunks_exact() method.

Example:

    let my_vec = (0..25).collect::<Vec<_>>();

    for chunk in my_vec.chunks(10) {
        println!("{:02?}", chunk);
    }

Result:

```none
[00, 01, 02, 03, 04, 05, 06, 07, 08, 09]
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[20, 21, 22, 23, 24]
```

Use the Itertools::chunks() method from the crate itertools. This crate extends the Iterator trait from the standard library so this chunks() method works with all iterators! Note that the usage is slightly more complicated in order to be that general. This has the same behavior as the method described above: in the last iteration, the chunk will be smaller instead of containing Nones.

Example:

    extern crate itertools;
    use itertools::Itertools;
    
    for chunk in &(0..25).chunks(10) {
        println!("{:02?}", chunk.collect::<Vec<_>>());
    }

Result:

```none
[00, 01, 02, 03, 04, 05, 06, 07, 08, 09]
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[20, 21, 22, 23, 24]
```

Split a vector into unequal chunks in R

split(example.data, rep(1:4, c(4,2,1,3)))

Split a vector into non-overlapping sub-list with increasing length

Example vector:

x <- letters[1:7]

Solution:

n <- ceiling(0.5 * sqrt(1 + 8 * length(x)) - 0.5)
split(x, rep(1:n, 1:n)[1:length(x)])
#$`1`
#[1] "a"
#
#$`2`
#[1] "b" "c"
#
#$`3`
#[1] "d" "e" "f"
#
#$`4`
#[1] "g"