How can I split a vector into smaller vectors of size N?
Rust slices already contain the necessary method for that: chunks.
Starting from this:
let src: Vec<u8> = vec![1, 2, 3, 4, 5];
you can get a vector of slices (no copy):
let dst: Vec<&[u8]> = src.chunks(3).collect();
or a vector of vectors (slower, heavier):
let dst: Vec<Vec<u8>> = src.chunks(3).map(|s| s.into()).collect();
playground
In R, split a vector randomly into k chunks?
That should be pretty quick for smaller numbers. But here a more concise way.
k.chunks2 = function(seq.size, n.chunks) {
break.pts <- sort(sample(1:seq.size, n.chunks - 1, replace = FALSE))
break.len <- diff(c(0, break.pts, seq.size))
groups <- rep(1:n.chunks, times = break.len)
return(groups)
}
If you really get a huge number of groups, I think the sort
will start to cost you execution time. So you can do something like this (probably can be tweaked to be even faster) to split based on proportions. I am not sure how I feel about this, because as n.chunks
gets very large, the proportions will get very small. But it is faster.
k.chunks3 = function(seq.size, n.chunks) {
props <- runif(n.chunks)
grp.props <- props / sum(props)
chunk.size <- floor(grp.props[-n.chunks] * seq.size)
break.len <- c(chunk.size, seq.size - sum(chunk.size))
groups <- rep(1:n.chunks, times = break.len)
return(groups)
}
Running a benchmark, I think any of these will be fast enough (unit is microseconds).
n <- 1000
y <- 10
microbenchmark::microbenchmark(k.chunks(n, y),
k.chunks2(n, y),
k.chunks3(n, y))
Unit: microseconds
expr min lq mean median uq max neval
k.chunks(n, y) 49.9 52.05 59.613 53.45 58.35 251.7 100
k.chunks2(n, y) 46.1 47.75 51.617 49.25 52.55 107.1 100
k.chunks3(n, y) 8.1 9.35 11.412 10.80 11.75 44.2 100
But as the numbers get larger, you will notice a meaningful speedup (note the unit is now milliseconds).
n <- 1000000
y <- 100000
microbenchmark::microbenchmark(k.chunks(n, y),
k.chunks2(n, y),
k.chunks3(n, y))
Unit: milliseconds
expr min lq mean median uq max neval
k.chunks(n, y) 46.9910 51.38385 57.83917 54.54310 56.59285 113.5038 100
k.chunks2(n, y) 17.2184 19.45505 22.72060 20.74595 22.73510 69.5639 100
k.chunks3(n, y) 7.7354 8.62715 10.32754 9.07045 10.44675 58.2093 100
All said and done, I would probably use my k.chunks2()
function.
Split a vector into chunks such that sum of each chunk is approximately constant
So, let's use pruning. I think other solutions are giving a good solution, but not the best one.
First, we want to minimize
where S_n is the cumulative sum of the first n elements.
computeD <- function(p, q, S) {
n <- length(S)
S.star <- S[n] / 3
if (all(p < q)) {
(S[p] - S.star)^2 + (S[q] - S[p] - S.star)^2 + (S[n] - S[q] - S.star)^2
} else {
stop("You shouldn't be here!")
}
}
I think the other solutions optimize over p and q independently, which won't give a global minima (expected for some particular cases).
optiCut <- function(v) {
S <- cumsum(v)
n <- length(v)
S_star <- S[n] / 3
# good starting values
p_star <- which.min((S - S_star)^2)
q_star <- which.min((S - 2*S_star)^2)
print(min <- computeD(p_star, q_star, S))
count <- 0
for (q in 2:(n-1)) {
S3 <- S[n] - S[q] - S_star
if (S3*S3 < min) {
count <- count + 1
D <- computeD(seq_len(q - 1), q, S)
ind = which.min(D);
if (D[ind] < min) {
# Update optimal values
p_star = ind;
q_star = q;
min = D[ind];
}
}
}
c(p_star, q_star, computeD(p_star, q_star, S), count)
}
This is as fast as the other solutions because it prunes a lot the iterations based on the condition S3*S3 < min
. But, it gives the optimal solution, see optiCut(c(1, 2, 3, 3, 5, 10))
.
For the solution with K >= 3, I basically reimplemented trees with nested tibbles, that was fun!
optiCut_K <- function(v, K) {
S <- cumsum(v)
n <- length(v)
S_star <- S[n] / K
# good starting values
p_vec_first <- sapply(seq_len(K - 1), function(i) which.min((S - i*S_star)^2))
min_first <- sum((diff(c(0, S[c(p_vec_first, n)])) - S_star)^2)
compute_children <- function(level, ind, val) {
# leaf
if (level == 1) {
val <- val + (S[ind] - S_star)^2
if (val > min_first) {
return(NULL)
} else {
return(val)
}
}
P_all <- val + (S[ind] - S[seq_len(ind - 1)] - S_star)^2
inds <- which(P_all < min_first)
if (length(inds) == 0) return(NULL)
node <- tibble::tibble(
level = level - 1,
ind = inds,
val = P_all[inds]
)
node$children <- purrr::pmap(node, compute_children)
node <- dplyr::filter(node, !purrr::map_lgl(children, is.null))
`if`(nrow(node) == 0, NULL, node)
}
compute_children(K, n, 0)
}
This gives you all the solution that are least better than the greedy one:
v <- sort(sample(1:1000, 1e5, replace = TRUE))
test <- optiCut_K(v, 9)
You need to unnest this:
full_unnest <- function(tbl) {
tmp <- try(tidyr::unnest(tbl), silent = TRUE)
`if`(identical(class(tmp), "try-error"), tbl, full_unnest(tmp))
}
print(test <- full_unnest(test))
And finally, to get the best solution:
test[which.min(test$children), ]
Quickly split a large vector into chunks in R
A speed improvement from the parallel package:
chunks <- parallel::splitIndices(6510321, ncl = ceiling(6510321/2000))
Is it possible to split a vector into groups of 10 with iterators?
There is no such helper method on the Iterator
trait directly. However, there are two main ways to do it:
- Use the
[T]::chunks()
method (which can be called on aVec<T>
directly). However, it has a minor difference: it won't produceNone
, but the last iteration yields a smaller slice. For always-exact slices that omit the final chunk if it's incomplete, see the[T]::chunks_exact()
method.
Example:
let my_vec = (0..25).collect::<Vec<_>>();
for chunk in my_vec.chunks(10) {
println!("{:02?}", chunk);
}
Result:
```none
[00, 01, 02, 03, 04, 05, 06, 07, 08, 09]
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[20, 21, 22, 23, 24]
```
- Use the
Itertools::chunks()
method from the crateitertools
. This crate extends theIterator
trait from the standard library so thischunks()
method works with all iterators! Note that the usage is slightly more complicated in order to be that general. This has the same behavior as the method described above: in the last iteration, the chunk will be smaller instead of containingNone
s.
Example:
extern crate itertools;
use itertools::Itertools;
for chunk in &(0..25).chunks(10) {
println!("{:02?}", chunk.collect::<Vec<_>>());
}
Result:
```none
[00, 01, 02, 03, 04, 05, 06, 07, 08, 09]
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[20, 21, 22, 23, 24]
```
Split a vector into unequal chunks in R
split(example.data, rep(1:4, c(4,2,1,3)))
Split a vector into non-overlapping sub-list with increasing length
Example vector:
x <- letters[1:7]
Solution:
n <- ceiling(0.5 * sqrt(1 + 8 * length(x)) - 0.5)
split(x, rep(1:n, 1:n)[1:length(x)])
#$`1`
#[1] "a"
#
#$`2`
#[1] "b" "c"
#
#$`3`
#[1] "d" "e" "f"
#
#$`4`
#[1] "g"
Related Topics
Insert Rows For Missing Dates/Times
Split String Column to Create New Binary Columns
Opposite of %In%: Exclude Rows With Values Specified in a Vector
Remove Specific Characters from Column Names in R
Creating Grouped Bar-Plot of Multi-Column Data in R
Change the Class from Factor to Numeric of Many Columns in a Data Frame
Create Counter Within Consecutive Runs of Values
Merge 2 Data Frames in a Loop for Each Column in One of Them
Reshaping Multiple Sets of Measurement Columns (Wide Format) into Single Columns (Long Format)
Drop Unused Factor Levels in a Subsetted Data Frame
How to Replace Na Values With Zeros in an R Dataframe
Convert Data from Long Format to Wide Format With Multiple Measure Columns
How to Use Pivot_Longer to Reshape from Wide-Type Data to Long-Type Data With Multiple Variables
Why Is It Not Advisable to Use Attach() in R, and What Should I Use Instead
Error: Unexpected Symbol/Input/String Constant/Numeric Constant/Special in My Code
How to Subset Matrix to One Column, Maintain Matrix Data Type, Maintain Row/Column Names