Sample from Vector of Varying Length (Including 1)

Sample from vector of varying length (including 1)

This is a documented feature:

If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x. Note that this convenience feature may lead to undesired behaviour when x is of varying length in calls such as sample(x).

An alternative is to write your own function to avoid the feature:

sample.vec <- function(x, ...) x[sample(length(x), ...)]
sample.vec(10)
# [1] 10
sample.vec(10, 3, replace = TRUE)
# [1] 10 10 10

Some functions with similar behavior are listed under seq vs seq_along. When will using seq cause unintended results?

sample() in R unpredictable when vector length is one

From help("sample"):

If x has length 1, is numeric (in the sense of is.numeric) and x >= 1,
sampling via sample takes place from 1:x.

So, when you have remaining = 2, then sample(remaining) is equivalent to sample(x = 1:2)

Update

From the comments it's clear you are also looking for a way around this behavior. Here is a benchmark comparison of three mentioned alternatives:

library(microbenchmark)

# if remaining is of length one
remaining <- 2

microbenchmark(a = {if ( length(remaining) > 1 ) { sample(remaining) } else { remaining }},
               b = ifelse(length(remaining) > 1, sample(remaining), remaining),
               c = remaining[sample(length(remaining))])

Unit: nanoseconds
 expr  min   lq    mean median     uq   max neval cld
    a  349  489  625.12  628.0  663.5  3283   100 a  
    b 1536 1886 2240.58 2025.0 2165.5 13898   100  b 
    c 4051 4400 5193.41 4679.5 5064.0 38413   100   c

# If remaining is not of length one
remaining <- 1:10
microbenchmark(a = {if ( length(remaining) > 1 ) { sample(remaining) } else { remaining }},
               b = ifelse(length(remaining) > 1, sample(remaining), remaining),
               c = remaining[sample(length(remaining))])

Unit: microseconds
 expr    min      lq     mean median      uq    max neval cld
    a  5.238  5.7970  6.82703  6.251  6.9145 51.264   100  a 
    b 11.663 12.2920 13.14831 12.851 13.3745 34.851   100   b
    c  5.238  5.9715  6.57140  6.426  6.8450 14.667   100  a

It looks like the suggestion from joran may be the fastest in your case if sample() is called much more often when remaining is of length > 1, and the if() {} else {} approach would be faster otherwise.

Why does sample() not work for a single number?

Or do I just need to include an if statement to avoid this.

Yeah, unfortunately. Something like this:

result = if(length(x) == 1) {x} else {sample(x, ...)}

Creating a sample vector of variable length for metadata

Maybe using paste im Map is another way.

stage <- c(Blast = 2, HSC = 4, LSC = 3)
unlist(Map(function(x, y) paste(x, seq_len(y), sep="_"), names(stage), stage)
     , FALSE, FALSE)
#[1] "Blast_1" "Blast_2" "HSC_1"   "HSC_2"   "HSC_3"   "HSC_4"   "LSC_1"  
#[8] "LSC_2"   "LSC_3"

sampling bug in R?

Have a look at the Details of the sample function:

"If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x"

How to sample 1:x where x is a vector of random integers with length greater than 1

Maybe use sapply to loop over vec:

out <- sapply(vec,sample,size = 1)

Sample a single value from list of vectors multiple times

When you have vector of length 1 the sampling happens from 1:x. From ?sample :

If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x

So when you do

set.seed(123)
sample(10, 1)
#[1] 3

It is selecting 1 number from 1 to 10. To avoid that from happening you can check length of vector in sapply :

sapply(groups, function(x) if(length(x) == 1) rep(x, repetition) 
                           else sample(x, repetition, replace = TRUE))

So this will return the same number repetition number of times when the length of vector is 1.

Sampling without replacement from multiple vectors of different length using vector lengths as some sort of weight

This will get you approximately 50 students (depending on how a was split)

new = unlist(lapply(a, function(x) sample(x, round(length(x)/2))))

To get exactly 50 each time, you can do this

ll = sapply(a, length)   # Get length of each vector in "a"
target = 50
new_ll = 0
while (sum(new_ll) != target)
    new_ll = round(ll * target / sum(ll) + runif(length(ll), -0.5, 0.5))

new = unlist(lapply(1:length(a), function(i) sample(a[[i]], new_ll[i])))

Explanation: Get the length of each vector in a and assign to ll. This amounts to doing ll[1] = length(vec1); ll[2] = length(vec2) and so on. We need to sample a certain amount from each vector in a such that we get 50 elements (target). This amount is determined with new_ll. It is approximately equal to target / num_students times each vector length.

Since this does not guarantee that target students are selected each time, we add a little jitter with runif to move the numbers around slightly, and we continue looping until the the sum of new_ll is equal to target.

The final line then iterates i from 1 through 10 (or the number of vectors in a) and samples new_ll[i] from each vector a[[i]].

Sample from Vector of Varying Length (Including 1)