Weighted Random Selection with and Without Replacement

Weighted random sample without replacement in python

You can use np.random.choice with replace=False as follows:

np.random.choice(vec,size,replace=False, p=P)

where vec is your population and P is the weight vector.

For example:

import numpy as np
vec=[1,2,3]
P=[0.5,0.2,0.3]
np.random.choice(vec,size=2,replace=False, p=P)

Weighted random selection with and without replacement

One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.

Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)

To create the alias lookup:

  1. Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.

  2. Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.

  3. Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)

  4. If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.

Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.

For example, if we run another iteration of 3 and 4, we see

(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned

At runtime:

  1. Get a U(0,1) random number, say binary 0.001100000

  2. bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.

  3. If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.

Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.

A weighted version of random.choice

Since version 1.7.0, NumPy has a choice function that supports probability distributions.

from numpy.random import choice
draw = choice(list_of_candidates, number_of_items_to_pick,
p=probability_distribution)

Note that probability_distribution is a sequence in the same order of list_of_candidates. You can also use the keyword replace=False to change the behavior so that drawn items are not replaced.

Weighted random sample of array items *without replacement*

The following implementation selects k out of n elements, without replacement, with weighted probabilities, in O(n + k log n) by keeping the accumulated weights of the remaining elements in a sum heap:

function sample_without_replacement<T>(population: T[], weights: number[], sampleSize: number) {

let size = 1;
while (size < weights.length) {
size = size << 1;
}

// construct a sum heap for the weights
const root = 1;
const w = [...new Array(size) as number[], ...weights, 0];
for (let index = size - 1; index >= 1; index--) {
const leftChild = index << 1;
const rightChild = leftChild + 1;
w[index] = (w[leftChild] || 0) + (w[rightChild] || 0);
}

// retrieves an element with weight-index r
// from the part of the heap rooted at index
const retrieve = (r: number, index: number): T => {
if (index >= size) {
w[index] = 0;
return population[index - size];
}

const leftChild = index << 1;
const rightChild = leftChild + 1;

try {
if (r <= w[leftChild]) {
return retrieve(r, leftChild);
} else {
return retrieve(r - w[leftChild], rightChild);
}
} finally {
w[index] = w[leftChild] + w[rightChild];
}
}

// and now retrieve sampleSize random elements without replacement
const result: T[] = [];
for (let k = 0; k < sampleSize; k++) {
result.push(retrieve(Math.random() * w[root], root));
}
return result;
}

The code is written in TypeScript. You can transpile it to whatever version of EcmaScript you need in the TypeScript playground.

Test code:

const n = 1E7;
const k = n / 2;
const population: number[] = [];
const weight: number[] = [];
for (let i = 0; i < n; i++) {
population[i] = i;
weight[i] = i;
}

console.log(`sampling ${k} of ${n} elments without replacement`);
const sample = sample_without_replacement(population, weight, k);
console.log(sample.slice(0, 100)); // logging everything takes forever on some consoles
console.log("Done")

Executed in Chrome, this samples 5 000 000 out of 10 000 000 entries in about 10 seconds.

Speed up random weighted choice without replacement in python

This is just a comment on jdhesa's answer. The question was if it is useful to consider the case where only one weight is incresed -> Yes it is!

Example

@nb.njit(parallel=True)
def numba_choice_opt(population, weights, k,wc,b_full_wc_calc,ind,value):
# Get cumulative weights
if b_full_wc_calc:
acc=0
for i in range(weights.shape[0]):
acc+=weights[i]
wc[i]=acc
#Increase only one weight (faster than recalculating the cumulative weight)
else:
weights[ind]+=value
for i in nb.prange(ind,wc.shape[0]):
wc[i]+=value

# Total of weights
m = wc[-1]
# Arrays of sample and sampled indices
sample = np.empty(k, population.dtype)
sample_idx = np.full(k, -1, np.int32)
# Sampling loop
i = 0
while i < k:
# Pick random weight value
r = m * np.random.rand()
# Get corresponding index
idx = np.searchsorted(wc, r, side='right')
# Check index was not selected before
# If not using Numba you can just do `np.isin(idx, sample_idx)`
for j in range(i):
if sample_idx[j] == idx:
continue
# Save sampled value and index
sample[i] = population[idx]
sample_idx[i] = population[idx]
i += 1
return sample

Example

np.random.seed(0)
population = np.random.randint(100, size=1_000_000)
weights = np.random.rand(len(population))
k = 10
wc = np.empty_like(weights)

#Initial calculation
%timeit numba_choice_opt(population, weights, k,wc,True,0,0)
#1.41 ms ± 9.21 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Increase weight[100] by 3 and calculate
%timeit numba_choice_opt(population, weights, k,wc,False,100,3)
#213 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#For comparison
#Please note that it is the memory allcocation of wc which makes
#it so much slower than the initial calculation from above
%timeit numba_choice(population, weights, k)
#4.23 ms ± 64.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python: Sample N random items from list with weights but without repetition

The normal method to select randomly without repetition is to shuffle the entries and take the first N.

from random import shuffle

N = 1
entrants = ['John', 'Jane', 'Cthulhu']
shuffle(entrants)
print(entrants[:N])

Or more directly

from random import sample

N = 1
entrants = ['John', 'Jane', 'Cthulhu']
print(sample(entrants, N))

However your requirement of weighted sampling means you'll need more than that.

def unique_sample(population, count):
shuffle(population)
unique = set()
it = iter(population)
while len(unique) < count:
elem = next(it)
if elem not in unique:
yield elem
unique.add(elem)

How to use weights in numpy.random.choice without replacement to get desired sample

Numpy should raise a value error if your sample size if larger than your population size if replace = False. The fact that it doesn't throw an error must be a bug.

Edit:
I misread the number. Now I see the issue. Yes if you are sampling without replacement and the probability weight of the 1s and 0s are as you set them, then as you sample, you will initially be removing the 1s and 0s at the same rate. Eventually you will start running out of 1s to sample (remember: replacement=False) even tho the weight of the 1s was initially set to be 4 times larger than the weight of the 0s. So yes, you are right that your weights that you passed to np.random.choice do not change as it samples, this would require numpy to re-normalize your weights every time which would be inefficient and wouldn't really make sense since there isn't really any need to do this calculation in this way that you are asking. But to answer your question, if you really wanted to do this, you could do something like this:

import numpy as np
import matplotlib.pyplot as plt

array = np.r_[np.ones(100_000), np.zeros(400_000)]
weights = array.copy()
weights[weights==1] = 2.5
weights[weights==0] = 0.625
normalized_weights = weights / weights.sum()
sample_sizes = (1_000, 5_000, 10_000, 50_000, 100_000, 200_000)
means = []
for sample_size in sample_sizes:
#
# sample one by one
#
means.append(np.mean([np.random.choice(array, 1, False, p=normalized_weights) for _ in range(sample_sizes)]))
plt.plot(sample_sizes, means, marker="x")
plt.ylabel("Mean")
plt.xlabel("Sample size")

i.e. you actually do have to sample them one by one to achieve the desired effect, unfortunately

Weighted sampling without replacement in Matlab

I don't think it is possible to avoid some sort of loop, since sampling without replacement means that the samples are no longer independent. Besides, what does the weighting actually mean when sampling without replacement?

In any case, for relatively small sample sizes I don't think you will notice any problem with performance. All the solutions I can think of basically do what you have done, but possibly expand out what is going on in randsample.

Faster weighted sampling without replacement

Update:

An Rcpp implementation of Efraimidis & Spirakis algorithm (thanks to @Hemmo, @Dinrem, @krlmlr and @rtlgrmpf):

library(inline)
library(Rcpp)
src <-
'
int num = as<int>(size), x = as<int>(n);
Rcpp::NumericVector vx = Rcpp::clone<Rcpp::NumericVector>(x);
Rcpp::NumericVector pr = Rcpp::clone<Rcpp::NumericVector>(prob);
Rcpp::NumericVector rnd = rexp(x) / pr;
for(int i= 0; i<vx.size(); ++i) vx[i] = i;
std::partial_sort(vx.begin(), vx.begin() + num, vx.end(), Comp(rnd));
vx = vx[seq(0, num - 1)] + 1;
return vx;
'
incl <-
'
struct Comp{
Comp(const Rcpp::NumericVector& v ) : _v(v) {}
bool operator ()(int a, int b) { return _v[a] < _v[b]; }
const Rcpp::NumericVector& _v;
};
'
funFast <- cxxfunction(signature(n = "Numeric", size = "integer", prob = "numeric"),
src, plugin = "Rcpp", include = incl)

# See the bottom of the answer for comparison
p <- c(995/1000, rep(1/1000, 5))
n <- 100000
system.time(print(table(replicate(funFast(6, 3, p), n = n)) / n))

1 2 3 4 5 6
1.00000 0.39996 0.39969 0.39973 0.40180 0.39882
user system elapsed
3.93 0.00 3.96
# In case of:
# Rcpp::IntegerVector vx = Rcpp::clone<Rcpp::IntegerVector>(x);
# i.e. instead of NumericVector
1 2 3 4 5 6
1.00000 0.40150 0.39888 0.39925 0.40057 0.39980
user system elapsed
1.93 0.00 2.03

Old version:

Let us try a few possible approaches:

Simple rejection sampling with replacement. This a far more simple function than sample.int.rej offered by @krlmlr, i.e. sample size is always equal to n. As we will see, it is still really fast assuming uniform distribution for weights, but extremely slow in another situation.

fastSampleReject <- function(all, n, w){
out <- numeric(0)
while(length(out) < n)
out <- unique(c(out, sample(all, n, replace = TRUE, prob = w)))
out[1:n]
}

The algorithm by Wong and Easton (1980). Here is an implementation of this Python version. It is stable and I might be missing something, but it is much slower compared to other functions.

fastSample1980 <- function(all, n, w){
tws <- w
for(i in (length(tws) - 1):0)
tws[1 + i] <- sum(tws[1 + i], tws[1 + 2 * i + 1],
tws[1 + 2 * i + 2], na.rm = TRUE)
out <- numeric(n)
for(i in 1:n){
gas <- tws[1] * runif(1)
k <- 0
while(gas > w[1 + k]){
gas <- gas - w[1 + k]
k <- 2 * k + 1
if(gas > tws[1 + k]){
gas <- gas - tws[1 + k]
k <- k + 1
}
}
wgh <- w[1 + k]
out[i] <- all[1 + k]
w[1 + k] <- 0
while(1 + k >= 1){
tws[1 + k] <- tws[1 + k] - wgh
k <- floor((k - 1) / 2)
}
}
out
}

Rcpp implementation of the algorithm by Wong and Easton. Possibly it can be optimized even more since this is my first usable Rcpp function, but anyway it works well.

library(inline)
library(Rcpp)

src <-
'
Rcpp::NumericVector weights = Rcpp::clone<Rcpp::NumericVector>(w);
Rcpp::NumericVector tws = Rcpp::clone<Rcpp::NumericVector>(w);
Rcpp::NumericVector x = Rcpp::NumericVector(all);
int k, num = as<int>(n);
Rcpp::NumericVector out(num);
double gas, wgh;

if((weights.size() - 1) % 2 == 0){
tws[((weights.size()-1)/2)] += tws[weights.size()-1] + tws[weights.size()-2];
}
else
{
tws[floor((weights.size() - 1)/2)] += tws[weights.size() - 1];
}

for (int i = (floor((weights.size() - 1)/2) - 1); i >= 0; i--){
tws[i] += (tws[2 * i + 1]) + (tws[2 * i + 2]);
}
for(int i = 0; i < num; i++){
gas = as<double>(runif(1)) * tws[0];
k = 0;
while(gas > weights[k]){
gas -= weights[k];
k = 2 * k + 1;
if(gas > tws[k]){
gas -= tws[k];
k += 1;
}
}
wgh = weights[k];
out[i] = x[k];
weights[k] = 0;
while(k > 0){
tws[k] -= wgh;
k = floor((k - 1) / 2);
}
tws[0] -= wgh;
}
return out;
'

fun <- cxxfunction(signature(all = "numeric", n = "integer", w = "numeric"),
src, plugin = "Rcpp")

Now some results:

times1 <- ldply(
1:6,
function(i) {
n <- 1024 * (2 ** i)
p <- runif(2 * n) # Uniform distribution
p <- p/sum(p)
data.frame(
n=n,
user=c(system.time(sample.int.test(n, p), gcFirst=T)['user.self'],
system.time(weighted_Random_Sample(1:(2*n), p, n), gcFirst=T)['user.self'],
system.time(fun(1:(2*n), n, p), gcFirst=T)['user.self'],
system.time(sample.int.rej(2*n, n, p), gcFirst=T)['user.self'],
system.time(fastSampleReject(1:(2*n), n, p), gcFirst=T)['user.self'],
system.time(fastSample1980(1:(2*n), n, p), gcFirst=T)['user.self']),
id=c("Base", "Reservoir", "Rcpp", "Rejection", "Rejection simple", "1980"))
},
.progress='text'
)

times2 <- ldply(
1:6,
function(i) {
n <- 1024 * (2 ** i)
p <- runif(2 * n - 1)
p <- p/sum(p)
p <- c(0.999, 0.001 * p) # Special case
data.frame(
n=n,
user=c(system.time(sample.int.test(n, p), gcFirst=T)['user.self'],
system.time(weighted_Random_Sample(1:(2*n), p, n), gcFirst=T)['user.self'],
system.time(fun(1:(2*n), n, p), gcFirst=T)['user.self'],
system.time(sample.int.rej(2*n, n, p), gcFirst=T)['user.self'],
system.time(fastSampleReject(1:(2*n), n, p), gcFirst=T)['user.self'],
system.time(fastSample1980(1:(2*n), n, p), gcFirst=T)['user.self']),
id=c("Base", "Reservoir", "Rcpp", "Rejection", "Rejection simple", "1980"))
},
.progress='text'
)

Sample Image

Sample Image

arrange(times1, id)
n user id
1 2048 0.53 1980
2 4096 0.94 1980
3 8192 2.00 1980
4 16384 4.32 1980
5 32768 9.10 1980
6 65536 21.32 1980
7 2048 0.02 Base
8 4096 0.05 Base
9 8192 0.18 Base
10 16384 0.75 Base
11 32768 2.99 Base
12 65536 12.23 Base
13 2048 0.00 Rcpp
14 4096 0.01 Rcpp
15 8192 0.03 Rcpp
16 16384 0.07 Rcpp
17 32768 0.14 Rcpp
18 65536 0.31 Rcpp
19 2048 0.00 Rejection
20 4096 0.00 Rejection
21 8192 0.00 Rejection
22 16384 0.02 Rejection
23 32768 0.02 Rejection
24 65536 0.03 Rejection
25 2048 0.00 Rejection simple
26 4096 0.01 Rejection simple
27 8192 0.00 Rejection simple
28 16384 0.01 Rejection simple
29 32768 0.00 Rejection simple
30 65536 0.05 Rejection simple
31 2048 0.00 Reservoir
32 4096 0.00 Reservoir
33 8192 0.00 Reservoir
34 16384 0.02 Reservoir
35 32768 0.03 Reservoir
36 65536 0.05 Reservoir

arrange(times2, id)
n user id
1 2048 0.43 1980
2 4096 0.93 1980
3 8192 2.00 1980
4 16384 4.36 1980
5 32768 9.08 1980
6 65536 19.34 1980
7 2048 0.01 Base
8 4096 0.04 Base
9 8192 0.18 Base
10 16384 0.75 Base
11 32768 3.11 Base
12 65536 12.04 Base
13 2048 0.01 Rcpp
14 4096 0.02 Rcpp
15 8192 0.03 Rcpp
16 16384 0.08 Rcpp
17 32768 0.15 Rcpp
18 65536 0.33 Rcpp
19 2048 0.00 Rejection
20 4096 0.00 Rejection
21 8192 0.02 Rejection
22 16384 0.02 Rejection
23 32768 0.05 Rejection
24 65536 0.08 Rejection
25 2048 1.43 Rejection simple
26 4096 2.87 Rejection simple
27 8192 6.17 Rejection simple
28 16384 13.68 Rejection simple
29 32768 29.74 Rejection simple
30 65536 73.32 Rejection simple
31 2048 0.00 Reservoir
32 4096 0.00 Reservoir
33 8192 0.02 Reservoir
34 16384 0.02 Reservoir
35 32768 0.02 Reservoir
36 65536 0.04 Reservoir

Obviously we can reject function 1980 because it is slower than Base in both cases. Rejection simple gets into trouble too when there is a single probability 0.999 in the second case.

So there remains Rejection, Rcpp, Reservoir. The last step is checking whether the values themselves are correct. To be sure about them, we will be using sample as a benchmark (also to eliminate the confusion about probabilities which do not have to coincide with p because of sampling without replacement).

p <- c(995/1000, rep(1/1000, 5))
n <- 100000

system.time(print(table(replicate(sample(1:6, 3, repl = FALSE, prob = p), n = n))/n))
1 2 3 4 5 6
1.00000 0.39992 0.39886 0.40088 0.39711 0.40323 # Benchmark
user system elapsed
1.90 0.00 2.03

system.time(print(table(replicate(sample.int.rej(2*3, 3, p), n = n))/n))
1 2 3 4 5 6
1.00000 0.40007 0.40099 0.39962 0.40153 0.39779
user system elapsed
76.02 0.03 77.49 # Slow

system.time(print(table(replicate(weighted_Random_Sample(1:6, p, 3), n = n))/n))
1 2 3 4 5 6
1.00000 0.49535 0.41484 0.36432 0.36338 0.36211 # Incorrect
user system elapsed
3.64 0.01 3.67

system.time(print(table(replicate(fun(1:6, 3, p), n = n))/n))
1 2 3 4 5 6
1.00000 0.39876 0.40031 0.40219 0.40039 0.39835
user system elapsed
4.41 0.02 4.47

Notice a few things here. For some reason weighted_Random_Sample returns incorrect values (I have not looked into it at all, but it works correct assuming uniform distribution). sample.int.rej is very slow in repeated sampling.

In conclusion, it seems that Rcpp is the optimal choice in case of repeated sampling while sample.int.rej is a bit faster otherwise and also easier to use.



Related Topics



Leave a reply



Submit