Random Sampling to Give an Exact Sum

Random sampling to give an exact sum

There exist an algorithm for generating such random numbers.

Originally created for MATLAB, there is an R implementation of it:

Surrogate::RandVec

Citation from MATLAB script comment:

%   This generates an n by m array x, each of whose m columns
% contains n random values lying in the interval [a,b], but
% subject to the condition that their sum be equal to s. The
% scalar value s must accordingly satisfy n*a <= s <= n*b. The
% distribution of values is uniform in the sense that it has the
% conditional probability distribution of a uniform distribution
% over the whole n-cube, given that the sum of the x's is s.
%
% The scalar v, if requested, returns with the total
% n-1 dimensional volume (content) of the subset satisfying
% this condition. Consequently if v, considered as a function
% of s and divided by sqrt(n), is integrated with respect to s
% from s = a to s = b, the result would necessarily be the
% n-dimensional volume of the whole cube, namely (b-a)^n.
%
% This algorithm does no "rejecting" on the sets of x's it
% obtains. It is designed to generate only those that satisfy all
% the above conditions and to do so with a uniform distribution.
% It accomplishes this by decomposing the space of all possible x
% sets (columns) into n-1 dimensional simplexes. (Line segments,
% triangles, and tetrahedra, are one-, two-, and three-dimensional
% examples of simplexes, respectively.) It makes use of three
% different sets of 'rand' variables, one to locate values
% uniformly within each type of simplex, another to randomly
% select representatives of each different type of simplex in
% proportion to their volume, and a third to perform random
% permutations to provide an even distribution of simplex choices
% among like types. For example, with n equal to 3 and s set at,
% say, 40% of the way from a towards b, there will be 2 different
% types of simplex, in this case triangles, each with its own
% area, and 6 different versions of each from permutations, for
% a total of 12 triangles, and these all fit together to form a
% particular planar non-regular hexagon in 3 dimensions, with v
% returned set equal to the hexagon's area.
%
% Roger Stafford - Jan. 19, 2006

Example:

test <- Surrogate::RandVec(a=1000, b=100000, s=2000000, n=140, m=1, Seed=sample(1:1000, size = 1))
sum(test$RandVecOutput)
# 2000000
hist(test$RandVecOutput)

Sample Image

Random sampling in R without direct repetition and exact quantity of each number

You can try this. The idea is to create the sorted vector first (seqc). Then for each iteration, the algorithm sample one value out of the possible values (i.e. all except the previous one in the vector).

seqc <- rep(values, each = 92)
vec <- sample(seqc, 1)
seqc <- seqc[-match(vec, seqc)]
for (i in 2:368){
vec[i] <- sample(seqc[seqc != vec[i - 1]], 1)
seqc <- seqc[-match(vec[i], seqc)]
}

output

head(vec)
# [1] "red" "blue" "red" "yellow" "blue" "yellow"

table(vec)
#vec
# blue green red yellow
# 92 92 92 92

It might throw an error, because the algorithm might not work as expected. In that case, rerun it until it works; it usually takes no more than 3 iterations for it to work.

Randomly select values from a given number list to add to a certain value in r

This computes all possible combinations of your input vector, so if this is very long, this might be a problem.

getVal <- function(vec,val) {
comb = combn(vec, 2)
idx = colSums(comb) == val
if (sum(idx)) {
return(comb[,idx][,sample(sum(idx),1)])
}
return(FALSE)
}

vec = (c(1,4,6,9))
val = 10
getVal(vec,val)
>>[1] 1 9

val = 11
>>[1] FALSE
getVal(vec,val)

Random sampling R

You can use the sample function from base R.

All you have to do is sample the rows with replace = FALSE, which means you won't have any overlapping. You can also define the number of samples.

n_groups <- 3
observations_per_group <- 5
size <- n_groups * obersavations_per_group
selected_samples <- sample(seq_len(nrow(NF)), size = size, replace = FALSE)

# Now index those selected rows
NF_1 <- NF[selected_samples, ]

Now, according to your comment, if you want to generate N dataframes, each with a number of samples and also label them accordingly, you can use lapply (which is a function that "applies" a function to a set of values). The "l" in "lapply" means that it returns a list. There are other types of apply functions. You can read more about that (and I highly recommend that you do!) here.

This code should solve your problem, or at least give you a good idea or where to go.

n_groups <- 3
observations_per_group <- 5
size <- observations_per_group * n_groups

# First we'll get the row samples.
selected_samples <- sample(
seq_len(nrow(NF)),
size = size,
replace = FALSE
)

# Now we split them between the number of groups
split_samples <- split(
selected_samples,
rep(1:n_groups, observations_per_group)
)

# For each group (1 to n_groups) we'll define a dataframe with samples
# and store them sequentially in a list.

my_dataframes <- lapply(1:n_groups, function(x) {
# our subset df will be the original df with the list of samples
# for group at position "x" (1, 2, 3.., n_groups)
subset_df <- NF[split_samples[x], ]
return(subset_df)
})

# now, if you need to access the results, you can simply do:
first_df <- my_dataframes[[1]] # use double brackets to access list elements

Generating a list of random numbers, summing to 1

The simplest solution is indeed to take N random values and divide by the sum.

A more generic solution is to use the Dirichlet distribution
which is available in numpy.

By changing the parameters of the distribution you can change the "randomness" of individual numbers

>>> import numpy as np, numpy.random
>>> print np.random.dirichlet(np.ones(10),size=1)
[[ 0.01779975 0.14165316 0.01029262 0.168136 0.03061161 0.09046587
0.19987289 0.13398581 0.03119906 0.17598322]]

>>> print np.random.dirichlet(np.ones(10)/1000.,size=1)
[[ 2.63435230e-115 4.31961290e-209 1.41369771e-212 1.42417285e-188
0.00000000e+000 5.79841280e-143 0.00000000e+000 9.85329725e-005
9.99901467e-001 8.37460207e-246]]

>>> print np.random.dirichlet(np.ones(10)*1000.,size=1)
[[ 0.09967689 0.10151585 0.10077575 0.09875282 0.09935606 0.10093678
0.09517132 0.09891358 0.10206595 0.10283501]]

Depending on the main parameter the Dirichlet distribution will either give vectors where all the values are close to 1./N where N is the length of the vector, or give vectors where most of the values of the vectors will be ~0 , and there will be a single 1, or give something in between those possibilities.

EDIT (5 years after the original answer): Another useful fact about the Dirichlet distribution is that you naturally get it, if you generate a Gamma-distributed set of random variables and then divide them by their sum.

generate random integers with a specific sum

Create your first random number. After that you take the difference between the value of num1 and 100 as the max def of rnd. But to guarantee that their sum is 100, you have to check at the last num if the sum of all nums is 100. If not the value of your last num is the difference that their sum and 100.

And to simply your code and get a clean strcuture, put that code in a loop and instead of single numbers work with an int[5] array.


private int[] CreateRandomNumbersWithSum()
{
int[] nums = new int[5];
int difference = 100;
Random rnd = new Random();

for (int i = 0; i < nums.Length; i++)
{
nums[i] = rnd.Next(0, difference);

difference -= nums[i];
}

int sum = 0;

foreach (var num in nums)
sum += num;

if (sum != 100)
{
nums[4] = 100 - sum;
}

return nums;
}

How to get random number list, with fixed sum and size

Ok, here it is. Let's start with integer problem. Simplest way to proceed is to use statistical distribution which is naturally produced set of numbers summed to a fixed value. And there is such distribution - Multinomial distribution. It has fixed sum equal to n, it provides sampled values from 0 to n. Because requirement is for sampling interval to be arbitrary, first we will shift interval to have minimum at 0, sample and then shift it back. Note, that sampled values might be higher than desired max, thus we have to use rejection technique, where any sample above max will be reject and next attempt will be sampled.

We use multinomial sampling from Python/Numpy. Along with rejection you could add test for uniqueness as well. Code, python 3.7

import numpy as np

def sample_sum_interval(n: int, p, maxv: int):
while True:
q = np.random.multinomial(n, p, size=1)[0]
v = np.where(q > maxv)
if len(v[0]) == 0: # if len(v) > 0, some values are outside the range, reject
# test on unique if len(np.unique(q)) == len(q)
return q
return None

k = 6
min = -8
max = 13
sum = 13

n = sum - k*min # redefined sum
maxv = max - min # redefined max, min would be 0
p = np.full((k), np.float64(1.0)/np.float64(k), dtype=np.float64) # probabilities

q = sample_sum_interval(n, p, maxv) + min # back to original interval
print(q)
print(np.sum(q))
print(np.mean(q))

q = sample_sum_interval(n, p, maxv) + min
print(q)
print(np.sum(q))
print(np.mean(q))

q = sample_sum_interval(n, p, maxv) + min
print(q)
print(np.sum(q))
print(np.mean(q))

Output

[ 5  0 -2  3  3  4]
13
2.1666666666666665
[ 3 3 -1 2 1 5]
13
2.1666666666666665
[-4 0 0 3 10 4]
13
2.1666666666666665

In order to translate it to Javascript, you would need either multinomial sampling, or binomial one, from binomial it is easy to get to multinomial.

UPDATE

ok, here is output when I don't add min to the result, sum would be 61 (13+6*8),
range [0...21]

[11  7  6  8  9 20]
61
10.166666666666666
[ 5 10 14 13 14 5]
61
10.166666666666666
[ 9 12 7 15 7 11]
61
10.166666666666666

Apparently, there is a Javascript library with multinomial sampling, which is modeled after R,
thank you @bluelovers

It should be called in the loop like this:

v = rmultinom(1, n, p);

and then v should be checked to be within range [0...maxv] and accepted or rejected if outside.

UPDATE II

Let me quickly (sorry, really have no time, will get to it tomorrow) describethe idea how I would do it for floats. There is similar distribution with generated bunch of numbers in the [0...1] range called Dirichlet distribution, and it always sums to fixed value of 1. In Python/Numpy it is https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.dirichlet.html.

Suppose I sampled n numbers di from Dirichlet and then map them to [min...max] interval:

xi = min + di*(max-min)

Then I sum them all, using property that all di sums to 1:

Sum = n*min + (max - min) = (n-1)*min + max

If Sum is fixed, then it mean we have to redefine maximum value - lets call it sampling maxs.

So sampling procedure would be as following - sample n [0...1] numbers from Dirichlet, map them to [min...maxs] interval, and check if any of those numbers are below desired max (original, not redefined). If it is, you accept, otherwise reject, like in integer case.

Code below

import numpy as np

def my_dirichlet(n: int):
"""
This is equivalent to numpy.random.dirichlet when all alphas are equal to 1
"""
q = np.random.exponential(scale = np.float64(1.0), size=n)
norm = 1.0/np.sum(q)
return norm * q

def sample_sum_interval(n: int, summa: np.float64, minv: np.float64, maxv: np.float64):
maxs = summa - np.float64(n-1)*minv # redefine maximum value of the interval is sum is fixed
alpha = np.full(n, np.float64(1.0), dtype=np.float64)

while True:
q = my_dirichlet(n) # q = np.random.dirichlet(alpha)
q = minv + q*(maxs - minv) # we map it to [minv...maxs]
v = np.where(q > maxv) # but we need it in the [minv...maxv], so accept or reject test
if len(v[0]) == 0: # if len(v) > 0, some values are outside the range, reject, next sample
return q
return None

n = 5
min = np.float64(-2.0)
max = np.float64(3.0)
sum = np.float64(1.0)

q = sample_sum_interval(n, sum, min, max)
print(q)
print(np.sum(q))

q = sample_sum_interval(n, sum, min, max)
print(q)
print(np.sum(q))

q = sample_sum_interval(n, sum, min, max)
print(q)
print(np.sum(q))

I've put standard NumPy Dirichlet sampling as well as custom Dirichlet sampling. Apparently, libRmath.js has exponential distribution sampling, but no Dirichlet, but it could be replaced with user-define code and exponentials. Remember, NumPy operates on vectors with single operator, loops are implicit.

Output:

[-0.57390094 -1.80924001  0.47630282  0.80008638  2.10675174]
1.0000000000000013
[-1.12192892 1.18503129 0.97525135 0.69175429 -0.73010801]
0.9999999999999987
[-0.34803521 0.36499743 -1.165332 0.9433809 1.20498888]
0.9999999999999991


Related Topics



Leave a reply



Submit