Best Way to Pick a Random Subset from a Collection

best way to pick a random subset from a collection?

Jon Bentley discusses this in either 'Programming Pearls' or 'More Programming Pearls'. You need to be careful with your N of M selection process, but I think the code shown works correctly. Rather than randomly shuffle all the items, you can do the random shuffle only shuffling the first N positions - which is a useful saving when N << M.

Knuth also discusses these algorithms - I believe that would be Vol 3 "Sorting and Searching", but my set is packed pending a move of house so I can't formally check that.

Java best way to select random subset from an array of strings with no duplicates

You can use Collections.shuffle which is exactly for these purposes.

For example:

String[] strings = new String[] {
"foo", "bar", "toto", "fox", "koukou", "bar", "Bar"
} ;

// Filter-out duplicates, then shuffle
final List<String> stringsList = Arrays.stream(strings)
.distinct().collect(Collectors.toList());
Collections.shuffle(stringsList);
// Take a sample
final int k = 3; // the sample size
List<String> sample = stringsList.subList(0, k); // this is a view, not another list
System.err.println(sample);

If we did not care about duplicates, things would be a bit simpler and shuffling could be performed on the original array (then, we could just pick first k elements):

Collections.shuffle(Arrays.asList(strings));

Efficient random sampling from a huge list

Elaborating on aix's answer above, to choose k out of a stream of items, read the items one at a time. Keep the first k items in a set S.

Now when reading the m-th item I (m>k now), keep it with probability k/m. If you do keep it, select an item U uniformly at random from S, and replace U with I.

The proof that this yields all subsets of size k with equal probability is based on induction on m. Note that you don't need to know n (the total number of items) in advance, and that S at each step is suitable. The algorithm is "streaming" - it doesn't require storing all items, or making a second pass.

Get a random subset from a result set in java

List<Member> list = new LinkedList<Member>(memberSet);
Collections.shuffle(list);
Set<Member> randomSet = new HashSet<Member>(list.subList(0, 5));

Full example:

public static void main(String... args) {

Set<Member> memberSet = new HashSet<Member>();
for (int i = 0; i < 100; i++)
memberSet.add(new Member(i));

List<Member> list = new LinkedList<Member>(memberSet);
Collections.shuffle(list);
Set<Member> randomSet = new HashSet<Member>(list.subList(0, 5));

System.out.println(randomSet);
}

static class Member {
final int value;
public Member(int value) {
this.value = value;
}
@Override
public String toString() {
return "" + value;
}
}

Sampling a random subset from an array

I suggest shuffling a copy of the array using the Fisher-Yates shuffle and taking a slice:

function getRandomSubarray(arr, size) {
var shuffled = arr.slice(0), i = arr.length, temp, index;
while (i--) {
index = Math.floor((i + 1) * Math.random());
temp = shuffled[index];
shuffled[index] = shuffled[i];
shuffled[i] = temp;
}
return shuffled.slice(0, size);
}

var x = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15];
var fiveRandomMembers = getRandomSubarray(x, 5);

Note that this will not be the most efficient method for getting a small random subset of a large array because it shuffles the whole array unnecessarily. For better performance you could do a partial shuffle instead:

function getRandomSubarray(arr, size) {
var shuffled = arr.slice(0), i = arr.length, min = i - size, temp, index;
while (i-- > min) {
index = Math.floor((i + 1) * Math.random());
temp = shuffled[index];
shuffled[index] = shuffled[i];
shuffled[i] = temp;
}
return shuffled.slice(min);
}

How can you obtain a randomized subset of a list in Java, without using Collections.shuffle?

Shuffling the whole list would be a bit of a waste of resources if what you want is significantly smaller. I'd personally just pick n unique random numbers between 0..size and use the objects at those indexes for the randomised subset.

If you're talking about picking a random subset very close to the size of the whole collection, then chances are you're better of just calling Collections.shuffle() and choosing the first n entries. But if we're talking about ~5 / 50,000, definitely use the above approach.

Pick a unique random subset from a set of unique values

Create a random permutation of the range 0, 1, ..., N - 1 and pick the first M of them; use those as indices into your original vector.

A random permutation is easily made with the standard library by using std::iota together with std::random_shuffle:

std::vector<Heavy> v; // given

std::vector<unsigned int> indices(V.size());
std::iota(indices.begin(), indices.end(), 0);
std::random_shuffle(indices.begin(), indices.end());

// use V[indices[0]], V[indices[1]], ..., V[indices[M-1]]

You can supply random_shuffle with a random number generator of your choice; check the docu­men­tation for details.

Quick way of creating a random subset of an array using IntStream

In this case, you want to select streamSize distinct elements from an input array.

You can use the Random#ints(randomNumberOrigin, randomNumberBound) method:

Returns an effectively unlimited stream of pseudorandom int values, each conforming to the given origin (inclusive) and bound (exclusive).

This returns a stream of random integers in the specified range. To ensure distinct values, distinct() is called and limit(...) allows to only keep the number of elements we want.

Random random = new Random();
Player[] subset = random.ints(0, allPlayers.length)
.distinct()
.limit(subsetSize)
.mapToObj(i -> allPlayers[i])
.toArray(Player[]::new);

I would note however, that, even if this is a one-liner, it is not as efficient as JB Nizet's solution since this continues generating integers until a distinct one is found.

How can I randomly select an item from a list?

Use random.choice():

import random

foo = ['a', 'b', 'c', 'd', 'e']
print(random.choice(foo))

For cryptographically secure random choices (e.g., for generating a passphrase from a wordlist), use secrets.choice():

import secrets

foo = ['battery', 'correct', 'horse', 'staple']
print(secrets.choice(foo))

secrets is new in Python 3.6. On older versions of Python you can use the random.SystemRandom class:

import random

secure_random = random.SystemRandom()
print(secure_random.choice(foo))


Related Topics



Leave a reply



Submit