Cross Platform Numpy.Random.Seed()

cross platform numpy.random.seed()

Update as of numpy v1.17 (mid-2019):

The results should be the same across platforms, but not across numpy version.

np.random.seed is described as a "convenience, legacy function"; it and the more recent/recommended alternative np.random.default_rng can no longer be relied on to produce the same result across numpy versions, unless specifically using the legacy/compatibility API provided by np.random.RandomState. While the RandomState module is guaranteed to provide consistent results, it is not updated with algorithmic (or correctness) improvements and is discouraged for use outside of unit testing and backwards compatibility.

See NEP 0019: Random number generator policy. It's actually a decent read :) The abstract reads:

For the past decade, NumPy has had a strict backwards compatibility policy for the number stream of all of its random number distributions. Unlike other numerical components in numpy, which are usually allowed to return different when results when they are modified if they remain correct, we have obligated the random number distributions to always produce the exact same numbers in every version. The objective of our stream-compatibility guarantee was to provide exact reproducibility for simulations across numpy versions in order to promote reproducible research. However, this policy has made it very difficult to enhance any of the distributions with faster or more accurate algorithms. After a decade of experience and improvements in the surrounding ecosystem of scientific software, we believe that there are now better ways to achieve these objectives. We propose relaxing our strict stream-compatibility policy to remove the obstacles that are in the way of accepting contributions to our random number generation capabilities.

This has been implemented in numpy. As of current writing (numpy version 1.22), numpy.random.default_rng() constructs a new Generator with the default BitGenerator. But in the description of np.random.Generator, the following guidance is attached:

No Compatibility Guarantee
Generator does not provide a version compatibility guarantee. In particular, as better algorithms evolve the bit stream may change.

Therefore, using np.random.default_rng() will preserve random numbers for the same versions of numpy across platforms, but not across versions. The best practices for ensuring reproducibility are to preserve your exact environment, e.g. in a docker container. Short of this, storing the results of randomly generated data and using the saved results in downstream workflows can help with reproducibility, though of course this does not save you from API changes later in your workflow the way a docker container would.

What is the right way to seed random number generation in a python multiprocessing pool?

You were close. Try this instead:

import multiprocessing                                                              
import numpy as np

def init():
    global rng
    rng = np.random.default_rng()
                                                                                    
def my_fun(_):                                  
    return rng.uniform()                                                                                             
     
if __name__ == "__main__":
    with multiprocessing.Pool(processes=4, initializer=init) as pool:
        my_list = pool.map(my_fun, range(40))  
    print(f"Number of unique values: {len(set(my_list))}")

The recommendation is that instead of using seeding, you should create a new instance of the generator instead. Here we're creating one new, freshly seed generator for each pool.

For reproducible results, add code to init() to pickle each new generator or print its state:

print(rng.__getstate__())

The output is sufficient to reconstructor the generator state. It looks like this:

{'bit_generator': 'PCG64',
 'state': 
     {'state': 319129345033546980483845008489532042435,
      'inc': 198751095538548372906400916761105570237},
 'has_uint32': 0,
 'uinteger': 0}

Platform-independent random state in scikit-learn train_test_split

As long as random_state is equal on all platforms and they are all running the same versions of numpy, you should get the exact same splits.

Since random_state is a numpy instance, I think all of scikit-learn's pseudo-random number generators are frozen because numpy froze RandomState.

You can check the documentation for random_state here, which as you can see is numpy.random.RandomState. You can check numpy's compatibility guarantee here.

Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?

Should I use np.random.seed or random.seed?

That depends on whether in your code you are using numpy's random number generator or the one in random.

The random number generators in numpy.random and random have totally separate internal states, so numpy.random.seed() will not affect the random sequences produced by random.random(), and likewise random.seed() will not affect numpy.random.randn() etc. If you are using both random and numpy.random in your code then you will need to separately set the seeds for both.

Update

Your question seems to be specifically about scikit-learn's random number generators. As far as I can tell, scikit-learn uses numpy.random throughout, so you should use np.random.seed() rather than random.seed().

One important caveat is that np.random is not threadsafe - if you set a global seed, then launch several subprocesses and generate random numbers within them using np.random, each subprocess will inherit the RNG state from its parent, meaning that you will get identical random variates in each subprocess. The usual way around this problem is to pass a different seed (or numpy.random.Random instance) to each subprocess, such that each one has a separate local RNG state.

Since some parts of scikit-learn can run in parallel using joblib, you will see that some classes and functions have an option to pass either a seed or an np.random.RandomState instance (e.g. the random_state= parameter to sklearn.decomposition.MiniBatchSparsePCA). I tend to use a single global seed for a script, then generate new random seeds based on the global seed for any parallel functions.

NumPy random seed produces different random numbers

You're confusing RandomState with seed. Your first line constructs an object which you can then use as your random source. For example, we make

>>> rnd = np.random.RandomState(3)
>>> rnd
<mtrand.RandomState object at 0xb17e18cc>

and then

>>> rnd.choice(range(20), (5,))
array([10,  3,  8,  0, 19])
>>> rnd.choice(range(20), (5,))
array([10, 11,  9, 10,  6])
>>> rnd = np.random.RandomState(3)
>>> rnd.choice(range(20), (5,))
array([10,  3,  8,  0, 19])
>>> rnd.choice(range(20), (5,))
array([10, 11,  9, 10,  6])

[I don't understand why your idx1 and idx1S agree-- but you didn't actually post a self-contained transcript, so I suspect user error.]

If you want to affect the global state, use seed:

>>> np.random.seed(3)
>>> np.random.choice(range(20),(5,))
array([10,  3,  8,  0, 19])
>>> np.random.choice(range(20),(5,))
array([10, 11,  9, 10,  6])
>>> np.random.seed(3)
>>> np.random.choice(range(20),(5,))
array([10,  3,  8,  0, 19])
>>> np.random.choice(range(20),(5,))
array([10, 11,  9, 10,  6])

Using a specific RandomState object may seem less convenient at first, but it makes a lot of things easier when you want different entropy streams you can tune.

Why does numpy.random.Generator.choice gives different results even if given fixed seed?

I think you might misunderstand the usage of the seed. The following code should always output True:

import numpy
rng = numpy.random.default_rng(0)
control = rng.choice([0,1],p=[0.5,0.5])
for i in range(100):
    rng = numpy.random.default_rng(0)
    print(control == rng.choice([0,1],p=[0.5,0.5]))
# Always True

When we used the same seed, we could get the same sequence of random numbers. Which means:

import numpy
rng = numpy.random.default_rng(0)
out = [rng.choice([0, 1], p=[0.5, 0.5]) for _ in range(10)]

the out should be the same whenever you run it, but the values in out are different.

How to generate a repeatable random number sequence?

The documentation does not explicitly say that providing a seed will always guarantee the same results, but that is guaranteed with Python's implementation of random based on the algorithm that is used.

According to the documentation, Python uses the Mersenne Twister as the core generator. Once this algorithm is seeded it does not get any external output which would change subsequent calls, so give it the same seed and you will get the same results.

Of course you can also observe this by setting a seed and generating large lists of random numbers and verifying that they are the same, but I understand not wanting to trust that alone.

I have not checked that other Python implementations besides CPython but I highly doubt they would implement the random module using an entirely different algorithm.

Cross Platform Numpy.Random.Seed()