cross platform numpy.random.seed()
Update as of numpy v1.17 (mid-2019):
The results should be the same across platforms, but not across numpy version.
np.random.seed
is described as a "convenience, legacy function"; it and the more recent/recommended alternative np.random.default_rng
can no longer be relied on to produce the same result across numpy versions, unless specifically using the legacy/compatibility API provided by np.random.RandomState
. While the RandomState module is guaranteed to provide consistent results, it is not updated with algorithmic (or correctness) improvements and is discouraged for use outside of unit testing and backwards compatibility.
See NEP 0019: Random number generator policy. It's actually a decent read :) The abstract reads:
For the past decade, NumPy has had a strict backwards compatibility policy for the number stream of all of its random number distributions. Unlike other numerical components in numpy, which are usually allowed to return different when results when they are modified if they remain correct, we have obligated the random number distributions to always produce the exact same numbers in every version. The objective of our stream-compatibility guarantee was to provide exact reproducibility for simulations across numpy versions in order to promote reproducible research. However, this policy has made it very difficult to enhance any of the distributions with faster or more accurate algorithms. After a decade of experience and improvements in the surrounding ecosystem of scientific software, we believe that there are now better ways to achieve these objectives. We propose relaxing our strict stream-compatibility policy to remove the obstacles that are in the way of accepting contributions to our random number generation capabilities.
This has been implemented in numpy. As of current writing (numpy version 1.22), numpy.random.default_rng()
constructs a new Generator
with the default BitGenerator
. But in the description of np.random.Generator
, the following guidance is attached:
No Compatibility Guarantee
Generator does not provide a version compatibility guarantee. In particular, as better algorithms evolve the bit stream may change.
Therefore, using np.random.default_rng()
will preserve random numbers for the same versions of numpy across platforms, but not across versions. The best practices for ensuring reproducibility are to preserve your exact environment, e.g. in a docker container. Short of this, storing the results of randomly generated data and using the saved results in downstream workflows can help with reproducibility, though of course this does not save you from API changes later in your workflow the way a docker container would.
What is the *right* way to seed random number generation in a python multiprocessing pool?
You were close. Try this instead:
import multiprocessing
import numpy as np
def init():
global rng
rng = np.random.default_rng()
def my_fun(_):
return rng.uniform()
if __name__ == "__main__":
with multiprocessing.Pool(processes=4, initializer=init) as pool:
my_list = pool.map(my_fun, range(40))
print(f"Number of unique values: {len(set(my_list))}")
The recommendation is that instead of using seeding, you should create a new instance of the generator instead. Here we're creating one new, freshly seed generator for each pool.
For reproducible results, add code to init()
to pickle each new generator or print its state:
print(rng.__getstate__())
The output is sufficient to reconstructor the generator state. It looks like this:
{'bit_generator': 'PCG64',
'state':
{'state': 319129345033546980483845008489532042435,
'inc': 198751095538548372906400916761105570237},
'has_uint32': 0,
'uinteger': 0}
Platform-independent random state in scikit-learn train_test_split
As long as random_state
is equal on all platforms and they are all running the same versions of numpy, you should get the exact same splits.
Since random_state
is a numpy instance, I think all of scikit-learn's pseudo-random number generators are frozen because numpy froze RandomState
.
You can check the documentation for random_state
here, which as you can see is numpy.random.RandomState
. You can check numpy's compatibility guarantee here.
Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?
Should I use np.random.seed or random.seed?
That depends on whether in your code you are using numpy's random number generator or the one in random
.
The random number generators in numpy.random
and random
have totally separate internal states, so numpy.random.seed()
will not affect the random sequences produced by random.random()
, and likewise random.seed()
will not affect numpy.random.randn()
etc. If you are using both random
and numpy.random
in your code then you will need to separately set the seeds for both.
Update
Your question seems to be specifically about scikit-learn's random number generators. As far as I can tell, scikit-learn uses numpy.random
throughout, so you should use np.random.seed()
rather than random.seed()
.
One important caveat is that np.random
is not threadsafe - if you set a global seed, then launch several subprocesses and generate random numbers within them using np.random
, each subprocess will inherit the RNG state from its parent, meaning that you will get identical random variates in each subprocess. The usual way around this problem is to pass a different seed (or numpy.random.Random
instance) to each subprocess, such that each one has a separate local RNG state.
Since some parts of scikit-learn can run in parallel using joblib, you will see that some classes and functions have an option to pass either a seed or an np.random.RandomState
instance (e.g. the random_state=
parameter to sklearn.decomposition.MiniBatchSparsePCA
). I tend to use a single global seed for a script, then generate new random seeds based on the global seed for any parallel functions.
NumPy random seed produces different random numbers
You're confusing RandomState
with seed
. Your first line constructs an object which you can then use as your random source. For example, we make
>>> rnd = np.random.RandomState(3)
>>> rnd
<mtrand.RandomState object at 0xb17e18cc>
and then
>>> rnd.choice(range(20), (5,))
array([10, 3, 8, 0, 19])
>>> rnd.choice(range(20), (5,))
array([10, 11, 9, 10, 6])
>>> rnd = np.random.RandomState(3)
>>> rnd.choice(range(20), (5,))
array([10, 3, 8, 0, 19])
>>> rnd.choice(range(20), (5,))
array([10, 11, 9, 10, 6])
[I don't understand why your idx1
and idx1S
agree-- but you didn't actually post a self-contained transcript, so I suspect user error.]
If you want to affect the global state, use seed
:
>>> np.random.seed(3)
>>> np.random.choice(range(20),(5,))
array([10, 3, 8, 0, 19])
>>> np.random.choice(range(20),(5,))
array([10, 11, 9, 10, 6])
>>> np.random.seed(3)
>>> np.random.choice(range(20),(5,))
array([10, 3, 8, 0, 19])
>>> np.random.choice(range(20),(5,))
array([10, 11, 9, 10, 6])
Using a specific RandomState
object may seem less convenient at first, but it makes a lot of things easier when you want different entropy streams you can tune.
Why does numpy.random.Generator.choice gives different results even if given fixed seed?
I think you might misunderstand the usage of the seed. The following code should always output True
:
import numpy
rng = numpy.random.default_rng(0)
control = rng.choice([0,1],p=[0.5,0.5])
for i in range(100):
rng = numpy.random.default_rng(0)
print(control == rng.choice([0,1],p=[0.5,0.5]))
# Always True
When we used the same seed, we could get the same sequence of random numbers. Which means:
import numpy
rng = numpy.random.default_rng(0)
out = [rng.choice([0, 1], p=[0.5, 0.5]) for _ in range(10)]
the out
should be the same whenever you run it, but the values in out
are different.
How to generate a repeatable random number sequence?
The documentation does not explicitly say that providing a seed will always guarantee the same results, but that is guaranteed with Python's implementation of random based on the algorithm that is used.
According to the documentation, Python uses the Mersenne Twister as the core generator. Once this algorithm is seeded it does not get any external output which would change subsequent calls, so give it the same seed and you will get the same results.
Of course you can also observe this by setting a seed and generating large lists of random numbers and verifying that they are the same, but I understand not wanting to trust that alone.
I have not checked that other Python implementations besides CPython but I highly doubt they would implement the random module using an entirely different algorithm.
Related Topics
How to Push a Subprocess.Call() Output to Terminal and File
Run a Python Script in Terminal Without the Python Command
How to Properly Write to Fifos in Python
Asyncio in Corroutine Runtimeerror: No Running Event Loop
Plotting the Data with Scrollable X (Time/Horizontal) Axis on Linux
Does R Have Function Startswith or Endswith Like Python
Using Python 32 Bit in 64Bit Platform
Command 'X86_64-Linux-Gnu-Gcc' Failed with Exit Status 1
Simulate Mouse Clicks on Python
Pip Is Not Working for Python 3.10 on Ubuntu
How to Create a User in Linux Using Python
Correct Daemon Behaviour (From Pep 3143) Explained
A Function Callback Every Time a Key Is Pressed (Regardless of Which Window Has Focus)
How to Get the Day of Week Given a Date