Random.Seed(): What Does It Do

random.seed(): What does it do?

Pseudo-random number generators work by performing some operation on a value. Generally this value is the previous number generated by the generator. However, the first time you use the generator, there is no previous value.

Seeding a pseudo-random number generator gives it its first "previous" value. Each seed value will correspond to a sequence of generated values for a given random number generator. That is, if you provide the same seed twice, you get the same sequence of numbers twice.

Generally, you want to seed your random number generator with some value that will change each execution of the program. For instance, the current time is a frequently-used seed. The reason why this doesn't happen automatically is so that if you want, you can provide a specific seed to get a known sequence of numbers.

What does numpy.random.seed(0) do?

np.random.seed(0) makes the random numbers predictable

>>> numpy.random.seed(0) ; numpy.random.rand(4)
array([ 0.55, 0.72, 0.6 , 0.54])
>>> numpy.random.seed(0) ; numpy.random.rand(4)
array([ 0.55, 0.72, 0.6 , 0.54])

With the seed reset (every time), the same set of numbers will appear every time.

If the random seed is not reset, different numbers appear with every invocation:

>>> numpy.random.rand(4)
array([ 0.42, 0.65, 0.44, 0.89])
>>> numpy.random.rand(4)
array([ 0.96, 0.38, 0.79, 0.53])

(pseudo-)random numbers work by starting with a number (the seed), multiplying it by a large number, adding an offset, then taking modulo of that sum. The resulting number is then used as the seed to generate the next "random" number. When you set the seed (every time), it does the same thing every time, giving you the same numbers.

If you want seemingly random numbers, do not set the seed. If you have code that uses random numbers that you want to debug, however, it can be very helpful to set the seed before each run so that the code does the same thing every time you run it.

To get the most random numbers for each run, call numpy.random.seed(). This will cause numpy to set the seed to a random number obtained from /dev/urandom or its Windows analog or, if neither of those is available, it will use the clock.

For more information on using seeds to generate pseudo-random numbers, see wikipedia.

What is Random seed in Azure Machine Learning?

What is a Random Seed Integer?

Will not go into any details regarding what a random seed is in general; there is plenty of material available by a simple web search (see for example this SO thread).

Random seed serves just to initialize the (pseudo)random number generator, mainly in order to make ML examples reproducible.

How to carefully choose a Random Seed from range of integer values? What is the key or strategy to choose it?

Arguably this is already answered implicitly above: you are simply not supposed to choose any particular random seed, and your results should be roughly the same across different random seeds.

Why does Random Seed significantly affect the ML Scoring, Prediction and Quality of the trained model?

Now, to the heart of your question. The answer here (i.e. with the iris dataset) is the small-sample effects...

To start with, your reported results across different random seeds are not that different. Nevertheless, I agree that, at first sight, a difference in macro-average precision of 0.9 and 0.94 might seem large; but looking more closely it is revealed that the difference is really not an issue. Why?

Using the 20% of your (only) 150-samples dataset leaves you with only 30 samples in your test set (where the evaluation is performed); this is stratified, i.e. about 10 samples from each class. Now, for datasets of that small size, it is not difficult to imagine that a difference in the correct classification of only 1-2 samples can have this apparent difference in the performance metrics reported.

Let's try to verify this in scikit-learn using a decision tree classifier (the essence of the issue does not depend on the specific framework or the ML algorithm used):

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321, stratify=y)
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Result:

[[10  0  0]
[ 0 9 1]
[ 0 0 10]]
precision recall f1-score support

0 1.00 1.00 1.00 10
1 1.00 0.90 0.95 10
2 0.91 1.00 0.95 10

micro avg 0.97 0.97 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30

Let's repeat the code above, changing only the random_state argument in train_test_split; for random_state=123 we get:

[[10  0  0]
[ 0 7 3]
[ 0 2 8]]
precision recall f1-score support

0 1.00 1.00 1.00 10
1 0.78 0.70 0.74 10
2 0.73 0.80 0.76 10

micro avg 0.83 0.83 0.83 30
macro avg 0.84 0.83 0.83 30
weighted avg 0.84 0.83 0.83 30

while for random_state=12345 we get:

[[10  0  0]
[ 0 8 2]
[ 0 0 10]]
precision recall f1-score support

0 1.00 1.00 1.00 10
1 1.00 0.80 0.89 10
2 0.83 1.00 0.91 10

micro avg 0.93 0.93 0.93 30
macro avg 0.94 0.93 0.93 30
weighted avg 0.94 0.93 0.93 30

Looking at the absolute numbers of the 3 confusion matrices (in small samples, percentages can be misleading), you should be able to convince yourself that the differences are not that big, and they can be arguably justified by the random element inherent in the whole procedure (here the exact split of the dataset into training and test).

Should your test set be significantly bigger, these discrepancies would be practically negligible...

A last notice; I have used the exact same seed numbers as you, but this does not actually mean anything, as in general the random number generators across platforms & languages are not the same, hence the corresponding seeds are not actually compatible. See own answer in Are random seeds compatible between systems? for a demonstration.

What is suggested seed value to use with random.seed()?

According to the documentation for random.seed:

If x is omitted or None, current system time is used; current system time is also used to initialize the generator when the module is first imported. If randomness sources are provided by the operating system, they are used instead of the system time (see the os.urandom() function for details on availability).

If you don't pass something to seed, it will try to use operating-system provided randomness sources instead of the time, which is always a better bet. This saves you a bit of work, and is about as good as it's going to get. Regarding availability, the docs for os.urandom tell us:

On a UNIX-like system this will query /dev/urandom, and on Windows it will use CryptGenRandom.

Cross-platform random seeds are the big win here; you can safely omit a seed and trust that it will be random enough on almost every platform you'll use Python on. Even if Python falls back to the time, there's probably only a millisecond window (or less) to guess the seed. I don't think you'll run into any trouble using the current time anyway -- even then, it's only a fallback.

Why different random numbers are generated even after specifying seed value?

Yes. When you call random.seed(), it sets the random seed. The sequence of numbers you generate from that point forwards will always be the same.

The thing is, you're only set the seed once, and then you're calling np.random.uniform() three times. That means you're getting the next three numbers from your random.seed(). Of course they're different – you haven't reset the seed in between. But every time you run the program, you'll get the same sequence of three numbers, because you set the seed to the same thing before generating them all.

Setting the seed only affects the next random number to be generated, because of how pseudo-random number generation (which np.random uses) works: it uses the seed to generate a new random number deterministically, and then uses the generated number to set a new seed for the next number. It effectively boils down to getting a really really long sequence of random numbers that will, eventually, repeat itself. When you set the seed, you're jumping to a specified point in that sequence – you're not keeping the code there, though.

How does a random seed work in a function factory in Python?

You pass in a seed to the global random object each time you call f(), because all top-level functions in the random module feed into a singleton object. This means that by the time f3 is created, the seeds set in f2 and f1 have been superseded, the seeds are not independent from the global random object. Importing random again for each f() call does not give you a new state, as really only binds names anew each time once the module object itself is loaded (only on the first import).

If you want to have a seeded random generator per function, you need to create individual random.Random() instances:

import random
import time

def f(p):
seeded_random = random.Random(time.time())
def g():
return 1 if seeded_random.random() < p else 0
return g

From the random module documentation:

The functions supplied by this module are actually bound methods of a hidden instance of the random.Random class. You can instantiate your own instances of Random to get generators that don’t share state.

What does the number in parentheses in `np.random.seed(number)` means?

python uses the iterative Mersenne Twister algorithm to generate pseudo-random numbers [1]. The seed is simply where we start iterating.

To be clear, most computers do not have a "true" source of randomness. It is kind of an interesting thing that "randomness" is so valuable to so many applications, and is quite hard to come by (you can buy a specialized device devoted to this purpose). Since it is difficult to make random numbers, but they are nevertheless necessary, many, many, many, many algorithms have been developed to generate numbers that are not random, but nevertheless look as though they are. Algorithms that generate numbers that "look randomish" are called pseudo-random number generators (PRNGs). Since PRNGs are actually deterministic, they can't simply create a number from the aether and have it look randomish. They need an input. It turns out that using some complex operations and modular arithmetic, we can take in an input, and get another number that seems to have little or no relation to the input. Using this intuition, we can simply use the previous output of the PRNG as the next input. We then get a sequence of numbers which, if our PRNG is good, will seem to have no relation to each other.

In order to get our iterative PRNG started, we need an initial input. This initial input is called a "seed". Since the PRNG is deterministic, for a given seed, it will generate an identical sequence of numbers. Usually, there is a default seed that is, itself, sort of randomish. The most common one is the current time. However, the current time isn't a very good random number, so this behavior is known to cause problems sometimes. If you want your program to run in an identical manner each time you run it, you can provide a seed (0 is a popular option, but is entirely arbitrary). Then, you get a sequence of randomish numbers, but if you give your code to someone they can actually entirely recreate the runtime of the program as you witnessed it when you ran it.



Related Topics



Leave a reply



Submit