R Doesn't Reset the Seed When "L'Ecuyer-Cmrg" Rng Is Used

Resetting R random number generator (rlecuyer) for inner loops using Snow/doSNOW

There are a few things worth mentioning here.
First of all, you're saving .Random.seed and then passing it directly to set.seed,
which wouldn't give you the desired result even if you weren't having problems with the RNG.
The reason for this is that, as documented,
set.seed's seed argument only cares about a single integer,
so if you pass a vector,
it will still use only the 1st value,
and the documentation of .Random.seed states:

.Random.seed is an integer vector whose first element codes the kind of RNG and normal generator.

So all your parallel streams would end up with the same integers because the first value of .Random.seed is always the same for the same RNG kind.

Secondly, the documentation also states:

.Random.seed saves the seed set for the uniform random-number generator, at least for the system generators. It does not necessarily save the state of other generators, and in particular does not save the state of the Box–Muller normal generator. If you want to reproduce work later, call set.seed (preferably with explicit values for kind and normal.kind) rather than set .Random.seed.

Which means you should definitely know what RNG you're using if you're going to be manipulating more internal values.
If you call clusterEvalQ(cl, RNGkind()) after your call to clusterSetupRNGstream,
you'll see that it returns:

[1] "user-supplied" "Inversion"     "Rejection"

So you can't assume that .Random.seed will be enough to save the RNG's state.
In fact, I'm not even sure it plays nicely with doSNOW,
see this discrepancy:

clusterSetupRNGstream(cl, seed = rep(seed, 6))
foo = foreach(irun = 1:2, .combine = list) %dopar% {
list(
.Random.seed,
get(".Random.seed", .GlobalEnv)
)
}
> str(foo)
List of 2
$ :List of 2
..$ : int [1:626] 10403 1 -921191862 -372998484 563067311 -15494811 985677596 1278354290 1696669677 -1382461401 ...
..$ : int 10405
$ :List of 2
..$ : int [1:626] 10403 1 -921191862 -372998484 563067311 -15494811 985677596 1278354290 1696669677 -1382461401 ...
..$ : int 10405

Finally, it's clear from your example that set.seed is not really working in this scenario.
The documentation of clusterSetupRNGstream states that the rlecuyer package is used,
and I don't know enough details about that package to say whether it supports set.seed or not,
but I suppose it doesn't.

The only alternative I can give you is one that I've used before that is a bit more verbose:

RNGkind("L'Ecuyer-CMRG")

cl = makeCluster(2)
registerDoSNOW(cl)

seed = 4711L
set.seed(seed)
# x is a vector of length irun - 1
seeds = Reduce(x = 1:2, init = .Random.seed, accumulate = TRUE, f = function(x, ignored) {
parallel::nextRNGStream(x)
})

erg = foreach(pseed = seeds, .combine = rbind) %dopar% {
RNGkind("L'Ecuyer-CMRG")
assign(".Random.seed", pseed, .GlobalEnv)

# do some random stuff in outer loop
smp = runif(1)

# save current state of RNG
s = get(".Random.seed", .GlobalEnv)

# inner loop, does some more random stuff
idx = numeric(5)
for(ii in seq.int(5)) {
idx[ii] = sample.int(10, 1)
# reset RNG for next loop iteration
assign(".Random.seed", s, .GlobalEnv)
}

c(smp, idx)
}

> print(erg)
[,1] [,2] [,3] [,4] [,5] [,6]
result.1 0.9482966 2 2 2 2 2
result.2 0.1749918 3 3 3 3 3
result.3 0.3263343 1 1 1 1 1

AFAIK, the state of R's own L'Ecuyer-CMRG is indeed held in .Random.seed.

Seed and clusterApply - how to select a specific run?

The clusterSetRNGStream function doesn't support the kind of reproducibility that you want very well. The problem is that it simply initializes each of the cluster workers to draw random numbers from a different stream of random numbers which is reproducible when using clusterApply with a given number of workers. But to execute a particular task, you would have to execute it on the correct worker in order to get the correct stream, and fast forward in that stream, which isn't supported even if you know the exact number of random numbers consumed by each task.

Instead, I suggest that you use the lower level functions to assign a different substream of random numbers to each task. You can do that by generating the task seeds using the nextRNGSubStream function:

library(parallel)
# This is based on the clusterSetRNGStream function from
# the parallel package, copyrighted by The R Core Team
getseeds <- function(ntasks, iseed) {
RNGkind("L'Ecuyer-CMRG")
set.seed(iseed)
seeds <- vector("list", ntasks)
seeds[[1]] <- .Random.seed
for (i in seq_len(ntasks - 1)) {
seeds[[i + 1]] <- nextRNGSubStream(seeds[[i]])
}
seeds
}

Since we're not using clusterSetRNGStream, you need to set the random number generator to "L'Ecuyer-CMRG" when initializing the workers:

cl <- makeCluster(detectCores())
clusterEvalQ(cl, { library(MASS); RNGkind("L'Ecuyer-CMRG") })

The key is to set the value of ".Random.seed" from the worker function in order to use the correct random number substream for each task:

worker <- function(nstart, seed, centers=4) {
assign(".Random.seed", seed, envir=.GlobalEnv)
kmeans(Boston, centers, nstart = nstart)
}

Since we're iterating over both the nstart and seed values, you use clusterMap rather than clusterApply to execute the tasks:

n <- 4
nstarts <- rep(25, n)
seeds <- getseeds(n, 1234)
results <- clusterMap(cl, worker, nstarts, seeds)

To reproduce the results of the fourth task, you specify the fourth seed:

itasks <- c(4)
results <- clusterMap(cl, worker, nstarts[itasks], seeds[itasks])

Using this method, you get reproducible results even when load balancing via the clusterMap .scheduling="dynamic" argument since the results aren't dependent on the worker that executes the task as they are when using clusterSetRNGStream.


Note that you can use the clusterMap MoreArgs argument to specify a value for the centers argument of the worker function:

results <- clusterMap(cl, worker, nstarts, seeds, MoreArgs=list(centers=5))

Parallel processing in R - setting seed with mclapply() vs. pbmclapply()

The issue is that in the utils.R file within the pbmcapply package it runs the following line:

if (isTRUE(mc.set.seed))
mc.set.stream()

If we compare this to what is being called when we run the mclapply() function in the parallel package we see that it runs:

if (mc.set.seed) 
mc.reset.stream()

This affects the results as reset stream will allow the code to be run from the globally set seed, whereas running set stream sets it to the a new random starting value using the initial seed. We can see this in the functions attached below:

mc.reset.stream <- function () 
{
if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
sample.int(1L)
# HERE! sets the seed to the global seed variable we set
assign("LEcuyer.seed", get(".Random.seed", envir = .GlobalEnv,
inherits = FALSE), envir = RNGenv)
}
}

mc.set.stream <- function ()
{
if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
assign(".Random.seed", get("LEcuyer.seed", envir = RNGenv),
envir = .GlobalEnv)
}
else {
if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
rm(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
}
}

I believe this change may be due to an issue with mclapply when you want to call the mclapply function more than once after setting the seed it will use the same random numbers. (i.e. by resetting the r session you should get the same results in the same order with pbmclapply so first time I get 0.143 then 0.064 and then -0.015). This is usually the preferred behaviour so you can call the funciton multiple times. See R doesn't reset the seed when "L'Ecuyer-CMRG" RNG is used? for more information.

The differences between these two implementations can be tested with the following code if you change the line in the .customized_mcparallel funciton definition from mc.set.stream() to mc.reset.stream(). Here I have simplified the function calls in the package to strip out the progress bar and leave in only the calculation (removing error checks also) and the change in setting the random seed. (Additionally note these functions will no longer run on a Windows machine only Linux or MacOS).

library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
pbmclapply <- function() {

pkg <- asNamespace('pbmcapply')
.cleanup <- get('.cleanup', pkg)

progressMonitor <- .customized_mcparallel({

mclapply(1:100, function(i) {
rnorm(1)
}, mc.cores = 2, mc.preschedule = TRUE, mc.set.seed = TRUE,
mc.cleanup = TRUE, mc.allow.recursive = TRUE)
})

# clean up processes on exit
on.exit(.cleanup(progressMonitor$pid), add = T)

# Retrieve the result
results <- suppressWarnings(mccollect(progressMonitor$pid)[[as.character(progressMonitor$pid)]])

return(results)
}

.customized_mcparallel <- function (expr, name, detached = FALSE){
# loading hidden functions
pkg <- asNamespace('parallel')
mcfork <- get('mcfork', pkg)
mc.advance.stream <- get('mc.advance.stream', pkg)
mcexit <- get('mcexit', pkg)
mcinteractive <- get('mcinteractive', pkg)
sendMaster <- get('sendMaster', pkg)
mc.set.stream <- get('mc.set.stream', pkg)
mc.reset.stream <- get('mc.reset.stream', pkg)

f <- mcfork(F)
env <- parent.frame()
mc.advance.stream()
if (inherits(f, "masterProcess")) {

mc.set.stream()
# reset the group process id of the forked process
mcinteractive(FALSE)

sendMaster(try(eval(expr, env), silent = TRUE))
mcexit(0L)
}

f
}

x <- pbmclapply()
y <- do.call(rbind, x)
z <- mean(y)
print(z)

For a complete remedy my best suggestion would be to either reimplement the functions in your own code (I copy pasted with some minor modifications to the functions from pbmcapply) or by forking the package and replacing the mc.set.seed in the utils.R file with mc.reset.seed. I can't think of a simpler solution at the moment, but hopefully this clarifies the issue.

First two values in .Random.seed are always the same with different set.seed()s

Expanding helpful comments from @r2evans and @Dave2e into an answer.

1) .Random.seed[1]

From ?.Random.seed, it says:

".Random.seed is an integer vector whose first element codes the
kind of RNG and normal generator. The lowest two decimal digits are in
0:(k-1) where k is the number of available RNGs. The hundreds
represent the type of normal generator (starting at 0), and the ten
thousands represent the type of discrete uniform sampler."

Therefore the first value doesn't change unless one changes the generator method (RNGkind).

Here is a small demonstration of this for each of the available RNGkinds:

library(tidyverse)

# available RNGkind options
kinds <- c(
"Wichmann-Hill",
"Marsaglia-Multicarry",
"Super-Duper",
"Mersenne-Twister",
"Knuth-TAOCP-2002",
"Knuth-TAOCP",
"L'Ecuyer-CMRG"
)
# test over multiple seeds
seeds <- c(1:3)

f <- function(kind, seed) {
# set seed with simulation parameters
set.seed(seed = seed, kind = kind)
# check value of first element in .Random.seed
return(.Random.seed[1])
}

# run on simulated conditions and compare value over different seeds
expand_grid(kind = kinds, seed = seeds) %>%
pmap(f) %>%
unlist() %>%
matrix(
ncol = length(seeds),
byrow = T,
dimnames = list(kinds, paste0("seed_", seeds))
)

#> seed_1 seed_2 seed_3
#> Wichmann-Hill 10400 10400 10400
#> Marsaglia-Multicarry 10401 10401 10401
#> Super-Duper 10402 10402 10402
#> Mersenne-Twister 10403 10403 10403
#> Knuth-TAOCP-2002 10406 10406 10406
#> Knuth-TAOCP 10404 10404 10404
#> L'Ecuyer-CMRG 10407 10407 10407

Created on 2022-01-06 by the reprex package (v2.0.1)

2) .Random.seed[2]

At least for the default "Mersenne-Twister" method, .Random.seed[2] is an index that indicates the current position in the random set. From the docs:

The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current
position in that set.

This is updated when random processes using the seed are executed. However for other methods it the documentation doesn't mention something like this and there doesn't appear to be a clear trend in the same way.

See below for an example of changes in .Random.seed[2] over iterative random process after set.seed().

library(tidyverse)

# available RNGkind options
kinds <- c(
"Wichmann-Hill",
"Marsaglia-Multicarry",
"Super-Duper",
"Mersenne-Twister",
"Knuth-TAOCP-2002",
"Knuth-TAOCP",
"L'Ecuyer-CMRG"
)

# create function to run random process and report .Random.seed[2]
t <- function(n = 1) {
p <- .Random.seed[2]
runif(n)
p
}

# create function to set seed and iterate a random process
f2 <- function(kind, seed = 1, n = 5) {

set.seed(seed = seed,
kind = kind)

replicate(n, t())
}

# set simulation parameters
trials <- 5
seeds <- 1:2
x <- expand_grid(kind = kinds, seed = seeds, n = trials)

# evaluate and report
x %>%
pmap_dfc(f2) %>%
mutate(n = paste0("trial_", 1:trials)) %>%
pivot_longer(-n, names_to = "row") %>%
pivot_wider(names_from = "n") %>%
select(-row) %>%
bind_cols(x[,1:2], .)

#> # A tibble: 14 x 7
#> kind seed trial_1 trial_2 trial_3 trial_4 trial_5
#> <chr> <int> <int> <int> <int> <int> <int>
#> 1 Wichmann-Hill 1 23415 8457 23504 2.37e4 2.28e4
#> 2 Wichmann-Hill 2 21758 27800 1567 2.58e4 2.37e4
#> 3 Marsaglia-Multicarry 1 1280795612 945095059 14912928 1.34e9 2.23e8
#> 4 Marsaglia-Multicarry 2 -897583247 -1953114152 2042794797 1.39e9 3.71e8
#> 5 Super-Duper 1 1280795612 -1162609806 -1499951595 5.51e8 6.35e8
#> 6 Super-Duper 2 -897583247 224551822 -624310 -2.23e8 8.91e8
#> 7 Mersenne-Twister 1 624 1 2 3 4
#> 8 Mersenne-Twister 2 624 1 2 3 4
#> 9 Knuth-TAOCP-2002 1 166645457 504833754 504833754 5.05e8 5.05e8
#> 10 Knuth-TAOCP-2002 2 967462395 252695483 252695483 2.53e8 2.53e8
#> 11 Knuth-TAOCP 1 1050415712 999978161 999978161 1.00e9 1.00e9
#> 12 Knuth-TAOCP 2 204052929 776729829 776729829 7.77e8 7.77e8
#> 13 L'Ecuyer-CMRG 1 1280795612 -169270483 -442010614 4.71e8 1.80e9
#> 14 L'Ecuyer-CMRG 2 -897583247 -1619336578 -714750745 2.10e9 -9.89e8

Created on 2022-01-06 by the reprex package (v2.0.1)

Here you can see that from the Mersenne-Twister method, .Random.seed[2] increments from it's maximum of 624 back to 1 and increased by the size of the random draw and that this is the same for set.seed(1) and set.seed(2). However the same trend is not seen in the other methods. To illustrate the last point, see that runif(1) increments .Random.seed[2] by 1 while runif(2) increments it by 2:

# create function to run random process and report .Random.seed[2]
t <- function(n = 1) {
p <- .Random.seed[2]
runif(n)
p
}

set.seed(1, kind = "Mersenne-Twister")
replicate(9, t(1))
#> [1] 624 1 2 3 4 5 6 7 8

set.seed(1, kind = "Mersenne-Twister")
replicate(5, t(2))
#> [1] 624 2 4 6 8

Created on 2022-01-06 by the reprex package (v2.0.1)

3) Sequential Randoms

Because the index or state of .Random.seed (apparently for all the RNG methods) advances according to the size of the 'random draw' (number of random values genearted from the .Random.seed), it is possible to generate the same series of random numbers from the same seed in different sized increments. Furthermore, as long as you run the same random process at the same point in the sequence after setting the same seed, it seems that you will get the same result. Observe the following example:

# draw 3 at once
set.seed(1, kind = "Mersenne-Twister")
sample(100, 3, T)
#> [1] 68 39 1

# repeat single draw 3 times
set.seed(1, kind = "Mersenne-Twister")
sample(100, 1)
#> [1] 68
sample(100, 1)
#> [1] 39
sample(100, 1)
#> [1] 1

# draw 1, do something else, draw 1 again
set.seed(1, kind = "Mersenne-Twister")
sample(100, 1)
#> [1] 68
runif(1)
#> [1] 0.5728534
sample(100, 1)
#> [1] 1

Created on 2022-01-06 by the reprex package (v2.0.1)

4) Correlated Randoms

As we saw above, two random processes run at the same point after setting the same seed are expected to give the same result. However, even when you provide constraints on how similar the result can be (e.g. by changing the mean of rnorm() or even by providing different functions) it seems that the results are still perfectly correlated within their respective ranges.

# same function with different constraints
set.seed(1, kind = "Mersenne-Twister")
a <- runif(50, 0, 1)

set.seed(1, kind = "Mersenne-Twister")
b <- runif(50, 10, 100)

plot(a, b)

Sample Image

# different functions
set.seed(1, kind = "Mersenne-Twister")
d <- rnorm(50)

set.seed(1, kind = "Mersenne-Twister")
e <- rlnorm(50)

plot(d, e)

Sample Image

Created on 2022-01-06 by the reprex package (v2.0.1)

How to use the same seed to produce the same output with or without parallel in R?

you are setting a seed for clusterSetRNGStream, which will help you generate and reproduce the same set of random streams for your parallel runs of that function. It will not do what you intended.

you can probably set a seed inside your function to reproduce the output from both the implementations. Something like:

# w/o parallel
set.seed(1)
xxx1 <- sample(1:500, 10)
print(xxx1)
# [1] 133 186 286 452 101 445 467 326 310 3

# w parallel
library("parallel")
cl <- makeCluster(1)
xxx2 <- parLapply(cl, 1, function(x) { set.seed(1); return(sample(1:500, 10)) })[[1]]
stopCluster(cl); rm(cl)
print(xxx2)
# [1] 133 186 286 452 101 445 467 326 310 3

Is set.seed consistent over different versions of R (and Ubuntu)?

Cross-OS consistency: yes

If you installed R on two different operating systems without manually changing defaults or the RProfile, you should get the same results when using set.seed().

Consistency over versions of R: not necessarily

It used to be the case that set.seed() would give the same results across R versions, but that's no longer generally true thanks to a little-announced update in R 3.6.0. So you can get cross version consistency comparing results before R 3.6.0, but if you compare a post-3.6.0 use of set.seed() to a pre-3.6.0 use of set.seed(), you will get different results.

You can see that in the examples below:

R 3.2.0

> set.seed(1999)
> sample(LETTERS, 3)
[1] "T" "N" "L"

R 3.5.3

> set.seed(1999)
> sample(LETTERS, 3)
[1] "T" "N" "L"

R 3.6.0

set.seed(1999)
sample(LETTERS, 3)
[1] "D" "Z" "R"

The reason for the inconsistency is that in R 3.6.0, the default kind of under-the-hood random-number generator was changed. Now, in order to get the results from set.seed() to match, you have to first call the function RNGkind(sample.kind = "Rounding").

R 3.6.0

> RNGkind(sample.kind = "Rounding")
Warning message:
In RNGkind(sample.kind = "Rounding") : non-uniform 'Rounding' sampler used
> set.seed(1999)
> sample(Letters, 3)
[1] "T" "N" "L"

controlling seeds with mclapply

The parallel package comes with special support for the "L'Ecuyer-CMRG" random number generator which was introduced at the same time as parallel. You can read the documentation for that support using:

library(parallel)
?mc.reset.stream

To use it, you first need to enable "L'Ecuyer-CMRG":

RNGkind("L'Ecuyer-CMRG")

After doing that, code such as:

set.seed(1)
mclapply(mylist, foo, mc.cores=3)
mclapply(mylist, foo, mc.cores=3)

will be reproducible, but the two calls to mclapply will return identical results. This is because the state of the random number generator in the master process isn't changed by calling mclapply.

I've used the following function to skip over the random number streams used by the mclapply workers:

skip.streams <- function(n) {
x <- .Random.seed
for (i in seq_len(n))
x <- nextRNGStream(x)
assign('.Random.seed', x, pos=.GlobalEnv)
}

You can use this function to get the behavior that I think you want:

set.seed(1)
mclapply(mylist, foo, mc.cores=3)
skip.streams(3)
mclapply(mylist, foo, mc.cores=3)
skip.streams(3)


Related Topics



Leave a reply



Submit