How to Set Seed for Random Simulations with Foreach and Domc Packages

How to set seed for random simulations with foreach and doMC packages?

My default answer used to be "well then don't do that" (using foreach) as the snow package does this (reliably!) for you.

But as @Spacedman points out, Renaud's new doRNG is what you are looking for if you want to remain with the doFoo / foreach family.

The real key though is a clusterApply-style call to get the seeds set on all nodes. And in a fashion that coordinated across streams. Oh, and did I mention that snow by Tierney, Rossini, Li and Sevcikova has been doing this for you for almost a decade?

Edit: And while you didn't ask about snow, for completeness here is an example from the command-line:

edd@max:~$ r -lsnow -e'cl <- makeSOCKcluster(c("localhost","localhost"));\
clusterSetupRNG(cl);\
print(do.call("rbind", clusterApply(cl, 1:4, \
function(x) { stats::rnorm(1) } )))'
Loading required package: utils
Loading required package: utils
Loading required package: rlecuyer
[,1]
[1,] -1.1406340
[2,] 0.7049582
[3,] -0.4981589
[4,] 0.4821092
edd@max:~$ r -lsnow -e'cl <- makeSOCKcluster(c("localhost","localhost"));\
clusterSetupRNG(cl);\
print(do.call("rbind", clusterApply(cl, 1:4, \
function(x) { stats::rnorm(1) } )))'
Loading required package: utils
Loading required package: utils
Loading required package: rlecuyer
[,1]
[1,] -1.1406340
[2,] 0.7049582
[3,] -0.4981589
[4,] 0.4821092
edd@max:~$

Edit: And for completeness, here is your example combined with what is in the docs for doRNG

> library(foreach)
R> library(doMC)
Loading required package: multicore

Attaching package: ‘multicore’

The following object(s) are masked from ‘package:parallel’:

mclapply, mcparallel, pvec

R> registerDoMC(2)
R> library(doRNG)
R> set.seed(123)
R> a <- foreach(i=1:2,.combine=cbind) %dopar% {rnorm(5)}
R> set.seed(123)
R> b <- foreach(i=1:2,.combine=cbind) %dopar% {rnorm(5)}
R> identical(a,b)
[1] FALSE ## ie standard approach not reproducible
R>
R> seed <- doRNGseed()
R> a <- foreach(i=1:2,combine=cbind) %dorng% { rnorm(5) }
R> b <- foreach(i=1:2,combine=cbind) %dorng% { rnorm(5) }
R> doRNGseed(seed)
R> a1 <- foreach(i=1:2,combine=cbind) %dorng% { rnorm(5) }
R> b1 <- foreach(i=1:2,combine=cbind) %dorng% { rnorm(5) }
R> identical(a,a1) && identical(b,b1)
[1] TRUE ## all is well now with doRNGseed()
R>

Use set.seed() with foreach() in R

This works:

library (foreach)

fn<-function(i)
{
set.seed(i)
y <- rnorm(1)
return(y)
}

x<-foreach(i=1:10) %do% fn(i)
print(x)

Fixing the seed for parallel simulation runs with different number of cores

1st Q: The following seemed to work for me

library(parallel)
library(doParallel)
cl <- makeCluster(5)
registerDoParallel(cl)
seedlist <- c(100, 200, 300, 400, 500)
clusterExport(cl, 'seedlist')
foreach(I=1:5) %dopar% {set.seed(seedlist[I]); runif(1)}

[[1]]
[1] 0.3077661

[[2]]
[1] 0.5337724

[[3]]
[1] 0.9152467

[[4]]
[1] 0.1499731

[[5]]
[1] 0.8336

set.seed(100)
runif(1)
[1] 0.3077661

2nd Q: Seems like a bug but maybe someone else has a better clue

First two values in .Random.seed are always the same with different set.seed()s

Expanding helpful comments from @r2evans and @Dave2e into an answer.

1) .Random.seed[1]

From ?.Random.seed, it says:

".Random.seed is an integer vector whose first element codes the
kind of RNG and normal generator. The lowest two decimal digits are in
0:(k-1) where k is the number of available RNGs. The hundreds
represent the type of normal generator (starting at 0), and the ten
thousands represent the type of discrete uniform sampler."

Therefore the first value doesn't change unless one changes the generator method (RNGkind).

Here is a small demonstration of this for each of the available RNGkinds:

library(tidyverse)

# available RNGkind options
kinds <- c(
"Wichmann-Hill",
"Marsaglia-Multicarry",
"Super-Duper",
"Mersenne-Twister",
"Knuth-TAOCP-2002",
"Knuth-TAOCP",
"L'Ecuyer-CMRG"
)
# test over multiple seeds
seeds <- c(1:3)

f <- function(kind, seed) {
# set seed with simulation parameters
set.seed(seed = seed, kind = kind)
# check value of first element in .Random.seed
return(.Random.seed[1])
}

# run on simulated conditions and compare value over different seeds
expand_grid(kind = kinds, seed = seeds) %>%
pmap(f) %>%
unlist() %>%
matrix(
ncol = length(seeds),
byrow = T,
dimnames = list(kinds, paste0("seed_", seeds))
)

#> seed_1 seed_2 seed_3
#> Wichmann-Hill 10400 10400 10400
#> Marsaglia-Multicarry 10401 10401 10401
#> Super-Duper 10402 10402 10402
#> Mersenne-Twister 10403 10403 10403
#> Knuth-TAOCP-2002 10406 10406 10406
#> Knuth-TAOCP 10404 10404 10404
#> L'Ecuyer-CMRG 10407 10407 10407

Created on 2022-01-06 by the reprex package (v2.0.1)

2) .Random.seed[2]

At least for the default "Mersenne-Twister" method, .Random.seed[2] is an index that indicates the current position in the random set. From the docs:

The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current
position in that set.

This is updated when random processes using the seed are executed. However for other methods it the documentation doesn't mention something like this and there doesn't appear to be a clear trend in the same way.

See below for an example of changes in .Random.seed[2] over iterative random process after set.seed().

library(tidyverse)

# available RNGkind options
kinds <- c(
"Wichmann-Hill",
"Marsaglia-Multicarry",
"Super-Duper",
"Mersenne-Twister",
"Knuth-TAOCP-2002",
"Knuth-TAOCP",
"L'Ecuyer-CMRG"
)

# create function to run random process and report .Random.seed[2]
t <- function(n = 1) {
p <- .Random.seed[2]
runif(n)
p
}

# create function to set seed and iterate a random process
f2 <- function(kind, seed = 1, n = 5) {

set.seed(seed = seed,
kind = kind)

replicate(n, t())
}

# set simulation parameters
trials <- 5
seeds <- 1:2
x <- expand_grid(kind = kinds, seed = seeds, n = trials)

# evaluate and report
x %>%
pmap_dfc(f2) %>%
mutate(n = paste0("trial_", 1:trials)) %>%
pivot_longer(-n, names_to = "row") %>%
pivot_wider(names_from = "n") %>%
select(-row) %>%
bind_cols(x[,1:2], .)

#> # A tibble: 14 x 7
#> kind seed trial_1 trial_2 trial_3 trial_4 trial_5
#> <chr> <int> <int> <int> <int> <int> <int>
#> 1 Wichmann-Hill 1 23415 8457 23504 2.37e4 2.28e4
#> 2 Wichmann-Hill 2 21758 27800 1567 2.58e4 2.37e4
#> 3 Marsaglia-Multicarry 1 1280795612 945095059 14912928 1.34e9 2.23e8
#> 4 Marsaglia-Multicarry 2 -897583247 -1953114152 2042794797 1.39e9 3.71e8
#> 5 Super-Duper 1 1280795612 -1162609806 -1499951595 5.51e8 6.35e8
#> 6 Super-Duper 2 -897583247 224551822 -624310 -2.23e8 8.91e8
#> 7 Mersenne-Twister 1 624 1 2 3 4
#> 8 Mersenne-Twister 2 624 1 2 3 4
#> 9 Knuth-TAOCP-2002 1 166645457 504833754 504833754 5.05e8 5.05e8
#> 10 Knuth-TAOCP-2002 2 967462395 252695483 252695483 2.53e8 2.53e8
#> 11 Knuth-TAOCP 1 1050415712 999978161 999978161 1.00e9 1.00e9
#> 12 Knuth-TAOCP 2 204052929 776729829 776729829 7.77e8 7.77e8
#> 13 L'Ecuyer-CMRG 1 1280795612 -169270483 -442010614 4.71e8 1.80e9
#> 14 L'Ecuyer-CMRG 2 -897583247 -1619336578 -714750745 2.10e9 -9.89e8

Created on 2022-01-06 by the reprex package (v2.0.1)

Here you can see that from the Mersenne-Twister method, .Random.seed[2] increments from it's maximum of 624 back to 1 and increased by the size of the random draw and that this is the same for set.seed(1) and set.seed(2). However the same trend is not seen in the other methods. To illustrate the last point, see that runif(1) increments .Random.seed[2] by 1 while runif(2) increments it by 2:

# create function to run random process and report .Random.seed[2]
t <- function(n = 1) {
p <- .Random.seed[2]
runif(n)
p
}

set.seed(1, kind = "Mersenne-Twister")
replicate(9, t(1))
#> [1] 624 1 2 3 4 5 6 7 8

set.seed(1, kind = "Mersenne-Twister")
replicate(5, t(2))
#> [1] 624 2 4 6 8

Created on 2022-01-06 by the reprex package (v2.0.1)

3) Sequential Randoms

Because the index or state of .Random.seed (apparently for all the RNG methods) advances according to the size of the 'random draw' (number of random values genearted from the .Random.seed), it is possible to generate the same series of random numbers from the same seed in different sized increments. Furthermore, as long as you run the same random process at the same point in the sequence after setting the same seed, it seems that you will get the same result. Observe the following example:

# draw 3 at once
set.seed(1, kind = "Mersenne-Twister")
sample(100, 3, T)
#> [1] 68 39 1

# repeat single draw 3 times
set.seed(1, kind = "Mersenne-Twister")
sample(100, 1)
#> [1] 68
sample(100, 1)
#> [1] 39
sample(100, 1)
#> [1] 1

# draw 1, do something else, draw 1 again
set.seed(1, kind = "Mersenne-Twister")
sample(100, 1)
#> [1] 68
runif(1)
#> [1] 0.5728534
sample(100, 1)
#> [1] 1

Created on 2022-01-06 by the reprex package (v2.0.1)

4) Correlated Randoms

As we saw above, two random processes run at the same point after setting the same seed are expected to give the same result. However, even when you provide constraints on how similar the result can be (e.g. by changing the mean of rnorm() or even by providing different functions) it seems that the results are still perfectly correlated within their respective ranges.

# same function with different constraints
set.seed(1, kind = "Mersenne-Twister")
a <- runif(50, 0, 1)

set.seed(1, kind = "Mersenne-Twister")
b <- runif(50, 10, 100)

plot(a, b)

# different functions
set.seed(1, kind = "Mersenne-Twister")
d <- rnorm(50)

set.seed(1, kind = "Mersenne-Twister")
e <- rlnorm(50)

plot(d, e)

Created on 2022-01-06 by the reprex package (v2.0.1)

Fixing set.seed for an entire session

There are several options, depending on your exact needs. I suspect the first option, the simplest is not sufficient, but my second and third options may be more appropriate, with the third option the most automatable.

Option 1

If you know in advance that the function using/creating random numbers will always draw the same number, and you don't reorder the function calls or insert a new call in between existing ones, then all you need do is set the seed once. Indeed, you probably don't want to keep resetting the seed as you'll just keep on getting the same set of random numbers for each function call.

For example:

> set.seed(1)
> sample(10)
[1] 3 4 5 7 2 8 9 6 10 1
> sample(10)
[1] 3 2 6 10 5 7 8 4 1 9
>
> ## second time round
> set.seed(1)
> sample(10)
[1] 3 4 5 7 2 8 9 6 10 1
> sample(10)
[1] 3 2 6 10 5 7 8 4 1 9

Option 2

If you really want to make sure that a function uses the same seed and you only want to set it once, pass the seed as an argument:

foo <- function(...., seed) {
## set the seed
if (!missing(seed))
set.seed(seed)
## do other stuff
....
}

my.seed <- 42
bar <- foo(...., seed = my.seed)
fbar <- foo(...., seed = my.seed)

(where .... means other args to your function; this is pseudo code).

Option 3

If you want to automate this even more, then you could abuse the options mechanism, which is fine if you are just doing this in a script (for a package you should use your own options object). Then your function can look for this option. E.g.

foo <- function() {
if (!is.null(seed <- getOption("myseed")))
set.seed(seed)
sample(10)
}

Then in use we have:

> getOption("myseed")
NULL
> foo()
[1] 1 2 9 4 8 7 10 6 3 5
> foo()
[1] 6 2 3 5 7 8 1 4 10 9
> options(myseed = 42)
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
> foo()
[1] 10 9 3 6 4 8 5 1 2 7


Related Topics



Leave a reply



Submit