Questions About Set.Seed() in R

Questions about set.seed() in R

Just for fun:

set.seed.alpha <- function(x) {
require("digest")
hexval <- paste0("0x",digest(x,"crc32"))
intval <- type.convert(hexval) %% .Machine$integer.max
set.seed(intval)
}

So you can do:

set.seed.alpha("hello world")

(in fact x can be any R object, not just an alphanumeric string)

Set.seed issue with sample when changing order of values

Well, first consider the simplier case where you leave off the prob=

set.seed(59) 
sample(c(12,7,5),7,replace = T)
# [1] 5 12 12 5 5 12 5
set.seed(59)
sample(c(5,7,12),7,replace = T)
# [1] 12 5 5 12 12 5 12

Because you have different input, you get a different result. But also note that the sample function is really sampling from the vector indexes, not the actual values of the vector. See how in the second result, you've basically just swapped the 5s and the 12s. The only thing that matters is the length of the input vector. If you try it with

set.seed(59) 
sample(1:3,7,replace = T)
# [1] 3 1 1 3 3 1 3

See how you still get he same "accaaca" pattern (the middle value is never picked). That's what setting the seed will do for you. You really only get the exact same result of all other parameters are identical.

If you change the order of the values in the vector, and swap the probabilities, you won't get the same observations from a pseudorandom number generator like the one R uses. It's simply not "smart" enough to see those are the same statistical distribution. However, if you draw a bunch of sample over and over again, in the long run they will have similar means thanks to the law of large numbers.

set.seed in for loop

First, your code does not do what you think it is doing. The problem is

for (i in 2:length(Dataset[-c(8,11)]))

You are not removing the columns from the loop, only from the length of the data frame. If the data frame has 20 columns, you will run the loop from column 2 to column 18 because you have just reduced the number of columns. Instead you should use i in 2:length(Dataset)[-c(8, 11)]. Since impute will jump over these columns if the are character data, you don't need to exclude them from the loop.

Second, we can test your question about the reproducibility of the results when the seed appears outside the loop. Here is a small example using the iris data set that comes with R:

data(iris)
set.seed(42)
Data <- iris[1:25, -5]
idx <- matrix(replicate(4, sample.int(25, 5)), 20)
idx <- cbind(idx, rep(1:4, each=5))
Data[idx] <- NA
head(Data)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 NA 3.5 1.4 0.2
# 2 4.9 3.0 1.4 0.2
# 3 4.7 3.2 1.3 NA
# 4 NA NA 1.5 NA
# 5 NA 3.6 NA NA
# 6 5.4 3.9 1.7 0.4

Now we impute the missing values three times:

library(Hmisc)
set.seed(1)
for (i in 1:4) {
Data[, i] <- impute(Data[, i], "random")
}
Data[idx]
# [1] 5.8 4.9 4.6 4.7 5.4 3.4 3.5 3.3 3.6 3.8 1.3 1.5 1.5 1.5 1.5 0.2 0.2 0.2 0.1 0.3
set.seed(1)
for (i in 1:4) {
Data[, i] <- impute(Data[, i], "random")
}
Data[idx]
# [1] 5.8 4.9 4.6 4.7 5.4 3.4 3.5 3.3 3.6 3.8 1.3 1.5 1.5 1.5 1.5 0.2 0.2 0.2 0.1 0.3
set.seed(1)
for (i in 1:4) {
Data[, i] <- impute(Data[, i], "random")
}
Data[idx]
# [1] 5.8 4.9 4.6 4.7 5.4 3.4 3.5 3.3 3.6 3.8 1.3 1.5 1.5 1.5 1.5 0.2 0.2 0.2 0.1 0.3

The imputed values are the same each time.

First two values in .Random.seed are always the same with different set.seed()s

Expanding helpful comments from @r2evans and @Dave2e into an answer.

1) .Random.seed[1]

From ?.Random.seed, it says:

".Random.seed is an integer vector whose first element codes the
kind of RNG and normal generator. The lowest two decimal digits are in
0:(k-1) where k is the number of available RNGs. The hundreds
represent the type of normal generator (starting at 0), and the ten
thousands represent the type of discrete uniform sampler."

Therefore the first value doesn't change unless one changes the generator method (RNGkind).

Here is a small demonstration of this for each of the available RNGkinds:

library(tidyverse)

# available RNGkind options
kinds <- c(
"Wichmann-Hill",
"Marsaglia-Multicarry",
"Super-Duper",
"Mersenne-Twister",
"Knuth-TAOCP-2002",
"Knuth-TAOCP",
"L'Ecuyer-CMRG"
)
# test over multiple seeds
seeds <- c(1:3)

f <- function(kind, seed) {
# set seed with simulation parameters
set.seed(seed = seed, kind = kind)
# check value of first element in .Random.seed
return(.Random.seed[1])
}

# run on simulated conditions and compare value over different seeds
expand_grid(kind = kinds, seed = seeds) %>%
pmap(f) %>%
unlist() %>%
matrix(
ncol = length(seeds),
byrow = T,
dimnames = list(kinds, paste0("seed_", seeds))
)

#> seed_1 seed_2 seed_3
#> Wichmann-Hill 10400 10400 10400
#> Marsaglia-Multicarry 10401 10401 10401
#> Super-Duper 10402 10402 10402
#> Mersenne-Twister 10403 10403 10403
#> Knuth-TAOCP-2002 10406 10406 10406
#> Knuth-TAOCP 10404 10404 10404
#> L'Ecuyer-CMRG 10407 10407 10407

Created on 2022-01-06 by the reprex package (v2.0.1)

2) .Random.seed[2]

At least for the default "Mersenne-Twister" method, .Random.seed[2] is an index that indicates the current position in the random set. From the docs:

The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current
position in that set.

This is updated when random processes using the seed are executed. However for other methods it the documentation doesn't mention something like this and there doesn't appear to be a clear trend in the same way.

See below for an example of changes in .Random.seed[2] over iterative random process after set.seed().

library(tidyverse)

# available RNGkind options
kinds <- c(
"Wichmann-Hill",
"Marsaglia-Multicarry",
"Super-Duper",
"Mersenne-Twister",
"Knuth-TAOCP-2002",
"Knuth-TAOCP",
"L'Ecuyer-CMRG"
)

# create function to run random process and report .Random.seed[2]
t <- function(n = 1) {
p <- .Random.seed[2]
runif(n)
p
}

# create function to set seed and iterate a random process
f2 <- function(kind, seed = 1, n = 5) {

set.seed(seed = seed,
kind = kind)

replicate(n, t())
}

# set simulation parameters
trials <- 5
seeds <- 1:2
x <- expand_grid(kind = kinds, seed = seeds, n = trials)

# evaluate and report
x %>%
pmap_dfc(f2) %>%
mutate(n = paste0("trial_", 1:trials)) %>%
pivot_longer(-n, names_to = "row") %>%
pivot_wider(names_from = "n") %>%
select(-row) %>%
bind_cols(x[,1:2], .)

#> # A tibble: 14 x 7
#> kind seed trial_1 trial_2 trial_3 trial_4 trial_5
#> <chr> <int> <int> <int> <int> <int> <int>
#> 1 Wichmann-Hill 1 23415 8457 23504 2.37e4 2.28e4
#> 2 Wichmann-Hill 2 21758 27800 1567 2.58e4 2.37e4
#> 3 Marsaglia-Multicarry 1 1280795612 945095059 14912928 1.34e9 2.23e8
#> 4 Marsaglia-Multicarry 2 -897583247 -1953114152 2042794797 1.39e9 3.71e8
#> 5 Super-Duper 1 1280795612 -1162609806 -1499951595 5.51e8 6.35e8
#> 6 Super-Duper 2 -897583247 224551822 -624310 -2.23e8 8.91e8
#> 7 Mersenne-Twister 1 624 1 2 3 4
#> 8 Mersenne-Twister 2 624 1 2 3 4
#> 9 Knuth-TAOCP-2002 1 166645457 504833754 504833754 5.05e8 5.05e8
#> 10 Knuth-TAOCP-2002 2 967462395 252695483 252695483 2.53e8 2.53e8
#> 11 Knuth-TAOCP 1 1050415712 999978161 999978161 1.00e9 1.00e9
#> 12 Knuth-TAOCP 2 204052929 776729829 776729829 7.77e8 7.77e8
#> 13 L'Ecuyer-CMRG 1 1280795612 -169270483 -442010614 4.71e8 1.80e9
#> 14 L'Ecuyer-CMRG 2 -897583247 -1619336578 -714750745 2.10e9 -9.89e8

Created on 2022-01-06 by the reprex package (v2.0.1)

Here you can see that from the Mersenne-Twister method, .Random.seed[2] increments from it's maximum of 624 back to 1 and increased by the size of the random draw and that this is the same for set.seed(1) and set.seed(2). However the same trend is not seen in the other methods. To illustrate the last point, see that runif(1) increments .Random.seed[2] by 1 while runif(2) increments it by 2:

# create function to run random process and report .Random.seed[2]
t <- function(n = 1) {
p <- .Random.seed[2]
runif(n)
p
}

set.seed(1, kind = "Mersenne-Twister")
replicate(9, t(1))
#> [1] 624 1 2 3 4 5 6 7 8

set.seed(1, kind = "Mersenne-Twister")
replicate(5, t(2))
#> [1] 624 2 4 6 8

Created on 2022-01-06 by the reprex package (v2.0.1)

3) Sequential Randoms

Because the index or state of .Random.seed (apparently for all the RNG methods) advances according to the size of the 'random draw' (number of random values genearted from the .Random.seed), it is possible to generate the same series of random numbers from the same seed in different sized increments. Furthermore, as long as you run the same random process at the same point in the sequence after setting the same seed, it seems that you will get the same result. Observe the following example:

# draw 3 at once
set.seed(1, kind = "Mersenne-Twister")
sample(100, 3, T)
#> [1] 68 39 1

# repeat single draw 3 times
set.seed(1, kind = "Mersenne-Twister")
sample(100, 1)
#> [1] 68
sample(100, 1)
#> [1] 39
sample(100, 1)
#> [1] 1

# draw 1, do something else, draw 1 again
set.seed(1, kind = "Mersenne-Twister")
sample(100, 1)
#> [1] 68
runif(1)
#> [1] 0.5728534
sample(100, 1)
#> [1] 1

Created on 2022-01-06 by the reprex package (v2.0.1)

4) Correlated Randoms

As we saw above, two random processes run at the same point after setting the same seed are expected to give the same result. However, even when you provide constraints on how similar the result can be (e.g. by changing the mean of rnorm() or even by providing different functions) it seems that the results are still perfectly correlated within their respective ranges.

# same function with different constraints
set.seed(1, kind = "Mersenne-Twister")
a <- runif(50, 0, 1)

set.seed(1, kind = "Mersenne-Twister")
b <- runif(50, 10, 100)

plot(a, b)

Sample Image

# different functions
set.seed(1, kind = "Mersenne-Twister")
d <- rnorm(50)

set.seed(1, kind = "Mersenne-Twister")
e <- rlnorm(50)

plot(d, e)

Sample Image

Created on 2022-01-06 by the reprex package (v2.0.1)

Set seed in R for Normal distribution

I think you may want to post in code review, but the single biggest recommendation I have is to rely more heavily on lapply rather than sequentially named objects (e.g., num1, num2, num3.

I assume at the end of this, you want the results of the simulations. To do so, we will make a list using lapply over each simulation. I have added some of your constants outside of the loop; we need to replicate 3 datasets during each simulation. We make extensive use of lapply to continually make the summary statistics that you need. Map is similar, except it allows us to pass multiple arguments when we need it.

asim<-10
t=3
n = rep(10, t)
N = sum(n)
pv<- rep(NA,asim)

lapply(seq_len(asim), function (val) {
set.seed(val)

data = lapply(n, rnorm, 0, 1)

# SKEWED HEAVY

G=0.5
h=0.5 # h=0 if skewed normal

y = lapply(data, function(x) ((exp(G*x)-1)/G)*(exp((h*x^2)/2)))

g = lapply(y, sort)
ybar = lapply(y, mean)

#BIWEIGHT

med = lapply(g, median)
mad_=mad(unlist(g, use.names = FALSE))
u = Map(function (g_, med_) (g_ - med_) / (9 * mad_), g, med)
idx = lapply(u, function(x) which(abs(x) < 1))

num = Map(function(g_, med_, u_, idx_) (((g_ - med_) ^ 2) * ((1 - (u_^2))^4))[idx_], g, med, u, idx)
den = Map(function(u_, idx_) ((1 - (u_^2)) * (1 - 5 * (u_ ^ 2)))[idx_], u, idx)

data.frame(sim = rep(seq_len(t), lengths(idx)),
num = unlist(num, use.names = FALSE),
den = unlist(den, use.names = FALSE))
}
)

R package submission error concerning set.seed()

If you really, really, want to set the seed inside a function, which I believe you nor anyone should do, save the current seed, do whatever you want, and before exiting the function reset it to the saved value.

old_seed <- .Random.seed
rnorm(1)
#[1] -1.173346

set.seed(42)
rbinom(1, size = 1, prob = 0.5)
#[1] 0

.Random.seed <- old_seed
rnorm(1)
#[1] -1.173346

In a function it could be something like the following, without the message instructions. Note that the function prints nothing, it never calls any pseudo-RNG and always outputs TRUE. The point is to save the seed's current value and reset the seed in on.exit.

f <- function(simul = FALSE){
if(simul){
message("simul is TRUE")
old_seed <- .Random.seed
on.exit(.Random.seed <- old_seed)
# rest of code
} else message("simul is FALSE")
invisible(TRUE)
}

f()
s <- .Random.seed
f(TRUE)
identical(s, .Random.seed)
#[1] TRUE

rm(s)

Argument of set.seed in R

The seed argument in set.seed is a single value, interpreted as an integer (as defined in help(set.seed()). The seed in set.seed produces random values which are unique to that seed (and will be same irrespective of the computer you run and hence ensures reproducibility). So the random values generated by set.seed(1) and set.seed(123) will not be the same but the random values generated by R in your computer using set.seed(1) and by R in my computer using the same seed are the same.

set.seed(1)
x<-rnorm(10,2,1)
> x
[1] 1.373546 2.183643 1.164371 3.595281 2.329508 1.179532 2.487429 2.738325 2.575781 1.694612
set.seed(123)
y<-rnorm(10,2,1)
> y
[1] 1.4395244 1.7698225 3.5587083 2.0705084 2.1292877 3.7150650 2.4609162 0.7349388 1.3131471 1.5543380

> identical(x,y)
[1] FALSE

Different Set.seed sample function results with same versions of R in different OS

You have to set also the kind:

set.seed(1, "Mersenne-Twister", sample.kind="Rounding")
sample(100,3)
#[1] 27 37 57

set.seed(1, "Mersenne-Twister", sample.kind="Rejection")
sample(100,3)
#[1] 68 39 1


Related Topics



Leave a reply



Submit