Random Output Different Between Implementations

Random output different between implementations

There is no required implementation for uniform_int_distribution<>. [rand.dist.general] specifies that:

The algorithms for producing each of the specified distributions are implementation-defined.

All that [rand.dist.uni.int] states is:

A uniform_int_distribution random number distribution produces random integers i, a <= i <= b, distributed
according to the constant discrete probability function
P(i | a, b) = 1/(b − a + 1) .

Each implementation is free to achieve this distribution how it wishes. What you are seeing is apparently three different implementations.

C++ rand and srand gets different output on different machines

Yes, rand() has different implementations; there's no requirement for them to be identical.

If you want consistent sequences across implementations and platforms, you can copy the sample implementation from the C standard section 7.20.2. Be sure to rename both rand and srand so they don't collide with the standard library's versions. You might need to adjust the code so the types have the same size and range across the implementations (e.g., use uint32_t from <stdint.h> rather than unsigned int).

EDIT: Given the new information from the comments, it looks like the requirements are different from what we thought (and I'm still not 100% clear on what they are).

You wants to generate random numbers on two systems consistent with a stored file that you've generated on one system, but you're unable to transfer it to the other due to network issues (the file is about a gigabyte). (Burning it to a DVD, or splitting it and burning it to 2 CDs, isn't an option?)

how to generate the same random number in two different environments?

You can use a the mersenne twister it has reproducable output (it is standardized).
Use the same seed on 2 machines and you're good to go.

#include <random>
#include <iostream>

int main()
{
    std::mt19937 engine;

    engine.seed(1);

    for (std::size_t n = 0; n < 10; ++n)
    {
        std::cout << engine() << std::endl;
    }
}

You can verify it here, https://godbolt.org/z/j5r6ToGY7, just select different compilers and check the output

Time difference for random number generation implementation in Java vs. C++

I suspect that the performance issue is in the bodies of your carbon(), hydrogen(), nitrogen(), oxygen(), and sulfur() functions. You should show how they produce the random data.

Or it could be in the if (sum < threshold) {} else {} code.

I wanted to keep setting the seed so the results would not be deterministic (closer to being truly random)

Since you're using the result of time(0) as a seed you're not getting particularly random results either way.

Instead of using srand() and rand() you should take a look at the <random> library and choose an engine with the performance/quality characteristics that meed your needs. If your implementation supports it you can even get non-deterministic random data from std::random_device (either to generate seeds or as an engine).

Additionally <random> provides pre-made distributions such as std::uniform_real_distribution<double> which is likely to be better than the average programmer's method of manually computing the distribution you want from the results of rand().

Okay, here's how you can eliminate the inner loops from your code and drastically speed it up (In Java or C++).

Your code:

double carbon() {
  if (rand() % 10000 < 107)
    return 13.0033548378;
  else
    return 12.0;
}

picks one of two values with a particular probability. Presumably you intended the first value to be picked about 107 times out of 10000 (although using % with rand() doesn't quite give you that). When you run this in a loop and sum the results as in:

for (int i = 0; i < composition[0]; i++) sum += carbon();

you'll essentially get sum += X*13.0033548378 + Y*12.0; where X is the number of times the random number stays under the threshold and Y is (trials-X). It just so happens that you can simulate running a bunch of trials and calculating the number of successes using a binomial distribution, and <random> happens to provide a binomial distribution.

Given a function sum_trials()

std::minstd_rand0 eng; // global random engine

double sum_trials(int trials, double probability, double A, double B) {
  std::binomial_distribution<> dist(trials, probability);
  int successes = dist(eng);
  return successes*A + (trials-successes)*B;
}

You can replace your carbon() loop:

sum += sum_trials(composition[0], 107.0/10000.0, 13.003354378, 12.0); // carbon trials

I don't have the actual values you're using, but your whole loop will look something like:

  for (int i = 0; i < 100000000; i++) {
     double sum = 0;
     sum += sum_trials(composition[0], 107.0/10000.0, 13.003354378, 12.0); // carbon trials
     sum += sum_trials(composition[1], 107.0/10000.0, 13.003354378, 12.0); // hydrogen trials
     sum += sum_trials(composition[2], 107.0/10000.0, 13.003354378, 12.0); // nitrogen trials
     sum += sum_trials(composition[3], 107.0/10000.0, 13.003354378, 12.0); // oxygen trials
     sum += sum_trials(composition[4], 107.0/10000.0, 13.003354378, 12.0); // sulfur trials

     if (sum > threshold) {
     } else {
     }
   }

Now one thing to note is that inside the function we're constructing distributions over and over with the same data. We can extract that by replacing the function sum_trials() with a function object, which we construct with the appropriate data once before the loop, and then just use the functor repeatedly:

struct sum_trials {
  std::binomial_distribution<> dist;
  double A; double B; int trials;

  sum_trials(int t, double p, double a, double b) : dist{t, p}, A{a}, B{b}, trials{t} {}

  double operator() () {
    int successes = dist(eng);
    return successes * A + (trials - successes) * B;
  }
};

int main() {
  int threshold = 5;
  int composition[5] = { 10, 10, 10, 10, 10 };

  sum_trials carbon   = { composition[0], 107.0/10000.0, 13.003354378, 12.0};
  sum_trials hydrogen = { composition[1], 107.0/10000.0, 13.003354378, 12.0};
  sum_trials nitrogen = { composition[2], 107.0/10000.0, 13.003354378, 12.0};
  sum_trials oxygen   = { composition[3], 107.0/10000.0, 13.003354378, 12.0};
  sum_trials sulfur   = { composition[4], 107.0/10000.0, 13.003354378, 12.0};

  for (int i = 0; i < 100000000; i++) {
     double sum = 0;

     sum += carbon();
     sum += hydrogen();
     sum += nitrogen();
     sum += oxygen();
     sum += sulfur();

     if (sum > threshold) {
     } else {
     }
   }
}

The original version of the code took my system about one minute 30 seconds. The last version here takes 11 seconds.

Here's a functor to generate the oxygen sums using two binomial_distributions. Maybe one of the other distributions can do this in one shot but I don't know.

struct sum_trials2 {
  std::binomial_distribution<> d1;
  std::binomial_distribution<> d2;
  double A; double B; double C;
  int trials;
  double probabilty2;

  sum_trials2(int t, double p1, double p2, double a, double b, double c)
    : d1{t, p1}, A{a}, B{b}, C{c}, trials{t}, probability2{p2} {}

  double operator() () {
    int X = d1(eng);
    d2.param(std::binomial_distribution<>{trials-X, p2}.param());
    int Y = d2(eng);

    return X*A + Y*B + (trials-X-Y)*C;
  }
};

sum_trials2 oxygen{composition[3], 17.0/1000.0, (47.0-17.0)/(1000.0-17.0), 17.9999, 16.999, 15.999};

You can further speed this up if you can just calculate the probability that the sum is under your threshold:

int main() {
  std::minstd_rand0 eng;
  std::bernoulli_distribution dist(probability_sum_is_over_threshold);

  for (int i=0; i< 100000000; ++i) {
    if (dist(eng)) {
    } else {
    }
  }
}

Unless the values for the other elements can be negative then the probability that the sum is greater than five is 100%. In that case you don't even need to generate random data; execute the 'if' branch of your code 100,000,000 times.

int main() {
  for (int i=0; i< 100000000; ++i) {
    //execute some code
  }
}

Random number generator performance varies between platforms

All of the distributions in the C++ standard library (including uniform_real_distribution) use an implementation-defined algorithm. (The same applies to std::rand, which defers to the C standard's rand function.) Thus, it's natural that there would be performance differences between these distributions in different implementations of the C++ standard library. See also this answer.

You may want to try testing whether there are performance differences in the C++ random engines (such as std::minstd_rand and std::mt19937), which do specify a fixed algorithm in the C++ standard. To do so, generate a random number in the engine directly and not through any C++ distribution such as uniform_int_distribution or uniform_real_distribution.

I originally thought that this was compiler optimizations omitting the uniform_real_distribution when it wasn't stored / printed as the variable isn't used and thus can be omitted but then why doesn't the compiler do the same for std::rand[?]

I presume the compiler could do this optimization because in practice, the C++ standard library is implemented as C++ code that's available to the compiler, so that the compiler could perform certain optimizations on that code as necessary. This is unlike with std::rand, which is only implemented as a function whose implementation is not available to the compiler, limiting the optimizations the compiler could do.

`numpy.random.normal` generates different numbers on different systems

Given that the differences are all so small, it suggests that the underlying bit-generators are doing the same things. It's just to do with differences between the underlying math library.

NumPy's legacy generator uses sqrt and log from libm, and you can see that it pulls in these symbols by first finding the shared object providing the generator via:

import numpy as np

print(np.random.mtrand.__file__)

then dumping symbols with:

nm -C -gD mtrand.*.so | grep GLIBC

where that mtrand filename comes from the above output.

I get a lot of other symbols output, but that might explain the differences.

At a guess it's to do with the log implementation, so you could test with:

import numpy as np

np.random.seed(0)

x = 2 * np.random.rand(2, 10**5) - 1
r2 = np.sum(x * x, axis=0)

np.save('test-log.npy', np.log(r2))

and compare between these two systems.

Random Output Different Between Implementations