Random output different between implementations
There is no required implementation for uniform_int_distribution<>
. [rand.dist.general] specifies that:
The algorithms for producing each of the specified distributions are implementation-defined.
All that [rand.dist.uni.int] states is:
A uniform_int_distribution random number distribution produces random integers
i
,a <= i <= b
, distributed
according to the constant discrete probability function
P(i | a, b) = 1/(b − a + 1)
.
Each implementation is free to achieve this distribution how it wishes. What you are seeing is apparently three different implementations.
C++ rand and srand gets different output on different machines
Yes, rand()
has different implementations; there's no requirement for them to be identical.
If you want consistent sequences across implementations and platforms, you can copy the sample implementation from the C standard section 7.20.2. Be sure to rename both rand
and srand
so they don't collide with the standard library's versions. You might need to adjust the code so the types have the same size and range across the implementations (e.g., use uint32_t
from <stdint.h>
rather than unsigned int
).
EDIT: Given the new information from the comments, it looks like the requirements are different from what we thought (and I'm still not 100% clear on what they are).
You wants to generate random numbers on two systems consistent with a stored file that you've generated on one system, but you're unable to transfer it to the other due to network issues (the file is about a gigabyte). (Burning it to a DVD, or splitting it and burning it to 2 CDs, isn't an option?)
Suggested solution:
Write a custom generator that generates consistent results on both systems (even if they're not the same results you got before). Once you've done that, use it to re-generate a new 1-gigabyte data file on both systems. The existing file becomes unnecessary, and you don't need to transfer huge amounts of data.
how to generate the same random number in two different environments?
You can use a the mersenne twister it has reproducable output (it is standardized).
Use the same seed on 2 machines and you're good to go.
#include <random>
#include <iostream>
int main()
{
std::mt19937 engine;
engine.seed(1);
for (std::size_t n = 0; n < 10; ++n)
{
std::cout << engine() << std::endl;
}
}
You can verify it here, https://godbolt.org/z/j5r6ToGY7, just select different compilers and check the output
Time difference for random number generation implementation in Java vs. C++
I suspect that the performance issue is in the bodies of your carbon()
, hydrogen()
, nitrogen()
, oxygen()
, and sulfur()
functions. You should show how they produce the random data.
Or it could be in the if (sum < threshold) {} else {}
code.
I wanted to keep setting the seed so the results would not be deterministic (closer to being truly random)
Since you're using the result of time(0)
as a seed you're not getting particularly random results either way.
Instead of using srand()
and rand()
you should take a look at the <random>
library and choose an engine with the performance/quality characteristics that meed your needs. If your implementation supports it you can even get non-deterministic random data from std::random_device
(either to generate seeds or as an engine).
Additionally <random>
provides pre-made distributions such as std::uniform_real_distribution<double>
which is likely to be better than the average programmer's method of manually computing the distribution you want from the results of rand()
.
Okay, here's how you can eliminate the inner loops from your code and drastically speed it up (In Java or C++).
Your code:
double carbon() {
if (rand() % 10000 < 107)
return 13.0033548378;
else
return 12.0;
}
picks one of two values with a particular probability. Presumably you intended the first value to be picked about 107 times out of 10000 (although using %
with rand()
doesn't quite give you that). When you run this in a loop and sum the results as in:
for (int i = 0; i < composition[0]; i++) sum += carbon();
you'll essentially get sum += X*13.0033548378 + Y*12.0;
where X is the number of times the random number stays under the threshold and Y is (trials-X). It just so happens that you can simulate running a bunch of trials and calculating the number of successes using a binomial distribution, and <random>
happens to provide a binomial distribution.
Given a function sum_trials()
std::minstd_rand0 eng; // global random engine
double sum_trials(int trials, double probability, double A, double B) {
std::binomial_distribution<> dist(trials, probability);
int successes = dist(eng);
return successes*A + (trials-successes)*B;
}
You can replace your carbon()
loop:
sum += sum_trials(composition[0], 107.0/10000.0, 13.003354378, 12.0); // carbon trials
I don't have the actual values you're using, but your whole loop will look something like:
for (int i = 0; i < 100000000; i++) {
double sum = 0;
sum += sum_trials(composition[0], 107.0/10000.0, 13.003354378, 12.0); // carbon trials
sum += sum_trials(composition[1], 107.0/10000.0, 13.003354378, 12.0); // hydrogen trials
sum += sum_trials(composition[2], 107.0/10000.0, 13.003354378, 12.0); // nitrogen trials
sum += sum_trials(composition[3], 107.0/10000.0, 13.003354378, 12.0); // oxygen trials
sum += sum_trials(composition[4], 107.0/10000.0, 13.003354378, 12.0); // sulfur trials
if (sum > threshold) {
} else {
}
}
Now one thing to note is that inside the function we're constructing distributions over and over with the same data. We can extract that by replacing the function sum_trials()
with a function object, which we construct with the appropriate data once before the loop, and then just use the functor repeatedly:
struct sum_trials {
std::binomial_distribution<> dist;
double A; double B; int trials;
sum_trials(int t, double p, double a, double b) : dist{t, p}, A{a}, B{b}, trials{t} {}
double operator() () {
int successes = dist(eng);
return successes * A + (trials - successes) * B;
}
};
int main() {
int threshold = 5;
int composition[5] = { 10, 10, 10, 10, 10 };
sum_trials carbon = { composition[0], 107.0/10000.0, 13.003354378, 12.0};
sum_trials hydrogen = { composition[1], 107.0/10000.0, 13.003354378, 12.0};
sum_trials nitrogen = { composition[2], 107.0/10000.0, 13.003354378, 12.0};
sum_trials oxygen = { composition[3], 107.0/10000.0, 13.003354378, 12.0};
sum_trials sulfur = { composition[4], 107.0/10000.0, 13.003354378, 12.0};
for (int i = 0; i < 100000000; i++) {
double sum = 0;
sum += carbon();
sum += hydrogen();
sum += nitrogen();
sum += oxygen();
sum += sulfur();
if (sum > threshold) {
} else {
}
}
}
The original version of the code took my system about one minute 30 seconds. The last version here takes 11 seconds.
Here's a functor to generate the oxygen sums using two binomial_distributions. Maybe one of the other distributions can do this in one shot but I don't know.
struct sum_trials2 {
std::binomial_distribution<> d1;
std::binomial_distribution<> d2;
double A; double B; double C;
int trials;
double probabilty2;
sum_trials2(int t, double p1, double p2, double a, double b, double c)
: d1{t, p1}, A{a}, B{b}, C{c}, trials{t}, probability2{p2} {}
double operator() () {
int X = d1(eng);
d2.param(std::binomial_distribution<>{trials-X, p2}.param());
int Y = d2(eng);
return X*A + Y*B + (trials-X-Y)*C;
}
};
sum_trials2 oxygen{composition[3], 17.0/1000.0, (47.0-17.0)/(1000.0-17.0), 17.9999, 16.999, 15.999};
You can further speed this up if you can just calculate the probability that the sum is under your threshold
:
int main() {
std::minstd_rand0 eng;
std::bernoulli_distribution dist(probability_sum_is_over_threshold);
for (int i=0; i< 100000000; ++i) {
if (dist(eng)) {
} else {
}
}
}
Unless the values for the other elements can be negative then the probability that the sum is greater than five is 100%. In that case you don't even need to generate random data; execute the 'if' branch of your code 100,000,000 times.
int main() {
for (int i=0; i< 100000000; ++i) {
//execute some code
}
}
Random number generator performance varies between platforms
All of the distributions in the C++ standard library (including uniform_real_distribution
) use an implementation-defined algorithm. (The same applies to std::rand
, which defers to the C standard's rand
function.) Thus, it's natural that there would be performance differences between these distributions in different implementations of the C++ standard library. See also this answer.
You may want to try testing whether there are performance differences in the C++ random engines (such as std::minstd_rand
and std::mt19937
), which do specify a fixed algorithm in the C++ standard. To do so, generate a random number in the engine directly and not through any C++ distribution such as uniform_int_distribution
or uniform_real_distribution
.
I originally thought that this was compiler optimizations omitting the uniform_real_distribution when it wasn't stored / printed as the variable isn't used and thus can be omitted but then why doesn't the compiler do the same for std::rand[?]
I presume the compiler could do this optimization because in practice, the C++ standard library is implemented as C++ code that's available to the compiler, so that the compiler could perform certain optimizations on that code as necessary. This is unlike with std::rand
, which is only implemented as a function whose implementation is not available to the compiler, limiting the optimizations the compiler could do.
`numpy.random.normal` generates different numbers on different systems
Given that the differences are all so small, it suggests that the underlying bit-generators are doing the same things. It's just to do with differences between the underlying math library.
NumPy's legacy generator uses sqrt
and log
from libm
, and you can see that it pulls in these symbols by first finding the shared object providing the generator via:
import numpy as np
print(np.random.mtrand.__file__)
then dumping symbols with:
nm -C -gD mtrand.*.so | grep GLIBC
where that mtrand
filename comes from the above output.
I get a lot of other symbols output, but that might explain the differences.
At a guess it's to do with the log
implementation, so you could test with:
import numpy as np
np.random.seed(0)
x = 2 * np.random.rand(2, 10**5) - 1
r2 = np.sum(x * x, axis=0)
np.save('test-log.npy', np.log(r2))
and compare between these two systems.
Related Topics
Unsigned and Signed Comparison
Should I Use Virtual, Override, or Both Keywords
How to Create a Function Dynamically, During Runtime in C++
Lvalue to Rvalue Implicit Conversion
Sse, Intrinsics, and Alignment
What Legitimate Reasons Exist to Overload the Unary Operator&
What Happens When You Deallocate a Pointer Twice or More in C++
Faq: Why Does Dynamic_Cast Only Work If a Class Has at Least 1 Virtual Method
How to Test Whether a C++ Class Has a Default Constructor (Other Than Compiler-Provided Type Traits)
Using C++ to Edit the Registry
How Are Circular #Includes Resolved
What Would Be C++ Limitations Compared C Language
How to Use SQLite in a Multi-Threaded Application
How to Connect MySQL Database Using C++
Will Casting Around Sockaddr_Storage and Sockaddr_In Break Strict Aliasing