Rdrand and Rdseed Intrinsics on Various Compilers

RDRAND and RDSEED intrinsics on various compilers?

All the major compilers support Intel's intrinsics for rdrand and rdseed via <immintrin.h>.

Somewhat recent versions of some compilers are needed for rdseed, e.g. GCC9 (2019) or clang7 (2018), although those have been stable for a good while by now. If you'd rather use an older compiler, or not enable ISA-extension options like -march=skylake, a library¹ wrapper function instead of the intrinsic is a good choice. (Inline asm is not necessary, I wouldn't recommend it unless you want to play with it.)

#include <immintrin.h>
#include <stdint.h>

// gcc -march=native or haswell or znver1 or whatever, or manually enable -mrdrnd
uint64_t rdrand64(){
    unsigned long long ret;   // not uint64_t, GCC/clang wouldn't compile.
    do{}while( !_rdrand64_step(&ret) );  // retry until success.
    return ret;
}

// and equivalent for _rdseed64_step
// and 32 and 16-bit sizes with unsigned and unsigned short.

Some compilers define __RDRND__ when the instruction is enabled at compile-time. GCC/clang since they supported the intrinsic at all, but only much later ICC (19.0). And with ICC, -march=ivybridge doesn't imply -mrdrnd or define __RDRND__ until 2021.1.

ICX is LLVM-based and behaves like clang.

MSVC doesn't define any macros; its handling of intrinsics is designed around runtime feature detection only, unlike gcc/clang where the easy way is compile-time CPU feature options.

Why do{}while() instead of while(){}? Turns out ICC compiles to a less-dumb loop with do{}while(), not uselessly peeling a first iteration. Other compilers don't benefit from that hand-holding, and it's not a correctness problem for ICC.

Why unsigned long long instead of uint64_t? The type has to agree with the pointer type expected by the intrinsic, or C and especially C++ compilers will complain, regardless of the object-representations being identical (64-bit unsigned). On Linux for example, uint64_t is unsigned long, but GCC/clang's immintrin.h define int _rdrand64_step(unsigned long long*), same as on Windows. So you always need unsigned long long ret with GCC/clang. MSVC is a non-problem as it can (AFAIK) only target Windows, where unsigned long long is the only 64-bit unsigned type.

But ICC defines the intrinsic as taking unsigned long* when compiling for GNU/Linux, according to my testing on https://godbolt.org/. So to be portable to ICC, you actually need #ifdef __INTEL_COMPILER; even in C++ I don't know a way to use auto or other type-deduction to declare a variable that matches it.

Compiler versions to support intrinsics

Tested on Godbolt; its earliest version of MSVC is 2015, and ICC 2013, so I can't go back any further. Support for _rdrand16_step / 32 / 64 were all introduced at the same time in any given compiler. 64 requires 64-bit mode.

	CPU	gcc	clang	MSVC	ICC
`rdrand`	Ivy Bridge / Excavator	4.6	3.2	before 2015 (19.10)	before 13.0.1, but 19.0 for `-mrdrnd` defining `__RDRND__`. 2021.1 for -march=ivybridge to enable `-mrdrnd`
`rdseed`	Broadwell / Zen 1	9.1	7.0	before 2015 (19.10)	before(?) 13.0.1, but 19.0 also added `-mrdrnd` and `-mrdseed` options)

Why is random device creation expensive?

The short answer is: It depends on the system and library implementation

Types of standard libraries

In a fairytale world where you have the most shitty standard library implementation imaginable, random_device is nothing but a superfluous wrapper around std::rand(). I'm not aware of such an implementation but someone may correct me on this
On a bare-metal system, e.g. an embedded microcontroller, random_device may interact directly with a hardware random number generator or a corresponding CPU feature. That may or may not require an expensive setup, e.g. to configure the hardware, open communication channels, or discard the first N samples
Most likely you are on a hosted platform, meaning a modern operating system with a hardware abstraction level. Let's consider this case for the rest of this post

Types of `random_device`

Your system may have a real hardware random number generator, for example the TPM module can act as one. See How does the Trusted Platform Module generate its true random numbers? Any access to this hardware has to go through the operating system, for example on Windows this would likely be a Cryptographic Service Provider (CSP).

Or your CPU may have some built in, such as Intel's rdrand and rdseed instruction. In this case a random_device that maps directly to these just has to discover that they are available and check that they are operational. rdrand for example can detect hardware failure at which point the implementation may provide a fallback. See What are the exhaustion characteristics of RDRAND on Ivy Bridge?

However, since these features may not be available, operating systems generally provide an entropy pool to generate random numbers. If these hardware features are available, your OS may use them to feed this pool or provide a fallback once the pool is exhausted. Your standard library will most likely just access this pool through an OS-specific API.

That is what random_device is on all mainstream library implementations at the moment: an access point to the OS facilities for random number generation. So what is the setup overhead of these?

System APIs

A traditional POSIX (UNIX) operating system provides random numbers through the pseudo-devices /dev/random and /dev/urandom. So the setup cost is the same as opening and closing this file. I assume this is what your book refers to
Since this API has some downsides, new APIs have popped up, such as Linux's getrandom. This one would not have any setup cost but it may fail if the kernel does not support it, at which point a good library may try /dev/urandom again
Windows libraries likely go through its crypto API. So either the old CSP API CryptGenRandom or the new BCryptGenRandom. Both require a handle to a service or algorithm provider. So this may be similar to the /dev/urandom approach

Consequences

In all these cases you will need at least one system call to access the RNG and these are significantly slower than normal function calls. See System calls overhead And even the rdrand instruction is around 150 cycles per instruction. See What is the latency and throughput of the RDRAND instruction on Ivy Bridge? Or worse, see RDRAND and RDSEED intrinsics on various compilers?

A library (or user) may be tempted to reduce the number of system calls by buffering a larger number random bytes, e.g. with buffered file I/O. This again would make opening and closing the random_device unwise, assuming this discards the buffer.

Additionally, the OS entropy pool has a limited size and can be exhausted, potentially causing the whole system to suffer (either by working with sub-par random numbers or by blocking until entropy is available again). This and the slow performance mean that you should not usually feed the random_device directly into a uniform_int_distribution or something like this. Instead use it to initialize a pseudo random number generator.

Of course this has exceptions. If your program needs just 64 random bytes over the course of its entire runtime, it would be foolish to draw the 2.5 kiB random state to initialize a mersenne twister, for example. Or if you need the best possible entropy, e.g. to generate cryptographic keys, then by all means, use it (or better yet, use a library for this; never reinvent crypto!)

Intel DRNG giving only giving 4 bytes of data instead of 8

Use int _rdrand64_step (unsigned __int64* val) from immintrin.h instead of writing inline asm. You don't need it, and there are many reasons (including this one) to avoid it: https://gcc.gnu.org/wiki/DontUseInlineAsm

In this case, the problem is that you're probably compiling 32-bit code, so of course 64-bit rdrand is not encodeable. But the way you used inline-asm ended up giving you a 32-bit rdrand, and storing garbage from another register for the high half.

gcc -Wall -O3 -m32 -march=ivybridge (and similar for clang) produces (on Godbolt):

In function 'rdrand64_step':
7 : <source>:7:1: warning: unsupported size for integer register

rdrand64_step:
    push    ebx
    rdrand ecx; setc al
    mov     edx, DWORD PTR [esp+8]    # load the pointer arg
    movzx   eax, al
    mov     DWORD PTR [edx], ecx
    mov     DWORD PTR [edx+4], ebx    # store garbage in the high half of *rand
    pop     ebx
    ret

I guess you called this function with a caller that happened to have ebx=0. Or else you used a different compiler that did something different. Maybe something else happens after inlining. If you looked at disassembly of what you actually compiled, you could explain exactly what's going on.

If you'd used the intrinsic, you would have gotten error: '_rdrand64_step' was not declared in this scope, because immintrin.h only declares it in 64-bit mode (and with a -march setting that implies rdrand support. Or [-mrdrnd]3. Best option: use -march=native if you're building on the target machine).

You'd also get significantly more efficient code for a retry loop, at least with clang:

unsigned long long use_intrinsic(void) {
    unsigned long long rand;
    while(!_rdrand64_step(&rand));  // TODO: retry limit in case RNG is broken.
    return rand;
}

use_intrinsic:                          # @use_intrinsic
.LBB2_1:                                # =>This Inner Loop Header: Depth=1
    rdrand  rax
    jae     .LBB2_1
    ret

That avoids setcc and then testing that, which is of course redundant. gcc6 has syntax for returning flag results from inline asm. You can also use asm goto and put a jcc inside the asm, jumping to a label: return 1; target or falling through to a return 0. (The inline-asm docs have an example of doing this. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html. See also the inline-assembly tag wiki.)

Using your inline-asm, clang (in 64-bit mode) compiles it to:

use_asm:
.LBB1_1:
    rdrand  rax
    setb    byte ptr [rsp - 1]
    cmp     byte ptr [rsp - 1], 0
    je      .LBB1_1
    ret

(clang makes bad decisions for constraints with multiple options that include memory.)

gcc7.2 and ICC17 actually end up with better code from the asm than from the intrinsic. They use cmovc to get a 0 or 1 and then test that. It's pretty dumb. But that's a gcc/ICC missed optimization that will hopefully be.

What are the exhaustion characteristics of RDRAND on Ivy Bridge?

Part 1.
Does it make a difference pulling 16, 32 or 64 bits?

No.

On Ivy Bridge, the CPU cores pull 64 bits over the internal communication links to the DRNG, regardless of the size of the destination register. So if you read 32 bits, it pulls 64 bits and throws away the top half. If you read 16 bits, it pulls 64 and throws away the top 3/4.

This is not described in the instruction documentation because it may not continue to be true in future products. A chip might be designed which stashes and uses the unused parts of the 64 bit word. However there isn't a significant performance imperative to do this today.

For the highest throughput, the most effective strategy is to pull from parallel threads. This is because there is parallelism in the bus hierarchy on chip. Most of the time for the instruction is transit time across the buses. Performing that transit in parallel is going to yield a linear increase in throughput with the number of threads, up to the maximum of 800MBytes/s. The second thing is to use 64-bit RdRands, because they get more data per instruction.

Part 2.
What does CF=0 mean really?

It means 'random data not available'. This is because the details of why it can't get a number are not available to the CPU core without it going off and reading more registers, which it isn't going to do because there is nothing it can do with the information.

If you sucked the output buffer of the DRNG dry, you would get an underflow (CF=0) but you could expect the next RdRand to succeed, because the DRNG is fast.

If the DRNG failed (e.g. a transistor popped in the entropy source and it no longer was random) then the online health tests would detect this and shut down the DRNG. Then all your RdRand invocations would yield CF=0.

However on Ivy Bridge, you will not be able to underflow the buffer. The DRNG is a little faster than the bus to which it is attached. The effect of pulling more data per unit time (with parallel threads) will be to increase the execution time of each individual RdRand as contention on the bus causes the instructions to have to wait in line at the DRNG's local bus. You can never pull so fast the the DRNG will underflow. You will asymptotically reach 800 MBytes/s.

This also is not described in the documentation because it may not continue to be true in future products. We can envisage products where the buses are faster and the cores faster and the DRNG would be able to be underflowed. These things are not known yet, so we can't make claims about them.

What will remain true is that the basic loop (try up to 10 times, then report a failure up the stack) given in the software implementors guide will continue to work in future products, because we've made the claim that it will and so we will engineer all future products to meet this.

So no, CF=0 cannot occur because "the buffers happen to be (transiently) empty when RDRAND is invoked" on Ivy Bridge, but it might occur on future silicon, so design your software to cope.

The Effect of Architecture When Using SSE / AVX Intrinisics

GCC and clang require that you enable all extensions you use. Otherwise it's a compile-time error, like error: inlining failed to call always_inline error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch

Using -march=native or -march=haswell or whatever is preferred over enabling specific extensions, because that also sets appropriate tuning options. And you don't forget useful ones like -mpopcnt that will let std::bitset::count() inline a popcnt instruction, and make all variable-count shifts more efficient with BMI2 shlx / shrx (1 uop vs. 3)

MSVC and ICC do not, and will let you use intrinsics to emit instructions that they couldn't auto-vectorize with.

You should definitely enable AVX if you use AVX intrinsics. Older MSVC without enabling AVX didn't always use vzeroupper automatically where needed, but that's been fixed for a few years. Still, if your whole program can assume AVX support, definitely tell the compiler about it even for MSVC.

For compilers that support GNU extensions (GCC, clang, ICC), you can use stuff like __attribute__((target("avx"))) on specific functions in a compilation unit. Or better, __attribute__((target("arch=haswell"))) to maybe also set tuning options. (That also enables AVX2 and FMA, which you might not want. And I'm not sure if target attributes do set -mtune=xx). See
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html

__attribute__((target())) will prevent them from inlining into functions with other target options, so be careful to use this on functions they will inline into, if the function itself is too small. Use it on a function containing a loop, not a helper function called in a loop.

See also
https://gcc.gnu.org/wiki/FunctionMultiVersioning for using different target options on multiple definitions of the same function name, for compiler supported runtime dispatching. But I don't think there's a portable (to MSVC) way to do that.

See specify simd level of a function that compiler can use for more about doing runtime dispatch on GCC/clang.

With MSVC you don't need anything, although like I said I think it's normally a bad idea to use AVX intrinsics without -arch:AVX, so you might be better off putting those in a separate file. But for AVX vs. AVX2 + FMA, or SSE2 vs. SSE4.2, you're fine without anything.

Just #define AVX2_FUNCTION to the empty string instead of __attribute__((target("avx2,fma")))

#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
// apparently ICC doesn't support target attributes, despite supporting GNU C
#define TARGET_HASWELL __attribute__((target("arch=haswell")))
#else
#define TARGET_HASWELL   // empty
 // maybe warn if __AVX__ isn't defined for functions where this is used?
 // if you need to make sure MSVC uses vzeroupper everywhere needed.
#endif

TARGET_HASWELL
void foo_avx(float *__restrict dst, float *__restrict src)
{
   for (size_t i = 0 ; i<1024 ; i++) {
       __m256 v = _mm256_loadu_ps(src);
       ...
       ...
   }
}

With GCC and clang, the macro expands to the __attribute__((target)) stuff; with MSVC and ICC it doesn't.

ICC pragma:

https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-optimization-parameter documents a pragma which you'd want to put before AVX functions to make sure vzeroupper is used properly in functions that use _mm256 intrinsics.

#pragma intel optimization_parameter target_arch=AVX

For ICC, you could #define TARGET_AVX as this, and always used it on a line by itself before the function, where you can put an __attribute__ or a pragma. You might also want separate macros for defining vs. declaring functions, if ICC doesn't want this on declarations. And a macro to end a block of AVX functions, if you want to have non-AVX functions after them. (For non-ICC compilers, this would be empty.)

Can't use uint64_t with rdrand as it expects unsigned long long, but uint64_t is defined as unsigned long

Use unsigned long long n; You can still return it as a uint64_t.

It works fine on the Godbolt compiler explorer with current versions of all 4 major x86 compilers (GCC, clang, ICC, MSVC). Note that _rdrand64_step will only ever work on x86-64 C++ implementations, so that limits the scope of portability concerns a lot.

All 4 mainstream x86 compilers define _rdrand64_step with a type compatible with unsigned long long, so in this case it's safe to just follow clang's headers.

Unfortunately (or not), gcc/clang's immintrin.h doesn't actually define a __int64 type to match Intel's intrinsics documentation, otherwise you could use that. ICC and MSVC do let you actually use unsigned __int64 n. (https://godbolt.org/z/v4xnc5)

immintrin.h being available at all implies a lot of other things about the compiler environment and type widths, and it's highly unlikely (but not impossible) that some future x86-64 C implementation would make unsigned long long anything other than a qword (uint64_t).

Although if they did, maybe they'd just map Intel's __int64 to a different type, since Intel's docs never use long or long long, just __int64, e.g. AVX2 _mm256_maskload_epi64(__int64 const* mem_addr, __m256i mask). (Or even __m128i* for the movq load intrinsic:

__m128i _mm_loadl_epi64 (__m128i const* mem_addr).

Much later, a more sane __m128i _mm_loadu_si64 (void const* mem_addr) was introduced (along with AVX512 intrinsics.)

But still, a C++ implementation with an unsigned long long that wasn't exactly 64 bits would probably break some intrinsics code, so it's not a problem you need to spend any time really worrying about. In this instance, if it were wider, that would still be fine. You'd just be returning the low 64 bits of it, where _rdrand64_step(&n); put the result. (Or you'd get a compile error if that C++ implementation had the intrinsic take unsigned long or however they define uint64_t instead of unsigned long long).

So there's zero chance of silent data corruption / truncation on any hypothetical future C++ implementation. ISO C++ guarantees that unsigned long long is at least a 64-bit type. (Actually specifies it by value-range, and being unsigned that its value bits are plain binary, but same difference.)

You don't need portability to a DeathStation 9000, just to any hypothetical future compiler that anyone might actually want to use, which pretty much implies that it would want to be compatible with existing Intel-intrinsics codebases, if it provides that style of intrinsics at all. (Rather than a redesign from scratch with different names and types, in which case you'd have to change 2 lines in this function to get it to work.)

Rdrand and Rdseed Intrinsics on Various Compilers