Near Constant Time Rotate That Does Not Violate the Standards

Near constant time rotate that does not violate the standards

I've linked to this answer for the full details from several other "rotate" questions, including this community wiki question, which should be kept up to date with best-practices.

I found a blog post about this issue, and it looks like it's finally a solved problem (with new-enough compiler versions).

John Regehr at the University of Utah recommends version "c" of his attempts at making a rotate function. I replaced his assert with a bitwise AND, and found that it still compiles to a single rotate insn.

typedef uint32_t rotwidth_t;  // parameterize for comparing compiler output with various sizes

rotwidth_t rotl (rotwidth_t x, unsigned int n)
{
  const unsigned int mask = (CHAR_BIT*sizeof(x)-1);  // e.g. 31

  assert ( (n<=mask)  &&"rotate by type width or more");
  n &= mask;  // avoid undef behaviour with NDEBUG.  0 overhead for most types / compilers
  return (x<<n) | (x>>( (-n)&mask ));
}

rotwidth_t rot_const(rotwidth_t x)
{
  return rotl(x, 7);
}

This could be templated on x's type, but it probably makes more sense for real use, to have the width in the function name (like rotl32). Usually when you're rotating, you know what width you want, and that matters more than what size variable you're currently storing the value in.

Also make sure to only use this with unsigned types. Right-shift of signed types does an arithmetic shift, shifting in sign-bits. (It's technically implementation-dependent behaviour, but everything uses 2's complement now.)

Pabigot independently came up with the same idea before I did, and posted it at gibhub. His version has C++ static_assert checking to make it a compile-time error to use a rotate count outside the range for the type.

I tested mine with gcc.godbolt.org, with NDEBUG defined, for variable and compile-time-const rotate counts:

gcc: optimal code with gcc >= 4.9.0, non-branching neg+shifts+or with earlier.

(compile-time const count: gcc 4.4.7 is fine)
clang: optimal code with clang >= 3.5.0, non-branching neg+shifts+or with earlier.

(compile-time const rotate count: clang 3.0 is fine)
icc 13: optimal code.

(compile-time const count with -march=native: generates slower shld $7, %edi, %edi. Fine without -march=native)

Even newer compiler versions can handle the commonly-given code from wikipedia (included in the godbolt sample) without generating a branch or cmov. John Regehr's version has the advantage of avoiding undefined behaviour when the rotate count is 0.

There are some caveats with 8 and 16 bit rotates, but compilers seem fine with 32 or 64, when n is uint32_t. See the comments in the code on the godbolt link for some notes from my testing various widths of uint*_t. Hopefully this idiom will be better-recognized by all compilers for more combinations of type widths in the future. Sometimes gcc will uselessly emit an AND insn on the rotate count, even though the x86 ISA defines the rotate insns with that exact AND as the first step.

"optimal" means as efficient as:

# gcc 4.9.2 rotl(unsigned int, unsigned int):
    movl    %edi, %eax
    movl    %esi, %ecx
    roll    %cl, %eax
    ret
# rot_const(unsigned int):
    movl    %edi, %eax
    roll    $7, %eax
    ret

When inlined, the compiler should be able to arrange for values to be in the right registers in the first place, resulting in just a single rotate.

With older compilers, you'll still get ideal code when the rotate count is a compile-time constant. Godbolt lets you test with ARM as a target, and it used a rotate there, too. With variable counts on older compilers, you get a bit of code bloat, but no branches or major performance problems, so this idiom should be safe to use in general.

BTW, I modified John Regehr's original to use CHAR_BIT*sizeof(x), and gcc / clang / icc emit optimal code for uint64_t as well. However, I did notice that changing x to uint64_t while the function return type is still uint32_t makes gcc compile it to shifts/or. So be careful to cast the result to 32bits in a separate sequence point, if you want the low 32b of a 64b rotate. i.e. Assign the result to a 64bit variable, then cast/return it. icc still generates a rotate insn, but gcc and clang don't, for

// generates slow code: cast separately.
uint32_t r = (uint32_t)( (x<<n) | (x>>( -n&(CHAR_BIT*sizeof(x)-1) )) );

If anyone can test this with MSVC, it would be useful to know what happens there.

Best practices for circular shift (rotate) operations in C++

See also an earlier version of this answer on another rotate question with some more details about what asm gcc/clang produce for x86.

The most compiler-friendly way to express a rotate in C and C++ that avoids any Undefined Behaviour seems to be John Regehr's implementation. I've adapted it to rotate by the width of the type (using fixed-width types like uint32_t).

#include <stdint.h>   // for uint32_t
#include <limits.h>   // for CHAR_BIT
// #define NDEBUG
#include <assert.h>

static inline uint32_t rotl32 (uint32_t n, unsigned int c)
{
  const unsigned int mask = (CHAR_BIT*sizeof(n) - 1);  // assumes width is a power of 2.

  // assert ( (c<=mask) &&"rotate by type width or more");
  c &= mask;
  return (n<<c) | (n>>( (-c)&mask ));
}

static inline uint32_t rotr32 (uint32_t n, unsigned int c)
{
  const unsigned int mask = (CHAR_BIT*sizeof(n) - 1);

  // assert ( (c<=mask) &&"rotate by type width or more");
  c &= mask;
  return (n>>c) | (n<<( (-c)&mask ));
}

Works for any unsigned integer type, not just uint32_t, so you could make versions for other sizes.

See also a C++11 template version with lots of safety checks (including a static_assert that the type width is a power of 2), which isn't the case on some 24-bit DSPs or 36-bit mainframes, for example.

I'd recommend only using the template as a back-end for wrappers with names that include the rotate width explicitly. Integer-promotion rules mean that rotl_template(u16 & 0x11UL, 7) would do a 32 or 64-bit rotate, not 16 (depending on the width of unsigned long). Even uint16_t & uint16_t is promoted to signed int by C++'s integer-promotion rules, except on platforms where int is no wider than uint16_t.

On x86, this version inlines to a single rol r32, cl (or rol r32, imm8) with compilers that grok it, because the compiler knows that x86 rotate and shift instructions mask the shift-count the same way the C source does.

Compiler support for this UB-avoiding idiom on x86, for uint32_t x and unsigned int n for variable-count shifts:

clang: recognized for variable-count rotates since clang3.5, multiple shifts+or insns before that.
gcc: recognized for variable-count rotates since gcc4.9, multiple shifts+or insns before that. gcc5 and later optimize away the branch and mask in the wikipedia version, too, using just a ror or rol instruction for variable counts.
icc: supported for variable-count rotates since ICC13 or earlier. Constant-count rotates use shld edi,edi,7 which is slower and takes more bytes than rol edi,7 on some CPUs (especially AMD, but also some Intel), when BMI2 isn't available for rorx eax,edi,25 to save a MOV.
MSVC: x86-64 CL19: Only recognized for constant-count rotates. (The wikipedia idiom is recognized, but the branch and AND aren't optimized away). Use the _rotl / _rotr intrinsics from <intrin.h> on x86 (including x86-64).

gcc for ARM uses an and r1, r1, #31 for variable-count rotates, but still does the actual rotate with a single instruction: ror r0, r0, r1. So gcc doesn't realize that rotate-counts are inherently modular. As the ARM docs say, "ROR with shift length, n, more than 32 is the same as ROR with shift length n-32". I think gcc gets confused here because left/right shifts on ARM saturate the count, so a shift by 32 or more will clear the register. (Unlike x86, where shifts mask the count the same as rotates). It probably decides it needs an AND instruction before recognizing the rotate idiom, because of how non-circular shifts work on that target.

Current x86 compilers still use an extra instruction to mask a variable count for 8 and 16-bit rotates, probably for the same reason they don't avoid the AND on ARM. This is a missed optimization, because performance doesn't depend on the rotate count on any x86-64 CPU. (Masking of counts was introduced with 286 for performance reasons because it handled shifts iteratively, not with constant-latency like modern CPUs.)

BTW, prefer rotate-right for variable-count rotates, to avoid making the compiler do 32-n to implement a left rotate on architectures like ARM and MIPS that only provide a rotate-right. (This optimizes away with compile-time-constant counts.)

Fun fact: ARM doesn't really have dedicated shift/rotate instructions, it's just MOV with the source operand going through the barrel-shifter in ROR mode: mov r0, r0, ror r1. So a rotate can fold into a register-source operand for an EOR instruction or something.

Make sure you use unsigned types for n and the return value, or else it won't be a rotate. (gcc for x86 targets does arithmetic right shifts, shifting in copies of the sign-bit rather than zeroes, leading to a problem when you OR the two shifted values together. Right-shifts of negative signed integers is implementation-defined behaviour in C.)

Also, make sure the shift count is an unsigned type, because (-n)&31 with a signed type could be one's complement or sign/magnitude, and not the same as the modular 2^n you get with unsigned or two's complement. (See comments on Regehr's blog post). unsigned int does well on every compiler I've looked at, for every width of x. Some other types actually defeat the idiom-recognition for some compilers, so don't just use the same type as x.

Some compilers provide intrinsics for rotates, which is far better than inline-asm if the portable version doesn't generate good code on the compiler you're targeting. There aren't cross-platform intrinsics for any compilers that I know of. These are some of the x86 options:

Intel documents that <immintrin.h> provides _rotl and _rotl64 intrinsics, and same for right shift. MSVC requires <intrin.h>, while gcc require <x86intrin.h>. An #ifdef takes care of gcc vs. icc. Clang 9.0 also has it, but before that it doesn't seem to provide them anywhere, except in MSVC compatibility mode with -fms-extensions -fms-compatibility -fms-compatibility-version=17.00. And the asm it emits for them sucks (extra masking and a CMOV).
MSVC: _rotr8 and _rotr16.
gcc and icc (not clang): <x86intrin.h> also provides __rolb/__rorb for 8-bit rotate left/right, __rolw/__rorw (16-bit), __rold/__rord (32-bit), __rolq/__rorq (64-bit, only defined for 64-bit targets). For narrow rotates, the implementation uses __builtin_ia32_rolhi or ...qi, but the 32 and 64-bit rotates are defined using shift/or (with no protection against UB, because the code in ia32intrin.h only has to work on gcc for x86). GNU C appears not to have any cross-platform __builtin_rotate functions the way it does for __builtin_popcount (which expands to whatever's optimal on the target platform, even if it's not a single instruction). Most of the time you get good code from idiom-recognition.

// For real use, probably use a rotate intrinsic for MSVC, or this idiom for other compilers.  This pattern of #ifdefs may be helpful
#if defined(__x86_64__) || defined(__i386__)

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>  // Not just <immintrin.h> for compilers other than icc
#endif

uint32_t rotl32_x86_intrinsic(rotwidth_t x, unsigned n) {
  //return __builtin_ia32_rorhi(x, 7);  // 16-bit rotate, GNU C
  return _rotl(x, n);  // gcc, icc, msvc.  Intel-defined.
  //return __rold(x, n);  // gcc, icc.
  // can't find anything for clang
}
#endif

Presumably some non-x86 compilers have intrinsics, too, but let's not expand this community-wiki answer to include them all. (Maybe do that in the existing answer about intrinsics).

(The old version of this answer suggested MSVC-specific inline asm (which only works for 32bit x86 code), or http://www.devx.com/tips/Tip/14043 for a C version. The comments are replying to that.)

Inline asm defeats many optimizations, especially MSVC-style because it forces inputs to be stored/reloaded. A carefully-written GNU C inline-asm rotate would allow the count to be an immediate operand for compile-time-constant shift counts, but it still couldn't optimize away entirely if the value to be shifted is also a compile-time constant after inlining. https://gcc.gnu.org/wiki/DontUseInlineAsm.

ROL / ROR on variable using inline assembly in Objective-C

To do this in standard C, you can do:

var = (var << shift) | (var >> (sizeof(var)*CHAR_BIT-shift))

Most compilers will recognise that pattern and optimise it to a single instruction (if the target supports it) anyway.

You can read more here: http://en.wikipedia.org/wiki/Circular_shift#Implementing_circular_shifts

How to perform rotate shift in C

(warning to future readers): Wikipedia's code produces sub-optimal asm (gcc includes a branch or cmov). See Best practices for circular shift (rotate) operations in C++ for efficient UB-free rotates.

From Wikipedia:

unsigned int _rotl(unsigned int value, int shift) {
    if ((shift &= 31) == 0)
      return value;
    return (value << shift) | (value >> (32 - shift));
}

unsigned int _rotr(unsigned int value, int shift) {
    if ((shift &= 31) == 0)
      return value;
    return (value >> shift) | (value << (32 - shift));
}

How do I ask the assembler to give me a full size register ?

gcc inline asm is a complicated beast. "r" (2) means allocate an int sized register and load it with the value 2. If you just need an arbitrary scratch register you can declare a 64 bit early-clobber dummy output, such as "=&r" (dummy) in the output section, with void *dummy declared earlier. You can consult the gcc manual for more details.

As to the final code snippet looks like you want a memory barrier, just as the linked email says. See the manual for example.

Near Constant Time Rotate That Does Not Violate the Standards