Using Bts Assembly Instruction with Gcc Compiler

Using bts assembly instruction with gcc compiler

inline void SetBit(*array, bit) {
    asm("bts %1,%0" : "+m" (*array) : "r" (bit));
}

assembly intrinsic for atomic bit test and set (BTS)

For a blind setting of a bit, sync_fetch_and_or or sync_or_and_fetch seem to be both equally good, with the result discarded the compiler knows to optimize it out.
On x86 gcc won't use bts, instead it will simply do a lock or which should be fine.

x86 - setting a bit using inline assembly

How to use the bts instruction in asm

In your code, this line:

bts %0, %%eax;

Should be replaced with:

bts %%eax, %0;

Explanation

given the general form asm( "code" : outputs : inputs : clobbers) GCC replaced %0, %1 and %2 in the “code” with registers holding the arguments after the colon. The definition of BTS says that the first operand is the bit string and the second the bit index. So the solution seems to be: bts %0, %1 has you have done in your code. However this is not how bts works: bts wants the address as the second operand and the bit to set as the first so: bts %1, %0. See the correct usage here.

Better solutions

While your code will work with the suggested correction, there are better options like the following:

uint32_t set_bit_assembly2(uint32_t x, uint32_t n)
{
    asm( "bts %1,%0"
         :"+r"(x)
         :"r"(n)
        );
    return x;
}

As pointed out by @DavidWohlferd in the comment we should use "+r" as x will be both read and write by the bts instruction.

Moreover it is possible to increase the readability by using the symbolic names:

asm( "bts %[bit],%[value]"
     : [value] "+rm"(value) 
     : [bit] "r"(bit)
     :"cc");

Yet another possibility is (see this post):

uint32_t set_bit_assembly3(uint32_t x, uint32_t n)
{
    asm( "bts %1,%0": "+rm"(x) : "r"(n));
    return x;
}

XCode and _bittest function

The _bittest and _bittest64 symbols are compiler intrinsics, that emit Bit-test instructions, specifically x86 bt, to examine the value of a bit at a zero-based index.

With a memory operand, bt has crazy-CISC bitstring behaviour where the bit index can go outside the dword/qword of memory selected by the addressing mode. This is slow and why compilers will load the operand into a register first. But this is what the MSVC intrinsic is for. Otherwise it wouldn't need to be an intrinsic.

The following C++ matches the behaviour of register-arg version of the bt instruction, wrapping the shift count at the register width, i.e. effectively looking only at the low bits. (This matches the MSVC intrinsic if b is <32 or <64.) See the updated code and comments for discussion of how to implement the MSVC semantics which let it access outside the pointed-to long or long long.

Also beware that long is a 32-bit type in the x64 Windows ABI, but a 64-bit type in the x86-64 System V ABI (which you're using on OS X, unless you build obsolete 32-bit code). You may want to change your code to int32_t or uint32_t to avoid leaving unused bits in each long, depending on how you're using it.

inline
unsigned char bittest(long const *a, long b)
{
    auto const value{ *a };
    auto const mask{ 1L << (b&31) };
    auto const masked_value{ value & mask };
    return unsigned char{ masked_value != 0 };
}

inline
unsigned char bittest64(long long const *a, long long b)
{
    auto const value{ *a };
    auto const mask{ 1LL << (b&63) };
    auto const masked_value{ value & mask };
    return unsigned char{ masked_value != 0 };
}

I'm not aware of any GCC or Clang intrinsics with identical functionality. If needed, you could resort to emitting assembly instructions from the function implementations instead, but bt with a memory operand is slow so it's normally best to implement in pure C++ and let the compiler do a good job.

Update:

After discussing the code emitted from the intrinsics, it has become clear, that the previously proposed replacement code only covers part of the functionality. In particular, the intrinsics allow indexing bits outside the memory occupied by *a. The following implementations account for that as well.

inline
unsigned char bittest(std::int32_t const *a, std::int32_t b)
{
    auto const bits{ reinterpret_cast<unsigned char const*>(a) };
    auto const value{ bits[b >> 3] };
    auto const mask{ (unsigned char)(1 << (b & 7)) };
    return (value & mask) != 0;
}

inline
unsigned char bittest64(std::int64_t const *a, std::int64_t b)
{
    auto const bits{ reinterpret_cast<unsigned char const*>(a) };
    auto const value{ bits[b >> 3] };
    auto const mask{ (unsigned char)(1 << (b & 7)) };
    return (value & mask) != 0;
}

set a bit with inline asm without btsl instruction

Folowing Michael's example, I've done clear and toggle as well as set:

unsigned long bitset(unsigned long value, uint8_t shift)
{
    unsigned long tmp;

    asm ("mov $0x1, %[tempreg]\n\t"
         "shl %[shift], %[tempreg]\n\t"
         "or %[tempreg], %[val]"
    : [val]"+r"(value),
      [tempreg]"=&r"(tmp)
    : [shift]"cN"(shift));

    return value;
}

unsigned long bitclear(unsigned long value, uint8_t shift)
{
    unsigned long tmp;

    asm ("mov $0x1, %[tempreg]\n\t"
         "shl %[shift], %[tempreg]\n\t"
         "not %[tempreg]\n\t"
         "and %[tempreg], %[val]\n\t"
    : [val]"+r"(value),
      [tempreg]"=&r"(tmp)
    : [shift]"cN"(shift));

    return value;
}
unsigned long toggle(unsigned long value, uint8_t shift)
{
    unsigned long tmp;

    asm ("mov $0x1, %[tempreg]\n\t"
         "shl %[shift], %[tempreg]\n\t"
         "xor %[tempreg], %[val]"
    : [val]"+r"(value),
      [tempreg]"=&r"(tmp)
    : [shift]"cN"(shift));

    return value;
}

bit test and set (BTS) on a tbb atomic variable

A compare_and_swap loop can be used, like this:

// Atomically perform i|=j. Return previous value of i.
int bitTestAndSet( tbb::atomic<int>& i, int j ) {
    int o = i;                  // Atomic read (o = "old value")
    while( (o|j)!=o ) {         // Loop exits if another thread sets the bits
        int k = o;
        o = i.compare_and_swap(k|j,k);
        if( o==k ) break;       // Successful swap
    }
    return o;
}

Note that if the while condition succeeds on the first try, there will be only an acquire fence, not a full fence. Whether that matters depends on context.

If there is risk of high contention, then some sort of backoff scheme should be be used in the loop. TBB uses a class atomic_backoff for contention management internally, but it's not currently part of the public TBB API.

There is a second way, if portability is not a concern and you are willing to exploit the undocumented fact that the layout of a tbb::atomic and T are the same on x86 platforms. In that case, just operate on the tbb::atomic using assembly code. The program below demonstrates this technique:

#include <tbb/tbb.h>
#include <cstdio>

inline int SetBit(int array[], int bit) {
    int x=1, y=0;
    asm("bts %2,%0\ncmovc %3,%1" : "+m" (*array), "+r"(y) : "r" (bit), "r"(x));
    return y;
}

tbb::atomic<int> Flags;
volatile int Result;

int main() {
    for( int i=0; i<16; ++i ) {
        int k = i*i%32;
        std::printf("bit at %2d was %d.  Flags=%8x\n", k, SetBit((int*)&Flags,k), +Flags);
    }
}

what is wrong with my version of _bittestandset

Could you give some further explanation what's not working? Maybe a simple example of the usage or so. It's hard to guess what's wrong with the code..

One thing that looks cheesy so far: You execute a bit-test opcode but ignore the result. The bit that you test (and set) ends up in the carry flag after the opcode.

If you want to get the result you need an additional instruction to get the carry flag into some other output register. (SBB EAX, EAX or something like that).

Otherwise - if you don't need the result - it's much cheaper to replace the BTS instruction with three simpler assembler opcodes:

Something along these lines:

  ; bit-index in cl
  ; Value in ebx
  ; eax used as a temporary (trashed)
  mov eax, 1
  shl eax, cl
  or  ebx, eax

Since I don't have mingw with your exact version running a assembly-output of a simple testcase could give us some clue what's going wrong.

Atomic test-and-set in x86: inline asm or compiler-generated lock bts?

IIRC, first-gen Xeon Phi is based on P5 cores (Pentium, and Pentium MMX). cmov wasn't introduced until P6 (aka Pentium Pro). So I think this is normal.

Just let the compiler do its job by writing a normal ternary operator.

Second, cmov is a far worse choice for this than setc, since you want to produce a 0 or 1 based on the carry flag. See my asm code below.

Also note that bts with a memory operand is super-slow, so you don't want it to generate that code anyway, esp. on a CPU that decodes x86 instructions into uops (like a modern Xeon). According to http://agner.org/optimize/, bts m, r is much slower than bts m, i even on P5, so don't do that.

Just ask the compiler for in to be in a register, or better yet, just don't use inline asm for this.

Since the OP apparently wants this to work atomically, the best solution is to use C++11's std::atomic::fetch_or, and leave it up to the compiler to generate lock bts.

std::atomic_flag has a test_and_set function, but IDK if there a way to pack them tightly. Maybe as bitfields in a struct? Unlikely though. I also don't see atomic operations for std::bitset.

Unfortunately, current versions of gcc and clang don't generate lock bts from fetch_or, even when the much-faster immediate-operand form is usable. I came up with the following (godbolt link):

#include <atomic>
#include <stdio.h>

// wastes instructions when the return value isn't used.
// gcc 6.0 has syntax for using flags as output operands

// IDK if lock BTS is better than lock cmpxchg.
// However, gcc doesn't use lock BTS even with -Os
int atomic_bts_asm(std::atomic<unsigned> *x, int bit) {
  int retval = 0;  // the compiler still provides a zeroed reg as input even if retval isn't used after the asm :/
  // Letting the compiler do the xor means we can use a m constraint, in case this is inlined where we're storing to already zeroed memory
  // It unfortunately doesn't help for overwriting a value that's already known to be 0 or 1.
  asm( // "xor      %[rv], %[rv]\n\t"
       "lock bts %[bit], %[x]\n\t"
       "setc     %b[rv]\n\t"  // hope that the compiler zeroed with xor to avoid a partial-register stall
        : [x] "+m" (*x), [rv] "+rm"(retval)
        : [bit] "ri" (bit));
  return retval;
}

// save an insn when retval isn't used, but still doesn't avoid the setc
// leads to the less-efficient setc/ movzbl sequence when the result is needed :/
int atomic_bts_asm2(std::atomic<unsigned> *x, int bit) {
  uint8_t retval;
  asm( "lock bts %[bit], %[x]\n\t"
       "setc     %b[rv]\n\t"
        : [x] "+m" (*x), [rv] "=rm"(retval)
        : [bit] "ri" (bit));
  return retval;
}

int atomic_bts(std::atomic<unsigned> *x, unsigned int bit) {
  // bit &= 31; // stops gcc from using shlx?
  unsigned bitmask = 1<<bit;
  //int oldval = x->fetch_or(bitmask, std::memory_order_relaxed);

  int oldval = x->fetch_or(bitmask, std::memory_order_acq_rel);
  // acquire and release semantics are free on x86
  // Also, any atomic rmw needs a lock prefix, which is a full memory barrier (seq_cst) anyway.

  if (oldval & bitmask)
    return 1;
  else
    return 0;
}

As discussed in What is the best way to set a register to zero in x86 assembly: xor, mov or and?, xor / set-flags / setc is the optimal sequence for all modern CPUs when the result is needed as a 0-or-1 value. I haven't actually considered P5 for that, but setcc is fast on P5 so it should be fine.

Of course, if you want to branch on this instead of storing it, the boundary between inline asm and C is an obstacle. Spending two instructions to store a 0 or 1, only to test/branch on it, would be pretty dumb.

gcc6's flag-operand syntax would certainly be worth looking in to, if it's an option. (Probably not if you need a compiler that targets Intel MIC.)

Using Bts Assembly Instruction with Gcc Compiler