Using Base Pointer Register in C++ Inline Asm

Using base pointer register in C++ inline asm

See the bottom of this answer for a collection of links to other inline-asm Q&As.

Your code is broken because you step on the red-zone below RSP (with push) where GCC was keeping a value.


What are you hoping to learn to accomplish with inline asm? If you want to learn inline asm, learn to use it to make efficient code, rather than horrible stuff like this. If you want to write function prologues and push/pop to save/restore registers, you should write whole functions in asm. (Then you can easily use nasm or yasm, rather than the less-preferred-by-most AT&T syntax with GNU assembler directives1.)

GNU inline asm is hard to use, but allows you to mix custom asm fragments into C and C++ while letting the compiler handle register allocation and any saving/restoring if necessary. Sometimes the compiler will be able to avoid the save and restore by giving you a register that's allowed to be clobbered. Without volatile, it can even hoist asm statements out of loops when the input would be the same. (i.e. unless you use volatile, the outputs are assumed to be a "pure" function of the inputs.)

If you're just trying to learn asm in the first place, GNU inline asm is a terrible choice. You have to fully understand almost everything that's going on with the asm, and understand what the compiler needs to know, to write correct input/output constraints and get everything right. Mistakes will lead to clobbering things and hard-to-debug breakage. The function-call ABI is a much simpler and easier to keep track of boundary between your code and the compiler's code.



Why this breaks

You compiled with -O0, so gcc's code spills the function parameter from %rdi to a location on the stack. (This could happen in a non-trivial function even with -O3).

Since the target ABI is the x86-64 SysV ABI, it uses the "Red Zone" (128 bytes below %rsp that even asynchronous signal handlers aren't allowed to clobber), instead of wasting an instruction decrementing the stack pointer to reserve space.

It stores the 8B pointer function arg at -8(rsp_at_function_entry). Then your inline asm pushes %rbp, which decrements %rsp by 8 and then writes there, clobbering the low 32b of &x (the pointer).

When your inline asm is done,

  • gcc reloads -8(%rbp) (which has been overwritten with %rbp) and uses it as the address for a 4B store.
  • Foo returns to main with %rbp = (upper32)|5 (orig value with the low 32 set to 5).
  • main runs leave: %rsp = (upper32)|5
  • main runs ret with %rsp = (upper32)|5, reading the return address from virtual address (void*)(upper32|5), which from your comment is 0x7fff0000000d.

I didn't check with a debugger; one of those steps might be slightly off, but the problem is definitely that you clobber the red zone, leading to gcc's code trashing the stack.

Even adding a "memory" clobber doesn't get gcc to avoid using the red zone, so it looks like allocating your own stack memory from inline asm is just a bad idea. (A memory clobber means you might have written some memory you're allowed to write to, e.g. a global variable or something pointed-to by a global, not that you might have overwritten something you're not supposed to.)

If you want to use scratch space from inline asm, you should probably declare an array as a local variable and use it as an output-only operand (which you never read from).

AFAIK, there's no syntax for declaring that you modify the red-zone, so your only options are:

  • use an "=m" output operand (possibly an array) for scratch space; the compiler will probably fill in that operand with an addressing mode relative to RBP or RSP. You can index into it with constants like 4 + %[tmp] or whatever. You might get an assembler warning from 4 + (%rsp) but not an error.
  • skip over the red-zone with add $-128, %rsp / sub $-128, %rsp around your code. (Necessary if you want to use an unknown amount of extra stack space, e.g. push in a loop, or making a function call. Yet another reason to deref a function pointer in pure C, not inline asm.)
  • compile with -mno-red-zone (I don't think you can enable that on a per-function basis, only per-file)
  • Don't use scratch space in the first place. Tell the compiler what registers you clobber and let it save them.


Here's what you should have done:

void Bar(int &x)
{
int tmp;
long tmplong;
asm ("lea -16 + %[mem1], %%rbp\n\t"
"imul $10, %%rbp, %q[reg1]\n\t" // q modifier: 64bit name.
"add %k[reg1], %k[reg1]\n\t" // k modifier: 32bit name
"movl $5, %[mem1]\n\t" // some asm instruction writing to mem
: [mem1] "=m" (tmp), [reg1] "=r" (tmplong) // tmp vars -> tmp regs / mem for use inside asm
:
: "%rbp" // tell compiler it needs to save/restore %rbp.
// gcc refuses to let you clobber %rbp with -fno-omit-frame-pointer (the default at -O0)
// clang lets you, but memory operands still use an offset from %rbp, which will crash!
// gcc memory operands still reference %rsp, so don't modify it. Declaring a clobber on %rsp does nothing
);
x = 5;
}

Note the push/pop of %rbp in the code outside the #APP / #NO_APP section, emitted by gcc. Also note that the scratch memory it gives you is in the red zone. If you compile with -O0, you'll see that it's at a different position from where it spills &x.

To get more scratch regs, it's better to just declare more output operands that are never used by the surrounding non-asm code. That leaves register allocation to the compiler, so it can be different when inlined into different places. Choosing ahead of time and declaring a clobber only makes sense if you need to use a specific register (e.g. shift count in %cl). Of course, an input constraint like "c" (count) gets gcc to put the count in rcx/ecx/cx/cl, so you don't emit a potentially redundant mov %[count], %%ecx.

If this looks too complicated, don't use inline asm. Either lead the compiler to the asm you want with C that's like the optimal asm, or write a whole function in asm.

When using inline asm, keep it as small as possible: ideally just the one or two instructions that gcc isn't emitting on its own, with input/output constraints to tell it how to get data into / out of the asm statement. This is what it's designed for.

Rule of thumb: if your GNU C inline asm start or ends with a mov, you're usually doing it wrong and should have used a constraint instead.


Footnotes:

  1. You can use GAS's intel-syntax in inline-asm by building with -masm=intel (in which case your code will only work with that option), or using dialect alternatives so it works with the compiler in Intel or AT&T asm output syntax. But that doesn't change the directives, and GAS's Intel-syntax is not well documented. (It's like MASM, not NASM, though.) I don't really recommend it unless you really hate AT&T syntax.


Inline asm links:

  • x86 wiki. (The tag wiki also links to this question, for this collection of links)

  • The inline-assembly tag wiki

  • The manual. Read this. Note that inline asm was designed to wrap single instructions that the compiler doesn't normally emit. That's why it's worded to say things like "the instruction", not "the block of code".

  • A tutorial

  • Looping over arrays with inline assembly Using r constraints for pointers/indices and using your choice of addressing mode, vs. using m constraints to let gcc choose between incrementing pointers vs. indexing arrays.

  • How can I indicate that the memory *pointed* to by an inline ASM argument may be used? (pointer inputs in registers do not imply that the pointed-to memory is read and/or written, so it might not be in sync if you don't tell the compiler).

  • In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?. Using %q0 to get %rax vs. %w0 to get %ax. Using %g[scalar] to get %zmm0 instead of %xmm0.

  • Efficient 128-bit addition using carry flag Stephen Canon's answer explains a case where an early-clobber declaration is needed on a read+write operand. Also note that x86/x86-64 inline asm doesn't need to declare a "cc" clobber (the condition codes, aka flags); it's implicit. (gcc6 introduces syntax for using flag conditions as input/output operands. Before that you have to setcc a register that gcc will emit code to test, which is obviously worse.)

  • Questions about the performance of different implementations of strlen: my answer on a question with some badly-used inline asm, with an answer similar to this one.

  • llvm reports: unsupported inline asm: input with type 'void *' matching output with type 'int': Using offsetable memory operands (in x86, all effective addresses are offsettable: you can always add a displacement).

  • When not to use inline asm, with an example of 32b/32b => 32b division and remainder that the compiler can already do with a single div. (The code in the question is an example of how not to use inline asm: many instructions for setup and save/restore that should be left to the compiler by writing proper in/out constraints.)

  • MSVC inline asm vs. GNU C inline asm for wrapping a single instruction, with a correct example of inline asm for 64b/32b=>32bit division. MSVC's design and syntax require a round trip through memory for inputs and outputs, making it terrible for short functions. It's also "never very reliable" according to Ross Ridge's comment on that answer.

  • Using x87 floating point, and commutative operands. Not a great example, because I didn't find a way to get gcc to emit ideal code.

Some of those re-iterate some of the same stuff I explained here. I didn't re-read them to try to avoid redundancy, sorry.

when writing a function in C/C++ that uses inline assembly (x86-64), is it safe to choose any GPRs (rax to r15) when want to? [duplicate]

No, it is not correct that "data are actually stored in the main memory, and it is only loaded in the registers when we are performing operations with it". Compilers work very hard to keep data in the registers and only spill to memory if required.

You have to tell the compiler about all the registers you use in the inline assembly, so it can ensure that they are available for you. None of the registers are available to use freely.

To indicate to the compiler that you are modifying the contents of registers, you list them as outputs or in the clobber list.

inline assembly + pointer management

Your first example is the most correct and has following errors:

  • It uses 32 bit registers instead of 64 bit.
  • 3 registers are changed which are not specified as outputs or clobbers.
  • EAX is loaded with source address, not the size.
  • dst is declared to be an output, when it should be an input.
  • The arguments for the add instruction are the wrong way round, in AT&T syntax the destination register is last.
  • A non-local label is used, which will fail if the asm statement gets duplicated, for example by inlining.

And the following performance issues:

  • The sz parameter is passed by reference. (May also impair optimisations in calling functions)
  • It is then passed into the asm as a memory argument, which requires it is written to memory.
  • Then it is copied to another register.
  • Fixed registers are used instead of letting the compiler choose.

Here is a fixed version, which is no faster than the equivalent C++ with intrinsics:

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t sz)
{
std::size_t count = 0;
__m256i temp;

assert((sz%32 == 0));

__asm__ volatile(

"1: \n"

"vmovntdqa (%[src],%[count]), %[temp] \n"
"vmovntdq %[temp], (%[dst],%[count]) \n"

"add $32, %[count] \n"

"cmp %[sz], %[count] \n"
"jz 1b \n"

:[count]"+r"(count), [temp]"=x"(temp)
:[dst]"r"(out), [src]"r"(in), [sz]"r"(sz)
:"memory", "cc"
);

}

The source and destination parameters are the other way round as memcpy which is potentially confusing.

Your Intel syntax version addition also fails to use the correct syntax to refer to arguments (eg %[dst]).

GCC inline assembly with stack operation

Modifying ESP inside inline-asm should generally be avoided when you have any memory inputs / outputs, so you don't have to disable optimizations or force the compiler to make a stack-frame with EBP some other way. One major advantage is that you (or the compiler) can then use EBP as an extra free register; potentially a significant speedup if you're already having to spill/reload stuff. If you're writing inline asm, presumably this is a hotspot so it's worth spending the extra code-size to use ESP-relative addressing modes.

In x86-64 code, there's an added obstacle to using push/pop safely, because you can't tell the compiler you want to clobber the red-zone below RSP. (You can compile with -mno-red-zone, but there's no way to disable it from the C source.) You can get problems like this where you clobber the compiler's data on the stack. No 32-bit x86 ABI has a red-zone, though, so this only applies to x86-64 System V. (Or non-x86 ISAs with a red-zone.)

You only need -fno-omit-frame-pointer for that function if you want to do asm-only stuff like push as a stack data structure, so there's a variable amount of push. Or maybe if optimizing for code-size.

You can always write a whole non-inline function in asm and put it in a separate file, then you have full control. But only do that if your function is large enough to be worth the call/ret overhead, e.g. if it includes a whole loop; don't make the compiler call a short non-looping function inside a C inner loop, destroying all the call-clobbered registers and having to make sure globals are in sync.


It seems you're using push / pop inside inline asm because you don't have enough registers, and need to save/reload something. You don't need to use push/pop for save/restore. Instead, use dummy output operands with "=m" constraints to get the compiler to allocate stack space for you, and use mov to/from those slots. (Of course you're not limited to mov; it can be a win to use a memory source operand for an ALU instruction if you only need the value once or twice.)

This may be slightly worse for code-size, but is usually not worse for performance (and can be better). If that's not good enough, write the whole function (or the whole loop) in asm so you don't have to wrestle with the compiler.

int foo(char *p, int a, int b) {
int t1,t2; // dummy output spill slots
int r1,r2; // dummy output tmp registers
int res;

asm ("# operands: %0 %1 %2 %3 %4 %5 %6 %7 %8\n\t"
"imull $123, %[b], %[res]\n\t"
"mov %[res], %[spill1]\n\t"
"mov %[a], %%ecx\n\t"
"mov %[b], %[tmp1]\n\t" // let the compiler allocate tmp regs, unless you need specific regs e.g. for a shift count
"mov %[spill1], %[res]\n\t"
: [res] "=&r" (res),
[tmp1] "=&r" (r1), [tmp2] "=&r" (r2), // early-clobber
[spill1] "=m" (t1), [spill2] "=&rm" (t2) // allow spilling to a register if there are spare regs
, [p] "+&r" (p)
, "+m" (*(char (*)[]) p) // dummy in/output instead of memory clobber
: [a] "rmi" (a), [b] "rm" (b) // a can be an immediate, but b can't
: "ecx"
);

return res;

// p unused in the rest of the function
// so it's really just an input to the asm,
// which the asm is allowed to destroy
}

This compiles to the following asm with gcc7.3 -O3 -m32 on the Godbolt compiler explorer. Note the asm-comment showing what the compiler picked for all the template operands: it picked 12(%esp) for %[spill1] and %edi for %[spill2] (because I used "=&rm" for that operand, so the compiler saved/restore %edi outside the asm, and gave it to us for that dummy operand).

foo(char*, int, int):
pushl %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $16, %esp
movl 36(%esp), %edx
movl %edx, %ebp
#APP
# 19 "/tmp/compiler-explorer-compiler118120-55-w92ge8.v797i/example.cpp" 1
# operands: %eax %ebx %esi 12(%esp) %edi %ebp (%edx) 40(%esp) 44(%esp)
imull $123, 44(%esp), %eax
mov %eax, 12(%esp)
mov 40(%esp), %ecx
mov 44(%esp), %ebx
mov 12(%esp), %eax

# 0 "" 2
#NO_APP
addl $16, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret

Hmm, the dummy memory operand to tell the compiler which memory we modify seems to have resulted in dedicating a register to that, I guess because the p operand is early-clobber so it can't use the same register. I guess you could risk leaving off the early-clobber if you're confident none of the other inputs will use the same register as p. (i.e. that they don't have the same value).

Accessing a register without using inline assembly with gcc

There's a shortcut:

register long rsp asm ("rsp");

Demo:

#include<stdio.h>

void foo(void)
{
register long rsp asm ("rsp");
printf("RSP: %lx\n", rsp);
}

int main()
{
register long rsp asm ("rsp");
printf("RSP: %lx\n", rsp);
foo();
return 0;
}

Gives:

 $ gdb ./a.out 
GNU gdb (Gentoo 7.2 p1) 7.2
...
Reading symbols from /home/user/tmp/a.out...done.
(gdb) break foo
Breakpoint 1 at 0x400538: file t.c, line 7.
(gdb) r
Starting program: /home/user/tmp/a.out
RSP: 7fffffffdb90

Breakpoint 1, foo () at t.c:7
7 printf("RSP: %lx\n", rsp);
(gdb) info registers
....
rsp 0x7fffffffdb80 0x7fffffffdb80
....
(gdb) n
RSP: 7fffffffdb80
8 }

Taken from the Variables in Specified Registers documentation.

Setting a C array as a new call stack (ESP) from inline asm?

Maybe this can work as a huge unsafe hack that only works in toy experiments. If you want to set a new stack, do that in hand-written asm before calling a C function.

This hack could work for the call to foo(), but what about the return 0;? Compiler-generated code will try to pop a return address from the current %esp.

(Or if optimization is disabled, will use leave which sets ESP = EBP before popping a saved EBP. That would switch back to the initial stack. So the behaviour depends on optimization level! You don't want that.)

Use GDB to single-step your code and actually watch reg values change, e.g. with layout reg.

But yes, &myStack + 1 is the address of one-past-the-end of the array, and does result in movl $myStack+1024, %eax as setup for the asm statement (where %0 expands to %eax in the template, because the compiler picked that register for the "rm" operand. You didn't give it the option of an immediate constant or it would have just done that with movl $myStack+1024, %esp).

https://godbolt.org/z/Nz6DgA shows that it "works" and will then immediately crash when it reaches the ret with optimization enabled, because it tries to pop with ESP pointing at one-past-the-end of myStack.

I am currently toying around implementation of threads in kernel level, hence idea is to assign separate stack for each thread, and shift between them

Especially if main has to actually return, then yes, you need to set up a stack before using it to call anything. Otherwise that final return address will be on the wrong stack!

For example, under Linux the pthread library creates a new thread with a new stack by allocating it with mmap(), then passing that stack address as an operand to clone(). So the new thread is never using the parent's stack, only ever its own stack. The kernel side thread-stack creation for a new task is I assume similar. You allocate a new stack, then use it for the new thread context.

You might put a "return address" at the top so the first function called in the new thread will actually return to a thread-exit / cleanup function. Possibly with some asm for that. Or make the actual thread entry point a function that doesn't return, instead cleanup up the thread context and switching to another thread or calling your scheduler or something.

That is kind of what I am going for, having thread using its own stack as opposed to main one. However, I do not have paging enabled and would like to implement stack creation as simple as possible for now (thus C array seemed like easy solution)

Unfortunately this is too simple and doesn't actually work.

Yes, you can use a C array for a thread stack (if you have exactly one extra thread...), the problem is how you're switching to it.

You're going to need to write a context-switch function at some point which saves one register context and loads another. (Google for example, you can find a few here on Stack Overflow, and probably something on https://www.osdev.org/.)

Create a new thread context struct in memory with its stack pointer pointing to the top of your thread stack, and its EIP pointing to the thread entry point. Call your context-switch function to switch to that new context.

From the POV of the C compiler, a context-switch function just looks like any other function call. It returns eventually, and may have modified any globally-reachable C objects. It doesn't matter that it temporarily had ESP pointing somewhere else. "Like any other function call" includes clobbering call-clobbered registers, BTW, so you don't need to save/restore EAX/ECX/EDX. The caller of the context-switch function already assumes they're destroyed.

You should generally hand-write that in asm, not inline asm. Changing ESP from inline asm is fraught with peril, and is officially documented as not supported by GCC.

This is because the compiler requires the value of the stack pointer to be the same after an asm statement as it was on entry to the statement

See also https://gcc.gnu.org/wiki/DontUseInlineAsm

Looping over arrays with inline assembly

Avoid inline asm whenever possible: https://gcc.gnu.org/wiki/DontUseInlineAsm. It blocks many optimizations. But if you really can't hand-hold the compiler into making the asm you want, you should probably write your whole loop in asm so you can unroll and tweak it manually, instead of doing stuff like this.


You can use an r constraint for the index. Use the q modifier to get the name of the 64bit register, so you can use it in an addressing mode. When compiled for 32bit targets, the q modifier selects the name of the 32bit register, so the same code still works.

If you want to choose what kind of addressing mode is used, you'll need to do it yourself, using pointer operands with r constraints.

GNU C inline asm syntax doesn't assume that you read or write memory pointed to by pointer operands. (e.g. maybe you're using an inline-asm and on the pointer value). So you need to do something with either a "memory" clobber or memory input/output operands to let it know what memory you modify. A "memory" clobber is easy, but forces everything except locals to be spilled/reloaded. See the Clobbers section in the docs for an example of using a dummy input operand.

Specifically, a "m" (*(const float (*)[]) fptr) will tell the compiler that the entire array object is an input, arbitrary-length. i.e. the asm can't reorder with any stores that use fptr as part of the address (or that use the array it's known to point into). Also works with an "=m" or "+m" constraint (without the const, obviously).

Using a specific size like "m" (*(const float (*)[4]) fptr) lets you tell the compiler what you do/don't read. (Or write). Then it can (if otherwise permitted) sink a store to a later element past the asm statement, and combine it with another store (or do dead-store elimination) of any stores that your inline asm doesn't read.

(See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for a whole Q&A about this.)


Another huge benefit to an m constraint is that -funroll-loops can work by generating addresses with constant offsets. Doing the addressing ourself prevents the compiler from doing a single increment every 4 iterations or something, because every source-level value of i needs to appear in a register.


Here's my version, with some tweaks as noted in comments. This is not optimal, e.g. can't be unrolled efficiently by the compiler.

#include <immintrin.h>
void add_asm1_memclobber(float *x, float *y, float *z, unsigned n) {
__m128 vectmp; // let the compiler choose a scratch register
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps (%[y],%q[idx],4), %[vectmp]\n\t" // q modifier: 64bit version of a GP reg
"addps (%[x],%q[idx],4), %[vectmp]\n\t"
"movaps %[vectmp], (%[z],%q[idx],4)\n\t"
: [vectmp] "=x" (vectmp) // "=m" (z[i]) // gives worse code if the compiler prepares a reg we don't use
: [z] "r" (z), [y] "r" (y), [x] "r" (x),
[idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
: "memory"
// you can avoid a "memory" clobber with dummy input/output operands
);
}
}

Godbolt compiler explorer asm output for this and a couple versions below.

Your version needs to declare %xmm0 as clobbered, or you will have a bad time when this is inlined. My version uses a temporary variable as an output-only operand that's never used. This gives the compiler full freedom for register allocation.

If you want to avoid the "memory" clobber, you can use dummy memory input/output operands like "m" (*(const __m128*)&x[i]) to tell the compiler which memory is read and written by your function. This is necessary to ensure correct code-generation if you did something like x[4] = 1.0; right before running that loop. (And even if you didn't write something that simple, inlining and constant propagation can boil it down to that.) And also to make sure the compiler doesn't read from z[] before the loop runs.

In this case, we get horrible results: gcc5.x actually increments 3 extra pointers because it decides to use [reg] addressing modes instead of indexed. It doesn't know that the inline asm never actually references those memory operands using the addressing mode created by the constraint!

# gcc5.4 with dummy constraints like "=m" (*(__m128*)&z[i]) instead of "memory" clobber
.L11:
movaps (%rsi,%rax,4), %xmm0 # y, i, vectmp
addps (%rdi,%rax,4), %xmm0 # x, i, vectmp
movaps %xmm0, (%rdx,%rax,4) # vectmp, z, i

addl $4, %eax #, i
addq $16, %r10 #, ivtmp.19
addq $16, %r9 #, ivtmp.21
addq $16, %r8 #, ivtmp.22
cmpl %eax, %ecx # i, n
ja .L11 #,

r8, r9, and r10 are the extra pointers that the inline asm block doesn't use.

You can use a constraint that tells gcc an entire array of arbitrary length is an input or an output: "m" (*(const char (*)[]) pStr). This casts the pointer to a pointer-to-array (of unspecified size). See How can I indicate that the memory *pointed* to by an inline ASM argument may be used?

If we want to use indexed addressing modes, we will have the base address of all three arrays in registers, and this form of constraint asks for the base address (of the whole array) as an operand, rather than a pointer to the current memory being operated on.

This actually works without any extra pointer or counter increments inside the loop: (avoiding a "memory" clobber, but still not easily unrollable by the compiler).

void add_asm1_dummy_whole_array(const float *restrict x, const float *restrict y,
float *restrict z, unsigned n) {
__m128 vectmp; // let the compiler choose a scratch register
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps (%[y],%q[idx],4), %[vectmp]\n\t" // q modifier: 64bit version of a GP reg
"addps (%[x],%q[idx],4), %[vectmp]\n\t"
"movaps %[vectmp], (%[z],%q[idx],4)\n\t"
: [vectmp] "=x" (vectmp)
, "=m" (*(float (*)[]) z) // "=m" (z[i]) // gives worse code if the compiler prepares a reg we don't use
: [z] "r" (z), [y] "r" (y), [x] "r" (x),
[idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
, "m" (*(const float (*)[]) x),
"m" (*

Related Topics



Leave a reply



Submit