Why Does Printf Still Work with Rax Lower Than The Number of Fp Args in Xmm Registers

Why does printf still work with RAX lower than the number of FP args in XMM registers?

The x86-64 SysV ABI's strict rules allow implementations that only save the exact number of XMM regs specified, but current implementations only check for zero / non-zero because that's efficient, especially for the AL=0 common case.

If you pass a number in AL1 lower than the actual number of XMM register args, or a number higher than 8, you'd be violating the ABI, and it's only this implementation detail which stops your code from breaking. (i.e. it "happens to work", but is not guaranteed by any standard or documentation, and isn't portable to some other real implementations, like older GNU/Linux distros that were built with GCC4.5 or earlier.)

This Q&A shows a current build of glibc printf which just checks for AL!=0, vs. an old build of glibc which computes a jump target into a sequence of movaps stores. (That Q&A is about that code breaking when AL>8, making the computed jump go somewhere it shouldn't.)

Why does eax contain the number of vector parameters? quotes the ABI doc, and shows ICC code-gen which similarly does a computed jump using the same instructions as old GCC.


Glibc's printf implementation is compiled from C source, normally by GCC. When modern GCC compiles a variadic function like printf, it makes asm that only checks for a zero vs. non-zero AL, dumping all 8 arg-passing XMM registers to an array on the stack if non-zero.

GCC4.5 and earlier actually did use the number in AL to do a computed jump into a sequence of movaps stores, to only actually save as many XMM regs as necessary.

Nate's simple example from comments on Godbolt with GCC4.5 vs. GCC11 shows the same difference as the linked answer with disassembly of old/new glibc (built by GCC), unsurprisingly. This function only ever uses va_arg(v, double);, never integer types, so it doesn't dump the incoming RDI...R9 anywhere, unlike printf. And it's a leaf function so it can use the red-zone (128 bytes below RSP).

# GCC4.5.3 -O3 -fPIC    to compile like glibc would
add_them:
movzx eax, al
sub rsp, 48 # reserve stack space, needed either way
lea rdx, 0[0+rax*4] # each movaps is 4 bytes long
lea rax, .L2[rip] # code pointer to after the last movaps
lea rsi, -136[rsp] # used later by va_arg. test/jz version does the same, but after the movaps stores
sub rax, rdx
lea rdx, 39[rsp] # used later by va_arg, test/jz version also does an LEA like this
jmp rax # AL=0 case jumps to L2
movaps XMMWORD PTR -15[rdx], xmm7 # using RDX as a base makes each movaps 4 bytes long, vs. 5 with RSP
movaps XMMWORD PTR -31[rdx], xmm6
movaps XMMWORD PTR -47[rdx], xmm5
movaps XMMWORD PTR -63[rdx], xmm4
movaps XMMWORD PTR -79[rdx], xmm3
movaps XMMWORD PTR -95[rdx], xmm2
movaps XMMWORD PTR -111[rdx], xmm1
movaps XMMWORD PTR -127[rdx], xmm0 # xmm0 last, will be ready for store-forwading last
.L2:
lea rax, 56[rsp] # first stack arg (if any), I think
## rest of the function

vs.

# GCC11.2 -O3 -fPIC
add_them:
sub rsp, 48
test al, al
je .L15 # only one test&branch macro-fused uop
movaps XMMWORD PTR -88[rsp], xmm0 # xmm0 first
movaps XMMWORD PTR -72[rsp], xmm1
movaps XMMWORD PTR -56[rsp], xmm2
movaps XMMWORD PTR -40[rsp], xmm3
movaps XMMWORD PTR -24[rsp], xmm4
movaps XMMWORD PTR -8[rsp], xmm5
movaps XMMWORD PTR 8[rsp], xmm6
movaps XMMWORD PTR 24[rsp], xmm7
.L15:
lea rax, 56[rsp] # first stack arg (if any), I think
lea rsi, -136[rsp] # used by va_arg. done after the movaps stores instead of before.
...
lea rdx, 56[rsp] # used by va_arg. With a different offset than older GCC, but used somewhat similarly. Redundant with the LEA into RAX; silly compiler.

GCC presumably changed strategy because the computed jump takes more static code size (I-cache footprint), and a test/jz is easier to predict than an indirect jump. Even more importantly, it's fewer uops executed in the common AL=0 (no-XMM) case2. And not many more even for the AL=1 worst case (7 dead movaps stores but no work done computing a branch target).


Related Q&As:

  • Assembly executable doesn't show anything (x64) AL != 0 vs. computed jump code-gen for glibc printf
  • Why is %eax zeroed before a call to printf? shows modern GCC code-gen
  • Why does eax contain the number of vector parameters? ABI documentation references for why it's like that
  • mold and lld not linking against libc correctly discussion of various possible ABI-violations and other ways a program could fail to work when calling printf from _start (depending on dynamic-linker hooks to get libc startup functions called).

Semi-related while we're talking about calling-convention violations:

  • glibc scanf Segmentation faults when called from a function that doesn't align RSP (and even more recently, also printf with AL=0, using movaps somewhere other than dumping XMM args to the stack)


Footnote 1: AL, not RAX, is what matters

The x86-64 System V ABI doc specifies that variadic functions must look only at AL for the number of regs; the high 7 bytes of RAX are allowed to hold garbage. mov eax, 3 is an efficient way to set AL, avoiding possible false dependencies from writing a partial register, although it is larger in machine-code size (5 bytes) than mov al,3 (2 bytes). clang typically uses mov al, 3.

Key points from the ABI doc, see Why does eax contain the number of vector parameters? for more context:

The prologue should use %al to avoid unnecessarily saving XMM registers. This is especially important for integer only programs to prevent the initialization of the XMM unit.

(That last point is obsolete: XMM regs are widely used for memcpy/memset and inlined to zero-init small arrays / structs. So much so that Linux uses "eager" FPU save/restore on context switches, not "lazy" where the first use of an XMM reg faults.)

The contents of %al do not need to match exactly the number of registers, but must be an upper bound on the number of vector registers used and is in the range 0–8 inclusive.

This ABI guarantee of AL <= 8 is what allows computed-jump implementations to omit bounds-checking. (Similarly, Does the C++ standard allow for an uninitialized bool to crash a program? yes, ABI violations can be assumed not to happen, e.g. by making code that would crash in that case.)



Footnote 2: efficiency of the two strategies

Smaller static code-size (I-cache footprint) is always a good thing, and the AL!=0 strategy has that in its favour.

Most importantly, fewer total instructions executed for the AL==0 case. printf isn't the only variadic function; sscanf is not rare, and it never takes FP args (only pointers). If a compiler can see that a function never uses va_arg with an FP argument, it omits saving entirely, making this point moot, but the scanf/printf functions are normally implemented as wrappers for the vfscanf / vfprintf calls, so the compiler doesn't see that, it sees a va_list being passed to another function so it has to save everything. (I think it's fairly rare for people to write their own variadic functions, so in a lot of programs the only calls to variadic functions will be to library functions.)

Out-of-order exec can chew through the dead stores just fine for AL<8 but non-zero cases, thanks to wide pipelines and store buffers, getting started on the real work in parallel with those stores happening.

Computing and doing the indirect jump takes 5 total instructions, not counting the lea rsi, -136[rsp] and lea rdx, 39[rsp]. The test/jz strategy also does those or similar, just after the movaps stores, as setup for the va_arg code which has to figure out when it gets to the end of the register-save area and switch to looking at stack args.

I'm also not counting the sub rsp, 48 either; that's necessary either way unless you make the XMM-save-area size variable as well, or only save the low half of each XMM reg so 8x 8 B = 64 bytes would fit in the red-zone. In theory variadic functions can take a 16-byte __m128d arg in an XMM reg so GCC uses movaps instead of movlps. (I'm not sure if glibc printf has any conversions that would take one). And in non-leaf functions like actual printf, you'd always need to reserve more space instead of using the red-zone. (This is one reason for the lea rdx, 39[rsp] in the computed-jump version: every movaps needs to be exactly 4 bytes, so the compiler's recipe for generating that code has to make sure their offsets are in the [-128,+127] range of a [reg+disp8] addressing mode, and not 0 unless GCC was going to use special asm syntax to force a longer instruction there.

Almost all x86-64 CPUs run 16-byte stores as a single micro-fused uop (only crusty old AMD K8 and Bobcat splitting into 8-byte halves; see https://agner.org/optimize/), and we'd usually be touching stack space below that 128-byte area anyway. (Also, the computed-jump strategy stores to the bottom itself, so it doesn't avoid touching that cache line.)

So for a function with one XMM arg, the computed-jump version takes 6 total single-uop instructions (5 integer ALU/jump, one movaps) to get the XMM arg saved.

The test/jz version takes 9 total uops (10 instructions but test/jz macro-fuse in 64-bit mode on Intel since Nehalem, AMD since Bulldozer IIRC). 1 macro-fused test-and-branch, and 8 movaps stores.

And that's the best case for the computed-jump version: with more xmm args, it still runs 5 instructions to compute the jump target, but has to run more movaps instructions. The test/jz version is always 9 uops. So the break-even point for dynamic uop count (actually executed, vs. sitting there in memory taking up I-cache footprint) is 4 XMM args which is probably rare, but it has other advantages. Especially in the AL == 0 case where it's 5 vs. 1.

The test/jz branch always goes to the same place for any number of XMM args except zero, making it easier to predict than an indirect branch that's different for printf("%f %f\n", ...) vs "%f\n".

3 of the 5 instructions (not including the jmp) in the computed-jump version form a dependency chain from the incoming AL, making it take that many more cycles before a misprediction can be detected (even though the chain probably started with a mov eax, 1 right before the call). But the "extra" instructions in the dump-everything strategy are just dead stores of some of XMM1..7 that never get reloaded and aren't part of any dependency chain. As long as the store buffer and ROB/RS can absorb them, out-of-order exec can work on them at its leisure.

(To be fair, they will tie up the store-data and store-address execution units for a while, meaning that later stores won't be ready for store-forwarding as soon either. And on CPUs where store-address uops run on the same execution units as loads, later loads can be delayed by those store uops hogging those execution units. Fortunately, modern CPUs have at least 2 load execution units, and Intel from Haswell to Skylake can run store-address uops on any of 3 ports, with simple addressing modes like this. Ice Lake has 2 load / 2 store ports with no overlap.)

The computed jump version has save XMM0 last, which is likely to be the first arg reloaded. (Most variadic functions go through their args in order). If there are multiple XMM args, the computed-jump way won't be ready to store-forward from that store until a couple cycles later. But for cases with AL=1 that's the only XMM store, and no other work tying up load/store-address execution units, and small numbers of args are probably more common.

Most of these reasons are really minor compared to the advantage of smaller code footprint, and fewer instructions executed for the AL==0 case. It's just fun (for some of us) to think through the up/down sides of the modern simple way, to show that even in its worst case, it's not a problem.

Integer describing number of floating point arguments in xmm registers not passed to rax

The x86-64 System V ABI says the FP register arg count is passed in AL, and that the upper bytes of RAX are allowed to contain garbage. (Same as any narrow integer or FP arg. But see also this Q&A about clang assuming zero- or sign-extension of narrow integer args to 32 bit. This only applies to function args proper, not al.)

Use movzx eax, al to zero-extend AL into RAX. (Writing EAX implicitly zero-extends into RAX, unlike writing an 8 or 16-bit register.)

If there's another integer register you can clobber, use movzx ecx,al so mov-elimination on Intel CPUs can work, making it zero latency and not needing an execution port. Intel's mov-elimination fails when the src and dst are parts of the same register.

There's also zero benefit to using a 64-bit source for conversion to FP. cvtsi2sd xmm0, eax is one byte shorter (no REX prefix), and after zero-extension into EAX you know that the signed 2's complement interpretation of EAX and RAX that cvtsi2sd uses are identical.


On your Mac, clang/LLVM chose to leave garbage in the upper bytes of RAX. LLVM's optimizer is less careful about avoiding false dependencies than gcc's, so it will sometimes write partial registers. (Sometimes even when it doesn't save code size, but in this case it does).

From your results, we can conclude that you used clang on Mac, and gcc or ICC on Ubuntu.

It's easier to look at the compiler-generate asm from a simplified example (new and std::cout::operator<< result in a lot of code).

extern "C" double foo(int, ...);
int main() {
foo(123, 1.0, 2.0);
}

Compiles to this asm on the Godbolt compiler explorer, with gcc and clang -O3:

### clang7.0 -O3
.section .rodata
.LCPI0_0:
.quad 4607182418800017408 # double 1
.LCPI0_1:
.quad 4611686018427387904 # double 2

.text
main: # @main
push rax # align the stack by 16 before a call
movsd xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero
movsd xmm1, qword ptr [rip + .LCPI0_1] # xmm1 = mem[0],zero
mov edi, 123
mov al, 2 # leave the rest of RAX unmodified
call foo
xor eax, eax # return 0
pop rcx
ret

GCC emits basically the same thing, but with

 ## gcc8.2 -O3
...
mov eax, 2 # AL = RAX = 2 FP args in regs
mov edi, 123
call foo
...

mov eax,2 instead of mov al,2 avoids a false dependency on the old value of RAX, on CPUs that don't rename AL separately from the rest of RAX. (Only Intel P6-family and Sandybridge do that, not IvyBridge and later. And not any AMD CPUs, or Pentium 4, or Silvermont.)

See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent for more about how IvB and later are different from Core2 / Nehalem.

printf gets stuck in an infinite loop with AL = 10 on x86-64 Linux with older gcc

tl;dr: do xorl %eax, %eax before call printf.

printf is a varargs function. Here's what the System V AMD64 ABI has to say about varargs functions:

For calls that may call functions that use varargs or stdargs (prototype-less
calls or calls to functions containing ellipsis (. . . ) in the declaration) %al18 is used
as hidden argument to specify the number of vector registers used. The contents
of %al do not need to match exactly the number of registers, but must be an upper
bound on the number of vector registers used and is in the range 0–8 inclusive.

You broke that rule. You'll see that the first time your code calls printf, %al is 10, which is more than the upper bound of 8. On your gNewSense system, here's a disassembly of the beginning of printf:

printf:
sub $0xd8,%rsp
movzbl %al,%eax # rax = al;
mov %rdx,0x30(%rsp)
lea 0x0(,%rax,4),%rdx # rdx = rax * 4;
lea after_movaps(%rip),%rax # rax = &&after_movaps;
mov %rsi,0x28(%rsp)
mov %rcx,0x38(%rsp)
mov %rdi,%rsi
sub %rdx,%rax # rax -= rdx;
lea 0xcf(%rsp),%rdx
mov %r8,0x40(%rsp)
mov %r9,0x48(%rsp)
jmpq *%rax # goto *rax;
movaps %xmm7,-0xf(%rdx)
movaps %xmm6,-0x1f(%rdx)
movaps %xmm5,-0x2f(%rdx)
movaps %xmm4,-0x3f(%rdx)
movaps %xmm3,-0x4f(%rdx)
movaps %xmm2,-0x5f(%rdx)
movaps %xmm1,-0x6f(%rdx)
movaps %xmm0,-0x7f(%rdx)
after_movaps:
# nothing past here is relevant for your problem

A quasi-C translation of the important bits is goto *(&&after_movaps - al * 4); (see Labels as Values). For efficiency, gcc and/or glibc didn't want to save more vector registers than you used, and it also doesn't want to do a bunch of conditional branches. Each instruction to save a vector register is 4 bytes, so it takes the end of the vector register saving instructions, subtracts al * 4 bytes, and jumps there. This results in just enough of the instructions executing. Since you had more than 8, it ended up jumping too far back, and landing before the jump instruction it just took, thus creating an infinite loop.

As for why it's not reproducible on modern systems, here's a disassembly of the beginning of their printf:

printf:
sub $0xd8,%rsp
mov %rdi,%r10
mov %rsi,0x28(%rsp)
mov %rdx,0x30(%rsp)
mov %rcx,0x38(%rsp)
mov %r8,0x40(%rsp)
mov %r9,0x48(%rsp)
test %al,%al # if(!al)
je after_movaps # goto after_movaps;
movaps %xmm0,0x50(%rsp)
movaps %xmm1,0x60(%rsp)
movaps %xmm2,0x70(%rsp)
movaps %xmm3,0x80(%rsp)
movaps %xmm4,0x90(%rsp)
movaps %xmm5,0xa0(%rsp)
movaps %xmm6,0xb0(%rsp)
movaps %xmm7,0xc0(%rsp)
after_movaps:
# nothing past here is relevant for your problem

A quasi-C translation of the important bits is if(!al) goto after_movaps;. Why did this change? My guess is Spectre. The mitigations for Spectre make indirect jumps really slow, so it's no longer worth doing that trick. Or not; see comments. Instead, they do a much simpler check: if there's any vector registers, then save them all. With this code, your bad value of al isn't a disaster, since it just means the vector registers will be unnecessarily copied.

Calling printf in x86-64 Linux requires RAX = 0?

The value in rax needs to be the number of floating point parameters being passed using a function that supports a variable number of arguments.

This document should help, see section 3.5.7 on variable argument lists.

When a function taking variable-arguments is called, %rax must be set to the total number of floating point parameters passed to the function in vector registers

Why does eax contain the number of vector parameters?

The value is used for optimization as stated in the ABI document

The prologue should use %al to avoid unnecessarily saving XMM registers. This is especially important for integer only programs to prevent the initialization of the XMM unit.

3.5.7 Variable Argument Lists - The Register Save Area. System V Application Binary Interface version 1.0

When you call va_start it'll save all the parameters passed in registers to the register save area

To start, any function that is known to use va_start is required to, at the start of the function, save all registers that may have been used to pass arguments onto the stack, into the “register save area”, for future access by va_start and va_arg. This is an obvious step, and I believe pretty standard on any platform with a register calling convention. The registers are saved as integer registers followed by floating point registers...

https://blog.nelhage.com/2010/10/amd64-and-va_arg/

But saving all 8 vector registers could be slow so the compiler may choose to optimize it using the value passed in al

... As an optimization, during a function call, %rax is required to hold the number of SSE registers used to hold arguments, to allow a varargs caller to avoid touching the FPU at all if there are no floating point arguments.

https://blog.nelhage.com/2010/10/amd64-and-va_arg/

Since you want to save at least the registers used, the value can be larger than the real number of used registers. That's why there's this line in the ABI

The contents of %al do not need to match exactly the number of registers, but must be an upper bound on the number of vector registers used and is in the range 0–8 inclusive.

You can see the effect from the prolog of ICC

    sub       rsp, 216                                      #5.1
mov QWORD PTR [8+rsp], rsi #5.1
mov QWORD PTR [16+rsp], rdx #5.1
mov QWORD PTR [24+rsp], rcx #5.1
mov QWORD PTR [32+rsp], r8 #5.1
mov QWORD PTR [40+rsp], r9 #5.1
movzx r11d, al #5.1
lea rax, QWORD PTR [r11*4] #5.1
lea r11, QWORD PTR ..___tag_value_varstrings(int, ...).6[rip] #5.1
sub r11, rax #5.1
lea rax, QWORD PTR [175+rsp] #5.1
jmp r11 #5.1
movaps XMMWORD PTR [-15+rax], xmm7 #5.1
movaps XMMWORD PTR [-31+rax], xmm6 #5.1
movaps XMMWORD PTR [-47+rax], xmm5 #5.1
movaps XMMWORD PTR [-63+rax], xmm4 #5.1
movaps XMMWORD PTR [-79+rax], xmm3 #5.1
movaps XMMWORD PTR [-95+rax], xmm2 #5.1
movaps XMMWORD PTR [-111+rax], xmm1 #5.1
movaps XMMWORD PTR [-127+rax], xmm0 #5.1
..___tag_value_varstrings(int, ...).6:

It's essentially a Duff's device. The r11 register is loaded with the address after the xmm saving instructions, and then al*4 is subtracted from the result (since movaps XMMWORD PTR [rax-X], xmmX is 4 bytes long) to jump to the movaps instruction that we should run

As I see, other compilers always save all the vector registers, or don't save them at all, so they don't care about al's value and just check if it's zero

The general purpose registers are always saved, probably because it's cheaper to just move the 6 registers to memory instead of spending time for a condition check, address calculation and jump. As a result so you don't need a parameter for how many integers were passed in registers

Here is a similar question to yours. You can find more information in the below links

  • How do vararg functions find out the number of arguments in machine code?
  • Why is %eax zeroed before a call to printf?
  • Identifying variable args function

GNU as, puts works but printf does not

puts appends a newline implicitly, and stdout is line-buffered (by default on terminals). So the text from printf may just be sitting there in the buffer. Your call to _exit(2) doesn't flush buffers, because it's the exit_group(2) system call, not the exit(3) library function. (See my version of your code below).

Your call to printf(3) is also not quite right, because you didn't zero %al before calling a var-args function with no FP arguments. (Good catch @RossRidge, I missed that). xor %eax,%eax is the best way to do that. %al will be non-zero (from puts()'s return value), which is presumably why printf segfaults. I tested on my system, and printf doesn't seem to mind when the stack is misaligned (which it is, since you pushed twice before calling it, unlike puts).


Also, you don't need any push instructions in that code. The first arg goes in %rdi. The first 6 integer args go in registers, the 7th and later go on the stack. You're also neglecting to pop the stack after the functions return, which only works because your function never tries to return after messing up the stack.

The ABI does require aligning the stack by 16B, and a push is one way to do that, which can actually be more efficient than sub $8, %rsp on recent Intel CPUs with a stack engine, and it takes fewer bytes. (See the x86-64 SysV ABI, and other links in the x86 tag wiki).


Improved code:

.text
.global main
main:
lea message, %rdi # or mov $message, %edi if you don't need the code to be position-independent: default code model has all labels in the low 2G, so you can use shorter 32bit instructions
push %rbx # align the stack for another call
mov %rdi, %rbx # save for later
call puts

xor %eax,%eax # %al = 0 = number of FP args for var-args functions
mov %rbx, %rdi # or mov %ebx, %edi will normally be safe, since the pointer is known to be pointing to static storage, which will be in the low 2G
call printf

# optionally putchar a '\n', or include it in the string you pass to printf

#xor %edi,%edi # exit with 0 status
#call exit # exit(3) does an fflush and other cleanup

pop %rbx # restore caller's rbx, and restore the stack

xor %eax,%eax # return 0
ret

.section .rodata # constants should go in .rodata
message: .asciz "Hello, World!"

lea message, %rdi is cheap, and doing it twice is fewer instructions than the two mov instructions to make use of %rbx. But since we needed to adjust the stack by 8B to strictly follow the ABI's 16B-aligned guarantee, we might as well do it by saving a call-preserved register. mov reg,reg is very cheap and small, so taking advantage of the call-preserved reg is natural.

Using mov %edi, %ebx and stuff like that saves the REX prefix in the machine-code encoding. If you're not sure / don't understand why it's safe to only copy the low 32bits, zeroing the upper 32b, then use 64bit registers. Once you understand what's going on, you'll know when you can save machine-code bytes by using 32bit operand-size.

MOVQ/PINSRQ vs VMOV to populate XMM (one works, the other doesn't)

I just want to post the code that works, after reading documentation a bit more and not doing the hard way:

global mul_array_float         ; mul_array_float(float &array1, float *array2)
mul_array_float:
vmovups xmm0, [rdi] ; populates xmm0 and xmm1 with rdi and rsi being
vmovups xmm1, [rsi] ; passed by the function call
vmulps xmm0, xmm1 ; multiply them and save result in xmm0
vmovups [rdi], xmm0 ; return the result to rdi (being passed by reference)
ret

If the function is passing the arrays as aligned, there is no speed loss with "ups" instructions. Thanks to Peter Cordes and Jester for their considerations.

xmm register sse x64 value inside

You can use one of the movdqa, movdqu, movaps, movups, movapd, movupd instructions to load values into a 128bit SSE register (xmm) from memory. The movdqa, movaps, movapd require 16-byte aligned memory access (and are faster).

Incidentally, doing one point at a time with SIMD would require a lot of code changes. Better bet is to do 4 at a time (because SIMD has 4 lanes of single precision floating point). Then you can (more or less) just replace each regular instruction with the same vector instruction.



Related Topics



Leave a reply



Submit