Is Garbage Allowed in High Bits of Parameter and Return Value Registers in X86-64 Sysv Abi

Is garbage allowed in high bits of parameter and return value registers in x86-64 SysV ABI?

It looks like you have two questions here:

Do the high bits of a return value need to be zeroed before returning? (And do the high bits of arguments need to be zeroed before calling?)
What are the costs/benefits associated with this decision?

The answer to the first question is no, there can be garbage in the high bits, and Peter Cordes has already written a very nice answer on the subject.

As for the second question, I suspect that leaving the high bits undefined is overall better for performance. On one hand, zero-extending values beforehand comes at no additional cost when 32-bit operations are used. But on the other hand, zeroing the high bits beforehand is not always necessary. If you allow garbage in the high bits, then you can leave it up to the code that receives the values to only perform zero-extensions (or sign-extensions) when they are actually required.

But I wanted to highlight another consideration: Security

Information leaks

When the upper bits of a result are not cleared, they may retain fragments of other pieces of information, such as function pointers or addresses in the stack/heap. If there ever exists a mechanism to execute higher-privileged functions and retrieve the full value of rax (or eax) afterwards, then this could introduce an information leak. For example, a system call might leak a pointer from the kernel to user space, leading to a defeat of kernel ASLR. Or an IPC mechanism might leak information about another process' address space that could assist in developing a sandbox breakout.

Of course, one might argue that it is not the responsibility of the ABI to prevent information leaks; it is up to the programmer to implement their code correctly. While I do agree, mandating that the compiler zero the upper bits would still have the effect of eliminating this particular form of an information leak.

You shouldn't trust your input

On the other side of things, and more importantly, the compiler should not blindly trust that any received values have their upper bits zeroed out, or else the function may not behave as expected, and this could also lead to exploitable conditions. For example, consider the following:

unsigned char buf[256];
...
__fastcall void write_index(unsigned char index, unsigned char value) {
    buf[index] = value;
}

If we were allowed to assume that index has its upper bits zeroed out, then we could compile the above as:

write_index:  ;; sil = index, dil = value
      ; movzx esi, sil       ; skipped based on assumptions
    mov [buf + rsi], dil
    ret

But if we could call this function from our own code, we could supply a value of rsi out of the [0,255] range and write to memory beyond the bounds of the buffer.

Of course, the compiler would not actually generate code like this, since, as mentioned above, it is the responsibility of the callee to zero- or sign-extend its arguments, rather than that of the caller. This, I think, is a very practical reason to have the code that receives a value always assume that there is garbage in the upper bits and explicitly remove it.

(For Intel IvyBridge and later (mov-elimination), compilers would hopefully zero-extend into a different register to at least avoid the latency, if not the front-end throughput cost, of a movzx instruction.)

What is the rationale for setting all SSE/AVX registers call-clobbered in the SysV ABI?

IIRC, the stated (or assumed? I forget) rationale is that there's no future-compatible mechanism for functions to save/restore the full vector register width¹. And the ABI designers were unwilling to say that only the baseline 128 bits, or low scalar element (64-bits) were call-preserved for a few registers, with future upper parts not.

You're right that AVX-512 was an opportunity to improve the situation, e.g. by defining XMM28..31 as call-preserved. (Scalar code often benefits from a one or two FP variables staying in registers, especially across calls to functions, including math library functions. For example, see the slowdown in an example where a hand-written asm version can't inline, but plain-C functions using sqrt can.)

Yes, this is fairly poor design, and causes spill/reload slowdowns in loops with function calls and (often scalar) FP. Sometimes even introducing store-forwarding latency into the critical path, e.g. in a loop involving a log(), or even worse a cheap library function like sqrt() if you fail to compile with -fno-math-errno so GCC can only speculatively inline it.

Footnote 1: xsave/xrstor and friends are usable from user-space, but that's not efficient/practical for functions. And IIRC you need to pass a mask of which parts of the state to store so OSes need to know about new extensions to the size of the architectural state is saves, so even that doesn't solve the problem of old libraries or other binaries saving/restoring wider registers.

What's the advantage of having nonvolatile registers in a calling convention? Windows x64 has 10 call-preserved XMM regs, which is probably too many, leaving only 6 call-clobbered for leaf functions to use without spending extra instructions saving/restoring.
Why do SSE instructions preserve the upper 128-bit of the YMM registers? - Intel's AVX design decision to have legacy-SSE instructions leave upper halves unmodified, mostly because of binary-only Windows kernel drivers that manually save/restore a few XMM regs.
When x86-64 (and SSE2) were new, there was no clue how future SIMD extensions would work, and some code was written to work now without an eye for the future. Also, x87 was always treated as call-clobbered, because its stack nature makes it hard for a function to know how many if any elements need saving/restoring if it wants to use the full 8 st0..7 registers. So historically x86 calling conventions didn't have any call-preserved FP registers; perhaps that's why GCC devs unfortunately didn't consider the value in having a couple.

Why char and short data are stored in 4 byte registers?

When you merely reference a char or short variable in an expression, the language rules say that it is immediately promoted to int. So, given char c, d; if we say c + d this is the same as saying (int)c + (int)d by the rules of the language. And also within the expression context printf("%d\n", c); is the same as printf("%d\n", (int)c);

Even if you cast a char variable to char it will still immediately be promoted to int, so if you say (char)c that's the same as saying (int)(char)(int)c. This is the reason that we can cast int i; to a shorter type (unsigned short)i and get a zero extended full sized int (from the lower 16 bits of i) as a result, or (short)i and get a sign extended full sized int (also from the lower 16 bits) as a result.

This automatic and immediate promotion to int for the shorter data types happens independently of function calling and parameter passing. So, in printf("%d\n", c); we are passing an int (that happens to be widened from a char) and that's what printf sees.

but why char and short need the promotion?

This is by the definition of the language. We can guess at rationale, namely that it simplifies the arithmetic operators, and also that we need some rules to rely upon even if they were different from that.

From ISO/IEC 9899:201x Committee Draft — April 12, 2011 N1570

EXAMPLE 2
In executing the fragment

char c1, c2;

/* ... */

c1 = c1 + c2;

the ‘‘integer promotions’’ require that the abstract machine promote the value of each variable to int size
and then add the two ints and truncate the sum. Provided the addition of two chars can be done without
§5.1.2.3 Environment 15
overflow, or with overflow wrapping silently to produce the correct result, the actual execution need only
produce the same result, possibly omitting the promotions.

Why does this function push RAX to the stack as the first operation?

The 64-bit ABI requires that the stack is aligned to 16 bytes before a call instruction.

call pushes an 8-byte return address on the stack, which breaks the alignment, so the compiler needs to do something to align the stack again to a multiple of 16 before the next call.

(The ABI design choice of requiring alignment before a call instead of after has the minor advantage that if any args were passed on the stack, this choice makes the first arg 16B-aligned.)

Pushing a don't-care value works well, and can be more efficient than sub rsp, 8 on CPUs with a stack engine. (See the comments).

x86-64 calling convention assumes return registers are zeroed?

Setting EAX, etc., clears the upper 32-bits anyway.

finding reference...

Actually it should be closed as a dupe of this question.

In assembly, how to handle Windows API with a return value that's documented as less than native register width?

At least for x64, the upper bits are undefined and the caller is responsible for doing zero- or sign-extension if needed. If your code relied on upper bits being zero, I'm afraid it was buggy all along, and you have just been lucky until now.

https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-160

The state of unused bits in the value returned in RAX or XMM0 is undefined.

Integer describing number of floating point arguments in xmm registers not passed to rax

The x86-64 System V ABI says the FP register arg count is passed in AL, and that the upper bytes of RAX are allowed to contain garbage. (Same as any narrow integer or FP arg. But see also this Q&A about clang assuming zero- or sign-extension of narrow integer args to 32 bit. This only applies to function args proper, not al.)

Use movzx eax, al to zero-extend AL into RAX. (Writing EAX implicitly zero-extends into RAX, unlike writing an 8 or 16-bit register.)

If there's another integer register you can clobber, use movzx ecx,al so mov-elimination on Intel CPUs can work, making it zero latency and not needing an execution port. Intel's mov-elimination fails when the src and dst are parts of the same register.

There's also zero benefit to using a 64-bit source for conversion to FP. cvtsi2sd xmm0, eax is one byte shorter (no REX prefix), and after zero-extension into EAX you know that the signed 2's complement interpretation of EAX and RAX that cvtsi2sd uses are identical.

On your Mac, clang/LLVM chose to leave garbage in the upper bytes of RAX. LLVM's optimizer is less careful about avoiding false dependencies than gcc's, so it will sometimes write partial registers. (Sometimes even when it doesn't save code size, but in this case it does).

From your results, we can conclude that you used clang on Mac, and gcc or ICC on Ubuntu.

It's easier to look at the compiler-generate asm from a simplified example (new and std::cout::operator<< result in a lot of code).

extern "C" double foo(int, ...);
int main() {
    foo(123, 1.0, 2.0);
}

Compiles to this asm on the Godbolt compiler explorer, with gcc and clang -O3:

### clang7.0 -O3
.section .rodata
.LCPI0_0:
    .quad   4607182418800017408     # double 1
.LCPI0_1:
    .quad   4611686018427387904     # double 2

.text
main:                                   # @main
    push    rax                  # align the stack by 16 before a call
    movsd   xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero
    movsd   xmm1, qword ptr [rip + .LCPI0_1] # xmm1 = mem[0],zero
    mov     edi, 123
    mov     al, 2                # leave the rest of RAX unmodified
    call    foo
    xor     eax, eax             # return 0
    pop     rcx
    ret

GCC emits basically the same thing, but with

 ## gcc8.2 -O3
    ...
    mov     eax, 2               # AL = RAX = 2   FP args in regs
    mov     edi, 123
    call    foo
    ...

mov eax,2 instead of mov al,2 avoids a false dependency on the old value of RAX, on CPUs that don't rename AL separately from the rest of RAX. (Only Intel P6-family and Sandybridge do that, not IvyBridge and later. And not any AMD CPUs, or Pentium 4, or Silvermont.)

See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent for more about how IvB and later are different from Core2 / Nehalem.

Is Garbage Allowed in High Bits of Parameter and Return Value Registers in X86-64 Sysv Abi