Where Is the X86-64 System V Abi Documented

Where is the x86-64 System V ABI documented?

The System V AMD64 psABI document is maintained as LaTeX sources on GitLab. Similarly the i386 psABI is a separate GitLab repo. (Formerly on github). Those pages have info on where revisions are discussed.

The x32 ABI (32-bit pointers in long mode) is part of the x86-64 aka AMD64 ABI doc. See Chapter 10: ILP32 Programming Model.

The GitLab repo auto-builds a PDF of the current x86-64 version, but not i386.

See also the x86 tag wiki for other guides / references / links.

The last version on Github was x86-64 version 1.0 draft (January 2018). As of July 2022, the current version is still 1.0, with the word Draft being removed by late 2018.

Github also hosts a PDF of i386 ABI version 1.1.

(Note that most non-Linux OSes use an older version of the i386 ABI which doesn't require 16-byte stack alignment, only 4. GCC ended up depending on -mpreferred-stack-boundary=4 16-byte alignment for its SSE code-gen (perhaps unintentionally), and eventually the ABI got updated for Linux to enshrine that as an official requirement. I attempted a summary in a comment on GCC bug #40838. This breaks backwards compat with some hand-written asm that calls other functions.)

Unofficially, sign-extending narrow args to 32-bit is required (for both i386 and amd64), because clang depends on it. Hopefully a future ABI revision will document that. GCC and/or clang now have some options to control that (TODO dig up what they were called), but the default is still the same as of 2022.

Naming: psABI

The Processor Supplement (psABI) docs are designed as a supplement to the less-frequently-updated System V gABI (generic), hosted on SCO's website.

Other links

Also https://refspecs.linuxfoundation.org/ hosts a copy of the gABI from 1997.

https://uclibc.org/specs.html has psABI links for various non-x86 ISAs. (Although for example the ARM one only seems to document the ELF file layout, not the calling convention or process startup state.) https://uclibc.org/docs/psABI-x86_64.pdf is an outdated copy of the x86-64 psABI (0.99.7 from 2014). The version on GitHub has clearer wording of a few things and bugfixes in some examples.

Related: What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64 describes the system-call calling convention for x86-64 SysV (as well as i386 Linux vs. FreeBSD).

It also summarizes the function calling conventions for integer args.
System calls don't take FP or SSE/AVX vector args, or structs by value, so the function-calling convention is more complicated.

Agner Fog has a calling conventions guide (covering Windows vs. Sys V, and the various conventions for 32-bit, and tips/tricks for writing functions you can use on either platform). This is a separate PDF from his optimization and microarchitecture guides and instruction tables (which are essential reading if you care about performance.)

Wikipedia has an x86 calling conventions article which describes various conventions, but mostly not in enough detail to use them for anything other than simple integer args. (e.g. no description of struct-packing rules).

Related: C++ ABI

GCC and Clang (on all architectures) use the C++ ABI originally developed for Itanium. https://itanium-cxx-abi.github.io/cxx-abi/. This is relevant for example for what requirements a C++ struct/class need to be passed in registers (e.g. being an aggregate according to some definition), vs. when a struct/class always needs to have an address and get passed by reference, even when it's small enough to pack into 2 registers. These rules depend on stuff having a non-trivial constructor or destructor.

Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?

Note that the current version of the i386 System V ABI used on Linux also requires 16-byte stack alignment¹. See https://sourceforge.net/p/fbc/bugs/659/ for some history, and my comment on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838#c91 for an attempt at summarizing the unfortunate history of how i386 GNU/Linux + GCC accidentally got into a situation where a backwards-incompat change to the i386 System V ABI was the lesser of two evils.

Windows x64 also requires 16-byte stack alignment before a call, presumably for similar motivations as x86-64 System V.

Also, semi-related: x86-64 System V requires that global arrays of 16 bytes and larger be aligned by 16. Same for local arrays of >= 16 bytes or variable size, although that detail is only relevant across functions if you know that you're being passed the address of the start of an array, not a pointer into the middle. (Different memory alignment for different buffer sizes). It doesn't let you make any extra assumptions about an arbitrary int *.

SSE2 is baseline for x86-64, and making the ABI efficient for types like __m128, and for compiler auto-vectorization, was one of the design goals, I think. The ABI has to define how such args are passed as function args, or by reference.

16-byte alignment is sometimes useful for local variables on the stack (especially arrays), and guaranteeing 16-byte alignment means compilers can get it for free whenever it's useful, even if the source doesn't explicitly request it.

If the stack alignment relative to a 16-byte boundary wasn't known, every function that wanted an aligned local would need an and rsp, -16, and extra instructions to save/restore rsp after an unknown offset to rsp (either 0 or -8) e.g. using up rbp for a frame pointer.

Without AVX, memory source operands have to be 16-byte aligned. e.g. paddd xmm0, [rsp+rdi] faults if the memory operand is misaligned. So if alignment isn't known, you'd have to either use movups xmm1, [rsp+rdi] / paddd xmm0, xmm1, or write a loop prologue / epilogue to handle the misaligned elements. For local arrays that the compiler wants to auto-vectorize over, it can simply choose to align them by 16.

Also note that early x86 CPUs (before Nehalem / Bulldozer) had a movups instruction that's slower than movaps even when the pointer does turn out to be aligned. (I.e. unaligned loads/stores on aligned data was extra slow, as well as preventing folding loads into an ALU instruction.) (See Agner Fog's optimization guides, microarch guide, and instruction tables for more about all of the above.)

These factors are why a guarantee is more useful than just "usually" keeping the stack aligned. Being allowed to make code which actually faults on a misaligned stack allows more optimization opportunities.

Aligned arrays also speed up vectorized memcpy / strcmp / whatever functions that can't assume alignment, but instead check for it and can jump straight to their whole-vector loops.

From a recent version of the x86-64 System V ABI (r252):

An array uses the same alignment as its elements, except that a local or global
array variable of length at least 16 bytes or a C99 variable-length array variable
always has alignment of at least 16 bytes.⁴

⁴ The alignment requirement allows the use of SSE instructions when operating on the array.
The compiler cannot in general calculate the size of a variable-length array (VLA), but it is expected
that most VLAs will require at least 16 bytes, so it is logical to mandate that VLAs have at
least a 16-byte alignment.

This is a bit aggressive, and mostly only helps when functions that auto-vectorize can be inlined, but usually there are other locals the compiler can stuff into any gaps so it doesn't waste stack space. And doesn't waste instructions as long as there's a known stack alignment. (Obviously the ABI designers could have left this out if they'd decided not to require 16-byte stack alignment.)

Spill/reload of `__m128`

Of course, it makes it free to do alignas(16) char buf[1024]; or other cases where the source requests 16-byte alignment.

And there are also __m128 / __m128d / __m128i locals. The compiler may not be able to keep all vector locals in registers (e.g. spilled across a function call, or not enough registers), so it needs to be able to spill/reload them with movaps, or as a memory source operand for ALU instructions, for efficiency reasons discussed above.

Loads/stores that actually are split across a cache-line boundary (64 bytes) have significant latency penalties, and also minor throughput penalties on modern CPUs. The load needs data from 2 separate cache lines, so it takes two accesses to the cache. (And potentially 2 cache misses, but that's rare for stack memory.)

I think movups already had that cost baked in for vectors on older CPUs where it's expensive, but it still sucks. Spanning a 4k page boundary is much worse (on CPUs before Skylake), with a load or store taking ~100 cycles if it touches bytes on both sides of a 4k boundary. (Also needs 2 TLB checks.) Natural alignment makes splits across any wider boundary impossible, so 16-byte alignment was sufficient for everything you can do with SSE2.

max_align_t has 16-byte alignment in the x86-64 System V ABI, because of long double (10-byte/80-bit x87). It's defined as padded to 16 bytes for some weird reason, unlike in 32-bit code where sizeof(long double) == 10. x87 10-byte load/store is quite slow anyway (like 1/3rd the load throughput of double or float on Core2, 1/6th on P4, or 1/8th on K8), but maybe cache-line and page split penalties were so bad on older CPUs that they decided to define it that way. I think on modern CPUs (maybe even Core2) looping over an array of long double would be no slower with packed 10-byte, because the fld m80 would be a bigger bottleneck than a cache-line split every ~6.4 elements.

Actually, the ABI was defined before silicon was available to benchmark on (back in ~2000), but those K8 numbers are the same as K7 (32-bit / 64-bit mode is irrelevant here). Making long double 16-byte does make it possible to copy a single one with movaps, even though you can't do anything with it in XMM registers. (Except manipulate the sign bit with xorps / andps / orps.)

Related: this max_align_t definition means that malloc always returns 16-byte aligned memory in x86-64 code. This lets you get away with using it for SSE aligned loads like _mm_load_ps, but such code can break when compiled for 32-bit where alignof(max_align_t) is only 8. (Use aligned_alloc or whatever.)

Other ABI factors include passing __m128 values on the stack (after xmm0-7 have the first 8 float / vector args). It makes sense to require 16-byte alignment for vectors in memory, so they can be used efficiently by the callee, and stored efficiently by the caller. Maintaining 16-byte stack alignment at all times makes it easy for functions that need to align some arg-passing space by 16.

There are types like __m128 that the ABI guarantees have 16-byte alignment. If you define a local and take its address, and pass that pointer to some other function, that local needs to be sufficiently aligned. So maintaining 16-byte stack alignment goes hand in hand with giving some types 16-byte alignment, which is obviously a good idea.

These days, it's nice that atomic<struct_of_16_bytes> can cheaply get 16-byte alignment, so lock cmpxchg16b doesn't ever cross a cache line boundary. For the really rare case where you have an atomic local with automatic storage, and you pass pointers to it to multiple threads...

Footnote 1: 32-bit Linux

Not all 32-bit platforms broke backwards compatibility with existing binaries and hand-written asm the way Linux did; some like i386 NetBSD still only use the historical 4-byte stack alignment requirement from the original version of the i386 SysV ABI.

The historical 4-byte stack alignment was also insufficient for efficient 8-byte double on modern CPUs. Unaligned fld / fstp are generally efficient except when they cross a cache-line boundary (like other loads/stores), so it's not horrible, but naturally-aligned is nice.

Even before 16-byte alignment was officially part of the ABI, GCC used to enable -mpreferred-stack-boundary=4 (2^4 = 16-bytes) on 32-bit. This currently assumes the incoming stack alignment is 16 bytes (even for cases that will fault if it's not), as well as preserving that alignment. I'm not sure if historical gcc versions used to try to preserve stack alignment without depending on it for correctness of SSE code-gen or alignas(16) objects.

ffmpeg is one well-known example that depends on the compiler to give it stack alignment: what is "stack alignment"?, e.g. on 32-bit Windows.

Modern gcc still emits code at the top of main to align the stack by 16 (even on Linux where the ABI guarantees that the kernel starts the process with an aligned stack), but not at the top of any other function. You could use -mincoming-stack-boundary to tell gcc how aligned it should assume the stack is when generating code.

Ancient gcc4.1 didn't seem to really respect __attribute__((aligned(16))) or 32 for automatic storage, i.e. it doesn't bother aligning the stack any extra in this example on Godbolt, so old gcc has kind of a checkered past when it comes to stack alignment. I think the change of the official Linux ABI to 16-byte alignment happened as a de-facto change first, not a well-planned change. I haven't turned up anything official on when the change happened, but somewhere between 2005 and 2010 I think, after x86-64 became popular and the x86-64 System V ABI's 16-byte stack alignment proved useful.

At first it was a change to GCC's code-gen to use more alignment than the ABI required (i.e. using a stricter ABI for gcc-compiled code), but later it was written in to the version of the i386 System V ABI maintained at https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI (which is official for Linux at least).

@MichaelPetch and @ThomasJager report that gcc4.5 may have been the first version to have -mpreferred-stack-boundary=4 for 32-bit as well as 64-bit. gcc4.1.2 and gcc4.4.7 on Godbolt appear to behave that way, so maybe the change was backported, or Matt Godbolt configured old gcc with a more modern config.

What registers are preserved through a linux x86-64 function call

Here's the complete table of registers and their use from the documentation [PDF Link]:

table from docs

r12, r13, r14, r15, rbx, rsp, rbp are the callee-saved registers - they have a "Yes" in the "Preserved across function calls" column.

Why does the x86-64 System V calling convention pass args in registers instead of just the stack?

instead of put the first 6 arguments in registers just to move them onto the stack in the function prologue?

I was looking at some code that gcc generated and that's what it always did.

Then you forgot to enable optimization. gcc -O0 spills everything to memory so you can modify them with a debugger while single-stepping. That's obviously horrible for performance, so compilers don't do that unless you force them to by compiling with -O0.

x86-64 System V allows int add(int x, int y) { return x+y; } to compile to

lea eax, [rdi + rsi] / ret, which is what compilers actually do as you can see on the Godbolt compiler explorer.

Stack-args calling conventions are slow and obsolete. RISC machines have been using register-args calling conventions since before x86-64 existed, and on OSes that still care about 32-bit x86 (i.e. Windows), there are better calling conventions like __vectorcall that pass the first 2 integer args in registers.

i386 System V hasn't been replaced because people mostly don't care as much about 32-bit performance on other OSes; we just use 64-bit code with the nicely-designed x86-64 System V calling convention.

For more about the tradeoff between register args and call-preserved vs. call-clobbered registers in calling convention design, see Why not store function parameters in XMM vector registers?, and also Why does Windows64 use a different calling convention from all other OSes on x86-64?.

Is %gs caller- or callee-save for the System V AMD64 ABI?

IIRC, the calling convention / ABI doesn't have much to say about it at all. The FS base is used for thread-local storage on x86-64 System V, but anything you want to do with the GS base (and/or the GS segment-selector value) is presumably up to you to define how your program uses it.

Either in terms of a calling convention, or more normally as set-once early in a thread's lifetime and leave untouched after that. Like how MXCSR and the x87 rounding mode / precision control are handled by having functions leave them untouched except for maybe local changes that are restored before making any further calls. (That is still something you can describe as a calling convention / ABI, but as Nate commented, it's neither call-preserved nor call-clobbered.)

I think the Linux kernel will save/restore user-space's selector value (and maybe also the segment base separately from whatever's in some GDT or LDT entry selected by a non-0 selector value, instead set via a system call to write the MSR, or via wrgsbase if the kernel enables it for user-space use on HW that supports it.) If so then it would be viable for user-space to use GS for something, like an alternate TLS.

In practice, you can safely assume that calls to compiler-generated code (or even hand-written) in libraries won't change it. So system calls and context switches are all you need to worry about; I'd recommend testing.

Note that mov to GS with a value that isn't 0 or a valid selector for an LDT or GDT entry will fault, so there's a very limited set of values you can use.

It's also quite slow to write GS (Is a mov to a segmentation register slower than a mov to a general purpose register?), although reading it is fairly efficient (except on P4). Perhaps even more efficient than a thread-local-storage memory location, although GS can't be a source operand for instructions other than stores.

Fun fact: i386 SysV uses GS for TLS; I think x86-64 changed to FS so the kernel-GS (which is special because of swapgs) could just be for finding the task's kernel stack after swapgs, not also having to be the kernel's TLS base for per-core variables.

System V AMD64 ABI floating point varargs order

For printf the format string indicates the type of the remaining argments.

The implementation of va_arg knows the type as it is an argument of va_arg, and the correct register can be deduced from the types.

C++ on x86-64: when are structs/classes passed and returned in registers?

The ABI specification is defined here.

A newer version is available here.

I assume the reader is accustomed to the terminology of the document and that they can classify the primitive types.

If the object size is larger than two eight-bytes, it is passed in memory:

struct foo
{
    unsigned long long a;
    unsigned long long b;
    unsigned long long c;               //Commenting this gives mov rax, rdi
};

unsigned long long foo(struct foo f)
{ 
  return f.a;                           //mov     rax, QWORD PTR [rsp+8]
}

If it is non POD, it is passed in memory:

struct foo
{
    unsigned long long a;
    foo(const struct foo& rhs){}            //Commenting this gives mov rax, rdi
};

unsigned long long foo(struct foo f)
{
  return f.a;                               //mov     rax, QWORD PTR [rdi]
}

^{Copy elision is at work here}

If it contains unaligned fields, it passed in memory:

struct __attribute__((packed)) foo         //Removing packed gives mov rax, rsi
{
    char b;
    unsigned long long a;
};

unsigned long long foo(struct foo f)
{
  return f.a;                             //mov     rax, QWORD PTR [rsp+9]
}

If none of the above is true, the fields of the object are considered.

If one of the field is itself a struct/class the procedure is recursively applied.

The goal is to classify each of the two eight-bytes (8B) in the object.

The the class of the fields of each 8B are considered.

Note that an integral number of fields always totally occupy one 8B thanks to the alignment requirement of above.

Set C be the class of the 8B and D be the class of the field in consideration class.

Let new_class be pseudo-defined as

cls new_class(cls D, cls C)
{
   if (D == NO_CLASS)
      return C;

   if (D == MEMORY || C == MEMORY)
      return MEMORY;

   if (D == INTEGER || C == INTEGER)
      return INTEGER;

   if (D == X87 || C == X87 || D == X87UP || C == X87UP)
      return MEMORY;

   return SSE;
}

then the class of the 8B is computed as follow

C = NO_CLASS;

for (field f : fields)
{
    D = get_field_class(f);        //Note this may recursively call this proc
    C = new_class(D, C);
}

Once we have the class of each 8Bs, say C1 and C2, than

if (C1 == MEMORY || C2 == MEMORY)
    C1 = C2 = MEMORY;

if (C2 == SSEUP AND C1 != SSE)
   C2 = SSE;

Note This is my interpretation of the algorithm given in the ABI document.

Example

struct foo
{
    unsigned long long a;
    long double b;
};

unsigned long long foo(struct foo f)
{
  return f.a;
}

The 8Bs and their fields

First 8B: a
Second 8B: b

a is INTEGER, so the first 8B is INTEGER.
b is X87 and X87UP so the second 8B is MEMORY.
The final class is MEMORY for both 8Bs.

Example

struct foo
{
    double a;
    long long b;
};

long long foo(struct foo f)
{
  return f.b;                     //mov rax, rdi
}

The 8Bs and their fields

First 8B: a
Second 8B: b

a is SSE, so the first 8B is SSE.

b is INTEGER so the second 8B is INTEGER.

The final classes are the one calculated.

Return values

The values are returned accordingly to their classes:

MEMORY

The caller passes an hidden, first, argument to the function for it to store the result into.

In C++ this often involves a copy elision/return value optimisation.
This address must be returned back into eax, thereby returning MEMORY classes "by reference" to an hidden, caller, allocated buffer.

If the type has class MEMORY, then the caller provides space for the return
value and passes the address of this storage in %rdi as if it were the first
argument to the function. In effect, this address becomes a “hidden” first
argument.
On return %rax will contain the address that has been passed in by the
caller in %rdi.
INTEGER and POINTER

The registers rax and rdx as needed.
SSE and SSEUP
The registers xmm0 and xmm1 as needed.
X87 AND X87UP
The register st0

PODs

The technical definition is here.

The definition from the ABI is reported below.

A de/constructor is trivial if it is an implicitly-declared default de/constructor and if:

   • its class has no virtual functions and no virtual base classes, and

   • all the direct base classes of its class have trivial de/constructors, and

   • for all the nonstatic data members of its class that are of class type (or array thereof), each such class has a trivial de/constructor.

Note that each 8B is classified independently so that each one can be passed accordingly.

Particularly, they may end up on the stack if there are no more parameter registers left.

Where Is the X86-64 System V Abi Documented