C++ on X86-64: When Are Structs/Classes Passed and Returned in Registers

C++ on x86-64: when are structs/classes passed and returned in registers?

The ABI specification is defined here.

A newer version is available here.

I assume the reader is accustomed to the terminology of the document and that they can classify the primitive types.

If the object size is larger than two eight-bytes, it is passed in memory:

struct foo
{
    unsigned long long a;
    unsigned long long b;
    unsigned long long c;               //Commenting this gives mov rax, rdi
};

unsigned long long foo(struct foo f)
{ 
  return f.a;                           //mov     rax, QWORD PTR [rsp+8]
}

If it is non POD, it is passed in memory:

struct foo
{
    unsigned long long a;
    foo(const struct foo& rhs){}            //Commenting this gives mov rax, rdi
};

unsigned long long foo(struct foo f)
{
  return f.a;                               //mov     rax, QWORD PTR [rdi]
}

^{Copy elision is at work here}

If it contains unaligned fields, it passed in memory:

struct __attribute__((packed)) foo         //Removing packed gives mov rax, rsi
{
    char b;
    unsigned long long a;
};

unsigned long long foo(struct foo f)
{
  return f.a;                             //mov     rax, QWORD PTR [rsp+9]
}

If none of the above is true, the fields of the object are considered.

If one of the field is itself a struct/class the procedure is recursively applied.

The goal is to classify each of the two eight-bytes (8B) in the object.

The the class of the fields of each 8B are considered.

Note that an integral number of fields always totally occupy one 8B thanks to the alignment requirement of above.

Set C be the class of the 8B and D be the class of the field in consideration class.

Let new_class be pseudo-defined as

cls new_class(cls D, cls C)
{
   if (D == NO_CLASS)
      return C;

   if (D == MEMORY || C == MEMORY)
      return MEMORY;

   if (D == INTEGER || C == INTEGER)
      return INTEGER;

   if (D == X87 || C == X87 || D == X87UP || C == X87UP)
      return MEMORY;

   return SSE;
}

then the class of the 8B is computed as follow

C = NO_CLASS;

for (field f : fields)
{
    D = get_field_class(f);        //Note this may recursively call this proc
    C = new_class(D, C);
}

Once we have the class of each 8Bs, say C1 and C2, than

if (C1 == MEMORY || C2 == MEMORY)
    C1 = C2 = MEMORY;

if (C2 == SSEUP AND C1 != SSE)
   C2 = SSE;

Note This is my interpretation of the algorithm given in the ABI document.

Example

struct foo
{
    unsigned long long a;
    long double b;
};

unsigned long long foo(struct foo f)
{
  return f.a;
}

The 8Bs and their fields

First 8B: a
Second 8B: b

a is INTEGER, so the first 8B is INTEGER.
b is X87 and X87UP so the second 8B is MEMORY.
The final class is MEMORY for both 8Bs.

Example

struct foo
{
    double a;
    long long b;
};

long long foo(struct foo f)
{
  return f.b;                     //mov rax, rdi
}

The 8Bs and their fields

First 8B: a
Second 8B: b

a is SSE, so the first 8B is SSE.

b is INTEGER so the second 8B is INTEGER.

The final classes are the one calculated.

Return values

The values are returned accordingly to their classes:

MEMORY
The caller passes an hidden, first, argument to the function for it to store the result into.

In C++ this often involves a copy elision/return value optimisation.
This address must be returned back into eax, thereby returning MEMORY classes "by reference" to an hidden, caller, allocated buffer.

If the type has class MEMORY, then the caller provides space for the return
value and passes the address of this storage in %rdi as if it were the first
argument to the function. In effect, this address becomes a “hidden” first
argument.
On return %rax will contain the address that has been passed in by the
caller in %rdi.
INTEGER and POINTER
The registers rax and rdx as needed.
SSE and SSEUP
The registers xmm0 and xmm1 as needed.
X87 AND X87UP
The register st0

PODs

The technical definition is here.

The definition from the ABI is reported below.

A de/constructor is trivial if it is an implicitly-declared default de/constructor and if:

   • its class has no virtual functions and no virtual base classes, and

   • all the direct base classes of its class have trivial de/constructors, and

   • for all the nonstatic data members of its class that are of class type (or array thereof), each such class has a trivial de/constructor.

Note that each 8B is classified independently so that each one can be passed accordingly.

Particularly, they may end up on the stack if there are no more parameter registers left.

how is a struct returned by value, in terms of assembly language, if that struct is too large to fit in a register?

The SYSV x86_64 calling conventions (used by everyone except Microsoft) allow structures of up to 16 bytes and INTEGER classification to be returned in the RAX/RDX register pair, while those of SSE classification and up to 32 bytes can be returned in the XMM0/XMM1 register pair.

The classification of a struct depends on the types of the fields in the struct, but basically integer and pointer types will be INTEGER while float and double will be SSE.

Larger structs will get MEMORY classification, so will require an extra hidden argument (passed in RDI, so prepended to the existing arguments) specifying a pointer to memory that the return value will be written to. This pointer will be returned in RAX.

This is all detailed in the SYSV x86_64 ABI doc

How are C structs returned

A calling convention typically does not specifically dictate any code or code sequences, it dictates only state — such as registers and memory, which goes to parameter passing and the stack: where parameters and return values go, what state must be preserved by the call (i.e. some registers and allocated stack memory), and what is scratch (i.e. some registers, and memory below the current stack pointer). It may also dictate things like stack alignment requirements.

The calling convention speaks to state as per above: but only at very specific points in time, namely at the exact boundary when control is transferred from caller to callee, and again when control is transferred back from callee to caller. Thus, the callee has an expectation that the caller has setup all the parameters as expected before its first instruction runs. The caller has the expectation that the callee has setup all the return values (and preserved what ever it must preserve) before the first instruction of its resumption from the call.

For these purposes, the calling convention does not dictate machine code instructions or even sequences of instructions; it only establishes expectation of values and locations at the points of transfer.

C++ What actually happens in assembly when you return a struct from a function?

So I spent hours playing with Godbolt's Compiler Explorer and reading up until I figured out the practical answer.

What I've gathered is this:

If the value fits into a register, it's left in a register as the return value.
If the value fits in 2 registers, it's left in 2 registers.
If the value is larger than this, the caller reserves memory in its own stack and the function writes directly into the caller's stack.

Both G++ & Clang do the same, this is tested on x86_64.

How C structures get passed to function in assembly?

As has been pointed out by others - passing structures by value is generally frowned upon in most cases, but it is allowable by the C language nonetheless. I'll discuss the code you did use even though it isn't how I would have done it.

How structures are passed is dependent on the ABI / Calling convention. There are two primary 64-bit ABIs in use today (there may be others). The 64-bit Microsoft ABI and the x86-64 System V ABI. The 64-bit Microsoft ABI is simple as all structures passed by value are on the stack. In The x86-64 System V ABI (used by Linux/MacOS/BSD) is more complex as there is a recursive algorithm that is used to determine if a structure can be passed in a combination of general purpose registers / vector registers / X87 FPU stack registers. If it determines the structure can be passed in registers then the object isn't placed on the stack for the purpose of calling a function. If it doesn't fit in registers per the rules then it is passed in memory on the stack.

There is a telltale sign that your code isn't using the 64-bit Microsoft ABI as 32 bytes of shadow space weren't reserved by the compiler before making the function call so this is almost certainly a compiler targeting the x86-64 System V ABI. I can generate the same assembly code in your question using the online godbolt compiler with the GCC compiler with optimizations disabled.

Going through the algorithm for passing aggregate types (like structures and unions) is beyond the scope of this answer but you can refer to section 3.2.3 Parameter Passing, but I can say that this structure is passed on the stack because of a post cleanup rule that says:

If the size of the aggregate exceeds two eightbytes and the first eightbyte isn’t SSE or any other eightbyte isn’t SSEUP, the whole argument is passed in memory.

It happens to be that your structure would have attempted to have the first two 32-bit int values packed in a 64-bit register and the double placed in a vector register followed by the int being placed in a 64-bit register (because of alignment rules) and the pointer passed in another 64-bit register. Your structure would have exceeded two eightbyte (64-bit) registers and the first eightbyte (64-bit) register isn't an SSE register so the structure is passed on the stack by the compiler.

You have unoptimized code but we can break down the code into chunks. First is building the stack frame and allocating room for the local variable(s). Without optimizations enabled (which is the case here), the structure variable s will be built on the stack and then a copy of that structure will be pushed onto the stack to make the call to print_student_info.

This builds the stackframe and allocates 32 bytes (0x20) for local variables (and maintains 16-byte alignment). Your structure happens to be exactly 32 bytes in size in this case following natural alignment rules:

 6fa:   55                      push   %rbp
 6fb:   48 89 e5                mov    %rsp,%rbp
 6fe:   48 83 ec 20             sub    $0x20,%rsp

Your variable s will start at RBP-0x20 and ends at RBP-0x01 (inclusive). The code builds and initializes the s variable (student struct) on the stack. A 32-bit int 0xa (10) for the age field is placed at the beginning of the structure at RBP-0x20. The 32-bit enum for Man is placed in field gen at RBP-0x1c:

 702:   c7 45 e0 0a 00 00 00    movl   $0xa,-0x20(%rbp)
 709:   c7 45 e4 00 00 00 00    movl   $0x0,-0x1c(%rbp)

The constant value 1.30 (type double) is stored in memory by the compiler. You can't move from memory to memory with one instruction on Intel x86 processors so the compiler moved the double value 1.30 from memory location RIP+0x100 to vector register XMM0 then moved the lower 64-bits of XMM0 to the height field on the stack at RBP-0x18:

 710:   f2 0f 10 05 00 01 00    movsd  0x100(%rip),%xmm0        # 818 <_IO_stdin_used+0x48>
 717:   00 
 718:   f2 0f 11 45 e8          movsd  %xmm0,-0x18(%rbp)

The value 3 is placed on the stack for the class field at RBP-0x10:

 71d:   c7 45 f0 03 00 00 00    movl   $0x3,-0x10(%rbp)

Lastly the 64-bit address of the string Tom (in the read only data section of the program) is loaded into RAX and then finally moved into the name field on the stack at RBP-0x08. Although the type for class was only 32-bits (an int type) it was padded to 8 bytes because the following field name has to be naturally aligned on an 8 byte boundary since a pointer is 8 bytes in size.

 724:   48 8d 05 e5 00 00 00    lea    0xe5(%rip),%rax        # 810 <_IO_stdin_used+0x40>
 72b:   48 89 45 f8             mov    %rax,-0x8(%rbp)

At this point we have a structure entirely built on the stack. The compiler then copies it by pushing all 32 bytes (using 4 64-bit pushes) of the structure onto the stack to make the function call:

 72f:   ff 75 f8                pushq  -0x8(%rbp)
 732:   ff 75 f0                pushq  -0x10(%rbp)
 735:   ff 75 e8                pushq  -0x18(%rbp)
 738:   ff 75 e0                pushq  -0x20(%rbp)
 73b:   e8 70 ff ff ff          callq  6b0 <print_student_info>

Then typical stack cleanup and function epilogue:

 740:   48 83 c4 20             add    $0x20,%rsp
 744:   b8 00 00 00 00          mov    $0x0,%eax
 749:   c9                      leaveq

Important Note: The registers used were not for the purpose of passing parameters in this case, but were part of the code that initialized the s variable (struct) on the stack.

Returning Structures

This is dependent on the ABI as well, but I'll focus on the x86-64 System V ABI in this case since that is what your code is using.

By Reference: A pointer to a structure is returned in RAX. Returning pointers to structures is preferred.

By value: A structure in C that is returned by value forces the compiler to allocate additional space for the return structure in the caller and then the address of that structure is passed as a hidden first parameter in RDI to the function. The called function will place the address that was passed in RDI as a parameter into RAX as the return value when it is finished. Upon return from the function the value in RAX is a pointer to the address where the return structure is stored which is always the same address passed in the hidden first parameter RDI. The ABI discusses this in section 3.2.3 Parameter Passing under the subheading Returning of Values which says:

If the type has class MEMORY, then the caller provides space for the return
value and passes the address of this storage in %rdi as if it were the first
argument to the function. In effect, this address becomes a “hidden” first argument. This storage must not overlap any data visible to the callee through
other names than this argument.
On return %rax will contain the address that has been passed in by the
caller in %rdi.

Get G++ to use a custom calling convention to pass larger structs in registers instead of memory?

You could pass the uint8_t or one of the pointers as a separate arg to describe what you want to the compiler, or stuff it into one of the existing 64-bit members (see below).

Unfortunately no, there aren't compiler options that tweak the C ABI / calling-convention rules to pass structs larger than 16 bytes in registers on x86-64 or other ISAs. The x86-64 System V ABI doesn't do that, and there isn't another calling convention GCC knows about which does. The Windows x64 ABI only passes up to 8-byte objects in registers, not even 16.

Also, you can't override the C++ ABI rule that non-trivially-copyable objects (or whatever the exact criterion is) are passed in memory so they always have an address. (e.g. by value on the stack in x86-64 System V.)

The only options I know of that modify the calling convention are -mabi=ms or whatever to select an existing calling convention GCC knows about. Or ones that affect whether certain registers are call-preserved or call-clobbered, like -fcall-used-reg (GCC manual) and some ABI-affecting options like -fpack-struct[=n] that aren't specifically about the calling convention. (And no, -fpack-struct wouldn't help. Bringing sizeof(A) down from 24 to 17 doesn't let it fit in 2 regs.

In theory with -fwhole-program or maybe -flto, GCC could invent custom calling conventions, but AFAIK it doesn't. It can take advantage of the fact that another function doesn't clobber certain registers, in terms of inter-procedural optimization (IPO) other than inlining, but not changing how args are passed.

The normal way to handle calling-convention overhead is to make sure small functions inline (e.g. by compiling with -flto to allow cross-file inlining), but this doesn't work if you're taking function pointers or using virtual functions.

It's not number of members, it's total size, so the x32 ABI (with 32-bit pointers/references and size_t) would be able to pass / return that struct packed into two registers. g++ -O3 -mx32.

(x86-64 SysV packs aggregates into up-to-2 registers using the same layout it would in memory, so smaller members means more member fit in 16 bytes.)

Or if you can settle for having a 32-bit size by value, or 48-bit size, you could pack the uint8_t into the upper byte of a uint64_t, or even use bitfield members. But since you have a level of indirection (a reference member) for size_t& __restrict__ dataPos;, that member is basically another pointer; using uint32_t& there wouldn't help since a pointer is still 64 bits. I assume you need that to be a reference for some reason.

You could pack your uint8_t into the upper byte of a pointer. Upcoming HW will have an option to optimize this, ignoring high bits instead of enforcing correct sign-extension from 48-bit or 57-bit. Otherwise you just manually do that with shifts and & with uintptr_t: Using the extra 16 bits in 64-bit pointers

Or since it's easier / more efficient to get data in/out of the bottom of a register on x86-64 (e.g. zero-latency movzx r32, r8), shift the pointer left. That means before deref, you just need an arithmetic right shift to redo sign-extension. This is cheaper than mov r64,imm64 to create as 0xff00000000000000 mask, and as a bonus it sign-extends cheaply so it even works in kernel code.

In theory a compiler can even write a partial register to merge a new low-8 in after left-shifting, to create this data. (But if writing to memory, overlapping qword and byte stores could be even better, not even needing a shift. If you aren't re-reading soon enough to cause a store-forwarding stall.)

(But if you have a CPU with the LAM feature, you can use the high 8 bits and have the CPU ignore those bits.)

C++ on X86-64: When Are Structs/Classes Passed and Returned in Registers