How Do Objects Work in X86 at the Assembly Level

How do objects work in x86 at the assembly level?

Classes are stored exactly the same way as structs, except when they have virtual members. In that case, there's an implicit vtable pointer as the first member (see below).

A struct is stored as a contiguous block of memory (if the compiler doesn't optimize it away or keep the member values in registers). Within a struct object, addresses of its elements increase in order in which the members were defined. (source: http://en.cppreference.com/w/c/language/struct). I linked the C definition, because in C++ struct means class (with public: as the default instead of private:).

Think of a struct or class as a block of bytes that might be too big to fit in a register, but which is copied around as a "value". Assembly language doesn't have a type system; bytes in memory are just bytes and it doesn't take any special instructions to store a double from a floating point register and reload it into an integer register. Or to do an unaligned load and get the last 3 bytes of 1 int and the first byte of the next. A struct is just part of building C's type system on top of blocks of memory, since blocks of memory are useful.

These blocks of bytes can have static (global or static), dynamic (malloc or new), or automatic storage (local variable: temporary on the stack or in registers, in normal C/C++ implementations on normal CPUs). The layout within a block is the same regardless (unless the compiler optimizes away the actual memory for a struct local variable; see the example below of inlining a function that returns a struct.)

A struct or class is the same as any other object. In C and C++ terminology, even an int is an object: http://en.cppreference.com/w/c/language/object. i.e. A contiguous block of bytes that you can memcpy around (except for non-POD types in C++).

The ABI rules for the system you're compiling for specify when and where padding is inserted to make sure each member has sufficient alignment even if you do something like struct { char a; int b; }; (for example, the x86-64 System V ABI, used on Linux and other non-Windows systems specifies that int is a 32-bit type that gets 4-byte alignment in memory. The ABI is what nails down some stuff that the C and C++ standards leave "implementation dependent", so that all compilers for that ABI can make code that can call each other's functions.)

Note that you can use offsetof(struct_name, member) to find out about struct layout (in C11 and C++11). See also alignof in C++11, or _Alignof in C11.

It's up to the programmer to order struct members well to avoid wasting space on padding, since C rules don't let the compiler sort your struct for you. (e.g. if you have some char members, put them in groups of at least 4, rather than alternating with wider members. Sorting from large to small is an easy rule, remembering that pointers may be 64 or 32-bit on common platforms.)

More details of ABIs and so on can be found at https://stackoverflow.com/tags/x86/info. Agner Fog's excellent site includes an ABI guide, along with optimization guides.

Classes (with member functions)

class foo {
  int m_a;
  int m_b;
  void inc_a(void){ m_a++; }
  int inc_b(void);
};

int foo::inc_b(void) { return m_b++; }

compiles to (using http://gcc.godbolt.org/):

foo::inc_b():                  # args: this in RDI
    mov eax, DWORD PTR [rdi+4]      # eax = this->m_b
    lea edx, [rax+1]                # edx = eax+1
    mov DWORD PTR [rdi+4], edx      # this->m_b = edx
    ret

As you can see, the this pointer is passed as an implicit first argument (in rdi, in the SysV AMD64 ABI). m_b is stored at 4 bytes from the start of the struct/class. Note the clever use of lea to implement the post-increment operator, leaving the old value in eax.

No code for inc_a is emitted, since it's defined inside the class declaration. It's treated the same as an inline non-member function. If it was really big and the compiler decided not to inline it, it could emit a stand-alone version of it.

Where C++ objects really differ from C structs is when virtual member functions are involved. Each copy of the object has to carry around an extra pointer (to the vtable for its actual type).

class foo {
  public:
  int m_a;
  int m_b;
  void inc_a(void){ m_a++; }
  void inc_b(void);
  virtual void inc_v(void);
};

void foo::inc_b(void) { m_b++; }

class bar: public foo {
 public:
  virtual void inc_v(void);  // overrides foo::inc_v even for users that access it through a pointer to class foo
};

void foo::inc_v(void) { m_b++; }
void bar::inc_v(void) { m_a++; }

compiles to

  ; This time I made the functions return void, so the asm is simpler
  ; The in-memory layout of the class is now:
  ;   vtable ptr (8B)
  ;   m_a (4B)
  ;   m_b (4B)
foo::inc_v():
    add DWORD PTR [rdi+12], 1   # this_2(D)->m_b,
    ret
bar::inc_v():
    add DWORD PTR [rdi+8], 1    # this_2(D)->D.2657.m_a,
    ret

    # if you uncheck the hide-directives box, you'll see
    .globl  foo::inc_b()
    .set    foo::inc_b(),foo::inc_v()
    # since inc_b has the same definition as foo's inc_v, so gcc saves space by making one an alias for the other.

    # you can also see the directives that define the data that goes in the vtables

Fun fact: add m32, imm8 is faster than inc m32 on most Intel CPUs (micro-fusion of the load+ALU uops); one of the rare cases where the old Pentium4 advice to avoid inc still applies. gcc always avoids inc, though, even when it would save code size with no downsides :/ INC instruction vs ADD 1: Does it matter?

Virtual function dispatch:

void caller(foo *p){
    p->inc_v();
}

    mov     rax, QWORD PTR [rdi]      # p_2(D)->_vptr.foo, p_2(D)->_vptr.foo
    jmp     [QWORD PTR [rax]]         # *_3

(This is an optimized tailcall: jmp replacing call/ret).

The mov loads the vtable address from the object into a register. The jmp is a memory-indirect jump, i.e. loading a new RIP value from memory. The jump-target address is vtable[0], i.e. the first function pointer in the vtable. If there was another virtual function, the mov wouldn't change but the jmp would use jmp [rax + 8].

The order of entries in the vtable presumably matches the order of declaration in the class, so reordering the class declaration in one translation unit would result in virtual functions going to the wrong target. Just like reordering the data members would change the class's ABI.

If the compiler had more information, it could devirtualize the call. e.g. if it could prove that the foo * was always pointing to a bar object, it could inline bar::inc_v().

GCC will even speculatively devirtualize when it can figure out what the type probably is at compile time. In the above code, the compiler can't see any classes that inherit from bar, so it's a good bet that bar* is pointing to a bar object, rather than some derived class.

void caller_bar(bar *p){
    p->inc_v();
}

# gcc5.5 -O3
caller_bar(bar*):
    mov     rax, QWORD PTR [rdi]      # load vtable pointer
    mov     rax, QWORD PTR [rax]      # load target function address
    cmp     rax, OFFSET FLAT:bar::inc_v()  # check it
    jne     .L6       #,
    add     DWORD PTR [rdi+8], 1      # inlined version of bar::inc_v()
    ret
.L6:
    jmp     rax               # otherwise tailcall the derived class's function

Remember, a foo * can actually point to a derived bar object, but a bar * is not allowed to point to a pure foo object.

It is just a bet though; part of the point of virtual functions is that types can be extended without recompiling all the code that operates on the base type. This is why it has to compare the function pointer and fall back to the indirect call (jmp tailcall in this case) if it was wrong. Compiler heuristics decide when to attempt it.

Notice that it's checking the actual function pointer, rather than comparing the vtable pointer. It can still use the inlined bar::inc_v() as long as the derived type didn't override that virtual function. Overriding other virtual functions wouldn't affect this one, but would require a different vtable.

Allowing extension without recompilation is handy for libraries, but also means looser coupling between parts of a big program (i.e. you don't have to include all the headers in every file).

But this imposes some efficiency costs for some uses: C++ virtual dispatch only works through pointers to objects, so you can't have a polymorphic array without hacks, or expensive indirection through an array of pointers (which defeats a lot of hardware and software optimizations: Fastest implementation of simple, virtual, observer-sort of, pattern in c++?).

If you want some kind of polymorphism / dispatch but only for a closed set of types (i.e. all known at compile time), you can do it manually with a union + enum + switch, or with std::variant<D1,D2> to make a union and std::visit to dispatch, or various other ways. See also Contiguous storage of polymorphic types and Fastest implementation of simple, virtual, observer-sort of, pattern in c++?.

Objects aren't always stored in memory at all.

Using a struct doesn't force the compiler to actually put stuff in memory, any more than a small array or a pointer to a local variable does. For example, an inline function that returns a struct by value can still fully optimize.

The as-if rule applies: even if a struct logically has some memory storage, the compiler can make asm that keeps all the needed members in registers (and do transformations that mean that values in registers don't correspond to any value of a variable or temporary in the C++ abstract machine "running" the source code).

struct pair {
  int m_a;
  int m_b;
};

pair addsub(int a, int b) {
  return {a+b, a-b};
}

int foo(int a, int b) {
  pair ab = addsub(a,b);
  return ab.m_a * ab.m_b;
}

That compiles (with g++ 5.4) to:

# The non-inline definition which actually returns a struct
addsub(int, int):
    lea     edx, [rdi+rsi]  # add result
    mov     eax, edi
    sub     eax, esi        # sub result
                            # then pack both struct members into a 64-bit register, as required by the x86-64 SysV ABI
    sal     rax, 32
    or      rax, rdx
    ret

# But when inlining, it optimizes away
foo(int, int):
    lea     eax, [rdi+rsi]    # a+b
    sub     edi, esi          # a-b
    imul    eax, edi          # (a+b) * (a-b)
    ret

Notice how even returning a struct by value doesn't necessarily put it in memory. The x86-64 SysV ABI passes and returns small structs packed together into registers. Different ABIs make different choices for this.

What happens at assembly level when you have functions with large inputs

Arrays (including strings) are passed by reference in most high level languages. int foo(char*) just gets a pointer value as an arg, and a pointer typically one machine word (i.e. fits in a register). In good modern calling conventions, the first few integer/pointer args are typically passed in registers.

In C/C++, you can't pass a bare array by value. Given int arr[16]; func(arr);, the function func only gets a pointer (to the first element).

In some other higher level languages, arrays might be more like C++ std::vector so the callee might be able to grow/shrink the array and find out its length without a separate arg. That would typically mean there's a "control block".

In C and C++ you can pass structs by value, and then it's up to the calling-convention rules to specify how to pass them.

x86-64 System V for example passes structs of 16-byte or less packed into up to 2 integer registers. Larger structs are copied onto the stack, regardless of how large an array member they contain (What kind of C11 data type is an array according to the AMD64 ABI). (So don't pass giant objects by value to non-inline functions!)

The Windows x64 calling convention passes large structs by hidden reference.

Example:

typedef struct {
    // too big makes the asm output cluttered with loops or memcpy
    // int Big_McLargeHuge[1024*1024];
    int arr[4];
    long long a,b; //,c,d;
} bigobj;
// total 32 bytes with int=4, long long=8 bytes

int func(bigobj a);
int foo(bigobj a) {
    a.arr[3]++;
    return func(a);
}

source + asm output on the Godbolt compiler explorer.

You can try other architectures on Godbolt with their standard calling conventions, like ARM or AArch64. I picked x86-64 because I happened to know of an interesting difference in the two major calling conventions on that one platform for struct-passing.

x86-64 System V (gcc7.3 -O3): foo has a real by-value copy of its arg (done by its caller) that it can modify, so it does so and uses it as the arg for the tail-call. (If it can't tailcall, it would have to make yet another full copy. This example artificially makes System V look really good).

foo(bigobj):
    add     DWORD PTR [rsp+20], 1   # increment the struct member in the arg on the stack
    jmp     func(bigobj)            # tailcall func(a)

x86-64 Windows (MSVC CL19 /Ox): note that we address a.arr[3] via RCX, the first integer/pointer arg. So there is a hidden reference, but it's not a const-reference. This function was called by value, but it's modifying the data it got by reference. So the caller has to make a copy, or at least assume that a callee destroyed the arg it got a pointer to. (No copy required if the object is dead after that, but that's only possible for local struct objects, not for passing a pointer to a global or something).

$T1 = 32    ; offset of the tmp copy in this function's stack frame
foo PROC
    sub      rsp, 72              ; 00000048H     ; 32B of shadow space + 32B bigobj + 8 to align
    inc      DWORD PTR [rcx+12]
    movups   xmm0, XMMWORD PTR [rcx]              ; load modified `a`
    movups   xmm1, XMMWORD PTR [rcx+16]           ; apparently alignment wasn't required
    lea      rcx, QWORD PTR $T1[rsp]
    movaps   XMMWORD PTR $T1[rsp], xmm0
    movaps   XMMWORD PTR $T1[rsp+16], xmm1         ; store a copy
    call     int __cdecl func(struct bigobj)
    add      rsp, 72              ; 00000048H
    ret      0
foo ENDP

Making another copy of the object appears to be a missed optimization. I think this would be valid implementation of foo for the same calling convention:

foo:
    add      DWORD PTR [rcx+12], 1       ; more efficient than INC because of the memory dst, on Intel CPUs
    jmp      func                        ; tailcall with pointer still in RCX

x86-64 clang for the SysV ABI also misses the optimization that gcc7.3 found, and does copy like MSVC.

So the ABI difference is less interesting than I thought; in both cases the callee "owns" the arg, even though for Windows it's not guaranteed to be on the stack. I guess this enables dynamic allocation for passing very large objects by value without a stack overflow, but that's kind of pointless. Just don't do it in the first place.

Small objects:

x86-64 System V passes small objects packed into registers. Clang finds a neat optimization if you comment out the long long members so you just have

typedef struct {
    int arr[4];
    //    long long a,b; //,c,d;
} bigobj;

# clang6.0 -O3
foo(bigobj):                          # @foo(bigobj)
    movabs  rax, 4294967296    # 0x100000000 = 1ULL << 32
    add     rsi, rax
    jmp     func(bigobj)          # TAILCALL

(arr[0..1] is packed into RDI, and arr[2..3] is packed into RSI, the first 2 integer/pointer arg-passing registers in the x86-64 SysV ABI).

gcc unpacks arr[3] into a register by itself where it can increment it.

But clang, instead of unpacking and repacking, increments the high 32 bits of RSI by adding 1ULL<<32.

MSVC still passes by hidden reference, and still copies the whole object.

What does an object file 'do' and why is it required?

What exactly is this doing

That is assembling (translating) your assembly (human-readable text) into machine code (machine-readable binary).

why is this needed

Because CPUs cannot execute text, they can only execute machine code.

Why couldn't the computer understand that directly?

First of all, the first lines are directives to the assembler and not instructions. These lines tell the assembler how to assemble, not what to assemble.

Secondly, designing and building a CPU that executes text assembly would definitely be possible but it simply has no advantage at all: harder to design, harder to build, slower to execute programs, programs would probably be 3x, 4x the size of current machine code... As you can see, it's not very efficient this way.

what does this x86_64 asm code do?

Looks like a virtual function call.

Looks like un-optimized compiler output, given the usage of rbp as a frame pointer, and using add rax,8 instead of an addressing mode like mov rax, [rax+8]. And the reload of the local for no reason when it could have kept the pointer live in a register after loading it earlier.

It looks like it's preparing a this pointer as an arg for the virtual function, as well as calling a virtual function through the vtable. How do objects work in x86 at the assembly level?

Anyway this code is just an inefficient way of writing

mov    rdi, [rbp+var_1C+4]    # load a pointer to a local from the stack
mov    rax, [rdi]             # load the vtable pointer from the start of the object
call   [rax+8]                # index into the vtable

Or maybe setting rdx was intended, rather than just inefficient un-optimized code. In the x86-64 System V ABI, rdx holds the 3rd integer / pointer arg to function calls. rdi holds the first (including this as a hidden first arg). If rsi was set earlier, then perhaps rdx is being passed intentionally, so it's a call like foo->virtual_function(something, foo), where class something *foo is a local variable or function arg.

Otherwise it's just foo->virtual_function().

Or it's something completely different that just happens to look like what a compiler would emit for a virtual function call. You can't know without more context.

How does the assembler handle classes and objects, and how are they stored in RAM and the executable?

The assembler has no clue what a class is, it only assembles machine code, which the occasional macro tossed in. For all intents and purposes a class is merly a struct with an optional vftable, with all the handling and class 'special features' (virtualism, polymorphism, inheiritanc etc) being done in the intermediate stage, when IR code is created. Memory would be allocated the same as a struct, variable, array or any other data 'blob' (statically or dynamically, taking alignment, const'ness and packing into account), except for the support code to handle stack & static based dtor unwinding(done again at the IR level), ctors, and static initialization(though static initialization can happen for more than class objects). I suggest you give the dragon book a read through (the first eight chapters would cover it), to get a clearer picture of how a compiler and assembler work, seeing as these things are not handled by the assembler, but by the compiler front and/or back ends, depending on how the compiler an its IL are structured.

What is a functions list pointer at beginning of structure called c++

That's the "vtable" a.k.a. "virtual method table".

How does linking a node to one another works in assembly language when it comes to binary tree?

First, let's differentiate between what the processor does aka machine code, and assembly language, since assembly languages can vary dramatically in what they can do, whereas the machine code is pretty primitive.

A node would be a struct in C. In machine code, the processor see structs, it sees constants like offsets and sizes. Let's take the following struct:

typedef struct Node *NodePtr;
typedef struct Node {
    int value;
    NodePtr leftChild, rightChild;
} Node

In the C language, a type determines a variables size. Assuming a 32-bit machine, an int will take 4 bytes, and pointers will also take 4 bytes. Thus, the size of one instance of Node is 12 bytes, and, the value field is at offset 0 from the beginning of such an instance, while leftChild is a offset 4, and rightChild at offset 8.

(Some assembly languages allow computing offsets of structs, giving access to named constants for the 0, 4, and 8 here, but the processor won't see names, only numbers.)

how are the nodes mapped with respect to the memory location ?

Memory locations are choses as either global storage, local storage, or heap storage, in both C and assembly language. The mechanism to take the address of global storage or local storage is processor specific, but heap allocation is done using malloc or other just like in C. In machine code, taking the address of a local variable involves computing that variables offset from the stack reference, where in C we just put & in front of a variables name.

Because unlike C you do need to specify the pointer location and the NULL pointer right ?

Just like in C, to initialize a Node we need to initialize each field. The value field with an element of {1,2,3,4}, and, the leftChild with perhaps NULL, while the rightChild with the address of some other Node. The machine code to do this will reference offset 0, offset 4, and offset 8 of the chosen storage.

We need to know the address value of NULL, but it is almost always the value 0 of the appropriate pointer size.

Will deletion and insertion be the same procedure as C ? Also extending to AVL tree is tree rotation the same procedure as given in C language too?

Yes, these operations are conceptually the same. It is the field accesses / pointer dereferencing, the if-then and whiles that requires translation to their assembly equivalents.

Under which cases are machine code and object code generated?

Different compiler toolchains do things differently. As discussed in the comments to your original question, some compilers convert source code directly into machine code, while others convert source code into assembly code (which serves as an intermediate representation), and then run that assembly code through an assembler to generate machine code. My experience is primarily with compilers that do the former, but Peter Cordes correctly pointed out that GCC does the latter. The actual implementation is mostly irrelevant, unless you're working on the compiler itself. Either one will produce the correct result, and neither has any relevant impact on the person who uses the compiler.

There is actually another alternative that fits conceptually somewhere between these two models. Clang (more specifically, LLVM) is an example of this. It compiles source code into an intermediate language, but rather than using architecture-specific assembly language as that intermediate representation, it uses IR (Intermediate Representation). In fact, this is the big innovation of the LLVM compilation model. The toolchain is implemented in three stages: there is a frontend that parses source code into IR code, an optimizer that performs optimization passes on the IR code, and then a backend that converts the IR code into machine code for a specific CPU.

Sample Image

This design allows a variety of frontends to be written for all sorts of source languages, as well as a variety of backends that target any CPU. The optimizer that sits in the middle is, however, the same in all cases, since it works on the intermediate IR code, which is always the same.

Sample Image

Other compilers may do something similar internally (so that it appears from the outside as a single step), or they may compile directly to architecture-specific machine code.

As for your specific question, what is the difference between "object code" and "machine code", there really is no difference. Object code generally refers to what you find in object files, such as ELF on *nix or PE on Windows, but this is actually just machine code.

As the name implies, machine code is machine-specific. Machine code for the x86 will only run on the x86 processor; ARM machine code only runs on the ARM processor; etc.

This is generated as the final stage of output by a compiler. It produces an object file, which contains object code, which is actually just machine code packed into an object file. However, object files often contain more than just machine code—they also contain tables for internal and external symbols, constant data, debugging information, etc. All of this is used by the linker when it generates the final executable image from these object files.

Although you didn't ask specifically about it, assembly code is just the human-readable form of machine code. Machine code is in a binary format that can be executed directly by the processor, whereas assembly code uses mnemonics that are easier for the programmer to read and write. For example, on x86, the object code might contain the byte 0x74, but in assembly language, this would be represented by the mnemonic JE (or, equivalently, JZ). An assembler is a program that converts assembly-language mnemonics into binary machine code.

How Do Objects Work in X86 at the Assembly Level