What Are the Different Calling Conventions in C/C++ and What Do Each Mean

What are the different calling conventions in C/C++ and what do each mean?

Simple answer: I use cdecl, stdcall, and fastcall. I seldom use fastcall. stdcall is used to call Windows API functions.

Detailed answer (Stolen from Wikipedia):

cdecl - In cdecl, subroutine arguments are passed on the stack. Integer values and memory addresses are returned in the EAX register, floating point values in the ST0 x87 register. Registers EAX, ECX, and EDX are caller-saved, and the rest are callee-saved. The x87 floating point registers ST0 to ST7 must be empty (popped or freed) when calling a new function, and ST1 to ST7 must be empty on exiting a function. ST0 must also be empty when not used for returning a value.

syscall - This is similar to cdecl in that arguments are pushed right-to-left. EAX, ECX, and EDX are not preserved. The size of the parameter list in doublewords is passed in AL.

pascal - the parameters are pushed on the stack in left-to-right order (opposite of cdecl), and the callee is responsible for balancing the stack before return.

stdcall - The stdcall[4] calling convention is a variation on the Pascal calling convention in which the callee is responsible for cleaning up the stack, but the parameters are pushed onto the stack in right-to-left order, as in the _cdecl calling convention. Registers EAX, ECX, and EDX are designated for use within the function. Return values are stored in the EAX register.

fastcall - __fastcall convention (aka __msfastcall) passes the first two arguments (evaluated left to right) that fit into ECX and EDX. Remaining arguments are pushed onto the stack from right to left.

vectorcall - In Visual Studio 2013, Microsoft introduced the __vectorcall calling convention in response to efficiency concerns from game, graphic, video/audio, and codec developers.[7] For IA-32 and x64 code, __vectorcall is similar to __fastcall and the original x64 calling conventions respectively, but extends them to support passing vector arguments using SIMD registers. For x64, when any of the first six arguments are vector types (float, double, __m128, __m256, etc.), they are passed in via the corresponding XMM/YMM registers. Similarly for IA-32, up to six XMM/YMM registers are allocated sequentially for vector type arguments from left to right regardless of position. Additionally, __vectorcall adds support for passing homogeneous vector aggregate (HVA) values, which are composite types consisting solely of up to four identical vector types, using the same six registers. Once the registers have been allocated for vector type arguments, the unused registers are allocated to HVA arguments from left to right regardless of position. Resulting vector type and HVA values are returned using the first four XMM/YMM registers.

safecall - n Delphi and Free Pascal on Microsoft Windows, the safecall calling convention encapsulates COM (Component Object Model) error handling, thus exceptions aren't leaked out to the caller, but are reported in the HRESULT return value, as required by COM/OLE. When calling a safecall function from Delphi code, Delphi also automatically checks the returned HRESULT and raises an exception if necessary.

The safecall calling convention is the same as the stdcall calling convention, except that exceptions are passed back to the caller in EAX as a HResult (instead of in FS:[0]), while the function result is passed by reference on the stack as though it were a final "out" parameter. When calling a Delphi function from Delphi this calling convention will appear just like any other calling convention, because although exceptions are passed back in EAX, they are automatically converted back to proper exceptions by the caller. When using COM objects created in other languages, the HResults will be automatically raised as exceptions, and the result for Get functions is in the result rather than a parameter. When creating COM objects in Delphi with safecall, there is no need to worry about HResults, as exceptions can be raised as normal but will be seen as HResults in other languages.

Microsoft X64 Calling Convention - The Microsoft x64 calling convention[12][13] is followed on Windows and pre-boot UEFI (for long mode on x86-64). It uses registers RCX, RDX, R8, R9 for the first four integer or pointer arguments (in that order), and XMM0, XMM1, XMM2, XMM3 are used for floating point arguments. Additional arguments are pushed onto the stack (right to left). Integer return values (similar to x86) are returned in RAX if 64 bits or less. Floating point return values are returned in XMM0. Parameters less than 64 bits long are not zero extended; the high bits are not zeroed.

When compiling for the x64 architecture in a Windows context (whether using Microsoft or non-Microsoft tools), there is only one calling convention – the one described here, so that stdcall, thiscall, cdecl, fastcall, etc., are now all one and the same.

In the Microsoft x64 calling convention, it is the caller's responsibility to allocate 32 bytes of "shadow space" on the stack right before calling the function (regardless of the actual number of parameters used), and to pop the stack after the call. The shadow space is used to spill RCX, RDX, R8, and R9,[14] but must be made available to all functions, even those with fewer than four parameters.

The registers RAX, RCX, RDX, R8, R9, R10, R11 are considered volatile (caller-saved).[15]

The registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered nonvolatile (callee-saved).[15]

For example, a function taking 5 integer arguments will take the first to fourth in registers, and the fifth will be pushed on the top of the shadow space. So when the called function is entered, the stack will be composed of (in ascending order) the return address, followed by the shadow space (32 bytes) followed by the fifth parameter.

In x86-64, Visual Studio 2008 stores floating point numbers in XMM6 and XMM7 (as well as XMM8 through XMM15); consequently, for x86-64, user-written assembly language routines must preserve XMM6 and XMM7 (as compared to x86 wherein user-written assembly language routines did not need to preserve XMM6 and XMM7). In other words, user-written assembly language routines must be updated to save/restore XMM6 and XMM7 before/after the function when being ported from x86 to x86-64.

What makes the calling convention different?

Platforms generally define one or more "standard" calling conventions. Compilers need to follow those conventions if they want to interoperate with other tools or components on the platform using those conventions, but can use their own different calling conventions internally.

The only real requirement is that any caller and callee need to agree on the conventions for the call between them.

What is the calling convention for extern C in C++?

Let's look at the generated assembly using the Debug build of a 32-bit Visual Studio project (default settings):

Here's my program:

extern "C" int func1(int x);
extern "C" int __stdcall func2(int x);
extern "C" int __cdecl func3(int x);

int main()
{
    int x = 0;
    func1(1);
    func2(2);
    func3(2);
    return 0;
}

Where func1, func2, and func3 are defined in a separate source file to limit the possibility of automatic inlining.

Let's look at the generated assembly code for main:

    func1(1);
002117E8  push        1  
002117EA  call        _func1 (0211159h)  
002117EF  add         esp,4  
    func2(2);
002117F2  push        2  
002117F4  call        _func2@4 (0211131h)  
    func3(3);
002117F9  push        3  
002117FB  call        _func3 (021107Dh)  
00211800  add         esp,4

For func1 and func3, it's the same signature. The argument is pushed onto the stack, the function call is invoked, and then the stack register (esp) is adjusted back (popped) to it's previous address - as expected for _cdecl calling convention. In __cdecl calling convention, the caller is responsible for restoring the stack pointer to its original address after a function call is made.

After the invocation of func2, there is no stack pointer adjustment. Consistent with __stdcall calling convention as it's declared. In __stdcall calling, the compiled function is responsible for popping the stack pointer back. Inspecting the assembly of func1 vs func2 shows that func1 ends with:

00211881  ret    // return, no stack adjustment

whereas func2 ends with this assembly:

002118E1  ret         4   // return and pop 4 bytes from stack

Now before you conclude that "no linkage attribute" implies "__cdecl", keep in mind that Visual Studio projects have the following setting:

Sample Image

Let's change that Calling convention setting to __stdcall and see what the resulting assembly looks like:

    func1(1);
003417E8  push        1  
003417EA  call        _func1@4 (034120Dh)  
    func2(2);
003417EF  push        2  
003417F1  call        _func2@4 (0341131h)  
    func3(3);
003417F6  push        3  
003417F8  call        _func3 (034107Dh)  
003417FD  add         esp,4

Suddenly main isn't popping arguments after the invocation of func1 - hence func1 assumed the default calling convention of the project settings. And that's technically your answer.

There are environments where __stdcall being the default is the norm. Driver development for example...

Why are there so many different calling conventions?

The calling conventions you mention were designed over the course of decades for different languages and different hardware. They all had different goals. cdecl supported variable arguments for printf. stdcall resulted in smaller code gen, but no variable arguments. Fastcall could greatly speed up the performance of simple functions with only one or two arguments on older machines (but is rarely a speed up today.)

Note than when x64 was introduced, on Windows at least, it was designed to have a single calling convention.

Raymond Chen wrote a great series on the history of calling conventions, you can start here.

Why can functions with different calling conventions still call each other?

Because in my view two functions with different calling conventions can't call each other

That's simply an incorrect view. A calling convention is just a set of rules for how arguments are handled across the call. The compiler generates instructions at each call site and within the body of the function that follow whichever convention the function is defined with.

If caller think callee should clean the stack, and callee think caller should clean the stack, and that's my problem

The problem you are thinking of is when the calling convention is omitted, and different translation units are compiled with different default conventions. The declarations in one TU are used in a manner incompatible with the definition in another TU.

What are custom calling conventions?

A calling convention describes how something may call another function. This requires parameters and state to be passed to the other function, so that it can execute and return control correctly. The way in which this is done has to be standardized and specified, so that the compiler knows how to order parameters for consumption by the remote function that's being called. There are several standard calling conventions, but the most common are fastcall, stdcall, and cdecl.

Usually, the term custom calling convention is a bit of a misnomer and refers to one of two things:

A non-standard calling convention or one that isn't in widespread use (e.g. if you're building an architecture from scratch).
A special optimization that a compiler/linker can perform that uses a one-shot calling convention for the purpose of improving performance.

In the latter case, this causes some values that would otherwise be pushed onto the stack to be stored in registers instead. The compiler will try to make this decision based on how the parameters are being used inside the code. For example, if the parameter is going to be used as a maximum value for a loop index, such that the index is compared against the max on each iteration to see if things should continue, that would be a good case for moving it into a register.

If the optimization is carried out, this typically reduces code size and improves performance.

And how am I affected by these as a developer?

From your standpoint as a developer, you probably don't care; this is an optimization that will happen automatically.

What calling convention does printf() in C use?

printf always uses CDECL in real-world C libraries, because STDCALL is highly inconvenient for variadic functions, and would make it impossible to work correctly in some cases ISO C requires it to.

ISO C says it's well-defined behaviour to pass extra args, like printf("%d\n", 1, 2, 3);

printf must safely ignore them, and behave like printf("%d\n", 1);. This rules out callee-pops conventions like STDCALL. (Which would be inconvenient anyway for any variadic function because ret imm16 to increment [er]SP after popping into [er]IP is only available with an immediate operand, not register. So you'd have to pop the return address, copy it over the highest 4 or 8 bytes of args, and ret from there, even if you could accurately calculate where.)

Mainstream calling conventions don't separately pass the number of args (or their size on the stack in bytes) to variadic functions, or use any kind of sentinel value, so there's no way an implementation of printf could find out how many args were actually passed. The args have to match the format string for as many args as the format string references, but there's no requirement not to pass args beyond that.

That's why Windows C ABIs / implementations always use a convention like CDECL for variadic functions, even if they default to STDCALL for functions with fixed numbers of args. (32-bit FASTCALL is also callee-pops; Windows x64 is not, and current MS documentation sometimes calls it x64 fastcall or "a fastcall convention".)

Also, a definition of the function in x86 32 bit Assembly would be great.

I'm doing this for educational purposes.

Since you're doing this to learn about asm, not necessarily to create a C implementation, printf is a pretty complicated API to implement, and probably not a good choice for pure assembly projects. (Or arguably for any modern design that doesn't have to actually be ISO C; parsing a text format string and going through a variable-length list of arguments has major downsides for simplicity. C++ people argue that a separate function call for each object you want to output is much better for type safety and stuff)

It's usually easier to deal with individual type -> string functions, like a print_int (decimal) vs. print_int_hex vs. print_double (very complicated on its own actually) vs. print_c_string (0 terminated) vs. print_buffer (pointer, length).

As a toy project, don't aim too high with your I/O formatting functions.

Provide some simple usable ones at first, that are easy to call from asm. Irvine32 with its WriteDec (unsigned) vs. WriteInt (signed) vs. WriteString is one decent example of a set of output functions for toy programs. Irvine32 notably uses a custom calling convention where all registers are call-preserved (training wheels mode), and the arg is in EAX or EDX (this is very good; stack args are dumb especially for functions that only take one)

Another very similar example is the MARS system-calls for that MIPS simulator. Some of them are poorly designed (or intentionally inconvenient for students?), like its read-string not returning the length in the return-value register, just leaving the characters in the pointed-to buffer with a terminating 0 byte (as a C string). So if you want to know how many you read, you have to loop over them looking for the first 0, i.e. strlen.

These toy APIs don't have cursor-movement, input without echo, or any of the things that make real terminal and keyboard handling way more complicated.
Or any way to specify formatting like printf's %020d to pad with leading zeros out to 20 digits long.

If you want to write your own input/output functions, you can think about whether you want them to be able to mix easily with code that directly uses lower-level functions, or whether they do their own buffering like C stdio, and should be treated as an opaque I/O layer so programs shouldn't use them and lower-level OS system calls at the same time.

Depending how sophisticated you want it, maybe taking args to specify width limits, or do that on a case by case basis customized for the project. (After all, if you wanted maintainability and easy code-reuse, you wouldn't choose asm in the first place. So just implement the I/O details at the place that's doing it, instead of building a flexible mechanism for callers to request any kind of formatting)

What Are the Different Calling Conventions in C/C++ and What Do Each Mean