Function Prologue and Epilogue in C

There are lots of resources out there that explain this:

Function prologue (Wikipedia)
x86 Disassembly/Calling Conventions (WikiBooks)
Considerations for Writing Prolog/Epilog Code (MSDN)

to name a few.

Basically, as you somewhat described, "the stack" serves several purposes in the execution of a program:

Keeping track of where to return to, when calling a function
Storage of local variables in the context of a function call
Passing arguments from calling function to callee.

The prolouge is what happens at the beginning of a function. Its responsibility is to set up the stack frame of the called function. The epilog is the exact opposite: it is what happens last in a function, and its purpose is to restore the stack frame of the calling (parent) function.

In IA-32 (x86) cdecl, the ebp register is used by the language to keep track of the function's stack frame. The esp register is used by the processor to point to the most recent addition (the top value) on the stack. (In optimized code, using ebp as a frame pointer is optional; other ways of unwinding the stack for exceptions are possible, so there's no actual requirement to spend instructions setting it up.)

The call instruction does two things: First it pushes the return address onto the stack, then it jumps to the function being called. Immediately after the call, esp points to the return address on the stack. (So on function entry, things are set up so a ret could execute to pop that return address back into EIP. The prologue points ESP somewhere else, which is part of why we need an epilogue.)

Then the prologue is executed:

push  ebp         ; Save the stack-frame base pointer (of the calling function).
mov   ebp, esp    ; Set the stack-frame base pointer to be the current
                  ; location on the stack.
sub   esp, N      ; Grow the stack by N bytes to reserve space for local variables

At this point, we have:

...
ebp + 4:    Return address
ebp + 0:    Calling function's old ebp value
ebp - 4:    (local variables)
...

The epilog:

mov   esp, ebp    ; Put the stack pointer back where it was when this function
                  ; was called.
pop   ebp         ; Restore the calling function's stack frame.
ret               ; Return to the calling function.

Are the prologue and epilogue mandatory when writing assembly functions?

If you do not set up a proper stack frame, it can be hard for a debugger to know what function you are in right now. On ELF-targets, you have to manually provide CFI data (cf. this article) if you do not explicitly set up a stack frame. Without CFI data, stack unwinding doesn't work and the debugger might not be able to find out what function you are in. Unless you want to manually add CFI data (which is somewhat tedious and easy to get wrong), I recommend you to accept the minor performance loss and just set up a full stack frame.

What are function epilogues and prologues?

The epilogue and prologue of a function are simply the set of instructions that 'set up' the context for the function when it's called and clean up when it returns.

The prologue typically performs such tasks as:

saves any registers that the function might use (that are required by the platform's standard to be preserved across function calls)
allocates storage on the stack that the function might require for local variables
sets up any pointer (or other linkage) to parameters that might be passed on the stack

The epilogue generally only needs to restore any save registers and restore the stack pointer such that any memory reserved by the function for its own use is 'freed'.

The exact mechanisms that might be used in a prologue/epilogue are Dependant on the CPU architecture, the platforms standard, the arguments and return values of the function, and the particular calling convention the function might be using.

Creating a C function without compiler generated prologue/epilogue & RET instruction?

It's not entirely clear what you want to accomplish. it seems like you want an interrupt handler that does the iret without other pushes and pops by default.

GCC

Using GCC (without NASM) something like this is possible:

/* Make C extern declarations of the ISR entry points */    
extern void isr_test1(void);
extern void isr_test2(void);

/* Define a do nothing ISR stub */
__asm__(".global isr_test1\n"
        "isr_test1:\n\t"
        /* Other stuff here */
        "iret");    

/* Define an ISR stub that makes a call to a C function */
__asm__(".global isr_test2\n"
        "isr_test2:\n\t"
        "cld\n\t"                    /* Set direction flag forward for C functions */
        "pusha\n\t"                  /* Save all the registers */
        /* Other stuff here */
        "call isr_test2_handler\n\t"
        "popa\n\t"                   /* Restore all the registers */
        "iret");

void isr_test2_handler(void)
{
    return;
}

Basic __asm__ statements in GCC can be placed outside of a function. We define labels for our Interrupt Service Routines (ISRs) and make them externally visible with .globl (You may not need global visibility but I show it anyway).

I create a couple of sample interrupt service routines. One that does nothing more than an iret and the other that makes a function call to a C handler. We save all the registers and restore them after. C functions require the direction flag be set forward so we need a CLD before calling the C function. This sample code works for 32-bit targets. 64-bit can be done by saving the registers individually rather than using PUSHA and POPA.

Note: If using GCC on Windows the function names inside the assembly blocks will likely need to be prepended with an _ (underscore). It would look like:

/* Make C extern declarations of the ISR entry points */    
extern void isr_test1(void);
extern void isr_test2(void);

/* Define a do nothing ISR stub */
__asm__(".global _isr_test1\n"
        "_isr_test1:\n\t"
        /* Other stuff here */
        "iret");    

/* Define an ISR stub that makes a call to a C function */
__asm__(".global _isr_test2\n"
        "_isr_test2:\n\t"
        "cld\n\t"                    /* Set direction flag forward for C functions */
        "pusha\n\t"                  /* Save all the registers */
        /* Other stuff here */
        "call _isr_test2_handler\n\t"
        "popa\n\t"                   /* Restore all the registers */
        "iret");

void isr_test2_handler(void)
{
    return;
}

MSVC/MSVC++

Microsoft's C/C++ compilers support the naked attribute on functions. They describe this attribute as:

The naked storage-class attribute is a Microsoft-specific extension to the C language. For functions declared with the naked storage-class attribute, the compiler generates code without prolog and epilog code. You can use this feature to write your own prolog/epilog code sequences using inline assembler code. Naked functions are particularly useful in writing virtual device drivers.

An example Interrupt Service Routine could be done like this:

__declspec(naked) int isr_test(void)
{
    /* Function body */
    __asm { iret };
}

You'll need to deal with the issues of saving and restoring registers, setting the direction flag yourself in a similar manner to the GCC example above.

GCC 7.x+ introduced Interrupt Attribute on x86/x86-64 Targets

On GCC 7.0+ you can now use __attribute__((interrupt)) on functions. This attribute was only recently supported on x86 and x86-64 targets:

interrupt

Use this attribute to indicate that the specified function is an interrupt handler or an exception handler (depending on parameters passed to the function, explained further). The compiler generates function entry and exit sequences suitable for use in an interrupt handler when this attribute is present. The IRET instruction, instead of the RET instruction, is used to return from interrupt handlers. All registers, except for the EFLAGS register which is restored by the IRET instruction, are preserved by the compiler. Since GCC doesn’t preserve MPX, SSE, MMX nor x87 states, the GCC option -mgeneral-regs-only should be used to compile interrupt and exception handlers.

This method still has deficiencies. If you ever want your C code to access the contents of a register as they appeared at the time of the interrupt, there is currently no reliable way to do it with this mechanism. This would be handy if you were writing a software interrupt and needed access to the registers to determine what actions to take (ie: int 0x80 on Linux). Another example would be to allow an interrupt to dump all the register contents to the display for debug purposes.

Function Prologue and Epilogue removed by GCC Optimization

Compilers are getting smart, it knew you didn't need a stack frame pointer stored in a register because whatever you put into your main() function didn't use the stack.

As for rep ret:

Here's the principle. The processor tries to fetch the next few
instructions to be executed, so that it can start the process of
decoding and executing them. It even does this with jump and return
instructions, guessing where the program will head next.

What AMD says here is that, if a ret instruction immediately follows a
conditional jump instruction, their predictor cannot figure out where
the ret instruction is going. The pre-fetching has to stop until the
ret actually executes, and only then will it be able to start looking
ahead again.

The "rep ret" trick apparently works around the problem, and lets the
predictor do its job. The "rep" has no effect on the instruction.

Source: Some forum, google a sentence to find it.

One thing to note is that just because there is no prologue it doesn't mean there is no stack, you can still push and pop with ease it's just that complex stack manipulation will be difficult.

Functions that don't have prologue/epilogue are usually dubbed naked. Hackers like to use them a lot because they don't contaminate the stack when you jmp to them, I must confess I know of no other use to them outside optimization. In Visual Studio it's done via:

__declspec(naked)

why to use ebp in function prologue/epilogue?

There's no requirement to use a stack frame, but there are certainly some advantages:

Firstly, if every function has uses this same process, we can use this knowledge to easily determine a sequence of calls (the call stack) by reversing the process. We know that after a call instruction, ESP points to the return address, and that the first thing the called function will do is push the current EBP and then copy ESP into EBP. So, at any point we can look at the data pointed to by EBP which will be the previous EBP and that EBP+4 will be the return address of the last function call. We can therefore print the call stack (assuming 32bit) using something like (excuse the rusty C++):

void LogStack(DWORD ebp)
{
    DWORD prevEBP = *((DWORD*)ebp);
    DWORD retAddr = *((DWORD*)(ebp+4));

    if (retAddr == 0) return;

    HMODULE module;
    GetModuleHandleExA(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS, (const char*)retAddr, &module);
    char* fileName = new char[256];
    fileName[255] = 0;
    GetModuleFileNameA(module, fileName, 255);
    printf("0x%08x: %s\n", retAddr, fileName);
    delete [] fileName;
    if (prevEBP != 0) LogStack(prevEBP);
}

This will then print out the entire sequence of calls (well, their return addresses) up until that point.

Furthermore, since EBP doesn't change unless you explicitly update it (unlike ESP, which changes when you push/pop), it's usually easier to reference data on the stack relative to EBP, rather than relative to ESP, since with the latter, you have to be aware of any push/pop instructions that might have been called between the start of the function and the reference.

As others have mentioned, you should avoid using stack addresses below ESP as any calls you make to other functions are likely to overwrite the data at these addresses. You should instead reserve space on the stack for use by your function by the usual:

sub esp, [number of bytes to reserve]

After this, the region of the stack between the initial ESP and ESP - [number of bytes reserved] is safe to use.
Before exiting your function you must release the reserved stack space using a matching:

add esp, [number of bytes reserved]

How can I plant assembly instructions in the prologue and epilogue of function via gcc

Apparently you can use the -finstrument-functions flag to get gcc to generate instrumentation calls

void __cyg_profile_func_enter(void *func, void *callsite); 
void __cyg_profile_func_exit(void *func, void *callsite);

at function entry and exit. I've never used this, but a quick search brings up information and examples here, here, here and here.

When is the function epilogue executed?

I dont totally understand the order of execution of the function
epilogue. Will the saved value of ebp be popped into ebp before the
function returns to the address of system()?

Yes. Returning to the return address is the last action that can reasonably be considered an action of the function rather than of its caller.

I have read that
"function epilogue is executed upon termination of the function". At
what point does this function exactly terminate?

What function? You haven't presented one. But in general C terms, a function terminates when it executes a return statement or when execution of the last statement in its body finishes. This is the "termination" to which the document refers.

I don't think it is
before calling system()

Well then, surprise! The whole point is that the function epilogue, which is executed after the function terminates, causes control to be transferred to the entry point of the system() function. Note, however, that interpreting this correctly requires a split perspective. Function termination is function specific, and best defined in terms of the function's source code. The epilogue, on the other hand, has no representation in the source code -- it comprises extra machine instructions inserted by the compiler to implement the function-return semantics of the source language.

because this would mean the overwritten saved
ebp containing 4 crappy bytes would be stored in ebp.

Yes, but it doesn't matter because esp is set correctly. Control then jumps to the entry point of system(), where the function prologue sets esp as the new ebp, and a new esp is set. That function therefore has valid stack bounds, so it runs correctly. Bad Things may happen when system() returns, because the return address is determined by the 4 crappy bytes, but we don't care -- we do all the damage we want to do in the shell that we have induced system() to provide to us, before system() ever returns.

Function epilogue in assembly

You're correct, it's redundant if you know that esp is already pointing at the location where you pushed your caller's ebp.

When gcc compiles a function with -fno-omit-frame-pointer, it does in fact do the optimization you suggest of just popping ebp when it knows that esp is already pointing in the right place.

This is very common in functions that use call-preserved registers (like ebx) which also have to be saved/restored like ebp. Compilers typically do all the saves/restores in the prologue/epilogue before anything like reserving space for a C99 variable-size array. So pop ebx will always leave esp pointing to the right place for pop ebp.

e.g. clang 3.8's output (with -O3 -m32) for this function, on the Godbolt compiler explorer. As is common, compilers don't quite make optimal code:

void extint(int);   // a function that can't inline because the compiler can't see the definition.
int save_reg_framepointer(int a){
  extint(a);
  return a;
}

    # clang3.8
    push    ebp
    mov     ebp, esp                     # stack-frame boilerplate
    push    esi                          # save a call-preserved reg
    push    eax                          # align the stack to 16B
    mov     esi, dword ptr [ebp + 8]     # load `a` into a register that will survive the function call.
    mov     dword ptr [esp], esi         # store the arg for extint.  Doing this with an ebp-relative address would have been slightly more efficient, but just push esi here instead of push eax earlier would make even more sense
    call    extint
    mov     eax, esi                     # return value
    add     esp, 4                       # pop the arg
    pop     esi                          # restore esi
    pop     ebp                          # restore ebp.  Notice the lack of a mov  esp, ebp here, or even a  lea esp, [ebp-4]  before the first pop.
    ret

Of course, a human (borrowing a trick from gcc)

# hand-written based on tricks from gcc and clang, and avoiding their suckage
call_non_inline_and_return_arg:
    push    ebp
    mov     ebp, esp                     # stack-frame boilerplate if we have to.
    push    esi                          # save a call-preserved reg
    mov     esi, dword [ebp + 8]         # load `a` into a register that will survive the function call
    push    esi                          # replacing push eax / mov
    call    extint
    mov     eax, esi                     # return value.  Could  mov eax, [ebp+8]
    mov     esi, [ebp-4]                 # restore esi without a pop, since we know where we put it, and esp isn't pointing there.
    leave                                # same as mov esp, ebp / pop ebp.  3 uops on recent Intel CPUs
    ret

Since the stack needs to be aligned by 16 before a call (according to the rules of the SystemV i386 ABI, see links in the x86 tag wiki), we might as well save/restore an extra reg, instead of just push [ebp+8] and then (after the call) mov eax, [ebp+8]. Compilers favour saving/restoring call-preserved registers over reloading local data multiple times.

If not for the stack-alignment rules in the current version of the ABI, I might write:

# hand-written: esp alignment not preserved on the call
call_no_stack_align:
    push    ebp
    mov     ebp, esp                     # stack-frame boilerplate if we have to.
    push    dword [ebp + 8]              # function arg.  2 uops for push with a memory operand
    call    extint                       # esp is offset by 12 from before the `call` that called us: return address, ebp, and function arg.
    mov     eax, [ebp+8]                 # return value, which extint won't have modified because it only takes one arg
    leave                                # same as mov esp, ebp / pop ebp.  3 uops on recent Intel CPUs
    ret

gcc will actually use leave instead of mov / pop, in cases where it does need to modify esp before popping ebx. For example, flip Godbolt to gcc (instead of clang), and take out -m32 so we're compiling for x86-64 (where args are passed in registers). This means there's no need to pop args off the stack after a call, so rsp is set correctly to just pop two regs. (push/pop use 8 bytes of stack, but rsp still has to be 16B-aligned before a call in the SysV AMD64 ABI, so gcc actually does a sub rsp, 8 and corresponding add around the call.)

Another missed optimization: with gcc -m32, the variable-length-array function uses an add esp, 16 / leave after the call. The add is totally useless. (Add -m32 to the gcc args on godbolt).

Function Prologue and Epilogue in C