What Happens If There Is No Exit System Call in an Assembly Program

What happens if there is no exit system call in an assembly program?

The processor does not know where your code ends. It faithfully executes one instruction after another until execution is redirected elsewhere (e.g. by a jump, call, interrupt, system call, or similar). If your code ends without jumping elsewhere, the processor continues executing whatever is in memory after your code. It is fairly unpredictable what exactly happens, but eventually, your code typically crashes because it tries to execute an invalid instruction or tries to access memory that it is not allowed to access. If neither happens and no jump occurs, eventually the processor tries to execute unmapped memory or memory that is marked as “not executable” as code, causing a segmentation violation. On Linux, this raises a SIGSEGV or SIGBUS. When unhandled, these terminate your process and optionally produce core dumps.

Assembly program crashes on call or exit

mov esp,0 sets the stack pointer to 0. Any stack instructions like push/pop or call/ret will crash after you do that.

Pick a different register for your array-count temporary, not the stack pointer! You have 7 other choices, looks like you still have EDX unused.

In the normal calling convention, only EAX, ECX, and EDX are call-clobbered (so you can use them without preserving the caller's value). But you're calling ExitProcess instead of returning from main, so you can destroy all the registers. But ESP has to be valid when you call.

call works by pushing a return address onto the stack, like sub esp,4 / mov [esp], next_instruction / jmp ExitProcess. See https://www.felixcloutier.com/x86/CALL.html. As your register-dump shows, ESP=8 before the call, which is why it's trying to store to absolute address 4.

Your code has 2 sections: looping over the array and then finding the average. You can reuse a register for different things in the 2 sections, often vastly reducing register pressure. (i.e. you don't run out of registers.)

Using implicit-length arrays (terminated by a sentinel element like 0) is unusual outside of strings. It's much more common to pass a function a pointer + length, instead of just a pointer.

But anyway, you have an implicit-length array so you have to find its length and remember that when calculating the average. Instead of incrementing a size counter inside the loop, you can calculate it from the pointer you're also incrementing. (Or use the counter as an array index like ary[ecx*4], but pointer-increments are often more efficient.)

Here's what an efficient (scalar) implementation might look like. (With SSE2 for SIMD you could add 4 elements with one instruction...)

It only uses 3 registers total. I could have used ECX instead of ESI (so main could ret without having destroyed any of the registers the caller expected it to preserve, only EAX, ECX, and EDX), but I kept ESI for consistency with your version.

.data
        ;ary        dword   100, -30, 25, 14, 35, -92, 82, 134, 193, 99, 0
        ary     dword   -24, 1, -5, 30, 35, 81, 94, 143, 0

.code
main PROC
;; inputs: static ary of signed dword integers
;; outputs: EAX = array average, EDX = remainder of sum/size
;;          ESI = array count (in elements)
;; clobbers: none (other than the outputs)

                                ; EAX = sum accumulator
                                ; ESI = array pointer
                                ; EDX = array element temporary

        xor     eax, eax        ; sum = 0
        mov     esi, OFFSET ary ; incrementing a pointer is usually efficient, vs. ary[ecx*4] inside a loop or something.  So this is good.
sumloop:                       ; do {
        mov     edx, [esi]
        add     edx, 4
        add     eax, edx        ; sum += *p++  without checking for 0, because + 0 is a no-op

        test    edx, edx        ; sets FLAGS the same as cmp edx,0
        jnz     sumloop         ; }while(array element != 0);
;;; fall through if the element is 0.
;;; esi points to one past the terminator, i.e. two past the last real element we want to count for the average

        sub     esi, OFFSET ary + 4  ; (end+4) - (start+4) = array size in bytes
        shr     esi, 2          ; esi = array length = (end-start)/element_size

        cdq                     ; sign-extend sum into EDX:EAX as an input for idiv
        idiv     esi            ; EAX = sum/length   EDX = sum%length

        call ExitProcess
main ENDP

I used x86's hardware division instruction, instead of a subtraction loop. Your repeated-subtraction loop looked pretty complicated, but manual signed division can be tricky. I don't see where you're handling the possibility of the sum being negative. If your array had a negative sum, repeated subtraction would make it grow until it overflowed. Or in your case, you're breaking out of the loop if sum < count, which will be true on the first iteration for a negative sum.

Note that comments like Set EAX register to 0 are useless. We already know that from reading mov eax,0. sum = 0 describes the semantic meaning, not the architectural effect. There are some tricky x86 instructions where it does make sense to comment about what it even does in this specific case, but mov isn't one of them.

If you just wanted to do repeated subtraction with the assumption that sum is non-negative to start with, it's as simple as this:

;; UNSIGNED division  (or signed with non-negative dividend and positive divisor)
; Inputs: sum(dividend) in EAX,  count(divisor) in ECX
; Outputs: quotient in EDX, remainder in EAX  (reverse of the DIV instruction)
    xor    edx, edx                 ; quotient counter = 0
    cmp    eax, ecx
    jb     subloop_end              ; the quotient = 0 case
repeat_subtraction:                 ; do {
    inc    edx                      ;   quotient++
    sub    eax, ecx                 ;   dividend -= divisor
    cmp    eax, ecx
    jae    repeat_subtraction       ; while( dividend >= divisor );
     ; fall through when eax < ecx (unsigned), leaving EAX = remainder
subloop_end:

Notice how checking for special cases before entering the loop lets us simplify it. See also Why are loops always compiled into "do...while" style (tail jump)?

sub eax, ecx and cmp eax, ecx in the same loop seems redundant: we could just use sub to set flags, and correct for the overshoot.

    xor    edx, edx                 ; quotient counter = 0
    cmp    eax, ecx
    jb     division_done            ; the quotient = 0 case
repeat_subtraction:                 ; do {
    inc    edx                      ;   quotient++
    sub    eax, ecx                 ;   dividend -= divisor
    jnc    repeat_subtraction       ; while( dividend -= divisor doesn't wrap (carry) );

    add    eax, ecx                 ; correct for the overshoot
    dec    edx
division_done:

(But this isn't actually faster in most cases on most modern x86 CPUs; they can run the inc, cmp, and sub in parallel even if the inputs weren't the same. This would maybe help on AMD Bulldozer-family where the integer cores are pretty narrow.)

Obviously repeated subtraction is total garbage for performance with large numbers. It is possible to implement better algorithms, like one-bit-at-a-time long-division, but the idiv instruction is going to be faster for anything except the case where you know the quotient is 0 or 1, so it takes at most 1 subtraction. (div/idiv is pretty slow compared to any other integer operation, but the dedicated hardware is much faster than looping.)

If you do need to implement signed division manually, normally you record the signs, take the unsigned absolute value, then do unsigned division.

e.g. xor eax, ecx / sets dl gives you dl=0 if EAX and ECX had the same sign, or 1 if they were different (and thus the quotient will be negative). (SF is set according to the sign bit of the result, and XOR produces 1 for different inputs, 0 for same inputs.)

Why am I allowed to exit main using ret?

C main is called (indirectly) from CRT startup code, not directly from the kernel.

After main returns, that code calls atexit functions to do stuff like flushing stdio buffers, then passes main's return value to a raw _exit system call. Or exit_group which exits all threads.

You make several wrong assumptions, all I think based on a misunderstanding of how kernels work.

The kernel runs at a different privilege level from user-space (ring 0 vs. ring 3 on x86). Even if user-space knew the right address to jump to, it can't jump into kernel code. (And even if it could, it wouldn't be running with kernel privilege level).
ret isn't magic, it's basically just pop %rip and doesn't let you jump anywhere you couldn't jump to with other instructions. Also doesn't change privilege level¹.
Kernel addresses aren't mapped / accessible when user-space code is running; those page-table entries are marked as supervisor-only. (Or they're not mapped at all in kernels that mitigate the Meltdown vulnerability, so entering the kernel goes through a "wrapper" block of code that changes CR3.)
Virtual memory is how the kernel protects itself from user-space. User-space can't modify page tables directly, only by asking the kernel to do it via mmap and mprotect system calls. (And user-space can't execute privileged instructions like mov cr3, rax to install new page tables. That's the purpose of having ring 0 (kernel mode) vs. ring 3 (user mode).)
The kernel stack is separate from the user-space stack for a process. (In the kernel, there's also a small kernel stack for each task (aka thread) that's used during system calls / interrupts while that user-space thread is running. At least that's how Linux does it, IDK about others.)
The kernel doesn't literally call user-space code; The user-space stack doesn't hold any return address back into the kernel. A kernel->user transition involves swapping stack pointers, as well as changing privilege levels. e.g. with an instruction like iret (interrupt-return).
Plus, leaving a kernel code address anywhere user-space can see it would defeat kernel ASLR.

Footnote 1: (The compiler-generated ret will always be a normal near ret, not a retf that could return through a call gate or something to a privileged cs value. x86 handles privilege levels via the low 2 bits of CS but nevermind that. MacOS / Linux don't set up call gates that user-space can use to call into the kernel; that's done with syscall or int 0x80 instructions.)

In a fresh process (after an execve system call replaced the previous process with this PID with a new one), execution begins at the process entry point (usually labeled _start), not at the C main function directly.

C implementations come with CRT (C RunTime) startup code that has (among other things) a hand-written asm implementation of _start which (indirectly) calls main, passing args to main according to the calling convention.

_start itself is not a function. On process entry, RSP points at argc, and above that on the user-space stack is argv[0], argv[1], etc. (i.e. the char *argv[] array is right there by value, and above that the envp array.) _start loads argc into a register and puts pointers to the argv and envp into registers. (The x86-64 System V ABI that MacOS and Linux both use documents all this, including the process-startup environment and the calling convention.)

If you try to ret from _start, you're just going to pop argc into RIP, and then code-fetch from absolute address 1 or 2 (or other small number) will segfault. For example, Nasm segmentation fault on RET in _start shows an attempt to ret from the process entry point (linked without CRT startup code). It has a hand-written _start that just falls through into main.

When you run gcc main.c, the gcc front-end runs multiple other programs (use gcc -v to show details). This is how the CRT startup code gets linked into your process:

gcc preprocesses (CPP) and compiles+assembles main.c to main.o (or a temporary file). On MacOS, the gcc command is actually clang which has a built-in assembler, but real gcc really does compile to asm and then run as on that. (The C preprocessor is built-in to the compiler, though.)
gcc runs something like ld -dynamic-linker /lib64/ld-linux-x86-64.so.2 -pie /usr/lib/Scrt1.o /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/crtbeginS.o main.o -lc -lgcc /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/crtendS.o. That's actually simplified a lot, with some of the CRT files left out, and paths canonicalized to remove ../../lib parts. Also, it doesn't run ld directly, it runs collect2 which is a wrapper for ld. But anyway, that statically links in those .o CRT files that contain _start and some other stuff, and dynamically links libc (-lc) and libgcc (for GCC helper functions like implementing __int128 multiply and divide with 64-bit registers, in case your program uses those).

.intel_syntax

.text:

.global _rbp

_rbp:
  mov rax, rbp
  ret;

This is not allowed, ...

The only reason that doesn't assemble is because you tried to declare .text: as a label, instead of using the .text directive. If you remove the trailing : it does assemble with clang (which treats .intel_syntax the same as .intel_syntax noprefix).

For GCC / GAS to assemble it, you'd also need the noprefix to tell it that register names aren't prefixed by %. (Yes you can have Intel op dst, src order but still with %rsp register names. No you shouldn't do this!) And of course GNU/Linux doesn't use leading underscores.

Not that it would always do what you want if you called it, though! If you compiled main without optimization (so -fno-omit-frame-pointer was in effect), then yes you'd get a pointer to the stack slot below the return address.

And you definitely use the value incorrectly. (*p)-4; loads the saved RBP value (*p) and then offsets by four 8-byte void-pointers. (Because that's how C pointer math works; *p has type void* because p has type void **).

I think you're trying to get your own return address and re-run the call instruction (in main's caller) that reached main, eventually leading to a stack overflow from pushing more return addresses. In GNU C, use void * __builtin_return_address (0) to get your own return address.

x86 call rel32 instructions are 5 bytes, but the call that called main was probably an indirect call, using a pointer in a register. So it might be a 2-byte call *%rax or a 3-byte call *%r12, you don't know unless you disassemble your caller. (I'd suggest single-stepping by instructions (GDB / LLDB stepi) off the end of main using a debugger in disassembly mode. If it has any symbol info for main's caller, you'll be able to scroll backward and see what the previous instruction was.

If not, you might have to try and see what looks sane; x86 machine code can't be unambiguously decoded backwards because it's variable-length. You can't tell the difference between a byte within an instruction (like an immediate or ModRM) vs. the start of an instruction. It all depends on where you start disassembling from. If you try a few byte offsets, usually only one will produce anything that looks sane.

   asm("movq %rax, 0"); //Exit code is 11, so now it should be 0

This is a store of RAX to absolute address 0, in AT&T syntax. This of course segfaults. exit code 11 is from SIGSEGV, which is signal 11. (Use kill -l to see signal numbers).

Perhaps you wanted mov $0, %eax. Although that's still pointless here, you're about to call through your function pointer. In debug mode, the compiler might load it into RAX and step on your value.

Also, writing a register in an asm statement is never safe when you don't tell the compiler which registers you're modifying (using constraints).

   printf("Main: %p\n", main);
   printf("&Main: %p\n", &main); //WTF

main and &main are the same thing because main is a function. That's just how C syntax works for function names. main isn't an object that can have its address taken. & operator optional in function pointer assignment

It's similar for arrays: the bare name of an array can be assigned to a pointer or passed to functions as a pointer arg. But &array is also the same pointer, same as &array[0]. This is true only for arrays like int array[10], not for pointers like int *ptr; in the latter case the pointer object itself has storage space and can have its own address taken.

Exit program x86

main() is called from the normal C runtime initialization functions. Writing main in any language, including asm, is no different from writing any other function.

Execution begins at _start. If you write your own _start, it has nothing to return to, so you need to make an _exit(2) or exit_group(2) system call.

(Or else segfault when execution falls off the end of your code, or if you try to ret it will pop a value off the stack into the program counter (EIP), and probably segfault on code-fetch from that probably-invalid address.)

When you compile + link with a C compiler, it links in CRT (C RunTime) startup code that provides a _start which initializes libc then calls main. After your main returns, the CRT code that called it runs atexit functions and then passes main's return value to an exit system call.

_start isn't a function, it's the process entry point. Under Linux for example, on entry to _start ESP points at argc, not a return address. (See the i386 System V ABI.)

This question comes at the question from a different angle, but my answer to another recent question goes into more detail.

As always, single-stepping with a debugger is a good way to see what's going on and test your understanding.

What Happens If There Is No Exit System Call in an Assembly Program