What happens if there is no exit system call in an assembly program?
The processor does not know where your code ends. It faithfully executes one instruction after another until execution is redirected elsewhere (e.g. by a jump, call, interrupt, system call, or similar). If your code ends without jumping elsewhere, the processor continues executing whatever is in memory after your code. It is fairly unpredictable what exactly happens, but eventually, your code typically crashes because it tries to execute an invalid instruction or tries to access memory that it is not allowed to access. If neither happens and no jump occurs, eventually the processor tries to execute unmapped memory or memory that is marked as “not executable” as code, causing a segmentation violation. On Linux, this raises a SIGSEGV
or SIGBUS
. When unhandled, these terminate your process and optionally produce core dumps.
Assembly program crashes on call or exit
mov esp,0
sets the stack pointer to 0. Any stack instructions like push/pop or call/ret will crash after you do that.
Pick a different register for your array-count temporary, not the stack pointer! You have 7 other choices, looks like you still have EDX unused.
In the normal calling convention, only EAX, ECX, and EDX are call-clobbered (so you can use them without preserving the caller's value). But you're calling ExitProcess
instead of returning from main
, so you can destroy all the registers. But ESP
has to be valid when you call
.
call
works by pushing a return address onto the stack, like sub esp,4
/ mov [esp], next_instruction
/ jmp ExitProcess
. See https://www.felixcloutier.com/x86/CALL.html. As your register-dump shows, ESP=8 before the call
, which is why it's trying to store to absolute address 4
.
Your code has 2 sections: looping over the array and then finding the average. You can reuse a register for different things in the 2 sections, often vastly reducing register pressure. (i.e. you don't run out of registers.)
Using implicit-length arrays (terminated by a sentinel element like 0
) is unusual outside of strings. It's much more common to pass a function a pointer + length, instead of just a pointer.
But anyway, you have an implicit-length array so you have to find its length and remember that when calculating the average. Instead of incrementing a size counter inside the loop, you can calculate it from the pointer you're also incrementing. (Or use the counter as an array index like ary[ecx*4]
, but pointer-increments are often more efficient.)
Here's what an efficient (scalar) implementation might look like. (With SSE2 for SIMD you could add 4 elements with one instruction...)
It only uses 3 registers total. I could have used ECX instead of ESI (so main
could ret
without having destroyed any of the registers the caller expected it to preserve, only EAX, ECX, and EDX), but I kept ESI for consistency with your version.
.data
;ary dword 100, -30, 25, 14, 35, -92, 82, 134, 193, 99, 0
ary dword -24, 1, -5, 30, 35, 81, 94, 143, 0
.code
main PROC
;; inputs: static ary of signed dword integers
;; outputs: EAX = array average, EDX = remainder of sum/size
;; ESI = array count (in elements)
;; clobbers: none (other than the outputs)
; EAX = sum accumulator
; ESI = array pointer
; EDX = array element temporary
xor eax, eax ; sum = 0
mov esi, OFFSET ary ; incrementing a pointer is usually efficient, vs. ary[ecx*4] inside a loop or something. So this is good.
sumloop: ; do {
mov edx, [esi]
add edx, 4
add eax, edx ; sum += *p++ without checking for 0, because + 0 is a no-op
test edx, edx ; sets FLAGS the same as cmp edx,0
jnz sumloop ; }while(array element != 0);
;;; fall through if the element is 0.
;;; esi points to one past the terminator, i.e. two past the last real element we want to count for the average
sub esi, OFFSET ary + 4 ; (end+4) - (start+4) = array size in bytes
shr esi, 2 ; esi = array length = (end-start)/element_size
cdq ; sign-extend sum into EDX:EAX as an input for idiv
idiv esi ; EAX = sum/length EDX = sum%length
call ExitProcess
main ENDP
I used x86's hardware division instruction, instead of a subtraction loop. Your repeated-subtraction loop looked pretty complicated, but manual signed division can be tricky. I don't see where you're handling the possibility of the sum being negative. If your array had a negative sum, repeated subtraction would make it grow until it overflowed. Or in your case, you're breaking out of the loop if sum < count
, which will be true on the first iteration for a negative sum.
Note that comments like Set EAX register to 0
are useless. We already know that from reading mov eax,0
. sum = 0
describes the semantic meaning, not the architectural effect. There are some tricky x86 instructions where it does make sense to comment about what it even does in this specific case, but mov
isn't one of them.
If you just wanted to do repeated subtraction with the assumption that sum
is non-negative to start with, it's as simple as this:
;; UNSIGNED division (or signed with non-negative dividend and positive divisor)
; Inputs: sum(dividend) in EAX, count(divisor) in ECX
; Outputs: quotient in EDX, remainder in EAX (reverse of the DIV instruction)
xor edx, edx ; quotient counter = 0
cmp eax, ecx
jb subloop_end ; the quotient = 0 case
repeat_subtraction: ; do {
inc edx ; quotient++
sub eax, ecx ; dividend -= divisor
cmp eax, ecx
jae repeat_subtraction ; while( dividend >= divisor );
; fall through when eax < ecx (unsigned), leaving EAX = remainder
subloop_end:
Notice how checking for special cases before entering the loop lets us simplify it. See also Why are loops always compiled into "do...while" style (tail jump)?
sub eax, ecx
and cmp eax, ecx
in the same loop seems redundant: we could just use sub to set flags, and correct for the overshoot.
xor edx, edx ; quotient counter = 0
cmp eax, ecx
jb division_done ; the quotient = 0 case
repeat_subtraction: ; do {
inc edx ; quotient++
sub eax, ecx ; dividend -= divisor
jnc repeat_subtraction ; while( dividend -= divisor doesn't wrap (carry) );
add eax, ecx ; correct for the overshoot
dec edx
division_done:
(But this isn't actually faster in most cases on most modern x86 CPUs; they can run the inc, cmp, and sub in parallel even if the inputs weren't the same. This would maybe help on AMD Bulldozer-family where the integer cores are pretty narrow.)
Obviously repeated subtraction is total garbage for performance with large numbers. It is possible to implement better algorithms, like one-bit-at-a-time long-division, but the idiv
instruction is going to be faster for anything except the case where you know the quotient is 0 or 1, so it takes at most 1 subtraction. (div
/idiv
is pretty slow compared to any other integer operation, but the dedicated hardware is much faster than looping.)
If you do need to implement signed division manually, normally you record the signs, take the unsigned absolute value, then do unsigned division.
e.g. xor eax, ecx
/ sets dl
gives you dl=0 if EAX and ECX had the same sign, or 1 if they were different (and thus the quotient will be negative). (SF is set according to the sign bit of the result, and XOR produces 1 for different inputs, 0 for same inputs.)
Why am I allowed to exit main using ret?
C main
is called (indirectly) from CRT startup code, not directly from the kernel.
After main
returns, that code calls atexit
functions to do stuff like flushing stdio buffers, then passes main's return value to a raw _exit
system call. Or exit_group
which exits all threads.
You make several wrong assumptions, all I think based on a misunderstanding of how kernels work.
The kernel runs at a different privilege level from user-space (ring 0 vs. ring 3 on x86). Even if user-space knew the right address to jump to, it can't jump into kernel code. (And even if it could, it wouldn't be running with kernel privilege level).
ret
isn't magic, it's basically justpop %rip
and doesn't let you jump anywhere you couldn't jump to with other instructions. Also doesn't change privilege level1.Kernel addresses aren't mapped / accessible when user-space code is running; those page-table entries are marked as supervisor-only. (Or they're not mapped at all in kernels that mitigate the Meltdown vulnerability, so entering the kernel goes through a "wrapper" block of code that changes CR3.)
Virtual memory is how the kernel protects itself from user-space. User-space can't modify page tables directly, only by asking the kernel to do it via
mmap
andmprotect
system calls. (And user-space can't execute privileged instructions likemov cr3, rax
to install new page tables. That's the purpose of having ring 0 (kernel mode) vs. ring 3 (user mode).)The kernel stack is separate from the user-space stack for a process. (In the kernel, there's also a small kernel stack for each task (aka thread) that's used during system calls / interrupts while that user-space thread is running. At least that's how Linux does it, IDK about others.)
The kernel doesn't literally
call
user-space code; The user-space stack doesn't hold any return address back into the kernel. A kernel->user transition involves swapping stack pointers, as well as changing privilege levels. e.g. with an instruction likeiret
(interrupt-return).Plus, leaving a kernel code address anywhere user-space can see it would defeat kernel ASLR.
Footnote 1: (The compiler-generated ret
will always be a normal near ret
, not a retf
that could return through a call gate or something to a privileged cs
value. x86 handles privilege levels via the low 2 bits of CS but nevermind that. MacOS / Linux don't set up call gates that user-space can use to call into the kernel; that's done with syscall
or int 0x80
instructions.)
In a fresh process (after an execve
system call replaced the previous process with this PID with a new one), execution begins at the process entry point (usually labeled _start
), not at the C main
function directly.
C implementations come with CRT (C RunTime) startup code that has (among other things) a hand-written asm implementation of _start
which (indirectly) calls main
, passing args to main according to the calling convention.
_start
itself is not a function. On process entry, RSP points at argc
, and above that on the user-space stack is argv[0]
, argv[1]
, etc. (i.e. the char *argv[]
array is right there by value, and above that the envp
array.) _start
loads argc
into a register and puts pointers to the argv and envp into registers. (The x86-64 System V ABI that MacOS and Linux both use documents all this, including the process-startup environment and the calling convention.)
If you try to ret
from _start
, you're just going to pop argc
into RIP, and then code-fetch from absolute address 1
or 2
(or other small number) will segfault. For example, Nasm segmentation fault on RET in _start shows an attempt to ret
from the process entry point (linked without CRT startup code). It has a hand-written _start
that just falls through into main
.
When you run gcc main.c
, the gcc
front-end runs multiple other programs (use gcc -v
to show details). This is how the CRT startup code gets linked into your process:
- gcc preprocesses (CPP) and compiles+assembles
main.c
tomain.o
(or a temporary file). On MacOS, thegcc
command is actually clang which has a built-in assembler, but realgcc
really does compile to asm and then runas
on that. (The C preprocessor is built-in to the compiler, though.) - gcc runs something like
ld -dynamic-linker /lib64/ld-linux-x86-64.so.2 -pie /usr/lib/Scrt1.o /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/crtbeginS.o main.o -lc -lgcc /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/crtendS.o
. That's actually simplified a lot, with some of the CRT files left out, and paths canonicalized to remove../../lib
parts. Also, it doesn't runld
directly, it runscollect2
which is a wrapper forld
. But anyway, that statically links in those.o
CRT files that contain_start
and some other stuff, and dynamically links libc (-lc
) and libgcc (for GCC helper functions like implementing__int128
multiply and divide with 64-bit registers, in case your program uses those).
.intel_syntax
.text:
.global _rbp
_rbp:
mov rax, rbp
ret;
This is not allowed, ...
The only reason that doesn't assemble is because you tried to declare .text:
as a label, instead of using the .text
directive. If you remove the trailing :
it does assemble with clang (which treats .intel_syntax
the same as .intel_syntax noprefix
).
For GCC / GAS to assemble it, you'd also need the noprefix
to tell it that register names aren't prefixed by %
. (Yes you can have Intel op dst, src order but still with %rsp
register names. No you shouldn't do this!) And of course GNU/Linux doesn't use leading underscores.
Not that it would always do what you want if you called it, though! If you compiled main
without optimization (so -fno-omit-frame-pointer
was in effect), then yes you'd get a pointer to the stack slot below the return address.
And you definitely use the value incorrectly. (*p)-4;
loads the saved RBP value (*p
) and then offsets by four 8-byte void-pointers. (Because that's how C pointer math works; *p
has type void*
because p
has type void **
).
I think you're trying to get your own return address and re-run the call
instruction (in main's caller) that reached main, eventually leading to a stack overflow from pushing more return addresses. In GNU C, use void * __builtin_return_address (0)
to get your own return address.
x86 call rel32
instructions are 5 bytes, but the call
that called main was probably an indirect call, using a pointer in a register. So it might be a 2-byte call *%rax
or a 3-byte call *%r12
, you don't know unless you disassemble your caller. (I'd suggest single-stepping by instructions (GDB / LLDB stepi
) off the end of main
using a debugger in disassembly mode. If it has any symbol info for main's caller, you'll be able to scroll backward and see what the previous instruction was.
If not, you might have to try and see what looks sane; x86 machine code can't be unambiguously decoded backwards because it's variable-length. You can't tell the difference between a byte within an instruction (like an immediate or ModRM) vs. the start of an instruction. It all depends on where you start disassembling from. If you try a few byte offsets, usually only one will produce anything that looks sane.
asm("movq %rax, 0"); //Exit code is 11, so now it should be 0
This is a store of RAX to absolute address 0
, in AT&T syntax. This of course segfaults. exit code 11 is from SIGSEGV, which is signal 11. (Use kill -l
to see signal numbers).
Perhaps you wanted mov $0, %eax
. Although that's still pointless here, you're about to call through your function pointer. In debug mode, the compiler might load it into RAX and step on your value.
Also, writing a register in an asm
statement is never safe when you don't tell the compiler which registers you're modifying (using constraints).
printf("Main: %p\n", main);
printf("&Main: %p\n", &main); //WTF
main
and &main
are the same thing because main
is a function. That's just how C syntax works for function names. main
isn't an object that can have its address taken. & operator optional in function pointer assignment
It's similar for arrays: the bare name of an array can be assigned to a pointer or passed to functions as a pointer arg. But &array
is also the same pointer, same as &array[0]
. This is true only for arrays like int array[10]
, not for pointers like int *ptr
; in the latter case the pointer object itself has storage space and can have its own address taken.
Exit program x86
main()
is called from the normal C runtime initialization functions. Writing main
in any language, including asm, is no different from writing any other function.
Execution begins at _start
. If you write your own _start
, it has nothing to return to, so you need to make an _exit(2)
or exit_group(2)
system call.
(Or else segfault when execution falls off the end of your code, or if you try to ret
it will pop a value off the stack into the program counter (EIP), and probably segfault on code-fetch from that probably-invalid address.)
When you compile + link with a C compiler, it links in CRT (C RunTime) startup code that provides a _start
which initializes libc then calls main
. After your main
returns, the CRT code that called it runs atexit
functions and then passes main's return value to an exit system call.
_start
isn't a function, it's the process entry point. Under Linux for example, on entry to _start
ESP points at argc
, not a return address. (See the i386 System V ABI.)
This question comes at the question from a different angle, but my answer to another recent question goes into more detail.
As always, single-stepping with a debugger is a good way to see what's going on and test your understanding.
Related Topics
How to Change the Environment Variables of Another Process in Unix
What Is the Meaning of So_Reuseaddr (Setsockopt Option) - Linux
Have Bash Script Answer Interactive Prompts
Why Do You Need to Put #!/Bin/Bash At the Beginning of a Script File
Setting Environment Variables in Linux Using Bash
Difference Between ${} and $() in Bash
How to Add a New Device in Qemu Source Code
Contiguous Physical Memory from Userspace
Does Linux Guarantee the Contents of a File Is Flushed to Disc After Close()
How to Get File Creation Date/Time in Bash/Debian
How to Search For a Multiline Pattern in a File
How to Merge Two Files Using Awk
Is There a "Goto" Statement in Bash
What Is an Uninterruptible Process
How to Configure Apache 2 to Run Perl Cgi Scripts
How to Loop Over Directories in Linux
Is Gettimeofday() Guaranteed to Be of Microsecond Resolution