Number of Executed Instructions Different for Hello World Program Nasm Assembly and C

Number of executed Instructions different for Hello World program Nasm Assembly and C


The number of instructions executed in program 1) is high because of linking the program with system library's at runtime?

Yep, dynamic linking plus CRT (C runtime) startup files.

used -static and which reduces the count by a factor of 1/10.

So that just left the CRT start files, which do stuff before calling main, and after.

How can I ensure that the instruction count is only that of the main function in Program 1)`

Measure an empty main, then subtract that number from future measurements.

Unless your instruction-counters is smarter, and looks at symbols in the executable for the process it's tracing, it won't be able to tell which code came from where.

and which is how Program 2) is reporting for the debugger.

That's because there is no other code in that program. It's not that you somehow helped the debugger ignore some instructions, it's that you made a program without any instructions you didn't put there yourself.

If you want to see what actually happens when you run the gcc output, gdb a.out, b _start, r, and single-step. Once you get deep in the call tree, you're prob. going to want to use fin to finish execution of the current function, since you don't want to single-step through literally 1 million instructions, or even 10k.


related: How do I determine the number of x86 machine instructions executed in a C program? shows perf stat will count 3 user-space instructions total in a NASM program that does mov eax, 231 / syscall, linked into a static executable.

Why does it take so many instructions to run an empty program?

It's hardly fair to claim that it "does literally nothing". Yes, at the app level you chose to make the whole thing a giant no-op for your microbenchmark, that's fine. But no, down beneath the covers at the system level, it's hardly "nothing". You asked linux to fork off a brand new execution environment, initialize it, and connect it to the environment. You called very few glibc functions, but dynamic linking is non-trivial and after a million instructions your process was ready to demand fault printf() and friends, and to efficiently bring in libs you might have linked against or dlopen()'ed.

This is not the sort of microbench that implementors are likely to optimize against. What would be of interest is if you can identify "expensive" aspects of fork/exec that in some use cases are never used, and so might be #ifdef'd out (or have their execution short circuited) in very specific situations. Lazy evaluation of resolv.conf is one example of that, where the overhead is never paid by a process if it never interacts with IP servers.

Hello World program in Nasm x86-64 prints Hello World continuously


Explanation

  • You're using the wrong syscall numbers for x86-64 Linux. Thus your exit() call fails and instead dowork and callback end up in a mutual recursion, causing a loop.
  • For the correct syscall numbers, see arch/x86/syscalls/syscall_64.tbl in the Linux source code:

1 common write sys_write
231 common exit_group sys_exit_group

  • If you're willing to embrace AT&T x86 assembler syntax and the C preprocessor, you can #include <sys/syscall.h> and use e.g. SYS_write as the syscall number for write. See hello-att.S below. This way you can stop worrying about looking up the syscall numbers.

hello.asm

    SECTION .text   ; Code section
global _start ; Make label available to linker

_start: ; Standard ld entry point
jmp callback ; Jump to the end to get our current address

dowork:
pop rsi ;
mov rax,1 ; System call number for write
mov rdi,1 ; 1 for stdout
mov rdx,12 ; length of Hello World
syscall ; Switch to the kernel mode

mov rax,231 ; exit_group(0)
xor rdi,rdi ;
syscall ;

callback:
call dowork ; Pushes the address of "Hello World" onto the stack
db 'Hello World',0xA ; The string we want to print

hello-att.S

#include <sys/syscall.h>

.global _start
_start:
jmp callback
dowork:
/* write(1, "Hello World\n", 12) */
pop %rsi /* "Hello World\n" */
mov $SYS_write, %rax
mov $1, %rdi
mov $12, %rdx
syscall

/* exit_group(0) */
mov $SYS_exit_group, %rax
xor %rdi, %rdi
syscall

callback:
call dowork
.ascii "Hello World\n"

Knowing AT&T x86 assembler syntax makes reading Linux kernel and glibc source a lot easier ;)

What is a reasonable minimum number of assembly instructions for a small C program including setup?

TL:DR: -static is not the default, use that to make an ELF executable that only runs your _start.

-no-pie -nostdlib will also make a static executable simply because it's non-PIE and there are no dynamic libraries to link.

There also is such a thing as -static-pie where the kernel will load your executable to a randomized base address but not run ld.so first (I think), but that's not what you get with -static.


Just to be clear, we're talking about the dynamic instruction count (how many are actually executed in user-space, perf stat -e instructions:u), not a static count (how many are sitting on disk / in memory as part of the executable). A static count only counts instructions inside loops once, and still counts instructions that never execute.

Or at least that's what I'm answering. That makes metadata in other sections, and code that doesn't execute irrelevant.

According to gdb, code from ld-linux-x86-64.so.2 is mapped into the program address space. Given that I disabled vdso and am including no libraries, is this file necessary to run the program?

You still built a position-independent executable (PIE). This is an ELF shared object with an entry point, so it's still dynamically linked. So the ld.so ELF interpreter runs on it. There's nothing for it to do because you don't actually use any shared libraries, but 17k user-space instructions sounds about right. I get 32606 or 7 instructions for your program on my Arch Linux system (glibc 2.31).

ld.so is started as an "interpreter" for your binary in a similar way to how /bin/sh is started to interpret an executable text file that starts with #!/bin/sh. (Although Linux's ELF program loader still does some of the work of mapping program segments into memory according to the program header of the executable, so ld.so doesn't have to do that manually with system calls.)

You can see this by running under gdb ./foo5 and using starti instead of run to stop before the first user-space instruction. You'll see that you're in ld.so's _start.

Reading symbols from ./foo5...
(No debugging symbols found in ./foo5)
Cannot access memory at address 0x1024 ### note this isn't a real address,
### just an offset relative to the base address / start of the file.
### That's another clue this is a PIE
(gdb) starti

Program stopped.
0x00007ffff7fd3100 in _start () from /lib64/ld-linux-x86-64.so.2

You can also run strace ./foo5 to see the system calls it makes, as an indication that there's a bunch of stuff happening:

$ strace ./foo5
execve("./foo5", ["./foo5"], 0x7ffc12394d90 /* 50 vars */) = 0
brk(NULL) = 0x55741b4b7000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffca69312b0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1d4fc4b000
arch_prctl(ARCH_SET_FS, 0x7f1d4fc4ba80) = 0
mprotect(0x557419622000, 4096, PROT_READ) = 0
strace: [ Process PID=303809 runs in 32 bit mode. ]
exit(0) = ?

(Note the "runs in 32 bit mode"; it doesn't, but strace detected that you used the 32-bit int $0x80 ABI instead of the normal syscall ABI that ld.so used.)


Use -static

-nostdlib used to imply -static, in GCC configured to not make PIEs by default. But modern distros do configure GCC to make PIEs for security reasons. See 32-bit absolute addresses no longer allowed in x86-64 Linux?

$ file foo5
foo5: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=1ac0a9af247fefebde100695805e5b73f06e891c, not stripped

After building with -static, OTOH:

$ file foo5
foo5: ELF 64-bit LSB executable ...
$ perf stat --all-user ./foo5

Performance counter stats for './foo5':

0.03 msec task-clock # 0.151 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
1 page-faults # 0.030 M/sec
1,930 cycles # 0.058 GHz
12 instructions # 0.01 insn per cycle
4 branches # 0.121 M/sec
0 branch-misses # 0.00% of all branches

0.000219151 seconds time elapsed

0.000284000 seconds user
0.000000000 seconds sys

(Odd that perf doesn't print :u for the events when you use --all-user. My system has /proc/sys/kernel/perf_event_paranoid = 0 so if I don't use that, it also counts instructions executed inside the kernel. That varies significantly from run to run, but around 60k total for this static executable.)

I only count 11 user-space instructions that execute, but apparently my i7-6700k counts 12 for that event. (There is hardware support for masking user, kernel, or both for any event counter. This is what perf uses.)

GDB also confirms success:

Reading symbols from ./foo5...
(No debugging symbols found in ./foo5)
Cannot access memory at address 0x401024
(gdb) starti
Starting program: /tmp/foo5

Program stopped.
0x0000000000401000 in _start ()
(gdb)

And the disassembly window from layout reg shows:

│  >0x401000 <_start>       call   0x40100e <main>
│ 0x401005 <_start+5> mov eax,0x1
│ 0x40100a <_start+10> xor ebx,ebx
│ 0x40100c <_start+12> int 0x80
│ 0x40100e <main> push rbp
│ 0x40100f <main+1> mov rbp,rsp
│ 0x401012 <main+4> lea rax,[rip+0xfe7] # 0x402000
│ 0x401019 <main+11> mov QWORD PTR [rbp-0x8],rax
│ 0x40101d <main+15> mov eax,0x0
│ 0x401022 <main+20> pop rbp
│ 0x401023 <main+21> ret

You could have compiled with -O2 to optimize your main down to just an xor eax,eax / ret, or not call it at all so only 3 user-space instructions had to execute.

Or to optimize your user-space instruction count while still using C, see @mosvy's answer about writing _start in C, and an inline asm _exit(2) that can inline into it.)

Note that your _start fails to pass argc and argv to main, although it does have RSP properly 16-byte aligned before a function call. (Because the x86-64 SysV ABI guarantees process entry happens with the stack aligned). You could do that with a mov load and an LEA. Note that since you don't initialize libc, even if you statically linked libc you couldn't call its functions.

See How Get arguments value using inline assembly in C without Glibc? for some hacks. (Basically stand-alone asm _start written in an asm() statement at global scope, or my answer is a total hack on the calling convention.)

Hello, world in assembly language with Linux system calls?

How does $ work in NASM, exactly? explains how $ - msg gets NASM to calculate the string length as an assemble-time constant for you, instead of hard-coding it.


I originally wrote the rest of this for SO Docs (topic ID: 1164, example ID: 19078), rewriting a basic less-well-commented example by @runner. This looks like a better place to put it than as part of my answer to another question where I had previously moved it after the SO docs experiment ended.


Making a system call is done by putting arguments into registers, then running int 0x80 (32-bit mode) or syscall (64-bit mode). What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 and The Definitive Guide to Linux System Calls.

Think of int 0x80 as a way to "call" into the kernel, across the user/kernel privilege boundary. The kernel does stuff according to the values that were in registers when int 0x80 executed, then eventually returns. The return value is in EAX.

When execution reaches the kernel's entry point, it looks at EAX and dispatches to the right system call based on the call number in EAX. Values from other registers are passed as function args to the kernel's handler for that system call. (e.g. eax=4 / int 0x80 will get the kernel to call its sys_write kernel function, implementing the POSIX write system call.)

And see also What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? - that answer includes a look at the asm in the kernel entry point that is "called" by int 0x80. (Also applies to 32-bit user-space, not just 64-bit where you shouldn't use int 0x80).


If you don't already know low-level Unix systems programming, you might want to just write functions in asm that take args and return a value (or update arrays via a pointer arg) and call them from C or C++ programs. Then you can just worry about learning how to handle registers and memory, without also learning the POSIX system-call API and the ABI for using it. That also makes it very easy to compare your code with compiler output for a C implementation. Compilers usually do a pretty good job at making efficient code, but are rarely perfect.

libc provides wrapper functions for system calls, so compiler-generated code would call write rather than invoking it directly with int 0x80 (or if you care about performance, sysenter). (In x86-64 code, use syscall for the 64-bit ABI.) See also syscalls(2).

System calls are documented in section 2 manual pages, like write(2). See the NOTES section for differences between the libc wrapper function and the underlying Linux system call. Note that the wrapper for sys_exit is _exit(2), not the exit(3) ISO C function that flushes stdio buffers and other cleanup first. There's also an exit_group system call that ends all threads. exit(3) actually uses that, because there's no downside in a single-threaded process.

This code makes 2 system calls:

  • sys_write(1, "Hello, World!\n", sizeof(...));
  • sys_exit(0);

I commented it heavily (to the point where it it's starting to obscure the actual code without color syntax highlighting). This is an attempt to point things out to total beginners, not how you should comment your code normally.

section .text             ; Executable code goes in the .text section
global _start ; The linker looks for this symbol to set the process entry point, so execution start here
;;;a name followed by a colon defines a symbol. The global _start directive modifies it so it's a global symbol, not just one that we can CALL or JMP to from inside the asm.
;;; note that _start isn't really a "function". You can't return from it, and the kernel passes argc, argv, and env differently than main() would expect.
_start:
;;; write(1, msg, len);
; Start by moving the arguments into registers, where the kernel will look for them
mov edx,len ; 3rd arg goes in edx: buffer length
mov ecx,msg ; 2nd arg goes in ecx: pointer to the buffer
;Set output to stdout (goes to your terminal, or wherever you redirect or pipe)
mov ebx,1 ; 1st arg goes in ebx: Unix file descriptor. 1 = stdout, which is normally connected to the terminal.

mov eax,4 ; system call number (from SYS_write / __NR_write from unistd_32.h).
int 0x80 ; generate an interrupt, activating the kernel's system-call handling code. 64-bit code uses a different instruction, different registers, and different call numbers.
;; eax = return value, all other registers unchanged.

;;;Second, exit the process. There's nothing to return to, so we can't use a ret instruction (like we could if this was main() or any function with a caller)
;;; If we don't exit, execution continues into whatever bytes are next in the memory page,
;;; typically leading to a segmentation fault because the padding 00 00 decodes to add [eax],al.

;;; _exit(0);
xor ebx,ebx ; first arg = exit status = 0. (will be truncated to 8 bits). Zeroing registers is a special case on x86, and mov ebx,0 would be less efficient.
;; leaving out the zeroing of ebx would mean we exit(1), i.e. with an error status, since ebx still holds 1 from earlier.
mov eax,1 ; put __NR_exit into eax
int 0x80 ;Execute the Linux function

section .rodata ; Section for read-only constants

;; msg is a label, and in this context doesn't need to be msg:. It could be on a separate line.
;; db = Data Bytes: assemble some literal bytes into the output file.
msg db 'Hello, world!',0xa ; ASCII string constant plus a newline (0x10)

;; No terminating zero byte is needed, because we're using write(), which takes a buffer + length instead of an implicit-length string.
;; To make this a C string that we could pass to puts or strlen, we'd need a terminating 0 byte. (e.g. "...", 0x10, 0)

len equ $ - msg ; Define an assemble-time constant (not stored by itself in the output file, but will appear as an immediate operand in insns that use it)
; Calculate len = string length. subtract the address of the start
; of the string from the current position ($)
;; equivalently, we could have put a str_end: label after the string and done len equ str_end - str

Notice that we don't store the string length in data memory anywhere. It's an assemble-time constant, so it's more efficient to have it as an immediate operand than a load. We could also have pushed the string data onto the stack with three push imm32 instructions, but bloating the code-size too much isn't a good thing.


On Linux, you can save this file as Hello.asm and build a 32-bit executable from it with these commands:

nasm -felf32 Hello.asm                  # assemble as 32-bit code.  Add -Worphan-labels -g -Fdwarf  for debug symbols and warnings
gcc -static -nostdlib -m32 Hello.o -o Hello # link without CRT startup code or libc, making a static binary

See this answer for more details on building assembly into 32 or 64-bit static or dynamically linked Linux executables, for NASM/YASM syntax or GNU AT&T syntax with GNU as directives. (Key point: make sure to use -m32 or equivalent when building 32-bit code on a 64-bit host, or you will have confusing problems at run-time.)


You can trace its execution with strace to see the system calls it makes:

$ strace ./Hello 
execve("./Hello", ["./Hello"], [/* 72 vars */]) = 0
[ Process PID=4019 runs in 32 bit mode. ]
write(1, "Hello, world!\n", 14Hello, world!
) = 14
_exit(0) = ?
+++ exited with 0 +++

Compare this with the trace for a dynamically linked process (like gcc makes from hello.c, or from running strace /bin/ls) to get an idea just how much stuff happens under the hood for dynamic linking and C library startup.

The trace on stderr and the regular output on stdout are both going to the terminal here, so they interfere in the line with the write system call. Redirect or trace to a file if you care. Notice how this lets us easily see the syscall return values without having to add code to print them, and is actually even easier than using a regular debugger (like gdb) to single-step and look at eax for this. See the bottom of the x86 tag wiki for gdb asm tips. (The rest of the tag wiki is full of links to good resources.)

The x86-64 version of this program would be extremely similar, passing the same args to the same system calls, just in different registers and with syscall instead of int 0x80. See the bottom of What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for a working example of writing a string and exiting in 64-bit code.


related: A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux. The smallest binary file you can run that just makes an exit() system call. That is about minimizing the binary size, not the source size or even just the number of instructions that actually run.



Related Topics



Leave a reply



Submit