How to Create Threads Without System Calls in Linux X86 Gas Assembly

Is it possible to create threads without system calls in Linux x86 GAS assembly?

The short answer is that you can't. When you write assembly code it runs sequentially (or with branches) on one and only one logical (i.e. hardware) thread. If you want some of the code to execute on another logical thread (whether on the same core, on a different core on the same CPU or even on a different CPU), you need to have the OS set up the other thread's instruction pointer (CS:EIP) to point to the code you want to run. This implies using system calls to get the OS to do what you want.

User threads won't give you the threading support that you want, because they all run on the same hardware thread.

Edit: Incorporating Ira Baxter's answer with Parlanse. If you ensure that your program has a thread running in each logical thread to begin with, then you can build your own scheduler without relying on the OS. Either way, you need a scheduler to handle hopping from one thread to another. Between calls to the scheduler, there are no special assembly instructions to handle multi-threading. The scheduler itself can't rely on any special assembly, but rather on conventions between parts of the scheduler in each thread.

Either way, whether or not you use the OS, you still have to rely on some scheduler to handle cross-thread execution.

x86 Linux assembler get program parameters from _start

On Linux, the familiar argc and argv variables from C are always passed on the stack by the kernel, available even to assembly programs that are completely standalone and don't link with the startup code in the C library. This is documented in the i386 System V ABI, along with other details of the process startup environment (register values, stack alignment).

At the ELF entry point (a.k.a. _start) of an x86 Linux executable:

  1. ESP points to argc
  2. ESP + 4 points to argv[0], the start of the array. i.e. the value you should pass to main as char **argv is lea eax, [esp+4], not mov eax, [esp+4])

How a Minimal Assembly Program Obtains argc and argv

I'll show how to read argv and argc[0] in GDB.

cmdline-x86.S

#include <sys/syscall.h>

.global _start
_start:
/* Cause a breakpoint trap */
int $0x03

/* exit_group(0) */
mov $SYS_exit_group, %eax
mov $0, %ebx
int $0x80

cmdline-x86.gdb

set confirm off
file cmdline-x86
run
# We'll regain control here after the breakpoint trap
printf "argc: %d\n", *(int*)$esp
printf "argv[0]: %s\n", ((char**)($esp + 4))[0]
quit

Sample Session

$ cc -nostdlib -g3 -m32 cmdline-x86.S -o cmdline-x86
$ gdb -q -x cmdline-x86.gdb cmdline-x86
<...>
Program received signal SIGTRAP, Trace/breakpoint trap.
_start () at cmdline-x86.S:8
8 mov $SYS_exit_group, %eax
argc: 1
argv[0]: /home/scottt/Dropbox/stackoverflow/cmdline-x86

Explanation

  • I placed a software breakpoint (int $0x03) to cause the program to trap back into the debugger right after the ELF entry point (_start).
  • I then used printf in the GDB script to print

    1. argc with the expression *(int*)$esp
    2. argv with the expression ((char**)($esp + 4))[0]

x86-64 version

The differences are minimal:

  • Replace ESP with RSP
  • Change address size from 4 to 8
  • Conform to different Linux syscall calling conventions when we call exit_group(0) to properly terminate the process

cmdline.S

#include <sys/syscall.h>

.global _start
_start:
/* Cause a breakpoint trap */
int $0x03

/* exit_group(0) */
mov $SYS_exit_group, %rax
mov $0, %rdi
syscall

cmdline.gdb

set confirm off
file cmdline
run
printf "argc: %d\n", *(int*)$rsp
printf "argv[0]: %s\n", ((char**)($rsp + 8))[0]
quit

How Regular C Programs Obtain argc and argv

You can disassemble _start from a regular C program to see how it obtains argc and argv from the stack and passes them as it calls __libc_start_main. Using the /bin/true program on my x86-64 machine as an example:

$ gdb -q /bin/true
Reading symbols from /usr/bin/true...Reading symbols from /usr/lib/debug/usr/bin/true.debug...done.
done.
(gdb) disassemble _start
Dump of assembler code for function _start:
0x0000000000401580 <+0>: xor %ebp,%ebp
0x0000000000401582 <+2>: mov %rdx,%r9
0x0000000000401585 <+5>: pop %rsi
0x0000000000401586 <+6>: mov %rsp,%rdx
0x0000000000401589 <+9>: and $0xfffffffffffffff0,%rsp
0x000000000040158d <+13>: push %rax
0x000000000040158e <+14>: push %rsp
0x000000000040158f <+15>: mov $0x404040,%r8
0x0000000000401596 <+22>: mov $0x403fb0,%rcx
0x000000000040159d <+29>: mov $0x4014c0,%rdi
0x00000000004015a4 <+36>: callq 0x401310 <__libc_start_main@plt>
0x00000000004015a9 <+41>: hlt
0x00000000004015aa <+42>: xchg %ax,%ax
0x00000000004015ac <+44>: nopl 0x0(%rax)

The first three arguments to __libc_start_main() are:

  1. RDI: pointer to main()
  2. RSI: argc, you can see how it was the first thing popped off the stack
  3. RDX: argv, the value of RSP right after argc was popped. (ubp_av in the GLIBC source)

The x86 _start is very similar:

Dump of assembler code for function _start:
0x0804842c <+0>: xor %ebp,%ebp
0x0804842e <+2>: pop %esi
0x0804842f <+3>: mov %esp,%ecx
0x08048431 <+5>: and $0xfffffff0,%esp
0x08048434 <+8>: push %eax
0x08048435 <+9>: push %esp
0x08048436 <+10>: push %edx
0x08048437 <+11>: push $0x80485e0
0x0804843c <+16>: push $0x8048570
0x08048441 <+21>: push %ecx
0x08048442 <+22>: push %esi
0x08048443 <+23>: push $0x80483d0
0x08048448 <+28>: call 0x80483b0 <__libc_start_main@plt>
0x0804844d <+33>: hlt
0x0804844e <+34>: xchg %ax,%ax
End of assembler dump.

How can I make Linux system calls from a C/C++ application, without using assembly, and in a cpu-independent manner?

libc already includes the wrapper functions you're looking for. The prototypes for many of them are in #include <unistd.h>, as specified by POSIX.

C is the language of low-level systems program on Unix (and Linux), so this has been a thing since Unix existed. (Providing wrapper functions in libc is easier than teaching compilers the difference between function call and system calls, and allows for setting errno on errors. It also allows for tricks like LD_PRELOAD to intercept system calls in user-space.)


The man pages for system calls are in section 2, vs. section 3 for library functions (which might or might not use system calls as part of their implementation: math.h cos(3), ISO C stdio printf(3) and fwrite(3), vs. POSIX write(2)).

execve(2) is the system call.

See execl(3) and friends are also part of libc, and eventually call execve(2). They are convenience wrappers on top of it for constructing the argv array, doing $PATH lookup, and passing along the current process's environment. Thus they're classed as functions, not system calls.

See syscalls(2) for an overview, and complete list of system Linux calls with links to their man-page wrappers. (I've linked the Linux man pages, but there are also POSIX man pages for all of the standard system calls.)


In the unlikely case that you're not linking libc, you can use macros like MUSL's syscall2 / syscall3 / etc. macros (the number is the arg count) to inline the right asm on whatever platform. You use __NR_write from asm/unistd.h to get system call numbers.

But note that the raw Linux system calls might have small differences from the interface provided by the libc wrappers. For example, they won't check for pthreads cancellation points, and brk / sbrk requires bookkeeping in user-space by libc.

See SYSCALL_INLINE in Android for a portable raw sys_write() inline wrapper using MUSL macros.

But if you are using libc like a normal person for functions like malloc and printf, you should just use its system call wrapper functions.

Hello, world in assembly language with Linux system calls?

How does $ work in NASM, exactly? explains how $ - msg gets NASM to calculate the string length as an assemble-time constant for you, instead of hard-coding it.


I originally wrote the rest of this for SO Docs (topic ID: 1164, example ID: 19078), rewriting a basic less-well-commented example by @runner. This looks like a better place to put it than as part of my answer to another question where I had previously moved it after the SO docs experiment ended.


Making a system call is done by putting arguments into registers, then running int 0x80 (32-bit mode) or syscall (64-bit mode). What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 and The Definitive Guide to Linux System Calls.

Think of int 0x80 as a way to "call" into the kernel, across the user/kernel privilege boundary. The kernel does stuff according to the values that were in registers when int 0x80 executed, then eventually returns. The return value is in EAX.

When execution reaches the kernel's entry point, it looks at EAX and dispatches to the right system call based on the call number in EAX. Values from other registers are passed as function args to the kernel's handler for that system call. (e.g. eax=4 / int 0x80 will get the kernel to call its sys_write kernel function, implementing the POSIX write system call.)

And see also What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? - that answer includes a look at the asm in the kernel entry point that is "called" by int 0x80. (Also applies to 32-bit user-space, not just 64-bit where you shouldn't use int 0x80).


If you don't already know low-level Unix systems programming, you might want to just write functions in asm that take args and return a value (or update arrays via a pointer arg) and call them from C or C++ programs. Then you can just worry about learning how to handle registers and memory, without also learning the POSIX system-call API and the ABI for using it. That also makes it very easy to compare your code with compiler output for a C implementation. Compilers usually do a pretty good job at making efficient code, but are rarely perfect.

libc provides wrapper functions for system calls, so compiler-generated code would call write rather than invoking it directly with int 0x80 (or if you care about performance, sysenter). (In x86-64 code, use syscall for the 64-bit ABI.) See also syscalls(2).

System calls are documented in section 2 manual pages, like write(2). See the NOTES section for differences between the libc wrapper function and the underlying Linux system call. Note that the wrapper for sys_exit is _exit(2), not the exit(3) ISO C function that flushes stdio buffers and other cleanup first. There's also an exit_group system call that ends all threads. exit(3) actually uses that, because there's no downside in a single-threaded process.

This code makes 2 system calls:

  • sys_write(1, "Hello, World!\n", sizeof(...));
  • sys_exit(0);

I commented it heavily (to the point where it it's starting to obscure the actual code without color syntax highlighting). This is an attempt to point things out to total beginners, not how you should comment your code normally.

section .text             ; Executable code goes in the .text section
global _start ; The linker looks for this symbol to set the process entry point, so execution start here
;;;a name followed by a colon defines a symbol. The global _start directive modifies it so it's a global symbol, not just one that we can CALL or JMP to from inside the asm.
;;; note that _start isn't really a "function". You can't return from it, and the kernel passes argc, argv, and env differently than main() would expect.
_start:
;;; write(1, msg, len);
; Start by moving the arguments into registers, where the kernel will look for them
mov edx,len ; 3rd arg goes in edx: buffer length
mov ecx,msg ; 2nd arg goes in ecx: pointer to the buffer
;Set output to stdout (goes to your terminal, or wherever you redirect or pipe)
mov ebx,1 ; 1st arg goes in ebx: Unix file descriptor. 1 = stdout, which is normally connected to the terminal.

mov eax,4 ; system call number (from SYS_write / __NR_write from unistd_32.h).
int 0x80 ; generate an interrupt, activating the kernel's system-call handling code. 64-bit code uses a different instruction, different registers, and different call numbers.
;; eax = return value, all other registers unchanged.

;;;Second, exit the process. There's nothing to return to, so we can't use a ret instruction (like we could if this was main() or any function with a caller)
;;; If we don't exit, execution continues into whatever bytes are next in the memory page,
;;; typically leading to a segmentation fault because the padding 00 00 decodes to add [eax],al.

;;; _exit(0);
xor ebx,ebx ; first arg = exit status = 0. (will be truncated to 8 bits). Zeroing registers is a special case on x86, and mov ebx,0 would be less efficient.
;; leaving out the zeroing of ebx would mean we exit(1), i.e. with an error status, since ebx still holds 1 from earlier.
mov eax,1 ; put __NR_exit into eax
int 0x80 ;Execute the Linux function

section .rodata ; Section for read-only constants

;; msg is a label, and in this context doesn't need to be msg:. It could be on a separate line.
;; db = Data Bytes: assemble some literal bytes into the output file.
msg db 'Hello, world!',0xa ; ASCII string constant plus a newline (0x10)

;; No terminating zero byte is needed, because we're using write(), which takes a buffer + length instead of an implicit-length string.
;; To make this a C string that we could pass to puts or strlen, we'd need a terminating 0 byte. (e.g. "...", 0x10, 0)

len equ $ - msg ; Define an assemble-time constant (not stored by itself in the output file, but will appear as an immediate operand in insns that use it)
; Calculate len = string length. subtract the address of the start
; of the string from the current position ($)
;; equivalently, we could have put a str_end: label after the string and done len equ str_end - str

Notice that we don't store the string length in data memory anywhere. It's an assemble-time constant, so it's more efficient to have it as an immediate operand than a load. We could also have pushed the string data onto the stack with three push imm32 instructions, but bloating the code-size too much isn't a good thing.


On Linux, you can save this file as Hello.asm and build a 32-bit executable from it with these commands:

nasm -felf32 Hello.asm                  # assemble as 32-bit code.  Add -Worphan-labels -g -Fdwarf  for debug symbols and warnings
gcc -static -nostdlib -m32 Hello.o -o Hello # link without CRT startup code or libc, making a static binary

See this answer for more details on building assembly into 32 or 64-bit static or dynamically linked Linux executables, for NASM/YASM syntax or GNU AT&T syntax with GNU as directives. (Key point: make sure to use -m32 or equivalent when building 32-bit code on a 64-bit host, or you will have confusing problems at run-time.)


You can trace its execution with strace to see the system calls it makes:

$ strace ./Hello 
execve("./Hello", ["./Hello"], [/* 72 vars */]) = 0
[ Process PID=4019 runs in 32 bit mode. ]
write(1, "Hello, world!\n", 14Hello, world!
) = 14
_exit(0) = ?
+++ exited with 0 +++

Compare this with the trace for a dynamically linked process (like gcc makes from hello.c, or from running strace /bin/ls) to get an idea just how much stuff happens under the hood for dynamic linking and C library startup.

The trace on stderr and the regular output on stdout are both going to the terminal here, so they interfere in the line with the write system call. Redirect or trace to a file if you care. Notice how this lets us easily see the syscall return values without having to add code to print them, and is actually even easier than using a regular debugger (like gdb) to single-step and look at eax for this. See the bottom of the x86 tag wiki for gdb asm tips. (The rest of the tag wiki is full of links to good resources.)

The x86-64 version of this program would be extremely similar, passing the same args to the same system calls, just in different registers and with syscall instead of int 0x80. See the bottom of What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for a working example of writing a string and exiting in 64-bit code.


related: A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux. The smallest binary file you can run that just makes an exit() system call. That is about minimizing the binary size, not the source size or even just the number of instructions that actually run.

Can we read and fault-inject another thread's program counter?

When you install a signal handler using sigaction() with SA_SIGINFO in the flags, the second parameter the signal handler gets is a pointer to siginfo_t, and the third parameter is a pointer to an ucontext_t. In Linux, this structure contains, among other things, the set of register values when the kernel interrupted the thread, including program counter.

#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <signal.h>
#include <ucontext.h>

#if defined(__x86_64__)
#define PROGCOUNTER(ctx) (((ucontext *)ctx)->uc_mcontext.greg[REG_RIP])
#elif defined(__i386__)
#define PROGCOUNTER(ctx) (((ucontext *)ctx)->uc_mcontext.greg[REG_EIP])
#else
#error Unsupported architecture.
#endif

void signal_handler(int signum, siginfo_t *info, void *context)
{
const size_t program_counter = PROGCOUNTER(context);

/* Do something ... */

}

As usual, printf() et al. are not async-signal safe, which means it is not safe to use them in a signal handler. If you wish to output the program counter to e.g. standard error, you should not use any of the standard I/O to print to stderr, and instead construct the string to be printed by hand, and use a loop to write() the contents of the string; for example,

#include <stdlib.h>
#include <unistd.h>
#include <errno.h>

static void wrerr(const char *p)
{
const int saved_errno = errno;
const char *q = p;
ssize_t n;

/* Nothing to print? */
if (!p || !*p)
return;

/* Find end of q. strlen() is not async-signal safe. */
while (*q) q++;

/* Write data from p to q. */
while (p < q) {
n = write(STDERR_FILENO, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1 || errno != EINTR)
break;
}

errno = saved_errno;
}

Note that you'll want to keep the value of errno unchanged in the signal handler, so that if interrupted after a failed library function, the interrupted thread still sees the correct errno value. (It's mostly a debugging issue, and "good form"; some idiots pooh-pooh this as "it does not happen often enough for me to worry about".)

Your program can examine the /proc/self/maps pseudofile (it is not a real file, but something that the kernel generates on the fly when the file is read) to see the memory regions used by the program, to determine whether the program was running a C library function (very common) or something else when the interrupt was delivered.

If you wish to interrupt a specific thread in a multi-threaded program, just use pthread_kill(). Otherwise the signal is delivered to one of the threads that has not blocked the signal, more or less at random.


Here is an example program, that is tested to in x86-64 (AMD64) and x86, when compiled with GCC-4.8.4 using -Wall -O2:

#define  _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <ucontext.h>
#include <time.h>
#include <stdio.h>

#if defined(__x86_64__)
#define PROGRAM_COUNTER(mctx) ((mctx).gregs[REG_RIP])
#define STACK_POINTER(mctx) ((mctx).gregs[REG_RSP])
#elif defined(__i386__)
#define PROGRAM_COUNTER(mctx) ((mctx).gregs[REG_EIP])
#define STACK_POINTER(mctx) ((mctx).gregs[REG_ESP])
#else
#error Unsupported hardware architecture.
#endif

#define MAX_SIGNALS 64
#define MCTX(ctx) (((ucontext_t *)ctx)->uc_mcontext)

static void wrerr(const char *p, const char *q)
{
while (p < q) {
ssize_t n = write(STDERR_FILENO, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1 || errno != EINTR)
break;
}
}

static const char hexc[16] = "0123456789abcdef";

static inline char *prehex(char *before, size_t value)
{
do {
*(--before) = hexc[value & 15];
value /= (size_t)16;
} while (value);
*(--before) = 'x';
*(--before) = '0';
return before;
}

static volatile sig_atomic_t done = 0;

static void handle_done(int signum)
{
done = signum;
}

static int install_done(const int signum)
{
struct sigaction act;

memset(&act, 0, sizeof act);
sigemptyset(&act.sa_mask);
act.sa_handler = handle_done;
act.sa_flags = 0;
if (sigaction(signum, &act, NULL) == -1)
return errno;

return 0;
}

static size_t jump_target[MAX_SIGNALS] = { 0 };
static size_t jump_stack[MAX_SIGNALS] = { 0 };

static void handle_jump(int signum, siginfo_t *info, void *context)
{
const int saved_errno = errno;
char buffer[128];
char *p = buffer + sizeof buffer;

*(--p) = '\n';
p = prehex(p, STACK_POINTER(MCTX(context)));
*(--p) = ' ';
*(--p) = 'k';
*(--p) = 'c';
*(--p) = 'a';
*(--p) = 't';
*(--p) = 's';
*(--p) = ' ';
*(--p) = ',';
p = prehex(p, PROGRAM_COUNTER(MCTX(context)));
*(--p) = ' ';
*(--p) = '@';
wrerr(p, buffer + sizeof buffer);

if (signum >= 0 && signum < MAX_SIGNALS) {
if (jump_target[signum])
PROGRAM_COUNTER(MCTX(context)) = jump_target[signum];
if (jump_stack[signum])
STACK_POINTER(MCTX(context)) = jump_stack[signum];
}

errno = saved_errno;
}

static int install_jump(const int signum, void *target, size_t stack)
{
struct sigaction act;

if (signum < 0 || signum >= MAX_SIGNALS)
return errno = EINVAL;

jump_target[signum] = (size_t)target;
jump_stack[signum] = (size_t)stack;

memset(&act, 0, sizeof act);
sigemptyset(&act.sa_mask);
act.sa_sigaction = handle_jump;
act.sa_flags = SA_SIGINFO;
if (sigaction(signum, &act, NULL) == -1)
return errno;

return 0;
}

int main(int argc, char *argv[])
{
const struct timespec sec = { .tv_sec = 1, .tv_nsec = 0L };
const int pid = (int)getpid();
ucontext_t ctx;

printf("Run\n");
printf("\tkill -KILL %d\n", pid);
printf("\tkill -TERM %d\n", pid);
printf("\tkill -HUP %d\n", pid);
printf("\tkill -INT %d\n", pid);
printf("or press Ctrl+C to stop this process, or\n");
printf("\tkill -USR1 %d\n", pid);
printf("\tkill -USR2 %d\n", pid);
printf("to send the respective signal to this process.\n");
fflush(stdout);

if (install_done(SIGTERM) ||
install_done(SIGHUP) ||
install_done(SIGINT) ) {
printf("Cannot install signal handlers: %s.\n", strerror(errno));
return EXIT_FAILURE;
}

getcontext(&ctx);

if (install_jump(SIGUSR1, &&usr1_target, STACK_POINTER(MCTX(&ctx))) ||
install_jump(SIGUSR2, &&usr2_target, STACK_POINTER(MCTX(&ctx))) ) {
printf("Cannot install signal handlers: %s.\n", strerror(errno));
return EXIT_FAILURE;
}

/* These are expressions that should evaluate to false, but the compiler
* should not be able to optimize them away. */
if (argv[0][1] == 'A') {
usr1_target:
fputs("USR1\n", stdout);
fflush(stdout);
}

if (argv[0][1] == 'B') {
usr2_target:
fputs("USR2\n", stdout);
fflush(stdout);
}

while (!done) {
putchar('.');
fflush(stdout);
nanosleep(&sec, NULL);
}

fputs("\nAll done.\n", stdout);
fflush(stdout);

return EXIT_SUCCESS;
}

If you save the above as example.c, you can compile it using

gcc -Wall -O2 example.c -o example

and run it

./example

Press Ctrl+C to exit the program. Copy the commands (for sending SIGUSR1 and SIGUSR2 signals), and run them from another window, and you'll see they modify the position for current execution. (The signals cause the program counter/instruction pointer to jump back, into an if clause that should never be executed otherwise.)

There are two sets of signal handlers. handle_done() just sets the done flag. handle_jump() outputs a message to standard error (using low-level I/O), and if specified, updates the program counter (instruction pointer) and stack pointer.

The stack pointer is the tricky part when creating an example program like this. It would be easy if we were satisfied with just crashing the program. However, an example is only useful if it works.

When we arbitrarily change the program counter/instruction pointer, and the interrupt was delivered when in a function call (most C library functions...), the return address is left on the stack. The kernel can deliver the interrupt at any point, so we cannot even assume that the interrupt was delivered when in a function call, either! So, to make sure the test program does not crash, I had to update the program counter/instruction pointer and stack pointer as a pair.

When a jump signal is received, the stack pointer is reset to a value I obtained using getcontext(). This is not guaranteed to be suitable for any jump location; it's just the best I could do for a minimal example. I definitely assume the jump labels are nearby, and not in subscopes where the compiler is likely to mess with the stack, mind you.

It is also important to keep in mind that because we are dealing with details left to the C compiler, we must conform to whatever binary code the compiler produces, not the other way around. For reliable manipulation of a process and its threads, ptrace() is a much better (and honestly, easier) interface. You just set up a parent process, and in the target traced child process, explicitly allow the tracing. I've shown examples here and here (both answers to the same question) on how to start, stop, and single-step individual threads in a target process. The hardest part is understanding the overall scheme, the concepts; the code itself is easier -- and much, much more robust than this signal-handler-context-manipulation way.

For self-introducing register errors (either to program counter/instruction pointer, or to any other register), with the assumption that most of the time that leads to the process crashing, this signal handler context manipulation should be sufficient.



Related Topics



Leave a reply



Submit