How to Mmap the Stack for the Clone() System Call on Linux

How to mmap the stack for the clone() system call on linux?

Joseph, in answer to your last question:

When a user creates a "normal" new process, that's done by fork(). In this case, the kernel doesn't have to worry about creating a new stack at all, because the new process is a complete duplicate of the old one, right down to the stack.

If the user replaces the currently running process using exec(), then the kernel does need to create a new stack - but in this case that's easy, because it gets to start from a blank slate. exec() wipes out the memory space of the process and reinitialises it, so the kernel gets to say "after exec(), the stack always lives HERE".

If, however, we use clone(), then we can say that the new process will share a memory space with the old process (CLONE_VM). In this situation, the kernel can't leave the stack as it was in the calling process (like fork() does), because then our two processes would be stomping on each other's stack. The kernel also can't just put it in a default location (like exec()) does, because that location is already taken in this memory space. The only solution is to allow the calling process to find a place for it, which is what it does.

clone system call's argument stores in stack or somewhere else?

You forget to use the flag CLONE_VM:

clone(stack_func, malloc(1024*1024) + (1024*1024), SIGCHLD | CLONE_VM, &a);

CLONE_VM (since Linux 2.0)

If CLONE_VM is set, the calling process and the child process run in the same memory space. In particular, memory writes performed by the calling process or by the child process are also visible in the other process. Moreover, any memory mapping or unmapping performed with mmap(2) or munmap(2) by the child or calling process also affects the other process.

If CLONE_VM is not set, the child process runs in a separate copy of the memory space of the calling process at the time of clone(). Memory writes or file mappings/unmappings performed by one of the processes do not affect the other, as with fork(2).

How can a caller properly use the clone() system call by specifying multiple arguments?

The additional arguments are optional arguments used to specify additional operations of the clone operation. They're only used if particular flags are set in the flags argument. If you don't set any of those flags, you don't need to supply the additional arguments.

If you set the CLONE_PARENT_SETTID flag, the child's thread ID will be stored in the location that parent_tid points to in the parent process.

If you set the CLONE_SETTLS flag, the tls argument will be used as the address of the thread-local storage descriptor.

If you set the CLONE_CHILD_SETTID flag, the child's thread ID will be stored in the location that child_tid points to in the child process.

It's done this way for backward compatibility. These arguments weren't in the original clone() system call, but were added in later Linux versions. They're optional so that older code will continue to compile.

Minimal stack size for Linux clone call?

Why doesn't clone segfault between 24 and 583 bytes of stack?

It does, but because it is a separate process, you don't see it. Before 24, it is not the child that segfaults, but the parent in trying to set up the child. Try using strace -ff to see this happening.

How does child fail silently with too little stack?

When the child dies, the parent is notified. The parent in this case (the one that does the clone() call) doesn't do anything with this notification. The reason it is not "silent" below 24 is because that's when the parent dies and in that case your shell will get the notification.

What is all that stack space used for?

What is the significance of 24 and 584 bytes? How do they vary on different systems and implementations?

The first 24 (and a bit) are used to set up the function call to child. Because it is a normal function, on completion it will return to the calling function. This means clone has to set up a calling function to return to (one that just cleanly terminates the child).

The 584 (and a bit) apparently is the amount of memory needed for the local variables of the calling function, your function, write and whatever write calls.

The reason I write "(and a bit)" is because there might be a bit of memory before stack that is available and abused by clone or child when running out of room. Try adding a free(stack) after the clone to see the result of that abuse.

Can I calculate a minimum stack requirement? Should I?

In general you should probably not. It requires pretty deep analysis of your functions and the external functions those use. Just like with "normal" programs, I would suggest going for the default (which is 8MB on linux, if I recall correctly). Only when you have strict memory requirements (or stack overflow problems), you should start to worry about these things.

Raw clone system call not working correctly

syscall has no special knowledge of clone. This means that when the function tries to return in the newly-created thread, it reads the return address from the switched stack, which is zero. This is more obvious if you write a non-zero bit pattern to the stack and also drop the CLONE_VM, so that the child does not clobber the parent.

Raw Clone system call

I can't say I recommend going with clone if you can use pthreads. I've had bad experience with functions such as malloc() in relation to clone.

Have you looked at the man page for documentation?

Here is an example that runs for me. I didn't really examine your code to see why it might be crashing.

#define _GNU_SOURCE
#include <stdio.h>
#include <sched.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <linux/sched.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>

// Allow us to round to page size
#define ROUND_UP_TO_MULTIPLE(a,b) \
( ( (a) % (b) == 0) ? (a) : ( (a) + ( (b) - ( (a) % (b) ) ) ) )

struct argsy {
    int threadnum;
};

int fun(void * args) {
    struct argsy * arguments = (struct argsy *) args;
    fprintf(stderr, "hey!, i'm thread %d\n", arguments->threadnum);
    return 0;
}

#define N_THREADS 10
#define PAGESIZE 4096

struct argsy arguments[N_THREADS];

int main() {
    assert(PAGESIZE==getpagesize());

    const int thread_stack_size = 256*PAGESIZE;
    void * base = malloc((((N_THREADS*thread_stack_size+PAGESIZE)/PAGESIZE)*PAGESIZE));
    assert(base);
    void * stack = (void *)ROUND_UP_TO_MULTIPLE((size_t)(base), PAGESIZE);

    int i = 0;
    for (i = 0; i < N_THREADS; i++) { 
        void * args = &arguments[i];
        arguments[i].threadnum = i;
        clone(&fun, stack+((i+1)*thread_stack_size), 
            CLONE_FILES | CLONE_VM,
            args);
    }

    sleep(1);

    // Wait not implemented
    return 0;
}

How to Mmap the Stack for the Clone() System Call on Linux