How Are User-Level Threads Scheduled/Created, and How Are Kernel Level Threads Created

How are user-level threads scheduled/created, and how are kernel level threads created?

This is prefaced by the top comments.

The documentation you're reading is generic [not linux specific] and a bit outdated. And, more to the point, it is using different terminology. That is, I believe, the source of the confusion. So, read on ...

What it calls a "user-level" thread is what I'm calling an [outdated] LWP thread. What it calls a "kernel-level" thread is what is called a native thread in linux. Under linux, what is called a "kernel" thread is something else altogether [See below].

using pthreads create threads in the userspace, and the kernel is not aware about this and view it as a single process only, unaware of how many threads are inside.

This was how userspace threads were done prior to the NPTL (native posix threads library). This is also what SunOS/Solaris called an LWP lightweight process.

There was one process that multiplexed itself and created threads. IIRC, it was called the thread master process [or some such]. The kernel was not aware of this. The kernel didn't yet understand or provide support for threads.

But, because, these "lightweight" threads were switched by code in the userspace based thread master (aka "lightweight process scheduler") [just a special user program/process], they were very slow to switch context.

Also, before the advent of "native" threads, you might have 10 processes. Each process gets 10% of the CPU. If one of the processes was an LWP that had 10 threads, these threads had to share that 10% and, thus, got only 1% of the CPU each.

All this was replaced by the "native" threads that the kernel's scheduler is aware of. This changeover was done 10-15 years ago.

Now, with the above example, we have 20 threads/processes that each get 5% of the CPU. And, the context switch is much faster.

It is still possible to have an LWP system under a native thread, but, now, that is a design choice, rather than a necessity.

Further, LWP works great if each thread "cooperates". That is, each thread loop periodically makes an explicit call to a "context switch" function. It is voluntarily relinquishing the process slot so another LWP can run.

However, the pre-NPTL implementation in glibc also had to [forcibly] preempt LWP threads (i.e. implement timeslicing). I can't remember the exact mechanism used, but, here's an example. The thread master had to set an alarm, go to sleep, wake up and then send the active thread a signal. The signal handler would effect the context switch. This was messy, ugly, and somewhat unreliable.

Joachim mentioned pthread_create function creates a kernel thread

That is [technically] incorrect to call it a kernel thread. pthread_create creates a native thread. This is run in userspace and vies for timeslices on an equal footing with processes. Once created there is little difference between a thread and a process.

The primary difference is that a process has its own unique address space. A thread, however, is a process that shares its address space with other process/threads that are part of the same thread group.

If it doesn't create a kernel level thread, then how are kernel threads created from userspace programs?

Kernel threads are not userspace threads, NPTL, native, or otherwise. They are created by the kernel via the kernel_thread function. They run as part of the kernel and are not associated with any userspace program/process/thread. They have full access to the machine. Devices, MMU, etc. Kernel threads run in the highest privilege level: ring 0. They also run in the kernel's address space and not the address space of any user process/thread.

A userspace program/process may not create a kernel thread. Remember, it creates a native thread using pthread_create, which invokes the clone syscall to do so.

Threads are useful to do things, even for the kernel. So, it runs some of its code in various threads. You can see these threads by doing ps ax. Look and you'll see kthreadd, ksoftirqd, kworker, rcu_sched, rcu_bh, watchdog, migration, etc. These are kernel threads and not programs/processes.

UPDATE:

You mentioned that kernel doesn't know about user threads.

Remember that, as mentioned above, there are two "eras".

(1) Before the kernel got thread support (circa 2004?). This used the thread master (which, here, I'll call the LWP scheduler). The kernel just had the fork syscall.

(2) All kernels after that which do understand threads. There is no thread master, but, we have pthreads and the clone syscall. Now, fork is implemented as clone. clone is similar to fork but takes some arguments. Notably, a flags argument and a child_stack argument.

More on this below ...

then, how is it possible for user level threads to have individual stacks?

There is nothing "magic" about a processor stack. I'll confine discussion [mostly] to x86, but this would be applicable to any architecture, even those that don't even have a stack register (e.g. 1970's era IBM mainframes, such as the IBM System 370)

Under x86, the stack pointer is %rsp. The x86 has push and pop instructions. We use these to save and restore things: push %rcx and [later] pop %rcx.

But, suppose the x86 did not have %rsp or push/pop instructions? Could we still have a stack? Sure, by convention. We [as programmers] agree that (e.g.) %rbx is the stack pointer.

In that case, a "push" of %rcx would be [using AT&T assembler]:

subq    $8,%rbx
movq    %rcx,0(%rbx)

And, a "pop" of %rcx would be:

movq    0(%rbx),%rcx
addq    $8,%rbx

To make it easier, I'm going to switch to C "pseudo code". Here are the above push/pop in pseudo code:

// push %ecx
    %rbx -= 8;
    0(%rbx) = %ecx;

// pop %ecx
    %ecx = 0(%rbx);
    %rbx += 8;

To create a thread, the LWP scheduler had to create a stack area using malloc. It then had to save this pointer in a per-thread struct, and then kick off the child LWP. The actual code is a bit tricky, assume we have an (e.g.) LWP_create function that is similar to pthread_create:

typedef void * (*LWP_func)(void *);

// per-thread control
typedef struct tsk tsk_t;
struct tsk {
    tsk_t *tsk_next;                    //
    tsk_t *tsk_prev;                    //
    void *tsk_stack;                    // stack base
    u64 tsk_regsave[16];
};

// list of tasks
typedef struct tsklist tsklist_t;
struct tsklist {
    tsk_t *tsk_next;                    //
    tsk_t *tsk_prev;                    //
};

tsklist_t tsklist;                      // list of tasks

tsk_t *tskcur;                          // current thread

// LWP_switch -- switch from one task to another
void
LWP_switch(tsk_t *to)
{

    // NOTE: we use (i.e.) burn register values as we do our work. in a real
    // implementation, we'd have to push/pop these in a special way. so, just
    // pretend that we do that ...

    // save all registers into tskcur->tsk_regsave
    tskcur->tsk_regsave[RAX] = %rax;
    // ...

    tskcur = to;

    // restore most registers from tskcur->tsk_regsave
    %rax = tskcur->tsk_regsave[RAX];
    // ...

    // set stack pointer to new task's stack
    %rsp = tskcur->tsk_regsave[RSP];

    // set resume address for task
    push(%rsp,tskcur->tsk_regsave[RIP]);

    // issue "ret" instruction
    ret();
}

// LWP_create -- start a new LWP
tsk_t *
LWP_create(LWP_func start_routine,void *arg)
{
    tsk_t *tsknew;

    // get per-thread struct for new task
    tsknew = calloc(1,sizeof(tsk_t));
    append_to_tsklist(tsknew);

    // get new task's stack
    tsknew->tsk_stack = malloc(0x100000)
    tsknew->tsk_regsave[RSP] = tsknew->tsk_stack;

    // give task its argument
    tsknew->tsk_regsave[RDI] = arg;

    // switch to new task
    LWP_switch(tsknew);

    return tsknew;
}

// LWP_destroy -- destroy an LWP
void
LWP_destroy(tsk_t *tsk)
{

    // free the task's stack
    free(tsk->tsk_stack);

    remove_from_tsklist(tsk);

    // free per-thread struct for dead task
    free(tsk);
}

With a kernel that understands threads, we use pthread_create and clone, but we still have to create the new thread's stack. The kernel does not create/assign a stack for a new thread. The clone syscall accepts a child_stack argument. Thus, pthread_create must allocate a stack for the new thread and pass that to clone:

// pthread_create -- start a new native thread
tsk_t *
pthread_create(LWP_func start_routine,void *arg)
{
    tsk_t *tsknew;

    // get per-thread struct for new task
    tsknew = calloc(1,sizeof(tsk_t));
    append_to_tsklist(tsknew);

    // get new task's stack
    tsknew->tsk_stack = malloc(0x100000)

    // start up thread
    clone(start_routine,tsknew->tsk_stack,CLONE_THREAD,arg);

    return tsknew;
}

// pthread_join -- destroy an LWP
void
pthread_join(tsk_t *tsk)
{

    // wait for thread to die ...

    // free the task's stack
    free(tsk->tsk_stack);

    remove_from_tsklist(tsk);

    // free per-thread struct for dead task
    free(tsk);
}

Only a process or main thread is assigned its initial stack by the kernel, usually at a high memory address. So, if the process does not use threads, normally, it just uses that pre-assigned stack.

But, if a thread is created, either an LWP or a native one, the starting process/thread must pre-allocate the area for the proposed thread with malloc. Side note: Using malloc is the normal way, but the thread creator could just have a large pool of global memory: char stack_area[MAXTASK][0x100000]; if it wished to do it that way.

If we had an ordinary program that does not use threads [of any type], it may wish to "override" the default stack it has been given.

That process could decide to use malloc and the above assembler trickery to create a much larger stack if it were doing a hugely recursive function.

See my answer here: What is the difference between user defined stack and built in stack in use of memory?

What exactly does it mean to schedule user-level thread to run on an available LWP?

Just like a process is a container for memory, the LWP (= kernel-level thread) is a container for fibers (= user-level threads, essentially).

The kernel's thread scheduler only sees kernel-level threads (LWPs), and it schedules LWPs on and off the CPUs - namely, an LWP has a time slice where it gets to run on a CPU. The user-level thread library (= fiber scheduler) owns the LWPs of that process, and it decides which fibers get to use the timeslices allocated by the kernel's scheduler to those LWPs.

When a fiber decides to yield the CPU, but the LWP's timeslice is not over yet, the fiber scheduler schedules another fiber to run within the LWP on that CPU. But while that other fiber is running, the LWP's timeslice might run out, and the kernel's scheduler will schedule that LWP off the CPU. The fiber scheduler will not have a say on the matter - the fiber scheduler won't even get to run, because it's in userspace and the kernel is not aware of it.

How does a user-level thread come out of execution?

Funny thing (what a real coincidence) I've been formulating the answer to this in my head on my way home yesterday. For real.

The answer is that user-level thread has to give control back. Only kernel-level threads could be preempted. This control giving can happen either explicitly - by calling functions like yield() - or implicitly, by calling any other function which know how to transfer control. Those would be most likely thread-synchronization functions.

How user level thread talk with kernel level thread

Kernel threads are like a specialized task responsible for doing a specific operation (not meant to last long). They are not threads waiting for incoming request from user-land threads. Moreover, a system call does not systematically create a kernel thread (see this post for more information and this one for some context about system calls): a kernel thread is started when a background task is required like for dealing with IO requests for example (this post shows a good practical case though the description is a bit deep). Basic system calls just runs in the same user-thread but with higher privileges. Note the kernel functions use a dedicated kernel stack: each user-level thread has 2 stacks on Linux: one for user-land functions and one for kernel-land functions (for sake of security).

As a result, in practice, I think all answers are wrong in usual cases (ie. assuming the target operation do not require to create a kernel thread). If the target system calls done actually require to create kernel threads, then b) is the correct answer. Indeed, kernel threads are like a one-shot specialized task as previously stated. Creating/Destroying new kernel-threads is not much expensive since it is just basically a relatively lightweight task_struct data structure internally.

Is the thread created in C# user level or kernel level?

That is unix terminology, on Windows you'd say "fiber or thread". The term "green thread" is also a pretty common way to say "user thread".

It is not up to C# nor the CLR to decide this, it is the CLR host that determines this.

The host is the glue that marries the CLR to the operating system or the host process. Programs that target Silverlight, .NET Compact, .NETCore, Xbox, Windows Phone, Hololens, etc always have a custom host to adapt to the target's OS. IIS and SQL Server are common examples of unmanaged programs that have a custom host to allow managed code execution, respectively ASP.NET and CLR stored procedures. Lots of other programs allow scripting in C# with a custom host, AutoCAD is the canonical example.

So the CLR does not create a thread itself, it asks the host to do it. The ICLRTask and ICLRTaskManager interfaces get that job done. The thread pool is a host duty as well, ICorThreadpool interface.

So it is formally unknowable that you'll get a fiber or thread. Notable is that these interfaces were added at the request of the SQL Server team. They were heavily invested in fibers at the time and wanted the option to execute CLR stored procedures on a fiber. Got it all done, but at roughly the same time the multi-core revolution of the early 2000s upset that apple-cart. And they did not actually ship it. I am not aware of any host that uses fibers, albeit that you can never be sure with custom hosting being common.

So it is pretty safe to assume that you'll get a "kernel thread".

Mapping of user level and kernel level thread

These days the term Light Weight Processes and threads are used interchangeably.

although this mapping may be indirect and may use a lightweight
process (LWP)

I know the above statement is confusing(Notice the 2 mays). I can think only 1 thing which the above statement signifies is that:

Earlier when linux supported only user-level threads, the kernel was unaware of the fact that there are multiple user-level threads, and the way it handled these multiple threads was by associating all of them to a light weight process(which kernel sees as a single scheduling and execution unit) at kernel level.

So associating a kernel-level thread with each user-level thread is kind of direct mapping and associating a single light weight process with each user-level thread is indirect mapping.

user level threads kernel level threads and fibers

On the Windows platform threads in user mode processes (applications) are user mode threads and threads in kernel mode processes are kernel mode threads. You can not create a kernel mode thread in a user mode process. On Windows all threads are scheduled by the kernel directly or indirectly (via how it configures CPU interrupts).

The .Net CreateThread ultimately uses the CreateThread API, which is exported from Kernel32.dll.

How Are User-Level Threads Scheduled/Created, and How Are Kernel Level Threads Created