Where Is the Stack Memory Allocated from for a Linux Process

How Stack or memory is allocated for threads under the same process in Linux

The current 'thread' concept in Linux is the NPTL one. NPTL uses clone(), which wraps sys_clone(). Allocating a stack for a new 'thread' is handled in the user space (ie. libc), not in kernel (ie. Linux). A library can allocate a stack using the allocation of choice (eg. malloc) and then call clone() passing this address as the stack (of course, needs to pass the top of the allocated region, since stacks grow downwards on most platforms):

Unlike fork(2), clone() allows the child process to share parts of its execution context with the calling process, such as the memory space, the table of file descriptors, and the table of signal handlers. ...

The main use of clone() is to implement threads: multiple threads of control in a program that run concurrently in a shared memory space.

When the child process is created with clone(), it executes the function fn(arg) ...

The child_stack argument specifies the location of the stack used by the child process ...

If you want to learn more specific details, open the source of your distro pthread_create implementation and get reading.

For example pthread_create.c:

int
__pthread_create_2_1 (newthread, attr, start_routine, arg)
...
struct pthread *pd = NULL;
int err = ALLOCATE_STACK (iattr, &pd);
...

and allocatestack.c:

 # define ALLOCATE_STACK(attr, pd) allocate_stack (attr, pd, &stackaddr)

static int
allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
ALLOCATE_STACK_PARMS)
...

You'll see that stack allocation has some whistles and bells, like caching and reusing stack regions, guard pages, but in the end is just a memory region allocated in user space.

Stack memory inside the Linux kernel

As far as I have it, stack memory allocation is normally dealt with by
a form of libc …

This program illustrates that nothing from libc is used for stack memory allocation:

// compile with: gcc -nostdlib nolibc.c -o nolibc
_start()
{
int a[9999];
*a = 0;
asm(" mov %0,%%ebx\n\
mov $1,%%eax\n\
int $128" : : "r" (*a)); // _exit(*a);
}

If the allocated stack space is exceeded, a fault occurs and the kernel's fault handler can allocate more space, unless there is a limit reached or a fixed size used.

Is stack memory contiguous physically in Linux?

As far as I can see, stack memory is contiguous in virtual memory
address, but stack memory is also contiguous physically? And does this
have something to do with the stack size limit?

No, stack memory is not necessarily contiguous in the physical address space. It's not related to the stack size limit. It's related to how the OS manages memory. The OS only allocates a physical page when the corresponding virtual page is accessed for the first time (or for the first time since it got paged out to the disk). This is called demand-paging, and it helps conserve memory usage.

why do we think that stack memory is always quicker
than heap memory? If it's not physically contiguous, how can stack
take more advantage of cache?

It has nothing to do with the cache. It's just faster to allocate and deallocate memory from the stack than the heap. That's because allocating and deallocating from the stack takes only a single instruction (incrementing or decrementing the stack pointer). On the other hand, there is a lot more work involved into allocating and/or deallocating memory from the heap. See this article for more information.

Now once memory allocated (from the heap or stack), the time it takes to access that allocated memory region does not depend on whether it's stack or heap memory. It depends on the memory access behavior and whether it's friendly to the cache and memory architecture.

if we want to sort a large amount of numbers, using array to store the
numbers is better than using a list, because every list node may be
constructed by malloc, so it may not take good advantage of cache,
that's why I say stack memory is quicker than heap memory.

Using an array is faster not because arrays are allocated from the stack. Arrays can be allocated from any memory (stack, heap, or anywhere). It's faster because arrays are usually accessed contiguously one element at a time. When the first element is accessed, a whole cache line that contains the element and other elements is fetched from memory to the L1 cache. So accessing the other elements in that cache line can be done very efficiently, but accessing the first element in the cache line is still slow (unless the cache line was prefetched). This is the key part: since cache lines are 64-byte aligned and both virtual and physical pages are 64-byte aligned as well, then it's guaranteed that any cache line fully resides within a single virtual page and a single physical page. This what makes fetching cache lines efficient. Again, all of this has nothing to do with whether the array was allocated from the stack or heap. It holds true either way.

On the other hand, since the elements of a linked list are typically not contiguous (not even in the virtual address space), then a cache line that contains an element may not contain any other elements. So fetching every single element can be more expensive.

How is Stack memory allocated when using 'push' or 'sub' x86 instructions?

yes, you're on the right track here, pretty much. sub rsp, X is kind of like "lazy" allocation: the kernel only does anything after a #PF page fault exception from touching memory above the new RSP, not just modifying registers. But you can still consider the memory "allocated", i.e. safe for use.

So, trying to read that segment before writing to it may cause an error.

No, read won't cause an error. Anonymous pages that have never been written are copy-on-write mapped to a/the physical zero page, whether they're in the BSS, stack, or mmap(MAP_ANONYMOUS).

Fun fact: in micro-benchmarks, make sure you write each page of memory for input arrays, otherwise you're actually looping over the same physical 4k or 2M page of zeros repeatedly and will get L1D cache hits even though you still get TLB misses (and soft page faults)! gcc will optimize malloc+memset(0) to calloc, but std::vector will actually write all the memory whether you want it to or not. memset on global arrays is not optimized out, so that works. (Or non-zero initialized arrays will be file-backed in the data segment.)


Note, I'm leaving out the difference between mapped vs. wired. i.e. whether an access will trigger a soft/minor page fault to update the page tables, or whether it's just a TLB miss and the hardware page-table walk will find a mapping (to the zero page).

But stack memory below RSP may not be mapped at all, so touching it without moving RSP first can be an invalid page fault instead of a "minor" page fault to sort out copy-on-write.


Stack memory has an interesting twist: The stack size limit is something like 8MB (ulimit -s), but in Linux the initial stack for the first thread of a process is special. For example, I set a breakpoint in _start in a hello-world (dynamically linked) executable, and looked at /proc/<PID>/smaps for it:

7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
Size: 132 kB
Rss: 8 kB
Pss: 8 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 8 kB
Referenced: 8 kB
Anonymous: 8 kB
...

Only 8kiB of stack has been referenced and is backed by physical pages. That's expected, since the dynamic linker doesn't use a lot of stack.

Only 132kiB of stack is even mapped into the process's virtual address space. But special magic stops mmap(NULL, ...) from randomly choosing pages within the 8MiB of virtual address space that the stack could grow into.

Touching memory below the current stack mapping but within the stack limit causes the kernel to grow the stack mapping (in the page-fault handler).

(But only if rsp is adjusted first; the red-zone is only 128 bytes below rsp, so ulimit -s unlimited doesn't make touching memory 1GB below rsp grow the stack to there, but it will if you decrement rsp to there and then touch memory.)

This only applies to the initial/main thread's stack. pthreads just uses mmap(MAP_ANONYMOUS|MAP_STACK) to map an 8MiB chunk that can't grow.
(MAP_STACK is currently a no-op.) So thread stacks can't grow after allocation (except manually with MAP_FIXED if there's space below them), and aren't affected by ulimit -s unlimited.


This magic preventing other things from choosing addresses in the stack-growth region doesn't exist for mmap(MAP_GROWSDOWN), so do not use it to allocate new thread stacks. (Otherwise you could end up with something using up the virtual address space below the new stack, leaving it unable to grow). Just allocate the full 8MiB. See also Where are the stacks for the other threads located in a process virtual address space?.

MAP_GROWSDOWN does have a grow-on-demand feature, described in the mmap(2) man page, but there's no growth limit (other than coming close to an existing mapping), so (according to the man page) it's based on a guard-page like Windows uses, not like the primary thread's stack.

Touching memory multiple pages below the bottom of a MAP_GROWSDOWN region might segfault (unlike with Linux's primary-thread stack). Compilers targeting Linux don't generate stack "probes" to make sure each 4k page is touched in order after a big allocation (e.g. local array or alloca), so that's another reason MAP_GROWSDOWN isn't safe for stacks.

Compilers do emit stack probes on Windows.

(MAP_GROWSDOWN might not even work at all, see @BeeOnRope's comment. It was never very safe to use for anything, because stack clash security vulnerabilities were possible if the mapping grows close to something else. So just don't use MAP_GROWSDOWN for anything ever. I'm leaving in the mention to describe the guard-page mechanism Windows uses, because it's interesting to know that Linux's primary-thread stack design isn't the only one possible.)

Allocating memory in OS

The Linux operating system does use have its own fixed size stack for each thread. you can see the stack in the source code defining thread_union.

When a syscall is preformed in x86/x64 the kernel swaps out the GS segment and CR3 registers to refer to its own memory and then uses the GS segment register to locate cpu_current_top_of_stack which points to the corresponding kernel stack pointer for the calling thread. You can see this in entry_SYSCALL_64.

When the kernel needs to allocate dynamic memory it uses kamlloc which uses the slob allocator.



Related Topics



Leave a reply



Submit