How to Decide How Much Stack I Can Use After a Call to Pthread_Attr_Setstacksize

Specifying thread stack size in pthreads

Never use pthread_attr_setstack. It has a lot of fatal flaws, the worst of which is that it is impossible to ever free or reuse the stack after a thread has been created using it. (POSIX explicitly states that any attempt to do so results in undefined behavior.)

POSIX provides a much better function, pthread_attr_setstacksize which allows you to request the stack size you need, but leaves the implementation with the responsibility for allocating and deallocating the stack.

Safe thread stack size?

Reducing the thread stack size will not reduce overhead (not in terms of CPU, memory use or performance). Your only limit in this respect is the total available virtual address space given to threads on your platform.

I would use the default stack size until a platform presents problems otherwise (if it happens at all). Then minimize stack usage if and when problems arise. However these will lead to real performance issues, as you'll need to hit up the heap, or devise thread-dependent allocation elsewhere.

Hidden overheads may include:

Allocation of large arrays on the stack, such as by VLA, alloca() or just plain statically sized automatic arrays.
Code you don't control or weren't aware of the consequences of using such as templates, factory classes etc. However given that you did not specify C++, this is less likely to be a problem.
Imported code from libraries headers etc. These may change between versions and significantly alter their stack, or even thread usage.
Recursion. This occurs due to the above points also, consider things like boost::bind, variadic templates, crazy macros, and then just general recursion using buffers or large objects on the stack.

You can in addition to setting the stack size, manipulate the thread priorities, and suspend and resume them as required, which will significantly assist the scheduler and system responsiveness. Pthreads allow you to set contention scope; LWP and in scope scheduling vary widely in their performance characteristics.

Here are some useful links:

Improving Performance through Threads
linux pthread_suspend

Why do I need to set bigger stacksize than it actually should be?

Whenever you create a pthread, the pthread library must allocate some stack space for it. That doesn't necessarily allocate physical memory for the stack space, it allocates virtual address space for the stack. The default stack size allocated for a thread is implementation-dependent, but if you are then going to allocate a large array on the stack (which is where automatic storage class variables are placed in virtually all C implementations), you need to adjust the space allocated to ensure it's large enough.

Consider: let's say the implementation (in the pthreads library) has decided to allocate 2MB of stack space, by default, for each thread. Then after creating 3 threads, your virtual memory map might look something like this (exact addresses and other details will of course vary):

8060000-8080000           Thread 3 stack
8030000-8050000           Thread 2 stack
8000000-8020000           Thread 1 stack

7000000-8000000           Main thread stack
[...]                     Other program regions (program code, heap, initialized data, library code/data, etc)

A couple of things to note. Stacks grow downward. The stack pointer starts out right at the top of the allocated region, and as you push things onto the stack by calling a sub-routine or allocating space for local variables, the stack pointer is reduced. The kernel will generally not allocate actual physical pages for your stack immediately. That would be wasteful since you might never use them (and something else would probably have to be evicted from RAM to do that). Instead, page map entries for each page in the region are allocated, but marked empty. Then, as you attempt to write into each page, your program will get a page fault. The kernel handles the fault by allocating a physical page for you, mapping it to the right virtual address and updating the page map entry (then automatically resuming your program without your needing to be aware of any of this).

Note also that the stack regions are not immediately contiguous. That is so that the kernel can distinguish when you have exhausted the virtual address space by going too far. That is what leads to a segmentation violation in your scenario: you've blown off the bottom of the stack and advanced into space for which there are no page map entries allocated.

So, when you use pthread_attr_setstacksize, you're telling the library and the kernel that you know exactly how large to make the stack and configuring the memory map accordingly. But since you have only provided enough space to exactly contain the array, you haven't left any room for the stack frame used to invoke your thread function, or for its other local variables (tid, i, mystacksize), or for any padding or other local stack usage.

So, the original author of this code was essentially saying: "I need to ensure there is room in each thread for my big array and then throw in an additional MEGEXTRA bytes for local variables, the calling stack frame and any other overhead." Again, note that is only allocating virtual address space so it's not wasteful to do this (virtual address space is not generally a precious resource on a 64-bit architecture). In the actual running of the program, you are likely only using one or two additional pages of that additional space.

One other thing to note: the first part of the stack size calculation (ARRAY_SIZE*sizeof(double)) equals 4 million. In hex, that is, 0x3D0900, which is not a multiple of the page size (usually 4K or 0x1000). The result of using that figure is indeterminate. The kernel might expand that to the next page size boundary (0x3d10000), or it might truncate to the previous boundary (0x3d0000), or (according to linux man page) it might return an error.

The posix specification (http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_attr_setstacksize.html) says

The stacksize attribute shall define the minimum stack size (in bytes) allocated for the created threads stack.

and says nothing about non-page-aligned size so arguably, expanding the size to the next page boundary is the only correct behavior. But glibc does not seem to do such an adjustment and the linux kernel implementation appears to then truncate the size provided.

In any case, it's a good idea not to cut these things too close. Predicting your actual exact stack usage in a real-life program is difficult at best.

How to properly allocate memory for a pthread stack

What do you mean by "monitor"? If you just want to ensure that not too much space is wasted for the stack (which would prevent you from having lots of threads on a 32-bit system or system with low ram+swap), you should simply use the pthread_attr_setstacksize function rather than pthread_attr_setstack. This way you are not responsible for allocating the stack yourself. You could also optionally use pthread_attr_setguardsize to ensure a larger zone of guard pages as protection if you're worried the thread will allocate more than one page at a time on the stack, but be aware that this will consume your virtual address space.

If you really want to measure stack usage, pthread_attr_setstack probably is the right tool, but it's not at all straightforward. I would allocate the memory with mmap, and make it read-only. Then install a SIGSEGV handler to mprotect the faulting page writable, increment a counter, and return. That will give you a count of the number of actual pages the thread touches. And since the signal handler will run in the faulting thread (this is guaranteed since it's a synchronous signal), you can keep the count in a thread-local storage variable to perform the counting on multiple threads.

You might actually need to make the last page or two writable before calling pthread_create, though, since the first write attempts will probably happen from the parent thread, and you probably don't want the signal handler running there if you're trying to store the results in thread-local storage.

To access your specific questions at the end:

You do not want MAP_SHARED. That flag is for memory that will be shared between processes. It probably won't hurt in your case, but it's misleading. Use MAP_PRIVATE.
The memory will not be released, and formally, can never be released. POSIX states quite explicitly that it's undefined behavior to ever reuse or free a stack given to a thread, since you cannot determine the lifetime reliably (even after pthread_join returns, it's conceptually possible that the thread is still executing its last few instructions to exit and thus still touching the stack, and it's possible that it remain stalled like that indefinitely long). I believe this is not possible on glibc/NPTL due to the way they use a kernel-generated futex wake event on thread exit to signal pthread_join atomically with the thread exit, but NPTL may cache and reuse stacks that you donated to a thread like this (since you're not allowed to reuse/free them yourself anyway). To be sure you'd have to check the source. As such, I would recommend NOT using pthread_attr_setstack at all in production code. Use pthread_attr_setstacksize. pthread_attr_setstack should be used only for development-time hacks like what you might be doing now.

How to determine Stack size of a Program in linux?

If you simply want the current stack size, you could declare a variable at the top of main(), take its address, and compare it to the address of a variable declared at wherever you define "current" to be. The difference should be the approximate size that the stack has grown.

If you want to know how much memory is reserved for the stack, you can check /proc/[pid]/maps, which has a region marked as [stack]. For example, my atd process has:

7fff72a41000-7fff72a56000 rw-p 00000000 00:00 0                          [stack]
0175b000-0177c000 rw-p 00000000 00:00 0                                  [heap]

which gives you an idea.

A neat trick that a friend shared with me when I wanted to know the maximum size of stack that my program used was as follows. I'll present it here in case someone finds it useful :)

1) In a function called near the beginning of main(), use alloca() or a very long array to scribble 0xDEADBEEF or some other such unlikely constant over as much of the stack as you expect could be used. This memory will be "freed" when the small function returns.

2) At the end of main, again use alloca() to grab a region of memory and "search" down through it for whatever magic constant you used to scribble (you might try to find the first block of 64 of them or something to skip over regions of memory that may have been allocated but simply never used), and where that pointer lands indicates your maximum stack usage.

Not perfect, but it was useful for what I was doing!