Is Stack Memory Contiguous Physically in Linux

Is stack memory contiguous physically in Linux?

As far as I can see, stack memory is contiguous in virtual memory
address, but stack memory is also contiguous physically? And does this
have something to do with the stack size limit?

No, stack memory is not necessarily contiguous in the physical address space. It's not related to the stack size limit. It's related to how the OS manages memory. The OS only allocates a physical page when the corresponding virtual page is accessed for the first time (or for the first time since it got paged out to the disk). This is called demand-paging, and it helps conserve memory usage.

why do we think that stack memory is always quicker
than heap memory? If it's not physically contiguous, how can stack
take more advantage of cache?

It has nothing to do with the cache. It's just faster to allocate and deallocate memory from the stack than the heap. That's because allocating and deallocating from the stack takes only a single instruction (incrementing or decrementing the stack pointer). On the other hand, there is a lot more work involved into allocating and/or deallocating memory from the heap. See this article for more information.

Now once memory allocated (from the heap or stack), the time it takes to access that allocated memory region does not depend on whether it's stack or heap memory. It depends on the memory access behavior and whether it's friendly to the cache and memory architecture.

if we want to sort a large amount of numbers, using array to store the
numbers is better than using a list, because every list node may be
constructed by malloc, so it may not take good advantage of cache,
that's why I say stack memory is quicker than heap memory.

Using an array is faster not because arrays are allocated from the stack. Arrays can be allocated from any memory (stack, heap, or anywhere). It's faster because arrays are usually accessed contiguously one element at a time. When the first element is accessed, a whole cache line that contains the element and other elements is fetched from memory to the L1 cache. So accessing the other elements in that cache line can be done very efficiently, but accessing the first element in the cache line is still slow (unless the cache line was prefetched). This is the key part: since cache lines are 64-byte aligned and both virtual and physical pages are 64-byte aligned as well, then it's guaranteed that any cache line fully resides within a single virtual page and a single physical page. This what makes fetching cache lines efficient. Again, all of this has nothing to do with whether the array was allocated from the stack or heap. It holds true either way.

On the other hand, since the elements of a linked list are typically not contiguous (not even in the virtual address space), then a cache line that contains an element may not contain any other elements. So fetching every single element can be more expensive.

Are Arrays Contiguous in Physical Memory?

Each page of virtual memory is mapped identically to a page of physical memory; there is no remapping for units of smaller than a page. This is inherent in the principle of paging. Assuming 4KB pages, the top 20 or 52 bits of a 32- or 64-bit address are looked up in the page tables to identify a physical page, and the low 12 bits are used as an offset into that physical page. So if you have two addresses within the same page of virtual memory (i.e. the virtual addresses differ only in their 12 low bits), then they will be located at the same relative offsets in some single page of physical memory. (Assuming the virtual page is backed by physical memory at all; it could of course be swapped out at any instant.)

For different virtual pages, there is no guarantee at all about how they are mapped to physical memory. They could easily be mapped to entirely different locations of physical memory (or of course one or both could be swapped out).

So if you allocate a very large array in virtual memory, there is no need for a sufficiently large contiguous block of physical memory to be available; the OS can simply map those pages of virtual memory to any arbitrary pages in physical memory. (Or more likely, it will initially leave the pages unmapped, then allocate physical memory for them in smaller chunks as you touch the pages and trigger page faults.)

This applies to all parts of a process's virtual memory: static code and data, stack, memory dynamically allocated with malloc/sbrk/mmap etc.

Linux does have support for huge pages, in which case the same logic applies but the pages are larger (a few MB or GB; the available sizes are fixed by hardware).

Other than very specialized applications like hardware DMA, there isn't normally any reason for an application programmer to care about how physical memory is arranged behind the scenes.

Why Contiguous memory allocation is required in linux?

Contiguous memory allocation (CMA) is needed for I/O devices that can only work with contiguous ranges of physical memory. On systems with an I/O memory management unit (IOMMU), this would not be an issue because a buffer that is contiguous in the device address space can be mapped by the IOMMU to non-contiguous regions of physical memory. Also some devices can do scatter/gather DMA (i.e., can read/write from/to multiple non-contiguous buffers). Ideally, all I/O devices should be designed to either work behind an IOMMU or should be capable of scatter/gather DMA. Unfortunately, this is not the case and there are devices that require physically contiguous buffers. There are two ways for a device driver to allocate a contiguous buffer:

The device driver can allocate a chunk of physical memory at boot-time. This is reliable because most of the physical memory would be available at boot-time. However, if the I/O device is not used, then the allocated physical memory is just wasted.
A chunk of physical memory can be allocated on demand, but it may be difficult to find a contiguous free range of the required size. The advantage, though, is that memory is only allocated when needed.

CMA solves this exact problem by providing the advantages of both of these approaches with none of their downsides. The basic idea is to make it possible to migrate allocated physical pages to create enough space for a contiguous buffer. More information on how CMA works can be found here.

Contiguous physical memory from userspace

No. There is not. You do need to do this from Kernel space.

If you say "we need to do this from User Space" - without anything going on in kernel-space it makes little sense - because a user space program has no way of controlling or even knowing if the underlying memory is contiguous or not.

The only reason where you would need to do this - is if you were working in-conjunction with a piece of hardware, or some other low-level (i.e. Kernel) service that needed this requirement. So again, you would have to deal with it at that level.

So the answer isn't just "you can't" - but "you should never need to".

I have written such memory managers that do allow me to do this - but it was always because of some underlying issue at the kernel level, which had to be addressed at the kernel level. Generally because some other agent on the bus (PCI card, BIOS or even another computer over RDMA interface) had the physical contiguous memory requirement. Again, all of this had to be addressed in kernel space.

When you talk about "cache lines" - you don't need to worry. You can be assured that each page of your user-space memory is contiguous, and each page is much larger than a cache-line (no matter what architecture you're talking about).

malloc does not guarantee returning physically contiguous memory

malloc does not guarantee returning physically contiguous memory

yes

It guarantees returning virtually contiguous memory

yes

Especially it is true when size > 4KB because 4KB is a size of page.
( On Linux systems).

Being contiguous memory does not imply that it will also be page aligned. The allcated memory can start from any address in heap. So whatever OS uses the page size it does not affect the allocation nature of malloc.

Why physically contiguous memory region is more efficient than virtually contiguous memory.?

For large blocks of physically contiguous memory, the kernel can use huge pages, i.e., much fewer page table entries.

Is Stack Memory Contiguous Physically in Linux