Is Accessing Data in the Heap Faster Than from the Stack

Is accessing data in the heap faster than from the stack?

Is accessing data in the heap faster than from the stack?

Not inherently... on every architecture I've ever worked on, all the process "memory" can be expected to operate at the same set of speeds, based on which level of CPU cache / RAM / swap file is holding the current data, and any hardware-level synchronisation delays that operations on that memory may trigger to make it visible to other processes, incorporate other processes'/CPU (core)'s changes etc..

The OS (which is responsible for page faulting / swapping), and the hardware (CPU) trapping on accesses to not-yet-accessed or swapped-out pages, would not even be tracking which pages are "global" vs "stack" vs "heap"... a memory page is a memory page.

While the global vs stack vs heap usage to which memory is put is unknown to the OS and hardware, and all are backed by the same type of memory with the same performance characteristics, there are other subtle considerations (described in detail after this list):

  • allocation - time the program spends "allocating" and "deallocating" memory, including occasional sbrk (or similar) virtual address allocation as the heap usage grows
  • access - differences in the CPU instructions used by the program to access globals vs stack vs heap, and extra indirection via a runtime pointer when using heap-based data,
  • layout - certain data structures ("containers" / "collections") are more cache-friendly (hence faster), while general purpose implementations of some require heap allocations and may be less cache friendly.

Allocation and deallocation

For global data (including C++ namespace data members), the virtual address will typically be calculated and hardcoded at compile time (possibly in absolute terms, or as an offset from a segment register; occasionally it may need tweaking as the process is loaded by the OS).

For stack-based data, the stack-pointer-register-relative address can also be calculated and hardcoded at compile time. Then the stack-pointer-register may be adjusted by the total size of function arguments, local variables, return addresses and saved CPU registers as the function is entered and returns (i.e. at runtime). Adding more stack-based variables will just change the total size used to adjust the stack-pointer-register, rather than having an increasingly detrimental effect.

Both of the above are effectively free of runtime allocation/deallocation overhead, while heap based overheads are very real and may be significant for some applications...

For heap-based data, a runtime heap allocation library must consult and update its internal data structures to track which parts of the block(s) aka pool(s) of heap memory it manages are associated with specific pointers the library has provided to the application, until the application frees or deletes the memory. If there is insufficient virtual address space for heap memory, it may need to call an OS function like sbrk to request more memory (Linux may also call mmap to create backing memory for large memory requests, then unmap that memory on free/delete).

Access

Because the absolute virtual address, or a segment- or stack-pointer-register-relative address can be calculated at compile time for global and stack based data, runtime access is very fast.

With heap hosted data, the program has to access the data via a runtime-determined pointer holding the virtual memory address on the heap, sometimes with an offset from the pointer to a specific data member applied at runtime. That may take a little longer on some architectures.

For the heap access, both the pointer and the heap memory must be in registers for the data to be accessible (so there's more demand on CPU caches, and at scale - more cache misses/faulting overheads).

Note: these costs are often insignificant - not even worth a look or second thought unless you're writing something where latency or throughput are enormously important.

Layout

If successive lines of your source code list global variables, they'll be arranged in adjacent memory locations (albeit with possible padding for alignment purposes). The same is true for stack-based variables listed in the same function. This is great: if you have X bytes of data, you might well find that - for N-byte cache lines - they're packed nicely into memory that can be accessed using X/N or X/N + 1 cache lines. It's quite likely that the other nearby stack content - function arguments, return addresses etc. will be needed by your program around the same time, so the caching is very efficient.

When you use heap based memory, successive calls to the heap allocation library can easily return pointers to memory in different cache lines, especially if the allocation size differs a fair bit (e.g. a three byte allocation followed by a 13 byte allocation) or if there's already been a lot of allocation and deallocation (causing "fragmentation"). This means when you go to access a bunch of small heap-allocated memory, at worst you may need to fault in as many cache lines (in addition to needing to load the memory containing your pointers to the heap). The heap-allocated memory won't share cache lines with your stack-allocated data - no synergies there.

Additionally, the C++ Standard Library doesn't provide more complex data structures - like linked lists, balanced binary trees or hash tables - designed for use in stack-based memory. So, when using the stack programmers tend to do what they can with arrays, which are contiguous in memory, even if it means a little brute-force searching. The cache-efficiency may well make this better overall than heap based data containers where the elements are spread across more cache lines. Of course, stack usage doesn't scale to large numbers of elements, and - without at least a backup option of using heap - creates programs that stop working if given more data to process than expected.

Discussion of your example program

In your example you're contrasting a global variable with a function-local (stack/automatic) variable... there's no heap involved. Heap memory comes from new or malloc/realloc. For heap memory, the performance issue worth noting is that the application itself is keeping track of how much memory is in use at which addresses - the records of all that take some time to update as pointers to memory are handed out by new/malloc/realloc, and some more time to update as the pointers are deleted or freed.

For global variables, the allocation of memory may effectively be done at compile time, while for stack based variables there's normally a stack pointer that's incremented by the compile-time-calculated sum of the sizes of local variables (and some housekeeping data) each time a function is called. So, when main() is called there may be some time to modify the stack pointer, but it's probably just being modified by a different amount rather than not modified if there's no buffer and modified if there is, so there's no difference in runtime performance at all.

Note

I omit some boring and largely irrelevant details above. For example, some CPUs use "windows" of registers to save the state of one function as they enter a call to another function; some function state will be saved in registers rather than on the stack; some function arguments will be passed in registers rather than on the stack; not all Operating Systems use virtual addressing; some non-PC-grade hardware may have more complex memory architecture with different implications....

Which is faster: Stack allocation or Heap allocation

Stack allocation is much faster since all it really does is move the stack pointer.
Using memory pools, you can get comparable performance out of heap allocation, but that comes with a slight added complexity and its own headaches.

Also, stack vs. heap is not only a performance consideration; it also tells you a lot about the expected lifetime of objects.

Which Is Faster Accessing The Stack Or The Heap?

Although the standard does not say anything about it, on computers with uniform memory structure there should be no difference. You may see some differences in access time due to the effects of caching in the CPU, because the data from the array p allocated in the stack would be closer (in terms of address) to the data of other local variables of your function (unless the optimizer chooses to place these locals in registers) but overall access time should be the same.

The biggest hit that you are going to take with memory allocated dynamically is the calls of malloc and free.

C++ performance of accessing heap vs. stack objects

You probably won't notice any speed differences between stack and heap (dynamic memory), unless the physical memory is different.

The access for an array is direct, regardless of the memory location. You can confirm this by looking at the assembly language generated by the compiler.

There could be a difference if the OS decides to use virtual memory for your arrays. This means that the OS could page chunks of your array to the hard drive and swap them out on demand.

In most applications, if there is a physical difference (in terms of speed) between memory types, it will be negligible, in order of nanoseconds. For more computational intense applications (lots of data or need for speed), this could make a difference.

However, there are other issues that make memory access a non-issue such as:

  • Disk I/O
  • Waiting for User Input
  • Memory paging
  • Sharing of the CPU with other applications or threads

All of the above items have an overhead that is usually an order of magnitude more than an access to a memory device.

The main reason for using dynamic memory instead of stack based is size. Stack memory is mainly used for passing arguments and storing return addresses. Local variables that are not declared static will also be placed on the stack. Most programming environments give the stack area a smaller size. Larger items can be placed on the heap or declared as static (and placed in the same area as globals).

Worry more about correctness than memory performance. Profile when in doubt.

Edit 1: Cache misses
A Cache Miss is when the processor looks in it's data cache and doesn't find the item; the processor must then fetch the item from external memory (a.k.a. reloading the cache).

For most applications, cache misses are negligible in performance, usually measured in small nanosecond values. They are not noticeable unless your program is computationally intensive or processing a huge amount of data.

Branch instructions will take up more execution time than a cache miss. Some conditional branch instructions may force the processor to reload the instruction pipeline and reload the Program Counter register. (Note: some processors can haul in executable loop code into the instruction cache and reduce the penalty of the branch effects.)

You can organize your data to reduce the amount of cache misses. Search the web for "data driven" or "data optimizations". Also try reducing the branches by applying algebra, Boolean Algebra, and factoring invariants out of loops.

Is accessing global data faster than accessing local data?

So the question was

whether it is faster to retrieve data that is close to previously touched
area than to retrieve data that is in completely different place

The answer is yes, it is faster.

TL;DR: DRAM has a buffer(a cache, if you please, though it's not really a cache)

And the reason to that is DRAM workings.

  • A SIMM is 1 or 2 ranks that consist of multiple DRAM chips(ICs).
    SIMM scheme
  • Each IC consists of multiple banks(rows of bytes + row/column decoder + row buffer)
    banks scheme

  • If ICs are numbered 0 through K, banks 0 through M, and rows 0 through N;
    then rows (0, m, n), (1, m, n) ... (K, m, n) constitute a memory page(data of successive addresses).

  • (a common case) If a given SIMM has 8 ICs per rank and a bank has 1024 columns(each is a byte), a memory page(or the overall buffered memory) is 8KB in size.

With that said, if you access an address that is on the same memory page as the last address that was requested for this same bank, only the column decoder would be engaged, which is ~2 times faster then when the address is on a different page. Note: the 2 times difference is only relative to DRAM and is not relative to the overall time to get to the CPU, which still would be >100ms.

There's a lot of details to be added, but I'm not proficient at all to do that.

P.S. this topic is not widely discussed and all of the above is just a very short overview of what made sense to me from examining not very well written information.

comparison of access performance of data in heap and stack

There might be (highly system depending) issues about cache locality and read/write misses. If you run your program on the stack and heap data, then it is conceivable (depending on your cache architecture) that you to run into more cache misses, than if you run it entirely on one continuos region of the stack. Here is the paper to this issue by Andrew Appel (from SML/NJ) and Zhong Shao where they investigate exactly this thing, because stack/heap allocation is a topic for the implementation of functional languages:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.3778

They found some performance problems with write misses but estimated these would be resolved by advances in caching.

So my guess for a contemporary desktop/server machine is that, unless you're running heavily optimized, architecture specific code which streams data along the cache lines, you won't notice any difference between stack and heap accesses. Things might be different for devices with small caches (like ARM/MIPS controller), where ignoring the cache can have noticeable performance effects anyway.

Why is allocating heap-memory much faster than allocating stack-memory?

new int[1e7] allocates space for 1e7 int values and doesn't initialize them.

vector<int> v(1e7); creates a vector<int> object on the stack, and the constructor for that object allocates space for 1e7 int values on the heap. It also initializes each int value to 0.

The difference in speed is because of the initialization.

To compare the speed of stack allocation you need to allocate an array on the stack:

int data[1e7];

but be warned: there's a good chance that this will fail because the stack isn't large enough for that big an array.

What is more efficient stack memory or heap?

Stack is usually more efficient speed-wise, and simple to implement!

I tend to agree with Michael from Joel on Software site, who says,

It is more efficient to use the stack
when it is possible.

When you allocate from the heap, the
heap manager has to go through what is
sometimes a relatively complex
procedure, to find a free chunk of
memory. Sometimes it has to look
around a little bit to find something
of the right size.

This is not normally a terrible amount
of overhead, but it is definitely more
complex work compared to how the stack
functions. When you use memory from
the stack, the compiler is able to
immediately claim a chunk of memory
from the stack to use. It's
fundamentally a more simple procedure.

However, the size of the stack is
limited, so you shouldn't use it for
very large things, if you need
something that is greater than
something like 4k or so, then you
should always grab that from the heap
instead.

Another benefit of using the stack is
that it is automatically cleaned up
when the current function exits, you
don't have to worry about cleaning it
yourself. You have to be much more
careful with heap allocations to make
sure that they are cleaned up. Using
smart pointers that handle
automatically deleting heap
allocations can help a lot with this.

I sort of hate it when I see code that
does stuff like allocates 2 integers
from the heap because the programmer
needed a pointer to 2 integers and
when they see a pointer they just
automatically assume that they need to
use the heap. I tend to see this with
less experienced coders somewhat -
this is the type of thing that you
should use the stack for and just have
an array of 2 integers declared on the
stack.

Quoted from a really good discussion at Joel on Software site:

stack versus heap: more efficiency?



Related Topics



Leave a reply



Submit