Can Malloc_Trim() Release Memory from the Middle of the Heap

malloc_trim(0) Releases Fastbins of Thread Arenas?

malloc_trim(0) states that it can only free memory from the top of the main arena heap, so what is going on here?

It can be called "outdated" or "incorrect" documentation. Glibc have no documentation of malloc_trim function; and Linux uses man pages from man-pages project. The man page of malloc_trim http://man7.org/linux/man-pages/man3/malloc_trim.3.html was written in 2012 by maintainer of man-pages as new. Probably he used some comments from glibc malloc/malloc.c source code http://code.metager.de/source/xref/gnu/glibc/malloc/malloc.c#675

676  malloc_trim(size_t pad);
677
678  If possible, gives memory back to the system (via negative
679  arguments to sbrk) if there is unused memory at the `high' end of
680  the malloc pool. You can call this after freeing large blocks of
681  memory to potentially reduce the system-level memory requirements
682  of a program. However, it cannot guarantee to reduce memory. Under
683  some allocation patterns, some large free blocks of memory will be
684  locked between two used chunks, so they cannot be given back to
685  the system.
686
687  The `pad' argument to malloc_trim represents the amount of free
688  trailing space to leave untrimmed. If this argument is zero,
689  only the minimum amount of memory to maintain internal data
690  structures will be left (one page or less). Non-zero arguments
691  can be supplied to maintain enough trailing space to service
692  future expected allocations without having to re-obtain memory
693  from the system.
694
695  Malloc_trim returns 1 if it actually released any memory, else 0.
696  On systems that do not support "negative sbrks", it will always
697  return 0.

Actual implementation in glibc is __malloc_trim and it has code for iterating over arenas:

http://code.metager.de/source/xref/gnu/glibc/malloc/malloc.c#4552

4552 int
4553 __malloc_trim (size_t s)

4560  mstate ar_ptr = &main_arena;
4561  do
4562    {
4563      (void) mutex_lock (&ar_ptr->mutex);
4564      result |= mtrim (ar_ptr, s);
4565      (void) mutex_unlock (&ar_ptr->mutex);
4566
4567      ar_ptr = ar_ptr->next;
4568    }
4569  while (ar_ptr != &main_arena);

Every arena is trimmed using mtrim() (mTRIm()) function, which calls malloc_consolidate() to convert all free segments from fastbins (they are not coalesced at free as they are fast) to normal free chunks (which are coalesced with adjacent chunks)

4498  /* Ensure initialization/consolidation */
4499  malloc_consolidate (av);

4111  malloc_consolidate is a specialized version of free() that tears
4112  down chunks held in fastbins. 

1581   Fastbins
1591    Chunks in fastbins keep their inuse bit set, so they cannot
1592    be consolidated with other free chunks. malloc_consolidate
1593    releases all chunks in fastbins and consolidates them with
1594    other free chunks.

The problem is, when the worker thread is recreated, it creates a new arena/heap instead of reusing the previous one, such that the fastbins from previous arenas/heaps are never reused.

This is strange. By design, maximum number of arenas is limited in glibc malloc by cpu_core_count * 8 (for 64-bit platform); cpu_core_count * 2 (for 32-bit platform) or by environment variable MALLOC_ARENA_MAX / mallopt parameter M_ARENA_MAX.

You can limit count of arenas for your application; call malloc_trim() periodically or call to malloc() with "large" size (it will call malloc_consolidate) and then free() for it from your threads just before destroying:

3319 _int_malloc (mstate av, size_t bytes)
3368  if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
 // fastbin allocation path
3405  if (in_smallbin_range (nb))
 // smallbin path; malloc_consolidate may be called
3437     If this is a large request, consolidate fastbins before continuing.
3438     While it might look excessive to kill all fastbins before
3439     even seeing if there is space available, this avoids
3440     fragmentation problems normally associated with fastbins.
3441     Also, in practice, programs tend to have runs of either small or
3442     large requests, but less often mixtures, so consolidation is not
3443     invoked all that often in most programs. And the programs that
3444     it is called frequently in otherwise tend to fragment.
3445   */
3446
3447  else
3448    {
3449      idx = largebin_index (nb);
3450      if (have_fastchunks (av))
3451        malloc_consolidate (av);
3452    }

PS: there is comment in man page of malloc_trim https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65:

+.SH NOTES
+This function only releases memory in the main arena.
+.\" malloc/malloc.c::mTRIm():
+.\" return result | (av == &main_arena ? sYSTRIm (pad, av) : 0);

Yes, there is check for main_arena, but it is at very end of malloc_trim implementation mTRIm() and it is just for calling sbrk() with negative offset. Since 2007 (glibc 2.9 and newer) there is another method to return memory back to the OS: madvise(MADV_DONTNEED) which is used in all arenas (and is not documented by author of glibc patch or author of man page). Consolidate is called for every arena. There is also code for trimming (munmapping) top chunk of mmap-ed heaps (heap_trim/shrink_heap called from slow path free()), but it is not called from malloc_trim.

Malloc is using 10x the amount of memory necessary

The full details can be a bit complex, so I'll try to simplify things as much as I can. Also, this is a rough outline and may be slightly inaccurate in places.

Requesting memory from the kernel

malloc uses either sbrk or anonymous mmap to request a contiguous memory area from the kernel. Each area will be a multiple of the machine's page size, typically 4096 bytes. Such a memory area is called an arena in malloc terminology. More on that below.

Any pages so mapped become part of the process's virtual address space. However, even though they have been mapped in, they may not be backed up by a physical RAM page [yet]. They are mapped [many-to-one] to the single "zero" page in R/O mode.

When the process tries to write to such a page, it incurs a protection fault, the kernel breaks the mapping to the zero page, allocates a real physical page, remaps to it, and the process is restarted at the fault point. This time the write succeeds. This is similar to demand paging to/from the paging disk.

In other words, page mapping in a process's virtual address space is different than page residency in a physical RAM page/slot. More on this later.

RSS (resident set size)

RSS is not really a measure of how much memory a process allocates or frees, but how many pages in its virtual address space have a physical page in RAM at the present time.

If the system has a paging disk of 128GB, but only had (e.g.) 4GB of RAM, a process RSS could never exceed 4GB. The process's RSS goes up/down based upon paging in or paging out pages in its virtual address space.

So, because of the zero page mapping at start, a process RSS may be much lower than the amount of virtual memory it has requested from the system. Also, if another process B "steals" a page slot from a given process A, the RSS for A goes down and goes up for B.

The process "working set" is the minimum number of pages the kernel must keep resident for the process to prevent the process from excessively page faulting to get a physical memory page, based on some measure of "excessively". Each OS has its own ideas about this and it's usually a tunable parameter on a system-wide or per-process basis.

If a process allocates a 3GB array, but only accesses the first 10MB of it, it will have a lower working set than if it randomly/scattershot accessed all parts of the array.

That is, if the RSS is higher [or can be higher] than the working set, the process will run well. If the RSS is below the working set, the process will page fault excessively. This can be either because it has poor "locality of reference" or because other events in the system conspire to "steal" the process's page slots.

malloc and arenas

To cut down on fragmentation, malloc uses multiple arenas. Each arena has a "preferred" allocation size (aka "chunk" size). That is, smaller requests like malloc(32) come from (e.g.) arena A, but larger requests like malloc(1024 * 1024) come from a different arena (e.g.) arena B.

This prevents a small allocation from "burning" the first 32 bytes of the last available chunk in arena B, making it too short to satisfy the next malloc(1M)

Of course, we can't have a separate arena for each requested size, so the "preferred" chunk sizes are typically some power of 2.

When creating a new arena for a given chunk size, malloc doesn't just request an area of the chunk size, but some multiple of it. It does this so it can quickly satisfy subsequent requests of the same size without having to do an mmap for each one. Since the minimum size is 4096, arena A will have 4096/32 chunks or 128 chunks available.

free and munmap

When an application does a free(ptr) [ptr represents a chunk], the chunk is marked as available. free could choose to combine contiguous chunks that are free/available at that time or not.

If the chunk is small enough, it does nothing more (i.e.) the chunk is available for reallocation, but, free does not try to release the chunk back to the kernel. For larger allocations, free will [try to] do munmap immediately.

munmap can unmap a single page [or even a small number of bytes], even if comes in the middle of an area that was multiple pages long. If so, the application now has a "hole" in the mapping.

malloc_trim and madvise

If free is called, it probably calls munmap. If an entire page has been unmapped, the RSS of the process (e.g. A) goes down.

But, consider chunks that are still allocated, or chunks that were marked as free/available but were not unmapped.

They are still part of the process A's RSS. If another process (e.g. B) starts doing lots of allocations, the system may have to page out some of process A's slots to the paging disk [reducing A's RSS] to make room for B [whose RSS goes up].

But, if there is no process B to steal A's page slots, process A's RSS can remain high. Say process A allocated 100MB, used it a while back, but is only actively using 1MB now, the RSS will remain at 100MB.

That's because without the "interference" from process B, the kernel had no reason to steal any page slots from A, so they "remain on the books" in the RSS.

To tell the kernel that a memory area is not likely to be used soon, we need the madvise syscall with MADV_WONTNEED. This tells the kernel that the memory area is low priority and it should [more] aggressively page it out to the paging disk, thereby reducing the process's RSS.

The pages remain mapped in the process's virtual address space, but get farmed out to the paging disk. Remember, page mapping is different than page residency.

If the process accesses the page again, it incurs a page fault and the kernel will pull in the data from paging disk to a physical RAM slot and remap. The RSS goes back up. Classical demand paging.

madvise is what malloc_trim uses to reduce the RSS of the process.

Why does malloc_trim() only work with the main arena?

Arenas other than the main one are probably allocated from the system using mmap so sbrk cannot be used to return that memory to the system. It could be possible to make glibc use mremap to shrink these other arenas. Note also that malloc_trim can only return memory at the end of the arena, if there are empty blocks in the middle of the arena there's no way to release that memory.

Understanding glibc malloc trimming

Largely for historical reasons, memory for small allocations comes from a pool managed with the brk system call. This is a very old system call — at least as old as Version 6 Unix — and the only thing it can do is change the size of an "arena" whose position in memory is fixed. What that means is, the brk pool cannot shrink past a block that is still allocated.

Your program allocates N blocks of memory and then deallocates N-1 of them. The one block it doesn't deallocate is the one located at the highest address. That is the worst-case scenario for brk: the size can't be reduced at all, even though 99.99% of the pool is unused! If you change your program so that the block it doesn't free is array[0] instead of array[NUM_CHUNKS-1], you should see both RSS and address space shrink upon the final call to free.

When you explicitly call malloc_trim, it attempts to work around this limitation using a Linux extension, madvise(MADV_DONTNEED), which releases the physical RAM, but not the address space (as you observed). I don't know why this only happens upon an explicit call to malloc_trim.

Incidentally, the 8MB mmap segment is for your initial allocation of array.

Why is the heap in Go executable?

The heap is no longer executable.

Code was generated at runtime for function literals prior to Go 1.1, thus requiring an executable heap. Function calls were revamped in Go 1.1 to eliminate the need for an executable heap and to provide other benefits.

Why don't memory allocators actively return freed memory to the OS?

Clarification

First, some clarification. You asked: ... my program a.out terminates already, there is no other process that is maintaining this memory cache - or, is there one?

Everything we are talking about is within the lifetime of a single process: the process always returns all allocated memory when it exits. There is no cache that outlives the process¹. The memory is returned even without any help from the runtime allocator: the OS simply "takes it back" when the process is terminated. So there is no system-wide leak possible from terminated applications with normal allocations.

Now what Valgrind is reporting is memory that is in use at the moment the process terminated, but before the OS cleans everything up. It works at the runtime library level, and not at the OS level. So it's saying "Hey, when the program finished, there were 72,000 bytes that hadn't been returned to the runtime" but an unstated implication is that "these allocations will be cleaned up shortly by the OS".

The Underlying Questions

The code and Valgrind output shown doesn't really correlate well with the titular question, so let's break them apart. First we'll just try to answer the questions you asked about allocators: why they exist and why they don't generally don't immediately return freed memory to the OS, ignoring the example.

You asked:

1) Why keep them in an internal cache? If it is for speed, how is it
faster? Yes, the OS needs to maintain a data structure to keep track
of memory allocation, but this the maintainer of this cache also needs
to do so.

This is sort of two questions in one: one is why bother having a userland runtime allocator at all, and then the other one is (perhaps?) why don't these allocators immediately return memory to the OS when it is freed. They are related, but let's tackle them one at a time.

Why Runtime Allocators Exist

Why not just rely on the OS memory allocation routines?

Many operating systems, including most Linux and other Unix-like operating systems, simply don't have an OS system call to allocate and free arbitrary blocks of memory. Unix-alikes offer brk which only grows or shrinks one contiguous block of memory - you have no way to "free" arbitrary earlier allocations. They also offer mmap which allows you to independently allocate and free chunks of memory, but these allocate on a PAGE_SIZE granularity, which on Linux is 4096 bytes. So if you want request 32 bytes, you'll have to waste 4096 - 32 == 4064 bytes if you don't have your own allocator. On these operating systems you practically need a separate memory allocation runtime which turns these coarse-grained tools into something capable of efficiently allocating small blocks.
Windows is a bit different. It has the HeapAlloc call, which is part of the "OS" and does offer malloc-like capabilities of allocating and freeing arbitrarily sized chunks of memory. With some compilers then, malloc is just implemented as a thin wrapper around HeapAlloc (the performance of this call has improved greatly in recent Windows versions, making this feasible). Still, while HeapAlloc is part of the OS it isn't implemented in the kernel - it is also mostly implemented in a user-mode library, managing a list of free and used blocks, with occasional kernel calls to get chunks of memory from the kernel. So it is mostly malloc in another disguise and any memory it is holding on to is also not available to any other processes.
Performance! Even if there were appropriate kernel-level calls to allocate arbitrary blocks of memory, the simple overhead roundtrip to the kernel is usually hundreds of nanoseconds or more. A well-tuned malloc allocation or free, on other hand, is often only a dozen instructions and may complete in 10 ns or less. On top of that, system calls can't "trust their input" and so must carefully validate parameters passed from user-space. In the case of free this means that it much check that the user passed a pointer which is valid! Most runtime free implements simply crash or silently corrupt memory since there is no responsibility to protect a process from itself.
Closer link to the rest of the language runtime. The functions you use to allocate memory in C++, namely new, malloc and friends, are part of an defined by the language. It is then entirely natural to implement them as part of the runtime that implements the rest of the language, rather than the OS which is for the most part language-agnostic. For example, the language may have specific alignment requirements for various objects, which can best be handled by language aware allocators. Changes to the language or compiler might also imply necessary changes to the allocation routines, and it would be a tough call to hope for the kernel to be updated to accommodate your language features!

Why Not Return Memory to the OS

Your example doesn't show it, but you asked and if you wrote a different test you would probably find that after allocating and then freeing a bunch of memory, your processes resident set size and/or virtual size as reported by the OS might not decrease after the free. That is, it seems like the process holds on to the memory even though you have freed it. This is in fact true of many malloc implementations. First, note that this is not a leak per se - the unreturned memory is still available to the process that allocated it, even if not to other processes.

Why do they do that? Here are some reasons:

The kernel API makes it hard. For the old-school brk and sbrk system calls, it simply isn't feasible to return freed memory unless it happens to be at the end of very last block allocated from brk or sbrk. That's because the abstraction offered by these calls is a single large contiguous region that you can only extend from one end. You can't hand back memory from the middle of it. Rather than trying to support the unusual case where all the freed memory happens to be at the end of brk region, most allocators don't even bother.
The mmap call is more flexible (and this discussion generally applies also to Windows where VirtualAlloc is the mmap equivalent), allowing you to at least return memory at a page granularity - but even that is hard! You can't return a page until all allocations that are part of that page are freed. Depending on the size and allocation/free pattern of the application that may be common or uncommon. A case where it works well is for large allocations - greater than a page. Here you're guaranteed to be able to free most of the allocation if it was done via mmap and indeed some modern allocators satisfy large allocations directly from mmap and free them back to the OS with munmap. For glibc (and by extension the C++ allocation operators), you can even control this threshold:
```
M_MMAP_THRESHOLD
  For allocations greater than or equal to the limit specified
  (in bytes) by M_MMAP_THRESHOLD that can't be satisfied from
  the free list, the memory-allocation functions employ mmap(2)
  instead of increasing the program break using sbrk(2).

  Allocating memory using mmap(2) has the significant advantage
  that the allocated memory blocks can always be independently
  released back to the system.  (By contrast, the heap can be
  trimmed only if memory is freed at the top end.)  On the other
  hand, there are some disadvantages to the use of mmap(2):
  deallocated space is not placed on the free list for reuse by
  later allocations; memory may be wasted because mmap(2)
  allocations must be page-aligned; and the kernel must perform
  the expensive task of zeroing out memory allocated via
  mmap(2).  Balancing these factors leads to a default setting
  of 128*1024 for the M_MMAP_THRESHOLD parameter.
```
So by default allocations of 128K or more will be allocated by the runtime directly from the OS and freed back to the OS on free. So sometimes you will see the behavior you might have expected is always the case.
Performance! Every kernel call is expensive, as described in the other list above. Memory that is freed by a process will be needed shortly later to satisfy another allocation. Rather than trying to return it to the OS, a relatively heavyweight operation, why not just keep it around on a free list to satisfy future allocations? As pointed out in the man page entry, this also avoids the overhead of zeroing out all the memory returned by the kernel. It also gives the best chance of good cache behavior since the process is continually re-using the same region of the address space. Finally, it avoids TLB flushes which would be imposed by munmap (and possibly by shrinking via brk).
The "problem" of not returning memory is the worst for long-lived processes that allocate a bunch of memory at some point, free it and then never allocate that much again. I.e., processes whose allocation high-water mark is larger than their long term typical allocation amount. Most processes just don't follow that pattern, however. Processes often free a lot of memory, but allocate at a rate such that their overall memory use is constant or perhaps increasing. Applications that do have the "big then small" live size pattern could perhaps force the issue with malloc_trim.
Virtual memory helps mitigate the issue. So far I've been throwing around terms like "allocated memory" without really defining what it means. If a program allocates and then frees 2 GB of memory and then sits around doing nothing, is it wasting 2 GB of actual DRAM plugged into your motherboard somewhere? Probably not. It is using 2 GB of virtual address space in your process, sure, but virtual address space is per-process, so that doesn't directly take anything away from other processes. If the process actually wrote to the memory at some point, it would be allocated physical memory (yes, DRAM) - after freeing it, you are - by definition - no longer using it. At this point the OS may reclaim those physical pages by use for someone else.
Now this still requires you have swap to absorb the dirty not-used pages, but some allocators are smart: they can issue a madvise(..., MADV_DONTNEED) call which tells the OS "this range doesn't have anything useful, you don't have to preserve its contents in swap". It still leaves the virtual address space mapped in the process and usable later (zero filled) and so it's more efficient than munmap and a subsequent mmap, but it avoid pointlessly swapping freed memory regions to swap.²

The Demonstrated Code

As pointed out in this answer your test with vector<int> isn't really testing anything because an empty, unused std::vector<int> v won't even create the vector object as long as you are using some minimal level of optimization. Even without optimization, no allocation is likely to occur because most vector implementations allocate on first insertion, and not in the constructor. Finally, even if you are using some unusual compiler or library that does an allocation, it will be for a handful of bytes, not the ~72,000 bytes Valgrind is reporting.

You should do something like this to actually see the impact of a vector allocation:

#include <vector>

volatile vector<int> *sink;

int main() {
    std::vector<int> v(12345678);
    sink = &v;
}

That results in actual allocation and de-allocation. It isn't going to change the Valgrind output, however, since the vector allocation is correctly freed before the program exits, so there is no issue as far as Valgrind is concerned.

At a high level, Valgrind basically categorizes things into "definite leaks" and "not freed at exit". The former occur when the program no longer has a reference to a pointer to memory that it allocated. It cannot free such memory and so has leaked it. Memory which hasn't been freed at exit may be a "leak" - i.e., objects that should have been freed, but it may also simply be memory that the developer knew would live the length of the program and so doesn't need to be explicitly freed (because of order-of-destruction issues for globals, especially when shared libraries are involved, it may be very hard to reliably free memory associated with global or static objects even if you wanted to).

¹ In some cases some deliberately special allocations may outlive the process, such as shared memory and memory mapped files, but that doesn't relate to plain C++ allocations and you can ignore it for the purposes of this discussion.

² Recent Linux kernels also have the Linux-specific MADV_FREE which seems to have similar semantics to MADV_DONTNEED.