Linux Allocator Does Not Release Small Chunks of Memory

Huge std::vector std::vector does not release all memory on destruction

Small allocations (up to 128kb by default, I think) are managed by an in-process heap, and are not returned to the OS when they're deallocated; they're returned to the heap for reuse within the process. Larger allocations come directly from the OS (by calling mmap), and are returned to the OS when deallocated.

In your first example, each vector only needs to allocate enough space for a single int. You have a hundred million small allocations, none of which will be returned to the OS.

In the second example, as the vector grows, it will make many allocations of various sizes. Some are smaller than the mmap threshold, these will remain in the process memory; but, since you only do this to one vector, that won't be a huge amount. If you were to use resize or reserve to allocate all the memory for each vector before populating it, then you should find that all the memory is returned to the OS.

Why does the Linux kernel require small short-term memory chunks in odd sizes？

Because most programs require small allocations, for relatively short periods, in a variety of sizes? That's why malloc and friends exist: To subdivide the larger allocations from the OS into smaller pieces with sub-page-size granularity. Want a linked list (commonly needed in OS kernels)? You need to be able to allocate small nodes that contain the value and a pointer to the next node (and possibly a reverse pointer too).

I suspect by "odd sizes" they just mean "arbitrary sizes"; I don't expect the kernel to be unusually heavy on 1, 3, 5, 7, etc. byte allocations, but the allocation sizes are, in many cases, not likely to be consistent enough that a fixed block allocator is broadly applicable. Writing a special block allocator for each possible linked list node size (let alone every other possible size needed for dynamically allocated memory) isn't worth it unless that linked list is absolutely performance critical after all.

Why does allocating large chunks of memory fail when reallocing small chunks doesn't

My guess is, that the memory size of your system is less than the 100 GiB that you are trying to allocate. While Linux does overcommit memory, it still bails out of requests that are way beyond what it can fulfill. That is why the second example fails.

The many small increments of the first example, on the other hand, are way below that threshold. So each one of them succeeds as the kernel knows that you didn't require any of the prior memory yet, so it has no indication that it won't be able to back those 100 additional MiB.

I believe that the threshold for when a memory request from a process fails is relative to the available RAM, and that it can be adjusted (though I don't remember how exactly).

Linux Memory Usage in top when using std::vector versus std::list

The freeing of memory is done on a "chunk" basis. It's quite possible that when you use list, the memory gets fragmented into little tiny bits.

When you allocate using a vector, all elements are stored in one big chunk, so it's easy for the memory freeing code to say "Golly, i've got a very large free region here, I'm going to release it back to the OS". It's also entirely possible that when growing the vector, the memory allocator goes into "large chunk mode", which uses a different allocation method than "small chunk mode" - say for example you allocate more than 1MB, the memory allocation code may see that as a good time to start using a different strategy, and just ask the OS for a "perfect fit" piece of memory. This large block is very easy to release back to he OS when it's being freed.

On the ohter hand if you are adding to a list, you are constantly asking for little bits, so the allocator uses a different strategy of asking for large block and then giving out small portions. It's both difficult and time-consuming to ensure that ALL blocks within a chunk have been freed, so the allocator may well "not bother" - because chances are that there are some regions in there "still in use", and then it can't be freed at all anyways.

I would also add that using "top" as a memory measure isn't a particularly accurate method, and is very unreliable, as it very much depends on what the OS and the runtime library does. Memory belonging to a process may not be "resident", but the process still hasn't freed it - it's just not "present in actual memory" (out in the swap partition instead!)

And to your question "is this defined somewhere", I think it is in the sense that the C/C++ library source cod defines it. But it's not defined in the sense that somewhere it's written that "This is how it's meant to work, and we promise never to hange it". The libraries supplied as glibc and libstdc++ are not going to say that, they will change the internals of malloc, free, new and delete as new technologies and ideas are invented - some may make things better, others may make it worse, for a given scenario.

As has been pointed out in the comments, the memory is not locked to the process. If the kernel feels that the memory is better used for something else [and the kernel is omnipotent here], then it will "steal" the memory from one running process and give it to another. Particularly memory that hasn't been "touched" for a long time.

What does it mean that mmap doesn't work 'when a large chunk becomes “locked” in between smaller ones'?

Consider a long-running logical function of a program that allocates lots and lots of memory. And at the same time that logical function is running, other bits of code allocate very small chunks of memory.

Later on, that long-running function may finish and the massive amount of memory it allocated is all freed. But lots of small bits allocated by other logical functions of the program are still holding memory.

You may have a case where the program requested a huge amount of memory from the operating system and almost all of that memory is free, but nevertheless, no memory can be returned to the operating system by the program's allocator because every chunk it allocated from the system still has a tiny piece in use. This is called "memory fragmentation".

What algorithm to apply for continuious reallocation of small memory chunks?

The best algorithm in this case would be free list allocator + binary search tree.

You are asking one big memory chunk from system, and then allocating memory blocks of fixed size from this chunk. When chunk is full you are allocating another one, and link them into red-black or AVL binary search interval tree (otherwise searching for a chunk during free by iterating over the chunks list becomes a bottleneck)

And in the same time, multi-threading and thread synchronization becomes important. If you will simply use mutexes or trivial spin-locks, you will found you libc malloc works much faster than your custom memory allocator. Solution can be found in Hazard pointer, i.e. each thread have it's own memory allocator (or one allocator per CPU core). This will add another problem - when one thread allocates memory and another releasing it, this will require searching for the exact allocator during free function, and strick locking or lock-free data structures.

The best option - you can use jemalloc, tcmalloc or any another generic propose fast memory allocator to replace your libc (i.e. pdmalloc) default allocator completely or partially.

Does calling free or delete ever release memory back to the system

There isn't much overhead for malloc, so you are unlikely to achieve any run-time savings. There is, however, a good reason to implement an allocator on top of malloc, and that is to be able to trace memory leaks. For example, you can free all memory allocated by the program when it exits, and then check to see if your memory allocator calls balance (i.e. same number of calls to allocate/deallocate).

For your specific implementation, there is no reason to free() since the malloc won't release to system memory and so it will only release memory back to your own allocator.

Another reason for using a custom allocator is that you may be allocating many objects of the same size (i.e you have some data structure that you are allocating a lot). You may want to maintain a separate free list for this type of object, and free/allocate only from this special list. The advantage of this is that it will avoid memory fragmentation.

Why is memory not reusable after allocating/deallocating a number of small objects?

Sorry for posting yet another answer since I am unable to comment; I believe many of the others are quite close to the answer really :-)

Anyway, the culprit is most likely address space fragmentation. I gather you are using Visual C++ on Windows.

The C / C++ runtime memory allocator (invoked by malloc or new) uses the Windows heap to allocate memory. The Windows heap manager has an optimization in which it will hold on to blocks under a certain size limit, in order to be able to reuse them if the application requests a block of similar size later. For larger blocks (I can't remember the exact value, but I guess it's around a megabyte) it will use VirtualAlloc outright.

Other long-running 32-bit applications with a pattern of many small allocations have this problem too; the one that made me aware of the issue is MATLAB - I was using the 'cell array' feature to basically allocate millions of 300-400 byte blocks, causing exactly this issue of address space fragmentation even after freeing them.

A workaround is to use the Windows heap functions (HeapCreate() etc.) to create a private heap, allocate your memory through that (passing a custom C++ allocator to your container classes as needed), and then destroy that heap when you want the memory back - This also has the happy side-effect of being very fast vs delete()ing a zillion blocks in a loop..

Re. "what is remaining in memory" to cause the issue in the first place: Nothing is remaining 'in memory' per se, it's more a case of the freed blocks being marked as free but not coalesced. The heap manager has a table/map of the address space, and it won't allow you to allocate anything which would force it to consolidate the free space into one contiguous block (presumably a performance heuristic).

Why don't memory allocators actively return freed memory to the OS?

Clarification

First, some clarification. You asked: ... my program a.out terminates already, there is no other process that is maintaining this memory cache - or, is there one?

Everything we are talking about is within the lifetime of a single process: the process always returns all allocated memory when it exits. There is no cache that outlives the process¹. The memory is returned even without any help from the runtime allocator: the OS simply "takes it back" when the process is terminated. So there is no system-wide leak possible from terminated applications with normal allocations.

Now what Valgrind is reporting is memory that is in use at the moment the process terminated, but before the OS cleans everything up. It works at the runtime library level, and not at the OS level. So it's saying "Hey, when the program finished, there were 72,000 bytes that hadn't been returned to the runtime" but an unstated implication is that "these allocations will be cleaned up shortly by the OS".

The Underlying Questions

The code and Valgrind output shown doesn't really correlate well with the titular question, so let's break them apart. First we'll just try to answer the questions you asked about allocators: why they exist and why they don't generally don't immediately return freed memory to the OS, ignoring the example.

You asked:

1) Why keep them in an internal cache? If it is for speed, how is it
faster? Yes, the OS needs to maintain a data structure to keep track
of memory allocation, but this the maintainer of this cache also needs
to do so.

This is sort of two questions in one: one is why bother having a userland runtime allocator at all, and then the other one is (perhaps?) why don't these allocators immediately return memory to the OS when it is freed. They are related, but let's tackle them one at a time.

Why Runtime Allocators Exist

Why not just rely on the OS memory allocation routines?

Many operating systems, including most Linux and other Unix-like operating systems, simply don't have an OS system call to allocate and free arbitrary blocks of memory. Unix-alikes offer brk which only grows or shrinks one contiguous block of memory - you have no way to "free" arbitrary earlier allocations. They also offer mmap which allows you to independently allocate and free chunks of memory, but these allocate on a PAGE_SIZE granularity, which on Linux is 4096 bytes. So if you want request 32 bytes, you'll have to waste 4096 - 32 == 4064 bytes if you don't have your own allocator. On these operating systems you practically need a separate memory allocation runtime which turns these coarse-grained tools into something capable of efficiently allocating small blocks.
Windows is a bit different. It has the HeapAlloc call, which is part of the "OS" and does offer malloc-like capabilities of allocating and freeing arbitrarily sized chunks of memory. With some compilers then, malloc is just implemented as a thin wrapper around HeapAlloc (the performance of this call has improved greatly in recent Windows versions, making this feasible). Still, while HeapAlloc is part of the OS it isn't implemented in the kernel - it is also mostly implemented in a user-mode library, managing a list of free and used blocks, with occasional kernel calls to get chunks of memory from the kernel. So it is mostly malloc in another disguise and any memory it is holding on to is also not available to any other processes.
Performance! Even if there were appropriate kernel-level calls to allocate arbitrary blocks of memory, the simple overhead roundtrip to the kernel is usually hundreds of nanoseconds or more. A well-tuned malloc allocation or free, on other hand, is often only a dozen instructions and may complete in 10 ns or less. On top of that, system calls can't "trust their input" and so must carefully validate parameters passed from user-space. In the case of free this means that it much check that the user passed a pointer which is valid! Most runtime free implements simply crash or silently corrupt memory since there is no responsibility to protect a process from itself.
Closer link to the rest of the language runtime. The functions you use to allocate memory in C++, namely new, malloc and friends, are part of an defined by the language. It is then entirely natural to implement them as part of the runtime that implements the rest of the language, rather than the OS which is for the most part language-agnostic. For example, the language may have specific alignment requirements for various objects, which can best be handled by language aware allocators. Changes to the language or compiler might also imply necessary changes to the allocation routines, and it would be a tough call to hope for the kernel to be updated to accommodate your language features!

Why Not Return Memory to the OS

Your example doesn't show it, but you asked and if you wrote a different test you would probably find that after allocating and then freeing a bunch of memory, your processes resident set size and/or virtual size as reported by the OS might not decrease after the free. That is, it seems like the process holds on to the memory even though you have freed it. This is in fact true of many malloc implementations. First, note that this is not a leak per se - the unreturned memory is still available to the process that allocated it, even if not to other processes.

Why do they do that? Here are some reasons:

The kernel API makes it hard. For the old-school brk and sbrk system calls, it simply isn't feasible to return freed memory unless it happens to be at the end of very last block allocated from brk or sbrk. That's because the abstraction offered by these calls is a single large contiguous region that you can only extend from one end. You can't hand back memory from the middle of it. Rather than trying to support the unusual case where all the freed memory happens to be at the end of brk region, most allocators don't even bother.
The mmap call is more flexible (and this discussion generally applies also to Windows where VirtualAlloc is the mmap equivalent), allowing you to at least return memory at a page granularity - but even that is hard! You can't return a page until all allocations that are part of that page are freed. Depending on the size and allocation/free pattern of the application that may be common or uncommon. A case where it works well is for large allocations - greater than a page. Here you're guaranteed to be able to free most of the allocation if it was done via mmap and indeed some modern allocators satisfy large allocations directly from mmap and free them back to the OS with munmap. For glibc (and by extension the C++ allocation operators), you can even control this threshold:
```
M_MMAP_THRESHOLD
  For allocations greater than or equal to the limit specified
  (in bytes) by M_MMAP_THRESHOLD that can't be satisfied from
  the free list, the memory-allocation functions employ mmap(2)
  instead of increasing the program break using sbrk(2).

  Allocating memory using mmap(2) has the significant advantage
  that the allocated memory blocks can always be independently
  released back to the system.  (By contrast, the heap can be
  trimmed only if memory is freed at the top end.)  On the other
  hand, there are some disadvantages to the use of mmap(2):
  deallocated space is not placed on the free list for reuse by
  later allocations; memory may be wasted because mmap(2)
  allocations must be page-aligned; and the kernel must perform
  the expensive task of zeroing out memory allocated via
  mmap(2).  Balancing these factors leads to a default setting
  of 128*1024 for the M_MMAP_THRESHOLD parameter.
```
So by default allocations of 128K or more will be allocated by the runtime directly from the OS and freed back to the OS on free. So sometimes you will see the behavior you might have expected is always the case.
Performance! Every kernel call is expensive, as described in the other list above. Memory that is freed by a process will be needed shortly later to satisfy another allocation. Rather than trying to return it to the OS, a relatively heavyweight operation, why not just keep it around on a free list to satisfy future allocations? As pointed out in the man page entry, this also avoids the overhead of zeroing out all the memory returned by the kernel. It also gives the best chance of good cache behavior since the process is continually re-using the same region of the address space. Finally, it avoids TLB flushes which would be imposed by munmap (and possibly by shrinking via brk).
The "problem" of not returning memory is the worst for long-lived processes that allocate a bunch of memory at some point, free it and then never allocate that much again. I.e., processes whose allocation high-water mark is larger than their long term typical allocation amount. Most processes just don't follow that pattern, however. Processes often free a lot of memory, but allocate at a rate such that their overall memory use is constant or perhaps increasing. Applications that do have the "big then small" live size pattern could perhaps force the issue with malloc_trim.
Virtual memory helps mitigate the issue. So far I've been throwing around terms like "allocated memory" without really defining what it means. If a program allocates and then frees 2 GB of memory and then sits around doing nothing, is it wasting 2 GB of actual DRAM plugged into your motherboard somewhere? Probably not. It is using 2 GB of virtual address space in your process, sure, but virtual address space is per-process, so that doesn't directly take anything away from other processes. If the process actually wrote to the memory at some point, it would be allocated physical memory (yes, DRAM) - after freeing it, you are - by definition - no longer using it. At this point the OS may reclaim those physical pages by use for someone else.
Now this still requires you have swap to absorb the dirty not-used pages, but some allocators are smart: they can issue a madvise(..., MADV_DONTNEED) call which tells the OS "this range doesn't have anything useful, you don't have to preserve its contents in swap". It still leaves the virtual address space mapped in the process and usable later (zero filled) and so it's more efficient than munmap and a subsequent mmap, but it avoid pointlessly swapping freed memory regions to swap.²

The Demonstrated Code

As pointed out in this answer your test with vector<int> isn't really testing anything because an empty, unused std::vector<int> v won't even create the vector object as long as you are using some minimal level of optimization. Even without optimization, no allocation is likely to occur because most vector implementations allocate on first insertion, and not in the constructor. Finally, even if you are using some unusual compiler or library that does an allocation, it will be for a handful of bytes, not the ~72,000 bytes Valgrind is reporting.

You should do something like this to actually see the impact of a vector allocation:

#include <vector>

volatile vector<int> *sink;

int main() {
    std::vector<int> v(12345678);
    sink = &v;
}

That results in actual allocation and de-allocation. It isn't going to change the Valgrind output, however, since the vector allocation is correctly freed before the program exits, so there is no issue as far as Valgrind is concerned.

At a high level, Valgrind basically categorizes things into "definite leaks" and "not freed at exit". The former occur when the program no longer has a reference to a pointer to memory that it allocated. It cannot free such memory and so has leaked it. Memory which hasn't been freed at exit may be a "leak" - i.e., objects that should have been freed, but it may also simply be memory that the developer knew would live the length of the program and so doesn't need to be explicitly freed (because of order-of-destruction issues for globals, especially when shared libraries are involved, it may be very hard to reliably free memory associated with global or static objects even if you wanted to).

¹ In some cases some deliberately special allocations may outlive the process, such as shared memory and memory mapped files, but that doesn't relate to plain C++ allocations and you can ignore it for the purposes of this discussion.

² Recent Linux kernels also have the Linux-specific MADV_FREE which seems to have similar semantics to MADV_DONTNEED.