Understanding Glibc Malloc Trimming

Understanding glibc malloc trimming

Largely for historical reasons, memory for small allocations comes from a pool managed with the brk system call. This is a very old system call — at least as old as Version 6 Unix — and the only thing it can do is change the size of an "arena" whose position in memory is fixed. What that means is, the brk pool cannot shrink past a block that is still allocated.

Your program allocates N blocks of memory and then deallocates N-1 of them. The one block it doesn't deallocate is the one located at the highest address. That is the worst-case scenario for brk: the size can't be reduced at all, even though 99.99% of the pool is unused! If you change your program so that the block it doesn't free is array[0] instead of array[NUM_CHUNKS-1], you should see both RSS and address space shrink upon the final call to free.

When you explicitly call malloc_trim, it attempts to work around this limitation using a Linux extension, madvise(MADV_DONTNEED), which releases the physical RAM, but not the address space (as you observed). I don't know why this only happens upon an explicit call to malloc_trim.

Incidentally, the 8MB mmap segment is for your initial allocation of array.

Why does malloc_trim() only work with the main arena?

Arenas other than the main one are probably allocated from the system using mmap so sbrk cannot be used to return that memory to the system. It could be possible to make glibc use mremap to shrink these other arenas. Note also that malloc_trim can only return memory at the end of the arena, if there are empty blocks in the middle of the arena there's no way to release that memory.

Impact of Increasing GLIBC malloc() M_MMAP_THRESHOLD to 1GB

This question was answered on the libc-help list:

If you increase M_MMAP_THRESHOLD, you also have to increase the heap size to something like 32 GiB (HEAP_MAX_SIZE in malloc/arena.c). The default of 2 * DEFAULT_MMAP_THRESHOLD_MAX is probably too small (assuming that DEFAULT_MMAP_THRESHOLD_MAX will be 2 GiB). Otherwise you will have substantial fragmentation for allocation requests between 2 GiB and HEAP_MAX_SIZE.

malloc_trim(0) Releases Fastbins of Thread Arenas?

malloc_trim(0) states that it can only free memory from the top of the main arena heap, so what is going on here?

It can be called "outdated" or "incorrect" documentation. Glibc have no documentation of malloc_trim function; and Linux uses man pages from man-pages project. The man page of malloc_trim http://man7.org/linux/man-pages/man3/malloc_trim.3.html was written in 2012 by maintainer of man-pages as new. Probably he used some comments from glibc malloc/malloc.c source code http://code.metager.de/source/xref/gnu/glibc/malloc/malloc.c#675

676  malloc_trim(size_t pad);
677
678  If possible, gives memory back to the system (via negative
679  arguments to sbrk) if there is unused memory at the `high' end of
680  the malloc pool. You can call this after freeing large blocks of
681  memory to potentially reduce the system-level memory requirements
682  of a program. However, it cannot guarantee to reduce memory. Under
683  some allocation patterns, some large free blocks of memory will be
684  locked between two used chunks, so they cannot be given back to
685  the system.
686
687  The `pad' argument to malloc_trim represents the amount of free
688  trailing space to leave untrimmed. If this argument is zero,
689  only the minimum amount of memory to maintain internal data
690  structures will be left (one page or less). Non-zero arguments
691  can be supplied to maintain enough trailing space to service
692  future expected allocations without having to re-obtain memory
693  from the system.
694
695  Malloc_trim returns 1 if it actually released any memory, else 0.
696  On systems that do not support "negative sbrks", it will always
697  return 0.

Actual implementation in glibc is __malloc_trim and it has code for iterating over arenas:

http://code.metager.de/source/xref/gnu/glibc/malloc/malloc.c#4552

4552 int
4553 __malloc_trim (size_t s)

4560  mstate ar_ptr = &main_arena;
4561  do
4562    {
4563      (void) mutex_lock (&ar_ptr->mutex);
4564      result |= mtrim (ar_ptr, s);
4565      (void) mutex_unlock (&ar_ptr->mutex);
4566
4567      ar_ptr = ar_ptr->next;
4568    }
4569  while (ar_ptr != &main_arena);

Every arena is trimmed using mtrim() (mTRIm()) function, which calls malloc_consolidate() to convert all free segments from fastbins (they are not coalesced at free as they are fast) to normal free chunks (which are coalesced with adjacent chunks)

4498  /* Ensure initialization/consolidation */
4499  malloc_consolidate (av);

4111  malloc_consolidate is a specialized version of free() that tears
4112  down chunks held in fastbins. 

1581   Fastbins
1591    Chunks in fastbins keep their inuse bit set, so they cannot
1592    be consolidated with other free chunks. malloc_consolidate
1593    releases all chunks in fastbins and consolidates them with
1594    other free chunks.

The problem is, when the worker thread is recreated, it creates a new arena/heap instead of reusing the previous one, such that the fastbins from previous arenas/heaps are never reused.

This is strange. By design, maximum number of arenas is limited in glibc malloc by cpu_core_count * 8 (for 64-bit platform); cpu_core_count * 2 (for 32-bit platform) or by environment variable MALLOC_ARENA_MAX / mallopt parameter M_ARENA_MAX.

You can limit count of arenas for your application; call malloc_trim() periodically or call to malloc() with "large" size (it will call malloc_consolidate) and then free() for it from your threads just before destroying:

3319 _int_malloc (mstate av, size_t bytes)
3368  if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
 // fastbin allocation path
3405  if (in_smallbin_range (nb))
 // smallbin path; malloc_consolidate may be called
3437     If this is a large request, consolidate fastbins before continuing.
3438     While it might look excessive to kill all fastbins before
3439     even seeing if there is space available, this avoids
3440     fragmentation problems normally associated with fastbins.
3441     Also, in practice, programs tend to have runs of either small or
3442     large requests, but less often mixtures, so consolidation is not
3443     invoked all that often in most programs. And the programs that
3444     it is called frequently in otherwise tend to fragment.
3445   */
3446
3447  else
3448    {
3449      idx = largebin_index (nb);
3450      if (have_fastchunks (av))
3451        malloc_consolidate (av);
3452    }

PS: there is comment in man page of malloc_trim https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65:

+.SH NOTES
+This function only releases memory in the main arena.
+.\" malloc/malloc.c::mTRIm():
+.\" return result | (av == &main_arena ? sYSTRIm (pad, av) : 0);

Yes, there is check for main_arena, but it is at very end of malloc_trim implementation mTRIm() and it is just for calling sbrk() with negative offset. Since 2007 (glibc 2.9 and newer) there is another method to return memory back to the OS: madvise(MADV_DONTNEED) which is used in all arenas (and is not documented by author of glibc patch or author of man page). Consolidate is called for every arena. There is also code for trimming (munmapping) top chunk of mmap-ed heaps (heap_trim/shrink_heap called from slow path free()), but it is not called from malloc_trim.

using glibc malloc hooks in a thread safe manner

UPDATED

You are right to not trust __malloc_hooks; I have glanced at the code, and they are - staggeringly crazily - not thread safe.

Invoking the inherited hooks directly, rather than restoring and re-entering malloc, seems to be deviating from the the document you cite a little bit too much to feel comfortable suggesting.

From http://manpages.sgvulcan.com/malloc_hook.3.php:

Hook variables are not thread-safe so they are deprecated now. Programmers should instead preempt calls to the relevant functions by defining and exporting functions like "malloc" and "free".

The appropriate way to inject debug malloc/realloc/free functions is to provide your own library that exports your 'debug' versions of these functions, and then defers itself to the real ones. C linking is done in explicit order, so if two libraries offer the same function, the first specified is used. You can also inject your malloc at load-time on unix using the LD_PRELOAD mechanisms.

http://linux.die.net/man/3/efence describes Electric Fence, which details both these approaches.

You can use your own locking if in these debug functions if that is necessary.

glibc malloc guard byte wrapper

You are using the size_t located just before the allocated area as the length available. However, it includes the size_t itself. Therefore, here:

    if (p != NULL) {
        size_t *q = p;
        q--;
        size_t s = *q & ~(SIZE_BITS); // get allocated bytes subtracting info bits
        char *z = p;
        memset(z, 0, s); // zero memory
        z[s - 1] = '@'; // place guard char
    }

you end up overwriting the length of the next region, partially, by your guard char. The solution is to substract the length of the length field in bytes, i.e. use const s = (((size_t *)p)[-1] & (~(size_t)SIZE_BITS)) - sizeof (size_t); instead.

(I verified this works on Embedded GNU C Library 2.15-0ubuntu10.15 on x86-64, for both 64 and 32-bit code (with different size_t sizes).)

I recommend you add at least minimal abstraction, so that porting your code to a different C library or a newer version of GNU C library in the future is not futile. (Version checking would be good, but I was too lazy to find out which versions of GNU C library actually use this layout.)

#include <string.h>
#include <limits.h>
#ifdef __GLIBC__

/* GLIBC stuffs the length just prior to the returned pointer,
 * with flags in the least significant three bits. It includes
 * the length field itself. */
#define   USER_LEN(ptr) ( ( ((size_t *)(ptr))[-1] & (~((size_t)7)) ) - sizeof (size_t))

#else
#error This C library is not supported (yet).
#endif

extern void  abort(void);
extern void *__libc_malloc(size_t);
extern void *__libc_realloc(void *, size_t);
extern void  __libc_free(void *);

#define CANARY_LEN 1

static void canary_set(void *const ptr, const size_t len)
{
    ((unsigned char *)ptr)[len - CANARY_LEN] = '@';
}

static int canary_ok(const void *const ptr, const size_t len)
{
    return ((const unsigned char *)ptr)[len - CANARY_LEN] == '@';
}

void *malloc(size_t size)
{
    void *ptr;
    ptr = __libc_malloc(size + CANARY_LEN);
    if (ptr) {
        const size_t len = USER_LEN(ptr);
        memset(ptr, 0, len);
        canary_set(ptr, len);
    }
    return ptr;
}

void *realloc(void *ptr, size_t size)
{
    void *newptr;

    if (!ptr)
        return malloc(size);

    if (!canary_ok(ptr, USER_LEN(ptr)))
        abort();

    newptr = __libc_realloc(ptr, size + CANARY_LEN);
    if (!newptr)
        return newptr;

    canary_set(newptr, USER_LEN(ptr));

    return newptr;
}

void free(void *ptr)
{
    if (ptr) {
        const size_t len = USER_LEN(ptr);

        if (!canary_ok(ptr, len))
            abort();

        memset(ptr, 0, len);

        __libc_free(ptr);
    }
}

Hope this helps.

Understanding corrupted size vs. prev_size glibc error

OK, so I've managed to overcome this issue.

First of all - A practical cause to "corrupted size vs. prev_size" is quite simple - memory chunk control structure fields in the adjacent following chunk are being overwritten due to out-of-bounds access by the code. if you allocate x bytes for pointer p but wind up writing beyond x in regards to the same pointer, you might get this error, indicating the current memory allocation (chunk) size is not the same as what's found in the next chunk control structure (due to it being overwritten).

As for the cause for this memory leak - structure mapping done in the Java/JNA layer implied different #pragma related padding/alignment from what dll/so was compiled with. This in turn, caused data to be written beyond the allocated structure boundary. Disabling that alignment made the issues go away. (Thousands of executions without a single crash!).

Understanding Glibc Malloc Trimming