How to Correctly Interpose Malloc Allowing for Ld_Preload Chaining

How to correctly interpose malloc allowing for LD_PRELOAD chaining

Though jemalloc provides __libc_malloc as a symbol it is for use for static linking with glibc only.

when you forward to __libc_malloc in your shared library you are still forwarding to the libc implementation.
However, it seems that during startup jemalloc sets malloc hooks to point to the previous address of malloc(). In this case the malloc wrapper in the first library (i.e. yours).
After setting a couple of things up internally which currently requires 3 calls to malloc() jemalloc installs itself as the new malloc via the libc malloc hooks.

Unfortunately there is no other symbol exported by glibc that you can use to bypass malloc hooks and use malloc directly. At least on the version I'm using.

You could handle this by setting malloc hooks yourself if you have another malloc replacement to use. However, you have already expressed a desire to "do the right thing" and not use malloc hooks because they are deprecated

You can handle this without using malloc hooks by
detecting recursive calls and providing a path to some other malloc
for example:

   unsigned int malloc = 0;
   void* malloc(const size_t size)
   {
      if (inMalloc != 0) 
      {
         return handleRecursiveMalloc(size);
      }
      ++inMalloc;
      auto res = this->mainAllocator->malloc(size);
      --inMalloc;
      return res;
   }

   void* handleRecursiveMalloc(size_t size)
   {
      void* currentBreak = sbrk(0);
      if (currentBreak == nullptr)
      {
         return nullptr; // recursion detected and we could not handle it.
      }
      void* newBreak = sbrk(size);
      if (newBreak == nullptr)
      {
         return nullptr; // recursion detected and we could not handle it.
      }
      // we now have a block of memory between currentBreak & newBreak
      // book-keeping here if required
      //  emergencyAllocSize += size;
      //  numEmergencyAllocations++
      return currentBreak;
   }

This is ugly but it works. Your wrapper to malloc is less efficient to the tune of one increment, one decrement and one conditional branch.
It probably doesn't make any difference but you could use the C++ attribute [[unlikely]] or gcc's __builtin_expect to say that the branch for recursion is not likely to be taken.

There is another pitfall to be aware of. If you are forwarding multiple symbols you should check that they are all forwarded safely (typically this means to the same library).
For example:

void* f1 = dlsym(RTLD_NEXT,"malloc");
void* f2 = dlsym(RTLD_NEXT,"malloc_usable_size");
// handle failures...
Dl_info info1;
dladdr(f1,&info1);
Dl_info info2;
dladdr(f2,&info2);
// handle failures...
if (info1.dli_fbase != info2.dli_fbase)
{
    // malloc_usable_size() is provided by a different library than malloc()
    // so we probably shouldn't use it
    f2 = nullptr; 
    // set flags accordingly
}

An example of this in practice is electric-fence.
If I chain:

LD_PRELOAD="mymalloc.so electric-fence.so"

You find that malloc_usable_size() comes from libc while malloc comes from electric-fence.
Granted electric-fence is not so common any more.

In this case it would be safer to replace malloc_usable_size() with a dummy function that always returns 0.
For example the normal libc version of malloc_usable_size(ptr) - (see https://code.woboq.org/userspace/glibc/malloc/malloc.c.html) looks at pointers located just before the allocated block (i.e. ptr-2*sizeof(size_t) ). If you give it a ptr that does not conform to this pattern it could segfault.

See for example Is it possible to define a symbol dynamically such that it will be found by dlsym?

An alternative for the deprecated __malloc_hook functionality of glibc

After trying some things, I finally managed to figure out how to do this.

First of all, in glibc, malloc is defined as a weak symbol, which means that it can be overwritten by the application or a shared library. Hence, LD_PRELOAD is not necessarily needed. Instead, I implemented the following function in a shared library:

void*
malloc (size_t size)
{
  [ ... ]
}

Which gets called by the application instead of glibcs malloc.

Now, to be equivalent to the __malloc_hooks functionality, a couple of things are still missing.

1.) the caller address

In addition to the original parameters to malloc, glibcs __malloc_hooks also provide the address of the calling function, which is actually the return address of where malloc would return to. To achieve the same thing, we can use the __builtin_return_address function that is available in gcc. I have not looked into other compilers, because I am limited to gcc anyway, but if you happen to know how to do such a thing portably, please drop me a comment :)

Our malloc function now looks like this:

void*
malloc (size_t size)
{
  void *caller = __builtin_return_address(0);
  [ ... ]
}

2.) accessing `glibc`s malloc from within your hook

As I am limited to glibc in my application, I chose to use __libc_malloc to access the original malloc implementation. Alternatively, dlsym(RTLD_NEXT, "malloc") can be used, but at the possible pitfall that this function uses calloc on its first call, possibly resulting in an infinite loop leading to a segfault.

complete malloc hook

My complete hooking function now looks like this:

extern void *__libc_malloc(size_t size);

int malloc_hook_active = 0;

void*
malloc (size_t size)
{
  void *caller = __builtin_return_address(0);
  if (malloc_hook_active)
    return my_malloc_hook(size, caller);
  return __libc_malloc(size);
}

where my_malloc_hook looks like this:

void*
my_malloc_hook (size_t size, void *caller)
{
  void *result;

  // deactivate hooks for logging
  malloc_hook_active = 0;

  result = malloc(size);

  // do logging
  [ ... ]

  // reactivate hooks
  malloc_hook_active = 1;

  return result;
}

Of course, the hooks for calloc, realloc and free work similarly.

dynamic and static linking

With these functions, dynamic linking works out of the box. Linking the .so file containing the malloc hook implementation will result of all calls to malloc from the application and also all library calls to be routed through my hook. Static linking is problematic though. I have not yet wrapped my head around it completely, but in static linking malloc is not a weak symbol, resulting in a multiple definition error at link time.

If you need static linking for whatever reason, for example translating function addresses in 3rd party libraries to code lines via debug symbols, then you can link these 3rd party libs statically while still linking the malloc hooks dynamically, avoiding the multiple definition problem. I have not yet found a better workaround for this, if you know one,feel free to leave me a comment.

Here is a short example:

gcc -o test test.c -lmalloc_hook_library -Wl,-Bstatic -l3rdparty -Wl,-Bdynamic

3rdparty will be linked statically, while malloc_hook_library will be linked dynamically, resulting in the expected behaviour, and addresses of functions in 3rdparty to be translatable via debug symbols in test. Pretty neat, huh?

Conlusion

the techniques above describe a non-deprecated, pretty much equivalent approach to __malloc_hooks, but with a couple of mean limitations:

__builtin_caller_address only works with gcc

__libc_malloc only works with glibc

dlsym(RTLD_NEXT, [...]) is a GNU extension in glibc

the linker flags -Wl,-Bstatic and -Wl,-Bdynamic are specific to the GNU binutils.

In other words, this solution is utterly non-portable and alternative solutions would have to be added if the hooks library were to be ported to a non-GNU operating system.

I want to make my own Malloc

There's rather a lot of good literature on implementing malloc and similar things. but I notice that you include C++ here -- are you aware that you can write your own implementation of new and delete in C++? That might be useful as a way to do it easily.

In any case, the characteristics you want are going to depend pretty heavily on your workload, that is, on the pattern of usage over time. If you have only mallocs and new frees, it's easy, obviously. If you have only mallocs of one, or a few different, block sizes, that's also simple.

In other languages, you get some leverage by having the ability to chain memory together, but C isn't that smart.

The basic implementation of malloc simply allocates a header that contains the data length, an "in use flag", and the malloced memory. Malloc then constructs a new header at the end of its space, allocates the memory, and returns a pointer. When you free, it just resets the in use flag.

The trick is that when you do a lot of mallooc and free, you can quickly get a lot of small blobs that aren't in use, but are hard to allocate. So you need some kind of bumpo gc to merge blocks of memory.

You could do a more complicated gc, but remember that takes time; you don't want a free to take up a lot of time.

There's a nice paper on Solaris malloc implementations you might find interesting. Here's another on building an alternative malloc, again in Solaris, but the basics are the same. And you should read the Wikipedia article on garbage collection, and follow it to some of the more formal papers.

Update

You know, you really should have a look at generational garbage collectors. The basic idea is that the longer something remains allocated, the more likely is it to stay allocated. This is an extension of the "copying" GC you mention. Basically, you allocate new stuff in one part of your memory pool, call it g0. When you reach a high water mark on that, you look through the allocated blocks and copy the ones that are still in use to another section of memory, call it g1, Then you can just clear the g0 space and start allocating there. Eventually g1 gets to its high water mark and you fix that by clearing g0, and clean up g1 moving stuff to g0, and when you're done, you rename the old g1 as g0 and vice versa and continue.

The trick is that in C especially, the handles you hand out to malloc'ed memory are straight raw pointers; you can't really move things around without some heap big medicine.

Second update

In comments, @unknown asks "Wouldn't moving stuff around just be a memcpy()". And indeed it would. but consider this timeline:

warning: this is not complete, and untested, just for illustration, for entertainment only, no warranty express or implied

/* basic environment for illustration*/
void * myMemoryHdl ;
unsigned char lotsOfMemory[LOTS]; /* this will be your memory pool*/

You mallocate some memory

/* if we get past this, it succeded */
if((myMemoryHdl = newMalloc(SIZE)) == NULL)
    exit(-1);

In your implementation of malloc, you create the memory and return a pointer to the buffer.

unsigned char * nextUnusued = &lotsOfMemory[0];
int partitionSize = (int)(LOTS/2);
int hwm = (int) (partition/2);
/* So g0 will be the bottom half and g1 the top half to start */
unsigned char * g0 = &lotsOfMemory[0];
unsigned char * g1 = &lotsOfMemory[partitionSize];


void * newMalloc(size_t size){
   void * rtn ;
   if( /* memory COMPLETELY exhausted */)
      return NULL;
   /* otherwise */
   /* add header at nextUnused */
   newHeader(nextUnused);     /* includes some pointers for chaining
                               * and a field with values USED or FREE, 
                               * set to USED */
   nextUnused += HEADERLEN ;  /* this could be niftier */
   rtn = nextUnused ;
   nextUnused += size ;
}

Some of the things are freed

  newFree(void * aHandle){
     *(aHandle-offset) = FREE ; /* set the flag in the header, 
                                 * using an offset. */
  }

So now you do all the stuff and you get to your high water mark.

 for( /* each block in your memory pool */ )
    if( /* block header is still marked USED */ ) {
        memcpy(/* block into other partition */);
    }
 /* clear the partition */
 bzero(g0, partitionSize);

Now, go back to the original handle you saved in myMemHdl. What does it point to? (Answer, you just set it to 0x00 with bzero(3).)

That's where the magic comes in. In C at least, the pointer you returned from your malloc is no longer under your control -- you can't move it around after the fact. In C++, with user-defined pointer-like types, you can fix that.

How to Correctly Interpose Malloc Allowing for Ld_Preload Chaining