Are Some Allocators Lazy

Are some allocators lazy?

On Linux, malloc requests memory with sbrk() or mmap() - either way, your address space is expanded immediately, but Linux does not assign actual pages of physical memory until the first write to the page in question. You can see the address space expansion in the VIRT column, while the actual, physical memory usage in RES.

what is lazy allocation?

Lazy allocation simply means not allocating a resource until it is actually needed. This is common with singleton objects, but strictly speaking, any time a resource is allocated as late as possible, you have an example of lazy allocation.

By delaying allocation of a resource until you actually need it, you can decrease startup time, and even eliminate the allocation entirely if you never actually use the object. In contrast, you could pre-allocate a resource you expect to need later, which can make later execution more efficient at the expense of startup time, and also avoids the possibility of the allocation failing later in program execution.

The following code provides an example of a lazily-allocated singleton:

public class Widget {
    private Widget singleton;

    public static Widget get() {
        if (singleton == null) {
            singleton = new Widget();
        }
        return singleton;
    }

    private Widget() {
        // ...
    }

    // ...
}

Do note that this code isn't threadsafe. In most cases, access to the get() method should be synchronized in some fashion.

A similar (and perhaps more general) concept is lazy initialization.

lazy allocation for c++ object arrays

For a standard Ubuntu 20.04 machine running Linux 5.4.0-51-generic....

We can observe this directly. In the code below, I increased the n value to 1 << 24 (~16 million ints = 64MB for 32-bit int) so it's the dominant factor in overall memory usage. I compiled, ran, and observed memory usage in htop:

#include <unistd.h>
int main() {
    int *foo = new int [1 << 24];
    sleep(100);
}

htop values: VIRT 71416KB / RES 1468KB

The virtual address allocations include the memory allocated by new, but the resident memory size is much smaller - indicating that distinct physical backing memory pages weren't needed yet for all the 64MB allocated.

After changing to int *foo = new int[1<<24]();:

htop values: VIRT 71416KB / RES 57800KB

Requesting the memory be zeroed resulted in a resident memory value just under the 64MB that was initialised, and it won't have been due to memory pressure (I have 64GB RAM), but some algorithm in the kernel must have decided to page out some of the backing memory after it was zeroed (I suspect kswapd?). The large RES value suggests that each page zeroed was given a distinct page of physical backing memory (as distinct from e.g. being mapped to the OS's zero-page for COW-allocation of an actual backing page).

With structs:

#include <unistd.h>

struct X {
    int a[1 << 24];
};

int main() {
    auto foo = new X;
    sleep(100);
}

htop values: VIRT 71416KB / RES 1460KB

This shows insufficient RES for the static arrays to have distinct backing pages. Either the virtual memory has been pre-mapped to the OS zero-page, or it's unmapped and will be mapped initially to the zero-page when accessed, then given its own physical backing page if written to - I'm not sure which, but in terms of actual physical RAM usage it doesn't make any difference.

After changing to auto foo = new X{};

htop values: VIRT 71416KB / RES 67844KB

You can clearly see that initialising the bytes to 0s resulted in use of backing memory for the arrays.

Addressing your questions:

Will the Linux kernel will use lazy memory allocation?

The virtual memory allocation is done when the new is done. Distinct physical backing memory is allocated lazily when an actual write is done to the memory by the user-space code.

For the second case, in the same way than when creating static arrays?

#include <unistd.h>

int g_a[1 << 24];

int f(int i) {
    static int a[1 << 24];
    return a[i];
}

int main(int argc, const char* argv[]) {
    sleep(20);
    int k = f(2930);
    sleep(20);
    return argc + k;
}

VIRT 133MB RES 1596KB

When this was run, the memory didn't jump after 20 seconds, indicating all the virtual address space was allocated during program loading. The low resident memory shows that the pages were not accessed and zeroed the way they were for new.

Just to address a potential point of confusion: while the Linux Kernel will zero out backing memory the first time it's provided to the process, any given call to new won't (in any implementation I've seen) know whether the memory allocated is being recycled from earlier dynamic allocations - which might have had non-zero values written into it - that have since been deleted/freed. Because of this, if you use memory-zeroing forms like new X{} or new int[n]() then the memory will be unconditionally cleared by the user-space code, causing the full amount of backing memory to be assigned and faulted in.

lazy overcommit allocation and calloc

calloc can get guaranteed-zero pages from the OS, and thus avoid having to write zeros in user-space at all. (Especially for large allocations, otherwise it'll zero something from the free list if there are any free-list entries of the right size.) That's where the laziness comes in.

So your page will be fresh from mmap(MAP_ANONYMOUS), untouched by user-space. Reading it will trigger a soft page fault that copy-on-write maps it to a shared physical page of zeros. (So fun fact, you can get TLB misses but L1d / L2 cache hits when looping read-only over a huge calloc allocation).

Writing that page / one of those pages (as the first access, or after it's CoW mapped to a zero page) will soft page-fault, and Linux's page-fault handler will allocate a new physical page and zero it. (So after the page fault, the whole page is generally hot in L1d cache, or at least L2, even with faultaround to prepare more pages and wire them into the page table to reduce the number of page faults, if there are neighbouring pages that are also lazily allocated).

But no, you don't generally need to worry about it, other than general performance tuning. If you logically own some memory, you can ask read to put data into it. The libc wrapper isn't doing any special retrying there; all the magic (checking for the target page being present and treating it like a soft or hard page fault) happens inside the kernel's implementation of read, as part of copy_to_user.

(Basically a memcpy from kernel memory to user-space, with permission checking that can make it return -EFAULT if you pass the kernel a pointer that you don't even logically own. i.e. memory that would segfault if you touched it from user-space. Note that you don't get a SIGSEGV from read(0, NULL, 1), just an error. Use strace ./a.out to see, as an alternative to actually implementing error checking in your hand-written asm.)

How to lazy allocate zeroed memory?

There are a few mechanisms to get pre-zeroed memory from the operating system:

mmap(2)'s MAP_ANONYMOUS flag forces the contents to be initialized to zero.

The POSIX shared memory segments can also zero

shm_open(3) provides you with a file descriptor
ftruncate(2) the "file" to the size you want
mmap(2) the "file" into your address space

The memory comes pre-zeroed:

   This volume of IEEE Std 1003.1-2001 specifies that memory
   objects have initial contents of zero when created. This is
   consistent with current behavior for both files and newly
   allocated memory. For those implementations that use physical
   memory, it would be possible that such implementations could
   simply use available memory and give it to the process
   uninitialized. This, however, is not consistent with standard
   behavior for the uninitialized data area, the stack, and of
   course, files. Finally, it is highly desirable to set the
   allocated memory to zero for security reasons. Thus,
   initializing memory objects to zero is required.

It appears that this memory is zeroed at use: mm/shmem.c function shmem_zero_setup():

/**
 * shmem_zero_setup - setup a shared anonymous mapping
 * @vma: the vma to be mmapped is prepared by do_mmap_pgoff
 */     
int shmem_zero_setup(struct vm_area_struct *vma)
{   
    struct file *file;
    loff_t size = vma->vm_end - vma->vm_start;

    file = shmem_file_setup("dev/zero", size, vma->vm_flags);
    if (IS_ERR(file))
        return PTR_ERR(file);

    if (vma->vm_file)
        fput(vma->vm_file);
    vma->vm_file = file;
    vma->vm_ops = &shmem_vm_ops;
    return 0;
}

Does malloc lazily create the backing pages for an allocation on Linux (and other platforms)?

Linux does deferred page allocation, aka. 'optimistic memory allocation'. The memory you get back from malloc is not backed by anything and when you touch it you may actually get an OOM condition (if there is no swap space for the page you request), in which case a process is unceremoniously terminated.

See for example http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html

Why is my custom allocator slower, than default allocator

The default allocator(std::allocator) is typically implemented as a relatively thin wrapper around new and delete.

The allocator in your example appears to be a hybrid sub/bump(increment) allocator. In summary, it allocates a chunk of memory from the system if allocator memory is exhausted, then bump allocates from an available chunk.

Among other things, consider:

It's not thread-safe. Concurrent access would eventually corrupt it. This doesn't matter for isolated profiling using a single thread, but is an important consideration nonetheless.

It manually manages memory all over the place. i.e. Chunk manages memory yet doesn't have a destructor, requiring Chunk::release to be called to destroy it(i.e.
in ~FixedAllocator()). Avoid manual memory management(even when writing allocators) utilizing RAII:

class Chunk
{
    // private: not required, classes are private by default.
    friend class FixedAllocator;

    // Replaced init(...) with constructor.
    Chunk(size_t blockSize, uchar block) :
        pData(new uchar[blockSize * blocks]),
        firstAvailableBlock(0),
        blocksAvailable(blocks) 
    {
        uchar* p = pData; 
        for (uchar i = 0; i != blocks; p += blockSize) 
        {
            *p = ++i;
        }
    }
    Chunk(const Chunk& other) = delete; // Disable copy construction.
    Chunk(Chunk&& other) :
        pData(std::move(other.pData)),
        firstAvailableBlock(other.firstAvailableBlock),
        blocksAvailable(other.blocksAvailable) 
    { 
        other.firstAvailableBlock = 0;
        other.blocksAvailable = 0;
    }

    Chunk& operator=(const Chunk&& other) = delete; // Disable copy assignment.
    Chunk& operator=(Chunk&& other)
    {
        pData = std::move(other.pData);
        firstAvailableBlock = other.firstAvailableBlock;
        blocksAvailable = other.blocksAvailable;
        other.firstAvailableBlock = 0;
        other.blocksAvailable = 0;
        return *this;
    }

    //...
    void release()
    {
        pData.reset();
    }
    //...

    std::unique_ptr<uchar[]> pData; // Automatically deleted in the implicitly generated destructor.
    uchar firstAvailableBlock, blocksAvailable; 
};

// And of course don't forget to update chunk creation:
//...
Chunk newChunk(blockSize, blocks);
chunks.push_back(std::move(newChunk));
//...

Chunk::hasBlock doesn't account for holes. If you were to allocate 10 bytes/5 bytes/10 bytes, then later deallocate the 5 byte block, hasBlock would return false for ranges within the 5 byte block, even though that space is actually available. Properly fixing that requires a system to track allocations.

It's slower because it's doing more overall work than the typical std::allocator implementation.

The small object size is set to sizeof(int),
which is most likely 4. The size of a std::list node is at least 12(back ptr(4-8), forward ptr(4-8), object(4+)). Thus, at least with the list nodes, SmallObjAllocator::allocate() and SmallObjAllocator::deallocate() won't call new or delete, instead always calling FixedAllocator::allocate() and FixedAllocator::deallocate().
FixedAllocator::allocate() and FixedAllocator::deallocate() are slow. They both execute a linear search, which in the worst case means they iterate over all chunks. Even in the average case, a lot of time is spent in the allocator instead of your program. Optimizing those two functions will yield the most results.
The blockSize of your allocator is set to sizeof(int) * 10000(probably 40k). Therefore, 10k insertions into an std::list<int> requires at least 120kb(sizeof(node) * 10000), so it's likely FixedAllocator resizes at least twice in your example(assuming a doubling resize policy). You can eliminate resizing by setting blockSize high enough that a resize is never required.

Allocator<int, 100000>(100k) should be more than enough for your example.

Allocators are a very complex subject, and honestly there are too many details to go over to fully explain how to optimize your example without writing a short novel. I recommend reading up on allocator design and studying allocators used in the real world to gain a better understanding of the subject.

See:

The Art & Science of Memory Allocation
Memory management
Memory allocators 101
Memory allocators wiki
How to allocate memory
tcmalloc
jemalloc
rpmalloc

Are Some Allocators Lazy