Allocating a Large Memory Block in C++

Allocating a large memory block in C++

You forgot one dimension, and the overhead of allocating memory. The shown code allocates memory very inefficiently in the third dimension, resulting in way too much overhead.

float*** a = new float**[N];

This will allocate, roughly 22000 * sizeof(float **), which is rougly 176kb. Negligible.

a[m] = new float*[M - 1];

A single allocation here will be for 44099 * sizeof(float *), but you will grab 22000 of these. 22000 * 44099 * sizeof(float *), or roughly 7.7gb of additional memory. This is where you stopped counting, but your code isn't done yet. It's got a long ways to go.

a[m][n] = new float[2];

This is a single allocation of 8 bytes, but this allocation will be done 22000 * 44099 times. That's another 7.7gb flushed down the drain. You're now over 15 gigs of application-required memory, roughly, that needs to be allocated.

But each allocation does not come free, and new float[2] requires more than 8 bytes. Each individually allocated block must be tracked internally by your C++ library, so that it can be recycled by delete. The most simplistic link-list based implementation of heap allocation requires one forward pointer, one backward pointer, and the count of how many bytes are there in the allocated block. Assuming nothing needs to be padded for alignment purposes, this is at least 24 bytes of overhead per allocation, on a 64-bit platform.

Now, since your third dimension makes 22000 * 44099 allocations, 22000 allocations for the second dimension, and one allocation for the first dimension: if I count on my fingers, this will require (22000 * 44099 + 22000 + 1) * 24, or another 22 gigabytes of memory, just to consume the overhead of the most simple, basic memory allocation scheme.

We're now up to about 38 gigabytes of RAM needed using the most simple, possible, heap allocation tracking, if I did my math right. Your C++ implementation is likely to use a slightly more sophisticated heap allocation logic, with larger overhead.

Get rid of the new float[2]. Compute your matrix's size, and new a single 7.7gb chunk, then calculate where the rest of your pointers should be pointing to. Also, allocate a single chunk of memory for the second dimension of your matrix, and compute the pointers for the first dimension.

Your allocation code should execute exactly three new statements. One for the first dimension pointer, One for the second dimension pointers. And one more for the huge chunk of data that comprises your third dimension.

Allocate memory for different types in one block

If you allocate and free lots of these structures, the savings could add up to something that's significant.

It uses a little less space, since each allocated block has some bookkeeping that records its size. It also may reduce memory fragmentation.

Also, doing this ensures that the a and b arrays are close together in memory. If they're often used together, this can improve cache hits.

Library implementers often do these micro-optimizations because they can't predict how the library will be used, and they want to work as well as possible in all cases. When you're writing your own application, you have a better idea of which code will be in inner loops that are significant for performance tuning.

Allocating large blocks of memory with new

With respect to new in C++/GCC/Linux(32bit)...

It's been a while, and it's implementation dependent, but I believe new will, behind the scenes, invoke malloc(). Malloc(), unless you ask for something exceeding the address space of the process, or outside of specified (ulimit/getrusage) limits, won't fail. Even when your system doesn't have enough RAM+SWAP. For example: malloc(1gig) on a system with 256Meg of RAM + 0 SWAP will, I believe, succeed.

However, when you go use that memory, the kernel supplies the pages through a lazy-allocation mechanism. At that point, when you first read or write to that memory, if the kernel cannot allocate memory pages to your process, it kills your process.

This can be a problem on a shared computer, when your colleague has a slow core leak. Especially when he starts knocking out system processes.

So the fact that you are seeing std::bad_alloc exceptions is "interesting".

Now new will run the constructor on the allocated memory, touching all those memory pages before it returns. Depending on implementation, it might be trapping the out-of-memory signal.

Have you tried this with plain o'l malloc?

Have you tried running the "free" program? Do you have enough memory available?

As others have suggested, have you checked limit/ulimit/getrusage() for hard & soft constraints?

What does your code look like, exactly? I'm guessing new ClassFoo [ N ]. Or perhaps new char [ N ].

What is sizeof(ClassFoo)? What is N?

Allocating 64*288000 (17.58Meg) should be trivial for most modern machines... Are you running on an embedded system or something otherwise special?

Alternatively, are you linking with a custom new allocator? Does your class have its own new allocator?

Does your data structure (class) allocate other objects as part of its constructor?

Has someone tampered with your libraries? Do you have multiple compilers installed? Are you using the wrong include or library paths?

Are you linking against stale object files? Do you simply need to recompile your all your source files?

Can you create a trivial test program? Just a couple lines of code that reproduces the bug? Or is your problem elsewhere, and only showing up here?

For what it's worth, I've allocated over 2gig data blocks with new in 32bit linux under g++. Your problem lies elsewhere.

Allocate a large block instead of many small?

Contrary to some of the comments posted so far, if you are tight on memory, you should allocate a single large block. This is because each malloc() you do has some overhead which is more or less fixed. This overhead will be a few bytes per allocation, so many small allocations could have you losing half your memory to overhead.

If you care a lot about performance (speed), you should also use a single allocation. It will improve locality and cache utilization, and reduce the number of system calls during startup and also teardown.

What is the correct way to allocate and use an untyped memory block in C++?

The global allocation functions

To allocate an arbitrary (untyped) block of memory, the global allocation functions (§3.7.4/2);

void* operator new(std::size_t);
void* operator new[](std::size_t);

Can be used to do this (§3.7.4.1/2).

§3.7.4.1/2

The allocation function attempts to allocate the requested amount of storage. If it is successful, it shall return the address of the start of a block of storage whose length in bytes shall be at least as large as the requested size. There are no constraints on the contents of the allocated storage on return from the allocation function. The order, contiguity, and initial value of storage allocated by successive calls to an allocation function are unspecified. The pointer returned shall be suitably aligned so that it can be converted to a pointer of any complete object type with a fundamental alignment requirement (3.11) and then used to access the object or array in the storage allocated (until the storage is explicitly deallocated by a call to a corresponding deallocation function).

And 3.11 has this to say about a fundamental alignment requirement;

§3.11/2

A fundamental alignment is represented by an alignment less than or equal to the greatest alignment supported by the implementation in all contexts, which is equal to alignof(std::max_align_t).

Just to be sure on the requirement that the allocation functions must behave like this;

§3.7.4/3

Any allocation and/or deallocation functions defined in a C++ program, including the default versions in the library, shall conform to the semantics specified in 3.7.4.1 and 3.7.4.2.

Quotes from C++ WD n4527.

Assuming the 8-byte alignment is less than the fundamental alignment of the platform (and it looks like it is, but this can be verified on the target platform with static_assert(alignof(std::max_align_t) >= 8)) - you can use the global ::operator new to allocate the memory required. Once allocated, the memory can be segmented and used given the size and alignment requirements you have.

An alternative here is the std::aligned_storage and it would be able to give you memory aligned at whatever the requirement is.

typename std::aligned_storage<sizeof(T), alignof(T)>::type buffer[100];

From the question, I assume here that the both the size and alignment of T would be 8.

A sample of what the final memory block could look like is (basic RAII included);

struct DataBlock {
    const std::size_t element_count;
    static constexpr std::size_t element_size = 8;
    void * data = nullptr;
    explicit DataBlock(size_t elements) : element_count(elements)
    {
        data = ::operator new(elements * element_size);
    }
    ~DataBlock()
    {
        ::operator delete(data);
    }
    DataBlock(DataBlock&) = delete; // no copy
    DataBlock& operator=(DataBlock&) = delete; // no assign
    // probably shouldn't move either
    DataBlock(DataBlock&&) = delete;
    DataBlock& operator=(DataBlock&&) = delete;

    template <class T>
    T* get_location(std::size_t index)
    {
        // https://stackoverflow.com/a/6449951/3747990
        // C++ WD n4527 3.9.2/4
        void* t = reinterpret_cast<void*>(reinterpret_cast<unsigned char*>(data) + index*element_size);
        // 5.2.9/13
        return static_cast<T*>(t);

        // C++ WD n4527 5.2.10/7 would allow this to be condensed
        //T* t = reinterpret_cast<T*>(reinterpret_cast<unsigned char*>(data) + index*element_size);
        //return t;
    }
};
// ....
DataBlock block(100);

I've constructed more detailed examples of the DataBlock with suitable template construct and get functions etc., live demo here and here with further error checking etc..

A note on the aliasing

It does look like there are some aliasing issues in the original code (strictly speaking); you allocate memory of one type and cast it to another type.

It may probably work as you expect on your target platform, but you cannot rely on it. The most practical comment I've seen on this is;

"Undefined behaviour has the nasty result of usually doing what you think it should do, until it doesn’t” - hvd.

The code you have probably will work. I think it is better to use the appropriate global allocation functions and be sure that there is no undefined behaviour when allocating and using the memory you require.

Aliasing will still be applicable; once the memory is allocated - aliasing is applicable in how it is used. Once you have an arbitrary block of memory allocated (as above with the global allocation functions) and the lifetime of an object begins (§3.8/1) - aliasing rules apply.

What about `std::allocator`?

Whilst the std::allocator is for homogenous data containers and what your are looking for is akin to heterogeneous allocations, the implementation in your standard library (given the Allocator concept) offers some guidance on raw memory allocations and corresponding construction of the objects required.

Difference in memory block layout allocated by malloc and calloc?

In practice they do the same thing. The advantage of calloc is that good implementations will perform an overflow detection when doing the multiplication needed to determine how much memory you need.

If you do it something like this:

void *
calloc(size_t nmemb, size_t size)
{
    size_t sz = nmemb * size;
    void *res = malloc(sz);

'sz' might not end up being what you expect it to be. malloc will then allocate much less than the caller expected, but the caller can end up treating the returned area as large enough. This leads to heap overflows will all the security implications that usually has.

Allocating a Large Memory Block in C++