Does Linux Malloc() Behave Differently on Arm Vs X86

Does Linux malloc() behave differently on ARM vs x86?

A little background

malloc() doesn't lie, your kernel Virtual Memory subsystem does, and this a common practice on most modern Operating Systems. When you use malloc(), what's really happening is something like this:

The libc implementation of malloc() checks its internal state, and will try to optimize your request by using a variety of strategies (like trying to use a preallocated chunk, allocating more memory than requested in advance...). This means the implementation will impact on the performance and change a little the amount of memory requested from the kernel, but this is not really relevant when checking the "big numbers", like you're doing in your tests.
If there's no space in a preallocated chunk of memory (remember, chunks of memory are usually pretty small, in the order of 128KB to 1MB), it will ask the kernel for more memory. The actual syscall varies from one kernel to another (mmap(), vm_allocate()...) but its purpose is mostly the same.
The VM subsystem of the kernel will process the request, and if it finds it to be "acceptable" (more on this subject later), it will create a new entry in the memory map of the requesting task (I'm using UNIX terminology, where task is a process with all its state and threads), and return the starting value of said map entry to malloc().
malloc() will take note of the newly allocated chunk of memory, and will return the appropriate answer to your program.

OK, so now you're program has successfully malloc'ed some memory, but the truth is that not a single page (4KB in x86) of physical memory has been actually allocated to your request yet (well, this is an oversimplification, as collaterally some pages could have been used to store info about the state of the memory pool, but it makes it easier to illustrate the point).

So, what happens when you try to access this recently malloc'ed memory? A segmentation fault. Surprisingly, this is a relatively little known fact, but your system is generating segmentation faults all the time. Your program is then interrupted, the kernel takes control, checks if the address faulting corresponds to a valid map entry, takes one or more physical pages and links them to the task's map.

If your program tries to access an address which is not inside a map entry in your task, the kernel will not be able to resolve the fault, and will send the signal (or the equivalent mechanism for non-UNIX systems) to it pointing out this problem. If the program doesn't handle that signal by itself, it will be killed with the infamous Segmentation Fault error.

So physical memory is not allocated when you call malloc(), but when you actually access that memory. This allows the OS to do some nifty tricks like disk paging, balloning and overcommiting.

This way, when you ask how much memory a specific process is using, you need to look at two different numbers:

Virtual Size: The amount of memory that has been requested, even if it's not actually used.
Resident Size: The memory which it is really using, backed by physical pages.

How much overcommit is enough?

In computing, resource management in a complex issue. You have a wide range of strategies, from the most strict capability-based systems, to the much more relaxed behavior of kernels like Linux (with memory_overcommit == 0), which basically will allow you to request memory up to the maximum map size allowed for a task (which is a limit that depends on the architecture).

In the middle, you have OSes like Solaris (mentioned in your article), which limit the amount of virtual memory for a task to the sum of (physical pages + swap disk pages). But don't be fooled by the article you referenced, this is not always a good idea. If you're running a Samba or Apache server with hundreds to thousands of independent processes running at the same time (which leads to a lot of virtual memory wasting due to fragmentation), you'll have to configure a ridiculous amount of swap disk, or your system will run out of virtual memory, while still having a lot of free RAM.

But why does memory overcommit work differently on ARM?

It doesn't. At least it shouldn't, but ARM vendors have an insane tendency to introduce arbitrary changes to the kernels they distribute with their systems.

In your test case, the x86 machine is working as it is expected. As you're allocating memory in small chunks, and you have vm.overcommit_memory set to 0, you're allowed to fill all your virtual space, which is somewhere on the 3GB line, because you're running it on a 32 bit machine (if you try this on 64 bits, the loop will run until n==N). Obviously, when you try to use that memory, the kernel detects that physical memory is getting scarce, and activates the OOM killer countermeasure.

On ARM it should be the same. As it doesn't, two possibilities come to my mind:

overcommit_memory is on NEVER (2) policy, perhaps because someone has forced it this way on the kernel.
You're reaching the maximum allowed map size for the task.

As on each run on ARM, you get different values for the malloc phase, I would discard the second option. Make sure overcommit_memory is enabled (value 0) and rerun your test. If you have access to those kernel sources, take a look at them to make sure the kernel honors this sysctl (as I said, some ARM vendors like to do nasty things to their kernels).

As a reference, I've ran demo3 under QEMU emulating vertilepb and on an Efika MX (iMX.515). The first one stopped malloc'ing at the 3 GB mark, as expected on a 32 bit machine, while the other did it earlier, at 2 GB. This may come as a surprise, but if you take a look at its kernel config (https://github.com/genesi/linux-legacy/blob/master/arch/arm/configs/mx51_efikamx_defconfig), you'll see this:

CONFIG_VMSPLIT_2G=y
# CONFIG_VMSPLIT_1G is not set
CONFIG_PAGE_OFFSET=0x80000000

The kernel is configured with a 2GB/2GB split, so the system is behaving as expected.

Does malloc lazily create the backing pages for an allocation on Linux (and other platforms)?

Linux does deferred page allocation, aka. 'optimistic memory allocation'. The memory you get back from malloc is not backed by anything and when you touch it you may actually get an OOM condition (if there is no swap space for the page you request), in which case a process is unceremoniously terminated.

See for example http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html

Allocating more memory than there exists using malloc

It is called memory overcommit. You can disable it by running as root:

 echo 2 > /proc/sys/vm/overcommit_memory

and it is not a kernel feature that I like (so I always disable it). See malloc(3) and mmap(2) and proc(5)

^{NB: echo 0 instead of echo 2 often -but not always- works also. Read the docs (in particular proc man page that I just linked to).}

Mindset difference between workstation and embedded programmers

Funny that you mention malloc() specifically in your example.

In every hard-real-time, deeply embedded system that I've worked on, memory allocation is managed specially (usually not the heap, but fixed memory pools or something similar)... and also, whenever possible, all memory allocation is done up-front during initialization. This is surprisingly easier than most people would believe.

malloc() is vulnerable to fragmentation, is non-deterministic, and doesn't discrminate between memory types. With memory pools, you can have pools that are located/pulling from super fast SRAM, fast DRAM, battery-backed RAM (I've seen it), etc...

There are a hundred other issues (in answer to your original question), but memory allocation is a big one.

Also:

Respect for / knowledge of the hardware platform
Not automatically asssuming the hardware is perfect or even functional
Awareness of certain language apects & features (e.g., exceptions in C++) that can cause things to go sideways quickly
Awareness of CPU loading and memory utilization
Awareness of interrupts, pre-emption, and the implications on shared data (where absolutely necessary -- the less shared data, the better)
Most embedded systems are data/event driven, as opposed to polled; there are exceptions of course
Most embedded developers are pretty comfortable with the concept of state machines and stateful behavior/modeling

Different behaviors depending on architecture

You're allocating 1 byte for 4 ints:

bla_thr=  malloc(1);

bla_thr->a=1;
bla_thr->b=2;
bla_thr->c=3;
bla_thr->d=4;

This invokes undefined behaviour so anything can happen. The bug is in your code, not libc. If you allocate enough space with:

bla_thr = malloc(sizeof *bla_thr); // == sizeof(struct bla);

it should work. Don't forget to free() the memory afterwards!

Do memory allocation functions indicate that the memory content is no longer used?

No.

The cache operation you mention (marking cached memory as unused and discarding without writeback to main memory) is called cacheline invalidation without writeback. This is performed through a special instruction with an operand that may (or may not) indicate the address of the cacheline to be invalidated.

On all architectures I'm familiar with, this instruction is privileged, with good reason in my opinion. This means that usermode code cannot employ the instruction; Only the kernel can. The amount of perverted trickery, data loss and denial of service that would be possible otherwise is incredible.

As a result, no memory allocator could do what you propose; They simply don't have (in usermode) the tools to do so.

Architectural Support

The x86 and x86-64 architecture has the privileged invd instruction, which invalidates all internal caches without writeback and directs external caches to invalidate themselves also. This is the only instruction capable of invalidating without writeback, and it is a blunt weapon indeed.
- The non-privileged clflush instruction specifies a victim address, but it writes back before invalidating, so I mention it only in passing.
- Documentation for all these instructions is in Intel's SDMs, Volume 2.
The ARM architecture performs cache invalidation without writeback with a write to coprocessor 15, register 7: MCR p15, 0, <Rd>, c7, <CRm>, <Opcode_2>. A victim cacheline may be specified. Writes to this register are privileged.
PowerPC has dcbi, which lets you specify a victim, dci which doesn't and instruction-cache versions of both, but all four are privileged (see page 1400).
MIPS has the CACHE instruction which can specify a victim. It was privileged as of MIPS Instruction Set v5.04, but in 6.04 Imagination Technologies muddied the water and it's no longer clear what's privileged and what not.

So this excludes the use of cache invalidation without flushing/writing back in usermode outright.

Kernel mode?

However, I'd argue that it's still a bad idea in kernelmode for numerous reasons:

Linux's allocator, kmalloc(), allocates out of arenas for different sizes of allocations. In particular, it has an arena for each allocation size <=192 bytes by steps of 8; This means that objects can be potentially closer to each other than a cacheline or partially overlap the next one, and using invalidation could thus blow out nearby objects that were rightly in cache and not yet written back. This is wrong.
- This problem is compounded by the fact that cache lines may be quite large (on x86-64, 64 bytes), and moreover are not necessarily uniform in size throughout the cache hierarchy. For instance, Pentium 4 had 64B L1 cachelines but 128B L2 cachelines.
It makes the deallocation time linear in the number of cachelines of the object to deallocate.
It has a very limited benefit; The size of L1 cache is usually in the KB's, so a few thousand flushes will fully empty it. Moreover the cache may already have flushed the data without your prompting, so your invalidation is worse than useless: The memory bandwidth was used, but you no longer have the line in cache, so when it will next be partially written to it will need to be refetched.
The next time the memory allocator returns that block, which might be soon, the user thereof will suffer a guaranteed cache miss and fetch from main RAM, while he could have had a dirty unflushed line or a clean flushed line instead. The cost of a guaranteed cache miss and fetch from main RAM is much greater than a cache line flush without invalidation tucked in somewhere automatically and intelligently by the caching hardware.
The additional code required to loop and flush these lines wastes instruction cache space.
A better use for the dozens of cycles taken by aforementioned loop to invalidate cachelines would be to keep doing useful work, while letting the cache and memory subsystem's considerable bandwidth write back your dirty cachelines.
- My modern Haswell processor has 32 bytes / clock cycle write L1 bandwidth and 25GB/s main RAM bandwidth. I'm sure a couple extra flushable 32-byte cachelines can be squeezed in somewhere in there.
Lastly, for short-lived, small allocations like that, there's the option of allocating it on the stack.

Actual memory allocator practice

The famed dlmalloc does not invalidate freed memory.
glibc does not invalidate freed memory.
jemalloc does not invalidate freed memory.
musl-libc's malloc() does not invalidate freed memory.

None of them invalidate memory, because they can't. Doing a system call for the sake of invalidating cachelines would be both incredibly slow and would cause far more traffic in/out of cache, just because of the context switch.

System call for different Hardware Architecture?

System calls depends on Operating system as well as architecture. Most of the cases, your program should be re-compiled if the architecture or operating system is different.

for example sbrk and brk system calls which is used for malloc() i.e dynamic memory allocation not available in windows.

Refer how malloc is implemented in windows Windows memory allocation questions

There are two types of system calls namely "machine architecture independent" and "machine architecture dependent" available.

If you use only machine architecture independent then there will be not much worry in porting.

Coming to answer for your question :
so it depends on which system call you used. But re-compile is must.

Why use _mm_malloc? (as opposed to _aligned_malloc, alligned_alloc, or posix_memalign)

Intel compilers support POSIX (Linux) and non-POSIX (Windows) operating systems, hence cannot rely upon either the POSIX or the Windows function. Thus, a compiler-specific but OS-agnostic solution was chosen.

C11 is a great solution but Microsoft doesn't even support C99 yet, so who knows if they will ever support C11.

Update: Unlike the C11/POSIX/Windows allocation functions, the ICC intrinsics include a deallocation function. This allows this API to use a separate heap manager from the default one. I don't know if/when it actually does that, but it can be useful to support this model.

Disclaimer: I work for Intel but have no special knowledge of these decisions, which happened long before I joined the company.

Does Linux Malloc() Behave Differently on Arm Vs X86