Linux Over Commit Heuristic

Why does my high memory requiring code work on MacOS but not Linux when Linux box has more memory?

Linux docs on overcommit with emphasis:

The Linux kernel supports the following overcommit handling modes

0 - Heuristic overcommit handling. Obvious overcommits of
address space are refused
. Used for a typical system. It
ensures a seriously wild allocation fails while allowing
overcommit to reduce swap usage. root is allowed to
allocate slightly more memory in this mode. This is the
default.

1 - Always overcommit. Appropriate for some scientific
applications. Classic example is code using sparse arrays
and just relying on the virtual memory consisting almost
entirely of zero pages.

(...)

You can change to mode #1 with sudo sysctl vm.overcommit_memory=1

What is the vm.overcommit_ratio in Linux?

I believe the default for vm.overcommit_memory is 0 and not 2. Is the overcommit_ratio only relevant to mode 2? I assume yes, but I'm not entirely sure.

From https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

0 - Heuristic overcommit handling. Obvious overcommits of address
space are refused. Used for a typical system. It ensures a seriously
wild allocation fails while allowing overcommit to reduce swap
usage. root is allowed to allocate slightly more memory in this
mode. This is the default.

1 - Always overcommit. Appropriate for some scientific applications.
Classic example is code using sparse arrays and just relying on the
virtual memory consisting almost entirely of zero pages.

2 - Don't overcommit. The total address space commit for the system
is not permitted to exceed swap + a configurable amount (default is
50%) of physical RAM. Depending on the amount you use, in most
situations this means a process will not be killed while accessing
pages but will receive errors on memory allocation as appropriate.

Instead of free -g which I assume rounds down to zero, you might want to use free -m or just free to be more precise.

This might be interesting as well:

cat /proc/meminfo|grep Commit

Malloc on linux without overcommitting

How can I allocate memory on Linux without overcommitting

That is a loaded question, or at least an incorrect one. The question is based on an incorrect assumption, which makes answering the stated question irrelevant at best, misleading at worst.

Memory overcommitment is a system-wide policy -- because it determines how much virtual memory is made available to processes --, and not something a process can decide for itself.

It is up to the system administrator to determine whether memory is overcommitted or not. In Linux, the policy is quite tunable (see e.g. /proc/sys/vm/overcommit_memory in man 5 proc. There is nothing a process can do during
allocation that would affect the memory overcommit policy
.
 

OP also seems interested in making their processes immune to the out-of-memory killer (OOM killer) in Linux. (OOM killer in Linux is a technique used to relieve memory pressure, by killing processes, and thus releasing their resources back to the system.)

This too is an incorrect approach, because the OOM killer is a heuristic process, whose purpose is not to "punish or kill badly behaving processes", but to keep the system operational. This facility is also quite tunable in Linux, and the system admin can even tune the likelihood of each process being killed in high memory pressure situations. Other than the amount of memory used by a process, it is not up to the process to affect whether the OOM killer will kill it during out-of-memory situations; it too is a policy issue managed by the system administrator, and not the processes themselves.
 

I assumed that the actual question the OP is trying to solve, is how to write Linux applications or services that can dynamically respond to memory pressure, other than just dying (due to SIGSEGV or by the OOM killer). The answer to this is you do not -- you let the system administrator worry about what is important to them, in the workload they have, instead --, unless your application or service is one that uses lots and lots of memory, and is therefore likely to unfairly killed during high memory pressure. (Especially if the dataset is sufficiently large to require enabling much larger amount of swap than would otherwise be enabled, causing a higher risk of a swap storm and late-but-too-strong OOM killer.)

The solution, or at least the approach that works, is to memory-lock the critical parts (or even the entire application/service, if it works on sensitive data that should not be swapped to disk), or to use a memory map with a dedicated backing file. (For the latter, here is an example I wrote in 2011, that manipulates a terabyte-sized data set.)

The OOM killer can still kill the process, and a SIGSEGV still occur (due to say an internal allocation by a library function that the kernel fails to provide RAM backing to), unless all of the application is locked to RAM, but at least the service/process is no longer unfairly targeted, just because it uses lots of memory.

It is possible to catch the SIGSEGV signal (that occurs when there is no memory available to back the virtual memory), but thus far I have not seen an use case that would warrant the code complexity and maintenance effort required.
 

In summary, the proper answer to the stated question is no, don't do that.

How can this process' virtual memory be bigger than Physical memory + Swap?

As @user3344003 said, this happens because Linux, by default (at least on my case), overcommits memory:

http://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/

/proc/sys/vm/overcommit_memory

This switch knows 3 different settings:

0: The Linux kernel is free to overcommit memory (this is the default), a heuristic algorithm is applied to figure out if enough memory is available.

1: The Linux kernel will always overcommit memory, and never check if
enough memory is available. This increases the risk of out-of-memory
situations, but also improves memory-intensive workloads.

2: The Linux kernel will not overcommit memory, and only allocate as much memory as defined in overcommit_ratio.

The default value on Debian is 0. This means that malloc won't fail, and the Kernel's OOM killer will get involved when the machine can't allocate new pages anywhere.

Further reading can be done at https://www.etalabs.net/overcommit.html

malloc conditions for failure

  • http://www.win.tue.nl/~aeb/linux/lk/lk-9.html

    Since 2.1.27 there are a sysctl VM_OVERCOMMIT_MEMORY and proc file
    /proc/sys/vm/overcommit_memory with values 1: do overcommit, and 0
    (default): don't. Unfortunately, this does not allow you to tell the
    kernel to be more careful, it only allows you to tell the kernel to be
    less careful. With overcommit_memory set to 1 every malloc() will
    succeed. When set to 0 the old heuristics are used, the kernel still
    overcommits.

You might also wish to look at instrumenting with mallinfo:

  • http://www.gnu.org/software/libc/manual/html_node/Statistics-of-Malloc.html

One final link:

  • http://opsmonkey.blogspot.com/2007/01/linux-memory-overcommit.html

In a way, Linux allocates memory the way an airline sells plane
tickets. An airline will sell more tickets than they have actual
seats, in the hopes that some of the passengers don't show up. Memory
in Linux is managed in a similar way, but actually to a much more
serious degree.

lazy overcommit allocation and calloc

calloc can get guaranteed-zero pages from the OS, and thus avoid having to write zeros in user-space at all. (Especially for large allocations, otherwise it'll zero something from the free list if there are any free-list entries of the right size.) That's where the laziness comes in.

So your page will be fresh from mmap(MAP_ANONYMOUS), untouched by user-space. Reading it will trigger a soft page fault that copy-on-write maps it to a shared physical page of zeros. (So fun fact, you can get TLB misses but L1d / L2 cache hits when looping read-only over a huge calloc allocation).

Writing that page / one of those pages (as the first access, or after it's CoW mapped to a zero page) will soft page-fault, and Linux's page-fault handler will allocate a new physical page and zero it. (So after the page fault, the whole page is generally hot in L1d cache, or at least L2, even with faultaround to prepare more pages and wire them into the page table to reduce the number of page faults, if there are neighbouring pages that are also lazily allocated).


But no, you don't generally need to worry about it, other than general performance tuning. If you logically own some memory, you can ask read to put data into it. The libc wrapper isn't doing any special retrying there; all the magic (checking for the target page being present and treating it like a soft or hard page fault) happens inside the kernel's implementation of read, as part of copy_to_user.

(Basically a memcpy from kernel memory to user-space, with permission checking that can make it return -EFAULT if you pass the kernel a pointer that you don't even logically own. i.e. memory that would segfault if you touched it from user-space. Note that you don't get a SIGSEGV from read(0, NULL, 1), just an error. Use strace ./a.out to see, as an alternative to actually implementing error checking in your hand-written asm.)

Is it possible to allocate large amount of virtual memory in linux?

Is it possible to allocate large amount of virtual memory in linux?

Possibly. But you may need to configure it to be allowed:

The Linux kernel supports the following overcommit handling modes

0 - Heuristic overcommit handling. Obvious overcommits of address
space are refused. Used for a typical system. It ensures a seriously
wild allocation fails while allowing overcommit to reduce swap
usage. root is allowed to allocate slightly more memory in this
mode. This is the default.

1 - Always overcommit. Appropriate for some scientific applications.
Classic example is code using sparse arrays and just relying on the
virtual memory consisting almost entirely of zero pages.

2 - Don't overcommit. The total address space commit for the system
is not permitted to exceed swap + a configurable amount (default is
50%) of physical RAM. Depending on the amount you use, in most
situations this means a process will not be killed while accessing
pages but will receive errors on memory allocation as appropriate.

Useful for applications that want to guarantee their memory
allocations will be available in the future without having to
initialize every page.

The overcommit policy is set via the sysctl `vm.overcommit_memory'.

So, if you want to allocate more virtual memory than you have physical memory, then you'd want:

# in shell
sysctl -w vm.overcommit_memory=1

RLIMIT_AS The maximum size of the process's virtual memory (address space) in bytes. This limit affects calls to brk(2), mmap(2) and mremap(2), which fail with the error ENOMEM upon exceeding this limit. Also automatic stack expansion will fail (and generate a SIGSEGV that kills the process if no alternate stack has been made available via sigaltstack(2)). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited.

So, you'd want:

setrlimit(RLIMIT_AS, {
.rlim_cur = RLIM_INFINITY,
.rlim_max = RLIM_INFINITY,
});

Or, if you cannot give the process permission to do this, then you can configure this persistently in /etc/security/limits.conf which will affect all processes (of a user/group).


Ok, so mmap seems to support ... but it requires a file descriptor. ... could be a win but not if they have to be backed by a file ... I don't like the idea of attaching to a file

You don't need to use a file backed mmap. There's MAP_ANONYMOUS for that.

I did not know what number to put in to request

Then use null. Example:

mmap(nullptr, 256*GB, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)

That said, if you've configured the system as described, then new should work just as well as mmap. It'll probably use malloc which will probably use mmap for large allocations like this.


Bonus hint: You may benefit from taking advantage of using HugeTLB Pages.

Unable to allocate array with shape and data type

This is likely due to your system's overcommit handling mode.

In the default mode, 0,

Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. The root is allowed to allocate slightly more memory in this mode. This is the default.

The exact heuristic used is not well explained here, but this is discussed more on Linux over commit heuristic and on this page.

You can check your current overcommit mode by running

$ cat /proc/sys/vm/overcommit_memory
0

In this case, you're allocating

>>> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588

~282 GB and the kernel is saying well obviously there's no way I'm going to be able to commit that many physical pages to this, and it refuses the allocation.

If (as root) you run:

$ echo 1 > /proc/sys/vm/overcommit_memory

This will enable the "always overcommit" mode, and you'll find that indeed the system will allow you to make the allocation no matter how large it is (within 64-bit memory addressing at least).

I tested this myself on a machine with 32 GB of RAM. With overcommit mode 0 I also got a MemoryError, but after changing it back to 1 it works:

>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056

You can then go ahead and write to any location within the array, and the system will only allocate physical pages when you explicitly write to that page. So you can use this, with care, for sparse arrays.



Related Topics



Leave a reply



Submit