What Does '-Oom-Kill-Disable' Do for a Docker Container

What is docker --kernel-memory

The whole CPU/Memory Limitation stuff is using cgroups.
You can find all settings performed by docker run (either per args or per default) under /sys/fs/cgroup/memory/docker/<container ID> for memory or /sys/fs/cgroup/cpu/docker/<container ID> for cpu.

So the --kernel-memory:

Reading: cat memory.kmem.limit_in_bytes

Writing: sudo -s echo 2167483648 > memory.kmem.limit_in_bytes

And also the benchmarking memory.kmem.max_usage_in_bytes and memory.kmem.usage_in_bytes which shows (rather selfexplaining) the current usage and the highest usage overall.

CGroup docs about Kernel Memory

For the functionality I will recommend reading Kernel Docs for CGroups V1 instead of the docker docs:

2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
With the Kernel memory extension, the Memory Controller is able to
limit the amount of kernel memory used by the system. Kernel memory is
fundamentally different than user memory, since it can't be swapped
out, which makes it possible to DoS the system by consuming too much
of this precious resource.
[..]
The memory used is
accumulated into memory.kmem.usage_in_bytes, or in a separate counter
when it makes sense. (currently only for tcp). The main "kmem" counter
is fed into the main counter, so kmem charges will also be visible
from the user counter.
Currently no soft limit is implemented for kernel memory. It is future
work to trigger slab reclaim when those limits are reached.

and

2.7.2 Common use cases

Because the "kmem" counter is fed to the main user counter, kernel
memory can never be limited completely independently of user memory.
Say "U" is the user limit, and "K" the kernel limit. There are three
possible ways limits can be set:

U != 0, K = unlimited:
This is the standard memcg limitation mechanism already present before kmem
accounting. Kernel memory is completely ignored.

U != 0, K < U:
Kernel memory is a subset of the user memory. This setup is useful in
deployments where the total amount of memory per-cgroup is overcommited.
Overcommiting kernel memory limits is definitely not recommended, since the
box can still run out of non-reclaimable memory.
In this case, the admin could set up K so that the sum of all groups is
never greater than the total memory, and freely set U at the cost of his
QoS.
WARNING: In the current implementation, memory reclaim will NOT be
triggered for a cgroup when it hits K while staying below U, which makes
this setup impractical.

U != 0, K >= U:
Since kmem charges will also be fed to the user counter and reclaim will be
triggered for the cgroup for both kinds of memory. This setup gives the
admin a unified view of memory, and it is also useful for people who just
want to track kernel memory usage.

Clumsy Attempt of a Conclusion

Given a running container with --memory="2g" --memory-swap="2g" --oom-kill-disable using

cat memory.kmem.max_usage_in_bytes
10747904

10 MB of Kernel-Memory in normal state. Would make sense to me to limit it, let's say to 20 MB of Kernel-Memory. Then it should kill or limit the container to protect the host. But due to the fact that there is - according to the docs - no possibility to reclaim the memory and the OOM Killer is starting to kill processes on host then even with a plenty of free memory (according to this: https://github.com/docker/for-linux/issues/1001) for me it is rather unpractical to use that.

The quoted option to set it >= memory.limit_in_bytes is not really helpful in that scenario either.

Deprecated

--kernel-memory is deprecated in v20.10, due to the fact someone (=Linux Kernel) realized all that as well..

What we can do then?

ULimit

Docker API exposes HostConfig|Ulimit which writes to /etc/security/limits.conf. For docker run should be --ulimit <type>=<soft>:<hard>. Use cat /etc/security/limits.conf or man setrlimit to see the categories and you can try to protect your system from filling kernel memory by e.g. generate unlimited processes with --ulimit nproc=500:500, but be careful, nproc works for users and not for containers, so count together..

To prevent DDoS (intentionally or unintentionally) i would suggest to limit at least nofile and nproc. Maybe someone can elaborate further..

sysctl:

docker run --sysctl can change kernel variables on message queue and shared memory, also network, e.g. docker run --sysctl net.ipv4.tcp_max_orphans= for orphan tcp connections which defaults on my system to 131072, and by a kernel memory usage of 64 kB each: Bang 8 GB on malfunction or dos. Maybe someone can elaborate further..

How to disable the oom killer in linux?

The OOM killer won't go away. If there is no memory, someone's got to pay. What you can do is set a limit after which memory allocations fail.
That's exactly what setting vm.overcommit_memory to 2 achieves.

From the docs:

The Linux kernel supports the following overcommit handling modes

2 - Don't overcommit. The total address space commit for the system
is not permitted to exceed swap + a configurable amount (default is
50%) of physical RAM. Depending on the amount you use, in most
situations this means a process will not be killed while accessing
pages but will receive errors on memory allocation as appropriate.

Normally, the kernel will happily hand out virtual memory (overcommit). Only when you reference a page, the kernel has to map the page to a real physical frame. If it can't service that request, a process needs to be killed by the OOM killer to make space.

Disabling overcommit means that e.g. malloc(3) will return NULL if the kernel couldn't commit the amount of memory requested. This makes things a bit more predictable, albeit limited (many applications allocate more than they would ever need).