What Is The Relation Between 'Container_Memory_Working_Set_Bytes' Metric and Oom-Killer on The Container

What is the relation between `container_memory_working_set_bytes` metric and OOM-killer on the container?

As you already know, container_memory_working_set_bytes is:

the amount of working set memory and it includes recently accessed
memory, dirty memory, and kernel memory. Therefore, Working set is
(lesser than or equal to) </= "usage".

The container_memory_working_set_bytes is being used for OoM decisions because it excludes cached data (Linux Page Cache) that can be evicted in memory pressure scenarios.

So, if the container_memory_working_set_bytes is increased to the limit, it will lead to oomkill.

You can find the fact that when Linux kernel checking available memory, it calls vm_enough_memory() to find out how many pages are potentially available.

Then when the machine is low on memory, old page frames including cache will be reclaimed but kernel still may find that it was unable free enough pages to satisfy a request. Now it's time to call out_of_memory() to kill the process. To determine the candidate process to be killed it uses oom_score.

So when Working Set bytes reached to limits, it means that kernel cannot find availables pages even after reclaiming old pages including cache so kernel will trigger OOM-killer to kill the process.

You can find more details on the Linux kernel documents:

https://www.kernel.org/doc/gorman/html/understand/understand016.html
https://www.kernel.org/doc/gorman/html/understand/understand013.html

What is the difference between “container_memory_working_set_bytes” and “container_memory_rss” metric on the container

You are right. I will try to address your questions in more detail.

What is the difference between two metrics?

container_memory_rss equals to the value of total_rss from /sys/fs/cgroups/memory/memory.status file:

// The amount of anonymous and swap cache memory (includes transparent
// hugepages).
// Units: Bytes.
RSS uint64 `json:"rss"`

The total amount of anonymous and swap cache memory (it includes transparent hugepages), and it equals to the value of total_rss from memory.status file. This should not be confused with the true resident set size or the amount of physical memory used by the cgroup. rss + file_mapped will give you the resident set size of cgroup. It does not include memory that is swapped out. It does include memory from shared libraries as long as the pages from those libraries are actually in memory. It does include all stack and heap memory.

container_memory_working_set_bytes (as already mentioned by Olesya) is the total usage - inactive file. It is an estimate of how much memory cannot be evicted:

// The amount of working set memory, this includes recently accessed memory,
// dirty memory, and kernel memory. Working set is <= "usage".
// Units: Bytes.
WorkingSet uint64 `json:"working_set"`

Working Set is the current size, in bytes, of the Working Set of this process. The Working Set is the set of memory pages touched recently by the threads in the process.

Which metrics are much proper to monitor memory usage? Some post said
both because one of those metrics reaches to the limit, then that
container is oom killed.

If you are limiting the resource usage for your pods than you should monitor both as they will cause an oom-kill if they reach a particular resource limit.

I also recommend this article which shows an example explaining the below assertion:

You might think that memory utilization is easily tracked with
container_memory_usage_bytes, however, this metric also includes
cached (think filesystem cache) items that can be evicted under memory
pressure. The better metric is container_memory_working_set_bytes as
this is what the OOM killer is watching for.

EDIT:

Adding some additional sources as a supplement:

A Deep Dive into Kubernetes Metrics — Part 3 Container Resource Metrics
#1744
Understanding Kubernetes Memory Metrics
Memory_working_set vs Memory_rss in Kubernetes, which one you should monitor?
Managing Resources for Containers
cAdvisor code

relationship between container_memory_working_set_bytes and process_resident_memory_bytes and total_rss

So the relationship seems is like this

container_working_set_in_bytes = container_memory_usage_bytes - total_inactive_file

container_memory_usage_bytes as its name implies means the total memory used by the container (but since it also includes file cache i.e inactive_file which OS can release under memory pressure) substracting the inactive_file gives container_working_set_in_bytes

Relationship between container_memory_rss and container_working_sets can be summed up using following expression

container_memory_usage_bytes = container_memory_cache + container_memory_rss

cache reflects data stored on a disk that is currently cached in memory. it contains active + inactive file (mentioned above)

This explains why the container_working_set was higher.

Ref #1

Ref #2

container_memory_rss relation with node memory used

tl;dr

Use container name filter (container!="") to exclude totals:

sum(container_memory_rss{container!=""}) by (instance) / 2^30

Explanation

If you ran the first query grouping results by container name, you would have noticed that most of the usage comes from a container without a name:

sort_desc(sum(container_memory_rss{instance="ip-192-168-104-46"}) by (name)) / 2^30

{}                          3.9971389770507812
{name="prometheus"}         0.6084518432617188
{name="cluster-autoscaler"} 0.04230499267578125

Actually there are several entries without name but they all have an id:

sort_desc(sum(container_memory_rss{instance="ip-192-168-104-46"}) by (id)) / 2^30

# these do not have a container name
{id="/"}                                1.1889266967773438
{id="/kubepods"}                        0.900482177734375
{id="/kubepods/burstable"}              0.6727218627929688
{id="/system.slice/docker.service"}     0.07495498657226562
{id="/system.slice/kubelet.service"}    0.060611724853515625

# and this is an example id of a real container which has a name label
{id="/kubepods/burstable/pod562495f9-afa6-427e-8435-016c2b500c74/e73975d90b66772e2e17ab14c473a2d058c0b9ffecc505739ee1a94032728a78"} 0.6027107238769531

These are accumulated values for each cgroup. cAdvisor takes the stats from cgroups and if you looks at them, you will find familiar entities:

# systemd-cgls -a
├─kubepods
│ ├─podc7dfcc4e-74fc-4469-ad56-c13fe5a9e7d8
│ │ ├─61a1a58e47968e7595f3458a6ded74f9088789a865bda2be431b8c8b07da1c6e
│ │ └─d47601e38a96076dd6e0205f57b0c365d4473cb6051eb0f0e995afb31143279b
│ ├─podfde9b8ca-ce80-4467-ba05-03f02a14d569
│ │ ├─9d3783df65085d54028e2303ccb2e143fecddfb85d7df4467996e82691892176
│ │ └─47702b7977bed65ddc86de92475be8f93b50b06ae8bd99bae9710f0b6f63d8f6
│ ├─burstable
│ │ ├─pod9ff634a5-fd2a-42e2-be27-7e1028e96b67
│ │ │ ├─5fa225aad10bdc1be372859697f53d5517ad28c565c6f1536501543a071cdefc
│ │ │ └─27402fed2e4bb650a6fc41ba073f9994a3fc24782ee366fb8b93a6fd939ba4d3

If you sum up all direct children of, say kubepods, you will get the same value kubepods has. Because of these totals sum(container_memory_rss) by (instance) shows several times the actual resource utilisation.

The solution is just to filter out any values without a container name. You can either do that when querying, as in the example at the top, or configure Prometheus with relabel_config to drop such metrics at the scrape time.

What Is The Relation Between 'Container_Memory_Working_Set_Bytes' Metric and Oom-Killer on The Container