What is the relation between `container_memory_working_set_bytes` metric and OOM-killer on the container?
As you already know, container_memory_working_set_bytes
is:
the amount of working set memory and it includes recently accessed
memory, dirty memory, and kernel memory. Therefore, Working set is
(lesser than or equal to) </= "usage".
The container_memory_working_set_bytes
is being used for OoM decisions because it excludes cached data (Linux Page Cache) that can be evicted in memory pressure scenarios.
So, if the container_memory_working_set_bytes
is increased to the limit, it will lead to oomkill.
You can find the fact that when Linux kernel checking available memory, it calls vm_enough_memory()
to find out how many pages are potentially available.
Then when the machine is low on memory, old page frames including cache will be reclaimed but kernel still may find that it was unable free enough pages to satisfy a request. Now it's time to call out_of_memory()
to kill the process. To determine the candidate process to be killed it uses oom_score
.
So when Working Set bytes reached to limits, it means that kernel cannot find availables pages even after reclaiming old pages including cache so kernel will trigger OOM-killer to kill the process.
You can find more details on the Linux kernel documents:
- https://www.kernel.org/doc/gorman/html/understand/understand016.html
- https://www.kernel.org/doc/gorman/html/understand/understand013.html
What is the difference between “container_memory_working_set_bytes” and “container_memory_rss” metric on the container
You are right. I will try to address your questions in more detail.
What is the difference between two metrics?
container_memory_rss
equals to the value of total_rss
from /sys/fs/cgroups/memory/memory.status
file:
// The amount of anonymous and swap cache memory (includes transparent
// hugepages).
// Units: Bytes.
RSS uint64 `json:"rss"`
The total amount of anonymous and swap cache memory (it includes transparent hugepages), and it equals to the value of total_rss
from memory.status
file. This should not be confused with the true resident set size
or the amount of physical memory used by the cgroup. rss + file_mapped
will give you the resident set size of cgroup. It does not include memory that is swapped out. It does include memory from shared libraries as long as the pages from those libraries are actually in memory. It does include all stack and heap memory.
container_memory_working_set_bytes
(as already mentioned by Olesya) is the total usage
- inactive file
. It is an estimate of how much memory cannot be evicted:
// The amount of working set memory, this includes recently accessed memory,
// dirty memory, and kernel memory. Working set is <= "usage".
// Units: Bytes.
WorkingSet uint64 `json:"working_set"`
Working Set is the current size, in bytes, of the Working Set of this process. The Working Set is the set of memory pages touched recently by the threads in the process.
Which metrics are much proper to monitor memory usage? Some post said
both because one of those metrics reaches to the limit, then that
container is oom killed.
If you are limiting the resource usage for your pods than you should monitor both as they will cause an oom-kill if they reach a particular resource limit.
I also recommend this article which shows an example explaining the below assertion:
You might think that memory utilization is easily tracked with
container_memory_usage_bytes
, however, this metric also includes
cached (think filesystem cache) items that can be evicted under memory
pressure. The better metric iscontainer_memory_working_set_bytes
as
this is what the OOM killer is watching for.
EDIT:
Adding some additional sources as a supplement:
A Deep Dive into Kubernetes Metrics — Part 3 Container Resource Metrics
#1744
Understanding Kubernetes Memory Metrics
Memory_working_set vs Memory_rss in Kubernetes, which one you should monitor?
Managing Resources for Containers
cAdvisor code
relationship between container_memory_working_set_bytes and process_resident_memory_bytes and total_rss
So the relationship seems is like this
container_working_set_in_bytes = container_memory_usage_bytes - total_inactive_file
container_memory_usage_bytes
as its name implies means the total memory used by the container (but since it also includes file cache i.e inactive_file which OS can release under memory pressure) substracting the inactive_file gives container_working_set_in_bytes
Relationship between container_memory_rss
and container_working_sets
can be summed up using following expression
container_memory_usage_bytes = container_memory_cache + container_memory_rss
cache reflects data stored on a disk that is currently cached in memory. it contains active + inactive file (mentioned above)
This explains why the container_working_set
was higher.
Ref #1
Ref #2
container_memory_rss relation with node memory used
tl;dr
Use container name filter (container!=""
) to exclude totals:
sum(container_memory_rss{container!=""}) by (instance) / 2^30
Explanation
If you ran the first query grouping results by container name, you would have noticed that most of the usage comes from a container without a name:
sort_desc(sum(container_memory_rss{instance="ip-192-168-104-46"}) by (name)) / 2^30
{} 3.9971389770507812
{name="prometheus"} 0.6084518432617188
{name="cluster-autoscaler"} 0.04230499267578125
Actually there are several entries without name but they all have an id
:
sort_desc(sum(container_memory_rss{instance="ip-192-168-104-46"}) by (id)) / 2^30
# these do not have a container name
{id="/"} 1.1889266967773438
{id="/kubepods"} 0.900482177734375
{id="/kubepods/burstable"} 0.6727218627929688
{id="/system.slice/docker.service"} 0.07495498657226562
{id="/system.slice/kubelet.service"} 0.060611724853515625
# and this is an example id of a real container which has a name label
{id="/kubepods/burstable/pod562495f9-afa6-427e-8435-016c2b500c74/e73975d90b66772e2e17ab14c473a2d058c0b9ffecc505739ee1a94032728a78"} 0.6027107238769531
These are accumulated values for each cgroup
. cAdvisor
takes the stats from cgroups
and if you looks at them, you will find familiar entities:
# systemd-cgls -a
├─kubepods
│ ├─podc7dfcc4e-74fc-4469-ad56-c13fe5a9e7d8
│ │ ├─61a1a58e47968e7595f3458a6ded74f9088789a865bda2be431b8c8b07da1c6e
│ │ └─d47601e38a96076dd6e0205f57b0c365d4473cb6051eb0f0e995afb31143279b
│ ├─podfde9b8ca-ce80-4467-ba05-03f02a14d569
│ │ ├─9d3783df65085d54028e2303ccb2e143fecddfb85d7df4467996e82691892176
│ │ └─47702b7977bed65ddc86de92475be8f93b50b06ae8bd99bae9710f0b6f63d8f6
│ ├─burstable
│ │ ├─pod9ff634a5-fd2a-42e2-be27-7e1028e96b67
│ │ │ ├─5fa225aad10bdc1be372859697f53d5517ad28c565c6f1536501543a071cdefc
│ │ │ └─27402fed2e4bb650a6fc41ba073f9994a3fc24782ee366fb8b93a6fd939ba4d3
If you sum up all direct children of, say kubepods
, you will get the same value kubepods
has. Because of these totals sum(container_memory_rss) by (instance)
shows several times the actual resource utilisation.
The solution is just to filter out any values without a container name. You can either do that when querying, as in the example at the top, or configure Prometheus with relabel_config to drop such metrics at the scrape time.
Related Topics
Cdc_Acm: Failed to Set Dtr/Rts - Can Not Communicate with Usb Cdc Device
How to Find Which Type of System Call Is Used by a Program
Environment Variables in Docker When Exec Docker Run
Can a Program Read Its Own Elf Section
.Htaccess Redirect Index.PHP to /
Check Library Version Netcdf Linux
How to Authenticate Username/Password Using Pam W/O Root Privileges
Reliably Kill Sleep Process After Usr1 Signal
Automate Scp with Multiple Files with Expect Script
How to View Dask Dashboard When Running on a Virtual Machine
Linux Umask for Sudo and Apache
Simplest Way to Build Dotnet Sdk Project Requiring Net461 on Macos
What Does "Private_Dirty" Memory Mean in Smaps
Joining Line Breaks in Fasta File with Condition in Sed/Awk/Perl One-Liner
User-Space Memory Editing Programs