What Would Cause a Java Process to Greatly Exceed the Xmx or Xss Limit

What would cause a java process to greatly exceed the Xmx or Xss limit?

Turns out this is a glibc problem.

The short answer for me was:

export MALLOC_ARENA_MAX=1

This decreased process footprint (VIRT in top) by as much as 5x. Back to the levels seen in CentOS 5.

Recent versions of glibc have a new feature "per-thread memory pools":

http://www.centos.org/docs/5/html/5.4/Technical_Notes/glibc.html

The last item in the 1.71.1 log section discusses it (and refers to a non-public bug....)

SGE h_vmem vs java -Xmx -Xms

The problem here appears to be an issue with the combination of the following factors:

The old cluster was RHEL5, and the new RHEL6
RHEL6 includes an update to glibc that changes the way MALLOC reports memory usage of multi-threaded programs.
the JVM includes a Multi-threaded garbage collector by default

To fix the problem I've used a combination of the following:

Export the MALLOC_ARENA_MAX environment variable to a small number (1-10) e.g. in the job script. I.e. include something like: export MALLOC_ARENA_MAX=1
Moderately increased the SGE memory requests by 10% or so
Explicitly set the number of java GC threads to a low number by using java -XX:ParallelGCThreads=1 ...
Increased the SGE thread requests. E.g. qsub -pe pthreads 2

Note that it's unclear that setting the MALLOC_ARENA_MAX all the way down to 1 is the right number, but low numbers seem to work well from my testing.

Here are the links that lead me to these conclusions:

https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en

What would cause a java process to greatly exceed the Xmx or Xss limit?

http://siddhesh.in/journal/2012/10/24/malloc-per-thread-arenas-in-glibc/

Java consumes memory more than Xmx argument

As the comments and answers have alluded to, there are a number of other factors to take into
account when measuring JVM memory usage. However, I don't think any answer has gone into
nearly enough depth.

JVM Memory Overview

Lets hit the question "I was wondering what is that 30MB used for?" head on. To do this, here is a simple java class:

// HelloWorld.java
public class HelloWorld {
    public static void main(String[] args) throws Exception {
        System.out.println("Hello world!");
        Thread.sleep(10000); // wait 10 seconds so we can get memory usage
    }
}

Now compile and run it with heap constraints:

$ nohup java -Xms2m -Xmx2m HelloWorld & # run in background
$ ps aux | awk 'NR==1; /[H]elloWorld/'
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
chaospie  6204  6.0  0.1 2662860 23040 pts/2   Sl   19:15   0:00 java -Xms2m -Xmx2m HelloWorld

Looking at the RSS (Resident Set Size, or how much memory this process is using) above we see that the JVM's
process is using about 23MB of memory. To see why, lets do some analysis. The quickest way to get a good
overview is to turn on NativeMemorytracking use the jcmd tool's VM.native_memory command. So,
let's run our app again:

$ nohup java -XX:NativeMemoryTracking=summary -Xms2M -Xmx2M HelloWorld &
[2] 6661
nohup: ignoring input and appending output to 'nohup.out'

$ ps aux | awk 'NR==1; /[H]elloWorld/'
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
chaospie  6661  5.0  0.1 2662860 23104 pts/2   Sl   19:21   0:00 java -XX:NativeMemoryTracking=summary -Xms2M -Xmx2M HelloWorld

$ jcmd 6661 VM.native_memory summary
6661:

Native Memory Tracking:

Total: reserved=1360145KB, committed=61177KB
-                 Java Heap (reserved=2048KB, committed=2048KB)
                            (mmap: reserved=2048KB, committed=2048KB)

-                     Class (reserved=1066093KB, committed=14189KB)
                            (classes #402)
                            (malloc=9325KB #146)
                            (mmap: reserved=1056768KB, committed=4864KB)

-                    Thread (reserved=20646KB, committed=20646KB)
                            (thread #21)
                            (stack: reserved=20560KB, committed=20560KB)
                            (malloc=62KB #110)
                            (arena=23KB #40)

-                      Code (reserved=249632KB, committed=2568KB)
                            (malloc=32KB #299)
                            (mmap: reserved=249600KB, committed=2536KB)

-                        GC (reserved=10467KB, committed=10467KB)
                            (malloc=10383KB #129)
                            (mmap: reserved=84KB, committed=84KB)

-                  Compiler (reserved=132KB, committed=132KB)
                            (malloc=1KB #21)
                            (arena=131KB #3)

-                  Internal (reserved=9453KB, committed=9453KB)
                            (malloc=9421KB #1402)
                            (mmap: reserved=32KB, committed=32KB)

-                    Symbol (reserved=1358KB, committed=1358KB)
                            (malloc=902KB #86)
                            (arena=456KB #1)

-    Native Memory Tracking (reserved=143KB, committed=143KB)
                            (malloc=86KB #1363)
                            (tracking overhead=57KB)

-               Arena Chunk (reserved=175KB, committed=175KB)
                            (malloc=175KB)

Memory Regions

Let's break it down ¹:

Java Heap : this is the heap -
Class : this is Metaspace, assuming you are using java 8.
Thread : this shows the number of threads, and the overall mem usage of the threads (note that the used stack in this section reflects the Xss value times the number of threads, you can get the default -Xssvalue with java -XX:+PrintFlagsFinal -version |grep ThreadStackSize).
Code : the code cache - this is used by the JIT (Just In Time Compiler) to cache compiled code.
GC : space used by the garbage collector.
Compiler : space used by the JIT when generating code.
Symbols : this is for symbols, field names, method signatures etc...
Native Memory Tracking : memory used by the native memory tracker itself.
Arena Chunk : this is related to malloc arenas ².

Much more than just the heap!

Reserved, Committed And RSS

Note that each region has a committed and a reserved section. To keep it short
reserved is what it can grow to and committed is what is currently committed to be used.
For example see the Java Heap section: Java Heap (reserved=2048KB, committed=2048KB), reserved is our
-Xmx value and committed would be our -Xms value, in this case they are equal .

Note too that the total committed size - it does not
reflect actual usage reported by RSS (or the RES column in top). The reason they differ is that RSS shows
the size of all memory pages which have been, and still are in use in physical memory, whereas committed shows the memory which is
used including that which is not in physical memory ³.

There is a lot more to this, however JVM and OS memory management is a complex topic, so I hope this answers your question at least at a high level.

See https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr022.html
From the JVM Native Memory Tracking docs (https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html#BABJGHDB):

Arena is a chunk of memory allocated using malloc. Memory is freed from these chunks in bulk, when exiting a scope or leaving an area of code. These chunks may be reused in other subsystems to hold temporary memory, for example, pre-thread allocations. Arena malloc policy ensures no memory leakage. So Arena is tracked as a whole and not individual objects. Some amount of initial memory can not by tracked.

To go into the difference between RSS, Reserved and Committed memory would be too much here, OS memory management is a complex topic, but see
this answer for a good overview.

Limit total memory consumption of Java process (in Cloud Foundry)

Unfortunately, there is no way to definitively enforce a memory limit on the JVM. Most of the memory regions are configurable (-Xmx, -Xss, -XX:MaxPermSize, -XX: MaxMetaspaceSize, etc.) but the one you can't control is Native memory. Native memory contains a whole host of things from memory mapped files to native libraries to JNI code. The best you can do is profile your application, find out where the memory growth is occurring, and either solve the growth or give yourself enough breathing room to survive.

Certainly unsatisfying, but in the end not much different from other languages and runtimes that have no control over their memory footprint.

How does a large -Xss setting affect server performance?

The answer is complicated. You're also asking the wrong question - make sure you read the entire answer all the way to the end.

Answering your question: How bad is large `-Xss`?

The amount of RAM a JVM needs is, basically, heap+metaspace+MAX_THREADS*STACK_SIZE.

Heap is simple: That's what the -Xmx parameter is for. metaspace is a more or less constant (I'm oversimplifying) and not particularly large amount.

Furthermore, assuming it's the usual server setup where you've set things up such that the JVM gets a static amount of memory (it's a server - it has a set amount of RAM and the best option, usually, is to spend it all. Give every major process running on the system a locked in configured amount of RAM. If the JVM is the only major software running on there (e.g. there is a database involved but it runs on another machine), and you have 8GB in the box, then give the JVM ~7GB. Why wouldn't you? Use -Xmx and -Xms`, set to the same value, and make it large. If postgres is also running on it, give the JVM perhaps 3GB and postgres 4GB (depends on how db-heavy your app is, of course). etcetera.

The point is, if you have both a large stacksize and a decently large max threads, let's say an -Xss of 20MB and max-threads of 100, then you lose 2GB of your allocated 7: On a box with 8GB installed and only the JVM as major consumer of resources, this setting:

java -Xmx7g -Xms7g -Xss20m

would be completely wrong and cause all sorts of trouble - that adds up to 9 GB, and I haven't even started accounting for metaspace yet, or the needs of the kernel. The box doesn't have that much! Instead you should be doing perhaps -Xmx5g -Xms5g -Xss20m.

And now you know what the performance cost is of this: The cost is having to reduce your -Xmx -Xms value from 7 to 5. It gets disastrously worse if you had to knock it down from 3 to 1 because it's a 4GB box - at that point what you're doing is basically impossible unless you first launch a new server with more ram in it.

Actually helping you solve your problem

Forget about all of the above, that's the wrong way to solve this problem. Keep your -Xss nice and low, or don't set it.

Instead, take your init code and isolate it, then run it in a separately set up thread (and then just .join() on that thread to wait for it to complete and flush all the fields your init code modified back; yield() sets up HB/HA as needed). Use this thread constructor:

Runnable initCode = () -> {
   // your init stuff goes here
};

ThreadGroup tg = Thread.currentThread().getThreadGroup();
Thread initThread = new Thread(tg, runnable, "init", 20L * 1024L * 1024L);
initThread.start();
initThread.join();

But, do some research first. The API of Thread is horribly designed and makes all sorts of grave errors. In particular, the stack size number (20MB here) is just a hint and the javadoc says any VM is free to just completely ignore it. Good API design would have of course specced that an exception is thrown instead if your requested stacksize is not doable by the VM.

I've done a quick test; adoptopenjdk 11 on a mac seems to have no problem with it.

Here's my test setup:

> cat Test.java
public class Test {
        public static void main(String[] args) throws Exception {
                Runnable r = () -> {
                        System.out.println("Allowed stack depth: " + measure());
                };
                r.run();
                r.run();
                Thread t = new Thread(Thread.currentThread().getThreadGroup(), r, "init", 1024L * 1024L);
                t.start();
                t.join();
                r.run();
        }

        public static int measure() {
                int min = 1;
                int max = 50000;
                while (min < max) {
                        int mid = (max + min) / 2;
                        try {
                                attempt(mid);
                                if (min == mid) return min;
                                min = mid;
                        } catch (StackOverflowError e) {
                                max = mid;
                        }
                }
                return min;
        }

        public static void attempt(int depth) {
                if (depth == 0) return;
                attempt(depth - 1);
        }
}

> javac Test.java; java -Xss200k Test
Allowed stack depth: 2733
Allowed stack depth: 6549
Allowed stack depth: 49999
Allowed stack depth: 6549

You can't check the size of the stack trace, as the JVM has a hard limit and won't store more than 1024 stack trace elements, thus the binary search for the answer.

I can't quite explain why the value isn't constant (it hops from 2733 to 6549), or even why an -Xss of 150k produces higher numbers for a real What The Heck???? - I'll ask a question about that right after posting this answer, but it does show that the thread that was made with a larger stack does indeed let you have a far deeper method callstack.

Run this test code on the target environment with the target JDK just to be sure it'll work, and then you have your actual solution :)

Java using much more memory than heap size (or size correctly Docker memory limit)

Virtual memory used by a Java process extends far beyond just Java Heap. You know, JVM includes many subsytems: Garbage Collector, Class Loading, JIT compilers etc., and all these subsystems require certain amount of RAM to function.

JVM is not the only consumer of RAM. Native libraries (including standard Java Class Library) may also allocate native memory. And this won't be even visible to Native Memory Tracking. Java application itself can also use off-heap memory by means of direct ByteBuffers.

So what takes memory in a Java process?

JVM parts (mostly shown by Native Memory Tracking)

1. Java Heap

The most obvious part. This is where Java objects live. Heap takes up to -Xmx amount of memory.

2. Garbage Collector

GC structures and algorithms require additional memory for heap management. These structures are Mark Bitmap, Mark Stack (for traversing object graph), Remembered Sets (for recording inter-region references) and others. Some of them are directly tunable, e.g. -XX:MarkStackSizeMax, others depend on heap layout, e.g. the larger are G1 regions (-XX:G1HeapRegionSize), the smaller are remembered sets.

GC memory overhead varies between GC algorithms. -XX:+UseSerialGC and -XX:+UseShenandoahGC have the smallest overhead. G1 or CMS may easily use around 10% of total heap size.

3. Code Cache

Contains dynamically generated code: JIT-compiled methods, interpreter and run-time stubs. Its size is limited by -XX:ReservedCodeCacheSize (240M by default). Turn off -XX:-TieredCompilation to reduce the amount of compiled code and thus the Code Cache usage.

4. Compiler

JIT compiler itself also requires memory to do its job. This can be reduced again by switching off Tiered Compilation or by reducing the number of compiler threads: -XX:CICompilerCount.

5. Class loading

Class metadata (method bytecodes, symbols, constant pools, annotations etc.) is stored in off-heap area called Metaspace. The more classes are loaded - the more metaspace is used. Total usage can be limited by -XX:MaxMetaspaceSize (unlimited by default) and -XX:CompressedClassSpaceSize (1G by default).

6. Symbol tables

Two main hashtables of the JVM: the Symbol table contains names, signatures, identifiers etc. and the String table contains references to interned strings. If Native Memory Tracking indicates significant memory usage by a String table, it probably means the application excessively calls String.intern.

7. Threads

Thread stacks are also responsible for taking RAM. The stack size is controlled by -Xss. The default is 1M per thread, but fortunately things are not so bad. The OS allocates memory pages lazily, i.e. on the first use, so the actual memory usage will be much lower (typically 80-200 KB per thread stack). I wrote a script to estimate how much of RSS belongs to Java thread stacks.

There are other JVM parts that allocate native memory, but they do not usually play a big role in total memory consumption.

Direct buffers

An application may explicitly request off-heap memory by calling ByteBuffer.allocateDirect. The default off-heap limit is equal to -Xmx, but it can be overridden with -XX:MaxDirectMemorySize. Direct ByteBuffers are included in Other section of NMT output (or Internal before JDK 11).

The amount of direct memory in use is visible through JMX, e.g. in JConsole or Java Mission Control:

BufferPool MBean

Besides direct ByteBuffers there can be MappedByteBuffers - the files mapped to virtual memory of a process. NMT does not track them, however, MappedByteBuffers can also take physical memory. And there is no a simple way to limit how much they can take. You can just see the actual usage by looking at process memory map: pmap -x <pid>

Address           Kbytes    RSS    Dirty Mode  Mapping
...
00007f2b3e557000   39592   32956       0 r--s- some-file-17405-Index.db
00007f2b40c01000   39600   33092       0 r--s- some-file-17404-Index.db
                           ^^^^^               ^^^^^^^^^^^^^^^^^^^^^^^^

Native libraries

JNI code loaded by System.loadLibrary can allocate as much off-heap memory as it wants with no control from JVM side. This also concerns standard Java Class Library. In particular, unclosed Java resources may become a source of native memory leak. Typical examples are ZipInputStream or DirectoryStream.

JVMTI agents, in particular, jdwp debugging agent - can also cause excessive memory consumption.

This answer describes how to profile native memory allocations with async-profiler.

Allocator issues

A process typically requests native memory either directly from OS (by mmap system call) or by using malloc - standard libc allocator. In turn, malloc requests big chunks of memory from OS using mmap, and then manages these chunks according to its own allocation algorithm. The problem is - this algorithm can lead to fragmentation and excessive virtual memory usage.

jemalloc, an alternative allocator, often appears smarter than regular libc malloc, so switching to jemalloc may result in a smaller footprint for free.

Conclusion

There is no guaranteed way to estimate full memory usage of a Java process, because there are too many factors to consider.

Total memory = Heap + Code Cache + Metaspace + Symbol tables +
               Other JVM structures + Thread stacks +
               Direct buffers + Mapped files +
               Native Libraries + Malloc overhead + ...

It is possible to shrink or limit certain memory areas (like Code Cache) by JVM flags, but many others are out of JVM control at all.

One possible approach to setting Docker limits would be to watch the actual memory usage in a "normal" state of the process. There are tools and techniques for investigating issues with Java memory consumption: Native Memory Tracking, pmap, jemalloc, async-profiler.

Update

Here is a recording of my presentation Memory Footprint of a Java Process.

In this video, I discuss what may consume memory in a Java process, how to monitor and restrain the size of certain memory areas, and how to profile native memory leaks in a Java application.

What Would Cause a Java Process to Greatly Exceed the Xmx or Xss Limit