Technique or Utility to Minimize Java "Warm-Up" Time

Technique or utility to minimize Java warm-up time?

"Warm-up" in Java is generally about two things:

(1): Lazy class loading: This can be work around by force it to load.

The easy way to do that is to send a fake message. You should be sure that the fake message will trigger all access to classes. For exmaple, if you send an empty message but your progrom will check if the message is empty and avoid doing certain things, then this will not work.

Another way to do it is to force class initialization by accessing that class when you program starts.

(2): The realtime optimization: At run time, Java VM will optimize some part of the code. This is the major reason why there is a warm-up time at all.

To ease this, you can sent bunch of fake (but look real) messages so that the optimization can finish before your user use it.

Another that you can help to ease this is to support inline such as using private and final as much as you can. the reason is that, the VM does not need to look up the inheritance table to see what method to actually be called.

Hope this helps.

Warming up high throughput Java apps

If you are talking about a high traffic webapp/website then JIT is a very minor issue. The biggest problem is warming up (populating) all the cache layers you'll need to have. E.g ehcache regions which are being populated from hibernate. That's because IO related operations are orders of magnitude slower than anything that happens inside the CPU (that is unless you are calculating fractals :)

Why does the JVM require warmup?

Which parts of the code should you warm up?

Usually, you don't have to do anything. However for a low latency application, you should warmup the critical path in your system. You should have unit tests, so I suggest you run those on start up to warmup up the code.

Even once your code is warmed up, you have to ensure your CPU caches stay warm as well. You can see a significant slow down in performance after a blocking operation e.g. network IO, for up to 50 micro-seconds. Usually this is not a problem but if you are trying to stay under say 50 micro-seconds most of the time, this will be a problem most of the time.

Note: Warmup can allow Escape Analysis to kick in and place some objects on the stack. This means such objects don't need to be optimised away. It is better to memory profile your application before optimising your code.

Even if I warm up some parts of the code, how long does it remain warm (assuming this term only means how long your class objects remain in-memory)?

There is no time limit. It depends on whether the JIt detects whether the assumption it made when optimising the code turned out to be incorrect.

How does it help if I have objects which need to be created each time I receive an event?

If you want low latency, or high performance, you should create as little objects as possible. I aim to produce less than 300 KB/sec. With this allocation rate you can have an Eden space large enough to minor collect once a day.

Consider for an example an application that is expected to receive messages over a socket and the transactions could be New Order, Modify Order and Cancel Order or transaction confirmed.

I suggest you re-use objects as much as possible, though if it's under your allocation budget, it may not be worth worrying about.

Note that the application is about High Frequency Trading (HFT) so performance is of extreme importance.

You might be interested in our open source software which is used for HFT systems at different Investment Banks and Hedge Funds.

http://chronicle.software/

My production application is used for High frequency trading and every bit of latency can be an issue. It is kind of clear that at startup if you don't warmup your application, it will lead to high latency of few millis.

In particular you might be interested in https://github.com/OpenHFT/Java-Thread-Affinity as this library can help reduce scheduling jitter in your critical threads.

And also it is said that the critical sections of code which requires warmup should be ran (with fake messages) atleast 12K times for it to work in an optimized manner. Why and how does it work?

Code is compiled using background thread(s). This means that even though a method might be eligible for compiling to native code, it doesn't mean that it has done so esp on startup when the compiler is pretty busy already. 12K is not unreasonable, but it could be higher.

How to reduce the time taken for loops?

I don't think this is related to looping, rather it is related to the createZipFile() function that seems to do some initializing/loading that runs at first time it is called.
Consider the following modified example that is producing identical run times in the loop:

   public static void main(String[] args) throws IOException {
try {
long _start = System.currentTimeMillis();
ZipFile _zipFile = new ZipFile(System.nanoTime()+".zip");
ZipParameters _parameters = new ZipParameters();
_parameters.setCompressionMethod(Zip4jConstants.COMP_STORE);
_parameters.setCompressionLevel(Zip4jConstants.DEFLATE_LEVEL_FASTEST);
_parameters.setIncludeRootFolder(false);
ArrayList<File> _files = new ArrayList<File>();
for(int j=1;j<5;j++){
_files.add(new File("1.jpg"));
}
System.out.println("Initializing files: "+(System.currentTimeMillis() - _start));
_zipFile.createZipFile(_files, _parameters);
System.out.println("Initial run: "+(System.currentTimeMillis() - _start));
for(int i=0;i<10;i++){
long start = System.currentTimeMillis();
ZipFile zipFile = new ZipFile(System.nanoTime()+".zip");
ZipParameters parameters = new ZipParameters();
parameters.setCompressionMethod(Zip4jConstants.COMP_STORE);
parameters.setCompressionLevel(Zip4jConstants.DEFLATE_LEVEL_FASTEST);
parameters.setIncludeRootFolder(false);
ArrayList<File> files = new ArrayList<File>();
for(int j=1;j<5;j++){
files.add(new File("1.jpg"));
}
zipFile.createZipFile(files, parameters);

File zippedFile = zipFile.getFile();
byte[] buffer = new byte[(int)zippedFile.length()];
FileInputStream fis = new FileInputStream(zippedFile);
fis.read(buffer);
fis.close();
zippedFile.delete();
System.out.println("Time taken for "+(i+1)+"tenter code hereh run: "+(System.currentTimeMillis() - start));
}
} catch (ZipException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}

no-warmup option in cassandra-stress read or write

From the DataStax Developer Blog, specifically Jonathan Ellis' post titled How not to benchmark Cassandra:

Tests against JVM-based systems like Cassandra need to be long enough to allow the JVM to “warm up” and JIT-compile the bytecode to machine code. cassandra-stress has a “daemon” mode to make it easy to separate the warm-up phase from results that you actually measure; for other workload generators, your best bet is to simply make enough requests that the warmup is negligible.

Essentially, when your queries run on Cassandra normally, the JVM "warm up" period would have passed shortly after your node started. When you run cassandra-stress without allowing it the chance to properly compile everything it needs, your query latencies will be skewed (probably higher).

If you're interested, this question talks a little more about the warm-up period, issues it can cause, as well as solutions for dealing with it: Technique or utility to minimize Java "warm-up" time?

Therefore, specifying "no warmup" will effectively ruin the results of your test. That is, unless you are specifically trying to test what your latencies would be during the JVM's "warm up" phase.

How to preload used class when JVM starts?

I think you need to be using an ahead-of-time (AOT) Java compiler that compiles classes to native code that can be loaded by the JVM.

One example is the jaotc tool that was introduced in Java 9 as a result of the JEP 295 work. It takes a list of classes (previously compiled to bytecode files) and compiles them to a native code library; e.g. a Linux .so file. You can then tell the JVM about the .so file on the command line, and it will load the AOT compiled code and use it.

The upside of AOT compilation is faster JVM startup. The downside is that an AOT compiler can't do as good a job of optimizing as a JIT compiler, so overall performance suffers in a long running program.

So your (apparent) need to meet a fast startup requirement may lead to you not meeting a long-term throughput requirement.

Additional references:

  • "The jaotc Command" - Oracle command documentation.
  • "Ahead of Time Compilation (AoT)" - A tutorial on using jaotc.

Java - Repeated function call reduces execution time

I'm sure JVM might be doing some tricks behind the scenes but can anybody help me understand whats really going on there?

  1. The massive latency of the first invocation is due to the initialization of the complete lambda runtime subsystem. You pay this only once for the whole application.

  2. The first time your code reaches any given lambda expression, you pay for the linkage of that lambda (initialization of the invokedynamic call site).

  3. After some iterations you'll see additional speedup due to the JIT compiler optimizing your reduction code.

Is there anyway to avoid this optimization so I can benchmark true execution time?

You are asking for a contradiction here: the "true" execution time is the one you get after warmup, when all optimizations have been applied. This is the runtime an actual application would experience. The latency of the first few runs is not relevant to the wider picture, unless you are interested in single-shot performance.

For the sake of exploration you can see how your code behaves with JIT compilation disabled: pass -Xint to the java command. There are many more flags which disable various aspects of optimization.

Performance Explanation: code runs slower after warm up

Short: The Just In Time Compiler is dumb.

First of all you can use the option -XX:+PrintCompilation to see WHEN the JIT is doing something. Then you will see something like this:

$ java -XX:+PrintCompilation weird
168 1 weird$CountByOne::getNext (28 bytes)
174 1 % weird::main @ 18 (220 bytes)
279 1 % weird::main @ -2 (220 bytes) made not entrant
113727636
280 2 % weird::main @ 91 (220 bytes)
106265475
427228826

So you see that the method main is compiled sometimes during the first and the second block.

Adding the options -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOption will give you more information about what the JIT is doing. Note, it requires hsdis-amd64.so which seems to be not very well available on common Linux distributions. You might have tom compile it on your own from the OpenJDK.

What you get is a huge chunk of assembler code for getNext and main.

For me, in the first compilation it seems that only the first block in main is actually compiled, you can tell by the line numbers. It contains funny things like this:

  0x00007fa35505fc5b: add    $0x1,%r8           ;*ladd
; - weird$CountByOne::getNext@6 (line 12)
; - weird::main@28 (line 31)
0x00007fa35505fc5f: mov %r8,0x10(%rbx) ;*putfield i
; - weird$CountByOne::getNext@7 (line 12)
; - weird::main@28 (line 31)
0x00007fa35505fc63: add $0x1,%r14 ;*ladd
; - weird::main@31 (line 31)

(Indeed it is very long, due to unrolling and inlining of the loop)

Appearently during the recompile of main, the second AND third block is compiled. The second block there looks very similar to the first version. (Again just an excerpt)

 0x00007fa35505f05d: add    $0x1,%r8           ;*ladd
; - weird$CountByOne::getNext@6 (line 12)
; - weird::main@101 (line 42)
0x00007fa35505f061: mov %r8,0x10(%rbx) ;*putfield i
; - weird$CountByOne::getNext@7 (line 12)
; - weird::main@101 (line 42)
0x00007fa35505f065: add $0x1,%r13 ;*ladd

HOWEVER the third block is compiled differently. Without inlining and unrolling

This time the entire loop looks like this:

  0x00007fa35505f20c: xor    %r10d,%r10d
0x00007fa35505f20f: xor %r8d,%r8d ;*lload
; - weird::main@171 (line 53)
0x00007fa35505f212: mov %r8d,0x10(%rsp)
0x00007fa35505f217: mov %r10,0x8(%rsp)
0x00007fa35505f21c: mov %rbp,%rsi
0x00007fa35505f21f: callq 0x00007fa355037c60 ; OopMap{rbp=Oop off=580}
;*invokevirtual getNext
; - weird::main@174 (line 53)
; {optimized virtual_call}
0x00007fa35505f224: mov 0x8(%rsp),%r10
0x00007fa35505f229: add %rax,%r10 ;*ladd
; - weird::main@177 (line 53)
0x00007fa35505f22c: mov 0x10(%rsp),%r8d
0x00007fa35505f231: inc %r8d ;*iinc
; - weird::main@180 (line 52)
0x00007fa35505f234: cmp $0x5f5e100,%r8d
0x00007fa35505f23b: jl 0x00007fa35505f212 ;*if_icmpge
; - weird::main@168 (line 52)

My guess is that the JIT identified that this part of the code is not used a lot, since it was using profiling information from the second block execution, and therefore did not optimize it heavily. Also the JIT appears to be lazy in a sense not to recompile one method after all relevant parts have been compiled. Remember the first compilation result did not contain source code for the second/third block AT all, so the JIT had to recompile that.



Related Topics



Leave a reply



Submit