Do Any Jvm's Jit Compilers Generate Code That Uses Vectorized Floating Point Instructions

Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?

So, basically, you want your code to run faster. JNI is the answer. I know you said it didn't work for you, but let me show you that you are wrong.

Here's Dot.java:

import java.nio.FloatBuffer;
import org.bytedeco.javacpp.*;
import org.bytedeco.javacpp.annotation.*;

@Platform(include = "Dot.h", compiler = "fastfpu")
public class Dot {
static { Loader.load(); }

static float[] a = new float[50], b = new float[50];
static float dot() {
float sum = 0;
for (int i = 0; i < 50; i++) {
sum += a[i]*b[i];
}
return sum;
}
static native @MemberGetter FloatPointer ac();
static native @MemberGetter FloatPointer bc();
static native @NoException float dotc();

public static void main(String[] args) {
FloatBuffer ab = ac().capacity(50).asBuffer();
FloatBuffer bb = bc().capacity(50).asBuffer();

for (int i = 0; i < 10000000; i++) {
a[i%50] = b[i%50] = dot();
float sum = dotc();
ab.put(i%50, sum);
bb.put(i%50, sum);
}
long t1 = System.nanoTime();
for (int i = 0; i < 10000000; i++) {
a[i%50] = b[i%50] = dot();
}
long t2 = System.nanoTime();
for (int i = 0; i < 10000000; i++) {
float sum = dotc();
ab.put(i%50, sum);
bb.put(i%50, sum);
}
long t3 = System.nanoTime();
System.out.println("dot(): " + (t2 - t1)/10000000 + " ns");
System.out.println("dotc(): " + (t3 - t2)/10000000 + " ns");
}
}

and Dot.h:

float ac[50], bc[50];

inline float dotc() {
float sum = 0;
for (int i = 0; i < 50; i++) {
sum += ac[i]*bc[i];
}
return sum;
}

We can compile and run that with JavaCPP using this command:

$ java -jar javacpp.jar Dot.java -exec

With an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, Fedora 30, GCC 9.1.1, and OpenJDK 8 or 11, I get this kind of output:

dot(): 39 ns
dotc(): 16 ns

Or roughly 2.4 times faster. We need to use direct NIO buffers instead of arrays, but HotSpot can access direct NIO buffers as fast as arrays. On the other hand, manually unrolling the loop does not provide a measurable boost in performance, in this case.

Java autovectorization

To answer your question (1), in principle, a Java compiler could optimize in the presence of a non-inlined method1() call, if it analyzed method1() and determined that it doesn't have any side-effects that would affect the auto-vectorization. In particular, the compiler could prove that the method was "const" (no side effects and no reads from global memory) which in general would enable many optimizations at the call site without inlining. It could also perhaps prove more restricted properties, such as not reading or writing to arrays of a certain type, which would also be enough to allow auto-vectorization to proceed in this case.

In practice, however, I am not aware of any Java compiler that can do this optimization today. If this answer is to believed, in Hotspot: "a [not-inlined] method call is typically opaque for JIT compiler." Most Java compilers are based in one way or another on Hotspot, so I don't expect there is a sophisticated Java compiler out that that can do this if Hotspot can't.

This answer also covers some reasons why such a interprocedural analysis (IPA) is likely to be both difficult and not particularly useful. In particular, methods about which non-trivial things can be proven are often small enough that they'd inlined anyways. I'm not sure if I totally agree: one could also argue that Java inlines aggressively partly because it doesn't do IPA, so strong IPA would perhaps open up the ability to do less inlining and consequently reduce runtime code footprint and JIT times.

The other method variants you ask about in (2) or (3) don't change anything: the compiler would still need IPA do allow it to vectorize, and as far as I know Java compilers don't have it.

(4) and (5) seem like they should be asked as totally separate questions.

About (6) I don't think it has changed, but it would make a good question for the OpenJDK hotspot mailing lists: I think you'd get a good answer.

Finally, it's worth noting that even in the absence of IPA and knowing nothing about method1(), a compiler could optimize the math on a, b and c if it could prove none of them had escaped. This seems pretty useless in general though: it would mean that all those variables would have been allocated in this function (or some function inlined into this one), whereas I would imagine that in most realistic scenarios at least one of the three is passed in by the caller.

How to use the Intel AVX in Java?

As I know, most current Java JVM JITters don't support automatic vectorization or just do that for very simple loops, so you're out of luck.

In Mono's .NET implementation there's Mono.Simd for manual vector code emission and then later MS introduced the System.Numeric.Vectors. Unfortunately there's nothing similar in Java. I don't know if Java's vector class is vectorized using SIMD or not but I don't think it is.

If you want to use CPU-specific features like AVX then your only choice is JNI. Write your bottle neck part in C or C++ and call it from Java

There's another solution by Scala to use vectorized code without modifying the JVM that you can read in How we made the JVM 40x faster



Update:

Now there's a new Vector API being developed for writing vector code manually

Provide an initial iteration of an incubator module, jdk.incubator.vector, to express vector computations that reliably compile at runtime to optimal vector hardware instructions on supported CPU architectures and thus achieve superior performance to equivalent scalar computations.

https://openjdk.java.net/jeps/338

  • Vector API Developer Program for Java* Software
  • Oracle and Intel seek to build a Java API for SIMD support

Read more:

  • Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
  • SIMD Vectors/Matrices in Java?
  • What is the state of auto-vectorization in OpenJDK?
  • Vectorized Algorithms in Java

Java best practices for vectorized computations

There are no clear best practices for every case. Whether you could/should use a pure Java solution (not using SIMD instructions) or (optimized with SIMD) native code through JNI depends on your particular application and specifically the size of your arrays and possible restrictions on the target system.

  1. There could be a requirement that you are not allowed to install specific native libraries in the target system and BLAS is not already installed. In that case you simply have to use a Java library.
  2. Pure Java libraries tend to perform better for arrays with length much smaller than 100 and at some point after that you get better performance using native libraries through JNI. As always, your mileage may vary.

Pertinent benchmarks have been performed (in random order):

  • http://ojalgo.org/performance_ejml.html
  • http://lessthanoptimal.github.io/Java-Matrix-Benchmark/
  • Performance of Java matrix math libraries?

These benchmarks can be confusing as they are informative. One library may be faster for some operation and slower for some other. Also keep in mind that there may be more than one implementation of BLAS available for your system. I currently have 3 installed on my system blas, atlas and openblas. Apart from choosing a Java library wrapping a BLAS implementation you also have to choose the underlying BLAS implementation.

This answer has a fairly up to date list except it doesn't mention nd4j that is rather new. Keep in mind that jeigen depends on eigen so not on BLAS.

C code to auto-vectorize floating point minimum

It looks like in GCC vectorization of reductions isn't enabled unless you specify -ffast-math or -fassociative-math. when I enable those it vectorizes just fine (using fminf in the inner loop):

ssetest.c:9: note: vect_model_load_cost: aligned.

ssetest.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .

ssetest.c:9: note: vect_model_load_cost: aligned.

ssetest.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .

ssetest.c:9: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .

ssetest.c:9: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .

ssetest.c:9: note: Cost model analysis:

Vector inside of loop cost: 4

Vector outside of loop cost: 0

Scalar iteration cost: 4

Scalar outside cost: 0

prologue iterations: 0

epilogue iterations: 0

Calculated minimum iters for profitability: 1

ssetest.c:9: note: Profitability threshold = 3

ssetest.c:9: note: LOOP VECTORIZED.

ssetest.c:15: note: vectorized 1 loops in function.

Why JIT compilers can't be used to produce binary?

The JIT compiler, compiles the code dynamically.

  • It generates different code for different flavours of CPU.
  • It generates different code for different memory models, e.g. For tyhe 64-bit JVM, if the maximum heap size is < 4 GB, < 24 GB, < 32 GB or more, you will produce different code in each case.
  • It will re-compile code as classes are loaded and unloaded.
  • It will re-optimise code based on how it is used. e.g. if a flag which used to be off is not on and visa-versa.

A static compiler cannot do these things.

How to find native instructions generated from class file

You probably need -XX:+PrintOptoAssembly, but you'd need a debug build of the JVM. The links to the binary distributions seem not to be available any longer, so you might have to build it from source: http://download.java.net/jdk6/6u10/archive/

If you're planning to try this with OpenJDK 7 as well, this might be of interest:
http://wikis.sun.com/display/HotSpotInternals/PrintAssembly



Related Topics



Leave a reply



Submit