What Optimizations How to Expect from Dalvik and the Android Toolchain

What optimizations can I expect from Dalvik and the Android toolchain?

This is Ben, one of the engineers working on the JIT @ Google. When Bill and I started on this project, the goal was to deliver a working JIT as soon as possible with minimal impact to resource contention (eg memory footprint, CPU hijacked by the compiler thread) so that it can run on low-end devices as well. Therefore we used a very primitive trace based model. That is, the compilation entity passed to the JIT compiler is a basic block, sometimes as short as a single instruction. Such traces will be stitched together at runtime through a technique called chaining so that the interpreter and code cache lookup won't be invoked often. To some degree the major source of speedup comes from eliminating the repeated interpreter parsing overhead on frequently executed code paths.

That said, we do have quite a few local optimizations implemented with the Froyo JIT:

Register allocation (8 registers for v5te target since the JIT produces Thumb code / 16 registers for v7)
Scheduling (eg redundant ld/st elimination for Dalvik registers, load hoisting, store sinking)
Redundant null check elimination (if such redundancy can be found in a basic block).
Loop formation and optimization for simple counted loops (ie no side-exit in the loop body). For such loops, array accesses based on extended induction variables are optimized so that null and range checks are only performed in the loop prologue.
One entry inline cache per virtual callsite w/ dynamic patching at runtime.
Peephole optimizations like power-reduction on literal operands for mul/div.

In Gingerbread we added simple inlining for getters/setters. Since the underlying JIT frontend is still simple trace based, if the callee has branches in there it won't be inlined. But the inline cache mechanism is implemented so that virtual getters/setters can be inlined without problems.

We are currently working on enlarging the compilation scope beyond a simple trace so that the compiler has a larger window for code analysis and optimization. Stay tuned.

Any documentation on Android optimisation - JAVAC vs JIT vs Dalvik

For an android application, your code goes through 2 compilation/optimization passes. The first is performed by javac, when you compile the java code to java bytecode. The 2nd is performed by dx, when you translate the java bytecode to dalvik bytecode.

I'm not familiar with any documentation that describes the optimizations performed by either. But both are open source (via openjdk, for javac), so you could theoretically examine both to get a feel for what optimizations they do.

However, your examples really have nothing to do with compiler level optimizations. I don't think knowing how, e.g., dx optimally allocates registers will help you to understand how to write performant java code.

For the two sets of examples that you mention, I would suggest trying 2 things on both sets of examples.

1. Disassemble

The first thing would be to disassemble the dalvik bytecode, and take a look at any differences. Simply counting the instructions won't necessarily tell you anything, but in some cases it is clear that one is faster than the other.

For example, in your first set of examples, the bytecode for both snippets is nearly identical, with the exception that the 2nd snippet contains an additional instruction. So the first snippet is clearly more performant, although it would be an extremely minor performance difference.

invoke-interface {v0, v1, v2}, Landroid/content/SharedPreferences$Editor;->putString(Ljava/lang/String;Ljava/lang/String;)Landroid/content/SharedPreferences$Editor;
invoke-interface {v0}, Landroid/content/SharedPreferences$Editor;->commit()Z

vs.

invoke-interface {v0, v1, v2}, Landroid/content/SharedPreferences$Editor;->putString(Ljava/lang/String;Ljava/lang/String;)Landroid/content/SharedPreferences$Editor;
move-result-object v0
invoke-interface {v0}, Landroid/content/SharedPreferences$Editor;->commit()Z

This extra instruction makes sense, if you take a close look at the two java snippets. In the first, the return value of putString is ignored, and commit is called on the same value that you called putString on. While in the 2nd snippet, clone() is called on the value returned by putString(), which requires an extra operation in the dalvik bytecode to save off the return value. Namely, the extra move-result-object instruction that you see there.

We, as the programmers, know that the return value of the putString method is the same object that is is being called on. However, the compiler can't know or assume this.

2. Measure

The second thing you can do is, as you mention, to actually profile them and measure the performance. After a while, you should start to get a feel for the types of things you can do to improve performance.

Are public getters and setters fine for Android if made final? Is final-correctness as important in Java as const-correctness is in C++?

Personally, I would recommend not using final for anything other than constants. In quite a few years of development I really haven't seen a good use for final other that static final ... for constants. The final keyword is particularly vexing when it comes to testing - java.net.URL being a final class requires using things like PowerMock if you really want high test coverage.

When in doubt, develop realistic performance tests and profile the two scenarios. If there is no discernible difference, then don't add restrictions like final to your code. Statements like:

direct field access is about 3x faster than invoking a trivial getter

is quite meaningless without some real numbers. If invoking a trivial getter only takes a 20 microseconds, then increasing the cost to 60 microseconds is meaningless. For mobile applications, getters & setters aren't going to cause your performance problems.

Always code for correctness first, readability second, and maintainability after that.

Android: why is native code so much faster than Java code

For algorithms that operate over arrays of data, there are two things that significantly change performance between a language like Java, and C:

Array bound checking: Java will check every access, bmap[i], and confirm i is within the array bounds. If the code tries to access out of bounds, you will get a useful exception. C & C++ do not check anything and just trust your code. The best case response to an out of bounds access is a page fault. A more likely result is "unexpected behavior".
Pointers: You can significantly reduce the operations by using pointers.

Take this innocent example of a common filter (similar to blur, but 1D):

for(int i = 0; i < ndata - ncoef; ++i) {  
    z[i] = 0;  
    for(int k = 0; k < ncoef; ++k) {  
        z[i] += c[k] * d[i + k];  
    }  
}

When you access an array element, coef[k] is:

Load address of array coef into register;
Load value k into a register;
Sum them;
Go get memory at that address.

Every one of those array accesses can be improved because you know that the indexes are sequential. Neither the compiler, nor the JIT can know that the indexes are sequential so they cannot optimize fully (although they keep trying).

In C++, you would write code more like this:

int d[10000];  
int z[10000];  
int coef[10];  
int* zptr;  
int* dptr;  
int* cptr;  
dptr = &(d[0]); // Just being overly explicit here, more likely you would dptr = d;  
zptr = &(z[0]); // or zptr = z;  
for(int i = 0; i < (ndata - ncoef); ++i) {  
    *zptr = 0; 
    *cptr = coef;  
    *dptr = d + i;  
    for(int k = 0; k < ncoef; ++k) {  
        *zptr += *cptr * *dptr;  
        cptr++;  
        dptr++;  
    }  
    zptr++;  
}

When you first do something like this (and succeed in getting it correct) you will be surprised how much faster it can be. All the array address calculations of fetching the index and summing the index and base address are replaced with an increment instruction.

For 2D array operations such as blur on an image, an innocent code data[r,c] involves two value fetches, a multiply and a sum. So with 2D arrays the benefits of pointers allows you to remove multiply operations.

So the language allows real reduction in the operations the CPU must perform. The cost is that the C++ code is horrendous to read and debug. Errors in pointers and buffer overflows are food for hackers. But when it comes to raw number grinding algorithms, the speed improvement is too tempting to ignore.

Alternative language or toolchain for creating Android programs

As you may or may not be aware, the Android toolchain is based on a few simple ideas:

Your code is compiled using the plain old java compiler, and linked against the Android stubs (android.jar) for linkage against the system library.
After being compiled, the code is converted to dex format. You can actually run this yourself, just do a dx --help. The job of the dx tool is to take Java class files and convert them to dex code, a pretty straightforward compilation which involves going from a stack based to register based vm, and a few other changes.
After having this in place, an apk is built using a set of apk tools. It used to be apkbuilder, but this has since been deprecated. You can actually run this yourself as well. All an APK is is simply a collection of the manifest, resources, and a single file for all the code in dex form. (I.e., many .class files compile to a single .dex which is quite a bit smaller because of a wrapped web of pointers).

So the Android toolchain isn't really all that complex. The custom build process is handled by ant build rules, which are defined in an SDK wide build.xml, which you can find in the platform-tools/ directory (iirc). However, to generate new baseline projects using this custom build environment you simply use the android update project command.

While I'm not sure if this is the response you'd hoped for, I hope it will disambiguate the build process. It's not really all that complex of a toolchain, the majority of it is off the shelf Java, and not Android specific (all that makes it Android specific is library specific stubs for dynamically linked system code). Beyond this, once you have a set of classes, you need only run a few commands to make an executable APK which Android unpacks. I would suspect that any tool targeting the JVM (and capable of linking with the Android specific dynamically linked API) could perform a similar process of producing class files and using this toolchain to compile the rest of the way, though obviously the automated ant build process makes it much simpler.

EDIT:

After some more digging, I found this relevant android-developers thread. An unsettling quote:

At this time we simply don't have the resources to support people who
want to use their own build system, but we really wish we could. In
many ways we try to make it easy on other tools vendor by clearly
separating logic to eclipse or ant specific code (hence the multitude
of jar files everywhere in the tools and in ADT), but this is not one
of them.

However, you may also find this link helpful.