DaCapo

This past week or so I’ve been trying to get the DaCapo benchmarks running on Shark. It’s a total baptism of fire. ANTLR uses exceptions extensively, so I’ve had to implement exception handling. FOP is multithreaded, so I’ve had to implement slow-path monitor acquisition and release (all of synchronization is now done!) I’ve had to implement safepoints, unresolved field resolution, and unresolved method resolution for invokeinterface. I’ve had to replace the unentered block detection code to cope with the more complex flows introduced by exception handlers. I’ve fixed bugs in the divide-by-zero check, in aload, astore, checkcast and new, and to top it off I implemented lookupswitch for kicks. And I’m only halfway through the set of benchmarks…

Building Shark

For reference, this is how to reproduce my working environment and get a debuggable Shark built:

svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
cd llvm
./configure --with-pic --enable-pic
make
cd ..
hg clone http://icedtea.classpath.org/hg/icedtea6
cd icedtea6
curl http://gbenson.net/wp-content/uploads/2008/08/mixtec-hacks.patch | patch -p1
./autogen.sh
LLVM_CONFIG=$(dirname $PWD)/llvm/Debug/bin/llvm-config ./configure --enable-shark
make icedtea-against-ecj

After the initial make icedtea-against-ecj you can use make hotspot to rebuild only HotSpot.

Shark 0.03 released

I just updated icedtea6 hg with the latest Shark. The main reason for this release is that Andrew Haley pointed out that the marked-method stuff I was using to differentiate compiled methods and interpreted methods didn’t work on amd64, and while it was possible to make it work there I didn’t like the idea of having something that needs tweaking for each new platform you build on. Now interpreted methods have the same calling convention as compiled ones, which makes the need for differentiation obsolete.

Other new features in this release include support for long, float, and double values, and a massive pile of new bytecodes. Check out the coverage page now, it’s awesome!

Debug option fun

I just extended the -XX:+SharkTraceInstalls debug option to print out a load more stuff, statistics on the code size and the number of non-volatile registers used and so on. If you run with it you’ll get something like this:

[0xd04bd010-0xd04bd1b4): java.lang.String::hashCode (420 bytes code, 32 bytes stack, 1 register)
[0xd04bd1c0-0xd04bd81c): java.lang.String::lastIndexOf (1628 bytes code, 80 bytes stack, 13 registers)
[0xd04bd820-0xd04bdc3c): java.lang.String::equals (1052 bytes code, 48 bytes stack, 5 registers)
[0xd04bdc40-0xd04be2f8): java.lang.String::indexOf (1720 bytes code, 80 bytes stack, 12 registers)
[0xd04be300-0xd04beaf4): java.io.UnixFileSystem::normalize (2036 bytes code, 80 bytes stack, 12 registers)
[0xd04beb00-0xd04c3310): sun.nio.cs.UTF_8$Encoder::encodeArrayLoop (18448 bytes code, 96 bytes stack, 15 registers)
[0xd04c3320-0xd04c348c): java.lang.String::charAt (364 bytes code, 32 bytes stack, 1 register)
[0xd04c3490-0xd04c3530): java.lang.Object:: (160 bytes code, 16 bytes stack, 0 registers)
...

This isn’t (just) because I like debug options. Lately I’ve thought of a couple of optimizations I could do, one to reduce the code size for methods with more than one return, and one to cut the number of registers used. The former probably won’t do a lot other than reducing compile time, but the latter should be well worth it. Maybe not so much on PowerPC — though I already have a couple of methods maxed out on registers — but not all platforms have the luxury of 19 caller-save registers! And, of course, if I’m going to spend time optimizing then I want to see it worked…

Whilst we’re on the subject of options I found another funky one: -XX:+VerifyBeforeGC. I’ve already fixed one bug using it!

Shark stuff

The new framewalker stuff is all done now. For interpreted frames you need to write all kinds of fiddly little accessors so the garbage collector can find the method pointer, the local variables, the monitors and the expression stack, and any other objects that may be lying around in there, but for compiled frames it’s simple: at any point at which the stack could be walked you just emit a map which says “in a stack frame with such and such a PC, slots 1, 2, 4, 5 and 8 contain pointers to objects”. The tricky bit is the PC: I don’t have access to the real one, so I had to fake one up, but it’s all working now — and surviving garbage collections — which is pretty cool! The garbage collector interface was the single biggest thing I was worried about, so it’s nice to have it under my belt, with all the old hacks removed.

Since finishing the framewalking stuff I’ve also implemented VM calls, which are the places where compiled code drops into C to do things too complicated to want to write in assembly. Making Shark fail gracefully when it hits unknown bytecodes was an amazing idea, as it shifted the focus from the simple grind of implementing bytecodes to the really critical — and interesting! — things. Doing it this way around means I can get all the infrastructure solid, then spend a week or so churning out the remaining ninety or so bytecodes.

In other news, with nearly 150 methods compiled, my simple testcase is now seven times faster with Shark than without