I have a really bad cold, but I fixed the lusearch bug.
Category: Uncategorized
The State Decacher and Other Animals
It’s been a while. Here’s where I am:
Status | Detail | |
---|---|---|
antlr | FAIL | too many open files |
bloat | pass | 699657ms |
chart | pass | 342527ms |
eclipse | FAIL | one method miscompiles, one method won’t compile |
fop | pass | 35198ms |
hsqldb | pass | 178011ms |
jython | pass | 983272ms |
luindex | pass | 140654ms |
lusearch | FAIL | segfault |
pmd | pass | 456881ms |
xalan | pass | 148200ms |
After implementing deoptimization and the remaining bytecodes I’ve been taking some time to rewrite state cache and decache. When methods are compiled, the local variables and expression stack mostly end up in registers, but when you enter the VM some of the locals and stack slots need to be accessible. Garbage collection can happen when you invoke Java methods or call VM functions, for example, so object pointers need to be visible. The way this works in Shark is that methods allocate a frame on the call stack at entry with enough space to store all its locals and stack slots. At VM entry, whatever slots are needed are written to the frame (“decached”), and on return any object pointers are reloaded (“cached”) in case they changed.
Unfortunately, over time, the cache and decache functions have become ridiculously overcomplicated. The problem is that there are three types of decache and cache — for a Java call, for a VM call, and for deoptimization — each with its own different rules for exactly what needs writing and rereading. The story ends there for cache, but a decache has three separate functions: it must generate the actual code to write the values, it must tell the garbage collector which slots contain objects, and it must describe the frame to the stack trace code. This is all further complicated by the fact that Shark uses a compressed expression stack, where long and double values take up one slot, whereas HotSpot uses an expanded version where they take up two.
I’m pretty sure that the Eclipse failure is a decache failure, and I’m leaning towards the two being them as well, hence the rewrite. It’s in two parts, the first being to move the interface between the compressed and expanded stacks from the decacher code into the bit that parses the bytecode, and the second being to abstract everything so that cache and decache are using as much of the same code as possible. Currently there is not much sharing between the two, and it’s messy.
The first part is done, and seems pretty stable. The only place where compressed stacks now exist is in the SharkBlock
class, and where it was necessary or useful to expose the expanded stack I’ve prefixed the method names with “x”.
The second part is a work in progress…
Shark, now with 100% bytecode coverage
I did jsr
, ret
and multianewarray
today; Shark now has 100% bytecode coverage.
DaCapo status
I’ve been working on DaCapo for nearly two weeks now, so I took a bit of time out today to figure out where I am with it:
Status | Detail | Unimplemented bytecodes | |
---|---|---|---|
antlr | FAIL | too many open files | jsr (once) |
bloat | pass | 718149ms | multianewarray (once) |
chart | pass | 337240ms | |
eclipse | FAIL | requires deoptimization | |
fop | pass | 37126ms | |
hsqldb | pass | 178120ms | |
jython | FAIL | requires deoptimization | jsr (21 times) |
luindex | pass | 149362ms | jsr (3 times) |
lusearch | FAIL | segfault | |
pmd | pass | 457936ms | jsr (15 times) |
xalan | pass | 174340ms | jsr (8 times) |
Note that this is a debug build, with no optimization and assertions enabled, so the times are in no way representative.
DaCapo
This past week or so I’ve been trying to get the DaCapo benchmarks running on Shark. It’s a total baptism of fire. ANTLR uses exceptions extensively, so I’ve had to implement exception handling. FOP is multithreaded, so I’ve had to implement slow-path monitor acquisition and release (all of synchronization is now done!) I’ve had to implement safepoints, unresolved field resolution, and unresolved method resolution for invokeinterface
. I’ve had to replace the unentered block detection code to cope with the more complex flows introduced by exception handlers. I’ve fixed bugs in the divide-by-zero check, in aload
, astore
, checkcast
and new
, and to top it off I implemented lookupswitch
for kicks. And I’m only halfway through the set of benchmarks…
Building Shark
For reference, this is how to reproduce my working environment and get a debuggable Shark built:
svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm cd llvm ./configure --with-pic --enable-pic make cd .. hg clone http://icedtea.classpath.org/hg/icedtea6 cd icedtea6 curl http://gbenson.net/wp-content/uploads/2008/08/mixtec-hacks.patch | patch -p1 ./autogen.sh LLVM_CONFIG=$(dirname $PWD)/llvm/Debug/bin/llvm-config ./configure --enable-shark make icedtea-against-ecj
After the initial make icedtea-against-ecj
you can use make hotspot
to rebuild only HotSpot.
Shark 0.03 released
I just updated icedtea6 hg with the latest Shark. The main reason for this release is that Andrew Haley pointed out that the marked-method stuff I was using to differentiate compiled methods and interpreted methods didn’t work on amd64, and while it was possible to make it work there I didn’t like the idea of having something that needs tweaking for each new platform you build on. Now interpreted methods have the same calling convention as compiled ones, which makes the need for differentiation obsolete.
Other new features in this release include support for long, float, and double values, and a massive pile of new bytecodes. Check out the coverage page now, it’s awesome!
Debug option fun
I just extended the -XX:+SharkTraceInstalls
debug option to print out a load more stuff, statistics on the code size and the number of non-volatile registers used and so on. If you run with it you’ll get something like this:
[0xd04bd010-0xd04bd1b4): java.lang.String::hashCode (420 bytes code, 32 bytes stack, 1 register) [0xd04bd1c0-0xd04bd81c): java.lang.String::lastIndexOf (1628 bytes code, 80 bytes stack, 13 registers) [0xd04bd820-0xd04bdc3c): java.lang.String::equals (1052 bytes code, 48 bytes stack, 5 registers) [0xd04bdc40-0xd04be2f8): java.lang.String::indexOf (1720 bytes code, 80 bytes stack, 12 registers) [0xd04be300-0xd04beaf4): java.io.UnixFileSystem::normalize (2036 bytes code, 80 bytes stack, 12 registers) [0xd04beb00-0xd04c3310): sun.nio.cs.UTF_8$Encoder::encodeArrayLoop (18448 bytes code, 96 bytes stack, 15 registers) [0xd04c3320-0xd04c348c): java.lang.String::charAt (364 bytes code, 32 bytes stack, 1 register) [0xd04c3490-0xd04c3530): java.lang.Object::(160 bytes code, 16 bytes stack, 0 registers) ...
This isn’t (just) because I like debug options. Lately I’ve thought of a couple of optimizations I could do, one to reduce the code size for methods with more than one return, and one to cut the number of registers used. The former probably won’t do a lot other than reducing compile time, but the latter should be well worth it. Maybe not so much on PowerPC — though I already have a couple of methods maxed out on registers — but not all platforms have the luxury of 19 caller-save registers! And, of course, if I’m going to spend time optimizing then I want to see it worked…
Whilst we’re on the subject of options I found another funky one: -XX:+VerifyBeforeGC
. I’ve already fixed one bug using it!
Shark 0.02 released
I just updated icedtea6 hg with the latest Shark. To build it you need a fairly recent svn LLVM (I’m using 54012) and you need to configure IcedTea with --enable-shark
.
Shark stuff
The new framewalker stuff is all done now. For interpreted frames you need to write all kinds of fiddly little accessors so the garbage collector can find the method pointer, the local variables, the monitors and the expression stack, and any other objects that may be lying around in there, but for compiled frames it’s simple: at any point at which the stack could be walked you just emit a map which says “in a stack frame with such and such a PC, slots 1, 2, 4, 5 and 8 contain pointers to objects”. The tricky bit is the PC: I don’t have access to the real one, so I had to fake one up, but it’s all working now — and surviving garbage collections — which is pretty cool! The garbage collector interface was the single biggest thing I was worried about, so it’s nice to have it under my belt, with all the old hacks removed.
Since finishing the framewalking stuff I’ve also implemented VM calls, which are the places where compiled code drops into C to do things too complicated to want to write in assembly. Making Shark fail gracefully when it hits unknown bytecodes was an amazing idea, as it shifted the focus from the simple grind of implementing bytecodes to the really critical — and interesting! — things. Doing it this way around means I can get all the infrastructure solid, then spend a week or so churning out the remaining ninety or so bytecodes.
In other news, with nearly 150 methods compiled, my simple testcase is now seven times faster with Shark than without