The State Decacher and Other Animals

It’s been a while. Here’s where I am:

  Status Detail
antlr FAIL too many open files
bloat pass 699657ms
chart pass 342527ms
eclipse FAIL one method miscompiles, one method won’t compile
fop pass 35198ms
hsqldb pass 178011ms
jython pass 983272ms
luindex pass 140654ms
lusearch FAIL segfault
pmd pass 456881ms
xalan pass 148200ms

After implementing deoptimization and the remaining bytecodes I’ve been taking some time to rewrite state cache and decache. When methods are compiled, the local variables and expression stack mostly end up in registers, but when you enter the VM some of the locals and stack slots need to be accessible. Garbage collection can happen when you invoke Java methods or call VM functions, for example, so object pointers need to be visible. The way this works in Shark is that methods allocate a frame on the call stack at entry with enough space to store all its locals and stack slots. At VM entry, whatever slots are needed are written to the frame (“decached”), and on return any object pointers are reloaded (“cached”) in case they changed.

Unfortunately, over time, the cache and decache functions have become ridiculously overcomplicated. The problem is that there are three types of decache and cache — for a Java call, for a VM call, and for deoptimization — each with its own different rules for exactly what needs writing and rereading. The story ends there for cache, but a decache has three separate functions: it must generate the actual code to write the values, it must tell the garbage collector which slots contain objects, and it must describe the frame to the stack trace code. This is all further complicated by the fact that Shark uses a compressed expression stack, where long and double values take up one slot, whereas HotSpot uses an expanded version where they take up two.

I’m pretty sure that the Eclipse failure is a decache failure, and I’m leaning towards the two being them as well, hence the rewrite. It’s in two parts, the first being to move the interface between the compressed and expanded stacks from the decacher code into the bit that parses the bytecode, and the second being to abstract everything so that cache and decache are using as much of the same code as possible. Currently there is not much sharing between the two, and it’s messy.

The first part is done, and seems pretty stable. The only place where compressed stacks now exist is in the SharkBlock class, and where it was necessary or useful to expose the expanded stack I’ve prefixed the method names with “x”.

The second part is a work in progress…

DaCapo status

I’ve been working on DaCapo for nearly two weeks now, so I took a bit of time out today to figure out where I am with it:

  Status Detail Unimplemented bytecodes
antlr FAIL too many open files jsr (once)
bloat pass 718149ms multianewarray (once)
chart pass 337240ms
eclipse FAIL requires deoptimization
fop pass 37126ms
hsqldb pass 178120ms
jython FAIL requires deoptimization jsr (21 times)
luindex pass 149362ms jsr (3 times)
lusearch FAIL segfault
pmd pass 457936ms jsr (15 times)
xalan pass 174340ms jsr (8 times)

Note that this is a debug build, with no optimization and assertions enabled, so the times are in no way representative.

DaCapo

This past week or so I’ve been trying to get the DaCapo benchmarks running on Shark. It’s a total baptism of fire. ANTLR uses exceptions extensively, so I’ve had to implement exception handling. FOP is multithreaded, so I’ve had to implement slow-path monitor acquisition and release (all of synchronization is now done!) I’ve had to implement safepoints, unresolved field resolution, and unresolved method resolution for invokeinterface. I’ve had to replace the unentered block detection code to cope with the more complex flows introduced by exception handlers. I’ve fixed bugs in the divide-by-zero check, in aload, astore, checkcast and new, and to top it off I implemented lookupswitch for kicks. And I’m only halfway through the set of benchmarks…

Building Shark

For reference, this is how to reproduce my working environment and get a debuggable Shark built:

svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
cd llvm
./configure --with-pic --enable-pic
make
cd ..
hg clone http://icedtea.classpath.org/hg/icedtea6
cd icedtea6
curl http://gbenson.net/wp-content/uploads/2008/08/mixtec-hacks.patch | patch -p1
./autogen.sh
LLVM_CONFIG=$(dirname $PWD)/llvm/Debug/bin/llvm-config ./configure --enable-shark
make icedtea-against-ecj

After the initial make icedtea-against-ecj you can use make hotspot to rebuild only HotSpot.

Shark 0.03 released

I just updated icedtea6 hg with the latest Shark. The main reason for this release is that Andrew Haley pointed out that the marked-method stuff I was using to differentiate compiled methods and interpreted methods didn’t work on amd64, and while it was possible to make it work there I didn’t like the idea of having something that needs tweaking for each new platform you build on. Now interpreted methods have the same calling convention as compiled ones, which makes the need for differentiation obsolete.

Other new features in this release include support for long, float, and double values, and a massive pile of new bytecodes. Check out the coverage page now, it’s awesome!

Debug option fun

I just extended the -XX:+SharkTraceInstalls debug option to print out a load more stuff, statistics on the code size and the number of non-volatile registers used and so on. If you run with it you’ll get something like this:

[0xd04bd010-0xd04bd1b4): java.lang.String::hashCode (420 bytes code, 32 bytes stack, 1 register)
[0xd04bd1c0-0xd04bd81c): java.lang.String::lastIndexOf (1628 bytes code, 80 bytes stack, 13 registers)
[0xd04bd820-0xd04bdc3c): java.lang.String::equals (1052 bytes code, 48 bytes stack, 5 registers)
[0xd04bdc40-0xd04be2f8): java.lang.String::indexOf (1720 bytes code, 80 bytes stack, 12 registers)
[0xd04be300-0xd04beaf4): java.io.UnixFileSystem::normalize (2036 bytes code, 80 bytes stack, 12 registers)
[0xd04beb00-0xd04c3310): sun.nio.cs.UTF_8$Encoder::encodeArrayLoop (18448 bytes code, 96 bytes stack, 15 registers)
[0xd04c3320-0xd04c348c): java.lang.String::charAt (364 bytes code, 32 bytes stack, 1 register)
[0xd04c3490-0xd04c3530): java.lang.Object:: (160 bytes code, 16 bytes stack, 0 registers)
...

This isn’t (just) because I like debug options. Lately I’ve thought of a couple of optimizations I could do, one to reduce the code size for methods with more than one return, and one to cut the number of registers used. The former probably won’t do a lot other than reducing compile time, but the latter should be well worth it. Maybe not so much on PowerPC — though I already have a couple of methods maxed out on registers — but not all platforms have the luxury of 19 caller-save registers! And, of course, if I’m going to spend time optimizing then I want to see it worked…

Whilst we’re on the subject of options I found another funky one: -XX:+VerifyBeforeGC. I’ve already fixed one bug using it!

Shark stuff

The new framewalker stuff is all done now. For interpreted frames you need to write all kinds of fiddly little accessors so the garbage collector can find the method pointer, the local variables, the monitors and the expression stack, and any other objects that may be lying around in there, but for compiled frames it’s simple: at any point at which the stack could be walked you just emit a map which says “in a stack frame with such and such a PC, slots 1, 2, 4, 5 and 8 contain pointers to objects”. The tricky bit is the PC: I don’t have access to the real one, so I had to fake one up, but it’s all working now — and surviving garbage collections — which is pretty cool! The garbage collector interface was the single biggest thing I was worried about, so it’s nice to have it under my belt, with all the old hacks removed.

Since finishing the framewalking stuff I’ve also implemented VM calls, which are the places where compiled code drops into C to do things too complicated to want to write in assembly. Making Shark fail gracefully when it hits unknown bytecodes was an amazing idea, as it shifted the focus from the simple grind of implementing bytecodes to the really critical — and interesting! — things. Doing it this way around means I can get all the infrastructure solid, then spend a week or so churning out the remaining ninety or so bytecodes.

In other news, with nearly 150 methods compiled, my simple testcase is now seven times faster with Shark than without

New framewalker interface

I got recursive locks working on Friday, which got me back into the framewalker stuff. For HotSpot’s framewalker to see frames as native I need to supply it with something like a program counter can be used to reference into a set of tables that tell it, for example, which stack slots contain pointers for consideration by the garbage collector. It expects this to be in a block of generated code (which won’t really be code at all in Shark), but the core problem is that the “code” you generate goes into a temporary buffer which HotSpot then relocates into the final location so I can’t simply inline pointers from the buffer into Shark’s output. The final location of the “code” can not be determined at compile time, and even if it could it can move at any time as a result of garbage collector activity.

When you invoke a method in zero you start with a methodOop, a pointer to a structure containing (amongst other things) the method’s entry point. The entry point is simply a pointer to the function that you call to execute the method. The address of the final code buffer is also contained within the methodOop, but both the entry point and the code buffer are volatile — they can change at any time — so they need to be read at the same time, in one atomic operation.

What is needed is some way to pass a pointer to the code buffer when calling Shark methods. After a fairly intense thinking session it occurred to me that the entry point is going to be word-aligned, so the bottom two or three bits will always be zero. Code buffer pointers in HotSpot are always word aligned too, so I decided to use the bottom bit as a flag: if the bottom bit is clear then the entry point is a normal pointer-to-a-function entry point, but if it’s set then the “entry point” is really a pointer to the code buffer. The actual entry point can then be read from the code buffer, which in Shark does not contain code but simply whatever data I decide to put in there.

The nice thing about this is that, aside from adding only one or two instructions per method dispatch, it also opens up the possibility of method inlining, something I didn’t think would be possible.