New framewalker interface

I got recursive locks working on Friday, which got me back into the framewalker stuff. For HotSpot’s framewalker to see frames as native I need to supply it with something like a program counter can be used to reference into a set of tables that tell it, for example, which stack slots contain pointers for consideration by the garbage collector. It expects this to be in a block of generated code (which won’t really be code at all in Shark), but the core problem is that the “code” you generate goes into a temporary buffer which HotSpot then relocates into the final location so I can’t simply inline pointers from the buffer into Shark’s output. The final location of the “code” can not be determined at compile time, and even if it could it can move at any time as a result of garbage collector activity.

When you invoke a method in zero you start with a methodOop, a pointer to a structure containing (amongst other things) the method’s entry point. The entry point is simply a pointer to the function that you call to execute the method. The address of the final code buffer is also contained within the methodOop, but both the entry point and the code buffer are volatile — they can change at any time — so they need to be read at the same time, in one atomic operation.

What is needed is some way to pass a pointer to the code buffer when calling Shark methods. After a fairly intense thinking session it occurred to me that the entry point is going to be word-aligned, so the bottom two or three bits will always be zero. Code buffer pointers in HotSpot are always word aligned too, so I decided to use the bottom bit as a flag: if the bottom bit is clear then the entry point is a normal pointer-to-a-function entry point, but if it’s set then the “entry point” is really a pointer to the code buffer. The actual entry point can then be read from the code buffer, which in Shark does not contain code but simply whatever data I decide to put in there.

The nice thing about this is that, aside from adding only one or two instructions per method dispatch, it also opens up the possibility of method inlining, something I didn’t think would be possible.

Speed demon

I got whole method synchronization working in Shark yesterday, at least for simple cases (non-recursive, uncontended locks). It’s a small thing, but it means that everything that Shark needs in a stack frame is actually in the stack frame. When I get back onto the framewalker stuff I’ll be able to start adding whatever extra stuff I need without worrying about messing myself up for the future.

I’ve been using jar on the HotSpot sources as my testcase, and lately it’s been noticably faster. I did couple of quick timing runs yesterday and even with only 62 methods compiled it is already twice as fast with Shark as without. I’m on the verge of massive success…

Synchronization

I figured out the deal with getting the number of monitors from typeflow. I was confused because it only tells you the number of monitors at the start of each block, but what I didn’t realise is that it inserts block boundaries after every monitorenter and monitorexit, so the number of monitors remains constant for entire block. A quick scan through a method’s blocks will find the one with the most monitors, and that’s the maximum number of monitors you need for your method. I wrote the code to allocate them yesterday; today I write the code to actually lock and unlock things with them.

As an aside, I’ve seen a few people refer to zero as a VM in its own right, a replacement for HotSpot, but this isn’t the case at all: zero is a port of HotSpot. HotSpot is a largely generic VM with a small number of CPU- and OS-specific files, and when people refer to a port of HotSpot to some processor or operating system, they’re talking about these files. If you look inside an OpenJDK tarball, you’ll find the HotSpot sources in something like these directories:

openjdk/hotspot/src/share
openjdk/hotspot/src/cpu/x86
openjdk/hotspot/src/cpu/sparc
openjdk/hotspot/src/os/linux
openjdk/hotspot/src/os/solaris
openjdk/hotspot/src/os/win32
openjdk/hotspot/src/os_cpu/linux_x86
openjdk/hotspot/src/os_cpu/linux_sparc
openjdk/hotspot/src/os_cpu/solaris_x86
openjdk/hotspot/src/os_cpu/solaris_sparc
openjdk/hotspot/src/os_cpu/win32_x86

Zero adds the following directories to that list:

openjdk/hotspot/src/cpu/zero
openjdk/hotspot/src/os_cpu/linux_zero

So when you build HotSpot with zero, you’re using the following code:

openjdk/hotspot/src/share              444,621 lines
openjdk/hotspot/src/cpu/zero             5,300 lines
openjdk/hotspot/src/os/linux             7,865 lines
openjdk/hotspot/src/os_cpu/linux_zero    1,085 lines

That’s 98.6% pure, unadulterated HotSpot.

(Shark isn’t a VM either: it’s a optional JIT compiler for the zero port of HotSpot.)

Shark’s framewalker interface

Yesterday I finished my zero bugs, so now I’m back on Shark. Until a couple of weeks ago I’d been implementing bytecodes one by one as they came up, but during some team meetings I had the idea to cause unimplemented bytecodes to abort only that particular compilation. Previously this would abort the entire VM, but now that I have a fair few of the most common ones implemented already this change would mean Shark was usable even in it’s unfinished state, both for general use and for a little bit of benchmarking. Getting day-to-day grind of implementing bytecodes out of the way means I now need to figure out some of the bigger design issues, however.

One that’s been lurking in the background since pretty much day one is the framewalker stuff. There are all kinds of different types of stack frames in HotSpot, but the relevant ones here are interpreted frames and native frames. Each method in the stack will have one or the other of these*, and it’s one of the cpu-specific code’s jobs to provide accessors into the ABI stack frames so that HotSpot can poke around in them. In the very first days of Shark I tried to make Shark frames appear as native frames, but doing that required all kinds of things I couldn’t provide: PCs and such like. I put in some pretty nasty hacks to make Shark frames look like interpreter frames and got on with the rest of it.

Fast forward a couple of months and this approach is showing the strain. I’m at a bit where the garbage collector is interrogating frames, and the GC needs a lot more than just a method pointer and a BCI. I’m faced with the decision of either trying to store everything into the Shark frames that the C++ interpreter stores (in which case I may as well make them identical, and do away with the distinction) or have another go at implementing Shark frames as native frames.

My gut feeling is to have another go with native frames. Making them interpreter frames is hacky enough, and trying to mimic the C++ interpreter perfectly feels too horrible to contemplate. What I may do first, however, is implement synchronization. It should be possible to handle it without having to resize frames, but it’s not obvious how to extract the number of monitors from the typeflow information so I’ve been avoiding thinking about it. Once I have that done, everything that Shark needs to store will be stored, and I can add the extra stuff to mimic HotSpot’s interpreter/native frames at my leisure.

* This is a simplification, but it’s certainly true of zero as it stands today.

General update

It’s been a while. I’ve been doing loads of little things lately, and that tends to stop me blogging. If I’m working on only one big thing then I tend to write about it as a break, often at the end of the day when I don’t want to start anything else, but hopping from task to task means there’s not one obvious thing to either a) write about or b) take breaks from. Ah well…

Probably the single biggest thing is that I started the process of getting an official OpenJDK porters project for zero. Zero’s been pretty much stable since the end of March so it’s high time I did this, but I’ve been procrastinating about it because I’m not sure how I’ll be able to develop in that environment. I want my local tree to be as close as possible to whatever upstream I’m using, which is currently the ecj-bootstrap side of IcedTea, but I won’t be able to use ecj to build straight from a project repository without either applying and committing the icedtea-ecj.patch (bad) or applying it locally and having a non-buildable repository (also bad). I may have to bite the bullet and use IcedTea to build it, which isn’t as bad as it initially seems since the target I use for rebuilds contains no Java. The initial build will be painful but I’ll just have to do it and be more careful not to blow away my tree!

Of course, the increasing interest in zero means I’m getting more bug reports. A few people have mentioned builds failing with java.lang.IllegalArgumentException: disparate values. I assumed they were all the same, so I investigated (and fixed) it on ia64, but that particular issue was little-endian specific and so doesn’t explain that people have been seeing it on ppc EPEL. It seems that this is the first place in the build where floating point values are used and checked, so this message is not one specific error but a catch-all for generic floating point bugs. Anyway, if you’re seeing this error then please let me know.

I’ve been thinking about TCK runs too. I didn’t fancy the idea of setting up an entire TCK environment on ppc, so I filed it in my head under “Future work”, but it occurred to me that zero works (or should do) on amd64 too, so it shouldn’t be too difficult to drop it into our nightly tester. I’ve not done more than thinking about this as yet since there’s not much point in doing it without copious free time to fix the thousands of bugs it’ll doubtless find — which I don’t have!

All this zero stuff has left Shark on the back-burner for a bit, but work there is progressing too. Lately I’ve been working on making unimplemented stuff abort only that compilation (not the entire VM) so that a)  people can use it in it’s unfinished state, and b) I can run some benchmarks and get some idea of the progress I’m making.