Good news and bad news

Bad news first. The drop in speed between the Zeros in IcedTea6 1.3 and 1.4 doesn’t seem to come from Zero itself. I did a build of IcedTea6 1.4 with everything in ports/hotspot/src/*cpu reverted to 1.3, and the speed loss remained. It must be something to do with the newer HotSpot, or some other patch that got added or changed. I don’t really want to spend any more time on this than I have, so we’ll just have to live with it.

I’ve not come to any conclusions as to the difference in speed between the native-layer C++ interpreter and Zero either. It’s not the unaligned access stuff I mentioned: I ran some benchmarks, but the results were ambiguous. It may be libffi, but again, I don’t want to spend more time on this…

The good news is that I’ve been checking the Zero sources for SCA cover, emailing various people, and there’s only one tiny easily-removable bit I’m unsure about. I spent the morning preparing and submitting the first of the patches that will be required, the core build patch, which will hopefully be well received.

Benchmarks

So, benchmarking. I did a couple of quick SPECjvm98 runs, to see what the times looked like. Please note that neither of these results were produced in compliance with the SPECjvm98 run rules so they are not comparable with SPECjvm98 metrics.

SPECjvm98 times for the interpreters in IcedTea6 1.4
SPECjvm98 times for the interpreters in IcedTea6 1.4

This result would suggest that the lion’s share of the performance difference between the template interpreter and Zero is down to Zero’s use of the C++ interpreter. There’s obviously something else slowing it down as well, though, possibly libffi, possibly the unaligned access stuff Ed Nevill mailed about the other day, or possibly something else entirely.

SPECjvm98 times for IcedTea6 1.3 Zero and IcedTea6 1.4 Zero
SPECjvm98 times for IcedTea6 1.3 Zero and IcedTea6 1.4 Zero

Apparently Robert Schuster was right about the speed loss between IcedTea6 1.3 Zero and IcedTea6 1.4. Most of the times are within 1% of each other, but the mpegaudio benchmark is some 12% slower. Assuming it is Zero (and not, for example, the HotSpot upgrade) there’s not a lot that could account for it. It could simply be the stack overflow checking that the Zero in 1.3 didn’t have, but I wouldn’t have thought that would have had such a pronounced effect. Maybe something’s not getting inlined like I expected; I’m going to have to hunt a little for this one…

State of the world

Well, in case you missed it, Zero passed the TCK! Specifically, the latest OpenJDK packages in Fedora 10 for 32- and 64-bit PowerPC passed the Java SE 6 TCK and are compatible with the Java SE 6 platform. I’ve been working toward this since sometime in November — the sharp-eyed amongst you may have noticed the steady stream of obscure fixes I’ve been committing — and the final 200 hour marathon finished at 5pm on FOSDEM Saturday, less than 24 hours before my talk. It was pretty stressful, and I took the week off to recover!

Needless to say, none of this could have happened without the rest of the OpenJDK team here at Red Hat getting it to pass on the other platforms. Special thanks must go to Lillian for managing the release. She got the blame for a lot of what went wrong, and it’s only fair she should get the credit for what went right.

Of course, all of this wasn’t just so I’d have something exciting to announce at FOSDEM. In a way it validates the decision we took at Red Hat to focus on Zero rather than using Cacao or another VM. By using as much OpenJDK code as possible — Zero builds are 99% HotSpot — we get as much OpenJDK goodness as possible, including the “correctness” of the code. Zero’s speed can make it a standing joke, but I’d like to use these passes to emphasize that Zero isn’t just a neat hack — it’s production quality code that hasn’t been optimized yet. I’ve written fast code and I’ve written correct code, and in my experience it’s easier in the long run to make correct code fast than it is to make fast code correct. The TCK isn’t everything, naturally, but the fact that it’s possible to pass it using Zero builds gives us a firm foundation for future work.

So, what now for me? Well, in the medium term I want to restart work on Shark, but there’s a couple of things for Zero I want to look at while they’re fresh in my mind. The first is speed. As an interpreter Zero will never be “fast”, but in my FOSDEM slides I used some of Andrew Haley‘s benchmarks that show Zero as significantly slower than the template interpreter on x86_64. Furthermore, Robert Schuster mentioned in his talk that the Zero in IcedTea 1.4 was significantly slower than the Zero in IcedTea 1.3. I’m not going to spend a great deal of time on it, but I’d like to do a bit of benchmarking and profiling to check that nothing stupid is happening.

The other thing I want to do for Zero is to get it into upstream HotSpot. This is going to require a lot of non-fun stuff — tidying, a bit of rethinking, and an SCA audit.

Finally, Inside Zero and Shark, the articles I’ve been writing. I didn’t mention it at the time, but I was writing them while my TCK runs were in progress, to keep me sane! I do plan to continue them, but they’ll likely be a little more sporadic now I’m starting the fun stuff again!

Inside Zero and Shark: The call stub and the frame manager

Now that we have all that stack stuff out of the way we can get into the core of Zero itself, the replacements for the parts of HotSpot that were originally written in assembly language.

I’ve mentioned already that the bridge from the VM into Java code is the call stub. In Zero, this is the function StubGenerator::call_stub, in stubGenerator_zero.cpp, and its job is really simple. In the previous article I explained how when a method is entered it finds its arguments at the top of the stack, at the end of the previous method’s frame. Well, for the first frame of all there is no previous frame, so the call stub’s job is to create one. If you look in the crash dump I linked in the previous article, you’ll see that right at the bottom of the stack trace is a short frame that’s different from all the others. This is the entry frame, the frame the call stub made. You can see the code that built it in EntryFrame::build, right at the bottom of stubGenerator_zero.cpp.

Once the entry frame is created, the call stub invokes the method by jumping to its entry point. If the method we’re calling hasn’t been JIT compiled then the entry point will be pointing at one of the interpreter’s method entries. These, along with the call stub, are the bits that are written in assembly language in classic HotSpot.

There are several different method entries — Zero has five, classic HotSpot has fourteen! — but most of this is optimization, and you can do pretty much everything with just two: an entry point for normal (bytecode) methods, and an entry for native (JNI) methods. In this article I’m going to talk about the normal entry.

In the C++ interpreter, the normal entry is split into two parts. The larger of the two is the bytecode interpreter. This is written in C++, the function BytecodeInterpreter::run in bytecodeInterpreter.cpp, and it does the bulk of the work. The other part is the frame manager. In non-Zero HotSpot this is written in assembly language, and it handles the various stack manipulations that cannot be performed from within C++. Zero’s frame manager is, of course, written in C++; it’s the function CppInterpreter::normal_entry in cppInterpreter_zero.cpp.

The frame manager performs tasks for the bytecode interpreter, so you might expect the bytecode interpreter call the frame manager wherever it needs to adjust the stack. It can’t work this way, however; the interleaved stack in classic HotSpot means that once you’re inside the bytecode interpreter the bytecode interpreter’s ABI frame lies on top of the stack, blocking any access to the Java frames beneath. To cope with this, the code is essentially written inside out, with the frame manager calling the bytecode interpreter, and the bytecode interpreter returning to the frame manager whenever it needs something done.

The way it works is this. On entering a method, we start off in the frame manager. The frame manager extends the caller’s frame to accomodate any extra locals, then creates a new frame for the callee. The frame manager then calls the bytecode interpreter with a method_entry message.

Now we’re inside the bytecode interpreter, which executes bytecodes one-by-one until it reaches something it cannot handle. Say it arrives at a method call instruction. In classic HotSpot, the bytecode interpreter’s ABI frame is blocking the top of the stack, so if the bytecode interpreter were to handle the call itself the callee wouldn’t be able to extend its caller’s frame to accomodate its extra locals. The frame manager has to to handle this, so the bytecode interpreter returns with a call_method message.

Now we’re back in the frame manager again; the bytecode interpreter’s frame has been removed, and the Java frame at the top of the stack. The arguments to the call were set up by the bytecode interpreter, so all the frame manager has to do is jump to the callee’s entry point. When the callee returns, the frame manager returns control to the bytecode interpreter by calling it with a method_resume message, and the bytecode interpreter continues from where it was when it issued the call_method.

This process is repeated every time a method call is required. Once the bytecode interpreter is finished with a method, it returns to the frame manager with a return_from_method or a throwing_exception message. The frame manager then removes the method’s frame, copies the result into it’s caller’s frame if necessary, and return to its caller.

The frame manager exists because the bytecode interpreter’s frame blocks the stack in classic HotSpot. In Zero, Java frames live on the Zero stack, which is separate from the ABI stack. Why then does Zero need a separate frame manager? The answer is that it doesn’t — it would be perfectly possible to rewrite the bytecode interpreter to stand alone. That, however, is the issue: you’d have to rewrite the bytecode interpreter, making significant modifications to the existing HotSpot code. That runs counter to the design philosophy of Zero, which aims for it to slot into HotSpot with minimal modification. It could be done, but we didn’t do it.

That pretty well sums up the normal entry. Next time I’ll talk about the other essential method entry, the one that handles JNI methods.