gbenson.net – Page 13

No assembler.

An hour or so later the zero-assembler Hotspot did compile Hello World too.

On January 18 2008, at 24 minutes and 59 seconds past 3 in the afternoon (Greenwich Mean Time) the zero-assembler Hotspot did run Hello World.

By last Thursday I’d given up on my idea of using alloca() to fake growable blocks of memory in the stack. I had it executing bytecodes, 8 or 9 instructions, but the more I played with it the more I realised how hacky and fragile it was becoming. The extra add is definitely a bug, but weird stuff starts to happen if you do alloca(size - 15) to circumvent it and I’m not entirely convinced this isn’t because the extra 16 bytes you allocate mask another, PPC-specific issue. It’s difficult to tell. I decided I’d much rather fix OpenJDK to use non-stack locks once than fix alloca() on every single platform.

So, on Friday I junked the bits that used alloca() and reverted the stack class back to big-block-o’-ram mode. The orignal pre-alloca() version allocated the block in Thread::pd_initialize() but I wanted to make the Java stack match the runtime stack in terms of size and the initial thread’s stack isn’t set up when pd_initialize() is called so I moved it into the call stub. I ended Friday trying to figure out how I could make locking work.

Sometime over the weekend, though, I realised that call stub is in the stack whenever you’re in Java code: I could grab my big block of ram there, with alloca(), but in a non-hacky way that ought to work, and leave the locking code untouched!

Suddenly I’m at the System.currentTimeMillis() milestone, on the verge of massive success :)

My generic port is shaping up. I keep wanting to call it the portable interpreter except that’s what people used to call the C++ interpreter so I keep ending up calling it the really portable interpreter or some such thing. All the “CPU-specific” files live in hotspot/src/cpu/zero and hotspot/src/os_cpu/linux_zero so I guess it’s called zero (it was snappier than “nothing”). It’s come to mean zero-assembler in my head although there is some assembler in there, a spinpause and a 64-bit atomic copy, but if I can’t write those for a new platform within a day or two then I probably need sacking!

Anyway, whatever it’s called, it’s shaping up. I have a funky stack class with a slightly odd allocation strategy to allow it to live in alloca allocated memory, I’ve written my call stub using it and am now working on the normal entry. I plan to use the same calling convention throughout, the stack-based one that the bytecode interpreter uses. I have no specific native calling convention I need to match, and doing it that way saves writing a load of result convertors.

Ok, back to coding…

Lately I’ve been experimenting with the idea of a generic Linux port of Hotspot. I updated the templater and generated a new set of stubs, fixed the build system to build them, then started filling them in. I got into interpreter code in a couple of days, which is where the fun starts. There’s two big problems: handling the stack, and calling native functions. libffi can do the latter, but the stack is looking to be fun. I was trying not to start thinking about it before Christmas but I failed…

My original idea was not to use the runtime stack at all: just allocate a block of memory somewhere and use that as the Java stack. Great. Ok, you need to make the garbage collector know about it, but great. Except that it turns out that synchronization works by exchanging the pointer to a locked object with a pointer to the lock, and the way it distinguishes the two is that the locks live in the runtime stack. I tried to figure a way of allocating my big block on the stack at the start of the thread, which it turns out is totally possible for created threads but totally not possible for attached threads.

Now I’m thinking about using alloca() to sneak the Java frames inside the runtime frames. My idea was that if alloca() allocates memory contiguously then so long as nothing else in your method used it you could “resize” your Java frame by simply allocating another block. You couldn’t resize your caller’s frame, but you can work around that with a copy. However, the code that GCC generates is weird, and I’m hoping it’s simply a bug. On i386 you get this:

mov    bytes_to_allocate,%eax
add    $0xf,%eax
add    $0xf,%eax
shr    $0x4,%eax
shl    $0x4,%eax
sub    %eax,%esp

and on ppc you get this:

lwz    r9,bytes_to_allocate
addi   r9,r9,15
addi   r0,r9,15
rlwinm r0,r0,28,4,31
rlwinm r0,r0,4,0,27
lwz    r9,0(r1)
neg    r0,r0
stwux  r9,r1,r0
lwz    r11,0(r1)
lwz    r0,4(r11)
mtlr   r0

It looks like it’s trying to align the blocks on 16-byte boundaries and that that second addition is a bug. But where to find a GCC guru at this time of year?

It’s been a while. These past two weeks I’ve been working on starting to port the client JIT and fixing an elusive crash on 32-bit. The JIT stuff is early days. I won’t commit every little change I make, but I’m going to drop my work into IcedTea once a week or so just so I’m not working in the dark (disable icedtea-core-build.patch to build it). But I’m more excited about fixing the crash. It was the last known bug, and since avdyk‘s news that Eclipse runs my feelings have shifted from “hmmm, this seems to work” to “wow!” so if you’re having problems on PPC with any IcedTea newer than f35ffd73f3c4 then I really want to hear about it!

Well, IcedTea now has ppc and ppc64 support out of the box. aph pointed out that, since I committed it the very instant it built, what is there is essentially a record of the bare minimum of what is needed to bring up an interpeter-only OpenJDK on a new platform and he suggested I write an overview of what I did, to serve as a guide to future porters. So here goes…

Gary’s guide to porting IcedTea

The first thing you need to do is check out a copy of IcedTea and patch the build system so it knows about your platform and will build without JITs on it. These changes are grouped into two patches, icedtea-ports.patch for the former and icedtea-core-build.patch for the latter. What you need to do is remake these patches to include your platform as well as ppc. I suggest you submit them for inclusion at this stage, so you don’t have to repeat this step every time IcedTea is updated with a new OpenJDK build.

Once you’ve done this you need to populate the ports directory with stubs so you can get to the point where libjvm.so compiles and links. If your platform is some sort of Linux then you’re in for a treat, because when I started on the ppc port I had the idea that I would do ppc, s390 and ia64 all at once, from the same codebase, and I wrote a templater to manage it for me. This never happened, but I coded in the templater right until the IcedTea import so you can use it to generate a pretty decent set of stubs.

The templater lives in contrib/templater. There’s some notes on it here, but you shouldn’t need them; basically, if you’re porting to a platform other than s390/s390x or ia64 then add it to the tables at the top of generate.py, then:

python contrib/templater/generate.py your_cpu

The templated files will give you a head start, but you’ll have to fix them up a bit. Partly this is because IcedTea will have moved on since my initial import, and partly this is because as I progressed with ppc it became more and more obvious that doing ppc, s390 and ia64 all at once was a pipe dream and I became less and less concerned with getting every #ifdef PPC perfect. There will be some PPC-specific code, and there will be some missing methods. Every time a build fails, stick in an Unimplemented(); and try again.

Eventually you will be at a point where libjvm.so compiles and links, and the ecj-bootstrap part of IcedTea will complete. Looking at the logs this took me two months — 300 man-hours give or take — but with the templater you could be there within a week.

At this point you may get a segfault. Your first Unimplemented() has been hit, which caused another Unimplemented() in the error reporting system. Temporarily “simplify” VMError::report_and_die() as described here and you will hopefully get your very first real live Unimplemented() message. Start implementing…

The first big bit you’ll hit is ICacheStubGenerator::generate_icache_flush. If you’ve avoided writing assembler thus far there is no getting around the fact that you need to write some now. At this point I implemented enough of an assembler for an unimplemented macro that called report_unimplemented from assembled code in exactly the same way as Unimplemented() does from C. Whenever I hit an Unimplemented() in a code-generating function I simply replaced it with __ unimplemented and continued, and I suggest you do this too.

Surprise! The very next Unimplemented() you hit will be the one you just wrote: the bit that generates the icache flush stub immediately calls it on itself. You really have to write it this time.

After that the next big thing is StubGenerator::generate_call_stub. The code this generates — the call stub — is used whenever C code calls Java. Within the interpreter certain conventions are employed when a method is called: a pointer to the method is in this register, and the stack frame is arranged like so, with the parameters at the end, and a pointer to the parameters is in that register. And so on. The details of this are your interpreter calling convention. The call stub’s job is to take a pointer to a method and an array of arguments and translate them into your interpreter calling convention. It creates what looks like an interpreter stack frame, fills in the relevant registers and jumps into the interpreter.

Before you can write your call stub you need to design yourself an interpreter calling convention. I described mine (more or less) here. The exact detail of this is up to you, but the state-monitors-stack order within the frames is important. Methods need to be able to allocate more monitors and extend the expression stack as necessary. You don’t want to move the interpreter state every time, so you put that at the bottom. You can’t move monitors without a safepoint, so you put those next. And you can move the expression stack whenever you like, so you put that last.

Once you’ve designed your calling convention and written your call stub you will be in the interpreter. For me this was another six weeks’ work, but it took much longer than it could have because the C++ interpreter (the code that the call stub was calling) was not released until b20. I had to try and design the calling convention blind, and a lot of stuff simply didn’t make sense.

So, you’re in the interpreter, the C++ one not the template one. Every method in Hotspot is defined by a methodOop, and each methodOop has a method entry which is the address of the code that will execute the method. Your call stub just jumped to the method entry of your first method, java.lang.Object.<clinit>. It’s an interpreted method, so you ended up in the interpreter’s normal (as opposed to native) entry, as generated by InterpreterGenerator::generate_normal_entry. To implement this you need to understand how the C++ interpreter works.

The normal entry in the C++ interpreter goes by the name of the frame manager. The guts of the C++ interpreter is the method BytecodeInterpreter::run, an enormous switch statement that takes care of pretty much everything. What it can’t take care of is all the stack frame stuff, which is where the frame manager comes in. The frame manager does work for BytecodeInterpreter::run but the relationship between the two is kind of reversed: rather than calling the frame manager to do work BytecodeInterpreter::run returns to the frame manager with a message to do some work. The frame manager then does the work and calls BytecodeInterpreter::run again with a message that it did what it was asked to. Interpreting then resumes.

So you need to implement a frame manager. I recommend judicious use of __ unimplemented here: you don’t need to implement everything at once, and rather than writing a bunch of code that won’t get executed til later you may as well write just what you need. That way stuff gets tested immediately it’s written.

The first instruction in java.lang.Object.<clinit> invokes a native method, so the next thing you need to write is InterpreterGenerator::generate_native_entry and its associated signature handlers, result convertors and result handlers. Again, don’t try and write them all at once, stub out what you don’t need with __ unimplemented and continue.

Some time around now java -XX:TraceBytecodes will become your best friend.

The next thing you’ll probably have problems with is System.currentTimeMillis(). This is the first native method that actually returns something, and getting that something into the right place in the expression stack is fiddly.

At some point after that you’ll find native methods that are passed objects and that return them. These are pointers. Pay attention: the things you are passing to and from native code are not the pointers themselves but pointers to those pointers — except if the pointer is NULL in which case you pass NULL and not a pointer to that NULL. This tripped me over every single time.

Somewhere around 1400 bytecodes everything will go multithreaded. This is where you’ll find out your object locking doesn’t work.

Hello World is a little over 300,000 bytecodes. The great thing about the C++ interpreter is that there are points where implementing one little thing will suddenly have you interpreting orders of magnitude more bytecodes. Once you have one bytecode executing you’ll have dozens. You’ll add the stuff to return from native methods and have hundreds, then you’ll add the stuff to do object locking and have hundreds of thousands. From Hello World to javac and Ant is a pretty small step. And then you’ll be pretty much where PPC IcedTea is today.

I should thank Steve Goldman for tirelessly explaining all this to me and answering all my stupid questions.

IcedTea is served: openjdk/control/build/linux-ppc