October 2007 – gbenson.net

My first builds ran to completion yesterday. It’s slow: the 32-bit one took nearly four hours and the 64-bit one a little over six — and this is on a quad 2.5GHz G5! I don’t have exclusive use of the machine though so I suppose other stuff may have been running in the meantime.

The next step is IcedTea integration. My tree is currently an odd hybrid of b17 with b22’s hotspot/src dropped in, and with all the build system changes between b17 and b23 I’ll basically be starting the port from scratch. This is possibly a good thing, because I want to switch from using ARCH_DATA_MODEL to using setarch, but it’s a pain nontheless. I plan to wait until the IcedTea guys have upgraded to b23 and sync my tree on that, remake all the build system patches and then import everything into IcedTea.

I have plenty to do in the meantime. I’ve been doing an audit of dodgy code while builds were running yesterday, basically looking for Unimplemented()s where they shouldn’t be and of course the ubiquitous XXX_EVIL_EVIL_EVIL. This is the state of my to do list:

Contended locks: The bit that handles locking and unlocking for synchronized native methods cannot cope when the lock is contended.
Relocations: The assembler doesn’t support relocations; if the garbage collector moves generated code then absolute addresses that reference other generated code will become invalid. I’m not sure if this matters currently.
Atomic copies: Copy::conjoint_jlongs_atomic() is not atomic on 32-bit. The other atomic copies are questionable too, but that one definitely doesn’t do what it says on the tin.
Stack args: The part of the signature handler that passes arguments to native methods on the stack is untested on 64-bit.
Unimplemented stuff: A brief and inexhaustive list: stack overflow checks, stack banging (whatever that is), JVMTI, JDI, profiling, prefetch, a JIT.

I now have Hello World executing to completion, 303,371 instructions on 32-bit and 307,209 on 64. There’s a lot missing, and a lot I plain don’t understand. And it’s very, very slow. I have a copy of javac running on 64-bit as I write. And running. And running. I attached gdb briefly and it was nearing two billion instructions. On 64-bit, that is. 32-bit is crashing in the garbage collector after a mere 300 million bytecodes or so. Debugging GC crashes is something I was dreading.

$ control/build/linux-ppc64/bin/java Hello
Hello world
# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc:  SuppressErrorAt=/os_linux_ppc.hpp:28

  error: Unimplemented(): aztec/hotspot/src/os_cpu/linux_ppc/vm/os_linux_ppc.hpp:28

Only on 64-bit so far…

My first exception!

Error occurred during initialization of VM
java.lang.ExceptionInInitializerError
        at java.security.SecureClassLoader.<clinit>(SecureClassLoader.java:55)
        at sun.misc.Launcher.<init>(Launcher.java:71)
        at sun.misc.Launcher.<clinit>(Launcher.java:59)
        at java.lang.ClassLoader.initSystemClassLoader(ClassLoader.java:1322)
        at java.lang.ClassLoader.getSystemClassLoader(ClassLoader.java:1304)
Caused by: java.lang.ClassCastException: java.lang.Class cannot be cast to java.lang.String
        at sun.security.util.Debug.<clinit>(Debug.java:45)
        at java.security.SecureClassLoader.<clinit>(SecureClassLoader.java:55)
        at sun.misc.Launcher.<init>(Launcher.java:71)
        at sun.misc.Launcher.<clinit>(Launcher.java:59)
        at java.lang.ClassLoader.initSystemClassLoader(ClassLoader.java:1322)
        at java.lang.ClassLoader.getSystemClassLoader(ClassLoader.java:1304)

I just spent the best part of a day trying to figure out why something was happening differently on 32- and 64-bit PPC. It turned out that what was happening was that one was decoding the string "OpenJDK VM" while the other was decoding "OpenJDK 64-Bit VM". I should feel pleased — things are working as they should — but it’s given me a headache.

Anyway, the point of this entry is that while I was in there I checked out that double-space. It was trying to insert "Server" or "Client", but of course mine has neither. It occurred to me that having undocumented variants of OpenJDK out there with weird identifiers could be slightly disturbing, so if you’re Googling this in the future trying to figure out what an "OpenJDK Aztec VM" or an "OpenJDK 64-Bit Aztec VM" is then this is the documentation: you’re using my stuff.

Whoa, everything just went multithreaded…

Because I keep getting this wrong, here is where the pointers in an interpreterState should point. This is a PPC32 frame with a four-word stack with one item pushed onto it and no monitors:

…680	Back chain
…684	LR save word
…688	Padding
…68c	Padding
…690	Slop
…694	Slop	← `_stack_limit`
…698
…69c	Stack
…6a0	Stack	← `_stack`
…6a4
…6a8	Interpreter state	← `_stack_base`, `_monitor_base`
.
.
.
…6ec

This is the same frame with a monitor allocated:

…680	Back chain
…684	LR save word
…688	Slop
…68c	Slop	← `_stack_limit`
…690
…694	Stack
…698	Stack	← `_stack`
…69c
…6a0	Monitor	← `_stack_base`
…6a4	Monitor
…6a8	Interpreter state	← `_monitor_base`
.
.
.
…6ec

My landline went dead shortly after starting work today. I managed to keep coding for an hour or so but my PPC machine is in Boston and there’s only so much you can do blind. It turned out to be a blessing in disguise however, since I’ve been meaning to document the interpreter calling convention for some time, but I got bored every time and went back to coding. Until today…

So, the interpreter calling convention. The interpreter calling convention is basically the register usage and the stack frame layout used within the interpreter — which for the C++ interpreter means everything inbetween StubRoutines::call_stub() and BytecodeInterpreter::run(). And while not immediately relevent, this includes any code created by a JIT.

The register usage is the simplest bit. There are three symbolically named registers that are valid at all times within the interpreter:

Rmethod: The address of the current methodOop.
Rlocals: The address of first local variable. Local variables are accessed with negative indices, so the address of the second local variable is Rlocals - wordSize and so on. Note that Rlocals is essentially a stack pointer and is treated as such in the result convertors, so you need to be careful it’s pointing where you expect while methods are returning.
Rstate: The address of the current interpreterState object.

These registers are assigned to registers defined as non-volatile by the PPC ABIs, so they do not need to be saved or restored around calls to native ABI functions.

The stack is managed in accordance with the PPC ABIs, with r1 as the frame pointer and with frames laid out in the standard manner:

    | ...                  |  high addresses
+-> | Link area            |
|   +----------------------+
|   | Register save area   |
|   | Local variable space |
|   | Parameter list space |
+---+ Link area            |  low addresses
    +----------------------+

Each method has its own frame, the “local variable space” of which is laid out as follows:

    +----------------------+
    | interpreterState     |  high addresses
    +----------------------+
    | monitor 0            |
    |  ...                 |
    | monitor m            |
    +----------------------+
    | stack slot 0         |
    |  ...                 |
    | stack slot n         |
    +----------------------+
    | slop_factor          |
    +----------------------+
    | padding              |  low addresses
    +----------------------+

slop_factor is a hack. When a method is called the callee’s parameters are pushed onto the caller’s stack. The method pops these off and pushes its return value, if any. It is a given that the Java compiler allocates enough stack slots for the parameters a method will call, but nobody seems sure it allocates enough slots for a return value in the event that the return value requires more slots than the parameters. This is dubiously referred to as “the static long no_params() issue”. slop_factor is essentially two extra words above the expression stack to protect what follows from being overwritten in this case. It’s probably unnecessary, but on PPC32 there is a fair chance that a stack overrun will corrupt the return address, just the kind of fun bug where the crash happens long after the cause. I left it in; I value my sanity over some wasted bytes of memory.

The state-monitors-stack ordering is not random. On entry to a method the caller’s frame will be at the top of the stack and Rlocals will be pointing at an expression stack slot within that frame (or at the higher slop word, in frames with no stack). A method’s N parameters are its first N local variables, and in methods with more local variables than parameters this frame (the caller’s frame) may be extended to accomodate them. Having the expression stack the lowest thing in the frame minimises the amount of stuff that needs moving when a frame is extended. Of course, this means that the entire expression stack needs moving every time a monitor is allocated, but synchronized methods pre-allocate a monitor so the monitor list needs extending much less frequently than the expression stack. Having it this way round means the monitors never move — and you can’t move a monitor without a safepoint so this is nice.

I’m fading now, but one final point is that random frame extension means you cannot unwind frames assuming they’re the same size they were when you created them.

I wrote the frame expander and it was easy. Now I’m up to 102 bytecodes, but this one is asking for a monitor which is the bit I was really dreading.

Being at a point where I’m interpreting bytecodes is really cool: every new thing I implement gets the interpreter a whole load further. I’ve now executed 70 instructions, including calls to both native and non-native methods with both void and non-void return values. It’s currently stopping at the start of the first method to require more locals than it takes parameters: the first method whose frame may need expanding, in other words. This might be slightly complicated by the fact that Rlocals is not used exactly as I thought it would be when I planned the frame-expanding code. I’m hoping this doesn’t matter.