gbenson.net – Page 13

When I’m testing my stuff I don’t use vanilla IcedTea. The standard debug build is an icedtea-against-icedtea which takes hours and hours to complete. Instead, I have a set of patches which enable assertions and disable optimization int the icedtea-against-ecj build. This builds much faster. If you are working on zero (or something else Hotspot) then you may find them handy:

cd /path/to/icedtea
wget http://inauspicious.org/files/icedtea/mixtec-patches.patch
patch -p1 < mixtec-patches.patch
make clean
make icedtea-against-ecj

Once you have that, you can use make hotspot to rebuild the JVM only, without going all the way into and out of the class library makefiles.

Note to self: “error while loading shared libraries: libjvm.so: R_PPC_REL24 relocation at 0x0fe4cc80 for symbol ‘memcpy’ out of range” means that something in libjvm.so was not built with -fPIC. If you statically linked some library in there, it’s probably that.

Now that the zero-assembler port is committed I’ve been thinking about dropping the ppc-specific one. The plan is to use zero for our other platforms (s390, s390x, ia64 and arm) and there’s no obvious reason to maintain a separate port for ppc and ppc64. This past two days aph and I ported zero to amd64 and I’ve been thinking of ways to integrate zero into the build system so that you can choose to use it on any platform with minimal effort. This will be easier and clearer without a load of ppc-specific stuff knocking around in there. There are a couple of bugs in the ppc-specific port too, bugs I noticed while writing zero, and I just can’t be bothered to fix them.

I did some simple speed tests, the time in seconds taken by jar cf tools.jar:

	ppc	zero
32-bit	67.6	70.3
64-bit	67.9	69.7

Not bad. A look at the call graph shows most of the method calls are interpreted methods, so I’m not sure it’s entirely representative — I’d expect programs doing a lot of native calls to be hit harder — but I like it. I’m interested to know why 64-bit zero is faster than 32 too — that’s not what you’d expect on ppc — but that can wait until I get round to some profiling. I want to get it running on the other platforms first.

Ok, now I have zero running on ppc64, and… the appletviewer works! The appletviewer never worked on 64-bit with my original, ppc-specific port so this is a first!

No assembler.

An hour or so later the zero-assembler Hotspot did compile Hello World too.

On January 18 2008, at 24 minutes and 59 seconds past 3 in the afternoon (Greenwich Mean Time) the zero-assembler Hotspot did run Hello World.

By last Thursday I’d given up on my idea of using alloca() to fake growable blocks of memory in the stack. I had it executing bytecodes, 8 or 9 instructions, but the more I played with it the more I realised how hacky and fragile it was becoming. The extra add is definitely a bug, but weird stuff starts to happen if you do alloca(size - 15) to circumvent it and I’m not entirely convinced this isn’t because the extra 16 bytes you allocate mask another, PPC-specific issue. It’s difficult to tell. I decided I’d much rather fix OpenJDK to use non-stack locks once than fix alloca() on every single platform.

So, on Friday I junked the bits that used alloca() and reverted the stack class back to big-block-o’-ram mode. The orignal pre-alloca() version allocated the block in Thread::pd_initialize() but I wanted to make the Java stack match the runtime stack in terms of size and the initial thread’s stack isn’t set up when pd_initialize() is called so I moved it into the call stub. I ended Friday trying to figure out how I could make locking work.

Sometime over the weekend, though, I realised that call stub is in the stack whenever you’re in Java code: I could grab my big block of ram there, with alloca(), but in a non-hacky way that ought to work, and leave the locking code untouched!

Suddenly I’m at the System.currentTimeMillis() milestone, on the verge of massive success :)

My generic port is shaping up. I keep wanting to call it the portable interpreter except that’s what people used to call the C++ interpreter so I keep ending up calling it the really portable interpreter or some such thing. All the “CPU-specific” files live in hotspot/src/cpu/zero and hotspot/src/os_cpu/linux_zero so I guess it’s called zero (it was snappier than “nothing”). It’s come to mean zero-assembler in my head although there is some assembler in there, a spinpause and a 64-bit atomic copy, but if I can’t write those for a new platform within a day or two then I probably need sacking!

Anyway, whatever it’s called, it’s shaping up. I have a funky stack class with a slightly odd allocation strategy to allow it to live in alloca allocated memory, I’ve written my call stub using it and am now working on the normal entry. I plan to use the same calling convention throughout, the stack-based one that the bytecode interpreter uses. I have no specific native calling convention I need to match, and doing it that way saves writing a load of result convertors.

Ok, back to coding…

Lately I’ve been experimenting with the idea of a generic Linux port of Hotspot. I updated the templater and generated a new set of stubs, fixed the build system to build them, then started filling them in. I got into interpreter code in a couple of days, which is where the fun starts. There’s two big problems: handling the stack, and calling native functions. libffi can do the latter, but the stack is looking to be fun. I was trying not to start thinking about it before Christmas but I failed…

My original idea was not to use the runtime stack at all: just allocate a block of memory somewhere and use that as the Java stack. Great. Ok, you need to make the garbage collector know about it, but great. Except that it turns out that synchronization works by exchanging the pointer to a locked object with a pointer to the lock, and the way it distinguishes the two is that the locks live in the runtime stack. I tried to figure a way of allocating my big block on the stack at the start of the thread, which it turns out is totally possible for created threads but totally not possible for attached threads.

Now I’m thinking about using alloca() to sneak the Java frames inside the runtime frames. My idea was that if alloca() allocates memory contiguously then so long as nothing else in your method used it you could “resize” your Java frame by simply allocating another block. You couldn’t resize your caller’s frame, but you can work around that with a copy. However, the code that GCC generates is weird, and I’m hoping it’s simply a bug. On i386 you get this:

mov    bytes_to_allocate,%eax
add    $0xf,%eax
add    $0xf,%eax
shr    $0x4,%eax
shl    $0x4,%eax
sub    %eax,%esp

and on ppc you get this:

lwz    r9,bytes_to_allocate
addi   r9,r9,15
addi   r0,r9,15
rlwinm r0,r0,28,4,31
rlwinm r0,r0,4,0,27
lwz    r9,0(r1)
neg    r0,r0
stwux  r9,r1,r0
lwz    r11,0(r1)
lwz    r0,4(r11)
mtlr   r0

It looks like it’s trying to align the blocks on 16-byte boundaries and that that second addition is a bug. But where to find a GCC guru at this time of year?