Note to self: “error while loading shared libraries: libjvm.so: R_PPC_REL24 relocation at 0x0fe4cc80 for symbol ‘memcpy’ out of range” means that something in libjvm.so was not built with -fPIC. If you statically linked some library in there, it’s probably that.
Now that the zero-assembler port is committed I’ve been thinking about dropping the ppc-specific one. The plan is to use zero for our other platforms (s390, s390x, ia64 and arm) and there’s no obvious reason to maintain a separate port for ppc and ppc64. This past two days aph and I ported zero to amd64 and I’ve been thinking of ways to integrate zero into the build system so that you can choose to use it on any platform with minimal effort. This will be easier and clearer without a load of ppc-specific stuff knocking around in there. There are a couple of bugs in the ppc-specific port too, bugs I noticed while writing zero, and I just can’t be bothered to fix them.
I did some simple speed tests, the time in seconds taken by jar cf tools.jar:
| ppc | zero | |
|---|---|---|
| 32-bit | 67.6 | 70.3 |
| 64-bit | 67.9 | 69.7 |
Not bad. A look at the call graph shows most of the method calls are interpreted methods, so I’m not sure it’s entirely representative — I’d expect programs doing a lot of native calls to be hit harder — but I like it. I’m interested to know why 64-bit zero is faster than 32 too — that’s not what you’d expect on ppc — but that can wait until I get round to some profiling. I want to get it running on the other platforms first.
Ok, now I have zero running on ppc64, and… the appletviewer works! The appletviewer never worked on 64-bit with my original, ppc-specific port so this is a first!
An hour or so later the zero-assembler Hotspot did compile Hello World too.
On January 18 2008, at 24 minutes and 59 seconds past 3 in the afternoon (Greenwich Mean Time) the zero-assembler Hotspot did run Hello World.
By last Thursday I’d given up on my idea of using alloca() to fake growable blocks of memory in the stack. I had it executing bytecodes, 8 or 9 instructions, but the more I played with it the more I realised how hacky and fragile it was becoming. The extra add is definitely a bug, but weird stuff starts to happen if you do alloca(size - 15) to circumvent it and I’m not entirely convinced this isn’t because the extra 16 bytes you allocate mask another, PPC-specific issue. It’s difficult to tell. I decided I’d much rather fix OpenJDK to use non-stack locks once than fix alloca() on every single platform.
So, on Friday I junked the bits that used alloca() and reverted the stack class back to big-block-o’-ram mode. The orignal pre-alloca() version allocated the block in Thread::pd_initialize() but I wanted to make the Java stack match the runtime stack in terms of size and the initial thread’s stack isn’t set up when pd_initialize() is called so I moved it into the call stub. I ended Friday trying to figure out how I could make locking work.
Sometime over the weekend, though, I realised that call stub is in the stack whenever you’re in Java code: I could grab my big block of ram there, with alloca(), but in a non-hacky way that ought to work, and leave the locking code untouched!
Suddenly I’m at the System.currentTimeMillis() milestone, on the verge of massive success :)
My generic port is shaping up. I keep wanting to call it the portable interpreter except that’s what people used to call the C++ interpreter so I keep ending up calling it the really portable interpreter or some such thing. All the “CPU-specific” files live in hotspot/src/cpu/zero and hotspot/src/os_cpu/linux_zero so I guess it’s called zero (it was snappier than “nothing”). It’s come to mean zero-assembler in my head although there is some assembler in there, a spinpause and a 64-bit atomic copy, but if I can’t write those for a new platform within a day or two then I probably need sacking!
Anyway, whatever it’s called, it’s shaping up. I have a funky stack class with a slightly odd allocation strategy to allow it to live in alloca allocated memory, I’ve written my call stub using it and am now working on the normal entry. I plan to use the same calling convention throughout, the stack-based one that the bytecode interpreter uses. I have no specific native calling convention I need to match, and doing it that way saves writing a load of result convertors.
Ok, back to coding…
Lately I’ve been experimenting with the idea of a generic Linux port of Hotspot. I updated the templater and generated a new set of stubs, fixed the build system to build them, then started filling them in. I got into interpreter code in a couple of days, which is where the fun starts. There’s two big problems: handling the stack, and calling native functions. libffi can do the latter, but the stack is looking to be fun. I was trying not to start thinking about it before Christmas but I failed…
My original idea was not to use the runtime stack at all: just allocate a block of memory somewhere and use that as the Java stack. Great. Ok, you need to make the garbage collector know about it, but great. Except that it turns out that synchronization works by exchanging the pointer to a locked object with a pointer to the lock, and the way it distinguishes the two is that the locks live in the runtime stack. I tried to figure a way of allocating my big block on the stack at the start of the thread, which it turns out is totally possible for created threads but totally not possible for attached threads.
Now I’m thinking about using alloca() to sneak the Java frames inside the runtime frames. My idea was that if alloca() allocates memory contiguously then so long as nothing else in your method used it you could “resize” your Java frame by simply allocating another block. You couldn’t resize your caller’s frame, but you can work around that with a copy. However, the code that GCC generates is weird, and I’m hoping it’s simply a bug. On i386 you get this:
mov bytes_to_allocate,%eax add $0xf,%eax add $0xf,%eax shr $0x4,%eax shl $0x4,%eax sub %eax,%esp
and on ppc you get this:
lwz r9,bytes_to_allocate addi r9,r9,15 addi r0,r9,15 rlwinm r0,r0,28,4,31 rlwinm r0,r0,4,0,27 lwz r9,0(r1) neg r0,r0 stwux r9,r1,r0 lwz r11,0(r1) lwz r0,4(r11) mtlr r0
It looks like it’s trying to align the blocks on 16-byte boundaries and that that second addition is a bug. But where to find a GCC guru at this time of year?
It’s been a while. These past two weeks I’ve been working on starting to port the client JIT and fixing an elusive crash on 32-bit. The JIT stuff is early days. I won’t commit every little change I make, but I’m going to drop my work into IcedTea once a week or so just so I’m not working in the dark (disable icedtea-core-build.patch to build it). But I’m more excited about fixing the crash. It was the last known bug, and since avdyk‘s news that Eclipse runs my feelings have shifted from “hmmm, this seems to work” to “wow!” so if you’re having problems on PPC with any IcedTea newer than f35ffd73f3c4 then I really want to hear about it!
