gbenson.net – Page 15

Being at a point where I’m interpreting bytecodes is really cool: every new thing I implement gets the interpreter a whole load further. I’ve now executed 70 instructions, including calls to both native and non-native methods with both void and non-void return values. It’s currently stopping at the start of the first method to require more locals than it takes parameters: the first method whose frame may need expanding, in other words. This might be slightly complicated by the fact that Rlocals is not used exactly as I thought it would be when I planned the frame-expanding code. I’m hoping this doesn’t matter.

I just interpreted a bytecode:

$ control/build/linux-ppc/bin/java -XX:+TraceBytecodes
VM option '+TraceBytecodes'

[12474] static void java.lang.Object.<clinit>()
[12474]        1     0  invokestatic 0 <registerNatives> <()V>
[12474]        2     3  return
error: bad message from interpreter: 9

A couple of people have asked me to explain what I mean by “the C++ interpreter” and why bits of it are written in assembler. Here is a brief summary:

OpenJDK is the implementation of the Java platform of Sun Microsystems.
Hotspot is the Java virtual machine of OpenJDK.
Hotspot comprises two interpreters and two JITs.
A running instance of Hotspot comprises one interpreter optionally supplemented by one JIT.
The interpreters are the template interpreter (aka the asm interpreter) and the C++ interpreter (aka the bytecode interpreter).
The JITs are client (aka compiler 1 aka C1) and server (aka compiler 2 aka C2).

The point of intersection between the two interpreters and the two JITs is method dispatch. Every method in Hotspot is defined by a methodOop, and each methodOop has a method entry* which is the address of the code that will execute the method. Different methods have different entries, depending upon whether they are native or not, synchronized or not, JIT-compiled or not, etc etc, but every method is invoked the same way: you put the address of the methodOop in one register, the address of the parameters in another, then jump to the method entry. The way the C++ interpreter differs from the template interpreter is that in the template interpreter everything is done in assembler whereas in the C++ interpreter the non-native entries simply** call BytecodeInterpreter::run(). The C++ interpreter’s entries are therefore much smaller and so easier to port. But they are still written in assembler.

* This is a slight lie: they have two.
** ha ha ha ha ha!

I think I’ve finally settled on the basics of an interpreter calling convention. It’s difficult as I don’t really know what I’ll need in the interpreter and it took a bit of thrashing around while I tried to figure out what to do with the stack, but here goes:

There are two non-volatile registers:

Rmethod is the address of the current methodOop
Rlocals is the address of first local variable

These are expected to be valid at all times within the interpreter.
In addition there are two volatile registers:

Rnparams is the number of parameters
Rnlocals is the number of local variables (including parameters)

These are only expected to be valid at interpreter entry points. They don’t even need to be in registers (you can read them from the methodOop) but a) they are already in registers in call_stub and b) both are needed in registers in the entry points, so as long as it’s no trouble to pass them like this I will continue to do so.

The stack frames will be laid out in accordance to the PPC ABIs:

    | ...                  |  high addresses
+-> | Link area            |
|   +----------------------+
|   | Register save area   |
|   | Local variable space |
|   | Parameter list space |
+---+ Link area            |  low addresses
    +----------------------+

The area referred to by the ABIs as “local variable space” will be arranged as follows:
```
    [local variable Rnlocals]
      ...
    [local variable 1       ] <-- Rlocals
    [padding as required    ]
```
Such that the first local variable is accessed as 0(Rlocals), the second as wordSize(Rlocals), and so on. This only works if always know in advance how many local variables the method we are calling will need, which seems reasonable. If there are cases where this isn’t so I can insert a check in the method entry to resize the frame as necessary, but this is expected to be time-consuming so should be the exception rather than the rule.
Any additional stack slots will be allocated below the first local variable, such that the first additional stack slot will be referenced as -wordSize(Rlocals).
Monitors will be allocated below any additional stack slots. I may well always allocate some space for monitors depending on how frequently they are created, how many any given method is expected to require, and exactly how time-consuming a frame-resize is.

Ok, I think that’s it.

This stack thing is proving to be a real pig. The problem is that the stack for a method is set up in two parts. The caller allocates space for the method’s parameters and fills them in prior to calling the method entry, and the method entry then extends the stack to allow for any additional local variables. This is fine on i486 and amd64 (and I think sparc too) where you have free reign to set the stack pointer to whatever you want, but an absolute pain on ppc where the ABI dictates the stack be arranged in frames. (I think this means that the “stack pointer” on ppc is actually a frame pointer.) I thought I could get around this because the caller can look in the methodOop to see how many additional local variables are required; you’d just allocate them at the same time as the parameters (which are just the first however many local variables anyway). Except it turns out that the C++ interpreter can arbitrarily extend the stack too, to allocate space for monitors, so I still need to figure out a way to do it!

aph pointed out that signals make a mess of my idea of writing below r1 and protecting later, but I realised that you know in advance how much stack a method will use so you can just set up the stack for that before you call it. Of course, all that has to be in assembler, so I can’t use my funky StackFrame class :( I decided it’s high time I made a table of register usage across ABIs to aid me in writing this:

	ppc	ppc64
`r0`	Volatile register which may be modified during function linkage
`r1`	Stack frame pointer
`r2`	Reserved	TOC pointer
`r3`	Volatile registers used for parameter-passing and return values
`r4`
`r5-r10`	Volatile registers used for parameter-passing
`r11`	Volatile registers which may be modified during function linkage	Volatile register used in calls by pointer and/or as an environment pointer
`r12`		Volatile register used in function linkage and exception handling
`r13`	Small data area pointer	System thread ID
`r14-r30`	Non-volatile registers used for local variables
`r31`	Non-volatile register used for local variables or as an environment pointer

I finished converting everything to 64-bit. It was way easier than I expected and I’m very glad I took the time to do it because what I thought were vast and insurmountable differences turned out to be pretty minuscule. I managed to tuck everything away in macros, the ABI differences in prolog and epilog, and enter (née function_entry_point) and call, and the register-size differences in much simpler ones like load and store which just map to lwz/ld and stw/std respectively. Hiding the ugly stuff in the assembler keeps the generators happily free of conditionals.

My next job is designing the interpreter calling convention. There’s no real reason for the interpreter to follow the platform’s ABI, and the fact that Java is essentially stack-based and PPC is essentially register-based is a very good reason not to. So I’m trying to figure out how to arrange the stack.

The general layout of stack frames is the same under both 32- and 64-bit ABIs:

    | ...                  |  high addresses
+-> | Link area            |
|   +----------------------+
|   | Register save area   |
|   | Local variable space |
|   | Parameter list space |
+---+ Link area            |  low addresses
    +----------------------+

The stack grows downwards, and the stack pointer, r1, points to the first word of the link area, the lowest address, such that all accesses into the stack are relative to r1 with a positive offset. The ABIs are pretty relaxed about what happens in the stack, but one thing they’re firm about is that 0(r1) points to the previous frame — essentially it’s where you save your caller’s r1. This is slightly irritating, because I think the interpreter would like an open-ended stack, but the requirement to maintain a valid link area at the very top of the stack would seem to preclude this. Aside from anything else, if r1 isn’t pointing to a valid link area then gdb cannot unwind the stack and produce backtraces. I discovered this empirically a while back ;)

My thinking at the moment is to leave r1 alone, and to use another register (r31 maybe) as the interpreter’s stack pointer. That way the interpreter can extend the stack however it likes, without thought to link areas and alignment, on the assumption that if it ever jumps out into C-land then it must first create a valid stack frame around it’s own data to protect it. Specifically, this will be a frame with no register save area, meaning the interpreter’s stuff falls neatly into the local variable space.

I’m not sure how stack walking will work under this scenario. It may be that it’s better to do this stack-shuffling every time the interpreter calls a new method, such that each method call, be it Java or C, has it’s own valid ABI stack frame. This will undoubtedly resolve itself as I progress.

I’ve been working on converting my existing 32-bit ppc code to ppc64. A lot of the ABI is the same (and my funky StackFrame class hides a lot of the differences) but one weird thing is that on ppc64 function pointers do not point to the actual code but to some descriptor thing. OpenJDK has loads of places that do the following, and they’re all wrong:

void (*function)() = generate_some_code();
...
(*function)();

I think the descriptors are supposed to be in some table that the linker creates, but I sneaked around it by making an assembler macro, function_entry_point(), that creates a function descriptor that points to the address immediately following it, the idea being that you place it at the start of anything you’re going to jump into from C. I’m not entirely happy with it: I’m not sure what the implications of mixing descriptors and code like this, and it makes the stubs non-relocatable, and omitting the function_entry_point() will cause a mysterious segfault as the processor dereferences the first two instructions of your stub and jumps to wherever they point. But needs must…

So as if to atone from my awful disassembler hack I wrote a really neat stack frame class. Say you’re writing some function that needs some non-volatile registers and maybe a local variable. You define the frame like this:

StackFrame frame = StackFrame();

const Register start_addr     = frame.get_register();
const Register end_addr       = frame.get_register();
const FloatRegister mr_floaty = frame.get_float_register();

const Address variable = frame.get_local_variable();

Once you defined the frame you just wrap your code with my funky new prolog and epilog macros and all the storing and restoring and stack pointer setting and alignment stuff gets done for you:

__ prolog (frame);

__ addi (end_addr, start_addr, 8);
__ lfd (mr_floaty, variable);
__ call (some_func);

__ epilog (frame);

As far as I can see the other OpenJDK architectures just do this kind of thing manually which for someone as slapdash as me is a whole world of pain.

Fun function of the week:

void disassemble(address start, address end)
{
  const char *fmt = "/tmp/aztec-%d.%c";
  char c_file[BUFSIZ], o_file[BUFSIZ];
  sprintf(c_file, fmt, getpid(), 'c');
  sprintf(o_file, fmt, getpid(), 'o');

  FILE *fp = fopen(c_file, "w");
  if (fp == NULL)
    fatal("%s:%d: can't write file", __FILE__, __LINE__);

  fputs("unsigned char start[] = {", fp);
  for (address a = start; a < end; a++) {
    if (a != start)
      fputc(',', fp);
    fprintf(fp, "0x%02x", *a);
  }
  fputs("};\n", fp);
  fclose(fp);

  char cmd[BUFSIZ];
  sprintf(cmd, "gcc -c %s -o %s", c_file, o_file);
  if (system(cmd) != 0)
    fatal("%s:%d: can't compile file", __FILE__, __LINE__);

  putchar('\n');
  sprintf(cmd, "objdump -D -j .data %s | grep '^....:'", o_file);
  if (system(cmd) != 0)
    fatal("%s:%d: can't disassemble file", __FILE__, __LINE__);
  putchar('\n');

  unlink(c_file);
  unlink(o_file);
}