I finished converting everything to 64-bit. It was way easier than I expected and I’m very glad I took the time to do it because what I thought were vast and insurmountable differences turned out to be pretty minuscule. I managed to tuck everything away in macros, the ABI differences in prolog and epilog, and enter (née function_entry_point) and call, and the register-size differences in much simpler ones like load and store which just map to lwz/ld and stw/std respectively. Hiding the ugly stuff in the assembler keeps the generators happily free of conditionals.

My next job is designing the interpreter calling convention. There’s no real reason for the interpreter to follow the platform’s ABI, and the fact that Java is essentially stack-based and PPC is essentially register-based is a very good reason not to. So I’m trying to figure out how to arrange the stack.

The general layout of stack frames is the same under both 32- and 64-bit ABIs:

    | ...                  |  high addresses
+-> | Link area            |
|   +----------------------+
|   | Register save area   |
|   | Local variable space |
|   | Parameter list space |
+---+ Link area            |  low addresses
    +----------------------+

The stack grows downwards, and the stack pointer, r1, points to the first word of the link area, the lowest address, such that all accesses into the stack are relative to r1 with a positive offset. The ABIs are pretty relaxed about what happens in the stack, but one thing they’re firm about is that 0(r1) points to the previous frame — essentially it’s where you save your caller’s r1. This is slightly irritating, because I think the interpreter would like an open-ended stack, but the requirement to maintain a valid link area at the very top of the stack would seem to preclude this. Aside from anything else, if r1 isn’t pointing to a valid link area then gdb cannot unwind the stack and produce backtraces. I discovered this empirically a while back ;)

My thinking at the moment is to leave r1 alone, and to use another register (r31 maybe) as the interpreter’s stack pointer. That way the interpreter can extend the stack however it likes, without thought to link areas and alignment, on the assumption that if it ever jumps out into C-land then it must first create a valid stack frame around it’s own data to protect it. Specifically, this will be a frame with no register save area, meaning the interpreter’s stuff falls neatly into the local variable space.

I’m not sure how stack walking will work under this scenario. It may be that it’s better to do this stack-shuffling every time the interpreter calls a new method, such that each method call, be it Java or C, has it’s own valid ABI stack frame. This will undoubtedly resolve itself as I progress.

9 thoughts on “


  1. Ok since you are using the c++ based interpreter the frame you describe here:

        | ...                  |  high addresses
    +-> | Link area            |
    |   +----------------------+
    |   | Register save area   |
    |   | Local variable space |
    |   | Parameter list space |
    +---+ Link area            |  low addresses
        +----------------------+
    

     

    is the province of the frame manager. It should in fact create a frame that
    looks quite like this. What you will find strange is that the frame manager
    (or the template interpreter) will need to extend the frame of the caller
    in order to make the Java locals contiguous. So lets pretend that you are in
    the call_stub and you have just created the arguments that you are passing to
    the method about to be invoked. The stack will look something like this:
    
        | ...                  |  high addresses
    +-> | Link area            |
    |   +----------------------+
    |   | Register save area   |
    |   | call_stub temps      |
    |   | Local[0]             |  <-- &Local[0]
    |   | ...                  |
    |   | Local[n-1]           |
    +---+ Link area            | <-- SP(1)
        +----------------------+
    
    Where the Java method being called takes n arguments but has m locals (m >= n).
    The calling convention is typically such that a register is used to pass the value
    of &Local[0] and another register is used to contain the methodOop. Because of adapters
    another register is also used to point to the bottom of the stack (r1 here) before any
    adapter got into the mix. The registers that are used are all up to your decision.
    
    Now we have arrived in the frame manager code and it has created a new frame to allow
    the actual c++ based interpreter to execute. What does the stack look like now?
    
    
       | ...                  |  high addresses
    +-> | Link area            |
    |   +----------------------+
    |   | Register save area   |
    |   | call_stub temps      |
    |   | Local[0]             |  <-- &Local[0]
    |   | ...                  |
    |   | Local[n-1]           |
    |   | Local[n]             | <-- SP(1)
    |   | ...                  |
    |   | Local[m-1]
    +---+ Link area            | <-- SP(2)
    |   +----------------------+
    |   | Register save area   |
    |   | "interpreter         |
    |   |  state               |
    |   |  object"             | <- this address is passed to c++ interpreter
    |   | Monitor area         | <- monitor base
    |   | "Java                | <- stack base
    |   |  expression          |
    |   |  stack"              |
    |   | parameter area       |
    +---+ Link area            |  <- SP(3)
        +----------------------+
    

    Since the ppc ABI is such that it uses arg registers the parameter area this new frame is whatever needs to exists to handle register store down for varargs. The monitor area is typically empty unless the method is synchronized in which case a single monitor entry is created. The Java expression stack is the full sized expression stack for this method (max_stack). The frame manager code fills the fields of the "Interpreter state object" with things like the methodOop, where the locals, monitors, and expression stack limits and top, etc. It also sets a message of "call_method".It then calls  BytecodeInterpreter::run() with the pointer to this object. All of the interpretation happens there. 

    
    
    
    
    
    

     


     

  2. What you will find strange is that the frame manager (or the template interpreter) will need to extend the frame of the caller in order to make the Java locals contiguous.

    This is one thing I do find strange, as I can’t see why you would not simply allocate space for all locals in one go. Certainly call_stub can read max_locals from the methodOop. I wondered if somewhere in the interpreter there would be cases where a method is entered without having access to its methodOop but that sounds pretty implausable.

  3. You can’t get rid of stack extension even if you did this because of adapters. (Unless you went back adapters with frames and we don’t want to go there). When you do an invoke from either compiled or interpreted you don’t know if the callee is compiled or interpreted. So the calling convention essentially has to change as you cross the boundary. The interpreter always assumes it is calling interpreted code and the compiler assumes compiled. In the case where this is wrong you will need stack space somewhere (for certain if c2i). So this means the caller’s frame is extended or in the bad old days of framed adapters that an intermediate frame gets created.

    So you could make a version of the interpreter that as part of the call it extended it’s own stack to account for the “extra” locals. Then if you ended up calling compiled you’d have wasted that space. It’s not a large amount of space for sure but I don’t see the real benefit of doing it before you call. In either case extension will happen on either the front side or back side of the call.

  4. Well I guess at the moment I won’t need adapters because I don’t have a compiler :) But it’s good to know…

    My concern is that extending the stack frame requires either moving the locals or storing them in reverse order. The former adds a chunk of overhead to every (interpreted) method call, and the latter would add overhead (negation) to every variable access. Both seem pretty bad. BTW I notice you drew your stack frames with reversed locals — was that intentional?

    Of course, if I do end up extending the stack for the extra locals then I may as well extend it far enough to fit the interpreter state and monitors and expression stack too. It’s not such an issue on 32-bit but ppc64 has a lot of overhead in it’s frames (14 slots, so 112 bytes) and it’d be nice to save that.

  5. Actually, I’m not even sure it’s possible to extend a stack frame on PPC without violating the ABI. The stack is arranged such that the stack pointer points to the first word of the top frame — the “back chain” word — which points to the back chain word of the next frame, and so on until the last frame which has a back chain of NULL. When creating a new frame you’re required to update the stack pointer and the back chain word atomically so that the stack pointer is always pointing to the beginning of a linked list of frames. There are special instructions for it, basically “store the contents of register X at address Y then store address Y in register X”, but I can’t see anything that could be used to atomically extend a frame as would be required. And a non-atomic extension risks having a signal turn up and trash everything.

    I may end up having to make the caller allocate space for the parameters and have the callee allocate space for all locals and copy the parameters into it. You’d waste the parameter space for interpreted callees, but I’m not sure there’s a way around it.

  6. Uh the locals are stored in reverse order. Since it is a stack the interpreter would push local[0] first, …, So when you extend the stack you don’t do anything but allocate space. That is of course why I drew the stack with the locals reversed. The numbers refer to the Java local number. If you are actually doing array accesses then the number is negated. Take a look at the code in bytecodeInterpreter and you’ll see that the index is negated.

  7. I don’t have a copy of either the ABI or a ppc manual in front of me but it must be possible since that is how the other ppc ports work even though they have adapter frames and I know of no problem with signal handlers.

  8. I figured it out. Say your the top of the stack is frame N, such that frame N’s back chain word points at frame N-1. First you create a new frame in the normal way, N+1. Then you change N+1’s back chain to point at N-1, effectively turning N and N+1 into the new frame N. It’ll be slightly more complex on ppc64 since the minimum frame is larger than the smallest amount you might want to extend it by, but the principle is the same.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.