Lecture 27, Code Generation Details
the notes for CS:4980:1
We can describe the semantics of a stack machine's instruction set in terms of a stack stored in memory (the array M), and a stack-pointer register sp. Because the ARM, our target architecture, is defined as having a full descending stack, we'll use the same convention in the following description. We'll use the following notation:
sp = sp - x;
sp = sp + x;
M[--sp] = c;
M[sp] = M[M[sp]];
temp = M[sp++]; M[M[sp++]] = temp;
M[--sp] = M[sp];
temp = M[sp++]; M[sp] = M[sp] + temp;
temp = M[sp++]; M[sp] = M[sp] > temp; -- for example, could also be < or &eq;, etc
if M[sp--] -- or if not then pc = dst;
The above instruction set is sufficient to allow reference to global variables at static addresses, but for languages such as ALGOL 60 and its descendants, including C, C++ and Java, we need a way to load the addresses of local variables. The easiest way to do this is with a special register, the frame pointer fp. Within any procedure or function, this always points to the base of the stack frame (activation record) for the current context.
M[--sp] = d + fp;
There are numerous alternatives to the above. You could have an instruction to push the frame pointer itself, then push the displacement, then add, if you wanted to eliminate the combination of arithmetic with a constnt.
Because all data objects in Kestrel have a constant size that is statically determined, the stack pointer at any point in a block is at a constant displacement from the base of the activation record for that block. Consider this code.
sort: procedure( ref i: int32, ref j: int32 ) if i < j then k: var int32; k = i; i = j; j = k; end; end;
If you imagine a totally sub-optimal implementation of this code, you might consider pushing each term in an expression on the stack as you evaluate the expression. This leads to the following picture of what is on the stack at each point in the above code:
sort: procedure( ref i: int32, ref j: int32 ) -- @i @j if i -- @i @j i < j -- @i @j i j -- @i @j b then -- @i @j k: var int32; -- @i @j k k -- @i @j k @k = i -- @i @j k @k i ; -- @i @j k i -- @i @j k @i = j -- @i @j k @i j ; -- @i @j k j -- @i @j k @j = k -- @i @j k @j k ; -- @i @j k end; -- @i @j end; --- Legend i, j - the values of the actual parameters --- @i, @j - the addresses of the actual parameters --- k - the local variable --- @k - the address of the local variable --- b - a boolean resulting from comparing two integers
Note that the storage allocated in the activation record for the above routine's parameters holds references to the actual parameters, since they were declared as reference parameters. The basic activation record for the sort procedure therefore holds 2 reference parameters along with the local variable k. The rest of the variables in the activation record are anonymous temporaries that are used during the evaluation of expressions and the execution of assignment statements.
For example, to evaluate if i < j then we push the value of i on the stack, then push the value of j on the stack, before comparing them and replacing the two items with just one item reflecting the result of the comparison. Finally, we pop the result from the stack as we do a conditional branch.
The observation with which we began is simple: We do not need to use a dynamically computed stack pointer for this stack. We can use static computation, where each push is essentially an allocation operation in the current activation record, and each pop is a deallocation. Instead of using a stack pointer, we can use static offsets from the base of the activaiton record, just as we do for local variables.
Now, consider the example at the head of this section. Ignoring the details of the procedure call and return, the code generated for the example code would be as follows:
sort: procedure( ref i: int32, ref j: int32 ) if i < j then PUSHLA I -- push the address of @i LOAD -- get the value of @i LOAD -- get the value of i PUSHLA J -- push the address of @j LOAD -- get the value of @j LOAD -- get the value of j LT BFALSE 1 k: var int32; PUSHL 1 -- allocate space for k k = i; PUSHLA K -- get the address of k PUSHLA I -- get the address of @i LOAD -- get the value of @i LOAD -- get the value of i POPS i = j; PUSHLA I -- get the address of @i LOAD -- get the value of @i PUSHLA J -- get the address of @j LOAD -- get the value of @j LOAD -- get the value of j POPS j = k; PUSHLA J -- get the address of @j LOAD -- get the value of @j PUSHLA K -- get the address of k LOAD -- get the value of k POPS end; POPL 1 -- deallocate k LABEL 1 end;
If you imagine a totally sub-optimal implementation of a stack machine, you imagine each stack machine instruction involves one or more memory cycles to access the top elements of the stack. Looking at the above, this suggests that a simple assignment such as i=1 could take 3 or more instruction cycles. Even the Burroughs 5000 computer avoided this by dedicating two registers inisde the CPU to holding the top two elements of the stack. State bits in the CPU were used to record whether the stack was entirely in memory, or whether the top word of the stack was in a register, or whether the top two words were in registers. The net result was efficient execution with few memory cycles actually required. In later machines such as the AAMP, 4 registers were used to hold up to the top 4 elements of the stack.
Section 5.1.1 of the following document describes the roles of the 16 registers in the ARM CPU as they are to be used on entry and exit from external procedures and functions.
-- Procedure Call Standard for the ARM Architecture
During function calls, registers 0 to 3 are reserved for parameter passing and argument return. Registers 4 through 11 (but sometimes not register 9) are avaiable for general use. Local calls can use other conventions, but in the early stages of developing a compiler, it is probably better to use the same calling conventions for local and external calls.
How do you translate stack machine code for effective execution on such a computer? The obvious approach is to use the ARM SP register, register 13, as the stack pointer and keep the entire contents of the stack in memory. This would make very inefficient use of registers.
When the Burroughs Corporation developed the B5000 computer in the early 1960s, they had a better idea. Use some registers inside the CPU as a stack-top-cache. The B5000 used only 2 registers inside the CPU to hold zero, one or two of the top elements on the stack, but later Burroughs had 4 registers to serve this purpose, and machines like the Collins AAMP typically set aside 8 registers inside the CPU to cache the stack top. The code generator inside a compiler for a machine like the ARM can easily do the same thing.