Consider the following C code to print an integer as a decimal number:
Calling dspdec(n,w) has the same effect as the Pascal call write(n:w) or the C call fprintf("%*d", w, n); all of these print n, right justified in a field of width w, using more characters than w if n is too large. The names dspch and dspdec are used here because these are the names of the routines in the Hawk monitor.void dspdec( n, w ) unsigned int n; /* the number to print */ int w; /* the field width to use */ { unsigned int fw = w - 1; unsigned int rm = n % 10; if ((n / 10) != 0) { /* there are more significant digits, so print them first */ dspdec( n / 10, fw ); } else { /* there are no more significant digits, so print any leading blanks first */ for (;;) { fw = fw - 1; if (fw < 0) exit; /* from for loop */ dspch(' '); } } /* leading blanks and preceding digits are printed, so print this digit */ dspch('0' + rm); }
The above routine is recursive, and it is instructive to trace its behavior through a call to print a small number, say 123 (decimal) in a field of width 5:
The problem is, how can this be translated to assembly language? The central issue is where should the local variables be put. Because the routine is recursive, the answer is that storage must be allocated dynamically, at run-time, for the storage of these variables.
- dspdec(123, 5) called;
- fw = 4, rm = 3.
- dspdec(12, 4) called;
- fw = 3, rm = 2.
- dspdec(1, 3) called;
- fw = 2, rm = 1.
- using fw to count spaces, dspch(' ') is called twice.
- dspch('0' + rm) is called, output is '1'
- return
- dspch('0' + rm) is called, output is '2'
- return
- dspch('0' + rm) is called, output is '3'
- return
To allow for recursion, local variables are typically allocated on a stack (there are alternatives to use of stacks). The routines in the Hawk monitor use R2 for a stack pointer and R1 for linkage, and user code should use the same conventions (but nothing prevents users from using other conventions).
By entirely arbitrary convention, at the time control enters a procedure, R2 points to the lowest unused address on the stack, so m[R2] is available for use by the procedure, as is m[R2+1], but lower addresses such as m[R2-1] are unavailable. Pushing on the stack involves incrementing R2, while popping from the stack involves decrementing R2. A surprising number of machines use the opposite convention!
The following calling sequence is used for calling procedures on the Hawk machine to allow for recursion:
If PROC is not in the same source file as the call, so that the linker is involved, a slightly more complex form must be used:ADDI R2,AR JSR R1,PROC ADDI R2,-AR
This is needed because the SMAL linker cannot compute PC relative addresses of external procedures. Instead, it must fill in the pointer to the procedure in an absolute word, defined elsewhere in the same source file as follows:ADDI R2,AR LOAD R1,PPROC JSRS R1,R1 ADDI R2,-AR
In both of these calling sequences, the value AR is used to indicate the size of the area on the stack used by the current procedure. This area is called the activation record of the procedure, and the identifier AR (or identifiers ending with AR in the case of multiple coexisting procedures) is used for the size of the procedure's activation record.EXT PROC PPROC: W PROC
On entry to a potentially recursive procedure, the following receiving sequence is suggested, shown here decorated with typical comments and allowing for linkage to that procedure from other procedures:
To exit from the procedure, the following return sequence can be used:;------------------------ INT PROC ; AR format, indexed using R2: AR = nnnn PROC: ; a comment about what it does ; what registers it uses STORES R1,R2 ; save return address in AR
By convention, the first word in each activation record is reserved for saving and restoring the return address, and the size of the activation record is always a multiple of 4, so that the least significant two bits of R2 are always zero.LOADS R1,R2 ; recover return address from AR JUMPS R1 ; return
Note that this is only one possible activation record structure! Nothing about the Hawk architecture demands this structure, and there are many alternatives. Some machines have procedure calling instructions that dictate a particular calling sequence, sometimes quite complex, but the Hawk machine does not.
The complete details of the calling sequence for a typical procedure must also include where parameters and results are passed. On machines such as the Hawk machine, it is common to pass parameters in registers. Our convention is to pass the first parameter in R3, the second in R4, and so on. Similarly, the first result should be returned in R3, the second result in R4, and so on.
We can now begin translation of the DSPDEC procedure to assembly language. Our first version will use the above calling sequence exactly, so we can code the entry to the routine as follows:
Here, we have fleshed out the receiving sequence with details specific to the problem of coding the dspdec routine. The local variables fw and rm have been allocated space in the activation record, as fields DSPDFW and DSPDRM, at offsets 4 and 8 in the activation record. So long as R2 points to the start of the activation record, indexed addressing (with the LOAD, LOADCC, STORE and LEA instructions) may be used to address fields of the activation record. The C code given above both allocated and initialized fw and ar with the following two lines:;------------------------ INT DSPDEC ; AR format, indexed using R2: ; 0 ; return address DSPDFW = 4 ; field width DSPDRM = 8 ; remainder after division by 10 DSPDAR = 12 ; total size DSPDEC: ; output R3 in decimal ; uses R4 as field width ; wipes out ???? STORES R1,R2 ; save ret addr in ar
The corresponding assembly code, for initialization of these fields of the AR, is as follows:unsigned int fw = w - 1; unsigned int rm = n % 10;
Note the use of our standard calling sequence to call the DIVIDE routine. This routine divides the number in R3 by the number in R4, returning the quotient in R3 and the remainder in R4, and obliterating R5 and R6 in the process. As a result, we have computed both the remainder we needed for the initialization of the local variable rm, and the quotient needed by the next few lines of C code:ADDSI R4,-1 ; decrement the field width STORE R4,R2,DSPDFW LIS R4,10 ADDI R2,DSPDAR JSR R1,DIVIDE; divide number by 10 (wipe out R5-6) ADDI R2,-DSPDAR STORE R4,R2,DSPDRM
These convert to assembly language as follows:if ((n / 10) != 0) { /* there are more significant digits, so print them first */ dspdec( n / 10, fw );
The above code takes advantage of the fact that the DIVIDE routine left the quotient right where it was needed for passing to the recursive call to DSPDEC, so only one instruction was needed to put the parameter into place. The next segment of C code is:TESTR R3 ; check the quotient BZS DSPDBL ; if zero, go output blanks LOAD R4,R2,DSPDFW ADDI R2,DSPDAR JSR R1,DSPDEC ; recursive call ADDI R2,-DSPDAR BR DSPDQT ; go off to the endif
This is translated to assembly code as follows:} else { /* there are no more significant digits, so print any leading blanks first */ for (;;) { fw = fw - 1; if (fw < 0) exit; /* from for loop */ dspch(' '); }
The final segment of the procedure, in C, was:DSPDBL: ; setup to output DSPDFW blanks LOAD R6,R2,DSPDFW LIS R3,' ' DSPDLP: ADDSI R6,-1 ; decrement counter BLT DSPDQT ; quit if nothing left ADDI R2,DSPDAR JSR R1,DSPCH; output blank (wipe out R4-5) ADDI R2,-DSPDAR BR DSPDLP
This translates to the following assembly code:} /* leading blanks and preceding digits are printed, so print this digit */ dspch('0' + rm); }
DSPDQT: ; setup to output the digit LOAD R6,R2,DSPDRM ADDI R3,'0' ; convert remainder to ASCII ADDI R2,DSPDAR JSR R1,DSPCH; output it (wipe out R4-5) ADDI R2,-DSPDAR LOADS R1,R2 ; recover ret addr from ar JUMPS R1 ; return
If producing working code in a hurry is your goal, the above code is appropriate, but when performance is an issue, some additional work is called for.
The calling sequence we have used requires 3 instructions, plus a 1 instruction entry sequence and a 2 instruction exit sequence. This seems excessive! On machines designed according to the CISC (complex instruction set computer) philosophy, the conventional solution to this excess is to create special calling instructions that combine these functions, for example, imagine having a call instruction such as:
This would increment R2 by size, save the return address in M[R2] and transfer control to dst. As a result, the entry sequence is reduced to no instructions, but we need a new return instruction to undo things:CALL size,dst
This would transfer control to the return address found in M[R2], but first, it would look in M[R2]-4 for the size field of the CALL instruction in order to decrement R2. While programming with these "improved" procedure calling and return instructions would be pleasant, they are not pleasant instructions to implement, and their implementation would gravely complicate the problem of making the CPU run quickly.RETURN
As a result, processord designed according to the RISC (reduced instruction set complexity) philosophy usually omit such complicated but easy to use instructions and rely on the programmer, or more usually, on the compiler, to leave out those parts of the calling sequence that are not needed in some circumstance.
For example, in our receiving and return sequences:
The code to save and restore the return address is unnecessary if the procedure calls no other procedures. A smart programmer, or a smart compiler, after tentatively including the STORES and LOADS instructions that do this saving and restoring, can look at the body of the procedure and notice that it contains no calls, then remove these instructions.PROC: STORES R1,R2 ; save ret addr in ar ; some code LOADS R1,R2 ; recover ret addr from ar JUMPS R1 ; return
Having removed them, the size of the activation record for that procedure or function can be reduced by one. In the case of many procedures and functions, particularly those that call no other procedures, the size of the activation record can be reduced to zero because all the local variables needed by the procedure fit in registers.
The above optimization produces a class of procedures and functions that have no activation records and therefore ignore R2 when they are called. When calling such a procedure or function, the calling sequence:
Can be simplified to just:ADDI R2,AR JSR R1,PROC ADDI R2,-AR
Another optimization can be done in the context of consecutive calls. Consider this example:JSR R1,PROC
Here, two procedures were called in quick sequence, and clearly, the intermediate adjustment to the activation record pointer, R2, are redundant and the code can be collapsed to:ADDI R2,AR JSR R1,PROCA ADDI R2,-AR ADDI R2,AR JSR R1,PROCB ADDI R2,-AR
This optimization is clearly possible even when there are other instructions between the two procedure calls, but it would seem more difficult when the intervening instrucitons reference fields of the activation record, as in this example:ADDI R2,AR JSR R1,PROCA JSR R1,PROCB ADDI R2,-AR
(Here, the first procedure returned a value that was stored in one local variable before loading a parameter to the second procedure from a second local variable.) It turns out that the optimization is still possible:ADDI R2,AR JSR R1,PROCA ADDI R2,-AR STORES R3,R2,LOCALA LOADS R3,R2,LOCALB ADDI R2,AR JSR R1,PROCB ADDI R2,-AR
Here, we have substituted assembly-time arithmetic on the offsets used for the intermediate load and store instructions for the run-time arithmetic of the ADDI instructions.ADDI R2,AR JSR R1,PROCA STORES R3,R2,LOCALA-AR LOADS R3,R2,LOCALB-AR JSR R1,PROCB ADDI R2,-AR
Finally, there is an optimization that is particularly useful in the context of recursive procedures, and that is to collapse activation records so that, at the time of the recursive call, only those fields of the activation record that actually contain useful information are saved. In the case of the example decimal print routine, note, for example, that the field width stored in the DSPDFW field of the activation record is not needed after any recursive call.
This optimization is done by organizing the fields of the activiation record so that fields that aren't needed during a recursive call are stored last. In the case of our decimal print routine, we would put the DSPDRM field first, since it is needed during the recursive call, and the DSPDFW field last. The result is a routine that needs 1/3 less stack space when it is called.
Here is the result of applying a number of the above optimizations to the decimal print routine:
;------------------------ INT DSPDEC ; AR format, indexed using R2: ; 0 ; return address DSPDRM = 4 ; remainder after division by 10 DSPDFW = 8 ; field width DSPDAR = 12 ; total size DSPDEC: ; output R3 in decimal ; uses R4 as field width ; wipes out 4-6 STORES R1,R2 ; save ret addr in ar ADDSI R4,-1 ; decrement the field width STORE R4,R2,DSPDFW LIS R4,10 JSR R1,DIVIDE; divide number by 10 (wipe out R5-6) STORE R4,R2,DSPDRM TESTR R3 ; check the quotient BZS DSPDBL ; if zero, go output blanks LOAD R4,R2,DSPDFW ADDI R2,DSPDFW JSR R1,DSPDEC ; recursive call ADDI R2,-DSPDFW DSPDQT: ; setup to output the digit LOAD R3,R2,DSPDRM ADDI R3,'0' ; convert remainder to ASCII JSR R1,DSPCH; output it (wipe out R4-5) LOADS R1,R2 ; recover ret addr from ar JUMPS R1 ; return DSPDBL: ; setup to output DSPDFW blanks LOAD R6,R2,DSPDFW LIS R3,' ' DSPDLP: ADDSI R6,-1 ; decrement counter BLT DSPDQT ; quit if nothing left JSR R1,DSPCH; output blank (wipe out R4-5) BR DSPDLP
In the worst case on the Hawk machine, if you know nothing of the routine you are calling other than the bare fact that it uses R1 for linkage and R2 as an activation record pointer, you must assume that it destroys the contents of all other registers. In this case, you will be forced to allocate space in your activation record for all values you need saved during the call, and you will have to augment your calling sequence with code to save the contents of any registers you care about before the call.
When a compiler generates code to call a procedure that is defined outside the current source file, perhaps in a file that has not even yet been compiled, it must make pesimistic assumptions about that procedure. As a result, calls to such procedures will, by necessity, be expensive, involving maximal use of the activation record, poor use of registers, and the full glory of our calling sequence.
In comparison, when compilers generate code to procedures defined within the same compilation unit, they can frequently apply many of the optimizations we have discussed, making effective use of registers and considerably speeding the procedure call.
As a result, modern optimizing compilers frequently generate far better code for calls to local procedures than to separately compiled code. The exception is in languages like Ada, where there are strict rules about order of compilation. In such a language, when a procedure is compiled, the compiler generates not only an object file but a file of instructions about how to compile calls to that procedure. When a call to an external procedure is encounterd by the compiler, it reads that file of instructions in order to determine how the call is to be made. Unfortunately, older programmingl languages like C and Fortran are rarely implemented to take advantage of such opportunities for effective optimization of external procedure calls.