22C:18, Lecture 16, Fall 1996

Douglas W. Jones
University of Iowa Department of Computer Science

Recursion

Consider the following C code to print an integer as a decimal number:

void dspdec( n, w )
unsigned int n; /* the number to print */
int w; 		/* the field width to use */
{
	unsigned int fw = w - 1;
	unsigned int rm = n % 10;
	if ((n / 10) != 0) {
		/* there are more significant digits,
		   so print them first */
		dspdec( n / 10, fw );
	} else {
		/* there are no more significant digits,
		   so print any leading blanks first */
		for (;;) {
			fw = fw - 1;
			if (fw < 0) exit; /* from for loop */
			dspch(' ');
		}
	}
	/* leading blanks and preceding digits are printed,
	   so print this digit */
	dspch('0' + rm);
}
Calling dspdec(n,w) has the same effect as the Pascal call write(n:w) or the C call fprintf("%*d", w, n); all of these print n, right justified in a field of width w, using more characters than w if n is too large. The names dspch and dspdec are used here because these are the names of the routines in the Hawk monitor.

The above routine is recursive, and it is instructive to trace its behavior through a call to print a small number, say 123 (decimal) in a field of width 5:

dspdec(123, 5) called;
fw = 4, rm = 3.
dspdec(12, 4) called;
fw = 3, rm = 2.
dspdec(1, 3) called;
fw = 2, rm = 1.
using fw to count spaces, dspch(' ') is called twice.
dspch('0' + rm) is called, output is '1'
return
dspch('0' + rm) is called, output is '2'
return
dspch('0' + rm) is called, output is '3'
return
The problem is, how can this be translated to assembly language? The central issue is where should the local variables be put. Because the routine is recursive, the answer is that storage must be allocated dynamically, at run-time, for the storage of these variables.

A Recursive Calling Sequence

To allow for recursion, local variables are typically allocated on a stack (there are alternatives to use of stacks). The routines in the Hawk monitor use R2 for a stack pointer and R1 for linkage, and user code should use the same conventions (but nothing prevents users from using other conventions).

By entirely arbitrary convention, at the time control enters a procedure, R2 points to the lowest unused address on the stack, so m[R2] is available for use by the procedure, as is m[R2+1], but lower addresses such as m[R2-1] are unavailable. Pushing on the stack involves incrementing R2, while popping from the stack involves decrementing R2. A surprising number of machines use the opposite convention!

The following calling sequence is used for calling procedures on the Hawk machine to allow for recursion:

	ADDI	R2,AR
	JSR	R1,PROC
	ADDI	R2,-AR
If PROC is not in the same source file as the call, so that the linker is involved, a slightly more complex form must be used:
	ADDI	R2,AR
	LOAD	R1,PPROC
	JSRS	R1,R1
	ADDI	R2,-AR
This is needed because the SMAL linker cannot compute PC relative addresses of external procedures. Instead, it must fill in the pointer to the procedure in an absolute word, defined elsewhere in the same source file as follows:
	EXT	PROC
PPROC:	W	PROC
In both of these calling sequences, the value AR is used to indicate the size of the area on the stack used by the current procedure. This area is called the activation record of the procedure, and the identifier AR (or identifiers ending with AR in the case of multiple coexisting procedures) is used for the size of the procedure's activation record.

On entry to a potentially recursive procedure, the following receiving sequence is suggested, shown here decorated with typical comments and allowing for linkage to that procedure from other procedures:

;------------------------
	INT	PROC
;		AR format, indexed using R2:
AR	= nnnn		
PROC:			; a comment about what it does
			; what registers it uses
	STORES	R1,R2	; save return address in AR
To exit from the procedure, the following return sequence can be used:
	LOADS	R1,R2	; recover return address from AR
	JUMPS	R1	; return
By convention, the first word in each activation record is reserved for saving and restoring the return address, and the size of the activation record is always a multiple of 4, so that the least significant two bits of R2 are always zero.

Note that this is only one possible activation record structure! Nothing about the Hawk architecture demands this structure, and there are many alternatives. Some machines have procedure calling instructions that dictate a particular calling sequence, sometimes quite complex, but the Hawk machine does not.

The complete details of the calling sequence for a typical procedure must also include where parameters and results are passed. On machines such as the Hawk machine, it is common to pass parameters in registers. Our convention is to pass the first parameter in R3, the second in R4, and so on. Similarly, the first result should be returned in R3, the second result in R4, and so on.

Coding A Recursive Procedure

We can now begin translation of the DSPDEC procedure to assembly language. Our first version will use the above calling sequence exactly, so we can code the entry to the routine as follows:

;------------------------
        INT     DSPDEC
;               AR format, indexed using R2:
;         0     ; return address
DSPDFW	= 4	; field width
DSPDRM	= 8	; remainder after division by 10
DSPDAR	= 12	; total size

DSPDEC:			; output R3 in decimal
			; uses R4 as field width
			; wipes out ????
	STORES	R1,R2	; save ret addr in ar
Here, we have fleshed out the receiving sequence with details specific to the problem of coding the dspdec routine. The local variables fw and rm have been allocated space in the activation record, as fields DSPDFW and DSPDRM, at offsets 4 and 8 in the activation record. So long as R2 points to the start of the activation record, indexed addressing (with the LOAD, LOADCC, STORE and LEA instructions) may be used to address fields of the activation record. The C code given above both allocated and initialized fw and ar with the following two lines:
	unsigned int fw = w - 1;
	unsigned int rm = n % 10;
The corresponding assembly code, for initialization of these fields of the AR, is as follows:
	ADDSI	R4,-1	; decrement the field width
	STORE	R4,R2,DSPDFW
	LIS	R4,10
	ADDI	R2,DSPDAR
	JSR	R1,DIVIDE; divide number by 10 (wipe out R5-6)
	ADDI	R2,-DSPDAR
	STORE	R4,R2,DSPDRM
Note the use of our standard calling sequence to call the DIVIDE routine. This routine divides the number in R3 by the number in R4, returning the quotient in R3 and the remainder in R4, and obliterating R5 and R6 in the process. As a result, we have computed both the remainder we needed for the initialization of the local variable rm, and the quotient needed by the next few lines of C code:
	if ((n / 10) != 0) {
		/* there are more significant digits,
		   so print them first */
		dspdec( n / 10, fw );
These convert to assembly language as follows:
	TESTR	R3	; check the quotient
	BZS	DSPDBL	;   if zero, go output blanks
	LOAD	R4,R2,DSPDFW
	ADDI	R2,DSPDAR
	JSR	R1,DSPDEC ; recursive call
	ADDI	R2,-DSPDAR
	BR	DSPDQT	; go off to the endif
The above code takes advantage of the fact that the DIVIDE routine left the quotient right where it was needed for passing to the recursive call to DSPDEC, so only one instruction was needed to put the parameter into place. The next segment of C code is:
	} else {
		/* there are no more significant digits,
		   so print any leading blanks first */
		for (;;) {
			fw = fw - 1;
			if (fw < 0) exit; /* from for loop */
			dspch(' ');
		}
This is translated to assembly code as follows:
DSPDBL:			; setup to output DSPDFW blanks
	LOAD	R6,R2,DSPDFW
	LIS	R3,' '
DSPDLP:	ADDSI	R6,-1	; decrement counter
	BLT	DSPDQT	;   quit if nothing left
	ADDI	R2,DSPDAR
	JSR	R1,DSPCH; output blank (wipe out R4-5)
	ADDI	R2,-DSPDAR
	BR	DSPDLP
The final segment of the procedure, in C, was:
	}
	/* leading blanks and preceding digits are printed,
	   so print this digit */
	dspch('0' + rm);
}
This translates to the following assembly code:
DSPDQT:			; setup to output the digit
	LOAD	R6,R2,DSPDRM
	ADDI	R3,'0'	; convert remainder to ASCII
	ADDI	R2,DSPDAR
	JSR	R1,DSPCH; output it (wipe out R4-5)
	ADDI	R2,-DSPDAR
	LOADS	R1,R2	; recover ret addr from ar
	JUMPS	R1	; return

Optimization

If producing working code in a hurry is your goal, the above code is appropriate, but when performance is an issue, some additional work is called for.

The calling sequence we have used requires 3 instructions, plus a 1 instruction entry sequence and a 2 instruction exit sequence. This seems excessive! On machines designed according to the CISC (complex instruction set computer) philosophy, the conventional solution to this excess is to create special calling instructions that combine these functions, for example, imagine having a call instruction such as:

	CALL	size,dst
This would increment R2 by size, save the return address in M[R2] and transfer control to dst. As a result, the entry sequence is reduced to no instructions, but we need a new return instruction to undo things:
	RETURN
This would transfer control to the return address found in M[R2], but first, it would look in M[R2]-4 for the size field of the CALL instruction in order to decrement R2. While programming with these "improved" procedure calling and return instructions would be pleasant, they are not pleasant instructions to implement, and their implementation would gravely complicate the problem of making the CPU run quickly.

As a result, processord designed according to the RISC (reduced instruction set complexity) philosophy usually omit such complicated but easy to use instructions and rely on the programmer, or more usually, on the compiler, to leave out those parts of the calling sequence that are not needed in some circumstance.

For example, in our receiving and return sequences:

PROC:
        STORES  R1,R2   ; save ret addr in ar
	; some code
        LOADS   R1,R2   ; recover ret addr from ar
        JUMPS   R1      ; return
The code to save and restore the return address is unnecessary if the procedure calls no other procedures. A smart programmer, or a smart compiler, after tentatively including the STORES and LOADS instructions that do this saving and restoring, can look at the body of the procedure and notice that it contains no calls, then remove these instructions.

Having removed them, the size of the activation record for that procedure or function can be reduced by one. In the case of many procedures and functions, particularly those that call no other procedures, the size of the activation record can be reduced to zero because all the local variables needed by the procedure fit in registers.

The above optimization produces a class of procedures and functions that have no activation records and therefore ignore R2 when they are called. When calling such a procedure or function, the calling sequence:

	ADDI	R2,AR
	JSR	R1,PROC
	ADDI	R2,-AR
Can be simplified to just:
	JSR	R1,PROC
Another optimization can be done in the context of consecutive calls. Consider this example:
	ADDI	R2,AR
	JSR	R1,PROCA
	ADDI	R2,-AR
	ADDI	R2,AR
	JSR	R1,PROCB
	ADDI	R2,-AR
Here, two procedures were called in quick sequence, and clearly, the intermediate adjustment to the activation record pointer, R2, are redundant and the code can be collapsed to:
	ADDI	R2,AR
	JSR	R1,PROCA
	JSR	R1,PROCB
	ADDI	R2,-AR
This optimization is clearly possible even when there are other instructions between the two procedure calls, but it would seem more difficult when the intervening instrucitons reference fields of the activation record, as in this example:
        ADDI    R2,AR
        JSR     R1,PROCA        
        ADDI    R2,-AR
	STORES	R3,R2,LOCALA
	LOADS	R3,R2,LOCALB
        ADDI    R2,AR
        JSR     R1,PROCB
        ADDI    R2,-AR
(Here, the first procedure returned a value that was stored in one local variable before loading a parameter to the second procedure from a second local variable.) It turns out that the optimization is still possible:
        ADDI    R2,AR
        JSR     R1,PROCA        
	STORES	R3,R2,LOCALA-AR
	LOADS	R3,R2,LOCALB-AR
        JSR     R1,PROCB
        ADDI    R2,-AR
Here, we have substituted assembly-time arithmetic on the offsets used for the intermediate load and store instructions for the run-time arithmetic of the ADDI instructions.

Finally, there is an optimization that is particularly useful in the context of recursive procedures, and that is to collapse activation records so that, at the time of the recursive call, only those fields of the activation record that actually contain useful information are saved. In the case of the example decimal print routine, note, for example, that the field width stored in the DSPDFW field of the activation record is not needed after any recursive call.

This optimization is done by organizing the fields of the activiation record so that fields that aren't needed during a recursive call are stored last. In the case of our decimal print routine, we would put the DSPDRM field first, since it is needed during the recursive call, and the DSPDFW field last. The result is a routine that needs 1/3 less stack space when it is called.

Here is the result of applying a number of the above optimizations to the decimal print routine:

;------------------------
        INT     DSPDEC
;               AR format, indexed using R2:
;         0     ; return address
DSPDRM	= 4	; remainder after division by 10
DSPDFW	= 8	; field width
DSPDAR	= 12	; total size

DSPDEC:			; output R3 in decimal
			; uses R4 as field width
			; wipes out 4-6
	STORES	R1,R2	; save ret addr in ar
	ADDSI	R4,-1	; decrement the field width
	STORE	R4,R2,DSPDFW
	LIS	R4,10
	JSR	R1,DIVIDE; divide number by 10 (wipe out R5-6)
	STORE	R4,R2,DSPDRM
	TESTR	R3	; check the quotient
	BZS	DSPDBL	;   if zero, go output blanks
	LOAD	R4,R2,DSPDFW
	ADDI	R2,DSPDFW
	JSR	R1,DSPDEC ; recursive call
	ADDI	R2,-DSPDFW
DSPDQT:			; setup to output the digit
	LOAD	R3,R2,DSPDRM
	ADDI	R3,'0'	; convert remainder to ASCII
	JSR	R1,DSPCH; output it (wipe out R4-5)
	LOADS	R1,R2	; recover ret addr from ar
	JUMPS	R1	; return

DSPDBL:			; setup to output DSPDFW blanks
	LOAD	R6,R2,DSPDFW
	LIS	R3,' '
DSPDLP:	ADDSI	R6,-1	; decrement counter
	BLT	DSPDQT	;   quit if nothing left
	JSR	R1,DSPCH; output blank (wipe out R4-5)
	BR	DSPDLP

Internal and External Routine Efficiency

In the worst case on the Hawk machine, if you know nothing of the routine you are calling other than the bare fact that it uses R1 for linkage and R2 as an activation record pointer, you must assume that it destroys the contents of all other registers. In this case, you will be forced to allocate space in your activation record for all values you need saved during the call, and you will have to augment your calling sequence with code to save the contents of any registers you care about before the call.

When a compiler generates code to call a procedure that is defined outside the current source file, perhaps in a file that has not even yet been compiled, it must make pesimistic assumptions about that procedure. As a result, calls to such procedures will, by necessity, be expensive, involving maximal use of the activation record, poor use of registers, and the full glory of our calling sequence.

In comparison, when compilers generate code to procedures defined within the same compilation unit, they can frequently apply many of the optimizations we have discussed, making effective use of registers and considerably speeding the procedure call.

As a result, modern optimizing compilers frequently generate far better code for calls to local procedures than to separately compiled code. The exception is in languages like Ada, where there are strict rules about order of compilation. In such a language, when a procedure is compiled, the compiler generates not only an object file but a file of instructions about how to compile calls to that procedure. When a call to an external procedure is encounterd by the compiler, it reads that file of instructions in order to determine how the call is to be made. Unfortunately, older programmingl languages like C and Fortran are rarely implemented to take advantage of such opportunities for effective optimization of external procedure calls.