6. Subroutines, Recursion and Local Variables

Part of 22C:60, Computer Organization Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

A Programming Problem, Binary to Decimal Conversion

The Hawk monitor provides several subroutines to output numbers but for the moment, assume that these functions are missing from the monitor and we wanted to write our own. The simplest binary to decimal conversion routine operates recursively! At each step, the number is divided into two parts, the least significant digit and the other digits. This division is done by dividing by ten! The least significant digit is the remainder, while the more significant digits are the quotient. We use recursion because we want to complete the conversion of the most significant digits before we finally print the least significant digit. Here is the code for this, written in C or C++, assuming we want to call the monitor routine dspch for each digit as we compute it:
A numeric output routine
void dspnum( unsigned int n )
/* display the integer number n */
{
        int quo = n / 10; /* the quotient -- more significant digits */
        int rem = n % 10; /* the remainder -- least significant digit */
        if (quo > 0) {
                /* output the more significant digits first */
                dspnum( quo );
        }
        /* finally, display the least significant digit */
        dspch( rem + '0' );
}

In contemplating converting this code to assembly language, we must ask several questions. Where can we store the local variables quo and rem? How do we divide by ten? How can we call one subroutine from within another, for example, the call to dspch(), and finally, how can we call that function from within itself, that is, how do we do recursion?

Note that we will not eliminate the recursion from this code, because we are far more interested in the exercise of properly translating recursion to machine language than the exercise of rewriting high-level language code to eliminate recurision.

Exercises

a) In the above code, explain why the expression rem+'0' works to convert integers from 0 to 9 into characters from '0' to '9'. This depends on a very specific property of the representations of the characters in our character set, so to answer this question, you may have to refer to the details of the ASCII character set given in Chapter 2.

b) Convert the above code to a Java, C# or whatever familiar language you may prefer and then write a small program to test it for a variety of different positive integers.

c) Modify the above code so it takes a second parameter, the number base. As a result, calling dspnum(100,8) should output 144 because 10010 is 1448. First, think only of number bases up to 10, then change the code to handle higher number bases.

Activation Records and the Stack

In discussion of high level languages, it is common to speak of local variables in a subroutine being pushed on the stack when the subroutine is called and popped from the stack when the subroutine returns. This applies to any for of subroutine, whether you call it a procedure, a function or a method, but it ignores a key question: How is this done, and while a local variable exists on the stack, how is it referenced.

The basic operations on a stack are push and pop. Classically, these are illustrated with a stack of dishes. If you imagine each dish holding one variable, you can imagine pushing new variables on the (abstract) stack by piling the dishes that hold them onto the (physical) stack of dishes, and you can imagine popping variables from the (abstract) stack by removing the corresponding dishes from the (physical) stack of dishes. Some cafeterias have dish stacks mounted on springs so that only the top dish is visible above the level of the countertop, making the terms push and pop even more appropriate, since adding a dish to the stack pushes the stack of dishes down into its well in the countertop and when you remove a dish, the stack pops up.

In general, in discussion of local variables and subroutine calling, we say that all of the local variables in a particular subroutine are grouped together into a data structure called an activation record. On entry to a subroutine, therefore, we can think in terms of the entire activation record being pushed on the stack, as a unit, instead of thinking of the individual local variables being pushed piecemeal, and on exit, we can think in terms of the entire activation record being popped from the stack.

The terms record, object and structure are sometimes used as synonyms. Of these terms, the word structure is the most ambiguous, since it sometimes refers to a block of related variables that are treated as a group, and it sometimes refers to more complex things like linked lists or trees. The term record refers to a group of related variables that are stored together; it is widely used for such groupings of variables in main memory and in files. The term object carries the most baggage, since it refers not only to a block of related variables, but it also implies that the details of those variables are hidden from outsiders and that those variables may only be manipulated by certain methods.

In the previous chapter, we blindly allocated space for the stack and set up register 2 to hold the address of that space. A variable or register that holds the address of a data structure is called a pointer, so here, we refer to R2 as the stack pointer. Several monitor routines used the stack, but we did not look at how it was used, except that we assumed that whatever use was made of the stack on entry to a function was undone before exit. Now, it is time to look at how we actually use the stack.

Just as we have a choice, when numbering the bits in a word, to number from the left or from the right, we have a choice, when building a stack, to have the stack grow from low memory to high memory or the reverse. Thus, pushing can involve moving towards higher memory addresses, or pushing can involve moving toward low memory addresses. This choice is completely unconstrained, and indeed, different computer architectures and different compiler writers have made this choice randomly. For example, on the Intel 80x86 and Pentium family, the stack grows downward from higher addresses to lower, while on the IBM RS6000 and Power PC family, it grows up from low addresses to high. Both schemes work equally well. The convention we will adopt on the Hawk machine for using the stack pointer is that the stack of activation records grows up.

We have a second choice. On entry to a subroutine, should we assume that the stack pointer points to, that is, holds the address of the topmost used location on the stack, or should we assume that it points to the bottommost free location just above the stack top? Again, both conventions can be made to work! In this case, however, the initialization we have already used determines the answer. In the code presented in the previous chapter, the stack pointer always pointed to the first word of an unused block of memory at the time a call was made to a subroutine.

We did not commit ourselves to where the stack pointer points between calls. Once the called subroutine is running, it can interpret this as a pointer to the first word of its activation record, deferring the increment of the stack pointer until just before the next call, or it could increment the stack pointer as it enlarges its activation record. We will do the former. Putting these decisions together, our stack looks like this on entry to a function:

The Hawk stack on entry to a subroutine
low memory
 
activation records of
of subroutines waiting
to be returned to
 
                  - stack pointer points here -                  
 
free space available
for activation record
of called subroutine
 
high memory

Within the body of a subroutine, we will assume that the stack pointer always holds the address of, or points to, the first word of the activation record. By convention, the first word of the activation record will generally hold the return address, but we only need to store the return address if this subroutine calls other subroutine. A good compiler for a machine like the Hawk will not make unnecessary use of memory, so the return address will only be stored when one subroutine calls another.

Beyond the return address, the remainder of the activation record is available to hold local variables. Of course, again, if a subroutine does not call other subroutines, we can store the first few local variables in registers, but for subroutines that call other subroutines, and for those with many local variables, we will need to use the space available on the stack.

Access to Local Variables, Load and Store

How do we access the variables in the activation record? The answer to this lies in the general form of an instruction we have already seen in a special case, the LOAD instruction, and its close relative, the STORE instruction. The version of the former we looked at in the previous chapter used program-counter-relative addressing, but in fact, this instruciton allows us to construct a memory address from the contents of any of the general purpose registers within the Hawk processor:
The Hawk LOAD instruction
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
1 1 1 1 dst 0 1 0 1 x
const16

Formally, this does r[dst]=word_memory[r[x]+sxt(15,const16)]

The version of LOAD discussed in the previous chapter had the x field set to zero, which implied the use of the program counter. When the x field is set to a nonzero value, it specifies one of the 15 general purpose registers, and it is the contents of that register that are added to the second halfword of the instruction after sign extension.

This approach to memory addressing is called indexed addressing, where the register R[x] is referred to as the index register and the constant from the instruction is referred to as the displacement. This form of addressing is generally useful if the index register points to the first word of a data object such as a record and the displacement gives the location of some field in the object relative to the start of that object.

Indexed addressing is generally useful, but our goal here is to use it in a very specific way. If we use the stack pointer as the index register, and if the stack pointer points to the first word of the topmost activation record on the stack, then the LOAD and STORE instructions can be used to access any word of the activation record by setting the displacement to the address of the desired local variable relative to the start of the activation record.

Exercises

d) Look up the LOADCC instruction in the Hawk manual. How does this differ from LOAD. There are two differences, one in the binary code, one in the behavior of the instruction itself; give both!

Addressing Modes

In our study of the Hawk machine, we have now seen quite a variety of addressing modes, that is, modes of access to operands of instructions. Some of these addressing modes actually address memory, while others do not; the term addressing mode is used, nonetheless, as a catch-all for all of these modes. In sum, the modes used on the Hawk are:

Immediate addressing
LIS R5,7
With immediate addressing, the operand is taken from a constant field of the instruction; in the example, the constant 7 is a 4-bit immediate constant. Immediate addressing on the Hawk uses several different sizes of immediate constants, with the size determined by the opcode.

Register addressing
MOVE R5,R7
With register addressing, the operand is taken from a register designated by a field of the instruction; in the example, both the source and destination operands are accessed using register addressing.

Short-indexed addressing
LOADS R5,R7
Short indexed addressing is sometimes called register-indirect addressing; in this case, an index register holds the address of, or a pointer to, the operand in memory. In the example, register 7 is used as an index register.

Indexed addressing
LOAD R5,R7,20
Indexed addressing was first developed in the 1950's; in this addresisng mode, the address of the operand in memory is the sum of the contents of an index register and a constant, the displacement, taken from a field of the instruction.

Program-counter-relative addressing
LOAD R5,PSTACK
Program counter relative addressing was first used in the early 1960's; in this addressing mode, the address of the operand in memory is the sum of the program counter and a constant, the displacement, taken from a field of the instruction.

With the Hawk architecture, the addressing mode is generally determined by the instruction itself, which is to say, by the opcode. Thus, for example the LIS, MOVE, LOADS and LOAD instructions are completely different, even though the only difference between them is the addressing mode used to select the source operand. Two of these instructions, though, LOAD using indexed mode and LOAD using program-counter-relative addressing are distinguished by the contents of the index register field. A value of zero indicates program-counter-relative mode, while a nonzero value indicates indexing using one of the 15 general purpose registers.

On some machines, a specific field of the instruction is used to encode the addressing mode for each operand. This idea was fully developed with the Digital Equipment Corporation PDP-11 computer in 1970, and it was imitated by the Motorola 68000 microprocessor and the later DEC VAX computer of the late 1970's and early 1980's. The VAX took this idea to the greatest extreme, with a 4-bit field used to select the index register for every operand, allowing 16 distinct addressing modes. Among the added modes were the following:

Direct addressing
A constant field in the instruction holds the address of the operand. This is the oldest addressing mode, found on computers designed as far back as the 1940s.

Indirect addressing
A constant field in the instruction holds the address of the word in memory holding the address of (or a pointer to) the operand. This addressing mode was invented at about the same time as indexed addressing, in the 1950s.

Double indexed addressing
A constant field in the instruction, the displacement, is added to the contents of two registers to compute the operand in memory. Sometimes, this mode is referred to as base-index addressing and one register is referred to as the base register while the other is referred to as the index register; when this terminology is used, the base register usually holds the address of the start of some data object while the index register and displacement are used to compute addresses within that object.

Indirect-indexed addressing
Indexed addressing is used to select the a word in memory that holds the address of the operand.

Auto-increment addressing
Auto-decrement addressing
These are like register-indirect addressing except that the register holding the address of the operand is incremented or decremented as a side effect of being used. Generally, if the increment is done before the use, the decrement is done after the use, and generally, the magnitude of the increment is determined by the operand size, in bytes.

In the case of the VAX, every one of the basic addressing modes could be converted to indirect form, so immediate-indirect addressing was used to implement absolute addressing, and register-indirect addressing was simply the indirect version of register addressing. Later studies of architectures that allowe all of the addressing modes to be used with any instruction showed that most of the addressing modes were used rarely, and when they were used, they were only used in very limited ways. Thus, on the PDP-11 and VAX, auto-decrement-indirect addressing was very rare, and auto-decrement mode was only commonly used with the stack pointer.

Exercises

e) Look at the binary codes for LOAD, LOADS, STORE and STORES and having done explain how the Hawk CPU distinguishes, trivially, between memory reference instructions that use indexed addressing and memory reference instructions that use short-indexed addressing.

f) Every hawk instruction that uses indexed addressing has a cousin that uses short indexed addressing. Furthermore, LOAD has a cousin called LEA that, instead of loading from memory, loads the memory address itself into the destination register. LOADS also has a cousin that does this, and it could have been named LEAS because it is the short-indexed cousin of LEA. What instruction is this?

g) Suppose you had an assembly language program for some other computer, and you had to translate it to Hawk asseembly language. The original code contains a load to register 5 using direct addressing to reference a memory location with the label LOC. Write equivalent Hawk code (it can be done in a few instructions).

h) Suppose you had an assembly language program for some other computer, and you had to translate it to Hawk asseembly language. The original code contains a load to register 5 using auto-increment addressing through register 3, with the increment before the memory reference and a one-word operand. Write equivalent Hawk code (it can be done in a few instructions).

Setting up an Activation Record

Looking back at the numeric output routine at the start of this chapter, we are interested in constructing an activation record that can be used to store the remainder, quotient and return address during the execution of any functions called from within the dspnum() function. This suggests the following activation record format:

An activation record for dspnum()
  0                     return address                  
4 remainder
8 quotient

A picture of a three word data object is easy to draw, but pictures provide poor documentation inside the text of a program. In that context, what we want are the symbolic names for each field of the object, where each name is defined as the offset from the start of the object to that field. Here, as with many other areas, assembly language programmers must adopt conventions to clearly document program details that high-level languages enforce with their syntax. The convention we will use for all records, structures and objects is illutrated below:
An activation record for DSPNUM
; activation record format for DSPNUM numeric output routine
RA      =       0       ; the return address
REM     =       4       ; the remainder
QUO     =       8       ; the quotient
ARSIZE  =       12      ; size of activation record in bytes

Given these definitions, and given that R2, the stack pointer, points to word zero of this activation record, we can store a value from register 4 into the remainder using STORE R4,R2,REM and load it back into register 3 using LOAD R3,R2,REM. These two instructions use the contents of the index register R2 plus the constant REM to compute their effective address.

A Useful Hawk Monitor Routine and its Calling Sequence

Our recursive numeric output routine began by dividing the number to be printed by ten in order to get the remainder and quotient. On some machines, this is done with a machine instruction, but on the Hawk, the general divide operation requires enough code that we use a monitor call:

DIVU -- unsigned divide
Parameters: Dividend = R3, Divisor = R4
Return value: Quotient = R3, Remainder = R4
Side effects: Uses R5 and R6

As you might expect, there are other arithmetic support routines in the monitor, but they are not needed now, so we will ignore them. What we will focus on here is how one subroutine can call another, first focusing on the non-recursive call from our decimal print routine to DIVU and then looking at the recursive call, which, as it turns out, is not materially different. The main programs we wrote in the previous chapter were simplified by the fact that they used no local variables and were not called, so they could largely ignore the use of the stack. Now, we must face the question of how to push the current activation record before a call and how to pop it back in place afterwards.

The principal thing to remember is this: During the execution of the body of a subroutine, the stack pointer, register 2, always points to the first word of the activation record of that subroutine. We assumed that it already pointed there on entry to the routine, and therefore, right before calling a subroutine, we must update the stack pointer to point to the first word of the new activation record, and the first thing we must do after a return to this function is undo that change. This leads to the following skeleton for a call to DIVU from within our numeric output routine:
The DIVU calling sequence
;       put the dividend in R3 and the divisor in R4
;       make sure nothing valuable is in R5 and R6

        ADDI    R2,R2,ARSIZE    ; push the current activation record
        LOAD    R1,PDIVU
        JSRS    R1,R1           ; call DIVU
        ADDI    R2,R2,-ARSIZE   ; pop the current activation record

;       get the quotient from R3 and the remainder from R4

The sequence of instructions needed to call a subroutine is called the calling sequence for that subroutine; the calling sequence given above is, in fact, the calling sequence we will use for every external subroutine on the Hawk machine. We will only make a slight change to this when we call an internal subroutine. For example, when our recursive print routine calls itself, it will use this calling sequence:

The DSPNUM calling sequence
;       put the number to be printed in R3
;       make sure R4 to R7 contain nothing valuable

        ADDI    R2,R2,ARSIZE    ; push the current activation record
        JSR     R1,DSPNUM       ; call DSPNUM
        ADDI    R2,R2,-ARSIZE   ; pop the current activation record

;       the number has been printed.

There are only two differences between this calling sequence and the calling sequence for the unsigned divide routine. First, it uses the JSR instruction to jump directly to the subroutine, using program-counter relative addressing instead of going through the more awkward process of loading the address of the subroutine into a register and then using that register. This works because the destination address DSPNUM is local.

The JSR instruction is actually the same machine instruction as the JUMP instruction, except that, where the latter has a destination register of zero, indicating that the return address should be discarded, the former has a nonzero value to indicate where the return address should be stored. As a rule, the Hawk architecture has only two interpretations for register fields that have the value zero. They either refer to the program counter, for example, in the context of index registers, or they cause the result to be discarded, as is the case here, where the JUMP instruction discards its return address.

The second difference is in the comments. For the unsigned divide routine, we know exactly what registers it uses because that code already exists, therefore, the comments about what registers the routine might use are exact. In the case of our decimal print routine, on the other hand, we have not yet written the code, so the most we can do is make a conservative statement.

Throughout all subroutines written for the Hawk machine, we will stick with the rules illustrated by this code: Parameters will alwasy be passed in registers 3 and up, and functions will always return results in registers 3 and up. Outside of the parameters and results, if the calling routine has anything of value in registers 3 to 7, it is up to the caller to save this before the call and restore it afterwards. In contrast, if the caller has anything of value in registers 8 to 15, it is the called routine's responsibility to save and restore this.

For every calling sequence, there is a matching receiving sequence, used at the entry point of the called routine to receive control from the calling routine, and there is a matching return sequence, used at the end of the called routine to return control to the caller. The first half of the calling sequence plus the receiving sequence work together to transfer parameters, transfer control, and save everything that must be saved, while the return sequence plus the second half of the calling sequence work together similarly to return to the caller and restore what must be restored. The receiving and return sequences for our numeric output routine are as follows:

The DSPNUM receiving and return sequences
;       receiving sequence assumptions
DSPNUM:         ; R3 holds the number to be printed
                ; R4-7 hold nothing of value
                ; R8-15 must be saved and restored if we use them
        STORES  R1,R2           ; save the return address

                ...             ; body of subroutine goes here

;       return sequence assertions
                ; R3-7 hold nothing of value
                ; R8-15, if used, must have been restored
        LOADS   R1,R2           ; get the return address
        JUMPS   R1              ; return

In the above code, we used STORES R1,R2 to save the return address in the activation record. We could have use STORE R1,R2,RA to do this, but because we have allocated the return address as the first field of the instruction, we can use the shorter instruction to access that field. As a general rule, the variable that is used most intensely in a subroutine should be put at the start of the activation record in order to speed all references to that variable; in our example, none of the variables are used very intensely, so this is not an issue.

The final return from the function is done with a JUMPS instruction; this is the short-indexed version of the JUMP instruction, and in fact, it is identical to the JSRS instruction with the destination field set to zero to indicate that the return address should be discarded.

This receiving sequence stores the return address in the activation record, allowing the subroutine to call other subroutines, and on exit, it restores the return address from the activation record. Had the subroutine made no other calls, or had other calls been linked through some other register than R1, there would have been no need to save and restore R1. A good compiler, when it generates machine code for a subroutine, will always make such optimizations when possible, but generally, assembly language programmers are advised to use the most general calling sequence during development, and only try such optimizations at the very end, after making very sure that everything works.

Exercises

i) Suppose someone built a new version of the Hawk computer with the JUMP, JSR and JSRS instructions eliminated, leaving only JUMPS available. The new machine still has LEA, LOAD and a full complement of load immediate operations. Give an instruction sequence that can do almost everything that the JUMP instruction does. Are there any important restrictions on this instruction sequence that you'd need to worry about when replacing JUMP instructions with this sequence?

j) Suppose someone built a new version of the Hawk computer as suggested in the previous question. Given an instruction sequence that can do almost everything that JSR does Can you formulate this sequence as a macro named JSR?

k) Suppose someone built a new version of the Hawk computer as suggested in the previous question. Given an instruction sequence that can do almost everything that JSRS does. Suppose you had to move some Hawk code to the new machine, and the code contained the instruction JSRS R1,R1; what problems does this pose if you try to substitute your instruction sequence for this?

Putting it all Together

We have now covered all of the pieces required to write a Hawk assembly language subroutine equivalent to the dspnum() subroutine given at the start of this chapter. Here is the finished product:
The DSPNUM subroutine
; activation record format for DSPNUM numeric output routine
RA      =       0       ; the return address
REM     =       4       ; the remainder
QUO     =       8       ; the quotient
ARSIZE  =       12      ; size of activation record in bytes

;       receiving sequence assumptions
DSPNUM:         ; R3 holds the integer n to be printed
                ; R4-7 hold nothing of value
                ; R8-15 must be saved and restored if we use them
        STORES  R1,R2           ; save the return address

        LIS     R4,10           ; the divisior to divide R3 by 10

        ADDI    R2,R2,ARSIZE    ; push the current activation record
        LOAD    R1,PDIVU
        JSRS    R1,R1           ; call DIVU
        ADDI    R2,R2,-ARSIZE   ; pop the current activation record

        STORE   R3,R2,QUO       ; quo = number / 10
        STORE   R4,R2,REM       ; rem = number % 10 (remainder)

        LOAD    R1,R2,QUO       ; get the quotient
        CMPI    R1,0            ; compare with zero
        BZS     ENDIF           ; if (quo > 0) do the following

        ; output the more significant digits first
        LOAD    R3,R2,QUO       ; get the quotient

        ADDI    R2,R2,ARSIZE    ; push the current activation record
        JSR     R1,DSPNUM       ; call dspnum( quo );
        ADDI    R2,R2,-ARSIZE   ; pop the current activation record

ENDIF:

        ; finally display the least significant digit, the remainder
        LOAD    R3,R2,REM       ; get the remainder
        ADDI    R3,R3,'0'       ; convert it to a character

        ADDI    R2,R2,ARSIZE    ; push the current activation record
        LOAD    R1,PDSPCH
        JSRS    R1,R1           ; call dspch( rem + '0' );
        ADDI    R2,R2,-ARSIZE   ; pop the current activation record

        ; the number has been printed.

;       return sequence assertions
                ; R3-7 hold nothing of value
                ; R8-15, if used, must have been restored
        LOADS   R1,R2           ; get the return address
        JUMPS   R1              ; return

In the above, we have avoided using global knowledge about the subroutine. The if statement, for example has been translated out of context, without any attention to the registers used by the previous statements. Similarly, after the division and after storing the remainder and quotient in the activation record, we have gone to memory to get a copy of the quotient when we needed to test to see if it was zero, even though register 3 still holds the right value.

Similarly, in the body of the if statement, we read the quotient from the activation record again before calling DSPNUM. Because we have not relied on any register other than the stack pointer to retain its value from one statement to the next, we have greatly simplified the answer to questions about the correctness of this code. Each line of the original dspnum() subroutine was translated to assembly language independently, except for the first two which were combined so that one divide operation could provide both the remainder and quotient.

Exercises

l) The comments for the DSPNUM receiving sequence and return sequence were written before the body of the the routine was written. They state permission to modify registers and obligations to preserve registers, instead of stating what registers are actually used. Give appropriate comments documenting the registers that are actually used!

m) Write a Hawk main program to test the above DSPNUM subroutine by using it to print out 1000016 in decimal. How many calls to DSPNUM does this test perform, including both the call from the main program and recursive calls? (Include the subroutine in the same source file with the main program, don't mess with separate assembly and linkage yet.)

n) Modify the Hawk DSPNUM code given above so it takes a second parameter, the number base, and prints a number in any base between 2 and 10. To keep things simple, don't bother checking for illegal number bases.

o) Write a Hawk assembly language function to print a signed integer. This should be equivalent to the following:

                void dspsgn( int n )
                /* output the integer n */
                {
                        if (n < 0) { n = -n; dspch('-'); }
                        dspnum( n );
                }

Optimization

As already pointed out, the code given in the previous section is not very good. We can tighten this code up in two ways: First, if we look at the whole program, we can find several places where we loaded a value from memory that was already in registers. Second, we can use some new machine instructions. The Hawk instruction set provides several instructions that are useful in this regard:

The Hawk LOADCC instruction
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
1 1 1 1 dst 0 1 0 0 x
const16

The LOADCC instruction does exactly the same data transfers as the LOAD instruction, except that it also sets the N and V condition codes to report on the value loaded. As a result, it is equivalent (for our purposes) to a load instruction followed by comparison with zero. Since we did exactly that in our example numeric output routine, we can substitute this one instruction for two instructions in the original. Note that LOADCC is not the same as comparison with regard to the V and C condition codes.

But, the if statement in the original dspnum() routine did not need to keep the value it was testing! For the purposes of our code, we only really need the condition codes to be set. As with several other Hawk instructions, we can set the destination field of the LOADCC instruction to zero, causing the result to be discarded after the condition codes have been set. The assembler knows this as the TEST instruction, and just as LOAD has a short-indexed cousin LOADS, LOADCC has a short-indexed cousin LOADSCC. From the latter, we get TESTS to test the value in memory pointed to by a register without actually loading it into a register.

But, as we already pointed out, the value we want to test is already in a register! We don't need to load it from memory at all, we only need to test the contents of that register! The Hawk provides an instruction for this as well, based on a cousin of the move instruction:

The Hawk MOVECC instruction
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
1 1 1 1 dst 1 1 1 0 x

The MOVECC instruction does exactly the same thing that the MOVE instruction does, except that it also sets the condition codes in order to report on the result. Thus, it is equivalent (for our purposes) to moving the data from one register to another and then comparing it with zero. If we only want to test the contents of a register and we don't need to move its value anywhere, we can set the destination field to zero; our assembler generates this code for the TESTR command.

We can always use a compare immediate instruction to compare the contents of a register with a constant, but this instruction, an add immediate that discards the result, is two halfwords long, while TESTR compares with zero using an instruction that occupies only one halfword.

If we apply all of the suggested optimizations and take their consequences into account, the code for DSPNUM can be compacted to the following:

The DSPNUM subroutine, partially optimized
; activation record format for DSPNUM numeric output routine
RA      =       0       ; the return address
REM     =       4       ; the remainder
ARSIZE  =       8       ; size of activation record in bytes

;       receiving sequence assumptions
DSPNUM:         ; R3 holds the integer n to be printed
                ; R4-7 hold nothing of value
                ; R8-15 must be saved and restored if we use them
        STORES  R1,R2           ; save the return address

        LIS     R4,10           ; the divisior to divide R3 by 10

        ADDI    R2,R2,ARSIZE    ; push the current activation record
        LOAD    R1,PDIVU
        JSRS    R1,R1           ; call DIVU
        ADDI    R2,R2,-ARSIZE   ; pop the current activation record

                                ; quo = number / 10 (quotient in R3)
        STORE   R4,R2,REM       ; rem = number % 10 (remainder)

        TESTR   R3              ; test rem
        BZS     ENDIF           ; if (quo > 0) do the following

        ; output the more significant digits first (from quotient in R3)
        ADDI    R2,R2,ARSIZE    ; push the current activation record
        JSR     R1,DSPNUM       ; call dspnum( quo );
        ADDI    R2,R2,-ARSIZE   ; pop the current activation record

                ...             ; the remainder of the code is unchanged

Here, because we never needed to go to the activation record for the variable we originally named quo holding the quotient after division, we have shortened the activation record by one word and we have eliminated a total of three instructions. We can do better than this!

The next optimization we can apply follows from the fact that each calling sequence ends with ADDI R2,R2,-ARSIZE while the next begins with ADDI R2,R2,ARSIZE. Where there are no references to the activation record between two successive calls, we can eliminate both of these instructions trivially. Even where there are references, we can eliminate them, as illustrated in the following fragment:
The DSPNUM subroutine, further optimization
                ...             ; the start of the code is unchanged

        ADDI    R2,R2,ARSIZE    ; push the current activation record
        LOAD    R1,PDIVU
        JSRS    R1,R1           ; call DIVU

                                ; quo = number / 10 (quotient in R3)
        STORE   R4,R2,REM-ARSIZE; rem = number % 10 (remainder)

        TESTR   R3              ; test rem
        BZS     ENDIF           ; if (quo > 0) do the following

        ; output the more significant digits first (from quotient in R3)
        JSR     R1,DSPNUM       ; call dspnum( quo );

ENDIF:

        ; finally display the least significant digit, the remainder
        LOAD    R3,R2,REM-ARSIZE; get the remainder
        ADDI    R3,R3,'0'       ; convert it to a character

        LOAD    R1,PDSPCH
        JSRS    R1,R1           ; call dspch( rem + '0' );
        ADDI    R2,R2,-ARSIZE   ; pop the current activation record

                ...             ; the end of the code is unchanged

In the above code, we have eliminated 4 more instructions, all of which incremented or decremented the stack pointer. In each block of code where the stack pointer was left in a nonstandard alignment because the size of the activation record had not been subtracted from it, we had to modify the indexing displacements on every instruction that used the stack pointer. Specifically, instead of subtracting the activation record size from the stack pointer at run time, we used the assembler to subtract it from the displacement fields at assembly time.

Exercises

p) Combine all the optimizations suggested above into one DSPNUM routine and test it.

q) The final optimization suggested above suggests an alternative receiving and return sequence, where the function begins and ends as follows:

        ; receiving sequence
        DSPNUM: STORES  R1,R2           ; save the return address
                ADDI    R2,R2,ARSIZE    ; push my activation record

                                        ; body of subroutine goes here
        ; return sequence
                ADDI    R2,R2,-ARSIZE   ; pop my activation record
                LOADS   R1,R2           ; get the return address
                JUMPS   R1              ; return

What is the calling sequence for DSPNUM from within a routine that uses this receiving and return sequence?

A Second Example

One of the classic examples of a recursive function is the Fibonacci series; this is the series of integers that begins 0 1 1 2 3 5 8 13 and continues with the rule that each member of the series is the sum of the two previous members. Mathematically, we say that f(i) gives the ith member of the series, where f(0)=0, f(1)=1 and for all higher values of i, f(i)=f(i-1)+f(i-2). The ratio of successive members of the Fibonacci series approximates the golden ratio, φ, an irrational number close to 1.618.

We can easily write code for a function that computes the ith member of the Fibonacci series in a high-level programming language. In C or C++, for example, we can write the following:

A recursive function to compute Fibonaccin numbers
unsigned int fibonacci( unsigned int i )
/* return the i-th member of the Fibonacci series */
{
        if (i <= 1) {
                return i;
        } else {
                return fibonacci( i - 1 ) + fibonacci( i - 2 );
        }
}

The truth is, this is a stupid way to compute the consecutive members of the Fibonacci series, since there is a much faster iterative algorithm to do the job, but we will focus on this recursive algorithm here because it is such a good illustration of recursion.

A really experienced assembly language programmer might write this in assembly language code without any preliminary work, but it is useful to rewrite it in a high-level language first, reducing complex expressions by adding variables so that each line of high-level-language code corresponds roughly to one assembly language instruction or simple block of instructions. To do this, we put each call on a line by itself, and we do all arithmetic separately from calls. In addition, it is convenient to eliminate multiple returns from inside the body of the function, since these can make the assembly language code even harder to read that high-level language code containing multiple returns from within the body of a subroutine. Putting these changes together gives us the following C or C++ code:

The Fibonacci function, rewritten
unsigned int fibonacci( unsigned int i )
/* return the i-th member of the Fibonacci series */
{
        unsigned int value = i;  /* the tentative return value */ 
        unsigned int fib1, fib2; /* intermediate values */
        if (value > 1) {
                fib1 = fibonacci( value - 1 );
                fib2 = fibonacci( value - 2 );
                value = fib1 + fib2;
        }
        return value;
}

This allows us to propose an activation record structure and to write a first draft of the code. As with the first example, this first draft is written with no optimization! We use a calling sequence here that is identical to the calling sequence used in the DSPDEC routine, with only one difference in the return sequence, and that is the use of register 3 to hold the return value:

The Fibonacci function, in Hawk Assembly Language
; activation record format for the FIBONACCI function
RA      =       0       ; the return address
VALUE   =       4       ; the tentative unsigned integer return value
FIB1    =       8       ; fibonacci(value-1) 
FIB2    =       12      ; fibonacci(value-2)
ARSIZE  =       16      ; size of activation record in bytes

;       receiving sequence assumptions
FIBONACCI:      ; R3 holds the unsigned integer parameter i
                ; R4-7 hold nothing of value
                ; R8-15 must be saved and restored if we use them
        STORES  R1,R2           ; save the return address
        STORE   R3,R2,VALUE     ; value = i

        LOAD    R3,R2,VALUE     ; get value
        CMPI    R3,1            ; compare value with 1
        BLEU    ENDIF           ; if (value > 1) do the following

        LOAD    R3,R2,VALUE     ; get value
        ADDSI   R3,-1           ; compute value - 1
        ADDI    R2,R2,ARSIZE    ; push the current activation record
        JSR     R1,FIBONACCI    ; call fibonacci
        ADDI    R2,R2,-ARSIZE   ; pop the current activation record
        STORE   R3,R2,FIB1      ; fib1 = fibonacci( value - 1 )

        LOAD    R3,R2,VALUE     ; get value
        ADDSI   R3,-2           ; compute value - 2
        ADDI    R2,R2,ARSIZE    ; push the current activation record
        JSR     R1,FIBONACCI    ; call fibonacci
        ADDI    R2,R2,-ARSIZE   ; pop the current activation record
        STORE   R3,R2,FIB2      ; fib2 = fibonacci( value - 2 )
        
        LOAD    R3,R2,FIB1      ; get fib1
        LOAD    R4,R2,FIB2      ; get fib2
        ADD     R3,R3,R4        ; add them
        STORE   R3,R2,VALUE     ; value = fib1 + fib2

ENDIF:
;       return sequence and assertions
        LOAD    R3,R2,VALUE     ; get return value
                ; R4-7 may have been used
                ; R8-15, if used, must have been restored
        LOADS   R1,R2           ; get the return address
        JUMPS   R1              ; return

Each block of instructions in the above code corresponds closely to one line of the high-level-language program from which we derived it, and these blocks were written without regard to each other. There may be one subtle point, in the use of the BLEU instruction. This is used because we are comparing unsigned operands. In fact, this is not really an issue, since the Fibonacci function will produce an overflow and an invalid result long before the difference between signed and unsigned arguments makes any difference!

Exercises

r) The comments on the receiving sequence and return sequence for the FIBONACCI function given above were written first, before the body of the function was written. They indicate registers that may be freely used and registers that must be saved and restored, but they say nothing about which registers are actually used. Rewrite these to reflect the actual register usage of this function.

s) Optimize the code for FIBONACCI to eliminate extraneous loads and stores, taking maximum advantage of values already known to be in registers.

t) Optimize the code for FIBONACCI to eliminate extraneous changes to the stack pointer between the two recursive calls.

u) Optimize the code for FIBONACCI to eliminate extraneous variables from the activation record. In fact, you can squeeze the activation record down to a single variable plus a return address.

v) Write an iterative subroutine in a high-level language that computes the ith Fibonacci number and then translate that subroutine to Hawk assembly language.

Multiple Routines in One Source File

It is natural to want to use the DSPNUM routine to test the FIBONACCI routine! One obvious way to do this is to combine both routines into one source file, along with the main program. Doing this poses potential problems because we have reused several identifiers in our initial versions of the code for each of these routines! Some of these identifiers are reused for the same meaning in both, for example RA was defined as zero in both activation records. ARSIZE on the other hand had different values in each, and both functions contained labels named ENDIF.

Our assembly language provides some help. Symbols defined by assignment with an equals sign may be redefined at will within a program. So, as long as the definition is before use, there is no problem at all with defining ARSIZE differently for each of these routines.

labels pose another problem! In any source file, no identifier may be used as a label with two different meanings. We will solve this by using naming conventions that will, usually, give us unique labels within each subroutine. In designing these two conventions, we face two conflicting requirements: First, while the assembler permits labels to be very long, long labels interfere with the columnar arrangement that is conventional for assembly code. But second, short labels are frequently unreadable gibberish. Here are some style guidelines:

Labels for subroutines:
The label for a subroutine may be up to 14 characters long, and it should be defined on the first line of the block of comments that documents the receiving sequence. There should be a unique and fairly obvious three-letter abbreviation for the subroutine name. For example, FIBONACCI can be abbreviated FIB and DSPNUM can be abbreviated DSP unless there are many display routines defined in this file, in which case, it would be better to abbreviate it DNM.

Labels for control structures within subroutines:
Any label used for a control structure within a subroutine should begin with the three letter abbreviation for the subroutine name.

Labels for loops:
If the subroutine with the abbreviated name FIB contains a loop, the start of the loop should have the label FIBLP and branches to exit that loop should go to FIBLX (the letter X is a mnemonic for exit). If there are multiple loops, consider adding a qualifying letter as a suffix, so you end up with FIBLPA and FIBLPB.

Labels for if statements:
If the subroutine with the abbreviated name FIB contains an if statement, the start of the else clause (if any) should begin with the label FIBEL and the end of the if statement should be marked with FIBEI; multiple if statements can be distinguished with one letter suffixes on these labels.

These conventions are by no means perfect, but they are sufficient to allow us to combine the two functions given here into one somewhat legible source file. One important result of this set of naming conventions is that names that have global meaning are longer and more legible, while names with only local meaning, those associated with local control structures within one routine, are less readable but still likely to be easy to find simply because they are local.

Exercises

w) Combine the DSPNUM and FIBONACCI subroutines into one source file, along with a main program that contians a loop to output the first 10 fibonacci numbers.

x) Combine the DSPNUM and FIBONACCI subroutines into one source file, and then add a call to DSPNUM to the body of FIBONACCI so that the Fibonacci function outputs a begin paren on entry, and just before exit, it outputs the value it is about to return followed by an end paren. If you called f(2), for example, this would output ((1)(0)1) and if you called f(3) it would output (((1)(0)1)(1)2). Run it for f(5).

y) Consider the series where s(i) defined by:

                unsigned int s( unsigned int i )
                {
                        int j;
                        int v = 0;
                        for (j = 0, j < i; j++) v = v + s(j);
                        return v;
                }

This begins 0 1 1 2, just like the Fibonacci series. Write a recursive assembly language program that computes this series, and test it with a main program that prints out the first 10 members of the series.