6. Subroutines, Recursion and Local Variables
Part of
22C:60, Computer Organization Notes
|
The Hawk monitor provides several subroutines to output numbers but for the moment, assume that these functions are missing so that we must write our own. The simplest binary to decimal conversion routine is recursive. At each step, the number is divided into two parts, the least significant digit and the other digits. After division by 10, the remainder gives the least significant digit while the quotient gives the rest. We use recursion because we want to output the most significant digits first, but we compute the least significant digit first. Here is the code for this, written in C or C++, assuming we want to call the monitor routine putchar for each digit as we compute it (warning — the version in the Hawk monitor has an additional parameter, the field width):
void putdecu( unsigned int n ) { /* display the unsigned integer n in decimal */ /* local variables and initialization */ int quo = n / 10; /* the quotient -- more significant digits */ int rem = n % 10; /* the remainder -- least significant digit */ /* if there are more significant digits, output them first */ if (quo > 0) { putdecu( quo ); } /* then display the least significant digit */ putchar( rem + '0' ); } |
Calling putdecu(123) causes the recursive call putdecu(12) which calls putdecu(1). The latter calls putchar('1') before it returns. In turn, putdecu(12) calls putchar('2') before it returns, allowing the outermost activation of putdecu to call putchar('3') and return. Informally, it might be convenient to think in terms of copies of putdecu when there is a recursive call, but no copies are actually made; this is why we use the term activation to refer to the activity caused by a call to a subroutine.
In contemplating converting this code to assembly language, we must ask several questions. Where can we store the local variables quo and rem? How do we divide by ten? How can we call one subroutine from within another, for example, the call to putchar() made from within putdecu, and finally, how can we call a function from within itself, that is, how do we do recursion?
Note that any recursive program can be rewritten into a purely iterative program using auxiliary variables instead of recursion, and this iterative code is frequently faster than any recursive code. Here, we are not interested in doing this. Rather, what we will do is translate the recursive algorithm to assembly language that exactly preserves the original recursive structure.
Exercises
a) In the above code, explain why the expression rem+'0' works to convert integers from 0 to 9 into characters from '0' to '9'. This depends on a very specific property of the representations of the characters in our character set, so to answer this question, you may have to refer to the details of ASCII given in Chapter 2.
b) Convert the above C code for putdecu() to Java, C# or whatever familiar language you may prefer and then write a small program to test it for a variety of different positive integers.
c) Modify the above code so it takes a second parameter, the number base. As a result, calling putdecu(100,8) should output 144 because 10010 is 1448. First, deal with number bases up to 10, then change the code to handle higher number bases.
When talking about high level languages, it is common to say that the local variables of a subroutine are pushed on the stack when the subroutine is called and popped from the stack when the subroutine returns. This applies to most forms of subroutines, including procedures functions and methods, but it ignores a key issue: How is this done, where is the stack, and how are variables on the stack referenced.
The basic operations on a stack are push and pop. Since 1962, stacks have been illustrated using a stack of dishes. If you imagine each dish holding a variable, you can imagine pushing a new variable onto the abstract stack by piling the dish that holds it onto the physical stack of dishes. You can imagine popping a variable from the abstract stack by removing the top dish from the stack. Some cafeterias have dish stacks mounted on springs so that only the top dish is visible above the countertop, making the terms push and pop even more intuitive. Adding a dish to the stack pushes the stack of dishes down into its well in the countertop, and when you remove a dish, the stack pops up.
In general, in discussion of local variables and subroutine calling, we say that all of the local variables in a particular subroutine are grouped together into a data structure called an activation record. This is the record of one activation of the subroutine. On entry to and exit from a subroutine, we can think in terms of the entire activation record being pushed and popped, as a unit, instead of thinking of the individual local variables being pushed and popped piecemeal.
The terms record, object and structure are sometimes used as synonyms. Of these terms, the word structure is the most ambiguous, since it sometimes refers to a block of related variables that are treated as a group, and it sometimes refers to more complex things like linked lists or trees. The term record refers to a group of related variables that are stored together; it is widely used for such groupings of variables in main memory, files and databases. The term object carries the most baggage, since it refers not only to a block of related variables, but it also implies that the details of those variables are hidden from outsiders and that those variables may only be manipulated by certain methods.
In the previous chapter, we relied on the Hawk monitor to alloate space for the stack and set up R2 to hold the address of that space. A variable or register that holds the address of a data structure is called a pointer, so here, we refer to R2 as the stack pointer. Several monitor routines used the stack, but we did not look at how it was used, except that we assumed that whatever use was made of the stack on entry to a function was undone before exit. Here, we will focus on the details.
Just as we have a choice to number the bits in a word from left to right or right to left, we can build a stack that grows from low memory to high or from high to low. Thus, pushing can either increment or decrement the stack pointer. This choice is unconstrained, and indeed, different computer architectures and different compiler writers have made this choice randomly. On the Intel 80x86 and Pentium family, pushing decrements the stack ponter (an idea copied from the DEC PDP-11), while on the IBM RS6000 and Power PC family, pushing increments the stack pointer. Both schemes work equally well. The convention we have adopted on the Hawk machine is that pushing increments the stack pointer.
We have a second choice. On entry to a subroutine, should we assume that the stack pointer points to or holds the address of the topmost used location on the stack, or should we assume that it points to the bottommost free location just above the stack? Again, both conventions can be made to work. In the case of the Hawk, on entry to a subroutine, we will assume that the stack pointer R2 always points to the first unused word above the topmost element currently on the stack.
These choices do not commit us to where the stack pointer points during subroutine execution. They only constrain where it points on entry to a routine and right after the routine returns. While the called routine is running, it can do anything it wants with the stack pointer, so long as it undoes everything before return. The stack pointer can be held constant in the body of the called routine, so it always points to the first word of the activation record, or we could increment the stack pointer immediately when local variables are allocated and decrement it just before return. Eventually, we will use a mixture of these techniques in optimized code, but for our initial examples in this chapter, the stack pointer will point to the start of the activation record throughout the subroutine body, except during calls to other routines. Combining these decisions, our stack looks like this on entry to a function:
low memory |
activation records of of subroutines waiting to be returned to |
- stack pointer points here -
free space available for activation record of called subroutine |
high memory |
Within a subroutine body, we will assume that the stack pointer always holds the address of or points to the first word of the activation record. By convention, we use the first word of the activation record to hold the return address. A good compiler or a careful programmer will not make unnecessary use of memory, so the return address only needs to be stored for subroutines that call others.
Beyond the return address, the rest of the activation record is available for local variables. If a subroutine does not call other subroutines, the local variables may all fit in registers, but when one routines calls another or if there are many local variables, we must allocate space for local variables on the stack.
We will use a variations on instructions we have already seen to access variables in the activation record. The version of LOAD we looked at in the previous chapter used program-counter-relative addressing, but in fact, this instruciton can be used to construct a memory address from the contents of any of general purpose register within the Hawk processor:
| ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Formally, this does r[dst]=word_memory[r[x]+sxt(15,const16)]
The version of LOAD discussed in the last chapter had x set to zero, implying the use of the program counter. When x is nonzero, it specifies one of the 15 general purpose registers. The effective address is the sum of the contents of that register and the sign-extended second halfword of the instruction.
This approach to memory addressing is called indexed addressing. The register R[x] is referred to as the index register and the constant from the instruction is referred to as the displacement. Indexing is useful when the index register points to the first word of a data object such as a record and the displacement gives the location of a field in the object relative to the start of that object. Consider the problem of accessing records that hold the X and Y coordinates of a point on the display screen, along its color.
| ||||||
; displacements of fields of one point-color record X = 0 Y = 4 COLOR = 8 ; given the address of the point-color record in R3 LOAD R4,R3,X ; get the X coordinate into R4 LOAD R5,R3,Y ; get the Y coordinate into R5 LIS R6,BLUE STORE R6,R3,COLOR ; set the COLOR field to BLUE |
If we use the stack pointer as the index register, and if it points to the first word of the topmost activation record on the stack, then we can use LOAD with appropriate displacements to examine variables of the activation record. Similarly, the STORE instruction can be used to change fields of the activation record.
Exercises
d) Look up the LOADCC instruction in the Hawk manual. How does it differ from LOAD? There are two differences, one in the binary code, one in its behavior. Give both!
The Hawk architecture uses a variety of addressing modes, that is, modes of access to operands of instructions. Some of these addressing modes actually address memory, while others do not; the term addressing mode is used as a catch-all whether or not a memory address is used. In sum, the modes used on the Hawk are:
- Immediate addressing
- LIS R5,7
- With immediate addressing, the operand is taken from a constant field of the instruction; in the example, the constant 7 is an 8-bit immediate constant.
- Register addressing
- MOVE R5,R7
- With register addressing, the operand comes from or goes to a register; in the example, both the source and destination operands use register addressing.
- Short-indexed addressing
- LOADS R5,R7
- Short indexed addressing is sometimes called register-indirect addressing; in this case, an index register holds the address of (that is a pointer to) the operand in memory. In this example, register 7 is used as an index register.
- Indexed addressing
- LOAD R5,R7,20
- Indexed addressing was invented in the 1950s; in this mode, the operand address is the sum of the contents of an index register and a constant, the displacement, taken from a field of the instruction. The index register R7 points to an object, the displacement 20 identifies a field of the object.
- Program-counter-relative addressing
- LOAD R5,PHELLO
- Program-counter relative addressing was invented in the early 1960s; in this mode, the operand address is the sum of the program counter and a displacement. If the program counter is itself thought of as an index register, PC-relative addressing is simply a variant of indexed addressing.
With the Hawk architecture, the addressing mode is generally determined by the instruction itself, which is to say, by the opcode. Thus, for example the LIS, MOVE, LOADS and LOAD instructions are completely different, even though the primary difference between them is the addressing mode used to select the source operand. In contrast, LOAD using indexed mode and LOAD using PC-relative addressing are distinguished only by the contents of the index register field, where a value of zero indicates PC-relative addressing, while a nonzero value indicates indexing with one of the 15 general purpose registers.
Exercises
e) Look at the binary codes for LOAD, LOADS, STORE and STORES and having done so, explain how the Hawk CPU distinguishes between memory reference instructions that use indexed addressing and memory reference instructions that use short-indexed addressing.
f) Every hawk instruction that uses indexed addressing has a cousin that uses short indexed addressing. Furthermore, LOAD has a cousin called LEA that, instead of loading from memory, loads the effective address. LOADS also has a cousin that does this, and it could have been named LEAS because it is the short-indexed cousin of LEA. What instruction is this? (Hint: There is a regular structure to the binary codes for these instructions.)
On some machines, a specific field of the instruction is used to encode the addressing mode for each operand. This idea was present in rudimentary form on the Digital Equipment Corporation PDP-5, introduced in 1964, and the PDP-8 family that grew from it.
| ||||||||||||||||||||||||||||||||||||
|
The four addressing modes of the PDP-8 are complicated by the special status of page zero (addresses 0 to 128). With a 12-bit memory address but only a 7-bit address field in the instruction, a single instruction cannot directly address all of memory. To solve this, the PDP-8 designers broke memory into a sequence of 128 word pages, using the most significant 5 bits of the address to select the page and the least significant 7 bits to select the word in page. At any time, the PDP-8 could only directly addresses page zero and the current page selected by the high 5 bits of the program counter. Other machines such as the Hawk use various forms of program-counter relative addressing to solve this problem.
If we ignore the complexity of the paged structure of the PDP-8 memory, we see that the PDP-8 had the following two basic addressing modes, neither of which is present on the Hawk:
- Direct addressing
- A constant field in the instruction holds the operand address. This is the oldest mode, found on computers designed in the 1940s. A direct addressed load instruction on the Hawk could look like this:
DLOAD R,ADDR
We could implement this using following SMAL Hawk macro:
MACRO DLOAD =r,=addr
LIW r,addr
LOADS r,r
ENDMAC
- Indirect addressing
- A constant field in the instruction holds the address of the word in memory holding the operand address. That is, a word in memory points to the operand. This was invented in the 1950s. An indirect load instruction on the Hawk could look like this:
ILOAD R,ADDR
We cold implement this with the following Hawk macro:
MACRO ILOAD =r,=addr
LIW r,addr
LOADS r,r
LOADS r,r
ENDMAC
The Digital Equipment Corporation PDP-11 computer introduced in 1970 had a wide range of addressing modes. This was imitated by the Motorola 68000 microprocessor in the mid 1970s and further expanded on by the DEC VAX computer in the late 1970s and early 1980s. The VAX took this idea to the greatest extreme, with a 4-bit field used to select the addressing mode of every operand, allowing 16 distinct addressing modes. Here are the the PDP-11 addressing modes.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The PDP-11 had a 16-bit word and 8 general-purpose registers, where R[7] was the program counter. Many instructions had both source and a destination fields, where each could use any addressing mode. The indexed mode on the PDP-11 uses the next 16-bit word in the instruction stream as the displacement, exactly as in the Hawk, but both the source and destination fields could be indexed. Indirection could be applied to any addressing mode, generalizing on what the PDP-8 had done. The new modes on the PDP-11 are:
- Auto-increment addressing
- Auto-decrement addressing
- These are like register-indirect addressing except that the register holding the operand address is incremented or decremented as a side effect of use. The decrement was done before use, while the increment was done after, and the magnitude of the increment or decrement was set by the operand size. Memory was byte-addressable and the opcode determined whether the operand was one, two or four bytes.
It is convenient that languages descended from C such as C++ and Java allow us to write M[R[r]++] as a compact description of auto-increment addressing. This is not entirely a coincidence. This feature was not present in BCPL, the language that was the predecessor of C. The C language was originally designed as Unix operating system development moved to the PDP-11, and the C operators ++ and -- have frequently been described as responses to the use of auto-increment and auto-decrement addressing on the PDP-11.
There are other more obscure addressing modes. For example, the IBM 360 architecture supports Double indexed addressing, where the contents of two different registers are added to a 12-bit constant from the instruction register in order to compute the effective address.
Curiously, the PDP-11 allowed immediate and direct addressing modes as side-effects of other modes. Recall that register 7 was the program counter on the PDP-11. Auto-increment addressing in the source field using the program counter would fetch the next word from the instruction stream for use as an immediate constant. Similarly, auto-increment indirect addressing would fetch the next word from the instruction stream and use it as an address. In effect, this was a complex way of going about direct addressing.
Exercises
g) Suppose you to translate an assembly language program for some other computer to Hawk asseembly language. The original contains a load to register 5 using direct addressing to reference a memory location with the label LOC. Write equivalent Hawk code (it may take a few instructions).
h) Suppose you to translate an assembly language program for some other computer to Hawk asseembly language. The original contains a load to register 5 using auto-increment addressing through register 3, incrementing after the memory reference, as on the PDP-11. The operand size is one word. Write equivalent Hawk code (it may take a few instructions).
Looking back at the numeric output routine at the start of this chapter, we are interested in constructing an activation record that can be used to store the return address, remainder and quotient during the execution of any functions called from within the putdecu() function. This suggests the following three-word activation record format:
0 | return address |
4 | remainder |
8 | quotient |
A picture of a three word data object is easy to draw, but assemblers cannot read pictures. What we need to do is define symbolic names for each field of the object, where the value of each name gives the offset from the start of the object to that field. Here, as with many other areas, assembly language programmers must adopt conventions to clearly document program details that high-level languages enforce with their syntax. The convention we will use for all records, structures and objects is illutrated below:
; activation record format for PUTDECU numeric output routine RETAD = 0 ; the return address REM = 4 ; the remainder QUO = 8 ; the quotient ARSIZE = 12 ; size of activation record in bytes |
Given these definitions, and given that the stack pointer R2 points to the start of such an activation record, we can store a value from register 4 into the remainder using STORE R4,R2,REM and load it back into register 3 using LOAD R3,R2,REM. These two instructions use the contents of the index register R2 plus the constant REM to compute their effective address. Here are some other example uses of the activation record:
STORE R1,R2,RETAD ; Store into the return address from R1 LOAD R1,R2,RETAD ; Load from the return address into R1 STORES R1,R2 ; Equivalent to the above LOADS R1,R2 ; because RETAD is zero STORE R3,R2,REM ; Store into the REM field from R3 LOAD R3,R2,REM ; Load from the REM field into R3 LEA R3,R2,REM ; Get the address of the REM field into R3 |
Note that any symbol defined by assembly-time assignment can be redefined by another assignment. As a result, if a single assembly source file holds two subroutines, and if the activation record structure for each is defined before the code, there is no problem if the same names are used in both activation records. One name in particular, ARSIZE, will be defined for each activation record, giving its size. Because we will always use LOADS and STORES to access the return address, we will rarely need to define the RETAD field, although we will always include comments to indicate that it is there.
Our recursive numeric output routine began by dividing the number to be
printed by ten in order to get the remainder and quotient. On some machines,
this is done with a machine instruction, but on the Hawk, multiply and
divide are both supported by software in the monitor:
The monitor also provides TIMES for multiplying signed integers,
and TIMESU for unsigned integers.
The definitive documentation for these routines is in
the monitor.h file.
Note that both the multiply and divide routines have slightly nonstandard
register use for their parameters and results. This is because exact
conformance with the rule that registers from R3 upward should be used
would slow these routines down; here, speed was judged more valuable than
standardization.
Our focus here is how one subroutine can call another. We begin by focusing on the non-recursive call from our decimal print routine to DIVIDEU and then looking at the recursive call, which, as it turns out, is not materially different.
The main programs we wrote in the previous chapter were simple. There were no local variables and no calls other than to monitor routines. As a result, we could largely ignore the use of the stack. Now, we must ask, how can we push the current activation record before a call and how can we pop it back afterwards.
The principal thing to remember is this: On entry and exit from a
subroutine, the stack pointer, register 2, always points to the first
word of that routine's activation record. Initially, we will leave the
stack pointer alone in the body of the routine. This means that,
right before each call, we must update the stack pointer to point to
the first word of the new activation record, and right after each return,
we must undo this. The
following skeleton for a call to DIVIDEU from within our numeric
output routine does this:
; put the dividend in R3 and the divisor in R5 ; make sure nothing valuable is in R4 and R6 ADDI R2,R2,ARSIZE ; push the current activation record LIL R1,DIVIDEU JSRS R1,R1 ; call DIVIDEU ADDI R2,R2,-ARSIZE ; pop the current activation record ; get the quotient from R3 and the remainder from R4 |
The sequence of instructions needed to call a subroutine is called the
calling sequence for that routine; the calling sequence given
above is the calling sequence we will use for every external
subroutine on the Hawk machine until we start optimizing.
External subroutines are those that
are given in a separate source file. We will only make a slight change
for calls to local or internal subroutines, those defined in the same
source file. For the recursive
calls in our print routine we will use this calling sequence:
; put the number to be printed in R3 ; make sure R4 to R7 contain nothing valuable ADDI R2,R2,ARSIZE ; push the current activation record JSR R1,PUTDECU ; call PUTDECU ADDI R2,R2,-ARSIZE ; pop the current activation record ; the number has been printed. |
There are only two differences between this calling sequence and the
calling sequence for the unsigned divide routine. First, it uses
the JSR instruction to jump directly to the subroutine, using
program-counter relative addressing instead of going through the
more awkward process of first loading the address of the subroutine in
a register before use.
This works because the destination address PUTDECU is local.
JSR is really the same instruction as JUMP,
except that JUMP has an implicit destination field of zero,
indicating that the return address should be discarded.
As a rule, the Hawk
has only two interpretations for register fields that are
zero: Index register zero refers to the program counter,
while operand register zero is the constant zero, as a source, or it causes
the result to be discarded, as a destination.
The second difference is in the comments. For the unsigned divide routine, we know exactly what registers it uses because that code already exists, therefore, the comments about what registers the routine might use are exact. In the case of our decimal print routine, on the other hand, we have not yet written the code, so the most we can do is make a conservative statement based on our standard calling conventions.
As already mentioned, we always pass parameters and return results in R3 and up. In addition, if the calling routine has anything of value in R3 to R7, it should save it before the call; the caller may use these registers with impunity. It is up to the called routine to return with R8 to R15 unchanged, saving and restoring them if needed.
For every calling sequence, there is a matching receiving sequence,
used at the entry point of the called routine to receive control from the
calling routine, and a matching return sequence, used at
the end of the called routine to return to the caller. The first half
of the calling sequence works with the receiving sequence to transfer
parameters and control, saving whatever neeeds saving, while
the return sequence works with the second half of the calling sequence
to return to the caller and restore what must be restored.
The receiving and return sequences for our example are:
; receiving sequence assumptions PUTDECU:; expects R3 -- the number to be printed ; may use R4-7 ; R8-15 must be saved before any use STORES R1,R2 ; save the return address ... ; body of subroutine goes here ; return sequence assertions ; may use R3-7 ; R8-15 must be restored if used LOADS R1,R2 ; get the return address JUMPS R1 ; return |
In the above code, we used STORES R1,R2 to save the return
address in the activation record. We could have use
STORE R1,R2,RETAD to do this, but because we have allocated the
return address as the first field of the activation record,
we can use the shorter
instruction to access that field. As a general rule, the variable that is
used most intensely in a subroutine should be put at the start of the
activation record in order to speed all references to that variable; in our
example, none of the variables are used very intensely, so this is not an
issue.
The final return from the function is done with a JUMPS instruction; this is the short-indexed version of the JUMP instruction, and in fact, it is identical to the JSRS instruction with the destination field set to zero to indicate that the return address should be discarded.
The receiving sequence stores the return address in the activation record, allowing the subroutine to call other subroutines. On exit, the return sequence restores the return address from the activation record. Had the subroutine made no other calls, or had other calls been linked through some other register than R1, there would have been no need to save and restore R1. A good compiler, when it generates machine code for a subroutine, will always make such optimizations when possible, but generally, assembly language programmers are advised to use the most general calling sequence during development, and only try such optimizations at the very end, after making very sure that everything works.
Exercises
i) Suppose someone built a new version of the Hawk computer with the JUMP, JSR and JSRS instructions eliminated, leaving only JUMPS available. The new machine still has LEA, LOAD and a full complement of load immediate operations. Give an instruction sequence that can do almost everything that the JUMP instruction does. Are there any important restrictions on this instruction sequence that you'd need to worry about when replacing JUMP instructions with this sequence?
j) Suppose someone built a new version of the Hawk computer as suggested in the previous question. Given an instruction sequence that can do almost everything that JSR does Can you formulate this sequence as a macro named JSR?
k) Suppose someone built a new version of the Hawk computer as suggested in the previous question. Given an instruction sequence that can do almost everything that JSRS does. Suppose you had to move some Hawk code to the new machine, and the code contained the instruction JSRS R1,R1; what problems does this pose if you try to substitute your instruction sequence for this?
We have now covered all of the pieces required to write a Hawk assembly language subroutine equivalent to the putdec() subroutine given at the start of this chapter. As we put it together, we will make several changes.
The comments on the receiving sequence and the return sequence will be changed to reflect the actual register use made by this code. Registers 3 to 5 were used directly, and the documentation for DIVIDEU says that it also uses register 6. Checking the documentation for PUTCHAR shows that it used fewer registers, so the comments on the receiving sequence should indicate that registers 3 to 6 were used.
The code presented here was written without using global knowledge, except as already noted in assessing register use. Each statement was translated to assembly code without attention to what registers were used by the previous statement or what values were needed by the next statement. This approach to progrmming requires less thought and less knowledge than the alternatives, but it generally produces larger and slower results than could be obtained by writing code to make use of contextual information about the adjacent parts of the program.
For example, in the following code, the if statement was translated without noticing that the quotient was alread in R3. As a result, the code goes to memory for the quotient. In the body of the if statement, the code reads the quotient from memory again before calling PUTDECU even though it was already in R3. In sum, aside from using R2 to hold the stack pointer, this code uses registers as very short-term storage.
; activation record format for PUTDECU numeric output routine RETAD = 0 ; the return address REM = 4 ; the remainder QUO = 8 ; the quotient ARSIZE = 12 ; size of activation record in bytes ; receiving sequence assumptions PUTDECU:; expects R3 -- the number to be printed ; uses R4-6 STORES R1,R2 ; -- save the return address LIS R5,10 ; -- the divisior to divide R3 by 10 ADDI R2,R2,ARSIZE ; -- push the current activation record LIL R1,DIVIDEU JSRS R1,R1 ; -- call DIVIDEU ADDI R2,R2,-ARSIZE ; -- pop the current activation record STORE R3,R2,QUO ; quo = number / 10 STORE R4,R2,REM ; rem = number % 10 (remainder) LOAD R1,R2,QUO ; -- get the quotient CMPI R1,0 ; -- compare with zero BZS ENDIF ; if (quo > 0) do the following ; output the more significant digits first LOAD R3,R2,QUO ; -- get the quotient ADDI R2,R2,ARSIZE ; -- push the current activation record JSR R1,PUTDECU ; putdec( quo ); ADDI R2,R2,-ARSIZE ; -- pop the current activation record ENDIF: ; finally display the least significant digit, the remainder LOAD R3,R2,REM ; -- get the remainder ADDI R3,R3,'0' ; -- convert it to a character ADDI R2,R2,ARSIZE ; -- push the current activation record LIL R1,PUTCHAR JSRS R1,R1 ; putchar( rem + '0' ); ADDI R2,R2,-ARSIZE ; -- pop the current activation record ; the number has been printed. ; return sequence assertions ; abused R3-6 LOADS R1,R2 ; -- get the return address JUMPS R1 ; return |
Exercises
l) The comments for the PUTDECU receiving and return sequences were written first before the body of the the routine. They state permission to modify registers and obligations to preserve registers, but not what registers are actually used. Give appropriate comments documenting the final result.
m) Write a Hawk main program to test the above PUTDECU subroutine by using it to print out 1000016 in decimal. How many calls to PUTDECU does this test perform, including both the call from the main program and the recursive calls? (Put the subroutine in the same source file as the main program, don't mess with separate assembly and linkage yet, and you may need to rename your PUTDECU to avoid conflict with the monitor.)
n) Modify PUTDECU so it takes a second parameter, the number base, and prints a number in any base between 2 and 10. Don't bother checking for illegal number bases.
o) Write a Hawk assembly language function PUTDEC to print a signed integer. This should be equivalent to the following:
void putdec( int n ) { /* output the integer n */ if (n < 0) { n = -n; putchar('-'); } putdecu( n ); }
It is very likely that you will make errors in your programs. If you forget to save the return address or if you accidentally damage the saved return address on the stack, your return from a subroutine will jump to a wild memory location. If you accidentally change R2, attempts to use the activation record will access unpredictable memory locations.
The most common consequence of such errors is reported by the Hawk monitor as a bus trap. In general, traps are error conditions detected by hardware. When a program tries to store data in read-only memory or load data from nonexistant memory, the CPU detects this and reports a bus trap.
Sometimes, when a program fails, it will jump to a location that is a legal memory address, either in read-only or read-write memory and then execute random data as instructions. While most possible values are indeed legal Hawk instructions, some are not. When the hardware encounters an opcode that is not defined, the result is an instruction trap.
In general, when the hardware detects a trap, it transfers control to the Hawk monitor. The monitor's default trap service routine kills the program that was running at the time of the trap and reports the trap PC and the trap MA. The trap PC is the address of the instruction that caused the trap, and in the case of a bus trap, the trap memory address is the illegal memory address that the program tried to use.
To view the instruction that caused the trap, that is, the instruction that the emulator tried to execute at the time of the trap, type in the trap PC followed by the m (display memory) command. If the program that was running was in fact your code, you should be able to relate what you see to your code. Debugging is more difficult when the program takes a wild jump to an instruction that cannot be fetched. In that case, the contents of the registers can be help identify what hapened.
In the case of bus trap exceptions, the trap MA gives the word address that caused the trap. If the trap was caused by an attempt to fetch an instruciton from nonexistant memory, this will equal the program counter. If a load or store instruction was the cause, trap PC will point to this instruction, and trap MA will be the sum of the index register and the displacement, if any. Note that the least significant two bits of the trap memory address are used to signal more detail about the cause of the trap and should not be interpreted as part of the address.
The most difficult traps to debug are those that are the result of wild
jumps. There are two common cases, wild jumps as the result of attempting
to return from a subroutine after the stack pointer or activation record
have been corrupted, or wild jumps as the result of incorrect computation
of the destination address of a jump or subroutine call. In either case,
the registers are likely to hold evidence of what was going on before
the wild jump.
As pointed out above, the code we have given is not very good. We can tighten this code up in two ways: First, there are several places where we loaded a value from memory that was already in registers. Second, the Hawk has several instructions that can be used to tighten the code.
| ||||||||||||||||||||||||||||||||||||||||||||||||||
|
The LOADCC instruction does the same thing as LOAD, but it also sets the N and Z condition codes to report whether the value loaded is negative or zero. As a result, it is something like a LOAD followed by a comparison with zero, although it sets the V and C condition codes differently. We used a load-compare sequence in our example numeric output routine, and we can replace it with a LOADCC.
The if statement in the original putdecu() routine did not need the value of its operand, it just needed to know if it was zero or not. If we set the destination field of LOADCC to zero, the result will be discarded after setting the condition codes. The assembler assembles TEST as LOADCC R0. Just as LOAD has a short-indexed cousin LOADS, LOADCC and TEST have short-indexed cousins LOADSCC and TESTS.
As we pointed out above, the value we want to test is already in a register. We don't need to load it, we only need to test that register. The Hawk provides an instruction for this, based on a cousin of MOVE:
07 | 06 | 05 | 04 | 03 | 02 | 01 | 00 |   | 15 | 14 | 13 | 12 | 11 | 10 | 09 | 08 |
1 1 1 1 | dst | 1 1 1 0 | x |
MOVECC does exactly what MOVE does, except that it also sets the N and Z condition codes to report on the result. If all we want is to test the contents of a register, not move it, we use R0 as a destination. As with TEST and TESTS, we introduce TESTR which is MOVECC R0. If we use CMPI to compare a register with zero, this takes two halfwords to do what TESTR does in one halfword and half the time.
Applying the suggested optimizations to PUTDECU compacts the code to the following:
; activation record format for PUTDECU numeric output routine RETAD = 0 ; the return address REM = 4 ; the remainder ARSIZE = 8 ; size of activation record in bytes ; receiving sequence assumptions PUTDECU:; expects R3 -- the integer n to be printed ; uses R4-6 STORES R1,R2 ; -- save the return address LIS R5,10 ; -- the divisior to divide R3 by 10 ADDI R2,R2,ARSIZE ; -- push the current activation record LIL R1,DIVIDEU JSRS R1,R1 ; -- call DIVIDEU( number, 10 ) ADDI R2,R2,-ARSIZE ; -- pop the current activation record ; quo = number / 10 (quotient in R3) STORE R4,R2,REM ; rem = number % 10 (remainder) TESTR R3 ; -- test quo BZS ENDIF ; if (quo > 0) do the following ; output the more significant digits first (from quotient in R3) ADDI R2,R2,ARSIZE ; -- push the current activation record JSR R1,PUTDECU ; putdec( quo ); ADDI R2,R2,-ARSIZE ; -- pop the current activation record ... ; the remainder of the code is unchanged |
We never needed to go to the activation record for the variable quo in the optimized code, so we have shortened the activation record by a word and eliminated a total of three instructions.
We can do even better! Each call ends with ADDI R2,R2,-ARSIZE and the next starts with ADDI R2,R2,ARSIZE, so we can trivially drop both instructions between successive calls. This requires only small changes if the intervening code makes use of the activation record.
... ; the start of the code is unchanged ADDI R2,R2,ARSIZE ; -- push the current activation record LIL R1,DIVIDEU JSRS R1,R1 ; -- call DIVIDEU( number, 10 ) ; quo = number / 10 (quotient in R3) STORE R4,R2,REM-ARSIZE; rem = number % 10 (remainder) TESTR R3 ; -- test rem BZS ENDIF ; if (quo > 0) do the following ; output the more significant digits first (from quotient in R3) JSR R1,PUTDECU ; putdec( quo ); ENDIF: ; finally display the least significant digit, the remainder LOAD R3,R2,REM-ARSIZE; -- get the remainder ADDI R3,R3,'0' ; -- convert it to a character LOAD R1,PPUTCHAR JSRS R1,R1 ; putchar( rem + '0' ); ADDI R2,R2,-ARSIZE ; -- pop the current activation record ... ; the end of the code is unchanged |
In this round of optimization, we eliminated four more instructions, all of which manipullated the stack pointer. Wherever the stack pointer was left pointing off the end of the activation record because we did not decrement it between two calls, we had to subtract the activation record size from each index displacement used to access any field of the activation record. In effect, instead of subtracting the activation record size from the stack pointer at run time between two calls, we did it at assembly time.
Exercises
p) Combine all the optimizations suggested above into one PUTDECU routine and test it.
q) The final optimization suggested above suggests an alternative receiving and return sequence:
; receiving sequence PUTDECU:STORES R1,R2 ; -- save the return address ADDI R2,R2,ARSIZE ; -- push my activation record ; body of subroutine goes here ; return sequence ADDI R2,R2,-ARSIZE ; -- pop my activation record LOADS R1,R2 ; -- get the return address JUMPS R1 ; returnHow does this relate to the receiving and return sequence used for the main program given in the previous chapter?
One of the classic examples of a recursive function is the Fibonacci series; this is the series of integers that begins 0 1 1 2 3 5 8 13 and continues with the rule that each member of the series is the sum of the two previous members. Mathematically, we say that f(i) gives the ith member of the series, where f(0) = 0, f(1) = 1 and for all higher values of i, f(i) = f(i-1)+f(i-2). The ratio of successive members of the Fibonacci series approximates the golden ratio, φ, an irrational number close to 1.618.
We can easily write code for a function that computes the
ith member of the Fibonacci series in a high-level
programming language. In C or C++, for example, this works:
unsigned int fibonacci( unsigned int i ) { /* return the i-th member of the Fibonacci series */ if (i <= 1) { return i; } else { return fibonacci( i - 1 ) + fibonacci( i - 2 ); } } |
There is a much faster iterative algorithm to compute the Fibonacci series,
using an iterative algorithm instead of recursion. Our point here, though,
is to explore calling sequences and activation records, so we will use the
recursive algorithm.
An experienced assembly language programmer might write this in
assembly language code without preliminary work, but it is useful
to rewrite it in a high-level language first, reducing complex expressions
by adding variables so that each line of high-level-language code corresponds
to one assembly language instruction or a short block of instructions.
To do this, we put each call on a line by itself, and we do all arithmetic
separately from calls. In addition, it is convenient to eliminate multiple
returns from inside the body of the function, since these can make the
assembly language code harder to read.
Putting these changes together gives us the following C or C++ code:
unsigned int fibonacci( unsigned int i ) { /* return the i-th member of the Fibonacci series */ unsigned int value /* the return value */ unsigned int fib1, fib2; /* intermediate values */ value = i; if (value > 1) { unsigned int p = value - 1; fib1 = fibonacci( p ); p = value - 2; fib2 = fibonacci( p ); value = fib1 + fib2; } return value; } |
This allows us to write a first poorly optimized draft of the code.
The calling sequence is
similar to the one we used for DSPDEC, except that in the
return sequence,
register 3 holds the value to return.
; activation record format for the FIBONACCI function RETAD = 0 ; the return address VALUE = 4 ; the tentative unsigned integer return value FIB1 = 8 ; fibonacci(value-1) FIB2 = 12 ; fibonacci(value-2) P = 16 ; the parameter p ARSIZE = 20 ; size of activation record in bytes ; receiving sequence assumptions FIBONACCI: ; expects R3 -- i, an unsigned integer ; ignores R4-15 ; returns R3 -- fib(i), the ith Fibonacci number STORES R1,R2 ; -- save the return address STORE R3,R2,VALUE ; value = i LOAD R3,R2,VALUE ; -- get value CMPI R3,1 ; -- compare value with 1 BLEU ENDIF ; if (value > 1) do the following LOAD R3,R2,VALUE ; -- get value ADDSI R3,-1 ; STORE R3,R2,P ; p = value - 1 LOAD R3,R2,P ; -- put parameter p in place ADDI R2,R2,ARSIZE ; -- push the current activation record JSR R1,FIBONACCI ; -- call fibonacci ADDI R2,R2,-ARSIZE ; -- pop the current activation record STORE R3,R2,FIB1 ; fib1 = fibonacci( p ) LOAD R3,R2,VALUE ; -- get value ADDSI R3,-2 ; STORE R3,R2,P ; p = value - 2 LOAD R3,R2,P ; -- put parameter p in place ADDI R2,R2,ARSIZE ; -- push the current activation record JSR R1,FIBONACCI ; -- call fibonacci ADDI R2,R2,-ARSIZE ; -- pop the current activation record STORE R3,R2,FIB2 ; fib2 = fibonacci( p ) LOAD R3,R2,FIB1 ; -- get fib1 LOAD R4,R2,FIB2 ; -- get fib2 ADD R3,R3,R4 ; -- add them STORE R3,R2,VALUE ; value = fib1 + fib2 ENDIF: ; return sequence and assertions LOAD R3,R2,VALUE ; -- get return value ; ignored R4-15 LOADS R1,R2 ; -- get the return address JUMPS R1 ; return value |
Each block of instructions in the above code corresponds closely to one line of the high-level-language program from which we derived it, and these blocks were written without regard to each other. There may be one subtle point, in the use of the BLEU instruction. This is used because we are comparing unsigned operands. In fact, this is not really an issue, since the Fibonacci function will produce an overflow and an invalid result long before the difference between signed and unsigned arguments makes any difference!
Exercises
r) The comments on the receiving sequence and return sequence for the FIBONACCI function given above were written first, before the body of the function was written. They indicate registers that may be freely used and registers that must be saved and restored, but they say nothing about which registers are actually used. Rewrite these to reflect the actual register usage of this function.
s) Optimize the code for FIBONACCI to eliminate extraneous loads and stores, taking maximum advantage of values already known to be in registers.
t) Optimize the code for FIBONACCI to eliminate extraneous changes to the stack pointer between the two recursive calls.
u) Optimize the code for FIBONACCI to eliminate extraneous variables from the activation record. In fact, you can squeeze the activation record down to a single variable plus a return address. You may need to use more registers to do this.
v) Write an iterative subroutine in a high-level language that computes the ith Fibonacci number and then translate that subroutine to Hawk assembly language.
Our focus in this chapter has been local variables that are allocated on entry to a subroutine and deallocated on exit. There are occasions when global variables and statically allocated variables are needed. These variables are allocated when the program is started and persist until the the program terminates. The SMAL assembler, working together with the SMAL linker, can allocate storage for such variables using the COMMON directive.
Suppose you wanted to count the number of calls made to the recursive
FIBONACCI routine we have just defined. To do this, we need
a global variable that is incremented each time the body of FIBONACCI
is entered. We can allocate this variable as follows:
; allocate global variable COMMON FIBCOUNT,4 ; initialize global variable LCSAVE = . . = FIBCOUNT W 0 . = LCSAVE |
The above code is equivalent to int fibcount=0; in C or C++.
The COMMON directive takes two arguments, the name of the global
variable, FIBCOUNT in this case, and the size of the variable
in bytes, 4 in this case. This directive merely tells the linker to
allocate the indicated space in RAM. If multiple declarations of the
same global variable occur in separately assembled sections of the program,
the linker will allocate only one copy, so all references to that variable
refer to the same memory location.
Giving an initial value to a global variable involves setting the assembler's location counter to that variable and assembling the initial value into memory. If this code is anywhere but at the end of the program, we need to remember the previous location counter and restore it when we are done.
The following code fragment illustrates how to increment a global variable. This code could be placed at the head of our FIBONACCI routine to count calls.
LIL R4,FIBCOUNT ; -- load pointer to fibcount LOADS R5,R4 ; -- load its value ADDSI R5,1 ; fibcount = fibcount + 1 STORES R5,R4 ; -- save the incremented value |
The above code is designed to be inserted directly after the entry to the
FIBONACCI routine. At that point in the code, R3 is in
use; it holds the value of the parameter to the subroutine. Because of this,
the above code uses R4 and R5.
Suppose we want to use our PUTDECU to test the FIBONACCI routine. An obvious way to do this is to combine both routines into one source file, along with the main program. Doing this poses potential problems because we have reused several identifiers in our initial versions of the code for each of these routines. Some of these identifiers are reused for the same meaning in both, for example RETAD was defined as zero in both activation records. ARSIZE on the other hand had different values in each, and both functions contained labels named ENDIF.
Our assembly language provides some help. Symbols defined by assignment with an equals sign may be redefined at will within a program. So, as long as the definition is before use, there is no problem at all with defining ARSIZE differently for each of these routines.
Labels pose the main problem. In any source file, no identifier may be used as a label with two different meanings. We can solve this by using naming conventions that encourage use of unique labels in each subroutine. There are two conflicting requirements at work here: Long labels can ruin the readability of the code, while short labels are hard to make unique or meaningful. Here are some style suggestions:
- Labels for subroutines:
- The label for a subroutine may be up to 14 characters long, and it should be defined on the first line of the block of comments that documents the receiving sequence. There should be a unique and fairly obvious three-letter abbreviation for the subroutine name. For example, FIBONACCI can be abbreviated FIB and PUTDECU can be abbreviated PUT (if there is only one put routine in the source file) or PDU (if PUTDEC and PUTDECU are both in the same source file).
- Labels for control structures within subroutines:
- Labels for local control structures should begin with the three letter abbreviation for the routine they are in. The last label before the return from a subroutine, regardless of the control structure, can have the suffix QT or X meaning quit or exit. So, the last label in the fibonacci routine could be FIBQT.
- Labels for loops:
- Labels at the top of a loop be suffixed LP and labels marking the exit from a loop can be suffixed LX (the X stands for exit). If there are multiple loops, consider adding a qualifying letter as a suffix, so you end up with labels like FIBLPA and FIBLPB.
- Labels for if statements:
- Labels marking the start of else clauses can be suffixed with EL and the label at the end of the if statement can be suffixed EI (for endif). If there are multiple if statements, add a final distinguishing suffix, for example, FIBEIA.
These conventions are by no means perfect, but they are sufficient to
allow us to combine the fibonacci and print routines given here into one
somewhat
legible source file. One important result of this set of naming conventions
is that names that have global meaning are longer and more legible, while
names with only local meaning, those associated with local control structures
within one routine, are less readable but still likely to be easy to find
simply because they are always defined fairly close to the point of use.
Exercises
w) Combine the PUTDECU and FIBONACCI subroutines into one source file, along with a main program that contians a loop to output the first 10 fibonacci numbers.
x) Combine the PUTDECU and FIBONACCI subroutines into one source file, and then add a call to PUTDECU to the body of FIBONACCI so that the Fibonacci function outputs a begin paren on entry, and just before exit, it outputs the value it is about to return followed by an end paren. If you called f(2), for example, this would output ((1)(0)1) and if you called f(3) it would output (((1)(0)1)(1)2). Run it for f(5).
y) Consider the series where s(i) defined by:
unsigned int s( unsigned int i ) { int j; int v = 0; if (i < 2) return i; for (j = 0, j < i; j++) v = v + s(j); return v; }This begins 0 1 1 2, just like the Fibonacci series. Write a recursive assembly language program that computes this series, and test it with a main program that prints out the first 10 members of the series.
z) Instrument the Fibonacci code to count how many calls have been made to FIBONACCI and then write a main program to evaluate f(8) and report the number of calls.