Another classic area where different architectural approaches lead in vastly different directions is the area of procedure, function and method calls.
Since the development of the Algol 60 programming language it has been common to think of calls as allocating objects called activation records on a run-time stack. (It is noteworthy that PL/I, Pascal, C and Simula were all strongly influenced by Algol, and that C++ combines large parts of Simula semantics with C syntax.) The activation record for a block of code contains:
In general, if we assume this model, two index registers will be dedicated to holding the static link and dynamic link. A call is made as follows:
In general, it takes several instructions to do all of this! Some of these instructions will be part of the code in the calling block, and some will be part of the prologue on the called block. In general, on return from a block to its caller, it takes several instructions to restore the execution context and deallocate the activation record that is no longer needed. Again, these instructions are typically divided between the epilogue on the called block and the body of the calling block. In sum, we can identify 3 sequences of instructions:
Usually, when talking collectively about these three types of instruciton sequences, the term calling sequence is used broadly to refer to all three. There are obviously many ways to build calling sequences, but the design of a good calling sequence is sufficiently difficult that, on most machines, there is a standard calling sequence used by most programmers.
Machine architects have long seen calling sequences as a problem area. As a result, they have long attempted to produce general call instructions that package large parts of the calling sequence into one machine instruction. A typical medium complexity CALL instruction will push the return address on the stack and transfer control to a destination, but many machines have been designed with CALL instructions that attempted to do quite a bit more.
The DEC VAX, for example, included a CALL instruction that included a mask indicating what registers were being used, and selectively pushed these registers prior to pushing the return address and a "stack mark word" holding this mask so that the right registers can be restored on return. This instruciton requires a large number of memory cycles and forces a break from pipelined execution!
The designers of Gnu C++ discovered that they could almost always produce faster object code if they never used the standard VAX CALL instruction except for calling library routines that were obliged to adhere to the standard. This discovery led many machine architects to abandon such complex call instructions and go back to far simpler models that were easy to pipeline. Typically, this has meant a return to CALL instructions with the semantics of the old IBM System 360 BAL (branch and link) instruction:
CALL R DST R = PC PC = EAThe return address is simply left in a register, so that the receiving sequence can save it as needed. If the function is at a leaf of the call tree (it calls no other functions), the return address can remain in the register for the duration.
Given this basic calling mechanism, and the fully developed model of an activation record as outlined above, a typical calling sequence suite would be:
-- assume the following named index registers: -- AR - pointer to the current activation record -- NR - points to new activation record (temporarily) -- RA - register holding return address (temporarily) -- if P points to an activation record -- P->CP holds the continuation point -- P->DL holds the saved dynamic link -- P->SL holds the saved static link -- we assume that ARS, the size of each activation -- record is known statically; it may vary from -- place to place depending on the number of -- temporaries currently needed, but each time -- any particular instruciton is executed, the -- size is fixed. -------------------- -- calling sequence ADD NR = AR + ARS -- (a) -- copy parameters into fields of *NR (b) -- save registers into fields of *AR (c) MOVE NR->SL = DST_CONTEXT -- (d) CALL RA,DST -- restore registers from *AR (e) -------------------- -- receiving sequence DST MOVE AR->CP = RA -- (f) MOVE NR->DL = AR -- (g) MOVE AR = NR -- (h) -------------------- -- return sequence MOVE AR = AR->DL -- (i) MOVE RA = AR->CP -- (j) JUMP *RANotes on the above pseudocode:
(a) The above code assumes that the called function will need to use RAM. Small functions where the entire activation record for the function fits in registers need no activation record in RAM and this line can be eliminated.
(b) The above code assumes that parameters are passed in memory, as must be the case if there are many or if large objects are passed by value. If there are only a few small parameters, all may be passed in registers and the code at b will not be needed.
(c,e) The above code assumes that the caller has values in registers that must be saved. If the called routine (and anything it calls) does not modify certain registers, this code can be omitted.
(d) If the called routine makes no reference to global variables, or rather, to variables declared in statically enclosing blocks other than the outermost block, it needs no static link and this line can be omitted.
(f,j) If the called routine does not call other routines and does not need extra registers, these lines may be omitted and the return address may remain in the RA register through the call.
(g,h,i) If the called routine does not call other and does not need extra registers, these lines may be omitted. If this is done, all references to local variables or parameters in memory must be indexed off of NR instead of AR.
The above notes indicate, in a fair amount of detail, the kind of optimizations that good compilers such as the GNU C++ compiler make in order to minimize the cost of calls. This minimization is only possible if the compiler knows all of the characteristics of the called routine at the time the code for the call is compiled. Therefore, calls to other functions in the same compilation unit will typically be more efficient to calls in other compilation units, and calls to previously compiled routines may be more efficient than calls to routines that have not yet been compiled, but only if the compiler has access to a sufficiently detailed description of the called routine at the time it compiles the call.
Interrupts have been described as calls to parameterless functions that return no results and are initiated not by the calling code but by events detected by hardware outside the CPU such as the completion of an input/output transfer.
In fact, interrupts are not so simple except in the simplest of computers! The original DEC PDP-8 had one of the simplest interrupt mechanisms. On interrupt:
This simplicity is misleading! The PDP-8 CPU had an Interrupt Enable bit, IE. If IE was set to zero, the instruction execution cycle would not handle interrupts. The occurance of an interrupt automatically reset IE, but on return, it was up to the software to enabl interrupts; this meant that there was a 1-instruction window between the instruction that enabled interrupts and the instruction that completed the return. An interrupt could not be allowed to occur within this window, so additional hardware had to be added to prevent interrupt processing between the enable interrupt instruction and the following instruction. Things got even more complex in the context of the PDP-8 memory management unit because interrupt processing always took place in bank 0 of memory, and this meant that the MMU state had to be saved and restored.
On the DEC PDP-11 and many stack architectures, interrupts were handled quite differently! On interrupt, the previous program counter and processor status word were pushed onto the stack, and then the program counter (PC) and processor status word (PSW) were loaded from the interrupt vector.
The PDP-11 processor status word included the condition codes, an indication of the priority level at which the processor was running, and a field dedicated to controlling the optional memory management unit. The priority field was a 3-bit field. If the CPU was running at high priority, all but the non-maskable interrupts were disabled. at lower priority levels, more interrupts were enabled, until at the lowest level, all were enabled.
The interrupt vector contained 8 PC/PSW pairs, one corresponding to each of the 8 priority levels. The interrupt request line from a device could be attached to one of the 8 levels, so that when that device requested an interrupt, the PC and PSW would be loaded from the corresponding entry in the vector.
The PDP-11 had a special return-from-interrupt instruciton that popped the top two words form the stack into the PC and PSW. So long as there was sufficient space on the stack of the currently running application, interrupt service routines could operate entirely in the context of that stack, popping everything they pushed prior to return and restoring the stack to its state prior to the interrupt as they finished the returned.