Homework 7 Solutions

22C:122, Fall 1998

Douglas W. Jones

In the following pipeline diagram, F O A M and R are the initials for the Fetch, Operand, ALU, Memory and Result pipeline stages. Dashes indicate bubbles in the pipe, and vertical bars indicate the dependencies that forced the introduction of bubbles. The letter over each vertical bar is the register name that forced the dependency.
```
                                 c
        LEA   c,X       F O A M R|
	LOAD  a,Y         F O A M|R      b
	LOAD  b,Z(c)        F - -|O A M R|       c
	ADD   c,a,b               F - - -|O A M R|
	STORE c,Z                         F - - -|O A M R
	LEA   pc,D                                F O A M R
```
Background Consider the following small code fragment, annotated as above:
```
	LEA   a,X	F O A M R  a
	LEA   a,Y	  F O A M R|
	STORE a,Z	    F - - -|O A M R
                        1 2 3 4 5 6 7 8 ...
```
Part A In the above fragment, register a is locked by the first LEA in cycle 2 and unlocked in cycle 5, while the second LEA locks register a in cycle 3 and unlocks it in cycle 6. If the lock is just one bit, the Register stage will unlock register a in cycle 5, allowing the STORE to continue in cycle 6 when it ought to have been delayed until cycle 7. As a result, the following incorrect pipeline diagram would result:
```
                                 a
	LEA   a,X	F O A M R|
	LEA   a,Y	  F O A M|R
	STORE a,Z	    F - -|O A M R
                        1 2 3 4 5 6 7 8 ...
```
Part B There are at least three alternatives that would work. The most obvious is to replace the one-bit scoreboard entries with 3 bit entries that are incremented each time any instruction locks a register and decremented each time any instruction unlocks a register. This would produce exactly the above pipeline diagram. Three bits per lock are sufficient because these allow lock counts to reach a maximum of 7, and there are only 5 pipeline stages. In fact, only 4 stages are relevant to locking, since it is the Operand stage that asserts locks and the Result stage that releases locks, so the locks will vary over the range 0 to 4, in the worst case.
A less desirable solution would be to treat the lock as a pure mutual exclusion device, delaying the second LEA when it attempted to lock the already-locked register, as illustrated by the following pipeline diagram:
```
                                 a
        LEA   a,X       F O A M R|     a
        LEA   a,Y         F - - O|A M R|
        STORE a,Z               F - - -|O A M R
                        1 2 3 4 5 6 7 8 ...
```
This is excessive because there is no need to wait for a result that will be overwritten!
A third solution is to note, as the second LEA attempts to lock a, that the result of the first LEA is now irrelevant. As a result, the first LEA can be aborted by converting it to a no-op. This gives the following pipeline diagram:
```
        LEA   a,X       F O A - -  a
        LEA   a,Y         F O A M R|
        STORE a,Z           F - - -|O A M R
                        1 2 3 4 5 6 7 8 ...
```
Note that when an instruction is aborted when another instruction discovers that it will overwrite the results of that instruciton, the operation of releasing the lock of the aborted instruction and claiming the lock on behalf of the new instruction can be combined. As a result, one bit per lock still suffices.
Part C What if there is a condition code register, and all LOAD, STORE and arithmetic operations use the condition codes to report on the value loaded, stored or computed, and that there is a full set of conditional branches that allow the condition codes to be tested.
Explain how the concept a scoreboard can be used to prevent conflicts over the use of condition codes? Does the one-bit-per-register scoreboard discussed in class suffice for this? If not, does the scoreboard implementation you proposed in part B suffice?
BACKGROUND Consider the following Minimal Ultimate RISC with interrupts:
```
	repeat
	    SRC = M[PC]             -- PC = M[FFFF]
	    PC = PC+1
	    DST = M[PC]
	    PC = PC+1
	    TMP = M[SRC]
	    M[DST] = TMP
	    if interrupt requested and IE' nonzero, then
	        exchange( PC, IR )  -- IR = M[FFFE]
                IE = 0, IE' = 0     -- IE = M[FFFD] lsb
            else
                IE' = IE
	    endif
	forever
```
Part A: Consider the problem of returning from interrupt. Assume that, for the duration of the interrupt service routine, the interrupt register holds the return address and that interrupts are disabled. The return sequence is therefore:
```
	MOVE X1,IE  -- enable interrupts
            -- - - - - what if an interrupt occurs here?
	MOVE IR,PC  -- return
```
What if an interrupt occurs between the enable and the return instructions? This will cause the contents of IR to be lost; the nested interrupt will correctly return to this (interrupted) interrupt service routine, but this interrupt service routine will not be able to complete its return! Therefore, the effect of the enable-interrupt operation must be delayed so that the actual enable takes effect with the return itself.
There are many ways of doing this, ranging from having a special return from interrupt instruction that enables interrupts as a side effect, to the simple one-instruction delayed enable scheme in the example architecture.
Part B: Adding interrupts to the simple version of the Pipelined Ultimate RISC with delay slots exposed to the programmer poses real problems! On interrupt, the interrupted code would continue to execute in delay slots while the interrupt routine begins, and on return, the last few instructions of the interrupt service routine would continue to execute in delay slots after the interrupted code resumes. As a result, the programmer would not be able to make any assumptions about the execution of code in delay slots!
Result forwarding eliminates all but one branch delay slot, but that delay slot is still subject to the problems that occur with the basic pipelined version.
If, on the other hand, we make interrupts and all branches inject bubbles in the pipe to fill interrupt and branch delay slots, a fully compatable pipelined version of the suggested architecture can be made. Result forwarding can be used to minimize the number of bubbles that need to be injected, but that counts as an optimization.