22C:122, Lecture Notes, Lecture 26, Fall 1999

Douglas W. Jones
University of Iowa Department of Computer Science

Operand Delay Slots
If we assume the pipelined processor from the previous lecture and construct a pipeline diagram for the execution of a series of instructions we can see that there are some serious problems a programmer will have to face.
In the following discussion, we'll use the following short form to name the 5 stages of our pipe:
- IF - instruction fetch
- AC - address computation
- OF - operand fetch
- OP - operate
- OS - operand store
Consider the following sequence of instructions:
```
		    ____ ____ ____ ____ ____
	LOAD R1,A  |_IF_|_AC_|_OF_|_OP_|*OS*|____
	STORE R1,B      |_IF_|_AC_|_OF_|_OP_|_OS_|____
	STORE R1,C           |_IF_|_AC_|_OF_|_OP_|_OS_|____
	STORE R1,D                |_IF_|_AC_|*OF*|_OP_|_OS_|
	
```
In this code, if executed on a pipelined machine with no interlocks, as presented in the previous lecture, the LOAD instruction that begins the code sequence only stores the value of A in R1 in the operand store stage. In the pipeline diagram above, this is marked with stars.
Also note that the STORE instructions inspect R1 in the operand fetch stages, so the first two store instructions above use the value of R1 prior to the value loaded from A! Thus, variables B and C receive the old values of R1, while variable D gets the copy of A because only the final STORE instruction has its operand fetch stage after the operand store phase of the initial instruction.
As a result, the simplest code to correctly represent the assignmentr
```
	 A = B
	
```
is:
```
	LOAD R1,B
	NOP
	NOP
	STORE R1,A
	
```
Here, we have used NOP instructions (no operation) to fill time while waiting for the necessary data to percolate through the pipeline.
We refer to the extra instructions inserted between a load and store as delay slot fillers. The delay slots are the pipeline cycles that must be accounted for in order to get some data from one instruction to another. In this case, we are filling operand delay slots with no-op instructions. They are called operand delay slots because the delay is between the store cycle of one instruction and the operand fetch cycle of a successor.
A good compiler will fill delay slots with useful code. For example, with the example pipeline, we can fill the operand delay slots by interleaving the code of successive assignment statements:
```
	 A = B
	 C = D
	 E = F
        
```
We compile this as:
```
	LOAD R1,B
	LOAD R2,D
	LOAD R3,F
	STORE R1,A
	STORE R2,C
	STORE R3,F
        
```
The same delay considerations must be accounted for when one instruction modifies memory and the next accesses that modified memory location. Consider:
```
	STORE R1,A
	NOP
	NOP
	LOAD R2,A
	
```
Again, we needed to account for two delay slots, so that the operand store stage of the STORE instruction would happen before the operand fetch stage of the LOAD instruction.
Self Modifying Code Delay Slots
If code modifies itself, we must also consider the delay between the store stage of the instruction that modifies another and the instruction fetch stage that fetches the modified instruction. Consider, for example:
```
		    ____ ____ ____ ____ ____
	STORE R1,A |_IF_|_AC_|_OF_|_OP_|*OS*|____
	STORE R2,A      |_IF_|_AC_|_OF_|_OP_|_OS_|____
	STORE R3,A           |_IF_|_AC_|_OF_|_OP_|_OS_|____
	STORE R4,A                |_IF_|_AC_|_OF_|_OP_|_OS_|____
	STORE R5,A                     |_IF_|_AC_|_OF_|_OP_|_OS_|____
	A: NOP                              |*IF*|_AC_|_OF_|_OP_|_OS_|
	
```
Here, if we assume that R1 through R5 contain 5 distinct machine instructions, each of which gets stored in location A, note that the operand store phases of the instructions that store R2 through R5 occur too late to have an effect on the instruction fetched from location A, so the value stored from R1 is the one that is eventually executed as an instruction from location A.
This kind of self-modifying code is rare in all but a limited set of situations. Most compilers never generate self-modifying code! If it is needed at all on a machine, it is needed in I/O code deep in the kernel of the operating system. For example, on the classic Intel microprocessors, there are 256 I/O addresses that can only be referenced by the IN and OUT instructions, where the I/O address that is referenced is hard coded as an 8-bit field of the instruction. If, inside the system, you want to code a function such as in(X) that returns the value read from I/O address X, you would have to code the in() function using self-modifying code to copy the parameter X into the I/O register field of the IN instruction that actually does the work of this function.
Most processors do little or nothing to automatically detect self-modifying code, so system programmers who need to rely on self modification must do so very carefully!
Pipeline Interlock Conditions
Although some DSP systems expose the pipeline to the programmer, most modern pipelined CPU designs make a serious effor to make code execute the way a naive user expects. Thus, if the user writes,
```
        LOAD R1,A
        STORE R1,B
        
```
or
```
        LOAD R1,A
	NOP
        STORE R1,B
        
```
The system should automatically delay itself so that the result is equivalent to:
```
        LOAD R1,A
	NOP
	NOP
        STORE R1,B
        
```
This means, first, that the performance is not great. Naive code will not take full advantage of a pipelined CPU! On the other hand, a good compiler can fill the delay slots with useful operations without the need to be obsessive in the way the comiler would have to be without help from the CPU.
Actually, what the CPU does is not quite the same as injecting NOP instructions into the instruction stream. Rather, we get the following effect, in our pipeline diagram
```
                    ____ ____ ____ ____ ____
        LOAD R1,A  |_IF_|_AC_|_OF_|_OP_|*OS*|____ ____ ____
        STORE R1,B      |_IF_|_AC_|////|////|*OF*|_OP_|_OS_|
        ANYTHING             |_IF_|////|////|_AC_|_OF_| ...
        ANYTHING                            |_IF_|_AC_| ...
        
```
In effect, we ask the CPU to freeze the state of the first few stages of the pipeline until the operand is stored that will be needed by the next instruction. The hardware that does this is called the pipeline interlock hardware, and the fillers injected in the pipeline to occupy the pipeline hardware during the delay slots are sometimes called bubbles.
Pipeline Interlock Conditions
In effect, for these operand interlocks, what we want is:
if the source operand of the instruction currently in the operand fetch stage of the pipe matches the destination operand of any instruction downstream in the pipe, stop the instruction fetch, address computation and operand fetch stages, and inject a bubble into the later pipeline stages, allowing them to continue operating until the interlock condition is resolved.
We need notation so we can formalize this. Recall the instruction format of our example machine:
```
	 _______ ___ ___ _______________  
	|_______|___|___|_______________|
	|   op  | r | x |     disp      |
	
```
We will refer to the contents of, for example, the r field of the interstage register leading from the instruction fetch stage to the address computation stage AC.r; similarly, OF.r refers to the r field presented to the operand fetch stage of the pipe by the address computation stage. In all cases, we will prefix the interstage register or field of a register with the short form of the name of the stage that reads the contents of that register or field.
Similarly, if a value is an output of a pipeline stage to the outside world, we will prefix that signal with the name of the stage that computes it. Thus, AC.need_register is the signal that indicates that the address computation stage needs the contents of an index register.
Now, we can formalize the interlock condition that was presented informally above:
```
	if
		 OF.need_register
	     and OP.might_write_register
	     and OF.r = OP.r
	or
		 OF.need_register
	     and OS.write_register
	     and OF.r = OS.r
	or
		 OF.need_memory
	     and OP.might_write_memory
	     and OF.ea = OP.ea
	or
		 OF.need_memory
	     and OS.write_memory
	     and OF.ea = OS.ea
	then
	     interlock, blocking the IF, AC and OF stages
	
```
Here, we had to introduce new outputs from the operate stage that duplicate the write_register and write_memory outputs of the operand store stage, predicting that, when the instruction reaches that stage, one or the other write signal will be (or at least, might be) asserted.
We need another interlock condition to prevent a register from being used for indexing while a change is pending:
```
        if
                 AC.need_register
             and OF.might_write_register
             and AC.r = OF.r
        or
                 AC.need_register
             and OP.might_write_register
             and AC.r = OP.r
        or
                 AC.need_register
             and OS.write_register
             and AC.r = OS.r
        then
             interlock, blocking the IF and AC stages
        
```
We could introduce similar interlock conditions to prevent an instruction fetch from a location slated to be modified by a previous instruction, but most processors don't do this.