27. Pipeline Interlocks

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Background

In the previous lecture, we introduced a pipelined machine based on the following model:

				         ______      _____
	 IF Instruction Fetch           |______|----|     |
				  _      |____|     |     |
	 AC Address Computation  | |----|______|    |     |
				 | |     |____|     |     |
	 OF Operand Fetch        | |----|______|----|  M  |
				 |R|     |____|     |     |
	 OP Operate              | |    |______|    |     |
				 | |     |____|     |     |
	 OS Operand Store        |_|----|______|----|_____|

We assumed that a word is 32 bits, that there are 16 general purpose registers R, word addressable memory, and that all instruction words are identical with the following format typical of 1 and 1/2 address machines:

                8           4       4                  16
         _______________ _______ _______ _______________________________
	|_______________|_______|_______|_______________________________|
        |    opcode     |   r   |   x   |             disp              |

From the natural data paths in the pipeline that resulted from our desire for memory to register and register to memory instructions, the following opcode format emerged fairly naturally, although only a subset of these opcodes seem to have practical uses:

         _______ _ _ _ _
        |_______|_|_|_|_|   (the least significant 4 bits are the mode)
        | aluop | | | | |
                 | | | |-- wm -- operand store stage writes to M[ea]
                 | | |---- wr -- operand store stage writes to R[r]
                 | |-- rm -- operand fetch stage reads M[ea] instead of ea
                 |---- rr -- operand fetch stage reads R[r]
          
              useful modes
                 0 0 1 0  -- immediate to register
                 0 1 1 0  -- memory to register
                 1 0 0 1  -- register to memory
                 1 0 1 0  -- register + immediate to register
                 1 1 0 1  -- register + memory to memory
                 1 1 1 0  -- register + memory to register
                 1 1 1 1  -- register + memory to both register and memory

The problem with this instruction encoding is that it leaves us with a number of useless addressing modes, for example, modes in which results are discarded, modes in which an immediate operand is also used as an address, and so on. In addition, depending on what operations the ALU provides, some of the above modes may be redundant.

Enhancement

Adding Branches to the instruction set may be as straightforward as using the convention that r15 refers to the PC. As a result, load immediate into r15 becomes a branch instruction; we will defer the details of this, but keep this idea in mind.

Finding an opcode for conditionals is also fairly easy. Any mode that neither stores a result in memory or in a registers seems to be an obvious candidate to borrow for a new meaning. Finding an appropriate meaning is harder.

As an example of a small change we could make that would be both useful and simple, consider modifying the operand fetch stage as follows:

	   Operand Fetch Stage
	   _____  _______________
	  | op|r||       ea      |
	  |___|_||_______________|
	    |  |         |
	    |  o---------|---------------o--------> register number       
	    |  |         |   _________   |  ------< data from registers
	    |  |         |  |  read   |  | |                        
	    o--|---------|--|registers|--|-|---o--> need register value
	    |  |         |  |_________| _|_|_  |                   
	    |  |         |              \0_1/--   << ADDITION
	    |  |         |                |                        
	    |  |         o--------------------o---> address to memory
	    |  |         |                |   |  -< data from memory
	    |  |         |                |  _|_|_
	    |  |         |   ______     --|--\0_1/                 
	    |  |         |  | read |   |  |    |                   
	    o--|---------|--|memory|---o--|----|--> needs memory
	    |  |         |  |______|      |    |
	    |  |         |                |     -----------        
	   _|__|  _______|_______  _______|_______  _______|_______
	  | op|r||       ea      ||      opa      ||      opb      |
	  |___|_||_______________||_______________||_______________|

Here, all we have done is add one multiplexor, so that, if a register is not read, the register number field is available as the opa input to the ALU instead of an undefined value from registers. This creates a new class of immediate addressing modes with short 4-bit operands:

	 _______ _ _ _ _
	|_______|_|_|_|_|
	| aluop | mode  |

            useful new modes
                 0 0 0 1  -- NEW short immediate to memory
                 0 1 0 1  -- NEW short immediate + memory to memory
	

Of course, these new modes will only be useful for those ALU operations that use their inputs from operand A; The modes shown with a + sign above will be useful with the full range of two-operand arithmetic and logical operations.

Even so, it is unlikely that all operations will be used with the same frequency, and it is unlikely that all operations will be equally likely with all addressing modes. As a result, we will most likely want to introduce a more complex encoding of the opcodes, eliminating the orthogonal encoding of addressing modes and introducing a ROM in the address computation stage to take a compact opcode encoding and expand it into an orthogonal endocing.

Operand Delay Slots

One crucial problem raised by the pipelined processor given here is that the results of one instruction is not stored to memory or to a register until long after that instruction is fetched, and in fact, until after the next few instructions have had their operands fetched.

We can easily illustrate this with a variant form of pipeline diagram where we list instructions on the vertical axis and time on the horizontal axis, showing the pipeline stge that is processing each instruction at the intersection of row and column.

		    ____ ____ ____ ____ ____
	LOAD R1,A  |_IF_|_AC_|_OF_|_OP_|*OS*|____
	STORE R1,B      |_IF_|_AC_|_OF_|_OP||_OS_|____
	STORE R1,C           |_IF_|_AC_|_OF||_OP_|_OS_|____
	STORE R1,D                |_IF_|_AC||*OF*|_OP_|_OS_|____
	STORE R1,E                     |_IF_|_AC_|_OF_|_OP_|_OS_|
                 time 1    2    3    4    5    6    7    8    9

The instruction sequence shown here is a test of the effect of the pipeline on operand interaction. The question this code asks is, which of the store instructions shown following the load instruction will store the value that was loaded from variable A.

Note that the load instruction fetched fetched at clock cycle 1 does not store its result, the value of A, in R1 until the end of clock cycle 5, and therefore, use of R1 as a source operand in the next two instructions has odd results. The operand fetch stage for the second instruction occurs at time 4, so this obviously gets the previous value of R1, not the new value from A. The operand fetch stage of the third instruction also gets the old value of R1 if we assume edge-triggered registers (we must do this, generally, and if the underlying register file is based on simple latches, we add a slave latch to the output of the register file to give the overall system edge-triggered semantics).

As a result, the programmer will see the variables D and E set, as expected, to the value of A, but the variables B and C will be set oddly -- at least, from the perspective of a naive programmer who doesn't understand pipelined processors.

The simplest code to correctly represent the assignment

	 A = B

is therefore:

	LOAD R1,B
	NOP
	NOP
	STORE R1,A
Here, we have used NOP instructions (no operation) to fill time while waiting for the necessary data to percolate through the pipeline.

We refer to the extra instructions inserted between a load and store as delay slot fillers. A delay slot is an instruction cycle during which the desired result has not yet reached the appropriate point in the pipeline, so we must wait an instruction cycle before we look again.

In this case, we are filling operand delay slots with no-op instructions because we must wait for the operand to be available for the operand fetch stage of the pipeline.

A good compiler will fill delay slots with useful code. For example, with the example pipeline, we can fill the operand delay slots by interleaving the code of successive assignment statements:

	 A = B
	 C = D
	 E = F

We compile this as:

	LOAD R1,B
	LOAD R2,D
	LOAD R3,F
	STORE R1,A
	STORE R2,C
	STORE R3,E

The same delay considerations must be accounted for when one instruction modifies memory and the next accesses that modified memory location. Consider:

	STORE R1,A
	NOP		-- delay slot waiting for store to A
	NOP		-- delay slow waiting for store to A
	LOAD R2,A

Again, we needed to account for two delay slots, so that the operand store stage of the STORE instruction would happen before the operand fetch stage of the LOAD instruction.

Self Modifying Code Delay Slots

While self-modifying code is not generally encouraged, it provides an example of the longest delay that our pipeline imposes. In this case, the modification is done by the operand store pipeline stage, and the modified code is fetched by the operand fetch stage. Consider, for example:

		    ____ ____ ____ ____ ____
	STORE R1,A |_IF_|_AC_|_OF_|_OP_|*OS*|____
	STORE R2,A      |_IF_|_AC_|_OF_|_OP||_OS_|____
	STORE R3,A           |_IF_|_AC_|_OF||_OP_|_OS_|____
	STORE R4,A                |_IF_|_AC||_OF_|_OP_|_OS_|____
	STORE R5,A                     |_IF||_AC_|_OF_|_OP_|_OS_|____
	A: NOP                             ||*IF*|_AC_|_OF_|_OP_|_OS_|
	
Here, if we assume that R1 through R5 contain 5 distinct machine instructions, each of which gets stored in location A. If we assume that the memory port from the instruction fetch stage takes priority over any store operations done at the same time, the value stored from R1 is the one that is eventually executed as an instruction from location A.

This kind of self-modifying code is rare in all but a limited set of situations. Most compilers never generate self-modifying code! If it is needed at all on a machine, it is needed in I/O code deep in the kernel of the operating system. For example, on the classic Intel 8080 microprocessor, there are 256 I/O addresses that can only be referenced by the IN and OUT instructions, where the I/O address that is referenced is hard coded as an 8-bit field of the instruction. If, inside the system, you want to code a function such as in(X) that returns the value read from I/O address X, you would have to code the in() function using self-modifying code to copy the parameter X into the I/O register field of the IN instruction that actually does the work of this function.

Most processors do little or nothing to automatically detect self-modifying code, so system programmers who need to rely on self modification must do so very carefully!

Pipeline Interlock Conditions

Some DSP systems expose the pipeline to the programmer, requiring the programmer or compiler writer to deal with all the different delay slots that must be filled, but from the first pipelied IBM 360 to modern pipelined CPU designs, a serious effort has been made to make code execute the way a naive user expects. Thus, if the user writes,

        LOAD R1,A
        STORE R1,B

or

        LOAD R1,A
	NOP
        STORE R1,B
The system should automatically delay itself so that the result is equivalent to:

        LOAD R1,A
	NOP
	NOP
        STORE R1,B

This means, first, that real programw will not necessarily perform at the peak rate allowed by the pipelined CPU -- only if accident or a good compiler carefully puts the instructions in the right order will you get peak performance.

What the CPU does is not quite the same as injecting NOP instructions into the instruction stream. Rather, we get the following effect, in our pipeline diagram

                    ____ ____ ____ ____ ____
        LOAD R1,A  |_IF_|_AC_|_OF_|_OP_|*OS*|____ ____ ____
        STORE R1,B      |_IF_|_AC_|////|///||*OF*|_OP_|_OS_|____
        ANYTHING             |_IF_|////|////|_AC_|_OF_|_OP_|_OS_|____
        ANYTHING                            |_IF_|_AC_|_OF_|_OP_|_OS_|

In effect, we ask the CPU to freeze the state of the first few stages of the pipeline until the operand is stored that will be needed by the next instruction. The hardware that does this is called the pipeline interlock hardware, and the fillers injected in the pipeline to occupy the pipeline hardware during the delay slots are sometimes called bubbles in the pipeline or pipeline bubbles.

Pipeline Interlock Conditions

Here is the rule we want the hardware to enforce in order to deal with operand dependencies:

if the source operand of the instruction currently in the operand fetch stage of the pipe matches the destination operand of any instruction downstream in the pipe, stop the instruction fetch, address computation and operand fetch stages, and inject a bubble into the later pipeline stages, allowing them to continue operating until the interlock condition is resolved.

We need notation so we can formalize this. Recall the instruction format of our example machine:

	 _______ ___ ___ _______________  
	|_______|___|___|_______________|
	|   op  | r | x |     disp      |

We will refer to the contents of, for example, the r field of the interstage register leading from the instruction fetch stage to the address computation stage AC.r; similarly, OF.r refers to the r field presented to the operand fetch stage of the pipe by the address computation stage. In all cases, we will prefix the interstage register or field of a register with the short form of the name of the stage that reads the contents of that register or field.

Similarly, if a value is an output of a pipeline stage to the outside world, we will prefix that signal with the name of the stage that computes it. Thus, AC.need_register is the signal that indicates that the address computation stage needs the contents of an index register.

Now, we can formalize the interlock condition that was presented informally above:

	if
		 OF.need_register
	     and OP.might_write_register
	     and OF.r = OP.r
	or
		 OF.need_register
	     and OS.write_register
	     and OF.r = OS.r
	or
		 OF.need_memory
	     and OP.might_write_memory
	     and OF.ea = OP.ea
	or
		 OF.need_memory
	     and OS.write_memory
	     and OF.ea = OS.ea
	then
	     interlock, blocking the IF, AC and OF stages
	     during the next clock cycle.

Here, we had to introduce new outputs from the operate stage that duplicate the write_register and write_memory outputs of the operand store stage, predicting that, when the instruction reaches that stage, one or the other write signal will be (or at least, might be) asserted.

Given these, this interlock predicate is simply a collection of comparators comparing fields of different the interstage registers and some or-gates that produces, as output, the signal indicating that it is time to inject a bubble.

We need another interlock condition to prevent a register from being used for indexing while a change is pending:

        if
                 AC.need_register
             and OF.might_write_register
             and AC.r = OF.r
        or
                 AC.need_register
             and OP.might_write_register
             and AC.r = OP.r
        or
                 AC.need_register
             and OS.write_register
             and AC.r = OS.r
        then
             interlock, blocking the IF and AC stages

We could introduce similar interlock conditions to prevent an instruction fetch from a location slated to be modified by a previous instruction, but most processors don't do this.