25. Pipeline Diagrams

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Thinking about Pipelines

Consider a machine where the execution of one instruction takes place in parallel with the fetch of the next instruction, and where the execution itself is broken down into several stages, for example, operand fetch, ALU/memory, and operand store. The operand fetch stage gets data from registers, the operand store returns results to registers, and the ALU/memory stage either does a memory operand (for load and store instructions) or does an ALU operation (for register-to-register operate instructions).

For the moment, assume that we implement this on a Harvard architecture, that is, a computer where program memory and data memory are entirely separate. This makes it trivial to overlap the execution of one instruction with the fetch of another, since the memory references used for instruction fetch are to a different physical memory than those used for data transfers.

Tracing the flow of one instruction through this machine clock cycle by clock cycle, we see that each instruction takes

Instruction fetch
-- IR = M[PC]; PC = PC + 1
Operand fetch from registers
-- ea = R[x] + disp; op1 = R[r]; op2 = R[x]
Do ALU or memory operation
-- either M[ea] = op1, result = M[ea] or result = ALU( op1, op2, f )
Operand store to registers
-- either R[r] = result, PC = result or nothing

In the above description of a simple pipelined processor, with no discussion of the instruction format, we see hints of a number of interstage registers: ea, the effective addres, op1 and op2, the operands being delivered to the ALU, and result, the result out of the ALU. The instruction register itself is also an interstage register. In later lectures, we will discuss the details of this inside the pipelined CPU. Here, we focus on the consequences visible to the programmer.

Conventional programming relies on the assumption that the results of each instruction are immediately available to the next. With a pipelined machine, on the other hand, the results of one instruction may not be available until several clock cycles after that instruction has been fetched.

Pipeline Diagrams

Pipeline diagrams shows, on one axis, the pipeline stage and on another axis, the time. For each clock cycle, the operands currently being processed by that pipeline stage are documented. Such diagrams are extremely useful tools for keeping track of what is going on where in a pipelined computation.

Pipeline diagrams are usually presented with time as the horizontal axis, although there are contexts where it is convenient to present time as the vertical axis. For our example machine, the pipeline diagram for the execution of one instruction would usually be presented as follows:

                               Time ------>
           _ _ _ _ _ _ _ _ _ _____ _ _ _ _ _ _ _ _ _ _ _ _
	fetch               |     |
           _ _ _ _ _ _ _ _ _|_____|_____ _ _ _ _ _ _ _ _ _
	operand fetch             |     |
           _ _ _ _ _ _ _ _ _|_ _ _|_____|_____ _ _ _ _ _ _
	ALU/memory operation            |     |
           _ _ _ _ _ _ _ _ _|_ _ _|_ _ _|_____|_____ _ _ _
	result store                          |     |
           _ _ _ _ _ _ _ _ _|_ _ _|_ _ _|_ _ _|_____|_ _ _
                          __    __    __    __    __    __
        clock ___________|  |__|  |__|  |__|  |__|  |__|
                               1     2     3     4     5

Consider the following sequence of instructions for this machine, documented using an algebraic notation that allows us to ignore the actual machine code format:

     L1: R1 = VA  -- VA is a variable in memory
     L2: R1 = VB  -- VB is a variable in memory
     L3: PC = R1
     L4: R3 = R1

If we use a pipeline diagram to examine the interaction between these, we see the following:

                               Time ------>
           _ _ _ _ _____ _____ _____ _____ _ _ _ _ _ _ _ _
	fetch     |L1   |L2   |L3   |L4   |      
	          |     |     |     |     |      
           _ _ _ _|_____|_____|_____|_____|_____ _ _ _ _ _
	operand fetch   |L1   |L2   |L3   |L4   |
                        |     |     |=R1  |=R1  |
           _ _ _ _ _ _ _|_____|_____|_____|_____|_____ _ _
	ALU/memory operation  |L1   |L2   |L3   |L4   |
	                      | =VA | =VB |     |     |
           _ _ _ _ _ _ _|_ _ _|_____|_____|_____|_____|_____ _ _
	result store                |L1   |L2   |L3   |L4   |
	                            |  R1=|  R1=|  PC=|  R3=|
           _ _ _ _ _ _ _|_ _ _|_ _ _|_ _ _|_____|_____|_____|_ _
        Clock cycle  1     2     3     4     5     6     7

        Register values  R1   ??========= VA=== VB===============
                         PC   3     4     5     6     ??=========
                         R3                                 VA===

This diagram exposes a serious problem! The first instruction L1, does not store its result until the trailing edge of clock cycle 4. The instruction L3 loads its source operand during this same clock cycle, so it gets the value that was in R1 prior to the store done by L1. Similarly, the second nstruction, L2, does not store its result in R1 until the end of clock cycle 5, so the final instruction, L4, ends up moving the result of L1 to R3 where a naive reader of the program would expect it not to have been executged at all!

Instruction L3 is a branch because it assigns to the program counter, yet instructions L4 (and L5 and L6) are still fetched because L3 does not actually change the program counter until the end of clock cycle 6! The branch only takes effect at the end of the 4th clock cycle after the instruction is fetched! As a result, 3 more instructions will be executed in line after the branch before the branch actually occurs.

The most trivial way to make programs produce the results expected by programmers is to force the programmer to add no-ops between the instructions in order to allow the results to reach their intended destinations before they are used:

     L1: R1 = VA  -- VA is a variable in memory
     L2: R1 = VB  -- VB is a variable in memory
         NOP
         NOP
     L3: PC = R1
         NOP
         NOP
         NOP
     L4: R3 = R1

If we put these in a pipeline diagram, we make the use of these no-ops clear

                               Time ------>
      _____ _____ _____ _____ _____ _____ _____ _____ _____ _ _ _ _ _ _ _ _
     |L1   |L2   |/////|/////|L3   |/////|/////|/////|     |      
     |     |     |/////|/////|     |/////|/////|/////|     |      
     |_____|_____|_____|_____|_____|_____|_____|_____||____|_ _ _ _ _
           |L1   |L2   |/////|/////|L3   |/////|/////||////|     |
           |     |     |/////|/////|=R1  |/////|/////||////|     |
        _ _|_____|_____|_____|_____||____|_____|_____||____|_____|__
                 |L1   |L2   |/////||////|L3   |/////||////|/////|     |
	         | =VA | =VB |/////||////|     |/////||////|/////|     |
      _ _ _ _ _ _|_ _ _|_____|_____||____|_____|_____||____|_____|_____|_
                       |L1   |L2   ||////|/////|L3   ||////|/////|/////|
                       |  R1=|  R1=|/////|/////|  PC=|/////|/////|/////|
          _ _ _ _ _ _ _|_ _ _|_ _ _|_ _ _|_____|_____|_____|_____|_____|_ _
        Clock cycle  1     2     3     4     5     6     7

Here, the diagonal bands of shading are the no-ops. Where one pipeline stage produces a result that is needed in the immediately following clock cycle by a later instruction, a double vertical line links the source of the data with the destination. So, at the end of clock cycle 3, the contents of R1 that was set in the result store stage are available for use during clock cycle 4 in the operand fetch stage, and the value of the program counter set at the end of clock cycle 6 will be used in the instruction fetch stage during clock cycle 7.

Pipeline Delay Slots

We used instructions made entirely of no-ops above for one purpose, to mark time between instructions so that a result would be available when it was needed. We refer to these intervals as delay slots, a term that derives from thinking of the time spent by the pipelined processor as being quantized into a series of slots, where some operation could be performed by each pipeline stage during each time slot. No-ops are inserted into the instruction stream to fill the time slots during which computation should be delayed in order to wait for some result.

A branch delay slot is a delay slot that is introduced because the branch does not take place immediately after the instruction containing that branch is fetched.

An operand delay slot is a delay slot that is introduced because an operand will not be available until some previous instruction reaches some later point in the pipeline.

Our example program required that we fill two operand delay slots and three branch delay slots.

Learning to Program All Over Again

Writing an an efficient pipelined program requires thinking in terms of the pipeline from the start. We begin by tracing the data through the computation, and then we try to write the code to minimize the number of delay slots we must fill with no-ops. To do this, you must literally program with the pipeline diagram in hand.

Good compilers for pipelined processord do this routinely. The part of the compiler that does this is called the instruction scheduler, and its job is to find the order of computations that minimizes the number of delay slots that will be padded out with no-ops.

Very few pipelined processors actually expose the pipeline to the programmer. Instead, additional logic is added to detect when the succeeding instruction depends on an operand that has not yet been stored by the predecessor instruction. Whenever such a situation is detected, we say that there is an operand conflict, and we use interlock logic to inject the appropriate number of no-ops into the pipeline.

As a result, programmers never need to know anything about the structure of the pipelined CPU, but also, as a result, naively written code may perform far worse than carefully crafted code where the instructions have been reordered in order to minimize the number of conflicts. With pipelined processors, it is commonly the case that hand-crafted assembly language code is slower than the output of a modestly good compiler simply because the compiler can reorder instructions without regard to the impact of this on readability, while the assembly language programmer is strongly motivated by the visual logic of the program, attempting to keep related instructions grouped together.

We will discuss the details of pipelined processors in far more detail in upcoming lectures!