22C:122/55:132, Notes, Lecture 23, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

Thinking about Pipelines
Consider a pipelined VLIW machine designed along the lines discussed in the previous lectures:
```
         _______________________________
        |______|______|________|________|
	| ALU1 | ALU2 | Memory | Branch |
	
```
We implement this on a Harvard architecture, with instructions fetched during the cycle prior to their execution. Furthermore, ALU1 uses a floating point pipeline with alignment, computation and normalization stages, while ALU2 executes only fast integer operations.
Tracing the flow of one instruction through this machine clock cycle by clock cycle, we see that each instruction takes
1. Instruction fetch
2. Execute ALU2, Memory and Branch sub-instructions
  Align the points for floating point add or subtract in ALU1.
3. Do ALU1 core operation
4. Normalize the ALU1 result
Conventional programming relies on the assumption that the results of each instruction are immediately available to the next. With a pipelined machine, on the other hand, the results of one instruction may not be available until several clock cycles after that instruction has been fetched.

Pipeline Diagrams

Pipeline diagrams shows, on one axis, the pipeline stage and on another axis, the time. For each clock cycle, the operands currently being processed by that pipeline stage are documented. Such diagrams are extremely useful tools for keeping track of what is going on where in a pipelined computation.

Pipeline diagrams are usually presented with time as the horizontal axis, although there are contexts where it is convenient to present time as the vertical axis. For our example machine, the pipeline diagram for the execution of one instruction would usually be presented as follows:

                               Time ------>
           _ _ _ _ _ _ _ _ _ _____ _ _ _ _ _ _ _ _ _ _ _ _
	fetch               |     |
           _ _ _ _ _ _ _ _ _|_____|_____ _ _ _ _ _ _ _ _ _
	Execute (except ALU1)     |     |
           _ _ _ _ _ _ _ _ _|_ _ _|_____|_____ _ _ _ _ _ _
	Do ALU1 operation               |     |
           _ _ _ _ _ _ _ _ _|_ _ _|_ _ _|_____|_____ _ _ _
	Normalize ALU1 result                 |     |
           _ _ _ _ _ _ _ _ _|_ _ _|_ _ _|_ _ _|_____|_ _ _
                            |     |     |           |
           PC must be set ->|     |     |           |<- ALU1 result
           prior to this.   |     |     |               available for
                                  |     |               next operation.
               Operands for all ->|     |<- ALU2 result available
              instructions must             memory reference completed
                be available by             branch takes effect.
                     this time.

Conventional programming relies on the assumption that the results of each instruction are immediately available to the next. With a pipelined machine, on the other hand, the results of one instruction may not be available until several clock cycles after that instruction!

Consider the following sequence of instructions that were written to illustrate the example VLIW architecture prior to introducing the possibility of pipelined execution; this is the loop used in computing the vector dot product:

        -- ALUa          ALUb          memory      control
     L1:Rc = Ra * Rb; Rj = Rj - R1; Ra = a[Ri];     -
     L2:Rt = Rt + Rc; Ri = Ri + R1; Rb = b[Ri]; if Rj > 0, goto L1;
     L3: ??

If we use a pipeline diagram to examine the interaction between these, we see the following:

                               Time ------>
           _ _ _ _ _ _ _ _____ _____ _____ _ _ _ _ _ _ _ _
	fetch           |L1   |L2   |L3   |      
	                |     |     |     |      
           _ _ _ _ _ _ _|_____|_____|_____|___ _ _ _ _ _ _
	Execute               |Rj=  |Ri=  |??
           (except ALU1)      |Ra=  |Rb=  |
                              |     |PC=L1|
           _ _ _ _ _ _ _|_ _ _|_____|_____|_____ _ _ _ _ _
	Do ALU1 operation           |Ra*Rb|Rt+Rc|      
	                            |     |     |
           _ _ _ _ _ _ _|_ _ _|_ _ _|_____|_____|_____ _ _
	Normalize ALU1 result             |Rc=  |Rt=  |
	                                  |     |     |
           _ _ _ _ _ _ _|_ _ _|_ _ _|_ _ _|_____|_____|_ _

This diagram exposes a multitude of problems! First, the control structure does not work as expected, because the branch that was part of instruction L2 does not take effect until after instruction L3 has been fetched.

The second problem is that the value in Rc used by instruction L2 is fetched (during the Execute phase) one clock cycle before the instruction L1 will update it (at the end of the Normalize phase).

The most trivial solution to this problem is to rewrite the code with no-ops added between instructions as padding:

        -- ALUa          ALUb          memory      control
     L1:Rc = Ra * Rb; Rj = Rj - R1; Ra = a[Ri];     -
     L1a:   -             -             -           -
     L1b:   -             -             -           -
     L2:Rt = Rt + Rc; Ri = Ri + R1; Rb = b[Ri]; if Rj > 0, goto L1;
     L2a:   -             -             -           -

Using a pipeline diagram makes the need for these clear:

                               Time ------>
           _ _ _ _ _____ _____ _____ _____ _____ _____ _ _ _ _ _
	fetch     |L1   |L1a//|L1b//|L2   |L2a//|L1   |     |
	          |     |/////|/////|     |////||     |
           _ _ _ _|_____|_____|_____|_____|____||_____|_____ _ _
	Execute         |Rj=  |/////|/////|Ri= ||/////|Rj=  |
           (except ALU1)|Ra=  |/////|////||Rb= ||/////|Ra=  |
                        |     |/////|////||PC=L1|/////|     |
           _ _ _ _|_ _ _|_____|_____|____||_____|_____|_____|___
	Do ALU1 operation     |Ra*Rb|////||/////|Rt+Rc|/////|
	                      |     |////||/////|     |/////|
           _ _ _ _|_ _ _|_ _ _|_____|____||_____|_____|_____|___
	Normalize ALU1 result       |Rc= ||Rt=  |/////|Rt=  |///
	                            |     |     |/////|     |///
           _ _ _ _|_ _ _|_ _ _|_ _ _|_____|_____|_____|_____|___

Here, the diagonal bands of shading are the no-ops. Where one pipeline stage produces a result that is needed in the immediately following clock cycle by a later instruction, a double vertical line links the source of the data with the destination.

So, the final pipeline stage in processing instruction L1 saves a result in Rc that will be used, in the immediately following clock cycle, by the Execute stage of the processing of instruction L2, and the Execute stage of the processing of instruction L2 updates the PC (if the branch takes place), allowing L1 to be fetched in the next clock cycle for the next iteration of the loop.

Delay Slots
We used instructions made entirely of no-ops above for one purpose, to mark time between instructions so that a result would be available when it was needed. We refer to these intervals as delay slots, a term that derives from thinking of the time spent by each pipeline unit as being quantized into a series of slots, where some operation could be performed in each time slot. No-ops are inserted into the instruction stream to fill the time slots during which computation should be delayed in order to wait for some result.
A branch delay slot is a delay slot that is introduced because the branch does not take place immediately after the instruction containing that branch is fetched.
An operand delay slot is a delay slot that is introduced because an operand will not be available until some previous instruction reaches some later point in the pipeline.
Our example program required that we fill two operand delay slots and one branch delay slot.

Learning to Program All Over Again

Writing an an efficient pipelined program requires thinking in terms of the pipeline from the start. We begin by tracing the data flow of one set of operands:

        -- ALUa          ALUb          memory      control
     L1:Rc = Ra * Rb; Rj = Rj - R1; Ra = a[Ri];     -
     L1a:   -             -             -           -
     L1b:   -             -             -           -
     L2:Rt = Rt + Rc; Ri = Ri + R1; Rb = b[Ri]; if Rj > 0, goto L1;
     L2a:   -             -             -           -

Using a pipeline diagram makes the need for these clear:

                               Time ------>
           _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
	fetch     |     |     |     |     |     |     |
	          
           _ _ _ _|_____|_____|_____|_____|_ _ _|_ _ _|_____|_
	Execute   |Ri=  |     |     |     |     |     |     |
                  | Ri+1|Ra=  |Rb=  | Ra  |           | Rc  |
                  |     | a[Ri| b[Ri| Rb  |           | Rt  |
           _ _ _ _|_____|_____|_____|_____|_____|_ _ _|_____|_____|_
	ALU1 operation  |     |     |     |Ra*Rb|     |     |Rt+Rc|
	                                  |     |           |     |
           _ _ _ _|_ _ _|_ _ _|_ _ _|_ _ _|_____|_____|_ _ _|_____|_____|_
	Normalize |     |     |     |     |     |Rc=  |     |     |Rt=  |
	                                        |     |           |     |
           _ _ _ _|_ _ _|_ _ _|_ _ _|_ _ _|_ _ _|_____|_ _ _|_ _ _|_____|_

This pipeline diagram only includes the computations required during one cycle through the logical loop in our computation. The challenge now is to "roll the diagram up" so that we can keep more functional units busy. We need a bit more resolution in our diagram to see what rollings and runrollings will work:

                               Time ------>
           _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
	fetch     |     |     |     |     |     |     |
	          
           _ _ _ _|_____|_____|_____|_____|_ _ _|_ _ _|_____|_
	Execute   |     |     |     |     |     |     |     |
	  M       |     |Ra=a[|Rb=b[|     |           |     |
	  ALU2    |Ri+= |     |     |     |           |     |
          ALU1    |     |     |     | RaRb|           | RcRt|
           _ _ _ _|_____|_____|_____|_____|_____|_ _ _|_____|_____|_
	ALU1 operation  |     |     |     |Ra*Rb|     |     |Rt+Rc|
	                                  |     |           |     |
           _ _ _ _|_ _ _|_ _ _|_ _ _|_ _ _|_____|_____|_ _ _|_____|_____|_
	Normalize |     |     |     |     |     |Rc=  |     |     |Rt=  |
	                                        |     |           |     |
           _ _ _ _|_ _ _|_ _ _|_ _ _|_ _ _|_ _ _|_____|_ _ _|_ _ _|_____|_
               1     2     1     2     1     2     1     2     1     2     1

Here, it is clear that we never need any particular pipeline stage in any particular functional unit for more than two operations, so the minimum length of the loop body is 2 instructions.

Miniaturizing this diagram, we can try to schedule this 2-instruction loop:

                               Time ------>
           _ _ _ _ __ __ __ __ __ __ __ __ __ __ __ __ __
	fetch     |  |  |  |  |  |  |  |  |  |  |  |  |  |
           _ _ _ _|__|__|__|__|__|__|__|__|__|__|__|__|__|__
	Execute   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
	  M       |  |  |X |X |+ |+ |* |* |. |. |: |: |  |  |
	  ALU2    |  |X |  |+ |  |* |  |. |  |: |  |  |  |  |  |
          ALU1    |  |  |  |  |X |  |+ |X |* |+ |. |* |: |. |  |:
           _ _ _ _|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|_
	ALU1 operation  |  |  |  |X |  |+ |X |* |+ |. |* |: |. |  |: |
           _ _ _ _|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|_
	Normalize |  |  |  |  |  |  |X |  |+ |X |* |+ |. |* |: |. |  |: |
           _ _ _ _|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|_
                      \                  /|     |\                      /
                           prologue       \     /        epilogue
                                     repeating pattern

Here, we have used X to indicate the flow of computation for the first logical iteration of the loop. All later iterations are indicated by other marks. The iterations are clearly overlapped! "Filling the pipeline" takes a loop prologue 7 instructions long, and "emptying the pipe" takes an epilogue 5 instructions long.

Unfortunately, the suggested loop body actually accumulates two independent sums! On even iterations, it sums one set of values, while on odd iterations, it sums another. Both of these are nominally in Rt, but during any iteration, one of these sums will be in the process of being computed, hidden away in the pipelined ALU, while the other sum lands in a register ready to be added to in the next iteration. This makes it extremely difficult to construct appropriate prologues and epilogues for this loop!

Crafting of assembly language code for this type of machines is beyond the ability of most programmers, so extremely complex, and interesting compilers are essential. What we want is a design for a pipelined machine that allows programmers to largely ignore the the pipeline while profiting, as much as possible, from its presence.