28. Pipeline Bubbles and Branches

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science


In the previous lectures, we have discussed a pipelined machine based on the following model:

				       ______      _____
	 IF Instruction Fetch         |______|----|     |
				  _    |____|     |     |
	 AC Address Computation  | |--|______|    |     |
				 | |   |____|     |     |
	 OF Operand Fetch        | |--|______|----|  M  |
				 |R|   |____|     |     |
	 OP Operate              | |  |______|    |     |
				 | |   |____|     |     |
	 OS Operand Store        |_|--|______|----|_____|

We discussed operand conflict detection logic, or interlock logic, that could detect the need to stall some pipeline stages, injecting a bubble into the pipe in order to allow that stage to wait for earlier instructions in the pipe to produce a needed result. The result we want is illustrated here:

                    ____ ____ ____ ____ ____
        LOAD R1,A  |_IF_|_AC_|_OF_|_OP_|*OS*|____ ____ ____
        STORE R1,B      |_IF_|_AC_|////|////|*OF*|_OP_|_OS_|
        ANYTHING             |_IF_|////|////|_AC_|_OF_| ...
        ANYTHING                            |_IF_|_AC_| ...

At the end of the notes for the previous lecture, we covered the conditions we need to detect in order to determine when to delay an operation. In short, if a pipeline stage needs, as inputs, data from a register or other storage location that will be changed by any instruction that is farther along in the pipe, that stage will stall.

Our goal, now, is to develop a mechanism to stall a pipeline stage and introduce a bubble once we detect the need to do this:

Blocking the Pipe

Recall that all of the interstage registers of a pipelined processor are clocked in parallel:

                             |   |______| Instruction Fetch
                             |   |______| Address Computation
                             |   |______| Operand Fetch
        clock                |   |______| Operate
                                 |______| Operand Store

To stall a pipeline stages, all we need to do is to stop the clock pulses from reaching that stage and all of the earlier stages. If our only concern is delaying the operand fetch stage, for example, we might do the following:

                             |   |______| Instruction Fetch
        OF stall        ___  |   |_____ | Address Computation
                     --|___|     |______| Operand Fetch
                    |         ----|>___|
        clock       |        |   |______| Operate
                                 |______| Operand Store

In the above, if the operand-fetch-stall signal is false, both halves of the pipeline are clocked together. If, on the other hand, the need for a stall is detected, the clock to the upper half of the pipeline will be stopped until the condition is no longer true.

Our machine actually requires a second interlock for address computation stage, so we must use the following logic: to do is to stop the clock to those stages, as:

        AC stall          ___      ____ 
           -------------O|   |  --|>pc_|
                     ---O|AND| | |______| Instruction Fetch
                    |  --|___|-o--|>___|
        OF stall    | |   ___    |_____ | Address Computation
                      o--|___|   |______| Operand Fetch
                      |         --|>___|
        clock         |        | |______| Operate
                                 |______| Operand Store

Here, if OF-stall is true, the operand fetch, address computation and instruction fetch stages are all stopped, while if the AC-stall signal is true, we just stop the address computation and instruction fetch stages, allowing later stages to continue.

Why didn't we do this:

        AC stall                    ___   --|>pc_|
           -----------------------O|AND| | |______| Instruction Fetch
        OF stall          ___   |          |_____ | Address Computation
                       --|___|             |______| Operand Fetch
                      |                   --|>___|
        clock         |                  | |______| Operate
                                           |______| Operand Store

The answer is, this introduces successively more delay in clock cycles to upstream stages of the pipe, and we really want all pipeline stages to be clocked at the same time. In fact, we are likely to build the real system like this so that all pipeline stages are clocked at as close to identically the same moment as we can arrange:

        AC stall          ___      ____ 
           -------------O|   |  --|>pc_|
                     ---O|AND| | |______| Instruction Fetch
                    |  --|___|-o--|>___|
        OF stall    | |   ___    |_____ | Address Computation
                      o--|___|   |______| Operand Fetch
                      |         --|>___|
        clock         |   ___  | |______| Operate
                    1 ---|___|

Injecting a Bubble

Stalling upstream pipeline stages using the logic outlined above is not sufficient! This is because the pipeline stages below the stopped stages are still running, and they will process whatever data is handed to them, taking incorrect results from the blocked pipeline stage when those results should be ignored.

To solve this, we must inject the equivalent of NOP operations into the pipe. We could do this by changing the op-code field of the instructions being passed downstream into NOP opcodes, or we can do it by adding one bit to each downstream interstage register to indicate whether the contents of that stage are valid or invalid.

The complete logic for injecting a bubble into one pipeline stage is thus:

	lock requests
           to earlier
               stages  clock to other stages
                  | |  |   ___
                  o-|--|-O|   |  |  previous stage  |
                  | o--|-O|AND|  |__________________|
                  | |  o--|___|---|>__OP___|__etc__|
        need lock | |  |               |     __|__
           -------o-|--|------     NOP o----|
                    |  |      |     _|_|_   | Logic for this
        clock       |  |       ----\ 1 0 /  | pipeline stage
           ---------|--o            \___/   |
                    |  |   ___        |     |_____
                    o--|-O|AND|    ____|___ ___|___
                    |  o--|___|---|>__OP___|__etc__|
                    |  |         |                  |
                    |  |         |    next stage    |
                    |  |
        lock requests  clock to other stages
           from later

The multiplexor shown for injecting a no-operation opcode may be simplified if the NOP opcode is all zeros, as it is with our proposed instruction encoding, where setting the least significant 4 bits of the opcode to zero makes that opcode seek no register or memory values nor store anything; in this case, the multiplexor becomes just 4 and gates on these lines of the opcode field.

Notice that the logic for this pipeline stage may continue to perform all normal combinatorial operations, producing results for the next stage at every clock cycle. These results should have no effect if the valid bit is reset.

What matters is that, when an invalidated or no-op instruction arrives at a pipeline stage, that stage must not perform store operations to any register or memory location, and it should not assert any requests for data from any registers or memory unless these can be met without causing conflicts with other resource users. Thus, outputs such as "needs memory" must be anded with the valid signal.

Finally, we need to pay attention to opcode validity in checking for the need for an interlock. Thus, for example, the final interlock condition give in the previous lecture must be rewritten as:

             and OF.might_write_register
             and AC.r = OF.r
	     and AC.op != NOP  -- may need to be added
	     and OF.op != NOP  -- may need to be added
             and OP.might_write_register
             and AC.r = OP.r
	     and AC.op != NOP  -- may need to be added
	     and OP.op != NOP  -- may need to be added
             and OS.write_register
             and AC.r = OS.r
	     and AC.op != NOP  -- may need to be added
	     and OP.op != NOP  -- may need to be added
             interlock, blocking the IF and AC stages

Note that for our proposed instruction encoding, the encoding of a NOP instruction is 4 bits of zero in the least-significant bits of the opcode field, and the need_register and write_register or might_write_register conditions are indicated by single bits in this 4-bit field, so we need not include tests for NOP explicitly.


We now have the mechanisms in place that we need to implement branches! Consider what the pipeline must look like as a branch flows through:

                    ____ ____ ____ ____ ____
        BRANCH X   |_IF_|_AC_|_OF_|_OP_|*OS*|____
        SOMETHING       |_IF_|_AC_|_OF_|_OP_|////|____
        SOMETHING            |_IF_|_AC_|_OF_|////|////|____
        SOMETHING                 |_IF_|_AC_|////|////|////|____
        SOMETHING                      |_IF_|////|////|////|////|____
        X: SOMETHING                        |_IF_|_AC_|_OF_|_OP_|_OS_|____

Here, we assume that the branch instruction takes effect when the operand store stage of the pipeline is reached; this is the natural implementation if we assume that the program counter is R15, as suggested in the notes for the previous lecture.

By the time the operand store stage of the branch instruction is executed, the 4 instructions following it will have been fetched and fed into the pipe, even if the data in these memory locations are not intended to be instructions! Therefore, if we don't want to force the programmer to face the whole idea of branch delay slots, we should cancel the partially executed results of these instructions when the branch instruction finally changes the PC register!

The tools we introduced to solve the pipeline interlock problem do exactly what we need for this! When the operand store stage of the pipeline detects that it is executing a branch, it must invalidate the contents of all of the upstream pipeline stages. It can do this by forcing the opcode interstage registers to NOP or by clearing the valid bit (two alternative and equivalent ways of doing the same thing).

                       |         |  previous stage  |
                       |         |__________________|
                       |               |     __|__
                       |           NOP o----|
                       |   ___      _|_|_   | Logic for this
        need lock  ----|--| OR|----\ 1 0 /  | pipeline stage
                       o--|___|     \___/   |
                       |              |     |_____
                       |           ____|___ ___|___
                       |         |                  |
                       |         |    next stage    |

The above diagram assumes that some later pipeline stage asserts the branch signal as it assigns to the PC. The consequence of this is to invalidate every interstage register in the entire pipeline.

The branch logic itself involves simple changes to the instruction fetch stage of the pipeline:

          Part of                   |  ------     |
          Instruction              _|_|_    _|_   |
          Fetch Stage         -----\1_0/   |+1 |  |
                             |       |     |___|  |
                             |       |       |    |
                             |       |       o--- | ----> memory address
        clock           ___  |     __|__     |    |       for instruction
	   ------------|AND|------|>_PC_|    |    |       fetch
                     -O|___| |       |       |    |
                   _|_       |       o-------     |
                  |AND|      |       |            |
                  |___|      |       |            |
                   | O       |       |            |
                   | |       |       |            |
                   |  -------o       |            |
                   |         |       |            |
        lock requests      branch           branch address
           from later                       from some later
               stages                       stage

Note that if a later stage asserts a branch, we always take it, even if the instruction fetch stage is blocked by a bubble ahead of it in the pipe. This is because the branch will also be deleting from the pipe all of the instructions that raised interlock conditions. In our example architecture, both the branch signal and the branch address come from the operand store stage at the very end of the pipe.

This logic allows the instructions after the branch to begin execution as they flow into the pipe. Only when the branch reaches the end of the pipe does it cancel the instructions behind it. If the branch is conditional, the next post-branch instruction will be finished immediatly after the branch is ignored. Only if the branch is taken will any instructions be cancelled. This is called speculative execution, and it has a significant performance benefit over the alternative of stalling the pipe until the branch can be finished.

Skip Instructions

Notice that, using the logic outlined above, every branch instruction invalidates all of the instructions in the pipe. Thus, even a branch to the next instruction, the one that would be executed anyway had the branch not been there, would invalidate everything.

The desire to allow conditionals to be executed without such an expense has led to the reintroduction of a class of instructions that was common on first and second generation computers but largely absent from classical third generation CPU's, the so-called skip instructions.

On a first generation architecture, as typified by the DEC PDP-8 (a machine that was introduced using second-generation technology in 1965, but remained in production in various forms until arount 1990), when a skip condition was met, the program counter was simply incremented.

For example, the "skip if accumulator negative" instruction would be implemented, inside the CPU's execute phase by sequential logic such as:

	if AC < 0 then PC = PC + 1
This logic clearly isn't applicable to pipelined machines, but with the valid bit in the pipe, we can do something remarkably easy. If no instruction has any side effects before the final pipeline stage, the store stage, then we can skip an instruction in the sequence by simply invalidating it at any time before the store. Thus, all we need to do to implement a skip instruction is to have that instruction invalidate its successor in the pipe! (The successor in the pipe is the instruction in the previous pipeline stage, unless that stage holds a bubble.)

Thus, at first analysis, an instruction stream containing, as its only control structures, skip instructions, will execute at exactly the same speed on a pipelined processor when the skips are taken as it does when the skips are not taken.

That was a first analysis! Invalidating an instruction may release an interlock condition that depended on that instruction, and this may accelerate some later instruction that was tentatively blocked by the skipped instruction!