22C:122/55:132 Notes, Lecture 27, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

Bubbles in the Pipe
We will continue to assume the one-and-a-half address pipelined processor from the two previous lectures, with the following stages:
- IF - instruction fetch
- AC - address computation
- OF - operand fetch
- OP - operate
- OS - operand store
Our goal is to build interlock logic so that, when code attempts to use an operand before it is available, the instruction that required the operand and all instructions upstream in the pipe are stopped and a bubble is allowed into the pipe until the result may be obtained. For example:
```
                    ____ ____ ____ ____ ____
        LOAD R1,A  |_IF_|_AC_|_OF_|_OP_|*OS*|____ ____ ____
        STORE R1,B      |_IF_|_AC_|////|////|*OF*|_OP_|_OS_|
        ANYTHING             |_IF_|////|////|_AC_|_OF_| ...
        ANYTHING                            |_IF_|_AC_| ...
        
```
At the end of the notes for the previous lecture, we covered the conditions we need to detect in order to determine when to delay an operation. In short, if a pipeline stage needs, as inputs, data from a register or other storage location that will be changed by any instruction that is farther along in the pipe, that stage will stall.
For example, if the operand fetch stage needs the contents of a register that will be changed by the instruction currently in the operate or operand store stage, the operand fetch stage will stall, blocking all computation in the address computation and instruction fetch stages until the register in question has been updated. We say this introduces a bubble in the pipe because, if it was water flowing through the pipe and we blocked the pipe at some point, we might have to inject air to allow the flow to continue out of the pipe.
In order to solve this problem, we need a mechanism for blocking a stage of the pipeline!

Blocking the Pipe

Recall that all of the interstage registers of a pipelined processor are clocked in parallel:

                                   ____ 
                              ----|>pc_|
                             |   |______| Instruction Fetch
                             o----|>___|
                             |   |______| Address Computation
                             o----|>___|
                             |   |______| Operand Fetch
                             o----|>___|
        clock                |   |______| Operate
           ------------------o----|>___|
                                 |______| Operand Store

To freeze the contents of some pipeline stages, all we need to do is to stop the clock pulses from reaching those stages. If our only concern is delaying the operand fetch stage, for example, we might do the following:

                                   ____ 
                              ----|>pc_|
                             |   |______| Instruction Fetch
                             o----|>___|
        operand lock    ___  |   |_____ | Address Computation
           -----------o|AND|-o----|>___|
                     --|___|     |______| Operand Fetch
                    |         ----|>___|
        clock       |        |   |______| Operate
           ---------o--------o----|>___|
                                 |______| Operand Store

In the above, if the operand-interlock signal is false, both halves of the pipeline are clocked together. If, on the other hand, an interlock condition is detected, the clock to the upper half of the pipeline will be stopped until the condition is no longer true.

Our machine actually requires a second interlock for address computation stage, so we must use the following logic: to do is to stop the clock to those stages, as:

        ad lock           ___      ____ 
           -------------o|   |  --|>pc_|
                     ---o|AND| | |______| Instruction Fetch
                    |  --|___|-o--|>___|
        op lock     | |   ___    |_____ | Address Computation
           ---------o-|-o|AND|----|>___|
                      o--|___|   |______| Operand Fetch
                      |         --|>___|
        clock         |        | |______| Operate
           -----------o--------o--|>___|
                                 |______| Operand Store

Here, if op-lock is true, the operand fetch, address computation and instruction fetch stages are all stopped, while if the ad-lock signal is true, we just stop the address computation and instruction fetch stages, allowing later stages to continue.

Blocking the Pipe

Stopping the pipe using the logic outlined above is not sufficient! The reason is, the pipeline stages below the stopped stages are still running, and they will process whatever data is handed to them, taking incorrect results from the blocked pipeline stage when those results should be ignored.

To solve this, we must inject the equivalent of NOP operations into the pipe. We could do this by changing the op-code field of the instructions being passed downstream into NOP opcodes, or we can do it by adding one bit to each downstream interstage register to indicate whether the contents of that stage are valid.

The complete logic for injecting a bubble into one pipeline stage is thus:

        lock requests
           to earlier
               stages  clock to other stages
                  | |  |   ___
                  o-|--|-o|   |  |  previous stage  |
                  | o--|-o|AND|  |__________________|
                  | |  o--|___|---|>_valid_|__etc__|
        need lock | |  |               |     __|__
           -------o-|--|-------------  o----|
                    |  |             | |    | Logic for this
        clock       |  |             o_|    | pipeline stage
           ---------|--o            |AND|   |
                    |  |   ___      |___|   |_____
                    o--|-o|AND|    ____|___ ___|___
                    |  o--|___|---|>_valid_|__etc__|
                    |  |         |                  |
                    |  |         |    next stage    |
                    |  |
        lock requests  clock to other stages
           from later
               stages

Notice that the logic for this pipeline stage may continue to perform all normal combinatorial operations, producing results for the next stage at every clock cycle. These results should have no effect if the valid bit is reset.

Also, and this is extremely important, if the valid bit on the input to a pipeline stage is false, that stage must not perform any store operations to any register or memory location, and it should not assert any requests for data from any registers or memory unless these can be met without causing conflicts with other resource users. Thus, outputs such as "needs memory" must be anded with the valid signal.

Finally, the valid signal must itself be accounted for in determining the interlock conditions. Thus, for example, the final interlock condition give in the previous lecture must be rewritten as:

        if
                 AC.need_register
             and OF.might_write_register
             and AC.r = OF.r
	     and AC.valid       -- added
	     and OF.valid       -- added
        or
                 AC.need_register
             and OP.might_write_register
             and AC.r = OP.r
	     and AC.valid       -- added
	     and OP.valid       -- added
        or
                 AC.need_register
             and OS.write_register
             and AC.r = OS.r
	     and AC.valid       -- added
	     and OS.valid       -- added
        then
             interlock, blocking the IF and AC stages
        
 Branches

	We now have the mechanisms in place that we need to implement
	branches!  Consider what the pipeline must look like as a branch
	flows through:
        
                    ____ ____ ____ ____ ____
        BRANCH X   |_IF_|_AC_|_OF_|_OP_|*OS*|____
        SOMETHING       |_IF_|_AC_|_OF_|_OP_|////|____
        SOMETHING            |_IF_|_AC_|_OF_|////|////|____
        SOMETHING                 |_IF_|_AC_|////|////|////|____
        SOMETHING                      |_IF_|////|////|////|////|____
        X: SOMETHING                        |_IF_|_AC_|_OF_|_OP_|_OS_|____
        
	By the time the operand store stage of the branch instruction
	is executed, the 4 instructions following it will have been
	fetched and fed into the pipe, even if the data in these memory
	locations are not intended to be instructions!  Therefore,
	the partially executed results of these instructions must be
	cancelled when the branch instruction finally changes the PC
	register!
	
	The valid bit we just added to each interstage register offers us
	the tool we need to accomplish this!  Thus, our interstage register
	logic becomes:
	
                             |               
                             |   |  previous stage  |
                             |   |__________________|
                        -----|----|>_valid_|__etc__|
        need lock            |         |     __|__
          -------------------|-------  o----|
                             o-----  | |    |
                             |     | | |    | Logic for this
                             |     o_o_|    | pipeline stage
                             |    |  AND|   |
                             |    |_____|   |_____
                             |     ____|___ ___|___
                        -----|----|>_valid_|__etc__|
                             |   |                  |
                             |   |    next stage    |
                             |
                           branch
	
	The above diagram assumes that some later pipeline stage asserts
	the branch signal as it assigns to the PC.  The consequence of this
	is to invalidate every interstage register in the entire pipeline.
	
	The branch logic itself is simple:
	
                                     -------------
          Part of                   |  ------     |
          Instruction              _|_|_    _|_   |
          Fetch Stage         -----\1_0/   |+1 |  |
                             |       |     |___|  |
                             |       |       |    |
                             |       |       o--- | ----> memory address
        clock           ___  |     __|__     |    |       for instruction
	   ------------|AND|------|>_PC_|    |    |       fetch
                     -o|___| |       |       |    |
                   _|_       |       o-------     |
                  |AND|      |       |            |
                  |___|      |       |            |
                   | o       |       |            |
                   | |       |       |            |
                   |  -------o       |            |
                   |         |       |            |
        lock requests      branch           branch address
           from later                       from some later
               stages                       stage
	
	Note that if a later stage asserts a branch, we always take it,
	even if the instruction fetch stage is blocked by a bubble ahead
	of it in the pipe.  In our example architecture, both the branch
	signal and the branch address come from the operand store stage
	of the pipe.
	
	This logic allows the instructions after the branch to begin
	execution as they flow into the pipe.  Only when the branch reaches
	the end of the pipe does it cancel the instructions behind it.
	If the branch is conditional, the next post-branch instruction will
	be finished immediatly after the branch is ignored.  Only if the
	branch is taken will any instructions be cancelled.  This is called
	speculative execution, and it has a significant performance benefit
	over the alternative of stalling the pipe until the branch can be
	finished.

 Skip Instructions

	Notice that, using the logic outlined above, every branch
	instruction invalidates all of the instructions in the pipe.
	Thus, even a branch to the next instruction, the one that would
	be executed anyway had the branch not been there, would invalidate
	everything.
	

	The desire to allow conditionals to be executed without such an
	expense has led to the reintroduction of a class of instructions
	that was common on first and second generation computers but
	largely absent from classical third generation CPU's, skip
	instructions.
	

	On a first generation architecture, as typified by the DEC PDP-8
	(a machine that was introduced using second-generation technology
	in 1965, but remained in production in various forms until arount
	1990), when a skip condition was met, the program counter was
	simply incremented.
	

	For example, the "skip if accumulator negative" instruction
	would be implemented, inside the CPU's execute phase by sequential
	logic such as:
	
	if AC < 0 then PC = PC + 1
	
	This logic clearly isn't applicable to pipelined machines, but with
	the valid bit in the pipe, we can do something remarkably easy.
	If no instruction has any side effects before the final pipeline
	stage, the store stage, then we can skip an instruction in the
	sequence by simply invalidating it at any time before the store.
	Thus, all we need to do to implement a skip instruction is to have
	that instruction invalidate its successor in the pipe!
	
	Thus, at first analysis, an instruction stream containing, as
	its only control structures, skip instructions, will execute at
	exactly the same speed on a pipelined processor when the skips
	are taken as it does when the skips are not taken.
	

	That was a first analysis!  Invalidating an instruction may release
	an interlock condition that depended on that instruction, and this
	may accelerate some later instruction that was tentatively blocked
	by the skipped instruction.