22C:122, Lecture Notes, Lecture 26, Fall 1999

Douglas W. Jones
University of Iowa Department of Computer Science

  1. Bubbles in the Pipe

    If we continue to assume the pipelined processor from the previous lectures:

    Our goal is to build interlock logic so that, when code attempts to use an operand before it is available, the instruction that required the operand and all instructions upstream in the pipe are stopped and a bubble is allowed into the pipe until the result may be obtained. For example:
                        ____ ____ ____ ____ ____
            LOAD R1,A  |_IF_|_AC_|_OF_|_OP_|*OS*|____ ____ ____
            STORE R1,B      |_IF_|_AC_|////|////|*OF*|_OP_|_OS_|
            ANYTHING             |_IF_|////|////|_AC_|_OF_| ...
            ANYTHING                            |_IF_|_AC_| ...
            
    In the previous lecture, we covered the conditions we need to detect in order to determine when to delay an operation. Now, we are concerned with the problem of doing something when the condition is detected.

  2. Blocking the Pipe

    Recall that all of the interstage registers of a pipelined processor are clocked in strict synchrony:

                                       ____ 
                                  ----|>pc_|
                                 |   |______| Instruction Fetch
                                 o----|>___|
                                 |   |______| Address Computation
                                 o----|>___|
                                 |   |______| Operand Fetch
                                 o----|>___|
            clock                |   |______| Operate
               ------------------o----|>___|
                                     |______| Operand Store
    
    	
    To freeze the contents of some pipeline stages, all we need to do is to stop the clock to those stages. If our only concern is delaying the operand fetch stage, for example, we might do the following:
                                       ____ 
                                  ----|>pc_|
                                 |   |______| Instruction Fetch
                                 o----|>___|
            operand lock    ___  |   |_____ | Address Computation
               -----------o|AND|-o----|>___|
                         --|___|     |______| Operand Fetch
                        |         ----|>___|
            clock       |        |   |______| Operate
               ---------o--------o----|>___|
                                     |______| Operand Store
    
    	
    In the above, if the operand-interlock signal is false, both halves of the pipeline are clocked together. If, on the other hand, an interlock is requested, the clock to the upper half is stopped.

    Our machine actually requires a second interlock for address computation stage, so we must use the following logic: to do is to stop the clock to those stages, as:

            ad lock           ___      ____ 
               -------------o|   |  --|>pc_|
                         ---o|AND| | |______| Instruction Fetch
                        |  --|___|-o--|>___|
            op lock     | |   ___    |_____ | Address Computation
               ---------o-|-o|AND|----|>___|
                          o--|___|   |______| Operand Fetch
                          |         --|>___|
            clock         |        | |______| Operate
               -----------o--------o--|>___|
                                     |______| Operand Store
    
    	
    Here, if op-lock is true, the operand fetch, address computation and instruction fetch stages are all stopped, while if the ad-lock signal is true, we just stop the address computation and instruction fetch stages, allowing later stages to continue.

  3. Blocking the Pipe

    We cannot stop by simply stopping the advance of data through the pipeline! The reason is, the pipeline stages that are still running will process whatever invalid data is handed to them, taking incorrect results from the frozen stage when those results should be ignored.

    To solve this, we must inject the equivalent of NOP operations into the pipe. We could do this by changing the op-code field of the instructions being passed downstream into NOP opcodes, or we can do it by adding one bit to each downstream interstage register to indicate whether the contents of that stage are valid.

    The complete logic for injecting a bubble into one pipeline stage is thus:

            lock requests
               to earlier
                   stages  clock to other stages
                      | |  |   ___
                      o-|--|-o|   |  |  previous stage  |
                      | o--|-o|AND|  |__________________|
                      | |  o--|___|---|>_valid_|__etc__|
            need lock | |  |               |     __|__
               -------o-|--|-------------  o----|
                        |  |             | |    | Logic for this
            clock       |  |             o_|    | pipeline stage
               ---------|--o            |AND|   |
                        |  |   ___      |___|   |_____
                        o--|-o|AND|    ____|___ ___|___
                        |  o--|___|---|>_valid_|__etc__|
                        |  |         |                  |
                        |  |         |    next stage    |
                        |  |
            lock requests  clock to other stages
               from later
                   stages
    	
    Notice that the logic for this pipeline stage may continue to perform all normal combinatorial operations, producing results for the next stage at every clock cycle. These results should have no effect if the valid bit is reset!

    Also, and this is extremely important, if the valid input to a pipeline stage is false, the stage should perform no store operations to any register, and it should not assert any requests for data from any registers or memory unless these can be met without causing conflicts with other resource users. Thus, outputs such as "needs memory" must be anded with the valid signal.

    Finally, the valid signal must itself be accounted for in determining the interlock conditions. Thus, for example, the final interlock condition give in the previous lecture must be rewritten as:

            if
                     AC.need_register
                 and OF.might_write_register
                 and AC.r = OF.r
    	     and AC.valid       -- added
    	     and OF.valid       -- added
            or
                     AC.need_register
                 and OP.might_write_register
                 and AC.r = OP.r
    	     and AC.valid       -- added
    	     and OP.valid       -- added
            or
                     AC.need_register
                 and OS.write_register
                 and AC.r = OS.r
    	     and AC.valid       -- added
    	     and OS.valid       -- added
            then
                 interlock, blocking the IF and AC stages
            

  4. Branches

    We now have the mechanisms in place that we need to implement branches! Consider what the pipeline must look like as a branch flows through:

                        ____ ____ ____ ____ ____
            BRANCH X   |_IF_|_AC_|_OF_|_OP_|*OS*|____
            SOMETHING       |_IF_|_AC_|_OF_|_OP_|////|____
            SOMETHING            |_IF_|_AC_|_OF_|////|////|____
            SOMETHING                 |_IF_|_AC_|////|////|////|____
            SOMETHING                      |_IF_|////|////|////|////|____
            X: SOMETHING                        |_IF_|_AC_|_OF_|_OP_|_OS_|____
            
    By the time the operand store stage of the branch instruction is executed, the 4 instructions following it will have been fetched and fed into the pipe, even if the data in these memory locations are not intended to be instructions! Therefore, the partially executed results of these instructions must be cancelled when the branch instruction finally changes the PC register!

    The valid bit we just added to each interstage register offers us the tool we need to accomplish this! Thus, our interstage register logic becomes:

                                 |               
                                 |   |  previous stage  |
                                 |   |__________________|
                            -----|----|>_valid_|__etc__|
            need lock            |         |     __|__
              -------------------|-------  o----|
                                 o-----  | |    |
                                 |     | | |    | Logic for this
                                 |     o_o_|    | pipeline stage
                                 |    |  AND|   |
                                 |    |_____|   |_____
                                 |     ____|___ ___|___
                            -----|----|>_valid_|__etc__|
                                 |   |                  |
                                 |   |    next stage    |
                                 |
                               branch
    	
    The above diagram assumes that some later pipeline stage asserts the branch signal as it assigns to the PC. In fact, the actual branch logic at the input to the instruction fetch stage is quite simple:
                                         -------------
                                        |  ------     |
                                       _|_|_    _|_   |
                                  -----\1_0/   |+1 |  |
            clock           ___  |     __|__   |___|  |
    	   ------------|AND|------|>_PC_|    |    |
                         -o|___| |       |       |    |
                       _|_       |       o-------     |
                      |AND|      |       |            |
                      |___|      |    ___|________    |
                       | o       |   |            |   |
                       | |       |   |  IF stage  |   |
                       |  -------o                    |
                       |         |                    |
            lock requests      branch           branch address
               from later
                   stages
    	
    Note that if a later stage asserts a branch, we always take it, even if the instruction fetch stage is blocked by a bubble ahead of it in the pipe. In our example architecture, both the branch signal and the branch address come from the operand store stage of the pipe.

  5. Skip Instructions

    Notice that, using the logic outlined above, every branch instruction invalidates all of the instructions in the pipe. Thus, even a branch to the next instruction, the one that would be executed anyway had the branch not been there, would invalidate everything.

    The desire to allow conditionals to be executed without such an expense has led to the reintroduction of a class of instructions that was common on first and second generation computers but largely absent from classical third generation CPU's, skip instructions.

    On a first generation architecture, as typified by the DEC PDP-8 (a machine that was introduced using second-generation technology in 1965, but remained in production in various forms until arount 1990), when a skip condition was met, the program counter was simply incremented.

    For example, the "skip if accumulator negative" instruction would be implemented, inside the CPU's execute phase by sequential logic such as:

    	if AC < 0 then PC = PC + 1
    	
    This logic clearly isn't applicable to pipelined machines, but with the valid bit in the pipe, we can do something remarkably easy. If no instruction has any side effects before the final pipeline stage, the store stage, then we can skip an instruction in the sequence by simply invalidating it at any time before the store. Thus, all we need to do to implement a skip instruction is to have that instruction invalidate its successor in the pipe!

    Thus, at first analysis, an instruction stream containing, as its only control structures, skip instructions, will execute at exactly the same speed on a pipelined processor when the skips are taken as it does when the skips are not taken.

    That was a first analysis! Invalidating an instruction may release an interlock condition that depended on that instruction, and this may accelerate some later instruction that was tentatively blocked by the skipped instruction.