22C:122/55:132 Notes, Lecture 28, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

Forwarding and Out of Order Execution
The pipelined processors we've examined to this point have interlock logic to allow correct execution of programs that do not allow for delay slots by injecting bubbles into the pipe, but the result, for naive code, will be slow execution with a large fraction of the pipeline cycles stalled. If we can reduce the bubble size or fill bubbles, the processor will be able to run the same programs faster.
There are two common techniques that apply here; one is called operand forwarding, in which data is passed to its eventual consumer before it would arrive there by the normal data paths of the machine. This involves adding additional data paths.
The other technique is called out-of-order execution; this involves adding hardware that will allow certain instructions to bypass stalled data in the pipeline under certain circumstances. Generally, this is harder to accomplish in simple pipelined systems, but in superscalar machines, it becomes fairly common.
Operand Forwarding
Consider our running example pipelined system, focusing on the interlock logic that blocks the operand fetch stage of the pipeline until an operand is avilable:
```
                                   ____
                              ----|>pc_|
                             |   |______| Instruction Fetch
                             o----|>___|
                            _|_  |______| Address Computation
            interlock logic|___|--|>___|
                             |   |______| Operand Fetch
                             o----|>___|
        clock                |   |______| Operate
           ------------------o----|>___|
                                 |______| Operand Store

        
```
We block the operand fetch stage if (among other conditions):
- The operand is fetched from a register and
  The register is the destination register for the instruction currently in the operate phase
  
  or
- The operand is fetched from a register and
  The register is the destination register for the instruction currently in the operand store phase
Notice that, the value of the operand that will be stored in the target register at the end of the operand store stage is available on the ALU output at the end of the operate stage! Operand forwarding logic adds data paths to the pipelined system to allow this information to be used!
Specifically, if the stall logic would stall a stage in the pipeline awaiting the availability of some value in registers or memory, and if that value is currently available on some data path in the machine, typically in a later pipeline stage, additional multiplexors and data paths can be added to the system to forward that value to where it is needed, eliminating the stall.
What we need to do is change the operand fetch stage so that, if:
- The operand is fetched from a register and
  The register is the destination register for the instruction currently in the operate phase
  
  The contents of the register are ignored, and instead, the
  value on the ALU output is used as an operand.
  
  or
- The operand is fetched from a register and
  The register is the destination register for the instruction currently in the operand store phase
  
  The contents of the register are ignored, and instead, the
  value on the data path to registers from the operand store stage
  is used as an operand.
Operand Forwarding Logic
To implement this particular forwarding logic, we must add new data paths for forwarding to the pipelined machine:
```
                                   ____
                              ----|>pc_|
                             |   |______| Instruction Fetch
                             o----|>___|
                            _|_  |______| Address Computation
            interlock logic|___|--|>___|
                   ?         |       |  
                             |       | -------------------
                             |       || ----------------  |
                             |    ___|||                | |
                             |   |______| Operand Fetch | | forwarding
                             o----|>___|                | | paths
                             |   |______| Operate       | |
                             |       | |                | |
                             |       |  ----------------^ |
        clock                |     __|_   alu output      |
           ------------------o----|>___|                  |
                                 |______| Operand Store   |
                                       |                  |
                                        ------------------o-->
                                          value to regs
        
```
The actual forwarding logic we need for this example goes in the operand fetch stage:
```
              Operand Fetch Stage
              _____  _______________
             | op|r||       ea      |
             |___|_||_______________|
               |  |         |
               |  o---------|------------------------> register number       
               |  |         |   _________   ---------< data from registers
               |  |         |  |  read   | |                          
               o--|---------|--|registers|-|---------> need register value
               |  |         |  |_________| |                          
               |  |         |     |        |  ------------<< ALU output
               |  |         |  ___|___     | |  ----------<< value to  regs
               |  |         | |forward|   _|_|_|_
               |  |         | | logic |---\_____/     logic relating to
               |  |         | |_______|      |       operand from memory
              _|__|  _______|_______  _______|_______  _______|_______
             | op|r||       ea      ||      opa      ||      opb      |
             |___|_||_______________||_______________||_______________|
        
```
Remarkably, if we add this logic, we completely eliminate all operand delay slots involving register dependencies! When the forwarding logic indicates that the operand should be forwarded from the ALU output or the result register, the read registers logic may suppress the read from registers.
Similar logic can be added to suppress interlocks relating to memory access, for example, if:
- The operand is fetched from memory and
  The effective address is the same as the effective address of the instruction currently in the operand store phase, where that instruction will store in memory.
  
  The value from memory is ignored, and instead, the
  value in the result register feeding operand store stage
  is used as an operand.
The actual logic requires one other condition to be tested, and when this is covered, forwarding logic completely eliminates all operand fetch stalls from our example machine!
Similar forwarding logic can be used to eliminate many but not all address computation stalls! Address computation stalls caused by an index register that is being updated by the operand store stage can definitely be eliminated by forwarding.
Address computation stalls caused by an instruction in the operand fetch stage that will update the index register can definitely not be eliminated because the value that will eventually be stored in the index register has not yet been computed. Only after the result emerges from the ALU is the value available!
Address computation stalls caused by an instruction in the operate stage that will update the index register may be eliminated by forwarding, but there is a price! The address computation phase of the pipeline needs to add the contents of the index register to the displacement to compute the relative address. The operate stage needs to combine the operands in the ALU. Forwarding the ALU output to the input of the address computation adder means that we have two adders operating in sequence, and this places a limit on the clock speed!
As a result, it may be that a machine without a forwarding path from the operate stage to the address computation stage will be able to run at a higher clock rate than a machine with this forwarding path! On the other hand, a machine with this forwarding path will have fewer stalls than a machine without it. If the stalls that can be eliminated have a higher price than the reduction in clock rate, the machine with the slower clock rate may outperform the machine with a faster clock!
Beware! The value of forwarding lies entirely in the extent to which the added logic eliminates pipeline stalls, and the frequency of stalls depends on the quality of the instruction sequence reorganization stage of the code generators of the commonly used compilers! Therefore, improved compilers can reduce or eliminate the benefits of forwarding, and the benefits of an investment in adding forwarding logic must be compared with the benefits of investing in better compilers.
Out-of-Order Execution
When a pipeline stage is blocked, the obvious thing to do is what we have done, blocking all upstream pipeline stages as well! There is an alternative that is occasionally useful; this is out-of-order execution.
If the instruction in one pipeline stage does not depend on the pipeline stages ahead of it that are occupied by stalled instructions, then insteas of injecting a bubble after the stall, the instruction we have identified may be promoted! In effect, in this case, the hardware is doing a job at run-time that could have been done by an optimizing compiler that re-orders the machine instructions to avoid stalls.
For our example architecture, we both the address computation stage and the operand fetch pipeline stages may stall, assuming that we do not introduce any forwarding logic. As it turns out, we cannot promote instructions around the address computation stage, so reordering cannot be used to limit the impact of address computation stalls. We can, however, apply this approach in a very limited way to some operand fetch stalls.
Why is out-of-order execution inapplicable to the address computation phase? Because the basic promotion operator requires looking at the contents of the interstage register that serves as input to the previous pipeline stage! The previous stage is the instruction fetch stage, and there is no interstage register previous to it!
In the case of stalls in the operand fetch stage, however, these involve stalling the processing of the contents of the interstage register between the address computation stage and the operand fetch stage. The candidate for promotion is the instruction in the previous interstage register, between the instruction fetch stage and the address computation stage.
In general, promotion involves the following data paths:
```
                          |   |______| 
                          o----|>___|
                          |   |______| previous stage
                          |      | |__________________ 
                          |     _|__                  |
                          o----|>___|                 |
                          |   |______| blocked stage  |
                          |      |  __________________|                  
         interlock and   _|_    _|_|_   promotion path              
         promotion logic|___|---\___/                 
                          |     __|_                 
        ------------------o----|>___|
                              |______| following stage
	
```
What Instructions can be Promoted?
In general, the blocked pipeline stage is unavailable to perform any computation, so the only instructions that can be promoted are those that perform little or no computation. No computation if there is no logic in the promotion path, and little computation if there is little logic in the promotion path.
What instructions does this leave? Consider the example system, with an operand fetch stage that can block. There are a few instructions that make no significant use of the operand fetch stage! Specifically, the load immediate instruction can be be promoted because it needs neither an operand from memory nor an operand from registers. The value in the effective address output from the address computation stage becomes the operand B on the input to the operate stage, while operand A is unneeded, and the opcode and register fields are copied verbatim.
Recall that load immediate instructions into the program counter were allowed, and were used as branch instructions! Clearly, we must make a special case of these, because code re-ordering cannot be allowed to change the position of a branch instruction relative to other instructions in its vicinity!
There appear to be no other instruction promotions possible with this architecture! Superscalar pipelined machines offer far more interesting reordering possibilities, but the basic mechanisms remain remarkably similar to those discussed here. For the example pipeline architecture, though, it is quite clear that out-of-order execution cannot achieve the speedups that were possible using operand forwarding, and furthermore, for this architecture, there is no need to use both mechanisms; they are alternatives.