29. Forwarding and Out of Order Execution
the 22C:122/55:132 Lecture Notes for Spring 2004
The pipelined processors we've examined to this point have interlock logic to allow correct execution of naive programs that were not written with any allowance for delay slots; we did this by injecting bubbles into the pipe, but the result, for naive code, would be slow execution with a large fraction of the pipeline cycles stalled. If we can reduce the bubble size or fill bubbles, the processor will be able to run the same programs faster than it might otherwise.
There are two now-common techniques that apply here; one is called operand forwarding, in which data is passed to its eventual consumer before it would arrive there by the normal data paths of the machine. This involves adding additional data paths.
The other technique is called out-of-order execution; this involves adding hardware that will allow certain instructions to bypass stalled data in the pipeline under certain circumstances. Generally, this is harder to accomplish in simple pipelined systems, but in superscalar machines, it becomes fairly common. The CDC 6600 actually supported out of order execution, where it was a side effect of the presence of interlock logic and multiple functional units.
Recall where we stand with our example pipelined system, we added interlock logic to block the clock to the operand fetch and address computation stages, along with auxiliary logic to create bubbles in the downstream pipe until the interlocks were released:
____ ----|>pc_| _|_ |______| Instruction Fetch interlock logic|___|--|>___| _|_ |______| Address Computation interlock logic|___|--|>___| | |______| Operand Fetch o----|>___| clock | |______| Operate ------------------o----|>___| |______| Operand Store
The operand fetch stage interlock logic requested an interlock when:
The effective address stage interlock logic requested an interlock when:
In the above list of interlock conditions, the following ones are of special interest:
In all of these, the operand store stage is in the process of copying a value from the result register (the interstage register holding the result from the ALU in the operate stage) to either memory or a register, and either the operand fetch stage or the address computation stage is waiting for this value to be stored so that stage can fetch it.
Why wait for a value to be stored in memory or a register and then fetch it when that value is already available, in an interstage register from which we could get it directly, without waiting? Operand forwarding logic adds data paths to the pipelined system to allow this information to be used!
The exact same predicate that we used to request a stall becomes a request for forwarding, so when we detect:
Instead of requesting a stall, we should set a multiplexor in the operand fetch stage to take data from the result register that is an input to the operand store stage instead of taking that data from a register.
To implement this particular forwarding logic, from the operand store stage to the operand fetch stage for register operands, we must add new data paths for forwarding to the pipelined machine:
____ ----|>pc_| _|_ |______| Instruction Fetch interlock logic|___|--|>___| _|_ |______| Address Computation interlock logic|___|--|>___| | |______| Operand Fetch | | |____ _____________ | | _|_|_ | | | \0_1/-forwarding | | | | logic | | | _____| | | _|_| to opa register | forwarding o----|>___| | path | |______| Operate | ------------------o----|>___| | | |____________________| | | contents of result register _|__|_ |______| Operand Store
Having added this data path and moved one term from the interlock logic for the operand fetch stage to the forwarding logic for the operand fetch stage, there will be a real measurable change in the performance of our computer. For a simple memory to memory assignment, using a register as an intermediate variable, and with no instructions inserted as operand delay fillers, our old machine would have required two operand delay slots:
____ ____ ____ ____ ____ LOAD R1,A |_IF_|_AC_|_OF_|_OP_|_OS*|____ ____ ____ STORE R1,B |_IF_|_AC_|////|////|*OF_|_OP_|_OS_|
With forwarding, for this sequence, we can now get away with one delay slot:
____ ____ ____ ____ ____ LOAD R1,A |_IF_|_AC_|_OF_|_OP*|_OS_|____ ____ STORE R1,B |_IF_|_AC_|////|*OF_|_OP_|_OS_|
It is interesting to ask, can we eliminate that remaining bubble by adding more forwarding logic? Let's look at one possibility:
____ ----|>pc_| _|_ |______| Instruction Fetch interlock logic|___|--|>___| _|_ |______| Address Computation interlock logic|___|--|>___| | |______| Operand Fetch | | |____ _____________ | | _|_|_ | | | \0_1/-forwarding | | | | logic | | | _____| | | _|_| to opa register | forwarding o----|>___| | path | |______| Operate | | | |____________________| | |__| from ALU output ------------------o----|>___| |______| Operand Store
In fact, this will work! We may still need interlock logic for the operand fetch stage because we haven't dealt with forwarding for operands from memory, but we have successfully eliminated all operand delay slots for register operands.
There is only one problem! In our first scheme, the one that left us with just one delay slot, the data path from the result register to the opa register passed through one multiplexor. In the new scheme, the data path passes through the ALU (which probably contains several multiplexors and an adder and carry propagation logic of some kind). As a result, the first scheme was very likely to be compatable with any clock rate we could use, while the second scheme adds an additional multiplexor in series with whatever delays were already inherent in the ALU. Since the ALU speed is likely to be a limiting factor in the speed of the overall system, our new scheme may require that we run with a reduced clock rate, while the first forwarding scheme is likely to have no overall performance penalty.
If we add similar forwarding logic for memory operands, adding a new input to the multiplexor on the path to the operand b register on the input to the operate stage and feeding this from the result ALU result register, and if, similarly, we add forwarding logic for address computation, widening the multiplexor that is on the input to the adder in that stage and feeding its new input from tthe ALU result register, we end up reducing the number of operand delay slots from two to one and the number of address computation delay slots from three to two.
We can also add forwarding logic for unconditional branches; in this case, if we recognize that an instruction is an unconditional branch in the address computation phase (where the effective address is the destination of the branch) or in the operand fetch stage (where the value fetched from register or from memory is the destination of the branch) we can forward that to the PC immediately!
Branch forwarding logic is tricky on machines that allow skip instructions! In this case, we can't safely forward a branch to the PC until we're sure that it won't be skipped! It's also important to know that we only invalidate the stages (inclusive) between the stage that forwarded the branch address and the instruction fetch stage. Any instructions in later pipeline stages are still valie and should be executed (they were fetched before the branch was fetched).
When a pipeline stage is blocked, the obvious thing to do is what we have done, blocking all upstream pipeline stages as well! There is an alternative that is occasionally useful; this is out-of-order execution.
If the instruction in one pipeline stage does not depend on the pipeline stages ahead of it that are occupied by stalled instructions, then instead of injecting a bubble after the stall, the instruction we have identified may be promoted! In effect, in this case, the hardware is doing a job at run-time that could have been done by an optimizing compiler that re-orders the machine instructions to avoid stalls.
For our example architecture, both the address computation stage and the operand fetch pipeline stages may stall, assuming that we do not introduce any forwarding logic. As it turns out, we cannot promote instructions around the address computation stage, so reordering cannot be used to limit the impact of address computation stalls. We can, however, apply this approach in a very limited way to some operand fetch stalls.
Why is out-of-order execution inapplicable to the address computation phase? Because the basic promotion operator requires looking at the contents of the interstage register that serves as input to the previous pipeline stage! The previous stage is the instruction fetch stage, and there is no interstage register previous to it!
In the case of stalls in the operand fetch stage, however, these involve stalling the processing of the contents of the interstage register between the address computation stage and the operand fetch stage. The candidate for promotion is the instruction in the previous interstage register, between the instruction fetch stage and the address computation stage.
In general, promotion involves the following data paths:
| |______| o-------|>___| | |______| previous stage | | | NOP o-------------------- | _|_|_ | | --\___/ | | | _|__ | o---|---|>___| | | | |______| blocked stage | | | | __________________| interlock and _|_ | _|_|_ promotion path promotion logic|___|-o----\___/ | __|_ ------------------o-------|>___| |______| following stage
When we promote an instruction around a blocked stage, we must replace it with a NOP at the stage from which it was promoted and we must use that instruction instead of the NOP coming out of the blocked stage.
In general, the blocked pipeline stage is unavailable to perform any computation, so the only instructions that can be promoted are those that perform little or no computation. No computation if there is no logic in the promotion path, and little computation if there is little logic in the promotion path.
What instructions does this leave? Consider the example system, with an operand fetch stage that can block. There are a few instructions that make no significant use of the operand fetch stage! Specifically, the load immediate instruction in our example instruction set can be be promoted because it needs neither an operand from memory nor an operand from registers. The value in the effective address output from the address computation stage becomes the operand B on the input to the operate stage, while operand A is unneeded, and the opcode and register fields are copied verbatim.
Recall that load immediate instructions into the program counter were allowed, and were used as branch instructions! Clearly, we must make a special case of these, because code re-ordering cannot be allowed to change the position of a branch instruction relative to other instructions in its vicinity! No instruction should be promoted past a branch instruction, and no branch instruction should be promoted!
There appear to be no other instruction promotions possible with this architecture! Superscalar pipelined machines offer far more interesting reordering possibilities, but the basic mechanisms remain remarkably similar to those discussed here. For the example pipeline architecture, though, it is quite clear that out-of-order execution cannot achieve the speedups that were possible using operand forwarding, and furthermore, for this architecture, there is no need to use both mechanisms; they are alternatives.