22C:122/55:132 Notes, Lecture 30, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

Superscalar Pipelines
The pipelined processors we've examined to this point have been able to execute only one instruction per clock cycle. Large, high performance machines frequently have a very wide data path to memory, motivated, initially, by the desire to provide direct support for fast load and store operations on 32 and 64 bit operands. If a machine is able to fetch 64 bits at a time, the instruction prefetch register will be at least 64 bits, and the question arises: How can we use all this if most instructions are only 16 or perhaps 32 bits.
One solution is to abandon short instructions and design a machine with very long instruction words. VLIW machines offer superscalar operation by encoding many operations into one instruction word.
Another solution, the subject of this section, is to attempt to execute more than one instruction at a time. That is, to design a pipelined machine where each pipeline stage can simultaneously execute two or more instructions. At the top level, we could do this quite simply. Consider a pipelined machine with a fixed 32-bit instruction word and a 32 bit data path to memory:
```
	          ______       _____
	      IF |______|-----|     |
	          |>___|   _  |     |
	      AC |______|-| | |     |
	          |>___|  | | |     |
	      OF |______|=| |-|  M  |
	          |>___|  |R| |     |
	      OP |______| | | |     |
	          |>___|  | | |     |
	      OS |______|=|_|-|_____|
	
```
We could build a 64 bit implementation of this machine as follows:
```
	          ______________       _____
	      IF |______________|-----|     |
	          |>___|  |>___|   _  |     |
	      AC |______||______|=| | |     |
	          |>___|  |>___|  | | |     |
	      OF |______||______|=| |=|  M  |
	          |>___|  |>___|  |R| |     |
	      OP |______||______| | | |     |
	          |>___|  |>___|  | | |     |
	      OS |______||______|=|_|=|_____|
                  pipe a  pipe b
	
```
This requires a single data path from memory to the instruction fetch stage, two paths from the registers to the address computation stage, two data paths from both memory and registers to the operand fetch stage and two data paths to both registers and memory for the operand store stage. Thus, we need 5 memory ports and a 6-port register file!
Operand Dependencies
Even if we ignore the problem of building multiport memory systems that support this degree of parallelism, this idea poses serious problems. Consider, for example, the problem of detecting operand dependencies and deciding when to stall a pipeline stage!
Consider, for example, the problem of when to stall the address computation stage of pipeline a, assuming that instruction a comes before instruction b when the two are packed into the same word. We must stall the AC stage of pipeline a if either:
- the instruction in the OF, OP or OS stages of pipeline a will store in the register that this instruction uses for indexing.
- or the instruction in the OF, OP or OS stages of pipeline b will store in the register that this instruction uses for indexing.
In addition, we must stall the AC stage of pipeline b if eihter:
- the instruction in the AC, OF, OP or OS stages of pipeline a will store in the register that this instruction uses for indexing.
- or the instruction in the OF, OP or OS stages of pipeline b will store in the register that this instruction uses for indexing.
In sum, the two pipelines are not independent! The instruction fetch stage must stall if any stage below it is stalled in either pipeline, and the stall interlock at each stage in each of the pipelines is approximately doubled in complexity!
We must also add some new interlocks! If the operand store stages of both pipelines attempt to store the same register or the same memory location, we must either stall pipeline b in order to allow pipeline a to store first, or we must discard the value stored by pipeline a, as the value that store would have written would be immediately overwritten if the two instructions were executed in sequence.
There is an exception to the latter rule! If the operand store stage is executing two branch instructions, the branch instruction being executed in pipeline takes precidence, causing the branch instruction in pipeline b to be ignored because it came later!
If we add Operand forwarding paths to the architecture, their complexity is also approximately doubled.
Out-of-Order Execution
If the interlock logic detects the need to stall pipeline asome stage in our superscalar pipelined machine, it is easy to imagine stalling both halves, but this would lead to deadlock if a stage of pipeline b requests a stall waiting for the same stage of pipeline A to finish an operation.
Rather than this, we must, at minimum, allow a stall in a stage of pipeline b to permit the corresponding stage of pipeline a to continue while blocking the stage above.
If stalling in either pipeline only blocks that pipeline (and also the instruction fetch stage), we end up allowing a significant amount of out-of-order execution. We must be very careful with this in order to prevent branch instructions from being subject ot reordering! The following rules do this:
- If the instruction in a stage of pipeline a is a branch instruction, we must stall this stage if the next stage of pipeline b is stalled.
- If the instruction in a stage of pipeline b is a branch instruction, we must stall this stage if the same stage of pipeline a is stalled.
Variable Length Instructions in a Superscalar Setting
If we have only a few different instruction lengths, for example, short instructions with no immediate constant or index displacement, and long ones with immediate constants or indexing displacements, as in the IBM System 360 family, we can easily use a promotion mechanism to bring the second half of a long instruction into position. Thus, if the instruction in the operand fetch stage of pipeline a is a long instruction, it can use and invalidate the instruction in the same stage of pipeline b.
Similarly, if the instruction in the operand fetch stage of pipeline b is a long instruction, it can use and invalidate the instruction in the previous stage of pipeline a (the address computation stage).
It is imperitive that out-of-order execution be suppressed at the pipeline stage where variable length instructions are marshalled, because if it were allowed, it could allow a halfword that was intended as the extension of some long instruction to be moved relative to the instruction that needed it! The result would be a very strange situation where some halfword would be interpreted as either the second half of a long instruction or an independent instruction, depending on the operand dependencies that occur in nearby instructions!
Alternatively, we may deal with variable length instrucitons with a very complex instruction fetch stage using a prefetch buffer and appropriate marshalling logic:
```
	                                    From Memory
	                                 |_______________|
	        from shift network           |       |      
	  ___|___ ___|___ ___|___ ___|___ ___|___ ___|___
	 |_______|_______|_______|_______|_______|_______|
	     |       |       |       |
	     |       |       |        ---o------
	     |       |       o---------o-|----  |   decode stage
	     |       o-------|-------  | |    | |
             |       |       |       |_|_|   _|_|_
             |       |       |       \___/   \___/
	  ___|___ ___|___ ___|___   ___|___ ___|___ 
	 |_______|_______|_______| |_______|_______|
	     |       |       |         |       |
                next stage of         next stage of
                 pipeline a            pipeline b
	
```
The above illustrates an interesting possibility! We have the option of designing machines with one fully functional pipeline and one that only contains sufficient functionality to handle a subset of the instructions. The figure illustrates this with pipeline a that is able to handle one, two and three syllable instructions, while pipeline b is only able to handle one and two syllable instructions. Furthermore, as shown, if there is a three syllable instruction followed by a one syllable instruction, both will be admitted to their respective pipelines at the same time, while if a three syllable instruction is followed by a two syllable instruction, only the three syllable instruction will be be admitted to pipeline a at the end of the clock cycle, because our marshalling logic does not contain the data path needed to bring the second syllable of the 2 syllable instruction into pipeline b.
If the instruction mix of this machine contains many 3 syllable instructions that are followed by 2 syllable instructions, the machine will operate as a simple scalar pipeline, while if it contains many one and two syllable instructions, the machine will operate at superscalar rates.
We could design instruction decoding logic that locks any class of opcodes out of pipeline b; for example, because ports that can access memory are expensive, we could include memory reference hardware only in pipeline a, allowing only register to register instructions to execute in pipeline b. If the code being executed has few memory reference instructions, it would execute at superscalar rates, while if the code has many such instructions, the performance would fall back to simple scalar pipeline rates.
High end microprocessors today frequently offer superscalar execution!