The pipelined processors we've examined to this point have had a fixed instruction word. In contrast, the first successful pipelined design, the IBM 360/95 had several instruction formats, and a number of other pipelined designs have had even more complex formats.
For example, the DEC VAX family of processors introduced in the mid 1970's had the following instruction format (considerably simplified):
instruction: _______8_______ the number of operands or addresses |_______________| following may be from 0 to 3, | opcode | depending on the opcode. operand formats: ___4___ ___4___ used for simple register and |_______|_______| for register indirect operands. | mode | reg | ___4___ ___4___ _______8_______ indexed |_______|_______|_______________| addressing | mode | index | disp | formats. ___4___ ___4___ ______________16_______________ |_______|_______|_______________________________| | mode | index | disp | ___4___ ___4___ ______ _____32__________________ |_______|_______|_____/ ... /_________________________| | mode | index | disp | immediate operand formats: ___4___ ___4___ |_______|_______| | mode | const | _______8_______ _______8_______ |_______________|_______________| | mode | const | _______8_______ ______________16_______________ |_______________|_______________________________| | mode | const | _______8_______ ______ _____32__________________ |_______________|_____/ ... /_________________________| | mode | const |This allows for instructions as short as 8 bits, but a 3-address add instruction (source operand plus source operand to destination operand) could take as few as 4 bytes or 32 bits (if all operands are in registers) or as many as 16 bytes or 128 bits (if all operands are in RAM requiring 32 bit index/displacement addressing for each). In fact, the instruction could be even longer, because the VAX had a special addressing mode that allowed a second index register to be multiplied by a small factor of two and added to the address specified by any of the indexed addressing modes listed above.
The Intel 80x86/Pentium architecture is even more irregular! The complexity of the VAX instruction set was designed in from the start, while a large part of the irregularity of the Pentium architecture is the result of evolutionary growth from the initial 8086 architecture; this was much simpler, but also irregular, having only a few registers, none of which were genuinely general purpose, and it had a 16 bit data path.
The original interpreters for these architectures were not pipelined to any significant depth. Later implementations of both of these architectures were deeply pipelined. Curiously, DEC (now part of Compaq) abandoned the VAX architecture at the peak of its market penetration, switching to the Alpha RISC architecture because they believed that there was little benefet to be gained by more intensive development of pipelined versions of the VAX. Intel, on the other hand, continues to develop every more sophisticated pipelined versions of the 80x86 family.
The classical way to implement a machine with variable length instructions is to use a microprogrammed interpreter. The fetch execute cycle begins by fetching just the opcode byte, and then, depending on the opcode, additional bytes are fetched, in sequence, until the instruction has been executed.
There is one approach to pipelined execution that is remarkably close to this in spirit. Instructions are pushed into the pipeline by the fetch stage one increment at a time. For example, a pipelined IBM 360 could be built with a 16-bit instruction register, and the fetch stage could fetch 16 bits at a time. When the opcode found in the instruction field of one instruction indicates that the immediately following 16 bits belong to the same instruction, the following instruction is grabbed from the pipeline, leaving a bubble in the pipe, as illustrated below:
| valid o-------------- ------|---------- | _|______|_______ | | |>______IR_______| | | ______|__________|___|______ | | | | | Stage that | | ... | | detects long | _|_|_ | | format ...---o--\0_1/ | |______________________|_____| | | ________|_______ _______|_______ |>______IR_______|_______________| |With this logic, the pipeline stage that detects the fact that an instruction requires a second word (or halfword) grabs the contents that is about to be clocked into the instruction register at the end of the preceeding pipeline stage and promotes it to be part of this instruction, and as it does so, it also forces the interstage register that would have received that instruction to be marked as invalid. For short instructions, the second half of the instruction register in the output of this pipeline stage may have some other use, or it may simply be ignored.
This approach to building long instructions only offers the full benefit of pipelined execution to short instructions, while it offers poorer performance for long instructions. This is justified primarily in processors with a relatively narrow data path to memory, for example, a 16 bit data path. This is a particularly appropriate choice when the long instruciton format is rare, so that the efficiently pipelined short instructions dominate the performance of application programs.
Many architectures, even the original 8086, had a very shallow pipeline with only 2 stages, instruction fetch and execute. The term pipelined execution is not frequently used for such shallow pipes; instead, trade literature frequently focuses on the part of the interstage register between these two stages, the instruction prefetch buffer. In fact, our minimal CISC design (lecture 12) came very close to incorporating this feature because it is a natural result of a word size that is bigger than the minimum instruction size!
The idea is simple: The instruction fetch mechanism always fetches an entire word from memory, even though the instruction itself is just a byte or halfword. Thus, during normal execution on a machine with 8-bit instructions and 16-bit words (for example, the original 80x86), there would be two instruction execution cycles per instruction fetch, and fetches would increment the program counter by two instead of incrementing it by one. This gives us an instruction fetch pipeline stage with the following structure:
instruction word from memory ---------------------------< | _________|_________ | | | -------------|-|--------- _|_|_ _|_|_ | \___/ \___/ | _______|_______ _______|_______ | |______IR_______|____PREFETCH___| | | | | | ---------- | instruction to remaining parts of CPUWith the above logic, on even numbered instruction execution cycles, we fetch a word from memory, while on odd cycles, we simply shift the instruction register to the left. The particular value we shift in does not matter, since it will never be seen by the rest of the CPU.
We face a problem with this logic! How do we suppress execution of instructions that are not supposed to be executed? If we branch to the first instruction in a pair, we should execute both instructions, but if we branch to the second half of the pair, we will fetch two instructions but only one should be executed.
This problem can be addressed by adding a valid bit to each potential instruction in the instruction register and its prefetch extension.
instruction word from memory -------------/------------<< | 2n _________|_________ | | * = 0 if branch to * | 1 | 0 an odd address _|_|_ _|_|_ _|_|_ | | ____| * = 1 in all other n+1/ -------------|-|--------- cases _|_|_ _|_|_ | \___/ \___/ /n+1 _______|_______ _______|_______ | |______IR_______|____PREFETCH___| | | | | | ---------- valid bit and instruction to remaining parts of CPUThis logic will inject invalid instructions into the later parts of the CPU once in a while, but it also offers an interesting possibility. Instead of fetching into the instruction register, we can fetch in anticipation of need, fetching not the current instruction and its immediate successor, but an instruction pair beyond that:
instruction word from memory ------------/-----------<< ________|_______ 2n | | * /n 1 /n 0 * = 0 if branch to _|_|_ _|_|_ _|_|_ an odd address | | | | --------|-o---- * = 1 in all other ----------|---|--------|-|-------- cases | --------|-o-|--------|-|------ | _|_|_ _|_|_|_ _|_|_ | | \___/ \_____/ \___/ | | _____|_____ _____|_____ _____|_____ | | |____IR_____|_PREFETCHa_|_PREFETCHb_| | | | | | | | /n+1 | o------- | | -----------|--------- valid bit and __|__ instruction to valid| rest of CPU bit| fetch from memory --> possible!Note that, when prefetch register a is invalid, for example, because we just branched to an odd instruction, this model allows us to do a long shift, emptying both prefetch registers as the one valid instruction from prefetch register b is shifted directly into the instruction register. A new instruction fetch will occur whenever prefetch register b is invalid, because we are guaranteed at least a short shift after each clock cycle.
Even the old 8086 architecture had a short prefetch buffer. You can infer this from the execution time statistics in the 8086 manual from Intel. The execution times of all instructions are documented (in clock cycles), and this documentation makes it very clear that the basic instruction interpreter in this machine is microcoded. However, the documenation also indicates that the times are only for straight-line code. Immediately following any branch instruction, the documentation indicates that there will be a penalty of several clock cycles. This branch penalty is the time it takes to load the first instruction at the branch destination into the prefetch buffer.
Once we have a wide enough prefetch buffer, along with valid bits associated with each instruction syllable, we can extract entire instructions from the prefetch buffer.
The prefetch buffer width we need is the sum of the longest instruction word plus the size of one word fetched from memory. So, for example, an IBM 360 with a 16-bit halfword instruction, a 32-bit fullword instruction and 32-bit data path to memory would need a 64 bit combined instruction register and prefetch buffer.
From Memory |_______________________| from shift network | | _____|_____ _____|_____ _____|_____ _____|_____ |____IR_____|_PREFETCHa_|_PREFETCHb_|_PREFETCHc_| | | _____________ o-----------|----| instruction |-----> how much | ___|_ | decode | to shift | \___/---|_____________| IR/PREFETCH | | ...|..........|.... other operand fetch activity? _____|__________|_______ |____IRa____|_____IRb____|In effect, the problem with variable length instructions is handled entirely by the decode logic that is added to the pipeline stage following the instruction fetch stage. This may be an operand fetch stage, for example.
The decode logic is a programmable logic array (generalized truth table) that takes, as inputs, the valid bits and any bits of the instruction that might be part of the opcode. The outputs of this programmable logic array include:
In addition, the decode PLA may convert a tightly encoded and perhaps variable length opcode field into an easily utilized unpacked format, so that, deeper in the pipeline, there is no need to check for specific opcodes; instead, deeper in the pipeline, there is one bit of the widened opcode field used to enable or disable each data path or option.
After this stage, all instructions are widened to a standard format that includes all optional fields that may be present in some instruction variants. The instruction fetch stage concerns itself only with the problem of keeping the IR/PREFETCH register full and providing the ability to shift it as many places as required as the decode logic extracts successive instructions from one end of the register.
When the RISC idea had penetrated the public perception enough that Intel felt a need to emphasize the currency of the Pentium architecture, an architecture that is inherently CISC oriented, they stated that the first stage of the pipeline in a Pentium converts instructions to RISC format. In fact, they are describing the logic we have just described here. Formally speaking, it is probably not correct to describe the function of the operand fetch stage this way, but it is certainly the case that, once this transformation is done, a CISC instruction set such as that of the Pentium or the VAX can be pipelined quite efficiently, as the commercial success of the Pentium clearly demonstrates.