30. Variable Length Instructions

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science


The pipelined processors we've examined to this point have had a fixed instruction word. In contrast, the first successful pipelined design, the IBM 360/95 had several instruction formats, and a number of other pipelined designs have had even more complex formats, with the Intel 80x86/Pentium family among the most complex ever.

For example, the DEC VAX family of processors introduced in the mid 1970's had the following instruction format (considerably simplified):


	 _______8_______  the number of operands or addresses
	|_______________| following may be from 0 to 3,
	|    opcode     | depending on the opcode.

	for each operand that follows:

	 ___4___ ___4___  depending on the mode field, one of the
	|_______|_______| following address modifiers may follow.
	| mode  |  reg  |

	address modifiers or immediate constants:

	|      val      |

	|              val              |

	|                              val                              |

This allows for instructions as short as 8 bits, but a 3-address add instruction (source operand plus source operand to destination operand) could take as few as 4 bytes or 32 bits (if all operands are in registers) or as many as 16 bytes or 128 bits (if all operands use 32 bit index/displacement addressing). In fact, the instruction could be even longer, because the VAX had a special addressing mode that allowed a second index register to be multiplied by a small factor of two and added to the address specified by any of the indexed addressing modes listed above.

The Intel 80x86/Pentium architecture is even more irregular! The complexity of the VAX instruction set was designed in from the start, while a large part of the irregularity of the Pentium architecture is the result of evolutionary growth from the initial 8086 architecture; this was much simpler, but also irregular, having only a few registers, none of which were genuinely general purpose, and it had a 16 bit data path.

The original interpreters for these architectures were not pipelined to any significant depth. Later implementations of both of these architectures were deeply pipelined. Curiously, DEC (now part of Compaq) abandoned the VAX architecture at the peak of its market penetration, switching to the Alpha RISC architecture because they believed that there was little benefet to be gained by more intensive development of pipelined versions of the VAX. The DEC Alpha remains one of the most beautiful 64-bit architectures ever developed, and benchmarks done by Silicon Graphics show that current (2004) Alpha machines HP are slightly faster than comparable SGI systems based on Intel Itanium chips.

Variable Length Instructions through Pipeline Promotion

The classical way to implement a machine with variable length instructions is to use a microprogrammed interpreter. The fetch execute cycle begins by fetching just the opcode byte, and then, depending on the opcode, additional bytes are fetched, in sequence, until the instruction has been executed.

There is one approach to pipelined execution that is remarkably close to this in spirit. Instructions are pushed into the pipeline by the fetch stage one increment at a time. For example, a pipelined IBM 360 could be built with a 16-bit instruction register, and the fetch stage could fetch 16 bits at a time. When the opcode found in the instruction field of one instruction indicates that the immediately following 16 bits belong to the same instruction, the following instruction is grabbed from the pipeline, leaving a bubble in the pipe, as illustrated below:

	      NOP o-------------     typically, this is the
	       _|_|_            |    instruction fetch stage
	       \1_0/---------   |
	 ________|_______   |   |
	|>______IR_______|  |   |
	  |                 |   |      |  typically, this is the
	  |   Stage that    |   |  ... |  address computation stage
	  |   detects long  |  _|_|_   |  (but this may not work,
	  |   format  ...---o--\1_0/   |  see text!)
	         |               |
	 ________|_______ _______|_______

With this logic, the pipeline stage that detects the fact that an instruction requires a second word (or halfword) grabs the contents that is about to be clocked into the instruction register at the end of the preceeding pipeline stage and promotes it to be part of this instruction, and as it does so, it also forces the interstage register that would have received that instruction to be marked as invalid. For short instructions, the second half of the instruction register in the output of this pipeline stage may have some other use, or it may simply be ignored.

This approach to building long instructions only offers the full benefit of pipelined execution to short instructions, while it offers poorer performance for long instructions. This is justified primarily in processors with a relatively narrow data path to memory, for example, a 16 bit data path. This is a particularly appropriate choice when the long instruciton format is rare, so that the efficiently pipelined short instructions dominate the performance of application programs.

One serious problem with this scheme is that this promotion path interacts in a troublesome way with other pipeline issues. If the upstream stage is able to request a stall, for example, this must be inhibited when the contents of that stage are promoted to operand status for some downstream stage.

In addition, the downstream stage cannot do any arithmetic or anything else that adds delay to the data path! Thus, if we promote operands from the instruction fetch stage to the address computation stage, we must be aware that to get the operand, we paid the price of an instruction fetch from memory. Adding the cost of indexed addressing to this might lower the maximum clock rate for our processor unacceptably.

This argues for the addition of a somewhat underutilized instruction decode stage between the instruction fetch stage and the address computation stage, but adding this lengthens the pipe, adding to the branch delay.

The Prefetch Buffer

Many architectures, even the original 8086, had a very shallow pipeline with only 2 stages, instruction fetch and execute. The term pipelined execution is not frequently used for such shallow pipes; instead, trade literature frequently focuses on the part of the interstage register between these two stages, the instruction prefetch buffer. In fact, our minimal CISC design (lecture 12) came very close to incorporating this feature because it is a natural result of a word size that is bigger than the minimum instruction size!

The idea is simple: The instruction fetch mechanism always fetches an entire word from memory, even though the instruction itself is just a byte or halfword. Thus, during normal execution on a machine with 8-bit instructions and 16-bit words (for example, the original 80x86), there would be two instruction execution cycles per instruction fetch, and fetches would increment the program counter by two instead of incrementing it by one. This gives us an instruction fetch pipeline stage with the following structure:

			instruction word from memory
               |first half     |second half
               |               |
               |  -------------|-|---------
 load (0)     _|_|_           _|_|_        | shift data path
 shift(1)-----\0_1/-----------\0_1/        |
         _______|_______ _______|_______   |
	|______IR_______|____PREFETCH___|  |
                |               |          |
                |                ----------
           to remaining
           parts of CPU

With the above logic, on even numbered instruction execution cycles, we fetch a word from memory, while on odd cycles, we simply shift the instruction register to the left.

We face a problem with this logic! How do we suppress execution of instructions that are not supposed to be executed? If we branch to the first instruction in a pair, we should execute both instructions, but if we branch to the second half of the pair, we will fetch two instructions but only one should be executed.

On the CDC 6600, where 4 15-bit instruction syllables were packed per word, this problem was avoided. On that machine, all branches were to a word address, and as far as the hardware designers were concerned, it was up to the programmers to arrange their code so that all branches were to word boundaries. In fact, the assembler solved this problem by inserting no-ops before all labeled instructions to bring those instructions to the first position in the next word.

This problem can be addressed by adding a valid bit to each potential instruction in the instruction register and its prefetch extension. We can reset this valid bit anywhere in the pipeline when we find we have to invalidate an instruction, and if this bit is reset, the instruction becomes a no-op, no matter what other opcode bits are set.

			instruction word from memory
                        |              2n
               _________|_________          * = 0 if branch to
                |               |               an odd address
              * |             1 |    0      
             _|_|_           _|_|_  _|_|_   * = 1 in all other
               |               |  ____|         cases
            n+1/  -------------|-|--------- 
 load (0)     _|_|_           _|_|_        | shift data path
 shift(1)-----\0_1/-----------\0_1/        |
         _______|_______ _______|_______   |
	|______IR_______|____PREFETCH___|  |
                |               |          |
                |                ----------
           valid bit and
           instruction to
        remaining parts of CPU

This logic will inject invalid instructions into the later parts of the CPU once in a while, but it also offers an interesting new possibility. Instead of fetching into the instruction register, we can fetch in anticipation of need, fetching not the current instruction and its immediate successor, but an instruction beyond that:

                               instruction word from memory
                         _______|________      2n
                          |           |
                        * /n        1 /n   0     * = 0 if branch to
                       _|_|_       _|_|_  _|_|_      an odd address
                         |           |      |
                         |           |  ----     * = 1 in all other
              -----------|-----------|-|--------     cases
             |  ---------|-o---------|-|------  |
            _|_|_       _|_|_       _|_|_     | |
        Sa--\0_1/   Sb--\0_1/-------\0_1/     | |
         _____|_____ _____|_____ _____|_____  | |
	|____IR_____|_PREFETCHa_|_PREFETCHb_| | |
              |           |           |       | |
              /n+1        |            -------  |
              |            ---------------------
        valid bit and
        instruction to 
         rest of CPU

Note that, when prefetch register a is invalid, for example, because we just branched to an odd instruction, this model allows us to do a long shift, emptying both prefetch registers as the one valid instruction from prefetch register b is shifted directly into the instruction register. A new instruction fetch will occur whenever prefetch register b is invalid, because we are guaranteed at least a short shift after each clock cycle, barring stalls.

Here is a table giving the shift count to be used in each clock cycle as a function of the valid bits in PREFETCHa and PREFETCHb. Shift counts are given in units of instruction syllables first and then decoded into a form appropriate for use to control the multiplexors shown above. The fetch output shows whether the next pipeline clock cycle should cause an instruciton fetch.

     valid      valid    | Count Sa Sb  Fetch
       0          0      |   1    -  0    1
       0          1      |   2    1  0    1
       1          0      |   1    0  0    1
       1          1      |   1    1  1    0

Even the old 8086 architecture had a short prefetch buffer. You can infer this from the execution time statistics in the 8086 manual from Intel. The execution times of all instructions are documented (in clock cycles), and this documentation makes it very clear that the basic instruction interpreter in this machine is microcoded. However, the documenation also indicates that the times are only for straight-line code. Immediately following any branch instruction, the documentation indicates that there will be a penalty of several clock cycles. This branch penalty is the time it takes to load the first instruction at the branch destination into the prefetch buffer.

The Prefetch Buffer and Variable Length Instructions

Once we have a wide enough prefetch buffer, along with valid bits associated with each instruction syllable, we can extract entire instructions from the prefetch buffer, no matter what the instruction format.

The prefetch buffer width we need is the sum of the longest instruction word plus the size of one word fetched from memory. So, for example, an IBM 360 with a 16-bit halfword instruction, a 32-bit fullword instruction and 32-bit data path to memory would need a 64 bit combined instruction register and prefetch buffer.

                                       From Memory
            from shift network        |           |      
         _____|_____ _____|_____ _____|_____ _____|_____
              |           |     _____________
              o-----------|----| instruction |-----> how much
              |        ___|_   |   decode    |       to shift
              |        \___/---|_____________|       IR/PREFETCH
              |          |
           ...|..........|.... other operand fetch activity?
In effect, the problem with variable length instructions is handled entirely by the decode logic that is added to the pipeline stage following the instruction fetch stage. This may be an operand fetch stage, for example.

The decode logic is a programmable logic array (generalized truth table) that takes, as inputs, the valid bits and any bits of the instruction that might be part of the opcode. The outputs of this programmable logic array include:

  1. Multiplexor controls that determine the shift paths in the prefetch buffer on the next clock cycle. These should squeeze bubbles of invalid data out of the prefetch buffer, packing the valid data toward the instruction register end of the register, and they should account for the removal of one valid instruction once such an instruction has been assembled.

  2. The valid bit for the next pipeline stage. This goes to one only when all the parts of an entire valid instruction are present and properly packed in the instruction prefetch buffer.

  3. Controls for the multiplexors that select default values for instruction fields that are not provided from the instruction register.

  4. The memory fetch control, indicating whether an instruction fetch cycle is needed during this pipeline cycle.

In addition, the decode PLA may convert a tightly encoded and perhaps variable length opcode field into an easily utilized unpacked format, so that, deeper in the pipeline, there is no need to check for specific opcodes; instead, deeper in the pipeline, there is one bit of the widened opcode field used to enable or disable each data path or option.

After this stage, all instructions are widened to a standard format that includes all optional fields that may be present in some instruction variants. The instruction fetch stage concerns itself only with the problem of keeping the IR/PREFETCH register full and providing the ability to shift it as many places as required as the decode logic extracts successive instructions from one end of the register.

When the RISC idea had penetrated the public perception enough that Intel felt a need to emphasize the currency of the Pentium architecture, an architecture that is inherently CISC oriented, they stated that the first stage of the pipeline in a Pentium converts instructions to RISC format. In fact, they are describing the logic we have just described here. Formally speaking, it is probably not correct to describe the function of the operand fetch stage this way, but it is certainly the case that, once this transformation is done, a CISC instruction set such as that of the Pentium or the VAX can be pipelined quite efficiently, as the commercial success of the Pentium clearly demonstrates.