26. Pipelined CPUs

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Origin of Pipelined CPUs

In the late 1960's, IBM introduced the 360 model 91; this was the first commercial offering of a pipelined computer, if one ignores the fact that the CDC 6600 computer could be viewed, in retrospect, as pipelined, in the sense that instruction fetch was done in anticipation of later execution by one or another functional unit.

The IBM System 360 family of computers had a common architecture, that is, a common instruction set, as seen by the users, with a wide variety of implementations. Vertical microcode (with large numbers of short microinstructions per instruction) and horizontal microcode (with a small number of long microinstgructions per instruction) were both used for some members of this family.

It is worth noting that the System 360 family is alive and well today. The number was coined by marketing to mean a 3rd generation machine for the 1960's, so, of course, when the 1960's ended, marketing considerations demanded a change of numbering, so the 370 family was born. In the 1980's, a new and somewhat scrambled numbering system obscured what was going on, but in the 1990's, the 390 family carried on the tradition. Today's representatives of this family remain object-code compatable with the original for user programs, despite immense changes in the implementation technology, the I/O architecture, and the operating system environment.

The IBM System 360 family is characterized by a 32-bit word, and 16 general purpose registers. The basic instruction format of the IBM System 360 was and is:

	Full-Word Instructions:
                8           4       4       4              12
	|_______________|_______|_______|_______|_______________________|
        |_______________|_______|_______|_______|_______________________|
        |       OP          R       X       B             DISP          |

	  Typical examples:

             L  (load)          GPR[R] = M[ GPR[X] + GPR[B] + DISP ]
             LA (load address)  GPR[R] =    GPR[X] + GPR[B] + DISP
             ST (store)                  M[ GPR[X] + GPR[B] + DISP ] = GPR[R]
             A  (add)           GPR[R] = M[ GPR[X] + GPR[B] + DISP ] + GPR[R]
             BAL(branch and link)   PC =    GPR[X] + GPR[B] + DISP; GPR[R] = PC

	Half-Word Instructions:
                8           4       4
	|_______________|_______|_______|
        |_______________|_______|_______|
        |       OP          R       X   |

	  Typical examples:

             BALR (branch and link) PC =    GPR[X]; GPR[R] = PC

There are hunddreds of texts on programming this family of machines (The assembly language was BAL, the Basic Assembly Language, so many texts are catalogued under this and not under IBM 360 in librar catilogs). The purpose of the above list of instructions is not to provide anything like an exhaustive list! Rather, it is to illustrate the kinds of operations the the underlying hardware must execute for each instruction execution cycle.

During any cycle, the machine may fetch a 16-bit halfword instruction or a 32-bit full-word instruction. In executing this instruction, it may perform any of the following operations:

Not all instructions will involve all of these stages, but note that This list includes several time-consuming operations. Gathering operands from registers will be fairly fast, since it involves accessing a very small and very fast RAM used to implement the register file. If we ignore floating point operations and multiplication and division, the addition required to form an effective address is very likely to be just as as time consuming as the ALU operations required for instructions such as add and subtract.

The memory reference for operands is certain to take as long as an instruction fetch, and it is quite possible to imagine a system where the time to perform arithmetic operations is comparable to the time taken to access memory. This leads to the suggestion that this architecture could be pipelined with the following pipeline stages:

(It is important to note that this set of pipeline stages was derived from the instruction set of one computer, under one set of assumptions about the relative time taken for various operations. Other instruction sets and other assumptions about relative times lead naturally to other sets of operations!)

The Interstage Registers

Given the above breakdown of an instruction set into pipeline stages, the first question that must be resolved is: What goes in the interstage registers?

The first stage fetches an instruction, so the interstage register that it feeds must contain this instruction word, here, stored in the IR interstage register. For the IBM 360 architecture and many others, there are instructions that require knowledge of the address of the instruction following the instruction that was just fetched. In the case of the IBM 360, the BAL and BALR instructions need this so they can save a return address. Other computers need the same information for PC relative branches. Here, we store this in the NEXT PC interstage register.

The gather operands stage collects the operands of the instruction from general registers and from the next PC interstage register, and feeds the addressing adder which produces an effective address. The details of what operand is loaded in the R and EA interstage registers are determined by the opcode.

Certain fields of the instruction register are no-longer needed at this point. The displacement and base register B have been used, and for all of the example instructions given here, there is no further need for the index register X, although there are many instructions that do need this.

The memory reference may save a register to memory, or it may load a register from memory. In the former case, most of the work of the instruciton has been completed and the ALU stage is only needed for cleanup. In the latter case, the ALU will be used to combine the two register operand R with the memory operand OP to compute some result. The opcode field is still needed, as is the register number, so the result can be saved.

Implementing a Pipelined Processor

If we want to give more detail on a pipelined processor in one or two lectures, we must simplify the design! To do this, we will eliminate the double indexing of the IBM 360 architecture, and then we will begin a bottom-up design that will actually yield most of the instruciton set as a side-effect of the design process!

We begin by assuming a canonical 5-stage pipe; this is the 5-stage pipeline that follows naturally from an analysis of an instruction set for a 1 and 1/2 address machine that contains both memory to register and register to memory arithmetic operations; the complexity of this pipeline is more than sufficient to handle register to register operations.

				      ______      _____
	   Instruction Fetch         |______|----|     |
				 _    |____|     |     |
	   Address Computation  | |--|______|    |     |
				| |   |____|     |     |
	   Operand Fetch        | |--|______|----|  M  |
				|R|   |____|     |     |
	   Operate              | |  |______|    |     |
				| |   |____|     |     |
	   Operand Store        |_|--|______|----|_____|
Aside: Instruction sets may be classified in terms of how many memory addresses there are per instruction. A 3-address machine allows two source operands from anywhere in memory to be combined and the result stored anywhere in memory, using only one instruction. The DEC/Compaq VAX architecture is the best known of the post 1970 3-address machines. A 2-address machine allows an operand anywhere in memory to be combined with any variable in memory. The DEC PDP-11 and M680x0 families allowed this. A 1-address machine allows an operand anywhere in memory to be addressed; typically, such machines have an accumulator with which operands may be combined. Zero-address machines use a stack for all operand manipulation and addressing, with no address fields required in any instructions.

Most stack machines contain several 1-address instructions for loading, storing and branching. A single accumulator is very limiting, so most one address machines contain multiple accumulators. Machine instructions on such a one-address machine therefore allow a rudimentary second address with a very limited address space, mainly, the identity of the accumulator to use. Such a machine is called a one-and-a-half address machine.

We'll assume that a word is 32 bits, that there are 16 general purpose registers R, word addressable memory, and that all instruction words are identical with the following format typical of 1 and 1/2 address machines:

                8           4       4                  16
         _______________ _______ _______ _______________________________
	|_______________|_______|_______|_______________________________|
        |    opcode     |   r   |   x   |             disp              |

We'll assume that we want to support instructions such as:

Furthermore, we'll have the rule that when x is zero, instead of designating R[0], it designates the constant 0, turning off indexing and giving us genuine immediate operands.

Finally, we'll assume that when x is 1111, this is a reference to R[15], the program counter, giving us PC relative addressing.

For the moment, we will completely ignore branches!

Deriving the Contents of the Interstage Registers

We can actually derive the contents of all of the interstage registers from the above description! We start with the instruction fetch to address computation interface. Clearly, this must contain the instruction that was fetched! In addition, because we are interested in relative addressing, we should pass onward the value of the PC that was used to fetch that instruction.

           Instruction Fetch
	|____________________________________|
	  |       pc      || op|r|x|  disp | 
	 _|_______________||___|_|_|_______|_
	|                                    |
	   Address Computation

The address computation stage is going to take information from register x and combine it with various sources to compute the effective address used by this instruction; therefore, the output of the address computation phase must include this effective address. The x and disp fields of the instruction register are not needed by later pipeline stages, and the pc value has been combined into the effective address, if it is needed, so these may be discarded. This gives us:

	   Address Computation
	|__________________________|
	  | op|r||       ea      |
	 _|___|_||_______________|_
	|                          |
	   Operand Fetch

The operand fetch stage is going to use either or both of r and ea to get operands from memory and registers, so obviously, it must include these operands in its output. The operand store stage will need both r and ea itself, so we pass these on as well, giving:

	   Operand Fetch
	|____________________________________________________________|
	  | op|r||       ea      ||      opa      ||      opb      |
	 _|___|_||_______________||_______________||_______________|_
	|                                                            |
	   Operate

The operate stage is going to contain the ALU that combines the operands as instructed by the opcode. Thus, its output must contain the result instead of the operands:

	   Operate
	|___________________________________________|
	  | op|r||       ea      ||     result    |
	 _|___|_||_______________||_______________|_
	|                                           |
	   Operand Store

This completes all the interstage registers!

Assumptions about the Memory and Register Interfaces

In outlining the structure of the pipeline stages, we must make some assumptions about the registers and memory. Here, we will assume that these are true multiport random access memories. For the CPU registers, this is a realistic assumption. For the main memory, this is unrealistic, but we won't let this deter us.

In addition, we will assume that there is some cost to a real multiport register or memory, so we will provide an input with each memory or register port, a signal saying "port needed". If this is not asserted during a pipeline cycle, that port is inactive.

Details of the Pipeline, Stage by Stage

We can now begin to build the pipeline stages, starting with the instruction fetch stage. In the abstract, this perorms the following logic:

	repeat
	   output_ir = M[pc]
	   output_pc = pc
	   pc = pc + 1
	forever
In the following, we'll assume that all registers are clocked simultaneously unless clock logic is shown! With this assumption, the fetch stage can be constructed as follows:
	   Instruction Fetch Stage
	                               True -----> need memory
	                                     
	 ---------o------------------------------> address to memory
	|       __|__               -------------< data from memory
	|      | +1  |             |
	|      |__ __|             |
	|         |                |
	|      ...|...             |
	|         |                |
	|  _______|_______  _______|_______
	| |       pc      || op|r|x|  disp | 
	| |_______________||___|_|_|_______|
        |_________|
The logic given above cannot perform branch instructions, but we will solve this later by adding a multiplexor at the location marked with a dotted line.

The address computation stage is only marginally more complex. This performs the following computation:

	repeat
	   output_op = input_ir.op
	   output_r  = input_ir.r
	   if input_ir.x = 0000
	      need_regs = 0
	      output_ea = input_ir.disp + 0
	   else if input_ir.x = 1111
	      need_regs = 0
	      output_ea = input_ir.disp + input_pc
	   else
	      need_regs = 1
	      output_ea = input_ir.disp + R[input_x]
	   endif
	forever

This reduces to the following logic:

	   Address Computation Stage
	   _______________  _______________
	  |       pc      || op|r|x|  disp | 
	  |_______________||___|_|_|_______|
                  |          |  | |    |
             -----|----------   | |    |
            |   --|-------------  o----|----------> register number
            |  |   -    ----------|----|----------< data from registers
            |  |    |  |  0       |    |
            |  |  __|__|__|__   __|__  |  ___ 
            |  |  \ 0  0  1 /--|=0000|-|-|nor|----> need register value
            |  |   \1__0__0/---|=1111|-|-|___|
            |  |       |               |
            |  |       |    ----------- 
            |  |       |___|            
            |  |      |  +  |           
            |  |      |_____|           
	   _|__|  _______|_______
	  | op|r||       ea      |
	  |___|_||_______________|

The operand fetch stage is quite simple:

	repeat
	   output_op = input_op
	   output_r  = input_r
	   output_ea = input_ea
	   if read_register( input_op )
	      output_opa = R[ input_r ]
	      needs_register = true
	   else
	      output_opa = undefined
	      needs_register = false
	   endif
	   if read_memory( input_op )
	      output_opb = M[ input_ea ]
	      needs_memory = true
	   else
              output_opb =    input_ea
	      needs_memory = false
	   endif
	forever

This translates to the following logic:

	   Operand Fetch Stage
	   _____  _______________
	  | op|r||       ea      |
	  |___|_||_______________|
	    |  |         |
	    |  o---------|------------------------> register number       
	    |  |         |   _________     -------< data from registers
	    |  |         |  |  read   |   |                        
	    o--|---------|--|registers|---|-------> need register value
	    |  |         |  |_________|   |                        
	    |  |         |                |                        
	    |  |         o--------------------o---> address to memory
	    |  |         |                |   |  -< data from memory
	    |  |         |                |  _|_|_
	    |  |         |   ______     --|--\0_1/                 
	    |  |         |  | read |   |  |    |                   
	    o--|---------|--|memory|---o--|----|--> needs memory
	    |  |         |  |______|      |    |
	    |  |         |                |     -----------        
	   _|__|  _______|_______  _______|_______  _______|_______
	  | op|r||       ea      ||      opa      ||      opb      |
	  |___|_||_______________||_______________||_______________|

The operate stage is equally straightforward:

	repeat
	   output_op = input_op
	   output_r  = input_r
	   output_ea = input_ea
	   output_result = ALU( aluop( input_op ), input_opa, input_opb )
	forever

Giving us this hardware:

	   Operate Stage
	   _____  _______________  _______________  _______________
	  | op|r||       ea      ||      opa      ||      opb      |
	  |___|_||_______________||_______________||_______________|
            |  |         |                |                |
            |  |         |               -   -------------- 
            |  |         |    _____    _|___|_
            o--|---------|---|aluop|--|  alu  |
            |  |         |   |_____|  |_______|
	   _|__|  _______|_______  _______|_______
	  | op|r||       ea      ||     result    |
	  |___|_||_______________||_______________|

Finally, we have the result store stage. This is complicated by the fact that it may store either to registers or to memory.

	repeat
	   if write_register( input_op )
	      R[ input_r ] = input_result
	   endif
	   if write_memory( input_op )
	      M[ input_ea ] = input_result
	   endif
	forever

We can implement this as follows:

	   Operand Store Stage
	   _____  _______________  _______________
	  | op|r||       ea      ||     result    |
	  |___|_||_______________||_______________|
            |  |         |                |
	    |   ---------|----------------|-------> register number       
	    |            |   ________     o-------> data to registers
	    |            |  |  write |    |                        
	    o------------|--|register|----|-------> write register
	    |            |  |________|    |                        
	    |            |                |                        
	    |             ----------------|-------> address to memory
	    |                ______        -------> data to memory
	    |               | write|
	     ---------------|memory|--------------> write memory
                            |______|

The above design omitted all details of flow-of-control through a program! We have no way to perform a branch, and this is a serious oversight which we must eventually correct!

Deriving an Instruction Set

The design outlined above directly suggests an instruction set for this machine! In looking at the information derived from the opcode field of the instruction to control the various pipeline stages, we have the following:

We can easily code these in an 8-bit instruction word as follows:
         _______________ _______ _______ _______________________________
	|_______________|_______|_______|_______________________________|
        |opcode (8 bits)|   r   |   x   |       disp (16 bits)          |
         _______ _ _ _ _
        |_______|_|_|_|_|
        |       | | | | |
          aluop rr  wr
                  rm  wm
We could consider the 4 read and write control bits to comprise something of an address mode field for the instruction. This leads to the following interpretation:
         _______ _ _ _ _
        |_______|_|_|_|_|
        |       | | | | |
          aluop rr  wr
                  rm  wm
                 0 0 0 0  -- *
                 0 0 0 1  -- *
                 0 0 1 0  -- immediate to register
                 0 0 1 1  -- *
                 0 1 0 0  -- *
                 0 1 0 1  -- *
                 0 1 1 0  -- memory to register
                 0 1 1 1  -- *
                 1 0 0 0  -- *
                 1 0 0 1  -- register to memory
                 1 0 1 0  -- register + immediate to register
                 1 0 1 1  -- *
                 1 1 0 0  -- *
                 1 1 0 1  -- register + memory to memory
                 1 1 1 0  -- register + memory to register
                 1 1 1 1  -- register + memory to both register and memory

The problem with this instruction encoding is that it leaves us with a number of useless addressing modes, for example, modes in which results are discarded, modes in which an immediate operand is also used as an address, and so on. These modes are starred above. In addition, depending on what operations the ALU provides, additional modes provided above may be redundant.

Nonetheless, the above tentative instruction set is actually a good start in the design of a practical computer! To complete this kind of design, what we will do is take some of the opcodes that serve no purpose with the register transfer design given and decode them for other purposes such as conditional branch instructions and other potentially interesting operations.