22. Functional Unit Parallelism

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Coprocessors in the opcode space

The coprocessors discussed above appear to be peripheral devices from the point of view of the CPU or instruction execution unit. This is fine for the Ultimate RISC, where there is no concept of opcode field, but that was something of an absurd architecture. On a decent computer with an instruction format that is more interesting, we would like to design the coprocessor so that it interprets the opcode of the instruction being executed. This is commonly done as follows:

First, the CPU is designed with instructions that are reserved for interpretation by a coprocessor. During the normal instruction execution cycle, the CPU treats these as follows:

        -- in the list of cases, one per opcode
	coprocessor_opcode:
	   if coprocessor_acknowledge
	      -- some coprocessor is willing to execute this opcode
	      repeat
		 -- do nothing
	      until coprocessor_done
	   else
	      handle unimplemented_instruction trap

This requires that the bus include two pieces of state information that are carried on the bus, coprocessor acknowledge asserted by any coprocessor that has decided to execute an instruction, and coprocessor_done, asserted by the coprocessor when it finishes executing an instruction.

Second, the CPU must expose selected state information on the bus. Specifically, whenever it is fetching an instruction word, it must sets a special bus line indicating that this read from memory is an instruction fetch. Coprocessors are expected to constantly monitor the data bus, and whenever they see instruction fetch asserted by the CPU, they must take a copy of the data, and if the opcode field of that data corresponds to a coprocessor instruction within that coprocessor's repertoire, they must assert the coprocessor acknowledge condition, and then execute the instruction before asserting coprocessor_done.

Typically, we want some coprocessor instructions to load data from memory to coprocessor registers, while others store data from coprocessor registers to memory, and yet others perform register to register operations within the coprocessor. There are several ways of doing this.

One approach is to pre-designate coprocessor instructions in the CPU that include load and store cycles. Coprocessor-load instructions carry out the normal address computation part of the instruction execution cycle in the CPU, but the CPU ignores the value read from memory. If the coprocessor is present, the coprocessor will use that value, while if it is absent, the CPU should trap. Similarly, for coprocessor store instructions, the CPU computes the memory address as normal, but when the time comes to use it, the CPU does not put any data on the data bus. If the coprocessor is present, the coprocessor does this job.

If the coprocessor is not ready when a coprocessor instruction is fetched by the CPU, it must use the bus stall mechanism to stop the CPU until it can cooperate. Note that the model described here allows multiple coprocessors. The CPU only raises illegal instruction traps if no coprocessor acknowledges an opcode. So, we could have a floating point coprocessor, a graphics coprocessor and a vector coprocessor if we wanted, with each interpreting a different subset of the coprocessor opcodes. In the extreme, the entire CPU may be designed as an instruction execution unit that just manages the fetch-execute cycle, with all instruction execution handled by coprocessors.

This approach was taken on the DEC PDP-11/45 floating point unit back in 1973; as a result, execution of integer and floating point instructions could be overlapped. The floating point instruction set looked like it was fully integrated into the normal instruction set of the machine, but the floating point unit was optional, and it operated in parallel with the CPU. Floating point instructions were executed using a sequential microcoded approach, and this was not fast, yet the CPU never waited for the floating point unit to complete an operation except when a new floating point instruction was encountered before the previous one had been completed. Because most floating point algorithms require several integer instructions to be executed per floating point instruction, the result was an effective use of parallelism.

(Other floating point coprocessors DEC built for other members of the PDP-11 family were not as fast. The later PDP-11/70 used the PDP-11/45 floating point, while the 11/40 and 11/23 used lower performance coprocessors.

The floating point coprocessors for the microprocessors of the 1980's typically operated similarly; some offered high performance by overlapping floating point with integer operations, but most did not.

This idea had its origins in the CPU for the CDC6600; that CPU was composed entirely of components called functional units, each of which behaved like a coprocessor of the type described here, able to operate in parallel with other functional units, so that if the compiler was nice, carefully rearranging instructions so that consecutive instructions rarely mentioned the same functional unit, the machine was extremely fast.

Chapter 39 of Bell and Newell, contains a writeup of the CDC 6600 written in 1964; this machine remained the fastest machine on earth until the early 1970's, when the CDC 7600 replaced it. Seymour Cray designed both machines, and after he quit CDC, he founded Cray Research and built the Cray I, which was the fastest maching on earth through the late 1970's. All these machines used functional unit parallelism.

In the CDC 6600 was also the first machine to incorporate multiple peripheral processors, each a complete general purpose computer, dedicated to handling input/output and many operating system functions, so that the CPU could be dedicated almost entirely to running user programs.

The CDC 6600 had a 60-bit word, with all main memory addresses being references to entire words. Instructions were 15 bits each, packed 4 to a word, with the following instruction format:

                  6        3     3     3
             _____________________________ 
            |___________|_____|_____|_____|
            |   opcode  | dst | src1| src2|

There were several banks of 8 registers in the CPU, with the opcode used to indicate which register bank was being addressed. Arithmetic in this machine was always register-to-register, so the above instruction format sufficed for arithmetic and logical operations on both the 60-bit floating point operand registers and on the 18-bit integer index registers.

The processor contained the following registers:

8 Operand registers, X0 to X7, 60 bits each, with a complete set of floating point operators available, along with logical and shift operations and a very limited suite of integer operations.
8 Increment registers, B0 to B7, 18 bits each, with load-store data paths to and from X0 to X7, and with integer arithmetic operations.
8 Address registers, A0 to A7, 18 bits each, with load-store data paths from X0 to X7 and B0 to B7, and with limited integer arithmetic, primarily the ability to add data from B0 to B7.

Load and store on the CDC6600 were distinctly strange! These operations were side effects of operations on A0 to A7! Specifically,

Assignment to A1 to A5 caused, as a side effect, the loading of a value from memory to the corresponding X1 to X5, using the value loaded in the A register as the memory address.
Assignment to A6 or A7 caused, as a side effect, the storing of the value in X6 or X7 to memory, using the value loaded in the A register as the memory address.

This strange model of computation was useful because of the division of the processor into functional units. The following functional units could all operate in parallel, so long as the registers they referenced were disjoint:

floating add-subtract
floating multiply
floating multiply
floating divide
long add
shift
boolean
increment A register (does memory access)
increment A register (does memory access)
branch

Because these operated in parallel, once they were started on an operation, it meant that it was almost always possible to overlap many instructions. For example, it was quite easy to write instruction sequences that would, in parallel, load the next operand from memory, multiply the current operands, compute the memory address of an even later operand, and prepare to store the previous result to memory.

The machine had parallel data paths from the processor to memory, so that, so long as the memory addresses did not conflict, an instruction fetch, an opeand store and an operand load could all be done in parallel (and there was room to allow one of the peripheral processors to do a DMA transfer at the same time). As a result, the two different increment functional units, the units that oversaw memory transfers and assignments to the A registers, could both be busy at the same time.

The need for multiple multiply functional units was because these were relatively slow operations. Even if there were not many multiply and divide instructions, these could stay busy for a while.

Writing good compilers for the CDC 6600 was not easy! The compiler had to not onlyh put out the right code, but it had to, also, reorder the machine instructions so that they were executed in an order that led to minimal waiting by one functional unit for results from another functional unit.

The CDC 6600 had a special component in the CPU called the scoreboard that was used to keep track of which registers were currently being used by what functinal unit, so that other functional units needing the contents of that register would wait until it was valid. This scoreboard function was a somewhat more complex verson of the logic described earlier for making the CPU wait for a result from a coprocessor when that is delayed.

We can describe the scoreboard most easily using terminology from concurrent programming: The scoreboard was an array of mutual exclusion locks, one per register. When a functional unit has an instruction to execute, it locks all the registers required by that instruction, then executes, and then unlocks those registers. An attempt by a functional unit to lock a register that is already locked blocks that functional unit until the register in question has been unlocked.