21. Coprocessors

Part of the 22C:122/55:132 Lecture Notes for Spring 2003
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Motivation

Suppose we have some slow computation, for example, multiplication or division done by the shift and add or shift and subtract algorithm.

One way to incorporate this into a computer is to incorporate it directly into the CPU, so that the execute portion of the fetch-execute cycle is expanded into many clock cycles when the current instruction requires such a time consuming operation. This is how things have typically been done on classic machines, from the days of the Berks Goldstein and Von Neumann paper to low-performance microprocessors of the present day.

This low performance alternative relies on a single sequentially operated control unit, typically microcoded, with the control structure of the microcode looking something like the following:

	repeat
	   -- fetch
	   IR = memory[PC]; PC = PC + 1;

	   -- decode
	   case IR:opcode of
	      load:
		 AC = memory[IR.address];
	      add:
		 AC = AC + memory[IR.address];
	      branch:
		 PC = IR.address;
	      times:
		 repeat n times -- multiplication loop
		    if odd MQ
		       AC|MQ = (AC + memory[IR.address])|MQ >> 1;
		    else
		       AC|MQ = AC|MQ >> 1;
		    endif
		 endloop
	      ...
	   end case
	endloop;

There are two problems with this model: First, it requires a complex control unit, and second, it seems unnecessarily slow. It is reasonable to ask, is there some way to continue fetching and executing instructions while the slow iterative multiply operation is being done? This leads to the idea of creating a special secondary processor, a coprocessor, to perform such time consuming operations.

The easiest way to add a coprocessor is as a peripheral device, and exactly this approach was commonplace with minicomputers and microcomputers of the 1970's, and that is the approach we will discuss here. This is not the actual way these ideas were discovered! If w were to follow a historical track, we would first discuss the CDC 6600 from 1965, with its use of functional units within the CPU, and then we would look at the DEC PDP-11/45 floating point coprocessor before, finally, we looked at simple coprocessors such as we are discussing here.

Coprocessor Parallelism

A coprocessor is a system component that runs in parallel with the CPU and has its own control unit, so that it may perform computations while the CPU is operating. Coprocessors may be as complex as entire secondary CPUs (for example, in systems with DSP or digital signal processor coprocessors), or they may be as simple as, for example, direct memory access processors.

Here, we are interested, specifically, in arithmetic coprocessors such as floating-point units.

The simplest such units are attached to the system bus as if they were peripheral devices. Consider, for example, a multiplier coprocessor, designed to support the standard but slow shift-and-add approach to multiplication. We would include the following registers

ier -- the multiplier
acc -- the accumulator
icand -- the multiplicand

Our coprocessor's control unit must be able to clock each of these registers, and it must be able to respond to a read request for the contents of these registers from the CPU. These functions require the following control signals from the coprocessor's control unit:

cier -- clock the multiplier
cacc -- clock the accumulator
cicand -- clock the multiplicand
rier -- read from multiplier to I/O bus
racc -- read from accumulator to I/O bus
ricand -- read from multiplicand to I/O bus
run -- a multiplexor control, see below

We can cheat a bit on the controls for our data part by noting that, for each cycle of the multiplication algorithm, the least significant bit of the multiplier controls whether we just shift or shift and add. Therefore, we can wire this connection directly in the data part and avoid involving the control part. This cheat is marked in the diagram of the data part that follows with an asterisk:

	Data    ======o=====o======o======o======o======o==
	Address ======|=====|======|======|======|======|==
	Read    ------|-----|------|------|------|------|--
	Write   ------|-----|------|------|------|------|--
                      |     |      |      |      |      |     
            rier ...  |  --/_\     |   --/_\     |   --/_\    
            racc      |     |      |  ____|______|______|_____
            rica      |     |      | |    |      |  ____|___  |
                      |     |     _|_|_   |     _|_|_   |   | |
            run  ...  |     |   --\0_1/   |   --\0_1/   |   | |
                   ___|___  |    ___|___  |    ___|___  |   | |
	    cier -|>_icand| |  -|>_acc__| |  -|>_ier__| |   | |
            cacc      |     |       |     |       |     |   | |
            cica       -----o    ---o     |       o-----    | |
                           _|___|_  |     |       |         | |
                          |___+___| o-----        |         | |
                             _|_____|_            |         | |
                             \1_____0/------------|---*     | |
                                _|________________|_  |     | |
                               |________>>1_________|-      | |
                                 |                |_________| |
                                 |____________________________|

The run control signal from the control unit is set to zero when the control unit is loading a register from the bus, and it is set to one while the multiply operation is in progress.

The control unit for this multiplier can be quite simple, operating according to the following algorithm:

	repeat
	   wait for start pulse
	   run = 1
	   repeat n times
	      generate step pulse
	   endloop
	   run = 0
	forever

We need a bus interface for the control unit in order to allow the CPU to use the bus to set and inspect the registers and to allow the CPU to request a multiply operation by issuing a start pulse:

	Data    ==========================
	Address ==o=======================
	Read    --|--------------o--------
	Write   --|--------------|---o----
                __|__            |   |
               |     |          _|_  |
               |= x  |-o-------| 4 |-|----- ricand
        address|= x+1|-|-o-----| x |-|----- racc 
        decoder|= x+2|-|-|-o---|AND|-|----- rier
               |= x+3|-|-|-|-o-|___|-|----- xx   
               |_____| | | | |      _|_
	                -|-|-|-----| 4 |--- cicand ___
              from        -|-|-----| x |----------|2 x|-- cacc
           coprocessor      -|-----|AND|----------|OR_|-- cier
          control unit        -----|___|-- start    |
                                                    |
              step ---------------------------------
              run  ------------------------------------- run

The above control unit enables read from the registers when there is a read from address x (icand), x+1 (acc) or x+2 (ier) and it enables a write to one of these registers when there is a write to address x, x+1 or x+2. A write to address x+3 (start) issues a start pulse to the coprocessor's control unit, while a read from address x+3 issues a pulse on the control line marked xx, with no current use; we will find a use for it shortly!

The coprocessor control unit issues step pulses, and these are ored into the clock lines to both the acc and icand registers.

Note that if we wanted to generalize this coprocessor into one able to perform both multiply and divide operations, we might use writes to x+3 to record the desired operation. For example, writing 0 to x+3 might configure the data part to perform multiplication, while writing 1 to x+3 could configure the data part to perform division. The control unit would issue the same n step pulses in either case, but the effect of these would depend on the operation that was selected by the write operation that initiated coprocessor operation.

Programming this coprocessor

Assuming we've got an Ultimate RISC CPU, we can write code to use this multiply unit as follows:

	; code for A = B x C
	MOVE B ier	; move to x+0
	MOVE C icand    ; move to x+2
	MOVE junk start ; move to x+3

	; must wait the appropriate time for the result

	MOVE icand C    ; move from x+3

A good compiler or a good assembly language programmer will insert code to perform other computations in the space where the above code indicates a wait.

The problem with this code is that it includes no provisions for waiting the appropriate time. If the manual for the coprocessor relates its speed to the speed of the CPU, we might be able to know exactly how many instructions to skip before trying to get the result, but if we upgrade our CPU, the answer will change. This leads us to look for other solutions.

Polling the coprocessor status

One solution is to have the coprocessor include an explicit status bit that programs running on the CPU can test in order to see if the coprocessor is busy. This solution is common on direct-memory-access coprocessors, and it is common on interfaces to DSP chips. In both contexts, we typically also allow the condition status=done to be interpreted by the CPU as an interrupt request.

We can do this with our example coprocessor by giving a meaning to the unused control line xx in the bus interface given previously. This line is asserted when the CPU tries to read from address x+3. We will use this line to enable a read of the state of the run output from the coprocessor control unit, so that the user can test to see if the coprocessor is running or idle. As a result, we change our code for multiplication to something like:

		; code for A = B x C
		MOVE B ier      ; move to x+0
		MOVE C icand    ; move to x+2
		MOVE junk start ; move to x+3

		; insert other operations here if there is something
		; else useful to be done before checking for done

	LP:	MOVE run to sfal; skip next instruction if run = 0
		MOVE CLP pc     ; move constant equal to LP to pc
		MOVE icand C    ; move from x+3

This is a bit ponderous, but it allows the same code to run on machines with a wide range of relative CPU and coprocessor speeds.

Asynchronous Busses

The multiplication coprocessor described above is likely to be able to complete its job in only a few instruction times, particularly if we are really using an Ultimate RISC as our main processor. This is because the control unit for the coprocessor can clock the registers at speeds limited by the short data circulation path within the coprocessor, while the CPU instruction execution cycle is limited by the bus speed and main memory speed.

If a coprocessor is fast enough that it is likely to complete its job in only a few instruction times, the cost of using a few instructions to poll the coprocessor status while waiting for it to finish the operation is relatively high! Therefore, we need another solution!

One common idea is to use an asynchronous bus, that is, a bus where the time it takes for a data transfer on the bus is under the control of the devices making the transfer. The minimum requirement for such a bus is an extra bus line, call it wait, that is raised by the device to say "hold on CPU, give me time to finish this job!" in response to a read or write request.

We use this idea commonly in modern computer systems, so that the CPU can do very fast cycles to addresses in fast RAM and significantly slower cycles to, for example, peripheral interface registers. Similarly, when the CPU starts a memory cycle, it has no idea whether the addressed location will be held in cache, and therefore available quite quickly, or available only in main memory, and therefore available after a much longer wait.

We'll use this idea as follows:

Whenever the CPU tries to read the contents of the multiplier register ier or of the accumulator, we'll have our coprocessor assert WAIT until the coprocessor is idle. This is illustrated below for just the read line for the accumulator:

	Data    ============o======
	Address ============|======
	Read    ------------|------
	Write   ------------|------
	Wait    -------o----|------
                       |    |
            racc -----/_\--/_\
                       |    |
            run  ------     |
                            |
                      data from acc

The CPU or (in the case of the ultimate RISC) the Instruction Execution Unit must be modified to inspect the Wait line. There are two approaches to this. In one, if Wait is high during a bus cycle, the CPU must repeatedly retry that bus cycle until a cycle can be completed during which wait stays low.

In the other approach, raising wait during a bus cycle stops the CPU clock until wait is lowered. These two approaches are distinctly different! CPU designs that allow an external signal to stop or stretch clock pulses are notably more complex than those what simply keep retrying operations until they can be completed. Genuinely asynchronous busses require the ability to stretch a single bus cycle.

We can program a machine where the coprocessor can block a fetch until the operation is done as follows:

		; code for A = B x C
                MOVE B ier      ; move to x+0
                MOVE C icand    ; move to x+2
                MOVE junk start ; move to x+3

		; insert other operations here if there is something
		; else useful to be done before checking for done

                MOVE icand C    ; move from x+3 (will block as needed)

The result is that naive code will run correctly, with the CPU being blocked as needed when results are not immediately available. Good compilers and careful assembly language programmers will interleave code so that other operations are performed at times when the coprocessor is expected to be busy, thus minimizing the number of times that the CPU is blocked by coprocessor activity.

Of course, we cheated a bit. Our coprocessor only forces the CPU to wait when it references the coprocessor's accumulator. If we use the design given above and the program looks first at the ier register, it will see intermediate results without being blocked. We would have to add some more gates to the design to prevent this from happening.

Coprocessors in the opcode space

The coprocessors discussed above appear to be peripheral devices from the point of view of the CPU or instruction execution unit. This is fine for the Ultimate RISC, where there is no concept of opcode field, but a decent computer will have an instruction format that is more interesting, with a field used to designate the opcode. In such cases, we would like to design the coprocessor so that it executes instructions with a compatable format. This is commonly done as follows:

First, the CPU is designed with instructions that are reserved for interpretation by a coprocessor. During the normal instruction execution cycle, the CPU treates these almost as if they were no-ops.

Second, the CPU exposes selective state information on the bus. Specifically, whenever it is fetching an instruction word, it sets a special bus line indicating that this fetch from memory is reading the next instruction to be executed. The coprocessor may observe the instruction being fetched and, if the opcode field is one of those reserved for coprocessor action, the coprocessor may elect to act on the instruction.

Typically, we want some coprocessor instructions to load data from memory to coprocessor registers, while others store data from coprocessor registers to memory, and yet others perform register to register operations within the coprocessor. There are several ways of doing this.

One approach is to pre-designate coprocessor instructions in the CPU that include load and store cycles. When the coprocessor is absent, coprocessor-load instructions carry out the normal address computation part of the instruction execution cycle in the CPU, but the CPU ignores the value read from memory. If the coprocessor is present, the coprocessor will use that value. Similarly, coprocessor store instructions executed when the coprocessor is absent store an indeterminate value to memory, but if the coprocessor is present, it provides the data to be stored.

If the coprocessor is not ready when a coprocessor instruction is fetched by the CPU, it must use the bus stall mechanism to stop the CPU until it can cooperate. Usually, we also include a coprocessor acknowledge signal on the bus. If the CPU decodes a pre-designated coprocessor instruction and no coprocessor acknowledges recognizing that instruction, the CPU raises an exception condition (typically using a trap mechanism) to indicate the situation. The exception handler may do a software simulation of the coprocessor instruction (as in floating point emulation), or it may just treat the instruction as an error.

This approach was taken on the DEC PDP-11/45 floating point unit back in 1973; as a result, execution of integer and floating point instructions could be overlapped. The floating point instruction set looked like it was fully integrated into the normal instruction set of the machine, but the floating point unit was optional, and it operated in parallel with the CPU. Floating point instructions were executed using a sequential microcoded approach, and this was not fast, yet the CPU never waited for the floating point unit to complete an operation except when a new floating point instruction was encountered before the previous one had been completed. Because most floating point algorithms require several integer instructions to be executed per floating point instruction, the result was an effective use of parallelism.

(Other floating point coprocessors DEC built for other members of the PDP-11 family were not as fast. The later PDP-11/70 used the PDP-11/45 floating point, while the 11/40 and 11/23 used lower performance coprocessors.

The floating point coprocessors for the microprocessors of the 1980's typically operated similarly; some offered high performance by overlapping floating point with integer operations, but most did not.

This idea had its origins in the CPU for the CDC6600; that CPU was composed entirely of components called functional units, each of which behaved like a coprocessor of the type described here, able to operate in parallel with other functional units, so that if the compiler was nice, carefully rearranging instructions so that consecutive instructions rarely mentioned the same functional unit, the machine was extremely fast.

Chapter 39 of Bell and Newell, contains a writeup of the CDC 6600 written in 1964; this machine remained the fastest machine on earth until the early 1970's, when the CDC 7600 replaced it. Seymour Cray designed both machines, and after he quit CDC, he founded Cray Research and built the Cray I, which was the fastest maching on earth through the late 1970's. All these machines used functional unit parallelism.

In the CDC 6600 was also the first machine to incorporate multiple peripheral processors, each a complete general purpose computer, dedicated to handling input/output and many operating system functions, so that the CPU could be dedicated almost entirely to running user programs.

The CDC 6600 had a 60-bit word, with all main memory addresses being references to entire words. Instructions were 15 bits each, packed 4 to a word, with the following instruction format:

                  6        3     3     3
             _____________________________ 
            |___________|_____|_____|_____|
            |   opcode  | dst | src1| src2|

There were several banks of 8 registers in the CPU, with the opcode used to indicate which register bank was being addressed. Arithmetic in this machine was always register-to-register, so the above instruction format sufficed for arithmetic and logical operations on both the 60-bit floating point operand registers and on the 18-bit integer index registers.

The processor contained the following registers:

8 Operand registers, X0 to X7, 60 bits each, with a complete set of floating point operators available, along with logical and shift operations and a very limited suite of integer operations.
8 Increment registers, B0 to B7, 18 bits each, with load-store data paths to and from X0 to X7, and with integer arithmetic operations.
8 Address registers, A0 to A7, 18 bits each, with load-store data paths from X0 to X7 and B0 to B7, and with limited integer arithmetic, primarily the ability to add data from B0 to B7.

Load and store on the CDC6600 were distinctly strange! These operations were side effects of operations on A0 to A7! Specifically,

Assignment to A1 to A5 caused, as a side effect, the loading of a value from memory to the corresponding X1 to X5, using the value loaded in the A register as the memory address.
Assignment to A6 or A7 caused, as a side effect, the storing of the value in X6 or X7 to memory, using the value loaded in the A register as the memory address.

This strange model of computation was useful because of the division of the processor into functional units. The following functional units could all operate in parallel, so long as the registers they referenced were disjoint:

floating add-subtract
floating multiply
floating multiply
floating divide
long add
shift
boolean
increment A register
increment A register
branch

Because these operated in parallel, once they were started on an operation, it meant that it was almost always possible to overlap many instructions. For example, it was quite easy to write instruction sequences that would, in parallel, load the next operand from memory, multiply the current operands, compute the memory address of an even later operand, and prepare to store the previous result to memory.

The machine had parallel data paths from the processor to memory, so that, so long as the memory addresses did not conflict, an instruction fetch, an opeand store and an operand load could all be done in parallel (and there was room to allow one of the peripheral processors to do a DMA transfer at the same time). As a result, the two different increment functional units, the units that oversaw memory transfers and assignments to the A registers, could both be busy at the same time.

The need for multiple multiply functional units was because these were relatively slow operations. Even if there were not many multiply and divide instructions, these could stay busy for a while.

Writing good compilers for the CDC 6600 was not easy! The compiler had to not onlyh put out the right code, but it had to, also, reorder the machine instructions so that they were executed in an order that led to minimal waiting by one functional unit for results from another functional unit.

The CDC 6600 had a special component in the CPU called the scoreboard that was used to keep track of which registers were currently being used by what functinal unit, so that other functional units needing the contents of that register would wait until it was valid. This scoreboard function was a somewhat more complex verson of the logic described earlier for making the CPU wait for a result from a coprocessor when that is delayed.