22C:122/55:132, Notes, Lecture 18, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

  1. Motivation for Coprocessor or Functional Units Parallelism

    Suppose we have some slow computation, for example, multiplication or division done by the shift and add or shift and subtract algorithm.

    One way to incorporate this into a computer is to incorporate it into the CPU, so that the execute portion of the fetch-execute cycle is expanded into many clock cycles when the current instruction requires such a time consuming operation. This is how things were typically done on classic machines, from the days of the Berks Goldstein and Von Neumann paper to low-performance microprocessors of the present day.

    This low performance alternative relies on a single sequentially operated control unit, typically microcoded, with the control structure of the microcode looking something like the following:

    		repeat
    		   -- fetch
    		   ADD = PC;
    		   IR = memory[ADD]; PC = PC + 1;
    		
    		   -- decode
    		   ADD = IR:address; case IR:opcode of
    		      load:
    			 AC = memory[ADD];
    		      add:
    			 AC = AC + memory[ADD];
    		      branch:
    			 PC = ADD;
    		      times:
    			 repeat n times
    			    if odd MQ
                                   AC|MQ = (AC + memory[ADD])|MQ >> 1;
                                else
                                   AC|MQ = AC|MQ >> 1;
    			    endif
                             endloop
                          ...
                       end case
                    endloop;
    	
    The problem with this model is that it requires a complex control unit and it interfered with the desire to run the computation at maximum speed.

  2. Coprocessor Parallelism

    A coprocessor is a system component that runs in parallel with the CPU and has its own control unit, so that it may perform computations while the CPU is operating. Coprocessors may be as complex as entire secondary CPUs (for example, in systems with DSP coprocessors, digital signal processors), or they may be as simple as, for example, direct memory access processors.

    Here, we are interested, specifically, in arithmetic coprocessors such as floating-point units.

    The simplest such units are attached to the system bus as if they were peripheral devices. Consider, for example, a multiplier coprocessor. Here is the data part we might need for a simple design based on the standard and slow shift and add operation:

    	Data    ======o=====o======o======o======o======o==
    	Address ======|=====|======|======|======|======|==
    	Read    ------|-----|------|------|------|------|--
    	Write   ------|-----|------|------|------|------|--
                          |     |      |      |      |      |     
                rier ...  |  --/_\     |   --/_\     |   --/_\    
                racc      |     |      |  ----|------|------|-----
                rica      |     |      | |    |      |  ----|---  |
                          |     |     _|_|_   |     _|_|_   |   | |
                run  ...  |     |   --\0_1/   |   --\0_1/   |   | |
                       ___|___  |    ___|___  |    ___|___  |   | |
    	    cier -|>_ier__| |  -|>_acc__| |  -|>icand_| |   | |
                cacc      |     |       |     |       |     |   | |
                cica       -----o    ---o     |       o-----    | |
                               _|___|_  |     |       |         | |
                              |___+___| o-----        |         | |
                                 _|_____|_            |         | |
                                 \1_____0/------------|---      | |
                                    _|________________|_  |     | |
                                   |________>>1_________|-      | |
                                     |                |         | |
                                     |                 ---------  |
                                      ----------------------------
    	
    The control signals from the control system of this coprocessor are:
    rier, racc, rica
    enables transfer of ier, acc and icand registers to the data bus
    run
    if zero, sets the multiplexors on intputs to acc and icand to take input from the data bus. If one, takes input from the output of the shifter.
    cier, cacc, cica
    clocks the ier, acc and icand registers

    The control unit for this multiplier is quite simple, operating according to the following:

    		repeat
    		   wait for start pulse
    		   run = 1
    		   repeat n times
    		      generate step pulse
    		   endloop
    		   run = 0
    		forever
    	
    Now, we need a bus interface for this system to allow the CPU to use the bus to set and inspect the registers and to allow the CPU to request a multiply operation by issuing a start pulse:
    	Data    ==========================
    	Address ==o=======================
    	Read    --|--------------o--------
    	Write   --|--------------|---o----
                    __|__            |   |
                   |     |          _|_  |
                   |= x  |-o-------| 4 |-|----- rier  
                   |= x+1|-|-o-----| x |-|----- racc 
                   |= x+2|-|-|-o---|AND|-|----- rica
                   |= x+3|-|-|-|-o-|___|-|----- xx   
                   |_____| | | | |      _|_
    	                -|-|-|-----| 4 |--- cier  ___
                  from        -|-|-----| x |---------|2 x|-- cacc
               coprocessor      -|-----|AND|---------|OR_|-- cica
              control unit        -----|___|-- start   |
                                                       |
                  step --------------------------------
                  run  ------------------------------------- run
    	
    The above control unit enables read from the registers when there is a read from address x (ier), x+1 (acc) or x+2 (icand) and it enables a write to a register when there is a write to address x, x+1 or x+2. A write to address x+3 (start) issues a start pulse to the coprocessor's control unit, while a read from address x+3 issues a pulse on the control line marked xx, with no current use (we will find a use for it!).

    The coprocessor control unit issues step pulses, and these are turned into clock pulses to both the acc and icand registers.

    Note that if we wanted to generalize this coprocessor into one able to perform both multiply and divide operations, we might use writes to x+3 to record the desired operation. For example, writing 0 to x+3 might configure the data part to perform multiplication, while writing 1 to x+3 could configure the data part to perform division. The control unit would issue the same n step pulses in either case, but the effect of these would depend on the operation that was selected by the write operation that initiated coprocessor operation.

  3. Programming this coprocessor

    Assuming we've got an Ultimate RISC CPU, we can write code to use this multiply unit as follows:

    		; code for A = B x C
    		MOVE B ier	; move to x+0
    		MOVE C icand    ; move to x+2
    		MOVE junk start ; move to x+3
    
    		; must wait the appropriate time for the result
    
    		MOVE icand C    ; move from x+3
    	
    A good compiler or a good assembly language programmer will insert code to perform other computations in the space where the above code indicates a wait. This is particularly valuable if there are other coprocessors in the system.

    The problem with this code is that it includes no provisions for waiting the appropriate time. If the manual for the coprocessor relates its speed to the speed of the CPU, we might be able to know exactly how many instructions to skip before trying to get the result, but if we upgrade our CPU, the answer will change. This leads us to look for other solutions.

  4. Polling the coprocessor status

    One solution is to have the coprocessor include an explicit status bit that programs running on the CPU can test in order to see if the coprocessor is busy. This solution is common on direct-memory-access coprocessors, and it is common on interfaces to DSP chips. In both contexts, we typically also allow the condition status=done to be interpreted by the CPU as an interrupt request.

    We can do this with our example coprocessor by giving a meaning to the unused control line xx in the bus interface given previously. This line is asserted when the CPU tries to read from address x+3. We will use this line to enable a read of the state of the run output from the coprocessor control unit, so that the user can test to see if the coprocessor is running or idle. As a result, we change our code for multiplication to something like:

                    ; code for A = B x C
                    MOVE B ier      ; move to x+0
                    MOVE C icand    ; move to x+2
                    MOVE junk start ; move to x+3
    
    		; insert other operations here if there is something
    		; else useful to be done before checking for done
    
    	LP:	MOVE run to sfal; skip next instruction if run = 0
                    MOVE CLP pc     ; move constant equal to LP to pc
                    MOVE icand C    ; move from x+3
            
    This is a bit ponderous, but it allows the same code to run on machines with a wide range of relative CPU and coprocessor speeds.

  5. Asynchronous Busses

    The multiplication coprocessor described above is likely to be able to complete its job in only a few instruction times. This is because the control unit for the coprocessor can clock the registers at speeds limited by the short data circulation path within the coprocessor, while the CPU instruction execution cycle is limited by the bus speed.

    If a coprocessor is fast enough that it is likely to complete its job in only a few instruction times, the cost of using a few instructions to poll the coprocessor status while waiting for it to finish the operation is relatively high! Therefore, we need another solution!

    One common idea is to use an asynchronous bus, that is, a bus where the time it takes for a data transfer on the bus is under the control of the devices making the transfer. The minimum requirement for such a bus is an extra bus line, call it wait, that is raised by the device to say "hold on CPU, give me time to finish this job!" in response to a read or write request.

    We use this idea commonly in modern computer systems, so that the CPU can do very fast cycles to addresses in fast RAM and significantly slower cycles to, for example, peripheral interface registers.

    Here, we'll use the idea as follows:

    Whenever the CPU tries to read the contents of the multiplier register ier or of the accumulator, we'll have our coprocessor assert WAIT until the coprocessor is idle. This is illustrated below for just the read line for the accumulator:

    	Data    ============o======
    	Address ============|======
    	Read    ------------|------
    	Write   ------------|------
    	Wait    -------o----|------
                           |    |
                racc -----/_\--/_\
                           |    |
                run  ------     |
                                |
                          data from acc
    	
    The CPU or (in the case of the ultimate RISC) the Instruction Execution Unit must be modified to inspect the Wait line. There are two approaches to this. In one, if Wait is high during a bus cycle, the CPU must repeatedly retry that bus cycle until a cycle can be completed during which wait stays low.

    In the other approach, raising wait during a bus cycle stops the CPU clock until wait is lowered. These two approaches are distinctly different! CPU designs that allow an external signal to stop or stretch clock pulses are notably more complex than those what simply keep retrying operations until they can be completed. Genuinely asynchronous busses require the ability to stretch a single bus cycle.

    We can program a machine where the coprocessor can block a fetch until the operation is done as follows:

                    ; code for A = B x C
                    MOVE B ier      ; move to x+0
                    MOVE C icand    ; move to x+2
                    MOVE junk start ; move to x+3
    
    		; insert other operations here if there is something
    		; else useful to be done before checking for done
    
                    MOVE icand C    ; move from x+3 (will block as needed)
            
    The result is that naive code will run correctly, with the CPU being blocked as needed when results are not immediately available. Good compilers and careful assembly language programmers will interleave code so that other operations are performed at times when the coprocessor is expected to be busy, thus minimizing the number of times that the CPU is blocked by coprocessor activity.

  6. Coprocessors in the opcode space

    The coprocessors discussed above appear to be peripheral devices from the point of view of the CPU or instruction execution unit. This is fine for the Ultimate RISC, where there is no concept of opcode field, but most instructions have an instruction format that is more interesting, with a field used to designate the opcode. In such cases, we would like to design the coprocessor so that it executes instructions with a compatable format. This is commonly done as follows:

    First, the CPU is designed with instructions that are reserved for interpretation by a coprocessor. During the normal instruction execution cycle, the CPU treates these almost as if they were no-ops.

    Second, the CPU exposes selective state information on the bus. Specifically, whenever it is fetching an instruction word, it sets a special bus line indicating that this fetch from memory is reading the next instruction to be executed. The coprocessor may observe the instruction being fetched and, if the opcode field is one of those reserved for coprocessor action, the coprocessor may elect to act on the instruction.

    Typically, we want some coprocessor instructions to load data from memory to coprocessor registers, while others store data from coprocessor registers to memory, and yet others perform register to register operations within the coprocessor. There are several ways of doing this.

    One approach is to pre-designate coprocessor instructions in the CPU that include load and store cycles. When the coprocessor is absent, coprocessor-load instructions carry out the normal address computation part of the instruction execution cycle in the CPU, but the CPU ignores the value read from memory. If the coprocessor is present, the coprocessor will use that value. Similarly, coprocessor store instructions executed when the coprocessor is absent store an indeterminate value to memory, but if the coprocessor is present, it provides the data to be stored.

    If the coprocessor is not ready when a coprocessor instruction is fetched by the CPU, it must use the bus stall mechanism to stop the CPU until it can cooperate. Usually, we also include a coprocessor acknowledge signal on the bus. If the CPU decodes a pre-designated coprocessor coprocessor instruction and no coprocessor acknowledges recognizing that instruction, the CPU raises an exception condition (typicall using a trap mechanism) to indicate the situation. The exception handler may do a software simulation of the coprocessor instruction (as in floating point emulation), or it may just treat the instruction as an error.

    This approach was taken on the DEC PDP-11/45 floating point unit back in 1973; as a result, execution of integer and floating point instructions could be overlapped. The floating point instruction set looked like it was fully integrated into the normal instruction set of the machine, but the floating point unit was optional, and it operated in parallel with the CPU. Floating point instructions were executed using a sequential microcoded approach, and this was quite slow, yet the CPU never waited for the floating point unit to complete an operation except when a new floating point instruction was encountered before the previous one had been completed. Because most floating point algorithms require several integer instructions to be executed per floating point instruction, the result was an effective use of parallelism.

    (Other floating point coprocessors DEC built for other members of the PDP-11 family were not as fast. The later PDP-11/70 used the PDP-11/45 floating point, while the 11/40 and 11/23 used lower performance coprocessors.

    The floating point coprocessors for the microprocessors of the 1980's typically operated similarly; some offered high performance by overlapping floating point with integer operations, but most did not.