23. Simple Pipelined Coprocessors

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Vector Processing

An early hypothesis among numerical analysts hoping to build high performance supercomputers was that direct hardware support of array and vector operations would lead to performance breakthroughs. This was motivated, in part, by the computational style encouraged by languages such as APL, where the statement A=B+C had several interpretations depending on the types of A B and C. If all the variables were simple variables, this statement had the expected meaning, but if one or the other variable was a vector, several interesting meanings could emerge, among them:

            -- if A, B and C are vectors
	    for i in sizeof(A) do
               A(i) = B(i) + C(i)

            -- if A and B are vectors and C is a scalar
	    for i in sizeof(A) do
               A(i) = B(i) + C

The obvious question is, if these are natural operations for programmers, why not directly implement these in hardware? If we could do this, we would eliminate most of the cost of instruction fetches, so that the vector operation could run to completion with the only expense being operand transfers to and from memory.

This leads naturally to the idea of a vector coprocessor or vector functional unit that could be attached to a machine. Generally, when these were built, the vector processor performed the following algorithm, expressed in C:

	(void) vectorop( real * src1, int stride_s1,
			 real * src2, int stride_s2,
			 real * dst,  int stride_dst,
			 int numops, real f( real s1, real s2 ) )
		while (numops > 0) {
			*dst = f( *src1, *src2 );
			src1 += stride_s1;
			src2 += stride_s2;
			dst += stride_dst;

This general operation is very powerful; not only can it add corresponding elements of two vectors to compute the elements of a result vector, but it can do things like multiplying each element of a row of one matrix (with stride 1) by each element of a column of another matrix (with stride equal to the number of columns), and if we use a stride of zero, it will perform scalar to vector operations.

Typically, a vector functional unit or vector coprocessor would have the following structure:

	   src1, src2, dst   -- address registers
	   str1, str2, strd  -- increment or stride registers
	   temp1, temp2      -- internal operand registers
	   count             -- count for the operation in progress
	   op                -- operation to perform

	Data Paths:
	   address_bus = select( src1, src2, dst )
	   src1 = src1 + str1
	   src2 = src2 + str2
	   dst  = dst  + strd
	   count = count - 1
           temp1 = data_bus
           temp2 = data_bus
           data_bus = alu( op, temp1, temp2 )

If the vector processor is attached as a crude coprocessor with direct memory access capability, the CPU would perform a vector operation by loading the registers listed above (aside from the temp registers), then starting the coprocessor (perhaps an automatic side effect of loading the op register), and then doing other work, if possible, while awaiting the result.

If the vector processor is incorporated into the CPU as a functional unit, vector instructions might simply name the CPU registers to be used, so we could imagine a vector instruction with a format such as:

        | opcode  | count | addr| str | addr| str | addr| str |
        |                 |    src1   |    src2   |    dst    |
If we have 16 index registers, the count, address and stride fields of this instrucion could be 4-bit fields indicating the register that holds the desired information, so the instruction would use any 6 of the registers to start a vector operation.

At 4 bits per register field, the instruction outlined above takes 28 bits to specify source and destination addresses, strides and count. The remaining 4 bits of a 32 bit instruction seem insufficient for the opcode, but there is a common solution to this problem.

Typical large-scale scientific computers have always supported machine words much larger than the memory address. The IAS machine proposed in 1946 had a 40-bit word and an address well under 20 bits. The IBM 704/709/7040/7090 family had a 36-bit word and an address that was at most 18 bits (15 bits was far more typical). The CDC 6600 had a 60-bit word and an 18-bit memory address, and By the time 32 bit memory addresses were contemplated, the usual proposals for supercomputers had 64-bit words.

The indexed addressing scheme used on the 709, and later adopted (with modifications) by many vector processors used the following format for pointers stored in registers:

	contents of a register
	|     decrement     |      address      |
	|     (stride)      |                   |

The low half of each register, when used for indexing) is a pointer to a memory location. The high half of the register was ignored during indexing; this was used to store the amount by which the address should be decremented when the pointer was used as a for-loop counter in the context of a decrement-and-branc-if-zero instruction, but it could be used equally well as a stride field in a vector processing context. The 709, of course, was not a vector processor, but if we borrow this idea for representing vector pointers, we could use the following instruction format for initiating vector operations:

	| opcode  | count | src1| src2| dst |

Given 16 to 64 index registers, it takes only 12 to 18 bits of the instruction to specify source and destination addresses for a vector operation. An 8 bit opcode field and a 32-bit instruction word would leave us 6 to 12 bits for specifying the vector length or for specifying the register holding the vector length, and this should be enough!

The Vector Advantage

The vector advantage of a processor was defined as the relative speedup that could be realized by coding vector operations on that processor using the vector instruction set instead of the conventional instruction set of the same machine.

Naive advocates of vector architectures assumed that the most successful vector machines would be the machines with the highest vector advantage. In fact, a retrospective look at the vector architectures that were marketed shows exactly the opposite!

The most successful vector machine to come to market was the Cray I family (including its successors the Cray X-MP, the Cray 2 and the Cray Y-MP.) While these machines had incredibly complex vector processors, complete with vector operand registers inside the CPU able to hold 64 64-bit operands each, they had remarkably low vector advantages. For the Cray 1, the figure usually quoted is 4.

Other vector processors, most notably the CDC STAR had far higher vector advantages (STAR stood for STring and ARray processor). The STAR had 2 to 4 vector functional units very similar to those described above, with no special vector operand registers, and it had a very high vector advantage).

The fundamental problem with the emphasis on the vector advantage of a processor is that even the most numerically intensive of programs spend much of their time on housekeeping jobs -- computing the address pointers to use in a vector operation, or calling functions, or managing the overall recursive structure of the algorithm. As a result, in real programs, the impact of architectural enhancements that speed these activities is huge, and furthermore, these enhancements speed up not only numerical computation but everything else as well.

Why devote a huge fraction of the hardware in a CPU to vector processing if it only speeds up the resulting computations by at most a factor of 4? Why not devote this hardware to speeding up everything else? Once this question was asked, development of new vector processors came to an almost complete halt!


One idea that dominated the design of all high performance vector processors has survived: The use of pipelined computation. This idea was first cleary identified in looking at the computational complexity of floating point operations in the 1960's. A floating point add is performed as follows (ignoring rounding and questions of signed number representation):

	float add( float a, float b )
		int aexp = exponent_field_of(a);
		int aman = mantissa_field_of(a);
		int bexp = exponent_field_of(b);
		int bman = mantissa_field_of(b);
		int sexp; -- will hold the exponent of the sum
		int sman; -- will hold the mantissa of the sum

		-- align the points of the two operands
		if (aexp > bexp) {
			bman = bman >> (aexp-bexp);
			sexp = aexp;
		} else if (bexp > aexp) {
			aman = aman >> (bexp-aexp);
			sexp = bexp;

		-- add
		sman = aman + bman;

		-- normalize
		while (top_bit(sman) == 0) {
			sman = sman << 1;
			sexp = sexp - 1;

		return makefloat( sexp, sman );

In building hardware to perform this, it is natural to build along the following lines;

	      a              b
        |                                      |
        |           align                      |
              |aman          |bman       |sexp
         _____|______________|_____      |
        |                          |     |
        |            add           |     |
        |__________________________|     |
                      |sman              |
        |                                      |
        |         normalize                    |

The align step can be done crudely using a pair of subtractors (aexp-bexp) and (bexp-aexp) to control a set of shifters and a final multiplexor. This computation can be done in O(log n) gate delays! The adder can also be done in this time, and so can the normalizer. If speed is our goal, it seems annoying to have to wait 3 times as long as any one step takes, when we could perform the steps in parallel! The idea is a natural one, sometimes described as bringing assembly line methods to computation.

Consider adding a+b, c+d and e+f as follows:

	step 1   align a+b
	step 2   align c+d   add a+b
	step 3   align e+f   add c+d   normalize a+b
	step 4               add e+f   normalize c+d
	step 5                         normalize e+f

During the first 2 steps, we waste some resources because we are still starting things running, and during the last two steps, we waste resources because we've run out of operands to keep all the computational jobs working, but in between, each of the 3 "workstations" along our assembly line is fully occupied! The more work we have, the less waste there will be during the startup and takedown of our assembly line.

The hardware to do this is remarkably simple!

	      a              b
        |                                      |
        |           align                      |
              |aman          |bman       |sexp
 clock      __|______________|___________|__
 --o-------|>_______________________________| interstage register
   |     _____|______________|_____      |
   |    |                          |     |
   |    |            add           |     |
   |    |__________________________|     |
   |                  |sman              |
   |        __________|__________________|__
    -------|>_______________________________| interstage register
        |                                      |
        |         normalize                    |

At each step, we must clock all of the interstage registers in parallel! These must, of course, be edge triggered registers. A complete pipelined vector adder would have the following structure:

        |                          |
   . . .|       operand fetch      |
   .    |__________________________|
   .          |a             |b
 clock      __|______________|__
 --o-------|>___________________| interstage register
   |     _____|______________|_________________
   |    |                                      |
   |    |           align                      |
   |    |______________________________________|
   |          |aman          |bman       |sexp
   |        __|______________|___________|__
   o-------|>_______________________________| interstage register
   |     _____|______________|_____      |
   |    |                          |     |
   |    |            add           |     |
   |    |__________________________|     |
   |                  |sman              |
   |        __________|__________________|__
   o-------|>_______________________________| interstage register
   |     _____________|__________________|_____
   |    |                                      |
   |    |         normalize                    |
   |    |______________________________________|
   |                  |s
   |        __________|_________
    -------|>___________________| interstage register
   .     _____________|____________
   .    |                          |
   . . .|       result store       |
During clock cycle n, we fetch one pair of operands from memory, align the operands we fetched during cycle n-1, add the operands we fetched during cycle n-2, normalize the result computed from the operands fetched during cycle n-3, and we store the result computed from the operands fetched during cycle n-4.

Curiously, the add/subtract pipeline has 5 stages, while the multiply pipleine only needs 4 stages if we use an O(log n) multiplier! This is because there is no pre-alignment step needed for multiplication.

Note that the clock line goes only to the interstage registers, and not into any of the pipeline stages. If we ignore the operand fetch and result store stages shown here, all of the pipeline stages are purely combinational, with no registers embedded in them and so no need for clocking! Because the use of the clock signal in the operand fetch and result store stages is something of an exception to a general rule, we have shown the connection of the clock lines to these stages using dotted lines.

Of course, we've glossed over a big issue here: How, in one clock cycle, do we perform 3 memory references? In the Cray 1 processor, this was avoided by storing all operands in special vector registers, but even here, there was a single bank of 8 vector registers, where each register held 64 operands of 64 bits each; such a register bank is naturally implemented with a small RAM, so we still need to make multiple references to this RAM during each cycle.

We will ignore this problem for now. Please take it on faith that there are memory architectures that allow a limited number of parallel access paths, so that from 1 to 4 memory accesses can be performed in parallel. Such a memory device is called a multiport memory. We will discuss the implementation of multiport memory in considerable detail later, but only after we have finished discussing high performance CPU designs!