22C:122/55:132, Notes, Lecture 19, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

Vector Processing
An early hypothesis among numerical analysts hoping to build high performanc supercomputers was that direct hardware support of array and vector operations would lead to performance breakthroughs. This was motivated by the computational style encouraged by languages such as APL, where the statement A=B+C had several interpretations depending on the types of A B and C. If all the variables were simple variables, this statement had the expected meaning, but if one or the other variable was a vector, several interesting meanings could emerge:
```
            -- if A, B and C are vectors
	    for i in sizeof(A) do
               A(i) = B(i) + C(i)
            endloop

            -- if A and B are vectors and C is a scalar
	    for i in sizeof(A) do
               A(i) = B(i) + C
            endloop
	
```
The obvious question is, if these are natural operations for programmers, why not directly implement these in hardware?
This leads naturally to the idea of a vector coprocessor or vector functional unit that could be attached to a machine. Generally, when these were built, the vector processor performed the following algorithm, expressed in C:
```
	(void) vectorop( real * src1, int stride_s1,
			 real * src2, int stride_s2,
			 real * dst,  int stride_dst,
			 int numops, real f( real s1, real s2 ) )
	{
		while (numops > 0) {
			*dst = f( *src1, *src2 );
			numops--;
			src1 += stride_s1;
			src2 += stride_s2;
			dst += stride_dst;
		}
	}
	
```
This general operation is very powerful; not only can it add corresponding elements of two vectors to compute the elements of a result vector, but it can do things like multiplying each element of a row of one matrix (with stride 1) to each element of a column of another matrix (with stride equal to the number of columns).
Typically, a vector functional unit or vector coprocessor would have the following structure:
```
	Registers:
	   src1, src2, dst   -- address registers
	   str1, str2, strd  -- increment registers
	   temp1, temp2      -- internal operand registers
	   count             -- count for the operation in progress
	   op                -- operation to perform

	Data Paths:
	   address_bus = select( src1, src2, dst )
	   src1 = src1 + str1
	   src2 = src2 + str2
	   dst  = dst  + strd
	   count = count - 1
           temp1 = data_bus
           temp2 = data_bus
           data_bus = alu( op, temp1, temp2 )
	
```
If the vector processor is attached as a crude coprocessor, the CPU would initiate a vector operation by loading the registers listed above (aside from the temp registers) and then starting the coprocessor.
If the vector processor is incorporated into the CPU as a functional unit, vector instructions might simply name the CPU registers to be used, so we could imagine a vector instruction with the format:
```
         _____________________________________________________
	|_________|_______|_____|_____|_____|_____|_____|_____|
        | opcode  | count | addr| str | addr| str | addr| str |
        |                 |    src1   |    src2   |    dst    |
	
```
If we have 16 index registers, we can select any 6 of them to hold the addresses and strides used for this computation. (in many computations, only 2 strides get used, 1 and n, where the vectors are of length n and the matrices are n by n).
At 4 bits per register field, the instruction outlined above takes 24 bits to specify source and destination addresses and strides. The remaining 8 bits of a 32 bit instruction seem insufficient to specify both an operation and a (constant) count, but there is a common solution to this problem.
Typical large-scale scientific computers have always supported machine words much larger than the memory address. The IAS machine proposed in 1946 had a 40-bit word and an address well under 20 bits. The IBM 704/709/7040/7090 family had a 36-bit word and an address that was at most 18 bits (15 bits was far more typical). The CDC 6600 had a 60-bit word and an 18-bit memory address, and By the time 32 bit memory addresses were contemplated, the usual proposals for supercomputers had 64-bit words.
The indexed addressing scheme used on the 709, and later adopted (with modifications) by many vector processors used the following format for pointers stored in registers:
```
        contents of a register
	 _______________________________________ 
	|___________________|___________________|
	|     decrement     |      address      |
	|     (stride)      |                   |
	
```
The low half of each register, when used for indexing) is a pointer to a memory location. The high half of the register is used to modify the address when the pointer is used to address successive locations during a vector operation.
The IBM 709 decremented the address by the amount indicated so that the same mechanism could be used for loop indexing, with termination occuring when the address field went to zero. The 709, of course, was not a vector processor. The typical vector processor would use increment semantics! This allows the following instruction format:
```
         ___________________________________
	|_________|_______|_____|_____|_____|
        | opcode  | count | src1| src2| dst |
	
```
Given 16 to 64 index registers, it takes only 12 to 18 bits of the instruction to specify source and destination addresses for a vector operation. An 8 bit opcode field leaves 6 to 12 bits to specify the number of elements per vector, and this is plenty!
The Vector Advantage
The "vector advantage" of a processor was defined as the relative speedup that could be realized by coding vector operations on that processor using the vector instruction set instead of the conventional instruction set of the same machine.
Naive advocates of vector architectures assumed that the most successful vector machines would be the machines with the highest vector advantage. In fact, a retrospective look at the vector architectures that were marketed shows exactly the opposite!
The most successful vector machine to come to market was the Cray I family (including its successors the Cray X-MP, the Cray 2 and the Cray Y-MP.) While these machines had incredibly complex vector processors, complete with vector operand registers inside the CPU able to hold 64 64-bit operands each, they had remarkably low vector advantages. For the Cray 1, the figure usually quoted is 4.
Other vector processors, most notably the CDC STAR had far higher vector advantages (STAR stood for STring and ARray processor). The STAR had 2 to 4 vector functional units very similar to those described above, with no special vector operand registers, and it had a very high vector advantage).
Why devote a huge fraction of the hardware in a CPU to vector processing if it only speeds up the resulting computations by a factor of 4? Why not devote this hardware to speeding up everything else? Once this question was asked, development of new vector processors came to an almost complete halt!

Pipelines

One idea that dominated the design of all high performance vector processors has survived: The use of pipelined computation. This idea was first cleary identified in looking at the computational complexity of floating point operations in the 1960's. A floating point add is performed as follows (ignoring rounding and questions of signed number representation):

	float add( float a, float b )
	{
		int aexp = exponent_field_of(a);
		int aman = mantissa_field_of(a);
		int bexp = exponent_field_of(b);
		int bman = mantissa_field_of(b);
		int sexp;
		int sman;

		-- align the points of the two operands
		if (aexp > bexp) {
			bman = bman >> (aexp-bexp);
			sman = aexp;
		} else if (bexp > aexp) {
			aman = aman >> (bexp-aexp);
			sman = bexp;
		}

		-- add
		sman = aman + bman;

		-- normalize
		while (top_bit(sman) == 0) {
			sman = sman << 1;
			sexp = sexp - 1;
		}

		return makefloat( sexp, sman );
	}

In building hardware to perform this, it is natural to build along the following lines;

	      a              b
         _____|______________|_____
        |                          |
        |           align          |
        |__________________________|
              |              |
         _____|______________|_____
        |                          |
        |            add           |
        |__________________________|
                      |
         _____________|____________
        |                          |
        |         normalize        |
        |__________________________|
                      |
                      s

The align step can be done using a pair of subtractors (aexp-bexp) and (bexp-aexp) to control a set of shifters and a final multiplexor. This computation can be done in O(log n) gate delays! The adder can also be done in this time, and so can the normalizer. If speed is our goal, it seems annoying to have to wait 3 times as long as any step takes, when we could perform the steps in parallel! The idea is a natural one, sometimes described as bringing assembly line methods to computation.

Consider adding a+b, c+d and e+f as follows:

	step 1   align a+b
	step 2   align c+d   add a+b
	step 3   align e+f   add c+d   normalize a+b
	step 4               add e+f   normalize c+d
	step 5                         normalize e+f

During the first 2 steps, we waste some resources because we are still starting things running, and during the last two steps, we waste resources because we've run out of operands to keep all the computational jobs working, but in between, each of the 3 "workstations" along our assembly line is fully occupied!

The hardware to do this is simple!

	      a              b
         _____|______________|_____
        |                          |
        |           align          |
        |__________________________|
            __|______________|__
           |____________________| interstage register
         _____|______________|_____
        |                          |
        |            add           |
        |__________________________|
            __________|_________
           |____________________| interstage register
         _____________|____________
        |                          |
        |         normalize        |
        |__________________________|
                      |
                      s

At each step, clock all of the interstage registers in parallel! These must, of course, be edge triggered registers. A complete pipelined vector adder would have the following structure:

         __________________________
        |                          |
        |       operand fetch      |
        |__________________________|
            __|______________|__
           |____________________| interstage register
         _____|______________|_____
        |                          |
        |           align          |
        |__________________________|
            __|______________|__
           |____________________| interstage register
         _____|______________|_____
        |                          |
        |            add           |
        |__________________________|
            __________|_________
           |____________________| interstage register
         _____________|____________
        |                          |
        |         normalize        |
        |__________________________|
            __________|_________
           |____________________| interstage register
         _____________|____________
        |                          |
        |       result store       |
        |__________________________|

During clock cycle n, we fetch one pair of operands from memory, align the operands we fetched during cycle n-1, add the operands we fetched during cycle n-2, normalize the result computed from the operands fetched during cycle n-3, and we store the result computed from the operands fetched during cycle n-4.

Curiously, the add/subtract pipeline has 5 stages, while the multiply pipleine only needs 4 stages if we use an O(log n) multiplier! This is because there is no pre-alignment step needed for multiplication.

Of course, we've glossed over a big issue here: How, in one clock cycle, do we perform 3 memory references? In the Cray 1 processor, this was avoided by storing all operands in special vector registers, but even here, there was a single bank of 8 vector registers, where each register held 64 operands of 64 bits each; such a register bank is naturally implemented with a small RAM, so we still need to make multiple references to this RAM during each cycle.

We will ignore this problem for now. Please take it on faith that there are memory architectures that allow a limited number of parallel access paths, so that from 1 to 4 memory accesses can be performed in parallel. Such a memory device is called a multiport memory. We will discuss the implementation of multiport memory in considerable detail later, but only after we have finished discussing high performance CPU designs!