22C:122/55:132, Notes, Lecture 15, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

Multiplication

The classical Russian Peasant multiplication algorithm can be expressed iteratively as follows:

        unsigned int times(unsigned int ier, unsigned int icand)
        {
           unsigned int prod = 0;
           while (ier != 0) {
              if (ier & 1) prod = prod + icand;
              ier = ier >> 1;           -- /2
              icand = icand << 1; -- *2
           }
           return prod;
        }

However, this form is rarely implemented in hardware. Instead, the common approach, widely used on single-accumulator machines of the 1950's and 1960's, relies on the following data paths:

	
	|     Multiplicand    |
	|_____________________|
                 __|   _________              ____________
             ___|_____|___      |      ______|______      |
            |             |     |     |             |     |
            | ALU/Shifter |->------->-|   Shifter   |     |
            |_____________|     |     |_____________|     |
                   |            |            |            |
	 __________|__________  |  __________|__________  |
	|                     | | |                     | |
	|>    Accumulator     | | |>Multiplier-Quotient | |
	|_____________________| | |_____________________| |
                   |____________|            |____________|

The multiply algorithm can be expressed in terms of this hardware as follows:

        unsigned int times(unsigned int ier, unsigned int icand)
        {
           unsigned int AC = 0;   -- the accumulator
           unsigned int MQ = ier; -- the multiplier
           for (i = 0; i < wordsize; i++) {
              if (MQ odd) {
                  AC|MQ = (AC + icand)|MQ >> 1
              } else {
                  AC|MQ = AC|MQ >> 1
              } endif
           }
           return MQ;
	}

Here, the notation of a programming language such as C is deficient because it makes it difficult to deal with something that is very easy in hardware: The treatment of two one word variables as a two-word variable. This is indicated above using the vertical bar symbol as a concatenation operator, so AC|MQ indicates one double length quantity, half of which comes from or is destined for AC, and the other half comes from or is destined for MQ.

When applied to an N-bit word, this multiply operation actually produces a product that is 2N bits long, with the most significant half in AC, and the least significant half in MQ, replacing the multiplier that originally filled this register. Classical single accumulator machines simply exposed the programmer to this fact, allowing the programmer to keep one, the other, or both words of the result, depending on the needs of the algorithm. This option is lost in most moder high level languages, which insist that the product of 2 N-bit numbers is an N-bit number. Unfortunately, many modern machines follow the lead of such languages!

It is worth noting that the multiply loop can be rewritten as follows, using the C (if?then:else) notation for a conditional expression:

           for (i = 0; i < wordsize; i++) {
              AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           }

Few microcoded machines use a loop counter control register to this loop. Instead, it is very common to simply unroll the loop the required number of times. Thus, for a 16 bit machine, we would replace the above loop with the following code:

           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;

           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;

           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;

           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;

In microcode, this can be reduced to 16 consecutive identical microinstructions, differing only in their next address fields. Thus, there is no overhead for conditional evaluation or microcontrol, and this multiply operation can be executed in exactly 16 clock cycles.

Higher Radix Multiply

The first solution to be widely used to speed up the N clock-cycle multiply algorithm outlined above simply uses a higher radix. Consider, for example, the following idea:

                                 |
                   -----o-----o--o--o-----o-----o-----
           ___   _|_   _|_   _|_   _|_   _|_   _|_   _|_
	  | 0 | |A*1| |A*2| |A*3| |A*4| |A*5| |A*6| |A*7|
	  |___| |___| |___| |___| |___| |___| |___| |___|
            |     |     |     |_   _|     |     |     |
            |     |     |_____  | |  _____|     |     |
            |     |_________  | | | |  _________|     |
            |_____________  | | | | | |  _____________|
                         _|_|_|_|_|_|_|_|_
                     3  | 0 1 2 3 4 5 6 7 |
         multiplier -/--|       MUX       | 
                        |_________________| 
                                 |

This circuit fragment will multiply its input by a 3 bit multiplier. If we substitute this for the ALU in the original algorithm, we can do a 15 bit multiply in 5 clock cycles, where, in each cycle, the least significant 3 bits of the MQ register are used to compute a partial product for addition to the accumulator. Instead of using one-place shifts, the new multiplier uses 3-place shifts.

The above seems a bit complex, since it relies on separate combinational circuitry to compute what ammounts to a table of partial products, but in fact, computing these is straightforward. Consider the following facts:

	a*0 =                 = 0    ? 0
	a*1 = a                 0    + a
	a*2 =     a<<1        = 0    + a<<1
	a*3 = a + a<<1        = a<<2 - a
	a*4 =            a<<2 = a<<2 ? 0
	a*5 = a        + a<<2 = a<<2 + a
	a*6 =     a<<1 + a<<2 = a<<2 + a<<1
	a*7 = a + a<<1 + a<<2 = a<<3 - a

These can be computed with a single add/subtract unit and two 2-input multiplexors, so long as the multiplexors have enable inputs in addition to their select inputs (when not enabled, the multiplexor output is zero).

                                 |
                         -----o--o--o-----
                       _|_   _|_    |    _|_
	              |<<2| |<<3|   |   |<<1|
	              |___| |___|   |   |___|
                       _|_____|_   _|_____|_
                ASEL -| 0     1 | | 0     1 |- BSEL
                      |  MUX A  | |  MUX B  |
               AENAB -|_________| |_________|- BENAB
                          _|___________|_
                         | A           B |
              ADDSUB ----|    0: A+B     |
                         |    1: A-B     |
                         |_______________|
                                 |

Given 3 bits of the multiplier, we set the control inputs to the above circuitry as follows in order to create a circuit equivalent to the 8-input multiplexor version of the circuit.

	 M0 M1 M2 | ASEL AENAB BSEL BENAB ADDSUB
        ----------|-----------------------------
	  0  0  0 |   x    0     x    0     x
          0  0  1 |   x    0     0    1     0
          0  1  0 |   x    0     1    1     0
          0  1  1 |   0    1     0    1     1
          1  0  0 |   0    1     x    0     x
          1  0  1 |   0    1     0    1     0
          1  1  0 |   0    1     1    1     0
          1  1  1 |   1    1     0    1     1

	ASEL   = M0 and M1 and M2
	AENAB  = (M1 and M2) or M0
	BSEL   = M1 and not(M2)
	BENAB  = (M1 or M2)
	ADDSUB = (M1 and M2)

It is worth noting that the high end IBM System 360 computers of the late 1960's used radix 16 so that they could do a 32-bit by 32-bit full-word multiply in 8 clock cycles or a 16-bit by 16-bit half-word multiply in 4 clock cycles.

Despite its elegance, this scheme offers only a constant speedup; microcoded implementations are still called for, and the demand for very high performance computers appears to require faster multiply hardware. Also, of course, we are interested in the ultimate bounds on the computation.

Brute Force Parallel Multiplication

When we multiply two numbers, using the algorithm everyone learns for decimal multiplication in elementary school, we lay the work out as follows:

	           C   C   C   C   C    multipliCand
                    4   3   2   1   0

	         x         E   E   E    multipliEr
                            2   1   0
        -------------------------------
	           P   P   P   P   P
	            04  03  02  01  00
	       P   P   P   P   P        Partial products
	        14  13  12  11  10
	+  P   P   P   P   P
	    24  23  22  21  20
        -------------------------------
	   P   P   P   P   P   P   P    Product
            6   5   4   3   2   1   0

The product P is the sum of the partial products P_i, where the number of partial products equals the number of digits in the multiplier.

For radix 2, the digits of partial product P_i are zero if bit E_i of the multiplier is zero, and the partial product equals the multiplicand if E_i is one. That is

P_ij = E_i and C_j

Aside from this trivial logic, all that is involved in multiplication is an array of adders! To multiply an N bit numbe by an M bit number, we need N adders of M bits each. If each adder has an enable input to enable computation of either A+B or just A, we can construct a fully parallel multiplier as follows:

                             C     C     C     C     C
                          0   4 0   3 0   2 0   1 0   0
                         _|_/  _|_/  _|_/  _|_/  _|_/
                        |   |-|   |-|   |-|   |-|   |---- E
                     ---|___|-|___|-|___|-|___|-|___|-0    0
                   _|_ / _|_ / _|_ / _|_ / _|_ /  |
                  |   |-|   |-|   |-|   |-|   |---------- E
               ---|___|-|___|-|___|-|___|-|___|-0 |        1
             _|_ / _|_ / _|_ / _|_ / _|_ /  |     |
            |   |-|   |-|   |-|   |-|   |---------------- E
         ---|___|-|___|-|___|-|___|-|___|-0 |     |        2
        |     |     |     |     |     |     |     |
        P     P     P     P     P     P     P     P
         7     6     5     4     3     2     1     0

Each cell of this picture contains the following:

                          |   /
                        __|  o
                       |    /|
                 ------|-----|----o---
                      _|_____|_   |
	             | A     B |  |
	             |    BENAB|--
	             |    +    |
	         ----|Cout  Cin|------
                     |_________|     
                          |
                   /      |
                  /       |

If eacn of these units is a classical full adder with an added enable input, the total multiply time for an N-bit by N-bit multiply is 2N times the propagation delay of the full adder. That is, O(N), and we did this with O(N²) hardware.

A Really Fast Parallel Multiplier!
We know that we can use carry lookahead logic to take the brute force design above and convert each step from O(N) time to O(log N) time, but we still need N steps to add the partial sums, so this gives us an add time of O(N log N), which is not an improvement over the original O(N)!
The question is, if we have N partial sums to compute, can we do this faster? The solution rests on the following idea:
(((A + B) + C) + D) = (A + B) + (C + D)
No matter how we add N terms, it will take N-1 2-input adders, which is O(N). However, we can arrange these in two distinct ways; one adds the terms successively in O(N) time, while the other forms a binary tree off adders, adding them in O(log N) time.
This is what we will do with our partial products! The result is a multiplier that uses O(N² log N) hardware and computes the product in O(log N) time!
Carry Propagation
We can improve the performance of any of the multiplication schemes by exploiting an idea that was originally developed to speed up the simple shift-and-add multiply algorithm; this is the carry-save or the carry-forward adder (the term carry-save is used for iterative use of this approach, while the term carry-forward is used for purely combinational versions).
The classical full adder circuit, used in a ripple-carry adder, has three inputs, A, B and C_in that are identical in meaning, and it has two outputs, S and C_out. The connection of C_in from one stage to C_out of the next is the natural way to arrange these for adding two numbers, but when we are adding more than 2 numbers, there is another option! This option is illustrated here for the sum of 3 numbers::
```
          A  B  C               A  B  C       A  B  C  
           n  n  n               1  1  1       0  0  0
         _|__|__|_             _|__|__|_     _|__|__|_
        | A  B  C |           | A  B  C |   | A  B  C |
        |    +    |    ...    |    +    |   |    +    |
        |__C___S__|           |__C___S__|   |__C___S__|
           |   |                 |   |         |   |
           |   |   ---        ---    |   ------    |
           |   |  |   ---            |  |  0       |
           |  _|__|__|_  |          _|__|__|_      |
           | | A  B  C | |         | A  B  C |     |
           | |    +    | |  ...  | |    +    |     |
           | |__C___S__|         | |__C___S__|     |
           |   |    |            |   |    |        |
          -o---|-   |             ---     |        |
         |  ---o |  |                     |        |
         |_|   |_|  |                     |        |
        |AND| |XOR| |                     |        |
        |___| |___| |                     |        |
          |     |   |                     |        |
          S     S   S                     S        S
           n+2   n+1 n                     1        0
	
```
The above illustration shows an arrangement for adding 3 numbers using carry-forward addition. The first stage just adds, and it forwards the carry propagation to the next stage. Only the final stage needs to propagate the carry all the way.
In the original carry-save adders, the n-bit accumulator was augmented with an n-1 bit carry-save register. The carry out of each step in accumulating the partial products was saved in the carry-save register and added into the accumulating sum on the next cycle. Thus, even though the system essentially used ripple carry hardware, the microcycle time for each step in summing the partial products was only 3 gate delays. Only after the final step in the summation was it necessary to wait for ripple carry propagation.
This idea can be directly applied to the brute-force parallel multiplier. The result is that it takes exactly N² full adders to add the partial products, and the sum is computed in N times the delay of a one-bit full adder. The sum then needs a final carry propagate operation that can be done using the logic of a carry-lookahead half-adder (a carry lookahead incrementor).
This idea can also be applied to the binary tree of partial sums, although this requires a bit more thought.