19. Fast Arithmetic

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Review of Addition

Given two numbers to be added, A and B, where A0 is the least significant bit of A, we add them as follows:

	     A  addend
	   + B  augend
          -----
             S  sum

Usually, we will treat the terms addend and augend as synonyms, since addition is commutative. In each bit position of the sum, we can identify the following:

	     C   carry into position i
	      i
	     A   addend bit i
	      i
	  +  B   augend bit i
	      i
        --------
	 C   S   carry out of bit position i
	  i+1 i  and bit i of the sum.

Here, Ci is both the carry into bit position i and the carry out of bit position i-1. These variables are related by the following truth table:

	 A  B  C  |  S  C
          i  i  i |   i  i+1
        ----------|---------
         0  0  0  |  0  0  
         0  0  1  |  1  0  
         0  1  0  |  1  0  
         0  1  1  |  0  1  
         1  0  0  |  1  0  
         1  0  1  |  0  1  
         1  1  0  |  0  1  
         1  1  1  |  1  1  

In hardware, we may build a 1-bit adder as follows:

Si = (Ai xor Bi ) xor Ci

Ci+1 = (Ai and Bi ) or (Ai and Ci ) or (Bi and Ci )

If we chain these together, we get a ripple-carry adder:

          A  B                  A  B          A  B      C
           n  n                  1  1          0  0      in
          |  |   ___       _    |  |   ___    |  |   ___|
         _|__|__|_  |       |  _|__|__|_  |  _|__|__|_
	| A  B  C | |       | | A  B  C | | | A  B  C |
	|    +    | |  ...  | |    +    | | |    +    |
	|__C___S__| |       | |__C___S__| | |__C___S__|
         __|   |    |_      |____|   |    |____|   |
        |      |                     |             |
        C      S                     S             S
         out    n                     1             0

The fundamental problem with this design is that the time it takes to add 2 n-bit numbers is O(n). Typically, this design is acceptable n<4, but for n=16, the speed becomes a serious bottleneck, and for n=32 or n=64, it is intolerable.

Carry-Save Adders

One way to speed the addition is to defer carry propagation. This is done as follows:

          A  B                  A  B          A  B      C
           n  n                  1  1          0  0      in
          |  |  0               |  |  0       |  |   ___|
         _|__|__|_             _|__|__|_     _|__|__|_
	| A  B  C |           | A  B  C |   | A  B  C |
	|    +    |    ...    |    +    |   |    +    |
	|__C___S__|           |__C___S__|   |__C___S__|
         __|   |      ___     ___|   |      ___|   |
               |  0  |               |  0  |       S
              _|__|__|_             _|__|__|_       0
	     | A  B  C |           | A  B  C |
	     |    +    |    ...    |    +    |
	     |__C___S__|           |__C___S__|
              __|   |      ___     ___|   |
                   _|__|__|_              S
                  | A  B  C |              1

If we use exactly the following scheme, we will require O(n2) adders and the delay will still be O(n)! However, this fails to take into account the fact that each adder has an extra input that is unused in this context! Of course, many of these adders could be replaced by half-adders since they have unused inputs, but that doesn't change the asymptotic bound.

If we need to sum a number of inputs, we can use this structure to perform the sum, and this is perfectly suited to summing multiple partial products for a fast multiplier.

The term carry save adder comes from the use of this model in iterative multiplication; after adding each partial product to the sum, the accumulated sum would be held in one register and the carrys produced in accumulating that sum would be held in a second register. As a result, the partial products could be accumulated at high speed, without the need to wait for carry propagation during each cycle of accumulation. Only at the end was there a need to use a full adder that propagated the carry.

If a carry-save adder is built in parallel, with 3 or more layers, it should be noted that the same components can be rewired to compute the sum for one column of the result in O(log n) time instead of O(n) time. To do this, the adders are arranged into a binary tree, where each internal node sums a carry input with the sums out of two subtrees. The carry outputs of each adder still go to the next tree to the left.

       a a a               a a a         a a a
        \|/                 \|/           \|/
      c'_O a c            ___O   __________O    a  -- addend at digit i
          \|/             \   \_/_     ___/     c  -- carry in to digit i
        c'_O a c           \ c'/  \ c /         c' -- carry in to digit i+1
            \|/             \|/    \|/          c" -- carry in to digit i+2
          c'_O a c        c"_O   c'_O           s  -- sum at digit i
              \|/            |      |
            c'_O             c'     s
               | 
               s

With this scheme, we create a bit of havoc with the wiring, and carries out from each stage now leapfrog upward (summing n inputs to one bit position now propagates carries to about log(n) other bit positiions, instead of just sending n-2 carry bits to the adjacent bit position).

Carry-Lookahead Adders

Another way to add n-bit numbers in O(log n) time is to use carry lookahead or anticipatory carry logic. Charles babbage coined the term carry anticipation, in the 1830's, in his work on fast mechanical adders for his difference engine, and this term is still used!

For any substring of digit positions within a sum, we can ask, for that substring, do the values of the operands within that substring cause it to generate a carry, and do the values of the operands within that substring cause it to propagate carry in to carry out

	       A   A  |A   A   A | A
                5   4 | 3   2   1|  0
	     + B   B  |B   B   B | B
                5   4 | 3   2   1|  0
	     ---------|----------|---
	       S   S  |S   S   S | S
                5   4 | 3   2   1|  0
                       \________/
                          [3:1]

In the above, we have singled out bits 3 to 1 of a sum, and we ask for P[3:1] -- are the values of A and B in bit positions 3 to 1 such that C4 is implied by C1?

P[3:1] = (A3 or B3) and (A2 or B2) and (A1 or B1)

Similarly, we can ask for G[3:1] -- are the values of of A and B in bit positions 3 to 1 such that they imply C4 without knowledge of C1?

G[3:1] = (A3 and B3) or (A2 and B2 and P[3]) or (A1 and B1 and P[3:2])

Our goal is to use the above information in a divide and conquer approach to fast addition. To do this, we want to build a tree with the adders for each bit position at the leaves and the tree body performing the carry propagation. To do this, we need a way to compute P[a:c] when given P[a:b] and P[b:c]. This is remarkably easy!

P[a:c] = P[a:b] and P[b:c]

We also need a way to compute G[a:c] when given G[a:b] and G[b:c]. This is remarkably easy!

G[a:c] = G[a:b] or (G[b:c] and P[a:b])

Now, assuming the word size is 2n, we can build a complete balanced binary tree with adders for the leaves, where each internal node of the tree operates as above, with one additional detail.

We must, at each internal node, accept a carry in from the least significant side and, depending on the propagate signals, deliver this to the appropriate subtrees. With this change, some internal node of the carry tree will generate carry in to each stage of the adder, so no adder stage needs to compute carry out! The result is as follows:

          A  B                  A  B          A  B      C
           n  n                  1  1          0  0      in
          |  |                  |  |          |  |      |
          |  |   ---            |  |   ---    |  |   ---o
         _|__|__|_  |          _|__|__|_  |  _|__|__|_  |
	| A  B  C | |         | A  B  C | | | A  B  C | |
	|    +    | |  ...    |    +    | | |    +    | |
	|_G_P__S__| |         |_G_P__S__| | |_G_P__S__| |
          | |  |    |           | |  |    |   | |  |    |
          | |  S                | |  S    |   | |  S    |
          | |   n               | |   1   |   | |   0   |
          | |                   | |       |  -  |       |
                                |  -----  | |  -        |
                                 -----  | | | |         |
                                     _|_|_|_|_|_        |
               ...                  | G P C G P |       |
                                    |  1 1 1 0 0|       |
                            -----   |         C |-------o
                                 |  |____G_P___0|       |
                                 |       | |            |
                        -------  |  -----  |            |
                     --------  | | |  -----             |
                            _|_|_|_|_|_                 |
                           | G P C G P |                |
                           |  1 1 1 0 0|                |
                           |         C |----------------o
                           |____G_P___0|                |
                    ___         | |                     |
         C   ------|OR |-------- _|_                    |
          out      |___|--------|AND|-------------------
                                |___|

We now have a scheme that gives us carry propagation from Cin to Cout in O(log2 n) time for an n bit sum, while requiring O(n log2 n) hardware for carry propagation. The additional hardware cost to move from O(n) to O(n log2 n) hardware is almost always justified by the speed improvement from O(n) to O(log2 n).

Essentially all computers with word sizes greater than 8 bits have used this approach to fast carry since the early 1970's.

The adders at the leaves of the carry propagation tree are:

			___
	  A ---o-------|XOR|   ___
	  B ---|-o-----|___|--|XOR|---- S
	  C ---|-|------------|___|
               | |      ___
               | o-----|AND|----------- G
               o-|-----|___|
               | |      ___
               |  -----|OR |----------- P
                -------|___|

The internal nodes of the carry lookahead tree are:

   
			___
          C -----------|AND|   ___
           0     ------|___|--|OR |---- C
          G ----|---o---------|___|      1
           0    |   |   ___
          P ----o    --|AND|   ___
           0    |    --|___|--|OR |---- G
          G ----|---|---------|___|
           1    |   |   ___
          P ----|---o--|AND|----------- P
           1     ------|___|

This is the scheme used on most modern computers, starting around 1970, except that the we use a 4-way tree instead of a binary tree. The 4-way tree is used because, for most digital logic circuits used in VLSI design, fanin and fanout on the order of 4 is easily done, while higher fanouts require higher power levels. So, a 16-bit adder requires four 4-bit lookahead units connecting the leaves, plus one more to coordinate them, and by adding one more level to the tree, we get a 64-bit fast adder.

Adding functionality to the ALU

The fast adder outlined above only performs one function, but it may easily be extended to other functions as follows. Note that, at the core of each adder, there is a simple function A xor B xor C. The truth table for this is:

	 A B C |  S
	-------|-----
	 0 0 0 |  0
	 0 0 1 |  1
	 0 1 0 |  1
	 0 1 1 |  0
	 1 0 0 |  1
	 1 0 1 |  0
	 1 1 0 |  0
	 1 1 1 |  1

Mechanical reduction of this to a sum-of-products logic circuit requires 3 inverters and 5 nand gates:

          A     B     C
          |     |     |
          o--   o--   o--
          | _|_ | _|_ | _|_
          | \ / | \ / | \ /
          |  O  |  O  |  O
          |  |  |  |  |  |  ___
         -|--o--|--|--|--|-|   |
         -|--|--|--o--|--|-|AND|O----
         -|--|--|--|--o--|-|___|     |
          |  |  |  |  |  |  ___      |
         -|--o--|--|--|--|-|   |     |
         -|--|--o--|--|--|-|AND|O--  |
         -|--|--|--|--|--o-|___|   |  ---|___
          |  |  |  |  |  |  ___     -----|   |
         -o--|--|--|--|--|-|   |         |AND|O--- S
         -|--|--|--o--|--|-|AND|O--------|___|
         -|--|--|--|--|--o-|___|       --|
          |  |  |  |  |  |  ___       |
         -o--|--|--|--|--|-|   |      |
         -|--|--o--|--|--|-|AND|O----- 
         -|--|--|--|--o--|-|___|
          |  |  |  |  |  |

If we expand this circuit by adding a few more gates and a few more inputs, we get the heart of a useful ALU:

          A    B             C
           i    i             i
          |    |  -----------|------------- NB not B
          |    | |           |  ------o---- CE carry enable
          |    | |           | |      | 
          |    | |           o------  |
          |    |_|           |_|    O_|
          |   |XOR|         |AND|  |AND|
          |   |___|         |___|  |___|
          |     |             |      |  
          o--   o--           |      |
          | _|_ | _|_  _______|      | 
          | \ / | \ / |   ___________|
          |  O  |  O  |  |  ------------ XE xor enable
          |  |  |  |  |  | |  ___
         -|--o--|--|--|--|-|-|   |
         -|--|--|--o--|--|-|-|AND|O----
         -|--|--|--|--o--|-|-|___|     |
          |  |  |  |  |  | |           |
          |  |  |  |  |  | o-|___      |
         -|--o--|--|--|--|-|-|   |     |
         -|--|--o--|--|--|-|-|AND|O--  |
         -|--|--|--|--|--o-|-|___|   | |
          |  |  |  |  |  | |         | |
          |  |  |  |  |  |  -|___    |  -|___
         -o--|--|--|--|--|---|   |    ---|   |
         -|--|--|--o--|--|---|AND|O------|AND|O--- S
         -|--|--|--|--|--o---|___|   ----|___|
          |  |  |  |  |  |    ___   |  --|
         -o--|--|--|--|--|---|   |  | |
         -|--|--o--|--|--|---|AND|O-  |  
         -|--|--|--|--o--|---|___|    |  
          |  |  |  |  |  |    ___     |  
         -o--|--|--|---------|   |    |  
         -|--|--o--|---------|AND|O---   
          |  |  |  |        -|___|        
                           |
                            ------------ AE and enable

We now have the following functions:

	 NB CE XE AE | function
	-------------|----------
	  0  1  1  0 | S = A + B (the normal add function)
	  1  1  1  0 | S = A - B (with carry = not borrow)
	  0  0  1  0 | S = A xor B
	  0  0  1  1 | S = A or B
	  0  0  0  1 | S = A and B

Many other functions are possible! Most notably, we can add extra enable inputs on the NAND gates in the column of gates, plus extra rows (with a corresponding widening of the final nand gate) to add functions such as left and right shift to the ALU.

Naive Sequential Multiplication

Given two numbers to be multiplied, A and B, where A0 is the least significant bit of A, we add them as follows:

	     A  multiplicand (icand for short)
	   + B  multiplier   (ier for short)
          -----
             S  product      (prod for short)

Usually, we are used to treating the terms multiplier and and multiplicand as synonyms, since multiplication is commutative, but the algorithms for multiplication treat these in very different ways!

The classic multiplication algorithm used on computers derives directly from an algorithm people frequently do in their heads. As an example, consider multiplying 12 by 15 mentally. This is hard, but you can multiply 6 by 30 somewhat easier and 3 by 60 even more easily -- the conversions between these problems involve dividing the multiplier by two while doubling the multiplicand in order to reduce the problem to one you can handle in your head.

The full algorithm, known as the Russian peasant multiplication algorithm, can be expressed as:

	times(ier, icand) =
		if even(ier)
			times(ier/2, icand*2)
		else
			times(ier/2, icand*2) + icand

This recursive formulation can be converted to iterative form:

	unsigned int times(unsigned int ier, unsigned int icand)
	{
	   unsigned int prod = 0;
	   while (ier != 0) {
	      if (ier & 1) prod = prod + icand;
	      ier = ier >> 1;           -- /2
	      icand = icand << 1; -- *2
	   }
	   return prod;
	}

In the above, we replaced multiplication and division by 2 by shift operators. The above, of course, only works for unsigned integers.

For a low-performance computer, we need to generalize the above algorithm to 2's complement numbers and then implement it in microcode or directly in a sequential control unit; this is exactly what Berks Goldstein and von Neumann proposed in their IAS paper from 1946; and they also recognized that, for very low performance applications, this algorithm could just as well be implemented in software. For a high performance machine, we need to transform the above algorithm into something that can be efficiently implemented without reliance on microcode or other slow hardware!

Data Paths for Sequential Multiplication

The IAS proposal included the following data paths in support of multiplication, and these data paths were included, without much change, in almost all single-accumulator computer architectures of the 1950's and 1960's, including the DEC PDP-8, if you bought the optional extended arithmetic element.

	
	|     Multiplicand    |
	|_____________________|
                 __|   _________              ____________
             ___|_____|___      |      ______|______      |
            |             |     |     |             |     |
            | ALU/Shifter |->------->-|   Shifter   |     |
            |_____________|     |     |_____________|     |
                   |            |            |            |
	 __________|__________  |  __________|__________  |
	|                     | | |                     | |
	|>    Accumulator     | | |>Multiplier-Quotient | |
	|_____________________| | |_____________________| |
                   |____________|            |____________|

The multiply algorithm can be expressed in terms of this hardware as follows:

        unsigned int times(unsigned int ier, unsigned int icand)
        {
           unsigned int AC = 0;   -- the accumulator
           unsigned int MQ = ier; -- the multiplier
           for (i = 0; i < wordsize; i++) {
              if (MQ odd) {
                  AC|MQ = (AC + icand)|MQ >> 1
              } else {
                  AC|MQ = AC|MQ >> 1
              } endif
           }
           return MQ;
	}

Here, the notation of a programming language such as C is deficient because it makes it difficult to deal with something that is very easy in hardware: The treatment of two one word variables as a two-word variable. This is indicated above using the vertical bar symbol as a concatenation operator, so AC|MQ indicates one double length quantity, half of which comes from or is destined for AC, and the other half comes from or is destined for MQ.

When applied to an N-bit word, this multiply operation actually produces a product that is 2N bits long, with the most significant half in AC, and the least significant half in MQ, replacing the multiplier that originally filled this register. Classical single accumulator machines simply exposed the programmer to this fact, allowing the programmer to keep one, the other, or both words of the result, depending on the needs of the algorithm. This option is lost in most moder high level languages, which insist that the product of 2 N-bit numbers is an N-bit number. Unfortunately, many modern machines follow the lead of such languages!

It is worth noting that the multiply loop can be rewritten as follows, using the C (if?then:else) notation for a conditional expression:

           for (i = 0; i > wordsize; i++) {
              AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           }

Few microcoded machines use a loop counter control register to this loop. Instead, it is very common to simply unroll the loop the required number of times. Thus, for a 16 bit machine, we would replace the above loop with microcode equivalent to the following:

           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;

           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;

           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;

           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;
           AC|MQ = (odd(MQ) ? AC + icand : AC)|MQ >> 1;

In microcode, this can be reduced to 16 consecutive identical microinstructions, differing only in their next address fields. Thus, there is no overhead for conditional evaluation or microcontrol, and this multiply operation can be executed in exactly 16 clock cycles.

Higher Radix Multiply

The first solution to be widely used to speed up the N clock-cycle multiply algorithm outlined above simply uses a higher radix. Consider, for example, the following idea:

                                 |
                   -----o-----o--o--o-----o-----o-----
           ___   _|_   _|_   _|_   _|_   _|_   _|_   _|_
	  | 0 | |A*1| |A*2| |A*3| |A*4| |A*5| |A*6| |A*7|
	  |___| |___| |___| |___| |___| |___| |___| |___|
            |     |     |     |_   _|     |     |     |
            |     |     |_____  | |  _____|     |     |
            |     |_________  | | | |  _________|     |
            |_____________  | | | | | |  _____________|
                         _|_|_|_|_|_|_|_|_
                     3  | 0 1 2 3 4 5 6 7 |
         multiplier -/--|       MUX       | 
                        |_________________| 
                                 |

This circuit fragment will multiply its input by a 3 bit multiplier. If we substitute this for the ALU in the original algorithm, we can do a 15 bit multiply in 5 clock cycles, where, in each cycle, the least significant 3 bits of the MQ register are used to compute a partial product for addition to the accumulator. Instead of using one-place shifts, the new multiplier uses 3-place shifts.

The above seems a bit complex, since it relies on separate combinational circuitry to compute what ammounts to a table of partial products, but in fact, computing these is straightforward. Consider the following facts:

	a*0 =                 = 0    ? 0
	a*1 = a                 0    + a
	a*2 =     a<<1        = 0    + a<<1
	a*3 = a + a<<1        = a<<2 - a
	a*4 =            a<<2 = a<<2 ? 0
	a*5 = a        + a<<2 = a<<2 + a
	a*6 =     a<<1 + a<<2 = a<<2 + a<<1
	a*7 = a + a<<1 + a<<2 = a<<3 - a

These can be computed with a single add/subtract unit and two 2-input multiplexors, so long as the multiplexors have enable inputs in addition to their select inputs (when not enabled, the multiplexor output is zero).

                                 |
                         -----o--o--o-----
                       _|_   _|_    |    _|_
	              |<<2| |<<3|   |   |<<1|
	              |___| |___|   |   |___|
                       _|_____|_   _|_____|_
                ASEL -| 0     1 | | 0     1 |- BSEL
                      |  MUX A  | |  MUX B  |
               AENAB -|_________| |_________|- BENAB
                          _|___________|_
                         | A           B |
              ADDSUB ----|    0: A+B     |
                         |    1: A-B     |
                         |_______________|
                                 |

Given 3 bits of the multiplier, we set the control inputs to the above circuitry as follows in order to create a circuit equivalent to the 8-input multiplexor version of the circuit.

	 M0 M1 M2 | ASEL AENAB BSEL BENAB ADDSUB
        ----------|-----------------------------
	  0  0  0 |   x    0     x    0     x
          0  0  1 |   x    0     0    1     0
          0  1  0 |   x    0     1    1     0
          0  1  1 |   0    1     0    1     1
          1  0  0 |   0    1     x    0     x
          1  0  1 |   0    1     0    1     0
          1  1  0 |   0    1     1    1     0
          1  1  1 |   1    1     0    1     1

	ASEL   = M0 and M1 and M2
	AENAB  = (M1 and M2) or M0
	BSEL   = M1 and not(M2)
	BENAB  = (M1 or M2)
	ADDSUB = (M1 and M2)

It is worth noting that the high end IBM System 360 computers of the late 1960's used radix 16 so that they could do a 32-bit by 32-bit full-word multiply in 8 clock cycles or a 16-bit by 16-bit half-word multiply in 4 clock cycles.

Despite its elegance, this scheme offers only a constant speedup; microcoded implementations are still called for, and the demand for very high performance computers appears to require faster multiply hardware. Also, of course, we are interested in the ultimate "big O" bounds on the computation.

Brute Force Parallel Multiplication

When we multiply two numbers, using the algorithm everyone learns for decimal multiplication in elementary school, we lay the work out as follows:

	           C   C   C   C   C    multipliCand
                    4   3   2   1   0

	         x         E   E   E    multipliEr
                            2   1   0
        -------------------------------
	           P   P   P   P   P
	            04  03  02  01  00
	       P   P   P   P   P        Partial products
	        14  13  12  11  10
	+  P   P   P   P   P
	    24  23  22  21  20
        -------------------------------
	   P   P   P   P   P   P   P    Product
            6   5   4   3   2   1   0

The product P is the sum of the partial products Pi, where the number of partial products equals the number of digits in the multiplier.

For radix 2, the digits of partial product Pi are zero if bit Ei of the multiplier is zero, and the partial product equals the multiplicand if Ei is one. That is

Pij = Ei and Cj

Aside from this trivial logic, all that is involved in multiplication is an array of adders! To multiply an N bit numbe by an M bit number, we need N adders of M bits each. If each adder has an enable input to enable computation of either A+B or just A, we can construct a fully parallel multiplier as follows:

                             C     C     C     C     C
                          0   4 0   3 0   2 0   1 0   0
                         _|_/  _|_/  _|_/  _|_/  _|_/
                        |   |-|   |-|   |-|   |-|   |---- E
                     ---|___|-|___|-|___|-|___|-|___|-0    0
                   _|_ / _|_ / _|_ / _|_ / _|_ /  |
                  |   |-|   |-|   |-|   |-|   |---------- E
               ---|___|-|___|-|___|-|___|-|___|-0 |        1
             _|_ / _|_ / _|_ / _|_ / _|_ /  |     |
            |   |-|   |-|   |-|   |-|   |---------------- E
         ---|___|-|___|-|___|-|___|-|___|-0 |     |        2
        |     |     |     |     |     |     |     |
        P     P     P     P     P     P     P     P
         7     6     5     4     3     2     1     0
Each cell of this picture contains the following:
                          |   /
                        __|  o
                       |    /|
                 ------|-----|----o---
                      _|_____|_   |
	             | A     B |  |
	             |    BENAB|--
	             |    +    |
	         ----|Cout  Cin|------
                     |_________|     
                          |
                   /      |
                  /       |

If eacn of these units is a classical full adder with an added enable input, the total multiply time for an N-bit by N-bit multiply is 2N times the propagation delay of the full adder. That is, O(N), and we did this with O(N2) hardware.

A Really Fast Parallel Multiplier!

We know that we can use carry lookahead logic to take the brute force design above and convert each step from O(N) time to O(log N) time, but we still need N steps to add the partial sums, so this gives us an add time of O(N log N), which is not an improvement over the original O(N)!

The question is, if we have N partial sums to compute, can we do this faster? The solution rests on the following idea:

(((A + B) + C) + D) = (A + B) + (C + D)

No matter how we add N terms, it will take N-1 2-input adders, which is O(N). However, we can arrange these in two distinct ways; one adds the terms successively in O(N) time, while the other forms a binary tree off adders, adding them in O(log N) time.

This is what we will do with our partial products! The result is a multiplier that uses O(N2 log N) hardware and computes the product in O(log N) time!

If we use a carry-save adder as suggested previously, we can propagate all of the carries down to the last stage, and only use the lookahead idea once at the end. The result requires O(N2) hardware and O(log N) time to multiply 2 N-bit numbers. The constant factors are about twice that for addition.

All of this was known by the early 1970's, and the Texas Instruments TTL handbook from 1973 includes nearly complete (but cryptic) details on how to build such high performance multipliers.

The question remains, however, whether it is economical to expend O(N2) hardware on multiplication. The answer, as it turns out, is that it is frequently not justified. The widespread success of minicomputers and low-end microprocessors that lacked multiplication instructions strongly hints at this, as does the fact that even on todays Pentium family, the run-times for integer multiplication strongly suggest that the older iterative schemes are still in use. It was not until the 1980's that this question was formally investigated.