22C:122/55:132, Notes, Lecture 14, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

Addition

Given two numbers to be added, A and B, where A₀ is the least significant bit of A, we add them as follows:

	     A  addend
	   + B  augend
          -----
             S  sum

Usually, we will treat the terms addend and augend as synonyms, since addition is commutative. In each bit position of the sum, we can identify the following:

	     C   carry into position i
	      i
	     A   addend bit i
	      i
	  +  B   augend bit i
	      i
        --------
	 C   S   carry out of bit position i
	  i+1 i  and bit i of the sum.

Here, C_i is both the carry into bit position i and the carry out of bit position i-1. These variables are related by the following truth table:

	 A  B  C  |  S  C
          i  i  i |   i  i+1
        ----------|---------
         0  0  0  |  0  0  
         0  0  1  |  1  0  
         0  1  0  |  1  0  
         0  1  1  |  0  1  
         1  0  0  |  1  0  
         1  0  1  |  0  1  
         1  1  0  |  0  1  
         1  1  1  |  1  1

In hardware, we may build a 1-bit adder as follows:

S_i = (A_i xor B_i ) xor C_i

C_i+1 = (A_i and B_i ) or (A_i and C_i ) or (B_i and C_i ) If we chain these together, we get a ripple-carry adder:

          A  B                  A  B          A  B      C
           n  n                  1  1          0  0      in
          |  |                  |  |          |  |      |
          |  |   ---            |  |   ---    |  |   ---
         _|__|__|_  |       |  _|__|__|_  |  _|__|__|_
	| A  B  C | |       | | A  B  C | | | A  B  C |
	|    +    | |  ...  | |    +    | | |    +    |
	|__C___S__| |       | |__C___S__| | |__C___S__|
           |   |    |       |    |   |    |    |   |
         --    |             ----    |     ----    |
        |      |                     |             |
        C      S                     S             S
         out    n                     1             0

The fundamental problem with this design is that the time it takes to add 2 n-bit numbers is O(n). Typically, this design is close to optimal for n<4, but for n=16, it begins to become a serious bottleneck, and for n=32 or n=64, it is intolerable.

Bulk Carry Properties

For any substring of digit positions within a sum, we can ask, for that substring, do the values of the operands within that substring cause it to generate a carry, and do the values of the operands within that substring cause it to propagate carry in to carry out

	       A   A  |A   A   A | A
                5   4 | 3   2   1|  0
	     + B   B  |B   B   B | B
                5   4 | 3   2   1|  0
	     ---------|----------|---
	       S   S  |S   S   S | S
                5   4 | 3   2   1|  0
                       \________/
                          [3:1]

In the above, we have singled out bits 3 to 1 of a sum, and we ask for P_[3:1] -- are the values of A and B in bit positions 3 to 1 such that C₄ is implied by C₁?

P_[3:1] = (A₃ or B₃) and (A₂ or B₂) and (A₁ or B₁)

Similarly, we can ask for G_[3:1] -- are the values of of A and B in bit positions 3 to 1 such that they imply C₄ without knowledge of C₁?

G_[3:1] = (A₃ and B₃) or (A₂ and B₂ and P_[3]) or (A₁ and B₁ and P_[3:2])

Our goal is to use the above information in a divide and conquer approach to fast addition. To do this, we want to build a tree with the adders for each bit position at the leaves and the tree body performing the carry propagation. To do this, we need a way to compute P_[a:c] when given P_[a:b] and P_[b:c]. This is remarkably easy!

P_[a:c] = P_[a:b] and P_[b:c]

We also need a way to compute G_[a:c] when given G_[a:b] and G_[b:c]. This is remarkably easy!

G_[a:c] = G_[a:b] or (G_[b:c] and P_[a:b])

Now, assuming the word size is 2ⁿ, we can build a complete balanced binary tree with adders for the leaves, where each internal node of the tree operates as above, with one additional detail.

We must, at each internal node, accept a carry in from the least significant side and, depending on the propagate signals, deliver this to the appropriate subtrees. With this change, some internal node of the carry tree will generate carry in to each stage of the adder, so no adder stage needs to compute carry out! The result is as follows:

          A  B                  A  B          A  B      C
           n  n                  1  1          0  0      in
          |  |                  |  |          |  |      |
          |  |   ---            |  |   ---    |  |   ---o
         _|__|__|_  |          _|__|__|_  |  _|__|__|_  |
	| A  B  C | |         | A  B  C | | | A  B  C | |
	|    +    | |  ...    |    +    | | |    +    | |
	|_G_P__S__| |         |_G_P__S__| | |_G_P__S__| |
          | |  |    |           | |  |    |   | |  |    |
          | |  S                | |  S    |   | |  S    |
          | |   n               | |   1   |   | |   0   |
          | |                   | |       |  -  |       |
                                |  -----  | |  -        |
                                 -----  | | | |         |
                                     _|_|_|_|_|_        |
               ...                  | G P C G P |       |
                                    |  1 1 1 0 0|       |
                            -----   |         C |-------o
                                 |  |____G_P___0|       |
                                 |       | |            |
                        -------  |  -----  |            |
                     --------  | | |  -----             |
                            _|_|_|_|_|_                 |
                           | G P C G P |                |
                           |  1 1 1 0 0|                |
                           |         C |----------------o
                           |____G_P___0|                |
                    ___         | |                     |
         C   ------|OR |-------- _|_                    |
          out      |___|--------|AND|-------------------
                                |___|

We now have a scheme that gives us carry propagation from C_in to C_out in O(log₂ n) time for an n bit sum, while requiring O(n log₂ n) hardware for carry propagation. The additional hardware cost to move from O(n) to O(n log₂ n) hardware is almost always justified by the speed improvement from O(n) to O(log₂ n).

Essentially all computers with word sizes greater than 8 bits have used this approach to fast carry since the early 1970's.

The adders at the leaves are:

	                ___
	  A ---o-------|XOR|   ___
	  B ---|-o-----|___|--|XOR|---- S
	  C ---|-|------------|___|
               | |      ___
               | o-----|AND|----------- G
               o-|-----|___|
               | |      ___
               |  -----|OR |----------- P
                -------|___|

The internal nodes of the carry lookahead tree are:

   
	                ___
          C -----------|AND|   ___
           0     ------|___|--|OR |---- C
          G ----|---o---------|___|      1
           0    |   |   ___
          P ----o    --|AND|   ___
           0    |    --|___|--|OR |---- G
          G ----|---|---------|___|
           1    |   |   ___
          P ----|---o--|AND|----------- P
           1     ------|___|

This is the scheme used on most modern computers, except that the we use a 4-way tree instead of a binary tree. The 4-way tree is used because, for most digital logic circuits used in VLSI design, fanin and fanout on the order of 4 is easily done, while higher fanouts require higher power levels. So, a 16-bit adder requires four 4-bit lookahead units connecting the leaves, plus one more to coordinate them, and by adding one more level to the tree, we get a 64-bit fast adder.

Adding functionality to the ALU

The fast adder outlined above only performs one function, but it may easily be extended to other functions as follows. Note that, at the core of each adder, there is a simple function A xor B xor C. The truth table for this is:

	 A B C |  S
	-------|-----
	 0 0 0 |  0
	 0 0 1 |  1
	 0 1 0 |  1
	 0 1 1 |  0
	 1 0 0 |  1
	 1 0 1 |  0
	 1 1 0 |  0
	 1 1 1 |  1

Mechanical reduction of this to a sum-of-products logic circuit requires 3 inverters and 5 nand gates:

          A     B     C
          |     |     |
          o--   o--   o--
          | _|_ | _|_ | _|_
          | \ / | \ / | \ /
          |  O  |  O  |  O
          |  |  |  |  |  |  ___
         -|--o--|--|--|--|-|   |
         -|--|--|--o--|--|-|AND|O----
         -|--|--|--|--o--|-|___|     |
          |  |  |  |  |  |  ___      |
         -|--o--|--|--|--|-|   |     |
         -|--|--o--|--|--|-|AND|O--  |
         -|--|--|--|--|--o-|___|   |  ---|___
          |  |  |  |  |  |  ___     -----|   |
         -o--|--|--|--|--|-|   |         |AND|O--- S
         -|--|--|--o--|--|-|AND|O--------|___|
         -|--|--|--|--|--o-|___|       --|
          |  |  |  |  |  |  ___       |
         -o--|--|--|--|--|-|   |      |
         -|--|--o--|--|--|-|AND|O----- 
         -|--|--|--|--o--|-|___|
          |  |  |  |  |  |

If we expand this circuit by adding a few more gates and a few more inputs, we get the heart of a useful ALU:

          A    B          C
           i    i          i
          |    |  --------|------------- NB not B
          |    | |        |  ------o---- CE carry enable
          |    | |        | |      | 
          |    |_|        o------  |
          |   |XOR|       |_|    |_|
          |   |___|      |AND|  |AND|
          |     |        |___|  |___|
          o--   o--        |      O  
          | _|_ | _|_  ----       | 
          | \ / | \ / |   --------
          |  O  |  O  |  |  ------------ XE xor enable
          |  |  |  |  |  | |  ___
         -|--o--|--|--|--|-|-|   |
         -|--|--|--o--|--|-|-|AND|O----
         -|--|--|--|--o--|-|-|___|     |
          |  |  |  |  |  | |           |
          |  |  |  |  |  | o-|___      |
         -|--o--|--|--|--|-|-|   |     |
         -|--|--o--|--|--|-|-|AND|O--  |
         -|--|--|--|--|--o-|-|___|   | |
          |  |  |  |  |  | |         | |
          |  |  |  |  |  |  -|___    |  -|___
         -o--|--|--|--|--|---|   |    ---|   |
         -|--|--|--o--|--|---|AND|O------|AND|O--- S
         -|--|--|--|--|--o---|___|   ----|___|
          |  |  |  |  |  |    ___   |  --|
         -o--|--|--|--|--|---|   |  | |
         -|--|--o--|--|--|---|AND|O-  |  
         -|--|--|--|--o--|---|___|    |  
          |  |  |  |  |  |    ___     |  
         -o--|--|--|---------|   |    |  
         -|--|--o--|---------|AND|O---   
          |  |  |  |        -|___|        
                           |
                            ------------ AE and enable

We now have the following functions:

	 NB CE XE AE | function
	-------------|----------
	  0  1  1  0 | S = A + B (the normal add function)
	  1  1  1  0 | S = A - B (with carry = not borrow)
	  0  0  1  0 | S = A xor B
	  0  0  1  1 | S = A or B
	  0  0  0  1 | S = A and B

Many other functions are possible! Most notably, we can add extra enable inputs on the NAND gates in the column of gates, plus extra rows (with a corresponding widening of the final nand gate) to add functions such as left and right shift to the ALU.

Naive Multiplication
Given two numbers to be multiplied, A and B, where A₀ is the least significant bit of A, we add them as follows:
```
	     A  multiplicand (icand for short)
	   + B  multiplier   (ier for short)
          -----
             S  product      (prod for short)
	
```
Usually, we are used to treating the terms multiplier and and multiplicand as synonyms, since multiplication is commutative, but the algorithms for multiplication treat these in very different ways!
The classic multiplication algorithm used on computers derives directly from an algorithm people frequently do in their heads. As an example, consider multiplying 12 by 15 mentally. This is hard, but you can multiply 6 by 30 somewhat easier and 3 by 60 even more easily -- the conversions between these problems involve dividing the multiplier by two while doubling the multiplicand in order to reduce the problem to one you can handle in your head.
The full algorithm, known as the Russian peasant multiplication algorithm, can be expressed as:
```
	times(ier, icand) =
		if even(ier)
			times(ier/2, icand*2)
		else
			times(ier/2, icand*2) + icand
	
```
This recursive formulation can be converted to iterative form:
```
	unsigned int times(unsigned int ier, unsigned int icand)
	{
	   unsigned int prod = 0;
	   while (ier != 0) {
	      if (ier & 1) prod = prod + icand;
	      ier = ier >> 1;           -- /2
	      icand = icand << 1; -- *2
	   }
	   return prod;
	}
	
```
In the above, we replaced multiplication and division by 2 by shift operators. The above, of course, only works for unsigned integers.
For a low-performance computer, we need to generalize the above algorithm to 2's complement numbers and then implement it in microcode or, for very low performance applications, leave it to the user to implement the above code directly in software. For a high performance machine, we need to transform the above algorithm into something that can be efficiently implemented without reliance on microcode or other slow hardware.