# 22C:122/55:132, Notes, Lecture 14, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

Given two numbers to be added, A and B, where A0 is the least significant bit of A, we add them as follows:

```	     A  addend
+ B  augend
-----
S  sum
```
Usually, we will treat the terms addend and augend as synonyms, since addition is commutative. In each bit position of the sum, we can identify the following:
```	     C   carry into position i
i
i
+  B   augend bit i
i
--------
C   S   carry out of bit position i
i+1 i  and bit i of the sum.
```
Here, Ci is both the carry into bit position i and the carry out of bit position i-1. These variables are related by the following truth table:
```	 A  B  C  |  S  C
i  i  i |   i  i+1
----------|---------
0  0  0  |  0  0
0  0  1  |  1  0
0  1  0  |  1  0
0  1  1  |  0  1
1  0  0  |  1  0
1  0  1  |  0  1
1  1  0  |  0  1
1  1  1  |  1  1
```
In hardware, we may build a 1-bit adder as follows:

Si = (Ai xor Bi ) xor Ci

Ci+1 = (Ai and Bi ) or (Ai and Ci ) or (Bi and Ci ) If we chain these together, we get a ripple-carry adder:

```          A  B                  A  B          A  B      C
n  n                  1  1          0  0      in
|  |                  |  |          |  |      |
|  |   ---            |  |   ---    |  |   ---
_|__|__|_  |       |  _|__|__|_  |  _|__|__|_
| A  B  C | |       | | A  B  C | | | A  B  C |
|    +    | |  ...  | |    +    | | |    +    |
|__C___S__| |       | |__C___S__| | |__C___S__|
|   |    |       |    |   |    |    |   |
--    |             ----    |     ----    |
|      |                     |             |
C      S                     S             S
out    n                     1             0
```
The fundamental problem with this design is that the time it takes to add 2 n-bit numbers is O(n). Typically, this design is close to optimal for n<4, but for n=16, it begins to become a serious bottleneck, and for n=32 or n=64, it is intolerable.

2. Bulk Carry Properties

For any substring of digit positions within a sum, we can ask, for that substring, do the values of the operands within that substring cause it to generate a carry, and do the values of the operands within that substring cause it to propagate carry in to carry out

```	       A   A  |A   A   A | A
5   4 | 3   2   1|  0
+ B   B  |B   B   B | B
5   4 | 3   2   1|  0
---------|----------|---
S   S  |S   S   S | S
5   4 | 3   2   1|  0
\________/
[3:1]
```
In the above, we have singled out bits 3 to 1 of a sum, and we ask for P[3:1] -- are the values of A and B in bit positions 3 to 1 such that C4 is implied by C1?

P[3:1] = (A3 or B3) and (A2 or B2) and (A1 or B1)

Similarly, we can ask for G[3:1] -- are the values of of A and B in bit positions 3 to 1 such that they imply C4 without knowledge of C1?

G[3:1] = (A3 and B3) or (A2 and B2 and P) or (A1 and B1 and P[3:2])

Our goal is to use the above information in a divide and conquer approach to fast addition. To do this, we want to build a tree with the adders for each bit position at the leaves and the tree body performing the carry propagation. To do this, we need a way to compute P[a:c] when given P[a:b] and P[b:c]. This is remarkably easy!

P[a:c] = P[a:b] and P[b:c]

We also need a way to compute G[a:c] when given G[a:b] and G[b:c]. This is remarkably easy!

G[a:c] = G[a:b] or (G[b:c] and P[a:b])

Now, assuming the word size is 2n, we can build a complete balanced binary tree with adders for the leaves, where each internal node of the tree operates as above, with one additional detail.

We must, at each internal node, accept a carry in from the least significant side and, depending on the propagate signals, deliver this to the appropriate subtrees. With this change, some internal node of the carry tree will generate carry in to each stage of the adder, so no adder stage needs to compute carry out! The result is as follows:

```          A  B                  A  B          A  B      C
n  n                  1  1          0  0      in
|  |                  |  |          |  |      |
|  |   ---            |  |   ---    |  |   ---o
_|__|__|_  |          _|__|__|_  |  _|__|__|_  |
| A  B  C | |         | A  B  C | | | A  B  C | |
|    +    | |  ...    |    +    | | |    +    | |
|_G_P__S__| |         |_G_P__S__| | |_G_P__S__| |
| |  |    |           | |  |    |   | |  |    |
| |  S                | |  S    |   | |  S    |
| |   n               | |   1   |   | |   0   |
| |                   | |       |  -  |       |
|  -----  | |  -        |
-----  | | | |         |
_|_|_|_|_|_        |
...                  | G P C G P |       |
|  1 1 1 0 0|       |
-----   |         C |-------o
|  |____G_P___0|       |
|       | |            |
-------  |  -----  |            |
--------  | | |  -----             |
_|_|_|_|_|_                 |
| G P C G P |                |
|  1 1 1 0 0|                |
|         C |----------------o
|____G_P___0|                |
___         | |                     |
C   ------|OR |-------- _|_                    |
out      |___|--------|AND|-------------------
|___|
```
We now have a scheme that gives us carry propagation from Cin to Cout in O(log2 n) time for an n bit sum, while requiring O(n log2 n) hardware for carry propagation. The additional hardware cost to move from O(n) to O(n log2 n) hardware is almost always justified by the speed improvement from O(n) to O(log2 n).

Essentially all computers with word sizes greater than 8 bits have used this approach to fast carry since the early 1970's.

The adders at the leaves are:

```	                ___
A ---o-------|XOR|   ___
B ---|-o-----|___|--|XOR|---- S
C ---|-|------------|___|
| |      ___
| o-----|AND|----------- G
o-|-----|___|
| |      ___
|  -----|OR |----------- P
-------|___|
```
The internal nodes of the carry lookahead tree are:
```
___
C -----------|AND|   ___
0     ------|___|--|OR |---- C
G ----|---o---------|___|      1
0    |   |   ___
P ----o    --|AND|   ___
0    |    --|___|--|OR |---- G
G ----|---|---------|___|
1    |   |   ___
P ----|---o--|AND|----------- P
1     ------|___|
```
This is the scheme used on most modern computers, except that the we use a 4-way tree instead of a binary tree. The 4-way tree is used because, for most digital logic circuits used in VLSI design, fanin and fanout on the order of 4 is easily done, while higher fanouts require higher power levels. So, a 16-bit adder requires four 4-bit lookahead units connecting the leaves, plus one more to coordinate them, and by adding one more level to the tree, we get a 64-bit fast adder.

3. Adding functionality to the ALU

The fast adder outlined above only performs one function, but it may easily be extended to other functions as follows. Note that, at the core of each adder, there is a simple function A xor B xor C. The truth table for this is:

```	 A B C |  S
-------|-----
0 0 0 |  0
0 0 1 |  1
0 1 0 |  1
0 1 1 |  0
1 0 0 |  1
1 0 1 |  0
1 1 0 |  0
1 1 1 |  1
```
Mechanical reduction of this to a sum-of-products logic circuit requires 3 inverters and 5 nand gates:
```          A     B     C
|     |     |
o--   o--   o--
| _|_ | _|_ | _|_
| \ / | \ / | \ /
|  O  |  O  |  O
|  |  |  |  |  |  ___
-|--o--|--|--|--|-|   |
-|--|--|--o--|--|-|AND|O----
-|--|--|--|--o--|-|___|     |
|  |  |  |  |  |  ___      |
-|--o--|--|--|--|-|   |     |
-|--|--o--|--|--|-|AND|O--  |
-|--|--|--|--|--o-|___|   |  ---|___
|  |  |  |  |  |  ___     -----|   |
-o--|--|--|--|--|-|   |         |AND|O--- S
-|--|--|--o--|--|-|AND|O--------|___|
-|--|--|--|--|--o-|___|       --|
|  |  |  |  |  |  ___       |
-o--|--|--|--|--|-|   |      |
-|--|--o--|--|--|-|AND|O-----
-|--|--|--|--o--|-|___|
|  |  |  |  |  |
```
If we expand this circuit by adding a few more gates and a few more inputs, we get the heart of a useful ALU:
```          A    B          C
i    i          i
|    |  --------|------------- NB not B
|    | |        |  ------o---- CE carry enable
|    | |        | |      |
|    |_|        o------  |
|   |XOR|       |_|    |_|
|   |___|      |AND|  |AND|
|     |        |___|  |___|
o--   o--        |      O
| _|_ | _|_  ----       |
| \ / | \ / |   --------
|  O  |  O  |  |  ------------ XE xor enable
|  |  |  |  |  | |  ___
-|--o--|--|--|--|-|-|   |
-|--|--|--o--|--|-|-|AND|O----
-|--|--|--|--o--|-|-|___|     |
|  |  |  |  |  | |           |
|  |  |  |  |  | o-|___      |
-|--o--|--|--|--|-|-|   |     |
-|--|--o--|--|--|-|-|AND|O--  |
-|--|--|--|--|--o-|-|___|   | |
|  |  |  |  |  | |         | |
|  |  |  |  |  |  -|___    |  -|___
-o--|--|--|--|--|---|   |    ---|   |
-|--|--|--o--|--|---|AND|O------|AND|O--- S
-|--|--|--|--|--o---|___|   ----|___|
|  |  |  |  |  |    ___   |  --|
-o--|--|--|--|--|---|   |  | |
-|--|--o--|--|--|---|AND|O-  |
-|--|--|--|--o--|---|___|    |
|  |  |  |  |  |    ___     |
-o--|--|--|---------|   |    |
-|--|--o--|---------|AND|O---
|  |  |  |        -|___|
|
------------ AE and enable
```
We now have the following functions:
```	 NB CE XE AE | function
-------------|----------
0  1  1  0 | S = A + B (the normal add function)
1  1  1  0 | S = A - B (with carry = not borrow)
0  0  1  0 | S = A xor B
0  0  1  1 | S = A or B
0  0  0  1 | S = A and B
```
Many other functions are possible! Most notably, we can add extra enable inputs on the NAND gates in the column of gates, plus extra rows (with a corresponding widening of the final nand gate) to add functions such as left and right shift to the ALU.

4. Naive Multiplication

Given two numbers to be multiplied, A and B, where A0 is the least significant bit of A, we add them as follows:

```	     A  multiplicand (icand for short)
+ B  multiplier   (ier for short)
-----
S  product      (prod for short)
```
Usually, we are used to treating the terms multiplier and and multiplicand as synonyms, since multiplication is commutative, but the algorithms for multiplication treat these in very different ways!

The classic multiplication algorithm used on computers derives directly from an algorithm people frequently do in their heads. As an example, consider multiplying 12 by 15 mentally. This is hard, but you can multiply 6 by 30 somewhat easier and 3 by 60 even more easily -- the conversions between these problems involve dividing the multiplier by two while doubling the multiplicand in order to reduce the problem to one you can handle in your head.

The full algorithm, known as the Russian peasant multiplication algorithm, can be expressed as:

```	times(ier, icand) =
if even(ier)
times(ier/2, icand*2)
else
times(ier/2, icand*2) + icand
```
This recursive formulation can be converted to iterative form:
```	unsigned int times(unsigned int ier, unsigned int icand)
{
unsigned int prod = 0;
while (ier != 0) {
if (ier & 1) prod = prod + icand;
ier = ier >> 1;           -- /2
icand = icand << 1; -- *2
}
return prod;
}
```
In the above, we replaced multiplication and division by 2 by shift operators. The above, of course, only works for unsigned integers.

For a low-performance computer, we need to generalize the above algorithm to 2's complement numbers and then implement it in microcode or, for very low performance applications, leave it to the user to implement the above code directly in software. For a high performance machine, we need to transform the above algorithm into something that can be efficiently implemented without reliance on microcode or other slow hardware.