Given two numbers to be added, A and B, where A_{0} is the least significant bit of A, we add them as follows:
A addend + B augend ----- S sumUsually, we will treat the terms addend and augend as synonyms, since addition is commutative. In each bit position of the sum, we can identify the following:
C carry into position i i A addend bit i i + B augend bit i i -------- C S carry out of bit position i i+1 i and bit i of the sum.Here, C_{i} is both the carry into bit position i and the carry out of bit position i-1. These variables are related by the following truth table:
A B C | S C i i i | i i+1 ----------|--------- 0 0 0 | 0 0 0 0 1 | 1 0 0 1 0 | 1 0 0 1 1 | 0 1 1 0 0 | 1 0 1 0 1 | 0 1 1 1 0 | 0 1 1 1 1 | 1 1In hardware, we may build a 1-bit adder as follows:
S_{i} = (A_{i} xor B_{i} ) xor C_{i}
C_{i+1} = (A_{i} and B_{i} ) or (A_{i} and C_{i} ) or (B_{i} and C_{i} ) If we chain these together, we get a ripple-carry adder:
A B A B A B C n n 1 1 0 0 in | | | | | | | | | --- | | --- | | --- _|__|__|_ | | _|__|__|_ | _|__|__|_ | A B C | | | | A B C | | | A B C | | + | | ... | | + | | | + | |__C___S__| | | |__C___S__| | |__C___S__| | | | | | | | | | -- | ---- | ---- | | | | | C S S S out n 1 0The fundamental problem with this design is that the time it takes to add 2 n-bit numbers is O(n). Typically, this design is close to optimal for n<4, but for n=16, it begins to become a serious bottleneck, and for n=32 or n=64, it is intolerable.
For any substring of digit positions within a sum, we can ask, for that substring, do the values of the operands within that substring cause it to generate a carry, and do the values of the operands within that substring cause it to propagate carry in to carry out
A A |A A A | A 5 4 | 3 2 1| 0 + B B |B B B | B 5 4 | 3 2 1| 0 ---------|----------|--- S S |S S S | S 5 4 | 3 2 1| 0 \________/ [3:1]In the above, we have singled out bits 3 to 1 of a sum, and we ask for P_{[3:1]} -- are the values of A and B in bit positions 3 to 1 such that C_{4} is implied by C_{1}?
P_{[3:1]} = (A_{3} or B_{3}) and (A_{2} or B_{2}) and (A_{1} or B_{1})
Similarly, we can ask for G_{[3:1]} -- are the values of of A and B in bit positions 3 to 1 such that they imply C_{4} without knowledge of C_{1}?
G_{[3:1]} = (A_{3} and B_{3}) or (A_{2} and B_{2} and P_{[3]}) or (A_{1} and B_{1} and P_{[3:2]})
Our goal is to use the above information in a divide and conquer approach to fast addition. To do this, we want to build a tree with the adders for each bit position at the leaves and the tree body performing the carry propagation. To do this, we need a way to compute P_{[a:c]} when given P_{[a:b]} and P_{[b:c]}. This is remarkably easy!
P_{[a:c]} = P_{[a:b]} and P_{[b:c]}
We also need a way to compute G_{[a:c]} when given G_{[a:b]} and G_{[b:c]}. This is remarkably easy!
G_{[a:c]} = G_{[a:b]} or (G_{[b:c]} and P_{[a:b]})
Now, assuming the word size is 2^{n}, we can build a complete balanced binary tree with adders for the leaves, where each internal node of the tree operates as above, with one additional detail.
We must, at each internal node, accept a carry in from the least significant side and, depending on the propagate signals, deliver this to the appropriate subtrees. With this change, some internal node of the carry tree will generate carry in to each stage of the adder, so no adder stage needs to compute carry out! The result is as follows:
A B A B A B C n n 1 1 0 0 in | | | | | | | | | --- | | --- | | ---o _|__|__|_ | _|__|__|_ | _|__|__|_ | | A B C | | | A B C | | | A B C | | | + | | ... | + | | | + | | |_G_P__S__| | |_G_P__S__| | |_G_P__S__| | | | | | | | | | | | | | | | S | | S | | | S | | | n | | 1 | | | 0 | | | | | | - | | | ----- | | - | ----- | | | | | _|_|_|_|_|_ | ... | G P C G P | | | 1 1 1 0 0| | ----- | C |-------o | |____G_P___0| | | | | | ------- | ----- | | -------- | | | ----- | _|_|_|_|_|_ | | G P C G P | | | 1 1 1 0 0| | | C |----------------o |____G_P___0| | ___ | | | C ------|OR |-------- _|_ | out |___|--------|AND|------------------- |___|We now have a scheme that gives us carry propagation from C_{in} to C_{out} in O(log_{2} n) time for an n bit sum, while requiring O(n log_{2} n) hardware for carry propagation. The additional hardware cost to move from O(n) to O(n log_{2} n) hardware is almost always justified by the speed improvement from O(n) to O(log_{2} n).
Essentially all computers with word sizes greater than 8 bits have used this approach to fast carry since the early 1970's.
The adders at the leaves are:
___ A ---o-------|XOR| ___ B ---|-o-----|___|--|XOR|---- S C ---|-|------------|___| | | ___ | o-----|AND|----------- G o-|-----|___| | | ___ | -----|OR |----------- P -------|___|The internal nodes of the carry lookahead tree are:
___ C -----------|AND| ___ 0 ------|___|--|OR |---- C G ----|---o---------|___| 1 0 | | ___ P ----o --|AND| ___ 0 | --|___|--|OR |---- G G ----|---|---------|___| 1 | | ___ P ----|---o--|AND|----------- P 1 ------|___|This is the scheme used on most modern computers, except that the we use a 4-way tree instead of a binary tree. The 4-way tree is used because, for most digital logic circuits used in VLSI design, fanin and fanout on the order of 4 is easily done, while higher fanouts require higher power levels. So, a 16-bit adder requires four 4-bit lookahead units connecting the leaves, plus one more to coordinate them, and by adding one more level to the tree, we get a 64-bit fast adder.
The fast adder outlined above only performs one function, but it may easily be extended to other functions as follows. Note that, at the core of each adder, there is a simple function A xor B xor C. The truth table for this is:
A B C | S -------|----- 0 0 0 | 0 0 0 1 | 1 0 1 0 | 1 0 1 1 | 0 1 0 0 | 1 1 0 1 | 0 1 1 0 | 0 1 1 1 | 1Mechanical reduction of this to a sum-of-products logic circuit requires 3 inverters and 5 nand gates:
A B C | | | o-- o-- o-- | _|_ | _|_ | _|_ | \ / | \ / | \ / | O | O | O | | | | | | ___ -|--o--|--|--|--|-| | -|--|--|--o--|--|-|AND|O---- -|--|--|--|--o--|-|___| | | | | | | | ___ | -|--o--|--|--|--|-| | | -|--|--o--|--|--|-|AND|O-- | -|--|--|--|--|--o-|___| | ---|___ | | | | | | ___ -----| | -o--|--|--|--|--|-| | |AND|O--- S -|--|--|--o--|--|-|AND|O--------|___| -|--|--|--|--|--o-|___| --| | | | | | | ___ | -o--|--|--|--|--|-| | | -|--|--o--|--|--|-|AND|O----- -|--|--|--|--o--|-|___| | | | | | |If we expand this circuit by adding a few more gates and a few more inputs, we get the heart of a useful ALU:
A B C i i i | | --------|------------- NB not B | | | | ------o---- CE carry enable | | | | | | | |_| o------ | | |XOR| |_| |_| | |___| |AND| |AND| | | |___| |___| o-- o-- | O | _|_ | _|_ ---- | | \ / | \ / | -------- | O | O | | ------------ XE xor enable | | | | | | | ___ -|--o--|--|--|--|-|-| | -|--|--|--o--|--|-|-|AND|O---- -|--|--|--|--o--|-|-|___| | | | | | | | | | | | | | | | o-|___ | -|--o--|--|--|--|-|-| | | -|--|--o--|--|--|-|-|AND|O-- | -|--|--|--|--|--o-|-|___| | | | | | | | | | | | | | | | | | -|___ | -|___ -o--|--|--|--|--|---| | ---| | -|--|--|--o--|--|---|AND|O------|AND|O--- S -|--|--|--|--|--o---|___| ----|___| | | | | | | ___ | --| -o--|--|--|--|--|---| | | | -|--|--o--|--|--|---|AND|O- | -|--|--|--|--o--|---|___| | | | | | | | ___ | -o--|--|--|---------| | | -|--|--o--|---------|AND|O--- | | | | -|___| | ------------ AE and enableWe now have the following functions:
NB CE XE AE | function -------------|---------- 0 1 1 0 | S = A + B (the normal add function) 1 1 1 0 | S = A - B (with carry = not borrow) 0 0 1 0 | S = A xor B 0 0 1 1 | S = A or B 0 0 0 1 | S = A and BMany other functions are possible! Most notably, we can add extra enable inputs on the NAND gates in the column of gates, plus extra rows (with a corresponding widening of the final nand gate) to add functions such as left and right shift to the ALU.
Given two numbers to be multiplied, A and B, where A_{0} is the least significant bit of A, we add them as follows:
A multiplicand (icand for short) + B multiplier (ier for short) ----- S product (prod for short)Usually, we are used to treating the terms multiplier and and multiplicand as synonyms, since multiplication is commutative, but the algorithms for multiplication treat these in very different ways!
The classic multiplication algorithm used on computers derives directly from an algorithm people frequently do in their heads. As an example, consider multiplying 12 by 15 mentally. This is hard, but you can multiply 6 by 30 somewhat easier and 3 by 60 even more easily -- the conversions between these problems involve dividing the multiplier by two while doubling the multiplicand in order to reduce the problem to one you can handle in your head.
The full algorithm, known as the Russian peasant multiplication algorithm, can be expressed as:
times(ier, icand) = if even(ier) times(ier/2, icand*2) else times(ier/2, icand*2) + icandThis recursive formulation can be converted to iterative form:
unsigned int times(unsigned int ier, unsigned int icand) { unsigned int prod = 0; while (ier != 0) { if (ier & 1) prod = prod + icand; ier = ier >> 1; -- /2 icand = icand << 1; -- *2 } return prod; }In the above, we replaced multiplication and division by 2 by shift operators. The above, of course, only works for unsigned integers.
For a low-performance computer, we need to generalize the above algorithm to 2's complement numbers and then implement it in microcode or, for very low performance applications, leave it to the user to implement the above code directly in software. For a high performance machine, we need to transform the above algorithm into something that can be efficiently implemented without reliance on microcode or other slow hardware.