19. Fast Arithmetic
Part of
the 22C:122/55:132 Lecture Notes for Spring 2004

Given two numbers to be added, A and B, where A_{0} is the least significant bit of A, we add them as follows:
A addend + B augend  S sum
Usually, we will treat the terms addend and augend as synonyms, since addition is commutative. In each bit position of the sum, we can identify the following:
C carry into position i i A addend bit i i + B augend bit i i  C S carry out of bit position i i+1 i and bit i of the sum.
Here, C_{i} is both the carry into bit position i and the carry out of bit position i1. These variables are related by the following truth table:
A B C  S C i i i  i i+1  0 0 0  0 0 0 0 1  1 0 0 1 0  1 0 0 1 1  0 1 1 0 0  1 0 1 0 1  0 1 1 1 0  0 1 1 1 1  1 1
In hardware, we may build a 1bit adder as follows:
S_{i} = (A_{i} xor B_{i} ) xor C_{i}
C_{i+1} = (A_{i} and B_{i} ) or (A_{i} and C_{i} ) or (B_{i} and C_{i} )
If we chain these together, we get a ripplecarry adder:
A B A B A B C n n 1 1 0 0 in   ___ _   ___   ___ ______   ______  ______  A B C     A B C    A B C   +   ...   +    +  __C___S__   __C___S__  __C___S__ __  _ ____  ____      C S S S out n 1 0
The fundamental problem with this design is that the time it takes to add 2 nbit numbers is O(n). Typically, this design is acceptable n<4, but for n=16, the speed becomes a serious bottleneck, and for n=32 or n=64, it is intolerable.
One way to speed the addition is to defer carry propagation. This is done as follows:
A B A B A B C n n 1 1 0 0 in   0   0   ___ ______ ______ ______  A B C   A B C   A B C   +  ...  +   +  __C___S__ __C___S__ __C___S__ __  ___ ___  ___   0   0  S ______ ______ 0  A B C   A B C   +  ...  +  __C___S__ __C___S__ __  ___ ___  ______ S  A B C  1
If we use exactly the following scheme, we will require O(n^{2}) adders and the delay will still be O(n)! However, this fails to take into account the fact that each adder has an extra input that is unused in this context! Of course, many of these adders could be replaced by halfadders since they have unused inputs, but that doesn't change the asymptotic bound.
If we need to sum a number of inputs, we can use this structure to perform the sum, and this is perfectly suited to summing multiple partial products for a fast multiplier.
The term carry save adder comes from the use of this model in iterative multiplication; after adding each partial product to the sum, the accumulated sum would be held in one register and the carrys produced in accumulating that sum would be held in a second register. As a result, the partial products could be accumulated at high speed, without the need to wait for carry propagation during each cycle of accumulation. Only at the end was there a need to use a full adder that propagated the carry.
If a carrysave adder is built in parallel, with 3 or more layers, it should be noted that the same components can be rewired to compute the sum for one column of the result in O(log n) time instead of O(n) time. To do this, the adders are arranged into a binary tree, where each internal node sums a carry input with the sums out of two subtrees. The carry outputs of each adder still go to the next tree to the left.
a a a a a a a a a \/ \/ \/ c'_O a c ___O __________O a  addend at digit i \/ \ \_/_ ___/ c  carry in to digit i c'_O a c \ c'/ \ c / c'  carry in to digit i+1 \/ \/ \/ c"  carry in to digit i+2 c'_O a c c"_O c'_O s  sum at digit i \/   c'_O c' s  s
With this scheme, we create a bit of havoc with the wiring, and carries out from each stage now leapfrog upward (summing n inputs to one bit position now propagates carries to about log(n) other bit positiions, instead of just sending n2 carry bits to the adjacent bit position).
Another way to add nbit numbers in O(log n) time is to use carry lookahead or anticipatory carry logic. Charles babbage coined the term carry anticipation, in the 1830's, in his work on fast mechanical adders for his difference engine, and this term is still used!
For any substring of digit positions within a sum, we can ask, for that substring, do the values of the operands within that substring cause it to generate a carry, and do the values of the operands within that substring cause it to propagate carry in to carry out
A A A A A  A 5 4  3 2 1 0 + B B B B B  B 5 4  3 2 1 0  S S S S S  S 5 4  3 2 1 0 \________/ [3:1]
In the above, we have singled out bits 3 to 1 of a sum, and we ask for P_{[3:1]}  are the values of A and B in bit positions 3 to 1 such that C_{4} is implied by C_{1}?
P_{[3:1]} = (A_{3} or B_{3}) and (A_{2} or B_{2}) and (A_{1} or B_{1})
Similarly, we can ask for G_{[3:1]}  are the values of of A and B in bit positions 3 to 1 such that they imply C_{4} without knowledge of C_{1}?
G_{[3:1]} = (A_{3} and B_{3}) or (A_{2} and B_{2} and P_{[3]}) or (A_{1} and B_{1} and P_{[3:2]})
Our goal is to use the above information in a divide and conquer approach to fast addition. To do this, we want to build a tree with the adders for each bit position at the leaves and the tree body performing the carry propagation. To do this, we need a way to compute P_{[a:c]} when given P_{[a:b]} and P_{[b:c]}. This is remarkably easy!
P_{[a:c]} = P_{[a:b]} and P_{[b:c]}
We also need a way to compute G_{[a:c]} when given G_{[a:b]} and G_{[b:c]}. This is remarkably easy!
G_{[a:c]} = G_{[a:b]} or (G_{[b:c]} and P_{[a:b]})
Now, assuming the word size is 2^{n}, we can build a complete balanced binary tree with adders for the leaves, where each internal node of the tree operates as above, with one additional detail.
We must, at each internal node, accept a carry in from the least significant side and, depending on the propagate signals, deliver this to the appropriate subtrees. With this change, some internal node of the carry tree will generate carry in to each stage of the adder, so no adder stage needs to compute carry out! The result is as follows:
A B A B A B C n n 1 1 0 0 in                o ______  ______  ______   A B C    A B C    A B C    +   ...  +    +   _G_P__S__  _G_P__S__  _G_P__S__                S   S    S    n   1    0                      ______  ...  G P C G P    1 1 1 0 0    C o  ____G_P___0                 ______   G P C G P    1 1 1 0 0   C o ____G_P___0  ___    C OR  __  out ___AND ___
We now have a scheme that gives us carry propagation from C_{in} to C_{out} in O(log_{2} n) time for an n bit sum, while requiring O(n log_{2} n) hardware for carry propagation. The additional hardware cost to move from O(n) to O(n log_{2} n) hardware is almost always justified by the speed improvement from O(n) to O(log_{2} n).
Essentially all computers with word sizes greater than 8 bits have used this approach to fast carry since the early 1970's.
The adders at the leaves of the carry propagation tree are:
___ A oXOR ___ B o___XOR S C ___   ___  oAND G o___   ___  OR  P ___
The internal nodes of the carry lookahead tree are:
___ C AND ___ 0 ___OR  C G o___ 1 0   ___ P o AND ___ 0  ___OR  G G ___ 1   ___ P oAND P 1 ___
This is the scheme used on most modern computers, starting around 1970, except that the we use a 4way tree instead of a binary tree. The 4way tree is used because, for most digital logic circuits used in VLSI design, fanin and fanout on the order of 4 is easily done, while higher fanouts require higher power levels. So, a 16bit adder requires four 4bit lookahead units connecting the leaves, plus one more to coordinate them, and by adding one more level to the tree, we get a 64bit fast adder.
The fast adder outlined above only performs one function, but it may easily be extended to other functions as follows. Note that, at the core of each adder, there is a simple function A xor B xor C. The truth table for this is:
A B C  S  0 0 0  0 0 0 1  1 0 1 0  1 0 1 1  0 1 0 0  1 1 0 1  0 1 1 0  0 1 1 1  1
Mechanical reduction of this to a sumofproducts logic circuit requires 3 inverters and 5 nand gates:
A B C    o o o  __  __  __  \ /  \ /  \ /  O  O  O       ___ o  oANDO o___        ___  o   oANDO  o___  ___       ___   o  ANDO S oANDO___ o___        ___  o   oANDO o___      
If we expand this circuit by adding a few more gates and a few more inputs, we get the heart of a useful ALU:
A B C i i i    NB not B     o CE carry enable          o   _ _ O_  XOR AND AND  ___ ___ ___     o o    __  __ _______   \ /  \ /  ___________  O  O    XE xor enable        ___ o  oANDO o___                o___  o   oANDO  o___                  ___  ___ o    oANDOANDO S o___ ___       ___   o    oANDO  o___        ___  o   oANDO     ___   AE and enable
We now have the following functions:
NB CE XE AE  function  0 1 1 0  S = A + B (the normal add function) 1 1 1 0  S = A  B (with carry = not borrow) 0 0 1 0  S = A xor B 0 0 1 1  S = A or B 0 0 0 1  S = A and B
Many other functions are possible! Most notably, we can add extra enable inputs on the NAND gates in the column of gates, plus extra rows (with a corresponding widening of the final nand gate) to add functions such as left and right shift to the ALU.
Given two numbers to be multiplied, A and B, where A_{0} is the least significant bit of A, we add them as follows:
A multiplicand (icand for short) + B multiplier (ier for short)  S product (prod for short)
Usually, we are used to treating the terms multiplier and and multiplicand as synonyms, since multiplication is commutative, but the algorithms for multiplication treat these in very different ways!
The classic multiplication algorithm used on computers derives directly from an algorithm people frequently do in their heads. As an example, consider multiplying 12 by 15 mentally. This is hard, but you can multiply 6 by 30 somewhat easier and 3 by 60 even more easily  the conversions between these problems involve dividing the multiplier by two while doubling the multiplicand in order to reduce the problem to one you can handle in your head.
The full algorithm, known as the Russian peasant multiplication algorithm, can be expressed as:
times(ier, icand) = if even(ier) times(ier/2, icand*2) else times(ier/2, icand*2) + icand
This recursive formulation can be converted to iterative form:
unsigned int times(unsigned int ier, unsigned int icand) { unsigned int prod = 0; while (ier != 0) { if (ier & 1) prod = prod + icand; ier = ier >> 1;  /2 icand = icand << 1;  *2 } return prod; }
In the above, we replaced multiplication and division by 2 by shift operators. The above, of course, only works for unsigned integers.
For a lowperformance computer, we need to generalize the above algorithm to 2's complement numbers and then implement it in microcode or directly in a sequential control unit; this is exactly what Berks Goldstein and von Neumann proposed in their IAS paper from 1946; and they also recognized that, for very low performance applications, this algorithm could just as well be implemented in software. For a high performance machine, we need to transform the above algorithm into something that can be efficiently implemented without reliance on microcode or other slow hardware!
The IAS proposal included the following data paths in support of multiplication, and these data paths were included, without much change, in almost all singleaccumulator computer architectures of the 1950's and 1960's, including the DEC PDP8, if you bought the optional extended arithmetic element.
 Multiplicand  _____________________ __ _________ ____________ ___________  ____________         ALU/Shifter >> Shifter   _____________  _____________      ____________________  ____________________        > Accumulator   >MultiplierQuotient   _____________________  _____________________  ____________ ____________
The multiply algorithm can be expressed in terms of this hardware as follows:
unsigned int times(unsigned int ier, unsigned int icand) { unsigned int AC = 0;  the accumulator unsigned int MQ = ier;  the multiplier for (i = 0; i < wordsize; i++) { if (MQ odd) { ACMQ = (AC + icand)MQ >> 1 } else { ACMQ = ACMQ >> 1 } endif } return MQ; }
Here, the notation of a programming language such as C is deficient because it makes it difficult to deal with something that is very easy in hardware: The treatment of two one word variables as a twoword variable. This is indicated above using the vertical bar symbol as a concatenation operator, so ACMQ indicates one double length quantity, half of which comes from or is destined for AC, and the other half comes from or is destined for MQ.
When applied to an Nbit word, this multiply operation actually produces a product that is 2N bits long, with the most significant half in AC, and the least significant half in MQ, replacing the multiplier that originally filled this register. Classical single accumulator machines simply exposed the programmer to this fact, allowing the programmer to keep one, the other, or both words of the result, depending on the needs of the algorithm. This option is lost in most moder high level languages, which insist that the product of 2 Nbit numbers is an Nbit number. Unfortunately, many modern machines follow the lead of such languages!
It is worth noting that the multiply loop can be rewritten as follows, using the C (if?then:else) notation for a conditional expression:
for (i = 0; i > wordsize; i++) { ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; }
Few microcoded machines use a loop counter control register to this loop. Instead, it is very common to simply unroll the loop the required number of times. Thus, for a 16 bit machine, we would replace the above loop with microcode equivalent to the following:
ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1; ACMQ = (odd(MQ) ? AC + icand : AC)MQ >> 1;
In microcode, this can be reduced to 16 consecutive identical microinstructions, differing only in their next address fields. Thus, there is no overhead for conditional evaluation or microcontrol, and this multiply operation can be executed in exactly 16 clock cycles.
The first solution to be widely used to speed up the N clockcycle multiply algorithm outlined above simply uses a higher radix. Consider, for example, the following idea:
 oooooo ___ __ __ __ __ __ __ __  0  A*1 A*2 A*3 A*4 A*5 A*6 A*7 ___ ___ ___ ___ ___ ___ ___ ___    _ _      _____   _____    _________     _________  _____________       _____________ _________ 3  0 1 2 3 4 5 6 7  multiplier / MUX  _________________ 
This circuit fragment will multiply its input by a 3 bit multiplier. If we substitute this for the ALU in the original algorithm, we can do a 15 bit multiply in 5 clock cycles, where, in each cycle, the least significant 3 bits of the MQ register are used to compute a partial product for addition to the accumulator. Instead of using oneplace shifts, the new multiplier uses 3place shifts.
The above seems a bit complex, since it relies on separate combinational circuitry to compute what ammounts to a table of partial products, but in fact, computing these is straightforward. Consider the following facts:
a*0 = = 0 ? 0 a*1 = a 0 + a a*2 = a<<1 = 0 + a<<1 a*3 = a + a<<1 = a<<2  a a*4 = a<<2 = a<<2 ? 0 a*5 = a + a<<2 = a<<2 + a a*6 = a<<1 + a<<2 = a<<2 + a<<1 a*7 = a + a<<1 + a<<2 = a<<3  a
These can be computed with a single add/subtract unit and two 2input multiplexors, so long as the multiplexors have enable inputs in addition to their select inputs (when not enabled, the multiplexor output is zero).
 ooo __ __  __ <<2 <<3  <<1 ___ ___  ___ _______ _______ ASEL  0 1   0 1  BSEL  MUX A   MUX B  AENAB _________ _________ BENAB _____________  A B  ADDSUB  0: A+B   1: AB  _______________ 
Given 3 bits of the multiplier, we set the control inputs to the above circuitry as follows in order to create a circuit equivalent to the 8input multiplexor version of the circuit.
M0 M1 M2  ASEL AENAB BSEL BENAB ADDSUB  0 0 0  x 0 x 0 x 0 0 1  x 0 0 1 0 0 1 0  x 0 1 1 0 0 1 1  0 1 0 1 1 1 0 0  0 1 x 0 x 1 0 1  0 1 0 1 0 1 1 0  0 1 1 1 0 1 1 1  1 1 0 1 1 ASEL = M0 and M1 and M2 AENAB = (M1 and M2) or M0 BSEL = M1 and not(M2) BENAB = (M1 or M2) ADDSUB = (M1 and M2)
It is worth noting that the high end IBM System 360 computers of the late 1960's used radix 16 so that they could do a 32bit by 32bit fullword multiply in 8 clock cycles or a 16bit by 16bit halfword multiply in 4 clock cycles.
Despite its elegance, this scheme offers only a constant speedup; microcoded implementations are still called for, and the demand for very high performance computers appears to require faster multiply hardware. Also, of course, we are interested in the ultimate "big O" bounds on the computation.
When we multiply two numbers, using the algorithm everyone learns for decimal multiplication in elementary school, we lay the work out as follows:
C C C C C multipliCand 4 3 2 1 0 x E E E multipliEr 2 1 0  P P P P P 04 03 02 01 00 P P P P P Partial products 14 13 12 11 10 + P P P P P 24 23 22 21 20  P P P P P P P Product 6 5 4 3 2 1 0
The product P is the sum of the partial products P_{i}, where the number of partial products equals the number of digits in the multiplier.
For radix 2, the digits of partial product P_{i} are zero if bit E_{i} of the multiplier is zero, and the partial product equals the multiplicand if E_{i} is one. That is
P_{ij} = E_{i} and C_{j}
Aside from this trivial logic, all that is involved in multiplication is an array of adders! To multiply an N bit numbe by an M bit number, we need N adders of M bits each. If each adder has an enable input to enable computation of either A+B or just A, we can construct a fully parallel multiplier as follows:
C C C C C 0 4 0 3 0 2 0 1 0 0 __/ __/ __/ __/ __/       E _______________0 0 __ / __ / __ / __ / __ /        E _______________0  1 __ / __ / __ / __ / __ /         E _______________0   2         P P P P P P P P 7 6 5 4 3 2 1 0Each cell of this picture contains the following:
 / __ o  / o _______   A B    BENAB  +  Cout Cin _________  /  / 
If eacn of these units is a classical full adder with an added enable input, the total multiply time for an Nbit by Nbit multiply is 2N times the propagation delay of the full adder. That is, O(N), and we did this with O(N^{2}) hardware.
We know that we can use carry lookahead logic to take the brute force design above and convert each step from O(N) time to O(log N) time, but we still need N steps to add the partial sums, so this gives us an add time of O(N log N), which is not an improvement over the original O(N)!
The question is, if we have N partial sums to compute, can we do this faster? The solution rests on the following idea:
(((A + B) + C) + D) = (A + B) + (C + D)
No matter how we add N terms, it will take N1 2input adders, which is O(N). However, we can arrange these in two distinct ways; one adds the terms successively in O(N) time, while the other forms a binary tree off adders, adding them in O(log N) time.
This is what we will do with our partial products! The result is a multiplier that uses O(N^{2} log N) hardware and computes the product in O(log N) time!
If we use a carrysave adder as suggested previously, we can propagate all of the carries down to the last stage, and only use the lookahead idea once at the end. The result requires O(N^{2}) hardware and O(log N) time to multiply 2 Nbit numbers. The constant factors are about twice that for addition.
All of this was known by the early 1970's, and the Texas Instruments TTL handbook from 1973 includes nearly complete (but cryptic) details on how to build such high performance multipliers.
The question remains, however, whether it is economical to expend O(N^{2}) hardware on multiplication. The answer, as it turns out, is that it is frequently not justified. The widespread success of minicomputers and lowend microprocessors that lacked multiplication instructions strongly hints at this, as does the fact that even on todays Pentium family, the runtimes for integer multiplication strongly suggest that the older iterative schemes are still in use. It was not until the 1980's that this question was formally investigated.