9. Beyond Integer Addition and Subtraction.
Part of
22C:40, Computer Organization and Hardware Notes

In the previous chapter, we demonstrated that addition and subtraction operators can be built in hardware using a number of logic gates proportional to the word size. One bit of an adder can be built using about 12 and, or and not gates, and this can be extended to do subtraction using about 4 more primitive gates to build an exclusive or operator. Adding a second exclusive or gate plus an and gate and an or gate gives us all of the logic functions we need, and adding a multiplexor, at a cost of 7 more gates gives us a complete onebit slice of an arithmetic logic unit. Adding up these gates and multiplying by the 32 bits per word gives us an estimate for the number of gates in the Hawk arithmeticlogic unit.
Adder  12 
exclusive or to make addersubtractor  4 
exclusive or and and and or gates for logic  6 
4input multiplexor for function select  + 7 
Total gates per bit  29 
word size  × 32 
Estimated gate count for the Hawk ALU  928 
This estimate, 928 gates, supports only the simple logic operations, add and subtract. To obtain the left and right shift operations, we must add two shift networks, one for left shift and one for right shift. The Hawk shift instructions take shift counts between 1 and 16, so each shift network can be made of either 32 16input multiplexors or 64 4input multiplexors. We will assume the latter.
4input multiplexor  7 
word size  × 32 
gates per 32bit 4way shifter  224 
2 layers of shifters  × 2 
gates per 32bit 16way shifter  448 
need both left and right shifts  × 2 
Total gates for the Hawk shift network  896 
Adding the 896 gates for the two shift networks to the 928 gates in the arithmetic logic unit gives us an estimate for the total complexity of the Hawk's arithmetic and shifting hardware: 1,824 and, or and not gates.
Exercises
a) Estimate the cost, in gates, of the ALUshifter for a 16bit digital signal processor chip (a DSP), assuming it supports only the following shift operations: byteswap, left shift one place, right shift one place, or don't shift at all. These options apply to the output of the ALU and are applicable to each ALU operand.
b) The accounting above assumed the use of two layers of 4way multiplexors to construct a shift network that could shift from 0 to 15 places. How does the result differ if you assume 4 layers of 2way multiplexors to do this job.
What must we add to this to add Multiplication to the Hawk's repertoire of arithmetic operations? In order to find the answer, we must go back and review the algorithm for multiplication that most people learn in elementary school. Here is a multiplication problem worked out:
1  2  8  multiplicand  
×  5  1  2  multiplier  
 
2  5  6  partial product 2 × 128  
1  2  8  partial product 1 × 128  
+  6  4  0  partial product 5 × 128  
 
6  5  5  3  6  product 
This decimal algorithm, shown here in graphical form, involves first multiplying the multiplicand by each digit of the multiplier to create the partial products. Each partial products is written shifted over so that it's least significant digit is under the corresponding digit of the multiplier, and then the partial products are added to compute the product.
The statement of this algorithm does not depend on the number base being used! Thus, we can use this exact same algorithm in base 16 or in base 8 or even in base 2. The hard part of this algorithm in decimal is multiplying the multiplicand by the digits of the multiplier. To do this, the typical elementary school student must memorize the multiplication tables for base 10. In base 2, however, the multiplication table is easy because there are only two possibilities, multiplication by one or multiplication by zero! Here is an illustration of this classic multiplication algorithm in binary:
Exercises
c) Write code, in C, C++ or Java, to multiply two numbers using the decimal algorithm you'd use with pen and paper. With each iteration, you should end up accumulating one partial product into the product, multiplying the multiplicand by 10, and dividing the multiplier by 10. You can extract the least significant digit of the multiplier using the mod operator, and you can multiply this times the multiplicand to compute the partial product. This may seem silly  you are multiplying and dividing in order to multiply, but note that you are only dividing by 10, never any other number, and you are only multiplying by numbers from 0 to 10, and never larger. So, you are making progress, using this limited set of operations to implement a much more general multiply routine.
To multiply two binary numbers using pencil and paper, we use exactly the same multiplication algorithm we would use in decimal, but we do it using binary arithmetic. Consider, for example, the problem of multiplying 100_{10} by 10_{10}. In binary, this is 1100100_{2} times 1010_{2}, and if we work the problem using pencil and paper, we will get something like this:
1  1  0  0  1  0  0  = 100  multiplicand  
×  1  0  1  0  = 10  multiplier  
 
0  0  0  0  0  0  0  = 0 × 100  partial products  
1  1  0  0  1  0  0  = 1 × 100  
0  0  0  0  0  0  0  = 0 × 100  
+  1  1  0  0  1  0  0  = 1 × 100  
 
1  1  1  1  1  0  1  0  0  0  = 10 × 100  = 1000 
This decimal algorithm, shown here in graphical form, involves first multiplying the multiplicand by each digit of the multiplier to create the partial products. Each partial products is written shifted over so that it's least significant digit is under the corresponding digit of the multiplier, and then the partial products are added to compute the product.
It is not too hard to reduce this paperandpencil algorithm to machine code for the Hawk machine! The first step in such a reduction is to work it out in a highlevel language like C or Java:
unsigned int multiply( unsigned int multiplier, unsigned int multiplicand ) { unsigned int product = 0; while (multiplier != 0) { if ((multiplier & 1) != 0) { product = product + multiplicand; } multiplier = multiplier >> 1; multiplicand = multiplicand << 1; } return product; } 
In the above, we have shifted the multiplicand one place left on each iteration so we can add it in as a partial product when the corresponding bit of the multiplier is nonzero. For each iteration, we shift the multiplier one place right so that we can always check the same bit, the least significant bit, instead of testing a different bit for each iteration. We test the bit by using the bitwise and operator of C to mask out all but the least significant bit. Here is exactly the same code using operators that many novice programmers will find to be more familiar:
unsigned int multiply( unsigned int multiplier, unsigned int multiplicand) { unsigned int product = 0; while (multiplier != 0) { if ((multiplier % 2) != 0) { product = product + multiplicand; } multiplier = multiplier / 2; multiplicand = multiplicand * 2; } return product; } 
In the above, we substituted multiplication and division by two for shifting, and we tested to see if the multiplier was even or odd by checking the remainder after division by two. Unfortunately, this code uses multiplicaiton and division operations in order to implement multiplication, so it is not a very effective solution to the problem of doing multiplication on a machine that has no multiply hardware!
It is worth noting that this algorithm for multiplication is ancient! It has been rediscovered repeatedly by many cultures. Some authors refer to it as the Russian peasant algorithm because peasants in some part of Russia multiplied by repeated doubling of the multiplicand while halving the multiplier.
The first version of the code given above only used operators that are present in the Hawk machine! For the left shift applied to unsigned integer operands, where C used >>, the Hawk has the SRU instruction; for right shifts, where C used <<, the Hawk has SL, and for the and operator, where C used &, the Hawk has AND. Using these, we can rewrite the C code as follows:
MULTIPLY: ; link through R1 ; R3 = multiplier on entry, product on return ; R4 = multiplicand on entry, destroyed ; R5 = temporary copy of multiplier, destroyed ; R6 = temporary used for anding with one ; uses no other registers MOVE R5,R3 ; move the multiplier CLR R3 ; product = 0; MULTLP: TESTR R5 BZS MULTLX ; while (multiplier != 0) { LIS R6,1 AND R6,R5 BZS MULTEI ; if ((multiplier & 1) != 0) { ADD R3,R3,R4 ; product = product + multiplicand; MULTEI: ; } SRU R5,1 ; multiplier = multiplier >> 1; SL R4,1 ; multiplicand = multiplicand << 1; BR MULTLP MULTLX: ; } JUMPS R1 ; return product; 
The above code takes either 8 or 9 instructions per iteration of the loop, and it is written in what might be called a boneheaded style, making as little use of auxiliary information as possible. In fact, a little study of the shift instructions leads quickly to some significant optimizations.
All shift instructions on the Hawk set the condition codes, so after doing the SL instruciton, we can test to see whether the multiplicand is zero. In fact, we can test even more than this, because the shift operators set the carry bit in the condition codes to the last bit shifted out of the operand, so the SL instruction actually tells us if the operand was even or odd prior to shifting. This leads us to the following faster unsigned multiplication code:
MULTIPLY: ; link through R1 ; R3 = multiplier on entry, product on return ; R4 = multiplicand on entry, destroyed ; R5 = temporary copy of multiplier, destroyed ; uses no other registers MOVE R5,R3 ; move the multiplier CLR R3 ; product = 0; TESTR R5 BZS MULTLX ; if (multiplier != 0) { MULTLP: ; do { SRU R5,1 ; multiplier = multiplier >> 1; BCR MULTEI ; if (multiplier was odd) { ADD R3,R3,R4 ; product = product + multiplicand; MULTEI: ; } SL R4,1 ; multiplicand = multiplicand << 1; TESTR R5 BZR MULTLP ; } while (multiplier != 0); MULTLX: ; } JUMPS R1 ; return product; 
This new version of the code takes either 5 or 6 instructions per iteration of the loop, so it will run about 30% faster than the original code. Furthermore, it uses one less register because of the way we are testing the bits of the multiplier. The above code is about as well optimized as you would normally hope for from a competent assembly language programmer, but further optimizations are possible. Consider the following:
MULTIPLY: ; link through R1 ; R3 = multiplier on entry, product on return ; R4 = multiplicand on entry, destroyed ; R5 = temporary copy of multiplier, destroyed ; uses no other registers MOVE R5,R3 ; move the multiplier CLR R3 ; product = 0 TESTR R5 BZR MULTLE ; go loop if nonzero multiplier JUMPS R1 ; return product if not MULTLP: ; continue main multiplication loop SL R4,1 ; multiplicand = multiplicant << 1 MULTLE: ; loop entry point SRU R5,1 ; multiplier = multiplier >> 1 BCS MULADD ; if (multiplier was odd) go add BZR MULTLP ; continue loop if multiplier nonzero JUMPS R1 ; return product; MULADD: ; annex to multiplication loop ADD R3,R3,R4 ; product = product + multiplicand; TESTR R5 BZR MULTLP ; continue loop if multiplier nonzero JUMPS R1 ; return product; 
This new version of the code takes either 4 or 6 instructions per iteration, half an instruction faster per iteration than the previous version. Furthermore, because the return instruction has been duplicated, this code saves several more instructions per call. The cost of this optimization is the complete destruction of the clean control structure of the original code, making it impractical to use comments written in a highlevel language to describe what is going on. This kind of optimization is almost never justified because it produces a result that is almost entirely unreadable and unmaintainable.
Furthermore, this optimization is working on a deadend project. Just as the key to fast string operations on the Hawk was to work on more than one character at a time, the key to fast multiplication on the Hawk architecture is to operate in a higher number base, base 16. Chapter 10 of the Hawk manual includes an unsigned integer multiply algorithm that has a worstcase cost of 2.8 instructions per bit while computing a 32bit product; this code is even less readable than any of the above blocks of code, but if speed is the object, it is the clear winner.
Exercises
d) Flip coins to generate a random 8bit multiplicand and a random 4bit multiplier. Multiply them using the paperandpencil binary multiplication algorithm, and then check your results by converting the multiplier, multiplicand and product to decimal to verify your computation.
e) Rewrite the binary multiplication algorithm in C to perform the multipliction in base 16. This is actually an easy assignment using either version of the C code given here! It involves only extremely small changes: You must allow for this digit of the multiplier to have more than 2 possible values (zero and one) and you must change all references to the radix from 2 to 16 (or alternately, change all onebit shifts to 4bit shifts).
f) What happenss with the Hawk binary multiplication code if you multiply numbers together that generate of 2^{32} or greater. For example consider what happens if you compute 125,000 squared using this algorithm? (Note that 125,000 is near 2^{17}, so squaring it would give a number near 2^{34} if a computer with a larger word size were being used.)
Examining any of the above multiply algorithms shows that to multiply a pair of 32bit numbers, we must be prepared to perform 32 pairs of shift operations and 32 add operations. We can add all of the partial products in parallel using a binary tree of 31 adders, but this tree of adders has absolutely no use for anything but multiplication, and it is very expensive!
If we opted to add such a tree to the arithmetic logic unit and shifter of the Hawk, we would use 32 and gates to compute each partial product by anding the bits of the multiplicand with one of the bits of the multiplier, plus 31 adders, each made of 32 fulladders, each made of 12 gates.
A single full adder  12 
32 bits per word  × 32 
gates per 32bit adder  384 
31 adders to add 32 partial products  × 31 
Subtotal: Cost of adding partial products  11904 
and gates to compute a partial product  32 
32 partial products  × 32 
Subtotal: Cost of computing partial products  1024 
 
Estimated gate count for fast multiplication  12928 
This is rather discouraging! We use the Hawk arithmetic logic unit and shifter with a very large fraction of all instructions we write, so we are not unhappy to spend 1,824 logic gates for this function. On the other hand, adding an additional 12,928 gates to our computer for only one purpose, multiplication, is not very appealing. Therefore, few but the highest performance computers ever build highspeed multipliers that operate in this bruteforce way.
Instead, most computers that have multiply hardware do multiplication using the same arithmeticlogic unit as they use for addition, cycling the data through the arithmeticlogic unit as many times as is required to compute the product. This is why, when you look at data for the speed of the various instructions on real computers, it is quite common to find that integer multiplication is 10 or more times slower than integer addition. On the Intel Pentium, for example, the hardware can execute two integer add instructions in parallel in 1/10th the time it takes to execute one integer multiply instruction.
Because of the slow speed of hardware multiplication on most computers, compiler writers have long been urged to avoid using multiply instructions whenever a short sequence of shift and add instructions could be used to accomplish the same thing. These optimizations can be expressed quite easily in languages like C and Java:
original  substitution  

a * 2  a << 1  
a * 3  (a << 1) + a  
a * 4  a << 2  
a * 5  (a << 2) + a  
a * 6  ((a << 1) + a) << 1  
a * 7  (a << 3)  a  
a * 8  a << 3  
a * 9  (a << 3) + a  
a * 10  ((a << 2) + a) << 1 
In sum, whenever the multiplier is a constant, the compiler should consider directly adding the appropriate partial products. There will be one partial product to add for each one in the binary representation of the multiplier, plus the shift operators required to align the partial products. On the Pentium, for example, where an integer multiply takes ten times as long as a simple shift or add instruction, a good compiler will usually use a mix of shift and add instructions whenever the multiplier has five or fewer one bits in its binary representation.
Sometimes, however, we can do even better, as illustrated in the case for multiplication by seven in the above table. In that case, we could have used (((a<<1)+a)<<1)+a, adding three terms that correspond to the 3 one bits in the multiplier, but instead, we multiplied by eight and then subtracted.
Another trick we can use is to factor a number and then multiply by each of the factors in succession. For example, to multiply by 100, we could multiply by 10 twice in succession, or we could multiply by 5 twice in succesison and then multiply the result by 4.
a * 100  (((a << 1) + a) << 3) + a) << 2 
a * 10 * 10  t = ((a << 2) + a) << 1;
((t << 2) + t) << 1 
a * 5 * 5 * 4  t = ((a << 2) + a);
((t << 2) + t) << 2 
Which of these optimizations will be fastest on some particular machine? That cannot be determined from the source file, although we can make educated guesses. For classical computer architectures, each operator in the source file corresponds to one machine instruction, and when register to register operations are available, assignments to temporary variables used to hold intermediate results is free. For such a machine, what matters is the total count of operators in the expression. Looking at the above table, we find that simple multiplication by 100 takes 3 shifts and 2 adds, a total of 5 operators, while multiplication by 10 and then 10 again takes 4 shifts and 2 adds, a total of 6 operators, and multiplying by 5 twice and then 4 takes 3 shifts and 2 adds. This suggests that all will be noticably faster than a hardware multiply, if that really takes ten times as long as an add, but it does not give one of these any great advantage over the other.
Reducing these to machine code requires looking at the repertoire of shift instructions on the target machine. On the Hawk, we have two left shift instructions, MOVSL, which does a register to register move plus a shift, in one instruction, and ADDSL, which shifts the contents of one register and then adds another register to it. The SL assembly language opcode is really ADDSL with the second register set to zero, indicating that nothing is to be added. Similar shift and add instrucitons are included in several architectures designed after the late 1970's, and in the early 1980's, a group at Hewlett Packard was the first to describe, in detail, how to fully exploit these instructions. Here is how we can use these on the Hawk:
SL  R3,1  ; R3 = R3 * 2 = R3 << 1 
ADDSL  R3,R3,1  ; R3 = R3 * 3 = (R3 << 1) + R3 
SL  R3,2  ; R3 = R3 * 4 = R3 << 2 
ADDSL  R3,R3,2  ; R3 = R3 * 5 = (R3 << 2) + R3 
SL ADDSL 
R3,1 R3,R3,1 
; R3 = R3 * 6 = R3 * 2 * 3 
NEG ADDSL 
R1,R3 R3,R1,3 
; R3 = R3 * 7 = (R3 << 3)  R3 
SL  R3,3  ; R3 = R3 * 8 = R3 << 3 
ADDSL  R3,R3,2  ; R3 = R3 * 9 = (R3 << 3) + R3 
SL ADDSL 
R3,1 R3,R3,2 
; R3 = R3 * 10 = R3 * 2 * 3 
The Hawk ADDSL instruction allows us to multiply by a wide range of small constants in just one or two instructions. Multiplication by 11 takes three instructions, but there are also three instruction sequences to multiply by all larger integers up to more than 35. A large number of these are documented in Chapter 10 of the Hawk CPU manual.
Exercises
g) Convert each of the three algorithms given above for multiplication by 100 into tightly coded Hawk instruciton sequences that compute R3 = R3 * 100. Rank them in order of speed! (Note that there could be a tie, and assume you can use R1 as an auxiliary register.)
g) Write hawk code to multiply R3 by 55.
h) Write hawk code to multiply R3 by 60.
i) Write hawk code to multiply R3 by 47.
j) What is the smallest constant multiplier for the Hawk that requires a 4instruction sequence? (All the sequences given in the Hawk manual are only 3 instructions long.).
To divide one number by another using pencil and paper, we use the classical long division algorithm that most people learn in elementary school:
5  1  2  quotient  
divisor   
1  2  8  )  6  5  5  3  8  dividend 
  6  4  0  partial product  
 
1  5  3  8  
  1  2  8  partial product  
 
2  5  8  
  2  5  6  partial product  
 
2  remainder 
This decimal algorithm, shown here in graphical form, involves a series of subtractions from the dividend. The values subtracted are called partial products because, in fact, they are the partial products we would add when we multiply the quotient by the divisor to compute the dividend (less the remainder). You can verify this by comparing the original multiplication problem given earlier in this chapter with the division problem shown above.
Each of these partial products is the product of one digit of the quotient times the divisor. The hard part of the decimal longdivision algorithm was estimating these partial products, but if we do our division in binary, that estimation problem becomes trivial because each partial product is either zero or it is equal to the divisor shifted over the appropriate number of places!
The classic way of dividing one binary number by another is to shift both the quotient and the dividend to the left with each step in the process instead of holding them statically and shifting the partial products successively rightward. This algorithm starts with the 32bit dividend in the least significant 32 bits of a 64bit register, and then, after 32 shiftandsubtract steps, ends with the remainder in the high 32 bits of that 64bit register and the quotient in the low 32 bits of that register. Here is how this code would look if written in C on a machine that supported the type int using 32 bits and long int using 64 bits:
unsigned int divide( int dividend, unsigned int divisor ) { unsigned long int remquot = dividend; long int remainder, quotient; int i; for (i = 0; i < 32; i++ ) { remquot = remquot << 1; if (remquot >= (divisor << 32)) { remquot = (remquot  (divisor << 32)) + 1; } } remainder = remquot >> 32; quotient = remquot & 0xFFFFFFFF; return quotient; } 
Notice that this short but difficult bit of code computes both the remainder and the quotient each time it is called, no matter which will actually be used. Because it was written in C, it cannot conveniently return both, but it is common to write assembly language divide routines that return both the remainder and quotient, and division hardware on many computers does the same.
If we are to write equivalent code for the Hawk, we need a way to shift a 64bit quantity one place to the left. Obviously, we can store the 64bit quantity in memory or in registers as a pair of consecutive 32bit quantities, but to shift it, we have to examine some of the secondary aspects of the Hawk shift instructions. First among these is that, on the Hawk, as well as on most other machines, the last bit shifted out of a register using a shift instruction is shifted into the carry bit. Therefore, to extend a left shift operation, we need an instruction that incorporates the carry bit into the least significant bit of the register holding the most significant half of the extended number. There are many ways of doing this on the Hawk:
SL R4,1 ; shift high half left first SL R3,1 ; shift low half BCR NOCARRY ; if (one shifted out of low half) { ADDSI R4,1 ; add one to the high half NOCARRY: ; } 
SL R4,1 ; shift high half left first SL R3,1 ; shift low half ADDC R4,R0 ; add the carry bit to the high half 
SL R3,1 ; shift low half left first ADDC R4,R4 ; shift carry into high half 
SRU R3,1 ; shift low half right first SRU R4,1 ; shift high half second ADJUST R4,CMSB ; add carry bit to high bit of low half 
The first approach to double left shifting shown above uses no obscure features of the architecture. It works, using a conditional branch to see if the carry bit was set after shifting the low half of the number, and then explicitly adding one if it was set. This is tedious enough that a special instruction was added to the Hawk architecture for adding the carry bit to a register or to the sum of two registers, ADDC. The most obvious way to use this is to replace the conditional branch and incrment logic given in the first solution, but because ADDC can be used to add two registers, and because adding a register to itself doubles the register, shifting its value one place left, we can shift a register and shift in a carry at once, as in the third example. The final example is included for completeness; it shows the fastest sequence for a 64bit right shift on the Hawk.
DIVIDE: ; link through R1 ; R3 = dividend on entry, quotient on return ; R4 = unused on entry, remainder on return ; R5 = divisor, unchanged ; R6 = scratch  used as a loop counter ; uses no other registers CLR R4 ; remainder = 0 LIS R6,32 ; loopcount = 32 DIVLP: ; do { SL R3,1 ; quotient = quotient << 1, sets C bit ADDC R4,R4 ; remainder = (remainder << 1) + C CMP R4,R5 BLTU DIVNO ; if (remainder >= divisor) { SUB R4,R4,R5 ; remainder = remainder  divisor ADDSI R3,1 ; quotient = quotient + 1; DIVNO: ; } ADDSI R6,1 ; loopcount = loopcount  1; BGT DIVLP ; } while (loopcount > 0) JUMPS R1 ; return 
In this code, we have referred to the high half of our 64bit working register as the remainder and the low half as the quotient, because that is what they become, eventually. As the code iterates, successive bits of the dividend are shifted out of the quotient register into the remainder before each attempt at subtracting a partial product.
It is not easy to speed up this code by tightening up the loop body, but it can be improved by another method, loop unrolling. To unroll a definite loop one step, we replace the loop body by doubled code for the loop body, and then we halve the number of iterations. This is easily demonstrated on the highlevel language version of our divide routine:
unsigned int divide( int dividend, unsigned int divisor ) { unsigned long int remquot = dividend; long int remainder, quotient; int i; for (i = 0; i < 16; i++ ) { remquot = remquot << 1; if (remquot >= (divisor << 32)) { remquot = (remquot  (divisor << 32)) + 1; } remquot = remquot << 1; if (remquot >= (divisor << 32)) { remquot = (remquot  (divisor << 32)) + 1; } } remainder = remquot >> 32; quotient = remquot & 0xFFFFFFFF; return quotient; } 
Unrolling the loop one step eliminates half of the instructions used to increment and test the loop counter, and it eliminates half of the branch instructions back to the top of the loop. Unrolling the loop a full 32 times would entirely eliminate the loop control variable and all branch instructions at the expense of duplicating the body a full 32 timess.
Exercises
k) Write Hawk code for a 96bit left shift and for a 96bit right shift, storing the operand in registers 3, 4 and 5, least significant 32bits first.
l) Write a new version of the Hawk division code given here with the loop body unrolled one step as in the C code given above.
The most obvious way to handle signed multiplication and division is to remember the signs of all the operands, then take the absolute value, and finally compute the proper sign for the result and, if this is negative, complement the appropriate register. This is illustrated for multiplication, but the same code could be written for any arithmetic operation. In fact, on signedmagnitude computers, even addition is done this way, with parallel hardware used to selectively negate the sum depending on the signs of the operands.
int signed_multiply( int multiplier, int multiplicand ) { int product_negative = FALSE; int product; if (multiplier < 0) { multiplier = multiplier; product_negative = ! product_negative; } if (multiplicand < 0) { multiplicand = multiplicand; product_negative = ! product_negative; } product = multiply( multiplier, multiplicand ); if (product_negative) product = product; return product; } 
The only problem with this approach is that it is unnecessarily complicated! So long as we use two's complement numbers, we can take advantage of the simple place value interpretation of the binary digits of a two's complement number. For bits 0 to 30 of a 32bit two's complement number, the weight of bit i is 2^{i}, identical to the weight for a simple binary number. For the most significant bit, the sign bit, the weight is negated, so the weight of bit 31 is 2^{31}. This leads to the remarkable fact that, when multiplying, the final partial product should be subtracted from the others, while when dividing, it should be added!
To use this fact, we recode our multiply algorithm to use a definite loop instead of an indefinite loop, doing 31 multiply steps inside the loop but handling the most significant bit outside the loop, adding instead of subtracting. Here is an example signed multiply, coded using left shifts for both the multiplier and the product:
MULTIPLYS: ; link through R1 ; R3 = multiplicand on entry, product on return ; R4 = multiplier on entry, destroyed ; R5 = temporary copy of multiplicand, destroyed ; R6 = loop counter, destroyed MOVE R5,R3 ; set aside multiplicand CLR R3 ; product = 0 SL R4,1 ; multiplier = multiplier << 1, C = sign bit BCR MULSEI ; if (sign bit was 1) { SUB R3,R0,R5 ; product = product  multiplicand MULSEI: ; } LIS R6,31 ; loopcount = 31 MULSLP: ; do { SL R3,1 ; product = product << 1 SL R4,1 ; multiplier = multiplier << 1; C = top bit BCR MULSEIL ; if (top bit was 1) { ADD R3,R3,R5 ; product = product + multiplicand; MULSEIL: ; } ADDSI R6,1 ; loopcount = loopcount  1 BGT MULLP ; } while (loopcount > 0) JUMPS R1 ; return 
The above code is pleasantly compact, but it can be accelerated by all of the methods we have already discussed. For example, the loop may be unrolled to eliminate the loop counter. This is only mildly complicated by the fact that there are a prime number of iterations, requiring one extra iteration to be done outside the loop.
Exercises
m) Write Hawk code for a naieve signed multiply routine that calls the unsigned multiply routine given previously.
n) Write C, C++ or Java code for a naieve signed divide routine that calls something akin to the unsigned divide routine given previously.
o) Modify the code for the signed multiply routine given here so that it computes the 64bit product in registers 3 and 4. Note that this is almost an easy job! All you need to do is replace the two independent shifts of registers 3 and 4 with a single 64bit shift!
p) Modify the code for the signed multiply routine given here to use 15 iterations where each iteration includes an unrolling of the loop.
q) Modify the code for signed multiply given here to do unsigned multiplication, and then compare it with the unsigned multiply code given previously. Which would be faster, and why?
As already noted, the Hawk architecture includes support for long shifts composed of a number of shorter 32bit shifts. The same special support instructions that permit this are also useful for high precision addition and subtraction. Thus, if some problem requires the use of 128bit numbers, for example, using registers 4 to 7 to hold one number and registers 8 to 11 to hold another, we can add or subtract these two numbers as follows:
ADD R4,R4,R8 ; add least significant 32 bits first ADDC R5,R9 ; ADDC R6,R10 ; ADDC R7,R11 ; add most significant 32 bits last 
SUB R4,R4,R8 ; subtract least significant 32 bits first SUBB R5,R9 ; SUBB R6,R10 ; SUBB R7,R11 ; subtract most significant 32 bits last 
There are two ways to build a highprecision multiply routine. One is to simply take a simple binary multiply routine and expand it, using multiregister shifts and multiregister adds. The other is to build a highprecision multiply from a previously developed low precision multiply. Given a 32bit unsigned multiply, and given that ab and cd are 64bit quantities, where a and c are the most significant 32 bits of each and b and d are the least significant 32 bits, the product ab×cd is a×c×2^{64}+a×d×2^{32}+b×c×2^{32}+b×d. All of the individual products can be computed using 32bit multiplies that produce 64bit products, so only the addition of the final sum requires long add instructions. This extended precision multiply algorithm was thoroughly documented by Charles Babbage back in the 19th century!
Exercises
r) Write Hawk code for a 128bit right shift.
s) Write Hawk code for a 128bit left shift.
t) Write Hawk code for a 64 bit unsigned integer multiply, producing a 128bit product.
While binary arithmetic is easy, it is not the only way to do arithmetic on computers. Many of the computers built in the 1940s and 1950s and even into the 1960s were built to support decimal arithmetic, and some programming languages, most notably COBOL, assume that numbers are represented and manipulated in decimal. Many lab instruments use decimal representations simply because they are designed to operate small decimal displays on the instrument itself, so when these instruments are augmented with computer interfaces, the data is sent to the computer in decimal. The internal number representations used on most pocket calculators are also decimal.
How are decimal numbers manipulated on binary computers? There are two common schemes, binary coded decimal and excess three decimal. In both cases, each digit is represented with 4 bits, but the values used to represent the digits differ:
BCD  decimal  Excess 3  

0  0  0  0  0  0  0  1  1  
0  0  0  1  1  0  1  0  0  
0  0  1  0  2  0  1  0  1  
0  0  1  1  3  0  1  1  0  
0  1  0  0  4  0  1  1  1  
0  1  0  1  5  1  0  0  0  
0  1  1  0  6  1  0  0  1  
0  1  1  1  7  1  0  1  0  
1  0  0  0  8  1  0  1  1  
1  0  0  1  9  1  1  0  0 
The binary coded decimal or BCD representation uses the natural binary weight for each decimal digit, while the excess 3 representation adds three to the simple binary value for each digit. The reason for this addition will become clear in a moment, but even without knowing the ultimate reason, it is easy to see that the excess three representation is symmetric in the sense that the numbers 5 and above are represented by the one's complements of the numbers below 5. This makes 10's complement representations of signed numbers particularly practical, if it is assumed that values of the most significant decimal digit of 5 or greater are being used to represent negative numbers.
When designing an adder for decimal numbers, we must figure out how to recognize the need to carry one from one decimal digit to the next. This means recognizing that the sum of two digits is greater than 10. Although we could construct a decimal adder directly, with a special carry propagation circuit to detect oversized sums, the common trick, since the early 1960's, has been to bias the numbers so that the simple binary carry out of each digit position will be right. This is done by adding 6 to each BCD digit of the sum, or alternatively, by using an excess3 representation, because the extra 3 in each excess3 digit adds together to make 6. Of course, after this is done, corrections must be made to the result! Here is an example addition problem, done with 8bit or 2digit BCD numbers:
1  1  1  carry bits  
0  1  0  0  0  1  0  1  addend = 45  
0  1  0  0  0  1  1  0  augend = 46  
+  0  1  1  0  0  1  1  0  sixes  
 
1  1  1  1  0  0  0  1  preliminary sum  
  0  1  1  0  0  0  0  0  correction  
 
1  0  0  1  0  0  0  1  the sum = 91 
Note that the sum for the decimal digits that produced a carry is correct, but that the sum for the digits that did not produce a carry is off by 6. So, we must subtract 6 from each digit position that did not produce a carry!
The Hawk machine has a special field in the processor status word, the decimal carry field, used to record the carry out of each decimal digit of the arithmetic logic unit. This field is of almost no use except for BCD arithmetic, and it borrows on ideas that were originally introduced in the Intel 8080 for support of BCD arithmetic:
31  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  09  08  07  06  05  04  03  02  01  00 
BCD carry  N  Z  V  C 
Bits 8 to 15 of the Hawk processor status word record the carry out of each BCD or hexadecimal digit of the arithmetic logic unit. These bits are available to the ADJUST instruction for performing the appropriate adjustment after the preliminary add instruction for a BCD or excess3 addition, leading to the following short instruction sequences for decimal addition on the Hawk:
; precondition: R3 and R4 contain BCD numbers LIL R5,#666666 ORIS R5,#66 ; load sixes ADD R5,R5,R3 ; correct addend by sixes ADD R5,R5,R4 ; produce preliminary sum ADJUST R5,BCD ; correct the result using the BCD carry bits ; postcondition: R5 = R3 + R4, the BCD sum 
; precondition: R3 and R4 contain excess3 numbers ADD R5,R3,R4 ; produce preliminary sum ADJUST R5,EX3 ; correct the result using the BCD carry bits ; postcondition: R5 = R3 + R4, the excess3 sum 
Here, the advantage of excess3 should be clear! With BCD, we must keep the constant 66666666_{16} in a register, and we must always add this constant to one of our operands before each addition. With excess 3, because the extra 6 is incorporated into the values before addition, all we need to do is make a slightly more complex correction after each sum.
Exercises
u) Write Hawk code to take the 10's complement of a BCD number and then explain how to detect if the result is positive or negative.
v) Write Hawk code to take the 10's complement of an excess3 number and then explain how to detect if the result is positive or negative.
w) Write Hawk code to convert from excess 3 to BCD and from BCD to excess 3.
x) What value is added to or subtracted from each decimal digit by the ADJUST ...,EX3 instruction? You may have to work a ` few examples to figure this out.
Floating point arithmetic is a common feature of modern computers, but it was not obvious, at the start of the computer age, that there was any need for this! John Von Neumann is reputed to have said "a good programmer who should be able to keep the point in his head," when someone asked him about the value of floating point hardware. How do you keep the point in your head? The answer is remarkably easy! You do arithmetic using integers, but instead of counting, say, integer numbers of inches, you count integer numbers of 128^{ths} of an inch. If you do this, you can think of the numbers being represented in this form:
24  23  22  21  20  19  18  14  16  15  14  13  12  11  10  09  08  07  06  05  04  03  02  01  00  1  2  3  4  5  6  7 
Integer part  Fraction 
The bits of the fixed point fraction given above have been renumbered, so instead of considering the most significant bit to be bit 31, it is bit 24, and instead of considering the least significant bit to be bit 0, it is bit 7. This is because the least significant bit has the weight 2^{7} in the placevalue system, or 1/128, and the most significant bit has the value 2^{24}. We always number the bits so that the one's place is numbered 0. When we write a number on paper, this is exactly what the point signifies. That is why we avoid the use of the term decimal point here, because the point does not in any way depend on the number base  in any number base, however, it marks the one's place, or the dividing line between the integer part of the number and the fractional part.
Consider the binary number 0.1000000; this is 1/2 or 64/128. Similarly, 0.0100000 is 1/4 or 32/128 in the fixed point number system outlined above. If we want to represent 0.1_{10}, we cannot do it exactly in this system; the closest we can come are 0.0001100, which is 12/128 or 0.9375, on the low side, and 0.0001101, which is 13/128 or 0.1015625, on the high side. The fundamental problem we face is that there is no exact representation of one tenth in binary, just as there is no exact representation of 1/3 in decimal. All we can hope for is a good approximation.
The rules for dealing with binary fixed point arithmetic are exactly the familiar rules for dealing with decimal fractions. When adding or subtracting two fixedpoint numbers, first shift them to align the points, and then add or subtract. When multiplying, first multiply and then count off the digits left of the point on the multiplier and multiplicand and then count off the sum of these to determine where the point goes in the product.
A C programmer using the fixed point number represeentation here would do arithmetic as follows:
int a, b, c; /* fixed point, with 7 places right of the point */ int d; /* fixed point, with 14 places right of the point */ /* all variables have the point in the same place, easy! */ a = b + c; /* no shifts needed */ a = b  c; /* no shifts needed */ a = (b * c) >> 7; /* product had 14 places left of the point */ /* variables with point in different places */ d = (a + b) << 7; /* shift result to fit */ a = (d >> 7)  b; /* align points before subtract */ d = b * c; /* luck! product has right number of places */ 
Unfortunately, languages like C and Java provide no help to the programmer when doing fixed point arithmetic. Thus, it is up to the programmer to remember to shift all constants appropriately, or to code them with the correct builtin bias, and it is up to the programmer to remember to shift operands left or right at the appropriate times. This is something that Von Neumann thought was something programmers could do in their heads, but it was clear by the early 1960's that programmers had real difficulty doing this. Floating point hardware or software eliminates the need for this.
Exercises
y) Given that the divisor and the dividend have points in the same place, where is the point in the quotient? Where is the point in the remainder? For a concrete example, consider the possibility that both divisor and dividend have 7 places after the point.
z) Give Hawk code to split a 2's complement fixedpoint number given in register 3 into its integer part, in register 3, and its fractional part, in register 4. The initial number should have 7 places after the point. The fractional part of the result will have one bit for the sign and 31 bits after the point.