14. Hawk Support for Multiply and Divide
Part of
the Hawk Manual

14.1. Multiplication by Constants  Methodology
14.2. Small Constant Multipliers  Optimal Code
14.3. Multiplication by Variables
14.4. Radix 16 Multiplication  Fast Code
14.5. Division by Powers of 2
14.6. Division by Constants  Methodology
14.7. Division by Variables
The Hawk has no integer multiply instruction. Multiplication hardware fits poorly with modern pipelined architectures, and the most common case, multiplication by small constants, is always faster using sequences of easy to pipeline instructions. For example, the Pentium instruction timings done by Quantasm Corporation, show that the base time for a 32bit integer add or subtract on that machine is 1 clock cycle, and both can frequently be paired with other instructions to execute concurrently. In contrast, a 32bit multiply takes 10 cycles and a divide takes 41 cycles, and neither can be paired. Clearly, these instructions should be avoided wherever possible, so why put them in the instruction set?
In the Hawk, MOVESL, ADDSL and SL can each be used to multiply by 16 distinct constants with a single instruction. This works identically for signed and unsigned integers:
SL r,s — r = r × 2^{s} for 0 < s < 17
ADDSL r,r,s — r = r × (1+2^{s}) for 0 < s < 17
For example:
MOVESL R3,R4,1 ; R3 gets R4 times 2 SL R3,1 ; R3 times 2 ADDSL R3,R3,1 ; R3 times 3 SL R3,2 ; R3 times 4 ADDSL R3,R3,2 ; R3 times 5 SL R3,3 ; R3 times 8 ADDSL R3,R3,3 ; R3 times 9
When the desired constant multiplier does not fit this pattern, it is
frequently possible to factor the multiplier into factors where
multiplication by each factor can be done by following this pattern.
Multiplying by each
factor in sequence is, of course, equivalent to multiplying by their
product. Here are some examples that use this trick for some common
small multipliers that aren't in the above list:
SL R3,1 ; \ R3 times 6 = 2 x 3 ADDSL R3,R3,1 ; / SL R3,1 ; \ R3 times 10 = 2 x 5 ADDSL R3,R3,2 ; / ADDSL R3,R3,1 ; \ R3 times 15 = 3 x 5 ADDSL R3,R3,2 ; /
If the desired multiplier cannot be factored, the next thing to consider is computing the product as a sum of partial products.
Generally, our goal will be to search out values of q and r that fit one of the patterns above. In the extreme, of course, any number can be represented as a sum of powers of two, but faster multiply code will frequently rest on other partial products. Summing partial products requires the use of a temporary register; here, we use R1. In loading this temporary with a copy of the multiplicand, it is useful to recall that we can use MOVESL to scale the multiplicand and we can use NEG to negate it.
NEG R1,R3 ; \ R3 times 7 = 8 + (1) ADDSL R3,R1,3 ; / MOVESL R1,R3,1 ; \ R3 times 10 = 8 + 2 ADDSL R3,R1,3 ; / MOVESL R1,R3,1 ; \ R3 times 11 = 9 + 2 ADDSL R3,R3,3 ;  ADD R3,R3,R1 ; / MOVE R1,R3 ; \ R3 times 13 = (3 x 4) + 1 ADDSL R3,R3,1 ;  ADDSL R3,R1,2 ; / MOVESL R1,R3,1 ; \ R1 times 14 = (3 x 4) + 2 ADDSL R3,R3,1 ;  ADDSL R3,R1,2 ; /
When all else fails, note that any number can be expressed as a sum of powers of two. The binary representation of the number has a onebit for each power of two that is present in this sum. Exploiting this leads directly to code to multiply by an arbitrary constant:
MOVE R1,R3 ; \ ADDSL R3,R1,2 ;  R3 times 10 = 1010 SL R3,1 ; / MOVE R1,R3 ; \ ADDSL R3,R1,1 ;  R3 times 27 = 11011 ADDSL R3,R1,2 ;  ADDSL R3,R1,1 ; / MOVE R1,R3 ; \ ADDSL R3,R1,1 ;  R3 times 100 = 1100100 ADDSL R3,R1,3 ;  SL R3,2 ; / MOVE R1,R3 ; \ ADDSL R3,R1,1 ;  R3 times 117 = 1110101 ADDSL R3,R1,1 ;  ADDSL R3,R1,2 ;  ADDSL R3,R1,2 ; /
In general, for a multiplier with n ones in its binary representation, we can compute the product by first making a copy of the multiplicand and then performing n–1 ADDSL operations, and if the binary representation of the multiplier has trailing zeros, one additional SL instruction. The shift counts for the ADDSL instructions can be read as the distance between consecutive ones in the multiplier.
It is difficult to give a general rule for which model will give the shortest instruction sequence for multiplication. For each multiplier, a different approach may yield a better result. Trial and error gives us the following instruction sequences for multipliers from 2 to 10:
SL R3,1 ; R3 times 2 ADDSL R3,R3,1 ; R3 times 3 SL R3,2 ; R3 times 4 ADDSL R3,R3,2 ; R3 times 5 SL R3,1 ; \ R3 times 6 = 2 x 3 ADDSL R3,R3,1 ; / NEG R1,R3 ; \ R3 times 7 = 8 + (1) ADDSL R3,R1,3 ; / SL R3,3 ; R3 times 8 ADDSL R3,R3,3 ; R3 times 9 SL R3,1 ; \ R3 times 10 = 2 x 5 ADDSL R3,R3,2 ; /
None of the above require more than 2 instructions. For multipliers from 11 to 38, 3 instructions suffice:
MOVE R1,R3 ; \ R3 times 11 = 1011 ADDSL R3,R1,2 ;  ADDSL R3,R1,1 ; / SL R3,2 ; \ R3 times 12 = 4 x 3 ADDSL R3,R3,1 ; / MOVE R1,R3 ; \ R3 times 13 = 1101 ADDSL R3,R1,1 ;  ADDSL R3,R1,2 ; / MOVESL R1,R3,1 ; \ R3 times 14 = (3 x 4) + 2 ADDSL R3,R3,1 ;  ADDSL R3,R1,2 ; / ADDSL R3,R3,1 ; \ R3 times 15 = 3 x 5 ADDSL R3,R3,2 ; / SL R3,4 ; R3 times 16 ADDSL R3,R3,4 ; R3 times 17 ADDSL R3,R3,3 ; \ R3 times 18 = 9 x 2 SL R3,1 ; / MOVE R1,R3 ; \ R3 times 19 = 10011 ADDSL R3,R1,3 ;  ADDSL R3,R1,1 ; / SL R3,2 ; \ R3 times 20 = 4 x 5 ADDSL R3,R3,2 ; / MOVE R1,R3 ; \ R3 times 21 = 10101 ADDSL R3,R1,2 ;  ADDSL R3,R1,2 ; / MOVESL R1,R3,1 ; \ R3 times 22 = 10110 (with a trick!) ADDSL R3,R1,3 ;  ADDSL R3,R1,1 ; / NEG R1,R3 ; \ R3 times 23 = (3 x 8) + (1) ADDSL R3,R3,1 ;  ADDSL R3,R1,3 ; / SL R3,3 ; \ R3 times 24 = 8 x 3 ADDSL R3,R3,1 ; / ADDSL R3,R3,2 ; \ R3 times 25 = 5 x 5 ADDSL R3,R3,2 ; / MOVESL R1,R3,1 ; \ R3 times 26 = (3 x 8) + 2 ADDSL R3,R3,1 ;  ADDSL R3,R1,3 ; / ADDSL R3,R3,3 ; \ R3 times 27 = 9 x 3 ADDSL R3,R3,1 ; / MOVESL R1,R3,2 ; \ R3 times 28 = (3 x 8) + 4 ADDSL R3,R3,1 ;  ADDSL R3,R1,3 ; / NEG R1,R3 ; \ R3 times 29 = 32 + (3) ADDSL R1,R1,1 ;  ADDSL R3,R1,5 ; / SL R3,1 ; \ R3 times 30 = 2 x 3 x 5 ADDSL R3,R3,1 ;  ADDSL R3,R3,2 ; / NEG R1,R3 ; \ R3 times 31 = 32 + (1) ADDSL R3,R3,5 ; / SL R3,5 ; R3 times 32 ADDSL R3,R3,5 ; R3 times 33 ADDSL R3,R3,4 ; \ R3 times 34 = (16 + 1) x 2 SL R3,1 ; / MOVESL R1,R3,1 ; \ R3 times 35 = 3 + 32 ADD R1,R1,R3 ;  ADDSL R3,R1,5 ; / MOVESL R1,R3,2 ; \ R3 times 36 = 32 + 4 ADDSL R3,R1,5 ; / MOVE R1,R3 ; \ R3 times 37 = 5 + 32 ADDSL R1,R1,2 ;  ADDSL R3,R1,5 ; / MOVESL R1,R3,1 ; \ R3 times 38 = (2 x 3) + 32 ADDSL R1,R1,1 ;  ADDSL R3,R1,5 ; /
Multiplying by 39 takes 4 instructions, but multiplying by 100 takes only 3. Extending this table is tedious and the payoff is limited. Occasionally, however, some applicatons require multiplication by larger constants. For example, some pseudorandom number generators reqsuire fast multiplication by strange values such as 16,807, 39,373, 69,621 and 48,271.
The classic shiftandadd algorithm for multiplying is known by many names; if, in each step, the multiplier is right shifted and the multiplicand is left shifted, so that the partial products are added in order of increasing significance, it is called the Russian Peasant Algorithm, giving credit to one folk origin.
Whatever the shift direction or the order of additions, all basic binary multiplication algorithms work as follows: In each multiply step, one bit of the multiplier is tested and both the multiplier and product are shifted one place; if the bit is one, the multiplicand is added, as a partial product, to the product being accumulated. It takes 32 such steps to compute the 64bit product of two 32bit unsigned numbers.
In many cases, only the low 32 bits of the 64bit product are, so the following simplified code is useful. This shifts both the product and multiplier left, adding the most significant partial product first:
TIMESUL:; unsigned multiply, low 32bits of the product ; expects: R1  the return address ; R4  the multiplier ; R5  the multiplicand ; returns: R3  the low 32 bits of product ; destroys: R4  all bits shifted away ; R6  used as loop counter CLR R3 LIS R6,32 ; load loop counter TMULLP: ;  start of multiply step SL R3,1 ; shift product SL R4,1 ; shift multiplier BCR TMULCN ; if high bit was one ADD R3,R3,R5; add partial product to product TMULCN: ;  end of multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return
Signed multiplication in the 2's complement number system poses only small added challenges. The important thing to note is that the effective weight of the sign bit is negative, so the partial product computed from the sign bit must be subtracted from the product instead of being added. Therefore, we handle the sign bit with a special multiply step that adds instead of subtracts. Aside from this change, the following code uses the same algorithm as the unsigned code given above:
TIMESL: ; signed multiply, low 32 bits of the product ; expects: R1  the return address ; R4  the multiplier ; R5  the multiplicand ; returns: R3  the low 32 bits of product ; destroys: R4  all bits shifted away ; R6  used as loop counter CLR R3 ;  start of special multiply step SL R4,1 ; shift multiplier BCR TMSSKP ; if sign bit was 1 SUB R3,R0,R5; subtract partial product from product TMSSKP: ;  end of special multiply step LIS R6,31 ; load loop counter TMSLLP: ;  start of multiply step SL R3,1 ; shift product SL R4,1 ; shift multiplier BCR TMSLCN ; if high bit was one ADD R3,R3,R5; add partial product to product TMSLCN: ;  end of multiply step ADDSI R6,1 ; count BGT TMSLLP ; iterate JUMPS R1 ; return
An obvious way to speed up this code is to unroll the loops. Here is the code for unsigned integer multiply, with the loop partially unrolled so there are 2 multiply steps per iteration:
TIMESUL:; unsigned multiply, low 32bits of the product ; expects: R1  the return address ; R4  the multiplier ; R5  the multiplicand ; returns: R3  the low 32 bits of product ; destroys: R4  all bits shifted away ; R6  used as loop counter CLR R3 LIS R6,16 ; load loop counter TMULLP: ;  start of first multiply step SL R3,1 ; shift product SL R4,1 ; shift multiplier BCR TMULC1 ; if high bit was one ADD R3,R3,R5; add partial product to product TMULC1: ;  end of first, start of second SL R3,1 ; shift product SL R4,1 ; shift multiplier BCR TMULC2 ; if high bit was one ADD R3,R3,R5; add partial product to product TMULC2: ;  end of second multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return
Each multiply step in the original code had 4 instructions, plus 2 instructions for loop control. Each unrolling, adding an additional multiply step to the loop body, eliminates 2 more loop control instructions. If the loop is fully unrolled, so all 32 multiply steps are in line, the net performance improvement would be 1/3. On a pipelined machine, the elimination of the branch penalty may lead to even greater improvements.
The following table summarizes the potential performance gains by unrolling the unsigned multiply routine, from no unrolling with one multiply step per iteration, to completely unrolled with 32 multiply steps and no loop. Execution time can be estimated from the number of instructions, but on pipelined machines, there will be branch penalties that may only apply to unpredictable branches.
Unrolling (steps per iteration)  1  2  n < 32  32 

Code size, bytes  18  26  10 + 8n  260 
Worst case instructions  195  163  131 + 64/n  130 
Average instructions  179  147  115 + 64/n  114 
Worst case branch count  63  47  31 + 32/n  32 
Average branch count  47  31  15 + 32/n  16 
Predictable branches  31  15  –1 + 32/n  0 
Another way to speed up multiplication is to move from binary to a higher radix such as base 16. This cuts the number of iterations from 32 down to 8, but it requires a caseselect construct for each of the 16 possibilities at each step. This idea is borrowed from the way the HP PARISC does integer multiplication.
Here, BTRUNC is used for case selection based on the rightmost bits of the multiplier. bits of a register. The code right shifts the multiplier after each step, adding the least significant partial product first. It keeps only the least significant 32 bits of the product. The first version of this code is written for clarity:
TIMESUL:; unsigned multiply, low 32bits of the product ; expects: R1  the return address ; R4  the multiplier ; R5  the multiplicand ; returns: R3  the low 32 bits of product ; destroys: R4  all bits shifted away ; R6  used as loop counter ; R7  used for partial products CLR R3 LIS R6,8 ; load loop counter TMULLP: ;  start of radix 16 multiply step MOVE R7,R5 ; copy multiplicand for partial product computation BTRUNC R4,4 ; decode least 4 bits of multiplier BR TMUL0 ; branch table for BTRUNC BR TMUL1 BR TMUL2 BR TMUL3 BR TMUL4 BR TMUL5 BR TMUL6 BR TMUL7 BR TMUL8 BR TMUL9 BR TMULA BR TMULB BR TMULC BR TMULD BR TMULE BR TMULF TMUL0: CLR R7 ; partial product is zero BR TMULCN ; done with 0 TMUL1: BR TMULCN ; done with 1 TMUL2: SL R7,1 ; multiply by two BR TMULCN ; done with 2 TMUL3: ADDSL R7,R7,1 ; multiply by three BR TMULCN ; done with 3 TMUL4: SL R7,2 ; multiply by four BR TMULCN ; done with 4 TMUL5: ADDSL R7,R7,2 ; multiply by five BR TMULCN ; done with 5 TMUL6: SL R7,1 ; multiply by two ADDSL R7,R7,1 ; multiply by three BR TMULCN ; done with 6 TMUL7: ADDSL R7,R7,1 ; multiply by three ADDSL R7,R5,1 ; multiply by two and add one BR TMULCN ; done with 7 TMUL8: SL R7,3 ; multiply by eight BR TMULCN ; done with 8 TMUL9: ADDSL R7,R7,3 ; multiply by nine BR TMULCN ; done with 9 TMULA: SL R7,1 ; multiply by two ADDSL R7,R7,2 ; multiply by five BR TMULCN ; done with 10 TMULB: ADDSL R7,R7,2 ; multiply by five ADDSL R7,R5,1 ; multiply by two and add one BR TMULCN ; done with 11 TMULC: ADDSL R7,R7,1 ; multiply by three SL R7,2 ; multiply by four BR TMULCN ; done with 12 TMULD: ADDSL R7,R7,1 ; multiply by three ADDSL R7,R5,2 ; multiply by four and add 1 BR TMULCN ; done with 13 TMULE: ADDSL R7,R7,1 ; multiply by three ADDSL R7,R5,1 ; multiply by two and add one SL R7,1 ; multiply by two BR TMULCN ; done with 14 TMULF: ADDSL R7,R7,1 ; multiply by three ADDSL R7,R7,2 ; multiply by five BR TMULCN ; done with 15 TMULCN: ADD R3,R3,R7; accumulate product SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier ;  end of radix 16 multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return
The above radix16 algorithm is significantly faster than the unsigned binary code given previously, yet it is more compact than the fully unrolled binary multiply algorithm. If integer multiply speed is paramount, we can optimize the above code at the expense of readability. Such optimization may be justified because, once debugged, this code can be used by a many applications; the gain, however, is small.
TIMESUL:; unsigned multiply, low 32bits of the product ; expects: R1  the return address ; R4  the multiplier ; R5  the multiplicand ; returns: R3  the low 32 bits of product ; destroys: R4  all bits shifted away ; R6  used as loop counter ; R7  used for partial products CLR R3 LIS R6,8 ; load loop counter TMULLP: ;  start of radix 16 multiply step BTRUNC R4,4 ; decode least 4 bits of multiplier BR TMUL0 ; branch table for BTRUNC BR TMUL1 BR TMUL2 BR TMUL3 BR TMUL4 BR TMUL5 BR TMUL6 BR TMUL7 BR TMUL8 BR TMUL9 BR TMULA BR TMULB BR TMULC BR TMULD BR TMULE ; the following code handles case 15! MOVE R7,R5 ; copy multiplicand for partial product computation ADDSL R7,R7,1 ; multiply by three ADDSL R7,R7,2 ; multiply by five  R7 = R5 x 15 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMUL0: ; case 0 has an optimized epilogue with no ADD SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMUL1: ; case 1 has an optimized epilogue, MOVE eliminated ADD R3,R3,R5; accumulate product SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMUL2: MOVESL R7,R5,1 ; multiply by two  R7 = R5 x 2 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMUL3: MOVE R7,R5 ; copy multiplicand for partial product computation ADDSL R7,R7,1 ; multiply by three  R7 = R5 x 3 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMUL4: MOVESL R7,R5,2 ; multiply by four  R7 = R5 x 4 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMUL5: MOVE R7,R5 ; copy multiplicand for partial product computation ADDSL R7,R7,2 ; multiply by five  R7 = R5 x 5 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMUL6: MOVESL R7,R5,1 ; multiply by two ADDSL R7,R7,1 ; multiply by three  R7 = R5 x 6 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMUL7: MOVE R7,R5 ; copy multiplicand for partial product computation ADDSL R7,R7,1 ; multiply by three ADDSL R7,R5,1 ; multiply by two and add one  R7 = R5 x 7 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMUL8: MOVESL R7,R5,3 ; multiply by 8  R7 = R5 x 8 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMUL9: MOVE R7,R5 ; copy multiplicand for partial product computation ADDSL R7,R7,3 ; multiply by nine  R7 = R5 x 9 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMULA: MOVESL R7,R5,1 ; multiply by two ADDSL R7,R7,2 ; multiply by five  R7 = R5 x 10 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMULB: MOVE R7,R5 ; copy multiplicand for partial product computation ADDSL R7,R7,2 ; multiply by five ADDSL R7,R5,1 ; multiply by two and add one  R7 = R5 x 11 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMULC: MOVESL R7,R5,2 ; multiply by four ADDSL R7,R7,1 ; multiply by three  R7 = R5 x 12 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMULD: MOVE R7,R5 ; copy multiplicand for partial product computation ADDSL R7,R7,1 ; multiply by three ADDSL R7,R5,2 ; multiply by four and add 1  R7 = R5 x 13 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return TMULE: MOVE R7,R5 ; copy multiplicand for partial product computation ADDSL R7,R7,1 ; multiply by three ADDSL R7,R5,1 ; multiply by two and add one SL R7,1 ; multiply by two  R7 = R5 x 14 ADD R3,R3,R7; accumulate product  start standard epilogue SL R5,4 ; shift multiplicand SR R4,4 ; shift multiplier  to end multiply step ADDSI R6,1 ; count BGT TMULLP ; iterate JUMPS R1 ; return
As is usual with optimized code, the above code is not particularly readable. Code from the loop prologue along with the entire loop epilogue has been duplicated in each case, and then the result has been optimized. As a result, the borderline between the divide step and the enclosing iterative control structure has been blurred. In addition, case 15 has been moved to the start of the list, in order to eliminate the last branch from the branch table.
The code above will be better than twice the speed of the unoptimized unsigned binary multiply code for random integers. Note that multipliers with long runs of zero bits are very common. These are the worst case for the unsigned binary code and the best case for this code. The following table completes the comparison of all the different unsigned multiply routines given above:
Multiply algorithm  Simple  Unrolled  Radix 16  Optimized Radix 16 

Code size, bytes  18  260  132  294 
Worst case instructions  195  130  99  91 
Average instructions  179  114  86.5  73.5 
Worst case branch count  63  32  31  23 
Average branch count  47  16  31  <23 
Predictable branches  31  0  23  15 
We are clearly in the realm of diminishing returns here, having more than doubled the code size over the original radix16 code for a 10% gain in speed. It is quite possible that the doubled size of this code could lead to a net decrease in performance on machines with a small cache, since the larger code could push instructions out of the cache, leading to overall slower execution.
If there is no cache or a sufficiently large cache, the total number of instructions per multiply can be reduced to 56.5 by fully unrolling the loop. This final optimization multiplies the code size by a factor of just under 8, to over 2K bytes, in order to achieve a 25% speed improvement.
SRU can be used for fast unsigned division with truncation:
SRU R3,1 ; unsigned divide by 2 SRU R3,2 ; unsigned divide by 4
Use of SR for signed division is complicated by the fact that there are two alternative rules for truncating the quotient. SR truncates the quotient to the next lower integer so the remainder is always positive:
SR R3,1 ; signed divide by 2, truncating downward SR R3,2 ; signed divide by 4, truncating downward
Generally, people expect –5 divided by 4 to give a quotient of –1 with a remainder of –1. When SR is used for this division problem, it gives a quotient of –2 with a remainder of 3. In fact, our naive expectation is contradicted by another equally naive expectation. We expect the remainder after division by 4 to be between 0 and 3. These two naive expectations are incompatible for signed dividends. SR satisfies our naive expectation about the remainder while violating our naive expectation about the quotient.
Many programming languages define integer division as truncating the result toward zero, satisfying the naive expectation for the quotient. That is, the sign of the remainder always matches the sign of the dividend. The ADJUST r,SSQ instruction can be used to deliver this result, as follows:
SR R3,2 ; divide by 4, truncating downward ADJUST R3,SSQ ; adjust quotient so truncation is toward zero
For both signed and unsigned shifts, the quotient can be rounded to the nearest integer as follows:
SR R3,2 ; signed divide by 4 with positive remainder ADDC R3,0 ; round so remainder is between 2 and 1 SRU R3,3 ; unsigned divide by 8 with positive remainder ADDC R3,0 ; round so remainder is between 4 and 3
In general, it is faster to multiply by the reciprocal of a number than to divide by that number! This is particularly important for division by constants that are not powers of two, since we can divide by a power of two by a simple shift. Furthermore, this applies across most computers, including those with hardware divide instructions. The following table shows the reciprocals of the integers from 1 to 10.
n  1/n (base 10)  1/n (base 2) 

1  1.0  1.0 
2  0.5  0.1 
3  0.33  0.0101 
4  0.25  0.01 
5  0.2  0.00110011 
6  0.166  0.00101 
7  0.142857142857  0.001001 
8  0.125  0.001 
9  0.11  0.000111000111 
10  0.1  0.000110011 
In binary or decimal repeating fractions are traditionally marked with an overbar above the string of digits that repeat; here, this is rendered like this. Many numbers that have exact reciprocals in decimal have reciprocals that repeat in binary. For example, 0.1_{10} has no finite binary representation.
In general, if we multiply a 32 bit integer by the reciprocal, truncated to 32 bits, we get a result that differs from the expected quotient by no more than one. In some cases, this inaccuracy is acceptable, but in most cases, an exact quotient is needed.
One way to obtain an exact quotient is to multiply by a carefully selected approximation of the reciprocal. In general, for an nbit unsigned divisor, we can get the exact result by multiplying by an n+1 bit quantity and then shifting to take the appropriate bits of the product. The n+1 bit approximate reciprocal cannot have more than n+1 onebits, so there are at most n+1 partial products to add in computing the quotient this way. This means that we can divide a 32bit unsigned number by an arbitrary constant with a sequence of no more than 33 ADDSRU instructions. Here are some examples:
; R3 = R3 / 3 = R3 * 0.010101010101010101010101010101011 MOVE R1,R3 ; R3 = R1 * 1.0 (make copy of original R3) SRU R3,1 ; R3 = R1 * 0.1 ADDSRU R3,R1,2 ; R3 = R1 * 0.011 exact R3/3 for all R1 < 8 ADDSRU R3,R1,2 ; R3 = R1 * 0.01011 all R1 < 32 ADDSRU R3,R1,2 ; R3 = R1 * 0.0101011 all R1 < 128 ADDSRU R3,R1,2 ; R3 = R1 * 0.010101011 all R1 < 512 ADDSRU R3,R1,2 ; R3 = R1 * 0.01010101011 all R1 < 2,048 ADDSRU R3,R1,2 ; R3 = R1 * 0.0101010101011 all R1 < 8,192 ADDSRU R3,R1,2 ; R3 = R1 * 0.010101010101011 all R1 < 32,768 ADDSRU R3,R1,2 ; R3 = R1 * 0.01010101010101011 all R1 < 131,072 ADDSRU R3,R1,2 ; R3 = R1 * 0.0101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.010101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.01010101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.0101010101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.010101010101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.01010101010101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.0101010101010101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.010101010101010101010101010101011 ; R3 = R3 / 5 = R3 * 0.0011001100110011001100110011001101 MOVE R1,R3 ; R3 = R1 * 1.0 (make copy of original R3) SRU R3,2 ; R3 = R1 * 0.01 ADDSRU R3,R1,1 ; R3 = R1 * 0.101 ADDSRU R3,R1,3 ; R3 = R1 * 0.001101 exact R3/5 for all R1 < 64 ADDSRU R3,R1,1 ; R3 = R1 * 0.1001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.0011001101 R1 < 1,024 ADDSRU R3,R1,1 ; R3 = R1 * 0.10011001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.00110011001101 R1 < 16,384 ADDSRU R3,R1,1 ; R3 = R1 * 0.100110011001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.001100110011001101 R1 < 262,144 ADDSRU R3,R1,1 ; R3 = R1 * 0.1001100110011001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.0011001100110011001101 ADDSRU R3,R1,1 ; R3 = R1 * 0.10011001100110011001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.00110011001100110011001101 ADDSRU R3,R1,1 ; R3 = R1 * 0.100110011001100110011001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.001100110011001100110011001101 ADDSRU R3,R1,1 ; R3 = R1 * 0.1001100110011001100110011001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.0011001100110011001100110011001101
As the comments above indicate, the complete series of 16 ADDSRU instructions is only needed if the maximum value of the dividend is unknown. If the maximum is known, an abbreviated shift sequence will suffice. For example, for a 16bit dividend, 6 or 7 ADDSRU instructions do the job.
Given code to divide by 3, division by 6, 12 and other binary multiples of 3 is straightforward; simply increase the final shift count appropriately. Similarly, we can derive code for divide by 10, 20 and other binary multiples of 5 by simple changes to the final shift count.
The above examples illustrate a general technique for multiplying an unsigned binary integer by a fractional constant binary multiplier. If there are k onebits in the multiplier, there will be one SRU instruction followed by k–1 ADDSRU instructions. Reading from the first SRU downward, the sequence of shift counts gives the intervals between consecutive onebits of the multiplier, starting with the leastsignificant bit. If the two least significant bits are one, the first shift count is one; if there are z zeros between the two least significant ones, the first shift count is z+1. The final shift count follows the same formula, based on the number of zeros between the point and the most significant one.
To divide a 32bit integer by an arbitrary integer constant, take the reciprocal of the divisor as a fixed point binary fraction. This fraction must be computed to 33 places of precision, counting the most significant one as the first place. Then, add one to the bit 33 places to the right of the most significant one.
For repeating binary fractions, faster code is possible at the expense of a result that is only approximate. The trick is to find the repeating pattern in the reciprocal and exploit it. The following example illustrates this for division by 5:
; R3 = ~(R3 / 5) = R3 * 0.00110011001100110011001100110011 MOVE R1,R3 ; R3 = R1 * 1.0 (make copy of original R3) SRU R3,1 ; R3 = R1 * 0.1 ADDSRU R3,R1,3 ; R3 = R1 * 0.0011 ADDSRU R3,R1,1 ; R3 = R1 * 0.10011 ADDSRU R3,R1,1 ; R3 = R1 * 0.110011 MOVE R4,R3 ; R3 = R1 * 0.110011 (make copy of working value) SRU R3,8 ; R3 = R1 * 0.00000000110011 ADDSRU R3,R4,8 ; R3 = R1 * 0.0000000011001100110011 * ADDSRU R3,R4,8 ; R3 = R1 * 0.000000001100110011001100110011 * ADDSRU R3,R4,2 ; R3 = R1 * 0.00110011001100110011001100110011 ; Either R3 = (R1 / 5) or (R3 + 1) = (R1 / 5);
This took only 10 instructions instead of the 18 in the bruteforce code, but the result is not exact. With this code, the quotient may be off by one. If the two lines marked with asterisks are deleted, the code works for all 16bit dividends, while if just one of these is deleted, it is good for all 24bit dividends.
Because the approximation will never be off by more than one, we can fix it by computing the remainder and then checking to see that it is within bounds. The following code fixes the result of the above divide by 5 code:
; Given R1 = dividend, R3 = approximate quotient MOVE R4,R3 ADDSL R4,R3,2 ; R4 = R3 * 5 = (R1 / 5) * 5 SUB R4,R1,R4; R4 = R1 % 5 (remainder after division) CMPI R4,5 BLTU NOFIX ; if remainder >= 5 ADDSI R3,1 ; fix quotient ADDSI R4,5 ; fix remainder NOFIX: ; Now, R1 = dividend, R3 = quotient, R4 = remainder
The above code produces both the remainder and the quotient after division. In fact, aside from the "perfect" algorithms given initially, where an exact quotient is produced by reciprocal multiplication, all useful division algorithms produce the remainder and quotient as the result of a single computation. Similarly, computers with a general purpose divide instruction usually use it to produce both the remainder and quotient.
These methods extend trivially to signed division by positive constants, with one caveat. Simply replace SRU and ADDSRU with SR and ADDSR. The caveat is that the remainder will always be positive. If you want the sign of the remainder the same as the sign of the dividend, you will need to add a correction step similar to that suggested above: If the dividend is negative and the remainder is nonzero, increment the quotient and decrement the remainder by the divisor. Computing the remainder is does exactly the same in both the signed and unsigned versions.
Division by a variable is somewhat more complex than multiplication by a variable. The following code implements the classic restoring division algorithm on 32bit unsigned operands, giving a 32bit remainder and a 32bit quotient.
DIVIDEU:; unsigned divide, 32bit operands ; expects: R1  return address ; R3  dividend ; R5  divisor ; returns: R3  quotient ; R4  remainder ; destroys: R6  used as loop counter CLR R4 ; clear remainder LIS R6,32 ; load loop counter DIVULP: ;  start of divide step SL R3,1 ; shift dividend/quotient ROL R4 ; putting high bit into remainder CMP R4,R5 ; see if divisor divides remainder BLTU DIVUC ; if not, skip SUB R4,R4,R5; if so, take away divisor ADDSI R3,1 ; and add a bit to the quotient DIVUC: ;  end of divide step ADDSI R6,1 ; count BGT DIVULP ; iterate JUMPS R1 ; return
As with the algorithms given for multiplication by a variable, this code does not check for overflow. If size is the primary consideration, this code is optimal. If speed is paramount, unrolling the loop will eliminate 2 of the 8 instructions per iteration, giving a speedup of at least 1/4. Division by variables is rare enough in practice that this is generally acceptable.