14. Hawk Support for Multiply and Divide

Part of the Hawk Manual
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Contents

14.1. Multiplication by Constants -- Methodology
14.2. Small Constant Multipliers -- Optimal Code
14.3. Multiplication by Variables
14.4. Radix 16 Multiplication -- Fast Code

14.5. Division by Powers of 2
14.6. Division by Constants -- Methodology
14.7. Division by Variables


14.1. Multiplication by Constants -- Methodology

The Hawk has no integer multiply instruction. Multiplication hardware fits poorly with modern pipelined architectures, and the most common case, multiplication by small constants, is always faster using sequences of easy to pipeline instructions. For example, the Pentium instruction timings done by Quantasm Corporation, show that the base time for a 32-bit integer add or subtract on that machine is 1 clock cycle, and both can frequently be paired with other instructions to execute concurrently. In contrast, a 32-bit multiply takes 10 cycles and a divide takes 41 cycles, and neither can be paired. Clearly, these instructions should be avoided wherever possible, so why put them in the instruction set?

In the Hawk, MOVESL, ADDSL and SL can each be used to multiply by 16 distinct constants with a single instruction. This works identically for signed and unsigned integers:

MOVESL a,b,s — a = b × 2s   for 0 < s < 17

SL     r,s   — r = r × 2s    for 0 < s < 17

ADDSL  r,r,s — r = r × (1+2s) for 0 < s < 17

For example:

        MOVESL R3,R4,1    ;   R3 gets R4 times 2
        SL     R3,1       ;   R3 times 2
        ADDSL  R3,R3,1    ;   R3 times 3
        SL     R3,2       ;   R3 times 4
        ADDSL  R3,R3,2    ;   R3 times 5
        SL     R3,3       ;   R3 times 8
        ADDSL  R3,R3,3    ;   R3 times 9

When the desired constant multiplier does not fit this pattern, it is frequently possible to factor the multiplier into factors where multiplication by each factor can be done by following this pattern. Multiplying by each factor in sequence is, of course, equivalent to multiplying by their product. Here are some examples that use this trick for some common small multipliers that aren't in the above list:
 

        SL     R3,1       ; \ R3 times 6  = 2 x 3
        ADDSL  R3,R3,1    ; /

        SL     R3,1       ; \ R3 times 10 = 2 x 5
        ADDSL  R3,R3,2    ; /

        ADDSL  R3,R3,1    ; \ R3 times 15 = 3 x 5
        ADDSL  R3,R3,2    ; /

If the desired multiplier cannot be factored, the next thing to consider is computing the product as a sum of partial products.

to compute the product ap
where p = q + r
use ap = aq + ar

Generally, our goal will be to search out values of q and r that fit one of the patterns above. In the extreme, of course, any number can be represented as a sum of powers of two, but faster multiply code will frequently rest on other partial products. Summing partial products requires the use of a temporary register; here, we use R1. In loading this temporary with a copy of the multiplicand, it is useful to recall that we can use MOVESL to scale the multiplicand and we can use NEG to negate it.

        NEG    R1,R3      ; \ R3 times 7 = 8 + (-1)
        ADDSL  R3,R1,3    ; /

        MOVESL R1,R3,1    ; \ R3 times 10 = 8 + 2
        ADDSL  R3,R1,3    ; /

        MOVESL R1,R3,1    ; \ R3 times 11 = 9 + 2
        ADDSL  R3,R3,3    ; |
        ADD    R3,R3,R1   ; /

        MOVE   R1,R3      ; \ R3 times 13 = (3 x 4) + 1
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R1,2    ; /

        MOVESL R1,R3,1    ; \ R1 times 14 = (3 x 4) + 2
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R1,2    ; /

When all else fails, note that any number can be expressed as a sum of powers of two. The binary representation of the number has a one-bit for each power of two that is present in this sum. Exploiting this leads directly to code to multiply by an arbitrary constant:

        MOVE   R1,R3      ; \
        ADDSL  R3,R1,2    ; | R3 times 10 = 1010
        SL     R3,1       ; /

        MOVE   R1,R3      ; \
        ADDSL  R3,R1,1    ; | R3 times 27 = 11011
        ADDSL  R3,R1,2    ; |
        ADDSL  R3,R1,1    ; /

        MOVE   R1,R3      ; \
        ADDSL  R3,R1,1    ; | R3 times 100 = 1100100
        ADDSL  R3,R1,3    ; |
        SL     R3,2       ; /

        MOVE   R1,R3      ; \
        ADDSL  R3,R1,1    ; | R3 times 117 = 1110101
        ADDSL  R3,R1,1    ; |  
        ADDSL  R3,R1,2    ; |
        ADDSL  R3,R1,2    ; /

In general, for a multiplier with n ones in its binary representation, we can compute the product by first making a copy of the multiplicand and then performing n–1 ADDSL operations, and if the binary representation of the multiplier has trailing zeros, one additional SL instruction. The shift counts for the ADDSL instructions can be read as the distance between consecutive ones in the multiplier.
 
 

14.2. Small Constant Multipliers -- Optimal Code

It is difficult to give a general rule for which model will give the shortest instruction sequence for multiplication. For each multiplier, a different approach may yield a better result. Trial and error gives us the following instruction sequences for multipliers from 2 to 10:

        SL     R3,1       ;   R3 times  2

        ADDSL  R3,R3,1    ;   R3 times  3

        SL     R3,2       ;   R3 times  4

        ADDSL  R3,R3,2    ;   R3 times  5

        SL     R3,1       ; \ R3 times  6 = 2 x 3
        ADDSL  R3,R3,1    ; /

        NEG    R1,R3      ; \ R3 times  7 = 8 + (-1)
        ADDSL  R3,R1,3    ; /

        SL     R3,3       ;   R3 times  8

        ADDSL  R3,R3,3    ;   R3 times  9

        SL     R3,1       ; \ R3 times 10 = 2 x 5
        ADDSL  R3,R3,2    ; /

None of the above require more than 2 instructions. For multipliers from 11 to 38, 3 instructions suffice:

        MOVE   R1,R3      ; \ R3 times 11 = 1011
        ADDSL  R3,R1,2    ; |
        ADDSL  R3,R1,1    ; /

        SL     R3,2       ; \ R3 times 12 = 4 x 3
        ADDSL  R3,R3,1    ; /

        MOVE   R1,R3      ; \ R3 times 13 = 1101
        ADDSL  R3,R1,1    ; |
        ADDSL  R3,R1,2    ; /

        MOVESL R1,R3,1    ; \ R3 times 14 = (3 x 4) + 2
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R1,2    ; /

        ADDSL  R3,R3,1    ; \ R3 times 15 = 3 x 5
        ADDSL  R3,R3,2    ; /

        SL     R3,4       ;   R3 times 16

        ADDSL  R3,R3,4    ;   R3 times 17

        ADDSL  R3,R3,3    ; \ R3 times 18 = 9 x 2
        SL     R3,1       ; /

        MOVE   R1,R3      ; \ R3 times 19 = 10011
        ADDSL  R3,R1,3    ; |
        ADDSL  R3,R1,1    ; /

        SL     R3,2       ; \ R3 times 20 = 4 x 5
        ADDSL  R3,R3,2    ; /

        MOVE   R1,R3      ; \ R3 times 21 = 10101
        ADDSL  R3,R1,2    ; |
        ADDSL  R3,R1,2    ; /

        MOVESL R1,R3,1    ; \ R3 times 22 = 10110 (with a trick!)
        ADDSL  R3,R1,3    ; |
        ADDSL  R3,R1,1    ; /

        NEG    R1,R3      ; \ R3 times 23 = (3 x 8) + (-1)
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R1,3    ; /

        SL     R3,3       ; \ R3 times 24 = 8 x 3
        ADDSL  R3,R3,1    ; /

        ADDSL  R3,R3,2    ; \ R3 times 25 = 5 x 5
        ADDSL  R3,R3,2    ; /

        MOVESL R1,R3,1    ; \ R3 times 26 = (3 x 8) + 2
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R1,3    ; /

        ADDSL  R3,R3,3    ; \ R3 times 27 = 9 x 3
        ADDSL  R3,R3,1    ; /

        MOVESL R1,R3,2    ; \ R3 times 28 = (3 x 8) + 4
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R1,3    ; /

        NEG    R1,R3      ; \ R3 times 29 = 32 + (-3)
        ADDSL  R1,R1,1    ; |
        ADDSL  R3,R1,5    ; /

        SL     R3,1       ; \ R3 times 30 = 2 x 3 x 5
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R3,2    ; /

        NEG    R1,R3      ; \ R3 times 31 = 32 + (-1)
        ADDSL  R3,R3,5    ; /

        SL     R3,5       ;   R3 times 32

        ADDSL  R3,R3,5    ;   R3 times 33


        ADDSL  R3,R3,4    ; \ R3 times 34 = (16 + 1) x 2
        SL     R3,1       ; /

        MOVESL R1,R3,1    ; \ R3 times 35 = 3 + 32
        ADD    R1,R1,R3   ; |
        ADDSL  R3,R1,5    ; /

        MOVESL R1,R3,2    ; \ R3 times 36 = 32 + 4
        ADDSL  R3,R1,5    ; /

        MOVE   R1,R3      ; \ R3 times 37 = 5 + 32
        ADDSL  R1,R1,2    ; |
        ADDSL  R3,R1,5    ; /

        MOVESL R1,R3,1    ; \ R3 times 38 = (2 x 3) + 32
        ADDSL  R1,R1,1    ; |
        ADDSL  R3,R1,5    ; /

Multiplying by 39 takes 4 instructions, but multiplying by 100 takes only 3. Extending this table is tedious and the payoff is limited. Occasionally, however, some applicatons require multiplication by larger constants. For example, some pseudo-random number generators reqsuire fast multiplication by strange values such as 16,807, 39,373, 69,621 and 48,271.

14.3. Multiplication by Variables

The classic shift-and-add algorithm for multiplying is known by many names; if, in each step, the multiplier is right shifted and the multiplicand is left shifted, so that the partial products are added in order of increasing significance, it is called the Russian Peasant Algorithm, giving credit to one folk origin.

Whatever the shift direction or the order of additions, all basic binary multiplication algorithms work as follows: In each multiply step, one bit of the multiplier is tested and both the multiplier and product are shifted one place; if the bit is one, the multiplicand is added, as a partial product, to the product being accumulated. It takes 32 such steps to compute the 64-bit product of two 32-bit unsigned numbers.

In many cases, only the low 32 bits of the 64-bit product are, so the following simplified code is useful. This shifts both the product and multiplier left, adding the most significant partial product first:

TIMESUL:; unsigned multiply, low 32-bits of the product
        ; expects:  R1 - the return address
        ;           R4 - the multiplier
        ;           R5 - the multiplicand
        ; returns:  R3 - the low 32 bits of product
        ; destroys: R4 - all bits shifted away
        ;           R6 - used as loop counter

        CLR     R3 
        LIS     R6,32   ; load loop counter
TMULLP:                 ;   --- start of multiply step
        SL      R3,1    ;   shift product 
        SL      R4,1    ;   shift multiplier
        BCR     TMULCN  ;   if high bit was one
        ADD     R3,R3,R5;     add partial product to product
TMULCN:                 ;   --- end of multiply step
        ADDSI   R6,-1   ;   count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

Signed multiplication in the 2's complement number system poses only small added challenges. The important thing to note is that the effective weight of the sign bit is negative, so the partial product computed from the sign bit must be subtracted from the product instead of being added. Therefore, we handle the sign bit with a special multiply step that adds instead of subtracts. Aside from this change, the following code uses the same algorithm as the unsigned code given above:

TIMESL: ; signed multiply, low 32 bits of the product
        ; expects:  R1 - the return address
        ;           R4 - the multiplier
        ;           R5 - the multiplicand
        ; returns:  R3 - the low 32 bits of product
        ; destroys: R4 - all bits shifted away
        ;           R6 - used as loop counter
        
        CLR     R3
                        ; --- start of special multiply step
        SL      R4,1    ; shift multiplier
        BCR     TMSSKP  ; if sign bit was 1
        SUB     R3,R0,R5;   subtract partial product from product
TMSSKP:                 ; --- end of special multiply step
        LIS     R6,31   ; load loop counter
TMSLLP:                 ;   --- start of multiply step
        SL      R3,1    ;   shift product 
        SL      R4,1    ;   shift multiplier
        BCR     TMSLCN  ;   if high bit was one
        ADD     R3,R3,R5;     add partial product to product
TMSLCN:                 ;   --- end of multiply step
        ADDSI   R6,-1   ;   count
        BGT     TMSLLP  ; iterate
        JUMPS   R1      ; return 

An obvious way to speed up this code is to unroll the loops. Here is the code for unsigned integer multiply, with the loop partially unrolled so there are 2 multiply steps per iteration:

TIMESUL:; unsigned multiply, low 32-bits of the product
        ; expects:  R1 - the return address
        ;           R4 - the multiplier
        ;           R5 - the multiplicand
        ; returns:  R3 - the low 32 bits of product
        ; destroys: R4 - all bits shifted away
        ;           R6 - used as loop counter

        CLR     R3
        LIS     R6,16   ; load loop counter
TMULLP:                 ;   --- start of first multiply step
        SL      R3,1    ;   shift product 
        SL      R4,1    ;   shift multiplier
        BCR     TMULC1  ;   if high bit was one
        ADD     R3,R3,R5;     add partial product to product
TMULC1:                 ;   --- end of first, start of second
        SL      R3,1    ;   shift product 
        SL      R4,1    ;   shift multiplier
        BCR     TMULC2  ;   if high bit was one
        ADD     R3,R3,R5;     add partial product to product
TMULC2:                 ;   --- end of second multiply step
        ADDSI   R6,-1   ;   count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

Each multiply step in the original code had 4 instructions, plus 2 instructions for loop control. Each unrolling, adding an additional multiply step to the loop body, eliminates 2 more loop control instructions. If the loop is fully unrolled, so all 32 multiply steps are in line, the net performance improvement would be 1/3. On a pipelined machine, the elimination of the branch penalty may lead to even greater improvements.

The following table summarizes the potential performance gains by unrolling the unsigned multiply routine, from no unrolling with one multiply step per iteration, to completely unrolled with 32 multiply steps and no loop. Execution time can be estimated from the number of instructions, but on pipelined machines, there will be branch penalties that may only apply to unpredictable branches.

Cost per 32-bit unsigned multiply
Unrolling (steps per iteration) 1 2 n < 32 32
Code size, bytes 18 26 10 + 8n 260
Worst case instructions 195 163 131 + 64/n 130
Average instructions 179 147 115 + 64/n 114
Worst case branch count 63 47 31 + 32/n 32
Average branch count 47 31 15 + 32/n 16
Predictable branches 31 15 –1 + 32/n 0

14.4. Radix 16 Multiplication -- Fast Code

Another way to speed up multiplication is to move from binary to a higher radix such as base 16. This cuts the number of iterations from 32 down to 8, but it requires a case-select construct for each of the 16 possibilities at each step. This idea is borrowed from the way the HP PA-RISC does integer multiplication.

Here, BTRUNC is used for case selection based on the rightmost bits of the multiplier. bits of a register. The code right shifts the multiplier after each step, adding the least significant partial product first. It keeps only the least significant 32 bits of the product. The first version of this code is written for clarity:

TIMESUL:; unsigned multiply, low 32-bits of the product
        ; expects:  R1 - the return address
        ;           R4 - the multiplier
        ;           R5 - the multiplicand
        ; returns:  R3 - the low 32 bits of product
        ; destroys: R4 - all bits shifted away
        ;           R6 - used as loop counter
        ;           R7 - used for partial products

        CLR     R3
        LIS     R6,8    ; load loop counter

TMULLP:                 ; --- start of radix 16 multiply step
        MOVE    R7,R5   ; copy multiplicand for partial product computation
        BTRUNC  R4,4    ; decode least 4 bits of multiplier
        BR      TMUL0   ;   branch table for BTRUNC
        BR      TMUL1
        BR      TMUL2
        BR      TMUL3
        BR      TMUL4
        BR      TMUL5
        BR      TMUL6
        BR      TMUL7
        BR      TMUL8
        BR      TMUL9
        BR      TMULA
        BR      TMULB
        BR      TMULC
        BR      TMULD
        BR      TMULE
        BR      TMULF

TMUL0:  CLR     R7      ; partial product is zero
        BR      TMULCN  ; done with 0

TMUL1:  BR      TMULCN  ; done with 1

TMUL2:  SL      R7,1    ; multiply by two
        BR      TMULCN  ; done with 2

TMUL3:  ADDSL   R7,R7,1 ; multiply by three
        BR      TMULCN  ; done with 3

TMUL4:  SL      R7,2    ; multiply by four
        BR      TMULCN  ; done with 4

TMUL5:  ADDSL   R7,R7,2 ; multiply by five
        BR      TMULCN  ; done with 5

TMUL6:  SL      R7,1    ; multiply by two
        ADDSL   R7,R7,1 ; multiply by three
        BR      TMULCN  ; done with 6

TMUL7:  ADDSL   R7,R7,1 ; multiply by three
        ADDSL   R7,R5,1 ; multiply by two and add one
        BR      TMULCN  ; done with 7

TMUL8:  SL      R7,3    ; multiply by eight
        BR      TMULCN  ; done with 8

TMUL9:  ADDSL   R7,R7,3 ; multiply by nine
        BR      TMULCN  ; done with 9

TMULA:  SL      R7,1    ; multiply by two
        ADDSL   R7,R7,2 ; multiply by five
        BR      TMULCN  ; done with 10

TMULB:  ADDSL   R7,R7,2 ; multiply by five
        ADDSL   R7,R5,1 ; multiply by two and add one
        BR      TMULCN  ; done with 11

TMULC:  ADDSL   R7,R7,1 ; multiply by three
        SL      R7,2    ; multiply by four
        BR      TMULCN  ; done with 12

TMULD:  ADDSL   R7,R7,1 ; multiply by three
        ADDSL   R7,R5,2 ; multiply by four and add 1
        BR      TMULCN  ; done with 13

TMULE:  ADDSL   R7,R7,1 ; multiply by three
        ADDSL   R7,R5,1 ; multiply by two and add one
        SL      R7,1    ; multiply by two
        BR      TMULCN  ; done with 14

TMULF:  ADDSL   R7,R7,1 ; multiply by three
        ADDSL   R7,R7,2 ; multiply by five
        BR      TMULCN  ; done with 15

TMULCN: ADD     R3,R3,R7; accumulate product
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier
                        ; --- end of radix 16 multiply step

        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

The above radix-16 algorithm is significantly faster than the unsigned binary code given previously, yet it is more compact than the fully unrolled binary multiply algorithm. If integer multiply speed is paramount, we can optimize the above code at the expense of readability. Such optimization may be justified because, once debugged, this code can be used by a many applications; the gain, however, is small.

TIMESUL:; unsigned multiply, low 32-bits of the product
        ; expects:  R1 - the return address
        ;           R4 - the multiplier
        ;           R5 - the multiplicand
        ; returns:  R3 - the low 32 bits of product
        ; destroys: R4 - all bits shifted away
        ;           R6 - used as loop counter
        ;           R7 - used for partial products

        CLR     R3
        LIS     R6,8    ; load loop counter

TMULLP:                 ; --- start of radix 16 multiply step
        BTRUNC  R4,4    ; decode least 4 bits of multiplier
        BR      TMUL0   ;   branch table for BTRUNC
        BR      TMUL1
        BR      TMUL2
        BR      TMUL3
        BR      TMUL4
        BR      TMUL5
        BR      TMUL6
        BR      TMUL7
        BR      TMUL8
        BR      TMUL9
        BR      TMULA
        BR      TMULB
        BR      TMULC
        BR      TMULD
        BR      TMULE
                        ; the following code handles case 15!
        MOVE    R7,R5   ; copy multiplicand for partial product computation
        ADDSL   R7,R7,1 ; multiply by three
        ADDSL   R7,R7,2 ; multiply by five -- R7 = R5 x 15
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMUL0:                  ; case 0 has an optimized epilogue with no ADD
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMUL1:                  ; case 1 has an optimized epilogue, MOVE eliminated
        ADD     R3,R3,R5; accumulate product
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMUL2:  MOVESL  R7,R5,1 ; multiply by two -- R7 = R5 x 2
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMUL3:  MOVE    R7,R5   ; copy multiplicand for partial product computation
        ADDSL   R7,R7,1 ; multiply by three -- R7 = R5 x 3
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMUL4:  MOVESL  R7,R5,2 ; multiply by four -- R7 = R5 x 4
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMUL5:  MOVE    R7,R5   ; copy multiplicand for partial product computation
        ADDSL   R7,R7,2 ; multiply by five -- R7 = R5 x 5
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMUL6:  MOVESL  R7,R5,1 ; multiply by two
        ADDSL   R7,R7,1 ; multiply by three -- R7 = R5 x 6
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 


TMUL7:  MOVE    R7,R5   ; copy multiplicand for partial product computation
        ADDSL   R7,R7,1 ; multiply by three
        ADDSL   R7,R5,1 ; multiply by two and add one -- R7 = R5 x 7
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMUL8:  MOVESL  R7,R5,3 ; multiply by 8 -- R7 = R5 x 8
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMUL9:  MOVE    R7,R5   ; copy multiplicand for partial product computation
        ADDSL   R7,R7,3 ; multiply by nine -- R7 = R5 x 9
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMULA:  MOVESL  R7,R5,1 ; multiply by two
        ADDSL   R7,R7,2 ; multiply by five -- R7 = R5 x 10
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMULB:  MOVE    R7,R5   ; copy multiplicand for partial product computation
        ADDSL   R7,R7,2 ; multiply by five
        ADDSL   R7,R5,1 ; multiply by two and add one -- R7 = R5 x 11
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMULC:  MOVESL  R7,R5,2 ; multiply by four
        ADDSL   R7,R7,1 ; multiply by three -- R7 = R5 x 12
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 





TMULD:  MOVE    R7,R5   ; copy multiplicand for partial product computation
        ADDSL   R7,R7,1 ; multiply by three
        ADDSL   R7,R5,2 ; multiply by four and add 1 -- R7 = R5 x 13
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

TMULE:  MOVE    R7,R5   ; copy multiplicand for partial product computation
        ADDSL   R7,R7,1 ; multiply by three
        ADDSL   R7,R5,1 ; multiply by two and add one
        SL      R7,1    ; multiply by two -- R7 = R5 x 14
        ADD     R3,R3,R7; accumulate product -- start standard epilogue
        SL      R5,4    ; shift multiplicand
        SR      R4,4    ; shift multiplier -- to end multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 

As is usual with optimized code, the above code is not particularly readable. Code from the loop prologue along with the entire loop epilogue has been duplicated in each case, and then the result has been optimized. As a result, the borderline between the divide step and the enclosing iterative control structure has been blurred. In addition, case 15 has been moved to the start of the list, in order to eliminate the last branch from the branch table.

The code above will be better than twice the speed of the unoptimized unsigned binary multiply code for random integers. Note that multipliers with long runs of zero bits are very common. These are the worst case for the unsigned binary code and the best case for this code. The following table completes the comparison of all the different unsigned multiply routines given above:

Cost per 32-bit unsigned multiply
Multiply algorithm SimpleUnrolledRadix 16Optimized Radix 16
Code size, bytes 18 260 132 294
Worst case instructions 195 130 99 91
Average instructions 179 114 86.5 73.5
Worst case branch count 63 32 31 23
Average branch count 47 16 31 <23
Predictable branches 31 0 23 15

We are clearly in the realm of diminishing returns here, having more than doubled the code size over the original radix-16 code for a 10% gain in speed. It is quite possible that the doubled size of this code could lead to a net decrease in performance on machines with a small cache, since the larger code could push instructions out of the cache, leading to overall slower execution.

If there is no cache or a sufficiently large cache, the total number of instructions per multiply can be reduced to 56.5 by fully unrolling the loop. This final optimization multiplies the code size by a factor of just under 8, to over 2K bytes, in order to achieve a 25% speed improvement.

14.5. Division by Powers of 2

SRU can be used for fast unsigned division with truncation:

        SRU    R3,1       ; unsigned divide by 2
        SRU    R3,2       ; unsigned divide by 4

Use of SR for signed division is complicated by the fact that there are two alternative rules for truncating the quotient. SR truncates the quotient to the next lower integer so the remainder is always positive:

        SR     R3,1       ; signed divide by 2, truncating downward
        SR     R3,2       ; signed divide by 4, truncating downward

Generally, people expect –5 divided by 4 to give a quotient of –1 with a remainder of –1. When SR is used for this division problem, it gives a quotient of –2 with a remainder of 3. In fact, our naive expectation is contradicted by another equally naive expectation. We expect the remainder after division by 4 to be between 0 and 3. These two naive expectations are incompatible for signed dividends. SR satisfies our naive expectation about the remainder while violating our naive expectation about the quotient.

Many programming languages define integer division as truncating the result toward zero, satisfying the naive expectation for the quotient. That is, the sign of the remainder always matches the sign of the dividend. The ADJUST r,SSQ instruction can be used to deliver this result, as follows:

        SR     R3,2       ; divide by 4, truncating downward
        ADJUST R3,SSQ     ;   adjust quotient so truncation is toward zero

For both signed and unsigned shifts, the quotient can be rounded to the nearest integer as follows:

        SR     R3,2       ; signed divide by 4 with positive remainder
        ADDC   R3,0       ;   round so remainder is between -2 and 1

        SRU    R3,3       ; unsigned divide by 8 with positive remainder
        ADDC   R3,0       ;   round so remainder is between -4 and 3

14.6. Division by Constants -- Methodology

In general, it is faster to multiply by the reciprocal of a number than to divide by that number! This is particularly important for division by constants that are not powers of two, since we can divide by a power of two by a simple shift. Furthermore, this applies across most computers, including those with hardware divide instructions. The following table shows the reciprocals of the integers from 1 to 10.

 n 1/n (base 10) 1/n (base 2)
1 1.0 1.0
2 0.5 0.1
3 0.33 0.0101
4 0.25 0.01
5 0.2 0.00110011
6 0.166 0.00101
7 0.142857142857 0.001001
8 0.125 0.001
9 0.11 0.000111000111
10 0.1 0.000110011

In binary or decimal repeating fractions are traditionally marked with an overbar above the string of digits that repeat; here, this is rendered like this. Many numbers that have exact reciprocals in decimal have reciprocals that repeat in binary. For example, 0.110 has no finite binary representation.

In general, if we multiply a 32 bit integer by the reciprocal, truncated to 32 bits, we get a result that differs from the expected quotient by no more than one. In some cases, this inaccuracy is acceptable, but in most cases, an exact quotient is needed.

One way to obtain an exact quotient is to multiply by a carefully selected approximation of the reciprocal. In general, for an n-bit unsigned divisor, we can get the exact result by multiplying by an n+1 bit quantity and then shifting to take the appropriate bits of the product. The n+1 bit approximate reciprocal cannot have more than n+1 one-bits, so there are at most n+1 partial products to add in computing the quotient this way. This means that we can divide a 32-bit unsigned number by an arbitrary constant with a sequence of no more than 33 ADDSRU instructions. Here are some examples:

        ;        R3 = R3 / 3 = R3 * 0.010101010101010101010101010101011
        MOVE    R1,R3   ; R3 = R1 * 1.0      (make copy of original R3)
        SRU     R3,1    ; R3 = R1 * 0.1
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.011    exact R3/3 for all R1 < 8
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.01011                 all R1 < 32
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101011               all R1 < 128
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101011             all R1 < 512
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010101011           all R1 < 2,048
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101010101011         all R1 < 8,192
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101010101011       all R1 < 32,768
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010101010101011     all R1 < 131,072
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101010101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101010101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010101010101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101010101010101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101010101010101010101010101011

        ;        R3 = R3 / 5 = R3 * 0.0011001100110011001100110011001101
        MOVE    R1,R3   ; R3 = R1 * 1.0      (make copy of original R3)
        SRU     R3,2    ; R3 = R1 * 0.01
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.001101 exact R3/5 for all R1 < 64
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.1001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.0011001101                R1 < 1,024
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.10011001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.00110011001101            R1 < 16,384
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.100110011001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.001100110011001101        R1 < 262,144
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.1001100110011001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.0011001100110011001101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.10011001100110011001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.00110011001100110011001101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.100110011001100110011001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.001100110011001100110011001101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.1001100110011001100110011001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.0011001100110011001100110011001101

As the comments above indicate, the complete series of 16 ADDSRU instructions is only needed if the maximum value of the dividend is unknown. If the maximum is known, an abbreviated shift sequence will suffice. For example, for a 16-bit dividend, 6 or 7 ADDSRU instructions do the job.

Given code to divide by 3, division by 6, 12 and other binary multiples of 3 is straightforward; simply increase the final shift count appropriately. Similarly, we can derive code for divide by 10, 20 and other binary multiples of 5 by simple changes to the final shift count.

The above examples illustrate a general technique for multiplying an unsigned binary integer by a fractional constant binary multiplier. If there are k one-bits in the multiplier, there will be one SRU instruction followed by k–1 ADDSRU instructions. Reading from the first SRU downward, the sequence of shift counts gives the intervals between consecutive one-bits of the multiplier, starting with the least-significant bit. If the two least significant bits are one, the first shift count is one; if there are z zeros between the two least significant ones, the first shift count is z+1. The final shift count follows the same formula, based on the number of zeros between the point and the most significant one.

To divide a 32-bit integer by an arbitrary integer constant, take the reciprocal of the divisor as a fixed point binary fraction. This fraction must be computed to 33 places of precision, counting the most significant one as the first place. Then, add one to the bit 33 places to the right of the most significant one.

For repeating binary fractions, faster code is possible at the expense of a result that is only approximate. The trick is to find the repeating pattern in the reciprocal and exploit it. The following example illustrates this for division by 5:

        ;     R3 = ~(R3 / 5) = R3 * 0.00110011001100110011001100110011
        MOVE    R1,R3   ; R3 = R1 * 1.0      (make copy of original R3)
        SRU     R3,1    ; R3 = R1 * 0.1
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.0011
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.10011
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.110011
        MOVE    R4,R3   ; R3 = R1 * 0.110011 (make copy of working value)
        SRU     R3,8    ; R3 = R1 * 0.00000000110011
        ADDSRU  R3,R4,8 ; R3 = R1 * 0.0000000011001100110011          *
        ADDSRU  R3,R4,8 ; R3 = R1 * 0.000000001100110011001100110011  *
        ADDSRU  R3,R4,2 ; R3 = R1 * 0.00110011001100110011001100110011
        ;     Either R3 = (R1 / 5) or (R3 + 1) = (R1 / 5);

This took only 10 instructions instead of the 18 in the brute-force code, but the result is not exact. With this code, the quotient may be off by one. If the two lines marked with asterisks are deleted, the code works for all 16-bit dividends, while if just one of these is deleted, it is good for all 24-bit dividends.

Because the approximation will never be off by more than one, we can fix it by computing the remainder and then checking to see that it is within bounds. The following code fixes the result of the above divide by 5 code:

        ; Given R1 = dividend, R3 = approximate quotient
        MOVE    R4,R3
        ADDSL   R4,R3,2 ; R4 = R3 * 5 = (R1 / 5) * 5
        SUB     R4,R1,R4; R4 = R1 % 5 (remainder after division)
        CMPI    R4,5
        BLTU    NOFIX   ; if remainder >= 5
        ADDSI   R3,1    ;   fix quotient
        ADDSI   R4,-5   ;   fix remainder
NOFIX:
        ; Now, R1 = dividend, R3 = quotient, R4 = remainder

The above code produces both the remainder and the quotient after division. In fact, aside from the "perfect" algorithms given initially, where an exact quotient is produced by reciprocal multiplication, all useful division algorithms produce the remainder and quotient as the result of a single computation. Similarly, computers with a general purpose divide instruction usually use it to produce both the remainder and quotient.

These methods extend trivially to signed division by positive constants, with one caveat. Simply replace SRU and ADDSRU with SR and ADDSR. The caveat is that the remainder will always be positive. If you want the sign of the remainder the same as the sign of the dividend, you will need to add a correction step similar to that suggested above: If the dividend is negative and the remainder is nonzero, increment the quotient and decrement the remainder by the divisor. Computing the remainder is does exactly the same in both the signed and unsigned versions.

14.7. Division by Variables

Division by a variable is somewhat more complex than multiplication by a variable. The following code implements the classic restoring division algorithm on 32-bit unsigned operands, giving a 32-bit remainder and a 32-bit quotient.

DIVIDEU:; unsigned divide, 32-bit operands
        ; expects:  R1 - return address
        ;           R3 - dividend
        ;           R5 - divisor
        ; returns:  R3 - quotient
        ;           R4 - remainder
        ; destroys: R6 - used as loop counter

        CLR     R4      ; clear remainder
        LIS     R6,32   ; load loop counter

DIVULP:                 ;   --- start of divide step
        SL      R3,1    ;   shift dividend/quotient
        ROL     R4      ;     putting high bit into remainder
        CMP     R4,R5   ;   see if divisor divides remainder
        BLTU    DIVUC   ;     if not, skip
        SUB     R4,R4,R5;     if so, take away divisor
        ADDSI   R3,1    ;       and add a bit to the quotient
DIVUC:                  ;   --- end of divide step

        ADDSI   R6,-1   ;   count
        BGT     DIVULP  ; iterate
        JUMPS   R1      ; return

As with the algorithms given for multiplication by a variable, this code does not check for overflow. If size is the primary consideration, this code is optimal. If speed is paramount, unrolling the loop will eliminate 2 of the 8 instructions per iteration, giving a speedup of at least 1/4. Division by variables is rare enough in practice that this is generally acceptable.