# 14. Hawk Support for Multiply and Divide

## 14.1. Multiplication by Constants -- Methodology

The Hawk has no integer multiply instruction. Multiplication hardware fits poorly with modern pipelined architectures, and the most common case, multiplication by small constants, is always faster using sequences of easy to pipeline instructions. For example, the Pentium instruction timings done by Quantasm Corporation, show that the base time for a 32-bit integer add or subtract on that machine is 1 clock cycle, and both can frequently be paired with other instructions to execute concurrently. In contrast, a 32-bit multiply takes 10 cycles and a divide takes 41 cycles, and neither can be paired. Clearly, these instructions should be avoided wherever possible, so why put them in the instruction set?

In the Hawk, MOVESL, ADDSL and SL can each be used to multiply by 16 distinct constants with a single instruction. This works identically for signed and unsigned integers:

MOVESL a,b,s — a = b × 2s   for 0 < s < 17

SL     r,s   — r = r × 2s    for 0 < s < 17

ADDSL  r,r,s — r = r × (1+2s) for 0 < s < 17

For example:

```        MOVESL R3,R4,1    ;   R3 gets R4 times 2
SL     R3,1       ;   R3 times 2
ADDSL  R3,R3,1    ;   R3 times 3
SL     R3,2       ;   R3 times 4
ADDSL  R3,R3,2    ;   R3 times 5
SL     R3,3       ;   R3 times 8
ADDSL  R3,R3,3    ;   R3 times 9
```

When the desired constant multiplier does not fit this pattern, it is frequently possible to factor the multiplier into factors where multiplication by each factor can be done by following this pattern. Multiplying by each factor in sequence is, of course, equivalent to multiplying by their product. Here are some examples that use this trick for some common small multipliers that aren't in the above list:

```        SL     R3,1       ; \ R3 times 6  = 2 x 3

SL     R3,1       ; \ R3 times 10 = 2 x 5

ADDSL  R3,R3,1    ; \ R3 times 15 = 3 x 5
```

If the desired multiplier cannot be factored, the next thing to consider is computing the product as a sum of partial products.

to compute the product ap
where p = q + r
use ap = aq + ar

Generally, our goal will be to search out values of q and r that fit one of the patterns above. In the extreme, of course, any number can be represented as a sum of powers of two, but faster multiply code will frequently rest on other partial products. Summing partial products requires the use of a temporary register; here, we use R1. In loading this temporary with a copy of the multiplicand, it is useful to recall that we can use MOVESL to scale the multiplicand and we can use NEG to negate it.

```        NEG    R1,R3      ; \ R3 times 7 = 8 + (-1)

MOVESL R1,R3,1    ; \ R3 times 10 = 8 + 2

MOVESL R1,R3,1    ; \ R3 times 11 = 9 + 2

MOVE   R1,R3      ; \ R3 times 13 = (3 x 4) + 1

MOVESL R1,R3,1    ; \ R1 times 14 = (3 x 4) + 2
```

When all else fails, note that any number can be expressed as a sum of powers of two. The binary representation of the number has a one-bit for each power of two that is present in this sum. Exploiting this leads directly to code to multiply by an arbitrary constant:

```        MOVE   R1,R3      ; \
ADDSL  R3,R1,2    ; | R3 times 10 = 1010
SL     R3,1       ; /

MOVE   R1,R3      ; \
ADDSL  R3,R1,1    ; | R3 times 27 = 11011

MOVE   R1,R3      ; \
ADDSL  R3,R1,1    ; | R3 times 100 = 1100100
SL     R3,2       ; /

MOVE   R1,R3      ; \
ADDSL  R3,R1,1    ; | R3 times 117 = 1110101
```

In general, for a multiplier with n ones in its binary representation, we can compute the product by first making a copy of the multiplicand and then performing n–1 ADDSL operations, and if the binary representation of the multiplier has trailing zeros, one additional SL instruction. The shift counts for the ADDSL instructions can be read as the distance between consecutive ones in the multiplier.

## 14.2. Small Constant Multipliers -- Optimal Code

It is difficult to give a general rule for which model will give the shortest instruction sequence for multiplication. For each multiplier, a different approach may yield a better result. Trial and error gives us the following instruction sequences for multipliers from 2 to 10:

```        SL     R3,1       ;   R3 times  2

ADDSL  R3,R3,1    ;   R3 times  3

SL     R3,2       ;   R3 times  4

ADDSL  R3,R3,2    ;   R3 times  5

SL     R3,1       ; \ R3 times  6 = 2 x 3

NEG    R1,R3      ; \ R3 times  7 = 8 + (-1)

SL     R3,3       ;   R3 times  8

ADDSL  R3,R3,3    ;   R3 times  9

SL     R3,1       ; \ R3 times 10 = 2 x 5
```

None of the above require more than 2 instructions. For multipliers from 11 to 38, 3 instructions suffice:

```        MOVE   R1,R3      ; \ R3 times 11 = 1011

SL     R3,2       ; \ R3 times 12 = 4 x 3

MOVE   R1,R3      ; \ R3 times 13 = 1101

MOVESL R1,R3,1    ; \ R3 times 14 = (3 x 4) + 2

ADDSL  R3,R3,1    ; \ R3 times 15 = 3 x 5

SL     R3,4       ;   R3 times 16

ADDSL  R3,R3,4    ;   R3 times 17

ADDSL  R3,R3,3    ; \ R3 times 18 = 9 x 2
SL     R3,1       ; /

MOVE   R1,R3      ; \ R3 times 19 = 10011

SL     R3,2       ; \ R3 times 20 = 4 x 5

MOVE   R1,R3      ; \ R3 times 21 = 10101

MOVESL R1,R3,1    ; \ R3 times 22 = 10110 (with a trick!)

NEG    R1,R3      ; \ R3 times 23 = (3 x 8) + (-1)

SL     R3,3       ; \ R3 times 24 = 8 x 3

ADDSL  R3,R3,2    ; \ R3 times 25 = 5 x 5

MOVESL R1,R3,1    ; \ R3 times 26 = (3 x 8) + 2

ADDSL  R3,R3,3    ; \ R3 times 27 = 9 x 3

MOVESL R1,R3,2    ; \ R3 times 28 = (3 x 8) + 4

NEG    R1,R3      ; \ R3 times 29 = 32 + (-3)

SL     R3,1       ; \ R3 times 30 = 2 x 3 x 5

NEG    R1,R3      ; \ R3 times 31 = 32 + (-1)

SL     R3,5       ;   R3 times 32

ADDSL  R3,R3,5    ;   R3 times 33

ADDSL  R3,R3,4    ; \ R3 times 34 = (16 + 1) x 2
SL     R3,1       ; /

MOVESL R1,R3,1    ; \ R3 times 35 = 3 + 32

MOVESL R1,R3,2    ; \ R3 times 36 = 32 + 4

MOVE   R1,R3      ; \ R3 times 37 = 5 + 32

MOVESL R1,R3,1    ; \ R3 times 38 = (2 x 3) + 32
```

Multiplying by 39 takes 4 instructions, but multiplying by 100 takes only 3. Extending this table is tedious and the payoff is limited. Occasionally, however, some applicatons require multiplication by larger constants. For example, some pseudo-random number generators reqsuire fast multiplication by strange values such as 16,807, 39,373, 69,621 and 48,271.

## 14.3. Multiplication by Variables

The classic shift-and-add algorithm for multiplying is known by many names; if, in each step, the multiplier is right shifted and the multiplicand is left shifted, so that the partial products are added in order of increasing significance, it is called the Russian Peasant Algorithm, giving credit to one folk origin.

Whatever the shift direction or the order of additions, all basic binary multiplication algorithms work as follows: In each multiply step, one bit of the multiplier is tested and both the multiplier and product are shifted one place; if the bit is one, the multiplicand is added, as a partial product, to the product being accumulated. It takes 32 such steps to compute the 64-bit product of two 32-bit unsigned numbers.

In many cases, only the low 32 bits of the 64-bit product are, so the following simplified code is useful. This shifts both the product and multiplier left, adding the most significant partial product first:

```TIMESUL:; unsigned multiply, low 32-bits of the product
; expects:  R1 - the return address
;           R4 - the multiplier
;           R5 - the multiplicand
; returns:  R3 - the low 32 bits of product
; destroys: R4 - all bits shifted away
;           R6 - used as loop counter

CLR     R3
LIS     R6,32   ; load loop counter
TMULLP:                 ;   --- start of multiply step
SL      R3,1    ;   shift product
SL      R4,1    ;   shift multiplier
BCR     TMULCN  ;   if high bit was one
TMULCN:                 ;   --- end of multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return
```

Signed multiplication in the 2's complement number system poses only small added challenges. The important thing to note is that the effective weight of the sign bit is negative, so the partial product computed from the sign bit must be subtracted from the product instead of being added. Therefore, we handle the sign bit with a special multiply step that adds instead of subtracts. Aside from this change, the following code uses the same algorithm as the unsigned code given above:

```TIMESL: ; signed multiply, low 32 bits of the product
; expects:  R1 - the return address
;           R4 - the multiplier
;           R5 - the multiplicand
; returns:  R3 - the low 32 bits of product
; destroys: R4 - all bits shifted away
;           R6 - used as loop counter

CLR     R3
; --- start of special multiply step
SL      R4,1    ; shift multiplier
BCR     TMSSKP  ; if sign bit was 1
SUB     R3,R0,R5;   subtract partial product from product
TMSSKP:                 ; --- end of special multiply step
LIS     R6,31   ; load loop counter
TMSLLP:                 ;   --- start of multiply step
SL      R3,1    ;   shift product
SL      R4,1    ;   shift multiplier
BCR     TMSLCN  ;   if high bit was one
TMSLCN:                 ;   --- end of multiply step
BGT     TMSLLP  ; iterate
JUMPS   R1      ; return
```

An obvious way to speed up this code is to unroll the loops. Here is the code for unsigned integer multiply, with the loop partially unrolled so there are 2 multiply steps per iteration:

```TIMESUL:; unsigned multiply, low 32-bits of the product
; expects:  R1 - the return address
;           R4 - the multiplier
;           R5 - the multiplicand
; returns:  R3 - the low 32 bits of product
; destroys: R4 - all bits shifted away
;           R6 - used as loop counter

CLR     R3
LIS     R6,16   ; load loop counter
TMULLP:                 ;   --- start of first multiply step
SL      R3,1    ;   shift product
SL      R4,1    ;   shift multiplier
BCR     TMULC1  ;   if high bit was one
TMULC1:                 ;   --- end of first, start of second
SL      R3,1    ;   shift product
SL      R4,1    ;   shift multiplier
BCR     TMULC2  ;   if high bit was one
TMULC2:                 ;   --- end of second multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return
```

Each multiply step in the original code had 4 instructions, plus 2 instructions for loop control. Each unrolling, adding an additional multiply step to the loop body, eliminates 2 more loop control instructions. If the loop is fully unrolled, so all 32 multiply steps are in line, the net performance improvement would be 1/3. On a pipelined machine, the elimination of the branch penalty may lead to even greater improvements.

The following table summarizes the potential performance gains by unrolling the unsigned multiply routine, from no unrolling with one multiply step per iteration, to completely unrolled with 32 multiply steps and no loop. Execution time can be estimated from the number of instructions, but on pipelined machines, there will be branch penalties that may only apply to unpredictable branches.

 Unrolling (steps per iteration) Code size, bytes Worst case instructions Average instructions 1 2 n < 32 32 18 26 10 + 8n 260 195 163 131 + 64/n 130 179 147 115 + 64/n 114 63 47 31 + 32/n 32 47 31 15 + 32/n 16 31 15 –1 + 32/n 0

## 14.4. Radix 16 Multiplication -- Fast Code

Another way to speed up multiplication is to move from binary to a higher radix such as base 16. This cuts the number of iterations from 32 down to 8, but it requires a case-select construct for each of the 16 possibilities at each step. This idea is borrowed from the way the HP PA-RISC does integer multiplication.

Here, BTRUNC is used for case selection based on the rightmost bits of the multiplier. bits of a register. The code right shifts the multiplier after each step, adding the least significant partial product first. It keeps only the least significant 32 bits of the product. The first version of this code is written for clarity:

```TIMESUL:; unsigned multiply, low 32-bits of the product
; expects:  R1 - the return address
;           R4 - the multiplier
;           R5 - the multiplicand
; returns:  R3 - the low 32 bits of product
; destroys: R4 - all bits shifted away
;           R6 - used as loop counter
;           R7 - used for partial products

CLR     R3
LIS     R6,8    ; load loop counter

TMULLP:                 ; --- start of radix 16 multiply step
MOVE    R7,R5   ; copy multiplicand for partial product computation
BTRUNC  R4,4    ; decode least 4 bits of multiplier
BR      TMUL0   ;   branch table for BTRUNC
BR      TMUL1
BR      TMUL2
BR      TMUL3
BR      TMUL4
BR      TMUL5
BR      TMUL6
BR      TMUL7
BR      TMUL8
BR      TMUL9
BR      TMULA
BR      TMULB
BR      TMULC
BR      TMULD
BR      TMULE
BR      TMULF

TMUL0:  CLR     R7      ; partial product is zero
BR      TMULCN  ; done with 0

TMUL1:  BR      TMULCN  ; done with 1

TMUL2:  SL      R7,1    ; multiply by two
BR      TMULCN  ; done with 2

TMUL3:  ADDSL   R7,R7,1 ; multiply by three
BR      TMULCN  ; done with 3

TMUL4:  SL      R7,2    ; multiply by four
BR      TMULCN  ; done with 4

TMUL5:  ADDSL   R7,R7,2 ; multiply by five
BR      TMULCN  ; done with 5

TMUL6:  SL      R7,1    ; multiply by two
ADDSL   R7,R7,1 ; multiply by three
BR      TMULCN  ; done with 6

TMUL7:  ADDSL   R7,R7,1 ; multiply by three
BR      TMULCN  ; done with 7

TMUL8:  SL      R7,3    ; multiply by eight
BR      TMULCN  ; done with 8

TMUL9:  ADDSL   R7,R7,3 ; multiply by nine
BR      TMULCN  ; done with 9

TMULA:  SL      R7,1    ; multiply by two
ADDSL   R7,R7,2 ; multiply by five
BR      TMULCN  ; done with 10

TMULB:  ADDSL   R7,R7,2 ; multiply by five
BR      TMULCN  ; done with 11

TMULC:  ADDSL   R7,R7,1 ; multiply by three
SL      R7,2    ; multiply by four
BR      TMULCN  ; done with 12

TMULD:  ADDSL   R7,R7,1 ; multiply by three
BR      TMULCN  ; done with 13

TMULE:  ADDSL   R7,R7,1 ; multiply by three
SL      R7,1    ; multiply by two
BR      TMULCN  ; done with 14

TMULF:  ADDSL   R7,R7,1 ; multiply by three
ADDSL   R7,R7,2 ; multiply by five
BR      TMULCN  ; done with 15

SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier
; --- end of radix 16 multiply step

BGT     TMULLP  ; iterate
JUMPS   R1      ; return
```

The above radix-16 algorithm is significantly faster than the unsigned binary code given previously, yet it is more compact than the fully unrolled binary multiply algorithm. If integer multiply speed is paramount, we can optimize the above code at the expense of readability. Such optimization may be justified because, once debugged, this code can be used by a many applications; the gain, however, is small.

```TIMESUL:; unsigned multiply, low 32-bits of the product
; expects:  R1 - the return address
;           R4 - the multiplier
;           R5 - the multiplicand
; returns:  R3 - the low 32 bits of product
; destroys: R4 - all bits shifted away
;           R6 - used as loop counter
;           R7 - used for partial products

CLR     R3
LIS     R6,8    ; load loop counter

TMULLP:                 ; --- start of radix 16 multiply step
BTRUNC  R4,4    ; decode least 4 bits of multiplier
BR      TMUL0   ;   branch table for BTRUNC
BR      TMUL1
BR      TMUL2
BR      TMUL3
BR      TMUL4
BR      TMUL5
BR      TMUL6
BR      TMUL7
BR      TMUL8
BR      TMUL9
BR      TMULA
BR      TMULB
BR      TMULC
BR      TMULD
BR      TMULE
; the following code handles case 15!
MOVE    R7,R5   ; copy multiplicand for partial product computation
ADDSL   R7,R7,1 ; multiply by three
ADDSL   R7,R7,2 ; multiply by five -- R7 = R5 x 15
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMUL0:                  ; case 0 has an optimized epilogue with no ADD
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMUL1:                  ; case 1 has an optimized epilogue, MOVE eliminated
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMUL2:  MOVESL  R7,R5,1 ; multiply by two -- R7 = R5 x 2
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMUL3:  MOVE    R7,R5   ; copy multiplicand for partial product computation
ADDSL   R7,R7,1 ; multiply by three -- R7 = R5 x 3
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMUL4:  MOVESL  R7,R5,2 ; multiply by four -- R7 = R5 x 4
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMUL5:  MOVE    R7,R5   ; copy multiplicand for partial product computation
ADDSL   R7,R7,2 ; multiply by five -- R7 = R5 x 5
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMUL6:  MOVESL  R7,R5,1 ; multiply by two
ADDSL   R7,R7,1 ; multiply by three -- R7 = R5 x 6
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMUL7:  MOVE    R7,R5   ; copy multiplicand for partial product computation
ADDSL   R7,R7,1 ; multiply by three
ADDSL   R7,R5,1 ; multiply by two and add one -- R7 = R5 x 7
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMUL8:  MOVESL  R7,R5,3 ; multiply by 8 -- R7 = R5 x 8
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMUL9:  MOVE    R7,R5   ; copy multiplicand for partial product computation
ADDSL   R7,R7,3 ; multiply by nine -- R7 = R5 x 9
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMULA:  MOVESL  R7,R5,1 ; multiply by two
ADDSL   R7,R7,2 ; multiply by five -- R7 = R5 x 10
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMULB:  MOVE    R7,R5   ; copy multiplicand for partial product computation
ADDSL   R7,R7,2 ; multiply by five
ADDSL   R7,R5,1 ; multiply by two and add one -- R7 = R5 x 11
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMULC:  MOVESL  R7,R5,2 ; multiply by four
ADDSL   R7,R7,1 ; multiply by three -- R7 = R5 x 12
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMULD:  MOVE    R7,R5   ; copy multiplicand for partial product computation
ADDSL   R7,R7,1 ; multiply by three
ADDSL   R7,R5,2 ; multiply by four and add 1 -- R7 = R5 x 13
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return

TMULE:  MOVE    R7,R5   ; copy multiplicand for partial product computation
ADDSL   R7,R7,1 ; multiply by three
SL      R7,1    ; multiply by two -- R7 = R5 x 14
ADD     R3,R3,R7; accumulate product -- start standard epilogue
SL      R5,4    ; shift multiplicand
SR      R4,4    ; shift multiplier -- to end multiply step
BGT     TMULLP  ; iterate
JUMPS   R1      ; return
```

As is usual with optimized code, the above code is not particularly readable. Code from the loop prologue along with the entire loop epilogue has been duplicated in each case, and then the result has been optimized. As a result, the borderline between the divide step and the enclosing iterative control structure has been blurred. In addition, case 15 has been moved to the start of the list, in order to eliminate the last branch from the branch table.

The code above will be better than twice the speed of the unoptimized unsigned binary multiply code for random integers. Note that multipliers with long runs of zero bits are very common. These are the worst case for the unsigned binary code and the best case for this code. The following table completes the comparison of all the different unsigned multiply routines given above:

Multiply algorithm Code size, bytes Worst case instructions Average instructions Simple Unrolled Radix 16 Optimized Radix 16 18 260 132 294 195 130 99 91 179 114 86.5 73.5 63 32 31 23 47 16 31 <23 31 0 23 15

We are clearly in the realm of diminishing returns here, having more than doubled the code size over the original radix-16 code for a 10% gain in speed. It is quite possible that the doubled size of this code could lead to a net decrease in performance on machines with a small cache, since the larger code could push instructions out of the cache, leading to overall slower execution.

If there is no cache or a sufficiently large cache, the total number of instructions per multiply can be reduced to 56.5 by fully unrolling the loop. This final optimization multiplies the code size by a factor of just under 8, to over 2K bytes, in order to achieve a 25% speed improvement.

## 14.5. Division by Powers of 2

SRU can be used for fast unsigned division with truncation:

```        SRU    R3,1       ; unsigned divide by 2
SRU    R3,2       ; unsigned divide by 4
```

Use of SR for signed division is complicated by the fact that there are two alternative rules for truncating the quotient. SR truncates the quotient to the next lower integer so the remainder is always positive:

```        SR     R3,1       ; signed divide by 2, truncating downward
SR     R3,2       ; signed divide by 4, truncating downward
```

Generally, people expect –5 divided by 4 to give a quotient of –1 with a remainder of –1. When SR is used for this division problem, it gives a quotient of –2 with a remainder of 3. In fact, our naive expectation is contradicted by another equally naive expectation. We expect the remainder after division by 4 to be between 0 and 3. These two naive expectations are incompatible for signed dividends. SR satisfies our naive expectation about the remainder while violating our naive expectation about the quotient.

Many programming languages define integer division as truncating the result toward zero, satisfying the naive expectation for the quotient. That is, the sign of the remainder always matches the sign of the dividend. The ADJUST r,SSQ instruction can be used to deliver this result, as follows:

```        SR     R3,2       ; divide by 4, truncating downward
```

For both signed and unsigned shifts, the quotient can be rounded to the nearest integer as follows:

```        SR     R3,2       ; signed divide by 4 with positive remainder
ADDC   R3,0       ;   round so remainder is between -2 and 1

SRU    R3,3       ; unsigned divide by 8 with positive remainder
ADDC   R3,0       ;   round so remainder is between -4 and 3
```

## 14.6. Division by Constants -- Methodology

In general, it is faster to multiply by the reciprocal of a number than to divide by that number! This is particularly important for division by constants that are not powers of two, since we can divide by a power of two by a simple shift. Furthermore, this applies across most computers, including those with hardware divide instructions. The following table shows the reciprocals of the integers from 1 to 10.

n 1/n (base 10) 1/n (base 2)
1 1.0 1.0
2 0.5 0.1
3 0.33 0.0101
4 0.25 0.01
5 0.2 0.00110011
6 0.166 0.00101
7 0.142857142857 0.001001
8 0.125 0.001
9 0.11 0.000111000111
10 0.1 0.000110011

In binary or decimal repeating fractions are traditionally marked with an overbar above the string of digits that repeat; here, this is rendered like this. Many numbers that have exact reciprocals in decimal have reciprocals that repeat in binary. For example, 0.110 has no finite binary representation.

In general, if we multiply a 32 bit integer by the reciprocal, truncated to 32 bits, we get a result that differs from the expected quotient by no more than one. In some cases, this inaccuracy is acceptable, but in most cases, an exact quotient is needed.

One way to obtain an exact quotient is to multiply by a carefully selected approximation of the reciprocal. In general, for an n-bit unsigned divisor, we can get the exact result by multiplying by an n+1 bit quantity and then shifting to take the appropriate bits of the product. The n+1 bit approximate reciprocal cannot have more than n+1 one-bits, so there are at most n+1 partial products to add in computing the quotient this way. This means that we can divide a 32-bit unsigned number by an arbitrary constant with a sequence of no more than 33 ADDSRU instructions. Here are some examples:

```        ;        R3 = R3 / 3 = R3 * 0.010101010101010101010101010101011
MOVE    R1,R3   ; R3 = R1 * 1.0      (make copy of original R3)
SRU     R3,1    ; R3 = R1 * 0.1
ADDSRU  R3,R1,2 ; R3 = R1 * 0.011    exact R3/3 for all R1 < 8
ADDSRU  R3,R1,2 ; R3 = R1 * 0.01011                 all R1 < 32
ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101011               all R1 < 128
ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101011             all R1 < 512
ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010101011           all R1 < 2,048
ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101010101011         all R1 < 8,192
ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101010101011       all R1 < 32,768
ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010101010101011     all R1 < 131,072
ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101010101010101011
ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101010101010101011
ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010101010101010101011
ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101010101010101010101011
ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101010101010101010101011
ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010101010101010101010101011
ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101010101010101010101010101011
ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101010101010101010101010101011

;        R3 = R3 / 5 = R3 * 0.0011001100110011001100110011001101
MOVE    R1,R3   ; R3 = R1 * 1.0      (make copy of original R3)
SRU     R3,2    ; R3 = R1 * 0.01
ADDSRU  R3,R1,1 ; R3 = R1 * 0.101
ADDSRU  R3,R1,3 ; R3 = R1 * 0.001101 exact R3/5 for all R1 < 64
ADDSRU  R3,R1,1 ; R3 = R1 * 0.1001101
ADDSRU  R3,R1,3 ; R3 = R1 * 0.0011001101                R1 < 1,024
ADDSRU  R3,R1,1 ; R3 = R1 * 0.10011001101
ADDSRU  R3,R1,3 ; R3 = R1 * 0.00110011001101            R1 < 16,384
ADDSRU  R3,R1,1 ; R3 = R1 * 0.100110011001101
ADDSRU  R3,R1,3 ; R3 = R1 * 0.001100110011001101        R1 < 262,144
ADDSRU  R3,R1,1 ; R3 = R1 * 0.1001100110011001101
ADDSRU  R3,R1,3 ; R3 = R1 * 0.0011001100110011001101
ADDSRU  R3,R1,1 ; R3 = R1 * 0.10011001100110011001101
ADDSRU  R3,R1,3 ; R3 = R1 * 0.00110011001100110011001101
ADDSRU  R3,R1,1 ; R3 = R1 * 0.100110011001100110011001101
ADDSRU  R3,R1,3 ; R3 = R1 * 0.001100110011001100110011001101
ADDSRU  R3,R1,1 ; R3 = R1 * 0.1001100110011001100110011001101
ADDSRU  R3,R1,3 ; R3 = R1 * 0.0011001100110011001100110011001101
```

As the comments above indicate, the complete series of 16 ADDSRU instructions is only needed if the maximum value of the dividend is unknown. If the maximum is known, an abbreviated shift sequence will suffice. For example, for a 16-bit dividend, 6 or 7 ADDSRU instructions do the job.

Given code to divide by 3, division by 6, 12 and other binary multiples of 3 is straightforward; simply increase the final shift count appropriately. Similarly, we can derive code for divide by 10, 20 and other binary multiples of 5 by simple changes to the final shift count.

The above examples illustrate a general technique for multiplying an unsigned binary integer by a fractional constant binary multiplier. If there are k one-bits in the multiplier, there will be one SRU instruction followed by k–1 ADDSRU instructions. Reading from the first SRU downward, the sequence of shift counts gives the intervals between consecutive one-bits of the multiplier, starting with the least-significant bit. If the two least significant bits are one, the first shift count is one; if there are z zeros between the two least significant ones, the first shift count is z+1. The final shift count follows the same formula, based on the number of zeros between the point and the most significant one.

To divide a 32-bit integer by an arbitrary integer constant, take the reciprocal of the divisor as a fixed point binary fraction. This fraction must be computed to 33 places of precision, counting the most significant one as the first place. Then, add one to the bit 33 places to the right of the most significant one.

For repeating binary fractions, faster code is possible at the expense of a result that is only approximate. The trick is to find the repeating pattern in the reciprocal and exploit it. The following example illustrates this for division by 5:

```        ;     R3 = ~(R3 / 5) = R3 * 0.00110011001100110011001100110011
MOVE    R1,R3   ; R3 = R1 * 1.0      (make copy of original R3)
SRU     R3,1    ; R3 = R1 * 0.1
ADDSRU  R3,R1,3 ; R3 = R1 * 0.0011
ADDSRU  R3,R1,1 ; R3 = R1 * 0.10011
ADDSRU  R3,R1,1 ; R3 = R1 * 0.110011
MOVE    R4,R3   ; R3 = R1 * 0.110011 (make copy of working value)
SRU     R3,8    ; R3 = R1 * 0.00000000110011
ADDSRU  R3,R4,8 ; R3 = R1 * 0.0000000011001100110011          *
ADDSRU  R3,R4,8 ; R3 = R1 * 0.000000001100110011001100110011  *
ADDSRU  R3,R4,2 ; R3 = R1 * 0.00110011001100110011001100110011
;     Either R3 = (R1 / 5) or (R3 + 1) = (R1 / 5);
```

This took only 10 instructions instead of the 18 in the brute-force code, but the result is not exact. With this code, the quotient may be off by one. If the two lines marked with asterisks are deleted, the code works for all 16-bit dividends, while if just one of these is deleted, it is good for all 24-bit dividends.

Because the approximation will never be off by more than one, we can fix it by computing the remainder and then checking to see that it is within bounds. The following code fixes the result of the above divide by 5 code:

```        ; Given R1 = dividend, R3 = approximate quotient
MOVE    R4,R3
ADDSL   R4,R3,2 ; R4 = R3 * 5 = (R1 / 5) * 5
SUB     R4,R1,R4; R4 = R1 % 5 (remainder after division)
CMPI    R4,5
BLTU    NOFIX   ; if remainder >= 5
NOFIX:
; Now, R1 = dividend, R3 = quotient, R4 = remainder
```

The above code produces both the remainder and the quotient after division. In fact, aside from the "perfect" algorithms given initially, where an exact quotient is produced by reciprocal multiplication, all useful division algorithms produce the remainder and quotient as the result of a single computation. Similarly, computers with a general purpose divide instruction usually use it to produce both the remainder and quotient.

These methods extend trivially to signed division by positive constants, with one caveat. Simply replace SRU and ADDSRU with SR and ADDSR. The caveat is that the remainder will always be positive. If you want the sign of the remainder the same as the sign of the dividend, you will need to add a correction step similar to that suggested above: If the dividend is negative and the remainder is nonzero, increment the quotient and decrement the remainder by the divisor. Computing the remainder is does exactly the same in both the signed and unsigned versions.

## 14.7. Division by Variables

Division by a variable is somewhat more complex than multiplication by a variable. The following code implements the classic restoring division algorithm on 32-bit unsigned operands, giving a 32-bit remainder and a 32-bit quotient.

```DIVIDEU:; unsigned divide, 32-bit operands
; expects:  R1 - return address
;           R3 - dividend
;           R5 - divisor
; returns:  R3 - quotient
;           R4 - remainder
; destroys: R6 - used as loop counter

CLR     R4      ; clear remainder
LIS     R6,32   ; load loop counter

DIVULP:                 ;   --- start of divide step
SL      R3,1    ;   shift dividend/quotient
ROL     R4      ;     putting high bit into remainder
CMP     R4,R5   ;   see if divisor divides remainder
BLTU    DIVUC   ;     if not, skip
SUB     R4,R4,R5;     if so, take away divisor