Assignment 8, Solutions

Part of the homework for 22C:60 (CS:2630), Spring 2012
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Background: One approximation for π is 333/106. In a high level language, you could write a function to (approximately) multiply an integer by π as follows:
```
unsigned int timespi( unsigned int i ) {
    return (i * 333)/106;
}
```
Note that, to 64 bits of precision in hex, 1/106 = 0.026A439F656F1826
a) Write the fastest SMAL Hawk instruction sequence you can find to multiply by 333, using the tricks discussed in Chapter 9 of the notes and Chapter 14 of the Hawk manual. (0.5 points)
333 = 3 × 111 = 3 × 3 × 37 -- prime factors
But, we have a fast trick for multiplying by 9, so we'll use 9 × 37. This gives the following, using the cookbook in Section 14.2 of the Hawk manual:
```
        ADDSL  R3,R3,3    ;   R3 times 9
        MOV    R1,R3      ; \ R3 times 37 = 5 + 32
        ADDSL  R1,R1,2    ; |
        ADDSL  R3,R1,5    ; /
```
This looks very good, but it might be faster to use the binary representation, with one ADDSL per one bit, plus a single MOV instruction. On conversion, 333₁₀ is 101001101₂, which has 5 one bits. As such, this looks like a dead end. On careful inspection, though, this is the sum of two interesting numbers, 101000101 and 1000. We can use this to advantage:
```
	MOVSL  R1,R3,3    ;   R1 = R3 * 8 (1000)
	ADDSL  R3,R3,2    ;   R3 = R3 * 5 (101)      \ together, give 101000101
	ADDSL  R3,R3,6    ;   R3 = R3 * 65 (1000001) /
	ADD    R3,R3,R1   ;   put it all together
```
In conclusion, we have a tie. The first answer was easy (using cookbook methods), while the second took inventing and was unlikely to show up in any solutions to the homework.
b) Write the fastest SMAL Hawk instruction sequence you can find to divide by 106, using all the tricks. (0.5 points)
Given that 1/106 = 0.026A439F656F1826, we need either the 32 most significant bits of the fraction, with correction, or or the 33 most significant nonzero bits:
```
0000 0010 0110 1010 0100 0011 1001 1111 -- most significant bits
0000 0010 0110 1010 0100 0011 1001 1111 0110 0101
       \______________nonzero bits_____________/
```
We need a shift and add for each one bit in either solution. The second solution has 3 more 1 bits, but as detailed in Section 14.6 of the Hawk manual, using only the 32 most significant bits would require multiplying by 106 to check if the result was off by one. 106 binary is 1101010, so it will take at least three instructions to do this, suggesting that we are better off using the second solution:
```
        MOVE    R1,R3   ; R3 = R1 * 1.0
        SRU     R3,2    ; R3 = R1 * 0.01
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.00101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.100101
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.01100101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.101100101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.1101100101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.11101100101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.111101100101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.001111101100101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.1001111101100101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.11001111101100101
        ADDSRU  R3,R1,5 ; R3 = R1 * 0.0000111001111101100101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.0010000111001111101100101
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.010010000111001111101100101
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010010000111001111101100101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.101010010000111001111101100101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.001101010010000111001111101100101
        ADDSRU  R3,R1,7 ; R3 = R1 * 0.0000001001101010010000111001111101100101
```

Background: Given a 64-bit value, with the least significant half in R3 and the most significant half in R4, give the fastest code you can work out for:

a) Shifting the value 1 place left. (0.2 points)

        SL      R3,1    ; the low half, shifts one bit into carry
        ADDC    R3,R3   ; the high half, and capture the carry bit

b) Shifting the value 6 places left. Strong hint: Yes, 6 repetitions of the code from part a) will work, but the problem can be solved in half as many instructions. (0.8 points)

        MOVE    R1,R3   ; keep a copy of the low half
        SL      R3,6    ; shift the low half, discarding 6 high bits
        SL      R4,6    ; shift the high half
        SR      R1,10   ; align the copy of the discarded bits of the low half
        OR      R4,R1   ; align the copy of the discarded bits of the low half

Background: The first unsigned multiplication routine given in Section 14.3 of the Hawk manual is not the same as the unsigned multiply code given in Chapter 9.
a) Explain the most important difference between the two. Your explanation must be concise and at a high level. (0.4 points)

The version in Chapter 9 shifts the multiplier right while shifting the multiplicand left, using an indefinite loop and processing the least significant bit of the multiplier first. The version in Section 14 of the hawk manual shifts both of them left, using a definite loop and processing the most significant bit of the multiplier first.
The most important difference might be the difference in shift direction, because this implies the other two.

b) Modify the code from Section 14.3 so it computes the 64-bit product of two unsigned 32-bit operands. (Hint: the modification required involves changing exactly one instruction.) (0.6 points)
```
TIMESUL:; unsigned multiply, low 32-bits of the product
        ;   inputs
        ;   - R1 return address
        ;   - R4 multiplier (unsigned!)
        ;   - R5 multiplicand (unsigned!)
        ;   outputs
        ;   - R3 low 32 bits of product (unsigned!)
        ;   - R4 high 32 bits of product (unsigned!)                   *** a
        ;   - R5 multiplicand
        ;   - R6 scratch -- used as loop counter

        CLR     R3      ;
        LIS     R6,32   ; load loop counter
TMULLP:                 ; --- start of multiply step
        SL      R3,1    ; shift product 
        ADDC    R4,R4   ; shift multiplier and capture top product bit *** b
        BCR     TMULCN  ; if high bit of multiplier was one
        ADD     R3,R3,R5;   add partial product to product
        ADDC    R4,R0   ;   capture any carry produced above           *** c
TMULCN:                 ; --- end of multiply step
        ADDSI   R6,-1   ; count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return 
```
The changed lines are marked with stars. Line a is a changed comment to document the effect of the other two changes. Line b is the crucial change, converting two independent 32-bit shifts into a single 64-bit shift. Line c is important only when adding the multiplicand to the product produces a carry out of the low 32 bits. This will be true only for the very largest multiplicands and can easily be missed.