Assignment 8, due Mar 27

Solutions

Part of the homework for CS:2630, Spring 2015
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Background: In C, Java or Python, you can write expressions like 1<<i where the shift count is a constant. In contrast, the Hawk architecture (like many other computers) only has constant shift counts in its shift instructions. If you wanted to compute 1<<i on the Hawk, with the result in R3 and the unsigned variable i in R4, you could do it with the following simple iterative solution:
```
        LIS     R3,1
        TESTR   R4
        BZS     DONE
LOOP:   SL      R3,1
        ADDSI   R4,-1
        BZR     LOOP
DONE:
```
Here is a clever alternative solution:
```
        LIS     R3,1
        BITTST  R4,0
        BBR     SKIP0
        SL      R3,1
SKIP0:  BITTST  R4,1
        BBR     SKIP1
        SL      R3,2
SKIP1:  BITTST  R4,2
        BBR     SKIP2
        SL      R3,4
SKIP2:  BITTST  R4,3
        BBR     SKIP3
        SL      R3,8
SKIP3:  BITTST  R4,4
        BBR     SKIP4
        SL      R3,16
SKIP4:
```
a) Assuming all instructions take the same amount of time, what is the smallest shift count (in R4) for which the clever alternative is faster? (0.4 points)

For a shift count of 4, the clever solution takes 12 instructions, while the iterative solution takes 15 instructions.

b) Under the same assumptions, what is the largest shift count (in R4) for which the simple iterative solution is faster? (0.4 points)

For a shift count of 3, the iterative solution takes 12 instructions while the clever solution takes 13 instructions.
In general, the clever solution takes 11 instructions plus one more instruction for each one bit in the shift count (between 0 and 31), while the iterative solution takes 3 instructions plus 3 more for each bit shifted.

c) How many shift (or shift-add) instructions are included in the clever alternative solution? (Hint: You will probably have to look things up in the Hawk manual to find out which of the instructions are shift instructions. (0.2 points)

There are 5 SL instructions, the obvious answer; these are actually ADDSL instructions. plus 5 BITTST instructins; these are really ADDSR instructions (because the bit numbers are less than 16). This makes a total of 10. See Chapter 6 of the Hawk manual for details.
A problem:
Give the fastest Hawk code you can to multiply by 2015. (1.0 points)
First, 2015 is 5 × 13 × 31 or 5 × 403 or 13 × 155 or 31 × 65. Alternatively, 2015 is 11111011111 in binary. Since the binary representation has lots of one bits, we'll ignore it for now. The first obvious candidate is 5 × 13 × 31. This gives us:
```
        ADDSL  R3,R3,2    ;   R3 times  5
        MOVE   R1,R3      ; \ R3 times 13 = 1101
        ADDSL  R3,R1,1    ; |
        ADDSL  R3,R1,2    ; /
        NEG    R1,R3      ; \ R3 times 31 = 32 + (-1)
        ADDSL  R3,R3,5    ; /
```
But this ignores the fact that 5 × 13 is 65, or 1000001 in binary. This gives the following:
```
        ADDSL  R3,R3,6    ;   R3 times 65
        NEG    R1,R3      ; \ R3 times 31 = 32 + (-1)
        ADDSL  R3,R3,5    ; /
```
There are other approaches. For example 11111011111 is 100000000000 minus 100001. Neither of these have too many one bits, so we can write this code:
```
        MOVSL  R1,R3,11   ;   R1 = R3 times 100000000000
	ADDSL  R3,R3,5    ;   R3 = R3 times 100001
        SUB    R3,R1,R3
```
There are many variations on the above; consider this:
```
        NEG    R1,R3
	ADDSL  R1,R1,5
	ADDSL  R3,R1,11
```
There don't seem to be any solutions shorter than 3 instructions.

Background: Consider the first version of TIMESUL given in section 14.3 of the Hawk manual. This returns the low 32 bits of the unsigned product of two 32-bit numbers in R3.

A problem: Modify this code so it returns the low 32 bits of the product in R3 and the high 32 bits of the product in R4. (1.0 points)

Hint: Study how to do 64-bit shifts, a subject discussed in the section on Division in Chapter 9 of the notes and also in Section 10 of the Hawk manual. You should also look at which bits of R4 contain useful information as TIMESUL performs successive iterations.

TIMESUL:; unsigned multiply, low 32-bits of the product
        ; expects:  R1 - the return address
        ;           R4 - the multiplier
        ;           R5 - the multiplicand
        ; returns:  R3 - the low 32 bits of product
        ; destroys: R4 - the high 32 bits of the product -- update comment
        ;           R6 - used as loop counter

        CLR     R3 
        LIS     R6,32   ; load loop counter
TMULLP:                 ;   --- start of multiply step
        SL      R3,1    ;   shift product 
;       SL      R4,1    ;   shift multiplier -------- replaced instruction
        ADDC    R4,R4   ;   shift multiplier -------- new instruciton
        BCR     TMULCN  ;   if high bit was one
        ADD     R3,R3,R5;     add partial product to product
        ADDC    R4,R0   ;     keep the carry -------- new instruction
TMULCN:                 ;   --- end of multiply step
        ADDSI   R6,-1   ;   count
        BGT     TMULLP  ; iterate
        JUMPS   R1      ; return

It is easy to forget the second ADDC instruction, but adding the next partial product can cause a carry out of the low 32 bits of the result.