Final Exam

Solutions and Commentary

Part of material for 22C:122/55:132, Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Grade Distributions

Final Exam

Median  16.0                         X
Mean    16.40                        X
                             X X   X X         X
                             X X   X X X       X
   ____________________X___X_X_X_X_X_X_X_X_X_X_X_X_X_______________
    0 . 2 . 4 . 6 . 8 . 10. 12. 14. 16. 18. 20. 22. 24. 26. 28. 30

Total of Final plus Midterm

Median = 26
Mean   = 26.33
              X   X X   X X   X     X         X
________X_X___X_X_X_X___X_X_X_X_X___X_X_X_X___X_X_____X_____X_______
 12. 14. 16. 18. 20. 22. 24. 26. 28. 30. 32. 34. 36. 38. 40. 42. 44

Homework, Top 10 Scores out of 13 Assignments

                                                              X
                                                              X
                                                              X
Median  = 48                                                  X
Mean    = 45.20                                               X
                                                              X X
                                                            X X X
                                                      X X   X X X
________X_____________________________X___X_X_X_X_____X_X_X_X_X_X___
 18. 20. 22. 24. 26. 28. 30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50

Grand Total

                                  X
Median = 71.3                   X X
Mean   = 71.54                  X X
                        X       X X X X     X
  __________X___X_____X_X___X_X_X_X_X_X_X___X_X_X_X___X_______
   40. 44. 48. 52. 56. 60. 64. 68. 72. 76. 80. 84. 88. 92. 96

Questions

Consider the problem of building a pipelined Ultimate RISC system where each instruction consists of two words, the source and destination addresses. The natural pipeline architecture for this machine consists of 3 stages, but this is complicated by the fact that the instruction-fetch stage must fetch two words per pipeline cycle, while the other two stages each load or store a single word per cycle. This leads to the following proposal:
```
    _______________           |         |
   |  Instr Fetch  |----------|---------x--- even instruction fetch
   |_______________|----------x---      |    odd instruction fetch
    _|___________|_           |         |
   | Operand Fetch |----------x---------x--- operand from memory
   |_______________|          |         |
    _|___________|_           |         |
   | Operand Store |----------x---------x--- operand to memory
   |_______________|          |         |
                            __|__     __|__
                           | odd |   | even|  2-way interleaved
                           |  M  |   |  M  |  memory
                           |_____|   |_____|
```
a) Assuming that there are no caches and that the memory cycle time is equal to the clock rate, how frequently would you expect memory conflicts in this scheme that cause pipeline stalls? Measure frequency as a fraction of pipeline cycles. (2 points)
Every instruction fetch needs even and odd memory, so if any operand fetch or store is active, the instruction fetch stage will stall. Therefore, the only successful instruction fetches will occur when "pipeline bubbles" occupy both the operand fetch and operand store stages, so 2/3 of the pipeline cycles will be stalled, and only 1/3 will be productive. In sum, this pipelined architecture will not outperform a conventional sequential architecture.
Note above that the instruction fetch stalls are so common that there are no conflicts between operand fetch and operand store!
Many said it would stall on every cycle, from which we can infer that no instruction will ever deadlock. Such answers got significant partial credit. Those who focused only on operand stalls also received partial credit. Only a few got full credit, and only a few got no credit.

b) Assuming that the Instruction Fetch stage is supported by a pair of I-caches (one for even words, one for odd words), where these caches are sized so that they have 90% hit rates, how frequently would you expect memory conflicts to stall this processor? Please state any simplifying assumptions you make. (2 points)
First note, the two I-caches are not independent, they will either both hit or both miss under all reasonable cache management schemes. So, 90% of the instructions will run with no instruction stall, and 10% will operate as in part a).
If the instruction fetch stage is not stalling, there will be the possibility of interaction between operand fetch and operand store. In this case, the probability of both accessing the same memory module is 1/4.
So, the fraction of pipeline cycles that will stall is going to be close to .9x1/4 + .1x2/3 or about 0.3.
A disturbing number of students multiplied all the different probabilities together blindly and came up with answers that were best classified as outlandish nonsense. Others ignored the operand fetch and store interactions, gaining half credit.

c) Identify each of the interstage registers for this pipelined processor. (Hint: There are not many, no more than 5 full words.) For full credit, identify any auxiliary bits or fields that are added to these registers by the pipeline stages (this involves very little). (2 points)

Instruction Fetch
src and dst, the two 32-bit halves of the instruction register
valid, 1 bit
Operand Fetch
dst, 32-bits, one half of the instruction register
tmp, 32-bits, the operand that was fetched and will be stored
valid, 1 bit
Operand Store

Only a few remembered that pipeline bubbles require the addition of a valid bit or something equivalent to each interstage register. A fair number added lots of ALU control fields, even though the ALU is not part of the instruction execution unit on the Ultimate RISC family of architectures. Some forgot the source and destination address fields.

d) In your prototype system, all instructions are required to begin on an even word. Your principal investor wants to lift this restriction in the production system. Give a brief outline the components that must be added to lift this restriction, why they are needed and what they do. (2 points)
In the prototype scheme, the program counter was 31 bits because we could assume the least significant bit was always zero. We must add this bit.
We need to add an alignment network for selecting either the even or odd word as the source or destination addresses. This requires two 32-bit by 2 input multiplexors, controlled by the least significant bit of the program counter.
In the old scheme, the even and odd memory addresses both used the program counter (ignoring the least significant bit). Now, we must use PC as the address of the source address and PC+1 as the address of the destination address, with an alignment network (two 32-bit multiplexors) controlled by the least significant bit of the program counter, used to select which address goes to even memory and which address goes to odd memory.
A fair number did well, about as many missed the need for an alignment network for the data being returned from memory and twice as many missed the need to deliver PC+1 with an additional alignment network to the memory addresses. Some added confusion by talking about variable length instructions (there are none) or splitting word boundaries (not relevant).

e) Consider the program <A,B><A,C><C,D>, where we represent each instruction as <src,dst>. Give a pipeline diagram for the progress of this computation, showing all bubbles introduced by memory conflicts. Assume that the addresses A and C are even and B and D are odd. (2 points)
Assume the use of I-cache, and assume all hits:
```
		A B  IF OF OS
		A C     IF OF OS
		C D        IF __ OF OS
```
If we assume no I-cache, we get this:
```
		A B  IF OF OS
		A C     __ __ IF OF OS
		C D              __ __ IF OF OS
```
There is a more complex behavior that was worth full credit, based on the assumption that some instruction fetches will be accomplished in two cycles, one for the even word and one for the odd word. In the event that the operand fetch and operand store stages do not use the even memory during some cycle, the even word of the instruction can be fetched during that cycle, and in the event that the operand stages do not use odd memory during some cycle, the odd instruction word can be fetched during that cycle. This requires a far more complex instruction fetch stage, but it can work.
Half the class did well here, but most failed to document their assumption about the presence or absence of cache, leaving that for the reader to infer, for a small penalty. Among the remainder, some had very poor notation for the pipeline diagram, some had odd notions of how the architecture worked, missed conflicts or had odd priority rules for access to memory.
Consider these two approaches to adding the ALU to the pipelined Ultimate RISC architecture:
```
    _______________           |      _______________   |
   |  Instr Fetch  |----------x--   |  Instr Fetch  |--x--
   |_______________|          |     |_______________|  |
    _|___________|_           |      _|___________|_   |
   | Operand Fetch |----o-----x--   | Operand Fetch |--x--
   |_______________|  __|__   |     |_______________|  |
     |           |   | ALU |  |      _|___________|_   |
     |           |   |_____|  |     | Operand Store |--x--
    _|___________|_     |     |     |_______________|  |
   | Operand Store |----o-----x--                       --o-------o--
   |_______________|        __|__                       __|__   __|__
                           |     |                     | ALU | |     |
        Design A           |  M  |       Design B      |_____| |  M  |
                           |_____|                             |_____|
```
a) One of these designs a distinct performance advantage over the other, one involving reduced contention in the memory arbitration subsystem and therefore fewer pipeline stalls. Identify and explain this advantage. (2 points)
Design A is faster because it allows ALU references to be done without introducing memory conflicts.
Half the class did well here; others had odd reasoning or focused their answers on odd considerations.

b) One of these designs requires a distinctly more complex ALU. Specifically, the register file holding the 16 registers in the ALU subsystem must have a more complex access mechanism in one than in the other. The simpler design can be described as requiring a two-port register file, with a read port and a write port, both sharing a common address. Give a description, at this level of abstraction, of the more complex design, and for each port, identify its purpose. (2 points)
Design A requires a 3 port register file. One port is read-only, with the register number provided by the operand fetch stage and with the data output directed into the logic for the unary operators allowed on source ALU operands. The other two ports share an address provided by the operand store stage, one of them is a read-only port that provides data to the ALU for binary operations and the other is a write-only port that delivers results to the register from binary operations.
Only a few did well here. Many focused on the entire ALU subsystem and not on the register file within it. Many did not provide a useful enumeration of the ports to the register file, and close to a third of the students provided nothing useful in their answer.

c) Which of these designs could be easily adapted to the more interesting memory arbitration model shown in the illustration for the previous question. (2 points)
Design A requires no change at all for incorporation into the crossbar_switch based memory interleaving scheme of the first problem, while Design B would require major changes to allow it to be shared between two memory busses, one for even addresses and one for odd addresses.
Half the class did well here, while the remainder appeared to be seriously confused. Many earned no credit.
Consider the quesiton of pipeline efficiency on the pipelined Ultimate RISC.
a) How many operand delay slots must we account for? (1 point)
1 operand delay slot. 2/3 did well, while 1/3 did not.

b) How many branch delay slots must we account for? (1 point)
2 branch delay slots. 2/3 did well, while 1/3 did not.

c) Assume that all operands are in memory, and ignore operands that address the ALU or other special addresses. What predicate enables result forwarding and what is the forwarding path? (2 points)
Forward the operand store data out to the operand fetch data in when the operand store address equals the operand fetch address and when both stages contain valid instructions.
Half the class did well here, and of the remainder, most were on the right track but were somewhat vague about things. A few gave only the forwarding path or only the predicate, and a few more gave neither.

d) Ignoring operands that address the ALU, etc, how many operand delay slots remain if we use this forwarding predicate? (1 point)
This forwarding path eliminates all operand delay slots.
Those who understood the forwarding path, as evidenced by their answers to part c) did well here. The remainder earned no credit.

e) In the above statement, the special behavior of the ALU addresses was explicitly ignored. Show why it would be wrong to use this forwarding logic for ALU operands. (1 point)
The addresses sent to the ALU include encodings of operations on registers or register values; the meanings of these operations on source addresses do not conform to the meanings on destination addresses, so comparing these is nonsense.
A third of the class did well, while the remainder did poorly.
On the midterm exam and subsequent homework, a very simple 16-bit architecture was presented:
```
   Register 0 is the program counter!

  15___ _______ _______ ___________0  
   |_._|_._._._|_._._._|_._._|_._|_|
   |op |r1     |r2     |op'  |sk |st

    0 0  operate      result = ALU( R[r1], R[R2], op' )
                      if SKIP(result, sk) then PC = PC + 1
                      if st then R[r1] = result
                
  15___ _______ _______ ___________0     4-stage pipeline:
   |_._|_._._._|_._._._|_._._._._._|      1 IF       -- instruction fetch
   |op |r1     |r2     |dd         |        
               |ea = R[r2]+dd      |      2 OF/AC    -- operand fetch and
                                                         address computation
    0 1  load         R[r1] = M[ea]       3 ALU/MEM  -- ALU and
    1 0  store        M[ea] = R[r1]                      memory reference
    1 1  load address R[r1] = ea          4 OS       -- operand store
```
Consider adding interrupts to this architecture. From the device perspective, the device raises the interrupt request line and then awaits the interrupt acknowledge line. During this wait, the CPU may complete any number of bus cycles. When the CPU acknowledges the interrupt, the device (or interrupt controller) places the interrupt address on the data line and asserts data ready until the CPU drops the acknowledge line. The problem is, what should the CPU do?
a) On interrupt, the CPU must save register 0 (the program counter) somewhere before it changes register 0 to the address provided by the device. Why must the hardware also save at least one other register? Hint: Consider the problem of writing code to save all 15 registers, assuming that all of them contain important but unknown data. (2 points)
Software in the interrupt service routine must use a register to hold the addresses of the locations in memory to which the other registers are saved. Without help from the hardware, there is no way to save this register to make space for loading this address constant.
Half the class figured this out, while many more demonstrated that they were on the right track by asking for the CPU to save the activation record address. In fact, the interrupt mechanism can't really know what registers held what value prior to the interrupt, so the activation record concept is not quite right, but it is tangentially related to what we are doing here.

b) Here is a proposal: When the device requests an interrupt at address A, the CPU saves R0 and R1 to memory locations A and A+1 and then loads R0 and R1 from memory locations A+2 and A+3. Explain why this proposal is a bad idea for use with this architecture. (2 points)
This proposal requires a complex control unit to sequence this series of memory accesses, and the pipelined RISC architecture proposed here does not need such control unit complexity for any other purpose, so adding it just for interrupts would be inappropriate.
It is true that this proposal requires the use of RAM for the interrupt vector, while alternatives such as c) work in ROM without needing any RAM. The software, however, would end up using RAM anyway for saving other information, so this is not really a central issue.
Only one student did well here. Some earned partial credit with secondary reasoning, but most earned no credit.

c) Here is a proposal: When the device requests an interrupt at address A, the CPU saves R0 and R1 to special registers IR0 and IR1 inside the CPU before loading R0 with A. This is easy to implement in hardware, but the problem is, how can the software of the interrupt service routine get at IR0 and IR1? To do this, we need to give meaning to an instruction that is currently useless. There are entire families of useless operate instructions that can be identified from the documentation given here, even if we assume that all 8 ALU operations are useful when applied to the program counter. Find them and explain why they are useless. (2 points)
Instructions op=00, sk=00 and st=0 are no-ops. We might want to have one no-op, but this group of instructions includes 2¹¹ no-ops.
Instructions op=00, r1=00, sk!=00, st=1 do useless conditional skips because the store to the PC erases the result of the skip; there are 2⁷ useless instructions in this group.
Some students observed that, for example, exclusive oring a value with the program counter is not useful. It is, however, well defined and a dedicated contrarian could therefore write code that used such bizarre operations. Therefore, full credit was not offered to this idea. Others looked back in their notes and found that only 6 of the 8 ALU operations were defined. That definition was incomplete, and it admitted that fact by saying that the ALU supported those operations, at the very least, so you cannot assume that other operations are not to be added.
About 2/5 of the class did well, 2/5 had one of the abovementioned weak answers, and 1/5 earned no credit.

d) Traps and interrupts on this architecture pose problems that would be solved if we combined pipeline stages 3 and 4 into one stage. Why? (2 points)
Pipeline stage 3 finalizes store operations while pipeline stage 4 finalizes load and load address operations and all operate instructions. If all finalization is done in the same stage, interrupt and trap handling is simplified. With distributed finalization, we have the possibility of imprecise traps and we must drain the pipeline before starting the interrupt handler instead of simply invalidating all the instructions in the pipe.
Half the class did well here, while the remainder were divided between vague answers and no answer.