Final Exam

Part of material for 22C:122/55:132, Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Note: This is an open-book, open-notes exam. Please write your answers in the exam booklet supplied. Leave blank lines between your answers to different questions, and leave a margin down the side of each page. Please write your name in the form it appears on your university ID card; write your name in the blank provided on the front cover. This exam is worth 30 points! Allocate about 4 minutes per point.

Questions

Consider the problem of building a pipelined Ultimate RISC system where each instruction consists of two words, the source and destination addresses. The natural pipeline architecture for this machine consists of 3 stages, but this is complicated by the fact that the instruction-fetch stage must fetch two words per pipeline cycle, while the other two stages each load or store a single word per cycle. This leads to the following proposal:
```
    _______________           |         |
   |  Instr Fetch  |----------|---------x--- even instruction fetch
   |_______________|----------x---      |    odd instruction fetch
    _|___________|_           |         |
   | Operand Fetch |----------x---------x--- operand from memory
   |_______________|          |         |
    _|___________|_           |         |
   | Operand Store |----------x---------x--- operand to memory
   |_______________|          |         |
                            __|__     __|__
                           | odd |   | even|  2-way interleaved
                           |  M  |   |  M  |  memory
                           |_____|   |_____|
```
a) Assuming that there are no caches and that the memory cycle time is equal to the clock rate, how frequently would you expect memory conflicts in this scheme that cause pipeline stalls? Measure frequency as a fraction of pipeline cycles. (2 points)
b) Assuming that the Instruction Fetch stage is supported by a pair of I-caches (one for even words, one for odd words), where these caches are sized so that they have 90% hit rates, how frequently would you expect memory conflicts to stall this processor? Please state any simplifying assumptions you make. (2 points)
c) Identify each of the interstage registers for this pipelined processor. (Hint: There are not many, no more than 5 full words.) For full credit, identify any auxiliary bits or fields that are added to these registers by the pipeline stages (this involves very little). (2 points)
d) In your prototype system, all instructions are required to begin on an even word. Your principal investor wants to lift this restriction in the production system. Give a brief outline the components that must be added to lift this restriction, why they are needed and what they do. (2 points)
e) Consider the program <A,B><A,C><C,D>, where we represent each instruction as <src,dst>. Give a pipeline diagram for the progress of this computation, showing all bubbles introduced by memory conflicts. Assume that the addresses A and C are even and B and D are odd. (2 points)
Consider these two approaches to adding the ALU to the pipelined Ultimate RISC architecture:
```
    _______________           |      _______________   |
   |  Instr Fetch  |----------x--   |  Instr Fetch  |--x--
   |_______________|          |     |_______________|  |
    _|___________|_           |      _|___________|_   |
   | Operand Fetch |----o-----x--   | Operand Fetch |--x--
   |_______________|  __|__   |     |_______________|  |
     |           |   | ALU |  |      _|___________|_   |
     |           |   |_____|  |     | Operand Store |--x--
    _|___________|_     |     |     |_______________|  |
   | Operand Store |----o-----x--                       --o-------o--
   |_______________|        __|__                       __|__   __|__
                           |     |                     | ALU | |     |
        Design A           |  M  |       Design B      |_____| |  M  |
                           |_____|                             |_____|
```
a) One of these designs a distinct performance advantage over the other, one involving reduced contention in the memory arbitration subsystem and therefore fewer pipeline stalls. Identify and explain this advantage. (2 points)
b) One of these designs requires a distinctly more complex ALU. Specifically, the register file holding the 16 registers in the ALU subsystem must have a more complex access mechanism in one than in the other. The simpler design can be described as requiring a two-port register file, with a read port and a write port, both sharing a common address. Give a description, at this level of abstraction, of the more complex design, and for each port, identify its purpose. (2 points)
c) Which of these designs could be easily adapted to the more interesting memory arbitration model shown in the illustration for the previous question. (2 points)
Consider the quesiton of pipeline efficiency on the pipelined Ultimate RISC.
a) How many operand delay slots must we account for? (1 point)
b) How many branch delay slots must we account for? (1 point)
c) Assume that all operands are in memory, and ignore operands that address the ALU or other special addresses. What predicate enables result forwarding and what is the forwarding path? (2 points)
d) Ignoring operands that address the ALU, etc, how many operand delay slots remain if we use this forwarding predicate? (1 point)
e) In the above statement, the special behavior of the ALU addresses was explicitly ignored. Show why it would be wrong to use this forwarding logic for ALU operands. (1 point)
On the midterm exam and subsequent homework, a very simple 16-bit architecture was presented:
```
   Register 0 is the program counter!

  15___ _______ _______ ___________0  
   |_._|_._._._|_._._._|_._._|_._|_|
   |op |r1     |r2     |op'  |sk |st

    0 0  operate      result = ALU( R[r1], R[R2], op' )
                      if SKIP(result, sk) then PC = PC + 1
                      if st then R[r1] = result
                
  15___ _______ _______ ___________0     4-stage pipeline:
   |_._|_._._._|_._._._|_._._._._._|      1 IF       -- instruction fetch
   |op |r1     |r2     |dd         |        
               |ea = R[r2]+dd      |      2 OF/AC    -- operand fetch and
                                                         address computation
    0 1  load         R[r1] = M[ea]       3 ALU/MEM  -- ALU and
    1 0  store        M[ea] = R[r1]                      memory reference
    1 1  load address R[r1] = ea          4 OS       -- operand store
```
Consider adding interrupts to this architecture. From the device perspective, the device raises the interrupt request line and then awaits the interrupt acknowledge line. During this wait, the CPU may complete any number of bus cycles. When the CPU acknowledges the interrupt, the device (or interrupt controller) places the interrupt address on the data line and asserts data ready until the CPU drops the acknowledge line. The problem is, what should the CPU do?
a) On interrupt, the CPU must save register 0 (the program counter) somewhere before it changes register 0 to the address provided by the device. Why must the hardware also save at least one other register? Hint: Consider the problem of writing code to save all 15 registers, assuming that all of them contain important but unknown data. (2 points)
b) Here is a proposal: When the device requests an interrupt at address A, the CPU saves R0 and R1 to memory locations A and A+1 and then loads R0 and R1 from memory locations A+2 and A+3. Explain why this proposal is a bad idea for use with this architecture. (2 points)
c) Here is a proposal: When the device requests an interrupt at address A, the CPU saves R0 and R1 to special registers IR0 and IR1 inside the CPU before loading R0 with A. This is easy to implement in hardware, but the problem is, how can the software of the interrupt service routine get at IR0 and IR1? To do this, we need to give meaning to an instruction that is currently useless. There are entire families of useless operate instructions that can be identified from the documentation given here, even if we assume that all 8 ALU operations are useful when applied to the program counter. Find them and explain why they are useless. (2 points)
d) Traps and interrupts on this architecture pose problems that would be solved if we combined pipeline stages 3 and 4 into one stage. Why? (2 points)