Final

22C:122, Fall 1998

This is an open-book exam! You may use any notes or books you wish, but not your neighbor's exam paper. Please write your answers in the exam booklet provided.

NOTE: None of the questions on this exam require long textual answers or complex drafting! Excessively long or complex answers will be penalized!

This exam is worth almost 1/3 of the final grade (30 points)

Consider a 2-way superscalar pipelined RISC processor with a 32-bit instruction word and a 64-bit data path to memory. The instruction set includes load, store and operate instructions, in the classic RISC style.
```
      ---------------      _                       ||
     | fetch |       |----| |---------(+)          ||
     |-------+-------| | instr |       |           ||
     | gather|       | | cache |       |   ON CHIP || OFF
     |-------+-------|  -------        |    LOGIC  || CHIP
     | ALU   |       |                 |           ||
     |-------+-------|                 |           ||
     | memory|       |--(+)            |           ||
     |       |       |--(+)            |           ||
     |-------+-------|   |     _       |           ||
     | store |       |    ----| |-----(+)  _____   ||
      ---------------      | data  |   |  |     |  ||
                           | cache |    --| MMU |--||-->
                            -------       |_____|  ||
```
Part A: Why are there two memory ports required by the memory stage of the pipeline? (1.5 point)
Part B: Why not build the system this way (with a separate cache on each port from the memory stage of the pipeline)? (1.5 points)
```
     |  ...  |  ...  |                ...
     |-------+-------|         _       |           ||
     | memory|       |--------| |-----(+)          ||
     |       |       |---  | cache |   |           ||
     |-------+-------|   |  -------    |           ||
     | store |       |   |     _       |   _____   ||
      ---------------     ----| |-----(+) |     |  ||
                           | cache |    --| MMU |--||-->
                            -------       |_____|  ||
```
Part C: The designs suggested above are not appropriate for a bus-based multiprocessor using snooping cache technology. Explain why this is so. (2 points)
Part D: On a machine of this complexity, context switching involves storing the general registers for the old context in memory and then loading new values of the registers from memory. Context switching also requires that the MMU's TLB registers be changed. Explain why the placement of the caches in the above designs leads to problems in context switching, and suggest a simple solution. (2 points)
Part E: Where does (or do) the alignment network(s) go in the first design presented above? Note that the data path to memory is 64 bits, while the instruction word is 32 bits and many of the operands are likely to be 32 bits (or smaller). (Warning! There is a trick to this question! Why is no alignment network required for the data path used during instruction fetch?) (2 points)
Consider a machine with 16 sets of general purpose registers in the CPU, where a 4-bit field of the PSW is used to select the current active register set. So long as there are 16 or fewer processes running on the machine, this allows context switching in the time it takes to change the PSW.
Both parts of this question deal with situations where there are more than 16 processes running on the machine. In each part, the question is: How would you assign register sets to processes and what performance would you expect compared to a machine that had only one register set but was otherwise comparable?
Part A Assume that processes have priority, and that high priority processes are frequently scheduled while low priority processes are scheduled only infrequently. (2 points)
Part B Assume that all processes are equally likely to be scheduled and that there is no priority. (2 points)
Today's technology would allows us to make CPU's with thousands of fast registers, but making effective use of these registers is not always easy.
Part A Given a mix of applicatons code, suggest a way of empirically determining the optimum size of the register set for that code. (2 points)
Part B How would you expect the optimum size of the register set to change as the application changes. Consider, at one extreme, a large-scale single-threaded numerical application involving complex image processing computations, and at the other extreme, a multithreaded application based on huge numbers of very short-lived threads. (2 points)
Consider the following instruction set:
```
    Resources:

	R[0..15] -- 16 General Purpose Registers
	  PC = R[15]
	A[0..1]  -- 2 address registers
	Op[0..1] -- 2 operand registers
	Res	 -- 1 result register

    Instructions, all 8-bit, some with additional data following

	000axxxx  -- A[a] = R[xxxx]
        001a00ss  -- A[a] = A[a] << ss
        001a01ss  -- A[a] = A[a] + c -- note 1
        001a1bss  -- A[a] = M[A[b]]  -- note 2
	001a11ss  -- Op[a] = c       -- note 1
	010axxxx  -- Op[a] = R[xxxx]
        011a0bss  -- Op[a] = M[A[b]] -- note 2
	011a1bss  -- M[A[b]] = Op[a] -- note 2
	100axxxx  -- Op[a] = ALU[ x, Op[0], Op[1] ]
	101axxxx  -- R[x] = Op[a]
	110axxxx  -- R[x] = A[a]

    Note 1: c is an 8, 16, 32 or 64 bit constant following the
	instruction syllable; the size depends on the value of ss.
	Constants are sign-extended to the register size.

    Note 2: the size of the operand fetched or stored from memory
	may be 8, 16, 32 or 64 bits, depending on ss.  Address
	registers are only 32 bits, so not all values of ss apply
	to uses of the address registers.
```
Part A: Consider the instruction ADD (X),1000(Y),Z (which means, R[Z] = M[R[X]] + M[R[Y]+1000]). On the VAX, this would be a single instruction occupying 6 bytes (using a 16 bit index constant for the value 1000). How would this instruction be endoded using the above instruction set? (2 points)
Part B: How would you code CALL X, where X is the address of a function, and where the CALL instruction pushes the old PC onto the stack defined by R14 before the transfer of control to location X. Assume that the push operation predecrements the stack pointer, and that the pop operation postincrements the stack pointer. (2 points)
Part C: Using the sequence of ADD/CALL as an example, comment on the feasibility of pipelined execution of this instruction set, including such issues as the possibility of reordering the instructions to eliminate interlock problems. (2 points)
Part D: The instruction set proposed in the study question had only one address register, while the set proposed here has two. What impact does this change have on the possibility of pipelining this instruction set? (2 points)
Here is a block diagram of an interesting small computer architecture:
```
 _________
|IEU =o===|===o======o======o======o=== CPU bus (8 bit address)
|    _|_  |   |      |     _|_    _|_           (16 bit data)
|   |_PC| |  _|_    _|_   |MAR|  |MDR|
|     |   | |AC_|==|ALU|    |      |
|_____|___|                 |  ___  --> data to and from memory
      |                      -| 1 |       (both 16 bits)
      |                       |MUX|---> address to memory
       -----------------------|_0_|
```
Each instruction is 16 bits, with an 8 bit source field and an 8 bit destination field. Load operations from the Memory Data pseudoRegister (MDR) cause a read from memory, while store operations to the MDR cause store operations to memory. In either case, the Memory Address Register (MAR) holds the memory address being referenced.
Part A: What's missing from the above? Specifically, consider the problem of writing code to load a variable from memory location X. (1 point)
Part B: Given that there are 16 ALU operations (evoked by write) and 16 possible condition code checks (evoked by read), how many accumulators could reasonably be added to this design? (2 points)
Part C: Discuss the possibilities for pipelining this architecture. (2 points)