Final Solutions

22C:122, Fall 1998

Score Distribution

Mean = 21.9
                                X
________X_______________X_X_____X_X_____X_____X_____________X___
 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29

Solutions

Part A: A 2-way superscalar pipelined RISC processor with a 32-bit instruction word and a 64-bit data path to memory requires two memory ports on the memory stage of the pipeline because the 2-way superscalar design may attempt to execute two memory reference instructions during each clock cycle. Many clock cycles will involve no memory reference, or only one, so a bus arbitrator, stalling one of the instructions in case of collision, can be used to combine these ports into a single port.
Part B: If a separate cache were put on each memory port from the memory stage of the pipeline, this would introduce cache coherency problems. Solving these would add significantly to the complexity of the chip.
Half the class had problems here, going off on tangents, being distracted by locality issues, and other things. Part C: The designs suggested here are not appropriate for a bus-based multiprocessor using snooping cache technology because the memory management unit is placed between the on-chip caches and the external memory bus. As a result, the caches all cache virtual addresses and not physical addresses, and this makes snooping very difficult at best, and impossible without the aid of complex support from the memory management unit.
One third had problems here, but most had the right idea and then added confusion or distraction.
Part D: Context switching is complicated on this machine because the caches hold virtual addresses. If the contents of the memory management unit's mapping registers are changed, the information in the cache becomes obsolete and must be ignored! A straightforward solution to this problem is to add a mechanism to invalidate the contents of the caches whenever a change in the memory management state of the machine occurs.
Only one did well here; many suggested reversing the order of the memory management unit and caches, but this requires either a complex multiport memory management unit or it makes the memory management unit into a severe bottleneck. In either case, the cost is likely to be much higher than a cache invalidation mechanism.
Part E: The most obvious place to put the alignment networks in the system presented is between the pipeline stage and the bus arbitrators. Thus, all bus arbitration and all caches deal with a single fixed word size. There is no need for an alighment network on the instruction port because all instruction fetches can fetch an aligned 64 bit doubleword. In the case that branch is made to the odd half of such a doublework, the other half can be simply invalidated.
One third did well here, one third proposed putting the alignment network after the caches and the remainder had difficult to classify answers.
Part A Assuming that processes have priority, and that high priority processes are frequently scheduled while low priority processes are scheduled only infrequently, high priority processes can be permanently bound to a register set, so that context switching to and from such processes is inexpensive. Only the low priority (and therefore infrequently used) processes would share one (or a few) register set(s). The result would be a significant performance improvement, particularly when the difference in frequency associated with the spectrum of priorities is large.
Part B Assuming that all processes are equally likely to be scheduled and that there is no priority, added register sets can be assigned on an LRU or round robin basis. If the number of processes only slightly exceeds the number of register sets, and if scheduling is random, there can be a significant performance benefit. If scheduling is round-robin or if the number of processes is much larger than the number of register sets, the benefit of multiple register sets will be minimal.
Just over half did well here, of the remainder, many suggested policies but failed to discuss the performance in any significant way.
Part A To determine the number of registers required for effective execution of an application, consider one of the following experiments: First, source code analysis can determine how many variables are "live" at each point in the applications mix. The number of registers should exceed the typical number of live variables. Second, consider writing single-accumulator or stack-based compilers for the code. For the single accumulator code, a memory trace study that shows cache working-set size can be used to suggest a number of registers. For the stack-based code, the amplitude of the short term ripples in the stack pointer are a useful measure of how many registers should be assigned to hold variables on the stack top.
Two thirds did well here, suggesting alternative such as those given above. The remainder had hard to classify errors.
Part B The optimum size of the register set will decrease as the lifetime of the average process or thread decreases. In the extreme, if each thread only performs one two-operand arithmetic instruction before termination, one accumulator is sufficient.
Half did well here, and of the rest, most received no credit.
Part A: Here is one way to code ADD (X),1000(Y),Z:
```
   1	A[0] = R[X]      0000xxxx
   2	Op[0] = M[A[0]]  011000ss -- ss, the size is not given
   3	A[1] = R[Y]      0001yyyy
   4	A[1] = A[1] + C  001101ss -- ss = 16 bit, encoding not given
	                 cccccccc
	                 cccccccc
   5	Op[1] = M[A[1]]  011101ss -- ss, the size is not given
   6	Op[0] = ADD      1000oooo -- oooo, the ADD encoding is not given
   7	R[z] = Op[0]     1010zzzz
```
About half had significant programming errors.
Part B: Here is one way to code CALL X:
```
   8	A[0] = R[14]     00001110
   9	A[0] = A[0] - 4  001001ss -- ss = 8 bit, encoding not given
                         11111100 -- = -4
  10	R[14] = A[0]     11001110
  11	Op[0] = R[15]    01001111
  12	Op[1] = 10       001111ss -- ss = 8 bit, encoding not given
                         00001010
  13	Op[0] = ADD      1000oooo -- oooo, the ADD encoding is not given
  14	M[A[0]] = Op[0]  011010ss -- ss = 32 bit, encoding not given
  15	Op[1] = X        oo1111ss -- ss = 32 bit, encoding not given
	                 xxxxxxxx
	                 xxxxxxxx
	                 xxxxxxxx
	                 xxxxxxxx
  16	R[15] = Op[1]    10111111
```
One in five did well here; about half forgot that R[15] is the current PC and, until the very last instruction of the macro, it doesn't point to the return address. Therefore, it must be adjusted (the ADD instruction above) to skip over the rest of the call. Another common error was to forget to pre-decrement the stack pointer.
, Part C: Given that the two instruction sequences above are to be executed in sequence, they may be interleaved as (3 1 4 2 3 5 8 6 9 7 11 10 12 15 13 14 16). In this interleaving, no instruction depends on the result of the immediately preceding instruction, so if a pipeline can be built with only 1 visible operand delay slot, this code can be pipelined with no interlock problems! The instruction set itself admits to a 3 or 5 stage pipeling -- ALU and memory operations can be done in the same stage, since on instruction uses both, and with only 2 input registers to select between, an operand gather stage may be unneeded.
Half the class failed to make any use of the example to support their contentions about pipelining this instruction set. One in three did well.
Part D: The addition of the additional address register removes a severe bottleneck, greatly improving the ability of this architecture to allow interleaving of different instruction sequences in order to remove conflicts from the instruction stream!
Two in three did well here, a few dodged the question or made other errors.
Part A: The instruction set, as specified, contains no convenient way to load a constant into the address register in order to do direct memory addressing.
One in 5 did well here. One in 3 completely missed this, while others claimed that there was no way to fetch instructions because the address on the CPU bus was only 8 bits wide. The alternative path directly to memory (through a MUX) from the PC eliminates this problem!
Part B: Given that there are 16 ALU operations (evoked by write) and 16 possible condition code checks (evoked by read), an 8-bit address leaves 4 bits available to specify an accumulator on an ALU reference instruction. But! There must also be some way to specify the MAR, MDR and PC, so only 15 accumulators would work (the address that could have referred to the 16th would be sufficient to allow addressing of the other registers.
Half the class gave the minumum number of accumulators that would allow comfortable programming, but the question clearly asked for the maximum (how many could be added, not how many should be added). One in five did well, but typically forgot to allow addressing of the other registers. Almost half gave incomprehensible or very confused explanations of their answers.
Part C: This architecture could be easily pipelined, with an instruction fetch, operand fetch and operand store stage in the pipe, just as on the minimal ultimate RISC. Memory arbitration (or a 2-port memory) would be required only when an instruction referenced the memory data register. In other cases, instruction fetch could bypass the CPU bus. In any case, the CPU bus and all registers directly attached to it should allow parallel read and write.
One in 5 did well here, an equal number gave answers with no connection to the problem, and an equal number did fairly well, but gave answers indicating a misunderstanding of the memory addressing restrictions of the architecture.