Final Exam Solutions

22C:122/55:132, Spring 2001

Exam Grade Distribution

Mean   = 14.6                                            X
Median = 16.5                                            X
                                   X                 X   X X X
_X___________________X_X___X___X_X_X_______X_X_______X_X_X_X_X_____________________X_
 . 4 . 5 . 6 . 7 . 8 . 9 .10 .11 .12 .13 .14 .15 .16 .17 .18 .19 .20 .21 .22 .23 .24 
Rough Grade Scale    - B             B +     + | -     - A             A +

Solutions

Part a: Give reasons why a programmer, knowing something about high-speed architecture, might chose to compute the sum x+y+z+w outside the loop instead of simply adding the 4 terms computed during each iteration to a single accumulator with the statement:
```
                x +=  sap        & xFF
                  +  (sap >>  8) & xFF
                  +  (sap >> 16) & xFF
                  +  (sap >> 24);
```
If the 4 terms are summed independently, as in the background code given, there are no data dependencies between the 4 assignments, so the code for these may be interleaved in order to maximize the potential parallelism in this computation. The alternative code given in this problem requires the addition of 4 terms to x, adding data dependencies that may (for some pipelined processors) force pipeline interlocks.
16 did well here, 1 earned no credit, and the remainder earned partial credit. Most understood the primary issue but made a major error.

Part b: The final statement from the code given above may be parenthesized several ways:
```
	x = ((x + y) + z) + w;
	x = (x + y) + (z + w);
```
Given a pipelined machine with 1 operand delay slot, what performance difference would you expect between the above two alternatives? Explain!
((x+y)+z)+w requires 2 stalls under the conditions given while (x+y)+(z+w) requires 1 stall.
11 did well here, 8 lost half a point for either a failure to quantify or an error in computing the number of stalls, 2 both failed to quantify and made errors, and one answer made no sense.

Part c: Assume we run this code on a machine with a modest sized I-cache, where memory cycles are several times slower than the machine's clock rate. Explain why the issue raised in part b is highly likely to be irrelevant
Under these circumstances, the code after the loop exit is unlikely to be in the I-cache, so each instruction fetch after the loop exit will likely cause a stall in the fetch stage. As a result, data dependencies in this part of the code are unlikely to introduce any additional slowdown.
This was a difficult question; 2 did well and 2 more gave odd answers that were not seriously wrong. 2 were vague about precisely which memory references introduced stalls, and 6 were worse than vague, explicitly mentioning operand fetches as the problem -- in code dealing entirely with registers! A total of 8 gave irrelevant answers.

Part d: On a conventional 1970's style architecture (for example, the M680x0 or the 80x86) the loop terminating condition used above might translate to:
```
	CMP ap,a1023  -- compare registers
	BLE LOOP      -- conditional branch on condition codes
```
Explain why the designers of many RISC architectures do not favor this approach, explain a viable alternative typical of RISC architectures, and illusrtate the use of this alternative to replace the instruction sequence given above.
This loop termination code requires the use of condition codes, and on these processors, all or most instructions set the condition codes, introducing significant data dependencies into the program. Specifically, there is a data dependency between the CMP and BLE instructions above, so a good compiler would attempt to insert other instructions between these two in order to eliminate interlocks. The problem is, on the example machines, most instructions change the condition codes, so it is not practical to reorder the code this way.
Many modern RISC architectures eliminate condition codes and introduce alternative conditional branches such as a compare and skip instruction:
```
	CMPS ap,a1023,GT  -- compare registers and skip if greater
	BR LOOP           -- unconditional branch
```
5 did well here, 2 more gave weak explanations, 3 gave poor explanations, and 3 more gave very weak alternatives. Many students got very involved in why counting down to zero on a register in order to terminate this loop would save cycles, but in fact, it saves no cycles or registers, assuming that our architecture has, for example, a register-to-register compare and skip instruction, or, for a second example, a test-register and branch that could test a temporary register that was set by a 3-operand subtract.
Part e: Two of the register variables given in the above code, xFF and a1023 hold constant values; the justifications for saving these values in registers on a modern RISC machine are quite different! Explain this.
xFF is pre-loaded in a register before the loop because, although the 8-bit constant FF₁₆ is likely to be allowed as an immediate operand, many RISC machines allow immediate operands only on a small subset of the ALU instructions. Add immediate is very common, but immediate logical operations may be sufficiently infrequent that no opcode is reserved for this possibility.
a1023 may be a static constant, but if so, it is highly likely to be a large constant, requiring a 32-bit load immediate. If the instruction set is a typical 32-bit RISC instruction set, this may require a 2-instruction sequence. Furthermore, if the array a is not a statically allocated global array, the address of element 1023 will not be a constant, but will depend on where the array was allocated at run time. For this circumstance, it may take several instructions to compute the address of a[1023].
Only 1 understood the problem with xFF while 3 understood the problem with a1023. 9 wandered into discussions of fetching xFF from memory (an unusual idea - few architectures will not have a load-immediate that can load FF₁₆ directly into a register, with no operand memory reference required. 2 somehow involved the use of a memory management unit in the computation of the address of a[1023]; this will not be involved on any computers of which I am aware. 12 earned no credit for either part of this question.
Part a: Do the instruction caches (src and dst) need to snoop? Explain your answer, with references to (mis)features of the Ultimate RISC instruction where these are relevant.
Because the Ultimate RISC has no support for indexing or indirection, code that uses pointers or arrays must be self-modifying. The use of self-modifying code requires that the I-cache snoop. In this case, there are 2 I-caches that must snoop.
5 did well here, 6 did not understand the need for self-modifying code on the ultimate RISC, 7 asserted the need for snooping based on flawed logic or failed to provide a reason. The remainder earned no credit.

Part b: How many clock cycles would the following sequence of instructions take, assuming fast main memory and assuming that all instructions are in the cache before the sequence begins:
```
   MOVE A,ACC  -- acc = a         I   Fa  Sac 
   MOVE B,ADD  -- acc = acc + b       I   Fb  S+
   MOVE ACC,C  -- c = acc                 I   //  Fac Sc
   MOVE D,ADD  -- acc = acc + d               //  I   //  Fd  S+
   MOVE ACC,E  -- e = acc                             //  I   //  Fac Se
                                  1   2   3   4   5   6   7   8   9   10
```
It takes 10 cycles, with operand interlocks in cycles 4 and 8, and a memory interlock in cycle 6.
3 did well here; 6 missed the memory interlock in cycle 6, 2 each missed the operand interlocks in cycles 4 and 8, and 9 added extra operand delays involving interaction between the first 2 instructions. Only 3 earned no credit.

Part c: In order to eliminate pipeline stalls caused by operand dependencies, we add the following logic:
if the memory address in the destination address register input to the operand store register equals the memory address in the source address register in the operand fetch stage, forward the data out from the operand store stage directly into the operand fetch stage instead of doing a read from memory.
When we try this logic, we discover that many programs that do input/output or arithmetic no longer operate correctly! Explain.
11 did well here, 5 more earned partial credit by identifying something relevant, but failed to follow through with a coherent explanation.

Part d: The figure includes two suggestions for the connection of Input-output devices to the machine, a and b. Discuss the differences between these, with respect to:
- the complexity of each device interface,
- and the effect they have, if any, on the cache.
Suggestion a, with the device attached to both the SRC and DST busses between the CPU and memory arbitration system. Suggestion b, with the device on the main memory bus requires only one bus interface per device. It is true that the busses used with alternative a are simpler (one is a read-only bus, the other write only), but the device may need to interlock if there is a simultaneous read and write to the devide, so overall, alternative a is likely to require more complex device interfaces.
In alternative a, the device interface imposes no interesting problems for the cache designer. In alternative b, however, we must suppress cache operation for transfers to and from device registers.
7 understood the complexity problem, 9 understood the impact on the cache. A few earned partial credit on one or the other issue, while 6 earned no credit.

Part e: The instruction fetch stage of the pipeline in the above system is an interesting finite state machine with the following inputs:
- src data valid (src bus grant or src cache hit)
- dst data valid (dst bus grant or dst cache hit)
- stall request (from later pipeline stages)
and the following outputs:
- src read request (to src bus)
- dst read request (to dst bus)
- instruction register valid (to later pipeline stages)
Describe this finite state machine in more detail.
Nobody did well here; 2 gave finite state machines that solved this problem but could not give single-cycle execution when there were hits in both caches, and 1 could not fetch the source and destination operands in parallel. 4 more gave random finite state machines, and 9 earned no credit by giving answers that could not be understood as finite state machine specifications. 6 gave no answer at all.
Part a: Assuming there were no other problems with this proposal, it would work for some arithmetic operations and not for others. Which common arithmetic operations cannot be made into extended precision arithmetic operations by this scheme.
Since all we did is connect the ALUs by allowing carry propagation, we have done nothing to solve the problems of multiplication and division, operations that reqire not only carry propagation but additional computation steps in proportion to the number of digits in the operands.
12 did well here, 4 only mentioned one or the other of the two operations, and 3 more focused on relevant side issues. The remainder either gave no answer or failed to identify anything relevant.

Part b: What features of modern pipelined architectures in the context of a shared-memory multiprocessor make it unlikely that the processors will stay in lock-step. What additional logic must be added to the interconnection between the processors to make this scheme work?
If the machines begin in identical states and execute the same instruction sequence, they will remain in lock step until there the processors contend for memory. At that point, some processor must stall while another continues, and they will no longer be synchronized. We can fix this problem by cross-connecting the stall logic for each pipeline stage so that whenever one processor stalls in some stage, all will stall in that stage.
9 did well here, and 3 more did well. 2 earned half credit by mentioning key issues, and 6 more touched on something relevant. Only 2 earned no credit.
Part a: If we have 2ⁿ CPU's and 2ⁿ memory modules, how many caches does this imply, and how many different caches may be checked when a CPU references some specific memory location?
For 2ⁿ CPU's, this arrangement requires an n levels interconnect network containing 2ⁿ caches per level, for a total of n(2ⁿ) caches. During any one memory access, n caches are involved.
4 gave good answers, 6 more had at least the understanding that n caches are checked per memory cycle. 2 made careless errors, while 9 gave answers that either dodged the point or clearly illustrated a significant misunderstanding of the system (for example, proposing an n by n crossbar).

Part b: Assume simple write-through caches. What kinds of applications will work on this multiprocessor, what kinds will fail, and why?
Since these are not snooping caches, applications that involve read-write access to shared memory locations will fail.
4 did well here, and 3 more saw the need to snoop but failed to identify the class of applications that would fail. 8 focused on performance and did not appear to notice that there was a fundamental semantic problem with this architecture.

Part c: A critic of this architecture says "When a CPU reads a word from some memory location, a copy of that word will be cached in every cache between that CPU and the memory. Therefore, if we describe the switching network in terms of n layers of caches, the aggregate contents of the cache in each layer will be the same, although the division of data between caches within each layer will differ from layer to layer." The critic is wrong. Explain.
On any memory cycle that goes through to memory, some cache in each of the n layers of the interconnect network will be updated with the value transferred, but the cache entry replaced in each level will be different. Therefore, the contents of each layer will differ. For an extreme example, if all processors in the system access the same location in memory, there will be 2ⁿ copies of this location in the cache layer closest to the processors, and half as many copies in each layer successively closer to memory. Therefore, in each layer closer to memory, there will be successively more space to cache other items.
6 did well, 2 more were vague but seemed to have the right idea, and 2 were distracted by the lack of cache coherency. Note that the critic would still be wrong if we solved the cache coherency problem on this architecture!