# Final Solutions and Comments, Spring 2003

## Overall Score Distribution

```Median = 74.2
Mean   = 73.2                            X
X
X     X                   X     X
_______X_X_____X___X_X___X_X_X_X_X_X_____X_X____
48 .52 .56 .60 .64 .68 .72 .76 .80 .84 .88 .92
C               B               A
```

## Composite Exam Score Distribution

```Median = 24.9
Mean   = 26.1                         X
X         X X   X       X
________X_____X___X_X_X_X_X___X_X_X___X_X___X________
0 . 4 . 8 .12 .16 .20 .24 .28 .32 .36 .40 .44 .48 .
```

## Composite Exam Score Distribution

```Median = 14.0                 X           X
X     X   X           X
________________X_X_X___X_X_X_X_X_____X_X_X_______X_X__________
0 . 2 . 4 . 6 . 8 .10 .12 .14 .16 .18 .20 .22 .24 .26 .28 .30
```

## Exam Solutions

1. Background: Consider each of the following and identify the appropriate interpretation or appropriate interpretations of R.
```	 ___ _____ _____ _______________
Part a)	|0 0|  r  |0 0 0|      disp     |  store M[ ?   + disp] = R[r]
|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
```

Since low memory is ROM, store to M[0+disp] would be useless. Since on a microcontroller, most code will be executed from ROM, store M[pc+disp] would rarely work. Even if code was executed from RAM, mixing code and data is rare in all modern programming methodologies, so these instructions can be used for other purposes.

Only 1/5 got this. 1/5 more got partial credit for recognizing that store to low memory wouldn't work, but seemed unaware of the unlikely need for PC relative store. Many stated that both forms were useful.

```	 ___ _____ _____ _______________
Part b)	|0 0|0 0 0|  x  |      disp     |  store M[R[x] + disp] =   ?
|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
```

Storing zero to an arbitrary memory address is clearly useful for initializing variables! Storing the program counter to an arbitrary address may be useful in forming alternative calling sequences for procedures and functions.

1/3 got this, and another 1/4 earned partial credit, mostly by noticing that saving the program counter might be useful. A surprising number asserted that saving zero in memory was of no use.

```	 ___ _____ _____ _______________
Part c)	|0 1|  r  |0 0 0|      disp     |  load R[r] = M[ ?   + disp]
|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
```

Loading from M[0+disp] (low memory) will work, and since low memory is ROM, it could used to load popular constants. But note! A load from M[pc+disp] (pc-relative) is no less effective for loading a constant, so it is unlikely that programmers and compiler writers would be bothered by a lack of support loading from low memory.

Only 1/7 got this, while 1/3 held that both interpretations were equally useful, earning partial credit. 1/5 seemed not to understand that pc-relative loads had any particular use.

```	 ___ _____ _____ _______________
Part d)	|0 1|0 0 0|  x  |      disp     |  load  ?   = M[R[x] + disp]
|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
```

Loading the constant zero is clearly meaningless, but loading from an arbitrary memory location into the program counter allows indirect jumps, indirect indexed jumps and function returns so is clearly useful.

3/4 understood this!

```	 ___ _____ _____ _______________
Part e)	|1 0|  r  |0 0 0|      disp     |  immed R[r] =  ?   + disp
|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
```

Loading 0+disp into a register is a good way to load an immediate constant, an extremely important operation! Loading pc+disp into a register offers another way to load the return address into a register, very useful if there is not a built-in function call instruction.

2. Background: Consider the following proposal for a function call linked through register 2:
```	IMMED 2,0,3     R = PC + 3
LOAD  PC,PC,1   PC = M[PC + 1]
CONST F         -- address of F
```

Part a) Suppose we'd had enough "opcode space" to allow a branch and link instruction that did everything a normal PµCPL jump instruction does but also saves the return address in a register. How much faster, in clock cycles, would this new call be than the call given above?

One cycle would be saved. The above sequence involves no interlocks, so replacing the IMMED/LOAD sequence with a BAL instruction eliminates only one instruction fetch and performs the branch one cycle sooner than it would be done in the original.

1/3 got this, and 1/3 got some partial credit; it is hard to tell why this problem was difficult.

Part b) If the PµCPL pipeline is implemented the naive way, the CPU will read the constant F from memory twice. If we add some forwarding logic, we can eliminate one of these fetches. Explain why the double read, explain where the value of F can be forwarded from to eliminate the extra read, and describe, at an abstract level, the forwarding predicate that is required.

Because the constant F is stored immediately after the load instruction, the IF stage of the pipeline will fetch this one cycle before the MA/OP stage reads it while executing the load instruction. We can therefore forward it from the instruction register of the AC/OF stage to the memory data input of the MA/OP stage. We do this whenever we see a PC-relative load with a displacement of zero in the MA/OP stage.

Only 1/7 covered all three issues this question asked about. 1/3 gave poor explanations of the double-read, 1/3 gave incorrect descriptions of the forwarding path, and 1/2 gave incorrect predicates, or did not give a predicate.

3. Background: In typical microcontrollers, all of RAM and ROM is on the same chip as the processor. The small size of the main memory on such systems and the fact that no off-chip data paths are required for access to memory makes it possible to use memory with a cycle time that is closely matched to the CPU clock rate.

A question: What memory arbitration and cache structure would you recommend for use in the PµCPL? A top-level description will suffice.

An I-cache (even a small one) will reduce contention caused by instruction fetches, but no O-cache is needed because memory and the CPU run at the same rate. All memory is on the same chip as the CPU, and there is only one CPU, so no snooping mechanisms are required. A simple 2-port memory arbitrator, with a read-only instruction port and a read-write data port is needed between this and the memory.

1/10 did well. 1/3 insisted that we need an operand cache, 1/5 did not see the need for an instruction cache, 1/7 saw no need for bus arbitration, and 1/10 did not understand that this is definitely a von Neumann architecture, not a Harvard architecture. 1/2 had difficult to classify errors involving such things as off-chip memory busses, multiprocessors, or pipeline designs completely different from the specified design.

4. Background: Assume that the PµCPL operate instructions are entirely based on programmable logic. Assume that, at power-up, the processor begins executing code at memory location zero.

Part a) What must you assume about the mechanisms for accessing the programmable logic in order to bootstrap this machine?

It must be possible to program the operate instructions without using any operate instructions, that is, the initial instructions starting at location zero must be load, store or immediate instructions that combine to initialize the ALU control mechanisms required by the operate instructions.

1/3 did well and 1/7 had confusing answers that were not wrong. 1/3 earned some credit for assuming a complex initialization mechanism such as a DMA processor or complex reset logic to load the ALU programming information; this assumes mechanisms that are almost as complex as a CPU just for this job!

Part b) What must you assume about the mechanisms for accessing the programmable logic in order to allow interrupt handling on this machine?

On interrupt, the set of operate instructions currently programmed would be unknown, so the interrupt handler would have to save some or all of the programming and load new programming, then, before return, it would have to restore the saved programming. The entry and exit sequences for the interrupt handler that do this would have to be written without any use of programmable operate instructions.

1/4 did well here, many others had oddly focused answers, for example, only discussing interrupts during the bootstrap code, or inventing complex mechanisms for interrupt handling that did not address the programmable operate instructions.

5. Background: A fully general programmable logic system for the PµCPL operate instructions may not be very useful. Assume, instead, that there is a general purpose ALU with a shift network attached to one input, a shift network attached to the result output, and programmable carry propagation between successive digits.

Part a) How many bits would you expect to be needed control a general purpose shift network for the PµCPL?

A shift operation on a 16 bit word could shift from 0 to 15 places, taking 4 bits. There are 4 shift operations that are likely to be useful, arithmetic right shift, logical right shift, left shift, and rotate. We can use 2 bits to specify which shift operation. So, our total would be 6 bits.

1/4 gave good answers here, 1/3 said 2 bits sufficed -- these clearly had a very limited interpretation of the phrase "general purpose shift network". Several confused shift networks and alignment networks, while others gave answers such as 16 bits or, in one case, 33 bits, apparently offering bizarrely complex functions in the shift network.

Part b) A typical ALU has 4-bits of function select, if we ignore control over carry propagation. Given this, how many bits would you want in the op field for complete control over the ALU and shifters?

Given that we have 2 shift networks at 6 bits each, plus one ALU at 4 bits, it takes 16 bits to control the ALU/shifter. One student added the comment "this is too easy."

1/3 got this. A very popular answer was 5 bits, simply reading the PµCPL specification, and apparently missing the "how many bits would you want" as opposed to "how many bits do you have".

Part c) Here is a proposal for the programmable logic in our ALU: In addition to a carry propagation control register, we allow each bit of control input to the ALU and shifters to be computed as the programmed and of any of the bits in the op field, or their inverses. How many bits does it take to program each bit of control input, and how many control registers would you use to hold these bits?

There are 5 bits in the PµCPL op field. Our control function must specify, for each of these bits, whether it must be on, off or ignored. This takes 2 bits, for a total of 10 bits. We can clearly organize these 10 bits in a single control register.

Only one got this. 1/3 somehow got things backward for a little credit, ignoring the 5-bit op field from the PµCPL spec and instead, starting from their answer to part c. 2/5 gave no justification for their numbers!

Part d) Combining your answers to parts b and c, how many control registers are required to control the programmable ALU stage of our pipeline?

There are 16 bits that control the ALU-shifter (answer from part b) and it takes a 10-bit control register for each of these bits (answer from part c). Therefore, it takes 16 10-bit control registers to program the ALU stage of the pipeline, plus whatever is involved in carry propagation control.

1/10 did well here. 1/4 added their answers from part b and c instead of multiplying! 1/2 gave no justification for their numbers.

Note, parts C and D propose a really bad architecture. We'd be better off with 32 16-bit control registers, on for each of the possible 5-bit values of the op field. This would let us program a single op instruction by loading just one control register, while the scheme outlined here requires us to set all the control registers to program just one op instruction.

6. Background: If a PµCPL program first sets one of the operate stage programmed logic control registers, and also, in an adjacent instruction, does an operate instruction, there may be a need for pipeline interlocks. The logic control register load could be finished in one of the following pipeline stages:
• AC/OF -- Address Computation / Operand Fetch
• MA/OP -- Memory Access / Operate
• RS -- Result Store

Problem: For each of the possibilities listed above, is interlock logic required? In each case, must an operate instruction be blocked to await the loading of a logic control register, or must the loading be blocked to await completion of an operate instruction?

If a program loads an ALU control register immediately before an operate instruction, if that load is done in the RS stage, the operate instruction must be blocked for one cycle; no interlock would be needed if the load is done in the ALU stage or earlier.

If a program loads an ALU control register immediately after an operate instruction, we would need an interlock if the load is done in the operand fetch stage (admittedly, an odd place to do it!).

There is an obscure point worth noting here: If the ALU control registers are edge triggered and trigger on the same clock edge that advances the pipeline, no interlock is required; this is unlikely -- these registers would probably be simple latches.

1/3 did well here, 1/5 correctly documented the need for interlocks but didn't say what stages blocked. 1/7 had the blocking backward, stopping a later pipeline stage so an earlier stage could complete, indicating a serious misunderstanding of pipeline interlocks!