Study Questions

22C:122, Fall 1998

Douglas W. Jones

Consider a fully interlocked pipelined RISC processor with two-way superscalar instruction execution. The pipeline has the conventional 5 stages, instruction fetch (which fetches 2 instructions per memory cycle), operand gather, ALU, memory, and result store. Assume that instructions are 32 bits and that the memory-CPU data path is 64 bits wide.
Part A How many level-1 caches would it be appropriate to include on the monolithic VLSI implementation of this processor. What purpose would they serve, and which would be appropriately implemented as read-only caches, and which as write-through caches.
Part B Draw a top-level diagram of the memory data paths you would expect on the monolithic VLSI implementation of this processor, emphasizing the memory arbitration logic and location of cache memory (or memories) between the pipeline and the memory bus connections of the chip.
Part C If the chip is to include a memory management unit, suggest an appropriate placement of the MMU relative to the memory arbitration logic and cache (or chches) you proposed in part B. Assume the MMU is minimalist, containing only a TLB and relying on a software fault handler to handle TLB misses.
Part D Suggest alternative answers to parts A, B and C, and attempt to evaluate their expected effects on the performance and utilization of the resources of your hypothetical processor.
Most modern RISC processors have very large execution contexts, with a large number of large registers, plus a heavy reliance on a large cache (or on large caches) to obtain reasonable performance. As a result, the context switching time on these machines is large, and once a context switch takes place, it may take many machine cycles for the cache to fill with enough data from the new context to allow operation to reach full speed.
Some CPU designers have allowed for faster context switching by including multiple register sets in the CPU. For example, as early as 1974, the MODCOMP IV computer included 16 register sets in the CPU, with a field in the PSW used to select the current register set.
If you use the "ps" command on a Unix system, you can get a list of all active processes (but by default, it only lists your personal processes and not those of others, the system, etc). Experiment with the "ps" command, and based on your experiments, discuss the impact you would expect from having a modest number of register sets stored within the CPU.
What kind of instruction set design would you suggest for a machine designed to support extremely fast switching between a very large number of short lived processes (or perhaps short lived threads).
The DEC VAX architecture is one of the most rational orthogonal CISC architectures ever developed. Most arithmetic instructions come with either 1, 2 or 3 operands, where each operand may be a constant, a register (1 of 16), a memory location (register indirect or indexed), or an indirect address, with indexing allowed on the indirect address.
VAX opcodes are 1 byte. Each operand field is 1 byte if it refers to a register or register indirect mode, 3 or 5 bytes if it holds a constant or index constant, and a prefix on the operand field may hold a second index constant, used, for example, in double-indexed addressing and in indexing off of a pointer used in indirect addressing.
In sum, VAX instructions may vary over an extreme range of instruction sizes, and a single instruciton may reference a large number of memory locations: A 3-address arithmetic instruction, involving 3 indirect pointers, for example, will reference 6 distinct operands in memory. Instruction decoding on the VAX is quite complex because the size of an instruction depends on the opcode field (which determines how many operands there are) and on the initial bytes of each operand field, which determine the size of the operand fields.
In a talk given in 1982, an interesting idea was proposed, that of a "sub instruction set". The idea is to view each VAX instruction as consisting of a sequence of sub instructions, where the sub-instructions involve such things as:
- MAR = R[x]
- MAR = MAR << s
- MAR = MAR + c (c is 8, 16 or 32 bits depending on s)
- MAR = M[MAR] (always 32 bits)
- OP1 = c (8, 16 or 32 bits depending on s)
- OP2 = c (8, 16 or 32 bits depending on s)
- OP1 = R[x]
- OP2 = R[x]
- OP1 = M[MAR] (8, 16 or 32 bits depending on s)
- OP2 = M[MAR] (8, 16 or 32 bits depending on s)
- RES = ALU[ x, OP1, OP2 ]
- R[x] = RES
- R[x] = MAR
- M[MAR] = RES (8, 16 or 32 bits)
In the above, x is a 4-bit constant (0..15), s is a 2-bit constant, and c is a constant taken from successive bytes that follow the sub-instruction.
Part A: Propose a sub-instruction encoding. Can you arrange things so all sub-instructions are encoded as 8 bits (with added bytes to hold the constant c, if needed)?
Part B: Consider the instruction ADD (X),1000(Y),Z (which means, R[Z] = M[R[X]] + M[R[Y]+1000]). On the VAX, this would be a single instruction occupying 6 bytes (using a 16 bit index constant for the value 1000). How would this instruction be endoded using your answer to part A.
Part B: Consider the possibility of pipelining these sub-instructions. There are opportunities to use both functional-unit parallelism and conventional pipelined execution. How much overlap of execution could you squeeze out of your code from part B; consider, in evaluating this, the possibility of reordering the instructions of part B, for example, by interleaving the code to compute the two source operands. Also, consider the example of two consecutive add instructions, where the code to store the result could be interleaved with the code to fetch the next operand.
Part C: How would you code CALL X, where X is the address of a function, and where the CALL instruction pushes the old PC onto the stack defined by R14 before the transfer of control to location X.
GRI once made a modular 16-bit minicomputer with many similarities to the Ultimate RISC. This computer had the same general idea for an instruction execution unit, but the IEU's address size was very small, just 8 bits. The advantage of this was that each full instruction was just 16 bits. From the point of view of the IEU, main memory was manipulated through a memory address register and a memory data register, both located within the address space of the IEU.
Part A: Suggest a register-transfer level design for this machine. Note that the PC is 16 bits, and therefore, note that routing the value of the PC to memory cannot be done over the 8-bit address bus that connects the IEU to the ALU, accumulator(s), memory address register and memory data register.
Part B: There's something missing from the above description of this architecture. Consider the problem of loading constants into the address register for direct addressing. Suggest some solutions to this problem.
Part C: How many accumulators would it take to make effective use of this idea. Consider both the 8-bit address path between the IEU and the ALU and registers, and the problems you would expect in writing simple programs such as A = B + C. Assume that the ALU decodes at least 3 and more likely 4 bits of the 8-bit address to select the operation.
Part D: How much pipelining could be applied to this architecture. How many memory ports would the pipelined version require? How deep would the pipeline be?