Final Exam

22C:122/55:132, Spring 2001

Please write your answers on the paper provided. Make sure your name is on the front page, in the same form as it appears in the official class list! Illegible and overly long answers are difficult to grade and may be penalized. Please leave margins around your answers!

This is an open-book, open notes exam. There is no limit placed on your use of handwritten or printed reference materials.

This exam is worth 30 points (30% of your final grade). There are 15 parts to this exam, divided between 4 questions. Each part of each question is worth 2 points, and is worth about 8 minutes of your time.

Background: Consider the problem of summing an array of characters. The following fragment of C code does this:
```
	unsigned char a[1024];
	register unsigned int * ap;
	register unsigned int w,x,y,z;
	register xFF = 0xFF;
	register unsigned int * a1023 = (unsigned int *)&(a[1023]);

	...

	w = x = y = z = 0;
	ap = (unsigned int *)a;
	do {
		register unsigned int sap = *ap;
		ap++;
		x +=  sap        & xFF;
		y += (sap >>  8) & xFF;
		z += (sap >> 16) & xFF;
		w += (sap >> 24);
	} until ap > a1023;
	x = x + y + z + w;
```
Prefixing a declaration in C or C++ with the keyword register advises the compiler that it would be wise to allocate a register to hold this variable. In the above code, ap (once initialized) always points to an element of a, but instead of pointing to a character, it points to a word (4 characters).
Part a: Give reasons why a programmer, knowing something about high-speed architecture, might chose to compute the sum x+y+z+w outside the loop instead of simply adding the 4 terms computed during each iteration to a single accumulator with the statement:
```
                x +=  sap        & xFF
                  +  (sap >>  8) & xFF
                  +  (sap >> 16) & xFF
                  +  (sap >> 24);
```
Part b: The final statement from the code given above may be parenthesized several ways:
```
	x = ((x + y) + z) + w;
	x = (x + y) + (z + w);
```
Given a pipelined machine with 1 operand delay slot, what performance difference would you expect between the above two alternatives? Explain!
Part c: Assume we run this code on a machine with a modest sized I-cache, where memory cycles are several times slower than the machine's clock rate. Explain why the issue raised in part b is highly likely to be irrelevant
Part d: On a conventional 1970's style architecture (for example, the M680x0 or the 80x86) the loop terminating condition used above might translate to:
```
	CMP ap,a1023  -- compare registers
	BLE LOOP      -- conditional branch on condition codes
```
Explain why the designers of many RISC architectures do not favor this approach, explain a viable alternative typical of RISC architectures, and illusrtate the use of this alternative to replace the instruction sequence given above.
Part e: Two of the register variables given in the above code, xFF and a1023 hold constant values; the justifications for saving these values in registers on a modern RISC machine are quite different! Explain this.
Hint: Consider such questions as whether the value is likely to be statically defined, the likely formats for immediate operands, and the likelyhood that these formats will be supported for the operations where they are needed.
Background: Consider a pipelined Ultimate RISC with the following structure:
```
         ________              src  |
	|        |------o-----------x--
	|        |  ____|____       |
	|        | |src cache|      |
	|   IF   | |_________| dst  |
	|        |------o-----------x--
	|        |  ____|____       |
	|________| |dst cache|      |
	 |>_____|  |_________|      |
	|___OF___|----o------o------x--     (b)
	 |>_____|     |      |      |        ?
	|___OS___|----|-o----|-o----x--     I/O   ___ 
	             _|_|_   | |    |        |   |   |
	            |     |  I/O     --o-----o---| M |
	            | ALU |   ?     ___|___      |___|
	            |_____|  (a)   | cache |
	                           |_______|
```
The IF stage fetches the source and destination addresses from consecutive memory locations. In case of conflict in the memory arbitration system, the bottommost active bus from the CPU wins and the topmost must wait.
Part a: Do the instruction caches (src and dst) need to snoop? Explain your answer, with references to (mis)features of the Ultimate RISC instruction where these are relevant.
Part b: How many clock cycles would the following sequence of instructions take, assuming fast main memory and assuming that all instructions are in the cache before the sequence begins:
```
   MOVE A,ACC  -- acc = a
   MOVE B,ADD  -- acc = acc + b
   MOVE ACC,C  -- c = acc
   MOVE D,ADD  -- acc = acc + d
   MOVE ACC,E  -- e = acc
```
In answering this question, account for delays caused by operand interlocks, and contention for memory, and also account for the time to fill the pipe, counting from the first instruction fetch to the last operand store.
Part c: In order to eliminate pipeline stalls caused by operand dependencies, we add the following logic:
if the memory address in the destination address register input to the operand store register equals the memory address in the source address register in the operand fetch stage, forward the data out from the operand store stage directly into the operand fetch stage instead of doing a read from memory.
When we try this logic, we discover that many programs that do input/output or arithmetic no longer operate correctly! Explain.
Hint: Some of the interlocks required to make the code from part b work correctly may provide useful illustratons.
Part d: The figure includes two suggestions for the connection of Input-output devices to the machine, a and b. Discuss the differences between these, with respect to:
- the complexity of each device interface,
- and the effect they have, if any, on the cache.
Part e: The instruction fetch stage of the pipeline in the above system is an interesting finite state machine with the following inputs:
- src data valid (src bus grant or src cache hit)
- dst data valid (dst bus grant or dst cache hit)
- stall request (from later pipeline stages)
and the following outputs:
- src read request (to src bus)
- dst read request (to dst bus)
- instruction register valid (to later pipeline stages)
Describe this finite state machine in more detail.
Background: Many cryptographic algorithms require very long integers -- it is not uncommon to speak of doing arithmetic on 1024 bit words! If we build a computer with such a large word size, this will be wasted for most purposes. Consider, therefore, building a shared memory multiprocessor where the CPUs can operate independently on 32 bit words, or they can be joined in pairs, quads octets or larger groups to operate on larger words.
A naive but wealthy investor says: "All we need to do is enable carry propagation between the the ALU's in the operate stages of the pipelined CPUs, and then have all of them jump simultaneously to the code that performs high precision integer arithmetic. From then on, the processors will operate in lock-step, until they execute the instruction that ends the connection."
Part a: Assuming there were no other problems with this proposal, it would work for some arithmetic operations and not for others. Which common arithmetic operations cannot be made into extended precision arithmetic operations by this scheme.
Part b: What features of modern pipelined architectures in the context of a shared-memory multiprocessor make it unlikely that the processors will stay in lock-step. What additional logic must be added to the interconnection between the processors to make this scheme work?
Background: Consider the following alternative to a 2x2 crossbar switching element for use in building scalable multiprocessors using a banyan interconnect topology
```
	                      |
	From client ----o-----x-
	              __|__   |
	             |cache|  |
	             |_____|  |
	                      |
	From client ----o-----x-  _________
	              __|__   |  |recognize|
	             |cache|  o--| Address |--- To server
	             |_____|  |  |__xx0xx__|
	                      |   _________
	                      |  |recognize|
	                      o--| Address |--- To server
	                      |  |__xx1xx__|
```
Assume that the recognize address logic can be set to test exactly one bit of the address in order to route a bus cycle to one or the other of the two server-side memory busses.
Part a: If we have 2ⁿ CPU's and 2ⁿ memory modules, how many caches does this imply, and how many different caches may be checked when a CPU references some specific memory location?
Part b: Assume simple write-through caches. What kinds of applications will work on this multiprocessor, what kinds will fail, and why?
Part c: A critic of this architecture says "When a CPU reads a word from some memory location, a copy of that word will be cached in every cache between that CPU and the memory. Therefore, if we describe the switching network in terms of n layers of caches, the aggregate contents of the cache in each layer will be the same, although the division of data between caches within each layer will differ from layer to layer." The critic is wrong. Explain.