22C:122/55:132 Notes, Lecture 40, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

Alignment Networks
Up to this point, we have assumed that the only memory operations are to load a word from memory and to store a word in memory. In fact, most memory systems operate this way, but many of our CPU designs assume a more interesting interface, where any particular instruction may read or write bytes or words (and sometimes also halfwords and doublewords). This leads us to imagine a memory interface such as the following:
```
	CPU side                            Memory side
                        _____________
        address   >----|             |----> address
        data out  >----|             |----> data out
        data in   <----|  interface  |----< data in
        direction >----|             |----> direction
        size      >----|_____________|
	
```
At the heart of this interface is a system of multiplexors and demultiplexors called an allignment network
Consider the simple case of a 16-bit memory interface on a machine that allows references to both bytes and 16-bit words. For a read from memory, we might use the following alignment network:
```
	CPU side                                  Memory side

	 byte     n   |------------------/----------- word
	address --/---|                 n-1         address
	              |-------- 
                               |  
                               |  /|
                               | |0|-- 0
                            ---|-| |
                           |   | |1|--   
                    MSB  8 |   |  \|  |   8  MSB
                 16   |--/-    |  |   o---/--|  
        data <----/---|        |  |   |      |---/--< data
                      |--/-    |  |   |  -/--|    16
                    LSB  8 |   /| |   | | 8  LSB
                           |  |1|-|---  |
                            --| | |     |
                              |0|-|-----
                               \| |
                                  |
        size ---------------------
	
```
When the size input to this circuit is 0, the circuit reads one byte from memory, using the least significant bit of the address to select the byte. When the size input is 1 and the least significant bit of the address is zero, it reads a 16-bit word.
There is a problem when this logic is used with a size of 1 (word) and an odd address. In this case, this alignment network reads the most significant byte from memory into both bytes of the value given to the CPU. This is clearly nonsense! The easiest solution is to simply disallow this case -- we simply require that all one-word operands be aligned on word boundaries in memory.
Exercise: Add logic to detetect an attempt to access a non-aligned operand and report this to the CPU.
Exercise: Work out the read alignment network for a 32-bit machine supporting aligned operands of size 8, 16 and 32 bits, encoded using size codes of 0, 1 and 2 from the CPU. Assume that 16-bit operands must have even addresses.
Exercise: Note that the alignment network given here always reads zero into the most significant bits of 8-bit operands. This is appropriate when reading data of the type declared as unsigned char in C or 0..255 in Pascal, but it is for the type signed char or -128..127. These require that the 8-bit operand be sign-extended to 16 bits. Modivy the alignment network given above to include an optional sign-extend control that enables this alternative.
Alignment Networks for Write
The most common way to build an alignment network that supports writes of partial words is based on a bus design that divides the write signal into multiple write signals, one for each byte of the word. We can do this with 8-bit bytes and a 16-bit word as follows:
```
	                       Memory Subsystem
         word ----/------------o--------
	address    n-1         |        |
                        |--/---|--------|---
         data out --/---|   8  |        |   |
                     16 |--/---|---     |   |
                            8  |  _|_   |  _|_
                                -|   |   -|   |
                                 |RAM|    |RAM|
         write low --------------|>__|   -|>__|
                                   |    |   |
                        |--/-------|----|---
         data in  --/---|   8      |    |
                     16 |--/-------     |
                                        |
         write high --------------------
	
```
This is the approach taken on a large fraction of the 16-bit microprocessors, including the Motorola 68000 and the Intel 8086, and modern 32-bit processors simply widen this model to 4 bytes, with 4 lines to control which byte is written during any particular write cycle, and an 8-line version would be required for a 64-bit word.
Given that we are using this type of ram, we can build the alignment network for the write data path as follows:
```
	CPU side                                  Memory side

	 byte     n   |------------------/---------- word
	address --/---|             ___    n-1       address
	              |-----o------|   |   ___ 
                            |      |OR |--|   |
                            |   ---|___|  |AND|-- write high
                            |  |         -|___|
                            |  |        |
        write --------------|--|--------o
                            |  |        |  ___
                            |  |    ___  -|   |
                             --|--O|   |  |AND|-- write low
                               |   |OR |--|___|
                               o---|___|
                               |
                               | |\
                            ---|-|1|
                           |   | | |--   
                    MSB  8 |  -|-|0|  |   8  MSB
                 16   |--/-  | | |/    --/--|  
        data >----/---|      | |  |         |---/--> data
                      |--/---o-|--|------/--|    16
                    LSB  8     |  |       8  LSB
        size ------------------o--
	
```
For an aligned word write, this alignment network puts the entire word on the data bus, while for byte write, it puts the duplicate copies of the byte on both bytes of the bus and then relies on the LSB of the address to determin whether the high or low byte gets written.
With this network, as with the read network given previously, an attempt to access a non-aligned word, that is, a word that starts at an odd address in memory, will be handled incorrectly; therefore, an attempt to do so should be detected and reported to the CPU.
Exercise:Work out the write alignment network for a 32-bit machine supporting aligned operands of size 8, 16 and 32 bits, encoded using size codes of 0, 1 and 2 from the CPU. Assume that 16-bit operands must have even addresses.
There is an alternative approach to writing partial words to memory that rests on a conventional memory where the entire word must be written at once. This is to read the entire word, update the byte or bytes that are to be changed, and then write it back to memory. Such read-modify-write cycles could be executed at no runtime cost in the era of core memory because the read operation from core was destructive, requiring that every read be followed by a write. Adding the data paths to allow the value written to be a modified version of the original value introduced no significant delay.
With modern DRAM semiconductor memory, the read operation is also destructive because it discharges the capacitors that store each bit of the word, but the refresh logic on the chip automatically handles the problem of refreshing the stored charge, and as a result, external logic does not see two memory cycles per read. Because of this, read-modify-write cycles require a pipeline stall on a typical modern machine, so (with an important exception), this approach is rarely used.
Non-Aligned Words
If we attempt to read or write a non-aligned word, we must recognize that we are dealing with a word that is stored physically in two consecutive words of memory. Thus, on a 16-bit machine, we encounter one case"
- Word access to an odd address
```
	          ________ ________
	aligned  |__HIGH__|__LOW___|
	          ________ ________ 
	  non    |________|__HIGH__| \
	aligned  |__LOW___|________| /
	
```
On a 32-bit machine, we have many more non-aligned cases:
- Half-word access to an address ending 11
- Word access to an address ending 01
- Word access to an address ending 10
- Word access to an address ending 11
```
	          ________ ________ ________ ________
	aligned  |_HIGHER_|__HIGH__|__LOW___|_LOWER__|
	         |________|________|__HIGH__|__LOW___|
	         |________|__HIGH__|__LOW___|________|
	         |__HIGH__|__LOW___|________|________|

	          ________ ________ ________ ________
	  non    |________|________|________|_HIGHER_| \
	aligned  |__HIGH__|__LOW___|_LOWER__|________| /
	          ________ ________ ________ ________
	         |________|________|_HIGHER_|__HIGH__| \
	         |__LOW___|_LOWER__|________|________| /
	          ________ ________ ________ ________
	         |________|_HIGHER_|__HIGH__|__LOW___| \
	         |_LOWER__|________|________|________| /
	          ________ ________ ________ ________
	         |________|________|________|__HIGH__| \
	         |__LOW___|________|________|________| /
	
```
Given that the operands we address in these cases span multiple words in memory, there is no choice but to use multiple memory cycles to access them!
Typically, in a pipelined CPU, this means that operand load or store operations that involve non-aligned operands must be executed using multiple memory cycles from the same pipeline stage, and this, in turn, means that this pipeline stage must stall for an extra cycle.
We could do this using a simple microprogrammed control unit for the pipeline stage that performs the operand fetch or store, but if this is the only cause we have to break out from pipelined execution, we can use simpler schemes.
For example, when we detect a non-aligned operand reference, we can stall the stages above while we peform the first half of the non-aligned reference, clocking a new operation into the interstage register on the next cycle that will (a) release the stall and (b) perform the second half of the non-aligned operation.
Eliminate Byte Addressing?
If the hardware supports non-aligned memory references, as is the case on the 80x86 family and most other CISC machines, they are bad for the efficiency of pipelined processing, and therefore, they should be avoided.
Compilers, for example, can automatically pad data structures so that each operand begins in a new word, and the heap manager for dynamically allocated data types can guarantee that all objects on the heap begin on a word boundary. Better compilers, particularly for languages that do not specify a one-to-one mapping from textual form of declaration to a specific storage layout in memory, may re-order the fields of records (and activation records) so that memory is used efficiently while staying within alignment constraints. Consider:
```
	struct widget {
		byte A;
		halfword B;
		byte C;
		word D;
	}

	Inefficiently allocated:
	          ________ ________ ________ ________
	         |////////|////////|////////|___A____|
	         |////////|////////|________B________|
	         |////////|////////|////////|___C____|
	         |_________________D_________________|

	Efficiently allocated:
	          ________ ________ _________________
	         |___A____|___C____|________B________|
	         |_________________D_________________|
	
```
Given that good compilers avoid non-aligned operands, it is fair to ask, why bother supporting them in hardware.
The designers of the DEC Alpha RISC architecture took an extreme position. This machine has 64-bit registers, and it supports direct load and store of 64-bit and 32-bit operands, with no support of non-aligned 64-bit operands, and no load or store operations for operands smaller than 32 bits.
Memory addresses on the Alpha are still byte addresses, so each byte in memory can be given an address, but the least significant 2 bits of the address are not used outside the CPU. Given a byte pointer in a register on the Alpha, loading a byte from memory takes a 2-instruction sequence:
1. Load 32-bit word using the byte pointer. The 2 least significant bits of the pointer are ignored by this instruction.
2. Extract byte from word using the byte pointer. This instruction uses only the 2 least significant bits of the byte pointer, using them to control an alignment network inside the ALU stage of the pipeline.
The code to store a byte in an arbitrary memory address is even more complex:
1. Load 32-bit word using the byte pointer. The 2 least significant bits of the pointer are ignored by this instruction.
2. Insert byte in word using the byte pointer. This instruction uses only the 2 least significant bits of the byte pointer, using them to control an alignment network inside the ALU stage of the pipeline.
3. Store the updated 32-bit word using byte pointer. As with the load instruction, the 2 least significant bits of the pointer are ignored.
These instruction sequences seem quite unwieldy, but the designers of the Alpha emphasized the importance of carefully hand coding very high performance string operations that manipulated strings using 64-bit parallel data transfers whenever possible. As a result, the performance of the Alpha has proven to be extremely good; for many years, the Alpha was the fastest microprocessor available, outperforming the competing versions of the Intel Pentium, the SGI MIPS and the IBM/Apple Power architecture.
Address Translation
Since around 1970, most operating systems of any significance have assumed the availability of a memory management unit, sitting between the CPU and main memory and able to translate virtual addresses output from the CPU to physical addresses used in memory.
```
         _____               data               _____
        |     |  |--------------------------|  |     |
	| CPU |--|          _____           |--| RAM |
	|_____|  |---------|     |----------|  |_____|
                   virtual | MMU | physical
                   address |_____| address
	
```
The naive programmer generally does not want to know about physical addresses and would prefer to ignore the presence of the memory management unit in the system. The naive system programmer knows it is there and typically prefers a very simple model of its function based on a set of data structures that are all in RAM:
```
            page    word in page
         ___________ ___________
        |___________|___________| Virtual address
              |           |
              |            --------------------
              |                                |
              |             page table         |
              |                                |
              |         rights  frame          |
              |          ________________      |
              |         |____|___________|     |
              |         |____|___________|     |
               -------->|____|___________|     |
                        |____|___________|     |
                        |____|___________|     |
                        |____|___________|     |
                        |____|___________|     |
                           |       |           |
        Rights <-----------        |           |
                              _____|_____ _____|_____
            Physical address |___________|___________|

                                 frame   word in frame
	
```
If the rights field of the table entry indicates that that table entry is invalid, or if the rights field says that a table entry is read-only but the program attempted a write operation, the memory management unit will abort the memory access and force the CPU to trap.
If we actually built memory management units like this, the performance penalty would be immense! For each memory reference by an applications program, the MMU would have to add the page number field of the virtual address to the base address of the page table to construct the address of a page-table entry, read the page-table entry from RAM, and then use it to construct a physical address. Thus, each main memory reference from the CPU would take two memory cycles, one for address translation and one for the actual data transfer.
```
         _____               data            
        |     |  |--------------------------|  |
	| CPU |--|          _____           |--x--
	|_____|  |---------|     |----------|  |
                   virtual | MMU | physical    |
                   address |_____| address     |
                              |                |
                               -------o--------x--
                                    __|__      |   _____
                                   |     |     |  |     |
	                           | TLB |      --| RAM |
	                           |_____|        |_____|
	
```
We can improve on this naive design by adding a cache to the MMU, as shown above. This cache is usually called the translation lookaside buffer, but in the conceptually easiest designs, it is just a fast cache.
In fact, it is foolish to use the physical address of a page-table entry as a key during translation lookaside buffer lookup, because this requires adding the page-table base address to the page number prior to checking the TLB. It is better to use the page number itself as the search key in the TLB, so that the TLB associative memory directly associates page numbers with frame numbers, and the physical address of a page-table entry is only needed if there is a TLB miss.
In modern RISC systems such as the SGI MIPS chip, it is common to simplify the memory management unit even more by eliminating the hardware for handling TLB misses. In these systems, a TLB miss is treated by the hardware in the same way that it treats an access violation or an invalid page-table entry. The memory management unit simply requests a trap when any of these events occur. On such a machine, the TLB entries are treated as special registers that may be written by system software in the trap handler in response to a TLB-miss trap.
Further discussion of memory management unit design requires a large body prerequisite material from operating systems!
The Big Picture
The full data path from a pipelined CPU to the bus interface connector of a modern microprocessor chip such as the Pentium or Power PC that supports symmetrical multiprocessor configurations must look something like the following:
```
        Pipelined               |                        |
	   CPU       -----o-----x--
	            |   __|__   |                        |
	 ________   |  |  I  |  |      On the same
	|________|--   |cache|  |      chip as the       |
	 |______|      |_____|  |          CPU
	|________|      _____   |                        |
	 |______|      |     |  |
	|________|-----|Align|--x--  _____               |
	 |______|      |_____|  |   |     |          |
	|________|               ---| MMU |----o-----x-- |
	                            |_____|  __|__   |
	                                    |  O  |--o   |   |
	                                    |cache|  |   _   | Bus on
	                                    |_____|   --|_|--o motherboard
	                                                     |
	                                     (L1)        |   |
	
```
The memory management unit sits between the processor and the operand cache because the operand cache is a snooping cache, and must therefore deal with physical addresses. The instruction cache, in contrast, uses virtual addresses, and because of this, we must invalidate the entire contents of the I-cache whenever we change the address mapping data used by the memory management unit.
A second reason that the memory manager is typically put between the processor and the cache is so that the entries in the page table (or the TLB) can contain one bit to indicate whether each page of the address space is to be cached or not. Pages that are mapped to conventional RAM are typically marked as cached, while pages that are mapped to device registers on the input-output bus should not be cached. Special areas of the memory address space such as video RAM may also require the cache to be inhibited, particularly if it is a write-back cache and not a write-through cache.