13. The Complexities of Memory.

Part of 22C:40, Computer Organization and Hardware Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Background

Up to this point, we have assumed a very simple model of main memory, where the address output by the processor goes straight to the memory without modification, and where the data path to and from memory is not interrupted by anything. An abstract view of this naive memory model is given in this figure.
A naive model of memory
Naive model of memory

There are two problems with this model. The first is one of speed! A typical central processor of the year 2000 can execute about 1 billion instructions per second, while typical memory of the year 2000 can handle only about 20 million memory references per second. Central processors are about 50 times faster than main memory!

First generation processors, those designed with vacuum tubes in 1950 or those designed with integrated circuits in 1965 or those designed as microprocessors in 1980 were generally about the same speed as main memory. On such processors, this naive model was perfectly reasonable. By 1970, however, transistorized supercomputers were being built where the central processor was significantly faster than the main memory, and by 1980, the difference had increased, although it took several decades for the performance difference to reach today's extreme.

Our solution to this problem is to use what is called a cache memory between the central processor and the main memory. Cache memory takes advantage of the fact that, with any of the memory technologies available for the past half century, we have had a choice between building large but slow memories or small but fast memories. This was known as far back as 1946, when Berks, Goldstine and VonNeumann proposed the use of a memory hierarchy, with a few fast registers in the central processor at the top of the hierarchy, a large main memory in the middle, and a library of archival data, stored off-line, at the very bottom.

A cache memory sits between the central processor and the main memory. During any particular memory cycle, the cache checks the memory address being issued by the processor. If this address matches the address of one of the few memory locations held in the cache, the cache handles the memory cycle very quickly; this is called a cache hit. If the address does not, then the memory cycle must be satisfied far more slowly by the main memory; this is called a cache miss. All cache systems implement some policy to determine what small set of memory locations are held in the fast memory inside the cache and when this set will change. This is called the cache replacement policy.
Adding a cache to the naive view
Memory with a cache

A cache constantly monitors the data and address busses that connect the processor to the main memory. This requires that this bus be designed to allow such eavesdropping, and it requires that the memory be designed to allow a cache to preempt it, forcing the premature completion of a memory cycle that would otherwise take much longer.

Without cache memory, the applications programmer would have to explicitly transfer data between the large main memory and the small but fast cache. With the use of a cache, this transfer is automated. It is natural to ask, can we automate the transfer of data between other levels in the memory hierarchy? The answer to this question is a firm yes. In 1962, Ferranti Ltd., working with Manchester University solved this problem with the introduction of the Atlas computer, the first computer to incorporate what was later called a paged memory management unit and to support virtual memory.

By 1970, several other commercially available systems supported virtual memory and IBM was scrambling to support this on their mainframe computer systems. By 1985, several microcomputer-based virtual memory systems were being sold, and by 2000, virtual memory technology was universally available on all but the smallest embedded computer systems.

Put simply, virtual memory is present in any system where the memory addresses issued by the program are transformed, by any mechanism, to produce the memory addresses actually seen by the memory hardware. The mechanism that performs this transformation has come to be called a memory management unit, although it has gone under other names in older systems.
Adding a memory management unit
A Memory Management Unit between the CPU and Memory

Memory management units take the address issued by the central processor and replace it with some other address, allowing all of the parts of memory that the current program needs to be stored in a relatively small main memory, while unneeded parts of memory are stored on disk or some other larger, slower memory. So long as the program never trys to use the other memory, it runs at full speed, but when it issues an address for a word that is not currently in main memory, the memory management unit forces a trap, allowing the operating system to bring the required data into memory, an action that will probably involve moving other data from main memory to disk. This activity is called a page fault.

Where we spoke simply of the address of a word of data when there was no memory management unit, we must now distinguish between the address that is used as input to the memory management unit and the address it produces as output. We refer to the addresses issued by the application software running on the central processor as virtual addresses, while the addresses actually delivered to memory are physical addresses. The function of the memory management unit is therefore to translate the virtual address to the physical address.

Alternatively, some authors, primarily British, have referred to virtual addresses as names while physical addresses are simply referred to as addresses. In this case, the application program is described as operating in terms of the (numeric) names of variables, while the actual memory holds these variables in specific addresses. In this case, the memory management unit is described as performing name to address translation, and in this regard, it serves somewhat in the same way as the old village postman might have served, in the era when letters could be addressed to a person, and it was up to the postman to know where that person's mail should be delivered.

While memory management units were invented in order to allow the use of virtual memory to automate data transfers between levels in the memory hierarchy, a second use became quickly apparent. The memory management unit can be used to completely isolate different programs from each other. Prior to the advent of virtual memory, computers could run multiple programs, but an error in one program could corrupt data anywhere in the memory, not only data belonging to other programs, but in the operating system itself. With a memory management unit, we can arrange things so that, at any time, the only parts of memory that can be addressed at all are those parts belonging to the current running program.

Typical modern computer systems contain multiple caches plus a memory management unit. An instruction cache or I cache holds only instructions, speeding the fetch phase of the fetch-execute cycle. A data cache or D cache holds only operands, and serves to speed the execute phase of the fetch-execute cycle. Sometimes, instead of separate I and D caches, there is just one on-chip cache, the L1 cache. In addition, it is common to find an L2 cache that is outside the processor chip, and there are systems that use an L3 cache, sometimes having one such cache on each memory chip. The cache levels L1 and L2 are counted moving from the processor toward memory. Sometimes, the I cache and D cache are counted together as the L1 cache. A complete system, therefore, might be as complex as that shown here:
A complete system
A complete system with I cache, D cache, MMU and L2 cache

In the above diagram, the central processor has two separate interfaces to memory, each with its own cache; this is possible because most modern processors are pipelined, so that the next instruction is fetched while its predecessor is executed. If the execute phase of one instruction involves a memory reference, for example, when the instruction is a load or store instruction, the processor tries to perform this load or store in parallel with the instruction fetch. Because of the separate caches for the fetch and execute stages of the pipeline, it is possible that these two memory cycles will indeed take place in parallel. Only when both caches have a miss will the bus arbitration logic have to select between the two memory references, allowing the execute memory reference to go first, and then allowing the next instruction fetch.

Exercises

a) In the last figure above, identify the various caches according to type (I cache, D cache, L1 cache, L2 cache, L3 cache, etc). Note that there may be multiple caches of one kind and none of some other kind. Also note that multiple names may be applicable to some of the caches.

b) In the last figure above, if the mapping function performed by the memory management unit is changed, the contents of some of the caches must be erased. Which caches, and what makes their contents invalid.

c) Suppose input/output devices are connected to the same address and data busses as the main memory. Explain why the presence of cache memory causes problems when a program attempts to read the device registers. Hint: Consider the impact of caches on a polling loop that repeatedly tests the device ready bit in a memory-mapped device register.

Building a Simple Cache

Our goal here is to show how a cache can be built from a small, fast random-access memory and a bit of auxiliary logic. If our memory address space allows a single 32-bit word of memory to be selected by a 30 bit address, this means that the main memory could be up to a gigaword, 230 words. Here, we will assume a cache of only 32 kilowords, 215 words. This is not an unusual size for the L2 cache on a modern processor. Of course, we will also assume that our small memory is much faster than main memory. If one memory cycle on main memory takes 50 nanoseconds, it is reasonable to rely on an L2 cache that has a 5 nanosecond cycle time. The internal L1 cache or the I cache and D cache might have cycle times of 0.5 nanoseconds in the same system.

If we can only store 32K words in our cache, then we need additional logic in the cache to associate each word of cache with a particular address in main memory. The easiest way to do this is to use 15 bits of our 30 bit main memory address to select a word in our small memory. If we do this, each word in the small memory can hold the value of one out of 32K different words in main memory, so with each word, we need to store 15 bits of the address in order to remember which particular word we are holding in the cache. Thus, each entry in our small memory holds 15 bits of address, we call this the tag, plus a 32-bit word of data. Pictorially, the core of such a memory can be described as follows:
The data flow within a cache
The structure of a simple cache

The logic shown here has one output to the cache controller, hit, indicating that the addressed word in the cache's memory has a tag that matches the contents of the address bus. The cache controller for this cache must produce 2 control signals, clock, to indicate when a new data word and its tag should be stored in the cache's memory, and enable to control when a data word from the cache's memory should be put on the data bus.

The job of the cache controller is to clock new data into the cache's internal memory whenever there is valid data on the data bus along with a valid address on the address bus, and, in response to a hit that occurs when there is a read request from the processor, to put the cache's data on the data bus, cutting the memory cycle short.

This cache design is called a write-through cache because it does nothing to accelerate write cycles. Write-back cahces are more complex, but their payoff is lower because most memory cycles issued by most processors are for read operations. The I cache that handles only instruction fetches need not support write cycles at all, and if there is both an L1 and and l2 cache, it is fairly common for one of these to be a write-through cache.

A control system for this cache must connect with the bus control lines. Here, we will assume that there is a read line that is asserted when the processor attempts to read from memory, and an ack line asserted by memory in response. The read signal indicates that a valid address is already on the address lines, and the ack line indicates that the data from memory is valid. The cache itself will assert the ack line when there is a hit. If we ignore write cycles, for example, because the cache is serving a read-only memory or because the cache is being used as an I cache, no other control signals are involved, and the full cache controller looks like this:
A read-only cache controller
a read-only cache controller

Note here that both the cache controller and the memory can drive the ack signal. If neither drives this signal, it is false. We show this using a bus driver on the cache controller; this driver provides no output if it is not enabled, and provides a constant output of one if it is enabled.

Extending this cache controller to make a write-through cache is not difficult. The primary new requirement is that the cache should record the value written whenever there is a write to memory. To support write operations, we must extend our set of bus control lines by adding a write signal that goes true when a valid address and the data to be written are both present on the bus. Again, we will assume that the memory responds to this with an ack signal to acknowledge completion of the write.
A write-through cache controller
a write-through cache controller

Both the cache controller and the bus interface for a write-back cache are far more complex, since such a cache must be able to initiate memory cycles independently from the CPU. In effect, the cache controller for such a cache is an independent but very simple processor, with its own internal state, responding to bus cycles initiated by the central processor, but just like a direct memory access interface to a peripheral device, able to initiate its own memory cycles.

Exercises

d) The cache shown above will work, to some extent, if the address lines are switched, so that the most significant bits of the address are used to select a word from the cache's memory and the least significant bits of the address are stored as the tag. This would work just as well as the original design if addresses issued by the processor were truly random, but it works very poorly for typical programs. Why?

e) The random access memory used in the cache shown here has a 15-bit memory address. How big is each word of this memory?

Inside the Memory Management Unit

Today, the available main memory technology allows access to one word of memory in around 50 nanoseconds, while our central processors can execute an instruction in around one nanosecond. This imbalance would cripple the speed of our computers if we did not add additional hardware, known as cache memory to speed access to main memory.

......