36. Cache Memory

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Pipelines and Multiport Memory

Pipelined machines invariably require at least two memory ports for the CPU, one for the instruction fetch pipeline stage, and one for operand fetch or result store. The MODCOMP IV computer, a 16-bit superminicomputer that came to market in 1974, did this with a crossbar switch:

	                   __________   |   |   |   |
        Instruction Fetch |__________|--x---x---x---x--
	                   _|>_____|_   |   |   |   |
        Execute           |__________|--x---x---x---x--
                                        |   |   |   |
	                   __________   |   |   |   |
        DMA Controller    |__________|--x---x---x---x--
	                   __________   |   |   |   |
        DMA Controller    |__________|--x---x---x---x--
                                        |   |   |   |
	                               |M| |M| |M| |M|
                                       |_| |_| |_| |_|

This machine had a microprogrammed execute stage that could execute most of the common instructions in a single cycle, but there were many others that took multiple cycles. The 4-way interleaved main memory reduced the chance of a collision between the two pipeline stages to 1 in 4 -- a collision occurs when the instruction fetch and operand fetch or store reference the same memory module.

The problem with crossbar switches is that they are expensive! If possible, it would be far more pleasant to use a simple bus arbitration scheme to access memory. Consider the following approach for our example pipelined processor:

	                   __________   |  
        Instruction Fetch |__________|--x--
	                   _|>_____|_   |
        Compute Address   |__________|  |
	                   _|>_____|_   |
        Operand Fetch     |__________|--x--
	                   _|>_____|_   |
        Operate           |__________|  |
	                   _|>_____|_   |
        Operand Store     |__________|--x--
                                        |
                                         -----o---
                                             _|_
                                            |RAM|
                                            |___|

Here, we have reduced the system to a single memory bus, with a single column of bus arbitration and bus couplers. Each connection from pipeline stage still has almost the full complexity of a crossbar switch interconnection, the cost now grows linearly with the number of pipeline stages that need access to memory.

Under what circumstances will this design perform at all?

If the instruction fetch and operand store stage make simultaneous attempts to access memory, we must block one of them. If we block the operand store stage, this stalls the entire pipeline! In contrast, if we block the instruction fetch stage, we introduce a bubble but keep the pipeline flowing. Therefore, it is easy to conclude, whether we use a crossbar switch or a single memory bus, we must assign fixed memory access priorities to each of the pipeline stages, with the highest priority going to the final stage in the pipe, and lower priorities going to each earlier stage.

Under what circumstances will this design perform well?

The answer is simple: Whenever two pipeline stages request access to memory during the same clock cycle, one of them must stall. Therefore, this simple design will only perform well if this is a rare event.

How can we reduce the frequency of memory accesses in a pipeline stage?

The answer depends on the stage. In the instruction fetch stage, for example, our preliminary pipelined designs all fetched one instruction word from memory on each and every memory cycle! There is one mechanism we have discussed that will reduce this demand on memory. If we fetch multiple instructions per cycle, replacing the instruction register with an instruction prefetch register, complete with shift logic, then the memory demand will be reduced.

For example, if we have a 64-bit wide memory data bus, but instructions are only 32 bits each, we only need to execute an instruction fetch on every other pipeline cycle. Or if the memory is 32 bits wide but we have many 16-bit instructions, as in the IBM System/360, again, we usually have one fetch every other instruction cycle. With a machine like the Pentium, where there are one-byte forms of the most common instructions, we can do even fewer fetches per cycle, so long as the compiler uses these compact but archaic instruction forms whenever possible.

Clearly, the design of the instruction set determines the frequency of memory references for operand fetch or operand store. Some instruction sets come close to demanding that almost every instruction reference an operand in memory, while others encourage large numbers of register-to-register operations.

The DEC VAX family of processors allowed programmers to write programs using any of memory-to-memory, memory-to-register, register-to-memory or register-to-register operations. In many cases, use of memory-to-memory instructions reduced the instruction count in machine or assembly language programs, but the minimum run-time was achieved by using as many register-to-register instructions as possible. The reason is that use of memory-to-memory instructions maximizes the demand for CPU-to-memory bandwidth, and therefore, maximizes the number of conflicts over memory access.

More Justification for Load/Store RISC Architectures

The RISC instruction set design philosophy that has dominated the design of new instruction sets since 1980 stresses the idea that load and store instructions should not do computation (beyond indexed addressing), and that arithmetic instructions should not access operands in memory. Such an instruction set is called a load-store instruction set, and it leads naturally to a four-stage pipeline:

	        __________       |  
        IF     |__________|------x--
	        _|>_____|_       |
        OF/AC  |__________|      |
	        _|>_____|_       |
        OP/MA  |__________|------x--
	        _|>_____|_       |
        OS     |__________|      |
                                  -----o---
                                      _|_
                                     |RAM|
                                     |___|

Because the instruction set uses the load/store model, address computation is only needed for load and store instructions, while only the arithmetic instructions need an operand fetch. Therefore, we combine these in the same pipeline stage.

This, in turn, allows the memory access required by a load or store instruction to be done by the same pipeline stage that does the ALU operation for an arithmetic instruction.

The final stage, operand store, saves the result of arithmetic instructions and loads from memory into a register.

Note that all of the logic to the left of the bus arbitration logic in the above figure is essentially a Harverd architecture CPU, while the memory to the right of the bus arbitration logic operates as if it was connected to a Von Neumann CPU. As such, the bus arbitration logic can be described as converting between the Harvard architecture and the Von Neumann architecture.

Cache Memory

There is another way to reduce the frequency of conflicts between the different memory ports in the CPU. This is to use cache memory or associative memory to reduce the number of memory cycles required.

A cache is a place where something is saved or hidden away. Thus, we speak of finding "a rebel arms cache", meaning a secret place where the rebels have hidden their weapons.

In computer science generally, we use the term cache to refer to a relatively local storage area used to reduce the cost of access to remote data. Caches may be implemented in hardware or software at many different levels of example. A web page cache is a good example of a software cache implemented at a very high level of abstraction. Here, an internet portal retains a cache of recently used web pages. When a client of that portal requests a web page, the page may be delivered immediately if it is in the portal's cache. Only if the page is not in the cache is it necessary to seek the page from the distant web-server where the original copy of the page is stored.

A cache hit occurs when the client finds the desired item in the cache, and therefore, gets fast service. The cache hit rate is the fraction of resource references that lead to cache hits. A cache miss occurs when the item is not found, so the client is required to take the time required to get the item from far away. The cache miss rate is the fraction of resource references that lead to cache misses.

There are two basic ways to build a cache for objects that may be either read or written. A write-through cache does nothing to speed access to the remote resource when a write takes place. The client must wait for the completion of the entire write operation, and all the cache does is keep a copy of the item that was written in order to speed future read operations.

A write-back cache intercepts write operations, allowing the client to continue as soon as the write to cache is completed. The actual write to the remote resource will be completed by the cache, possibly at a much later time. Write-back caches offer the possibility of even greater reduction in remote traffic because, if the client writes a new value to the cache before the previous value has been written to its ultimate destination, the first value can be simply discarded.

Introducing a cache into a system will have a significant impact only if a significant number of resource references lead to cache hits. If the pattern of resource usage is random, it is very difficult for a cache of the most recent resources to do much good. On the other hand, if the system exhibits temporal locality, that is, if recently referenced items are highly likely to be referenced in the near future, the cache can have a significant impact.

Fortunately, real programs exhibit considerable temporal locality! If a program contains loops, recently referenced instructions are very likely to be referenced again, and many of the variables referenced during one iteration of a loop will be referenced during every iteration!

Cache Memory and Pipelined Processors

If we put hardware caches between a pipelined CPU and the the memory arbitration system, we can eliminate a large number of memory cycles and many memory conflicts. Consider this design:

	    __________              |  
        IF |__________|------o------x--
	    _|>_____|_    ___|___   |
        AC |__________|  |I cache|  |
	    _|>_____|_   |_______|  |
        OF |__________|-------------x--
	    _|>_____|_              |
        OP |__________|             |
	    _|>_____|_              |
        OS |__________|-------------x--
                                    |
                                     -----o---
                                         _|_
                                        |RAM|
                                        |___|

Here, we introduced only one cache, an instruction cache, in the data path between the instructin fetch stage of the pipeline and the memory arbitration logic. If the cache hit rate for this cache is 90 percent, we will reduce the number of pipeline stalls caused by memory conflicts in the instruction fetch stage to 10 percent of the original value. This single cache can therefore have a huge effect.

We can add a second cache to the system as follows:

	    __________                  |  
        IF |__________|--------o--------x--
	    _|>_____|_      ___|___     |
        AC |__________|    |I cache|    |
	    _|>_____|_   | |_______|    |
        OF |__________|--x--            |
	    _|>_____|_   |              |
        OP |__________|  |              |
	    _|>_____|_   |              |
        OS |__________|--x--            |
                         |              |
                          -------o------x--
                              ___|___   |
                             |O cache|   -----o---
                             |_______|       _|_
                                            |RAM|
                                            |___|

Here, we introduced an operand cache. Why not 3 caches, one for the operand fetch stage and one for the operand store stage? Because this introduces a serious problem: How do you ensure that the view of memory seen by the operand fetch stage matches the view seen by the operand store stage! If these stages use the same cache, then the result of an operand store will be seen immediately by the next operand fetch memory cycle. If each has a separate cache, there is a serious problem keeping these caches coherent -- this is called the cache coherency problem.

Note that the instruction cache is a read only cache, while the operand cache is a read-write cache! The instruction cache, being simpler, can be larger, while the operand cache, being somewhat more complex, may be smaller.

In a pure load-store architecture, such as most modern RISC systems, there is only one pipeline stage that makes memory references for operands, so there is no need for an arbitrator between the operand cache and the execute stages of the pipelined CPU.

Why is there no cache coherency problem between the instruction cache and the operand cache? In fact, there is a problem, but it is rarely evident! So long as we obey the rule that all programs are read-only, there is no problem. If we are executing code from ROM, for example, the instruction cache will never disagree with the contents of ROM.

If, on the other hand, we have self-modifying code, there is a problem! Consider the kind of code that was once common on machines that had no hardware support for indexing. A program needing to do indexed addressing would compute the displacement and then add this to a memory reference instruction that addressed the first word of the data structure. Once the desired instruction was created, it was stored in the next memory location and then executed.

Classical self-modifying code like this is unneeded on modern machines, but it is common to use self-modifying code on a much larger scale. We write programs that invoke a loader to bring some routine into memory, then we call that routine, and then we load something else and call it. This is typical of dynamically loaded library routines, and it is at the heart of our activities when we run a sequence of programs under a control language interpreter or command shell.

Machines that have an instruction cache typically provide at least one special instruction to deal with this problem: An instruction to invalidate all of the entries in the instruction cache. There are several simple variations on this, for example, instructions to sumply turn the cache on and off, or to turn on and off the cache's ability to save new information. If a machine has a data cache, similar cache control options may apply to it.

Implementing Associateive Memory

Hardware cache implementations rest on various implementations of what is called associative memory or content addressable memory. There are several ways of implementing this, each appropriate for use in different contexts. A pure software model of an associative memory is a useful way to begin, but before giving this, it is useful to give a pure software model of a conventional memory:

	int classic_ram( int addr, int data, int op )
	{
	    static int memory[ size ];

	    if op == read
		return memory[addr];
	    else -- op == write
		memory[addr] = data;
	}

Using similar terms, a write-through cache might be modeled as follows:

	write_through_cached_ram( addr, data, op )
	{
	    struct line {
		int tag;
		int data;
	    } cache [ cache_size ];

	    int row;
	    int hit;

	    -- search the cache
	    for (row = 0; row < cache_size; row++)
		if (cache[row].tag == addr) break;
	    if (row == cache_size) row = random_row();

	    -- search the cache
	    if op == read {
		if (cache[row].tag != addr)
		    cache[row].data = classic_ram( addr, data, op );
		return cache[row].data;
	    } else { -- op == write
		classic_ram( addr, data, op );
		cache[row].data = data;
		cache[row].tag = addr;
	    }
	}

This software cache uses a linear search of the tags on each cache line to find an item in the cache, and it replaces lines at random if there is a cache miss. This gives the correct semantics, but this is not how we implement caches in hardware!

Note that each cache line consists of a tag and the associated data. This is the standard terminology used in discussion of cache implementation.

Externally, we can describe an associative memory as follows:

          address in    data in
	  _|_|_|_|______|_|_|_|_
	 |           |          |
	 |           |          |
	 |           |          |
	 |    tag    |   data   |
	 |   array   |   array  |
	 |           |          |
	-|> write    |          |
	 |___________|__ _ _ _ _|
               |        | | | |
             miss       data out

The cache has a write strobe, address in, data in, and data out, just like a conventional RAM, but it has one new external connection, the miss line (or alternately, a hit line) that indicates that no cache line currently contains a value corresponding to the address.

In the most expensive implementation of an associative memory, the data array is exactly like the data array of a RAM. That is, there is a word-line for each word of data. When an address is input that matches one of the tags, this word line is asserted, allowing the contents of that word to be presented to the data out lines, and allowing a write pulse to load data in to that word.

In this case, the tag array serves instead of an address decoder. Each entry in the tag array is a register holding one address. When the address input matches a tag, the word line is asserted. If no word-line is asserted, we declare a miss. The basic mechanism for one row is described here:

         addr          data in            
          |               |                          |
          o------         o--|----------------       |
          |      |        |  |     ___        |      |
          |  ____|____    |  o----|   |   ____|____  |
          | |>__tag___|   |  |    |AND|--|>__data__| |
          |      |  ___   |  |  --|___|       |      |
          |       -|   |  |  | |             _|_     |
          |        | = |--|--|-o------------\   /    |
          o--------|___|  |  |               \ /     |
          |               |  |                |      |
          |               |  |                 ------o
          |               |  |                       |
                             |                       |
                           write                  data out

One important detail has been omitted here: How does the tag get written? One way to handle this is to add a conventional address decoding array and a second set of address lines to allow direct selection of a cache line. If there is a miss and a write operation is requested, the second set of address lines determines which cache line is written. If there is a hit, the second set of address lines is ignored:

              line
             select   address in    data in
	    ___|_|___ _|_|_|_|______|_|_|_|_
	   |         |           |          |
	   |   row   |           |          |
	   |  decode |           |          |
	   | (only if|    tag    |   data   |
	   |   miss) |   array   |   array  |
	   |         |           |          |
	   |         |           |          |
	  -|> write  |           |          |
	   |_________|___________|__ _ _ _ _|
                           |        | | | |
                         miss       data out

We can then hide the row decode mechanism as follows:

                                    address  data
                                      in      in
                                -----  |       |
                              _|_    | |       |
                             |+1 |   | |       |
                     ___     |___|   / |       |
        write --o---|   |   ___|___  | |       |
	        |   |AND|--|>_row__| | |       |
                |  -|___|      |     | /       /
                | |            o-----  |       |
                | |         ___|_______|_______|___
                | |        |       |       |       |
                | |        |  row  |  tag  |  data |
                | |        | decode| array | array |
                 -|--------|>      |       |       |
                  |        |_______|_______|_______|
                  |                    |       |
                   --------------------o       /
                                       |       |
                                     miss    data
                                              out

In the above, write operations when there is a hit rely on the tag array to select the row of the data array to be written. Write operations when there is a miss rely on the row register to select the row of the tag and data arrays to be written. As given here, the row select register is only increments once per write when there is a miss. This leads to a FIFO replacement for items in the associative memory, not the best replacement policy, but generally no worse than random.

By the early 1970's, a 4-word by 4-bit MSI content addressable memory chip was available from Fairchild Semiconductor, with access times under 100 nanoseconds (in an era when core memory typically had access times on the order of a microsecond). this was designed for constructing tag arrays that cleanly fit this model. Compatable 4-word by 4-bit chips were available to implement sections of the data array; in concert, these allowed the construction of associative memories with capacities on the order of a few hundred words, quite adequate for cache memories and memory management units.

Using RAM to implement Associative Memory

The content addressable memory design suggested above is expensive! Expensive implementations such as this have been used since the 1960's, but cache would not be widely used and memory management units would be significantly less common if there were no inexpensive way to build a content addressable memory.

The trick to lowering the cost is to use conventional fast static RAM technology to build the content addressable memory. Small RAM chips (capacity around 512 bytes) with 50 nanosecond access time were available by 1972, and as the scale of integration has increased, the available fast RAM has grown even smaller and faster. This can be used to build a cache as follows:

             address      data in
                |            |
              __|__          |    
               | |           | 
               o-|---------  | 
               | |        _|_|_
               | |      ____|____
               | |     |   data  |
               |  -----|addr     |
               |       |         |
               |       |   RAM   |
               |       |         |
        write -|-------|>        |
               |       |_________|
               |          __|__
               |           | | 
               |  ---------  |
               |_|           |
              | = |          |
              |___|          |
        ____    |            |
        miss = hit        data out

Consider, for example, using a 1K RAM to implement the associative memory for a machine with a 32 bit address bus. Inside the associative memory, the least significant 10 bits of the address are used to select RAM entries. This system will perform as well as a fully associative memory with one exception. If two items have addresses that are the same in the leas significant 10 bits, the associative memory can store only one of them!

This is called a set-associative memory because there is a set of tag values that is associated with each entry in the associative memory.

Multiway Set Associative Memory

The problem with a simple set associative memory such as is illustrated above is that it only has one data word available to associate with each set of tag values. If we have 1K of associative memory and a number of the tag values we're interested in have the same bit pattern in their least-significant bits, our associative memory will be unable to help. The solution is to store more than one tag value for each set. This is called a multiway set associative memory.

For example, in a 2-way set associative memory, we store two different tags for each row of the cache, as illustrated below:

             address      data in
                |            |
              __|__          |    
               | |           | 
               | o-----------|-----
               o-|---------  |     |
               | |        _|_|_    |
               | |          |      |
               | |          o------|-----
               | |      ____|____  |  ____|____
               | |     |   data  | | |   data  |
               |  -----|addr     |  -|addr     |
               |       |         |   |         |
               |       |   RAM   |   |   RAM   |
               |    clk|    A    |clk|    B    |
               |     a-|>        | b-|>        |
               |       |_________|   |_________|
               |          __|__         __|__
               o-------    | |           | |
               |  -----|---  |           | |
               | |     |  ---|-----------  |
               |_|     |_|   |             |
              | = |   | = |  |             |
              |_a_|   |_b_|   ----     ----
                |       |        _|___|_
                 --   --o--------\0   1/
                   |_|            \___/
                  |OR |             |
                  |___|             |
            ____    |               |
            miss = hit              |

In the above, no mechanism has been given for deciding which half of the associative memory should be written when it is time to update or store a new association. If the desired tag is found in RAM A, of course, new associations for that tag should be stored in RAM A. If the desired tag is found in RAM B, the new association should be stored in RAM B. This is the simple case.

If, for some reason, the desired tag is found in both RAM A and RAM B, the logic given above always uses the copy in RAM B, so for the above design, if the desired tag is found in both, the copy that should be updated is the one in RAM B. In fact, if we build a decent cache controller, this situation should never arise.

The final question is, if the desired tag is not found in either bank of memory, where should new associations be stored? It is perfectly acceptable to use one or the other bank at random! Any choice will lead to a correctly functioning associative memory, but a careful choice will lead to a higher likelihood that the desired association is available when it is needed.

The following logic will alternate banks when there is a cache miss, while replacing the data in the bank that matched the tag in the event of a hit:

               |_|     |_|
              | = |   | = | logic from
              |_a_|   |_b_| previous figure
                |       |
                 ----   |      ___ 
                     |  o----O|   |        ___     
                     o--|-----|AND|-------|   |
        write --o----|--|--o--|___|       | OR|-- clk a
                |    |  |  |   ___    ----|___|
                |    |  o--|--|   |  |     ___
                |    |  |  |  |AND|--|----|   |
                |    |  |   --|___|  |    | OR|-- clk b
             ---     |  |            |  --|___|
            |        |  |            | |
            |        |  |      ___   | |
            |  ___   |  o----O|   |  | |
            | |   |  o--|----O|AND|--  |
            | |  Q|--|--|-----|___|    |
             -|>T_|  |  |      ___     |
              |  Q|--|--|-----|   |    |
              |___|  |   ----O|AND|----
                      -------O|___|

The essentially random replacement scheme used above is not a very wise one. To improve the performance of this associative memory, we will have to add one bit per line of the memory to remember which bank had the most recent hit for that line, and then use this bit to determine which bank to replace. Our goal is to always use the bank holding the least recently used match for that particular tag.

Extension of the 2-way set-associative memory given here to 4 or 8 ways is a straightforward exercise left to the reader.

Building Read Only Cache from Associative Memory

To build a read-only cache, such as might be used to improve the performance of the operand fetch stage of a pipelined processor, we start with an associative memory and add bus connections:

	address-----------------o---/-------------------------
	data   -----------------|---/---o--------o------------
	read   --o--------------|-------|--------|------------
	ack    --|-o------------|-------|--------|---------o--
                 | |            /       /        |         |
                 | |         ___|_______|___    / \       / \
                 | |        | addr  | data  |  /___\--o--/___\
                 | |   ___  |       |       |    |    |    1
                 |  --|   | |  associative  |    |    |
                 o----|AND|-|>   memory     |    |    |
                 |  -O|___| |       |       |    /    |
                 | |        |__hit__|_______|    |    |
                 | |            |       |        |    |
                 | |            |  ___   --------     |
                 |  ------------o-|   |               |
                 |                |AND|---------------
                  ----------------|___|

This associative memory will force an acknowledge of a read request the moment it has a hit, independently of any action by other memory attached to the bus. As a result, the CPU will continue as if the memory cycle had completed, without regard to collisions in the bus arbitrator. When there is a cache miss, the acknowledge line will not be asserted until the main memory responds with valid data, and until then, we expect the processor (or processor stage) to stall awaiting the data.

When the cache misses, it sends no signals to the bus, but when the memory acknowledges the bus request, indicating that valid data is available, the associative memory is clocked, causing it to remember the new address-data pairing.

Write Through Cache made from Associative Memory

A write-through cache speeds read operations but does nothing to speed up write operations. The term write through refers to the fact that write operations pass through the cache to main memory without any modification or interruption by the cache hardware. To build a write-through cache, such as might be used to improve performance for operand fetch and store, we only need to add a few elements to the basic read-only cache outlined above.

	address----------------------o--/--------------------
	data   ----------------------|--/--o------o----------
	read -o----------------------|-----|------|----------
	write-|----------o-----------|-----|------|----------
	ack  -|-o--------|-----------|-----|------|-------o--
              | |        |           /     /      |       |
              | |        |        ___|_____|__   / \     / \
              | |        |       | addr |data | /___\-o-/___\
              | |        |  ___  |      |     |   |   |   1
              | |   ___   -|   | | associative|   |   |
              |  --|   |   | OR|-|>  memory   |   |   |
              o----|AND|---|___| |      |     |   /   |
              |  -O|___|         |__hit_|_____|   |   |
              | |                   |      |      |   |
              | |                   |  ___  ------    |
              |  -------------------o-|   |           |
              |                       |AND|-----------
               -----------------------|___|

The only change that has been made here is the addition of logic to store a new association in the cache each time there is a write cycle. In the original, new data was stored only when there was a miss and the memory returned an acknowledge indicating that the read to main memory was complete.

Both the read-only cache and the write-through cache designs shown here illustrate that, with careful bus design, cache memory can be entirely optional! Simply pulling the cache from the socket that connects it to the bus should only slow the system down without changing any essential behavior!

Of course, the above assertion is based on the assumption that all of the devices on the bus have memory semantics! If some of the devices are not RAM; for example, if there are I/O devices on the bus, it is essential that we provide a cache disable function, and that the cache be disabled during all data transfers to and from any I/O address register. This is a job that can be left to software, for example, by requiring that all accesses to device registers be coded using instruction sequences such as:

	disable_D_cache;
	load device status;
	enable_D_cache;

This is only a minor inconvenience if only new code is involved, but if we wish to maintain compatability with old code that was not written to deal with the cache, we must add hardware to detect addresses in the range reserved for I/O devices so that the cache will be automatically disabled for all such devices.

Write Back Cache

A write-back cache is a cache that can hold a value to be written back to main memory and finish the write later while allowing the CPU to continue as if the write had been completed immediately.

The performance advantage of moving to a write-back policy from a write-through policy is limited! In typical programs, it is fair to estimate that about 2/3 of the operand memory cycles are reads, and only 1/3 are writes. Furthermore, much of the code that violates this rule of thumb will involve such things as memory to memory block moves, and when a block of memory is moved from here to there, the presence of a write-back cache is rarely significant.

A write-back cache involves a significant control unit, bus arbitration and more! The cache must retain, for each cached location, a bit of state information to indicate whether the value matches the value in memory or not. There is a match if the value was originally read from memory or if the value has been written back, and there is no match between the time the CPU writes a value and the time the cache finishes the write-back.

The write-back cache operates at full speed, with one cache cycle per CPU cycle, until an attempt is made to replace a cached location that needs to be written back. At that point, the cache controller stalls the CPU and initiates a write cycle to main memory to write the cached value back. Only after the memory write is finished does the CPU finish its write of a new value to the cache.

How can this complexity pay off? If there is a high likelyhood that memory locations that have previously been written will be written again before that cache line is needed to store another memory location.

Registers as Cache

In a machine with general purpose registers, it is important to note that, from a compiler writer's perspective, these serve many of the same purposes as a cache. The more registers there are, the more frequently the compiler will maintain variables entirely in registers, and the fewer memory cycles will be used for operand load and store. As a result, machines with larger and more flexible sets of general purpose registers make lighter demands on their data caches than do machines with few general purpose registers.

The equivalence between cache and registers is not complete! On most machines, registers are only good for scalar variables, not complex objects such as structures or arrays. This is because there are typically no mechanisms for using indexing for register selection. There have been a few machines built where this is not true, and some of these have been widely used, but the exceptions do not represent the mainstream of architectural development today.