22C:122/55:132 Notes, Lecture 36, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

Associative or Content Addressable Memory

Hardware cache implementations rest on various implementations of what is called associative memory or content addressable memory. There are several ways of implementing this, each appropriate for use in different contexts. A pure software model of an associative memory is a useful way to begin, but before giving this, it is useful to give a pure software model of a conventional memory:

	int classic_ram( int addr, int data, int op )
	{
	    static int memory[ size ];

	    if op == read
		return memory[addr];
	    else -- op == write
		memory[addr] = data;
	}

Using similar terms, a write-through cache might be modeled as follows:

	write_through_cached_ram( addr, data, op )
	{
	    struct line {
		int tag;
		int data;
	    } cache [ cache_size ];

	    int row;
	    int hit;

	    -- search the cache
	    for (row = 0; row < cache_size; row++)
		if (cache[row].tag == addr) break;
	    if (row == cache_size) row = random_row();

	    -- search the cache
	    if op == read {
		if (cache[row].tag != addr)
		    cache[row].data = classic_ram( addr, data, op );
		return cache[row].data;
	    } else { -- op == write
		classic_ram( addr, data, op );
		cache[row].data = data;
		cache[row].tag = addr;
	    }
	}

This software cache uses a linear search of the tags on each cache line to find an item in the cache, and it replaces lines at random if there is a cache miss. This gives the correct semantics, but this is not how we implement caches in hardware!

Note that each cache line consists of a tag and the associated data. This is the standard terminology used in discussion of cache implementation.

Externally, we can describe an associative memory as follows:

          address in    data in
	  _|_|_|_|______|_|_|_|_
	 |           |          |
	 |           |          |
	 |           |          |
	 |    tag    |   data   |
	 |   array   |   array  |
	 |           |          |
	-|> write    |          |
	 |___________|__ _ _ _ _|
               |        | | | |
             miss       data out

It has a write strobe, address in, data in, and data out, just like a conventional RAM, but it has one new external connection, the miss line (or alternately, a hit line) that indicates that no cache line currently contains a value corresponding to the address.

In the most expensive implementation of an associative memory, the data array is exactly like the data array of a RAM. That is, there is a word-line for each word of data. When an address is input that matches one of the tags, this word line is asserted, allowing the contents of that word to be presented to the data out lines, and allowing a write pulse to load data in to that word.

In this case, the tag array serves instead of an address decoder. Each entry in the tag array is a register holding one address. When the address input matches a tag, the word line is asserted. If no word-line is asserted, we declare a miss. The basic mechanism for one row is described here:

         addr          data in            
          |               |                          |
          o------         o--|----------------       |
          |      |        |  |     ___        |      |
          |  ____|____    |  o----|   |   ____|____  |
          | |>__tag___|   |  |    |AND|--|>__data__| |
          |      |  ___   |  |  --|___|       |      |
          |       -|   |  |  | |             _|_     |
          |        | = |--|--|-o------------\   /    |
          o--------|___|  |  |               \ /     |
          |               |  |                |      |
          |               |  |                 ------o
          |               |  |                       |
                             |                       |
                           write                  data out

One important detail has been omitted here: How does the tag get written? One way to handle this is to add a conventional address decoding array and a second set of address lines to allow direct selection of a cache line. If there is a miss and a write operation is requested, the second set of address lines determines which cache line is written. If there is a hit, the second set of address lines is ignored:

              line
             select   address in    data in
	    ___|_|___ _|_|_|_|______|_|_|_|_
	   |         |           |          |
	   |   row   |           |          |
	   |  decode |           |          |
	   | (only if|    tag    |   data   |
	   |   miss) |   array   |   array  |
	   |         |           |          |
	   |         |           |          |
	  -|> write  |           |          |
	   |_________|___________|__ _ _ _ _|
                           |        | | | |
                         miss       data out

We can then hide the row decode mechanism as follows:

                                    address  data
                                      in      in
                                -----  |       |
                              _|_    | |       |
                             |+1 |   | |       |
                     ___     |___|   / |       |
        write --o---|   |   ___|___  | |       |
	        |   |AND|--|>_row__| | |       |
                |  -|___|      |     | /       /
                | |            o-----  |       |
                | |         ___|_______|_______|___
                | |        |       |       |       |
                | |        |  row  |  tag  |  data |
                | |        | decode| array | array |
                 -|--------|>      |       |       |
                  |        |_______|_______|_______|
                  |                    |       |
                   --------------------o       /
                                       |       |
                                     miss    data
                                              out

In the above, write operations when there is a hit rely on the tag array to select the row of the data array to be written. Write operations when there is a miss rely on the row register to select the row of the tag and data arrays to be written. As given here, the row select register is only increments once per write when there is a miss. This leads to a FIFO replacement for items in the associative memory, not the best replacement policy, but generally no worse than random.

Using RAM to implement Associative Memory
This is an expensive content addressable memory! Expensive implementations such as this are actually used, but this is rare! It is far more common to use compromize implementations based on conventional fast static RAM technology!
```
             address      data in
                |            |
              __|__          |    
               | |           | 
               o-|---------  | 
               | |        _|_|_
               | |      ____|____
               | |     |   data  |
               |  -----|addr     |
               |       |         |
               |       |   RAM   |
               |       |         |
        write -|-------|>        |
               |       |_________|
               |          __|__
               |           | | 
               |  ---------  |
               |_|           |
              | = |          |
              |___|          |
        ____    |            |
        miss = hit        data out
	
```
Consider, for example, using a 1K RAM to implement the associative memory for a machine with a 32 bit address bus. Inside the associative memory, the least significant 10 bits of the address are used to select RAM entries. This system will perform as well as a fully associative memory with one exception. If two items have addresses that are the same in the leas significant 10 bits, the associative memory can store only one of them!
This is called a set-associative memory because there is a set of tag values that is associated with each entry in the associative memory.
Multiway Set Associative Memory
The problem with a simple set associative memory such as is illustrated above is that it only has one data word available to associate with each set of tag values. If we have 1K of associative memory and a number of the tag values we're interested in have the same bit pattern in their least-significant bits, our associative memory will be unable to help. The solution is to store more than one tag value for each set. This is called a multiway set associative memory.
For example, in a 2-way set associative memory, we store two different tags for each row of the cache, as illustrated below:
```
             address      data in
                |            |
              __|__          |    
               | |           | 
               | o-----------|-----
               o-|---------  |     |
               | |        _|_|_    |
               | |          |      |
               | |          o------|-----
               | |      ____|____  |  ____|____
               | |     |   data  | | |   data  |
               |  -----|addr     |  -|addr     |
               |       |         |   |         |
               |       |   RAM   |   |   RAM   |
               |    clk|    A    |clk|    B    |
               |     a-|>        | b-|>        |
               |       |_________|   |_________|
               |          __|__         __|__
               o-------    | |           | |
               |  -----|---  |           | |
               | |     |  ---|-----------  |
               |_|     |_|   |             |
              | = |   | = |  |             |
              |_a_|   |_b_|   ----     ----
                |       |        _|___|_
                 --   --o--------\0   1/
                   |_|            \___/
                  |OR |             |
                  |___|             |
            ____    |               |
            miss = hit              |
	
```
In the above, no mechanism has been given for deciding which half of the associative memory should be written when it is time to update or store a new association. If the desired tag is found in RAM A, of course, new associations for that tag should be stored in RAM A. If the desired tag is found in RAM B, the new association should be stored in RAM B. This is the simple case.
If, for some reason, the desired tag is found in both RAM A and RAM B, the logic given above always uses the copy in RAM B, so for the above design, if the desired tag is found in both, the copy that should be updated is the one in RAM B.
The final question is, if the desired tag is not found in either bank of memory, where should new associations be stored? It is perfectly acceptable to use one or the other bank at random! Any choice will lead to a correctly functioning associative memory, but a careful choice will lead to a higher likelihood that the desired association is available when it is needed.
The following logic will select a random bank when a new tag is encountered:
```
               |_|     |_|
              | = |   | = | logic from
              |_a_|   |_b_| previous figure
                |       |
                 ----   |      ___ 
                     |  o----O|   |        ___     
                     o--|-----|AND|-------|   |
        write --o----|--|--o--|___|       | OR|-- clk a
                |    |  |  |   ___    ----|___|
                |    |  o--|--|   |  |     ___
                |    |  |  |  |AND|--|----|   |
                |    |  |   --|___|  |    | OR|-- clk b
             ---     |  |            |  --|___|
            |        |  |            | |
            |        |  |      ___   | |
            |  ___   |  o----O|   |  | |
            | |   |  o--|----O|AND|--  |
            | |  Q|--|--|-----|___|    |
             -|>T_|  |  |      ___     |
              |  Q|--|--|-----|   |    |
              |___|  |   ----O|AND|----
                      -------O|___|
	
```
The toggle flopflop in the above circit causes the preferred bank for saving a new tag to alternate on each write cycle.
To improve the performance of this associative memory, we will have to add one bit per line of the memory to remember which bank had the most recent hit for that line, and then use this bit to determine which bank to replace. Our goal is to always replace the entry in the memory bank that was least recently used for that tag.
Extension of the 2-way set-associative memory given here to 4 or 8 ways is a straightforward exercise left to the reader.

Read Only Cache made from Associative Memory

To build a read-only cache, such as might be used to improve the performance of the operand fetch stage of a pipelined processor, we start with an associative memory and add bus connections:

	address-----------------o---/-------------------------
	data   -----------------|---/---o--------o------------
	read   --o--------------|-------|--------|------------
	ack    --|-o------------|-------|--------|---------o--
                 | |            /       /        |         |
                 | |         ___|_______|___    / \       / \
                 | |        | addr  | data  |  /___\--o--/___\
                 | |   ___  |       |       |    |    |    1
                 |  --|   | |  associative  |    |    |
                 o----|AND|-|>   memory     |    |    |
                 |  -O|___| |       |       |    /    |
                 | |        |__hit__|_______|    |    |
                 | |            |       |        |    |
                 | |            |  ___   --------     |
                 |  ------------o-|   |               |
                 |                |AND|---------------
                  ----------------|___|

This associative memory will force an acknowledge of a read request the moment it has a hit, independently of any action by other memory attached to the bus. As a result, the CPU will continue as if the memory cycle had completed, without regard to collisions in the bus arbitrator.

When the cache misses, it sends no signals to the bus, but when the memory acknowledges the bus request, indicating that valid data is available, the associative memory is clocked, causing it to remember the new address-data pairing.

Write Through Cache made from Associative Memory
To build a write-through cache, such as might be used to improve performance for operand fetch and store, we only need to add a few elements to the basic read-only cache outlined above.
```
	address----------------------o--/--------------------
	data   ----------------------|--/--o------o----------
	read -o----------------------|-----|------|----------
	write-|----------o-----------|-----|------|----------
	ack  -|-o--------|-----------|-----|------|-------o--
              | |        |           /     /      |       |
              | |        |        ___|_____|__   / \     / \
              | |        |       | addr |data | /___\-o-/___\
              | |        |  ___  |      |     |   |   |   1
              | |   ___   -|   | | associative|   |   |
              |  --|   |   | OR|-|>  memory   |   |   |
              o----|AND|---|___| |      |     |   /   |
              |  -O|___|         |__hit_|_____|   |   |
              | |                   |      |      |   |
              | |                   |  ___  ------    |
              |  -------------------o-|   |           |
              |                       |AND|-----------
               -----------------------|___|
	
```
The only change that has been made here is the addition of logic to store a new association in the cache each time there is a write cycle. In the original, new data was stored only when there was a miss and the memory returned an acknowledge indicating that the read to main memory was complete.
Both the read-only cache and the write-through cache designs shown here illustrate that, with careful bus design, cache memory can be entirely optional! Simply pulling the cache from the socket that connects it to the bus should only slow the system down without changing any essential behavior!
Of course, the above assertion is based on the assumption that all of the devices on the bus have memory semantics! If some of the devices are not RAM; for example, if there are I/O devices on the bus, it is essential that we provide a cache disable function, and that the cache be disabled during all data transfers to and from any I/O address register. This is a job that can be left to software, for example, by requiring that all accesses to device registers be coded using instruction sequences such as:
```
	disable_D_cache;
	load device status;
	enable_D_cache;
	
```
This is only a minor inconvenience if only new code is involved, but if we wish to maintain compatability with old code that was not written to deal with the cache, we must add hardware to detect addresses in the range reserved for I/O devices so that the cache will be automatically disabled for all such devices.
Write Back Cache
A write-back cache is a cache that can hold a value to be written back to main memory and finish the write later while allowing the CPU to continue as if the write had been completed immediately.
The performance advantage of moving to a write-back policy from a write-through policy is limited! In typical programs, it is fair to estimate that about 2/3 of the operand memory cycles are reads, and only 1/3 are writes. Furthermore, much of the code that violates this rule of thumb will involve such things as memory to memory block moves, and when a block of memory is moved from here to there, the presence of a write-back cache is rarely significant.
A write-back cache involves a significant control unit, bus arbitration and more! The cache must retain, for each cached location, a bit of state information to indicate whether the value matches the value in memory or not. There is a match if the value was originally read from memory or if the value has been written back, and there is no match between the time the CPU writes a value and the time the cache finishes the write-back.
The write-back cache operates at full speed, with one cache cycle per CPU cycle, until an attempt is made to replace a cached location that needs to be written back. At that point, the cache controller stalls the CPU and initiates a write cycle to main memory to write the cached value back. Only after the memory write is finished does the CPU finish its write of a new value to the cache.
How can this complexity pay off? If there is a high likelyhood that memory locations that have previously been written will be written again before that cache line is needed to store another memory location.