22C:122/55:132 Notes, Lecture 37, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

  1. Caches and Multiprocessors

    Consider a multiprocessor system such as the following:

             _____ 
            |     |            |
    	| CPU |-----o------x--
            |_____|   __|__    |
             _____   |cache|   |
            |     |  |_____|   |
    	| CPU |-----o------x--
            |_____|   __|__    |
                     |cache|    ----o--------o--  ...
                     |_____|      __|__    __|__
     	                     |     |  |     |
     	                     | RAM |  | I/O |
                                 |_____|  |_____|
    	
    This system is not too different from the basic model for a pipelined CPU:
                   
             _____             |
    	|_____|-----o------x--
             |>__|    __|__    |
            |_____|  |cache|   |
             |>__|   |__I__|   |
    	|_____|-----o------x--
             |>__|    __|__    |
            |_____|  |cache|    ----o--------o--  ...
                     |__O__|      __|__    __|__
     	                     |     |  |     |
     	                     | RAM |  | I/O |
                                 |_____|  |_____|
    	
    But, the multiprocessor system suffers from a severe cache coherency problem where the pipelined CPU operates without any great problems. With the pipelined CPU, the I cache is read only; except for the rare occasions where self-modifying code is used, for example, when a new program is loaded in memory, there is no coherency problem, and we can solve that problem trivially by invalidating the I cache just before jumping to the newly loaded code.

    With the multiprocessor, there is no problem so long as the programs on each processor never interact, but in that case, there is no reason to build a multiprocessor. We could demand that the program on each CPU invalidate its cache whenever it is about to check a shared variable for changes by the other processor, but this imposes far too high a penalty on many parallel algorithms.

    The penalty can be reduced if there is a special instruction to invalidate the cached copy of just one memory location. In this case, each read of a variable that another processor may have changed must be preceeded by an explicit instruction to invalidate that variable. This is feasible, but it requires great care in programming, and it precludes the use of a high level language unless the language being used allows variables to be declared as shared so the compiler can automatically do the necessary invalidation.

    Because of this complexity, early shared memory multiprocessors used crossbar switches to access shared memory, and if they had cache memory, the cache was simply disabled for all accesses to shared memory. Local memory was cached, and therefore it was fast, while the shared memory was slow.

  2. Snooping Caches

    In the early 1980's, a new idea was developed, the snooping cache. This was first brought to market by two competing companies, Sequent and Encore; the Encore Multimax product line and the Sequent Symmetry product line were each significant commercial successes by the mid 1980's, with typical configurations based 16 commercial 32-bit microprocessors sharing a single memory bus.

    Encore was founded by C. Gordon Bell, after a series of experiments with multiprocessors done at Carnegie Mellon University. The C.mmp computer had 16 PDP-11 processors connected to 16 memory modules by a crossbar switch. (The name means Computer -consisting of- multiple mini-processors.)

    The Cm* system consisted of 50 LSI-11 microprocessors, each with local memory, interconnected using a hierarchy of switches that allowed access to remote memory at speeds only 1/3 to 1/9 the speed of local memory access, depending on how many levels of of the switching hierarchy were involved. (the name Cm means Computer module; the asterisk is the Kleene closure from regular expression notation, meaning indefinite repetition.) Cm* was the first "NUMA" machine -- that is, the first machine where all memory was addressable by all processors, but where the memory access times depended on which processor was accessing what memory region. (NUMA means NonUniform Memory Access.)

    The fundamental idea behind a snooping cache is that the cache constantly monitors the bus. (The verb to snoop is colloquial English meaning to spy or to monitor.) The idea is, whenever the snooping cache detects a write operation on the bus for a particular memory location, it invalidates any cached copy it has of the contents of that location. As a result, so long as a memory location is read-only, it may duplicated between many caches, as different processors read it, but when a processor writes that location, all copies will be destroyed except the one copy in the local cache of the processor that performed the write operation, and, of course, the copy in main memory.
                                 .
                                 .
                                  
             _____               |
            |     |              |
    	| CPU |------o-------x--
            |_____|  ____|____   |
                    |         |  |
                    | snooping|--o <-- snooping
                    |  cache  |  |     connection
                    |_________|  |
                                 .
                                 .
    	
    The two bus connections of a snooping cache are quite different! The connection on the side closest to the CPU is the standard connection that provides the cache function. The connection to the shared bus to memory is a passive, read-only connection that is used only for the purpose of snooping.

    Most of the logic of the snooping cache is identical to a conventional write through cache. Why write through? Because the entire idea of a snooping cache depends on the fact that every write cycle from any processor will be seen by all caches! Without this, snooping would not work.

    The new logic in a snooping cache is all centered on one bit per cache entry indicating whether this entry is valid or not. The memory holding these valid bits must be a 2-port memory, with one port used by the cache and the other used for invalidation. valid bit per cache entry. The logic for this is quite easy if the cache uses the simples set associative structure:

                       |                           write  addr
               --------|-------      1       0         |  |
              |   _____|____   |   __|_______|__       |  |
              |  |  data in |  |  |  A  in   B  | --|  |  | M
              |  |          |  |  |             |   |--|--o E
            --o--|addr      |   --|A   addr    B|---|  |  | M
                 |    main  |     |             |      |  | O
                 |   cache  |     |   2 port    |      |  | R
                 |    RAM   |     |     RAM     |      |  | Y
            --o--|>         |   --|>ckA     ckB<|------o  |  
              |  |   data   |  |  |             |      |  | B
              |  |____out___|  |  |__A__out__B__|      |  | U
              |        |       |     |                 |  | S
               --------|-------      |                 |  |
                       |           valid               |  |
    	
    The logic connecting the main cache RAM to the CPU bus is not shown above. This remains as it was. Whenever a write cycle occurs on the memory bus, the valid bit for the corresponding cache line is cleared. Whenever a value is stored in the main cache RAM, the corresponding valid bit is set. Whenever an entry in the main cache RAM is read, the valid bit is also read, and a cache miss is declared if the entry is invalid, whether or not the tags match.

    The above logic is an oversimplification, because it ignores one important possibility. Some of the write cycles on the memory bus will have been initiated by this CPU; in this case, the cache entries must not be invalidated! To fix this, we add a small amount of extra logic:

                           CPU BUS
                  write -----------------o--
                                         |
                                         | write addr
                  1       0              |    |  |
            |   __|_______|__            |    |  |
            |  |  A  in   B  |   ---|    |    |  | M
            |  |             |      |----|----|--o E
             --|A   addr    B|------|    |    |  | M
               |             |           |    |  | O
               |   2 port    |     ___   |    |  | R
               |     RAM     |    |   |O-     |  | Y
             --|>ckA     ckB<|----|AND|       |  |  
            |  |             |    |___|-------o  | B
            |  |__A__out__B__|                |  | U
            |     |                           |  | S
                  |                           |  |
                valid                         |  |
    	
    The 2-port RAM suggested above is a specialized device, but it is not difficult to figure out how to build it. Consider the following design:
             Addr        clear                             Addr
    	  A           |                        |         B
    	 |  ___       o--------------          |     ___  |
    	 | |   |- | | |    ___   ____|_   ___  |   -|   | |
              -| D |- o-|-|---|   | |    R'| |   |-o   -| D |-
               | E |- | | |   |AND|-|S    R|-|AND| |   -| E |
               | C |--|-|-|-o-|___| |_Q____| |___|-|----| C |
               | O |- | | | |       __|__          |   -| O |
               | D |- | | |  -------\   /          |   -| D |
                 E    | | |          \ /           |      E
               |___|- | | |           |            |   -|___|
    	          | o-|-----------             |
    	          | | |                        |
    	          | |                          |
    	clock A --  valid                       --- clock B
    	
    Aside from the two decoder arrays, each cell in the memory is just an RS flipflop, a tri-state driver and two and gates. The RS flipflop is only slightly specialized in order to support a global clear input for initializing the cache and for invalidating it when the CPU executes a cache invalidate instruction. Implementing this flipflop is extremely simple:
                                                R'
                            ___          ___    |
    	          S ---|   |        |   |--- 
    	               |NOR|--    --|NOR|---- R
    	              -|___|   \/   |___|-
                         |         /\         |
                          ----o---    -------- 
                              |                 
                              Q                 
    	
    Is there any danger of putting this flipflop into a metastable state? This can only occur if the CPU accesses the same address as is being accessed on the bus. If the CPU is trying to write an address while the bus reports a write to the same address, the invalidate is suppressed by the little bit of special logic outlined above. If the CPU is trying to read an, the write to the cache (and the validation of this cache entry) will only occur when the CPU gets access to the memory bus for a read cycle. In sum, there will never be simultaneous set and reset cycles involving this flipflop.

  3. Higher Performance

    The snooping cache architecture described here rests on write-through caches and it requires that the entire data path from processor to memory be claimed for the duration of each memory cycle that results from a cache miss.

    Higher performance systems can reduce the burden of memory traffic by several means. One is to introduce packet switched buss structures. When this is done, each bus arbitrator becomes a packet switch, holding a message from CPU to memory until the bus is idle, and then forwarding it. If the packet holds a write message, it includes both the data and the address to be written. If it is a read message, it holds just the address, and when the memory receives this, it will send a reply packet back to the CPU.

    Packet switching busses require that the controller on each memory module include bus arbitration logic just like that for each CPU, raising the price. The advantage is that, on a read cycle. From the point of view of any particular CPU, the benefit seems minimal, but from the point of view of the system, because the CPU releases the memory bus as soon as the write request is transmitted to memory, before the write actually takes place there is an advantage, and because the read request is broken into two transactions, there is an advantage. Interleaved memory allows one module to process a read while another receives data to write while another is waiting to transmit what it has read.

    The Encore Multimax architecture used this scheme, with a single snooping cache serving each pair of microprocessors. The resulting architecture was as folows:

             _____ 
            |     |  |
    	| CPU |--x--
            |_____|  |
             _____   |
            |     |  |      
    	| CPU |--x--          |
            |_____|  |            |
    	          ------o-----x--
             _____        __|__   |
            |     |  |   |cache|--o
            | CPU |--x-- |_____|  |
            |_____|  |            |
             _____   |            |
            |     |  |            |
    	| CPU |--x--          |
            |_____|  |            |
    	          ------o-----x--
                          __|__   |
                         |cache|--o
                         |_____|  |
             _____                |
            |     |               |
    	| RAM |---------------x--
            |_____|               |
             _____                |
            |     |               |
    	| RAM |---------------x--
            |_____|               |
    	
    This machine supported up to 16 CPUs on 8 circuit boards that plugged into the backplane, plus 4 memory modules (with 4-way interleaving) and several I/O processor boards, each with one non-cached CPU and a bus connector (SCSI or Ethernet). The CPUs were microprogrammed 32-bit CISC processors typical of the early to mid 1980's, so allowing a pair of CPUs to share a single cache was reasonable -- many clock cycles on such a CPU do not require memory cycles!

    Modern high-end machines packaged with on-chip cache generally provide snooping hardware as part of the "CPU" chip. Thus, typical modern "CPU" chip frequently pack logic such as the following:

             ______          |             ______          |
            |______|----o----x--          |______|----o----x--
    	 |>___|   __|__  |             |>___|   __|__  |
            |______| |cache| |            |______| |cache| |
    	 |>___|  |__I__| |             |>___|  |__I__| |
            |______|---------x--          |______|----o----x--  
    	 |>___|          |        |    |>___|   __|__  |    |
            |______|          --o-----x-- |______| |cache|  ----x--
    	                  __|__   |            |__O__|------o
                             |cache|--o                         |
                             |__O__|  |
    	
    Such "CPU chips" can be connected in parallel to a single memory bus in order to build multiprocessor systems. With the relatively small size of the on-chip cache, these usually deliver reasonable performance in dual processor configurations, and some perform well in 4-processor configurations, but few perform well in large configurations. The problem is, modern VLSI CPU designs allow internal clock rates that are far higher than the rate at which off-chip traffic to such resources as main memory can be accomplished, so most of the performance advantage provided by the cache is used to achieve a good balance between CPU clock rate and main memory cycle time.

    Both the left and right diagram show the O-cache as a snooping cache that snoops the external memory bus, but the two diagrams show the O-cache connected differently. Because the operand fetch/store pipeline stage has the highest priority for access to the memory bus, the O-cache will never prevent the fetch/store stage from blocking, so these two arrangements provide very similar performance.

    With the arrangement shown on the right, an O-cache hit can allow an instruction fetch to go through when it might otherwise be blocked by an operand fetch. With the arrangement on the left, the O-cache will naturally contain copies of all instructions that were recently fetched, but a small amount of extra logic can suppress the storage of instruction words in this cache.

             ______          |
            |______|----o----x--
    	 |>___|   __|__  |
            |______| |cache| |
    	 |>___|  |__I__| |
            |______|----o----x--
    	 |>___|   __|__  |          |
            |______| |cache|  --o-------x--
                     |__O__|----|----o--o
                              __|__  |  |
                             |cache|-   |
                             |_____|    |
    
                        L1      L2
    	
    Here, we have two levels of cache. The L1 caches (I-cache and O-cache) are typically part of the same chip as the CPU, while the L2 cache is typically on a separate chip. Sometimes, the L2 cache is mounted on the same ceramic substrate or daughter board as the CPU chip, and sometimes, it is a separate component. An L3 cache may also be present. When such a cache hierarchy is used, the caches closest to the CPU are typically the smallest and fastest, while those closer to memory are typically larger and slower.