22C:122/55:132 Notes, Lecture 35, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

  1. Pipelines and Multiport Memory

    Pipelined machines invariably require at least two memory ports for the CPU, one for the instruction fetch pipeline stage, and one for operand fetch or result store. The MODCOMP IV computer, a 16-bit superminicomputer that came to market in 1974, did this with a crossbar switch:

    	                   __________   |   |   |   |
            Instruction Fetch |__________|--x---x---x---x--
    	                   _|>_____|_   |   |   |   |
            Execute           |__________|--x---x---x---x--
                                            |   |   |   |
    	                   __________   |   |   |   |
            DMA Controller    |__________|--x---x---x---x--
    	                   __________   |   |   |   |
            DMA Controller    |__________|--x---x---x---x--
                                            |   |   |   |
    	                               |M| |M| |M| |M|
                                           |_| |_| |_| |_|
    	
    This machine had a microprogrammed execute stage that could execute many instructions in a single cycle, but there were many others that took multiple cycles. The 4-way interleaved main memory reduced the chance of a collision between the two pipeline stages to 1 in 4 -- a collision occurs when the instruction fetch and operand fetch or store reference the same memory module.

    The problem with crossbar switches is that they are expensive! If possible, it would be far more pleasant to use a simple bus arbitration scheme to access memory. Consider the following approach for our example pipelined processor:

    	                   __________   |  
            Instruction Fetch |__________|--x--
    	                   _|>_____|_   |
            Compute Address   |__________|  |
    	                   _|>_____|_   |
            Operand Fetch     |__________|--x--
    	                   _|>_____|_   |
            Operate           |__________|  |
    	                   _|>_____|_   |
            Operand Store     |__________|--x--
                                            |
                                             -----o---
                                                 _|_
                                                |RAM|
                                                |___|
    	
    Here, we have reduced the system to a single memory bus, with a single column of bus arbitration and bus couplers. Each connection from pipeline stage still has almost the full complexity of a crossbar switch interconnection, the cost now grows linearly with the number of pipeline stages that need access to memory.

    Under what circumstances will this design perform?

    If the instruction fetch and operand store stage make simultaneous attempts to access memory, we must block one of them. If we block the operand store stage, this stalls the entire pipeline! In contrast, if we block the instruction fetch stage, we introduce a bubble but keep the pipeline flowing. Therefore, it is easy to conclude, whether we use a crossbar switch or a single memory bus, we must assign fixed memory access priorities to each of the pipeline stages, with the highest priority going to the final stage in the pipe, and lower priorities going to each earlier stage.

    Under what circumstances will this design perform well?

    The answer is simple: Whenever two pipeline stages request access to memory during the same clock cycle, one of them must stall. Therefore, this simple design will only perform well if this is a rare event.

    How can we reduce the frequency of memory accesses in a pipeline stage?

    The answer depends on the stage. In the instruction fetch stage, for example, our preliminary pipelined designs all fetched one instruction word from memory on each and every memory cycle! There is one mechanism we have discussed that will reduce this demand on memory. If we fetch multiple instructions per word, replacing the instruction register with an instruction prefetch register, complete with shift logic, then the memory demand will be reduced.

    For example, if we have a 64-bit wide memory data bus, but instructions are only 32 bits each, we only need to execute an instruction fetch on every other pipeline cycle.

    The design of the instruction set determines the frequency of memory references for operand fetch or operand store. Some instruction sets come close to demanding that almost every instruction reference an operand in memory, while others encourage large numbers of register-to-register operations.

    The DEC VAX family of processors allowed programmers to write programs using any of memory-to-memory, memory-to-register, register-to-memory or register-to-register operations. In many cases, use of memory-to-memory instructions reduced the instruction count in machine or assembly language programs, but the minimum run-time was achieved by using as many register-to-register instructions as possible. The reason is that use of memory-to-memory instructions maximizes the demand for CPU-to-memory bandwidth, and therefore, maximizes the number of conflicts over memory access.

  2. More Justification for Load/Store RISC Architectures

    The RISC instruction set design philosophy that has dominated the design of new instruction sets since 1980 stresses the idea that load and store instructions should not do computation (beyond indexed addressing), and that arithmetic instructions should not access operands in memory. Such an instruction set is called a load-store instruction set, and it leads naturally to a four-stage pipeline:

    	        __________       |  
            IF     |__________|------x--
    	        _|>_____|_       |
            OF/AC  |__________|      |
    	        _|>_____|_       |
            OP/MA  |__________|------x--
    	        _|>_____|_       |
            OS     |__________|      |
                                      -----o---
                                          _|_
                                         |RAM|
                                         |___|
    	
    Because the instruction set uses the load/store model, address computation is only needed for load and store instructions, while only the arithmetic instructions need an operand fetch. Therefore, we combine these in the same pipeline stage.

    This, in turn, allows the memory access required by a load or store instruction to be done by the same pipeline stage that does the ALU operation for an arithmetic instruction.

    The final stage, operand store, saves the result of arithmetic instructions and loads from memory into a register.

    Note that all of the logic to the left of the bus arbitration logic in the above figure is essentially a Harverd architecture CPU, while the memory to the right of the bus arbitration logic operates as if it was connected to a Von Neumann CPU. As such, the bus arbitration logic can be described as converting between the Harvard architecture and the Von Neumann architecture.

  3. Cache Memory

    There is another way to reduce the frequency of conflicts between the different memory ports in the CPU. This is to use cache memory or associative memory to reduce the number of memory cycles required.

    A cache is a place where something is saved or hidden away. Thus, we speak of finding "a rebel arms cache", meaning a secret place where the rebels have hidden their weapons.

    In computer science generally, we use the term cache to refer to a relatively local storage area used to reduce the cost of access to remote data. Caches may be implemented in hardware or software at many different levels of example. A web page cache is a good example of a software cache implemented at a very high level of abstraction. Here, an internet portal retains a cache of recently used web pages. When a client of that portal requests a web page, the page may be delivered immediately if it is in the portal's cache. Only if the page is not in the cache is it necessary to seek the page from the distant web-server where the original copy of the page is stored.

    A cache hit occurs when the client finds the desired item in the cache, and therefore, gets fast service. The cache hit rate is the fraction of resource references that lead to cache hits. A cache miss occurs when the item is not found, so the client is required to take the time required to get the item from far away. The cache miss rate is the fraction of resource references that lead to cache misses.

    There are two basic ways to build a cache for objects that may be either read or written. A write-through cache does nothing to speed access to the remote resource when a write takes place. The client must wait for the completion of the entire write operation, and all the cache does is keep a copy of the item that was written in order to speed future read operations.

    A write-back cache intercepts write operations, allowing the client to continue as soon as the write to cache is completed. The actual write to the remote resource will be completed by the cache, possibly at a much later time. Write-back caches offer the possibility of even greater reduction in remote traffic because, if the client writes a new value to the cache before the previous value has been written to its ultimate destination, the first value can be simply discarded.

    Introducing a cache into a system will have a significant impact only if a significant number of resource references lead to cache hits. If the pattern of resource usage is random, it is very difficult for a cache of the most recent resources to do much good. On the other hand, if the system exhibits temporal locality, that is, if recently referenced items are highly likely to be referenced in the near future, the cache can have a significant impact.

    Fortunately, real programs exhibit considerable temporal locality! If a program contains loops, recently referenced instructions are very likely to be referenced again, and many of the variables referenced during one iteration of a loop will be referenced during every iteration!

  4. Cache Memory and Pipelined Processors

    If we put hardware caches between a pipelined CPU and the the memory arbitration system, we can eliminate a large number of memory cycles and many memory conflicts. Consider this design:

    	    __________              |  
            IF |__________|------o------x--
    	    _|>_____|_    ___|___   |
            AC |__________|  |I cache|  |
    	    _|>_____|_   |_______|  |
            OF |__________|-------------x--
    	    _|>_____|_              |
            OP |__________|             |
    	    _|>_____|_              |
            OS |__________|-------------x--
                                        |
                                         -----o---
                                             _|_
                                            |RAM|
                                            |___|
    	
    Here, we introduced only one cache, an instruction cache, in the data path between the instructin fetch stage of the pipeline and the memory arbitration logic. If the cache hit rate for this cache is 90 percent, we will reduce the number of pipeline stalls caused by memory conflicts in the instruction fetch stage to 10 percent of the original value. This single cache can therefore have a huge effect.

    We can add a second cache to the system as follows:

    	    __________                  |  
            IF |__________|--------o--------x--
    	    _|>_____|_      ___|___     |
            AC |__________|    |I cache|    |
    	    _|>_____|_   | |_______|    |
            OF |__________|--+--            |
    	    _|>_____|_   |              |
            OP |__________|  |              |
    	    _|>_____|_   |              |
            OS |__________|--+--            |
                             |              |
                              -------o------x--
                                  ___|___   |
                                 |O cache|   -----o---
                                 |_______|       _|_
                                                |RAM|
                                                |___|
    	
    Here, we introduced an operand cache. Why not 3 caches, one for the operand fetch stage and one for the operand store stage? Because this introduces a serious problem: How do you ensure that the view of memory seen by the operand fetch stage matches the view seen by the operand store stage! If these stages use the same cache, then the result of an operand store will be seen immediately by the next operand fetch memory cycle. If each has a separate cache, there is a serious problem keeping these caches coherent -- this is called the cache coherency problem.

    Note that the instruction cache is a read only cache, while the operand cache is a read-write cache! The instruction cache, being simpler, can be larger, while the operand cache, being somewhat more complex, may be smaller. The fact that the CPU makes an instruction fetch on each pipeline Cache memorybut while the operand

    In a pure load-store architecture, such as most modern RISC systems, there is only one pipeline stage that makes memory references for operands, so there is no need for an arbitrator between the operand cache and the execute stages of the pipelined CPU.

    Why is there no cache coherency problem between the instructin cache and the operand cache? In fact, there is a problem, but it is rarely evident! So long as we obey the rule that all programs are read-only, there is no problem. If we are executing code from ROM, for example, the instruction cache will never disagree with the contents of ROM.

    If, on the other hand, we have self-modifying code, there is a problem! Consider the kind of code that was once common on machines that had no hardware support for indexing. A program needing to do indexed addressing would compute the displacement and then add this to a memory reference instruction that addressed the first word of the data structure. Once the desired instruction was created, it was stored in the next memory location and then executed.

    Classical self-modifying code like this is unneeded on modern machines, but it is common to use self-modifying code on a much larger scale. We write programs that invoke a loader to bring some routine into memory, then we call that routine, and then we load something else and call it. This is typical of dynamically loaded library routines, and it is at the heart of our activities when we run a sequence of programs under a control language interpreter or command shell.

    Machines that have an instruction cache typically provide at least one special instruction to deal with this problem: An instruction to invalidate all of the entries in the instruction cache. There are several simple variations on this, for example, instructions to sumply turn the cache on and off, or to turn on and off the cache's ability to save new information. If a machine has a data cache, similar cache control options may apply to it.