22C:18, Lecture 38, Fall 1996

Douglas W. Jones
University of Iowa Department of Computer Science

Pedagogical Lies

As this course has progressed, we have glossed over a number of issues and ignored many typical features of modern computer architectures. Modern machines rarely execute one instruction after another, following the classical fetch-execute cycle; instead, consecutive instructions are frequently executed in parallel. Modern machines are rarely built with a direct access pathe between the CPU and main memory; instead, the CPU accesses main memory through intermediate memory levels known as caches. Finally, modern machines rarely access I/O devices directly attached to the main bus; instead, there are intermediate input/output busses between the main bus and the I/O device interfaces.

Parallel Execution

Modern CPU's do not execute instructions following the simple fetch-execute cycle outlined as:

	repeat
	    fetch instruction
	    execute instruction
	    check for interrupts
	forever

Instead, we build CPU's that attempt to execute multiple instructions in parallel. The basic function of such a CPU can be outlined as follows:

	repeat
	    in parallel
		fetch next instruction
		execute current instruction
		store results of previous instruction
	    end of parallel stuff
	    check for interrupts
 	forever

The above scheme has the CPU performing parts of the instruction execution cycle for three different instructions at each iteration of the loop, and it requires 3 iterations of the loop to execute each instruciton. This is called pipelined execution, and it is worth noting that modern CPU designs frequently have "deeper" pipelines with more pipeline stages, so that 5 or more instructions may be executed in parallel.

CPU's that allowed multiple instructions to be executed in parallel were pioneered with the CDC 6600, designed by Seymour Cray and first marketed in 1965. The modern pipelined view of execution emerged in the late 1960's in the IBM 360/95 -- a machine some describe as IBM's first supercomputer. By the mid 1970's, many commercial minicomputers used some form of pipelined execution, with 2-phase pipes (overlapping fetching the next instruction with executing the current one) becoming fairly common.

First generation microprocessors such as the Intel 8080 did not do any pipelining, but by the second generation, characterized by the Intel 8086, instruction prefetching was common. Today, pipelined execution is universal in all but small programmable controller chips (the smallest class of microprocessors on the market today).

An easy way to see the effect of pipelining is to plot the instruction being executed on one axis of a graph versus the time on another axis:

          |_________ _________ _________
          |         |         |         |
  Instr 1 |  Fetch  | Execute |  Store  |
          |_________|_________|_________|_________
          |         |         |         |         |
  Instr 2 |         |  Fetch  | Execute |  Store  |
          |         |_________|_________|_________|_________
          |                   |         |         |         |
  Instr 3 |                   |  Fetch  | Execute |  Store  |
          |                   |_________|_________|_________|
          |__________________________________________________
               1         2         3         4         5

                               Time --->

Typically, we hope that each pipeline stage can execute in one clock tick, so that if the CPU runs at 400 MHz, it can execute 400 Million instructions per second.

Pipelining has a significant effect on programming at the assembly language level. Consider what happens if the following sequence of instructions are executed on a pipelined version of the Hawk machine:

	Instr 1: LIS	R3,3
	Instr 2: ADD	R4,R4,R3

In terms of the above graph, the store phase of the LIS instruction takes place at the end of step 3, but the execute phase of the ADD instruction takes place during the same step! If the pipeline were implemented in the most straightforward way, this would mean that the ADD instruction would use the value of R3 prior to the value assigned by the LIS instruction! Some pipelined machines have been built this way, but most add additional hardware to "stall the pipeline" or to insert bubbles in the pipe, delaying instruction execution until the operands on which the instruction rest are available. With pipeline stalls, a typical CPU would execute the above instruction sequence as follows:

          |_________ _________ _________
          |  Fetch  | Execute |  Store  |
  Instr 1 |   LIS   |   LIS   |  in R3  |
          |_________|_________|_________|_________ _________
          |         |  Fetch  |/////////| Execute |  Store  |
  Instr 2 |         |   ADD   |//stall//|   ADD   |  in R4  |
          |         |_________|/////////|_________|_________|
          |__________________________________________________
               1         2         3         4         5

                               Time --->

As a result, programs on modern CPU's that are written without attention to the presence of a pipeline will frequently execute more slowly than programs written with an understanding of the behavior of the pipeline. Consider the following two instruction sequences:

	Slow and readable	Fast and unreadable

	LOAD	R3,X		LOAD	R3,X
	ADDSI	R3,1		LOAD	R4,Y
	STORE	R3,X		ADDSI	R3,1
	LOAD	R4,Y		ADDSI	R4,-4
	ADDSI	R4,-4		STORE	R3,X
	STORE	R4,Y		STORE	R4,Y

These two instruction sequences do identically the same thing; the left sequence is organized for clear readability, with related instructions grouped together so that it is easy to see that this increments X by 1 while decrementing Y by 4. The right sequence has the same instructions interleaved, making it hard to read. But! In the left sequence, most of the instructions depend on a result computed by their immediate predecessors, while in the right sequence, the interleaving is such that no instruction depends on a result computed by the previous instruciton. As a result, the right sequence is faster!

Good modern compilers will automatically reorder instructions to minimize dependencies between successive instructions, and many computers, all the way back to the CDC 6600 and including the Pentium today, are able to execute instructions out of order, effectively reordering instructions at run-time in order to maximize pipeline performance.

No matter how the instructions in a program are reordered, branch instructions cause problems! Typically, the fact that a conditional branch instruction will or will not be known is not known until the store phase of the instruction execution cycle. By this time, with a 3-phase pipeline, the following instruction has been executed and the instruction after that has been fetched. If the branch is not followed, this is what we want for maximum execution speed, but if the branch is to be followed, we have to discard these partially completed instructions. Thus, we say that each branch followed incurs a branch penalty, and that instructions following a branch are subject to speculative execution since any results completed may need to be discarded.

Thus, even if two CPU's are clocked at the same rate and are able to fetch instructions at the same rate, they may execute identical programs at different speeds depending on how much reordering they can do and depending on whether they handle speculative exeuction or just stall their pipelines whenever a branch instruction is in the pipe. If the CPUs being compared have different instruction sets, so that they do differing ammounts of work per instruciton, comparison is even more difficult.

Cache Memory

There is a significant mismatch between high performance CPU speeds and main memory speeds! Today's main memory chips operate at around 60 ns, allowing a single 32 bit (or sometimes 64 bit) load or store during each 60 ns memory cycle. The CPU's we build are clocked at upwards of 100 MHz, so in theory, we are doing one instruction fetch every 10 ns, plus occaional loads and stores, all from a memory that requires 60 ns for each operation!

Even if we assume that multiple instructions are packed into each word fetched from the instruction stream, with only a fraction of the instructions requiring loads or stores, it is hard to imagine how such a slow memory can keep up with such a fast CPU.

The key to this problem lies in the introduction of cache memories, small fast memories inserted between the CPU and main memory. A cache keeps keeps copies of values that were recently read from main memory, so that, if the CPU needs the contents of those memory locations again, it can get them from the cache. With every memory reference performed by the CPU, it looks first in the cache, and only goes to main memory when the cache entry indicates that the desired value is not available.

 _____    _____
|     |  |     |        Bus
| CPU |==|cache|======================
|_____|  |_____|        ___|___
                       |       |
                       | Memory|
                       |_______|

When the CPU finds a value it needs in the cache, we call this a hit. When it is forced to go to main memory, we call this a miss. Modern cache designs frequently achieve hit ratios (that is, the ratio of hits to total CPU memory references) that are above 90%, and as a result, fewer than one memory reference in ten requires a main memory cycle.

How can a small but fast auxiliary memory lead to such high hit ratios? The answer lies in a property of programs called locality: Most programs repeatedly reference a small number of memory locatons -- most execution, for example, involves loops, repeatedly fetching the same instructions and repeatedly referencing the same variables. At any instant, as a result, the the set of recently referenced memory locations is a good predictor of the set of memory locations that will be referenced in the near future.

Cache memory was originally developed in the late 1960's, when modestly fast semiconductor caches were installed between the fastest CPUs of the day and the then-dominant core memory systems. In the 1970's, the idea of separate caches for instructions and data emerged to allow instruction fetches to be carried out in parallel with data references, so long as both involved cache hits. Instruction caches are noticably simpler than data caches because instructions are read-only, while the data cache must deal with both read and write cycles.

Multilevel caches may involve separate cache modules between the CPU chip and the memory bus, with a faster but smaller cache chip mounted on the same substrate as the CPU, and an even faster but much smaller instruction and data caches built into the CPU chip itself.

In some cases, the presence of caches makes a difference to the programmer! Self modifying code, for example, will not work predictably if there are separate instruction and data caches! If this is needed, systems with such caches must typically include special cache flush instructions that cause the cache to forget, thus forcing the next few (or next few thousand) memory cycles to be cache misses, refilling the cache with the new modified data.

Polled input/output is problematic if the cache sits between the CPU and the device registers! Again, cache flush instructions can solve this! Some memory management units include a bit in the descriptor for each page that can be set to inhibit use of the cache for data on that page; on such systems, this bit would usually be set for those pages that are used to access input/output devices!

Program structures and data structures can play a significant role in determining locality, and thus, in determining execution speed on systems that contain caches. Consider the following two program fragments:

	poor locality		good locality

	for i := 1 to n do	for i := 1 to n do begin
	    if a[i] > a[max]	    if a[i] > a[max]
		then max := i;		then max := i;
	for i := 1 to n do	    if a[i] < a[min]
	    if a[i] < a[min]		then min := i;
		then min := i;	end;

The version on the left has two loops, each of which visits every array element once. The version on the right has one loop that visits each array element twice in quick sequence. As a result, the second reference in the right example is unlikely to cause a cache miss, while in the right example, the second loop may easily encounter just as many cache misses as the first in references to items in the array.

Bus Structures

Machines based on the simple model of interaction between the CPU and memory through a single bus was only commercially successful during the 1970's! The DEC PDP-11, introduced in 1970, had a single bus, the UNIBUS, connecting all memory and devices with the CPU. The DEC PDP-8/E OMNIBUS was introduced in the same year, and many other minicomputer manufacturers followed with similar bus structures. The S-100 bus, introduced with the Altair 8800 in 1976 and the first great microcomputer bus, had this structure, as did a number of other early microcomputer bus structures.

 _____ 
|     |             Bus
| CPU |===============================
|_____|    ___|___   ___|___   ___|___
          | Device| |       | | Device|
          |   A   | | Memory| |   B   |
          |_______| |_______| |_______|

Early systems, those designed in the 1950's and 1960's, frequently had very irregular input/output subsystems, with some devices directly attached to the CPU and handled with special instructions, other devices attached to special controllers that were directly linked to the CPU, and other devices attached to dedicated I/O busses. These unified bus structures of the 1970's were a great improvement over such irregular architectures.

The problem with this structure is that the bus must accomodate the very high speed traffic from CPU to memory, and it is inconvenient to build device interfaces to accomodate such traffic! Furthermore, modern CPUs are fast enough that there are strict limits on the length of the CPU-memory bus, and this makes it difficult to add peripherals.

As a result, many modern systems have returned to an irregular looking hierarchy of busses. On the IBM PC, for example, there are a number of busses. Some are included for compatability with peripherals first designed for use on early generation PC hardware, some are included in order to accomodate modern fast peripherals (the IDE bus on the PC), and some are included for compatability with industry standards such as the SCSI standard. Thus, a typical system might have a bus structure such as the following:

 _____
|     |   Main Bus (fast, 64 bit wide)
| CPU |==================================
|_____|    ___|___   ___|___   ___|___
          |       | |  bus  | |  bus  |
          | Memory| |coupler| |coupler|
          |_______| |_______| |_______|
                        |         |
  Fast I/O bus (16 bit) |         |  Slow I/O bus (8 bit)
 ==========================     ===========================
     ___|___   ___|___              ___|___   ___|___ 
    |       | |  bus  |            |       | |  Key  |
    | Disk  | |coupler|            | Mouse | | board |
    |_______| |_______|            |_______| |_______|
                  |
   SCSI bus       |
 ==========================
     ___|___     ___|___
    |       |   |       |
    | Disk  |   | Tape  |
    |_______|   |_______|

The various bus couplers may vary in complexity depending on whether or not they support DMA transfers and depending on whether or not they attempt to be invisible to the CPU. That is, does the CPU access the devices on one of the secondary busses by loading a device select register in the coupler and then setting or inspecting a data register, or is some field of the memory address used to directly reference the device on a secondary bus, so that the CPU can address, for example, the mouse, as if its device registers were attached directly to the main bus.