15. Inside a Modern CPU

Part of 22C:60, Computer Organization Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Background

Up to this point, we have described the central processing unit as executing a sequential fetch-execute cycle, in which each instruction is executed completely before the next instruction is fetched. We can characterize such a processor as being hardware that executes an algorithm such as the following:

The sequential fetch-execute cycle

while (TRUE) do { ... fetch one instruction ... get the operand registers and compute the effective address ... do the ALU operations or access memory as required ... save any results as required }

**The sequential fetch-execute cycle**
while (TRUE) do { ... fetch one instruction ... get the operand registers and compute the effective address ... do the ALU operations or access memory as required ... save any results as required }

Starting in the mid 1960s, Control Data Corporation and IBM pioneered new approaches to the instruction execution cycle that allowed for much higher performance by overlapping the fetch of one instruction with the execution of its predecessors. There were several approaches to doing this in the computers of the late 1960s, and the Hawk is designed around two of these.

One fo the first successful approaches to speeding the fetch-execute cycle was developed by Control Data. This is known as functional unit parallelism. The Hawk coprocessor architecture is designed for this; in terms of the terminology used by Control Data, each coprocessor is a functional unit. When the Hawk CPU starts a coprocessor operation such as a floating-point add with the COSET instruction, the coprocessor begins executing the indicated operation, but the CPU is free to fetch and execute other instructions. The only time the CPU waits for a coprocessor operation to finish is when the CPU uses a COSET or COGET instruction to communicate with a busy coprocessor.

By the late 1970s, another approach to speeding the fetch-execute cycle came to dominate all of the others. This is called pipelined execution. Different pipelined machines have had different numbers of pipeline stages. Short pipelines have been built, with only two stages, while others have been built with five or more stages. The four stage model illustrated below is typical and very appropriate for the Hawk:

A pipelined processor
IF Instruction Fetch
OF/AC Operand Fetch / Address Computation
ALU/MA Arithmetic Logic Unit / Memory Access
RS Result Store

**A pipelined processor**
IF	Instruction Fetch
OF/AC	Operand Fetch / Address Computation
ALU/MA	Arithmetic Logic Unit / Memory Access
RS	Result Store

The basic idea of a pipelined processor is that each instruction is processed, in turn, by each of the pipeline stages. In the 4-stage pipeline illustrated above, the instruction fetch stage begins the exeuction of each instruction by fetching it, and then passing it off to the operand-fetch, address-computation stage, which gathers operands from registers and computes the effective address. After this is done, the arithmetic-logic-unit, memory-acces stage does whatever arithmetic is required for register-to-register instructions, or goes to memory for memory reference instructions, and then passes the value to be stored in the destination register to the result-store stage.

Just as we could express the behavior of a sequential processor in algorithmic terms, we can also express the behavior of a pipelined processor in such terms. The outer loop remains the same, but each iteration of this loop performs part of the computation for several different instructions in parallel. To do this, we need some way to express parallelism. In the following, we express this with the informally defined in parallel block, which initiates the parallel execution of all of the statements within that block:

The pipelined fetch-execute cycle

while (TRUE) do in parallel { ... fetch instruction x ... get operands and compute effective address for instruction x-1 ... do ALU operations or access memory for instruction x-2 ... save any results as required by instruction x-3 }

**The pipelined fetch-execute cycle**
while (TRUE) do in parallel { ... fetch instruction x ... get operands and compute effective address for instruction x-1 ... do ALU operations or access memory for instruction x-2 ... save any results as required by instruction x-3 }

As a result, for the processor illustrated, during each execution cycle, there are four instructions in various stages of execution, one being executed by each stage. It takes four execution cycles to complete each instruction, but one instruction is completed during each cycle. The following figure, called a pipeline diagram, illustrates the execution of a short sequence of instructions on a pipelined processor:

Pipelined diagram of program execution
LOADS R3,R4 IF OF/AC ALU/MA RS
ADDSI R4,1 IF OF/AC ALU/MA RS
STORES R3,R4 IF OF/AC ALU/MA RS

time 1 2 3 4 5 6

**Pipelined diagram of program execution**
`LOADS R3,R4`	IF	OF/AC	ALU/MA	RS
`ADDSI R4,1`	IF	OF/AC	ALU/MA	RS
`STORES R3,R4`	IF	OF/AC	ALU/MA	RS

time	1	2	3	4	5	6

This diagram shows the instruction LOADS R3,R4 being fetched at time 1. At time 2, the contents of R4 are taken from the registers as the effective address of this instruction, while at time 3, this memory address is used to load one word from memory. Finally, at time 4, this value is stored in R3, completing the execution of this instruction.

Similarly, the instruction ADDSI R4,R1 is fetched at time 2. At time 3, the contents of R4 are taken from the registers as the operand of this instruction, while at time 4, this operand is incremented. Finally, at time 5, the incremented value is stored back in R4, completing the execution of the second instruction.

Exercises

a) With reference to the pipeline diagram given above, during what time step does the STORES compute its effective address? Given this, what odd behavior would you expect with regard to the effects of the immediately preceeding ADDSI?

Problems with Pipelining

Pipelined processors work best if instructions are independent of each other. Consider the problem posed by the simple assignment statement a=b. If we assume that a and b are local variables, this would be done on the Hawk with two instructions, LOAD R3,R2,B followed by STORE R3,R2,A. If our pipelined machine executed these in sequence without any special considerations, we would get a strange result. This can be seen by examining the pipeline diagram.

A problem pipeline diagram
LOAD R3,R2,B IF OF/AC ALU/MA RS
STORE R3,R2,A IF OF/AC ALU/MA RS

time 1 2 3 4 5

**A problem pipeline diagram**
`LOAD R3,R2,B`	IF	OF/AC	ALU/MA	RS
`STORE R3,R2,A`	IF	OF/AC	ALU/MA	RS

time	1	2	3	4	5

The LOAD instruction is loads its operand from memory in cycle 3 and saves this operand in R3 during cycle 4. Meanwhile, the STORE instruction gets its operand from R3 during cycle 3 and stores it to memory during cycle 4. As a result, even though the LOAD instrucition is fetched before the STORE instruciton, the effect is as if the LOAD was executed second. Pipelined processors typically deal with such operand dependencies by detecting the dependency and delaying the execution of the dependent instruction until the necessary operands are present. The following pipeline diagram illustrates this.

A pipeline diagram with delay slots
LOAD R3,R2,B IF OF/AC ALU/MA RS
STORE R3,R2,A IF delay delay OF/AC ALU/MA RS

time 1 2 3 4 5 6 7

**A pipeline diagram with delay slots**
`LOAD R3,R2,B`	IF	OF/AC	ALU/MA	RS
`STORE R3,R2,A`	IF	delay	delay	OF/AC	ALU/MA	RS

time	1	2	3	4	5	6	7

These dealy slots are introduced by interlock logic that detects that one pipeline stage depends on results from another and prevents that stage from launching itself until the necessary results are available. In this case, the operand fetch stage of the store instruction blocks itself until the load instruction has completed its results store step.

Pipeline delay slots can destroy the potential performance of a pipelined processor. To minimize their impact, processor designers have developed several tricks: Result forwarding logic attempts to make results available sooner. For example, although the result of the load instruction in our example is not properly in the destination register at the end of cycle 3, it has been fetched from memory at that point and it must therefore be in some internal register of the processor. Therefore, a smart operand fetch stage can get the data from that internal register instead of going to R3 as directed by the instruciton. Result forwarding logic can generally eliminate one delay cycle per operand dependency.

The effect of result forwarding
LOAD R3,R2,B IF OF/AC ALU/MA RS
STORE R3,R2,A IF delay OF/AC ALU/MA RS

time 1 2 3 4 5 6

**The effect of result forwarding**
`LOAD R3,R2,B`	IF	OF/AC	ALU/MA	RS
`STORE R3,R2,A`	IF	delay	OF/AC	ALU/MA	RS

time	1	2	3	4	5	6

Compilers can also play an important part in eliminating operand dependencies by rearranging the order of instructions in a program. Assembly language programmers can also reorder their instructions to eliminate operand dependencies, but the result is generally far less readable than an ordering that emphasizes the relationship of instructions to each other. As a result, hand crafted assembly code frequently runs slower than compiled code, given the use of a high quality compiler.

There are other delays that must be considered. Consider the problem of branch and jump instructions. Generally, the address computation stage of branch instructions computes the destination address, the ALU stage checks the condition codes to see if the branch should take place, and the result store stage actually sets the program counter. As a result, the instruction fetch for the next instruction must be delayed until after the result-store stage, resulting in a 3-cycle delay. Branch prediction logic in the instruction fetch stage can reduce these delays, so long as the predictions are correct.

Exercises

b) Consider the high level code a=b;c=d. This can be converted to assembly language in two different ways, assuming that the variables are all local. Which will be faster? Why?

LOAD R3,R2,B LOAD R3,R2,B
STORE R3,R2,A LOAD R4,R2,D
LOAD R3,R2,D STORE R3,R2,A
STORE R3,R2,C STORE R4,R2,C

c) Construct the pipeline diagram for each of the above instruction sequences, showing how each would be executed, assuming a simple pipelined processor without any result forwarding or other schemes to reduce the number of delay cycles.

Pipelining and the Hawk

The Hawk architecture was designed to be pipelined using the 4-stage model illustrated above. In order to understand how this is done, we must examine how the different pipeline stages communicate. The output of each pipeline stage is stored in an interstage register that serves as input to the next pipeline stage. So, for example, the instruction register is an interstage register loaded by the instruction fetch pipeline stage and used as an input to the operand fetch and address computation stage, and the effective address register is an output of the address computation stage and an input into the memory access stage. The first step in designing a pipelined processor after laying out the basic pipeline stages is to figure out what all of the necessary interstage registers are. For the Hawk, we can determine the following.

Outputs from the IF stage, input to the OF/AC stage
IF-IR,: the instruction register output from the instruction fetch stage.
IF-PC,: the program counter output from the instruction fetch stage. This is needed because many instructions need to examine the value of the program counter used in fetching that instruction, for example, when computing program-counter relative addresses.
Outputs from the OF/AC stage, input to the ALU/MA stage
OF-IR,: the instruction register output from the operand fetch stage. This is needed because later stages still need to know what instruction they are executing.
OF-EA,: the effective address output from the address computation stage, needed only if the instruction is a memory reference instruction.
OF-OP1,: the first operand output from the operand fetch stage, for example, the operand to be stored in memory or added to.
OF-OP2,: the second operand output from the operand fetch stage, needed primarily for arithmetic operations.
Outputs from the ALU/MA stage, input to the RS stage
ALU-IR,: the instruction register output from the arithmetic logic unit stage. This is needed because later stages still need to know what the instruction is.
ALU-RES,: the result of the arithmetic instruction or the value loaded from memory.

The following figure shows these registers in pictoral form, in relation to the pipeline stages they interconnect. The figure also shows the relation of the pipeline stages to memory and the 15 general registers:

Pipeline interstage registers in their context
Pipeline components

Exercises

d) Which pipeline interstage register provides data as input to the operand fetch stage for result forwarding.

The Instruction Fetch Stage

Given these registers, we can now begin to work out, in algorithmic form, the behavior of each pipeline stage. Consider, for example, the instruction fetch stage. If we ignore the problem of implementing branch instrucitons and if we ignore all long instructions, so we assume all Hawk instructions are 16 bits, this stage becomes very simple:

The IF stage, simplified

{ /* once per clock cycle */ if_pc = if_pc + 2; if_ir = * (halfword_pointer) if_pc; }

**The IF stage, simplified**
{ /* once per clock cycle / if_pc = if_pc + 2; if_ir = (halfword_pointer) if_pc; }

This version of the instruction fetch stage is seriously oversimplified. In any useful computer, there must be some way to force a branch. We do this with an output from a later stage, for example, the result store stage. If the result-store stage detects a branch instruction, it must send the branch address to the instruction fetch stage, along with a signal indicating that a branch is to be taken.

The IF stage, with branches

{ /* once per clock cycle */ if (branch) { if_pc = branch_address; } else { if_pc = if_pc + 2; } if_ir = * (halfword_pointer) if_pc; }

**The IF stage, with branches**
{ /* once per clock cycle / if (branch) { if_pc = branch_address; } else { if_pc = if_pc + 2; } if_ir = (halfword_pointer) if_pc; }

We can build this circuit using a multiplexor to select between the two values of the program counter, plus an incrementor.

The IF stage, as digital logic
Pipeline components

We have preserved two major oversimplifications through this exercise. First, we have ignored long instructions. The Hawk computer is usable with only short instructions, but programs are far more compact if long instructions are permitted. Second, we have ignored the fact that physical memory addresses are 30-bit quantities (or 32 bits with the least significant two bits ignored), and that data to and from main memory is always transferred 32 bits at a time.

Exercises

e) For each of the following, write an equivalent instruction sequence, using only short instructions. If you need a scratch register, use R1.
-- LIL R3,#123456
-- LOAD R4,R3,#1234
f) Assume the data path from memory is 32 bits wide, and you have a 31 bit word address. Draw a circuit that splits the 31-bit address into b 30 bit word address and a 1 bit halfword-in-word signal, and then uses a multiplexor to select the correct halfword.

The Operand-Fetch/Address-Computation Stage

The operand-fetch/address-computation stage is sometimes described as the operand marshalling stage. Most of the logic in this stage is concerned with marshalling or shunting the values from various registers into the correct inputs of the next pipeline stage. The term marshalling is used here as it is in British English, where a railroad marshalling yard is a railroad yard where railroad cars are shunted around to make up trains. The one piece of logic that does something other than shunt operands around is the adder used for program-counter-relative and indexed address computation.

The details of the operand routing depend on the contents of the instruction register. For some instructions, we will need the values of up to three registers. We can express the details of the marshalling as code using a sequence of if statements detailing which inputs to this pipeline stage get assigned to what outputs under what circumstances. All of these output registers can be loaded in parallel if the register subsystem allows multiple registers to be inspected at once.

The Hawk OF/AC stage, simplified

{ /* once per clock cycle */ int dst = extract_dst_field( if_ir ); int s1 = extract_s1_field( if_ir ); int s2 = extract_s2_field( if_ir ); of_ir = if_ir; if (is_branch( if_ir )) { of_ea = PC + register[if_pc] << 1; } else if ( s2 == 0 ) { of_ea = PC; } else { of_ea = register[s2]; } if (is_memref_or_two_register( if_ir )) { of_op1 = register[dst]; } else { of_op1 = register[s1]; } if (is_ADDSI( if_ir )) { of_op2 = if_s2; } else { of_op2 = register[s2]; } }

**The Hawk OF/AC stage, simplified**
{ /* once per clock cycle */ int dst = extract_dst_field( if_ir ); int s1 = extract_s1_field( if_ir ); int s2 = extract_s2_field( if_ir ); of_ir = if_ir; if (is_branch( if_ir )) { of_ea = PC + register[if_pc] << 1; } else if ( s2 == 0 ) { of_ea = PC; } else { of_ea = register[s2]; } if (is_memref_or_two_register( if_ir )) { of_op1 = register[dst]; } else { of_op1 = register[s1]; } if (is_ADDSI( if_ir )) { of_op2 = if_s2; } else { of_op2 = register[s2]; } }

In the above, and in the following logic diagram, a number of details have been omitted. Sign extension has not been documented at all, where it is required, and the data paths are insufficient to support several instructions. The logic diagram makes it clear that all of the inputs are delivered to every possible destination, with multiplexors at the destination determining which inputs are used and which inputs are ignored.

The OF/AC stage, as digital logic
Pipeline components

The logic diagram given here does not express the logic for the effective address computation the same way that it was expressed in code. There, it was done with an if-then-else construct, while here it is done with two multiplexors upstream of an adder. Both approaches can be made to work, but the conditions tested are different.

Exercises

g) Express the inputs a, b, c and d used to control the multiplexors in the above drawing in terms of values of various bits or fields in the instruction register. Comparison with the code given previously should help.
h) There are at least two places in the code, and two places in the logic diagram, where sign extension or other alignment issues have been omitted. Identify each of these and explain what was left out of the code or logic diagram.
i) This logic diagram allows a 4-bit value from the s2 field of PC to be passed through into the OP2 register. What instructions need this? Does the difference in sign extension requirements for these two contexts make any difference to this pipeline stage?
j) This logic diagram does not appear to support the 8-bit constants needed for the LIS and ORSI instructions. Discuss how to add support for these instructions. It will probably take another 2-way multiplexor or alternatively, widening one of the 2-way multiplexors to a 3-way multiplexor. There may be alternatives, depending on which register you elect to use for the value being loaded.

The Arithmetic-Logic-Unit/Memory-Access Stage

The arithmetic-logic-unit/memory-access stage either performs an arithmetic or logic operation (including, possibly, shifts) on the two operand inputs, or it stores data to memory, or it loads data from memory. In addition, this stage can route the effective address to the result stage in order to handle the load-effective-address instruction.

The ALU/MA stage, as digital logic
Pipeline components

The version of the ALU stage shown here omits one very important detail: The condition codes! Obviously, the condition code register must be included somewhere, and the most natural place to put it is in this pipeline stage.

Exercises

k) Write out this stage, at a level of detail similar to that in the drawing, using a program notation similar to that used to describe the first two pipeline stages above.
l) What fields or bits of the instruction register control the multiplexor input shown as a in the diagram.
m) What fields or bits of the instruction register control the multiplexor input shown as c in the diagram.
n) The ALU control input b is a complex function of a number of fields of the instruction register. Use example instructions to show each field that matters.
o) There are obviously at least two control signals from this pipeline stage to memory, one used to indicate the need to read from memory and one used to indicate the need to write to memory.
-- What fields or bits of the instruction register determine the value of the read-from-memory signal.
-- What fields or bits of the instruction register determine the value of the write-to-memory signal.

The Result-Store Stage

The result-store stage takes the result from the previous stages and does one of several things:

Ignores it -- for example, in the case of a store instruction, where the useful work of the instruction was all completed in the previous pipeline stage.
Stores it to the register R[dst] -- in the case of load instructions and in the case of arithmetic or logic instructions.
Stores it to the program counter -- in the case of branch instructions.

There is very little logic in this pipeline stage, excepting that involved with deciding what to do with the result. The basic data flow in this pipeline stage is exceedingly simple:

The RS stage, as digital logic
Pipeline components

The version of the result-store pipeline stage shown here is extremely oversimplified in two important aspects. First, it contains no logic to handle conditional branches -- obviously, this logic will involve the value of the condition code register from the previous pipeline stage, combined with the branch condition field (the dst field of the instruciton register). This omission is a consequence of our omission of the condition codes from the previous pipeline stage.

The second omission is more important: We have made no provision for call instructions. A call instruction must both change the program counter and save a result (the return address) to memory. We will have to fix our top-level design for the pipeline to solve this problem, adding at least one new interstage register so that this pipeline stage can see both the result and the return address (the value of the program counter).

Superscalar, Multicore and Multiprocessor Architecture

Some pipelined processors have what is called superscalar execution. This involves fetching two (or more) instructions at a time in the instruction fetch stage and then letting these flow through the pipeline in parallel. So long as there are no operand dependencies between such instructions, this can lead to double (or more) the execution speed. The Hawk, with 16-bit instructions and a 32-bit data path to memory, lends itself naturally to two-way superscalar implementations.

Of course, superscalar execution means even more potential for operand conflicts, and the interlock logic needed to prevent unexpected results on a 2-way superscalar pipelined machine is frequently around four times the complexity of the interlocking needed in a simple pipeline. This leads to diminishing returns.

In the 1960's, Burroughs Corporation began seriously exploring the idea of building computers with multiple CPU's. These machines were somewhat successful in competition with IBM's mainframe family, but everything about the Burroughs family of computer architectures was sufficiently eccentric that most observers saw those systems as being well outside the mainstream of computer architectures.

In the 1970's, Gordon Bell of Digital Equipment Corporation and Carnegie Mellon University observed that multiple inexpensive minicomputers offered significantly more aggregate computing power than a single mainframe with the same cost. This led, in the 1970's and into the 1980's, to a number of multi-minicomputer and multi-microprocessor architectures. Some of these were marginally successful in the commercial marketplace.

These multiprocessor experiments largely faded in the face of competition from pipelined superscalar machines, but in the 1990's, it was clear that the rate of return on pipelining and superscalar processors was in decline. Adding pipeline stages makes machines faster, but it increases the number of branch delay slots. A two-way superscalar processor can be twice as fast as a conventional processor, but only for straight-line code where there are not too many operand dependencies. Furthermore, a two way superscalar processor requires more than twice the silicon as the two pipelined processors it combines.

This led, by the end of the 1990's, to the realization that the next direction in computer architecture was going to be single-chip microprocessors with multiple CPUs. Such processors came to be called multicore processors. A multicore processor has two or more CPUs on one chip, sharing an on-chip cache and bus arbitration logic. The first generation of multicore processors had just two CPU cores, but four-core chips are available, and it is natural to expect eight and sixteen core chips to follow.

A multicore system
a multicore chip

Because it has only one memory management unit, the multicore chip shown in the figure requires that all processors operate in the same address space with the same access rights to memory. As such, it provides support for multithreaded applications but not for multiple independent processes that are protected from each other by separate memory management units. More sophisticated multicore systems either support multiple memory management units or more complex memory management units that allow each core of the processor to see a different memory address space.

Multicore chips, and before them, multiprocessors, pose real challenges to both operating systems and application developers. Unix proved, early on, to be well adapted to multiprocessors; it was successfully ported to a number of multiprocessors in the 1980's. The X window manager for Unix works very well in the multicore-multiprocessor context because it uses a separate parallel process for the window manager. As processors expand above two cores, however full processor utilization will only be possible if the applications themselves are multi-threaded. Unfortunately, most programmers find it difficult to write multi-threaded code.