14. Inside a Modern CPU

Part of 22C:60, Computer Organization Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Background

Up to this point, we have described the central processing unit as executing a sequential fetch-execute cycle, in which each instruction is executed completely before the next instruction is fetched. Starting in the mid 1960s, new approaches to executing instructions were developed that allowed for much higher performance by overlapping the fetch of one instruction with the execution of its predecessors.

While there were several approaches to doing this in the computers of the late 1960s, one approach came to dominate all others in the 1970s. This is called pipelined execution. Different pipelined machines have had different numbers of pipeline stages. Short pipelines have been built, with only two stages, while others have been built with five or more stages. The four stage model illustrated below is typical:

A pipelined processor
IF Instruction Fetch
OF/AC Operand Fetch / Address Computation
ALU/MA Arithmetic Logic Unit / Memory Access
RS Result Store

**A pipelined processor**
IF	Instruction Fetch
OF/AC	Operand Fetch / Address Computation
ALU/MA	Arithmetic Logic Unit / Memory Access
RS	Result Store

The basic idea of a pipelined processor is that each instruction is processed, in turn, by each of the pipeline stages. In the 4-stage pipeline illustrated above, the instruction fetch stage begins the exeuction of each instruction by fetching it, and then passing it off to the operand-fetch, address-computation stage, which gathers operands from registers and computes the effective address. After this is done, the arithmetic-logic-unit, memory-acces stage does whatever arithmetic is required for register-to-register instructions, or goes to memory for memory reference instructions, and then passes the value to be stored in the destination register to the result-store stage.

As a result, for the processor illustrated, during each execution cycle, there are four instructions in various stages of execution, one being executed by each stage. It takes four execution cycles to complete each instruction, but one instruction is completed during each cycle. The following figure, called a pipeline diagram, illustrates the execution of a short sequence of instructions on a pipelined processor:

Pipelined diagram of program execution
LOADS R3,R4 IF OF/AC ALU/MA RS
ADDSI R4,1 IF OF/AC ALU/MA RS
STORES R3,R4 IF OF/AC ALU/MA RS

time 1 2 3 4 5 6

**Pipelined diagram of program execution**
`LOADS R3,R4`	IF	OF/AC	ALU/MA	RS
`ADDSI R4,1`	IF	OF/AC	ALU/MA	RS
`STORES R3,R4`	IF	OF/AC	ALU/MA	RS

time	1	2	3	4	5	6

This diagram shows the instruction LOADS R3,R4 being fetched at time 1. At time 2, the contents of R4 are taken from the registers as the effective address of this instruction, while at time 3, this memory address is used to load one word from memory. Finally, at time 4, this value is stored in R3, completing the execution of this instruction.

Similarly, the instruction ADDSI R4,R1 is fetched at time 2. At time 3, the contents of R4 are taken from the registers as the operand of this instruction, while at time 4, this operand is incremented. Finally, at time 5, the incremented value is stored back in R4, completing the execution of the second instruction.

Exercises

a) With reference to the pipeline diagram given above, during what time step does the STORES instruction compute its effective address? Given this, what anomolous behavior would you expect with regard to the effects of the immediately preceeding ADDSI instruction?

Pipelining and the Hawk

The Hawk architecture was designed to be pipelined using the 4-stage model illustrated above. In order to understand how this is done, we must examine how the different pipeline stages communicate. The output of each pipeline stage is stored in an interstage register that serves as input to the next pipeline stage. So, for example, the instruction register is an interstage register loaded by the instruction fetch pipeline stage and used as an input to the operand fetch and address computation stage, and the effective address register is an output of the address computation stage and an input into the memory access stage. The first step in designing a pipelined processor after laying out the basic pipeline stages is to figure out what all of the necessary interstage registers are. For the Hawk, we can determine the following.

Outputs from the IF stage, input to the OF/AC stage
IF-IR,: the instruction register output from the instruction fetch stage.
IF-PC,: the program counter output from the instruction fetch stage. This is needed because many instructions need to examine the value of the program counter used in fetching that instruction, for example, when computing program-counter relative addresses.
Outputs from the OF/AC stage, input to the ALU/MA stage
OF-IR,: the instruction register output from the operand fetch stage. This is needed because later stages still need to know what instruction they are executing.
OF-EA,: the effective address output from the address computation stage, needed only if the instruction is a memory reference instruction.
OF-OP1,: the first operand output from the operand fetch stage, for example, the operand to be stored in memory or added to.
OF-OP2,: the second operand output from the operand fetch stage, needed primarily for arithmetic operations.
Outputs from the ALU/MA stage, input to the RS stage
ALU-IR,: the instruction register output from the arithmetic logic unit stage. This is needed because later stages still need to know what the instruction is.
ALU-RES,: the result of the arithmetic instruction or the value loaded from memory.

Given these registers, we can now begin to work out, in algorithmic form, the behavior of each pipeline stage. Consider, for example, the instruction fetch stage. If we ignore the problem of implementing branch instrucitons and if we ignore all long instructions, so we assume all Hawk instructions are 16 bits, this stage becomes very simple:

The Hawk instruction fetch stage, simplified

for (;;) { /* iterate once per clock cycle */ if_ir = * (halfword_pointer) if_pc; if_pc = if_pc + 2; }

**The Hawk instruction fetch stage, simplified**
for (;;) { /* iterate once per clock cycle / if_ir = (halfword_pointer) if_pc; if_pc = if_pc + 2; }

The effective address computation stage is a bit more complex, even if we ignore the long instructions of the Hawk, since the operation is performs depends on the contents of the instruction register:

The Hawk operand fetch stage, simplified

for (;;) { /* iterate once per clock cycle */ of_ir = if_ir; if (is_memory_reference_instruction( if_ir )) { register = extract_rx_field( if_ir ); if (register == 0) { of_ea = if_pc; } else { of_ea = register[ register ]; } } if (needs_value_of_rd( if_ir )) { register = extract_rd_field( if_ir ); if (register == 0) { of_op1 = 0; } else { of_op1 = register[ register ]; } ... similar code for other operand registers ... } }

**The Hawk operand fetch stage, simplified**
for (;;) { /* iterate once per clock cycle */ of_ir = if_ir; if (is_memory_reference_instruction( if_ir )) { register = extract_rx_field( if_ir ); if (register == 0) { of_ea = if_pc; } else { of_ea = register[ register ]; } } if (needs_value_of_rd( if_ir )) { register = extract_rd_field( if_ir ); if (register == 0) { of_op1 = 0; } else { of_op1 = register[ register ]; } ... similar code for other operand registers ... } }

Problems with Pipelined Processors

More to be added