Two major classes of computer architectures emerged in the 1940's, Harvard architectures, named after the Harvard series of relay calculators developed by Aiken at Harvard (the Harvard Mark I, Mark II, Mark III and Mark IV machines) and the Von Neumann or Princeton machines. The latter are characterized by a single main memory, holding both program and data, while the former have two separate memory spaces, one for code and one for data.
Harvard Architecture
________ _____ ________
| PROGRAM| | | | DATA |
| MEMORY |=====| CPU |=====| MEMORY |
|________| |_____| |________|
Princeton Architecture
_____ ________
| | | |
| CPU |=====| MEMORY |
|_____| |________|
Essentially all general purpose computers appear to be Princeton
or Von Neumann machines, from the programmer's perspective, but
this appearance is frequently misleading -- modern high-performance
CPU's are, at their heart, frequently designed in the Harvard
spirit, with added hardware outside the heart of the CPU to create
the appearance of a princeton design.
Many microcontrollers and DSP (digital signal processor) designs are pure Harvard architectures. This frequently extends to such things as a word-size for the program memory that is unrelated to the word-size for the data memory.
For example, the Microchip 16Cxx family of microcontrollers has a 14 bit program word and an 8-bit data word. The data memory of this microcontroller is very small, with a maximum of 422 bytes of uncommitted RAM (out of a 512 byte address space; the rest of the address space is reserved for various special purpose registers such as device interfaces). The program memory for processors in this family is limited to 8K 14-bit words, all of which is ROM.
DSP systems, whether they are DSP chips on a modern microprocessor or DSP peripherals on a minicomputer of 1973, are designed to be run as adjuncts to a general purpose processor. As such, the DSP relies on the general purpose processor for all support functions, and is not designed to support a resident operating system or other support software. The host computer typically accesses the DSP's program and data memory as a special peripheral device.
The basic instruction execution cycle of a VLIW machine is straightforward. Consider the example format given at the start of this section of the notes:
repeat
IR = CodeMemory[ PC ];
do
R[aDST] = ALUa( aOP, R[aS1], R[aS2] );
in parallel with
R[bDST] = ALUa( bOP, R[bS1], R[bS2] );
in parallel with
if mOP = load
R[mR] = DataMemory[ mDISP + R[mX] ];
elseif mOP = store
DataMemory[ mDISP + R[mX] ] = R[mR];
...
endif
in parallel with
if is_call_op( cOP )
do
R[cR] = PC + 1;
in parallel with
PC = bDISP + R[cX];
enddo;
elseif branch_condition_satisfied( cOP, R[cR] )
PC = bDISP + R[cX]
else
PC = PC + 1;
endif
enddo
forever
Each cycle of the main loop of this instruction execution
cycle fetches an instruction and then executes all of the
components of that instruction in parallel.
The VLIW instruction execution cycle outlined above required two register transfers in sequence per cycle. All of the interesting functional units remain idle during the instruction fetch, and the instruction memory interface remains idle during the operation of the functional units.
In order to reduce this inefficiency, we will pipeline the instruction execution cycle by treating the instruction register itself as an interstage register. The resulting instruction execution cycle can be described as follows:
repeat do IR = CodeMemory[ PC ]; in parallel with R[aDST] = ALUa( aOP, R[aS1], R[aS2] ); in parallel with R[bDST] = ALUa( bOP, R[bS1], R[bS2] ); in parallel with if mOP = load R[mR] = DataMemory[ mDISP + R[mX] ]; elseif mOP = store DataMemory[ mDISP + R[mX] ] = R[mR]; ... endif in parallel with if is_call_op( cOP ) do R[cR] = PC + 1; in parallel with PC = bDISP + R[cX]; enddo; elseif branch_condition_satisfied( cOP, R[cR] ) PC = bDISP + R[cX] else PC = PC + 1; endif enddo foreverNow, each cycle of the instruction execution cycle executes all of its register transfers in parallel! This means that each cycle peforms two simultaneous memory accesses per cycle, one an instruction fetch and one a data load or store; this is easy if the machine uses a Harvard architecture.
If we make no additional changes to the architecture, this change has grave consequences for the programmer! It means, for example, that the instruction immediately following an instruction that does a branch or call will be executed before the branch or call takes place! This is because the next instruction was fetched while the branch or call was being decoded and executed.
This is called a "branch delay slot", and is an example of a problem that all pipelined architectures introduce. The hardware designer must either make the CPU inefficient in order to prevent the programmer from being aware of the existance of the pipeline, or the designer will saddle the programmer with some very un-obvious features in order to maximize the efficiency of the hardware.
The second way we can exploit parallelism is by changing the instruction set so that each instruction gives work to an ALU but does not await the result. Instead, during each instruction cycle, we store the results of the previous ALU operation while giving the ALU the next operand.
Focusing only on one ALU field of the instruction, the specification of a single operation is now spread over two instructions, as follows:
| SRC1 | SRC2 | OP | DST |
______ ______ ____
|______|______|____|_____ first instruction
|_____| second instruction
This requires the addition of one pipeline interstage register ahead
of each ALU. This holds the operands and operation that the ALU is
to process during the next clock cycle.
Of course, having added one interstage register, we can now talk seriously about adding others. For example, we might have one ALU that only does add, subtract and other simple operations, while the other can multipl and divide. The simple ALU might finish each operation on the next instruction, as above, while the ALU that can multiply and divide might have several pipeline stages.
A pipelined fast multiply or divide might, for example, handle 8 partial products per pipeline stage, completing a 32 bit multiply in 4 clock cycles. As a result, the multiply-divide pipe would have to be programmed with instructions that have the following character:
| SRC1 | SRC2 | OP | DST |
______ ______ ____
|______|______|____|_____ first instruction
__ __ second
__ __ third
__ _____ fourth
|_____| fifth instruction
Programmers won't like writing code for this machine, nor will they
enjoy writing compilers for it, but it does allow a good programmer
or a good compiler writer to exploit the parallelism of the machine.
Any machine that allows simultaneous operations on a number of registers must support register files that allow multiple operands to be extracted at once. For example, consider the following abstraction:
data in
write addr _____|_____
-----| |
strobe | |
-----|> |
| register |
read addr | file | read addr
A -----|___________|----- B
| |
data out A B data out
Here, the write address determins which register changes when there
is a strobe pulse. Read address A determies what data appears on
data out A, and read address B determines what data appears on
data out B.
We can implement this with two simpler register sets as follows:
data in
|
write addr --------o---------
----o-------|---------- |
-|-------|-------- | |
| | _____|_____ | | _____|_____
| -| | | -| |
strobe | | | | | |
--o---|> A | ---|> B |
| register | | register |
read addr | file | | file |
A -----|___________| --|___________|
| | | read addr
| --------|--------- B
| |
data out A B data out
Here, we have used two off-the-shelf register files with one
data input and one data output to make a dual port register
file with two data outputs. The two files store identical data
and allow parallel access to that data by simple duplication.
Another way of realizing the same function is to build the multiport register file starting from the ground up. here is an example 4-port register file:
data in
|
write addr ---------o----o----o---------
----- __|__ __|__ __|__ __|__
| -|>____| -|>____| -|>____| -|>____|
| | | | | | | | |
/|- | | | | | | |
strobe | |------|---- | | | | |
----| |------|---------|---- | | |
\|------|---------|---------|---- |
| | | |
| -|---------|-------o-
| -|-|---------o----- |
| -|-|-o------------- | |
-o-|-|-|------------- | | |
read addr _|_|_|_|_ _|_|_|_|_ read addr
A -----\_______/ \_______/----- B
| |
data out A B data out
Both of the above implementations of multiport memory can be
easily extended to any number of ports, and both are commonly
used for small memorys such as register files inside computers.
The first version is best where off-the-shelf components must be
used, while the second is easily incorporated into custom designs.
Multiport memory subsystems that allow parallel write operations are somewhat more complex.
Multiport main memory is more complex, but the above illustrations suffice, for the time being, to demonstrate that it is possible, in principle.
Any machine that allows simultaneous operations on a number Notice that the multiport register file outlined here is easy to extend to support multiple read ports, but it is far more complex to extend it to multiple write ports. This leads naturally to restrictions on the use of registers by the different functional units.
The most extreme way to do this is to designate one register as the result register for each functional unit. As a result, the destination registers specification fields of the instruction format are eliminated (aDST, bDST, mR for load instructions, and cR for call instructions), and instead, this register is completely implied by the opcode. So, aDST, bDST, mLOAD and cRETURN become reserved registers in the register set. All of these are available as source operand on any operation.
Some applications will function reasonably well with this limitation, but others need more registers for at least some of the functional units. Consider the following idea:
_______
Source Register Number |_|_|_|_|
| F | R |
___
Destination Register |_|_|
(F is implicit) | R |
F: Functional Unit
0 0 - Call/Branch unit
0 1 - Load/Store unit
1 0 - ALU A
1 1 - ALU B
R: Destination Register for that unit
The 4 registers in the Call/Branch functional unit allow for
up to 4 nesting levels of function calls without the need to save
return addresses in memory. Above 4 levels, the program must start
managing a stack in RAM, but it is usually the innermost loop and the
leaf-function calls in the calling tree that determine performance,
so this is probably overkill.
The 4 registers in the Load/Store functional unit allow for up to 4 constants or operands loaded from memory before there is a need to use one of the ALUs to shuffle operands into other registers without operating on them.
Taking this into account, we arrive at the following concrete VLIW instruction set, based on the example described at the start of the previous lecture:
_____ _____ ____ __ _____ _____ ____ __
|__4__|__4__|__3_|_2|__4__|__4__|__3_|_2| ...
| S1 | S2 | OP | D| S1 | S2 | OP | D|
| ALUa | ALUb |
__ _____ _____ _________ ____ _____ _____ _________
... |_2|__4__|__4__|____20___|__3_|__4__|__4__|___20____|
|OP| mR | mX | mDISP |cOP | cR | cX | bDISP |
| memory | control |
aOP: 000 = add small constant (R[D] = R[S1] + S2) (also noop)
001 = subtract from const (R[D] = S2 - R[S1]) (also negate)
010 = add
011 = subtract
100 = multiply low [low word of doubleword result]
101 = multiply high [high word of doubleword result]
110 = divide
111 = mod
bOP: 000 = add small constant (R[D] = R[S1] + S2)
001 = add
010 = subtract
011 = xor
100 = and
101 = or
110 = shift right (R[D] = R[S1] >> S2)
111 = shift left (R[D] = R[S1] << S2)
mOP: 00 = if [mR] = 01xx load immediate mXmDISP
01 = if [mR] = 01xx load address R[mX] + m[DISP]
10 = if [mR] = 01xx load from memory
11 = store
cOP: 0000 = test R[cR]
00xx = call
01xx = load immediate?
10xx = load address?
11xx = ?
0001 = branch
0010 = branch if R[cR] > 0
0011 = branch if R[cR] < 0
0100 = branch if R[cR] = 0
0101 = branch if R[cR] >= 0
0110 = branch if R[cR] <= 0
0111 = branch if R[cR] != 0
The above instruction set includes some creativity! The
no-ops have been eliminated! To make an ALU do a no-op,
add the small constant zero to a register.
In the branch unit, note that the call instruction specifies cR as a destination, so it only uses 2 bits of this register field. Therefore, the remaining 2 bits are used as extension to the opcode field, allowing for a second set of load immediate instructions. The option of similar extended opcode sets is open in the memory functional unit.