24. VLIW Architecture
the 22C:122/55:132 Lecture Notes for Spring 2004
One way to take advantage of parallelism in an architecture is to directly expose that parallelism to the programmer. This is most common in DSP systems, particularly those with what are known as VLIW or Very Long Instruction Word architectures.
The basic idea is to simply allow each instruction to directly utilize each of a number of functional units within the CPU. Consider, for example, the following instruction format for a machine with two ALUs and operand port to memory:
_____ _____ _____ ______ _____ _____ _____ ______ __ |_____|_____|_____|______|_____|_____|_____|______| ... | aS1 | aS2 | aOP | aDST | bS1 | bS2 | bOP | bDST | | ALUa | ALUb | __ ___ _____ _____ _________ _____ _____ _____ _________ ... |___|_____|_____|_________|_____|_____|_____|_________| |mOP| mR | mX | mDISP | cOP | cR | cX | bDISP | | memory | control | aOP, bOP: add, subtract, multiply, divide, and, or, etc S1 and S2 are source registers DST is the destination register mOP: load, store, load immediate address R is the register to load or store X is the index register DISP is the displacement cOP: branch, call, branch if positive, branch if zero, etc. R is the register to test for conditional ops R gets the return address for call X is the index register DISP is the displacement
Here, each instruction has 4 major fields: The ALUa and ALUb fields control the function of two ALUs. In one instruction cycle, each ALU may take two operands (S1 and S2), combine them using an operation OP, and deliver the result to a destination DST.
Each instruction may also perform a memory operation, loading or storing the contents of register R in a memory location computed by adding the contents of register X to the displacement DISP. Other memory field operations might include load immediate address or perhaps load immediate, the latter would use X concatenated with DISP as a slightly longer immediate constant than could be formed using the load immediate address operation.
It takes a fair number of registers to allow efficient use of an architecture such as this. There are 10 register fields in each instruction word, so merely allowing all operands to be independent requires that the machine have at least 10 registers. On a machine with 3-registers per instruction word, it is common to find uses for 8 to 16 registers, so we can guess that programmers want about 3 times as many registers as are referenced in a single instruction, suggesting that this machine might need 30 or even 60 registers.
Assuming the register fields are all 6 bits each, allowing 64 registers, that the op fields are all 4 bits each, and that the addressing displacements are all 12 bits, we have 10×6 bits for registers, 4×4 bits for operation specification, and 2×12 bits for displacement, or 96 bits per instruction! This is big, but that is what the name VLIW suggests!
Note that each instruction in the example format may read from as many as 8 registers (aS1, aS2, bS1, bS2, mR on store, mX, cR, cX) and each instruction may update as many as 4 registers (aDST, bDST, mR on load, cR on call, assuming that a call instruction saves the return address in the designated register). Thus, we need an extraordinary number of parallel accesses to the registers in the CPU, but this problem is a solved problem!
This instruction set also assumes that every instruction will be a memory reference instruction, although it is likely that there will be no-op opcodes in each functional unit's field of the instruction word. Therefore, we will only achieve peak performance if we can perform two simultaneous memory references for each instruction.
Several aspects of VLIW instruction set design can be evaluated in exactly the same way as a conventional architecture. For example, the question of how many registers should be included works out in the same way, as do questions about the division of the register set between general purpose registers and special purpose registers such as index registers, floating point registers, and so on.
The one aspect of a VLIW instruciton set that requires new approaches to evaluation is the question of how many distinct functional units should be included in the machine!
With conventional architectures, the presence of anything like a functional unit is hidden from the programmer; such units may lurk inside the machine, but their number may change from one implementation of the architecture to the next. With a VLIW architecture, their presence is exposed, so their number must be fixed for all implementations of the architecture.
It is clear that not all functional units will be needed in every instruction. Therefore, there must be at least one "no-op" operation available to each functional unit. If this is an explicit no-op, evaluation is somewhat simplified, but pointless computations on otherwise unneeded instructions are effective no-ops, as are branches to the next instruction in sequence.
The question an evaluator can ask is: In well optimized code, what fraction of instructions for each functional unit are no-ops? If the fraction is high, the instruction set is out of balance. If the fraction is low, that functional unit is fully utilized. If the fraction of no-ops is uniformly distributed between the different functional units, the architecture is probably well balanced.
This is probably because our design has too many ALUs relative to the other classes of functional units.
This may be because the design has too few ALUs; sequences of instructions are being used instead of simple instructions.
This may be because the design has too few ALUs, as above.
Perhaps these fields should be combined, so each instruction can either perform a memory operation or a branch, but not both.
In general, a good way to begin any architectural project is to examine real programs, looking at the statistics for the operations used. The example VLIW architecture presented here is well matched to programs that have approximately as many memory references as branches, and typically two arithmetic operations per memory reference.
The set of functions each functional unit is able to perform may be more varied than an initial inspection suggests. For example, the memory reference functional unit of the example VLIW architeture will certainly include these operations:
Note that one of these, the load address function, is really an arithmetic function, using the indexed addressing capability of this functional unit to do a register to register add immediate. Indexed addressing requires that the memory reference functional unit include an adder, and once we know it's there, we may as well expose it, allowing its use for other addition operations when there is no need for a memory reference, and this allows the memory reference functional unit can be used to carry some of the load of the ALUs.
The branch functional unit is also likely to include indexed addressing, because a general indexed branch instruction is needed, for example, in the execution of case-select and function return operations.
We can easily imagine scaling this example architecture down to one where a single functional unit could perform either a memory reference or a branch, but not both, and we could easily scale it down to have just one arithmetic functional unit.
Consider the following problem:
t = 0 for i = 1 to 10 do t = t + a[i]*b[i]
Here, t is the vector dot product of the 10-vectors a and b. Translating this to good code for our example VLIW instruction set is not easy! A first crude effort might be:
-- ALUa ALUb memory control - - Rt = 0; - - - Ri = 1; - LP: - - Ra = a[Ri]; - - - Rb = b[Ri]; - Ra = Ra * Rb; - - - Rt = Rt + Ra; - - - - - Ri = Ri + 1; - - - Ra = Ri + -10; - - - - if Ra < 0, goto LP;
Here, we simply did a brute-force literal translation of the original to machine code, and in the process, we never used more than one functional unit in any instruction. Optimization of this kind of code is difficult but can have a huge payoff! Consider the following equivalent code:
-- ALUa ALUb memory control - - Rt = 0; - - - Ri = 1; - - - Rj = 10; - - - R1 = 1; - - Rj = Rj - R1; Ra = a[Ri]; - - Ri = Ri + R1; Rb = b[Ri]; - LP:Rc = Ra * Rb; Rj = Rj - R1; Ra = a[Ri]; - Rt = Rt + Rc; Ri = Ri + R1; Rb = b[Ri]; if Rj > 0, goto LP; Rc = Ra * Rb; - - - Rt = Rt + Rc; - - -
In the above, we added a new loop control variable, Rj, used to count from 10 down to 0. This is because testing for zero is a very common feature of conditional branches, while comparison with an immediate constant is difficult. The second thing we did was to "pipeline" the iteration, so that each iteration of the main loop fetches the operands from one vector element while it multiplies the previous vector elements and adds them to the sum.
This reduces the loop to just two instructions, with only one no-op! Note that the 4 functional units operate in parallel! Therefore, the results from a computation in one functional unit are not available to another functional unit until the next instruction. Incrementing Ri, for example, in the same instruction as a fetch from b[Ri], will use the old value for indexed addressing while computing a new value for use in the next iteration.
Because of the pipelined execution of the loop, with Ra and Rb serving as "interstage registers" for this software pipeline, we needed to add a pair of instructions to the loop prologue to start the first iteration, filling the pipeline before the loop begins, and we had to add a pair of instructions making up a loop epilogue to finish up what was in the pipeline after the loop terminates.
In a typical program, there would be many blocks of code like this; when splicing one such block of code to the next, it is usually possible to overlap the loop prologue of one block with the loop epilogue of the next. The above example illustrates this possibility nicely, because the set of functional units used in the prologue does not overlap the set of functional units used in the epilogue.
Our example architecture included no provisions for small immediate constants as operands for the two ALUs, so the amount by which the array index and loop counter are adjusted for each iteration is stored in a register, R1, holding the constant 1 for the duration of the loop. A set of ALU operations that interpreted one of the operand register select fields as a small constant instead of a register number whould have been slightly more elegant.
Also, notice in the above that one of the ALUs was used for loop control variable updates, an operation that is likely to involve short integers, while the other ALU is used for all computations involved in producing the actual result, operations likely to involve long integers or floating point variables. This strongly suggests that it would be quite reasonable to give the two ALU's different word lengths and different sets of operators! Of course, it would be wrong to judge the architecture on just this one problem! The decision to move forward with such a design should rest on examination of a large suite of programs!
This example illustrates an important problem with all high performance architectures! Naive machine-language code rarely makes full use of the parallelism inherent in the CPU design. Clear easy-to-read assembly language code for such machines rarely performs well, while code that performs well is frequently very convoluted and hard to understand and maintain. As a result, it is quite common to find that the output of a good high-quality compiler outperforms hand crafted assembly language, because the compiler is not responsible for generating clear and easy to maintain object code, while the assembly language programmer is almost always under this obligation.
Two major classes of computer architectures emerged in the 1940's, Harvard architectures, named after the Harvard series of relay calculators developed by Aiken at Harvard (the Harvard Mark I, Mark II, Mark III and Mark IV machines) and the Von Neumann or Princeton machines. The latter are characterized by a single main memory, holding both program and data, while the former have two separate memory spaces, one for code and one for data.
Harvard Architecture ________ _____ ________ | PROGRAM| | | | DATA | | MEMORY |=====| CPU |=====| MEMORY | |________| |_____| |________| Princeton Architecture _____ ________ | | | | | CPU |=====| MEMORY | |_____| |________|
Essentially all general purpose computers today appear to be Princeton or Von Neumann machines, from the programmer's perspective, but this appearance is frequently misleading -- modern high-performance CPU's are, at their heart, frequently designed in the Harvard spirit, with added hardware outside the heart of the CPU to create the appearance of a princeton design.
Many microcontrollers and DSP (digital signal processor) designs are pure Harvard architectures. This frequently extends to such things as a word-size for the program memory that is unrelated to the word-size for the data memory.
For example, the Microchip 16Cxx family of microcontrollers has a 14 bit program word and an 8-bit data word. The data memory of this microcontroller is very small, with a maximum of 422 bytes of uncommitted RAM (out of a 512 byte address space; the rest of the address space is reserved for various special purpose registers such as device interfaces). The program memory for processors in this family is limited to 8K 14-bit words, all of which is ROM.
DSP systems, whether they are DSP chips on a modern microprocessor or DSP peripherals on a minicomputer of 1973, are designed to be run as adjuncts to a general purpose processor. As such, the DSP relies on the general purpose processor for all support functions, and is not designed to support a resident operating system or other support software. The host computer typically accesses the DSP's program and data memories as special peripheral devices.
The basic instruction execution cycle of a VLIW machine is straightforward. Consider the example format given at the start of this section of the notes:
repeat IR = CodeMemory[ PC ]; do R[aDST] = ALUa( aOP, R[aS1], R[aS2] ); in parallel with R[bDST] = ALUa( bOP, R[bS1], R[bS2] ); in parallel with if mOP = load R[mR] = DataMemory[ mDISP + R[mX] ]; elseif mOP = store DataMemory[ mDISP + R[mX] ] = R[mR]; ... endif in parallel with if is_call_op( cOP ) do R[cR] = PC + 1; in parallel with PC = bDISP + R[cX]; enddo; elseif branch_condition_satisfied( cOP, R[cR] ) PC = bDISP + R[cX] else PC = PC + 1; endif enddo forever
Each cycle of the main loop of this instruction execution cycle fetches an instruction and then executes all of the components of that instruction in parallel.
The VLIW instruction execution cycle outlined above required two register transfers in sequence per cycle. All of the interesting functional units remain idle during the instruction fetch, and the instruction memory interface remains idle during the operation of the functional units.
In order to reduce this inefficiency, we will pipeline the instruction execution cycle by treating the instruction register itself as an interstage register. The resulting instruction execution cycle can be described as follows:
repeat do IR = CodeMemory[ PC ]; in parallel with R[aDST] = ALUa( aOP, R[aS1], R[aS2] ); in parallel with R[bDST] = ALUa( bOP, R[bS1], R[bS2] ); in parallel with if mOP = load R[mR] = DataMemory[ mDISP + R[mX] ]; elseif mOP = store DataMemory[ mDISP + R[mX] ] = R[mR]; ... endif in parallel with if is_call_op( cOP ) do R[cR] = PC + 1; in parallel with PC = bDISP + R[cX]; enddo; elseif branch_condition_satisfied( cOP, R[cR] ) PC = bDISP + R[cX] else PC = PC + 1; endif enddo forever
Now, each cycle of the instruction execution cycle executes all of its register transfers in parallel! This means that each cycle peforms two simultaneous memory accesses per cycle, one an instruction fetch and one a data load or store; this is easy if the machine uses a Harvard architecture.
If we make no additional changes to the architecture, this change has grave consequences for the programmer! It means, for example, that the instruction immediately following an instruction that does a branch or call will be executed before the branch or call takes place! This is because the next instruction was fetched while the branch or call was being decoded and executed.
This is called a "branch delay slot", and is an example of a problem that all pipelined architectures introduce. The hardware designer must either make the CPU inefficient in order to prevent the programmer from being aware of the existance of the pipeline, or the designer will saddle the programmer with some very un-obvious features in order to maximize the efficiency of the hardware.
The second way we can exploit parallelism is by changing the instruction set so that each instruction gives work to an ALU but does not await the result. Instead, during each instruction cycle, we store the results of the previous ALU operation while giving the ALU the next operand.
Focusing only on one ALU field of the instruction, the specification of a single operation is now spread over two instructions, as follows:
| SRC1 | SRC2 | OP | DST | ______ ______ ____ |______|______|____|_____ first instruction |_____| second instructionThis requires the addition of one pipeline interstage register ahead of each ALU. This holds the operands and operation that the ALU is to process during the next clock cycle.
Of course, having added one interstage register, we can now talk seriously about adding others. For example, we might have one ALU that only does add, subtract and other simple operations, while the other can multiply and divide. The simple ALU might finish each operation on the next instruction, as above, while the ALU that can multiply and divide might have several pipeline stages.
A pipelined fast multiply or divide might, for example, handle 8 partial products per pipeline stage, completing a 32 bit multiply in 4 clock cycles. As a result, the multiply-divide pipe would have to be programmed with instructions that have the following character:
| SRC1 | SRC2 | OP | DST | ______ ______ ____ |______|______|____|_____ first instruction __ __ second __ __ third __ _____ fourth |_____| fifth instructionProgrammers won't like writing code for this machine, nor will they enjoy writing compilers for it, but it does allow a good programmer or a good compiler writer to exploit the parallelism of the machine.