22C:122, Lecture Notes, Lecture 21, Fall 1999

Douglas W. Jones
University of Iowa Department of Computer Science

Very Long Instruction Word Architectures
One way to take advantage of parallelism in an architecture is to directly expose that parallelism to the programmer. This is most common in DSP systems, particularly those with what are known as VLIW or Very Long Instruction Word architectures.
The basic idea is to simply allow each instruction to directly utilize each of a number of functional units within the CPU. Consider, for example, the following instruction format:
```
	 _____ _____ _____ ______ _____ _____ _____ ______ 
	|_____|_____|_____|______|_____|_____|_____|______| ...
	| aS1 | aS2 | aOP | aDST | bS1 | bS2 | bOP | bDST |
	|          ALUa          |          ALUb          |
	     ___ _____ _____ _________ _____ _____ _____ _________ 
	... |___|_____|_____|_________|_____|_____|_____|_________|
	    |mOP| mR  | mX  |  mDSIP  | cOP | cR  | cX  |  bDISP  |
	    |          memory         |          control          |

	aOP, bOP:  add, subtract, multiply, divide, and, or, etc

	mOP:       load, store, load immediate

	cOP:       branch, call, branch if positive, branch if zero, etc.
	
```
Here, each instruction has 4 major fields: The ALUa and ALUb fields control the function of two ALUs. In one instruction cycle, each ALU may take two operands (S1 and S2), combine them using an operation OP, and deliver the result to a destination DST.
Each instruction may also perform a memory operation, loading or storing the contents of register R in a memory loaction computed by adding the contents of register X to the displacement DISP. Other memory field operations might include load immediate, using the combination X|DISP as an immediate constant.
It takes a fair number of registers to allow efficient use of an architecture such as this. Assuming the register fields are all 6 bits each, allowing 64 registers, that the op fields are all 4 bits each, and that the addressing displacements are all 12 bits, we have 10×6 bits for registers, 4×4 bits for operation specification, and 2×12 bits for displacement, or 96 bits per instruction! This is big, but that is what the name VLIW suggests!
VLIW and Pipelined Execution
Typical VLIW instruction sets are pipelined at two levels:
First, the CPU is typically a Harvard architecture -- that is, the CPU has two separate memory ports, one for instruction fetch and one for operand load and store. If we want to keep both memory ports busy, we can do this by fetching the next instruction in parallel with execution of the most recently fetched instruciton.
The second way we can exploit parallelism is by changing the instruction set so that each instruction gives work to an ALU but does not await the result. Instead, during each instruction cycle, we store the results of the previous ALU operation while giving the ALU the next operand.
Focusing only on one ALU field of the instruction, the specification of a single operation is now spread over two instructions, as follows:
```
	| SRC1 | SRC2 | OP | DST |
         ______ ______ ____       
	|______|______|____|_____  first instruction
	                   |_____| second instruction
	
```
This requires the addition of one pipeline interstage register ahead of each ALU. This holds the operands and operation that the ALU is to process during the next clock cycle.
Of course, having added one interstage register, we can now talk seriously about adding others. For example, we might have one ALU that only does add, subtract and other simple operations, while the other can multipl and divide. The simple ALU might finish each operation on the next instruction, as above, while the ALU that can multiply and divide might have several pipeline stages.
A pipelined fast multiply or divide might, for example, handle 8 partial products per pipeline stage, completing a 32 bit multiply in 4 clock cycles. As a result, the multiply-divide pipe would have to be programmed with instructions that have the following character:
```
        | SRC1 | SRC2 | OP | DST |
         ______ ______ ____
        |______|______|____|_____  first instruction
         __                    __  second
         __                    __  third
         __                 _____  fourth
                           |_____| fifth instruction
        
```
Programmers won't like writing code for this machine, nor will they enjoy writing compilers for it, but it does allow a good programmer or a good compiler writer to exploit the parallelism of the machine.

Multi-port Register Files

Any machine that allows simultaneous operations on a number of registers must support register files that allow multiple operands to be extracted at once. For example, consider the following abstraction:

                      data in
        write addr  _____|_____ 
              -----|           |
            strobe |           |
              -----|>          |
                   |  register |
        read addr  |    file   |  read addr
            A -----|___________|----- B
                     |       |
            data out A       B data out

Here, the write address determins which register changes when there is a strobe pulse. Read address A determies what data appears on data out A, and read address B determines what data appears on data out B.

We can implement this with two simpler register sets as follows:

                               data in
                                  |
        write addr        --------o---------
             ----o-------|----------        |
                -|-------|--------  |       |
               | |  _____|_____   | |  _____|_____
               |  -|           |  |  -|           |
        strobe |   |           |  |   |           |   
             --o---|>    A     |   ---|>    B     |
                   |  register |      |  register |
        read addr  |    file   |      |    file   |
            A -----|___________|    --|___________|
                         |         |        |       read addr
                         |          --------|---------  B
                         |                  |
                data out A                  B data out

Here, we have used two off-the-shelf register files with one data input and one data output to make a dual port register file with two data outputs. The two files store identical data and allow parallel access to that data by simple duplication.

Another way of realizing the same function is to build the multiport register file starting from the ground up. here is an example 4-port register file:

                                     data in
                                        |
        write addr        ---------o----o----o--------- 
            -----      __|__     __|__     __|__     __|__
                 |   -|>____|  -|>____|  -|>____|  -|>____|
                 |  |    |    |    |    |    |    |    |
                 /|-     |    |    |    |    |    |    |
        strobe  | |------|----     |    |    |    |    |
            ----| |------|---------|----     |    |    |
                 \|------|---------|---------|----     |
                         |         |         |         |
                         |        -|---------|-------o- 
                         |      -|-|---------o-----  |  
                         |    -|-|-o-------------  | |  
                          -o-|-|-|-------------  | | |
               read addr  _|_|_|_|_           _|_|_|_|_  read addr
                   A -----\_______/           \_______/----- B
                              |                   |
                     data out A                   B data out

Both of the above implementations of multiport memory can be easily extended to any number of ports, and both are commonly used for small memorys such as register files inside computers. The first version is best where off-the-shelf components must be used, while the second is easily incorporated into custom designs.

Multiport memory subsystems that allow parallel write operations are somewhat more complex.

Multiport main memory is more complex, but the above illustrations suffice, for the time being, to demonstrate that it is possible, in principle.