17. Systematic Instruction Set Design

Information Theory

One problem the designer of any insturction set must answer is, How many bits shoul be assigned to each field in each variation of the instruction word. This is a question of information encoding, and the theory relevant to this is information theory.

Information theory deals with the information content of messages. Both information content and message size are measured in the same unit, and this leads to problems, so we will distinguish between bits of information on the one hand and either bits of representation or binary digits on the other hand.

Claude Shannon, the developer of information theory, coined the word bit as an abbreviation for binary digit in 1948, and proposed that, where the difference between bits of information and binary digits matters, the term binary digits should be used for the digit count, perhaps abbreviated binits, while the word bits should be used to indicate the quantity of information. Sadly, today, nobody uses the word binits, the phrase binary digits is rare, and we use the word bits ambiguously.

Shannon's definition of the information content of a message, in bits, assumes the following givens:

• M -- the message.
• P(M) -- the probability of M in the universe of all messages.

The information content, or entropy, in this context, is:

• H(M) = log2( 1/P(M) )

So, for example, the if the probability of the message "I am sick" is 1/8, the entropy H("I am sick") is log2( 1/(1/8) ) bits or log2(8) bits which is 3 bits.

If you just received M1="my body temperature is above normal", you can guess that I have a fever and therefore that I am sick. Here, you are relying not on the unconditional probability M2="I am sick", P(M2), but rather on the conditional probability P(M2|M1), the probability of M2 in the context of M1. In fact, I could have an above-normal temperature because I just came out of a sauna, so P(M2|M1) is not 1, but it is certainly higher than P(M2).

Note that the entropy, H, of a message need not be an integer. It is perfectly reasonable to say that H("I have a temperature") is 2.75 and that the conditional entropy of the second message, "I am sick" is 0.25.

If we have messages composed from an alphabet, for example, messages composed of letters of the Roman alphabet, or messages composed of instructions from an instruction set, we can measure the entropy of the message in many ways. The simplest, H0 or the zeroth order entropy, rests on the unconditional probabilities of the letters.

• M -- a message composed of n=|M| letters, m1 to mn
• P(mi) -- the probability the ith letter in M
• H(mi) = log2( 1/P(mi) ) -- the entropy of each letter of M
• H0(M) = sum from 1 to |M| of H(mi)

That is, the entropy of a string is the sum of the entropies of the letters of the string.

Higher order entropies take context into account, so H1, the first order entropy, is based on P(mi|mi-1) for P(mi). That is, we use the probability of each letter in the context of its predecessor instead of the raw probability of each letter.

If we use context, we generally get a smaller measure of the entropy. For example, the first order entropy is:

• H1(M) = sum from 1 to |M| of log2(1/P(mi|mi-1))

and the second order entropy is

• H2(M) = sum from 1 to |M| of log2(1/P(mi|mi-2mi-1)

In the above, we have to deal with boundary conditions involving the predecessors of m1. We deal with this artificially by considering the message to be preceeded by an infinite string of characters outside the alphabet, so that P(m1|m0) is the probability of m1 as the first character of the message.

For an example, in English, the probability of the letter e, P(e) is fairly high (0.12706 in a novel I once measured, so it carries about 3.5 bits of information), but the probability of e in the context of a preceeding q, P(e|q) is near zero, so if an e follows a q, it carries an immense amount of information because it reports an extremely unlikely event. On the other hand, the probability of the letter u is about 1/4 that of e (P(u)=0.02997, so it carries about 5.5 bits of information), and the letter q is the rarest letter in the alphabet (P(q)=0.00070, so it carries over 10 bits of information). The probability of a u after a q in English, P(u|q) is near unity, so finding a u after a q carries almost no information.

In general, data compression algorithms seek to compress the representations of S to H(S) bits, and the better compression algorithms take into account quite a bit of context, so they do better than H0. No data compression algorithm can compress to better than H(S) for an arbitrary S. If it does so for some particular string S, it must do significantly worse than H(S) for a sufficient number of other strings that the expected value of the compressed length of the output is greater than or equal to H(S).

Variable-Length Codes

Unambiguous variable length codes are usually organized as prefix codes. Consider the following example for an alphabet of 7 letters:

```	A = 00
B = 010
C = 011
D = 10
E = 110
F = 1110
G = 1111
```

The string DECADE in this alphabet would be encoded as 101100110010110.

Prefix codes are called this because no codeword in such a code is a prefix of any other codeword.

Any prefix code can be described by a trie, or decoding tree. The trie for the above code is:

```			     .
0/ \1
/   \
/     \
.       .
0/ \1   0/ \1
A   .   D   .
0/ \1   0/ \1
B   C   E   .
0/ \1
F   G
```

If our representation is in base n, the trie for decoding a prefix code will be an n-ary tree. To decode the code, start at the root, following the labeled branches that match successive digits of the code, until you reach the leaf. Record the character in the leaf and then start back at the root.

Huffman discovered a coding algorithm to optimize the binary prefix code representation of a finite alphabet where each letter has a probability that is known in advance. Huffman's algorithm assigns the probability to the letter as a weight, and then produces a weight-balanced trie. The weight of the entire trie is 1 (the sum of all the probabilities), and the weights of the two subtrees are as close to 1/2 as is possible. Huffman's algorithm produces a Huffman code, and it is fairly easy to prove that a Huffman code will compress text in the alphabet being used to within 1 bit per letter of H0, and that no fixed prefix code can do better than this.

Application to Instruction Set Design

The prefix codes that interest us in the world of instruction coding are prefix codes with radix higher than 2. So, if our machine has instructions that are a multiple of 8 bits in length, we might view the instruction set as a prefix code in radix 256. this would be an appropriate way to view the instruction set of machines such as the DEC VAX or the Intel 80x86 family.

When we look at the VAX or PDP-11 instruction set from this perspective, we note, immediately, that the orthogonality of these instruciton sets forces the coding to be sub-optimal. The problem is, orthogonality has forced us to offer efficient encodings of instructions that are never needed (except perhaps as curiosities). Consider, on the PDP-11, a move memory to memory instruction using autodecrement addressing on the program counter for both operands. This compact 16-bit instruction does nothing useful -- except to provide a delightful exam question for intro assembly language students studying that machine.

Bits per Instruction Field

The instruction formats for many machines are composed of many fields. For example, we speak of the instruction code or opcode field, register select fields, addressing mode fields, and so on. The CPU, as it reads the instruction stream, parses the instruction into its respective fields using context. Any analysis of the coding of the instruction should take advantage of this same context!

Therefore, we do not use the simple probability P(x) for each field of the instruction -- that is, the probability that some instruction field will be x, out of context, but rather, we use the conditional probability, P(x|c), the probability that, in context c, the field will be x.

Typically, opcodes are considered out of context -- that is, we do not allow the opcode x to be encoded differently depending on the previous instruction. This would make the CPU far too complex. On the other hand, all other instruciton elements are encoded in the context of their opcode.

In fact, on the PDP-11, as the designers ran out of opcodes for uncommon instructions, they intuitively borrowed bits from operand fields to extend the opcode field, shrinking the operand fields and enlarging the opcode field in exactly the way that information theory suggests we might want to do for coding of rare operations. Thus, it is the non-orthogonal parts of the PDP-11 instruction set that provide the best illustrations of efficient instruction design, from an information theoretic perspective!

Given that we have already selected the set of opcodes we wish to implement, assignment of optimal encodings to them is straightforward: For each opcode Oi we simply assign

ciel( log2(1/P(Oi)) ) bits

And, we construct a prefix code for the opcodes. To do this, of course, we need statistical information about the frequency of different operations. Systematic efforts to gather such information date back to the 1950s, but until the 1970s, this information was not used to any great extent in instruction set design. As this information began to appear for competing machines with overall similarities, the realization of the inefficiency of many instruction sets began to be widely recognized. Simple features like compact encodings of small constants proved to have an immense impact, and it was recognized fairly quickly that excessive orthogonality and complex addressing modes did not lead to efficient encoding.

The VAX actually tried to respond to these observations by providing short encodings for constants and by providing short displacement fields for indexing, but by 1980, it was clear that the fastest code generated by compilers for the VAX rarely used memory to memory instructions, but rather, tended to load operands in registers, operate on them there, and then store results. Thus, most of the orthogonality of the VAX was wasted!

In theory, we could use completely different encodings for the addressing and register fields of each instruction, and occasionally, this can lead to significant compaction. Usually, we ignore this, and simply lump all occurances of the same type of information together as we gather our statistics.

For the register field, it is quite common to find that some registers are referenced with far greater frequency than others. A variable length register field is possible, but usually, we fix the length of the register field. Given that a register field of n bits allows for 2n registers, the question is, how many registers should we use.

The answer can be determined by writing a compiler that is parameterized in the number of registers it uses as it compiles programs. To make things fair, we make it preferentially use the lowest numbered free register whenever it needs a new register, so high-numbered registers are guaranteed to be used less frequently than low-numbered registers.

Now, we run the compiler over the set of all programs in some benchmark suite and collect run-time (or simulated run-time) register use statistics for the resulting program executions. We have an optimal design using n bits to select among 2n registers if, for all registers r,

n-1 < log2(1/P(r)) < n+1

In sum, if all registers are equally likely to be used, the design is optimal for any coding scheme, and if the variation between the most used and the least used register is less than a factor of 4, a better binary coding scheme based on a fixed number of bits for register selection cannot be devised.

At some point, increasing n will cause the register usage to become suboptimal by the above criteria. At that point, we must either abandon the fixed coding system, moving to, for example, a variable number of bits used for register selection, or we must abandon our goal of devising an optimal encoding.

The same approach can be used to select the number of bits used to encode addressing modes, displacements from pointers for addressing structures, and any other field selection constructs.

Of course, all of the above was predicated on the idea that each construct was encoded in isolation, with minimal use of context. In practice, each addressing mode and each register reference is in the context of some operation, and statistical analysis of the impact of context on usage sometimes discloses serious imbalance. For example, some addressing modes may never be used except in the context of some instructions, and some registers may only be used in the context of some instructions.

Block Codes

If we have a code that is based on fixed size words, such as is the case on most modern RISC machines, we cannot use Huffman's algorithm. Nonetheless, we note that when H(S) = Length(S), S cannot be compressed, and when this is true, P(0) = P(1) = .5 in the binary representation of S. Therefore, an instruction code in which the representations of real programs does not look random is not optimally compressed.

If an instruction set results in programs that are not optimally compressed, we expect it to take more memory cycles to execute, to require more cache misses, more page faults, and more code space than a comparable computer with a more nearly optimal program representation. This is independent of all other details of the CPU implementation!