22C:122/55:132, Lecture Notes, Lecture 11, Fall 1999

Douglas W. Jones
University of Iowa Department of Computer Science

Instruction Coding
One problem the designer of any insturction set must answer is, How many bits shoul be assigned to each field in each variation of the instruction word. This is a question of information encoding, and the theory relevant to this is information theory.
A Brief Review of Information Theory
Information theory deals with the information content of messages. Both information content and message size are measured in the same unit, and this leads to problems, so we will distinguish between bits of information on the one hand and either bits of representation or binary digits on the other hand.
Given
- M -- a message
- P(M) -- the probability of M in the universe of all messages.
We define the information content, or entropy of the message as:
- H(M) = log₂( 1/P(M) )
So, for example, the if the probability of the message "I am sick" is 1/8, the entropy H("I am sick") is log₂( 1/(1/8) ) or log₂(8) which is 3 bits.
If you just received M₁="my body temperature is above normal", you can guess that I have a fever and therefore that I am sick. Here, you are relying not on the unconditional probability M₂="I am sick", P(M₂), but rather on the conditional probability P(M₂|M₁), the probability of M₂ in the context of M₁. In fact, I could have an above-normal temperature because I just came out of a sauna, so P(M₂|M₁) is not 1, but it is certainly higher than P(M₂).
Note that the entropy, H, of a message need not be an integer. It is perfectly reasonable to say that H("I have a temperature") is 2.75 and that the conditional entropy of the second message, "I am sick" is 0.25.
If we have messages composed from an alphabet, for example, messages composed of letters of the Roman alphabet, or messages composed of instructions from an instruction set, we can measure the entropy of the message in many ways. The simplest, H₀ or the zeroth order entropy, rests on the unconditional probabilities of the letters.
- S -- a string composed of letters S₁ to S_n
- P(S_i) -- the probability of letter i
- H₀(S) = sum from 1 to n of H(S_i)
That is, the entropy of a string is the sum of the entropies of the letters of the string.
- H₀(S_i) = sum from 1 to n of log₂(1/P(S_i))
Higher order entropies take context into account, so H₁, the first order entropy, substitutes P(S_i:S_i-1) for P(S_i). That is, we use the probability of each letter in the context of its predecessor instead of the raw probability of each letter.
If we use context, we generally get a smaller measure of the entropy. For example, the first order entropy is:
- H₁(S_i) = sum from 1 to n of log₂(1/P(S_i|S_i-1)
and the second order entropy is
- H₂(S_i) = sum from 1 to n of log₂(1/P(S_i|S_i-2S_i-1)
In general, data compression algorithms seek to compress the representations of S to H(S) bits. No data compression algorithm can compress to better than H(S) for an arbitrary S. If it does so for some particular string S, it must do significantly worse than H(S) for a sufficient number of other strings that the expected value of the compressed length of the output is greater than or equal to H(S).
Variable-Length Codes
Unambiguous variable length codes are usually organized as prefix codes. Consider the following example for an alphabet of 7 letters:
```
	A = 00
	B = 010
	C = 011
	D = 10
	E = 110
	F = 1110
	G = 1111
	
```
The string DECADE in this alphabet would be encoded as 101100110010110.
Prefix codes are called this because no codeword in such a code is a prefix of any other codeword.
Any prefix code can be described by a trie, or decoding tree. The trie for the above code is:
```
                             . 
                           0/ \1
                           /   \
                          .     . 
                        0/ \1 0/ \1
                        A   . D   . 
                          0/ \1 0/ \1
                          B   C E   . 
                                  0/ \1
                                  F   G
	
```
If our representation is in base n, the trie for decoding a prefix code will be an n-ary tree. To decode the code, start at the root, following the labeled branches that match successive digits of the code, until you reach the leaf. Record the character in the leaf and then start back at the root.
Huffman discovered a coding algorithm to optimize the binary prefix code representation of a finite alphabet where each letter has a probability that is known in advance. Huffman's algorithm assigns the probability to the letter as a weight, and then produces a weight-balanced trie. The weight of the entire trie is 1 (the sum of all the probabilities), and the weights of the two subtrees are as close to 1/2 as is possible. Huffman's algorithm produces a Huffman code, and it is fairly easy to prove that a Huffman code will compress text in the alphabet being used to within 1 bit per letter of H₀, and that no prefix code can do better than this.
The prefix codes that interest us in the world of instruction coding are prefix codes with radix higher than 2. So, if our machine has instructions that are a multiple of 8 bits in length, we might view the instruction set as a prefix code in radix 256. this would be an appropriate way to view the instruction set of machines such as the DEC VAX or the Intel 80x86 family.
Bits per Instruction Element
Instruction sets are typically composed of many elements. For example, we speak of the instruction code or opcode field, register select fields, addressing mode fields, and so on. The CPU, as it reads the instruction stream, parses the instruction into its respective fields using context. Any analysis of the coding of the instruction should take advantage of this same context!
Therefore, we do not use the simple probability P(x) for each element of the instruction -- that is, the probability that the next instruction element will be x, out of context, but rather, we use the conditional probability, P(x/c), the probability that, in context c, the next instruction element will be x.
Typically, opcodes are considered out of context -- that is, we do not allow the opcode x to be encoded differently depending on the previous instruction. On the other hand, all other instruciton elements are encoded in the context of their opcode.
Given that we have already selected the set of opcodes, assignment of optimal encodings to them is straightforward: For each opcode O_i we simply assign
ciel( log₂(1/P(O_i)) ) bits
And, we construct a prefix code for the opcodes.
In theory, we could use completely different encodings for the addressing and register fields of each instruction, and occasionally, this can lead to significant compaction. Usually, we ignore this, and simply lump all occurances of the same type of information together as we gather our statistics.
For the register field, it is quite common to find that some registers are referenced with far greater frequency than others. A variable length register field is possible, but usually, we fix the length of the register field. Given that a register field of n bits allows for 2ⁿ registers, the question is, how many registers should we use.
The answer can be determined by writing a compiler that is parameterized in the number of registers it uses as it compiles programs. To make things fair, we make it preferentially use the lowest numbered free register whenever it needs a new register, so high-numbered registers are guaranteed to be used less frequently than low-numbered registers.
Now, we run the compiler over the set of all programs in some benchmark suite and collect run-time (or simulated run-time) register use statistics for the resulting program executions. We have an optimal design using n bits to select among 2ⁿ registers if, for all registers r,
n-1 < log₂(1/P(r)) < n+1
In sum, if all registers are equally likely to be used, the design is optimal for any coding scheme, and if the variation between the most used and the least used register is less than a factor of 4, a better binary coding scheme based on a fixed number of bits for register selection cannot be devised.
At some point, increasing n will cause the register usage to become suboptimal by the above criteria. At that point, we must either abandon the fixed coding system, moving to, for example, a variable number of bits used for register selection, or we must abandon our goal of devising an optimal encoding.
The same approach can be used to select the number of bits used to encode addressing modes, displacements from pointers for addressing structures, and any other field selection constructs.
Of course, all of the above was predicated on the idea that each construct was encoded in isolation, with minimal use of context. In practice, each addressing mode and each register reference is in the context of some operation, and statistical analysis of the impact of context on usage sometimes discloses serious imbalance. For example, some addressing modes may never be used except in the context of some instructions, and some registers may only be used in the context of some instructions.
Block Codes
If we have a code that is based on fixed size words, such as is the case on most modern RISC machines, we cannot use Huffman's algorithm. Nonetheless, we note that when H(S) = Length(S), S cannot be compressed, and when this is true, P(0) = P(1) = .5 in the binary representation of S. Therefore, an instruction code in which the representations fo real programs does not look random is not optimally compressed.
If an instruction set results in programs that are not optimally compressed, we expect it to take more memory cycles to execute, to require more cache misses, more page faults, and more code space than a comparable computer with a more nearly optimal program representation. This is independent of all other details of the CPU implementation!