Lecture 38, Peephole Optimization

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Without Optimization

Consider code generation for the ARM using a simple stack-based code generator interface. Given this input (where i is a local variable):

i = i + 1;

The code generator might generate this intermediate code, perhaps in the form of a sequence of calls to code generation routines and not an actual intermediate code:

PUSHLA i
PUSHLA i
LOAD
PUSHI  1
ADD
POPS

The code generator for an ARM machine might respond to this as follows, assuming that it is smart enough to use r4, r5, r6 etc as a stack:

        sub     r3, fp, #8    -- PUSHLA i
        sub     r4, fp, #8    -- PUSHLA i
        ldr     r4, [r4, #0]  -- LOAD
        mov     r5, #1        -- PUSHI  1
        add     r4, r4, r5    -- ADD
        str     r4, [r3, #0]  -- POPS

This is, of course, very inefficient! Any decent assembly language programmer would have written the following code to increment the local variable i:

        ldr     r3, [fp, #-8] -- load  i
        add     r3, r3, #1    -- add   1
        str     r3, [fp, #-8] -- store i

The problem we will ask here is: How can we get this quality of code without being very smart?

Peephole Optimization

Peephole optimization involves examination of code at a very local level, attempting to find patterns of instructions that can be replaced with more efficient patterns of instructions. Here is a typical peephole optimization rule:

When you see this pattern:

        mov     |a|, #|b|
        add     |c|, |d|, |a|

replace it with this:

        add     |c|, |d|, #|b|

Here is a second rule:

When you see this pattern:

        sub     |a|, |b|, #|c|
        ldr     |d|, [|a|, #0]

Replace it with this:

        ldr     |d|, [|b|, #-|c|]

Searching the output of our compiler for instances that match these patterns and then making the indicated substitutions, we get the following code:

        sub     r3, fp, #8    -- PUSHLA i
        ldr     r4, [fp, #-8] -- PUSHLA i / LOAD
        add     r4, r4, #1    -- PUSHI  1 / ADD
        str     r4, [r3, #0]  -- POPS

This code is almost optimal. Peephole optimization cannot easily find the final optimization, combining the initial sub with the final str, because these are far apart in the code and separated by other operations. Peephole optimization is, in effect, looking at the code through a very small window, a peephole. Our rules required a window that was only two instructions long, yet is created pretty good code.

Note that our peephole optimization rules are not necessarily universal rules. For example, neither of our patterns could be used with any safety in general contexts because, in each case, we eliminated the initialization of a register. What would happen, for example, if r4 has been used at a later point in the code? The peephole rules we have presented above are based on the assumption that the code generator is very limited in the code it generates, such that it will never generate sequences of instructions that match our patterns except when those patterns apply.

Peephole optimizers can be improved by allowing negative constraints, for example, constraints that say |x||y|.

Implementing Peephole Optimization

Obviously, peephole optimizers can be written in languages with strong pattern matching tools. The Unix tools SED and AWK, along with programming languages such as Perl can express the rules required for peephole optimizations reasonably compactly.

Nonetheless, using such a language to do the peephole optimization for a production compiler would be silly. As with many other issues, it is easier to do this optimization earlier in the code generation process. For example, consider the following idea:

Instead of emitting code as text strings, for example, "mov r5, #1", consider, instead, emitting the code from the code generator as a tuple such as <"mov","r5","#1","">. The code generator can emit these tuples directly into the peephole optimizer, where a fixed-size set of tuples can be buffered. As new tuples are pushed into the peephole mechanism, it can emit tuples as needed (in text form) to make space, but each time a new tuple is pushed in, the peephole mechanism can compare the component fields to see if an optimizing rule applies. Since the fields are pre-parsed (having never been converted to text) the necessary comparisons can be very simple.

Another option is to do the peephole optimization on the intermediate stack-based instruction set. In this case, when we find optimizations that work on the real machine, we work out what sequence of stack-based instructions created the conditions that led to this optimization. Then, we introduce new stack-based instructions to express the optimization. For example, the optimization rules we introduced above could be re-expressed as:

When you see this pattern:

        PUSHLA  |a|
        LOAD

Replace it with this new instruction:

        PUSHLAL |a|

This new instruction is defined as being exactly equivalent to the two instructions it replaces, but the ARM code generator for the new instruction can use a single ARM ldr instruction for the PUSHLAL where it would have used sub followed by ldr in the trivial approach to ARM code generation from the original PUSHLA LOAD instruction sequence.

Here is the second example rule rewritten similarly:

When you see this pattern:

        PUSHI |a|
        ADD

Replace it with this:

        ADDI |a|

Again, the new instruction is defined as being exactly equivalent to the two instructions it replaces, and again, it allows the ARM code generator to generate improved code. Applying both of these optimizations, our original stack code gets rewritten as follows:

PUSHLA  i
PUSHLAL i
ADDI    1
POPS

The code generator for an ARM machine might respond to this as follows,

        sub     r3, fp, #8    -- PUSHLA  i
        ldr     r4, [fp, #-8] -- PUSHLAL i
        add     r4, r4, #1    -- ADDI    1
        str     r4, [r3, #0]  -- POPS

Constants

The ARM instruction format has some odd constraints. One of them is on the form of immediate operands. In instructions such as compare and add, only the second operand may be a constant, and this constant is formed using the shifter operand specified by bits 0 to 11 of the instruciton. This field is divided as follows:

bits 0 to 7 -- immed_8
An 8 bit operand.
bits 8 to 11 -- rotate_imm
A 4-bit shift count. This is doubled, and then the 8-bit operand is rotated that many bits to form the actual 32-bit immediate constant.

As a result, the only immediate constants that can be directly represented on the ARM are those that fit one of the following patterns, expressed in hex:

If you write instructions like add r3,r2,#c or mov r3,#c, the assembler will emit useful code if the constant c can be anded with one of the above without changing its value; otherwise, dumber assemblers will complain and some smarter assemblers will silently use a more complex instruction sequence to load the constant.

The mvn instruction takes the one's complement of the shifter operand, so if you want to load negative constants, you can use mvn. If you really want to do mov r3,#-5, you must instead use mvn r3,4 because the one's complement of 4 is also the two's complement of 5. Smarter assemblers will automatically determine if this substitution works.

Similarly, for comparison, cmn compares with the two's complement of its second argument. So, if you wanted to do cmp r3,-5, you must instead do cmn r3,5.

The official ARM assembler knows about the relationship between MOV and MVN. If you write MOV R3,#&FFFFFFFF, for example, it will generate MVN R3,#0. (The official ARM assembler uses upper case where the Gnu assembler uses lower case, and it uses the & symbol to indicate hexadecimal.)

This trick doesn't suffice. Suppose you wanted to do this: mov r3,#0x12345670. The assemlber does not permit it. The official ARM assemlber has a special "pseudo operation" that you can use: LDR R3,=&0x12345670. When ARM the assemlber sees this, it first asks if the literal can be loaded as an immediate constant using a mov or mvn instruction. In that case, it generates that instruction. If not, it generates a PC-relative reference to the literal pool, and places the constant there.

The official ARM assembler accumulates literals until it encounters an LTORG directive or until the end of the program, and then it emits all of the accumulated literals. The LTORG directive is only needed if the program is so large that PC-relative addressing cannot reach the default literal pool at the end of the program. In that case, the user must include one or more explicit LTORG directives in mid program to output accumulated literals. Obviously, the LTORG directive should be given after an unconditional branch or subroutine return in order to ensure that the literals are not accidently executed, and an LTORG directive should be given within the range.

Note that third-party assemblers such as the GNU assembler cannot be relied on to include these support tricks. Experiment. If they are supported and work, use them, otherwise, you must write your own "expert" to load constants. If your code generator has a stack machine as its interface to the code generator, the obvious place to put the "load constant" expert is in the pushi routine for pushing an immediate constant on the stack.

Assemblers or compiler optimizers that select how to load a constant on the ARM must have an algorithm for recognizing numbers that fit the immediate constant pattern and numbers that do not. The code that generates the optimal constant load instructions is, in effect, a very special case of peephole optimization, with patterns checking that is unusually complex.

Here is the algorithm to determine if a 32-bit nonzero constant c can be handled by an ARM immediate instruction. First, note that any 32-bit nonzero constant c can be viewed as having the following structure:

That is, p leading zeros, a string of bits x, and q trailing zeros. We are interested in the maximal strings of leading and trailing zeros, so we can safely assume that the string of bits x begins and ends with ones once we eliminate the special case of the constant zero.

The number can be compactly encoded as an immediate constant in an ARM instruction if p+q>24, and if p is even, it can be encoded if p+q=24. So, how do we compute p and q for the constant c? This rests on a set of classical tricks for computing properties of binary numbers.

The trailers operator extracts the number of trailing zeros in the binary representation of a number, converting each leading or trailing zero to a one bit, and leaving all the other bits zero. The reverse operator reverses the order of the bits in a word, and the bitcount operator counts the number of one bits in a word. in a word.

Counting trailing zeros

To count the trailing zeros in a word, use:

#define trailers(x) ( ~(x | -x) )

This algorithm for extracting the trailers is unlikely to be intuitive, so let's work a 16-bit example:

Actually, there are several equivalent ways to compute this. For example, this would also work:

#define trailers(x) bitcount( (~x ^ -x) >> 1 )

Reversing the bits in a word

The reverse operator could be implemented by a simple loop, shifting the input value left while shifting the output value right, but this is a slow way to do it. There is a trick implementation of the reverse operator:

unsigned int reverse( unsigned int x ){
	x = ((x >>  1) & 0x55555555) + ((x & 0x55555555) <<  1);
	x = ((x >>  2) & 0x33333333) + ((x & 0x33333333) <<  2);
	x = ((x >>  4) & 0x0F0F0F0F) + ((x & 0x0F0F0F0F) <<  4);
	x = ((x >>  8) & 0x00FF00FF) + ((x & 0x00FF00FF) <<  8);
	x = ((x >> 16)             ) + ((x             ) << 16);
        return x;
}

The first line exchanges each pair of bits. The second line exchanges each pair of 2-bit fields. The third line exchanges each pair of 4-bit fields, and so on. The final line has been optimized to eliminate the need for extra constants.

Note that, if several bitcount and reverse operations are needed, there is value to expanding them as inline macros so that an optimizing compiler can load the obscure constants into registers just once and then use them over and over. These constants are nasty, in the sense that there is no fast way to load them in a single ARM instruction and much the same is true on most other machines.

On machines with a shorter word, a table lookup can be competitive, for example, using a 256 entry table to reverse the order of bits in each byte. Trial and error is generally the best way to figure out which algorithm will win in any particular context. Some machines even have special instructions for swapping the order of bytes in a word.

Counting bits

Like the reverse operator, the bitcount operator can be implemented by a simple loop, shifting one bit off per iteration and counting the one bits. There is, however, a trick implementation of the bitcount operator that is remarkably similar to the trick reverese opereator discussed above:

unsigned int bitcount( unsigned int x ){
	x = (x & 0x55555555) + ((x >>  1) & 0x55555555);
	x = (x & 0x33333333) + ((x >>  2) & 0x33333333);
	x = (x & 0x0F0F0F0F) + ((x >>  4) & 0x0F0F0F0F);
	x = (x & 0x00FF00FF) + ((x >>  8) & 0x00FF00FF);
	x = (x & 0x0000FFFF) + ((x >> 16)             );
        return x;
}

The first assignment replaces each pair of bits with the sum of those two bits. The second replaces each 4-bit segment with the sum of the two 2-bit integers in that segment. The third replaces each 8-bit segment with the sum of the two 4-bit integers in that segment, and so on. This version of this algorithm has been written so that the special constants it uses are the same as those required by the bitcount operator.

For machines with shorter words, it is sometimes faster to use a table-lookup, for example, with a 256 entry table giving the bitcounts for the integers from 0 to 255. Given this, you can sum the bitcounts of the bytes of a word. Ultimately, for each computer architecture, there are generally a few different bitcount implementations worth testing to see which is fastest.

Peephole Optimization Applied to Pipelined Processors

Note that all modern processors of any quality at all are pipelined. A typical pipeline has 4 stages (some have more, some fewer):

  1. IF Instruction Fetch
  2. OG Operand Gather (from registers and instruction fields)
  3. OP Operate on the ALU or Memory (do the actual work)
  4. RS Result Store (back to a register)

Note that many processors also incorporate result forwarding -- this is work done in the operand gather stage where, if a source operand is in a register and the result store stage is currently trying to store a result in that same register, the operand gather stage takes the result instead of waiting for the store to register to complete. The mechanism by which this is done does not matter to us as much as the fact that it can be done.

Consider the impact of this on the code of the example used above:

INSTRUCTION            TIME
ldr     r3, [fp, #-8]  | IF | OG | OP | RS |
add     r3, r3, #1          | IF |////| OG | OP | RS |
str     r3, [fp, #-8]                 | IF |////| OG | OP | RS |
                       | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  |

What could have been done in 6 clock cycles had there been no operand dependencies between these three instructions now takes 8 clock cycles. In each of the second two instrucitons, one clock cycle of delay was added to force the operand gather phase of that instruction to wait until the result computed by the operate phase of the previous instruction was available. Two cycles would have been added had it not been for the result forwarding logic typical of modern pipelined machines.

Now, suppose we have a sequence of two unrelated statements such as:

i = i + 1;
j = j - 1;

If we convert these to the following assembly language code, we will incur 4 clock-cycles worth of delay because of operand dependencies:

        ldr     r3, [fp, #-8]  -- load i
        add     r3, r3, #1     -- add 1
        str     r3, [fp, #-8]  -- store i
        ldr     r3, [fp, #-12] -- load j
        sub     r3, r3, #1     -- sub 1
        str     r3, [fp, #-12] -- store j

In contrast, if we allocate different registers and interleave the two sequences, we can eliminate all of the operand delays:

        ldr     r3, [fp, #-8]  -- load i
        ldr     r4, [fp, #-12] -- load j
        add     r3, r3, #1     -- add 1
        sub     r4, r4, #1     -- sub 1
        str     r3, [fp, #-8]  -- store i
        str     r4, [fp, #-12] -- store j

Of course, the penalty for such interleaving is very hard to read code. This is why good compilers can frequently generate faster code than the average assembly-language programmer who may be more interested in writing readable and maintanable code than in writing code that is as fast as possible.

Peephole optimization cannot take the two blocks of code given originally and interleave them this fully, but there are some simple peephole rules that can eliminate many delay slots. Consider the following peephole optimization rule:

When you see this pattern:

        str     |a|, [fp, #|b|]
        ldr     |c|, [fp, #|d|]

Where |a| ≠ |c|, and |b| ≠ |d|, replace it with this:

        ldr     |c|, [fp, #|d|]
        str     |a|, [fp, #|b|]

In short, where you see a store to a local variable followed by a load from a different local variable, swap them so long as different registers are used.

Why restrict this to local variables (that is, those indexed off of the frame pointer)? Because in the case of Falcon programs, we can guarantee that there are no aliases for local variables. Non-local variables could have been passed to the current routine as reference parameters, and this creates the possibility that for arbitrary loads and stores, two different names, for example, a direct reference to a global variable and the use of a reference parameter, could refer to the same actual variable. Switching the order of loads and stores to a single particular variable would change the meaning of a program.

Of course, this optimization only works if the compiler's code generator allocates different registers for the different variables. This is difficult if a pure stack-oriented code generator where registers assignment depends on the level in the stack, while it works well with some other register allocation schemes.