Lecture 32, Peephole Optimization

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Without Optimization

Consider code generation for the ARM using a simple stack-based code generator interface. Given this input (where i is a local variable):

i = i + 1;

The code generator might generate this intermediate code, perhaps in the form of a sequence of calls to code generation routines and not an actual intermediate code:

pushla i
pushla i
load
pushi  1
add
pops

The code generator for an ARM machine might respond to this as follows, assuming that it is smart enough to use r3, r4, r5 etc as a stack:

        sub     r3, fp, #8    -- pushla i
        sub     r4, fp, #8    -- pushla i
        ldr     r4, [r4, #0]  -- load
        mov     r5, #1        -- pushi  1
        add     r4, r4, r5    -- add
        str     r4, [r3, #0]  -- pops

This is, of course, very inefficient! Any decent assembly language programmer would have written the following code to increment the local variable i:

        ldr     r3, [fp, #-8] -- load  i
        add     r3, r3, #1    -- add   1
        str     r3, [fp, #-8] -- store i

The problem we will ask here is: How can we get this quality of code without being very smart?

An Aside: Which Way Does the Arm Stack Grow?

Those negative displacements from the frame pointer seem odd? Which way does the stack grow on the Arm? A quick empirical test suffices to answer this question:

void f( int * p ){
        int q;
        if (p < &q) {
                puts("stack grows by incrementing the stack pointer\n");
        } else {
                puts("stack grows by decrementing the stack pointer\n");
        }
}
int main(){
        int p;
        f( &p );
}

Run this, and you get this output:

stack grows by decrementing the stack pointer

Peephole Optimization

Peephole optimization involves examination of code at a very local level, attempting to find patterns of instructions that can be replaced with more efficient patterns of instructions. Here is a typical peephole optimization rule:

When you see this pattern:
        mov     |a|, #|b|
        add     |c|, |d|, |e|
Where |a| = |e|, replace it with this:
        add     |c|, |d|, #|b|

Here is a second rule:

When you see this pattern:
        sub     |a|, |b|, #|c|
        ldr     |d|, [|e|, #0]
Where |a| = |e|, replace it with this:
        ldr     |d|, [|b|, #-|c|]

Searching the output of our compiler for instances that match these patterns and then making the indicated substitutions, we get the following code:

        sub     r3, fp, #8    -- pushla i
        ldr     r4, [fp, #-8] -- pushla i / load
        add     r4, r4, #1    -- pushi  1 / add
        str     r4, [r3, #0]  -- pops

This code is almost optimal. Peephole optimization cannot easily find the final optimization, combining the initial sub with the final str, because these are far apart in the code and separated by other operations. Peephole optimization is, in effect, looking at the code through a very small window, a peephole. Our rules required a window that was only two instructions long, yet is created pretty good code.

Note that our peephole optimization rules are not necessarily universal rules. For example, neither of our patterns could be used with any safety in general contexts because, in each case, we eliminated the initialization of a register. What would happen, for example, if r4 has been used at a later point in the code? The peephole rules we have presented above are based on the assumption that the code generator is very limited in the optimizations it does, such that it will never generate sequences of instructions that match our patterns except when those patters apply.

Implementing Peephole Optimization

Obviously, peephole optimizers can be written in languages with strong pattern matching tools. The Unix tools SED and AWK, along with programming languages such as PERL can express the rules required for peephole optimizations reasonably compactly.

Nonetheless, using such a language to do the peephole optimization for a production compiler would be silly. As with many other issues, it is easier to do this optimization earlier in the code generation process. For example, consider the following idea:

Instead of emitting code as text strings, for example, "mov r5, #1", consider, instead, emitting the code from the code generator as a tuple such as <"mov","r5","#1","">. The code generator can emit these tuples directly into the peephole optimizer, where a fixed-size set of tuples can be buffered. As new tuples are pushed into the peephole mechanism, it can emit tuples as needed (in text form) to make space, but each time a new tuple is pushed in, the peephole mechanism can compare the component fields to see if an optimizing rule applies. Since the fields are pre-parsed (having never been converted to text) the necessary comparisons can be very simple.

Another option is to do the peephole optimization on the intermediate stack-based instruction set. In this case, when we find optimizations that work on the real machine, we work out what sequence of stack-based instructions created the conditions that led to this optimization. Then, we introduce new stack-based instructions to express the optimization. For example, the optimization rules we introduced above could be re-expressed as:

When you see this pattern:
        pusha |a|
        load
Replace it with this new instruction:
        pushl |a|

This new instruction is defined as being exactly equivalent to the two instructions it replaces, but the ARM code generator for the new instruction can use a single ARM ldr instruction for the pushl where it would have used sub followed by ldr in the trivial approach to ARM code generation from the original pusha load instruction sequence.

Here is the second example rule rewritten similarly:

When you see this pattern:
        pushi |a|
        add
Replace it with this:
        addi |a|

Again, the new instruction is defined as being exactly equivalent to the two instructions it replaces, and again, it allows the ARM code generator to generate improved code. Applying both of these optimizations, our original stack code gets rewritten as follows:

pushla i
pushl  i
addi   1
pops

The code generator for an ARM machine might respond to this as follows,

        sub     r3, fp, #8    -- pushla i
        ldr     r4, [fp, #-8] -- pushl  i
        add     r4, r4, #1    -- addi   1
        str     r4, [r3, #0]  -- pops

Peephole Optimization Applied to Pipelined Processors

Note that all modern processors of any quality at all are pipelined. A typical pipeline has 4 stages (some have more, some fewer):

IF Instruction Fetch
OG Operand Gather (from registers and instruction fields)
OP Operate on the ALU or Memory (do the actual work)
RS Result Store (back to a register)

Note that many processors also incorporate result forwarding -- this is work done in the operand gather stage where, if a source operand is in a register and the result store stage is currently trying to store a result in that same register, the operand gather stage takes the result instead of waiting for the store to register to complete. The mechanism by which this is done does not matter to us as much as the fact that it can be done.

Consider the impact of this on the code of the example used above:

INSTRUCTION            TIME
ldr     r3, [fp, #-8]  | IF | OG | OP | RS |
add     r3, r3, #1          | IF |////| OG | OP | RS |
str     r3, [fp, #-8]                 | IF |////| OG | OP | RS |
                       | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  |

What could have been done in 6 clock cycles had there been no operand dependencies between these three instructions now takes 8 clock cycles. In each of the second two instrucitons, one clock cycle of delay was added to force the operand gather phase of that instruction to wait until the result computed by the operate phase of the previous instruction was available. Two cycles would have been added had it not been for the result forwarding logic typical of modern pipelined machines.

Now, suppose we have a sequence of two unrelated statements such as:

i = i + 1;
j = j - 1;

If we convert these to the following assembly language code, we will incur 4 clock-cycles worth of delay because of operand dependencies:

        ldr     r3, [fp, #-8]  -- load i
        add     r3, r3, #1     -- add 1
        str     r3, [fp, #-8]  -- store i
        ldr     r3, [fp, #-12] -- load j
        sub     r3, r3, #1     -- sub 1
        str     r3, [fp, #-12] -- store j

In contrast, if we allocate different registers and interleave the two sequences, we can eliminate all of the operand delays:

        ldr     r3, [fp, #-8]  -- load i
        ldr     r4, [fp, #-12] -- load j
        add     r3, r3, #1     -- add 1
        sub     r4, r4, #1     -- sub 1
        str     r3, [fp, #-8]  -- store i
        str     r4, [fp, #-12] -- store j

Of course, the penalty for such interleaving is very hard to read code. This is why good compilers can frequently generate faster code than the average assembly-language programmer who may be more interested in writing readable and maintanable code than in writing code that is as fast as possible.

Peephole optimization cannot take the two blocks of code given originally and interleave them this fully, but there are some simple peephole rules that can eliminate many delay slots. Consider the following peephole optimization rule:

When you see this pattern:
        str     |a|, [fp, #|b|]
        ldr     |c|, [fp, #|d|]
Where |a| ≠ |c|, and |b| ≠ |d|, replace it with this:
        ldr     |c|, [fp, #|d|]
        str     |a|, [fp, #|b|]

In short, where you see a store to a local variable followed by a load from a different local variable, swap them so long as different registers are used.

Why restrict this to local variables (that is, those indexed off of the frame pointer)? Because in the case of Falcon programs, we can guarantee that there are no aliases for local variables. Non-local variables could have been passed to the current routine as reference parameters, and this creates the possibility that for arbitrary loads and stores, two different names, for example, a direct reference to a global variable and the use of a reference parameter, could refer to the same actual variable. Switching the order of loads and stores to a single particular variable would change the meaning of a program.

Of course, this optimization only works if the compiler's code generator allocates different registers for the different variables. This is difficult if a pure stack-oriented code generator where registers assignment depends on the level in the stack, while it works well with pure LRU allocation of registers.