36. Branch Prediction Logic

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Branch Anticipation Caches

Associative memory plays an important role in one special area of modern high-speed central processors; that is, predicting which instruction should be fetched next.

In a pipelined machine, the obvious next instruction to fetch is the successor of the previously fetched instruction, but this will be incorrect whenever a branch is taken. The problem we face is to design an instruction fetch mechanism that will correctly anticipate the next instruction even in the presence of branches. This can be done with moderate success using a branch prediction cache.

Given an associative memory holding the successors for branch instructions, we can build an instruction fetch stage that operates as follows:

              ------o----------
             |    __|__     ___|___________
             |   |     |   |  tag  |       | Branch
	     |   | + 1 |   |       | value | Prediction
	     |   |_____|   |__hit__|_______| Cache
             |      |          |       |
             |      |  --------|------- 
             |     _|_|_       |
             /    \ 0 1 /------  Make prediction
             |     \___/         
         ----|-----  |           
        |    |    _|_|_          
        |   -|---\ 1 0 / Invalidate prediction
        | i| |    \_ _/         
        | n| |   ___|___ 
        | v| |  |>__PC__|
        | a| |      |
        / l|  ---o--o----------------> addr to memory
        | i|     |        -----------< data from memory
        | d|  ___|___ ___|___
        | a| |>__PC__|___IR__|
        | t|    _|_______|____
        | e|   |
        ^  ^   |  next pipeline stage

In sum, we predict that the address of the next instruction will be the successor of the previous one except when we find that the previous address in the branch prediction cache. In that case, we predict that the next instruction will be at the address given by the cache.

Checking the Branch Prediction

Because our branch predictions may be wrong, we must be prepared to check them. We do this by keeping a record, with each instruction, of the address from which it was fetched. Thus, we include the program counter value in the interstage register at each stage of the pipeline up to the stage that verifies that we have fetched the correct instructions. (We had to do this anyway in order to allow PC-relative addressing to be done in the address computation stage, or in order to allow precise traps to properly track what instruction caused the trap.) The verification is done in the pipeline stage that actually executes the branch.

        
        ^  ^        |       |
        |  |   -----o       |
        /  |  |  ___|___ ___|___ __
        |  |  | |>__PC__|___IR__|__ ...
        |  |  |     |       |           
        | i|  |   __|__    _|______________
        | n|  |  |     |  |           
        | v|  |  | + 1 |  |     stage that executes
        | a|  /  |_____|  |     branch instructions
        | l|  |     |     |
        | i|  |     |  ---| effective address
        | d|  |    _|_|_  |
        | a|  |   \ 0 1 /-| take branch
        | t|  |    \___/  |________________
        | e|  |      |
        |  |   ----  |
         --|-------|-o
           |      _|_|_ 
           |     |     |
           |     |  =  |
           |     |_____|
           |        O   
           |        |   
            --------o--------------->

When there was no branch prediction, we invalidated all of the upstream instructions in the pipeline whenever there was a branch. With branch prediciton, we instead compare our predicted next address with the next address computed in the branch stage of the pipe.

If the predicted next address was wrong, we do two things: First, we load the correct next address in the program counter used by the instruction fetch stage, and second, we invalidate all partially executed instructions between the instruction fetch stage and the branch stage.

Updating the Branch Prediction Cache

When should a new branch prediciton be added to the cache?
When should a prediction be removed?

The most conservative approach is to only use the branch prediction cache for unconditional branches. In this case, we add a prediction whenever we encounter an unconditional branch opcode, and we never need to delete predictions from the cache except when we modify the code or when the cache fills (in which case, new predictioins overwrite old ones). In that case, we invalidate the entire branch prediction cache when we invalidate the I-cache, and at all other times, we simply add a new prediction whenever we encounter an unconditional branch.

A more extreme approach is to add every branch taken to the branch prediction cache, and to invalidate entries whenever they make a bad prediction. We therefore add the following logic to the pipeline stage that may take a branch in order to add predictions:

        ^  ^  ^
        |  |  |  ___|___ ___|___ __
        |  |  | |>__PC__|___IR__|__ ...
        |  |  |     |       |           
        |  |  |     |      _|______________
        |  |   --/--      |           
        |  |    tag       |     stage that executes
        |  |              |     branch instructions
        |  |  value       |
        |   -----/--------| effective address
        |     ___         |
        |    |   |--------| take branch
         ----|AND|        |________________
        add  |___|-- 
        prediction  |
                    |
                    ^
                 delete
               prediction

Notice that we only add a prediction if the prediction in the cache was wrong. If the cache correctly predicted this branch, we make no changes! Finally, we add the following logic to delete incorrect predictions:

        
        ^  ^        |       |
        |  |   -----o       |
        |  /  |  ___|___ ___|___ __
        |  |  | |>__PC__|___IR__|__ ...
        |  |  | tag |       |           
        |   --|-----o       |           
        |     |   __|__    _|______________
        |d    |  |     |  |           
        |e    |  | + 1 |  |     stage that executes
        |l    /  |_____|  |     branch instructions
        |e    |     |     |
        |t    |     |  ---| effective address
        |e    |    _|_|_  |
        |     |   \ 0 1 /-| take branch
        |p    |    \___/  |________________
        |r    |      |
        |e     ----  |
        |d         | |
        |i        _|_|_
        |c       |     |
        |t       |  =  |
        |i       |_____|
        |o          O
        |n          |
         -----------

Here, the address of the instruction who's successor was incorrectly preicted is used as a tag in the prediction cache, and if there is a match for this tag, that entry is invalidated.

Is this more complex scheme likely to pay off? Clearly, both schemes will pay off for unconditional branches. Since most loops iterate more than once, the conditional branches involved in post-test loop exits that are taken during the first iteration of a loop will tend to be taken the same way during all subsequent iterations up until the last, so the complex scheme works for those branches.

If a function is called exactly once in a loop body, the complex scheme will correctly predict the return address. Functions that are called multiple times in a loop body will have their returns predicted incorrectly, but the loops that limit the speed of real programs rarely call any given function more than once per iteration.

If statements that are actually taken randomly, with a 50/50 probability distribution will not be predicted in any useful way, but most if statements are not random! Many if statements handle exception conditions, and most others have a preponderance of the control flow going to one branch or the other.

Therefore, it is highly likely that the complex approach to branch prediction will correctly predict a large fraction of the conditional branches in real programs, and therefore, that it will eliminate a large fraction of the bubbles from the pipeline in iterative code and other branch-intensive computations.

Better Predictions

Efforts have been made to improve branch predition. One scheme that has a significant potential payoff is to use a more complex tag, for example, not merely the previous address, but the previous address plus some other recent address. Of course, given that many associative memories discard a considerable amount of information, we do not use the full concatenation of 32 bit (or longer, on some modern machines) addresses, as a tag, but rather, some combinatorially simple hash function of the two addresses.

If our goal is to allow, for example, two different return addresses to be cached for a given function, we do not want to use the address of the previous two instructions, but rather, we want the address of the most recent call plus the address of the previous instruction. Research suggests that this can lead to marginally useful improvements in run-time for real programs.

The Associative Memory Specification

From the above, we can infer that the branch prediction cache is slightly more interesting than most. We found uses for the following inputs above:

Tag₁ -- used to search for a prediction.
Tag₂ -- used to add a prediction.
Value
Add Prediction
Tag₃ -- used to delete a prediction.
Delete prediction

It is quite possible that all three of these might be used during a single clock cycle, if a branch is executed during that cycle and the previously predicted destination address was wrong.

It is not a horribly difficult exercise to design an associative memory that can do all three of these at the same time. For example, the snooping cache model required a second tag input for invalidating cache entries, and exactly the same mechanism can be used to delete predicitons using tag₃.

The ability to search the cache using one tag while simultaneously adding an entry to the cache using another tag is significantly more troublesome! If we assume a small branch prediction cache, large enough to improve the performance of inner loops and little more, we could easily build a fully associative memory with two or even three tag comparators on each line of the memory. Given that the branch prediction logic is always part of a VLSI CPU, this is quite practical.

But, we need not go to this extreme. Note that the destination of a branch instruction is almost never another branch instruction! In fact, when an optimizing compiler discovers that it has generated a branch to a branch instruction, it should always adjust the destination of the first branch to go directly to the destination of the second branch. Therefore, it is quite reasonable to disable the branch prediction logic for one clock cycle whenever we find that we need to add a prediction to the cache, and then use that clock cycle to add the new prediction to the cache!

Speculation

The approaches to branch prediction outlined above all support speculative execution. When we fetch instructions, we assume that they will be executed, so we pour them into the pipeline and only invalidate instructions if we discover that our speculation was incorrect. In sum, there is nothing extraordinary about speculative execution.

In a superscalar pipeline, we can do something quite interesting, though. If we use the branch prediction cache to cache not only the addresses but also the values fetched from memory at that address, we can, when we get a branch prediction, speculate both that this branch will be taken and that it will not, injecting the PC+1 prediction into one pipeline and the cache prediciton into the other pipeline. This way, when we finally discover which instruction should be taken and which should not, we can cancel only half of the instructions in the pipeline, leaving the other half intact.

The mechanisms needed for two-way branch prediction are only slightly short of the mechanisms needed to do what is now called hyperthreading, where we allow the superscalar CPU to split itself into two independent CPUs. Recent research shows that enabling hyperthreaded execution generally destroys the effectiveness of multiway speculation, so if the performance of the architecture depends on multiway speculation to any significant extent, then hyperthreaded execution has only limited value.