Associative memory plays an important role in one special area of modern high-speed central processors; that is, predicting which instruction should be fetched next.
In a pipelined machine, the obvious next instruction to fetch is the successor of the previously fetched instruction, but this will be incorrect whenever a branch is taken. The problem we face is to design an instruction fetch mechanism that will correctly anticipate the next instruction even in the presence of branches. This can be done with moderate success using a branch anticipation cache.
Given an associative memory holding the successors for branch instructions, we can build an instruction fetch stage that operates as follows:
------o---------- | __|__ ___|___________ | | | | tag | | Branch | | + 1 | | | value | Prediction | |_____| |__hit__|_______| Cache | | | | | | --------|------- | _|_|_ | / \ 0 1 /------ Make prediction | \___/ ----|----- | | | _|_|_ | -|---\ 1 0 / Invalidate prediction | i| | \_ _/ | n| | ___|___ | v| | |>__PC__| | a| | | / l| ---o--o----------------> addr to memory | i| | -----------< data from memory | d| ___|___ ___|___ | a| |>__PC__|___IR__| | t| _|_______|____ | e| | ^ ^ | next pipeline stageIn sum, we predict that the address of the next instruction will be the successor of the previous one except when we find that the previous address in the branch anticipation cache. In that case, we predict that the next instruction will be at the address given by the cache.
Because our branch predictions may be wrong, we must be prepared to check them. We do this by keeping a record, with each instruction, of the address from which it was fetched. Thus, we include the program counter value in the interstage register at each stage of the pipeline up to the stage that verifies that we have fetched the correct instructions. The verification is done in the pipeline stage that actually executes the branch.
^ ^ | | | | -----o | / | | ___|___ ___|___ __ | | | |>__PC__|___IR__|__ ... | | | | | | i| | __|__ _|______________ | n| | | | | | v| | | + 1 | | stage that executes | a| / |_____| | branch instructions | l| | | | | i| | | ---| effective address | d| | _|_|_ | | a| | \ 0 1 /-| take branch | t| | \___/ |________________ | e| | | | | ---- | --|-------|-o | _|_|_ | | | | | = | | |_____| | O | | --------o--------------->When there was no branch prediction, we all of the upstream instructions in the pipeline whenever there was a branch. With branch prediciton, we instead compare our predicted next address with the next address computed in the branch stage of the pipe.
If the predicted next address was wrong, we do two things: First, we load the correct next address in the program counter used by the instruction fetch stage, and second, we invalidate all partially executed instructions between the instruction fetch stage and the branch stage.
When should a new branch prediciton be added to the cache? When should a prediction be removed? The most conservative approach is to only use the branch prediction cache for unconditional branches. In this case, we add a prediction whenever we encounter a branch opcode, and we never need to delete predictions from the cache except when we modify the code. In that case, we invalidate the entire branch prediction cache when we invalidate the I-cache, and at all other times, we simply add a new prediction whenever we encounter an unconditional branch.
A more extreme approach is to add every branch taken to the branch prediction cache, and to invalidate entries whenever they make a bad prediction. We therefore add the following logic to the pipeline stage that may take a branch in order to add predictions:
^ ^ ^ | | | ___|___ ___|___ __ | | | |>__PC__|___IR__|__ ... | | | | | | | | | _|______________ | | --/-- | | | tag | stage that executes | | | branch instructions | | value | | -----/--------| effective address | ___ | | | |--------| take branch ----|AND| |________________ add |___|-- prediction | | ^ delete predictionNotice that we only add a prediction if the prediction in the cache was wrong. If the cache correctly predicted this branch, we make no changes! Finally, we add the following logic to delete incorrect predictions:
^ ^ | | | | -----o | | / | ___|___ ___|___ __ | | | |>__PC__|___IR__|__ ... | | | tag | | | --|-----o | | | __|__ _|______________ |d | | | | |e | | + 1 | | stage that executes |l / |_____| | branch instructions |e | | | |t | | ---| effective address |e | _|_|_ | | | \ 0 1 /-| take branch |p | \___/ |________________ |r | | |e ---- | |d | | |i _|_|_ |c | | |t | = | |i |_____| |o O |n | -----------Here, the address of the instruction who's successor was incorrectly preicted is used as a tag in the prediction cache, and if there is a match for this tag, that entry is invalidated.
Is this more complex scheme likely to pay off? Clearly, both schemes will pay off for unconditional branches. Since most loops iterate more than once, the conditional branches involved in loop exits that are taken during the first iteration of a loop will tend to be taken the same way during all subsequent iterations up until the last, so the complex scheme works for those branches.
If a function is called exactly once in a loop body, the complex scheme will correctly predict the return address. Functions that are called multiple times in a loop body will have their returns predicted incorrectly, but the loops that limit the speed of real programs rarely call any given function more than once per iteration.
If statements that are actually taken randomly, with a 50/50 probability distribution will not be predicted in any useful way, but most if statements are not random! Many if statements handle exception conditions, and most others have a preponderance of the control flow going to one branch or the other.
Therefore, it is highly likely that the complex approach to branch prediction will correctly predict a large fraction of the conditional branches in real programs, and therefore, that it will eliminate a large fraction of the bubbles from the pipeline in iterative code and other branch-intensive computations.
Efforts have been made to improve branch predition. One scheme that has a significant potential payoff is to use a more complex tag, for example, not merely the previous address, but the previous address plus some other recent. Of course, given that many associative memories discard a considerable amount of information, we do not use the full concatenation of 32 bit (or longer, on some modern machines) addresses, as a tag, but rather, some combinatorially simple hash function of the two addresses.
If our goal is to allow, for example, two different return addresses to be cached for a given function, we do not want to use the address of the previous two instructions, but rather, we want the address of the most recent call plus the address of the previous instruction. Research suggests that this can lead to marginally useful improvements in run-time for real programs.
From the above, we can infer that the branch prediction cache is slightly more interesting than most. We found uses for the following inputs above:
It is quite possible that all three of these might be used during a single clock cycle, if a branch is executed during that cycle and it is discovered that the predicted destination address was wrong.
It is not a horribly difficult exercise to design an associative memory that can do all three of these at the same time. For example, the snooping cache model required a second tag input for invalidating cache entries, and exactly the same mechanism can be used to delete predicitons using tag3.
The ability to search the cache using one tag while simultaneously adding an entry to the cache using another tag is significantly more troublesome! If we assume a small branch prediction cache, large enough to improve the performance of inner loops and little more, we could easily build a fully associative memory with two or even three tag comparators on each line of the memory. Given that the branch prediction logic is always part of a VLSI CPU, this is quite practical.
But, we need not go to this extreme. Note that the destination of a branch instruction is almost never another branch instruction! In fact, when an optimizing compiler discovers that it has generated a branch to a branch instruction, it should always adjust the destination of the first branch to go directly to the destination of the second branch. Therefore, it is quite reasonable to disable the branch prediction logic for one clock cycle whenever we find that we need to add a prediction to the cache, and then use that clock cycle to add the new prediction to the cache!