# Lecture 40, Stack Top Caches

## Implementing Stacks in Hardware

The Burroughs corporation, when they built the B5000 machine back in the 1960s, recognized that a pure stack architecture had an obvious performance problem: Consider the run-time cost of executing i=i+1 on a machine where the stack is entirely memory resident:

```PUSHLA    i    -- sp = sp + 1; M[sp] = fp + i;
PUSHLA    i    -- sp = sp + 1; M[sp] = fp + i;
PUSHI     1    -- sp = sp + 1; M[sp] = 1;
ADD            -- M[sp-1] = M[sp-1] + M[sp]; sp = sp - 1
POP            -- M[M[sp-1]] = M[sp]; sp = sp - 1
```

In sum, this requires 12 memory cycles (excluding instruction fetches). An obvious way to improve this is to use one register inside the CPU to hold the stack top; call it a. This offers a modest improvement:

```PUSHLA    i    -- sp = sp + 1; M[sp] = a; a = fp + i;
PUSHLA    i    -- sp = sp + 1; M[sp] = a; a = fp + i;
PUSHI     1    -- sp = sp + 1; M[sp] = a; a = 1;
ADD            -- a = M[sp] + a; sp = sp - 1;
POP            -- M[M[sp]] = a; a = M[sp-1]; sp = sp - 2;
```

This offers a modest improvement: 12 cycles are reduced to 8 cycles. But, if you think about building hardware, you realize that for any add operation, both operands must be held in registers while the ALU computes the result, and for any store operation, both the address and value being stored must be held in registers. The reads from the stack in the above would mostly be reading into this second register.

The most obvious way to do this is to keep the top two elements on the stack in registers at all times. Let's call the register used for the stack a and the one used for the element below the stack top b. The naive way to use this hardware to execute the above program would be:

```PUSHLA    i    -- sp = sp + 1; M[sp] = b; b = a; a = fp + i;
PUSHLA    i    -- sp = sp + 1; M[sp] = b; b = a; a = fp + i;
PUSHI     1    -- sp = sp + 1; M[sp] = b; b = a; a = 1;
ADD            -- a = a + b; b = M[sp]; sp = sp - 1;
POP            -- M[b] = a; a = M[sp]; b = M[sp-1]; sp = sp - 2;
```

Unfortunately, this made no difference in the run time, since there are still 8 memory cycles. Note that we do not count the register to register data transfers as costing any time because, when this is done inside the CPU, these transfers can all be done in parallel.

### A Brilliant Idea

I suspect that Burroughs went through something like the above before someone on the design team came up with a brilliant idea: Instead of keeping the two registers full of data at all times, add two bits of state information, a bit to indicate whether a currently holds data, and a bit to indicate whether b currently holds data. As a result, the system has 3 potential states:

• state 0: a and b both vacant
• state 1: a vacant, b in use (the stack top)
• state 3: a in use (the stack top), b in use

Now, consider how the above program would run if the stack was initially in state 0:

```PUSHLA    i    -- b = fp + i; state = 1;
PUSHLA    i    -- a = fp + i; state = 3;
LOAD           -- a = M[a]; state = 3;
PUSHI     1    -- sp = sp + 1; M[sp] = b; sp = sp + 1; b = a; a = 1; state = 3;
ADD            -- b = a + b; state = 1;
POP            -- a = b; b = M[sp]; sp = sp - 1; M[b] = a; state = 0;
```

This reduces the run time to 4 memory cycles, a significant improvement. The Burroughs implementation simply included the current value of the stack state in its instruction decoding, so that each instruction had a different implementation in each potential stack state:

PUSHLA i in state 0
b = fp + i; state = 1;
PUSHLA i in state 1
a = fp + i; state = 3;
PUSHLA i in state 3
M[sp] = b; sp = sp + 1; b = a; a = fp + i; state = 3;

PUSHI i in state 0
b = i; state = 1;
PUSHI i in state 1
a = i; state = 3;
PUSHI i in state 3
M[sp] = b; sp = sp + 1; b = a; a = i; state = 3;

b = M[sp]; sp = sp - 1; b = M[b]; state = 1;
b = M[b]; state = 1;
a = M[a]; state = 3;

a = M[sp]; b = M[sp-1]; sp = sp - 2; b = a + b; state = 1;
a = b; b = M[sp]; sp = sp - 1; b = a + b; state = 1;
b = a + b; state = 1;

POP in state 0
a = M[sp]; b = M[sp-1]; sp = sp - 2; M[b] = a; state = 0;
POP in state 1
a = b; b = M[sp]; sp = sp - 1; M[b] = a; state = 0;
POP in state 3
M[b] = a; state = 0;

In effect, the a and b registers served as a cache for the top of stack. Later stack machines added additional registers, enlarging this cache and further reducing the likelihood that stack operations would result in actual memory cycles.

## Stack Top Caches in Compilers

This idea can be directly applied to register allocation in a compiler, with one major change: In a compiler, we want to avoid using instructions for register-to-register transfers. As a result, we do not want to dedicate a particular register to holding the top item on the stack. As a result, our stack state is going to be far more complicated. If we have a block of n registers that is to be used as a stack-top cache, each of these registers has one of n+1 possible uses: It may be unused, it may hold the stack top, an item below the stack top, all the way up to the item n–1 items below the stack top.

A first venture at this would attach, to each register, a counter indicating whether it was currently not part of the stack (0), or whether it represented the top element (1), the element below the top (2), and so on. This suggests that we might need to search the registers to find the stack top, and it suggests the need to update all of the registers with nonzero tags each time there is a push or pop. This is not appealing.

Fortunately, there is an alternative. If we arrange the set of cache registers used to cache the stack top as a circular buffer in the range min to max, where cache=(max-min)+1, we can use just one index to track which register holds the stack top, call it top, and a count to track how many stack elements are currently in registers, call it count. On the ARM, for example, min might be 3 and max might be 12. Here is the logic for generating code for the pushi c and add instructions in this context:

```void pushi( int c ){
int r = top + 1;
if (r > max) r = min;
if (count == cache) { /* we need to clear a register */
-- emit code to push R[r] on the RAM stack --
count = count - 1;
}
-- emit code to load the constant c into R[r] --
top = r;
count = count + 1;
}
if (count == 0) { /* we to get the stack top into a register */
-- emit code to pop the RAM stack into R[top] --
count = count + 1;
}
int r = top - 1;
if (r < min) r = max;
if (count == 1) { /* we to get the stack top into a register */
-- emit code to pop the RAM stack into R[r] --
count = count + 1;
}
-- emit code to add R[r] plus R[top] and put the result in R[r]
top = r;
count = count - 1;
}
```

This scheme works well within a basic block, but whenever control converges on a label from two or more directions, it does not guarantee that the stack will be in the same state on each path into that label. As a result, at the end of each branch point in the code, the stack-top cache must be put into a standard state such as entirely in RAM. This was not a problem with the hardware stack-top cache, since the cache state there was a dynamic attribute. In general, stack-top caches are not a good stating point for developing more optimal code generators.