Attacking The Return Address

Part of 22C:169, Computer Security Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Buffer Overflow

Consider this bit of C code intended to get a number from the input stream:

#include <stdio.h>
#include <stdlib.h>
int getnum() {
        char buf[32];
        gets(buf);
        return atoi(buf);
}

Here, gets and atoi are routines in the standard C library to get a string from the standard input (gets from stdio.h) and convert from a textual string to an integer (atoi from stdlib.h.

This code is typical of thousands, perhaps millions of lines of C and C++ code written around the world. Code like this has found its way into compilers, web web browsers, text editors, input-output drivers and many other contexts.

The problem with this code is that the input primitive it uses has no idea how long a string to get. It simply gets one line of data, whether that is one character of 100. Since the line is only supposed to contain a number, correct input to the program will never involve very long lines of text, this testing of a program using this routine may never reveal a problem unless the person designing the test knows about the risk.

In this case, the risk is that the programmer only allowed for lines of 31 or fewer characters. Recall that C and C++ strings always end with a null character, so if someone types a line 31 characters long, the string that results is actually 32 bytes, including tha null.

Of course, the programmer should never have used gets. The designers of C included this routine in their standard library early on in the development of the language, and only later realized that including it had been a mistake. By then, there were so many C programs out there using this routine that they couldn't just eliminate it, so they put a warning in the manual encouraging programmers to use fgets instead, but it remains available and tempting, to this day.

Consider including the above routine with this trivial main program.

int main() {
        for (;;) {
                int i;
                printf("Enter a number ");
                i = getnum();
                printf("Your number was: %d\n", i);
        }
}

This program just gets a number from the terminal and echos it, not very exciting at all. Now, try running the program with an eye to what the program will do when yor input string is too long. There are two possibilities, both fatal:

If the stack grows up: In this case, pushing on the stack increments the stack pointer. If the string is larger than the 32 character buffer, it will overwrite the next activation record on the stack -- in this case, the activation record of the gets routine that was called to fill the buffer. When it comes time for gets to return, the return address will have been overwritten by the input data, so the return will be "into the wild blue".

If the stack grows down: In this case, pushing on the stack decrements the stack pointer. If the string is larger than the 32 character buffer, it will overwrite whatever was pushed on the stack before the call to getnum, but before it gets there, it will overwrite the return address of getnum itself and anything else that was implicitly pushed onto the stack on the way to entering this getnum.

Either way, the program is unlikely to continue operating normally. Try experimenting with this routine. See what it does.

Experimenting with Deliberate Damage to the Stack

From the previous lecture, we developed a vague idea of where the return address was stored in the activation record. We can construct a little program to allow us to explore the details of this by probing different locations.

#include <stdio.h>
#include <stdlib.h>
int offset; /* offset into stack to probe */
void probe() {
        char * p;
        p = (char *)&p;
        p[offset] = '\x55';
}

int main(int argc, char * argv[]) {
        offset = atoi(argv[1]);
        probe();
        putstr("No damage\n");        
}

Note, this main program takes its input from the command line arguments provided when the program is launched. The parameters to a C or C++ main program launched from the command line under Unix, Linux or Microsoft's DOS are, argc, an integer count of the command line arguments, and argv, an array of pointers to strings, one string per argument. argv[1] points to the first command line argument, and because C and C++ think that an array is the same as a pointer to the first element of that array, argv[1][0] is the first character of the first argument.

Note: This program really stinks. It ought to have, at the very least, checked to see that the argument was passed. If you don't type in an argument, so argc is zero, argv[1] will be a null pointer, and the attempt to look at the string it points to will likely result in an addressing error.

Using this program, we can probe around in memory, relative to the activaiton record of probe, looking for memory locations that cause damage. We know that return addresses are probably 32-bit numbers, so using the results of the experiments in the previous lecture, we can guess roughly the offsets that will cause the most damage. It is highly likely that 4 offsets in this range will cause our little program to fail.

Experiment. Find the specific offsets from that cause the program to fail instead of returning with no damage.

A Bit of C/C++ Obscurity

In C and C++, the name of a function is a pointer to that function. Consider this bit of code to demonstrate this fact:

#include <stdio.h>
void printit(char * s, char * p) {
        printf("%s = %x;", s, (long int)p);
        printf(" first 2 bytes %x %x", p[0], p[1] );
}

main() {
        printit( "printit", (char *)printit );
        printit( "main", (char *)main );
}

Here, the printit routine prints out the name of the subroutine, passed as a string, the address of the subroutine, passed as a pointer (to a character), and the first two bytes of that subroutine, printed by following that pointer. These two bytes are actual machine instructions. The address and the instructions are printed in hex.

The main program simply requests information on each of the routines in the program (there are only two). It casts the function names to type char* in order to meet the requirements of printit.

When you run this program, you will find that (to no great surprise), that the compiler compiles the routines of your program in the order you wrote them, and that the first few instruction bytes of each routine are always the same because the receiving sequences are always the same. That is, the sequence of instructions used to receive control from the caller are always the same -- typically instructions to save something, registers or the return address, on the stack.

Exploits

Now that we know, to a fair degree of certainty, where the return address of the function is, we should be able to exploit it to cause that function to return to the wrong place. Consider the following bit of code:

#include <stdio.h>
#include <stdlib.h>
void wrongplace() {
        printf("We got to the wrong place\n");
        exit();
}

int offset; /* offset into stack to probe */
void probe() {
        char * p;
        p = (char *)&p;
        *(void **)(&p[offset]) = (void *)wrongplace;
}

int main(int argc, char * argv[]) {
        offset = atoi(argv[1]);
        probe();
        puts("No damage\n");        
}

Where the previous program merely probed into its stack by writing nonsense over one byte, this program probes into its stack by writing a pointer, probably a 4-byte value on most modern computers, into the designated part of its stack. This pointer is a pointer to the subroutine wrongplace, and if you probe the stack with the correct offset into the stack, the result will be that probe returns by entering wrongplace. Of course, it enters wrongplace in an awkward way, thinking that it was returning from a function, and certainly not pushing a return address on the stack. This is why wrongplace was written to terminate by calling exit, terminating the entire program.

The attack we demonstrated above was, of course, a self-attack, but we can make it into a useful attack if we can get the address of a subroutine within the program we wish to attack and slip this address into the string of characters that the program reads using gets or any of a large number of other vulnerable input routines that C and C++ programmers have used over the years.