Insecurity in the C Family of Languages

Part of 22C:169, Computer Security Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Why C

In one sense, the C programming language is an inexcusable language with an exemplary heritage and some depressing consequences.

The Algol 60 programming language, developed by a committee made up of the greatest computer scientists of the late 1950's introduced some key ideas, the if-then-else statement, for example. In the early 1960's, Christopher Strachey at Cambridge led a group that developed a language called CPL, Christopher's Programming Language (to his students), or the Cambridge Programming Language (to many), or the Combined Programming Language (to those who didn't like Cambridge getting all the credit).

CPL was never implemented. It existed on paper only. CPL was a reaction to shortcomings of Algol 60. One of Christopher Strachey's students implemented a subset of CPL, called Basic CPL or BCPL. BCPL was practical and attracted a small but interesting following in the late 1960's. BCPL was implemented on Multics as an alternative to PL/1, the official systems programming language of the Multics project, and several defense contractors began using it for at least some of their projects.

An interesting aside in the history of BCPL is that Strachey's group began experimenting with object-oriented programming using it, despite the fact that it had no built-in support for objects. BCPL gave use curly braces as brackets for blocks of statements. Algol 60 used the keywords begin and end for this.

One group that got interested in BCPL was at Bell Labs. Kernighan and Ritchie, experienced developers from the Multics system, were working there on the early prototype of the Unix system. They decided to invent a new language based on BCPL for use in the re-implementation Unix. This language was christened C because an earlier subset of BCPL developed at Bell Labs was called B. The first compiler was written for the DEC PDP-11 computer, and some features of the language were based on ideas from that machine.

C became a success for two reasons: First, Unix was a success. A Unix academic license cost only a few dollars, and universities around the country began to experiment with Unix systems. Second, C was small enough and clean enough that microcomputer implementations became available.

In the early 1980's, it was not at all clear that C and Unix would become major forces. Pascal was the dominant programming language for academic research, and it was, technically, far superior. It was strongly typed, so it was far less prone to the kinds of security nightmares that C programs are notorious for. Pascal was also widely available on microcomputers.

The force that drove C into the limelight was the early Internet. Digital Equipment Corporaton's VAX, a 32-bit successor to the 16-bit PDP-11, was priced right, and when the University of California at Berkeley ported Unix to the VAX -- BSD Unix, VAXes running Unix quickly became the dominant computer used for E-mail. As the Internet was created, these became the dominant Internet host.

C and Pascal were not object oriented. Another language dating from the late 1960's, Simula 67, was the first programming language that was designed to support object oriented programming. For a decade, hardly anyone appreciated this language outside of Scandinavia, where it had been developed. Many thought of it as a special-purpose simulation language based on Algol 60. While simulation motivated the development of the language, it was definitely not special purpose.

Many programmers in the late 1970's and early 1980's began building object-oriented extensions to C and Pascal. Objective C, Object-Oriented Pascal, and many others came from this effort. One of those was C++, a language developed by Bjarne Stroustrup. He had been a Simula 67 programmer, and he was not happy programming in C in his new job at Bell Labs, so he wrote a preprocessor that took C with extensions to support good features from Simula 67 and produce C output. This extended language was known as C++.

The success of C++ was sealed by the development of the Gnu C++ compiler, a compiler that generated better code, for the VAX, than almost any other compiler, and that was a free open-source product. As the Internet grew, Unix and the Gnu C++ compiler were ported to many other computers, and C and C++ became the dominant languages for system implementation.

C++ has its problems. Many of these are problems it inherited from C. One group that was bothered by this was a group at Sun Microsystems. The result of their work, released in the mid 1990's was Java. Java preserves much of the syntax of C++, cleaning up the semantics of the language and a few ugly bits of syntax. Unfortunately, it preserves many ugly features of C -- this was considered necessary in order to make it easy for C++ programmers to migrate to Java.

Modern type-safe languages such as Java have a serious vulnerability hiding under them. They tend to run on operating systems with a Unix heratige (this includes Windows and Linux) that are implemented in a mix of C and C++. This means that, even though the language itself may be safe, the operating system interface on which it sits may contain numerous vulnerabilities, and any calls to external code (Java, for example, contains hooks allowing Java programs to call C or C++ code) introduce all of the potential insecurity of C and C++ into these more modern languages.

Dangerous Programming

The standard Introduction to C begins with the program to output the string "Hello World" to the standard output stream:

#include <stdio.h>

int main()
{
    printf("Hello World!\n");
}

We'll extend this dull little program with some declarations. Consider this version:

#include <stdio.h>

int main()
{
    int i;
    char s[] = "Hello World!\n";
    for (i = 0; s[i] != '\0'; i++) {
        putchar(s[i]);
    }
}

Now, our main program has two local variables. The first, i is an integer. We're going to use it as the index to count through a for loop. The second, s is an un-dimensioned array of characters initialized to the string "Hello World!". All strings in C implicitly have a null byte at the end, represented in C as '\0'. The for loop terminates when it finds this null byte.

Notice that the array really is undimensioned! If our program elects to use index values outside the range defined by the string, it will find some byte of memory.

Here is another variant on the program that is even more unsafe:

#include <stdio.h>

int main()
{
    char *s = "Hello World!\n";
    for (; *s != '\0'; s++) {
        putchar(*s);
    }
}

Here, we didn't declare s as an array of characters, we declared it as a pointer to a character. Now, our for loop has no initialization, and we step through the string "Hello World" using the character pointer to pick successive characters out of the string. The character pointer can point to any memory address at all, but because we were careful, it remained pointing only to characters within the string (including the null byte).

Notice that we're doing arithmetic on the pointer with the operation s++. This is exactly equivalent to s=s+1. Adding one to a pointer in C doesn't always add one to the representation of that pointer -- it adds the size of one element of the type pointed to. In this case, because it's a pointer to a character, it adds 1 because characters are of size 1 byte.

We can do worse, taking some serious risks:

#include <stdio.h>

int main()
{
    int i;
    char *s = "Hello World!\n";
    while (*s != '\0') {
        putchar(*s);
        i = (int)s;
        i = i + sizeof(char);
        s = (char *)i;
    }
}

Here, we have used integer addition to increment the pointer s. We did this by using a very dangerous C (and c++) feature called casting to take the representation of s and make the compiler treat that representation as an integer. Without the cast, the assignment of a pointer to an integer (or visa versa) would be illegal. With the cast, we've told the compiler, "yes, I really want you to do this unsafe thing."

The above program is verbose. We can shorten it, keeping it just as unsafe, by rewriting it as follows:

#include <stdio.h>

int main()
{
    char *s = "Hello World!\n";
    for (; *s != '\0';s = (char *)((int)s + sizeof(char))) {
        putchar(*s);
    }
}

But Why

The above examples are not written to suggest that you ever write this kind of code in any commercial product you are ever involved in writing. In most contexts, the above code is dangerous and any programmer who writes such code should be severely repremanded.

the above programs are progressively more dangerous, offering progressively more severe opportunities to make serious programming errors. We will use them as a launching point for something else, an exploration of how unsafe programming methods can explore the implementation of the programming language in order to allow an outsider to attack the behavior of a program.

Object File Vulnerabilities

On a typical Unix system, if your test program (any of the programs above) is in a source file called, say hello.c, you'd compile it as follows:

cc t.c

There are several C and C++ compilers on most Unix systems. Typical names are cc, for C and cpp for C++. The above programs should work under all of them, unless your machine uses 64-bit pointers. On such machines, change the declaration of int to long int so that the integer and pointer variables have the same size. Having compiled the program, the output of the compiler is, by default, stored in a file called a.out.

This file is just that, a file. It is easy to change its contents. Here is an example session where such a change was made:

% cc t.c
% a.out
Hello World!
% modify "Hello" "H---o" < a.out > b.out
% chmod +x b.out
% b.out
H---o World!

Here, I wrote my own command (just another program) called modify that was applied to the input file a.out, giving the output file b.out. I used a utility called SED to write this command. My modify command substituted the text H---o for any occurances of Hello it happened to find. The second to the last command in the above example makes the file b.out executable, because it isn't executable by default.

Why did we do this? To demonstrate that programs, even after they are compled, are just files. Object programs under most computer systems can be easily modified. Such modifications can make arbitrary changes to the file! Many classic viruses, for example, search out object files and attach themselves to those files by editing the object files they find.

Good C References

The Definitive reference on the C language is the slim little text that launched the language:

The C Programming Language by Brian W. Kernighan and Dennis M. Ritchie, Prentice Hall, 1978 and subsequent editions.

This slim little book is a classic, demonstrating the clear distinction between tutorial introductions to a programming language and reference material, and demonstrating how concise and clean a language definition can be. The current edition describes ANSI C, and it remains in print. Note that Kernighan and Ritchie wrote the first C compiler as well as writing the book.

Another interesting on-line source giving one man's take on the C++ language is:

Bjarne Stroustrup's FAQ

Note that Bjarne Stroustrup was the developer of C++, so his take on the language is important.

If you ever have to write big programs in C, consider the advice given here:

Douglas W. Jones Manual of C Style

New Unix users may find the following tutorials to be useful.

A vi tutorial. There are other editors, but the two leading editors for Unix programmers are still VI and Emacs. There's no need to learn both, but Unix programmers ought to know one or the other.

A Unix tutorial. There are many introductory Unix tutorials, this one has only one virtue, it is short.