2 -- C for experienced programmers

22C:112 Notes, Spring 2009

Part of the 22C:112, Operating Systems Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The C programming language

The C programming language was developed by Brian Kernighan and Dennis Ritchie at Bell Telephone Laboratories around 1973. An exact date is hard to pin down because the language evolved from an earlier language, known as B, and it was not completely solidified in an instant.

C was developed as a system programming language to support porting the Unix operating system to the Digital Equipment PDP-11 computer. Unix had originally been implemented in the native assembly language of an earlier computer made by Digital, the PDP-7. This had an 18-bit word (allowing either three 6-bit characters per word or two 9-bit characters per word), and a single accumulator. In contrast, the PDP-11 had a 16-bit word and an 8-bit byte. This forced a complete reimplementation of the operating system, and the system designers, Dennis Ritchie and Ken Thompson, decided to try writing it in a high level language.

Bell Laboratories had been involved in the Multics project (about which, more later). Multics was one of the first operating systems to be written in a high level language (but not the first). Much of Multics was written in PL/I, a programming language designed in the early 1960's by a committee of SHARE (the IBM user's group), but parts of Multics were written in BCPL.

The B languge was Ken Thompson's a attempt to improve BCPL. It was effectively untyped (with just one type, the word). In moving to the PDP-11, it was necessary to extend this to explicitly handle both words and bytes, leading (in the short run) to two distinct types, int and char, and in the long run, to the prefixes short and long, so that you could declare objects of type short int (originally 8-bit integers) or long int (32-bit integers), as well as int (originally 16-bits).

BCPL was, in turn, a practical implementation of CPL. BCPL stands for Basic CPL. It was implemented at Cambridge University by Martin Richards. CPL was a paper effort for a long time, and it took years for a working compiler to emerge. Hence the need for a stripped down basic version. The acronym CPL officially stood for Combined Programming Language, but, as it began at Cambridge, early explanations of the name referred to it as the Cambridge Programming Language, and for many years, people jokingly referred to it as Christopher's Programming Language, because of the involvement of Christopher Strachey. CPL was, in turn, strongly influenced by Algol.

The standing joke was that because BCPL led to B which led to C, the successor of C should have been P and not C++.

The Wikipedia writeups on these languages are interesting. The wiki pages for BCPL, B and C all include examples where the evolution is fairly clear:

  1. http://en.wikipedia.org/wiki/ALGOL
  2. http://en.wikipedia.org/wiki/Combined_Programming_Language
  3. http://en.wikipedia.org/wiki/BCPL
  4. http://en.wikipedia.org/wiki/B_programming_language
  5. http://en.wikipedia.org/wiki/C_%28programming_language%29

Hell World in C

	/* hello.c -- Hello-world program */

	#include <stdio.h>
	int main()
	{
		printf( "Hello world!\n" );
	}

The above example is a classic! Starting with Kernighan and Ritchie's classic book, The C Programming Language, essentially all intro to C books have begun with this example. This declares an integer function called main. Curiously, main returns no value, but only outputs, to the standard output stream, the message "Hello world!" (with a newline at the end).

The main program is an integer function only for historical reasons, and indeed, most main programs return no value. If C were being redesigned today, main programs would probably be of type void indicating that they return nothing.

C comments are set out by digraphs (two character symbols) made up of a slash and star. The trigraph /*/ can be used to both open and close comments if you want to be silly, because it is not recognized as a degenerate comment (the other obvious interpretation).

The line that begins #include instructs the compiler to search for a file called stdio.h containing the interface definition for the standard input/output package, part of the C standard library. Programs can call standard library routines without inserting the interface definition file, but it is easier to do so.

The interface definition files for the C standard library are always stored in /usr/include/, and it can be instructive to read around in this directory.

Compiling a C program under Unix or Linux

Unix source files should always end in .c, and by convention, it is common to begin source files with a comment indicating the name of the source file where they are expected to be stored. Thus, the hello-world program above would naturally be stored in a file named hello.c.

C source files are typically edited with a simple text editor, as opposed to a word processor. Nowdays, though, many word processors can be used to manipulate simple text files. Nonetheless, many C programmers still use editors such as vi or emacs because they are fast and powerful.

Given a file named hello.c, you can compile it under a Unix or Linux system using the command line interface and the command:

	cc hello.c

This will place the executable output in a file called a.out, a name that originally indicated the default output file of the assembler. If we were starting over again, we would probably not use this name. If you want to put the output in a different file, perhaps hello, you do it as follows:

	cc -o hello hello.c

To run the program, you type its name:

	hello

If this does not work, this is because your search path is not set up to run files from the current working directory. In that case, you can either edit your search path or override it. Overriding it is easy:

	./hello

This command is a full filename, saying look for a file named hello in the directory named . (dot), that is, in the current working directory. The output should be:

	Hello world!

Types in C

To declare a simple variable in C, you name the type first, then the variable:

        char ch;                /* ch should hold an 8-bit character */
        unsigned char ch;       /* typically, 0 <= ch <= 255 */
        signed char ch;         /* typically, -128 <= ch <= 127 */

        int ch;                 /* typically either a 16 or 32-bit integer */
        signed int ch;          /* the same as int ch, eg -32768 to 32767 */
        unsigned int ch;        /* for 16-bit ints, would be 0 to 65535 */
        long int ch;            /* longer than int, at least 32 bits */
        unsigned long int ch;   /* these attributes combine! */

        float x;                /* the default floating point type */

One of the most dangerous things about C is that the precision of all of the standard integer types is ill defined, or rather, the precision of these types is defined in terms of the word and byte size of the architecture. As a result, recompiling a program, even using different compilers on the same machine, can break things. Long after C was developed, the package stdint.h was designed as a solution to this problem. Using this package defines a set of types that have deterministic representations. The type names in this package are ugly, but each name encodes the number of bits in the representation and whether the value is signed or unsigned. Use of these types is recommended in programs intended to be portable:

        #include <stdint.h>
        int8_t a;               /*        -128 <= a <= 127        */
        uint8_t b;              /*           0 <= b <= 255        */
        int16_t c;              /*      -32768 <= c <= 32767      */
        uint16_t d;             /*           0 <= d <= 65535      */
        int32_t e;              /* -2147483648 <= e <= 2147483647 */
        uint32_t f;             /*           0 <= f <= 4294967295 */

Operating systems make little or no use of floating point, so we will not do more than mention that C supports floating point.

Arrays are declared in C using a rather evil syntax:

        int a[100];            /* an array of 100 ints, from a[0] to a[99] */
        char str[20];          /* an array of 20 chars, char[0] to char[19] */

This is evil because the array bounds are naturally an attribute of the type, and placing the array name between two parts of the type is awkward. The semantics model used for C arrays is worse than this syntax suggests, because the array name is actually a pointer to the first element of the array, and indexing is defined as being equivalent to pointer arithmetic.

C has no concept of class, but it is possible to declare things called structures (or records) that group together the variables that make up the representation of a class.

        struct point {
                int x;         /* every point has an integer x coordinate */
                int y;         /* every point has an integer y coordinate */
        };

        struct point a;        /* the variable a has components a.x and a.y */

You can name types in c. For example:

        typedef long int bigint; /* bigint is an alias for long int */
        typedef struct xxpoint {
                int x;
                int y;
        } point;
Having created these types, you can now allocate objects of these types with less verbose notation:
        bigint a;              /* a is a long int */
        point origin;          /* origin is of type struct xxpoint */

Evil features of C

In C, variables of type char may be either signed or unsigned at the discretion of the compiler writer. On some machines, loading an unsigned character is fast, while loading a signed character requires sign extension, while on other machines, load instructions always sign-extend and unsigned loads require an extra instruction to truncate the result. These areas of compiler discretion lead to problems writing portable code!

There is no difference between integers and characters. It is perfectly legal to output 'a' (the character constant a) where an integer is expected. Strongly typed languages really ought to distinguish integer types from character types.

In C, the = operator means assignment and the == operator means comparison. If a programmer writes:

        if (A = B) C = D;

This is perfectly legal. It may not mean what was intended, though. It is equivalent to

        A = B;
        if (A != FALSE) C = D;

That is, the value of B is assigned to A, and then this value is compared with FALSE to see if C should be assigned to D. It is worth noting that TRUE and FALSE are not predefined, and there is no Boolean type. Boolean variables in C are typically stored as integers, and the system interprets zero as false and all nonzero values as true.

There is no bounds checking on arrays. If the array a has 20 elements, assignment to a[100] is perfectly legal, and may well lead your program to do very undesirable things.

Arrays are not types. The name of an array is merely a pointer to its first element. The expression a[i] is equivalent to *(a + i) where the asterisk means "follow the pointer" and adding the integer i to a pointer creates a pointer to the ith successive item following the referent of the original pointer, where the item size is determined by the type of the pointer.

It is possible to create a pointer to any variable. Thus, if i is the name of a variable of type t, &i is a pointer of type *t to that variable, wher the type *t means "pointer to t."

Messing with Hello World

/* hello.c -- Hello world program */

#include 

void notmain(char s[]);  /* interface definition for notmain */

int main()
{
	notmain("-- Hello there! --");
	/* pass a string.  Note that strings in C are just arrays of
           characters with an implicit null included to mark the end
           of the string */
}

void notmain(char * s)
        /* note the change in the definition of s!  Here, s is declared
           as a pointer to a character, that is what the asterisk means.
           in C, the name of an array is the same as a pointer to its
           first element, so this definition is equivalent to the interface
           definition.  It is very bad style to do this.  It would be
           far better to make it textually identical */
{
	printf(s);
	putchar('\n');
        /* in the original hello-world program, the trailing newline was
           output at the end of string constant.  Here, the newline is
           output using a separate call, in this case, a call to putchar,
           the standard C routine to output one character to stdout
           (the standard output stream). */
}