2 -- C for experienced programmers

CS:3620 Notes, Spring 2012

Part of the CS:3620, Operating Systems Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The C programming language

The C programming language was developed by Brian Kernighan and Dennis Ritchie at Bell Telephone Laboratories around 1973. An exact date is hard to pin down because the language evolved from an earlier language, known as B, and it was not completely solidified in an instant.

C was developed as a system programming language to support porting the Unix operating system to the Digital Equipment PDP-11 computer. Unix had originally been implemented in the native assembly language of an earlier computer made by Digital, the PDP-7. This had an 18-bit word (allowing either three 6-bit characters per word or two 9-bit characters per word), and a single accumulator. In contrast, the PDP-11 had a 16-bit word and an 8-bit byte. This forced a complete reimplementation of the operating system, and the system designers, Dennis Ritchie and Ken Thompson, decided to try writing it in a high level language.

Bell Laboratories had been involved in the Multics project (about which, more later). Multics was one of the first operating systems to be written in a high level language (but not the first). Much of Multics was written in PL/I, a programming language designed in the early 1960's by a committee of SHARE (the IBM user's group), but parts of Multics were written in BCPL.

The B languge was Ken Thompson's a attempt to improve BCPL. It was effectively untyped (with just one type, the word). In moving to the PDP-11, it was necessary to extend this to explicitly handle both words and bytes, leading (in the short run) to two distinct types, int and char, and in the long run, to the prefixes short and long, so that you could declare objects of type short int (originally 8-bit integers) or long int (32-bit integers), as well as int (originally 16-bits).

BCPL was, in turn, a practical implementation of CPL. BCPL stands for Basic CPL. It was implemented at Cambridge University by Martin Richards. CPL was a paper effort for a long time, and it took years for a working compiler to emerge. Hence the need for a stripped down basic version. The acronym CPL officially stood for Combined Programming Language, but, as it began at Cambridge, early explanations of the name referred to it as the Cambridge Programming Language, and for many years, people jokingly referred to it as Christopher's Programming Language, because of the involvement of Christopher Strachey. CPL was, in turn, strongly influenced by Algol.

The standing joke was that because BCPL led to B which led to C, the successor of C should have been P and not C++.

The Wikipedia writeups on these languages are interesting. The writeups for BCPL, B and C all include examples where the evolution is fairly clear:

  1. Algol — http://en.wikipedia.org/wiki/ALGOL
  2. CPL — http://en.wikipedia.org/wiki/Combined_Programming_Language
  3. BCPL — http://en.wikipedia.org/wiki/BCPL
  4. B — http://en.wikipedia.org/wiki/B_programming_language
  5. C — http://en.wikipedia.org/wiki/C_%28programming_language%29

Hello World in C

/* hello.c -- Hello-world program */

#include <stdio.h>
int main()
{
	printf( "Hello world!\n" );
}

The above example is a classic! Starting with Kernighan and Ritchie's classic book, The C Programming Language, essentially all intro to C books have begun with this example, and designers of other programming languages have included hello-world examples for their languages. This program declares an integer function called main. Curiously, main returns no value, but only outputs, to the standard output stream, the message "Hello world!" (with a newline at the end).

The main program is an integer function only for historical reasons, and indeed, most main programs return no value. If C were being redesigned today, main programs would probably be of type void indicating that they return nothing.

Indenting in C is, by convention, always done with tabs and tabs are always 8 spaces. C has nothing similar to an inner class or any other nesting of declarations, so the only thing that leads to deep indenting is deeply nested control structures.

C comments are set out by digraphs (two character symbols) made up of a slash and star. The trigraph /*/ can be used to both open and close comments if you want to be silly, because it is not recognized as a degenerate comment (the other obvious interpretation).

The line that begins #include instructs the compiler, actually the C preprocessor, to search for a file called stdio.h containing the interface definition for the standard input/output package, part of the C standard library. Programs can call standard library routines without inserting the interface definition file, but it is easier to do so. Include directives search for the indicated file along one of two search paths. If the file's name is bracketed, as is the case with <stdio.h>, it searches for the file in the system search path; on Unix, the primary directory on this path is /usr/include. If the file's name is in double quotes, "file", the preprocessor searches the current directory and optionally other directories on a user-specified search path.

It's worth taking a look at some of the files in /usr/include/ on a Unix compatible system. Include files are not easy to read for two reasons. First, many of them include other files, and second, many contain multiple conditionals allowing them to define things differently depending on the context (for example, depending on what other files have been included, or depending on what type of CPU or what operating system is being used).

Compiling a C program under Unix or Linux

Under Unix, C source files should always end in .c, and by convention, it is common to begin each source file with a comment indicating the name of that source file (or at least, the name the programmer expects that file to be stored under). Thus, the hello-world program above would naturally be stored in a file named hello.c.

C source files are typically edited with a simple text editor, as opposed to a word processor. Nowdays, though, many word processors can be used to manipulate simple text files, most C programmers still use editors such as vi or emacs because they are fast and powerful. If you can touch type, an editor that does not require use of the mouse can be much faster than an editor that forces you to shift your hands from the keyboard to the mouse and back again. These editors come into their own when making complex repetitive edits.

Given a file named hello.c, you can compile it under a Unix or Linux system using the command line interface and the command:

cc hello.c

This will place the executable output in a file called a.out, a name that originally indicated the default output file of the assembler. If we were starting over again, we would probably not use this name. If you want to put the output in a different file, perhaps hello, you do it as follows:

cc -o hello hello.c

The -o command line option tells the C compileer to use the next item on the command line as the name of the output file. By convention, command line options in Unix-like systems are indicated with a leading dash, although nothing forces this. If you write a new command, your code is free to interpret its command line any way it wants, using any mechanism you want to distinguish between, for example, file names and option flags.

To run the program, you type the name of the executable file:

hello

If this does not work, this is because your search path is not set up to run files from the current working directory. In that case, you can either edit your search path or override it. Overriding it is easy:

./hello

This command is a full filename, saying look for a file named hello in the directory named . (dot). In Unix systems, dot refers to the current working directory. If hello.c was the hell world program, the output should be:

Hello world!

Types in C

To declare a simple variable in C, you name the type first, then the variable:

char ch;                /* ch should hold an 8-bit character */
unsigned char ch;       /* typically, 0 <= ch <= 255 */
signed char ch;         /* typically, -128 <= ch <= 127 */

int ch;                 /* typically either a 16 or 32-bit integer */
signed int ch;          /* the same as int ch, eg -32768 to 32767 */
unsigned int ch;        /* for 16-bit ints, would be 0 to 65535 */
long int ch;            /* longer than int, at least 32 bits */
unsigned long int ch;   /* these attributes combine! */

float x;                /* the default floating point type */

One of the most dangerous things about C is that the precision of all of the standard integer types is ill defined, or rather, the precision of these types is defined in terms of the word and byte size of the architecture. As a result, recompiling a program, even using different compilers on the same machine, can break things. Long after C was developed, the package stdint.h was designed as a solution to this problem. Using this package defines a set of types that have deterministic representations. The type names in this package are ugly, but each name encodes the number of bits in the representation and whether the value is signed or unsigned. Use of these types is recommended in programs intended to be portable:

#include <stdint.h>
int8_t a;               /*        -128 <= a <= 127        */
uint8_t b;              /*           0 <= b <= 255        */
int16_t c;              /*      -32768 <= c <= 32767      */
uint16_t d;             /*           0 <= d <= 65535      */
int32_t e;              /* -2147483648 <= e <= 2147483647 */
uint32_t f;             /*           0 <= f <= 4294967295 */

Java's primitive data type int corresponds to int32_t. Java programmers may be a bit surprised by the variety of integer types in C, and more particularly, by the unsigned types. Python's integer class is not comparable to C integers for two reasons. First, Python integers are first-class objects, with numerous applicable methods. In contrast, Java and C have primitive integer types that are not part of the object-oriented programming model. Second, Python integers do not have a fixed size in memory; instead, the amount of memory allocated to hold each Python integer depends on its magnitude. Python frees the programmer from worrying about the memory requirements of integers, but at a high cost in both storage requirements and speed.

Operating systems make little or no use of floating point, so we will not do more than mention that C supports floating point.

Arrays are declared in C using a rather evil syntax:

int a[100];            /* an array of 100 ints, from a[0] to a[99] */
char str[20];          /* an array of 20 chars, char[0] to char[19] */

This is evil because the array bounds are naturally an attribute of the type, and placing the array name between two parts of the type is awkward. The semantics model used for C arrays is even worse than this syntax suggests, because the array name is actually a constant-valued pointer to the first element of the array, and indexing is defined as being equivalent to pointer arithmetic.

C has no concept of class, but it is possible to declare things called structures (or records) that group together the variables that make up the representation of a class.

struct point {
	int x;         /* every point has an integer x coordinate */
	int y;         /* every point has an integer y coordinate */
};

struct point a;        /* the variable a has components a.x and a.y */

You can name types in c. For example:

typedef long int bigint; /* bigint is an alias for long int */
typedef struct xxpoint {
	int x;
	int y;
} point;
Having created these types, you can now allocate objects of these types with less verbose notation:
bigint a;              /* a is a long int */
point origin;          /* origin is of type struct xxpoint */

As a result of the above definitions, you can set origin.x and origin.y to zero, assuming that the origin of your coordinate system should be at (0,0).

Evil features of C

In the original definition of C, variables of type char could be either signed or unsigned at the discretion of the compiler writer. On some machines, loading an unsigned character is fast, while loading a signed character requires sign extension. On other machines, load instructions always sign-extend and unsigned loads require an extra instruction to truncate the result. Giving the compiler the right to pick signed or unsigned, whichever is faster, leads to faster execution, but it also leads to problems writing portable code. The same char value may be greater than another on one machine and less on a different machine!

There is no difference between integers and characters. It is perfectly legal to use 'a' (the character constant) where an integer is expected. Strongly typed languages really ought to distinguish integer types from character types. Java has the same problem.

In C, the = operator means assignment and the == operator means comparison. Java and many other languages retain this convention. If a C programmer writes:

if (A = B) C = D;

This is perfectly legal. It is not legal in Java, where assignment is not legal within an expression. It is unlikely to mean what was intended because it is equivalent to

A = B;
if (A != false) C = D;

That is, the value of B is assigned to A, and then this value is compared with false to see if C should be assigned to D. It is worth noting that true and false are not predefined, and there is no distinct boolean type in C. (as an afterthought, the header file stdbool.h was introduced to define the type boolean and the constants true and false) Boolean variables in C are typically stored as integers, and the system interprets zero as false and all nonzero values as true.

There is no bounds checking on arrays. If the array a has 20 elements, assignment to a[100] is perfectly legal, and may well lead your program to do very undesirable things.

Arrays are not types. The name of an array is merely a pointer to its first element. The expression a[i] is equivalent to *(a + i) where the asterisk means "follow the pointer" and adding the integer i to a pointer creates a pointer to the ith successive item following the referent of the original pointer, where the item size is determined by the type of the pointer.

It is possible to create a pointer to any variable. Thus, if i is the name of a variable of type t, &i is a pointer of type *t to that variable, wher the type *t means "pointer to t."

Messing with Hello World

/* hello.c -- Hello world program */

#include <stdio.h>

void notmain( char s[] );  /* interface definition for notmain */

int main()
{
        notmain("-- Hello there! --");
        /* pass a string.  Note that strings in C are just arrays of
           characters with an implicit null included to mark the end
           of the string */
}

void notmain( char * s )
        /* note the change in the definition of s!  Here, s is declared
           as a pointer to a character, that is what the asterisk means.
           in C, the name of an array is the same as a pointer to its
           first element, so this definition is equivalent to the interface
           definition.  It is very bad style to do this.  It would be
           far better to make it textually identical */
{
        printf( s );
        /* the above code is evil.  The problem is, the first parameter
	   to printf() is supposed to be a format string indicating how
	   to print data from the following parameters.  Special characters
	   in the format string (notably the percent sign) can cause
	   trouble.  As a rule, never use anything but a string constant
	   as the first parameter to printf(). */
        putchar( '\n' );
        /* in the original hello-world program, the trailing newline was
           output at the end of string constant.  Here, the newline is
           output using a separate call, in this case, a call to putchar,
           the standard C routine to output one character to stdout
           (the standard output stream). */
}

The right way to print out an unknown string is one of the following:

        printf( "%s", s );
        /* the first argument is a format string indicating that the second
           argument is to be interpreted as the address of a string. */
        fputs( s, stdout );
        /* put the string s out to the standard output stream; the putchar
           and printf routines implicitly output to stdout, but fputs allows
           output to any output stream. */

You can also output a string one character at a time. Here's a bad way to do this:

        int i; /* you must declare the index variable outside the loop */
	for (i = 0; i++; i < strlen( s )) putchar( s[i] );

The above requires that you have an #include <string.h> directive at the head of your program, but that is not a huge issue. Note that C is not object oriented. We say strlen(s), not s.strlen(). The above code is not recommended because strlen() must actually count all the characters in the string in order to find its length. Remember, strings in C are just unbounded arrays of characters with a null character ('\0') marking their end. It's better to use a while loop:

        int i = 0;
	while (s[i] != '\0') {
		putchar( s[i] );
		i = i + 1;
	}

The above code can be written more compactly using auto-increment addressing:

        int i = 0;
	while (s[i] != '\0') putchar( s[i++] );

If this is the last use we're making of the string parameter s, we can compact this even more:

	while (*s != '\0') putchar( *(s++) );

Here, we've replaced s[i] with *s, that is, we're using s as a pointer to the first character of the string. Then, instead of incrementing i by one, we're incrementing s itself by the size of one character.