Compiler Construction

Right now, your Subversion directory is cs4980 and your group's directory will be named something like cs4980/GROUPNAME. It's annoying to type long path names with underscores in them, and you have no need to access the other stuff in cs4980 that doesn't involve your group.

Consider creating a symbolic link to your group's directory so you can cd directly there when you begin work. Presuming your group's name is GROUPNAME (a very unrealistic name), the following shell commands will do this:

ln -s cs4980/GROUPNAME GROUPNAME

This creates a symbolic link named GROUPNAME from your home directory to your group's directory within the class Subversion repository. As a result, if you type cd GROUPNAME is is equivalent to typing the far more verbose cd cs4980/GROUPNAME.

This still leaves you the opportunity to use cd cs4980/jones followed by svn update to get the current code being given away by the course instructor.

Note that you should never do an svn update in any other group's directory, it could lead to inappropriate copying of code from other groups. Note that all Subversion activity is logged, and after the groups are properly launched on their projects, your access to other directories will be blocked.

Error Reporting

The normal way programmers think about error reporting code is to just output the error message wherever it is required. For example, at the end of the last lecture, we wrote this code:

if ( lex_next.value > ((UINT32_MAX - (ch - '0')) / 10) ) {
        /* =BUG= report number out of bounds */
} else {
        /* accumulate value of digit */
        lex_next.value = (lex_next.value * 10) + (ch - '0');
}

To eliminate this bug, we need a way to report this error. Ideally, the error message ought to report where in the input file the problem was found, and error messages ought to have a uniform format, and all error messages should be easy to find so that the messages can be re-written, for example, to produce versions of the program for sale in non-English-speaking lands.

In large projects, therefore, it is quite common to force all error messages through a single error reporting package. A typical error reporing package has, at minimum, the following interface:

/* errors.h -- error reporting package interface */

/* an enumeration type to encode the error messages */
typedef enum {
        /* intended for use in calls to error_fatal */
        ER_BADFILE,
        ...
        /* intended for use in calls to error_warn */
        ER_TOOBIG,
        ...
} error_message;

void error_fatal( error_message er, int line );
        /* output message er and exit the program indicating failure */

void error_warn( error_message er, int line );
        /* output message er and allow processing to continue */

In a compiler, non-fatal errors and possibly also warnings are also important, although sometimes, it is a good idea to abort the compilation after several non-fatal errors in order to prevent a huge stream of useless error messages, since it is usually the case that only the first few messages are really useful. Inside the error package, we must relate the messages to their strings:

/* errors.c -- error reporting package implementation */

#include <stlib.h>
#include <stdio.h>
#include "errors.h"

/* the error messages.
 *  NOTE:  this array must have one entry for each
 *  member of the error_message enumeration, in exactly the same order
 */
static const char * message[] = {
        /* intended for use in calls to error_fatal */
        /* ER_BADFILE */ "Cannot open input file.",
        ...
        /* intended for use in calls to error_warn */
        /* ER_TOOBIG  */ "Value too large.",
        ...
};

void error_fatal( error_message er, int line ) {
        /* output the message er and exit the program indicating failure */
        fprintf( stderr, "Fatal error on line %d: %s\n", line, message[ er ] );
        exit( EXIT_FAILURE );
}

void error_warn( error_message er, int line ) {
        /* output message er and allow processing to continue */
        fprintf( stderr, "Error on line %d: %s\n", line, message[ er ] );
}

Note that the error message array is a constant array of pointers to string constants, and those strings do not make up the entire message. Instead, the error reporting routines compose the message with boiler-plate text (for example, "Fatal error on line") and the line number. All formatting is done in the error print routines, where the array contains only the parts of the message that "fill in the blanks" in the standard error message.

If the error reporting module is well designed, localizing the program to a different language, for example changing it from English to French, can be quite simple. With luck, the only changes are to the contents of the message array, along with the format strings in error reporting routines. Here is how we use this error reporting system in the example that introduced this section:

if ( lex_next.value > ((UINT32_MAX - (ch - '0')) / 10) ) {
        error_warn( ERR_TOOBIG, line_number );
} else {
        /* accumulate value of digit */
        lex_next.value = (lex_next.value * 10) + (ch - '0');
}

Since we are writing a compiler, it is useful to note that a very common form of syntax error is "x encountered when y expected". For example, the error that started this discussion is "EOF encountered when end quote expected". This suggests that we should have an error reporting routine called, for example, er_gotbutwant( got, want ). The got parameter should probably be either the actual character or lexeme encountered in the input text -- in the lexical analyzer, it will be a character, in the syntax analysis, a lexeme. The want string can be one of the enumerated error message strings from the error reporting package.

The obvious solution is to have both error_gotbutwant() which is used to report that you found one character when you expected another, and a higher level lex_gotbutwant() to report that you found one lexeme when you expected another.

Reminder

In developing any large project, there will always be places where you know more work could be done but you cannot do it now. Particularly in a group project, you cannot rely on your own memory to remind you to go back and flesh out those details. One way to deal with this is to agree, across the group, on a standard textual marker to show where additional work is needed. This way, any group member looking at the code can see if something is missing.

In the code given here, this marker is the string =BUG=, typically embedded at the start of a comment with an explanation following. The idea is that the code can be executable, but the bug notice indicates shortcomings in the code, features that are missing, tests it will not pass, and so on.

By using a standard string such as =BUG= to mark stubs in your code, a simple search with grep =BUG= *.c *.h can show you all the known shortcomings in your project, giving you a quick way to see what remains unfinished. In a large project, with many people sharing the work, communicating through bug notices in the code can be a far more effective way to direct the work than having group meetings. Adding well documented and correct bug notices to code is productive work, so don't feel embarrassed if all you do in a work session is add bug notices. Of course, it's also productive work to eliminate them by replacing the bug notice with working code.

A Main Program for Testing

We've written enough of a lexical analyzer to be worth testing. Here is a minimal testlex.c, a main program that can be used to test the lexical analyzer:

/* testlex.c -- test framework for the lexical analyzer */
#include 
#include 

#include "lexical.h"

main(){
        /* the main program for testlex */

        /* =BUG= this should open argv[1] if present so we can read files */
        lex_open( NULL ); /* default to STDIN if argv[1] not present */
        do {
                lex_put( &lex_this, stdout );
                putchar( '\n' ); /* output each lexeme on a new line */
                lex_advance();
        } while (lex_this.type != ENDFILE);
}

Oops, some missing code

We didn't write lex_open(), nor did we write lex_put(). We'd better do this pronto:

void lex_open( char * f ) {
        /* open file f for input, or use stdin if f is null */
        if (f != NULL) {
                infile = fopen( f, "r" );
                if (infile == NULL) error_fatal( ER_BADFILE, 0 );
        } else {
                infile = stdin;
        }

	/* initialize the sliding window */
	ch = fgetc( infile );
	line_number = 1;
	lex_advance();
	lex_advance();
}

Here, fopen() is the standard C stream read routine, it may fail (returning NULL) so we have to issue an error message in that case.

What about end-of-file? The fgetc() function (and the equivalent getc() macro) gets one character from the given file each time it is called, but the character is returned as an int, not a char. If it gets a character successfully, it returns that character as a positive value. If, on the other hand, it reaches the end of file, it returns EOF, a negative value. So, we can change the declaration of ch to this in lexical.c:

static int ch; /* the current character not yet part of a lexeme */

And, in lex_advance() we can add a new case at the head of the cascade of cases that classify lexemes:

if (ch == EOF) {
	/* end of file */
	lex_next.type = ENDFILE;
	lex_next.value = 0; /* meaningless */
} else if ...

We must also be careful about EOF checking inside the code for each lexeme; for example, when reading the successive digits of a number, we need to write one of the following:

} while ((ch >= '0') && (ch <= '9'));
} while ((ch != EOF) && ISCLASS( ch, DIGIT ));

The first line above is safe because EOF is outside the range of digits, but wherever we use ISCLASS() we must be careful because it only works with positive values and EOF is negative.

The lex_put() routine is a bit more interesting. If the lexical analyzer had merely constructed a string variable holding the lexeme, lex_put() could simply output that, but we did not do so. As a result, our routine must reconstruct the string from the data in the lexeme it is given.

void lex_put( lexeme * lex, FILE * f ) {
        /* reconstruct and output lex to file f */
        switch (lex.type) {
        case IDENT:
        case KEYWORD:
                /* =BUG= missing code for these lexeme types */
                break;
        case NUMBER:
                fprintf( f, "%" PRId32, lex.value );
                /* note:  the above uses a format string from (c)inttypes.h */
                break;
        case PUNCT:
                /* =BUG= how do we do this? */
                break;
        case STRING:
        case ENDFILE:
                /* =BUG= missing code for these lexeme types */
                break;
        }
}

The above code includes a bug notice: How do we print out punctuation? The version of lexical.c contains an enumerated type that gives names to all of the punctuation marks. This enumeration really belongs in lexical.h, so we'll move it there. The table mapping ASCII characters to these punctuation codes still belongs in lexical.c where we put it, but now, we need the inverse table. We could define this simply as an array of strings indexed by punctuation:

/* table mapping from punctuation names to text strings for lex_put */
/* WARNING:  strings must be given in the order enumerated in punct_type */
static const char * punct_name[] = {
    /* PTX */ "?WHAT?",
    /* "PT_SEMI   */ ";", /* PT_EQUALS */ "=", /* PT_COLON  */ ":",
    /* "PT_LPAREN */ "(", /* PT_LBRAKT */ "[", /* PT_LBRACE */ "{",
    /* "PT_RPAREN */ ")", /* PT_RBRAKT */ "]", /* PT_RBRACE */ "}",
    /* "PT_COMMA  */ ",", /* PT_ATSIGN */ "@", /* PT_ELIPS  */ "..",
    /* "PT_NOTEQL */ "/=", /*PT_GT     */ ">", /* PT_GE     */ ">=",
    /* "PT_LT     */ "<", /* PT_LE     */ "<=", /*PT_PLUS   */ "+",
    /* "PT_MINUS  */ "-", /* PT_TIMES  */ "*", /* PT_DIV    */ "/",
    /* "PT_MOD    */ "%", /* PT_AND    */ "&", /* PT_OR     */ "|",
    /* "PT_NOT    */ "~", /* PT_DOT    */ "."
};

        case PUNCT:
                fputs( punct_name[lex.value], f );
                break;

Testing

Now, we can start testing! As you develop programs as complex as a compiler, there are huge numbers of test cases that you need to develop. Consider, in one small area, the problem of evaluating numeric lexemes. First, we need to test some typical legal values. All the digits should evaluate to themselves, and a good number of multi-digit numbers ought to be tested. We can do this with our framework quite easily:

~/compiler $ echo 0 1 2 3 4 5 6 7 8 9 | ./testlex
0
1
2
3
4
5
6
7
8
9
~/compiler $ echo "; ( ) , / < - % ~" | ./testlex
;
(
)
,
/
<
-
%
~

The above shows one line of input to the Unix/Linux shell, along with the expected output. The next test is for a multi-digit inputs with a value that ought to be representable as a 32-bit number; this also contains some leading spaces:

~/compiler $ echo "   1234567890" | ./testlex
1234567890

When running a new test, it makes good sense to do this interactively, however, once you have debugged your code, it is also a good idea to keep the tests along with a record of the expected output. Here are the shell commands to save the output of some tests to a file called expected.txt:

~/compiler $ echo 0 1 2 3 4 5 6 7 8 9 | ./testlex > expected.txt
~/compiler $ echo "; ( ) , / < - % ~" | ./testlex >> expected.txt
~/compiler $ echo "   1234567890" | ./testlex >> expected.txt

Now, you can keep a test script that runs a sequence of tests and compares the results with the expected results:

echo 0 1 2 3 4 5 6 7 8 9 | ./testlex > gotten.txt
echo "; ( ) , / < - % ~" | ./testlex >> gotten.txt
echo "   1234567890" | ./testlex >> gotten.txt
diff gotten.txt expected.txt

The diff command compares the two files it is given as arguments. If the files are the same, it gives no output at all. If they differ, it lists just the differences. After every change to the lexical analyzer, you should re-run the script of tests it has previously passed and deal with any changes in the output. Some changes may be because you fixed bugs. In those cases, adjust the file of expected output accordingly. Other changes may have introduced new bugs. In those cases, fix them, or at least document them and indicate that the program no-longer passes the tests you've written.

This practice, called regression testing, is an essential part of large software projects where you want to avoid breaking things that used to work. Automated testing and re-testing after every incremental change is the key to success.

Lecture 7, Errors and Testing

Groupware Issues

Error Reporting

Reminder

A Main Program for Testing

Oops, some missing code

Testing