Lecture 7, Odds and Ends

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Groupware Issues

Right now, your Subversion directory is c_196_jones and your group's directory is named something like c_196_jones/GROUPNAME. It's annoying to type long path names with underscores in them, and you have no need to access the other stuff in c_196_jones that doesn't involve your group. Consider doing this (starting in your home directory):

cd c_196-jones
svn cleanup
cd ..
rm -rf c_196_jones

Having done that, you have completely disconnected your directory structure from the Subversion repository. You need to reconnect, but why bother reconnecting to the whole repository? Just do this to reconnect:

svn checkout --username HAWKID https://svn.divms.uiowa.edu/repos/c_196_jones/groupname

Now, your group's directory hangs directly your home directory, so you have direct access to the project without typing underscores or other annoying things.

Error Reporting

The normal way programmers think about error reporting code is to just output the error message wherever it is required. For example, in the middle of the lexical analyzer, while processing a string, if an end of file is found during the scan for a closing quote:

while ((ch != EOF) && (ch != quote)) { /* scan over string *.
	SCAN();
	/* WEAKNESS must work on backslash escapes */
}
if (ch == EOF) {
	fputs( "close quote expected when end of file found", stderr );
} else {
	SCAN(); /* skip closing quote */
}

This is, of course, better than no error checking, and it is better than no eror message, but it has several weaknesses:

The latter objection is particularly significant when software must be internationalized, with error messages changed to other languages. Ad hoc error reporting, in this case, forces the internationalization team to search the entire source for character strings, figure out whether they must be changed, and then translate those strings.

In large projects, therefore, it is quite common to force all error messages through a single error reporting package. A typical error reporing package has, at minimum, the following interface:

/* errors.h */

/* an enumeration type to encode the error messages */
enum error_message {
	/* intended for use in calls to error_fatal */
	ER_BADFILE,
	...
};

error_fatal( error_message er, int line );
/* output message er and exit the program indicating failure */

In a compiler, non-fatal errors and possibly also warnings are also important, although sometimes, it is a good idea to abort the compilation after several non-fatal errors in order to prevent a huge stream of useless error messages, since it is usually the case that only the first few messages are really useful. Inside the error package, we must relate the messages to their strings:

/* errors.c */

#include 
#include 
#include "errors.h"

/* the error messages.
 * NOTE:  this array must have one entry for each
 * member of the error_message enumeration, in exactly the same order
 */
static const char * message[] = {
	/* intended for use in calls to error_fatal */
	/* ER_BADFILE */ "Cannot open input file.",
        ...
};

error_fatal( error_message er, int line ) {
	/* output the message er and exit the program indicating failure */
	fprintf( stderr, "Fatal error on line %d: %s\n", line, message[ er ] );
	exit( EXIT_FAILURE );
}

Note that the error message array is a constant array of pointers to string constants, and those strings do not make up the entire message. Instead, the error_fatal() routine composes the message with boiler-plate text (in this case, "Fatal error on line") and the line number. All formatting is done in the error print routines, where the array contains only the parts of the message that "fill in the blanks" in the standard error message.

When moving the code to a different language, the contents of the message array would be changed, along with the format string in error_fatal() (and in other error reporting routines).

Since we are writing a compiler, it is useful to note that a very common form of syntax error is "x encountered when y expected". For example, the error that started this discussion is "EOF encountered when end quote expected". This suggests that we should have an error reporting routine called, for example, er_gotbutwant( got, want ). The got parameter should probably be either the actual character or lexeme encountered in the input text -- in the lexical analyzer, it will be a character, in the syntax analysis, a lexeme? The want string can be one of the enumerated error message strings from the error reporting package.

Classifying Characters

C and C++ offer a standard set of character classifying tools in the header file <ctype.h>. These are wonderful, to a point, but they have several weaknesses -- notably, they do not classify things exactly the way Falcon wants them classified, and they include tools for localization that are irrelevant to Falcon.

It is easy to build a custom character classifier. Here is one example:

enum char_type { OTH=0, WIT=1, LET=2, DIG=4, PUN=8 };

/* note in the following that punctuatin marks
   with no use in FALCON are classified as oth */
static const char_type char_class[256] = {
        /* NUL SOH STX ETX EOT ENQ ACK BEL BS  HT  LF  VT  FF  CR  SO  SI  */
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,WIT,WIT,WIT,WIT,WIT,OTH,OTH,
        /* DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM  SUB ESC FS  GS  RS  US  */
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
        /*      !   "   #   $   %   &   '   (   )   *   +   ,   -   .   /  */
           WIT,OTH,OTH,OTH,OTH,OTH,OTH,OTH,PUN,PUN,PUN,PUN,PUN,PUN,PUN,PUN,
        /*  0   1   2   3   4   5   6   7   8   9   :   ;   <   =   >   ?  */
           DIG,DIG,DIG,DIG,DIG,DIG,DIG,DIG,DIG,DIG,PUN,PUN,PUN,PUN,PUN,OTH,
        /*  @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O  */
           PUN,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,
        /*  P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]   ^   _  */
           LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,PUN,PUN,PUN,OTH,OTH,
        /*  `   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o  */
           OTH,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,
        /*  p   q   r   s   t   u   v   w   x   y   z   {   |   }   ~  DEL */
           LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,PUN,OTH,PUN,OTH,OTH,
        /* beyond ASCII                                                    */
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH
};

/* check whether a character is in a particular class other than OTH */
#define ISCLASS(ch,class) (char_class[ch]&(class))

This allows a call to ISCLASS( ch, LET | DIG ) to return zero (false) if the character is not a letter or digit -- as we define it, and nonzero (true) if it is one.