Lecture 7, Odds and Ends

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Groupware Issues -- the Nuclear Option

Right now, your Subversion directory is c_196_jones and your group's directory will be named something like c_196_jones/GROUPNAME. It's annoying to type long path names with underscores in them, and you have no need to access the other stuff in c_196_jones that doesn't involve your group. Consider doing this (starting in your home directory):

cd c_196-jones
svn cleanup
cd ..
rm -rf c_196_jones

Having done that, you have completely disconnected your directory structure from the Subversion repository. You need to reconnect, but why bother reconnecting to the whole repository? Just do this to reconnect:

svn checkout --username HAWKID https://svn.divms.uiowa.edu/repos/c_196_jones/groupname

Now, your group's directory hangs directly your home directory, so you have direct access to the project without typing underscores or other annoying things.

Groupware Issues -- a Less Radical Option

Alternatively, you can create a symbolic link to your group's directory, leaving the entire structure in place. Presuming your group's name is GROUPNAME (a very unrealistic name), you can create a symbolic link to your group directory. Consider, for example using the following shell command while in your home directory:

ln -s c_196_jones/GROUPNAME GROUPNAME

This creates a symbolic link named GROUPNAME from your home directory to your group's directory within the class Subversion repository. As a result, if you type cd GROUPNAME is is equivalent to typing the far more verbose cd c_196_jones/GROUPNAME.

The net result of this suggestion is largely equivalent to the first suggestion. You no longer have to type annoying long pathnames with slashes and underscores in them. The difference is, this option leaves you the opportunity to use cd c_196_jones/jones followed by svn update to take a look at the code being given away by the course instructor. For that reason, this is the approach is recommended.

Note that you should never do an svn update in any other group's directory, it could lead to inappropriate copying of code from other groups. Note that all Subversion activity is logged.

Error Reporting

The normal way programmers think about error reporting code is to just output the error message wherever it is required. For example, in the middle of the lexical analyzer, while processing a string, if an end of file is found during the scan for a closing quote:

while ((ch != EOF) && (ch != quote)) { /* scan over string *.
	SCAN();
	// ==BUG== must deal with nonprintable characters in strings
}
if (ch == EOF) {
	fputs( "close quote expected when end of file found", stderr );
} else {
	SCAN(); /* skip closing quote */
}

This is, of course, better than no error checking, and it is better than no eror message, but it has several weaknesses:

The error message doesn't say what line the error was on.
Uniform format for error reporting is at the mercy of the programmer.
Error reporting text strings are scattered all over the place.

The latter objection is particularly significant when software must be internationalized, with error messages changed to other languages. Ad hoc error reporting, in this case, forces the internationalization team to search the entire source for character strings, figure out whether they must be changed, and then translate those strings.

In large projects, therefore, it is quite common to force all error messages through a single error reporting package. A typical error reporing package has, at minimum, the following interface:

// errors.h

// an enumeration type to encode the error messages
enum error_message {
	// intended for use in calls to error_fatal
	ER_BADFILE,
	...
};

void error_fatal( error_message er, int line );
	// output message er and exit the program indicating failure

In a compiler, non-fatal errors and possibly also warnings are also important, although sometimes, it is a good idea to abort the compilation after several non-fatal errors in order to prevent a huge stream of useless error messages, since it is usually the case that only the first few messages are really useful. Inside the error package, we must relate the messages to their strings:

/* errors.c */

#include 
#include 
#include "errors.h"

// the error messages.
// NOTE:  this array must have one entry for each
// member of the error_message enumeration, in exactly the same order
static const char * message[] = {
	// intended for use in calls to error_fatal
	/* ER_BADFILE */ "Cannot open input file.",
        ...
};

void error_fatal( error_message er, int line ) {
	// output the message er and exit the program indicating failure
	fprintf( stderr, "Fatal error on line %d: %s\n", line, message[ er ] );
	exit( EXIT_FAILURE );
}

Note that the error message array is a constant array of pointers to string constants, and those strings do not make up the entire message. Instead, the error_fatal() routine composes the message with boiler-plate text (in this case, "Fatal error on line") and the line number. All formatting is done in the error print routines, where the array contains only the parts of the message that "fill in the blanks" in the standard error message.

If the error reporting module is well designed, localizing the program to a different language, for example changing it from English to French, can be quite simple. With luck, the only changes are to the contents of the message array would be changed, along with the format string in error_fatal() (and in other error reporting routines).

Since we are writing a compiler, it is useful to note that a very common form of syntax error is "x encountered when y expected". For example, the error that started this discussion is "EOF encountered when end quote expected". This suggests that we should have an error reporting routine called, for example, er_gotbutwant( got, want ). The got parameter should probably be either the actual character or lexeme encountered in the input text -- in the lexical analyzer, it will be a character, in the syntax analysis, a lexeme? The want string can be one of the enumerated error message strings from the error reporting package.

The got-but-want routine would appear to be a higher level tool from the basic error reporting tool, since the lexical analyzer depends on the ability to report basic errors, but the got-but-want error reporter may depend on the lexical analyzer. There are ways of dealing with circular dependencies between program components, but we do not need to use those for this problem. It is almost certainly simpler to simply provide different got-but-want error reporting mechanisms for items at different levels of abstraction, for example er_gotbutwant() for reporting at a low level, and lex_gotbutwant() for reporting at the lexical level.

Classifying Characters

C and C++ offer a standard set of character classifying tools in the header file <ctype.h>. These are wonderful, to a point, but they have several weaknesses -- notably, they do not classify things exactly the way Kestrel wants them classified, and they include tools for localization that are irrelevant to Kestrel.

It is easy to build a custom character classifier. Here is one example:

enum char_type { OTH=0, WIT=1, LET=2, DIG=4, PUN=8 };

// note in the following that punctuatin marks
// with no use in FALCON are classified as oth
static const char_type char_class[256] = {
        // NUL SOH STX ETX EOT ENQ ACK BEL BS  HT  LF  VT  FF  CR  SO  SI
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,WIT,WIT,WIT,WIT,WIT,OTH,OTH,
        // DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM  SUB ESC FS  GS  RS  US
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
        //      !   "   #   $   %   &   '   (   )   *   +   ,   -   .   /
           WIT,OTH,OTH,OTH,OTH,OTH,PUN,OTH,PUN,PUN,PUN,PUN,PUN,PUN,PUN,PUN,
        //  0   1   2   3   4   5   6   7   8   9   :   ;   <   =   >   ?
           DIG,DIG,DIG,DIG,DIG,DIG,DIG,DIG,DIG,DIG,PUN,PUN,PUN,PUN,PUN,OTH,
        //  @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O
           PUN,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,
        //  P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]   ^   _
           LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,PUN,OTH,PUN,OTH,OTH,
        //  `   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o
           OTH,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,
        //  p   q   r   s   t   u   v   w   x   y   z   {   |   }   ~  DEL
           LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,PUN,PUN,PUN,PUN,OTH,
        // beyond ASCII
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
           OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH
};

/* check whether a character is in a particular class other than OTH */
#define ISCLASS(ch,class) (char_class[ch]&(class))

This allows a call to ISCLASS( ch, LET | DIG ) to return zero (false) if the character is not a letter or digit -- according to our definition, and nonzero (true) if it is one.

Where does the character classifier go? The most obvious place is within the implementation of the lexical analyzer, lexical.c since it is unlikely to be needed elsewhere. It could, however, be in a separate header file, with its relationship to the lexical analyzer documented in the makefile.