Lecture 15, Predefined Symbols and Keywords

Part of the notes for CS:4908:0004 (22C:196:003)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Predefined Symbols

The interface to the symbol table was discussed. The following were included in the symboltable.h file (abridged):

//symboltable.h

typedef uint32_t symbol_handle;

void symbol_start();
void symbol_append( char ch );
symbol_handle symbol_lookup();

Here, to deal with the problem of predefined symbols, we propose adding a new interface routine to the symbol table. This could look like the following in symboltable.h:

symbol_handle symbol_add( char * s );
// add the string s (a null terminated character string) to the symbol
// table, returning its handle.  This is used during initialization only.

The idea is to call symbol_add once for each predefined identifier or keyword in the language in order to insert them into the symbol table (and as a side effect, into the string pool). This routine is redundant, in the sense that the user could always accomplish the same thing using the lower level character-at-a-time interface, but it is convenient. The code for symbol_add() is trivial:

symbol_handle symbol_add( char * s ) {
	symbol_start();
	while ( *s != NUL ) symbol_append( *s );
	return symbol_end();
}

An initializer could use a sequence of calls to add each keyword to the symbol table, like this:

	key_if_handle = symbol_add( "if" );
	key_else_handle = symbol_add( "else" );
	key_while_handle = symbol_add( "while" );

This is probably not the end of the story! We really want a compact representation for keyword handles, so that we can construct sets of keywords. Thus, keyword handles should be small integers in a compact range. So long as we have fewer than 32 keywords, this allows us to represent sets of keywords using single-word bit vectors.

The Keyword Package

What we need to build on top of the symbol table is a keyword package. The lexical analyzer will use this to distinguish identifiers from keywords, and it will also output the text of a keyword on demand. An appropriate interface to the keyword package is straightforward:

// keywords.h
//   user interface to the keyword recognition mechanism

// users of this header file should first include
//   
//   
//   stringpool.h
//   symboltable.h

/* by: -- record of authorship -- date- etc */

enum key_handle { // a list of all the keywords in the language
	key_END,
        key_CONST,
        key_TYPE,
        key_EXCEPTION,
        key_VAR,
        key_PROCEDURE,
        key_FUNCTION,
        key_PRIVATE,
        key_EXTERNAL,
        key_ENUM,
        key_ARRAY,
        key_OF,
        key_RECORD,
        key_IF,
        key_THEN,
        key_ELSE,
        key_SELECT,
        key_CASE,
        key_WHILE,
        key_DO,
        key_UNTIL,
        key_FOR,
        key_IN,
        key_TRY,
        key_HANDLE,
        key_RASIE,
        key_NULL
}
// any additions to this enumeration must adjust KEY_MAX accordingly and
// IT IS CRUCIAL that the order of the keyword names listed in this table
// exactly match the order of the keyword strings in the initialization of
// key_names in file keywords.c

#define KEY_MIN key_END
#define KEY_MAX key_NULL
#define KEY_INVALID (KEY_MAX + 1)

void key_init();
// initializer for keyword mechanism

void key_put( key_handle k, FILE * f );
// output the keyword to the human readable file

key_handle key_lookup( symbol_handle s );
// output the keyword handle corresponding to the given symbol handle
// if the symbol is not a keyword, returns KEY_INVALID

With this framework in place, we can easily write code for key_put() using a constant array of strings to translate key handles to their text:

const char * const key_names[ KEY_MAX + 1 ] = {
	"end",
        "const",
        "type",
        "exception",
        "var",
        "procedure",
        "function",
        "private",
        "external",
        "enum",
        "array",
        "of",
        "record",
        "if",
        "then",
        "else",
        "select",
        "case",
        "while",
        "do",
        "until",
        "for",
        "in",
        "try",
        "handle",
        "raise",
        "null"
}
// IT IS CRUCIAL that the order of the strings listed in this table
// exactly match the order of the keyword names in the declaration of
// key_handle in file keywords.h

void key_put( key_handle k, FILE * f ) {
        // output the keyword to the human readable file
        fputs( key_names[ k ], f );
}

The more interesting problem is converting string handles to the corresponding keywords. This requires that all of the names listed above be entered into the symbol table, and from there into the string pool, and then it requires constructing a data structure to map between symbol handles and keyword handles. The simplest solution to this problem is computatinally inefficient, but easy to code:

string_handle key_table[ KEY_MAX + 1 ]

void key_init() {
        // initializer for keyword mechanism
        int i;
        for (i = KEY_MIN; i <= KEY_MAX; i++) {
		key_table[i] = symbol_add( key_names[i] );
	}
}

key_handle key_lookup( symbol_handle s ) {
	// output the keyword handle corresponding to the given symbol handle
	// if the symbol is not a keyword, returns KEY_INVALID

	/*BUG*/ // stupid version, uses a linear search

	key_handle i = KEY_MIN;
	while ((i <= KEY_MAX) && (key_table[i] != s)) i++;
	return s;
	// this relies on the fact that KEY_INVALID is KEY_MAX + 1
}

This implementation of key_lookup given here is a problem because it will be called once for every identifier scanned by the lexical analyzer in order to determine if that identifier is a keyword. Doing a linear search once for each identifier seems excessive.

We can do better using hashing, a topic we will discuss later.

Information Hiding

Note that our design above isolates the details of the mechanism used to identify keywords from the mechanisms used to identify symbols in the symbol table, and it isolates the lexical analyzer from how keywords are recognized. As a result, you can live for a long time using this crummy keyword recognition mechanism, and only replace it with something better at a far later time, after users start to complain about the speed of the compiler (if ever).

The only reason we will bother improving on the above design is because it is not difficult and it is a good example of a trick typical to a wide variety of applications.