Lecture 14, Predefined Symbols and Keywords

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Predefined Symbols

The lexical analyzer we've discussed up to this point handles identifiers, but it does not handle keywords. The code for identifiers looks like this:

} else if (ISCLASS(ch,LETTER)) {
        /* identifier or possibly a keyword */
        lex_next.type = IDENT;
        symbol_start( line_number );
        do {
                /* save the character */
                symbol_append( ch );
                /* get the next character */
                ch = getc( infile );
        } while ((ch != EOF) && ISCLASS(ch,LETTER|DIGIT));
        lex_next.value = symbol_lookup();
} else if (ISCLASS(ch,DIGIT)) {

To handle keywords, we need to add code at the end of the above to ask: Is this identifier really a keyword? If so, mark it as such. We'll do this with the following modification to the last few lines given above:

        } while ((ch != EOF) && ISCLASS(ch,LETTER|DIGIT));
        lex_next.value = symbol_lookup();
        {
                /* check to see if it is a keyword */
                key_handle key = key_lookup( lex_next.value );
                if (key != KEY_INVALID) {
                        lex_next.type = KEYWORD;
                        lex_next.value = key;
                }
        }
} else if (ISCLASS(ch,DIGIT)) {

Note that we're not asking "is the text of this identifier a keyword?" Instead, we're asking "is the symbol handle of this identifier the symbol handle of a keyword?" This requires that the text of all of the keywords be pre-entered into the symbol table. To support this, we'll need to add this a new interface to symboltable.h:

symbol_handle symbol_add( const char * s );
/* add the null terminated character string s to the symbol table,
 * returning its handle.  This is used during initialization only.
 */

The idea is to call symbol_add() once for each predefined identifier or keyword in the language in order to insert them into the symbol table (and as a side effect, into the string pool). This routine is redundant, in the sense that the user could always accomplish the same thing using the lower level character-at-a-time interface. The code for symbol_add() is trivial:

symbol_handle symbol_add( const char * s ) {
        symbol_start( 0 );
        while ( *s != '\0' ) {
                symbol_append( *s );
                s++;
        }
        return symbol_lookup();
}

A rather stupid initializer could use a sequence of calls to add each keyword to the symbol table:

        key_if_handle = symbol_add( "if" );
        key_else_handle = symbol_add( "else" );
        key_while_handle = symbol_add( "while" );

This is no way to handle keywords, but it might be an appropriate approach to a second class of special Kestrel symbols. In Kestrel, the notation r.f refers to field f of record object r, but dot notation is also used in two other contexts. If t is the name of a scalar type, t.max and t.min are the limits on that type, and if a is the name of an array, a.index is the index type of that array, and a.max and a.min limits on the array index, shorthand notations for a.index.max and a.index.min.

The Keyword Package

What we need to build on top of the symbol table is a keyword package. The lexical analyzer will use this to distinguish identifiers from keywords, and it will also output the text of a keyword on demand. An appropriate interface to the keyword package is straightforward:

/* keywords.h -- keyword recognition mechanism interface */

/* by: -- record of authorship -- date- etc */

/* users of this header file should first include
 *   
 *   
 *   stringpool.h
 *   symboltable.h
 */

/* list of all the keywords in the language */
typedef enum {
        KEY_INVALID /* the null keyword handle */
        KEY_END,        KEY_CONST,      KEY_FINAL,      KEY_TYPE,
        KEY_EXCEPTION,  KEY_VAR,        KEY_PROCEDURE,  KEY_FUNCTION,
        KEY_PRIVATE,    KEY_RESTRICTED, KEY_EXTERNAL,   KEY_ARRAY,
        KEY_SET,        KEY_OF,         KEY_RECORD,     KEY_IF,
        KEY_THEN,       KEY_ELSE,       KEY_SELECT,     KEY_CASE,
        KEY_WHILE,      KEY_DO,         KEY_UNTIL,      KEY_FOR,
        KEY_IN,         KEY_CATCH,      KEY_RASIE,      KEY_NULL
} key_handle;
/* the range of valid key_handle values is KEY_INVALID + 1 to KEY_NULL
 * always keep KEY_NULL at the end!
 */

void key_init();
/* initializer for keyword mechanism */

void key_put( key_handle k, FILE * f );
/* output the indicated keyword to the human readable file */

key_handle key_lookup( symbol_handle s );
/* return the keyword handle corresponding to the given symbol handle
 * if the symbol is not a keyword, returns KEY_INVALID
 */

With this framework in place, we can easily write code for key_put() using a constant array of strings to translate key handles to their text:

/* list of all the keywords in the language */
const char * const key_names[] = {
        "???", /* the null keyword handle */
        "end",          "const",        "final",        "type",
        "exception",    "var",          "procedure",    "function",
        "private",      "restricted",   "external",     "array",
        "set",          "of",           "record",       "if",
        "then",         "else",         "select",       "case",
        "while",        "do",           "until",        "for",
        "in",           "catch",        "raise",        "null"
};
/* IT IS CRUCIAL that the order of the strings listed in this table
 * exactly match the order of the keyword names declared in
 * key_handle in file keywords.h
 */

void key_put( key_handle k, FILE * f ) {
        /* output the keyword to the human readable file */
        fputs( key_names[ k ], f );
}

The more interesting problem is converting string handles to the corresponding keywords. This requires that all of the names listed above be entered into the symbol table, and from there into the string pool, and then it requires constructing a data structure to map between symbol handles and keyword handles. The simplest solution to this problem is computatinally inefficient, but easy to code:

string_handle key_table[ KEY_MAX + 1 ]

void key_init() {
        /* initializer for keyword mechanism */
        int i;
        for (i = KEY_INVALID + 1; i <= KEY_NULL; i++) {
                key_table[i] = symbol_add( key_names[i] );
        }
}

key_handle key_lookup( symbol_handle s ) {
        /* return the keyword handle corresponding to the given symbol handle
         * if the symbol is not a keyword, returns KEY_INVALID
         */

        /* =BUG= stupid version, uses a linear search */

        int i = KEY_NULL;
        while ((i >= KEY_INVALID + 1) && (key_table[i] != s)) i--;
        return (key_handle)i;
}

This linear search implementation of key_lookup is a problem because it will be called once for every identifier scanned by the lexical analyzer in order to determine if that identifier is a keyword. Doing a linear search once for each identifier might be unreasonably expensive.

Might be unreasonably expensive, not is unreasonably expensive. If you write code for efficient searches and compare it with linear search, most experiments will show that a linear search beats, for example, a binary search for 8 or fewer items, and that a binary search beats a linear search for 32 or more items (rounding to the nearest power of two). In the grey area from about 8 to about 32 items, it really doesn't matter which searching method you use. Since our keyword table is under 32 elements, we're in the grey area.

But, hashing is generally faster than binary search, so we ought to look into how that can be used.

Hashed Keyword Lookup

We can do better using hashing. Note there that we are hashing without ever seeing the actual text of the symbol. Rather, we are starting with the symbol handle and using hashing to speed up the search of key_table. Because the symbol handle was itself the result of hashing the text of the symbol, we could say here that we are using rehashing.

We could do this with hash table entries that have 2 fields each, the symbol handle and the key handle:

key_handle key_lookup( symbol_handle s ) {
        /* return the keyword handle corresponding to the given symbol handle
         * if the symbol is not a keyword, returns KEY_INVALID
         */

        /* =BUG= stupid hash version */

        key_handle start = (s % KEY_NULL) + 1;
        key_handle i;
        do {
                if (hash_table[i].sym == s) return hash_table[i].key;
                if (i == KEY_NULL) i = KEY_INVALID + 1; else i = i + 1;
        } until (i == start);
        return KEY_INVALID;
}

What is stupid about this? It finds the correct symbol very quickly if the keyword is in the table, but it does a linear search through the entire table if the keyword is not there. If we make the hash table perhaps 10 percent larger than the number of keywords, with the extra entries initialized to invalid, it can be acceptably fast, but there is an alternative. Note that the 10 percent larger rule, with 2 entries per table entry really implies a hash table that is 2.2 times the size of the set of keywords.

There is an extremely fast alternative that uses no loops or searching at all:

key_handle key_lookup( symbol_handle s ) {
        /* return the keyword handle corresponding to the given symbol handle
         * if the symbol is not a keyword, returns KEY_INVALID
         */

        /* excessively clever hash version */

        int hash = s % KEY_HASH_SIZE;
        if (hash_table[i].sym == s) {
                return hash_table[i].key;
        } else {
                return KEY_INVALID;
        }
}

The key to making this algorithm work is to select a hash table size that is just large enough that there are no collisions. That is, no two valid keywords hash to the same value. For our set of keywords, the minimum hash table size is just 89 entries. There is no magic way to find the size that works. You must do an exhaustive search, working up from the number of symbols until you find a size with no collisions. Here is the initializer for this hash table:

void key_init() {
        /* initializer for keyword lookup mechanism */
        int i;

        int trouble = 0;

        /* first, put default values in table */
        for (i = 0; i < KEY_HASH_SIZE; i++) {
                key_hash[i].key = KEY_INVALID;
                key_hash[i].sym = SYMBOL_INVALID;
        }

        /* second, try to put hash keywords into table */
        for (i = KEY_INVALID + 1; i <= KEY_NULL; i++) {
                symbol_handle s = symbol_add( key_names[i] );
                int hash = s % KEY_HASH_SIZE;
                if (key_hash[hash].key == KEY_INVALID) {
                        key_hash[hash].key = (key_handle)i;
                        key_hash[hash].sym = s;
                } else {
                        trouble = trouble + 1;
                }
        }

        if (trouble > 0) {
                /* =BUG= this should probably be a call to error_fatal() */
        }
}

There is an interesting (but obscure) area in the study of algorithms that involves what are called perfect hash functions, where the hash function itself is adjusted until all of the legal symbols hash into unique slots in a compact range of values. Here, we've avoided the need for such fancy hash functions by simply enlarging the table until everything fit.

The statistics on the number of collisions as a function of the size of the keyword table are interesting:

Table size: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
Collisions:  6  7  7  7  7  7  9  9  7  8  6  3  8  9  8  1  3  4  5  5

Table size: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
Collisions:  5  1  4  3  3  1  4  3  5  3  2  1  3  3  5  3  5  3  3  2

Table size: 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
Collisions:  4  2  4  4  5  2  2  3  5  4  3  2  2  2  3  2  2  2  1  0

You can see what looks like random variations in the number of collisions as the table size increases, but the trend is downward. One surprise is the small number of collisions when the table size is barely larger than the number of keywords. This is because our initial hash function used to put the symbols in the symbol table is fairly good.

Note, however, that this hash function is fragile. Any time the symbol table size is changed, this changes the first hash function applied to the text of the keywords, and this in turn changes where the collisions will occur with the second hash. This means that the experiments used to pick the keyword hash table size will need to be repeated whenever the configuration file is changed. This is why the keyword hash table initializer must check for collisions and report an error if any occur.

On the other hand, a hash function using a search will usually find its target on the first try even if we only put 30 entries in the hash table. Only 6 of our keywords, in the above experiments, collided with other table entries. This means that hashing with a search can be very efficient even when the table is only slightly larger than the set of keywords.