Lecture 16, Rehashing

Part of the notes for 22C:196:003 (CS:4908:0003)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The Keyword Problem

Consider the problem of distinguishing reserved words (keywords) from other identifiers in the lexical analyzer. As a matter of background, recall that the symbol table interface included the constant SYMBOL_SIZE and that, on entering or looking up a symbol in the symbol table, the lookup routine or routines return an integer symbol handle in the range 0 to SYMBOL_SIZE-1 that uniquely identifies the symbol.

Most probably, but hidden inside the symbol table, there is an array of pointers to strings indexed by the the symbol handle. In mature code, the search of this array for a string is likely to be done by hashing, where the strings themselves are stored in a second array, the string pool.

The problem we face here is how to distinguish between identifiers such as this and keywords such as else. We have already presented a stupid mechanism based on a linear search. There are several ways to do so. Here, we will limit our attention to solutions that do not involve keyword recognitions mechanisms that are independent of the symbol table. We will focus on mechanisms that map symbol handles in the range from 0 to SYMBOL_SIZE to keyword handles in the range from KEY_MIN to KEY_MAX, where the latter ange is much smaller than the former.

An Array of Keyword Values

We could solve the problem with an array of SYMBOL_SIZE keyword handles. This leads to the fastest solution to the problem, completely eliminating the need for a linear search but using lots of memory:

key_handle key_table[ SYMBOL_SIZE ];
/*BUG*/ // stupid waste of memory

void key_init() {
        // initializer for keyword mechanism
        int i;
        for (i = 0; i < SYMBOL_SIZE; i++) key_table[i] = KEY_INVALID;
        for (i = KEY_MIN; i <= KEY_MAX; i++) {
                key_table[ symbol_add( key_names[i] ) ] = i;
        }
}

key_handle key_lookup( symbol_handle s ) {
        // output the keyword handle corresponding to the given symbol handle
        // if the symbol is not a keyword, returns KEY_INVALID

        return key_table[ s ];
        // this is very very fast
}

It is commonplace these days to assume that memory is almost free, but in fact, that is not true. Large data structures cause inefficient use of space in the cache, actually, in the caches. Most modern computers use multiple levels of cache, and the perfomance benefit of keeping things small can be significant.

Rehashing

Note that the string handle of each identifier is likely to be the result of a hashed search of the string pool. Therefore, the 30 or so keywords in our language will have string IDs that are scattered widely and apparently at random in the range 0 to SYMBOL_SIZE-1. Suppose we make key_table[] small, indexed over the range 0 to KEY_TABLE_SIZE-1, where this range includes the range but is slightly larger KEY_MIN to KEY_MAX. For example, in Kestrel, we have fewer than 32 keywords,so it makes sense to use a value of KEY_TABLE_SIZE somewhat larger than 32.

Given a string handle in a large range, we can hash this down into a key-table index by using a simple mod operator. The quality of this hash function will depend on KEY_TABLE_SIZE. If we use a size that is a power of two, the mod function simply gives us the least significant few bits of the symbol handle. This is extremely easy to compute (no division is required).

If the hashing of keywords into the symbol table is good, their symbol handles should be distributed effectively at random over the entire range of legal symbol handles, and the result is, taking the least significant bits of the symbol handle should also give a random result.

Note that, as with hashed searches, even a bad rehash function will work. In the worst case, if all symbols rehash to the identical value, the hash function simply becomes a linear search. If the rehash function distributes symbols reasonably well over the range, even if the table is 90% full, the average serch time will be only 3 or 4 iterations of the search loop. Here is some code:

#define REHASH_SIZE 32
// range 0 to REHASH_SIZE-1 must properly contain range KEY_MIN to KEY_MAX
#define rehash(h) ((h) % REHASH_SIZE)

struct {
        symbol_handle sym;
        key_handle key;
} rehash_table[ REHASH_SIZE ];

void key_init() {
        // initializer for keyword mechanism
        int i;
        for (i = 0; i < REHASH_SIZE; i++) {
                rehash_table[i].key = KEY_INVALID;
                rehash_table[i].sym = SYMBOL_INVALID;
        for (i = KEY_MIN; i <= KEY_MAX; i++) 
                symbol_handle s = symbol_add( key_names[i] );
                int j = rehash(s); // this must be identical in key_lookup
                while (rehash_table[j].key != KEY_INVALID) {
                        // this is a very special hash search
                        // during initialization, each symbol will be seen
                        // just once, so we don't need to check s
                        // we are guaranteed that the table will never fill
                        j = (j + 1); // this must be identical in key_lookup
                        if (j == REHASH_SIZE) j = 0;
                }
                rehash_table[j].key = i;
                rehash_table[j].sym = s;
        }
}

key_handle key_lookup( symbol_handle s ) {
        // output the keyword handle corresponding to the given symbol handle
        // if the symbol is not a keyword, returns KEY_INVALID

        int j = rehash(s); // this must be identical in key_init
        while ((rehash_table[j].sym != s)
        &&     (rehash_table[j].sym != SYMBOL_INVALID)) {
                // this is a special hash search
                // because we know the table will never be full
                j = (j + 1); // this must be identical in key_init
                if (j == REHASH_SIZE) j = 0;
        }
        return rehash_table[j].key;
}

Note the use of SYMBOL_INVALID above. Prior to this point, we did not need an invalid symbol handle. We need one here, so we must go back to the symbol table and add an out-of-range value to the set of symbol handles.

This code works fine, but it is a bit annoying to need two entries per symbol in the hash table, one for the symbol handle and one for the keyword handle. If we first determined the slot numbers in the rehash hash table to which each keyword would hash, and then assigned keyword handles based on those slot numbers, we could eliminate the key field of each rehash_table entry. The result would be very fragile code, because any change to the symbol table ordering, for example, because of a change to the symbol table's primary hash function, would force the reordering of the keword handles.

We could completely eliminate the need for a search by using a perfect hash function for rehashing. As with the above idea, this also leads to fragile code where any change to the primary symbol table mechanism could force the reengineering of the keyword handles.