Lecture 18, Rehashing

Part of the notes for 22C:196:003 (CS:4908:0003)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The Keyword Problem

Consider the problem of distinguishing reserved words (keywords) from other identifiers in the lexical analyzer. Let's assume the following as background: Assume that the string pool interface includes the constant STRING_IDS and that, on entering or looking up a string in the string pool, the string pool lookup routine or routines return an integer in the range 0 to STRING_IDS-1 that uniquely identifies the string. This integer is the string identifier or string ID.

Most probably, but hidden inside the string pool, there is an array of pointers to strings indexed by the string ID. In mature code, the search of this array for a string is likely to be done by hashing, where the strings themselves are stored in a second array, the string pool itself.

So, the problem is, how to distinguish between identifiers such as this and keywords such as else. There are several ways to do it. Here, we will limit our attention to solutions that operate inside the lexical analyzer, so in all cases, the lexical analyzer first identifies the keyword as an identifier, and then determines that it is a keyword.

Furthermore, we can assume that there is a support routine (either in the lexical analyzer or in a separate keywords package) used by the lexical analyzer, called keyword_make() that enters a keyword in the string pool. A routine called keyword_init() calls keyword_make() for each keyword, perhaps containing a sequence of calls something like this:

keyword_init( "if", KEY_IF );
keyword_init( "then", KEY_THEN );
keyword_init( "else", KEY_ELSE );
keyword_init( "end", KEY_END );

Obviously, keyword_init() uses an interface to the string pool to put the keywords in the string pool, and it must maintain an auxiliary data structure to relate keywords to the values that the lexical analyzer returns.

An Array of Keyword Values

The simplest way to do this is to maintain an array indexed on the range 0 to STRING_IDS-1 that contains the keyword value for each identifier. Let's call this array key_code[]. For normal identifiers, key_code[] would have the value KEY_INVALID or something similar to indicate that the identifier is not a keyword. For the entry for the string ID of the keyword if, though, the value would be KEY_IF, and similarly for all other keywords.

This is fast but consumes lots of memory, it takes a whole word per potential identifer, most likely thousands of words of RAM. It is not that RAM is expensive in terms of static cost per bit, but rather, that large sparse data structures spoil the effectiveness of cache memory. Therefore, we will seek an alternative.

Rehashing

Note that the string ID of each identifier is likely to be the result of a hashed search of the string pool. Therefore, the 30 or so keywords in our language will have string IDs that are scattered widely and apparently at random in the range 0 to STRING_IDS-1. Suppose we make key_code[] small, indexed over the range 0 to KEY_CODES-1, where 0 to KEY_CODES is much smaller than STRING_IDS but larger than the number of keywords. Given that id is the string ID of some keyword, key_code[id % KEY_CODES] is the first entry in key_code[] where we will look for that keyword. Of course, there may be collisions, so we need to add an extra field to each entry, the string ID of each keyword. This lets us search key_code[] for the identifer id to see if it is a keyword as follows:

code = id % KEY_CODES;
start = code;
for (;;) {
        if (key_code[ code ].id == id) break; /* we found the keyword */
        if (key_code[ code ].code == KEY_INVALID) break; /* not a keyword */
        code = code + 1;
        if (code == KEY_CODES) code = 0;
        if (code == start) this should never happen because
                we assumed that the KEY_CODES is greater than
                the number of keywords.
}
if (key_code[ code ].id == id) {
        we found a keyword
} else {
        it is not a keyword
}

This is just another example of hashing, and since the main symbol table is likely to be a hash table, we refer to this search as a rehash. Here, we are hashing on the integers, not on strings, and the mod function itself is our hash function.

As with all hashing, the search should be very fast, and rejection of non-keywords should take only a few iterations of the loop so long as KEY_CODES is sufficiently greater than the number of keywords.

Note the following line, quoted from above:

code = id % KEY_CODES;

As written, this uses the remainder or (in the case of unsigned numbers) mod operation. Generally speaking, whether hardware or software are used, computing the remainder is a side effect of computing the quotient, and this computation is very slow. For a typical modern computer, the add and subtract instructions, as well as bitwise logical operations, all take about 1 CPU cycle and can be pipelined with superscalar execution.

In contrast, multiplication, in the best case, takes around 4 clock cycles, while division is even worse, commonly taking on the order of 10 clock cycles. neither pipelined execution nor superscalar computation are typically applicable to these instructions.

This is why it is very common to use powers of two for values like KEY_CODES. If, for example, the value is 64, then the above line can be rewritten as:

code = id & 0x3F;

Or, alternatively,

code = id & (KEY_CODES - 1);

This only works for powers of two, but it is as fast as addition, and if there is only a 50% chance of a collision in the hash, it is likely that the entire rehash table lookup will be done before a divide instruction would have finished.