Lecture 8, Identifiers and String Pools

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Identifiers

The first step in processing identifiers is fairly simple, we simply follow the rules in the railroad diagram and substitute the code into the lexical analyzer:

if (ch == EOF) {
        lex_next.type = ENDFILE;
        lex_next.value = 0; /* irrelevant */
} else if (ISCLASS(ch,LETTER)) {
        /* identifier or possibly a keyword */
        lex_next.type = IDENT;
        do {
                /* =BUG= must remember ch as part of the identifier's text */
                /* get the next letter */
                ch = getc( infile );
        } while ((ch != EOF) && ISCLASS(ch,LETTER|DIGIT));
        /* =BUG= lex_next.value must be set to something unique to the text */
} else if (ISCLASS(ch,DIGIT)) {
        /* decimal digit */

Strings are similar:

} else if ((ch == '"') || (ch == '\'') {
        /* string */
        char quote = ch; /* remember what quote mark to use */
        lex_next.type = STRING;
        /* get the first letter after the opening quote */
        ch = getc( infile );
        while ((ch != EOF) && (ch != '\n') && (ch != quote)) {
                /* =BUG= must remember ch as part of the string's text */
                /* get the next letter within the string */
                ch = getc( infile );
        }
        /* =BUG= lex_next.value must be set to something unique to the string */
        if (ch == quote) {
                /* get the next letter after the closing quote */
                ch = getc( infile );
        } else {
                error_warn( ER_BADSTR, line_number );
        }
} else if (ISCLASS(ch,PUNCTUATION)) {

Bug notices in the above clearly identify a problem. Whether we are processing an identifier or string, we must save the text somewhere, and we must compute a value that uniquely identifies the text of that lexeme, so that all instances of the same text map to the same value, and so that that value can be used to reconstruct the text when needed.

We will refer to the values we use as handles, where a handle is a generalization on the concept of a pointer. A pointer is the memory address of an object -- the text of an identifier or string, in this case, while a handle is not necessarily the address; it may map to the object in question through a small computation. Note that there is no reason not to use the same mechanism to solve both problem, so we will refer to our handles as symbol handles.

A Path Forward

The standard libraries associated with C++ offer tools that can certainly help with this problem. The std::string class supports a complete range of string operations, including concatenation, with the + operator, comparison using ==, substrings and input-output. In fact, the only real disadvantage of using the C++ string class is that it is far more powerful than we need, and we pay for that power with computational cost. Using C++ strings to solve our problems is akin to using a sledghammer to hit a tack.

In a small program, there would be absolutely no justification for building our own string storage mechanism. In a large program like a compiler, we can significantly improve overall performance by building our own mechanisms, but only in such a large program. In general, the larger a program is, the more justification there is for building custom tools to support that application. At the extreme of what has come to be called Google-scale data, finely-tuned custom-crafted mechanisms rule the day.

We only need some string support in one place, the lexical analyzer. In the lexical analyzer, however, the speed of the mechanism is crucial to the efficiency of the compiler. We have two goals, speed and compactness. To meet our speed goal, we aim to minimize the number of times any particular character is examined or moved. To meet our compactness goal, we aim to store the largest possible symbol table in the smallest possible space. C++ strings typically require 3 words in addition to the text of the string, one for the string length, one for the length of the buffer allocated to hold the string, and one for the reference count. In addition, the buffer allocated to hold each string is likely to be significantly larger than the string stored there. Note that, in compiling large production programs, there may be thousands of identifiers, so these inefficiencies can add up to quite a bit of memory.

But wait! Memory is cheap these days! Why worry about a lousy factor of 3 bloat on the size of the string pool? We've got megabytes to burn! Is that really true? The speed of memory access is far faster if our data structures are compact enough to fit in cache, and if our working-set of pages in a virtual memory system is small. Programs with compact data structures can run faster. And indeed, megabytes are cheap, but there are many applications today that involve gigabytes of data. In addition, some of today's text processing involves huge volumes of text (take a look at what Google stores). In those contexts, compact storage is still very valuable.

There are two issues here: First the issue of string storage. We will solve this with a component called the string pool. The string pool simply stores (and allows recall) of identifiers, without any other processing. We will cover this first, since its interface will shape the interface for the symbol table. The symbol table's job is to determine whether a string has been seen before, so that duplicate strings are never stored. A second role for the symbol table is to allow easy association of attributes with symbols. One way to do this is to have the symbol table map the text of symbols to symbol handles that are integers in a relatively compact range so that the user may construct arrays indexed by symbol handle; alternatively, we can search dictionaries using the handle as a search key instead of using the text of the symbol.

The String Pool

As we read an identifier, we need a place to hold the text of that identifier, temporarily. The easiest way to do this is to set aside, inside the lexical analyzer, a buffer to hold the text. This same buffer can be used to accumulate the text of quoted character strings. A simple array of characters suffices for this, but the size of the array limits the permissible length of identifiers (and possibly string constants), so it should be generous. The problems with this preliminary design are:

Obviously, where we do encounter limits, we should make those limits configuration constants that are read from some header file that holds global configuration constants or even provided by the makefile.

This leads us to a design:

The simplest string pool implementation uses a single array of characters, with a maximum size that limits the total size of all of the distinct strings in the program. More complex implementations allocate the string array in segments, but this is overkill for an initial implementation. If we build an appropriate interface, such design details can be changed later.

Inside the string pool, each string can be terminated by a NUL and addressed by a pointer to its first character. The string pool starts out empty, and it never shrinks, as every new identifier in the entire program eventually ends up added to the pool.

NUL termination is a C-centric design. Nothing in the specification of the Kestrel language mentions NUL terminators. Instead, Kestrel seems to rely on the ability to implicitly pass array bounds as it passes what are called conformant array parameters. We could just as easily encode the length of each string in the pool using a prefix. The choice between these two ways of encoding the length of a string is one that can be hidden inside the string pool code and should never be seen from the outside.

Kestrel allows string constants that include embedded NULs, but these are not primitive constants at the lexical level. Rather, they are created using expression syntax, for example "A"+NUL+"B" which represents a 3-character string where the middle character is a NUL. We might well want to use our string pool and symbol table to hold the results of such constant string concatenations, and we can do this if we agree from the start to record the length of each string as a prefix on that string in the string pool.

Here is a preliminary definition of the string pool interface that does not commit us to either null-termination or a prefix count:

/* stringpool.h -- string pool interface specification */

/* users of this header file should first include
 *   <stdio.h>
 *   <stdint.h>
 *   <stdbool.h>
 * users of this header file should define (probably from a global
 * configuration file)
 *   POOL_SIZE -- the number of characters in the pool.
 * when this file is used from stringpool.c (but nowhere else)
 *   the user must first define EXTERN as nothing or blank
 */

#ifndef EXTERN
        #define EXTERN extern
#endif

typedef uint32_t string_handle;
/* the string handle type.  C and C++ don't let us be more specific,
 * but it is actually values between 0 and POOL_SIZE-1, inclusive.
 */

#define STRING_NULL 0
/* the null string handle, never used to index an actual string
 * this serves as a null pointer to a string
 */

void string_init();
/* initializer for the string pool package */

string_handle string_start( int line );
/* setup to accumulate a new string and return its handle
 * the line parameter is given so that string-pool errors can be reported
 */

void string_append( char ch );
/* append one character to the string */

void string_done();
/* mark the end of the string */

void string_accept();
/* accept the new string as a permanent part of the string pool */

void string_reject();
/* reject the new string, it will not be included in the string pool */

/* note:
 * to add a string to the pool
 *   handle = string_start( current line )
 *   for each character ch in the string { string_append(ch) }
 *   string_done()
 *   if symbol_table_lookup_fails() {
 *     string_accept()
 *     symbol_table_add( handle )
 *   } else {
 *     string_reject()
 *   }
 */

void string_put( string_handle h, FILE * f );
/* output the text of the string to the human-readable file */

/* =BUG= the code generator will need a way to put strings into object files */
/* =BUG= could string_chars(s,thunk()) call thunk(ch) for each char of s? */

bool string_eq( string_handle h1, string_handle h2 );
/* compare the strings h1 and h2 and return true if they are identical */

#undef EXTERN

We can now rewrite some of the code from the lexical analyzer to use this string pool:

} else if (ISCLASS(ch,LETTER)) {
        /* identifier or keyword */
        string_handle str = string_start( line_number );
        lex_next.type = IDENT; /* later, we may find that it's a KEYWORD */
        do {
                /* save this letter */
                string_append( ch );
                /* get the next letter */
                ch = getc( infile );
        } while ((ch != EOF) && ISCLASS(ch,LETTER|DIGIT));
        string_done();
        /* =BUG= must call either string_accept() or string_reject() */
        /* =BUG= lex_next.value must be set to something unique to the text */
} else if (ISCLASS(ch,DIGIT)) {