Lecture 6, Strings and String Pools

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Supporting Lexical Analysis

Typically, we expect the lexical analyzer to output (or more accurately return, possibly in a global variable) a sequence of lexemes, where each lexeme has a type (identifier, reserved word, integer, punctuation, etc.). The type field itself can be an enumeration type, with an integer value field, the interpretation of which depends on the lexeme type.

For integer lexemes, the value, obviously, can be that integer. Obviously, then, when the lexical analyzer first encounters an integer, it should set the value to zero, and then, as with each digit is processed, the value should be multiplied by the radix prior to adding in each digit. That was easy, but the other lexeme types pose more interesting problem.

Punctuation

If all punctuation marks were single characters, the integer value of a punctuation mark lexeme could be, simply, the character code for that punctuation mark. The problem with this is that we have a number of punctuation lexemes that are built up from multiple characters. Consider, for example, comparison operators such as >= which might also be represented by (880516 in Unicode). This is also a string in UTF-8, with successive bytes encoded as E816 A016 8516.

In the syntax analysis code, we don't want to deal with alternative representations. This suggests that what we need is a single enumeration type that enumerates all of the legal punctuation in the language. The lexical analyzer's header file should export this enumeration just as it exports the enumeraiton giving the set of lexeme types.

The Kestrel documentation for the lexical level lists the a number of punctuation marks, with comments indicating their purposes. Inside the compiler, we need names for these symbols. Consider, for example, writing C code something like this in the lexical.h file:

enum punct_type {
    PT_EQUALS /* = */,   PT_COLON  /* : */, 
    PT_LPAREN /* ( */,   PT_RPAREN /* ) */, 
    PT_LBRAKT /* [ */,   PT_RBRAKT /* ] */, 
    PT_LBRACE /* { */,   PT_RBRACE /* } */, 
    PT_COMMA  /* , */,   PT_ATSIGN /* @ */, 
    PT_ELIPS  /* .. */,  PT_NOTEQL /* <> */, 
    PT_GT     /* > */,   PT_GE     /* >= */, 
    PT_LT     /* < */,   PT_LE     /* <= */, 
    PT_PLUS   /* + */,   PT_ATSIGN /* - */, 
    PT_TIMES  /* * */,   PT_DIV    /* / */, 
    PT_AND    /* & */,  PT_OR     /* | */, 
    PT_DOT    /* . */
};

There is nothing sacred about this suggested enumeration, nor is it exhaustive. It is easy to imagine adding mod and remainder operators (look at the Wikipedia write-up for the modulo operation to see how messy this can get if you permit negative numbers).

During syntax analysis, there will be many points where the parser is looking for one of several symbols. In such a context, it would be nice to be able to use formal operations on sets. Unfortunately, C and C++ do not permit the following fanciful bit of code:

set operators = { PT_PLUS, PT_MINUS };
if (  (lex_this.type == PUNCT)
   && (lex_this.value ∈ operators)
) {  ...

(In contrast, Pascal and several other languages had set types.) There are numerous C++ classes that have been constructed to implement sets, but most of them are overkill, allowing sets over arbitrary integer domains. So long as we are interested in sets over domains of 32 or fewer items (Kestrel has about 25 punctuation marks), we can use 32-bit integers to represent sets.

typedef uint32_t set32_t;

/* construct set32_t value from one integer */
#define TO_SET32(i)   (((set32_t)1)<<(i))

/* test it integer in set32_t, returns nonzero if so, zero if not */
#define IN_SET(i,s) (TO_SET(i) & s)

It might even pay off, just in terms of textual bloat, to create some specific defines for constructing larger sets compactly:

/* construct set32_t value from several integers */
#define TO_SET32_2(i,j)   (TO_SET32(i) | TO_SET32(j))
#define TO_SET32_3(i,j,k) (TO_SET32(i) | TO_SET32(j) | TO_SET32(k))

Once the basic set architecture is established, these kinds of auxiliary definitions can be added as needed -- there is no reason to add them if there is no demand, and they are easy to add, once the pattern is understood. Given these definitions (in the appropriate header file!), we can write code like this, using the bitwise and and or operators for set union and intersection:

if (  (lex_this.type == PUNCT)
   && IN_SET32( lex_this.value, TO_SET32_2( PT_PLUS, PT_MINUS ) );
) {  ...

This code can be more efficient than simply using conditional boolean operators, and the benefit grows as the number of set elements increases. Part of the benefit is that the set constructor, while it involves many or and shift operators, is composed entirely of constants. C and C++ compilers are very good at moving computation to compile time whenever possible, so what looks like a complex expression is, at run time, just a shift, an and operator and a branch on result equal to zero. Contrast that with the machine code you expect for this expression:

if (  (lex_this.type == PUNCT)
   && ((lex_this.value == PT_PLUS) || (lex_this.value == PT_MINUS ))
) {  ...

Keywords

Everything we said above about punctuation also applies to keywords. There are fewer than 30 keywords currently defined for Kestrel. It is natural to use an enumeration type for the value associated with a keyword, and we can use the set mechanism discussed above to work with sets of keywords in contexts where one of several keywords are expected.

We will defer further consideration of this issue until we have investigated the problem of dealing with identifiers, because aside from there being a predefined set of keywords, all keywords will initially be recognized as identifiers. We must therefore deal with identifiers before worrying about how to figure out which potential identifiers are keywords and which are not.

Identifiers

For identifiers, we must save the characters of the identifier as they are read from the input stream. Once we have the entire identifier, we must examine the set of previously encountered identifiers to see if this one is the same as one of its predecessors. If it is the same, we return a value associated with that identifier, but if the identifier is new, we add it to the set and return its key.

The standard libraries associated with C++ offer tools that can certainly solve this problem. The std::string class supports a complete range of string operations, including concatenation, with the + operator, comparison using ==, substrings and input-output. In fact, the only real disadvantage of using the C++ string class is that it is far more powerful than we neeed. Using it is akin to using a sledghammer to hit a tack.

In a small program, there would be absolutely no justification for building our own string storage mechanism. In a large program like a compiler, we can significantly improve overall performance by building our own mechanisms, but only in such a large program. In general, the larger a program is, the more justification there is for building custom tools to support that application.

Unlike the set types discussed above, which are likely to be used throughout the parser, we only need some string support in one place, the lexical analyzer. In the lexical analyzer, however, the speed of the mechanism is crucial to the efficiency of the compiler. We have two goals, speed and compactness. To meet our speed goal, we aim to minimize the number of times any particular character is examined or moved. To meet our compactness goal, we aim to store the largest possible symbol table in the smallest possible space. C++ strings typically require 3 words in addition to the text of the string, one for the string length, one for the length of the buffer allocated to hold the string, and one for the reference count. Note that, in compiling large production programs, here may be thousands of identifiers.

But wait! These days, memory is cheap! Why worry about a lousy factor of 3 bloat on the size of the string pool? We've got megabytes to burn! Or do we? The speed of memory access is far faster if our data structures are compact enough to fit in cache, and if our working-set of pages in a virtual memory system is small. Programs with compact data structures can run faster. And indeed, megabytes are cheap, but there are many applications today that involve gigabytes of data. In addition, some of today's text processing involves huge volumes of text (take a look at what Google stores). In those contexts, compact storage is still very valuable.

As we read an identifier, we need a place to hold the text of that identifier, temporarily. The easiest way to do this is to set aside, inside the lexical analyzer, a buffer to hold the text. This same buffer can be used to accumulate the text of quoted character strings. A simple array of characters suffices for this, but the size of the array limits the permissible length of identifiers (and possibly string constants), so it should be generous. Obviously, the maximum length should be a configuration constant, read from some header file that holds global configuration constants, and it should be significantly longer than the longest line of text a programmer is likely to produce.

Now, where to put the rest of the strings, the ones that have previously been recognized as unique? The common solution is in a data structure called a string pool. This may be as simple as a single array of characters, with a maximum size that limits the total size of all of the distinct strings in the program, or it may be made up incrementally, for example, by allocating a new string pool segment each time the previous one fills. If we build an appropriate interface, these kinds of design details can be adjusted later. Inside the string pool, each string is terminated by a NUL and addressed by a pointer to its first character. The string pool starts out empty, and it never shrinks, as every new identifier in the entire program eventually ends up added to the pool.

Wait! Terminated by a NUL? That is a C-centric design. Nothing in the specification of the Kestrel language mentions NUL terminators. Instead, Kestrel seems to rely on the ability to implicitly pass array bounds as it passes an array as a parameter. We could just as easily encode the length of each string in the pool using a prefix. The choice between these two ways of encoding the length of a string is one that can be hidden inside the string pool code.

How do we search the string pool? A typical solution involves a hash table. As the lexical analyzer accumulates each string, the successive characters are added (or accumulated) in a hash function. This provides the starting point for a string pool search. The use of a hash function dictates the final detail: Outside the string pool, the handle on a string in the pool is the index of the hash-table entry for that string. The above arguments lead to a string pool interface something like the following:

/* stringpool.h */
/* users of this header file should define (probably from a global
 * configuration file) MAX_STRINGS -- the maximum number of distinct
 * strings permitted.  This sets the hash table size.
 */

typedef uint32_t string_handle;
/* the string handle type.  C and C++ don't let us be
 * more specific, but it is actually values between
 * 0 and MAX_STRINGS-1, inclusive.
 */

#define STRING_HASH(acc,ch) ((((acc) * _STRING_PRIME)+ch) % MAX_STRINGS)
/* the user, in accumulating the hash of a string, must allocate
 * an accumulator acc and update it with the statement
 *     acc = STRING_HASH( acc, ch)
 * for every character in the string.
 */

#define _STRING_PRIME 61
/* _STRING_PRIME is private to this header file.
 * _STRING_PRIME must be relatively prime to MAX_STRINGS;
 *      it is easiest if it is simply a prime number.
 * _STRING_PRIME must be less than (2**32)/MAX_STRINGS;
 *      in order to avoid integer overflow in STRING_HASH()
 */

string_handle string_lookup( char * s, string_handle h );
/* given s, a pointer to the text of a null-terminated string,
 * and given h, the hash function computed for that string,
 * look it up in the hash table internal to the string pool
 * and, if the string has not been seen before, add it to the
 * string pool.  Finally, return the string handle for the
 * string, as it is now guaranteed to be in the string pool.
 */

char * string_get( string_handle h );
/* given a string handle, return a pointer to the corresponding
 * null-terminated character string.  We need this so that users
 * of the string pool can print out strings in error messages,
 * and so that they can include the text of strings in object
 * files -- for debugging, linkage, or most crucially, if the
 * string pool is used for quoted character strings.
 */

The above specification is incomplete, but ignoring that, we are now set to state that the value associated with identifier lexemes should be the string handle of that identifier.

Incomplete? What if the string won't fit in the string pool because it is too big or there are too many strings? In that case, it is fair to throw an exception, but but it will be, by definition, an unhandlable exception, since the handler will need to enlarge MAX_STRINGS, recompile the compiler, and try again. Handlers don't do this! Instead of raising an exception, it makes more sense to call a standard (defined in some other header file) fatal error routine that outputs an appropriate message and exits with an error status.

Keywords (Again)

Some piece of initialization code needs to call string_lookup() to enter each reserved word in the string pool. There also needs to be a mechanism to map the string handles of reserved words to small integers suitable for storing in a set -- allowing sets of reserved words. Obviously, this mapping mechanism also needs to identify which strings are reserved words. There are trivial ways to do this (an array of MAX_STRINGS bytes giving the reserved-word number of each reserved word, or FF16 for non-reserved words. There are also sophisticated tricks such as perfect hash functions that can significantly reduce the memory overhead of this computation. Refinement, here, can be deferred.

Character Strings

There is good reason to store character strings in the string pool: To prevent common strings from being duplicated in memory. If someone is writing a "string heavy" application and uses the string constant "Help" over and over again, it would be nice if the compiler only stored this string once, both at compile time and in the resulting object code. Suppose the same program also uses the identifier Help. There is no damage at all if the identifier and the string constant share the same string pool entry.

But, there is another problem. The Kestrel language allows construction of string constants containing non-printing characters. For example, "This"LF"text"NUL" is a string that includes an embedded linefeed and an embedded null. It is not one lexeme, it is a sequence of 4 lexemes, which are combined in the semantic level. Kestrel string lexemes never contain NUL characters, so NUL terminators can be used in the string pool, but string constants constructed by the semantic level may contain NUL characters. If we construct these higher-level in the string pool, we need to design the string pool without NUL terminated strings.