Lecture 6, Strings and String Pools

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Supporting Lexical Analysis

Typically, we expect the lexical analyzer to output (or more accurately return, possibly in a global variable) a sequence of lexemes, where each lexeme has a type (identifier, reserved word, integer, punctuation, etc.). The type field itself can be an enumeration type, with an integer value field, the interpretation of which depends on the lexeme type.

For integer lexemes, the value, obviously, can be that integer. Obviously, then, when the lexical analyzer first encounters an integer, it should set the value to zero, and then, as with each digit is processed, the value should be multiplied by the radix prior to adding in each digit. That was easy, but the other lexeme types pose more interesting problem.

Punctuation

If all punctuation marks were single characters, the integer value of a punctuation mark lexeme could be, simply, the character code for that punctuation mark. The problem with this is that we have a number of punctuation lexemes that are built up from multiple characters. Consider, for example, comparison operators such as >= which might also be represented by ≥ (8805₁₆ in Unicode). This is also a string in UTF-8, with successive bytes encoded as E8₁₆ A0₁₆ 85₁₆.

In the syntax analysis code, we don't want to deal with alternative representations. This suggests that what we need is a single enumeration type that enumerates all of the legal punctuation in the language. The lexical analyzer's header file should export this enumeration just as it exports the enumeraiton giving the set of lexeme types.

The Falcon documentation for the lexical level lists the a number of punctuation marks, with comments indicating their purposes. Inside the compiler, we need names for these symbols. Consider, for example, writing C code something like this in the lexical.h file:

enum punct_type {
    PT_EQUALS /* = */,   PT_COLON  /* : */, 
    PT_LPAREN /* ( */,   PT_RPAREN /* ) */, 
    PT_LBRAKT /* [ */,   PT_RBRAKT /* ] */, 
    PT_LBRACE /* { */,   PT_RBRACE /* } */, 
    PT_COMMA  /* , */,   PT_ATSIGN /* @ */, 
    PT_ELIPS  /* .. */,  PT_NOTEQL /* <> */, 
    PT_GT     /* > */,   PT_GE     /* >= */, 
    PT_LT     /* < */,   PT_LE     /* <= */, 
    PT_PLUS   /* + */,   PT_ATSIGN /* - */, 
    PT_TIMES  /* * */,   PT_DIV    /* / */, 
    PT_AND    /* /\ */,  PT_OR     /* \/ */, 
    PT_DOT    /* . */
};

There is nothing sacred about this suggested enumeration, nor is it exhaustive. It is easy to imagine adding mod and remainder operators (look at the Wikipedia write-up for the modulo operation to see how messy this can get if you permit negative numbers).

During syntax analysis, there will be many points where the parser is looking for one of several symbols. In such a context, it would be nice to be able to use set operations such as:

set operators = { PT_PLUS, PT_MINUS };
if (  (lex_this.type == PUNCT)
   && (lex_this.value in operators)
) {  ...

Unfortunately, C and C++ do not have built-in set types. (In contrast, Pascal and several other languages had set types.) There are numerous C++ classes that have been constructed to implement sets, but most of them are overkill, allowing sets over arbitrary integer domains. So long as we are interested in sets over domains of 32 or fewer items (the punctuation domain currently includes under 25 members), we can use 32-bit integers to represent sets of punctuation marks.

typedef uint32_t set32_t;

/* construct set32_t value from one integer */
#define TO_SET(i)   (1<<(i))

/* test it integer in set32_t, returns nonzero if so, zero if not */
#define IN_SET(i,s) (TO_SET(i) & s)

It might even pay off, just in terms of textual bloat, to create some specific defines for constructing larger sets compactly:

/* construct set32_t value from several integers */
#define TO_SET2(i,j)   (TO_SET(i) | TO_SET(j))
#define TO_SET3(i,j,k) (TO_SET(i) | TO_SET(j) | TO_SET(k))

Once the basic set architecture is established, these kinds of auxiliary definitions can be added as needed -- there is no reason to add them if there is no demand, and they are easy to add, once the pattern is understood. Given these definitions (in the appropriate header file!), we can write code like this, using the bitwise and and or operators for set union and intersection:

if (  (lex_this.type == PUNCT)
   && IN_SET( lex_this.value, TO_SET2( PT_PLUS, PT_MINUS ) );
) {  ...

This code can be more efficient than simply using conditional boolean operators, and the benefit grows as the number of set elements increases. Part of the benefit is that the set constructor, while it involves many or and shift operators, is composed entirely of constants and can be evaluated at compile time. At run time, there is just one shift and one and operation to be computed. Contrast that with the machine code you expect for this expression:

if (  (lex_this.type == PUNCT)
   && ((lex_this.value == PT_PLUS) || (lex_this.value == PT_MINUS ))
) {  ...

Keywords

Everything we said above about punctuation also applies to keywords. There are fewer than 30 keywords currently defined for Falcon. It is natural to use an enumeration type for the value associated with a keyword, and we can use the set mechanism discussed above to work with sets of keywords in contexts where one of several keywords are expected.

We will defer further consideration of this issue until we have investigated the problem of dealing with identifiers, because aside from there being a predefined set of keywords, all keywords will initially be recognized as identifiers. We must therefore deal with identifiers before worrying about how to figure out which potential identifiers are keywords and which are not.

Identifiers

For identifiers, we must save the characters of the identifier as they are read from the input stream. Once we have the entire identifier, we must examine the set of previously encountered identifiers to see if this one is the same as one of its predecessors. If it is the same, we return a value associated with that identifier, but if the identifier is new, we add it to the set and return its key.

The standard libraries associated with C++ offer tools that can certainly solve this problem. The std::string class supports a complete range of string operations, including concatenation, with the + operator, comparison using ==, substrings and input output. In fact, the real disadvantage of using the C++ string class is that it is far more powerful than we neeed. Using it is akin to using a sledghammer to hit a tack.

In a small program, there would be absolutely no justification for building our own string storage mechanism. In a large program like a compiler, we can significantly improve overall performance by building our own mechanisms, but only in such a large program. In general, the larger a program is, the more justification there is for building custom tools to support that application.

Unlike the set types discussed above, which are likely to be used throughout the parser, we only need some string support in one place, the lexical analyzer. In the lexical analyzer, however, the speed of the set mechanism is crucial to the efficiency of the compiler. We have two goals, speed and compactness. To meet our speed goal, we aim to minimize the number of times any particular character is examined or moved. To meet our compactness goal, we aim to store the largest possible symbol table in the smallest possible space. C++ strings typically require 3 words in addition to the text of the string, one for the string length, one for the length of the buffer allocated to hold the string, and one for the reference count. Note that, in compiling large production programs, here may be thousands of identifiers.

As we read an identifier, we need a place to hold the text of that identifier, temporarily. The easiest way to do this is to set aside, inside the lexical analyzer, a buffer to hold the text. This same buffer can be used to accumulate the text of quoted character strings. A simple array of characters suffices for this, but the size of the array limits the permissible length of identifiers (and possibly string constants), so it should be generous. Obviously, the maximum length should be a configuration constant, read from some header file that holds global configuration constants, and it should be significantly longer than one line of text.

Now, where to put the rest of the strings, the ones that have previously been recognized as unique? The common solution is in a data structure called a string pool. This may be as simple as a single array of characters, with a maximum size that limits the total size of all of the distinct strings in the program, or it may be made up incrementally, for example, by allocating a new string pool segment each time the previous one fills. If we build an appropriate interface, these kinds of design details can be adjusted later. Inside the string pool, each string is terminated by a null and addressed by a pointer to its first character. The string pool starts out empty, and it never shrinks, as every new identifier in the entire program eventually ends up added to the pool.

How do we search the string pool? A typical solution involves a hash table. As the lexical analyzer accumulates each string, the successive characters are added (or accumulated) in a hash function. This provides the starting point for a string pool search. The use of a hash function dictates the final detail: Outside the string pool, the handle on a string in the pool is the index of the hash-table entry for that string. The above arguments lead to a string pool interface something like the following:

/* stringpool.h */
/* users of this header file should define (probably from a global
 * configuration file) MAX_STRINGS -- the maximum number of distinct
 * strings permitted.  This sets the hash table size.
 */

typedef uint32_t string_handle;
/* the string handle type.  C and C++ don't let us be
 * more specific, but it is actually values between
 * 0 and MAX_STRINGS-1, inclusive.
 */

#define STRING_HASH(acc,ch) ((((acc) * _STRING_PRIME)+ch) % MAX_STRINGS)
/* the user, in accumulating the hash of a string, must allocate
 * an accumulator acc and update it with the statement
 *     acc = STRING_HASH( acc, ch)
 * for every character in the string.
 */

#define _STRING_PRIME 61
/* _STRING_PRIME is private to this header file.
 * _STRING_PRIME must be relatively prime to MAX_STRINGS;
 *      it is easiest if it is simply a prime number.
 * _STRING_PRIME must be less than 2**32/MAX_STRINGS;
 *      in order to avoid integer overflow in STRING_HASH()
 */

string_handle string_lookup( char * s, string_handle h );
/* given s, a pointer to the text of a null-terminated string,
 * and given h, the hash function computed for that string,
 * look it up in the hash table internal to the string pool
 * and, if the string has not been seen before, add it to the
 * string pool.  Finally, return the string handle for the
 * string, as it is now guaranteed to be in the string pool.
 */

char * string_get( string_handle h );
/* given a string handle, return a pointer to the corresponding
 * null-terminated character string.  We need this so that users
 * of the string pool can print out strings in error messages,
 * and so that they can include the text of strings in object
 * files -- for debugging, linkage, or most crucially, if the
 * string pool is used for quoted character strings.
 */

The above specification is incomplete, but ignoring that, we are now set to state that the value associated with identifier lexemes should be the string handle of that identifier.

Incomplete? What if the string won't fit in the string pool because it is too big or there are too many strings? In that case, it is fair to throw an exception, but but it will be, by definition, an unhandlable exception, since the handler will need to enlarge MAX_STRINGS, recompile the compiler, and try again. Handlers don't do this! Instead of raising an exception, it makes more sense to call a standard (defined in some other header file) fatal error routine that outputs an appropriate message and exits with an error status.

Keywords (Again)

Some piece of initialization code needs to call string_lookup() to enter each reserved word in the string pool. There also needs to be a mechanism to map the string handles of reserved words to small integers suitable for storing in a set -- allowing sets of reserved words. Obviously, this mapping mechanism also needs to identify which strings are reserved words. There are trivial ways to do this (an array of MAX_STRINGS bytes giving the reserved-word number of each reserved word, or FF₁₆ for non-reserved words. There are also sophisticated tricks such as perfect hash functions that can significantly reduce the memory overhead of this computation. Refinement, here, can be deferred.

Character Strings

There is good reason to store character strings in the string pool: To prevent common strings from being duplicated in memory. If someone is writing a "string heavy" application and uses the string constant "Help" over and over again, it would be nice if the compiler only stored this string once in the resulting object code. Suppose the same program also uses the identifier Help. There is no damage at all if the identifier and the string constant share the same string pool entry.

But, there is another problem. The Falcon language allows escape sequences within strings in order to include non-printing characters. For example, "This\LF\text\NUL\" is a string that includes an embedded linefeed and an embedded null. There must be, somewhere, a symbol table that maps LF and NUL to their character codes. There are 34 such codes defined in ASCII: 32 control characters plus SP, an alternative way to say " ", and DEL, the non-printing character encoded as 7F₁₆. Where do these go?

It seems a pity to add another mechanism for these, distinct from the keyword and identifier mechanisms. There are, of course, many solutions to this problem. If one byte per identifier is set aside in the keyword table, so long as there are no keywords that are also control characters, we could use one bit of that byte to distinguish 7-bit control characters from keywords, but this does not generalize cleanly support for Unicode. Unicode, with literally millions of named characters, suggests that the data structure for looking up named characters might want to be external, on a disk file. This issue can be deferred if the entire mechanism for looking up character names is packaged as a separate (abstract or C++) class.