Lecture 6, Strings and String Pools

Part of the notes for CS:4908:0004 (22C:196:004)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Supporting Lexical Analysis

Typically, we expect the lexical analyzer to output (or more accurately return, possibly in a global variable) a sequence of lexemes, where each lexeme has a type (identifier, reserved word, integer, punctuation, etc.). The type field itself can be an enumeration type, with an integer value field, the interpretation of which depends on the lexeme type.

For integer lexemes, the value, obviously, can be that integer. Obviously, then, when the lexical analyzer first encounters an integer, it should set the value to zero, and then, as with each digit is processed, the value should be multiplied by the radix prior to adding in each digit. That was easy, but the other lexeme types pose more interesting problem.

Punctuation

If all punctuation marks were single characters, the integer value of a punctuation mark lexeme could be, simply, the character code for that punctuation mark. The problem with this is that we have a number of punctuation lexemes that are built up from multiple characters. Consider, for example, comparison operators such as >= which might also be represented by ≥ (8805₁₆ in Unicode). This is also a string in UTF-8, with successive bytes encoded as E8₁₆ A0₁₆ 85₁₆.

In the syntax analysis code, we don't want to deal with alternative representations. This suggests that what we need is a single enumeration type that enumerates all of the legal punctuation in the language. The lexical analyzer's header file should export this enumeration just as it exports the enumeraiton giving the set of lexeme types.

The Kestrel documentation for the lexical level lists the a number of punctuation marks, with comments indicating their purposes. Inside the compiler, we need names for these symbols. Consider, for example, writing C (or C++) code something like this in the lexical.h file:

enum punct_type {
    PT_EQUALS /* = */,   PT_COLON  /* : */, 
    PT_LPAREN /* ( */,   PT_RPAREN /* ) */, 
    PT_LBRAKT /* [ */,   PT_RBRAKT /* ] */, 
    PT_LBRACE /* { */,   PT_RBRACE /* } */, 
    PT_COMMA  /* , */,   PT_ATSIGN /* @ */, 
    PT_ELIPS  /* .. */,  PT_NOTEQL /* <> */, 
    PT_GT     /* > */,   PT_GE     /* >= */, 
    PT_LT     /* < */,   PT_LE     /* <= */, 
    PT_PLUS   /* + */,   PT_ATSIGN /* - */, 
    PT_TIMES  /* * */,   PT_DIV    /* / */, 
    PT_AND    /* & */,   PT_OR     /* | */, 
    PT_DOT    /* . */
};

There is nothing sacred about this suggested enumeration, nor is it exhaustive. It is easy to imagine adding mod and remainder operators (look at the Wikipedia write-up for the modulo operation to see how messy this can get if you permit negative numbers).

What about comments? The above example uses C-style comments instead of the C++ line-end comments (//) so that it can organize the enumeration with 2 columns of definitions, each with its own comments. Mixing the two comment styles may be a bit awkward, but arranging the enumeration with two commented columns of text makes it more readable than it would be with all of the comments confined to the ends of the lines.

During syntax analysis, there will be many points where the parser is looking for one of several symbols. In such a context, it would be nice to be able to use formal operations on sets. Unfortunately, C and C++ do not permit the following fanciful bit of code:

set operators = { PT_PLUS, PT_MINUS };
if (  (lex_this.type == PUNCT)
   && (lex_this.value ∈ operators)
) {  ...

(In contrast, Pascal and several other languages had set types.) There are numerous C++ classes that have been constructed to implement sets, but most of them are overkill, allowing sets over arbitrary integer domains with very heavy-weight set implementations. So long as we are interested in sets over domains of 32 or fewer items (Kestrel has about 25 punctuation marks), we can use 32-bit integers to represent sets.

typedef uint32_t set32_t;

/* construct set32_t value from one integer */
#define TO_SET32(i)   (((set32_t)1)<<(i))

/* test if integer in set32_t, returns nonzero if so, zero if not */
#define IN_SET32(i,s) (TO_SET32(i) & s)

It might even pay off, just in terms of textual bloat, to create some specific defines for constructing larger sets compactly:

/* construct set32_t value from several integers */
#define TO_SET32_2(i,j)   (TO_SET32(i) | TO_SET32(j))
#define TO_SET32_3(i,j,k) (TO_SET32(i) | TO_SET32(j) | TO_SET32(k))

Once the basic set architecture is established, these kinds of auxiliary definitions can be added as needed -- there is no reason to add them if there is no demand, and they are easy to add, once the pattern is understood. Given a header file with these definitions, in the appropriate header file, for example, sets.h, we can write code like this, using the bitwise and and or operators for set union and intersection:

if (  (lex_this.type == PUNCT)
   && IN_SET32( lex_this.value, TO_SET32_2( PT_PLUS, PT_MINUS ) );
) {  ...

This code can be more efficient than simply using conditional boolean operators, and the benefit grows as the number of set elements increases. Part of the benefit is that the set constructor, while it involves many or and shift operators, is composed entirely of constants. C and C++ compilers are very good at moving computation to compile time whenever possible, so what looks like a complex expression is, at run time, just a shift, an and operator and a branch on result equal to zero. Contrast that with the machine code you expect for this expression:

if (  (lex_this.type == PUNCT)
   && ((lex_this.value == PT_PLUS) || (lex_this.value == PT_MINUS ))
) {  ...

Keywords

Everything we said above about punctuation also applies to keywords. There are fewer than 30 keywords currently defined for Kestrel. It is natural to use an enumeration type for the value associated with a keyword, and we can use the set mechanism discussed above to work with sets of keywords in contexts where one of several keywords are expected.

We will defer further consideration of this issue until we have investigated the problem of dealing with identifiers, because aside from there being a predefined set of keywords, all keywords will initially be recognized as identifiers. We must therefore deal with identifiers before worrying about how to figure out which potential identifiers are keywords and which are not.

Identifiers

For identifiers, we must save the characters of the identifier as they are read from the input stream. Once we have the entire identifier, we must examine the set of previously encountered identifiers to see if this one is the same as one of its predecessors. If it is the same, we return a value associated with that identifier, but if the identifier is new, we add it to the set and return its key.

The standard libraries associated with C++ offer tools that can certainly solve this problem. The std::string class supports a complete range of string operations, including concatenation, with the + operator, comparison using ==, substrings and input-output. In fact, the only real disadvantage of using the C++ string class is that it is far more powerful than we neeed. Using it is akin to using a sledghammer to hit a tack.

In a small program, there would be absolutely no justification for building our own string storage mechanism. In a large program like a compiler, we can significantly improve overall performance by building our own mechanisms, but only in such a large program. In general, the larger a program is, the more justification there is for building custom tools to support that application.

Unlike the set types discussed above, which are likely to be used throughout the parser, we only need some string support in one place, the lexical analyzer. In the lexical analyzer, however, the speed of the mechanism is crucial to the efficiency of the compiler. We have two goals, speed and compactness. To meet our speed goal, we aim to minimize the number of times any particular character is examined or moved. To meet our compactness goal, we aim to store the largest possible symbol table in the smallest possible space. C++ strings typically require 3 words in addition to the text of the string, one for the string length, one for the length of the buffer allocated to hold the string, and one for the reference count. Note that, in compiling large production programs, here may be thousands of identifiers.

But wait! These days, memory is cheap! Why worry about a lousy factor of 3 bloat on the size of the string pool? We've got megabytes to burn! Or do we? The speed of memory access is far faster if our data structures are compact enough to fit in cache, and if our working-set of pages in a virtual memory system is small. Programs with compact data structures can run faster. And indeed, megabytes are cheap, but there are many applications today that involve gigabytes of data. In addition, some of today's text processing involves huge volumes of text (take a look at what Google stores). In those contexts, compact storage is still very valuable.

There are two issues here: First the issue of string storage. We will solve this with a component called the string pool. The string pool simply stores (and allows recall) of identifiers, without any other processing. We will cover this first, since its interface will shape the interface for the symbol table. The symbol table's job is to determine whether a string has been seen before, so that duplicate strings are never stored. A second role for the symbol table is to allow easy association of attributes with symbols. One way to do this is to have the symbol table map the text of symbols to symbol handles that are integers in a relatively compact range so that the user may construct arrays indexed by symbol handle.

The String Pool

As we read an identifier, we need a place to hold the text of that identifier, temporarily. The easiest way to do this is to set aside, inside the lexical analyzer, a buffer to hold the text. This same buffer can be used to accumulate the text of quoted character strings. A simple array of characters suffices for this, but the size of the array limits the permissible length of identifiers (and possibly string constants), so it should be generous. The probles with this preliminary design are:

This requires copying the text after it is collected from the input. It would be far better to arrange things so that, once the text is accumulated, it is never copied.
If the limit on the length of a string is short, it annoys programmers, while if the limit on the length of a string is long, the vast majority of strings will never approach that length.
It is better to avoid introducing gratuitous limits. If there must be a limit, it should be a limit on the total size of the text of all of the strings, not a limit on the size of each string in addition to the limit on the total text or in addition to a limit on the total number of strings.

Obviously, where we do encounter limits, we should make those limits configuration constants that are read from some header file that holds global configuration constants or even provided by the makefile.

What is the alternative? Accumulate strings in the same place that they will eventually be stored. After accumulating a string, the symbol table mechanism can look it up to see if it has been encountered before. If so, discard the newly accumulated string. If not, keep it.

The simplest string pool implementation uses a single array of characters, with a maximum size that limits the total size of all of the distinct strings in the program. More comples implementations allocate the string array in segments, but this is overkill for an initial implementation. If we build an appropriate interface, such design details can be changed later. Inside the string pool, each string is terminated by a NUL and addressed by a pointer to its first character. The string pool starts out empty, and it never shrinks, as every new identifier in the entire program eventually ends up added to the pool.

Wait! Terminated by a NUL? That is a C-centric design. Nothing in the specification of the Kestrel language mentions NUL terminators. Instead, Kestrel seems to rely on the ability to implicitly pass array bounds as it passes an array as a parameter. We could just as easily encode the length of each string in the pool using a prefix. The choice between these two ways of encoding the length of a string is one that can be hidden inside the string pool code and should never be seen from the outside.

Here is a preliminary definition of the string pool using null-terminated strings, it could be changed to use other terminators, the user need never know (except that the change would allow nulls to be included in strings):

// stringpool.h
//   user interface specifications for the string pool.

// users of this header file should first include
//   <stdio.h>
//   <stdint.h>
// users of this header file should define (probably from a global
// configuration file)
//   POOL_SIZE -- the number of characters in the pool.
// when this file is used from stringpool.c (but nowhere else)
//   the user must first define EXTERN as nothing or blank

#ifndef EXTERN
	#define EXTERN extern
#endif

typedef uint32_t string_handle;
// the string handle type.  C and C++ don't let us be
// more specific, but it is actually values between
// 0 and POOL_SIZE-1, inclusive.

EXTERN char _string_pool[ POOL_SIZE ];
// private, the actual location where the text of strings is stored

EXTERN string_handle _string_limit;
// private, index of the first unused location in _string_pool

EXTERN string_handle _string_pos;
// private, position to store new characters added to _string_pool

#define STRING_NULL 0
// the null string, never used to index an actual string

//void string_init();
#define string_init() { _string_limit = 1; }
// initializer for the string pool package
// initial value 1 guarantees that STRING_NULL won't accidentally be used

//string_handle string_start();
#define string_start() ( _string_pos = _string_limit, _string_limit )
// setup to accumulate a new string and return its handle

//void string_append( char ch );
#define string_append(ch) {                          \
        if (_string_pos > (pool_size - 1)) /*BUG*/;  \
        _string_pool[ _string_pos ] = ch;            \
        _string_pos++;                               \
}
// append one character to the string

//void string_done();
#define string_done() {                              \
        if (_string_pos > (pool_size - 1)) /*BUG*/;  \
        _string_pool[ _string_pos ] = '\0';          \
        _string_pos++;                               \
}
// mark the end of the string

//void string_accept();
#define string_accept() { _string_limit = _string_pos; }
// accept the new string as a permanent part of the string pool

//void string_reject();
#define string_reject() { (void); }
// reject the new string, it will not be included in the string pool

// note:
// to add a string to the pool
//   handle = string_start()
//   for each character ch in the string { string_append(ch) }
//   string_done()
//   if symbol_table_lookup_fails() {
//     string_accept()
//     symbol_table_add( handle )
//   } else {
//     string_reject()
//   }

/*BUG*/ // the code generator will need a way to put strings into object files
/*BUG*/ // perhaps string_chars(s,thunk()) calls thunk(c) for each char c in s

void string_put( string_handle h, FILE * f );
// output the string to the human-readable file

bool string_eq( string_handle h1, string_handle h2 );
// compare the strings h1 and h2, returns true if they have identical text

#undef EXTERN

We have adopted some conventions in the above that should probably be documented in the project README file:

Symbol names starting with underscore (such as _string_limit) define private fields or private functions of a package. These should only be referenced in the same include file or in the corresponding package implementation.
Functions implemented as defined macros are always preceeded by commented-out function definitions giving the applicable types. Users should write code that conforms to these types.
The comment /*BUG*/ is used to indicate incomplete code or other known bugs. These comments should be deleted and replaced with appropriate code.

Aside from the incompletene elements indicated by bug notices, this header file illustrates a common feature of many C and C++ programs: It is possible to use the #define mechanism of the C preprocessor to create macros that appear, to the user, to be function calls. In classical C, stdio.h included something like the following:

#define putchar(ch) putc(ch,stdout)
#define putc(ch,f) (                                        \
        /* a very long define to put ch in the buffer */    \
        /* for f and if the buffer is full call write */    \
        /* to append the bufer to the file */               \
)
int fputc( int ch, FILE * f )

Note that putc(ch,f) and fputc(ch,f) have the same behavior, except that the former is a macro, while the latter is a function call. The reason that it is sometimes sensible to use macros is that the function call overhead can be significant. Use of a macro avoids not only the instructions for calling the function and returning, but also the instructions for passing parameters, allocating and deallocating space on the stack, and saving and restoring registers.

now set to state that the value associated with identifier lexemes should be the string handle of that identifier.

The C and C++ preprocessors support the languages and the #define mechanism, the basis of macros in the C and C++ language. The notation used for macro definition in C and C++ is awkward and limited in power, requiring the macro body to be defined in a single line of text. The multi-line macro bodies given above use the backslash character (\) at the end of each line to prevent the macro processor from seeing the line ends that were included for readability. Aligning these backslashes in a column makes the text of the macro easier to read.

Incomplete? What if the string won't fit in the string pool because it is too big or there are too many strings? In that case, it is fair to throw an exception, but but it will be, by definition, an unhandlable exception, since the handler will need to enlarge MAX_STRINGS, recompile the compiler, and try again. Handlers in C++ can't do this! Instead of raising an exception, it makes more sense to call a standard (defined in some other header file) fatal error routine that outputs an appropriate message and exits with an error status.

The Symbol Table

The string pool maps string handles to the text of the corresponding strings, offering no operations other than accumulating text for a string, comparing strings, and printing string text to a file. We want to use this to create a mechanism for quickly searching the set of strings to find if a string is new. We do this by layering a new abstraction over the strings &emdash; we will call this new layer the symbol table.

The symbol table associates symbol handles with symbols in exactly the same way that the string pool associates string handles with strings. The difference is that the symbol table guarantees that the handle it offers for each symbol will uniquely identify that symbol. The string pool is happy to accumulate multiple copies of a string, but the symbol table layer prevents this, guaranteeing uniqueness.

The easiest way to build a symbol table is to use an array of string handles. After entering the candidate string into the string pool, the symbol table can do a linear search through this array, using string_eq() to compare the new string with each of the strings already in the pool. If the new string is not found, the new string is added to the pool with string_accept(). If the new string is found in the table, string_reject() is called to discard the new string from the string pool.

Of course, as a rule of thumb, linear search is unlikely to be a reasonable choice when there are more than about 10 or 20 items in the list being searched. (The overhead of more sophisticated search algorithms makes linear search a good choice when the list is short.) In our language, the set of symbols pre-loaded in the symbol table will include all the predefined identifiers and keywords, and there are well over 20 of these. Therefore, we want a better search algorithm. The obvious choice is a hashed search. Here is an interface specificaiton for the symbol table based on this:

// symboltable.h
//   user interface specifications for the symbol table.

// users of this header file should first include
//   <stdio.h>
//   <stdint.h>
//   stringpool.h
// users of this header file should define (probably from a global
// configuration file)
//   SYMBOL_SIZE -- the maximum number of symbols
//       values on the order of POOL_SIZE/6 make good sense
//   SYMBOL_HASH -- used in the hash function
//       must be less than and relatively prime to SYMBOL_SIZE
//       wrong values degrade performance but do not break anything
// when this file is used from stringpool.c (but nowhere else)
//   the user must first define EXTERN as nothing or blank

#ifndef EXTERN
	#define EXTERN extern
#endif

typedef uint32_t symbol_handle;
// the symbol handle type.  C and C++ don't let us be
// more specific, but it is actually values between
// 0 and SYMBOL_SIZE-1, inclusive.

EXTERN string_handle _symbol_table[ SYMBOL_SIZE ];
// a private definition:  Nobody outside this package should use this

void symbol_init();
// initializer for the symbol table package
// -- a real function, it must set all entries in the symbol table

EXTERN symbol_handle _symbol_hash;
// the hash code of the symbol currently being accumulated
// a private definition:  Nobody outside this package should use this

EXTERN string_handle _symbol_string;
// the tentative string handle for the symbol being accumulated.
// a private definition:  Nobody outside this package should use this

//void symbol_start();
#define symbol_start() { _symbol_hash = 0; _symbol_string = string_start; }
// setup to accumulate one string into the symbol table

//void symbol_append( char ch )
#define symbol_append(ch) {                            \
        _symbol_hash = ((_symbol_hash * SYMBOL_CONST ) \
                        + (ch)) % SYMBOL_SIZE;         \
        string_append(ch);                             \
}
// append one character to the symbol currently being accumulated

symbol_handle symbol_lookup();
// finish the lookup of the symbol currently being accumulated
// this looks it up in the hashed symbol table and returns its handle
// if the symbol is new, it is added to the table and retained in the pool

// note:
// to add a symbol to the table
//   symbol_start()
//   for each character ch in the string { symbol_append(ch) }
//   handle = symbol_lookup()

//void symbol_put( symbol_handle s, FILE * f );
#define symbol_put( s, f ) { string_put( _symbol_table[ s ], f ); }
// output the symbol to the human-readable file

/*BUG*/ // the code generator will need a way to put symbols into object files
/*BUG*/ // perhaps symbol_chars(s,thunk()) calls string_chars().

#undef EXTERN

The symbol append function given above computes a hash function incrementally as characters are appended to the symbol. The function it uses is related to multiplicative congruence pseudo-random number generators. It is important to note that hashed searches work even if the hash function is totally broken, and the performance of a hashed search is pretty good for even marginally well thought-out hash functions. So long as the hash table is under about 90% full, even a mediocre hash function can give near constant-time searches.

The actual symbol-table search is done by the lookup routine. Here is some code:

//symboltable.c
//  implementation of the symbol table
//  (although much of it was implemented in the header file)

#include 
#include 

#include "config.h"
#include "stringpool.h"

#define EXTERN
#include "symboltable.h"

void symbol_init() {
	/*BUG*/ // missing body? sets table to STRING_NULL
}

symbol_handle symbol_lookup() {
	symbol_handle place = _symbol_hash;
	do (;;) { /* loop exits by embedded returns */
		if (_symbol_table[ place ] == STRING_NULL ) {
			// add symbol to table
			_symbol_table[ place ] = _symbol_string;
			string_accept();
			return place;
		}
		if (string_eq(_symbol_table[ place ], _symbol_string )) {
			// symbol is already in table
			string_reject()
			return place;
		}
		place = place + 1;
		if (place == SYMBOL_SIZE) place = 0;
		if (place == _symbol_hash) { /*BUG*/ } // table is full!
	}
}

Character Strings

There is good reason to store character strings in the string pool: To prevent common strings from being duplicated in memory. If someone is writing a "string heavy" application and uses the string constant "Help" over and over again, it would be nice if the compiler only stored this string once, both at compile time and in the resulting object code. Suppose the same program also uses the identifier Help. There is no damage at all if the identifier and the string constant share the same string pool entry.

But, there is another problem. The Kestrel language allows construction of string constants containing non-printing characters. For example, "This"LF"text"NUL"" is a string that includes an embedded linefeed and an embedded null. It is not one lexeme, it is a sequence of 5 lexemes, which are combined in the semantic level. Kestrel string lexemes never contain NUL characters, so NUL terminators can be used in the string pool, but string constants constructed by the semantic level may contain NUL characters. If we construct these higher-level in the string pool, we need to design the string pool without NUL terminated strings.

In addition, this forces us to build some kind of concatenation tool that can build new strings by concatenating characters and other strings to the string under construction. We can ignore this problem for a long time before coming back to create a mechanism for solving it. that can be added to the languag