Lecture 17, Blocks

Part of the notes for 22C:196:003 (CS:4908:0003)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Symbol Accumulation

The interface to the string pool was discussed. Specifically, why do we propose the following stringpool.h file (abridged):

//stringpool.h

typedef uint32_t string_handle;

void string_start();
void string_append( char ch );
string_handle string_end();

string_handle string_lookup( char * s );

char * string_put( string_handle h, FILE * f );

In looking at this interface, it appears that string_lookup() is redundant. this is intentional. In fact, string_lookup() is trivial:

string_handle string_lookup( char * s ) {
	string_start();
	while ( *s != NUL ) string_append( *s );
	return string_end();
}

Why? As it turns out, we definitely need string_lookup() to put predefined strings into the string pool, both identifiers and keywords. So, for example, somewhere in the initializer, there will be code something like this:

	key_if_handle = string_lookup( "if" );
	key_else_handle = string_lookup( "else" );
	key_while_handle = string_lookup( "while" );

This is probably not the end of the story for keywords, but the point is, keywords are easiest to recognize if the lexical analyzer first sees them as identifiers, and then, after collecting the text of the identifier, asks if that text matches one of the keyword handles. To do that, all of the keywords need to be entered into the string pool as part of the initialization for the lexical analyzer, and this requires something like string_lookup() to enter constant strings from the program text into the string pool data structure.

Why the low level interface then? The answer lies in the notion of information hiding in the interface between the symbol table and the lexical analyzer.

Information Hiding

Information hiding is a standard technique in software engineering. If the border between two software modules, for example, the border between the lexical analyzer and the string pool. The more the interface between two components exposes the inner structure of one component to the other, the harder it becomes to make design changes to either.

For example if we require the lexical analyzer to present strings to the string pool as null-terminated C strings, we do two things: First, we force the lexical analyzer to accumulate the text of strings, probably in some kind of internal string variable. Second, we strongly suggest to the string pool that all strings should be null terminated -- precluding the inclusion of null characters in the text of a string.

By placing all string accumulation in the string pool, and having the lexical analyzer merely indicate the start and end of each string, passing the characters of the new string in between, we allow the string pool to completely control the storage of strings, deciding for itself how to store them.

Mechanism

Typical string pools store the text of individual strings compactly, adding each new string to the end of a single long array of characters. In a more mature string pool, this array might be broken into large segments, with a new segment allocated from the heap every time the previous segment fills. This changes the nature of a string handle, but if information hiding is taken seriously, the users of strings never need to know the nature of a string handle, just that string handles are one legal type for the value of a lexeme -- the type that applies specifically to lexemes of type identifier (and probably also string).

Given this, string_start() merely clears the accumulator for computing the hash of a string and sets the string-start pointer to the first unused character position at the end of the string pool.

Then, string_append(ch) adds the character ch to the end of the string pool and accumulates it into the hash function. If, in the end, the string does turn out to be a new one, by the time that is discovered, it will already be in the string pool.

And, string_end() looks up the hash of the string in the hash table to compute the string handle for the newly accumulated string. If the new string is already in the hash table, the end of the string pool is reset to the former start of the newly accumulated string (erasing it from the end of the string pool). If, on the other hand, the new string is new, it is already in the pool, so only the hash table needs updating.

Macros

Note that, aside from the calls in string_lookup(), the interface routines string_start(), string_append() and string_end() will probably only be called from the code in the lexical analyzer to recognize an identifier and to recognize a string. Furthermore, the code for these routines will typically be small. This suggests that these routines can be packaged as macros instead of as subroutines.

The way this is typically done in C and C++ is to use the #define mechanism of the preprocessor. Instead of header file entries like this:

void string_start();
void string_append( char ch );
string_handle string_end();

We might write something like this:

///////////////////////////////
// hash function parameters
#define __HASHX 15
#define __HASHY POOLSIZE
// note that HASHX must be relatively prime to HASHY

///////////////////////////////
// private variables of the string pool defined here only
// because the interface functions are macros

char __stringpool[ POOLSIZE ]; // the string pool
int  __poolpos; // the current end of the string pool
int  __newsym;  // the start of the new symbol being added to the pool

///////////////////////////////
// Interface functions converted to macros

#define string_start() {                                \
        __hash = 0;                                     \
        __newsym = __poolpos;                           \
}

void string_append( char ch ); {                        \
        __hash = ((__hash * __HASHX) + ch) % __HASHY;   \
        __stringpool[__poolpos] = ch;                   \
        __poolpos++;					\
}

string_handle string_end();

In C and C++, #define macros may extend over more than one line if the newlines in the macro body are prefixed by backslashes.

In C and C++, there is a convention -- merely a convention -- that identifiers used in a header file starting with a double underscore are private to the package and may not be used outside that package. Thus, even though every user of stringpool.h is given definitons for __stringpool, this may not be used except within the string pool package.

Users of the string pool need not know whether the string pool functions are functions or macros. In fact, the entire mechanism may be developed with conventional functions, and then, late in the process, they can be converted to macros to improve the performance of the compiler.