Chapter 3, Symbol Tables

From Abstraction to Efficient Implementation

Part of the notes for 22C:50, Introduction to System Software
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

What is a Symbol Table?

Most computer programs which interact with humans have at least some places where symbolic input is allowed. For example, job control languages or command shells allow the use of commands such as "ls", "cat" or "exit" along with symbolic program and file names, and programming languages allow symbolic identifiers and key words. For a program to process such input, it must have some way to associate such symbolic names with their abstract values. Mechanisms which maintain such associations are called symbol tables.

The problems of constructing and managing a symbol table are relatively independent of the nature of the values which it associates with symbolic names, and these problems remain essentially the same whether names are associated with integer memory addresses, as they are in a simple assembly language, or with data types, storage locations, and other information needed by a high level language compiler.

In Figure 2.17, we assumed that there was some kind of object, st, the symbol table, containing the routines define and lookup for adding a definition to the symbol table and for looking up a definition in the table. These procedures were used in order to isolate the code of the example assembler from the internal details of the symbol table itself. The user of the procedures define and lookup thinks only in terms of the abstract concept of a symbol table, without reference to how it is implemented. Similarly, the author of an implementation of define and lookup need only implement this abstraction, without reference to how the symbol table will be used. The abstract view of what a symbol table does may be informally defined as shown in Figure 3.1.

define(s,v)
After calling define with arguments s and v, future calls to lookup(s) will return v; lookup(y) for all y not equal to s will be the same as it would have been if define(s,v) had not been called.

lookup(s)
The effect of calling lookup is completely defined above, except that, initially, all calls to lookup return a distinguished value indicating that the symbol s is undefined.

Figure 3.1. An informal definition of a symbol table.

Any data storage component of a program which may be specified without reference to how it is implemented is called an abstract data type. In most common programming languages, programmers view the predefined data types such as integer and real as abstract data types, using them without reference to the number of bits or the representation system used for values of these types. When a user defines a new type, the user must give a concrete representation for the type, but modern programming methodologies encourage the user to isolate that representation from its uses, allowing the user to use the new type abstractly.

Modern programming languages such as Java, C++, Ada and Modula 2 provide facilities which allow the explicit declaration and use of abstract data types, but even when explicit declaration is not possible, it is always possible to hide the data structures representing an abstraction behind a wall of procedures and functions. In any case, a clear understanding of the abstract data types involved is one of the most powerful tools for modularizing large programs.

As with the syntactic definitions given in the previous chapter, abstract data types can be defined formally as well as informally. Although a number of formal definition techniques exist for abstract data types, only one will be illustrated here, the algebraic method. Using the algebraic definition method, the first step in a formal definition of an abstraction is to define the sets of values to be used; these are given in Figure 3.2 for symbol tables.

T  --  The set of symbol tables
V  --  The set of values
S  --  The set of symbols
Figure 3.2. Sets needed to define a symbol table.

Note that, from the point of view of this definition, members of the set T in Figure 3.2 are values taken on by symbol table variables. Thus, define must be treated as a function which, when given a symbol, a value, and a symbol table, returns that symbol table augmented with the new definition. Here, we use the italic function name define for the abstract function implemented by the concrete procedure define. The abstract function is a pure function, with no side effects, so any updates it performs to the representaton of an object of type symbol table are represented by explicitly passing the old object as a parameter while expecting the function to return the updated version as a result. In the assembler example used here, in contrast, there is only one object of type symbol table, so our define operation may implicitly operate on this object without the need to name it as a parameter.

Following the same reasoning, our formal model of lookup is a function lookup that takes both a symbol table and a symbol as arguments and returns a value. The argument and result types of these functions are formally specified in Figure 3.3, along with two constants (sometimes considered to be constant valued functions of no arguments) which represent the initial symbol table value and the error value:

error : --> V
empty : --> T
define : T x S x V --> T
lookup : T x S --> V

Figure 3.3. Functions on the sets in Figure 3.2.

Formally, the last line in Figure 3.3 says that the function lookup maps an element of the set T and an element of the set S to an element of the set V.

The algebraic definition is completed by giving equations which define the effect of applying these functions as shown in Figure 3.4.

lookup(empty,s) = error
looking up any symbol s in an empty symbol table gives an error result.

lookup(define(t,s,v),s) = v
looking up any symbol s in the symbol table resulting from a definition of that symbol returns the corresponding value.

lookup(define(t,s',v),s) = lookup(t,s)
looking up any symbol s in the symbol table resulting from the definition of some other symbol s' returns the same value as would be returned by looking up the symbol s in the symbol table to which s' was added.

Figure 3.4. Equations defining the actions on a symbol table.

Unlike the situation with formal syntactic definitions, where the systematic derivation of a parser from a grammar is possible, there is no simple and general way to derive economical implementations of an abstract data type from its formal definition. Automatic derivation of uneconomical implementations is reasonably easily. For example, the definition given in Figure 3.4 suggests using an expandable table. The define function would enter new definitions at the end of the table, while the lookup function would simply perform a linear search for the desired symbol, starting from the end.

As the definition of the term abstract data type suggests, there are many ways of implementing these specifications; the following sections illustrate a variety of implementations for symbol tables. In these sections, it will be assumed that there is only one symbol table in use, and thus, that there is no need to pass the table itself as a parameter to the define and lookup routines. It will also be assumed that symbols are externally represented as strings with a large unspecified maximum length. The values associated with symbols by the symbol table will be unspecified, although it is convenient to think of them as integers. In fact, for an assembler, the values will typically be records which contain both an integer value and various flags indicating whether that value is defined, undefined or multiply defined, and perhaps whether the value is relocatable.

A Simple Implementation

A simple way to represent a symbol table is as an unsorted array of symbol-value pairs. In such a symbol table, there must be some way of telling whether an entry has previously been used. Solutions to this problem include using a flag in each entry, or a distinctive string value such as the null string or blank. The solution which is used in this example uses an auxiliary counter to indicate how many full slots there are. When a symbol is defined, a linear search is used to see if there is a previous definition; a similar search is needed to implement the lookup function. Figure 3.5 illustrates the resulting data structure, as it would appear in the computer's memory, assuming that each value is 16 bits, each symbol is limited to 6 characters, and there are at most 8 table entries. In the example, entry 4 associates the value 5 with the symbol "NAME".

           0   1   2   3   4   5   6   7    byte number
         |___|___|___|___|___|___|___|___|  within each
 table 0 |_______________________|_______|  table entry
       1 |_______________________|_______|
       2 |_______________________|_______|
       3 |_______________________|_______|
       4 |_N___A___M___E_________|___5___|
       5 |_______________________|_______|
       6 |_______________________|_______|
       7 |_______________________|_______|
       8 |_______________________|_______|
                  symbol           value

Figure 3.5. A simple symbol table representation in memory.

The concrete representation given in Figure 3.5 can be easily declared in programming languages such as Pascal and C (or C++), as is illustrated in Figure 3.6.

{ in Pascal }
var table: array [0..7] of record
               symbol: packed array [0..5] of char;
               value:  0..65535;
           end;

/* in C */
struct {
    char symbol[6];
    unsigned short int value;
} table[8];

Figure 3.6. Pascal and C declarations for Figure 3.5.

A common characteristic of most languages used for system programming is that they allow fairly precise control of the representation of an object. Pascal and C (including C++) are both moderately good in this regard, but far from excellent. Java, on the other hand, cannot be used to declare anything resembling the data structure illustrated in Figure 3.4! (The reason for this is an excellent test of your understanding of the implementation of Java!)

Of course, it would be foolish to constrain our high-level language implementation to a particular number of array entries or a particular type for the value of a symbol. To avoid this, any civilized programming language should allow the definitions to be parameterized. Figure 3.7 illustrates a far more appropriate set of definitions:

-- in Ada
type EntryType is
  record
    Symbol: array (0 .. Symlim) of Character;
    Value:  ValueType;
  end record;

Table: array (0 .. Tablim) of EntryType;

/* in C */
struct {
    char symbol[symsize];
    valtype value;
} table[tabsize];

Figure 3.7. Ada and C declarations using named bounds and types

The declarations in 3.7 are parameterized in the sense that the array bounds and component types are symbolic names of items declared elsewhere in the program. Ada includes a more formal notion of parameterized types, where actual parameters must be provided prior to any instantiation of the type. We will not delve into this kind of feature here!

Figure 3.8 gives a complete declaration of the symbol table, including the access methods for the abstraction, in Pascal.

type symtype = array [0..symlim] of char;

var table: array [0..tablim] of record
               symbol: symtype;
               value: valtype;
           end;

    filled: integer { initially zero };

procedure define(s: symtype, v: valtype);
var i: 0 .. tablim;
begin
     i := 0;
     while (i < tablim)
       and (i < filled)
       and (table[i].s <> symbol)
         do i := i + 1;
     if table[i].symbol = s then begin
 	  { redefinition }
          table[i].value := v;
     end else if i = filled then begin
	  { new definition }
          table[i].symbol := s;
          table[i].value := v;
          filled := filled + 1;
     end;
end {define};

function lookup(s: symtype): valtype;
var i: 0..tablim;
begin
     i := 0;
     lookup := undefined;
     while (i < tablim)
       and (i < filled)
       and (table[i].symbol <> s)
        do i := i + 1;
     if table[i].symbol = s
       then lookup := table[i].value;
end {lookup};

Figure 3.8. A simple symbol table implementation.

In Pascal, there is no return statement such as is found in languages descended from C; instead, assignment to the name of a function within that function is used to specify the return value. The other noteworthy feature of the above code, for those unfamiliar with Pascal, is the use of <> as a not-equals operator; also Pascal uses = as the comparison operator, while := means assignment.

The symbol table implementation discussed in Figures 3.5 to 3.8 has one important property which was not included in either the formal or informal definitions given previously: The table can fill up! Ignoring this possibility in the definitions of an abstract data type is common; in formal definitions, attention to such issues can easily obscure the normal behavior of the type in question. For example, representations of the abstract data type integer also have a finite capacity, yet programming language specifications almost never specify what will happen when this capacity is exceeded, other than to state that it is an error to do so. Just as better programming language implementations would be expected to detect errors involving integer overflow, better symbol table implementations are expected to detect such things as a full table!

Although it is more efficient than a mindless attempt to translate the specification given in Figure 3.4 into a working implementation, the implementation suggested by Figures 3.5 to 3.8 is not very efficient and should never be used except for very small tables. In this table, time taken to search the table for a symbol which has never been defined is linearly proportional to the number of symbols already in the table. If a search is made for a randomly selected symbol already in the table, on the average, half of the entries will be tested before the symbol is found. Thus, again, the time taken will be proportional to the number of elements in the table. In making such time comparisons, it is common to state the above by saying that the times taken are O(n); the abbreviation O(n) is read as "on the order of n, where n is the number of elements in the table." Better searching methods are known; for example, a sorted table can be searched in O(log n) time, and the best symbol table organizations can actually be searched in times which approximate O(1) (a constant time)!

String Pools

In addition to its poor performance, the symbol table organization suggested in Figures 3.5 to 3.8 has a second problem, it does not use memory very efficiently. The reason is that each entry in the table itself must be large enough to hold the longest possible symbol. This would pose no problem if all symbols were the same length, and fit exactly in the storage allocated for them, but such a situation is rare. Usually, there is some maximum symbol length, but the average length is considerably shorter. For example, in C, identifiers may be up to 31 characters in length, but few programmers use identifiers this long. (In fact, in C, identifiers may be any length, but the language standard only requires implementations to check the first 31 characters.)

If each slot in the symbol table can hold a symbol 31 characters long, while the average identifier is about 8 characters in length, fully 3/4 of the space allocated for string storage will be wasted by the time the table fills. Given a choice, most programmers would prefer a language processor that could use this wasted storage.

Why worry about wasted storage? In the 1960's when computer with over 64K bytes of memory were extremely rare, and memory cost a penny a bit, it made sense to shoehorn data tightly into memory. Today, when megabytes can be had for the less than the cost of a night's lodging, does this make sense? Not always!
Prototype software, used to prove that something can be done, usually doesn't need this care, but the situation is quite different when speed is of the essence. Modern computer architectures rely on relatively small cache memories for their remarkable speed. Excessively large and ornate data structures that fill the cache with unneeded data can have a serious impact on performance! Therefore, efficient memory use has a significant impact on the speed of the application.
And, of course, size still matters for many applications. While RAM may be relatively inexpensive, we have many applicatons today that are able to fill RAM as quickly as we can buy it. Image and sound processing, full-text database searches, and many other applications need all the RAM they can get. If the data structures used in such an application are inefficient, these kinds of applications will be unable to operate on many existing computers and will be unable to coexist on others. Nobody appreciates the message "Insufficient memory, please close as many other applications as possible before trying again."

The oldest solution to this problem is to set the maximum symbol length so that it is only a few characters larger than the expected average. For example, many assembly languages from the 1960's and early 1970's allowed only 6 characters per symbol. This is annoying to programmers because it prevents the use of mnemonic names. A slight variation on this scheme is to allow symbols of any length, but to store only the first few characters. For example, early implementations of Pascal frequently ignored all but the first 8 characters of identifiers. In large programs, both of these solutions frequently force the use of preposterous non-mnemonic or overly abbreviated names such as IEF142I, PWRF$$, or RDTK$$. (The first of these comes from IBM's OS/360, the other two are from PR1MOS, a minicomputer operating system from the early 1980's.)

If all characters in a symbol are to be preserved without wasting space, some method of dynamically allocating just enough space for each symbol must be used. If we use dynamic storage allocation for string objects, as a typical naive C++ or Java programmer might, the overhead can be startlingly high! A string object in such a language will typically be allocated as shown in Figure 3.9.

 |___|___|___|___|    |___|___|___|___|
 |  handle     o-|--->| my class    o-|---> Class descriptor
 |_______________|    |_______________|
                      | size  |       |
                      |_______|       |
                      |         text  |
                      |    ___________|
                      |   |  padding  |
                      |___|___________|

Figure 3.9. A string object in C++ or Java.

Each C++ or Java string object is referenced by a handle, so an array of strings is implemented as an array of 32-bit pointers to the string objects. Each object in these languages is usually allocated starting in a new 32-bit word, so if the object does not completely fill some multiple of 32 bits, padding is added on the end. For string objects, this padding will most likely be, on average, about 16 bits. Every object in these languages always includes a pointer to its class descriptor, a record describing all of the attributes of the class; the cost of this is typically another 32 bits. Since string objects may be of any size up to 64K, each string object begins with a 16-bit size indicator. Thus, in languages such as Java, the total cost of a string is about 12 bytes in addition to the actual memory used to store the text! If the average identifier is 5 characters long, this means that we have an overhead of over 200% per identifier, on average!

Actually, in Java, the overhead is even higher! Each Java object "knows" its own name; this is typically implemented by including a pointer to the object name (another string) as part of the object. Each object also knows about the context in which it was declared, so each object includes a pointer to the declaring context. These two pointers add another 8 bytes of overhead.
This discussion also ignores the overhead of heap management. Dynamically allocated blocks of storage typically include, at minimum, a prefix indicating the size and status of the block. This typically adds at least 4 bytes, and more typically 8 to 12 bytes.

We can afford this kind of overhead in prototype software, but in production software, we would rather make more efficient use of memory, particularly when you consider that the standard header files used for programs in languages such as C and C++ define many hundereds of identifiers before any user-defined identifiers have been encountered.

Fortunately, there is a simple solution to this problem, the string pool. The basic idea is to store the text of all identifiers in compact form, packed into a single array of characters. If our programming language imposes overhead for the array, this is imposed only once, and we will pack a large number of identifiers (and eventually, other textual objects) into this string pool.

A string pool consists of a single array of characters large enough to hold all of the strings that a program must store. Each string in the pool is referenced by an integer giving the array index of the first character in the string. If the pool size is under 65536 bytes, a 16-bit integer is sufficient (a C or C++ short int, or a Pascal or Ada subrange type such as 0..65535). The actual entry in the symbol table is then just an integer index into the string pool, as illustrated in Figure 3.10; again, as in Figure 3.5, entry 4 holds an association for the symbol "NAME".

           0   1   2   3    byte number
         |___|___|___|___|  within each           |___|
 table 0 |_______|_______|  table entry    pool 0 |___|
       1 |_______|_______|                      1 |___|
       2 |_______|_______|                      2 |___|
       3 |_______|___                           3 |___|
       4 |______o|------------------------------->|_N_|
       5 |_______|_____                         5 |_A_|
       6 |_______|_______|                      . |_M_|
       7 |_______|_______|                      . |_E_|
       8 |_______|_______|                        |___|
           symbol  value                          |___|

Figure 3.10. A symbol table implemented using a string pool.

Note that nothing about the use of a string pool changes the way the table is searched! From a conceptual point of view, strings are still stored in the table itself. Thus, the conceptual operation "examine the string in entry X of the table" is implemented as "examine the string in the pool starting at the location pointed to by entry X of the table."

The string pool can be thought of an abstract data type in its own right. For example, consider the informal definition given in Figure 3.11.

store(s)
-- A function returning an integer i such that fetch(i) will return s at all future times. This implies that fetch(store(s)) = s.

fetch(i)
-- A function returning a string when given an integer. If i was not returned by a previous call to store, fetch(i) will return a meaningless result.

Figure 3.11. An informal definition of a string pool.

If the possibility of string pool overflow is ignored, the code in Figure 3.12 outlines an appropriate implementation of a string pool in C.

char pool[poollim];	/* the string pool */

int pos = 0;		/* current position in string pool */

int store( symtype symbol )
/* add symbol to stringpool */
{
    int len, i;
    len = length(symbol);
    for (i = 0; i < len; i++) pool[pos + i] = symbol[i];
    pool[pos + len] = marker;
    i = pos;
    pos = pos + len + 1;
    return i;
} /* store */

int fetch( int pos, char * symref )
/* recover symbol from stringpool
   put it in the referenced place
   returns the length of the symbol */
{
    int len;
    len = 0;
    while (pool[pos] != marker) {
        *symref = pool[pos];
        symref++;
	pos++;
        len++;
    }
    return len;
} /* fetch */

Figure 3.12. A string pool implementation in C.

This C code had to deal with several problems imposed by the C (or C++) languages. First and foremost, C (and C++) functions cannot be relied on to return complex data types such as arrays or structures; they can only safely return simple types such as integers or pointers to complex objects. This restriction has been lifted to various degrees by more recent implementations of these languages, but it is not safe to rely on this! Since our goal is fast efficient code, we cannot afford to allocate something like a string object to hold the return value, so instead, we pass a pointer to the place where the text should be put and let fetch() copy the text to that position. This is not a very well thought out design. In fact, since most of our use of the string pool will involve comparisons with strings in the pool, we should rethink the user interface to this abstraction before using it!

The string pool routines given in Figure 3.12 use a marker, typically some otherwise illegal character such as null (0x00 in C or C++, char(0) in Pascal) to mark the end of each symbol. If we only want this symbol table for use in an assembler, colon or any other punctuation mark could have been used, since these are not allowed in identifiers, but since they might be legal in other strings we might eventually store in the string pool, we don't use them! An alternate way of marking the end of a symbol would be to encode the length of the symbol before the first character of the symbol. For example, an 'A' could be used as a prefix for all symbols of length 1, a 'B' for symbols of length 2, etc. This has exactly the same storage overhead as the use of a marker, unless we encounter a large number of strings over 256 bytes long (unusual in programming languages), so there is little reason to prefer one scheme over the other.

For the sake of computational efficiency, a third string pool access function is frequently introduced for comparing strings with pool entries. The reason for this is that comparison is a very common operation, being needed in both the "define" and "lookup" symbol table access functions. The fetch function, as given in Figure 3.11 and 3.12, does not allow efficient comparison because it extracts a copy of the indicated string from the pool for comparison when such a comparison could have been made without any copying operations. A typical compare function defined in terms of fetch and store is given in Figure 3.13.

function compare( s: symtype, i: integer ): boolean;
begin
     compare := ( fetch(i) = s );
end {compare};
Figure 3.13. A poor way to compare strings with pool entries.

An efficient implementation of compare would not use fetch; instead, it would compare each character of the string parameter s directly with the characters following position i in the pool. In general, efficient text processing software avoids copying strings or substrings into temporary containers, but instead, does as much of the examination of the text in place as is possible!

The versions of define and lookup given in Figure 3.8 share the same search loop, suggesting that this loop should itself be a subroutine. The symbol table implementation given in Figure 3.14 does this while introducing the use of the string pool.

var table: array [0..tablim] of record
                s: integer {string pool pointer};
                v: value;
           end;

    filled: integer { initially 0 }

function find(s: symtype): integer;
{ returns -1 only when the table is full}
var i: 0..tablim;
begin
     i := 0;
     while (i < tablim)
       and (i < filled)
       and not( compare( s, table[i].s ) )
         do i := i + 1;
     if i = filled then begin { new entry }
          table[i].s := store(s);
          filled := filled + 1;
          find := i;
     end else begin { old entry or full table }
          if compare( s, table[i].s )
              then find := i   { old entry }
              else find := -1; { table is full! }
     end;
end {find};

procedure define(s: string, v: value);
{ has no effect if the table is full }
var i: integer;
begin
     i := find(s);
     if i > -1 then table[i].v := v;
end {define};

function lookup(s: symbol): value;
{ returns undefined s not in table }
var i: -1..tablim;
begin
     i := find(s);
     if i > -1 then lookup := table[i].v
               else lookup := undefined;
end {lookup};

Figure 3.14. A linear search, string pool based symbol table.

In this symbol table implementation, note that define and lookup deal only with the value field of each symbol table entry, while find deals only with the other fields. This suggests, correctly, that there are two separable aspects of symbol table implementation: Table organization, which deals with how symbols are associated with specific slots in the table, and value storage, which deals with the nature of the values stored. The find function deals only with the table organization. As given in Figure 3.14, it implements a linear organization. Faster symbol table search techniques can be implemented with changes to find without any effect on the define and lookup routines.

Object Oriented Thinking in Older Languages

Classical programming languages such as C provide no obvious support for object oriented programming methodologies, but this does not mean that there is no appropriate way to use these methodologies in these languages. In C, for example, the separate compilation mechanisms of the language provide a very strong information hiding tool.

Each C compilation unit exports a collection of symbols that may be referenced from other compilation units, while other symbols are entirely private to the compilation unit. In C, by default, all global variables and all functions are exported, but any global declaration prefixed by the keyword static is private to the current compilation unit. Figure 3.15 gives an appropriate skeleton for a symbol table compilation unit in C:

/* symtab.c -- a symbol table */

#include "symtab.h"
#include "stringpool.h"

static struct {
	valtype value;
	int poolref;
} table[tabsize];

static int filled;

static int find( char* s )
{
 	...
}

void define ( ... )
{
	...
}

void lookup ( ... )
{
	...
}

Figure 3.15. A skeleton for a symbol table compilation unit in C.

In the skeleton shown in Figure 3.15, the include file symtab.h contains all of the definitions a needed by both the users of the symbol table and the implementation. Definitions known only to the implementation are local to symtab.c; thus, the include file serves as an interface definition. The actual table and all other details of the implementation are declared to be static, so they are hidden from the outside world, while the two access functions, define and lookup, are visible. This code assumes that the string pool is similarly packaged.

This definition includes one significant defect! The find() function takes a simple pointer to a character as a parameter. We cannot afford the overhead of passing a complex string object as a parameter, but we also cannot afford to pass just a pointer because we need some indication of the length of the string being defined! The solution to this delimma is to be found in the lexical analyzer and how it represents lexemes on a source line. If, inside the lexical analyzer, the start and end of each lexeme is recorded as the analysis process proceeds, then we should probably pass precisely this information to find().

This illustrates an important point! While the final modularization of a program into components may well follow the initial outline, some knowledge of the representations used within one component may have a strong impact on on the most efficient choice for the interface specification of another component. As a result, while the top-level division of a large-scale project into modules may decided early on, the decision to freeze the interface specifications between modules should usually be deferred until after initial prototypes of each module have been designed! Of course, once the interfaces are frozen, these prototypes will need to be discarded or rewritten!

String Packing

Before examining more efficient symbol table search strategies, it is worth noting that, at some computational expense, text can be packed to use storage with about one third greater efficiency than can be had by storing text one character per byte. This is particularly important on very small computers where there is very little available memory, but it is also important in full-text retrieval applications. Text can be packed by taking advantage of the fact that only a fraction of the character set is allowed in identifiers, and furthermore, some letters are far more common than others.

Consider the 94 printing characters in the ASCII character set, also known as the ISO-7 character set; these are illustrated in Figure 3.16.

  ! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o
p q r s t u v w x y z { | } ~

Figure 3.16. The printing ASCII characters.

Seven bits are required to encode these 94 characters plus 33 control characters and the space character. On most modern machines, each character is stored in an 8 bit byte for convenient manipulation. In most programming languages, only letters and digits are allowed in identifiers. The 52 upper and lower case letters, 10 digits, a punctuation mark such as underline or dot, and a marker used to terminate symbols could be stored in only 6 bits! This would allow the use of only 6 bits per character, packing 4 characters into each successive group of 3 bytes. The code to perform such packing and unpacking is left as an exercise!

An even more complex packing scheme is known as Radix 40 or Radix 50 notation (note that 40₁₀ = 50₈). The encoding shown in Figure 3.17 is one example Radix 40 character encoding:

0       = blank
1 - 26  = ABCDEFGHIJKLMNOPQRSTUVWXYZ
27      = $
28      = .
29      = :
30 - 39 = 0123456789

Figure 3.17. A Radix 40 encoding.

Note that exactly 40 different values are used to encode the allowed characters. Thus, it requires a little less than 6 bits to store one character. The language supported by this particular example treated both period and dollar sign as letters for the purpose of lexical analysis, and as with many languages, it considered upper and lower case equivalent, so there was no need to store case distinctions.

The idea of using fractional bits to store information is not new; for example, 10 bits can be used to represent values between 0 and 1024, allowing all combinations of 3 decimal digits to be stored in 10 bits even though storing each digit alone would require 4 bits. Given the functions rad40(ch) and ascii(n) which convert ASCII characters to radix 40 codes and the reverse, the radix 40 pack and unpack routines, for groups of 3 characters, can be written in C as shown in Figure 3.18.

int pack( char c1, char c2, char c3 )
{
     return ( rad40(c1)*40 + rad40(c2) )*40 + rad40(c3);
} /* pack */

void unpack( int i, char* c1, char* c2, char* c3 )
{
     *c1 = ascii(i / (40*40));
     *c2 = ascii((i / 40) % 40);
     *c3 = ascii(i % 40);
} /* unpack */

Figure 3.18. Radix 40 conversion routines.

Using these routines to pack 3 character groups into integers, the minimum packed value is pack(' ',' ',' ') which returns zero. The maximum is pack('9','9','9') which returns (39*40+39)*40+39, or 63999. Since this is less than 65536 (2¹⁶), 3 characters can be packed into 16 bits or two bytes. This technique is most commonly used for fixed size symbols; for example, it allows symbols 6 characters long to be stored in a 32 bit word. The classical directory structure of the RT/11 operating system for the PDP-11, in which file names are 6 characters, followed by a dot and a 3 character extension, comes from the use of 3 16-bit words per file name with a radix 40 representation for the text of the name. Early versions of the Microsoft DOS file system were based on this, so the legacy of Radix 40 persists to this day! In principle, this scheme could be applied to string pool organization, but memory isn't that expensive except on the smallest of embedded systems!

Hashing

The remaining issue in symbol table implementation is search speed. All of the details of the search process have been isolated in the procedure find which was defined above. The linear search approach used in the original version of find takes O(n) time when operating on an n item table. As a result, the time taken to build a table with n defined entries ends up O(n²). These rough time estimates can be used as a guide in comparing different table organizations.

The most common approach to fast searching is to use a sorted table and a binary search. Binary search can be described as being based on the use of an initial approximation or guess at the location of the item in the table, with successive iterations of the search loop improving that approximation. Although binary search takes only O(log n) time, the cost of maintaining a sorted table is high, since the best sorting routines take O(n log n) time for an n entry table.

Binary search is not the answer to the symbol table problem, but it does give a clue: The answer will involve use of an initial approximation of the value of the find function. Note that find maps up to tablim+1 distinct symbols onto distinct integers in the range from 0 to tablim, and maps all other symbols to -1. An initial approximation, on the other hand, will map all possible symbols onto non-unique integers in the same range.

It turns out that almost any initial approximation will work; for example, binary search uses a very crude initial approximation, tablim/2, but it uses the fact that the table is sorted to allow a fast search. An alternative is to try to find a good initial approximation so that a simple search algorithm can be used to improve on it. The best initial approximation functions produce all possible values between 0 and tablim when they are given all possible symbols as input. Additionally, they have the property that they produce randomly distributed values in the range when they are given randomly selected subsets of the possible symbols. Because of this random behavior, the approximation functions used in symbol table searches are known as hash functions (to hash something up, in common English, means to scramble or mix or chop it up). Given any such hash function, and using a linear search from the the initial guess, a reasonable first draft version of the find function could be coded as shown in Figure 3.19.

var table: array [0..tablim] of record
                s: integer {string pool pointer, initially -1};
                v: value;
           end;

function find(s: symbol): integer;
var i: 0..tablim;
    j: 0..tablim;
begin
     i := hash(s);
     j := (i - 1) mod (tablim + 1);
     { it is now true that ((j + 1) mod (limit + 1)) = i }
     { the following checks all table entries starting with hash(s) }
     while (i <> j)
       and table[i].s >= 0
       and not( compare( s, table[i].s )
         do i := (i + 1) mod (limit + 1);
     if not( table[i].u ) then begin { new entry }
          table[i].s := store(s);
          find := i;
     end else if compare( s, table[i].s )
              then find := i;
              else find := -1;
end {find};

Figure 3.19. A hash based version of the find function.

Other approaches to organizing hashed symbol tables have been developed. These have primarily been motivated by a desire to avoid accidentally filling a cluster of consecutive table slots when, by accident, a large number of identifiers hash to the same slot. Since similar identifiers such as xab1, xab2, xab3 and so on are common in some large programs, this problem with clustering should be expected. When the hash of a symbol leads to a slot in the table that is already in use for a different symbol, the situation is called a collision. The version of the find function given in Figure 3.19 is said to resolve collisions with a linear search. The result is that symbols which hash to the same slot "overflow" into adjacent slots.

Clustering in the hash table raises the expense of linear search collision resolution because it results in an uneven distribution of vacant slots in the hash table. One approach to avoiding clustering is to introduce a second hash function which is used to determine where to start the linear search when there is a collision. Hopefully, this "rehash" will be different for each symbol which originally hashed to the same slot. Another approach is to maintain linked lists of items which hashed to the same slot in the hash table. This is called bucket hashing because each entry in the table in effect holds all strings which hashed to that location, without overflowing into other buckets. Bucket hashing simply changes the collision resolution problem to one of list management.

It is worth noting that hashed symbol tables, when used with a good hash function, tend to find most symbols on the first try, which is to say, the time taken is roughly constant, independent of the table size. Only for tables that are more than about 90 percent full do collisions become common enough that the search times become significant. An analytical treatment of hashing is much more difficult than similar treatments of searching sorted lists or of sorting; thus this 90 percent figure is only a rule of thumb and not a guarantee of performance.

The final issue in hashed symbol table construction is the choice of the hash function itself. This can play a major role in avoiding collisions if it produces large and almost random differences in its output for small changes in the input. This property is also a property of pseudo random number generators; in fact, the techniques for generating random numbers using computers are very similar to those given below for hashing. An obvious approach to hashing is to treat the entire symbol to be hashed as a positive integer, for example by using a very high precision Radix 40 representation. Given that this integer is S, two excellent hash functions are:

hash(S) = S mod (limit + 1)
Note that this is only a good hash function when (limit + 1) is relatively prime to (shares no common factors with) the radix of the symbol representation. Otherwise, it can ignore some characters in the symbol.

hash(S) = ((T*S) mod 1)*limit
Where T should be irrational. The inverse of the golden ratio (phi) is particularly appealing; the golden ratio is defined as the solution to phi = 1/(phi-1); phi = 1.61803...

The above hash functions have the disadvantage that they require high precision arithmetic for their computation. This is especially difficult on small computers where even simple arithmetic can be difficult, and it is actually undesirable even on many large computers because multiply and divide are computationally undesirable instructions! An alternative approach is to use a function the value of which can be accumulated one character at a time. These have the special advantage that the hash function can be computed by the lexical analyzer while the symbol is being scanned for the first time, but of course, this introduces yet another area where the interface between the lexical analyzer and symbol table might need to be modified to accomodate this performance improvement.

The CRC (cyclic redundancy check) function used for computing checksums in data transmission is particularly suited to this task, but other functions are easier to compute; for example, consider the hash function given in Figure 3.20.

int hash( char * s )
{
	int i;
    	int j;
        j = 0;
        for (i = 0; i < length(s); i++) {
        	j := ( 5*(j + s[i]) ) % (tablim + 1);
	}
	return j;
} /* hash */

Figure 3.20. A poor implementation of a decent hash function.

In the hash function given in Figure 3.20, almost any prime number will do where 5 was used, but it is important that the number should not divide evenly into the table size.

The implementation given in Figure 3.20 is not very good! Because the compiler does not know whether or not the length function has side effects, it calls length() once per iteration! Furthermore, the length function computes the string length by counting the characters until it finds and end-of-string mark, so the for loop in the figure might as well be rewritten as a while loop, exiting when it finds an end-of-string mark.

Multiplication by small constants such as 5 does not require use of a relatively slow multiply instruction. On many modern machines, the compiler translates 5*x to something like x+(x<<2); this shift-and-add sequence may take 2 instructions instead of one, but these simple instructions can usually be executed faster than a multiply instruction.

Terminology

The reader should be familiar with the following general terms after reading this section.

symbol tables                   radix 40
abstract data types             binary search
access functions                hashing
linear search                   collisions
string pools                    collision resolution

Additionally, the reader should be familiar with the symbol table and string pool as examples of abstract data types.

Exercises

Give informal definitions, in the style of Figure 3.1, of the following abstract data types:
a) FIFO queues, with the operations enqueue and dequeue.
b) Stacks, with the operations push and pop.
c) Priority queues, with the operations enqueue and dequeue.
d) Files of records, with the operations read, write, reset, and rewrite.
The comments after Figure 3.6 state that there is no Java declaration that produces a data structure comparable to that shown in Figure 3.5. Why? If a Java program contains an array of 8 records, where each record contains a 6 character string and a small integer, what actual data structure goes into memory? Describe this data structure using a notation comparable to that used in Figures 3.5, 3.9, and 3.10. Don't draw the entire structure.
Rewrite the define routine from Figure 3.8 so that it fails gracefully if the symbol table fills up. You may recode in the language of your choice, and use any approach you prefer for reporting the failure.
Measure the actual memory required for a string object in an object-oriented language such as Java or C++. Consider the following methodology. Create an array of thousands of string objects, each holding one character. Measure the storage required by your program before and after this array is initialized (under almost every operating system, there is a command for looking at the resource utilization of a running program).
Rewrite the store function from Figure 3.12 so that it fails gracefully if the string pool fills up. You may recode in the language of your choice, and use any approach you prefer for reporting the failure.
Rewrite the code for "compare" from Figure 3.13 in an efficient form, that is, so that it does not make a copy of the string from the pool using fetch, but instead, compares the given symbol with the text in the pool character by character. For this purpose, assume that a symbol is an array of characters with a null terminator.
Decode the following Radix 40 message, given as a sequence of decimal numbers, using the information given in Figures 3.11 and 3.12: 28844 15360 55628.
Convert the following text to Radix 40 representation, and present your result as a sequence of decimal numbers, using the information given in Figures 3.11 and 3.12: ENCODE IT.
Explain the weakness of the following simple hash function:
```
hash(s) = s[1] mod (tablim + 1)
```
a) Modify the lexical analyzer given in Figure 2.22 so that it computes the hash function given in Figure 3.20 for each identifier as it scans that identifier. The "value" field for newly scanned identifiers should be the value of this hash function. For integers, of course, this field would most likely be the integer value of the lexeme.
b) Modify the result of part a of this problem by having the lexical analyzer directly call the "find" function from Figure 3.13 as the final step in the analysis of an identifier. As a result, the "value" field of a lexeme representing an identifier should end up giving the index of the identifier's slot in the symbol table. Note that the combination of this problem with Problems 7 and 8 of Chapter 2 gives a lexical analyzer quite similar to many in common use in production language processors.
c) Criticize the result of your work on parts a and b from the point of view of object oriented methodology. What happened to the abstract interface between lexical analysis and the symbol table? Can you draw a clear line in the result?
Rewrite the code from Figure 3.20 so that the for loop is replaced by a while loop searching for the end of string. For maximum efficiency, also try to eliminate the array subscript and replace it with autoincrement indirect addressing.

References

The classic reference for symbol table organization is

The Art of Computer Programming. Volume 3, Sorting and Searching by D. E. Knuth (Addison-Wesley, 1973).

This includes detailed discussions of binary search, hashing, and many related techniques; unfortunately, all of the examples are in the MIX assembly language. Another reasonable source of information on hashing is

Fundamentals of Computer Algorithms by E. Horowitz and S. Sahni (Computer Science Press, 1978).

Most data structures texts give at least some coverage to this subject.

Two papers by helped motivate many language implementors to include support for abstract data types in new languages. The first paper provides the inspiration (and more formal foundation) for the informal notation used in Figure 3.1 to specify the symbol table abstract data type.

A Technique for Software Module Specification with Examples, by D. L. Parnas, in Communications of the ACM, 15 5 (May, 1972).

The second paper provides good examples of how a problem can be divided into logical modules, many of which would later come to be called abstract data type implementations.

On the Criteria to be Used in Decomposing Systems into by D. L. Parnas, in Communications of the ACM, 15 12 (December, 1972).

This textbook provides nice summaries of the programming language features discussed in this chapter.

Programming Language Concepts by C. Ghezzi and M. Jazayeri (Wiley, 1982).

The best source for information on Ada is

ANSI/MIL-STD-1815A (22 Jan, 1983), the MILITARY STANDARD Ada Programming Language manual, published by the Ada Joint Program Office, OUSD(R&E), Washington D.C. 20301.

A number of Ada textbooks are also available; among the better ones is

Understanding Ada by G. Bray and D. Pokrass (Wiley, 1985).

The best source for information on C is

The C Programming Language, second edition by Brian W. Kernighan and Dennis Ritchie, published by Prentice Hall 1988.

This is a small book (only 272 pages) and it includes both a good tutorial (the first 189 pages) and a good reference manual. Kernighan and Ritchie are the original co-developers of C, and their text sets the standard for how C programs ought to look.