Lecture 24, Static Semantics

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Blocks

A Kestrel block may contain a sequence of definitions for such things as constants, variables, types and functions. Some of these definitions may be public, visible from outside the block, while others may be private, visible only from inside.

The "syntactic sugar" of block declaration includes specific details about how the items declared within a block are marked as public or private, and given identity as types, variables or whatever. This is of no concern at a deeper level. All that matters is that the components of the block are bindings between identifiers and their values, with each such binding marked as either public or private.

From a practical perspective, this suggests that the environment established by each block is represented, inside the compiler, by a list of such bindings. The details of this list structure do not depend on the "syntactic sugar", and indeed, that shallow syntax may be changed in many ways, so long as it is possible to marke each definition as either public or private, and so long as it is possible to define types, constants, variables, etc. Consider this block:

    this:  const 3;
    thing: type  this .. this + 5;
    private a: var thing;

The list of bindings established by this block might be given as:

(this,  public,  constant 3)
(thing, public,  type this .. this + 5)
(a,     private, var thing)

From the point of view of a syntax-driven parser, there is no need to expose the outside world to the existence of this list. Objects of type block simply present, to the world, the ability to look up an identifier in that block from within, where all components are visible, or from the outside, where only the public components are visible.

In point of fact, however, the above list is wrong. Note, first, that the definition of a subrange type in the Kestrel grammar is as follows:

<subrange> ::= <expression> ".." <expression>

This implies that the minimum and maximum bounds on the values of type thing should be the values of the expressions. not the text of the expressions. Therefore, we really ought to rewrite our list of bindings something like the following:

(this,  public,  constant 3)
(thing, public,  type(subrange, 3, 8))
(a,     private, var thing)

That is, the second binding in our example environment binds the name thing to the type (subrange, 3, 8), and this binding is marked as public. The semantics of Kestrel require that the expressions used to establish the bounds on subrange types be statically evaluatable -- they may not incorporate any terms that depend on variables. One can imagine relaxed versions of the semantics that permit variables here, but they would be far more difficult to implement.

Finally, there is also problem with the third binding shown above. The binding of a as a variable of the type thing should not be a binding to the name of the type, implying that there must be a lookup operation every time the type of the variable is checked, but rather, it should be a binding directly to the type itself, yet another example of knitting together the branches of the parse tree, as described in the previous lecture.

As an aside, consider the following assignment, in the context of the above definitions:

a = 10;

Assuming that the compiler checks the subrange conformance statically, at compile time, it should note that 10 exceeds the upper bound on the subrange type used for a, which was 8. If the (abstract) parse tree is fully knitted together, the error message you might get might be something like "Error: 10 outside of 3 to 8." Retaining the type name in addition to the type to which it is bound would allow the error message "Error: 10 outside of range thing," or even "10 outside of thing, the range 3 to 8. It is not immediately obvious whether the longer error message is more helpful.

Here, we have emphasized only the most basic attributes of bindings, but once we start actually generating code, we care about additional attributes. Each type, for example, will have a storage-requirement attribute that gives the size, in "storage allocation units" (probably bytes) of objects of that type. Each variable will need to have the address of that variable relative to the base address associated with the variable's block.

Abstract Parse Trees vs. Concrete Parse Trees

The example of parsing block is just one example of the distinction between abstract syntax and concrete syntax. Another example comes from the grammar for expressions. The concrete grammar for expressions has these rules:

<expression> ::= <comparand> [ <comparing operator> <comperand> ]
<comparand> ::= <term> { <adding operator> <term> }
<term> ::= <factor> { <multiplying operator> <factor> }

A brute-force approach to building a parse tree suggests that there ought to be four classes of tree nodes here, all subclasses of the syntactic category abstract class: One for expressions, one for comperands, one for terms and one for factors.

In fact, though, the distinction between these is merely a matter of operator precidence, and what we really want is a single class, call it expression is you want, that combines two items with an operator. The parser -- in charge of concrete syntax -- can have separate parse-expression, parse-comparant, parse-term and parse-factor routines to do recursive descent parsing, but if it produces a parse tree, the classes in that tree should be of just one class, the abstract binary operator class, that combines two operands using any of the available operators.

If constructed in an object oriented programming language such as C++, the natural thing to do is build the parse tree using the class expression, where the public interface to class expression is the expression::parse() method. The methods for parsing terms, factors and others can be private methods of the expression class, since the outside world does not know about these kinds of subexpressions. Any parse tree built would, after parsing, consiste of just expression nodes with appropriate operators. The surface expression syntax is no longer present in such a tree, or rather, the topology of the tree preserves the surface syntax, not the types of the tree nodes.

Symbol Binding: Two Approaches

At every point in each block, there is a set of bindings between identifiers and their attributes -- constants, types, variables, etc. There are two basic ways to manage these bindings:

Search

First, whenever we need to find the binding of a symbol, we can walk the list of bindings. This is actually a two-dimensional walk, since we walk up the list of enclosing blocks, and for each block, we walk down the list of identifiers currently defined in that block. The comparisons made during this walk need not be string comparisons with the names of the symbols, they can be comparisons of the string handles, where the actual text of the symbol names is hidden away in the string pool.

Lookup

Seond, we can maintain a single symbol table, associating each symbol in the program with its current binding. If, for example, the string pool uses an internal hash table, and exports string handles that are indices into that hash table, the semantic processing component for symbol bindings can maintain an array of current symbol bindings that has the same dimension as the hash table, using the symbol handles as indices into that table whenever it looks up the binding of a symbol.

In thinking about these two alternatives, note that it is easier to see the correctness of the search model, since this tends to match the intuitive models used in the official programming language semantics. On the other hand, linear searches are potentially expensive.

In contrast, the lookup model requires a complex mapping from the official programming language semantics to the implementation, and this offers numerous opportunities for error. At the same time, while the lookup model may be faster for each individual identifier referenced, it requires a complex computation whenever a new binding is added to a block in order to save that binding, and a second computation on exit from each block to restore all the bindings that were saved because of definitions in that block.

Adding Complexity

Note that there are actually three reasons we need to look up a symbol:

First, we can look up a symbol at the time it is being defined, in order to see if it is already defined in the current block, for the purpose of detecting and complaining about multiple symbol definitions. For example, if the code i: const x.f is encountered, we have to look up i to see if it already has a binding in the current block. If it does, then this binding is illegal.

Second, when we find an identifier at the head of a reference, we need to look it up in this block or any enclosing block. For example, in the code i: const x.f, we must look up x in the current block, and if it is not defined there, we look in the enclosing block, and so on, all the way out to the global definitions. Public and private bindings are visible in this search.

Third, when we find an identifier within a reference where it is used to refer to a field of an object, we must look it up in the list of public fields of that object, given by the list of public bindings in the block from which the object was instantiated. For example, in the code i: const x.f, we must look up f in the block that x names an instance of. Only public bindings are visible from the point of view of this lookup.

All of these variant lookup tools can be built on top of a linear search of the contents of one block -- the basis of the first case listed above. The second case encloses this search in a loop that steps out through enclosing blocks, while the third case filters the result of the search to eliminate private components.

In contrast, using a global symbol table only addresses the second case, and it does so at significant expense: Each time a new binding is entered in the symbol table, the previous binding of that symbol, if any, must be saved so that it can be restored at the end of the block. In effect, the symbol table holds bindings between identifiers and push-down lists of attributes. At the end of each block, all bindings established by that block must be popped. This, in effect, requires that the entire data structure needed for the linear search be retained.

In sum, the linear search approach to maintaining bindings is the one that defines the semantics of the language, while the use of a global symbol table is an optimization. Other optimizations are possible, for example, instead of a linear search, the set of bindings for each block could be maintained as a binary search tree. Note that this is not a binary search tree of strings, but rather, a binary search tree of indices into the hashed string pool. As such, comparisons in this tree (or comparisons in the linear search) are fast -- they are not string comparisons, just integer comparisons.

A general rule with optimizations applies here: First get things to work, then, only if it matters, optimize. One reason that good modular abstraction that isolates the details of the search from the user is important is because it enables later optimization.