Lecture 23, Knitting the Parse Tree

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The parse tree revisited

Consider the following Kestrel code:

rt: type record
	f: ft
end;
et: type @ rt
at: type array it of et;
a:  at;

This artificial example declares the variable a of type at, where type at is an array type indexed by an index variable of type it and containing elements of type et. Type it is not defined here; presumably it is a subrange type that has been defined somewhere else, perhaps in an enclosing block. Type et, the type of the elements of array a is a pointer type, constrained to point to objects of type rt. Those objects are records with one field each, a variable field named f of type ft In the context of this example a[i]@.f means a reference to field f of the record referenced by element i of array a.

A somewhat simplified parse tree (perhaps we should call it an abstract syntax tree) for the block containing the declarations above might look like this:

Knitting the tree

This is just a parse tree, at this point. It does not express anything about the semantics of the language. In the previous lectures, we have discussed creating a special data structure, the environment, that serves to keep track of the meanings of declarations. There is another way of thinking of this that does not involve an auxiliary data structure, but uses a modified parse tree to track the variable bindings.

We modify the parse tree as follows: Wherever we find an identifier in the tree in a context where we need its meaning, we search up the tree for a declaraton of that identifier. To be more specific, we search for a block element that is a declaration that gives meaning to that identifier, and we link from the identifier to its declaration:

Assuming that ft and et refer to identifiers that were properly declared in some enclosing context, the links to ft and et in the above diagram must go up the tree to something declared outside this picture.

We can describe what we have done as knitting the parse tree. It is no longer strictly a tree, but we can now use the data structure to generate code without any searching of the tree during code generation becase all of the search targets were found during the construction of the data structure.

At this point, all of the new links we have added to the parse tree are simply annotations, but we can do better. In kestrel, anonymous types are allowed. For example, the Kestrel code we started with can be rewritten as this:

a:  var array it of @ record f: ft end

This use of anonymous types does not necessarily add to readability, but if some type is used in just one place, there is no need to give that type a name. As a result, from the point of view of a parse tree, the type definition itself can be substituted into the tree for any occurance of that type. If we take this into account when we knit the tree, we get the following result:

Here, note that the only occurance of any identifier in the tree is at the point where that identifier was defined. All references to that identifier have been replaced with references to the object that identifier was bound to. This may pose problems when generating error messages (sometimes, it is nice to name the type of an object when complaining about it), but it works for compiling.

Furthermore, this knit version of the parse tree greatly simplifies the type equivalence problem. Kestrel says that two types are the same if they rest on the same definition. Once we have knit the tree, two different variables have the same type if their type pointers point to the same type object in the tree.

Of course, type compatibility in Kestrel is more complex, because two different scalar types are compatible if they have the same base type, and pointer types may be compatible in a type hierarchy created by class-subclass relationships, but all of these are easier to work out if the parse tree has been knit before any questions about type compatibility are asked.

Relating the knit tree to the environment

In the previous notes, we wrote Block::compile() something like this:

Block * Block::compile( Environment * e ) {

        lex_wantinset( START_PUNCS, START_KEYS, START_LEXS, ER_WANT_BLOCK );
        while (lex_isinset( START_PUNCS, START_KEYS, START_LEXS )) {
                if ( (lex_this.type == IDENT)
                &&   lex_ispunc( lex_next, PT_COLON ) ) {
                        e = Declaration::compile( e );
                } else {
                        Statement * s = Statement::compile( e );
                }
                if (lex_ispunc( lex_this, PT_SEMI )) lex_advance();
        }
        lex_wantinset( FOLLOW_PUNCS, FOLLOW_KEYS, FOLLOW_LEXS, ER_WANT_ENDBLOK);
        return NULL;
}

The question is, how does the concept of an environment, as used in this code, relate to the idea of the knit parse tree introduced above? The key is, the search for an identifier in the parse tree proceeds by searching up the tree through a chain of block elements to find the declaration for the identifier, while in the above code, searching for an identifier involves an operation on the environment.

Therefore, the chain of block elements that runs up the parse tree is equivalent to the environment in the above model. In fact, the most obvious implementation of the environment is as a linear list of bindings of identifiers to attributes, so each binding in this implementation is equivalent to a block element. While our abstractions may look different, the actual data structures for the environment may well end up exactly the same as the parse tree shown in our example.

So what is the environment, from an abstract point of view? The basic operations we want on any environment e are:

The most obvious implementation of the environment is as a linked list with the most recently added attribute at the head of the list, but there are other implementations. A linear list has a constant-time cost for adding a new element, but search times are linear in the number of elements, in the worst case. Fortunately, in most programs, the most frequently referenced identifiers are local, and this means that the expected cost of a search is better than linear in the size of the environment.

An alternative implementation would use a search tree for the environment. In this case, the lookup operation would be faster, taking logarithmic time, but adding a new element to the tree would be slower, probably also taking logarithmic time. Don't build complex mechanisms like this during the early stages of building a compiler!