Lecture 23, Parse Errors, Knitting the Parse Tree

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Parse Errors

Consider a parser for this fragment of the Falcon grammar:

<constructor> ::= "(" <expression> { [ "," ] <expression> } ")"
               |  "[" <expression> { [ "," ] <expression> } "]"
               |  "{" <expression> { [ "," ] <expression> } "}"

On analysis of the grammar, we find (trivially, that the start set for a constructor is {"(", "[", "{"} With somewhat more difficulty, we can compute the follow set for constructors, a set that turns out to be the union of the start sets for statements, all the operators allowed in expressions, all of the addressing operators allowed in references, and all of the keywords that can terminate any section of a block. With these sets in hand, we can imagine writing a parser as follows -- using an informal notation:

void parse_constructor() {
	if (lex_this ∉ {"(", "[", "{"}) {
		error( expected an opening brace );
	}
	if (lex_this == "(") follower = ")";
	else if (lex_this == "[") follower = "]";
	else follower = "}";
	lex_advance(); /* skip opening brace */

	parse_expression();
	while (lex_this ≠ follower) {
		if (lex_this == ",") lex_advance(); /* skip optional comma */
		parse_expression();
	}
	lex_advance(); /* skip closing brace */
	if (lex_this ∉ constructor_follow_set) {
		error( expected something like EOF or ; );
        }
}

There are some interesting questions raised by the above code. What do we do if we don't find the expected lexeme in the tests for either the start set or the final set? A first approximation answer would be "it doesn't matter; having detected a syntax error, our responsibility is done."

The problem with this model is that it is sometimes useful to have the compiler recover from syntax errors, if such recovery allows it to parse onward and report other errors. It is useless if the result is merely a cascade of errors that are a consequence of misparsing the first error, so the question becomes, how can we make the compiler recover?

One model for doing this operates by skipping forward in the input whenever we know that the current lexeme should be in some set and it is not. For example, we can replace the first test above with this:

if (lex_this ∉ {"(", "[", "{"}) {
	error( looking for an opening brace );
	while (lex_this ∉ {"(", "[", "{"}) {
		lex_advance();
	}
}

This model works fairly well, although it might be reasonable to limit the extent to which it skips ahead. Arguing such issues amounts to fine tuning the response of the parser to programs that are already broken. This offers only limited return on the investment.

Dynamic Computation of the Follow Set

One problem with the above code is that the follow set for each nonterminal is always statically computed. The code to parse a parenthesized expression and the code to parse the controlling expression for an if statement have different contexts that are likely to be worthy of different treatment in the parser. Consider these uses of an expression:

if i > j then ...
while i > j do ...
p = (i > j);

The fact that an expression is parsed within an if statement implies that the keyword then is likely to follow, while if the expression is parsed in the context of a while statement, the keyword then would be an error, while the keyword do is likely. In the third case, a parenthesized expression, no keyword would be syntactically valid, but the expression must be followed by a closing parenthesis (or a comma, since in Falcon, parenthesized expressions occur within constructors.

This suggests using dynamic computation of the follow sets instead of static computation. This leads to code like the following for our example, parsing constructors:

void parse_constructor( set followers ) {
	if (lex_this ∉ {"(", "[", "{"}) {
		error( expected an opening brace );
	}
	if (lex_this == "(") follower = ")";
	else if (lex_this == "[") follower = "]";
	else follower = "}";
	lex_advance(); /* skip opening brace */

        set s = follower ∪ {","} ∪ expression_start_set;
	parse_expression( s );
	while (lex_this ≠ follower) {
		if (lex_this == ",") lex_advance(); /* skip optional comma */
		parse_expression( s );
	}
	lex_advance(); /* skip closing brace */
	if (lex_this ∉ followers ) {
		error( expected something in followers );
        }
}

Here, notice that the follow set is passed to each parsing routine. Here, we end the parsing of the constructor by checking the follow set we were given, instead of checking against a generic follow set, and as each subsidiary expression is paresed, it is given the follow set determined by this context -- a follow set that includes only the closing brace that matches the opening brace of this constructor.

This scheme is most effective in shallowly nested programs. When follow sets are passed down the parse tree through a great depth, they may end up accumulating essentially all of the symbols that static analysis would place in the follow set. On the other hand, in a shallow parse tree, the smaller follow sets can lead to significantly more focused error messages or more focused error recovery.

Set Operators

Above, we used informal notation:

if (lex_this ∉ {"(", "[", "{"}) {

A naive C or C++ programmer working with our proposed parsing framework might write this instead:

if (!(  (lex_this.type == PUNCT)
     && (  (lex_this.value == P_LP)
	|| (lex_this.value == P_LSQR)
	|| (lex_this.value == P_LCURL)
)    )  ) {

This is really hard to read, and things get even worse if the set in question contains a mix of keywords and punctuation.

This suggests that we might want to build a general set mechanism. Note that the set of lexeme types is small, and the set of keywords is small, and the set of punctuation marks is small. In this case, the relevant definition of small is: smaller than the size of one word on a typical 32-bit computer.

Therefore, we could represent sets of lexemes, as needed in the parser, as word triples. One that is a bit vector for the lexeme type, one that is a bit vector for the keyword code, and one that is a bit vector for the punctuation code. Consider this code:

if (lex_this ∉ s) {

We might rewrite this as follows using the above idea:

if (!(  (lex_this.type | s.lextypes)
     || ((lex_this.type == PUNCT) && (lex_this.value & s.puncts))
     || ((lex_this.type == KEYWORD) && (lex_this.value & s.keys))
)    ) {

This is still ugly, but we can easily package this behind a wall of abstraction (either defining functions or using the preprocessor define mechanism) so that programmers merely need to write something like this:

if (!in_lex_set(lex_this, s)) {

Native speakers of C will prefer preprocessor defines. Native C++ speakers might create a new class for lexeme sets. If the methods of this class are defined using the C++ inline mechanism, there is no real efficiency difference between these two models.

In either case, note that the parser will also need set constructors for lexeme sets, along with mechanisms to take the union of sets. These are straightforward programming exercises.

Knitting the Parse Tree Together

Consider the following Falcon code:

type rt = record
	var f: ft
end;
type et = @ rt
type at = array it of et;
var a: at;

This artificial example allows the following reference to be legal a[i]@.f -- a reference to field f of the record referenced by element i of array a might create a new class for lexeme sets. If the methods of this class are defined using the C++ inline mechanism, there is no real efficiency difference between these two models.

A parse tree for the block containing the declarations might look like this:

This tree can be made useful if, as definitions are parsed, they are cross linked, knitting the tree together. So, whenever a reference to the type rt is encountered, it is linked directly to the corresponding type definition in the tree. Completing the process on the above tree adds the following links: