Lecture 23, Parse Errors, Knitting the Parse Tree

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Parse Errors

Consider a parser for this fragment of the Kestrel grammar:

<constructor> ::= "(" <expression> { [ "," ] <expression> } ")"
               |  "[" <expression> { [ "," ] <expression> } "]"
               |  "{" <expression> { [ "," ] <expression> } "}"

On analysis of the grammar, we find (trivially, that the start set for a constructor is {"(", "[", "{"} With somewhat more difficulty, we can compute the follow set for constructors, a set that turns out to be the union of the start sets for statements, all the operators allowed in expressions, all of the addressing operators allowed in references, and all of the keywords that can terminate any section of a block. With these sets in hand, we can imagine writing a parser as follows -- using an informal notation:

void parse_constructor() {
	if (lex_this ∉ {"(", "[", "{"}) {
		error( expected an opening brace );
	}
	if (lex_this == "(") follower = ")";
	else if (lex_this == "[") follower = "]";
	else follower = "}";
	lex_advance(); /* skip opening brace */

	parse_expression();
	while (lex_this ≠ follower) {
		if (lex_this == ",") lex_advance(); /* skip optional comma */
		parse_expression();
	}
	lex_advance(); /* skip closing brace */
	if (lex_this ∉ constructor_follow_set) {
		error( expected something like EOF or ; );
        }
}

There are some interesting questions raised by the above code. What do we do if we don't find the expected lexeme in the tests for either the start set or the final set? A first approximation answer would be "it doesn't matter; having detected a syntax error, our responsibility is done."

The problem with this model is that it is sometimes useful to have the compiler recover from syntax errors, if such recovery allows it to parse onward and report other errors. It is useless if the result is merely a cascade of errors that are a consequence of misparsing the first error, so the question becomes, how can we make the compiler recover?

One model for doing this operates by skipping forward in the input whenever we know that the current lexeme should be in some set and it is not. For example, we can replace the first test above with this:

if (lex_this ∉ {"(", "[", "{"}) {
	error( looking for an opening brace );
	while (lex_this ∉ {"(", "[", "{"}) {
		lex_advance();
	}
}

This model works fairly well, although it might be reasonable to limit the extent to which it skips ahead. Arguing such issues amounts to fine tuning the response of the parser to programs that are already broken. This offers only limited return on the investment.

Dynamic Computation of the Follow Set

One problem with the above code is that the follow set for each nonterminal is always statically computed. The code to parse a parenthesized expression and the code to parse the controlling expression for an if statement have different contexts that are likely to be worthy of different treatment in the parser. Consider these uses of an expression:

if i > j then ...
while i > j do ...
p = (i > j);

The fact that an expression is parsed within an if statement implies that the keyword then is likely to follow, while if the expression is parsed in the context of a while statement, the keyword then would be an error, while the keyword do is likely. In the third case, a parenthesized expression, no keyword would be syntactically valid, but the expression must be followed by a closing parenthesis (or a comma, since in Kestrel, parenthesized expressions occur within constructors.

This suggests using dynamic computation of the follow sets instead of static computation. This leads to code like the following for our example, parsing constructors:

void parse_constructor( set followers ) {
	if (lex_this ∉ {"(", "[", "{"}) {
		error( expected an opening brace );
	}
	if (lex_this == "(") closer = ")";
	else if (lex_this == "[") closer = "]";
	else closer = "}";
	lex_advance(); /* skip opening brace, remember matching closer */

        set s = closer ∪ {","} ∪ expression_start_set;
	parse_expression( s );
	while (lex_this ≠ closer) {
		if (lex_this == ",") lex_advance(); /* skip optional comma */
		parse_expression( s );
	}
	lex_advance(); /* skip closing brace */
	if (lex_this ∉ followers ) {
		error( looking for something in followers );
		while (lex_this ∉ followers) do lex_advance();
        }
}

Here, notice that the follow set is passed to each parsing routine. Here, we end the parsing of the constructor by checking the follow set we were given, instead of checking against a generic follow set, and as each subsidiary expression is paresed, it is given the follow set determined by this context -- a follow set that includes only the closing brace that matches the opening brace of this constructor.

This scheme is most effective in shallowly nested programs. When follow sets are passed down the parse tree through a great depth, they may end up accumulating essentially all of the symbols that static analysis would place in the follow set. On the other hand, in a shallow parse tree, the smaller follow sets can lead to significantly more focused error messages or more focused error recovery.

Error messages themselves are a problem. The final error message above should not, for example, say "looking for something in {'do', ';', 'end', <reference> ... }" enumerating the entire follow set. That could go on and on. One option is to have the error message say "looking for something like 'do'", picking just one representative element of the follow set.

Set Operators

Above, we used informal notation:

if (lex_this ∉ {"(", "[", "{"}) {

On several occasions, we have suggested implementing some kind of set primitives to permit relatively easy substitutions for messy code like this:

if (!(  (lex_this.type == PUNCT)
     && (  (lex_this.value == P_LP)
	|| (lex_this.value == P_LSQR)
	|| (lex_this.value == P_LCURL)
)    )  ) {

If we do things right, we can write something like this:

if (!is_punct(lex_this, set_cons3(P_LP, P_LSQR, P_LCURL)) {

Native speakers of C will prefer preprocessor defines. Native C++ speakers might create a new class for lexeme sets. If the methods of this class are defined using the C++ inline mechanism, there is no real efficiency difference between these two models.

Knitting the Parse Tree Together

Consider the following Kestrel code:

rt: type record
	f: var ft
end;
et: type @ rt
at: type array it of et;
a:  var  at;

This artificial example allows the following reference to be legal a[i]@.f -- a reference to field f of the record referenced by element i of array a might create a new class for lexeme sets. If the methods of this class are defined using the C++ inline mechanism, there is no real efficiency difference between these two models.

A parse tree for the block containing the declarations might look like this:

This tree can be made useful if, as definitions are parsed, they are cross linked, knitting the tree together. So, whenever a reference to the type rt is encountered, it is linked directly to the corresponding type definition in the tree. Completing the process on the above tree adds the following links: