Lecture 20, Parse Errors and Follow Sets

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Kestrel expression lists

Consider a parser for this fragment of the Kestrel grammar:

<expression list> ::= "(" <expressions> ")"
                   |  "[" <expressions> "]"
                   |  "{" <expressions> "}"
<expressions> ::= <expression> { [ "," ] <expression> }

First some notes: Kestrel uses expression lists for several distinct semantic purposes. They serve as parameter lists to procedures and functions, and they serve as subscript lists for arrays. Unlike languages like Pascal, C and C++, Kestrel does use the type of bracketing to distinguish between the purposes. Instead, semantic context determines the purpose. If the expression list follows a reference to an array, the expression list is a subscript list. If the expression list follows a reference to a procedure or function, it is a parameter list.

The point of this particular discussion, however, has to do with parsing expression lists, not with their meaning.

It is not hard to examine the grammar to determine the start set for an expression list:

<expression list>.start = {"(", "[", "{"}

The follow set for expression requires more careful analysis of the grammar. Processing the grammar tools provided on the course home page, we find that the follow set is:

<expression list>.follow = {"@", ".", "(", "[", "{", "=", ",", "*", "/", "%", "&", "-", "+", "|", "/=", ">", ">=", "<", "<=", ")", "]", "}", "..", ";", "~", ":",
"in", "then", "end", "do", "if", "select", "catch", "raise", "while", "for", "else", "case", "until", "external", "of", "array", "set", "record", "null",
<identifier>, <number>, <string>, <end of file>}

The start sets and follow sets for expressions are intermediate in difficulty between the above two, and once we run the tool, find that they are:

<expression>.start = {"-", "~", "(", "[", "{", "record", "null",
<identifier>, <number>, <string>, <end of file>}
<expression>.follow = <expression list>.follow

Parsing

We can write informal recursive descent code to parse expression lists as follows:

void parse_expression_list() {
	if (lex_this ∉ expression_list_start_set) {
		error( expected something in expression_list_start_set );
	}
	if (lex_this == "(") follower = ")";
	else if (lex_this == "[") follower = "]";
	else follower = "}";
	lex_advance(); /* skip opening brace */

	parse_expression();

	while (lex_this ≠ follower) {
		if (lex_this == ",") lex_advance(); /* skip optional comma */
		parse_expression();
	}

	lex_advance(); /* skip closing brace */

	if (lex_this ∉ expression_list_follow_set) {
		error( expected something in expression_list_follow_set );
        }
}

The above code illustrates a general pattern, each parsing routine can begin with code to complain if the current lexeme is not in the start set for that nonterminal, and it can end with code to complain if the current lexeme is not in the follow set. Clearly, we can write lexical support routines to perform these checks and issue appropriate complaints. There are some interesting questions raised by the above code.

Where the original code contained 3 production rules to handle the problem of balancing the different kinds of brackets, this code simply remembers the closing bracket that goes opposite the opening bracket and then checks for that. The big shortcoming of this code is that it will produce rather poor error messages.

Improving the error messages

The first problem with the error messages suggested above is that merely saying "expression list follow set" means nothing to the average programmer, wile actually giving the follow set would flood the programmer with too much information. A well crafted error message would probably pick a representative element or elements from the follow set, for example, saying something like "operator or semicolon expected."

But we can do better. The above algorithm complains if there is no element of the start set at the beginning, and it complains if there is no member of the follow set at the end, but it does nothing to recover from this fault, essentially guaranteeing that the parsing will be derailed by the first error encountered, leading to a cascade of complaints as all the other parsing rules fail.

One model for reducing the severety of error cascades operates by skipping forward in the input whenever we know that the current lexeme should be in some set and it is not. For example, we can replace the first test above with this:

if (lex_this ∉ expression_list_start_set) {
	error( looking for an opening brace );
	while (lex_this ∉ expression_list_start_set) {
		lex_advance();
	}
}

This model works fairly well, although it might be reasonable to limit the extent to which it skips ahead. Arguing such issues amounts to fine tuning the response of the parser to programs that are already broken. This offers only limited return on the investment.

Another problem is the sheer size of the follow set. The fact is, most programs are not deeply indented, and as a result, in many contexts, the follow set, taking into account only the surrounding context, will be smaller than the follow set for the same syntactic construct in an unconstrained context. For example, if the opening bracket of an expression list is a parentheses, the follow set for that parentheses contains only a closing parentheses, and not a closing square bracket or brace. We can solve this problem by dynamically computing the follow set, or that part of the follow set that depends on context.

Dynamic Computation of the Follow Set

One problem with the above code is that the follow set for each nonterminal is always statically computed. The code to parse a parenthesized expression and the code to parse the controlling expression for an if statement have different contexts that are likely to be worthy of different treatment in the parser. Consider these uses of an expression:

if i > j then ...
while i > j do ...
p = (i > j);

The fact that an expression is parsed within an if statement implies that the keyword then is likely to follow, while if the expression is parsed in the context of a while statement, the keyword then would be an error, while the keyword do is likely. In the third case, a parenthesized expression, no keyword would be syntactically valid, but the expression must be followed by a closing parenthesis (or a comma, since in Kestrel, parenthesized expressions occur within constructors.

This suggests using dynamic computation of the follow sets instead of static computation. This leads to code like the following for our example, parsing constructors:

void parse_expression_list( set followers ) {
	if (lex_this ∉ expression_list_start_set) {
		error( expected something in expression_list_start_set );
	}
	if (lex_this == "(") closer = ")";
	else if (lex_this == "[") closer = "]";
	else closer = "}";
	lex_advance(); /* skip opening brace, remember matching closer */

        set s = {closer} ∪ {","} ∪ expression_start_set;
	parse_expression( s );

	while (lex_this ≠ closer) {
		if (lex_this == ",") lex_advance(); /* skip optional comma */
		parse_expression( s );
	}

	lex_advance(); /* skip closing brace */

	if (lex_this ∉ followers ) {
		error( looking for something in followers );
		while (lex_this ∉ followers) do lex_advance();
        }
}

Here, notice that the follow set is passed to each parsing routine. Here, we end the parsing of the constructor by checking the follow set we were given, instead of checking against a generic follow set, and as each subsidiary expression is paresed, it is given the follow set determined by this context -- a follow set that includes only the closing brace that matches the opening brace of this constructor.

This scheme is most effective in shallowly nested programs. When follow sets are passed down the parse tree through a great depth, they may end up accumulating essentially all of the symbols that static analysis would place in the follow set. On the other hand, in a shallow parse tree, the smaller follow sets can lead to significantly more focused error messages or more focused error recovery.

Error messages themselves are a problem. The final error message above should not, for example, say "looking for something in {'do', ';', 'end', <reference> ... }" enumerating the entire follow set. That could go on and on. One option is to have the error message say "looking for something like 'do'", picking just one representative element of the follow set.