Lecture 11, Start and Follow Sets, LR1 Parsers

Part of the notes for CS:4908:0004 (22C:196:004)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

An Example Grammar

Consider the following grammar from the previous lecture:

<expression> ::= <term> { ( '+' | '-' ) <term> }
<term> ::= <factor> { ( '*' | '/' ) <factor> }
<factor> ::= [ '-' ] ( <number> | '(' <expression> ')' )

We transformed this into the following equivalent left-recursive grammar last time:

 1  <expression> ::= <term>
 2                |  <expression> '+' <term>
 3                |  <expression> '-' <term>
 4  <term> ::= <factor>
 5          |  <term> '*' <factor>
 6          |  <term> '/' <factor>
 7  <factor> ::= <number>
 8            |  '-' <number>
 9            |  '(' <expression> ')'
10            |  '-' '(' <expression> ')'

If you consider writing a shift-reduce parser, it is immediately obvious that you need to look a varying distance into the stack. The distance you need to look is determined by the number of symbols in the longest rule. In the case of the above grammar, the longest rule is rule 10, so we need to look at the top four items on the stack.

Peeking deeply into the stack may be uncomfortable, but we can always rewrite the grammar to avoid this need by introducing extra nonterminal symbols. This transformation is trivial and can easily be automated. Here is the above grammar rewritten so that each rule requires looking at only the topmost symbol and the symbol immediately below it:

 1  <expression> ::= <term>
 2                |  <expression+> <term>
 3                |  <expression-> <term>
 2a <expression+> ::= <expression> '+'
 3a <expression-> ::= <expression> '-'
 4  <term> ::= <factor>
 5          |  <term*> <factor>
 6          |  <term/> <factor>
 5a <term*> ::= <term> '*'
 6a <term/> ::= <term> '/'
 7  <factor> ::= <number>
 8            |  '-' <number>
 9            |  <(expression> ')'
10            |  <-(expression> ')'
 9a <(expression> ::= '(' <expression>
10a <-(expression> ::= <-(> <expression>
10b <-(> ::= '-' '('

In the above, the right-hand side of the original rule 2 has been shortened to just two symbols by the addition of a new nonterminal, <expression+>. This new nonterminal is defined in rule 2a. We have similarly split up rules 3, 5, 6 and 9. Rule 10 was more complex. Here, we rewrote rule 10 as this:

10  <factor> ::= <-(expression> ')'
10a <-(expression> ::= '-' '(' <expression>

But then, because rule 10a has 3 symbols on its right-hand side, we had to perform an additional split, creating rule 10b.

The Challenge

The basic shift-reduce parser operates by scanning the input text from end to end (left to right, in languages using the Roman alphabet). In addition to the input text, the parser has a push-down stack, and the ability to examine the top elements of the stack.

The question we face when building a shift reduce parser is simple: Which of the possible reductions should we do at any particular time? The key to this problem is to identify two sets of terminal symbols associated with each symbol in the grammar:

The start symbols for each nonterminal are all of the terminal symbols that might be found at the start of an occurance of a particular terminal symbol, and
The following symbols for each nonterminal are all of the terminal that might be found immediately after each occurance of a particular terminal symbol.

For our grammar, before we shortened all the production rules, we have the following start symbols:

<expression>: starts with one of { <number>, '-', '(' }
<term>: starts with one of { <number>, '-', '(' }
<factor>: starts with one of { <number>, '-', '(' }

In this case, all three sets are the same because every factor is a legal term and every term is a legal expression. In our modified grammar, limited to two-symbol access to the stack top, we add the following:

<expression+>: starts with one of { <number>, '-', '(' }
<expression->: starts with one of { <number>, '-', '(' }
<term*>: starts with one of { <number>, '-', '(' }
<term/>: starts with one of { <number>, '-', '(' }
<(expression>: starts with one of { '(' }
<-(expression>: starts with one of { '-' }
<-(>: starts with one of { '-' }

In the above, new nonterminals representing things that started with expressions and terms followed by something all have the same set of starting symbols as the basic expression and term nonterminals. The new nonterminals representing things that started with punctuation have very simple start sets containing only that punctuation mark.

Note that we can trivially identify the set of start symbols for each terminal symbol in the grammar: Each such symbol has a start set containing just itself.

We needed the start sets in order to compute the sets of following symbols for each nonterminal. Consider this production rule:

<a> ::= b c d

This rule tells us nothing about the set of following symbols for <a>. On the other hand, regardless of whether b, c, and d are nonterminals or terminals, it tells us that the follow set for b contains the start symbols for c, and the follow set for d contains the start symbols for d. If we traverse the entire grammar, we can accumulate the complete follow sets for each terminal and nonterminal in the grammar.

Having this set in hand is a powerful tool. If we look just at the follow set for each terminal, we can identify numerous syntactically invalid strings without doing any parsing. If the string contains the text a b, for example, where b is not one of the following symbols of the terminal symbol a, then that text cannot be syntactically valid.

Note that we need to consider end-of-file as a special symbol in this context. To do this, we add one extra nonterminal at the very top level of the grammar, <file>:

    <file> ::= <expression> <eof>
 1  <expression> ::= <term>
 2                |  <expression> '+' <term>
 3                |  <expression> '-' <term>
 4  <term> ::= <factor>
 5          |  <term> '*' <factor>
 6          |  <term> '/' <factor>
 7  <factor> ::= ...

Here, we have used <eof> to represent end-of-file. With this in place, the follow sets for each of our nonterminals can be computed, giving us this:

<expression>: followed by one of { <eof>, '+', '-', ')' }
<term>: followed by one of { <eof>, '*', '/', '+', '-', ')' }
<factor>: followed by one of { <eof>, '*', '/', '+', '-', ')' }
<expression+>: followed by one of { <number>, '-', '(' }
<expression->: followed by one of { <number>, '-', '(' }
<term*>: followed by one of { <number>, '-', '(' }
<term/>: followed by one of { <number>, '-', '(' }
<(expression>: followed by one of { ')' }
<-(expression>: followed by one of { ')' }
<-(>: followed by one of { <number>, '-', '(' }

These start sets and follow sets have many uses. For example, in a shift-reduce parser, when no shift or reduction is possible, that indicates a syntax error. What do you do at that point? The traditional recovery algorithm is to skip over whatever is found in the input stream until some lexeme is found that is in the follow set for the symbol on top of the stack. At that point, parsing can continue and it might be possible to output more useful error messages that could not have been output if parsing had halted abruptly at the very first error.

Of course, skipping until a symbol is found can also lead to a completely different and entirely incorrect interpretation of the input, explaining why some syntax errors lead to strings of utterly useless error messages. Smart compilers try to produce a few error messages before giving up, because productive programmers can somtimes correct a few errors per attempt at compilation, but it is rarely useful to allow compilation to continue for more than a handful of error messages before abandoning the attempt.

A Practical Shift-Reduce Parsing Algorithm

Repeat the following steps until either the target symbol (<file> in our example) is the only thing on the stack or until none of the alternatives below apply:

If the right-hand side of a production rule matches the top of stack, and if the next input symbol is in the set of end symbols for the nonterminal on the left-hand side of the production rule, apply that rule.

Otherwise, when the above cannot be done, if the next input symbol is in the set of end symbols for the topmost symbol on the stack (terminal or nonterminal), shift one symbol from the input stream onto the stack top.

Otherwise, if neither of the above can be done, there is an error. Nothing can be done except perhaps discarding input until the next symbol is in the set of end symbols for the top symbol on the stack.

The above algorithm was invented by Don Knuth in 1965. It is called an LR1 parser. The L in LR1 means that it operates operates Left to right. The R means that it reduces the Rightmost derivation first (that is, it reduces the right side of the parse tree for any production rule before attacking the rest of that rule. The 1 means that it operates with just one symbol of lookahead. Knuth proved that a grammar that can only be parsed with an LRk parser (that is, requiring k symbols of lookahead) can be written so that it can be parsed with an LR1 parser. As a result, nobody needs stronger parsers than LR1. LR1 parsers do require large tables, and as a result, weaker but more compact parsers are common. Explore the Wikipedia article, Canonical LR parser, for an introduction to variants such as the LALR and SLR parsers.

Shift-Reduce Parsers as Compilers

Given a shift-reduce parser, we can make a compiler as follows: First, we need to associate additional information with every item on the stack. Numeric lexemes have values, identifiers have a specific identity, denoting, for example, the memory locations at which variables are stored. Terms, factors and expressions, similarly, each must be associated with the storage location for the value of that term, factor, or expression (perhaps a register, perhaps a memory locaiton).

In this context, we associate a semantic action with every production rule. For many of the rules, the actions will merely involve preserving information, In our example grammar, this applies to these rules:

 1  <expression> ::= <term>
 2a <expression+> ::= <expression> '+'
 3a <expression-> ::= <expression> '-'
 4  <term> ::= <factor>
 5a <term*> ::= <term> '*'
 6a <term/> ::= <term> '/'
 7  <factor> ::= <number>
 9            |  <(expression> ')'
10            |  <-(expression> ')'
 9a <(expression> ::= '(' <expression>

For each of the above rules, whatever semantic attribute was associated with the nonterminal on the right-hand side is simply moved to the nonterminal on the left-hand side as that nonterminal is placed on the stack top. The punctuation marks consumed by these rules carry no semantic content once it is decided to apply these rules.

The following rules from our example grammar have more significant semantic content:

 2  <expression> ::= <expression+> <term>
 3                |  <expression-> <term>
 5  <term> ::= <term*> <factor>
 6          |  <term/> <factor>
 7  <factor> ::= <number>
 8            |  '-' <number>
10a <-(expression> ::= <-(> <expression>

Rule 10b, <-(> ::= '-' '(' involves no symbols that carry any semantic content, so we did not include in either of the above lists. Now, let's look at the actions associated with these rules:

For rules 2, 3, 5 and 6, the semantic actions are all similar: First, find where the values associated with each of the two operands are stored. Second, generate machine code to add, subtract, multiply or divide those two values (depending on which rule we are dealing with). Finally, wherever the machine code put the result should be associated with the result nonterminal after applying the reduction rule.

For rule 7, the compiler needs to generate machine code to place the constant numeric value from the lexeme into some location, and then wherever the machine code put the result should be associated with the newly identified term. Rule 8 is like rule 7, excep that the machine code we generate should load the negated constant (or negate the loaded constant).

For rule 10a, first, we need to find where the expression's value is located, and then generate machine code to negate this value. Finally, we need to associate the result with the newly identified term.

One popular approach to syntax errors in LR1 parsers is to add production rules that reflect the popular "language extensions" reflected in the syntax error. For example, missing semicolons could be considered to be a class of language extension, so production rules are added allowing programs to be parsed despite missing semicolons. The semantic action for each such rule is an error message instead of useful code generation. Where these rules are added, they sometimes include incorrect parsing of the text after the error until some recovery point, but the important part, the error message, is output, and there is a chance that parsing can recover enough to output useful error messages after the first error.