Lecture 10, Shift Reduce Parsing

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

An Example Grammar

Consider the following simplified version of the grammar for Falcon expressions:

<expression> ::= <term> { ( '+' | '-' ) <term> }
<term> ::= <factor> { ( '*' | '/' ) <factor> }
<factor> ::= [ '-' ] ( <number> | '(' <expression> ')' }

(The simplifications we made eliminated comparing and logical operators, variables and non-numeric constants.)

The question we face is, how do we construct a parser that can take an input string and match it to this grammar. A human working at parsing a string might proceed as follows, rewriting the string after each step:

-    4    *    5    +    6    -   (   7   +   8   )   /    9
  factor  *  factor +  factor -   (factor + factor)   /  factor
        term        +   term  -   ( term  +  term )   /  factor
        term        +   term  -   (  expression   )   /  factor
        term        +   term  -        factor         /  factor
        term        +   term  -                     term
                          expression

Unfortunately, this view of the problem does not give us an algorithm. People can take a breadth-first approach to problems like this, doing large numbers of substitutions along the entire string, and then working on the substituted string. Note, however, that what we have shown above can also be described by a tree:

-    4    *    5    +    6    -   (   7   +   8   )   /    9
 \   |    |    |    |    |    |   |   |   |   |   |   |    |
  factor  |  factor |  factor |   |factor | factor|   |  factor
       \  |  /      |    |    |   |   |   |   |   |   |  /
        term        |   term  |   | term  |  term |  /  /
          |          \   |    |   |     \ | /    /  /  /
           \          \   \   |    \ expression /  /  /
            \_______   \   \   \    \_    |   _/  /  /
                    \   \   \   \     \   |  /  _/  /
                     \   \   \   |     factor  /  _/
                      \   \   |  |          \ /  /
                       \   |  |  |          term
                        \  |  |  |   ________/
                         \ |  |  |  /
                          expression

Such a tree is called a parse tree. This parse tree exactly reflects the syntactic relationships implied by the above grammar, but it says little about one important issue. Our top-level expression contains two operators, '+' and '-'. What is the order of evaluation of operators that occur at the same level in the expression?

Refining the Grammar

The above grammar is loaded with extended BNF constructs, and these obscure several issues. Let's remove them! There are, unfortunately, two distinct ways to do this (not counting mixed approaches). We could expand our grammar to this one, referred to as a left-recursive grammar because the recursion occurs on the left side of each rule:

<expression> ::= <term>
              |  <expression> '+' <term>
              |  <expression> '-' <term>
<term> ::= <factor>
        |  <term> '*' <factor>
        |  <term> '/' <factor>
<factor> ::= <number>
          |  '-' <number>
          |  '(' <expression> ')'
          |  '-' '(' <expression> ')'

Alternatively, we could produce this grammar, which is right recursive:

<expression> ::= <term>
              |  <term> '+' <expression>
              |  <term> '-' <expression>
<term> ::= <factor>
        |  <factor> '*' <term>
        |  <factor> '/' <term>
<factor> ::= <number>
          |  '-' <number>
          |  '(' <expression> ')'
          |  '-' '(' <expression> ')'

Which is correct? First and foremost, these two grammars describe the exact same set of strings and therefore they describe the same language, so long as you define the language only as the set of strings, without ascribing any meanngs to the strings.

The problem we face is this: These two grammars give distinctly different interpretations to such expressions as "5*6/7". The left-recursive grammar above gives the left tree, while the right-recursive grammar gives the right tree:

    5    *    6    /    7          5    *    6    /    7
    |    |    |    |    |          |    |    |    |    |
 factor  | factor  | factor     factor  | factor  | factor
    |    |    |    |   /            \   |    |    |    |
  term   |   /    /   /              \   \    \   |  term
     \_  |  /    /   /                \   \    \  |  _/
       \ | /    /   /                  \   \    \ | /
        term   /   /                    \   \   term 
          \   / __/                      \__ \   /
           \ | /                            \ | / 
           term                              term
             |                                |
        expression                        expression

Unless we apply some very complex semantic rules, the first grammar above and the left parse tree imply left-to-right evaluation of operators, giving the value 4, while the second grammar and left parse tree above imply right to left evaluation, giving the value 0.

(Semantic rules can be formulated that give either value for either parse tree, but to do this, we have to interpret the value of each subtree as something quite different from a simple integer value, for example, as a list, with complex semantics at each level for constructing new lists, and an evaluator at the very top level that actually applies the operators.)

The Basic Idea

The basic shift-reduce parser operates by scanning the input text from end to end (left to right, in languages using the Roman alphabet). In addition to the input text, the parser has a push-down stack, and the ability to examine the top elements of the stack. In the following figures, we will show the stack top to the right, separated from the input text by a double vertical bar. Consider this input text, shown with an empty stack:

|| - 4 * 5 + 6 - ( 7 + 8 ) / 9

We will parse this against the following grammar, the left recursive grammar presented previously. In order to simplify the presentation, we'll number the production rules:

 1  <expression> ::= <term>
 2                |  <expression> '+' <term>
 3                |  <expression> '-' <term>
 4  <term> ::= <factor>
 5          |  <term> '*' <factor>
 6          |  <term> '/' <factor>
 7  <factor> ::= <number>
 8            |  '-' <number>
 9            |  '(' <expression> ')'
10            |  '-' '(' <expression> ')'

Our parser begins by reading one symbol from the input and pushing it onto the stack:

- || 4 * 5 + 6 - ( 7 + 8 ) / 9

Our grammar does not have any production rules that match a single minus sign, so we repeat, scanning forward another lexeme:

- <number> || * 5 + 6 - ( 7 + 8 ) / 9

At this point, we can apply rule 8, reducing the stack to this:

<factor> || * 5 + 6 - ( 7 + 8 ) / 9

A factor standing alone is a term, according to rule 4, so we can also do this reduction:

<term> || * 5 + 6 - ( 7 + 8 ) / 9

How did we know not to apply rule 1 to the above? The answer is, we peeked ahead at the next lexeme, the '*', and noticed that <term> '*' shows up at the start of another rule, but <expression> '*' does not.

Now, we shift in another symbol:

<term> * || 5 + 6 - ( 7 + 8 ) / 9

No rules match at this point, so we shift in yet another symbol:

<term> * <number> || + 6 - ( 7 + 8 ) / 9

At this point, we can apply rule 7 to reduce the number to a factor:

<term> * <factor> || + 6 - ( 7 + 8 ) / 9

Now, we can apply rule 5:

<term> || + 6 - ( 7 + 8 ) / 9

This time, we can apply rule 1 because the next lexeme is '+' and <expression> '+' does show up at the head of a rule. This gives us:

<expression> || + 6 - ( 7 + 8 ) / 9

And then shift twice:

<expression> + || 6 - ( 7 + 8 ) / 9
<expression> + <number> || - ( 7 + 8 ) / 9

And then apply rules 7 and 4 and then 2:

<expression> + <factor> || - ( 7 + 8 ) / 9
<expression> + <term> || - ( 7 + 8 ) / 9
<expression> || - ( 7 + 8 ) / 9

Now, we shift 3 times before we find any rules we can apply:

<expression> - || ( 7 + 8 ) / 9
<expression> - ( || 7 + 8 ) / 9
<expression> - ( >number< || + 8 ) / 9

At this point, we can apply rules 7 and 4 and then 1:

<expression> - ( >factor< || + 8 ) / 9
<expression> - ( >term< || + 8 ) / 9
<expression> - ( >expression< || + 8 ) / 9

No rules apply here, so we shift and shift again:

<expression> - ( >expression< + || 8 ) / 9
<expression> - ( >expression< + <number> || ) / 9

At this point, we can apply rules 7, 4 and 2:

<expression> - ( >expression< + <factor> || ) / 9
<expression> - ( >expression< + <term> || ) / 9
<expression> - ( >expression< || ) / 9
No rules apply here, so we shift, allowing us to apply rules 9 and 4:

<expression> - ( >expression< ) || / 9
<expression> - >factor< || / 9
<expression> - >term< || / 9

Again, we did not apply rule 1 because we looked ahead and noticed that <term> '/' is the start of a rule, while <expression> '/' is not.

At this point, we shift twice, and then apply rules 7, 6 and 3:

<expression> - >term< / || 9
<expression> - >term< / <number> ||
<expression> - >term< / <factor> ||
<expression> - >term< ||
<expression> ||

And we are done!

Tables

If we had not had the idea of looking ahead into the input string, we'd have had to do a backtrack search for a parse. The idea of looking ahead can be formalized by listing, for each production rule, the set next symbols that permit it to be applied. This is a purely mechanical job, and it is possible to write a program that takes a grammar as input and either produces the table giving, for each rule, the next symbol that permits that rule, or rejects the grammar.