Compiler Construction

Noam Chomsky classified languages into what is now called the Chomsky Hierarchy. There are 4 levels in this hierarchy:

Production Rules

In formal linguistics, a grammar consists of a set of rewrite rules applied to strings of terminal and nonterminal symbols. A string in the language consists only of terminal symbols, while nonterminals, also called metasymbols, are used for describing syntactic elements of the language.

There are two ways of thinking about grammars: The generative viewpoint starts with a single nonterminal, for example, <sentence> and applies a sequence of rewrite rules to this in order to generate an utterance in the language. The opposite viewpoint, parsing, starts with a sentence and applies a series of rewrite rules to it in order to reduce it to a sentence.

Here, we follow the conventions of BNF and enclose the names of nonterminal symbols in angle brackets, for example, <sentence>, while we leave terminal symbols alone. In addition, we write our rewrite rules using the BNF ::= symbol. Consider this rewrite rule from the Kestrel grammar:

<until loop> ::= "do" <block> "until" <expression>

Applied as a generative grammar, we can read this rewrite rule from left to right. In this reading, it allows the nonterminal symbol <until loop> to be replaced with the string of terminal and nontermial symbols do <block> until <expression> The generative view of grammars leads us to refer to each of these rewrite rules as a production rule or, in short, a production.

(Note, in the Kestrel grammar, terminal symbols are quoted in order to distinguish them from metasymbols. Thus, { is a metasymbol, the opening brace indicating that the following material is to be repeated zero or more times, while "{" is a terminal symbol, an opening brace that may appear in the text of a Kestrel program. Many EBNF users rely on the reader's intuition to distinguish metasymbols from terminal symbols, while some use a different type font.)

Alternatively, we can apply this production rule in parsing by reading it from right to left. In this case, wherever we encounter do <block> until <expression>, we may replace it with <until loop>. In this case, we read the rule as a reduciton rule, since it reduces a string of mixed terminal and nonterminal symbols to a nonterminal symbol.

The grammars for Chomsky type zero languages consist of unconstrained rules. Both the left-hand and right-hand sides of production rules may consist of any number of terminal and nonterminal symbols. Thus, if you see a grammar that has rules looking like this, it may describe a type zero language:

a <b> c ::= d <e> f <g>

For a language to be a context free language, it must be possible to give a grammar for that language where every production rule has just one nonterminal symbol on its left-hand side. The grammar for Kestrel is given in this form, and in fact, all properly formed BNF grammars can be reduced to this form.

It is important to note that there are an infinite number of grammars for any language. Consider this production rule:

<leftside> ::= <right> <hand> <side>

If we introduce the new nonterminal <middle>, we may replace this one rule with this:

<leftside> ::= <middle>
<middle> ::= <right> <hand> <side>

<leftside> ::= <right> <middle>
<middle> ::= <hand> <side>

The two alternatives shown above are entirely equivalent to the original in the sense that they describe the same language, or in other words, they generate the same set of strings. They are different only because they use different sets of production rules to do so.

BNF Notation and its Extensions

Consider the following set of context-free production rules for the Kestrel language:

<loop> ::= <while loop>
<loop> ::= <until loop>
<loop> ::= <for loop>

<loop> ::= <while loop> | <until loop> | <for loop>

This is just a matter of abbreviation. In formal language theory, this is three production rules. BNF simply permits a more compact representation.

Most of the extensions of added to BNF to make the various EBNF notations come from regular expressions. Formally, a regular expression is defined as:

Using parentheses to group, we might write this regular expression to describe strings of dots separated by decimal digits:

.((0|1|2|3|4|5|6|7|8|9).)*

This regular expression matches . and it matches .4. and .1.4.3.7. as well as an infinite number of other strings. The star notation for zero or more repetitions has a name, the Kleene closure.

In BNF, from the start, the vertical bar was used to indicate alternatives, and sequences of terminals and nonterminals were used to indicate matching sequences. What is missing is an iterator. The most common way to indicate iteration in extended BNF is with curly braces, so instead of writing a* we write {a}.

This extension to BNF does not change the expressive power of the notation, it only allows compact expression. Consider, for example, this rule from the Kestrel grammar:

<formal parameters> ::= <parameter> { [ "," ] <parameter> }

We can rewrite this, adding new nonterminals, in order to completely eliminate the extensions to BNF:

<formal parameters> ::= <formal list>
<formal list> ::= <parameter>
<formal list> ::= <parameter> [ "," ] <formal list>

Here, we have used recursion where the original used iteration. We have one more extension to BNF in the above rule, square brackets indicating optional elements. We can eliminate this extension by the following rewrite of the final rule above:

<formal list> ::= <parameter> <formal list>
<formal list> ::= <parameter> "," <formal list>

If we wanted to define <formal parameters> in the original un-extended BNF, we can combine the above ideas to give the following:

<formal parameters> ::= "(" <formal list> ")"
<formal list> ::= <parameter>
               |  <parameter> <formal list>
               |  <parameter> "," <formal list>

The advantage of the extended BNF notation is compactness. We can automatically derive the full set of production rules from the compact form, but finding expressive human-oriented names for the new non-terminals introduced during this expansion of the grammar is a job for humans.

The worst grammar rewrite tools tend to invent names such as <a>, <b> and <c>, with no reference to the names used in original grammar. Better rewrite tools tend to name new nonterminals after the nonterminal on the left-hand side of the rule where the new nonterminal is introduced, so they might invent, for example, <formal parameter list a> in the above context. It takes human intervention to pick better names.

Our next job is to discuss parsing. As we discuss that, we will see that each parsing algorithm tends to favor one or another grammar for the language. Compact presentations with fewer nonterminals may be better for humans in some contexts, but in other contexts, some nonterminals may have obvious semantic content and a human reader may prefer to preserve them in the grammar even when they have no particular value to an explanation of the syntax. For this reason, it is handy to seek out tools for grammar rewriting, both tools to expand EBNF grammars to their primitive BNF form, and tools to attempt to tighten up sprawling grammars by eliminating extraneous nonterminals and redundant production rules.

An Example Grammar

Consider the following simplified version of the grammar for Kestrel expressions:

<expression> ::= <term> { ( '+' | '-' ) <term> }
<term> ::= <factor> { ( '*' | '/' ) <factor> }
<factor> ::= [ '-' ] ( <number> | '(' <expression> ')' }

(The simplifications we made eliminated comparing and logical operators, variables and non-numeric constants.)

The question we face is, how do we construct a parser that can take an input string and match it to this grammar. A human working at parsing a string might proceed as follows, rewriting the string after each step:

-    4    *    5    +    6    -   (   7   +   8   )   /    9
  factor  *  factor +  factor -   (factor + factor)   /  factor
        term        +   term  -   ( term  +  term )   /  factor
        term        +   term  -   (  expression   )   /  factor
        term        +   term  -        factor         /  factor
        term        +   term  -                     term
                          expression

Unfortunately, this view of the problem does not give us an algorithm. People can take a breadth-first approach to problems like this, doing large numbers of substitutions along the entire string, and then working on the substituted string. Note, however, that what we have shown above can also be described by a tree:

-    4    *    5    +    6    -   (   7   +   8   )   /    9
 \   |    |    |    |    |    |   |   |   |   |   |   |    |
  factor  |  factor |  factor |   |factor | factor|   |  factor
       \  |  /      |    |    |   |   |   |   |   |   |  /
        term        |   term  |   | term  |  term |  /  /
          |          \   |    |   |     \ | /    /  /  /
           \          \   \   |    \ expression /  /  /
            \_______   \   \   \    \_    |   _/  /  /
                    \   \   \   \     \   |  /  _/  /
                     \   \   \   |     factor  /  _/
                      \   \   |  |          \ /  /
                       \   |  |  |          term
                        \  |  |  |   ________/
                         \ |  |  |  /
                          expression

Such a tree is called a parse tree. This parse tree exactly reflects the syntactic relationships implied by the above grammar, but it says little about one important issue. Our top-level expression contains two operators, '+' and '-'. What is the order of evaluation of operators that occur at the same level in the expression?

Refining the Grammar

The above grammar is loaded with extended BNF constructs, and these obscure several issues. Let's remove them! There are, unfortunately, two distinct ways to do this (not counting mixed approaches). We could expand our grammar to the following, which is referred to as a left-recursive grammar because the recursion occurs on the left side of each rule:

<expression> ::= <term>
              |  <expression> '+' <term>
              |  <expression> '-' <term>
<term> ::= <factor>
        |  <term> '*' <factor>
        |  <term> '/' <factor>
<factor> ::= <number>
          |  '-' <number>
          |  '(' <expression> ')'
          |  '-' '(' <expression> ')'

<expression> ::= <term>
              |  <term> '+' <expression>
              |  <term> '-' <expression>
<term> ::= <factor>
        |  <factor> '*' <term>
        |  <factor> '/' <term>
<factor> ::= <number>
          |  '-' <number>
          |  '(' <expression> ')'
          |  '-' '(' <expression> ')'

Which is correct? First and foremost, these two grammars describe the exact same set of strings and therefore they describe the same language, so long as you define the language only as the set of strings, without ascribing any meanngs to the strings.

The problem we face is this: These two grammars give distinctly different interpretations to such expressions as "5*6/7". The left-recursive grammar above gives the left tree below, while the right-recursive grammar gives tree to the right:

    5    *    6    /    7          5    *    6    /    7
    |    |    |    |    |          |    |    |    |    |
 factor  | factor  | factor     factor  | factor  | factor
    |    |    |    |   /            \   |    |    |    |
  term   |   /    /   /              \   \    \   |  term
     \_  |  /    /   /                \   \    \  |  _/
       \ | /    /   /                  \   \    \ | /
        term   /   /                    \   \   term 
          \   / __/                      \__ \   /
           \ | /                            \ | / 
           term                              term
             |                                |
        expression                        expression

Unless we apply some very complex semantic rules, the first grammar above and the left parse tree imply left-to-right evaluation of operators, giving the value 4, while the second grammar and left parse tree above imply right to left evaluation, giving the value 0.

(Semantic rules can be formulated that give either value for either parse tree, but to do this, we have to interpret the value of each subtree as something quite different from a simple integer value, for example, as a list, with complex semantics at each level for constructing new lists, and an evaluator at the very top level that actually applies the operators.)

Shift-reduce Parsing

The basic shift-reduce parser operates by scanning the input text from end to end (left to right, in European languages, but right to left in Arabic or Hebrew). In addition to the input text, the parser has a push-down stack, and the ability to examine the top elements of the stack. In the following figures, we will show the stack top to the right, separated from the input text by a double dagger mark (‡). Consider this input text, shown with an empty stack:

‡ - 4 * 5 + 6 - ( 7 + 8 ) / 9

We will parse this against the following grammar, the left recursive grammar presented previously. In order to simplify the presentation, we'll number the production rules:

 1  <expression> ::= <term>
 2                |  <expression> '+' <term>
 3                |  <expression> '-' <term>
 4  <term> ::= <factor>
 5          |  <term> '*' <factor>
 6          |  <term> '/' <factor>
 7  <factor> ::= <number>
 8            |  '-' <number>
 9            |  '(' <expression> ')'
10            |  '-' '(' <expression> ')'

Our parser begins by reading one symbol from the input and pushing it onto the stack:

- ‡ 4 * 5 + 6 - ( 7 + 8 ) / 9

Our grammar does not have any production rules that match a single minus sign, so we repeat, scanning forward another lexeme:

- <number> ‡ * 5 + 6 - ( 7 + 8 ) / 9

<factor> ‡ * 5 + 6 - ( 7 + 8 ) / 9

A factor standing alone is a term, according to rule 4, so we can also do this reduction:

<term> ‡ * 5 + 6 - ( 7 + 8 ) / 9

How did we know not to apply rule 1 to the above? The answer is, we peeked ahead at the next lexeme, the '*', and noticed that <term> '*' shows up at the start of another rule, but <expression> '*' does not.

<term> * ‡ 5 + 6 - ( 7 + 8 ) / 9

<term> * <number> ‡ + 6 - ( 7 + 8 ) / 9

<term> * <factor> ‡ + 6 - ( 7 + 8 ) / 9

<term> ‡ + 6 - ( 7 + 8 ) / 9

This time, we can apply rule 1 because the next lexeme is '+' and <expression> '+' does show up at the head of a rule. This gives us:

<expression> ‡ + 6 - ( 7 + 8 ) / 9

<expression> + ‡ 6 - ( 7 + 8 ) / 9
<expression> + <number> ‡ - ( 7 + 8 ) / 9

<expression> + <factor> ‡ - ( 7 + 8 ) / 9
<expression> + <term> ‡ - ( 7 + 8 ) / 9
<expression> ‡ - ( 7 + 8 ) / 9

<expression> - ‡ ( 7 + 8 ) / 9
<expression> - ( ‡ 7 + 8 ) / 9
<expression> - ( <number> ‡ + 8 ) / 9

<expression> - ( <factor> ‡ + 8 ) / 9
<expression> - ( <term> ‡ + 8 ) / 9
<expression> - ( <expression> ‡ + 8 ) / 9

<expression> - ( <expression> + ‡ 8 ) / 9
<expression> - ( <expression> + <number> ‡ ) / 9

<expression> - ( <expression> + <factor> ‡ ) / 9
<expression> - ( <expression> + <term> ‡ ) / 9
<expression> - ( <expression> ‡ ) / 9

<expression> - ( <expression> ) ‡ / 9
<expression> - <factor> ‡ / 9
<expression> - <term> ‡ / 9

Again, we did not apply rule 1 because we looked ahead and noticed that <term> '/' is the start of a rule, while <expression> '/' is not.

<expression> - <term> / ‡ 9
<expression> - <term> / <number> ‡
<expression> - <term> / <factor> ‡
<expression> - <term> ‡
<expression> ‡

And we are done! No more reductions are possible and there is no further input to shift into the stack. If we reach this state with the stack containing the sentence symbol of our grammar, the input was syntactically correct. If we reach this state with anything else on the stack, the input contained a syntax error. Unfortunately, this parsing method does not directly produce any hint about the nature of the error.

Tables

If we had not had the idea of looking ahead into the input string, we'd have had to do a backtrack search for a parse. The idea of looking ahead can be formalized by listing, for each production rule, the set next symbols that permit it to be applied. This is a purely mechanical job, and it is possible to write a program that takes a grammar as input and either produces the table giving, for each rule, the next symbol that permits that rule, or rejects the grammar.

Rejecting the grammar does not mean that this method cannot be used to parse the language that this grammar describes, but rather, it means that the grammar itself is not suited to this method. There may be other grammars for the same language that this method can be applied to. If given a "parser construction" tool that produces the parse tables for a shift-reduce parser, the compiler writer will typically need to spend time "massaging" the grammar, reformulating it to meet the constraints of the parser generator.