Lecture 9, Context-Free Grammars

Part of the notes for CS:4908:0004 (22C:196:004)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Language Categories

Noam Chompsky classified languages into what is now called the Chompsky Hierarchy. There are 4 levels in this hierarchy:

Regular languages. The formal grammar for a regular language can be expressed as a regular expression, and a finite-state machine. can be built to distinguish between valid and invalid strings in any regular language. The lexical level of the Kestrel language is regular, as is the lexical level of almost every programming language in common use today -- if one ignores rules about which digits are legal in what number base, limits on the number of characters in an identifier, and similar details.
Many programmers are familiar with so-called regular expressions found in text editors such as vi. The term has escaped from its original theoretical context into widespread use. Unfortunately, the implementations of regular expressions in text processing systems varies immensely. In some, the expressive power of the expressions is very limited, being much less powerful than Chompsky's level 3 implies, while in others, the expressive power approaches level 2.
Context-free languages. The formal grammar of a context-free language is a context free-grammar. BNF and its common extensions are all notations for expressing context free grammars. A push-down automaton, that is, a finite-state machine augmented with a push-down stack, is sufficient to distinguish between valid and invalid strings in a context-free language. Most programming languages in use today are described by context free grammars, and the parser we will build for Kestrel uses such a grammar.
In fact, most computer languages are more complex than the context-free grammar for the language suggests. The languages we use have context sensitivities. We generally ignore these while we build the parser for the language, and then process the context sensitive details in the semantics code built on top of the parser.
Context sensitive languages are described by context sensitive grammars. Languages in this class can be recognized by what is called a linear bounded automaton, that is, a finite-state machine provided with an auxiliary storage area that has a size proportional to the size of the input string. In effect, the parser must retain the entire input string, and it needs added data structures that have a size proportional to the input string. Unlike context-free languages, there is no push-pop constraint on the use of the auxiliary memory.
Recursively enumerable languages are the largest class. These languages can be recognized by a Turing machine, a finite-state machine augmented with an infinite memory, usually described as an infinite tape that the machine may scan in either direction, reading and writing as it goes. Any programming language that allows construction of open-ended data structures such as linked lists is formally equivalent to a Turing machine, except, of course, that real implementations of such languages are necessarily limited by the fact that we cannot build infinite memories for our computers.

Production Rules

In formal linguistics, a grammar consists of a set of rewrite rules applied to strings of terminal and nonterminal symbols. A string in the language consists only of terminal symbols, while nonterminals, also called metasymbols, are used for describing syntactic elements of the language.

There are two ways of thinking about grammars: The generative viewpoint starts with a single nonterminal, for example, <sentence> and applies a sequence of rewrite rules to this in order to generate an utterance in the language. The opposite viewpoint, parsing, starts with a sentence and applies a series of rewrite rules to it in order to reduce it to a sentence.

Here, we follow the conventions of BNF and enclose the names of nonterminal symbols in angle brackets, for example, <sentence>, while we leave terminal symbols alone. In addition, we write our rewrite rules using the BNF ::= symbol. Consider this rewrite rule from the Kestrel grammar:

<until loop> ::= "do" <block> "until" <expression>

Applied as a generative grammar, we can read this rewrite rule from left to right. In this reading, it allows the nonterminal symbol <until loop> to be replaced with the string of terminal and nontermial symbols do <block> until <expression> The generative view of grammars leads us to refer to each of these rewrite rules as a production rule or, in short, a production.

(Note, in the Kestrel grammar, terminal symbols are quoted in order to distinguish them from metasymbols. Thus, { is a metasymbol, the opening brade indicating that the following material is to be repeated zero or more times, while "{" is a terminal symbol, an opening brade that may appear in the text of a Kestrel program. Many EBNF users rely on the reader's intuition to distinguish metasymbols from terminal symbols, while some use a different type font.)

Alternatively, we can apply this production rule in parsing by reading it from right to left. In this case, wherever we encounter do <block> until <expression>, we may replace it with <until loop>. In this case, we read the rule as a reduciton rule, since it reduces a string of mixed terminal and nonterminal symbols to a nonterminal symbol.

The grammars for Chompsky type zero languages consist of unconstrained rules. Both the left-hand and right-hand sides of production rules may consist of any number of terminal and nonterminal symbols. Thus, if you see a grammar that has rules looking like this, it may describe a type zero language:

a <b> c ::= d <e> f <g>

For a language to be a context free language, it must be possible to give a grammar for that language where every production rule has just one nonterminal symbol on its left-hand side. The grammar for Kestrel is given in this form, and in fact, all properly formed BNF grammars can be reduced to this form.

It is important to note that there are an infinite number of grammars for any language. Consider this production rule:

<leftside> ::= <right> <hand> <side>

If we introduce the new nonterminal <middle>, we may replace this one rule with this:

<leftside> ::= <middle>
<middle> ::= <right> <hand> <side>

Or we might opt, instead, to replace it with this:

<leftside> ::= <right> <middle>
<middle> ::= <hand> <side>

The two alternatives shown above are entirely equivalent to the original in the sense that they describe the same language, or in other words, they generate the same set of strings. They are different only because they use different sets of production rules to do so.

BNF Notation and its Extensions

Consider the following set of context-free production rules for the Kestrel language:

<loop> ::= <while loop>
<loop> ::= <until loop>
<loop> ::= <for loop>

In proper BNF, we combine these 3 distinct production rules into one rule:

<loop> ::= <while loop> | <until loop> | <for loop>

This is just a matter of abbreviation. In formal language theory, this is three production rules. BNF simply permits a more compact representation.

Most of the extensions of added to BNF to make the various EBNF notations come from regular expressions. Formally, a regular expression is defined as:

a terminal symbol, a
the concatenation of two regular expressions, ab
the selection between two regular expressions, a|b
zero or more repetitions of a regular expressions, a*

Using parentheses to group, we might write this regular expression to describe strings of dots separated by decimal digits:

.((0|1|2|3|4|5|6|7|8|9).)*

This regular expression matches . and it matches .4. and .1.4.3.7. as well as an infinite number of other strings. The star notation for zero or more repetitions has a name, the Kleene closure.

In BNF, from the start, the vertical bar was used to indicate alternatives, and sequences of terminals and nonterminals were used to indicate matching sequences. What is missing is an iterator. The most common way to indicate iteration in extended BNF is with curly braces, so instead of writing a* we write {a}.

This extension to BNF does not change the expressive power of the notation, it only allows compact expression. Consider, for example, this rule from the Kestrel grammar:

<formal parameters> ::= "(" <parameter> { [ "," ] <parameter>} ")"

We can rewrite this, adding new nonterminals, in order to completely eliminate the extensions to BNF:

<formal parameters> ::= "(" <formal parameter list> ")"
<formal parameter list> ::= <parameter>
<formal parameter list> ::= <parameter> [ "," ] <formal parameter list>

Here, we have used recursion where the original used iteration. We have one more extension to BNF in the above rule, square brackets indicating optional elements. We can eliminate this extension by the following rewrite of the final rule above:

<formal parameter list> ::= <parameter> <formal parameter list>
<formal parameter list> ::= <parameter> "," <formal parameter list>

Therefore, if we wanted to define <formal parameters> in the original un-extended BNF, we might have written it as follows:

<formal parameters> ::= "(" <formal parameter list> ")"
<formal parameter list> ::= <parameter>
                         |  <parameter> <formal parameter list>
                         |  <parameter> "," <formal parameter list>

The advantage of the extended BNF notation is compactness. We can automatically derive the full set of production rules from the compact form, but finding expressive human-oriented names for the new non-terminals introduced during this expansion of the grammar is a job for humans.

The worst grammar rewrite tools tend to invent names such as <a>, <b> and <c>, with no reference to the names used in original grammar. Better rewrite tools tend to name new nonterminals after the nonterminal on the left-hand side of the rule where the new nonterminal is introduced, so they might invent, for example, <formal parameter list a> in the above context. It takes human intervention to pick better names.

Our next job is to discuss parsing. As we discuss that, we will see that each parsing algorithm tends to favor one or another grammar for the language. Compact presentations with fewer nonterminals may be better for humans in some contexts, but in other contexts, some nonterminals may have obvious semantic content and a human reader may prefer to preserve them in the grammar even when they have no particular value to an explanation of the syntax. For this reason, it is handy to seek out tools for grammar rewriting, both tools to expand EBNF grammars to their primitive BNF form, and tools to attempt to tighten up sprawling grammars by eliminating extraneous nonterminals and redundant production rules.