Lecture 12, Recursive Descent Parsing

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

An Example Grammar

Consider the same old example grammar:

<expression> ::= <term> { ( '+' | '-' ) <term> }
<term> ::= <factor> { ( '*' | '/' ) <factor> }
<factor> ::= [ '-' ] ( <number> | '(' <expression> ')' )

We can convert this grammar to a railroad diagram, just as we did with the description of the lexical level of the language. There is a difference. At the lexical level, there was no recursion, so there was just one railroad diagram that cycled until the end of file. Here, we have recursion, so each rule converts to a separate chunk of railroad:

Recursive Descent

The basic idea of recursive descent parsing is to take the railroad diagram (or equivalently, the EBNF description of the grammar) as a flow chart for the language processing program. Perhaps the easiest way to see this is to think in terms of a recursive descent generator instead of a parser. Consider generating random expressions with the following code, code that directly follows the control structures implied by the EBNF or railroad diagrams above:

void expression() {
        term();
	while (randomlytrue()) {
		if (randomlytrue()) {
			printf( " +" );
		} else {
			printf( " -" );
		}
		term();
	}
}

void term() {
        factor();
	while (randomlytrue()) {
		if (randomlytrue()) {
			printf( " *" );
		} else {
			printf( " /" );
		}
		factor();
	}
}

void factor() {
	if (randomlytrue()) {
		printf( " -" );
	}
	if (randomlytrue()) {
		printf( " &u", randomvalue() );
	} else {
		printf( " (" );
		expression();
		printf( " )" );
	}
}

Each call to expression() above will generate, to the standard output file, a new random expression -- as random as the seeding of the pseudo-random number generator behind the random() function permits.

Running this code with appropriate probability distributions for each of the random choices gives output like this (from an actual run, one expression per line):

 ( 93 )
 - 27
 40 / - 68 - - 23
 - 22
 ( 42 / - 84 ) * 26
 ( ( ( - 84 * - 29 / - ( 14 ) ) ) / ( 51 ) )
 - 26
 - ( 78 - - 92 ) / 65
 - 29 - ( 27 )
 6 + - ( - 3 / 15 ) + - 51 / ( 41 ) * 14
 - ( - 59 )
 - ( - ( - 95 * - ( - 43 * - ( - 63 * - 4 * 17 / - 70 ) ) * 39 ) )

If we'd done the same thing with a grammar for a language like English, we'd have created a nonsense sentence generator that might produce sentences such as "Colorless green ideas sleep furiously." Grammatically correct, but utterly free of meaning.

Recursive Descent Parsing

Converting this code to a parser is simply a matter of replacing the code to produce output with code that consumes input, plus finding a way to decide what branch in the code to follow. The latter is the key problem.

The solution to this problem is found in the previous lecture: For each branch in the code above, the action taken along that branch involves consuming some terminal or nonterminal symbol. Therefore, at the point where the branch is taken, the current lexeme must be in the start set for that non-terminal.

Consider a lexical analyzer with the following interface:

EXTERN lexeme lex_this; /* the current lexeme */
void lex_advance(); /* advance to the next lexeme */

Recall that we identified the following start sets for the nonterminals in our grammar.

<expression>: starts with one of { <number>, '-', '(' }
<term>: starts with one of { <number>, '-', '(' }
<factor>: starts with one of { <number>, '-', '(' }

We can put this together to give us the following expression parser:

void expression() {
        term();
        while ((lex_this == "+") || (lex_this == "-")) {
                if (lex_this == "+") {
                        consume( "+" );
                } else {
                        consume( "-" );
                }
                term();
        }
}

void term() {
        factor();
        while ((lex_this == "*") || (lex_this == "/")) {
                if (lex_this == "*") {
                        consume( "*" );
                } else {
                        consume( "/" );
                }
                factor();
        }
}

void factor() {
        if (lex_this == "-") {
		consume( "-" );
        }
        if (lex_this == number) {
		consume( number )
        } else {
		consume( "(" );
                expression();
		consume( ")" );
        }
}

The above parser will correctly parse all correct expressions, and if we code the consume() routine correctly, it will detect syntax errors in a useful way:

void consume( l ) {
	if (lex_this == l ) {
		lex_advance();
	} else {
		printf( "found %s when %s expected\n", lex_this, l );
	}
}

This code makes an effort to output a reasonable error message, but it makes no effort to recover from the error. Sensible recovery requires a bit more work.

In the above code, we have been deliberately careless about the type of the lexeme objects, acting as if we could compare lexemes with strings. In fact, lexemes will typically be complex data objects, so the comparison we wrote as:

while ((lex_this == "*") || (lex_this == "/")) {

will probably have to be written as something harder to read such as:

while ( (lex_this.type == LEX_PUNC)
&&      ( (lex_this.value == PUNC_MUL) || (lex_this.value == PUNC_DIV) )
) {

Programming Tricks

This leads naturally to a strong market for parsing support routines that make it easy to organize sets of lexemes and test for set membership. Consider the following defines, based on the assumption that all punctuation marks are organized in an enumeration and that the range of this enumeration is smaller than the word size:

#define ispunc( lex, set ) ((lex.type == LEX_PUNC) && ((1 << lex.value) & (set))
#define makeset2( a, b ) ( (1 << (a)) | (1 << (b)) )

Now, we can rewrite the above ugly predicate as:

while (ispunc( lex_this, makeset2( PUNC_MUL, PUNC_DIV ) )) {

The C and C++ preprocessor macro facilities are weak enough that we can't make a general set constructor macro. Intead, we must make constructor macros for each size of set we might encounter. It is easy enough to make such macros, and the number of different set sizes will be limited. We can use a very similar mechanism for identifying any of a set of keywords.

Limitations

Recursive descent parsing cannot handle left-recursive grammars: If we have a production rule like this, we are lost:

<thing> ::= <thing> <otherthings>
         |  <last thing>

The problem is, we cannot determine when to stop the recursion on such a rule. Any such grammar may be rewritten, however, so that it is not left recursive.

In some cases, we cannot decide which rule to apply, based solely on the current lexeme. In such cases, we can sometimes resolve the problem by looking ahead at the next lexeme. This is why the lexical analyzer interface we have described included a two-lexeme window into the input string, with lex_this being the current lexeme, and lex_next being its successor. Unfortunately, one-symbol lookahead is not always sufficient. With LR1 parsers, we could prove that any grammar that required k-symbol lookahead could be rewritten so that only one symbol of lookahead is required. In the case of recursive descent, or LLk parsers, there are LL2 grammars that cannot be reduced to LL1 form. This is a comparatively weak parsing method! It is, however, strong enough for most programming languages.

Furthermore, a recursive descent parser does not require complex support software to maintain. So long as it is written in an programming language that supports recursion, the source code for the parser can be constructed to directly reflect the grammar as it would be expressed in railroad diagrams or modest extensions of BNF. In the context of a compiler construction course, this means we can move onward to a discussion of semantics without spending excessive time on the intricacies shoehorning a grammar into the form required by parser generator.