Lecture 12, Recursive Descent Parsing

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

An Example Grammar

Consider the same example grammar we've been working with:

<expression> ::= <term> { ( '+' | '-' ) <term> }
<term> ::= <factor> { ( '*' | '/' ) <factor> }
<factor> ::= [ '-' ] ( <number> | '(' <expression> ')' )

We can convert this grammar to a railroad diagram, just as we did with the description of the lexical level of the language. There is a difference. At the lexical level, there was no recursion, so there was just one railroad diagram that cycled until the end of file. Here, we have recursion, so each rule converts to a separate chunk of railroad:

a railroad diagram

Recursive Descent

The basic idea of recursive descent parsing is to take the railroad diagram (or equivalently, the EBNF description of the grammar) as a flow chart for the language processing program. Perhaps the easiest way to see this is to think in terms of a recursive descent generator instead of a parser. Consider generating random expressions with the following code, code that directly follows the control structures implied by the EBNF or railroad diagrams above:

void expression() {
        term();
	while (randomlytrue()) {
		if (randomlytrue()) {
			printf( " +" );
		} else {
			printf( " -" );
		}
		term();
	}
}

void term() {
        factor();
	while (randomlytrue()) {
		if (randomlytrue()) {
			printf( " *" );
		} else {
			printf( " /" );
		}
		factor();
	}
}

void factor() {
	if (randomlytrue()) {
		printf( " -" );
	}
	if (randomlytrue()) {
		printf( " &u", randomvalue() );
	} else {
		printf( " (" );
		expression();
		printf( " )" );
	}
}

Each call to expression() above will generate, to the standard output file, a new random expression -- as random as the seeding of the pseudo-random number generator behind the random() function permits.

Running this code with appropriate probability distributions for each of the random choices gives output like this (from an actual run, one expression per line):

 ( 93 )
 - 27
 40 / - 68 - - 23
 - 22
 ( 42 / - 84 ) * 26
 ( ( ( - 84 * - 29 / - ( 14 ) ) ) / ( 51 ) )
 - 26
 - ( 78 - - 92 ) / 65
 - 29 - ( 27 )
 6 + - ( - 3 / 15 ) + - 51 / ( 41 ) * 14
 - ( - 59 )
 - ( - ( - 95 * - ( - 43 * - ( - 63 * - 4 * 17 / - 70 ) ) * 39 ) )

If we'd done the same thing with a grammar for a language like English, we'd have created a nonsense sentence generator that might produce sentences such as "Colorless green ideas sleep furiously." Grammatically correct, but utterly free of meaning.

Recursive Descent Parsing

Converting this code to a parser is simply a matter of replacing the code to produce output with code that consumes input, plus finding a way to decide what branch in the code to follow. The latter is the key problem.

The solution to this problem is found in the previous lecture: For each branch in the code above, the action taken along that branch involves consuming some terminal or nonterminal symbol. Therefore, at the point where the branch is taken, the current lexeme must be in the start set for that non-terminal.

Consider a lexical analyzer with the following interface:

EXTERN lexeme lex_this; /* the current lexeme */
void lex_advance(); /* advance to the next lexeme */

Recall that we identified the following start sets for the nonterminals in our grammar.

<expression>
starts with one of { <number>, '-', '(' }

<term>
starts with one of { <number>, '-', '(' }

<factor>
starts with one of { <number>, '-', '(' }

We can put this together to give us the following expression parser:

void expression() {
        term();
        while ((lex_this == "+") || (lex_this == "-")) {
                if (lex_this == "+") {
                        consume( "+" );
                } else {
                        consume( "-" );
                }
                term();
        }
}

void term() {
        factor();
        while ((lex_this == "*") || (lex_this == "/")) {
                if (lex_this == "*") {
                        consume( "*" );
                } else {
                        consume( "/" );
                }
                factor();
        }
}

void factor() {
        if (lex_this == "-") {
		consume( "-" );
        }
        if (lex_this == number) {
		consume( number )
        } else {
		consume( "(" );
                expression();
		consume( ")" );
        }
}

In the above code, we've been very sloppy about interaction with the lexical level. We wrote things like if(lex_this=='+') when, in the context of the lexical analyzer we've been writing, we ought to have written a far more complex bit of code, probably something like this:

if ((lex_this.type == PUNCT) && (lex_this.VALUE == PT_PLUS))

Another short-hand notation we used is to write things like consume("+"). In a context where we already know that the current lexeme is "+", we could just call lex_advance(). This covers all but two of the cases above. In each of those cases, we could write a consume function something like the following:

void consume( lex_types t, uint32_t v ) {
	if ((lex_this.type == t ) && (lex_this.value == v)) {
		lex_advance();
	} else {
		... an appropriate error message ...
	}
}

This code outputs an error message, but it makes no effort to recover from the error. Sensible recovery requires a bit more work.

Sets

As our compiler develops, we will find cases where we need to check if a lexeme is in a fairly large start set. In the above, the worst of these bits of code was:

while ((lex_this == "*") || (lex_this == "/")) {

Rewriting this properly gives us something that is already unweildy:

while (  (lex_this.type == LEX_PUNCT)
&&       ( (lex_this.value == PT_TIMES) || (lex_this.value == PT_DIV) )
) {

For larger sets, things get very messy very quickly. We can solve this by directly implementing a set membership tool using a C (and C++) header file. Effectively this file creates a new abstract class set32_t. Members of this class can hold sets over any domain containing 32 or fewer elements with integer representations.

/* sets.h a fast lightweight implementation of set operations */
typedef uint32_t set32_t;

/* set32_t to_set32( int i ) */
#define to_set32(i)   (((set32_t)1)<<(i))
/* construct a single-member set32_t value from one integer or enumeration */

/* bool in_set32( int i, set32_t s ) */
#define in_set32(i,s) (to_set32(i) & s)
/* test if integer in set32_t, returns nonzero if so, zero if not */

Why did we need to use the cast (set32_t)1 above instead of the simple constant 1? The reason is, the constant 1 is of type int which is not guaranteed to be 32 bits. It could legally be 16 bits, in which case 1<<16 is zero because the one-bit was shifted off the end of the number representation.

With these definitions, the set {1} is created with TO_SET32(1), and the set {1,5} is created with TO_SET32(1)|TO_SET32(5). Because of the way we have used integers to represent sets, we can use the integer and and or operators to perform the set union and intersection operations. The notation is verbose, but a few extra defines for constructing 2, 3 and 4 element sets can help make things look nicer:

/* set32_t to_set32_2( int i, int j ) */
#define to_set32_2(i,j)     (to_set32(i) | to_set32(j))
/* construct set32_t value from 2 integers */

/* set32_t to_set32_3( int i, int j, int k ) */
#define to_set32_3(i,j,k)   (to_set32(i) | to_set32(j) | to_set32(k))
/* construct set32_t value from 3 integers */

/* set32_t to_set32_4( int i, int j, int k, int l ) */
#define to_set32_4(i,j,k,l) (to_set32_2(i,j) | to_set32_2(k,l))

Once the basic set architecture is established, these kinds of auxiliary constructors for larger sets can be added as needed -- there is no reason to add them if there is no demand, and they are easy to add, once the pattern is understood.

It would be nice if the C #define macro mechanism was strong enough to write a generalized to_set set constructor that could handle arbitrary numbers of set elements, but the mechanism is not strong enough for this. C (and C++) do support a rather expensive mechanism for function calls with variable numbers of parameters, but use of this mechanism would eliminate a significant amount of run-time efficiency because functions that use this mechanism cannot generally be compiled inline. In contrast, macros such as we've defined above can generally be evaluated entirely at compile time, so writing something like to_set32_2(0,3) will actually produce the value 9 at compile time, deferring no computation to run time.

Using these definitions, the problematic code at the start of this section can be rewritten as follows:

if (  (lex_this.type == PUNCT)
   && in_set32( lex_this.value, to_set32_2( PT_PLUS, PT_MINUS ) );
) {  ...

The above code only works if the set of punctuation marks is within the 32-element limit of our set framework. In the case of Kestrel, it is. If we were concerned with a larger universe of punctuation marks, we could write working code along the same lines that exploited the fact that most C and C++ compilers support 64-bit integers as type uint64_t; this would allow us to define type set64_t.

Lexical Analysis Support Routines

As it turns out, there will be many places in our parser where we need to check to see if the next lexeme is in a particular set of punctuation marks, and later, once we have support for keywords, there are places where we need to see if the next lexeme is in a particular set of keywords. We can easily extend the set processing ideas above to support tests for membership in such specialized sets:

/* bool lex_ispuncset( lexeme lex, set32_t s ); */
#define lex_ispuncset( lex, s ) (                               \
        (lex.type == LEX_PUNC) && in_set32(lex.value, s )       \
)

Where does this code go? To the authors of the parser, it looks like lexical analysis code. To the authors of the lexical analyzer, it looks like user-level code, since the implementations of these routines do not depend in any way on the internal details of the lexical analyzer. Here, the naming convention is appropriate for the outside view. The prefix lex_ on each routine suggests that it's inside the lexical analyzer. It would make sense, however, to put the code in an auxiliary module, for example, lexsupport.h and lexsupport.c.

With this, we can rewrite the code at the end of the last section as:

while (ispuncset( lex_this, to_set32_2( PUNC_MUL, PUNC_DIV ) )) {

We might also write a simpler define, since there are many cases where we are looking for a specific punctuation mark:

/* bool lex_ispunc( lexeme lex, punct_type t ); */
#define lex_ispunc( lex, t ) (                               \
        (lex.type == LEX_PUNC) && (lex.value == t)              \
)

At the start of this lecture, we wrote things like if(lex_this=='+'). Now, we can rewrite that as if(lex_ispunc(lex_this,PT_PLUS)).

It would make sense to include the consume() function here as well, with specialized versions for punctuation, and later, for keywords. For example:

bool lex_forcepunc( punct_type t ) {
	/* force lex_this to be the punctuation mark t and advance over it */
        if (lex.ispunc( lex_this, t ) {
		lex_advance();
	} else {
		/* =BUG= must report found lex_this when we t was expedted */
	}
}

Limitations

Recursive descent parsing cannot handle left-recursive grammars: If we have a production rule like this, we are lost:

<thing> ::= <thing> <otherthings>
         |  <last thing>

The problem is, we cannot determine when to stop the recursion on such a rule. Any such grammar may be rewritten, however, so that it is not left recursive.

In some cases, we cannot decide which rule to apply, based solely on the current lexeme. In such cases, we can sometimes resolve the problem by looking ahead at the next lexeme. This is why the lexical analyzer interface we have described included a two-lexeme window into the input string, with lex_this being the current lexeme, and lex_next being its successor. Unfortunately, one-symbol lookahead is not always sufficient. With LR1 parsers, we could prove that any grammar that required k-symbol lookahead could be rewritten so that only one symbol of lookahead is required. In the case of recursive descent, or LLk parsers, there are LL2 grammars that cannot be reduced to LL1 form. This is a comparatively weak parsing method! It is, however, strong enough for most programming languages.

Furthermore, a recursive descent parser does not require complex support software to maintain. So long as it is written in an programming language that supports recursion, the source code for the parser can be constructed to directly reflect the grammar as it would be expressed in railroad diagrams or modest extensions of BNF. In the context of a compiler construction course, this means we can move onward to a discussion of semantics without spending excessive time on the intricacies shoehorning a grammar into the form required by parser generator.