Compiler Construction

In Lecture 16, we gave a basic algorithmic framework for the top level of a parser for Kestrel:

void parse_program() {
        parse_block();
        if (lex_this.type != ENDFILE)) {
                lex_gotbutwant( lex_this, ENDFILE );
        }
}

void parse_block() {
        while ( (lex_this.type == IDENT)
        ||      (lex_iskeyset( lex_this, STATEMENT_KEYWORDS ))
        ) { /* there are more block elements */
                if ( (lex_this.type == IDENT)
                &&   (lex_ispunc( lex_next, PT_COLON ))) {
                        parse_declaration();
                } else {
                        parse_statement();
                }
                if (lex_ispunc( lex_this, PT_SEMCOL )) lex_advance();
        }
}

This framework was done rather carelessly, it provides no solid hooks from which to hang the code to handle more complex semantics, and it does not provide a framework for breaking a large program into multiple files.

Our grammar tools, particularly the program gstartfollow can be used to find the start and follow sets of all of the Kestrel nonterminals. Running this on the Kestrel grammar gives us the following results for <program> and <block>:

<kestrel program> ::= <block> <end of file>
                   |  <end of file>
# start set:   <end of file> <identifier> "do" "if" "select" "catch"
#              "raise" "while" "for"

<block> ::= <block-a>
# start set:   <identifier> "do" "if" "select" "catch" "raise" "while"
#              "for"
# follow set:  <end of file> "end" "else" "case" "until"

Note that the grammar tools ended up creating a new nonterminal, <block-a>. This new nonterminal was introduced in order to remove various EBNF extensions from the original Kestrel grammar for blocks. The gstartfollow tool computes the start and follow sets for for all of the nonterminals in a BNF grammar, including both the symbols in the original EBNF grammar and all of the symbols added in translating EBNF to pure BNF.

Lexical Support

In general, the start and follow sets of nonterminals contain some mixture of keywords, punctuation and lexical categories such as <identifier&lgt;. This means that we really need a tool to allow easy checking for membership in a start or follow set. Our original lexsupport package had only part of this. We had:

/* bool lex_ispunc( lexeme lex, punct_type t ); */
/* bool lex_ispuncset( lexeme lex, set32_t s ); */
/* bool lex_iskeyset( lexeme lex, set32_t s ); */

bool lex_isinset( set32_t ps, set32_t ks, set32_t ls ) {
        /* test if lex.this is punct in ps, keyword in ks, or lexeme in ls */
        return lex_ispuncset( lex_this, ps )
        ||     lex_iskeyset( lex_this, ks )
        ||     lex_isset( lex_this, ls );
}

Furthermore, at the start and end of the parsing routine for every nonterminal, we can force membership in the start set and end set for that nonterminal as by calling a routine something like this:

void lex_wantinset( set32_t ps, set32_t ks, set32_t ls, error_message e ) {
        /* test if lex.this is punct in ps, keyword in ks, or lexeme in ls */
        if (lex_isinset( ps, ks, ls )) return;

        /* complain about error */
        lex_gotbutwant( lex_this, e );

        /* =BUG= should probably skip ahead until we find something in set */
        /*       or take some other action to recover from the error       */
}

This code assumes we have an error support tool, lex_gotbutwant(x,y) that outputs a message such as "x y", where x is the textual representation of a lexeme and y is an error message such as "found when statement expected."

A Framework

Now, we are ready to build a reasonable framework for a parser or compiler. We start with the decision that every publicly known nonterminal is a C++ class with a factory method called compile() that actually does the work. Using class Block as an example, this suggests something like this:

// block.h -- interface specificatio for parser/compiler for blocks

// Author, Date, etc.

// Note, in block.c and only there, the define EXTERN before including this
// Note, before including this, include "environment.h"

// BNF
// <block> ::= { <block element> [ ; ] }
// <block element> ::= <declaration> | <statement>

#ifndef EXTERN
        #define EXTERN extern
#endif

class Block {
public:
        static Block * compile( Environment * e );
        // factory method

        // =BUG= blocks certainly have attributes, they are missing here
};

Note, we made a decision here to fold the processing of <block element> into <block> since no other production rule in the grammar references <block element>. Blindly expaning every nonterminal to a parsing routine is not sensible. With this, we can start to flesh out the code for block.cpp.

// block.cpp -- implementation of parser/compiler for blocks

// Author, Date, etc.

// BNF
// <block> ::= { <block element> [ ; ] }
// <block element> ::= <declaration> | <statement>

#include 
#include 
#include "config.h"
#include "sets.h"
#include "errors.h"
#include "stringpool.h"
#include "symboltable.h"
#include "keywords.h"
#include "lexical.h"
#include "lexsupport.h"

#include "environment.h"

#define EXTERN
#include "block.h"

The above list of include files is long, but reading the notes saying "the user must include" in each file we know we need leads to others. With this prologue out of the way, we can define the start and follow sets for blocks:

// start sets
#define START_PUNCS SET32_EMPTY
#define START_KEYS ( to_set32_4( KEY_DO, KEY_IF, KEY_SELECT, KEY_CATCH ) \
                   | to_set32_3( KEY_RAISE, KEY_WHILE, KEY_FOR )         \
                   )
#define START_LEXS to_set32( IDENT )

// follow sets
#define FOLLOW_PUNCS SET32_EMPTY
#define FOLLOW_KEYS to_set32_4( KEY_END, KEY_ELSE, KEY_CASE, KEY_UNTIL )
#define FOLLOW_LEXS to_set32( ENDFILE )

Note that we had to split each set into thirds, one set of punctuation marks, one set of keywords, and one set of lexemes. Some of these sets are empty, so if optimal code were our primary goal, we could create special versions of lex_wantinset() that take fewer parameters. For now, we'll ignore the opportunity to do this kind of optimization and proceed to the parser code:

Block * Block::compile( Environment * e ) {

        lex_wantinset( START_PUNCS, START_KEYS, START_LEXS, ER_WANT_BLOCK );

        while (lex_isinset( START_PUNCS, START_KEYS, START_LEXS )) {

                if ( (lex_this.type == IDENT)
                &&   lex_ispunc( lex_next, PT_COLON ) ) {
                        // all declarations begin with ident:
                        e = Declaration::compile( e );
                } else {
                        // if not a declaration must be a statement
                        Statement * s = Statement::compile( e );
                }
                if (lex_ispunc( lex_this, PT_SEMI )) lex_advance();
        }

        lex_wantinset( FOLLOW_PUNCS, FOLLOW_KEYS, FOLLOW_LEXS, ER_WANT_ENDBLOK);

        // =BUG= the following is almost certainly an error
        return NULL;
}

The parsing component of this code begins with the requirement that the lexeme be in the start set for blocks, and it ends with the requirement that the lexeme be in the follow set for blocks.

Between these two is the loop looking for block elements. This loop uses the same start set because the start sets of blocks and block elements are the same. The loop body uses one symbol of lookahead to distinguish between declarations and statements, and then it handles the optional semicolon between statements.

The code is only slightly better than a parser. Blocks are given an environment and they use the declarations in that block to add to the environment. Whatever it is that the block builds, however, is not developed, except for the bug notices.

The stub for programs is a bit different. The program parser is responsible for creating the standard environment. So, our framework for parse_program will look something like this:

Program * Program::compile() {

        lex_wantinset( START_PUNCS, START_KEYS, START_LEXS, ER_WANT_BLOCK );
	// =BUG= the above check is redundant, since Block::compile checks this

	// =BUG= must emit standard prologue for object file

	Environment * e = NULL;
	// =BUG= the above should create the standard environment

	Block * b = Block::compile( e );
	// =BUG= do we need to do anything with b or can we discard it?

	// =BUG= must emit standard epilogue for object file

        lex_wantinset( FOLLOW_PUNCS, FOLLOW_KEYS, FOLLOW_LEXS, ER_WANT_EOF );
}

Syntactic Stubs

You could try to write the entire parser and then test it, but that is a very large effort. If you are serious about incremental development, it would be better to lop off large chunks of the parse tree to allow early testing. Consider this stub for statements, assuming that the start set and follow set components are declared:

Statement * Statement::compile( Environment * e ) {

        lex_wantinset( START_PUNCS, START_KEYS, START_LEXS, ER_WANT_STMNT );

        // =BUG= the following is a stub
        lex_advance();

        lex_wantinset( FOLLOW_PUNCS, FOLLOW_KEYS, FOLLOW_LEXS, ER_WANT_ENDSTMT);

        // =BUG= does the enclosing block need anything about this statement?
        return NULL;
}

This stub version of compile simply advances over the lexeme that introduced the statement. This is totally wrong for most statements, but it will correctly handle parameterless subroutine calls, since a call to a parameterless Kestrel subroutine consists of nothing more than the name of that subroutine. A similar stub for declarations might look like this:

// =BUG= shouldn't the following return a Declaration?
Environment * Declaration::compile( Environment * e ) {

        // =BUG= the following is unneeded because caller checks ident:
        lex_wantinset( START_PUNCS, START_KEYS, START_LEXS, ER_WANT_STMNT );

        // =BUG= the following is a stub
        lex_advance();  // skip the identifier
        lex_advance();  // skip the colon
        lex_advance();  // skip one identifier, the simplest declaration

        lex_wantinset( FOLLOW_PUNCS, FOLLOW_KEYS, FOLLOW_LEXS, ER_WANT_ENDSTMT);

        // =BUG= does the enclosing block need anything about this statement?
        return NULL;
}

Our generic pattern for writing a parsing routine has that parsing routine return an object of the class that represents this syntactic category. The first bug notice above complains that Declaration::compile() ought to return an object of class Declaration. Our code for compile block, however, required that it return an environment, and we satisfied this in the stupidist way we could here. If we make class Declaration a subclass of Environment, this might help bring our code back into conformity with the generic pattern, but consistency with this pattern is not that important.

Note the second bug notice above. While our generic pattern for writing a parsing routine begins with a check for membership in the start set and ends with a check for membership in the follow set, this code never needs to check for start set membership because the only place that there is a call to Declaration::compile() is in Block::compile(), and there, it is prefixed with an if statement that checks not only for the start set (an identifier) but also the next lexeme after that (a colon).

With these two stubs in place, our parser should now be able to parse a kestrel program such as this:

x: int;
y: bool;
call1;
call2;

This is hardly an impressive program, but you should test your code on something like this with several variations. For example, semicolons are optional, and you should create variations that elicit all of the error messages you can. Try something like this, for example:

Lecture 22, A Practical Framework

Background

Lexical Support

A Framework

Syntactic Stubs