Lecture 17, Blocks

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Blocks

The Kestrel grammar says:

<kestrel program> ::= <block> <end of file>

<block> ::= { <block element> [ ";" ] }

<block element> ::= <declaration>
                 |  <statement>

In short, a block is a sequence of statements mixed with declarations. A declaration associates a name with an entity such as a variable or a procedure or function; the grammare given here is abridged:

<declaration> ::= <identifier> ":" <entity>
<entity> ::= <type constructor>
          |  <procedure constructor>
          |  <procedure constructor>

The term constructor here is used as a generic term for the syntactic forms that construct attributes that can be bound to identifiers by declarations. Types, variables and constants, after all, are radically different from each other, and we need a name for the syntactic category that includes all of them.

Looking further, we find that blocks occur in several contexts. The first context is the body of a procedure (or function, but we're abridging the grammar here):

<procedure constructor> ::= "procedure" [ <formal parameters> ]
                            <body>

<body> ::= "external"
        |  <block> "end"

The second context involves the construction of a record type (the Kestrel term for what C and C++ call a struct, but also used for objects, because blocks may contain procedure and function declarations, which in this context become the methods of the object). Again, we have abridged the grammar:

<type constructor> ::= "type" <type>

<type> ::= <reference>
        |  <record>

<record> ::= "record" <block> "end"

The third context involves the bodies of certain types of statements:

<statement> ::= <if>
             |  <assignment>

<if> ::= "if" <expression> [ "then" ]
             <block>
       [ "else"
             <block> ]
         "end"

Parsing Programs

Given this much of the grammar

<kestrel program> ::= <block> <end of file>

We can write the code for a top-down recursive descent parse_program routine something like this:

void parse_program() {
        parse_block();
	if (lex_iskey( lex_this, KEY_END )) lex_advance();
	if (lex_this.type != ENDFILE)) {
		er_gotbutwant( lex_this, ENDFILE );
	}
}

This parser can be changed into a translator by adding code to emit the standard prologue and standard epilogue to the object file.

void compile_program() {
        fputs( PROLOGUE, objectfile );
        parse_block();
	if (lex_iskey( lex_this, KEY_END )) lex_advance();
	if (lex_this.type != ENDFILE)) {
		er_gotbutwant( lex_this, ENDFILE );
	}
        fputs( EPILOGUE, objectfile );
}

What are the standard prologue and epilogue for your machine? You can find it by compiling an empty C program and comparing the output with the result of compiling, for example, a hello-world program. Consider compiling these two:

/* empty.c -- the empty C program */
int main() {
}

/* hello.c -- a hello-world C program */
#include 
int main() {
	fputs( "Hello world!\n", stdout );
}

If you compile these with cc -S empty.c and cc -S hello.c, the output files empty.s and hello.s, will hold the assembly code corresponding to the two input files, in (one of) the assembly language(s) of the machine you are using. By comparing the two files, you should be able to identify the code that is always prefixed to a main program and the code that always follows a main program, as well as the code specific to the body of the hello world program.

What's that about multiple assemblers? Unfortunately, there is no such thing as the unique and only assembly language for any machine. It is always possible to devise an alternative assembly language for any machine, and this has been done for most machines. The GNU family of compilers generally use the GNU assemblers to process their outputs, while the hardware vendor for each machine generally has a somewhat incompatible assembly language. Usually these have similar names for the machine instructions, but even that is not always true.

For the x86 family, for example, you have a choice between masm, the Microsoft assembler, and gas, the GNU assembler. There are many others. There is less variety for the ARM, but gas on the ARM is not fully compatible with the Microsoft armasm, and there are many more assemblers, ranging from freeware to high priced commercial products.

Parsing Blocks

Given this much of the grammar

<block> ::= { <block element> [ ";" ] }

<block element> ::= <declaration>
                 |  <statement>

We can write the framework for a top-down recursive descent parse_block routine:

void parse_block() {
	while (??) { // there are more block elements
		if (??) {
			parse_declaration();
		} else {
			parse_statement();
		}
		if (lex_ispunc( lex_this, PT_SEMCOL )) lex_advance();
	}
}

This leaves just a few questions unanswered. How do we know when the block ends? How do we know when the block element is a declaration, and when it is a statement? To answer these questions, we need to look at the start sets of block elements, declarations and statements.

Unfortunately, the start set of <statement> includes the start set of <assignment> and assignment statements begin with identifiers. Similarly all declarations begin with the identifier being declared. What does the parser do? It must look ahead at the next lexeme. All declarations in Kestrel begin with <identifier> ":". No statement begins this way. Therefore, by looking at lex_next, we can distinguish between statements and declarations. This gives us the following, assuming that the constant STATEMENT_KEYWORDS is the set of all keywords that could begin statements:

void parse_block() {
	while ( (lex_this.type == IDENT)
        ||      (iskeyset( lex_this, STATEMENT_KEYWORDS ))
        ) { // there are more block elements
		if ( (lex_this.type == IDENT)
		&&   (ispunc( lex_next, PT_COLON ))) {
			parse_declaration();
		} else {
			parse_statement();
		}
		if (lex_ispunc( lex_this, PT_SEMCOL )) lex_advance();
	}
}

The compile_block() routine does not generate any code, ever. That job belongs to parse_declaration and parse_statement. The compile_block() routine does need to do something that is not hinted at above: That is, it needs to collect all of the declarations in the block from the declarations and make them available to all future declarations in the same block, as well as to all statements in the block. To understand what is involved here, we must look at the concept of scope rules and how this is evident in Kestrel.

Hierarchic Scope Rules

So what are the semantic attributes of a block?

Of course, there is the code associated with the various statements, but here, we are interested in the attributes associated with the declarations in the block.

Essentially, each block establishes a set of name-attribute pairs as the environment in which each statement and declaration in that block is evaluated. When searching for the definition of a variable or other named item in a particular context, the immediately enclosing block is searched first, and then the block that encloses that one, and so on, until all enclosing blocks have been examined.

Kestrel can be compiled by a one-pass compiler, so we add one more rule: The only declarations visible from any point in the code are those that occur textually before that point.

In the abstract, the above model suffices, but in the real world, it is useful to distinguish between two kinds of blocks:

Free Blocks: The blocks that are the bodies of procedures or functions and the blocks that are the bodies of records establish entirely new environments. While the hierarchic scope rules allow code within such a block to reference variables outside that block, the semantics of such references are complicated, as will be discussed in the next lecture.
Block Extensions: Blocks declared in the body of a statement extend the block enclosing that statement. In a sense, these behave like simple concatenation, but when the end of a block extension is reached, that extension is discarded. Nesting of such blocks is comparatively simple. In effect, each declaration extends the current block, but and when a block extension is discarded, the associated declarations are removed from the current block.