Lecture 4, Practical Lexical Analysis

Part of the notes for CS:4908:0004 (22C:196:004)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Top Level Issues

Compilers are large programs, so we need to think in the large when writing them. Separating the code for the lexical analyzer from the code for the rest of the compiler seems essential, both from the point of view of controlling the complexity of the project, and from the point of view of creating testable interfaces.

For a compiler written in C or C++, the natural way to do this is to code the lexical analyzer in a file named, for example lexical.c, with the interface given in a separate file named lexical.h. The file name extensions used here are arbitrary, set as a matter of convention and not necessity, as is the convention that the implementation of a program component reside in a file with the same name as the interface definition, but a different extension.

Once a program consists of multiple components connected by interface definitions, it becomes essential to document the relationships between the pieces. In the C/C++ world, the most valuable tool for this purpose is the makefile. Read about makefiles on Wikipedia, or take a look at the suggestions for makfile format and use shown here:

http://homepage.cs.uiowa.edu/~dwjones/syssoft/style.html#make

This suggests that we will begin our project with the following files:

lexical.h
lexical.c
testlex.c
Makefile

Of course, our long term goal is a compiler, so we should also have a main program for the entire compiler, probably:

main.c

But, at least initially, we will not write the compiler's main program. Instead, we will write a test framework for testing the lexical analyzer. As a general principle, whenever a large project is divided into components, test frameworks for those components are extremely useful. When the components are developed before their users, test frameworks become essential.

The makefile should look something like this, including the parts that refer to the not-as-yet developed main program and the parts that refer to our initial plan for the lexical analyzer:

    |# Makefile
    |
    |################################################################
    |# File dependencies for the Kestrel compiler                   #
    |# Author:  Author's Names                                      #
    |# Instructions:                                                #
    |#          make         -- will build compiler someday         #
    |#          make testlex -- build lexical analysis test program #
    |#          make clean   -- delete everything unnecessary       #
    |################################################################
    |
    |#######
    |# configuration options
    |
    |# compiler to use, may give global compiler options here
    |COMPILER = c++
    |
    |#######
    |# primary make target:  the Kestrel compiler
    |
    |kestrel: main.o lexical.o
    |        $(COMPILER) -o kestrel main.o lexical.o
    |
    |main.o: main.c lexical.h
    |        $(COMPILER) -c main.c
    |
    |lexical.o: lexical.c lexical.h
    |        $(COMPILER) -c lexical.c
    |
    |#######
    |# secondary make target:  testlex for testing lexical.o
    |
    |testlex: testlex.o lexical.o
    |        $(COMPILER) -o testlex testlex.o lexical.o
    |
    |testlex.o: testlex.c lexical.h
    |        $(COMPILER) -c testlex.c
    |
    |#######
    |# secondary make target:  clean for cleaning up the project
    |
    |clean:
    |        rm *.o
    |        rm testlex
    |        rm kestrel

The above makefile includes a commitment to eventually write a compiler -- the main program, but initially, we will use it to build a test program, testlex. Provisions for this test program should be preserved essentially forever, since any time there is a change to the lexical analyzer, it is a good idea to test it in isolation before trying to build the whole compiler that rests on it.

In fact, it is even a good idea to think about designing the test program before designing the lexical analyzer! How do you test a lexical analyzer? An obvious idea is to throw a small kestrel program at it and print out the sequence of lexemes. This implies that the lexical analyzer must be prepared both to read from a source file and to output the values of lexemes, or provide tools permitting this. Here, the need for a test program actually allows us to discover some of the functionality of our lexical analyzer.

Of course, that functionality is also going to be needed by the compiler. A compiler does not print out every lexeme it encounters, but when an error is encountered, the compiler typically needs to print out something about the location of the error as part of an error message. One way to do this is to print out some of the lexemes in the vicinity, as well as things like the current line number.

We have also included a bit of boilerplate above, support for the command make clean. In a large project, there are huge numbers of intermediate files produced by preprocessors and compilers. These clutter the directory for the project, making archives and backups difficult to maintain. The purpose of make clean is to delete all of the automatically generated files, including the object files, so that the project directory is reduced to just the original human-created code. That is all you need to burn to CD for an off-line backup, and that is all you need to zip up as an e-mail attachment when you share the code with a distant friend.

Project makefiles sometimes also include make install, to install the executable and any other files it depends on in the appropriate /bin file. Usually, the makefile will include a configuration option that you can set to specify where the program and any auxiliary files it requires should be installed.

Another thing you could include is make test that runs a suite of test scripts to verify that the compiler is working. As the development progress evolves, you would add tests to the test suite, and after each development step, you would use make test to run all the tests and see if your improved code is good, or whether, as frequently happens, you broke something in the process of adding a missing feature. Obviously, you should run make test first, check that all the tests were good, and only then run make install to replace the installed version of the code with the current development version.

Since we have no code to test, it is premature to develop a make test mechanism. Initially, we will run our tests manually and evaluate the results by eyeball.

An Interface

The next key question is, what does the interface to the lexical analysis subsystem look like. We need this first before we get too far along writing code for the lexical analyzer, and we certainly need it before we write test code.

For a C++ programmer, an obvious question is, should there be a class called lexical, so that the lexical analyzer is an instance of this class? Some object-oriented programming purists insist that everything must be done with objects and classes, but the truth is, classes offer little value for an abstraction where there is only one instance. In a typical compiler, we don't need multiple lexical analyzers. Our language is defined in terms of just one stream of lexemes.

C++ supports an older form of abstraction, inherited from C, where each separately compiled source file can be viewed as defining a single instance of an object of an anonymous class. The global variables in a source file are public, by default, but by prefixing the variable name with the keyword static, it becomes a private variable, accessible only by code within that file. The keyword extern tells C compilers to use a variable defined elsewhere. The rules for functions declared in a file are similar; by default, they are the public methods applicable to the object defined by that file, but function definitions prefixed with the keyword static are private and may only be called from other code within the compilation unit.

We are already committed to making the lexical analyzer a separate compilation unit, so it is natural to use this approach. As a consequence, our lexical analyzer must export something like the following functions:

lex_open( filename ) opens the indicated file and initializes the lexical analyzer for reading from that file. This one is fairly obvious, it is the initializer for the lexical analyzer.
lex_advance() advances the lexical analyzer one lexeme onward into the source file.

Some languages can be parsed by looking only at one lexeme, but other languages require that the parser examine some of the context. For example, if a program sees a: in the Kestrel language, it means that a is being defined and it should process a as the symbol being defined, while if it sees a=, it means that a is the object of an assignment statement and should be processed as the name of an already defined variable. Parsers that operate this way are said to be looking ahead in the source file. To do so requires that the lexical analyzer support lookahead.

For most programming languages, a single lexeme of lookahead suffices, so our lexical analyzer will support this with a two-lexeme sliding window in the source file. The lex_advance() function will slide this window forward one step. An obvious way to expose a sliding window to the parser is to offer, as part of the lexical analyzer interface, two variables, one holding the current lexeme and one holding its successor:

lex_this -- the current lexeme
lex_next -- the next lexeme following the current one

Here, we are sticking to a naming convention, where all public components of the lexical analyzer are named with the prefix lex. Now that we are exporting two variables that represent lexemes, we must also export the type definition for those variables. Are lexemes a class? Or are they simple variables? However we decide, it is clear that lexemes come in many sorts.

lex_this.type What kind of lexeme is this? The kind of a lexeme might be an instance of an enumeration type, with the values that include such things as IDENT, KEYWORD, NUMBER, STRING, PUNCT and ENDFILE.
lex_this.value For numeric lexemes, the value is obvious, an integer, probably of the longest supported unsigned integer type. The lexical analyzer for Kestrel does not need to deal in negative values, because negation is a matter of expression syntax. Keywords and identifiers need a unique distinguishing value so that, whenever any particular word is encountered, the same value is returned. An integer can clearly be used to encode such a value. For strings, we need access to the characters of that string, and the value can serve as some kind of handle. Regardless, it is obvious from this brief discussion that the interpretation of the value depends on the type of the lexeme.

We also need to provide services that can be applied to lexemes. It is perfectly reasonable to have a class lexeme, with methods applicable to lexemes, or alternatively, to have a type (in C and C++, a typedef or a struct) lexeme with some applicable functions. The latter is adequate, and we are unlikely to need more than exactly two lexemes. The methods applicable, whether we use a C++ class or fake it with functions, must include something like this:

lex_put( lexeme, file ) Output the lexeme in textual form to the indicated file. This is the most obvious function to call from within the test program, and it provides obvious support for formatting error messages. It could output text of the lexeme as it appeared in the source file, but from the point of view of testing the lexical analyzer, it would be more effective to have it output the reconstruction of the text of the lexeme, using the type and value fields. Given that the text of the lexeme may eventually be embedded in error messages, lex_put() should not decorate the output with any spaces or other text -- let the caller add whatever decoration is required by the context.

The above discussion suggests a header file for the lexical analyzer that looks something like the following:

/* lexical.h -- lexical analyzer interface specificaton */

/* by: -- record of authorship -- date- etc */

enum lex_types { IDENT, KEYWORD, NUMBER, STRING, PUNCT, ENDFILE };

typedef struct lexeme {
	lex_types type;
	uint32_t value;
} lexeme;

EXTERN lexeme lex_this; /* the current lexeme */
EXTERN lexeme lex_next; /* the next lexeme */

void lex_open( char * f );
void lex_advance();
void lex_put( lexeme * lex, FILE f );

This is oversimplified. For example, in many cases, the syntax analyzer may save a lexeme that is a key part of some construct and then, at a later time, discover an error that requires printing this lexeme. For example, the in an expression such as (a/b)+(c/d) the expression analyzer might save the + operator as it works its way through the expression, and only at the end of processing (c/d) will it know if the two operands of the + are legal addends. Only then can it print the error message, and when it does so, the most obvious point in the source program to refer to is the point where the + was found, not the point where the expression ended, which might even be several lines later in the code. This suggests that each lexeme should eventually include its own line number instead of having a special interface to the lexical analyzer to get the current line number at the frontier of lexical analysis.

A well planned interface definition allows us to defer adding support for such features until the need is discovered.