Lecture 4, Practical Lexical Analysis

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Top Level Issues

Compilers are large programs, so we need to think in the large when writing them. Separating the code for the lexical analyzer from the code for the rest of the compiler seems essential, both from the point of view of controlling the complexity of the project, and from the point of view of creating testable interfaces.

For a compiler written in C or C++, the natural way to do this is to code the lexical analyzer in a file named, for example lexical.c, with the interface given in a separate file named lexical.h. The file name extensions used here are arbitrary, set as a matter of convention and not necessity, as is the convention that the implementation of a program component reside in a file with the same name as the interface definition, but a different extension.

Once a program consists of multiple components connected by interface definitions, it becomes essential to document the relationships between the pieces. In the C/C++ world, the most valuable tool for this purpose is the makefile. Read about makefiles on Wikipedia, or take a look at the suggestions for makfile format and use shown here:

http://homepage.cs.uiowa.edu/~dwjones/syssoft/style.html#make

This suggests that we will begin our project with the following files:

lexical.h
lexical.c
testlex.c
Makefile

Why testlex.c? Because any time any component of a large program is developed, a test framework for that component should be developed as well. The makefile should look something like this:

    |# Makefile
    |
    |###########################################################
    |# File dependencies for the Falcon compiler               #
    |# Author:  Author's Names                                 #
    |# Instructions:                                           #
    |#          make         -- will build compiler someday    #
    |#          make testlex -- build lexical analysis tester  #
    |#          make clean   -- delete everything unnecessary  #
    |###########################################################
    |
    |#######
    |# configuration options
    |
    |# compiler to use, may give global compiler options here
    |COMPILER = c++
    |
    |#######
    |# primary make target:  the falcon compiler
    |
    |falcon: main.o lexical.o
    |        $(COMPILER) -o falcon main.o lexical.o
    |
    |main.o: main.c lexical.h
    |        $(COMPILER) -c main.c
    |
    |lexical.o: lexical.c lexical.h
    |        $(COMPILER) -c lexical.c
    |
    |#######
    |# secondary make target:  testlex for testing lexical.o
    |
    |testlex: testlex.o lexical.o
    |        $(COMPILER) -o testlex testlex.o lexical.o
    |
    |testlex.o: testlex.c lexical.h
    |        $(COMPILER) -c testlex.c
    |
    |#######
    |# secondary make target:  clean for cleaning up the project
    |
    |clean:
    |        rm *.o
    |        rm testlex
    |        rm falcon

The above makefile includes a commitment to eventually write a compiler -- the main program, but initially, we will use it to build a test program, testlex. Provisions for this test program should be preserved essentially forever, since any time there is a change to the lexical analyzer, it is a good idea to test it in isolation before trying to build the whole compiler that rests on it.

In fact, it is even a good idea to think about designing the test program before designing the lexical analyzer! How do you test a lexical analyzer? An obvious idea is to throw a small falcon program at it and print out the sequence of lexemes. This implies that the lexical analyzer must be prepared both to read from a source file and to output the values of lexemes, or provide tools permitting this. Here, the need for a test program actually allows us to discover some of the functionality of our lexical analyzer.

Of course, that functionality is also going to be needed by the compiler. A compiler does not print out every lexeme it encounters, but when an error is encountered, the compiler typically needs to print out something about the location of the error as part of an error message. One way to do this is to print out some of the lexemes in the vicinity, as well as things like the current line number.

We have also included a bit of boilerplate above, support for the command make clean. In a large project, there are huge numbers of intermediate files produced by preprocessors and compilers. These clutter the directory for the project, making archives and backups difficult to maintain. The purpose of make clean is to delete all of the automatically generated files, including the object files, so that the project directory is reduced to just the original human-created code.

Project makefiles sometimes also include make install, to install the executable and any other files it depends on in the appropriate /bin file. Usually, the makefile will include a configuration option that you can set to specify where the program and any auxiliary files it requires should be installed.

An Interface

The next key question is, what does the interface to the lexical analysis subsystem look like. We need this first before we get too far along writing code for the lexical analyzer, and we certainly need it before we write test code.

An obvious question, for a C++ programmer, is this: Should there be a class called lexical, so that the lexical analyzer is an instance of this class? Some object-oriented programming purists insist that everything must be done with objects and classes, but the truth is, classes offer little value for an abstraction where there is only one instance. In a typical compiler, we don't need multiple lexical analyzers. Our language is defined in terms of just one stream of lexemes.

C++ supports an older form of abstraction, inherited from C, where each separately compiled source file can be viewed as defining a single instance of an object of an anonymous class. The global variables in a compilation unit are, by default public, but by prefixing the variable name with the keyword static, it becomes a private variable, accessible only by code within that compilation unit. The keyword extern tells C compilers to use a variable defined elsewhere. The functions declared in a compilation unit are similar. By default, they are the public methods applicable to that object, but function definitions prefixed with the keyword static are private and may only be called from other code within the compilation unit.

We are already committed to making the lexical analyzer a separate compilation unit, so it is natural to use this approach. As a consequence, our lexical analyzer must export something like the following functions:

lex_open( filename ) opens the indicated file and initializes the lexical analyzer for reading from that file. This one is fairly obvious.
lex_advance() advances the lexical analyzer one lexeme onward into the source file.

As the processing of the file continues through the source program, the parser (and the test program) needs to examine the current lexeme and occasionally, examine the successor to that lexeme. This requires that the lexical analyzer keep, available for inspection, both the current and next lexemes. An obvious way to do this is to offer, as part of the interface, two variables, this lexeme and the next lexeme:

lex_this
lex_next

Here, we are sticking to a naming convention, where all public components of the lexical analyzer are named with the prefix lex. Now that we are exporting two variables that represent lexemes, we must also export the type definition for those variables. Are lexemes a class? Or are they simple variables? However we decide, it is clear that lexemes come in many sorts.

lex_this.type What kind of lexeme is this? The kind of a lexeme might be an instance of an enumeration type, with the values that include such things as IDENT, KEYWORD, NUMBER, STRING, PUNCT and ENDFILE.
lex_this.value For numeric lexemes, the value is obvious, an integer, probably of the longest supported unsigned integer type. Keywords and identifiers need a unique distinguishing value so that, whenever any particular word is encountered, the same value is returned. An integer can clearly be used to encode such a value. For strings, we need access to the characters of that string, and the value can be some kind of handle.

We also need to provide services that can be applied to lexemes. It is perfectly reasonable to have a class lexeme, with methods applicable to lexemes, or alternatively, to have a type (in C and C++, a typedef or a struct) lexeme with some applicable functions. The latter is adequate, and we are unlikely to need more than exactly two lexemes. The methods applicable, whether we use a C++ class or fake it with functions, must include something like this:

lex_put( lexeme, file ) Output the lexeme in textual form to the indicated file. Because this will be called from within the test program and from within error message print routines, it should output the text of the lexeme as it appeared in the source file, with no additional decoration. Let the caller output linefeeds and enclosing context.

The above discussion suggests a header file for the lexical analyzer that looks something like the following:

/* lexical.h -- lexical analyzer interface specificaton */

enum lex_types { IDENT, KEYWORD, NUMBER, STRING, PUNCT, ENDFILE };

typedef struct lexeme {
	lex_types type;
	uint32_t value;
} lexeme;

EXTERN lexeme lex_this; /* the current lexeme */
EXTERN lexeme lex_next; /* the next lexeme */

void lex_open( char * f );
void lex_advance();
void lex_put( lexeme * lex, FILE f );

Or something like that.