Lecture 4, Practical Lexical Analysis

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Top Level Issues

Compilers are large programs, so we need to think in the large when writing them. Separating the code for the lexical analyzer from the code for the rest of the compiler seems essential, both from the point of view of controlling the complexity of the project, and from the point of view of creating testable interfaces.

For a compiler written in C or C++, the natural way to do this is to code the lexical analyzer in a file named, for example lexical.c, with the interface given in a separate file named lexical.h. The file name extensions used here are arbitrary, set as a matter of convention and not necessity, as is the convention that the implementation of a program component reside in a file with the same name as the interface definition, but a different extension.

Once a program consists of multiple components connected by interface definitions, it becomes essential to document the relationships between the pieces. In the C/C++ world, the most valuable tool for this purpose is the makefile. Read about makefiles on Wikipedia, or take a look at the suggestions for makfile format and use shown here:

http://homepage.cs.uiowa.edu/~dwjones/syssoft/style.html#make

This suggests that we will begin our project with the following files:

lexical.h
lexical.c
testlex.c
Makefile

Why testlex.c? Because any time any component of a large program is developed, a test framework for that component should be developed as well. The makefile should look something like this:

    |# Makefile
    |
    |###########################################################
    |# File dependencies for the Kestrel compiler              #
    |# Author:  Author's Names                                 #
    |# Instructions:                                           #
    |#          make         -- will build compiler someday    #
    |#          make testlex -- build lexical analysis tester  #
    |#          make clean   -- delete everything unnecessary  #
    |###########################################################
    |
    |#######
    |# configuration options
    |
    |# compiler to use, may give global compiler options here
    |COMPILER = c++
    |
    |#######
    |# primary make target:  the Kestrel compiler
    |
    |kestrel: main.o lexical.o
    |        $(COMPILER) -o kestrel main.o lexical.o
    |
    |main.o: main.c lexical.h
    |        $(COMPILER) -c main.c
    |
    |lexical.o: lexical.c lexical.h
    |        $(COMPILER) -c lexical.c
    |
    |#######
    |# secondary make target:  testlex for testing lexical.o
    |
    |testlex: testlex.o lexical.o
    |        $(COMPILER) -o testlex testlex.o lexical.o
    |
    |testlex.o: testlex.c lexical.h
    |        $(COMPILER) -c testlex.c
    |
    |#######
    |# secondary make target:  clean for cleaning up the project
    |
    |clean:
    |        rm *.o
    |        rm testlex
    |        rm kestrel

The above makefile includes a commitment to eventually write a compiler -- the main program, but initially, we will use it to build a test program, testlex. Provisions for this test program should be preserved essentially forever, since any time there is a change to the lexical analyzer, it is a good idea to test it in isolation before trying to build the whole compiler that rests on it.

In fact, it is even a good idea to think about designing the test program before designing the lexical analyzer! How do you test a lexical analyzer? An obvious idea is to throw a small kestrel program at it and print out the sequence of lexemes. This implies that the lexical analyzer must be prepared both to read from a source file and to output the values of lexemes, or provide tools permitting this. Here, the need for a test program actually allows us to discover some of the functionality of our lexical analyzer.

Of course, that functionality is also going to be needed by the compiler. A compiler does not print out every lexeme it encounters, but when an error is encountered, the compiler typically needs to print out something about the location of the error as part of an error message. One way to do this is to print out some of the lexemes in the vicinity, as well as things like the current line number.

We have also included a bit of boilerplate above, support for the command make clean. In a large project, there are huge numbers of intermediate files produced by preprocessors and compilers. These clutter the directory for the project, making archives and backups difficult to maintain. The purpose of make clean is to delete all of the automatically generated files, including the object files, so that the project directory is reduced to just the original human-created code. That is all you need to burn to CD for an off-line backup, and that is all you need to zip up as an e-mail attachment when you share the code with a distant friend.

Project makefiles sometimes also include make install, to install the executable and any other files it depends on in the appropriate /bin file. Usually, the makefile will include a configuration option that you can set to specify where the program and any auxiliary files it requires should be installed.

Another thing you could include is make test that runs a suite of test scripts to verify that the compiler is working. As the development progress evolves, you would add tests to the test suite, and after each development step, you would use make test to run all the tests and see if your improved code is good, or whether, as frequently happens, you broke something in the process of adding a missing feature. Obviously, you should run make test first, check that all the tests were good, and only then run make install to replace the installed version of the code with the current development version.

An Interface

The next key question is, what does the interface to the lexical analysis subsystem look like. We need this first before we get too far along writing code for the lexical analyzer, and we certainly need it before we write test code.

For a C++ programmer, an obvious question is, should there be a class called lexical, so that the lexical analyzer is an instance of this class? Some object-oriented programming purists insist that everything must be done with objects and classes, but the truth is, classes offer little value for an abstraction where there is only one instance. In a typical compiler, we don't need multiple lexical analyzers. Our language is defined in terms of just one stream of lexemes.

C++ supports an older form of abstraction, inherited from C, where each separately compiled source file can be viewed as defining a single instance of an object of an anonymous class. The global variables in a compilation unit are public, by default, but by prefixing the variable name with the keyword static, it becomes a private variable, accessible only by code within that compilation unit. The keyword extern tells C compilers to use a variable defined elsewhere. The functions declared in a compilation unit are similar. By default, they are the public methods applicable to the object defined by the source file, but function definitions prefixed with the keyword static are private and may only be called from other code within the compilation unit.

We are already committed to making the lexical analyzer a separate compilation unit, so it is natural to use this approach. As a consequence, our lexical analyzer must export something like the following functions:

lex_open( filename ) opens the indicated file and initializes the lexical analyzer for reading from that file. This one is fairly obvious.
lex_advance() advances the lexical analyzer one lexeme onward into the source file.

As the processing of the file continues through the source program, the parser (and the test program) needs to examine the current lexeme and occasionally, examine the successor to that lexeme. This requires that the lexical analyzer always have on hand both the current and next lexemes. An obvious way to do this is to offer, as part of the interface, two variables, this lexeme and the next lexeme:

lex_this -- the current lexeme
lex_next -- the next lexeme following the current one

Here, we are sticking to a naming convention, where all public components of the lexical analyzer are named with the prefix lex. Now that we are exporting two variables that represent lexemes, we must also export the type definition for those variables. Are lexemes a class? Or are they simple variables? However we decide, it is clear that lexemes come in many sorts.

lex_this.type What kind of lexeme is this? The kind of a lexeme might be an instance of an enumeration type, with the values that include such things as IDENT, KEYWORD, NUMBER, STRING, PUNCT and ENDFILE.
lex_this.value For numeric lexemes, the value is obvious, an integer, probably of the longest supported unsigned integer type. The lexical analyzer for Kestrel does not need to deal in negative values, because negation is a matter of expression syntax. Keywords and identifiers need a unique distinguishing value so that, whenever any particular word is encountered, the same value is returned. An integer can clearly be used to encode such a value. For strings, we need access to the characters of that string, and the value can be some kind of handle.

We also need to provide services that can be applied to lexemes. It is perfectly reasonable to have a class lexeme, with methods applicable to lexemes, or alternatively, to have a type (in C and C++, a typedef or a struct) lexeme with some applicable functions. The latter is adequate, and we are unlikely to need more than exactly two lexemes. The methods applicable, whether we use a C++ class or fake it with functions, must include something like this:

lex_put( lexeme, file ) Output the lexeme in textual form to the indicated file. This is the most obvious function to call from within the test program, and it provides obvious support for formatting error messages. It should output the text of the lexeme as it appeared in the source file, with no additional decoration. The caller can output linefeeds and enclosing context as needed.

The above discussion suggests a header file for the lexical analyzer that looks something like the following:

/* lexical.h -- lexical analyzer interface specificaton */

enum lex_types { IDENT, KEYWORD, NUMBER, STRING, PUNCT, ENDFILE };

typedef struct lexeme {
	lex_types type;
	uint32_t value;
} lexeme;

EXTERN lexeme lex_this; /* the current lexeme */
EXTERN lexeme lex_next; /* the next lexeme */

void lex_open( char * f );
void lex_advance();
void lex_put( lexeme * lex, FILE f );

This is oversimplified. For example, in many cases, the syntax analyzer may save a lexeme that is a key part of some construct and then, at a later time, discover an error that requires printing this lexeme. For example, the in an expression such as (a/b)+(c/d) the expression analyzer might save the + operator as it works its way through the expression, and only at the end of processing (c/d) will it know if the two operands of the + are legal addends. Only then can it print the error message, and when it does so, the most obvious point in the source program to refer to is the point where the + was found, not the point where the expression ended, which might even be several lines later in the code. This suggests that eacy lexeme should include its own line number instead of having a special interface to the lexical analyzer to get the current line number at the frontier of lexical analysis.

Other attributes are harder to anticipate, and will typically be added later, as the need for them is discovered.