Lecture 4, Practical Lexical Analysis

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Top Level Issues

Compilers are large programs, so we need to think in the large when writing them. Separating the code for the lexical analyzer from the code for the rest of the compiler seems essential, both from the point of view of controlling the complexity of the project, and from the point of view of creating testable interfaces.

For a compiler written in C or C++, the natural way to do this is to code the lexical analyzer in a file named, for example lexical.c or lexical.cpp, with the interface given in a separate file named lexical.h. The convention of using .h extensions on interface specifications is common to both C and C++. The letter h stands for header, because C function headers are a major part of C interface specifications.

On Unix and Linux systems, file name extensions used are somewhat arbitrary, set as a matter of convention and not necessity. The C and C++ compilers do use default extensions, but it is fairly easy to override these. The worlds of Windows and web applications tend to use extensions more rigidly. Using standard extensions is helpful, though, and the convention that families of related files share the same name with different extensions is very helpful.

Once a program consists of multiple components connected by interface definitions, it becomes essential to document the relationships between the pieces. In the C/C++ world, the most valuable tool for this purpose is the makefile. Read about makefiles on Wikipedia, or take a look at the suggestions for makfile format and use shown here:

http://homepage.cs.uiowa.edu/~dwjones/syssoft/style.html#make

This suggests that we will begin our project with the following files:

Of course, our long term goal is a compiler, so we should also have a main program for the entire compiler, probably:

But, at least initially, we will not write the compiler's main program. Instead, we will write a test framework for testing the lexical analyzer. This approach is common when a project is too large to begin with any kind of approximation of the real main program.

As a general principle, whenever a large project is divided into components, test frameworks for those components are extremely useful. when a large project has a main program that is fairly obvious, test frameworks for components are useful. When the components are developed before their users, test frameworks become essential.

The makefile should look something like this, including the parts that refer to the not-as-yet developed main program and the parts that refer to our initial plan for the lexical analyzer:

|# Makefile
|
|################################################################
|# File dependencies for the Kestrel compiler                   #
|# Author:  Author's Names                                      #
|# Instructions:                                                #
|#          make         -- will build compiler someday         #
|#          make testlex -- build lexical analysis test program #
|#          make clean   -- delete everything unnecessary       #
|################################################################
|
|#######
|# configuration options
|
|# compiler to use, may give global compiler options here
|COMPILER = c++
|
|#######
|# primary make target:  the Kestrel compiler
|
|kestrel: main.o lexical.o
|        $(COMPILER) -o kestrel main.o lexical.o
|
|main.o: main.c lexical.h
|        $(COMPILER) -c main.c
|
|lexical.o: lexical.c lexical.h
|        $(COMPILER) -c lexical.c
|
|#######
|# secondary make target:  testlex for testing lexical.o
|
|testlex: testlex.o lexical.o
|        $(COMPILER) -o testlex testlex.o lexical.o
|
|testlex.o: testlex.c lexical.h
|        $(COMPILER) -c testlex.c
|
|#######
|# secondary make target:  clean for cleaning up the project
|
|clean:
|        rm *.o
|        rm testlex
|        rm kestrel

It should be fairly obvious that the first item in the above Makefile is a comment, as is every other line that starts with a pound sign. Makefile syntax is noticably awkward! Indenting must be done by tabs, for example, and named variables are defined with equals signs but referenced with an awkward $(NAME) construct. The brilliant thing about makefiles -- and the reason that this awkward tool remains in widespread use -- is that it allows arbitrary relationships between files to be defined. To make the file testlex.o for example, this makefile says: "first find testlex.c and lexical.h and then issue a shell command firing up the appropriate compiler to process testlex.c with the -c compiler option. This option causes the compiler to produce testlex.o instead of continuing the compilation process to produce an executable program.

The above makefile includes a commitment to eventually write a compiler -- the main program, but initially, we will use it to build a test program, testlex. Provisions for this test program should be preserved essentially forever, since any time there is a change to the lexical analyzer, it is a good idea to test it in isolation before trying to build the whole compiler that rests on it.

In fact, it is even a good idea to think about designing the test program before designing the lexical analyzer! How do you test a lexical analyzer? An obvious idea is to throw a small kestrel program at it and print out the sequence of lexemes. This implies that the lexical analyzer must be prepared both to read from a source file and to output the values of lexemes, or provide tools permitting this. Here, the need for a test program actually allows us to discover some of the functionality of our lexical analyzer.

Of course, that functionality is also going to be needed by the compiler. A compiler does not print out every lexeme it encounters, but when an error is encountered, the compiler typically needs to print out something about the location of the error as part of an error message. One way to do this is to print out some of the lexemes in the vicinity, as well as things like the current line number.

We have also included a bit of boilerplate above, support for the command make clean. In a large project, there are huge numbers of intermediate files produced by preprocessors and compilers. These clutter the directory for the project, making archives and backups difficult to maintain. The purpose of make clean is to delete all of the automatically generated files, including the object files, so that the project directory is reduced to just the original human-created code. That is all you need to burn to CD for an off-line backup, and that is all you need to zip up as an e-mail attachment when you share the code with a distant friend.

Project makefiles sometimes also include make install, to install the executable and any other files it depends on in the appropriate /bin file. Usually, the makefile will include a configuration option that you can set to specify where the program and any auxiliary files it requires should be installed.

Another thing you could include is make test that runs a suite of test scripts to verify that the compiler is working. As the development progress evolves, you would add tests to the test suite, and after each development step, you would use make test to run all the tests and see if your improved code is good, or whether, as frequently happens, you broke something in the process of adding a missing feature. Obviously, you should run make test first, check that all the tests were good, and only then run make install to replace the installed version of the code with the current development version.

Since we have no code to test, it is premature to develop a make test mechanism. Initially, we will run our tests manually and evaluate the results by eyeball.

An Interface

The next key question is, what does the interface to the lexical analysis subsystem look like. We need this first before we get too far along writing code for the lexical analyzer, and we certainly need it before we write test code.

For a C++ programmer, an obvious question is, should there be a class called lexical, so that the lexical analyzer is an instance of this class? Rigidly object-oriented programming languages like Java require that everything be done with objects and classes, but the truth is, classes offer little value for an abstraction where there is only one instance. In a typical compiler, we don't need multiple lexical analyzers. Our language is defined in terms of just one stream of lexemes.

C++ supports an older form of abstraction, inherited from C, where each separately compiled source file can be viewed as defining a single instance of an object of an anonymous class. The global variables in a source file are public, by default, but by prefixing the variable name with the keyword static, it becomes a private variable, accessible only by code within that file. The keyword extern tells C compilers to use a variable defined elsewhere. The rules for functions declared in a file are similar; by default, they are the public methods applicable to the object defined by that file, but function definitions prefixed with the keyword static are private and may only be called from other code within the compilation unit.

We'll adopt a simple convention for all public names exported by the interface to an object is defined by a separate C compilation unit: Each such interface will have the object name (or a convenient abbreviation) as a prefix. For example, if lexical.c and lexical.h are the implementation and interface for the lexical analyzer, each public field in this interface will be prefixed lex_.

We are already committed to making the lexical analyzer a separate compilation unit, so it is natural to use this approach. As a consequence, our lexical analyzer must export something like the following functions:

Some languages can be parsed by looking only at one lexeme, but other languages require that the parser examine some of the context. For example, if a program sees a: in the Kestrel language, it means that a is being defined and it should process a as the symbol being defined, while if it sees a=, it means that a is the object of an assignment statement and should be processed as the name of an already defined variable. Parsers that operate this way are said to be looking ahead in the source file. To do so requires that the lexical analyzer support lookahead.

For most programming languages, a single lexeme of lookahead suffices, so our lexical analyzer will support this with a two-lexeme sliding window in the source file. The lex_advance() function will slide this window forward one step. An obvious way to expose a sliding window to the parser is to offer, as part of the lexical analyzer interface, two variables, one holding the current lexeme and one holding its successor:

To clarify this idea of a sliding window, consider the Hello World program in Kestel:

putstr( "Hello world"+LF, output )

Recall that our lexical analyzer will eventually divide this into lexemes as follows:

[putstr] [(] ["Hello world"] [+] [LF] [,] [output] [)]

After opening this input file, we expect:

lex_this = [putstr]
lex_next = [(]

Calling lex_advance() changes the state of the lexical analysis to:

lex_this = [(]
lex_next = ["Hello world"]

And calling lex_advance() again changes the state to:

lex_this = ["Hello world"]
lex_next = [+]

Of course, we have yet to specify what a lexeme is, from the point of view of data structure, so [(] does not mean a specific value such as the unevaluated text of the lexeme. We have not yet worked out what data structure will actually be used to represent a lexeme.

So what is the type of lex_this and lex_next? Are lexemes a class? Or are they simple variables? However we decide, it is clear that lexemes come in many sorts. This suggests that we might want to be able to write things such as:

We also need to provide services that can be applied to lexemes. It is perfectly reasonable to have a class lexeme, with methods applicable to lexemes, or alternatively, to have a type (in C and C++, a typedef or a struct) lexeme with some applicable functions. The latter is adequate, and we are unlikely to need more than exactly two lexemes. The methods applicable, whether we use a C++ class or fake it with functions, must include something like this:

The above discussion suggests a header file for the lexical analyzer that looks something like the following:

/* lexical.h -- lexical analyzer interface specificaton */

/* by: -- record of authorship -- date- etc */

typedef enum { IDENT, KEYWORD, NUMBER, STRING, PUNCT, ENDFILE } lex_types;

typedef struct lexeme {
	lex_types type;
	uint32_t value;
} lexeme;

EXTERN lexeme lex_this; /* the current lexeme */
EXTERN lexeme lex_next; /* the next lexeme */

void lex_open( char * f );
void lex_advance();
void lex_put( lexeme * lex, FILE f );

This is oversimplified. For example, in many cases, the syntax analyzer may save a lexeme that is a key part of some construct and then, at a later time, discover an error that requires printing this lexeme. For example, the in an expression such as (a/b)+(c/d) the expression analyzer might save the + operator as it works its way through the expression, and only at the end of processing (c/d) will it know if the two operands of the + are legal addends. Only then can it print the error message, and when it does so, the most obvious point in the source program to refer to is the point where the + was found, not the point where the expression ended, which might even be several lines later in the code. This suggests that each lexeme should eventually include its own line number instead of having a special interface to the lexical analyzer to get the current line number at the frontier of lexical analysis.

A well planned interface definition allows us to defer adding support for such features until the need is discovered.

A Test Framework

Even before we start implementing the lexical analyzer, it's worth setting up a preliminary test framework. Here is a minimal test framework for the file testlex.c. This is a main program that simply calls the lexical analyzer to consume consecutive lexemes from the input and then prints each lexeme using the lexical analyzer's own printing tool:

/* testlex.c -- test framework for the lexical analyzer */
#include 
#include 

#include "lexical.h"

main(){
        /* the main program for testlex */

        /* =BUG= this should open argv[1] if present so we can read files */
        lex_open( NULL ); /* default to STDIN if argv[1] not present */
        do {
                lex_put( &lex_this, stdout );
                putchar( '\n' ); /* output each lexeme on a new line */
                lex_advance();
        } while (lex_this.type != ENDFILE);
}

In a real compiler, the most likely use of lex_put() will be in error messages, but here, it is our primary testing tool. If we can read each lexeme and successfully put it out, something is working correctly.

We've deliberately left some unsolved issues here. Eventually, we want the compiler to be able to take command line arguments that specify such things as the name of the file to be compiled. The test program should also allow such argumets, but we've simply put a bug notice in the code to remind ourselves to write a better test program.