Lecture 3, Moving Toward an Algorithm

Part of the notes for 22C:196:002 (CS:4908:0002)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Alternative Descriptions of Lexical Structure

Toward the end of the last lecture, we described the lexemes of Kestrel as follows, in a context where the a Kestrel program was defined as a sequence of lexemes separated by white space of various kinds.

<lexeme> ::=

    <letter> { <letter> | <digit> }

|   ( <digit> { <digit> } ) [
      '#' ( <digit> | <letter> ) {
        ( <digit> | <letter> )
      }
    ]

|   <string>

|   <punctuation>

|   <end of file>

There are other notations that are describe much the same thing. One notation is sometimes known as the "railroad diagram" because of its resemblance to the schematic map of a railroad network. Here is the railroad diagram for the Kestrel lexical structure, including a first approximation of the rules for Kestrel character strings, and integrating the description of skipping white-space with the description of the structure of each type of lexeme:

a railroad diagram

In using a railroad diagram to scan a block of text, you follow the lines in the diagram. Whenever your path passes through an oval, you must read a character from the input that matches the contents of that oval. If you are blocked, in a situation where none of the ovals you could pass through contain the next available character, you have found a lexical error. At any choice point where one path splits many ways, only one oval should match the next character. If more than one matches, the lexical description provided by the railroad diagram is ambiguous.

In fact, the above railroad diagram is ambiguous: Is the string "abc" one identifier, or is it "a" followed by "bc" or "ab" followed by "c" or perhaps "a" followed by "b" followed by "c". The number "123" could be similarly divided, and is "a1" a single identifier, or an identifier followed by a number? This ambiguity reflecting the ambiguity in the original BNF description. We resolved this informally with the greedy rule: We will match the longest identifier we can before going on to the next identifier or number, and similarly, the longest number we can before going on to the next identifier or number.

The labels on the sides of the lines in the railroad diagram indicate the conceptual output of the lexical analyzer. Every time you scan past such a label, you have reached the end of some lexeme (in this case, a letter, a number, a string or a punctuation mark).

Another notation we can use is the state diagram for a finite state machine. In fact, this is just another way of thinking about the railroad diagram. We can begin our move to a state diagram by labeling each branch point in the railroad diagram with a state:

a preliminary state diagram

In the state diagram, states are numbered circles, connected by labeled arcs called state transitions. In each state, one character of the input is read, and the value of that character determines the which arc is followed. An arc is followed if the character just read matches the character that labels that arc. So, for example, in the above state diagram, if you are in state one and read a letter, you move to state 2.

Simply identifying the states as branch points in the railroad diagram is only the first step. Wherever two "tracks" in the railroad diagram merged and then split, making a choice, there are problems. Consider states 4 and 5 above. State 5 has two outputs, one of which has no label, because the labels "letter" and "digit" between states 4 and 5 were intended to apply to it. The solution is to avoid merging "tracks" before a split, and instead, duplicate the split, moving the merge points so they are after the split. As a result, the un-labeled arc out of state 5 back to state 4 becomes two arcs, one for letters, and one for digits, both leading from state 5 to state 5. Similar problems occur with state 1 that are resolved in the same way.

The standard terminology for state diagrams refers to the arcs between states as state transitions. Each transition should be labeled with the input that allows that transition to be taken and, if an output is associated with that transition, the output. The label "x/y" on a transition indicates that this transition is taken when the input is x, and if taken, the output will be y. The diagram below includes the changes discussed above and attempts to conform to the standard notation, while maintaining as much of the layout of the original railroad diagram as possible:

a final state diagram

In the above state diagram, we had to abbreviate some of the output labels because of space constrtaints, but the abbreviations, and one missing label, should be obvious. There is still some abbreviation in the above figure. In a formal finite state machine diagram, lines representing transitions never never merge or fork. All such lines have their tail attached to one state an their head attached to another. When the diagram is redrawn that way, there is never a need to attach more than one input label to any line. In the abbreviated form shown above, it is important to note that exactly one character of input is consumed as "control" moves from one state to the next. If the path from state to state passes by two labels, this does not mean that two inputs are consumed. Thus, in analyzing the string "9a", after consuming "9" and entering state 3 to identify this as a number, state 3 "a" forcing a transitition to state 2 as the letter is consumed. The path in the above diagram follows from state 3 to state 3 past the labels anything else and letter; here, as in every other case where this occurs, letter is merely a subset of anything else.

The other traditional way to represent a finite state machine is with a state table giving, for each state, the next state and output as a function of input. This is given below:

  State     Input     Next    Output
1 " "1
1 letter 2
1 digit 3
1 "'"6
1 '"'7
1 other 1 punctuation
2 letter 2
2 digit 2
2 " "1 identifier
2 "'"6 identifier
2 '"'7 identifier
2 other 1 identifier + punctuation
3 digit 3
3 "#"4
3 " "1 number
3 letter 2 number
3 "'"6 number
3 '"'7 number
3 other 1 number + punctuation
4 letter 5
4 digit 5
5 letter 5
5 digit 5
5 " "1 number
5 "'"6 number
5 '"'7 number
5 other 1 number + punctuation
6 "'"1 string
6 other 6
7 '"'1 string
7 other 7

In the above state table, we have resolved the ambiguities in our preliminary state diagram above by letting the transitions out of the final state involved in processing each kind of lexeme branch to state 1 except in cases where the character that terminated one lexeme was the start character of the next. In that case, the transition out of the end of one lexeme can lead directly to the first state for processing the next.

The above state table is simplified: We have left out the details for tabs, linefeeds and break characters. In the context of our simplified view of the Kestrel lexical level, the rules for all of these are identical for the rules for spaces.

One thing we omitted in the first descriptions of the lexical analyzer here is a specific description of where the analysis begins. We should have been explicit about this, and we can do so trivially with the state machine by specifying state 1 (the state that skips blanks) as the first state. We have also omitted the special case for end-of-file, although this is trivial.

The biggest disadvantage of a state table, as a formal specification of the lexical level, is that it is about as readable as machine language printed in hexadecimal. In fact, in a very formal sense, the state table is a kind of machine language. Debugging or updating a state table in the event that the specifications for the language change is as difficult as doing the same kinds of things to a machine language program.

Aside: Some Theory

If you take a course on automata theory, you will find that there are two equivalent types of finite-state machines. In a Moore machine, the outputs are purely a function of the state, while in a Mealy machine, the outputs are a function of the state and the input consumed in that state -- that is to say, the outputs are a function of the state transition.

The Moore and Mealy formalisms are equivalent, in that every Moore machine may be transformed into an equivalent mealy machine and visa versa. To convert a Moore machine to a mealy machine, the outputs associated with each state are attached, instead, to the transitions that lead to that state.

To convert a Mealy machine to a Moore machine, the outputs associated with each transition are pushed ahead into the states that those outputs lead into. Wherever this transformation associates two or more outputs with one state, that state is split as needed into additional states, each with just one of the outputs. The state transistions out of each "substate" of a split state are identical for each of the substates.

So, is the above state machine a Moore or a Mealy machine?

Both the state diagram and the state table give away the answer. In the diagram, the outputs were labels on the state transitions, not on the states. In the table, for any given state, there were multiple outputs, with the output selected by the input encountered in that state. In sum, we produced a Mealy machine above.

Coding a State Machine

An obvious way to code a finite state machine is to directly represent the state table as an array, indexed by state. Each entry in the state array in this case is an array indexed by the input character encountered in that state. The entires in this final array give the next state and (in the case of a Mealy machine) the output to produce in that state. This leads us to the following code:

state = INITIAL_STATE;
for (;;) {
	/* do one state transition per iteration */
	int ch = getchar(); /* consume one character */
	char output = state_table[state][ch].out;
	state = state_table[state][ch].next;
	putchar( output );
}

The above code assumes that the output is just a character. The resulting code is called a finite-state transducer. This would allow us, for example, to out put the letter "i" for each identifier in the input, the letter "n" for each number, etc. This is where the typical automata theory course stops in the discussion of finite state machines, digressing, instead, into questions of the properties of the class of languages a finite state machine can recognize, and the various ways to increase the power of the machine by adding memory.

We want more! A practical lexical analyzer must do more than merely recognize that a lexeme exists. It must also, for example, interpret the values of numbers and maintain a table of all identifiers that have been encountered. This means that we are not interested in just outputting the mere fact that a properly formed lexeme has been encountered, but we are also interested in saying something about that lexeme. In that case, instead of simply outputting a character with each state transition, we let each transistion trigger an appropriate computation. The following program structure is typical of table-driven lexical analyzers:

state = INITIAL_STATE;
for (;;) {
	/* do one state transition per iteration */
	int ch = getchar(); /* consume one character */
	int action = state_table[state][ch].act;
	state = state_table[state][ch].next;
	switch (action) {

	case FIRSTDIGIT: /* ch is the first digit of a number */
		value = ch - "0";
		break;

	case NEXTDIGIT: /* ch is the first digit of a number */
		value = (value * 10) + (ch - "0");
		break;

	}
}

The state table in either of the above approaches is typically a statically allocated constant. The code to initialize this table is almost always totally unreadable, consisting of many hundreds of constants. Some automatic lexical analyzer construction tools take, as input, a textual description of the lexical rules and produce such a table as output. In this case, programmers do not ever directly manipulate the table, they manipulate the textual description from which the table is generated.

An alternative is to code the state machine in the flow of control of the lexical analyzer. With this approach, the railroad diagram is a flow-chart for the code. Going back to the railroad diagram above, we can derive something like the following:

char ch = getchar(); /* we work with one character always in hand */
for (;;) {
        while ((ch == ' ') || (ch == '\t') || (ch == '\n') || (ch == 'r')) {
		/* skip whitespace */
		ch = getchar();
	}
	if (((ch >= 'A') && (ch <= 'Z')) || ((ch >= 'a') && (ch <= 'z')) {
		/* found start of identifier */
		...
	} else if ((ch >= '0') && (ch <= '9')) {
		/* found start of number */
		value = ch - '0';
		ch = getchar();
		while ((ch >= '0') && (ch <= '9')) {
			/* consume remaining digits */
			value = (value * 10) + (ch - '0');
			ch = getchar();
		}
		...
	} else ...
}

The advantage of the above approach is that it produces legible code that anyone knowledgible in the programming language can edit. The disadvantage of this code is that the formal state machine is lost in the code. A knowledgible reader will note that the code ch=getchar() occurs in very strictly controlled places, always after the previous character has been dealt with and immediately before a choice point in the code. This code integrates the semantic actins (for example, accumulating the values of numbers) with the code to recognize numbers. So long as the properties of interest are simple -- as they are with numbers, this is easy, but in some cases, we might be interested in complex properties where the code for the semantics hides the code for the state machine.

The above code was written as if the lexical analyzer was the outer loop of the program. This is not a good idea. In most lanugages, the syntax analysis code makes a more appropriate main program, and our goal is to build a lexical analyzer we can call from the syntax analysis system. Our goal, in a production lexical analyzer, therefore, is to offer an interface to the world where a call to (for example) nextlex() advances one lexeme through the stream of lexemes.

The syntax analyzer, therefore, will call nextlex() each time it is ready to advance one step through the text of the program. In general, the syntax analyzer may need to look not only at the current lexeme, but also at some sliding window of neighboring lexemes. Many programming languages are described by grammars that require looking both at the current lexeme and its successor, for example. In such a case, nextlex() will advance the window, an operation more complex than merely returning the current lexeme.

Our design effort must also define exactly what a lexeme is. That will be the goal of the next lecture.