Lecture 3, Moving Toward an Algorithm

Part of the notes for CS:4908:0004 (22C:196:004)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Alternative Descriptions of Lexical Structure

Toward the end of the last lecture, we described the lexemes of Kestrel as follows, in a context where the a Kestrel program was defined as a sequence of lexemes separated by white space of various kinds.

<lexeme> ::=

    <letter> { <letter> | <digit> }

|   ( <digit> { <digit> } ) [
      '#' ( <digit> | <letter> ) {
        ( <digit> | <letter> )
      }
    ]

|   <string>

|   <punctuation>

|   <end of file>

There are other notations that are describe much the same thing. One notation is sometimes known as the "railroad diagram" because of its resemblance to the schematic map of a railroad network. Here is the railroad diagram for the Kestrel lexical structure, including the rules for Kestrel character strings, and integrating the description of skipping white-space with the description of the structure of each type of lexeme:

a railroad diagram

In using a railroad diagram to scan a block of text, you follow the lines in the diagram. Whenever your path passes through an oval, you must read a character from the input that matches the contents of that oval. If you are blocked, in a situation where none of the ovals you could pass through contain the next available character, you have found a lexical error. At any choice point where one path splits many ways, only one oval should match the next character. If more than one matches, the lexical description provided by the railroad diagram is ambiguous.

In fact, the above railroad diagram is ambiguous: Is the string "abc" one identifier, or is it "a" followed by "bc" or "ab" followed by "c" or perhaps "a" followed by "b" followed by "c". The number "123" could be similarly divided, and is "a1" a single identifier, or an identifier followed by a number? This ambiguity reflecting the ambiguity in the original BNF description. We resolved this informally with the greedy rule: We will match the longest identifier we can before going on to the next identifier or number, and similarly, the longest number we can before going on to the next identifier or number.

The labels on the sides of the lines in the railroad diagram indicate the conceptual output of the lexical analyzer. Every time you scan past such a label, you have reached the end of some lexeme (in this case, a letter, a number, a string or a punctuation mark).

Another notation we can use is the state diagram for a finite-state machine. In fact, railroad diagrams can be considered to be an alternative notation for describing finite-state machines. Formally, we can describe a finite-state machine as:

The transition rules of a finite-state machine map from the current state and the current input value to the next state and the output value. The machine operates by consuming sequences of inputs, moving from state to state as it does so, and if it is a transducer, producing a sequence of outputs.

To convert a railroad diagram to a finite state machine we start by labeling each branch point in the railroad diagram with a state. In the following, we have used circled numbers as state labels:

a preliminary state diagram

The arcs connecting states in the modified railroad diagram correspond to state transistions, and the labels on those arcs indicate the input that triggers that state transition. In each state of our machine, one character of the input is read, and the value of that character determines the which arc is followed. An arc is followed if the character just read matches the character that labels that arc. So, for example, in the above state diagram, if you are in state one and read a letter, you move to state 2.

Simply identifying the states as branch points in the railroad diagram is only the first step. Wherever two "tracks" in the railroad diagram merged and then split, making a choice, there are problems. Consider states 4 and 5 above. State 5 has two outputs, one of which has no label. That is, the arck from state 5 to state 4 is intended to be followed without consuming any input; this is called a null-input transition.

We can repair this problem with states 4 and 5 by replacing the arc from state 5 to state 4 with a pair of arcs from state 5 back to itself, one followed when the input is a letter, the other when the input is a digit. In effect, the choice made in state 4 has been replicated on the transistions out of state 5 in order to eliminate the null-input transition. Similar problems occur with the transitions from states 2, 3 and 5 to state 1 that can be resolved in the same way.

In the formulation we have used here, each state transition should be labeled with the input that allows that transition to be taken and, if an output is associated with that transition, the output. The label "x/y" on a transition indicates that this transition is taken when the input is x, and if taken, the output will be y. The diagram below includes the changes discussed above and attempts to conform to the standard notation, while maintaining as much of the layout of the original railroad diagram as possible:

a final state diagram

In the above state diagram, we had to abbreviate some of the output labels because of space constrtaints, but the abbreviations, and one missing label, should be obvious. There is still some abbreviation in the above figure. In a formal finite state machine diagram, lines representing transitions never never merge or fork. All such lines have their tail attached to one state an their head attached to another. When the diagram is redrawn that way, there is never a need to attach more than one input label to any line. In the abbreviated form shown above, it is important to note that exactly one character of input is consumed as "control" moves from one state to the next. If the path from state to state passes by two labels, this does not mean that two inputs are consumed. Thus, in analyzing the string "9a", after consuming "9" and entering state 3 to identify this as a number, state 3 "a" forcing a transitition to state 2 as the letter is consumed. The path in the above diagram follows from state 3 to state 3 past the labels anything else and letter; here, as in every other case where this occurs, letter is merely a subset of anything else.

The other traditional way to represent a finite state machine is with a state table giving, for each state, the next state and output as a function of input. This is given below:

  State     Input     Next    Output
1 " "1
1 letter 2
1 digit 3
1 "'"6
1 '"'7
1 other 1 punctuation
2 letter 2
2 digit 2
2 " "1 identifier
2 "'"6 identifier
2 '"'7 identifier
2 other 1 identifier + punctuation
3 digit 3
3 "#"4
3 " "1 number
3 letter 2 number
3 "'"6 number
3 '"'7 number
3 other 1 number + punctuation
4 letter 5
4 digit 5
5 letter 5
5 digit 5
5 " "1 number
5 "'"6 number
5 '"'7 number
5 other 1 number + punctuation
6 "'"1 string
6 other 6
7 '"'1 string
7 other 7

In the above state table, we have resolved the ambiguities in our preliminary state diagram above by letting the transitions out of the final state involved in processing each kind of lexeme branch to state 1 except in cases where the character that terminated one lexeme was the start character of the next. In that case, the transition out of the end of one lexeme can lead directly to the first state for processing the next.

Note that, while some state transitions have no associated outputs, others have multiple outputs. If the machine is processing the input 12+, the numeral 2 is consumed with a transition from state 3 to state 3, producing no output, and then the + sign is consumed, forcing a transition from state 3 to state 1. The output of this second transition is both a report that a number has been consumed and a report that a punctuation mark has been consumed.

The above state table is simplified: We have left out the details for tabs, linefeeds and break characters. In the context of our simplified view of the Kestrel lexical level, the rules for all of these are identical for the rules for spaces.

One thing we omitted in the first descriptions of the lexical analyzer here is a specific description of where the analysis begins. We should have been explicit about this, and we can do so trivially with the state machine by specifying state 1 (the state that skips blanks) as the first state. We have also omitted any mention of end-of-file.

The biggest disadvantage of a state table, as a formal specification of the lexical level, is that it is about as readable as machine language printed in hexadecimal. In fact, in a very formal sense, the state table is a kind of machine language. Debugging or updating a state table in the event that the specifications for the language change is as difficult as doing the same kinds of things to a machine language program.

Aside: Some Theory

If you take a course on automata theory, you will find that there are two equivalent types of finite-state machines. In a Moore machine, the outputs are purely a function of the state, while in a Mealy machine, the outputs are a function of the state and the input consumed in that state -- that is to say, the outputs are a function of the state transition.

The Moore and Mealy formalisms are equivalent. That is, every Moore machine may be transformed into a Mealy machine that produces the same outputs in response to the same inputs, and visa versa. To convert a Moore machine to a mealy machine, the outputs associated with each state are attached, instead, to the transitions that lead to that state.

To convert a Mealy machine to a Moore machine, the outputs associated with each transition are pushed ahead into the states that those outputs lead into. Wherever this transformation associates conflicting outputs with a state, that state is split as needed into additional states, each with just one of the outputs. The state transistions out of each "substate" of a split state are identical for each of the substates.

So, is the above state machine a Moore or a Mealy machine?

Both the state diagram and the state table give away the answer. In the diagram, the outputs were labels on the state transitions, not on the states. In the table, for any given state, there were multiple outputs, with the output selected by the input encountered in that state. In sum, we produced a Mealy machine above.

Coding a State Machine

An obvious way to code a finite state machine is to directly represent the state table as an array, indexed by state. Each entry in the state array is an array indexed by the input character encountered in that state. The entires in this final array give the next state and (in the case of a Mealy machine) the output to produce in that state. This leads us to the following code:

state = INITIAL_STATE;
for (;;) {
	/* do one state transition per iteration */
	int ch = getchar(); /* consume one character */
	char output = state_table[state][ch].out;
	state = state_table[state][ch].next;
	putchar( output );
}

The above code assumes that the output is just a character. The resulting code is called a finite-state transducer. This would allow us, for example, to out put the letter "i" for each identifier in the input, the letter "n" for each number, etc. This is where the typical automata theory course stops in the discussion of finite state machines, digressing, instead, into questions of the properties of the class of languages a finite state machine can recognize, and the various ways to increase the power of the machine by adding memory.

A minimal lexical analyzer simply chews each lexeme off of the input text, but a lexical analyzer can do more. The Java class Scanner, for example, can both recognize a number of lexical classes and evaluate values in the class. Instances of Scanner can analyze inputs containing integers, floating point numbers, strings, and other forms of tokens. For example, the method hasNextFloat() returns true if the next lexeme in the input stream is a floating point number, and nextFloat() consumes the next lexeme and returns its floating-point value.

Many off-the-shelf scanners are difficult to customize for languages with a lexical structure different from that anticipated by their designers, and in the context of a compiler, the lexical analyzer can do significant jobs that are not covered by the typical off-the-shelf tokenizer. This makes it worth our time to explore the inner workings of a good lexical analyzer.

In addition to interpreting the values of numbers, the lexical analyzers used in compilers frequently maintain a table of all of the identifiers that have been encountered. This means that the lexical analyzer does not merely output mere fact that a properly formed lexeme has been encountered, but it gives a useful interpretation of that lexeme. State transistions inside such a lexical analyzer must therefore trigger arbitrary computations.

The following program structure is typical of table-driven lexical analyzers:

state = INITIAL_STATE;
for (;;) {
	/* do one state transition per iteration */
	int ch = getchar(); /* consume one character */
	int action = state_table[state][ch].act;
	state = state_table[state][ch].next;
	switch (action) {

	case FIRSTDIGIT: /* ch is the first digit of a number */
		value = ch - "0";
		break;

	case NEXTDIGIT: /* ch is the first digit of a number */
		value = (value * 10) + (ch - "0");
		break;

	}
}

The state table in either of the above approaches is typically a statically allocated constant, since we aren't interested in compilers where the lexical structure of the language is a variable. The code to initialize this table is almost always totally unreadable, consisting of many hundreds of constants. Some automatic lexical analyzer construction tools take, as input, a textual description of the lexical rules and produce such a table as output. In this case, programmers do not ever directly manipulate the table, they manipulate the textual description from which the table is generated.

An alternative is to code the state machine in the flow of control of the lexical analyzer. With this approach, the railroad diagram is a flow-chart for the code. Going back to the railroad diagram above, we can derive something like the following:

char ch = getchar(); /* we work with one character always in hand */
for (;;) {
        while ((ch == ' ') || (ch == '\t') || (ch == '\n') || (ch == 'r')) {
		/* skip whitespace */
		ch = getchar();
	}
	if (((ch >= 'A') && (ch <= 'Z')) || ((ch >= 'a') && (ch <= 'z')) {
		/* found start of identifier */
		...
	} else if ((ch >= '0') && (ch <= '9')) {
		/* found start of number */
		value = ch - '0';
		ch = getchar();
		while ((ch >= '0') && (ch <= '9')) {
			/* consume remaining digits */
			value = (value * 10) + (ch - '0');
			ch = getchar();
		}
		...
	} else ...
}

The advantage of the above approach is that it produces legible code that anyone knowledgible in the programming language can edit. The disadvantage of this code is that the formal state machine is lost in the code. A knowledgible reader will note that the code ch=getchar() occurs in very strictly controlled places, always after the previous character has been dealt with and immediately before a choice point in the code. This code integrates the semantic actions (for example, accumulating the values of numbers) with the code to recognize numbers. So long as the properties of interest are simple -- as they are with numbers, this is easy, but in some cases, we might be interested in complex properties where the code for the semantics hides the code for the state machine.

The above code was written as if the lexical analyzer was the outer loop of the program. This is not a good idea. In most lanugages, the syntax analysis code makes a more appropriate main program, and our goal is to build a lexical analyzer we can call from the syntax analysis system. Our goal, in a production lexical analyzer, therefore, is to offer an interface to the world where a call to (for example) lex_advance() advances one lexeme through the stream of lexemes.

The syntax analyzer, therefore, will call lex_advance() each time it is ready to advance one step through the text of the program. In general, the syntax analyzer may need to look not only at the current lexeme, but also at some sliding window of neighboring lexemes. Many programming languages are described by grammars that require looking both at the current lexeme and its successor, for example. In such a case, lex_advance() will advance the window, an operation more complex than merely returning the current lexeme.

Our design effort must also define exactly what a lexeme is, as a data type returned by. That will be the goal of the next lecture.