Lecture 2, Lexical Structure

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

A human-centered view

It is useful to consider how a person would process a program before trying to think about how it is done by a program. For this purpose, consider this program in our toy programming language, Kestrel.

putstr( "Hello world"+LF, output )

When a person literate in the Roman alphabet looks at this text, an important, almost unconscious processing step takes place: The text is seen not as a random pattern of letters on the page, but as a sequence of distinct words and symbols, This processing step is formally called lexical analysis, and the words and similar structures recognized at this level are called lexemes. When a person literate in programming languages reads the above code, they will likely break it into the following 8 lexemes (each bracketed with square brackets):

[putstr] [(] ["Hello world"] [+] [LF] [,] [output] [)]

If the person knows the language in which the text is written, a second and still possibly unconscious processing step will occur: Lexical elements of the text will be classified into structures according to their function in the text. A literate programmer, regardless of whether they know this language, will probably see the parenthesized text as a parameter list containing two expressions separated by a comma, where these are parameters for a call to something called putstr.

In the case of a computer language, literate readers will see statements, expressions, terms and factors; in English, literate readers will see subjects, objects, verbs, and subsidiary phrases. In our language, Kestrel, there are blocks, declarations, statements and expressions. This level of analysis is called syntactic analysis, and is performed with respect to the grammar or syntax of the language in question.

Specifying the lexical structure

The analysis of the lexical structure of the text begins with classification of the characters in that text. If you look at the definition of the Kestrel language, for example, you will find that it begins with a description of the character set, where the following categories of characters are identified.

<tab>
<newline>
<break>
There are a variety of less common non-printing characters in this class.
<space>
<digit>
The usual digits, from zero to nine.
<letter>
The usual letters, both upper case and lower case.

These are then used to define a broader class of symbols

<white space> ::= <space> | <tab> | <newline> | <break>
The white space in a program consists, obviously, of spaces, but may also include tab characters, newlines and a variety of other break characters.

In the above, we are using a notation called BNF (either Backus-Naur Form or Backus Normal Form, there is dispute about what BNF stands for). Each BNF rule defines a class of strings named on the left of the ::= sign in terms of the symbols or named classes of symbols on the right, with each alternative separated by a vertical bar.

This allows a top-level description of a Kestrel program as a sequence of lexemes, with optional white space separating the lexemes.

<kestrel program> ::= { { <white space> } <lexeme> }
A Kextrel program consists of a sequence of lexemes, separated by possibly empty sequences of white space characters.

The above definition is a bit dishonest, since it does not even mention the end-of-file mark at the end. In practice, the shortest possible program consists of an end-of-file mark, which we will classify as a kind of lexeme. The underlying system guarantees that there will never be multiple end-of-file marks, and that there will be exactly one at the end of every program, no matter how defective it may be. Note that Kestrel does permit an empty program consisting of just an end-of-file mark. Such a program does nothing.

In the above, we used a commonly-used extension to BNF to indicate zero or more repetitions of a symbol (or sequence of symbols). This is, perhaps, the most common extension to BNF. Curly braces in this notation indicate repetition, so the notation a{b}c means the infinite set of strings including ac, abc, abbc, abbbc, etc. (In the original version of BNF, recursion was used to represent such sets.)

An abridged lexical specification for Kestrel begins as follows:

<lexeme> ::= <identifier> | <number> | <string>
          |  <punctuation> | <end of file>
<identifer> ::= <letter> { <letter> | <digit> }
An identifier consists of at least one letter, followed by zero or more additional letters and digits.

<number> ::= <decimal number> [ '#' <extended number> ]
Numbers begin with a decimal number. This may be followed by a number sign (the # mark), followed by an extended number. In that case, the decimal number gives the number base, and the extended number gives the value. In the more common case, the decimal number gives the value.

<decimal number> ::= <digit> { <digit> }
A decimal number consists of a sequence of one or more digits. There's little reason to get formal about how those digits are interpreted, because this is a notation we all learned in elementary school.

<extended number> ::= <extended digit> { <extended digit> }
A decimal number consists of a sequence of extended digits, allowing radixes higher than 10. Obviouly, use of an extended digit greater than permitted by the radix is an error.

<extended digit> ::= <digit> | <letter>
The extended digits include both letters and digits, thus allowing, for example, the number 16#3E8 as the hexadecimal representation of 1000. The same value in octal would be 8#1750

Note the use of another extension to BNF above: Square brackets describe optional elements, so a[b] describes the set containing the strings a and ab. This is perhaps the second most common extension to BNF. In the original notation, this would have been described as a|ab.

We will stop here. For the complete description of the Kestrel lexical structure, refer to the official definition of the language. Notably, it includes quoted strings, and it includes a less than obvious (at first blush) system for number bases between 16 and 32.

The primary problem with BNF definitions is that they are wordy. In addition to the symbols of the language, we have introduced numerous metasymbols (with their names enclosed in angle-braces) such as <letter> and <digit>.

In fact, BNF notation is far more powerful than we need for describing a language at the lexical level. To see this, all we need to do is condense the above description by substituting the definition of each metasymbol for the use of that symbol in the text.

<lexeme> ::=

    <letter> { <letter> | <digit> }

|   ( <digit> { <digit> } ) [
      '#' ( <digit> | <letter> ) {
        ( <digit> | <letter> )
      }
    ]

|   <string>

|   <punctuation>

|   <end of file>

Note the use of another extension to BNF above: We have used parentheses to group subsidiary alternatives. Also, because of the complex nested structure of the collapsed text, we have begun to use line breaks and indenting to help the reader figure out which end paren goes with which begin paren.

The key thing to note in the above is that there is no recursion, just choices and loops. In fact, a complete functional lexical analyzer can be written as "flat" code without anything but if and while choices and no need to look at more than one character or remember any previous characters. Formally, we say that for most languages, the lexical description of that language is not merely context-free, but susceptable to analysis by a finite-state machine.

We're still cheating just a bit here: This definition suggests the possibility that a file might contain multiple <end of file> lexemes, and this is impossible.