Lecture 6, Numbers and Punctuation

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Code for a Lexical Analyzer

Typically, we expect the lexical analyzer to output (or more accurately return, possibly in a global variable) a sequence of lexemes, where each lexeme has a type (identifier, reserved word, integer, punctuation, etc.). The type field itself can be an enumeration type, with an integer value field, the interpretation of which depends on the lexeme type.

For integer lexemes, the value, obviously, can be that integer. Obviously, then, when the lexical analyzer first encounters an integer, it should set the value to zero, and then, as with each digit is processed, the value should be multiplied by the radix prior to adding in each digit. That was easy, but the other lexeme types pose more interesting problem.

We can start the job of writing a lexical analyzer rather quickly, so long as we focus only on the simplest of lexemes, decimal numbers:

static char ch; /* the current character not yet part of a lexeme*/
static FILE * infile; /* the input file */

/* =BUG= lex_open needs to initialize ch and infile */

void lex_advance() {
        lex_this = lex_next; // slide the window forward
        while ((ch == ' ') || (ch == '\n') || (ch == '\t')) {
                /* skip whitespace */
                ch == fgetc( infile );
                /* =BUG= what if this hits an end of file? */        
                /* =BUG= how do we handle comments? */        
        }
        if ((ch >= '0') && (ch <= '9')) {
                /* decimal digit */
                lex_next.type = NUMBER;
                lex_next.value = 0;
                do {
                        /* accumulate value of digit */
                        lex_next.value = (lex_next.value * 10) + (ch - '0');
                        /* =BUG= what if there is an overflow? */

                        /* get next decimal digit */
                        ch == fgetc( infile );
                        /* =BUG= what if this hits an end of file? */        
                } while ((ch >= '0') && (ch <= '9'));
                /* =BUG= what if a # leads into an odd number base? */        
        } else {
                /* =BUG= what about identifiers, strings and punctuation? */        
        }
}

If you ignore all the parts that are missing above, the code for lex_advance() given here preisely follows the railroad diagram in the notes for Lecture 3, with just a bit of code at the very start of to implement the sliding window described in Lecture 4.

Classifying Characters

Parts of the above code are ugly. Consider this:

while ((ch == ' ') || (ch == '\n') || (ch == '\t')) {

Wouldn't it be nicer to be able to write this?

while (ch ∈ WHITESPACE) {

To do this, we'll need some kind of character classification tool. The standard C and C++ libraries include a set of tools for this purpose in the standard <ctype.h section of the library:

These classify characters using the lexical rules for C. While most of these are the same for most languages, it is worth considering the problem of how these are implemented, and we do have some characters that need different interpretations.

It is easy to build a fast and general purpose character classifier using a constant array indexed by character, where the array entry contains the basic class of that character. For our lexical analyzer, we are only interested in a few character classes: whitespace, letters, digits, punctuation and other. We'll use some 3-letter abbreviations for these in order to keep the text of the table from exploding:

define enum {
	OTHER=0, WHITESPACE=1, LETTER=2, DIGIT=4, PUNCTUATION=8
} char_type;

/* short definitions of character types used in the classifier table */
#define OTH OTHER
#define WIT WHITESPACE
#define LET LETTER
#define DIG DIGIT
#define PUN PUNCTUATION

/* character classifier table */
static const char_type char_class[256] = {
     /* NUL SOH STX ETX EOT ENQ ACK BEL BS  HT  LF  VT  FF  CR  SO  SI */
        OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,WIT,WIT,WIT,WIT,WIT,OTH,OTH,
     /* DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM  SUB ESC FS  GS  RS  US */
        OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
     /*      !   "   #   $   %   &   '   (   )   *   +   ,   -   .   / */
        WIT,OTH,OTH,OTH,OTH,OTH,PUN,OTH,PUN,PUN,PUN,PUN,PUN,PUN,PUN,PUN,
     /*  0   1   2   3   4   5   6   7   8   9   :   ;   <   =   >   ? */
        DIG,DIG,DIG,DIG,DIG,DIG,DIG,DIG,DIG,DIG,PUN,PUN,PUN,PUN,PUN,OTH,
     /*  @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O */
        PUN,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,
     /*  P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]   ^   _ */
        LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,PUN,OTH,PUN,OTH,OTH,
     /*  `   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o */
        OTH,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,
     /*  p   q   r   s   t   u   v   w   x   y   z   {   |   }   ~  DEL */
        LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,PUN,PUN,PUN,PUN,OTH,
     /* beyond ASCII
        OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
        OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
        OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
        OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
        OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
        OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
        OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,
        OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH,OTH
};

/* get rid of short definitions */
#undef OTH
#undef WIT
#undef LET
#undef DIG
#undef PUN

The above table includes some odd classififications. For example, quotation marks are classified as other instead of punctuation. Perhaps we should have created a new category for quotes? Also, note that all of the punctuation marks that do not appear as the first characters of Kestrel punctuation lexemes are classified as other, not as punctuation.

It is worth asking, why did we make this a 256-entry table even though ASCII only has 128 characters? Because the C and C++ type char is 8 bits, so we need to allow for the possibility of a file that contains characters outside the ASCII range. In fact, Unicode, in the UTF-8 encoding, uses sequences of 2 or more bytes in this extended range to represent characters outside the ASCII range.

We could use the above table directly, but it's easier to use a character class test such as the following:

/* check whether a character is in a particular class other than OTH */
#define ISCLASS(ch,class) (char_class[ch]&(class))

We can now rewrite the problem line from the start of this section as follows:

while (ISCLASS(ch,WHITESPACE)) {

If we want to test to see if a character is a letter or digit, we could write ISCLASS(ch,LETTER)||ISCLASS(ch,DIGIT), but our assignment of values to the enumeration allows this to be compacted to ISCLASS(ch,LETTER|DIGIT).

Where does this code go? The most obvious answer is in lexical.c since it is unlikely to be needed anywhere else, but this makes lexical.c larger, so we could put it in some subsidiary header file, perhaps charclass.h.

Punctuation

If all punctuation marks were single characters, the integer value of a punctuation mark lexeme could be, simply, the character code for that punctuation mark. The problem with this is that we have a number of punctuation lexemes that are built up from multiple characters. Consider, for example, comparison operators such as >= which might also be represented by (880516 in Unicode). This is also a string in UTF-8, with successive bytes encoded as E816 A016 8516.

In the syntax analysis code, we don't want to deal with alternative representations. This suggests that what we need is a single enumeration type that enumerates all of the legal punctuation in the language. The lexical analyzer's header file should export this enumeration just as it exports the enumeraiton giving the set of lexeme types.

The Kestrel documentation for the lexical level lists the a number of punctuation marks, with comments indicating their purposes. Inside the compiler, we need names for these symbols. Consider, for example, writing C (or C++) code something like this in the lexical.h file:

typedef enum {
    PT_SEMI   /* ; */,   PT_EQUALS /* = */,   PT_COLON  /* : */, 
    PT_LPAREN /* ( */,   PT_LBRAKT /* [ */,   PT_LBRACE /* { */,
    PT_RPAREN /* ) */,   PT_RBRAKT /* ] */,   PT_RBRACE /* } */, 
    PT_COMMA  /* , */,   PT_ATSIGN /* @ */,   PT_ELIPS  /* .. */,
    PT_NOTEQL /* /= */,  PT_GT     /* > */,   PT_GE     /* >= */, 
    PT_LT     /* < */,   PT_LE     /* <= */,  PT_PLUS   /* + */,
    PT_ATSIGN /* - */,   PT_TIMES  /* * */,   PT_DIV    /* / */, 
    PT_MOD    /* % */,   PT_AND    /* & */,   PT_OR     /* | */, 
    PT_NOT    /* ~ */,   PT_DOT    /* . */
} punct_type;

There is nothing sacred about this suggested enumeration, nor is it exhaustive. Punctiuation marks are simply listed in the order in which they are listed in the Kestrel lexical specification, and each is named in a manner that is perhaps long-winded but self explanatory.

We can use something very similar to our char_class array to classify punctuation marks, allowing us to write code something like this in our lexical analyzer:

} else if (ISCLASS(ch,PUNCTUATION)) {
        lex_next.type = PUNCT;
        lex_next.value = punct_class[ch];
        ch = fgetc( infile );
        /* =BUG= what if this hits an end of file? */
        /* =BUG= what about 2-character punctuation marks? */
}

Overflow

Let's tackle a bug before we make more of them. Our first draft of the lexical analyzer included this code:

/* accumulate value of digit */
lex_next.value = (lex_next.value * 10) + (ch - '0');
/* =BUG= what if there is an overflow? */

What happens if a programmer writes a number larger than 2,147,483,647? That is the largest positive number that can be represented as a signed 32-bit integer, but since we're accumulating an unsigned value, we get to use the top bit, so this is no problem.

However, what happens if a programer writes a number larger than 4,294,967,295? This is the largest value of type uint32_t. Fortunately, you don't have to memorize this value, although it's useful to remember it as being about 4 billion. You don't need to memorize it because it has the defined name UINT32_MAX in the header file <stdint.h>.

But what do you do to detect that a computation would overflow the integer representation you're using? C and C++ don't give you direct access to the condition codes, so you can't just do the addition and then see if it overflowed. Instead, you have to think ahead. What you want to do is detect that:

(value * radix) + digit > UINT32_MAX

But you want to do this without ever computing a value that the computer can't handle. We can do this by doing some algebra! Subtract digit from both sides of the above comparison, and it will still be true under the same circumstances:

(value * radix) > (UINT32_MAX - digit)

The multiplication could overflow, so we divide both sides by radix to get:

value > (UINT32_MAX - digit)/ radix

This, we can write into our code in order to detect and report the error:

if ( lex_next.value > ((UINT32_MAX - (ch - '0')) / 10) ) {
        /* =BUG= report number out of bounds */
} else {
        /* accumulate value of digit */
        lex_next.value = (lex_next.value * 10) + (ch - '0');
}

We haven't reduced our bug count, but we have made forward progress.