Note on Machine Problem 1

Part of the homework for 22C:50, Summer 2003
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The solution to MP1 distributed to the class is the second one I produced. My first solution was good enough for the main part of the problem, supporting numbers in bases other than 10 and 16, but as I began to solve the auxiliary questions, the small duplication of code for the main part of the probem began to threaten to grow huge.

The basic problem was, as I solved the problem for too many digits in a number for base 10, I realized that I really ought to add similar code to base 16. As I solved the problem for an illegal hex digit, I realized I ought to add similar code to the code for arbitrary bases. This led me to a major rewrite. Here is my original code for lex_scan, where the main partof the problem was solved, along with many of the secondary parts:

void lex_scan()
/* scan for the next lexeme
   updates lex_this and lex_next as it advances one lexeme through the text
*/
{  
        lex_this = lex_next;
   
        /* set lexeme attributes to default values */
        lex_next.pos = line;
        lex_next.len = 0;
        lex_next.val = 0;
   
        if (pos == NULL) {
                lex_next.typ = endfile;
                return;
        }
   
        /* skip blanks */
        while (*pos == ' ') pos++;

        if ((*pos == '\0')
        ||  (*pos == ';')) {
                lex_next.typ = endline;
                return;
        }

        /* process nonblank lexeme */
        lex_next.pos = pos;
        if (isalpha( *pos )) {
                lex_next.typ = identifier;
                lex_next.val = (unsigned int)SYM_NOHASH;
                do {
                        lex_next.val = (unsigned int)sym_hash( *pos,
                                (SYM_HANDLE)lex_next.val );
                        lex_next.len++;
                        pos++;
                } while (isalnum( *pos ) || (*pos == '_'));
                lex_next.val = (unsigned int)sym_find(
                        lex_next.pos, lex_next.len,
                        (SYM_HANDLE)lex_next.val );
                return;
        }
        if (isdigit( *pos )) {
                lex_next.typ = number;
                do {
/*V1.0*/                int digit;
/*V1.0*/                if (lex_next.val > (UINT_MAX / 10)) {
/*V1.0*/                        lex_error( &lex_next, "number too large" );
/*V1.0*/                } else {
/*V1.0*/                        lex_next.val = lex_next.val * 10;
/*V1.0*/                }
/*V1.0*/                digit = (unsigned int)((int)*pos - (int)'0');
/*V1.0*/                if (lex_next.val > (UINT_MAX - digit)) {
/*V1.0*/                        lex_error( &lex_next, "number too large" );
/*V1.0*/                } else {
/*V1.0*/                        lex_next.val = lex_next.val + digit;
/*V1.0*/                }
/*V1.0*/                /* the above 13 lines replaces the following from V0.0:
\*V1.0*\
                        lex_next.val = lex_next.val * 10
                                + (unsigned int)((int)*pos - (int)'0');
\*V1.0*\
\*V1.0*\                *  the added code catches oversized numbers
\*V1.0*\                *  and we really ought to add it to the code
\*V1.0*\                *  for all different radixes, not just decimal!
\*V1.0*\                *
\*V1.0*\                *  This would force number evaluation
\*V1.0*\                *  into a subroutine to avoide repeating this
\*V1.0*\                *  mess three times over.
\*V1.0*\                */
                        lex_next.len++;
                        pos++;
                } while (isdigit( *pos ));
/*V1.0*/
/*V1.0*/        if (*pos == '#') { /* number is in extended radix */
/*V1.0*/                int radix = lex_next.val;
/*V1.0*/                pos++;
/*V1.0*/                if ((radix < 2) || (radix > 36)) {
/*V1.0*/                        lex_error( &lex_next, "bad radix" );
/*V1.0*/                        radix = 36;
/*V1.0*/                }
/*V1.0*/                lex_next.val = 0;
/*V1.0*/                while (isalnum( *pos )) {
/*V1.0*/                        int digit;
/*V1.0*/                        if (isdigit( *pos )) {
/*V1.0*/                                digit = (int)*pos - (int)'0';
/*V1.0*/                        } else if (isupper( *pos )) {
/*V1.0*/                                digit = 10 + (int)*pos - (int)'A';
/*V1.0*/                        } else /* islower( *pos ) */ {
/*V1.0*/                                digit = 10 + (int)*pos - (int)'a';
/*V1.0*/                        }
/*V1.0*/                        if (digit > radix) {
/*V1.0*/                                lex_error( &lex_next,
/*V1.0*/                                        "bad digit in number" );
/*V1.0*/                                digit = 0;
/*V1.0*/                        }
/*V1.0*/                        /* The above check on digits within bounds
\*V1.0*\                         * should also be added to the original code
\*V1.0*\                         * hex numbers below.
\*V1.0*\                         * This would force number evaluation
\*V1.0*\                         * into a subroutine to avoide repeating this
\*V1.0*\                         * mess twice over.
\*V1.0*\                         */
/*V1.0*/                        lex_next.val = lex_next.val * radix + digit;
/*V1.0*/                        pos++;
/*V1.0*/                }
/*V1.0*/        }
/*V1.0*/
                return;
        }
        if (*pos == '#') {
                lex_next.typ = number;
                pos++;
                while (isxdigit( *pos )) {
                        lex_next.val = lex_next.val << 4;
                        if (isdigit( *pos )) {
                                lex_next.val = lex_next.val
                                + (unsigned int)((int)*pos - (int)'0');
                        } else if (isupper( *pos )) {
                                lex_next.val = lex_next.val
                                + (unsigned int)(10+(int)*pos-(int)'A');
                        } else { /* islower( *pos ) */
                                lex_next.val = lex_next.val
                                + (unsigned int)(10+(int)*pos-(int)'a');
                        }
                        lex_next.len++;
                        pos++;
                }
                return;
        }
        lex_next.typ = punc;
        lex_next.len = 1;
        lex_next.val = (unsigned int) *pos;
        pos++;
        return;
}

The comments in the above code describe the various messes that led me to write the second version that is posted as the official solution. If I were grading this code, I'd give the above version full credit, but only because it included self-critical comments about what really ought to be done. The official solution fixes all the little problems, not just some of them. As a result, it required changes to code outside of the lexical analyzer:

The file objectcode.c was changed so that it would never print bytes as more than 2 hex digits, and never print words as more than 4 hex digits.

The file parser.c was changed so that it would report an error if the operand on a byte or word directive was out of bounds for 8 or 16 bits. This was subtle for two reasons. First, the value field of an OBJECT_VALUE is an unsigned integer, while it is perfectly reasonable to put out negative values in the object code. When represented in an unsigned integer, negative numbers look just like oversized positive numbers, so the value of the expression 0-1 seems to be greater than 255. Therefore, I couldn't simply reject oversized values.

In solving the above problem, I could have simply assumed that unsigned integers in C are 32 bits, and that would have worked. In that case, I would have rejected bytes if they were greater than 0xFF and less than 0xFFFFFF00. The problem is, the official definition of C doesn't say that unsigned integers are 32 bits, it says that their length is implementation defined but guaranteed to be at least 16 bits. So, I went to the official definition of standard C and found that there is a special header file <limits.h> that defines the symbol UINT_MAX as the maximum allowed value of an unsigned integer on this particular machine. So, instead of 0xFFFFFF00, I used UINT_MAX minus 0xFF as the cutoff between legal unsigned representations of negative values and illegal oversized positive numbers. This is ugly, but I suspect that all solutions to this problem are ugly.