Compiler Construction

The parameter argc gives the number of command line arguments, It will be non-negative. If the program is launched from one of the standard shells, argv[0] will be a pointer to the program name, as it appeared on the command line, and argc will be at least one, since the program name counts as the first argument. If a program was directly launched by another application, for example, using the execve() system call, things are a bit murkier. Bombproof code ought not make too many assumptions here.

The obvious command line argument for a compiler is the name of the input file. If this name is absent, it is also natural for a compiler to read from standard input. The other obvious arguments are all likely to be optional, things such as the name of the output file and flags to control optional behaviors or override defaults.

In general, users have difficulty with positional parameters to shell commands, particularly when there are command-line options that are rarely used and must appear in a particular position relative to each other. This leads to problems parsing command line options.

Finally, there are environment variables. In many cases, large applications have a hierarchy of default behaviors. For example, there might be a -xxx command line option to set some behavior. If this option is missing, the environment variable XXX is used to set that behavior, and if that is missing, a hard-coded behavior is adopted. Environment variables are inspected with the getenv() library routine, found in <stdlib.h>.

The code to deal with the above issues is big enough that it makes sense to devote the main program of any large application to this problem and nothing else. Once the main program has parsed the command line parameters and set the defaults, it calls the initializer routines of the rest of the applicatoin and then launches the application.

Our goal, for a Kestrel compiler, is to be able to launch the compiler with commands like these:

kestrel infile.k -o outfile.s  # the most obvious
kestrel infile.k               # implicit output to infile.s
kestrel -o outfile.s           # input from stdin, output to outfile.s
kestrel                        # input from stdin, output to stdout
kestrel infile.k -o -          # input from named file, output to stdout
kestrel - -o -                 # explicit use of stdin, stdout

Here we assumed that Kestrel source code would appear in files with the .k extension, but the compiler may not even check the particular extension that it is given, except perhaps to make sure that it is not using the same file for input and output. You might equally well use .kes as the source file extension.

Command line syntax is easy to forget, so it would be nice to support these kinds of variants:

kestrel infile.kes -ooutfile.s     # leaving out the space shouldn't matter
kestrel infile.kes -o=outfile.s    # using an equals sign should work
kestrel -o outfile.s infile.kes    # the order shoudn't matter

And, of course, any command with complex arguments ought to provide a brief help message on demand.

kestrel -help
kestrel -?

This help message should be longer than the normal error messasge, but much shorter than the man page or a full manual for the compiler. It needn't list nonstandard forms of the arguments, but it should at least mention the available arguments. And, of course, since some common shell commands understand -? as a request for help, while others understand -help, the program should accept both and not force the user to remember how to ask for help.

Communicating the Parameter Settings

There are two obvious ways for the main program to communicate with the rest of the application: Passing parameters, and through global variables. Both approaches have their value. If we create main.h as the interface between the main program and the rest of the application, this can hold all of the options. Here is an example:

/* main.h -- main program interface specification */

/* Prerequisites for use:
 *   In main.c, but nowhere else,
 *     EXTERN must be defined first
 */

#ifndef EXTERN
        #define EXTERN extern
#endif

/* Information from the command line */

EXTERN const char * main_progname; /* program name */
EXTERN const char * main_infile;   /* input file name, NULL if stdin */
EXTERN const char * main_outfile;  /* output file name, NULL if stdout */

/* any additional command line option values go here */

#undef EXTERN

This design allows the main program to simply set the options, without knowing what parts of the application rely on them. In the code for some part of the application, if that code needs, for example, the input file name, it would include main.h and then reference the global variable main_infile.

These global variables are declared as pointers to constant strings (that is what the type const char * means) because the strings in question are never intended to be editable by users of these pointers. (In contrast, note that the type char * const refers to a constant-valued pointer that can never be changed after it is initialized.)

The alternative is to have the main program parcel out the different options and defaults, passing them as parameters to each of the initializers. This requires that the main program have significant knowledge of the rest of the application, knowing which command line options matter to each of the different components of the application.

Bullet Proof Code

As already mentioned, the standard Unix shells make useful guarantees about argv and argc, but Unix programs can also be launched from other applications using execve(), and if the software is moved to non-Unix environments, behavior may vary even more. About the only guarantee it is safe to accept is that argc is the count of consecutive non-null entries at the start of argv. This forces something like the following code to deal with the program name:

        /* first, deal with the program name */
        if ((argc > 0)           /* Unix/Linux shells guarantee this */
        &&  (argv[0] != NULL)) { /* under Unix/Linux, above implies this */
                main_progname = argv[0];
                if (main_progname[0] == '\0') { // if nonstandard exec
                        main_progname = DEFAULT_NAME;
                }
        } else { /* nonstandard exec might even do this */
                main_progname = default_name;
        }
        /* assert: program name is now well defined */

        /* first, deal with the program name */
	main_progname = default_name;
        if ((argc > 0)              /* Unix/Linux shells guarantee this */
        &&  (argv[0] != NULL)       /* under Unix/Linux, above implies this */
        &&  (argv[0][0] != '\0')) { /* nonstarndard exec could do this */
		main_progname = argv[0];
	}
        /* assert: program name is now well defined */

Here, we have accounted for several possibilities: That no arguments were provided, that a non-Unix system passed nonstandard arguments, and that the first argument, argv[0] might have been empty after the program was launched in a nonstandard way. These are unlikely, but we are taking no chances.

So who needs to know the program name? Error messages. It is good practice to have the error message identifiy the program that produced it, because that program may be one of several that are reporting errors. The Kestrel compiler can obviously set its default name to something like "kestrel", but if it was installed as the kest command, it ought to automatically prefix its error messages with kest, and if it was installed with a short name like kc it ought to prefix messages with that. By taking the name from the command line, the compiler reports the name

Parsing the Argument Vector

Handling the above is sufficiently complex to illustrate a basic mechanism that can be generalized to handle a wide variety of command-line arguments. Here is suggested code to handle just the above options:

/* set argument strings to indicate that they have not been set */
main_infile = NULL;  // this means read from stdin
main_outfile = NULL; //            write to stdout
isinfile = false;    // indicates that no input file has been set

/* then deal with the command line arguments */
i = 1; /* start with the argument after the program name */
while ((i < argc) && (argv[i] != NULL)) { /* for each arg */
	const char * arg = argv[i]; /* this argument */
	char ch = *arg;     /* first char of this argument

	if ( ch == '\0' ) {
		/* ignore empty argument strings */
	} else if ( ch != DASH ) {
		/* arg not starting with dash is the input file name */
		if (isinfile) { /* too many input files given */
			er_fatal( ER_EXTRAINFILE, 0 );
		}
		main_infile = arg;
		isinfile = true;
	} else {
		/* command line -option */
		arg++; /* strip skip over the leading dash */
		ch = *arg; /* first char of argument */

		if (ch == '\0') { /* - by itself */
			/* ... meaning read stdin */
			if (isinfile) { /* too many input files specified */
				er_fatal( ER_EXTRAINFILE, 0 );
			}
			isinfile = true;
		} else if (ch == 'o' ) { /* -o option */

			/* =BUG= code to parse -o option goes here */

		/* put code to parse other options here */

		} else if (!strcmp( arg, "help" )) { /* -help */
			er_help();

		} else if (!strcmp( arg, "?" )) { /* -? (alternate help) */
			er_help();

		} else {
			er_fatal( ER_BADARG, 0 );

		}
	}      
	i++; /* advance to the next argument */
}

In the above code, we used the character DASH as the lead character on command line options. When installed on Unix-based systems, this should be defined as a dash, since the Unix convention is to use dashes as the lead-in to command line options. The DOS/Windows command line convention uses a slash, so a simple change of the definition of DASH can be used to create a DOS-style version of the code. (Historical note: It was the established use of / as a command-line option character that forced Microsoft DOS to use \ as its path separator in file names when (as an afterthought) the DOS file system was extended to support a hierarchical directory structure; this happened in DOS version 2.0 in 1983.)

Why was the above code written as a while loop and not a for loop, as suggested by the comment on the while loop header? It is bad style to modify the loop index variable inside the body of a for loop, and some programming languages actualy forbid this, making the loop index a constant within the body of the loop. By making it a while loop, we explicitly make it clear that the index variable may be modified in the loop body, and in a moment, we'll be doing so.

Why call er_help() to output the help message? We could have just output it directly here but, like all error messages, it should be easy to localize to a different language if you export the program to a different part of the world. Putting all output message generation in the same place gives the translator just one place to look for the material that needs to be localized. Of course, just like other error messages, the help message should automatically use the name by which the program was installed, whatever that may be.

The above code sets the default input file before it parses the argument list. On a Linux system where the program is installed as the kestrel command, this code will accept all of the following as requests to read from standard input:

kestrel
kestrel -
kestrel /dev/stdin

The third option above is a bit of a surprise and works slightly differently from the other two. In the third case, the program will not read from the already opened standard input file, but will, instead, take /dev/stdin as the name of a file and open it. Usually, this will be exactly the same as reading from the already open standard input file — except if the current user does not have read permission on that file.

(Aside: It takes a fairly deep understanding of Unix security to find the circumstance where an application can read the already open standard input file, but cannot open that file. This occurs if the application was stored in a file with the setuid or setgid bits set in the access rights for that file, so the application runs in a different protection domain from the user who launched it. This means that there can be files that the user can open and pass, as open files, to the application, but the application cannot open them itself.)

The above code uses an auxiliary Boolean variable isinfile, to detect multiple specifications of the input file. Consider the following:

kestrel this that
kestrel - -
kestrel - this
kestrel that -

All of these will result in calls to er_fatal() to report that there was an extra input file specification. In fact, had there not been convention that a null pointer meant "read from standard input", there would be no need for the auxiliary Boolean variable — instead, the code could have simply tested to see if main_infile was already non-null.

Now, suppose we wanted to support a -o option to set the output file name. There are two obvious ways such an option could be supported, -o outfile and -ooutfile. In the former case, the output file name is the next sequential parameter after the option specifier, while in the latter case, the option and its argument are concatenated. In fact, the standard Unix c compiler uses both forms. For specifying output files, cc and gcc traditionally use -o file, while for specifying libraries to be used, the traditional form is -lfile. Some GCC options also use this form -option=value, suggesting the possibility of -o=file.

The problem with this proliferation of different ways of indicating arguments to command-line options is that programmers frequently forget, for each option, how the argument to that option is formatted. If only we could start over from scratch and adopt a uniform notation, this would be much easier, but as things stand now, the perhaps the best we can do is to uniformly support all three obvious ways of specifying arguments to command line options. The following code, intended to be added to the command-line option parsing code above, does this:

} else if (ch == o) { /* -o outfile or -o=outfile or -ooutfile */
	if (isoutfile) { /* too many output files */
                er_fatal( ER_EXTRAOUTFILE, 0 );
	}

        arg++; /* strip off the letter o */
        ch = *arg;
        if (ch == '\0') { /* -o filename */
                i = i + 1;
		if ((i > argc) && (argv[i] != NULL)) {
			er_fatal( ER_MISSINGFILE, 0 );
		}
                main_outfile = argv[i];
		isoutfile = true;
	} else { /* -ofilename or -o=filename */
		if (ch == '=') {
			arg++; /* strip off the equals sign */
        		ch = *arg;
		}
                if (ch == '\0') er_fatal( ER_MISSINGFILE, 0 );
                main_outfile = arg;
		isoutfile = true;
        }

/* code to parse other command line options goes here */

This code assumes that main_outfile is initially null, a value and that isoutfile is initially false. After the argument list is parsed, if main_outfile is still null and isoutfile is still false, the output file name should be synthesized from the input file name. If isoutfile is true, the output file has either been specified or explicitly forced to be standard out (still null). The code to synthesize the file name can be quite complex:

if ((!isoutfile) && (main_infile != NULL)) { /* compute output file */
	/* =BUG= must write code */
}

Some fun code has been omitted above to construct an output file name from the input file name. Suppose the output is in assembly code, and you are using the .s suffix to indicate assembly language files. If the source file name is file.kes, the output file should be named file.s, while if the input file name is just file with no dotted suffix, the output file should be named file.s. This means you have to pick apart the file name to see if it has a suffix before appending your suffix.

Note: There is a suite of command-line option parsing tools hiding in the GNU C library. The routines getopt() and getsubopt() seem very flexible, at least for single letter options. The question to ask is, could you find the getopt package, digest its documentation, and write code using it faster than you could write code such as the above? That question identifies one of the chief barriers to code reuse.

Lecture 17, C and C++ Main Programs

Basic Details

Communicating the Parameter Settings

Bullet Proof Code

Parsing the Argument Vector