5 -- Pulling Apart a Shell

22C:112 Notes, Spring 2011

Part of the 22C:112, Operating Systems Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

A Shell Language

The standard Unix shells all parse lines of input into blank-delimited chunks, so if the user types

        /bin/ls -l myfile

The 3 parts are

        a) /bin/ls
        b) -l
        c) myfile

Part a) is interpreted as a file name. The standard shells do far more than merely attempt to execute the program in this file. They also try appending the file name provided by the user to each of several prefixes provided in the search path (you can look at your current search path by typing the shell command echo $PATH). This is why, when you type ls as a shell command, it successfully executs /bin/ls, the utility for listing a directory. Initially, our example shell completely ignores the concept of a search path and merely attempts to execute the program in the file given, so a user wishing to list a directory would need to type the full file name.

Parts b) and c) are arguments to the program. In this case, the -l argument is a command-line option, requesting a long-form listing, and the myfile argument is the name of the file (or directory) that the /bin/ls program is to list the attributes of.

A Shell

Consider the example shell in http://homepage.cs.uiowa.edu/~dwjones/opsys/notes/mush.txt This shell is quite minimal. The main loop is

        for (;;) {
                getcommand();
                parseargv();
                launch();
	}

The subroutine getcommand() reads one command from standard input into the global string command (recall that in C, strings are just arrays of characters with a NUL character marking the end of each string).

The parseargv() routine finds the start of each blank-delimited component of command and sets one pointer in the global array argv to point to that component. It also plants a NUL at the end of each component; as a result, at the end of this process, argv is an array of pointers to strings, where each string is one component of the original command string.

The argv data structure used by the program is dictated by the conventions for launching applications in Unix. When a Unix program wants to transfer control to a different program, it passes an array of parameter strings. The first parameter to a program is always the name of the program. The number of parameters is indicated by a NULL pointer marking the end of the parameter array.

The Launch routine is responsible for actually launching the application. It uses the execve() kernel call to transfer control to the new application. This takes 3 arguments, the file name from which the application is to be run (for example, /bin/ls). The code given here is very stupid, it does not check to see if there is a command, it merely calls execve(argv[0], ... ). In all of the more interesting shells, there are a number of built-in commands, so before calling execve(), the shell would have to check if the command name is built into the shell. The most obvious built-in command is exit, the command that tells the shell to terminate, but if the shell supports control structures, commands such as if and while are also typically built in.

The Unix system call execve is central to the shell. To launch a non-builtin command, the shell calls:

        execve( name, argv, environ );

Here, name is the name of the file to be executed, argv is the array of argument strings, and environ is a second array of strings that makes up what Unix calls the environment of the program. These two arrays are both parameters, but the intent of Unix is that the argv contain parameters, while environ contains an array of name/value pairs that behave like global variables. This is merely a convention, in the sense that an application that wanted to abuse the argument list or the environment is free to do so. Our example shell completely ignores the environment.

The Search Path

The standard Unix shells use an environment variable, $PATH, to decide what file names to check for commands. Here is a typical structure for $PATH:

        echo $PATH
        /usr/local/bin:/bin:/usr/bin:/space/jones/bin:.

This means, first look for the command in /usr/local/bin and then look in /bin before looking in /usr/bin Finally, the path checks my personal binary directory, /space/jones/bin before trying the current directory, indicated by a period. Components of the path are separated by colons, and each component should be a directory name.

To use components of the path, the program needs to get the path The main program in our example shell had no parameters, or rather, none were declared. All Unix systems actually pass 3 parameters, so we should have declared the main program as:

        int main(int argc, char **argv, char **envp)

The first parameter, argc is the count of arguments in the argv array that was passed to the main program by the call to execve made by whoever executed the main program. The second parameter is a pointer to argv, the array of arguments, where each array entry is a pointer to a character string. We could have declared this in any of several ways, all of which are equivalent because every pointer in C is potentially a reference to the first element of an array:

        char **argv,    /* a pointer to a pointer to a character */
        char *argv[],   /* an array of pointers to characters */
        char argv[][],  /* a two-dimensional array of characters */

The first parameter is actually redundant since we could just count the entries in the array until we find a NULL pointer. The third parameter is a pointer to the envrionment (which is, itself, an array of pointers to the strings defining that environment). Curiously, the global variable environ is also a pointer to the environment, so there is no need for this third parameter to the main program. This is further evidence of the unplanned evolution of C.

One of the strings in the environment is PATH, the search path. The C standard library has a utility function getenv() that can be used to find look up the values of variable in the environment. So, we could find the search path using:

        char * path = getenv( "PATH" );

The C standard library contains a wide variety of routines for working with character strings. The shell command man 3 string will list them all with no additional information. The command man 3 xxx, where xxx is the name of one of these string functions will give the official manual page for that function. The 3 in man 3 xxx is there because we are only asking for information from section 3 of the manual (the standard library). Section 1 describes shell commands, while Section 3 describes the operating system kernel.

One of the standard string functions, for example, can be used to pick off successive tokens from a string. Because the strings in the path are delimited by colons, we could get the first entry from the path as follows:

        char * entry = strtok( path, ":" );

If the above code had been written in Java or Python, the string returned by strtok() would be a copy of the first token found in the string path. The C string package does not work this way! Instead, strtok() actually edits path by planting an end-of-string mark in place of the terminator on the first token and then returns a pointer to the start of that token. This is quite dangerous. Here is an example of some somewhat safer code using the C string package:

		char name[LINELEN+1];
                strncpy( name, "/usr/bin/", LINELEN);
                strncat( name, argv[0], LINELEN - strlen(name));
                execve( name, argv, environ );

The above code first allocates a local variable, name, an array of characters big enough to hold one line plus a terminating NUL character, and then copies a component of the search path into this buffer using the strncpy() function. Having done this, it then uses strncat() to append the command name the user typed. The strncat() function takes an argument giving the number of characters it is permitted to append, which is the length of the buffer minus the length of the string actually stored in that buffer.

Finally, the above code fragment attempts to launch the application. If successful, it will never return because execve only returns if it is unsuccessful in launching the application. The above code could be replicated for each name on a constant-valued search path, or put into a loop to successively try components of the search path taken from the environment.

Launching an application

The code to launch an application under operating systems descended from Unix is strange. A more rational design might have the exec() system call run some other application and then return when that application terminates. Instead, the Unix exec() system calls (execv() and execve()) are semantically equivalent to goto statements, permanently abandoning the caller's program as they start the new application. Nothing in the documentation for these system services hints at a way for the caller to retain control while waiting for a subsidiary application to return.

The reason Unix does things this way is a consequence of a second design decision, the way Unix handles parallel process creation. Where a more rational design might connect the launching of a parallel process with starting a new applicaton -- so an application could be started in parallel with the current application or executed sequentially after the current one -- Unix opted to allow a process to fork, that is, to create a copy of itself running in parallel with the original.

The Unix fork() system service causes the calling process to be duplicated. Conceptually, the new process is a copy of the caller, with every bit of the caller's memory duplicated. In fact, no read-only data is duplicated -- it is read only, so it can be shared by both the original process and its copy. Furthermore, modern memory management mechanisms allow copying to be limited to those parts of the process's read-write memory that are actually changed, so if a read-write variable is not changed, no copy will be made. We will discuss this later in our discussion of virtual memory technology.

The Unix fork() system call creates only one difference between the original process, the caller or parent process, and the new process, the child process. That is for the caller, fork() returns the process ID of the child, which is always nonzero. For the child, it returns zero.

The parent can call the wait() system call to wait for the child to exit. The wait() system call waits for any child process of the parent to terminate, and when that child terminates, it returns the process ID of the child. It can, optionally, also capture the exit status of the child.

The exit() system call terminates a process. Applications normally terminate by calling exit(), and if the child does not launch an applicaton, the child itself should exit. This leads to the following framework for launching an applicaton from within another application running under Unix:

        int pid; /* the process ID of the child */
        int status; /* the exit status of the child */
        if ((pid = fork()) == 0) {
                /* child process */
                execve( file, argv, environ );

                /* control reaches this point only if the execve fails */
                exit(-1);
        } else {
                /* parent process */
                while (wait( &status ) != pid);
        }
        /* the child has terminated with status indicating how */

The argument to exit() is an integer. If a pointer to an integer is passed to the wait() system call (as in wait(&status) above), 8 bits of the status argument passed to exit() are packed into the referenced integer, along with other information. There are a collection of macros that can be used to extract this information. For example, WEXITSTATUS(status) will extract the 8-bit status itself, while WIFEXITED(status) will return true if the child terminated by calling exit() (it could have terminated by a run-time error or by being killed).

Input Parsing

The example shell uses parseargv() separate one line of input into an array of arguments. The buffer to hold the text, command, is passed as a global variable, along with argv, the array where pointers to successive arguments will be put. We could have shortened getargs considerably by using the strtok() routine mentioned above. Here, we avoided doing so simply to show all of the computation in one place instead of requiring the reader to understand a complex subroutine library.

The parseargv() function loops through the arguments. For each argument, it first skips blanks, then remembers the address of the argument with &command[j], where the ampersand means "get me the address of the variable, not its value." Having found the start of an argument, it then scans for a blank or end of string, and then replaces the blank at the end of the command with a null. The result is the replacement of a single string with an array of strings without moving any of the text.

Files

In the above discussion, we ignored the question of open files. By default, when a process forks, both parent and child processes share all open files. By default, when a program uses some flavor of exec() to launch another program, the files that were open in the caller remain open in the new program. As we will see later, this has interesting security consequences.

It is possible to mark files to be automatically closed when the program does an exec(), but this is an unnecessary feature of Unix and its descendants, since a program that has open files ought to know about them, and knowing about them, the program could explicitly close them itself before calling exec().

What's Missing

Our example shell is missing a huge number of features. The code does nothing to deal with over-length lines, so formally speaking, it is in error. Built-in commands such as cd, if and while are completely absent. No tools are provided to support manipulation of the environment, for example, with built-in commands such as set to assign values to environment variables. There is no support for parameter and environment variable substitution, for example, by recognizing dollar signs as lead-ins to environment variable names. The use of quotation marks to enclose arguments that include blanks is missing, as is Input-output redirection. All of these would make interesting machine problems, and if they were all done, the resulting shell would be genuinely useful.