CS:3620 Notes, Lecture 4, Spring 2018

When Unix development began in the late 1960s, the most common way for on-line users to interact with the computer was a Teletype. This combined a keyboard and printer in one unit, transmitting keypresses to the computer and printing whatever the computer sent back. Usually, what you typed was printed immediately, but this was under software control.

Computers had textual command languages where commands directed the computer to run programs. Very early during the development of Unix, a decision was made that the command language interpreter would be an application program, not an integral part of the operating system. This was an innovative decision with an important long term consequence: If users didn't like the command language interpreter that came with the system, they could write their own.

The first command language interpreter written for the Unix system was called sh, short for shell, the command shell. It was called a shell because, in some conceptual sense, it wrapped around the programs that users wanted to run. Ken Thompson, one of the designers of Unix, wrote the first version of sh, but Thompson had higher priorities, and development of sh was passed to Stephen Bourne, who released a significantly enhanced but compatible shell in 1977, still called sh.

When the University of California at Berkeley got a Unix source license, they began developing BSD Unix, the Berkeley Software Distribution. A grad student on the BSD project thought he could write a better shell, and he called his shell the C shell or csh, because it had a few programming features that were a bit more C-like than the programming features of the Bourne shell.

The Bourne shell was tied up in the AT&T Unix license, and as the move toward open software gained steam, Brian Fox wrote a compatable extension of the Bourne shell that he called the Bourne Again Shell or bash. Ken Greer at Carnegie Mellon University had experience with the command language of a different operating system, TENEX. He wrote a new version of csh extended with some features from TENEX, and called it tcsh.

Today, the standard distributions of Linux and other systems descended from Unix include both tcsh and bash. Both support the basic command language of Ken Thompson's original shell, with incompatible extensions beyond this. When you open an interactive shell session on a Unix-like system, you can find out what shell you are using with the echo command by asking it to show you the value of $SHELL, a shell variable that holds the file name of the default shell for your user account:

[dwjones@fastx05 ~]$ echo $SHELL
/bin/tcsh
[dwjones@fastx05 ~]$

Note that this shows a full file name, not just the name of the shell. You can run a different shell on a one-off basis by simply typing the name of that shell:

[dwjones@fastx05 ~]$ bash
bash-4.4$ echo $SHELL
/bin/tcsh
bash-4.4$ exit
[dwjones@fastx05 ~]$

Notice that running a different shell doesn't change the values of variables like $SHELL. Also notice that the exit shell command terminates the current shell, returning control to whatever program launched it.

You can change your default shell with the chsh program command. It prompts for the shell you want to use. Usually, this will be either /bin/bash or /bin/tcsh, but if you write your own shell, you can change to make it your default shell once you are certain that it works well enough. The chsh program won't change your shell unless you type in your password to confirm that the change is what you want.

Unix shell scripts

In Unix-based systems like Linux and MacOS, scripts are executable files containing text that begins with the "magic number" #!. What follows, on the same line, is a file name terminated by a newline. When the system sees an executable beginning like this, it runs the named file as an interpreter, providing the entire script file as input to that interpreter. Consider this script:

#!/bin/tcsh
echo "script output"

If we store this text in a file called myscript, we can make it executable and then execute it as follows:

[dwjones@serv15 ~]$ chmod +x myscript
[dwjones@serv15 ~]$ ./myscript
script output
[dwjones@serv15 ~]$

The first command above makes the script executable. The chmod command can be used to change the access rights (the mode?) of a file or list of files. The string +x means add the execute x right. The string -x would have removed execute right, and -w would remove the right to write the file, making it read-only.

The second command runs the script. In general, typing a file name on the command line makes the shell try to run that file. When the system sees the first line of the file, #!/bin/tcsh, it launches the tcsh shell to read from the file myscript. To the shell, all lines starting with # are comments, so the shell ignores the first line and begins work on the second line by running the echo program, a program that simply outputs the text of whatever parameters are passed to it.

[dwjones@serv15 ~]$ ls -l myscript

-rwxr-xr-x 1 dwjones faculty 223 Jan  5 11:29 shellscript

This indicates that the access rights are rwxr-xr-x for this file. The first rwx applies to access by the owner, named jones, while the middle r-x applies to members of the group named faculty. The final r-x applies to all others. The 223 in the output is the size of the file, in bytes, which is followed by the date of last modification, followed by the file name itself.

For more information about the ls command (or any shell command) type man ls. (You can also use a web search engine to search for the keywords ls and "unix command".) There are two problems with this. First, the result is sometimes huge, you may get the feeling you are drinking from a fire hose. Second, if the command is a built-in command of some shell, you'll have to type, for example, man tcsh and then hunt around in that document for the command you want.

Commentary

The #! mechanism is a kluge. File attributes really ought to be encoded as some kind of file type, with type attributes of a file quite distinct from the data in that file. The MacOS notion of having files that have a resource fork (type information) and a data fork (the file itself) is an attempt to find a more rational way of encoding file types. Some Unix applications also use extensions on file names. Consider, for example, the file name image.jpg, where the extension .jpg indicates that the file is in JPG image format. This is another kluge, dating back into the 1960's, and it is not used by any of the core components of Unix, but only by applications. Thus, in Unix, it is perfectly legal to attempt to edit a JPG file with a text editor.

For a strong example to illustrate the foolishness of using the extension on a file name to indicate the type of the file, consider how you would react to a programming language that required you to indicate the type of each variable with an extension on the variable name. Where most programming languages let you write i=i+1, you would have to write i.int=i.int+1 or even i.int=i.int+1.int. This would get old very fast.

Shell Parameters

The Unix system allows parameters to be passed to any Unix command. The echo command simply outputs its parameters, with spaces delimiting them. When you typed echo Hello world!, you actually passed two parameters to the echo command, Hello followed by world!. You can test this by typing the commands with extra spaces:

echo          Hello            world!

The output will be the same as it was without the spaces. The shell lets you include explicit spaces within a parameter if you put the parameter in quotes.

echo "        Hello            world! "

(The space after the exclamation point is there because exclamation points have special meaning to some shells in some contexts, and including this space avoids such a context.)

Unix shells use an awful notation for referencing parameters. This is forced on the shell by the lack of anything resembling a subroutine heading on shell scripts. Had shell scripts required a subroutine heading of some kind, that could have provided formal parameter names, but lacking these, the shell forces formal parameters to begin with a dollar sign followed by the parameter number. Consider this shell script stored in the file parameters:

#!/bin/tcsh
# parameters $1 $2 $3

echo parameter 1 is $1
echo parameter 2 is $2
echo parameter 3 is $3

If you run this with the command parameters this "is a" test you will see the output:

parameter 1 is this
parameter 2 is is a
parameter 3 is test

Dollar signs mean quite a bit to the shell. $1 is equivalent to $argv[1], for example, where $argv is logically the name of an array, the argument vector, containing all of the arguments. If you put $argv with no subscript, the entire argument vector will be substituted into the text. $$ is replaced with the current process number. In all cases, the dollar sign and whatever text follows within the syntactic constraints of the shell you are using is replaced by the text of the indicated value.

The short notation $1 is the original Thompson shell notation. As the shell matured, named shell variables were added, and then these were divided into arrays of components.

Shell Variables

The Unix shells allow variable creation. The mechanisms for this differ from shell to shell, so here, we will focus on the C-shell (and tcsh) variable mechanism.

set myvariable = "Hello world! "
echo $myvariable

To create a numeric variable, or rather, a variable with a value that is the textual representation of an integer, you could just set it, as above, but if you use the @ shell command, it will evaluate expressions to the right of the equals sign. This only works with integers, and you should keep in mind the fact that all shell variables are text strings, so when you do addition in the shell, it is converting text to binary, doing the addition, and then converting back to text. Here is an example:

@ myvariable = 1 + 1
echo $myvariable

Quotes

The standard Unix shells support a number of different types of quotation marks. Consider these examples:

set myvariable = "Hello    world! "
echo $myvariable

Here, the output is Hello world!, with the extra spaces excluded because, the text of $myvariable is assigned first, and then parsed into parameters for the echo command. Add quotes, and this changes:

set myvariable = "Hello    world! "
echo "$myvariable"

Now, the output is Hello world! , with the extra spaces included. This is because the quotes prevented the division of the variable, while still permitting the dollar sign to be recognised. Single quotes suppress interpretation of the dollar sign:

set myvariable = "Hello    world! "
echo '$myvariable'

The result is $myvariable, which is to say, the literal text, without any attempt to substitute the variable's value into place. On the other extreme, we can ask the shell to evaluate a command and substitute the output of that command for the text of that command. Consider:

set myvariable = "Today is"
echo "$myvariable" `date "+%A %B %d, %Y."`

Here, the second argument to the echo command is the output of the date command with the argument "+%A %B %d, %Y.", an argument that specifies the date format. Note the inclusion of the double-quoted string within the single-quoted string.

Control Structures

The C shell and tcsh support a number of control structures, including the usual if and while statements. Any language that allows for an assignment statement, arithmetic or string operators, and while loops conditional on arithmetic or string comparison is a general purpose programming language, although it should be immediately obvious that the Unix shells are very ugly programming languages, as illustrated in the following useless script:

#!/bin/tcsh
# shellscript <number> <text>
#   outputs many copies of <text> 
#   the number of copies is controlled by <number> 

echo entering $$ $1 $2

@ myvariable = $argv[1]
while ( $myvariable > 0 )
        echo $argv[2]
        @ myvariable = $myvariable - 1
        ./shellscript $myvariable $argv[2]
end

echo exiting $$ $1 $2

This example illustrates both iteration with a while loop and recursion. Note that the shell variables are local. Each invocation of a shell script launches a new shell with a completely new environment, based on the environment of the shell that launched it.

As a result, this number of times this script echoes $argv[2] is large because of the combinationof recursion and iteration controlled by $argv[1]. The script includes debug output (the echo commands on entry and exit) to help you understand the pattern of the recursion.

A Security Disaster -- Shell Injection Attacks

The common Unix shells show plenty of evidence of evolution, and little evidence of intelligent design. It is true that the designers were intelligent, but features were added without thinking things through, and their evolution into general purpose programming languages was more of an accident than an intentional effort. The result is not only a language that is ugly, but a language that has some very dangerous properties. One of the worst of these is its vulnerability to what is known as an injection attack. Consider this script:

#!/bin/tcsh
# check arg
# outputs "it matches" if the argument is 1

if ($1 == 1) echo it matches

This appears to be a an uninteresting shell script. Call this script with an argument other than 1 and it will produce no output. Call it with an argument that evaluates to 1, and the output will be the string it matches. There is trouble, though. Consider the following call:

./check "1 ) echo"

if (1 ) echo == 1) echo it matches

Shell injection attacks can be far more dangerous than mere injection of an echo command. Consider:

./check "1 ) rm check "

This deletes the check script, but if we'd used rm -r * it would have deleted all of the files in the current directory before it tried to parse the rest of the line. This example ends up producing lots of error messages, but all of these error messages are output after the damage is done.

Defense against shell injection attacks in all of the standard Unix shells varies from difficult to impossible. The problem is, the central element of the defense involves parsing the parameter and checking to see that it is safe, and you cannot write a parser in a typical Unix shell that is not, itself, vulnerable to injection attacks.

So, while the Unix shells are general purpose programming languages, secure applications typically avoid reliance on complex shell scripts, or if they do use them, they wrap protective code around the shell script in order to guarantee that the parameters to the shell script are safe.

Injection attacks are a broad general problem for interpreted code. The most widely known injection attack vulnerabilities are found in SQL scripts that are run by server-side Web applications, but many other server-side applications are vulnerable.

4 -- Shell Scripts

CS:3620 Notes, Spring 2018