The Unix File System

Part of 22C:169, Computer Security Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The File System Tree

Prior to 1965, most computers that had any type of file system had flat file systems. That is, the file system had a top-level directory listing file names, with no other structure. All file names were typically very short, frequently with limits on name length as short as 8 characters.

This arrangement was totally insufficient for multi-user systems, and it was hardly useful for large-scale software development. Several early computer systems had simple extensions of this. For example, consider this early file name structure:

File name structure: name:file.extension
each file name is constructed of 3 fields, a user name, a file name relative to that user name, and an extension.

The user name: name
limited to 6 characters, identifies a user. For example, jones.

The file name: file
limited to 6 characters, identifies a particular. For example, proj1.

The extension: extension
limited to 3 characters, identifies the version of the file. for example jones:proj1.src might be the source file, while jones:proj1.obj might be the compiled object file, and jones:proj1.doc might be the documentation for the project.

This scheme was clumsy and failed to generalize. Users with large-scale file storage requirements quickly found that they ran out of names, and they quickly found that they needed better tools for organizing their work.

In the mid 1960's, a group of workers involved with the Multics project -- at MIT and Bell Labs, published a paper proposing an alternative model. In this model, the file system organizes directories as a tree, with leaves that are files.

Different systems have implemented this idea with minor changes over the years:

It may seem that the only significant difference between these systems is their use of different separator characters between names along the path from the root of the system to the particular file you might be interested in. This is not true.

It's worth noting that the original version of Microsoft's DOS, purchased from Seattle Computer Products, had a flat file system. In the era when floppy disks were small, having just one directory (FAT, to use old DOS terminology) per disk was no problem. As hard drives became common, and as floppy disks got larger, Microsoft added a hierarchical file system in MS-DOS 2.0. By that time, in 1983, the basic idea of a hierarchical file system was about 20 years old and Unix was close to 15 years old.

These systems differ in the way they handle links to files. In the Multics file system, each file had a canonical name. That was the name of the file in the directory tree. There could be additional links to the file, but the secondary links were marked as secondary.

In contrast, the directory tree under Unix allows multiple links to a file to have exactly equal status. To understand this, consider a file called /users/jones/demo/test. Given this file, consider the following sequence of commands.

> ln /users/jones/demo/test /users/jones/new

The ln command under Unix creates a link from a directory to a file. Having created a new link, our file has two distinct names: /users/jones/demo/test and /users/jones/new. These two names refer to identically the same file. They are, in effect, two handles on a single object. Of course, this only works if the directories named on the two paths already exist.

Now, consider the effect of the following command:

> rm /users/jones/demo/test

The rm command removes a link from a directory to a file. Having removed the link from /users/jones/demo/ named test, the path name /users/jones/demo/test is no longer the valid name of a file. The file /users/jones/new still exists, and in fact, it is the original file.

When you rename a file under the Unix file syste, you do this by creating a new link using the new name, and then deleting the old link.

Unix enforces the rule that there may not be multiple links to directories, except for two very specialized cases: The link named . (period) in every directory references that directory itself, and the link named .. (dot dot) references the parent of that directory. This has the consequence that any slash in a Unix path name can be replaced by the string /./. All of the following names are equivalent:

/users/jones/new
/users/./jones/././new
/users/jones/../jones/new
/users/jones/./.././jones/new

The Current Directory

Typing long path names is cumbersome. To simplify matters, Unix systems maintain, for each process, a current working directory. Path names that start with / begin at the root of the file system, while path names that begin with any other character begin with the current working directory. The command cd (change directory) is used to change the current working directory. Suppose the reference command was being used to reference files. All uses of the reference command in the following reference identically the same file:

> reference /users/jones/new
> cd /
> reference users/jones/new
> cd /users
> reference jones/new
> cd /users/jones
> reference new

> cd ..
> reference jones/new
> cd ..
> reference users/jones/new
> cd users
> reference jones/new
> cd jones
> reference new

The first batch of cd commands above all use absolute path names, while the second batch use relative path names to walk around in the directory structure. This clearly illustrates the value of the .. link in each directory for walking up the tree toward the root. File systems without such back links require a special command other than cd to walk up the tree.

Unix systems also support the notion of the home directory of the current user. The shell environment variable $home gives the user's home directory, and when the shell sees the symbol ~ (tilde or curly dash) at the start of a path name, it substitutes the user's home directory. Thus, is the current user is jones, the following file names all reference the same file

/users/jones/new
~/new
~dwjones/new
$HOME/new

The second form above depends on who the current user is, while the third form above can be used by any user of the same file system to reference the same file belonging to the user named jones. The shell variable $HOME contains the home directory of the current user, so if the current user is jones, the final form also references the same file.

Search Paths

When the shell sees the ls command, for example, it runs an application from the file /bin/ls. This program then lists the current directory.

It is a reasonable guess that some early version of the Unix shell contained logic something like this: Given the command name, try to open a file by that name in the /bin directory. If that fails, try the current directory. This simplistic logic invites extensions, adding other directories to the list of directories to be searched. The developers of the Unix shell eventually hit on the notion of a generalized search path.

When any of the standard Unix shells try to launch a command, they use the shell variable $PATH. Like all shell variables, the value of this variable is a character string. In this case, the string is constructed of a sequence of directory names separated by colons. Using the echo command to examine the current path on one Unix system returned the following:

> echo $PATH
/bin:/sbin:/usr/bin:/usr/sbin:/Users/jones/bin:.

This says that, on encountering a command name, the shell will first look in a directory called /bin and then in /sbin, and then /usr/bin, then /usr/sbin, and /Users/jones/bin, before finally looking in the current directory.

The search path implements, at the shell level, an approximation of the kind of rules we are used to in programming languages. When you mention the variable i in a program, the compiler first checks the local block to see if the variable is defined there, and then it checks the surrounding block, and on outward until finally it checks to see if i was declared globally.

The default search path provided by the system when a user logs on is usually just a list of standard directories where the predefined commands are stored. In the case of the above, the predefined path was /bin:/sbin:/usr/bin:/usr/sbin Many users have locally developed programs or programs they installed for their personal use. A good place to store these is the user's own directory of executables. On Unix systems, it is traditional to name this ~/bin. In the above example, the ~/.tcshrc file of the user named "jones" (automatically executed when the shell begins execution) contained the following commands:

> setenv PATH ${PATH}:/Users/jones/bin
> setenv PATH ${PATH}:.

Each of the above two lines adds one new component to the environment variable $PATH.

The $PATH shell variable gives the search path used by the shell to execute shell commands. Many other Unix applications maintain paths. For example, $MANPATH environment variable is used on some Unix systems to tell the man command what directories to look in for the manual pages. Some compilers maintain a similar path for looking up included source files, so that the #include directive need not explicitly give a full file name.

Security Consequences of Search Paths

The search path mechanism of Unix (and of systems that have copied this idea from Unix) has some dangerous consequences. Consider this little shell script:

#/bin/tcsh
#  myls arguments
# home-made command to list directories

ls $argv

The author of the script undoubtedly intended that the ls command in the script would execute /bin/ls. Unfortunately, this is not guaranteed! Suppose the user edits the path as follows:

> setenv PATH ~/bin:$PATH

Now, the user's own bin file is in the path ahead of the system directories, so the system will look in the user's directory first. When the user executes myls, it will look up the ls command in the user's directory, so the user is now free to redefine ls or any other shell commands (excepting built-in shell commands) that the script happens to use. Consider what happens if this little script in our /bin file:

#/bin/tcsh
# ls arguments
# attack script causing ls to have very different behavior
rm -f $argv

Defense Against Search Path Attacks

When writing shell scripts, there are two basic defenses that can be used to prevent users from corrupting the interpretation of the commands in the script.

The first defense is to never rely on search paths, but to always write out the path-names of commands in full. This can be a bit of a nuisance, because it requires the script author to go hunting for each command to see where it is defined instead of using the common names for the commands. Having done this, the above example script could be rewritten as follows:

#/bin/tcsh
#  myls arguments
# home-made command to list directories

/bin/ls $argv

The second defense is to take over the search path mechanism within the script. Don't accept the path you were given, but define the path within the script so that it only searches the directories you want to search. Typically, this involves restoring the default search path that was used prior to any user modificaitons:

#/bin/tcsh
#  myls arguments
# home-made command to list directories

setenv PATH /bin:/usr/bin
ls $argv

The second defense only works because the setenv command is a built-in shell command. Note that there is no need to restore the path when the shell script terminates. The script is running in a copy of the user's environment, so any changes to the environment variables are changes to the copy, not to the original, and the current copy of the environment will be discarded when the script reaches its end.

Symbolic Links

Unix supports a second kind of link, the symbolic link, created by using the ln -s command. Unlike hard links, symbolic links can be attached not only to leaves in the file tree, but also to any directory. Unlike hard links, symbolic links do not prevent the deletion of the file or directory they reference. Symbolic links are sometimes called soft links. Consider this shell command, in a context where /users/jones/demo/test already existed.

> ln -s /users/jones/demo/test /users/jones/new

After the above command, /users/jones/demo/test and /users/jones/new are both names for the same file. Up until this point, the -s option on the link command has had no interesting consequences. Now, however, if we delete the original file with this command:

> rm /users/jones/demo/test

The file is actually gone. Attempts to use either /users/jones/demo/test or /users/jones/new will result in error messages. The file /users/jones/new still exists in the directory structure, but it is a link to nowhere. That is, until a new file is created with the name /users/jones/demo/test. At this point, both names become defined.

Under the Unix file system, the file /users/jones/demo/test used in the above examples with symbolic links could be a directory, and it could even be a file on a different hardware volume than the symbolic link. With hard links, the link can only be made to a file, not a directory, and it can only be made to a file on the same file-system volume (the same disk).

References

The classic Multics file system paper is worth reading:

A General-Purpose File System For Secondary Storage, by R. C. Daley and P. G. Neumann. Fall Joint Computer Conference, 1965. Available on line from:
http://www.multicians.org/fjcc4.html"

The classic paper on the Unix system is also worth reading:

The UNIX Time-Sharing System, by D. M. Ritchie and K. Thompson. Communications of the ACM, July 1974. The best on-line version of this paper comes from Ritchie's web site:
http://cm.bell-labs.com/cm/cs/who/dmr/cacm.html"

Section III of the paper covers the file system, as originally envisioned.

The Wikipedia article on the Unix file system covers the developments after the original Unix system. Most of these have to do with implementation and not with end-user semantics.

http://en.wikipedia.org/wiki/Unix_File_System