22C:116, Lecture 13, Spring 1997

Douglas W. Jones
University of Iowa Department of Computer Science

File System Layers
A well designed file system will typically be built in many layers, for example, with a structure such as the following:
1. The disk interrupt service routine
2. The device driver (puts requests on queue)
3. Low-level file system
4. Buffered streams
5. Named files
The low-level file system offers a device interface quite similar to that offered by the device driver, and possibly identical to it. The difference is that the device driver enqueues requests for I/O to a real device, while the low-level file system accepts requests for I/O to virtual devices.
Higher level layers of the file system create the illusion of buffered stream I/O, and independence from the actual block sizes of the underlying device, and create a system of directories and file names for the files implemented by the low-level file system. Here, we will focus on how a decent low level file system can be built.
A standard low-level interface
If you assume a standard device interface, shown pictorially as:
```
        _______
       | READ  |  -- read( buffer, device-address )
       |_______|
       | WRITE |  -- write( buffer, device-address )
       |_______|
```
It is clear that this can be supported on a large variety of random-access storage devices, ranging from hard and floppy disks to software emulation of such devices using fast RAM for data storage. For the purpose of this section, we will assume one fixed buffer size, equal to the sector size of the device, and we will assume that device addresses are lineraized over a range such as 0 to Max-Address instead of being constructed from components such as sector, cylinder and surface numbers.
Higher level I/O primitives, such as the standard random-access stream primitives of UNIX or the even higer level stream primitives of C can easily be implemented on top of this layer, if needed.
In addition to constructing physical device drivers that use this interface, we can also construct virtual devices that use it. For example, consider the following picture:
```
                             ___
        _______ ____________|  _______
       | READ  |            | | READ  |  
       |_______|    Disk    | |_______|
       | WRITE |   Cache    | | WRITE |
       |_______|____________| |_______|
                            |___
```
This shows a disk cache implementation that assumes the availability of a device supporting this interface, and that, itself supports this interface. If interposed between any physical device and a user of that device, this disk cache can improve the apparent speed of the disk, from the user's viewpoint, by caching recently used sectors in main memory, for example, using an LRU replacement algorithm.
The disk cache might, for example, maintain a set of buffers in memory that hold copies of the most recently accessed disk sectors; on read, if the desired sector is buffered, no actual disk I/O takes place. On write, a copy of the data written can be held in the cache for later reading, and cache buffers can be assigned to sectors on an LRU basis, with all of his done by the software.
Our focus here is on file systems. In effect, an open file is a virtual device that also supports this virtual device interface, although perhaps with a noticably smaller value of Max Address. Thus, the following picture applies:
```
                             ___
        _______ ____________|  _______
       | READ  |            | | READ  |  
       |_______|   Opened   | |_______|
       | WRITE |    File    | | WRITE |
       |_______|____________| |_______|
                            |___
```
The implementation of an open file must map read and write requests from the user into read and write requests on the underlying file. Of course, the file system need not be implemented directly on a physical device. The following structure is quite possible:
```
                        ___                    ___
   _______ ____________|  _______ ____________|  _______
  | READ  |            | | READ  |            | | READ  |  
  |_______|   Opened   | |_______|    Disk    | |_______|
  | WRITE |    File    | | WRITE |   Cache    | | WRITE |
  |_______|____________| |_______|____________| |_______|
                       |___                   |___
```
A file system consists of two essentially separate parts, one that creates opened files in response to open requests, and one that implements opened files. We will focus on the former first.
Opened File Semantics
Just like the underlying disk or cached disk on which it is implemented, an opened file supports our two basic operations, read and write. The fact that it is an open disk file does nothing to change the data that flows to or from disk with these operations. What it does do is change the disk addresses used.
A user of a bare disk or a disk cache uses real disk addresses. A user of an opened file uses addresses relative to that file. In effect, these can be considered to be virtual disk addresses, and all of the address mapping mechanisms applicable to virtual address tranalation are equally applicable to translating file addresses to disk addresses.
The simplest of these is additive linear mapping, illustrated below:
```
disk_address( file_address ) =
    if file_address < file_size
       then return file_address + base_address
       else error
```
In this case, each open file is described by a base disk address and a limit.
In old minicomputer and early microcomputer operating systems, file systems were frequently constructed this way. A directory entry consisted of a file name, a base address and a file size. If this is the only form of file mapping on a system, the primary weakness shows up when there are large numbers of small files. As the file system evolves, the disk space grows progressively more fragmented, until sufficiently large free blocks cannot be found to allow new files to be created. At this point, the file system must be compacted, sliding all existing files together and consolodating the free space into one large free block.
On modern systems, such simple additive linear mapping is commonly called partitioning, and it is quite common for large disks to be divided into many smaller virtual disks, each called a partition, and each supporting an independent file system.
Partitioning is done for a number of reasons. The most practical is to control the amount of material saved and restored when system backups are made. It is common to divide a disk into a user partition and a system partition because the system is only changed occasionally, and therefore only needs to be backed up occasionally.
A more unfortunate reason for partitioning a disk is to accomodate an old file system that is unable to manage the space on a new larger disk drive. Many of todays operating systems were originally designed in an era when disks with capacities over a few hundred megabytes were unavailable, and the designs frequently failed to anticipate the availability of larger drives. Partitioning a modern large drive into a number of smaller virtual drives is one way to allow such file systems to remain useful.
Another reason to partition a disk is to control resource contention. If each subcommunity of users allocates their files in independent partitions, the system management can control the impact of misuse of disk space. Well designed multi-user file systems generally have far more powerful tools for solving this problem, but partitioning remains in common use for this purpose.
Another way to translate the user's address within a file to a physical disk address is to use a lookup table analogous to the page tables used in conventional memory management units.
```
disk_address( file_address ) = file_table[ file_address ]
```
This requires a different mapping table for each file; the difficulty with this is that most files are quite small, while the table size is determined by the largest legal file. In such a case, most table entries would contain error indicators. A more reasonable implementation would involve storing only the non-error entries in the table:
```
disk_address( file_address ) =
    if file_address < file_size
       then return file_table[ file_address ]
       else error
```
Here, each open file has a file table of the size needed for that file, and a record of the size of this table.
For large files, it is not sensible to keep the entire mapping table in main memory! The obvious place to put it is on disk, and one way to do this is to put the mapping table in another file. This is recursive, but it isn't infinite regress if there is a minimum threshold size used for this method. In that case, the translation code might look something like:
```
disk_address( file_address ) =
    if tiny_file then
       return file_table( file_address )
    else
       sector = file_address div slots_per_sector
       slot = file_address mod slots_per_sector
       read( buffer, sector, file_description_file )
       return buffer( slot )
    endif
```
Here, tiny open files are described by a small table in memory, perhaps extracted from the directory entry for the file, while large files are described by an open file that holds the table. To translate a disk address for a large file, the appropriate entry from the appropriate file sector must be read.
Note that this approach is only efficient if a disk cache sits between the file system and the disk -- if not, an extra disk access would be required for every sector of a large file read or written. Because it is highly likely that consecutive sectors will be accessed or that one sector will be accessed multiple times, perhaps with a read and then a write, the locality principle operates here, and the use of an appropriate disk cache will eliminate most of the extra I/O operations.
In fact, the scheme outlined above can also be viewed as storing disk files as tree structures:
```
                      _ _ _
root (a tiny file)   |_|_|/|
                    __| |__
                  _|_ _   _|_ _
description file |_|_|_| |_|/|/|
                __| | |__ |___
              _|_  _|_  _|_  _|_
data sectors |___||___||___||___|
```
In the above, both "tiny" file description table and each sector of the description file have been assumed hold three disk addresses each. It is far more common to have upwards of 128 disk addresses per sector of the description file (or files), and the root structure frequently holds ten or so files.
This kind of tree structure does not impose any physical organization on the sectors of a file, and this is both an asset and a liability. It is an asset because it means that a file system organized using this kind of indexing is not subject to serious fragmentation problems. It is a liability because it means that sequential access to files on such a system may be subject to serious latency problems imposed by poor organization of the file sectors.
Tree structured file systems are quite common, but they are rarely implemented in the straitforward way suggested above. Instead, they are usually implemented using fairly awkward special purpose code.
UNIX I-nodes
The widely used UNIX file system is an example. There, each open file is represented in memory by a data structure called an I-node. The I usually stands for either the words Information or Index. The classic version of the I-node data structure contains:
- The access rights.
- Owner, date of creation, etc.
- The disk addresses of the first 8 sectors. This allows fast access to the first few sectors of a file.
- The disk address of the sector containing pointers to the next 128 sectors of the file.
- The disk address of the sector containing pointers to the sectors containing pointers to the next 128x128 sectors of the file.
Given that the classic UNIX sector size is 512 bytes, this allowed accessing files up to about 8 megabytes on classic UNIX systems. This was sufficient for machines with 16 bit words, and it was sufficient for the disk technology available in the early 1970's, but by the late 1970's, it was obviously too small.
Modern UNIX systems overcome the deficiency of the data structure outlined above by adding one more disk address to each I-node that supports a three-level tree, the so-called large-model file system.
The complexity of the UNIX I-node data structure is a result of the memory limited context of early UNIX systems, combined with a desire to make the first few sectors of even the largest files easy to access.
Fundamental Problems
The two fundamental problems that a well engineered file system must address are as follows:
Most files are small. Small shell scripts, bits of E-mail, single procedures and other bits and pieces make up the majority of the files on most systems. Many of these are significantly smaller than a single sector. Therefore, a well engineered file system should be able to store a large number of tiny files.
Most accesses to files made by running programs are to large files. The big applications that justify the highest performance systems are the best examples of this. Therefore, a well engineered file system should support efficient access to very large files.
Object Oriented Design.
Recall that all of our the software components we have discussed have a standard user interface, perhaps implemented as follows:
```
   record
      pointer to READ routine
      pointer to WRITE routine
      device-dependent fields
   end record
```
The device dependent fields for a disk might include a disk request queue, and the READ and WRITE routines would post requests in this queue and communicate with the appropriate device driver.
This approach to implementing file systems dates back to the mid 1960's and is one of the original problems that led to the modern notion of object oriented design. Today, we would refer to this record as an instance of a polymorphic class with the public methods READ and WRITE being virtual methods of the class, and with subclasses called physical disk, disk cache, linear-mapped disk, etc.
The specialized version of this interface for an open file might look like the following:
```
   record
      pointer to READ routine
      pointer to WRITE routine
      pointer to physical disk interface record
      mapping information for sectors of the file
   end record
```
Given this, we can view the open(filename) system service as a service that takes a textual file name and returns a pointer to an open file interface record such as the above; in modern terms, it is an object creator that creates an instance of the appropriate class to allow access to the indicated file. The close service, in its simplest form, merely deletes this record or causes the destructor method of the class to be invoked.
File Names and Directory Managers
The primary job of a directory manager is to maintain a mapping between textual file names, as presented to the open function, and the information needed to build an open file description record.
The directory itself is the data structure on disk used by the directory manager in performing this mapping.
The simplest approach to directory management was used on many early computer systems, including most early floppy-disk based systems. On such systems, a fixed part of each disk was reserved for the directory, which took the form of a flat table, for example:
```
Directory = array [1..64] of record
              name: packed array [0..7] of char;
              start-sector: disk-address;
              sectors: integer;
            end record;
```
Here, as in many crude file systems, a linear additive mapping scheme is used to describe the disk space occupied by a file. It is trivial to extend such a directory to include such features as the date of file creation, the length of the file, in bytes, and similar details.
Hierarchically structured file names
In general, users don't like flat name spaces, and they quickly start building hierarchic spaces using simple conventions such as the now almost universal
```
filename.extension
```
format. In fact, the name of this file,
```
/csf/jones/.public-html/opsys/notes/0920.html
```
could as easily be treated as a single long character string in a flat name space as it can be treated as a description of a path through a tree. Flat directory structures become unwieldy, though, so most modern systems support some variant on the tree-structured directory.
Tree-structured directories appear to have originated with proposals put forth by workers at project Mac for what eventually became the Multics system. These ideas were published in paper by Dennis and Van Horn, in Communications of the ACM, March 1966.
The key idea is to store directory data structures such as that outlined for a simple flat directory system in files. One bit attached to each file determines whether that file contains data or a directory. Whether it is a data file or a directory, all other details of storage allocation for the file are typically the same. Only the interpretation and the set of allowed operations on the file's contents changes.
The UNIX file system
The UNIX file system is a variant on the theme of hierarchically structured directories. Underlying the user-level directory manager is a primitive flat file system using integers as file names. The integers are simply indices into an array of I-nodes, the I-table, allocated out near the center of the disk:
```
 ___________________________________________________
|super|                |         |                  |
|block|                | I-table |                  |
|_____|________________|_________|__________________|
```
The superblock at the start of the disk contains the disk address of the I-table and its size, along with free-space data structures. In order to provide a degree of crash protection, the superblock is actually stored redundantly. The I-table is in the middle of the disk in order to minimize seek times in moving the heads from I-table to data and back.
Each I-node has one bit indicating whether the associated file is a directory or a data file. Other than that, all other attributes such as how storage is allocated for files are the same, whether it's a directory or a data file.
Under the UNIX system, each directory is a sequence of ordered pairs , where file-name is a variable-length string and I-number is an integer. Each directory contains two special entires, one called "." (dot), which contains the I-number of that directory, and one called ".." (dot dot), containing the I-number of the parent directory. Other than these two special entries, the directories of a UNIX system are strictly tree structured.
UNIX does allow data files to be referenced from multiple directories. Thus, the same file may appear under different names in two or more directories.
The redundancy of the UNIX directory structure (self-links and back links) allows a program called fsck (file system check) to reconstruct the directory structure in the face of many possible failures. On most UNIX systems, fsck is routinely run as part of the startup script. The redundancy of the UNIX directory structure was not designed with support for fsck in mind; other systems, notably the Xerox Distributed File System (XDFS), have provided far greater redundancy, so that the entire directory system can be reconstructed from the data files themselves.
Under XDFS, the prefix on each sector of each file contains the sector number of that file relative to the file it is in, along with the file number of the file (analogous to a UNIX I-node). As a result, the entire data structure describing mappings from pairs to actual files on disk can be reconstructed by scanning the headers on all the sectors on disk. In UNIX terms, the I-table and all subsidiary data structures can be reconstructed.
Under XDFS, sector zero of each file contains the file number of the directory that references that file, as well as the name of the file and any related information for the file. Thus, by finding sector zero of each file on the system (by scanning the entire disk) the directory structure of the file system can be reconstructed. The net result is that if some sectors on a disk are damaged, no data not in those sectors will be lost.
Under UNIX, the prohibition on directory structures other than a tree with back links (..) and self-links (.) is imposed in order to avoid problems with reclamation of storage when links are deleted. Each I-node contains a count of the number of incoming links; this is incremented when a link to that I-node is added, and it is decremented when a link to that I-node is removed. If the count reaches zero, the file is deleted.
The problem with this scheme is that it can't detect it when a directory becomes inaccessible because the circular and self-links prevent the count for the directory from going to zero. As a result, UNIX uses one command to remove links to data files (rm) and another command to delete directories from the directory tree (rmdir). The Cambridge File System eliminates this problem by using dynamic garbage collection to reclaim unreachable files or directories.
Performance Problems
A successful file system must support large numbers of small files while supporting fast access to large files. Large numbers of small files requires allocation of files in relatively small units such as the sector, and this encourages use of translation tables, N-ary trees, or other similar data structures to hold the data needed to translate between addresses within a file and physical disk addresses.
To support fast access to large files, contiguous allocation of disk space is essential. Many file systems do this, preferentially selecting among available free blocks so that, when a file is enlarged, new sectors are accessable with minimum latency. This requires that, when a file is enlarged, free sectors in the same cylinder are used, if they are available, and if not, free sectors in adjacent cylinders are used.