22C:116, Lecture 13, Fall 1999

Douglas W. Jones
University of Iowa Department of Computer Science

File System Layers
A well designed file system (as opposed, for example, to the file systems used on most commercially successful operating systems) will typically be built in many layers, for example, with a structure such as the following:
1. The device driver
  - Interrupt service routine
  - Disk request scheduler
  - Low level disk-request queue interface
2. Low-level file system
3. Buffered streams
4. Named files
The low-level file system offers a device interface quite similar to that offered by the device driver, and possibly identical to it. The difference is that the device driver enqueues requests for I/O to a real device, while the low-level file system accepts requests for I/O to virtual devices that are each implemented by one or more real devices.
Higher level layers of the file system create the illusion of buffered stream I/O, and independence from the actual block sizes of the underlying device, and create a system of directories and file names for the files implemented by the low-level file system. Here, we will focus on how a decent low level file system can be built.
A standard low-level interface
If you assume a standard device interface, shown pictorially as:
```
        _______
       | READ  |  -- read( buffer, device-address )
       |_______|
       | WRITE |  -- write( buffer, device-address )
       |_______|
```
It is clear that this can be supported by the low-level interface to the device drivers of a large variety of random-access storage devices, ranging from hard and floppy disks to software emulation of such devices using fast RAM for data storage. For the purpose of this section, we will assume one fixed buffer size, equal to the sector size of the device, and we will assume that device addresses are lineraized over a range such as 0 to Max-Address instead of being constructed from components such as sector, cylinder and surface numbers.
To go beyond these simplifying assumptions, we could add a service to our standard low-level random-access device interface:
```
        _______
       | READ  |  -- read( buffer, device-address )
       |_______|
       | WRITE |  -- write( buffer, device-address )
       |_______|
       | INFO  |  -- info( infoblock )
       |_______|
```
The new info service for each device would be expected to return information about that device's disk geometry answering questions such as: What is Max-Address for this device? Is the sector size fixed, and if so, to what value? Is the number of sectors per track fixed and if so, to what number, and is this divice divided into multiple surfaces and cylinders, and if so, how many of each? The answers to these questions may vary from device to device, creating serious problems for higher level software.
Higher level I/O primitives, such as the standard random-access stream primitives of UNIX or the even higer level stream primitives of C can easily be implemented on top of this layer, if needed, and these primitives can hide most of the complexity exposed by the addition of the info service suggested above.
In addition to constructing physical device drivers that use this interface, we can also construct virtual devices that use it. For example, consider the following picture:
```
                             ___
        _______ ____________|  _______
       | READ  |            | | READ  |  
       |_______|    Disk    | |_______|
       | WRITE |   Cache    | | WRITE |
       |_______|____________| |_______|
                            |___
```
This shows a disk cache implementation that assumes the availability of a device supporting this interface, and that, itself supports this interface. If interposed between any physical device and a user of that device, this disk cache can improve the apparent speed of the disk, from the user's viewpoint, by caching recently used sectors in some area of main memory, for example, using an LRU replacement algorithm.
The disk cache might, for example, maintain a set of buffers in memory that hold copies of the most recently accessed disk sectors; on read, if the desired sector is buffered, no actual disk I/O takes place. On write, a copy of the data written can be held in the cache for later reading, and cache buffers can be assigned to sectors on an LRU basis, with all of his done by the software.
Our focus here is on file systems. In effect, our lowest level model of an open file is a virtual device that also supports this standard device interface, although perhaps with a noticably smaller value of Max Address. Thus, the following picture applies:
```
                             ___
        _______ ____________|  _______
       | READ  |            | | READ  |  
       |_______|   Opened   | |_______|
       | WRITE |    File    | | WRITE |
       |_______|____________| |_______|
                            |___
```
The implementation of an open file must map read and write requests from the user into read and write requests on the underlying file. Of course, the file system need not be implemented directly on a physical device. The following structure is quite possible:
```
                        ___                    ___
   _______ ____________|  _______ ____________|  _______
  | READ  |            | | READ  |            | | READ  |  
  |_______|   Opened   | |_______|    Disk    | |_______|
  | WRITE |    File    | | WRITE |   Cache    | | WRITE |
  |_______|____________| |_______|____________| |_______|
                       |___                   |___
```
A file system consists of two essentially separate parts, one that creates opened files in response to open requests, and one that implements opened files. We will focus on the former first.
Opened File Semantics
Just like the underlying disk or cached disk on which it is implemented, an opened file supports our two basic operations, read and write. The fact that it is an open disk file does nothing to change the data that flows to or from disk with these operations. What it does do is change the disk addresses used. Thus, the top-level code of the read and write operations on an open file can be expressed as:
```
open-file representation and access methods:

   device -- identity of device used to implement this file

   description -- some kind of description of this file
   
   read( buffer, file-address )
      device.read( buffer, disk_address( file_address, description ))

   write( buffer, file-address )
      device.write( buffer, disk_address( file_address, description ))
```
A user of a bare disk or a disk cache uses real disk addresses. A user of an opened file uses addresses relative to that file. In effect, these can be considered to be virtual disk addresses, and all of the address mapping mechanisms applicable to virtual address tranalation are equally applicable to translating file addresses to disk addresses.
Additive linear mapping is the simplest way to translate virtual to physical addresses. This is illustrated below:
```
disk_address( file_address, description ) =
    if file_address < description.size
       then return file_address + description.base
       else error
```
In this case, the description of an open file consists of just two fields, a base and a limit, where both are sector numbers.
In old minicomputer and early microcomputer operating systems, file systems were frequently constructed this way. A directory entry consisted of a file name, a base address and a file size. If this is the only form of file mapping on a system, the primary weakness shows up when there are large numbers of small files. As the file system evolves, the disk space grows progressively more fragmented, until sufficiently large free blocks cannot be found to allow new files to be created. At this point, the file system must be compacted, sliding all existing files together and consolodating the free space into one large free block.
On modern systems, such simple additive linear mapping is commonly called partitioning, and it is quite common for large disks to be divided into many smaller virtual disks, each called a partition, and each supporting an independent file system.
Partitioning is done for a number of reasons:

to allow partial backups
It is common to divide a disk into a user partition and a system partition because the system is only changed occasionally, and therefore only needs to be backed up occasionally.

to support an old file system
This is an unfortunate reason for partitioning a disk! As disk drives get larger, they have frequently exceeded the maximum size anticipated by the designers of older file systems. Many of todays operating systems were originally designed in an era when disks with capacities over a few hundred megabytes were unavailable, and the designs frequently failed to anticipate the availability of larger drives. Small virtual disk drives can easily support the old file system with minimal changes!

to control resource contention
If each subcommunity of users allocates their files in independent partitions, the system management can control the impact of misuse of disk space. Well designed multi-user file systems generally have far more powerful tools for solving this problem, but partitioning remains in common use for this purpose.

to allow coexisting file systems

When one hardware system must support multiple operating systems, it is common to partition at least one of the disk drives, reserving a partition for each of the operating systems. Thus, we find PC systems with Linux partitions and DOS partitions. The coexisting operating systems must cooperate to support this by supporting compatable disk partitioning schemes; if one file system partitions the disk by cylinders while the other partitions the disk by recording surface, coexistance will be difficult!

Table lookup mapping is a common method - this is exactly the method used with virtual memory address maps:
```
disk_address( file_address, description ) = description[ file_address ]
```
Here, the description of each file is an array, indexed by file address, giving the corresponding disk addresses. Each file has its own mapping table; the difficulty with this is that most files are quite small, while the table size is determined by the largest legal file. In such a case, most table entries would contain error indicators. A more reasonable implementation would involve storing only the non-error entries in the table:
```
disk_address( file_address, description ) =
    if file_address < description.size
       then return description.table[ file_address ]
       else error
```
Here, the description of an open file is a record containing the size of the mapping array, along with that array. This allows a dynamically allocated array of size just large enough for the file.
For large files, it is not sensible to keep the entire mapping table in main memory! The obvious place to put it is on disk, and one way to do this is to put the mapping table in another file. This is recursive, but it isn't infinite regress if there is a minimum threshold size used for this method. In that case, the translation code might look something like:
```
disk_address( file_address, description ) =
    if file_address >= description.size
       then error
       else if description.size = 1 then
          then return description.address
	  else
       else
          sector = file_address div slots_per_sector
          slot = file_address mod slots_per_sector
          description.file.read( buffer, sector )
          return buffer[ slot ]
```
Here, an open file description has 3 fields, the size of the file, the disk address of the sector holding that file, if it is only one sector, or the file holding the table of sectors of that file, if it is larger than one sector. Each address translation for a large file involves reading a sector of the description file to get the required address.
Note that this approach is only efficient if a disk cache sits between the file system and the disk -- if not, an extra disk access would be required for every sector of a large file read or written. Because it is highly likely that consecutive sectors will be accessed or that one sector will be accessed multiple times, perhaps with a read and then a write, the locality principle operates here, and the use of an appropriate disk cache will eliminate most of the extra I/O operations.
In fact, the scheme outlined above can also be viewed as storing disk files as tree structures:
```
                      _ _ _
root (1 sector)      |_|_|/|
                    __| |__
                  _|_ _   _|_ _
description file |_|_|_| |_|/|/|
                __| | |__ |___
              _|_  _|_  _|_  _|_
data sectors |___||___||___||___|
```
In the above, both root file and each sector of the intermediate level description file have been assumed hold three disk addresses each. It is far more common to have upwards of 128 disk addresses per sector of the description file (or files), and instead of defining tiny files as those of just 1 sector, the base level data structure can contain the addresses of up to 8 or 10 sectors.
This kind of tree structure does not impose any physical organization on the sectors of a file, and this is both an asset and a liability. It is an asset because it means that a file system organized using this kind of indexing is not subject to serious external fragmentation problems. It is a liability because it means that sequential access to files on such a system may be subject to serious latency problems imposed by poor organization of the file sectors.
A good file system using this type of organization will typically maintain free-space data structures in such a way that space for files can be allocated in the same cylinder as the space used to index the sectors of those files, and so that consecutive sectors of a file are stored in the same cylinder whenever possible.
Tree structured file systems are quite common, but they are rarely implemented in the straitforward way suggested above. Instead, they are usually implemented using fairly awkward special purpose code.
UNIX I-nodes
The widely used UNIX file system is an example. There, each open file is represented in memory by a data structure called an I-node. The I usually stands for either the words Information or Index. The classic version of the I-node data structure contains:
- The access rights.
- Owner, date of creation, etc.
- The disk addresses of the first 8 sectors. This allows fast access to the first few sectors of a file.
- The disk address of the sector containing pointers to the next 128 sectors of the file.
- The disk address of the sector containing pointers to the sectors containing pointers to the next 128x128 sectors of the file.
Given that the classic UNIX sector size is 512 bytes, this allowed accessing files up to about 8 megabytes on classic UNIX systems. This was sufficient for machines with 16 bit words, and it was sufficient for the disk technology available in the early 1970's, but by the late 1970's, it was obviously too small.
Modern UNIX systems overcome the deficiency of the data structure outlined above by adding one more disk address to each I-node that supports a three-level tree, the so-called large-model file system.
The complexity of the UNIX I-node data structure is a result of the memory limited context of early UNIX systems, combined with a desire to make the first few sectors of even the largest files easy to access.
Fundamental Problems
The two fundamental problems that a well engineered file system must address are as follows:
Most files are small. Small shell scripts, bits of E-mail, single procedures and other bits and pieces make up the majority of the files on most systems. Many of these are significantly smaller than a single sector. Therefore, a well engineered file system should be able to store a large number of tiny files.
Most accesses to files made by running programs are to large files. Image files, sound files, gigantic databases, and similar applications are important! Therefore, a well engineered file system should support efficient access to very large files.
Object Oriented Design.
Recall that all of our the software components we have discussed have a standard user interface, perhaps implemented as follows:
```
   record
      pointer to READ routine
      pointer to WRITE routine
      device-dependent fields
   end record
```
The device dependent fields for a physical disk might include pointers to the disk request queue, so the READ and WRITE routines could post requests in the right queue and communicate with the appropriate interrupt service routine.
This approach to implementing file systems dates back to the mid 1960's and is one of the original problems that led to the modern notion of object oriented design. Today, we can refer to this record as an instance of a polymorphic class with the public methods READ and WRITE being virtual methods of the class, and with subclasses called physical disk, disk cache, linear-mapped disk, etc.
The specialized version of this interface for an open file might look like the following:
```
   record
      pointer to READ routine
      pointer to WRITE routine
      pointer to physical disk interface record
      mapping information for sectors of the file
   end record
```
Given this, we can view the open(filename) system service as a service that takes a textual file name and returns a pointer to an open file interface record such as the above; in modern terms, it is an object creator that creates an instance of the appropriate class to allow access to the indicated file. The close service, in its simplest form, merely deletes this record or causes the destructor method of the class to be invoked.
NOTE: Creating an instance of an open-file is not the same as creating an instance of a file on disk! The open-file objects that have dominated the above discussion are transient objects that get created when the file is opened and destroyed when the file is closed. Files themselves typically have much longer lifetimes, although there are systems that allocate scratch files that have no file name and autodestruct on close.
File Names and Directory Managers
The primary job of a directory manager is to maintain a mapping between textual file names, as presented to the open function, and the information needed to build an open file description record.
The directory itself is the data structure on disk used by the directory manager in performing this mapping.
The simplest approach to directory management was used on many early computer systems, including most early floppy-disk based systems. On such systems, a fixed part of each disk was reserved for the directory, which took the form of a flat table, for example:
```
Directory = array [1..64] of record
              name: packed array [0..7] of char;
              start-sector: disk-address;
              sectors: integer;
            end record;
```
Here, as in many crude file systems, a linear additive mapping scheme is used to describe the disk space occupied by a file. It is trivial to extend such a directory to include such features as the date of file creation, the length of the file, in bytes, and similar details.
Hierarchically structured file names
In general, users don't like flat name spaces, and they quickly start building hierarchic spaces using simple conventions such as the now almost universal
```
filename.extension
```
format. In fact, the name of this file,
```
/csf/jones/.public-html/opsys/notes/13.html
```
could as easily be treated as a single long character string in a flat name space as it can be treated as a description of a path through a tree. Flat directory structures become unwieldy, though, so most modern systems support some variant on the tree-structured directory.
Tree-structured directories appear to have originated with proposals put forth by workers at project Mac for what eventually became the Multics system. These ideas were published in paper by Dennis and Van Horn, in Communications of the ACM, March 1966.
The key idea is to store directory data structures such as that outlined for a simple flat directory system in files. One bit attached to each file determines whether that file contains data or a directory. Whether it is a data file or a directory, all other details of storage allocation for the file are typically the same. Only the interpretation and the set of allowed operations on the file's contents changes.
The UNIX file system
The UNIX file system is a variant on the theme of hierarchically structured directories. Underlying the user-level directory manager is a primitive flat file system using integers as file names. The integers are simply indices into an array of I-nodes, the I-table, allocated out near the center of the (linearized) range of disk addresses:
```
 ___________________________________________________
|super|                |         |                  |
|block|                | I-table |                  |
|_____|________________|_________|__________________|
```
The superblock at the start of the disk contains the disk address of the I-table and its size, along with free-space data structures. In order to provide a degree of crash protection, the superblock is actually stored redundantly. The I-table is in the middle of the disk in order to minimize seek times in moving the heads from I-table to data and back.
Each I-node has one bit indicating whether the associated file is a directory or a data file. Other than that, all other attributes such as how storage is allocated for files are the same, whether it's a directory or a data file.
Under the UNIX system, each directory is a sequence of ordered pairs , where file-name is a variable-length string and I-number is an integer. Each directory contains two special entires, one called "." (dot), which contains the I-number of that directory, and one called ".." (dot dot), containing the I-number of the parent directory. Other than these two special entries, the directories of a UNIX system are strictly tree structured.
UNIX does allow data files to be referenced from multiple directories. Thus, the same file may appear under different names in two or more directories.
The redundancy of the UNIX directory structure (self-links and back links) allows a program called fsck (file system check) to reconstruct the directory structure in the face of many possible failures. On most UNIX systems, fsck is routinely run as part of the startup script. The redundancy of the UNIX directory structure was not designed with support for fsck in mind; other systems, notably the Xerox Distributed File System (XDFS), have provided far greater redundancy, so that the entire directory system can be reconstructed from the data files themselves.
Under XDFS, the prefix on each sector of each file contains the sector number of that file relative to the file it is in, along with the file number of the file (analogous to a UNIX I-node number). As a result, the entire data structure describing mappings from [sector-number,file-number] pairs to actual files on disk can be reconstructed by scanning the headers on all the sectors on disk. In UNIX terms, the I-table and all subsidiary data structures can be reconstructed from the representations of the files themselves.
Under XDFS, sector zero of each file contains the file number of the directory that references that file, as well as the name of the file and any related information for the file. Thus, by finding sector zero of each file on the system (by scanning the entire disk) a strictly tree-structured directory for the file system can be reconstructed. The net result is that if some sectors on a disk are damaged, the only possible data loss is to the data in those sectors. No localized damage will ever destroy the entire file system. Sadly, few modern file systems have incorporated this degree of redundancy in order to allow a similar degree of damage survival, so today, it is common to lose entire files or directories because of very localized disk failures.
Under UNIX, the prohibition on directory structures other than a tree with back links (..) and self-links (.) is imposed in order to avoid problems with reclamation of storage when links are deleted. Each I-node contains a count of the number of incoming links; this is incremented when a link to that I-node is added, and it is decremented when a link to that I-node is removed. If the count reaches zero, the file is deleted.
The problem with this scheme is that it can't detect it when a directory becomes inaccessible because the circular and self-links prevent the count for the directory from going to zero. As a result, UNIX uses one command to remove links to data files (rm) and another command to delete directories from the directory tree (rmdir). The Cambridge File System (see Wilkes and Needham, The Cambridge Cap Computer and its Operating System) eliminates this problem by using dynamic garbage collection to reclaim unreachable files or directories. Again, this solution, while well proven, has not been adapted by any commercial file systems in widespread use.
Performance Problems
A successful file system must support large numbers of small files while supporting fast access to large files. Large numbers of small files requires allocation of files in relatively small units such as the sector, and this encourages use of translation tables, N-ary trees, or other similar data structures to hold the data needed to translate between addresses within a file and physical disk addresses.
To support fast access to large files, contiguous allocation of disk space is essential. Many file systems do this, preferentially selecting among available free blocks so that, when a file is enlarged, new sectors are accessable with minimum latency. This requires that, when a file is enlarged, free sectors in the same cylinder are used, if they are available, and if not, free sectors in adjacent cylinders are used.