22C:116, Lecture Notes, Feb. 17, 1995

Douglas W. Jones
University of Iowa Department of Computer Science

  1. Focus on File Systems.

    If you assume a standard device interface, shown pictorially as:

            _______
           | READ  |  -- read( buffer, device-address )
           |_______|
           | WRITE |  -- write( buffer, device-address )
           |_______|
    
    It is clear that this can be supported on a large variety of random-access storage devices. Here, we will assume one fixed buffer size, equal to the sector size, and we will assume that device addresses are lineraized over a range such as 0 to Max-Address instead of being constructed from components such as sector, cylinder and surface numbers.

    In addition to constructing physical device drivers that use this interface, we can also construct virtual devices that use it. For example, consider the following picture:

                                 ___
            _______ ____________|  _______
           | READ  |            | | READ  |  
           |_______|    Disk    | |_______|
           | WRITE |   Cache    | | WRITE |
           |_______|____________| |_______|
                                |___
    
    This shows a disk cache implementation that assumes the availability of a device supporting this interface, and that, itself supports this interface. If interposed between any physical device and a user of that device, this disk cache can improve the apparent speed of the disk, from the user's viewpoint, by caching recently used sectors in main memory, for example, using an LRU replacement algorithm.

    Our focus here is on file systems. In effect, an open file is a virtual device that also supports this virtual device interface, although perhaps with a noticably smaller value of Max Address. Thus, the following picture applies:

                                 ___
            _______ ____________|  _______
           | READ  |            | | READ  |  
           |_______|   Opened   | |_______|
           | WRITE |    File    | | WRITE |
           |_______|____________| |_______|
                                |___
    
    The implementation of an open file must map read and write requests from the user into read and write requests on the underlying file. Of course, the file system need not be implemented directly on a physical device. The following structure is quite possible:
                            ___                    ___
       _______ ____________|  _______ ____________|  _______
      | READ  |            | | READ  |            | | READ  |  
      |_______|   Opened   | |_______|    Disk    | |_______|
      | WRITE |    File    | | WRITE |   Cache    | | WRITE |
      |_______|____________| |_______|____________| |_______|
                           |___                   |___
    
    A file system consists of two essentially separate parts, one that creates opened files in response to open requests, and one that implements opened files. Here, we will focus on the former.

  2. Opened File Semantics

    Just like the underlying disk or cached disk on which it is implemented, an opened file supports our two basic operations, read and write. The fact that it is an open disk file does nothing to change the data that flows to or from disk with these operations. What it does do is change the disk addresses used.

    A user of a bare disk or a disk cache uses real disk addresses. A user of an opened file uses addresses relative to that file. In effect, these can be considered to be virtual disk addresses, and all of the address mapping mechanisms applicable to virtual address tranalation are equally applicable to translating file addresses to disk addresses.

    The simplest of these is additive linear mapping, illustrated below:

    disk_address( file_address ) =
        if file_address < file_max
           then return file_address + base_address
           else error
    
    In this case, each open file is described by a base disk address and a limit. Such simple additive linear mapping is commonly called partitions, and it is common for large disks to be divided into many smaller virtual disks, each called a partition, and each supporting an independant file system.

    Partitioning is done for a number of reasons. The most practical is to control the amount of material saved and restored when system backups are made. It is common to divide a disk into a user partition and a system partition because the system is only changed occasionally, and therefore only needs to be backed up occasionally.

    A more unfortunate reason for partitioning a disk is because, when some old file system was designed, disks of that size were unavailable. As a result, the file system can only utilize the disk if it is broken into managable pieces. Another reason to partition a disk is to control resource contention. If each subcommunity of users allocates their files in independent partitions, the system management can control the impact of misuse of disk space. Well designed multi-user file systems generally have far more powerful tools for solving this problem.

    As with virtual address translation, tables are commonly used to map disk addresses:

    disk_address( file_address ) = file_table( file_address )
    
    This requires a different mapping table for each file; the difficulty with this is that most files are quite small, while the table size is determined by the largest legal file. In such a case, most table entries would contain error indicators. A more reasonable implementation would involve storing only the non-error entries in the table:
    disk_address( file_address ) =
        if file_address < file_max
           then return file_table( file_address )
           else error
    
    Here, each open file has a file table of the size needed for that file, and a record of the size of this table.

    For large files, it is not sensible to keep the entire mapping table in main memory! The obvious place to put it is on disk, and one way to do this is to put the mapping table in another file. This is recursive, but it isn't infinite regress if there is a minimum threshold size used for this method. In that case, the translation code might look something like:

    disk_address( file_address ) =
        if tiny_file then
           return file_table( file_address )
        else
           sector = file_address div slots_per_sector
           slot = file_address mod slots_per_sector
           read( buffer, sector, file_description_file )
           return buffer( slot )
        endif
    
    Here, tiny open files are described by a small table in memory, while large files are described by an open file that holds the table. To translate a disk address for a large file, the appropriate entry from the appropriate file sector must be read.

    Note that this approach is only efficient if a disk cache sits between the file system and the disk -- if not, an extra disk access would be required for every sector of a large file read or written. Because it is highly likely that consecutive sectors will be accesses, the locality principle operates here, and the use of an appropriate disk cache will eliminate most of the extra I/O operations.

    In fact, the scheme outlined above can also be viewed as storing disk files as tree structures:

                          _ _ _
    root (a tiny file)   |_|_|/|
                        __| |__
                      _|_ _   _|_ _
    description file |_|_|_| |_|/|/|
                    __| | |__ |___
                  _|_  _|_  _|_  _|_
    data sectors |___||___||___||___|
    
    In the above, both "tiny" file description table and each sector of the description file have been assumed hold three disk addresses each. It is far more common to have upwards of 128 disk addresses per sector of the description file (or files), and the root structure frequently holds ten or so files.

    Tree structured file systems are quite common, but they are rarely implemented in the straitforward way suggested above. Instead, they are usually implemented using fairly awkward special purpose code.

  3. UNIX I-nodes

    The widely used UNIX file system is an example. There, each open file is represented in memory by a data structure called an I-node. The I usually stands for Information. The classic version of the I-node data structure contains:

    The access rights.

    Owner, date of creation, etc.

    The disk addresses of the first 8 sectors. This allows fast access to the first few sectors of a file.

    The disk address of the sector containing pointers to the next 128 sectors of the file.

    The disk address of the sector containing pointers to the sectors containing pointers to the next 128x128 sectors of the file. Given that the classic UNIX sector size is 512 bytes, this allowed accessing files up to about 8 megabytes on classic UNIX systems. This was sufficient for machines with 16 bit words, and it was sufficient for the disk technology available in the early 1970's, but by the late 1970's, it was obviously too small.

    Modern UNIX systems overcome the deficiency of the data structure outlined above by adding one more disk address to each I-node that supports a three-level tree.

    The complexity of the UNIX I-node data structure is a result of the memory limited context of early UNIX systems, combined with a desire to make the first few sectors of even the largest files easy to access.

  4. Fundamental problems.

    The two fundamental problems that a well engineered file system must address are as follows:

    Most files are small. Small shell scripts, bits of E-mail, single procedures and other bits and pieces make up the majority of the files on most systems. Many of these are significantly smaller than a single sector. Therefore, a well engineered file system should be able to store a large number of tiny files.

    Most accesses to files made by running programs are to large files. The big applications that justify the highest performance systems are the best examples of this. Therefore, a well engineered file system should support efficient access to very large files.