22C:116, Lecture 19, Spring 2002

Douglas W. Jones
University of Iowa Department of Computer Science

  1. File Systems -- More on the Device Interface

    Recall that a device driver is simply a component of the operating system that offers a standard interface to higher level system components such as (for example, in the context of UNIX), seek, read and write.

    There is no requirement that a device driver actually access a real device! One of the best known drivers that does not access a real device is the UNIX device /dev/null; this device happily returns from any operation requested by the user, reporting that there were no errors; on output, it does nothing at all; on input, it always returns an end-of-file indication.

    Another common example of a device driver that actually connects to no external device is the so-called RAM disk. On UNIX, there is a special device, /dev/kmem, that offers a RAM disk access model to all of the kernel's memory. This is useful only for debugging, and it is only accessable to the super-user!

    It is worth noting that each open file object has a notion of its own position, this is set by seek, and both read and write increment the position by the number of bytes read or written. This attribute is not a property of the device driver, and this requires us to modify our model somewhat! Open file objects must have a structure something like the following:

          open file                     device
          _________                    _________
         |________o------------------>|________o-------> read
         |_________| position         |________o-------> write
         |_________| \                |________o-------> ...
         |_________|  > file          |_________| \
         |_________| /  attributes    |_________|  > device
         |   ...   |                  |_________| /  attributes
    				  |   ...   |
    
    For a driver that just supports one device with fixed behavior, all of the device attributes may be embedded in the code for the driver. Other devices, however, have complex attributes:

    The open-file has not only a position, but a set of read-write permissions. These permissions are attributes of the file, not the device, so they go in the open-file object.

    In addition, we must consider the fact that users read and write files in arbitrary increments, reading 100 bytes now, then 20, then 200, and seeking to random byte positions in the file. In contrast, the hardware interface to the disk requires that all transfers be done in units of sectors.

    If the user is wise, and if the user knows the sector size, the user can always seek to positions in the file that are on a sector boundary, and then always read or write buffers that are a multiple of the size of a sector. In this place, read and write operations can go straight to the user's buffer.

    Because not all users do this, the system is prepared to allocate a buffer for the user's open file, where that buffer is big enough to hold one or more sectors. When such a buffer is alloated, the read and write operations called by the user copy between the system buffer and the user's address space, and only when the buffer is filled (on write) or empties (on read) does the system go to the actual device for an I/O operation.

    Thus, our open file object has the following fields:

          open file
          _________
         |________o------------------> device
         |_________| position
         |_________| allowed operations (read, write)
         |________o-----> buffer (may be nil)
         |_________| buffer status
    
    Buffer status information is needed in order to handle transitions from reading to writing. If the buffer has been allocated to hold data that is being read from the file and the user starts writing, there is no problem, but if the user has partially filled a buffer by writing and then wants to read at the current file position, some complex work is required.

    Exercise: Write the code required to manage the buffer associated with an open file; this should allow an arbitrary sequence of read, write and seek operations, with arbitrary buffer sizes provided by the user. Your code should make read and write operations to the underlying device in blocks of size N (the buffer size), doing all seeks to an integer multiple of N.

  2. File Systems -- Files as Virtual Devices

    A disk file can be thought of as nothing more than a virtual disk. The interface to the disk file looks, to the higher level blocking and deblocking code, exactly like the interface to an entire disk. The only thing different is that physical disks have a fixed file size; attempts to change their size by writing past the end-of-file always fail. On files, on the other hand, an attempt to write past the end of file simply enlarges the file.

    Therefore, from the point of view of higher level software, open disk files look just like another kind of device driver! The key difference lies in the device specific information in the device object and in the read and write routines.

          disk file
          _________
         |________o----> read  \
         |________o----> write  > within disk file
         |________o----> ...   /
         |________o---------------->  disk device
         |_________| \ how to          _________
         |   ...   |  |translate      |________o----> read
                      |disk           |________o----> write
    		 / address	  |________o----> ...
    				  |   ...   |
    

    The simplest example that fits this model is a partition of disk. To all higher level software, a partition is indistinguishable from a physical disk. Partitions are typically implemented with the most trivial of address translations:

          partition
          _________
         |________o----> read  \
         |________o----> write  > within partition
         |________o----> ...   /
         |________o----------->  actual disk (a device)
         |_________| base              _________
         |_________| bound            |   ...   |
    

    The read operation on a partition represented as described above can be implemented as follows:

    	partition read( buffer, length, sector_in_partition )
    	   if sector_in_partition > bound
    	      complain
    	   else
    	      sector_on_disk = sector_in_partition + base
    	      actual_disk.read( buffer, length, sector_on_disk )
    		     
    
    Historically, partitioning of disk drives originated because of poor file system design. File systems always seem to be designed with some assumption about the largest possible disk drive, and usually, this assumption fails -- new disk drives invariably seem to come on the market that are larger than the file system can accomodate.

    When this happens, the entire file system can be redesigned to support the larger disk drives, but this takes time, so as a stopgap measure, partitioning is used to subdivide the large disk drive into smaller virtual disks -- partitions, each of which can support a file system.

    Partitioning is not entirely bad, however! Once you are forced to partition your disk, you discover that it is convenient to put user data in one partition, temporary files in another, and system files in a third in order to simplify backup management. The user partition should be backed up frequently, while the temporary partition never needs to be backed up. The system partition doesn't need backups if that partition can be easily rebuilt from the system distributioin CDROMs, and even if the system is customized, only one backup of the system partition is needed, instead of the frequent backups that are made for the user partition.

    Partitioning can also compensate for a bad file system by forcing the sectors of a file into nearby parts of the disk -- poor file systems allocate sectors at random, and without partitioning, consecutive sectors could be anywhere on the disk.

    The big difference between disk files and partitions, from the user's perspective, is that a write past the end of a disk file causes the file to be enlarged, where a write past the end of a partition or physical disk is an error.

  3. File Systems -- Address Mapping for Disk Files

    The implementation of disk files must therefore use a more complex data structure to map (virtual) disk addresses within a file to physical disk addresses on the disk or partition where that file is stored.

    UNIX provides a classic solution to this problem. In UNIX, the data structure describing the allocatio of a file on disk is called an i-node. (This stands for information node.) The structure of a UNIX i-node has evolved over the years, but the basic idea is that there is one i-node for each file on the disk (or for each file on the partition); this describes where the data for that file is stored, and contains the other attributes of the file that are not encoded in the data. When the file is open, the i-node for the file is in main memory; when the file is closed, the i-node is on disk.

          disk file
          _________
         |________o----> read  \
         |________o----> write  > within file
         |________o----> ...   /
         |________o----------------->  actual disk (a device)
         |________o---->  i-node             _________
                         _________          |   ...   |
                        |_________| \
                        |_________|  > other file attributes
                        |_________| /
                        |_________| file size
                        |_________| disk address of sector 0
                        |_________| disk address of sector 1
                        |_________| disk address of sector 2
                        |__ ... __   ...
                         _________| disk address of sector 7
                        |_________| disk address of big-file index sector
                        |_________| disk address of large-file index sector
    

    This allows very fast translation (by simple table lookup) of the disk addresses of the first 8 sectors in any file. Most files need only a few sectors, so for most files, this so-called small model suffices.

    For larger files, so-called big files, the next disk address does not refer to a data sector of the file, but rather, to a sector holding the disk addresses of the next data sectors of the file. Classic UNIX had 512 bytes per sector, and disk addresses (linearized) were 32 bits or 4 bytes. This allows 128 disk addresses per disk sector, so the large-file index sector held the disk addresses addresses of data sectors 8 to 135 (135 = 127+8). This handles files up to 69,632 bytes in length.

    For even larger files, referred to as large-model files, the next disk address in the I-node holds the disk addresses of sectors used to hold the disk addresses of the next sectors of the file. For 512-byte sectors, this holds the addresses of 128 sectors, each holding 128 addresses, so in sum, it can hold the addresses of 128x128 or 16384 sectors, sector numbers 136 to 16519 of the file. Thus, this model allows for huge files up to 8,458,240 bytes.

    When disk and applications evolved to the point that there was a need for even larger files, two directions emerged. In one, sectors were enlarged to 4K bytes each, so each index sector held 1K disk addresses, and the large model allowed for 1M + 1K + 8 sectors. The alternative was to add keep the small sector size (having observed that many data files and most directories are smaller than 512 bytes) and instead, add a third level of inderection, the so-called huge model, allowing a total file size of 8+128x128+128x128x128 sectors in each file. Both of these schemes work well.

    Of course, the UNIX approach is not the only way to solve the problem. Many early systems only allowed random access use of partitions, and required all access to smaller disk files to be purely sequential. In these systems, each sector of a disk file contained the disk address of the next sector of that file.

    The UNIX approach, with its lopsided tree of index sectors, requires some rather irregular code to access sectors. Why not use a more uniform data structure such as a simple tree or a B-tree to access the sectors of the file?