11. Disk Input/Output

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Disks

Rotating magnetic memory devices date back to the early 1950's. The first such devices were drums, with one head fixed over every track. The drum on ILLIAC I, for example, had 64 tracks, with an 8 by 8 array of heads. (This drum is still owned by the U of Illinois computer science department).

In the late 1950's and early 1960's, workers at IBM figured out how to make a record/playback head "fly" over the flat surface of a spinning disk. The result was a new technology, the moving head disk, where one head was positioned over each recording surface on the disk spindle and moved in or out to the appropriate track.

For large-capacity, high performance disks, whether in 1965 or 2000 a number of disk platters are rigidly attached to each disk spindle. The number of platters on one spindle has varied greatly over the years. IBM's original disk drive had over 20 platters, while it was common to see disks with 12 platters on a common spindle in the early 1970's. Usually, both sides of each platter are used for data storage; today's larger capacity disks typically have between 4 or 8 recording surfaces.

When there are multiple recording surfaces, there is typically only one head per surface, with all heads sharing a common head-positioning mechanism, as illustrated below:

     spindle
       ||  v------||    __________
  =====[]=====    ||___|          |
       ||  ^------||   |   head   |
       ||  v------||___|positioner|
  =====[]=====    ||   |__________|
     __||__^------||
    |      |
    | motor|
    |______|

It's fun to note that the tiny disks used in today's laptop computers frequently have the motor literally inside the hub of the disk, while the disk drives of the 1960's had motors like those found in washing machines, with a belt-drive from motor to disk spindle.

Moving head disks did not displace older technologies immediately. Drums continued to be made until the mid 1960's, and fixed head disks, disks with one head fixed over every track, persisted into the 1970's. Drums and fixed-head disks were particularly popular as paging and swapping devices because their lower latency led to reduced page-fault service times. At the other extreme, moving head disks have been made with more than one head per recording surface, and they have been made with more than one head positioner per disk drive. Such eccentric designs are uncommon and will not be discussed here!

The first production moving-head disks with one head per surface date from the very early 1960's. The IBM 1401 computer, indroduced in 1961, came with the IBM 1505 disk drive (the combination was sold under the name RAMAC -- The Random Access Method of Accounting Control, named after its initial business application). This disk drive had 25 platters, 200 tracks per surface, 5 sectors per track and 200 6-bit bytes per sector. 5 and 10 platter disks were common in the late 1960's and early 1970's, with platters about 14 inches in diameter. Some early disks used hydraulic head positioning mechanisms, but the dominant head positioning technology for high performance disks from the late 1960's to the present has been the voice-coil servo drive.

Reynold B. Johnson, 1906-1998, led the IBM lab in San Jose that developed the RAMAC; he also developed one of the first mark-sense readers and the 1/2-inch video-tape. (Professor Lindquist of the University of Iowa developed another early mark-sense reader to grade his college entrance exam, the ACT.)

While the speed and capacity of rotating storage media have increased immensely since the early days of computing, and sizes have fallen from large free-standing cabinets to palm-sized enclosures, the basic physics has remained unchanged, and thus, the basic issues with which the software must deal are the same now as they were in the early 1950's.

Disk Latency

The rotational speed of all high performance disks is fixed; typical disks of the early 1970's ran at 20 revolutions per second, while modern high performance disks may rotate much faster. The stored kinetic energy in the rotating disk is sufficient that it may take a few seconds to bring it up to speed on power-up, and any speed change after that is prohibitively expensive. This is as true of todays miniature disk drives with flea-sized motors as it was in the days when the disk assembly weighed 20 pounds and was spun by a washing-machine motor.

Some floppy disk drives, notably the old Apple Macintosh 800K drives, used a variable speed scheme to pack a bit more data onto a disk than they could pack onto a fixed speed disk. This worked only because the floppy disk itself is very lightweight and rotates relatively slowly; most of the motor power is needed to overcome friction and it takes only a small amount of power to accelerate or decelerate the disk. It is easier to use a variable speed clock to solve this problem. While variable speed clocks easy to build at low data rates typical of floppy disks, variable speed clocking for high performance hard drives is difficult and rare. The problem is, the highest performance drives move data to and from the disk at speeds close to the limit of the digital logic itself.

Magnetic disk recording surfaces are divided into numerous circular tracks; typically, each track is subdivided into sectors, and the basic operation on the disk is to read or write one sector of data. On a disk or drum with one head fixed over each track, the time delay between the issue of a read or write and the completion of that operation can be divided into two components, the rotational latency time and the transfer time.

The rotational latency time is the time it takes for the rotation of the disk to bring the start of the desired sector into position under the head. This may vary from one full revolution, if the start of the desired sector had just passed under the head at the time the transfer was requested, to nothing, if the start of the desired sector is just about to pass under the head at the time the transfer is requested.

The transfer time is the time it takes for the rotation of the disk to rotate one sector past the head.

With the advent of moving head disk technology in the early 1960's, a new component was added to the time taken for a single read or write operation, the head positioning latency time or seek time. (The latter term derives from the common use of the verb `to seek' to describe the act of moving the heads to the desired track.) This is the time it takes for the disk head positioning mechanism to move the heads to the desired track; it may vary from nothing, in the case where the head was already on the desired track, to an appreciable fraction of a second.

Typical head positioning hardware smoothly accelerates the head from the starting track to the end track, and a bit of simple physics leads to the result that such hardware will give track-to-track head movement times that are quadratic. That is, if a move of one track takes one time unit, a move of four tracks will take two time units and a move of nine tracks will take three time units.

It is worth noting that the standard CD format, being designed for audio recording, uses a single spiral track. Like a magnetic disk, this track is divided into sectors, and random access to the recorded data, whether to play a specific track of an audio CD or to read a random sector of a CD-ROM on a computer, typically involves two steps, first moving the head blindly to the estimated radial distance of the desired sector, and then reading the first sector header encountered in order to see if the move was far enough. As a result, CD-ROM drives have head motion latency and rotational latency not too different from magnetic conventional disks.

Note on Data Formats

A typical sector on a disk is organized as follows:

 __________________________________________________________
|________|______________________________________|__________|
  prolog                  data                     epilog

Typically, the prolog includes synchronizing marks that allow the clock used for reading the sector to be synchronized with the data and that allow the hardware to recognize the start of a disk sector. Following this, the prolog typically includes the full disk address of the sector.

The disk address is included in the prolog for many reasons. First, it allows verification that the head positioning hardware and disk surface selection logic is working correctly, or at least, that it is working the same way it was when the disk was formatted. Second, on some disks, the head positioning servo actually uses the track or cylinder numbers read off the disk to move the heads in or out to the right track, and finally, the addressing logic of many disk systems relies on the sector number recorded at the start of each sector to determine whether the sector is the one the user wishes to access.

The end of the prologue is typically a checksum on the prologue. Once the prologue is recorded by the disk formatter, it is usually never written again. When a write operation is requested, the hardware begins by reading and verifying the prologue. Only then does it begin to write the data.

The epilogue typically consists of a checksum. All read and write operations must process the entire sector, either writing a full sector and then writing the checksum, or reading the full sector and then verifying the checksum. If the disk controller allows partial sector reads or writes, this is an illusion. A partial sector write almost always overwrites the full sector, appending arbitrary data to the user's data in order to achieve this. A partial sector read only reports a fraction of the sector's data to the user, but the entire sector must be read by the controller in order to verify the checksum.

Formatting a disk

When a "raw" disk is mounted on a disk drive, it must be formatted. Typically, this involves writing predetermined data in each sector of each track of the disk. The important part of the formatting operation from this perspective is not the data recorded but rather that the formatter puts valid prologs and epilogs on all of the sectors are written.

Usually, the disk formatting software also verifies that each sector is readable and writable, gathers the good sectors into a free-space data structure, and creates a file system of some type on the disk. The latter subject will be discussed in more detail later, and in fact, close examination of most formatting programs shows that these two jobs are largely independent.

When the disk interface is configured for disk formatting, the hardware generally allows the software to write out arbitrary data to fill entire sectors, including the prolog and epilog. This gives the software the option of laying down sectors in arbitrary order on each track of the disk. If they are put down in consecutive ascending order, it may be impossible to read or write consecutive sectors at full speed. This is because, by the time the interrupt marking the end of one read or write has been processed, it may be too late to start work on the next sector without waiting a full revolution.

To avoid this problem, many systems arrange the sectors on a disk in interleaved order, so that logically consecutive sectors are separated by one or more other sectors. The following example illustrates the interleaving of 16 sectors, using consecutive letters of the alphabet to label logically consecutive sectors.

     A B C D E F G H I J K L M N O P  -- no interleaving
     A I B J C K D L E M F N G O H P  -- 2-way interleaving, almost
     A G L B H M C I N D J O E K P F  -- 3-way interleaving
     A E I M B F J N C G K O D H L P  -- 4-way interleaving, almost
     A N K H E B O L I F C P M J G D  -- 5-way interleaving

The 2-way and 4-way interleaving patterns shown above are imperfect because 2 and 4 evenly divide 16. The 3 and 5-way interleaving patterns are perfect. Interleaving can be done in hardware, or it can be done in software.

When the physical addresses recorded in the sector prologs on the disk are not interleaved, the software may interleave them. In general, if there are n sectors, numbered consecutively from 0 to n-1 on each track, and if we want m-way interleaving, then so long as n and m are relatively prime (they have no common factors) the software can generate interleaved sector numbers using the formula:

interleave(i) = (m * i) mod n

The software interface

To read or write one sector of a disk, the software must:

Specify the address of a buffer in memory.
Specify one of the several disk drives that may be attached to a single controller.
Specify how to position the heads on the drive in question. This is usually called the cylinder portion of the disk address.
Specify which recording head or recording surface is to be used on the selected drive.
Specify which sector of the track selected by the cylinder and surface is to be used.
Specify the data transfer direction.

In simple disk controllers, each of the above might appear as a separate register in the controller, writable by software. Typically, the actual transfer begins only when the final command is issued, the transfer direction. Typically, it is the act of loading a value in the final register that actually starts the controller in operation. Thus, the loading of a memory address or a field of the disk address has no effect until the software actually specifies a read or write operation, at which point, the values previously loaded are used to initiate a seek on the appropriate drive, await the right sector of the selected track, and then transfer the data.

When a data transfer is completed, a status register typically displays this fact, along with an indication of any errors detected. In addition, the done condition usually results in an interrupt request.

Added Complexity

Interface busses such as the SCSI bus separate the responsibility for each operation into two halves, as illustrated below:

           __________            __________   ____
CPU ----->|   DMA    |          |   DISK   |-|disk|
          |controller|<========>|controller|-|disk|
Memory <=>|__________|          |__________|-|____|

This design adds significant complexity, and it is noteworthy that such hardware has been common since the 1960s, where the the IBM 360 channel architecture serves as the architypical example. There, the DMA controller was called a channel processor. Todays examples include the SCSI, USB and Firewire controllers; as with the IBM 360, most of these allow multiple disks (and their controllers) to be attached to one such bus.

Of course, everything is smaller today than in the 1960's. Today's disk controllers are made of a few chips mounted on a single printed circuit board attached to each disk drive, and today's channel processors are usually as small and in many cases are built into the motherboard of the host computer. In the 1960's, each of these occupied a free-standing cabinet full of electronics.

Then and now, commands from the CPU to the channel controller, perhaps a SCSI or Firewire controller, direct it to transfer data to or from some memory buffer and direct it to forward certain additional commands to the disk controller. All of the usual I/O operations are possible with such systems, but most such systems also allow many nonsensical operations such as directing the channel controller to read data from the disk to memory, while directing the disk controller to write data from the memory to disk.

Another complicating factor is introduced by the fact that it is frequently useful to start one disk seeking while another reads or writes data. This is particularly important in large servers. To support this, many disk controllers allow a seek operation to be initiated on some disk without initiating either a read or write. Software that actually uses this feature is common on the high performance operating systems and is only slowly becoming common on personal computers as these migrate from single-user operating systems to scaled-back versions of general purpose operating systems.

Virtual memory adds another layer of complexity. Generally, the user's I/O buffer is not forced to lie entirely within one page of the user's address space, but instead, it may span several pages. While these may be consecutive in the user's virtual address space, they might be anywhere at all in the physical address space. DMA controllers that understand the structure of page tables have been built, but for most simple DMA controllers, the I/O software must first gather data from the fragments of the user's buffer into a physically contiguous buffer, and then do a DMA write from that buffer to disk, and similarly, after a DMA read from disk, the system must scatter the data from the system's buffer to the fragments of the user buffer.

Many high end DMA controllers support scatter read and gather write operations. For these controllers, there is not a single base/count register pair describing the user's buffer, but rather, a list of base/count pairs describing the fragments of the user's buffer. Usually, this list is in main memory, and the DMA controller has a register that points to the next pair to process when the current pair is completed. Only when the list is exhausted does the transfer stop. In fact, this is an old feature! The IBM 360 channel controllers supported an even more complex variation on this through what were called channel programs.

Some disk controllers allow multiple sectors to be read in sequence, so if the buffer is larger than one sector, it can be entirely read in one disk transfer, so long as the desired sectors are consecutively numbered on disk. Thus, depending on the controller, it is sometimes possible to read or write as much as an entire track or even a cylinder to or from the disk in one transfer.

Finally, at the other extreme of performance, some modest and low performance disk controllers have been built without the full generality of DMA. Instead, the disk controllers are built with an internal buffer able to hold one sector or one track, and all disk I/O is to or from this buffer. Prior to writing the disk, software running on the CPU must copy the data from the application's memory area to the disk buffer, and after a read, it is up to software to copy from the disk buffer to the user's buffer. The disk buffer in such controllers may or may not be addressable as a normal memory area; in the lowest performance disks, data is transferred between the CPU and this buffer one byte at a time.

Disk Request Queues

Typically, assuming a simple DMA disk interface, without the added features discussed above, a modern operating system will maintain a queue of disk I/O requests. Each request is a record of a disk input or output operation that has been requested by a user but not yet carried out by the disk interface. A typical disk I/O queue will contain records with a format such as the following:

     Disk Request Record:
         Buffer Address in main memory.
         Disk Address:
            cylinder.
            surface.
            sector.
         transfer direction
         done semaphore

A user process, or rather, the file system or page fault handler acting on behalf of that user, might pose a disk request as follows:

    low-level user disk I/O request routine
         enqueue( request, disk-queue )
         enable( disk_interrupt )
         P( request.semaphore )  -- wait for I/O complete
         return

The interrupt service routine might operate as follows:

    disk interrupt service routine
         disable( disk_interrupt )
         if request not null then V( request.semaphore )
            -- the above lets a waiting user continue
         if not empty( disk-queue ) then
            request := dequeue( disk-queue )
            buffer-address-register := request.buffer-address
            disk-address-registers := request.disk-address
            disk-command := request.transfer-direction
            enable( disk_interrupt )
            return
         else
            request := null
            return

Disk interrupts signal the completion of the previously requested operation, so the first thing the interrupt service routine does is signal the semaphore on which the process requesting the previous service may have blocked. The having done this, the interrupt service routine continues by starting a new disk transfer, if one is pending in the disk request queue.

Note that, if the queue is empty, the disk routine returns, leaving the interrupt disabled but doing nothing to make the controller drop its interrupt request. When a user enqueues a new request, the user code enables the interrupt, causing an immediate interrupt and allowing the controller to immediately start the newly requested transfer.

Exercise: The above pseudocode for the disk interrupt service routine and for enqueueing disk requests assumes that disk I/O is being requested for typical user code. Imagine the case where some disk requests are enqueued by the page-fault service routine. In that case, it may be awkward for the service routine to block on a semaphore (some models of interrupt service prohibit service routines from blocking on anything), so it may be better to have the disk interrupt service routine finish the processing of the page fault. Propose an augmented data structure for the disk request queue and pseudocode for the page-fault handler and disk interrupt service routine to do this.

Improving Disk Performance, Disk Caches

One way to improve the performance of a disk system is to use the queue of pending disk requests as a cache. New read requests are only posted to the queue if no write is pending to the same sector, and new write requests cause previously posted write requests addressed to the same sector to be deleted. The following version of the user's code to post a new disk I/O request to the disk request queue describes this:

    low-level user disk I/O request routine
       if request.disk-address matches prior.disk-address
          where prior is a request already in the disk-queue
          if request.direction = in and prior.direction = out
             copy prior.buffer into request.buffer
             return
          if request.direction = out and prior.direction = out
             delete prior from disk-queue
             return
       enqueue( request, disk-queue )
       enable( disk_interrupt )
       P( request.semaphore )
       return

The result of this logic is that there will never be more than one write queued for any disk sector, and if a write is queued for a sector, there will be no reads queued for that sector.

Exercise: Multiple reads from the same sctor can also be combined. Propose additions to the disk request queue data structure and to the pseudocode given here that would do this.

The use of a disk cache poses risks, whether the cache is based on the request queue, as outlined above, or whether it is based on specifically reserved buffer space in main memory. In either case, there is the possibility that some write may never be recorded to disk! If some software keeps writing some sector, each such write will cancel the write already in the queue, and the result is that no write will ever be done. So long as there are no system failures, this is not a problem, but if there is a power failure, this could cause the loss of very important data. Therefore, it is common for systems that support disk cache schemes to include, in the kernel, a specific operation to force all writes to a disk to be forced to be completed. In the UNIX system, the sync() kernel call does this.

Improving Disk Performance, Disk Schedulers

Disk scheduler software can markedly improve the performance of disk I/O for any disk that serves multiple independent sources of I/O requests. This is typical of disks in large timesharing systems and of disks in network file servers. The key is to note that, if there are multiple requests in the disk I/O queue, the time taken to process those requests depends on the order in which they are processed. This is because the latency time depends on the order in which they are accessed.

For example, consider a request queue containing 2 requests for operations on track 1 and two requests for operations on track 400. Clearly, if these are done in the order (1, 400, 1, 400) the result will be quite a bit slower than if they are done in the order (1, 1, 400, 400). The former requires three long seeks, while the latter requires only one long seek.

Thus, if disk requests are scheduled in the appropriate order, disk performance can be markedly improved. It is generally impossible to point to a separate procedure in the operating system and say "this is the disk scheduler". Instead, the disk scheduler is broken between the two ends of the disk request queue, and the queue data structure is itself the structure used to organize the scheduler.

Typically, the queue for each disk is broken into separate pieces for each cylinder, and the enqueue routine on the user side of the disk request queue is made responsible for enqueueing the request in the appropriate cylinder's queue. The scheduling policy is then the responsibility of the disk interrupt service routine.

Typical scheduling policies are:

FIFO, the base case, is inherently fair to all users, but it does nothing to improve throughput. (A fun exercise is to quantify fairness and then prove that the FIFO policy is optimal).
Shortest seek time first. After processing all requests for the current cylinder, the next cylinder to be processed is the one with pending requests that is closest to the current cylinder. This policy produces good throughput (the number of I/O requests processed per second) but it is unfair, it will cause some users to wait indefinitely. (The optimality of this method depends on how the seek time varies as a function of the number of cylinders moved. For a fun exercise, figure out the class of functions for which this is optimal.)
The bidirectional elevator algorithm. After processing all requests for the current cylinder, the next cylinder to be processed is the one closest to the current cylinder in a fixed direction, for example, towards the center of the disk. Only if no work is pending in any cylinder in that direction does the preferred direction reverse so that future seeks will be in the opposite direction. This technique offers a good throughput, but it discriminates against disk requests near either extreme of the range of valid cylinder numbers.
The unidirectional elevator algorithm. After processing all requests for the current cylinder, the next cylinder to be processed is the one closest to the current cylinder in a fixed direction, for example, towards the center of the disk. Only if no work is pending in any cylinder in that direction does a seek in the opposite direction take place. In that case, the most distant available cylinder with pending work is selected, and the preferred direction remains unchanged. This technique offers equal service to all cylinders while only slightly reducing the throughput of the bidirectional elevator.

The elevator algorithms are based on the model used by elevators in tall buildings. The elevator does not visit floors in FIFO order, that is, going from floor to floor in the order those floors were requested by users pushing buttons requesting visits to various levels of the building. Instead, the elevator sweeps systematically up the building, visiting those floors for which it has pending requests and stopping when it reaches the highest requested floor. At that point, it begins a downward sweep, stopping at the lowest requested floor. This is exactly the bidirectional elevator algorithm, with floor numbers substituted for cylinder numbers.

                 Queues for each
                  disk cylinder

                    ---------   |
                      00000 |   |
                    ---------   |
                        000 |   |
    Source          ---------   |
      of    ---->      0000 |   |
   requests         --------- /---\  Cylinder
                                00|  currently
                    --------- \---/  being
                          0 |        served
                    ---------

The conflict between throughput and fairness illustrated by the differences between the unidirectional and bidirectional elevator algorithms is typical of a number of scheduling issues in operating systems. Attempts to optimize against a single goal are fairly straightforward, but there is no solution which is simultaneously optimal against all criteria!

The following pseudocode implements the elevator algorithm; in this code, the disk request queue has been replaced with an array of queues, one per cylinder:

    low-level user disk I/O request routine
       enqueue( request, disk-queue[ request.cylinder ] )
       enable( disk_interrupt )
       P( request.semaphore )

The interrupt service routine consumes requests from the queues as follows:

    disk interrupt service routine
       disable( disk_interrupt )
       if request not null then V( request.semaphore )
          -- the above lets a waiting user continue
       if empty( current-queue ) then
          -- time to move the elevator to a new floor
          i := current-cylinder
          repeat
             i := (i + 1) mod cylinders
          until i = current_cylinder or not empty( disk-queue[i] ) 
          current-queue := disk-queue[i]
          disk-queue[i] := empty
          current-cylinder := i
       end if
       if empty( current-queue )
          request := dequeue( current-queue )
          buffer-address-register := request.buffer-address
          disk-address-registers := request.disk-address
          disk-command := request.transfer-direction
          enable( disk_interrupt )
	  return
       else
          request := null;
	  return

Note that the disk interrupt service routine uses a separate queue, current-queue to hold the requests for the cylinder it is currently servicing. This prevents late arrivals at that cylinder from being served until the next visit to that cylinder; if this was not done, and if some process continuously maintained a supply of pending disk addresses for one cylinder, no other cylinder would ever be served.

Another detail to note is that the search for the next cylinder to process need not be a sequential search, as outlined above. If the disk queue array is maintained as a circular sparse array (a linked list structure), so that empty queues are not represented at all, the next queue will always be adjacent to the current one and the computation shown as a search above can be recoded as a simple pointer assignment. (for a pleasant exercise, write the necessary code.)

Device Independence

The desire for device independent input/output was one of the original motives behind the development of polymorphic object-oriented programming methodologies. Typically, we aske that each device support some set of input/output operations, for example:

     Read( ch )                  Arbitrary software
     Write( ch )        <----->  that performs actions
     Read( buf, add )            that look like I/O to
     Write( buf, add )           the user, and may even
			         involve I/O
     Used for disk files,
     windows, network ports,
     and similar abstractions.

The advantages of using a standard software interface for every device, whether disk or terminal or printer or network port have been known since the development of the FORTRAN programming language in the late 1950's. To achieve this, it is necessary to have a user-level device interface that is sufficiently flexible to operate with both random access and sequential devices, supporting both stream and block modes of operation on any device. The low level Unix interface (read, write, seek) is one good example of such an interface.

This interface may be considered a virtual device interface, with the input output software of the operating system serving the function of virtual to physical input output operation translation.

The virtual device may also be considered as an object, in the sense used in object-oriented programming. In fact, the first clear description of how objects in this sense can be implemented came from the field of operating systems, in a paper by Christopher Strachey in his OS 6 system from the mid 1960's. In this system, he used the following implementation:

       An open file descriptor
		________
 F ----------->|     o--|--------> Read( F, ch )
	       |     o--|--------> Write( F, ch )
	       |________|    
 Data specific |        |     Pointers to code
 to this file  |        |     to do device dependant
 and device    |        |     part of the I/O operation
	       |________|


  Read( F, ch ) <------ the device independant read
  {                     routine called by the user.
     F^.Read( F, ch )
  }                     It simply calls the device
			specific routine, passing
			the pointer to the data
			needed to do the operation.

This was originally implemented in BCPL (the Basic Computer Programming Language), a predecessor of C based on CPL. Strachey invented CPL, which stood either for Christopher's Programming Language, the Combined Programming Language, or the Cambridge Programming Language, depending on who you talk to.

In systems with no protection, a file variable can be as simple as a pointer to the descriptor for the open file. This descriptor is typically created by the Open system service. If there is protection, the descriptor must be protected from the user program, so F, the file varialbe, is more complex than just a pointer. On Unix, for example, F is an integer index into a system maintained array of pointers to descriptors (the array is called the file table of the current process).

In object oriented terms, each of Strachey's open files is an object, an instance of the class file. The class file is amenable to multiple implementations, each requiring different code; such a type is called a polymorphic type, and one requirement of the operating system context is that multiple implementations of the abstraction coexist, so that, for example, a user can have one open file refer to a file on disk, while another represents the keyboard.

In object oriented programming languages, the same approach described here is commonly used to implement polymorphic objects. The data structure representing the object contains pointers to the action routines that implement the abstract operations on the object.

File Systems

An open disk file is an example of a virtual device just as much as a disk or a printer interface is, when presented through a device independent interface. Thus, we can extend our polymorphic class hierarchy for open files to include not only open files which refer directly to physical devices such as printer ports, serial communications lines, and disk drives, but also open files that refer to files on the file system, windows under the window manager, and other virtual I/O devices.

Therefore, the underlying data structure for an open disk file under a modern file system might resemble that shown in the following diagram:

                ________
         F --->|     o--|--------> File System 
               |     o--|--------> Code
               |________|    
               |        |              ________
               |     o--|------------>|     o--|------> Disk I/O
               |        |             |     o--|------> Code
               |________|             |________|
             Descriptor of            |        |
             open disk file           |        | Data needed to
                                      |        | access the disk
                                      |________|
                                     Descriptor of
                                     the disk itself.

        Code to read and write an open disk file

        * May convert block size to that of disk.
        * Translates address of block in file
          to disk address on the real disk.

        The latter is Virtual Address Translation

The descriptor of the open disk file will typically contain buffers for block size conversions, and it will contain the data structure needed to convert addresses from file-relative form to absolute disk addresses.

The descriptor of the disk device itself will typically contain the information needed for disk scheduling, the queue needed to communicate with the disk interrupt service routines, and other information specific to the actual disk being used.

The read and write routines for the open disk file can be written in such a way that they call the standard, device independant read and write routines of the disk itself -- in effect, the read and write routines that implement the open disk file can be written as if they were user-level code.