22C:116, Lecture Notes, Sept. 15, 1995

Douglas W. Jones
University of Iowa Department of Computer Science

  1. Disks

    Rotating magnetic memory devices date back to the early 1950's. The first such devices were drums, with one head fixed over every track. The drum on ILLIAC I, for example, had 64 tracks, with an 8 by 8 array of heads.

    In the late 1950's and early 1960's, workers at IBM figured out how to make a record/playback head "fly" over the flat surface of a spinning disk. The result was a new technology, the moving head disk, where one head was positioned over each recording surface on the disk spindle and moved in or out to the appropriate track.

    For large-capacity, high performance disks, whether in 1965 or 1995, a number of disk platters are rigidly attached to each disk spindle. Thus, a disk drive may have from one to 20 recording surfaces; 4 or 8 surfaces per drive are typical of modern equipment.

    When there are multiple recording surfaces, there is typically only one head per surface, with all heads sharing a common head-positioning mechanism, as illustrated below:

           ||  v------||    __________
      =====[]=====    ||___|          |
           ||  ^------||   |   head   |
           ||  v------||___|positioner|
      =====[]=====    ||   |__________|
        |      |
        | motor|
    Drums continued to be made until the mid 1960's, and fixed head disks, disks with one head fixed over every track, persisted into the 1970's.

    While the speed and capacity of rotating storage media have increased immensely since the early days of computing, the basic physics has remained unchanged, and thus, the basic issues with which the software must deal are the same now as they were in the early 1950's.

  2. Latency

    The rotational speed of all high performance disks is fixed; typical disks of the early 1970's ran at 20 revolutions per second, and modern high performance disks may rotate at 3 to 5 times this speed. The stored kinetic energy in the rotating disk is sufficient that it may take a few seconds to bring it up to speed on power-up, and any speed change after that prohibitively expensive.

    Some floppy disk drives, notably the Apple Macintosh 800K drives, use a variable speed scheme to pack a bit more data onto a disk than they could pack onto a fixed speed disk. Equivalently, a variable speed clock used in recording the data can achieve this with a fixed rotational speed.

    Disk recording surfaces are divided into numerous circular tracks; typically, each track is subdivided into sectors, and the basic operation on the disk is to read or write one sector of data. On a disk or drum with one head fixed over each track, the time delay between the issue of a read or write and the completion of that operation can be divided into two components, the rotational latency time and the transfer time.

    The rotational latency time is the time it takes for the rotation of the disk to bring the start of the desired sector into position under the head. This may vary from one full revolution, if the start of the desired sector had just passed under the head at the time the transfer was requested, to nothing, if the start of the desired sector is just about to pass under the head at the time the transfer is requested.

    The transfer time is the time it takes for the rotation of the disk to rotate one sector under the head.

    With the advent of moving head disk technology in the early 1960's, a new component was added to the time taken for a single read or write operation, the head positioning latency time or seek time. (The latter term derives from the common use of the verb `to seek' to describe the act of moving the heads to the desired track.) This is the time it takes for the disk head positioning mechanism to move the heads to the desired track; it may vary from nothing, in the case where the head was already on the desired track, to an appreciable fraction of a second.

    Typical head positioning hardware smoothly accelerates the head from the starting track to the end track, and a bit of simple physics leads to the result that such hardware will give track-to-track head movement times that are quadratic. That is, if a move of one track takes one time unit, a move of four tracks will take two time units and a move of nine tracks will take three time units.

  3. Note on Data Formats

    A typical sector on disk is organized as follows:

      prolog                  data                    epilog
    Typically, the prolog includes synchronizing marks that allow the clock used for reading the sector to be synchronized with the data and that allow the hardware to recognize the start of a disk sector. Following this, the prolog typically includes the full disk address of the sector.

    The disk address is included in the prolog for many reasons. First, it allows verification that the head positioning hardware and disk surface selection logic is working correctly, or at least, that it is working the same way it was when the disk was formatted. Second, on some disks, the head positioning servo actually uses the track or cylinder numbers read off the disk to move the heads in or out to the right track, and finally, the addressing logic of many disk systems relies on the sector number recorded at the start of each sector to determine whether the sector is the one the user wishes to access.

    The end of the prologue is typically a checksum on the prologue. Once the prologue is recorded by the disk formatter, it is usually never written again. When a write operation is requested, the hardware begins by reading and verifying the prologue. Only then does it begin to write the data.

    The epilogue typically consists of a checksum. All read and write operations must process the entire sector, either writing a full sector and then writing the checksum, or reading the full sector and then verifying the checksum. If the disk controller allows partial sector reads or writes, this is an illusion. A partial sector write almost always overwrites the full sector, appending arbitrary data to the user's data in order to achieve this. A partial sector read only reports a fraction of the sector's data to the user, but the entire sector must be read by the controller in order to verify the checksum.

  4. Formatting a disk

    When a "raw" disk is mounted on a disk drive, it must be formatted. Typically, this involves writing random data to each sector of each track of the disk. The important part of the formatting operation is that the prologues on all of the sectors are written.

    Usually, the disk formatting software also creates a file system of some type on the disk, a subject to be discussed later.

    When writing in disk formatting mode, the software can typically write out arbitrary data to fill entire sectors, including the prologue and epilogue. This gives the software the option of laying down sectors in arbitrary order on each track of the disk. If they are put down in consecutive ascending order, it may be impossible to read or write consecutive sectors at full speed. This is because, by the time the interrupt marking the end of one read or write has been processed, it will be too late to start work on the next sector without waiting a full revolution.

    To avoid this, many disk formatting programs record sectors in interleaved order, so that logically consecutive sectors are not physically consecutive. Here are some examples of interleaved sectors on a fictional 8 sector per track disk:

     0  1  2  3  4  5  6  7    non-interleaved
     0  4  1  5  2  6  3  7    2-way interleaved
     0  3  6  1  4  7  2  5    3-way interleaved

  5. The software interface

    To read or write one sector of a disk, the software must:

    1) Specify the address of a buffer in memory.

    2) Specify one of the several disk drives that may be attached to a single controller.

    3) Specify how to position the heads on the drive in question. This is usually called the cylinder portion of the disk address.

    4) Specify which recording head or recording surface is to be used on the selected drive.

    5) Specify which sector of the track selected by the cylinder and surface is to be used.

    6) Specify the data transfer direction.

    In simple disk controllers, each of the above might appear as a separate register in the controller, writable by software. Typically, the actual transfer begins only when the final command is issued, the transfer direction. Typically, it is the act of loading a value in the final register that actually starts the controller in operation. Thus, the loading of a memory address or a field of the disk address has no effect until the software actually specifies a read or write operation, at which point, the values previously loaded are used to initiate a seek on the appropriate drive, await the right sector of the selected track, and then transfer the data.

    When a data transfer is completed, a status register typically displays this fact, along with an indication of any errors detected. In addition, the done condition usually results in an interrupt request.

  6. Added Complexity

    Real disk interfaces are complicated by three factors:

    First, interface busses such as the SCSI bus separate the responsibility for each operation into two halves, as illustrated below:

               __________            __________   ____
    CPU ----->|   DMA    |          |   DISK   |-|disk|
    Memory <=>|__________|          |__________|-|____|
    Here, commands fro the CPU to the DMA controller, perhaps a SCSI controller, direct it to transfer data to or from some memory buffer and direct it to forward certain additional commands to the disk controller. All of the usual I/O operations are possible with such systems, but many nonsensical operations are also possible, such as directing the DMA controller to read data from the disk to memory, while directing the disk controller to write data from the memory to disk.

    Second, with a number of disks on one controller, it is frequently useful to start one disk seeking while another reads or writes data. Thus, many disk controllers allow a seek operation to be initiated on some disk without initiating either a read or write. Software that actually uses this feature is common on high performance operating systems, but rare on personal computers.

    Third, many DMA controllers allow "scatter read" and "gather write" operations. That is, the memory buffer need not be a single contiguous buffer, but may be described by a string of base-count pairs, each giving the base address of one part of the buffer and a count of the words or bytes in that part. This is extremely useful when a buffer may span many pages in a virtual memory system -- in that case, it is rare for the physical addresses of the pages making up the buffer to be contiguous.

    Finally, many disk controllers allow multiple sectors to be read in sequence. Thus, it is frequently possible to read or write as much as an entire track to or from the disk in one transfer.

  7. Disk Request Queues

    Typically, assuming a simple disk interface, without the added features discussed above, a disk I/O queue will contain requests formatted as follows.

    Disk Request Record:
       Buffer Address in main memory.
       Disk Address:
       transfer direction
       done semaphore
    A user process, or rather, the file system, acting on behalf of a user, might pose a disk request as follows:
       enqueue( request, disk-queue )
       enable( disk_interrupt )
       P( request.semaphore )
    The interrupt service routine might operate as follows:
       disable( disk_interrupt )
       if request <> null then V( request.semaphore )
          -- the above lets a waiting user continue
       if not empty( disk-queue ) then
          request := dequeue( disk-queue )
          buffer-address-register := request.buffer-address
          disk-address-registers := request.disk-address
          disk-command := request.transfer-direction
          enable( disk_interrupt )
          request := null;
    Note that, if the queue is empty, the disk routine returns, leaving the interrupt disabled but doing nothing to make the controller drop its interrupt request. When a user enqueues a new request, the user code enables the interrupt, causing an immediate interrupt and allowing the controller to immediately start the newly requested transfer.

  8. Improving disk performance, caches

    One way to improve the performance of a disk system is to use the queue of pending disk requests as a cache. New read requests are only posted to the queue if no write is pending to the same sector, and new write requests cause previously posted write requests addressed to the same sector to be deleted. The following pseudocode describes this:

       if request.disk-address matches prior.disk-address
          where prior is a request already in the disk-queue
          if request.direction = in and prior.direction = out
              copy prior.buffer into request.buffer
          if request.direction = out and prior.direction = out
              delete prior from disk-queue
       enqueue( request, disk-queue )
       enable( disk_interrupt )
       P( request.semaphore )
    The result of this logic is that there will never be more than one write queued for any sector, and if a write is queued for a sector, there will be no reads queued for that sector. Multiple reads can also be combined with small additions to the data structure.

  9. Improving disk performance, schedulers

    A disk scheduler can markedly improve the performance of disk I/O for any disk that has multiple independent pending I/O requests. The key is to note that, if there are multiple requests in the disk I/O queue, the time taken to process those requests depends on the order in which they are processed. This is because the latency time depends on the order in which they are accessed.

    For example, consider a request queue containing 2 requests for operations on track 1 and two requests for operations on track 400. Clearly, if these are done in the order (1, 400, 1, 400) the result will be quite a bit slower than if they are done in the order (1, 1, 400, 400). The former requires three long seeks, while the latter requires only one long seek.

    Thus, if disk requests are scheduled in the appropriate order, disk performance can be markedly improved. It is generally impossible to point to a separate procedure in the operating system and say "this is the disk scheduler". Instead, the disk scheduler is broken between the two ends of the disk request queue, and the queue data structure is itself the structure used to organize the scheduler.

    Typically, the queue for each disk is broken into separate pieces for each cylinder, and the enqueue routine on the user side of the disk request queue is made responsible for enqueueing the request in the appropriate cylinder's queue. The scheduling policy is then the responsibility of the disk interrupt service routine.

    Typical scheduling policies are:

    1) FIFO, the base case, is inherently fair to all users, but it does nothing to improve throughput. (A fun exercise is to quantify fairness and then prove that the FIFO policy is optimal).

    2) Shortest seek time first. After processing all requests for the current cylinder, the next cylinder to be processed is the one with pending requests that is closest to the current cylinder. This policy produces good throughput (the number of I/O requests processed per second) but it is unfair, it will cause some users to wait indefinitely. (The optimality of this method depends on how the seek time varies as a function of the number of cylinders moved. For a fun exercise, figure out the class of functions for which this is optimal.)

    3) The bidirectional elevator algorithm. After processing all requests for the current cylinder, the next cylinder to be processed is the one closest to the current cylinder in a fixed direction, for example, towards the center of the disk. Only if no work is pending in any cylinder in that direction does the preferred direction reverse so that future seeks will be in the opposite direction. This technique offers a good throughput, but it discriminates against disk requests near either extreme of the range of valid cylinder numbers.

    4) The unidirectional elevator algorithm. After processing all requests for the current cylinder, the next cylinder to be processed is the one closest to the current cylinder in a fixed direction, for example, towards the center of the disk. Only if no work is pending in any cylinder in that direction does a seek in the opposite direction take place. In that case, the most distant available cylinder with pending work is selected, and the preferred direction remains unchanged. This technique offers equal service to all cylinders while only slightly reducing the throughput of the bidirectional elevator.

    The conflict between throughput and fairness illustrated above is typical of a number of issues on operating systems. Attempts to optimize against a single goal are fairly straightforward, but there is no solution which is simultaneously optimal against both criteria!

  10. Elevator code

    The elevator algorithm is named after the algorithm used by an elevator in a tall building. The following pseudocode implements this algorithm:

       enqueue( request, disk-queue[ request.cylinder ] )
       enable( disk_interrupt )
       P( request.semaphore )
    The interrupt service routine might operate as follows:
       disable( disk_interrupt )
       if request <> null then V( request.semaphore )
          -- the above lets a waiting user continue
       if empty( current-queue ) then
          -- time to move the elevator to a new floor
          i := current-cylinder
             i := (i + 1) mod cylinders
          until i = current_cylinder or not empty( disk-queue[i] ) 
          current-queue := disk-queue[i]
          disk-queue[i] := empty
          current-cylinder := i
       end if
       if empty( current-queue )
          request := dequeue( current-queue )
          buffer-address-register := request.buffer-address
          disk-address-registers := request.disk-address
          disk-command := request.transfer-direction
          enable( disk_interrupt )
          request := null;
    Note that the disk interrupt service routine uses a separate queue, current-queue to hold the requests for the cylinder it is currently servicing. This prevents late arrivals at that cylinder from being served until the next visit to that cylinder; if this was not done, and if some process continuously maintained a supply of pending disk addresses for one cylinder, no other cylinder would ever be served.

    Another detail to note is that the search for the next cylinder to process need not be a sequential search, as outlined above. If the disk queue array is maintained as a circular sparse array (a linked list structure), so that empty queues are not represented at all, the next queue will always be adjacent to the current one and the computation shown as a search above can be recoded as a simple pointer assignment. (for a pleasant exercise, write the necessary code.)

  11. Device Independence

      Read( ch )                  Arbitrary software
      Write( ch )        <----->  that performs actions
      Read( buf, add )            that look like I/O to
      Write( buf, add )           the user, and may even
    			      involve I/O
      Used for disk files,
      windows, network ports,
      and similar abstractions.
    The advantages of using a standard software interface for every device, whether disk or terminal or printer or network port have been known since the days of FORTRAN. To achieve this, it is necessary to have a user-level device interface that is sufficiently flexible to operate with both random access and sequential devices, supporting both stream and block modes of operation on any device. The low level Unix interface (read, write, seek) is one good example of such an interface.

    This interface may be considered a virtual device interface, with the input output software serving the function of virtual to physical input output operation translation.

    The virtual device may also be considered as an object, in the sense used in object-oriented programming. In fact, the first clear description of how objects in this sense can be implemented came from the field of operating systems, in a paper by Christopher Strachey in his OS 6 system from the mid 1960's. In this system, he used the following implementation:

           An open file descriptor
     F ----------->|     o--|--------> Read( F, ch )
    	       |     o--|--------> Write( F, ch )
     Data specific |        |     Pointers to code
     to this file  |        |     to do device dependant
     and device    |        |     part of the I/O operation
      Read( F, ch ) <------ the device independant read
      {                     routine called by the user.
         F^.Read( F, ch )
      }                     It simply calls the device
    			specific routine, passing
    			the pointer to the data
    			needed to do the operation.

    This was originally implemented in BCPL (the Basic Computer Programming Language), a predecessor of C based on CPL. Strachey invented CPL, which stood either for Christopher's Programming Language, the Combined Programming Language, or the Cambridge Programming Language, depending on who you talk to.

    In systems with no protection, a file variable can be as simple as a pointer to the descriptor for the open file. This descriptor is typically created by the Open system service. If there is protection, the descriptor must be protected from the user program, so F, the file varialbe, is more complex than just a pointer. On Unix, for example, F is an integer index into a system maintained array of pointers to descriptors (the array is called the file table of the current process).

    This approach to implementing device independance is strongly related to object oriented programming. A file is an object. The abstract data type "file" is amenable to multiple implementations, each requiring different code. Such a type is called a polymorphic type, and one requirement of the operating system context is that multiple implementations of the abstraction coexist, so that, for example, a user can have one open file refer to a file on disk, while another represents the keyboard.

    In object oriented programming languages, the same approach described here is commonly used to implement polymorphic objects. The data structure representing the object contains pointers to the action routines that implement the abstract operations on the object.

  12. File Systems

    An open disk file is an example of a virtual device just as much as a disk or a printer interface is, when presented through a device independent interface. Thus, we might see the following structure emerging in a modern file system implementation:

             F --->|     o--|--------> File System 
                   |     o--|--------> Code
                   |        |              ________
                   |     o--|------------>|     o--|------> Disk I/O
                   |        |             |     o--|------> Code
                   |________|             |________|
                 Descriptor of            |        |
                 open disk file           |        | Data needed to
                                          |        | access the disk
                                         Descriptor of
                                         the disk itself.

            Code to read and write an open disk file
            * May convert block size to that of disk.
            * Translates address of block in file
              to disk address on the real disk.
            The latter is Virtual Address Translation

    The descriptor of the open disk file will typically contain buffers for block size conversions, and it will contain the data structure needed to convert addresses from file-relative form to absolute disk addresses.

    The descriptor of the disk device itself will typically contain the information needed for disk scheduling, the queue needed to communicate with the disk interrupt service routines, and other information specific to the actual disk being used.

    The read and write routines for the open disk file can be written in such a way that they call the standard, device independant read and write routines of the disk itself -- in effect, the read and write routines that implement the open disk file can be written as if they were user-level code.