Chapter 11, Random Access Input/Output

Lecture notes for 22C:116 Introduction to System Software, by Douglas W. Jones, University of Iowa Department of Computer Science

Disks

Although the terminal and tape peripherals discussed in the previous chapters would be sufficient to create a working system, the system would be primitive by modern standards, or for that matter, by the standards of 1970. The files needed by most programs could be stored on magnetic tape, but this would require that an operator spend a considerable amount of time mounting and dismounting tapes, with all directory structure being handled by the people who manage the tape library. In fact, this is the way that files were usually handled in the early 1950's, and even in the 1970's, certain applications continued to be handled this way. The advent of magnetic drums memory, and later, the introduction of magnetic disk memory changed this.

Drums were introduced in 1952 on the Whirlwind I computer; by 1961, the Manchester University Atlas computer was using a primitive file system which blurred the distinction between drum and tape. The CTSS system at MIT had a fully developed (but not hierarchical) disk-based file system by 1963, and the description of the MULTICS file system published in 1965 had all of the features usually associated with modern file systems.

MULTICS was one of the most influential operating systems of all time! The MULTICS system grew out of Project MAC at MIT in the mid 1960's, and was was developed by a consortium that included MIT, Bell Labs and General Electric as original members. Bell dropped out and Honeywell bought the General Electric computer division, so it was Honeywell that ended up commercializing this system. MULTICS was developed to pursue the idea of a computer utility, a bold idea in the mid 1960's, but a commonplace idea starting in the early 1970's. Today's internet service providers all follow models very similar to that anticipated by MULTICS. The MULTICS system achieved its greatest commercial success in the late 1970's and early 1980's, at a time when its academic reputation had already begun to fade into the shadow of UNIX and the growing promise of networking, and the last MULTICS system remaining in operation was shut down in the fall of 2000.

The term file has two common meanings, file variable and disk file. File variables, as discussed in the previous chapters, are the objects to which programs direct input/output requests. These are bound, typically by some operation such as an open system service, to actual input/output devices or disk files. Taken in its narrowest sense, the term disk file refers to a named storage area on a magnetic disk, but there are many other devices which may be used to store such files. Among these are the older magnetic drum technology, fixed-head disks, moving-head disks, floppy disks, magnetic bubble memory, RAM disks, flash EEPROM cards, and others. As a result, the term secondary storage files would more accurately describe what are commonly called disk files.

All of the secondary storage technologies mentioned above have some common characteristics. These technologies allow the storage of data at prices per bit well below that of primary memory, but (for various reasons) the memory technology being used is not appropriate as the main memory from which programs and data may be directly accessed by the central processor. As a result, they are used as secondary storage, with input/output hardware that allows the transfer of blocks of data between primary and secondary storage.

As with the primary memory in which programs are executed, secondary memory devices usually support random access, so it is reasonable to talk about secondary memory addressing using terms similar to those used for primary memory addressing. Unlike tapes and other sequential devices, it is not necessary to read or write from the start. Unlike primary memory, however, but the time taken to complete a data transfer usually depends on the address used for that data transfer and the address used for the previous transfer.

Random Access Device Characteristics

Drums

The oldest random-access secondary storage device is the drum memory, although it is worth noting that some early computers actually used drum technology for their primary memory. Drum memories had, as their central element, a rotating drum coated with a continuous layer of magnetic material The drum was driven by an electric motor, spinning at a fixed rate, typically from 5 to 60 revolutions per second; the rate was typically locked to the frequency of the alternating current powering the motor. Data was read and written using magnetic recording heads very similar to those used for tape recorders. Each head was fixed in position over a circular track of data on the drum, so the number of heads was the same as the number of tracks. Even in the early days of drum recording, drives with 64 or more heads were common. Figure 11.1 shows a typical drum, in schematic form.

            _____________________
 _______   |                     |
|       |  |                     |
| MOTOR |==|                     |===== axis of rotation
|_______|  |                     |
           |_____________________|
            ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
         ___|_|_|_|_|_|_|_|_|_|_|____
        |                            |
        | head selection electronics |<---- track address
        |____________________________|
                      |
                      |
             data to or from drum

Figure 11.1. A magnetic drum.

The logical characteristics of the drum are the number of heads and tracks (these numbers are the same), the data capacity of each track, and the speed at which the drum rotates, which determines the time to read and write a track or the time to find and read part of a track. On early drums, users were required to read or write an entire track, so such drums were said to allow random access to the track and sequential access within a track.

On later drums, the technology had advanced to the point that more data could be stored on one track than could be conveniently transferred into memory at one time, so tracks were broken into sectors. The actual recording of sectors on a track is similar to the recording of blocks on a tape, with intersector gaps separating consecutive sectors of data. Each sector typically has a prologue that includes an easy to recognize bit pattern that occurs only at the start of a sector, plus, optionally, sector and track number information in order to verify that the sector is the correct one. Following the data part of the sector, there is usually an epilogue containing a checksum for the data on the sector so that recording or playback errors can be detected.

Users wishing to transfer data to or from such a drum must specify the track to be accessed and the sector within that track, and then, before data can be transferred, the user may have to wait for the appropriate sector to pass under the head. The waiting time for such a drum access is called the rotational latency time.

The term 'latency time' is a general term which refers to the time between the initiation of a process and the time when that process actually starts. The process is said to be latent during this interval. Thus, for example, a person has a latent disease between the time of infection and the time the first symptoms appear.

If access to a randomly selected sector is attempted at a randomly selected time, the expected rotational latency will be about one half of the time it takes the drum to make a full rotation.

A multiple sector drum is characterized by the number of tracks, sectors per track, and bytes per sector, as well as the time to transfer one sector and the minimum, maximum, and average latency time. The largest component of the latency time is usually the rotational latency time, but other components can be due to the time it takes to select the head to be used (head selection latency), or to the time it takes to setup the hardware to start a data transfer (hardware setup latency).

Note that when each track is broken into multiple sectors, many hardware interfaces are unable to read physically consecutive sectors because the time taken for the intersector gap to rotate past the heads is shorter than the electronic components of the latency time. Specifically, there is insufficient time between when one transfer ends and the start of the next sector for the software to load up the device interface registers to start the next transfer. Of course, this problem could be solved by making the intersector gaps larger, but this would reduce the useful data capacity of each track. Some controllers have two sets of interface registers, one for the next data transfer, and one for the current one; this is expensive, but allows for very short intersector gaps. A more common approach is to change the numbering scheme so that logically consecutive sectors are not physically consecutive; for example, if there are 5 sectors per track, number them 1-4-2-5-3 instead of 1-2-3-4-5. This is called interleaving; the above example is two-way interleaved because consecutive logical sectors occupy every other physical sector. Three-way interleaving would have consecutive logical sectors occupy every third physical sector.

Fixed Head Disks

Disks originated as a simple approach to reducing the volume occupied by drum memories. Such simple disks are logically indistinguishable from drums, although physically, they consist of a magnetically surfaced platter with heads mounted against it from one or both sides. Such a disk is called a fixed head disk to distinguish it from a later development, the moving head disk.

One result of the disk arrangement is that the tracks near the center of the disk are physically shorter than those near the edge. Since the disk rotates at a constant rate and data is transferred at a constant rate (in bits per second), the central tracks are recorded at a higher linear density (in bits per inch) than the peripheral tracks. It is occasionally suggested that a greater amount of data could be stored on the disk if either the speed of the disk or the data rate were varied depending on which track was being used; unfortunately, disks make good flywheels and it may take many seconds to significantly change their speed, and the only variable frequency clocking that is inexpensive works only at relatively low speeds. As a result, with almost all disk drives made, we pack data at the maximum practical density on the innermost track, and store the data on the outer tracks at progressively lower densities.

Moving Head Disks

Both drums and fixed head disks are expensive because of the requirement of one head (and associated electronics) per recording track. Moving head disks reduce this cost by having only one head per recording surface, with a mechanical head positioning arm used to move the head from track to track as needed. The savings from this change are offset, to some extent, by the introduction of a new form of latency, head positioning latency, sometimes called the seek time (ie, the time it takes the head to seek to a new radial distance from the disk center). The time it takes to move the heads from one track to another on a recording surface usually varies as a complicated function of the distance of the move; usually this can be approximated as a quadratic function, so if it takes 0.1 second to move the heads 400 tracks, it would be expected to take about 0.005 seconds to move the heads 1 track.

Moving head disks vary considerably in performance, from low-speed, low-capacity floppy disks, with only one or two recording surfaces, to high speed units with multiple disk platters all mounted on a common recording surface.

The original moving head disk was developed under the direction of Reynold B. Johnson at IBM's San Jose research lab for the IBM 1401 computer; introduced with the 1401 in 1961, it was called the IBM 1505 RAMAC disk, after the Random Access Method of Accounting Control, the first disk-based business application. This disk had 25 platters, giving 50 recording surfaces; there were 200 data tracks per surface, 5 sectors per track, and 200 6-bit bytes per sector. The drive stood around 6 feet (180 cm) tall, with a glass face giving an edge-on view of all the recording surfaces. One of the worst design features of this early drive was the use of hydraulics to position the heads over the desired data tracks! The hydraulic system occasionally broke down, and if hydraulic fluid sprayed onto the disks, as it sometimes did, the result was invariably destructive.

By the early 1970's, the dominant disk drives used removable disk packs with 11 platters each, giving 20 recording surfaces -- the topmost and bottommost surfaces were unused because they were the most vulnerable to damage when the disk pack was removed from the drive. Each platter was 14 inches in diameter, and as with the original RAMAC drive, there were 200 data tracks per surface. Each track was divided into 20 sectors of 512 bytes each. The drives were typically the size of a top-loading washing machine, complete with a lid that opened to expose the tub and drive spindle inside, where the disk pack could be mounted. At normal operating speed, the disk in this drive spun at 1200 revolutions per minute. By this time, electromagnetic linear motors, called voice-coil drives, were the dominant head positioning technology for high performance drives.

The first generation of 8 inch floppy disks, from about the same era, had one recording surface with 76 recording tracks; each track was divided into 26 sectors of 128 bytes each. These disks were quickly superceeded with double sided double density disks, where the double density recording format allowed 256 bytes per sector. Most 8 inch floppy disk drives spun at 180 RPM. This slow speed was necessary because frictional heating of the disk surface inside its envelope could damage the disk at higher speeds. Head positioning on the first floppy disk drives was done with stepping motors that drove helical lead screws to move the head assembly in or out.

An 80 megabyte hard drive from 1987 had 3 platters, giving 6 recording surfaces. Each platter was 5 inches in diameter. By 1994, much higher capacity drives were being made, packing as many as 4 platters, or 8 recording surfaces 3 3/4 inches in diameter into a drive just under 1 inch thick. Most of these drives were designed to spin at 7200 revolutions per minute. Electromagnetic voice coil drives are still dominant in high end disk drives, although in the 1980's, stepping motors were widely used.

Whether from 1962 or today, these disks all have the same abstract structure, as illustrated in Figure 11.2.

                                   _________
                    ||--------\/  |         |
                    ||  ===============================
 -------------      ||--------/\  |   Hub   |
|    head     |_____||--------\/  |         |
| positioning |_____||  ===============================
|    motor    |     ||--------/\  |         |
 -------------      ||--------\/  |         |
                    ||  ===============================
                    ||--------/\  |_________|
                    arm      heads    | |
                                    __|_|__
                                   |       |
                                   |spindle|
                                   | motor |
                                   |_______|

Figure 11.2. A disk drive.

The number of platters may vary from 1 to a large number. The technology used to position the heads may vary. The actual head technology has evolved considerably over the years, but whether the disk is a RAMAC disk from 1962 or a modern hard drive, the head flies over the surface of the spinning disk, held against the head by a spring, and held away from the surface against the tension of the spring by a film of air that is pulled between the head and the disk by the motion of the disk surface. This is why disk failures are sometimes called disk crashes. When the air film is interrupted, for example, by a particle of dust, the head crashes into the disk surface, driven by the spring and typically scraping a the magnetic recording surface off of the aluminum disk.

The heads on a floppy disk drive press into the non-rigid disk surface. Because the head and disk are in constant contact, the head can grind away the disk if any dust is present. To minimize this risk, floppy disks are always contained in an envelope with an expanded teflon mat lining its inside. This teflon mat is also in contact with the disk, but because teflon is among the softest plastics known, any dirt which gets into the disk will embed itself harmlessly in the teflon.

Whatever the technology, there is usually a single head positioning motor that moves all of the heads radially in or out together. This is because motors and precision bearings are relatively expensive, and the cost of a motor does not depend greatly on the number of heads involved. As a result, the set of tracks which can be accessed without moving the heads is an important logical structure, called the cylinder. If the tracks on each surface are numbered from the outside inwards, each cylinder consists of all of the tracks on different surfaces with the same track number. Thus there are as many cylinders in a disk pack as there are tracks on one surface, and each track is at the intersection of one recording surface and one cylinder.

With each passing year, the number of tracks which can be packed per unit of radial distance on a disk has improved. These improvements have been due to better head positioning mechanisms, smaller spacing between the head and the disk, narrower recording tracks, and many other considerations. Similarly, the number of bits that can be recorded in a unit distance of a given track has been improved as a result of new recording surface materials, new head geometries, and smaller spacing between the head and the disk. Although the limits of physics will eventually prevent a further improvement in magnetic recording, but so far, everytime these limits appear to approach, some new scheme has emerged to change the rules.

It should be noted that there are two different approaches to arranging the sectors on the tracks of a disk or drum. Hard sectored devices have a physical indicator to mark the start of each sector, while soft sectored devices use the data on the disk track itself to indicate the start of each sector. Typical hard sectoring schemes rely on notches in the edge of one disk platter along with photocells, or they use one disk surface to hold "servo tracks" that are recorded only once, in the factory, and used from then on to control the recording format used on all the other tracks of the device.

Aside from the rotating magnetic media discussed above, there are a number of other technologies which are used for auxiliary or secondary random access storage. For example, bulk core memory was once used in sizes up to about 16 megabytes, and now, bulk semiconductor memories based on Flash EEPROM (electrically erasable programmable read only memory) are used for many applications. Typically, bulk memories are organized for sequential data transfers of data blocks corresponding to disk sectors, at data rates comparable to or sometimes faster than the fastest modern disks, but without head positioning or rotational latency.

Magnetic bubble memories were once considered a promising auxiliary memory technology. In these, the magnetic storage medium is fixed, but a rotating magnetic field forces the magnetically encoded data to circulate along tracks in the medium. Typically, the storage medium is a garnet chip, with tracks defined by iron or nickel patterns on the surface of the chip. The data is read and written using a read/write head that functions like that of a conventional disk, and curiously, because data is circulating around tracks, terms such as rotational latency are as appropriate to bubble memories as they are to disks and drums. The lack of moving parts and extremely high radiation resistance made bubble memory devices particularly appealing in aerospace applications.

Optical disks have great potential for use as very high capacity read-only memories (ROM). The CDROM and DVDROM formats are now well established, and with the advent of recordable drives, "write-once, read-mostly" or WORM memory has become widespread. With these technologies, disks have only a single recording surface, but issues of latency and head motion are identical to those for magnetic disks.

The Hardware Interface

For all of the secondary storage media mentioned above, the basic input/output operations require that a block, sector, or track of data on the device be specified before the operations begin, along with a buffer in primary memory. With both of these specified, the data can be read or written as soon as the record/playback head is over the right track and the correct sector has rotated into position.

Device addresses have different forms on different devices, but typically, they take the form <cylinder, surface, sector>. typical disk drive. This form corresponds to a specification of the radius to which the heads should be moved (cylinder), height (surface), and angle (sector) of the start of the desired sector in a three dimensional cylindrical coordinate system.

The order of these three coordinates, <cylinder, surface, sector>, is significant! Recording consecutive sectors of one track is a very simple operation; we just leave the cylinder and sufrace selection hardware alone and let the disk continue to spin as we access the consecutive sectors. Changing surfaces within a cylinder is inexpensive, since this involves electronically switching from one head to another. The most expensive operation is moving the heads to a different cylinder, since this involves accelerating the head assembly from a standing start and then decelerating them to stop at the desired cylinder.

The hardware interfaces to different disk devices vary considerably, but most have device interface registers that contain the fields of one disk address. Once these have been loaded, a "seek" command can usually be issued to cause the drive to position the heads over the desired track and wait until the desired sector is available. On some of the earliest floppy disk drives, the actual data transfer was done one character at a time, through a serial port, but high performance disk drives use direct memory access controllers to move an entire sector without involving the CPU. A typical disk interface is described in Figure 11.3.

                                 _______________
DISKCON control                 |_______|___|_|_|
                                  unused | | | |
   IE = 1 - interrupt enable             | | | IE
   ON = 1 - spindle motor on             | | ON
                                         OP
   OP = 00 - idle
   OP = 01 - seek
   OP = 10 - read
   OP = 11 - write
                                 _______________
DISKSTAT status                 |___|_______|_|_|
                               unused| | | | | |
   READY   - operation complete      | | | | | READY
   ONTRACK - seek completed          | | | | ONTRACK
                                     ERCODE
   ERCODE = 0000 - no error
   ERCODE = 0001 - disk not mounted
   ERCODE = 0010 - disk not up to speed
   ERCODE = 0011 - no such cylinder
   ERCODE = 0100 - no such surface
   ERCODE = 0101 - no such sector
   ERCODE = 0110 - byte count too large
   ERCODE = 1000 - formatting error
   ERCODE = 1001 - parity error
                                 _______________
DISKCYL cylinder select         |msb____________|
                                |____________lsb|
                                 16 bits in/out
                                 _______________
DISKSURF surface select         |_______________|
                                  8 bits in/out
                                 _______________
DISKSECT sector select          |_______________|
                                  8 bits in/out
                                 _______________
DISKADDR DMA address            |msb____________|
                                |_______________|
                                |_______________|
                                |____________lsb|
                                 32 bits in/out
                                 _______________
DISKCNT  DMA byte count         |msb____________|
                                |____________lsb|
                                 16 bits in/out

Figure 11.3. A typical disk interface.

The disk interface given in Figure 11.3 is fairly typical of some disk subsystems from the late 1960's to the early 1980's. Older disk subsystems tended to be more complex because they were only available on the largest of computers with the most complex peripheral interface architectures, while disk subsystems after this grew complex as microprocessors with complex peripheral architectures with secondary and tertiary busses and other added complexity. Thus, access to a disk connected through a SCSI interface (Small Computer System Interface), a USB interface (Universal Serial Bus) or some other modern alternative is quite complex.

The interface given in Figure 11.3 is somewhat simpler than the interface for an IDE disk; the IDE interface requires a total of 13 registers for the disk address and it uses a completely independent register set in the DMA controller to handle main memory addressing.

The status register (DISKSTAT) shown in Figure 11.3 is able to signal a long list of error conditions. When the drive is idle (the DISKCON motor on bit set to zero), it will typically report ERCODE=0010 (disk not up to speed) unless the drive has a removable disk, in which case it may report that there is no disk mounted in the drive. If the drive is turned on by setting the right bit in the control register, the spindle motor will begin to spin, but only after it has been spinning for a while will it come up to operating speed, and only then will the error code field of the status register change.

Durning normal operation, the head positioning mechanism always tries to keep the heads positioned on the cylinder selected by the cylinder select register (DISKCYL). If the heads are not positioned on the correct cylinder, the on-track bit in the status register will be zero; when the mechanism completes a seek operation to the selected track, this bit will change to 1. If the current disk operation was seek, the ready bit will also be set, but if the current disk operation was read or write, the ready bit will not be set until the data transfer to the selected sector is completed, so long as that sector is legal.

In the event of an attempt to seek a non-existant cylinder, or in the event of an attempt to read or write a non-existant surface or sector, the ready bit will be set with the error code field of the status register indicating the error. Similarly if, for some reason, the disk was bad or the data on the disk was corrupted, the ready bit will be set with an error code indicating what happened.

Floppy disks and most removable disks have, of course, only one platter with at most two recording surfaces, and floppy disks and some older hard drives for personal computers spun slowly enough that DMA hardware was unnecessary. These low-performance disk drives typically had a data register in their interface, and they requested one interrupt per byte or one interrupt per 16 bit word read from or written to the disk.

To read one sector from a high performance disk drive, the software must typically go throug the steps outlined in Figure 11.4

void diskaddress( char * buffer, int len, int cyl, int surf, int sect )
{
     /* set DISKADDR = buffer */
     outp( DISKADDR,     ((int)buffer >> 24) & 0xFF ); /* msb */
     outp( DISKADDR + 1, ((int)buffer >> 16) & 0xFF );
     outp( DISKADDR + 2, ((int)buffer >>  8) & 0xFF );
     outp( DISKADDR + 3, ((int)buffer      ) & 0xFF ); /* lsb */

     /* set DISKCNT = len */
     outp( DISKCNT,      (        len >>  8) & 0xFF ); /* msb */
     outp( DISKCNT + 1,  (        len      ) & 0xFF ); /* lsb */

     /* set DISKCYL  = cyl  */
     outp( DISKCYL,      (        cyl >>  8) & 0xFF ); /* msb */
     outp( DISKCYL + 1,  (        cyl      ) & 0xFF ); /* lsb */

     /* set DISKSURF = surf */
     outp( DISKSURF,     (       surf      ) & 0xFF );

     /* set DISKSECT = sect */
     outp( DISKSECT,     (       sect      ) & 0xFF );
}

void waitdiskdone()
{
     while (inp(DISKSTAT) & 0x01 == 0) do /* nothing */;
}

void diskread( char * buffer, int len, int cyl, int surf, int sect )
{
     /* first set up all the disk addressing registers */
     diskaddress( buffer, len, cyl, surf, sect );

     /* then issue the read command */
     outp( DISKCON, 0x0A ); /* OP = read, ON = 1, IE = 0 */

     /* finally wait for the operation to complete */
     waitdiskdone();
}

Figure 11.4. A typical large disk interface.

The code shown in Figure 11.4 includes generic routines for loading up all of the disk addressing registers and for waiting for a disk transfer complete. The read routine is built from these two, along with one additional access to an interface register to actually command the disk drive to transfer the data. This code is very crude! For example, it does no error checking, and it is perfectly willing to start a data transfer on a drive that has no disk mounted or that has not been spun up to speed.

Some disk hardware requires a separate commands to spin the drive up to speed before any head positioning operations are allowed, and then a new command to seek the correct cylinder before it becomes legal to request a read or write. The code given here assumes that a read command will automatically await the disk coming up to speed before seeking, and then automatically seek before attempting a data transfer.

The code given here does nothing to check that the data buffer provided by the user is one sector long. Most drives allow partial sectors to be read or written -- on read, the extra data is simply ignored. On write, zeros are automatically appended to fill one sector. Some disk interfaces will automatically seek successive sectors, writing one sector at a time until the desired number of bytes have been written. In some, this is only allowed within one track, in others, the interface hardware will even increment the surface number to fill successive tracks of a cylinder, and then increment the cylinder number when the cylinder fills. Nowdays, when it is inexpensive to incorporate several small complete computer systems into the interface controller that is built into the disk drive, this kind of complexity is quite common!

Some error conditions are the user's fault, for example, attempting to address a cylinder number that is outside the range of head motion for the drive. On the other hand, while the disk is reading, it may get a transient read error, for example, because a dust particle causes the disk head to bounce briefly away from the surface, or because of electrical interference. When this occurs, a second or third try might successfully read the data that was misread on the first try. Therefore, most disk interface software checks the error status after each operation. For the interface described in Figure 11.3, a retry would be reasonable after a format error, since this is a reasonable error indicator to use for reporting a misread prologue on a sector, and a retry would be reasonable after a parity error report after a read operation, since this would be a reasonable error indicator to use to report checksum errors on the data part of a sector. Only a few retries should be allowed, however, since repeated occurrances of these error conditions are likely to indicate that the disk itself is damaged.

The code shown in Figure 11.4 uses a polling loop to await the completion of a disk data transfer. This is as undesirable here as it is with low performance serial input/output ports. Certainly, during the latency interval, while the disk device is seeking the correct cylinder and then while the disk rotates to the correct sector, the processor could be busy doing something useful. During the actual data transfer, however, there may be no opportunity to continue computation.

If the disk has a very high data transfer rate, the DMA transfer may actually bring the CPU to a stop by moving data at the maximum rate the memory can accomodate! For disks without DMA, the transfer rate may be so high that we cannot afford to service interrupts during the transfer, so the transfer may have to be done with interrupts disabled. Only for relatively low data rates can useful computation be overlapped with disk data transfers.

The Device Driver

The software used for disk input/output can be divided into two layers, a file-system layer which provides the user-level file abstraction, including such things as multiple logical files residing on one disk device, and directory services, and the device-driver layer which provides basic disk input/output services for the use of the file system layer. The file system layer will be described in the next chapter, while the low level disk management layer will be described here.

The device driver layer is typically highly dependent on the hardware interface to the disks themselves, and is usually only device independent to the extent that the hardware is. Thus, if the hardware interface allows many different kinds of disks to be connected to one set of disk controller electronics, as is the case with the IDE and SCSI interfaces, issues such as the number of cylinders, the number of recording surfaces, or the sector size must be contained in a device description record.

Assuming that the hardware allows for the overlap of computation and input/output, the approach to operating the disk shown in Figure 11.4 is quite undesirable because it requires a polling loop immediately after each transfer is initiated. As a result, low level disk input/output is usually done by posting an input/output request with a device and then checking later to see if the request has been satisfied; the device service itself is performed by an interrupt service routine, with a queue holding pending disk requests, as shown in Figure 11.5.

                         __________                       
                  ___   |          |   ___           ___ 
     interface to    |  | Queue of |  |    Interrupt    | interface
     higher level    |==| pending  |==|    service      | registers
     file system. ___|  | requests |  |___ routine   ___|
                        |__________|

Figure 11.5. Basic organization of a disk driver

It should be noted that the term input/output driver refers to the combination of the interrupt service routine, the request queue, and the interface between that queue and higher level software.

What information needs to be saved for a pending disk request? At any instant, there could be any number of input/output requests posted with different devices, and the order in which the devices finish the requests may differ considerably from the order in which they were posted. Therefore, in addition to the basic information about the disk operation to be performed (what cylinder, surface and sector, what memory address, what buffer length, and is it a read or write), the record must contain some way to indicate, to the user, that the operation has been completed.

To wait for a hardware device to finish an operation, the software waited for a bit to be set in a device register. To wait for some operation to be finished by an interrupt service routine, we can use a very similar scheme, waiting for a bit to be set in a variable. We will need one variable per pending disk request. Since the disk software cannot anticipate in advance how many disk requests may be enqueues, the variable used to signal completion of a request must be provided by the user; typically, therefore, we ask the user to pass a pointer to this variable along with the buffer address and other information, as suggested in Figure 11.6.

     One disk request:   __________
                 buffer |__________------> user's buffer
                   size |__________| size of buffer
               cylinder |__________| \
                surface |__________|  > disk address
                 sector |__________| /
              direction |__________| in or out
                   done |__________------> user's done flag
                         __________
		   next |__________| \ queue
		   prev |__________| / management

Figure 11.6. An entry in a disk request queue.

The queue entry outlined in Figure 11.6 includes suggested queue management fields, a pointer to the next item in the queue, and a pointer to the previous item. In fact, we could hide the queue implementation behind a wall of abstraction, but there is little practical choice in the way we manage the queue. Array implementations are not appropriate for queues with an unknown maximum size, and the large size of each queue entry suggests that we ought to try to minimize data copying. Both of these arguments lead us toward a linked list implementation. If the only operations on the queue are enqueue and dequeue, we only really need a next pointer, but as will be seen later, we will have several reasons to delete items that were previously enqueued, and deletion of arbitrary items from a linked list is simplest if the list items have both next and back pointers.

At an abstract level, our device driver consists of a user interface routine that enqueues items as users request input/output, a queue, and an interrupt service routine that dequeues items when the previous operation is complete. Given this, the basic structure of the device driver must parallel that of an interrupt driven character sequential device driver such as that illustrated in Figure 10.7. For example, whether the queue is a queue of characters or a queue of disk requests, when the queue is emptied by the interrupt service routine, the service routine must disable the device's interrupt request register, and after each enqueue operation, the device must be re-enabled. This leads us to the minimalist disk input/output driver shown in Figure 11.7.

{ statically allocated variables }
var request: disk_request_pointer;  { the current request }
    free: disk_request_pointer;     { the list of free requests }

procedure diskinterrupt;
{ control reaches here whenever there is any interrupt
  request from the disk drive }
begin
     { first, deal with the request that was just finished }
     if request <> nil then begin

          { signal the user that the request is done }
          request^.done^ := 1;

          { return the request record to the free list }
          request^.next := free;
          free := request;
     end;

     { second, start a new request }
     if not empty( disk_request_queue ) then begin

          request := dequeue( disk_request_queue );

          diskaddress( { function defined in Figure 11.4 }
               request^.buffer,
               request^.size,
               request^.cylinder,
               request^.surface,
               request^.sector
          );

          if request^.direction = out
               then outp( DISKCON, WRITE_ON_IE )
               else outp( DISKCON, READ_ON_IE );

     end else begin
          { there was no new disk request, so disable interrupts }
          else outp( DISKCON, IDLE_ON );
     end;
end { diskinterrupt };

void diskread( buffer: char_pointer;
               len, cyl, surf, sect: integer;
               done: int_pointer
             );
     var req: disk_request_pointer;
         temp: integer;
begin
     { get a free disk request record }
     while free = nil do { nothing };
     req := free;
     free := request^.next;

     { setup this request record }
     req^.buffer := buffer;
     req^.size := len;
     req^.cylinder := cyl;
     req^.surface := surf;
     req^.sector := sect;
     req^.direction := in;
     req^.done := done;

     { enqueue the request }
     enqueue( disk_request_queue, req );

     { reset done flag, so interrupt routine can set it }
     done^ := 0;

     { enable disk interrupt -- a critical section }
     disableints;
     temp := inp( DISKCON );
     setbit( temp, IE );
     outp( DISKCON, temp );
     enableints;

end { diskread }
    
procedure waitdiskdone( done: int_pointer );
begin
     while done^ = 0 do { nothing };
end { diskwait };

Figure 11.7. A disk input/output driver.

The interrupt service routine shown in Figure 11.7 begins by marking the previous request as having been completed, un-blocking any software that is waiting for that request to complete, and then it deallocates the disk request record for the old request. Once this is done, it starts the new disk operation, if any. The code shown does nothing about errors; in a production disk driver, some errors would have to lead to retrying operations (for example, read checksum errors) while others would have to lead to error reports (for example, illegal sector number errors).

The waitdiskdone routine given in Figure 11.7 differs from the version in Figure 11.4 in an important way. The version in Figure 11.4 waits for any disk operation to complete, while the version in Figure 11.7 waits for a specific disk operation (the disk operation that was started using a particular done pointer) to complete.

The code and data structures shown here are typical of the internal structure of many systems, but one important feature has been omitted. This code assumes only a single disk device, the disk, while in the real world, we are interested in systems with many disk devices. When a system has multiple disks, all of the disk devices could operate using the same interrupt service routine, with some status register indicating which device caused the most recent disk interrupt, or we could have a different interrupt service routine for each device. Typically, if all the disk interfaces are compatable, most of the code will be the same, with just a pointer to a disk description data structure that is changed when an operation is directed to one device or another.

Another detail we have omitted is the possibility of supporting sequential input and output operations to disks. When making a backup of a disk to a tape, or when restoring the contents of a disk drive from tape, it is convenient to do so by using sequential read and write operations, and if the best block and sector sizes for the disk and tape are not matched to each other, character sequential copying might be the easiest. In this case, we need to augment the block-oriented interface for each disk device with a stream interface, in much the same way that we did this with tape interfaces in Figure 9.10. This is usually not provided directly by the device driver, but rather, it is provided by software one or two layers up, still inside the system and below the level of the user program.

Enhanced Disk Performance

When disk input/output requests are queued, it is possible to give the user the illusion of a disk which sometimes runs much faster than the actual physical disk! For example, when the user requests that some sector be read, the queue can be searched to see if a write request for the same sector has already been enqueued. If one is found, the buffer associated with the write request can simply be copied into the buffer associated with the read request, without any actual disk input/output taking place! Similarly, whenever a write request is enqueued, any write requests for the same sector which were already in the queue can be deleted from the queue and marked as having been completed, since whatever they would have written out would have been be overwritten when the new request is completed.

The consequence of this strategy is that there will never be more than one pending write request for any particular sector, and if there is a pending write request for a sector, there will never be read requests for the same sector in the queue after it. The effect of this is to greatly enhance the performance of the disk interface for popular disk sectors, while adding a small overhead for access to infrequently used disk sectors.

For a heavily used disk, such as the disk on a file server, percentage of the data transfers are directed to the same sectors, this approach can give users the impression that the disk operates at main memory speed because the majority of the input/output requests will be processed without actually accessing the disk.

This approach is sometimes called a disk cache, by analogy with cache memory hardware.

A cache memory is a small, expensive, very high speed memory which sits between the central processor and main memory. It holds the values of the most recently referenced words of main memory, and inspects each read request from main memory, intercepting and speeding those which request recently referenced words. An access to a memory address that the cache can help with is called a cache hit, and an access to an address that the cache does not know about is called a cache miss.

When the desired data can be found in main memory (in the queue) and no disk access is required, we call this a disk cache hit, while any disk request which actually causes disk input/output is called a disk cache miss. As with hardware cache performance, the hit ratio (percentage of all input/output requests which result in hits) is used as a measure of the effectiveness of the disk cache.

A second possibility for improving disk performance arises if disk requests are allowed to be processed in an order other than the one in which they were posted. For example, if the user posts requests for transfers to or from cylinders 1, 2, 1, 2 and 1, in that order, the disk heads will have to seek to a new track 4 or 5 times (depending on the initial cylinder). On the other hand, if these same five requests are performed in the order 1, 1, 1, 2 and 2, only one or two seeks will be required. Similarly, because long seeks take more time than short seeks, access to tracks 1, 2, 3, 4 and 5 will be much faster in order than if they are accessed, for example, in the order 1, 4, 2, 5 and 3. If the disk is interleaved to make it possible to read sectors 1, 2, 3, 4 and 5 within a track with minimal rotational latency, we are likely to be able to read them in one revolution, while reading the same sectors in the order 5, 4, 3, 2 and 1 is likely to require 4 disk revolutions.

This reordering of the request queue is called disk (or drum) scheduling. As illustrated above, disk scheduling can lead to significant performance improvements if the number of requests in the queue is large. If the request queue never contains more than one item, however, scheduling never accomplishes anything! Logically, the disk scheduler can be thought of as an independent program which observes and reorders the disk queue in order to maximize performance. In fact, the disk scheduler is never implemented this way! Instead, we implement the scheduler in the enqueue and dequeue routines for the disk request queue.

Drum and fixed-head disk scheduling are fairly simple, since head motion latency need not be considered. The optimal algorithm for such devices maintains as many first-in first-out request queues as there are sectors per track. When a request is posted, it is placed in the queue selected by the sector number field of its disk address. When an interrupt arrives from the device, the next request is taken from the queue for the sector number which is about to pass under the heads. This results in both maximal utilization of the device, measured, for example, in bits transferred per second, and in a minimum value for the maximum time any particular user will have to wait to see their request satisfied.

Once moving heads are allowed, there are a number of possible disk scheduling policies that meet different response requirements. Each policy is stated in terms of which pending disk transfer request should be selected for processing when the time comes to examine the queue of pending requests. To evaluate a queueing policy, we ask three questions:

What it the impact of the policy on throughput, that is, the effect on the number of disk requests that can be processed per unit time.
What is the impact of the policy on the average waiting time.
What is the impact of the policy on the worst-case waiting time.

In fact, the first two questions are really two sides of the same coin. The policy with the best throughput, in requests per second, will have the shortest average wait, in seconds per request. The measure of each of these is the reciprocal of the other. The average and worst-case waiting time can be quite different for some of the disk scheduling policies:

FIFO -- first in, first out

By default, if you are told that a queue is being maintained, this is the usual assumption you make about how this queue is managed. In evaluating alternative scheduling disciplines, one of the first questions you should ask is: is it better than FIFO with regard to our measures of performance, and if so, how much better is it? There is a sense in which the FIFO policy is the fairest. No disk request is favored over any other, and as a result, if the queue length remains constant, the worst case and the average case will be comparable.

Priority -- process the highest priority request first

In some settings, it is appropriate to assign priorities to different disk requests. Processing pending requests in the order of their priority may not improve overall performance, in fact, it may make things worse by forcing a very bad order of request processing, but it may guarantee that key requests are processed in time to meet important deadlines.

SSTF -- Shortest seek-time first

One way to try to optimize disk performance is to try to minimize the latency. Thus, after finishing a disk operation, the one to try next is the one that can be initiated first. Within a cylinder, on completing an operation on sector n, we should preferentially select sector n+1 for processing next, and if no transfer is pending to sector n+1, we should try sector n+2. When moving between cylinders, we should preferentially select the pending transfer in the cylinder nearest the current cylinder before we request a seek to a distant cylinder. In sum, each time the disk interrupt service routine finishes a data transfer, it searches the request queue for that request which can be met with minimum head movement from the current head position.

This policy results in a minimum average waiting time and maximum throughput. If these are the measures we care about, this policy is optimal, but it is far from optimal in its worst-case! If some region of the disk has a high number of pending requests, those requests will receive very fast service, while requests for transfers involving other areas of the disk will receive poor service. In the extreme case, if our application software always keeps at least one disk request pending for one region of the disk, requests for access to some other region of the disk will never be processed! Therefore, the worst case for this policy is an infinite wait!

Cyclic Scan -- Cycle through pending requests in address order.

The cyclic scan policy attempts to gain most of the benefits of the SSTF policy while retaining the essential fairness of the FIFO policy. The idea is to always advance the disk address with each transfer completed, so, after completing a transfer to sector s of cylinder c, the next transfer must be either to a sector s' where s'>s, or the next transfer must be to some sector of cylinder c' where c'>c, until there are no pending transfers that meet this criterion; when this is the case, we select the minimum sector number where a transfer is pending in the minimuim cylinder number where a transfer is pending and begin again. In effect, we are using the SSTF policy with the constraint that all seeks are in the same direction, or, looked at from another perspective, we are scanning through the pending requests in a cycle.

If there are many pending requests, this policy will frequently outperform FIFO scheduling on all measures of performance. In fact, if the goal is to minimize the worst case performance, this is the best disk scheduling policy, but there is an even better solution for maximizing throughput or minimizing the average waiting time.

Elevator -- Cycle up and down through the pending requests.

The elevator disk scheduling policy was originally invented as a control scheme for elevators, and then applied to disk scheduling. The analogy between elevators and moving-head disks is actually quite strong! Each cylinder of the disk can be imagined as a floor of the building, and the set of pending requests for each cylinder can be imagined as a the set of people waiting for the elevator on the corresponding floor of the building. The scheduling policy question can then be restated as: In what order should the elevator visit the floors of the building in order to minimize the average or worst-case waiting time. The cyclic scan algorithm, stated on these terms, says that the elevator stops at each floor where someone wants to get on or off only on the way up, and then takes an express trip to the bottom. With elevators, we don't do this! Instead, we let the elevator stop whenever it passes a floor where people want to get on or off.

Translating back to the disk context, this means that we follow the scanning algorithm until we reach one end of the scan, and then, instead of taking a quick seek to the lowest numbered cylinder where there is a pending request, we simply reverse the direction of the scan, working through pending requests toward successively lower numbered cylinders.

There is no scheduling policy that minimizes both the average and worst-case waiting time, and the elevator policy does not minimize neither, but overall, it is usually considered the best choice!

All of these disk scheduling policies can be implemented by using one queue per cylinder on each disk; in this case, the routine to post a disk request simply uses the cylinder number as an index into an array of queues, and the interrupt service routine simply searches this array for the nearest non-empty queue (in the appropriate direction) when deciding what to do next. With modern disks that have hundreds of cylinders, the arrays required by this solution may seem large, but this is not a major problem.

Even if the queues used for requests within each cylinder are simple FIFO queues, the performance will be significantly better if pending requests are spread out over many cylinders, but the best performance requires that the queues within each cylinder be carefully structured to allow all pending requests for that cylinder to be read with minimum latency.

For the cyclic scan and elevator policies, an alternative implementation uses two priority queues, one for all requests inside the current head position, and one for all requests outside the current head position. In the case of the bidirectional-scan policy, the two queues are sorted in the opposite order, so that the outermost request in the inner queue has the highest priority, and the innermost request in the outer queue has the highest priority. When a new request is posted, the current head position is inspected to decide which queue it goes in. When the disk interrupt service routine finishes dealing with a request, it gets its next request from the same queue as it got its last request from, unless that queue is empty, in which case it switches queues. This approach can be very efficient because of the fact that the operations on a priority queue can be done in time proportional to the log of the queue size if an appropriate data structure such as a heap ordered binary tree is used.

Terminology

The reader should be familiar with the following general terms after reading this section:

random-access devices           drums
moving-head disks               fixed-head disks
floppy disks                    disk packs
recording surfaces              hard sectoring
tracks                          soft sectoring
sectors                         disk addresses
cylinders                       inter-sector gaps
rotational latency              head positioning latency
interleaving                    seek time
low level disk input/output     disk request blocks
disk request queues             critical sections
disk cache                      hit ratio
disk scheduling policy          shortest-seek-time-first
cyclic-scan                     elevator

Exercises

Given a disk with 5 sectors per track, where these are numbered 1-2-3-4-5 when no interleaving is used and 1-4-2-5-3 with two-way interleaving, how would these be numbered with
a) 3 way interleaving?
b) 4 way interleaving?
Show the physical order of the logical sector numbers on a 3 way interleaved track of 16 sectors.
Given that it takes 0.1 seconds to move the disk heads 400 tracks and that the number of tracks moved is proportional to the square of the head motion latency for that move, how long will it take to move
a) 2 tracks?
b) 3 tracks?
If one sector rotates by the head every 138 microseconds and it takes 100 microseconds for the software to start the next disk transfer after the previous one has finished, and if the head positioning latency is 0.005 times the square root of the number of tracks moved, make a table of the number of sectors which must be skipped before the next transfer can begin as a function of the number of tracks the head must move, from 0 to at least 15 tracks.
a) Which of the errors listed in Figure 11.3 should be considered fatal errors, which might be resolved by retrying a read operation, and which might be resolved by retrying a write operation?
b) Modify diskread from Figure 11.4 to check for the appropriate errors and retry the operation up to 3 times before giving up and raising a read-error exception.
Following up on the above exercise, modify the code shown in Figure 11.7 to check for appropriate errors retry any disk operation up to 3 times before giving up.
Give an appropriate disk write routine for Figure 11.7.
Write a procedure which reads the contents of an entire disk and copies it to tape; use the low level disk interface described in Figure 11.7 and the user level tape interface described in Figure 9.9. Assume that the global constants "sectorsize", "sectorspertrack", "surfaces", and "cylinders" are available to describe the disk being copied, and that your procedure will take, as a parameter, a pointer to the opened file descriptor of the tape.
Exactly what damage could occur if an interrupt occurred at the wrong time during the critical section in the diskread routine in Figure 11.7? Between which instructions would this interrupt have to occur, and what observable consequences would result from the damage?
Modify the code shown in Figure 11.7 for "diskread" so that it searches the request queue for other requests for the same disk address and implements the disk cache ideas discussed in this chapter.
Consider a disk system where the hardware allows consecutive sectors of one track to be read, but 4 sectors must be skipped to move 1 cylinder in or out, 6 must be skipped to move 2, 7 must be skipped to move 3, and 8 must be skipped to move 4. If there are 16 sectors per track (numbered 0 to 15), disk addresses are expressed as (track,sector) pairs, and requests arrive to access addresses (1,2), (2,5), (4,10), (5,14), (1,6), and (3,11), how many rotations and fractions of a rotation, at minimum, would it take to process this sequence of requests
a) if they are processed in the order they are given above?
b) if they are processed in the best possible order?
For each of the disk scheduling policies, what should be done when a request is posted for the track the head is currently processing? What alternatives are there, if any, and what undesirable consequences would result from the use of these alternatives?
a) shortest-seek-time-first
b) cyclic-scan
c) elevator
How likely is it that a disk scheduler would make much of a performance difference on a small personal computer? Why?
How likely is it that a disk scheduler would make much of a performance difference on a large web server? Why?

Readings

There is a good discussion of disk scheduling in Chapter 7 of

Operating System Concepts, by J. Peterson and A. Silberschatz (Addison Wesley, 1983),