Lecture 15, Random Access Input/Output

Disks and related secondary memory devices

Part of the notes for 22C:112, Operating Systems
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Disks

Although the keyboard, display and tape peripherals discussed in the previous lectures would be sufficient to create a working system, the system would be primitive by modern standards, or for that matter, by the standards of 1970. The files needed by most programs could be stored on magnetic tape, but this would require that an operator spend a considerable amount of time mounting and dismounting tapes, with all directory structure being handled by the people who manage the tape library. In fact, this is the way that files were usually handled in the early 1950's, and even in the 1970's, certain applications continued to be handled this way. The advent of magnetic drum memory, and later, magnetic disk memory changed this.

Drums were introduced in 1952 on the Whirlwind I computer; by 1961, the Manchester University Atlas computer was using a primitive file system which blurred the distinction between drum and tape. The CTSS system at MIT had a fully developed (but not hierarchical) disk-based file system by 1963, and the description of the Multics file system published in 1965 had all of the features usually associated with modern file systems, including directory hierarchies and a strong security model.

Multics was one of the most influential operating systems of all time! The Multics system grew out of Project MAC at MIT in the mid 1960's, and was was developed by a consortium that included MIT, Bell Labs and General Electric as original members. Bell dropped out and Honeywell bought the General Electric computer division, so it was Honeywell that ended up commercializing this system. Multics was developed to pursue the idea of a computer utility, a bold idea in the mid 1960's, but a commonplace idea starting in the early 1970's. Today's internet service providers all follow models very similar to that anticipated by Multics. The Multics system achieved its greatest commercial success in the late 1970's and early 1980's, at a time when its academic reputation had already begun to fade into the shadow of Unix and the growing promise of networking. The last Multics system remaining in operation was shut down in the fall of 2000.

The term file has two common meanings, file variable and disk file. File variables, as discussed in the previous chapters, are the objects to which programs direct input/output requests. These are bound, typically by some operation such as an open system service, to actual input/output devices or disk files. Taken in its narrowest sense, the term disk file refers to a named storage area on a disk, but there are many other devices which may be used to store such files. Among these are the older magnetic drum technology, fixed-head disks, moving-head disks, floppy disks, magnetic bubble memory, RAM disks, flash EEPROM cards, and flash USB drives. As a result, the term secondary storage files would more accurately describe what are commonly called disk files.

All of the secondary storage technologies mentioned above have some common characteristics. These technologies allow the storage of data at prices per bit well below that of primary memory, but (for various reasons) the memory technology being used is not appropriate as the main memory from which programs and data may be directly accessed by the central processor. As a result, they are used as secondary storage, with input/output hardware that allows the transfer of blocks of data between primary and secondary storage.

As with the primary memory in which programs are executed, secondary memory devices usually support random access, so it is reasonable to talk about secondary memory addressing using terms similar to those used for primary memory addressing. Unlike tapes and other sequential devices, it is not necessary to read or write from the start. Unlike primary memory, however, the time taken to complete a data transfer usually depends on the address used for that data transfer and the address used for the previous transfer.

Random Access Device Characteristics

Drums

The oldest random-access secondary storage device is the magnetic drum memory, although it is worth noting that some early computers actually used drum technology for their primary memory. Drums were introduced in the early 1950's and were obsolete by the late 1960's. Drum memories had, as their central element, a rotating drum coated with a continuous layer of magnetic material The drum was driven by an electric motor, spinning at a fixed rate, typically from 5 to 60 revolutions per second; the rate was typically locked to the frequency of the alternating current powering the motor. Data was read and written using magnetic recording heads very similar to those used for tape recorders. Each head was fixed in position over a circular track of data on the drum, so the number of heads was the same as the number of tracks. Even in the early days of drum recording, drives with 64 or more heads were common. Figure 1 shows a typical drum, in schematic form.

            _____________________
 _______   |                     |
|       |  |                     |
| MOTOR |==|                     |===== axis of rotation
|_______|  |                     |
           |_____________________|
            ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
         ___|_|_|_|_|_|_|_|_|_|_|____
        |                            |
        | head selection electronics |<---- track address
        |____________________________|
                      |
                      |
             data to or from drum

Figure 1. A magnetic drum.

The logical characteristics of the drum are the number of heads and tracks (these numbers are the same), the data capacity of each track, and the speed at which the drum rotates, which determines the time to read and write a track or the time to find and read part of a track. On early drums, users were required to read or write an entire track, so such drums were said to allow random access to the track and sequential access within a track.

On later drums, the technology had advanced to the point that more data could be stored on one track than could be conveniently transferred into memory at one time, so tracks were broken into sectors. The actual recording of sectors on a track is similar to the recording of blocks on a tape, with intersector gaps separating consecutive sectors of data. Each sector typically has a prologue that includes an easy to recognize bit pattern that occurs only at the start of a sector, plus, optionally, sector and track number information in order to verify that the sector is the correct one. Following the data part of the sector, there is usually an epilogue containing a checksum for the data on the sector so that recording or playback errors can be detected.

Users wishing to transfer data to or from such a drum must specify the track to be accessed and the sector within that track, and then, before data can be transferred, the user may have to wait for the appropriate sector to pass under the head. The waiting time for such a drum access is called the rotational latency time.

The term latency time is a general term which refers to the time between the initiation of a process and the time when that process actually starts. The process is said to be latent during this interval. Thus, for example, a person has a latent disease between the time of infection and the time the first symptoms appear.

If access to a randomly selected sector is attempted at a randomly selected time, the expected rotational latency will be about one half of the time it takes the drum to make a full rotation.

A multiple sector drum is characterized by the number of tracks, sectors per track, and bytes per sector, as well as the time to transfer one sector and the minimum, maximum, and average latency time. The largest component of the latency time is usually the rotational latency time, but other components can be due to the time it takes to select the head to be used (head selection latency), or to the time it takes to setup the hardware to start a data transfer (hardware setup latency).

Note that when each track is broken into multiple sectors, many hardware interfaces are unable to read physically consecutive sectors because the time taken for the intersector gap to rotate past the heads is shorter than the electronic components of the latency time. Specifically, there is insufficient time between when one transfer ends and the start of the next sector for the software to load up the device interface registers to start the next transfer. Of course, this problem could be solved by making the intersector gaps larger, but this would reduce the useful data capacity of each track. Some controllers have two sets of interface registers, one for the next data transfer, and one for the current one; this is expensive, but allows for very short intersector gaps. A more common approach is to change the numbering scheme so that logically consecutive sectors are not physically consecutive; for example, if there are 5 sectors per track, number them 1-4-2-5-3 instead of 1-2-3-4-5. This is called interleaving; the above example is two-way interleaved because consecutive logical sectors occupy every other physical sector. Three-way interleaving would have consecutive logical sectors occupy every third physical sector.

Fixed Head Disks

Disks originated as a simple approach to reducing the volume occupied by drum memories. Such simple disks are logically indistinguishable from drums, although physically, they consist of a magnetically surfaced platter with heads mounted against it from one or both sides. Such a disk is called a fixed head disk to distinguish it from a later development, the moving head disk.

The evolution from the magnetic drum to the fixed-head magnetic disk parallels the evolution from Thomas Edison's cylindrical phonograph record to Emile Berliner's disk-shaped gramaphone record.

One result of the disk arrangement is that the tracks near the center of the disk are physically shorter than those near the edge. Since the disk rotates at a constant rate and data is transferred at a constant rate (in bits per second), the central tracks are recorded at a higher linear density (in bits per inch) than the peripheral tracks. It is occasionally suggested that a greater amount of data could be stored on the disk if either the speed of the disk or the data rate were varied depending on which track was being used; unfortunately, disks make good flywheels and it may take several seconds to significantly change their speed, and variable frequency clocking is historically an expensive option. As a result, disk drives made in the 20th century usually packed a fixed amount of data per degree of rotation instead of a fixed amount per linear measure. The maximal practical linear data density is only achieved at the innermost track of the disk.

Moving Head Disks

Both drums and fixed head disks are expensive because of the requirement of one head (and associated electronics) per recording track. Moving head disks reduce this cost by having only one head per recording surface, with a mechanical head positioning arm used to move the head from track to track as needed. The savings from this change are offset, to some extent, by the introduction of a new form of latency, head positioning latency, sometimes called the seek time (ie, the time it takes the head to seek to a new radial distance from the disk center). The time it takes to move the heads from one track to another on a recording surface usually varies as a complicated function of the distance of the move; usually this can be approximated as a quadratic function, so if it takes 0.1 second to move the heads 400 tracks, it would be expected to take about 0.005 seconds to move the heads 1 track.

Moving head disks vary considerably in performance, from low-speed, low-capacity floppy disks, with only one or two recording surfaces, to high speed units with multiple disk platters all mounted on a common recording surface.

The original moving head disk was developed under the direction of Reynold B. Johnson at IBM's San Jose research lab for the IBM 1401 computer; introduced with the 1401 in 1961, it was called the IBM 1505 RAMAC disk, after the Random Access Method of Accounting Control, the first disk-based business application. This disk had 25 platters, giving 50 recording surfaces; there were 200 data tracks per surface, 5 sectors per track, and 200 6-bit bytes per sector. The drive stood around 6 feet (180 cm) tall, with a glass face giving an edge-on view of all the recording surfaces. One of the worst design features of this early drive was the use of hydraulics to position the heads over the desired data tracks! The hydraulic system occasionally broke down, and if hydraulic fluid sprayed onto the disks, as it sometimes did, the result was invariably destructive.

By the early 1970's, the dominant disk drives used removable disk packs with 11 platters each, giving 20 recording surfaces -- the topmost and bottommost surfaces were unused because they were the most vulnerable to damage when the disk pack was removed from the drive. Each platter was 14 inches in diameter, and as with the original RAMAC drive, there were 200 data tracks per surface. Each track was divided into 20 sectors of 512 bytes each. The drives were typically the size of a top-loading washing machine, complete with a lid that opened to expose the tub and drive spindle inside, where the disk pack could be mounted. At normal operating speed, the disk in this drive spun at 1200 revolutions per minute. By this time, electromagnetic linear motors, called voice-coil drives, were the dominant head positioning technology for high performance drives.

Among hard disk drives for the PC era, we find the older MFM drives mostly had 17 sectors per track, with from 306 to 1024 cylinders, from 1 to 8 heads, and typically 512 bytes per sector. Hard-sectored ESDI drives typically supported 34 sectors per track, from 1 to 32 heads, but rarely more than 15, and from 614 to 1632 cylinders and still 512 bytes per sector. Today's IDE and newer drives hide their physical layout behind a layer of firmware in the microcontrollers that run the drive itself, so it is harder to extract this information. The need to maintain compatability with ancient software leads most newer disks to continue to support 512 byte sectors, but where the interface registers operate in terms of 16383 cylinders, 16 surfaces and 63 sectors per track, the hardware may have only 4 physical surfaces. It is difficult to extract more information from the publically available data.

The first generation of 8 inch floppy disks, from the early 1970s had one recording surface with 76 recording tracks; each track was divided into 26 sectors of 128 bytes each. These disks were quickly superceeded with double sided double density disks, where the double density recording format allowed 256 bytes per sector. Most 8 inch floppy disk drives spun at 180 RPM. This slow speed was necessary because frictional heating of the disk surface inside its envelope could damage the disk at higher speeds. Head positioning on the first floppy disk drives was done with stepping motors that drove helical lead screws to move the head assembly in or out.

It is noteworthy that the 3 1/2 inch microfloppy disk is the only magnetic disk format where variable speed recording has been widely used. Because floppy disks have very low angular momentum, they can spin up to the desired speed in about the time it takes to seek halfway across the disk. The Apple SuperDrive for the 3 1/2 inch floppy allowed the disk to spin at variable rates, and this allowed Apple's High Density format on this size floppy to store 800K where the rest of the industry only managed 720K.

An 80 megabyte hard drive from 1987 had 3 platters, giving 6 recording surfaces. Each platter was 5 inches in diameter. By 1994, much higher capacity drives were being made, packing as many as 4 platters, or 8 recording surfaces 3 3/4 inches in diameter into a drive just under 1 inch thick. Most of these drives were designed to spin at 7200 revolutions per minute. Electromagnetic voice coil drives are still dominant in high end disk drives, although in the 1980's, stepping motors were widely used. Whether from 1962 or today, these disks all have the same abstract structure, as illustrated in Figure 2.

                                   _________
                    ||--------\/  |         |
 _____________      ||  ===============================
|             |     ||--------/\  |   Hub   |
|    head     |_____||--------\/  |         |
| positioning |_____||  ===============================
|    motor    |     ||--------/\  |         |
|_____________|     ||--------\/  |         |
                    ||  ===============================
                    ||--------/\  |_________|
                   head      heads    | |
                positioning         __|_|__
                   arm             |       |
                                   |spindle|
                                   | motor |
                                   |_______|

Figure 2. A disk drive.

The number of platters may vary from 1 to a large number. The technology used to position the heads may vary. The actual head technology has evolved considerably over the years, but whether the disk is a RAMAC disk from 1962 or a modern hard drive, the head flies over the surface of the spinning disk, held against the disk by a spring, and held away from the surface against the tension of the spring by a film of air that is pulled between the head and the disk by the motion of the disk surface. This is why disk failures are sometimes called disk crashes. When the air film is interrupted, for example, by a particle of dust, the head crashes into the disk surface, driven by the spring and typically scraping a the magnetic recording surface off of the disk, creating more dust and eventually causing the heads on other recording surfaces to crash.

The heads on a floppy disk drive press into the non-rigid disk surface. Because the head and disk are in constant contact, the head can grind away the disk if any dust is present. To minimize this risk, floppy disks are only spin when an actual data transfer is under way, and they are always contained in an envelope with an expanded teflon mat lining its inside. This teflon mat is also in contact with the disk, but because teflon is among the softest plastics known, any dirt which gets into the disk will embed itself harmlessly in the teflon.

Whatever the technology, there is usually a single head positioning motor that moves all of the heads radially in or out together. This is because motors and precision bearings are relatively expensive, and the cost of a motor does not depend greatly on the number of heads involved. As a result, the set of tracks which can be accessed without moving the heads is an important logical structure, called the cylinder. If the tracks on each surface are numbered from the outside inwards, each cylinder consists of all of the tracks on different surfaces with the same track number. Thus there are as many cylinders in a disk pack as there are tracks on one surface, and each track is at the intersection of one recording surface and one cylinder.

With each passing year, the number of tracks which can be packed per unit of radial distance on a disk has improved. These improvements have been due to better head positioning mechanisms, smaller spacing between the head and the disk, narrower recording tracks, and many other considerations. Similarly, the number of bits that can be recorded in a unit distance of a given track has been improved as a result of new recording surface materials, new head geometries, and smaller spacing between the head and the disk. The limits of physics will eventually prevent a further improvement in magnetic recording, but so far, everytime these limits appear to approach, some new scheme has emerged to change the rules.

There are two different approaches to arranging the sectors on the tracks of a disk or drum. Hard sectored devices have a physical indicator to mark the start of each sector, while soft sectored devices use the data on the disk track itself to indicate the start of each sector. Typical hard sectoring schemes rely on notches in the edge of one disk platter along with photocells, or they use one disk surface to hold "servo tracks" that are recorded only once, in the factory, and used from then on to control the recording format for on all the other surfaces of the disk.

RAID

An interesting change in disk technology happened in the 1980's. Prior to this, new developments in disk technology were driven by the needs of high end computer systems. IBM and Control Data Corporation were leaders in the development of new disk technologies for their mainframes, and there was something of a trickle-down phenomonon, in which the disk drives available to the minicomputer marketplace were typically several years behind the technology being used for large systems.

Even floppy disk technology followed this pattern. Floppy disks came to market in 1971 as auxiliary disk drives on large IBM mainframes, used only to hold the data needed to bootstrap the system. By 1974, floppy disk drives were available on the newest minicomputers, as an option, and with the dawn of the microcomputer age in 1976, floppy disks were sought after in that marketplace.

A decade later, in the mid 1980's, the booming microcomputer marketplace had changed this! The demand for hard disk drives was immense! This led to the use of stepping-motor based head positioning on low-performance hard drives, but as soon as the price of the control systems fell far enough, around 1987, voice-coil head positioning mechanisms for hard drives in the microcomputer marketplace could equal the head-positioning latency times of the drives used on high-end machines.

This led quickly to the realization that arrays of these inexpensive disks could easily replace the large disk drives that had been developed to meet the needs of large mainframes. Patterson, Gibson and Katz at the University of California at Berkeley coined the term RAID in 1987 to refer to a Redundant Array of Inexpensive Disks; they recognized that, once you replace an expensive disk drive with, for example, 8 mass-market drives, it is quite simple to add a ninth drive to hold, for example, parity information on the data in the other 8 drives. This redundancy is enough to allow recovery from the failure of any one drive.

Today, it is common to find RAID subsystems holding many terabytes attached to the mainframe computers in large corporate data centers. These are packages in racks the size of upright refrigerators, with on the order of 100 small disk drives. Small RAID systems the size of a classic IBM PC with on the order of 10 drives are appropriate for the kind of server that might handle the work of a single office. Whatever it's size, the controller for a RAID may be quite complex; typical RAID controllers look like SCSI devices or network file systems to the host computer, and it is the RAID controller that handles the redundancy of the RAID and handles the job of dividing the data between the different disk drives. Of course, if high performance is not needed, multiple drives connected directly to the host computer may be operated as a RAID using special input-output driver software do the job of a RAID controller.

Other Secondary Memory Technologies

Aside from the rotating magnetic media discussed above, there are a number of other technologies which are used for auxiliary or secondary random access storage. For example, in the 1970's, when core memory was still common as the primary memory of large computers, typically with sizes under one megabyte, bulk core memory was used as secondary memory in sizes up to about 16 megabytes. Today, bulk semiconductor memories based on Flash EEPROM (electrically erasable programmable read only memory) are used for many applications. Typically, bulk memories are organized for sequential data transfers of data blocks corresponding to disk sectors, at data rates comparable to or sometimes faster than the fastest modern disks, but without head positioning or rotational latency.

Note that Flash EEPROM is a descendant of a long line of older memory technologies. Originally, ROM or Read Only Memory was primarily used for storing very small bits of program code, for example, bootstrap loaders or the microprogram that controlled the CPU's fetch-execute cycle. The contents of a ROM were determined when the memory was manufactured and could not be changed. PROM, or programmable ROM, was memory that could be programmed after manufacturing. Typically, it was built containing all zeros (or sometimes all ones), and programming circuitry could be used to change any zero to a one (or any one to a zero). Once changed, bits could not be changed back. The next technology was EPROM, programmable ROM that could be erased -- that is, returned to its initial state. Most EPROM was erased by exposure to ultraviolet light, so-called UV-EPROM. This was followed by EEPROM, electrically erasable EPROM. This starts to sound like RAM, but erasure was quite slow, taking seconds, and in most cases, the erase operation erased the entire chip. Flash EEPROM is a type of EEPROM that can be erased very quickly. Usually, the erase operation takes under a millisecond (still slow compared to RAM, but not too different from a disk drive), and each block can be erased separately, so you can think of the blocks as if they were disk sectors.

Magnetic bubble memories were once considered a promising auxiliary memory technology. In these, the magnetic storage medium is fixed, but a rotating magnetic field forces the magnetically encoded data to circulate along tracks in the medium. Typically, the storage medium is a garnet chip, with tracks defined by iron or nickel patterns on the surface of the chip. The data is read and written using a read/write head that functions like that of a conventional disk, and curiously, because data is circulating around tracks, terms such as rotational latency are as appropriate to bubble memories as they are to disks and drums. The lack of moving parts and extremely high radiation resistance made bubble memory devices particularly appealing in aerospace applications. Today, magnetic bubble memory technology is almost forgotten, but a new technology with very similar properties is on the threshold of the marketplace, the race-track memory, where electrical currnet pushes magnetic bubbles along conductive traces.

Optical disks have great potential for use as very high capacity read-only memories (ROM). The audio CDROM and video DVDROM formats are now well established, and with the advent of recordable drives, write-once, read-mostly or WORM memory has become widespread, in the form of CD/R and DVD/R devices. With these technologies, disks usually have only a single recording surface, but issues of latency and head motion are identical to those for magnetic disks.

The CDROM format has one unusual feature. Instead of having multiple recording tracks, like a magnetic disk, each CDROM has a single spiral track, like a phonograph record. This track is broken into sectors, and the head positioning system includes algorithms allowing it to guess what radial distance to look at for a particular sector. Thus, while there is only one track, the latency for a read from CDROM still involves head positioning and rotational latency.

Erasable CD/RW and DVD/RW disks pose some new problems. Erasing such a disk involves heating the recording medium to a temperature high enough to "heal" the damage done to the recording surface by recording data on it. This is typically done with the same laser used for writing data, but running it at a higher power level. Rewriting a sector of such a disk typically involves at least one pass over the sector heating the track to a high enough temperature that it erases, followed by a pass to record new data on that sector. This makes for very slow recording compared to the playback times, and it is most efficiently done if a long sequence of sectors are erased and then rewritten. In contrast, it takes identically the same amount of time to write a track of a conventional magnetic disk as it takes to read that track. File systems developed for magnetic disks sometimes perform poorly on re-recordable CDROM disks because of the inappropriate assumptions they make about writing speeds.

Another problem with both Flash EEPROM and CD/RW and DVD/RW disks is the limitation on the number of times they may be written. Conventional disks do not wear out as a result of write operations. Conventional disks fail either because of bearing failures or dust particles that manage to squeeze into the drive enclosure or, more frequently, particles that are shed from some part of the interior of the enclosure. In contrast, Flash EEPROM is usually guaranteed to survive only a limited number of erase-write cycles. Limits below 100,000 cycles were common in early flash devices, and for a brief time, flash memories were made that could survive over a million cycles. Unfortunately, the desire for higher capacity has trumped the desire for more erase cycles, so by 2013, most flash memory is only guaranteed for 2000 to 3000 write cycles!

Such memory limits are a minor problem for flash memory cards used in cameras or for backups, but if you use a flash memory for the primary "disk" on your computer, it can be a big problem. Most flash controllers attempt to implement some kind of "wear leveling" algorithm in order to even out the wear on the different sectors of the flash device. This means that, for example, "sector 0" of the flash device does not reference a fixed location on that device, but rather, it references a location that may move each time that sector is erased in order to even out the wear on the different sectors.

The Hardware Interface

For all of the secondary storage media mentioned above, the basic input/output operations require that a block, sector, or track of data on the device be specified before the operations begin, along with a buffer in primary memory. With all of these specified, the data can be read or written, perhaps with some latency while waiting for the record/playback head to reach the right track and waiting for the correct sector to rotate into position.

Device addresses have different forms on different devices, but typically, they take the form <cylinder, surface, sector> tuples. This form corresponds to a specification of the radius to which the heads should be moved, the cylinder, the height in the stack of surfaces to attend to, the surface, and the angle at which recording or playback is to start, the sector. Thus, the <cylinder, surface, sector> tuple can be thought of as the coordinates, in 3 dimensions, of the start of the data being recorded, using a cylindrical coordinate system.

The order of these three coordinates, <cylinder, surface, sector>, is significant! Recording consecutive sectors of one track is a very simple operation; we just leave the cylinder and sufrace selection hardware alone and let the disk continue to spin as we access the consecutive sectors. Changing surfaces within a cylinder is inexpensive, since this involves electronically switching from one head to another. The most expensive operation is moving the heads to a different cylinder, since this involves accelerating the head assembly from a standing start and then decelerating to stop at the desired cylinder.

An Example Disk Interface

The hardware interfaces to different disk devices vary considerably, from the very low level interfaces of the disk hardware itself to the standardized interfaces of SCSI disks, to the very abstract interfaces of USB devices. Access to a USB disk is really access to a disk over a computer network, where the USB wires connect the network. All USB devices include small processors in addition to any input/output or storage devices. Many modern SCSI disks also include local processing, so access to the device that appears to be direct is actually indirectly through a small computer system built into the disk device.

For the sake of this discussion, we will ignore disk devices with built-in processors and focus on disks more typical of the early 1970s, before such complexity was common. Where disk devices incorporate processors, the processors typically must deal with the primitive low-level hardware through an interface such as is described here.

At a low level, the hardware for a typical disk typically has device interface registers that contain the fields of one disk address. Once these have been loaded, a seek command can usually be issued to cause the drive to position the heads over the desired track and wait until the desired sector is available. On some of the earliest floppy disk drives, the actual data transfer was done one character at a time, through a serial port, but high performance disk drives use direct memory access controllers to move an entire sector without involving the CPU.

The interface for commercially available drives that interface to machines such as the classic IBM PC is complicated by the complete separation of the drive functions from the direct memory access data transfer functions. On the PC, device registers 000 to 00F16 are used to communicate with the DMA controller. This special controller supports 4 DMA channels to which devices may be attached. To perform a disk read or write operation, the buffer address and size is loaded in the appropriate DMA channel control registers and a read or write command is issued to the DMA controller, and then, independently, the cylinder, surface and sector numbers are loaded in the control registers of the disk drive itself. Finally, the disk drive is told to read or write. Once all of this is done, the DMA controller and drive cooperate to actually move the desired data between main memory and the disk.

We will avoid this complexity by using a fictional disk interface more in the spirit of minicomputer disk interfaces of the early 1960's than in the spirit of today's high performance drives. Older disk subsystems tended to be more complex because they were only available on the largest of computers with the most complex peripheral interface architectures, while disk subsystems after this grew complex as microprocessors began to incorporate features originally found only on the largest computers. This example interface is described in Figure 3.

DISKCON Disk Control Register
This one-byte write-only register holds the command to the disk drive. Writing a new value in this register causes the disk controller to begin a new operation on the disk; when that operation is completed, the disk interface will indicate operation complete and (if interrupts are enabled) request an interrupt.
      _______________
     |_______|___|_|_|
              | | | |
              | | | 0 -- interrupt disabled
IE            | | | 1 -- interrupt enabled
              | | 0 -- disk spindle motor power off
ON            | | 1 -- disk spindle motor power on
IDLE          0 0 -- disk idle
SEEK          0 1 -- seek specified cylinder, no transfer requested
READ          1 0 -- read from disk to main memory
WRITE         1 1 -- write from main memory to disk

DISKSTAT Disk Status Register
This one-byte read-only register holds the current disk status.
      _______________
     |___|_______|_|_|
          | | | | | |
          | | | | | 0 -- operation in progress
READY     | | | | | 1 -- operation complete
          | | | | 0 -- head not currently on correct cylinder
ONCYL     | | | | 1 -- head currently on correct cylinder (seek complete)
          0 0 0 0 -- no error
          0 0 0 1 -- disk not mounted
          0 0 1 0 -- disk not up to speed
          0 0 1 1 -- no such cylinder
          0 1 0 0 -- no such surface
          0 1 0 1 -- no such sector
          0 1 1 0 -- byte count larger than sector size
          1 0 0 0 -- formatting error
          1 0 0 1 -- parity error
ERRMSK    1 1 1 1 -- (never used by hardware)

Warning: The following registers should never be changed while a read or write operation is in progress!

DISKCYL Disk Cylinder Select Register
This two-byte write-only register holds the requested cylinder number.

DISKSURF Disk Surface Select Register
This one-byte write-only register holds the requested surface number.

DISKSECT Disk Sector Select Register
This one-byte write-only register holds the requested sector number.

DISKADDR Disk Memory Address Register
This four-byte write-only register holds the memory address to be used for the next direct-memory-access data transfer.

DISKCNT Disk Byte Count Register
This two-byte write-only register holds the byte-count for the next direct-memory-access data transfer.

Figure 3. A typical disk interface.

For this discussion, we will ignore the fact that some of the device registers shown in Figure 3 are 16 or even 32 bits in size, while many input-output architectures limit the register size for the interface registers to 16 or even 8 bits. In a real system, we would have to break the larger of these registers into two or more pieces, for example, breaking the 32-bit DISKADDR register into DISKADDRH and DISKADDRL, 16-bit registers holding the high and low halves of the 32-bit quantity.

The interface given in Figure 3 is somewhat simpler than the interface for an IDE disk; the IDE interface requires a total of 13 registers for the disk address and it uses a completely independent register set in the DMA controller to handle main memory addressing.

The status register DISKSTAT shown in Figure 3 is able to signal a long list of error conditions. When the drive is idle (the DISKCON motor on bit is set to zero), it will typically report disk not up to speed unless the drive has a removable disk, in which case it may report disk not mounted. If the drive is turned on by setting the ON bit in the control register, the spindle motor will begin to spin, but only after it has been spinning for a while will it come up to operating speed, and only then will the error code field of the status register change.

Durning normal operation, the head positioning mechanism always tries to keep the heads positioned on the cylinder selected by the cylinder select register DISKCYL. If the heads are not positioned on the correct cylinder, the ONCYL bit in the status register will be zero; when the mechanism completes a seek operation to the selected track, this bit will change to 1. If the current disk operation was seek, the ready bit will also be set, but if the current disk operation was read or write, the ready bit will not be set until the data transfer to the selected sector is completed, so long as that sector is legal.

In the event of an attempt to seek a non-existant cylinder, or in the event of an attempt to read or write a non-existant surface or sector, the ready bit will be set as soon as the error is discovered, and the error code field of the status register will indicate the error. Similarly if, for some reason, the disk was bad or the data on the disk was corrupted, the ready bit will be set with an error code indicating what happened.

Floppy disks and most other removable disks have, of course, only one platter with at most two recording surfaces, and floppy disks and some older hard drives for personal computers spun slowly enough that DMA hardware was unnecessary. These low-performance disk drives typically had a data register in their interface, and they requested one interrupt per byte or one interrupt per 16 bit word read from or written to the disk.

A Minimal Disk Driver

The software used for disk input/output can be divided into two layers, a file-system layer which provides the user-level file abstraction, including such things as multiple logical files residing on one disk device, and directory services, and the device-driver layer which provides basic disk input/output services for the use of the file system layer. The file system layer will be described in the next chapter, while the low level disk management layer will be described here.

The device driver layer is typically highly dependent on the hardware interface to the disks themselves, and is usually only device independent to the extent that the hardware is. Thus, if the hardware interface allows many different kinds of disks to be connected to one set of disk controller electronics, as is the case with the IDE and SCSI interfaces, issues such as the number of cylinders, the number of recording surfaces, or the sector size must be contained in a device description record.

The driver for a high performance disk drive with the interface described in Figure 3 would typically follow the outline shown in Figure 4, although this figure gives no code for the write operation.

void diskaddress( char * buffer, int len, int cyl, int surf, int sect )
{
     outp( DISKADDR, (int)buffer );
     outp( DISKCNT,  len );
     outp( DISKCYL,  cyl );
     outp( DISKSURF, surf );
     outp( DISKSECT, sect );
}

void waitdiskdone()
{
     while (inp(DISKSTAT) & 0x01 == 0) do /* nothing */;
}

void diskread( char * buffer, int len, int cyl, int surf, int sect )
{
     /* first set up all the disk addressing registers */
     diskaddress( buffer, len, cyl, surf, sect );

     /* then issue the read command */
     outp( DISKCON, ON | READ );

     /* finally wait for the operation to complete */
     waitdiskdone();
}

Figure 4. A typical large disk interface.

The code shown in Figure 4 includes generic routines for loading the disk addressing registers and for waiting for a disk transfer to complete. The read routine is built from these two, along with one additional access to an interface register to actually command the disk drive to move the data. This code is very crude! It does no error checking, and it is perfectly willing to start a data transfer on a drive that has no disk mounted or that is not up to speed.

Some disk hardware requires a separate commands to spin the drive up to speed before any head positioning operations are allowed, and then a new command to seek the correct cylinder before it becomes legal to request a read or write. The code given here assumes that a read command will automatically await the disk coming up to speed before seeking, and then automatically seek before attempting a data transfer.

The code given here does nothing to check that the data buffer provided by the user is one sector long. Most drives allow partial sectors to be read or written -- on read, the extra data is simply ignored. On write, zeros are automatically appended to fill one sector. Some disk interfaces will automatically seek successive sectors, writing one sector at a time until the desired number of bytes have been written. In some, this is only allowed within one track, in others, the interface hardware will even increment the surface number to fill successive tracks of a cylinder, and then increment the cylinder number when the cylinder fills. Nowdays, when it is inexpensive to incorporate several small complete computer systems into the interface controller that is built into the disk drive, this kind of complexity is quite common!

Some error conditions are the user's fault, for example, attempting to address a cylinder number that is outside the range of head motion for the drive. On the other hand, while the disk is reading, it may get a transient read error, for example, because a dust particle causes the disk head to bounce briefly away from the surface, or because of electrical interference. When this occurs, a second or third try might successfully read the data that was misread on the first try. Therefore, most disk interface software checks the error status after each operation. For the interface described in Figure 3, a retry would be reasonable after a format error, since this is a reasonable error indicator to use for reporting a misread prologue on a sector, and a retry would be reasonable after a parity error report after a read operation, since this would be a reasonable error indicator to use to report checksum errors on the data part of a sector. Only a few retries should be allowed, however, since repeated occurrances of these error conditions are likely to indicate that the disk itself is damaged.

The code shown in Figure 4 uses a polling loop to await the completion of a disk data transfer. This is as undesirable here as it is with low performance serial input/output ports. Certainly, during the latency interval, while the disk device is seeking the correct cylinder and then while the disk rotates to the correct sector, the processor could be busy doing something useful. During the actual data transfer, however, there may be no opportunity to continue computation.

If the disk has a very high data transfer rate, the DMA transfer may actually bring the CPU to a stop by moving data at the maximum rate the memory can accomodate! For disks without DMA, the transfer rate may be so high that we cannot afford to service interrupts during the transfer, so the transfer may have to be done with interrupts disabled.

Higher Performance Disk Drivers

Assuming that the hardware allows for the overlap of computation and input/output, the approach to operating the disk shown in Figure 4 is quite undesirable because it requires a polling loop immediately after each transfer is initiated. As a result, low level disk input/output is usually done by posting an input/output request with a device and then checking later to see if the request has been satisfied; the device service itself is performed by an interrupt service routine, with a queue holding pending disk requests, as shown in Figure 5.

                         __________                       
                  ___   |          |   ___           ___ 
     interface to    |  | Queue of |  |    Interrupt    | interface
     higher level    |==| pending  |==|    service      | registers
     file system. ___|  | requests |  |___ routine   ___|
                        |__________|

Figure 5. Basic organization of a disk driver

It should be noted that the term input/output driver refers to the combination of the interrupt service routine, the request queue, and the interface between that queue and higher level software.

What information needs to be saved for a pending disk request? At any instant, there could be any number of input/output requests posted with different devices, and the order in which the devices finish the requests may differ considerably from the order in which they were posted. Therefore, in addition to the basic information about the disk operation to be performed (what cylinder, surface and sector, what memory address, what buffer length, and is it a read or write), the record must contain some way to indicate, to the user, that the operation has been completed.

To wait for a hardware device to finish an operation, the software waited for a bit to be set in a device register. To wait for some operation to be finished by an interrupt service routine, we can use a very similar scheme, waiting for a bit to be set in a variable. We will need one variable per pending disk request. Since the disk software cannot anticipate in advance how many disk requests may be enqueues, the variable used to signal completion of a request must be provided by the user; typically, therefore, we ask the user to pass a pointer to this variable along with the buffer address and other information, as suggested in Figure 6.

     One disk request:   __________
                 buffer |__________------> user's buffer
                   size |__________| size of buffer
               cylinder |__________| \
                surface |__________|  > disk address
                 sector |__________| /
              direction |__________| in or out
                   done |__________------> user's done flag
                         __________
		   next |__________| \ queue
		   prev |__________| / management

Figure 6. An entry in a disk request queue.

The queue entry outlined in Figure 6 includes suggested queue management fields, a pointer to the next item in the queue, and a pointer to the previous item. In fact, we could hide the queue implementation behind a wall of abstraction, but there is little practical choice in the way we manage the queue. Array implementations are not appropriate for queues with an unknown maximum size, and the large size of each queue entry suggests that we ought to try to minimize data copying. Both of these arguments lead us toward a linked list implementation. If the only operations on the queue are enqueue and dequeue, we only really need a next pointer, but as will be seen later, we will have several reasons to delete items that were previously enqueued, and deletion of arbitrary items from a linked list is simplest if the list items have both next and back pointers.

At an abstract level, our device driver consists of a user interface routine that enqueues items as users request input/output, a queue, and an interrupt service routine that dequeues items when the previous operation is complete. Given this, the basic structure of the device driver must parallel that of an interrupt driven character sequential device driver such as that illustrated in Figure 10.7. For example, whether the queue is a queue of characters or a queue of disk requests, when the queue is emptied by the interrupt service routine, the service routine must disable the device's interrupt request register, and after each enqueue operation, the device must be re-enabled. This leads us to the minimalist disk input/output driver shown in Figure 7.

{ statically allocated variables }
struct diskrequest * request;  { the current request }
struct diskrequest * free;     { the list of free requests }

void diskinterrupt() {
     /* control reaches here whenever there is any interrupt
        request from the disk drive */

     /* first, deal with the request that was just finished */
     if (request != NULL) {

          /* signal the user that the request is done */
          *(request->done) = 1;

          /* return the request record to the free list */
          request->next = free;
          free = request;
     }

     /* second, start a new request */
     if (!empty( disk_request_queue )) {

          request = dequeue( disk_request_queue );

          diskaddress( /* function defined in Figure 4 */
               request->buffer,
               request->size,
               request->cylinder,
               request->surface,
               request->sector
          );

          if (request->direction = out) {
               outp( DISKCON, WRITE + ON + IE );
          } else {
               outp( DISKCON, READ + ON + IE );
          }
     } else {
          /* there was no new disk request, so disable interrupts */
          outp( DISKCON, IDLE + ON );
     }
}

void diskread( char * buffer,
               int len, int cyl, int surf, int sect,
               int * done
             ) {
     struct diskrequest * req;
     int temp;

     /* get a free disk request record -- rarely needs to wait */
     while (free == nil) { /* nothing */ }
     req = free;
     free = request->next;

     /* setup this request record */
     req->buffer = buffer;
     req->size = len;
     req->cylinder = cyl;
     req->surface = surf;
     req->sector = sect;
     req->direction = in;
     req->done = done;
     /* reset done flag, so interrupt routine can set it */
     *done = 0;

     /* enqueue the request */
     enqueue( disk_request_queue, req );

     /* enable disk interrupt -- a critical section */
     disableints;
     temp = inp( DISKCON );
     setbit( temp, IE );
     outp( DISKCON, temp );
     enableints;

end { diskread }
    
void waitdiskdone( int * done ) {
     while (*done == 0) { /* nothing */ };
}

Figure 7. A disk input/output driver.

The interrupt service routine shown in Figure 7 begins by marking the previous request as having been completed, un-blocking any software that is waiting for that request to complete, and then it deallocates the disk request record for the old request. Once this is done, it starts the new disk operation, if any. The code shown does nothing about errors; in a production disk driver, some errors would have to lead to retrying operations (for example, read checksum errors) while others would have to lead to error reports (for example, illegal sector number errors).

The waitdiskdone routine given in Figure 7 differs from the version in Figure 4 in an important way. The version in Figure 4 waits for any disk operation to complete, while the version in Figure 7 waits for a specific disk operation (the disk operation that was started using a particular done pointer) to complete.

The code and data structures shown here are typical of the internal structure of many systems, but one important feature has been omitted. This code assumes only a single disk device, the disk, while in the real world, we are interested in systems with many disk devices. When a system has multiple disks, it is common to find that all of the disk devices operate using the same interrupt service routine, with some status register indicating which device caused the most recent disk interrupt. This is particularly common with SCSI or similar bus architectures that allow one controller to handle many disk drives. Even if each disk's interrupt requests are vectored to separate drivers, if the disk interfaces are compatable, most of the code of those drivers will be the same, with just a pointer to a disk description data structure that is changed when an operation is directed to one device or another.

Another detail we have omitted is the possibility of supporting sequential input and output operations to disks. When making a backup of a disk to a tape, or when restoring the contents of a disk drive from tape, it is convenient to do so by using sequential read and write operations, and if the best block and sector sizes for the disk and tape are not matched to each other, character sequential copying might be the easiest. In this case, we need to augment the block-oriented interface for each disk device with a stream interface, in much the same way that we did this with tape interfaces, This is usually not provided directly by the device driver, but rather, it is provided by software one or two layers up, still inside the system and below the level of the user program.

Enhanced Disk Performance

When disk input/output requests are queued, it is possible to give the user the illusion of a disk which sometimes runs much faster than the actual physical disk! For example, when the user requests that some sector be read, the queue can be searched to see if a write request for the same sector has already been enqueued. If one is found, the buffer associated with the write request can simply be copied into the buffer associated with the read request, without any actual disk input/output taking place! Similarly, whenever a write request is enqueued, any write requests for the same sector which were already in the queue can be deleted from the queue and marked as having been completed, since whatever they would have written out will be irrelevant after the new write operation is done.

The consequence of this strategy is that there will never be more than one pending write request for any particular sector, and if there is a pending write request for a sector, there will never be read requests for the same sector in the queue after it. The effect of this is to greatly enhance the performance of the disk interface for popular disk sectors, while adding a small overhead for access to infrequently used disk sectors.

For a heavily used disk, such as the disk on a file server, where a large percentage of the data transfers are directed to the same sectors, this approach can give users the impression that the disk operates at main memory speed because the majority of the input/output requests will be processed without actually accessing the disk.

This approach is sometimes called a disk cache, by analogy with cache memory hardware.

Cache is pronounced the same way as cash; it comes from the French verb cacher meaning 'to hide'. Generally, a cache is a secure or hidden storage place.

A cache memory is a small, expensive, very high speed memory which sits between the central processor and main memory. It holds the values of the most recently referenced words of main memory, and inspects each read request from main memory, intercepting and speeding those which request recently referenced words. An access to a memory address that the cache can help with is called a cache hit, and an access to an address that the cache does not know about is called a cache miss.

When the desired data can be found in main memory (in the queue) and no disk access is required, we call this a disk cache hit, while any disk request which actually causes disk input/output is called a disk cache miss. As with hardware cache performance, the hit ratio (percentage of all requests which result in hits) is used as a measure of the effectiveness of the disk cache.

A second possibility for improving disk performance arises if disk requests are allowed to be processed in an order other than the one in which they were posted. For example, if the user posts requests for transfers to or from cylinders 1, 2, 1, 2 and 1, in that order, the disk heads will have to seek to a new track 4 times after the initial seek). On the other hand, if these same five requests are performed in the order 1, 1, 1, 2 and 2, only one seek will be required after the initial seek. Similarly, because long seeks take more time than short seeks, access to tracks 1, 2, 3, 4 and 5 will be much faster in that order than if they are accessed, for example, in the order 1, 4, 2, 5 and 3. If the disk is interleaved to make it possible to read sectors 1, 2, 3, 4 and 5 within a track with minimal rotational latency, we are likely to be able to read them in one revolution, while reading the same sectors in the order 5, 4, 3, 2 and 1 is likely to require 4 disk revolutions.

This reordering of the request queue is called disk scheduling. As illustrated above, disk scheduling can lead to significant performance improvements if the number of requests in the queue is large. If the request queue never contains more than one item, however, scheduling never accomplishes anything! If the user interface to the file system requires that each read be completed immediately before the user program is allowed to continue, disk scheduling only improves the performance if there are multiple concurrent users.

Disk scheduling is irrelevant on devices such as flash memory, where the access time for any block of data is a constant regardless of the address of that block. Disk scheduling is almost irrelevant on single-user systems where disk I/O is only occasional and rarely limits performance. At the other extreme, disk scheduling is crucial on file servers and search engines, where huge quantities of data are stored on gigantic disk systems and shared by large numbers of users.

Logically, the disk scheduler can be thought of as an independent program which observes and reorders the disk queue in order to maximize performance. In practice, the disk scheduler is never implemented this way! Instead, we implement the scheduler in the enqueue and dequeue routines for the disk request queue. Thus, we change the queueing discipline of this queue from first-in first-out to, ideally, dequeueing requests in an order that gives minimum latency.

Drum and fixed-head disk scheduling are fairly simple, since head motion latency need not be considered. The optimal algorithm for such devices maintains as many first-in first-out request queues as there are sectors per track. When a request is posted, it is placed in the queue selected by the sector-number field of its disk address. When an interrupt arrives from the device, the next request is taken from the queue for the sector number which is about to pass under the heads. This results in both maximal utilization of the device, measured, for example, in bits transferred per second, and in a minimum value for the maximum time any particular user will have to wait to see their request satisfied.

Once moving heads are allowed, there are a number of possible disk scheduling policies that meet different response requirements. Each policy is stated in terms of which pending disk transfer request should be selected for processing when the time comes to examine the queue of pending requests. To evaluate a queueing policy, we ask three questions:

In fact, the first two questions are really two sides of the same coin. The policy with the best throughput, in requests per second, will have the shortest average wait, in seconds per request. The measure of each of these is the reciprocal of the other. The question of average versus worst-case waiting time, however, is quite different. Some policies minimize the average wait, while others minimize the worst-case wait. Here are some example disk scheduling policies:

FIFO -- first in, first out
By default, if you are told that a queue is being maintained, you assume that it is a first-in first-out queue. In evaluating alternative scheduling disciplines, one of the first questions you should ask is: is it better than FIFO with regard to our measures of performance, and if so, how much better is it? There is a sense in which the FIFO policy is the fairest. No disk request is favored over any other, and as a result, if the queue length remains constant, the worst case and the average case will be comparable.

Priority -- process the highest priority request first
In some settings, it is appropriate to assign priorities to different disk requests. Processing pending requests in the order of their priority may not improve overall performance, in fact, it may make things worse by forcing a very bad order of request processing, but it may guarantee that key requests are processed in time to meet important deadlines.

Shortest latency-time first
One way to try to optimize disk performance is to try to minimize the latency. Thus, after finishing a disk operation, the one to try next is the one that can be initiated first. Within a cylinder, on completing an operation on sector n, we should preferentially select sector n+1 for processing next, and if no transfer is pending to sector n+1, we should try sector n+2 (this discussion assumes that no interleaving is needed). When moving between cylinders, we should preferentially select the pending transfer in the cylinder nearest the current cylinder before we request a seek to a distant cylinder. In sum, each time the disk interrupt service routine finishes a data transfer, the request queue is searched for that request which can be completed soonest.

This policy results in a minimum average waiting time and maximum throughput. If these are the measures we care about, this policy is optimal, but it is far from optimal in its worst-case! If some region of the disk has a high number of pending requests, those requests will receive very fast service, while requests for transfers involving other areas of the disk will receive poor service. In the extreme case, if our application software always keeps at least one disk request pending for one region of the disk, requests for access to some other region of the disk will never be processed! Therefore, the worst case for this policy is an infinite wait!

SSTF -- Shortest Seek Time First
Because head motion is usually far more expensive than rotational latency, most implementations of minimum latency scheduling ignore the rotational component, scheduling transfers within a cylinder in FIFO order while continuing to carefully schedule seeks, head motion from one cylinder to another.

Cyclic Scan -- Cycle through pending requests in address order.
The cyclic scan policy attempts to gain most of the benefits of the SSTF policy while retaining the essential fairness of the FIFO policy. The idea is to always advance the cylinder number when all pending requests to the previous cylinder have been completed. It is important to the fairness of the cyclic scan schedule to ignore new requests scheduled to the same cylinder until the next visit to that cylinder. In effect, we are using the SSTF policy with the constraint that all seeks are in the same direction, or, looked at from another perspective, we are scanning through the pending requests in a cycle.

If there are many pending requests, this policy will frequently outperform FIFO scheduling on all measures of performance. In fact, if the goal is to minimize the worst case seek times, this is the best disk scheduling policy, but there is an even better solution for maximizing throughput or minimizing the average waiting time without completely sacrificing fairness.

Elevator -- Cycle up and down through the pending requests.
The elevator scheduling policy was originally invented as a control scheme for elevators, and then applied to disk scheduling. The analogy between elevators and moving-head disks is actually quite strong! Each cylinder of the disk can be imagined as a floor of the building, and the set of pending requests for each cylinder can be imagined as a the set of people waiting for the elevator on the corresponding floor of the building. The scheduling policy question can then be restated as: In what order should the elevator visit the floors of the building in order to minimize the average or worst-case waiting time. The cyclic scan algorithm, stated on these terms, says that the elevator stops at each floor where someone wants to get on or off only on the way up, and then takes an express trip to the bottom. With elevators, we don't do this! Instead, we let the elevator stop whenever it passes a floor where people want to get on or off.

Translating back to the disk context, this means that we follow the scanning algorithm until we reach one end of the scan, and then, instead of taking a quick seek to the lowest numbered cylinder where there is a pending request, we simply reverse the direction of the scan, working through pending requests toward successively lower numbered cylinders.

We are working here in a world where perfection with regard to one goal is incompatable with perfection with regard to another goal. There is no scheduling policy that minimizes both the average and worst-case waiting time, and the elevator policy does not minimize either! Despite this, it is usually considered the best choice.

All of these disk scheduling policies can be implemented by using one queue per cylinder on each disk; in this case, the routine to post a disk request simply uses the cylinder number as an index into an array of queues, and the interrupt service routine simply searches this array for the nearest non-empty queue (in the appropriate direction) when deciding what to do next. With modern disks that have hundreds of cylinders, the arrays required by this solution may seem large, but this is not a major problem.

Even if the queues used for requests within each cylinder are simple FIFO queues, the performance will be significantly better if pending requests are spread out over many cylinders, but the best performance requires that the queues within each cylinder be carefully structured to allow all pending requests for that cylinder to be read with minimum latency.

For the cyclic scan and elevator policies, an alternative implementation uses two priority queues, one for all requests inside the current head position, and one for all requests outside the current head position. In the case of the bidirectional-scan policy, the two queues are sorted in the opposite order, so that the outermost request in the inner queue has the highest priority, and the innermost request in the outer queue has the highest priority. When a new request is posted, the current head position is inspected to decide which queue it goes in. When the disk interrupt service routine finishes dealing with a request, it gets its next request from the same queue as it got its last request from, unless that queue is empty, in which case it switches queues. This approach can be very efficient because of the fact that the operations on a priority queue can be done in time proportional to the log of the queue size if an appropriate data structure such as a heap ordered binary tree is used.

Terminology

The reader should be familiar with the following general terms after reading this section:

random-access devices           drums
moving-head disks               fixed-head disks
floppy disks                    disk packs
recording surfaces              hard sectoring
tracks                          soft sectoring
sectors                         disk addresses
cylinders                       inter-sector gaps
rotational latency              head positioning latency
interleaving                    seek time
low level disk input/output     disk request blocks
disk request queues             critical sections
disk cache                      hit ratio
disk scheduling policy          shortest-seek-time-first
cyclic-scan                     elevator

Exercises

  1. Given a disk with 5 sectors per track, where these are numbered 1-2-3-4-5 when no interleaving is used and 1-4-2-5-3 with two-way interleaving, how would these be numbered with

    a) 3 way interleaving?

    b) 4 way interleaving?

  2. Show the physical order of the logical sector numbers on a 3 way interleaved track of 16 sectors.

  3. Given that it takes 0.1 seconds to move the disk heads 400 tracks and that the number of tracks moved is proportional to the square of the head motion latency for that move, how long will it take to move

    a) 2 tracks?

    b) 3 tracks?

  4. The Seagate Barracuda ATA IV disk family has an internal transfer rate from disk to controller of 555 million bits per second. The advertised spindle speed of 7200 RPM, and the advertised sector size is 512 bytes.

    a) How many bytes per track of unformatted storage capacity does this imply?

    b) If each sector has a 3 byte epilogue (a 2 byte checksum plus a marker byte) and an 13 byte prologue (cylinder, track and sector numbers, checksum on the prologue, and some marker bytes), how many whole sectors fit on one track?

    b) The advertised (rotational) latency time is advertised to be 4.16 milliseconds. Is this consistant with the advertised spindle speed?

  5. If one sector rotates by the head every 138 microseconds and it takes 100 microseconds for the software to start the next disk transfer after the previous one has finished, and if the head positioning latency is 0.005 seconds times the square root of the number of tracks moved, make a table of the number of sectors which must be skipped before the next transfer can begin as a function of the number of tracks the head must move, from 0 to at least 15 tracks.

  6. a) Which of the errors listed in Figure 3 should be considered fatal errors, which might be resolved by retrying a read operation, and which might be resolved by retrying a write operation?

    b) Modify diskread from Figure 4 to check for the appropriate errors and retry the operation up to 3 times before giving up and raising a read-error exception.

  7. Figure 3 contains a warning that certain interface registers should not be changed while a read or write operation is in progress.

    a) Describe the possible consequences of changing DISKCYL while a write operation is in progress.

    b) Describe the possible consequences of changing DISKADDR while a read operation is in progress.

  8. Following up on the above exercise, modify the code shown in Figure 7 to check for appropriate errors retry any disk operation up to 3 times before giving up.

  9. Give an appropriate disk write routine for Figure 7.

  10. Write a procedure which reads the contents of an entire disk and copies it to tape; use the low level disk interface described in Figure 7 and the user level tape interface described in Figure 9.9. Assume that the global constants "sectorsize", "sectorspertrack", "surfaces", and "cylinders" are available to describe the disk being copied, and that your procedure will take, as a parameter, a pointer to the opened file descriptor of the tape.

  11. Exactly what damage could occur if an interrupt occurred at the wrong time during the critical section in the diskread routine in Figure 7? Between which instructions would this interrupt have to occur, and what observable consequences would result from the damage?

  12. Modify the code shown in Figure 7 for "diskread" so that it searches the request queue for other requests for the same disk address and implements the disk cache ideas discussed in this chapter.

  13. Consider a disk system where the hardware allows consecutive sectors of one track to be read, but 4 sectors must be skipped to move 1 cylinder in or out, 6 must be skipped to move 2, 7 must be skipped to move 3, and 8 must be skipped to move 4. If there are 16 sectors per track (numbered 0 to 15), disk addresses are expressed as (track,sector) pairs, and requests arrive to access addresses (1,2), (2,5), (4,10), (5,14), (1,6), and (3,11), how many rotations and fractions of a rotation, at minimum, would it take to process this sequence of requests

    a) if they are processed in the order they are given above?

    b) if they are processed in the best possible order?

  14. For each of the disk scheduling policies, what should be done when a request is posted for the track the head is currently processing? What alternatives are there, if any, and what undesirable consequences would result from the use of these alternatives?

    a) shortest-seek-time-first

    b) cyclic-scan

    c) elevator

  15. How likely is it that a disk scheduler would make much of a performance difference on a small personal computer? Why?

  16. How likely is it that a disk scheduler would make much of a performance difference on a large web server? Why?

Readings

There is a good discussion of disk scheduling in Chapter 7 of

Operating System Concepts, by J. Peterson and A. Silberschatz (Addison Wesley, 1983),
See also section 5.2.2 of
Operating System Elements, A User Perspective, by P. Calingaert (Prentice Hall, 1982),