22C:116, Lecture 16, Spring 2002

Douglas W. Jones
University of Iowa Department of Computer Science

Stable Storage
How do you make an assignments to a multi-byte object look atomic? This is essential in updating a directory, updating a database entry, and in many other operating system contexts. System failure during an update operation will prevent the completion of the operation, leaving part of the data structure or disk sector updated and part of it in its original state, with, perhaps, some parts entirely corrupted or erased.
One common solution is to make all references to the object through an atomically updatable pointer, and then record a complete new copy of the object without changing the original or the pointer. The actual moment of irreversable update then becomes the moment that the pointer is changed. If the system fails before the change of the pointer, the old object will still be used after the system recovers. If the pointer has been changed, the new object will be referenced.
But how can we update the pointer on a device such as a disk or flash EEPROM where there is no atomic update primitive in the hardware? The solution to this problem is particularly important when you are trying to update the multiple copies of a disk sector in a RAID, because you would like the semantics to be "either the sector is updated correctly, or no change is made" and you would like this to be true despite a power failure at any point in time during the update.
Leslie Lamport (and independently, others) developed algorithms for updating a shared variable in the face of failures. These assume that a mutual exclusion algorithm guarantees that only one process tries to update the variable at a time, and in this context, it guarantees that the result of the update will be correct, in the face of failure.
The basic operations offered by Lamport's stable storage algorithms are:
```
    inspect( V ) returns value   (or V.inspect()        )
    update( V, value )           (   V.update( value )  )
```
Stable storage rests on a new and redundant representation for the stored value and it rests on two procedures (or protocols) for inspecting and updating the stored value.
Conceptually, it is reasonable to use a client-server world view and imagine the inspect and update procedures as being the services offered by the server to clients. If the server fails, we can easily start a new server, as long as the variable itself is stored in such a way that it survives the failure.
A stable variable V is represented as a record where every field is duplicated:
```
            Copy1  -- the value stored
            Time1  -- the time it was stored
            Sum1   -- a checksum over the value and time

            Copy2
            Time2
            Sum2
```
There are two copies of the value, and for each copy, there is a record of the time at which the value was last updated and a record of the checksum computed as of the last update.
The fault tolerance of this scheme improves if the two (or more) copies of the tuple are stored in such a way that failures only destroy one copy at a time. Thus, they should be in different memory modules, or on different disk drives. If the system is physically distributed, they should be geographically separated and connected to different computer systems.
The update and inspect operations must be performed inside critical sections, and if failure is to be survived, these critical sections must use some algorithm that can detect failure of a process inside a critical section and release any associated mutual exclusion semaphores.
```
        Procedure Update( V, value )
        begin
           update time

           V.Copy1 := value
           V.Time1 := time
           V.Sum1  := Checksum( value, time )

           -- wait for the assignment to really finish

           V.Copy2 := value
           V.Time2 := time
           V.Sum2  := Checksum( value, time )

           -- wait for the assignments to really finish
        end
```
The utility of this code relies on keeping two (or more) copies of the value, where no failure will corrupt more than one copy at a time. The wait for one update to finish before starting another update is very important, in this regard.
The assignment statements shown above will not, typically, execute instantaneously. On a cache machine, for example, there may be a delay before the values are written back to main memory. If disks are involved, there may be a considerable time between the issuing of a write request and the actual write to disk. If disk cache software is involved, it is even possible that the write to disk will never happen.
There is a problem here -- many modern inexpensive disk drives refuse to tell the user when the actual write to the magnetic disk surface has completed. The controller that is permanently attached to the drive contains an internal buffer, and the controller does everything it can to convince the user that the disk write operation has been completed as soon as the data is written to this buffer. This certainly helps improve system performance, but it makes it difficult or even impossible to use such disks for stable storage. Sometimes, the only way to force the drive to complete a write operation is to tell it to park its heads and prepare for a shutdown.
Many operating system disk I/O drivers also attempt to introduce buffering between the user and the disk. This can be done by disk schedulers and RAM cache implementations that store frequently referenced sectors in RAM and write them back to disk much later. Such drivers can make it impossible to implement stable storage at the application level, unless there is a system service to force buffered data back to disk. Under UNIX, this service is called fsync().
```
        Procedure Inspect( V )
        begin
           -- optionally start by reading all fields from disk

           if V.Sum1 = checksum( V.Copy1, V.Time1 )
              if V.Sum2 = checksum( V.Copy2, V.Time2 )
                 if V.Time1 > V.Time2
                    value = V.Copy1
                 else
                    value = V.Copy2
                 endif
              else
                 value = V.Copy1
              endif
           else
              if V.Sum2 = checksum( V.Copy2, V.Time2 )
                 value = V.Copy2
              else
                 -- failure - both copies were corrupted  --
              endif
           endif
           return value
        end
```
This code is fairly simple -- there are four cases to consider:
1) There were no errors In this case, the checksums on both copies will be valid, both will be the same, and they will have the same timestamp. In this case, either copy may be returned.
2) There was a failure such that update managed to write one updated copy of the data, but it didn't get to start writing the other updated copy. In this case, the checksums on both copies will be valid, but their timestamps will differ. Return the copy with the most recent timestamp.
3) There was a failure during one of the updates, or there was a failure that corrupted one of the stored values between updates. In this case, the checksum on the copy in question will be invalid. The other copy should still be OK and can be returned.
If the failure occurs during the write of the second copy, it will be as if the write was successful because the first copy will have already been written and will be available for use. If the failure occurs during the write of the first copy, it will be as if the write never occurred.
4) A failure or failures destroy all copies. There is nothing that can be done about this, but it takes multiple failures to cause this kind of damage, and the probability of this multiple failure can be made arbitrarily low by storing more copies in more widely distributed locations.
Note that the stable storage update procedure may be made to be even more reliable if it does a write-read-verify cycle on each copy it writes out, that is, it writes each to memory and then reads it back and verifies correct storage before continuing.
Also, note that no checksum algorithm can detect all possible errors in the stored data. Checksums can be made arbitrarily error resistant, but but they cannot be made perfect. For any checksum algorithm, there is some combination of errors that will defeat it!
Timer hardware
Before discussing file systems, it's useful to spend some time on a simpler class of devices: timers. In a sense, a timer is the simplest of all I/O devices, because it performs no I/O. There are many kinds of timers; the simplest is the periodic timer. All it does is request a periodic interrupt.
All modern operating systems, UNIX, Windows and Mac/OS included, use periodic timers. In the case of early UNIX the timer fired 60 times per second, a frequency derived from the power supply. Modern UNIX systems almost universally have the system clock fire 100 times per second. In the case of the DOS and Windows environments, the interval has classically been 18.2 ticks per second. Most modern systems use hardware that is able to provide far more general timer service, but commercially available operating frequently always ignore the generality of the hardware in favor of using it to provide a simple periodic interrupt.
As an example of a very simple hardware interface, consider a timer that fires every millisecond. This might present the following interface to the software:
expired -- once a millisecond, the timer sets this single bit interface register. When this bit is set, the timer also requests an interrupt. The timer interrupt service software must reset this bit before returning from or re-enabling interrupts in order to turn off the interrupt request.
This simple timer will be used for the remainder of this lecture, but note that far more complex timers are common. For example, interval timers contain a register that is decremented at a constant rate, for example, once per microsecond. The timer requests an interrupt when the register reaches zero, and the interrupt service routine may load a new count in the register to request the next interrupt.
Timer Software
It is rare to find software that requires a periodic interrupt at the same rate as the hardware period of the timer. When the application requires a periodic interrupt at longer intervals, however, this can by synthesized from the basic periodic interrupt of the hardware clock as follows:
```
periodic-interrupt-service-routine:
   reset( expired )
   countdown := countdown - 1;
   if countdown = 0 call user-service
   return
```
Here, the procedure user-service is called at the end of an interval defined by the periodic decrementing of the variable countdown; user-service should reset countdown each time it is called. If user-service always sets countdown to the same value, the user routine will be called periodically; if user-service sets different values each time it is calle, it will be aperiodic. This creates, in software, a programmable interval timer, but most modern real-time-clock hardware directly supports this function, with a countdown register decrementing at several megahertz and the rule that the hardware will request an interrupt when the register reaches zero.
Application programs rarely want just one interval timer. What they typically want is a service such as the following, callable from within any user process:
timed_wait( t ) -- when a process calls this operating system service, that process will be delayed t time units (the particular time unit depends on the system; we will use milliseconds for the examples presented here).
Exercise: How can you implement the example timed_wait() service in terms of the UNIX timer primitives -- see the man page for setitimer() for documentation.
On a system with only one process and an interval timer, this service would be trivial to implement. On a system with more than one process and a periodic timer, this service is far more interesting. In such a context, many processes may be awaiting different times, and there may easily be more than one process waiting for the same time instant.
Thus, there must be a list of waiting processes, with a record, for each process, of the time it is awaiting. When a timer expires, it must somehow search this list and wake up each process waiting for the current instant.
The need to block processes and then wake them up suggests that each timer record will have a semaphore on which the process blocks. This leads to the following data structure:
```
Timer record:
   link fields, as needed for list management
   delay -- the requested time delay
   sem -- semaphore used to block process

Timer list:
   a linked list of timer records
```
The wait service seen by an application program would then be as follows"
```
timed_wait( t )
   get a timer record r
   initialize r.sem to a count of zero )
   r.delay := t
   append( r, timer-list )
   P( r.sem )
```
A single timer record can be permanently attached to each process. The interrupt routine needed to support this service is as follows:
```
periodic-interrupt-service-routine:
   reset( expired )
   for each timer record r in timer-list
      r.delay := r.delay - 1
      if r.delay <= 0 then
         remove( r, timer-list )
         V( r.sem )
      endif
   endloop
   return
```
Code written along the above lines is actually used on some systems, but it begins to behave very badly on systems with large numbers of waiting processes. The problem is that interrupt service routines should never perform open ended computations such as scanning of linked lists. This work should be moved to an application process if at all possible.
Nonetheless, the above code implements what can properly be called a virtual interval timer available to each applications process. This notion of virtual timers is an important example of virtual device interfaces; its primary importance is that timer, as a class of device, have the simplest of semantics.
Exercise: Explore the documentation for the UNIX timer services and write a critical essay on the deficiencies of these services. There are many!
Improving Timer Performance
The key to avoiding complex interrupt processing in the implementation of timer services is to rely on a sorted list of timer records, each holding the time being awaited. This requires the following data structure:
```
Timer record:
   link fields, as needed for list management
   time -- the requested time at which the timer should fire
   sem -- semaphore used to block process

Timer queue:
   a priority queue of timer records, sorted by time

Clock:
   the current time
```
```
 _____________     ______     ______     ______
| Timer Queue |-->| Link |-->| Link |-->| Link |--X
|_____________|   |------|   |------|   |------|
                  | time |   | time |   | time |
                / |------|   |------|   |------|
 Sort by order /  | sem  |   | sem  |   | sem  |
 of time fields!  |______|   |______|   |______|
```
If the system is to run for long periods, the variables used to hold clock and the time fields of timer records must have sufficient precision that overflows are not likely during the life of the system. A year's worth of milliseconds will overflow a 32 bit counter, so realistic systems should 48 or 64 bit counters.
Some systems maintain all time values in, for example, milliseconds since some arbitrary dawn of time. Others maintain time values in more expensive forms, with fields divided out for year, day, hour, minute, second and millisecond. There is little reason to do the latter, however, since these divisions are usually only needed for the output of dates, and internal time and interval arithmetic is simplified by holding time values in a simple form.
Exercise: The classic Unix representation of the date uses a signed 32 bit count of the seconds since Midnight January 1, 1970, GMT. When will the UNIX date overflow? (It is interesting to note that MacOS and UNIX were both designed without Y2K errors in their operating kernels.)
Assuming that the time fields are absolute times, in milliseconds, and that the timer interrupt service routine maintains a global variable called clock, measuring milliseconds from the beginning of time, the wait service can be coded as follows:
```
timed_wait( t )
   get a timer record r
   initialize r.sem to a count of zero
   r.time := clock + t
   enqueue( r, timer-queue )
   P( r.sem )
```
The interrupt routine needed to support this service is as follows:
```
periodic-interrupt-service-routine:
   reset( expired )
   clock := clock + 1   -- count time
   while head( timer-queue ).time <= clock
      r := dequeue( timer-queue )
      V( r.sem )
   end loop
   return
```
This code rests on a priority queue. The simplest such data structure is a sorted linked list, where the enqueue service sorts new items into the list in time order, the head of the list is always available, and the dequeue service simply removes the head element from the list.
This simple implementation moves most of the work to the enqueue service, where it belongs, but if the list is long, the enqueue service may take some time. There are faster algorithms that use binary trees or heap-ordered trees to achieve O(log n) operation times on the queue. You can get more information on this subject from An Empirical Comparison of Priority Queue and Event Set Implementations in Communications of the ACM, 29, April 1986, pages 300-311. Code for interesting priority queue algoritnms is available to WWW users.
Adapting the above code to a programmable interval timer poses some problems. Primary among these is the problem of maintaining the clock. Every time the interval timer expires, the clock can be updated to reflect the interval that has just passed, but for inquiries about the clock value between updates, the running count (countdown, in the example given earlier) must be inspected to get the current value.
Date Overflow Considerations
The above code relies on a global variable "clock" that can be incremented once per timer interrupt. This is a source of significant problems in real systems. For example, consider a timer that increments once per millisecond. This will increment by 1000 per second (so about 10 bits of timer precision are devoted to counting seconds. Counting 60 seconds per minute takes roughly another 6 bits of precision, and counting 60 minutes per hour takes another 6 bits, leaving 10 bits out of a typical 32 bit word available for counting hours. 1000 hours is just 40 days of 25 hours each, so we can quickly estimate that our 32 bit counter will overflow in just 6 weeks! If we replace rough estimates such as those used above with accurate arithmetic, we get a clock lifetime of 49.71027 days.
Using the above algorithm, we would expect our system to run correctly for a month, and then we would expect a brief burst of irregular behavior as the clock rolled over, after which it would deliver correct performance for another month. This is hardly acceptable!
Therefore, we seek a better solution!
One approach is to simply use a higher precision counter. For example, we could replace our 32 bit counter with a 64 bit counter; with this, our system would last for millennia instead of mere months. We pay a price, however, in slow arithmetic, unless we're running on a machine with a 64 bit ALU.
Another approach is to note that, while we may well want our system to run for longer than one month, most of the timed delays requested by users are likely to be on the order of seconds. We can easily rewrite the timer data structures to support this by replacing the time field in each timer record with a relative delay field:
```
Timer record:
   link fields, as needed for list management
   delay -- the delay between the previous timer and this
   sem -- semaphore used to block process

Timer queue:
   a sequential queue of timer records, sorted by expiration time

Clock:
   the current time
```
```
 _____________     ______     ______     ______
| Timer Queue |-->| Link |-->| Link |-->| Link |--X
|_____________|   |------|   |------|   |------|
                  | delay|   | delay|   | delay|
 Relative delay / |------|   |------|   |------|
 from one record  | sem  |   | sem  |   | sem  |
 to the next.     |______|   |______|   |______|
```
This data structure adds only marginally to the complexity of the timer services, while it allows very fast arithmetic to be used for typical short delays.
Summary: The above notes suggest a number of ways to implement a large number of virtual timers for use by user processes, when given only a single hardware timer. There are many more ways of doing this! This lesson applies to any of a number of virtual resources created by operating systems, so, for example, a file system may creates many virtual disks from one physical disk in many different ways, and a window manager may create many virtual display screens within the confines of one physical display screen in many different ways.
Real Time Considerations
The semantics of timers is not always immediately obvious. When a process says wait s seconds this does not mean to wait exactly s seconds and then continue. Rather, it means to wait at least s seconds. If there are other ready processes at the end of the wait, we generally make no attempt to predict which process will be running at the end of the wait; we merely insist that the waiting process become ready when the timer expires.
This is consistent with the usual weak assumptions made about parallel processes -- that they advance unpredictably, so that we can make no assumptions about how CPU resources are divided between them. Usually, proof of correctness under these weak assumptions is easier than proof under more complex assumptions, so we use these weak assumptions even when stronger assumptions are reasonable.
Note that the presence of a timed-wait or equivalent service is one of the defining characteristics of a real-time operating system! Real-time applications, demand stronger assumptions. Typically, the defining characteristic of a real-time system is the presence of deadlines by which the software must complete some actions.
When parallel programming is applied to real-time systems, we must ask if there is a feasible schedule. That is, we ask not only if there is enough CPU time overall to complete the entire computation, but we also ask if each task can accomplish the necessary computation before each deadline.
If there is no feasible schedule, the system cannot succede, but if there is a feasible schedule, we must also ask whether the scheduling algorithm is able to find the feasible schedule among the infinite set of possible schedules. Two classes of real-time scheduling algorithms have been widely studied.
Rate monotonic priority scheduling will always find feasible schedules if there is approximately 1/3 idle CPU capacity. In this scheme, processes are scheduled with priorities related to the intervals between deadlines. Processes dealing with deadlines that have a short interval are given high priorities, while those dealing with deadlines that have a long interval are given relatively lower priorities.
Usually, rate monotonic scheduling is discussed under the assumption that intervals are periodic, in which case, the priority of a process is fixed proportionally to the rate at which it must meet deadlines. In fact, there is no reason to limit this scheme to periodic deadlines, but introducing aperiodic deadlines does force the system to use dynamically varying priorities.
Deadline based scheduling uses the time of the next deadline to be met by a process as its priority. Thus, if a process must finish some job by time t, the priority of that job is t. Distant future deadlines have low priority, while imminent deadlines have high priority. Deadline basee scheduling is somewhat uncommon in practice, but it can find a feasible schedule even when there is no excess CPU capacity.