How do you make an assignments to a multi-byte object look atomic? This is essential in updating a directory, updating a database entry, and in many other operating system contexts. System failure during an update operation will prevent the completion of the operation, leaving part of the data structure or disk sector updated and part of it in its original state, with, perhaps, some parts entirely corrupted or erased.
One common solution is to make all references to the object through an atomically updatable pointer, and then record a complete new copy of the object without changing the original or the pointer. The actual moment of irreversable update then becomes the moment that the pointer is changed. If the system fails before the change of the pointer, the old object will still be used after the system recovers. If the pointer has been changed, the new object will be referenced.
But how can we update the pointer on a device such as a disk or flash EEPROM where there is no atomic update primitive in the hardware? The solution to this problem is particularly important when you are trying to update the multiple copies of a disk sector in a RAID, because you would like the semantics to be "either the sector is updated correctly, or no change is made" and you would like this to be true despite a power failure at any point in time during the update.
Leslie Lamport (and independently, others) developed algorithms for updating a shared variable in the face of failures. These assume that a mutual exclusion algorithm guarantees that only one process tries to update the variable at a time, and in this context, it guarantees that the result of the update will be correct, in the face of failure.
The basic operations offered by Lamport's stable storage algorithms are:
inspect( V ) returns value (or V.inspect() ) update( V, value ) ( V.update( value ) )Stable storage rests on a new and redundant representation for the stored value and it rests on two procedures (or protocols) for inspecting and updating the stored value.
Conceptually, it is reasonable to use a client-server world view and imagine the inspect and update procedures as being the services offered by the server to clients. If the server fails, we can easily start a new server, as long as the variable itself is stored in such a way that it survives the failure.
A stable variable V is represented as a record where every field is duplicated:
Copy1 -- the value stored Time1 -- the time it was stored Sum1 -- a checksum over the value and time Copy2 Time2 Sum2There are two copies of the value, and for each copy, there is a record of the time at which the value was last updated and a record of the checksum computed as of the last update.
The fault tolerance of this scheme improves if the two
(or more) copies of the
The update and inspect operations must be performed inside critical
sections, and if failure is to be survived, these critical sections
must use some algorithm that can detect failure of a process inside
a critical section and release any associated mutual exclusion semaphores.
The assignment statements shown above will not, typically,
execute instantaneously. On a cache machine, for example, there
may be a delay before the values are written back to main memory.
If disks are involved, there may be a considerable time between
the issuing of a write request and the actual write to disk.
If disk cache software is involved, it is even possible that
the write to disk will never happen.
There is a problem here -- many modern inexpensive disk drives
refuse to tell the user when the actual write to the magnetic
disk surface has completed. The controller that is permanently
attached to the drive contains an internal buffer, and the controller
does everything it can to convince the user that the disk write
operation has been completed as soon as the data is written to this
buffer. This certainly helps improve system performance, but it makes
it difficult or even impossible to use such disks for stable storage.
Sometimes, the only way to force the drive to complete a write
operation is to tell it to park its heads and prepare for a shutdown.
Many operating system disk I/O drivers also attempt to introduce
buffering between the user and the disk. This can be done by disk
schedulers and RAM cache implementations that store frequently
referenced sectors in RAM and write them back to disk much later.
Such drivers can make it impossible to implement stable storage at
the application level, unless there is a system service to force
buffered data back to disk. Under UNIX, this service is called
fsync().
1) There were no errors
In this case, the checksums on both copies will be valid,
both will be the same, and they will have the same
timestamp.
In this case, either copy may be returned.
2) There was a failure such that update managed to write one
updated copy of the data, but it didn't get to start
writing the other updated copy.
In this case, the checksums on both copies will be valid,
but their timestamps will differ. Return the copy with
the most recent timestamp.
3) There was a failure during one of the updates, or there
was a failure that corrupted one of the stored values between
updates.
In this case, the checksum on the copy in question will
be invalid. The other copy should still be OK and can be
returned.
If the failure occurs during the write of the second copy,
it will be as if the write was successful because the first
copy will have already been written and will be available
for use.
If the failure occurs during the write of the first copy,
it will be as if the write never occurred.
4) A failure or failures destroy all copies. There is nothing that
can be done about this, but it takes multiple failures to cause
this kind of damage, and the probability of this multiple failure
can be made arbitrarily low by storing more copies in more widely
distributed locations.
Note that the stable storage update procedure may be made to be
even more reliable if it does a write-read-verify cycle on each
copy it writes out, that is, it writes each to memory and then reads
it back and verifies correct storage before continuing.
Also, note that no checksum algorithm can detect all possible errors in
the stored data. Checksums can be made arbitrarily error resistant,
but but they cannot be made perfect. For any checksum algorithm, there
is some combination of errors that will defeat it!
Before discussing file systems, it's useful to spend some time on a simpler
class of devices: timers. In a sense, a timer is the simplest of all I/O
devices, because it performs no I/O. There are many kinds of timers; the
simplest is the periodic timer. All it does is request a periodic interrupt.
All modern operating systems, UNIX, Windows and Mac/OS included, use periodic
timers. In the case of early UNIX the timer fired 60 times per second, a
frequency derived from the power supply. Modern UNIX systems almost
universally have the system clock fire 100 times per second.
In the case of the DOS and Windows environments, the interval has classically
been 18.2 ticks per second. Most modern systems use hardware that is able to
provide far more general timer service, but commercially available operating
frequently always ignore the generality of the hardware in favor of
using it to provide a simple periodic interrupt.
As an example of a very simple hardware interface, consider a timer that
fires every millisecond. This might present the following interface to
the software:
expired -- once a millisecond, the timer sets this single bit interface
register. When this bit is set, the timer also requests an interrupt. The
timer interrupt service software must reset this bit before returning from
or re-enabling interrupts in order to turn off the interrupt
request.
This simple timer will be used for the remainder of this lecture, but note
that far more complex timers are common. For example, interval timers
contain a register that is decremented at a constant rate, for example,
once per microsecond. The timer requests an interrupt when the register
reaches zero, and the interrupt service routine may load a new count in the
register to request the next interrupt.
It is rare to find software that requires a periodic interrupt at the same
rate as the hardware period of the timer. When the application requires
a periodic interrupt at longer intervals, however, this can by synthesized
from the basic periodic interrupt of the hardware clock as follows:
Application programs rarely want just one interval timer. What they
typically want is a service such as the following, callable from within
any user process:
timed_wait( t ) -- when a process calls this operating system service,
that process will be delayed t time units (the particular time unit
depends on the system; we will use milliseconds for the examples presented
here).
Exercise: How can you implement the example timed_wait() service in
terms of the UNIX timer primitives -- see the man page for setitimer() for
documentation.
On a system with only one process and an interval timer, this service
would be trivial to implement. On a system with more than one process
and a periodic timer, this service is far more interesting. In such a
context, many processes may be awaiting different times, and there may
easily be more than one process waiting for the same time instant.
Thus, there must be a list of waiting processes, with a record, for each
process, of the time it is awaiting. When a timer expires, it must somehow
search this list and wake up each process waiting for the current instant.
The need to block processes and then wake them up suggests that each timer
record will have a semaphore on which the process blocks. This leads to
the following data structure:
Nonetheless, the above code implements what can properly be called a virtual
interval timer available to each applications process. This notion of
virtual timers is an important example of virtual device interfaces; its
primary importance is that timer, as a class of device, have the simplest
of semantics.
Exercise: Explore the documentation for the UNIX timer services
and write a critical essay on the deficiencies of these services. There are
many!
The key to avoiding complex interrupt processing in the implementation of
timer services is to rely on a sorted list of timer records, each holding
the time being awaited. This requires the following data structure:
Some systems maintain all time values in, for example, milliseconds since
some arbitrary dawn of time. Others maintain time values in more expensive
forms, with fields divided out for year, day, hour, minute, second and
millisecond. There is little reason to do the latter, however, since these
divisions are usually only needed for the output of dates, and internal
time and interval arithmetic is simplified by holding time values in a simple
form.
Exercise: The classic Unix representation of the date uses a signed
32 bit count of the seconds since Midnight January 1, 1970, GMT. When will
the UNIX date overflow? (It is interesting to note that MacOS and UNIX
were both designed without Y2K errors in their operating kernels.)
Assuming that the time fields are absolute times, in milliseconds, and that
the timer interrupt service routine maintains a global variable called
clock, measuring milliseconds from the beginning of time, the wait service
can be coded as follows:
This simple implementation moves most of the work to the enqueue service,
where it belongs, but if the list is long, the enqueue service may take
some time. There are faster algorithms that use binary trees or heap-ordered
trees to achieve O(log n) operation times on the queue. You can get more
information on this subject from An Empirical Comparison of Priority Queue
and Event Set Implementations in Communications of the ACM, 29,
April 1986, pages 300-311.
Code
for interesting priority queue algoritnms is available to WWW users.
Adapting the above code to a programmable interval timer poses some problems.
Primary among these is the problem of maintaining the clock. Every time the
interval timer expires, the clock can be updated to reflect the interval that
has just passed, but for inquiries about the clock value between updates,
the running count (countdown, in the example given earlier) must
be inspected to get the current value.
The above code relies on a global variable "clock" that can be incremented
once per timer interrupt. This is a source of significant problems in real
systems. For example, consider a timer that increments once per millisecond.
This will increment by 1000 per second (so about 10 bits of timer precision
are devoted to counting seconds. Counting 60 seconds per minute takes roughly
another 6 bits of precision, and counting 60 minutes per hour takes another
6 bits, leaving 10 bits out of a typical 32 bit word available for counting
hours. 1000 hours is just 40 days of 25 hours each, so we can quickly estimate
that our 32 bit counter will overflow in just 6 weeks! If we replace rough
estimates such as those used above with accurate arithmetic, we get a clock
lifetime of 49.71027 days.
Using the above algorithm, we would expect our system to run correctly for
a month, and then we would expect a brief burst of irregular behavior as the
clock rolled over, after which it would deliver correct performance for
another month. This is hardly acceptable!
Therefore, we seek a better solution!
One approach is to simply use a higher precision counter. For example, we
could replace our 32 bit counter with a 64 bit counter; with this, our
system would last for millennia instead of mere months. We pay a price,
however, in slow arithmetic, unless we're running on a machine with a 64 bit
ALU.
Another approach is to note that, while we may well want our system to run
for longer than one month, most of the timed delays requested by users are
likely to be on the order of seconds. We can easily rewrite the timer data
structures to support this by replacing the time field in each timer record
with a relative delay field:
Summary: The above notes suggest a number of ways to implement
a large number of virtual timers for use by user processes, when given only
a single hardware timer. There are many more ways of doing this! This
lesson applies to any of a number of virtual resources created by operating
systems, so, for example, a file system may creates many virtual disks from
one physical disk in many different ways, and a window manager may create
many virtual display screens within the confines of one physical display
screen in many different ways.
The semantics of timers is not always immediately obvious. When a process
says wait s seconds this does not mean to wait exactly s seconds and
then continue. Rather, it means to wait at least s seconds. If there are
other ready processes at the end of the wait, we generally make no attempt
to predict which process will be running at the end of the wait; we merely
insist that the waiting process become ready when the timer expires.
This is consistent with the usual weak assumptions made about parallel
processes -- that they advance unpredictably, so that we can make no
assumptions about how CPU resources are divided between them. Usually,
proof of correctness under these weak assumptions is easier than proof under
more complex assumptions, so we use these weak assumptions even when stronger
assumptions are reasonable.
Note that the presence of a timed-wait or equivalent service is one of the
defining characteristics of a real-time operating system! Real-time
applications, demand stronger assumptions. Typically, the defining
characteristic of a real-time system is the presence of deadlines by which
the software must complete some actions.
When parallel programming is applied to real-time systems, we must ask
if there is a feasible schedule. That is, we ask not only
if there is enough CPU time overall to complete the entire computation,
but we also ask if each task can accomplish the necessary computation
before each deadline.
If there is no feasible schedule, the system cannot succede, but if there
is a feasible schedule, we must also ask whether the scheduling algorithm
is able to find the feasible schedule among the infinite set of possible
schedules. Two classes of real-time scheduling algorithms have been widely
studied.
Rate monotonic priority scheduling will always find feasible schedules if
there is approximately 1/3 idle CPU capacity. In this scheme, processes
are scheduled with priorities related to the intervals between deadlines.
Processes dealing with deadlines that have a short interval are given
high priorities, while those dealing with deadlines that have a long
interval are given relatively lower priorities.
Usually, rate monotonic scheduling is discussed under the assumption that
intervals are periodic, in which case, the priority of a process is fixed
proportionally to the rate at which it must meet deadlines. In fact, there
is no reason to limit this scheme to periodic deadlines, but introducing
aperiodic deadlines does force the system to use dynamically varying
priorities.
Deadline based scheduling uses the time of the next deadline to be met by
a process as its priority. Thus, if a process must finish some job by time
t, the priority of that job is t. Distant future deadlines have low priority,
while imminent deadlines have high priority. Deadline basee scheduling is
somewhat uncommon in practice, but it can find a feasible schedule even when
there is no excess CPU capacity.
Procedure Update( V, value )
begin
update time
V.Copy1 := value
V.Time1 := time
V.Sum1 := Checksum( value, time )
-- wait for the assignment to really finish
V.Copy2 := value
V.Time2 := time
V.Sum2 := Checksum( value, time )
-- wait for the assignments to really finish
end
The utility of this code relies on keeping two (or more)
copies of the value, where no failure will corrupt more than
one copy at a time. The wait for one update to finish before
starting another update is very important, in this regard.
Procedure Inspect( V )
begin
-- optionally start by reading all fields from disk
if V.Sum1 = checksum( V.Copy1, V.Time1 )
if V.Sum2 = checksum( V.Copy2, V.Time2 )
if V.Time1 > V.Time2
value = V.Copy1
else
value = V.Copy2
endif
else
value = V.Copy1
endif
else
if V.Sum2 = checksum( V.Copy2, V.Time2 )
value = V.Copy2
else
-- failure - both copies were corrupted --
endif
endif
return value
end
This code is fairly simple -- there are four cases to
consider:
periodic-interrupt-service-routine:
reset( expired )
countdown := countdown - 1;
if countdown = 0 call user-service
return
Here, the procedure user-service is called at the end of an
interval defined by the periodic decrementing of the variable countdown;
user-service should reset countdown each time it is called. If
user-service always sets countdown to the same value, the user
routine will be called periodically; if user-service sets different
values each time it is calle, it will be aperiodic.
This creates, in software, a programmable interval timer, but most modern
real-time-clock hardware directly supports this function, with a
countdown register decrementing at several megahertz and the rule that
the hardware will request an interrupt when the register reaches zero.
Timer record:
link fields, as needed for list management
delay -- the requested time delay
sem -- semaphore used to block process
Timer list:
a linked list of timer records
The wait service seen by an application program would then be as follows"
timed_wait( t )
get a timer record r
initialize r.sem to a count of zero )
r.delay := t
append( r, timer-list )
P( r.sem )
A single timer record can be permanently attached to each process. The
interrupt routine needed to support this service is as follows:
periodic-interrupt-service-routine:
reset( expired )
for each timer record r in timer-list
r.delay := r.delay - 1
if r.delay <= 0 then
remove( r, timer-list )
V( r.sem )
endif
endloop
return
Code written along the above lines is actually used on some systems, but
it begins to behave very badly on systems with large numbers of waiting
processes. The problem is that interrupt service routines should never
perform open ended computations such as scanning of linked lists. This
work should be moved to an application process if at all possible.
Timer record:
link fields, as needed for list management
time -- the requested time at which the timer should fire
sem -- semaphore used to block process
Timer queue:
a priority queue of timer records, sorted by time
Clock:
the current time
_____________ ______ ______ ______
| Timer Queue |-->| Link |-->| Link |-->| Link |--X
|_____________| |------| |------| |------|
| time | | time | | time |
/ |------| |------| |------|
Sort by order / | sem | | sem | | sem |
of time fields! |______| |______| |______|
If the system is to run for long periods, the variables used to hold clock
and the time fields of timer records must have sufficient precision that
overflows are not likely during the life of the system. A year's worth
of milliseconds will overflow a 32 bit counter, so realistic systems
should 48 or 64 bit counters.
timed_wait( t )
get a timer record r
initialize r.sem to a count of zero
r.time := clock + t
enqueue( r, timer-queue )
P( r.sem )
The interrupt routine needed to support this service is as follows:
periodic-interrupt-service-routine:
reset( expired )
clock := clock + 1 -- count time
while head( timer-queue ).time <= clock
r := dequeue( timer-queue )
V( r.sem )
end loop
return
This code rests on a priority queue. The simplest such data structure
is a sorted linked list, where the enqueue service sorts new items
into the list in time order, the head of the list is always available,
and the dequeue service simply removes the head element from the
list.
Timer record:
link fields, as needed for list management
delay -- the delay between the previous timer and this
sem -- semaphore used to block process
Timer queue:
a sequential queue of timer records, sorted by expiration time
Clock:
the current time
_____________ ______ ______ ______
| Timer Queue |-->| Link |-->| Link |-->| Link |--X
|_____________| |------| |------| |------|
| delay| | delay| | delay|
Relative delay / |------| |------| |------|
from one record | sem | | sem | | sem |
to the next. |______| |______| |______|
This data structure adds only marginally to the complexity of the timer
services, while it allows very fast arithmetic to be used for typical short
delays.