The usual assumption surrounding such assignments as
A := 0; A := -1is that assignment is atomic, that is, it either happens or it doesn't, and that if the above two statements are executed in parallel, there are only two possible outcomes, expressed by the following postcondition:
(A = 0) or (A = -1)What if the variable A is 64 bytes long on a machine where all loads and stores are in terms of 32 bit quantities? Equivalently, for a simpler example, what if the variable 16 bits long, on a machine where all load and store operations operated on 8 bit quantities. Suddenly, there are a number of other possible outcomes! Each of the following outcomes is now possible:
(A = 0) or (A = -1) or (A = 255) or (A = -256)If the low half of the word A is A0 and the high half is A1, the following sequence of assignments will give the value 255:
Process 1 Process 2 A0 := 0; A0 := 255; A1 := 255; A1 := 0;The usual solution to this is to use some kind of mutual exclusion mechanism, so that the assignments to A0 and A1 are done in a critical section, but this cannot prevent failures. Consider the following:
Process 1 Process 2 A0 := 0; A1 := 0; A0 := 255; < A1 := 255; > -- failsThe consequence of this is that the illegal value 255 still appears in the shared variable, so that, even if the mutual exclusion mechanism can recover from the failure of one process inside the critical section, and grant entry to another, the shared data is corrupted.
This problem shows up whenever the value of a shared variable can only be updated one piece at a time. For example, if A is a logical record on disk that consists of multiple physical records. The problem may obviously occur on parallel computers, but it also occurs on purely sequential machines when there is asynchronous context switching or where the shared variable is in some persistant memory device such as disk or battery backed-up RAM and the CPU halts unexpectedly, for example, because of a power failure.
One way to assure that a shared variable is updated on an all or none basis is to perform the updates "off line". This requires a way to atomically move the updated copy into public view. Consider the following scheme:
The publically visible copy of a composite shared variable is pointed to by a pointer; a process wishing to update the composite variable updates a copy, and then atomically updates the pointer.
P a pointer to the shared value Inspect(V): V := P^; Update(V): temp^ := V; P := temp;If N processes compete for access to the shared variable, it must be possible for each process to independently update its own private copy of the variable; thus, the atomically updatable pointer must have at least N possible values, corresponding to the private variables of each process.
Therefore, if the machine allows atomic updates of N bit quantities in memory, this scheme allows up to 2**N processes to compete.
How do you make an assignments to a multi-byte object look atomic? This is easy if there is no problem with failure, but it is harder when failures are possible!
Leslie Lamport developed algorithms for updating a shared variable in the face of failures. This assumes that a mutual exclusion algorithm guarantees that only one process tries to update the variable at a time, and in this context, it guarantees that the result of the update will be correct, in the face of failure.
The basic operations offered by Lamport's stable storage algorithms are:
inspect( V ) update( V )Lamport's stable storage rests on a new and redundant representation for the stored value and it rests on two procedures (or protocols) for inspecting and updating the stored value.
Conceptually, it is reasonable to use a client-server world view and imagine the inspect and update procedures as being the services offered by the server to clients. If the server fails, we can easily start a new server, as long as the variable itself is stored in such a way that it survives the failure.
A stable variable V is represented as a record where every field is duplicated:
Copy1 Time1 Sum1 Copy2 Time2 Sum2There are two copies of the value, and for each copy, there is a record of the time at which the value was last updated and a record of the checksum computed as of the last update.
The fault tolerance of Lamport's scheme improves if the two
(or more) copies of the
The update and inspect operations must be performed inside critical
sections, and if failure is to be survived, these critical sections
must use some algorithm that can detect failure of a process inside
a critical section and release any associated mutual exclusion semaphores.
The assignment statements shown above will not, typically,
execute instantaneously. On a cache machine, for example, there
may be a delay before the values are written back to main memory.
If disks are involved, there may be a considerable time between
the issuing of a write request and the actual write to disk.
If disk cache software is involved, the write to disk may never
happen.
This illustrates something that was pointed out earlier, in the
context of the discussion of disk caches. In general, fancy
cacheing algorithms can improve performance, but they can get in
the way when fault tolerance is the goal!
1) There were no errors
In this case, the checksums on both copies will be valid,
both will be the same, and they will have the same
timestamp.
In this case, either copy may be returned.
2) There was a failure such that update managed to write one
updated copy of the data, but it didn't get to start
writing the other updated copy.
In this case, the checksums on both copies will be valid,
but their timestamps will differ. Return the copy with
the most recent timestamp.
3) There was a failure during one of the updates, or there
was a failure that corrupted one of the stored values between
updates.
In this case, the checksum on the copy in question will
be invalid. The other copy should still be OK and can be
returned.
If the failure occurs during the write of the second copy,
it will be as if the write was successful because the first
copy will have already been written and will be available
for use.
If the failure occurs during the write of the first copy,
it will be as if the write never occurred.
4) A failure or failures destroy all copies. There is nothing that
can be done about this, but it takes multiple failures to cause
this kind of damage, and the probability of this multiple failure
can be made arbitrarily low by storing more copies in more widely
distributed locations.
Note that the stable storage update procedure may be made to be
even more reliable if it does a write-read-verify cycle on each
copy it writes out, that is, it writes each to memory and then reads
it back and verifies correct storage before continuing.
Also, note that no checksum algorithm can detect all possible errors in
the stored data. Checksums can be made arbitrarily error resistant,
but but they cannot be made perfect. For any checksum algorithm, there
is some combination of errors that will defeat it!
Atomic assignment and inspection are not enough to guarantee that
a shared variable is used correctly through possible fault sequences! For
example, consider the problem of transferring money from one checking
account to another:
The term transaction comes from the financial world, as is
suggested by the above example. Transactions can be quite complex:
For example, if I deposit a check in a bank, the transaction involves
debting the account of the check writer, crediting my account, and
crediting the sum representing the current day's checks received.
Each transaction typically involves three sets of variables:
In one sense, atomic transactions are trivial to implement. The
entire database needed for the transaction could be stored as a
single variable using Lamport's stable storage. A single mutual
exclusion semaphore suffices to guard access to this variable.
This trivial solution is unacceptable! Most interesting databases
cannot be claimed totally for each update. Instead, the process
performing an update must claim only part of the structure, allowing
other processes to operate on other parts of the structure at the
same time.
Consider, for example, the Bank of America. This is a large bank
with branches all over California. It would be impossible to
operate such a bank if each transaction required giving the teller
in charge of that transaction exclusive use of the bank's entire
data base for the duration of the transaction.
The sets I and U are not known in advance.
Typically, some subset of I must be inspected in order
to determine the other variables in I and U.
This leads to the two-phase model of transactions.
Deadlocks can be avoided by claiming locks in a standard order, but
this is not always practical, in the general case. For example,
if each database record contains pointers to other records that must
be locked, you must claim the record containing the pointer in order
to follow the pointer, and any cycles in the pointer structure are potential
sources of deadlock.
The 2 phase model guarantees that things will work, but it is overly
restrictive. There are specific problems where a greater degree of
concurrent access can be allowed by violation of the two-phase model.
For example, consider a shared lexicographic binary tree. The
two-phase model requires that a user of any node in this tree must
lock all nodes on the path from the root to the node in question
before releasing any of the locks.
In fact, the data structure will work just as well if the user only
holds a few locks at a time. Specifically, the user first claims the
lock on the root, and then claims the lock on the child of the root
that is on the path to the target node in the tree. As soon as it
is determined that the root is not the direct parent of the target,
the root can be released and the process can be continued recursively.
In the end, the process holds a lock on the target node and on the
parent of the target node.
Procedure Update( V )
begin
update time
Copy1 := V
Time1 := time
Sum1 := Checksum( V, time )
-- wait for the update to finish
Copy2 := V
Time2 := time
Sum2 := Checksum( V, time )
-- wait for the update to finish
end
The utility of this code relies on keeping two (or more)
copies of the value, where no failure will corrupt more than
one copy at a time. The wait for one update to finish before
starting another update is very important, in this regard.
Procedure Inspect( V )
begin
-- optionally start by reading all fields from disk
if Sum1 = checksum( Copy1, Time1 )
if Sum2 = checksum( Copy2, Time2 )
if Time1 > Time2
V = Copy1
else
V = Copy2
endif
else
V = Copy1
endif
else
if Sum2 = checksum( Copy2, Time2 )
V = Copy2
else
-- failure --
endif
endif
end
This code is fairly simple -- there are four cases to
consider:
Jones pays Smith some Amount
Jones := Jones - Amount
Smith := Smith + Amount
To avoid accidentally creating or destroying money, either both updates
must be made before an error occurs, or neither must be made!
In the general case, there may be three machines here, one belonging
to a bank where Smith has an account, one belonging to a bank
where Jones has an account, and a third machine that mediates the
transaction.
V -- the set of variables
I in V -- the subset of V inspected
U in V -- the subset of V updated
Typically, we assume that every variable is associated with a lock,
a binary mutual exclusion semaphore; this allows processes to claim
exclusive use of that variable.
Phase 1
Claim locks and inspect variables
until all needed variables are locked.
All variables must be locked prior to inspection.
If multiple variables are locked, deadlock is possible. Deadlocks
can be resolved by aborting and restarting transactions because
they always occur in phase 1, before the transaction has modified
any shared variables.
Phase 2
Update variables and release locks!
The 2 phase model can be improved by using more interesting locks.
Instead of 2-state locks -- where the two states are free and in-use,
for example, 3-state locks can be used, with the following states:
free
read-only
in-use
If a process only needs to read a variable, it may claim the variable
by setting the lock to read-only. This is legal if the previous
state of the lock was free or read-only. If a process may need to
write a variable, it must claim the lock as in-use, a claim that can
only be made if the lock was free. Thus, a variable may have multiple
readers.