29. Atomic Transactions

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Introduction

The usual assumption surrounding such assignments as

          A := 0; A := -1

is that assignment is atomic, that is, it either happens or it doesn't. If the above two statements are executed in parallel, or if one statement is interrupted in order to complete the other, a programmer will typically expect only two possible outcomes. These are expressed by the following postcondition:

          (A = 0) or (A = -1)

What if the variable A is 64 bytes long on a machine where all loads and stores are in terms of 32 bit quantities? Or, on a small microcontroller, what if the variable A is 16 bits long, on a machine where all load and store operations operated on 8 bit quantities. Suddenly, there are a number of other possible outcomes! Each of the following outcomes is now possible on the 8/16 bit microcontroller:

          (A = 0) or (A = -1)
          or (A = 255)
          or (A = -256)

If the low half of the word A is A0 and the high half is A1, the following sequence of assignments will give the value 255:

          Process 1  Process 2
          A0 := 0;
		     A0 := 255;
		     A1 := 255;
          A1 := 0;

The usual solution to this is to use some kind of mutual exclusion mechanism, so that the assignments to A0 and A1 are done in a critical section, but this cannot prevent failures. Consider the following:

          Process 1  Process 2
          A0 := 0;
          A1 := 0;
		     A0 := 255;
		   < A1 := 255; > -- fails

Such a failure, affecting one process but not the other, is easily imagined on a multiprocessor, but even on a uniprocessor, if two different users are running code that shares a memory segment, and then one user hits control-C to kill a process, this kind of consequence is possible.

The consequence of this partial failure is that the illegal value 255 still appears in the shared variable, so that, even if the mutual exclusion mechanism can recover from the failure of one process inside the critical section, and grant entry to another, the shared data is corrupted.

This problem shows up whenever the value of a shared variable can only be updated one piece at a time. For example, if A is a logical record on disk that consists of multiple physical records. The problem may occur not only on parallel computers, but also on purely sequential machines when there is asynchronous context switching.

Analogous problems can occur in the absence of context switching, if a sequence of assignments is interrupted by a failure. Consider:

          A0 := 0;
          A1 := 0;
            ...      -- some time later
          A0 := 255;
        < A1 := 255; > -- fails

Here, if the data being updated is on disk or in "persistant RAM" such as RAM with battery backup or old-fashioned core memory, the failure can leave an illegal value in memory, a value that was only partially updated.

Pointer Assignment

One way to assure that a shared variable is updated on an all or none basis is to perform the updates "off line". This requires a way to atomically move the updated copy into public view. Consider the following scheme:

The publically visible copy of a composite shared variable is pointed to by a pointer; a process wishing to update the composite variable updates a copy, and then atomically updates the pointer. We assume that updates to pointers are atomic -- that is, that pointers occupy exactly one hardware word, where a word is defined as the unit of atomic update in memory.

     P a pointer to the shared value

     Inspect(V):
        V := P^;  -- return the value pointed to by P

     Update(V):
	new( temp );
        temp^ := V;
        P := temp; -- make P point to the desired value

The above code uses a newly allocated cell from the heap for each update to the atomicly updated variable. This generality is unnecessary. If N processes compete for access to the shared variable, it must be possible for each process to independently update its own private copy of the variable; thus, the atomically updatable pointer must have at least N possible values, corresponding to the private variables of each process.

Therefore, if the machine allows atomic updates of n bit quantities in memory, this scheme allows up to 2n processes to compete.

In fact, we do not need a completely general heap! If each process has two cells able to hold a single value each, and if each process uses these cells in turn, updating the least recently updated cell each time an update is required, and finishing the update with a pointer assignment, this scheme will work. Therefore, pointer assignment schemes can support atomic update on a 4-bit machine with as many as 8 processes, or on an 8-bit machine with as many as 128 processes.

Stable Storage

How do you make an assignments to a multi-byte object look atomic without using pointer assignment and without using memory proportional to the number of processes? This is easy if there is no problem with failure, but it is harder when failures are possible!

Leslie Lamport developed algorithms for updating a shared variable in the face of failures. This assumes that a mutual exclusion algorithm guarantees that only one process tries to update the variable at a time, and in this context, it guarantees that the result of the update will be correct, in the face of failure.

The basic operations offered by Lamport's stable storage algorithms are:

    inspect( V ) returns value   (or V.inspect()        )
    update( V, value )           (   V.update( value )  )

Lamport's stable storage rests on a new and redundant representation for the stored value and it rests on two procedures (or protocols) for inspecting and updating the stored value.

Conceptually, it is reasonable to use a client-server world view and imagine the inspect and update procedures as being the services offered by the server to clients. If the server fails, we can easily start a new server, as long as the variable itself is stored in such a way that it survives the failure.

A stable variable V is represented as a record where every field is duplicated:

            Copy1  -- the value stored
            Time1  -- the time it was stored
            Sum1   -- a checksum over the value and time

            Copy2
            Time2
            Sum2
There are two copies of the value, and for each copy, there is a record of the time at which the value was last updated and a record of the checksum computed as of the last update.

The fault tolerance of Lamport's scheme improves if the two (or more) copies of the <value, time, checksum> tuple are stored in such a way that failures only destroy one copy at a time. Thus, they should be in different memory modules, or on different disk drives. If the system is physically distributed, they should be geographically separated and connected to different computer systems.

The update and inspect operations must be performed inside critical sections, and if failure is to be survived, these critical sections must use some algorithm that can detect failure of a process inside a critical section and release any associated mutual exclusion semaphores.

        Procedure Update( V, value )
        begin
           update time

           V.Copy1 := value
           V.Time1 := time
           V.Sum1  := Checksum( value, time )

           -- wait for the assignment to really finish

           V.Copy2 := value
           V.Time2 := time
           V.Sum2  := Checksum( value, time )

           -- wait for the assignments to really finish
        end

The utility of this code relies on keeping two (or more) copies of the value, where no failure will corrupt more than one copy at a time. The wait for one update to finish before starting another update is very important, in this regard.

The assignment statements shown above will not, typically, execute instantaneously. On a cache machine, for example, there may be a delay before the values are written back to main memory. If disks are involved, there may be a considerable time between the issuing of a write request and the actual write to disk. If disk cache software is involved, it is even possible that the write to disk will never happen.

This illustrates something that was pointed out earlier, in the context of the discussion of disk caches. In general, fancy cacheing algorithms can improve performance, but they can get in the way when fault tolerance is the goal! We must have a way to force cached values out into the real memory before we can rely on this stable storage algorithm. In systems derived from UNIX, the kernel primitive that does this is fsync().

        Procedure Inspect( V )
        begin
           -- optionally start by reading all fields from disk

           if V.Sum1 = checksum( V.Copy1, V.Time1 )
              if V.Sum2 = checksum( V.Copy2, V.Time2 )
                 if V.Time1 > V.Time2
                    value = V.Copy1
                 else
                    value = V.Copy2
                 endif
              else
                 value = V.Copy1
              endif
           else
              if V.Sum2 = checksum( V.Copy2, V.Time2 )
                 value = V.Copy2
              else
                 -- failure --
              endif
           endif
           return value
        end

This code is fairly simple -- there are four cases to consider:

1) There were no errors In this case, the checksums on both copies will be valid, both will be the same, and they will have the same timestamp. In this case, either copy may be returned.

2) There was a failure such that update managed to write one updated copy of the data, but it didn't get to start writing the other updated copy. In this case, the checksums on both copies will be valid, but their timestamps will differ. Return the copy with the most recent timestamp.

3) There was a failure during one of the updates, or there was a failure that corrupted one of the stored values between updates. In this case, the checksum on the copy in question will be invalid. The other copy should still be OK and can be returned.

If the failure occurs during the write of the second copy, it will be as if the write was successful because the first copy will have already been written and will be available for use. If the failure occurs during the write of the first copy, it will be as if the write never occurred.

4) A failure or failures destroy all copies. There is nothing that can be done about this, but it takes multiple failures to cause this kind of damage, and the probability of this multiple failure can be made arbitrarily low by storing more copies in more widely distributed locations.

Note that the stable storage update procedure may be made to be even more reliable if it does a write-read-verify cycle on each copy it writes out, that is, it writes each to memory and then reads it back and verifies correct storage before continuing.

Also, note that no checksum algorithm can detect all possible errors in the stored data. Checksums can be made arbitrarily error resistant, but but they cannot be made perfect. For any checksum algorithm, there is some combination of errors that will defeat it!

Transactions

Atomic assignment and inspection are not enough to guarantee that a shared variable is used correctly through possible fault sequences! For example, consider the problem of transferring money from one checking account to another:

      Jones pays Smith some Amount:
         Jones := Jones - Amount
         Smith := Smith + Amount

To avoid accidentally creating or destroying money, either both updates must be made before an error occurs, or neither must be made! In the general case, there may be three machines here, one belonging to a bank where Smith has an account, one belonging to a bank where Jones has an account, and a third machine that mediates the transaction.

The term transaction comes from the financial world, as is suggested by the above example. Transactions can be quite complex: For example, if I deposit a check in a bank, the transaction involves debting the account of the check writer, crediting my account, and crediting the sum representing the current day's checks received.

Each transaction typically involves three sets of variables:

  V       -- the set of variables
  I in V  -- the subset of V inspected
  U in V  -- the subset of V updated

In the following, we will make the slightly stronger assumption:

  U in I  -- updated variables are all inspected

Typically, we assume that every variable is associated with a lock, a binary mutual exclusion semaphore; this allows processes to claim exclusive use of that variable. If a variable is to be updated but does not need inspection, we lock it as if we needed to inspect it first.

In one sense, atomic transactions are trivial to implement. The entire database needed for the transaction could be stored as a single variable using Lamport's stable storage. A single mutual exclusion semaphore suffices to guard access to this variable.

This trivial solution is unacceptable! Most interesting databases cannot be claimed totally for each update. Instead, the process performing an update must claim only part of the structure, locking only those variables needed while allowing other processes to operate on other parts of the structure at the same time.

Consider, for example, the Bank of America. This is a large bank with branches all over California. It would be impossible to operate such a bank if each transaction required giving the teller in charge of that transaction exclusive use of the bank's entire data base for the duration of the transaction.

The sets I (inspected variables) and U (updated variables) are not known in advance. Typically, some subset of I must be inspected in order to determine the other variables in I and U. This leads to the following two-phase model of transactions.

     Phase 1
        Claim locks and inspect variables
        until all needed variables are locked.

All variables must be locked prior to inspection. If multiple variables are locked, deadlock is possible. Deadlocks can be resolved by aborting and restarting transactions because they always occur in phase 1, before the transaction has modified any shared variables.

Deadlocks can be avoided by claiming locks in a standard order, but this is not always practical, in the general case. For example, if each database record contains pointers to other records that may need to be locked and inspected, you must claim the record containing the pointer in order to follow the pointer, and any cycles in the pointer structure are potential sources of deadlock.

     Phase 2
        Update variables and release locks!

The 2 phase model can be improved by using more interesting locks. Instead of 2-state locks -- where the two states are free and in-use, for example, 3-state locks can be used, with the following states:

            free
            read-only
            in-use (read-write)

If a process only needs to read a variable, it may claim the variable by setting the lock to read-only. This is legal if the previous state of the lock was free or read-only. If a process may need to write a variable, it must claim the lock as in-use, a claim that can only be made if the lock was free. Thus, a variable may have multiple readers. Locks that support this are said to solve the readers-writers problem.

The 2 phase model guarantees that things will work, but it is overly restrictive. There are specific problems where a greater degree of concurrent access can be allowed by violation of the two-phase model.

For example, consider a shared lexicographic binary tree. The two-phase model requires that a user of any node in this tree must lock all nodes on the path from the root to the node in question before releasing any of the locks.

In fact, the data structure will work just as well if the user only holds a few locks at a time. Specifically, the user first claims the lock on the root, and then claims the lock on the child of the root that is on the path to the target node in the tree. As soon as it is determined that the root is not the direct parent of the target, the root can be released and the process can be continued recursively. In the end, the process holds a lock on the target node and on the parent of the target node.

Note: The above definitions of the two-phase model of transactions do not guarantee atomicity. They only describe the mutual exclusion model that atomic transaction systems provide for the user. The implementation of atomicity in this model will be the subject of the next lecture!