22C:116, Lecture Notes, Nov. 6, 1995

Douglas W. Jones
University of Iowa Department of Computer Science

  1. Atomic Transactions

    The usual assumption surrounding such assignments as

              A := 0; A := -1
    
    is that assignment is atomic, that is, it either happens or it doesn't, and that if the above two statements are executed in parallel, there are only two possible outcomes, expressed by the following postcondition:
              (A = 0) or (A = -1)
    
    What if the variable A is 64 bytes long on a machine where all loads and stores are in terms of 32 bit quantities? Equivalently, for a simpler example, what if the variable 16 bits long, on a machine where all load and store operations operated on 8 bit quantities. Suddenly, there are a number of other possible outcomes! Each of the following outcomes is now possible:
              (A = 0) or (A = -1)
              or (A = 255)
              or (A = -256)
    
    If the low half of the word A is A0 and the high half is A1, the following sequence of assignments will give the value 255:
              Process 1  Process 2
              A0 := 0;
    		     A0 := 255;
    		     A1 := 255;
              A1 := 0;
    
    The usual solution to this is to use some kind of mutual exclusion mechanism, so that the assignments to A0 and A1 are done in a critical section, but this cannot prevent failures. Consider the following:
              Process 1  Process 2
              A0 := 0;
              A1 := 0;
    		     A0 := 255;
    		   < A1 := 255; > -- fails
    
    The consequence of this is that the illegal value 255 still appears in the shared variable, so that, even if the mutual exclusion mechanism can recover from the failure of one process inside the critical section, and grant entry to another, the shared data is corrupted.

    This problem shows up whenever the value of a shared variable can only be updated one piece at a time. For example, if A is a logical record on disk that consists of multiple physical records. The problem may obviously occur on parallel computers, but it also occurs on purely sequential machines when there is asynchronous context switching or where the shared variable is in some persistant memory device such as disk or battery backed-up RAM and the CPU halts unexpectedly, for example, because of a power failure.

  2. Pointer Assignment

    One way to assure that a shared variable is updated on an all or none basis is to perform the updates "off line". This requires a way to atomically move the updated copy into public view. Consider the following scheme:

    The publically visible copy of a composite shared variable is pointed to by a pointer; a process wishing to update the composite variable updates a copy, and then atomically updates the pointer.

         P a pointer to the shared value
    
         Inspect(V):
            V := P^;
    
         Update(V):
            temp^ := V;
            P := temp;
    
    If N processes compete for access to the shared variable, it must be possible for each process to independently update its own private copy of the variable; thus, the atomically updatable pointer must have at least N possible values, corresponding to the private variables of each process.

    Therefore, if the machine allows atomic updates of N bit quantities in memory, this scheme allows up to 2**N processes to compete.

  3. Stable Storage

    How do you make an assignments to a multi-byte object look atomic? This is easy if there is no problem with failure, but it is harder when failures are possible!

    Leslie Lamport developed algorithms for updating a shared variable in the face of failures. This assumes that a mutual exclusion algorithm guarantees that only one process tries to update the variable at a time, and in this context, it guarantees that the result of the update will be correct, in the face of failure.

    The basic operations offered by Lamport's stable storage algorithms are:

        inspect( V )
        update( V )
    
    Lamport's stable storage rests on a new and redundant representation for the stored value and it rests on two procedures (or protocols) for inspecting and updating the stored value.

    Conceptually, it is reasonable to use a client-server world view and imagine the inspect and update procedures as being the services offered by the server to clients. If the server fails, we can easily start a new server, as long as the variable itself is stored in such a way that it survives the failure.

    A stable variable V is represented as a record where every field is duplicated:

                Copy1
                Time1
                Sum1
    
                Copy2
                Time2
                Sum2
    
    There are two copies of the value, and for each copy, there is a record of the time at which the value was last updated and a record of the checksum computed as of the last update.

    The fault tolerance of Lamport's scheme improves if the two (or more) copies of the tuple are stored in such a way that failures only destroy one copy at a time. Thus, they should be in different memory modules, or on different disk drives. If the system is physically distributed, they should be geographically separated and connected to different computer systems.

    The update and inspect operations must be performed inside critical sections, and if failure is to be survived, these critical sections must use some algorithm that can detect failure of a process inside a critical section and release any associated mutual exclusion semaphores.

            Procedure Update( V )
            begin
               update time
    
               Copy1 := V
               Time1 := time
               Sum1  := Checksum( V, time )
    
               -- wait for the update to finish
    
               Copy2 := V
               Time2 := time
               Sum2  := Checksum( V, time )
    
               -- wait for the update to finish
            end
    
    The utility of this code relies on keeping two (or more) copies of the value, where no failure will corrupt more than one copy at a time. The wait for one update to finish before starting another update is very important, in this regard.

    The assignment statements shown above will not, typically, execute instantaneously. On a cache machine, for example, there may be a delay before the values are written back to main memory. If disks are involved, there may be a considerable time between the issuing of a write request and the actual write to disk. If disk cache software is involved, the write to disk may never happen.

    This illustrates something that was pointed out earlier, in the context of the discussion of disk caches. In general, fancy cacheing algorithms can improve performance, but they can get in the way when fault tolerance is the goal!

            Procedure Inspect( V )
            begin
               -- optionally start by reading all fields from disk
    
               if Sum1 = checksum( Copy1, Time1 )
                  if Sum2 = checksum( Copy2, Time2 )
                     if Time1 > Time2
                        V = Copy1
                     else
                        V = Copy2
                     endif
                  else
                     V = Copy1
                  endif
               else
                  if Sum2 = checksum( Copy2, Time2 )
                     V = Copy2
                  else
                     -- failure --
                  endif
               endif
            end
    
    This code is fairly simple -- there are four cases to consider:

    1) There were no errors In this case, the checksums on both copies will be valid, both will be the same, and they will have the same timestamp. In this case, either copy may be returned.

    2) There was a failure such that update managed to write one updated copy of the data, but it didn't get to start writing the other updated copy. In this case, the checksums on both copies will be valid, but their timestamps will differ. Return the copy with the most recent timestamp.

    3) There was a failure during one of the updates, or there was a failure that corrupted one of the stored values between updates. In this case, the checksum on the copy in question will be invalid. The other copy should still be OK and can be returned.

    If the failure occurs during the write of the second copy, it will be as if the write was successful because the first copy will have already been written and will be available for use. If the failure occurs during the write of the first copy, it will be as if the write never occurred.

    4) A failure or failures destroy all copies. There is nothing that can be done about this, but it takes multiple failures to cause this kind of damage, and the probability of this multiple failure can be made arbitrarily low by storing more copies in more widely distributed locations.

    Note that the stable storage update procedure may be made to be even more reliable if it does a write-read-verify cycle on each copy it writes out, that is, it writes each to memory and then reads it back and verifies correct storage before continuing.

    Also, note that no checksum algorithm can detect all possible errors in the stored data. Checksums can be made arbitrarily error resistant, but but they cannot be made perfect. For any checksum algorithm, there is some combination of errors that will defeat it!

  4. Transactions

    Atomic assignment and inspection are not enough to guarantee that a shared variable is used correctly through possible fault sequences! For example, consider the problem of transferring money from one checking account to another:

        Jones pays Smith some Amount
    
          Jones := Jones - Amount
          Smith := Smith + Amount
    
    To avoid accidentally creating or destroying money, either both updates must be made before an error occurs, or neither must be made! In the general case, there may be three machines here, one belonging to a bank where Smith has an account, one belonging to a bank where Jones has an account, and a third machine that mediates the transaction.

    The term transaction comes from the financial world, as is suggested by the above example. Transactions can be quite complex: For example, if I deposit a check in a bank, the transaction involves debting the account of the check writer, crediting my account, and crediting the sum representing the current day's checks received.

    Each transaction typically involves three sets of variables:

      V       -- the set of variables
      I in V  -- the subset of V inspected
      U in V  -- the subset of V updated
    
    Typically, we assume that every variable is associated with a lock, a binary mutual exclusion semaphore; this allows processes to claim exclusive use of that variable.

    In one sense, atomic transactions are trivial to implement. The entire database needed for the transaction could be stored as a single variable using Lamport's stable storage. A single mutual exclusion semaphore suffices to guard access to this variable.

    This trivial solution is unacceptable! Most interesting databases cannot be claimed totally for each update. Instead, the process performing an update must claim only part of the structure, allowing other processes to operate on other parts of the structure at the same time.

    Consider, for example, the Bank of America. This is a large bank with branches all over California. It would be impossible to operate such a bank if each transaction required giving the teller in charge of that transaction exclusive use of the bank's entire data base for the duration of the transaction.

    The sets I and U are not known in advance. Typically, some subset of I must be inspected in order to determine the other variables in I and U. This leads to the two-phase model of transactions.

      Phase 1
         Claim locks and inspect variables
         until all needed variables are locked.
    
    All variables must be locked prior to inspection. If multiple variables are locked, deadlock is possible. Deadlocks can be resolved by aborting and restarting transactions because they always occur in phase 1, before the transaction has modified any shared variables.

    Deadlocks can be avoided by claiming locks in a standard order, but this is not always practical, in the general case. For example, if each database record contains pointers to other records that must be locked, you must claim the record containing the pointer in order to follow the pointer, and any cycles in the pointer structure are potential sources of deadlock.

      Phase 2
         Update variables and release locks!
    
    The 2 phase model can be improved by using more interesting locks. Instead of 2-state locks -- where the two states are free and in-use, for example, 3-state locks can be used, with the following states:
                free
                read-only
                in-use
    
    If a process only needs to read a variable, it may claim the variable by setting the lock to read-only. This is legal if the previous state of the lock was free or read-only. If a process may need to write a variable, it must claim the lock as in-use, a claim that can only be made if the lock was free. Thus, a variable may have multiple readers.

    The 2 phase model guarantees that things will work, but it is overly restrictive. There are specific problems where a greater degree of concurrent access can be allowed by violation of the two-phase model.

    For example, consider a shared lexicographic binary tree. The two-phase model requires that a user of any node in this tree must lock all nodes on the path from the root to the node in question before releasing any of the locks.

    In fact, the data structure will work just as well if the user only holds a few locks at a time. Specifically, the user first claims the lock on the root, and then claims the lock on the child of the root that is on the path to the target node in the tree. As soon as it is determined that the root is not the direct parent of the target, the root can be released and the process can be continued recursively. In the end, the process holds a lock on the target node and on the parent of the target node.