22C:116, Lecture 30, Fall 1999

Douglas W. Jones
University of Iowa Department of Computer Science

  1. Atomic Transaction Protocols

    The two phase model of an atomic transaction discussed at the end of the last lecture ensures that, if there is a deadlock, the transactions involved will be blocked prior to any changes they might intend to make. As a result, they can be safely aborted when a deadlock occurs. Furthermore, with a bit of effort, it is possible to save enough state information that it is possible to abort and retry transactions when deadlocks are detected, thus preventing users from noticing the failures.

    Typically, atomic transactions in a distributed environment are carried out by clients using servers to store the data. As a result, a family of atomic transaction protocols has emerged. Typically, such a protocol begins by establishing a transaction identifier:

    	      CLIENT  |  SERVER
    	    Begin --------->
    		 <--------- Ready
    	      Transaction ID is
    Either the server assigns the identifier and gives it to the client, or the client, or the client creates an identifier that is guaranteed to be unique and gives it to the server. In the latter case, the client might, for example, concatenate a guaranteed-to- be-unique process ID with a transaction serial number to create such an identifier.

    Each pair of messages between the client and server is initiated by the client and may be viewed as a remote-procedure call and implemented using any of the standard remote-procedure-call protocols.

    Having established a transaction ID, the client then instructs the server to lock records in the database and read or write them. The server maintains locks on behalf of transaction identifiers; thus, in more general cases, each server may service many clients, and each client may be involved in multiple independant transactions.

    	      CLIENT  |  SERVER
        Read( X, ID ) --------->
    		 <--------- Read_Reply( D )
     Write( X, D, ID )--------->
    		 <--------- Write_Reply
    	  Read and Write locks are
           established as side effects of
           the read and write operations.
    Read( X, ID ) reads variable X, with the value of X being returned as D in the reply. Write( X, D, ID ) registers the intent of this transaction to write the value D on X. This write is not carried out immediately; rather, X is locked and the value D is set aside until all the changes required by this transaction have been accumulated.

    The reply messages not only signal the completion of the read or write operation, but they indicate that the requested data has been locked on behalf of this transaction. If some other transaction has an exclusive lock on the data, these replies must be delayed until the other transaction is finished.

    Because of these locks, transactions may deadlock! If there is only one server and multiple clients, that server may detect deadlocks locally and reply to a read or write request with an abort message, indicating that the transaction has been discarded and that the client must restart it.

    	      CLIENT  |  SERVER
    		 <--------- Abort
    	      Signals premature
    	     end of transaction.
    Some systems also allow the client to prematurely abort the transaction. In any case, an aborted transaction is equivalent to a transaction that was never initiated. All locks claimed by the transaction are released and any writes made by the transaction are discarded.

    When a client finishes with a transaction, it may request that the server commit the transaction.

    	      CLIENT  |  SERVER
          Commit( ID )--------->
    	      Record all changes
    	    and release all locks.
    		 <--------- Commit_Reply

  2. Locking

    The most trivial implementation of transactions is to have the Begin message claim a single global lock on the shared data and have the Commit message release this lock. If fault tolerance is not required, the read and write operations can work directly on the shared data.

    The exciting part of implementing transactions comes from trying to allow greater concurrency -- that is, allowing multiple simultaneous transactions to continue as long as they reference disjoint parts of the shared data, and trying to allow fault tolerance -- that is, allowing an assurance that if the client or server dies at any point during the transaction, the data that survives will always give the appearance that the transaction either completed successfully or never started.

    If separate binary locks (semaphores) are placed on each data item, transactions may overlap (and potentially deadlock), thus leading to a significant improvement in performance, and even more concurrency is possible with read-write locks.

    A read-write lock may have a total of 4 states:

    	no process has the lock
            one process has the lock
            multiple processes are reading
    	one process may read or write;
    There are 3 operations on such a lock:
         read-lock( ID )
    	block while state = write
    	case lock state of:
    	   unlocked:      state <- read-single
    	   read-single:   state <- read-multiple
    	   read-multiple: no change
         write-lock( ID )
    	block while state = write
    	   or while state = read-multiple
    	   or while state = read
                    and ID not reader's ID
    	case lock state of:
    	   unlocked:      state <- write
    	   read-single:   state <- write
         unlock( ID )
            state = unlocked
    Implementing read-write locks is a bit of a challenge! It can be done using semaphores, though.

  3. Transaction Implementation

    Modifications made to shared data must not be visible until the transaction is committed. For a trivial database consisting of a single variable, this may be done by having the server use Lamport's atomic update operation when the server is instructed to commit a new value for that variable.

    For realistic servers, the solution involves what are called shadow pages (or disk sectors) to hold the new values of variables that are to be changed. The actual pages (or sectors) making up the file are stored in an index, and it is this index; a separate copy of this index is made for each transaction in progress, and all changes are made by updating the appropriate index. When a transaction is committed, the master index is updated using an atomic update.

    In the following example, the atomic transactions will be discussed in terms of updates to sectors of a single file. The same approach can be used for the files of a single file system, but for now, we will think of read and write operations as applying to sectors. The example is also limited to a flat data structure, where there is no hierarchical structure in the sectors or in the index.

                      _____             _____     _____
            Old      |     |   . . .   |     |   |     |
            Data     |_____|           |_____|   |_____|
                       |  \______       /          / |
                       |        _\_____/          /  |
                       |       / _\______________/   |
                       | ...  / /  \                 |
                      _|_____/_/_   \      _____    /
            Old      |           |   \    | NEW |  /
            Index    |___________|    \   |_____| /
                                       \  ... |  /
                             New      |           |
                             Index    |___________|
    The new index is conceptually created by the Begin operation, and initially, it is simply a working copy of the old index, with pointers to all of the old data sectors.

    When a Write is done, neither the old index nor the old sectors are touched. Instead, a new sector is allocated to hold the new value and then the new index is updated to reflect the change.

    When the commit operation is done, the new index replaces the old index in one atomic operation. For example, the index could be stored in stable storage (using Lamports approach). In this case, it is reasonable to assume that the old index is out on disk, while the new index is a working copy in the volitile memory of the server.

    This approach allows outside viewers to see the updates as happening instantaneously at the time of the atomic update of the index. As described here, however, this assumes that only one transaction at a time is being processed.

    One way to allow multiple concurrent transactions is to have the server maintain multiple copies of the new index. Since the server also maintains the locks, it can assure that no two copies have conflicting changes -- that is, that any pointers modified in one copy of the index are unmodified in all other copies.

    When a commit operation is performed, the server does a stable storage update of the index from one working copy, and also updates all other working copies to reflect the changes that were made before it releases the locks on the pages that were locked during the transaction.

  4. What if the client dies in mid-transaction?

    One way to detect the failure of clients (or servers) is to set a time limit for each transaction. If the transaction is not committed within its limit, the transaction is aborted. This not only recovers from failures, but it resolves deadlocks!

    Usually, the time limit is negociated as part of the dialog that agrees on the transaction ID. For example, the client might propose a time limit before which it hopes to finish the transaction. If the client dies before it can commit the transaction, the server will notice that it has exceeded its time limit and abort the transaction.

    Deadlocks are resolved trivially by this because if there is a time limit on every transaction, and if the server aborts the transaction when the limit is exceeded even if the transaction was blocked awaiting the release a lock, then some transaction involved in every deadlock will eventually be aborted when its time limit expires.

  5. Where are the locks?

    Locks are not stored on disk with the sectors themselves! Locks are transient; if the server fails, all transactions are aborted and all locks released; this is most easily assured if locks are stored in transient memory and not out on disk with the permanent data.

  6. What if the index is not flat?

    Hierarchical data structures may be updated by a modification of the shadow page scheme:

                   _____   _____       _____   _____
            Old   |     | |     |     |     | |     |
            Data  |_____| |_____|     |_____| |_____|
                       \___/           /   \___/
            Old Sub   |     |         /   |     |
            Indeces   |_____|         \   |_____|
                           \ \         \   /        _____
                            \ \         \_/____    | NEW |
                             \ \         /     \   |_____|
                              \ \_______/_      \___/
                               \       /  \    |     | New Sub
                                \     /    \   |_____| Index
                                 \___/      \___/
            Old                 |     |    |     |  New
            Index               |_____|    |_____|  Index
    When there is a hierarchy of sub-indeces, all sub-indeces along the path from the root to the new data must be replicated. Only the root needs to be atomically updated, however, and the effect of that update is a complete update to the entire data structure.

    The above example is formulated in terms of a binary tree, but it generalizes trivially to the 128ary or any other trees that are common in disk file systems -- b-trees, Unix-style directory trees, Unix-stile I-nodes, etc.

  7. Transactions distributed over multiple servers

    Transactions involving multiple servers add new problems. What happens if one server is told to commit a transaction, and then there is a failure preventing another of the servers from committing. The solution to this is found in the 2-phase commit protocols.

                                     |        -->
             |                       |      /
        Commit -->  ====>  Prepare-to-commit
         <-- Done                    |      \
             |                       |        -->
                                Ready to commit
                               /     |
                           <--       |
                               \     |
                                Ready to commit
                                     |        -->
                                     |      /
                          Commit or Rollback
                                     |      \
                                     |        -->
                                <--  |
    In a two-phase commit, the client first tells every server that it is about to commit, then awaits messages from all the servers that they are ready. If all indicate readyness, it tells all of them to commit. If some don't reply or request an abort, then the client tells all to roll back the transaction; otherwise it tells all to commit. When all indicate that they are done, the client can continue assured that all is well.

    If a server uses sub-servers, it passes on Prepare-to-commit messages to subservers, it doesn't say it's ready until all subservers are ready, and, of course, it passes on all commit and rollback messages to its sub-servers and waiths for them to say they're done before it says it's done.

    This doesn't guarantee consistency! There is, unfortunately, a small chance that a server fail after it says Ready but before it successfully commits. This interval from Ready to Commit can be made quite short compared to the length of the transaction, however, so this model is at least helpful in reducing the likelyhood of a failure.

    If every client and server in the system maintains a detailed transaction log in stable storage that indicatex what transactions it believes were completed, it is possible to guarantee global consistancy, but this guarantee only applies when all systems are running, or after a failure, when all systems have been repaired. In the presence if irreparable failure, or in the presence of unbounded repair times, the best guarantee we can offer sets a bound on the probability of data loss.

  8. Distributed Deadlock

    In a distributed system, it is never reasonable to detect deadlocks by building a graph model of the system in some particular machine's memory and then run an algorithm to detect cycles in the graph. The expense of gathering the information needed to build this model is greater than the expense of deadlock detection!

    The key to this is to note that the deadlock detection algorithm can actually be distributed over the network and run in parallel. The basic algorithm involves exactly the same kind of distributed protocol as is used for algorithms such as those for computing the spanning tree of a network, group communications, or distributed mutual exclusion.

    Typically, if a process is blocked for too long and suspects that it is time to search for a deadlock, it sends a message to each process from which it is awaiting something. This message asks "Is there a deadlock", and it includes the ID of the originating process.

    If a process receives such a message and it isn't blocked, it sends a reply saying "No"; if it receives such a message and it is blocked, it forwards the message to every process from which it is expecting a message. If a process receives such a message from itself (through a series of intermediaries), then it knows that there is a deadlock.

    This basic algorithm works for single-resource models and for and-models. A somewhat more complex algorithm is required for the or-model, but atomic transactions don't involve or-model deadlocks!