22C:116, Lecture Notes, Apr. 12, 1995

Douglas W. Jones
University of Iowa Department of Computer Science

  1. Atomic Transaction Protocols

    The two phase model of an atomic transaction discussed at the end of the last lecture ensures that, if there is a deadlock, the transactions involved will be blocked prior to any changes they might intend to make, and as a result, they can be safely aborted. Furthermore, with a bit of effort, it is possible to save sufficient state information that aborted transactions can be retried.

    Typically, atomic transactions in a distributed environment are carried out by clients using servers to store the data. As a result, a family of atomic transaction protocols has emerged. Typically, such a protocol begins by establishing a transaction identifier:

    	      CLIENT  |  SERVER
    	    Begin --------->
    		 <--------- Ready
    	      Transaction ID is
    Either the server assigns the identifier and gives it to the client, or the client, or the client creates an identifier that is guaranteed to be unique and gives it to the server. In the latter case, the client might, for example, concatenate a guaranteed-to- be-unique process ID with a transaction serial number to create such an identifier.

    Each pair of messages between the client and server is initiated by the client and may be viewed as a remote-procedure call and implemented using any of the standard remote-procedure-call protocols.

    Having established a transaction ID, the client then instructs the server to locks records in the database and read or write them. The server maintains locks on behalf of transaction identifiers; thus, in more general cases, each server may service many clients, and each client may be involved in multiple independant transactions.

    	      CLIENT  |  SERVER
        Read( X, ID ) --------->
    		 <--------- Read_Reply( D )
     Write( X, D, ID )--------->
    		 <--------- Write_Reply
    	  Read and Write locks are
           established as side effects of
           the read and write operations.
    Read( X, ID ) reads variable X, with the value of X being returned as D in the reply. Write( X, D, ID ) registers thethe intent of this transaction to write the value D on X. This write is not carried out immediately; rather, the value is set aside until all the changes required by this transaction have been accumulated.

    The reply messages not only signal the completion of the read or write operation, but they indicate that the requested data has been locked on behalf of this transaction. If some other transaction has an exclusive lock on the data, these replies must be delayed until the other transaction is finished.

    Because of these locks, transactions may deadlock! If there is only one server and multiple clients, that server may detect deadlocks locally and reply to a read or write request with an abort message, indicating that the transaction has been discarded and that the client must restart it.

    	      CLIENT  |  SERVER
    		 <--------- Abort
    	      Signals premature
    	     end of transaction.
    Some systems also allow the client to prematurely abort the transaction. In any case, an aborted transaction is equivalent to a transaction that was never initiated. All locks claimed by the transaction are released and any writes made by the transaction are discarded.

    When a client finishes with a transaction, it may request that the server commit the transaction.

    	      CLIENT  |  SERVER
          Commit( ID )--------->
    	      Record all changes
    	    and release all locks.
    		 <--------- Commit_Reply
    If a deadlock is detected during a transaction, some blocked transaction must be aborted. The abort message will usually be received instead of a normal reply to a read or write request.

  2. Locking

    The most trivial implementation of transactions is to have the Begin message claim a single global lock on the shared data and have the Commit message release this lock. If fault tolerance is not required, the read and write operations can work directly on the shared data.

    The exciting part of implementing transactions comes from trying to allow greater concurrency -- that is, allowing multiple simultaneous transactions to continue as long as they reference disjoint parts of the shared data, and trying to allow fault tolerance -- that is, allowing an assurance that if the client or server dies at any point during the transaction, the data that survives will always look like either the transaction never started or it looks like the transaction completed successfully.

    If separate binary locks (semaphores) are placed on each data item, the transactions may overlap (and potentially deadlock), but even more concurrency is possible with read-write locks.

    A read-write lock has 4 states:

    	no process has the lock.
            one process has the lock;
            others may also read;
    	the first process may also write.
         read multiple
            multiple processes are reading;
            no process may write.
    	one process may read or write;
    There are 3 operations on such a lock:
         read-lock( ID )
    	legal if the lock is unlocked or read-single;
    	blocked otherwise.
         write-lock( ID )
    	legal if the lock is unlocked or is read-single
    	by the same ID.  Blocked otherwise.
         unlock( ID )
            undoes read-lock or write-lock;
    Implementing read-write locks is a bit of a challenge! It can be done using semaphores, though.

  3. Transaction Implementation

    Modifications made to shared data must not be visible until the transaction is committed. For a trivial database consisting of a single variable, this may be done by having the server use Lamport's atomic update operation when the server is instructed to commit a new value for that variable.

    For realistic servers, the solution involves what are called shadow pages (or disk sectors) to hold the new values of variables that are to be changed. The actual pages (or sectors) making up the file are stored in an index, and it is this index; a separate copy of this index is made for each transaction in progress, and all changes are made by updating the appropriate index. When a transaction is committed, the master index is updated using an atomic update.

    In the following example, the atomic transactions will be discussed in terms of updates to sectors of a single file. The same approach can be used for the files of a single file system, but for now, we will think of read and write operations as applying to sectors. The example is also limited to a flat data structure, where there is no hierarchical structure in the sectors or in the index.

                      _____             _____     _____
            Old      |     |   . . .   |     |   |     |
            Data     |_____|           |_____|   |_____|
                       |  \______       /          / |
                       |        _\_____/          /  |
                       |       / _\______________/   |
                       | ...  / /  \                 |
                      _|_____/_/_   \      _____    /
            Old      |           |   \    | NEW |  /
            Index    |___________|    \   |_____| /
                                       \  ... |  /
                             New      |           |
                             Index    |___________|
    The new index is conceptually created by the Begin operation, and initially, it is simply a working copy of the old index, with pointers to all of the old data sectors.

    When a Write is done, neither the old index nor the old sectors are touched. Instead, a new sector is allocated to hold the new value and then the new index is updated to reflect the change.

    When the commit operation is done, the new index replaces the old index in one atomic operation. For example, the index could be stored in stable storage (using Lamports approach). In this case, it is reasonable to assume that the old index is out on disk, while the new index is a working copy in the volitile memory of the server.

    This approach allows outside viewers to see the updates as happening instantaneously at the time of the atomic update of the index. As described here, however, this assumes that only one transaction at a time is being processed.

    One way to allow multiple concurrent transactions is to have the server maintain multiple copies of the new index. Since the server also maintains the locks, it can assure that no two copies have conflicting changes -- that is, that any pointers modified in one copy of the index are unmodified in all other copies.

    When a commit operation is performed, the server does a stable storage update of the index from one working copy, and also updates all other working copies to reflect the changes that were made before it releases the locks on the pages that were locked during the transaction.

  4. What if the client dies during a transaction?

    One way to detect the failure of clients (or servers) is to set a time limit for each transaction. If the transaction is not committed within its limit, the transaction is aborted. This not only recovers from failures, but it resolves deadlocks!

    Usually, the time limit is negociated as part of the dialog that agrees on the transaction ID. For example, the client might propose a time limit before which it hopes to finish the transaction. If the client dies before it can commit the transaction, the server will notice that it has exceeded its time limit and abort the transaction.

    Deadlocks are resolved trivially by this because if there is a time limit on every transaction, and if the server aborts the transaction when the limit is exceeded even if the transaction was blocked awaiting the release a lock, then some transaction involved in every deadlock will eventually be aborted when its time limit expires.

  5. Where are the locks?

    Locks are not stored on disk with the sectors themselves! This would duplicate locked sectors writes in the shadow pages. Furthermore, locks are transient; if the server fails, all transactions are aborted and all locks released; this is most easily assured if locks are stored in transient memory and not out on disk with the permanent data.

  6. What if the index is not flat?

    Hierarchical data structures may be updated by a modification of the shadow page scheme:

                   _____   _____       _____   _____
            Old   |     | |     |     |     | |     |
            Data  |_____| |_____|     |_____| |_____|
                       \___/           /   \___/
            Old Sub   |     |         /   |     |
            Indeces   |_____|         \   |_____|
                           \ \         \   /        _____
                            \ \         \_/____    | NEW |
                             \ \         /     \   |_____|
                              \ \_______/_      \___/
                               \       /  \    |     | New Sub
                                \     /    \   |_____| Index
                                 \___/      \___/
            Old                 |     |    |     |  New
            Index               |_____|    |_____|  Index
    When there is a hierarchy of sub-indeces, all sub-indeces along the path from the root to the new data must be replicated. Only the root needs to be atomically updated, however, and the effect of that update is a complete update to the entire data structure.

    The above example is formulated in terms of a binary tree, but it generalizes trivially to the 128ary or similarl trees that are common in disk file systems.

  7. Transactions distributed over multiple servers

    Transactions involving multiple servers add new problems. What happens if one server is told to commit a transaction, and then there is a failure preventing another of the servers from committing. The solution to this is found in the 2-phase commit protocols.

                                     |        -->
             |                       |      /
        Commit -->  ====>  Prepare-to-commit
         <-- Done                    |      \
             |                       |        -->
                                Ready to commit
                               /     |
                           <--       |
                               \     |
                                Ready to commit
                                     |        -->
                                     |      /
                          Commit or Rollback
                                     |      \
                                     |        -->
                                <--  |
    In a two-phase commit, the client first tells every server that it is about to commit, then awaits messages from all the servers that they are ready. If all indicate readyness, it tells all of them to commit. If some don't reply or request an abort, then the client tells all to roll back the transaction; otherwise it tells all to commit. When all indicate that they are done, the client can continue assured that all is well.

    If a server uses sub-servers, it passes on Prepare-to-commit messages to subservers, it doesn't say it's ready until all subservers are ready, and, of course, it passes on all commit and rollback messages to its sub-servers and waiths for them to say they're done before it says it's done.

    This doesn't guarantee consistency! There is, unfortunately, a small chance that a server fail after it says Ready but before it successfully commits. This interval from Ready to Commit can be made quite short compared to the length of the transaction, however, so this model is at least helpful in reducing the likelyhood of a failure.

  8. Distributed Deadlock

    In a distributed system, it is never reasonable to detect deadlocks by building a graph model of the system in some particular machine's memory and then run an algorithm to detect cycles in the graph. The expense of gathering the information needed to build this model is greater than the expense of deadlock detection!

    The key to this is to note that the deadlock detection algorithm can actually be distributed over the network and run in parallel. The basic algorithm involves exactly the same kind of distributed protocol as is used for algorithms such as those for computing the spanning tree of a network, group communications, or distributed mutual exclusion.

    Typically, if a process is blocked for too long and suspects that it is time to search for a deadlock, it sends a message to each process from which it is awaiting something. This message asks "Is there a deadlock", and it includes the ID of the originating process.

    If a process receives such a message and it isn't blocked, it sends a reply saying "No"; if it receives such a message and it is blocked, it forwards the message to every process from which it is expecting a message. If a process receives such a message from itself (through a series of intermediaries), then it knows that there is a deadlock.

    This basic algorithm works for single-resource models and for and-models. A somewhat more complex algorithm is required for the or-model, but atomic transactions don't involve or-model deadlocks!