22C:116, Lecture 17, Spring 1997

Douglas W. Jones
University of Iowa Department of Computer Science

  1. Deadlock

    Deadlock occurs when two or more processes are blocked, where each process is blocked awaiting some action by another blocked process.

    A process might be blocked awaiting a message from another process, or it might be blocked awaiting a resource currently held by another process. If all the processes in a deadlocked group of processes are blocked awaiting messages, the deadlock is described as a communication deadlock; if all processes in a deadlocked group are awaiting resources, it is a resource deadlock. Mixed cases where some processes in the group are awaiting resources and others are awaiting messages are quite possible.

    In either case, whether resources or messages (or both) are involved, the key is that there is a cyclic wait.

    In some older texts, the possibility of message passing is completely ignored, and deadlocks are defined entirely in terms of resource contention.

  2. Resource Deadlocks

    Deadlocks resulting from resource allocation have been widely studied since the late 1960's. Three distinct issues have been raised in this context:

    Once a deadlock is detected, it must be somehow resolved. Deadlock resolution is frequently destructive -- so only fault tolerant applications can survive it. Therefore, if it is possible, deadlock avoidance is better.

  3. Deadlock Detection

    Since a deadlock involves is a cycle of processes waiting for something, the standard algorithms for deadlock detection convert the problem into graph theoretic terms and then use various cycle finding algorithms on the resulting graph.

    The following notation is traditionally used in graph theoretic formulations of the deadlock detection problem:

       Allocated        Process that
       resource         has the resource
          _                _
         |_|------------->(_)
    
       Waiting          Resource the
       process          process waits for
          _                _
         (_)------------->|_|
    
       Waiting          Process that
       process          should send message
          _                _
         (_)------------->(_)
    
    The graph has vertices for each process (round) and resource (square). The graph has edges representing the "waits for" relation. A process can wait for a resource, a resource can wait to be released by a process, and (in message passing systems) a process can wait for a message from some other process. The latter is a modern extension to this old notation.

    Here is an example "wait for graph" describing 4 processes, A, B, C and D, and 4 resources, Y, Y, Z and W. A is waiting for X, but B has both X and Z. C has W and is waiting for Y, and D has Y and is waiting for W. Does a deadlock exist?

                Processes     Resources
                    _            _
                  A(_)--------->|_|X
                            ___/
                    _      /     _
                  B(_)<===    ->|_|Y
                           \/ __/
                    _ _____/\/   _
                  C(_)<--   /\__|_|Z
                          \/
                    _     /\____ _
                  D(_)<--    -->|_|W
                     \_____/
    
    There is a deadlock involving processes C and D and resources W and Y.

    Detecting a cycle in a graph involves traversing the cycle! Thus, the cost of deadlock detection is related to the number of vertices in the cycle, which could be as large as the number of vertices in the graph. As a result, deadlock detection algorithms scale poorly growing in expense as the system grows in size.

    As long as a the only wait operations involve processes waiting on resources, the deadlock graph is formally a bipartite graph, that is, a graph where there are two sets of vertices (round and square, here), and where all edges either connect a vertex in one set with a vertex in the other. If processes may wait for messages, the graph is not bipartite, but this has no effect on the relevant algorithms.

  4. A Review of Graph Algorithms

    Basic graph algorithms are a standard part of undergraduate courses on data structures!

    A graph consists of a set of vertices (processes and resources, in our example), and a set of edges (the arrows connecting processes and resources in our example). In general, edges need not have direction associated with them, but in our example application, we are only interested in directed graphs, where edges have direction.

    Many of the basic questions we will need to resolve about graphs rest on the same basic meta-algorithm for graph traversal:

        Initially, all vertices and edges are unmarked
        R is a some vertex, to be used as the root
        Mark R
        S = { R }
        Repeat
            Remove a vertex V from S
            For each V' reachable from V,
                If V' is not marked
                    mark V'
                    put V' in S
                    mark the edge 
                endif
            endfor
        Until S is empty
    
    At any point during the execution of this meta-algorithm, the set S contains the vertices "on the frontier" of the graph being explored. Marked vertices that are no-longer in S have been fully explored, while unmarked vertices not yet in S remain to be visited.

    If S is managed on a LIFO basis, this algorithm performs a depth-first traversal of the graph, following the longest possible path from the root before returning to visit other parts of the graph. If S is managed on a FIFO basis, this algorithm performs a breadth-first traversal of the graph, visiting all vertices reachable from vertices already in S before visiting others beyond them.

    The set of marked edges produced by this algorithm is a tree that spans every vertex in that part of the graph reachable from the root vertex R. That is, it is a spanning tree of the reachable part of the graph. Not all parts of all graphs are reachable from any vertex, so this is not always a spanning tree of the whole graph. Note that, depending on the traversal scheme used (that is, depending on the management of the set S) many different spanning trees can be produced by the same meta-algorithm.

    NOTE: Other variants on the above metaalgorithm will show up repeatedly in the remainder of this course! The subject of spanning trees will also show up again and again!

    Given a directed graph rooted at vertex R, it is not hard to modify the depth-first recursive tree traversal algorithm to traverse the graph and search for cycles. The modified algorithm only marks those edges and vertices on the path from the root to the leaf currently being examined -- thus, as it follows a path through the tree, it marks edges and vertices, and as it backtracks in its traversal, it unmarks edges and vertices. If this modified algorithm ever encounters a marked vertex as it walks deeper into the graph (as opposed to encountering such a vertex on backtracking), it has found a cycle reachable from the root R.

  5. Deadlock Graphs

    It is worth noting that the details of the system being studied determines the nature of the graph algorithm needed! In a resource model, where processes wait for resources, if no process may wait for more than one resource at a time, the number of outgoing edges from any vertex will never excede one, so the deadlock graph has a trivial structure -- it consists of a number of chains or loops. This special case is called the single resource model.

    In this simple case, deadlock detection doesn't involve a general graph algorithm; rather, it involves a simple algorithm that chases down chains of vertices looking either for an open end -- the signal that there is no deadlock, or for a loop -- the signal of a deadlook.

    Note that the general resource model allows for a process to wait for a number of resources, as in block P until X and Y are available, but that it is quite possible to recode this as block P unti X is available, then block P until Y is available. Thus, resource models may typically be recoded to allow this trivial deadlock detection algorithm.

    However, if the primitive block until A and B are available, then claim both may never deadlock in circumstances where claim A and then wait for B will deadlock. So, if a system is proven to be deadlock free under the single-resource model, it is guaranteed deadlock free under the more general model, but not visa versa.

    The most common application of this in real systems is in database systems, where resources correspond to database records that must be claimed before some update can be performed. Some database systems use exactly this trivial algorithm.

  6. Deadlock Resolution

    Once a deadlock is detected, what should be done? The classic solution is to break the cycle by aborting a process in order to free its resources. This is only applicable to systems using a resource model!

    What process should be aborted? The obvious answer is to abort the least important process, but this requires a ranking of processes by importance. Process priority, for those operating systems that support such a notion, may or may not relate to importance.

    An alternative is to return an error code on the resource allocation request that would have led to a deadlock. If the application is unprepared to deal with this error, it may be aborted, but note that this may mean that the wrong process is aborted, since the last process to join a deadlock cycle need not be the least critical process in the cycle. If a process can be coded to anticipate the possibility of allocation failures, this alternative can be quite useful.

    The classic solution is clearly not oriented towards communication deadlocks! Furthermore, aborting a process is hard on critical applications, where less drastic solutions are desirable.

    In deciding which process to abort, it is common to abort the process that was the last one added to the cyclic wait -- that is, the one that closed the cycle. This is not aloways the right choice, since the order in which processes join a deadlock is unrelated to the amount of damage that would be caused by aborting the processes.

    Ideally, the process to be aborted should be the one where that act will cause the least damage, that is, the least critical process. Few systems provide information sufficient to determine how critical a process is, however. The common alternative of aborting the process with the lowest priority approximates this, since high priority processes are frequently more critical than low priority processes, but this isn't necessarily so.

    Communications deadlock cannot be resolved by aborting a process! Aborting all processes in the deadlocked group is possible, of course, but it is hardly desirable, especially if fault tolerance is one of the system goals.

    In a fault tolerant system, when a communications deadlock is detected, an appropriate action to take would be to break the deadlock by sending an error message to one of the waiting processes or by causing the call to Receive to return an error code or raise an exception instead of blocking forever awaiting a message that will never arrive.

  7. Deadlock Avoidance

    There is a trivial solution to the problem of deadlock avoidance in a pure resource allocation context: All you need to do is prohibit processes from incrementally allocating resources.

    This trivial solution has been widely used. You simply require each process to declare all resources it might want as a field of the job card. It is clumsy, and it requires that the process hold all resources it might need for the duration of its run even if it only needs some of them briefly.

    Another less trivial solution is to require that all resources be acquired in some order. For example, if any process requiring both use of the modem and use of the laser printer wishes to acquire both, it must acquire the modem first and the printer second. As long as this rule is enforced, it will be possible to guarantee that the system is deadlock free with regard to these two resources because it will be impossible for one process to claim the modem and then request the printer while the other claims the printer and then requests the modem. Total orderings over all resources are possible, and many applications can be rewritten to comply with such an ordering constraint.