Chapter 16, Schedulers

Making one processor appear to be many

Part of the notes for 22C:50, Introduction to System Software
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science


The Problem

Although it is easy to think of support for multiple processes in terms of one central processing unit per process, few systems are constructed that way. The reason for this is that the number of processes to be used in solving a problem is usually a logical consequence of the problem to be solved, while the number of processors to be used in building any particular computer system is determined by the amount of money available to build the system, the technology available and the speed required.

Prior to 1960, all computers had only a single processor. Between 1960 and 1970, there were two important developments in the support of multiple processes. The first of these was the development of a small number of multiprocessor systems, that is, computer systems with more than one central processor connected to the same main memory. A good example of such a system was the Burroughs D825 system, known by the military as the AN/GYK-3(V); this system was completed in 1962, and had up to four central processors attached to a main memory of up to 64K words of 48 bits each. During the 1960's, Burroughs Corporation was a leader in the construction of multiprocessors, and by the end of the decade, the high-end Burroughs mainframe computers were all based on multiprocessor configurations.

The use of multiprocessor systems was not economically justified for any but the most demanding applications until the development of the minicomputer, but by the early 1970's, experimental systems were being built with up to 16 processors, and by the mid 1980's, universities and research labs were experimenting with machines having hundreds of microprocessors, and machines with around 16 processors were being actively marketed for conventional applications by such companies as Encore, Sequent and Silicon Graphics.

During the 1980's, a few manufacturers of desktop computers offered multiprocessor workstations, notably the DEC Firefly, a 4-processor micro-VAX workstation. By the 1990's, such configurations became available on personal computer systems, typically in dual processor configurations offered in both the Intel/Microsoft and Apple/Power-PC families.

The second major development in the 1960's was the introduction of operating systems which could support multiple processes on a single central processor. The motivation for these systems was provided by the fact that most user programs spend a considerable amount of time waiting for the completion of input/output operations. Since early central processors were very expensive, there was a strong motivation to use this idle time to run other user programs. The Ferranti Atlas system developed at Manchester University in 1961 was a good example of such a system, as was the BBN time-sharing system developed for the PDP-1 computer in 1962. The first of these was a large batch system where user jobs were submitted as decks of punched cards and results were returned some time later, while the second was an interactive system with 5 terminals. By the late 1960's, virtually all large computer systems supported multiple processes, whether in a batch or timesharing mode of operation.

The end result of the developments described above is that the typical computer system supports at least as many processes as it has processors, and has at least one processor. Each process must, of course, be supported by what we can describe as a virtual processor of its own, and when that process is executing some instruction, some physical processor must be involved. The part of an
operating system which takes the physical processor resources and multiplexes them among the virtual processors needed to run the processes on that system is called the central processing unit scheduler. If the term scheduler is used without qualification, it is usually taken to refer to the CPU scheduler.

At any instant, each processor in a system is executing the code of some process. A context switch occurs whenever a processor stops executing the code of one process and starts executing the code of another. Context switching may be initiated by the process which was running, or it may be initiated by the scheduler. The latter are called preemptive or involuntary context switches.

The States of a Process

The scheduler decides, at each instant, which processor shall be used to execute instructions on behalf of which process. Because all user code and most system code can be thought of as running as part of some process, the scheduler can be viewed as the bottommost or component in the hierarchy of software components making up an operating system.

From the point of view of processes running under the control of the scheduler, each process is either running, waiting for some event, or dead; from the scheduler's point of view, an additional state is necessary: a process may be able to run, but not currently being supported by any real central processor. Such processes are said to be ready because they are ready to run as soon as a processor becomes available. Thus, the states of a process shown in Figure 15.11 must be extended as shown in Figure 16.1.

                      |                 |
                      |      ready      |
                      |         ^       |
            create    |       | | relin-|    wait
         ------------>| sched-| | quish |------------>
    dead              |  ule  | |       |              waiting
         <------------|       | | pre-  |<------------
           terminate  |       | | empt  |   signal
              kill    |       V         |
                      |     running     |
                      |                 |

Figure 16.1. The states of a process.

The list of system services supported by a typical scheduler is implied by the names of the transitions between the states shown in Figure 16.1. These services are described in more detail in Figure 16.2.

create a process which is an activation of the code c; the created process changes state from dead to ready or running; a process identifier, p, is returned.

kill process p, changing its state from ready, running, or waiting to dead.

used by a process to terminate itself when its work is done; the state changes from running to dead.

used by a process to voluntarily change state from running to ready so that some other process may be run on the processor which was being used.

used by a process to wait on the semaphore s, decrementing the count in s and causing the process to change state from running to waiting if the result is negative.

used by a process to signal using the semaphore s, incrementing the count in s and, if there is a process waiting on s, changing the state of one waiting process to ready or running.

initialize the semaphore s with the initial count i.

Figure 16.2. Typical scheduler services.

Two significant changes were made in moving from the state diagram in Figure 16.1 and the list of services in Figure 16.2. The newsem() service was added in Figure 16.2 and the preempt and schedule operations are missing. Adding a semaphore creation or initialization operation is a fairly obvious change; if we have a data type, there should be some way to initialize it or create an instance of it. The situation with preemption and scheduling is more complex. A process is preempted involuntarily by the scheduler, without any process explicitly requesting preemption, and scheduling of ready processes always happens as a side effect of other operations.

The list in Figure 16.2 is by no means universal. For example, processes on the UNIX system are created by an operation called fork which splits the parent process into two identical copies which run the same code but have separate copies of the stack and local variables. One of the resulting processes is called the parent, the other is called the child; a call to fork() returns the process identifier of the child to the parent, and it returns zero to the child; aside from this difference in return value, the parent and child continue running from the point of return. The child has a copy of the parent's stack and heap, while the parent has the originals, but because these are identical, it really doesn't matter which process gets the copy and which keeps the originals.

The particular services which a system provides for synchronizing user processes vary considerably from system to system, but the wait and signal operations on semaphores are perhaps the simplest alternatives, and they are sufficient to implement all of the alternatives. For example, a semaphore with an initial count set to one can be used to guarantee mutual exclusion on a resource shared by two processes. When a process wants to use that resource, it waits on that semaphore, and when it is done, it signals that semaphore. The wait operation blocks the process if the semaphore is already zero, while it decrements the semaphore if or when the semaphore is one. The signal operation always increments the semaphore, allowing exactly one waiting process to continue.

As illustrated in Figure 15.13, semaphores can also be used for counting, for example, to keep track of the number of bytes in a queue. The wait and signal operations can be used to increment and decrement a counter (in a semaphore) whenever the size of a data structure is changed; as a result, processes are forced to wait until data is available, that is, until the counter is positive, before they decrement the count and consume some of the data.

There are a number of systems which provide other synchronizing services. For example, some have services with names such as sleep and wake-up; with these, when an event occurs for which some process is waiting (sleeping), the process which caused or detected the event must know the process identifier of the sleeping process in order to wake it up. This problem can be solved by adopting the convention that processes must post notices of what they are waiting for before going to sleep, but this raises the possibility that a posted notice will be inspected and acted on before the waiting process actually goes to sleep. As a result, the wake-up operation might actually get done while the process is still awake, just prior to its going to sleep. This wake-up while running

problem is one of the problems that inspired the development of semaphores in the late 1960's, and it is one reason that services like wake-up(p) are not central to modern operating systems.

Some systems, including most early versions of UNIX, provide no general purpose services for process synchronization, but hide synchronization inside of input/output operations. This is usually done under the assumption that independent users of a timesharing system will want to view their share of a system as a personal computer without much interaction with other users. Of course, synchronization is still needed in order to implement the sharing of such things as line printers and disks, and it is essential for the implementation of electronic mail and multiuser games, but every effort is made to hide this from the users. Only late in the history of UNIX, with the standardization of UNIX System V, has a standard set of semaphore services become available. These are an afterthought, and like afterthoughts in other areas of system and programming language design, they are not widely used or understood by users.

Alternative Scheduling Policies

Scheduling decisions may be made with the cooperation of the processes being scheduled, for example, when a process voluntarily relinquishes a processor, but on many systems, the scheduler can also preemptively change a process from running to ready in order to run some other process; such schedulers are called preemptive schedulers. With a preemptive scheduler, the interval allotted to a process between the time it is scheduled and the time it is preempted is called a time slice. The choice of whether or not to preempt running processes, and if so, the length of time slice to use is a policy decision which depends on the demands of the application. For example, if efficiency of processor utilization is a primary concern, the overhead of preemption makes it undesirable; after all, with each preemption, the scheduler must devote some processor time to scheduling decisions. On the other hand, if interactive response time is a concern, preemption is essential. A very short time slice allows fast response but it also requires frequent, and possibly time consuming scheduling decisions.

When a processor becomes available, some ready process must be scheduled. How this ready process is selected is a second policy issue, and again, the particular policy depends on the application. The round robin policy is the simplest; this simply schedules the process which has been ready the longest. The round robin policy is unquestionably fair, but fairness in scheduling is not necessarily desirable. All of the alternatives to round robin scheduling are based on some assignment of priorities to the ready processes. For example, each process might have a fixed priority based on the response time expected for that process, or priorities might be varied depending on resource usage.

The idiomatic phrase round robin has been part of the English language for centuries; it originally appears to have referred to a petition in which all names were written in a circle so that the recipient couldn't tell who had signed first. Today, it usually refers to a letter circulated among a group of correspondants; on receiving the letter, each correspondant may read everything the others have written before adding to it.

Although some priority scheduling policies are based on the assumption that each process has a distinct priority, most must have some rule for breaking ties when processes have equal priority. Equal priorities are quite common; for example, a time-sharing system may put all interactive user processes at one priority (sometimes called the foreground priority), and all non-interactive processes at a lower priority (sometimes called to background priority). When equal priorities are possible, ties should be broken in such a way that processes of equal priority are treated fairly. Random selection is usually considered fair, but it is sometimes desirable to use round robin scheduling of processes within each priority class.

Variable process priorities are common in time sharing systems. For example, each process might be assigned a priority inversely proportional to the fraction of the available processor time that
process used in the recent past. This kind of priority rule favors processes which are not computationally intensive while penalizing those which are consuming an inordinate share of the processor time. As a result, interactive processes which spend large amounts of time waiting for input from users will have a high priority and a short response time, independently of whether there are compute bound processes consuming most of the available processor time.

In process control systems, for example, the computer and software controlling a reaction in a chemical plant or the flaps of an airplane, it is common to assign fixed priorities to processes on the basis of how frequently they must operate. Consider a system with two processes, one which must check a thermometer every second in order to adjust a heater, and one which must check the fluid level in a tank every minute in order to adjust a pump. Each process attempts to maintain some value at a constant, but one of these values, the temperature, is more critical than the other, the fluid level. Clearly, the temperature control process should have a higher priority.

Process control systems where priorities are assigned proportionally to the frequency with which processes perform key services are said to be based on a rate monotonic priority assignment. Rate monotonic systems work, but when more than about 60 percent of the processor time is used, on the average, there is a risk of occasionally missing a critical deadline. This problem can be overcome by identifying specific deadlines for each process which specify the times at which key actions must be performed. The priority of each process in such a deadline system is determined by the deadlines themselves so that the process with the most imminent deadline always has the highest priority. Each time a process finishes a key action, it reports this to the scheduler by posting a new deadline indicating the time by which it must complete its next key action.

Although most studies of scheduling in computer systems focuses on the role of a centralized scheduler, it is possible to view scheduling in a broader context that includes decentralized scheduling decisions. For example, most coroutine-based systems may be viewed as multiple process systems in which the scheduling decisions are decentralized and non-preemptive. Each coroutine instance can be viewed as a process, and the resume statements which transfer control between coroutine instances can be viewed as combining a relinquish operation with an explicit statement of which process to schedule next.

An interrupt mechanism can be considered to be a decentralized preemptive scheduler. It is preemptive because the program running at the time an interrupt occurs has no control over the occurrence of that event; the central processor is, in effect, snatched away from the user program and given to the interrupt handler by the interrupt hardware. It is decentralized because decisions to request interrupts are made independently by each input/output device, without consulting other devices. Note that a centralized decision may be made in the rare event that multiple devices simultaneously request interrupts, typically on the basis of hardware priority.

Implementing a Non-Preemptive Scheduler

Before discussing the implementation of schedulers in the general context, with multiple processors and preemption, a simple coroutine based scheduler on a uniprocessor with no interrupt mechanisms will be discussed. This avoids all discussion of asynchronous events, while allowing the conceptual structure of the scheduler to be thoroughly examined. This may be viewed as an exercise in simulating multiple processes using coroutines, but the data structures used here form the basis of most schedulers, so little is lost by such simulation.

The data structures needed by a coroutine based non-preemptive scheduler are shown in Figure 16.3.

type processref = ^ process;
     process = record
                    next: processref;
                    self: activation;

     processqueue = record
                         head, tail: processref;

     semaphore = record
                      count: integer;
                      waiting: processqueue;

var ready: processqueue;
    running: processref;

Figure 16.3. Data structures for a simple scheduler.


The process description record shown in Figure 16.3 centers on a pointer to the coroutine activation for that process, along with a pointer to another process activation so that processes may be linked into queues. Such queues will be used, for example, to represent the lists of processes waiting at a semaphore, or the list of ready processes. In an object-oriented world, we would say that objects of type process are a subclass of the generic or abstract class queueable.

Note that we do not include the abstract state of a process as an explicit component of the process; instead, we will represent the state of a process by the place where a pointer to the process is stored. If a process is running, there is a pointer to it from the global variable running; if it is ready, it is in the ready queue, the queue named ready in Figure 16.3. If a process is waiting, it is in the queue associated with the semaphore on which it waited. The enqueue and dequeue operations on queues of processes will not be defined here, but are not difficult to code.

The record describing a process is frequently called a process control block or task control block. When a priority scheduler is used, the priority or deadline of the current process is stored in this record. Additional information is frequently stored in this record, including such things as the list of all files currently open for use by the process, security information such as a user identifier, and billing records so that users may be charged for all resources used by each process or so that the priority may be adjusted to reflect resource usage. Many systems have process control blocks which have expanded to hundreds or even thousands of words, making the creation of new processes quite cumbersome. If a significant number of these words must be copied to or from global variables or registers during a context switch, context switching can become quite expensive.

The services to activate and terminate a process illustrate the basic use of the data structures shown in Figure 16.3. These are illustrated in Figure 16.4.

var terminate: activation { the only activation of terminatebody };

procedure create( coroutine processbody );
var p: processref;
     new( p );
     p^.self := activate processbody;
     enqueue( p, ready );
end { activate };

coroutine terminatebody;
          dispose( running^.self { the coroutine activation record } );
          dispose( running       { the process control block } );
          running := dequeue( ready );
          resume running^.self;

Figure 16.4. Coroutine based activate and terminate services.


The create operation in Figure 16.4 is shown as a procedure; it simply allocates and initializes a new process control block and an activation of the code for that process, without actually transferring control to it. This situation is more complex with terminate; this is the only activation of a coroutine, and when it is resumed, it never returns control to the code that resumed it. In most systems, this code would be packaged so that it can be called as a procedure, but it never returns! When control is transferred to terminate, the caller's process disappears from the system and control is transferred to some other process selected from the ready queue.

Note that the code given for terminate does not check to see if the ready list is empty! Instead, it blindly dequeues a new process from the ready list and runs it. This code will be correct only if we assume that there is always at least one ready process. Many systems assure this by including one extra process in the system, called the idle process. When there is nothing else to do, the system runs the idle process. Sometimes, this is nothing but an infinite loop, with one call to relinquish per iteration. Sometimes, the idle process does useful computation, but the one rule it must obey is that it never, ever, blocks. The idle process state must always be either ready or running!

The relinquish service is at the very center of a non-preemptive scheduler This may be constructed as shown in Figure 16.5.

var relinquish: activation { the only activation of relinquishbody };

coroutine relinquishbody;
          enqueue( running, ready );
          running := dequeue( ready );
          resume running^.self;
end { relinquishbody };

Figure 16.5. A coroutine based version of the relinquish service.

When the coroutine representing the body of a running process executes a resume relinquish statement, the code shown in Figure 16.5 puts that process in the queue of ready processes, gets the next ready process from the queue, and resumes it, changing its state to running. Taken together, this accomplishes a context switch from one process to another, except in the case where there is only one process; in that case, the process will do a context switch to itself.

Because this is a non-preemptive scheduler, the simple minded definition of the basic operations on semaphores shown in Figure 15.12 cannot be used. Instead, the wait operation must explicitly remove the waiting process from the ready list until a signal operation allows it to continue. This is shown in Figure 16.6.

var wait:   activation { the only activation of waitbody };

coroutine waitbody( var sem: semaphore );
          sem.count := sem.count - 1;
          if sem.count < 0 then begin { go to sleep }
               enqueue( running, sem.waiting );
               running := dequeue( ready );
          resume running^.self;
end { waitbody };

procedure signal( var sem: semaphore );
var temp: processref;
     if sem.count < 0 then begin { wake up a process }
          temp := dequeue( sem.waiting );
          enqueue( temp, ready );
     sem.count := sem.count + 1;
end { signalbody };

Figure 16.6. Coroutine based implementations of wait and signal.


With the implementation given in Figure 16.6, a positive count in the semaphore corresponds to the value in the semaphore in Figure 15.12, but when the value is negative, the corresponding semaphore in Figure 15.12 would have a count of zero. When negative counts are allowed, they show how many processes are waiting in the semaphore queue.

Most systems, of course, will package the wait operation as a procedure, but we show it as a coroutine here simply to avoid having to deal with the complex semantics that arises when a coroutine calls a procedure that then does a resume operation.

The wait coroutine requires that the identity of a semaphore be passed as a parameter, for example, using the notation resume wait(sem). This can be implemented by copying a pointer to the desired semaphore into a local variable of the coroutine each time it is resumed. Once the coroutine is resumed, it will either find that the semaphore count was positive, in which case, it will resume the original running task, or it will find that the semaphore count was negative, in which case it will enqueue the caller on the queue for that semaphore and resume some other ready process.

The signal procedure given in Figure 16.6 has logic which is almost the reverse of that given for wait, but because signal always returns to its caller, it is coded as a procedure instead of a coroutine.

The data structures shown in Figure 16.3 are insufficient to implement the kill operation on a process. The problem is that, when kill(p) is called, the process p may be either in the ready queue or in the waiting queue of some semaphore; in order to kill the process, it must be removed from this queue, but to do this, the identity of the queue in which it is waiting must be determined. Maintaining all queues as doubly linked lists would help solve this problem; alternately, the process description record could be extended with a pointer to the head of the queue in which the process is listed. Processes are not killed very frequently in most systems, so the kill operation need not be very efficient.

Alternative Scheduler Structures

Assuming that the queues used to implement the scheduler shown in the previous section are first-in first-out queues, the scheduler will implement a round robin scheduling policy. Priority scheduling can easily be done by replacing the first-in first-out queues with priority queues. The simplest implementation of priority queues simply stores the list of processes in each queue in order, sorted by priority. In this case, the enqueue operation would perform a linear search of the queue to insert new entries in the right place, and the dequeue operation would remove the head item each time.

It should be noted that a simple linear list implementation of priority queues is optimal for queues of fewer than about 10 elements, but for queues of more than about 30 elements, other implementations are likely to be much faster. The problem is that insertions in a sorted linear list take a time proportional to the number of items in the list, while data structures such as the heap used in heapsort allow priority queue operations to be done in times proportional to the log of the number of items in the queue.

There are a number of very sophisticated priority queue implementations that are even faster than the heaps used in heapsort, but it should be noted that the average ready-queue length on most systems, even large timesharing systems with hundreds of users, is not particularly great. Most of the time, most processes on most systems are blocked waiting for input or output operations to complete. As a result, the linear list implementation is usually the most appropriate.

There is some disagreement about whether semaphore queues should be sorted by priority when priority scheduling is used. Some definitions of semaphores require that first-in first-out queues be used, but what is really required is a queue which guarantees that every waiting process eventually reaches the head of the queue. If priority queues are used for semaphores and processes have fixed priorities, this property cannot be guaranteed unless it can be shown that the queue will occasionally be empty. If processes have priorities that rise as they wait, or if priorities are assigned based on the deadlines of the waiting processes, priority queues pose no problems, and indeed, they should be used on semaphores as well as for the ready-queue.

Some older priority schedulers are constructed on a radically different plan from the one outlined in the previous section. These are based on a master process list which never changes, except when processes are created or deleted. The master process list is sorted in order of priority, and each process has a flag indicating its state. Whenever it is time to schedule a new process, the master process list is searched in order of descending priority for a ready process. The primary advantage of this approach is that it makes it quite easy to find all processes in the system. Unfortunately, there are a number of problems with this approach. If the master process list becomes long, the operation of scheduling a new process can become quite expensive. If random or round robin scheduling of tasks with equal priority is desired, this requires a complex system to either reorder the master process list or remember which process was most recently run at each priority level.


An important programming error can occur in multiple process programs which was not possible in programs made of a single process. This occurs when every process in some group of processes waits for a condition which only some other process in the same group can signal. When this occurs, all processes in the group are said to be deadlocked. Deadlocks are a fundamental risk of concurrent or parallel programming in the same way that infinite loops are a fundamental risk of conventional or sequential programming.

Gridlock in a congested road system may be viewed as an example of a deadlock. A simple example of gridlock is illustrated in Figure 16.7.

        |       |
        |   |   |
        | S     |          N - car going north
        | S |   |          S - car going south
--------  S      --------  E - car going east
          S  W W W W       W - car goint west
  --  --         --  --
     E E E E  N
--------      N  --------
        |   | N |
        |     N |          Assume car's can't go in reverse.
        |   |   |
        |       |

Figure 16.7. Gridlock, an example of deadlock.

Each car involved in a gridlock is waiting to occupy the patch of pavement immediately in front of it, but some other car is already occupying that patch, so none of the cars can move. Once a gridlock has occurred, the only way to break it is for some car at the edge of the gridlock to back up, that is, for some participant to relinquish a resource (a patch of pavement) it had already claimed.

Unfortunately, processes in computer systems cannot generally run in reverse, so deadlocks must be broken by killing a process and signaling that each resource previously held by that process has become free. This corresponds to resolving gridlock at in a congested intersection by dynamiting some of the cars.

Deadlocks are frequently associated with critical sections in programs that consist of many processes that share variables with each other. In such a program, critical sections are typically protected by binary semaphores, that is, semaphores that only take on two values, zero and one. On entry to a critical section where some shared variable is manipulated, a process does a wait operation on that semaphore, locking out any other processes that might want to use that variable. On exit from the critical section, a process does a signal operation on that semaphore, restoring its value to one and allowing other processes to continue if they were waiting on the same semaphore.

An example of a deadlock prone system of processes is shown in Figure 16.8.

var a, b: semaphore;

coroutine p1;
          resume relinquish;
          resume wait( a );
          resume relinquish;
          resume wait( b );
             .  { critical section }
          signal( a );
          resume relinquish;
          signal( b );
end { p1 };

coroutine p2;
          resume relinquish;
          resume wait( b );
          resume relinquish;
          resume wait( a );
             .  { critical section }
          signal( a );
          resume relinquish;
          signal( b );
end { p2 };

begin { main program };
     newsem( a, 1 );
     newsem( b, 1 );
     activatep( p1 );
     activatep( p2 );
     resume terminate;
end { main program };

Figure 16.8. A deadlock prone pair of processes.

The example in Figure 16.8 also serves to illustrate the use of some of the coroutine based scheduler services defined in the previous sections. The processes p1 and p2 follow parallel paths; each is iterative, waiting on each of two semaphores, then doing something in its critical section before signalling the semaphores and doing something that doesn't involve any shared variables. Neither the shared variables nor the actual useful work done by these processes is shown, but we do assume that each processes relinquishes frequently, so that both have equal opportunities to run.

Of course, the example in Figure 16.8 is artificial. When only two processes are competing with a single critical section, there is no justification for using more than one binary semaphore. Only when there are multiple critical sections involving multiple variables is there a justification for using multiple semaphores, but then, the risk of deadlock becomes quite real.

If one of the processes shown in Figure 16.8 successfully claims both semaphores, there will be no deadlock. There is a chance, however, that the following sequence of operations will occur: First, p1 does wait(a), and then then p2 does wait(b). At this point, the counters in both semaphores are zero, so when p1 does wait(b) it will be blocked, with its state changed from running to waiting on semaphore b, and when p2 does wait(a) it will also be blocked. At this point, there is a deadlock!

Note that the deadlock in Figure 16.8 cannot be broken by simply deleting one of the processes. It is also necessary to perform signal operations on any semaphores that the blocked process had waited on in the process of getting stuck. Two simple modifications to the code in Figure 16.8 would have completely eliminated the risk of deadlock. First, we could remove the resume operations between the wait operations, but this only works in a non-preemptive system.

One solution to the problem works in all scheduling environments: Deadlocks involving processes which take exclusive use of resources can always be eliminated by requiring that all claims for resources be made in a specific order. In the example, this can be done by requiring that wait operations on the semaphores a and b always be done in the same order, for example, alphabetical order. Any order will suffice, so long as all processes in the system use the same order.

In a purely non-preemptive scheduling system where there are no interrupts, deadlocks which cause all of the processes to enter a waiting state can be easily detected by checking, in the code for wait, to see if the ready queue has become empty. This will not detect the fact that a subset of the processes are deadlocked, and it will not help when there are interrupts; of course, if there is an idle process that never waits, there will never be a global deadloc, so this is rarely useful. In interrupt based systems, it is frequently the case that all user processes are waiting for input or output operatons to complete. In this case, the signal operations will be done by interrupt service routines, and it will be quite common for the ready list to contain only one process, the idle process.

Automatic detection of deadlocks at the level of synchronizing primitives such as wait and signal is, in general, impossible. As long as there is one interrupt enabled or one ready or running process, the system cannot predict whether or not that interrupt's service routine or that process will eventually perform a signal operation on any particular semaphore. At a higher level, however, system software can detect and even avoid deadlocks. The key to deadlock detection and avoidance is to have all requests for resources pass through a centralized resource manager, sometimes called a monitor. At all times, the manager tracks the state of each client process, including the number of resources allocated to that client.

As an example, consider a resource manager for main memory. All user programs requesting memory must make their requests through this manager. For each client, the manager keeps a record of how much memory is currently allocated to that client, and if the client is waiting for memory, how much additional memory is needed. Assuming that fragmentation is not a problem, a client calling allocate(n) will not have to wait if n is less than the number of free words. If n is greater, the client will have to wait until that many words are available. Words are made available when some other process deallocates them. Since a process which is waiting for memory cannot deallocate what it currently holds until it is given what it is waiting for, memory held by waiting processes cannot be counted on to become available to unblock other processes. As a result, whenever the amount of storage requested by a process is greater than the sum of the free storage and that held by other non-blocked processes, we are at risk of deadlock!

Deadlocks such as the one described above can be avoided by a technique known as the banker's algorithm in which each client process is given an additional attribute, the resource limit of that process. Resource limits are analogous to credit ratings; they state the maximum amount of a resource which a process is allowed to obtain. The difference between the resource limit of a process and the amount currently allocated to that process is the potential claim of that process on the system. The banker's algorithm requires the assumption that every process will eventually deallocate all of the resources it allocated.

The banker's algorithm defines a system as being unsafe if a group of processes making legal requests (under their limits) would lead the system to deadlock; it is possible, of course, that no such requests will be made, and thus, that there will be no deadlock. The banker's algorithm operates by blocking processes making allocation requests when those requests would lead the system into an unsafe state, even if there is sufficient memory to satisfy the immediate request.

In detail, the banker's algorithm operates as follows: Assuming that the current state is safe when a request for resource allocation is made, temporarily allocate the requested amount and then check to see if the resulting state is safe. To find if a state is safe, assume that the process with the smallest potential claim requested its entire claim; if this can be granted from the current free resources, the original state is unsafe. If it can be granted, assume that the process will complete and return everything which was allocated to it, so provisionally deallocate all of its resources and see if the resulting state, minus that process, is safe. If this can be continued until all processes are done, the original state was safe and the resource allocan request can actually be granted. If it is determined that the state was unsafe, the process making the allocation request is putting the system at risk of deadlock.

What should the system do if it determines that granting a user's request would put the system in an unsafe state where deadlock could result? The usual answer is, the system should block that user, refusing to grant it the resources it has requested until such a grant can be made without putting the system in an unsafe state. It can be proven that this rule leads to a system where there will be no deadlock, under the assumtion that all blocking involves contention for resources, and under the assumption that there is only one resource type, for example, memory.

Deadlocks involving contention for shared resources can always be dealt with if the resources are preemptable; that is, if the resources can be taken away from one use and given to another. Preemption is not a very interesting solution to the deadlock problem if it involves terminating the process that claimed the resources which were preempted, but this is not always necessary. For example, some page replacement algorithms attempt to assign enough page frames to each process that the entire working set of each process can always be held in memory. A simple approach to this would involve a risk of deadlock when the sum of all of the working sets exceeded the number of frames available, but this deadlock is easily eliminated by preempting frames held by one process and giving them to others.

Deadlocks are surprisingly common in large multiuser database systems, and the usual way of resolving them is to preempt the resources claimed by one user. This does not involve terminating a proces, though! Instead, each database update is arranged as what is called a transaction. As the update progresses, it claims items in the database, and these claims may result in a deadlock. If such a deadlock is detected, for example, because some transaction has not been completed by an appropriate time limit, the transaction is aborted, allowing the transactions with which it is competing to complete. When a transaction is aborted, the process trying to make that update simply waits a bit and then retries the transaction.

Preemptive Scheduling

The data structures used by preemptive schedulers do not differ greatly from those illustrated previously for non-preemptive schedulers. The key difference is in the code itself. Preemptive schedulers rely on a real-time-clock to request an interrupt whenever it is time to preempt one process in order to schedule another. The simplest real-time-clock hardware requests an interrupt each time a set interval has elapsed. For example, many older machines had a power line clock which requested an interrupt 60 times a second, using the 60 cycle per second alternating current from the power line as a time base.

In Europe, power lines run at 50 cycles per second, so power line clocks on that continent run somewhat slower. Whether the power line is 50 or 60 cycles per second, it is worth noting that utility companies make a major effort to keep the frequency of the electric power they deliver constant; in the United States, the interconnected electric utility companies maintain an accuracy over the long run of better than 1 part in a million, so systems using power-line-clock hardware are generally quite accurate.
More complex systems have what are called programmable real-time clocks, that allow the interrupt frequency to be set by software, or programmable interval timers, that allow a register to be loaded with a time interval, and produce an interrupt when that amount of time elapses. Yet others have time of day clocks which report the current time of day and include an alarm register so that an interrupt can be requested at a specific time.

The typical IBM PC compatable has three programmable interval timers compatable with the old Intel 8253 chip. This chip contains 3 16-bit counters that are decremented every 0.838 microseconds (1.19318 MHz). These counters can be used as programmable interval timers or programmable real-time clocks, as well as several other functions. Counter zero is wired to request an interrupt when it expires, while counters 1 and 2 are used for other purposes (dynamin RAM refresh and crude tone generation).

The classic Microsoft DOS operating system makes only minimal use of the real-time clock, setting it to request an interrupt every 65536 clock cycles, which comes to 18.206 interrupts per second. This interrupt source is used to count time, in units of 1/18.206 seconds, counting from midnight, and software conversion routines convert this to time of day in hours, minutes and seconds.

The typical PC compatable also contains a second real-time clock, running off of a battery. This clock counts time in units of years, months, days, hours, minutes and seconds, and typically, one of the things the operating system does as it starts is to read this clock and use it to set the time and date clock that is maintained, from then on, by the programmable interval timer.

However the real-time-clock hardware comes to produce an interrupt, on systems where this interrupt is used for scheduling, the real-time-clock interrupt service routine must perform essentially the same operations as are performed by the relinquish service of a non-preemptive scheduler. Thus, the clock interrupt service routine must save the program counter and other registers of the interrupted process, put that process on the ready queue, and then take a ready process from the ready queue and return to it as if it was the interrupted process. There is no way for any particular process to tell when it will be interrupted and preempted in order to allow another process to run. As a result, it is possible that any number of actions taken by other processes will occur between any two consecutive instructions executed by that process.

It should be noted that the scheduler interrupt service routine (the clock interrupt service routine) is unlike any other interrupt service routines on most systems. All other interrupt service routines can usually be viewed as procedures which are implicitly called by the interrupt hardware. When any other interrupt service routine has completed its job, it returns to the point in the code where the processor was executing when the interrupt occurred. The scheduler interrupt service routine is called the same way any other interrupt service routine is called, and it returns using the same return sequence, but it does not return to the point of the call because it has performed a context switch, allowing it to return to the point in some other process where that process was most recently interrupted.

The situation under a preemptive scheduler on a uniprocessor is not radically different from the situation on a multiprocessor where there at least as many processors as processes. On a uniprocessor, as mentioned above, any number of instructions executed by other processes will sometimes be executed between two consecutive instructions of any process. On a multiprocessor, it is likely that one instruction will be executed on every processor in the system between every pair of consecutive instructions in every process. If programs are written under the assumption that all ready or running processes execute instructions at unpredictable and irregular rates, they should work equally well in both environments. The primary difference between the two environments is one of performance: Ten independent processes running on a preemptively scheduled uniprocessor will run at least ten times slower than on a system with ten separate central processing units.

The code for system services such as wait, signal, and relinquish on a preemptively scheduled system or a multiprocessor system will change considerably from what was shown previously for a non-preemptive scheduler. The reason for this is that the possibility of preemption between any pair of consecutive instructions must be considered in coding these services. Whenever a sequence of instructions is found where such preemption cannot be tolerated, that sequence is a critical section during which interrupts must be disabled.

Critical sections within processes running under the a preemptive scheduler may be protected using the wait and signal operations on semaphores, but critical sections within the implementation of wait and signal cannot be protected this way! Thus, lower level synchronization primitives must be used to implement the primitives offered by the scheduler. On a uniprocessor, it is common to use instructions which enable and disable interrupts to guard small critical sections within the scheduler, but this does not suffice on a multiprocessor. The next chapter contains an extended discussion of these issues.

High Level Schedulers

The central processor schedulers discussed up to this point are usually classified as low level schedulers, since they deal only with the immediate question of what process should run next, with no attention to the history of a process. Another class of scheduling problems occurs at a higher level, concerning such issues as which process should be started next and what priority each process should run at. Such problems are solved by high level schedulers. A common characteristic of high level scheduling is that it has no effect on the meaning of programs but has a great impact on system performance. Low level scheduler operations such as wait and signal have a great impact on the meaning of a program, but changes to the priority of the program or changes to the time that program begins execution usually have little or no impact.

The oldest high level scheduling problem has to do with deciding which jobs to run in what order on a batch oriented computer system, that is, a system where users submit batches of programs to be executed and then come back later to find the results. Historically, jobs would be submitted for such systems in the form of decks of punched cards, but equivalent operating modes are still useful for certain applications. Instead of submitting a card deck and picking up a stack of printout a few hours later, todays batch subsystems allow a command-language script to be submitted, and the results are delivered as an on-line file instead of a stack of printout.

On batch systems, user requests for any computation are placed in a job queue, from which the system selects jobs for execution. If a large part of a system's work comes from the job queue, the performance of the system can depend crucially on how many jobs from the queue the system allows to run at the same time, and on how those jobs are selected from the queue.

Typical batch systems have a structure such as is illustrated in Figure 16.9.

               submission     scheduling

   interactive  ---                --->  batch
   user process    |              |     process
                   |              |
                    --->  job  ---
   interactive  ------->       ------->  batch
   user process     ---> queue ---      process
                   |              |
                   |              |
   card reader  ---                --->  batch
     process                            process

Figure 16.9. The structure of a batch system.

The job queue is a queue of job description records. Each such record contains, at minimum, the identity of a command language file or shell script describing what is to be done by that job. The record usually also contains the identity of the user who submitted the job and various information about the resources needed to perform that job or other constraints. When a new job is read from the card reader on an antiquated system, or when one is submitted by an interactive user on a modern system, a job description record is created.

In purely card based systems, the card reader process was frequently quite complex. The reason for this was that a job could consist of many sub-files. For example, a typical input card deck might contain the text of a program to compile, the test data for the compiled version of the code, and the command language requesting compilation, linking, loading, and execution of the program. Typically, the card reader input process would be responsible for separating these various items, storing each in a different temporary file, editing the command language text to contain the file names for each of the other files, and submitting the batch request itself.

The card reader process was frequently called the input spooler because, on early computer systems, the batch queue was frequently maintained on its own spool of paper (and later magnetic) tape. IBM had fun with this, coining the 'backronym' (backwards acronym) Simultaneous Peripheral Operation On Line to explain the use of the term SPOOL.

An output spooler process handles a far simpler problem, accepting requests for files to be printed from user programs and printing them. process which handles a queue of files to be output, for example on a printer, and handling the problem of printing each in turn. A complex output spooler might handle such jobs as distributing the printing load over multiple printers or assigning priority to different print jobs depending on the number of pages to be printed.

In the oldest systems, the jobs of input queue management and high level scheduling were all clerical jobs. User card decks would accumulate in trays in the input area, and clerks would examine these jobs and construct batches of jobs to be run concurrently.

The batch processes in Figure 16.9 take jobs from the batch queue and run them. These processes were called initiators on IBM's classic OS/360. The running jobs compete for system resources; if there are too few jobs running, some resources may be under-utilized, but if there are too many jobs, the competition may slow the entire system. The problem of the high level scheduler is to provide jobs to the user processes in such an order that all system resources will as uniformly used as is possible, under the constraints imposed by the system's owners.

Example policies the owners might insist on include giving short jobs higher priority, or that users willing to pay more for their resources should get faster service. Such policies may be in conflict with maximizing resource utilization. For example, most university computer centers in the 1960's and 1970's had policies favoring the short jobs run by students in introductory computer science courses. These jobs typically perform almost no computation, but produce large amounts of compiler diagnostic messages and debugging output. As a result, the input/output resources of such computer systems were frequently saturated while the processors spent quite a bit of time idle. The best system utilization could be obtained if the student jobs were balanced by some fraction of compute bound jobs such as simulation runs, but allowing too many such jobs during times of peak student demand would reduce the number of student jobs which could be run!

A high level scheduler needs certain information about a job in order to decide when that job should be run. This information consists essentially of a forecast of the resource needs of that job. In the era of punched cards, this information was generally summarized on a special card, the job card, that was the first card in the deck of cards for that job. The high level scheduler used the information from the job card to assign a priority to the job in the job queue, and possibly also a priority for low level scheduling when the job finally runs, although this was rare. The resource limits listed on the job card could also be used by deadlock avoidance schemes such as the banker's algorithm.

Asking user's to make accurate forecasts of the computing resources needed by their jobs isn't always easy, but users learn quickly to make decent guesses. Classic batch schedulers almost always enforced the user's estimates as strict upper limits on resource usage, so users who underestimated their resource usage frequently got no results at all. Users who overestimated their resource requirements would typically earn low priorities and slow service as a consequence, and between these two penalties, users typically became quite good at predicting what they needed.

As might be guessed, the formulas used to assign priorities in the job queue grew quite complex. Typically, they were the result of empirical testing; in fact, an entire specialty area within computer science grew around the art of adjusting these formulas on the basis of performance measurements. Furthermore, the simple priority based approaches to job queue management frequently allowed user jobs to be deferred indefinitely when there was an unlimited supply of higher priority jobs. For example, consider the fate of a large simulation job on a system used by students in introductory computer science classes.

It should be noted that the high level scheduler for a batch system may be identified as a separate process, but that it may equally well be distributed between the enqueue operations used to put jobs in the batch queue and the dequeue operations used by each batch job execution process to get jobs to run. It is quite natural to think of a process as a unit of abstraction and then demand that each abstract component of a system exist as a separate process, but the queues through which processes communicate are also abstractions!

There is little in the way of batch scheduling on most modern personal computer systems, but UNIX systems do have a mechanism with much the same structure, hidden behind the at command. This command allows a user to schedule a shell command to be run at some fugure time of day on some date. The at queue is essentially a batch queue, sorted by the time at which the job is to be run. Queue entries contain, in addition to the time, the identity of the user who submitted the job, along with the current directory and shell environment that was in effect at the time the job was submitted. As a result, when the time comes, the requested job will run as if the user had simply waited until the indicated time and then typed the command.

High Level Schedulers in Interactive Systems

A second major high level scheduling problem involves the scheduling of interactive users on timesharing systems. On simple timesharing systems, the number of interactive users allowed is determined by the number of terminals which may be physically connected to the system, and all users run at the same priority, using a round-robin scheduler. The problem with such systems is that different users have significantly different computational demands. Users who are editing text require much less computational power than those who are compiling programs, running simulations, or formatting text for a laser printer. Thus, the response time, or time it takes the system to react to a user's input, may vary considerably depending on what the other users of the system are doing.

The goal of high level scheduling policies on timesharing systems is usually to provide a guaranteed response time to each interactive user in some class. Typically, this will be at the expense of other users or potential users. For example, a typical policy might be that users who are running jobs with computational demands typical of text editing will be guaranteed a response time of 0.2 seconds.

Various human factors research studies have shown that response times slower than this are perceived as annoying by users who have ever experienced faster response. In fact, the efficiency of users performing simple clerical tasks is significantly degraded by response times slower than 0.1 seconds.
To achieve the goals of this policy, it may be necessary to turn away users even when there are enough terminals attached to the system, or it may be necessary to slow the response times of users who are placing greater computational burdens on the system. Users in the latter category are usually advised that, if their response times become unacceptably slow, they should run their programs as batch jobs or at some time of day when there are fewer users.

Limiting the number of interactive users can usually be done by adding a special process to the system which has, as its only responsibility, the job of periodically measuring and publishing (in some memory location) the response time. If the measured response time falls below the desired threshold, the program which handles new interactive users can start rejecting attempts to connect to the system. Of course, it would also be possible to forcibly eject users from the system in order to improve response, but this usually leads to hard feelings.

A priority scheduler is needed in order to enforce the policy that the response times of compute-bound users should be sacrificed, if necessary, in order to ensure that interactive users receive good response. In effect, this policy requires that interactive users run at a higher priority than compute bound users. In order to implement this, it is necessary to find a way of identifying
which processes are compute bound, and which are interactive. This can be done by examining the recent history of each process. For example, the system might keep a record of the fraction of processor time used by each process over some interval in the recent past. If there are n processes on the system, processes which used less than 1/n of the processor time might be given a high priority, while those which used more could be given a low priority.

The high level scheduler used in the classic UNIX system illustrates an alternative approach to solving this problem. When a process exits the running state voluntarily, calling the equivalent of a wait or relinquish operation, it's priority is nudged upward, while when a process is preempted, its priority is nudged downward. This policy biases the system towards the recent history of each process, while the policy suggested in the previous paragraph looks at longer term behavior.


The reader should be familiar with the following general terms after reading this section:

central processor scheduler     ready queue
centralized scheduler           idle process
decentralized scheduler         deadlock
preemptive scheduler            real time systems
non-preemptive scheduler        priority scheduler
process                         priority queue
task                            shortest deadline first scheduler
process control block           foreground
process states                  background
running process                 time-slice
ready process                   virtual processor
waiting process                 low level scheduler
terminated process              high level scheduler
job initiation                  batch job queue
round robin schedulers          job card

In addition, the reader should be familiar with the following system services of the example operating system:

relinquish                      wait
kill                            signal
terminate                       newsem


  1. Which of the state transitions shown in Figure 16.1 are initiated by the process to which they apply, which are initiated by the scheduler, and which are initiated by other processes?

  2. Consider the following implementation of the sleep and wake-up primitives: The sleep operation simply schedules the next ready process; the calling process is not entered into any data structure; wake-up(x) puts process x in the ready queue. Both the ready queue and the semaphore queues are singly linked. What serious errors are likely consequences of applying the wake-up operation to a running process.

  3. Consider the following implementation of the sleep and wake-up primitives: A semaphore (initially zero) is associated with each process (e.g. as part of the control block); the sleep operations does a wait on this semaphore; wake-up(x) does a signal on the semaphore of process x. How does this implementation deal with the wake-up while running problem?

  4. Consider a uniprocessor system with a round-robin preemptive scheduler using a fixed length time-slice. What is the longest time-slice which can be used if a response time of 0.2 seconds is to be guaranteed for non-compute-bound applications when there are 20 active processes?

  5. If one instruction is executed every microsecond, and there are 100 context switches every second, what fraction of the CPU time is available for user computation, assuming:

    a) a context switch takes 20 instructions?

    b) a context switch takes 200 instructions?

    c) a context switch takes 2000 instructions?

  6. Write the enqueue and dequeue routines assumed by the code in Figures 16.4 through 16.6 using the data structures shown in Figure 16.3.

  7. Modify the data structures shown in Figure 16.3 as needed to implement a priority scheduler. Assume that the activate operation specifies the priority at which the new process is to run. Write the code for the enqueue and dequeue operations needed to make the ready list into a simple implementation of a priority queue.

  8. Modify the data structures shown in Figure 16.3 and the code in Figures 16.4 through 16.6 to allow for the kill operation. Write code for your kill operation.

  9. Write the code needed within the allocate and deallocate operations to maintain the records needed for deadlock detection. (Assuming that there is no fragmentation problem, this will involve a counter of the number of free words and a counter per process of the number of words allocated to that process, as well as information about the state of each process). Modify the data structures of Figure 16.3 to include any data which must be associated with each process in order to solve this problem. Emphasize the following issues in planning your solution to this problem: When to wait, when to signal, when to return nil to a user in order to signal that that user's request would have caused a deadlock.

  10. Write a boolean function, safe(free,status), to determine if the current configuration is safe using the Banker's algorithm. Assume that the current configuration is defined by the following data structures:
    free: integer { a count of the free resources };
    status: array [1..processes] of record { for each process }
                 current: integer { the current allocation };
                 limit: integer   { the maximum allowed allocation };

  11. Why do 10 independent processes on a preemptively scheduled system run at least 10 times slower than they would on a 10 processor system instead of exactly 10 times slower (assuming that both systems can execute instructions at the same rate)?

  12. Where are the critical sections in Figures 16.4, 16.5 and 16.6 where an interrupt causing an involuntary resume relinquish would cause trouble? Assume that the enqueue and dequeue operations are atomic, indivisible operations. Note that there are at least two classes of problems which you can look for: Those where an interrupt during the critical section would result in a process being listed as being in more than one state at a time, for example, running and ready, and those where the count in a semaphore could be incorrectly managed.

  13. Expand on Figure 16.9 by redrawing this with a print spooler process and a queue of print jobs, showing arrows from each process which may request a print job to the print queue.



Any respectable operating system text should have good coverage of the material in this chapter.

The structure of a short-term scheduler is covered quite well in Chapter 4 of

Operating System Principles, by Per Brinch Hansen (Prentice Hall, 1973).
Chapter 6 of Hansen's book contains detailed analysis of some medium and long term scheduling policies, and there is a clear introduction to the Banker's algorithm in Sections 2.6.1 through 2.6.3, while Section 3.5 provides good coverage of deadlocks.

Additional material on process management and deadlocks is covered in Chapters 7 and 8 of Shaw's

The Logical Design of Operating Systems, by Shaw (Prentice Hall, 1974).
This material is somewhat more theoretical than the introductory material in Hansen's text.

Material on these subjects at a level of detail comparable to the level of the material covered here is given in Chapter 4 of

Operating System Elements, by Calingaert (Prentice Hall, 1982).

More extensive coverage of these issues is given in Chapters 4 and 8 of

Operating System Concepts, by Peterson and Silberschatz (Addison Wesley, 1983).

Those interested in the early history of multiple processor and multiple process machines might be interested in a description of one of the first multiple process batch systems:

"The Atlas Scheduling System," by Horwath, Jones, and Wyld, in The Computer Journal, 5, 3 (October 1962), pages 238-244.
For a description of one of the first time-sharing systems, see
"A Time-Sharing Debugging System for a Small Computer," by McCarthy, Boilen, Fredkin and Licklider, in the Proceedings of the 1963 Summer Joint Computer Conference, AFIPS Conference Proceedings, Vol. 23, pages 51-57.
For a description of one of the first multiple processor systems, see
"The D825 Automatic Operating and Scheduling Program," by Thompson and Wilkinson, Proceedings of the 1963 Summer Joint Computer Conference, AFIPS Conference Proceedings, Vol. 23, pages 41-49.

Finally, no reading of early papers on multiple processor or multiple process systems is complete without a careful examination of M. E. Conway's paper,

A Multiprocessor System Design, by M. E. Conway in Proceedings of the 1963 Fall Joint Computer Conference, AFIPS Conference Proceedings, Vol. 24, pages 139-146.
This paper includes one of the first clear statements of the distinction between processes and processors, and it also includes an early description of a set of supervisor services which allow a program to spawn multiple cooperating processes.

The classic discussion of the shortcomings of rate monotonic scheduling and the advantages of deadline based scheduling is in

"Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment," by Liu and Layland in the Journal of the ACM, 20, 1 (January, 1973) pages 46-61.

A survey of various priority queue implementations, along with the results of numerous performance measurements, is contained in

"An Empirical Comparison of Priority-Queue and Event-Set Implementations," by D. W. Jones, in the Communications of the ACM, 29, 4 (April 1986) pages 300-311.