22C:116, Lecture Notes, Lecture 5, Fall 1999

Douglas W. Jones
University of Iowa Department of Computer Science

Other Mutual Exclusion Mechanisms
The problem of mutual exclusion on multiprocessors where there are no atomic exchange or test-and-set operations has been the subject of years of study, during which many of the most ingeneous solutions to the critical section problem have been found. Dekker's solution is oldest of these. Each spin-lock is represented by three variables:
```
state[0..1] -- process i sets state[i] to enter section
turn        -- indicates who's turn it is to try next
```
Given this, and that all variables are initially zero, the code for the lock and unlock operations is as follows, assuming that each process passes it's process ID to the lock and unlock operations:
```
other(pid)
   return( 1 - pid );

lock(pid):
   state[pid] = 1;
   while state[other(pid)] = 1 do
      state[pid] = 0;
      while turn <> pid do nothing;
      state[pid] = 1;
   end loop;
   return;

unlock(pid):
   state[pid] = 0;
   turn = other(pid);
   return;
```
When A is inside the critical section, state[A] is one and state[B] is zero. If both try to enter at the same time, so both state variables are set to one, they both enter the loop waiting for the other to reset their state to zero. One will immediately reassert their claim, while the other is blocked a waiting its turn. Furthermore, on exit from the critical section, each process always sets turn to give priority to the other process, preventing one or the other from hogging the critical section.
This code is difficult to understand! Furthermore, it is difficult to generalize to more than two processes (consider, for example, building a binary tree of processes to generalize it!). There are many more solutions to this problem (consult your textbook for some!)
Although many modern RISC processors don't have test-and-set instructions, they don't necessarily force use of code such as the above. For example, the DEC Alpha processor has pair of instructions, load-interlocked and store-interlocked that allows the user to create an an instruction sequence that appears atomic, even though it isn't. The key is that a sequence of instructions starting with a load-interlocked instruction and ending with a store-interlocked instruction addressing the same location will either operate as a critical section or do nothing, reporting a failure.
To do this, the CPU contains a special register that contains the address of the location being interlocked. This is set by the load-interlocked instruction, and if this register detects use of the same location by another processor, the following store-interlocked instruction becomes a no-op and the error condition is reported.
Producers and Consumers
Mutual exclusion is not enough. Low level control over the sharing of data is essential but insufficient to allow productive parallel programming.
Consider a situation where one process produces data to be consumed by another:
```
    Producer --> Consumer
```
The producer can easily fill a shared variable, then let the consumer take the data, but controlled sharing is not enough. There must be some way for the producer to signal to the consumer that data is available.
Signalling as an Abstraction
One way to view this signalling problem is to think of the fork and join primitives.
```
   Process A             Process B
       |                     |
    Produce                  |
   some data                 |
       |                     |
       V                     |
     ----- fork = signal(s)  |
       | \                   |
       |   ------(s)------   |
       |                   \ |
       |  s is a semaphore  ||
       |                  __VV__ join = wait(s)
       |                     |
       |                  Comsume
       |                  the data
       |                     |
```
The control path between the two processes can be considered to be a short-lived process that serves to synchronize the producer and consumer, or it can be considered to be a synchronizing signal of some type. (This understanding of synchronization as being theoretically equivalent to control flow is a relatively new insight!)
In operation, if process A produces data before process B is ready to consume it, the signal reaches the join and waits for process B to arrive at that point. If Process B is ready to consume before process A has produced any data, Process B waits at the join for the signal to arrive.
Queues
What kind of data structure should be used to communicate between the producer and consumer? There are many options, but all boil down to variations on the theme of bounded buffers.
A bounded buffer has a data capacity that may vary from 1 item (where it becomes equivalent to a simple variable) to some large number. If more than one item is allowed, the buffer may have either a linked list or an array implementation. Here, we'll look at the array implementation:
```
   ________________________________
  |  |  |\\|\\|\\|\\|\\|\\|  |  |  |
  |__|__|\\|\\|\\|\\|\\|\\|__|__|__|
   Empty |     Data        | Empty
         |                 |
       Head               Tail
```
The operations on this structure are simple, if we ignore problems with preventing enqueueing data in a full queue or dequeueing data from an empty queue:
```
  enque(d)
     buffer[tail] = d
     tail = tail + 1 (modulo queue size)

  dequeue(d)
     d = buffer[head]
     head = head + 1 (modulo queue size)
```
We can detect empty and full queues fairly easily:
```
  empty := (head = tail)
  full := (head = tail)
```
But, these conditions are the same, which poses problems! Despite these problems, this approach to detecting the buffer full and buffer empty conditions is actually used, on occasion, primarily in low-level applications such as communication between processes and interrupt handlers.
The usual solution to the ambiguity is to redefine the full-queue condition as:
```
  full := (head = tail + 1) (modulo queue size)
```
Exercise: Propose other solutions to detecting the full and empty conditions in a circular buffer, perhaps by counting the data elements in the buffer. Compare your solution with the one proposed above! Does it take more or less space? Does its use imply a run-time cost? Do your answers depend on the size of the enqueued data items, and if so, how?
Shared Queues
When a queue is shared between multiple processes, there are two different problems. One is assurance of mutual exclusion, so that, for instance, no two processes try to enqueue an item of data in the same queue slot, and the other is keeping track of the number of available data items and free spaces. All of these can be accomplished by semaphores:
```
  Semaphores used to provide mutual exclusion:
     headlock -- protects the head pointer
     taillock -- protects the tail pointer

  Semaphores used to count available resources:
     free -- counts number of free spaces
     data -- counts number of enqueued data items
```
Given these semaphores, with all mutual exclusion semaphores initialized to one and the counting semaphores initialized to reflect the initial state of the queue, the operations on a shared queue can be implemented as follows:
```
  enque(d)
     P(free)     -- wait for free space to put data
     P(taillock) -- wait for exclusive use of tail
     buffer[tail] = d
     tail = tail + 1 (modulo queue size)
     V(taillock) -- release the tail
     V(data)     -- signal that data is available

  dequeue(d)
     P(data)     -- wait for data to be available
     P(headlock) -- wait for exclusive use of head
     d = buffer[head]
     head = head + 1 (modulo queue size)
     V(headlock) -- release the head
     V(free)     -- signal free space available
```
This implementation of shared FIFO queues is universally applicable, but there are some special characteristics worth noting:
First, the mutual exclusion locks are only needed if there are multiple producers or consumers. When there is only one producer process, that is the only process that will inspect and manipulate the tail pointer, so no mutual exclusion semaphore is needed to protect the tail pointer.
Second, the mutual exclusion operations shown above protect very brief critical sections, so it is unlikely that any process will wait long for entry. As a result, the best implementations of headlock and taillock involve busy waiting at a spin-lock, not semaphores implemented in the process manager.
Third, if the queues are very large, so there is a large probability that the queue will be neither full nor empty, the counting semaphores will rarely block the producer or consumer; thus, context switching will be very rare.
Variations on Queues
If the buffer has a capacity of one, the head and tail pointers are superfluous, as are the locks protecting them. In this case, the producer and consumer processes would be expected to operate in lock-step, producing one data item, then consuming it before the next is produced.
Exercise: Write the code for this limiting case for a bounded buffer queue. How many semaphores are required? What purposes do the remaining semaphores serve, and what range of values do they take on?
Lock step operation of the producer and consumer is frequently innefficient because unless the computational requirements of the two processes are identical, either the producer will spend time waiting for the consumer, or vicea versa.
From elementary queueing theory, we can predict the following: If the average time taken to produce an item is the same as the average time taken to consume an item, and the times vary randomly, the performance of the system will be maximized with an infinite queue capacity and an infinite average number of items in the queue. Furthermore, as the size of the bounded buffer is decreased, the percentage of the time that the producer spends waiting for space or the consumer spends waiting for data will increase smoothly. The conclusion is that, where memory allows, large buffers are preferable.
The bounded buffer with a capacity of two is a commonly used special case, particularly in 1960's vintage input/output subsystems. In that context, it is frequently described as "Double Buffered I/O".
Exercise: Write the code for this common case for a bounded buffer queue. How many semaphores are required if you assume a single consumer and a single producer? What purposes do the remaining semaphores serve, and what range of values do they take on?
When multiple FIFO queues are implemented using linked lists instead of arrays, it is common to have a single shared pool of free list items instead of a bounded supply of list items for each queue. For a given amount of memory, this can markedly improve performance, but it adds slightly to the complexity of the code. Specifically, instead of having a separate space semaphore to count the amount of free space in each queue, there is a single free space semaphore.
Exercise: Write the code for linked list queues in the general case where there are multiple queues where each queue may serve multiple producers and multiple consumers. Assume that there is only one type of object enqueued in any of the many queues -- call it a buffer, and assume that each buffer has a data field and however many links are required for the queue management scheme you use. How many semaphores are required per queue, and how many additional semaphores are shared by all queues?
Other problems
Do not be misled into thinking that the mutual exclusion and producer-consumer problems are the only problems in parallel programming. They are the most important and most common problems, but there are many others:
The readers-writers problem is a common problem in file-system design. A file may either be closed, open for read access by any number of users, or open for write access by exactly one user. Implementing the read-open, write-open and close operations using semaphores and shared variables is an interesting exercise.
Deadlock, the situation that arises when multiple processes are blocked, each awaiting some action by another blocked process, is a common failing in parallel programs. There are complex algorithms for both deadlock avoidance and deadlock detection. We will talk about these later.
Schedulers, a Footnote
There are two basic kinds of low-level schedulers. In the lecture on schedulers, the ready queue was described as merely a memory with the enqueue and dequeue operations, and no description of the scheduling discipline. In fact, there are many alternatives:
If the ready queue is a FIFO queue (for example, a linked list of process descriptions with head and tail pointers), the scheduler is said to implement the round-robin algorithm, where all processes have equal priority for access to the CPU.
If the ready queue is a priority queue, where priority is a fixed attribute of each process, the scheduler is a simple priority scheduler. Such schedulers are commonly used in real-time operating systems. Typically, high priorities are given to processes that must meet frequent deadlines, while low priorities are given to processes with infrequent deadlines (this is called the rate monotonic priority assignment).
If the ready queue is a priority queue, where priority is adjusted in response to process behavior, we can develop an adaptive scheduler. For example, the classic UNIX scheduler would raise the priority of processes each time they voluntarily relinquished the CPU, perhaps to wait for I/O completion, while periodically lowering the priorities of all processes. The result is that interactive processes, which do little computing but frequently await input, would naturally rise to a higher prioriy, while compute bound processes would fall in priority. This provides excellent service to interactive users without the need for users to be aware that there is any kind of priority structure.
Another category of scheduler attaches, as the priority of each process, the deadline by which it must finish its operation. The process with the nearer deadline gets priority, and when a process finishes its operation, it either terminates or resets its deadline to the time at which it must finish its next operation. Deadline based schedulers are provably optimal for real-time systems, but they are rarely used in practice.