22C:116, Lecture 24, Spring 2002

Douglas W. Jones
University of Iowa Department of Computer Science

Multiprocessor Operating Systems
Porting an operating system to a symmetrical multiprocessor can be easy, if the operating system was constructed using "textbook methods". The textbook methods that matter are:
- Hierarchical design, with the process manager at the root of the hierarchy, where the process manager includes support for semaphores.
- All critical sections outside the process manager guarded by semaphores and not by ad-hoc synchronization mechanisms.
- Minimal reliance on enable-interrupts and disable-interrupts to guard critical sections except in the process manager.
- Fine-grained mutual exclusion. That is, whenever possible, instead of locking out access to an entire data structure (table, tree, queue), locking should be at the level of the individual item.
Violations of these rules are common in uniprocessor operating systems. If all kernel code is privileged, and if some critical section is very short, the system programmer on a uniprocessor will naturally use disable-interrupts to enter a critical section and enable-interrupts to exit that critical section. Code written this way will port very poorly to a multiprocessor.
The reason for this is: If there are multiple CPUs, disable interrupts does nothing to halt other CPU's, and if we care about performance, we don't want to halt other CPUs! When one CPU needs to update a shared data structure, it must only block those other CPUs that are trying to update the same data structure!
Basic Rules
When porting uniprocessor code to a multiprocessor, wherever the uniprocessor code had:
```
	disable interrupts
	.. critical section ..
	enable interrupts
```
The corresponding code on the multiprocessor will have this code:
```
	disable interrupts     -- prevent other activity on this CPU
	wait on spin lock      -- block if other CPU in critical section
	.. critical section ..
	release spin lock      -- allow other CPUs into critical section
	enable interrupts      -- allow other activity on this CPU
```
In the worst case, any critical section guarded by disable-interrupts on the uniprocessor will wait on a global lock variable. The spin-lock can be implemented using any available locking primitive. In the worst case, somethign as crude as Dekker's solution might apply, but if the machine has instructions supporting better solutions, they should be used.
- The machine has test-and-set
  Note that the particular values that mean set or reset depend on the test-and-set instruction. The only way to find this is to read the hardware manual. Reset means that the lock is open, while set means that the lock is closed, blocking entry to a critical section.
```
	wait on spin lock (i)
		loop
		    cc = test-and-set(i)
		until cc indicates that i was reset

	release spin lock (i)
		i = reset
```
- The machine has exchange memory and register
  Here, we have the freedom to pick the values that mean open and closed for the lock.
```
	wait on spin lock (i)
		register r = closed
		loop
		    exchange-memory-and-register(r,i)
		until r = open

	release spin lock (i)
		i = open
```
- The machine has load-locked store-conditional
  r = load-locked(i) loads from memory location i into register r and sets auxiliary hardware in the CPU to work monitoring location i to see if any store operations to i are executed. This monitoring is called snooping on the memory bus.
  cc = store-conditonal(r,i) stores the contents of register r into memory location i, but only if the snooping mechanism indicates that there has been no store to location i since the most recent load-locked, and only if the snooping mechanism is currently snooping for references to locaiton i. The success or failure of store-conditional is returned in some kind of condition code.
```
	wait on spin lock (i)
		register r = closed
		register s
		loop
		    s = load-locked(i)
		    cc = store-conditional(r,i)
		until (r = open) and cc indicates success

	release spin lock (i)
		i = open
```
  In the above, load-locked and store-conditional were used to simulate exchange-memory-and-register. In fact, they can be used for much more. For example, to atomically increment an integer in location i of memory, we can use:
```
	atomic increment (i)
		register r
		loop
		    s = load-locked(i)
		    i = i + 1
		    cc = store-conditional(r,i)
		until cc indicates success
```
  If no other process does a store in the indicated variable, this will run in only one more instruction than a normal load-increment-store sequence. Only in the unlikely case that another process touches the memory location during this very short critical section does the loop iterate. Most modern RISC CPU's support this model!
When porting uniprocessor code to a multiprocessor, the interrupt mechanism of the multiprocessor must be inspected! There are two common possibilities:
- Only one processor services interrupts.
  In this case, the hardware designer took the easy way out, and the machine is not truly a symmetric multiprocessor. Since all interrupts are served on the same CPU, there are no new or special interactions between interrupt service routines, so the only interactions that must be considered are between interrupt service routines on one processor and non-interrupt service code running on other processors.
  Within each I/O driver, critical sections involving data shared between the interrupt service routine and the user-side access routines for that driver must be identified and guarded by spin-locks. Ideally, a single global lock variable should be avoided, and instead, locks local to the driver should be used -- there is no need to block the ethernet driver when the SCSI interrupt service routine is exploring the disk schedule.
- All processors services interrupts.
  In this case, the hardware designer did extra work to allow interrupts to be handled by different CPU's in parallel. For example, a mechanism could have been added to the system so that successive interrupts are delivered to successive CPU's in round-robin order, or a mechanism could be added so that each CPU can declare its priority level, and all interrupts go to the CPU that is running at the lowest priority.
  In this case, we need to audit the code for the interrupt handlers to check all references to global resources such as buffer pools. Any use of such a resource must be guarded by spin-locks. Again, it is better if there is a lock to guard each shared resource (such as the head of the linked list of free buffers) instead of one global lock.
Trap service routines always run on the same processor that executed the instruciton that caused the trap. Therefore, all CPU's execute trap service routines, and it is most productive if they are treated as user code running under the authority of the scheduler as part of a user process (but executing in the kernel).
User Code
When user processes must deal with shared data, in a uniprocessor, they will usually use semaphores or some equivalent mechanism supported by the kernel. Assuming correct kernel implementation, this will generally work correctly on the multiprocessor, so it requires no changes, however, changes to user code may allow better multiprocessor performance.
Inside trap-service routines, however, there may be places where the code running on behalf of a user process uses disable-interrupts to guard some critical section. Spin locks must be added to each such critical section!
In general, both user and kernel code running on behalf of users has two options to guard each critical section: spin-locks and semaphores (or equivalent mechanisms that involve the process manager). The choice between these is should be made as follows:
- Low contention critical sections should be guarded by spin locks.
  These critical sections are usually short, and the probability that a process will be blocked on entry to these sections is small.
  The key thing to note is that operations on semaphores involve transfer of control to a scheduler routine, while operations on spin locks involve only a few machine instructions if there is no contention. Therefore, the overhead of spin locks is far less if there is usually no contention.
- High contention critical sections should be guarded by semaphores.
  These critical sections are usually long, and the probability that a process will be blocked on entry to one of these sections is high.
  The key thing to note is that waiting on a spin-lock monopolizes the CPU on which the blocked process is running, while semaphores allow other processes to run. Therefore, use of semaphores will allows more useful work to be done if the likelyhood of blocking is high.