20. Distributed Systems

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

General Concepts in Distributed Systems

There is a broad (and in fact, multidimensional) spectrum of distributed systems. To explore this spectrum, consider asking the following questions:

In answering these questions, four large classes of systems emerge, isolated uniprocessors, the typical subject of introductory courses, plus multiprocessors, networks of systems, and distributed systems.

To add to the confusion, the answers to some of these questions may change depending on the level of abstraction from which the system is examined. For example, some systems that appear to share memory at the user level are implemented on top of packet switched networks.

Multiprocessors

Some computer systems have multiple CPU's and multiple memories, but (under the constraints of some protection mechanism) any code running on any processor may access any location in memory. Some of these machines have a single ready list shared equally by all processors, while others assign processes to processors for any of a variety of reasons.

The University of Iowa has had a number of these systems, since the mid 1980's. For many years, we had a number of Encore Multimax systems at the Weeg Computer Center, and now we have a number of large Silicon Graphics systems on campus that are good examples of such multiprocessors. The idea of building such multiprocessors dates back to work by Conway in the early 1960's, and was widely used on Burroughs mainframes from the mid 1960's onward. In the late 1970's, Encore and Sequent reintroduced such architectures in the marketplace, and today, such architectures are relatively common in high-end scientific computation. Dual and quad processor personal computers or desktop workstations have been sold since the early 1980's, and under operating systems such as Linux, MacOS X or Windows NT, use of an extra processor or two is no longer considered exotic.

It is important to note that, at some level of abstraction, all multiprocessor systems may be viewed as having a communications network. Typically, this network connects processors and memory, and it is typically possible to isolate transactions on this network that resemble messages. If the messages in the interconnection network are generated by hardware and concern individual memory references, we will classify such systems as shared-memory systems, while if the messages are at a higher level, generated by software at the user or system level, we will claffify such systems as distributed systems.

The discussion that follows deals with symmetrical multiprocessors, that is, with multiprocessors where all CPUs have the same instruction set and an effort is made to divide the work evenly between all. The alternative to this is to designate one CPU as a primary CPU and view the others as special purpose coprocessors, peripheral support processors, etc. Most modern computer systems have large numbers of such secondary processors that are viewed by the operating system as parts of I/O devices.

Operating systems for multiprocessors frequently differ only slightly from systems for uniprocessors. The difference is that, where a uniprocessor had a single idle process on the ready list, multiprocessors must have enough idle processes to satisfy all processors when there is no work to be done, and the ready list can be shared equally by all processors. If a system is written with a uniform mechanism for entry and exit from critical sections, most of the changes required to move such a system to a multiprocessor are confined to this mechanism. This explains why Unix has been successfully ported to a number of multiprocessors.

The other major difference between multiprocessor and uniprocessor operating systems is visible only in the lowest level kernel code. Where a critical section on a uniprocessor operating system kernel could guard a critical section with, for example:

	disable interrupts

	-- critical code

	enable interrupts

The same critical code on a multiprocessor must be guarded by:

	disable interrupts
	spin-lock

	-- critical code

	spin-unlock
	enable interrupts

The interrupt enable/disable prevents other activity on the same CPU from occuring, while the spin-lock (using, for example, test-and-set instructions, Dekker's solution, Lamport's minimum latency solution, or any of a number of others) prevent other CPU's from interfering.

If all critical sections are clearly identified in the code, the conversion of a uniprocessor system to a multiprocessor system can be quite straightforward. This is why most modern systems, including Unix, Windows NT, and MacOS X, are all able to handle symmetrical multiprocessing.

There are several types of shared memory computer systems; the full details of how these are implemented is the subject of an architecture course, but a brief summary is worth including here:

Coprocessors or asymmetrical shared memory multiprocessors

If the operating system runs on one processor, while user applications run on others, we say the machine is based on an asymmetrical model. In this case, we say that the operating system runs on the control processor. If some applicatons also run on the control processor, we refer to this as the primary processor; in this case, we refer to the other processors as coprocessors.

Not all coprocessors are parts of multiprocessors. Floating point units, for example, are frequently described as coprocessors, but they do not generally execute an independent stream of instructions. The DSP (digital signal processor) chips in a typical modern PC are excellent examples of this category, as are some graphics processors.

This is not a new category! The CDC 6600 (the fastest computer on earth, when Control Data Corporation introduced it in 1965) had 10 peripheral processors in addition to its central processor. The CPU had a 60 bit word and was very fast, for its day; the The PPs (peripheral processors) were slow and limited to a 12 and 18-bit words (an awkward mixed model of word-size!). The operating system kernel, to use modern terminology, ran largely on one of the peripheral processors; it scheduled user computation on the central processor and it scheduled I/O operations on all of the peripheral processors.

Symmetrical multiprocessors

When Conway first proposed building a multiprocessor, in his seminal papers on parallel programming in the early 1960's, he was thinking of a symmetrical model where all CPU's were equal and none could be said to uniquely run the operating system. The original implementations of symmetrical multiprocessors were based on crossbar switches:

             |    |    |
     CPU ----+----+----+--
             |    |    |
     CPU ----+----+----+--
             |    |    |
     CPU ----+----+----+--
             |    |    |
             M    M    M  (memory modules)

Crossbar switches have been built to connect up to 16 CPUs to 16 memory modules; the C.mmp system at Carnegie-Mellon was the best-known example at the high end of this range. This interconnection scheme has a cost that grows as n2, so the price grows prohibitively as the system gets large. Note that each intersection in this switching network is connecting a full CPU to memory bus, not just one wire, but typically 32 or 64 data wires and 32 address wires! Despite this, 4x4 crossbar switches are fairly common!

An alternative way to build a large-scale symmetrical multiprocessor is to use fixed-size crossbar switches, for example, 2x2 switch modules, and route the path from CPU to memory through a series of layers:

             |  |                  |  |
     CPU ----+--+--    ------------+--+--
     CPU ----+--+--   |       -----+--+--
             |  |_____|     /      |  |_____ M
             |__________   /       |________ M
                         \/
             |  |        /\        |  |
     CPU ----+--+--     /  \-------+--+--
     CPU ----+--+--    /        ---+--+--
             |  |_____/        |   |  |_____ M
             |_________________|   |________ M

For the 2x2 example illustrated here, each CPU can be thought of as being at the root of a binary tree, where each memory module is a leaf of that tree and the crossbar switches serve as the internal nodes. The weaving together of these trees allows n CPUs to access n memory modules using a switching system with O(log n) layers of crossbar switches and with a total of O(n log n) intersections in the crossbar switches. The best known large-scale examples of this architecture is the BBN Butterfly (although this departed from this topology in an important way); the butterfly used 5 layers of 4x4 switches to connect 1024 processors to an equal number of memory modules. The name butterfly comes from one way of drawing the interconnection network; this network has also been called a banyan network because, like the botanical banyan tree, it has multiple independent roots.

Yet another way of building a symmetrical multiprocessor is to connect all of the processors to a shared bus:

     CPU    CPU
      |      |
   ---+------+------+------+--
                    |      |
                    M      M

This has a cost of O(n) for n processors, but if each processor makes r memory references per second, the bus must handle nr references per second. If the memory modules, CPU and bus interfaces are all built using the same technology, it will be difficult to use more than a very few CPUs with this approach. This was done in the 1970's, exploiting the fact that multiple low performance microprocessors could be interfaced to high performance busses and memory modules, but the purpose of this was to exploit the imbalance in technology between the first generation of microprocessors and the best available memory technology of the era.

Starting in the 1980's, this became the standard way to build high performance multiprocessors with a modest number of processors. The key to this was the use of cache memory technology on each processor. If each processor makes r memory references per second, but the local cache on each processor has a miss rate of m, then each processor will only require mr bus cycles per second. Therefore, if the bus and memory modules are built using the same technology as the processor, that is, the bus can handle O(r) memory references per second, then we should be able to use such a bus to interconnect O(1/r) processors with a shared memory. So, for example, if our cache miss rate is 10%, we ought to be able to build 10 processor systems using this model. Today's dual processor desktop workstations use this model, and systems with up to about 16 processors have been sold commercially in the last decade.

NUMA multiprocessors

The systems discussed above all had uniform memory access times; that is, all CPUs had equal access to all memory modules. In the case of the butterfly or banyan network, however, the number of switching delays in access to memory can be significant. We may be able to ignore a single crossbar switch in the path from CPU to memory, but we cannot ignore five such switches! Therefore, the BBN butterfly and several other machines in this class have been extended with an extra path to memory directly from each CPU to the corresponding memory module.

Such machines are called Non Uniform Memory Access multiprocessors because the access time from CPU to memory is very fast for its local memory and significantly slower for other memory. NUMA machines pose special operating system problems because of the need to optimize code and data placement. In general, code sharing on such a machine is a bad idea! It is better to duplicate the code in the local memory of the machine that is executing that code. Similarly, read-only data shold be duplicated. The stack of a process should be in the local memory of the CPU executing that process, so that the only references to remote memory are references to data shared between multiple processes running on different CPUs.

This localizaiton of data to the CPU running a process requires that the process remain at that CPU for long periods, or perhaps, for its lifetime. Instead of one ready list for the entire system, allowing processes to migrate freely between CPUs, we must have one ready list per processor, and at the time we create processes, we must solve the process placement problem.

Virtual memory subsystems for NUMA processors have been developed that solve the page placement problem, automatically replicating read-only pages in the local memories of the processors that reference those memory locations, and automatically migrating shared pages toward the local memory of the processor making the most intense use of those pages. As with LRU page replacement, these systems do not try find a perfect place to store each page, but rather, they observe fault patterns and make decent decisions about page placement.

Network Operating Systems

A network operating system is a conventional operating system which includes provisions for attaching it to a network. Most versions of Unix in common use today are network operating systems, and networking features have become standard in MacOS and the various Windows systems from Microsoft.

Typical features that distinguish a network operating system from a stand-alone operating system include:

Remote command execution, for example, as provided by the Unix rsh command and the newer secure ssh command. This allows a user to issue a command to be executed on a particular remote system. For example, to run the sync command on a remote Unix system named cow, assuming you have accounts that are configured correctly, you may type:

	rsh cow sync

Remote file access is provided, for example, by the Unix to Unix rcp command, the newer secure scp command and by the more general ftp utility. These subsystems allow users to copy files from one system to another. For example, to copy a file from one Unix system to another (assuming the appropriate accounts exist with the appropriate rights):

	rcp cow:src goat:dst

This copies the file src on cow to the file dst on goat.

Remote login is provided for example, by the Unix to Unix rlogin command and by the more general telnet utility. In addition, the newer secure scp command can be used for Unix to Unix remote login. These allow users of one system to open interactive sessions on a remote system, for example, to start a session on a machine named tractor, assuming the various network permissions are set correctly, you would type:

	rlogin tractor

At a lower level, network operating systems must provide user processes with access to communications protocols for communicating over the network with processes on remote machines. These provide the basis for the implementation of network-oriented commands such as those outlined above.

Network file systems are a special feature of most network operating systems. These allow multiple machines in a network to share one logical file system, even though the different machines may run otherwise unrelated operating systems. Where the rcp and scp commands are simply utilities for copying files from one machine to another, network file systems allow a file stored on one machine to be directly opened by any application on another machine as if it was on some part of the local file system. The NFS protocols developed originally by Sun Microsystems have become the standard network file system in the Unix and Linux worlds, and newer versions of the MacOS and Windows allow those machines to use the NFS protocols to share file systems. Novel corporation developed a set of competing protocols for the IBM compatable PC world, and the file sharing protocols of the old Apple-talk network also offered these types of services.

Distributed Operating Systems

A distributed operating system differs from a network of machines each supporting a network operating system in only one way: The machines supporting a distributed operating system are all running under a single operating system that spans the network. Thus, the print spooler might, at some instant, be running on one machine, while the file system is running on others, while other machines are running other parts of the system, and under some distributed operating systems, these parts may at times migrate from machine to machine.

With network operating systems, each machine runs an entire operating system. In contrast, with distributed operating systems, the entire system is itself distributed across the network. As a result, distributed operating systems typically make little distinction between remote execution of a command and local execution of that same command. In theory, all commands may be executed anywhere; it is up to the system to execute commands where it is convenient.

Homogenous and Inhomogenous Systems

Network operating systems are naturally compatable with inhomogenous networks, that is, networks containing many different kinds of machines. Thus, for example, the Internet connects many Unix machines, but it also connects IBM PCs running various versions of Windows, IBM Mainframes running IBM's VM operating system, and Apple Macintosh computers running versions of MacOS. This inhomogeneity is tolerated as long as each network operating systems involved supports some subset of the same protocols.

Distributed operating systems have typically been implemented on homogenous networks, where all machines support identical instruction sets. In fact, though, if there are any differences between machines, even, for example, differences in optional extensions such as floating point units, the system must distinguish between machines in exactly the same way it would have to if it supported machines with entirely different instruction sets.

The most troublesome differences in a distributed system are those involving differences in data representation. Some machines store integers least-significant byte first, while others store integers most significant byte first. As a result, if an array of integers is transmitted in byte sequential order from one machine to another, the integers may have their bytes shuffled! The same problem may occur with floating point numbers, even when the source and destination machine agree on a common floating point format such as the IEEE floating point standard!