35. Bus Arbitration and Crossbar Switches

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Bus Arbitration

It is common to build computer systems with multiple active devices attached to the same memory bus. The most obvious example may be a machine with multiple central processors, but it is equally reasonable to speak in terms of secondary processors such as direct-memory-access peripheral devices. Consider, for example:

                    bus
     =========================================
       __|__   __|__   __|__   __|__   __|__
      |     | |     | |     | |     | |     |
      | CPU | | CPU | | DMA | | DMA | | RAM |
      |_____| |_____| |_____| |_____| |_____|
                       __|__   __|__
                      |     | |video|
                      | disk| | out |
                      |_____| |_____|
                        
      

Here, we have two CPU and two DMA devices. Each of these may independently access RAM. The CPU initiate DMA transfers to or from disk, but once initiated, the disk controller transfers the data between the disk and RAM on its own, without involving the CPU in each transfer. The DMA video controller operates continuously, reading a new pixel from RAM every microsecond and converting it to voltages on the red, green and blue signals sent to a color CRT.

We can view this system as having the simplest form of multiport memory, one where there is in actuality only one memory port that is time multiplexed between the various clients.

The problem this poses is how these different devices can share the same bus connecting them to memory. Up to this point, we have spoken of memory busses with the assumption that a single device, the bus master, provided the memory address and memory control signals. Now, we assume bus where any of several devices may control the bus, and they must share it in an orderly way!

We solve this problem with bus arbitration logic. This logic may be centralized or distributed, and it may operate synchronously or asynchronously. Distributed asynchronous bus arbitration is very hard, while centralized synchronous arbitration is fairly easy. Consider the following interface between the bus arbitrator and any particular client:

            Clock -------->

Arbitrator  <------ Request  Client
 
            Grant -------->
With a synchronous bus, the arbitrator and all clients share a common clock, so it is usual to distribute the clock signal on the bus and to treat the clock itself as a component of the arbitrator. During each clock cycle (possibly a major cycle with several minor cycles), exactly one client may use the bus.

To use the bus, a client asserts its bus request line. If the arbitrator replies by asserting the bus grant line, the client may use the bus for a read or a write cycle. If the arbitrator does not grant the bus, the client must stall. If we have 2 clients, we can make a truth table describing the bus arbitrator:

       Requests        Grants
      A       B   |   A       B
 -----------------+-----------------
      0       0   |   0       0
      0       1   |   0       1
      1       0   |   1       0
      1       1   |   1       0
The first 3 lines of this truth table are simple. If no client requests the bus, no client receives a bus grant. If one only one client requests the bus, that client receives the bus grant. The third line is more interesting! If both clients request the bus, only one may receive it. As shown, whenever this happens, the bus is granted to client A; thus, we have established a strict priority, giving the highest priority to client A. Here is a similar truth table for 3 clients:
       Requests        Grants
      A   B   C   |   A   B   C
 -----------------+-----------------
      0   0   0   |   0   0   0
      0   0   1   |   0   0   1
      0   1   0   |   0   1   0
      0   1   1   |   0   1   0
      1   0   0   |   1   0   0
      1   0   1   |   1   0   0
      1   1   0   |   1   0   0
      1   1   1   |   1   0   0
A fixed priority bus arbitration system is very appropriate when the clients are naturally different in their patterns of bus usage. For example, consider a system with a CPU able to make one memory reference every 50 nanoseconds, a network interface that makes one meory reference every microsecond, and a disk interface that makes a memory reference every 5 microseconds. It turns out that the best priority allocation for this system gives the highest priority to the client that makes the least frequent use of the memory, so the disk gets the high priority, the network the next lower priority, and the CPU gets the lowest priority.

Fixed priority is undesirable when the clients are equals! For example, if two CPUs share a bus, giving one a higher priority will always lead to one CPU running fast at the expense of the other. The solution to this is to roatate the priority system. This is easiest in the context of a distributed arbitrator:

Distributed Synchronous Arbitration

A distributed bus arbitrator can be constructed on the following model:

                       grant
                         |
                    _____v_____ 
                   |           |
   request A ----->| arbitrate |--------> grant A
                   |___________|
                         |
                    _____v_____ 
                   |           |
   request B ----->| arbitrate |--------> grant B
                   |___________|
                         |
                         v
		         .
		         .
		         .

Here, the highest priority is at the top of the list. If the arbitrator at the top sees a request, it always grants it. If the arbitrator below sees a request, it only grants it if there are no requests above. The vertical connection between the arbitration units is called the priority chain. Each arbitrate unit has the following logic:

                 priority|
                   chain |
                     in  |   ___ 
                         o--|   |
                         |  |AND|------- grant
       request --------o----|___|
                       | |
                       O_| 
                      |   |
                      |AND|
                      |___|
                        |  
                        |  priority
                        |   chain
                        |    out

To create a round-robin priority system, where granting the request of some bus client causes its priority to drop, all we need to do is create a circular priority chain that is broken immediately below that client each time the bus is granted.

                _________      clock
               |         |       |
               |    _____v_____  |
               |   |           | |
   request A --|-->| arbitrate |-|---o--> grant A
               |   |___________| |  _|_
               |         |       o-|>__|
               |         |  _____|___|
               |         |_|     |
               |        |   |    |
               |        | OR|    |
               |        |___|    |
               |         |       |
               |    _____v_____  |
               |   |           | |
   request B --|-->| arbitrate |-|---o--> grant B
               |   |___________| |  _|_
               |         |       o-|>__|
               |         |  _____|___|
               |         |_|     |
               |        |   |    |
               |        | OR|    |
               |        |___|    |
               |_________|
                          
                          

Here, on each clock cycle, we remember which grant line was high in a flipflop tied to that grant line. The outputs of these flipflops controls where the priority chain is broken, and therefore, which arbitrate unit is at the head of the chain.

The above arbitration circuit has a serious fault. If a bus cycle occurs during which no client makes a memory request, none of the flipflops will be set, and therefore, no arbitrator will be at the head of the chain. This is clearly a problem, but we can solve it by enabling the clock to the flipflops only when there is at least one request, so that the flipflops always remember the most recent request. We must also make sure that exactly one of the flipflops is set on powerup.

Crossbar Switches and Interleaving

The oldest way to build a true multiport memory saw one of its first major uses in the Burroughs D825 multiprocessor system. Conway's 1963 paper, on the D825 operating system in the 1963 Spring Joint Computer Conference is a good source on this. This machine was designed with up to 4 CPUs, 16 memory modules of 4K each, and an I/O subsystem, all able to access memory. The illustration in this paper was roughly as follows:

                  ___  ___  ___  ___
                 |   ||   ||   ||   |
                 |CPU||CPU||CPU||CPU|
                 |___||___||___||___|
      ___     |    |    |    |    |
     |   |    |    |    |    |    |
     | M |----x----x----x----x----x-
     |___|    |    |    |    |    |
      ___     |    |    |    |    |
     |   |    |    |    |    |    |
     | M |----x----x----x----x----x-
     |___|    |    |    |    |    |
       ...    |    |    |    |    |
      ___     |    |    |    |    |
     |   |    |    |    |    |    |
     | M |----x----x----x----x----x-
     |___|    |    |    |    |    |
              |
              |
         I/O subystem

The matrix structure of this interconnect system is called a crossbar switch, and the component at each x has, at its heart, a piece of bus arbitration logic. The bus coming out of each CPU is conventional, except that it has a stall line that causes the CPU to stall if the memory request cannot be granted. The bus going to memory is also conventional, except that some of the address lines from the CPU determine which memory module is to be addressed, and so the address data path on the bus to memory is narrower than the address data path on the bus from the CPU. During any clock cycle, each CPU may access a different memory module, with stalls requested only when two CPUs attempt to access the same module.

The term crossbar switch comes from the field of telephony. Early dial telephones used 10ary and later 100ary trees of stepper switches in the telephone exchange to connect the calling phone to the called phone. These were expensive, and each switch could only connect one telephone call at a time. Crossbar telephone switches consist of an array of horizontal bars crossing an array of vertical bars, with contact fingers on each. Each bar has an electromagnet that can rotate the bar. If a horizontal and vertical bar are simultaneously rotated, the contact fingers latch. Once a connection is made, it remains connected until a release electromagnet on one of the bars is energized, and while one connection is in use, any of the other connections may be made or broken.

Interleaving

Typically, if we have 2n memory modules, we use the least significant n bits of the memory address from the CPUs to select the module. This is called n-way interleaved memory, and it effectively randomizes the locations of variables among the different memory modules, since consecutive memory references made by a CPU refer to different memory modules, not to the same module.

The alternative is to treat each module as a separate bank of contiguous memory addresses, a model that forces the programmer to be constantly aware of the assignment of banks to CPUs. This is done only when the cost of a memory access depends on which module is accessed and which CPU makes the request, in which case we refer to the machine as a non uniform memory access machine or a NUMA machine, for short. In such a system, the operating system must carefully allocate both memory and processors in such a way as to minimize the cost of memory access. Today's small and medium scale multiprocessors generally avoid such complexity, but as the number of CPUs exceeds 16 to 64, it is hard to avoid these problems.

Interleaved memory has several consequences, some of which are important even when there is only one bus client. For example, on classical core memory, a read cycle caused the memory location that was read to be erased as a side effect of the read. In low performance systems, the restoration of the contents of this memory location was frequently left to microcode in the CPU, and as a result, reads were actually slower than writes.

In high performance core memory systems, though, it was useful to build more complex memory modules to take care of this detail. As a result, on a high performance system, the memory module from which a read was made was unavailable for one clock cycle after each read while it restored the contents of the location from which the read took place. This is of little value if there is only one module, but when there are multiple modules and they are interleaved, a sequence of reads from successive memory locations can be done without waiting for the restore cycles. This was done on the CDC 6600, where the common claim is that it used 10-way interleavng, but the number 10 was almost certainly in base 8.

While core is an archaic technology today, read cycles still tend to take longer than write cycles because the write requires only that the address and data travel together from CPU to memory, while read cycles require that the address travel from CPU to memory and then that the data travel in the opposite direction, for a total of twice the transit time. On the highest performance CPU-memory interconnect systems, these two data transfers take place in separate bus cycles, and it is useful to interleave the memory so that a bus transaction with a new memory module can be started while a read is being processed by another module.

A Crossbar Switch Cell

At each bus intersection in a crossbar switch, we need several components, but the most basic is a way to switch address and data information from one bus to another:

                 CPU
             addr data write
               |   |   |
               /   /   |
               |   |   |
   addr  --/---|---|---|--o------------------------------
MEM            |   |   |  |
   data  --/---|---|---|--|-----------o------------------
               |   |   |  |           |
   write ------|---|---|--|-o---------|------------------
               |   |   | _|_|_        |
               |   |   |   |          |
               |   |   |  / \         |
               |   |   | /___\--------|----------------o--- enable
               |   |   |   |       ---o---       ___   |
               o---|---|-| |     _|_      |     |   |--o
               |   |   | |-     \   /-----|-----|AND|  |
               |   |   o-|       \ /      |     |___|O-|--
               |   |   |          |       |      ___   |  |
               |   |   |          |      / \    |   |--   |
               |   |   |          |     /___\---|AND|     |
               |   |   |          |       |     |___|-----o
               |   |   |           ---o---                |
               |   |   |              |                   |
               |   o---|--------------                    |
               |   |   |                                  |
               |   |   o---------------------------------- 
               |   |   |
               |   |   |

Here, when the enable line for an intersection between the busses is high, bus coupler logic transmits the address and the read-write control line from the CPU bus to the memory bus, and bidirectional bus coupler logic transmits the data in the appropriate direction between the busses, depending on whether this is a read or write operation.

A particular connection will be enabled when two conditions hold:

So, in sum, we must have the following picture:
                   CPU
                  basic
                   bus req grant
                    |   |   |   ________ 
                    |   |   |  | Decode |
                    o---|---|--|addr lsb|
                    |   |   |  |________|
                    |   |   |   | |  | |
                    |   |   |      \
                    |   |   |       | each intersection 
                    |   |   |       | is connected to a
                   _|_  |   |       | different output.
                  |   | |   |       |
   basic bus -----|   |-|---|-------|------------------
MEM               |___| |   |       |
   req       ----/--|---|---|-------|------o-----------
                /   |   |   |       |      |
   grant     --/----|---|---|-------|------|-------o---
               |    |   |   |      _|_     |       |
               |    |   |   |     |   |   / \     _|_ 
               |    |   o---|-----|AND|  /___\-o-\   /
               |    |   |   |     |___|    1   |  \ /
               |    |   |   |       |req       |   |
               |    |   |   |   ____|____      |   |
   priority    |    |   |   |  |bus grant|     |   |
   chain     --|----|---|---|--| logic   |-----|---|---
               |    |   |   |  |_________|     |   |
               |    |   |   |       |          |   |
               |    |   |   |       |grant     |   |
                ----|---|---|-------o----------    |
                    |   |   |                      |
                    |   |   |                      |
                    |   |   o----------------------
                    |   |   |
                    |   |   |

The cost of an N by N crossbar switch for a bus with an A-bit address and a D-bit data path will by O(NxNx(A+D)). This is not a pleasant cost to pay when N and M are large. Furthermore, if you imagine building a single crossbar intersection as an integrated circuit, it will require 2(A+D)+k input output pins, where k is a small integer that accounts for miscellaneous connections such as the priority chain, read/write lines and so on. On a modern machine with a 32 bit address and a 64 bit data path, this will requires chips with about 200 pins, ignoring power and ground! This pin count is comparable to that of a good microprocessor circa the year 2000!

Despite this, crossbar switches have proven to be quite useful. Burroughs was routinely building machines with 8 by 8 crossbar switches in the late 1960's, using typically 6 CPUs and 2 DMA coprocessors on each system. In the 1970's, Carnegie-Mellon University built the C.mmp multiprocessor (Computer with Multiple MiniProcessors, officially, but to many, the Carnegie Mellon MultiProcessor), with 16 PDP-11 CPUs and 16 memory modules; this machine had a 16-bit data path and a 20 bit physical memory address.

It is clear, however, that crossbar switches are too expensive for large scale use. Building a 1000 processor machine with just a crossbar switch is not a reasonable idea! We need an alternative.

Multiport Memory on Pipelined Systems

By the mid 1970's, the MODCOMP IV minicomputer incorporated a 4x4 crossbar switch (again with a 16-bit data path and an 18-bit address), but this time, the primary goal was to support a single pipelined processor, using one memory port for the instruction fetch stage of the pipeline and one memory port for loading or storing operands. The other two memory ports were usually used for DMA coprocessors, but a second processor could be connected, in theory.

The MODCOMP IV provides a useul example of the considerations that are involved in hooking a pipelined machine to a multiport memory:

      _________________________       ______________
     |   ___________________   |     |              |
     |  | Instruction Fetch |========|              |
     |  |___________________|  |     |              |
CPU  |   __|_____________|__   |     |              |
     |  |      Execute      |========|  Multiport   |
     |  |___________________|  |     |    Memory    |
     |_________________________|     |              |
         ___________________         |   (4 by 4    |
        |  DMA coprocessor  |========|   crossbar   |
        |___________________|        |  + memory)   |
         ___________________         |              |
        |  DMA coprocessor  |========|              |
        |___________________|        |______________|

Which of the memory ports in this system should have priority, or should we use a round-robin scheme where priority rotates from port to port with use?

Part of the answer comes from our earlier comments about DMA processors sharing a single bus with the CPU. In that case, we said that the DMA processor needed priority over the CPU. This remains true here! If the DMA processor is handling data transfers from a peripheral, the CPU can wait. If the CPU took priority, there is a possibility that data from the peripheral would be lost and the input-output operation would fail.

But what about the two memory ports used by the CPU itself? Does the instruction fetch stage have priority over the operate stage or visa versa. The answer comes from the nature of stalls in a pipelined system. If the instruction fetch stage stalls, bubbles are introduced in the pipe and later stages can continue to operate, but if some operate stage stalls, the instruction fetch stage is blocked. Therefore, operand fetch and store must take priority over instruction fetch in this model! As a result, the highest priority port to the crossbar switch in the diagram above is the bottom port, and the top port has the lowest priority.