It appears to be the case that the payoff to be obtained from adding additional parallelism to the execution of a single instruction stream, for example, by supporting long registers, by deep pipelines or by superscalar or speculative execution are limited. Today's Pentium chips are on the order of 10 times the size of the first generation of 32-bit microprocessors (the HP 9000, from around 1983), and it seems likely that, for many applications, a net performance gain could be achieved by monolithic multiprocessors, where each processor on the chip uses only moderate levels of internal parallelism.
In looking at modern microprocessors, it appears that we are entering an era of diminishing returns. In moving from 32 bit processors to 64 bit processors, it is unclear that more than a small percentage of the operands in a typical program will require the full ALU or register width. Certainly, some key algorithms, such as the bitblit algorithm, can be improved by using wide registers, but these only need a small number of wide registers, and most computations will never need the extra bits.
Superscalar execution is similarly limited by the parallelism inherent in the basic blocks of a program. Smart compilers can rearrange code to improve this, but the emphasis on multiway speculative execution in some modern superscalar chips provides clear evidence that our basic blocks are not long or complex enough to fully exploit more than 3 or 4-way superscalar pipelines, and I suspect that the net gain in moving above 2-way superscalar execution is not impressive.
Speculative execution after conditional branches is another area where the net gain must be limited. With a single speculative stream, following hints in the object code about the likely path, random hints would lead to half of the speculatively executed instructions being discarded, and good hints can lead to very few pipeline cycles being devoted to fruitless speculation. Two-way speculation, while requiring no hints, guarantees that half of the speculatively executed instructions will be discarded, and 4-way speculation (allowing exeuction to begin through the next branch instruction before the previous branch is decided) allows as many as 3/4 of the speculatively executed instructions to be discarded!
Thus, it seems that the payback from adding additional speculation, additional superscalar pipes, or wider registers to the current generation of 32 bit processors will be limited, and that some of the investment already expended in these features may have offered very limited payoff.
If there is only a limited payback to be obtained by putting more parallelism into individual CPU's, what benefit can we get from our modern VLSI fabrication methods? One obvious approach is to migrate function onto the same die as the CPU itself. Since around 1986, for example, we have seen on-chip floating point units and memory management units emerge to replace the coprocessors that dominated the earlier generation of microprocessors.
There is always room for more on-chip cache, it seems, but the payback for adding cache capacity is also subject to a law of diminishing returns. Once the cache size equals the working set size of the application in question, increases in the cache size have negligable effect, and what is gained by larger caches is frequently lost to context switching unless the cache can be made large enough to hold the working sets of all active contexts.
This leads naturally to the suggestion that serious consideration should be given to using some of the available die area on modern VLSI processor chips for additional CPUs. The original HP9000 CPU, released around 1983, was 1/4 inch square, about 1/10th the size of a modern Intel Pentium, and the density increases we have seen in the intervening decade have been sufficient that I suspect we could add appropriate memory arbitration logic to allow on the order of 5 to 10 such CPU's on one Pentium-sized die with a single external memory bus.
This is not to say that I would advocate using that old CPU design! Rather, my suspicion is that we would be well justified in building Pentium-sized chips holding 2 or 4 interconnected CPU's implemented with modest internal parallelism. My guess would be that these CPU's could profitably incorporate double-issue superscalar pipelines, with one-way speculative execution over conditional branches with branch prediction hints provided by the compiler.
It is not clear that the proposed class of monolithic multiprocessors would have immediate applications to the MS-DOS compatable world of MS-Windows variants, but such a chip should have immediate application to the UNIX workstation world. We have an established base of multiple-process code that runs under UNIX, and UNIX variants have been built for shared shared memory multiprocessors for over a decade by Encore, Sequent and Silicon Graphics, among others.
Furthermore, the use of X-windows under UNIX provides an easy way to use two CPU's because the X-server can use one while the X-client uses the other. When running animations under X on a single workstation, a common problem is pulsating progress of the animation caused by the time-sliced sharing of one CPU between client and server. The proposed architecture would eliminate this trivially, without the need for special graphics acceleration hardware to solve this problem!
How should the CPU's of a monolithic multiprocessor be interconnected? The most obvious scheme is to use snooping cache technology with a shared on-chip bus. This bus may well communicate with the off-chip bus through a second level cache, and the second level cache may itself snoop the external bus to allow for larger scale parallelism.
Should all on-chip CPU's be identical? The system software would be simplest if all were identical, but we know how to build multiprocessor operating systems that can handle inhomogenous CPU resources. For example, C.mmp/Hydra at Carnegie Mellon handled a mix of PDP-11 processors back in the 1970's. One could easily imagine having only one of the on-chip processors with full floating point support, while others only supported integers.
Clearly, quantitative performance measurements on comparable modern processors should be made! Within individual CPU's, we need to know the average and peak utilization of pipeline stages, across all pipelines in the case of superscalar designs, as measured for CPU bound applications. For multiprocessors, we need the CPU utilization under typical single-user UNIX CPU and display bound workloads. With this quantitative information in hand, we ought to be able to decide how much parallelism to put into each CPU in a monolithic multiprocessor.
With conventional uniprocessor chips, a point defect will typically spoil the chip. With a monolithic multiprocessor, an interesting possibility arises: In those cases where the effects of the defect are confined to one processor, disabling that processor may result in a perfectly useful, although lower performance chip! This suggests that something akin to a fuzable link ought to be included with each processor so that each can be permanently disabled.
One potential benefit of monolithic multiprocessors is improved testability. Once the on-chip bus has been tested, it should be possible to test each of the processors independently, verifying the functionality of each while the others are shut down. This requires that, in addition to permanent and static shutdown of all on-chip processors, there should be some way to dynamically shut down processors for diagnostic purposes.
This writeup is a more formal presentation of an idea discussed in the course Advanced Computer Architecture, 22C:122, offered in the spring of 1996 at the University of Iowa, and distributed as a take-home final exam question on April 26.
This is not the original proposal for monolithic multiprocessors. A decade ago, for example, the idea was discussed at Rockwell International in the context of wafer-scale integration -- and it may have been discussed elsewhere. The Rockwell idea focused on the use of a redundant message passing interconnection structure, with a fault tolerant operating system that would automatically account for manufacturing defects. In comparison, what I have outlined here is far more conventional.
The ill-fated CDC EP-IX 64-bit microprocessor was apparently manufactured in a 4-processor monolithic configuration for supporting a high performance UNIX workstation.