7. Reduction to Microcode, An Example

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Introduction

In general, given a set of register-transfer components that correspond to the operators of an imperitive programming languge, we can reduce any program in that language to a register-transfer system and a finite-state control unit. We do this as follows:

Eliminate all recursion, eliminate all object-oriented constructs, eliminate all pointers, and eliminate all subroutines. These theoretically trivial but ugly changes can all be done at the source level; the resulting program will generally be just as fast as the original, but it will be ugly, with all variables global and no obvious use of abstraction anywhere.
For each simple variable in the program, build one register, with the number of bits in the register able to hold the required range of values.
For each array in the program, build one random-access memory, with the number of bits per word and the addressing of words matched to the data type of the array element and the dimensions of the array.
Count the number of distinct assignment statements that assign to each variable or to each array. Two assignment statements are not distinct if they assign the same function of the same variables to the same variable or array. If some variable or array has n distinct assignments to it, attach an n input multiplexor to the data input for the corresponding register or random-access memory. In the special case where n is one, omit the multiplexor.
Count the number of distinct subscript expressions used for each different array, and put a multiplexor on the address input of the corresponding random-access memory allowing selection between the values of these expressions.
For each distinct expression used in the program, feed the values of the appropriate registers, inputs and constants into combinational register-transfer circuitry that will compute the value of the expression, and then feed the value of that expression into each multiplexor that requires this value.

The above algorithm gives you the register-transfer logic needed to implement an arbitrary program. The next part of the problem is to construct a finite-state control unit that will make this logic execute the program. Now, to derive the control unit, we will focus on the control structure of the program, after identifying all of the inputs and outputs from the control unit:

The outputs of the control unit include all clock lines to registers and all control inputs to multiplexors, as well, potentially, as control inputs to multi-function operators such as ALUs that might result from applying optimizations to the basic register-transfer logic outlined above.
The inputs to the control unit include all boolean values tested in if statements in the program, and all small integers used to control select-case constructs.
For a first effort, each assignment statement in the program corresponds to one state in the control unit; when the control unit is in that state, it will invoke the corresponding register transfer in the register-transfer logic of the machine.
To optimize, look at consecutive states of the control unit, and whenever the register transfers invoked in these states are completely independent, the states can be combined into one state that invokes both register transfers in parallel. Two register transfers are independent if the second one in sequence does not rely on any results computed by the first in that sequence, and if none of the data paths required for each of the register transfers involves control inputs that differ from the control inputs used by the other register transfer.
The control structure of the program determines the successor state or successor states for each state of the finite state control unit. States will have multiple successors if an if statement or case-select construct in involved. The control expression of the control statement determines which successor state will be selected.

The net result of applying this second part of the algorithm is the state table for a (Moore) finite-state machine to drive the register transfer logic developed in the first part of this.

There is one little detail needed to reduce this to actual hardware: We need to make sure that the register transfers occur at the right times. Typically, this means something like the following:

                                       ___________
                       _______________| status
                      |   ____________| outputs
                ______|__|______      |
               |                |     |   Register
clock -----o---|> Finite State  |     |   Transfer
           |   |  Control Unit  |     |   Logic
           |   |________________|     |
           |      _|_|_   | |_________| select
           |     | and |  |___________| inputs
            -----|gates|              |
                 |_____|              |
                   | |________________|  clock
                   |__________________|  inputs
                                      |___________

This bit of logic guarantees that the clock inputs to the register transfer logic all go to zero between successive clock pulses, even if the same register is assigned to by successive states of the control unit. This is extremely safe if the control unit changes state on the positive going clock edge and the registers in the register-transfer half of the system are negative edge triggered. Faster performance is possible if all register transfers happen on the same clock edge as the state changes; because we used and gates above, this is best done by having everything happen on the falling edge of the clock, so the and gates cut off any changes in the clock inputs to the register transfer logic before the output of the control unit begins to wobble through its state changes.

An Extended Example, Euclid's GCD Algorithm

For the sake of example, consider Euclid's algorithm to compute the greatest common divisor of two numbers. This can be expressed algorithmically as follows:

	unsigned int gcd( unsigned int x, unsigned int y )
	{
		if (x == y) {
			return x;
		} else if (x > y) {
			return gcd( x-y, y );
		} else /* x < y */ { 
			return gcd( x, y-x );
		}
	}

The above algorithm is expressed recursively, so our first job is to remove the recursion:

	unsigned int gcd( unsigned int x, unsigned int y )
	{
		while (x != y) {
			if (x > y) {
				x = x - y;
			} else /* x < y */ { 
				y = y - x;
			}
		}
		return x;
	}

The next step is to remove the function calls from the program; in this case, we have no main program calling for the greatest common divisor, instead, we have the specification, invented arbitrarily for this assignment, that our machine should compute the greatest common divisor of its two inputs whenever the input ready signal is asserted, and our machine should output result valid whenever the output of the machine holds the result.

	/* these variables stand in for external inputs */
	unsigned int input_a; input_b;
	BOOLEAN input_valid;

	/* these are the variables of our machine */
	unsigned int x, y;
	BOOLEAN result_valid;

	/* the contents of x and y will be equal when result_valid is true */
	/* either x or y can be sampled as the output when this holds */

	main()
	{
		for (;;) { /* an infinite loop */
			do {
				(void);
			while (input_ready == FALSE);
			result_valid = FALSE;
			x = input_a;
			y = input_b;
			while (x != y) {
				if (x > y) {
					x = x - y;
				} else /* x < y */ { 
					y = y - x;
				}
			}
			result_valid = TRUE;
		}
	}

We have written the above with some extra variables added to the declarations up front, standing in for outputs from the outside world, and with mere comments stating which internal variables are visible to the outside world to hold outputs. We had to do this because the programming language view of how input and output are done differs in serious ways from the way mechanisms actually accomplish this.

Register Transfers in this Algorithm

Having reduced our algorithm to the final form shown above, we can now begin reducing this to register-transfer logic. We begin by extracting all the assignment statements from the code and grouping them by the name of the variable assigned to:

	result_valid = FALSE;
	result_valid = TRUE;

	x = input_a;
	x = x - y;

	y = input_b;
	y = y - x;

Next, for each variable, we use a register, and we use a multiplexor to select which assignment is performed, with appropriate register-transfer functional units substituted for the operators in the expressions. For the result-valid Boolean variable, this gives the following:

                          FALSE TRUE
                           __|___|__
       F  -----------------\ 0   1 /
        result-valid        \_____/
                        _______|_______
       C  -------------|> result valid |
	result-valid   |_______________|
	                       |
	                       |
	                 result-valid
	                    output

This example is constructed, deliberately, without any thoughtful optimization! Given the encoding of FALSE as zero and TRUE as one, it is trivial to optimize this as:

       F  ---------------------
        result-valid           |
                        _______|_______
       C  -------------|> result valid |
	result-valid   |_______________|
	                       |
	                       |
	                  result-valid
	                     output

We can apply similar reductions to each of the other registers, so for the X register, we get:

                          input a         ___________ value of y
                             |        ___|______ 
                             |      _|___|_     |
                             |     | x   y |    |
                             |     |  x-y  |    |
                             |     |_______|    |
                             |    _____|        |
                           __|___|__            |
       F  -----------------\ 0   1 /            |
        x                   \_____/             |
                        _______|_______         |
       C  -------------|>      x       |        |
	x              |_______________|        |
	                       |                |
	                       o----------------
	                       |
	                      gcd
	                     output

The diagram for the Y register is very similar, but this raises a question if we try to combine all of these: How do we make an organized drawing of the entire system? Here is one suggestion for the overall structure of the drawing of such a digital system:

           ______________________________________ ____________
          |    inputs from the outside world     |            |
          |______________________________________|            |
          |   distibution network for feedback   |            |
          |       from internal registers        |  feedback  |
          |______________________________________|    data    |
          |   combinational register-transfer    |   paths    |
          |              elements                |            |
          |______________________________________|            |
          |            multiplexors              |            |
          |______________________________________|            |
          |              registers               |            |
          |______________________________________|____________|

Following this convention gives us the following diagram for the entire register-transfer half of the system:

              input a            input b           
                |                  |
                |        ----------|-----------o-------------------- x
                |       |          |           |                    |
                |       |    ------|-------o---|---------------- y  |  
                |      _|___|_     |      _|___|_               |   |
                |     | x   y |    |     | y   x |              |   |
                |     |  x-y  |    |     |  y-x  |              |   |
                |     |_______|    |     |_______|              |   |
                |    _____|        |    _____|                  |   |
              __|___|__          __|___|__                      |   |
       F  ----\ 0   1 /   F  ----\ 0   1 /   F  --------        |   |
        x      \_____/     y      \_____/     rv        |       |   |
              ____|____          ____|____          ____|____   |   |
       C  ---|>   x    |  C  ---|>   y    |  C  ---|>   rv   |  |   |
	x    |_________|   y    |_________|   rv   |_________|  |   |
	          |                  |                  |       |   |
	          |                  o------------------|-------    |
	          o------------------|------------------|-----------
	          |                  |                  |
	         gcd             alternate         result valid
	        output           gcd output           output

At this point, we are more than halfway there! We know, at this point, that the outputs from the finite state control unit will be no more and no less than F_x, C_x, F_y, C_y, F_rv, and C_rv. For compactness, we have renamed the result-valid flipflop (and its control signals) with the abbreviated name rv.

What we are missing are the inputs to the control unit! To locate these, we extract, from the final textual version of our program, all of the expressions evaluated in the control structures of the program. These were;

	(input_ready == FALSE)
	(x != y)
	(x > y)

Input-ready is trivial, since this Boolean input to the system is simply passed directly to the control unit, unchanged. The other two are more interesting. We could add special combinational logic to compute these directly from the distribution network for the values of internal registers, as follows:

                             input a         input b           
                               |               |
           ----------o---------|-----o---------|---------o---------- x
          |          |         |     |         |         |          |
          |    ------|---o-----|-----|---o-----|-----o---|------ y  |  
         _|___|_    _|___|_    |    _|___|_    |    _|___|_     |   |
        | x   y |  | x   y |   |   | x   y |   |   | y   x |    |   |
        |  x>y  |  |  x=y  |   |   |  x-y  |   |   |  y-x  |    |   |
        |_______|  |_______|   |   |_______|   |   |_______|    |   |
            |          |       |       |       |       |        |   |
       <----           |
         x>y           |
                       |
       <--------------- 
         x=y

This, however, is completely unnecessary, because we are already computing both the difference x-y and the difference y-x; between these two differences, we can completely determine equality and the relative magnitudes of the numbers.

If we were dealing with signed numbers, we might be tempted to use the signs of the differences for this comparison. This would fail in the event of an overflow. Instead, we will take advantage of the fact that the subtraction x-y is generally done using two's complement additioin, x+~y+1; when this is done, the carry out of the adder means "no borrow output from most significant bit", which, in turn, means x>y. Therefore, taking both the differences x-y and y-x gives us signals indicating both x>y and y>x. We can get all the possible comparisons of x and y from simple functions of these two carry signals.

       input  input a            input b           
       ready    |                  |
          |     |        ----------|-----------o-------------------- x
          |     |       |          |           |                    |
          |     |       |    ------|-------o---|---------------- y  |  
          |     |      _|___|_     |      _|___|_               |   |
          |     |     | x   y |    |     | y   x |              |   |
          |     |     |  x-y  |    |     |  y-x  |              |   |
    IR <--      |     |_c_____|    |     |_c_____|              |   |
             /| |       | |        |       | |                  |   |
    x>y <--O< |-|-------|-|--------|-------o |                  |   |
             \| |       | |        |       | |                  |   |
           ___  |       | |        |       | |                  |   |
          |   |-|-------  |        |       | |                  |   |
    x=y <-|and| |         |        |       | |                  |   |
          |___|-|---------|--------|-------  |                  |   |
                |    _____|        |    _____|                  |   |
              __|___|__          __|___|__                      |   |
       F  ----\ 0   1 /   F  ----\ 0   1 /   F  --------        |   |
        x      \_____/     y      \_____/     rv        |       |   |
              ____|____          ____|____          ____|____   |   |
       C  ---|>   x    |  C  ---|>   y    |  C  ---|>   rv   |  |   |
	x    |_________|   y    |_________|   rv   |_________|  |   |
	          |                  |                  |       |   |
	          |                  o------------------|-------    |
	          o------------------|------------------|-----------
	          |                  |                  |
	         gcd             alternate         result valid

The Control Unit

The control unit can now be described! We know that the outputs of the control unit are F_x, C_x, F_y, C_y, F_rv, and C_rv, and we know that the inputs are IR, x>y and x=y. The truth tables for the control unit will therefore have the following structure

	  this             | next        this  | F  C  F  C  F   C
	  state IR x>y x=y | state       state |  x  x  y  y  rv  rv
	-------------------|-------     -------|----------------------

All we need to do is fill in these tables. To do so, we examine the algorithm again, grouping all assignment statements into groups that can be carried out in parallel using the data paths outlined at the end of the previous seciton. These groups are identified by the labels in the following restatement of the algorithm:

	main()
	{
		for (;;) { /* an infinite loop */
			do {
		A:		(void);
			while (input_ready == FALSE);
		B1:	result_valid = FALSE;
		B2:	x = input_a;
		B3:	y = input_b;
			while (x != y) {
				if (x > y) {
		C:			x = x - y;
				} else /* x < y */ { 
		D:			y = y - x;
				}
			}
		E:	result_valid = TRUE;
		}
	}

Each letter used above corresponds to one state of the finite state control unit. In state A no assignments take place; in state B, we have three parallel assignments, while in states C, D and E, there is one assignment for each. We can formalize this in the table that relates the outputs of our finite state control unit to the current state:

          this  | F  C  F  C  F   C
	  state |  x  x  y  y  rv  rv
	--------|----------------------
            A   | -  0  -  0  -   0
            B   | 0  1  0  1  0   1
            C   | 1  1  -  0  -   0
            D   | -  0  1  1  -   0
            E   | -  0  -  0  1   1

In the above truth table, we have not provided binary encodings for the state names, and we have used dashes in the output to indicate don't care values. As a general rule, if a register is not clocked during some clock-cycle of the finite state control unit, the multiplexor controls and function select controls that govern the data inputs to that register do not matter. Designating don't care values is important when an effort is made to optimize the implementation of a boolean function, so it is polite to indicate it if the door to optimization is to remain open.

The second state table we must fill out follows directly from the control structure for our algorithm; we can express this as a picture, as follows:

                                 ______
                                /      \x>y
                                \      /
                      ------------->(C)-------- 
   _____ir=0        /           /      \    x=y \
  /     \          /x>y         \___   /y>x      \
  \     /         /                 \ /           \
   -->(A)------>(B)                  X             ---->(E)
  /      ir=1    |\              ___/ \           /  /    \
 /               | \y>x         /      \x>y      /  /      \
 \               |  \           \      /    x=y /  /       / 
  \           x=y|    ------------->(D)--------   /       /
   \             |              /      \         /       /
    \             \             \______/y>x     /       /
     \             \___________________________/       /
      \_______________________________________________/

There is nothing particularly pretty about this state diagram! The inner loop of the original code has been thoroughly obscured by the fact that we only have states for each block of assignment statements, and not for control structures such as the while loop and the if statement. Without these extra states, states B, C and D all test for loop termination,m and states C and D are connected in what might be described as a figure-eight loop. This state diagram reduces directly to the following truth-table for the next-state function:

	  this              | next
	  state IR x>y x=y  | state
	--------------------|-------
            A   0   -   -   |  A
            A   1   -   -   |  B
            B   -   1   -   |  C
            B   -   0   0   |  D
            B   -   0   1   |  E
            C   -   1   -   |  C
            C   -   0   0   |  D
            C   -   0   1   |  E
            D   -   1   -   |  C
            D   -   0   0   |  D
            D   -   0   1   |  E
            E   -   -   -   |  A

Again, we have used dashes to indicate don't care values on the inputs. When mechanically reducing a truth table where there are don't care inputs, neither that input nor the inverted value of that input needs connection to the row decoder for that row of the input. Digital logic optimization is largely concerned with algorithms for combining rows of the truth table by identifying the rows for which the inputs are don't care values.

We can combine the above two truth tables into one table without changing any of the information expressed. If we do it as follows, we make clear that this table still describes a Moore machine:

	  this              | next    F  C  F  C  F   C
	  state IR x>y x=y  | state    x  x  y  y  rv  rv
	--------------------|------------------------------
            A   0   -   -   |  A      -  0  -  0  -   0
            A   1   -   -   |  B
                            |
            B   -   1   -   |  C      0  1  0  1  0   1
            B   -   0   0   |  D
            B   -   0   1   |  E
                            |
            C   -   1   -   |  C      1  1  -  0  -   0
            C   -   0   0   |  D
            C   -   0   1   |  E
                            |
            D   -   1   -   |  C      -  0  1  1  -   0
            D   -   0   0   |  D
            D   -   0   1   |  E
                            |
            E   -   -   -   |  A      -  0  -  0  1   1

In the above truth table, all rows that involve transitions from one state have been grouped into blocks, and for each block, only one set of outputs is given.

We can assign arbitrary Boolean encodings for the state names used here! Sometimes, the optimization of the logic for the finite-state control unit is simplified by careful encoding of the state names, but since we are largely ignoring such issues in this class, we will not bother with this.

The important thing to note is that, aside from questions of optimization, the reduction of an algorithm to a digital system that executes that algorithm is a mechanical process. Most of the information you need to evaluate the performance of that digital system is actually apparent in the final version of the source code for the algorithm, particularly if it is annotated to show the statements that can be executed in parallel.

Filling in the contents of this state table has a strong resemblance to programming! It is, in fact, quite reasonable to view the part of the truth table dealing with each state as a kind of machine instruction, where next-state fields of the table give the control structure part of the behavior of that instruction, while the control unit outputs to the data part of the system give the computational part of the instruction.

Microprogramming

If we think in these terms, the register-transfer logic of the digital system can be thought of as defining the instruction set of the control unit, and then the actual job of working out the control unit for that control unit can be thought of as programming in that instruction set. When the register transfer logic is designed with the intent of allowing general purpose computation, as opposed to designing it to meet the needs of one particular algorithm, we refer to the entire system as a microprogrammable system, and we refer to the contents of the state table of the control unit as the microprogram (µprogram, for short). This terminology is particularly applicable when the state table for the control unit is stored in a dedicated fast RAM or in some kind of programmable ROM (PROM, generally; for example, EEPROM or Flash EEPROM), so that the particular algorithm to be executed by the digital system can be changed after the system has been manufactured.