Homework 6 Solved

22C:122/55:132, Spring 2001

Part A: Design the carry-lookahead unit for a fast adder based on trinary trees. This will have inputs P_i, G_i, P_i+1, G_i+1, P_i+2, and G_i+2 from the three subtrees above it, and input C_i from the right; it will have outputs C_i+1 and C_i+2 to the left and center subtree above it, and it will have outputs P and G to its parent in the carry lookahead tree.

C_i+1 = G_i: or (P_i and C_i)
C_i+2 = G_i+1: or (P_i+1 and G_i); or (P_i+1 and P_i and C_i)
P = P_i+2 and P_i+1 and P_i
G = G_i+2: or (P_i+2 and G_i+1); or (P_i+2 and P_i+1 and G_i)

Part B: Give an Iowa Logic Specification Language specification for this part.

circuit lookahead3;
   inputs
      p0, g0, p1, g1, p2, g2, c0;
   outputs
      c1, c2, p, g;
   parts
      c1or: or(2);
      c1and: and(2);
      c2or: or(3);
      c2and1: and(2);
      c2and2: and(3);
      pand: and(3);
      gor: or(3);
      gand1: and(2);
      gand2: and(3);
   wires
      g0 to c1or.in(1);
      p0 to c1and.in(1);
      c0 to c1and.in(2);
      c1and.out to c1or.in(2);
      c1or.out to c1;

      g1 to c2or.in(1);
      p1 to c2and1.in(1);
      g0 to c2and1.in(2);
      c2and1.out to c2or.in(2);
      p1 to c2and2.in(1);
      p0 to c2and2.in(2);
      c0 to c2and2.in(3);
      c2and2.out to c2or.in(3);
      c2or.out to c2;
      
      p2 to pand.in(1);
      p1 to pand.in(2);
      p0 to pand.in(3);
      pand.out to p;

      g2 to gor.in(1);
      p2 to gand1.in(1);
      g1 to gand1.in(2);
      gand1.out to gor.in(2);
      p2 to gand2.in(1);
      p1 to gand2.in(2);
      g0 to gand2.in(3);
      gand2.out to gor.in(3);
      gor.out to g;
end;

Part C: Consider a 27 bit adder using this circuit. A[0..26] are the individual one-bit adders at the leaves of the tree. Draw those parts of the tree connecting A[15..20] to the root.

     | |  _   | |  _   | |  __   | |  _   | |  _   | |  __ 
     |_|_| |  |_|_| |  |_|_|  |  |_|_| |  |_|_| |  |_|_|  |
    |a b c|| |a b c|| |a b c| | |a b c|| |a b c|| |a b c| |
... | A20 || | A19 || | A18 | | | A17 || | A16 || | A15 | | ...
    |_p_g_|| |_p_g_|| |_p_g_| | |_p_g_|| |_p_g_|| |_p_g_| |
      | |  |   | |  |   | |   |   | |  |   | |  |   | |   |
     _|_|__|___|_|__|___|_|_  |  _|_|__|___|_|__|___|_|_  |
    | p2g2 c2  p1g1 c1  p0g0| | | p2g2 c2  p1g1 c1  p0g0| |
    |      lookahead3     c0|-o |      lookahead3     c0|-o
    |___________________p_g_| | |_p_g___________________| |
             _______    | |   |   | |   __________________|
    _______  ____   |   | |   |   | |  |    _______
... ____   | __  |  |   | |   |   | |  |   |  _____  ________ 
    __  |  |   | |  |   | |   |   | |  |   | |   __ |  ______ ...
     _|_|__|___|_|__|___|_|_  |  _|_|__|___|_|__|___|_|_   __
    | p2g2 c2  p1g1 c1  p0g0| | | p2g2 c2  p1g1 c1  p0g0| |
    |      lookahead3     c0|-o |      lookahead3     c0|-o
    |___________________p_g_| | |_p_g___________________| |
                        | |   |   | |   __________________|
                        | |   |   | |  |    ________
                        | |   |   | |  |   |  ______ ...
                       _|_|___|___|_|__|___|_|_   __
                      | p2g2  c2  p1g1 c1  p0g0| |
                      |       lookahead3     c0|-o Cin
                      |________p______g________| |
                         ____  |      |          |
                  ____  |    |-       |          |
                 |    |-| AND|        |          |
       Cout  o---| OR | |____|--------|---------- 
                 |____|---------------

A Question: If we assume a constant delay per gate and no cost for interconnecting wires, an increment circuit can be built that takes a constant time to increment a number regardless of the number of bits.
Explain why no physical realization of this circuit will break O(log n). To do this, you will have to identify the component or components that pose special problems, and you will have to explain what those problems are that lead to the indicated performance.
In building an n-bit increment circuit, the recursive subcircuit build is called n-1 times; the final time, the formal parameter index will have the value n. Within build, there is an and gate named g. This gate has its fan-in given by the formal parameter index. The above two facts combine to require an and-gate with O(n) fan-in.
Given any particular hardware technology, there will be some limit to the fan-in of any single gate. If we need greater fan-in, we will have to create a cascade of gates. The most efficient cascade of gates is a tree, leading to the conclusion that the delay for a gate with a fan-in of n will be O(log n).
A similar argument can be constructed based on the fanout, since the input to the least significant position in the increment circuit is used by all greater positions in the circuit, therefore requiring a fanout of n.
Part A: Assuming that, prior to multiplying, the multiplicand is in M[sp], assume the multiplier is in t, and assume the accumulator is zero. What new data paths and functional operations must be added to the machine to allow multiplication?
ACF needs an extra bit because we now need a right-shift operation and an add with the sum right-shifted. Very arbitarily, we could code these as ACF=4 and ACF=5.
We need to add TF to control a multiplexor for the input to the T register. TF=0 would load T from the data bus, while TF=1 would right-shift T (shifting the bit shifted out of the AC function select into the most significant bit of T).
We need to add a new feedback path from the data-flow part to the control system, so the microcode can inspect the least significant bit of T; call this TM2 (T mod 2).
In order to allow TM2 to be used to control the microcontrol unit, we add a new bit to the cond field of each microinstruction. This Very arbitrarily, we use COND=4 to select TM2 as the condition.

Part B: Give the microinstruction sequence you would need for a 16 bit multipy. Note that no loop counter is available, so your code will include 16 repeats of the same microinstructions. You need only show one repeat of your code if you document the repeating pattern properly.
```
uaddr || next uadr | clocks |bus |   function     |
------++-----------+--------+----+----------------+---------
      ||           | IAPTSM | AM | I  A  T P S M  |             
      || COND NEXT | RCCCPW | CR | R  C  F C P A  |             
      ||           | CCC CR | WE | F  F    F F D  |
------++-----------+--------+----+----------------+---------
  a   || 100    b  | 000000 | 00 | x xxx x x x xx | test T bit 0
  b   || 100    c  | 010100 | 00 | x 100 x 1 x xx | shift, test bit 1
  b|1 || 100    c  | 010100 | 01 | x 101 x 1 x 10 | shift-add, test bit 1
  c   || 100    d  | 010100 | 00 | x 100 x 1 x xx | \ repeat (bit 2)
  c|1 || 100    d  | 010100 | 01 | x 101 x 1 x 10 | /
     - - - - - - - - - - - - - - - - - - - - - - - - -
  p   || 100    q  | 010100 | 00 | x 100 x 1 x xx | \ repeat (bit 15)
  p|1 || 100    q  | 010100 | 01 | x 101 x 1 x 10 | /
  q   || 000   next| 010100 | 00 | x 100 x 1 x xx | shift, (done)
  q|1 || 000   next| 010100 | 01 | x 100 x 1 x 10 | shift-add (done)
```
The above code takes 33 microinstructions, one that only tests the least significant bit of T. The remaining 32 instructions consist of 16 instruction pairs, stored in even/odd memory address pairs. Each of these instruction pairs deals with shifting and adding in response to the previously tested bit of the multiplier while testing the next bit of the multiplier. It takes 17 microcycles to execute the multiply, and for each nonzero bit in the multiplier, there will be a read of the multiplicand from memory.
This microcode can be optimized slightly by combining the first and last microinstructions in sequence with operations that might otherwise be part of the computations done prior to the multiply or after it, but optimization is not part of this exercise.