5. Register Transfer Logic, A Breakneck Review
Part of
the 22C:122/55:132 Lecture Notes for Spring 2004

At the register transfer level, digital systems are characterized by a set of registers and the combinatorial circuitry that interconnects these registers. While we can treat each of these combinatorial circuits as new design problems, there are a large number of standard combinatorial occur frequently in register transfer level designs, and it is worth reviewing these before we look at mechanistic approaches to reducing programminglanguage style specifications at the instructionset processor level to register transfer systems.
Register transfer systems are generally described schematically, using notations based on the standard schematic diagram notations developed by electrical engineers. In this basic notation, the wires (or conducting paths on printed circuit boards or integrated circuit dies) that interconnect components are shown as lines, while abstract components themselves are shown as boxes. In high level digital system design, the wires connecting parts to the power supply (and ground) are generally omitted, so the wires that remain all carry digital signals, with one bit carried on each wire:
one bit 
When multiple bits are involved, there are a number of equivalent notations:
D  1 D  2 D  3 is the same as D =========/============ (thickness indicates 13 3 multiple wires, count indicates how many.) or the same as D / (careful use of a 13 3 count indicates multiple wires.)
Making a distinction between "fat lines" and "thin lines" is helpful, but it takes more work to draw the pictures this way. Where there is only one linethickness, slashes (representing bits of string used to tie wires into bundles) and wire counts are needed on each line representing a bundle, and this must be repeated as needed to make it clear that some particular line in a schematic diagram represents more than one wire.
When "fat line" notation is used for bundles, particularly when there is a standard word size, no "wire count" or slash is needed for those bundles where the count is obvious from context.
Where it is necessary to show a bundle being broken into its individual wires, the following notation is common:
_  _ D   _ 1   D  D ===/===== 1  D 2  3 ==========/==== 2 D  _ D 2  D 3 _ 23 _ 3
In using this notation, particularly when showing a bundle being broken into subbundles, it is important ot make it clear using subscripts or wire names which wires go in which bundle!
Generally, when a group of wires represents the bits of one word, they should be named systematically. For example, if the wires carry the contents of the the nbit register R, it would be sensible to name them R_{0} to R_{n1} or R_{1} to R_{n}. The particular range and order of the subscripts may vary from one setting to another. If R is thought of as holding an integer value, for example, R_{i} might be used to hold the bit with weight 2^{i}, implying that R_{0} is the least significant bit. On the other hand, if R holds a fixed point fraction, R_{i} might hold the bit with weight R^{i} implying that R_{1} holds the most significant bit.
If a system is designed from the ground up, it is useful to adopt a uniform naming convention and use it throughout the system. Things get more difficult when offtheshelf components are used. Each component will come with documentation that uses its own naming scheme, and as a result, you might find that the address output from a microprocessor chip has A_{1} as the most significant bit, while the address input to the memory chip might have A_{0} as the least significant bit. This kind of incompatable naming adds no complexity to the implementation of the system, but it offers numerous opportunities for confusion in the documentation.
Off the shelf components can also add to the physical system complexity. The most common cause of this is incompatable assignment of binary encodings to signals. A one bit signal may be encoded using positive or negative logic (does high mean true or false), and in general, for an nbit signal, there are 2^{n}! (that is factorial) different ways of encoding the same set of 2^{n} meanings on that bundle of wires. Interconnecting off the shelf components can therefore involve arbitrary looking little bits of combinatorial logic to do the necessary code conversions. This logic is sometimes described as the glue used to hold the components together.
A register is a collection of identical flipflops all sharing a common clock. For example, we migh construct the 3bit positive pulsetriggered register R from 3 positive pulsetriggered D latches as:
D D D 2 1 0  _______  _______  _______          D Q D Q D Q           R    R    R    2    1    0   C   C   C    _______   _______   _______        Coo     Q Q Q 2 1 0
Generally, we will avoid drawing out the gatelevel components, and instead, we will simply abbreviate this register as:
D 20  /3 ________________  _  C>_ _ R  _________________  /3  Q 20
The commonly used notation clearly marks, on the bundled wires, the number of conductors, while leaving it to the reader to infer that the register must have that number of bits. External connections are generally given names, with subscripts indicating the number of conductors and their order. Thus, if the input is labeled D_{20}, the most significant bit is D_{2} and the least significant bit is D_{0}.
In the above figure, we could have used subscripts on the register name to indicate the numbering of the bits of R; for example, R_{20} to indicate that R has 3 bits and that R_{2} is the most significant bit while R_{0} is the least significant.
In theory, no digital system designer needs more than one type of edge triggered register, but optimal designs sometimes require more types. So, it is handy to keep at least the following 4 types in mind:
 _______________  _  positive pulse >_ _  triggered ________________ register    _______________ positive edge  _  triggered >_  register ________________   _______________   _ _  negative pulse > _  triggered ________________ register    _______________ negative edge  _  triggered > _  register ________________  
The pulsetriggered registers are made with type D latches, while the edgetriggered registers are made with edgetriggered or masterslave flipflops. A simple inverter on the clock input to a positive pulse or edge triggered device will convert it to a negative pulse or edge triggered device.
It is worth remembering that register transfer level designers rarely make any explicit use of the Qbar outputs of registers in their initial designs, but that these outputs are available at almost no cost. Therefore, if a register is located, in the physical realization of the circuit, near some function that requires the inverted outputs, it is common to extract these directly from the register instead of using inverters to compute the function.
On the other hand, if the inverse is needed at some distance from the register, the cost of the added inverters may be frequently less than the cost of the extra interconnections needed to move both the Q outputs and the Qbar outputs from the register.
When some logic element must process data from multiple sources, a multiplexor can be used to select the sourde. A 2input 1output multiplexor is composed of elements with the following truth table:
A B C  D C A B  D C A B  D    0 0 0  0 0 0 0  0 0 0   0 0 0 1  0 0 0 1  0 0 0   0 0 1 0  0 0 1 0  1 0 1   1 0 1 1  1 0 1 1  1 0 1   1 1 0 0  1 1 0 0  0 1  0  0 1 0 1  0 1 0 1  1 1  1  1 1 1 0  1 1 1 0  0 1  0  0 1 1 1  1 1 1 1  1 1  1  1 C A B  D  0 0   0 0 1   1 1  0  0 1  1  1
Here, when the control input C is 0, the output D is equal to the input A, while when the control input is 1, the output is equal to input B. The left truth table above gives the specification with the inputs listed in the natural order that comes from using A and B as the two inputs and C as the control input. The second truth table is exactly the same, except that the control input has been listed first. The rightmost truth table is based on the second, but with don't care conditions clearly marked on the inputs; finally, the truth table at the bottom is what results from collapsing identical rows, leading to an optimal design. Experienced digital designers may come up with this final version as their initial draft, but it is worth showing the full progression here.
If we follow the recipe for converting a truth table to a digital system given in the notes for lecture 2, we get the following:
C A B    o    __    \ /    O   The Or Array     ___ (below) o \    0  1    )Oo o___/ 1     ___   o \    1    1  )Oo o___/ 1     ___ (above)   The And Array   \___/ O   D
Note that our optimization rules have allowed us to dispense with two inverters on the input columns and two table rows representing input combinations where the output was zero. Redrawing the resulting circuit compactly gives us a twoinput by one bit multiplexor in the the following form:
____ A  \  )O ____/  _____  O\ \  ) ) D  ____ O/____/ B  \    )O  ____/   _ O  D = (A and C) or (B and C) /_\    C o
In register transfer diagrams, multiplexors are frequently drawn using one or the other of the following notations:
A B A B     _______ _______  0 1  \ 0 1 / C  MUX  C \ / _________ \_____/     C C
This notation may be used for either single bit or multiple bit multiplexors, with the presence of multiple bits indicated by "bundle" notation on the interconnections and not by anything within the box. The slopesided notation is commonly used without a label by designers who limit the use of this symbol to multiplexors and use it for nothing else. Other designers use the slopesided notation for any function that combines multiple inputs to make one output, for example, to represent adders or similar parts. When that approach is used, a function label is, of course, required.
Multiplexors for more than 2 inputs may be made by cascading 2input multiplexors to make a multiplexor tree, as:
A A A A 0 1 2 3     _______ _______ \ 0 1 / \ 0 1 / C  \ /\ / 0 \_____/ \_____/         _______ \ 0 1 / C  \ / 1 \_____/   D
Use of multiplexor trees adds delays, so it is important to know that there is a general fast way to create a multiplexor with n control inputs and 2^{n} data inputs. This is is illustrated in the following example, with 2 control inputs and 4 data inputs:
C C 1 0   o o  __  __  \ /  \ /  O  O     ___ o \ o )O A ___/  0     ___   o \  o )O ___ A ___/  \ 1     ___  )O D o \ ___/ o )O  A ___/  2     ___   o \  o )O A ___/ 3    
Both the multiplexor tree and the optimal design are likely to be drawn as follows in a registertransferlevel logic diagram:
A A A A 0 1 2 3     _________________ \ 00 01 10 11/ C /\ / 10 2 \_________________/   D
The dimension on the control input to this multiplexor can be inferred from the number of inputs to the multiplexor, but the order of the subscripts cannot. The latter is essential to the determination of which combination of control inputs will select what data input. The labels on the inputs can be given in any system that is obvious, in context. Binary and hexadecimal labels are common, and symbolic labels are sometimes used when there is a well established binding of the bit patterns on the control input to some set of symbols.
If some combinational logic element is to be replicated for each bit in a word, is represented schematically using a box to represent the replicated element, with inputs and outputs labeled appropriately and the contents of the box documenting the function being computed:
A B   / / _______________  x y   _   z = x or y    ________z________  /  C
Note in the above example that we have local labels on the inputs and outputs of the box, as well as global labels on the inputs and outputs from the entire system. The local labels on the inside of the box at each connection relate only to the documentation of the function of the box, while the global labels outside the box are used to document the relationship between the box and its larger context.
Generally, when the equation in the box uses logical operators such as and, or and not, we assume that this logical operation is carried out in parallel, bit by bit. Thus, if the word size is n bits, the above box becomes equivalent to n iterations of the following simple logic circuit:
_____ A \ \ i ) OR ) C B O/____/ i i
It is quite common to discover that, along some data path within a digital system, some selection must be made between combinational functions. Typically, this can always be done as follows, using a bruteforce implementation:
A B   o     o     _______ _______  x y   x y   _   _   z=xy   z=xy      ____z____ ____z____     _______ \ 0 1 / C \ / \_____/  F
We will frequently abbreviate such circuits using the following notation:
A B   _______________  x y   _   0 z = xy  C  _   1 z = xy    ________z________   F
Here, we use the same convention as was used with multiplexors, control inputs are shown from the left when data inputs are from the top, or from the bottom when data inputs were from the side.
Note that clear documentation is not the same thing as good implementation. Direct implementation of the logic element described above involves computing the two alternative results completely independently and then selecting between them. If you descend below the surface, however, you may find that the two functions share terms, and if this is the case, duplicate computation of those terms is not a good idea! In the above example, it turns out that we can do a bit of algebra to discover that the two values are very closely related! Applying deMorgan's law to xy gives ~(xy)! Therefore, we might want to implement this as:
A B   _______________  x _ y   z = xy  ________z________  ________________  x   0 y = x  C  _   1 y = x  ________y________   F
It should be noted that the conditional invert function required in the above is equivalent to an exclusive or applied to each input:
x x x 2 1 0 C oo  _ _ _ XOR XOR XOR ___ ___ ___       F F F 2 1 0
This is not really as exciting as it may seem! The exclusive or function is one of the most expensive of the logic functions to compute. Most implementations rest on one or the other of the following:
____ A o \   )O  ____ ____/  ____  \   \  )Oo  )O A xor B ____/  ____ ____/   \    )O B o____/ ____ A o \   )O __ ____/  ____ \ /   \ O O  )O A xor B  /_\ ____ ____/  \    )O _ _ B o____/ (this is really: A xor B = AB  AB)
Propagation of an input change to the output of either circuit takes 3 gate delays, and the second circuit is exactly equivalent to a multiplexor controlled by the B input being used to select between either A or A.
The simplest arithmetic operation, adding one or incrementing, was mentioned at the end of the previous lecture. To repeat a bit of content and give an example, adding 1 to 1011_{2} is done as follows:
0 0 1 1  the carry bits N 1 0 1 1  addend + 1  augend  N+1 0 1 1 0 0  sum
Each column of this sum produces a 2bit result, with values ranging from 0 to 2. The least significant bit of each singlecolumn addition is the result bit for that column. The most significant bit is the carry into the next most significant column. The result of the increment operation always has the potential to require one more bit than the input.
A simple truth table can be constructed to describe what is done for each column:
in + out n c  c s i i  i+1 i  0 0  0 0 0 1  0 1 1 0  0 1 1 1  1 0
Here, we have labeled the carry inputs to bit i of the number as c_{i} and the carry output from bit i as c_{i+1}. It is easy to see the following relations from the truth table:
We call this logic circuit a halfadder because it does almost half of the work required to add two binary numbers.
The full increment function on a 3 bit number can be implemented by a series of these 1bit operations:
N N N Addend 2 1 0    1 Augend  __  __  ___ ___  ___  ___  n c   n c   n c  +    +    +  c_s__  c_s__  c_s__ C __  __  __  Carry out 3       S S S Sum 2 1 0
This design, with the carry out of each stage fed into the carry in of the next is called a ripple carry design. It has the serious disadvantage that, as the word size grows large, the time it takes the carry to propagate from the least significant bit up to the most significant bit is linearly proportional to the word size.
At the register transfer level, we will typically abbrevaite this as one of the following:
N N N    /3 /3 /3 C __________ __________ _________      a b   +1   +1   cs = a+b  ___________ ___________ _c___s_____     /3 /4 C /3    S S S
In the leftmost example, the carry out is ignored! In the center example, the carry out is incorporated into the output as a 4th bit. In the rightmost example, the carry out is extracted as a distinct output. The form used depends on our focus and on the context.
In the 2 left examples above, the carry input to the least significant bit of the adder chain has been ignored. We assume that it is one if it is not mentioned. In the rightmost example, the carry input is brought out explicitly so that it may have a value other than 1.
We could also use the following form to emphasize that the value used for the carry input can be thought of as being used to select the function:
N  ____________  x   0: f = x  C    1: f = x+1  ______f______   S
Here, we have treated the C input as a control input for selecting between two functions, increment if it is one and do nothing if it is zero.
A full adder for two binary numbers is described by the following truth table
in + out a b c  c s i i i  i+1 i  0 0 0  0 0 0 0 1  0 1 0 1 0  0 1 0 1 1  1 0 1 0 0  0 1 1 0 1  1 0 1 1 0  1 0 1 1 1  1 1
Here, aside from adding a new input, we have labeled things compatably with the increment example. Inspection gives the following implementation:
a b c i i i     __  _______ ___  ___  n c   n c  +    +  c_s__  c_s__ _____      / /   ( (    \____\   c s i+1 iHere, we have violated a standard rule of logic schematic diagrams: Data should always flow from left to right and top to bottom. The reason is, of course, that the convention from arithmetic, that carry is propagated from right to left, has been allowed to dominate!
The onebit increment components are referred to as half adders precisely because it takes two of them to make a full adder as shown above.
At the block diagram level, we might specify an adder as follows:
A B A B     1 /n /n /n /n __ _______________ ______________  x y   x y c       f = x + y   f = x + y + c      ________f________ _c______f________    /n+1  /n+1  C  S S
In the left example, the carry out of the high bit of the adder has been folded into an n+1 bit result, given that the addend and augend were n bits each. In the right example, the carry out from the most significant bit is an auxiliary output of the adder, and the carry into the elast significant bit is an auxiliary input. The convention that inputs are on the top edge and outputs are on the bottom edge makes the subscripted notation c_{in} and c_{out} redundant, but it might have been a good idea to include it. Alternately, we could have used the subscripts C_{0} and C_{n}, but only if, everywhere in the design, we rigorously stuck to the convention that bit zero was the least significant bit and bit n1 was the most significant.
If speed is of the essence, as it is in most high performance computer architectures, the ripple carry approach to computing the sum outlined above is not very good. The adder outlined above has a speed proportional to O(n) on an n bit word. There are practical ways of building an adder with O(log n) speed, and of course, optimal design methods can always be applied to give something with 3 gate delays, for an apparent speed of O(1). The latter is only illusion, though, because it requires gates with O(n) inputs and O(n) fanout, and the speed of such gates tends to fall of as O(log n)! As a result, O(log n) speed is the realistic goal for designers of fast arithmetic hardware.
Subtraction of binary numbers can be handled by adding the two's complement; the two's complement of a binary number is one plus the one's complement, and the one's complement is computed by inverting all of the bits of the number. This leads to the following naive design for a subtractor:
B  ______    not  _______  ______    +1  A _______   _______________  x y     f = x + y    ________f________   SNoting that the carry input to the adder can also be used to add one to the sum, we can collapse this to:
B  ______    not  A _______ 1   ____ ______________  x y c     f = x + y + c    __c_____f________     C SThis is a common way to realize a subtractor, and it has the consequence that C_{out}=1 indicates that there was no borrow and C_{out}=0 indicates that there was a borrow. (The fact that the carry out after a subtract has an inverted sense is directly exposed to assembly language programmers in some architectures, while in others, the hardware designers added extra logic to avoid this inversion that some novice programmers find to be confusing.)
Of course, we can also simply design the truth table for the subtractor from first principles and implement it directly.
While adders and subtractors are useful and sufficient, it is common to design single functional units that can perform both jobs. Such a unit is usually referred to as an Arithmetic Logic Unit. Here is one typical design:
A A A 2 B 1 B 0 B  2  1  0       F oo  O  _  _  _  XOR  XOR  XOR  ___  ___  ___  __  __  __       F oo 1    __    __    F = C   _    _    _ 2 in  AND   AND   AND  ___   ___   ___ ____  ____  ____  a b c    a b c    a b c   +    +    +  c__s___  c__s___  c__s___   __  __  C     out    S S S 2 1 0This example ALU computes 8 possible functions of A and B, as selected by the inputs F_{0}, F_{1} and F_{2}. F_{0} is exclusiveored with each of the bits of the second input to the adder, inverting them if it is one, leaving them unchanged if it is zero. F_{1} is used to control carry propagation between the stages of the adder, allowing propagation if it is one, inhibiting propagation if it is zero. F_{3} is the carry in to the least significant bit of the adder.
Because F_{2} in the above design is ignored when F_{1} is zero, this ALU computes only 6 useful functions. these are enumerated below:
F F F  operation 2 1 0    0 0  S = A xor B  0 1  S = A equ B = A xor (not B) 0 1 0  S = A + B 0 1 1  S = (A  B)  1 = A + (not B) 1 1 0  S = A + B + 1 1 1 1  S = A  B = A + (not B) + 1This ALU design is typical of many commercial ALU designs! One typical feature is that many of the control input combinations produce operations of doubtful utility. If some carefully selected and gates in an optimal version of the adder are given additional enable inputs, the result is a few more useful operations and many more useless ones, with the addition of only one or two additional control lines. Because of this, it useful systems almost always hide the ALU control lines from the user, and instead, run the user's operation code through a small ROM (table lookup) to select from among the useful ALU operations without exposing the user to the useless or redundant operations.
The final sections of http://homepage.cs.uiowa.edu/~dwjones/assem/notes/08arith.html cover some of this material on arithmetic logic units, with better quality graphics and an alternative presentation.
It is common in practical designs to find that one or both data inputs to the ALU comes from one of several sources. It is worth noting, in this context, that the exclusiveor function in the above example ALU can be widened into a multiplexor at no additional cost in gate delays. Recall that the implementation of an exclusiveor function is unlikely to be faster than a multiplexor used to select between B and B. Therefore, a broadened ALU, with integral multiplexors to select between inputs, is unlikely to be much slower than an ALU with only two inputs. Therefore, if our naive design calls for something like this:
A B C D _______ _______ \ 0 1 / \ 0 1 / \_____/ \_____/ _______________  a b     ALU    __c_____s________  
We can easily fold the multiplexor on the b input to the above ALU into the exclusive or for inverting this input in our implementation.
Random access memory plays an important role in many settings. In addition to the large RAM used for main memory, smaller random access memories serve many roles, for example, to hold banks of registers inside the central processor, or as key parts of cache memory units.
Any random access memory can be modelled as an array of registers, and the small RAMs used to implement register banks frequently follow this model almost exactly. The most naive implementation of a RAM is as follows:
data in  /n  ooo ____ ____ ____ ____  __   __   __   __  array of n bit />____ >____ >____ >____ registers ___       clock     demultiplexor to \  distribute clock      signal to registers      2  _____ /o\0 1 2 3/ multiplexor to address \_____/ select appropriate  register for output /n  data outThe registers used to implement a small RAM are almost always simple latches. The larger the memory, the more likely some other technology is to be used, both simpler (capacitive memory in DRAM) or more complex (DRAM refresh logic). In the above, the notation suggests positive pulse triggered latches. Normally, we will use the following notation to indicate such a RAM:
 /n    data in    /addr  2  RAM   4 x n   _  >_ _     data out    /n Having indicated that the RAM has 4n bits, the labels stating that the data in and data out lines are n bits wide and that the address line is log_{2}4 bits wide are redundant, but such redundancy can prevent misunderstanding!
Given the above implementation of a RAM, it is easy to see how multiple read ports could be added, so that two different customers for the data from the RAM can independently, and in parallel extract the values of different registers. All we need is a second output multiplexor attached to a second address input. Multiport RAM gets far harder to build when the memory is large!
The usual implementation of RAM does not involve a distinct multiplexor and demultiplexor. Instead, the address lines are decoded once and the decoder outputs distributed to the memory cells as follows:
address inputs data inputs A A D D D 1 0 clock 2 1 0 o o  o o o  __  __   __  __  __  \ /  \ /   \ /  \ /  \ /  O  O  ___  O  O  O     o \       ) row 0 clock     ___  ___/ o \        )o row 0 select o___/  ___     o \       ) row 1 clock     ___  ___/ o \        )o row 1 select o___/  ___     o \       ) row 2 clock     ___  ___/ o \        )o row 2 select o___/  ___     o \       ) row 3 clock     ___  ___/ o \        )o row 3 select o___/          D D D 2 1 0 data outputsEach row of the memory is one register, with select and clock signals from the left and data and inverted data inputs for each bit from above. Each bit of a naive (very small, very fast) static RAM memory is typically implemented as follows:
data inputs _ D D i i ___  o \ ___     )O \ \    ___/  )Oo )o    ___ ___/  /   o \  ___ h       )O \      o___/   )O       ___/                            row clock o row select o    D i data outputThe memory cell shown above has a tristate output, so that the multiplexor function is distributed between memory cells instead of being abstracted out. An outside observer cannot see this, although some memory designs also include a tristate data output on the entire memory, so that many memory chips can be wired in parallel, with only one of them enabled at any time.
Larger and more tightly designed RAM arrays have cells reduced from the design shown here. Instead of using a full RS flipflop for each cell, the cells of the array are reduced to what is called a "jam latch" because data is jammed into the latch through its outputs. The data output line is, in this case, also used for input, so we call it the bit line. The data output is connected to the latch by a bidirectional bus coupler (in MOS technology, this is a single transistor, and the latch itself is just 2 more transistors). To read from memory, the contents of the latch are gated onto the bit line and the data from the bit line is amplified and fed to the output. To write to memory, the data to be written is amplified and fed to the bit line while the latch for one cell is selected. The data is jammed backward from the bus through the latch and into the cell. This is the approach used in most SRAM.
To build DRAM, we use the above approach, simplified one step farther. We replace the 2 transistors used to make the basic jam latch with a single capacitor. So long as the contents of this capacitor are read shortly after it is written, the data will be the same as what was written. If the read is delayed, the charge on the capacitor will leak away, leaving an ambiguous value in that bit of memory. Each time the cell is read, the value read is immediately jammed back in, refreshing that cell. To make sure that the contents of the entire RAM array remain valid, all cells in that array must be read on a regular basis. DRAM refresh hardware does this, borrowing unused memory cycles (as seen from the perspective of any particular memory array) to perform refresh cycles.