2. Data Representation

Part of 22C:60, Computer Organization Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Representation

When we write computer programs, we always deal with representations. For example, we don't deal with text the way a calligrapher or author deals with text, we deal with a specific character set that may be used to represent text. Instead of dealing with integers or real numbers, we deal with finite precision representations of integers and we deal with floating point numbers.

Well designed programming languages allow the programmer to treat these representations as if they were identical to the abstractions they represent, but this identity is never complete. Thus, for example, our machine representations of integers behave exactly like abstract integers until a result greater than the machine precision is produced; at that point, either the behavior of integer variables becomes very strange (the usual consequence in C++) or the program raises an exception (the usual consequence in Java).

If speed is no issue, arbitrary precision arithmetic packages are available in some languages such as C and Java. These packages, while slow, automatically switch to a larger representation every time the result of a computation overflows the precision of the representation previously in use. In languages where speed is of only secondary importance, such as Python, all integers are represented with such arbitrary precision packages.

Just about every introduction to assembly language or machine language programming ever written begins with a section like this. The IBM 7040/7044 Data Processing System Student Text (IBM, 1963) begins with with a chapter on binary arithmetic that is as good as anything more recent, and the Introduction to Programming for the PDP-8 (DEC, 1973) has a nice "Number System Primer" that covers this ground equally well!

Binary Representations

All modern computers use binary representations for their data, and all use fixed size containers for data. There is no reason that computers could not be built that use other representations; "trinary" (base 3) logic devices aren't hard to build, and in fact, decimal computers were built in large numbers in the 1960's. There is no compelling benefit to using non-binary representations, though, and even the decimal machines of the 1960's were built using binary logic and used binary-coded-decimal (BCD) internally.

The bit is the fundamental unit of information. A bit has two values, conventionally called zero or one. We could as easily name the values on and off, high and low, or true and false; in fact, all of those names are used in different contexts, and the truth is that all such names are arbitrary! The representation of a bit inside the computer is also arbitrary. The presence or absence of holes in a punched card or strip of paper tape was once a common representation of bits, while inside the computers, the two values of a bit are usually represented using different voltages. The specific voltages used does not generally matter to the computer architect, but is left to the engineering level in the design.

Binary encoding of information dates back to the 19th century with the development of the punched card, but it only grew to maturity in the second half of the 20th century, with the dawn of the computer age. Over the years, bits have been grouped in many different ways, and these groupings have been given many names, words, bytes, characters and octets are but a few examples.

Historical Groupings of Bits

Groupings in SMAL

An assembly language is a language for assembling data and instructions into memory. The primitive operations in an assembly language place individual data items such as bytes, words or halfwords into memory. Additional more complex operations fill memory with instructions, but initially, we will ignore these.

In most assembly languages, each line of text specifies one data item of some kind to be placed in memory. Consecutive lines of text specify the contents of consecutive memory locations. The assembly directive or instruction on each line specifies how to place data in memory, while the operands of the directive or instruction specify what data to place in memory.

Our assembly language is called SMAL, because it is a Symbolic Macro Assembly Language. To begin with, we will discuss the basic features of this language for assembling data into memory. We will discuss more advanced features of this language later. SMAL contains direct support for 8-bit bytes, 16-bit halfwords and 32-bit words, using the B, H and W directives. (There is also a T directive for the occasionally useful 24-bit triple-byte quantity.) Each of these assembly directives will accept an operand indicating the value of one byte, halfword or word. A single word of memory can be filled with either one W directive, two consecutive H directives, or four consecutive B directives. (Or, if you insist on obscurity, a T directive plus a B directive.)

Filling 3 words of memory using SMAL
        W       2#10101010110011001111000011111111
        H       2#1111000011111111
        H       2#1010101011001100
        B       2#11111111
        B       2#11110000
        B       2#11001100
        B       2#10101010

In this example, we have filled 3 words of memory so that each 32-bit word holds the identical binary value. The first line of code assembles a word as a single operation. The next 2 lines assemble the 2 halfwords of a word in 2 operations, while the final 4 lines assemble the 4 bytes of a word in 4 separate operations.

The values in this example have all been given in binary, as indicated by the leading 2# on each word. The SMAL assembler supports other data representations, as we will discuss later. Note that, when assembling code for the Hawk machine or for the Intel 8086/Pentium family, the least significant byte or halfword is always assembled first. Some other computers, notably the IBM Power architecture, put the most significant byte first when breaking a word down into multiple bytes. There is no particular advantage of one scheme over the other, so different computers take one or the other approach at random. This variety is a great source of confusion for novice programmers.

Coding

How many values can be coded in one bit? Two! We'll call them 0 and 1 for now, but in other contexts, it makes more sense to call them true and false; the names we use are arbitrary. How many values can be coded by 2 bits? By 3 bits? By 4 bits? Here is a systematic listing of all possible combinations of 4 bits:

All the combinations of 4 bits
0 0 0 0     1 0 0 0   4 bits gives 24 = 16 possibilities
0 0 0 1 1 0 0 1
0 0 1 0 1 0 1 0
0 0 1 1 1 0 1 1
 
0 1 0 0 1 1 0 0
0 1 0 1 1 1 0 1
0 1 1 0 1 1 1 0
0 1 1 1 1 1 1 1

The structure of the above table hides, within it, listings of all the combinations of 2 bits, in the rightmost 2 bits of each quadrant of the table. Similarly, all the combinations of 3 bits are given in the rightmost 3 bits of each column.

In general, to get n combinations, it takes log2(n) bits (log2 is the logarithm to the base 2), rounded up to the nearest integer, since it is hard to physically represent a fractional bit. Thus, if we have 3 possible values, we need 2 bits, and if we have 5 possible values, we need 3 bits.

How we associate values with these combinations bits is our business! We could decide that 0000 means Tuesday and 0001 means five, if we wanted to, or we could be more systematic, associating 0000 with a, 0001 with b, and so on. Such rational systematic encodings of character sets on computers are rare, because there is almost always a need for compatability with some older encoding standard.

Character sets frequently look very arbitrary! Consider the character set used by ILLIAC I, a computer built at the University of Illinois in the 1940's. Despite dating back to the early years of the computer era, this character set inherits its odd order from even older technology. The table below was transcribed from a typewritten original in The ILLIAC Miniature Manual by John Halton (U. of Illinois Digital Computer Lab file no 260, Oct 22 1958, page 3):

ILLIAC Character Codes
           | Characters |n for 92
Tape Holes | F/S    L/S | Orders
-----------------------------------
     o        0      P       2F
     o  O     1      Q      66F
     o O      2      W     130F
     o OO     3      E     194F
     oO       4      R     258F
     oO O     5      T     322F
     oOO      6      Y     386F
     oOOO     7      U     450F
    Oo        8      I     514F
    Oo  O     9      O     578F
    Oo O      +      K     642F
    Oo OO     -      S     706F
    OoO       N      N     770F
    OoO O     J      J     834F
    OoOO      F      F     898F
    OoOOO     L      L     962F
   O o      Delay  Delay     3F
   O o  O   $(Tab)   D      67F
   O o O    CR/LF  CR/LF   131F
   O o OO     (      B     195F
   O oO   L/S=Letter-Shift 259F
   O oO O     ,      V     323F
   O oOO      )      A     387F
   O oOOO     /      X     451F
   OOo      Delay  Delay   515F
   OOo  O     =      G     579F
   OOo O      .      M     643F
   OOo OO F/S=Figure-Shift 707F
   OOoO       '      H     771F
   OOoO O     :      C     835F
   OOoOO      x      Z     899F
   OOoOOO   Space  Space   963F

This character set is coded using 5 bits per character, with two special codes, L/S or letter shift and F/S or figure shift used to shift between the two different and wildly disorganized sets of 32 characters! The code CR/LF or carriage return -- line feed was used to mark the end of each line of text.
 

A fragment of paper tape punched in the Illiac code.
                               
snippet of ILLIAC paper tape
                               

On close inspection, it should be apparent that the layout of this character code is based in part on the QWERTY layout of the keys on a classical typewriter, but beyond that, the rationalle for this code is a mystery! Note that the capital O in the "tape holes" column stands for a hole in the tape or the digit one in the binary code, while blank stands for the digit zero. The lower case o in the "tape holes" column shows where the sprocket hole in the tape goes; punched paper tape was pulled through the tape reader by a sprocket that engaged these holes. The sprocket hole is not part of the data coded on the tape, although users sometimes viewed it as serving a function similar to that of a decimal point.

Exercises:

a) Show the representation for the string "HELLO WORLD." as it would have appeared punched on paper tape for the ILLIAC I. Remember, the initial state of the teletype is unknown! You don't know if the previous text left it set in figure-shift or letter-shift state. (Use upper-case O for holes in the tape, and a lower-case o for the sprocket hole.)

b) What is encoded on the snippet of ILLIAC I tape given above?

The ISO-7 or ASCII character set was originally proposed in the early 1960's as a rational response to the limitations of the 6-bit character sets used on early computers. ASCII was proposed with the understanding that new computers would have 8-bit bytes, but it only used 7 of those bits and it had an upper-case only subset that needed only 6 bits:

The ASCII character codes
left 3 bits
            000     001     010     011     100     101     110     111  
 
Right
4 bits
0000   NUL DLE SP 0 @ P ` p
0001 SOH DC1 ! 1 A Q a q
0010 STX DC2 " 2 B R b r
0011 ETX DC3 # 3 C S c s
0100 EOT DC4 $ 4 D T d t
0101 ENQ NAK % 5 E U e u
0110 ACK SYN & 6 F V f v
0111 BEL ETB ' 7 G W g w
1000 BS CAN ( 8 H X h x
1001 HT EM ) 9 I Y i y
1010 LF SUB * : J Z j z
1011 VT ESC + ; K [ k {
1100 FF FS , < L \ l |
1101 CR GS - = M ] m }
1110 SO RS . > N ^ n ~
1111 SI US / ? O _ o DEL

Spaces are encoded with the SP code, 010 0000. The function of the other named codes will be given later. Using this, we can encode the string "Hello World!" in ASCII as follows:

Encoding a string in ASCII
                100 1000   H                 101 0111   W                
110 0101   e 110 1111   o
110 1100   l 111 0010   r
110 1100   l 110 1100   l
110 1111   o 110 0100   d
010 0000 010 0001   !

The ASCII character set, now officially known as the ISO-7 Latin 1 character set, is the basis of all of the widely used character sets in common use today and forms the base from which the Unicode character set was developed. From the very start, ASCII text has been encoded in 8 bit bytes by adding an extra bit on the left, so the letter 'a' is coded as either 0110 0001 or 1110 0001 instead of being coded as 110 0001. Of course, some vendors opted for setting the extra bit to 0 while others set it to 1. Fortunately this variation ended in the 1980s. All modern 8-bit character sets such as the UTF-8 encoding of Unicode set this bit to zero to indicate the classic ASCII character set and set it to one to indicate one or another extended character set.

Pure Unicode uses 21 bits to encode each character, although the characters from the common alphabets of the world can be coded using the 16-bit subset of Unicode. When Unicode is represented in the UTF-8 encoding, the binary codes 0000 0000 to 0111 1111 are used for the 128 characters of the classic ASCII character set. Each of the Unicode codes from 128 to 2047 can be coded in two-bytes in UTF-8. The two-byte UTF-8 codes include complete support for the Latin, Cyrillic, Greek, Arabic and Hebrew alphabets, including a wide range of accent marks. For codes from 2048 to 65535, the UTF-8 encoding uses three bytes per character. These three byte codes suffice for most purposes, but the UTF-8 scheme can be continued to encode up to 36 bits of data as a string of 7 8-bit bytes. The Unicode standard never needs more than 4 bytes to encode a 21-bit value in UTF-8.

Unicode UTF-8 encoding
Range of Values Encoding Number of Bytes
0 - 127 0xxx xxxx 1 (classic ASCII)
128 - 2047 110x xxxx 10xx xxxx 2
2048 - 65535 1110 xxxx 10xx xxxx 10xx xxxx 3

In the above UTF coding table, the each bit of the pure Unicode character is represented by the letter x, while bits added in the UTF-8 encoding are given explicitly as ones and zeros. The order of the Unicode bits remains unchanged when the consecutive bytes of the UTF-8 character are written out from left to right, as in the table.

It should be noted that not all possible 21-bit values are legal Unicode characters. The full complexity of Unicode is such that we cannot reasonably cover the details here, but it is worth noting that UTF-16 allows all valid Unicode characters to be represented in 1 or 2 consecutive 16-bit words, with most common alphabetic characters represented in just 1 word.

One other character set from the 1960s is still with us, EBCDIC, an 8-bit character set originally devised by IBM and still used on IBM's massive enterprise servers used by many large corporations. EBCDIC, the Extended Binary Coded Decimal Interchange Code, is an 8-bit extension of one of the old 6-bit BCD character codes used on IBM's 36-bit family of computers.

Whether ancient or modern, all practical character codes contain a number of non-printing codes. In ASCII, these have standard 2 and 3 letter names. Except for space, SP, these are known as control characters. Most of the ASCII control characters are rarely used. In several cases, they were included in the character set to support applications that are now obsolete; in some cases, they were included to support applications that were never really developed. The C programming language and its descendants use special representations for the most commonly used control characteres. For example, C and its descendants represent \t for HT and \f for FF.

 

 

The ASCII Control Characters
NUL       null, intended to be ignored, commonly abused as end-of-string
SOH start of heading
STX start of text
ETX end of of text
EOT end of of transmission
ENQ enquire (about the status of a remote terminal?)
ACK acknowledge (an answer to ENQ?)
BEL \b bell, used to ring the bell in the remote terminal
BS backspace, commonly abused as erase
HT \t horizontal tab, still used as originally intended
VT vertiacal tab, commonly used as a reverse linefeed
LF \n linefeed, frequently used as end-of-line
FF \f formfeed, on printers, frequently causes page eject
CR \r carriage return, frequently used to mean enter
SO shift out (change ribbon color?)
SI shift in (undo SO)
DLE data link escape
DC1 device control 1 (for devices on a remote terminal)
DC2 device control 2
DC3 device control 3
DC4 device control 4
NAK negative acknowledge (an answer to ENQ?)
SYN synchronize (for synchronous data links)
ETB end transmission block
CAN cancel
EM end of message
SUB substitute
ESC escape (a prefix altering the use of what follows)
FS file separator
GS group separator
RS record separator
US unit separator
SP space
DEL delete, intended to be ignored

Exercises:

c) Show the representation for the text "The 4th of July." (excluding quotes) in 7-bit ASCII, as a sequence of binary numbers.

d) What range of Unicode values is encoded in 4 consecutive bytes in the UTF-8 encoding?

e) If the DEC PDP-8 made a comeback, we might want a Unicode UTF-12 encoding for that machine's 12 bit word. Following the coding scheme of UTF-8, what range of Unicode values would you encode in one 12-bit word? In two 12-bit words? In three 12-bit words?

Character Codes in SMAL

Our assembly language, SMAL, contains two features that directly support the ASCII character set. The first is its support for characters as operands for assembly into memory as bytes, halfwords or words (although the most common use is as bytes). The second feature is the ASCII directive for storing the consecutive bytes of a string in memory. As a result, we could store the text Hello in three different ways, as illustrated below:

An ASCII string using SMAL
        B       2#01001000
        B       2#01100101
        B       2#01101100
        B       2#01101100
        B       2#01101111

        B       "H"
        B       "e"
        B       "l"
        B       "l"
        B       "o"

        ASCII   "Hello"

This example shows the same 5-character string being encoded in three different ways, producing a total of 15 consecutive bytes of data. We could have encoded this with a single line reading ASCII "HelloHelloHello". There is no difference at all between the data loaded in memory by these three different methods. Once it is loaded in memory, all data becomes nothing but patterns of bits with no inherent meaning. Nothing in the result records the manner in which the bytes were placed in memory.

SMAL contains no direct support for control characters other than SP, space, although as we will see later, it is possible to define identifiers to have any desired value. We can therefore define each of the control characteres by name, for example, giving NUL the meaning 0000 0000 and LF the meaning 0000 1010.

Binary Numbers

It is possible to associate arbitrary bit patterns with numeric values. For example, nothing, prevents us from deciding to represent zero as 1101 and one as 1000. Using such a mapping from binary values to the integers makes arithmetic extremely difficult -- in effect, this arbitrary mapping forces us to use table-lookup schemes for all arithmetic operations.

Table lookup has been used as the basis of all arithmetic on at least one computer, the IBM 1620, codenamed the Cadet during development in the late 1950's. Later, someone decided that the name should be interpreted as the acronym "Can't Add, Doesn't Even Try," because the Cadet used table-lookup to compute the results of arithmetic operations. Some programmers even loaded new tables so they could do arithmetic in bases other than 10, the base for which the machine was designed.

With few exceptions, all computers today use variations on the binary number system for integer arithmetic. The basic binary number system we use today dates back to ancient China, where it was used to order of the 64 combinations of I-Ching sticks (each stick could be in either of two states; the pattern that resulted from a throw of the sticks was used for fortune telling):

All possible 6-bit binary numbers
0 = 000000     16 = 010000     32 = 100000     48 = 110000
1 = 000001 17 = 010001 33 = 100001 49 = 110001
2 = 000010 18 = 010010 34 = 100010 50 = 110010
3 = 000011 19 = 010011 35 = 100011 51 = 110011
4 = 000100 20 = 010100 36 = 100100 52 = 110100
5 = 000101 21 = 010101 37 = 100101 53 = 110101
6 = 000110 22 = 010110 38 = 100110 54 = 110110
7 = 000111 23 = 010111 39 = 100111 55 = 110111
8 = 001000 24 = 011000 40 = 101000 56 = 111000
9 = 001001 25 = 011001 41 = 101001 57 = 111001
10 = 001010 26 = 011010 42 = 101010 58 = 111010
11 = 001011 27 = 011011 43 = 101011 59 = 111011
12 = 001100 28 = 011100 44 = 101100 50 = 111100
13 = 001101 29 = 011101 45 = 101101 61 = 111101
14 = 001110 30 = 011110 46 = 101110 62 = 111110
15 = 001111 31 = 011111 47 = 101111 63 = 111111

In looking formally at how the bit pattern for a number is related to its abstract value, we must distinguish clearly between representation and abstraction. Conventionally, when we say ten, we are using the name of an abstract value, without any reference to number systems. In English, we have such simple names for all values up to twelve. When we write "12", however, we are using a decimal representation of the abstract value denoted by the name twelve. Of course, twelve itself is also a representation, a string of italicized lower-case letters from the Roman alphabet that designates a sound. That sound is related to an abstract value that is a number only by by the cultural conventions of the English speaking world.

In what follows, we will use a programming language notation, where unquoted numerals stand for abstract integer values, subject to arithmetic operations, while quoted strings stand for concrete representations of those values. Thus, the string "011100" is just a string of symbols, with no inherent meaning. It could be equivalent to eleven thousand one hundred if we decided to interpret it as a decimal number, or it could be the binary representation of the number twenty eight.

Consider, first, the formula that relates representations of decimal numbers to their values:

i=digits(S)-1
value(S) = Σ v(S[i]) × 10i
i = 0

Where
S is a string of digits
value(S) is the numeric value assigned to that string
digits(S) is the number of digits in S
the rightmost digit in S is S[0]
the leftmost digit in S is S[digits(S)-1]
v(d) is the numeric value associated with digit d

Users of the decimal system usually speak of the one's place, the ten's place, the hundred's place and so on. To illustrate the use of this formula, consider the following:

value("1024") = v("1") × 103  +  v("0") × 102  +  v("2") × 101  +  v("4") × 100

In thinking about the above example, it is important to distinguish between "1", the numeral, which is a character, v("1"), the application of the numeric value function v to this numeral, and one, the actual value computed by the application of this function. As humans, we confuse these all the time, rarely distinguishing between the numeral, a symbol, and the abstract value it represents. Inside a computer, this distinction is crucial.

The formula given above can be directly reduced to code in a high level programming language! For example, the following C code will convert from decimal numbers, passed as a string, to the corresponding binary value, encoded in a variable of type int:

        int value( char S[] )
        {
                int i;
                int a = 0;
                for (i = 0; i < strlen(S); i++) {
                        a = a + v[S[i]] * (int)pow( 10, i );
                }
                return a;
        }

This function converts decimal to binary because it assumes that the representation of all values of type int is binary, and it assumes that the computer performing this computation is a binary computer! This function is awful, for a number of reasons: First, it uses the floating-point pow function to compute 10i on each iteration, when it could have simply taken the multiplier from the previous iteration and multiplied it by 10. Second, it uses the table V to find the integer values each digit in the string, when there are much faster ways of doing this. Third, it computes the length of the string using strlen(S) again and again with each iteration of the loop when it could have done this just once, up front, and finally, and most importantly from the point of view of human users, the digits in the string are interpreted backwards, so the least significant digit, or the one's place, is the first character in the string. Conventional renderings in the Roman character set with left-to-right printing would have us put the most significant digit first.

Exercises:

f) Convert the above function to the language of your choice and write a small program to try it out on some real strings of digits, giving as output the string it converted and the value of that string.

g) Fix the code you just wrote so it avoids re-evaluating the string-length with each iteration.

h) Fix the code you just wrote so it avoids using an exponential function, but instead starts with a multiplier of 1 and multiplies this by 10 on each iteration.

i) Fix the code you just wrote so it evaluates the digits of the number in their conventional order.

Powers of Two

The formula given above applies equally well for computing the abstract value of a binary number, with the simple restriction that each binary digit, or bit may take on only two values, 0 and 1, and that the base of the number system is 2 instead of 10:

i=digits(S)-1
value(S) = Σ v(S[i]) × 2i (binary)
i = 0

If we evaluate this formula using decimal arithmetic, as we would typically do with pencil and paper, it is an excellent formula for converting from binary to decimal! If we evaluate it using binary arithmetic on a binary computer, it is a useful way to convert from the textual representation of a binary number to the representation of that number in a single integer variable.

Just as we speak of the one's place, the ten's place and the hundred's place in decimal numbers, we speak of the one's place, the two's place, the four's place and the eight's place with binary numbers. The list of powers of 2 generally becomes very familiar to users of binary numbers:

The first 16 powers of two
i             2i                         i             2i
0 1 8 256
1 2 9 512
2 4 10 1,024
3 8 11 2,048
4 16 12 4,096
5 32 13 8,192
6 64 14 16,384
7 128 15 32,768
8 256 16 65,536

 
Powers of two close to powers of ten
i 2i
10 1,024 =~ 1,000 = 1K (kilo)
20 1,048,576 =~ 1,000,000 = 1M (mega)
30 1,073,741,824 =~ 1,000,000,000 = 1G (giga)
40 1,099,511,627,776 =~ 1,000,000,000,000 = 1T (tera)

Note that each power of two is twice the previous power of two, and note the convenient coincidence that 210, 1024, is within a few percent of 103, 1000. This allows quick estimates of the magnitude of binary numbers -- if you have a binary number with 10 bits, it will be close to a 3-digit decimal number, while if it has 20 bits, it will be close to a 6-digit decimal number.

Exercises:

j) What is 224? Several machines were once built with 24-bit words.

k) What is 232, in decimal, exactly. This is relevant on machines with a 32-bit word!

l) The maximum number that can be represented in 6-bits in binary is 63 while 26 is 64. The maximum number that can be represented in 2-digits in decimal is 99 while 102 is 100. What is the maximum number than can be represented in n-bits, as a function of n?

Binary/Decimal Conversion with Pen and Paper

To convert binary to decimal using pencil and paper, just write the place values over each bit position, in decimal, then sum the products of the place values times the digits under them. For binary numbers the arithmetic is simple because these products are either 0 or the place value itself. For example, to find the decimal equivalent of 101010112 we proceed as follows:
                                               
128 64 32 16 8 4 2 1
1 0 1 0 1 0 1 1

101010112 = 128 + 32 + 8 + 2 + 1 = 171

To convert a decimal number to binary using pencil and paper, divide the number by two, writing the integer part of the quotient below the number and writing the remainder off to the side. Repeat this until you get a quotient of 0, and the column of remainders you are left with will be the binary number, reading it up from bottom to top. So, to convert 171 to binary, we proceed as follows:
 
171  
85   1
42   1
21   0
10   1
5   0
2   1
1   0
0   1   101010112

The above two exercises confirm each other since the final result of converting to decimal and then back to binary is the same as the original value.

Exercises:

m) Flip a coin 8 times (or use any other random process) in order to make an 8-bit binary number. Convert it to decimal. Convert the result back to binary to check your work!

n) Write a little program that takes two decimal integers as input n and b, and prints out n in base b. The program should work for values of b from 2 to 10. In C, C++ and Java, the / and % operators return the quotient and remainder, respectively, for positive operands.

Integer Values of Individual Digits

How do we relate the characters used to represent digits to their abstract numeric values? When working with pencil and paper, we did this using rules learned in elementary school, and it may have seemed so natural that it passed unnoticed. When working with computers, we need to be much more explicit. The most general solution to this is an array indexed by characters, where the array entries corresponding to the digits in the character set hold the corresponding integer values. Many programming languages allow the use of character values for array indexing, although many programmers get nervous when they see expressions such as A['b'], which means the element of the array A selected by the character 'b'. The following conversion function based on a table works for the 7-bit ASCII character set in C or C++ (add 128 more entries at the end, all 0, to make it work for 8-bit ASCII):

        int v(char d) /* a function v of a character returning an integer */
        {
                static values[128] = { 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
                                       0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
                                       0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
                                       0,1,2,3, 4,5,6,7, 8,9,0,0, 0,0,0,0,
                                       0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
                                       0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
                                       0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
                                       0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0 };
                return values[d];
        }

This is computationally fast, but there is no need for array indexing here! The reason is, almost all character codes used on computers have assigned consecutive values to the symbols for the 10 decimal digits. This was true even with the odd character set of ILLIAC I! As a result, the following works:

        int v(char d) /* a function v of a character returning an integer */
        {
                return d - '0';
        }
The above code is legal C and C++, and similar code is legal in Java and C#. In languages descended from C, characters and numbers are members of the same abstract type, the integers. In some languages, strong type checking prevents integers and characters from being mixed, so explicit conversions are needed to get the integer representation of a character or the character represented by an integer. For example, consider this bit of Pascal code exactly equivalent to the C code given above:
        function v(d: char): integer;
        begin
                v = int(d) - int('0');
        end;
In Pascal, the built-in function int() asks for the integer used to represent a character. It is not really a function, but just an instruction to the compiler to take the binary code that used to represent a character and treat it as an integer. This Pascal is identical to the C code. It is likely that Pascal compiler would produce the exact same machine code for this function as a C compiler.

Other Number Bases

Any integer may be used as the radix of a number system. The formula used above to evaluate numbers in bases 2 and 10 also works for r, an arbitrary radix:
i=digits(S)-1
value(S) = Σ v(S[i]) × ri (radix r)
i = 0

Note that, for any radix r, the string "10" in that radix has the value r. Thus, in binary, 10 means two, in decimal, the string 10 means ten, and in octal, the string 10 means eight. To avoid confusion, the the number in printed work is sometimes given as a numerical subscript in base ten, as in:

101102 = 268 = 2210 = 2113

As long as the radix r is ten or less, the conventional digits 0 to 9 can be used. For r greater than ten, additional digit symbols are needed. Historically, a variety of symbols have been used! For example, in base 12 also called duodecimal, the digits T for ten and E for eleven have been used. On ILLIAC I, base 16 was called sexadecimal, and they used the following digits:

0123456789KSNJFL

The most common convention for bases greater than ten is to use A for ten, B for eleven, C for twelve, and so on, up to Z. This allows bases as high as 36, although base 16, hexadecimal, is the highest commonly base. Thus, the hexidecimal digits are:

0123456789ABCDEF

It is common to use bases 8 and 16 when working with binary numbers. This is the result of grouping binary digits in groups of three or four. We routinely group digits of long numbers in decimal as well. Usually, we group them into threes, so we write 1,234,567 instead of 1234567. In effect, this grouping of decimal digits shifts us to base 1000 with three character names for each digit and commas between the digits. So 1,234,567 has 1 in the million's place, 234 in the thousand's place, and 567 in the one's place.

Consider the representation of one thousand in binary. In binary, this this is a nice long number:

1000 = 512 + 256 + 128 + 64 + 32 + 8

100010 = 11111010002

We can group the digits of this number in order to improve readability. If we group the bits in groups of 3, we can treat each triplet of bits as an octal (base 8) digit, and if we group the digits in groups of 4, we can treat each group of bits as a hexadecimal digit:

100010 = 1,111,101,0002 = 17508

100010 = 11,1110,10002 = 3E816

For integers, regardless of the number base, we always group digits from the least significant end of the number, leaving the odd leftovers for the most significant bits.

The pencil and paper conversion algorithms given for binary actually work for any place-value number system. To convert to decimal from base r, write successive powers of r over successive digits, starting with 1 over the least significant digit, then multiply each digit by the power of r above it and sum the products. To convert a decimal number to base r, divide repeatedly by r, setting aside the remainders, until you get zero, and then write the remainders in order, last to first, as digits of the number.

Some programmers get so used to using the octal or hexadecimal number systems as abbreviations for binary numbers that they will refer to those bases as being binary numbers. This kind carelessness is fairly natural and common, but technically, it is wrong.

Exercises:

o) Given the numbers 1b, 10b, 100b and 1000b, where b is the unknown number base, what are the decimal equivalents for these numbers, assuming that the base is 2? Assuming that it is 8? Assuming that it is 10?

p) The letters A through F are sufficient to spell many words, and we can augment them with misreadings of the digits 0 and 1 as the capital letters O and I to spell even more words. So, what decimal value, when converted to hexadecimal, spells the words BEEF, FEED, FOOD, and CODE? What are the corresponding binary values? What are the corresponding octal values?

q) Flip a coin 8 times, in sequence, and write down 1 for each time it comes up heads, and 0 for each time it comes up tails in order to make an 8-bit binary number. Convert that to decimal. Convert the result to octal, and then replace each digit by it's 3-bit binary representation. You should get your original number back.

r) Flip a coin 8 times, in sequence, and write down 1 for each time it comes up heads, and 0 for each time it comes up tails in order to make an 8-bit binary number. Convert that to decimal. Convert the result to hexadecimal, and then replace each digit by it's 4-bit binary representation. You should get your original number back.

Other Number Bases in SMAL

As we have already seen, the our assembly language, SMAL, supports binary representations of data to be assembled into memory, along with the use of the ASCII code for text. In fact, SMAL offers support for all of the common number bases and many uncommon number bases. For example, the letter A, which is represented as 10000012, can be represented in SMAL in the following different forms, among many others:

SMAL encoding in different number bases
        B       "A"          ; ASCII
        B       2#1000001    ; binary
        B       3#2102       ; base 3
        B       4#1001       ; base 4
        B       8#101        ; base 8, octal
        B       10#65        ; base 10, decimal
        B       65           ; also decimal
        B       16#41        ; base 16, hexadecimal
        B       #41          ; also hexadecimal

In the above, note that the ASCII, decimal and hexadecimal representations have special privileges. For all other systems of representation, the number base must be given explicitly (in decimal), followed by a number sign, followed by the number in that base. In the case of decimal numbers, the number is given directly, while a pound-sign used without a number base as a prefix is used to represent hexadecimal.

For those who are curious, the SMAL assembler contains a general conversion routine from any number base to binary, and when it encounters a number in any other number base, it applies this routine to that number in order to convert it to binary. The Hawk conversion routine will operate in any number base up to 36. The limit of base 36 is set by the fact that there are 26 letters in the alphabet plus 10 decimal digits available for use as digits in higher radix numbers.

SI Multipliers

It is common, in the modern world, to hear that some nuclear weapon has an explosive yield equivalent to so many megatons of TNT, or that a disk has a capacity of so many gigabytes, or that a computer has so many megabytes of main memory. Units of measurement like the kilometer and the millimeter are also widely used. It is important to remember that the multipliers used for such units, multipliers like kilo- and milli-, have precise meanings that are defined by international standards as part of the SI system of measurement, otherwise known as the metric system:

The common SI multipliers
1T= 1 tera = 1,000,000,000,000
1G= 1 giga = 1,000,000,000
1M= 1 mega = 1,000,000
1K= 1 kilo = 1,000
1 = 1 = 1
1m= 1 milli = .001
= 1 micro = .000,001
1n= 1 nano = .000,000,001
1p= 1 pico = .000,000,000,001

In the world of computers, we have on-line disk subsystems with capacities measured in terabytes, or Tbytes, and we have CPU clock speeds measured in gigahertz or GHz. Small microcontrollers have memory capacities measured in kilobytes or Kbytes, and data communications lines may have lengths measured in kilometers or Km. We also have memory speeds measured in nanoseconds or ns, chip sizes measured in square millimeters or mm, and we have disk access times measured in milliseconds, ms. Note that capitalization is used to indicate whether the multiplier is greater than or less than one, and note the use of the Greek letter mu, µ, for example, in the abbreviation of microseconds as µseconds. The speed of light is remarkably close to one foot per nanosecond.

When someone tells you that their memory has a capacity of 100Kb, unfortunately, you are in trouble. Chip designers will tend to use this to refer to kilobits or Kbits, while computer system designers will expect it to stand for kilobytes or Kbytes. As a rule, spell it out if there is any doubt!

In the world of computer science, it is common to use one K or one kilo to refer to a multiplier is 1024, but in fact, this usage is only approximate. In the world of disks, it is common to use strange mixed approximations, where mega means 1000×1024. To distinguish multipliers that are powers of 2 from multipliers that are powers of 10, it has been suggested that we should use the subscript 2 to indicate the common computer science approximation, while reserving the unsubscripted form for the exact SI prefix.

1 kilo2 = 1024
1 mega2 = 1,048,576

Of course, this leaves us with no convenient notation for the disk salesman's definition of a megabyte as 1000 times 1024. Fortunately, this absurd usage is in decline.

Exercises:

s) How long is a microfortnight, in seconds.

t) How heavy is a kilodram (weight), in pounds? How heavy is a megadram in tons? (Be careful to use the definition of a dram as a unit of weight, because it is also defined as a unit of volume.)

Finite Precision

On computers, we generally are forced to work in terms of a fixed word size, so our numbers are composed of a fixed number of bits. Leading zeros are used to pad a number out to the desired width. Thus, on a machine with a 16 bit word,

10010 = 0000,0000,0110,01002 = 006416

On this machine, the maximum representable number is

6553510 = 1111,1111,1111,11112 = FFFF16

When deciding between octal and hexadecimal for compact representation of numbers, it is common to select one that divides words and bytes evenly, so for machines based on an 8 bit byte, hexadecimal seems a good choice, while for machines with a 6 bit byte, octal makes sense.

There are exceptions to this! The Motorola 68000 and the DEC PDP-11, were essentially 16-bit machines with 8-bit bytes, but each instruction had four 3-bit fields. The Intel 8080 was an 8-bit machine with two 3-bit fields per instruction. It was convenient on these machines to print the contents of memory in octal in order to expose these 3-bit fields as octal digits. The Unix system and the C programming language were first implemented on the PDP-11 system, so to this day, Unix and C tend to default to octal.

Arithmetic

To add two decimal numbers, for example, to add 132 to 39, we arrange the numbers as follows:

1 3 2   augend
+ 3 9 addend

And then do the addition one pair of digits at a time, starting with 2+9. To add one pair of digits, we use an addition table that most of us should have memorized in elementary school. This same table would have been loaded in the memory of the IBM 1620 Cadet, which did arithmetic using a table-lookup instead of an adder built from digital logic:

The decimal addition table
+    0    1    2    3    4    5    6    7    8    9
                                         
0 0 1 2 3 4 5 6 7 8 9
1 1 2 3 4 5 6 7 8 9 10
2 2 3 4 5 6 7 8 9 10 11
3 3 4 5 6 7 8 9 10 11 12
4 4 5 6 7 8 9 10 11 12 13
5 5 6 7 8 9 10 11 12 13 14
6 6 7 8 9 10 11 12 13 14 15
7 7 8 9 10 11 12 13 14 15 16
8 8 9 10 11 12 13 14 15 16 17
9 9 10 11 12 13 14 15 16 17 18
10 11 12 13 14 15 16 17 18 19

For the example, looking up 2+9 in the table gives a sum of 11; we split this sum into two pieces, the sum digit and the carry digit, writing the result as follows:

1 carry
1 3 2   augend
+ 3 9 addend

1 sum

There is a problem adding the numbers in the next column: Our addition table only allows adding two numbers, but we have three numbers in this column. One way to do this is to use the same addition table, but to step down one row in cases where we have to add in a carry. This is why an extra row was given in the addition table. Doing this in the 10's place of our example gives us this:

1 carry
1 3 2   augend
+ 3 9 addend

7 1 sum

Finally, we add up the column in the hundred's place. For this example, this is trivial:

1 carry
1 3 2   augend
+ 3 9 addend

1 7 1 sum

Binary Addition

Recall that the value of a simple binary number, represented as a string S of digit symbols, is defined as:

i=digits(S)-1
value(S) = Σ v(S[i]) × 2i
i = 0

This is the same place-value system we use for the decimal number system, except that instead of the 1's, 10's and 100's places, we have the 1's, 2's and 4's places. It should not be surprising, then, that binary arithmetic strongly resembles decimal arithmetic. Binary addition can be done using exactly the approach described above, but substituting the following much smaller table for the addition table:

The binary addition table
                    +    0    1                    
         
0 0 1
1 1 10
10 11

If this seems too simple, that is because it really is simple! The rules are the same as for decimal arithmetic because the place-value system is the same system. The only change is a reduction in the number of values each digit can take on. This means that remembering things like the addition and multiplication tables is far easier in binary than in decimal. We pay a penalty, of course, because there are more binary digits in a particular number than there are decimal digits. It takes 4 binary digits to represent a quantity that would be comfortably represented in 1 decimal digit, and it takes ten bits to represent a 3 digit decimal number. Converting decimal to binary, adding the binary numbers using the above rules, and converting the binary sum back to decimal gives:

 

decimal   binary  
 
1 carry 1  
1 3 2   augend 1 0 0 0 0 1 0 0
+ 3 9 addend   + 1 0 0 1 1 1


1 7 1 sum 1 0 1 0 1 0 1 1

 

If there is a limit on the number of bits allowed, as on a computer with a fixed word size, we say that an overflow has occurred if the carry out of the topmost bit of our fixed size word is nonzero. Note that this applies to words used to hold unsigned data, and that the conventions for detecting overflow are more complex if we are also interested in representing negative numbers.

Exercises:

u) Flip a coin 16 times to produce 2 8-bit binary numbers. Add them in binary, and then convert the original numbers to decimal, add them, and convert the result back to binary in order to check your sum.

Negative Numbers and Subtraction

Subtraction in binary is as easy as subtraction in decimal. Borrowing follows the same rules in binary and decimal, but binary subtraction involves more digits and far simpler details at each digit. Subtraction, however, introduces the problem of negative numbers, and these are not usually represented in computers in the same way that we represent them on paper!

Consider the problem of representing positive and negative numbers using 4 bits for each number. If we follow our intuition from the world of pencil and paper, we can simply reserve one bit to represent the sign while the remaining bits are used to hold the binary representation of the magnitude. This is called the signed magnitude number system In this system, we can think of the sign bit as taking on the values + and -, while the 3 magnitude bits take on the values 0 and 1 and follow a place-value system of interpretation. This allows the representation of the following 16 values:

All 4-bit signed magnitude numbers
                        0 0 0 0   +0       1 0 0 0   -0                        
0 0 0 1   +1   1 0 0 1   -1
0 0 1 0   +2   1 0 1 0   -2
0 0 1 1   +3   1 0 1 1   -3
0 1 0 0   +4   1 1 0 0   -4
0 1 0 1   +5   1 1 0 1   -5
0 1 1 0   +6   1 1 1 0   -6
0 1 1 1   +7   1 1 1 1   -7

The signed-magnitude system works, and since representations are ultimately arbitrary, it might be as good as any alternative. Unfortunately, it is harder to build hardware to add numbers in this system than in alternative systems, so it is not commonly used for integer arithmetic today. This was not obvious to computer developers in the 1940s and 1950s, so many early computers used signed magnitude binary numbers. The only widespread use of signed magnitude today is in floating point numbers.

Signed magnitude numbers pose one annoyance because positive zero and negative zero have different representations. It would be nicer if two numbers were always equal if their representations were equal.

To negate a signed magnitude number, all you need to do is flip the sign bit from a one to a zero or from a zero to a one. On the other hand, to add two signed magnitude numbers, you must either add or subtract the magnitudes, depending on whether the signs are equal or unequal, and then set the sign of the result according to a rather complex set of rules.

An alternative way to make negative numbers is to arrange things so that the negative of a number is obtained by reversing all the bits of the corresponding positive number. This leads to the one's complement number system, illustrated with the following table of 4-bit signed numbers:

All 4-bit one's complement numbers
                  0 0 0 0   +0       1 0 0 0   -7                  
0 0 0 1   +1   1 0 0 1   -6
0 0 1 0   +2   1 0 1 0   -5
0 0 1 1   +3   1 0 1 1   -4
0 1 0 0   +4   1 1 0 0   -3
0 1 0 1   +5   1 1 0 1   -2
0 1 1 0   +6   1 1 1 0   -1
0 1 1 1   +7   1 1 1 1   -0

The operation of negating a one's complement number is called taking the one's complement of a number or one's complementing the number. So, the one's complement of 0000 is 1111 and the one's complement of 1111 is 0000. Note that negating a negative number produces a positive number, and visa versa. Also note that the one's complement operator is well defined even when other number systems are being used. The operator is defined in terms of its effect on the binary representation, not in terms of the interpretation of that representation.

Warning: Many students get confused by the use of the term one's complement to mean both the name of the number system and the name of the operator used to negate numbers in that system! If someone says: "Take the one's complement of x," or "one's complement x" that means to apply the operator, reversing each bit in x; the number of bits in the result will be the same as the number of bits in x. If someone says: "write x as a 4-bit one's complement number," that means to use the number system!

As was noted above, adding two signed magnitude numbers is fairly difficult. In contrast, the rules for adding two one's complement numbers are simple, but far from obvious: Treat the most significant bit exactly like all the other bits, and compute the unsigned binary sum of the two numbers. Then, if this addition produces a carry out of the topmost bit, add one to the result! This is called an end around carry.

This rule for addition is so simple that, when building computer hardware to do one's complement arithmetic, subtraction is always done by adding the one's complement of the subtrahend number to the minuend. One's complement has been used as the primary representation for integers on a number of computers, but almost all of these were designed prior to 1965.

Note that one's complement numbers and signed magnitude numbers share the problem with there being two representations of zero, +0 and -0. This is eliminated when two's complement numbers are used. In the two's complement system, numbers are negated by taking the one's complement and then adding one. While this idea is far from obvious, the result is the dominant number system used on today's computers.

 

All 4-bit two's complement numbers
                  0 0 0 0    0        1 0 0 0   -8                  
0 0 0 1   +1   1 0 0 1   -7
0 0 1 0   +2   1 0 1 0   -6
0 0 1 1   +3   1 0 1 1   -5
0 1 0 0   +4   1 1 0 0   -4
0 1 0 1   +5   1 1 0 1   -3
0 1 1 0   +6   1 1 1 0   -2
0 1 1 1   +7   1 1 1 1   -1

The two's complement operator is defined as take the one's complement and then add one. It is important to note that applying this operator twice to an n-bit number always produces the original value. So, for example, the two's complement of 0101 is 1011 and the two's complement of 1011 is 0101.

The two's complement number system has an annoying feature: It cannot representa the negation of the most negative number. In 4 bits, for example, the value -8 is represented as 1000, but there is no 4-bit two's complement representation of +8. If you apply the two's complement operator to 1000 you get 1000!

Despite this, two's complement arithmetic is the dominant system used on all modern computers! Among the advantages of this system are that there are no problems with multiple representations of zero. Applying the two's complement operator to 0000 gives 0000 after we discard the carry out of the high bit. Another advantage is that odd numbers always have a one in their least significant bit, whether they are positive or negative. Similarly, all two's complement numbers that are evenly divisible by 4 end in 00, and all that are evenly divisible by 8 end in 000, no matter what the sign.

The fundamental reason for the popularity of two's complement numbers is that arithmetic is trivial. To add two two's complement numbers, just treat all bits as magnitude bits, even the most significant bit that you might think of as a sign bit.

To subtract two two's complement numbers, one's complment the number to be subtracted, then add the two numbers plus one. Using C notation (which is also C++ and Java notation), - is the two's complement operator and ~ is the one's complement operator, the following lines of code are all equivalent:

A - B
A + -B
A + (~B + 1)
(A + ~B) + 1

Note that with two's complement subtraction, a carry of one into any bit position means, in essence, don't borrow from this position. This seems an odd way of handling borrows, but it works.

Warning: As was the case with one's complement arithmetic, many students get confused by the use of the term two's complement to mean both the name of the number system and the name of the operator used to negate numbers in that system! If someone says: "Take the two's complement of x," or "two's complement x" that means to apply the operator; the number of bits in the result will be the same as the number of bits in x. If someone says: "write x as a 4-bit two's complement number," that means to use the number system, not negate the number.

The two's complement number system turns out to be a strange place value system, in which the "sign bit" has a value that is negative. Thus, in addition to the obvious ways of converting two's complement numbers to decimal (if negative, write down a minus sign, then two's complement it to create a positive number, then convert that to decimal), we can do the conversion using the sum of the place values. For the 8 bit two's complement system, the place values are:

-128 64 32 16 8 4 2 1
  1     0     1     0     1     0     1     1  

101010112 = -128 + 32 + 8 + 2 + 1 = -85

When adding two two's complement numbers, overflow detection is not as simple as looking for a carry out of some bit position. If the two numbers being added were of opposite sign, the sum must lie between them, and overflow is impossible. If the numbers being added were of the same sign, overflow is possible, and if it occurs, the sign of the result will differ from the sign of the two numbers being added. Alternately, this turns out to be equivalent to asking if the carry into the sign position is the same as the carry out of the sign position -- in adding negative numbers, there should always be a carry both into and out of the sign, and in adding positive numbers, there should never be a carry in or carry out of the sign.

As an aside, 2's complement numbers have a strong relationship to what is called the ten's complement number system in decimal, sometimes known as the odometer number system. To compute the 10's complement of a number, first take the 9's complement by subtracting every digit from 9, then add 1 to the result. So, -1, as a 6-digit 10's complement number, is represented by 999999, and -2 is 999998.

There are other number systems that see occasional use. For example, there are biased number systems, in which a bias is added to the abstract value in order to guarantee a positive value, and then that value is represented in binary. The most common bias used for n-bit numbers is 2n-1. Here is an example, in 4 bits:

All 4-bit numbers with a bias of 8
                0 0 0 0   -8       1 0 0 0     0                
0 0 0 1   -7   1 0 0 1   +1
0 0 1 0   -6   1 0 1 0   +2
0 0 1 1   -5   1 0 1 1   +3
0 1 0 0   -4   1 1 0 0   +4
0 1 0 1   -3   1 1 0 1   +5
0 1 1 0   -2   1 1 1 0   +6
0 1 1 1   -1   1 1 1 1   +7

Biased number systems seem like oddities, but they are commonly used in representing the exponent field of floating point numbers. Biased numbers have the advantage that sorting the representations into simple binary order puts the abstract values in numerical order. This is not true for any of the other systems discussed here! Like the two's complement system, there is always at least one value that cannot be negated.

Curiously, when the bias used with n-bit numbers is 2n-1, the biased representation of some value v can be computed by taking the two's complement representation of v and then inverting the most significant bit. So, for example, -8 as a 4-bit two's complement number is 1000 and as a 4-bit number biased by 8, it is 0000.

Exercises:

v) Flip a coin 8 times to produce an 8-bit binary number. What abstract value does this number represent in the signed magnitude system. Check your result by converting back to binary in the signed-magnitude system.

w) Flip a coin 8 times to produce an 8-bit binary number. What abstract value does this number represent in the one's complement system. Check your result by converting back to binary in the one's complement system.

x) Flip a coin 8 times to produce an 8-bit binary number. What abstract value does this number represent in the two's complement system. Check your result by converting back to binary in the two's complement system.

y) Flip a coin 8 times to produce an 8-bit binary number. What abstract value does this number represent in the biased number system with the default bias for 8-bit numbers. system. Check your result by converting back to binary in the biased number system.