When we write computer programs, we always deal with representations. For example, we don't deal with text the way a calligrapher or author deals with text, we deal with a specific character set that may be used to represent text. Instead of dealing with integers or real numbers, we deal with finite precision representations of integers and we deal with floating point numbers.
Well designed programming languages allow the programmer to treat these representations as if they were identical to the abstractions they represent, but this identity is never complete. Thus, for example, our machine representations of integers behave exactly like abstract integers until a result greater than the machine precision is produced; at that point, the behavior of integer variables becomes very strange!
All modern computers use binary representations for their data, and use fixed size containers for data. There is no reason that computers could not be built that use other basic representations; "trinary" (base 3) logic devices aren't hard to build, and in fact, decimal computers were built in large numbers in the 1960's. There is no compelling benefit to using non-binary representations, though, and even the decimal machines of the 1960's were built using binary logic and used binary-coded-decimal (BCD) internally.
The bit is the fundamental unit of information. A bit has two values, conventionally called zero or one. We could as easily name the values on and off, high and low, or true and false; in fact, all of those names are used in different contexts, and the truth is that all such names are arbitrary!
Over the past 50 years, bits have been grouped inside computers in many ways, and these groupings have been given many names:
How many values can be coded in one bit? Two! We'll call them 0 and 1 for now. How many values can be coded by 2 bits? By 3 bits? By 4 bits? Here is a systematic listing of all possible combinations of 4 bits:
0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 1 4 0 1 0 0 1 1 0 0 4 bits, 2 = 16 possibilities. 0 1 0 1 1 1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1In general, to get n combinations, you need log2(n) bits (log2 is the log to the base 2).
How we chose to associate values with these bits is our business! We could decide that 0000 means Tuesday and 0001 means five, if we wanted, and, in fact, character sets frequently look that arbitrary! Consider the character set used by ILLIAC I; the table below was transcribed from a typewritten original in The ILLIAC Miniature Manual by John Halton (U. of Illinois Digital Computer Lab file no 260, Oct 22 1958, page 3):
| Characters |n for 92 Tape Holes | F/S L/S | Orders ----------------------------------- o 0 P 2F o O 1 Q 66F o O 2 W 130F o OO 3 E 194F oO 4 R 258F oO O 5 T 322F oOO 6 Y 386F oOOO 7 U 450F Oo 8 I 514F Oo O 9 O 578F Oo O + K 642F Oo OO - S 706F OoO N N 770F OoO O J J 834F OoOO F F 898F OoOOO L L 962F O o Delay Delay 3F O o O $(Tab) D 67F O o O CR/LF CR/LF 131F O o OO ( B 195F O oO L/S=Letter-Shift 259F O oO O , V 323F O oOO ) A 387F O oOOO / X 451F OOo Delay Delay 515F OOo O = G 579F OOo O . M 643F OOo OO F/S=Figure-Shift 707F OOoO ' H 771F OOoO O : C 835F OOoOO x Z 899F OOoOOO Space Space 963FThis character set is coded in 5 bits, with L/S and F/S as shift codes for switching between two different and wildly disorganized sets of 32 characters! How the designers of this character set decided on the order of the letters is a mystery! Note that the capital O in the "tape holes" column stands for one in the binary code, and that blank stands for zero. The lower case o in the "tape holes" column shows where the sprocket hole in the tape goes; this is not part of the binary code, and is more like a decimal point, from the human reader's perspective.
The modern ISO/7 or ASCII character set, originally proposed in the early 1960's, is a significant improvement:
left 3 bits 000 001 010 011 100 101 110 111 --------------------------------- 0000 | NUL DLE SP 0 @ P ` p 0001 | SOH DC1 ! 1 A Q a q 0010 | STX DC2 " 2 B R b r 0011 | ETX DC3 # 3 C S c s 0100 | EOT DC4 $ r D T d t 0101 | ENQ NAK % 5 E U e u 0110 | ACK SYN & 6 F V f v Right 0111 | BEL ETB ' 7 G W g w 4 bits 1000 | BS CAN ( 8 H X h x 1001 | HT EM ) 9 I Y i y 1010 | LF SUB * : J Z j z 1011 | VT ESC + ; K [ k { 1100 | FF FS , < L \ l | 1101 | CR GS - = M ] m } 1110 | SO RS . > N ^ n ~ 1111 | SI US / ? O _ o DELThis character set, now officially known as the ISO/7 character set, is the basis of all of the widely used character sets in common use today. One other character set from the "bad old days" is still with us, EBCDIC.
As with character, it is possible to associate arbitrary bit patterns with numeric values, but if arithmetic is to be perfomed on those values, things are far easier if a uniform character set is used. The basic binary number system dates back to ancient China, where it was used to order of the 64 combinations of I-Ching sticks (each stick could be in either of two states; the pattern that resulted from a throw of the sticks was used for fortune telling):
0 = 000000 16 = 010000 32 = 100000 48 = 110000 1 = 000001 17 = 010001 33 = 100001 49 = 110001 2 = 000010 18 = 010010 34 = 100010 50 = 110010 3 = 000011 19 = 010011 35 = 100011 51 = 110011 4 = 000100 20 = 010100 36 = 100100 52 = 110100 5 = 000101 21 = 010101 37 = 100101 53 = 110101 6 = 000110 22 = 010110 38 = 100110 54 = 110110 7 = 000111 23 = 010111 39 = 100111 55 = 110111 8 = 001000 24 = 011000 40 = 101000 56 = 111000 9 = 001001 25 = 011001 41 = 101001 57 = 111001 10 = 001010 26 = 011010 42 = 101010 58 = 111010 11 = 001011 27 = 011011 43 = 101011 59 = 111011 12 = 001100 28 = 011100 44 = 101100 50 = 111100 13 = 001101 29 = 011101 45 = 101101 61 = 111101 14 = 001110 30 = 011110 46 = 101110 62 = 111110 15 = 001111 31 = 011111 47 = 101111 63 = 111111In looking formally at how the bit pattern for a number is related to its abstract value, we must distinguish clearly between representation and abstraction. Here, we will use a programming language notation, where unquoted numerals stand for abstract integer values, subject to arithmetic operations, while quoted strings stand for concrete representations of those values.
Thus, the string "011100" is just a string of symbols, with no inherent meaning. It could be equivalent to "eleven thousand one hundred" or it could be the binary representation of the number 28.
Consider, first, the formula that relates representations of decimal numbers to their values:
i=digits(S)-1 i value(S) = SUM v(S[i]) × 10 (decimal) i=0 Where S is a string of digits value(S) is the numeric value assigned to that string digits(S) is the number of digits in S the rightmost digit in S is S[0] the leftmost digit in S is S[size(S)-1] v(d) is the numeric value associated with digit d
Users of the decimal system rarely use mathematical notation, as above; instead, we speak of the one's place, the ten's place, the hundred's place and so on.
The above formula applies equally to binary numbers, with the simple restriction that each binary digit, or bit may take on only two values, 0 and 1, and that the base of the number system is 2 instead of 10:
i=digits(S)-1 i value(S) = SUM v(S[i]) × 2 (binary) i=0Similarly, in binary numbers, we speak of the one's place, the two's place, the four's place and the eight's place. The list of powers of 2 rapidly becomes familiar to users of binary numbers:
i i 2 0 1 1 2 2 4 3 8 4 16 5 32 6 64 7 128 8 256 9 512 10 1024 about 1000 or 1K (kilo) 11 2048 12 4096 13 8192 14 16384 15 32768 16 65536 20 1048576 about 1,000,000 or 1M (mega) 30 about 1,000,000,000 or 1G (giga)How do we relate the characters used to represent digits to their abstract numeric values? The most general solution to this problem is to use some kind of lookup table, for example:
to compute v(d) find the value of i such that key[i]=d where key = "0123456789" return that value of iIn practice, this is unnecessary because we arrange our character sets so that the numeric positions of the numerals in the character set are consecutive. This was true even with the quirky character set used on ILLIAC! Thus, the following works:
to compute v(d) return rep('0')-rep(d)Here, rep(c) returns the integer used to represent the character c.
Any integer may be used as the radix of a number system. The formula used above to evaluate numbers in bases 2 and 10 also works for r, an arbitrary radix:
i=digits(S)-1 i value(S) = SUM v(S[i]) × r (radix r) i=0
Note that, for any radix r, the string 10 in that radix has, as its value, the radix. Thus, in binary, 10 means two, in decimal, the string 10 means ten, and in octal, the string 10 means eight. To avoid confusion, it is conventional to indicate the number base by a numerical subscript, as
10110 = 26 = 22 = 211 2 8 10 3As long as the radix r is no greater than 10, the conventional digits, 0 through 9, can be used. For r greater than 10, additional symbols must be chosen for use as digits. Historically, a variety of symbols have been used! For example, in base 12 also called duodecimal, the digits T and E have been used (T for ten and E for eleven). Later, on ILLIAC I, the following digits were used for base 16, called sexadecimal in the documentation:
0123456789KSNJFLToday, the standard convention for bases greater than ten is to use A for ten, B for eleven, C for twelve, and so on, up to Z. This allows for bases as high as 36, although the highest commonly used number base is base 16, usually called hexadecimal, using the digits.
0123456789ABCDEFIn the context of computer systems based on binary numbers, it is common to use bases 8 and 16. The reason for this is that it is generally easier for people to treat very long numbers in terms of groups of digits. For example, in dealing with large decimal numbers, we group the digits in threes. In effect, this grouping creates a number in base 1000 with three character names for each digit in that number and commas separating the digits. In the SI system of units, we even have standard adjectival forms for the place values in this system:
1 kilo = 1,000 1 mega = 1,000,000 1 giga = 1,000,000,000 1 tera = 1,000,000,000,000Many computer science students think one k is 1024, but in fact, this usage is only approximate. In the world of disks, it is common to use strange mixed approximations, where mega means 1000×1024. The convention of using a subscript two on the SI prefixes to indicate that the nearest powers of two is intended has recently been suggested and seems to be a good idea! So,
1 kilo = 1024 1 mega = 1,048,576 2 2Having found such utility in grouping decimal digits into groups of three, it is natural to group binary digits similarly:
100 = 1,100,100 = 144 10 2 8But, as is shown above, if each group of 3 bits is interpreted as a value between zero and seven, the result is an octal (base 8) number. Hexadecimal (Base 16) results from a similar grouping of binary digits in groups of 4, with digit value ranging from zero to fifteen.
100 = 110,0100 = 64 10 2 16
On computers, we generally are forced to work in terms of a fixed word size, so our numbers are composed of a fixed number of bits. Leading zeros are used to pad a number out to the desired width. Thus, on a machine with a 16 bit word,
100 = 0000,0000,0110,0100 = 0064 10 2 16On this machine, the maximum representable number is
65535 = 1111,1111,1111,1111 = FFFF 10 2 16When deciding between octal and hexadecimal for compact representation of numbers, it is common to select one that divides words and bytes evenly, so for machines based on an 8 bit byte, hexadecimal seems a good choice, while for machines with a 6 bit byte, octal makes sense. There are exceptions to this! The Motorola 68000 is essentially a 16 bit machine with 8 bit bytes, but the instruction format is made of numerous 3 bit fields, so displaying 68000 instructions in octal makes good sense!