22C:18, Lecture 18, Fall 1996

Douglas W. Jones
University of Iowa Department of Computer Science

Scientific Notation

John Von Neumann said that "any programmer worth his salt can keep the point in his head" (a story recounted to me by Don Gillies, a student of Von Neumann's.) This remark was made in the early 1950's as an argument against the need for floating point arithmetic hardware. At the time, floating point was a new idea.

The truth is, programmers have problems with fixed point arithmetic. The rules discussed in the previous lecture do work, for most applications, but without the kind of automatic support found in languages such as PL/I and Ada, they are a common source of programming errors.

It is probably true that, for most applications, floating point hardware is overkill, but it is commonly available. Floating point number representations are the binary equivalent of the scientific notation you should have learned in highschool. In this notation, a number is represented by a normalized mantissa and an exponent. In decimal, for example, Avagadro's constant is expressed as:

6.02 x 10²³

(this is avagadro's number, the number of hydrogen atoms in one gram of hydrogen.)

The mantissa of this number is 6.02, and this mantissa is normalized so that it has one place ahead of the point (an alternate way to state the normalization rule is that the value of the mantissa, m, satisfies the rule 10.0 > m >= 1.0). The exponent of this number is 23 and the number base is 10. The following denormalized numbers are equivalent:

60.2 x 10²²

.602 x 10²⁴

When we write out the number, we use a superscript and we write out the string x10 to make it easier to read. A less readable notation is common in computer output, used primarily because older computer printers could not print super and subscripte. In this case, we write:

6.02E23

Here, it would be fair to read the letter E (standing for exponent) as "times ten to the". Even tighter packings are appropriate in some circumstances. For example, internally in typical calculators, floating point numbers are stored in packed form, typically with the exponent in the two most significant digits (and an implicit -99 to +99 limit on the range of legal exponent values). In this format, our example number would be:

23602

Binary Floating Point Numbers

Binary floating point representations rest on the same idea, using the same terminology. The differences are in the number base, 2, in the normalization rule, where we usually put the point to the left of the entire mantissa, and in the the way we distinguish exponent from mantissa by reserving fixed fields for each.

An easy way to represent floating point numbers on a computer is as a record. For example, consider the following Pascal type definition:

type
    float = record
	exp: integer;
	mantissa: integer;
    end;

In C, this would be:

typedef struct float {
	int exp;
	int mantissa;
} float;

Here, we assume that the number base is 2, so the abstract value of a number in this system is given by:

value = mantissa x 2^exponent

Furthermore, we will assume that the mantissa used in the representaton declared above is a fixed point two's complement binary fraction. Thus, the point will be assumed to be between the sign bit and the remainder of mantissa, with a value ranging from -1 to slightly less than +1.

In the place-value system, the sign bit of the mantissa is the -1's place, and while the next bit to the right is the 1/2's place, the bit to the right of that is the 1/4'ths place, and so on.

This representation is very easy to deal with in software, but it is inefficient because it allows a huge range of exponent values. Real computations rarely need base 10 exponents outside the range -99 to +99. Recalling that 2 to the 20 is roughly 10 to the 6, our binary exponents are unlikely to need a range outside -330 to +330. This means that an 8 bit exponent is on the small side, but a 10 bit exponent (allowing values in the range -512 to +511) is generous.

In fact, most computations require exponents smaller than the exponent in Avagadro's number, so an exponent range of -23 to +23 will frequently suffice. Again, using the rule of thumb that 6 decimal digits equal 20 bits, this translates to binary exponents between -76 and +76, implying that an 8 bit exponent will actually satisfy most users.

Thus, a well designed floating point number system needs an 8 to 10 bit exponent. How much precsision is needed in the mantissa? Most real measurements are made to between 1 part in 100 and 1 part in 1000, so 3 decimal digits of precision will usually suffice. When computations involve long sums, extra precision will be needed! As a result, around 7 decimal digits of precision is a practical minimum for computational use. A 24 bit mantissa allows values to range between -8 million and +8 million, a bit less than 7 full decimal digits of precision. Some applications need significantly more than this!

This leads to the conclusion that, within the constraints of a 32 bit word, an 8 bit exponent with a 24 bit mantissa will just barely suffice for many practical computations. This leads to the following typical single-precision floating point representation:

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
| |               |                                             |
 s    exponent    .                   mantissa

Here, the bit s is the sign bit of the mantissa, separated from the other bits of the mantissa. This separation is useful in order to allow the sign of the whole floating point number to be tested using the same instructions that would be used for testing signs of integers. A related convention is to assume that, when all 32 bits of the number are zero, the value of the number is zero.

Biased Exponent

For an interesting combination of practical reasons, it is common to represent the exonent using a biased number system, so the exponent 00000000 is the most negative value (-128 decimal), 10000000 is zero, and 11111111 is the most positive value (127 decimal). In particular, this means that the value for zero is interpreted as:

0.0 x 2^-128

Signed Magnitude Mantissa

The mantissa field in floating point numbers is frequently stored as a signed magnitude binary fraction, although it seems that every possible representation of the sign has been used at one time or another by some computer somewhere. Thus, the point in the mantissa is generally placed between the sign and the other bits. The following examples illustrate this floating point representation:

					       0
 0 10000000 10000000000000000000000  =  0.5 x 2  =  0.500000

					       1
 1 10000001 10000000000000000000000  = -0.5 x 2  = -1.00000

					        2
 1 10000010 11000000000000000000000  = -0.75 x 2 = -3.00000

					       3
 0 10000011 10100000000000000000000  =0.625 x 2  =  5.00000

					       4
 0 10000100 10100000000000000000000  =0.625 x 2  = 10.0000

Note, above, the comment that a 32 bit binary floating point number is just barely sufficient for the typical range of exponent values and the typical mantissa precision needed for real-world computation! As a result, most floating point systems allow higher precision, frequently supporting a 64 bit double precison representation, or even a 128 bit quadruple precision format. A well designed double precision format will typically offer both a larger exponent field, perhaps 12 bits, and a larger mantissa field, although many historical computers have retained the same size exponent field for double precision numbers as they have for single precsion.

Normalized Binary Floating Point Numbers

The usual normalization rule for binary floating point numbers is simple: the most significant non-sign bit of the mantissa is one. An equivalent statement is that the mantissi m is adjusted so that:

	1 > m >= 0.5

Thus, of the following representatons for the decimal number 10, only the first is considered to be normalized:

 0 10000100 10100000000000000000000
 0 10000101 01010000000000000000000
 0 10000110 00101000000000000000000

In addition, most floating point formats adopt the convention that, if the mantissa is zero, the exponent is also set to zero.

Given a pointer to an arbitrary floating point number, the following C function will normalize it, assuming that exponent and mantissa are in separate integers, using signed 32 bit long integers and the record declarations given earlier:

	normalize(n)
	float* n;
	{
		int sign;
		if (n->mantissa == 0) {
			n->exp = 0;
			return;
		}
		int sign = (n->mantissa < 0);
		if (sign) n->mantissa = -n->mantissa;
		while ((n->mantissa & 0x40000000L) == 0) {
			n->mantissa <<= 1;
			n->exponent -= 1;
		}
		if (sign) n->mantissa = -n->mantissa;
	}

The only test in the above code that is specific to the particular representaton used is the test of the most significant non-sign bit of the mantissa. For a 32 bit word, this is the bit with the value 40000000₁₆.

The Hidden Bit

Using the normalization and representaton rules suggested above, the most significant non-sign bit of the mantissa is always 1. Given that, in a 32 bit floating point number, both mantissa and exponent bits are in short supply, it seems wasteful to use a bit that is always set. Many commonly used floating point number representations avoid this waste by not storing this bit. We call this the hidden bit of the mantissa. As a result, assuming the same breakdown of mantissa and exponent bits as used before, we might document this hidden bit as:

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
| |             __|                                             |
 s    exponent |.1                    mantissa
                hidden bit

The example floating point values illustrated before can be converted to this new format by shifting their mantissas one bit left in order to "hide" their most significant bits. The example values presented previously are now:

					       0
 0 10000000 00000000000000000000000  =  0.5 x 2  =  0.500000

					       1
 1 10000001 00000000000000000000000  = -0.5 x 2  = -1.00000

					       2
 1 10000010 10000000000000000000000  = 0.75 x 2  =  3.00000

					       3
 0 10000011 01000000000000000000000  =0.625 x 2  =  5.00000

					       4
 0 10000100 01000000000000000000000  =0.625 x 2  = 10.0000

Using a hidden bit does pose one problem: How do you represent zero? Formally, a value of all zeros is interpreted as follows under our hidden bit rules:

                                                -128
 0 00000000 00000000000000000000000  = 0.500 x 2

While this number is very small, it is not zero! The common answers are somewhat arbitrary and smack of cheating. Typically, we simply declare this value to be zero as a special case, or we declare that, when the exponent field is zero, there is no hidden bit and, therefore, for such small numbers, one less bit of the mantissa is stored. In either case, special case treatment is involved.

Floating Point Arithmetic

In high school math, you should have learned the following algorithm for adding two numbers in scientific notation:

Denormalize the number with the smaller exponent so the exponents are equal.
Add the mantissas.
Normalize the result.

In effect, the denormalization of a number conceptually aligns the points so that digits with equivalent place values can be added to each other. The following exaple illustrates this:

     the problem:
                  8                  6
        9.927 x 10    +    8.665 x 10    =

     denormalize:
                  8                  8
        9.927 x 10    +    0.087 x 10    =

     add:
                  8
       10.014 x 10    =

     normalize:
                  9
        1.001 x 10

We follow exactly the same rules for adding or subtracting binary floating point numbers. Assuming that we use a record where the exponent and mantissa are long signed integers, we can code an add function as follows:

	add(a,b)
	float *a, *b;
	{
		/* denormalize */
		while (a->exp >> b->exp) {
			b->exp ++;
			b->mantissa >>gt;>>gt;= 1;
		}
		while (a->exp < b->exp) {
			a->exp ++;
			a->mantissa >>gt;>>gt;= 1;
		}

		/* add */
		a->mantissa += b->mantissa;

		/* renormalize */
		if (overflow) {
			a->exp ++;
			a->mantissa >>gt;>>gt;= 1;
			a->mantissa ^= 0x80000000;
		}
		normalize( a );
	}

Note the odd bit of code that isn't quite standard C that checks for overflow. This special normalize step applies when adding the two mantissas produced an overflow and recovers from that overfow by shifting the mantissa right, incrementing the exponent, and fixing the sign bit, which is guaranteed to have been wrong in case of overflow.

This test for overfow is trivial in assembly language on a machine with condition codes, but quite messy to code in any high level language. A C or Pascal programmer would typically end up being forced to waste one bit of precision by denormalizing both operands prior to the add in order to avoid overflow.

Floating point subtraction is quite similar. Floating point multiplication and division are, by comparison, trivial. The rules you learned in highschool apply: To multiply two numbers, add the exponents, multiply the mantissas and normalize the result. In Pascal, this is:

	procedure multiply (var a,b: float);
	begin
	    a.exp = a.exp + b.exp;
	    a.mantissa = a.mantissa * b.mantissa;
	    normalize(a);
  	end;

Note that floating point multiplication is not much more difficult than integer multiplicaton, while floating point addition was significantly more complex than integer multiplication. This speed difference shows up quite clearly in the run times for these instructions on many machines, where floating multiplication is frequently not much slower than integer multiplication, while floating addition is frequently far slower than integer multiplication.