11. Floating Point
Part of
22C:60, Computer Organization Notes
|
The Hawk architecture includes two instructions reserved for communication with coprocessors. A coprocessor is a special purpose processor that operates in conjunction with the central processor. Coprocessors may be physically separate from the central processor, on a separate chip, or they may be integrated on the same chip with it. It is the logical separation that is essential. Coprocessors are commonly used for floating point, but they have also been used for graphics, cryptography, and more.
The Hawk coprocessor instructions, COSET and COGET transfer data between the general purpose registers and specialized registers inside one or more coprocessors. Here, we will only discuss coprocessor number one, the floating point coprocessor.
|
|
In these instructions, dst always refers to a CPU register, while x refers to a register in the active coprocessor, as selected by the value in coprocessor register zero, the coprocessor status register, COSTAT. See the Hawk manual for details of the fields in COSTAT. The following code enables the floating point coprocessor for short (32-bit) floating point operands:
LIL R1, FPENAB + FPSEL COSET R1, COSTAT |
Once the floating point coprocessor is enabled, addressing coprocessor registers 1 to 15 refers specifically to registers inside the floating point coprocessor. When operating in short format, there are only two useful registers in the floating-point coprocessor, floating-point accumulators zero and one, FPA0 and FPA1, which corespond to coprocessor registers 2 and 3. Coprocessor register 1 will be ignored, for now. It is used for access to the least significant halfword of long (64-bit) floating point operands.
Floating point coprocessor registers 4 to 15 are not really registers. Rather, the operations set these registers are used to operate on the floating point accumulators. The available operations are floating point add, subtract, multiply and divide, as well as square root and integer to floating conversion. Setting even floating point registers 4 to 14 causes operations on FPA0 and setting odd registers 5 to 15 operates on FPA1. For example, setting coprocessor register number 5 converts the integer stored in a general purpose register into a floating point value in FPA1. The complete set of short (32-bit) floating point operations on FPA0 is illustrated below; the same operations are available on FPA1.
COSET R1, FPA0 ; 2 FPA0 = R1 COSET R1, FPINT+FPA0 ; 4 FPA0 = (float) R1 COSET R1, FPSQRT+FPA0 ; 6 FPA0 = sqrt( R1 ) COSET R1, FPADD+FPA0 ; 8 FPA0 = FPA0 + R1 COSET R1, FPSUB+FPA0 ; 10 FPA0 = FPA0 - R1 COSET R1, FPMUL+FPA0 ; 12 FPA0 = FPA0 * R1 COSET R1, FPDIV+FPA0 ; 14 FPA0 = FPA0 / R1 |
Floating point operations do not directly set the condition codes. Instead, When the coprocessor get instruction COGET is used to get the contents of a floating point accumulator, it sets the N and Z condition codes to show whether the floating point value is negative or zero. In addition, the C condition code is used to report floating point values that are infinite or non numeric.
Exercises
a) Give appropriate defines for the symbols FPA0, FPA1, FPSQRT, FPADD, FPSUB, FPMUL and FPDIV that are used as operands on the COSET instruction.
b) Given 32-bit floating point values x in R4 and y in R5, give Hawk code to enable the coprocessor, compute sqrt(x2 + y2), put the result in R3 and then disable the coprocessor.
To fully specify our floating-point coprocessor, we must define not only the operations but also the data formats it uses. The Hawk, like most modern computers, uses the floating point format the Institute for Electrical and Electronic Engineers (IEEE) has defined. This follows a general outline very similar to most floating point formats developed since since the early 1960's, but it has some eccentric features.
Binary floating point numbers are closely related to decimal numbers expressed in scientific notation. Consider the number 6.02×1023, Avagadro's number. This number is composed of a mantissa, 6.02, and an exponent, 23. The number base of the mantissa is the same as the value to which the exponent is applied, and the mantissa in scientific notation is always normalized, as shown below:
60221419.9 | × | 1016 | \ | ||
60221.4199 | × | 1019 | not normalized | ||
60.2214199 | × | 1022 | / | ||
6.02214199 | × | 1023 | normalized | ||
0.602214199 | × | 1024 | not normalized |
Of these, only 6.02... × 1023 is considered to be properly in scientific notation. The rule is that the mantissa must always be a decimal number from 1.000 to 9.999... The only exception is zero. When a number has a mantissa that does not satisfy this rule, we normalize it by moving the point and adjusting the exponent until it satisfies this rule.
The IEEE standard includes both 32 and 64-bit floating point numbers. Here, we will ignore the latter and focus the 32-bit format. As in scientific notation, binary floating point numbers have exponent and mantissa fields, but they are in binary, so the mantissa is in base two and the exponent is a power of two.
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 09 | 08 | 07 | 06 | 05 | 04 | 03 | 02 | 01 | 00 |
S | exponent | mantissa |
In the IEEE floating point formats, like most others, the most significant bit holds the sign of the mantissa, with zero meaning positive. The mantissa is stored in signed magnitude form. The magnitude of the mantissa of a 32-bit floating-point number is given to 24 bits of precision, while the exponent is stored in the 8 remaining bits. Notice that this adds up to 33 bits of sign, exponent and mantissa, evidence of some exceptional trickery in this floating point representation. IEEE double-precision numbers differ from the above in that each number is 64 bits. This allows 11-bits for the exponent instead of 8 bits, and 53 bits for the mantissa, including one extra bit obtained from the same trickery.
The IEEE format gets an extra bit for the mantissa as a consequence of the mantissa normalization rule it uses. The mantissa in an IEEE format number is a binary fixed point number with one place to the left of the point. The normalization rule is similar to that used for scientific notation, so the smallest normalized mantissa value is 1.0, while the largest normalized value is 1.1111...2. This means that all normalized mantissas have a most significant bit that is one. In general, if a bit is always the same, there is no point in storing it since we can take the constant value for granted. We call such a bit that does not need to be stored a hidden bit. Consider the IEEE floating point value represented as 1234567816:
|
|
| The IEEE format supports non-normalized mantissas only for the smallest possible exponent. Zero is represented by the smallest possible exponent with and a non-normalized mantissa of zero.
The Biased ExponentThe second odd feature of the IEEE format is that the exponent is given as a biased signed integer with the eccentric bias of 127. The normal range of exponents runs from 000000012, meaning -126, to 111111102, meaning +127. The exponent represented as 000000002 also means -126, and is used for unnormalized mantissas. In this case, the hidden bit is zero. The exponent 111111112 is reserved for values that the IEEE calls NaNs, where NaN stands for not a number. The value of the mantissa tells what kind of NaN. The hardware uses a mantissa of zero for infinity. Software may set nonzero values. Because of the odd bias of 127 for exponents, an exponent of one is represented as 100000002, zero is 011111112, and negative one is 011111102. There is a competing but equivalent explanation of the IEEE format that presents the bias as 128 and places the point in the mantissa one place to the right. The different presentations of the IEEE system make no difference in the number representations, but they can be confusing when comparing different textbook descriptions of the system. The following table shows IEEE floating-point numbers, given in binary, along with their interpretations.
Software Floating PointSome versions of the Hawk emulator do not have a floating point coprocessor. On such a machine, floating point arithmetic must be done by software. This lets us to examine the algorithms used for floating point arithmetic, and it gives an extended example of software to implement a complicated class. In the following presentation, we will initially ignore compatibility with the Hawk floating point coprocessor and the IEEE floating point format. Instead, we will focus on clear code and simple data representation. Only at the end will we give code to convert to and from IEEE format. The interface specification for a class should list all of the operations applicable to objects of that class, the methods, and for each method, it should specify the parameters, constraints on those parameters, and the nature of the result. The implementation of the class then gives the details of the object representation and the specific algorithms that implement each method. It is good practice to write the interface specification so that it serves as a manual for users of the class as well as a formal interface. For our floating-point class, the set of operations is fairly obvious. We want operators that add, subtract, multiply and divide floating-point numbers, and also, to return the integer part of a number, and to convert integers to floating-point form. We will ignore other operations for now. In most object-oriented programming languages, a strong effort is made to avoid copying objects from place to place. Instead, objects sit in memory and object handles are used to refer to them. The handle for an object is a pointer to that object, that is, a word holding the address of the object. We will follow this pattern, so parameters to our floating point operators will be the addresses of their operands. Finally, the interface specificaiton for a class must indicate how to allocate storage for an element of that class. The only thing a user of the object needs to know is the size of the object, not the internal details of its representation. The following interface specification for our Hawk floating point package assumes that each floating point number is stored in two words of memory, enough for an exponent and a mantissa of one word each, although the user need not know how the words are used.
A floating point representationThe simplest floating point representation from a software perspective is a record containing two fields of one word each, the exponent and the mantissa. This is not enough detail. Which word is which? How is each represented? What is the range of exponent values? How do we represent the sign of the exponent? How is the mantissa normalized? How do we represent non-normalized values such as zero? Since our computer uses two's complement arithmetic, it makes sense to use that for the exponent and mantissa. We can represent zero using a mantissa of zero; technically, when the mantissa is zero, the exponent does not matter, but for zero, the exponent will always be the smallest (most negative) value.
The more difficult question is, where is the point in the
mantissa? We could put it anywhere, but
there are two good choices: The mantissa could be an integer,
or the point could be just right of the
sign bit. We will do the latter, with the mantissa normalized
so the bit just right of the point is always one.
Normalizing a floating point numberMany operations on floating point numbers produce results that are unnormalized, and these must be normalized before performing additional operations on them. If this is not done, there will be a loss of precision in the results. Classical scientific notation is always presented in normalized form for the same reason. To normalize a floating point number, we must distinguish some special cases: First, is the number zero? Zero cannot be normalized! Second, is the number negative? Because we have opted to represent our mantissa in two's complement form, negative numbers are slightly more difficult to normalize; this is why many hardware floating-point systems use signed magnitude for their floating point numbers. The normalize subroutine is not part of the public interface to our floating point package, but rather, it a private component, used as the final step of just about every floating point operation. Therefore, we can write it with the assumption that operands are passed in registers instead of using pointers to memory locations. We will code this here using registers 3 and 4 to hold the exponent and mantissa of the number to be normalized, and we will use this convention both on entrance and exit.
There are two tricks here. First, we used BITTST to test bit 30 of the mantissa. BITTST moves the indicated bit to the C condition code; actually it is an alias for a left or a right shift to move the bit to the the C bit, discarding the shifted result to R0. In C, C++ or Java, programmers typically test bits of a word by anding the word with a constant that has just that bit set. The second trick involves normalizing negative numbers. Converting negative values from two's complement to one's complement while normalizing, we can use the rule that bit 30 of normalized negative mantissas is always zero.
Integer to Floating ConversionConversion from integer to floating point is remarkably simple. If we set the exponent field to 31 with the integer value in the mantissa and then normalize, the result is correct. This is because the fixed point fractions we are using for the mantissa can be thought of as integer counts in units of 2-31.
Floating to Integer ConversionConversion of floating-point numbers to integer is a bit more complex, but only because we have no pre-written denormalize routine that will set the exponent field to 31. Instead, we need to write this ourselves. Where the normalize routine shifted the mantissa left and decremented the exponent until the number was normalized, the floating to integer conversion routine will have to shift the mantissa right and increment the exponent until the exponent has the value 31. This leaves open the question of what happens if the initial value of the exponent was greater than 31. The answer is, in that case, the integer part of the number is too large to represent in 32 bits. In this case, we should raise an exception, or, lacking a model of how to write exception handlers, we could set the overflow condition code. Here, this is left as an exercise for the reader.
Floating Point Addition
We are now ready to explore the implementation of some of the floating point
operations. These follow quite naturally from the standard rules for working
with numbers in scientific notation. Consider the problem of adding
9.92×103 to 9.25×101.
We begin by denormalizing the numbers so that they have the same exponents;
this allows us to add the mantissas, after which we renormalize the result
and round it to the appropriate number of decimal places:
An important question arises here: Which number do we denormalize prior to adding? The the answer is, we never want to lose the most significant digits of the sum, so we always increase the smaller of the two exponents while shifting the corresponding mantissa to the right. To do this, we will exchange the arguments into a standard order before proceeding. There are many ways to exchange the contents of two registers We will use the most straightforward approach, setting the value in one register aside, moving the other register, and then moving the set-aside value into its final resting place. This takes three move instructions and a spare register. There are ways to do this without an extra register. The most famous and cryptic of these uses the exclusive or operator: a=a⊕b;b=a⊕b;a=a⊕b. Also note that when we add two normalized numbers, there will sometimes be a carry into the sign bit, which is to say, an overflow. With pencil and paper, we stop propagating the carry before it changes the sign and instead, wedge in an extra bit and note the need to normalize the result. In software, we must undo the damage done by the overflow and then fix the normalization of the result. The following floating point add subroutine solves these problems:
Most of this code follows simply from the logic of adding that we demonstrated with the addition of two numbers using scientific notation. There are some points, however, that are worthy of note. The above code uses its activation record to save R1 allowing the call to NORMALIZE, and to save R8, freeing it for use as a temporary during exchange operations. NORMALIZE does not use an activation record, so this code has been optimized by eliminating stack pointer adjustment before and after the call. Finally, there is the issue of dealing with overflow during addition. After addition, when the sign is wrong, interpreted as a sign bit, it does have the correct value as the most significant bit of the magnitude, as if there were an invisible sign bit to the left of it. Therefore, after a signed right shift to make space for the new sign bit (incrementing the exponent to compensate for this) we can complement the sign by adding one to it, for example, using the ADJUST instruction.
Floating Point MultiplicationStarting with a working integer multiply routine, floating point multiplication is simpler than floating point addition. This simplicity is apparent in the algorithm for multiplying in scientific notation: Add the exponents, multiply the mantissas and normalize the result, as illustrated below:
Unlike addition, we need not denormalize anything before the operation. The one new issue we face is the matter of precision. Multiplying two 32-bit mantissas gives a 64-bit result. We will assume a signed multiply routine that delivers this result, with the following calling sequence:
If the multiplier and multiplicand have 31 places after the point in each, then the 64-bit product has 62 places after the point. Therefore, to normalize the result, we will always shift it one place. If the multiplier and multiplicand are normalized to have minimum absolute values of 0.5, the product will have a minimum absolute value of 0.25. Normalizing such a small product will require an additional shift, but never more than one. We must use 64-bit shifts for thiese normalize steps in order to avoid loss of precision, so we cannot use the normalize code we used with addition, subtraction and conversion from binary to floating point.
Most of the above code is involved with normalizing the result. This code is oversimplified! What if the product is zero? Our normalization rule is that a product of zero should have the most negative possible value. This code does not test for overflow or underflow, that is, no test for exponent out of bounds.
Other OperationsMultiply and divide routines do not finish the story. Our commitment to strong abstraction means that users of our floating point numbers may not examine their representations. The designers of floating point hardware do not face this constraint. They advertise the exact format they use and users are free to use this information. If we do not disclose such detail, we must provide tools for comparing numbers, for testing the sign of numbers, for testing for zero, and other operations that might otherwise be trivial. Another issue we face is the import and export of floating point numbers. We need tools to convert numbers to and from textual and IEEE standard format. The routine to convert from our eccentric format to IEEE format begins by dealing with the range of exponent values. Our 32-bit exponent field has an extraordinary range. Second, it converts the exponent and mantissa to the appropriate form, and finally, it packs the pieces must be packed together. The following somewhat oversimplified code does this:
Note in the above code that the advertised bias of the IEEE format is 127, yet we used a bias of 126! This is because we also subtracted one from the original exponent to account for the fact that our numbers were normalized in the range 0.5 to 1.0, while IEEE numbers are normalized in the range 1.0 to 2.0. This is also why we compared with 128 and -125 instead of 127 and -126 when checking for the maximum and minimum legal exponents in the IEEE format. We have omitted one significant detail in the above! All underflows were simply forced to zero when some of them ought to have resulted in denormalized numbers.
Conversion from IEEE format to our eccentric software format is fairly easy because our exponent and mantissa fields are larger than those of the single-precision IEEE format. Thus, we can convert with no loss of precision. This code presented above ignores the possibility that the value might be a NaN or infinity. This code makes extensive use of shifting to clear fields within the number. Thus, instead of writing n&0xFFFFFF00, we write (n>>8)<<8. This trick is useful on many machines where loading a large constant is significantly slower than a shift instruction. By doing this, we avoid both loading a long constant into a register and using an extra register to hold it. We used a related trick to set the implicit one bit, using a subtract instruction to set the carry bit and then adding this bit into the number using an adjust instruction.
Conversion to DecimalA well designed floating point package will include a complete set of tools for conversion to and from decimal textual representations, but our purpose here is to use the conversion problem to illustrate the use of our floating point package, so we will write our conversion code as user-level code, making no use of any details of the floating point abstraction that are not described in the header file for the package. First, consider the problem of printing a floating point number using only the operations we have defined, ignoring the complexity of assembly language and focusing on the algorithm. We can begin by taking the integer part of the number and printing that, followed by a point, but the question is, how do we continue from there, printing the digits after the point? To print the fractional part of a number, we will take the integer part of the number and subtract it from the number, leaving just the fraction; this is equivalent to taking the number mod one. Multiplying the fractional part by ten brings one decimal digit of the fraction above the point. This can be repeated to extract each successive digit. This is not particularly efficient, but it works.
Before writing assembly code, we will rewrite this so the code passes handles to floating point numbers, not the numbers themselves, and so it calls the right subroutines to do arithmetic.
The above code shows some of the problems we forced on ourselves by insisting on having no knowledge of the representation of floating point numbers when we write our print routine. Where a C or Java programmer would write 10.0, relying on the compiler to create the floating point representation and put it in memory, we have been forced to use the integer constant 10 and then call float() to do the conversion. This is a common consequence of strict object oriented encapsulation, although there are looser encapsulation schemes that export compile time or assembly time macros to process constants into their internal representations. The next problem we face is that, at the time we write this code, we are denying ourselves knowledge of the size of the representation of floating point numbers. As a result, we cannot allocate space in our activation records taking advantage of a known size. Our solution to this problem rests on two elements. First, we will rely on the fact that the interface definition float.h provides us with the size of a floating point number in the constant FLOATSIZE. We have adopted a general convention here: For each object, record or structure, we always have a symbol defined to hold its size. Second, we can use the assembler to sum up the sizes of the fields of the activation record instead of adding them by hand, as we have in our previous examples. To do this, we begin by setting the activation record size to zero, and then we define each field in terms of the previous size, adding the field size at each step. The repetitive definitions involved in this can be packaged in a macro.
Had we allowed ourselves to use knowledge about the size of a floating point number, we could have defined NUM=4, TMP=12 and TEN=20. If we did this, changes in the floating point package might require us to rewrite the code. If we had not defined the macro LOCAL, each local variable declaration would have required two lines of code. For example, the declaration of the local variable NUM would begin with NUM=ARSIZE, and then it would add to the activation record size with ARSIZE=ARSIZE+FLOATSIZE. The local variables for saving registers 8 and 9 were allocated so that the integer variables in our code can use these registers over and over again instead of being loaded and stored in order to survive each call to a routine in the floating point package. Of course, if those routines need registers 8 and 9, they will be saved and restored anyway, but we leave that to them. The following code contains one significant optimization. With all of the subroutine calls, we could have incremented and decremented the stack pointer many times. Instead, we increment it just once at the start of the print routine and decrement it just once at the end; in between, we always subtract ARSIZE from every displacement into the activation record in order to correct for this.
|