11. Floating Point

Part of CS:2630, Computer Organization Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The Hawk Floating Point Coprocessor

The Hawk architecture includes two instructions reserved for communication with coprocessors. A coprocessor is a special purpose processor that operates in conjunction with the central processor. Coprocessors may be physically separate from the central processor, on a separate chip, or they may be integrated on the same chip with it. It is the logical separation that is essential. Coprocessors are commonly used for floating point, but they have also been used for graphics, cryptography, and other specialized operations.

The Hawk coprocessor instructions, COSET and COGET move data between general purpose registers and registers inside the selected coprocessor. Our focus is on coprocessor number one, the floating-point coprocessor.
 

The Hawk coprocessor set instruction
       
07 06 05 04 03 02 01 00   15 14 13 12 11 10 09 08
0 0 0 1 dst 0 0 1 0 x
       

The Hawk coprocessor get instruction
       
07 06 05 04 03 02 01 00   15 14 13 12 11 10 09 08
0 0 0 1 dst 0 0 1 1 x
       

In these instructions, dst always refers to a CPU register, while x refers to a register in the active coprocessor, as selected by the value in coprocessor register zero, the coprocessor status register, COSTAT. Fields of this register determine which coprocessors are enabled (powered up), which coprocessor is currently selected, and in some cases, the operating mode of that coprocessor. See the Hawk manual for the full details of the fields in COSTAT. Assuming that there is a USE directive for the float.h file, so that the right definitions are avalable, the following code enables the floating-point coprocessor for short (32-bit) floating-point operands:
 

Enabling the floating-point coprocessor
               
        LIL     R1, FPENAB + FPSEL
        COSET   R1, COSTAT
               

Once the floating-point coprocessor is enabled and selected, addressing coprocessor registers 1 to 15 refers specifically to registers inside the floating point coprocessor and not any other. When operating in short format, there are only two useful registers in the floating-point coprocessor, floating-point accumulators zero and one, FPA0 and FPA1, which corespond to coprocessor registers 2 and 3. Coprocessor register 1 will be ignored, for now. It is used for access to the least significant halfword of long (64-bit) floating-point operands.

Floating point coprocessor registers 4 to 15 are not really registers. Rather, the operations that set these pseudo-registers are used to operate on the floating-point accumulators. Another way to think about this is to say that the least significant bit of the x field is the coprocessor register select bit, while the high 3 bits of the x field are the coprocessor operation code.

The available operations are floating-point add, subtract, multiply and divide, as well as square root and integer to floating conversion. Setting even floating-point registers 4 to 14 causes operations on FPA0 and setting odd registers 5 to 15 operates on FPA1. For example, setting coprocessor register number 5 converts the integer stored in a general purpose register into a floating-point value in FPA1. The complete set of short (32-bit) floating-point operations on FPA0 is illustrated below using R3 as a source operand; the same operations are available on FPA1.

Floating point coprocessor operations
        COSET   R3, FPA0        ;  2 FPA0 = R3
        COSET   R3, FPINT+FPA0  ;  4 FPA0 = (float) R3
        COSET   R3, FPSQRT+FPA0 ;  6 FPA0 = sqrt( R3 )
        COSET   R3, FPADD+FPA0  ;  8 FPA0 = FPA0 + R3
        COSET   R3, FPSUB+FPA0  ; 10 FPA0 = FPA0 - R3
        COSET   R3, FPMUL+FPA0  ; 12 FPA0 = FPA0 * R3
        COSET   R3, FPDIV+FPA0  ; 14 FPA0 = FPA0 / R3

Floating point operations do not directly set the condition codes. Instead, When the coprocessor get instruction COGET is used to get the contents of a floating-point accumulator, it sets the N and Z condition codes to show whether the floating-point value is negative or zero. In addition, the C condition code is used to report floating-point values that are infinite or non numeric. So, the following code would work to get the contents of FPA0 into R3 and check that it is valid:

Reading a floating-point register
        COGET   R3, FPA0        ; R3 = FPA0
        BCS     INVALID         ; -- go deal with invalid result

Exercises

a) Give appropriate defines for the symbols FPA0, FPA1, FPSQRT, FPADD, FPSUB, FPMUL and FPDIV that are used as operands on the COSET instruction.

b) Given 32-bit floating-point values x in R4 and y in R5, give Hawk code to enable the coprocessor, compute sqrt(x2 + y2), put the result in R3 and then disable the coprocessor.
 

IEEE Floating Point Format

To fully specify our floating-point coprocessor, we must define not only the operations but also the data formats it uses. The Hawk, like most modern computers, uses the floating-point format the Institute for Electrical and Electronic Engineers (IEEE) has defined. This follows a general outline very similar to most floating-point formats developed since since the early 1960's, but it has some eccentric features.

Binary floating-point numbers are closely related to decimal numbers expressed in scientific notation. Consider the number 6.02 × 1023, Avagadro's number. This number is composed of a mantissa, 6.02, and an exponent, 23. The number base of the mantissa, 10, is also the base to which the exponent is applied, and the mantissa in scientific notation is always normalized, as shown below:

Equivalent values
60221419.9 ×  1016  \
60221.4199×1019   not normalized
60.2214199×1022  /
6.02214199×1023   normalized
0.602214199×1024   not normalized

Of these, only 6.02... × 1023 is considered to be properly in scientific notation. The rule is that the mantissa must always be a decimal number from 1.000 to 9.999... The only exception is zero. When a number has a mantissa that does not satisfy this rule, we normalize it by moving the point and adjusting the exponent until it satisfies this rule.

The IEEE standard includes both 32 and 64-bit floating-point numbers. Here, we will ignore the latter and focus the 32-bit format. As in scientific notation, binary floating-point numbers have exponent and mantissa fields, but they are in binary, so the mantissa is in base two and the exponent is a power of two.
 

The IEEE single-precision floating-point representation
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
S exponent mantissa

In the IEEE floating-point formats, like most other floating-point formats designed since the 1950s, the most significant bit holds the sign of the mantissa, with zero meaning positive. The mantissa is stored in signed magnitude form. The magnitude of the mantissa of a 32-bit IEEE floating-point number is given to 24 bits of precision, while the exponent is stored in the 8 remaining bits. Notice that this adds up to 33 bits of sign, exponent and mantissa, evidence of some exceptional trickery. IEEE double-precision numbers differ from the above in that each number is 64 bits. This allows 11-bits for the exponent instead of 8, and 53 bits for the mantissa, including the extra bit obtained by the same kind of trickery.
 

A Normalized Mantissa

The IEEE format gets an extra bit for the mantissa as a consequence of the mantissa normalization rule it uses. The mantissa in an IEEE format number is a binary fixed point number with one place to the left of the point. The normalization rule is similar to that used for scientific notation, so the smallest normalized mantissa value is 1.0, while the largest normalized value is 1.1111...2, a value just less than 2.0. This means that all normalized mantissas have a most significant bit that is one. In general, if a bit is always the same, there is no point in storing it since we can take the constant value for granted. We call such a bit that does not need to be stored a hidden bit. Consider the IEEE floating-point value represented as 1234567816:
 

An example single-precision floating-point number
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
0 0 0 1  0 0 1 0  0 0 1 1  0 1 0 0  0 1 0 1  0 1 1 0  0 1 1 1  1 0 0 0
 
sign = 0 (positive)
exponent = 001 0010 0
mantissa = 1.011 0100 0101 0110 0111 1000
value = 5.69046 × 10-28
 

 

The IEEE format supports non-normalized mantissas only for the smallest possible exponent. Zero is represented by the smallest possible exponent with and a non-normalized mantissa of zero. Conveniently, this is represented as 0000000016.

We have been referring to the point, not the decimal point or the binary point. Why? The purpose of the point, in any radix, is to sit just to the right of the one's place, dividing the integer part of a number from the fractional part. This has nothing to do with the base of the number system, so calling it a decimal point or a binary point conveys no useful information.
 

The Biased Exponent

Most floating-point formats designed since the 1960s have used a biased representation for the exponent. That is, all zeros in the exponent field is the most negative possible exponent, and all ones is the most positive value. In the case of the IEEE format, the bias used is an odd one, 127, and two exponent values have been singled out for special treatment. Exponents in the range from 000000012 to 111111102 are interpreted as biased values, so 000000012 means –126 and 111111102 means +127.

The exponent represented as 000000002 is special. Instead of its usual biased interpretation, where it would be –127, It also means –126, but it implies that the hidden bit is zero, which is to say, the mantissa is not normalized.

The exponent 111111112, which in this biased system would mean +128, is reserved for values that the IEEE calls NaNs, where NaN stands for not a number. The value of the mantissa tells what kind of NaN. The hardware uses a mantissa of zero to mean infinity. Software may use nonzero values for other meanings.

Because of the odd bias of 127 for exponents, an exponent of one is represented as 100000002, zero is 011111112, and negative one is 011111102. There is a competing but equivalent explanation of the IEEE format that presents the bias as 128 and places the point in the mantissa one place farther to the right.

The competing presentations of the IEEE floating-point system make no difference in the number representations, but they can be confusing when comparing different textbook descriptions of the system. The following table shows IEEE floating-point numbers, given in binary, along with their interpretations.
 

Example IEEE single-precision floating-point numbers
Infinity and NaN
0 11111111 00000000000000000000000=Infinity
1 11111111 00000000000000000000000=-Infinity
0 11111111 00000000000010000000000=NaN (not a number)
1 11111111 00000010001111101000110=NaN
 
Normalized numbers SignExp.Mant.Value
0 10000000 00000000000000000000000= +1.0 × 21  × 1.002  = 2.0
0 01111111 00000000000000000000000= +1.0 × 20  × 1.002  = 1.0
0 01111111 10000000000000000000000= +1.0 × 20  × 1.102  = 1.5
0 01111111 11000000000000000000000= +1.0 × 20  × 1.112  = 1.75
1 01111111 11000000000000000000000= –1.0 × 20  × 1.112  = –1.75
0 00000001 00000000000000000000000= +1.0 × 2–126  × 1.002  = 2–126 
 
Unnormalized numbers SignExp.Mant.Value
0 00000000 10000000000000000000000= +1.0 × 2–126  × 0.102  = 2–127 
0 00000000 01000000000000000000000= +1.0 × 2–126  × 0.012  = 2–128 
0 00000000 00000000000000000000000= +1.0 × 2–126  × 0.002  = 0
1 00000000 00000000000000000000000= –1.0 × 2–126  × 0.002  = 0

 

In the above table, note that 2–126 is the smallest positive normalized value, with a mantissa that is all zero except the hidden bit. Everything smaller is non-normalized, with a zero exponent implying a hidden bit of zero.

Exercises

c) Give the 32-bit IEEE representation for 10010 in binary.

d) Give the 32-bit IEEE representation for 100010 in binary.

e) Give the 32-bit IEEE representation for 0.110 in binary.

f) Give the 32-bit IEEE representation for 0.0110 in binary.

g) Consider the following infinite loop on the Hawk:

        LIL     R3, FPENAB + FPSEL
        COSET   COSTAT
        LIS     R3, 1
L:
        COSET   R3, FPINT + FPA0
        COGET   R3, FPA0
        BR      L

Describe the sequence of values in R3, with an emphasis on the value after many iterations. An approximate upper bound on the value of R3 is a strong answer to this question.

g) Write C code equivalent to the above. Hint, you will need to do some type casting in order to do unsafe type conversions from float to int.

Floating Point in C

The standard C data types float and double are frequently implemented using the IEEE 32 and 64-bit floating-point formats. Some compilers also support long double, for even higher precision. These are not a universal rules. On machines that have other floating-point formats, C compilers will use the native format, whatever it may be. The C99 standard does not specifiy a floating-point format, but it does specify a minimum acceptable precision for floating-point numbers by specifying a value that, when added to 1.0, must produce a value bigger than 1.0:

In addition, the C90 standard requires that the hearder file float.h define the following symbols

Note that C compilers for two different machines supporting 32-bit floating point numbers could have different descriptions: One might have FLT_MANT_DIG = 24 and FLT_RADIX = 2 while the other has FLT_MANT_DIG = 6 and FLT_RADIX = 16. These two machines will have similar floating-point precision, but the radix 16 machine will tend to lose more precision in arithmetic because it must round results to the nearest base-16 digit instead of rounding to the nearest binary digit.

Also note that a programmer can ask how big the representation of a floating point number is with the sizeof pseudo function. It is called a pseudo function because it looks like a function but is not one. When a C programmer writes sizeof(float) for example, this asks the compiler to give the size of a floating-point number. This happens at compile time, not at run time. The argument to sizeof can be the name of any type or variable. The value is always given as an integer number of storage units, but on most machines, these are bytes.

Here is a C program that explores the representation of a floating-point number on whatever machine it is compiled on:
 

Exploring floating-point representations in C
#include <stdio.h>
  
int main(){
    float a = 1.0;
    printf( "sizeof(float) = %d\n", sizeof(float) );
    printf( " 1.0 in hex = %x\n", *((int*)&a) );
    printf( " this isn't = %x\n", a );
    return 0;
}

The last two lines above attempt to print the hexadecimal representation of the example floating-point value. The first line, with complicate casting, actually works, while the final line fails miserably. The problem is, when parameters are passed in C to a function with a variable-sized argument list, all of the arguments are promoted or widened. So, float arguments are promoted, perhaps to double and scalar arguments such as char are promoted to int or even long int. As a result, the value printed by the last line is a complete misinterpretation of the actual parameter. The developers of C refer to functions with a variable argument list as variadic a term dating back to 1936 that was all but unknown prior to the widespread use of C.

In C, most casting merely defeats the type rules of the language. Thus, for example, given the two types struct listnode and struct treenode which happen to have identical sizes, casting a variable from one of these types to the other makes no changes to the data, it only re-labels that data from one type to another. In contrast, casting between integer and floating-point values causes code to be generated that translates between these representations. So, (int)3.1416 has the integer value 3.

The casting in the example, *((int*)&a) prevents any conversion from being done. Instead of working with the value of a, we work with its address &a. This pointer has the data type float*, which we then cast to int*, so it is now a pointer to an integer. Finally, the leading * dereferences the pointer, that is, follows that pointer to the actual bits of a. The bits that represented the floating-point number 1.0 are now seen as an integer.

Software Floating Point

Some computers do not have floating-point hardware. This is particularly true of the kind of small microcontrollers used in the internet of things. Floating point arithmetic on those machines is done by software. Such software forces us to examine floating-point arithmetic algorithms, and it gives an extended implementation example. Here, we ignore compatibility with the IEEE floating-point format until the end, so we can focus on clear code with simple data representations.

Consider the following implementation, given in C:

A software floating-point representation in C
           
typedef struct {
    int exponent, mantissa;
} myFloat;
           

This declares type myFloat to be a struct with two integer fields, exponent and mantissa. If we use this declaration, in the header file, we expose the representation of myFloat to users, but we allow users to declare variables of type myFloat. We will accept this loss of abstraction here.

To keep things simple, the exponent field of our floating-point representation will be an ordinary 2's complement integer, with no bias or other complication. Occasionally, floating-point representations have been proposed where the mantissa is an integer, but it is easier to state a normalization rule for fixed point fractions.

A fixed-point binary number is a number with the point in a fixed position within the word. Here, we will put the point just to the right of the sign bit in our numbers. That means that (aside from the sign bit that indicates whether the number is positive or negative), our mantissa m must be within the range 1.0 > m ≥ –1.0. (The assymetry in this range is due to the fact that we are using 2's complement numbers.) Consider the following example signed binary fraction as an example. This is the closest approximation to 0.001 in 32 bits, with the point immediately to the right of the sign bit:

A fixed-point 32-bit signed binary fraction
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 1 0 1 1
 
31
value(n) = Σ ni × 2–i
i = 0
 
= 2–10 + 2–16 + 2–17 + 2–21 + 2–24 + 2–27 + 2–28 + 2–30 + 2–31
≅ 0.0009999996983

Note that in this example, we numbered the bits in bigendian order, starting at the left, so that we could use negated bit numbers as exponents in the sum of powers of two. The decimal result is shown to more places than it should be in order to emphasize that it is only an approximation of 0.001; in fact, 32 bits can only express a value to a precision of one part in 4 billion, so we should never show more than 10 digits. If we round the number to 10 digits, we get exactly 0.001.

The normalization rule we will adopt for the mantissa is that the absolute value of the mantissa |m| always obeys the rule 1.0 > |m| ≥ 0.5. We have excluded –1 from the range, so any number can be safely negated, and we have required that the high bit of the number, with the place value 1/2, be set in the absolute value of the number. Here are some example floating-point values in this system:
 

Examples using the software floating-point representation
exponent  00000000000000000000000000000000 +0.5 × 20 = 0.5
mantissa  01000000000000000000000000000000
exponent  00000000000000000000000000000001 +0.5 × 21 = 1.0
mantissa  01000000000000000000000000000000
exponent  00000000000000000000000000000001 +0.75 × 21 = 1.5
mantissa  01100000000000000000000000000000
exponent  00000000000000000000000000000001 -0.75 × 21 = –1.5
mantissa  10100000000000000000000000000000
exponent  11111111111111111111111111111111 +0.5 × 2–1 = 0.25
mantissa  01000000000000000000000000000000
exponent  11111111111111111111111111111101 +0.5 × 2–3 = 0.0625
mantissa  01000000000000000000000000000000
exponent  11111111111111111111111111111101 ~8/10 × 2–3 =~ 0.1
mantissa  01100110011001100110011001100110

Now, consider the problem of converting from a 32-bit integer to this software floating-point format. We can just assign the integer to the mantissa, with no change to the bits, but what value do we assign to the exponent? The point in the integer is just to the right of the least significant bit, but we want the point to be just to the right of the sign bit.

Up to this point, we treated the value of the floating-point number as the product of a fixed-point mantissa times an exponent term. The term floating point comes from a different way of looking at the same thing. Instead of viewing the exponent as a multiplier, we consider the exponent as an indication of where the point is in the number. An exponent of 0, in our software floating point representation, puts the point just right of the sign, the first bit from the right. An exponent of 1 puts the point one place farther right, and an exponent of 31 puts the point at the right end of the mantissa.

Setting the exponent to 31 therefore converts an integer to a number in our floating point system, but it is not normalized. To normalize, we need to do a bit more work. The following code does this:

Converting integer to floating in C
void int_to_myFloat( myFloat * f, int i ) {
    f->mantissa = i;
    f->exponent = 31;

    /* normalize f */
    if (i < 0) i = -i;
    while ((i & 0x40000000) == 0) {
        i = i << 1;
        f->mantissa = f->mantissa << 1;
        f->exponent = f->exponent - 1;
    }
}

The normalization code above shifts the mantissa left while decrementing the exponent, until the mantissa satisfies the normalization rule. The termination test uses the absolute value of the mantissa, but we could have written two versions of the loop, one for positive mantissas an one for negative ones.

Exercises

h) Give the normalized software floating point representation for 10010 in binary.

i) Repeat exercise h for 100010.

j) Repeat exercise h for 0.110.

k) Repeat exercise h for 0.0110.

l) In this software floating-point number system, what is the largest possible positive value (in binary).

m) In this software floating-point number system, what is the smallest possible positive nonzero normalized value?

m) Write SMAL Hawk code equivalent to int_to_myFloat given above. Note that the test written in C as (i&0x40000000) can be done with the BITTST instruction on the Hawk.
 

Floating Point Addition

We can now explore the implementation of more interesting operations. These follow naturally from the standard rules for working with numbers in scientific notation. Consider the problem of adding 9.92×103 to 9.25×101. We begin by denormalizing the numbers so that they have the same exponents; this allows us to add the mantissas, after which we renormalize the result and round it to the appropriate number of places:
 

Adding in scientific notation
given 9.92 × 103  +9.25×101 
denormalized 9.92 × 103  +0.0925×103 
rearranged (9.92 + 0.0925)×103 
added 10.0125×103 
normalized 1.00125×104 
rounded 1.00×104 

 

The final rounding step is one many students forget, particularly in the age of pocket calculators. In scientific notation, we have the convention that the number of digits given indicates the precision of the measurements from which a number wase taken. As a result, if two numbers are given in scientific notation and then added or subtracted, the result should not be expressed to greater precision than the least precise operand. When throwing away the less significant digits of the result, we always round to minimise the information loss and avoid introducing systematic errors.

An important question is which operand do we denormalize before adding? The the answer is, we never want to lose the most significant digits of the sum, so we always increase the smaller of the exponents while shifting the corresponding mantissa right. To do this, we need to sort the arguments into a standard order before proceeding.

Note that when we add two numbers, there will sometimes be a carry into the sign bit, which is to say, an overflow. On pencil and paper, we stop propagating the carry before it changes the sign and instead, wedge in an extra bit and note the need to normalize the result. In a language like C, where the language provides no tools for recognizing an integer overflow, we have two choices: Either assume an overflow in advance, by shifting the number right to make space for the extra digit before doing the add, or find an algorithm for detecting the overflow. The former has the disadvantage that we lose one bit of precision when there is no overflow. The latter is easy in assembly language, where the condition codes report overflow, but difficult in C. (It is possible: ((a^(a+b))&~(a^b))<0 is true if (a+b) produces an an overflow.)

Regardless of whether there is an overflow, the final step in addition is to normalize the result. The normalize algorithm given above for integer-to-floating conversion is the same one used below:

Adding two floating-point numbers in C
void myFloat_add( myFloat * f, const myFloat * g ) {
    /* adds *g to *f */

    myFloat aug; /* the augend that is added to */
    {
        myFloat add; /* the addend added to the augend */
        if (f->exponent > g->exponent) {
	    aug = *g;
	    add = *f;
        } else {
	    aug = *f;
	    add = *g;
        }
        /* assert, the addend has the larger exponent if they differ */

        /* first, align the points, making the exponents equal */
        while (add.exponent > aug.exponent) {
            aug.exponent = aug.exponent + 1;
            aug.mantissa = aug.mantissa >> 1;
        }

        /* prevent overflow */
        add.mantissa = add.mantissa >> 1;
        aug.exponent = aug.exponent + 1;
        aug.mantissa = aug.mantissa >> 1;

        /* second, add the mantissas */
        aug.mantissa = aug.mantissa + add.mantissa;
    } /* done with addend */
    {
        /* finally, normalize the augend */
        int i = aug.mantissa;
        if (i < 0) i = -i;
        while ((i & 0x40000000) == 0) {
            i = i << 1;
            aug.mantissa = aug.mantissa << 1;
            aug.exponent = aug.exponent - 1;
        }
    }
    *f = aug;
}

Most of the above code follows simply from the logic of adding that we demonstrated with the addition of two numbers using scientific notation. There are some points, however, that are worthy of note.

The second parameter is declared as a const parameter. This means that the object pointed to by the pointer cannot be changed. This function will change the object pointed to by its first parameter, but not the second.

The right shifts to denormalize the numbers with the smaller exponent simply threw away the least significant bits. In a single computation, this might be insignificant, but in large scale computation with hundreds or thousands of additions, these errors can accumulate. Rounding can be used to reduce the severity of these errors considerably.

To see how rounding works, it is useful to look at a decimal example. Consider the difference between rounding and truncating π 3.14159265... to 5 places of precision. If we truncate, the result is 3.1415 with an error of 9.265...×10–5 while if we round, the result is 3.1416 with an error of 7.35...×10–6.

The algorithm for rounding in binary or any other number base is the same as the algorithm for rounding in decimal. If the least significant digit being retained has weight w, add w/2 to the number and then truncate. In binary, this simplifies to adding 1 to the most significant digit to be discarded and then truncate. The IEEE floating point specificatin requires rounding.

We have already mentioned the second major problem with the above code, the loss of one bit of precision in the case of overflow. Detecting overflow in C requires looking at the sign of the operands and the sign of the result. If the operands have the same sign and the result has a different sign after addition, then there was an overflow. Writing code to detect and deal with this in C is messy.
 

Adding mantissas and handling overflow in machine code
; given R3=add.exponent, R4=add.mantissa, R5=aug.exponent, R6=aug.mantissa
	ADD     R4,R6           ; add.mantissa = add.mantissa + aug.mantissa
        BVR     NOTOVF          ; if (overflow) { -- we need one more bit
        ADDSI   R3,1            ;   add.exponent = add.exponent + 1
        SR      R4,1            ;   add.mantissa = add.mantissa >> 1
        SUB     R0,R1,R1        ;   -- set carry bit
        ADJUST  R4,CMSB         ;   flip sign bit of add.mantissa
NOTOVF:                         ; }

The code for handling overflow in machine code is fast, but it is subtle. An overflow means that the sign of the result is wrong. Doing a signed right shift of the result squeezes an extra bit in as it duplicates the sign bit. That sign bit was wrong as a sign, but it was the correct value as part of the mantissa, so our problem is to flip the sign bit. We could do this by loading the long constant 8000000016, but loading this constant would require an extra register and 48 bits of instructions (LIL followed by ORIS).

The Hawk ADJUST instruction is used for a variety of odd purposes. ADJUST r,CMSB adds the carry bit to the most significant bit of the register r. This instruction was added to the HAWK in order to simplify and speed up the code for long right shifts, but if the carry bit is first forced to one, it this can be used to do something equivalent to adding 8000000016 to a register. The instruction SUB R0,R1,R1 sets the C and Z condition codes by subtracting two equal values and then discarding the result. Any two equal values will do this.

Exercises

n) The floating point add code given here is rather stupid about shifting. It should never right-shift the lesser of the two addends more than 32 times! Fix this!

o) Fix this code so that the denormalize step rounds the lesser of the two addends by adding one to the least significant bit just prior to the final right shift operation.

p) Fix the C code so that, when it shifts both operands right one place before adding them (in order to avoid overflow) it rounds them instead of merely discarding the least significant bit.

q) Fix the HAWK code for dealing with overflow so that it rounds instead of simply discarding the least significant bit when it does a right shift.

Floating Point Multiplication

Floating point multiplication is simpler than floating point addition. This is apparent in the algorithm for multiplying in scientific notation: Add the exponents, multiply the mantissas and normalize the result, as shown here:

Multiplication in scientific notation
given 1.02 × 103  ×  9.85 × 101
rearranged (1.02 × 9.85) × 10(3 + 1)
multiplied 10.047 × 104
normalized 1.0047 × 105
rounded 1.00 × 105

Unlike addition, we do not denormalize anything before the operation. The one new issue we face is the matter of precision. Multiplying two 32-bit values gives a 64-bit result. We can do this in C by using long int operands.

Multiplying two floating-point numbers in C
void myFloat_mul( myFloat * f, const myFloat * g ) {
    /* multiplies f by g */
    long int product = ((long int) f->mantissa * g->mantissa) << 1;
    f->exponent = f->exponent + g->exponent;

    /* normalize */
    if (product > 0) {
        while ((product & 0x40000000L) == 0) {
            product = product << 1;
            f->exponent = f->exponent - 1
        }
    } else {
        while ((-product & 0x40000000L) == 0) {
            product = product << 1;
            f->exponent = f->exponent - 1
        }
    }
    f->mantissa = (product >> 32);
}

There are two tricky parts to this code: First, why shift the product one place left after multiplication? The answer comes from the rules for multiplication of fixed point numbers: The number of places right of the point in the product is the sum of the number of places right of the point in each multiplicand. Since our original 32-bit mantissas had 31 places right of the point, the 64-bit product has 62 bits right of the point. If we want the high half of the 64-bit product to have 31 bits right of its point We need tho shift it over one place.

The second issue is the normalization code. If both operands of the multiply were properly normalized, the result will require at most one place of shift, since 0.5×0.5 is the smallest normalized product in this number system. Using a loop to normalize with more than one place of shift is only done here in case one or both operands were not normalized.

The normalization code given here uses an alternative algorithm from the one used in the integer to float conversion and floating point addition code. The one given here treats the positive and negative cases separately. Both algorithms accomplish the same thing.

The final 32-bit shift at the end of the above code simply signifies that the most significant half of the result is to be returned. This need not involve an actual shift if 64-bit longwords are stored in pairs of 32-bit registers.
 

Exercises

r) Fix this code to deal with exponent overflows. In the event of an exponent overflow, the exponent and the absolute value of the mantissa should stick at their maximum possible values.

s) Fix this code to deal with underflows. This problem is hard because the exponent could go below the minimum value when the exponents of the multiplicands are added, or it could be decremented below its minimum during normalization. Non-normalized mantissa values with the minimum possibl exponent make good sense for underflows with small magnitudes, while zero is a sensible result for big underflows.

t) Fix this floating point multiply code so it correctly handles the value zero.

u) Write code for a floating point divide routine.
 

Other Operations

Multiply and divide are not enough. Our commitment to strong abstraction discourages users from examining the representations of floating-point numbers. Floating point hardware designers do not face this constraint. They advertise the exact number format and encourage users to take advantage of it. If we do not do this, we must provide tools for operations that may be trivial, such as testing the sign of numbers, or testing for zero.

We also need to support import and export of floating-point numbers. Users need to convert to and from textual and IEEE standard format. We convert from our format to IEEE format by picking off the sign, aligning the point in the mantissa, and then dealing with the exponent. Our 32-bit exponent has an extraordinary range, so overflow and underflow are problems.

Again, we will declare the parameter to be a const indicating that this function will not change the myFloat value passed to it. Note that, in C, const char * p makes a pointer to a constant, while char * const p makes a constant pointer to a variable. At best, this is a confusing consequence of a poor language design.

Also note that the bias of the IEEE format is 127, yet in the following code, we use a bias of 126. This is because the code is doing two things with one operation. First, it is decrementing the original exponent by one to account for the fact that our numbers were normalized in the range 0.5 to 1.0, while IEEE numbers are normalized in the range 1.0 to 2.0. Second, it is adding 127, the actual bias.

The range check on the exponent in the code below uses the bounds –148, –125 and 128 for denormalized and normalized numbers instead of the bounds –149, –126 and 127 specified for the IEEE format. Again, this difference is because of the change in normalization rule between the IEEE model and our word-based model.
 

Converting to IEEE Floating point format in C
float myFloat_to_float( const myFloat * f ) {
    /* convert from myfloat to the native IEEE format */
    int exponent = f->exponent;
    int mantissa = f->mantissa;

    /* first split off the sign */
    int sign = 0;
    if (mantissa < 0) {
        mantissa = -mantissa;
        sign = 0x80000000;
    }

    /* convert */
    if (exponent > 128) { /* convert overflow to infinity */
        mantissa = 0;
        exponent = 0x7F800000;

    } else if (exponent < -148) { /* convert underflow to zero */
        mantissa = 0;
        exponent = 0;
        sign = 0; /* avoids returning negative zero */

    } else if (exponent < -125) { /* denormalized conversion */
        do {
            mantissa = mantissa >> 1;
            exponent = exponent + 1;
        } while (exponent < -125);
        mantissa = mantissa >> 7;
	exponent = 0x00000000;

    } else { /* normalized conversion */
        mantissa = (mantissa >> 7) & 0x007FFFFF;
        exponent = (exponent + 126) << 23

    }

    int floatval = sign | exponent | mantissa;
    return *((float*)&floatval)
}

The final step in the above code, done as the result is returned, is to force the C compiler to see the type of the returned result as float despite the fact that all of the components have been manipulated as int values. This is done by creating a pointer that points to the value, then casting that pointer so the compiler thinks it points to a float before dereferencing it to get the return value as a float. We did the same kind of thing in Chapter 3 when we explored how integers are stored in multiple bytes of memory.

Why did we write *((float*)&floatval) instead of (float)floatval? This is because casting in C has two completely different meanings, which is to say, there is an error in the design of C. When we write (sometype)value it usually means "discard the compiler's previous type for value and instead use sometype with no change of representation." This holds except when sometype is a numeric type and value is a numeric type. In that case, casting causes a change of representation, generating code as needed to convert integer to floating or visa versa. We prevented this second interpretation by casting a pointer to the data instead of casting the data.

The above code is not perfect. It truncates, discarding the insignificant bits instead of rounding to the nearest IEEE floating point value. Rounding is preferred because it halves the maximum error in conversion.

Hawk code to convert from IEEE format to myFloat
        SUBTITLE Convert from IEEE-format floating-point to myFloat
FLOAT_TO_MYFLOAT:
	; expects R3 = result, pointer to a 2-word myFloat
        ;         R4 = input, the number in IEEE format
        ; uses    R5 = a temporary
        MOVE    R5,R4           ; -- work on exponent
        SL      R5,1            ; -- discard sign bit
        SR      R5,12
        SR      R5,12           ; -- discard mantissa
        ADDI    R5,R5,-126      ; exp = ((input & 0x7F800000)>>23) - 126
        STORES  R5,R3           ; result.exponent = exp

        MOVE    R5,R4           ; -- work with mantissa
        SL      R5,9            ; -- discard exponent and sign
        SR      R5,1            ; -- make space for hidden bit
        SUB     R0,R1,R1        ; -- set carry bit
        ADJUST  R5,CMSB         ; -- move carry to hidden bit
        SR      R5,1            ; mant = ((input & 0x007FFFFF)<<7) | 0x40000000
        TESTR   R4
        BNR     FIEEEPOS        ; if (input < 0) {
        NEG     R5,R5           ;   mant = -mant
FIEEEPOS:                       ; }
        STORE   R5,R3,MANTISSA  ; result.mantissa = mant
        JUMPS   R1              ; return

Conversion from IEEE format to our eccentric software format is fairly easy because our exponent and mantissa fields are larger than those of the single-precision IEEE format. Thus, we can convert with no loss of precision. This code presented above ignores the possibility that the value might be a NaN or infinity.

This code makes extensive use of shifting to clear fields within the number. Thus, instead of writing n&0xFFFFFF00, we write (n>>8)<<8. This trick is useful on many machines where loading a large constant is significantly slower than a shift instruction. By doing this, we avoid both loading a long constant into a register and using an extra register to hold it. On the Hawk, this optimization is effective only for shift counts of 16 or less. Shifting by a greater amount takes two consecutive shift instructions, in which case, it may be faster to load a long constant using LIL followed by an AND instruction.

Another trick is used in the above code to set the implicit one bit. Instead of or-ing a long constant with the mantissa, the code uses a subtract instruction to set the carry bit, and then it uses an ADJUST instruction to or the carry bit into the most significant bit. The subtract instruction subtracted an uninitialized register from itself and discarded the result. This is perfectly safe because the result does not depend on the value of that register.

Conversion to Decimal

A well designed floating point package will include a complete set of tools for conversion to and from decimal textual representations, but our purpose here is to use the conversion problem to illustrate the use of our software floating point package, so we will write our conversion code as user-level code, making no use of any details of the floating point abstraction aside from those discussed above.

First, consider the problem of printing a floating-point number using only the operations we have defined, and focusing on the algorithm. We can begin by taking the non-rounded integer part of the number and printing that, followed by a point, but the question is, how do we continue from there, printing the digits after the point?

To print the fractional part of a number, we will take the integer part of the number and subtract it from the number, leaving just the fraction; this is equivalent to taking the number mod one. Multiplying the fractional part by ten brings one decimal digit of the fraction above the point. This can be repeated to extract each successive digit. This is not particularly efficient, but it works. Here is C code for this:

C code to print a floating-point number
void putfloat( float num, int places ) {
    /* output num to places digits after the point */
    int inum; /* the integer part */

    if (num < 0) {  /* make it positive and print the sign */
        num = -num;
        putchar( '-' );
    }

    /* first put out integer part */
    inum = (int)num;
    printf( "%1d", inum );
    putchar( '.' );

    /* second put out digits of the fractional part */
    for (; places > 0; places--) {
        num = (num - (float)inum) * 10.0;
        inum = (int)num;
        putchar( inum + '0' );
    }
}

The code given above relies on the fact that casting int(num) truncates the floating point value num as it converts it to integer. A C programmer who wants rounding, must write int(num+0.5). Similarly, (float)inum converts integer to floating. All other casting in C merely tells the compiler to suppress type checking, but here, casting converts the representation from one form to another, involving various amounts of code depending on the computer being used.

We can rewrite this code to use our software floating point format, passing a read-only handle (that is, a const pointer) to the two-word object to be printed. In the code given below, we assume that we have a myFloat_to_int conversion routine, the opposite of the routine given above, and we assume a subtraction routine, myFloat_sub, comparable to the addition routine given above.

This code violates the myFloat abstraction by reaching into the myFloat to negate it by negating the mantissa. This is a cheap computation, while a subroutine call to myFloat_negate would be comparatively expensive. The code also includes the constant ten that was composed by hand. We could have written int_to_myFloat(&ten,10) to initialize this constant, but that would require run-time computation every time a floating point number is printed, and it eliminates the opportunity to make ten a static constant because, as the result of a function, it is obviously not constant.

One feature of this code that is quite distinct from anything that can be written in Java or Python is the use of several myFloat objects embedded directly into the activation record. Both num and tmp are local variables, without handles. This means that they are 2-word fields of the activation record. References to num.mantissa are simple local variable references, and when we pass &num as a handle, we are computing the address of a local variable to pass.

C code to print a number in myFloat format
void putMyFloat( const myFloat *pnum, int places ) {
    myFloat num;  /* a copy of the number -- don't mess with the original */
    int inum;     /* the integer part */

    num.mantissa = pnum->mantissa;
    num.exponent = pnum->exponent;
    if (num.mantissa < 0) {
        num.exponent = -num.exponent;
        putchar( '-' );
    }

    /* first put out the integer part */
    inum = myFloat_to_int( &num );
    printf( "%d", inum );
    putchar( '.' );
    
    /* second put out digits of the fractional part */
    while (places > 0) {
        myfloat tmp;
        static const myfloat ten = { 0x00000004, 0x50000000 };
        int_to_myfloat( &tmp, inum );
        myFloat_sub( &num, &tmp );
        myFloat_mul( &num, &ten );
        inum = myFloat_to_int( &num );
        putchar( inum + '0' );
        places = places - 1;
    }
}

Exercises

v) The putfloat routine given above only limits the number of fractional places printed, it does not control the number of places printed left of the point, and it has no provisions to control overflow, where the floating point number is larger than can be converted to an integer. Fix it so that it takes a third parameter, left_places and so that it prints an error indication such as ? when the number is too big to fit in the indicated number of places.

w) A good floating point print routine will round the number to the indicated precision. Fix putfloat to do this.