11. Floating Point
Part of
CS:2630, Computer Organization Notes

The Hawk architecture includes two instructions reserved for communication with coprocessors. A coprocessor is a special purpose processor that operates in conjunction with the central processor. Coprocessors may be physically separate from the central processor, on a separate chip, or they may be integrated on the same chip with it. It is the logical separation that is essential. Coprocessors are commonly used for floating point, but they have also been used for graphics, cryptography, and other specialized operations.
The Hawk coprocessor instructions, COSET and COGET move data between general purpose registers and registers inside the selected coprocessor. Our focus is on coprocessor number one, the floatingpoint coprocessor.


In these instructions, dst always refers to a CPU register, while x refers to a register in the active coprocessor, as selected by the value in coprocessor register zero, the coprocessor status register, COSTAT. Fields of this register determine which coprocessors are enabled (powered up), which coprocessor is currently selected, and in some cases, the operating mode of that coprocessor. See the Hawk manual for the full details of the fields in COSTAT. Assuming that there is a USE directive for the float.h file, so that the right definitions are avalable, the following code enables the floatingpoint coprocessor for short (32bit) floatingpoint operands:
LIL R1, FPENAB + FPSEL COSET R1, COSTAT 
Once the floatingpoint coprocessor is enabled and selected, addressing coprocessor registers 1 to 15 refers specifically to registers inside the floating point coprocessor and not any other. When operating in short format, there are only two useful registers in the floatingpoint coprocessor, floatingpoint accumulators zero and one, FPA0 and FPA1, which corespond to coprocessor registers 2 and 3. Coprocessor register 1 will be ignored, for now. It is used for access to the least significant halfword of long (64bit) floatingpoint operands.
Floating point coprocessor registers 4 to 15 are not really registers. Rather, the operations that set these pseudoregisters are used to operate on the floatingpoint accumulators. Another way to think about this is to say that the least significant bit of the x field is the coprocessor register select bit, while the high 3 bits of the x field are the coprocessor operation code.
The available operations are floatingpoint add, subtract, multiply and divide, as well as square root and integer to floating conversion. Setting even floatingpoint registers 4 to 14 causes operations on FPA0 and setting odd registers 5 to 15 operates on FPA1. For example, setting coprocessor register number 5 converts the integer stored in a general purpose register into a floatingpoint value in FPA1. The complete set of short (32bit) floatingpoint operations on FPA0 is illustrated below using R3 as a source operand; the same operations are available on FPA1.
COSET R3, FPA0 ; 2 FPA0 = R3 COSET R3, FPINT+FPA0 ; 4 FPA0 = (float) R3 COSET R3, FPSQRT+FPA0 ; 6 FPA0 = sqrt( R3 ) COSET R3, FPADD+FPA0 ; 8 FPA0 = FPA0 + R3 COSET R3, FPSUB+FPA0 ; 10 FPA0 = FPA0  R3 COSET R3, FPMUL+FPA0 ; 12 FPA0 = FPA0 * R3 COSET R3, FPDIV+FPA0 ; 14 FPA0 = FPA0 / R3 
Floating point operations do not directly set the condition codes. Instead, When the coprocessor get instruction COGET is used to get the contents of a floatingpoint accumulator, it sets the N and Z condition codes to show whether the floatingpoint value is negative or zero. In addition, the C condition code is used to report floatingpoint values that are infinite or non numeric. So, the following code would work to get the contents of FPA0 into R3 and check that it is valid:
COGET R3, FPA0 ; R3 = FPA0 BCS INVALID ;  go deal with invalid result 
a) Give appropriate defines for the symbols FPA0, FPA1, FPSQRT, FPADD, FPSUB, FPMUL and FPDIV that are used as operands on the COSET instruction.
b) Given 32bit floatingpoint values x in R4 and y in R5, give Hawk code to enable the coprocessor, compute sqrt(x^{2} + y^{2}), put the result in R3 and then disable the coprocessor.
To fully specify our floatingpoint coprocessor, we must define not only the operations but also the data formats it uses. The Hawk, like most modern computers, uses the floatingpoint format the Institute for Electrical and Electronic Engineers (IEEE) has defined. This follows a general outline very similar to most floatingpoint formats developed since since the early 1960's, but it has some eccentric features.
Binary floatingpoint numbers are closely related to decimal numbers expressed in scientific notation. Consider the number 6.02 × 10^{23}, Avagadro's number. This number is composed of a mantissa, 6.02, and an exponent, 23. The number base of the mantissa, 10, is also the base to which the exponent is applied, and the mantissa in scientific notation is always normalized, as shown below:
60221419.9  ×  10^{16}  \  
60221.4199  ×  10^{19}  not normalized  
60.2214199  ×  10^{22}  /  
6.02214199  ×  10^{23}  normalized  
0.602214199  ×  10^{24}  not normalized 
Of these, only 6.02... × 10^{23} is considered to be properly in scientific notation. The rule is that the mantissa must always be a decimal number from 1.000 to 9.999... The only exception is zero. When a number has a mantissa that does not satisfy this rule, we normalize it by moving the point and adjusting the exponent until it satisfies this rule.
The IEEE standard includes both 32 and 64bit floatingpoint numbers. Here,
we will ignore the latter and focus the 32bit
format. As in scientific notation, binary floatingpoint numbers have
exponent and mantissa fields, but they are in binary, so the mantissa
is in base two and the exponent is a power of two.
31  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  09  08  07  06  05  04  03  02  01  00 
S  exponent  mantissa 
In the IEEE floatingpoint formats, like most other floatingpoint formats designed since the 1950s, the most significant bit holds the sign of the mantissa, with zero meaning positive. The mantissa is stored in signed magnitude form. The magnitude of the mantissa of a 32bit IEEE floatingpoint number is given to 24 bits of precision, while the exponent is stored in the 8 remaining bits. Notice that this adds up to 33 bits of sign, exponent and mantissa, evidence of some exceptional trickery. IEEE doubleprecision numbers differ from the above in that each number is 64 bits. This allows 11bits for the exponent instead of 8, and 53 bits for the mantissa, including the extra bit obtained by the same kind of trickery.
The IEEE format gets an extra bit for the mantissa as a consequence
of the mantissa normalization rule it uses. The mantissa
in an IEEE format number is a binary fixed point number
with one place to the left of the point. The
normalization rule is similar to that used for scientific notation, so
the smallest normalized mantissa value is 1.0, while the largest normalized
value is 1.1111..._{2}, a value just less than 2.0.
This means that all normalized mantissas have
a most significant bit that is one. In general, if a bit
is always the same, there is no point in storing it since we can take
the constant value for granted. We call such a bit that does not need to be
stored a hidden bit. Consider the IEEE floatingpoint value
represented as 12345678_{16}:
 
 
The IEEE format supports nonnormalized mantissas only for the smallest possible exponent. Zero is represented by the smallest possible exponent with and a nonnormalized mantissa of zero. Conveniently, this is represented as 00000000_{16}.
We have been referring to the point, not the decimal point or the binary point. Why? The purpose of the point, in any radix, is to sit just to the right of the one's place, dividing the integer part of a number from the fractional part. This has nothing to do with the base of the number system, so calling it a decimal point or a binary point conveys no useful information.
Most floatingpoint formats designed since the 1960s have used a biased representation for the exponent. In the case of the IEEE format, the bias used is an odd one, 127, and two exponent values have been singled out for special treatment. Exponents in the range from 00000001_{2} to 11111110_{2} are interpreted as biased values, so 00000001_{2} means –126 and 11111110_{2} means +127.
The exponent represented as 00000000_{2} is special. It also means –126, but it implies that the hidden bit is zero, which is to say, the mantissa is not normalized. The exponent 11111111_{2} is reserved for values that the IEEE calls NaNs, where NaN stands for not a number. The value of the mantissa tells what kind of NaN. The hardware uses a mantissa of zero to mean infinity. Software may use nonzero values for other meanings.
Because of the odd bias of 127 for exponents, an exponent of
one is represented as 10000000_{2}, zero is
01111111_{2}, and negative one is
01111110_{2}. There is a competing but equivalent
explanation of the IEEE format that presents the bias as 128 and places
the point in the mantissa one place farther to the right.
The different presentations of the IEEE system make no difference in the
number representations, but they can be confusing when comparing
different textbook descriptions of the system.
The following table shows IEEE floatingpoint numbers,
given in binary, along with their interpretations.
Infinity and NaN  
0 11111111 00000000000000000000000  =  Infinity  
1 11111111 00000000000000000000000  =  Infinity  
0 11111111 00000000000010000000000  =  NaN (not a number)  
1 11111111 00000010001111101000110  =  NaN  
Normalized numbers  Sign  Exp.  Mant.  Value  

0 10000000 00000000000000000000000  =  +1.0  ×  2^{1}_{ }  × 1.00_{2}^{ }  = 2.0  
0 01111111 00000000000000000000000  =  +1.0  ×  2^{0}_{ }  × 1.00_{2}^{ }  = 1.0  
0 01111111 10000000000000000000000  =  +1.0  ×  2^{0}_{ }  × 1.10_{2}^{ }  = 1.5  
0 01111111 11000000000000000000000  =  +1.0  ×  2^{0}_{ }  × 1.11_{2}^{ }  = 1.75  
1 01111111 11000000000000000000000  =  –1.0  ×  2^{0}_{ }  × 1.11_{2}^{ }  = –1.75  
0 00000001 00000000000000000000000  =  +1.0  ×  2^{–126}_{ }  × 1.00_{2}^{ }  = 2^{–126}_{ }  
Unnormalized numbers  Sign  Exp.  Mant.  Value  
0 00000000 10000000000000000000000  =  +1.0  ×  2^{–126}_{ }  × 0.10_{2}^{ }  = 2^{–127}_{ }  
0 00000000 01000000000000000000000  =  +1.0  ×  2^{–126}_{ }  × 0.01_{2}^{ }  = 2^{–128}_{ }  
0 00000000 00000000000000000000000  =  +1.0  ×  2^{–126}_{ }  × 0.00_{2}^{ }  = 0  
1 00000000 00000000000000000000000  =  –1.0  ×  2^{–126}_{ }  × 0.00_{2}^{ }  = 0 
In the above table, note that 2^{–126} is the smallest positive normalized value, with a mantissa that is all zero except the hidden bit. Everything smaller is nonnormalized, with a zero exponent implying a hidden bit of zero.
c) Give the 32bit IEEE representation for 100_{10} in binary.
d) Give the 32bit IEEE representation for 1000_{10} in binary.
e) Give the 32bit IEEE representation for 0.1_{10} in binary.
f) Give the 32bit IEEE representation for 0.01_{10} in binary.
g) Consider the following infinite loop on the Hawk:
LIL R3, FPENAB + FPSEL COSET COSTAT LIS R3, 1 L: COSET R3, FPINT + FPA0 COGET R3, FPA0 BR L
Describe the sequence of values in R3, with an emphasis on the value after many iterations. An approximate upper bound on the value of R3 is a strong answer to this question.
g) Write C code equivalent to the above. Hint, you will need to do some type casting in order to do unsafe type conversions from float to int.
The standard C data types float and double are frequently implemented using the IEEE 32 and 64bit floatingpoint formats. Some compilers also support long double, for even higher precision. These are not a universal rules. On machines that have other floatingpoint formats, C compilers will use the native format, whatever it may be. The C99 standard does not specifiy a floatingpoint format, but it does specify a minimum acceptable precision for floatingpoint numbers by specifying a value that, when added to 1.0, must produce a value bigger than 1.0:
In addition, the C90 standard requires that the hearder file float.h define the following symbols
Note that C compilers for two different machines supporting 32bit floating point numbers could have different descriptions: One might have FLT_MANT_DIG = 24 and FLT_RADIX = 2 while the other has FLT_MANT_DIG = 6 and FLT_RADIX = 16. These two machines will have similar floatingpoint precision, but the radix 16 machine will tend to lose more precision in arithmetic because it must round results to the nearest base16 digit instead of rounding to the nearest binary digit.
Also note that a programmer can ask how big the representation of a floating point number is with the sizeof pseudo function. It is called a pseudo function because it looks like a function but is not one. When a C programmer writes sizeof(float) for example, this asks the compiler to give the size of a floatingpoint number. This happens at compile time, not at run time. The argument to sizeof can be the name of any type or variable. The value is always given as an integer number of storage units, but on most machines, these are bytes.
Here is a C program that explores the representation of a floatingpoint number on whatever machine it is compiled on:
#include <stdio.h> int main(){ float a = 1.0; printf( "sizeof(float) = %d\n", sizeof(float) ); printf( " 1.0 in hex = %x\n", *((int*)&a) ); printf( " this isn't = %x\n", a ); return 0; } 
The last two lines above attempt to print the hexadecimal representation of the example floatingpoint value. The first line, with complicate casting, actually works, while the final line fails miserably. The problem is, when parameters are passed in C to a function with a variablesized argument list, all of the arguments are promoted or widened. So, float arguments are promoted, perhaps to double and scalar arguments such as char are promoted to int or even long int. As a result, the value printed by the last line is a complete misinterpretation of the actual parameter. The developers of C refer to functions with a variable argument list as variadic a term dating back to 1936 that was all but unknown prior to the widespread use of C.
In C, most casting merely defeats the type rules of the language. Thus, for example, given the two types struct listnode and struct treenode which happen to have identical sizes, casting a variable from one of these types to the other makes no changes to the data, it only relabels that data from one type to another. In contrast, casting between integer and floatingpoint values causes code to be generated that translates between these representations. So, (int)3.1416 has the integer value 3.
The casting in the example, *((int*)&a) prevents any conversion from being done. Instead of working with the value of a, we work with its address &a. This pointer has the data type float*, which we then cast to int*, so it is now a pointer to an integer. Finally, the leading * dereferences the pointer, that is, follows that pointer to the actual bits of a. The bits that represented the floatingpoint number 1.0 are now seen as an integer.
Some computers do not have floatingpoint hardware. This is particularly true of the kind of small microcontrollers used in the internet of things. Floating point arithmetic on those machines is done by software. Such software forces us to examine floatingpoint arithmetic algorithms, and it gives an extended implementation example. Here, we ignore compatibility with the IEEE floatingpoint format until the end, so we can focus on clear code with simple data representations.
Consider the following implementation, given in C:
typedef struct { int exponent, mantissa; } myFloat; 
This declares type myFloat to be a struct with two integer fields, exponent and mantissa. If we use this declaration, in the header file, we expose the representation of myFloat to users, but we allow users to declare variables of type myFloat. We will accept this loss of abstraction here.
To keep things simple, the exponent field of our floatingpoint representation will be an ordinary 2's complement integer, with no bias or other complication. Occasionally, floatingpoint representations have been proposed where the mantissa is an integer, but it is easier to state a normalization rule for fixed point fractions.
A fixedpoint binary number is a number with the point in a fixed position within the word. Here, we will put the point just to the right of the sign bit in our numbers. That means that (aside from the sign bit that indicates whether the number is positive or negative), our mantissa m must be within the range 1.0 > m ≥ –1.0. (The assymetry in this range is due to the fact that we are using 2's complement numbers.) Consider the following example signed binary fraction as an example. This is the closest approximation to 0.001 in 32 bits, with the point immediately to the right of the sign bit:
 
 
= 2^{–10} + 2^{–16} + 2^{–17} + 2^{–21} + 2^{–24} + 2^{–27} + 2^{–28} + 2^{–30} + 2^{–31}  
≅ 0.0009999996983 
Note that in this example, we numbered the bits in bigendian order, starting at the left, so that we could use negated bit numbers as exponents in the sum of powers of two. The decimal result is shown to more places than it should be in order to emphasize that it is only an approximation of 0.001; in fact, 32 bits can only express a value to a precision of one part in 4 billion, so we should never show more than 10 digits. If we round the number to 10 digits, we get exactly 0.001.
The normalization rule we will adopt for the mantissa is that the absolute value of the mantissa m always obeys the rule 1.0 > m ≥ 0.5. We have excluded –1 from the range, so any number can be safely negated, and we have required that the high bit of the number, with the place value 1/2, be set in the absolute value of the number. Here are some example floatingpoint values in this system:
exponent  00000000000000000000000000000000  +0.5 × 2^{0} = 0.5 
mantissa  01000000000000000000000000000000  
exponent  00000000000000000000000000000001  +0.5 × 2^{1} = 1.0 
mantissa  01000000000000000000000000000000  
exponent  00000000000000000000000000000001  +0.75 × 2^{1} = 1.5 
mantissa  01100000000000000000000000000000  
exponent  00000000000000000000000000000001  0.75 × 2^{1} = –1.5 
mantissa  10100000000000000000000000000000  
exponent  11111111111111111111111111111111  +0.5 × 2^{–1} = 0.25 
mantissa  01000000000000000000000000000000  
exponent  11111111111111111111111111111101  +0.5 × 2^{–3} = 0.0625 
mantissa  01000000000000000000000000000000  
exponent  11111111111111111111111111111101  ~8/10 × 2^{–3} =~ 0.1 
mantissa  01100110011001100110011001100110 
Now, consider the problem of converting from a 32bit integer to this software floatingpoint format. We can just assign the integer to the mantissa, with no change to the bits, but what value do we assign to the exponent? The point in the integer is just to the right of the least significant bit, but we want the point to be just to the right of the sign bit.
Up to this point, we have treated the value of the floatingpoint number as the product of a fixedpoint mantissa times an exponent term. The term floating point comes from a different way of looking at the same thing. Instead of viewing the exponent as a multiplier, we consider the exponent as an indication of where the point is in the number. An exponent of 0, in our software floating point representation, places the point just to the right of the sign, the first bit from the right. An exponent of 1 puts the point after the second bit from the right, and an exponent of 31 puts the point at the right end of the 32bit number.
Setting the exponent to 31 therefore converts an integer to a number in our floating point system, but it is not normalized. To normalize, we need to do a bit more work. The following code does this:
void int_to_myFloat( myFloat * f, int i ) { f>mantissa = i; f>exponent = 31; /* normalize f */ if (i < 0) i = i; while ((i & 0x40000000) == 0) { i = i << 1; f>mantissa = f>mantissa << 1; f>exponent = f>exponent  1; } } 
The normalization code used above shifts the mantissa left while decrementing the exponent, until the mantissa satisfies the normalization rule. The loop is controlled in terms of the absolute value of the mantissa, but we could just as easily used two different versions of the loop, one for positive mantissas an one for negative ones.
h) Give the normalized software floating point representation for 100_{10} in binary.
i) Repeat exercise h for 1000_{10}.
j) Repeat exercise h for 0.1_{10}.
k) Repeat exercise h for 0.01_{10}.
l) In this software floatingpoint number system, what is the largest possible positive value (in binary).
m) In this software floatingpoint number system, what is the smallest possible positive nonzero normalized value?
m) Write SMAL Hawk code equivalent to int_to_myFloat given
above. Note that the test written in C as
(i&0x40000000) can be done with the BITTST
instruction on the Hawk.
We can now explore the implementation of more interesting operations. These follow naturally from the standard rules for working with numbers in scientific notation. Consider the problem of adding 9.92×10^{3} to 9.25×10^{1}. We begin by denormalizing the numbers so that they have the same exponents; this allows us to add the mantissas, after which we renormalize the result and round it to the appropriate number of places:
given  9.92 × 10^{3}_{ }  +  9.25  ×  10^{1}_{ }  

denormalized  9.92 × 10^{3}_{ }  +  0.0925  ×  10^{3}_{ }  
rearranged  (9.92 + 0.0925)  ×  10^{3}_{ }  
added  10.0125  ×  10^{3}_{ }  
normalized  1.00125  ×  10^{4}_{ }  
rounded  1.00  ×  10^{4}_{ } 
The final rounding step is one many students forget, particularly in the age of pocket calculators. In scientific notation, we have the convention that the number of digits given indicates the precision of the measurements from which a number wase taken. As a result, if two numbers are given in scientific notation and then added or subtracted, the result should not be expressed to greater precision than the least precise operand. When throwing away the less significant digits of the result, we always round to minimise the information loss and avoid introducing systematic errors.
An important question is which operand do we denormalize before adding? The the answer is, we never want to lose the most significant digits of the sum, so we always increase the smaller of the exponents while shifting the corresponding mantissa right. To do this, we need to sort the arguments into a standard order before proceeding.
Note that when we add two numbers, there will sometimes be a carry into the sign bit, which is to say, an overflow. On pencil and paper, we stop propagating the carry before it changes the sign and instead, wedge in an extra bit and note the need to normalize the result. In a language like C, where the language provides no tools for recognizing an integer overflow, we have two choices: Either assume an overflow in advance, by shifting the number right to make space for the extra digit before doing the add, or find an algorithm for detecting the overflow. The former has the disadvantage that we lose one bit of precision when there is no overflow. The latter is easy in assembly language, where the condition codes report overflow, but difficult in C.
Regardless of whether there is an overflow, the final step in addition is to normalize the result. The normalize algorithm given above for integertofloating conversion is the same one used below:
void myFloat_add( myFloat * f, const myFloat * g ) { /* adds *g to *f */ myFloat aug; /* the augend that is added to */ { myFloat add; /* the addend added to the augend */ if (f>exponent > g>exponent) { aug = *g; add = *f; } else { aug = *f; add = *g; } /* assert, the addend has the larger exponent if they differ */ /* first, align the points, making the exponents equal */ while (add.exponent > aug.exponent) { aug.exponent = aug.exponent + 1; aug.mantissa = aug.mantissa >> 1; } /* prevent overflow */ add.mantissa = add.mantissa >> 1; aug.exponent = aug.exponent + 1; aug.mantissa = aug.mantissa >> 1; /* second, add the mantissas */ aug.mantissa = aug.mantissa + add.mantissa; } /* done with addend */ { /* finally, normalize the augend */ int i = aug.mantissa; if (i < 0) i = i; while ((i & 0x40000000) == 0) { i = i << 1; aug.mantissa = aug.mantissa << 1; aug.exponent = aug.exponent  1; } } *f = aug; } 
Most of the above code follows simply from the logic of adding that we demonstrated with the addition of two numbers using scientific notation. There are some points, however, that are worthy of note.
The second parameter is declared as a const parameter. This means that the object pointed to by the pointer cannot be changed. This function will change the object pointed to by its first parameter, but not the second.
The right shifts to denormalize the numbers with the smaller exponent simply threw away the least significant bits. In a single computation, this might be insignificant, but in large scale computation with hundreds or thousands of additions, these errors can accumulate. Rounding can be used to reduce the severity of these errors considerably.
To see how rounding works, it is useful to look at a decimal example. Consider the difference between rounding and truncating π 3.14159265... to 5 places of precision. If we truncate, the result is 3.1415 with an error of 9.265...×10^{–5} while if we round, the result is 3.1416 with an error of 7.35...×10^{–6}.
The algorithm for rounding in binary or any other number base is the same as the algorithm for rounding in decimal. If the least significant digit being retained has weight w, add w/2 to the number and then truncate. In binary, this simplifies to adding 1 to the most significant digit to be discarded and then truncate. The IEEE floating point specificatin requires rounding.
We have already mentioned the second major problem with the above code, the loss of one bit of precision in the case of overflow. Detecting overflow in C requires looking at the sign of the operands and the sign of the result. If the operands have the same sign and the result has a different sign after addition, then there was an overflow. Writing code to detect and deal with this in C is messy.
; given R3=add.exponent, R4=add.mantissa, R5=aug.exponent, R6=aug.mantissa ADD R4,R6 ; add.mantissa = add.mantissa + aug.mantissa BVR NOTOVF ; if (overflow) {  we need one more bit ADDSI R3,1 ; add.exponent = add.exponent + 1 SR R4,1 ; add.mantissa = add.mantissa >> 1 SUB R0,R1,R1 ;  set carry bit ADJUST R4,CMSB ; flip sign bit of add.mantissa NOTOVF: ; } 
The code for handling overflow in machine code is fast, but it is subtle. An overflow means that the sign of the result is wrong. Doing a signed right shift of the result squeezes an extra bit in as it duplicates the sign bit. That sign bit was wrong as a sign, but it was the correct value as part of the mantissa, so our problem is to flip the sign bit. We could do this by loading the long constant 80000000_{16}, but loading this constant would require an extra register and 48 bits of instructions (LIL followed by ORIS).
The Hawk ADJUST instruction is used for a variety of odd purposes. ADJUST r,CMSB adds the carry bit to the most significant bit of the register r. This instruction was added to the HAWK in order to simplify and speed up the code for long right shifts, but if the carry bit is first forced to one, it this can be used to do something equivalent to adding 80000000_{16} to a register. The instruction SUB R0,R1,R1 sets the C and Z condition codes by subtracting two equal values and then discarding the result. Any two equal values will do this.
n) The floating point add code given here is rather stupid about shifting. It should never rightshift the lesser of the two addends more than 32 times! Fix this!
o) Fix this code so that the denormalize step rounds the lesser of the two addends by adding one to the least significant bit just prior to the final right shift operation.
p) Fix the C code so that, when it shifts both operands right one place before adding them (in order to avoid overflow) it rounds them instead of merely discarding the least significant bit.
q) Fix the HAWK code for dealing with overflow so that it rounds instead of simply discarding the least significant bit when it does a right shift.
Floating point multiplication is simpler than floating point addition. This is apparent in the algorithm for multiplying in scientific notation: Add the exponents, multiply the mantissas and normalize the result, as illustrated below:
given  1.02 × 10^{3} × 9.85 × 10^{1}  

rearranged  (1.02 × 9.85) × 10^{(3 + 1)}  
multiplied  10.047 × 10^{4}  
normalized  1.0047 × 10^{5}  
rounded  1.00 × 10^{5} 
Unlike addition, we do not denormalize anything before the operation. The one new issue we face is the matter of precision. Multiplying two 32bit values gives a 64bit result. We can do this in C by using long int operands.
void myFloat_mul( myFloat * f, const myFloat * g ) { /* multiplies f by g */ long int product = ((long int) f>mantissa * g>mantissa) << 1; f>exponent = f>exponent + g>exponent; /* normalize */ if (product > 0) { while ((product & 0x40000000L) == 0) { product = product << 1; f>exponent = f>exponent  1 } } else { while ((product & 0x40000000L) == 0) { product = product << 1; f>exponent = f>exponent  1 } } f>mantissa = (product >> 32); } 
There are two tricky parts to this code: First, why shift the product one place left immediately after multiplication. The answer comes from the rules for multiplication of fixed point numbers: The number of places right of the point in the product is the sum of the number of places right of the point in each multiplicand. Since our original 32bit mantissas had 31 places right of the point, the 64bit product has 62 bits right of the point. If we want the high half of the 64bit product to have its point aligned properly for the mantissa of the result, we need tho shift it over one place.
The second issue is the normalization code. If both operands of the multiply were properly normalized, the result will require at most one place of shift, since 0.5×0.5 is the smallest normalized product in this number system. Using a loop to normalize with more than one place of shift is only done here in case one or both operands were not normalized.
The normalization code given here uses an alternative algorithm from the one used in the integer to float conversion and floating point addition code. The one given here treats the positive and negative cases separately. Both algorithms accomplish the same thing.
The final 32bit shift at the end of the above code simply signifies that the most significant half of the result is to be returned. This need not involve an actual shift if 64bit longwords are stored in pairs of 32bit registers.
r) Fix this code to deal with exponent overflows. In the event of an exponent overflow, the exponent and the absolute value of the mantissa should stick at their maximum possible values.
s) Fix this code to deal with underflows. This problem is hard because the exponent could go below the minimum value when the exponents of the multiplicands are added, or it could be decremented below its minimum during normalization. Nonnormalized mantissa values with the minimum possibl exponent make good sense for underflows with small magnitudes, while zero is a sensible result for big underflows.
t) Fix this floating point multiply code so it correctly handles the value zero.
u) Write code for a floating point divide routine.
Multiply and divide are not enough. Our commitment to strong abstraction discourages users from examining the representations of floatingpoint numbers. Floating point hardware designers do not face this constraint. They advertise the exact number format and encourage users to take advantage of it. If we do not do this, we must provide tools for operations that may be trivial, such as testing the sign of numbers, or testing for zero.
We also need to support import and export of floatingpoint numbers. Users need to convert to and from textual and IEEE standard format. We convert from our format to IEEE format by picking off the sign, aligning the point in the mantissa, and then dealing with the exponent. Our 32bit exponent has an extraordinary range, so overflow and underflow are problems. The following somewhat oversimplified C code does this:
float myFloat_to_float( const myFloat * f ) { /* convert from myfloat to the native IEEE format */ int exponent = f>exponent; int mantissa = f>mantissa; /* first split off the sign */ int sign = 0; if (mantissa < 0) { mantissa = mantissa; sign = 0x80000000; } /* convert */ if (exponent > 128) { /* convert overflow to infinity */ mantissa = 0; exponent = 0x7F800000; } else if (exponent < 148) { /* convert underflow to zero */ mantissa = 0; exponent = 0; sign = 0; /* avoids returning negative zero */ } else if (exponent < 125) { /* denormalized conversion */ do { mantissa = mantissa >> 1; exponent = exponent + 1; } while (exponent < 125); mantissa = mantissa >> 7; exponent = 0x00000000; } else { /* normalized conversion */ mantissa = (mantissa >> 7) & 0x007FFFFF; exponent = (exponent + 126) << 23 } int floatval = sign  exponent  mantissa; return *((float*)&floatval) } 
Again, we have declared the parameter to be a const indicating that this function will not change the myFloat value passed to it. Also note that the bias of the IEEE format is 127, yet in the above code, we used a bias of 126. This is because the code is doing two things with one operation. First, it is decrementing the original exponent by one to account for the fact that our numbers were normalized in the range 0.5 to 1.0, while IEEE numbers are normalized in the range 1.0 to 2.0. Second, it is adding 127, the actual bias.
The range check on the exponent in the above code uses the bounds –148, –125 and 128 for denormalized and normalized numbers instead of the bounds –149, –126 and 127 specified for the IEEE format. Again, this difference is because of the change in normalization rule between the IEEE model and our wordbased model.
The final step in the above code, done as the result is returned, is to force the C compiler to see the type of the returned result as float despite the fact that all of the components have been manipulated as int values. This is done by creating a pointer that points to the value, then casting that pointer so the compilere thinks it points to a float before dereferencing it to get the return value as a float. We did the same kind of thing in Chapter 3 when we explored how integers are stored in multiple bytes of memory.
The above code is not perfect. It truncates, discarding the insignificant bits instead of rounding to the nearest IEEE floating point value. Rounding is preferred because it halves the maximum error in conversion.
SUBTITLE Convert from IEEEformat floatingpoint to myFloat FLOAT_TO_MYFLOAT: ; expects R3 = result, pointer to a 2word myFloat ; R4 = input, the number in IEEE format ; uses R5 = a temporary MOVE R5,R4 ;  work on exponent SL R5,1 ;  discard sign bit SR R5,12 SR R5,12 ;  discard mantissa ADDI R5,R5,126 ; exp = ((input & 0x7F800000)>>23)  126 STORES R5,R3 ; result.exponent = exp MOVE R5,R4 ;  work with mantissa SL R5,9 ;  discard exponent and sign SR R5,1 ;  make space for hidden bit SUB R0,R1,R1 ;  set carry bit ADJUST R5,CMSB ;  move carry to hidden bit SR R5,1 ; mant = ((input & 0x007FFFFF)<<7)  0x40000000 TESTR R4 BNR FIEEEPOS ; if (input < 0) { NEG R5,R5 ; mant = mant FIEEEPOS: ; } STORE R5,R3,MANTISSA ; result.mantissa = mant JUMPS R1 ; return 
Conversion from IEEE format to our eccentric software format is fairly easy because our exponent and mantissa fields are larger than those of the singleprecision IEEE format. Thus, we can convert with no loss of precision. This code presented above ignores the possibility that the value might be a NaN or infinity.
This code makes extensive use of shifting to clear fields within the number. Thus, instead of writing n&0xFFFFFF00, we write (n>>8)<<8. This trick is useful on many machines where loading a large constant is significantly slower than a shift instruction. By doing this, we avoid both loading a long constant into a register and using an extra register to hold it. On the Hawk, this optimization is effective only for shift counts of 16 or less. Shifting by a greater amount takes two consecutive shift instructions, in which case, it may be faster to load a long constant using LIL followed by an AND instruction.
Another trick is used in the above code to set the implicit one bit. Instead of oring a long constant with the mantissa, the code uses a subtract instruction to set the carry bit, and then it uses an ADJUST instruction to or the carry bit into the most significant bit.
A well designed floating point package will include a complete set of tools for conversion to and from decimal textual representations, but our purpose here is to use the conversion problem to illustrate the use of our software floating point package, so we will write our conversion code as userlevel code, making no use of any details of the floating point abstraction aside from those discussed above.
First, consider the problem of printing a floatingpoint number using only the operations we have defined, and focusing on the algorithm. We can begin by taking the nonrounded integer part of the number and printing that, followed by a point, but the question is, how do we continue from there, printing the digits after the point?
To print the fractional part of a number, we will take the integer part of the number and subtract it from the number, leaving just the fraction; this is equivalent to taking the number mod one. Multiplying the fractional part by ten brings one decimal digit of the fraction above the point. This can be repeated to extract each successive digit. This is not particularly efficient, but it works. Here is C code for this:
void putfloat( float num, int places ) { /* output num to places digits after the point */ int inum; /* the integer part */ if (num < 0) { /* make it positive and print the sign */ num = num; putchar( '' ); } /* first put out integer part */ inum = (int)num; printf( "%1d", inum ); putchar( '.' ); /* second put out digits of the fractional part */ for (; places > 0; places) { num = (num  (float)inum) * 10.0; inum = (int)num; putchar( inum + '0' ); } } 
The code given above relies on the fact that casting int(num) truncates the floating point value num as it converts it to integer. A C programmer who wants rounding, must write int(num+0.5). Similarly, (float)inum converts integer to floating. All other casting in C merely tells the compiler to suppress type checking, but here, casting converts the representation from one form to another, involving various amounts of code depending on the computer being used.
We can rewrite this code to use our software floating point format, passing a readonly handle (that is, a const pointer) to the twoword object to be printed. In the code given below, we assume that we have a myFloat_to_int conversion routine, the opposite of the routine given above, and we assume a subtraction routine, myFloat_sub, comparable to the addition routine given above.
This code violates the myFloat abstraction by reaching into the myFloat to negate it by negating the mantissa. This is a cheap computation, while a subroutine call to myFloat_negate would be comparatively expensive. The code also includes the constant ten that was composed by hand. We could have written int_to_myFloat(&ten,10) to initialize this constant, but that would require runtime computation every time a floating point number is printed, and it eliminates the opportunity to make ten a static constant because, as the result of a function, it is obviously not constant.
One feature of this code that is quite distinct from anything that can be written in Java or Python is the use of several myFloat objects embedded directly into the activation record. Both num and tmp are local variables, without handles. This means that they are 2word fields of the activation record. References to num.mantissa are simple local variable references, and when we pass &num as a handle, we are computing the address of a local variable to pass.
void putMyFloat( const myFloat *pnum, int places ) { myFloat num; /* a copy of the number  don't mess with the original */ int inum; /* the integer part */ num.mantissa = pnum>mantissa; num.exponent = pnum>exponent; if (num.mantissa < 0) { num.exponent = num.exponent; putchar( '' ); } /* first put out the integer part */ inum = myFloat_to_int( &num ); printf( "%d", inum ); putchar( '.' ); /* second put out digits of the fractional part */ while (places > 0) { myfloat tmp; static const myfloat ten = { 0x00000004, 0x50000000 }; int_to_myfloat( &tmp, inum ); myFloat_sub( &num, &tmp ); myFloat_mul( &num, &ten ); inum = myFloat_to_int( &num ); putchar( inum + '0' ); places = places  1; } } 
v) The putfloat routine given above only limits the number of fractional places printed, it does not control the number of places printed left of the point, and it has no provisions to control overflow, where the floating point number is larger than can be converted to an integer. Fix it so that it takes a third parameter, left_places and so that it prints an error indication such as ? when the number is too big to fit in the indicated number of places.
w) A good floating point print routine will round the number to the
indicated precision. Fix putfloat to do this.