11. Floating Point

Part of CS:2630, Computer Organization Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The Hawk Floating Point Coprocessor

The Hawk architecture includes two instructions reserved for communication with coprocessors. A coprocessor is a special purpose processor that operates in conjunction with the central processor. Coprocessors may be physically separate from the central processor, on a separate chip, or they may be integrated on the same chip with it. It is the logical separation that is essential. Coprocessors are commonly used for floating point, but they have also been used for graphics, cryptography, and other specialized operations.

The Hawk coprocessor instructions, COSET and COGET move data between general purpose registers and registers inside the selected coprocessor. Our focus is on coprocessor number one, the floating point coprocessor.

The Hawk coprocessor set instruction

07 06 05 04 03 02 01 00 15 14 13 12 11 10 09 08

0 0 0 1 dst 0 0 1 0 x

The Hawk coprocessor get instruction

07 06 05 04 03 02 01 00 15 14 13 12 11 10 09 08

0 0 0 1 dst 0 0 1 1 x

In these instructions, dst always refers to a CPU register, while x refers to a register in the active coprocessor, as selected by the value in coprocessor register zero, the coprocessor status register, COSTAT. Fields of this register determine which coprocessors are enabled (powered up), which coprocessor is currently selected, and in some cases, the operating mode of that coprocessor. See the Hawk manual for the full details of the fields in COSTAT. Assuming that there is a USE directive for the float.h file, so that the right definitions are avalable, the following code enables the floating point coprocessor for short (32-bit) floating point operands:

Enabling the floating point coprocessor

LIL R1, FPENAB + FPSEL COSET R1, COSTAT

**Enabling the floating point coprocessor**
	LIL R1, FPENAB + FPSEL COSET R1, COSTAT

Once the floating point coprocessor is enabled and selected, addressing coprocessor registers 1 to 15 refers specifically to registers inside the floating point coprocessor and not any other. When operating in short format, there are only two useful registers in the floating-point coprocessor, floating-point accumulators zero and one, FPA0 and FPA1, which corespond to coprocessor registers 2 and 3. Coprocessor register 1 will be ignored, for now. It is used for access to the least significant halfword of long (64-bit) floating point operands.

Floating point coprocessor registers 4 to 15 are not really registers. Rather, the operations that set these pseudo-registers are used to operate on the floating point accumulators. Another way to think about this is to say that the least significant bit of the x field is the coprocessor register select bit, while the high 3 bits of the x field are the coprocessor operation code.

The available operations are floating point add, subtract, multiply and divide, as well as square root and integer to floating conversion. Setting even floating point registers 4 to 14 causes operations on FPA0 and setting odd registers 5 to 15 operates on FPA1. For example, setting coprocessor register number 5 converts the integer stored in a general purpose register into a floating point value in FPA1. The complete set of short (32-bit) floating point operations on FPA0 is illustrated below using R3 as a source operand; the same operations are available on FPA1.

Floating point coprocessor operations

COSET R3, FPA0 ; 2 FPA0 = R3 COSET R3, FPINT+FPA0 ; 4 FPA0 = (float) R3 COSET R3, FPSQRT+FPA0 ; 6 FPA0 = sqrt( R3 ) COSET R3, FPADD+FPA0 ; 8 FPA0 = FPA0 + R3 COSET R3, FPSUB+FPA0 ; 10 FPA0 = FPA0 - R3 COSET R3, FPMUL+FPA0 ; 12 FPA0 = FPA0 * R3 COSET R3, FPDIV+FPA0 ; 14 FPA0 = FPA0 / R3

**Floating point coprocessor operations**
COSET R3, FPA0 ; 2 FPA0 = R3 COSET R3, FPINT+FPA0 ; 4 FPA0 = (float) R3 COSET R3, FPSQRT+FPA0 ; 6 FPA0 = sqrt( R3 ) COSET R3, FPADD+FPA0 ; 8 FPA0 = FPA0 + R3 COSET R3, FPSUB+FPA0 ; 10 FPA0 = FPA0 - R3 COSET R3, FPMUL+FPA0 ; 12 FPA0 = FPA0 * R3 COSET R3, FPDIV+FPA0 ; 14 FPA0 = FPA0 / R3

Floating point operations do not directly set the condition codes. Instead, When the coprocessor get instruction COGET is used to get the contents of a floating point accumulator, it sets the N and Z condition codes to show whether the floating point value is negative or zero. In addition, the C condition code is used to report floating point values that are infinite or non numeric. So, the following code would work to get the contents of FPA0 into R3 and check that it is valid:

Reading a floating point register

COGET R3, FPA0 ; R3 = FPA0 BCS INVALID ; -- go deal with invalid result

**Reading a floating point register**
COGET R3, FPA0 ; R3 = FPA0 BCS INVALID ; -- go deal with invalid result

Exercises

a) Give appropriate defines for the symbols FPA0, FPA1, FPSQRT, FPADD, FPSUB, FPMUL and FPDIV that are used as operands on the COSET instruction.
b) Given 32-bit floating point values x in R4 and y in R5, give Hawk code to enable the coprocessor, compute sqrt(x² + y²), put the result in R3 and then disable the coprocessor.

IEEE Floating Point Format

To fully specify our floating-point coprocessor, we must define not only the operations but also the data formats it uses. The Hawk, like most modern computers, uses the floating point format the Institute for Electrical and Electronic Engineers (IEEE) has defined. This follows a general outline very similar to most floating point formats developed since since the early 1960's, but it has some eccentric features.

Binary floating point numbers are closely related to decimal numbers expressed in scientific notation. Consider the number 6.02 × 10²³, Avagadro's number. This number is composed of a mantissa, 6.02, and an exponent, 23. The number base of the mantissa, 10, is also the base to which the exponent is applied, and the mantissa in scientific notation is always normalized, as shown below:

Equivalent values
60221419.9 × 10¹⁶ \
60221.4199 × 10¹⁹ not normalized
60.2214199 × 10²² /
6.02214199 × 10²³ normalized
0.602214199 × 10²⁴ not normalized

**Equivalent values**
60221419.9	×	10¹⁶	\
60221.4199	×	10¹⁹	not normalized
60.2214199	×	10²²	/
6.02214199	×	10²³	normalized
0.602214199	×	10²⁴	not normalized

Of these, only 6.02... × 10²³ is considered to be properly in scientific notation. The rule is that the mantissa must always be a decimal number from 1.000 to 9.999... The only exception is zero. When a number has a mantissa that does not satisfy this rule, we normalize it by moving the point and adjusting the exponent until it satisfies this rule.

The IEEE standard includes both 32 and 64-bit floating point numbers. Here, we will ignore the latter and focus the 32-bit format. As in scientific notation, binary floating point numbers have exponent and mantissa fields, but they are in binary, so the mantissa is in base two and the exponent is a power of two.

The IEEE single-precision floating-point representation
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

S exponent mantissa

In the IEEE floating point formats, like most others, the most significant bit holds the sign of the mantissa, with zero meaning positive. The mantissa is stored in signed magnitude form. The magnitude of the mantissa of a 32-bit floating-point number is given to 24 bits of precision, while the exponent is stored in the 8 remaining bits. Notice that this adds up to 33 bits of sign, exponent and mantissa, evidence of some exceptional trickery. IEEE double-precision numbers differ from the above in that each number is 64 bits. This allows 11-bits for the exponent instead of 8 bits, and 53 bits for the mantissa, including one extra bit obtained by the same kind of trickery.

A Normalized Mantissa

The IEEE format gets an extra bit for the mantissa as a consequence of the mantissa normalization rule it uses. The mantissa in an IEEE format number is a binary fixed point number with one place to the left of the point. The normalization rule is similar to that used for scientific notation, so the smallest normalized mantissa value is 1.0, while the largest normalized value is 1.1111...₂, a value just less than 2.0. This means that all normalized mantissas have a most significant bit that is one. In general, if a bit is always the same, there is no point in storing it since we can take the constant value for granted. We call such a bit that does not need to be stored a hidden bit. Consider the IEEE floating point value represented as 12345678₁₆:

An example single-precision floating point number

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0

sign = 0 (positive)
exponent = 001 0010 0
mantissa = 1.011 0100 0101 0110 0111 1000
value = 5.69046 × 10^-28

The IEEE format supports non-normalized mantissas only for the smallest possible exponent. Zero is represented by the smallest possible exponent with and a non-normalized mantissa of zero. Conveniently, this is represented as 00000000₁₆.

The Biased Exponent

Most floating point formats designed since the 1960s have used a biased representation for the exponent. In the case of the IEEE format, the bias used is an odd one, 127, and two exponent values have been singled out for special treatment. Exponents in the range from 00000001₂ to 11111110₂ are interpreted as biased values, so 00000001₂ means –126 and 11111110₂ means +127.

The exponent represented as 00000000₂ is special. It also means –126, but it implies that the hidden bit is zero, which is to say, the mantissa is not normalized. The exponent 11111111₂ is reserved for values that the IEEE calls NaNs, where NaN stands for not a number. The value of the mantissa tells what kind of NaN. The hardware uses a mantissa of zero to mean infinity. Software may use nonzero values for other meanings.

Because of the odd bias of 127 for exponents, an exponent of one is represented as 10000000₂, zero is 01111111₂, and negative one is 01111110₂. There is a competing but equivalent explanation of the IEEE format that presents the bias as 128 and places the point in the mantissa one place farther to the right. The different presentations of the IEEE system make no difference in the number representations, but they can be confusing when comparing different textbook descriptions of the system. The following table shows IEEE floating-point numbers, given in binary, along with their interpretations.

Example IEEE single-precision floating-point numbers
Infinity and NaN
0 11111111 00000000000000000000000 = Infinity
1 11111111 00000000000000000000000 = -Infinity

0 11111111 00000000000010000000000 = NaN (not a number)
1 11111111 00000010001111101000110 = NaN

Normalized numbers Sign Exp. Mant. Value
0 10000000 00000000000000000000000 = +1.0 × 2¹ × 1.00₂ = 2.0
0 01111111 00000000000000000000000 = +1.0 × 2⁰ × 1.00₂ = 1.0
0 01111111 10000000000000000000000 = +1.0 × 2⁰ × 1.10₂ = 1.5
0 01111111 11000000000000000000000 = +1.0 × 2⁰ × 1.11₂ = 1.75
1 01111111 11000000000000000000000 = –1.0 × 2⁰ × 1.11₂ = –1.75
0 00000001 00000000000000000000000 = +1.0 × 2^–126 × 1.00₂ = 2^–126

Unnormalized numbers Sign Exp. Mant. Value
0 00000000 10000000000000000000000 = +1.0 × 2^–126 × 0.10₂ = 2^–127
0 00000000 01000000000000000000000 = +1.0 × 2^–126 × 0.01₂ = 2^–128
0 00000000 00000000000000000000000 = +1.0 × 2^–126 × 0.00₂ = 0
1 00000000 00000000000000000000000 = –1.0 × 2^–126 × 0.00₂ = 0

**Example IEEE single-precision floating-point numbers**
Infinity and NaN
0 11111111 00000000000000000000000	=	Infinity
1 11111111 00000000000000000000000	=	-Infinity

0 11111111 00000000000010000000000	=	NaN (not a number)
1 11111111 00000010001111101000110	=	NaN

Normalized numbers	Sign		Exp.	Mant.	Value
0 10000000 00000000000000000000000	=	+1.0	×	2¹	× 1.00₂	= 2.0
0 01111111 00000000000000000000000	=	+1.0	×	2⁰	× 1.00₂	= 1.0
0 01111111 10000000000000000000000	=	+1.0	×	2⁰	× 1.10₂	= 1.5
0 01111111 11000000000000000000000	=	+1.0	×	2⁰	× 1.11₂	= 1.75
1 01111111 11000000000000000000000	=	–1.0	×	2⁰	× 1.11₂	= –1.75
0 00000001 00000000000000000000000	=	+1.0	×	2^–126	× 1.00₂	= 2^–126

Unnormalized numbers	Sign		Exp.	Mant.	Value
0 00000000 10000000000000000000000	=	+1.0	×	2^–126	× 0.10₂	= 2^–127
0 00000000 01000000000000000000000	=	+1.0	×	2^–126	× 0.01₂	= 2^–128
0 00000000 00000000000000000000000	=	+1.0	×	2^–126	× 0.00₂	= 0
1 00000000 00000000000000000000000	=	–1.0	×	2^–126	× 0.00₂	= 0

In the above table, note that 2^–126 is the smallest positive normalized value, with a mantissa that is all zero except the hidden bit. Everything smaller is non-normalized, with a zero exponent implying a hidden bit of zero.

Exercises

c) Give the 32-bit IEEE representation for 100₁₀.
d) Give the 32-bit IEEE representation for 1000₁₀.
e) Give the 32-bit IEEE representation for 0.1₁₀.
f) Give the 32-bit IEEE representation for 0.01₁₀.
g) Consider the following infinite loop on the Hawk:
        LIL     R3, FPENAB + FPSEL
        COSET   COSTAT
        LIS     R3, 1
L:
        COSET   R3, FPINT+FPA0
        COGET   R3, FPA0
        BR      L
Describe the sequence of values in R3, with an emphasis on the value after many iterations. An approximate upper bound on the value of R3 is a strong answer to this question.

Software Floating Point

Some computers do not have a floating point hardware. Floating point arithmetic on those machines is done by software. Such software forces us to examine floating point arithmetic algorithms, and it gives an extended implementation example. Here, we ignore compatibility with the IEEE floating point format until the end, so we can focus on clear code with simple data representations.

Software Floating Point Interface Specification

TITLE float.h, interface specification for float.a FLOATSIZE = 8 ; size of a floating point number, in bytes. ; the user gets no other information about the ; number representation ; for all calling sequences here: ; R1 = return address ; R2 = pointer to activation record ; R3-7 = parameters and temporaries ; R8-15 = guaranteed to be unchanged on return ; functions that return floating values use: ; R3 = pointer to place to put return value ; the caller must pass this pointer! EXT FLOAT ; convert integer to floating ; expects R3 = pointer to floating result ; R4 = integer to convert EXT FLTINT ; convert floating to integer ; expects R3 = pointer to floating value ; returns R3 = integer return value EXT FLTCPY ; copy a floating point number ; expects R3 = pointer to floating result ; R4 = pointer to floating operand EXT FLTTST ; test sign and zeroness of floating number ; expects R3 = pointer to floating value ; returns R3 = integer -1, 0 or 1 EXT FLTADD ; add floating-point numbers ; expects R3 = pointer to floating result ; R4 = pointer to addend ; R5 = pointer to augend EXT FLTSUB ; subtract floating-point numbers ; expects R3 = pointer to floating result ; R4 = pointer to subtrahend ; R5 = pointer to minuend EXT FLTNEG ; negate a floating-point number ; expects R3 = pointer to floating result ; R4 = pointer to operand EXT FLTMUL ; multiply floating-point numbers ; expects R3 = pointer to floating result ; R4 = pointer to multiplicand ; R5 = pointer to multiplier EXT FLTDIV ; divide floating-point numbers ; expects R3 = pointer to floating result ; R4 = pointer to multiplicand ; R5 = pointer to multiplier

**Software Floating Point Interface Specification**
TITLE float.h, interface specification for float.a FLOATSIZE = 8 ; size of a floating point number, in bytes. ; the user gets no other information about the ; number representation ; for all calling sequences here: ; R1 = return address ; R2 = pointer to activation record ; R3-7 = parameters and temporaries ; R8-15 = guaranteed to be unchanged on return ; functions that return floating values use: ; R3 = pointer to place to put return value ; the caller must pass this pointer! EXT FLOAT ; convert integer to floating ; expects R3 = pointer to floating result ; R4 = integer to convert EXT FLTINT ; convert floating to integer ; expects R3 = pointer to floating value ; returns R3 = integer return value EXT FLTCPY ; copy a floating point number ; expects R3 = pointer to floating result ; R4 = pointer to floating operand EXT FLTTST ; test sign and zeroness of floating number ; expects R3 = pointer to floating value ; returns R3 = integer -1, 0 or 1 EXT FLTADD ; add floating-point numbers ; expects R3 = pointer to floating result ; R4 = pointer to addend ; R5 = pointer to augend EXT FLTSUB ; subtract floating-point numbers ; expects R3 = pointer to floating result ; R4 = pointer to subtrahend ; R5 = pointer to minuend EXT FLTNEG ; negate a floating-point number ; expects R3 = pointer to floating result ; R4 = pointer to operand EXT FLTMUL ; multiply floating-point numbers ; expects R3 = pointer to floating result ; R4 = pointer to multiplicand ; R5 = pointer to multiplier EXT FLTDIV ; divide floating-point numbers ; expects R3 = pointer to floating result ; R4 = pointer to multiplicand ; R5 = pointer to multiplier

The interface specification for a class should list all of the methods applicable to objects of that class, and for each method, it should specify the parameters, constraints on those parameters, and the nature of the result. The class implementation adds the details of object representation and the specific algorithms for each method. It is good practice to write the interface specification so it serves as a user manual for the class.

For our floating-point class, we will define add, subtract, multiply and divide, plus tools for conversion to and from integers. Of course, other operations can be added. Since users of this class need not know how numbers are represented, we pass object handles (pointers) as parameters. Thus, users never need to know how many registers a number might occupy.

Finally, the interface specificaiton for a class should indicate how to allocate storage for variables in the class. The only thing a user of an object needs to know is the size of the object. The above interface specification assumes that each floating point number is stored in two words of memory, enough for an exponent and a mantissa of one word each, although the user need not know how the words are used.

Exercises

h) Write a main program that uses a separate common block of size FLOATSIZE to hold each floating point variable used in the computation of the floating point representation of 0.1, computed by converting the integer constants 1 and 10 to floating point and then dividing one by the other. This should call FLOAT twice and then call to FLTDIV.
i) Write a separately compilable subroutine called SQUARE that takes 2 pointers to floating point numbers as parameters and returns the square of the second number in the first. Also write the interface specification and comment everyting appropriately.

A floating point representation

The simplest floating point representation from a software perspective is a record containing two fields of one word each, the exponent and the mantissa. Of course, we must specify which word is which and how each is represented.

Since our computer uses two's complement arithmetic, it makes sense to use that for both the exponent and mantissa. We can represent zero using a mantissa of zero; in this case, the exponent does not matter, but by convention, we will set it to the smallest (most negative) value. Note that this format has no special cases for non-normalized numbers or non-numeric values.

The more difficult question is, where is the point in the mantissa? We could put it anywhere, but there are two good choices: The mantissa could be an integer, or the point could be just right of the sign bit. We will do the latter, with the mantissa normalized so the bit just right of the point is always one. Note that this format has no hidden bit.

A floating-point number representation
exponent 00000000000000000000000000000000 +0.5 × 2⁰ = 0.5
mantissa 01000000000000000000000000000000

exponent 00000000000000000000000000000001 +0.5 × 2¹ = 1.0
mantissa 01000000000000000000000000000000

exponent 00000000000000000000000000000001 +0.75 × 2¹ = 1.5
mantissa 01100000000000000000000000000000

exponent 00000000000000000000000000000001 -0.75 × 2¹ = –1.5
mantissa 10100000000000000000000000000000

exponent 11111111111111111111111111111111 +0.5 × 2^–1 = 0.25
mantissa 01000000000000000000000000000000

exponent 11111111111111111111111111111101 +0.5 × 2^–3 = 0.0625
mantissa 01000000000000000000000000000000

exponent 11111111111111111111111111111101 ~8/10 × 2^–3 =~ 0.1
mantissa 01100110011001100110011001100110

**A floating-point number representation**
exponent	`00000000000000000000000000000000`	+0.5 × 2⁰ = 0.5
mantissa	`01000000000000000000000000000000`

exponent	`00000000000000000000000000000001`	+0.5 × 2¹ = 1.0
mantissa	`01000000000000000000000000000000`

exponent	`00000000000000000000000000000001`	+0.75 × 2¹ = 1.5
mantissa	`01100000000000000000000000000000`

exponent	`00000000000000000000000000000001`	-0.75 × 2¹ = –1.5
mantissa	`10100000000000000000000000000000`

exponent	`11111111111111111111111111111111`	+0.5 × 2^–1 = 0.25
mantissa	`01000000000000000000000000000000`

exponent	`11111111111111111111111111111101`	+0.5 × 2^–3 = 0.0625
mantissa	`01000000000000000000000000000000`

exponent	`11111111111111111111111111111101`	~8/10 × 2^–3 =~ 0.1
mantissa	`01100110011001100110011001100110`

Exercises

j) In this number system, what is the largest possible positive value (in binary).
k) In this number system, what is the smallest possible positive nonzero normalized value?
l) In this number system, how is 10.0₁₀ represented.
m) In this number system, how is 100.0₁₀ represented.

Normalizing a floating point number

Many operations on floating point numbers produce unnormalized results. If we do not normalize them before performing the next operation, we risk loss of precision. This is why classical scientific notation requires all values to be normalized after each operation. To normalize a number, we must first check for zero, since zero cannot be normalized. Our decision to use two's complement for the mantiss requires a special case for negative numbers. This is why many hardware floating-point systems use signed magnitude for their mantissas.

The normalize routine is not part of the public interface to our floating point package; rather, it a private component, used as the final step of most floating point operations. Therefore, we can write it with the assumption that operands are passed in registers instead of using object handles. We will use R3 and R4 to hold the exponent and mantissa of the number to be normalized, on both entrance and exit.

Normalizing a floating-point number

SUBTITLE floating point normalize NORMALIZE: ; expects R3 = exponent ; R4 = mantissa ; returns R3 = exponent ; R4 = mantissa TESTR R4 BZR NRMNZ ; if (mantissa == 0) { LIL R3,#800000 SL R3,8 ; exponent = 0x80000000; JUMPS R1 ; return; NRMNZ: BNS NRMNEG ; } /* else */ if (mantissa > 0) { NRMPLP: ; while BITTST R4,30 BBS NRMPRT ; ((mantissa & 0x40000000) == 0) { SL R4,1 ; mantissa = mantissa << 1; ADDSI R3,-1 ; exponent = exponent - 1; BR NRMPLP ; } NRMPRT: JUMPS R1 ; return; NRMNEG: ; } /* else if */ { /* mantissa < 0 */ ADDSI R4,-1 ; mantissa = mantissa - 1; ; /* mantissa now in one's complement form */ NRMNLP: ; while BITTST R4,30 BBR NRMNRT ; ((mantissa & 0x40000000) != 0) { SL R4,1 ; mantissa = mantissa << 1; ADDSI R3,-1 ; exponent = exponent - 1; BR NRMPLP ; } NRMNRT: ADDSI R4,1 ; mantissa = mantissa + 1; ; /* mantissa now in two's complement form */ JUMPS R1 ; return; ; }

**Normalizing a floating-point number**
SUBTITLE floating point normalize NORMALIZE: ; expects R3 = exponent ; R4 = mantissa ; returns R3 = exponent ; R4 = mantissa TESTR R4 BZR NRMNZ ; if (mantissa == 0) { LIL R3,#800000 SL R3,8 ; exponent = 0x80000000; JUMPS R1 ; return; NRMNZ: BNS NRMNEG ; } /* else / if (mantissa > 0) { NRMPLP: ; while BITTST R4,30 BBS NRMPRT ; ((mantissa & 0x40000000) == 0) { SL R4,1 ; mantissa = mantissa << 1; ADDSI R3,-1 ; exponent = exponent - 1; BR NRMPLP ; } NRMPRT: JUMPS R1 ; return; NRMNEG: ; } / else if / { / mantissa < 0 / ADDSI R4,-1 ; mantissa = mantissa - 1; ; / mantissa now in one's complement form / NRMNLP: ; while BITTST R4,30 BBR NRMNRT ; ((mantissa & 0x40000000) != 0) { SL R4,1 ; mantissa = mantissa << 1; ADDSI R3,-1 ; exponent = exponent - 1; BR NRMPLP ; } NRMNRT: ADDSI R4,1 ; mantissa = mantissa + 1; ; / mantissa now in two's complement form */ JUMPS R1 ; return; ; }

We used BITTST to test bit 30 of the mantissa above. This macro moves the indicated bit to one of the condition codes with a shift instruction that depends on the bit being tested. BITTST sets different condition codes depending on the bit being tested, so it defines two macros, BBS and BBR as the appropriate conditional branches to check if the bit was set or reset. In C, C++ or Java, bits are usually tested by anding with a constant with just that bit set.

To normalize negative numbers, we converted to one's complement form. In this form, bit 30 is always zero when the number is normalized. It would have been just as easy to negate the number, normalized it as a positive value, and then negate it again.

Exercises

n) The above code does not detect underflow. Decrementing the exponent below the smallest legal value gives the highest legal value. Fix this error.

Integer to Floating Conversion

Conversion from integer to floating is easy, we simply set the exponent field to 31 and then normalize. This is because our mantissa representation can can be thought of as an integer in units of 2^–31. The code below is optimized, it makes use of the known register use of NORMALIZE and it saves the return address in a register.

Integer to Floating Conversion on the Hawk

; format of a floating point number stored in memory EXPONENT = 0 MANTISSA = 4 FLOATSIZE = 8 SUBTITLE integer to floating conversion FLOAT: ; expects R3 = result -- floating result pointer ; R4 = number -- integer to convert ; uses R5 = return address ; R6 = result pointer MOVE R5,R1 ; -- set aside return address MOVE R6,R3 ; -- set aside result LIS R3,31 ; exponent = 31 ; mantissa = number JSR R1,NORMALIZE ; normalize( R3-4 ) STORES R3,R6 ; result->exponent = exponent STORE R4,R6 MANTISSA; result->mantissa = mantissa JSRS R5 ; return -- uses saved return address!

**Integer to Floating Conversion on the Hawk**
; format of a floating point number stored in memory EXPONENT = 0 MANTISSA = 4 FLOATSIZE = 8 SUBTITLE integer to floating conversion FLOAT: ; expects R3 = result -- floating result pointer ; R4 = number -- integer to convert ; uses R5 = return address ; R6 = result pointer MOVE R5,R1 ; -- set aside return address MOVE R6,R3 ; -- set aside result LIS R3,31 ; exponent = 31 ; mantissa = number JSR R1,NORMALIZE ; normalize( R3-4 ) STORES R3,R6 ; result->exponent = exponent STORE R4,R6 MANTISSA; result->mantissa = mantissa JSRS R5 ; return -- uses saved return address!

Floating to Integer Conversion

Conversion of floating-point numbers to integer is a bit more complex, but only because we have no pre-written denormalize routine that will set the exponent field to 31. Instead, we need to write this ourselves. Where the normalize routine shifted the mantissa left and decremented the exponent, floating to integer conversion involves shifting the mantissa right and incrementing the exponent until the exponent reaches 31.

This leaves open the question of what happens if the initial value of the exponent was greater than 31. The answer is, in that case, the integer part of the number is too large to represent in 32 bits. In this case, we should raise an exception, or, lacking a model of how to write exception handlers, we could set the overflow condition code. Here, this is left as an exercise for the reader.

Floating to Integer Conversion on the Hawk

SUBTITLE floating to integer conversion FLTINT: ; expects R3 = number -- pointer to floating value ; returns R3 = result -- an integer ; uses R4 = exponent LOADS R4,R3 ; exponent = number->exponent LOAD R3,R3,MANTISSA ; result = number->mantissa FINTLP: ; while CMPI R4,31 BGE FINTLX ; (exponent < 31) { SR R3,1 ; result = result >> 1 ADDSI R4,1 ; exponent = exponent + 1; BR FINTLP ; } FINTLX: ; unchecked error condition: exponent > 31 implies overflow JUMPS R1 ; return result

**Floating to Integer Conversion on the Hawk**
SUBTITLE floating to integer conversion FLTINT: ; expects R3 = number -- pointer to floating value ; returns R3 = result -- an integer ; uses R4 = exponent LOADS R4,R3 ; exponent = number->exponent LOAD R3,R3,MANTISSA ; result = number->mantissa FINTLP: ; while CMPI R4,31 BGE FINTLX ; (exponent < 31) { SR R3,1 ; result = result >> 1 ADDSI R4,1 ; exponent = exponent + 1; BR FINTLP ; } FINTLX: ; unchecked error condition: exponent > 31 implies overflow JUMPS R1 ; return result

Exercises

o) The above code for FLTINT truncates the result in an odd way for negative numbers. If the floating point input value is –1.5, what integer does it return? Why?
p) The above code for FLTINT truncates the result in an odd way for negative numbers. Fix the code so that it truncates the way a naive programmer would expect.
q) The above code for FLTINT truncates, but sometimes, it is desirable to have a version that rounds a number to the nearest integer. Binary numbers can be rounded by adding one in the most significant discarded digit and then truncating, Write code for FLTROUND that does this.
r) The above code for FLTINT could do thousands of right shifts for numbers with very negative exponents, an utter waste. Modify the code so that it automatically recognizes these extreme cases and returns a value of zero whenever more than 32 shifts would be required.

Floating Point Addition

We can now explore the implementation of more interesting operations. These follow naturally from the standard rules for working with numbers in scientific notation. Consider the problem of adding 9.92×10³ to 9.25×10¹. We begin by denormalizing the numbers so that they have the same exponents; this allows us to add the mantissas, after which we renormalize the result and round it to the appropriate number of places:

Adding in scientific notation
given 9.92 × 10³ + 9.25 × 10¹
denormalized 9.92 × 10³ + 0.0925 × 10³
rearranged (9.92 + 0.0925) × 10³
added 10.0125 × 10³
normalized 1.00125 × 10⁴
rounded 1.00 × 10⁴

**Adding in scientific notation**
given	9.92 × 10³	+	9.25	×	10¹
denormalized	9.92 × 10³	+	0.0925	×	10³
rearranged	(9.92 + 0.0925)	×	10³
added	10.0125	×	10³
normalized	1.00125	×	10⁴
rounded	1.00	×	10⁴

The final rounding step is one many students forget, particularly in the age of pocket calculators. In scientific notation, we have the convention that the number of digits given indicates the precision of the measurements from which a number wase taken. As a result, if two numbers are given in scientific notation and then added or subtracted, the result should not be expressed to greater precision than the least precise operand. When throwing away the less significant digits of the result, we always round to minimise the information loss and avoid introducing systematic errors.

An important question is which operand do we denormalize before adding? The the answer is, we never want to lose the most significant digits of the sum, so we always increase the smaller of the exponents while shifting the corresponding mantissa right. To do this, we need to sort the arguments into a standard order before proceeding.

There are many ways to exchange the contents of two registers. We will use the most straightforward approach, setting the value in one register aside, moving the other register, and then moving the first value into its final place. This takes three move instructions and a spare register. There are ways to do this without an extra register. The most famous and cryptic of these uses the exclusive or operator: a=a⊕b;b=a⊕b;a=a⊕b.

Also note that when we add two numbers, there will sometimes be a carry into the sign bit, which is to say, an overflow. On pencil and paper, we stop propagating the carry before it changes the sign and instead, wedge in an extra bit and note the need to normalize the result. In software, we must undo the damage done by the overflow and then normalize. The following floating point add subroutine solves these problems:

Adding two floating point numbers on the Hawk

SUBTITLE floating add ; activation record format RETAD = 0 ; return address R8SAVE = 4 ; place to save R8 FLTADD: ; expects R3 = pointer to floating sum ; R4 = pointer to addend ; R5 = pointer to augend STORES R1,R2 ; -- save return address STORE R8,R2,R8SAVE ; -- save R8 MOVE R7,R3 ; R7 = saved pointer to sum LOADS R3,R4 ; R3 = addend.exponent LOAD R4,R4,MANTISSA ; R4 = addend.mantissa LOAD R6,R5,MANTISSA ; R6 = augend.mantissa LOADS R5,R5 ; R5 = augend.exponent CMP R3,R5 BLE FADDEI ; if (addend.exponent > augend.exponent) { MOVE R8,R3 MOVE R3,R5 MOVE R5,R8 ; exchange exponents MOVE R8,R4 MOVE R4,R6 MOVE R6,R8 ; exchange mantissas FADDEI: ; } -- now addend.exponent <= augend.exponent) FADDDL: ; while CMP R3,R5 BGE FADDDX ; (addend.exponent < augend.exponent) { ADDSI R3,1 ; increment addend.exponent SR R4,1 ; shift addend.mantissa BR FADDDL FADDDX: ; } -- now addend.exponent = augend.exponent ADD R4,R6 ; add mantissas BVR FADDNO ; if (overflow) { /* we need one more bit */ ADDSI R3,1 ; increment result.exponent SR R4,1 ; shift result.mantissa SUB R0,R0,R0 ; set carry bit in order to ... ADJUST R4,CMSB ; flip sign bit of result (overflow repaired!) FADDNO: ; } JSR R1,NORMALIZE ; normalize( result ) STORES R3,R7 ; save result.exponent STORE R4,R7,MANTISSA ; save result.mantissa LOAD R8,R2,R8SAVE ; restore R8 LOADS R1,R2 ; restore return address JUMPS R1 ; return!

**Adding two floating point numbers on the Hawk**
SUBTITLE floating add ; activation record format RETAD = 0 ; return address R8SAVE = 4 ; place to save R8 FLTADD: ; expects R3 = pointer to floating sum ; R4 = pointer to addend ; R5 = pointer to augend STORES R1,R2 ; -- save return address STORE R8,R2,R8SAVE ; -- save R8 MOVE R7,R3 ; R7 = saved pointer to sum LOADS R3,R4 ; R3 = addend.exponent LOAD R4,R4,MANTISSA ; R4 = addend.mantissa LOAD R6,R5,MANTISSA ; R6 = augend.mantissa LOADS R5,R5 ; R5 = augend.exponent CMP R3,R5 BLE FADDEI ; if (addend.exponent > augend.exponent) { MOVE R8,R3 MOVE R3,R5 MOVE R5,R8 ; exchange exponents MOVE R8,R4 MOVE R4,R6 MOVE R6,R8 ; exchange mantissas FADDEI: ; } -- now addend.exponent <= augend.exponent) FADDDL: ; while CMP R3,R5 BGE FADDDX ; (addend.exponent < augend.exponent) { ADDSI R3,1 ; increment addend.exponent SR R4,1 ; shift addend.mantissa BR FADDDL FADDDX: ; } -- now addend.exponent = augend.exponent ADD R4,R6 ; add mantissas BVR FADDNO ; if (overflow) { /* we need one more bit */ ADDSI R3,1 ; increment result.exponent SR R4,1 ; shift result.mantissa SUB R0,R0,R0 ; set carry bit in order to ... ADJUST R4,CMSB ; flip sign bit of result (overflow repaired!) FADDNO: ; } JSR R1,NORMALIZE ; normalize( result ) STORES R3,R7 ; save result.exponent STORE R4,R7,MANTISSA ; save result.mantissa LOAD R8,R2,R8SAVE ; restore R8 LOADS R1,R2 ; restore return address JUMPS R1 ; return!

Most of the above code follows simply from the logic of adding that we demonstrated with the addition of two numbers using scientific notation. There are some points, however, that are worthy of note.

The above code uses saves R1 in its activation record to allow the call to NORMALIZE, and it saves R8 so that can be used as a temporary during the exchange operations. This code does not adjust the stack pointer before and after the call to NORMALIZE because that routine does not use its activation record.

Finally, there is the issue of dealing with overflow during addition. After addition, when the sign is wrong, interpreted as a sign bit, it does have the correct value as the most significant bit of the magnitude, as if there were an invisible sign bit to the left of it. Therefore, after a signed right shift to make space for the new sign bit (incrementing the exponent to compensate for this) we can complement the sign by adding one to it. There are many ways to do this. The approach used here using the ADJUST instruction has the advantage that it does not use any extra registers.

Exercises

s) The floating point add code given here is rather stupid about shifting. It should never right-shift the lesser of the two addends more than 32 times! Fix this!
t) Fix this code so that the denormalize step rounds the lesser of the two addends by adding one to the least significant bit just prior to the final right shift operation.

Floating Point Multiplication

Starting with a working integer multiply routine, floating point multiplication is simpler than floating point addition. This simplicity is apparent in the algorithm for multiplying in scientific notation: Add the exponents, multiply the mantissas and normalize the result, as illustrated below:
Multiplication in scientific notation
given 1.02 × 10³ × 9.85 × 10¹
rearranged (1.02 × 9.85) × 10^{(3 + 1)}
multiplied 10.047 × 10⁴
normalized 1.0047 × 10⁵
rounded 1.00 × 10⁵

**Multiplication in scientific notation**
given		1.02 × 10³ × 9.85 × 10¹
rearranged		(1.02 × 9.85) × 10^{(3 + 1)}
multiplied		10.047 × 10⁴
normalized		1.0047 × 10⁵
rounded		1.00 × 10⁵

Unlike addition, we need not denormalize anything before the operation. The one new issue we face is the matter of precision. Multiplying two 32-bit mantissas gives a 64-bit result. We will assume a signed multiply routine that delivers this result, with the following calling sequence:

A signed multiply interface specification

MULTIPLYS: ; expects R3 = multiplier ; R4 = multiplicand ; returns R3 = product, low bits ; R4 = product, high bits ; uses R5, R6, does not use activation record

**A signed multiply interface specification**
MULTIPLYS: ; expects R3 = multiplier ; R4 = multiplicand ; returns R3 = product, low bits ; R4 = product, high bits ; uses R5, R6, does not use activation record

If the multiplier and multiplicand have 31 places after the point in each, then the 64-bit product has 62 places after the point. Therefore, to normalize the result, we will always shift it one place. If the multiplier and multiplicand are normalized to have minimum absolute values of 0.5, the product will have a minimum absolute value of 0.25. Normalizing such a small product will require an additional shift, but never more than one. We must use 64-bit shifts for thiese normalize steps in order to avoid loss of precision, so we cannot use the normalize code we used with addition, subtraction and conversion from binary to floating point. As a result, most of the following is concerned with normalization.

Multiplying two floating point numbers on the Hawk

SUBTITLE floating multiply ; activation record format RETAD = 0 ; return address PRODUCT = 0 ; saved product pointer FLTMUL: ; expects R3 = prod, product pointer ; R4 = plier, multiplier pointer ; R5 = icand, multiplicand pointer ; uses R6-7 STORES R1,R2 ; -- save return address STORE R3,R2,PRODUCT ; -- save pointer to product LOADS R6,R4 ; R6 = plier.exponent LOADS R7,R5 ; R7 = icand.exponent ADD R7,R6,R7 ; R7 = prod.exponent LOAD R3,R4,MANTISSA ; -- parameter R3 = plier.mantissa LOAD R4,R5,MANTISSA ; -- parameter R4 = icand.mantissa LIL R1,MULTIPLYS JSRS R1,R1 ; prod.mantissa = multiplys( ... ) ; -- note it has 2 bits left of the point SL R3,1 ADDC R4,R4 ; prod.mantissa = product.mantissa << 1 BNS FMULN ; if (prod.mantissa > 0) { BITTST R4,30 BBS FMULOK ; if (prod.mantissa not normalized) { SL R3,1 ADDC R4,R4 ; prod.mantissa = prod.mantissa << 1 ADDSI R7,-1 ; prod.exponent = prod.exponent - 1 BR FMULOK ; } FMULN: ; } else { -- prod.mantissa < 0 ADDSI R3,-1 BCS FMULNC ADDSI R4,-1 ; prod.mantissa = prod.mantissa - 1 FMULNC: ; -- mantissa now in one's complement form BITTST R4,30 BBR FMULNOK ; if (prod.mantissa not normalized) { SL R3,1 ADDC R4,R4 ; prod.mantissa = prod.mantissa << 1 ADDSI R7,-1 ; prod.exponent = prod.exponent - 1 FMULNOK: ; } ADDSI R3,1 ADDC R4,R0 ; prod.mantissa = prod.mantissa + 1 FMULOK: ; } -- mantissa now normalized LOAD R5,R2,PRODUCT STORES R7,R5 ; -- save prod.exponent STORE R4,R5 ; -- save prod.mantissa LOADS R1,R2 ; -- restore return address JUMPS R1 ; return

**Multiplying two floating point numbers on the Hawk**
SUBTITLE floating multiply ; activation record format RETAD = 0 ; return address PRODUCT = 0 ; saved product pointer FLTMUL: ; expects R3 = prod, product pointer ; R4 = plier, multiplier pointer ; R5 = icand, multiplicand pointer ; uses R6-7 STORES R1,R2 ; -- save return address STORE R3,R2,PRODUCT ; -- save pointer to product LOADS R6,R4 ; R6 = plier.exponent LOADS R7,R5 ; R7 = icand.exponent ADD R7,R6,R7 ; R7 = prod.exponent LOAD R3,R4,MANTISSA ; -- parameter R3 = plier.mantissa LOAD R4,R5,MANTISSA ; -- parameter R4 = icand.mantissa LIL R1,MULTIPLYS JSRS R1,R1 ; prod.mantissa = multiplys( ... ) ; -- note it has 2 bits left of the point SL R3,1 ADDC R4,R4 ; prod.mantissa = product.mantissa << 1 BNS FMULN ; if (prod.mantissa > 0) { BITTST R4,30 BBS FMULOK ; if (prod.mantissa not normalized) { SL R3,1 ADDC R4,R4 ; prod.mantissa = prod.mantissa << 1 ADDSI R7,-1 ; prod.exponent = prod.exponent - 1 BR FMULOK ; } FMULN: ; } else { -- prod.mantissa < 0 ADDSI R3,-1 BCS FMULNC ADDSI R4,-1 ; prod.mantissa = prod.mantissa - 1 FMULNC: ; -- mantissa now in one's complement form BITTST R4,30 BBR FMULNOK ; if (prod.mantissa not normalized) { SL R3,1 ADDC R4,R4 ; prod.mantissa = prod.mantissa << 1 ADDSI R7,-1 ; prod.exponent = prod.exponent - 1 FMULNOK: ; } ADDSI R3,1 ADDC R4,R0 ; prod.mantissa = prod.mantissa + 1 FMULOK: ; } -- mantissa now normalized LOAD R5,R2,PRODUCT STORES R7,R5 ; -- save prod.exponent STORE R4,R5 ; -- save prod.mantissa LOADS R1,R2 ; -- restore return address JUMPS R1 ; return

This code is oversimplified: What if the product is zero? In this case, we should set the exponent to the most negative value. Also, this code does not detect overflow or underflow in the exponent field.

Exercises

u) Fix this code to return zero on underflow and the maximum value with the right sign on overflow.
v) Fix this floating point multiply code so it correctly handles the value zero.
w) Write code for a floating point divide routine.

Other Operations

Multiply and divide are not enough. Our commitment to strong abstraction discourages users from examining the representations of floating point numbers. Floating point hardware designers do not face this constraint. They advertise the exact number format and encourage users to take advantage of it. If we do not do this, we must provide tools for operations that may be trivial, such as testing the sign of numbers, or testing for zero.

We also need to support import and export of floating point numbers. Users need to convert to and from textual and IEEE standard format. We convert from our format to IEEE format by picking off the sign, aligning the point in the mantissa, and then dealing with the exponent. Our 32-bit exponent has an extraordinary range, so overflow and underflow are problems. The following somewhat oversimplified C code does this:

Packing an IEEE Floating point value in C

unsigned int ieeepack( int exponent, int mantissa ) { /* first split off the sign */ int sign = 0; if (mantissa < 0) { mantissa = -mantissa; sign = 0x80000000; } /* convert */ if (exponent > 128) { /* convert overflow to infinity */ mantissa = 0; exponent = 0x7F800000; } else if (exponent < -148) { /* convert underflow to zero */ mantissa = 0; exponent = 0; sign = 0; /* avoids returning negative zero */ } else if (exponent < -125) { /* denormalized conversion */ while (exponent < -125) { mantissa = mantissa >> 1; exponent = exponent + 1; } mantissa = mantissa >> 7; exponent = 0x00000000; } else { /* normalized conversion */ mantissa = (mantissa >> 7) & 0x007FFFFF; exponent = (exponent + 126) << 23 } return sign | exponent | mantissa; }

**Packing an IEEE Floating point value in C**
unsigned int ieeepack( int exponent, int mantissa ) { /* first split off the sign / int sign = 0; if (mantissa < 0) { mantissa = -mantissa; sign = 0x80000000; } / convert / if (exponent > 128) { / convert overflow to infinity / mantissa = 0; exponent = 0x7F800000; } else if (exponent < -148) { / convert underflow to zero / mantissa = 0; exponent = 0; sign = 0; / avoids returning negative zero / } else if (exponent < -125) { / denormalized conversion / while (exponent < -125) { mantissa = mantissa >> 1; exponent = exponent + 1; } mantissa = mantissa >> 7; exponent = 0x00000000; } else { / normalized conversion */ mantissa = (mantissa >> 7) & 0x007FFFFF; exponent = (exponent + 126) << 23 } return sign \| exponent \| mantissa; }

The bias of the IEEE format is 127, yet in the above code, we used a bias of 126. This is because the code is doing two things with one operation. First, it is decrementing the original exponent by one to account for the fact that our numbers were normalized in the range 0.5 to 1.0, while IEEE numbers are normalized in the range 1.0 to 2.0. Second, it is adding 127, the actual bias.

The range check on the exponent in the above code uses the bounds –148, –125 and 128 for denormalized and normalized numbers instead of the bounds –149, –126 and 127 specified for the IEEE format. Again, this difference is because of the change in normalization rule between the IEEE model and our word-based model.

The above code is not perfect. It truncates, discarding the insignificant bits instead of rounding to the nearest IEEE floating point value. Rounding is preferred because it halves the maximum error in conversion.

Hawk code to unpack an IEEE-format floating-point number

SUBTITLE unpack an IEEE-format floating point number FLTIEEE:; expects R3 = result, pointer to a 2-word floating point number ; R4 = input, the number in IEEE format ; uses R5 = a temporary MOVE R5,R4 ; -- work on exponent SL R5,1 ; -- discard sign bit SR R5,12 SR R5,12 ; -- discard mantissa ADDI R5,R5,-126 ; exp = ((input & 0x7F800000)>>23) - 126 STORES R5,R3 ; result.exponent = exp MOVE R5,R4 ; -- work with mantissa SL R5,9 ; -- discard exponent and sign SR R5,1 ; -- make space for hidden bit SUB R0,R0,R0 ; -- set carry bit ADJUST R5,CMSB ; -- move carry to hidden bit SR R5,1 ; mant = ((input & 0x007FFFFF)<<7) | 0x40000000 TESTR R4 BNR FIEEEPOS ; if (input < 0) { NET R5,R5 ; mant = -mant FIEEEPOS: ; } STORE R5,R3,MANTISSA ; result.mantissa = mant JUMPS R1 ; return

**Hawk code to unpack an IEEE-format floating-point number**
SUBTITLE unpack an IEEE-format floating point number FLTIEEE:; expects R3 = result, pointer to a 2-word floating point number ; R4 = input, the number in IEEE format ; uses R5 = a temporary MOVE R5,R4 ; -- work on exponent SL R5,1 ; -- discard sign bit SR R5,12 SR R5,12 ; -- discard mantissa ADDI R5,R5,-126 ; exp = ((input & 0x7F800000)>>23) - 126 STORES R5,R3 ; result.exponent = exp MOVE R5,R4 ; -- work with mantissa SL R5,9 ; -- discard exponent and sign SR R5,1 ; -- make space for hidden bit SUB R0,R0,R0 ; -- set carry bit ADJUST R5,CMSB ; -- move carry to hidden bit SR R5,1 ; mant = ((input & 0x007FFFFF)<<7) \| 0x40000000 TESTR R4 BNR FIEEEPOS ; if (input < 0) { NET R5,R5 ; mant = -mant FIEEEPOS: ; } STORE R5,R3,MANTISSA ; result.mantissa = mant JUMPS R1 ; return

Conversion from IEEE format to our eccentric software format is fairly easy because our exponent and mantissa fields are larger than those of the single-precision IEEE format. Thus, we can convert with no loss of precision. This code presented above ignores the possibility that the value might be a NaN or infinity.

This code makes extensive use of shifting to clear fields within the number. Thus, instead of writing n&0xFFFFFF00, we write (n>>8)<<8. This trick is useful on many machines where loading a large constant is significantly slower than a shift instruction. By doing this, we avoid both loading a long constant into a register and using an extra register to hold it. On the Hawk, this optimization is effective only for shift counts of 16 or less. Shifting by a greater amount takes two consecutive shift instructions, in which case, it may be faster to load a long constant using LIL followed by an AND instruction.

Another trick is used in the above code to set the implicit one bit. Instead of or-ing a long constant with the mantissa, the code uses a subtract instruction to set the carry bit, and then it uses an ADJUST instruction to or the carry bit into the most significant bit.

Conversion to Decimal

A well designed floating point package will include a complete set of tools for conversion to and from decimal textual representations, but our purpose here is to use the conversion problem to illustrate the use of our floating point package, so we will write our conversion code as user-level code, making no use of any details of the floating point abstraction that are not described in the header file for the package.

First, consider the problem of printing a floating point number using only the operations we have defined, ignoring the complexity of assembly language and focusing on the algorithm. We can begin by taking the non-rounded integer part of the number and printing that, followed by a point, but the question is, how do we continue from there, printing the digits after the point?

To print the fractional part of a number, we will take the integer part of the number and subtract it from the number, leaving just the fraction; this is equivalent to taking the number mod one. Multiplying the fractional part by ten brings one decimal digit of the fraction above the point. This can be repeated to extract each successive digit. This is not particularly efficient, but it works. Here is C code for this:

C code to print a floating point number

void putflt( float num, int places ) { int inum; /* the integer part */ if (num < 0) { /* make it positive and print the sign */ num = -num; putchar( '-' ); } /* first put out integer part */ inum = fltint( num ); putdecu( inum, 0 ); putchar( '.' ); /* second put out digits of the fractional part */ for (; places > 0; places--) { num = (num - float(inum)) * 10.0; inum = fltint( num ); putchar( inum + '0' ); } }

**C code to print a floating point number**
void putflt( float num, int places ) { int inum; /* the integer part / if (num < 0) { / make it positive and print the sign / num = -num; putchar( '-' ); } / first put out integer part / inum = fltint( num ); putdecu( inum, 0 ); putchar( '.' ); / second put out digits of the fractional part / for (; places > 0; places--) { num = (num - float(inum)) 10.0; inum = fltint( num ); putchar( inum + '0' ); } }

Before writing assembly code, we will rewrite this so the code passes handles to floating point numbers, not the numbers themselves, and so it calls the right subroutines to do arithmetic.

Lower level C code to print a floating point number

void putflt( float *pnum, int places ) { float num; /* a copy of the number */ float tmp; /* a temporary floating point number */ float ten; /* a constant floating value */ int inum; /* the integer part */ int i; /* loop counter */ float( &ten, 10 ); if (flttst( &num ) < 0) { /* make it positive, print the sign */ fltneg( &num, pnum ); putchar( '-' ); } else { fltcpy( &num, pnum ); } /* first put out integer part */ inum = fltint( &num ); putdecu( inum, 0 ); putchar( '.' ); /* second put out digits of the fractional part */ while (places > 0) { float( &tmp, inum ); fltsub( &num, &num, &tmp ); fltmul( &num, &num, &ten ); inum = fltint( &num ); putchar( inum + '0' ); places = places - 1; } }

**Lower level C code to print a floating point number**
void putflt( float pnum, int places ) { float num; / a copy of the number / float tmp; / a temporary floating point number / float ten; / a constant floating value / int inum; / the integer part / int i; / loop counter / float( &ten, 10 ); if (flttst( &num ) < 0) { / make it positive, print the sign / fltneg( &num, pnum ); putchar( '-' ); } else { fltcpy( &num, pnum ); } / first put out integer part / inum = fltint( &num ); putdecu( inum, 0 ); putchar( '.' ); / second put out digits of the fractional part */ while (places > 0) { float( &tmp, inum ); fltsub( &num, &num, &tmp ); fltmul( &num, &num, &ten ); inum = fltint( &num ); putchar( inum + '0' ); places = places - 1; } }

The above code shows some of the problems we forced on ourselves by insisting on having no knowledge of the floating-point representation when writing our print routine. Where a C or Java programmer would write 10.0, relying on the compiler to create the floating point representation and put it in memory, we have been forced to use the integer constant 10 and then call float() to do the conversion. This is a common result of strict object oriented encapsulation, although we could have the interface definition export a compile time or assembly time macro to process constants into their internal representations.

If we deny ourselves knowledge of the size of the representation of floating point numbers, we cannot allocate space in our activation records for local floating point variables the way we have up to this point. We solve this by having the interface definition in float.h export the size of floating point numbers as the constant FLOATSIZE. We can adopt the general convention that we always define a symbol for the size of each object, record or structure.

Before this point, we manually added the sizes of the fields in the activation record when we built our activation records, but we can use the assembler to do this clerical work. To do this, we begin by setting the activation record size to zero, and then we increment it by the object size as we define each field.

Building an activation record for FLTPRINT

TITLE fltprint.a -- floating print routine USE "float.h" INT FLTPRINT MACRO LOCAL X, =SIZE X = ARSIZE ARSIZE = ARSIZE + SIZE ENDMAC WRDSIZE = 4 ; size of one word ; activation record format ARSIZE = 0 ; initial size of activation record LOCAL RETAD, WRDSIZE ; return address LOCAL NUM, FLOATSIZE ; copy of the floating point number LOCAL TMP, FLOATSIZE ; a temporary floating point number LOCAL TEN, FLOATSIZE ; the constant ten LOCAL R8SAVE, WRDSIZE ; save area for register 8 LOCAL R9SAVE, WRDSIZE ; save area for register 9 LOCAL R10SAVE, WRDSIZE; save area for register 10

**Building an activation record for FLTPRINT**
TITLE fltprint.a -- floating print routine USE "float.h" INT FLTPRINT MACRO LOCAL X, =SIZE X = ARSIZE ARSIZE = ARSIZE + SIZE ENDMAC WRDSIZE = 4 ; size of one word ; activation record format ARSIZE = 0 ; initial size of activation record LOCAL RETAD, WRDSIZE ; return address LOCAL NUM, FLOATSIZE ; copy of the floating point number LOCAL TMP, FLOATSIZE ; a temporary floating point number LOCAL TEN, FLOATSIZE ; the constant ten LOCAL R8SAVE, WRDSIZE ; save area for register 8 LOCAL R9SAVE, WRDSIZE ; save area for register 9 LOCAL R10SAVE, WRDSIZE; save area for register 10

Had we allowed ourselves to use knowledge about the size of a floating point number, we could have defined NUM=4, TMP=12 and TEN=20. If we did this, changes in the floating point representation, for example, changing to a two-word mantissa, would require a a major rewrite of every routine that uses floating point numbers. Had we not defined the macro LOCAL, each local variable declaration would have required two lines of code. For example, the declaration of the local variable NUM would begin with NUM=ARSIZE, and then it would add to the activation record size with ARSIZE=ARSIZE+FLOATSIZE.

The need to save registers 8 to 10 was discovered only after writing the code on the next page. This code could have been written using fewer registers and more local variables in the activation record, but it is more efficient to use registers, and our standard calling conventions require that if we use registers 8 and up, we must save and then restore whatever was in them. Of course, if any of the subroutines called by our code use registers 8 to 10, those routines must save and restore that data for us.

The code on the next page contains one significant optimization. With all of the subroutine calls, we could have incremented and decremented the stack pointer many times. Instead, we increment it just once at the start of the print routine and decrement it just once at the end; in between, we always subtract ARSIZE from every displacement into the activation record in order to correct for this.

The body of the floating print routine, part 1

FLTPRINT: ; expects R3 = pnum, pointer to floating number to print ; R4 = places to print after the point ; uses R8 = pnum ; R9 = places ; R10 = inum, integer part of number STORES R1,R2 STORE R8,R2,R8SAVE STORE R9,R2,R9SAVE STORE R9,R2,R10SAVE ; -- saved return address, R8-10 MOVE R8,R3 MOVE R9,R4 ; -- move number and places ADDI R2,R2,ARSIZE ; -- from here on, R2 points to end of AR LEA R3,R2,TEN-ARSIZE LIS R4,10 LIL R1,FLOAT ; -- we need the constant 10 JSRS R1,R1 ; float( &ten, 10 ); MOVE R3,R8 LIL R1,FLTTST JSRS R1,R1 TESTR R3 BNR FPRNNEG ; if (flttst( pnum ) < 0) { LEA R3,R2,NUM-ARSIZE MOVE R4,R8 LIL R1,FLTNEG JSRS R1,R1 ; fltneg( &num, pnum ); LIS R3,'-' LIL R1,PUTCHAR JSRS R1,R1 ; putchar( '-' ); BR FPRABS FPRNNEG: ; } else { LEA R3,R2,NUM-ARSIZE MOVE R4,R8 LIL R1,FLTCPY JSRS R1,R1 ; fltcpy( &num, pnum ); FPRABS: ; } ; /* first put out the integer part */ LEA R3,R2,NUM-ARSIZE LIL R1,FLTINT JSRS R1,R1 STORE R3,R2,INUM ; -- parameter inum = fltint( &num ); LIS R4,0 ; -- parameter 0 LIL R1,PUTDECU JSRS R1,R1 ; putdecu( inum, 0 ); LIS R3,'.' LIL R1,PUTCHAR JSRS R1,R1 ; putchar( '.' );

**The body of the floating print routine, part 1**
FLTPRINT: ; expects R3 = pnum, pointer to floating number to print ; R4 = places to print after the point ; uses R8 = pnum ; R9 = places ; R10 = inum, integer part of number STORES R1,R2 STORE R8,R2,R8SAVE STORE R9,R2,R9SAVE STORE R9,R2,R10SAVE ; -- saved return address, R8-10 MOVE R8,R3 MOVE R9,R4 ; -- move number and places ADDI R2,R2,ARSIZE ; -- from here on, R2 points to end of AR LEA R3,R2,TEN-ARSIZE LIS R4,10 LIL R1,FLOAT ; -- we need the constant 10 JSRS R1,R1 ; float( &ten, 10 ); MOVE R3,R8 LIL R1,FLTTST JSRS R1,R1 TESTR R3 BNR FPRNNEG ; if (flttst( pnum ) < 0) { LEA R3,R2,NUM-ARSIZE MOVE R4,R8 LIL R1,FLTNEG JSRS R1,R1 ; fltneg( &num, pnum ); LIS R3,'-' LIL R1,PUTCHAR JSRS R1,R1 ; putchar( '-' ); BR FPRABS FPRNNEG: ; } else { LEA R3,R2,NUM-ARSIZE MOVE R4,R8 LIL R1,FLTCPY JSRS R1,R1 ; fltcpy( &num, pnum ); FPRABS: ; } ; /* first put out the integer part */ LEA R3,R2,NUM-ARSIZE LIL R1,FLTINT JSRS R1,R1 STORE R3,R2,INUM ; -- parameter inum = fltint( &num ); LIS R4,0 ; -- parameter 0 LIL R1,PUTDECU JSRS R1,R1 ; putdecu( inum, 0 ); LIS R3,'.' LIL R1,PUTCHAR JSRS R1,R1 ; putchar( '.' );

The body of floating print, part 2

FPRLP: TESTR R9 BLE FPRLX ; while (places > 0) { LEA R3,R2,TMP-ARSIZE MOVE R4,R8 LIL R1,FLOAT JSRS R1,R1 ; float( &tmp, inum ); LEA R3,R2,NUM-ARSIZE MOVE R4,R3 LEA R5,R2,TMP-ARSIZE LIL R1,FLTSUB JSRS R1,R1 ; fltsub( &num, &num, &tmp ); LEA R3,R2,NUM-ARSIZE MOVE R4,R3 LEA R5,R2,TEN-ARSIZE LIL R1,FLTMUL JSRS R1,R1 ; fltmul( &num, &num, &ten ); LEA R3,R2,NUM-ARSIZE LIL R1,FLTINT JSRS R1,R1 MOVE R10,R3 ; parameter R8 = inum = fltint( &num ); ADDI R3,R3,'0' LIL R1,PUTCHAR JSRS R1,R1 ; putchar( inum + '0' ); ADDSI R9,-1 ; places = places - 1; BR FPRLP FPRLX: ; } ADDI R2,R2,-ARSIZE LOAD R8,R2,R8SAVE LOAD R9,R2,R9SAVE LOAD R10,R2,R10SAVE LOADS R1,R2 ; restore return address, R8, R9 JUMPS R1 ; return

**The body of floating print, part 2**
FPRLP: TESTR R9 BLE FPRLX ; while (places > 0) { LEA R3,R2,TMP-ARSIZE MOVE R4,R8 LIL R1,FLOAT JSRS R1,R1 ; float( &tmp, inum ); LEA R3,R2,NUM-ARSIZE MOVE R4,R3 LEA R5,R2,TMP-ARSIZE LIL R1,FLTSUB JSRS R1,R1 ; fltsub( &num, &num, &tmp ); LEA R3,R2,NUM-ARSIZE MOVE R4,R3 LEA R5,R2,TEN-ARSIZE LIL R1,FLTMUL JSRS R1,R1 ; fltmul( &num, &num, &ten ); LEA R3,R2,NUM-ARSIZE LIL R1,FLTINT JSRS R1,R1 MOVE R10,R3 ; parameter R8 = inum = fltint( &num ); ADDI R3,R3,'0' LIL R1,PUTCHAR JSRS R1,R1 ; putchar( inum + '0' ); ADDSI R9,-1 ; places = places - 1; BR FPRLP FPRLX: ; } ADDI R2,R2,-ARSIZE LOAD R8,R2,R8SAVE LOAD R9,R2,R9SAVE LOAD R10,R2,R10SAVE LOADS R1,R2 ; restore return address, R8, R9 JUMPS R1 ; return

Exercises

x) Write a floating print routine that produces its output in scientific notation, for example, using the format 6.02E23 where the E stands for times ten to the. To do this, you will have to first do a decimal normalize, counting the number of times you have to multiply or divide by ten in order to bring the mantissa into the range from 1 to just under 10, and then you will have to print the mantissa (using the floating print routine we just discussed), and finally print the exponent.

`31`	`30`	`29`	`28`	`27`	`26`	`25`	`24`	`23`	`22`	`21`	`20`	`19`	`18`	`17`	`16`	`15`	`14`	`13`	`12`	`11`	`10`	`09`	`08`	`07`	`06`	`05`	`04`	`03`	`02`	`01`	`00`

S	exponent								mantissa