22C:18, Lecture 19, Fall 1996

Douglas W. Jones
University of Iowa Department of Computer Science

BCD Arithmetic

While many programming languages support only binary and floating point arithmetic, another number representation is quite important in the real world, BCD or binary coded decimal.

This is important in applications where the number of arithmetic operations performed on a number is comparable to the number of times that number is read from or written to a textual format. In such cases, the expense of converting to and from ASCII character strings is far greater than the cost of arithmetic, for binary numbers, largely because of the many divisions by ten required each time the number is output.

Furthermore, in financial applications, high precision is commonly needed. Typically, dollar ammounts are represented in pennies, and a 32 bit integer can only represent numbers up to around 4 billion, or 40 million dollars, a number dangerously small from the point of view of even a modest bank or medium size company.

BCD numbers are represented as strings of 4-bit nibbles, where each nibble holds one decimal digit. Thus, the number 365 is represented as:

	 3    6    5
	0011 0110 0101
Printing a BCD number is trivial! Any hexadecimal output routine will do it! Reading a BCD number is equally trivial, easier than reading a hexadecimal number. No expensive divisions by ten are needed; instead, all that is needed is shifting.

Doing arithmetic on BCD numbers is, however a bit more difficult than doing arithmetic on binary numbers. Nonetheless, many computers have been built with hardware in their arithmetic unit to do BCD arithmetic. Machines designed using the CISC philosophy commonly include BCDADD and BCDSUB instructions to do BCD arithmetic, but most modern machines do not go this far. Instead, they take their clue from work done in the late 1950's and use simple binary addition and subtraction to do BCD arithmetic, providing only a BCD "fixup" instruction to take care of the details that the binary arithmetic did not complete.

To see what these details are, note that the simple binary sum of two BCD numbers will be correct so long as the sum does not generate any carry from one decimal digit to another. So, the simple binary sum of the BCD representations of 365 and 123 will be the correct BCD representation of the decimal sum, 488.

        0011 0110 0101   = 365
      + 0001 0010 0011   = 123
      -----------------
        0100 1000 1000   = 188
This seems too easy, and it is. The problem is, any decimal digits that produce a sum greater than 10, or A (hex) must be induced to force a carry into the next digit position. One way to force this carry to take place properly is to add 6 to every digit of one of the operands. This will cause digits of the sum that would have been greater than 9 to be greater than 15 instead, so the binary carry from one BCD digit to the next will be the correct decimal carry.

Conveniently, adding 6 to each digit of the sum will produce the right values in every digit position that produced a carry, but in those positions that did not produce carries, the sum digit will be too large by 6. The trick, then, is to take away 6 from every position in the preliminary sum that did not generate a carry. This is illustrated by the following example, where 365 is added to 128 to produce 493, using BCD arithmetic:

        0011 0110 0101   = 365  -- operand 1
      + 0110 0110 0110   = 666  -- sixes
      -----------------
        1001 1100 1011
      + 0001 0010 1000   = 128  -- operand 2
      -----------------
        1010 1111 0011   = AF3  -- preliminary sum
      - 0110 0110 0000   = 660  -- correction
      -----------------
        0100 1001 0011   = 493  -- sum
Many modern machines have simple binary arithmetic units that are augmented with a bit of extra logic to record the carry out of each 4-bit nibble. A special instruction, BCDADJ in the case of the Hawk, is then provided to take away 6 from those nibbles that need it.

The Hawk processor status word register includes a special field, the BCD carry field; this holds the 8 carry bits from the most recent add or subtract; these are what the BCDADJ instruction uses to decide which BCD digits should be adjusted by subtracting 6.

As a result, the following macro will add two BCD numbers on the Hawk machine:

	MACRO	BCDADD a,b,c
	LIW	RF,#66666666
	ADD	a,b,RF
	ADD	a,a,c
	BCDADJ	a
Warning; this macro uses register 15 as a temporary register to hold the constant 66666666 (hex). If large amounts of BCD arithmetic are being used, a register should be reserved to hold this value instead of loading it as part of each BCD add instruction.

Extended Precision Arithmetic

32 bit binary arithmetic and 8 Digit BCD arithmetic are not sufficient for many applications! In fact, no matter what word size a machine supports, it is likely that some application will need a larger word size. Because of this, most machines provide some kind of support for extended precision arithmetic, allowing multiple words to be treated as single operands for arithmetic computations. Usually, the solution is to provide some way to add the carry bit from one add instruction to the next.

The basic Hawk instruction set allows this to be done in a cumbersome way. Consider the following example code for adding the 64 bit double-word in R4 and R5 to the 64 bit double-word in R6 and R7, assuming the low word of each pair holds the least significant bits:

	ADD	R6,R6,R4	; add lower bits
	BCR	nocarry
	ADDSI	R7,1		; add carry to higher bits
nocarry:ADD	R7,R7,R5	; add higher bits
While this code works, detecting overflow and detecting whether the result is zero is inefficient, and generalizing it to more than 64 bit arithmetic is even worse. As a result, the Hawk machine includes a special instructon ADDC, or add with carry; this instrucion reduces the following example to:
	ADD	R6,R6,R4	; add lower bits
	ADDC	R7,R5		; add higher bits plus carry
The ADDC and SUBB (subtract with borrow) instructions not only take into account the carry from the previous add or subtract, but they set the condition codes to reflect the state of the entire result. This even applies to higher precisions, such as:
	SUB	R6,R6,R3	; subtract lower bits
	SUBB	R7,R4		; subtract middle bits
	SUBB	R8,R5		; subtract high bits
	BGT	somewhere	; branch if R[6-8] > R[3-5]
The following example illustrates the use of these extended precision instructions in the context of a macro to add two 16 digit decimal numbers:
	MACRO	BCDADD a,b,c
	LIW	RF,#66666666
	ADD	a,b,RF		; add sixes to low 8 digits
	ADD	a+1,b+1,RF	; add sixes to high 8 digits
	ADD	a,a,c
	BCDADJ	a		; compute low 8 digits of sum
	ADDC	a+1,c+1
	BCDADJ	a+1		; compute high 8 digits of sum
Warning; again, this macro uses register 15 as a temporary register for 66666666 (hex). A globally reserved register could hold this value!

Note that BCD and binary arithmetic are almost equally easy to extend to greater than 32 bit precision, so long as only adding and subtracting are involved. Extended precision multiplicaton and division, on the other hand, are somewhat more difficult, and as a result, conversion of high precision binary numbers to and from decimal is an expensive proposition. As a result, if the only arithmetic required is addition and subtraction, extended precision BCD is almost always less expensive (hexadecimal I/O routines can be used for each 8 digit chunk of the BCD number!).

Extended precision multiplication and division are an interesting venture. The fundamental observation, made, among others, by Charles Babbage, is that, in multiplying two numbers, for example, AB times CD, where AB consists of two n-digit parts, A, the most significant n digits, and B, the least significant digits, the product is:

		   A   B
		x  C   D
	       ----------
		  DBh DBl
	      DAh DAl
	      CBh CBl
	+ CAh CAl
	-----------------
Here, we have use the notation that B times D, the product of 2 n-digit numbers, is a number of 2n digits; BDh is the high n digits of the product and BDl is the low n digits of the product. For n=1 and in base 10, this is the familiar rule for multiplying two numbers of two digits each. For n=32, this says that we can compute the product of two 64-bit numbers using 4 multiplies and 3 high precision adds, assuming that we have a multiply instruction that produces a 64-bit product from two 32-bit operands.

String Operations

In Pascal, it is legal to write

	a = 'Some string of characters';
so long as the declaration of a, as a packed array of characters, is compatable with the number of characters in the quoted string constant.

In C, we can write a piece of code that does the same thing:

	strcpy(a, "Some string of characters");
Both of these copy the text from the quoted string to the indicated array of characters. The Pascal code requires that the compiler generate code to call an appropriate copy routine, or that the compiler generate a for loop to do the copying. The C code explicitly calls the routine that does the copying.

If you read the manual for the C programming language, you will find that the strcpy routine is defined something like the following:

	strcpy(a,b)
	char* a;
	char* b;
	{
		while (*a++ = *b++);
	}
This is wonderfully obscure C code, but it is not hard to rewrite the while loop in more understandable form, while staying in C:
	while ((*a = *b) != 0) {
		a++;
		b++;
	}
It is easy to rewrite such routines in assembly language. Consider:
;------------------------
        INT     STRCPY
STRCPY:                 ; copy a null terminated string
                        ; R3 points to the destination string
                        ; R4 points to the source string
                        ; wipes out R3-6
                        ; does not use R2
STRCLP:
        LOADS   R5,R4   ; get a byte and test for null
        EXTB    R5,R5,R4
        LOADS   R6,R3   ; store a byte
        STUFFB  R6,R5,R3
        STORES  R6,R3
        BZS     STRCQT  ;   quit if byte was null
        ADDSI   R3,1    ; otherwise move on to next byte
        ADDSI   R4,1
        BR      STRCLP  ; and iterate
STRCQT:
        JUMPS   R1      ; return
Other procedures in the standard C string library are equally easy to translate to trivial HAWK code. Consider strlen(), the C function to measure the length of a null terminated string:
	int strlen(a)
	char* a;
	{
		char * b = a
		while (*b++);
		return b - a;
	}
This translates to the following assembly code:
;------------------------
        INT     STRLEN
STRLEN:                 ; measure length of string pointed to by R3
                        ; returns the length in R3
                        ; wipes out R4-5
                        ; does not use R2
        MOVE    R5,R3   ; save address of string start
STRLLP:
        LOADS   R4,R3   ; test one chunk of the string
        EXTB    0,R4,R3 ;   (a chunk is one byte)
        BZS     STRLQT  ;   quit if null
        ADDSI   R3,1    ; otherwise move on to next byte
        BR      STRLLP  ; and iterate
STRLQT:
        SUB     R3,R3,R5; compute number of non nulls
        JUMPS   R1      ; return

Optimized String Operations

While the code for STRCPY and STRLEN presented above is typical of what a good compiler would produce from the C code shown, it is not very efficient! The problem is that it addresses the consecutive bytes of each string separately, without regard to the fact that each word of the string holds 4 bytes. Thus, each word holding part of the string is fetched from memory as many as 4 times! Faster string manipulation code can be written!

Consider the STRLEN function, the simplest of the string manipulation functions. If strings are forced to begin on a word boundary, this function can be written to load consecutive words of the string, counting a length of 4 bytes for each word, until it finds a word with a null byte. For that word only, it would have to find out which byte is null. The situation is only slightly more complex if strings are allowed to begin on an arbitrary byte. In that case, the first and last words of the string may contain fewer than 4 bytes, while the middle words of the string can be counted quickly, 4 bytes at a time.

The Hawk architecture borrows a feature from various modern RISC architectures to speed code for dealing with null-terminated strings. The key to this revolves around the condition codes set by the LOADCC and LOADSCC instruction. Clearly, the N and Z condition codes are needed to report on the entire 32 bit value loaded. In addition, LOADCC must zero the V condition code in order to allow use of the BGT, BGE, BLT and BLE instructions to be used to compare the loaded value with zero. There is, however, no special need for the C condition code after loading, so this code is available for other uses!

After a LOADCC or LOADSCC instruction, the Hawk sets the C condion code to report if any individual byte of the loaded word is zero. This allows the following fast version of strlen to be coded:

;------------------------
        INT     STRLEN
STRLEN:                 ; measure length of string pointed to by R3
                        ; returns the length in R3
                        ; wipes out R4-5
                        ; does not use R2
        MOVE    R5,R3   ; save address of string start
        ADDSRU  0,R3,2  ; test least significant 2 bits of address
        BCR     STRLLP  ;   if all zero, go work by words
        LOADSCC R4,R3   ; otherwise get first 1-3 bytes
        EXTB    0,R4,R3 ; test first byte of string
        BZS     STRLQT  ;   quit if null
        ADDSI   R3,1    ; otherwise move on to next byte
        ADDSRU  0,R3,2  ; test least significant 2 bits
        BCR     STRLLP  ;   if all zero, go work by words
        EXTB    0,R4,R3 ; otherwise test second byte of string
        BZS     STRLQT  ;   quit if null
        ADDSI   R3,1    ; otherwise move on to third byte
        ADDSRU  0,R3,2  ; test least significant 2 bits
        BCR     STRLLP  ;   if all zero, go work by words
        EXTB    0,R4,R3 ; otherwise test third byte of string
        BZS     STRLQT  ;   quit if null
        ADDSI   R3,1    ; otherwise move on to 4th byte
                        ; !!! we know 4th byte is in next word!
STRLLP:                 ; top of loop scanning words of string
        LOADSCC R4,R3   ;   get 4 bytes of string
        BCS     STRLQL  ;     quit if any bytes are null
        ADDSI   R3,4    ;   count 4 non-nulls
        BR      STRLLP  ; iterate
STRLQL:                 ; R4 holds a null somewhere
        EXTB    0,R4,R3 ; test byte 0 of word of string
        BZS     STRLQT  ;   quit if high byte was null
        ADDSI   R3,1    ; otherwise count a byte
        EXTB    0,R4,R3 ; test byte 1 of word of string
        BZS     STRLQT  ;   quit if upper middle byte null
        ADDSI   R3,1    ; otherwise count a byte
        EXTB    0,R4,R3 ; test byte 2 of word of string
        BZS     STRLQT  ;   quit if lower middle byte null
        ADDSI   R3,1    ; otherwise count a byte
STRLQT:                 ; R3 points to the null byte
        SUB     R3,R3,R5; compute number of non nulls
        JUMPS   R1      ; return
The first part of this code, prior to the label STRLLP, repeatedly asks: Is the start of the string aligned on a word boundary. If not, it checks the byte to see if it is null, and if not, advances to the next byte and repeats the test. This code could be written as a loop, but since there are at most 3 non-aligned bytes to be tested at the start of a string, the loop is unrolled with the 3 iterations written in sequence. For loops with small bodies, the result is significantly faster!

The middle part of the code, from STRLLP to STRLQL, counts words of 4 bytes each, terminating when the word in question has a null anywhere in it. The final part then works through that word looking for the null. As with the first part of the code, the final part is an unrolled loop.

Writing similarly optimized code for STRCPY is harder! If both the source and destination strings are aligned identically, the logic used for STRCPY can be used without change, but if the alignments are different, a different approach must be used. Typically, the code for an optimized version of STRCPY begins with a test of the relative alignment of the strings, for example, by exclusive oring the pointers and then testing the least significant bits (the alignment is identical if the least significant two bits of the result are zero).

At least three different versions of the body are typically used, one that takes over if the alignments are the same, one for alignments that differ by 2 bytes, perhaps copying successive halfwords until a null is encountered, and one that handles alignments that are off by 1 byte. Optimizing compilers cannot be expected to produce such code from the simple C versions of the string functions found in the C reference manual! Instead, for each RISC architecture, someone must typically hand-craft these key bits of code, taking the same kind of care that is typically devoted to the hardware design of string copy instructions on complex instruction set computers!