7. Byte and Halfword Addressing, Text Strings.

Part of 22C:60, Computer Organization
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

A Programming Problem, Text Manipulation

The Hawk monitor borrows its convention for recording the length of text strings from the C programming language. This is used in the Hawk monitor in only one place, the DSPST routine, but it would certainly be convenient to have access to the string processing routines of the C standard library. These include:

puts( src );
Given that src is the address of the first byte of a null-terminated string, output that string to the standard output device. Our DSPST does exactly this.

strncpy( dst, src, count );
Given that src is the address of the first byte of a null-terminated string, copy that string, including the terminating null if possible, to memory starting at the address dst but do not copy more than count characters. Usually, count gives the size of the memory region pointed to by dst. This routine returns the pointer that was passed as dst.

strncat( dst, src, count );
Given that src and dst are the addresses of the first bytes of two null-terminated strings, concatenate the string pointed to by src to the end of the string pointed to by dst, terminating the resulting string with a single null, if possible. If the resulting string is longer than count characters, it will be truncated. This routine returns the pointer dst to the result.

strlen( src );
Given that src is the address of the first byte of a null-terminated string, return the length of that string, up to but not including the null.

On most Unix systems, full documentation for any of these routines can be obtained using the man command; for example, to get the official documentation for the strlen() routine, type the man strlen command into the shell interface. The book The C Programming Language by Kernighan and Ritchie (Prentice Hall) also documents these routines.

If we wish to write code to perform these functions, we need a way to fetch individual bytes from memory. As we discussed in Chapter 2, the memory addresses on the Hawk computer do allow us to give a distinct address to each byte of memory, but the variations on the LOAD and STORE introduced in Chapters 4, 5 and 6 all ignore the two least significant bits of the memory address and load or store a full 32-bit word.

The IBM System/360 was very first computer to offer support for 8-bit bytes, 16-bit halfwords and 32-bit words; this computer had a complete assortment of load and store instructions, load byte, load halfword and load word, along with their complements store byte, store halfword and store word. The problem with this approach to instruction set design are twofold.

First, if we commit to this idea, we use up 6 opcodes for load and store, and on a machine like the Hawk, the instruction set is crowded enough that it is not clear that we have these opcodes to spare.

Second, on a machine with a 32-bit word, loading and storing fractional words is inefficient enough that it should be discouraged! If the instruction set of the computer offers convenient load and store instructions for bytes and halfwords, this might encourage programmers to make excessive use of them. The truth is, loading a byte from memory on a machine with a 32-bit word actually involves loading the word containing that byte and then extracting the byte from the word, so why not make the instruction set match the work actually done by the hardware?

Exercises

a) One puzzle about the above interface definition is worth thinking about. Why do strncpy() and strncat() have, as return values, their first argument. Hint: Consider finding a more compact way to encode this list of string operations.

                strncpy( str, "Hello, ", 10 );
                strncat( str, "world", 10 );
                strncat( str, "!", 10 );

This little example loads the string pointed to by str with "Hello, world!" but not more than 10 characters.

Byte and Halfword Extraction

The Hawk architecture offers two special instructions for extracting bytes and halfwords from full-word operands. These instructions are designed for use with the LOADS instruction for fast loading of bytes from memory. Here is the EXTB instruction for extracting bytes; the EXTX instruciton is quite similar, except that it extracts halfwords.

The Hawk EXTB instruction
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
0 1 0 1 dst s1 s2

Formally, this does r[dst]=sxt(7,r[s1]>>((r[s2]&3)<<3)). Unfortunately, while this formal notation is compact, it is not very informative! Here is a more detailed breakdown of what the above line of code really means:

r[dst]=sxt(7,r[s1] ... )
A byte is taken from register designated by the s1 field of the instruction and moved to the destination register. That byte is sign extended, so the result will make sense for bytes holding two's complement values.

r[dst]= ... r[s1]>>((r[s2]&3)<<3) ...
The particular byte extracted is determined by the contents of the register designated by the s2 field of the instruciton.

... ( ... )<<3 ...
The shift count used to select the byte will be a multiple of 8 because shifting a binary integer 3 places left is equivalent to multiplication by 8.

... r[s2]&3 ...
The shift count depends only on the least significant two bits of the register designated by the s2 field of the instruction because anding a bit pattern with 310 which is 112 forces all but those two bits to zero. These 2 bits of the second source register are the bits that select the byte within the word.

... (r[s2]&3)<<3 ...
From the above, we conclude that if the least significant two bits are 00, we get a shift count of 0, if they are 01, we get a shift count of 8, if they are 10, we get a shift count of 16, and if they are 11, we get a shift count of 24.

In rather long winded English, this says that if the least significant 2 bits of the byte pointer Rp are 00, EXTB Rd,Rs,Rp will load the destination register Rd with the least significant byte of the source register Rs. If the least significant 2 bits of Rp are 01, Rd will be loaded with the second to the least significant byte of Rs. If the least significant 2 bits of Rp are 10, Rd will be loaded with the next to the most significant byte of Rs and If the least significant 2 bits of Rp are 11, Rd will be loaded with the most significant byte of Rs. In all cases, this byte is sign extended before loading in the destination register.

The net result of all of this is that, if Rp contains a memory address and Rs contains a word that was loaded from memory at that address, then the instruction EXTB Rd,Rs,Rp will pick, from Rs the byte pointed to by Rp and put that byte in Rd. So, if Rp contains a pointer to a byte in memory, we can load that byte into some register as follows:
Loading a signed byte from memory
; given:  R3, a pointer to a byte in memory
        LOADS   R1,R3           ; get the word containing the desired byte
        EXTB    R1,R2,R3        ; extract the desired byte from that word
; result:  R1 now holdds the desired byte, sign-extended; R1=M[R3]

The extract instructions both set the condition codes to report on whether the extracted byte or halfword is zero or negative, so we can easily test for end-of-string as we extract bytes from some source string in order to perform some string operation.

There is one minor problem. What if we wanted to extract unsigned halfwords or unsigned bytes? The Hawk architecture solves this problem by providing an additional instruction for truncating any signed quantity to some desired number of bits. The TRUNC R2,8 instruction truncates a signed quantity in R2 to 8 bits, forcing all higher bits to be set to zero.
Loading an unsigned byte from memory
; given R1, a pointer to a byte in memory
        LOADS   R2,R1           ; get the word containing the desired byte
        EXTB    R2,R2,R1        ; extract the desired byte from that word
        TRUNC   R2,8            ; force an unsigned interpretation on it
; result R2 now holds the desired byte, padded on the left with zeros.

Exercises

b) Given that R1 holds ABCDEF16, what value is left in the register after TRUNC R1,8? after TRUNC R1,6? after TRUNC R1,8?

c) If the instruction EXTB R3,R3,R0 is used (In this context, register zero the constant zero), what is the result in R3 if its initial is ABCDEF16? If its initial value is 12345616? What useful function does this particular variant of the EXTB statement do?

Counting the Bytes in a Null Terminated String

We have now covered enough ground to write a subroutine to count the bytes in a null terminated string. In a high level language like C or C++, we could do this as follows:
C code for strlen(src)
unsigned int strlen( char src[] )
{
        int n = 0;
        while ( src[n] != '\0' ) {
                n = n + 1;
        }
        return n;
}

This first version of the code views the string to be measured as an array of characters. Each iteration of the loop examines one character in this array, testing to see if it is the null character that marks the end of the string. In the C and C++ languages, the name of an array is the same as a pointer to its first element. As a result, the declarations char src[] and char * src declare the parameter src to have exactly the same type; it holds the address of a character in memory. Since strings in C and C++ are represented by arrays of consecutive characters, the natural way to pass a string to a function is by passing a pointer to its first character. This feature of the C and C++ languages may be one of their greatest weaknesses, but we can use it to write the following alternative version of the code to compute the string length:
C code for strlen(src)
unsigned int strlen( char * src )
{
        int n = 0;
        while ( *src != '\0' ) {
                src = src + 1;
                n = n + 1;
        }
        return n;
}

In the above, note that the star operator, when applied as a prefix operator prefix operator on a value that is a pointer, causes that pointer to be followed to the object it references. In this case, we follow the character pointer to a particular character, so so the object *src is the character pointed to by src and could also have been written as src[0].

In C and C++, adding an integer to a pointer produces a new pointer; the amount by which the memory address is incremented depends on the number of bytes occupied by the objects to which the pointer refers. In this case, since we are assuming that a char object occupies one byte, adding one to a char pointer increments the address by one. Adding one to an int pointer would incrment the address by 4 if we assume that int objects occupy 4 bytes. Note the word "assume" above! The C standard does not actually state the size, in bytes, of any standard types.

We have covered enough detail now to translate the second version of the above code into Hawk assembly language. Our standard calling sequence suggests that the actual parameter should be passed in register 3, and the result should be returned in register 3. Because our conventions allow a called routine to do anything it wants in registers 3 to 7, we can use these for local variables within the function, and this allows us to completely avoid use of the stack or activation records in this little routine:
Hawk code for STRLEN using pointer arithmetic
;       receiving sequence assumptions for STRLEN (SLN)
STRLEN:         ; R3 holds src, the address of the string
                ; R4 and 5 hold nothing of value
                ; R4 will be used to extract characters
                ; R5 holds n, the count of the length
        LIS     R5,0            ; n = 0;
SLNLP:                          ; while (TRUE) {
        LOADS   R4,R3
        EXTB    R4,R4,R3
        BZS     SLNLX           ;   if (*src == '\0') exit;
        ADDSI   R3,1            ;   src = src + 1;
        ADDSI   R5,1            ;   n = n + 1;
        BR      SLNLP
SLNLX:                          ; }
        MOVE    R3,R5
        JUMPS   R1              ; return n

There is one big disadvantage of this code! Each sequence of 4 consecutive iterations of the loop in this program will load exactly the same data into register 4 prior to extracting a different byte from that register. This is inefficient, and in fact, it is the motivation for some significant and interesting tricks that have become common on new computers designed since the mid 1980's.

This code can also be tightened by some local optimizations that ignore these larger issues. There are two registers being incremented in the loop body, and one of these increments can be eliminated by computing the string length as the difference between the pointer to the first byte and the pointer to the terminating null. There are two branches within the loop, one conditional and one unconditional; rewriting the loop so the termination test is at the end of the loop can eliminate one of these branches. Combining these optimizations, we can reduce the cost of the loop from 6 instructions per iteration down to 4 instructions per iteration.

Exercises

d) Improve the above code for STRLEN by eliminating one increment operator from the loop body, as suggested above.

e) Improve the above code for STRLEN by changing the control structure so that the loop exit test is done by a conditional branch back to the top of the loop from the end of the loop, as suggested above.

Array Addressing

Suppose we wanted to write assembly code equivalent to the first version of strlen given in the previous seciton. This requires that we deal with array indexing. Given the expression src[n], we must handle this as follows:

  1. Get the address of element zero of the array.
  2. Add the product of the subscript value and the size of one array element.
  3. Use the resulting pointer to reference the indicated element.

We can express this computation in C or C++ by stating that src[n] is equivalent to *(src+n), with the added warning that adding an integer to a pointer in C implies an implicit multiplication of the integer by the size of the object pointed to. Since we are dealing with arrays of characters, where a character is one byte, the multiplicaiton step is trivial, giving us the following Hawk code for STRLEN
Hawk code for STRLEN using arrays
;       receiving sequence assumptions for STRLEN (SLN)
STRLEN:         ; R3 holds src, the address of the string
                ; R4 and 5 hold nothing of value
                ; R4 will be used to extract characters
                ; R5 holds n, the count of the length
        LIS     R5,0            ; n = 0;
SLNLP:                          ; while (TRUE) {
	ADD	R4,R3,R5
        LOADS   R4,R4		
        EXTB    R4,R4,R3
        BZS     SLNLX           ;   if (src[n] == '\0') exit;
        ADDSI   R5,1            ;   n = n + 1;
        BR      SLNLP
SLNLX:                          ; }
        MOVE    R3,R5
        JUMPS   R1              ; return n

Note that this code is exactly the same complexity as the first assembly language version of STRLEN. We have eliminated one increment from the body of the loop, but we had to add one instruction to the code to compute the address of the character to test.

Had the size of an array element been anything other than one, we would have had to add additional instructions in order to multiply the index value by the element size. This added complexity suggests that, in the general case, the pointer arithmetic version would be faster, but it turns out that, on modern computers, the pointer arithmetic version will also be slightly faster even in the case of one-byte characters because of problems that arise in pipelined processors when consecutive instructions use the same register.

Stuffing Bytes into Words

On machines where all data transfers between the processor and memory move 32-bit quantities, storing a byte or halfword into some word in memory involves first loading that word, and then replacing the correct byte or halfword before finally storing the modified word back into memory. This is true whether or not the instruction set of the computer offers a "store byte to memory" or "store word in memory" instruction.

In the case of the Hawk, the three steps described above must be carried out explicitly. The first and last steps are easily done with LOADS and STORES instructions. Two special instructions are provided for the middle step, one to stuff bytes into a register and the other to stuff halfwords. Here is the former:

The Hawk STUFFB instruction
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
0 1 1 1 dst s1 s2

Formally, this is r[dst]=(r[dst]&~(0xFF<<((r[s2]&3)<<3))|((r[s1]&0xFF)<<((r[s2]&3)<<3)). Unfortunately, this formal notation neither compact nor very very informative! The following comments should help in understanding this:

... (r[s2]&3)<<3 ...
This subexpression shows up twice in the above, and it also shows up in the description of the semantics of the EXTB. This takes the least significant 2 bits of r[s2] and multiplies by 8 to give values of 0, 8, 16 or 24.

... ~(0xFF<<((r[s2]&3)<<3)) ...
Here, we take the value 111111112 and shift it over 0, 8, 16 or 24 places so that these 8 ones line up with the byte to be stuffed in place. Then we invert the value making ones where there were zeros and zeros where there were ones. This is the value that is anded with the destination register in order to set the selected byte of the destination to zero.

... ((r[s1]&0xFF)<<((r[s2]&3)<<3)).
Here, we take the least significant byte of r[s1] and shift it over 0, 8, 16 or 24 places so that it lines up with the byte to be stuffed in place. Finally, we or it into place.

Again, in rather long winded English, this says that if the least significant 2 bits of the byte pointer Rp are 00, STUFFB Rt,Rs,Rp will replace the least significant byte of the temporary register Rt with the least significant byte of the source register Rs. If the least significant 2 bits of Rp are 01, it will replace the second to the least significant byte of Rt with the least significant byte of Rs. If the least significant 2 bits of Rp are 10, it will replace the second to the most significant byte of Rt with the least significant byte of Rs and if the least significant 2 bits of Rp are 11, it will replace the most significant byte of Rt with the least significant byte of Rs.

The net result of all of this is that the instruction STUFFB Rt,Rs,Rp will replace the byte in the temporary register Rt selected by the least significant bits of the pointer register Rp with the byte taken from the source register Rs. If Rt is vacant, and Rs contains a byte we wish to store in the memory location pointed to by Rp, we can first load the word pointed to by Rp into Rt, then stuff the byte into place in that word, and then store the result back to memory, as follows:

Stuffing a byte into memory
; given:  R3, a byte pointer and R4, a byte to store
        LOADS   R1,R3           ; get the word containing the desired byte
        STUFFB  R1,R4,R3        ; insert the desired byte into that word
        STORES  R1,R3           ; put the modified word back in memory
; result:  R1 now holds a copy of the updated word, M[R3] = R4

Byte manipulation on the Hawk is cumbersome enough that we could write macros to hide these somewhat strange instruction sequences. The macros given here will not be used later, but are given for the sake of example. The naming convention we have used is that LOADB and STOREB manipulate bytes, and the suffix S indicates that we are using short-indexed addressing. These definitions assume that register 1 can be used as a temporary location for byte stuffing:

Byte manipulation macros
MACRO    LOADBS =dst,=x
  LOADS  dst,x
  EXTB   dst,dst,x
ENDMAC

MACRO    STOREBS =dst,=x
  LOADS  R1,x
  STUFFB R1,dst,x
  STORES R1,x
ENDMAC

Exercises

f) Given that the hardware interprets R0 as the constant zero in this context, what does STUFFB R3,R0,R0 do?

g) Write a LOADB macro that behaves like LOAD r,x,disp except that it loads a single byte into the destination register instead of loading a word. This instruction may have to compute a destination address in a register.

A String Copy Subroutine

In setting out to write a Hawk assembly language version of code equivalent to the strncpy() routine from the C standard library, we can start by writing this in a high level language. This routine centers on a loop that copies one character per iteration, with two exits from this loop, one if the character copied was a null, marking the end of string, and one exit if the copying reached the end of the memory allocated to the destination. Two exit loops are usually easier to read if they are coded with the exits standing on an equal basis, wo we will write this as an infinite loop with explicit break statements marking the loop exits. The following code is written with no complex statements, in the hope that each assignment is simple enough to be replaced by a single machine instruction.
C code for strncpy(dst,src,len)
char * strncpy( char * dst, char * src, unsigned int len )
{
        char ch;
        char * return_value = src;
        while (TRUE) {
                if (len == 0) break;
                ch = *src;
                *dst = ch;
                if (ch = '\0') break;
                src = src + 1;
                dst = dst + 1;
                len = len - 1;
        }
        return return_value;
}

The first step in planning to convert the above code to Hawk assembly language is to work out how we plan to use registers and the activation record, if any. On entry, our convention is that parameters are passed in registers 3 and up. In this case, this means passing dsc in register 3, src in register 4, and len in register 5. We can use register 6 to hold the copy of dst that we have called return_value, and register 7 to hold the character we are copying, ch.

Our convention is that registers 3 to 7 can be used freely in any subroutine, so at first glance, it appears that we do not need any variables in an activation record. In fact, however, we need one more register in order to hold the copy of the word in memory that we are modifying with the STUFFB instruction. Here, we will use register 1, but that means we must save the return address in memory, so we do need a one-word activation record for that. With this worked out, we can now write code:
The STRNCPY subroutine
; activation record format for STRNCPY string copy routine
RA      =       0       ; the return address
ARSIZE  =       4       ; size of activation record in bytes

;       receiving sequence and assumptions
STRNCPY:        ; R3 holds dst, the address of the destination
                ; R4 holds src, the address of the source
                ; R5 holds len, the maximum length to copy
                ; R6 and 7 are available for any use
                ; R8-15 must be saved and restored if we use them
        STORES  R1,R2

        MOVE    R6,R3           ; return_value = src
STNCLP:                         ; while (TRUE) do {
        TESTR   R5
        BZS     STNCLX          ;   if (len == 0) break
;*
        LOADS   R7,R4
        EXTB    R7,R7,R4        ;   ch = *src

        LOADS   R1,R3
        STUFFB  R1,R7,R3
        STORES  R1,R3           ;   *dst = ch

        TESTR   R7
        BZS     STNCLX          ;   if (ch == '\0') break

        ADDSI   R4,1            ;   src = src + 1
        ADDSI   R3,1            ;   dst = dst + 1
        ADDSI   R5,-1           ;   len = len - 1

        BR      STNCLP
STNCLX:                         ; }

        MOVE    R3,R6
        LOADS   R1,R2
        JUMPS   R1              ; return return_value

The above code is far from optimal for more reasons than one. First and foremost, like our machine-code version of the strlen() routine, it accesses each word of memory 4 times, once per byte. We can certainly do better if we can find some way to process bytes in blocks of four instead of one byte at a time.

Second, even if we ignore the first inefficiency, we can improve this code by making several local optimizations. Note, for example, that the ADDSI R5,-1 instruction right before the end of the loop sets the condition codes to report on the result. We can therefore replace the BR instruction at the end of the loop with a BNZ instruciton, and move the loop-top label down to the point marked with an asterisk in the code. This eliminates 2 of the 13 instructions per iteration, giving better than a 10% improvement.

We can also eliminate the TESTR R7 instruction in midloop, because the EXTB sets the condition codes to report on the byte extracted, and none of the instructions between that instruction and the TESTR instruction touch the condition codes. This is one reason that each load instruction on the Hawk comes in two flavors, one that sets the condition codes (LOADSCC) and one that does not (LOADS).

Finally, there is one little optimization we can make, using register 6 as our working pointer to the destination string, so that we never modify the copy in register 3, eliminating one instruction outside the loop. Because this optimization has no impact on the loop itself, the only reason to pursue this optimization is if memory is in extremely short supply.

Exercises

h) Incorporate all of the suggested optimizations into STRNCPY.

i) Write Hawk code for strncat() that first finds the end of the destination string using logic torn from STRLEN and then copies the source string into place using logic borrowed from STRNCPY.

j) Can you modify the STRNCPY routine so that it still conforms to the documented assumptions in the comments with the the receiving sequence while eliminating the use of the activation record on the stack? This will require using some register other than R1 for the temporary used for byte stuffing. If you cannot do this, explain why.

k) Rewrite STRNCPY so that it uses array addressing instead of pointer arithmetic to operate on consecutive elements of the source and destination strings.

Testing the String Processing Subroutines

A simple test of the above routines would be to print the copy of a string. Performing this test raises some new programming issues which we can explore in a high level language before we explore them in the Hawk assembly language. Here is, perhaps, the simplest test program for these routines, written in C.
Testing strncpy() and strlen()
        main()
        {
                char buffer[12];
                buffer[11]='/0';
                dspat( 6, 1 )
                dspstr( strncpy( buffer, "String!", 11 ));
                dspat( 1, 1 )
                dspnum( strlen( buffer) );
        }

Programmers accustomed to working in object oriented languages may not be comfortable with having to explicitly declare the memory into which the string is copied; we have done that here with a variable called buffer which is a local variable of the main program. This is an array of 12 char objects, which is to say, 12 bytes, indexed 0 to 11. We have set the final byte to zero because strncpy only copies the zero terminator of strings shorter than its limit. In our test program, we have set this limit to 11 so that strncpy will only change buffer[0] to buffer[10], leaving the zero in buffer[11] as the terminator for long strings while the zero copied by strncpy serves as the terminator for short strings.

The word buffer is itself a source of confusion for some beginning programmers. This word comes to computer science from the field of electrical engineering. A buffer in electrical engineering is an electrical circuit, typically an amplifier, that passes its input voltage through to its output without allowing perturbations in its output to pass back through to its input. In programming, the term buffer has come to refer to an area of memory holding a copy of a value, allowing changes to be made to the copy without damage to the original. The term is used primarily in the context of input or output, but buffering is also used between stages in a computation.

The above test program tests only one string; a complete test should really test a variety of string lengths, some longer than the buffer and some shorter. We could simply re-assemble or re-compile our test program over and over with different strings, but we might be wiser to use a for loop to copy successively longer strings into the buffer. That option is left as an exercise. Here is the Hawk version of our simple test program:
Testing strncpy() and strlen()
; activation record format for MAIN
BUFFER  =       0       ; an array of 12 bytes
ARSIZE  =       12

; entry point for MAIN and assumptions
MAIN:
        LOAD    R2,PSTACK       ; initialize stack pointer
        STORE   R0,R2,BUFFER+8  ; buffer[8-11] = "\0\0\0\0"

        LIS     R3,6
        LIS     R4,1
        ADDI    R2,ARSIZE
        LOAD    R1,PDSPAT
        JSRS    R1,R1
        ADDI    R2,-ARSIZE      ; dspat( 6, 1 )

        LEA     R3,R2,BUFFER
        LEA     R4,STRING
        LIS     R5,11
        ADDI    R2,ARSIZE
        JSR     R1,STRNCPY
        ADDI    R2,-ARSIZE
        ADDI    R2,ARSIZE
        LOAD    R1,PDSPSTR
        JSRS    R1,R1
        ADDI    R2,-ARSIZE      ; dspstr(strncpy(buffer,string,11));

        LIS     R3,1
        LIS     R4,1
        ADDI    R2,ARSIZE
        LOAD    R1,PDSPAT
        JSRS    R1,R1
        ADDI    R2,-ARSIZE      ; dspat( 1, 1 )

        LEA     R3,R2,BUFFER
        ADDI    R2,ARSIZE
        JSR     R1,STRLEN
        ADDI    R2,-ARSIZE
        ADDI    R2,ARSIZE
        JSR     R1,DSPNUM       ; dspnum( strlen( buffer ));
        ADDI    R2,-ARSIZE

        LOAD    R1,PEXIT
        JSRS    R1,R1           ; exit()

STRING: ASCII   "String!",0
        END

The above Hawk code assumes that the source file also includes our DSPNUM routine along with the STRCPY and STRLEN functions. This code has not been optimized at all, and as a result, it includes many changes to the stack pointer that could easily be eliminated. They are left in the code here in order to emphasize the rules for calling sequences we have been following.

This is the first main program we have presented that uses an activation record of its own. This activation record differs from those of called subroutines in only one way: It contains no return address. Aside from this, addressing of local variables of the main program is identical in style to addressing of local variables in other subroutines. The instruction LEA R3,R2,BUFFER loads register 3 with a pointer to the first byte in the local variable BUFFER; in fact, a simple register to register move would have sufficed because the buffer begins at offset zero in the activation record, but the fully un-optimized style we have adopted here does not take advantage of such knowledge, and deliberately demonstrates the general case, code that would have worked no matter where in the activation record we had decided to put the buffer.

Exercises

l) Optimize the test program by removing all unnecessary code to increment and decrement the stack pointer, and by substituting more efficient instructions wherever this is possible.

m) Write a test program that prints successively shorter substrings of "abcdefghijklmnop" using a for loop to select the starting point in the string (so all substrings end with the letter p). This program should copy each substring into an 11 byte buffer with the 12th byte set to zero, print that substring, and print the length of that substring. Without this latter requirement, the entire test could be done without testing STRNCPY; by forcing the use of a short buffer, we can force a reasonable test of the buffer overflow conditions that might occur in this routine.

String Concatenation

The C standard library routine for string concatenation, strncat(src,dst,len), provides an interesting exercise. Recall that this concatenates the null-terminated strings pointed to by src and dst, storing the result in the same buffer that held src and limiting the result to len bytes, and returning the pointer to the result. This routine can be written from scratch, without using strlen() or strncpy(), or it can be written as a very small routine based on them, using strlen() first to find the length of the first string, and then using strncpy to copy the second string onto the end of the first, up to the end of the buffer. In C, this can be done as follows:

C code for strncat(dst,src,len)
char * strncat( char * dst, char * src, unsigned int len )
{
        int dstlen = strlen(dst);
        if (dstlen < len) {
                strncpy( dst + dstlen, src, len - dstlen );
        }
        return dst;
}

This approach to concatenating strings is an excellent test of the strlen() and strncpy() routines, and in addition, it is a good way to illustrate, at the assembly language level, how we can call two routines we have already written from within another routine.

If we intend to put all of the routines in our string package in the same source file, we will clearly have a problem with names that are the same! Most of the C string routines begin with STR and with an explicit parameter giving the buffer length have names that begin STRN. As a result, the unique prefix for STRNCAT could be no shorter than STRNC, but this is also the prefix on the name STRNCPY. Fortunately, our preliminary versions of these routines are very short, needing only a few labels; our initial version of STRNCAT, for example, needs only one label to mark the end of the if statement that copies the second half of the string into place. We will use the name STRNCX for this, using the convention that the suffix X is short for exit. We would probably have to switch to longer identifiers fairly quickly if we wanted to put all of the string routines in the same source file!

While our STRLEN and STRNCPY routines did not use their activation records, our STRNCAT routine will need its activation record to hold all of the parameters through the call to STRLEN and to save the destination buffer address through the call to STRNCPY. In addition, of course, we must also save the return address for STRNCAT. This leads to the following code:

Hawk code for STRNCAT
; activation record format for STRNCAT string concatenation routine
RA      =       0       ; the return address
DST     =       4       ; destination string address
SRC     =       8       ; source string address
LEN     =       12      ; length of destination buffer (unsigned)
ARSIZE  =       16

;       receiving sequence assumptions
STRNCAT:        ; R3 holds DST
                ; R4 holds SRC
                ; R5 holds LEN
                ; R6-7 contain nothing of value
                ; R8-15 must be saved and restored if used
        STORES  R1,R2           ; save return address
        STORE   R3,R2,DST
        STORE   R4,R2,SRC
        STORE   R5,R2,LEN       ; parameters are saved

        ADDI    R2,R2,ARSIZE
        JSR     R1,STRLEN
        ADDI    R2,R2,-ARSIZE   ; dstlen = strlen( dst )

        LOAD    R5,R2,LEN
        SUB     R5,R5,R3        ; R5 = len - dstlen
        BLEU    STRNCX          ; if (len > dstlen) {

        LOAD    R4,R2,DST       ;    note, R3 still holds dstlen
        ADD     R3,R3,R4        ;    note, R5 still holds len - dstlen
        LOAD    R4,R2,SRC
        ADDI    R2,R2,ARSIZE
        JSR     R1,STRNCPY
        ADDI    R2,R2,-ARSIZE   ;    strncpy( dst+dstlen,src,len-dstlen )
STRNCX:                         ; }

        LOAD    R3,R2,DST
        LOADS   R1,R2
        JUMPS   R1              ; return dst

The above code can be optimized significantly, but some optimization has already been done. For example, we have computed the difference between the buffer length and the destination string length once, keeping it in register 5, while our high-level language source code suggests that the comparison of these two quantities and the later subtraction might be two different operations. And, we never allocated space in the activation record for the result returned by STRLEN, but instead, we retained this in register 3 until it was no longer needed.

There are several additional optimizations we could make. The first of these involves optimizing the space required for the activation record on the stack. Everything we have saved in the activation record is needed after the first call from within this routine, the call to STRLEN but only some of this information is still relevant after the call to STRNCPY. As a result, we could shorten the activation record during that second call. In the context of this code, this saves nothing, but in a recursive function, careful optimization of the information saved during a recursive call can lead to significant improvements in memory usage.

The second optimization we could make eliminates unneeded changes to the stack pointer between the two recursive calls. This change is a bit tricky because the second call is inside an if statement while the first call is outside that if statement. As a result, this optimization cannot be combined with the first optimization suggested above! This forces us to make a choice! Do we want to optimize for speed, eliminating as many instructions as possible, or do we want to optimize for efficient use of space on the stack? In this case, the answer is clear, since in fact, neither of the routines called by this routine make much use of the stack.

In fact, all of the above discussion was based on ignorance of the real uses made of the stack by the STRLEN and STRNCPY routines. If those routines had been written by other programmers and were likely to be changed, it would be fair to keep the code we have above, but if those routines were completely debugged and were part of the same library as this code, we could safely take advantage of our knowledge that STRLEN does not use the stack at all, and that STRNCAT makes only the slightest use of the stack.

Exercises

n) Improve the above code for STRNCAT by minimizing the waste of space on the stack during the call to STRNCPY.

o) Improve the above code for STRNCAT by minimizing the incrementing and decrementing of the stack pointer, keeping the assumption that both called routines are likely to use their activation records.

p) Improve the above code for STRNCAT by taking advantage of all the information you have about the called routines, so that there is no unneeded stack manipulation.

A Big Optimization, Working With Words

As pointed out in the context of both the STRLEN and STRCPY routines, our code copying one character at a time did nothing to take advantage of the fact that the computer we are using is able to work with 32-bit data. If we were dealing with an 8-bit microprocessor, the algorithm we are using would be appropriate, but on a 32-bit machine, we should move data in 32-bit increments whenever possible.

In the mid 1960's, when Control Data Corporation introduced their 6600 family of machines, programmers began to explore how to use operations on such large words to efficiently manipulate text. On that machine, 10 6-bit characters were packed into a 60-bit word, and the machine had less support for byte-addressing than the Hawk. By the mid 1980's, the developers of the Digital Equipment Corporations 64-bit Alpha microprocessor had developed the model used on the Hawk for stuffing and extracting bytes in larger registers, and they had developed the model we will discuss here for doing string operations one word at a time. Although the motivation was pixel manipulation in graphics applications, the Intel Pentium MMX extensions are similarly motivated.

Our basic idea is to treat the words in the string one character at a time only at the beginning and end of the string, while the body of the string is treated one word at a time. Since we are packing 4 characters per word, we will deal with only 0 to 3 individual characters at the start of the string and 0 to 3 individual characters at the end. In doing this, we face two distinct problems: How do we check an address to see if it is a word address, something we must do at the start of the string, and how do we check a word to see if it includes the end-of-string character?

The Hawk provides a distinct solution to each of these problems. First, it has a collection of special instructions for inspecting and testing the least significant bits of a word, and second, it has an addition to the load instructions that sets the C condition code if any byte of the loaded word is zero, so we can easily detect the word containing the end of a string.

The Hawk architecture has a truncate instruction, TRUNC dst,n that preserves only the least significant n bits of a register, setting the others to zero. We used this earlier in this chapter to convert a signed byte loaded from memory into an unsigned byte, but it can also be used to snip the 2 least significant bits from an address, setting the Z condition code if these are zero.

In writing a fast string routine, what we really want is a way to test the least significant bits of an address and go to one of 4 different cases depending on the values of those bits. The Hawk architecture has a specialized instruction to do exactly this, BTRUNC. This truncates a register and then, instead of putting the truncated result in a register, it adds the truncated result to the program counter, so that the next instruction will be selected from among the instructions immediately following the BTRUNC instruction. Usually, the next few instructions are branch instructions.

The tricks we will use to write a fast STRLEN routine cannot be easily expressed in languages like C, and in fact, assembly language programmers are still employed to write carefully crafted and very fast assembly-language versions of key routines such as those in our string library and those in similar pixel-oriented graphics libraries. Here is our solution for STRLEN:
Hawk code for STRLEN
;       receiving sequence assumptions for STRLEN (SLN)
STRLEN:         ; R3 holds src, the address of the string
                ; R4 and 5 hold nothing of value on entry
                ; R4 will be used to extract characters
                ; R5 holds initial value of R3
        MOVE    R5,R3           ; remember starting place
        LOADSCC R4,R3           ; get first word of string
        BTRUNC  R3,2            ; test what byte we are starting with
        BR      SLNB0           ;  byte 0 - can work with entire words
        BR      SLNB1           ;  byte 1
        BR      SLNB2           ;  byte 2   (Note, the condition codes
        BR      SLNB3           ;  byte 3    are as left by LOADSCC!)

SLNB1:  EXTB    R0,R4,R3        ; extract and test byte 1
        BZS     SLNQT           ; quit if null
        ADDSI   R3,1            ; advance to byte 2
SLNB2:  EXTB    R0,R4,R3        ; extract and test byte 2
        BZS     SLNQT           ; quit if null
        ADDSI   R3,1            ; advance to byte 3
SLNB3:  EXTB    R0,R4,R3        ; extract and test byte 3
        BZS     SLNQT           ; quit if null
        ADDSI   R3,1            ; advance to full word
        LOADSCC R4,R3           ; get next full word

SLNB0:  ; assert R3 divisible by 4, R4=M[R3], LOADSCC set the C bit
        BCS     SLNLX           ; some byte in this word is zero
SLNLP:
        ADDSI   R3,4            ; advance to next full word
        LOADSCC R4,R3           ; get next full word
        BCR     SLNLP           ; iterate if all bytes nonzero
        
SLNLX: ; assert some byte in R4 is zero, R4=M[R3], R3 is divisible by 4
        EXTB    R0,R4,R3        ; test first byte
        BZS     SLNQT
        ADDSI   R3,1            ; advance to full word
        EXTB    R0,R4,R3        ; test second byte
        BZS     SLNQT
        ADDSI   R3,1            ; advance to full word
        EXTB    R0,R4,R3        ; test third byte
        BZS     SLNQT
        ADDSI   R3,1            ; it must be the final byte

SLNQT:  ; assert R5 points to first byte, R3 to final byte
        SUB     R3,R5,R3        ; compute the string length
        JUMPS   R1              ; return

Note that the inner loop here tests 4 bytes per iteration using only 3 instructions per iteration. Thus, for long strings, this STRLEN subroutine will count the bytes using fewer instructions than there are bytes in the string being tested. At the start and end of the string, we pay 3 instructions per byte. Compared with the uniform 4 instructions per byte we paid in the optimized version of the byte-oriented code, the overall improvement is quite significant.

It is significantly harder to apply these optimizations to STRNCP because we must consider not only how the source string is broken into words but also how the destination buffer is broken up. If the source and destination addresses are equal in the least significant two bits, we can use the logic given for STRLEN, but there are three other cases we might want to handle with more arcane code.

Exercises

q) Write a version of STRNCPY that uses the original code given earlier if the source and destination pointers are aligned differently, and uses logic based on the above version of STRLEN if the source and destination pointers are aligned identically.

r) Write a version of STRNCPY that uses halfword stuffing and halfword extraction, so that it can copy strings when the source and destination addresses are off by 2 from each other. How many instructions per byte copied does it average for long strings.