7. Bytes, Halfwords and Text.

Part of CS:2630, Computer Organization
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

A Programming Problem, Text Manipulation

The Hawk monitor borrows its convention for recording the length of text strings from the C programming language. This is used in the PUTS and GETS routines, and it also supports string processing routines based on routines in the C standard library, defined in the header file string.h. The full string library of C is fairly large, while the string library for the Hawk is much smaller. Both include the following routines:

STRLEN — measure the length of a null-terminated string
Parameters: R3 — address of the first byte of the string.
Return value: R3 — length of the string.
The length does not include the trailing NUL character.

C equivalent: len = strlen(str)

STRCPY — string copy with no length limit
STRNCPY — string copy with a strict length limit
Parameters: destination = R3, source = R4, length = R5
Return value: R3 — the destination (unchanged)
The source register points to the string to copy.
The destination register points to the the memmory block that will hold the copy.
The length, used only with STRNCPY, is the size of the block of memory, in bytes.

C equivalents: strcpy(dst, src); strncpy(dst, src, length)

STRCAT — string concatenate with no length limit
STRNCAT — string concatenate with a strict length limit
Parameters: destination = R3, source = R4, length = R5
Return value: R3 — the destination (unchanged)
The source register points to the string to append.
The destination register points to the string to which it will be appended.
The length, used only with STRNCAT, specifies the maximum number of characters to be appended to the destination.

C equivalents: strcat(dst, src); strncat(dst, src, length)

These are much weaker string management tools than the ones in languages like Java and Python. In C or on the Hawk, the string routines do not allocate memory for the result, so before calling one of the string copy or concatenation routines, programs must first take care of that.

The presence of two different versions of several string routines is a frequent source of confusion. The strcpy and strcat routines are older. They were documented in the first edition of Kernighan and Richie's book The C Programming Language (1978). These routines are dangerous because there is no limit on the size of the result string. By the time the 1983 print edition of the Unix Programmer's Manual was published by Bell Labs, strncpy and strncat were added to the library.

C (and C++) programmers using the Unix, Linux and MacOS command line interface can get full documentation for any routine in the C standard library with the man shell command. For example, to get the official documentation for strlen, use man strlen. The C Programming Language by Kernighan and Ritchie (Second Edition, Prentice Hall, 1988) also documents these routines, as do numerous web sites (many of them merely reprint the description given by the man command).

The man command gives access to the on-line Unix Programmer's Reference Manual When a library routine has the same name as a shell command, use man 3 strlen — the number refers to the section number in the manual. Section 1 is shell commands, section 2 is the operating system interface, and Section 3 is the C standard library.

Here is a little C program demonstrating the use of these routines to construct the string "Hello world!" and a SMAL Hawk equivalent:

Demonstrating C string handling
#include <stdio.h>
#include <string.h>
char buffer[15]; /* a place to put the text */

int main() {
    strncpy(buffer,"hello, ",15);
    puts(buffer);    /* output the accumulated text */
    return 0;

The first thing that stands out in this program is the explicit allocation of memory to hold the string being constructed. Here, the array buffer can hold any string of up to 15 characters, including the trailing NUL. In both of the above examples, buffer is declared as a global variable, but it would work just as well if it had been declared as a local variable.

The term buffer has a long history. In many fields, a buffer is something that cushions against rapid change. So, for example, a chemical buffer prevents changes in pH despite addition of small amounts of additional acid or base to a solution. A buffer amplifier protects a signal source from being altered by the load being driven; if you short circuit a buffered output, for example, you may damage the buffer, but the signal source is protected.

In computing, a buffer is a block of memory used to hold data between when it is produced and when it is consumed. Without a buffer, each item produced must be consumed immediately. With a buffer, data can be produced first (here, by calling strncpy and strncat) and consumed later (here, by calling puts).

Demonstrating Hawk string handling
	USE	"hawk.h"
	USE	"stdio.h"
	USE	"string.h"

	COMMON	BUFFER,15	; -- a place to put the text


	LIS	R5,15
	JSRS	R1,R1		; strncpy(buffer,"hello, ",15)

	LIS	R5,7
	JSRS	R1,R1		; strncat(buffer,"world!",7)

	JSRS	R1,R1		; puts(buffer) -- output the text


HELLO:	ASCII	"hello, ",0
WORLD:	ASCII	"world!",0

In this example, the buffer was sized to be a little larger than the total length of the string that was computed. Because we start with an empty buffer, the limit on strncpy was given as 15, the same as the buffer length. The copy operation puts 8 characters in the buffer, 7 printable plus a trailing NUL, so there is only space for 7 more characters with the first to strncat.

The problem with strcpy and strcat is that they impose no limit on the number of characters in the destination string. If a fixed size buffer is allocated for the destination and then oversized strings are put in it, nothing in the implementation of strncpy or strncat prevents copying too much, leading to a Buffer overrun error. The strncpy and strncat routines were added to help prevent these errors, but it is still up to the programmer to pass appropriate limits, and setting these limits carelessly still results in buffer overflows.

In languages like Python and Java, the buffer to hold the concatenated strings is allocated automatically and automatically sized to hold the concatenated strings. We can do the same thing in C or assembly language, but to do so, we need access to additional routines in the standard library, with interface definitions in stdlib.h:

MALLOC — allocate a block of memory
Parameters: R3 — size of the block to allocate, in bytes.
Return value: R3 — points to the allocated block, or NULL.
The name is a contraction of memory allocate.
If a block cannot be allocated, NULL is returned.
The block always begins at a word-aligned address.
The block is always in a region of memory called the heap.

C equivalent: ptr = malloc(size)
Note that the return type is void*.
A pointer of type void* points to nothing in particular.

FREE — deallocate a block of memory
Parameters: R3 — pointer to the block to be returned to the heap.
Return value: None
The block must have previously been allocated by malloc.
It is an error to attempt to free an already free block.
These errors cannot always be detected.

C equivalent: free(ptr)

Note that memory on the heap is not free. Allocating a block of memory on the heap takes time because it requires a search for a sufficiently large block of free space. And, the heap must contain not only blocks memory but also, for each block, an indication of its status and a record of the size of the block. In aggregate, this information will occupy at least one word per block, and algorithms to speed the search for an appropriate free block typically require more information to be stored with each block.

Beware of the term heap because it has two distinct meanings in computer science. Here, we refer to a region of disorderly memory use as a heap, as opposed to a stack, where allocation is in a neat last-in first-out order. The other meaning, which does not concern us here is a heap-ordered tree, where the root of the tree or subtree has a value greater than any value below it. Heap-ordered trees are used in algorithms such as heapsort.

In languages like Python and Java, all strings are allocated on the heap. Here are C routines for string processing that do approximately what those languages do for string copy and concatenation operations:

C string handling on the heap
char * string_copy(char * s) {
    char * buf = (char *)malloc(strlen(s) + 1);
    return buf;

char * string_concatenate(char * s1, char * s2) {
    char * buf = (char *)malloc(strlen(s1) + strlen(s2) + 1);
    return buf;

Notice here that we used strcpy and strcat instead of strncpy and strncat. There are frequent warnings against the use of strcpy and strcat because of the risk of buffer overruns, but they are safe to use here because the freshly allocated buffer is always sized precisely to hold the strings being copied or concatenated.

There is one important feature of the Java and Python programming environment that is not part of C, C++ or our Hawk programming environment: Garbage collection. Java and Python include a garbage collector to automatically reclaim memory on the heap when it is no-longer accessible. Whether in C or under the Hawk monitor, programmers must track all memory objects allocated by malloc and remember to call free for each object when it is no longer needed. Failure to do so creates a memory leak where it appears that the free memory in the heap is leaking away. Memory leaks are a minor issue in short-lived toy applications, but they can be extremely problematic in applications that are expected to run or hours or days.


a) The strncpy and strncat routines return their first argument as their return value. This feature can be used to write more compact string handling code. Rewrite the following part of the string handling Hello World program to demonstrate this. The result should be a single expression with just one mention of buffer:

                strncpy( str, "hello, ", 15 );
                strncat( str, "world", 7 );

b) Rewrite (and test) the Hello World program so that it uses malloc to allocate the buffer and then frees it after use. Do this with the SMAL Hawk version as well.

c) Rewrite (and test) the Hello World program so that it uses string_concatenate to do the concatenation and then frees the result string after it is no longer needed. Do this with the SMAL Hawk version as well.

Loading a Byte from Memory

Before we can translate strcpy and strcat to assembly language, we need a way to load individual bytes from memory. As we saw in Chapter 2, Hawk memory addresses allow us to give a distinct address to each byte of memory, but the variations on the LOAD and STORE instructions introduced in Chapters 4, 5 and 6 all ignore the two least significant bits of the memory address and load or store a full 32-bit word.

The IBM System/360 introduced support for 8-bit bytes, 16-bit halfwords and 32-bit words in 1965. This machine had load and store instructions for each operand size: load byte, load halfword and load word, along with store byte, store halfword and store word. This poses two problems:

First, if we commit to this idea, we need 6 opcodes for variants on load and store, and it is hard to find spare opcodes for these on the Hawk.

Second, loading and storing fractional words involves extra work. Either the CPU or the memory must be more complex than it would be if all loads and stores were for whole and properly aligned words.

These considerations lead to the decision on the Hawk, as well as on several other reduced instruction set complexity architectures to separate the mechanisms for memory access form the mechanisms for byte manipulation. On these machines, to load a byte from memory, you first load the word holding that byte and then use a second instruction to extract the desired byte from that word.

The Hawk architecture offers two special instructions for extracting bytes and halfwords from full-word operands. These are designed for use after LOADS when the desired operand is a byte or halfword. EXTB is for extracting bytes; EXTH is similar, but operates on halfwords.

The Hawk EXTB instruction
07 06 05 04 03 02 01 00   15 14 13 12 11 10 09 08
0 1 0 1 dst s1 s2

Formally, this does r[dst]=(r[s1]>>((r[s2]&3)<<3))&255. This formal notation is based entirely on the syntax common to C, C++, Java and Python, but it is not easy to read. Here an explanation:

r[dst]=(r[s1] ... )&255
A byte is taken from the register designated by the s1 field and moved to the destination register. The high 24 bits of the result are zero. Recall that 25510 = 111111112.

r[dst]= ... r[s1]>>((r[s2] ... ) ... ) ...
The register designated by the s2 field determines the byte extracted. (Recall that >> means right-shift, so x>>8 means x/256.)

... (R[S2] ... )<<3 ...
The shift count for byte selection is a multiple of 8. (x<<3 is the same as x×8.)

... r[s2]&3 ...
The shift count depends only on the least significant two bits of the register designated by the s2 field. Not that 310 is 112, and anding with 112 sets all high bits to zero. The 2 bits that survive are the bits that select the byte within the word.

... (r[s2]&3)<<3 ...
Multiplying 0, 1, 2 or 3 by 8 gives shift counts of 0, 8, 16 or 24.

In long-winded English, if the least significant 2 bits of the byte pointer Rp are 00, EXTB Rd,Rs,Rp will load the destination Rd with the least significant byte of the source register Rs. If those 2 bits of Rp are 01, Rd will be loaded with the second to the least significant byte of Rs. If those 2 bits are 10, Rd will be loaded with third byte of Rs and If those 2 bits are 11, Rd will be loaded with the top byte of Rs. The result is always truncated to just 8 bits in the destination register.

As a result, if Rp contains a pointer, or memory address, and Rs contains a word that was loaded using that pointer, then the instruction EXTB Rd,Rs,Rp will pick, from Rs the byte pointed to by Rp and put that byte in Rd. The result will be in the range 0 — 255. So, if Rp contains a pointer to a byte, we can load that byte into a register as follows:

Loading an unsigned byte from memory
; given:  R4, a pointer to a byte in memory
        LOADS   R3,R4           ; get the word containing the desired byte
        EXTB    R3,R3,R4        ; extract the desired byte from that word
; result:  R3 now holds the desired byte; R3=M[R4]

EXTB and EXTH both set the condition codes to report on whether the extracted byte or halfword is zero or nonzero, so we can easily test for end-of-string as we extract bytes from a string during a string operation.

What if we wanted to extract signed halfwords in the range –32768 — +32727 or signed bytes in the range –128 — +127? The Hawk solves this problem with a special sign-extend instruction. Recall that, to sign extend a two's complement number, you duplicate the sign bit to widen that number to the desired number of bits. For example if the the 4-bit numbers 1010 and 0101 are sign extended to 8 bits, the results will be 11111010 and 00000101. This preserves the value of the number while enlarging its representation. The above two's complement numbers, in either their 4-bit or 8-bit forms have the decimal values –6 and +5.

The Hawk sign extend instruction, SXT, takes a register and a count of the number of bits to be preserved. SXT R3,8 extends the value in R3 by replacing its top 24 bits with whatever was in bit 7. Another instruction, TRUNC, has the same format, but takes a signed quantity and forces all of the higher bits to zero; this is called truncation.

Loading a signed byte from memory
; given R4, a pointer to a byte in memory
        LOADS   R3,R4           ; get the word containing the desired byte
        EXTB    R3,R3,R4        ; extract the desired byte from that word
        SXT     R3,8            ; force a signed interpretation on it
; result R3 now holds the desired byte, sign extended to 32 bits


d) Given that R1 holds ABCDEF16 right before each of the following instructions, what value is left in the register after each of them?

  SXT   R1,8?
  TRUNC R1,8?
  SXT   R1,6?
  TRUNC R1,6?
  SXT   R1,5?
  TRUNC R1,5?

You can check your work empirically with a Hawk emulator.

e) If the instruction EXTB R4,R3,R0 is used (In the context of this instruction, register zero is the constant zero), what is the result in R3 if its initial is ABCDEF16? If its initial value is 12345616? What useful function does this particular variant of EXTB do?

Array Addressing

Consider the problem of writing code to access an element a[i] of an array a of characters. As in C, we'll assume that the integer i is within the array bounds, so we don't need to worry about bounds checking. Bounds checking is, after all, just a matter of adding if statements to the code that compare i with the upper and lower array bounds. These if statements are implicitly there before every single use of array indexing in languages like Java and Python. To evaluate a[i], we proceed as follows:

  1. Get the address of a, that is, the address of a[0].
  2. Add the product of the subscript i and the size of an array element.
  3. Use the resulting pointer to reference the indicated element.

In C or C++, the expression a[i] is always equivalent to *(a+i). In these languages, the unary * operator means "follow the pointer to the object it references in memory." Also, in C and C++, an array name is almost the same thing as a pointer to the first element of the array, and adding an integer to a pointer implicitly multiplies the integer by the size of the referenced object before adding. In the case of arrays of characters in C, each character is just one byte so the multiplication is trivial. As a strange consequence of this, the C expression &a[i] is equivalent to a+i, where the unary & operator returns the address of its operand.

On the Hawk, the code for a[i] depends on whether a is a local variable, an array of constants embedded in the program text (such as the "hello world" array), or a global variable.

The expression A[i] for an array of bytes
; assume A is a local array of characters, i is in R3
LEA     R4,R2,A         ; get the address of A into R4
ADD     R4,R3,R4        ; R4 now holds the address of A[i]
LOADS   R3,R4           ; R3 now holds the word at this address
EXTB    R3,R3,R4        ; R3 now holds the the array element A[i]
; assume A is a constant array of characters, i is in R3
LEA     R4,A            ; get the address of A into R4
ADD     R4,R3,R4        ; R4 now holds the address of A[i]
LOADS   R3,R4           ; R3 now holds the word at this address
EXTB    R3,R3,R4        ; R3 now holds the the array element A[i]
; assume A is a global, allocated using COMMON, i is in R3
LIL     R4,A            ; get the address of A into R4
ADD     R4,R3,R4        ; R4 now holds the address of A[i]
LOADS   R3,R4           ; R3 now holds the word at this address
EXTB    R3,R3,R4        ; R3 now holds the the array element A[i]

The above bits of code differ only in the instructions that load the address of the array. Stack pointer or R2-relative addressing is used for local variables; implicit PC-relative addressing is used for constants in the code, and direct addressing is used for global variables. For arrays of words, we can use a simple LOADS to load each array element, but we must first multiply the array subscript by the word size in bytes:

The expression A[i] for a local array of words
; assume A is a local array of words, i is in R3
LEA     R4,R2,A         ; get the address of A into R4
SL      R3,2            ; multiply R3 by 4 by shifting 2 places left
ADD     R4,R3,R4        ; R4 now holds the address of A[i]
LOADS   R3,R4           ; R3 now holds the word at this address
; assume A is a local array of words, i is in R3
LEA     R4,R2,A         ; get the address of A into R4
ADDSL   R3,R4,2         ; multiply R3 by 4 and add R4 to the result
LOADS   R3,R3           ; R3 now holds the word at this address

In the above code fragments, the SL R3,2 instruction shifts the contents of R3 left two places, throwing away any bits shifted out the left end of the register and padding with zeros on the right. This is equivalent to multiplication by four; in general, SL R3,n multiplies by 2n, and because the sequence of shifting and then adding is common, the instruction ADDSL is available to combine these functions.

Both fragments coded for arrays of words illustrate only the case where the array is local, a part of the activation record. The problem of loading the starting address of the array does not depend on the size of the array elements, so the examples from the previous figure cover what is needed for constant or for global arrays.

Counting the Bytes in a String

We have now covered enough ground to write a subroutine to count the bytes in a null terminated string. In a high level language like C or C++, the following code works:

C code for strlen using arrays
unsigned int strlen( const char str[] ) {
    int i = 0;
    while (str[i] != 0) i = i + 1;
    return i;

This first version of the code views the string to be measured as a constant array of characters. Each iteration of the loop examines but may not change one character in this array, testing to see if it is the null character that marks the end of the string. Using the material on array addressing from the previous section, we can translate this code directly to Hawk assembly language as follows:

Hawk code for STRLEN using arrays
;       receiving sequence assumptions for STRLEN (SLN)
STRLEN: ; expects R3 -- str, the (constant) string's address
        ; uses    R4 -- the address of src[i]
        ;         R5 -- used to extract characters
        ;         R6 -- holds i, the count of the length
        ; returns R3 -- the length of the string str
        LIS     R6,0            ; i = 0;
SLNLP:                          ; while (TRUE) {
        ADD     R4,R3,R6        ;   -- compute address of src[i]
        LOADS   R5,R4           ;   -- get word holding src[i]
        EXTB    R5,R5,R4        ;   -- extract the byte src[i]
        BZS     SLNLX           ;   if (src[i] == '\0') break;
        ADDSI   R6,1            ;   i = i + 1;
        BR      SLNLP
SLNLX:                          ; }
        MOVE    R3,R6
        JUMPS   R1              ; return i in R3

In C and C++, the name of an array is almost the same as a pointer to its first element. As a result, the declarations char str[] and char * str declare the parameter str to have exactly the same type, the address of a character in memory. Since strings in C and C++ are represented by arrays of consecutive characters, the natural way to pass a string to a function is by passing a pointer to its first character. This feature of the C and C++ languages is sometimes a source of serious confusion Here, though, we can exploit this to it to write an alternative version of strlen() that avoids array indexing:

C code for strlen using pointer arithmetic
unsigned int strlen( const char * ptr ) {
    int i = 0;
    while ( *ptr != '\0' ) {
        ptr = ptr + 1;
        i = i + 1;
    return i;

In the above, note that the asterisk operator applied as a prefix to a pointer causes pointer dereferencing. That is, it causes the computer to follow the pointer to the object it references. In this case, *ptr is the character pointed to by ptr and could also have been written as ptr[0].

Here, note that the declaration const char * ptr declares ptr to be a pointer to a character constant, thus forbidding strlen from making any changes to whatever ptr points to. It does not make the pointer ptr a constant. The C compiler is pretty good at enforcing such restrictions, but when we write assembly code, nothing but comments in the code documents such restrictions.

In C and C++, adding an integer to a pointer produces a new pointer; the amount by which the memory address is incremented depends on the number of bytes occupied by the objects to which the pointer refers. In this case, since we are assuming that a char object occupies one byte, adding one to a char pointer increments the address by one. Adding one to an int pointer would incrment the address by 4 if we assume that int objects occupy 4 bytes. Note the word assume above! The C standard does not fix the number of bytes in objects of any of the standard types. We have covered enough detail now to translate the second version of the above code into Hawk assembly language.

Hawk code for STRLEN using pointer arithmetic
;       receiving sequence assumptions for STRLEN (SLN)
STRLEN: ; expects R3 -- ptr, the (constant) string's address
        ; uses    R4 -- used to extract characters
        ;         R5 -- holds i, the count of the length
        ; returns R3 -- the length of the string ptr
        LIS     R5,0            ; i = 0;
SLNLP:                          ; while (TRUE) {
        LOADS   R4,R3           ;   -- get word holding *ptr
        EXTB    R4,R4,R3        ;   -- extract the byte *ptr
        BZS     SLNLX           ;   if (*ptr == '\0') break;
        ADDSI   R3,1            ;   ptr = ptr + 1;
        ADDSI   R5,1            ;   i = i + 1;
        BR      SLNLP
SLNLX:                          ; }
        MOVE    R3,R5
        JUMPS   R1              ; return i in R3

Note that both versions of this code have exactly the same complexity One has an extra increment in the loop body, while the other must add in the base of the array during each iteration. If the array had been an array of integers, the second version would have been a bit faster, but there are optimizations that can make this code even faster.

Both versions of this code have an important shortcoming: Each block of 4 consecutive iterations of the loop body will load exactly the same data into register 4 prior to extracting a byte. This is inefficient, and later, we will discuss some interesting tricks for eliminating this that have become common on new computers designed since the mid 1980's.

This code can also be tightened by some local optimizations that ignore these larger issues. There are two registers being incremented in the loop body; if we compute the string length from the difference of the initial and final values of the string pointer, we can eliminate one increment. In addition, there are two branches within the loop, one conditional and one unconditional; rewriting the loop so the termination test is at the end of the loop can eliminate one of these. Combining these optimizations, we can reduce the cost of the loop from 6 instructions per iteration down to 4. STRLEN is important enough that we should do this.


f) Improve the above code for strlen and STRLEN by eliminating one increment operator from the loop body, as suggested above.

g) Improve the above code for strlen and STRLEN to use a post-test loop, where in assembly language, this involves a conditional branch back to the top of the loop from the end, as suggested above.

Stuffing Bytes into Words

On machines where all data transfers between processor and memory move 32-bit quantities, storing a byte or halfword into a word in memory involves first loading that word, then replacing the correct byte or halfword in it, and finally storing the modified word back into memory. This is usually true even when the instruction set of the computer offers store byte or store halfword instructions.

In the case of the Hawk, the three steps described above must be carried out explicitly. The first and last steps are easily done with LOADS and STORES instructions. Two special instructions are provided for the middle step, STUFFB to stuff bytes into a register and STUFFH to stuff halfwords. Here is the former:

The Hawk STUFFB instruction
07 06 05 04 03 02 01 00   15 14 13 12 11 10 09 08
0 1 1 1 dst s1 s2

r[dst]=(r[dst]&~(0xFF<<((r[s2]&3)<<3))|((r[s1]&0xFF)<<((r[s2]&3)<<3)), in our formal notation. Unfortunately, as with the EXTB instruction, this formal notation neither compact nor very informative! The following comments should help in understanding it:

... (r[s2]&3)<<3 ...
This subexpression shows up twice above, and it also shows up in the description of EXTB. This takes the least significant 2 bits of r[s2] and multiplies by 8 to give values of 0, 8, 16 or 24.

... ~(0xFF<<((r[s2]&3)<<3)) ...
Here, the value 111111112 is shifted 0, 8, 16 or 24 places so that the 8 ones line up with the byte to be stuffed. The value is then inverted to make ones where there were zeros and visa versa. This is the value that is anded with the destination register to set the selected byte of the destination to zero.

... ((r[s1]&0xFF)<<((r[s2]&3)<<3)).
Here, we take the least significant byte of r[s1] and shift it over 0, 8, 16 or 24 places so that it lines up with the desired byte before we or it into place.

In long winded English, this says that if the least significant 2 bits of the byte pointer Rp are 00, STUFFB Rt,Rs,Rp will replace the least significant byte of the temporary register Rt with the least significant byte of the source register Rs. If the those 2 bits of Rp are 01, it will replace the second to the least significant byte of Rt with the least significant byte of Rs. If those 2 bits are 10, it will replace the third byte of Rt with the low byte of Rs and if those 2 bits 11, it will replace the top byte of Rt with the low byte of Rs.

In effect, the instruction STUFFB Rt,Rs,Rp replaces the byte in the temporary register Rt selected by the least significant bits of the pointer register Rp with the low byte of Rs. Suppose we want to store the low byte of Rs in the memory location pointed to by Rp. To do this, we first load the word pointed to by Rp into a tempory register Rt, then stuff the byte into it before we store the word back in memory.

Stuffing a byte into memory
; given:  R3, a byte pointer and R4, a byte to store
        LOADS   R1,R3           ; get the word containing the desired byte
        STUFFB  R1,R4,R3        ; insert the desired byte into that word
        STORES  R1,R3           ; put the modified word back in memory
; result:  R1 now holds a copy of the updated word, M[R3] = R4

Byte manipulation on the Hawk may appear cumbersome compared to byte manipulation on machines such as the IBM 360, the DEC PDP-11 or the Intel 80x86/Pentium family. On those machines, there is a single instruction to load a byte from memory to register or store a byte from register to memory. The fact that it can be done in one instruction does not, however, make it fast. The instruction may still load an entire word from memory for modification, and in order to support this instruction, every word loaded from memory must pass through an alignment network that slightly slows every memory reference. The designers of some modern machines have opted to expose this complexity in the instruction set, allowing fast access to memory for words at the price of explicit byte extraction. The DEC Alpha processor provided the model for Hawk byte addressing.

Programmers who want to avoid knowing about byte extraction and stuffing can hide these details within macros. The macros given here will not be used later; they are merely examples. The macro names LOADBS and STOREBS indicate that these instructions are kin to LOADS and STORES, with B added to the suffix because they operate on bytes.

The STOREBS macro definition assumes that R1 can be used as a temporary location for byte stuffing. This is compatble with the use of R1 for subroutine linkage, since that use guarantees that R1 will be needed for every call, and therefore, between calls, it will usually be available for short-term use.

Byte manipulation macros
MACRO    LOADBS =dst,=x
  LOADS  dst,x
  EXTB   dst,dst,x

MACRO    STOREBS =dst,=x
  LOADS  R1,x
  STUFFB R1,dst,x


h) Given that the Hawk interprets R0 as the constant zero in this context, what does STUFFB R3,R0,R0 do?

i) Write a LOADB macro that behaves like LOAD r,x,disp except that it loads a single byte into the destination register instead of loading a word. Use R1 as a temporary if needed.

A String Copy Subroutine

Before giving an assembly language version, we will look at a C implementation of the strncpy routine. The central part of our routine will be a loop to copy one character per iteration from the source string to the destination string. This loop has two terminating conditions: It must exit when it has copied the NUL marking the end of the string, and it must exit when the number of characters copied reaches the length limit. Note that the code given here can be converted to implement strcpy by deleting every statement that mentions the length limit.

Loops with two exit points are easier to read if the exits are coded similarly, so we will write this as an infinite loop with two explicit break statements instead of mixing break statements with other kinds of loop exits. The following code is written with no complex statements, in the hope that each assignment can be replaced by one machine instruction.

C code for strncpy using pointer arithmetic
char * strncpy( char * dst, char * src, unsigned int len ) {
    char ch;
    char * return_value = src;
    for (;;) {
        if (len == 0) break;
        ch = *src;
        *dst = ch;
        if (ch = '\0') break;
        src = src + 1;
        dst = dst + 1;
        len = len - 1;
    return return_value;

In the above, we opted to use pointer arithmetic instead of array indexing based on what we learned from STRLEN about the optimizations this made possible.

The first step in writing any assembly language code is to document the parameters and to suggest register usage in the body of the code. By convention, parameters are passed in R3 and up, so dst will go in R3, src in R4, and len in R5. Looking at the C code, we need a place for the return value, so we will set aside R5 for this, and we know we will need a register to hold each successive character, so we set aside R7 for that. All of this should be documented in the subroutine header comments.

The convention that registers 3 to 7 can be used freely in any subroutine suggests that the our code might not need an activation record. Unfortunately, we will need one more register as a temporary for use with STUFFB. Storing the return address in the activation record frees R1, so we will do that in the code that follows.

STRNCPY in assembly language
; activation record format for STRNCPY string copy routine
RETAD   =       0       ; the return address
ARSIZE  =       4       ; size of activation record in bytes

;       receiving sequence and assumptions
STRNCPY:; expects R3 -- dst, the destination string address
        ;         R4 -- src, the source address
        ;         R5 -- len, the maximum length to copy
        ; uses    R6 -- the return value
        ;         R7 -- the character being copied
        ; returns R3 -- dst, unchanged
        STORES  R1,R2
        MOVE    R6,R3           ; return_value = src
STNCLP:                         ; while (TRUE) do {
        TESTR   R5
        BZS     STNCLX          ;   if (len == 0) break
        LOADS   R7,R4
        EXTB    R7,R7,R4        ;   ch = *src

        LOADS   R1,R3
        STUFFB  R1,R7,R3
        STORES  R1,R3           ;   *dst = ch

        TESTR   R7
        BZS     STNCLX          ;   if (ch == '\0') break

        ADDSI   R4,1            ;   src = src + 1
        ADDSI   R3,1            ;   dst = dst + 1
        ADDSI   R5,-1           ;   len = len - 1

        BR      STNCLP
STNCLX:                         ; }

        MOVE    R3,R6
        LOADS   PC,R2           ; return return_value

You can minimally test the above code for STRNCPY using the test program at the start of this chapter. Just insert the code for STRNCPY into the file (you may have to add an ALIGN directive if you put it at the end) and then replace all occurences of the name STRNCPY with some other name. Adding calls to test STRLEN is fairly trivial.

This code is far from optimal. Like our assembly-code version of strlen(), it accesses each word of memory 4 times, once per byte. We can do better if we can process blocks of bytes.

We can improve this code with some local optimizations. The ADDSI R5,-1 right before the end of the loop sets the condition codes, so we can replace the BR at the loop end with a BNZ to exit the loop when the count hits zero. This lets us move the loop-top label down to the point in the code marked with a star, saving two instructions per iteration.

We can eliminate the TESTR R7 in midloop, because the EXTB above it set the condition codes and nothing between the EXTB and the TESTR touches them, saving one more instrution per iteration.

Finally, we can use R6 as the working pointer in the destination string, so that we never modify the copy in R3. This eliminates an instruction outside the loop, but it has no impact on the loop itself.


j) Incorporate all of the suggested optimizations into STRNCPY.

k) Suppose we eliminate one ADDSI per iteration of STRNCPY by changing the use of R5. Before the loop, we compute the highest permitted value of the destination pointer, and then compare the destination with this maximum value on each iteration. Does this lead to a faster code? Compare with both the original and optimized versions.

l) Rewrite STRNCPY to use array addressing, not pointer arithmetic.

m) The test suggested above is minimal. Write a more exhaustive test that prints successively shorter substrings of "abcdefghijklmnop" using a for loop to select the starting point in the string (so all substrings end with the letter p, if it fits in the buffer). each iteration should copy its substring into an 11 byte buffer with a 12th byte set to zero, print the buffer and also print the string length.

String Concatenation

The C standard library strncat(dst,src,len) routine provides an interesting exercise. This concatenates the null-terminated strings pointed to by dst and src, storing the result in the same buffer that held the original value of dst, copying no more than len bytes, and returning a pointer to the result.

While strncat can be written from scratch, without using strlen or strncpy, if we use them, the result as a very small routine. First, we use strlen() to find the length of the dst string, and then we offset a copy of the dst pointer so it points to the end of the string. Finally, we use strncpy() to copy src onto the end of the dst string, up to but not over the end of the buffer. In C, this can be done as follows:

C code for strncat(dst,src,len)
char * strncat( char * dst, char * src, unsigned int len ) {
    int dstlen = strlen(dst);
    strncpy( dst + dstlen, src, len );
    return dst;

This approach to concatenating strings is a good test of the strlen() and strncpy() routines, and it illustrates, how we can call routines we have already written from within another.

If we put all of the routines in our string package in one source file, we will need naming conventions to avoid label clashes. The C string routine names all begin with str and those with parameters giving the buffer length have names beginning strn. As a result, the unique prefix for strncat could be strnc, but this is also the prefix on the name strncpy. We could just use longer identifiers for local labels, or we could abbreviate our prefixes for local labels. For example, local labels in STRNCAT could begin SNCAT or even SNCT and local labels in STRLEN could begin SLEN or SLN.

The STRLEN code given above made no use of its activation record and the STRNCPY code used it for just one thing, the return address. In writing STRNCAT, we need a more fully developed activation record for two reasons. First, STRNCAT calls other subroutines so it needs to save its return address, and second, it needs to save the values of its parameters through the call to STRLEN so it can pass them (or values derived from them) to STRNCPY.

Note that our calling conventions allow us to assume that registers R8 — R15 are not changed by a call, so we could use those registers to hold variables we need after the call to STRLEN. The problem is, if we use any of these registers, we must first save their values because whoever calls STRNCAT is also allowed to assume that R8 — R15 will not change. In this case, cost of saving these registers exceeds the benefit of using them. This leads to the following code:

Hawk code for STRNCAT
; activation record format for STRNCAT string concatenation routine
RETAD   =       0       ; the return address
DST     =       4       ; destination string address
SRC     =       8       ; source string address
LEN     =       12      ; length of destination buffer (unsigned)
ARSIZE  =       16

;       receiving sequence assumptions
STRNCAT:; expects R3 -- dst, the destination string
        ;         R4 -- src, the source string
        ;         R5 -- len, the destination buffer length
        ; uses    R4-7
        ; returns R3 -- dst, unchanged
        STORES  R1,R2           ; -- save return address
        STORE   R3,R2,DST
        STORE   R4,R2,SRC
        STORE   R5,R2,LEN       ; -- parameters are saved

                                ; -- parameter s = dst (already in R3)
        ADDI    R2,R2,ARSIZE
        JSR     R1,STRLEN
        ADDI    R2,R2,-ARSIZE   ; dstlen = strlen( dst ) -- in R3

        LOAD    R4,R2,DST
        ADD     R3,R3,R4        ;    -- parameter dst = dst+dstlen
        LOAD    R4,R2,SRC       ;    -- parameter src
        LOAD    R5,R2,LEN       ;    -- parameter len = len-dstlen (in R5)
        ADDI    R2,R2,ARSIZE
        JSR     R1,STRNCPY
        ADDI    R2,R2,-ARSIZE   ;    strncpy( dst+dstlen,src,len )

        LOAD    R3,R2,DST
        LOADS   PC,R2           ; return dst

The above code can be optimized significantly, and a little optimization has already been done: We did not allocate space in the activation record for the result returned by STRLEN, instead we kept it in register 3 until it was needed.

Another optimization we could make is to avoid wasted space in the activation record. Everything in the activation record is needed after the call to STRLEN from STRNCAT, but only some of it still matters during the call to STRNCPY. This lets us shorten the activation record during the second call. All we need to do is assign a smaller value to ARSIZE at the point when some fields become irrelevant. Of course, the fields of the activation record must be arranged so that the fields that become irrelevant later in the code are at the end. This doesn't matter much here, but the memory saved by such optimization in deeply recursive subroutines can matter.

We could eliminate unneeded changes to the stack pointer between the two internal calls. This change is a bit tricky because the second call is inside an if statement while the first call is outside it. This optimization is only possible if the activation record size is fixed, so we must decide between efficient use of space on the stack and optimizing for speed by eliminating changes to the stack pointer.

All of the above discussion was based on ignorance of the real use made of the stack by STRLEN and STRNCPY. If those routines had been written by other programmers and were likely to be changed, it would be fair to keep the code we have above, but if those routines were completely debugged and part of the same library as this code, we could take advantage of our knowledge that STRLEN does not use the stack at all.


n) Improve STRNCAT by minimizing wasted space on the stack during the call to STRNCPY.

o) Improve STRNCAT by minimizing the incrementing and decrementing of the stack pointer, assuming that both called routines might need their activation records.

p) Improve STRNCAT by taking advantage of all the information you have about the called routines, so that there is no unneeded stack manipulation.

A Big Optimization, Working With Words

As pointed out in the context of both the STRLEN and STRCPY routines, working with one character at a time does nothing to take advantage of the fact that the computer we are using is able to work with 32-bit data. If we were dealing with an 8-bit microprocessor, the algorithms we used would be appropriate, but on a 32-bit machine, we should move data in 32-bit increments if possible.

In the mid 1960's, Control Data Corporation introduced the 6600 architecture. This packed ten 6-bit characters into each 60-bit word, with much less support for byte-addressing than the Hawk. This forced programmers to learn how to do efficient text manipulation in terms of words. By the mid 1980's, the developers of Digital Equipment Corporation's 64-bit Alpha processor had developed byte manipulation instructions like those of the Hawk, and they finished developing efficient string operations using words. The MMX extensions to the Intel Pentium architecture are based on many of the same ideas.

Long after these techniques were developed, the term SIMD within a register (abbreviated SWAR) was coined. A SIMD or single instruction multiple data stream computer is one where there are machine instructions that operate on entire arrays in parallel. Similar techniques for manipulating arrays of pixels packed into machine words have become quite common on graphics coprocessors.

What we want to do here is deal with individual characters only at the beginning and end of each string, and operate on full words in between. Since we are packing 4 characters per word, we will deal with only 0 to 3 individual characters at the start and 0 to 3 individual characters at the end. In doing this, we face two distinct problems: How do we check an address to see if it is a word address, something we must do at the start of the string, and how do we check a word to see if it includes the end-of-string character at the end?

A C or Java programmer wanting to check the variable n to see if it is divisible by 4 can write if(n%4==0), but taking the remainder after division is expensive. We can take advantage of the binary number system by rewriting this as if(n&3==0), and because the Hawk architecture includes an AND instruction to compute the logical and of two registers, we can convert the above directly to Hawk instructions.

The Hawk offers some special instructions that improve on this general solution. First, it has a special instruction for inspecting and testing the least significant bits of a word. The truncate instruction, TRUNC dst,n zeros out all but the least significant n bits of a register. This can be used to snip the 2 least significant bits from an address and set the Z condition code if they are zero. To help solve the second problem, detecting ene of string, the LOADCC and LOADSCC instructions set the C condition code if any byte of the loaded word is zero.

The Hawk also provides a special instruction to select between different cases, for example, the 0 to 3 bytes that must be handled separately at the start and end of each string. The BTRUNC instruction truncates its operand just like the TRUNC instruction, but instead of saving the result, it adds it to the program counter, so that the next instruction will be selected from among the instructions immediately after the BTRUNC. Usually, the next few instructions are branch instructions.

The tricks used in the fast STRLEN routine given below cannot be easily expressed in languages like C, although the comments in the code try to do this. Assembly language programmers are still employed to write carefully crafted and very fast code like this, not only for key routines such in the string library, but also for pixel-oriented graphics libraries. Game developers and search-engine developers make extensive use of such custom crafted code in order to achieve the fastest possible performance.

Efficient Hawk code for STRLEN
;       receiving sequence assumptions for STRLEN (SLN)
STRLEN: ; expects R3 -- src, the address of the string
        ; uses    R4 -- ch, the character *src
        ;         R5 -- start, the starting address of src
        ; returns R3 -- len, length of the string src
        MOVE    R5,R3           ; start = src
        LOADSCC R4,R3           ; ch = *src -- the whole word
        BTRUNC  R3,2            ; select (src & 3) {
        BR      SLNB0           ;   -- branch table 0 (byte number)
        BR      SLNB1           ;   --              1
        BR      SLNB2           ;   --              2
        BR      SLNB3           ;   --              3
SLNB1:                          ;   case 1     
        EXTB    R0,R4,R3        ;     -- finish ch = *src
        BZS     SLNQT           ;     if (ch == NUL) break
        ADDSI   R3,1            ;     src = src + 1
SLNB2:                          ;   case 2     
        EXTB    R0,R4,R3        ;     -- finish ch = *src
        BZS     SLNQT           ;     if (ch == NUL) break
        ADDSI   R3,1            ;     src = src + 1
SLNB3:                          ;   case 3     
        EXTB    R0,R4,R3        ;     -- finish ch = *src
        BZS     SLNQT           ;     if (ch == NUL) break
        ADDSI   R3,1            ;     src = src + 1
        LOADSCC R4,R3           ;     ch = *src -- the whole word
SLNB0:                          ;   case 0 -- src divisible by 4
                                ;     -- note ch=*src, LOADSCC set C bit
        BCS     SLNLX           ;     if no byte in this word is zero {
SLNLP:                          ;       do {
        ADDSI   R3,4            ;         src = src + 4
        LOADSCC R4,R3           ;         ch = *src -- the whole word
        BCR     SLNLP           ;       while no byte in this word is zero
SLNLX:                          ;     } -- some byte in R4 is zero
        EXTB    R0,R4,R3        ;     -- finish ch = *src
        BZS     SLNQT           ;     if (ch == NUL) break
        ADDSI   R3,1            ;     src = src + 1
        EXTB    R0,R4,R3        ;     -- finish ch = *src
        BZS     SLNQT           ;     if (ch == NUL) break
        ADDSI   R3,1            ;     src = src + 1
        EXTB    R0,R4,R3        ;     -- finish ch = *src
        BZS     SLNQT           ;     if (ch == NUL) break
        ADDSI   R3,1            ;     src = src + 1
SLNQT:                          ; } -- assert *src == NUL
        SUB     R3,R3,R5        ; len = src - start
        JUMPS   R1              ; return len

Note that the inner loop above tests 4 bytes for each iteration, using only 3 instructions. Thus, for long strings, this STRLEN subroutine will use fewer instructions than there are bytes in the string being tested. The cost is 3 instructions per byte at the start and end of the string, while the optimized version of the byte-oriented code required a uniform 4 instructions per byte. Counting instructions is a very crude way to measure performance, particularly on modern pipelined processors. Ultimately, on any real CPU, physical testing against the clock is the final judge.

It is significantly harder to apply these optimizations to STRNCP because we must consider not only how the source string is broken into words but also how the destination buffer is broken up. If the least significant two bits of the source and destination addresses are equal, we can use the logic given for STRLEN. If the only the second to the least significant bits differ, we can copy a halfword at a time, but if the least significant bits differ, the byte alignment of the two string operands is different and we may be reduced to copying a byte at a time.

In C or Java, if you want to find out which bits of a and b are equal, you can use a^b to exclusive-or them. In the result, any bits that were equal in the originals will be zero. For the Hawk, the EQU instruction gives the logical inverse of the exclusive-or, so each result bit will be set if the corresponding bits of the operands were equal.


q) Write a version of STRNCPY that uses the original code given earlier if the source and destination pointers are aligned differently, and uses logic based on the above version of STRLEN if the source and destination pointers are aligned identically.

r) Write a version of STRNCPY that uses halfword stuffing and halfword extraction, so that it can copy strings when the source and destination addresses are off by 2 from each other. How many instructions per byte copied does it average for long strings.

s) Near the start of this chapter, code was give for a heap-based string_concatenate routine that calls malloc before using strcpy and strcat to do the work. Near the end of the chapter, code for strncat is given, from which you can infer an implementation of strcat. Given this code, strlen is called a total of 3 times for every call to string_concatenate. Rewrite string_concatenate to minimize the number of calls to strlen?