7. Byte and Halfword Addressing, Text Strings.
Part of
22C:60, Computer Organization
|
The Hawk monitor borrows its convention for recording the length of text
strings from the C programming language. This is used in the Hawk monitor in
only one place, the DSPST routine, but it would certainly be
convenient to have access to the string processing routines of the C standard
library. These include:
- puts( src );
- Given that src is the address of the first byte of a null-terminated string, output that string to the standard output device. Our DSPST does exactly this.
- strncpy( dst, src, count );
- Given that src is the address of the first byte of a null-terminated string, copy that string, including the terminating null if possible, to memory starting at the address dst but do not copy more than count characters. Usually, count gives the size of the memory region pointed to by dst. This routine returns the pointer that was passed as dst.
- strncat( dst, src, count );
- Given that src and dst are the addresses of the first bytes of two null-terminated strings, concatenate the string pointed to by src to the end of the string pointed to by dst, terminating the resulting string with a single null, if possible. If the resulting string is longer than count characters, it will be truncated. This routine returns the pointer dst to the result.
- strlen( src );
- Given that src is the address of the first byte of a null-terminated string, return the length of that string, up to but not including the null.
On most Unix systems, full documentation for any of these routines
can be obtained using the man command; for example, to get the
official documentation for the strlen() routine, type the
man strlen command into the shell interface. The book
The C Programming Language by Kernighan and Ritchie (Prentice
Hall) also documents these routines.
If we wish to write code to perform these functions, we need a way to fetch individual bytes from memory. As we discussed in Chapter 2, the memory addresses on the Hawk computer do allow us to give a distinct address to each byte of memory, but the variations on the LOAD and STORE introduced in Chapters 4, 5 and 6 all ignore the two least significant bits of the memory address and load or store a full 32-bit word.
The IBM System/360 was very first computer to offer support for 8-bit bytes, 16-bit halfwords and 32-bit words; this computer had a complete assortment of load and store instructions, load byte, load halfword and load word, along with their complements store byte, store halfword and store word. The problem with this approach to instruction set design are twofold.
First, if we commit to this idea, we use up 6 opcodes for load and store, and on a machine like the Hawk, the instruction set is crowded enough that it is not clear that we have these opcodes to spare.
Second, on a machine with a 32-bit word, loading and storing fractional words is inefficient enough that it should be discouraged! If the instruction set of the computer offers convenient load and store instructions for bytes and halfwords, this might encourage programmers to make excessive use of them. The truth is, loading a byte from memory on a machine with a 32-bit word actually involves loading the word containing that byte and then extracting the byte from the word, so why not make the instruction set match the work actually done by the hardware?
Exercises
a) One puzzle about the above interface definition is worth thinking about. Why do strncpy() and strncat() have, as return values, their first argument. Hint: Consider finding a more compact way to encode this list of string operations.
strncpy( str, "Hello, ", 10 ); strncat( str, "world", 10 ); strncat( str, "!", 10 );This little example loads the string pointed to by str with "Hello, world!" but not more than 10 characters.
The Hawk architecture offers two special instructions for extracting bytes and halfwords from full-word operands. These instructions are designed for use with the LOADS instruction for fast loading of bytes from memory. Here is the EXTB instruction for extracting bytes; the EXTH instruciton is quite similar, except that it extracts halfwords.
07 | 06 | 05 | 04 | 03 | 02 | 01 | 00 | 15 | 14 | 13 | 12 | 11 | 10 | 09 | 08 | |
0 1 0 1 | dst | s1 | s2 |
Formally, this does r[dst]=(r[s1]>>((r[s2]&3)<<3))&255. Unfortunately, while this formal notation is compact, and based entirely on the syntax common to C, C++ and Java, it is not very informative! Here is a more detailed breakdown of what the above line of code really means:
- r[dst]=(r[s1] ... )&255
- A byte is taken from register designated by the s1 field of the instruction and moved to the destination register. The high 24 bits of the result is zero.
- r[dst]= ... r[s1]>>((r[s2] ... ) ... ) ...
- The particular byte extracted is determined by the contents of the register designated by the s2 field of the instruciton.
- ... (R[S2] ... )<<3 ...
- The shift count used to select the byte will be a multiple of 8 because shifting a binary integer 3 places left is equivalent to multiplication by 8.
- ... r[s2]&3 ...
- The shift count depends only on the least significant two bits of the register designated by the s2 field of the instruction because anding a bit pattern with 310 which is 112 forces all but those two bits to zero. These 2 bits of the second source register are the bits that select the byte within the word.
- ... (r[s2]&3)<<3 ...
- From the above, we conclude that if the least significant two bits are 00, we get a shift count of 0, if they are 01, we get a shift count of 8, if they are 10, we get a shift count of 16, and if they are 11, we get a shift count of 24.
In long-winded English, this says that if the least significant 2 bits of the byte pointer Rp are 00, EXTB Rd,Rs,Rp will load the destination Rd with the least significant byte of the source register Rs. If the least significant 2 bits of Rp are 01, Rd will be loaded with the second to the least significant byte of Rs. If the least significant 2 bits are 10, Rd will be loaded with third byte of Rs and If the least significant 2 bits are 11, Rd will be loaded with the top byte of Rs. The result is always truncated to just 8 bits in the destination register.
As a result, if Rp contains a memory address and Rs contains a word that was loaded from memory at that address, then the instruction EXTB Rd,Rs,Rp will pick, from Rs the byte pointed to by Rp and put that byte in Rd. So, if Rp contains a pointer to a byte, we can load that byte into a register as follows:
; given: R4, a pointer to a byte in memory LOADS R3,R4 ; get the word containing the desired byte EXTB R3,R3,R4 ; extract the desired byte from that word ; result: R3 now holds the desired byte; R3=M[R4] |
EXTB and EXTH both set the condition codes to report on whether the extracted byte or halfword is zero or nonzero, so we can easily test for end-of-string as we extract bytes from a string during a string operation.
There is one minor problem. What if we wanted to extract signed halfwords or signed bytes? The Hawk architecture solves this problem by providing an additional instruction for sign extending any unsigned quantity. To sign extend a two's complement number, you add copies of the sign bit to widen that number to the desired number of bits. For example if the the 4-bit numbers 1010 and 0101 are sign extended to 8 bits, the results will be 11111010 and 00000101. This operation preserves the value of the number while changing its representation. The above numbers, in either their 4-bit or 8-bit forms have the decimal values -6 and +5.
The hawk sign extend instruction is called SXT. This takes a register and a count of the number of bits to be preserved, so SXT R3,8 extends the value in R3 by replacing its top 24 bits with whatever was in bit number 7. It is worth noting that another instruction, with the identical format, takes a signed quantity and forces all of the higher bits to zero; this is called truncation, and the instruction is called TRUNC.
; given R4, a pointer to a byte in memory LOADS R3,R4 ; get the word containing the desired byte EXTB R3,R3,R4 ; extract the desired byte from that word SXT R3,8 ; force an unsigned interpretation on it ; result R3 now holds the desired byte, sign extended to 32 bits |
Exercises
b) Given that R1 holds ABCDEF16, what value is left in the register after TRUNC R1,8? after TRUNC R1,6? after TRUNC R1,5?
c) If the instruction EXTB R3,R3,R0 is used (In this context, register zero the constant zero), what is the result in R3 if its initial is ABCDEF16? If its initial value is 12345616? What useful function does this particular variant of the EXTB statement do?
Consider the problem of writing code to access an element of an array of characters, for example, code for the expression a[i], where a is an array of characters and i is an integer that we know to be within the array bounds, so we don't need to worry about bounds checking. To evaluate this expression, we proceed as follows:
In C or C++, the expression a[i] is always equivalent to *(a+i). In these languages, the * operator means "follow the pointer to the object it references in memory." In these languages, the name of an array is always the same thing as a pointer to the first element of that array, and adding an integer to a pointer always implicitly multiplies the integer by the size of the object referenced by the pointer. In the case of arrays of characters in C, each character is just one byte.
On the hawk, the code for a[i] depends on whether a is a local variable, an array of constants embedded in the program text (such as the "hello world" array, or a global variable.
; assume A is a local array of characters, i is in R3 LEA R4,R2,A ; get the address of A into R4 ADD R4,R3,R4 ; R4 now holds the address of A[i] LOADS R3,R4 ; R3 now holds the word at this address EXTB R3,R3,R4 ; R3 now holds the the array element A[i] |
; assume A is a constant array of characters, i is in R3 LEA R4,A ; get the address of A into R4 ADD R4,R3,R4 ; R4 now holds the address of A[i] LOADS R3,R4 ; R3 now holds the word at this address EXTB R3,R3,R4 ; R3 now holds the the array element A[i] |
; assume A is a global, allocated using COMMON, i is in R3 ; and assume that PA is the label on a word pointing to A LIL R4,A ; get the address of A into R4 ADD R4,R3,R4 ; R4 now holds the address of A[i] LOADS R3,R4 ; R3 now holds the word at this address EXTB R3,R3,R4 ; R3 now holds the the array element A[i] |
The above code fragments differ only in the initial instructions
that load the address of the first array element. After that, everything
is the same. The story for arrays of words is similar, except
that we can use a simple LOADS to load each array element, but we
must first multiply the array subscript by four before using it.
; assume A is a local array of words, i is in R3 LEA R4,R2,A ; get the address of A into R4 SL R3,2 ; multiply R3 by 4 by shifting 2 places left ADD R4,R3,R4 ; R4 now holds the address of A[i] LOADS R3,R4 ; R3 now holds the word at this address |
; assume A is a local array of words, i is in R3 LEA R4,R2,A ; get the address of A into R4 ADDSL R3,R4,2 ; multiply R3 by 4 and add R4 to the result LOADS R3,R3 ; R3 now holds the word at this address |
In the above code fragments, the SL R3,2 instruction shifts
the contents of R3 left two places, throwing away any bits shifted out the
left end of the register and padding with zeros on the right. This is
equivalent to multiplication by four; in general,
SL R3,n multiplies by 2n, and because the
sequence of shifting and then adding is common, the instruction
ADDSL is available to combine these functions.
Both fragments coded for arrays of words illustrate only the case where the array is local, a part of the activation record. The problem of loading the starting address of the array does not depend on the size of the array elements, so the examples from the previous figure cover what is needed for constant arrays that are part of the program or for global arrays stored in space allocated by the COMMON directive.
We have now covered enough ground to write a subroutine to count the
bytes in a null terminated string. In a high level language like C or
C++, we could do this as follows:
unsigned int strlen( char src[] ) { int n = 0; while ( src[n] != '\0' ) { n = n + 1; } return n; } |
This first version of the code views the string to be measured as an
array of characters. Each iteration of the loop examines one character
in this array, testing to see if it is the null character that marks the
end of the string. Using the material on array addressing from the
previous section, we can translate this code directly to Hawk assembly
language as follows:
; receiving sequence assumptions for STRLEN (SLN) STRLEN: ; R3 holds src, the address of the string ; R4 and 5 hold nothing of value ; R4 will be used to extract characters ; R5 holds n, the count of the length LIS R5,0 ; n = 0; SLNLP: ; while (TRUE) { ADD R4,R3,R5 LOADS R4,R4 EXTB R4,R4,R3 BZS SLNLX ; if (src[n] == '\0') exit; ADDSI R5,1 ; n = n + 1; BR SLNLP SLNLX: ; } MOVE R3,R5 JUMPS R1 ; return n |
In the C and C++ languages, the name of an array is the same as a pointer to its first element. As a result, the declarations char src[] and char * src declare the parameter src to have exactly the same type, the address of a character in memory. Since strings in C and C++ are represented by arrays of consecutive characters, the natural way to pass a string to a function is by passing a pointer to its first character. This feature of the C and C++ languages is sometimes a source of serious confusion Here, though, we can exploit this to it to write an alternative version of strlen().
unsigned int strlen( char * src ) { int n = 0; while ( *src != '\0' ) { src = src + 1; n = n + 1; } return n; } |
In the above, note that the star operator, when applied as a prefix operator on a value that is a pointer, causes that pointer to be followed to the object it references. In this case, we follow the character pointer to a particular character, so so the object *src is the character pointed to by src and could also have been written as src[0].
In C and C++, adding an integer to a pointer produces a new pointer; the amount by which the memory address is incremented depends on the number of bytes occupied by the objects to which the pointer refers. In this case, since we are assuming that a char object occupies one byte, adding one to a char pointer increments the address by one. Adding one to an int pointer would incrment the address by 4 if we assume that int objects occupy 4 bytes. Note the word "assume" above! The C standard does not actually state the size, in bytes, of any standard types. We have covered enough detail now to translate the second version of the above code into Hawk assembly language.
; receiving sequence assumptions for STRLEN (SLN) STRLEN: ; R3 holds src, the address of the string ; R4 and 5 hold nothing of value ; R4 will be used to extract characters ; R5 holds n, the count of the length LIS R5,0 ; n = 0; SLNLP: ; while (TRUE) { LOADS R4,R3 EXTB R4,R4,R3 BZS SLNLX ; if (*src == '\0') exit; ADDSI R3,1 ; src = src + 1; ADDSI R5,1 ; n = n + 1; BR SLNLP SLNLX: ; } MOVE R3,R5 JUMPS R1 ; return n |
Note that both versions of this code have exactly the same complexity One of these pieces of code has an extra increment in the body of the loop, while the other has an extra add to add in the base of the array during each iteration. If the array had been an array of integers, the second version would have been noticably faster. If we view these pieces of code as starting points for optimization, it turns out that the second one will lead us to the fastest version.
There is one big disadvantage of either version of this code! Each sequence of 4 consecutive iterations of the loop in this program will load exactly the same data into register 4 prior to extracting a different byte from that register. This is inefficient, and in fact, it is the motivation for some significant and interesting tricks that have become common on new computers designed since the mid 1980's.
This code can also be tightened by some local optimizations that ignore these larger issues. There are two registers being incremented in the loop body, and one of these increments can be eliminated by computing the string length as the difference between the pointer to the first byte and the pointer to the terminating null. There are two branches within the loop, one conditional and one unconditional; rewriting the loop so the termination test is at the end of the loop can eliminate one of these branches. Combining these optimizations, we can reduce the cost of the loop from 6 instructions per iteration down to 4 instructions per iteration.
Exercises
d) Improve the above code for STRLEN by eliminating one increment operator from the loop body, as suggested above.
e) Improve the above code for STRLEN by changing the control structure so that the loop exit test is done by a conditional branch back to the top of the loop from the end of the loop, as suggested above.
On machines where all data transfers between the processor and memory move 32-bit quantities, storing a byte or halfword into some word in memory involves first loading that word, and then replacing the correct byte or halfword before finally storing the modified word back into memory. This is true even if the instruction set of the computer offers a "store byte to memory" or "store halfword in memory" instruction.
In the case of the Hawk, the three steps described above must be carried out explicitly. The first and last steps are easily done with LOADS and STORES instructions. Two special instructions are provided for the middle step, one to stuff bytes into a register and the other to stuff halfwords. Here is the former:
07 | 06 | 05 | 04 | 03 | 02 | 01 | 00 | 15 | 14 | 13 | 12 | 11 | 10 | 09 | 08 | |
0 1 1 1 | dst | s1 | s2 |
This is
r[dst]=(r[dst]&~(0xFF<<((r[s2]&3)<<3))|((r[s1]&0xFF)<<((r[s2]&3)<<3))
in our formal notation.
Unfortunately, as with the EXTB instruction, this formal notation
neither compact nor very
very informative! The following comments should help in understanding
this:
- ... (r[s2]&3)<<3 ...
- This subexpression shows up twice in the above, and it also shows up in the description of the semantics of the EXTB. This takes the least significant 2 bits of r[s2] and multiplies by 8 to give values of 0, 8, 16 or 24.
- ... ~(0xFF<<((r[s2]&3)<<3)) ...
- Here, we take the value 111111112 and shift it over 0, 8, 16 or 24 places so that these 8 ones line up with the byte to be stuffed in place. Then we invert the value making ones where there were zeros and zeros where there were ones. This is the value that is anded with the destination register in order to set the selected byte of the destination to zero.
- ... ((r[s1]&0xFF)<<((r[s2]&3)<<3)).
- Here, we take the least significant byte of r[s1] and shift it over 0, 8, 16 or 24 places so that it lines up with the byte to be stuffed in place. Finally, we or it into place.
In long winded English,
this says that if the least significant
2 bits of the byte pointer Rp are 00,
STUFFB Rt,Rs,Rp will replace
the least significant byte of the temporary register Rt with the
least significant byte of the source register Rs.
If the least significant 2 bits of Rp are 01, it will replace
the second to the least significant byte of Rt with the least
significant byte of Rs.
If the least significant 2 bits of Rp are 10, it will replace
the third byte of Rt with the low
byte of Rs and
if the least significant 2 bits of Rp are 11, it will replace
the top byte of Rt with the low
byte of Rs.
In effect, the instruction STUFFB Rt,Rs,Rp replaces the byte in the temporary register Rt selected by the least significant bits of the pointer register Rp with the low byte of Rs. Suppose that we want to store the low byte of Rs in the memory location pointed to by Rp. To do this, we first load the word pointed to by Rp into a tempory register Rt, then stuff the byte into that word before we store the word back in memory.
; given: R3, a byte pointer and R4, a byte to store LOADS R1,R3 ; get the word containing the desired byte STUFFB R1,R4,R3 ; insert the desired byte into that word STORES R1,R3 ; put the modified word back in memory ; result: R1 now holds a copy of the updated word, M[R3] = R4 |
Byte manipulation on the Hawk may appear cumbersome compared to byte manipulation on machines such as the IBM 360, the DEC PDP-11 or the Intel Pentium. On all of those machines, there is a single instruction that will load one byte from memory to register or store one byte from register to memory. The fact that it can be done in one instruction does not, however, make this a fast instruction. The instruction always loads an entire word from memory, and in order to support this instruction, every word loaded from memory must pass through an alignment network that slightly delays every memory reference. The designers of some modern machines have opted to expose this complexity in the instruction set, allowing fast access to memory for words and forcing explicitly extract byte operands. The DEC Alpha processor provided the model for Hawk byte addressing.
Programmers who want to avoid knowing about byte extraction and stuffing can hide these somewhat strange instruction sequences within macros. The macros given here will not be used later, but are given for the sake of example. The naming convention we have used is that LOADB and STOREB indicate that these instructions manipulate bytes, while the suffix S indicates the use of short-indexed addressing.
The STOREBS macro definition assumes that R1 can be used
as a temporary location for byte stuffing. This assumption is compatble with
the use of R1 for subroutine linkage, since between the point where
a subroutine saves its return address in its activation record and the point
where the subroutine restores its return address, R1 is available
for other uses, and its use in calling other subroutines prevents us from
using it for any long-term purposes.
MACRO LOADBS =dst,=x LOADS dst,x EXTB dst,dst,x ENDMAC MACRO STOREBS =dst,=x LOADS R1,x STUFFB R1,dst,x STORES R1,x ENDMAC |
Exercises
f) Given that the hardware interprets R0 as the constant zero in this context, what does STUFFB R3,R0,R0 do?
g) Write a LOADB macro that behaves like LOAD r,x,disp except that it loads a single byte into the destination register instead of loading a word. Use R1 as a temporary register to hold the effective address.
In setting out to write a Hawk assembly language version of code equivalent to the strncpy() routine from the C standard library, a reasonable first step is to first write it in a high level language. The central part of our routine will be a loop that copies one character per iteration from the source string to the destination string. There are two distinct exits conditions this loop must test. First, we must terminate the loop when we have copied the null character that marks the end of string. Second, we must terminate the loop when the copy operation has filled the array of bytes allocated to the destination string.
Loops with two exit points are usually easier to read if they are coded with the exits standing on an equal basis, so we will write this as an infinite loop with two explicit break statements instead of writing it as, for example, a while loop with a second exit encoded with a break. The following code is written with no complex statements, in the hope that each assignment is simple enough to be replaced by a single machine instruction.
char * strncpy( char * dst, char * src, unsigned int len ) { char ch; char * return_value = src; while (TRUE) { if (len == 0) break; ch = *src; *dst = ch; if (ch = '\0') break; src = src + 1; dst = dst + 1; len = len - 1; } return return_value; } |
The first step in converting the above to Hawk assembly language is to work out how we plan to use registers and the activation record, if any. On entry, our convention is that parameters are passed in registers 3 and up, so we will put dst in register 3, src in register 4, and len in register 5. We can use register 6 to hold the copy of dst that we have called return_value, and register 7 to hold the character we are copying, ch. This gives the following:
; activation record format for STRNCPY string copy routine RETAD = 0 ; the return address ARSIZE = 4 ; size of activation record in bytes ; receiving sequence and assumptions STRNCPY: ; R3 holds dst, the address of the destination ; R4 holds src, the address of the source ; R5 holds len, the maximum length to copy ; R6 holds the return value ; R7 holds ch, the character being copied ; R8-15 must be saved and restored if we use them STORES R1,R2 MOVE R6,R3 ; return_value = src STNCLP: ; while (TRUE) do { TESTR R5 BZS STNCLX ; if (len == 0) break ;* LOADS R7,R4 EXTB R7,R7,R4 ; ch = *src LOADS R1,R3 STUFFB R1,R7,R3 STORES R1,R3 ; *dst = ch TESTR R7 BZS STNCLX ; if (ch == '\0') break ADDSI R4,1 ; src = src + 1 ADDSI R3,1 ; dst = dst + 1 ADDSI R5,-1 ; len = len - 1 BR STNCLP STNCLX: ; } MOVE R3,R6 LOADS R1,R2 JUMPS R1 ; return return_value |
Our convention that registers 3 to 7 can be used freely in any subroutine suggests, at first glance, that the above code should not need an activation record. Unfortunately, we needed one more register to hold the copy of the word in memory that we are modifying with STUFFB. We used R1, and that forced us to save the return address in a one-word activation record.
The above code is far from optimal for many reasons. First and foremost, like our machine-code version of the strlen() routine, it accesses each word of memory 4 times, once per byte in that word. We can certainly do better if we can find some way to process bytes in blocks of four instead of one at a time.
Second, even if we ignore the first inefficiency, we can improve this code by making several local optimizations. Note, for example, that the ADDSI R5,-1 instruction right before the end of the loop sets the condition codes to report on the result. We can therefore replace the BR instruction at the end of the loop with a BNZ instruciton, and move the loop-top label down to the point marked with an asterisk in the code. This eliminates 2 of the 13 instructions per iteration, giving better than a 10% improvement.
We can also eliminate the TESTR R7 instruction in midloop, because the EXTB sets the condition codes to report on the byte extracted, and none of the instructions between that instruction and the TESTR instruction touch the condition codes. This is one reason that each load instruction on the Hawk comes in two flavors, one that sets the condition codes (LOADSCC) and one that does not (LOADS).
Finally, there is a small optimization we can make, using R6 as the working pointer in the destination string, so that we never modify the copy in R3, eliminating one instruction outside the loop. Because this optimization has no impact on the loop itself, the only reason to pursue this optimization is if memory is in extremely short supply.
Exercises
h) Incorporate all of the suggested optimizations into STRNCPY.
i) Write Hawk code for strncat() that first finds the end of the destination string using logic torn from STRLEN and then copies the source string into place using logic borrowed from STRNCPY.
j) Can you modify the STRNCPY routine so that it still conforms to the documented assumptions in the comments with the the receiving sequence while eliminating the use of the activation record on the stack? This will require using some register other than R1 for the temporary used for byte stuffing. If you cannot do this, explain why.
k) Rewrite STRNCPY so that it uses array addressing instead of pointer arithmetic to operate on consecutive elements of the source and destination strings.
A simple test of the above routines would be to print the copy of a string. Performing this test raises some new programming issues which we can explore in a high level language before we explore them in the Hawk assembly language. Here is a simple test program for these routines, written in C.
main() { char buffer[12]; buffer[11]='/0'; dspat( 6, 1 ) dspstr( strncpy( buffer, "String!", 11 )); dspat( 1, 1 ) dspnum( strlen( buffer) ); } |
Programmers used to working in object oriented languages may not like explicitly declaring the memory into which the string is copied. The local variable buffer serves that purpose here; it is an array of 12 characters, indexed 0 to 11. We set buffer[11] to zero because strncpy only copies the null terminator if the string fits. In our test program, we have set the limit so that strncpy will only change buffer[0] to buffer[10], leaving the null in buffer[11] just in case the string being copied is long.
The word buffer comes from electronics, where a buffer is an amplifier that passes its input to its output without allowing changes at its output to leak back to the input. In programming, the word refers to variables that sit between the stages of a computation and serve a similar purpose.
; activation record format for MAIN BUFFER = 0 ; an array of 12 bytes ARSIZE = 12 ; entry point for MAIN and assumptions MAIN: LIL R2,STACK ; initialize stack pointer STORE R0,R2,BUFFER+8 ; buffer[8-11] = "\0\0\0\0" LIS R3,6 LIS R4,1 ADDI R2,ARSIZE LIL R1,DSPAT JSRS R1,R1 ADDI R2,-ARSIZE ; dspat( 6, 1 ) LEA R3,R2,BUFFER LEA R4,STRING LIS R5,11 ADDI R2,ARSIZE JSR R1,STRNCPY ADDI R2,-ARSIZE ADDI R2,ARSIZE LIL R1,DSPSTR JSRS R1,R1 ADDI R2,-ARSIZE ; dspstr(strncpy(buffer,string,11)); LIS R3,1 LIS R4,1 ADDI R2,ARSIZE LIL R1,DSPAT JSRS R1,R1 ADDI R2,-ARSIZE ; dspat( 1, 1 ) LEA R3,R2,BUFFER ADDI R2,ARSIZE JSR R1,STRLEN ADDI R2,-ARSIZE ADDI R2,ARSIZE LIL R1,DSPDECU JSRS R1,R1 ; dspdecu( strlen( buffer )); ADDI R2,-ARSIZE LIL R1,EXIT JSRS R1,R1 ; exit() STRING: ASCII "String!",0 |
The above Hawk code uses the monitor routine DSPDEC, while the STRCPY and STRLEN routines must be embedded in the same source file. This code has not been optimized at all, and as a result, it includes many changes to the stack pointer that could easily be eliminated. They are left in the code here in order to emphasize the rules for calling sequences we have been following.
This is the first main program we have seen that uses an activation record of its own. This activation record differs from those of called subroutines in only one way: It contains no return address. Aside from this, addressing local variables of the main program is done in the same way that they are addressed in subroutines. The instruction LEA R3,R2,BUFFER loads register 3 with a pointer to the first byte in the local variable BUFFER; in fact, a simple register to register move would have sufficed because the buffer begins at offset zero in the activation record, but we have ignored this optomization in order to deliberately demonstrate code that will work no matter where in the activation record the buffer is placed.
The above test program tests only one string; a complete test should really test a variety of string lengths, some longer than the buffer and some shorter, perhaps by using a for loop to copy successively longer strings into the buffer. That option is left as an exercise.
Exercises
l) Optimize the test program by removing all unnecessary code to increment and decrement the stack pointer, and by substituting more efficient instructions wherever this is possible.
m) Write a test program that prints successively shorter substrings of "abcdefghijklmnop" using a for loop to select the starting point in the string (so all substrings end with the letter p, if it fits in the buffer). This program should copy each substring into an 11 byte buffer with the 12th byte set to zero, print that substring, and print the length of that substring. Without this latter requirement, the entire test could be done without testing STRNCPY; by forcing the use of a short buffer, we can force a reasonable test of the buffer overflow conditions that might occur in this routine.
The C standard library routine for string concatenation, strncat(dst,src,len), provides an interesting exercise. This concatenates the null-terminated strings pointed to by dst and src, storing the result in the same buffer that held dst, limiting the result to len bytes, and returning the pointer to the result. This routine can be written from scratch, without using strlen() or strncpy(), but if we use them, the result as a very small routine that first uses strlen() to find the length of dst and then uses strncpy to copy src onto the end of dst, up to the end of the buffer. In C, this can be done as follows:
char * strncat( char * dst, char * src, unsigned int len ) { int dstlen = strlen(dst); if (dstlen < len) { strncpy( dst + dstlen, src, len - dstlen ); } return dst; } |
This approach to concatenating strings is an excellent test of the strlen() and strncpy() routines, and in addition, it is a good way to illustrate, at the assembly language level, how we can call two routines we have already written from within another routine.
If we intend to put all of the routines in our string package in the same source file, we must eliminate labels that are the same in different routines! The C string routines begin with str and with an explicit parameter giving the buffer length have names that begin strn. As a result, the unique prefix for strncat could be no shorter than strnc, but this is also the prefix on the name strncpy. Fortunately, our preliminary versions of these routines are very short, needing only a few labels; our initial version of STRNCAT, for example, needs only one label to mark the end of the if statement that copies the second half of the string into place. We will use the name STRNCX for this, using the convention that the suffix X is short for exit. We would have to switch to longer identifiers if we wanted to put all of the string routines in the same source file, or perhaps shorten the STR prefix in our abbreviated names.
While our STRLEN and STRNCPY routines did not make much use of their activation records, STRNCAT will need its activation record to hold its parameters through calls it makes, and it will have to save its own return address through these calls. This leads to the following code:
; activation record format for STRNCAT string concatenation routine RETAD = 0 ; the return address DST = 4 ; destination string address SRC = 8 ; source string address LEN = 12 ; length of destination buffer (unsigned) ARSIZE = 16 ; receiving sequence assumptions STRNCAT: ; R3 holds DST ; R4 holds SRC ; R5 holds LEN ; R6-7 contain nothing of value ; R8-15 must be saved and restored if used STORES R1,R2 ; save return address STORE R3,R2,DST STORE R4,R2,SRC STORE R5,R2,LEN ; parameters are saved ADDI R2,R2,ARSIZE JSR R1,STRLEN ADDI R2,R2,-ARSIZE ; dstlen = strlen( dst ) LOAD R5,R2,LEN SUB R5,R5,R3 ; R5 = len - dstlen BLEU STRNCX ; if (len > dstlen) { LOAD R4,R2,DST ; note, R3 still holds dstlen ADD R3,R3,R4 ; note, R5 still holds len - dstlen LOAD R4,R2,SRC ADDI R2,R2,ARSIZE JSR R1,STRNCPY ADDI R2,R2,-ARSIZE ; strncpy( dst+dstlen,src,len-dstlen ) STRNCX: ; } LOAD R3,R2,DST LOADS R1,R2 JUMPS R1 ; return dst |
The above code can be optimized significantly, but some optimization has already been done. For example, we computed the difference between the buffer length and the destination string length once, keeping it in register 5, while our high-level language source code suggests that the different operations for comparison and subtraction of these two quantities. Also, we did not allocate space in the activation record for the result returned by STRLEN; we simply kept this in register 3 until it was no longer needed.
Another optimizations we could make is to avoid wasted space in activation record. Everything in the activation record is needed after the call to STRLEN from STRNCAT, but only some of it is still relevant after the call to STRNCPY. Therefore, we can shorten the activation record during the second call. This is hardly worthwhile here, but the memory savings from such optimization of a recursive function can be important.
We could eliminate unneeded changes to the stack pointer between the two internal calls. This change is a bit tricky because the second call is inside an if statement while the first call is outside it. This optimization is only possible if the activation record size is fixed, so we must decide between efficient use of space on the stack and optimizing for speed by eliminating changes to the stack pointer.
All of the above discussion was based on ignorance of the real use made of the stack by the STRLEN and STRNCPY routines. If those routines had been written by other programmers and were likely to be changed, it would be fair to keep the code we have above, but if those routines were completely debugged and were part of the same library as this code, we could take advantage of our knowledge that STRLEN does not use the stack at all, and that STRNCAT makes only the slightest use of the stack.
Exercises
n) Improve STRNCAT by minimizing wasted space on the stack during the call to STRNCPY.
o) Improve STRNCAT by minimizing the incrementing and decrementing of the stack pointer, keeping the assumption that both called routines are likely to use their activation records.
p) Improve STRNCAT by taking advantage of all the information you have about the called routines, so that there is no unneeded stack manipulation.
As pointed out in the context of both the STRLEN and STRCPY routines, our code copying one character at a time did nothing to take advantage of the fact that the computer we are using is able to work with 32-bit data. If we were dealing with an 8-bit microprocessor, the algorithm we are using would be appropriate, but on a 32-bit machine, we should move data in 32-bit increments whenever possible.
In the mid 1960's, Control Data Corporation introduced the 6600 architecture. This packed ten 6-bit characters into each 60-bit word, and it had even less support for byte-addressing than the Hawk, forcing programmers to learn how to do efficient text manipulation in terms of words. By the mid 1980's, the developers of Digital Equipment Corporation's 64-bit Alpha processor had developed byte manipulation instructions like those of the Hawk, and they fully developed a model for efficient string operations using words. The MMX extensions to the Intel Pentium architecture are based on many of the same ideas.
The basic idea is to deal with individual characters only at the beginning and end of the string, and use words between. Since we are packing 4 characters per word, we will deal with only 0 to 3 individual characters at the start and 0 to 3 individual characters at the end. In doing this, we face two distinct problems: How do we check an address to see if it is a word address, something we must do at the start of the string, and how do we check a word to see if it includes the end-of-string character at the end?
The Hawk helps solve both problems. First, it has special instructions for inspecting and testing the least significant bits of a word, and second, the load instructions set the C condition code if any byte of the loaded word is zero to help detect the ends of strings.
The Hawk architecture has a truncate instruction, TRUNC dst,n that preserves only the least significant n bits of a register, setting the others to zero. This can be used reverse the effect of sign extension, but it can also be used to snip the 2 least significant bits from an address and set the Z condition code if they are zero.
What we really want is a way to test the least significant bits of an address and go to one of 4 different cases depending on the values of those bits. The Hawk architecture has a specialized instruction to do exactly this, BTRUNC. This truncates its operand and then, instead of saving the truncated, it adds it to the program counter, so that the next instruction will be selected from among the instructions immediately following the BTRUNC instruction. Usually, the next few instructions are branch instructions.
; receiving sequence assumptions for STRLEN (SLN) STRLEN: ; R3 holds src, the address of the string ; R4 and 5 hold nothing of value on entry ; R4 will be used to extract characters ; R5 holds initial value of R3 MOVE R5,R3 ; remember starting place LOADSCC R4,R3 ; get first word of string BTRUNC R3,2 ; test what byte we are starting with BR SLNB0 ; byte 0 - can work with entire words BR SLNB1 ; byte 1 BR SLNB2 ; byte 2 (Note, the condition codes BR SLNB3 ; byte 3 are as left by LOADSCC!) SLNB1: EXTB R0,R4,R3 ; extract and test byte 1 BZS SLNQT ; quit if null ADDSI R3,1 ; advance to byte 2 SLNB2: EXTB R0,R4,R3 ; extract and test byte 2 BZS SLNQT ; quit if null ADDSI R3,1 ; advance to byte 3 SLNB3: EXTB R0,R4,R3 ; extract and test byte 3 BZS SLNQT ; quit if null ADDSI R3,1 ; advance to full word LOADSCC R4,R3 ; get next full word SLNB0: ; assert R3 divisible by 4, R4=M[R3], LOADSCC set the C bit BCS SLNLX ; some byte in this word is zero SLNLP: ADDSI R3,4 ; advance to next full word LOADSCC R4,R3 ; get next full word BCR SLNLP ; iterate if all bytes nonzero SLNLX: ; assert some byte in R4 is zero, R4=M[R3], R3 is divisible by 4 EXTB R0,R4,R3 ; test first byte BZS SLNQT ADDSI R3,1 ; advance to full word EXTB R0,R4,R3 ; test second byte BZS SLNQT ADDSI R3,1 ; advance to full word EXTB R0,R4,R3 ; test third byte BZS SLNQT ADDSI R3,1 ; it must be the final byte SLNQT: ; assert R5 points to first byte, R3 to final byte SUB R3,R5,R3 ; compute the string length JUMPS R1 ; return |
The tricks used in the fast STRLEN routine given above cannot be easily expressed in languages like C, so assembly language programmers are still employed to write carefully crafted and very fast versions of key routines such as those in the string library or in pixel-oriented graphics libraries. Game developers and search-engine developers make extensive use of such custom crafted code in order to achieve the fastest possible performance.
Note that the inner loop here tests 4 bytes per iteration using only 3 instructions per iteration. Thus, for long strings, this STRLEN subroutine will count the bytes using fewer instructions than there are bytes in the string being tested. At the start and end of the string, we pay 3 instructions per byte. Compared with the uniform 4 instructions per byte we paid in the optimized version of the byte-oriented code, the overall improvement is quite significant.
It is significantly harder to apply these optimizations to STRNCP because we must consider not only how the source string is broken into words but also how the destination buffer is broken up. If the source and destination addresses are equal in the least significant two bits, we can use the logic given for STRLEN, but there are three other cases we might want to handle with more arcane code.
Exercises
q) Write a version of STRNCPY that uses the original code given earlier if the source and destination pointers are aligned differently, and uses logic based on the above version of STRLEN if the source and destination pointers are aligned identically.
r) Write a version of STRNCPY that uses halfword stuffing and halfword extraction, so that it can copy strings when the source and destination addresses are off by 2 from each other. How many instructions per byte copied does it average for long strings.