Lecture 39, Exploiting the Assembler
the notes for 22C:196:002 (CS:4908:0002)
In many assemblers (unfortunately, this excludes the Gnu Assembler, as), it is possible to generate code without any forward references. Consider this little bit of ARM code:
-- some code -- b there -- more code -- there: -- yet more --
The key to avoiding forward references is to use the .org directive that is intended for setting the assembly origin. Consider the following:
-- some code -- bplace: .org .+4 @ reserve space for the b instruction -- more code -- there: .org bplace @ go back to fill in missing code b there @ the forward branch .org there @ resume normal assembly -- yet more --
This is evil code, but it does eliminate the need for forward references. It is also illegal under the rules of as because as forbids the use of .org directives that decrement the location counter. This is annoying, but then, as does support forward references, so we don't really need to use this trick.
Note one thing: The resulting object code is full of hesitations as the location counter is set and reset. Typical object codes consit of a sequence of blocks, with each block headed by a load address, so the loader reads the object file one block at a time, storing it starting at the indicated memory address. Direct-memory-access I/O devices allow the actual copying of each block into RAM to be done by a single DMA operation, but the software may have to interfere at the start of each block. Furthermore, blocks may have to be stored on disk so each block begins on a sector boundary. The net result is that it is far more efficient to load a program as one big block than it is to load it as a sequence of little blocks.
Of course, the linker could take the output of the assembler and linearize it, weaving the blocks together into the correct order. If the object code is stored on a random-access disk file, complete sorting of the blocks and concatenation of those blocks that are to be loaded into consecutive memory locations requires only two passes. The first pass makes an index of the blocks and their starting addresses on disk and the memory addresses where they are to be loaded. This index is then sorted by memory address, and then the blocks are copied to the output file in order by memory address.
Compiling a case-select statement can be done with methods similar to those described above. Consider this example:
select i from case 1: -- some code case 2: -- more code end
An assembly language programmer intent on writing equivalent machine code might write this:
ldr r3,[fp,#i] @ get operand i (a local var) cmp r3,#lower @ check lower bound blt endcase cmp r3,#upper @ check upper bound bgt endcase adr r4,table ldr pc,[r4,r3,lsl#2]@ indirect indexed jump through table lower = 1 upper = 2 table = .-(lower<<2) @ address of element zero of table .word case1 @ table .word case2 @ table case1: -- some code br endcase case2: -- more code br endcase @ optional endcase:
This approach to select-case statements easily expands to any number of cases, and the run-time is independent of the number of cases. If the list of cases includes gaps, the unused table entries need to point to the else clause, if there is one. If there is none, the unused table entries need to point to the label marking the end.
But note: If the list of table entries is sparse, the table can be a very inefficient use of memory. A cascade of n compare instructions, each followed by a conditional branch, takes 2n words on the ARM if all of the constants are small, and 4n words in the worst case where the constants are arbitrary 32-bit values. Thus, a cascade of compares may make better use of memory if less than half of the table entries are used, and it will make better use of memory if less than 1/4 of the table entries are used.
An optimizing compiler that tries to select between a jump table and a cascade of comparisons must accumulate the entire list of cases before it decides on how to select between them. This is difficult. When Nicolas Wirth designed Pascal, his solution to this problem was simple: He simply advertised, in the reference manual for the language, that case statements were inefficient when the set of cases was sparse. Programmers were simply advised to write cascades of if-else statements to handle selection between sparse sets of cases. This worked well.
Compiling a case-select construct in one pass is surprisingly easy when a two-pass assembler with arbitrary setting of the assembler origin is used. Consider the following reorganization of the above code:
ldr r3,[fp,#i] @ get operand i (a local var) cmp r3,#lower @ check lower bound blt endcase cmp r3,#upper @ check upper bound bgt endcase adr r4,table ldr pc,[r4,r3,lsl#2]@ indirect indexed jump through table case1: .org table+(1<<2) ,word case1 @ table .org case1 -- some code br endcase case2: .org table+(2<<2) ,word case2 @ table .org case2 -- more code br endcase lower = 1 upper = 2 table = .-(lower<<2) @ address of element zero of table .org .+(((upper-lower)+1)<<2) @ reserve space for the table endcase:
This is genuinely evil code. During pass 1 of the assembly, the origins used to put the table entries into place are undefined. They will be well defined during pass two, though, so nothing bad comes of this except, for some assemblers, lots of pass 1 errors. The benefit of this scheme is that the assembler does not need to keep track of the values of the different case labels, all it needs to track is the maximum and minimum labels it has encountered.
The code, as shown, does not work if the table is sparse. In that case, either the table must be initialized with default values before the case-specific labels are filled in, or the unused table entries must be tracked as the case statement is compiled so that they can be filled in with their default values last. Both of these have been done successfully using two-pass assembly, but we will not give solutions here because this entire scheme is a digression that does not work with the ARM and the as assembler.
When function definitions are nested arbitrarily within other functions, as permitted by Falcon, there is another problem. Consider this example:
procedure a -- some code prodedure b -- body end; code -- more code end;
It would be nice, at compile time, to untangle the procedures a and b, so that the compiler output would look like the following:
A: -- receiving sequence for a -- translation of some code -- translation of more code -- return sequence for a B: -- receiving sequence for b -- translation of body -- return sequence for b
A compiler that generates the complete parse tree and then generates code from that parse tree can do this, but a one-pass compiler that generates code on the fly must resort to alternatives. One approach to this is to simply compile into successive memory locations in the order of the text of the program, with added branches to connect the parts of each subroutine. The result might look like the following:
A: -- receiving sequence for a -- translation of some code B ACONTINUE B: -- receiving sequence for b -- translation of body -- return sequence for b ACONTINUE: -- translation of more code -- return sequence for a
This imposes a run-time penalty for the nesting structure of the program, so it is worth looking for alternatives. One useful alternative offered by the Gnu as assembler is the notion of object code sections and subsections.
The assembler supports multiple assembly location counters. By default, code is assembled into the text section, but you can create other sections. Furthermore, each section can have subsections. Since the text section is the one that normally contains all of the executable code of a program, it is natural to put the code there, and it is reasonable to use subsections to organize that code. For the full details, see the documentation for the .section assembler operation, and the .text directive, equivalent to .section text.
We could, for example, put each subroutine into a separate subsection of the text section by replacing the branch around each nested subroutine with a .text directive specifying a new subsection:
.text 1 A: -- receiving sequence for a -- translation of some code .text 2 B: -- receiving sequence for b -- translation of body -- return sequence for b .text 1 -- translation of more code -- return sequence for a
By generating code this way, the body of each routine is assembled into one contiguous subsection of the text section, where these subsections are appended to each other to form the entire program.