Lecture 39, Exploiting the Assembler

Part of the notes for CS:4908:0004 (22C:196:004)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Avoiding Forward References

In many assemblers (unfortunately, this excludes the Gnu assembler, as or gas), it is possible to generate code without any forward references. Consider this little bit of ARM code containing a forward branch instruction:

        -- some code --
        b       there
        -- more code --
there:
        -- yet more --

The key to avoiding forward references is to use the .org directive that is intended for setting the assembly origin. Consider the following:

        -- some code --
bplace:
        .org    .+4     @ reserve space for the b instruction
        -- more code --
there:
        .org    bplace  @ go back to fill in missing code
        b       there   @ the forward branch
        .org    there   @ resume normal assembly
        -- yet more --

This is evil code, but it does eliminate the need for forward references. It is also illegal under the rules of as because as forbids the use of .org directives that decrement the location counter. This is annoying, but then, as does support forward references, so we don't really need to use this trick.

Actually, many assemblers work perfectly well with this code even though they output error messages. The prohibition against decrementing the location counter is because the author of the assembler didn't want to guarantee the correct result, but in many cases, the actual implementation works just fine because the assembler does what you'd expect even while it is outputting an error message.

Note one other thing: The resulting object code is full of hesitations as the location counter is set and reset. Typical object codes consit of a sequence of blocks, with each block headed by a load address, so the loader reads the object file one block at a time, storing it starting at the indicated memory address. Direct-memory-access I/O devices allow the actual copying of each block into RAM to be done by a single DMA operation, but the software may have to interfere at the start of each block. Furthermore, blocks may have to be stored on disk so each block begins on a sector boundary. The net result is that it is far more efficient to load a program as one big block than it is to load it as a sequence of little blocks.

Of course, the linker could take the output of the assembler and linearize it, weaving the blocks together into the correct order. If the object code is stored on a random-access disk file, complete sorting of the blocks and concatenation of those blocks that are to be loaded into consecutive memory locations requires only two passes. The first pass makes an index of the blocks and their starting addresses on disk and the memory addresses where they are to be loaded. This index is then sorted by memory address, and then the blocks are copied to the output file in order by memory address.

Case-select Jump Tables

Compiling a case-select statement can be done with methods similar to those described above. Consider this example:

select i from
case 1:
    -- some code
case 2:
    -- more code
end

An assembly language programmer intent on writing equivalent machine code might write this:

        ldr     r3,[fp,#i]      @ get operand i (a local var)

        cmp     r3,#lower       @ check lower bound
        blt     endcase
        cmp     r3,#upper       @ check upper bound
        bgt     endcase

        adr     r4,table
        ldr     pc,[r4,r3,lsl#2]@ indirect indexed jump through table
lower   =       1
upper   =       2
table   =       .-(lower<<2)    @ address of element zero of table
        .word   case1           @ table[1]
        .word   case2           @ table[2]

case1:
        -- some code
        br      endcase
case2:
        -- more code
        br      endcase         @ optional
endcase:

This approach to select-case statements easily expands to any number of cases, and the run-time is independent of the number of cases. If the list of cases includes gaps, the unused table entries need to point to the else clause, if there is one. If there is none, the unused table entries need to point to the label marking the end.

But note: If the list of table entries is sparse, the table can be a very inefficient use of memory. A cascade of n compare instructions, each followed by a conditional branch, takes 2n words on the ARM if all of the constants are small, and 4n words in the worst case where the constants are arbitrary 32-bit values. Thus, a cascade of compares could make better use of memory if less than half of the table entries are used, and it will make better use of memory if less than 1/4 of the table entries are used.

If the cascade of compares is organized correctly, it can be arranged as a binary search. Consider replacing this

if (i = 1)
  -- code for case 1
else
  if (i = 2)
    -- code for case 2
  else
    if (i = 3)
      -- code for case 3
    else
      if (i = 4)
        -- code for case 4
      else
        -- code for all others
    end
  end
end
with this
if (i >= 1) | (i <= 4)
  if (i < 3)
    if (i = 2)
      -- code for case 1
    else
      -- code for case 2
    end
  else
    if (i = 3)
      -- code for case 3
    else
      -- code for case 4
    end
  end
else
  -- code for all others
end

This code has the efficiency of a binary search, with the same number of comparisons required regardless of which case is selected (excepting the else clause, which an optimizing compiler might be able to eliminate or simplify based on range constraints on the type of the value i. For code patterned on the above, we can select between any n consecutive alternatives with log2n+2 comparisons.

An optimizing compiler that tries to select between a jump table a linear cascade of comparisons or a binary-search cascade of comparisons must accumulate the entire list of cases before it decides on how to select between them. This requires complex code that is typical of a mature optimizing compiler but unlikely in a prototype compiler.

When Nicolas Wirth designed Pascal, his solution to this problem was simple: He simply advertised, in the reference manual for the language, that case statements were inefficient when the set of cases was sparse. Programmers were simply advised to write cascades of if-else statements to handle selection between sparse sets of cases. This worked well, and it allowed the compiler to always use an indexed indirect jump through a jump table.

Things are different in languages such as Python where expressions are allowed everywhere, including case labels. There, the cascade of comparisons is the general solution, and only a mature optimizing compiler will be able to generate a jump table when the set of case labels is constant and compact.

Assembler Abuse for Case-Select Jump Tables

Compiling a case-select construct in one pass is surprisingly easy when a two-pass assembler with arbitrary setting of the assembler origin is used. Consider the following reorganization of the above code:

        ldr     r3,[fp,#i]      @ get operand i (a local var)

        cmp     r3,#lower       @ check lower bound
        blt     endcase
        cmp     r3,#upper       @ check upper bound
        bgt     endcase

        adr     r4,table
        ldr     pc,[r4,r3,lsl#2]@ indirect indexed jump through table

case1:
        .org    table+(1<<2)
        ,word   case1          @ table[1]
        .org    case1
        -- some code
        br      endcase
case2:
        .org    table+(2<<2)
        ,word   case2          @ table[2]
        .org    case2
        -- more code
        br      endcase

lower   =       1
upper   =       2
table   =       .-(lower<<2)    @ address of element zero of table
        .org    .+(((upper-lower)+1)<<2) @ reserve space for the table
endcase:

This is genuinely evil code. During pass 1 of the assembly, the origins used to put the table entries into place are undefined. They will be well defined during pass two, though, so nothing bad comes of this except, for some assemblers, lots of pass 1 errors. The benefit of this scheme is that the assembler does not need to keep track of the values of the different case labels, all it needs to track is the maximum and minimum labels it has encountered.

The code, as shown, does not work if the table is sparse. In that case, either the table must be initialized with default values before the case-specific labels are filled in, or the unused table entries must be tracked as the case statement is compiled so that they can be filled in with their default values last. Both of these have been done successfully using two-pass assembly, but we will not give solutions here because this entire scheme is a digression that does not work with the ARM and the as assembler.

Nested Function Definitions

When function definitions are nested arbitrarily within other functions, as permitted by Kestrel, there is another problem. Consider this example:

a: procedure
    -- some code

    b: prodedure
       -- body
    end;

    -- more code
end;

It would be nice, at compile time, to untangle the procedures a and b, so that the compiler output would look like the following:

A:      -- receiving sequence for a
        -- translation of some code
        -- translation of more code
        -- return sequence for a

B:      -- receiving sequence for b
        -- translation of body
        -- return sequence for b

A compiler that generates the complete parse tree and then generates code from that parse tree can do this, but a one-pass compiler that generates code on the fly must resort to alternatives. One approach to this is to simply compile into successive memory locations in the order of the text of the program, with added branches to connect the parts of each subroutine. The result might look like the following:

A:      -- receiving sequence for a
        -- translation of some code
        B       ACONTINUE

B:      -- receiving sequence for b
        -- translation of body
        -- return sequence for b

ACONTINUE:
        -- translation of more code
        -- return sequence for a

This imposes a run-time penalty for the nesting structure of the program, so it is worth looking for alternatives.

Perhaps the simplest solution is to defer the generation of the receiving sequence of a subroutine until the first component of that subroutine is encountered that requires code generation. This means that, so long as programmers adopt a coding style where all declarations are given first before any executable statements, and so long as local variables are not subject to implicit initialization, the receiving sequence for each block can be deferred until after the entire code for nested subroutines has been generated. In the event that programmers mix definitions and code in an undisciplined way, the branches suggested above would remain mixed into the code.

The Gnu as assembler offers a useful alternative because it supports the use of object code sections and subsections, each with its own location counter. By default, code is assembled into the text section, but you can create other sections. Furthermore, each section can have subsections. Since the text section is the one that normally contains all of the executable code of a program, it is natural to put the code there, and it is reasonable to use subsections of the text section to organize that code. For the full details, see the documentation for the .section assembler operation, and the .text directive, equivalent to .section text.

We could, for example, put each subroutine into a separate subsection of the text section by replacing the branch around each nested subroutine with a .text directive specifying a new subsection:

        .text   1
A:      -- receiving sequence for a
        -- translation of some code
    
        .text   2
B:      -- receiving sequence for b
        -- translation of body
        -- return sequence for b

        .text   1
        -- translation of more code
        -- return sequence for a

By generating code this way, the body of each routine is assembled into one contiguous subsection of the text section, where these subsections are appended to each other to form the entire program.