Chapter 6, Conditional and Macro Assembly

High level tools in a low level language

Part of the notes for 22C:50, Introduction to System Software
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The Problem

Although the assembler discussed in the previous sections would be a useful programming tool, it lacks two important features which distinguish most commercially available assemblers; these are conditional assembly directives and a macro mechanism. These features are also common in some programming languages; for example, the C and C++ preprocessor supports conditional compilation and compiler macros.

These features, usually grouped under the name macro processing, are handled at translation time. That is, they are control structures used to control the behavior of a language translator, determining what text the translator processes, what text it ignores, and what substitutions are performed in that text prior to processing.

. In fact, these control structures are quite general; if correctly implemented, they allow arbitrary computations to be performed at translation time. The macro facility we will explore has this generality. The standard C and C++ preprocessors do not!

Conditional Assembly Directives

Conditional assembly or compilation mechanisms are frequently considered to be mechanisms that allow a single version of the source code for a program to be used to generate multiple versions of the executable. Consider, for example, a C or C++ program that can be compiled in two versions, one for use under a window manager, where error messages are displayed in a popup window, and one for use under a command shell, where error messages are simply directed to the error output stream. This can be done shown in Figure 6.1.

/* ERROR_STYLE should be defined as either WINDOW or STREAM */
#define WINDOW 0
#define STREAM 1
#define ERROR_STYLE WINDOW

void errormsg( char* m )
/* report an error
   given:  m, a null terminated error message
*/
{
	#if ERROR_STYLE == WINDOW
		make_popup_window( m );
	#else
		fputs( m, stderr );
	#endif
}

Figure 6.1. Use of conditional compilation in C or C++.

When the example from Figure 6.1 is comiled, the machine code that results will either call the make-popup-window routine or the fputs routine. No machine code will be included for the other routine! The symbols ERROR_STYLE, WINDOW and STREAM are not the names of variables! Rather, they are defined as synonyms for the numeric constants 0 and 1. The syntax for conditional compilation in C and C++ does not treat the lines beginning with # as statements, but unlike other constructs in C, it requires that they begin on a new line.

We will extend our example assembly language with analogous directives, placing them in the syntactic positon of symbolic opcodes, using the names IF, ELSE and ENDIF. Unlike the B and W directives, these will not cause anything to be put in memory, but instead, they will turn on and off the processing of all other assembly language statements. This is why these kinds of directives are known as pseudo-operations or pseudo-codes in many assembly languages.

The conditional assembly directives of many older assemblers much more closely resemble the control structures of machine languages. For example, the conditional assembly directive on the IBM 360, 370, and related machines takes the form of a conditional GOTO! "AIF (e) l" causes the assembler to skip to the line with the label l when the expression e is true. In that assembler, labels used only for conditional assembly can be prefixed with a period to prevent them from being entered into the assembler's symbol table.

Conditional assembly must not be confused with the run-time control structure of the assembled program. The effects of conditional assembly are entirely confined to assembly time, and are in many ways analogous "commenting out" sections of the source code. Any control structure in the assembled program must still be implemented by branch or jump machine instructions and labels.

Before continuing with a discussion of how conditional assembly directives are implemented, it should be noted that there are a number of different conditions that it might be useful to test. Examples of these are given in Figure 6.2.

IF a = b           ; if a and b are equal
IF a != b          ; if a and b are not equal
IF a < b           ; if a is less than b

IF DEF( a )        ; if a has been defined
IF FWD( a )        ; if a is a forward reference
IF TYP( a ) = ABS  ; if the type of A is absolute

IF FWD(A) | (DEF(A) & (A > 255))

Figure 6.2. Typical tests for conditional assemblers.

The obvious conditions are equality, inequality, and other comparison operators. Some assembly languages also include special tests for zero and nonzero, although there is no need for these when given general comparisons. Less obvious, although quite useful, are predicates which test assembly time properties of names. For example, the predicate DEF() might test to see if its operand has been defined, and FWD() might test if its operand is a forward reference, that is, that it has been used but not yet defined during the current pass. In assembly languages where symbols have types, an issue to be discussed in the next chapter, TYP() could return the type of its operand.

Just as conventional if statements are most useful in the context of a full suite of control structures that includes iteration and perhaps recursion, conditional compilation or assembly can be complemented with additional control structures. We will do this here with macros, after we first complete an examination of the implementation of conditional assembly features.

Implementing Conditionals

Conditional assembly is relatively easy to implement. When the assembler encounters an IF directive where the condition evaluates to false, assembly is suspended until the matching ELSE or ENDIF is encountered. When the assembler encounters an ELSE during normal assembly, assembly is suspended until the matching ENDIF is found. When an ELSE clause is encountered while assembly was suspended by the matching IF clause, normal assembly resumes. Encountering an ENDIF during normal assembly has no effect.

This, of course, leaves one big question unanswered: How does the assembler suspend the assembly process? This has been answered in many different ways in different implementations of conditional assembly or conditional compilation. Some assemblers and compilers rely on a prepass through the source to process the conditionals, including or excluding lines of source code before the regular assembly or compilation process begins. The prepass is sometimes handled by an entirely separate software tool, a preprocessor for the language. The IBM 360 assembler processes conditionals in a prepass, while C and C++ rely on a preprocessor usually known as cpp (but invoked by cc -E -C) to handle conditionals, and this preprocessor may be easily used to process the input to other compilers or assemblers.

As an alternative to using a prepass, the assembler or compiler can process conditionals as it processes the program. There are several ways of doing this. The most extreme involves suppressing object code generation, location counter updates and symbol table changes when assembly is turned off, while continuing to do a complete lexical and syntax analysis of the input.

This can be done with a simple Boolean flag, assembly-on. All symbol table updates, location counter updates, object code generation and error message output are conditional on the value of this flag. On encountering an IF directive, the assembler would set assembly-on to the value of the Boolean expression tested. On encountering an ELSE directive, the assembler would complement assembly-on, and on encountering an ENDIF directive, the assembler would restore assembly-on to its value prior to the IF directive.

This scheme can have odd side effects if care is not taken. Consider an IF-ENDIF block containing multiple ELSE directives! Or, consider the problems posed by nesting one IF-ELSE-ENDIF block inside another; without special care, the consecutive ELSE directives will each complement assembly-on, leading to something quite different from what the user wanted.

A correct implementation of conditional assembly directives follows from the way a person might handle them. A person will typically count the nesting levels and perhaps draw lines connecting the corresponding IF, ELSE and ENDIF directives.

An assembler or compiler can't draw lines on a listing, but it can count nesting levels. Consider using a counter which is zero during normal assembly and positive when assembly is inhibited. If the counter is zero when an IF is found with a false condition, this counter is set to 1 to inhibit assembly; the same is done when an ELSE is found in the same context. When the counter is positive, IF increments the counter and ENDIF decrements the counter. Thus, if the counter is one, assembly must have been inhibited by the immediately enclosing IF-ENDIF pair; only in this context should ELSE set the counter to zero.

The example in Figure 6.3 shows the use of this counter by giving counter values and equivalent assembly code for each combination of conditions tested in the original code.

              |     c1=false   |c1=true,c2=false|c1=true,c2=true
original code | counter   code | counter   code | counter   code
--------------+----------------+----------------+---------------
B 1           |    0      B 1  |    0      B 1  |    0      B 1
IF c1         |                |                |
   B 2        |    1           |    0      B 2  |    0      B 2
   IF c2      |                |                |
      B 3     |    2           |    1           |    0      B 3
   ENDIF      |                |                |
   B 4        |    1           |    0      B 4  |    0      B 4
ELSE          |                |                |
   B 5        |    0      B 5  |    1           |    1
ENDIF         |                |                |
B 6           |    0      B 6  |    0      B 6  |    0      B 6

Figure 6.3. Conditional assembly counting nested IF directives.

Most assembly languages with conditional assembly features treat the conditional directives as if they were operation codes. This means that labels are allowed on the same line as an IF or an ENDIF directive, and it raises the question of how such labels should be interpreted. The most straightforward implementation processes such labels in whatever assembly mode was in effect as the processing of the line began, before paying any attention to what assembly directive or symbolic opcode follows the label.

In the C and C++ preprocessor, all conditional directives are on a line by themselves, with a # as the first nonblank character on the line. This is analogous to forbidding labels on conditional assembly directives, and for our example assembly language, the result can be described by a trivial change to the definition of <line> from Figure 2.9:

<line> ::= ( <definition>
           | <statement> |
           | IF <operand> |
           | ELSE |
           | ENDIF
           ) [ ;<text> ] <line end>

Using this as a guide, and using a global counter cond to control the assembly process, conditional assembly directives, with nesting, could be processed by the code given in Figure 6.4.

procedure line;
begin
     if lex.this = 'IF' then begin
          if cond > 0 then begin
               cond := cond + 1;
               ignoreline;
          end else begin
               nextlex {skip IF so operand can be parsed};
               if not(operand) then cond := 1;
          end;
     end else if lex.this = 'ELSE' then begin
          if cond < 2
               then cond := 1 - cond;
          nextlex {skip ELSE};
     end else if lex.this = 'ENDIF' then begin
          if cond > 0
               then cond := cond - 1;
          nextlex {skip ENDIF};
     end else if cond = 0 then begin
          if lex.next = '='
               then definition
               else statement;
     end;
     skipline;
end {line};

Figure 6.4. Outline of code for conditional assembly.

This code may be viewed as a modification of Figures 2.14 or 2.15. The code shown here is only an approximation of what would be included in a production assembler, since a number of issues have been ignored. For example, no effort has been made to detect the use of extra ELSE directives outside of an IF-ENDIF pair.

From the user's point of view, perhaps the most important remaining issue is actually a cosmetic one: What should be included in the assembler listing when assembly is suppressed? There are at least two alternatives: List everything, treating suppressed lines as comments in the listing; or, suppress listing of conditional directives, listing only those lines assembled normally. Since both of these are useful in different contexts, a new assembly directive is frequently introduced to select one or the other mode. For example, in MACRO 11, the assembler for the historically important DEC PDP-11 family of computers, the directive .NLIST CND turns off listing of code which is ignored, while .LIST CND turns on listing of this code (note that all assembler directives in MACRO 11 start with a dot; this makes them easy to distinguish from machine instructions).

Macro Assembly Directives

Macro facilities are are common in both high level languages such as C and C++ and in assembly languages. In addition, macros are an important feature of text formatting systems, spreadsheet cell definitions and many other applications. The C and C++ macro mechanism uses the keyword #define, as illustrated in Figure 6.5:

/* macro to increment x -- open subroutine */
#define MINC(x) x = x + 1

/* function to increment x -- closed subroutine */
void finc( int * x )
{
	*x = *x + 1
}

void demo()
{
	int i;
	MINC( i );   /* macro call */
	finc( &i );  /* function call */
}

Figure 6.5. Open and closed subroutines in C.

In Figure 6.5, two different mechanisms are created, each capable of accomplishing the same purpose. The difference between these two lies not in what the routine does but in how the job is divided between compile-time processing and run-time processing. In the case of the function finc(), the compiler translates this into machine code, and the call to the function happens at run-time with an actual control transfer from the machine code for the function demo() to the machine code of the function finc() and then a return when the function completes. At run-time, a pointer to the variable i is actually passed to the function.

For the macro MINC(), on the other hand, the compiler, or rather, the C preprocessor, substitutes the text i=i+1 for the macro call MINC(i), so that, at run-time, there are no control transfers and there is no parameter passing!

In the 1950's, when the art of computer programming was still young, different research groups, working independently, developed these two ideas for how to implement subroutines, and for a while, the terms open subroutine and closed subroutine were commonly used for what we now call macros on the one hand functions, procedures or just subroutines on the other hand. Whether these are implemented in a high level language or in an assembly language, the distinction remains the same. Macros are processed at translation time, while subroutines, functions or procedures exist at run-time and require the use of call and return mechanisms in the executed machine language.

For our example assembly language, we will use the basic macro notation illustrated in Figure 6.6, a notation that is typical of that used in assembly languages designed after the mid 1960's.

	MACRO CRLF
	  B CR
	  B LF
	ENDMAC

	MACRO BB X
	  B X
	  B X
	ENDMAC

	CRLF
	BB	5
Figure 6.6. A notation for macro definition and use

The first macro definition in Figure 6.6 defines a macro named CRLF that has no parameters and a 2-line macro body. The assembler does not assemble the macro body when it is defined, but simply saves it for later processing. Once a macro has been defined, use of that macro's name in the symbolic opcode position of a line of assembly code constitutes a macro call. The assembler, on encountering a macro call performs a process called macro expansion by substituting the body of the macro for the call in the assembled code. The second to the last line of Figure 6.6 is a call to the macro CRLF; this has the effect of assembling two bytes in memory, a CR followed by an LF.

The second macro defined in Figure 6.6, named BB, has one macro formal parameter, X. The example macro call to B provides one macro actual parameter, 5. During macro expansion, the actual parameter is substituted for the formal parameter wherever it occurs in the body of the macro and then the body is substituted for the call itself.

Figure 6.7 shows some more complex macros in our example assembly language:

	MACRO IIF (COND), (THENPART), (ELSEPART)
	  IF COND
	    THENPART
	  ELSE
	    ELSEPART
	  ENDIF
	ENDMAC

	MACRO POINT TABLE, ENTRY
	  TEMP = .
	     . = TABLE + ( 2 * ENTRY )
	       W TEMP
	     . = TEMP
	ENDMAC

	IIF (SIZE = 0), (B 5), (W 5)
	POINT T, 6

Figure 6.7. Some complicated macro definitions.

The macro IIF defined in Figure 6.7 illustrates the combination of macro parameters with conditional assembly directives. In IIF, the macro formal parameters are each parenthesized to indicate that the actual parameters must themselves be parenthesized lists. Had the formal parameters not been enclosed in parentheses, the actual parameters would have been required to be single lexemes. This is about as close as most macro assemblers come to the concept of parameter passing modes. The net effect of the IIF macro is to allow a compact one-line expression of a condition and two alternatives, as illustrated on the second to the last line of Figure 6.7. Here, we wish to assemble a byte if SIZE is zero, and a word otherwise.

The second macro defined in Figure 6.7, POINT, performs a complex combination of operations on the location counter, making use of macro parameters to control this. POINT first saves the current location by assigning it to the symbol TEMP; at the end, it restores the location counter from TEMP, so a call to POINT makes no net change to the location counter. Within the body of POINT, the location counter is temporarily set to a value depending on the formal parameters TABLE and ENTRY, and a word is stored containing the saved location counter from the calling context. The parameter TABLE gives the address of word zero of an array, ENTRY gives the entry in that array where a one-word pointer to the caller's location counter should be stored. So, the final line of Figure 6.7 makes T[6] point to the current location.

The two code fragments in Figure 6.8 assemble exactly the values into memory; the left example uses the macros defined in Figure 6.6 and Figure 6.7, while the right example is coded without the use of any but the most basic assembler facilities.

; WITH MACROS                        ; WITHOUT MACROS
   A = 5
   POINT TAB,0                       TAB0: B CR
   CRLF                                    B LF
   POINT TAB,1                       TAB1: B 5
   B     5                           TAB2: B 6
   POINT TAB,2
   IIF   (A = 5), (B 6), (CRLF)      TAB:  W TAB0
TAB:                                       W TAB1
   . = . + 6                               W TAB2
   B 32					   B 32

Figure 6.8. Calls to macros and equivalent code.

The macro definition and calls in Figure 6.9 provide another example.

	MACRO TWICE (LINE)
	  LINE
	  LINE
	ENDMAC

	TWICE (B 2)

	TWICE (TWICE (B 4))
Figure 6.9. An illustration of macro expansion.

Consider first the expansion of the call "TWICE (B 2)" in Figure 6.9. The first step in processing this is to save "B 2" as the actual parameter to be substituted for LINE in the body of TWICE; we will save it in a special symbol table, the macro parameter table. Next, the text of the macro TWICE is read, and each formal parameter in this text is replaced from the macro parameter table. Finally, each line that results is processed as normal input, so that the net result is the same as assembling the line "B 2" twice, putting the value 2 in two consecutive memory locations.

Processing of each line of the macro body may involve further macro expansion! This is illustrated in the second call to TWICE in Figure 6.9. The results of successive macro expansion operations on the macro calls from Figure 6.9 are shown in Figure 6.10.

Original code        Expanded once        Expanded again

TWICE (B 2)          B 2                  B 2
                     B 2                  B 2

TWICE (TWICE (B 4))  TWICE (B 4)          B 4
                                          B 4
                     TWICE (B 4)          B 4
                                          B 4

Figure 6.10. Multiple macro expansion passes.

In fact, most real macro assemblers do not make multiple expansion passes, as suggested by Figure 6.9! Nor are macros expanded first and then, after expansion, conditionals processed! Instead, all macro expansions and conditional processing are almost always done together; as the body of a macro is expanded, the text is processed, looking for further macro definitions, further macro calls, and any conditional directives.

The classic C preprocessor notation does not allow for this! In the C preprocessor, a macro (a #define) may not contain conditional directives (#if). C preprocessor conditionals can cause a macro to have different definitions depending on the condition tested, but this happens at macro definition time, not at expansion time. As a result, C preprocessor macro bodies may not be recursive because, without conditionals, there would be no way to terminate the recursion.

Implementing Macros

The key to efficient macro implementation is a mechanism which allows many different sources of text to be spliced together as the assembly process continues. This could be viewed as being done by a switch on the input to the assembler, as shown in Figure 6.11.

                          ---<--------------------------
                         |                  ---<------  |
                         |                 |          | |
 -------------     -------------     -------------    | |
| source file |   | saved macro |   | saved macro |   | |
|  (on disk)  |   |   bodies    |   |  parameters |   | |
 -------------     -------------     -------------    | |
       |                 |                 |          | |
        ---------->----  |  ----<----------           | |
                       | | |                          | |
                       o o o    -----------------     | |
     input selection   |-------| stack of switch |    | |
         switch         -      |     settings    |    | |
                         |      -----------------     | |
              -----------------------                 | |
             | Lexical and Syntactic |--->------------  |
             |   Analysis Phases of  |--->--------------
             |      the Assembler    |
              -----------------------

Figure 6.11. The data flow through a macro assembler.

This switch is normally set to read from the input file, but when a macro call is encountered, the current switch setting is pushed on a stack and the switch is changed to read from the saved copy of the macro body. When the end of a macro is encountered, the switch is set back to its previous setting. While reading from the source of a macro, the switch must be temporarily set to read actual parameters each time a formal parameter is encountered.

The substitution mechanism (or mechanisms) described above must make use of stored copies of both the macro body and of the actual parameters. Furthermore, there must be symbol table facilities to match macro bodies with their names and to match actual and formal parameters. Generally, macro names are stored in the op-code table, since they are used syntactically as if they were op-codes. Formal parameters pose more difficult problems. They take on a particular set of values only briefly (during a single expansion of a macro), and they are undefined most of the time. Furthermore, if a macro call is encountered during the expansion of a macro, whether or not that call refers to the same macro (a recursive macro) or to another, there must be a stack of definitions, with only the current definition being active at any time.

Although there are many equivalent macro substitution mechanisms, only one will be outlined here. In this approach, macro bodies are stored in the string pool, but in a slightly modified form, as shown in Figure 6.12.

Source text of macro:

MACRO POINT TABLE, ENTRY
  TEMP = .
     . = TABLE + ( 2 * ENTRY )
       W TEMP
     . = TEMP
ENDMAC

Stored form of text:

--TEMP-=-.@-----.-=-#1-+-(-2-*-#2-)@ -------W-TEMP@-----.-=-TEMP@&

Key to special symbols:

                                      Note:  These symbols
	  symbol   use                are used as printable
	     @     end of line        representations of
	     &     end of macro       character codes that
	     #n    parameter n        would be unprintable
	     _     space              in real implementations.

Figure 6.12. Storage of a macro body in the string pool.

The macro header and the line containing ENDMAC are not included in the string pool, since these are not needed in the expansion. If we wish to support a notion of parameter type (for example, the idea that some parameters can be required to be parenthesized lists) we can prefix the macro body with one byte per formal parameter, where that byte gives the form of the expected parameter.

In the string pool, line ends must be marked with a special character, typically a character that is guaranteed to be absent from the text of each line, such as the ASCII CR (carriage return) or LF (line feed) characters, shown in Figure 6.12 as @. The end of each macro must also be marked with another special character, perhaps the ASCII NULL or ETX (end of text), represented here as &. The positions where actual parameters must be substituted into the macro body are marked, but the symbolic actual parameter names are removed and replaced with parameter numbers, in Figure 6.12, we used #1 to refer to the first parameter, #2 for the second, and so on. (The symbol # is used for yet another character that is guaranteed to be absent from the input line, perhaps the ASCII DLE [data link escape] or ESC [escape].)

The storage format suggested here for macro bodies is based on the observation that many applications of macros in production settings involve huge numbers of macros, some of which have extremely large bodies. Consider, for example, the C preprocessor. The standard header files <stdio.h> contains 21 defines (macro definitions), and it includes files that define even more symbols. The header files for such major components as window managers are far bigger, and as a result, we routinely expect production assemblers or compilers to be able to store thousands of macro definitions!

The substitution of symbols such as #1 for each formal parameter in the stored text of a macro allows the formal parameter symbol table to be re-initialized as each macro definition is processed and discarded afterwards. Since the number of formal parameters for any particular macro will be small, this table should probably be managed by a trivial linear search!

Macro processing can take place during a prepass before assembly or compilation begin, as is done with the C and C++ preprocessor, or it can be done at the same time as assembly or compilation. Many assemblers designed after the mid 1960's integrate macro processing with the assembly process, and in most such assemblers, macros are simply expanded identically during each pass. The alternative is to output the result of expansion from the first pass and process this pre-expanded text on the second pass, but this trades the CPU time required for macro expansion for the input-output time required to save an intermediate file containing the expanded macros, and even in 1970, the computation required to expand macros was usually faster than the time taken to write the results to an intermediate file.

It should be noted that in a two pass assembler where macro substitution is performed during both passes, it would be a mistake to store new definitions of the macros as the macro definitions are processed during the second pass! Not only would doing so store duplicate definitions of each macro, but it would lead to an even greater possibility that a macro might expand differently on the two passes. The presence of conditional features in the assembly language already leads to the risk of different results during the two passes, and we do not want to do anything to increase this risk!

During macro expansion, a stack is needed to hold the actual parameters. When a new macro call is encountered, each of its parameters is pushed onto this stack, and when expansion finishes, the associated parameters are popped. The easiest way of viewing this stack is to consider the actual line of the call to the macro currently being expanded as being on the stack top. To save time in extracting actual parameters from the line, the text of the line could be accompanied by an index giving the starting character position of each parameter on the line. This works well, but it does waste space by storing leading blanks, comments, and the macro name itself.

Some assemblers severely limit the number of nesting levels allowed for macro calls; numbers like 4 and 8 permitted levels of nesting are not uncommon! If such a limit is acceptable, it makes perfectly good sense to stack the entire line along with a table of pointers to each parameter. Such shallow limits have little effect on everyday programming, where any use of macros is considered to be exotic, but if for example, recursive macros are used to perform interesting computations, we would like to make the most efficient possible use of memory. Figure 6.13 suggests one stack format that makes efficient use of memory:

           __________________________________________________________
 <-push-< |n|k|o| ... |o| last param |@| ... |first param|@| ...
           ----+-------+--------------------------------------------
           ^   |       |_^                    ^
stack top _|   |______________________________|

              \_________/\________________________________/
              n pointers           n parameters

     n = the number of parameters.
     k = previous location from which input was being read.
     @ = end of parameter

Figure 6.13. Suggested format for the stack of actual macro parameters.

The parameters are shown on the stack in reversed order, since the last parameter pushed will be the topmost in the stack, and they are pushed in the order that they are parsed, left to right on the calling line. Note that, unless we are dealing with unusual extensions to the macro language, the parameter text we push is the text of the actual parameter, extracted directly from the macro call! If there are no parameters, the stack entry is just two bytes, the saved location from which input was being read and a count of zero.

During the construction of each stack entry, a temporary array is needed to hold the starting positions of each of the parameters; only when all parameters have been pushed is this array itself pushed onto the stack top. wopping a stack entry constructed as shown in Figure 6.13 requires searching for the end of the last parameter; this isn't computationally awful, but we can eliminate the search by including, near the head of each stack entry, a pointer to the previous stack entry.

It should be noted that clear reporting of errors in macro usage may require additional data in each stack entry! For example, in order to report which macro was being expanded, a pointer to the symbolic name of the macro (in the string pool, for instance) would have to be stored in each entry. Another error reporting problem occurs when an insufficient number of actual parameters is provided on a macro call. If the expected number of parameters is not stored with the macro definition, this will only be detected when a formal parameter is used which was not defined. In this case, the null string is frequently substituted for the formal parameter, giving no indication of an error unless the result could not be assembled. An alternative which is not commonly used is to restore the symbolic form of the formal parameter when this happens, leading to an obvious indication of what happened.

It is interesting to note that a complete macro assembler requires two variable size data structures, the string pool and the macro parameter stack. Each of these may grow to an unusually large size, and it is therefore tempting to allocate a separate huge memory array to each. Doing so may not be wise, since both of these structures are arrays of bytes, and each grows from one end. If we let the stack and the string pool share the same array, growing from opposite ends, then either one may grow to fill the free space. This arrangement is shown in Figure 6.14.

 ____________________________________________________________________
| string pool              |        unused        |      macro stack |
 --------------------------------------------------------------------
                            ^                      ^
pool free space pointer ____|                      |
                           macro stack pointer ____|

Figure 6.14. The string pool and macro stack sharing one array.

Arranging things so that there is only one block of unused space between two data structures instead of two blocks, one in the array for each structure makes it more likely that we will not run out of memory, no matter how much free space we have. The logic for doing this is very similar to that for using a string pool in the first place: Limits are placed on the total resource usage instead of on each individual demand. Stack or string pool overflow are detected when the pool free space pointer is greater than the macro stack pointer.

The macro expansion process in most assemblers operates as follows: First, we finish pushing whatever actual parameter table may be currently active, and then we parse the actual parameter list for the macro call, pushing each macro parameter onto the macro expansion stack as we do so, and building a new actual parameter table in the process.

As each line of text is read from the saved body of a macro, all formal parameters in that line are replaced from the actual parameter table. Typically, this replacement is done by the read routine prior to lexical analysis so that the line can be listed with all substitutions done properly. Finally, when the end of the macro is reached, the current macro parameters are popped off the macro expansion stack stack and macro parameter table in effect prior to the macro call is restored.

Thinking in object oriented terms, the lexical analysis package must call the macro package get-line routine instead of directly reading lines from the input file. The parser must be able to hand macro definitions to the macro package, and the parser must be able to hand macro calls to the macro package. The macro definition interface must allow the parser to pass the text of the macro, to the macro package, with formal parameters recognized and singled out for special storage, and the macro call interface to the macro package must allow the parser to pass the actual parameters to the macro package as it parses the parameter list of the call.

Extensions to Macro Processing

Before finishing with macros, some common extensions should be mentioned. Macros frequently need to generate labels on the fly, so that two different calls to the same macro can generate different labels. One way to generate labels is to concatenate strings, for example, a letter followed by the text of one of the formal parameters.

If we require that each formal parameter in the macro body be a separate simple identifier, successive formal parameters must be separated by at least one space; the result is that actual parameter values, after substitution, will remain separated by that space, thus making it impossible to parameters to manufacture new identifiers.

There are a number of solutions to this problem. The oldest is to require that all formal parameters be specially marked. For example, the IBM 360 assembler requires that each formal parameter name be prefixed by the symbol &; this makes it unnecessary to put spaces between successive formal parameters, thus making concatenation easy. The Unix shell uses a similar parameter substitution mechanism, with the rule that every formal parameter to a shell script is prefixed with $; actually, the Unix shell parameter notation is even worse, since the formal parameter names are positional, $1 being the leftmost formal parameter.

The MACRO-11 assembler allows single quotes to be used within the body of a macro to separate successive formal parameters or to separate formal parameters from adjacent identifiers or numbers; at the time a macro is expanded, each isolated single quote is deleted from the text. As a result, if the macro definition contains the string A'B, where A and B are formal parameters, and the corresponding actual parameters are a and b, expanding the macro will produce the string ab. This leaves unanswered the question: What if the user wants to include a single quote in the expanded macro? In MACRO-11, the answer is that all strings of consecutive single quotes are shortened by one as the macro is expanded. Thus, to include 'x' in the result of macro expansion, the macro definition would need to include ''x''.

The motivation for concatenation given above was the desire to be able to create new labels in a macro by concatenating macro parameters, but we cannot always rely on the caller of a macro to provide a unique identifier from which new identifiers may be constructed. What we would really like is an unlimited supply of new identifiers, so that, for example, we could use an integer counter to create the string of identifiers L1, L2, L3 etc. Consider the macro shown in Figure 6.15.

MACRO LAB
 LAB: B 0
      W LAB
ENDMAC
Figure 6.15. An example requiring automatic label generation.

The macro shown in Figure 6.15 stores a byte in the locaton labeled LAB followed by a pointer to that byte. At least, it does so if we only call this macro once. The second time we call the macro, we will redefine LAB, and most assemblers consider the redefinition of a label to be illegal!

There are a number of solutions to this problem, the least elegant being simply to require that the caller of the macro provide a unique label with each call as a parameter. The better solutions to the problem of labels within macro text all involve some way of automatically generating a new unique label with each call to the macro.

An interesting solution to this problem involves adding the notion of formal parameter types to the assembly language. We have already suggested this with the distinction between the formal parameters a and (b), where the actual parameter for a must be a single lexeme and the actual parameter bor (b) must be a list of lexemes enclosed in parentheses. Consider adding the parameter type =c. A macro formal parameter declared with this format can be required to be replaced by the textual representation of the value of the expression passed as an actual parameter. This is illustrated in Figure 6.16:

Source Text
MACRO CONCAT (FRONT),=NUMBER,(BACK)
  FRONT'NUMBER'BACK
ENDMAC

; first call
CONCAT (R), 5, (:)

; second call
CONCAT (B R), 4+1, ()
Macro Expansion
; first call
  R5:

; second call
  B R5
Figure 6.16: An example of numeric parameter concatenation

The macro shown in Figure 6.15 motivated this discussion because, if it is called multiple times, the label LAB ends up with multiple definitions. We can create a new version of this macro that avoids this problem by calling the concatenate macro from Figure 6.16 as is shown in Figure 6.17:

Source Text
LCNT = 1

MACRO LAB
  CONCAT (LAB),LCNT,(: B 0)
  CONCAT ( W LAB),LCNT,()
  LCNT = LCNT + 1
ENDMAC

; first call
LAB

; second call
LAB
Macro Expansion (ignoring lines setting LCNT)
; first call
  LAB1: B 0
   W LAB1

; second call
  LAB2: B 0
   W LAB2
Figure 6.17. Automatic label generation using parameters passed by value.

Another automatic label generation scheme has been used in the IBM 360-370 series assembly language: Here a special macro formal parameter is implicitly defined for each macro call, &SYSNDX; this is automatically replaced by the decimal representation of a different integer in each macro expansion. Thus, what has been done explicitly with the variable LCNT in Figure 6.17 is done automatically with &SYSNDX. MACRO-11 for the DEC PDP-11 series had a more general approach to automatic label generation: Macro formal parameter definitions preceded by the symbol ? were associated with automatically generated labels when no actual parameter was given.

A somewhat uncommon, although useful addition to macro assembly facilities is a general string manipulation mechanism. Consider, for example, the possibility of extending the parenthesized list parameter passing mode so that the actual parameter may be a substring of the list provided. Consider allowing the following notation in an actual parameter list: (p):a:b; this could indicate that characters a through b of the parenthesized list p are to be passed as the actual parameter. If this kind of facility is provided, other facilities are usually provided to allow the length of a string to be inspected, for example, by adding a LEN function to the expression analyzer which returns the length of its argument in characters.

Repetitive Assembly Directives

It should be noted that some macro assemblers also include a mechanism which allows repetition of a segment of code. This is essentially an assembly time loop. A possible use and syntax for this are illustrated in Figure 6.18.

Source Code
COUNT = 0
LOOP
  COUNT = COUNT + 1
  B COUNT
ENDLOOP COUNT = 5
Equivalent Code
  B 1
  B 2
  B 3
  B 4
  B 5
Figure 6.18. An example of repetitive assembly.

It should be clear that this kind of construct has important uses in initializing large complex tables because it allows the assembler to execute the algorithm used to generate the table instead of requiring the assembly language programmer to do the work. This reduces the chance for clerical errors as well as simplifying the life of the assembly language programmer.

In fact, if we have conditionals and macros in our assembler, this kind of iterative construct is not strictly necessary because a recursive macro can generate any data structure that an iterative construct could generate! This is true, of course, only if we have enough space on our macro stack to handle the number of recursive calls involved. For example, Figure 6.19 does the job recursively that was done iteratively in Figure 6.18.

MACRO LOOP =COUNT, =LIMIT
  B COUNT+1
  IF (COUNT + 1) < LIMIT
     LOOP (COUNT + 1), LIMIT
  ENDIF
ENDMAC

LOOP 0,5
Figure 6.19. A recursive macro equivalent to the loop in Figure 6.18.

Assembly time looping constructs may be implemented in either of two ways. The simplest implementation involves remembering (with a special stack entry) where the top of the loop was and reseting the input back to that point each time the end of the loop is reached. Since some stream input models don't allow arbitrary resets to the stream, such implementations frequently include the restriction that loops may only be specified inside macro bodies. The alternative is to push the text of the loop body onto the stack (with the aid of a prepass through the loop) and then repetitively read it from the stack until the loop terminates. This requires more memory, but allows loops in any context.

Terminology

The reader should be familiar with the following general terms after reading this section:

conditional assembly            macro actual parameters
pseudo-operations               macro formal parameters
pseudo-codes                    macro substitution
nested conditionals             macro expansion
macros                          repetitive assembly
macro calls                     preprocessors or prepasses

Additionally, the reader should be familiar with the following new assembly directives introduced into the example assembly language:

IF                              ENDMAC
ELSE                            LOOP
ENDIF                           ENDLOOP
MACRO

Exercises

Modify the grammar for expressions given in Figure 5.8 so that it allows function calls such as those suggested in Figure 6.2. Although it doesn't matter to the grammar, consider having Boolean values represented by integers so that there is no concept of the type of an expression, for example, True could be represented by 1 and False by 0, as in C.
The implementation of conditionals outlined in Figure 6.4 requires that repeated calls to line be made to skip the parts of the input which are not to be assembled, and that there be a state variable indicating whether calls to line are for the purpose of skipping text or assembling it. Rewrite this code so that, when IF or an ELSE is encountered which turns off assembly, the assembler immediately reads forward until it finds the line containing the corresponding ENDIF or ELSE. This will involve more code than the original, but it will run faster. Assume that the procedure lex.skipline can be called to terminate processing of the current line and setup lex.this and lex.next to be the first two lexemes on the next line.
Consider the problem of implementing conditional assembly the way it is done in the IBM 360-370 assembly language, where the directive AIF has two arguments, a (Boolean) expression and a label to skip to if the expression is true, and the directive AGO has one argument, a label to skip to unconditionally. Assuming that the labels referenced by AGO and AIF must occur later in the input text than the directives, that the function parse_operand parses one expression, and that the values 0 and 1 correspond to true and false, rewrite the code in Figure 6.4 to process this style of conditional assembly.
a) Using the representation for stored macros from Figure 6.12, show how the macros from Figure 6.6 would be stored together in the string pool. Show pointers from the macro names in a stylized opcode table to the start of the text of each macro; don't bother putting the names themselves in the string pool. Document any assumptions you have to make.
b) As with part a, but encode the symbols from Figure 6.7, suggesting an appropriate extention to the data structures of Figur 6.12 for dealing with the types of each formal parameter.
b) Repeat part a, but assume that the strings in the opcode table are themselves stored in the string pool.
Assuming that the macro expansion stack is organized as suggested in Figure 6.13, show the contents of this stack immediately after the first recursive call to TWICE resulting from the expansion of the code in Figure 6.9.
Write a macro which, when called, generates a table of the squares of numbers from 0 to 99 in memory. Assume that the assembler allows multiply operators in expressions!
a) Do this using the loop construct illustrated in Figure 6.18.
b) Do this using only conditional and macro directives, as in Figure 6.19.
Write a macro which uses the substring facilities mentioned in the section on extensions to macro assemblers to extract successive decimal digits from a string of decimal digits passed as a parameter. The following example illustrates the effect of a call to this macro:
The Macro Call
```
DIGITS 32768
```
The Effect
```
  B 3
  B 2
  B 7
  B 6
  B 8
```
You may use looping facilities such as are illustrated in Figure 6.18 or recursion as illustrated in Figure 6.19, and you should consider using boolean predicates such as those listed in Figure 6.2. You may need multiple macros to solve this problem.
Consider the problem of using IBM 360-370 style conditional assembly directives to implement loops. To the programmer, this is simply a backward GOTO in the text, and it is only allowed within the body of macros. How would you implement this in the assembler?
What problem would result if parameters were passed by normal textual substitution instead of by value in Figure 6.16? Consider passing an initial count of 40 when the maximum allowed line length is 80 characters.

References

Most of the example assembler, including the syntax given in this chapter for conditional and macro assembly, is based on the SMAL assembler which is fully described in

Machine Independent SMAL: A Symbolic Macro Assembly Language, by D. W. Jones, University of Iowa Department of Computer Science technical report 84-09 (July, 1984).

Chapter 10 of the text

PDP-11, Structured Assembly Language Programming by Sebsta (Benjamin Cummings, 1985)

contains a nice introduction to the macro facilities of MACRO-11, the assembler for the PDP-11 family of computers. This is one of the better designed macro and conditional facilities which can be found in commercially available assembly languages.

Chapter 4 of the text

System Software, an Introduction to Systems Programming by Beck (Addison Wesley, 1985)

contains a decent survey of alternative approaches to macro processing, as well as a short discussion of macro implementation. Chapter 4 of the text

Systems Programming by Donovan (McGraw Hill, 1972)

gives details of the implementation of an IBM 360 style macro preprocessor. Section 4.3.2 discusses the possibility of macros which, when expanded, cause other macros to be defined.

The macro preprocessor for C is described in

The C Programming Language, by Kernighan and Ritchie (Prentice Hall, 1978 or 1988).

Many C textbooks give this subject only minimal coverage, even though the macro preprocessor is quite powerful; it even allows the definition of macros which build new control structures.