Lecture 42, Immediate Constants

Part of the notes for (CS:4908:0004) 22C:196:004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The Problem

The ARM instruction format has some odd constraints. One of them is on the form of immediate operands. In instructions such as compare, only the second operand may be a constant, and this constant is formed using the shifter operand specified by bits 0 to 11 of the instruciton. This field is divided as follows:

bits 0 to 7 -- immed_8: An 8 bit operand.
bits 8 to 11 -- rotate_imm: A 4-bit shift count. This is doubled, and then the 8-bit operand is rotated that many bits to form the actual 32-bit immediate constant.

As a result, the only immediate constants that can be directly represented on the ARM are those that fit one of the following patterns:

000000FF
000003FC
00000FF0
00003FC0
0000FF00
0003FC00
000FF000
003FC000
00FF0000
03FC0000
0FF00000
3FC00000
FF000000
FC000003
F000000F
C000003F

If you write instructions like add r3,r2,#c or mov r3,#c, the assembler will emit useful code if the constant c can be anded with one of the above without changing its value; otherwise, the assembler will complain.

The mvn instruction takes the one's complement of the shifter operand, so if you want to load negative constants, you can use mvn. If you really want to do mov r3,#-5, you must instead use mvn r3,4 because the one's complement of 4 is also the two's complement of 5.

Similarly, for comparison, cmn compares with the two's complement of its second argument. So, if you wanted to do cmp r3,-5, you must instead do cmn r3,5.

Assembler Tricks

The official ARM assembler knows about the relationship between MOV and MVN. If you write MOV R3,#&FFFFFFFF, for example, it will generate MVN R3,#0. (The official ARM assembler uses upper case where the Gnu assembler uses lower case, and it uses the & symbol to indicate hexadecimal.)

This trick doesn't suffice. Suppose you wanted to do this: mov r3,#0x12345670. The assemlber does not permit it. The official ARM assemlber has a special "pseudo operation" that you can use: LDR R3,=&0x12345670. When ARM the assemlber sees this, it first asks if the literal can be loaded as an immediate constant using a mov or mvn instruction. In that case, it generates that instruction. If not, it generates a PC-relative reference to the literal pool, and places the constant there.

The official ARM assembler accumulates literals until it encounters an LTORG directive or until the end of the program, and then it emits all of the accumulated literals. The LTORG directive is only needed if the program is so large that PC-relative addressing cannot reach the default literal pool at the end of the program. In that case, the user must include one or more explicit LTORG directives in mid program to output accumulated literals. Obviously, the LTORG directive should be given after an unconditional branch or subroutine return in order to ensure that the literals are not accidently executed, and an LTORG directive should be given within the range.

Note that third-party assemblers such as the GNU assembler cannot be relied on to include these support tricks. Experiment. If they are supported and work, use them, otherwise, you must write your own "expert" to load constants. If your code generator has a stack machine as its interface to the code generator, the obvious place to put the "load constant" expert is in the pushi routine for pushing an immediate constant on the stack.

The expert algorithm

How do you determine if the constant c conforms to this pattern?

00003FC0
0000FF00
0003FC00
000FF000

The answer is that any nonzero 32-bit constant c other than zero can be described as:

0*p x 0*q

That is, p leading zeros, a string of bits x, and q trailing zeros. We are interested in the maximal strings of leading and trailing one bits, so the string of bits x begins and ends with ones.

The number can be compactly encoded as an immediate constant in an ARM instruction if p+q>24. If p is even, it can be encoded if p+q=24. So, how do we compute p and q for the constant c. This rests on a set of classical tricks for computing properties of binary numbers.

p = bitcount( trailers( c ))
q = bitcount( leaders( c ))

The leader and trailer operators count the number of leading and trailing zeros in the binary representation of a number. These can be computed surprisingly efficiently on a binary computer, although we need two supporting operators to do this, one counts the number of one bits in a word, and the other reverses the order of the bits in a word.

Counting trailing zeros

To count the trailing zeros in a word, use:

#define trailers(x) bitcount( ~(x | -x) )

This is unlikely to be intuitive, so let's work a 16-bit example:

0010110001010000 = x
1101001110101111 = ~x
1101001110110000 = -x = ~x + 1
1111111111110000 = x | -x
0000000000001111 = ~(x | -x)

Actually, there are several equivalent ways to compute this. For example, this would also work:

#define trailers(x) bitcount( (~x ^ -x) >> 1 )

Counting leading zeros

To count the leading zeros in a word, simply reverse the order of the bits and count the trailing zeros:

#define leaders(x) = trailers( reverse( x ))

Counting Bits

The bitcount operator could be implemented by a simple loop, shifting one bit off per iteration until no ones are left. There is, however, a well-known trick implementation of the bitcount operator that is very fast:

unsigned int bitcount( unsigned int x ){
	x = (x & 0x55555555) + ((x & 0xAAAAAAAA) >>  1);
	x = (x & 0x33333333) + ((x & 0xCCCCCCCC) >>  2);
	x = (x & 0x0F0F0F0F) + ((x & 0xF0F0F0F0) >>  4);
	x = (x & 0x00FF00FF) + ((x & 0xFF00FF00) >>  8);
	x = (x & 0x0000FFFF) + ((x & 0xFFFF0000) >> 16);
        return x;
}

The first assignment replaces each pair of bits with the sum of those two bits. The second replaces each 4-bit segment with the sum of the two 2-bit integers in that segment. The third replaces each 8-bit segment with the sum of the two 4-bit integers in that segment, and so on.

Note that the final assignment above can be rewritten as follows:

	x = (x & 0x0000FFFF) + ((x             ) >> 16);

That is, there is no need for an and operator to clip off the low bits of a word that are about to be discarded by the shift operator. In addition, the first and operator on this line can be done by a truncate operation on machines that have such a special instruction for this purpose. This may avoid the need to reserve a register for the constant FFFF₁₆.

For machines with shorter words, it is sometimes faster to use a table-lookup, for example, with a 256 entry table giving the bitcounts for the integers from 0 to 255. Given this, you can sum the bitcounts of the bytes of a word. Ultimately, for each computer architecture, there are generally a few different bitcount implementations worth testing to see which is fastest.

Reversing the bit order

The reverse operator could also be implemented by a simple loop, shifting the input value left while shifting the output value right. There is, however, a trick implementation of the reverse operator that is closely related to that given for the bitcount operator above:

unsigned int reverse( unsigned int x ){
	x = ((x & 0x55555555) <<  1) + ((x & 0xAAAAAAAA) >>  1);
	x = ((x & 0x33333333) <<  2) + ((x & 0xCCCCCCCC) >>  2);
	x = ((x & 0x0F0F0F0F) <<  4) + ((x & 0xF0F0F0F0) >>  4);
	x = ((x & 0x00FF00FF) <<  8) + ((x & 0xFF00FF00) >>  8);
	x = ((x             ) << 16) + ((x             ) >> 16);
        return x;
}

The first line exchanges each pair of bits. The second line exchanges each consecutive pair of bit pairs. The third line exchanges each consecutive pair of 4-bit fields, and so on. The final line has been optimized to eliminate the need for extra constants, as was done for the bitcount operator.

Note that, if several bitcount and reverse operations are needed, there is value to expanding them as inline macros so that an optimizing compiler can load the obscure constants into registers just once and then use them over and over.

As with the bitcount operator, on machines with a shorter word, a table lookup can be competitive, for example, using a 256 entry table to reverse the order of bits in each byte. Trial and error is generally the best way to figure out which algorithm will win in any particular context.