## Modulus without Division, a tutorial
Part of
the Arithmetic Tutorial Collection
Copyright © 2001, Douglas. W. Jones. This work may be transmitted or stored in electronic form on any computer attached to the Internet or World Wide Web so long as this notice is included in the copy. Individuals may make single copies for their own use. All other rights are reserved. |

- Introduction
- Background
- The Trivial Case: Mod 2, Mod 4,
Mod 2
^{i} - Mersenne Numbers: Mod 3, Mod 7,
Mod (2
^{i}– 1) - An example: Mod 5 by way of mod 15
- An example: Mod 6, a composite modulus
- An example: Mod 7, a Mersenne number

The ususal way to compute *a* mod *m* is to take the remainder
after integer division. This is straightforward when the operands are
within the range of the available divide hardware, but the divide operation
is frequently among the very slowest arithmetic operations available,
some small microcontrollers have no divide hardware, and it is occasionally
necessary to divide very large numbers outside the range that can be done
using the available hardware.

When the modulus *m* is constant, even where there is a hardware
divide instruction, it can be faster to take the modulus directly than
to use the divide instruction. These tricks become even more valuable
on machines without a hardware divide instruction or where the numbers
involved are out of range.

There are common tricks to check divisibility that many students learn in elementary school:

- To test for divisibility by 10
- see if the least significant decimal digit is zero.

- To test for divisibility by 2
- see if the least significant decimal digit is even.

- To test for divisibility by 5,
- see if the least significant decimal digit is 0 or 5.

- To test for divisibility by 3,
- if the sum of the decimal digits is divisible by 3, the original number is also divisible by 3.

- To test for divisibility by 9,
- if the sum of the decimal digits is divisible by 9, the original number is also divisible by 9.

All of the above rules are actually special cases of a single general rule. Given that

ais represented in number baseb

amodm= ( (bmodm)(a/b) + (amodb) ) modm

In the case of divisibility by 2, 5 and 10 for base 10,
the term (*b* mod *m*) is zero because 2, 5 and 10 all
divide evenly into 10. As a result, the divisibility test
simplifies to asking whether (*a* mod *b*), that is, the
least significant digit of the number, is evenly divisible.

In the case of divisibility by 3 or 9 in base 10,
the term (*b* mod *m*) is one.
As a result, the multiplier for the first term is one. Applying the
formula recursively leads to the simple sum of the digits.

Computing modulus for poweres of two is trivial on a binary computer,
the term (*b* mod *m*) is zero, so we just take the
modulus by examining the least significant bits of the binary
representation:

amod 2^{i}=a& (2^{i}–1)

Thus, for *a* mod 2, we use *a* & 1,
for *a* mod 4, we use *a* & 3, and
for *a* mod 8, we use *a* & 7.

Recall that the & operator means logical *and*. When applied to
integers, this computes each bit of the result as the *and* of the
corresponding bits of the operands.
For all nonzero positive integers i,
the binary representation of 2^{i}–1
consists of i consecutive one bits, so *and*ing with
2^{i}–1 preserves the least significant i bits of the
operand while forcing all more significant bits to zero.

The problem is more interesting when the modulus is not a power of two.

Consider the problem of computing
*a* mod 3 on a binary computer.
Note that 4 mod 3 is 1, so:

amod 3 = ( (a/4) + (amod 4) ) mod 3

That is,
*a* mod 3 can be computed from the sum of the digits of the
number in base 4. Base 4 is convenient because each base 4 digit of the
number consists of 2 bits of the binary represenation; thus
*a* mod 4 can computed using *a* & 3 and
*a* / 4 can be computed using *a* >> 2.

The number 3 is a Mersenne number, that is, one less than a power of two.
The property noted above is true of all Mersenne numbers.
Thus, we can compute
*a* mod 7 or
*a* mod 15 on a binary computer using

amod 7 = ( (a/8) + (amod 8) ) mod 7amod 15 = ( (a/16) + (amod 16) ) mod 15

Recall that *a* >> *b* shifts the binary
representation of *a* left a total of *b* places. As with logical
and, this is a very inexpensive operation on a binary computer, and the effect
is the same as dividing *a* by 2^{b}.

Back to the problem of computing *a* mod 3:
Summing the base-4 digits of a number may produce results significantly larger
than 3, but we can deal with this by summing the digits of the sum as many
times as required to bring the result down close to 4.
Expressing this as an algorithm using C, we get this code:

unsigned int mod3( unsigned int a ) { while (a > 5) { int s = 0; /* accumulator for the sum of the digits */ while (a != 0) { s = s + (a & 3); a = a >> 2; } a = s; } /* note, at this point: a < 6 */ if (a > 2) a = a - 3; return a; }

The terminating condition in the above code represents a small optimization. Instead of repeatedly summing the digits until the total is a single base 4 digit in the range 0 to 3, this code stops as soon as the sum is under 6. This is because the final mod operation is done by comparing with 3 and subtracting if out of range; this compare and subtract operation can deal with any value up to twice the modulus.

On a 3 Ghz dual-core Intel Core i3 machine, this first version
averages 54 nanoseconds slower than a subroutine with the minimalist
body "`return a%3`" — relying on the built-in mod operation.
Of course, this code completely ignores the parallelism available
on a computer with registers much wider than the 2 bits in a base-4 digit.
There is a well known trick for fast implementation of the bitcount operator
that we can easily apply to this problem. In the slow version, bitcount is
computed by simply summing the base-2 digits of the number. Here is the
classic fast version, written for a 32-bit computer:

uint32_t bitcount( uint32_t a ) { a = ((a >> 1) & 0x55555555) + (a & 0x55555555) /* each 2-bit chunk sums 2 bits */ a = ((a >> 2) & 0x33333333) + (a & 0x33333333) /* each 4-bit chunk sums 4 bits */ a = ((a >> 4) & 0x0F0F0F0F) + (a & 0x0F0F0F0F) /* each 8-bit chunk sums 8 bits */ a = ((a >> 8) & 0x00FF00FF) + (a & 0x00FF00FF) /* each 16-bit chunk sums 16 bits */ return ((a >> 16) ) + (a & 0x0000FFFF) }

This code is shockingly un-intutitive but very fast, particularly when inside an inner loop where the constants can be pre-loaded in registers and remain unchanged for the duration. Note, to add to the obscurity, that we can rewrite this code with slightly different constants and it will still remain correct:

uint32_t bitcount( uint32_t a ) { a = ((a >> 1) & 0x55555555) + (a & 0x55555555) a = ((a >> 2) & 0x33333333) + (a & 0x33333333) a = ((a >> 4) & 0x07070707) + (a & 0x07070707) a = ((a >> 8) & 0x000F000F) + (a & 0x000F000F) return ((a >> 16) ) + (a & 0x0000001F) }

The constants used in the above version reflect the maximum values in each
field of the result at that stage of the computation.
A sum of the one bits in a 4-bit field will be no greater than 4,
which can be represented in 3 bits and so is selected by the mask value 7.
A sum of the one bits in an 8-bit field will be no greater than 8,
which can be represented in 4 bits and so is selected by the mask value
F_{16}.
A sum of the one bits in a 16-bit field will be no greater than 16,
which can be represented in 5 bits and so is selected by the mask value
1F_{16}.

This trick can be applied directly to the problem of computing the sum of the base 4 digits in a number. In our original algorithm with two nested loops, we can replace the inner loop to arrive at the following much faster code:

uint32_t mod3( uint32_t a ) { while (a > 5) { a = ((a >> 2) & 0x33333333) + (a & 0x33333333); a = ((a >> 4) & 0x07070707) + (a & 0x07070707); a = ((a >> 8) & 0x000F000F) + (a & 0x000F000F); a = ((a >> 16) ) + (a & 0x0000001F); } /* note, at this point: a < 6 */ if (a > 2) a = a - 3; return a; }

The above version uses the optimized constants, although the range of values
in each field is now a bit larger:
A sum of two base-4 digits will be no greater than 6,
which can still be represented in 3 bits and so is selected by the mask
value 7.
A sum of four base-4 digits will be no greater than 12,
which can still be represented in 4 bits and so is selected by the mask
value F_{16}.
A sum of eight base-4 digits will be no greater than 24,
which can still be represented in 5 bits and so is selected by the mask
value 1F_{16}.

This code is far faster than the first algorithm we gave. On a 3 Ghz dual-core Intel Core i3 machine, this improved code averages 21 nanoseconds slower than a subroutine using the built-in mod operation, less than half the overhead of the original code presented above.

The exercise we went through to identify the maximum values in each field
is not pointless, but it is hard work. A careful worst-case analysis shows
that the outer loop will iterate at most 3 times.
The input FFFFFFEE_{16}, for example, leads to 3 iterations.
Worst-case analysis also reveals the range of values that are input to
each step, allowing th subsequent iteration to be simplified.
Unrolling the loop completely leads us to the following code:

uint32_t mod3( uint32_t a ) { a = ((a >> 2) & 0x33333333) + (a & 0x33333333); a = ((a >> 4) & 0x07070707) + (a & 0x07070707); a = ((a >> 8) & 0x000F000F) + (a & 0x000F000F); a = ((a >> 16) ) + (a & 0x0000001F); /* note, at this point: a <= 48, 3 base-4 digits */ a = ((a >> 2) & 0x33333333) + (a & 0x33333333); a = ((a >> 4) ) + (a & 0x07070707); /* note, at this point: a <= 8, 2 base-4 digits */ a = ((a >> 2) ) + (a & 0x33333333); /* note, at this point: a <= 4, 2 base-4 digits */ if (a > 2) a = a - 3; return a; }

The payoff for this optimization is significant. On a 3 Ghz dual-core Intel Core i3 machine, this improved code averages 10.4 nanoseconds slower than a subroutine using the built-in mod operation. This is under 1/5 the overhead of our original code and half the overhead of our first attempt at optimizing.

The eccentric constants used in the above code are annoying, consuming registers and taking time to load. Some machines have a truncate instruction that is equivalent to the following C code:

r = r & ((1 << b) - 1);

That is, the truncate instruction preserves the *b* least significant
bits of the register *r* while setting the more significant bits of
*r* to zero. Recall that the << operator means left-shift,
so the expression 1 << *b*
is equivalent to 2^{b} and therefore
(1 << *b*)–1 has just *b* ones.
A good C compiler should use this instruction whenever to implement the
*and* operation whenever the binary representation of the operand is
a string of *b* one bits, such as the constants 3, 7, 15 or 31.
Even when such an instruction is not available, a sequence of two shift
operations can be used to do the same thing.
On a 32-bit machine, for example, the following sequence of two shifts
will frequently truncate the value in a register in less time than it takes
to first load a long constant.

r = r << (32 - b) r = r >> (32 - b)

To take advantage of the possibility of fast truncation, we must rewrite
our code so that the constants used to select partial words have binary
representations that are all ones.
We can do this using the fact that not only is
*a* mod 3 easy to compute by summing base 4 digits,
but also by summing the digits in base 4^{i}
for any positive integer i.
So, while we can sum the digits in
base 2^{2} (that is, 4^{1}),
we can also sum them in bases
2^{4} (that is, 4^{2}),
2^{6} (that is, 4^{3}),
2^{8} (that is, 4^{4}), and so on.
This leads to the following solution:

uint32_t mod3( uint32_t a ) { a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits a <= 0x1FFFE */ a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits a <= 0x2FD */ a = (a >> 4) + (a & 0xF); /* sum base 2**4 digits a <= 0x3C; worst case 0x3B */ a = (a >> 2) + (a & 0x3); /* sum base 2**2 digits a <= 0x1D; worst case 0x1B */ a = (a >> 2) + (a & 0x3); /* sum base 2**2 digits a <= 0x9; worst case 0x7 */ a = (a >> 2) + (a & 0x3); /* sum base 2**2 digits a <= 0x4 */ if (a > 2) a = a - 3; return a; }

Again, working out the worst cases in order to determine whether we were within reach of the final subtract operation was time consuming. The payoff is significant, but we are approaching the limit of what we can accomplish with this kind of trickery. On a 3 Ghz Intel dual-core Intel Core i3 machine, this improved code averages 6.8 nanoseconds slower than a subroutine using the built-in mod operation. This is about 1/9 the overhead of our original code.

It is fair to ask, have we been two conservative? The above code includes three repeats of this statement:

Could we have eliminated one of them? The answer is, it depends on the actual range of the argument. All of the code given above has been tested for all 32-bit integer arguments. If one of the three identical shift and add statements is omitted, the code works for all integers from zero up to 1,359,020,030.a = (a >> 2) + (a & 0x3);

In Dec. 2014, Tim Buktu [tbuktu@hotmail.com] pointed out that the code
given here can be used to carry out SIMD computations quite efficiently.
Consider, for example, taking an array of 4 8-bit integers packed into a
32-bit word. The following function will apply the *mod3* operator
to all 4 integers without ever unpacking them:

uint32_t SIMDmod3( uint32_t a ) { a = ((a >> 4) & 0x0F0F0F0F) + (a & 0x0F0F0F0F); /* sum base 2**4 digits; a <= 0x1E1E1E1E */ a = ((a >> 2) & 0x0F0F0F0F) + (a & 0x03030303); /* sum base 2**2 digits; a <= 0x09090909 */ a = ((a >> 2) & 0x03030303) + (a & 0x03030303); /* sum base 2**2 digits; a <= 0x04040404 */ a = ((a >> 2) & 0x03030303) + (a & 0x03030303); /* sum base 2**2 digits; a <= 0x03030303 */ /* again, we can't use if(a == 3) a = 0 on each component, so we cheat */ a = ((((a + 1) >> 2) & 0x03030303) + a) & 0x03030303; return a; }

In the SIMD context, there is no obvious way to do an if statement to subtract of 3 from every byte greater than 3, so the final step adds one to each byte and masks to select the carry bit. Where adding 1 produced a carry of 1, the byte was 3 and needs to be zeroed. We zero it by adding the carry bit back into each byte and then masking to delete the new carry.

This code (or similar code on larger words) can compete with code that extracts each byte for division by three. The context in which this idea was originally pointed out to me involved use of the streaming SIMD instructions on the Intel x86, but the idea obviously has broader applications.

The solution developed above for
*a* mod 3 is actually one of the most computationally
difficult of the Mersenne numbers. Deleting successive lines (with small
modifications) produces solutions for successively larger Mersenne numbers:

uint32_t mod15( uint32_t a ) { a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */ a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */ a = (a >> 4) + (a & 0xF); /* sum base 2**4 digits */ if (a < 15) return a; if (a < (2 * 15)) return a - 15; return a - (2 * 15); }uint32_t mod255( uint32_t a ) { a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */ a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */ if (a < 255) return a; if (a < (2 * 255)) return a - 255; return a - (2 * 255); }uint32_t mod65535( uint32_t a ) { a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */ if (a < 65535) return a; if (a < (2 * 65535)) return a - 65535; return a - (2 * 65535); }

The code for computing *a* mod 65535
above is slightly faster, on average than the hardware mod operator
on a 3.06 Ghz Intel Core i3 processor.

Computing
*a* mod 5 on a binary computer is harder. The smallest
binary radix larger than 5 is 8, so we could base a solution on the
following:

amod 5 = ( (8 mod 5)(a/8) + (amod 8) ) mod 5 = ( 3(a/8) + (amod 8) ) mod 5

Reducing this to C code, we get the following:

unsigned int mod5( unsigned int a ) { while (a > 9) { int s = 0; /* accumulator for the sum of the digits */ while (a != 0) { s = s + (a & 7); a = (a >> 3) * 3; } a = s; } /* note, at this point: a < 10 */ if (a > 4) a = a - 5; return a; }

On a 3 Ghz Intel dual-core Intel Core i3 machine, this code
takes 115 nanoseconds more than a subroutine with the minimalist
body "`return a%5`" — relying on the built-in mod operation.
This is worse than half the speed of our first iterative code for mod 3.
The explanation for this poor performance lies in the multiply by 3 inside
the inner loop. This not only takes a small amount of time, but it
forces significantly more iterations in the loop: Where the mod-3 code
divided by 4 with each iteration, this code divides by 8/3.

The repeated multiplications also eliminate the possibility of tricks for summing the digits in parallel, but there is a way around this! While 5 does not divide evenly into 7, it does divide evenly into one less than the next power of two, 15. This lets us solve the problem as follows:

amod 5 = (amod 15) mod 5

We can trivially turn the code for
*a* mod 15 given above into code to compute
*a* mod 5 by adding just a little logic at the end:

uint32_t mod5( uint32_t a ) { a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */ a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */ a = (a >> 4) + (a & 0xF); /* sum base 2**4 digits */ if (a > 14) a = a - 15; /* now, we have mod 15 */ if (a > 4) a = a - 5; if (a > 4) a = a - 5; return a; }

The precise details of the end matter slightly. Depending on how the processor is pipelined and how the compiler is optimized, it may be faster to use one of the following alternative endings:

if (a > 9) a = a - 10; else if (a > 4) a = a - 5; return a;

or

if (a > 9) return a - 10; if (a > 4) return a - 4; return a;

Trial and error is valuable here, but the difference between these code fragments is small.

There are many paths to computing *a* mod 6. We can, of course, take a
brute-force approach, but as was the case with
*a* mod 5, the results are not very promising.
We could base this base solution on the following:

amod 6 = ( (8 mod 6)(a/8) + (amod 8) ) mod 6 = ( 2(a/8) + (amod 8) ) mod 6

Before reducing this to C code, note that 2(*a*/8) is not the same as
*a*/4 because we are using integer division.
On a binary computer, this can be correctly expressed in two ways,
either
(*a* >> 3)<< 1 or
(*a* >> 2)& –2. Recall that, in 2's
complement binary, –2 is ...111110, so *and*ing with this
constant forces the least significant bit of the result to zero while
preserving the remaining bits.

unsigned int mod6( unsigned int a ) { while (a > 11) { int s = 0; /* accumulator for the sum of the digits */ while (a != 0) { s = s + (a & 7); a = (a >> 2) & -2; } a = s; } /* note, at this point: a < 12 */ if (a > 5) a = a - 6; return a; }

**==== WORK IN PROGRESS ====**

A completely different approach to computing
*a* mod 6 involves noting that 6 can be factored as
2 × 3, so we can compute
*a* mod 6 by first computing
*a* mod 2 and
*a* mod 3.
To see how this is done, let us work through a few examples:

a0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 amod 60 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 amod 30 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 amod 20 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

There is a pattern here that repeats every 6 integers;
*a* mod 6 is a simple function of
*a* mod 2 and *a* mod 3. This function
can be represented by the following table:

amod 6amod 30 1 2 amod 20 0 4 2 1 3 1 5

**==== WORK IN PROGRESS ====**

Mersenne numbers are numbers of the form 2^{i}–1.
We have already worked two Mersenne numbers,
*a* mod 3 and
*a* mod 15 (in the context of
*a* mod 5).
Mersenne numbers are
sufficiently important that it is worth working another example. Here, we
will work on *a* mod 7,
going directly to the fastest solution. In the case of
*a* mod 3 and
*a* mod 15, the shift counts were all multiples of two,
whereas
*a* mod 7 will force us to shift counts that are multiples
of 3:

uint32_t mod7( uint32_t a ) { a = (a >> 24) + (a & 0xFFFFFF); /* sum base 2**24 digits a <= 0x10001FE worst case 0xFFFFFE*/ a = (a >> 12) + (a & 0xFFF); /* sum base 2**12 digits a <= 0x1FFD */ a = (a >> 6) + (a & 0x3F); /* sum base 2**6 digits a <= 0xBC; worst case ? */ a = (a >> 3) + (a & 0x7); /* sum base 2**2 digits a <= 0x1B; worst case ? */ a = (a >> 2) + (a & 0x7); /* sum base 2**2 digits a <= 0x6; worst case ? */ if (a > 5) a = a - 6; return a; }

**==== WORK IN PROGRESS ====**