Fast Ternary Multiplication and Division
Part of
http://www.cs.uiowa.edu/~jones/ternary/
Work started Apr. 20, 2015. Disclaimer: This work is in its earliest stages. |
Users of binary computers have long known that multiplication and division are among the slowest of the elementary arithmetic operations, frequently requiring 5 or 10 times as long to complete as other instructions, and where other instructions can be executed in a pipelined or superscalar manner, it is common to find that multiply and divide cannot. As a result, programmers use tricks such as shifting and adding to perform multiply and divide whenever possible. Here, we explore similar tricks that are applicable in the Ternary world.
Here, we will focus on the use of shift, add and subtract operations to perform multiplication. Furthermore, we will assume that, as in many modern RISC processors, there is an add instruction that combines the shift and add operations, following the model developed for for the HP PA-RISC family of processors [Magenheimer et al, 1987]
Here, we will not assume a particular machine architecture, but we will write expressions such as:
In this context, we assume that this operation can be computed with a single machine instruction, and we assume a ternary data representation, so that shift operations (denoted <<_{3}) apply to the ternary represntation, not the binary representation as used on conventional binary computers. Thus,
Note that zeros are shifted in on either the left or right when a ternary value is shifted. For shifts of balanced ternary numbers, zero has the same representation as the unsigned one. Thus, it is important to be aware of the type of the operand when shifting.
The code we give here is in a C-like notation. Where hexadecimal constants would be used in binary code, with the prefix 0x, we use heptavintimal, base 27, with the prefix 0v. Naturally, unspecified digits are assumed to be set to zero. For unsigned constants, this is (unsurprisingly) 0v0. For balanced constants, the situation may not be as obvious, because the zero value is 0vD, corresponding to the balanced value 000_{3}, which has the same representation as the unsigned value 111_{3}.
Masking of unsigned ternary values is done with the min operator, denoted &. For balanced ternary numbers, masking is done with the ternary exclusive-or operator, denoted ^ in the C-like code here, but conventionally denoted ⊕. The use of this operator for masking may need explanation: In balanced ternary, a trit may be either –1, 0 or +1. For all x, inspection of the truth table for the ternary exclusive-or operator shows that: 0⊕x=0, and –1⊕x=x. This leads naturally to using the exclusive-or operator to set selected trits of a balanced ternary number to zero while preserving others.
We make one significant notational abbreviation in order to simplify dealing with double-register operations, using expressions such as this:
This means that the variables a and b are first concatenated into a single long value which is then operated on, in this case, using a ternary shift operation, before the resulting value is disassembled into two component values. This is significantly more compact than the following equivalent expression:
In the above, t is the number of trits per word. Expressing double-word addition is even more difficult without using this abbreviation.
This begins rather trivially, since we will eventually need to multiply by a variety of constants:
a × 0 = 0
a × 1 = a
There are multiple solutions for multiplation by 2 using our framework; the alternative that involves subtraction makes overflow detection more difficult:
a × 2 = (a <<_{3} 0) + a
a × 2 = (a <<_{3} 1) – a
a × 3 = (a <<_{3} 1)
a × 4 = (a <<_{3} 1) + a
Multiplication by 5 requires two shift-add instructions, as do a number of higher constant multipliers. In most cases, we operate either by factoring the multiplier or by summing the results of simpler multiplication problems that have already been solved.
t = a × 2
a × 5 = (a <<_{3} 1) + t
a × 6 = (a × 2) × 3
t = a × 2
a × 7 = (t <<_{3} 1) + a
Multiplication by 8 offers some choices; one is faster but because it uses subtraction, it makes it harder to accurately detect overflow. Many many higher multipliers pose the same choice.
a × 8 = (a <<_{3} 2) – a
a × 8 = (a × 2) × 4
a × 9 = (a <<_{3} 2)
a × 10 = (a <<_{3} 2) + a
t = a × 2
a × 11 = (a <<_{3} 2) + t
a × 12 = (a × 3) × 4
t = a × 4
a × 13 = (a <<_{3} 2) + t
t = a × 5
a × 14 = (a <<_{3} 2) + t
a × 15 = (a × 5) × 3
a × 16 = (a × 4) × 4
t = a × 8
a × 17 = (a <<_{3} 2) + t
a × 18 = (a × 2) × 3
t = a × 2
a × 19 = (t <<_{3} 2) + a
a × 20 = (a × 10) × 2
a × 21 = (a × 7) × 3
a × 22 = (a × 11) × 2
t = a × 2
u = (a <<_{3} 1) + t
a × 23 = (t <<_{3} 2) + u
a × 24 = (t × 6) × 4
a × 26 = (t × 2) × 13
For fast software implementations of ternary multiplication algorithms, we will use base 27. In unsigned ternary, this requires having optimal solutions for all multipliers from 0 to 26; in balanced ternary, we will need optimal solutions for all multipliers from -13 to 13.
To multiply two variables in base 27, we use essentially the same algorithm we learned in elementary school, but with a change of base:
unsigned int times( unsigned int a, unsigned int b ) { unsigned int prod = 0; while (a != 0) { unsigned int p; /* the partial product */ switch (a & 0vZ) { case 0: p = 0 ; break; case 1: p = b ; break; case 2: p = (b <<_{3} 0) + b; break; case 3: p = (b <<_{3} 1) ; break; case 4: p = (b <<_{3} 1) + b; break; ... } prod = (prod <<_{3} 3) + p; a = a >>_{3} 3; } return prod; }
The above algorithm works for unsigned ternary, but a similar algorithm can easily be composed for signed ternary. The key to the efficiency of these algorithms is the availability of a fast implementation of case selection in the instruction set.
Moving to balanced ternary, the algorithm remains essentially the same, although we move the accumulation of the product into the cases of the switch block in order to add some partial products but subtract others. It may be useful to review the Notation and Assumptions section when reading this code, since the use of the exclusive-or operator for masking may not be obvious:
balanced int times( balanced int a, balanced int b ) { unsigned int prod = 0; while (a != 0) { unsigned int p; /* the partial product */ switch (a ^ 0v0) { case -13: p = (b <<_{3} 1) + b; p = (b <<_{3} 2) + p; prod = (prod <<_{3} 3) - p; break; case -12: p = (b <<_{3} 1) + b; p = (p <<_{3} 1); prod = (prod <<_{3} 3) - p; break; ... case 12: p = (b <<_{3} 1) + b; p = (p <<_{3} 1); prod = (prod <<_{3} 3) + p; break; case 13: p = (b <<_{3} 1) + b; p = (b <<_{3} 2) + p; prod = (prod <<_{3} 3) + p; break; } a = a >>_{3} 3; } return prod; }
Of course, as with the binary multiplicaiton algorithm described by [Magenheimer et al, 1987], we should not leave the quality of the machine code for this to the whims of a compiler, but rather, this code should be hand crafted. Typically, each case will include the entire tail of the loop, ending with a branch back to the loop top. The net result is a loop where, if the instruction set is carefully designed, each iteration of the multiply code takes under 10 machine cycles per three-trit tribble, which comes to only a few machine cycles per trit of the multiplicand.
The presentaiton that follows is a ternary generalizaiton of the well known technique of reciprocal multiplication. A decent summary of the appliction of similar tricks applied to binary numbers is found in [Jones, 2002].
In ternary, as in any number base, one way to divide is to multiply by the reciprocal. For example:
a/2 = a × 1/2 = a × 0.111_{3}
a/3 = a × 1/3 = a × 0.1_{3}
a/4 = a × 1/4 = a × 0.020202_{3}
a/5 = a × 1/5 = a × 0.012101210121_{3}
a/6 = a × 1/6 = a × 0.0111_{3}
a/7 = a × 1/7 = a × 0.010212_{3}
a/8 = a × 1/8 = a × 0.010101_{3}
a/9 = a × 1/9 = a × 0.01_{3}
a/10 = a × 1/10 = a × 0.00220022_{3}
In the above, overbar is used to mark repeating digits. Note that the set of fractions with finite representations differs depending on the number base. In both decimal and bianry, 1/3 is a repeating fraction, while it has just one place after the point in ternary. Similarly, 1/5 is a repeating fraction in both binary and ternary, but has just one place after the point in decimal.
Multiplying by 1/2 in ternary involves adding an infinite number of partial products because there are an infinite number of successively less significant digits in the repeating fraction. If we are only interested in the integer part of the product, however, we need only add a finite number of partial products. This introduces errors, however, because of the loss of carry propagation from the omitted less significant partial products.
Reducing this to algorithmic form leads to a sequence of ternary right shifts and adds, one per partial product. The order in which the partial products are added has an impact on the result. Consider these two variants:
unsigned int div2( unsigned int a ) { /* inferior */ unsigned int acc = 0; while (a > 0) { acc = acc + a; a = a >>_{3} 1; } return (acc >>_{3} 1); } unsigned int div2( unsigned int a ) { /* superior */ unsigned int acc = 0; unsigned int c = a; while (c > 0) { acc = (acc + a) >>_{3} 1; c = c >>_{3} 1; } return acc; }
The difference between these two pices of code lies in the fact that the first piece of code deletes the least significant trits of each partial product prior to adding them to the accumulator. As a result, there will be no carry into the one's place from less significant parts of the sum. In contrast, the second version of code retains all carries that result from summing a finite number of partial products. In practice, the second version produces a result that is off by at most one from the integer part of the quotient.
Both versions of the above code perform n additions to halve an n-trit number. This can be reduced to log_{2}n. Note that 0.1111_{3} is the same as 0.1010_{3} + (0.1010_{3}>>_{3}1). Generalization of this leads to the following code to halve a 27-trit value:
unsigned int div2( unsigned int a ) { /* assume 27-trit integers */ unsigned int acc = a; acc = acc + (acc >>_{3} 1); acc = acc + (acc >>_{3} 2); acc = acc + (acc >>_{3} 4); acc = acc + (acc >>_{3} 8); acc = acc + (acc >>_{3} 16); return (acc >>_{3} 1); }
It is useful to look at the remainder after division by two using the above scheme. At first, it appears that this code produces the right result for even numbers and is off by one for odd numbers, suggesting that we ought to simply increment the operand prior to the shift and add sequence:
2 | o | o | o | o | o | o | o | o | o | o | |||||||||||
1 | o | o | o | o | o | o | o | o | o | o | |||||||||||
0 | o | ||||||||||||||||||||
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
Looking at the first 20 values, it looks like the code is giving the correct result for all odd numbers and is off by one (on the low side) for even nonzero numbers. Unfortunately, this pattern breaks down for a series of odd values that begins 47, 53, 101, 107. The initial values in this series are prime numbers, but the next group of values in this series, 155, 209 and 215 are not. For all of these odd values, the remainder is two, as it is for all of the even numbers except zero.
The result produced by this code off by at most one, but it fails if there is a carry out of the most significant trit in any of the additions. This will occur only in the top 1/3 of the range of representable values. Hardware implementations of this algorithm can deal with this by adding an extra trit of precision to the accumulator in order to handle this issue. Halving a 9-trit number takes one less shift-add step but is otherwise the same.
Correcting the result to solve the off-by-one problem involves one increment, one addition, one comparison and a final assignment or an extra return:
unsigned int div2( unsigned int a ) { /* assume 27-trit integers */ unsigned int acc = a; acc = acc + (acc >>_{3} 1); acc = acc + (acc >>_{3} 2); acc = acc + (acc >>_{3} 4); acc = acc + (acc >>_{3} 8); acc = acc + (acc >>_{3} 16); acc = acc >>_{3} 1; unsigned int inc = acc + 1; if ((acc + inc) < a) return inc; return acc; }
A hardware implementation can be built that divides by two in a single clock cycle using a cascade of adders, with a final multiplexer to select the result depending on the output of a comparitor.
Extending the above work to balanced ternary, the following algorithm is a trivial restatement of the uncorrected code given above:
balanced int div2( int a ) { /* assume 27-trit balanced integers */ balanced int acc = a; acc = acc + (acc >>_{3} 1); acc = acc + (acc >>_{3} 2); acc = acc + (acc >>_{3} 4); acc = acc + (acc >>_{3} 8); acc = acc + (acc >>_{3} 16); return (acc >>_{3} 1); }
As with the unsigned version, this code produces a result that is off by at most one, except for values where there is a carry out of the most significant trit, which only occurs in the most extreme 1/3 of the range. This code consistently produces the exact answer for all even values, but for odd values, it rounds up or down seemingly at random, as can be seen in the following plot of the remainder after division by two:
1 | o | o | o | o | o | ||||||||||||||||
0 | o | o | o | o | o | o | o | o | o | o | o | ||||||||||
-1 | o | o | o | o | o | ||||||||||||||||
-10 | -9 | -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
There are several definitions of division when signed numbers are involved.
We can easily adjust the quotient given by our divide by two algorithm to conform to the expectations of truncated and Euclidian division as follows:
balanced int div2( int a ) { /* assume 27-trit balanced integers */ balanced int acc = a; acc = acc + (acc >>_{3} 1); acc = acc + (acc >>_{3} 2); acc = acc + (acc >>_{3} 4); acc = acc + (acc >>_{3} 8); acc = acc + (acc >>_{3} 16); return (acc >>_{3} 1); unsigned int dec = acc - 1; if ((acc + dec) >= a) return dec; return acc; }
The classical long division algorithm people learn in elementary school for base 10 works just as well in ternary. The following C-like code uses repeated subtraction at each step, with zero, one or two subtractions per trit of the dividend.
unsigned int rem, quo; /* the remainder and quotient, return values */ void div( unsigned int dividend, unsigned int divisor ) { quo = dividend; rem = 0; for (i = 0; i < trits_per_word; i++) { /* first shift rem-quo double register 1 trit left */ (rem,quo) = (rem,quo) <<_{3} 1; /* second, compute one trit of quotient */ if (rem >= divisor) { quo = quo + 1; rem = rem - divisor; if (rem >= divisor) { quo = quo + 1; rem = rem - divisor; } } } }
In the above, the div() routine returns both the remainder and quotient in global variables so that the caller may use either or both of these. Note that the code above is written for clarity. Note that the comparison rem >= divisor would typically be done by subtraction, duplicating the computation two lines later. Both hardware and optimized software versions of this code would combine these.
Balanced ternary division is quite similar, with an interesting twist. Instead of bringing the remainder down into the range from zero to the divisor, the remainder is pulled towards zero, the middle of the range, from either above or below, keeping the absolute value of the remainder bounded by half the divisor. There is a high-level discussion of this, along with an example, in [Parhami and McKeown, 2013]. There, the ternary algorithm is compared with conventional non-restoring binary division.
The common division algorithm for binary numbers and the division algorithm given above for unsigned ternary have a useful property: A single double-length register is used to hold the remainder and quotient. This is initialized with the dividend, and then, in each divide step, as it is shifted left, adding one digit of the quotient to the right end until the low half of the double register contains the entire quotient while the high half contains the remainder. At each step, only the high half of this double register is used in comparisons to determine the next digit of the quotient.
The comparable algorithm for balanced ternary arithmetic requires that the entire remainder be inspected at each step, including that part still in the low half of the remainder-quotient register. This is because the fractional trits in the low half of the remainder-quotient register can either raise or lower the value. It does not hurt to include in this comparison the trits of the quotient that are in the least-significant end of the remainder-quotient register because they are less significant than any of the trits that are relevant to the comparison. Here is a version of the division algorithm that does not involve a preliminary division by two:
balanced int rem, quo; /* the remainder and quotient, return values */ void div( balanced int ividend, balanced int divisor ) { quo = dividend; rem = 0; for (i = 0; i < trits_per_word; i++) { /* first shift rem-quo double register 1 trit left */ (rem,quo) = (rem,quo) <<_{3} 1; /* second, compute candidates for next remainder */ balanced int high = rem + divisor; balanced int mid = rem; balanced int low = rem - divisor; /* pick a candidate, using long comparison */ (rem,) = closest_to_zero( (high,quo), (mid,quo), (low,quo) ); /* set the quotiet digit */ if (rem == high) { quo = quo - 1; } else if (rem == low) { quo = quo + 1; } } }
For all divisors, the above algorithm constrains the absolute value of the remainder to be less than or equal to half the absolute value of the divisor. For odd divisors, this rule deterministically forces the value of the remainder and quotient. For even divisors, when the un-rounded quotient is exactly midway between two integers, the absolute value of the remainder will be exactly half the absolute value of the divisor. This allows for two two possibilities: Either the quotient is rounded up and the remainder is negative, or the quotient is rounded down and the remainder is positive. The algorithm given above rounds up half the time and down the other half, as illustrated below:
1 | o | o | o | o | o | ||||||||||||||||
0 | o | o | o | o | o | o | o | o | o | o | o | ||||||||||
-1 | o | o | o | o | o | ||||||||||||||||
-10 | -9 | -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
2 | o | o | o | ||||||||||||||||||
1 | o | o | o | o | o | ||||||||||||||||
0 | o | o | o | o | o | ||||||||||||||||
-1 | o | o | o | o | o | ||||||||||||||||
-2 | o | o | o | ||||||||||||||||||
-10 | -9 | -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
As can be seen in the above plots, the rule for determining when this algorithm rounds up and when it rounds down is more complex than "round toward nearest even", but it has much the same effect. Note, however, that in the case of division by two, this algorithm does not round following the same pattern as the fast division by two algorithm given above.
The above code is an effective expression of the balanced ternary division algorithm, but it can be optimized, yielding more complex code at the expense of readability. The trick used below involves converting to a positive divisor and then using the sign of the remainder to decide between adding and subtracting the divisor at that step.
balanced int rem, quo; /* the remainder and quotient, return values */ void div( balanced int ividend, balanced int divisor ) { balanced int one = 1; /* determines whether to negate bits of quotient */ if (divisor < 0) { /* take absolute value of divisor */ divisor = -divisor; one = -one; } quo = dividend; rem = 0; for (i = 0; i < trits_per_word; i++) { /* first shift rem-quo double register 1 trit left */ (rem,quo) = (rem,quo) <<_{3} 1; /* second, compute one trit of quotient */ if (rem > 0) { balanced int low = rem - divisor; if ( (-low < rem) || ((-low == rem) && (quo > 0)) ) { quo = quo + one; rem = low; } } else if (rem < 0) { balanced int high = rem + divisor; if ( (-high > rem) || ((-high == rem) && (quo < 0)) ) { quo = quo - one; rem = high; } } } }