TerSCII: Ternary Standard Code for Information Interchange

Part of http://www.cs.uiowa.edu/~jones/ternary/
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Disclaimer: Nobody but the author endorses the use of this character set, and even he isn't so sure of it.

Abstract

The TerSCII character set is designed to serve, in the world of ternary information processing systems, in the same role as ASCII and its compatible descendant, Unicode, serve in the world of binary information processing. As a coding system ASCII and Unicode are resolutely binary, with blocks of 16, 32 and 64 characters as their basis. This is inappropriate in the Ternary world, where 9, 27 and 81 are far more likely to be relevant. In addition, just as the UTF-8 and UTF-16 codes are appropriate for representing Unicode as strings of 8 and 16-bit words, ternary computers will require TTF-3 and TTF-9 encodings for representing TerSCII as strings of 3-trit trybbles or 9-trit trytes.

  1. Background and Motivation
  2. The TerSCII Basic Roman Block
  3. The TTF-3 and TTF-9 Encodings

1. Background and Motivation

A single ternary (base 3) digit is a trit, which may take on the values 0, 1 and 2, for unsigned data, or –1, 0 and +1, for signed data. Trits are naturally grouped into triplets referred to as trybbles, where each trybble has 33 or 27 possible values. Where we need to compactly represent the value of ternary, we will use heptavintimal, that is, base 27, represented using the following encoding:

Heptavintimal trybble encodings
Weight: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Ternary: 000001002 010011012 020021022 100101102 110111112 120121122 200201202 210211212 220221222
Digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H K M N P R T V X Z

Three consecutive trybbles make up a tryte, able to represent a range of 273 or 19,683 distinct values. It is natural to pack 3 trytes or 27 trits into a word. This gives us a word that can be represented in 43 bits on a binary computer.

Representing text on such a machine using ASCII would naturally suggest using 5 trits per character, since 35 is 243. This is almost but not quite sufficient to represent UTF-8, but if we use 6 trits per character, we have 729 possible values, an awkward number even if we invent UTF-9 so that we can pack 512 values into each character.

Further investigation of ASCII and Unicode shows that the code is resolutely based on powers of two. The upper and lower case equivalents of each Roman letter are separated by a difference of 32. There are provisions for 32 control characters, most of which are rarely or inconsistently used. The blocks set aside in Unicode for different alphabets are all documented in hexadecimal and most begin cleanly on addresses that are multiples of 16 or 256. In contrast, the TerSCII code is resolutely designed in terms of multiples of powers of 3. The natural code block size is 9 by 9 instead of 16 by 16.

A second motive for developing a new character set follows from serious security problems caused by Unicode. In Unicode, there are many glyphs that display identicaly. Numerous letters appear identically in the Roman, Greek and Russian alphabets, and there are numerous different ways of displaying accent marks, some as single glyphs that render the letter with the accent mark, and some as sequences consisting of a letter followed by the accent mark. As a result, there are many character strings that render identically but are actually quite different.

Consider, for example, the string "ТerSСΙI" which should resemble the string "TerSCII" on a web browser conforming to modern standards, but is composed of the following Unicode entities:

This has serious security consequences, for example, when a bogus web site has a URL that renders identically to a legitimate site. TerSCII, in contrast, does not permit this. There is exactly one encoding for each glyph. Accent marks may only be encoded as combining marks, never as separate accented characters. One consequence of this is that conversion from Unicode to TerSCII is deterministic and straightforward, while conversion from TerSCII to Unicode is nondeterministic, although there are sensible heuristics that can be used to pick the most appropriate of several identical Unicode glyphs in any particular context.

2. The TerSCII Code

With 26 letters in the English alphabet and comparable numbers in other western and middle-eastern alphabets, the first power of 3 that lends itself to representing a reasonable character set is 34 or 81. A 4-trit character allows encoding the Roman alphabet in both upper and lower case, plus 10 digits and a modest (but insufficient) set of control characters and punctuation marks. In this environment, a code extension system comparable to that of Unicode invites a character code built on 81-character blocks.

2.1 The TerSCII Basic Roman Block

Consider the following block for the basic Roman alphabet:

 00    0    1    2     3    4    5     6    7    8  
0ESSP09IR_ir
1EL- 1AJSajs
2ET' 2BKTbkt
3LR, 3CLUclu
4OP; 4DMVdmv
5RL: 5ENWenw
6SU. 6FOXfox
7HT! 7GPYgpy
8SD? 8HQZhqz

CodeMeaning
ESEnd of String, analogous to NULL
ELEnd of Line, analogous to LF or CR/LF
ETEnd of Text file
LRLeft to Right rendering of following text
OPOverPrint following text on previous char
RLRight to Left rendering of following text
SUShift Up (superscript) following by 1/3 baseline
HTHorizontal Tab in current rendering direction
SDShift Down (subscript) following by 1/3 baseline
SPSpace

This is a meagre character set, but it is good enough to typeset the body text of a novel, if substitution of apostrophes for quote marks is acceptable. Control over rendering direciton eliminates the need for backspace. The ability to overprint allows underlining and, with the addition of more characters, accent marks. Shift up and shift down can be used to superscript or subscript text.

Some rules apply to overprinting: All characters following an OP control code will overprint until the next LR or RL control code. No other overprinting mehanism is present. Specifically, there is no equivalent of the ASCII CR, which, on printing terminals, allowed the following characters to overprint (from left to right) an entire line of text. Instead, if an EL is encountered in LR mode, the next line is rendered starting at the left, and in RL mode, EL starts rendering the next line at the right. Where supported, changes from LR to RL mode in midline should operate so that "RL a b c d" should render identically to "RL a LR c b RL d" and both should print as "abcd".

2.2 The TerSCII Extended Roman Block

The next block we add provides characters that were missing from the above that are useful for western european languages.

 01    0    1    2     3    4    5     6    7    8  
0
1 *
2
3 /
4 |
5 \
6
7
8

Note that double quotes are merely pairs of single quotes. This eliminates the distinction between ‘‘this’’ (quoted with pairs of single quotes) and “that” (quoted with double quotes).

3. The TTF-3 and TTF-9 Encodings

The basic TRISCII character set can be encoded in 4-trit quartets, but addressing 4-trit units on a ternary computer is as difficult as addressing 6-bit units on a binary machine. 6 or 9 trits per character make far more sense. The Trinary Text Formats TTF-3 and TTF-9 use these sizes. These formats borrow some ideas from the UTF-8 encoding of Unicode, but they do so without threatening to create any degree of compatibility.

TTF-3 encodes each character as a sequence of one or more 3-trit nybbles. The leading trit on each nybble indicates whether that nybble is a stand-alone character, the first nybble of a long character, or a subsequent character of a long character. Each nybble carries 2 trits of the character representation.

trybble 0 1 2 3   blocks
trit 0 1 2 3 4 5 6 7 8 9 10 11
0 – 8 0 t1 t0   0: only control characters
9 – 80 2 t3 t2 1 t1 t0   0: basic Roman except cc's
81 – 728 2 t5 t4 1 t3 t2 1 t1 t0   1 to 8
729 – 6560 2 t7 t6 1 t5 t4 1 t3 t2 1 t1 t0   9 to 80

As with Unicode, character must be encoded using its shortest encoding. Thus, while HT (Horizontal Tab) can be encoded as 200 121 (KG27) or even 200 100 121 (K9G27), we require it to be encoded as 021 (727). This constraint plus our encoding scheme guarantees that simple trybble-by-trybble comparison of two strings in their TTF-3 form will alphabetize them as if the characters had been fully expanded into their canonical fixed-size representation.

Unlike Unicode, the first trybble of the longer character encodings does not indicate the length of the character. This scheme can, potentially, be stretched to arbitrary-length codes, but we arbitrarily declare that any character encoding with more than 8 trybbles is illegal. This sets an excessively generous upper bound on the size of the character set and permits encoding of any data that can be encoded in TTF-9.

TTF-9 encodes each character as a sequence of one or more 9-trit trytes. The first trit of each tryte gives the length of the encoding.
tryte 0 1   blocks
trit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 – 6560 0 t7 t6 t5 t4 t3 t2 t1 t0   0 to 80
6561 – 43046720 2 t15 t14 t13 t12 t11 t10 t9 t8 1 t7 t6 t5 t4 t3 t2 t1 t0   81 and up

This encoding scheme is generous, allowing for 40 million distinct character codes, considerably more than Unicode's upper limit. Like UTF-8 and TTF-3, TTF-9 allows lexical sorting of strings based on their full TRISCII representation while doing successive comparisons one tryte at a time.

Because TTF-3 encodes the common Roman characters in just 2 trybbles while TTF-9 encodes them in 3, TTF-3 should be more compact for European languages. It should remain competitive even where characters in blocks 1 to 8 dominate because of the efficient encoding of spaces and control characters.