TerSCII: Ternary Standard Code for Information Interchange

Part of http://homepage.cs.uiowa.edu/~dwjones/ternary/
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Disclaimer: Nobody but the author endorses the use of this character set, and even he isn't so sure of it.

Abstract

The TerSCII character set is designed to serve, in the world of ternary information processing systems, in the same role as ASCII and its compatible descendant, Unicode, serve in the world of binary information processing. As a coding system ASCII and Unicode are resolutely binary, with blocks of 16, 32 and 64 characters as their basis. This is inappropriate in the Ternary world, where 9, 27 and 81 are far more likely to be relevant. In addition, just as the UTF-8 and UTF-16 codes are appropriate for representing Unicode as strings of 8 and 16-bit words, ternary computers will require TTF-3 and TTF-9 encodings for representing TerSCII as strings of 3-trit trybbles or 9-trit trytes.

1. Background and Motivation

Heptavintimal trybble encodings
Weight:	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26
Ternary:	000	001	002	010	011	012	020	021	022	100	101	102	110	111	112	120	121	122	200	201	202	210	211	212	220	221	222
Digits:	0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F	G	H	K	M	N	P	R	T	V	X	Z

Three consecutive trybbles make up a tryte, able to represent a range of 27³ or 19,683 distinct values. It is natural to pack 3 trytes or 27 trits into a word. This gives us a word that can be represented in 43 bits on a binary computer.

Representing text on such a machine using ASCII would naturally suggest using 5 trits per character, since 3⁵ is 243. This is almost but not quite sufficient to represent UTF-8, but if we use 6 trits per character, we have 729 possible values, an awkward number even if we invent UTF-9 so that we can pack 512 values into each character.

Further investigation of ASCII and Unicode shows that the code is resolutely based on powers of two. The upper and lower case equivalents of each Roman letter are separated by a difference of 32. There are provisions for 32 control characters, most of which are rarely or inconsistently used. The blocks set aside in Unicode for different alphabets are all documented in hexadecimal and most begin cleanly on addresses that are multiples of 16 or 256. In contrast, the TerSCII code is resolutely designed in terms of multiples of powers of 3. The natural code block size is 9 by 9 instead of 16 by 16.

A second motive for developing a new character set follows from serious security problems caused by Unicode. In Unicode, there are many glyphs that display identicaly. Numerous letters appear identically in the Roman, Greek and Russian alphabets, and there are numerous different ways of displaying accent marks, some as single glyphs that render the letter with the accent mark, and some as sequences consisting of a letter followed by the accent mark. As a result, there are many character strings that render identically but are actually quite different.

2. The TerSCII Code

With 26 letters in the English alphabet and comparable numbers in other western and middle-eastern alphabets, the first power of 3 that lends itself to representing a reasonable character set is 3⁴ or 81. A 4-trit character allows encoding the Roman alphabet in both upper and lower case, plus 10 digits and a modest (but insufficient) set of control characters and punctuation marks. In this environment, a code extension system comparable to that of Unicode invites a character code built on 81-character blocks.

Consider the following block for the basic Roman alphabet:

00 0 1 2 3 4 5 6 7 8

0 ES SP 0 9 I R _ i r
1 EL - 1 A J S a j s
2 ET ' 2 B K T b k t
3 LR , 3 C L U c l u
4 OP ; 4 D M V d m v
5 RL : 5 E N W e n w
6 SU . 6 F O X f o x
7 HT ! 7 G P Y g p y
8 SD ? 8 H Q Z h q z

00	0	1	2	3	4	5	6	7	8
0	ES	SP	0	9	I	R	_	i	r
1	EL	-	1	A	J	S	a	j	s
2	ET	'	2	B	K	T	b	k	t
3	LR	,	3	C	L	U	c	l	u
4	OP	;	4	D	M	V	d	m	v
5	RL	:	5	E	N	W	e	n	w
6	SU	.	6	F	O	X	f	o	x
7	HT	!	7	G	P	Y	g	p	y
8	SD	?	8	H	Q	Z	h	q	z

Code Meaning
ES End of String, analogous to NULL
EL End of Line, analogous to LF or CR/LF
ET End of Text file
LR Left to Right rendering of following text
OP OverPrint following text on previous char
RL Right to Left rendering of following text
SU Shift Up (superscript) following by 1/3 baseline
HT Horizontal Tab in current rendering direction
SD Shift Down (subscript) following by 1/3 baseline
SP Space

Code	Meaning
ES	End of String, analogous to NULL
EL	End of Line, analogous to LF or CR/LF
ET	End of Text file
LR	Left to Right rendering of following text
OP	OverPrint following text on previous char
RL	Right to Left rendering of following text
SU	Shift Up (superscript) following by 1/3 baseline
HT	Horizontal Tab in current rendering direction
SD	Shift Down (subscript) following by 1/3 baseline
SP	Space

This is a meagre character set, but it is good enough to typeset the body text of a novel, if substitution of apostrophes for quote marks is acceptable. Control over rendering direciton eliminates the need for backspace. The ability to overprint allows underlining and, with the addition of more characters, accent marks. Shift up and shift down can be used to superscript or subscript text.

Some rules apply to overprinting: All characters following an OP control code will overprint until the next LR or RL control code. No other overprinting mehanism is present. Specifically, there is no equivalent of the ASCII CR, which, on printing terminals, allowed the following characters to overprint (from left to right) an entire line of text. Instead, if an EL is encountered in LR mode, the next line is rendered starting at the left, and in RL mode, EL starts rendering the next line at the right. Where supported, changes from LR to RL mode in midline should operate so that "RL a b c d" should render identically to "RL a LR c b RL d" and both should print as "abcd".

The next block we add provides characters that were missing from the above that are useful for western european languages.

01 0 1 2 3 4 5 6 7 8

0 ‘
1 *
2 ’
3 /
4 |
5 \
6 ‹
7 ◊
8 ›

Note that double quotes are merely pairs of single quotes. This eliminates the distinction between ‘‘this’’ (quoted with pairs of single quotes) and “that” (quoted with double quotes).

3. The TTF-3 and TTF-9 Encodings

The basic TRISCII character set can be encoded in 4-trit quartets, but addressing 4-trit units on a ternary computer is as difficult as addressing 6-bit units on a binary machine. 6 or 9 trits per character make far more sense. The Trinary Text Formats TTF-3 and TTF-9 use these sizes. These formats borrow some ideas from the UTF-8 encoding of Unicode, but they do so without threatening to create any degree of compatibility.

TTF-3 encodes each character as a sequence of one or more 3-trit nybbles. The leading trit on each nybble indicates whether that nybble is a stand-alone character, the first nybble of a long character, or a subsequent character of a long character. Each nybble carries 2 trits of the character representation.

trybble 0 1 2 3 blocks
trit 0 1 2 3 4 5 6 7 8 9 10 11

0 – 8 0 t₁ t₀ 0: only control characters
9 – 80 2 t₃ t₂ 1 t₁ t₀ 0: basic Roman except cc's
81 – 728 2 t₅ t₄ 1 t₃ t₂ 1 t₁ t₀ 1 to 8
729 – 6560 2 t₇ t₆ 1 t₅ t₄ 1 t₃ t₂ 1 t₁ t₀ 9 to 80

trybble	0	1	2	3	blocks
0 – 8	0	t₁	t₀		0: only control characters
9 – 80	2	t₃	t₂	1	t₁	t₀		0: basic Roman except cc's
81 – 728	2	t₅	t₄	1	t₃	t₂	1	t₁	t₀		1 to 8
729 – 6560	2	t₇	t₆	1	t₅	t₄	1	t₃	t₂	1	t₁	t₀	9 to 80

As with Unicode, character must be encoded using its shortest encoding. Thus, while HT (Horizontal Tab) can be encoded as 200 121 (KG₂₇) or even 200 100 121 (K9G₂₇), we require it to be encoded as 021 (7₂₇). This constraint plus our encoding scheme guarantees that simple trybble-by-trybble comparison of two strings in their TTF-3 form will alphabetize them as if the characters had been fully expanded into their canonical fixed-size representation.

Unlike Unicode, the first trybble of the longer character encodings does not indicate the length of the character. This scheme can, potentially, be stretched to arbitrary-length codes, but we arbitrarily declare that any character encoding with more than 8 trybbles is illegal. This sets an excessively generous upper bound on the size of the character set and permits encoding of any data that can be encoded in TTF-9.

TTF-9 encodes each character as a sequence of one or more 9-trit trytes. The first trit of each tryte gives the length of the encoding.

tryte 0 1 blocks
trit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

0 – 6560 0 t₇ t₆ t₅ t₄ t₃ t₂ t₁ t₀ 0 to 80
6561 – 43046720 2 t₁₅ t₁₄ t₁₃ t₁₂ t₁₁ t₁₀ t₉ t₈ 1 t₇ t₆ t₅ t₄ t₃ t₂ t₁ t₀ 81 and up

tryte	0	1	blocks
0 – 6560	0	t₇	t₆	t₅	t₄	t₃	t₂	t₁	t₀		0 to 80
6561 – 43046720	2	t₁₅	t₁₄	t₁₃	t₁₂	t₁₁	t₁₀	t₉	t₈	1	t₇	t₆	t₅	t₄	t₃	t₂	t₁	t₀	81 and up

This encoding scheme is generous, allowing for 40 million distinct character codes, considerably more than Unicode's upper limit. Like UTF-8 and TTF-3, TTF-9 allows lexical sorting of strings based on their full TRISCII representation while doing successive comparisons one tryte at a time.

Because TTF-3 encodes the common Roman characters in just 2 trybbles while TTF-9 encodes them in 3, TTF-3 should be more compact for European languages. It should remain competitive even where characters in blocks 1 to 8 dominate because of the efficient encoding of spaces and control characters.

TerSCII: Ternary Standard Code for Information Interchange

Abstract

1. Background and Motivation

2. The TerSCII Code

2.1 The TerSCII Basic Roman Block

2.2 The TerSCII Extended Roman Block

3. The TTF-3 and TTF-9 Encodings

00	0	1	2	3	4	5	6	7	8
0	ES	SP	0	9	I	R	_	i	r
1	EL	-	1	A	J	S	a	j	s
2	ET	'	2	B	K	T	b	k	t
3	LR	,	3	C	L	U	c	l	u
4	OP	;	4	D	M	V	d	m	v
5	RL	:	5	E	N	W	e	n	w
6	SU	.	6	F	O	X	f	o	x
7	HT	!	7	G	P	Y	g	p	y
8	SD	?	8	H	Q	Z	h	q	z

trybble	0			1			2			3			blocks
trit	0	1	2	3	4	5	6	7	8	9	10	11
0 – 8	0	t₁	t₀										0: only control characters
9 – 80	2	t₃	t₂	1	t₁	t₀							0: basic Roman except cc's
81 – 728	2	t₅	t₄	1	t₃	t₂	1	t₁	t₀				1 to 8
729 – 6560	2	t₇	t₆	1	t₅	t₄	1	t₃	t₂	1	t₁	t₀	9 to 80

00	0	1	2	3	4	5	6	7	8
0	ES	SP	0	9	I	R	_	i	r
1	EL	-	1	A	J	S	a	j	s
2	ET	'	2	B	K	T	b	k	t
3	LR	,	3	C	L	U	c	l	u
4	OP	;	4	D	M	V	d	m	v
5	RL	:	5	E	N	W	e	n	w
6	SU	.	6	F	O	X	f	o	x
7	HT	!	7	G	P	Y	g	p	y
8	SD	?	8	H	Q	Z	h	q	z

00	0	1	2	3	4	5	6	7	8
0	ES	SP	0	9	I	R	_	i	r
1	EL	-	1	A	J	S	a	j	s
2	ET	'	2	B	K	T	b	k	t
3	LR	,	3	C	L	U	c	l	u
4	OP	;	4	D	M	V	d	m	v
5	RL	:	5	E	N	W	e	n	w
6	SU	.	6	F	O	X	f	o	x
7	HT	!	7	G	P	Y	g	p	y
8	SD	?	8	H	Q	Z	h	q	z