Data Compression and Encryption Algorithms

by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Index

Splay Tree Based Codes
Encryption as an Introductory Programming Exercise

Splay Tree Based Codes

The algorithms for balancing splay-trees, a form of self-adjusting binary search tree invented by Dan Sleator and analyzed by Bob Tarjan, can be adapted to the job of balancing the trie used within a prefix code. This was reported in the paper Applications of Splay Trees to Data Compression by Douglas W. Jones in Communications of the ACM, Aug. 1988, pages 996-1007.

The following C code implementing this algorithm is available:

The first two pieces of code listed above are primarily intended as tutorial examples, although the compression code has been tuned for speed and is very appropriate for modest compression in small, memory limited applications. This is the full version of the code from the paper Practical Evaluation of a Data Compression Algorithm, by Douglas W. Jones, in the proceedings of DCC '91 (the 1991 Data Compression Conference), published by the IEEE Computer Society Press.

The final piece of code listed above is roughly comparable to the commonly used UNIX compress and uncompress utilities based on LZW compression. The user interface is largely compatable, and it compresses about as well, although not quite as quickly.

Splay compression performs unexpectedly well on images! It typically performs as well as the LZW model used in the GIF image format, while consiming considerably less memory. This is because the algorithm is locally adaptive. Scanning images along a Hilbert curve instead of along a conventional raster improves the performance by allowing more effective local adaption. The following software was used to demonstrate this:

Filters for converting between Hilbert-curves and raster scans.

The splay-compression utility is actually a subset of a more interesting utility that contained some additional cryptographic features. That code cannot (as far as I know) be exported from the United States. The basic idea of the cryptographic features (an idea used by Lotus in their product Ami Pro) is to condition the initial code tree used for data compression by the successive characters of the encryption key before use.

The Lotus Ami Pro version of the algorithm is insecure (deliberately, in order to allow export). In addition to conditioning the initial code tree with the password, they also include the password (encoded as a string of two digit decimal numbers) as a prefix on the data to be compressed and encrypted. As a result, if the password is purely alphabetic (and therefore doesn't mess up the encoding of the digits), the decimal form of the password can be inferred fairly easily. The situation is modestly more complex if the password is partially numeric. The weaknesses (and details) of the Lotus Ami Pro encryption scheme were reported to me by Paul C. Kocher (pck@netcom.com).

The weakness of Lotus's approach can be patched by removing the password from the head of the string being encrypted, and replacing it by a randomly chosen prefix. This makes known plaintext attacks difficult by using the random prefix to "stir up the tree" prior to the occurance of any known data.

Splay encryption is vulnerable to a chosen plaintext attacks! There are sequences of bytes that will expose the current state of the tree to someone who examines the compressed data, thus allowing any following data to be decoded. It may be possible to patch this weakness by a combination of run-length encoding and the insertion of random data at periodic points in the string to be encrypted, where these points must be spaced sufficiently closely to break the chosen plaintext string into short segments. More detail on thiese suggestions is available.

None of the comments about patching the weakness of splay-tree based encryption should be taken as any assurance that the patched result is very secure! It is only an accident that splay trees are useful for encryption, and the security of variants on the algorithm is still largely an open question.

A full version of the splay-tree compression program, including the cryptographic features stripped from the version released over the Web, is available by E-mail from the author.

The comp.compression FAQ is a good general source of references on this entire subject area. There is also an excellent list of compression pointers on the web, sorted by author and by substance.

Encryption as an Introductory Programming Exercise

In the spring of 1993, I taught 22C:16, Introduction to Programming in Pascal. Every semester I've taught this course, I try to find a "theme" around which to organize my programming assignments. For the spring of 1993, I decided to use some very elementary cryptography as the subject matter for programming assignments.

Machine problem zero was, not surprisingly, to run a program I gave them that printed "Hello World". Machine problem one was also unrelated to the theme, but all following machine problems revolved around real (although trivial) cryptographic problems.

Machine problem two -- elementary Caesar cyphers (rot13).
Machine problem three -- general substitution cyphers.
Machine problem four -- automatic cracking of Caesar cyphers.
Machine problem five -- automatic cracking of general substitution cyphers.
- The assignment
- Unfortunately, I have lost the solution I distributed
Machine problem -- dictionary based codes.