From Checksums to Authentication

Part of 22C:169, Computer Security Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Checksums and Hash Codes

Two classic problems lead to much the same idea: The first is error detection in data transmission. If you transmit data, or you save data how do you add to this data information that suffices to allow the recipient to tell if the data was corrupted in transmission or during storage. The oldest idea along these lines is to store each byte with a parity bit:

D7 D6 D5 D4 D3 D2 D1 D0  P 

P = D0 + D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8 (mod 2)

This approach to parity dates back to the era of electromechanical data transmission. For example, Teletypes that transmitted data encoded in the 7-bit ASCII code were frequently configured to transmit 8 bit data, the 8th bit being a parity bit. The magnetic tape recording formats of the 1950's stored 6 data bits plus parity on 7-track tape, and as the move was made to 8-bit bytes in the 1960's, the standard tape format used 9-track tape to allow for parity.

On receipt of the data, the parity for the 8 data bits is computed and compared with the parity bit that was received. If the computed and received parity are equal, it is likely that the data is correct. This works so long as the probability of two corrupted data bits in one byte is low, and it fails as soon as this is likely.

This weakness led developers of data communications equipment to seek more robust parity schemes -- that is, schemes that would allow detection of errors in the face of multiple errors or bursts of errors, and that would allow this with less overhead. A common very simple scheme was to use the arithmetic sum of all the bytes in the message as a check at the end of the message.

This is called a checksum, and the term applies whether the sum is done byte by byte or word by word, and whether it is done using integer arithmetic, modular arithmetic, or bitwise exclusive-or operators. In fact, the term checksum has come to be used broadly for any check-code computed as a function of the message contents. whether the accumulator is larger than one byte or whether the sum is done In practice, the most common scheme for this is the cyclic redundancy check:

Core logic to compute the CCITT CRC-16 function, x16 + x12 + x5 + 1.

Here, we are not interested in the details of efficient computation of this function, but rather, it should be noted that this idea of using a cyclic shift register with feedback to various points in the register and exclusive or gates is also widely used for pseudo-random number generation. Certain shift registers wired this way (but not all) will generate maximal pseudo-random number sequences if initialized with any nonzero value -- that is, they will visit all other nonzero values before returning to their initial value.

As used for cyclic redundancy-check computation, the register is initialized to zero, and then all of the bits of the message are shifted into it in sequence. At the end of this process, zeros are shifted in and the contents of the register are shifted out into the data stream. The reciever duplicates this computation to check the data for transmission errors. The example cyclic cyclic redundancy check logic given here, the CCITT CRC-16 function, is widely used in telecommunication. There are many others.

The thing to note is that all CRC functions compute a small n-bit digest of the message, computed in such a way that corruption of the message has approximately a 1 in 2n chance of going undetected.

Hash Codes

Hash functions are commonly used in constructing symbol-tables and other fast associative memories. Instead of searching through the entire set of items, we compute a hash function, giving us a small integer index into the set of items. An ideal hash function will have the following properties:

Here is a typical hash function:

int hash ( char *s ) {
    int v = 0;
    while (*s != NULL) {
        v = (A * v + *s) % M
    return v;

The code at the center of this hash function is remarkably similar to the code used in a multiplicative congruence pseudo-random number generator.

In fact, there is no reason not to use a CRC function as a hash function. Classic CRC functions, however, are ugly things to compute efficiently in software -- those exclusive-or gates and shift registers that are easy to build in bit-serial hardware devices at the ends of communication lines are not so easy to code as word-parallel logic in a C program.

An application: Authentication

Normally, in data communictions or symbol tables for compilers, we use hash or checksum functions that compute small values. So, for data communications, 16-bit CRC functions are common, and it is common to compute hash functions using 32-bit arithmetic. This gives us a 1 in 216 or a 1 in 232 chance of a collision between any two randomly chosen inputs to the CRC or hash function.

What happens if we compute the hash function using far larger numbers, for example, 256-bit integers? In this case, the probability of randomly selected input strings colliding, that is, hashing to the same value, will be 1 in 2256.

This is more than just an error detection trick! Some functions are easy to invert, while others are hard to invert. If our hash function is easy to invert, then given the value of hash(s), we can easily compute some s that has that hash value. In contrast, if we have a hard-to-invert hash function, it is very hard to go from hash(s) to some s with that hash value. In the worst case, the function may be so hard to invert that we are forced to resort to trial and error methods.

We refer to these hard-to-invert hash functions as cryptographically secure hash functions. The term message digest is sometimes used for the value of the hash function applied to some message.

Given a secure hash function, we can, for example, publish the message digest for a message, then lock that message away. Later, when this message is retrieved, if the recomputed message digest is the same as the value published earlier, we know that the message that was recovered is authentic. This is an example of using secure hash functions for authentication.

Consider the problem of providing proof, in court, that a digital photograph has not been altered since the time the photograph was made at a crime scene. Altering digital photographs is trivial these days, so there is good reason to be suspicious of any digital photo. If, on extraction from the camera, the digital image is immediately hashed, and this hash value is published, then the alteration can be easily detected.

Of course, publishing the message digest of each photo would be expensive, so the usual scheme is to archive a file containing the message digests of all the documents you wish to protect, and then publish the message digest of this archive. At the time a question is raised about the authenticity of some document, you then compute its message digest, show that this number is in the archive, and then show that the message digest of the archive it itself the same as the published value. This scheme is currrently supported commercially, with obscure want-ads in places like the New York Times and the Wall Street Journal serving as the ultimate points of publication.

A Warning

The idea that there exist difficult to invert functions, also known as trapdoor functions, is pure speculation. For every known function, the most we can do is speculate that there is no easy algorithm for inverting it.

For small functions, we can invert them by simply exahustively going through all inputs and constructing a table of all outputs, listing, for each output, an input that gives that value. Therefore, as memory and computing power get cheaper, functions we thought could not be inverted become invertable.

It is fairly easy to estimate the total storage capacity of the universe, given an approximate value of the mass of the universe -- quantum physics actually allows us to state the entropy of the universe in bits! The number is around 2×10122. Note the rule of thumb that 3 decimal digits is the same as 10 binary bits, so this is just over 2400 bits of information.

If our trapdoor function has a 512-bit value, then we have over 100 bits of safety margin -- our universe is definitely not big enough to hold the lookup table needed to invert such a function. However, this does not tell us that there is no algorithm.

The history of trapdoor functions is littered with examples of functions that looked good when they were originally proposed, but where attackers eventually found algorithms for inverting them. The current SHA-1 function is not as strong as was originally suspected, but its known weakness does not put it in range of systematic attack.

Another Appliction

Suppose Alice and Bob each think they know the same thing, and they need to prove to each other that they do, but they want to share this secret without allowing Carol to intercept it. For example, Alice and Bob want to prove to each other that they are using the same cryptographic key for a symmetric key cypher.

One way for Alice and Bob to prove that they share the same key is for each of them to apply a trapdoor function -- they can talk openly about what function to use, if it is truly non-invertable, and then they can each privately apply this function to their secret before they openly exchange the results.

After doing this, Alice and Bob each know the hash for their own secret and each knows the hash for the other's secret, so each of them can privately compare the two hash codes to see if they share the same secret. Carol also can know that they have successfully shared the secret, but Carol cannot know what that secret is because the trapdoor function prevents Carol from inverting it.


The Wikipedia is a good source:

For a rather opaque treatment of cyclic redundancy checks, see:

For secure hash functions, see

For the information content of the universe, see:

A New Large Number Coincidence and the Maximum Number of Bits in the Universe, by Scott Funkhouser