Huffman Code Discussion and Implementation

This is one of those pages documenting an effort that never seems to end. I thought it would end, but I keep coming up with things to try. This effort grew from a little curiosity. One day, my copy of "Numerical Recipes In C" fell open to the section on Huffman Coding. The algorithm looked fairly simple, but the source code that followed looked pretty complicated and relied on the vector library used throughout the book.

The complexity of the source in the book caused me to search the web for clearer source. Unfortunately, all I found was source further obfuscated by either C++ or Java language structures. Instead of searching any further, I decided to write my own implementation using what I hope is easy to follow ANSI C.

I thought that I could put everything to rest after implementing the basic Huffman algorithm. I thought wrong. Mark Nelson had mentioned that there are canonical Huffman codes which require less information to be stored in encoded files so that they may be decoded later. Now I have an easy to follow (I hope) ANSI C implementation of encoding and decoding using canonical Huffman codes.

As time passes, I've been tempted to make other enhancements to my implementation, and I've created different versions of code. Depending on what you're looking for, one version might suit you better than another.

Click here for information on the different versions of my code, as well as instructions for downloading and building my source code.

Algorithm Overview

Huffman coding is a statistical technique which attempts to reduce the amount of bits required to represent a string of symbols. The algorithm accomplishes its goals by allowing symbols to vary in length. Shorter codes are assigned to the most frequently used symbols, and longer codes to the symbols which appear less frequently in the string (that's where the statistical part comes in). Arithmetic coding is another statistical coding technique.

Building a Huffman Tree

The Huffman code for an alphabet (set of symbols) may be generated by constructing a binary tree with nodes containing the symbols to be encoded and their probabilities of occurrence. This means that you must know all of the symbols that will be encoded and their probabilities prior to constructing a tree. The tree may be constructed as follows:

Step 1. Create a parentless node for each symbol. Each node should include the symbol and its probability.
Step 2. Select the two parentless nodes with the lowest probabilities.
Step 3. Create a new node which is the parent of the two lowest probability nodes.
Step 4. Assign the new node a probability equal to the sum of its children's probabilities.
Step 5. Repeat from Step 2 until there is only one parentless node left.

The code for each symbol may be obtained by tracing a path to the symbol from the root of the tree. A 1 is assigned for a branch in one direction and a 0 is assigned for a branch in the other direction. For example a symbol which is reached by branching right twice, then left once may be represented by the pattern '110' or '001'. The figure below depicts a sample tree where 'A' encodes to 0, 'B' encodes to 100, 'C' encodes to 101, 'D' encodes to 110, and 'E' encodes to 111.

Once a Huffman tree is built, Canonical Huffman codes, which require less information to rebuild, may be generated by the following steps:

Step 1. Remember the lengths of the codes resulting from a Huffman tree generated per above.
Step 2. Sort the symbols to be encoded by the lengths of their codes (use symbol value to break ties).
Step 3. Initialize the current code to all zeros and assign code values to symbols from longest to shortest code as follows:

Encoding Data

Once a Huffman code has been generated, data may be encoded simply by replacing each symbol with it's code.

Decoding Data

TABLE 1. Table of Canonical Codes
Symbol	Code Length	Code
E	3	000
D	3	001
C	3	010
B	3	011
A	1	1

If you know the Huffman code for some encoded data, decoding may be accomplished by reading the encoded data one bit at a time. Once the bits read match a code for symbol, write out the symbol and start collecting bits again. See Decoding Encode Files for details.

References

Implementation Issues

What is a Symbol

One of the first questions that needs to be resolved before you start is "What is a symbol?". For my implementation a symbol is any 8-bit combination as well as an End Of File (EOF) marker. This means that there are 257 possible symbols in any code.

Handling End-of-File (EOF)

The EOF is of particular importance, because it is likely that an encoded file will not have a number of bits that is a integral multiple of 8. Most file systems require that files be stored in bytes, so it's likely that encoded files will have spare bits. If you don't know where the EOF is, the spare bits may be decoded as an extra symbol.

At the time I sat down to implement Huffman's algorithm, there were two ways that I could think of for handling the EOF. It could either be encoded as a symbol, or ignored. Ignoring the EOF requires that a count of the number of symbols encoded be maintained so that decoding can stop after all real symbols have been decoded and any spare bits can be ignored.

Encoding the EOF has the advantage of not requiring a count of the number of symbols encoded in a file. When I originally started out I thought that a 257^th symbol would allow for the possibility of a 17 bit code. And I didn't want to have to deal with 17 bit values in C. As it turns out a 257^th symbol will create the possibility of a 256 bit code and I ended up writing a library that could handle 256 bit codes anyway. (See Code Generation.)

Consequently, I have two different implementations, a 0.1 version that contains a count of the number of symbols to be decoded, and a versions 0.2 and later that encode the EOF.

Code Generation

The source code that I have provided generates a unique Huffman tree based on the number of occurrences of symbols within the file to be encoded. The result is a Huffman code that yields an optimal compression ratio for the file to be encoded. The algorithm to generate a Huffman tree and the extra steps required to build a canonical Huffman code are outlined above.

Using character counts to generate a tree means that a character may not occur more often than it can be counted. The counters used in my implementation are of the type unsigned int, therefore a character may not occur more than UINT_MAX times. My implementation checks for this and issues an error. If larger counts are required, the program may be modified to use a larger type or other arbitrary sized counters.

Code Length

In general, a Huffman code for an N symbol alphabet, may yield symbols with a maximum code length of N - 1 bits. Following the rules outlined above, it can be shown that if at every step that combines the two parentless nodes with the lowest probability, only one of the combined nodes already has children, an N symbol alphabet (for even N) will have two N - 1 length codes.

Example Given a 6 symbol alphabet with the following symbol probabilities: A = 1, B = 2, C = 4, D = 8, E = 16, F = 32

Step 1. Combine A and B into AB with a probability of 3.
Step 2. Combine AB and C into ABC with a probability of 7.
Step 3. Combine ABC and D into ABCD with a probability of 15.
Step 4. Combine ABCD and E into ABCDE with a probability of 31.
Step 5. Combine ABCDE and F into ABCDEF with a probability of 63.

In order to handle a 256 character alphabet, which may require code lengths of up to 255 bits, I created a library that performs standard bit operations on arrays unsigned characters. Versions prior to 0.3 use a library designed specifically for 256 bit arrays. Later versions use a bitarry library designed for arbitrary length arrays. Both versions of he library are written in the same portable ANSI C as the rest of my Huffman code library.

Writing Encoded Files

I chose to write my encoded files in two parts. The first part contains information used to reconstruct the Huffman code (a header) and the second part contains the encoded data.

Header

In order to decode files, the decoding algorithm must know what code was used to encode the data. Being unable to come up with a clean way to store the tree itself, I chose to store information about the encoded symbols.

To reconstruct a traditional Huffman code, I chose to store a list of all the symbols and their counts. By using the symbol counts and the same tree generation algorithm that the encoding algorithm uses, a tree that matching the encoding tree may be constructed.

To save some space, I only stored the non-zero symbol counts, and the end of count data is indicated by an entry for a character zero with a count of zero. The EOF count is not stored in my implementations that encode the EOF, both the encoder and decoder assume that there is only one EOF.

Canonical Huffman codes usually take less information to reconstruct than traditional Huffman codes. To reconstruct a canonical Huffman code, you only need to know the length of the code for each symbol and the rules used to generate the code. The header generated by my canonical Huffman algorithm consists of the code length for each symbol. If the EOF is not encoded the total number of encoded symbols is also included in the header.

Encoded Data

The encoding of the original data immediately follows the header. One natural by-product of canonical Huffman code is a table containing symbols and their codes. This table allows for fast lookup of codes. If symbol codes are stored in tree form, the tree must be searched for each symbol to be encoded. Instead of searching the leaves of the Huffman tree each time a symbol is to be encoded, my traditional Huffman implementation builds a table of codes for each symbol. The table is built by performing a depth first traversal of the Huffman tree and storing the codes for the leaves as they are reached.

With a table of codes, writing encoded data is simple. Read a symbol to be encoded, and write the code for that symbol. Since symbols may not be integral bytes in length, care needs to be taken when writing each symbol. Bits need to be aggregated into bytes. In my 0.1 version of code, all the aggregation is done in-line. My versions 0.2 and later use my bitfile library to handle writing any number of bits to a file.

Decoding Encode Files

Like encoding a file, decoding a file is a two step process. First the header data is read in, and the Huffman code for each symbol is reconstructed. Then the encoded data is read and decoded.

I have read that the fastest method for decoding symbols is to read the encoded file one bit at time and traverse the Huffman tree until a leaf containing a symbol is reached. However, I have also read that it is faster to store the codes for each symbol in an array sorted by code length and search for a match every time a bit is read in. I have yet to see a proof for either side.

I do know that the tree method is faster for the worst case encoding where all symbols are 8 bits long. In this case the 8-bit code will lead to a symbol 8 levels down the tree, but a binary search on 256 symbols is O(log₂(256)) or an average of 16 steps.

Since conventional Huffman encoding naturally leads to the construction of a tree for decoding, I chose the tree method here. The encoded file is read one bit at a time, and the tree is traversed according to each of the bits. When a bit causes a leaf of the tree to be reached, the symbol contained in that leaf is written to the decoded file, and traversal starts again from the root of the tree.

Canonical Huffman encoding naturally leads to the construction of an array of symbols sorted by the size of their code. Consequently, I chose the array method for decoding files encoded with a canonical Huffman code. The encoded file is read one bit at time, with each bit accumulating in a string of undecoded bits. Then all the codes of a length matching the string length are compared to the string. If a match is found, the string is decoded as the matching symbol and the bit string is cleared. The process repeats itself until all symbols have been decoded.

Library Code Interface

Encoding Data (Traditional codes)

HuffmanEncodeFile

Declaration:

int HuffmanEncodeFile(FILE *inFile, FILE *outFile);

Description:

This routine genrates a Huffman tree optimized for a file and writes out an version that file encoded by the tree.

Parameters:

inFile - The file stream to be encoded. It must opened and rewindable. NULL pointers will return an error.
outFile - The file stream receiving the encoded results. It must be opened. NULL pointers will return an error.

Effects:

inFile is encoded using a Huffman code and written to outFile. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Encoding Data (Canonical codes)

CanonicalEncodeFile

Declaration:

int CanonicalEncodeFile(FILE *inFile, FILE *outFile);

Description:

This routine genrates a Canonical Huffman tree optimized for a file and writes out an version that file encoded by the tree.

Parameters:

Effects:

inFile is encoded using a Canonical Huffman code and written to outFile. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Decoding Data (Traditional codes)

HuffmanDecodeFile

Declaration:

int HuffmanDecodeFile(FILE *inFile, FILE *outFile);

Description:

This routine reads a file encoded by the Huffman coding algorithm and writes out a decoded version of that file.

Parameters:

inFile - The file stream to be decoded. It must opened. NULL pointers will return an error.
outFile - The file stream receiving the decoded results. It must be opened. NULL pointers will return an error.

Effects:

inFile is decoded using a traditional Huffman algorithm and written to outFile. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Decoding Data (Canonical codes)

CanonicalDecodeFile

Declaration:

int CanonicalDecodeFile(FILE *inFile, FILE *outFile);

Description:

This routine reads a file encoded by the canonical Huffman coding algorithm and writes out a decoded version of that file.

Parameters:

Effects:

inFile is decoded using a canonical Huffman algorithm and written to outFile. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Portability Issues

All the source code that I have provided is written in strict ANSI-C. I would expect it to build correctly on any machine with an ANSI-C compiler. However, I have only tested the code on my 64-bit Linux PC using the gcc compiler.

The code makes no assumptions about the size of types or byte order (endian), so it should be able to run on all platforms. However type size and byte order issues will prevent files that are encoded on one platform from being decoded on another platform. The code also assumes that an array of unsigned char will be allocated in a contiguous block of memory.

Actual Software

I am releasing my implementations of Huffman's algorithms under the LGPL. If you've actually read this page to get all the way down here, you already know that I have different implementations. In general earlier versions are simpler (maybe easier to follow) and later versions are easier to use as libraries and better suited for projects taking advantage of the LGPL. In some cases later version also fix minor bugs. All versions of my source code is available on GitHub. I recommend that you checkout the latest revision, if you're not looking for something specific.

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipperstein@gmail.com

Huffman Code Discussion and Implementation

by Michael Dipperstein

Algorithm Overview

Building a Huffman Tree

Encoding Data

Decoding Data

References

Implementation Issues

What is a Symbol

Handling End-of-File (EOF)

Code Generation

Code Length

Writing Encoded Files

Header

Encoded Data

Decoding Encode Files

Library Code Interface

Encoding Data (Traditional codes)

HuffmanEncodeFile

Encoding Data (Canonical codes)

CanonicalEncodeFile

Decoding Data (Traditional codes)

HuffmanDecodeFile

Decoding Data (Canonical codes)

CanonicalDecodeFile

Portability Issues

Actual Software