Arithmetic Code Discussion and Implementation

The insanity continues. After playing with Huffman coding and LZSS, I had to give arithmetic coding a try. It wasn't enough to understand how it worked I had to actually implement it.

I've read about what this stuff did to Ross Williams. I can only hope that I'm not going to be subjected to a similar fate.

As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of the arithmetic coding algorithm. Anyone familiar with ANSI C and the arithmetic coding algorithm should be able to follow and learn from my implementation. There's a lot of room for improvement of compression ratios, speed, and memory usage, but this project is about learning and sharing, not perfection.

Recently (since 2008 or so) I've been using Python when I've needed quick hacks to do something on a PC (vs an embedded system). I eventually wrote a bitfile library for Python, then I realized that I needed a to write some code that actually uses the library, so now there's a Python implementation of the arithmetic coding algorithm.

The rest of this page discusses arithmetic coding and the results of my efforts so far.

Algorithm Overview

Arithmetic coding is similar to Huffman coding; they both achieve their compression by reducing the average number of bits required to represent a symbol.

Given:
An alphabet with symbols S₀, S₁, ... S_n, where each symbol has a probability of occurrence of p₀, p₁, ... p_n such that ∑p_i = 1.

From the fundamental theorem of information theory, it can be shown that the optimal coding for S_i requires (p_i×log₂(p_i)) bits.

More often than not, the optimal number of bits is fractional. Unlike Huffman coding, arithmetic coding provides the ability to represent symbols with fractional bits.

Since, ∑p_i = 1, we can represent each probability, p_i, as a unique non-overlapping range of values between 0 and 1. There's no magic in this, we're just creating ranges on a probability line.

For example, suppose we have an alphabet 'a', 'b', 'c', 'd', and 'e' with probabilities of occurrence of 30%, 15%, 25%, 10%, and 20%. We can choose the following range assignments to each symbol based on its probability:

Where square brackets '[' and ']' mean the adjacent number is included and parenthesis '(' and ')' mean the adjacent number is excluded.

TABLE 1. Sample Symbol Ranges
Symbol	Probability	Range
a	30%	[0.00, 0.30)
b	15%	[0.30, 0.45)
c	25%	[0.45, 0.70)
d	10%	[0.70, 0.80)
e	20%	[0.80, 1.00)

Ranges assignments like the ones in this table can then be use for encoding and decoding strings of symbols in the alphabet. Algorithms using ranges for coding are often referred to as range coders.

Encoding Strings

By assigning each symbol its own unique probability range, it's possible to encode a single symbol by its range. Using this approach, we could encode a string as a series of probability ranges, but that doesn't compress anything. Instead additional symbols may be encoded by restricting the current probability range by the range of a new symbol being encoded. The pseudo code below illustrates how additional symbols may be added to an encoded string by restricting the string's range bounds.

Any value between the computed lower and upper probability bounds now encodes the input string.

Start with lower and upper probability bounds of 0 and 1.

Encode 'a'
current range = 1 - 0 = 1
upper bound = 0 + (1 × 0.3) = 0.3
lower bound = 0 + (1 × 0.0) = 0.0

Encode 'c'
current range = 0.3 - 0.0 = 0.3
upper bound = 0.0 + (0.3 × 0.70) = 0.210
lower bound = 0.0 + (0.3 × 0.45) = 0.135

Encode 'e'
current range = 0.210 - 0.135 = 0.075
upper bound = 0.135 + (0.075 × 1.00) = 0.210
lower bound = 0.135 + (0.075 × 0.80) = 0.195

The string "ace" may be encoded by any value within the probability range [0.195, 0.210).

It should become apparent from the example that precision requirements increase as additional symbols are encoded. Strings of unlimited length require infinite precision probability range bounds. The section on implementation discusses how the need for infinite precision is handled.

Decoding Strings

The decoding process must start with a an encoded value representing a string. By definition, the encoded value lies within the lower and upper probability range bounds of the string it represents. Since the encoding process keeps restricting ranges (without shifting), the initial value also falls within the range of the first encoded symbol. Successive encoded symbols may be identified by removing the scaling applied by the known symbol. To do this, subtract out the lower probability range bound of the known symbol, and multiply by the size of the symbols' range.

Based on the discussion above, decoding a value may be performed following the steps in the pseudo code below:

Example:
Using the probability ranges from Table 1 decode the three character string encoded as 0.20.

Decode first symbol
0.20 is within [0.00, 0.30)
0.20 encodes 'a'

Remove effects of 'a' from encode value
current range = 0.30 - 0.00 = 0.30
encoded value = (0.20 - 0.0) ÷ 0.30 = 0.67 (rounded)

Decode second symbol
0.67 is within [0.45, 0.70)
0.67 encodes 'c'

Remove effects of 'c' from encode value
current range = 0.70 - 0.45 = 0.35
encoded value = (0.67 - 0.45) ÷ 0.35 = 0.88

Decode third symbol
0.88 is within [0.80, 1.00)
0.88 encodes 'e'

The encoded string is "ace".
In case you were sleeping, this is the string that was encoded in the encoding example.

There are two issues I've glossed over with this example: knowing when to stop, and the required computational precision. The section on implementation addresses both of these issues.

Implementation

This section discusses some of the issues of implementing an arithmetic encoder/decoder and the approach that I took to handle those issues. As I have already stated, my implementation is intended to be easy to develop and follow, not necessarily optimal.

What is a Symbol

One of the first questions that needs to be resolved before you start is "What is a symbol?". For my implementation a symbol is any 8 bit combination as well as an End Of File (EOF) marker. This means that there are 257 possible symbols in any encoded stream.

Infinite Precision

In my algorithm overview I noted that the lower and upper probability range bounds as well as the encoded value theoretically require infinite precision values on the range [0, 1). The method chosen to handle this largely impacts the rest of the implementation. Not wanting to blaze new trials, I chose to use the common solution. Values on the probability range scale will be represented as infinite length fixed point values, with the binary point before the most significant bit.

Example:
.00000... = 0
.10000... = 1/2
.01000... = 1/4
.11000... = 3/4
.11111... = 1

Wow Infinity. That's a Lot of Bits.

Fortunately for us, we don't need to operate on all the bits at once. As additional symbols are encoded, the lower and upper range bounds converge. As the bounds converge, their most significant bits stop changing. By choosing the precision of each symbol's range bounds, we can also limit the amount of bits required for each computation.

As you read through this section, you will learn how it is possible to achieve good compression results using 16 bit computations.

Computing Probability Ranges

I've implemented both static and adaptive order-0 probability models. The my static model computes probability ranges based on symbols counts obtained by making a pass on the input prior to starting the encoding. Others may use a predetermined set of ranges. My adaptive model assumes a uniform distribution and updates itself as new symbols are processed.

For my static model, the first step to computing probability ranges for each symbol is to simply count occurrences of each symbol, as well as the total numbers of symbols.

The next step is to scale all of the symbol counts so that they can be used in N bit computations. To do this symbol counts must be scaled to use no more than N - 2 bits. In the case of 16 bit computations, counts must be 14 bits or less. You'll see why in the next section.

Use the following steps to scale the counts to be used with N bit integer math:
Step 1. Divide the total symbol count by 2^{(N - 2)}
Step 2. Using integer division, divide each individual symbol count by the ceiling of the result of Step 1.
Step 3. If a non-zero count became zero, make it one.

Now we have a scaled count for each symbol. We need to convert the scaled counts to range bounds on a probability line. For an alphabet with symbols S₀, S₁, ... S_n and scaled counts c₀, c₁, ... c_n, we can define ranges as follows:

TABLE 2. Scaled Symbol Ranges
Symbol	Range
S₀	[0, c₀)
S₁	[upper range of S₀, upper range of S₀ + c₁)
S₂	[upper range of S₁, upper range of S₁ + c₂)
. . .	. . .
S_n	[upper range of S_{n - 1}, upper range of S_{n - 1} + c_n)

For my adaptive model we start with scaled symbols, but the N - 2 bit restriction still applies. It's possible to add the effect of a new symbol that would make the symbol count too large to be represented by N - 2 bits (see Adaptive Model and Symbol Range Updates). When this happens, I just rescale the current model for half the symbol count.

Use the following steps to scale an adaptive model to be used with N bit use integer math:
Step 1. Determine the probability range for each symbol in the model.
Step 2. Halve each of the ranges from Step 1.
Step 3. The new range for symbol S_i [upper bound of symbol S_{i - 1}, upper bound of symbol S_{i - 1} + range from step 2].
NOTE: The upper bound of symbol S_-1 = 0

Encoding with Scaled Ranges

The standard form of arithmetic coding's encoding is based on fractional ranges on a probability line between 0 and 1. We just assigned our symbols probability ranges on a probability line between 0 and ∑c_i (The total symbol count rescaled). The encoding algorithm needs to be adapted for our new range scale.

The following pseudo code modifies the standard form of the encoding algorithm for a range scale of [0, ∑c_i):

Decoding with Scaled Ranges

The standard form of arithmetic coding's decoding is also based on fractional ranges on a probability line between 0 and 1. Like encoding, we have to rescale our calculations for the decoding process.

The following pseudo code modifies the standard form of the decoding algorithm for a range scale of [0, ∑c_i):

Limiting Computations to N Bits

We have a way of representing floating point values as an infinite stream of bits, we have a way of scaling our probability ranges to N - 2 bits, and we can encode and decode using the scaled values. This still doesn't get us to a finite number of bits. There are a few facts that we can use to develop a solution that limits computations to N bits:

As the lower and upper probability range bounds start to converge, they are likely to reach a point where the have identical MSBs (most significant bits). Once the bounds have matching MSBs, the MSBs will remain matching for the rest of the encoding process. From that point on we don't need to worry about the MSBs anymore. Shift the MSB out to the encoded data stream, and shift in a 0 and a 1 for the LSBs of the lower and upper range limits respectively. Since we didn't perform an operation that affects bits beyond the N^th, we know the LSBs will be 0 and 1.

Example:
Lower Range Bound = 1011010111001101
Upper Range Bound = 1101001001001100

Shift out the MSB (1) and write it to the encoded output

Lower Range Bound = 011010111001101?
Upper Range Bound = 101001001001100?

Shift in the new LSB

Lower Range Bound = 0110101110011010
Upper Range Bound = 1010010010011001

Underflow

Now that we have a rule for shifting bits through our N bit variables, We can do everything with N bit math. Well almost. It turns out there's one thing I overlooked. What happens when the lower and upper range bounds start to converge around 1/2 (0111.... and 1000...)? This condition is called an underflow condition. When an underflow condition occurs the MSBs of the range bounds will never match.

The good news is that there's a fairly simple way to handle underflow. We can recognize that an underflow condition is pending when the two MSBs of the lower range bound are different from the two MSBs of the upper range bound. What we have to do in this situation is to remove the second bit from both range bounds, shift the other bits left, and remember that we had an underflow condition. We may still have an underflow condition so this process may need to be repeated. When we finally get an opportunity to shift out the MSB, follow it with the underflow bit(s) of the opposite value of the converged MSB.

Example:
Lower Range Bound = 0111010111001101
Upper Range Bound = 1011001001001100

This is an underflow condition remove the second MSB and shift all the bits over.

Lower Range Bound = 0110101110011010
Upper Range Bound = 1110010010011001

Continue with encoding until MSBs match.

Lower Range Bound = 1011010111001101
Upper Range Bound = 1101001001001100

Write out the MSB (1) and underflow bit (0), and shift in the new LSB.

Lower Range Bound = 0110101110011010
Upper Range Bound = 1010010010011001

Stopping The Decoding Process

In my decoding example I stated that a three symbol string should be decoded, but if you notice, there is still a coded value when we're done. We stopped decoding because the problem said to only decode three symbols.

One of the methods for knowing when to stop the decoding process is to include a symbol count somewhere in the encoded file. The file header is an ideal place to store the symbol count.

Another approach is to implement a "bijective" coder. The idea of a "bijective" implementation seems like more a hassle than a benefit. If you're interested in learning about "bijective" encoding/decoding refer to SCOTT's "one to one" compression discussion.

I took the approach of encoding an EOF. The upside is that I don't need to include a total symbol count in my header. I don't even need to count EOFs, because there's only one. The downsides are that the EOF does take up space in the probability line, and encoding the EOF may also add extra bits to the encoded file.

I performed a few tests comparing including a total symbol count in the file header against encoding the EOF. Encoding the EOF appeared to result in the better average compression ratios. My testing was far from scientific. If I get some time I will actually do a mathematical analysis of the two methods to determine when one method is better than the other.

Stopping the Encoding Process

Since I stop my encoding process after encoding the EOF it is at that point that I know the final probability range bounds. However, our probability range bounds have infinite precision and we need to write enough bits to be sure that the value we output falls within the range for the string we just encoded. Since the decoder also uses N bits, we just need to make sure it will have N bits to work with when it decodes the EOF. To do this, output the lower range bound's second MSB, all remaining underflow bits, plus an additional underflow bit.

Static vs. Adaptive Models

The arithmetic coding algorithm is well suited for both static and adaptive probability models. Encoders and decoders using adaptive probability models start with a fixed model and use a set of rules to adjust the model as symbols are encoded/decoded. Encoders and decoders that use static symbol probability models start with a model that doesn't change during the encoding process. Static probability models may be constant regardless of what is being encoded or they may be generated based on the encoded input.

Static Model and File Headers

My static probability model implementation uses a file header. Symbol probabilities are computed prior to the start of the encoding process and they are stored in a file header for the decoder to use at the start of the decoding process.

The file header is section of data prior to the encoded output that includes scaled range counts for each encoded symbol (except the EOF whose count is always one). Since I've scaled my counts to all be 14 bit values, I cheat a little and just write 14 bit values to the encoded file. Handling the 14 bit values is only slightly more complicated than handling 16 bit values, and it makes the file 512 bits smaller.

Adaptive Model and Symbol Range Updates

In the section on Computing Probability Ranges I touched on the potential need to rescale adaptive probability ranges and how I did it. What I neglected to mention was why this happens and what it means.

As its name would suggest, the adaptive model updates with the data stream. I chose to update my model after a symbol is encoded or decoded. After a symbol is encoded or decoded, it's upper bound and the bounds of every symbol after it must be incremented by one. On average that will be 128 updates per an encode/decode in a 256 symbol alphabet. Others have tried to reduce the number of updates required by placing the more common symbols near the end of the list of ranges. Laziness and the my inability to see a great performance benefit kept me from trying that approach.

If adding a new symbol causes the total probability to exceed N - 2 bits, the probability ranges must be rescaled. One side effect of rescaling is that symbols encoded/decode after a rescale operation will have twice the weight of symbols encoded prior to the rescale (the weight of prior symbols is halved). I have a feeling that this may be beneficial more often than not, but it's not intended to be a feature of my implementation.

C Library Code Interface

Encoding Data

ArEncodeFile

Declaration:

int ArEncodeFile(FILE *inFile, FILE *outFile, const model_t model);

Description:

This routine generates a list of arithmetic code ranges for an input file and then uses the ranges to write out an encoded version of that file.

Parameters:

inFile - The file stream to be encoded. It must be opened and it must also be rewindable if a static model is used. If NULL, stdin will be used.
outFile - The file stream receiving the encoded results. It must be opened as binary. If NULL, stdout will be used.
model - model_t type value indicating whether a static model or a dynamic model is to be used.

Effects:

inFile is arithmetically encoded and written to outFile. Neither file is closed after exit.

Returned:

0 for success, non-zero for failure. errno will be set in the event of a failure.

Decoding Data

ArDecodeFile

Declaration:

int ArDecodeFile(FILE *inFile, FILE *outFile, const model_t model);

Description:

This routine opens an arithmetically encoded file, reads it's header, and builds a list of probability ranges which it then uses to decode the rest of the file.

Parameters:

inFile - The file stream containing the encoded input. It must be opened as binary. If NULL, stdin will be used.
outFile - The file stream receiving the decoded results. It must be opened as binary. If NULL, stdout will be used.
model - model_t type value indicating whether a static model or a dynamic model is to be used.

Effects:

The arithmetically encoded file inFile is decoded and the results are written to outFile. Neither file is closed after exit.

Returned:

0 for success, non-zero for failure. errno will be set in the event of a failure.

Portability

All of the C source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on several Linux distributions as well as mingw on Windows XP. (Un)fortunately I don't have a modern Windows system to test.

There are some compile time options that offer minimal speed-up when compiling code for a little endian target, but the little endian code is disabled by default.

My C implementation assumes that long integers (long) are bigger than short integers (short), if this is not the case with your compiler, a compile time error should be issued.

The C implementation is also incapable of handling files with greater than ULONG_MAX bytes to be encoded. The program will issue an error and gracefully terminate if the file being encoded is too large.

The Python code was tested using Python 2.6 on Linux and Windows XP. It's possible that minor tweaks will be required to get the code to run properly using other versions of Python.

Further Information

I found Arturo Campos' articles on arithmetic coding to be a huge help. Unfortunately all that is left of them is the wayback machine archive.

This article that Mark Nelson wrote on arithmetic coding back in 1991 was also a huge help. In 2014, 10 years after my original implementation, Mark published this article, which he encourages people to read first.

The articles from Arturo Campos and Mark Nelson provided most of the information that I used to implement the algorithm.

An excellent paper on arithmetic coding was published by Eric Bodden, Malte Clasen, and Joachim Kneis after my initial publication of this web page. It is recommended reading for anybody interrested in arithmetic coding.

Lazar Sumar also has some interesting thoughts on arithmetic coding that he discusses on his blog.

Actual Software

I am releasing my implementations of the arithmetic coding algorithm under the LGPL. The source code repositories are available on GitHub

C Version	https://github.com/michaeldipperstein/arcode
Python Version	https://github.com/michaeldipperstein/arcode-py

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipperstein@gmail.com