The day after I posted my source and description for arithmetic coding, I became unable to resist the temptation to bang out something on RLE. I had a few hours of "free time" and what better way to fill them than to publish an Run Length Encoding (RLE) implementation? And so my version 0.1 library was born.

Almost two years after that, I found myself discussing various RLE techniques, and the discussion touched on the PackBits technique. Thus the temption to bang out more on RLE arose. The result of that effort brings you rle-0.2, which provides support for traditional and PackBits styles of encoding and decoding.

As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of a run length encoding (RLE) algorithm. Anyone familiar with ANSI C and RLE should be able to follow and learn from my implementation. I'm sure that there's room for improvement of compression ratios, speed, and memory usage, but this project is about learning and sharing, not perfection.

Click here for a link to my RLE source. The rest of this page discusses RLE and my implementation.


Algorithm Overview

Run length encoding stands out from other methods of compression. It does not try to reduce the average symbol size like Huffman coding or arithmetic coding, and it doesn't replace strings with dictionary references like Lemple-Ziv and Lemple-Ziv-Welch style coding. RLE replaces a string of repeated symbols with a single symbol and a count (run length) indicating the number of times the symbol is repeated.

Example:
The string:
"aaaabbcdeeeeefghhhij"
may be replaced with
"a4b2c1d1e5f1g1h3i1j1"

The numbers are in bold to indicate that they are values, not symbols.

If you count the before and after string size, you'll notice that RLE didn't make this string any smaller. Part of the reason is because it now takes a symbol and a value to represent what used to take just single a single symbol.

One of the ways of preventing single symbols from expanding is to only apply coding to runs of more than one and use the run length to indicate the number of additional copies of the encoded symbol that are required. Ideally that would give us a string like this:
"a3b1cde4fgh2ij"

Now the string is shorter, but we have another problem. How do we know whether a symbol is being followed by another symbol or a run length? We could borrow a page out of LZSS and include an encoded/not encoded flag with each symbol. An other option is to follow each encoded symbol with a designated escape code that will indicate a run length follows.

The escape code solution is pretty easy to handle, except that you need to be able to make sure that the escape code doesn't occur in your uncoded input, or you need to be able to indicate if you're just handling a symbol that is the same as an escape code.

Someone much more creative than I am at this moment came up with a neat idea that solves this problem. The escape code does not need to be a fixed value, instead an escape code will be inferred whenever a symbol repeats itself. The value that follows the repeated symbol will indicate the number of additional symbols that follow. Using this scheme, the example string is encoded like this:
"aa2bb0cdee3fghh1ij"

For lack of a better name, I'll refer to this technique as traditional RLE. The downside to traditional RLE is that one more symbol is required for each run, and runs of length 2 take more space encoded than not encoded (two symbols and a count of 0). However, based on a few brief tests on files containing C source, the double symbol escape sequence produced better results than following all symbols with a run length.

One technique for reducing the cost of two symbol runs is to encode two types of blocks: blocks that are copied verbatim, and runs. Runs of two symbols may be placed in the blocks copied verbatim. However, this technique incurs a penalty in order to be able to indicate that a block should be copied verbatim.

This method was popularized by the PackBits algorithm. The PackBits algorithm will precede a block of data with a one byte header n, where n is interpreted as follows:

TABLE 1. PackBits Block Header Interpretation.
n Meaning
0 to 127 Copy the next n + 1 symbols verbatim
-127 to -1 Repeat the next symbol 1 - n times
-128 Do nothing

Hopefully you've been paying attention and noticed the following areas of improvements in the PackBits algorithm:

  1. The -128 header could be used for a run of 129 symbols, but it's not.
  2. It takes two bytes to indicate a run of two, so nothing is saved.

The library I provide implements a variant of the PackBits algorithm where the one byte header is interpreted as follows:

TABLE 2. Variant PackBits Block Header Interpretation.
n Meaning
0 to 127 Copy the next n + 1 symbols verbatim
-128 to -1 Repeat the next symbol 2 - n times

This variant algorithm allows runs to be two symbols longer, but cannot encode two byte runs. Using this scheme, the example string is encoded like this:
"-2a3bbcd-3e1fg-1h1ij"

Encoding Strings

Traditional RLE

Encoding using traditional RLE is fairly simple:

Step 1. Set the previous symbol equal to an unmatchable value.
Step 2. Read the next symbol from the input stream.
Step 3. If the symbol is an EOF exit.
Step 4. Write out the current symbol.
Step 5. If the symbol is an does not match the previous symbol, set the previous symbol to the current symbol, and go to step 2.
Step 6. Read and count additional symbols until a non-matching symbol is found. This is the run length.
Step 7. Write out the run length.
Step 8. Write out the non-matching symbol.
Step 9. Set the previous symbol to the non-matching symbol, and go to step 2.

When actually implementing traditional RLE, a little attention to detail is required in Step 6. The run length is stored in a finite number of bits (I used an unsigned char). Runs longer than the amount that can be counted need to be broken up into to smaller runs. When the maximum count is reached, just write the count value and start the process of looking for a run all over again. You also need to handle the case where a run is ended by an EOF. When a run is ended by an EOF, write out the run length and exit.

That's all there is to it.

PackBits Variant

Encoding using the PackBits variant is slightly more complicate than traditional RLE. The block header cannot be written until the type of block and it's length have been determined. Until then data must be held in a buffer of the maximum length that can be copied verbatim. The following steps describe PackBits style encoding:

Step 1. Read symbols from the input stream into the buffer until one of the following occurs:

  1. The buffer is full (go to step 2)
  2. An EOF is reached (go to step 3)
  3. The last three symbols are identical (go to step 4)

Step 2. If the buffer is full:

  1. write the buffer size - 1
  2. write contents of the buffer
  3. go to step 1

Step 3. If the symbol is an EOF:

  1. write the buffer size - 1
  2. write contents of the buffer
  3. exit

Step 4. If the last three symbols match, a run has been found. Determine the number of symbols in the buffer prior to the start of the run (n).
Step 5. Write n - 1 followed by the contents of the buffer up to the start of the run.
Step 6. Set the run length to 3.
Step 7. Read additional symbols until a non-matching symbol is found. Increment the run length for each matching symbol.
Step 8. Write out 2 - the run length the run length followed by the run symbol.
Step 9. Write the non-matching symbol to the buffer and go to step 1.

That's pretty much all there is. You need to stop counting your run length in Step 6 if it reaches the maximum length you can account for in your header. My actual implementation is also a little less greedy. When I reach the maximum number of symbols that can be copied verbatim, I read an extra symbol or two in case the symbols at the end of a buffer are actually the start of a run.

Decoding Strings

Traditional RLE

Decoding traditionally encoded strings is even easier than encoding. Not only are there less steps, but there are no caveats. To decode a traditionally encoded stream:

Step 1. Set the previous symbol equal to an unmatchable value.
Step 2. Read the next symbol from the input stream.
Step 3. If the symbol is an EOF exit.
Step 4. Write out the current symbol.
Step 5. If the symbol is an does not match the previous symbol, set the previous symbol to the current symbol, and go to step 2.
Step 6. Read the run length.
Step 7. Write out a run of the current symbol as long as indicated by the run length.
Step 8. Go to step 1.

PackBits Variant

If that wasn't easy enough, it is even easier to decode strings encoded by the variant PackBits algorithm. To decode a variant PackBits encoded stream:

Step 1. Read the block header (n).
Step 2. If the header is an EOF exit.
Step 3. If n is non-negative, copy the next n + 1 symbols to the output stream and go to step 1.
Step 4. If n is negative, write 2 - n copies of the next symbol to the output stream and go to step 1.


Library Code Interface

Encoding Data (Traditional or Packbits Variant)

RleEncodeFile
VPackBitsEncodeFile

Declaration:

int RleEncodeFile(FILE *inFile, FILE *outFile)
int VPackBitsEncodeFile(FILE *inFile, FILE *outFile)

Description:

This routine reads an input file and writes out a run length encoded (traditional or packbits Variant) version of that file.

Parameters:

inFile - The file stream to be encoded. It must opened. NULL pointers will return an error.
outFile - The file stream receiving the encoded results. It must be opened. NULL pointers will return an error.

Effects:

inFile is encoded using the run length encoding algorithm (traditional or packbits variant). The encoded output is written to outFile. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Decoding Data (Traditional or Packbits Variant)

RleDecodeFile
VPackBitsDeScodeFile

Declaration:

int RleDecodeFile(FILE *inFile, FILE *outFile)
int VPackBitsDecodeFile(FILE *inFile, FILE *outFile)

Description:

This routine a reads run length encoded (traditional or packbits variant) a input file and writes out a decoded version of that file.

Parameters:

inFile - The file stream to be decoded. It must opened. NULL pointers will return an error.
outFile - The file stream receiving the decoded results. It must be opened. NULL pointers will return an error.

Effects:

inFile is decoded using the run length encoding algorithm (traditional or packbits variant). The decoded output is written to outFile. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.


Further Information

As usual, I found Arturo Campos' discussion to be a help. Unfortunately all that is left of it is the wayback machine archive.

Take a look at Apple's Technical Note TN1023 for a description of the PackBits algorithm. Unfortunately all that is left of the technical note is the wayback machine archive too.


Actual Software

I am releasing my implementations of run length encoding (RLE) under the LGPL. The source code repository is available on GitHub. I recommend that you checkout the latest revision of the master branch, unless you're looking for something specific.

Repository Location https://github.com/michaeldipperstein/rle

Portability

All the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on Linux and mingw on Windows XP.

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipperstein@gmail.com

Home
Last updated on December 30, 2017