Rice (Golomb) Coding Encoding Discussion and Implementation

I had to scratch yet another compression itch. I know very little about compressing wave audio files, but some questions I received made me want to find out slightly more. It turns out that it's possible to convert wave files into a format where the bulk of the data can be represented by low value numbers. Once that is done, Rice Encoding (a special case of Golomb Coding) can be applied to reduce the bits required to represent the lower value numbers.

Rice's algorithm seemed easy to implement, so I felt compelled to give it a try. I was right (sometimes that happens). A rough cut of an implementation of the algorithm didn't take that long to implement. If you're going to try your own implementation, you'll need a way to read and write individual bits, the rest is not that big of a deal. I already have a bitfile library that handles the reading and writing of bits, so I was left with the stuff that is not a big deal.

As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of the Rice encoding and decoding algorithms. Anyone familiar with ANSI C and Rice (or Golomb) Encoding should be able to follow and learn from my implementation. I'm sure that there's room for improvement of compression ratios, speed, and memory usage, but this project is about learning and sharing, not perfection.

Click here for a link to my Rice source. The rest of this page discusses Rice Encoding and my implementation.

Algorithm Overview

Given a constant M, any symbol S can be represented as a quotient (Q) and remainder (R), where:
S = Q × M + R.

If S is small (relative to M) then Q will also be small. Rice encoding is designed to reduce the number of bits required to represent symbols where Q is small.

Rather than representing both Q and R as binary values, Rice encoding represents Q as a unary value and R as a binary value.

For those not familiar with unary notation, a value N may be represented by N 1s followed by a 0.
Example: 3 = 1110 and 5 = 111110.

Note: The following is true for binary values, if

log₂(M) =
K

where K is an integer:

Encoding

Given a bit length, K. Compute the modulus, M using by the equation M = 2^K. Then do following for each symbol (S):

Decoding

As with encoding, given a bit length, K. Compute the modulus, M using by the equation

M = 2^K

. Then do following for each encoded symbol (S):

Effectiveness

So how well does Rice coding work for generic data? Not very well at all. Rice coding only works well when symbols are encoded with small values of Q. Since Q is unary, encoded symbols can become quite large for even slightly large Qs. It takes 8 bits just to represent the value 7.

One way to improve the compression obtained by Rice coding on generic data is to apply reversible transformation on the data that reduces the average value of a symbol. The Burrows-Wheeler Transform (BWT) with Move-To-Front (MTF) encoding is such a transform.

I used the Calgary Corpus as a data set to test the effectiveness of Rice coding compared to my implementations of Huffman coding and LZSS. The executive summary is that even with the help of BWT and MTF, Rice coding couldn't match the compression ratios of Huffman coding or LZSS. However BWT and MTF allowed Rice coding to actually reduce the size of the data sets. The results of my test appear in the following table:

Implementation Issues

Size of Symbol

TABLE 1. Calgary Corpus Test Results
File	Original	Huffman	BWT + MTF + Huffman	LZSS	BWT + MTF + LZSS	Rice (K = 4 bits)	BWT + MTF + Rice (K = 4 bits)	Rice (K = 2 bits)	BWT + MTF + Rice (K = 2 bits)
bib	111261	73173	50849	52551	69792	132690	73271	310945	63729
book1	768771	438792	375140	423905	531367	983146	498044	2411218	415556
book2	610856	368788	272857	285896	384448	780344	396846	1912774	327191
geo	102400	73845	74328	83349	80692	127322	92851	300282	164232
news	377109	246891	187912	194679	264605	466867	251512	1122082	232063
obj1	21504	17338	12835	12267	14233	27379	17040	66546	23962
obj2	246814	195384	103745	103090	129296	322336	179249	812091	203702
paper1	53161	33819	24041	24464	33116	66994	34729	162996	29119
paper2	82199	48077	37889	39693	52814	106376	53286	262680	43819
paper3	46526	27702	22428	23467	31066	60251	30261	149090	25339
paper4	13286	8267	6756	6795	8934	17006	8676	41738	7340
paper5	11954	7893	6069	6072	7862	14933	7873	36108	6827
paper6	38105	24495	17319	17619	23596	47046	24949	112869	21022
pic	513216	107354	102814	105373	118892	392391	327985	494376	227137
progc	39611	26381	17376	17559	23447	46306	26174	106976	22658
progl	71646	43425	24027	22536	32692	83408	46170	193202	34977
progp	49379	30666	17077	15449	22777	57154	32098	130982	25144
trans	93695	65720	36146	33796	49277	105406	61614	237650	51727
Total	3251493	1838010	1389608	1468560	1878906	3837355	2162628	8864605	1925544
Percent		56.53%	42.74%	45.17%	57.79%	118.02%	66.51%	272.63%	59.22%

Rice's algorithm does not place any restrictions on the size of symbols being encoded. However the size of the encoded symbols must be known when the algorithm is actually being applied to data. My implementation of Rice's algorithm only encodes bytes. This places the additional restriction that K is between 1 and 7.

Handling End-of-File (EOF)

The EOF is of particular importance, because it is likely that an encoded file will not have a number of bits that is a integral multiple of bytes. Most file systems require that files be stored in bytes, so it's likely that encoded files will have spare bits. If you don't know where the EOF is, the spare bits may be decoded as another symbol.

I choose the third option for my implementation. My decoder will not produce another symbol until it sees the 0 indicating the end of the unary portion followed by K bits. Filling the spare bits in the last byte of a file with ones is enough to keep my decoder from decoding an extra symbol.

Library Code Interface

Encoding Data

RiceEncodeFile

Declaration:

int RiceEncodeFile(FILE *inFile, FILE *outFile, const unsigned char k)

Description:

This routine reads an input file one character at a time and writes out a Rice-Golomb encoded version of that file.

Parameters:

inFile - The file stream to be encoded. It must opened. NULL pointers will return an error.
outFile - The file stream receiving the encoded results. It must be opened. NULL pointers will return an error.
k - The length of binary portion of encoded word.

Effects:

inFile is encoded using the Rice-Golomb algorithm with the binary portion of code words k bits long. The encoded output is written to outFile. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Decoding Data

RiceDecodeFile

Declaration:

int RiceDecodeFile(FILE *inFile, FILE *outFile, const unsigned char k)

Description:

This routine reads an input file one encoded string at a time and decodes it using the Rice-Golomb algorithm.

Parameters:

inFile - The file stream to be decoded. It must opened. NULL pointers will return an error.
outFile - The file stream receiving the decoded results. It must be opened. NULL pointers will return an error.
k - The length of binary portion of encoded word.

Effects:

inFile is decoded using the Rice-Golomb algorithm with the binary portion of code words k bits long. The decoded output is written to outFile. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Further Information

Actual Software

I am releasing my implementations of fRice encoding/decoding under the LGPL. The source code repository is available on GitHub. I recommend that you checkout the latest revision of the master branch, unless you're looking for something specific.

Portability

All the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on Linux and mingw on Windows XP.

The software makes no assumptions about the endianess of the machine that it is executing on. However, it does make some assumptions about the size of data types. The software makes use of the #if and #error pre-processor directives as an attempt to check for violations of my assumptions at compile time.

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipperstein@gmail.com