Adaptive Delta Coding Discussion and Implementation

My experience with Rice (Golomb) coding and a challenge that a student sent me had thinking about creating a library that uses a small number of bits to encode the differences between consecutive symbols, but it didn't seem useful or all that interesting. It became interesting and useful after I attended BodyNets 2009 and saw a presentation on "Adaptive Lossless Compression in Wireless Body Area Sensor Networks". The presenter concluded that adaptive delta encoding was a code space and time efficient method for compressing data sampled from heart and gait monitors. So with renewed enthusiasm, I threw together a quick hack that implements adaptive delta encoding.

As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of the adaptive delta coding algorithm. Anyone familiar with ANSI C and the adaptive delta coding algorithm should be able to follow and learn from my implementation. There's a lot of room for improvement of compression ratios, speed, and memory usage, but this project is about learning and sharing, not perfection.

The rest of this page discusses adaptive delta coding and the results of my efforts so far.

Algorithm Overview

Delta encoding represents streams of compressed symbols as the difference between the current symbol and the previous symbol. This will result in compression for any data streams where it takes fewer bits to represent the differences between consecutive symbols then it does to represent the symbols themselves.

If you happen to have data that is well suited to delta encoding (like audio or heart monitor data), you will find the simplicity of this algorithm attractive. However, delta encoding is not well suited for all data types.

Encoding

Delta coding is fairly straightforward. The first symbol of a delta encoded stream is always written out unecoded. After that, every symbol is replaced with an S bit long signed value representing the value of the current symbol minus the value of the previous symbol (or the other way around if you like).

Using S bits to represent a signed value, allows for differences ranging from -2^S-1 to 2^S-1-1.

Keeping S small is what allows for compression, but if S is too small, the range of values that can be represented by S bits might not include the range of differences between consecutive symbols in the data steam. Rather than giving up when a delta is too large to be represented by S bits, an escape symbol may be issued to indicate that the next symbol is not delta encoded, then the current symbol may be output.

The Delta encoding algorithm does not specify an escape symbol. I used the smallest signed value that can be represented in S bits (-2^S-1) as my escape symbol.

Step 3. Compute the difference between the current symbol and the previous symbol.

Decoding

Decoding isn't any harder than encoding. Since first symbol of a delta encoded stream is always written out unecoded, it be read back and written without modification. After that, read in an S bit encoded symbol. If the S bit symbol is the escape code, read the next symbol and write it out, otherwise, write out the symbol that is the sum of the previous symbol and the difference that is represented by the S bit encoded symbol.

Based on the discussion above, decoding encoded input consists of the following steps:

Implementation Issues

Adapting to Delta Ranges

If you choose a value for the code word size (S) that is too small, the Delta encoding algorithm will generate a lot of escapes, and you'll end up with an encoded data stream that's larger than the unencoded stream. If you choose a value for S that is too large, your encoded stream will be far from optimal size. One solution to this problem is to implement an adaptive algorithm that allows the the value of S to change as the data stream is processed.

I'm not aware of any literature that provides details on different adaptation algorithms, so I invented my own. My goals for the algorithm and it's implementation were for it to be code space and time efficient, and easy to modify. It also had to produce results that were improvements over the results of the non-adaptive algorithm.

I incorporated an overflow counter to determine if the code word size (S) should be adjusted upward, and underflow counter to determine if S should be adjusted downward. When a delta value requires more (or less) bits than the full range of S, the overflow (underflow) counter is incremented. Otherwise the counter is decremented (but limited to 0). When a counter reaches 3, S is increased (decreased) by 1 and the counter is set to zero.

The source code used to adjust the code word size is all contained in the file adapt.c. If you want to experiment with different adaptive algorithms, you should start there. I look forward to hearing from anybody that has experimental results to report.

Handling End-of-File (EOF)

The EOF is of particular importance, because it is likely that an encoded file will not have a number of bits that is an integral multiple of bytes. Most file systems require that files be stored in bytes, so it's likely that encoded files will have spare bits. If you don't know where the EOF is, the spare bits may be decoded as another symbol.

I choose the third option for my implementation. I write an escape symbol at the end of an encoded file. The decoder expects an unencoded symbol to follow the escape, but file can always be ended in less bits than it takes to write an unencoded symbol, so the decoder will get an EOF before it gets the symbol it was expecting.

Effectiveness

So how well does delta coding work for generic data? Not very well at all. Delta coding only works well when consecutive symbols have similar values. The greater the difference between two consecutive symbols, the more bits are required to encode them.

One way to improve the compression obtained by delta coding on generic data is to apply a reversible transformation on the data that reduces the average value of a symbol as well as the difference between consecutive symbols. The Burrows-Wheeler Transform (BWT) with Move-To-Front (MTF) encoding is such a transform.

I used the Calgary Corpus as a data set to test the effectiveness of delta coding compared to my implementations of Rice coding, Huffman coding and LZSS. The executive summary is that even with the help of BWT and MTF, delta coding couldn't match the compression ratios of Huffman coding or LZSS. Rice coding with BWT and MTF even has better compression ratios if the proper value for K is selected. However delta coding without BWT and MTF performed better than Rice coding without BWT and MTF, which is the way they should be used in a system with limited memory and processing power. The adaptive delta coding algorithm that I used also seems robust enough that there is very little penalty for choosing the wrong value for s.

Library Code Interface

TABLE 1. Uncompressed and Traditionally Compressed Calgary Corpus Files
File	Original	Huffman + BWT	LZSS + BWT
bib	111261	50849	69792
book1	768771	375140	531367
book2	610856	272857	384448
geo	102400	74328	80692
news	377109	187912	264605
obj1	21504	12835	14233
obj2	246814	103745	129296
paper1	53161	24041	33116
paper2	82199	37889	52814
paper3	46526	22428	31066
paper4	13286	6756	8934
paper5	11954	6069	7862
paper6	38105	17319	23596
pic	513216	102814	118892
progc	39611	17376	23447
progl	71646	24027	32692
progp	49379	17077	22777
trans	93695	36146	49277
Total	3251493	1389608	1878906
Percent	100.00%	42.74%	57.79%

TABLE 2. Rice Calgary Corpus Files
File	Rice (K = 2)	Rice (K = 4)
No Processing	BWT + MTF	No Processing	BWT + MTF
bib	310945	63729	132690	73271
book1	2411218	415556	983146	498044
book2	1912774	327191	780344	396846
geo	300282	164232	127322	92851
news	1122082	232063	466867	251512
obj1	66546	23962	27379	17040
obj2	812091	203702	322336	179249
paper1	162996	29119	66994	34729
paper2	262680	43819	106376	53286
paper3	149090	25339	60251	30261
paper4	41738	7340	17006	8676
paper5	36108	6827	14933	7873
paper6	112869	21022	47046	24949
pic	494376	227137	392391	327985
progc	106976	22658	46306	26174
progl	193202	34977	83408	46170
progp	130982	25144	57154	32098
trans	237650	51727	105406	61614
Total	8864605	1925544	3837355	2162628
Percent	272.63%	59.22%	118.02%	66.51%

TABLE 3. Delta Calgary Corpus Files
File	Delta (S = 2)	Delta (S = 4)
No Processing	BWT + MTF	No Processing	BWT + MTF
bib	123019	84489	123017	84492
book1	847240	605189	847239	605192
book2	659954	466702	659953	466707
geo	117961	92639	117963	92640
news	404171	319905	404171	319906
obj1	22737	17481	22743	17482
obj2	295535	174819	295534	174821
paper1	57070	40565	57071	40567
paper2	89217	62711	89218	62711
paper3	49215	36416	49216	36416
paper4	14184	10398	14183	10403
paper5	13172	9394	13173	9394
paper6	41690	28949	41690	28949
pic	211592	182724	211593	182725
progc	41443	29544	41440	29544
progl	73771	44693	73772	44695
progp	51802	30931	51800	30934
trans	101770	66050	101770	66053
Total	3215543	2303599	3215546	2303631
Percent	98.89%	70.85%	98.89%	70.85%

Encoding Data

DeltaEncodeFile

Declaration:

int DeltaEncodeFile(FILE *inFile, FILE *outFile, unsigned charcodeSize);

Description:

This function reads from the specified input stream and writes an adaptive delta encoded version to the specified output stream.

Parameters:

inFile - The file stream to be encoded. It must opened. NULL pointers will return an error.
outFile - The file stream receiving the encoded results. It must be opened. NULL pointers will return an error.
charcodeSize - The number of bits used for code words at the start of coding.

Effects:

Data from the inFile stream will be delta encoded and written to the outFile stream.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Decoding Data

DeltaDecodeFile

Declaration:

int DeltaDecodeFile(FILE *inFile, FILE *outFile, unsigned char codeSize);

Description:

This function reads from the specified adaptive delta encoded input stream and writes a decoded version to the specified output stream.

Parameters:

inFile - The file stream to be decoded. It must opened. NULL pointers will return an error.
outFile - The file stream receiving the decoded results. It must be opened. NULL pointers will return an error.
charcodeSize - The number of bits used for code words at the start of coding.

Effects:

Data from the inFile stream will be decoded and written to the outFile stream.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Portability

All the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on Linux and mingw on Windows XP.

The software makes no assumptions about the endianess of the machine that it is executing on. However, it does make some assumptions about the size of data types. The software makes use of the #if and #error pre-processor directives as an attempt to check for violations of my assumptions at compile time.

Actual Software

I am releasing my implementation of adaptive delta encoding and decoding under the GNU LGPL. The source code repository is available on GitHub. I recommend that you checkout the latest revision of the master branch, unless you're looking for something specific.

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipperstein@gmail.com

Adaptive Delta Coding Discussion and Implementation

by Michael Dipperstein

Algorithm Overview

Encoding

Decoding

Implementation Issues

Adapting to Delta Ranges

Handling End-of-File (EOF)

Effectiveness

Library Code Interface

Encoding Data

DeltaEncodeFile

Decoding Data

DeltaDecodeFile

Portability

Actual Software