Frequency Substitution Encoding Discussion and Implementation

Frequency Substitution is a term I made up for a transformation that I saw discussed on another web site. I have since lost the link to the site, but if somebody can find it I'll be happy to reference it.

The idea behind Frequency Substitution is simple. The goal is to transform data by replacing the highest frequency symbols with the lowest values. It sounded simple enough, so I thought that I'd give it a try.

Normally this type of transformation doesn't produce useful results, but there are a few compression techniques, such as Rice (Golomb) Coding, that produce smaller code words for lower value symbols. By making the lower valued symbols the most frequent symbols, these encoding schemes may achieve better compression ratios.

As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of frequency substitution encoding and decoding. Anyone familiar with ANSI C should be able to follow and learn from my implementation. I'm sure that there's room for improvement of speed and memory usage, but this project is about learning and sharing, not perfection.

Click here for a link to my frequency substitution source. The rest of this page discusses the concepts involved with frequency substitution and my implementation.

Algorithm Overview

The goal of frequency substitution is to transform data by replacing the highest frequency symbols with the lowest values. In a data set contains 256 unique symbols, the most common symbol will be replaced with 0 and the least common symbol will be replaced with 255.

Encoding

Encoding a data stream using frequency substitution is a two pass process. The first pass is used to count the frequency of the symbols. After the first pass, a table of substitution codes is generated. Then a second pass is made, where an encoded data set is generated using the substitution table.

The first pass should be straight forward, start with a table indicating a frequency of 0 for all possible symbols, then add 1 to a symbol's frequency count every time it is encountered in the data stream. Next sort the substitution table most frequent to least frequent.

On the second pass, write out enough information for the decoder to reconstruct the table, then read in each unencoded symbol and write out the sorted position of that symbol.

Example:
Given a symbol set of 0 ... 9, encode the following data stream:
3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3,2,3,8,4,6,2,6,4,3,3,8,3,2,7,9

TABLE 1. Empty Table
Symbol	Count
0	0
1	0
2	0
3	0
4	0
5	0
6	0
7	0
8	0
9	0

TABLE 2. Completed Table
Symbol	Count
0	0
1	2
2	4
3	7
4	3
5	3
6	3
7	2
8	3
9	4

TABLE 3. Sorted Frequencies
Symbol Position	Symbol	Count
0	3	7
1	2	4
2	9	4
3	4	3
4	5	3
5	6	3
6	8	3
7	1	2
8	7	2
9	0	0

Step 4. Write out the encoded data by substituting a symbol for it's position in the frequency table.

3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3,2,3,8,4,6,2,6,4,3,3,8,3,2,7,9
becomes
0,7,3,7,4,2,1,5,4,0,4,6,2,8,2,0,1,0,6,3,5,1,5,3,0,0,6,0,1,8,2

A quick sanity check shows that the sum of the symbols in the initial string (150) is greater than the sum of symbols in the encoded string (96) so we must have lowered the average symbol value.

Decoding

Decoding data that has been encoded by frequency substitution is a single pass process. Given a table of frequency substitutions, reverse the substitutions preformed during the encoding process.

Example:
The following sequence and table of frequency substitutions
0,7,3,7,4,2,1,5,4,0,4,6,2,8,2,0,1,0,6,3,5,1,5,3,0,0,6,0,1,8,2

TABLE 4. Sorted Frequencies
Symbol Position	Symbol	Count
0	3	7
1	2	4
2	9	4
3	4	3
4	5	3
5	6	3
6	8	3
7	1	2
8	7	2
9	0	0

Produces the following sequence
3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3,2,3,8,4,6,2,6,4,3,3,8,3,2,7,9

If you were paying attention, you're probably wondering how the encoder passed the table of frequency substitutions to the decoder. The next section provides some ideas and discusses my implementations.

The Substitution Key

Before the decoder can do anything with the encoded data, it needs a key that tells it how to undo the substitutions. Since the key is needed immediately, it makes sense for the encoder to output the key before it outputs the encoded data.

So what kind of information do we need to have in the key? In order for the key to be useful, it must contain the original symbol and the value that it was substituted for.

If there are N possible symbols, the simplest thing to do is write out the complete list of all N (symbol, substitution) pairs. Such a list contains 2 × N entries. My goal was to use frequency substitution for data compression, so it would be nice if the list could be made even smaller.

Position As Code Word Value

Since the substitution key is the first thing output by the encoder, the key for all N symbols may be output in positions 0 ... (N - 1). Position 0 contains the symbol represented by the substitution 0, position 1 contains the symbol represented by substitution 1, and so on This method requires just N entries for N (substitution, symbol) pairs.

Excluding Unused Symbols

Sometimes the data being encoded is sparse; it only uses a fraction of all of the symbols in the symbol set. In this case there will be a large number of unneeded entries in a position indexed key.

For spare data sets, we can omit any symbol that isn't encoded, but to do that, the there has to be a way to indicate to the decoder that there are no more (substitution, symbol) pairs. I do this by repeating the previous symbol value.

Example:
The substitution key for the decoding example is written as the sequence 3,2,9,4,5,6,8,1,7,7.

Effectiveness

If you actually read my introduction, you'd notice that I thought frequency substitution might be helpful when used with Rice (Golomb) Coding. In order to test that theory, I used the Calgary Corpus as a test data set.

The table below shows the results of applying frequency substitution to the Calgary Corpus and then using Rice coding (with K = 4 and K = 2) to compress encoded files. Both position indexed and sparse versions of my frequency substitution implemention were used. For reference, the table also provides results for Rice coding alone and Rice coding with BWT and MTF.

So why use frequency substitution? If your goal is to achieve the best possible compression ratios, most data sets don't provide a compelling reason. However, the computational and memory resources required for frequency substitution is less than that required by BWT and MFT. I didn't provide any proof of that, you'll just have to trust me or see for yourself.

Library Code Interface

TABLE 5. Effectiveness
File	Orig	K = 4	K = 2
Rice	BWT + Rice	FreqSub + Rice	Rice	BWT + Rice	FreqSub + Rice
bib	111261	132690	73271	76575	310945	63729	84694
book1	768771	983146	498044	500220	2411218	415556	455421
book2	610856	780344	396846	404871	1912774	327191	395796
geo	102400	127322	92851	90115	300282	164232	152588
news	377109	466867	251512	260672	1122082	232063	289901
obj1	21504	27379	17040	19192	66546	23962	33112
obj2	246814	322336	179249	212593	812091	203702	358858
paper1	53161	66994	34729	35926	162996	29119	37180
paper2	82199	106376	53286	53804	262680	43819	50000
pic	513216	392391	327985	325044	494376	227137	218680
progc	39611	46306	26174	27624	106976	22658	31321
progl	71646	83408	46170	47521	193202	34977	46736
progp	49379	57154	32098	33668	130982	25144	35320
trans	93695	105406	61614	66797	237650	51727	81433
Total	3141622	3698119	2090869	2154622	8524800	1865016	2197750
Percent		117.71%	66.55%	68.58%	271.35%	59.36%	69.96%

Encoding Data

FreqEncodeFile

Declaration:

int FreqEncodeFile(FILE *inFile, FILE *outFile);

Description:

This routine reads an input file and counts character frequencies. The file is then read again and an output file is created where each input character is encoded with a value associated with its frequency. The more frequent the symbol, the lower the value.

Parameters:

inFile - The file stream to be encoded. It must opened and rewindable. NULL pointers will return an error.
outFile - The file stream receiving the encoded results. It must be opened. NULL pointers will return an error.

Effects:

inFile is encoded using the using frequency substitution and written to outFile. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Decoding Data

FreqDecodeFile

Declaration:

int FreqDecodeFile(FILE *inFile, FILE *outFile);

Description:

This routine reads a frequency substitution encode file and writes a decoded version to the specified output file.

Parameters:

inFile - The file stream to be decoded. It must opened. NULL pointers will return an error.
outFile - The file stream receiving the decoded results. It must be opened. NULL pointers will return an error.

Effects:

inFile is decoded using the frequency substitution algorithm and written to outFile. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Portability

All the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have only tested the code compiled with gcc on Linux.

Actual Software

I am releasing my implementations of frequency substitution encoding and decoding under the LGPL. The source code repository is available on GitHub

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipperstein@gmail.com

Frequency Substitution Encoding Discussion and Implementation

by Michael Dipperstein

Algorithm Overview

Encoding

Decoding

The Substitution Key

Position As Code Word Value

Excluding Unused Symbols

Effectiveness

Library Code Interface

Encoding Data

FreqEncodeFile

Decoding Data

FreqDecodeFile

Portability

Actual Software