The saga continues. As with everything else that I've written, my LZSS implementation left (and still leaves) room for improvement. This web page attempts to discuss my implementation and the updates that have sporadically taken place since my original LZSS release.
I also toyed with an in-memory implementation of the LZSS algorithm that preforms all encode and decode operations on arrays rather than files. This implementation might be useful to those developing on systems that do not include a file system.
Information on downloading the source code for all of my LZSS implementations may be found here.
As with my Huffman code implementation, my intent is to publish an easy to follow ANSI C implementation of the LZSS compression algorithm. Anyone familiar with ANSI C and the LZSS algorithm should be able to follow and learn from my implementation.
The rest of this page discusses the results of my efforts so far.
LZSS is a dictionary encoding technique. Unlike Huffman coding, which attempts to reduce the average amount of bits required to represent a symbol, LZSS attempts to replace a string of symbols with a reference to a dictionary location for the same string. It is intended that the dictionary reference should be shorter than the string it replaces. In the case of LZ77, the predecessor to LZSS, that wasn't always the case.
The LZSS algorithm and it's predecessor LZ77 attempt to compress series of strings by converting the strings into a dictionary offset and string length. So if the string "abcd" appeared in the dictionary at position 1234, it may be encoded as {offset = 1234, length = 4}.
Okay, so where does the dictionary come from, and why can't I find my whole file in it?
The LZSS and LZ77 dictionary is not an external dictionary that lists all known symbol strings. Instead, the dictionary is a sliding window containing the last N symbols encoded/decoded. The larger N, the longer it takes to search the whole dictionary for a match and the more bits will be required to store the offset into the dictionary. Typically dictionaries contain an amount of symbols that can be represented by a whole power of 2. A 432 symbol dictionary would require 9 bits to represent all possible offsets. If you need to use 9 bits, you might as well have a 512 symbol dictionary so that you can have more entries.
Since dictionaries are sliding windows, once the (N + 1)th symbol is processed and added to the dictionary, the first symbol is removed. Additional new symbols cause an equal number of the oldest symbols to slide out.
In the example above, after encoding "abcd" as {offset = 1234, length = 4}, the sliding window would shift over 4 characters and the first 4 symbols (offsets 0 .. 3) would slide off the front of the sliding window. "a", "b", "c", and "d" would then be entered the dictionary into positions (N - 4), (N - 3), (N - 2), and (N - 1).
As stated above, encoded strings are represented as an offset and a length. Since the dictionary size is fixed at N the largest offset may be (N - 1), and the longest string matching a series of characters in the dictionary may be N characters. If we plan to store these values, that would require 2 * ceil(log2(N)) bits to represent such a string. For large N, it's highly unlikely that you'll ever read an N symbol string that matches the contents of dictionary, so the maximum allowed match length is often limited.
In the examples I have seen, N is typically 4096 or 8192 and the maximum length allowed for matching strings is typically between 10 and 20 characters.
In their original LZ77 algorithm, Lempel and Ziv proposed that all strings be encoded as a length and offset, even strings with no match. Storer and Szymanski observed that individual unmatched symbols or matched strings of one or two symbols take up more space to encode than they do to leave uncoded.
For example, encoding a string from a dictionary of 4096 symbols, and allowing for matches of up to 15 symbols, will require 16 bits to store an offset and a length. A single character is only 8 bits. Using this method, encoding a single character doubles the required storage.
Storer and Szymanski proposed preceding each symbol a with coded/uncoded flag. Using the above example it now takes 9 bits for each uncoded symbol and 17 bits for each encoded string.
Storer and Szymanski also observed that if you're not encoding strings of length 0 to M, then M may be used as an offset to the length of the match. Applying their observation, instead of using 4 bits to represent match lengths of 0 to 15, the same 4 bits may be used to represent match lengths of M to (15 + M).
Based on the discussion above, encoding input
requires the following steps:
Step 1. Initialize the dictionary to a known value.
Step 2. Read an uncoded string that is the length of the
maximum allowable match.
Step 3. Search for the longest matching string in the
dictionary.
Step 4. If a match is found greater than or equal to the
minimum allowable match length:
Step 5. Shift a copy of the symbols written to the encoded
output from the unencoded string to the dictionary.
Step 6. Read a number of symbols from the uncoded input
equal to the number of symbols written in Step 4.
Step 7. Repeat from Step 3, until all the entire input has
been encoded.
The LZSS decoding process is less resource intensive than the LZSS encoding process. The encoding process requires that a the dictionary is searched for matches to the string to be encoding. Decoding an offset and length combination only requires going to a dictionary offset and copying the specified number of symbols. No searching is required.
Decoding input requires the following steps:
Step 1. Initialize the dictionary to a known value.
Step 2. Read the encoded/not encoded flag.
Step 3. If the flag indicates an encoded string:
Step 4. Shift a copy of the symbols written to the decoded
output into the dictionary.
Step 5. Repeat from Step 2, until all the entire input has
been decoded.
When I set out to implement LZSS, I had the following goals in mind:
My version 0.2 and later code uses a bitfile library, so reading and writing individual bits isn't a big deal. All of my versions use codes that are integral bytes, and each of my newer versions are derived from my earlier versions so there is a lot of common design.
What follows is a discussion about how these goals were accomplished.
The dictionary size directly effects:
I chose to implement my dictionary as a 4096 character cyclic buffer (sliding window). I chose 4096 characters, because others have achieved good results with dictionaries of this size, and I can encode offsets of 0 to 4095 in 12 bits. If I encode the offset and length of a string in a 16 bit word, that leaves me with 4 bits for the length.
Keeping the goal of a 16 bit encoded string in mind, and the above requirement for a 12 bit offset, I'm left with 4 bits to encode the string length. With 4 bits, I can encode lengths of 0 through 15. However, as Storer and Szymanski observed, it only takes 1 byte to write out characters that match dictionary strings 0 or 1 character long, and 2 bytes to write out characters that match dictionary strings 2 characters long. Savings of one or more bytes per string doesn't occur unless the string is 3 characters or longer.
By not encoding strings that offer less than one byte of savings, the minimum encoded length is three characters. Since the minimum encoded length is three, I am able to add three to the 4 bit value stored in the length field, resulting in encoded string lengths of 3 to 18. So 18 is the maximum string length encoded by this implementation.
I've been calling my implementation a modified LZSS implementation. It's modified because of the way my version 0.1 code handled the encoded/not encoded flag. If I followed the traditional LZSS approach for handling the encoded/not encoded flag, writing an unencoded character would require a 9 bit write, 1 for the flag and 8 for the character. Similarly writing an encoded string would require 17 bits, 1 bit for the flag and 16 bits for the code.
While I was studying the algorithm, I came across some implementations that stored encoded flags in groups of eight followed by the characters or encoded strings that they represent. I don't know who to credit with the discovery of this technique, but it allowed my version 0.1 implementation to use standard one byte reads and writes instead of one bit reads and writes.
This technique also makes the EOF clear. By processing only bytes, there are no spare bits, os the EOF of the encoded dats is actually the EOF of the encoded file. Don't worry about it if I lost you on the EOF discussion, just relax, because there's no need to handle it specially.
After writing my version 0.1 implementation, I wrote a bitfile library that handles single and multiple bit writes. My version 0.2 and later code uses the bitfile library and handles the encoded/not encoded flag according to the traditional LZSS algorithm.
As stated above, one of my goals was to provide an LZSS implementation that is easy to read and easy to debug. So, to keep things simple, my first implementation used a brute force sequential search to match the string to be encoded to strings in the dictionary. Through experimentation and reading, I've discovered that the methods used for string matching significantly impact encoding time.
Other's have successfully improved the time required to find matching strings using the following techniques:
I have already experimented with some of these techniques and plan to experiment with others as time allows. The rest of this section documents some of what I have tried so far.
The first search I implemented was a sequential search. I simply start at the beginning of the sliding window dictionary and do a character by character comparisons with the string being encoded. If the first characters match, I check the characters that follow. If I don't have a match, I move to the next character in the sliding window. The source code implementing a sequential search is contained in the version 0.1 file lzss.c and the file brute.c in versions 0.2 and later. The logic is pretty simple, but may require a number of comparisons. If the string is length m and the dictionary is length n, a sequential search will be on the order of (n) × (m); O(n × m).
The addition of code implementing the KMP algorithm is a relatively new one (version 0.6.1). By the time I got around to including it Wikipedia had a reasonable description as well as pseudocode that I could reference.
The KMP algorithm is O(n × m), just like the brute force sequential search. However KMP attempts to use some information about the string we're attempting to find a match for and the comparisons already made in order skip some comparisons that must fail. This skipping makes KMP a Θ(m) search algorithm.
Example:
dictionary = abcabcabd
string = abcabd
The initial pass, will start by comparing string[0]
to
dictionary[0]
and fail when it compares
string[5]
to dictionary[5]
.
Dictionary | a | b | c | a | b | c | a | b | d |
String | a | b | c | a | b | d |
The brute force sequential search will resume by comparing
string[0]
to dictionary[1]
but we know this
will fail because string[1]
matches
dictionary[1]
and string[0]
and
string[1]
are not the same.
KMP is smart enough to skip ahead and resume by comparing
string[0]
to dictionary[3]
, which happens to
be a match in this example.
Dictionary | a | b | c | a | b | c | a | b | d |
String | a | b | c | a | b | d |
The KMP algorithm requires that the string being searched for be preprocessed. The preprocessing generates a look-up table used to determine how far back from the failed comparison, the algorithm must go to resume comparisons. The look-up table is know as a "partial match table" or a "failure function". The partial match table for our example string is depicted below:
String | a | b | c | a | b | d |
Partial Match Table | -1 | 0 | 0 | 0 | 1 | 2 |
When the example above failed to match
string[5]
to dictionary[5]
, position 5 of the
partial match table was used to determine that search should fallback 2
to dictionary[3]
.
The source code implementing the KMP algorithm is contained in the file kmp.c. The majority of the code follows the outline of the pseudocode provided by Wikipedia.
The sequential search algorithm moves through the dictionary one character at a time, checking for matches to the first character in the string to be encoded. Assuming equal distribution of characters, there's a 1 in alphabet size chance of characters matching.
One approach to ensuring that the first character of the string being encoded matches the dictionary entry it is being compared to is to keep a list of all dictionary entries that start with a character for each character in the alphabet. For example, there would be one list of all the entries that start with 'A', and another of all the entries that start with 'B'. In the case of a 256 character alphabet, 256 lists must be maintained.
Since the dictionary is a sliding window of the last characters encoded by the algorithm, the lists of strings starting with a given character must be updated as old characters are removed from the dictionary and new characters are added to the dictionary. Linked lists are a natural choice for implementing such dynamic lists.
The additional memory overhead required to implement a collection of linked lists is one pointer to the head of each of the lists, plus one pointer to the next character in the list for each character in the dictionary.
If the string is length m and the dictionary is length n, it turns out the worst case number of operations for finding a match is still of the order O(n × m), but the average case is of the order of (m × n) ÷ (alphabet size).
The source code implementing a linked list search is contained in the
version 0.1 file lzlist.c
and the file list.c
in
versions 0.2 and later. In my implementation, all pointers (list head and
next) are int
indices into the sliding window dictionary.
After implementing string matches with linked lists, it seemed like a wasn't much effort to try matching using hash tables. In actuality, the linked lists approach, is a solution involving hash tables where the hash key is the first character of the string and collisions are stored in a linked list. If that's all I wanted, I'd be done.
One problem with the linked list approach is that all the strings in a list only match the first character in the list, but a string will not be encoded in {offset, length} format until it matches at least some minimum (M) number of characters. To ensure that only strings matching M characters are searched, you can generate a hash key from the first M characters of the string. String matching will fly when you do that, however the hash table will need (alphabet size)M entries. For my implementation alphabet size is 256 and M is 3, so the hash table would need 2563 entries (that's 16,777,216 for anybody without a calculator). That makes the hash table 4096 times the size of the dictionary.
A 16M entry hash table doesn't strike me as a viable solution. I wanted to try much smaller tables. In order to be able to use smaller tables, I needed a key that would distribute M long strings fairly evenly through the hash table. Rather than inventing and testing a new key algorithm, I chose the key generation method discussed in K. Sadakane and H. Imai: Improving the Speed of LZ77 Compression by Hashing and Suffix Sorting, IEICE Trans. Fundamentals, Vol. E83-A, No. 12, pp. 2689--2798, 2000. The key algorithm is supposed to be the same algorithm used by gzip and may be implemented using the following C fragment:
key = 0;
for (i = 0; i < (M + 1); i++)
{
key = (key << d) ^ (string[i]);
key %= hash size;
}
For my implementation I ended up settling on d = 5 and hash size = 1024, though I found that the key distributed strings evenly for hash size of a power of 2 between 256 and 131072.
There's only one additional complication. In the case of linked lists, adding or removing a character from the dictionary required that one entry be added or removed from a linked list. When hashing on M characters, adding or removing the nth character from the dictionary requires that strings starting at (n - M + 1) through n be added to or removed from the hash table.
The source code implementing a hash table search is contained in the
version 0.1 file lzhash.c
and the file hash.c
in
version 0.2 and later. In my implementation, all pointers (hash table and
next) are int
indices into the sliding window dictionary.
The sequential search algorithm moves through the dictionary one character at a time, checking for matches to the string being encoded. Any failed match results in advancing the compare to the string starting with the next character in the dictionary.
A binary search tree is another structure that is added to the dictionary to help speed up the search process. Each string in the dictionary has a corresponding node in the binary search tree. Each node has two pointers, one to a subtree containing only nodes with strings less than the current node, and the other to a subtree containing only nodes with strings greater than the current node.
Unlike the sequential search, when a search fails against a string in a node of a binary tree, the next comparison will start with the string at the root of the subtree that may contain a match. If the current match failed because a the string being encoded is less than the string in the current node, the next string compared will be the one in root of the subtree containing all strings less than the string in the current node.
If the string is length m and the dictionary is length n, it turns out the worst case number of operations for finding a match is still of the order O(n × m), but the average case is O(log(n) × m). The worst case occurs when the binary search tree resembles a linked list and each node only has one child.
Since the dictionary is a sliding window of the last characters encoded by the algorithm, the binary search tree must be updated as old characters are removed from the dictionary and new characters are added to the dictionary. Wikipedia provides a description of the algorithm used to insert or remove entries from a binary search tree.
The additional memory overhead required to implement a binary search tree is one pointer to the root of the tree plus three pointers for each symbol in the sliding window dictionary. The pointers are for each node's left child, right child, and parent.
The source code implementing a binary tree search is contained in the
file tree.c
. In my implementation, all pointers (left, right,
and parent) are all unsigned int
indices into the sliding
window dictionary.
Declaration:
int EncodeLZSS(FILE *fpIn, FILE *fpOut);
Description:
This function will read an input file and write an output file encoded according to the traditional LZSS algorithm. This algorithm encodes strings as 16 bits (a 12 bit offset + a 4 bit length).
Parameters:
fpIn
- The file stream to be encoded. It must opened. NULL pointers
will return an error.
fpOut
- The file stream receiving the encoded results. It must be opened.
NULL pointers will return an error.
Effects:
fpIn
is encoded and written to fpOut
.
Neither file is closed after exit.
Returned:
0 for success, -1 for failure. errno
will be set in
the event of a failure.
Declaration:
int DecodeLZSS(FILE *fpIn, FILE *fpOut);
Description:
This function will read an LZSS encoded input file and write an output file. This algorithm decodes strings from 16 bit code words (a 12 bit offset + a 4 bit length).
Parameters:
fpIn
- The file stream to be decoded. It must opened. NULL pointers
will return an error.
fpOut
- The file stream receiving the decoded results. It must be opened.
NULL pointers will return an error.
Effects:
fpIn
is decoded and written to fpOut
.
Neither file is closed after exit.
Returned:
0 for success, -1 for failure. errno
will be set in
the event of a failure.
I am releasing my implementations of the LZSS algorithm under the LGPL. If you've actually read this page to get all the way down here, you already know that I have different implementations. In general earlier versions are simpler (maybe easier to follow) and later versions are easier to use as libraries and better suited for projects taking advantage of the LGPL. In some cases later version add new new search methods or fix minor bugs. All versions of my source code is available on GitHub. I recommend that you checkout the latest revision if you're not looking for something specific.
Repository Location | https://github.com/michaeldipperstein/lzss |
---|
If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipperstein@gmail.com
HomeLast updated on December 29, 2018