Lempel-Ziv-Welch (LZW) Encoding Discussion and Implementation

The need to explore a compression algorithm has struck again. After playing with LZSS (LZ77) I thought LZW (LZ78) was something I should eventually get around to studying. I've played with several other things since having that thought, but it finally took hold, and became the thing that consumed "free time"

As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of the Lempel-Ziv-Welch (LZW) encoding/decoding algorithm. Anyone familiar with ANSI C and LZW or LZ78 should be able to follow and learn from my implementation. I'm sure that there's room for improvement of compression ratios, speed, and memory usage, but this project is about learning and sharing, not perfection.

Click here for a link to my LZW source. The rest of this page discusses LZW and my implementation.

Algorithm Overview

Like it's predecessor LZSS (LZ77), the Lempel-Ziv-Welch algorithm uses a dynamically generated dictionary and and encodes strings by a reference to the dictionary. It is intended that the dictionary reference should be shorter than the string it replaces. As you will see, LZW achieves it's goal for all strings larger than 1.

Encoding

The LZSS algorithm uses a sliding window dictionary. Each entry in the dictionary is a character. LZSS code words consist of an offset into the dictionary and the number of characters following the offset to include in an encoded string.

Unlike LZSS, entries in the LZW dictionary are strings, and every LZW code word is a reference to a string in the dictionary.

Okay, so where does the dictionary come from, and why can't I find an entry for my whole file in it?

The Dictionary and Encoded Strings

The LZW dictionary is not an external dictionary that lists all known symbol strings. Instead, the dictionary is initialized with an entry for every possible byte. Other strings are added as they are built from the input stream. The code word for a string is simply the next available value at the time it is added to the dictionary.

An "encoded" string is used to add new strings to the dictionary. The encoded string is built from the input stream. The input stream is read 1 byte at a time. If the string formed by concatenating the new byte to the encoded string is in the dictionary, the new byte is added to the end of the encoded string. Otherwise a dictionary entry is made for the new string and the code word for the coded string is written to the output stream. Then the byte that was just read becomes thet encoded string.

The Basic Encoding Algorithm

Based on the discussion above, encoding input consists of the following steps:
Step 1. Initialize dictionary to contain one entry for each byte. Initialize the encoded string with the first byte of the input stream.
Step 2. Read the next byte from the input stream.
Step 3. If the byte is an EOF goto step 6.
Step 4. If concatenating the byte to the encoded string produces a string that is in the dictionary:

Step 5. If concatenating the byte to the encoded string produces a string that is not in the dictionary:

In the example above, a 17 character string is represented by 13 code words. Any actual compression that would occur would be based on the size of the code words. In this example code words could be as short as 9 bits. Typically code words are 12 to 16 bits long. Of course the typical input stream is also longer than 17 characters.

Decoding

New Byte	Encoded String	New Code	Code Output
t	t	None	None
h	h	256 (th)	t
i	i	257 (hi)	h
s	s	258 (is)	i
_	_	259 (s_)	s
i	i	60 (_i)	_
s	is	None	None
_	_	261 (is_)	258 (is)
h	h	262 (_h)	_
i	hi	None	None
s	s	263 (his)	257 (hi)
_	s_	None	None
t	t	264 (s_t)	259 (s_)
h	th	None	None
i	i	265 (thi)	256 (th)
n	n	266 (in)	i
g	g	267 (ng)	n
None	None	None	g

It shouldn't be a big surprise that LZW data is decoded pretty much the opposite of how it's encoded. The dictionary is initialized so that it contains an entry for each byte. Instead of maintaining an encoded string, the decoding algorithm maintains the last code word and the first character in the string it encodes. New code words are read from the input stream one at a time and string encoded by the new code is output.

During the encoding process, the code prior to the current code is written because concatenating the first character of the current code with the string encoded by the prior code generated a code that wasn't in the dictionary. When that happened the string formed by the concatenation was added to the dictionary. The same string needs to be added to the dictionary when decoding.

The Basic Decoding Algorithm

Based on the discussion above, decoding input consists of the following steps:
Step 1. Initialize dictionary to contain one entry for each byte.
Step 2. Read the first code word from the input stream and write out the byte it encodes.
Step 3. Read the next code word from the input stream.
Step 4. If the code word is an EOF exit.
Step 5. Write out the string encoded by the code word.
Step 6. Concatenate the first character in the new code word to the string produced by the previous code word and add the resulting string to the dictionary.
Step 7. Go to step 3.

Example 2: Decode the stream 't' 'h' 'i' 's' '_' 258 '_' 257 259 256 'i' 'n' 'g' produced by the previous example

Input Code	Encoded String	Added Code	String Output
t	t	None	t
h	h	256 (th)	h
i	i	257 (hi)	i
s	s	258 (is)	s
_	_	259 (s_)	_
258	is	260 (_i)	is
_	_	261 (is_)	_
257	hi	262 (_h)	hi
259	s_	263 (his)	s_
256	th	264 (s_t)	th
i	i	265 (thi)	i
n	n	266 (in)	n
g	g	267 (ng)	g

The decode string matches the original encoded string, so I must have done something right. One of my favorite things about LZW is that the decoder doesn't require any additional information from the encoder. There's no need to include extra information commonly required by statistical algorithms like Huffman Code and Arithmetic Code. So compression ratios are never reduced by extra data cost.

Exception to the Rules

Whoever said that there's an exception for every rule must have studied LZW. It turns out that an exception may occur. When decoding certain input streams the decoder may see a code word that's one larger than anything that it has in it's dictionary. Fortunately for me others have figured out when the exception happens and how to deal with it.

The exception occurs if the dictionary contains an entry for string + character and the input stream is string + character + string + character + string.

When the exception occurs, concatenate the first character of the string encoded by the previous code word to the end of the string encoded by the previous code word. The resulting string is the value of the new code word. Write it to the output stream and add it to the dictionary.

Example 3: The string "abcabcabcabcabcabc" demonstrates an occurrence of the special exception:

Wow! That's a lot of work to force an exception to occur. The results still need to be decoded in order to witness the exception in action.

There you have it, the exception and how it's handled. It can be made to occur with a shorter pattern of repeated 2 character strings like "ababababababab", but I doesn't make as nice of an example.

That's all there is to the LZW algorithm. There are a few more issues to consider when actually implementing the algorithm on a computer. You should to read on if you care about them.

Implementation Issues

New Byte	Encoded String	New Code	Code Output
a	a	None	None
b	b	256 (ab)	a
c	c	257 (bc)	b
a	a	258 (ca)	c
b	ab	None	None
c	c	259 (abc)	ab (256)
a	ca	None	None
b	b	260 (cab)	ca (258)
c	bc	None	None
a	a	261 (bca)	257 (bc)
b	ab	None	None
c	abc	None	None
a	a	262 (abca)	259 (abc)
b	ab	None	None
c	abc	None	None
a	abca	None	None
b	b	263 (abcab)	262 (abca)
c	bc	None	None
None	None	None	257 (bc)

Input Code	Encoded String	Added Code	String Output
a	a	None	a
b	b	256 (ab)	b
c	c	257 (bc)	c
256	ab	258 (ca)	ab
258	ca	259 (abc)	ca
257	bc	60 (cab)	bc
259	abc	261 (bca)
262	Not In Dictionary	abca	abca
NOTE: The above step demonstrates the exception and how it is handled. (New encoded string = old encoded string + the first character of the old encoded string = abc + a = abca)
257	bc	263 (abcab)	bc

As I have stated the introduction, my intent is to publish an easy to follow ANSI C implementation of the Lempel-Ziv-Welch Encoding (LZW) encoding/decoding algorithm. I have also tried to make the code portable. Those goals drove much of this implementation.

Size of Code Word

One of the first things that needs to be decided when implementing LZW is how big of a code word to use. The following items should be considered when choosing a code word size:

I encode strings of bytes, so the size of a code word must be bigger than a byte. That rules out any code words 8 bits or less.

I also don't want the code word to be huge, because even size 1 strings will be written as a code word. Example 1 encodes a 17 byte (136 bit) string into 13 code words. If the code words were only 9 bits long, the encoded data would be 117 bits. However if the code words were 16 bits long, the encoded data would be 208 bits. Typically, a larger data set must be encoded before longer code words produce any compression.

Code word size also has an impact on the maximum number of strings in a dictionary. An n bit code word may have as many as 2ⁿ strings in it's dictionary. If you're encoding bytes the first 256 of those strings will be single bytes.

Larger dictionaries can contain codes for more strings and that typically improves compression. The downside is that the encoding algorithm needs to search the dictionary for matches to the current string. Search time increases with the size of the dictionary.

With so many factors to consider, I ended up using 12 bit code words for my version 0.1 implementation. It's really easy to modify my version 0.1 implementation to use 9 to 15 bits. I settled on 12 bits after trying all values between 9 and 15 on a random set of files between 1K and 128K in size. The test was highly unscientific, but the 12 bit code words had the best average compression results.

I got a little fancier with my version 0.2 implementation. I begin with a 9 bit code word, but allow the bit length to increase up to 15 bits as more code words are generated. It is usually the case that a data stream can be compressed with code words that start at 9 bits and grow to n bits better than it can be compressed with a fixed n bit code word. I use the word "usually", because there is a little overhead required to indicate that the code word size has changed (see Indicating Code Word Length).

Feel free to change the code word size for any reason that you might have. If you increase it beyond 15 bits, things will blow up on machines with 16 bit integers. If you increase the code word size to more than 16 bits, you'll need to modify the routines that read and write code words.

Representing Strings in the Dictionary

The dictionary in the Lempel-Ziv-Welch algorithm provides a way of associating strings with code words. Code words are of limited length. However, the LZW algorithm does not impose a limit on the length of strings that are encoded.

One way to represent strings of arbitrary length is by a NULL terminated array. With LZW, it's possible to have thousands of strings thousands of bytes long. As machine memory sizes increase, there may come a time when the memory requirements of NULL terminated strings isn't a big deal. I'm still using a vintage PC, so NULL terminated strings are out of the question.

Fortunately, somebody much smarter than me observed that all strings entered into the dictionary either have a size of 1, or consist of a character appended to a string that is in the dictionary. The string that's already in the dictionary must also have a code word associated with it. Rather then representing the new string as old string + character, it can be represented as code word for old string + character.

In Example 1 the code 263 represents the string "his". Rather than creating a dictionary entry for code word 263 and "his", an entry can be made for the code word 263, the prefix code 257, and byte 's'. Every dictionary entry consists of the code word, the prefix code, and the last character in the string.

The big advantage to this scheme is that all strings in the dictionary are stored as a small, fixed length value. The downside to this scheme is that you need to traverse the dictionary to determine what string a code word encodes. 263 encodes 257 + 's'. 257 encodes 'h' + 'i'. That's not too bad with 3 character strings, but it's not that fun with 1000 character strings. Still, having a known fixed length for dictionary entries outweighs the need to traverse the dictionary.

Encoding Issues

Fortunately, writing code words wasn't that big of a deal. I have a bit file library containing functions for writing files one bit at a time. I use the same library for all of my other compression implementations, and it works just fine for LZW.

Indicating Code Word Length

The problem of indicating how long code words are only occurs when variable length code words are used. If fixed length code words are used, there's no question about how many bits the decoder should use when reading the encoded data.

I'm not sure what the common practice is for variable length code words and leaving hints that allow the decoder to recognize a code word length change. My implementation is based on some ideas that kept me up one night, and it works just fine. There are two observations that lead me to my implementation:

With those observations, everything fell into place. When the encoder needs n + 1 bits to write out a code word, it writes an n bit indicator. Then all code words from that indicator until the next indicator will be written using n + 1 bits. Once an indicator signals increased code word size, there is no way to decrease code word size.

For my indicator I use a code word of all ones at the current code word length. The consequence of using an all ones indicator is that the code word represented by that value also requires an extra bit. Suppose the encoder is using 9 bits and it needs to output the code word 511. 511 is 9 bits of all ones. A single 511 will cause the decoder to switch to reading 10 bit code words without decoding code word 511. To get around this, the encoder must write an indicator (511 in this case) to switch to 10 bits, then it must write the code word 511 using 10 bits.

Version 0.2 of my implementation starts out using 9 bit code words and allows the code word length to grow to 15 bits.

Dictionary Layout

Determining the layout of the dictionary most definitely impacts finding and inserting strings. The dictionary stores strings represented as a code word prefix and a byte suffix.

Initially, the dictionary will contain one entry for every one character string. However, the number of strings in the dictionary may grow as the encoding process proceeds. Fortunately there is an upper bound on the number of strings that may be in the dictionary. If code words are n bits long, there can only be 2ⁿ unique code words.

The only real requirement for an LZW dictionary is that it can store 2ⁿ code word/string pairs. It seemed natural to use an array of 2ⁿ code word/string entries to me. I started out with a dictionary that was an array of 2ⁿ entries, but as I started playing with the algorithm I realized that all strings 1 byte long are encoded by the value of the byte. It's much faster to just write out the value of the byte than it is to look them up in the dictionary. So my implementation uses an array based dictionary only stores up to (2ⁿ - 256) strings. Single byte strings were handled without using the dictionary.

As natural as an array based dictionary seems, it can be really slow to search when the dictionary fills up (see below). Recently I have replaced the array based dictionary with a binary tree based dictionary to speed up average search time. Anybody who has seen a binary tree should be comfortable with my implementation.

Searching for and Inserting Strings in Dictionary

Though being able to store all the code words is the only real dictionary requirement, the ability to perform fast search and insertion operations is also desirable.

Every time the encoding algorithm reads a byte, it appends it to a string that it is building. The first thing it does with the string is look for it in the dictionary. If the string isn't in the dictionary and the dictionary isn't full, the encoding algorithm will insert the string into the dictionary. There is one dictionary search for each byte in the uncoded data stream, and there will be one insertion for each new code word.

The simplest way to search an array based dictionary is by brute force, from start to finish. Similarly the simplest way to insert strings into the dictionary is to start at the beginning and keep looking until an empty location is found. These simple brute force approaches are O(N), where N is the number of code words (which is O(2ⁿ) where n is the number of bits in a code word).

It would really be nice to have something faster without adding much computational overhead. Others have successfully used hashing to speed up searches and insertions. The ideal hash function will make searching and insertion O(1) operations. I haven't come up with that function yet. Rather than being creative I used a simple function to generate a hash key that is used as an initial index into the dictionary. First I shifted the string's code prefix by 8 bits and OR in it's final character. That gives me an n + 8 bit number. I used that number modulo the size of the dictionary to get my initial key.

The first dictionary index I checked is the value of the key. Unfortunately collisions are possible. In the case of searching for a match, a collision occurs when there's a string already located in the dictionary position, but it doesn't match the string that we're looking for. My collision resolution was simple, but no more efficient than a brute force source. I search the next dictionary indices until I've either: found a match, searched the entire dictionary, or found an empty dictionary location.

If the dictionary is full, the array based algorithm will take O(N) searches to discover that a string is not in the dictionary. I realized that I could increase the size of the dictionary array, so that it will always include empty entries and stop the search if I arrive at an empty entry, but that just cuts the search time to an average of O(N/2).

Rather than using a larger array, I chose to use a binary tree. It takes an average of O(log(N)) searches to discover that a string isn't in the full dictionary. I'm not making much of an effort to balance my tree, so worst case is still O(N), but I don't think there will be many naturally occurring data streams that produce that result.

In order to keep the binary tree close to balanced, dictionary entries are ordered using a simple key. I append the code for the prefix sting to the MS nibble of the new character, then I append the LS nibble to that.

It's possible that I may have future code release using balanced trees. But that will take motivation and time.

One nice side effect of these methods are that they don't require any additional work to find an insertion location for a new string. Strings are only inserted into the dictionary when a match isn't found and there is room to insert the string. If the array based search algorithm ends because it found an empty dictionary location, there wasn't a match and the dictionary isn't full. The new string and its new code word is inserted into the empty dictionary location where the search algorithm stopped. In the case of the tree based search, when it ends without a match, the new entry is inserted as the child of the leaf where the search ended.

Idea for Super-fast Search/Insertion

The whole dictionary search and insertion process may be trivialized by increasing the size of the dictionary so that it has one index for every possible code word prefix/byte suffix string. If code words are n bits long, then there are 2^{n + 8} different possible strings; no more than 2ⁿ of which will actually be used. The dictionary can then be implemented as a 2^{n + 8} array, where each entry is either empty or contains the code word for the code word prefix/byte suffix pair represented by that index.

Decoding Issues

As with writing code words, reading code words wasn't that big of a deal. I use my bit file library here too.

Determining Code Word Length

The problem of determining the length of code words is only an issue when variable length code words are allowed. In the section on encoding, I discuss a method that uses an all ones indicator when code word bit lengths are to increase by one. When my decoder reads a code word that is all ones bits, it must increase the code word length by one bit without decoding that code word. All code words that follow are read using the new length and operation proceeds normally.

Version 0.2 of my implementation starts out using 9 bit code words and allows the code word length to grow to 15 bits.

Dictionary Layout

The dictionary organization used for encoding would work just fine for decoding, however it's more complicated then it needs to be. During the encoding process strings are created and the dictionary is searched to see if code words already exits for that string. The decoding algorithm reads the encoded stream 1 code word at a time and uses the dictionary to find out how to decode a code word. Simply restated, encoding searches the dictionary based on string value. Decoding searches the dictionary based on code word value.

My decoding dictionary is just an array of 2ⁿ strings where the array index is the code word that encodes the string.

Decoding Code Words into Strings

After it reads a code word, the decoding algorithm will look up the code word in the dictionary and write out the encoded string that it encodes. Normally that wouldn't be an issue. However, strings in the dictionary are stored as a prefix code word + a suffix byte. In order to decode a code word into a string of bytes, the prefix code of the string must also be decoded. The process of decoding prefix strings continues until a prefix string decodes to a 1 byte string. The process of decoding strings builds the string being decoded in reverse order.

The processes of decoding code words into another code word and a byte and writing all the bytes out in the reverse order of their decoding made me think of recursion. I don't have much of an opportunity to use recursion in my paying job, so the idea of implementing a recursive decode routine seemed "fun". If you'd rather avoid recursion, store the decoded characters into an array that is at least 2ⁿ - 256 bytes long, then write out the results in reverse order.

Further Information

I found Mark Nelson's Dr. Dobb's Journal article be very helpful. He even provides source for a slightly different implementation.

C Library Code Interface

Encoding Data

LZWEncodeFile

Declaration:

int LZWEncodeFile(FILE *fpIn, FILE *fpOut);

Description:

This routine reads an input file one character at a time and writes out an LZW encoded version of that file.

Parameters:

fpIn - The file stream to be encoded. It must opened. NULL pointers will return an error.
fpOut - The file stream receiving the encoded results. It must be opened. NULL pointers will return an error.

Effects:

fpIn is encoded using the LZW algorithm with codes up to CODE_LEN long and written to fpOut. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Decoding Data

LZWDecodeFile

Declaration:

int LZWDecodeFile(FILE *fpIn, FILE *fpOut);

Description:

This routine reads an input file one encoded string at a time and decodes it using the LZW algorithm.

Parameters:

fpIn - The file stream to be decoded. It must opened. NULL pointers will return an error.
fpOut - The file stream receiving the decoded results. It must be opened. NULL pointers will return an error.

Effects:

fpIn is decoded using the LZW algorithm with codes up to CODE_LEN long and written to fpOut. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Actual Software

I am releasing my implementations of the LZW algorithm under the LGPL. If you've actually read this page to get all the way down here, you already know that I have different implementations. As I add enhancements or fix bugs, they will be pushed to my source repository, which is hosted on GitHub. I recommend that you checkout the latest revision if you're not looking for something specific.

Portability

All the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on Linux and mingw on Windows XP.

The software makes no assumptions about the endianess of the machine that it is executing on. However, it does make some assumptions about the size of data types. The software makes use of the #if and #error pre-processor directives as an attempt to check for violations of my assumptions at compile time.

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipperstein@gmail.com

Lempel-Ziv-Welch (LZW) Encoding Discussion and Implementation

by Michael Dipperstein

Algorithm Overview

Encoding

The Dictionary and Encoded Strings

The Basic Encoding Algorithm

Decoding

The Basic Decoding Algorithm

Exception to the Rules

Implementation Issues

Size of Code Word

Representing Strings in the Dictionary

Encoding Issues

Indicating Code Word Length

Dictionary Layout

Searching for and Inserting Strings in Dictionary

Idea for Super-fast Search/Insertion

Decoding Issues

Determining Code Word Length

Dictionary Layout

Decoding Code Words into Strings

Further Information

C Library Code Interface

Encoding Data

LZWEncodeFile

Decoding Data

LZWDecodeFile

Actual Software

Portability