The idea: First assign a number to all the letters of the alphabet. Say A is 41, B is 42, etc.
Now let’s say we’re compressing the string ABRACADABRA.
We see the first A, and save it as 41. Then we see B, and while we’ll note down it is 42, in our lookup table we’ll create a new entry for AB and give it a number (say, 81). Then when we see R, we’ll encode its value, but we’ll also add BR to our table, and so on. Essentially, if we see another AB later on in the string, we won’t note A and B separately. We’ll just note down 81 for AB, and thus we get compression. And of course, once we see AB, we’ll add a new entry which will be 3 characters long.
For this string, the new keys that are added are AB,BR,RA,AC,CA,AD,DA,ABR,RAB,BRA,ABRA.
Decoding follows the same procedure. Let’s say we have the values for all the single characters. We then build up the rest of the table as we decode: First we decode A, then B, then we add AB to our symbol table.
The symbol table can be stored as a trie. The benefit of this is we can utilize strings with common prefixes (which by design we’ll have a lot of).
There are many implementation variations (e.g. how big can we allow the symbol table to be, etc).
Properties
- Whereas Huffman encoding uses variable length sequences to represent a fixed length string (usually a character), LZW compression uses a fixed length sequence to represent a variable length string.
- No need to transmit any table to the decoder. If they’re using the same algorithm, we can ensure the initial single character encodings are embedded in the algorithm.
- Usually gives better compression than Huffman.
- On the decoding side, it must decode from the beginning, as we have to build up the symbol table from the start.