`
`Bricolage: Data Compression
`
`Bricolage: Data Compression
`
`© Copyright 1998 The Perl Journal. Reprinted with permission.
`
`Bricolage: Data Compression
`Morse Code
`Ambiguous codes
`Huffman Coding
`The Code
`The Rub
`Another Rub
`Other Methods
`Other Directions
`Bibliography
`Notes
`
`Bricolage: Data Compression
`You are probably familiar with Unix compress, gzip, or bzip2 utilities, or the DOS
`pkzip utility. These programs all make files smaller; we say that such files are
`compressed. Compressed files take less disk space and less network bandwidth. The
`downside of compressed files is that it they are full of unreadable gibberish; you usually
`have to run another program to uncompress them before you can use them again. In this article we'll see how file
`compression works, and I'll show a simple module that includes functions for compressing and uncompressing
`files.
`
`Morse Code
`The idea behind data compression is very simple. In a typical file, say a text file, every character takes up the
`same amount of space: 8 bits. The letter e is represented by the 8 bits 01100101; the letter Z is represented by the
`8 bits 010101010. But in a text file, e occurs much more frequently than Z---maybe about 75 times as frequently.
`If you could give the common symbols short codes and the uncommon symbols long codes, you'd have a net
`gain.
`This isn't a new idea. It was exploited by Samuel Morse in the Morse Code, a very early digital data transmission
`protocol. Morse Code was designed to send text files over telegraph wires. A telegraph is very simple; it has a
`switch at one end, and when you close the switch, an electric current travels through a wire to the other end,
`where there is a relay that makes a click. By tapping the switch at one end, you make the relay at the other end
`click. Letters and digits are encoded as sequences of short and long clicks. A short click is called a dot, and a
`long click is called a dash.
`The two most common letters in English text are E and T; in Morse code these are represented by a single dot and
`a single dash, respectively. The codes for I, A, N, and M, all common letters, are **, *-, -*, and --. In contrast, the
`codes for the uncommon letters Q and Z are --*- and --**.
`In computer file compression, we do a similar thing. We analyze the contents of the data, and figure out which
`symbols are frequent and which are infrequent. Then we assign short codes to the frequent symbols and long
`codes to the infrequent symbols. We write out the coded version of the file, and that usually makes it smaller.
`
`https://perl.plover.com/Huffman/huffman.html
`
`1/8
`
`IPR2022-00602
`Apple EX1029 Page 1
`
`
`
`7/19/2021
`
`Bricolage: Data Compression
`
`Ambiguous codes
`There's a problem with Morse Code: You need a third symbol, typically a long pause, to separate the dots and
`dashes that make up one letter from the dats and dashes that make up the next. Otherwise, if you get *-, you
`don't know whether it's the single letter A or the two letters ET---or it might be the first bit of the letter R or L. In a
`long message, all the dots and dashes run together and you get a big mess that can't be turned back into text. In
`Morse code, it can be hard to tell `Eugenia' from `Sofia': Without the interletter pauses, they're both:
` ***---**-****-
`Those interletter spaces take up a lot of transmission time, and it would be nice if you didn't need them. It turns
`out that if you arrange the code properly, you don't. The ambiguity problem with Morse Code occurs because
`some codes are prefixes of others: There are some letters where the code for the first part of one letter is just the
`same as the code for the other letter, but with something extra tacked on. When you see the shorter code, you
`don't know if it's complete or if it should be combined with the following symbols. It turns out that if no code is a
`prefix of any other, then the code is unambiguous.
`Suppose for simplicity that we only needed to send the letters A, C, E, and S over the telegraph. Instead of Morse
`code, we could use the following code table:
` A -
` C **
` E *-*
` S *--
`Suppose we receive the message -*****-**--*--*-**--. What was the message? Well, the first symbol is -, so
`the first letter in the message must be A, because that's the only letter with a code that starts with a -. Then the
`next two symbols are **, so the second letter must be a C, because all the other codes that start with * have a -
`after the * instead of another *. Similar reasoning shows that the third letter is also C. After that, the code is *-*;
`it must be an E. We continue through the message, reading off one letter at a time, and eventually we get the
`whole thing this way.
`It's so simple that a computer can decode it, if the computer is equipped with a decision tree like this one:
`
`Start at Start, and look at the symbols in the message one by one. At each stage, follow the appropriate labelled
`branch to the next node. If there's a letter at that node, output the letter and go back to the start node. If there's no
`letter at the node, look at the next symbol in the input and continue down the tree towards the leaves.
`
`https://perl.plover.com/Huffman/huffman.html
`
`2/8
`
`IPR2022-00602
`Apple EX1029 Page 2
`
`
`
`7/19/2021
`
`Bricolage: Data Compression
`
`Huffman Coding
`Obviously, it's important to choose the right code. If Morse had made the * code for Z and **-* the code for E, he
`wouldn't be famous.
`Choosing the right code can be tricky. Consider the example of the previous section, where we only had to code
`messages that contain A, C, E, and S. The code I showed is good when we expect our messages to contain more
`A's than E's or S's. If S were very common, we clearly could have done better; less clearly, if all four letters were
`about equally common, then we could still have done better, by assigning each letter a code of the same length:
`
`Suppose, for example, our message happened to contain 200 of each of the four letters. Then the first code would
`use 1800 symbols, and the second code would use only 1600.
`In 1952, David Huffman discovered a method for producing the optimal unambigous code. For a given set of
`symbols, if you know the probability with which each symbol appears in the input, you can use Huffman's
`method to construct an unambiguous code that encodes the typical message with fewer *'s and -'s than any other
`code.
`The method is very simple and ingenious. For concreteness, let's suppose that the (rather silly) message is
` THE_THIRSTIEST_SISTERS_TEETH_RESIST_THIS_STRESS
`(I used _ instead of space so that it'll be easier to see.)
`Start with the table of relative probabilities; you can get this by counting the number of occurrences of every
`symbol in the message. This is called histogramming. (A histogram is a bar chart; histos is Greek for a beam or a
`mast.) Here's the histogram for the symbols in our sample message:
`S 11
`T 10
`E 7
`
`https://perl.plover.com/Huffman/huffman.html
`
`3/8
`
`IPR2022-00602
`Apple EX1029 Page 3
`
`
`
`7/19/2021
`
`Bricolage: Data Compression
`
`_ 6
`I 5
`H 4
`R 4
`Now take the two least common entries in the table, that's H and R. They'll get the longest codes, because they're
`least common. We'll simplify this by pretending that H and R are the same, and lumping them together into one
`category, which we'll call HR. Then we'll assign codes to all the other letters and to HR. When we're done, we still
`have to distinguish between H and R. Now, HR has some code. We don't know what it is yet, so let's symbolize it
`with <HR>. We don't really need to use <HR> in our message, because the is no such thing as the letter HR, so we'll
`split it in two, and let the code for H be <HR>* and the code for R be <HR>-. As a result of this, the codes for H and
`R will be longer than the codes for the other letters, but if that has to appen, it's better for it to happen for H and R,
`because they are the least common letters in the message.
`So we will lump H and R together and pretend temporarily that they are only one letter. Our table then looks like
`this:
` S 11 R = <HR>-
` T 10 H = <HR>*
` HR 8
` E 7
` _ 6
` I 5
`Now we repeat the process. The two least common symbols are I and _. We'll lump them together into a new
``symbol' called I_, we'll assign finish assigning the codes to S, T, HR, E, and I_. When we're done, I will get the
`code <I_>* and _ will get the code <I_>-.
` S 11 R = <HR>-
` I_ 11 H = <HR>*
` T 10 _ = <I_>-
` HR 8 I = <I_>*
` E 7
`Then we lump together HR and E:
` HRE 15 R = <HR>-
` S 11 H = <HR>*
` I_ 11 _ = <I_>-
` T 10 I = <I_>*
` HR = <HRE>-
` E = <HRE>*
`Then we lump together T and I_:
` I_T 21 R = <HR>-
` HRE 15 H = <HR>*
` S 11 _ = <I_>-
` I = <I_>*
` HR = <HRE>-
` E = <HRE>*
` I_ = <I_T>-
` T = <I_T>*
`Then we lump together S and HRE:
` SHRE 25 R = <HR>-
` I_T 21 H = <HR>*
` I = <I_>-
`https://perl.plover.com/Huffman/huffman.html
`
`4/8
`
`IPR2022-00602
`Apple EX1029 Page 4
`
`
`
`Bricolage: Data Compression
`
`7/19/2021
` _ = <I_>*
` HR = <HRE>-
` E = <HRE>*
` I_ = <I_T>-
` T = <I_T>*
` S = <SHRE>-
` HRE = <SHRE>*
`Now we only have two `symbols' left. There's only one way to assign a code to two symbols; one of them gets *
`and the other gets -. It doesn't matter which gets which, so let's say that SHRE gets * and I_T gets -.
`Now the codes fall out of the table we've built up in the right-hand column:
`
`SHRE = *
`
`
`S = *-
`
` HRE = **
`
` HR = **-
`
` R = **--
`
` H = **-*
`
` E = ***
`
`I_T = -
`
`
`I_ = --
`
`
`
`_ = ---
`
`
`
`I = --*
`
`
`T = -*
`We throw away the codes for the fictitious compound symbols, and we're left with the real code:
` S = *-
` T = -*
` E = ***
` _ = ---
` I = --*
` H = **-*
` R = **--
`As promised, the code is unambiguous, because no code is a prefix of any other code. Our original message
`encodes like this:
` -***-****----***-*--***--*--*--*****--*---
` *---**--******--*-----*******-***-*---
` **--****---**--*----***-*--**----
` *--***--****-*-
`For a total of 128 *'s and -'s, an average of 2.72 symbols per character, and a 9.3% improvement over the 141
`symbols we would have had to use if we had given every letter a three-symbol code.
`
`The Code
`For this article, I implemented a demonstration module that compresses files. You can retrieve it from the perl
`Journal web site at http://www.tpj.com/ or from my Perl Paraphernalia web site at
`http://www.plover.com/~mjd/perl/Huffman/. The program htest.pl compresses an input and saves it to the file
`/tmp/htest.out; then it opens this file, reads in the compressed data, decompresses it, and prints the result to
`standard output.
`Most of the real work is in the Huffman module that htest uses. Here are the important functions that htest
`calls:
` my $hist = Huffman::histogram(\@symbols);
`https://perl.plover.com/Huffman/huffman.html
`
`5/8
`
`IPR2022-00602
`Apple EX1029 Page 5
`
`
`
`7/19/2021
`Bricolage: Data Compression
`This generates the histogram of the input text. The histogram is the tally of the number of occurrences of each
`symbol. The argument to histogram is an array of symbols, passed by reference for efficiency, because it is
`likely to be very large. It might have been simpler to pass the input as a single string, but this way we can assign
`codes per-word instead of per-character if we want to, just by splitting the input into an array in a different way.
` my $codes = Huffman::tabulate($hist);
`The tabulate function generates the code table from thie histogram using Huffman's method. (This has the side
`effect of destroying the histogram.) The code table is just a hash that maps symbols to codes; the codes
`themselves are strings like 0010011.
` Huffman::save_code_table(*FILE, $codes);
`This writes the code table to the file.
` Huffman::encode(*FILE, $codes, \@symbols);
`This encodes the input text and writes the result to the file.
` my $codes = Huffman::load_code_table(*FILE);
` my $coded_data = <FILE>;
` my $text = Huffman::decode($codes, $coded_data);
`This is the decompression process. The return value from load_code_table is a code table in a rather interesing
`form. It's a decision tree, just like the ones we saw earlier in the article. Each node of the tree is either a single
`string (which is the symbol that the decoder should output) or it's an array with two elements; the 0th element is
`the part of the tree on the branch labeled 0, and the 1st element is the part of the tree on the branch labeled 1. For
`example, taking * as equivalent to 0 and - as equivalent to 1, the decision tree from earlier in the article would be
`represented like this:
` [[C,
` [E,
` S
` ]
` ],
` A
` ]
`
`The Rub
`We can compress the file, but unless we include the code table in the compressed file, we won't be able to
`uncompress it again. Files that can't be uncompressed are not very useful. (Sometimes you don't need to be able
`to decompress the file; in these cases the Perl unlink function implements a much simpler and more efficient
`form of compression.) But sometimes the code table is bigger than the savings that we got from compressing the
`file, especially if the original file is small. On a sample file of comp.lang.perl.misc articles, the compressor
`reduced the file size by 32%, from 42733 bytes to 29114. But when I ran it on its own source code, the file size
`increased from 987 bytes to 1321. The compressed data was only 631 bytes long, but the code table and other
`overhead took up 690 bytes.
`
`Another Rub
`
`https://perl.plover.com/Huffman/huffman.html
`
`6/8
`
`IPR2022-00602
`Apple EX1029 Page 6
`
`
`
`7/19/2021
`Bricolage: Data Compression
`Huffman coding is optimal in the sense of compressing a given set of symbols into the smallest space possible.
`But if you readjust your idea of a `symbol', you can get must better compression.
`Suppose you're compressing English text. If you histogram the characters, and use Huffman's method, you'll get
`the optimal way to encode the characters. Because each character has its own code, your code would also serve
`to code arbitrary random gibberish. It should be clear that this expressiveness has a price. For English text, we
`don't need so much expressiveness; it's clearly more efficient to assign a code to each word instead of to each
`character. In doing so, we lost the ability to encode random gibberish, but our compressed data becomes much
`smaller. Suppose there are about 2**17 words in English; if we assign each one a different 17-bit code, we can
`then encode our original seven-word message:
` THE THIRSTIEST SISTERS TEETH RESIST THIS STRESS
`into 7*17 = 119 bits, alreasy an improvement on the 128 we used before. And if we use Huffman's method to
`assign long codes to rare words and short codes to common words, we can do even better.
`Using this method, I compressed the comp.lang.perl.misc articles by 63%. Unfortunately, the code table blew
`up as a result, and took up 43739 bytes, which was larger than the original uncompressed data.
`
`Other Methods
`Most modern data compression doesn't use Huffman coding directly. A better method was proposed in 1977 and
`1978 by Jakob Ziv and Abraham Lempel. The Lempel-Ziv methods scan the input data, and when they find a
`substring that occurs twice, the replace the second occurrence with a reference back to the first. The references
`can then be Huffman-compressed.
`These methods have several advantages over the basic scheme I implemented. One important benefit of Lempel-
`Ziv compression methods is that they don't have to read and analyze the entire input in advance; they can
`construct a code table as they go along, basing it on only the portion of the input seen so far. This is important for
`practical applications. The data compression in your modem wouldn't be very useful if the modem had to gather
`and analyze a week's worth of traffic before it could send or receive anything.
`Another benefit of this method is that because the code table is imlpicit in the input seen so far, the code table
`doesn't need to be recorded explicitly. The decoding process can infer the code table from the coded data itself.
`This avoids the embarrassing inflamations that we saw where the code table took up more space than was
`actually saved by the compression process.
`One extra payoff of building the code table on the fly is that if the algorithm notices that the code table it's using
`isn't performing well, it can throw it away and start over in the middle. If the data is in several pieces that have
`very different characters, say a graphic image embedded in a text file, this method will be a win over straight
`Huffman coding because it'll be able to apply an appropriate code table to each part of the data. LZW, the
`compression algorithm used by the DOS lha program, doesn't do this, but the variation of it employed by the
`Unix compress program does. The algorithm used by the GNU project's gzip and zlib is different but also
`periodically throws away the code tables and starts over.
`The best thing about Lempel-Ziv and related methods is that they don't need to decide in advance how to break
`up the input into symbols. LZW, for example, puts any string that it hasn't seen before into the code table, under
`the assumption that if it appeared once, it'll probably appear again. As the input comes in, longer and longer
`substrings of the input go into the code table; long substrings go in only when their shorter prefixes have
`appeared multiple times. This means that if the file is best broken up into words, the algorithm will eventually
`detect that; if it is best broken up into single characters, the algorithm will detect that too.
`
`https://perl.plover.com/Huffman/huffman.html
`
`7/8
`
`IPR2022-00602
`Apple EX1029 Page 7
`
`
`
`7/19/2021
`
`Bricolage: Data Compression
`
`Other Directions
`If you want an easy but possibly amusing project, try altering the code table functions in the module so that the
`code tables are smaller. At present, the module does not assume that the coding is done per-character, so you
`shouldn't break that.
`Actually, though, assigning the codes per-word seems to blow up the code table more than it saves. I haven't
`found a case yet where it's worthwhile. I'd be interested in seeing more analysis of this.
`Finally, the decision-tree data structure in the decoder is probably over-clever. It would be simpler to represent
`the same information with a state table, this way:
` 0 1
` --------------------
` 0 1 A
` 1 C 2
` 2 E S
`Each internal node in the decision tree is assigned a state, which is a number; the root node gets the number 0. To
`decode, the decoder keeps track of the state number, which says what internal node it's currently at. Each time it
`reads a bit, it looks in the table to decide where to go next. If it sees a symbol, that means it erached a leaf, and
`should output that symbol and start over at state 0; if it sees a number, that means it's now at a different internal
`node and should read more bits. I'm really not sure whether this would be more efficient than the way I did do it,
`but it would probably be easier to understand.
`
`Bibliography
`Symbols, Signals, and Noise, J.R. Pierce, pp. 94--97. Harper, New York, 1961.
`The comp.compression FAQ, is an excellent starting source of information, especially part 2. It is available from
`ftp://rtfm.mit.edu/pub/usenet/news.answers/compression-faq/.
`
`Notes
`My column finally has a name! Bricolage is a hot word these days in the computer learning business; it's French
`for tinkering or pottering. Thanks to Deven Ullman, who suggested it, and to the other people who suggested
`other names.
`
`Return to: Universe of Discourse main page | What's new page | Perl Paraphernalia| File
`Compression Page
`mjd-tpj-huffman@plover.com
`
`https://perl.plover.com/Huffman/huffman.html
`
`8/8
`
`IPR2022-00602
`Apple EX1029 Page 8
`
`