`
`Mastering
`
`A
`
`..
`
`A
`
`Q
`‘H
`
`-5.:
`
`t.
`
`.
`
`’\
`
`i
`3 .
`‘
`‘
`
`\
`
`.1‘
`
`I
`
`.,
`
`—\
`
`-¢
`
`.
`~v&‘
`:\'.
`
`
`
`K-rL\—~\\u>"<.4Ar“S§\‘:Ms>.&£x(:Mn\'¢—.‘.£by-:44»
`
`O§RE|LLY®
`
`Kyle London
`
`1
`
`APPLE 1023
`
`
`
`Mastering Algorithms with C
`
`Kyle Loudon
`
`O§RE|LLY®
`Beijing - Cambridge - Famham - Kéiln - Sebczstopol - Tokyo
`
`2
`
`
`
`Mastering Algorithms with C
`by Kyle Loudon
`
`Copyright © 1999 O’Reilly Media, Inc. All rights reserved.
`Printed in the United States of America.
`
`Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
`
`Editor: Andy Oram
`
`Production Editor: Jeffrey Liggett
`
`Printing History:
`
`August 1999:
`
`First Edition.
`
`Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered
`trademarks of O‘Reil1y Media, Inc. Mczstering Algorti/ams with C, the image of sea horses, and
`related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by
`manufacturers and sellers to distinguish their products are claimed as trademarks. Where
`those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark
`claim, the designations have been printed in caps or initial caps.
`
`While every precaution has been taken in the preparation of this book, the publisher assumes
`no responsibility for errors or omissions, or for damages resulting from the use of the
`information contained herein.
`
`ISBN: 978-1-565-92453-6
`[LS1]
`
`[201o—11-30]
`
`3
`
`
`
`Data Compression
`
`Data compression is the process of reducing the number of bits used to represent
`data, It
`is one of the most significant results of information theory, an area of
`mathematics that addresses various ways to manage and manipulate information.
`Data compression entails wvo processes: in one process the data is compressed, or
`encoded, to reduce its size; in a second process it is uncompressed, or decoded, to
`return it to its original state.
`
`*\
`
`To understand why data compression is possible, we must first understand that all
`data can be characterized by some informational content, called its entropy (a term
`borrowed from thermodynamics). Compression is possible because most data is
`represented with more bits than its entropy suggests is optimal. To gauge the
`effectiveness of Compression, we look at the ratio of the size of the Compressed
`data divided by its original size, and subtract this from 1. This value is known as
`the data’s compression ratio.
`
`In the broadest sense, data compression methods are divided into two classes:
`105532 and lossless. In lossy compression we accept a certain loss of accuracy in
`exchange for greater compression ratios. This is acceptable in some applications,
`such as graphics and sound processing, provided the degradation is managed
`carefully. However, frequently we use lossless compression, which ensures that an
`exact copy of the original data is reproduced when uncompressed.
`
`This chapter focuses on lossless compression, for which there are two general
`approaches: mmimmn redzmdomcy coding and difczfi/‘on.dry~based metlaods. Mini-
`mum redundancy coding achieves Compression by encoding symbols that occur
`with great frequency using fewer bits than for
`those that occur less often.
`Dictionaiybased methods encode data in terms of tokens that take the place of
`redundant phrases. Example 14-1 is a header for the compression methods pre-
`sented in this chapter.
`
`4
`
`
`
`356
`
`Chapter 14: Data Compression
`
`This chapter covers:
`
`Bit operations
`
`An important part of data compression because most methods require operat-
`ing on data one bit at a time to some degree, C provides a number of bitwise
`operators that can be used to implement an extended class of bit operations.
`
`Huffman coding
`
`One of the oldest and most elegant forms of compression based on minimum
`redundancy coding. Fundamental to Huffman coding is the construction of a
`Huffman tree, which is used both to encode and decode the data. Huffman
`
`coding is not the most effective form of compression, but it runs fast both
`when compressing and uncompressing data.
`
`[Z 77 (Lempel—Zz'v—1977)
`One of the fundamental methods of dictionary—based compression. LZ77 uses
`a sliding window and a look—ahead buffer to encode symbols in terms of
`phrases encountered earlier in the data. LZ77 generally results in better com-
`pression ratios than Huffman coding, but with longer compression times.
`However, uncompressing data is generally very fast.
`
`Some applications of lossless data compression are:
`
`Software aism'butz‘on
`
`The process of delivering software on various media. When distributing soft-
`ware on physical media, such as compact discs or magnetic tapes and dis-
`kettes, reducing the amount of storage required can produce considerable cost
`savings in mass distributions.
`
`Arcbzmng
`Collecting groups of files into organized libraries. Typically, archives contain
`large amounts of data. Thus, after creating archives, frequently we compress
`them.
`
`Mobile computing
`
`An area of computing in which devices typically have limited amounts of
`memory and secondary storage. Mobile computing generally refers to comput-
`ing with small, portable devices such as advanced programmable calculators,
`electronic organizers, and other personal computing devices.
`
`Optiwmfzea’ netzuot‘/«mg (illztstrafed in this cbaptefl
`Compression is used especially when sending large amounts of data across
`wide—area networks. Bandwidth at certain points along wide-area networks is
`often limited. Although compressing and uncompressing data does require
`time, in many network applications the cost is well justified.
`
`Embedded applications
`An area of computing similar to mobile computing in that devices typically have
`somewhat
`limited amounts of memory and secondary storage. Examples of
`
`5
`
`
`
`Data Compression
`
`3 67
`
`embedded applications are lab instruments, avionics (aircraft electronics), VCRs,
`home stereos, and other pieces of equipment built around microcontrollers.
`
`Database systems
`
`Typically, large systems that can be optimized by reducing their size to some
`extent. Databases may be compressed at the record or file level.
`Online manuals
`
`Manuals that are accessed directly on a computer. Online manuals are typi~
`cally of considerable size, but many sections are not accessed on a regular
`basis. Therefore,
`it
`is common to store them in a compressed form and
`uncompress sections only as they are needed.
`
`Exarniple 14-]. Headerfor Data Compressloiz
`/******************************k**********************************************
`*
`*
`
`———————————————————————————— —— compress.h ———————————————————————————— ——
`
`*
`*
`
`*
`*
`
`*****************************************************************************/
`
`#ifndef COMPRESS_H
`#define COMPRESS_H
`
`#include "bitree.h"
`
`/*************************************************x***************************
`*
`*
`
`* Define a structure for nodes of Huffman trees.
`*
`
`*
`*
`
`*****************************************************************************/
`
`typedef struct HuffNode_ (
`
`unsigned char
`int
`
`symbol;
`freq;
`
`) HuffNode;
`
`/*****************************************************************************
`*
`*
`
`* Define a structure for entries in Huffman code tables.
`*
`
`*
`*
`
`*****************************************************************************/
`
`typedef struct HuffCode_ (
`
`unsigned char
`unsigned short
`unsigned char
`
`used;
`code;
`size;
`
`} Huffcode;
`
`6
`
`
`
`368
`
`Chapter I4: Data Compress/012
`
`Example 14-]. Headerfor Data Compresszmz (commued)
`/************************************************************************w****
`*
`*
`* Define the number of bits required for LZ77 token members.
`*
`*
`*
`*****************************************************************************/
`
`#define
`#define
`#define
`#define
`
`LZ77_TYPE_BITS
`lZ77_WINOFF_BITS
`LZ77_BUFLEN_BITS
`iJz7 7__NEXT_BI‘I‘S
`
`1
`
`5
`
`**
`
`Define the size of the sliding window and the look—ahead buffer for
`LZ77. Each must be less than or equal to 2 raised to LZ77_wINOFF_BITS
`and LZ77_BUFLEN_BITS respectively.
`
`****************************************************************************/
`
`#define
`#define
`
`LZ77__WINDOW_SIZE
`LZ77_BUFFER_SIZE
`
`4096
`32
`
`/*k***************************************************************************
`*
`*
`* Define the number of bits for LZ77 phrase tokens.
`*
`*
`*
`**************************************************k**************************/
`
`#define
`
`LZ77__PHRASE_BI'I‘S
`
`(LZ7 7_'I‘YPE_BI'I‘S+LZ77_WINOFF__BI’I‘S\
`+LZ 7 7_NEX'I‘_BI’I‘S+LZ 7 7___BU'E‘LEN__BITS)
`
`/*****************************************************************************
`*
`*
`* Define the number of bits for LZ77 symbol tokens.
`*
`*
`*
`*****************************************************************************/
`
`#define
`
`LZ77_SYMBOL__BI'I‘S
`
`(LZ7 7_TYPE__BITS+LZ7 7_NEX’I‘_BI'I‘S )
`
`I
`
`/**************************************************k**************************
`*
`*
`*
`*
`*
`*
`*****************************************************************************/
`
`————————————————————————— —— Public Interface ————————————————————————— ——
`
`int huffman_compress(const unsigned char *original, unsigned char
`**compressed,
`int size);
`
`int huffman_uncompress(const unsigned char *compressed, unsigned char
`**original);
`
`int 1z77_compress(const unsigned char *original, unsigned char **compressed,
`int size);
`
`7
`
`
`
`Interface for Bit Operations
`
`Example I4-1. Headerfor Data Compression (continued)
`
`int lz77_uncompress(const unsigned char *compressed, unsigned char
`**original) ;
`
`iiendi f
`
`Descrzptiorz of Bit Operations
`
`When compressing and uncompressing data, often we need to perform opera-
`tions on less than a single byte. Therefore, before discussing various methods of
`
`it is important to become familiar with some basic operations
`data compression,
`for working with data one bit at a time. These operations are necessary because
`bit operators in C work only with intrinsic integral operands, which are small. The
`
`operations presented in this section work with buffers containing any number of
`bits. Note that the set of operations presented here is rather incomplete. Specifi-
`cally, only those that are used in this chapter and in Chapter 15, Data Encryption,
`are defined.
`
`Interface for Bit Operations
`
`I9z't__get
`
`int; bit_get(const unsigned char *b1‘ ts, int pos);
`
`Return Value
`
`State of the desired bit: 1 or 0.
`
`Gets the state of the bit at position pos in the buffer bi ts. The
`Description
`leftmost position in the buffer is 0. The state returned is either 1 or 0,
`
`Complexity
`
`0(1)
`
`bz't_set
`
`void bit__set (unsigned char *bits, int pos, int state);
`
`Return Value None.
`
`Sets the state of the bit at position pos in the buffer bi ts to the
`Description
`Value specified by state. The leftmost position in the buffer is 0. The state must
`be 1 or 0.
`
`Complexity
`
`0(1)
`
`8
`
`
`
`3 70
`
`l9z't_.xor
`
`Chapter 14: Data Compression
`
`void bit_xor(c0nst unsigned char *bitsl, const unsigned char *bits2,
`unsigned char *bitsx, int size);
`
`Return Value None.
`
`Computes the bitwise XOR (exclusive OR) of the two buffers
`Description
`bi tel and bi ts2, each containing size bits, and returns the result in bi tsx. The
`bitwise XOR of two binary operands yields 0 in each position 17 of the result where
`in position 1‘ of the operands the bits are the same, and 1 in each position where
`the bits are different. For example, 11010 G 01011 = 10001 (CD denotes XOR). It is
`the responsibility of the caller to manage the storage required by bi tsx.
`
`Complexity
`
`O([3), where B is the number of bits in each buffer.
`
`bz't_r0t_left
`
`void bit_rot__left(unsigned char *bits,
`Return Value None.
`
`int size, int count);
`
`Rotates the buffer bits, containing size bits, to the left count
`Description
`bits. After the operation, the leftmost count bits become the count rightmost bits
`in the buffer, and all other bits are shifted accordingly.
`
`00213), Where n is the number of bits rotated to the left and [3 is
`Complexity
`the number of bits in the buffer.
`
`Implementation and Analysis
`of Bit Operations
`
`Each bit operation works with a buffer of data defined as a pointer to an unsigned
`Character. This pointer points to as many bytes as are required to represent the
`number of bits in the buffer. If the number of bits in the buffer is not a multiple of
`8, some bits in the final byte are not used.
`
`I9z't_get
`
`The bz't_get operation gets the state of a bit in a buffer (see Example 14-22). To do
`this, we determine in which byte the desired bit resides and then use a mask to
`get the specific bit from that byte. The bit set to 1 in mask determines which bit
`will be read from the byte. We use a loop to shift this bit into the proper position.
`We fetch the desired bit by indexing to the appropriate byte in bi ts and apply-
`ing the mask.
`
`9
`
`
`
`Inzplemenmtion and Analysis of Bit Operations
`
`3 71
`
`The runtime complexity of bi't_get is O(1). This is because all of the steps in get-
`
`ting the state of a bit in a buffer run in a constant amount of time.
`
`bz't__set
`
`The bz't_se1f operation sets the state of a bit in a buffer (see Example 14-2). This
`operation works similarly to bz't_get, except that it uses the mask to set the state of
`the specified bit rather than to get it.
`
`The runtime complexity of b1Tt_set is O(1). This is because all of the steps in get—
`ting the state of a bit in a buffer run in a constant amount of time.
`
`bz't_xor
`
`The bit_xor operation computes the bitwise XOR (exclusive OR) of two buffers,
`bi tsl and bi ts2, and places
`the
`result
`in another buffer, bi tsx (see
`
`Example 14-2). To do this, we compare the bit in position i of bi tsl With the bit
`in position i of bits2. If the bits are the same, we set the bit in position i of
`bitsx to 0; otherwise, we set the bit in position i of bitsx to 1. This process
`continues for as many bits are in each buffer, as specified by size.
`
`The runtime complexity of bz'z.‘_xor is O([3), where B is the number of bits in each
`buffer. This is because the loop in the operation iterates once for each bit.
`
`bz’t_rot_left
`
`The b1T1.‘_rozf_Zeft operation rotates a buffer a specified number of bits to the left (see
`Example 14-2). We begin by saving the leftmost bit of the leftmost byte and then
`shifting each byte one bit to the left. As we shift each byte, we set the rightmost
`bit of the preceding byte to the bit shifted off the left of the current byte. Once we
`have shifted the last byte, we set its rightmost bit to the bit shifted off the first
`byte. This process is repeated as many times as the number of bits to be rotated.
`
`The runtime complexity of bi't_7'ot_Zeft is OCHB), where 11 is the number of bits
`rotated to the left and [3 is the number of bits in the buffer. This is because for
`each rotation, ([3/8) + 1 shifts are performed to the left.
`
`Example 14-2. Implemematlon ofB-It Operations
`/*****************************************************************************
`*
`*
`
`*
`*
`
`——————————————————————————————— —— bit:.c —————————————————————————————— _— *
`*
`
`*****************************************************************************/
`
`iiinclude <string . h>
`
`itinclude "bit .h"
`
`10
`
`
`
`11
`
`
`
`Implementation and Analysis of Bit Operations
`
`Example 14-2. Implementatzbn of Bit Operations (continued)
`for (i = O; i < (pos % 8);
`i++)
`mask = mask >> 1;
`
`/********************k********************************************************
`*
`*
`
`* Set the bit.
`*
`
`*
`*
`
`*****************************************************************************/
`
`if (state)
`bits[pos / 8]
`else
`bits[pos / 8]
`
`return;
`
`bits[pos / 8]
`
`| mask;
`
`bits[pos / 8] & (~mask);
`
`/***************W*****************************************************t***k***
`*
`*
`
`*
`*
`
`—————————————————————————————— —— bit_xor ————————————————————————————— ——
`
`*
`*
`
`*****************************************************************************/
`
`void bit_xor(const unsigned char *bits1, const unsigned char *bits2, unsigned
`char *bitsx, int size)
`{
`
`int
`
`i ;
`
`/***************k*************************************************************
`*
`*
`
`* Compute the bitwise XOR (exclusive OR) of the two buffers.
`*
`
`*
`*
`
`*****************************************************************************/
`
`for (i = O; i < size;
`
`i++)
`
`(
`
`l= bit_get(bits2,
`i)
`if (bit_get(bitsl,
`bit_set(bitsx, i, 1);
`else
`bit_set(bitsx, i, O);
`
`i))
`
`/*****************************************************************************
`*
`*
`
`——————————————————————————— —— bit_rot_left ——————————————————————————— ——
`
`*
`*
`
`*
`*
`
`*****************************************************************************/
`
`12
`
`
`
`3 74
`
`Chapter I 4: Data Compression
`
`Example 14-2. Implemeizfatzbiz qfBtt Operations (con/tmted)
`
`void bit_rot_left(unsigned char *bits,
`
`int size, int count)
`
`(
`
`int
`
`fbit,
`
`/*****************************************************************************
`*
`*
`
`* Rotate the buffer to the left the specified number of bits.
`*
`
`*
`‘k
`
`*****************************************************************************/
`
`if (size > 0)
`
`{
`
`for (j = 0;
`
`j < count;
`
`j++)
`
`(
`
`for (i = 0; i <= ((size — 1)
`
`/ 8);
`
`i++)
`
`(
`
`/***w***************t************************************************
`‘k
`~k
`
`* Get the bit about to be shifted off the current byte.
`*
`
`*
`*
`
`***********************+*+***+***************************w******w***/
`
`lbit = bit_get(&bits[i], 0);
`
`if (i == 0)
`
`{
`
`/************************************************k****************
`*
`*
`
`* Save the bit shifted off the first byte for later.
`*
`
`*
`*
`
`*****************************************************************/
`
`fbit = lbit
`
`else {
`
`/*********************************W*******************************
`*
`*
`
`Set the rightmost bit of the previous byte to the leftmost
`* bit about to be shifted off the current byte.
`*
`
`‘
`
`*
`*
`*
`
`*‘k***‘k****‘k****‘k*******************~k*****7\'*k'k*******'k***'k***‘k‘)r‘k‘k**/
`
`bit_set(&bits[i — 1], 7, lbit);
`
`13
`
`
`
`Description ofHufiman Coding
`
`Example Z 4-2. Implernenlatton of Bit O_/Jercmons (co1m'nned)
`/********************************************************************
`*
`*
`
`* Shift the current byte to the left.
`*
`l
`
`*
`*
`
`********************************************************************/
`
`bits[i] = bits[i] << 1,-
`
`} /
`
`***********************************************************************
`*
`*
`
`* Set the rightmost bit of the buffer to the bit shifted off the
`*
`first byte.
`*
`
`*
`*
`*
`
`***********************************************************************/
`
`bit__set(bits, size — 1, fbit);
`
`Description ofHnfiman Coding
`
`One of the oldest and most elegant forms of data compression is Huffman coding,
`an algorithm based on minimum redundancy coding. Minimum redundancy cod-
`ing suggests that if we know how often different symbols occur in a set of data,
`we can represent the symbols in a Way that makes the data require less space. The
`idea is to encode symbols that occur more frequently with fewer bits than those
`that occur less frequently. It is important to realize that a symbol is not necessarily
`a character of text: a symbol can be any amount of data We choose, but it is often
`one byte’s worth.
`
`Entropy and Minimum Redundancy
`
`let’s revisit the concept of entropy introduced at the beginning of the
`To begin,
`Chapter. Recall that every set of data has some informational content, which is
`called its entropy. The entropy of a set of data is the sum of the entropies of each
`of its symbols. The entropy S of a symbol z is defined as:
`
`Sz = ~lgPz
`
`14
`
`
`
`3 76
`
`Chapter I 4: Data Compression
`
`where P2 is the probability of z being found in the data. If it is known exactly how
`many times 2 occurs, P2 is referred to as the frequency of 2. As an example, if z
`occurs 8 times in 32 symbols, or one—fourth of the time, the entropy of z is:
`
`—lg(1/4) = 2 bits
`
`This means that using any more than two bits to represent z is more than we
`need. If we consider that normally we represent a symbol using eight bits (one
`byte), we see that compression here has the potential to improve the representa—
`tion a great deal.
`
`Table 14-1 presents an example of calculating the entropy of some data contain-
`ing 72 instances of five different symbols. To do this, we sum the entropies con-
`tributed by each symbol. Using “U” as an example, the total entropy for a symbol
`is computed as follows. Since “U” occurs 12 times out of the 72 total, each
`instance of “U” has an entropy that is calculated as:
`
`—lg(12/72) = 2.584963 bits
`
`Consequently, because “U” occurs 12 times in the data,
`entropy of the data is calculated as:
`
`its contribution to the
`
`(2.584963)(12) = 31.01955 bits
`
`In order to calculate the overall entropy of the data, we sum the total entropies
`contributed by each symbol. To do this for the data in Table 14—1, we have:
`
`3101955 + 56.00000 + 23.53799 + 53.94552 + 36.95994 = 16146300 bits
`
`If using 8 bits to represent each symbol yields a data size of (72)(8) = 576 bits, we
`should be able to compress this data, in theory, by up to:
`
`1 — (161.463000/576) = 720%
`
`Table 14-1. 779e Entropy of a Set oj'Datct C'0mm‘nt1zg 72 Insrcmces of5 Different Symbols
`
`Probability
`12/72
`
`Entropy of Each Instance
`2.584963
`
`l Total Entropy
`51.01955
`
`18/72
`
`7/72
`
`15/72
`
`20/72
`
`2.000000 ‘
`
`5.362570
`
`2.265034
`
`1.847997
`
`36.00000
`
`25.53799
`
`53.94552
`
`56.95994
`
`Building a Huffman Tree
`
`Huffman coding presents a way to approximate the optimal representation of data
`based on its entropy. It works by building a data structure called a Huffmcm tree,
`
`15
`
`
`
`Description ofHuflmcm Coding
`
`3 77
`
`which is a binary tree (See Chapter 9, Trees) organized to generate Hufi"mom codes.
`Huffman codes are the codes assigned to symbols in the data to achieve compres-
`sion. However, Huffman codes result in compression that only approximates the
`data’s entropy because, as you may have noticed in Table 14-1, the entropies of
`symbols often come out to be fractions of bits. Since the actual number of bits
`used in Huffman codes cannot be fractions in practice, some codes end up with
`slightly too many bits to be optimal.
`
`Figure 14—1 illustrates the process of building a Huffman tree from the data in
`Table 14-1. Building a Huffman tree proceeds from its leaf nodes upward. To
`begin, we place each symbol and its frequency in its own tree (see Figure 14-1,
`step 1). Next, we merge the two trees whose root nodes have the smallest fre-
`quencies and store the sum of the frequencies in the new tree's root (see
`Figure 14-1, step 2). This process is then repeated until we end up with a single
`tree (see Figure 141, step 5), which is the final Huffman tree. The root node of
`this tree contains the total number of symbols in the data, and its leaf nodes con-
`tain the original symbols and their frequencies. Because Huffman coding continu-
`ally seeks out the two trees that appear to be the best to merge at any given time,
`it is a good example of a greedy algorithm (see Chapter 1, Introduction).
`
`Compressing and Uncompressing Data
`
`Building a Huffman tree is part of both compressing and uncompressing data. To
`compress data using a Huffman tree, given a specific symbol, we start at the root
`of the tree and trace a path to the symbol’s leaf. As we descend along the path,
`whenever we move to the left, we append O to the current code; whenever we
`move to the right, we append 1. Thus,
`in Figure 14-1, step 6,
`to determine the
`Huffman code for “U” we move to the right (1), then to the left (10), and then to
`the right again (101). The Huffman codes for all of the symbols in the figure are:
`
`U=101,V=O1,\W=100,X=00,Y=11
`
`To uncompress data using a Huffman tree, we read the compressed data bit by bit.
`Starting at the tree’s root, whenever we encounter 0 in the data, we move to the
`left in the tree; Whenever we encounter 1, We move to the right. Once we reach a
`leaf node, we generate the symbol it contains, move back to the root of the tree,
`and repeat the process until we exhaust the compressed data. Uncompressing data
`in this manner is possible because Huffman codes are prefix free, which means
`that no code is a prefix of any other. This ensures that once we encounter a
`sequence of bits that matches a code, there is no ambiguity as to the symbol it
`represents. For example, notice that 01, the code for “V,” is not a prefix of any of
`the other codes. Thus, as soon as we encounter O1 in the compressed data, we
`know that the code must represent “V.”
`
`16
`
`
`
`Chapter 14.‘ Data Compression
`
`0 "1l'!E'll.Y‘!l'l!.?E€l‘.?l!‘?l‘?l.’!!.ll€}?‘f!!‘.?!§9
`
`................
`
`0 Mmmergingthemes=°n'°!vin9.?r&9H9n€!%?.Z9'J!!.F2
`
`Q symbol O frequency
`rool nodes ofbinary trees
`
`0 A!!9r.!isr9J.".9.!l9.!t9s%.s9n!9@!1in9.f!s9H9!!si9é.!.$.9.'3lW 0 MM merainathe "‘?.°.‘..€9H!9i!1'19.lff%‘!E‘?flE‘.‘??.l.?.f‘.'!fl2°
`9
`
`0 After merging the trees containing frequencies 33 and 39 G Generating the Huffman code for the symbol ”U”
`:, ................................................................3
`!, ...............................................................
`%%%%%%%%%%%%%%%%%%%%%%%%%%
`
`Figure 14-1. Bufldmg a Huffman treefrom the symbols cmdfrequezzcles in. Table 14-1
`
`Effectiveness ofHuffman Coding
`
`To determine the reduced size of data compressed using Huffman coding, We cal-
`culate the product of each symbol’s frequency times the number of bits in its Huff—
`man code, then add them together. Thus, to calculate the compressed size of the
`data presented in Table 14-1 and Figure 14-1, we have:
`
`(12)(?>) + (18)(2) + (7)(3) + (15)(2) + (20)(2) = 165 bits
`
`Assuming that without compression each of the 72 symbols would be represented
`with 8 bits, for 21 total data size of 576 bits, We end up with the following compres-
`sion ratio:
`
`1~(163/576) = 71.7%
`
`17
`
`
`
`Inte7fczcefo7'Hufi‘man Coding
`
`379
`
`Once again, considering the fact that we cannot take into account fractional bits in
`Huffman coding, in many cases this Value will not be quite as good as the data’s
`entropy suggests, although in this case it is Very close.
`
`In general, Huffman coding is not the most effective form of compression, but it
`runs fast both when compressing and uncompressing data. Generally,
`the most
`time-consuming aspect of compressing data with Huffman coding is the need to
`scan the data twice: once to gather frequencies, and a second time actually to
`compress the data. Uncompressing the data is particularly efficient because decod-
`ing the sequence of bits for each symbol requires only a brief scan of the Huff—
`man tree, which is bounded.
`
`Interface for Hujjfman Coding
`
`l9itfi°mcin_compress
`
`int huffman_compress(const unsigned char *original, unsigned char **compressed,
`int size);
`
`Return Value Number of bytes in the compressed data if compressing the data
`is successful, or —1 otherwise.
`
`Uses Huffman coding to compress a buffer of data specified by
`Description
`original, which contains size bytes. The compressed data is written to a buffer
`returned in compressed. Since the amount of storage required in compressed is
`unknown to the caller, buffmoin_compress dynamically allocates the necessary
`storage using mizlloc. It is the responsibility of the caller to free this storage using
`free when it is no longer needed.
`
`Complexity
`
`002), where n is the number of symbols in the original data.
`
`loitjfinczn_iincompress
`
`int huffman__uncompress(const: unsigned char *compressed, unsigned
`char **original);
`
`Return Value Number of bytes in the restored data if uncompressing the data is
`successful, or -1 otherwise.
`‘
`
`Uses Huffman coding to uncompress a buffer of data specified by
`Description
`compressed. It is assumed that the buffer contains data previously compressed
`with /9tiffIt’l6l71_CO17’ip7‘eSS. The restored data is written to a buffer returned in
`original. Since the amount of storage required in original may not be known
`to the caller, biijj"7nan_ilncompress dynamically allocates the necessary storage
`using mozlloc. It is the responsibility of the caller to free this storage using free
`when it is no longer needed.
`
`Complexity
`
`O(n), where n is the number of symbols in the original data.
`
`18
`
`
`
`380
`
`Chapter I 4." Data Compression
`
`Implementation and Analysis
`ofH1/tfifman Coding
`
`With Huffman coding, we try to compress data by encoding symbols as Huffman
`codes generated in a Huffman tree. To uncompress the data, we rebuild the Huff~
`man tree used in the compression process and convert each code back to the sym-
`bol it represents. In the implementation presented here, a symbol in the original
`data is one byte.
`
`lonfi”1nan_co1npress
`
`The /9uffman_co1npress operation (see Example 14-3) compresses data using Huff-
`man coding. It begins by scanning the data to determine the frequency of each
`symbol. The frequencies are placed in an array, freqs. After scanning the data,
`the frequencies are scaled so that each can be represented in a single byte. This is
`done by determining the maximum number of times any symbol occurs in the
`data and adjusting the other frequencies accordingly. Since symbols that do not
`occur in the data should be the only ones with frequencies of O, we perform 21
`simple test to ensure that any nonzero frequencies that scale to less than 1 end up
`being set to 1 instead of 0.
`
`l9uz'Id_tree to build
`Once we have determined and sealed the frequencies, we call
`the Huffman tree. The bm'ld_z.‘ree function begins by inserting into a priority queue
`one binary tree for each symbol occurring at least once in the data. Nodes in the
`trees are HuffNode structures (see Example 14-1), This structure consists of two
`members: symbol is a symbol from the data (used only in leaf nodes), and freq is
`a frequency. Each tree initially contains only a single node, which stores one sym-
`bol and its scaled frequency as recorded and sealed in the freqs array.
`
`To build the Huffman tree, we use a loop to perform size — 1 merges of the trees
`Within the priority queue. On each iteration, we call pqneue_extmct twice to
`extract the two binary trees whose root nodes have the smallest frequencies. We
`then sum the frequencies, merge the trees into a new one, store the sum of the
`frequencies in the new tree’s root, and insert the new tree back into the priority
`queue. We continue this process until, after size —- 1 iterations,
`the only tree
`remaining in the priority queue is the final Huffman tree.
`
`Using the Huffman tree built in the previous step, we call butldjable to build a
`
`table of the Huffman codes assigned to every symbol. Each entry in the table is a
`Huffcode structure. This structure consists of three members: used is a flag set to
`
`1 or 0 indicating whether the entry has a code stored in it, code is the Huffman
`code stored in the entry, and size is the number of bits the code contains. Each
`code is a short integer because it can be proven (although this is not shown here)
`
`19
`
`
`
`Implementation and Analysis of Hujfmczn Coding
`
`381
`
`that when all frequencies are scaled to fit within one byte, no code will be longer‘
`than 16 bits.
`
`We build the table by traversing the Huffman tree using a preorder traversal (see
`Chapter 9). In each activation of Zn/.iTlcl_tabZe, code keeps track of the current Huff-
`man code being generated, and size maintains the number of bits it contains. As
`
`we traverse the tree, each time we move to the left, we append 0 to the code;
`each time we move to the right, we append 1. Once we encounter a leaf node, we
`store the Huffman code into the table of codes at the appropriate entiy. As we
`store each code, we call the network function /otons as a convenient way to ensure
`that the code is stored in big—endian format. This is the format required when we
`actually generate the compressed data in the next step as well as when we uncom-
`press it.
`
`While generating the compressed data, we use ipos to keep track of the current
`byte being processed in the original data, and opos to keep track of the current bit
`we are writing to the buffer of compressed data. To begin, we write a header that
`
`will help to rebuild the Huffman tree in buffmcm__tmc0mp7‘ess. The header con-
`tains a four—byte value for the number of symbols about to be encoded followed
`by the scaled frequencies of all 256 possible symbols, including those that are 0.
`Finally,
`to encode the data, we read one symbol at a time, look up its Huffman
`code in the table, and write each code to the compressed buffer. We allocate
`space for each byte in the compressed buffer as we need it.
`
`The runtime complexity of /9ZQ]_(fI?2clIZ_C0171p7"€SSlS 001), where n is the number of
`symbols in the original data. Only two parts of the algorithm depend on the size
`of the data: the part in which we determine the frequency of each symbol, and the
`part in which we read the data so we can compress it. Each of these runs in O(n)
`time. The time to build the Huffman tree does not affect
`the complexity of
`buff2nczn_compress because the running time of this process depends only on the
`number of different symbols in the data, which in this implementation i