`
`Office of Business Enterprises
`Duplication Services Section
`
`THIS IS TO CERTIFY that the collections of the Library of Congress contain a bound
`volume entitled THE DATA COMPRESSION BOOK, call number QA 76.9.D33 N46, Copy
`2. The attached — Cover Page, Title Page, Second Title page, Copyright Page, Table of Contents
`Pages and Chapter 5 - are a true and complete representation from that work.
`
`THIS IS TO CERTIFY FURTHER, that work is marked with a Library of Congress
`Copyright Office stamp dated January 29, 1992.
`
`IN WITNESS WHEREOF, the seal of the Library of Congress is affixed hereto on
`May 18, 2018.
`
`AA'elAt- ArAget
`Deirdre Scott
`Business Enterprises Officer
`Office of Business Enterprises
`Library of Congress
`
`101 Independence Avenue, SE Washington, DC 20540-4917 Tel 202.707.5650 www.locgov; duplicationservices@loc.gov
`
`Page 1
`
`NETFLIX, INC
`Exhibit 1018
`IPR2018-01630
`
`
`
`Featuring fast , efficient data
`compression techniques in C
`
`Mark Nelson
`
`M&T1
`
`Page 2
`
`
`
`The Data
`Compression
`Book
`
`Page 3
`
`
`
`The Data
`Compression
`Book
`
`Featuring fast , efficient data
`compression techniques in C
`
`Mark Nelson
`
`Page 4
`
`
`
`......
`
`• •
`
`• ••••
`
`.1•••••
`
`M&T 1 M&T Books
`A Division of M&T Publishing, Inc.
`501 Galveston Drive
`Redwood City, CA 94063
`
`E
`
`—
`
`C 1991 by M&T Publishing, Inc.
`
`Printed in the United States of America
`
`All rights reserved. No part of this book may be reproduced or transmitted in any fonn or by any means.
`electronic or mechanical, including photocopying, recording, or by any information storage and
`retrieval sytem, without prior written pemission from the Publisher. Contact the Publisher for
`information on foreign rights.
`
`Limits of Liability and Disclaimer of Warranty
`The Author and Publisher of this book have used their best efforts in preparing the hook and the
`programs contained in it. These efforts include the development, research, and testing of the theories
`and programs to determine their effectiveness.
`
`The Author and Publisher make no warranty of any kind, expressed or implied, with regard to these
`programs or the documentation contained in this hook, The Author and Publisher shall not be liable in
`any event for incidental or consequential damages in connection with, or arising out of, the furnishing,
`performance, or use of these programs.
`
`Library of Congress Cataloging-in-Publication Data
`
`Nelson, Mark, 1958-
`The data compression book/ by Mark Nelson.
`p. cm.
`Includes index.
`ISBN 1-55851-214-4: $26.95
`--ISBN 1-55851-216-0 (Book disk set): $36.95
`1. Data Compression (Computer Science) I. Title.
`QA76.9.D33N46 1991
`005.74'6—dc20
`
`9 l-23080
`CIP
`
`Editor: Tova F. Hiegel
`Technical review: Robert Jung
`
`94 93 92 91
`
`4 3 2 1
`
`Cover Design: Lauren Smith Design
`I.,ayout: Marni Tapscott
`
`Page 5
`
`
`
`Contents
`
`WHY THIS BOOK IS FOR YOU
`
`INTRODUCTION TO DATA COMPRESSION
`The Audience
`Why C9
`Which C?
`Keeping Score
`The Structure
`
`
`
`1
`
`3
`4
`4
`5
`9
` 10
`
`THE DATA -COMPRESSION LEXICON, WITH A HISTORY....13
`13
`The Two Kingdoms
`14
`Data Compression = Modeling + Coding
`15
`The Dawn Age
`17
`Coding
`18
`An improvement
`20
`Modeling
`20
`Statistical Modeling
`22
`Dictionary Schemes
`23
`Ziv and Lempel
`23
`
`LZ77
`24
`LZ78.
`24
`Lossy Compression
`25
`Programs to Know
`
`
`
`
`
`THE DAWN AGE: MINIMUM REDUNDANCY CODING
`The Shannon-Fano Algorithm
`The Huffman Algorithm
`Huffman in C
`
`29
`31
`34
`38
`
`Page 6
`
`
`
`
`BITIO.0
`A Reminder about Prototypes
`MAIN-C.0 and MAIN-E,C
`MAIN-C.0
`ERRHAND.0
`Into the Huffman Code
`Counting the Symbols
`Saving the Counts
`Building the Tree
`Using the Tree
`The Compression Code
`Putting It All Together
`Performance
`
`
`
`A SIGNIFICANT IMPROVEMENT: ADAPTIVE HUFFMAN
`CODING
`Adaptive Coding
`Updating the Huffman Tree
`What Swapping Does
`The Algorithm.
`An Enhancement
`
`The Escape Code
`The Overflow Problem
`A Resealing Bonus.
`The Code
`Initialization of the Array
`The Compress Main Program
`The Expand Main Program
`Encoding the Symbol.
`Decoding the Symbol.
`The Code
`
`38
`47
`49
`55
`56
`58
`59
`59
`61
`62
`63
`79
`79
`
`81
`82
`83
`87
`88
`89
`90
`91
`95
`95
`97
`98
`99
`100
`108
`109
`
`Page 7
`
`
`
`
`
`
`
`HUFFMAN ONE BETTER:
`ARITHMETIC CODING
`Difficulties
`Arithmetic Coding: A Step Forward
`Practical Matters
`A Complication
`Decoding
`Where's the Beef?
`The Code
`The Compression Program
`The Expansion Program
`Initializing the Model
`Reading the Model
`Initializing the Encoder
`The Encoding Process
`Flushing the Encoder
`The Decoding Process
`Summary
`Code
`
`STATISTICAL MODELING
`Higher-Order Modeling
`Finite Context Modeling
`Adaptive Modeling
`A Simple Example
`Improvements
`Highest-Order Modeling
`Updating the Model
`Escape Probabilities
`Scoreboarding
`Data Structures
`The Finishing Touches: Tables 1 and 2
`
`1 2 3
`123
`124
`128
`130
`132
`133
`134
`134
`136
`137
`141
`141
`142
`145
`145
`148
`148
`
`167
`167
`168
`169
`170
`176
`176
`178
`179
`180
`181
`185
`
`Page 8
`
`
`
`O OOO OO
`
`O
`
`r •
`
`Model Flushing
`Implementation
`Conclusions
`Enhancement
`ARITH-N Listing
`
`DICTIONARY -BASED COMPRESSION
`An Example
`Static vs. Adaptive
`Adaptive Methods
`A Representative Example
`Israeli Roots
`History
`ARC: The Father of MS-DOS Dictionary Compression
`Dictionary Compression
`Danger Ahead—Patents
`Conclusion
`
`SLIDING WINDOW COMPRESSION
`The Algorithm
`Problems with LZ77
`An Encoding Problem
`LZSS Compression
`Data Structures
`A Balancing Act
`Greedy vs. Best Possible
`The Code
`Constants and Macros
`Global Variables
`The Compression Code
`Initialization.
`The Main Loop
`
`
`
`I 86
`186
`187
`187
`188
`
`219
`219
`220
`221
`223
`115
`226
`227
`228
`230
`711
`
`233
`133
`238
`240
`240
`242
`244
`246
`247
`?47
`251
`252
`253
`254
`
`Page 9
`
`
`
`The Exit Code
`AddString()
`DeleteString0
`Binary Tree Support Routines
`The Expansion Routine
`Improvements
`The Code
`
`LZ78 COMPRESSION
`Can LZ77 Improve?
`Enter LZ78
`LZ78 Details
`LZ78 Implementation.
`An Effective Variant
`Decompression
`The Catch
`LZW Implementation
`Tree Maintenance and Navigation
`Compression
`Decompression
`The Code
`Improvements
`Patents
`
`SPEECH COMPRESSION
`Digital Audio Concepts
`Fundamentals
`Sampling Variables
`PC-Based Sound
`Lossless Compression of Sound
`Problems and Results
`Lossy Compression
`
`
`
`
`
`
`
`256
`257
` .260
`262
`263
`265
`266
`
`277
`278
`279
`279
`282
`285
`287
`288
`290
`290
`293
`294
`296
`302
`311
`
`313
`313
`314
`319
`323
`324
`325
`328
`
`Page 10
`
`
`
`.......
`
`.
`
`Silence Compression
`Other Techniques
`
`LOSSY GRAPHICS COMPRESSION
`Enter Compression
`Statistical and Dictionary Compression Methods
`Lossy Compression
`Differential Modulation
`Adaptive Coding
`A Standard That Works: JPEG
`JPEG Compression
`The Discrete Cosine Transform
`DCT Specifics
`Why Bother?
`Implementing the DCT
`Matrix Multiplication
`Continued Improvements
`Output of the DCT
`Quantization
`Selecting a Quantization Matrix
`Coding
`The Zig-Zag Sequence
`Entropy Encoding
`What About Color?
`The Sample Program
`Input Format
`The Code
`Initialization .
`The Forward DCT Routine
`WriteDCTData0
`File Expansion
`ReadDCTData()
`
`329
`345
`
`347
`348
`348
`349
`350
`351
`353
`354
`354
`357
`358
`359
`360
`362
`363
`364
`365
`169
`369
`372
`373
`173
`374
`375
`376
`378
`379
`382
`383
`
`Page 11
`
`
`
`Input DCT Code
`The Inverse DCT
`The Complete Code Listing
`Support Programs
`Some Compression Results
`
`AN ARCHIVING PACKAGE
`CAR and CARMAN.
`The CARMAN Command Set
`The CAR File
`The Header
`Storing the Header
`The Header CRC
`Command-Line Processing
`Generating the File List
`Opening the Archive Fi1es426
`The Main Processing Loop
`Skipping/Copying Input File
`File Insertion
`File Extraction
`Cleanup
`The Code
`
`APPENDIX A:STATISTICS FOR
`COMPRESSION PROGRAMS
`
`APPENDIX B:TEST PROGRAMS
`
`GLOSSARY
`
`BIBLIOGRAPHY
`Other Resources
`
`
`
`384
`385
`387
`400
`406
`
`409
`409
`410
`412
`413
`413
`417
`418
`421
`
`428
`434
`435
`436
`438
`438
`
`489
`
`495
`
`505
`
`517
`518
`
`Page 12
`
`
`
`
`
`we Agee
`
`e ENE RRR AE AU 578d LEON A CUgs
`
`year Seer
`
`ETE RWORD csssessrssesssessstsessenssenssonaneeatesnestnnnessessnaneinnetsaneernatentansetnet 521
`
`UN DE X cecccceescessnessensseeetcnsessenesssennstsseoaseeeectsateqonseesssasnesauanessenenessuen ines 523
`
`Page 13
`
`Page 13
`
`
`
`CHAPTER 5
`
`Huffman One Better:
`Arithmetic Coding
`
`The last two chapters show that Huffman coding uses knowledge about information
`content to efficiently encode symbols. If the probability of a symbol's appearance in
`a message is known, Huffman techniques can encode that symbol using a minimal
`number of bits. Huffman coding has been proven the best fixed-length coding method
`available.
`
`Difficulties
`Huffman codes have to be an integral number of bits long, and this can sometimes
`be a problem. If the probability of a character is 1/3, for example, the optimum number
`of bits to code that character is around 1.6. Huffman coding has to assign either one
`or two bits to the code, and either choice leads to a longer compressed message than
`is theoretically possible.
`This nonoptimal coding becomes a noticeable problem when the probability of a
`character is very high. If a statistical method could assign a 90 percent probability to a given
`character, the optimal code size would be 0.15 bits. The Huffman coding system would
`probably assign a -bit code to the symbol, which is six times longer than necessary.
`This would be a problem when compressing two-color images, like those coming
`from a fax machine. Since there are only two colors, an ordinary coding method would
`assign the 1 bit to one color and the 0 bit to the other. Since both codes have only a
`single bit, Huffman coding is not going to be able to compress this data at all. No
`matter how high the probability of one of the bits, we are still going to have to encode
`it using one bit.
`
`123
`
`Page 14
`
`
`
`THE DATA COMPRESSION BOOK
`
`The conventional solution to this problem is to group the bits into packets and
`apply Huffman coding. But this weakness prevents Huffman coding from being
`a universal compressor.
`
`Arithmetic Coding: A Step Forward
`Only in the last ten years has a respectable candidate to replace Huffman coding
`been successfully demonstrated: arithmetic coding. Arithmetic coding bypasses the
`idea of replacing an input symbol with a specific code. It replaces a stream of input
`symbols with a single floating-point output number. More bits are needed in the output
`number for longer, complex messages. This concept has been known for some time,
`but only recently were practical methods found to implement arithmetic coding on
`computers with fixed-sized registers.
`The output from an arithmetic coding process is a single number less than 1 and
`greater than or equal to 0. This single number can be uniquely decoded to create the
`exact stream of symbols that went into its construction. To construct the output
`number, the symbols are assigned a set probabilities. The message "BILL GATES,"
`for example, would have a probability distribution like this:
`
`Character
`SPACE
`A
`B
`E
`G
`
`S
`T
`
`Probability
`1/10
`1/10
`1/10
`1/10
`1/10
`1/10
`2/10
`1/10
`1/10
`
`124
`
`Page 15
`
`
`
`ARITHMETIC CODING
`
`Once character probabilities are known, individual symbols need to be assigned a
`range along a "probability line," nominally 0 to 1. It doesn't matter which characters
`are assigned which segment of the range, as long as it is done in the same manner by
`both the encoder and the decoder. The nine-character symbol set used here would look
`like the following:
`
`Character
`SPACE
`A
`B
`E
`G
`I
`L
`S
`T
`
`Probability
`1/10
`1/10
`1/10
`1/10
`1/10
`1/10
`2/10
`1/10
`1/10
`
`Range
`0.00 >= r > 0.10
`0.10 >= r > 0.20
`0.20 >= r> 0.30
`0.30 ›-- r > 0.40
`0.40 >-- r > 0.50
`0.50 >= r> 0.60
`0.60 >= r > 0.80
`0.80 >= r > 0.90
`0.90 >-= r > 1.00
`
`Each character is assigned the portion of the 0 to 1 range that corresponds to its
`probability of appearance. Note that the character "owns" everything up to, but not
`including, the higher number. So the letter T in fact has the range .90 to .9999
`The most significant portion of an arithmetic-coded message belongs to the first
`symbol—or B, in the message "BILL GATES." To decode the first character properly,
`the final coded message has to be a number greater than or equal to .20 and less than
`.30. To encode this number, track the range it could fall in, After the first character is
`encoded, the low end for this range is .20 and the high end is .30.
`During the rest of the encoding process, each new symbol will further restrict the
`possible range of the output number. The next character to be encoded, the letter I,
`owns the range .50 to .60 in the new subrange of .2 to .3. So the new encoded number
`will fall somewhere in the 50th to 60th percentile of the currently established range.
`Applying this logic will further restrict our number to .25 to .26. The algorithm to
`accomplish this for a message of any length is shown here:
`
`low = 0.0;
`high — 1.0;
`
`125
`
`Page 16
`
`
`
`THE DATA COMPRESSION BOOK
`
`while ( ( c = getc( input ) ) !— EOF )
`range — high - low;
`high = low + range * high_range( c )1
`low_range( c );
`low — low + range
`
`1
`output( low );
`
`Following this process to its natural conclusion with our message results in the
`following table:
`
`New Character
`
`B
`I
`L
`
`L
`SPACE
`
`G
`A
`T
`5
`
`S
`
`Low value
`0.0
`0.2
`0.25
`0.256
`0.2572
`0.25720
`0.257216
`0.2572164
`0.25721676
`0.257216772
`0.251216/752
`
`High Value
`1.0
`0.3
`0.26
`0.258
`0.2576
`0.25724
`
`0.257220
`0.2572168
`0.2572168
`0.257216776
`0.2572167756
`
`So the final low value, .2572167752, will uniquely encode the message "BILL
`GATES" using our present coding scheme.
`Given this encoding scheme, it is relatively easy to see how the decoding process
`operates. Find the first symbol in the message by seeing which symbol owns the space
`our encoded message falls in. Since .2572167752 falls between .2 and .3, the first
`character must be B. Then remove B from the encoded number. Since we know the
`low and high ranges of B, remove their effects by reversing the process that put them
`in. First, subtract the low value of B, giving .0572167752. Then divide by the width
`
`126
`
`Page 17
`
`
`
`ARITHMETIC CODING
`
`of the range of B, or .1. This gives a value of .572167752. Then calculate where that
`lands, which is in the range of the next letter, I. The algorithm for decoding the
`incoming number is shown next:
`
`number = input_code();
`for ( ;
`)
`symbol = find_symbol_straddling_this_range( number );
`putc( symbol );
`range - high_range( symbol ) - low_range( symbol );
`number = number - low_range( symbol );
`number - number / range:
`
`I have conveniently ignored the problem of how to decide when there are no more
`symbols left to decode. This can be handled either by encoding a special end-of-file
`symbol or by carrying the stream length with the encoded message. In the example in
`this chapter, as elsewhere in the book, I carry along a special end-of-stream symbol
`that tells the decoder when it is out of symbols. The decoding algorithm for the "BILL
`GATES" message will proceed as shown:
`
`Encoded Number Output Symbol
`B
`0.2572167752
`I
`0.572167752
`1.
`0.72167752
`L
`0.6083876
`SPACE
`0.041938
`G
`0.41938
`A
`0.1938
`0.938
`0.38
`0.8
`0.0
`
`T
`E
`S
`
`Low
`0.2
`0.5
`0.6
`0.6
`0.0
`0.4
`0.2
`0.9
`0.3
`0.6
`
`High Range
`0.1
`0.3
`0.1
`0.6
`0.2
`0.8
`0.2
`0.8
`0.1
`0.1
`0.1
`0.5
`0.1
`0.3
`
`1.0
`0.4
`0.9
`
`0.1
`0.1
`0.1
`
`127
`
`Page 18
`
`
`
`THE DATA COMPRESSION 1300K
`
`In summary, the encoding process is simply one of' narrowing the range of possible
`numbers with every new symbol. The new range is proportional to the predefined
`probability attached to that symbol. Decoding is the inverse procedure, in which the
`range is expanded in proportion to the probability of each symbol as it is extracted.
`
`Practical Matters. Encoding and decoding a stream of symbols using arithmetic
`coding is not too complicated. But at first glance it seems completely impractical. Most
`computers support floating-point numbers of around 80 bits. So do you have to start over
`every time you encode ten or fifteen symbols? Do you need a floating-point processor?
`Can machines with different floating-point formats communicate through arithmetic
`coding?
`As it turns out, arithmetic coding is best accomplished using standard 16-bit and 32-
`bit integer math. Floating-point math is neither required nor helpful. What is required
`is an incremental transmission scheme in which fixed-size integer state variables
`receive new bits at the low end and shift them out at the high end, forming a single
`number that can be as long as the number of bits on the computer.
`Earlier, we saw that the algorithm works by keeping track of a high and low number
`that: brackets the range of the possible output number. When the algorithm first starts,
`the low number is set to .0 and the high number is set to 1. The first simplification
`made to work with integer math is to change the 1 to .999 . . . , or . 1] 1 . . . in binary.
`Mathematicians agree that .111 . . . binary is exactly the same as I binary, and we take,
`their assurance at face value. It simplifies encoding and decoding.
`To store these numbers in integer registers, first justify them so the implied decimal
`point is on the left side of the word. Then load as much of the initial high and low
`values as will fit into the word size we are working with. My implementation uses 16-
`bit unsigned math, so the initial value of high is Oxillf, and low is 0. We know that
`the high value continues with Fs forever, and the low continues with zeros forever—
`so we can shift those extra bits in with impunity when needed.
`If you imagine our "BILL GATES" example in a five-decimal digit register (I use
`decimal digits in this example for clarity), the decimal equivalent of our setup would
`look like what follows on the next page.
`
`128
`
`Page 19
`
`
`
`ARITHMETIC CODING
`
`HIGH: 99999 implied digits => 999999999. . .
`LOW: 00000 implied digits —> 000000000...
`
`To find the new range of numbers, apply the encoding algorithm from the previous
`section. First, calculate the range between the low and high values. The difference
`between the two registers will be 100000, not 99999. This is because we assume the
`high register has an infinite number of 9s added to it, so we need to increment the
`calculated difference. We then compute the new high value using the formula from the
`previous section:
`
`high = low + high_range(symbol).
`
`In this case, the high range was .30, which gives a new value for high of 30000.
`Before storing the new value of high, we need to decrement it, once again because of
`the implied digits appended to the integer value. So the new value of high is 29999.
`The calculation of low follows the same procedure, with a resulting new value of
`20000. So now high and low look like this:
`
`high: 29999 (999,..);
`low: 20000 (000...).
`
`At this point, the most significant digits of high and low match. Due to the nature
`of our algorithm, high and low can continue to grow closer together without quite ever
`matching. So once they match in the most significant digit, that digit will never
`change. We can now output that digit as the first digit of our encoded number. This is
`done by shifting both high and low left by one digit and shifting in a 9 in the least
`significant digit of high. The equivalent operations are performed in binary in the C
`implementation of this algorithm.
`As this process continues, high and low continually grow closer together, shifting digits
`out into the coded word. The process for our "BILL GATES" message is shown next.
`
`129
`
`Page 20
`
`
`
`THE DATA COMPRESSION
`
`BOOK
`
`HIGH
`
`LOW
`
`RANGE
`
`CUMULATIVE OUTPUT
`
`99999 00000
`Initial state
`29999 20000
`Encode B (0.2-0.3)
`99999 00000
`Shift out 2
`59999 50000
`Encode I (0.5-0.6)
`99999 00000
`Shift out 5
`Encode L (0.6-0.8)
`/9999 60000
`Encode L (0.6-0.8)
`75999 72000
`Shift out 7
`59999 20000
`Encode SPACE (0.0-0.1) 23999 20000
`Shift out 2
`39999 00000
`Encode G (0.4-0.5)
`19999 16000
`Shift out 1
`99999 60000
`Encode A (0.1-0.2)
`67999 64000
`Shift out 6
`79999 40000
`Encode T (0.9-1.0)
`79999 76000
`Shift out 7
`99999 60000
`Encode E (0.3-0.4)
`75999 72000
`Shift out 7
`59999 20000
`Encode S (0.8-0.9)
`55999 52000
`Shift out 5
`59999 20000
`Shift out 2
`Shift out 0
`
`100000
`
`10000
`
`100000
`20000
`
`40000
`
`40003
`
`40000
`
`40000
`
`40000
`
`40000
`
`.2
`.2
`.25
`,25
`.25
`.257
`.257
`.257?
`.2572
`.25721
`.25721
`.257216
`.257216
`..2572167
`.2572167
`.25721677
`.25/21677
`.257216775
`.2572167752
`.25721677E20
`
`After all the letters are accounted for, two extra digits need to be shifted out of either
`the high or low value to finish the output word, This is so the decoder can properly
`track the input data. Part of the information about the data stream is still in the high and
`low registers, and we need to shift that information to the file for the decoder to use
`later.
`
`A Complication. This scheme works well for incrementally encoding a message.
`Enough accuracy is retained during the double-precision integer calculations to ensure
`that the message is accurately encoded. But there is potential for a loss of precision under
`certain circumstances.
`
`130
`
`Page 21
`
`
`
`ARITHMETIC CODING
`
`If the encoded word has a string of Os or 9s in it, the high and low values will slowly
`converge on a value, but they may not see their most significant digits match imme-
`diately. High may be 700004, and low may be 699995. At this point, the calculated
`range will be only a single digit long, which means the output word will not have
`enough precision to be accurately encoded. Worse, after a few more iterations, high
`could be 70000, and low could look be 69999.
`At this point, the values are permanently stuck. The range between high and low
`has become so small that any iteration through another symbol will leave high and low
`at their same values. But since the most significant digits of both words are not equal,
`the algorithm can't output the digit and shift. It seems to have reached an impasse.
`Defeat this underflow problem by preventing things from ever getting that bad. The
`original algorithm said something like, "If the most significant digit of high and low
`match, shift it out." If the two digits don't match, but are now on adjacent numbers, a
`second test needs to be applied. If high and low are one apart, we test to see if the
`second most significant digit in high is a 0 and if the second digit in low is a 9. If so,
`it means we are on the road to underflow and need to take action.
`Head off undertlow with a slightly different shift operation. Instead of shifting the
`most significant digit out of the word, delete the second digits from high and low and
`shift the rest of the digits left to fill the space. The most significant digit stays in place.
`Then set an underflow counter to remember that we threw away a digit and aren't quite
`sure whether it was going to be a 0 or a 9. This process is shown here:
`
`High:
`Low:
`Underflow:
`
`Before
`40344
`39810
`0
`
`After
`43449
`38100
`1
`
`After every recalculation, check for underflow digits again if the most significant digits
`don't match. If underflow digits are present, we shift them out and increment the counter.
`
`131
`
`Page 22
`
`
`
`THE DATA COMPRESSION BOOK
`
`When the most significant digits do finally converge to a single value, output that
`value. Then output the underflow digits previously discarded. The underflow digits
`will all be 9s or Os, depending on whether high and low converged on the higher or
`lower value. In the C implementation of this algorithm, the underflow counter keeps
`track of how many ones or zeros to output.
`
`Decoding. In the "ideal" decoding process, we had the entire input number to work
`with, and the algorithm had to do things like "divide the encoded number by the symbol
`probability." In practice, we can't perform an operation like that on a number that could
`be billions of bytes long. As in the encoding process, however, the decoder can operate
`using 16- and 32-bit integers for calculations.
`Instead of using just two numbers, high and low, the decoder has to use three
`numbers. The first two, high and low, correspond exactly to the high and low values
`maintained by the encoder. The third number, code, contains the current bits being
`read in from the input bit stream. The code value always falls between the high and
`low values. As they come closer and closer to it, new shift operations will take place,
`and high and low will move back away from code.
`The high and low values in the decoder will be updated after every symbol, just as
`they were in the encoder, and they should have exactly the same values. By perform-
`ing the same comparison test on the upper digit of high and low, the decoder knows
`when it is time to shift a new digit into the incoming code. The same underflow tests
`are performed as well.
`In the ideal algorithm, it was possible to determine what the current. encoded
`symbol was just by finding the symbol whose probabilities enclosed the present value
`of the code. In the integer math algorithm, things are somewhat more complicated. In
`this case, the probability scale is determined by the difference between high and low.
`So instead of the range being between .0 and 1.0, the range will he between two
`positive 16-bit integer counts. Where the present code value falls along that range
`determines current probability. Divide (value - low) by (high - low + 1) to get the
`actual probability for the present symbol.
`
`132
`
`Page 23
`
`
`
`ARITHMETIC CODING
`
`Where's the Beef? It is not immediately obvious why this encoding process is an
`improvement over Huffman coding. It becomes clear when we examine a case in which
`the probabilities are a little different. If we have to encode the stream "AAAAAAA," and
`the probability of A is known to be .9, there is a 90 percent chance that any incoming
`character will be the letter A. We set up our probability table so that A occupies the .0
`to .9 range, and the end-of-message symbol occupies the .9 to 1 range. The encoding
`process is shown next:
`
`New Character
`
`A
`A
`A
`A
`A
`A
`A
`END
`
`Low value
`0.0
`0.0
`0.0
`0.0
`0.0
`0.0
`0.0
`0.0
`0.43046721
`
`High Value
`1.0
`0.9
`0.81
`0.729
`0.6561
`0.59049
`0.531441
`0.4782969
`0.4782969
`
`Now that we know the range of high and low values, all that remains is to pick a
`number to encode this message. The number .45 will make this message uniquely
`decode to "AAAAAAA." Those two decimal digits take slightly less than seven bits
`to specify, which means we have encoded eight symbols in less than eight bits! An
`optimal Huffman message would have taken a minimum of nine bits.
`To take this point to an even further extreme, I set up a test that had only two
`symbols. In it, 0 had a 16382/16383 probability, and an end-of-file symbol had a 1/
`:16383 probability. I then created a file filled with 100,000 Os. After compression using
`arithmetic coding, the output file was only three bytes long! The minimum size using
`Huffman coding would have been 12,501 bytes. This is obviously a contrived ex-
`ample, but it shows that arithmetic coding compresses data at rates much better than
`one bit per byte when the symbol probabilities are right.
`
`133
`
`Page 24
`
`
`
`THE DATA COMPRESSION BOOK
`
`The Code
`The code supplied with this chapter in ARITRC is a simple module that performs
`arithmetic compression and decompression using a simple order 0 model. It works
`exactly like the nonadaptive Huffman coding program in chapter 3. It first makes a
`single pass over the data, counting the symbols. The data is then scaled down to make
`the counts fit into single, unsigned characters. The scaled counts are saved to the
`output file for the decompressor to get at later, then the arithmetic coding table is built.
`Finally, the compressor passes through the data, compressing each symbol as it
`appears. When done, the end-of-stream character is sent out, the arithmetic coder is
`flushed, and the program exits.
`
`The Compression Program. The compression portion of this program is shown
`shortly. The main module is called by the utility version of MAIN-E.C, which will have
`already taken care of opening files, parsing arguments, etc. Once we get to the
`compression phase of the program, things are ready to go.
`The compressor code breaks down neatly into three sections. The first two lines
`initialize the model and the encoder. The while loop plus an additional line after it
`performs the compression, and the last three lines shut things down.
`
`build_model( input, output -)file
`initialize_arithmetic_encoder();
`
`while ( ( c = getc( input ) ) != EOF )
`convert_int_to_symbol( c, &s
`encode_symbol( output, &s );
`
`convert_int_to_symbol( END_OF_STREAM, /Cs ):
`encode_symbol( output, &s );
`flush_arithmetic_encoder( output );
`OutputBits( output, 01_, 16 );
`
`134
`
`Page 25
`
`
`
`ARITHMETIC CODING
`
`The build_model() routine has several responsibilities. It makes the first pass over
`the input data to count all the characters. It scales down the counts to fit in unsigned
`characters, then it takes those counts and builds the range table used by the coder.
`Finally, it writes the counts to the output file so the decompressor will have access to
`them later.
`The initialize arithmetic encoder routine is fairly simple. It just sets up the high- and
`low-integer variables used during the encoding. The encoding loop calls two different
`routines to encode the symbol. The first, convert_int_to_symbolO, takes the character
`read in from the file and looks up the range for the given symbol. The range is then
`stored in the symbol object, which has the structure shown:
`
`typedef struct
`unsigned short int low count;
`unsigned short int high_count;
`unsigned short int scale;
`
`I SYMBOL:
`
`These three values are all that are needed for the symbol to be encoded using the
`arithmetic encoder. The low-count and high-count values uniquely define where on the 0 to
`1 range the symbol lies, and the scale value tells what the total span of the 0 to 1 scale is.
`If 1,000 characters had been counted in a text file, for example, the low_count and
`high_count for A might be 178 and 199, and the scale would be 1,000. This would indicate
`that on the 0 to 1 number scale, A would own the range .178 to .199.
`Once the symbol object has been defined, it can be passed to the encoder. The
`arithmetic encoder needs only those three items to process the symbol and update the
`output number. It has the high- and low-range values and the underflow count stored
`internally as static variables, and it doesn't need anything else.
`The way we detached the modeling data from the coding data gives us a convenient
`mechanism for experimenting with new ways of modeling. We just have to come up with
`the numbers that get plugged into the symbol. The encoder doesn't care how we got those
`numbers as long as they were derived consistently so we can decode the file later.
`
`135
`
`Page 26
`
`
`
`THE DATA COMPRESSION BOOK
`
`When we reach the end of the input file, we encode and send the end-of-stream
`symbol. The decompression program will use it to determine when it is done. To
`finish, call a routine to flush the arithmetic encoder, which takes care of any undeiflow
`bits. Finally, we have to output an extra sixteen bits. The reason for this is simple.
`When decoding symbols from the input bit stream, the effective hits are in the most
`significant bit position of the input registers. To get the bits there, we have to load other
`hits into the lower positions and shift them over. At least 15 insignificant bits are
`needed to decode the last symbol. Outputting 16 bits at the end of the program ensures
`that the decoder won't get a premature end of file while decoding the input file.
`
`The Expansion Program. The main part of the expansion program follows the same
`pattern. First, the model is set up and the arithmetic coder is initialized. In this program,
`initializing the model means reading in the counts from the input file where the
`compressor put them. Initializing the arithmetic decoder means loading the low and high
`registers with 0000 and FFFF, then reading the first 16 hits from the input file into the
`current code.
`
`input_counts( input >file );
`initialize_arithmetic_decoder( input ):
`for (
`;
`;
`)
`get_symbol_scale( &s
`) ;
`count = get_currert_count( &s );
`c = convert_symbol_Lo_inti count, &s
`if ( c == END_OF_STREAM
`)
`break;
`remove_symbol_from_stream( input, &s );
`putt( (char)