throbber
LIBRARY OF CONGRESS
`
`Office of Business Enterprises
`Duplication Services Section
`
`THIS IS TO CERTIFY that the collections of the Library of Congress contain a bound
`volume entitled THE DATA COMPRESSION BOOK, call number QA 76.9.D33 N46, Copy
`2. The attached — Cover Page, Title Page, Second Title page, Copyright Page, Table of Contents
`Pages and Chapter 5 - are a true and complete representation from that work.
`
`THIS IS TO CERTIFY FURTHER, that work is marked with a Library of Congress
`Copyright Office stamp dated January 29, 1992.
`
`IN WITNESS WHEREOF, the seal of the Library of Congress is affixed hereto on
`May 18, 2018.
`
`AA'elAt- ArAget
`Deirdre Scott
`Business Enterprises Officer
`Office of Business Enterprises
`Library of Congress
`
`101 Independence Avenue, SE Washington, DC 20540-4917 Tel 202.707.5650 www.locgov; duplicationservices@loc.gov
`
`Page 1
`
`NETFLIX, INC
`Exhibit 1018
`IPR2018-01630
`
`

`

`Featuring fast , efficient data
`compression techniques in C
`
`Mark Nelson
`
`M&T1
`
`Page 2
`
`

`

`The Data
`Compression
`Book
`
`Page 3
`
`

`

`The Data
`Compression
`Book
`
`Featuring fast , efficient data
`compression techniques in C
`
`Mark Nelson
`
`Page 4
`
`

`

`......
`
`• •
`
`• ••••
`
`.1•••••
`
`M&T 1 M&T Books
`A Division of M&T Publishing, Inc.
`501 Galveston Drive
`Redwood City, CA 94063
`
`E
`
`—
`
`C 1991 by M&T Publishing, Inc.
`
`Printed in the United States of America
`
`All rights reserved. No part of this book may be reproduced or transmitted in any fonn or by any means.
`electronic or mechanical, including photocopying, recording, or by any information storage and
`retrieval sytem, without prior written pemission from the Publisher. Contact the Publisher for
`information on foreign rights.
`
`Limits of Liability and Disclaimer of Warranty
`The Author and Publisher of this book have used their best efforts in preparing the hook and the
`programs contained in it. These efforts include the development, research, and testing of the theories
`and programs to determine their effectiveness.
`
`The Author and Publisher make no warranty of any kind, expressed or implied, with regard to these
`programs or the documentation contained in this hook, The Author and Publisher shall not be liable in
`any event for incidental or consequential damages in connection with, or arising out of, the furnishing,
`performance, or use of these programs.
`
`Library of Congress Cataloging-in-Publication Data
`
`Nelson, Mark, 1958-
`The data compression book/ by Mark Nelson.
`p. cm.
`Includes index.
`ISBN 1-55851-214-4: $26.95
`--ISBN 1-55851-216-0 (Book disk set): $36.95
`1. Data Compression (Computer Science) I. Title.
`QA76.9.D33N46 1991
`005.74'6—dc20
`
`9 l-23080
`CIP
`
`Editor: Tova F. Hiegel
`Technical review: Robert Jung
`
`94 93 92 91
`
`4 3 2 1
`
`Cover Design: Lauren Smith Design
`I.,ayout: Marni Tapscott
`
`Page 5
`
`

`

`Contents
`
`WHY THIS BOOK IS FOR YOU
`
`INTRODUCTION TO DATA COMPRESSION
`The Audience
`Why C9
`Which C?
`Keeping Score
`The Structure
`
`
`
`1
`
`3
`4
`4
`5
`9
` 10
`
`THE DATA -COMPRESSION LEXICON, WITH A HISTORY....13
`13
`The Two Kingdoms
`14
`Data Compression = Modeling + Coding
`15
`The Dawn Age
`17
`Coding
`18
`An improvement
`20
`Modeling
`20
`Statistical Modeling
`22
`Dictionary Schemes
`23
`Ziv and Lempel
`23
`
`LZ77
`24
`LZ78.
`24
`Lossy Compression
`25
`Programs to Know
`
`
`
`
`
`THE DAWN AGE: MINIMUM REDUNDANCY CODING
`The Shannon-Fano Algorithm
`The Huffman Algorithm
`Huffman in C
`
`29
`31
`34
`38
`
`Page 6
`
`

`

`
`BITIO.0
`A Reminder about Prototypes
`MAIN-C.0 and MAIN-E,C
`MAIN-C.0
`ERRHAND.0
`Into the Huffman Code
`Counting the Symbols
`Saving the Counts
`Building the Tree
`Using the Tree
`The Compression Code
`Putting It All Together
`Performance
`
`
`
`A SIGNIFICANT IMPROVEMENT: ADAPTIVE HUFFMAN
`CODING
`Adaptive Coding
`Updating the Huffman Tree
`What Swapping Does
`The Algorithm.
`An Enhancement
`
`The Escape Code
`The Overflow Problem
`A Resealing Bonus.
`The Code
`Initialization of the Array
`The Compress Main Program
`The Expand Main Program
`Encoding the Symbol.
`Decoding the Symbol.
`The Code
`
`38
`47
`49
`55
`56
`58
`59
`59
`61
`62
`63
`79
`79
`
`81
`82
`83
`87
`88
`89
`90
`91
`95
`95
`97
`98
`99
`100
`108
`109
`
`Page 7
`
`

`

`
`
`
`
`HUFFMAN ONE BETTER:
`ARITHMETIC CODING
`Difficulties
`Arithmetic Coding: A Step Forward
`Practical Matters
`A Complication
`Decoding
`Where's the Beef?
`The Code
`The Compression Program
`The Expansion Program
`Initializing the Model
`Reading the Model
`Initializing the Encoder
`The Encoding Process
`Flushing the Encoder
`The Decoding Process
`Summary
`Code
`
`STATISTICAL MODELING
`Higher-Order Modeling
`Finite Context Modeling
`Adaptive Modeling
`A Simple Example
`Improvements
`Highest-Order Modeling
`Updating the Model
`Escape Probabilities
`Scoreboarding
`Data Structures
`The Finishing Touches: Tables 1 and 2
`
`1 2 3
`123
`124
`128
`130
`132
`133
`134
`134
`136
`137
`141
`141
`142
`145
`145
`148
`148
`
`167
`167
`168
`169
`170
`176
`176
`178
`179
`180
`181
`185
`
`Page 8
`
`

`

`O OOO OO
`
`O
`
`r •
`
`Model Flushing
`Implementation
`Conclusions
`Enhancement
`ARITH-N Listing
`
`DICTIONARY -BASED COMPRESSION
`An Example
`Static vs. Adaptive
`Adaptive Methods
`A Representative Example
`Israeli Roots
`History
`ARC: The Father of MS-DOS Dictionary Compression
`Dictionary Compression
`Danger Ahead—Patents
`Conclusion
`
`SLIDING WINDOW COMPRESSION
`The Algorithm
`Problems with LZ77
`An Encoding Problem
`LZSS Compression
`Data Structures
`A Balancing Act
`Greedy vs. Best Possible
`The Code
`Constants and Macros
`Global Variables
`The Compression Code
`Initialization.
`The Main Loop
`
`
`
`I 86
`186
`187
`187
`188
`
`219
`219
`220
`221
`223
`115
`226
`227
`228
`230
`711
`
`233
`133
`238
`240
`240
`242
`244
`246
`247
`?47
`251
`252
`253
`254
`
`Page 9
`
`

`

`The Exit Code
`AddString()
`DeleteString0
`Binary Tree Support Routines
`The Expansion Routine
`Improvements
`The Code
`
`LZ78 COMPRESSION
`Can LZ77 Improve?
`Enter LZ78
`LZ78 Details
`LZ78 Implementation.
`An Effective Variant
`Decompression
`The Catch
`LZW Implementation
`Tree Maintenance and Navigation
`Compression
`Decompression
`The Code
`Improvements
`Patents
`
`SPEECH COMPRESSION
`Digital Audio Concepts
`Fundamentals
`Sampling Variables
`PC-Based Sound
`Lossless Compression of Sound
`Problems and Results
`Lossy Compression
`
`
`
`
`
`
`
`256
`257
` .260
`262
`263
`265
`266
`
`277
`278
`279
`279
`282
`285
`287
`288
`290
`290
`293
`294
`296
`302
`311
`
`313
`313
`314
`319
`323
`324
`325
`328
`
`Page 10
`
`

`

`.......
`
`.
`
`Silence Compression
`Other Techniques
`
`LOSSY GRAPHICS COMPRESSION
`Enter Compression
`Statistical and Dictionary Compression Methods
`Lossy Compression
`Differential Modulation
`Adaptive Coding
`A Standard That Works: JPEG
`JPEG Compression
`The Discrete Cosine Transform
`DCT Specifics
`Why Bother?
`Implementing the DCT
`Matrix Multiplication
`Continued Improvements
`Output of the DCT
`Quantization
`Selecting a Quantization Matrix
`Coding
`The Zig-Zag Sequence
`Entropy Encoding
`What About Color?
`The Sample Program
`Input Format
`The Code
`Initialization .
`The Forward DCT Routine
`WriteDCTData0
`File Expansion
`ReadDCTData()
`
`329
`345
`
`347
`348
`348
`349
`350
`351
`353
`354
`354
`357
`358
`359
`360
`362
`363
`364
`365
`169
`369
`372
`373
`173
`374
`375
`376
`378
`379
`382
`383
`
`Page 11
`
`

`

`Input DCT Code
`The Inverse DCT
`The Complete Code Listing
`Support Programs
`Some Compression Results
`
`AN ARCHIVING PACKAGE
`CAR and CARMAN.
`The CARMAN Command Set
`The CAR File
`The Header
`Storing the Header
`The Header CRC
`Command-Line Processing
`Generating the File List
`Opening the Archive Fi1es426
`The Main Processing Loop
`Skipping/Copying Input File
`File Insertion
`File Extraction
`Cleanup
`The Code
`
`APPENDIX A:STATISTICS FOR
`COMPRESSION PROGRAMS
`
`APPENDIX B:TEST PROGRAMS
`
`GLOSSARY
`
`BIBLIOGRAPHY
`Other Resources
`
`
`
`384
`385
`387
`400
`406
`
`409
`409
`410
`412
`413
`413
`417
`418
`421
`
`428
`434
`435
`436
`438
`438
`
`489
`
`495
`
`505
`
`517
`518
`
`Page 12
`
`

`

`‘Uf‘
`
`'{"l‘|‘-"'.' -...:.,
`
`-q¢.|._u ‘
`
`I
`
`.
`
`I
`
`....n,
`
`'
`
`(Lt
`
`AFTERWORD ......................................................................................................... 521
`
`INDEX ......................................................................................................................... 523
`
`Page 13
`
`.
`
`Page 13
`
`

`

`CHAPTER 5
`
`Huffman One Better:
`Arithmetic Coding
`
`The last two chapters show that Huffman coding uses knowledge about information
`content to efficiently encode symbols. If the probability of a symbol's appearance in
`a message is known, Huffman techniques can encode that symbol using a minimal
`number of bits. Huffman coding has been proven the best fixed-length coding method
`available.
`
`Difficulties
`Huffman codes have to be an integral number of bits long, and this can sometimes
`be a problem. If the probability of a character is 1/3, for example, the optimum number
`of bits to code that character is around 1.6. Huffman coding has to assign either one
`or two bits to the code, and either choice leads to a longer compressed message than
`is theoretically possible.
`This nonoptimal coding becomes a noticeable problem when the probability of a
`character is very high. If a statistical method could assign a 90 percent probability to a given
`character, the optimal code size would be 0.15 bits. The Huffman coding system would
`probably assign a -bit code to the symbol, which is six times longer than necessary.
`This would be a problem when compressing two-color images, like those coming
`from a fax machine. Since there are only two colors, an ordinary coding method would
`assign the 1 bit to one color and the 0 bit to the other. Since both codes have only a
`single bit, Huffman coding is not going to be able to compress this data at all. No
`matter how high the probability of one of the bits, we are still going to have to encode
`it using one bit.
`
`123
`
`Page 14
`
`

`

`THE DATA COMPRESSION BOOK
`
`The conventional solution to this problem is to group the bits into packets and
`apply Huffman coding. But this weakness prevents Huffman coding from being
`a universal compressor.
`
`Arithmetic Coding: A Step Forward
`Only in the last ten years has a respectable candidate to replace Huffman coding
`been successfully demonstrated: arithmetic coding. Arithmetic coding bypasses the
`idea of replacing an input symbol with a specific code. It replaces a stream of input
`symbols with a single floating-point output number. More bits are needed in the output
`number for longer, complex messages. This concept has been known for some time,
`but only recently were practical methods found to implement arithmetic coding on
`computers with fixed-sized registers.
`The output from an arithmetic coding process is a single number less than 1 and
`greater than or equal to 0. This single number can be uniquely decoded to create the
`exact stream of symbols that went into its construction. To construct the output
`number, the symbols are assigned a set probabilities. The message "BILL GATES,"
`for example, would have a probability distribution like this:
`
`Character
`SPACE
`A
`B
`E
`G
`
`S
`T
`
`Probability
`1/10
`1/10
`1/10
`1/10
`1/10
`1/10
`2/10
`1/10
`1/10
`
`124
`
`Page 15
`
`

`

`ARITHMETIC CODING
`
`Once character probabilities are known, individual symbols need to be assigned a
`range along a "probability line," nominally 0 to 1. It doesn't matter which characters
`are assigned which segment of the range, as long as it is done in the same manner by
`both the encoder and the decoder. The nine-character symbol set used here would look
`like the following:
`
`Character
`SPACE
`A
`B
`E
`G
`I
`L
`S
`T
`
`Probability
`1/10
`1/10
`1/10
`1/10
`1/10
`1/10
`2/10
`1/10
`1/10
`
`Range
`0.00 >= r > 0.10
`0.10 >= r > 0.20
`0.20 >= r> 0.30
`0.30 ›-- r > 0.40
`0.40 >-- r > 0.50
`0.50 >= r> 0.60
`0.60 >= r > 0.80
`0.80 >= r > 0.90
`0.90 >-= r > 1.00
`
`Each character is assigned the portion of the 0 to 1 range that corresponds to its
`probability of appearance. Note that the character "owns" everything up to, but not
`including, the higher number. So the letter T in fact has the range .90 to .9999
`The most significant portion of an arithmetic-coded message belongs to the first
`symbol—or B, in the message "BILL GATES." To decode the first character properly,
`the final coded message has to be a number greater than or equal to .20 and less than
`.30. To encode this number, track the range it could fall in, After the first character is
`encoded, the low end for this range is .20 and the high end is .30.
`During the rest of the encoding process, each new symbol will further restrict the
`possible range of the output number. The next character to be encoded, the letter I,
`owns the range .50 to .60 in the new subrange of .2 to .3. So the new encoded number
`will fall somewhere in the 50th to 60th percentile of the currently established range.
`Applying this logic will further restrict our number to .25 to .26. The algorithm to
`accomplish this for a message of any length is shown here:
`
`low = 0.0;
`high — 1.0;
`
`125
`
`Page 16
`
`

`

`THE DATA COMPRESSION BOOK
`
`while ( ( c = getc( input ) ) !— EOF )
`range — high - low;
`high = low + range * high_range( c )1
`low_range( c );
`low — low + range
`
`1
`output( low );
`
`Following this process to its natural conclusion with our message results in the
`following table:
`
`New Character
`
`B
`I
`L
`
`L
`SPACE
`
`G
`A
`T
`5
`
`S
`
`Low value
`0.0
`0.2
`0.25
`0.256
`0.2572
`0.25720
`0.257216
`0.2572164
`0.25721676
`0.257216772
`0.251216/752
`
`High Value
`1.0
`0.3
`0.26
`0.258
`0.2576
`0.25724
`
`0.257220
`0.2572168
`0.2572168
`0.257216776
`0.2572167756
`
`So the final low value, .2572167752, will uniquely encode the message "BILL
`GATES" using our present coding scheme.
`Given this encoding scheme, it is relatively easy to see how the decoding process
`operates. Find the first symbol in the message by seeing which symbol owns the space
`our encoded message falls in. Since .2572167752 falls between .2 and .3, the first
`character must be B. Then remove B from the encoded number. Since we know the
`low and high ranges of B, remove their effects by reversing the process that put them
`in. First, subtract the low value of B, giving .0572167752. Then divide by the width
`
`126
`
`Page 17
`
`

`

`ARITHMETIC CODING
`
`of the range of B, or .1. This gives a value of .572167752. Then calculate where that
`lands, which is in the range of the next letter, I. The algorithm for decoding the
`incoming number is shown next:
`
`number = input_code();
`for ( ;
`)
`symbol = find_symbol_straddling_this_range( number );
`putc( symbol );
`range - high_range( symbol ) - low_range( symbol );
`number = number - low_range( symbol );
`number - number / range:
`
`I have conveniently ignored the problem of how to decide when there are no more
`symbols left to decode. This can be handled either by encoding a special end-of-file
`symbol or by carrying the stream length with the encoded message. In the example in
`this chapter, as elsewhere in the book, I carry along a special end-of-stream symbol
`that tells the decoder when it is out of symbols. The decoding algorithm for the "BILL
`GATES" message will proceed as shown:
`
`Encoded Number Output Symbol
`B
`0.2572167752
`I
`0.572167752
`1.
`0.72167752
`L
`0.6083876
`SPACE
`0.041938
`G
`0.41938
`A
`0.1938
`0.938
`0.38
`0.8
`0.0
`
`T
`E
`S
`
`Low
`0.2
`0.5
`0.6
`0.6
`0.0
`0.4
`0.2
`0.9
`0.3
`0.6
`
`High Range
`0.1
`0.3
`0.1
`0.6
`0.2
`0.8
`0.2
`0.8
`0.1
`0.1
`0.1
`0.5
`0.1
`0.3
`
`1.0
`0.4
`0.9
`
`0.1
`0.1
`0.1
`
`127
`
`Page 18
`
`

`

`THE DATA COMPRESSION 1300K
`
`In summary, the encoding process is simply one of' narrowing the range of possible
`numbers with every new symbol. The new range is proportional to the predefined
`probability attached to that symbol. Decoding is the inverse procedure, in which the
`range is expanded in proportion to the probability of each symbol as it is extracted.
`
`Practical Matters. Encoding and decoding a stream of symbols using arithmetic
`coding is not too complicated. But at first glance it seems completely impractical. Most
`computers support floating-point numbers of around 80 bits. So do you have to start over
`every time you encode ten or fifteen symbols? Do you need a floating-point processor?
`Can machines with different floating-point formats communicate through arithmetic
`coding?
`As it turns out, arithmetic coding is best accomplished using standard 16-bit and 32-
`bit integer math. Floating-point math is neither required nor helpful. What is required
`is an incremental transmission scheme in which fixed-size integer state variables
`receive new bits at the low end and shift them out at the high end, forming a single
`number that can be as long as the number of bits on the computer.
`Earlier, we saw that the algorithm works by keeping track of a high and low number
`that: brackets the range of the possible output number. When the algorithm first starts,
`the low number is set to .0 and the high number is set to 1. The first simplification
`made to work with integer math is to change the 1 to .999 . . . , or . 1] 1 . . . in binary.
`Mathematicians agree that .111 . . . binary is exactly the same as I binary, and we take,
`their assurance at face value. It simplifies encoding and decoding.
`To store these numbers in integer registers, first justify them so the implied decimal
`point is on the left side of the word. Then load as much of the initial high and low
`values as will fit into the word size we are working with. My implementation uses 16-
`bit unsigned math, so the initial value of high is Oxillf, and low is 0. We know that
`the high value continues with Fs forever, and the low continues with zeros forever—
`so we can shift those extra bits in with impunity when needed.
`If you imagine our "BILL GATES" example in a five-decimal digit register (I use
`decimal digits in this example for clarity), the decimal equivalent of our setup would
`look like what follows on the next page.
`
`128
`
`Page 19
`
`

`

`ARITHMETIC CODING
`
`HIGH: 99999 implied digits => 999999999. . .
`LOW: 00000 implied digits —> 000000000...
`
`To find the new range of numbers, apply the encoding algorithm from the previous
`section. First, calculate the range between the low and high values. The difference
`between the two registers will be 100000, not 99999. This is because we assume the
`high register has an infinite number of 9s added to it, so we need to increment the
`calculated difference. We then compute the new high value using the formula from the
`previous section:
`
`high = low + high_range(symbol).
`
`In this case, the high range was .30, which gives a new value for high of 30000.
`Before storing the new value of high, we need to decrement it, once again because of
`the implied digits appended to the integer value. So the new value of high is 29999.
`The calculation of low follows the same procedure, with a resulting new value of
`20000. So now high and low look like this:
`
`high: 29999 (999,..);
`low: 20000 (000...).
`
`At this point, the most significant digits of high and low match. Due to the nature
`of our algorithm, high and low can continue to grow closer together without quite ever
`matching. So once they match in the most significant digit, that digit will never
`change. We can now output that digit as the first digit of our encoded number. This is
`done by shifting both high and low left by one digit and shifting in a 9 in the least
`significant digit of high. The equivalent operations are performed in binary in the C
`implementation of this algorithm.
`As this process continues, high and low continually grow closer together, shifting digits
`out into the coded word. The process for our "BILL GATES" message is shown next.
`
`129
`
`Page 20
`
`

`

`THE DATA COMPRESSION
`
`BOOK
`
`HIGH
`
`LOW
`
`RANGE
`
`CUMULATIVE OUTPUT
`
`99999 00000
`Initial state
`29999 20000
`Encode B (0.2-0.3)
`99999 00000
`Shift out 2
`59999 50000
`Encode I (0.5-0.6)
`99999 00000
`Shift out 5
`Encode L (0.6-0.8)
`/9999 60000
`Encode L (0.6-0.8)
`75999 72000
`Shift out 7
`59999 20000
`Encode SPACE (0.0-0.1) 23999 20000
`Shift out 2
`39999 00000
`Encode G (0.4-0.5)
`19999 16000
`Shift out 1
`99999 60000
`Encode A (0.1-0.2)
`67999 64000
`Shift out 6
`79999 40000
`Encode T (0.9-1.0)
`79999 76000
`Shift out 7
`99999 60000
`Encode E (0.3-0.4)
`75999 72000
`Shift out 7
`59999 20000
`Encode S (0.8-0.9)
`55999 52000
`Shift out 5
`59999 20000
`Shift out 2
`Shift out 0
`
`100000
`
`10000
`
`100000
`20000
`
`40000
`
`40003
`
`40000
`
`40000
`
`40000
`
`40000
`
`.2
`.2
`.25
`,25
`.25
`.257
`.257
`.257?
`.2572
`.25721
`.25721
`.257216
`.257216
`..2572167
`.2572167
`.25721677
`.25/21677
`.257216775
`.2572167752
`.25721677E20
`
`After all the letters are accounted for, two extra digits need to be shifted out of either
`the high or low value to finish the output word, This is so the decoder can properly
`track the input data. Part of the information about the data stream is still in the high and
`low registers, and we need to shift that information to the file for the decoder to use
`later.
`
`A Complication. This scheme works well for incrementally encoding a message.
`Enough accuracy is retained during the double-precision integer calculations to ensure
`that the message is accurately encoded. But there is potential for a loss of precision under
`certain circumstances.
`
`130
`
`Page 21
`
`

`

`ARITHMETIC CODING
`
`If the encoded word has a string of Os or 9s in it, the high and low values will slowly
`converge on a value, but they may not see their most significant digits match imme-
`diately. High may be 700004, and low may be 699995. At this point, the calculated
`range will be only a single digit long, which means the output word will not have
`enough precision to be accurately encoded. Worse, after a few more iterations, high
`could be 70000, and low could look be 69999.
`At this point, the values are permanently stuck. The range between high and low
`has become so small that any iteration through another symbol will leave high and low
`at their same values. But since the most significant digits of both words are not equal,
`the algorithm can't output the digit and shift. It seems to have reached an impasse.
`Defeat this underflow problem by preventing things from ever getting that bad. The
`original algorithm said something like, "If the most significant digit of high and low
`match, shift it out." If the two digits don't match, but are now on adjacent numbers, a
`second test needs to be applied. If high and low are one apart, we test to see if the
`second most significant digit in high is a 0 and if the second digit in low is a 9. If so,
`it means we are on the road to underflow and need to take action.
`Head off undertlow with a slightly different shift operation. Instead of shifting the
`most significant digit out of the word, delete the second digits from high and low and
`shift the rest of the digits left to fill the space. The most significant digit stays in place.
`Then set an underflow counter to remember that we threw away a digit and aren't quite
`sure whether it was going to be a 0 or a 9. This process is shown here:
`
`High:
`Low:
`Underflow:
`
`Before
`40344
`39810
`0
`
`After
`43449
`38100
`1
`
`After every recalculation, check for underflow digits again if the most significant digits
`don't match. If underflow digits are present, we shift them out and increment the counter.
`
`131
`
`Page 22
`
`

`

`THE DATA COMPRESSION BOOK
`
`When the most significant digits do finally converge to a single value, output that
`value. Then output the underflow digits previously discarded. The underflow digits
`will all be 9s or Os, depending on whether high and low converged on the higher or
`lower value. In the C implementation of this algorithm, the underflow counter keeps
`track of how many ones or zeros to output.
`
`Decoding. In the "ideal" decoding process, we had the entire input number to work
`with, and the algorithm had to do things like "divide the encoded number by the symbol
`probability." In practice, we can't perform an operation like that on a number that could
`be billions of bytes long. As in the encoding process, however, the decoder can operate
`using 16- and 32-bit integers for calculations.
`Instead of using just two numbers, high and low, the decoder has to use three
`numbers. The first two, high and low, correspond exactly to the high and low values
`maintained by the encoder. The third number, code, contains the current bits being
`read in from the input bit stream. The code value always falls between the high and
`low values. As they come closer and closer to it, new shift operations will take place,
`and high and low will move back away from code.
`The high and low values in the decoder will be updated after every symbol, just as
`they were in the encoder, and they should have exactly the same values. By perform-
`ing the same comparison test on the upper digit of high and low, the decoder knows
`when it is time to shift a new digit into the incoming code. The same underflow tests
`are performed as well.
`In the ideal algorithm, it was possible to determine what the current. encoded
`symbol was just by finding the symbol whose probabilities enclosed the present value
`of the code. In the integer math algorithm, things are somewhat more complicated. In
`this case, the probability scale is determined by the difference between high and low.
`So instead of the range being between .0 and 1.0, the range will he between two
`positive 16-bit integer counts. Where the present code value falls along that range
`determines current probability. Divide (value - low) by (high - low + 1) to get the
`actual probability for the present symbol.
`
`132
`
`Page 23
`
`

`

`ARITHMETIC CODING
`
`Where's the Beef? It is not immediately obvious why this encoding process is an
`improvement over Huffman coding. It becomes clear when we examine a case in which
`the probabilities are a little different. If we have to encode the stream "AAAAAAA," and
`the probability of A is known to be .9, there is a 90 percent chance that any incoming
`character will be the letter A. We set up our probability table so that A occupies the .0
`to .9 range, and the end-of-message symbol occupies the .9 to 1 range. The encoding
`process is shown next:
`
`New Character
`
`A
`A
`A
`A
`A
`A
`A
`END
`
`Low value
`0.0
`0.0
`0.0
`0.0
`0.0
`0.0
`0.0
`0.0
`0.43046721
`
`High Value
`1.0
`0.9
`0.81
`0.729
`0.6561
`0.59049
`0.531441
`0.4782969
`0.4782969
`
`Now that we know the range of high and low values, all that remains is to pick a
`number to encode this message. The number .45 will make this message uniquely
`decode to "AAAAAAA." Those two decimal digits take slightly less than seven bits
`to specify, which means we have encoded eight symbols in less than eight bits! An
`optimal Huffman message would have taken a minimum of nine bits.
`To take this point to an even further extreme, I set up a test that had only two
`symbols. In it, 0 had a 16382/16383 probability, and an end-of-file symbol had a 1/
`:16383 probability. I then created a file filled with 100,000 Os. After compression using
`arithmetic coding, the output file was only three bytes long! The minimum size using
`Huffman coding would have been 12,501 bytes. This is obviously a contrived ex-
`ample, but it shows that arithmetic coding compresses data at rates much better than
`one bit per byte when the symbol probabilities are right.
`
`133
`
`Page 24
`
`

`

`THE DATA COMPRESSION BOOK
`
`The Code
`The code supplied with this chapter in ARITRC is a simple module that performs
`arithmetic compression and decompression using a simple order 0 model. It works
`exactly like the nonadaptive Huffman coding program in chapter 3. It first makes a
`single pass over the data, counting the symbols. The data is then scaled down to make
`the counts fit into single, unsigned characters. The scaled counts are saved to the
`output file for the decompressor to get at later, then the arithmetic coding table is built.
`Finally, the compressor passes through the data, compressing each symbol as it
`appears. When done, the end-of-stream character is sent out, the arithmetic coder is
`flushed, and the program exits.
`
`The Compression Program. The compression portion of this program is shown
`shortly. The main module is called by the utility version of MAIN-E.C, which will have
`already taken care of opening files, parsing arguments, etc. Once we get to the
`compression phase of the program, things are ready to go.
`The compressor code breaks down neatly into three sections. The first two lines
`initialize the model and the encoder. The while loop plus an additional line after it
`performs the compression, and the last three lines shut things down.
`
`build_model( input, output -)file
`initialize_arithmetic_encoder();
`
`while ( ( c = getc( input ) ) != EOF )
`convert_int_to_symbol( c, &s
`encode_symbol( output, &s );
`
`convert_int_to_symbol( END_OF_STREAM, /Cs ):
`encode_symbol( output, &s );
`flush_arithmetic_encoder( output );
`OutputBits( output, 01_, 16 );
`
`134
`
`Page 25
`
`

`

`ARITHMETIC CODING
`
`The build_model() routine has several responsibilities. It makes the first pass over
`the input data to count all the characters. It scales down the counts to fit in unsigned
`characters, then it takes those counts and builds the range table used by the coder.
`Finally, it writes the counts to the output file so the decompressor will have access to
`them later.
`The initialize arithmetic encoder routine is fairly simple. It just sets up the high- and
`low-integer variables used during the encoding. The encoding loop calls two different
`routines to encode the symbol. The first, convert_int_to_symbolO, takes the character
`read in from the file and looks up the range for the given symbol. The range is then
`stored in the symbol object, which has the structure shown:
`
`typedef struct
`unsigned short int low count;
`unsigned short int high_count;
`unsigned short int scale;
`
`I SYMBOL:
`
`These three values are all that are needed for the symbol to be encoded using the
`arithmetic encoder. The low-count and high-count values uniquely define where on the 0 to
`1 range the symbol lies, and the scale value tells what the total span of the 0 to 1 scale is.
`If 1,000 characters had been counted in a text file, for example, the low_count and
`high_count for A might be 178 and 199, and the scale would be 1,000. This would indicate
`that on the 0 to 1 number scale, A would own the range .178 to .199.
`Once the symbol object has been defined, it can be passed to the encoder. The
`arithmetic encoder needs only those three items to process the symbol and update the
`output number. It has the high- and low-range values and the underflow count stored
`internally as static variables, and it doesn't need anything else.
`The way we detached the modeling data from the coding data gives us a convenient
`mechanism for experimenting with new ways of modeling. We just have to come up with
`the numbers that get plugged into the symbol. The encoder doesn't care how we got those
`numbers as long as they were derived consistently so we can decode the file later.
`
`135
`
`Page 26
`
`

`

`THE DATA COMPRESSION BOOK
`
`When we reach the end of the input file, we encode and send the end-of-stream
`symbol. The decompression program will use it to determine when it is done. To
`finish, call a routine to flush the arithmetic encoder, which takes care of any undeiflow
`bits. Finally, we have to output an extra sixteen bits. The reason for this is simple.
`When decoding symbols from the input bit stream, the effective hits are in the most
`significant bit position of the input registers. To get the bits there, we have to load other
`hits into the lower positions and shift them over. At least 15 insignificant bits are
`needed to decode the last symbol. Outputting 16 bits at the end of the program ensures
`that the decoder won't get a premature end of file while decoding the input file.
`
`The Expansion Program. The main part of the expansion program follows the same
`pattern. First, the model is set up and the arithmetic coder is initialized. In this program,
`initializing the model means reading in the counts from the input file where the
`compressor put them. Initializing the arithmetic decoder means loading the low and high
`registers with 0000 and FFFF, then reading the first 16 hits from the input file into the
`current code.
`
`input_counts( input >file );
`initialize_arithmetic_decoder( input ):
`for (
`;
`;
`)
`get_symbol_scale( &s
`) ;
`count = get_currert_count( &s );
`c = convert_symbol_Lo_inti count, &s
`if ( c == END_OF_STREAM
`)
`br

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket