`compression techniques in C
`
`Mark Nelson
`
`M_§£Fm
`El;
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 001
`
`
`
`M&T Books
`115 West 18th Street
`New York, NY 10011
`
`Slflflfllllllll
`
`© 1992 by M&T Publishing, Inc.
`
`Printed in the United States of America
`
`All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means,
`electronic or mechanical, including photocopying, recording, or by any information storage and
`retrieval sytem, without prior written pemission from the Publisher. Contact the Publisher for
`information on foreign rights.
`
`Limits of Liability and Disclaimer of Warranty
`The Author and Publisher of this book have used their best efforts in preparing the book and the
`programs contained in it. These efforts include the development, research, and testing of the theories
`and programs to determine their effectiveness.
`
`The Author and Publisher make no warranty of any kind, expressed or implied, with regard to these
`programs or the documentation contained in this book. The Author and Publisher shall not be liable in
`any event for incidental or consequential damages in connection with, or arising out of, the furnishing,
`performance, or use of these programs.
`
`Library of Congress Cataloging-in-Publication Data
`
`Nelson, Mark, 1958 —
`The Data Compression Book/by Mark Nelson
`p. cm.
`Includes index.
`ISBN 1-55851-2144: $29.95
`1. Data Compression (Computer Science)
`QA76.9.D33N46
`1991
`005.74’6--dc20
`
`ISBN l-55851-216-0 (book/disk set): $39.95
`I. Title
`
`9l-23080
`CIP
`
`Project Editor: Tova Fliegel
`Copy Editor: Shayne Bell
`Technical Reviewer: Robert Jung
`
`95
`
`94
`
`93
`
`432
`
`Cover Design: Lauren Smith Design
`Layout: Marni Tapscott
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 002
`
`
`
`Contents
`
`WHY THIS BOOK IS FOR YOU .................................................................. ..1
`
`INTRODUCTION TO DATA COMPRESSION ...................................... ..3
`
`The Audience ........................................................................................................ ..4
`
`Why C? ................................................................................................................. ..4
`Which C? .............................................................................................................. ..5
`
`Keeping Score ....................................................................................................... ..9
`The Structure ...................................................................................................... ..10
`
`THE DATA-COMPRESSION LEXICON, WITH A H|STORY....13
`
`The Two Kingdoms ............................................................................................. .. 13
`
`Data Compression = Modeling + Coding ............................................................ .. 14
`
`The Dawn Age .................................................................................................... .. 15
`
`Coding ................................................................................................................ .. 17
`
`An Improvement .............................................................................................. .. 18
`
`Modeling ............................................................................................................. ..20
`
`Statistical Modeling ......................................................................................... .. 20
`
`Dictionary Schemes ......................................................................................... ..22
`
`Ziv and Lempel ................................................................................................... .. 23
`LZ77 . .............................................................................................................. ..23
`
`LZ78. ............................................................................................................... .. 24
`
`Lossy Compression ............................................................................................. ..24
`Programs to Know .............................................................................................. .. 25
`
`THE DAWN AGE: MINIMUM REDUNDANCY CODING ............. ..29
`
`The Shannon—Fano Algorithm ............................................................................. ..31
`
`The Huffman Algorithm ...................................................................................... ..34
`HuffmaninC ...................................................................................................... ..38
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 003
`
`
`
`BITIO.C.
`A Reminder about Prototypes
`MA1N—C.C and MAIN-E.C
`
`.......
`....
`
`....
`
`.....
`
`.......
`
`........
`
`......
`
`...........
`
`ERRHAND.C...
`Into the Huffman Code.
`Counting the Symbo1s........
`Saving the Counts................. ........ ..
`Building the Tree..........
`.......
`......
`Using the Tree ....
`The CompressionCode .63
`Putting It All Together..
`Performance.....................
`
`..........59
`
`.......
`
`......
`
`A SIGNIFICANT IMPROVEMENT: ADAPTIVE HUFFMAN
`
`Adaptive Coding
`Updating the Huffman Tree.............
`What SwappingDoes
`The Algorithm.
`........
`
`........
`
`.88
`..89
`
`The Escape Code .
`The Overflow Problem
`........ ..
`A Rescaling Bonus.
`TheCode
`97
`Initialization of theArray
`The Compress MainProgram .98
`The Expand Main Program.
`.....
`Encoding the Symbol.
`108
`Decoding the Symbol.
`TheCode ..........109
`
`..
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 004
`
`
`
`HUFFMAN ONE BETTER:
`ARITHMETIC CODING ..................
`
`Difficulties ........................................................................................................ .. 123
`Arithmetic Coding: A Step Forward .................................................................. .. 124
`Practical Matters ............................................................................................ .. 128
`A Complication .............................................................................................. .. 130
`Decoding ....................................................................................................... .. 132
`Where’s the Beef? .......................................................................................... .. 133
`
`The Code .............................................................................................. .., ......... .. 134
`The Compression Program ............................................................................. .. 134
`The Expansion Program ................................................................................. .. 136
`Initializing the Model ..................................................................................... .. 137
`Reading the Model ......................................................................................... .. 141
`Initializirig the Encoder .................................................................................. .. 141
`The Encoding Process .................................................................................... .. 142
`Flushing the Encoder ..................................................................................... .. 145
`The Decoding Process ................................................................................... .. 145
`Summary ........................................................................................................... .. 148
`Code .................................................................................................................. .. 148
`
`STATISTICAL MODELING ......................................................................... ..167
`
`167
`Higher—Order Modeling ....................................................................................
`Finite Context Modeling ................................................................................... .. 168
`Adaptive Modeling ........................................................................................... .. 169
`A Simple Example ......................................................................................... .. 170
`Improvements ................................................................................................ .. 176
`Highest—Order Modeling ................................................................................... .. 176
`Updating the Model ....................................................................................... .. 178
`Escape Probabilities ....................................................................................... .. 179
`Scoreboarding ................................................................................................ .. 180
`Data Structures .............................................................................................. .. 181
`The Finishing Touches: Tables 1 and 2 .......................................................... .. 185
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 005
`
`
`
`Model Flushing .............................................................................................. .. 186
`Implementation .............................................................................................. .. 186
`Conclusions ........................
`....
`................................................................... .. 187
`
`Enhancement ................................................................................................. .. 187
`
`ARITH—N Listing ................................................................................
`
`....... .. 188
`
`DICTIONARY-BASED COMPRESSION ...............................................219
`
`An Example ..................................................................................................... ...219
`Static VS. Adaptive ............................................................................................. .. 220
`Adaptive Methods .......................................................................................... .. 221
`A Representative Example ............................................................................. .. 223
`Israeli Roots .............................................................................
`.................... .. 225
`
`History ........................................................................................................... .. 226
`ARC: The Father of MS—DOS Dictionary Compression ................................... ..227
`
`Dictionary Compression ................................................................................. ..228
`Danger Ahead—Patents ..................
`............................................................... ..230
`Conclusion . ......... . .. ........................................................................................... .. 231
`
`S L I D I N G W I N D O W C O M P R E S S I O N
`
`3 3
`
`........................................................ ..233
`........
`.........
`The Algorithm ................
`Problems with LZ77 ...................................................................................... ..238
`
`............. .....240
`.........
`An Encoding Problem .....................................................
`LZSS Compression .....................................
`......
`.................................. ..240
`Data Structures .............................................................................................. ..242
`
`A Balancing Act ............................................................................................ ..244
`Greedy vs. Best Possible ......................................
`....................................... ..246
`The Code ...................
`.........................
`...........
`......................................... ..247
`Constants and Macros .................................................................................... ..247
`
`Global Variables ...............................................................
`
`........................... .. 251
`
`............................................ ..252
`...........
`The Compression Code ....................
`Initialization. ................... ... ................
`......................................................... . . 25 3
`
`The Main Loop ......................................................................................
`
`..... ..254
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 006
`
`
`
`The Exit Code ................................................................................................ .. 256
`AddSt1ing() .................................................................................................... .. 257
`DeleteString() ................................................................................................. ..26O
`Binary Tree Support Routines ........................................................................ .. 262
`The Expansion Routine ..................................................................................... .. 263
`Improvements ................................................................................................ .. 265
`The Code .......................................................................................................... ..266
`
`L278 COMPRESSION ................................................................................... ..277
`Can LZ77 Improve? .......................................................................................... .. 278
`Enter LZ78 ........................................................................................................ ..279
`LZ78 Details .................................................................................................. ..279
`LZ78 Implementation. ................................................................................... ..282
`An Effective Variant .......................................................................................... .. 285
`Decompression .................................................................................................. .. 287
`The Catch ...................................................................................................... .. 288
`LZW Implementation .................................................................................... ..290
`Tree Maintenance and Navigation .................................................................. ..290
`Compression ..................................................................................................... .. 293
`Decompression .................................................................................................. ..294
`The Code .......................................................................................................... .. 296
`Improvements ................................................................................................... .. 302
`Patents ............................................................................................................... .. 3 ll
`
`SPEECH COMPRESSION .....................................; ..................................... ..313
`Digital Audio Concepts ..................................................................................... .. 313
`Fundamentals ................................................................................................. . . 3 14
`Sampling Variables ........................................................................................ .. 319
`PC—Based Sound ............................................................................................ .. 323
`Lossless Compression of Sound ........................................................................ ..324
`Problems and Results ..................................................................................... .. 325
`Lossy Compression ........................................................................................ .. 328
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 007
`
`
`
`Silence Compression ...................................................................................... .. 329
`Other Techniques .............................................................................................. .. 345
`
`LOSSY GRAPHICS COMPRESSION ................................................... ..347
`Enter Compression ........................................ ..‘ .................................................. ..348
`Statistical and Dictionary Compression Methods ........................................... .. 348
`Lossy Compression ........................................................................................ ..349
`Differential Modulation .................................................................................. .. 350
`Adaptive Coding ............................................................................................ .. 351
`A Standard That Works: JPEG .......................................................................... .. 353
`JPEG Compression ........................................................................................ ..354
`The Discrete Cosine Transform ...................................................................... .. 354
`DCT Specifics ............................................................................................... .. 357
`Why Bother? ..................................................................................................... ..358
`Implementing the DCT ..................................................................................... .. 359
`Matrix Multiplication ..................................................................................... .. 360
`Continued Improvements .................................................................................. .. 362
`Output of the DCT ......................................................................................... .. 363
`Quantization ................................................................................................... .. 364
`Selecting a Quantization Matrix ................................................................... .. 365
`Coding ................................................................ .; ............................................ .. 369
`The Zig-Zag Sequence ................................................................................... .. 369
`Entropy Encoding .......................................................................................... .. 372
`What About Color? ........................................................................................ .. 373
`The Sample Program ......................................................................................... .. 373
`Input Format .................................................................................................. .. 374
`The Code ....................................................................................................... ..375
`Initialization ................................................................................................... .. 376
`The Forward DCT Routine ............................................................................ .. 378
`WriteDCTData() ............................................................................................ .. 379
`File Expansion ............................................................................................... .. 382
`ReadDCTData() ................................................................................ . L ........... . . 383
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 008
`
`
`
`Input DCT Code ............................................................................................ .. 384
`The Inverse DCT ........................................................................................... .. 385
`The Complete Code Listing .............................................................................. .. 387
`Support Programs .............................................................................................. ..400
`Some Compression Results ............................................................................... ..406
`
`AN ARCHIVING PACKAGE .........................................................................409
`
`CAR and CARMAN ......................................................................................... ..409
`The CARMAN Command Set ....................................................................... ..41O
`
`The CAR File ................................................................................................ ..412
`
`The Header .................................................................................................... .. 413
`Storing the Header ......................................................................................... ..413
`The Header CRC ........................................................................................... ..417
`
`Command-Line Processing ............................................................................ .. 418
`Generating the File List ..................................................................................... .. 421
`Opening the Archive‘Files ............................................................................. ..426
`The Main Processing Loop ............................................................................... ..428
`Skipping/Copying Input File .......................................................................... ..434
`File Insertion .................................................................................................. ..435
`
`File Extraction ............................................................................................... .. 436
`
`Cleanup ............................................................................................................438
`The Code .......................................................................................................... ..438
`
`APPENDIX A:STAT|STlCS FOR
`COMPRESSION PROGRAMS .....................................................................489
`
`APPENDIX B:TEST PROGRAMS .......................................................... ..495
`
`GLOSSARY .............................................................................................................505
`
`BIBLIOGRAPHY ..................................................................................................517
`
`Other Resources ................................................................................................ ..518
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 009
`
`
`
`AFTERWORDmmWmmwmmwmmwm”
`
`Wmm.mWmWmwmmm
`
`INDEX.
`
`Wmwmmmwmmmmmmmm
`
`.mmw
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 010
`
`
`
`CHAPTER 1
`
`Introduction to Data
`Compression
`
`The primary purpose of this book is to explain various data—compression tech-
`niques using the C programming language. Data compression seeks to reduce the
`number of bits used to store or transmit information. It encompasses a wide variety of
`software and hardware compression techniques which can be so unlike one another
`that they have little in common except that they compress data. The LZW algorithm
`used in the Compuserve GIF specification, for example. has virtually nothing in
`common with the CCITT G.72l specification used to compress digitized voice over
`phone lines.
`This book will not take a comprehensive look at every variety of data compres-
`sion. The field has grown in the last 25 years to a point where this is simply not
`possible. What this book will cover are the various types of data compression com-
`monly used on personal and midsized computers, including compression of binary .
`programs, data, sound, and graphics.
`Furthermore, this book will either ignore or only lightly cover data—compression
`techniques that rely on hardware for practical use or that require hardware applica-
`tions. Many of today’s Voice—compression schemes were designed for the worldwide
`f1xed—bandwidth digital telecommunications networks. These compression schemes
`are intellectually interesting, but they require a specific type of hardware tuned to the
`fixed bandwidth of the communications channel. Different algorithms that don’t have
`to meet this requirement are used to compress digitized voice on a PC, and these
`algorithms generally offer better performance.
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 011
`
`
`
`THE DATA COMPRESSION BOOK
`
`Some of the most interesting areas in data compression today, however, do concern
`compression techniques just becoming possible with new and more powerful hardware.
`Lossy image compression, like that used in multimedia systems, for example, can now be
`implemented on standard desktop platforms. This book will cover practical ways to both
`experiment with and implement some of the algorithms used in these techniques.
`
`The Audience
`You will need basic programming skills to adequately discuss data»compression code.
`"the ability to follow block-structured code, such as C or Pascal, is a requirement. In
`addition, understanding computer architecture well enough to follow bit—oriented opera-
`tions, such as shifting, logical ORing and ANDing, and so on, will be essential.
`This does not mean that you need to be a C guru for this book to be worthwhile. You
`don’t even have to be a programmer. But the ability to follow code will be essential,
`because the concepts discussed here will be illustrated with portable C programs. The
`C code in this book has been written with an eye toward simplicity in the hopes that
`C novices will still be able to follow the programs. We will avoid the more esoteric
`constructs of C, but the code will be C, not pseudocode or English.
`
`Why C?
`The use of C to illustrate data—compression algorithms may raise some hackles. A more
`traditional way to write this book would have been to use pseudocode to sketch out the
`algorithms. But the lack of rigor in a pseudocode “program” often leads to hazy or
`incomplete definitions full of lines like “PROCESS FILE UNTIL OUT OF DATA.” The
`result is that pseudocode is easy to read, but not so easy to translate into a working program.
`If pseudocode is unsatisfactory, the next best choice is to use a conventional
`programming language. Though hundreds of choices are available, C seems the best
`choice for this type of book for several good reasons. First, in many respects C has
`become the lingua franca of programmers. That C compilers support computers
`ranging from a lowly 8051 microcontroller to supercomputers capable of 100 million
`instructions per second (MIPS) has had much to do with this. It doesn’t mean that C
`is the language of choice for all programmers. What it does mean is that most
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 012
`
`
`
`|NTRODUCTl0N
`
`programmers should have a C compiler available for their machines, and most are
`probably regularly exposed to C code. Because of this, many prograrmners who use
`other languages can still manage to code in C, and even more can at least read C.
`A second reason for using C is that it is a language without too many surprises. The few
`constructs it uses as basic language elements are easily translated to other languages. So a
`datacompression program that is illustrated using C can be converted to a working Pascal
`program through a relatively straightforward translation procedure. Even assembly—lan—
`guage programmers should find the process relatively painless.
`Perhaps the most importantreason for using C is simply one of efficiency. C is often
`thought of as ahigh—level assembly language, since it allows programmers to get close
`to the hardware. Despite the increasing optimization found in recent C compilers, it is
`not likely that C will ever exceed the speed or size possible ir1 hand-coded assembly
`language. That flaw is offset, however, by the ability to easily port C code to other
`machines. So for a book of this type, C is probably the most efficient choice.
`
`Which C?
`Despite being advertised as a “portable” language, a C program that compiles and
`executes on a given machine is not guaranteed to run on any other. It may not even
`compile using a different compiler on the same machine. The important thing to
`remember is not that C is portable, but that it can be portable. The code for this book
`has been written to be portable, and it compiles and runs cleanly using several
`compilers and environments.’Tl1e compilers/environments used here include:
`
`- Microsoft C 6.0, MS—DOS 3.3/5.0
`- Borland C++ 2.0, MS—DOS 3.3/5.0
`= Zortech C++, MS—DOS 3.3/5.0, with 286 and 386 DOS Extenders
`- Interactive Unix System 3.2 with the portable C compiler
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 013
`
`
`
`THE DATA COMPRESSION BOOK
`
`One important portability issue is library function calls. Though the C program-
`ming language was fairly well defined by the original K&R book (Brian W. Kemighan
`and Dennis M. Ritchie, The C Programming Language [Englewood Cliffs, N.J.:
`Prentice-Hall, 1978]), the run-time library implementation was left totally up to the
`whims of the implementor. Fortunately, the American National Standards Institute
`was able to complete the C language specification in l990, and the result was
`published as ANSI standard XJl1.34. This standard not only expanded and pinned
`down the original K&R language specification, but it also took on the definition of a
`standard C run—tirne library. This makes it much easier to write code that works the
`same way from machine to machine. The code in this book will be written with the
`intention of using only ANSI C library calls. Compiler-dependent extensions to either
`the language or the library will be avoided wherever possible.
`Given the standardization of the libraries, the remaining portability issues center
`
`around two things: sizes of the basic data types and dealing with noncompliant
`compilers. The majority of data-type conflicts arise when switching between 16- and
`32-bit machines.
`
`is fairly easy to manage the change between 16- and 32-bit
`it
`Fortunately,
`machines. Though the basic integer data type switches between 16- and 32-bits, both
`machines have a 16-bit “short int” data type. Once again, a “long int” "is generally 32
`bits on both machines. So in cases where the size of an integer clearly matters, it can '
`
`be pirmed down to either 16- or 32—bits with the appropriate declaration.
`On the vast majority of machines used in the world today, the C compiler
`implementation of the “char” data type is 8 bits wide. In this book, we will gloss over
`the possibility that any other size exists and stick with 8-bit characters. In general,
`porting a program shown here to a machine with an unusual char size is not too
`difficult, but spending too much time on it will obscure the important point of the
`
`programs here, which is data compression.
`The final issue to deal with when writing portable code is the problem of
`
`noncompliant compilers. In the MS-DOS world, most C compilers undergo major
`releases and upgrades every two years or so. This means that most compiler vendors
`have been able to release new versions of their compilers that now conform closely to
`
`Veritas Techs. LLC
`Exhibit 1012
`Page 014
`
`
`
`INTRODUCHON
`
`the ANSI C standard. But this is not the case for users of many other operating
`systems. In particular, UNIX users will frequently be using a C compiler which came
`with their system and which confonns to the older K&R language definition. While
`the ANSI C committee went to great lengths to make ANSI C upwardly compatible
`from K&R C, we need to watch out for a few problems.
`The first problem lies in the use of function prototypes. Under K&R C, function
`prototypes were generally used only when necessary. The compiler assumed that any
`unseen function returned an integer, and it accepted this without complaint. If a
`function retumed something unusual—a pointer or along, for instance——the prograrn—
`mer would write a function prototype to inform the compiler:
`
`long locate_string();
`
`the prototype told the compiler to generate code that assumes that the
`Here,
`function returned a long instead of an int. Function prototypes didn’t have much more
`use than that. Because of this, many C programmers working under a K&R regime
`made little or no use of function prototypes, and their appearance in a program was
`
`something of an oddity.
`While the ANSI C committee tried not to alter the basic nature of C, they were
`unable to pass up the potential improvements to the language that were possible
`through the expansion of the prototyping facility. Under ANSI C, a function prototype
`defines not only the return type of a function, but also the type of all the arguments as
`well. The function shown earlier, for example, might have the following prototype
`
`with an ANSI C compiler:
`
`long locate_string< FILE *input_file, char *string );
`
`This lets the compiler generate the correct code for the return type and check for the
`correct type and number of arguments as well. Si