`
`4OLUME 25, No. 10
`
`OCTOBER 1995
`
`EDITORS
`
`DOUGLAS COMER
`
`ANDY WELLINGS
`
`SOFTWARE
`
`ISSN 0038-0644
`
`fyWILEY
`‘AAfitaetelancs Saicies
`
`- New York -
`
`bane - Toronto
`
`SPEXBL 25(10) 1065-1182 (1995)
`
`Teradata, Exh. 1005, p. 1 of 23
`
`
`
`I SOFTWARE
`PRACTICE & EXPERIENCE
`Editors
`Professor D. E. Comer, Computer Science Department, Purdue University, West
`Lafayette, IN 47907, U.S.A.
`.
`Charlotte I. Tubis, U.S. Editorial Assistant. Computer Science Department, Purdue University, West Lafayette,
`IN 47907, U.S.A.
`Dr A. J. Wellings, Department of Computer Science, University of York,
`Heslington, York Y01 5DD
`
`Advisory Editorial Board
`Professor D. W. BARRON
`Department of Electronics and Computer Science,
`University of Southampton,
`Southampton S09 5NH, U.K.
`Professor P. J. BROWN
`Computing Laboratory, The University,
`Canterbury, Kent CT2 7NF, U.K.
`
`Professor J. A. CAMPBELL
`Department of Computer Science, University College London,
`Gower Street, Lond on WC1E 6BT, U.K.
`
`Professor F. J. CORBATO
`Electrical Engineering Department,
`Massachusetts Institute of Technology,
`545 Technology Square,
`Cambridge, Massachusetts 02139, U.S.A.
`
`Dr. Christopher W. FRASER
`AT&T Bell Laboratori es, 600 Mountain Ave 2C-464,
`Murray Hill, NJ 07974-0636, U.S.A.
`
`Professor D. E. KNUTH
`Department of Computer Science, Stanford University,
`Stanford, Californ ia 94305, U.S .A.
`
`Dr. B. W. LAMPSON
`180 Lake View Ave.
`Cambridge,
`MA 02138. U.S.A.
`
`Dr. C. A. LANG
`Three-Space Ltd,
`70 Castle Street.
`Cambridge CB3 OAJ, U.K.
`
`Professor B. RANDELL
`Computing Laboratory,
`University of Newcastle-upon-Tyne,
`Claremont Tower, Claremont Road,
`Newcastle-upon-Tyne NEl 7RU, U.K.
`
`Professor J. S. ROHL
`Department of Computer Science,
`The Univers ity of Western Austra lia,
`Nedlands, Western Australia 6009.
`
`D. T. ROSS
`Softech Inc. , 460 Totten Pond Road,
`Waltham, Massachusetts 02154, U.S.A.
`
`B. H. SHEARING
`The Software Factory,
`28 Padbrook, Limpsfield, Oxted,
`Surrey RH8 ODW, U.K.
`
`Professor N. WIRTH
`Institut fUr Computersysteme, ETH-Zentrum,
`CH-8092 Zu rich, Switzerland.
`
`Professor PER BRINCH HANSEN
`School of Computer and Information Science,
`4-116 CST, Syracuse University,
`Syracuse, New York 13210, U.S.A.
`Professor D. R. HANSON
`Department of Computer Science,
`Princeton University, Princeton,
`New Jersey 08544, U.S.A.
`Professor J. KATZENELSON
`Faculty of Electrical Engineering,
`Technion -Israel Institute of Technology,
`Haifa, Israel
`Dr. B. W. KERNIGHAN
`AT&T Bell Laboratories, 600 Mountain Avenue,
`Murray Hill, New Jersey 07974, U.S.A.
`Aims and Scope
`Software--Practice and Experience is an internationally respected and rigorously refereed vehicle for the dissemination and
`discussion of practical experience w ith new and established software for both systems and applications. Contributions regu(cid:173)
`larly: (a) describe detailed accounts of completed software-system projects which can serve as 'how-to-do-it' models for future
`work in the same fie ld; (b) present short reports on programming techniques that can be used in a wide variety of areas; (c)
`document new techniques and tools that aid in solving software construction problems; and (d) explain methods/techniques
`that cope with the special demands of large sca le software projects. The journal also features time ly Short Communications
`on rapidly developing new topics.
`The editors actively enr.ou rage papers which result from practical experience with too ls and methods developed and used
`in both academic and industrial environments. The aim is to encourage practitioners to share their experiences with design,
`implementation and evaluation of techniques and tools for software and software systems.
`Papers cover software design and implementation, case studies describing the evolution of system and the thinking behind
`them, and criti ca l appraisals of software systems. The journa l has always welcomed tutorial articles describing we ll-tried tech(cid:173)
`niques not previously documented in computing literature. The emphasis is on practical experience; articles w ith theoretical
`or mathematical content are included on ly in cases where an understanding of the theory will lead to better practical systems.
`Articles range in length from a Short Communication (half to two pages) to the length required to give full treatment to a
`substantial piece of software (40 or more pages).
`
`Advertising: For details contact-
`Michael J. Levermore. Advertisement Sales, John Wiley & Sons Ltd, BaHins Lane. Chichester, Sussex POt9 lUD. England (Telephone 01243
`77035 1, Fax 01243 775878, Telex 862901
`Software-Practice and Experience (lSSN 0038-{)644/USPS 890-9201 is published monthly, by John Wiley & Sons Limited, BaHins Lane. Chichester.
`Sussex, England. Second class postage paid at Jamaica, N.Y. 11431. Air freight and mailing in the U.S.A. by Publications Expediting Services tnc.,
`200 Meacham Avenue, Elmont. N.Y. 11003. © 1995 by John Wi ley & Sons Ltd. Printed and bound in Great Britain by Page Bros, Norwich. Printed
`on acid-free paper.
`To subscribe: Orde rs should be addressed to Subscriptions Department, John Wiley & Sons Limited, BaHins Lane. Chichester. Sussex. P019 lUD,
`England. 1995 subscription price (13 issuesl : U.S. $825.00.
`
`U.S.A. POSTMASTER: Send address changes to Softwar..-Practice and Experience, c/o Publications Expediting Services Inc., 200 Meacham
`Avenue, Elmont, N.Y. 11003. U.S.A.
`
`Teradata, Exh. 1005, p. 2 of 23
`
`
`
`SOFTWARE—PRACTICE AND EXPERIENCE
`(Softw. pract. exp.)
`
`VOLUME 25, ISSUE No. 10
`
`October 1995
`
`CONTENTS
`
`Migration in Object-oriented Database Systems—A Practical Approach:
`C. Huemer, G. Kappel and S. Vieweg...........scsssssssecessseeesssecseeesseessesensecsseeee 1065
`
`Automatic Synthesis of Compression Techniques for Heterogeneous
`Piles: Wi A Hea and A. Es ZWANGO wisesscsnissscsssateenceraapsevsistennsreumcencecrnseene 1097
`
`A Tool for Visualizing the Execution of Interactions on a Loosely-coupled
`Distributed System: P. Ashton and J. Penny.......ccsscssessssssessensrssseessseees 1117
`
`Process Scheduling and UNIX Semaphores: N. Dunstan and I. Fris.......... 1141
`
`Software Maintenance: An Approach to Impact Analysis of Objects
`Chvericge? Si AQ si cccavscusscssanscsssscennacgunssanennctscacecsabuccsstcacsavsucceassaucsssspesscsnecesses 1155
`
`SPEXBL 25(10) 1065-1182 (1995)
`ISSN 0038-0644
`
`Research Alert (ISI) and SCISEARCH Database(ISI).
`
`Indexed or abstracted by Cambridge Scientific Abstracts, CompuMath Citation Index (ISI),
`Compuscience Database, Computer Contents, Computer Literature Index, Computing
`Reviews, Current Contents/Eng, Tech & Applied Sciences, Data Processing Digest, Deadline
`Newsletter, Educational Technology Abstracts, Engineering Index, Engineering Societies
`Library, IBZ (International Bibliography of Periodical Literature), Information Science Abstracts
`(Plenum), INSPEC, Knowledge Engineering Review, Nat Centre for Software Technology,
`
`Teradata, Exh. 1005, p. 3 of 23
`
`Teradata, Exh. 1005, p. 3 of 23
`
`
`
`SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 25(10), 1097-1116 (OCTOBER 1995)
`
`Automatic Synthesis of Compression Techniques for
`HeterogeneousFiles
`
`Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, U.S.A.
`(email: bhsu@cs.uiuc.edu, voice: (217) 244-1620)
`
`WILLIAM H. HSU
`
`AND
`
`AMY E. ZWARICO
`
`Department of Computer Science, The Johns Hopkins University, Baltimore, MD 21218, U.S.A.
`(email: amy@cs.jhu.edu, voice: (410) 516-5304)
`
`SUMMARY
`
`We present a compression technique for heterogeneous files, those files which contain multiple types of
`data such as text, images, binary, audio, or animation. The system usesstatistical methods to determine
`the best algorithm to use in compressing each block of data in a file (possibly a different algorithm for
`each block). The file is then compressed by applying the appropriate algorithm to each block. We obtain
`better savings than possible by using a single algorithm for compressing the file. The implementation
`of a working version of this heterogeneous compressor is described, along with examples of its value
`toward improving compression both in theoretical and applied contexts. We compare our results with
`those obtained using four commercially available compression programs, PKZIP, Unix compress, Stuffit,
`and Compact Pro, and show that our system provides better space savings.
`
`KEY WORDS:
`
`adaptive/selective data compression algorithms; redundancy metrics; heterogeneousfiles; program synthesis
`
`INTRODUCTION
`
`The primary motivation in studying compression is the savings in space that it provides.
`Many compression algorithms have been implemented, and with the advent of new hard-
`ware standards, more techniques are under development. Historically, research in data com-
`pression has been devoted to the development of algorithms that exploit various types of
`redundancy foundin a file. The shortcoming of such algorithmsis that they assume, often
`inaccurately, that files are homogeneous throughout. Consequently, each exploits only a
`subset of the redundancy foundin thefile.
`Unfortunately, no algorithm is effective in compressingall files.! For example, dynamic
`Huffman coding works best on data files with a high variance in the frequency ofindivid-
`ual characters (including some graphics and audio data), achieves mediocre performance on
`natural languagetextfiles, and performs poorly in general on high-redundancy binary data.
`On the other hand, run length encoding works well on high-redundancy binary data, but
`performs very poorly ontextfiles. Textual substitution works best when multiple-character
`strings tend to be repeated, as in English text, but this performance degrades as the average
`
`CCC 0038-0644/95/101097-20
`©1995 by John Wiley & Sons, Ltd.
`
`Received 20 April 1994
`Revised 5 February 1995
`
`Teradata, Exh. 1005, p. 4 of 23
`
`Teradata, Exh. 1005, p. 4 of 23
`
`
`
`1098
`
`W. H. HSU ANDA.E. ZWARICO
`
`length of these strings decreases. These relative strengths and weaknesses becomecritical
`when attempting to compress heterogeneous files. Heterogeneousfiles are those which con-
`tain multiple types of data such as text, images, binary, audio, or animation. Consequently,
`their constituent parts may have different degrees of compressibility. Because most com-
`pression algorithms are either tailored to a few specific classes of data or are designed to
`handle a single type of data at a time, they are not suited to the compression of heteroge-
`neousfiles. In attempting to apply a single method to suchfiles, they forfeit the possibility
`of greater savings achievable by compressing various segments of the file with different
`methods.
`To overcomethis inherent weakness found in compression algorithms, we have developed
`a heterogeneous compressor that automatically chooses the best compression algorithm to
`use on a given variable-length block of a file, based on both the qualitative and quantita-
`tive properties of that segment. The compressor determines and then applies the selected
`algorithms to the blocks separately. Assembling compression procedures to create a specif-
`ically tailored program for each file gives improved performance over using one program
`for all files. This system produces better compression results than four commonly available
`compression packages, PKZIP,? Unix compress,’ Stufflt,* and Compact Pro° for arbitrary
`heterogeneousfiles.
`The major contributions of this work are twofold. The first is an improved compression
`system for heterogeneous files. The second is the development of a method ofstatisti-
`cal analysis of the compressibility of a file (its redundancy types), Although the concept
`of redundancy types is not new,®’ synthesis of compression techniques using redundancy
`measurementsis largely unprecedented. The approach presentedin this paperusesa straight-
`forward program synthesis technique: a compressionplan, consisting of instructions for each
`block of input data, is generated, guided by thestatistical properties of the input data. Be-
`cause of its use of algorithms specifically suited to the types of redundancy exhibited by
`the particular input file, the system achieves consistent average performance throughout the
`file, as shown by experimental evidence.
`As an example of the type of savings our system produces, consider compressing a
`heterogeneousfile (such as a small multimedia data file) consisting of 10K of low redun-
`dancy (non-natural language) ASCII data, 10K of English text, and 25K of graphics. In
`this case, a reasonably sophisticated compression program might recognize the increased
`savings achievable by employing Huffman compression, to better take advantage of the fact
`that the majority of the data is graphical. However, none of the general-purpose compres-
`sion methods under consideration are optimal when used alone onthis file. This is because
`the text part of this file is best compressed by textual substitution methods (e.g., Lempel—
`Ziv) rather than statistical methods, while the low-redundancy data* and graphics parts
`are best compressed by alphabetic distribution-based methods(e.g., arithmetic or dynamic
`Huffman coding) rather than Lempel—Ziv or run-length encoding. This particular file totals
`45K in length before compression. A compressor using pure dynamic Huffman coding only
`achieves about 7 per cent savings for a compressed file of length 42.2K. One of the best
`general-purpose Lempel—Ziv compressors currently available®? achieves 18 per cent sav-
`ings, producing a compressedfile of length 37.4K. Our system uses arithmetic coding on
`the first and last segments and Lempel—Ziv compression on the text segment in the middle,
`achieving a 22 per cent savings and producing a compressedfile of length 35.6K. This is
`a 4 per cent improvement over the best commercial system.
`
`" This denotes, in our system, a file with a low rate of repeated strings.
`
`Teradata, Exh. 1005, p. 5 of 23
`
`Teradata, Exh. 1005, p. 5 of 23
`
`
`
`AUTOMATIC SYNTHESIS OF COMPRESSION TECHNIQUES FOR HETEROGENEOUS FILES 1099
`
`The purpose: of our experiments was to verify the conjecture that a selective system
`for combining methods can improve savings onasignificant range of heterogeneousfiles,
`especially multimedia data. This combination differs from current adaptive methods in
`that
`it switches among compression paradigms designed to remove very different types
`of redundancy. By contrast, existing adaptive compression programsare sensitive only to
`changesin particular types of redundancy, such as run-length, which do not require changing
`the underlying algorithm used in compression. Thus they cannot adapt to changes in the
`type of redundancy present, such as from high run-length to high character repetition. The
`superiority of our approach is demonstrated in our experimental section.
`This paper begins with a presentation of existing approaches to data compression, includ-
`ing a discussion of hybrid and adaptive compression algorithms and a description of four
`popular commercial compression packages. These are followed by documentation on the
`design of the heterogeneous compression system, analysis of experimental results obtained
`from test runs of the completed system, and comparison of the system’s performanceagainst
`that of commercial systems. Finally, implications of the results and possibilities for future
`work are presented.
`
`RELATED WORK
`
`It is a fundamental result of information theory that there is no single algorithm that per-
`forms optimally in compressing all files.'! However, much work has been done to develop
`algorithms and techniques that work nearly optimally on all classes of files. In this sec-
`tion we discuss adaptive algorithms, composite algorithms, and four popular commercial
`compression packages.
`
`Adaptive compression algorithms and composite techniques
`
`Exploiting the heterogeneity in a file has been addressed in two ways: the development
`of adaptive compression algorithms, and the composition of various algorithms. Adaptive
`compression algorithms attune themselves gradually to changes in the redundancies within a
`file by modifying parameters used by the algorithm, such as the dictionary, during execution.
`For example, adaptive alphabetic distribution-based algorithms such as dynamic Huffman
`coding'® maintain a tree structure to minimize the encoded length of the most frequently
`occurring characters. This property can be made to change continuously asafile is pro-
`cessed.
`An example of an adaptive textual substitution algorithm is Lempel—Ziv compression,
`a title which refers to two distinct variants of a basic textual substitution scheme. The
`first variant, known as LZ77 or the sliding dictionary or sliding window method, selects
`positional references from a constant range of preceding strings.':'' These ‘back-pointers’
`literally encode position and length of a repeated string. The second variant, known as
`LZ78 or the dynamic dictionary method, uses a dictionary structure with a paging heuristic.
`When the dictionary — a table of strings from the file — is completely filled, the contents
`are not discarded. Instead, an auxiliary dictionary is created and updated while compression
`continues using the main dictionary. Each time this auxiliary table is filled, its contents are
`‘swapped’ into the main dictionary and it is cleared. The maintenance of dictionaries for
`textual substitution is analogous to the semi-space method of garbage collection, in which
`two pages are used but only oneis ‘active’ — these are exchanged whenonefills beyond
`a preset threshold. Another adaptive variant of this algorithm is the Lempel—Ziv—Welch
`
`Teradata, Exh. 1005, p. 6 of 23
`
`Teradata, Exh. 1005, p. 6 of 23
`
`
`
`1100
`
`W. H. HSU ANDA. E. ZWARICO
`
`(LZW)algorithm, a descendant of LZ78 used in Unix compress.® '? Both LZW and LZ78
`vary the length ofstrings used in compression.® !*
`Yet another adaptive (alphabetic distribution-based) compression scheme, the Move-To-
`Front (MTF) method, was developed by Bentley et al.'? In MTF, the ‘word code’ for a
`symbolis determined by the position of the word in a sequential list. The word list is ordered
`so that frequently accessed wordsare nearthe front, thus shortening their encodings.
`Adaptive compression algorithms are not appropriate to use with heterogeneousfiles
`because they are sensitive only to changes in the particular redundancy type with which
`they are associated, such as a change in the alphabetic distribution. They do not exploit
`changesacrossdifferent redundancytypesin the files. Therefore a so-called adaptive method
`typically cannot directly handle drastic changesinfile properties, such as an abrupttransition
`from text to graphics. For example, adaptive Huffman compressors specially optimized for
`text achieve disproportionately poor performance oncertain imagefiles, and vice versa. As
`the use of multimedia files increases, files exhibiting this sort of transition will become
`more prevalent.
`Our approach differs from adaptive compression because the system chooses each algo-
`rithm (as well as the duration of its applicability) before compression begins, rather than
`modifying the technique for each file during compression. In addition, while adaptive meth-
`ods make modifications to their compression parameters on the basis ofsingle bytes or fixed
`length strings of input, our heterogeneous compressorbases its compression uponstatistics
`gathered from larger blocks offive kilobytes. This allows us to handle much larger changes
`in file redundancy types. This makes our system less sensitive to residualstatistical fluctu-
`ations from different parts ofa file. We note that it is possible to use an adaptive algorithm
`as a primitive in the system.
`Another approach to handling heterogeneous files is the composition of compression
`algorithms. Composition can either be accomplished by running several algorithms in suc-
`cession or by combining the basic algorithms and heuristics to create a new technique. For
`example, recent implementations of ‘universal’ compression programs execute the Lempel—
`Ziv algorithm and dynamic Huffman coding in succession, thus improving performance
`by combining thestring repetition-based compression of Lempel—Ziv with the frequency-
`based compression strategy of dynamic Huffman coding. One commercial implementation
`is LHarc.'*'> Our system exploits the same savings since it uses the Freeze implementa-
`tion of the Lempel—Ziv algorithm, which filters Lempel—Ziv compressed output through a
`Huffman coder. An example of a truly composite technique is the compression achieved
`by using Shannon—Fanotries* in conjunction with the Fiala-Greene algorithm (a variant
`of Lempel-Ziv)'® in the PKZIP* commercial package. Tries are used to optimally encode
`strings by character frequency.'’? PKZIP wasselectedas the representative test program from
`this group in our experimentdueto its superior performance on industrial benchmarks.’
`Our approach generalizes the ideas of successively executing or combining different
`compression algorithms by allowing any combination of basic algorithms within a file. This
`includes switching from among algorithms an arbitrary numberof times within a file. The
`algorithms themselves may be simple or composite and may be adaptive. All are treated as
`atomic commandsto be applied to portionsofa file.
`
`* A trie is a tree of variable degree > 2 such that (1) each edge is labelled with a character, and the depth of any node
`represents one more than the numberof characters required to identify it, (2) all internal nodes are intermediate and represent
`prefixes of keys in the trie; (3) keys (strings) may be inserted as leaves using the minimum number of characters which
`distinguish them uniquely. Thus a generic trie containing the strings computer and compare would have keys at a depth of
`five which share a commonprefix of length four.
`
`Teradata, Exh. 1005, p. 7 of 23
`
`Teradata, Exh. 1005, p. 7 of 23
`
`
`
`AUTOMATIC SYNTHESIS OF COMPRESSION TECHNIQUES FOR HETEROGENEOUSFILES 1101
`
`The problem of heterogeneous files was addressed by Toal'* in a proposal for a naive
`heterogeneous compression system similar to ours. In such a system, files would be seg-
`mented into fixed-length encapsulated blocks; the optimal algorithm would be selected for
`each block on the basis of their simple taxonomy (qualitative data type) only; and the blocks
`would be independently compressed. Our system, however, performs more in-depth statis-
`tical analysis in order to make a more informed selection from the database of algorithms.
`This entails not only the determination of qualitative data properties but the computation of
`metrics for an entire block (as opposed to sporadic or random sampling from parts of each
`block). Furthermore, normalization constants for selection parameters (i.e. the redundancy
`metrics) are fitted to observed parameters foratest library. Finally, a straightforward but
`crucial improvement to the naive encapsulated-block method is the implementation of a
`multi-pass scheme. By determining the complete taxonomy (data type and dominant redun-
`dancy type) in advance, any numberof contiguous blocks which use the same compression
`method will be treated as a single segment. Toal observed in preliminary experiments that
`the overhead of changing compression schemes from one block to another dominated the
`additional savings that resulted from selection of a superior compression method.'* This
`overheadis attributable to the fact that blocks compressed independently (even if the same
`method is used) are essentially separate files and assume the same startup overhead of the
`compression algorithm used.* We have determined experimentally that merging contiguous
`blocks whenever possible obviates the large majority of changes in compression method.
`This eliminates a sufficient proportion of the overhead to make heterogeneous compression
`worthwhile.
`
`Commercial products
`
`One of the goals of this research was to develop a compression system which is gener-
`ally superior to commercially available systems. The four systems we studied are PKZIP,
`developed for microcomputers running MS-DOS;? Unix compress;> and Stuffit Classic‘
`and Compact Pro,’ developed for the Apple Macintosh operating system. Each of these
`products performs its compression in a single pass, with only one methodselected perfile.
`Thus, the possibility of heterogeneousfiles is ignored.
`Unix compress uses an adaptive version of the Lempel—Ziv algorithm.It operates by
`substituting a fixed-length code for common substrings. compress,
`like other adaptive
`textual substitution algorithms, periodically tests its own performanceandreinitializes its
`string table if the amount of compression has decreased.
`Stufflt makes use of two sets of algorithms:
`it first detects special-type files such as
`imagefiles and processes them with algorithms suited for high-resolution color data; for the
`remaining files, it queries the operating system for the explicit file type given whenthefile
`wascreated, and uses this information to choose either the LZW variant of Lempel—Ziv,*°
`dynamic Huffman coding, or run-length encoding. This is a much morelimited selection
`process than what we have implemented. Additionally, no selection of compression methods
`is attempted withinafile. Compact Pro uses the same methodologyas Stufflt and compress,
`but incorporates an improved Lempel—Ziv derived directly from LZ77.'' The public-domain
`version of Stuffit is derived from Unix compress,as is evident from the similarity of their
`performanceresults.
`
`“ For purposes of comparison, the block sizes tested by Toal were nearly identical to those used in our system (ranging
`upwards from 4K).
`
`Teradata, Exh. 1005, p. 8 of 23
`
`Teradata, Exh. 1005, p. 8 of 23
`
`
`
`1102
`
`W. H. HSU ANDA. E. ZWARICO
`
`Compression systems such as Stuffft perform simple selection among alternative com-
`pression algorithms. The important problem is that they are underequipped for the task of
`fitting a specific technique to each file (even when the uncompressed data is homogeneous).
`Stufflt uses few heuristics, since its algorithms are intended to be ‘multipurpose’ . Further-
`more, only the file type is considered in selecting the algorithm — that is, no measures of
`redundancy are computed. Earlier versions of Stuffit (which were extremely similar to Unix
`compress) used composite alphabetic and textual compression, but made noselections on
`the basis of data characteristics. The chief improvements of our heterogeneous compressor
`over this approach are that it uses a two-dimensional lookup table, indexed by file proper-
`ties and quantitative redundancy metrics, and — more important — that it treats the file as a
`collection of heterogeneousdata sets.
`
`THE HETEROGENEOUS COMPRESSOR
`
`Our heterogeneous compressor treats a file as a collection of fixed size blocks (5K in
`the current implementation), each containing a potentially different type of data and thus
`best compressed using different algorithms. The actual compression is accomplished in
`two phases. In the first phase, the system determines the type and compressibility of each
`block. The compressibility of each block of data is determined by the values of three
`quantitative metrics representing the alphabetic distribution, the average run length and the
`string repetition ratio in the file. If these metrics are all below a certain threshold, then the
`block is considered fully compressed (uncompressible) and the program continues on to the
`next block. Otherwise, using the block type and largest metric, the appropriate compression
`algorithm (and possible heuristic) are chosen from the compression algorithm database. The
`compression method for the current block is then recorded in a small array-based map of
`the file, and the system continues.
`The second phase comprises the actual compression and an optimization that maximizes
`the size of a segmentof data to be compressedusingaparticular algorithm. In this optimiza-
`tion, which is interleaved with the actual compression, adjacent blocks for which exactly
`the same method have been chosen are merged into a single block. This merge technique
`maximizesthe length of segments requiring a single compression method by greedily scan-
`ning ahead until a change of method is detected. Scanning is performed using the array
`mapofthe file generated when compression methods were selected from the database. A
`compression history, needed for decompression, is automatically generated as part of this
`phase.
`The newly compressed segments are written to a buffer by the algorithm, which stores
`the output data with the compression history. The system then writes out the compressed
`file and exits with a signal to the operating system that compression was successful.
`From this two-pass schemeit is straightforward to see why this system is less susceptible
`than traditional adaptive systems to biases accrued when the data type changes abruptly
`during compression. Adaptive compressors perform all operations myopically, sacrificing
`the ability to see ahead in the file or data stream to detect future fluctuations in the type
`of data. As a result, adaptive compressors retain the statistical vestiges of the old method
`until these are ‘flushed out’ by new data (or balanced out, depending upon the process for
`paging and aginginternal data structures such as dictionaries). Thus adaptive compressors
`may continue to suffer the effects of bias, achieving suboptimal compression. On the other
`hand, by abruptly changing compression algorithms, our technique completely discardsall
`remnants of the ‘previous’ method(i.e. the algorithm used on the preceding segment). This
`
`Teradata, Exh. 1005, p. 9 of 23
`
`Teradata, Exh. 1005, p. 9 of 23
`
`
`
`AUTOMATIC SYNTHESIS OF COMPRESSION TECHNIQUES FOR HETEROGENEOUS FILES 1103
`
`allows us to immediately capitalize on changes in data. In addition, merging contiguous
`blocks of the same data type acquires the advantage of incurring all the overhead at once
`for switching to what will bethe best compression method for an entire variable-length
`segment. The primary advantage of adaptive compression techniques over our technique is
`that the adaptive compression algorithms are ‘online’ (single-pass). This property increases
`compression speed and, more important, gives the ability to compress a data stream (for
`instance, incoming data packets in a network or modem transmission) in additionto files
`in secondary ‘storage or variable-length buffers.
`The remainder of this section presents the system. We begin with a description of the
`calculation of the block types and the redundancy metrics. We also explain the use of the
`metrics as absolute indicators of compressibility, and then describe the compression algo-
`rithms used and the structure of the database of algorithms. A discussion of implementation
`details concludes the section.
`
`Property analysis
`
`The compressibility of a block of data and the appropriate algorithm to do so are deter-
`mined by the type of data contained in a block and the type of redundancy(if any) in the
`data. These two properties are represented by four parameters: the block type, and the three
`redundancy metrics. The block type describes the data in the block — text, binary, graphical,
`etc. The three redundancy metricsare the degree of variation in character frequency, average
`run length in thefile, and the string repetition ratio of the file. They provide a quantitative
`measure of how compressible the block is and which type of redundancy is most evident
`in the block. The use of both quantitative redundancy measures (redundancy metrics) and
`qualitative characteristics (block types) as indicators for data compressibility is advocated
`by Held’ and Salton.'? We haverefined the process for computing those attributes referred
`to as datanalysis results by Held’ and as statistical language characteristics by Salton!” to
`obtain an actual guide for compression. The remainderof this section describes how these
`four parameters are determined for each block.
`
`Block types
`
`The block type describes the nature of a segmentof input data. There are ten classifica-
`tions of data in this system: ANSItext, non-natural language text (hexadecimal encodings of
`binary data), natural language text, computer source code, low redundancy binary,digitized
`audio, low resolution graphics, high-resolution graphics, high-redundancy binary executable,
`and binary object data. ANSI text is composed of characters from a superset of the ASCII
`alphabet. Non-natural language text contains primarily ASCII text but does not follow a
`distribution of characters like that of human languages. Examples are computer typesetting
`data, uuencoded and BinHex encoded