`
`of Hsu from the
`
`Illinois Institute of
`
`
`
`
` Attachment 1h: Copy
`
`
` SOFTWARE
`
`Technology Librar
`
`aHeth
`
`AN
`
`tataee
`
`25
`
`11)Gd)2h
`
`1995
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 129
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 129
`
`
`
`
`ttachment ih: Copy
`f Hsu from the
`
`Illinois Institute of
`Technology Library
`
`ACTICE & EXPERIENCE
`
`No. 10
`
`OCTOBER 1995
`
`EDITORS
`
`DOUGLAS COMER
`
`ANDY WELLINGS
`
`Software: Practice &
`experience
`Received On: 11-03-95
`Illinois Institute of
`Technology Library
`
`WILEY
`
`Publishers Since 1807
`Chichester - New York - Brisbane - Toronto - Singapore
`A Wiley-Interscience Publication
`SPEXBL 25(10) 1065-1182 (1995)
`ISSN 0038-0644
`
`
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 130
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 130
`
`
`
`
`
`Attachment ih: Copy
`of Hsu from the
`
`Pi
`
`Illinois Institute offMG MARE
`Technolog Librar
`PRACTICE & EXPERIENCE
`Editors
`Professor D. E. Comer, Computer Science Department, Purdue University, West
`Lafayette, IN 47907, U.S.A.
`Charlotte |. Tubis, U.S. Editorial Assistant, Computer Science Department, Purdue University, West Lafayette,
`IN 47907, U.S.A.
`Dr A. J. Wellings, Department of Computer Science, University of York,
`Heslington, York YO1 5DD
`
`Advisory Editorial Board
`Professor D. W. BARRON
`Departmentof Electronics and Computer Science,
`University of Southampton,
`Southampton SO9 5NH,U.K.
`Professor P. J. BROWN
`Computing Laboratory, The University,
`Canterbury, Kent CT2 7NF, U.K.
`Professor J. A. CAMPBELL
`Department of Computer Science, University College London,
`GowerStreet, London WC1E 6BT,U.K.
`Professor F. J. CORBATO
`Electrical Engineering Department,
`Massachusetts Institute of Technology,
`545 Technology Square,
`Cambridge, Massachusetts 02139, U.S.A.
`Dr. Christopher W. FRASER
`AT&T Bell Laboratories, 600 Mountain Ave 2C-464,
`Murray Hill, NJ 07974-0636, U.S.A.
`Professor PER BRINCH HANSEN
`School of Computer and Information Science,
`4-116 CST, Syracuse University,
`Syracuse, New York 13210, U.S.A.
`Professor D. R. HANSON
`Department of Computer Science,
`Princeton University, Princeton,
`New Jersey 08544, U.S.A.
`Professor J. KATZENELSON
`Faculty of Electrical Engineering,
`Technion-lsrael Institute of Technology,
`Haifa, Israel
`Dr. 8. W. KERNIGHAN
`AT&T Bell Laboratories, 600 Mountain Avenue,
`MurrayHill, New Jersey 07974, U.S.A.
`
`Professor D. E. KNUTH
`Department of Computer Science, Stanford University,
`Stanford, California 94305, U.S.A,
`
`Dr. B. W. LAMPSON
`180 Lake View Ave,
`Cambridge,
`MA 02138, U.S.A.
`
`Dr. C. A. LANG
`Three-Space Ltd,
`70 Castle Street,
`Cambridge CB3 OAJ, U.K.
`Professor B. RANDELL
`Computing Laboratory,
`University of Newcastle-upon-Tyne,
`Claremont Tower, Claremont Road,
`Newcastle-upon-Tyne NE1 7RU, U.K.
`
`Professor J. S. ROHL
`Department of Computer Science,
`The University of Western Australia,
`Nedlands, Western Australia 6009.
`
`D. T. ROSS
`Softech Inc., 460 Totten Pond Road,
`Waltham, Massachusetts 02154, U.S.A.
`
`B. H. SHEARING
`The Software Factory,
`28 Padbrook, Limpsfield, Oxted,
`Surrey RH8 ODW,U.K.
`
`Professor N. WIRTH
`Institut fir Computersysteme, ETH-Zentrum,
`CH-8092 Zurich, Switzerland.
`
`Aims and Scope
`Software—Practice and Experience is an internationally respected and rigorously refereed vehicle for the dissemination and
`discussion of practical experience with new and established software for both systems and applications. Contributions regu-
`larly: (a) describe detailed accounts of completed software-system projects which can serve as ‘how-to-do-it' modelsfor future
`work in the samefield; (b) present short reports on programming techniques that can be used in a wide variety of areas; (c)
`documentnew techniques and tools that aid in solving software construction problems; and (d) explain methods/techniques
`that cope with the special demands of large scale software projects. The journal also features timely Short Communications
`on rapidly developing new topics.
`The editors actively encourage papers which result from practical experience with tools and methods developed and used
`in both academic and industrial environments. The aim is to encourage practitioners to share their experiences with design,
`implementation and evaluation of techniques and tools for software and software systems.
`Papers cover software design and implementation, case studies describing the evolution of system and the thinking behind
`them, andcritical appraisals of software systems. The journal has always welcomedtutorial articles describing well-tried tech-
`niques not previously documented in computingliterature. The emphasis is on practical experience; articles with theoretical
`or mathematical content are included only in cases where an understanding of the theory will lead to better practical systems.
`Articles range in length from a Short Communication (half to two pages) to the length required to give full treatment to a
`substantial piece of software (40 or more pages).
`Advertising: For details contact—
`Michael J. Levermore, Advertisement Sales, John Wiley & Sons Ltd, Baffins Lane, Chichester, Sussex PO19 1UD, England (Telephone 01243
`770351, Fax 01243 775878, Telex 86290)
`Software—Practice and Experience (ISSN 0038-0644/USPS 890-920)is published monthly, by John Wiley & Sons Limited, Baffins Lane, Chichester,
`Sussex, England. Second class postage paid at Jamaica, N.Y. 11431, Air freight and mailing in the U.S.A. by Publications Expediting ServicesInc.,
`200 Meacham Avenue, Elmont, N.Y. 11003. © 1995 by John Wiley & Sons Ltd. Printed and boundin Great Britain by Page Bros, Norwich. Printed
`on acid-free paper.
`To subscribe: Orders should be addressed to Subscriptions Department, John Wiley & Sons Limited, Baffins Lane, Chichester, Sussex, PO19 1UD,
`England. 1995 subscription price (13 issues); U.S. $825.00.
`U.S.A. POSTMASTER: Send address changes to Software—Practice and Experience, c/o Publications Expediting Services Inc., 200 Meacham
`Avenue, Elmont, N.¥. 11003, U.S.A.
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 131
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 131
`
`
`
`
`
`Attachment 1h: Copy
`of Hsu from the
`
`
`
`Illinois Institute of
`Technology Library
`(Softw. pract. exp.)
`
`nip EXPERIENCE
`
`VOLUME 25, ISSUE No. 10
`
`October 1995
`
`CONTENTS
`
`Migration in Object-oriented Database Systems—A Practical Approach:
`C. Huemer Ge Kappel and SS. Viewag iin iaissscsliecn leelistactcdesvendetadssvasenctrs 1065
`
`Automatic Synthesis of Compression Techniques for Heterogeneous
`FIGS? VVi i. ASU BCA LWATI OO Sitese ss dccccacasusnaswcaescissulssdseisvaneavasesuecbavevie 1097
`
`A Tool for Visualizing the Execution of Interactions on a Loosely-coupled
`Distributed System: P. Ashton and J. P@nny.......cccccscssesessseeeenteeensseereseenes 1117
`
`Process Scheduling and UNIX Semaphores:N. Dunstan and I. Fris.......... 1141
`
`Software Maintenance: An Approach to Impact Analysis of Objects
`MPEGS SS. PRSUIED sss ckcc goa ls pe ds oad atc vcvcgaeh aad ses Tastee sas A ewes di scsessnnis 1155
`
`SPEXBL 25(10) 1065-1182 (1995)
`ISSN 0038-0644
`
`Research Alert (ISI) and SCISEARCH Database(ISI).
`
`Indexed or abstracted by Cambridge Scientific Abstracts, CompuMath Citation Index (ISI),
`Compuscience Database, Computer Contents, Computer Literature Index, Computing
`Reviews, Current Contents/Eng, Tech & Applied Sciences, Data Processing Digest, Deadline
`Newsletter, Educational Technology Abstracts, Engineering Index, Engineering Societies
`Library, IBZ (International Bibliography of Periodical Literature), Information Science Abstracts
`(Plenum),
`INSPEC, Knowledge Engineering Review, Nat Centre for Software Technology,
`
`|
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 132
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 132
`
`
`
`
`
`Attachment 1h: Copy
`of Hsu from the
`
`Illinois Institute of
`Technolog Librar se
`
`» VOL. 25(10), 1097-1116 (OCTOBER 1995)
`
`Automatic Synthesis of Compression Techniques for
`Heterogeneous Files
`
`Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, U.S.A.
`(email: bhsu@cs.uiuc.edu, voice: (217) 244-1620)
`
`WILLIAM H. HSU
`
`AND
`
`AMY E. ZWARICO
`
`Department of Computer Science, The Johns Hopkins University, Baltimore, MD 21218, U.S.A.
`(email: amy@cs.jhu.edu, voice: (410) 516-5304)
`
`SUMMARY
`
`We present a compression technique for heterogeneousfiles, those files which contain multiple types of
`data such as text, images, binary, audio, or animation. The system uses statistical methods to determine
`the best algorithm to use in compressing each block of data inafile (possibly a different algorithm for
`each block). Thefile is then compressed by applying the appropriate algorithm to each block. We obtain
`better savings than possible by using a single algorithm for compressing the file. The implementation
`of a working version of this heterogeneous compressor is described, along with examples of its value
`toward improving compression both in theoretical and applied contexts. We compare our results with
`those obtained using four commercially available compression programs, PKZIP, Unix compress,StuffIt,
`and Compact Pro, and show that our system provides better space savings.
`
`4"
`
`i
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 133
`
`KEY WORDS:
`
`adaptive/selective data compression algorithms; redundancy metrics; heterogeneousfiles; program synthesis
`
`INTRODUCTION
`
`The primary motivation in studying compression is the savings in space that it provides.
`Many compression algorithms have been implemented, and with the advent of new hard-
`ware standards, more techniques are under development. Historically, research in data com-
`pression has been devoted to the developmentof algorithms that exploit various types of
`redundancy foundin a file. The shortcoming of such algorithms is that they assume, often
`inaccurately, that files are homogeneous throughout. Consequently, each exploits only a
`subset of the redundancy foundin the file.
`Unfortunately, no algorithm is effective in compressing all files.' For example, dynamic
`Huffman coding works best on data files with a high variance in the frequency of individ-
`ual characters (including some graphics and audio data), achieves mediocre performance on
`natural languagetextfiles, and performs poorly in general on high-redundancy binary data.
`On the other hand, run length encoding works well on high-redundancy binary data, but
`performs very poorly on text files. Textual substitution works best when multiple-character
`strings tend to be repeated, as in English text, but this performance degrades as the average
`
`CCC 0038-0644/95/101097—20
`©1995 by John Wiley & Sons, Ltd.
`
`Received 20 April 1994
`Revised 5 February 1995
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 133
`
`
`
`
`ttachment 1h: Copy
`of Hsu from the
`
`Technology Library
`
`Illinois Institute of
`
`methods.
`
`Ww. H. HSU ANDA. E. ZWARICO
`1098
`length of these strings decreases. Theserelative strengths and weaknesses becomecritic
`when attempting to compress heterogeneousfiles. Heterogeneous files are those which co
`tain multiple types of data such as text, images, binary, audio, or animation. Consequentl
`their constituent parts may have different degrees of compressibility. Because most co
`pression algorithms are either tailored to a few specific classes of data or are designed t
`handle a single type of data at a time, they are not suited to the compression of heterog'
`neousfiles. In attempting to apply a single method to such files, they forfeit the possibili
`of greater savings achievable by compressing various segments of the file with differe
`To overcomethis inherent weakness found in compression algorithms, we have develope
`a heterogeneous compressor that automatically chooses the best compression algorithm
`use on a given variable-length block of a file, based on both the qualitative and quanti
`tive properties of that segment. The compressor determines and then applies the select
`algorithms to the blocks separately. Assembling compression procedures to create a speci
`ically tailored program for each file gives improved performance over using one progra
`for all files. This system producesbetter compression results than four commonly availab
`compression packages, PKZIP2 Unix compress,’ Stuffit," and Compact Pro? for arbi
`heterogeneousfiles.
`The major contributions of this work are twofold. Thefirst is an improved compressi
`system for heterogeneous files. The second is the development of a method ofstatis
`cal analysis of the compressibility of a file (its redundancy types). Although the conce
`of redundancy types is not new,®” synthesis of compression techniques using redundan
`measurementsis largely unprecedented. The approach presented in this paper uses a straig
`forward program synthesis technique: a compressionplan, consisting of instructions for ea’
`block of input data, is generated, guided by the statistical properties of the input data. B
`cause of its use of algorithms specifically suited to the types of redundancy exhibited
`the particular inputfile, the system achieves consistent average performance throughoutt
`file, as shown by experimental evidence.
`As an example of the type of savings our system produces, consider compressing
`heterogeneousfile (such as a small multimedia data file) consisting of 10K of low redu
`dancy (non-natural language) ASCII data, 10K of English text, and 25K of graphics.
`this case, a reasonably sophisticated compression program might recognize the increas
`savings achievable by employing Huffman compression,to better take advantageofthe fi
`that the majority of the data is graphical. However, none of the general-purpose comp
`sion methods under consideration are optimal when used alone on this file. This is becaul
`the text part of this file is best compressed by textual substitution methods (e.g., Lem
`Ziv) rather than statistical methods, while the low-redundancy data* and graphics p
`are best compressed by alphabetic distribution-based methods(e-g., arithmetic or dyn
`Huffman coding) rather than Lempel-Ziv or run-length encoding. This particular file to
`45Kin length before compression. A compressor using pure dynamic Huffman coding 0
`achieves about 7 per cent savings for a compressed file of length 42.2K. One of the
`general-purpose Lempel—Ziv compressors currently available*® achieves 18 per cent s
`ings, producing a compressedfile of length 37.4K. Our system uses arithmetic coding
`the first and last segments and Lempel-Ziv compression on the text segment in the mid
`achieving a 22 per cent savings and producing a compressed file of length 35.6K. This
`a 4 per cent improvement over the best commercial system.
`ce
`* This denotes, in our system, a file with a low rate of repeated strings.
`
`aa,
`Veritas Techs. LLC
`Exhibit 1026
`Page 134
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 134
`
`
`
`
`
`Attachment 1h: Copy
`of Hsu from the
`
`
`
`Illinois Institute of
`Technology Library
`AUTOMATIC SYNTHESIS OF COMPRESSION TECHNIQUES FOR HETEROGENEOUSFILES 1099
`
`a preset threshold. Another adaptive variant of this algorithm is the Lempel—Ziv—Welch
`
`Adaptive compression algorithms and composite techniques
`Exploiting the heterogeneity in a file has been addressed in two ways: the development
`of adaptive compression algorithms, and the composition of various algorithms. Adaptive
`compression algorithms attune themselves gradually to changes in the redundancies within a
`file by modifying parameters used by the algorithm, such as the dictionary, during execution.
`For example, adaptive alphabetic distribution-based algorithms such as dynamic Huffman
`coding'® maintain a tree structure to minimize the encoded length of the most frequently
`occurring characters. This property can be made to change continuously asafile is pro-
`cessed.
`An example of an adaptive textual substitution algorithm is Lempel—Ziv compression,
`a title which refers to two distinct variants of a basic textual substitution scheme. The
`first variant, known as LZ77 orthe sliding dictionary or sliding window method,selects
`positional references from a constant range of preceding strings.':'' These ‘back-pointers’
`literally encode position and length of a repeated string. The second variant, known as
`LZ78 or the dynamic dictionary method,uses a dictionary structure with a paging heuristic.
`Whenthe dictionary — a table of strings from the file — is completely filled, the contents
`are not discarded. Instead, an auxiliary dictionary is created and updated while compression
`continues using the main dictionary. Each time this auxiliary table is filled, its contents are
`‘swapped’ into the main dictionary and it is cleared. The maintenance of dictionaries for
`textual substitution is analogous to the semi-space method of garbage collection, in which
`two pages are used but only oneis ‘active’ — these are exchanged when onefills beyond
`
`The purpose of our experiments was to verify the conjecture that a selective system
`for combining methods can improve savings on a significant range of heterogeneousfiles,
`especially multimedia data. This combination differs from current adaptive methods in
`that it switches among compression paradigms designed to remove very different types
`of redundancy. By contrast, existing adaptive compression programsare sensitive only to
`changesin particular types of redundancy, such as run-length, which do not require changing
`the underlying algorithm used in compression. Thus they cannot adapt to changes in the
`type of redundancy present, such as from high run-length to high character repetition. The
`superiority of our approach is demonstrated in our experimental section.
`This paper begins with a presentation of existing approaches to data compression, includ-
`ing a discussion of hybrid and adaptive compression algorithms and a description of four
`popular commercial compression packages. These are followed by documentation on the
`design of the heterogeneous compression system, analysis of experimental results obtained
`from test runs of the completed system, and comparison of the system’s performanceagainst
`that of commercial systems. Finally, implications of the results and possibilities for future
`work are presented.
`
`RELATED WORK
`
`It is a fundamental result of information theory that there is no single algorithm that per-
`forms optimally in compressing all files.'! However, much work has been done to develop
`algorithms and techniques that work nearly optimally on all classes of files. In this sec-
`tion we discuss adaptive algorithms, composite algorithms, and four popular commercial
`compression packages.
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 135
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 135
`
`
`
`
`
`Technology Library
`
`Illinois Institute of
`
`W. H. HSU ANDA. E. ZWARICO
`1100
`(LZW)algorithm, a descendant of LZ78 used in Unix compress.” '* Both LZW and LZ78
`vary the length ofstrings used in compression.® '?
`Yet another adaptive (alphabetic distribution-based) compression scheme, the Move-To-
`Front (MTF) method, was developed by Bentley et al.3 In MTF, the ‘word code’ for a
`symbolis determined by the position ofthe word in a sequentiallist. The wordlist is ordered
`so that frequently accessed words are near the front, thus shortening their encodings.
`Adaptive compression algorithms are not appropriate to use with heterogeneousfiles
`because they are sensitive only to changes in the particular redundancy type with which
`they are associated, such as a change in the alphabetic distribution. They do not exploit
`changes acrossdifferent redundancy typesin thefiles. Therefore a so-called adaptive method
`typically cannot directly handle drastic changes in file properties, such as an abrupttransition
`from text to graphics. For example, adaptive Huffman compressors specially optimized for
`text achieve disproportionately poor performanceon certain image files, and vice versa. AS
`the use of multimedia files increases, files exhibiting this sort of transition will become
`more prevalent.
`Our approachdiffers from adaptive compression because the system chooses each algo-
`rithm (as well as the duration of its applicability) before compression begins, rather than
`modifying the technique for each file during compression.In addition, while adaptive meth-
`ods make modifications to their compression parameters on the basis of single bytes or fixed
`length strings of input, our heterogeneous compressor bases its compression uponstatistics
`gathered from larger blocks of five kilobytes. This allowsus to handle muchlarger changes
`in file redundancy types. This makes our system less sensitive to residualstatistical fluctu-
`ations from different parts of a file. We note that it is possible to use an adaptive algorithm
`as a primitive in the system.
`Another approach to handling heterogeneous files is the composition of compression
`algorithms. Composition can either be accomplished by running several algorithms in suc-
`cession or by combining the basic algorithmsand heuristics to create a new technique. For
`example, recent implementations of ‘universal’ compression programs execute the Lempel—
`Ziv algorithm and dynamic Huffman coding in succession, thus improving performance
`by combining the string repetition-based compression of Lempel—Ziv with the frequency-
`based compression strategy of dynamic Huffman coding. One commercial implementation
`is LHarc.'*'5 Our system exploits the same savings since it uses the Freeze implementa-
`tion of the Lempel—Ziv algorithm, which filters Lempel-Ziv compressed output through a
`Huffman coder. An example of a truly composite technique is the compression achieved
`by using Shannon—Fano tries* in conjunction with the Fiala-Greene algorithm (a variant
`of Lempel-Ziv)'® in the PKZIP? commercial package. Tries are used to optimally encode
`strings by character frequency.'’ PKZIP was selected as the representative test program from
`this group in our experiment due to its superior performance on industrial benchmarks.’
`Our approach generalizes the ideas of successively executing or combining different
`compression algorithms by allowing any combination of basic algorithms withina file. This
`includes switching from amongalgorithms an arbitrary number of times within a file. The
`algorithms themselves may be simple or composite and may be adaptive. All are treated as
`atomic commandsto be applied to portions of a file.
`* A trie is a tree of variable degree > 2 such that (1) each edge is labelled with a character, and the depth of any node
`represents one more than the numberof characters required to identify it; (2) all internal nodes are intermediate and represent
`prefixes of keys in the trie; (3) keys (strings) may be inserted as leaves using the minimum number of characters which
`distinguish them uniquely. Thus a generic trie containing the strings computer and compare would have keys at a depth of
`
`five which share a commonprefix of length four.
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 136
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 136
`
`
`
`
`Attachment 1h: Copy
`of Hsu from the
`
`Illinois Institute of
`
`Technology Librar
`AUTOMATIC SYNTHESIS OF COMPRESSION TECHNIQUES FOR HETEROGENEOUSFILES 1101
`The problem of heterogeneousfiles was addressed by Toal!® in a proposal for a naive
`heterogeneous compression system similar to ours. In such a system, files would be seg-
`mented into fixed-length encapsulated blocks; the optimal algorithm would be selected for
`each block onthe basis of their simple taxonomy (qualitative data type) only; and the blocks
`would be independently compressed. Our system, however, performs more in-depth statis-
`tical analysis in order to make a more informed selection from the database of algorithms.
`This entails not only the determination of qualitative data properties but the computation of
`metrics for an entire block (as opposed to sporadic or random sampling from parts of each
`block). Furthermore, normalization constants for selection parameters (i.e. the redundancy
`metrics) are fitted to observed parameters for a test library. Finally, a straightforward but
`crucial improvement to the naive encapsulated-block method is the implementation of a
`multi-pass scheme. By determining the complete taxonomy (data type and dominant redun-
`dancy type) in advance, any number of contiguous blocks which use the same compression
`method will be treated as a single segment. Toal observed in preliminary experiments that
`the overhead of changing compression schemes from one block to another dominated the
`additional savings that resulted from selection of a superior compression method.'* This
`overheadis attributable to the fact that blocks compressed independently (even if the same
`method is used) are essentially separate files and assume the same startup overhead of the
`compression algorithm used.” We have determined experimentally that merging contiguous
`blocks whenever possible obviates the large majority of changes in compression method.
`This eliminates a sufficient proportion of the overhead to make heterogeneous compression
`worthwhile.
`
`upwards from 4K).
`
`Commercial products
`Oneof the goals of this research was to develop a compression system which is gener-
`ally superior to commercially available systems. The four systems we studied are PKZIP,
`developed for microcomputers running MS-DOS:? Unix compress;? and Stufflt Classic’
`and Compact Pro,> developed for the Apple Macintosh operating system. Each of these
`products performs its compression in a single pass, with only one method selected per file.
`Thus, the possibility of heterogeneousfiles is ignored.
`Unix compress uses an adaptive version of the Lempel-Ziv algorithm.® It operates by
`substituting a fixed-length code for common substrings. compress,
`like other adaptive
`textual substitution algorithms, periodically tests its own performance andreinitializes its
`string table if the amount of compression has decreased.
`StuffIt makes use of two sets of algorithms:
`it first detects special-type files such as
`imagefiles and processes them with algorithms suited for high-resolution color data; for the
`remaining files, it queries the operating system for the explicit file type given whenthefile
`was created, and usesthis information to choose either the LZW variant of Lempel-Ziv,*°
`dynamic Huffman coding, or run-length encoding. This is a much more limited selection
`process than what we have implemented. Additionally, no selection of compression methods
`is attempted within a file. Compact Pro uses the same methodology as Stufflt and compress,
`but incorporates an improved Lempel-Ziv derived directly from LZ77."' The public-domain
`version of Stuffit is derived from Unix compress,as is evident from the similarity of their
`performanceresults.
`* For purposes of comparison, the block sizes tested by Toal were nearly identical to those used in our system (ranging
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 137
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 137
`
`
`
`
`Attachment
`lh: Copy
`of Hsu from the
`
`Illinois Institute of
`Technolog Librar
`W. H. HSU ANDA.E. ZWARICO
`1102
`Compression systems such as Stuffit perform simple selection among alternative com-
`pression algorithms. The important problem is that they are underequipped for the task of
`fitting a specific techniqueto each file (even when the uncompressed data is homogeneous).
`Stuffit uses few heuristics, since its algorithms are intended to be ‘multipurpose’ . Further-
`more, only the file type is considered in selecting the algorithm — that is, no measures of
`redundancy are computed. Earlier versions of StuffIt (which were extremely similar to Unix
`compress) used composite alphabetic and textual compression, but made no selections on
`the basis of data characteristics. The chief improvements of our heterogeneous compressor
`over this approach are that it uses a two-dimensional lookuptable, indexed by file proper-
`ties and quantitative redundancy metrics, and — more important — that it treats the file as a
`collection of heterogeneous data sets.
`
`THE HETEROGENEOUS COMPRESSOR
`Our heterogeneous compressor treats a file as a collection of fixed size blocks (5K in
`the current implementation), each containing a potentially different type of data and thus
`best compressed using different algorithms. The actual compression is accomplished in
`two phases. In the first phase, the system determines the type and compressibility of each
`block. The compressibility of each block of data is determined by the values of three
`quantitative metrics representing the alphabetic distribution, the average run length and the
`string repetition ratio in the file. If these metrics are all below a certain threshold, then the
`block is considered fully compressed (uncompressible) and the program continues on to the
`next block. Otherwise, using the block type and largest metric, the appropriate compression
`algorithm (and possible heuristic) are chosen from the compression algorithm database. The
`compression method for the current block is then recorded in a small array-based map of
`the file, and the system continues.
`The second phase comprises the actual compression and an optimization that maximizes
`the size of a segmentofdata to be compressed using a particular algorithm. In this optimiza-
`tion, which is interleaved with the actual compression, adjacent blocks for which exactly
`the same method have been chosen are merged into a single block. This merge technique
`maximizes the length of segments requiring a single compression method by greedily scan-
`ning ahead until a change of method is detected. Scanning is performed using the array
`map of the file generated when compression methods were selected from the database. A
`compressionhistory, needed for decompression, is automatically generated as part of this
`phase.The newly compressed segments are written to a buffer by the algorithm, which stores
`the output data with the compression history. The system then writes out the compressed
`file and exits with a signal to the operating system that compression was successful.
`From this two-pass schemeit is straightforward to see why this system is less susceptible
`than traditional adaptive systems to biases accrued when the data type changes abruptly
`during compression. Adaptive compressors perform all operations myopically, sacrificing
`the ability to see ahead in thefile or data stream to detect future fluctuations in the type
`of data. As a result, adaptive compressors retain the statistical vestiges of the old method
`until these are ‘flushed out’ by new data (or balanced out, depending uponthe process for
`paging and aging internal data structures such as dictionaries). Thus adaptive compressors
`may continue to suffer the effects of bias, achieving suboptimal compression. Onthe other
`hand, by abruptly changing compression algorithms, our technique completely discards all
`remnants of the ‘previous’ method (i.e. the algorithm used on the preceding segment). This
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 138
`
`Veritas Techs. LLC
`Exhibit 1026
`Page 138
`
`
`
`
`
`Attachment 1h: Copy
`of Hsu from the
`
`
`
`
`
`Illinois Institute of
`Technology Library
`
`AUTOMATIC SYNTHESIS OF COMPRESSION TECHNIQUES FOR HETEROGENEOUSFILES 1103
`
`allows us to immediately capitalize on changes in data. In addition, merging contiguous
`blocks of the same data type acquires the advantage of incurring all the overhead at once
`for switching to what will be the best compression method for an entire variable-length
`segment. The primary advantage of adaptive compression techniques over our technique is
`that the adaptive compression algorithms are ‘online’ (single-pass). This property increases
`compression speed and, more important, gives the ability to compress a data stream (for
`instance, incoming data packets in a network or modem transmission) in addition to files
`in secondary storage or variable-length buffers.
`The remainder of this section presents the system. We begin with a description of the
`calculation of the block types and the redundancy metrics. We also explain the use ofthe
`metrics as absolute indicators of compressibility, and then describe the compression algo-
`rithms used andthe structure of the database of algorithms. A discussion of implementation
`details concludes the section.
`
`Property analysis
`The compressibility of a block of data and the appropriate algorithm to do so are deter-
`mined by the type of data contained in a block and the type of redundancy (if any) in the
`data. These two properties are represented by four parameters: the block type, and the three
`redundancy metrics. The block type describesthe data in the block — text, binary, graphical,
`etc. The three redundancy metrics are the degree of variation in character frequency, average
`run length in the file, and the string repetition ratio of the file. They provide a quantitative
`measure of how compressible the block is and which type of redundancy is most evident
`in the block. The use of both quantitative redundancy measures (redundancy metrics) and
`qualitat