`
`PRACTICE 3. EXPERIENCE
`
`§oLUME 25. No. 10
`
`OCTOBER 1995
`
`EDITORS
`
`DOUGLAS COMER
`
`ANDY WELLINGS
`
`® WILEY
`
`Publishers Sznre M07
`
`Chichester - New York - Brisbane - Toronto - Singapore
`A Wi|ey—|nterscience Publication
`SPEXBL 25(1OI 1065-! 182 (1995)
`ISSN 0038-0644
`
`1
`1
`
`Oracle 1003
`Oracle 1003
`
`
`
`
`
` SOFTWARE
`
`PRATICE & EXPERIENCE
`Editors
`_
`
`Professor D. E. Comer, Computer Science Department, Purdue University, West
`Lafayette, IN 47907, U.S.A.
`Charlotte I. Tubis, U.S. Editorial Assistant, Computer Science Department, Purdue University, West Lafayette,
`IN 47907, U.S.A.
`
`Dr A. J. Wellings, Department of Computer Science, University of York,
`Heslington, York YO‘) 5DD
`
`Advisory Editorial Board
`.
`Professor D. W. BARRON
`Department of Electronics and Computer Science,
`University of Southampton,
`Southampton S09 5N}-l, U.K.
`
`Professor P. J. BROWN
`Computing Laboratory, The University.
`Canterbury, Kent CT2 7NF, U.K.
`Professor J. A. CAMPBELL
`Department of Computer Science, University College London,
`Gower Street, London WC1E 6BT. U.K.
`
`Professor F. J. CORBATO
`Electrical Engineering Department,
`Massachusetts Institute of Technology,
`545 T89h"0l09‘/ Square,
`Cambridge, Massachusetts 02139, U.S.A.
`Dr. Christopher W_ FRASER
`AT&T Bell Laboratories, 600 Mountain Ave 20434,
`Murray Hill, NJ 07974-0636, U.S.A.
`PFOTBSSOF PER BRINCH HANSEN
`School of Computer and Information Science.
`4-115 CST, Syracuse University,
`Syracuse, New York 13210, U.S.A.
`Professor D. R. HANSON
`
`Depanmem of Computer Science’
`Princeton University, Princeton,
`New Jersey 08544, U.S.A.
`
`Professor J. KATZENELSON _
`Faculty of Electrical Engineering,
`Technion«|srael Institute of Technology,
`Haifa, Israel
`Dr. B. W. KERNIGHAN
`AT&T Bell Laboratories, 600 Mountain Avenue,
`Murray Hill, New Jersey 07974, U.S.A.
`
`Professor D. E. KNUTH
`Department of Computer Science, Stanford University,
`Stanford, California 94305, U.S.A.
`Dr. B. W. LAMPSON
`180 Lake View Ave’
`4
`E/|aAr‘né)2”1g%?'U_S.A_
`
`Dr‘ C‘ A‘ LANG
`Thfe6~5PaC8 Ltd,
`70 Castle Street.
`Cambridge CB3 0AJ' U_K_
`
`Professor B. RANDELL
`Computing Laboratory,
`University of NeWCast|e_Up0n_Tynel
`Claremont Tower, Claremont Road,
`N
`-
`—
`.
`e""°"‘”“' ”p°" TV“ NE1 mu’ U K
`Professor J‘ S. ROHL
`Department of Computer Science,
`The U|"lVeT5lfV Of Western AU5l|’all6-
`Nedlands, Western Australia 6009.
`
`.
`
`9- 7- R055
`Softech |nc., 460 Totten Pond Fload,
`Waltham, Massachusetts 02154, U.S.A.
`
`B. H. SHEARING
`The software Factory.
`23 padbrookl Limpsfiejd Oxted
`surrey RH3 ODW U_K_
`'
`'
`Professor N. WIRTH
`lnstitut fi.ir Computersysteme, ETH-Zentrum,
`CH—8092 Zurich, Switzerland.
`
`'
`
`Aims and Scope
`Software—Practice and Experience is an internationally respected and rigorously refereed vehicle for the dissemination and
`discussion of practical experience with new and established software for both systems and applications. Contributions regu-
`larly: (a) describe detailed accounts ofcompleted software-system projects which can serve as ’how—to—do-it’ models for future
`work in the same field; (b) present short reports on programming techniques that can be used in a wide variety of areas; lc)
`document new techniques and tools that aid in solving software construction problems; and (d) explain methods/techniques
`that cope with the special demands of large scale software projects. The journal also features timely Short Communications
`on rapidly developing new topics.
`The editors actively encourage papers which result from practical experience with tools and methods developed and used
`in both academic and industrial environments. The aim is to encourage practitioners to share their experiences with design.
`implementation and evaluation of techniques and tools for software and software systems.
`Papers cover software design and implementation, case studies describing the evolution of system and the thinking behind
`them, and critical appraisals of software systems. The journal has always welcomed tutorial articles describing well-tried tech-
`niques not previously documented in computing literature. The emphasis is on practical experience; articles with theoretical
`or mathematical content are included only in cases where an understanding of the theory will lead to better practical systems.
`Articles range in length from a Short Communication (half to two pages) to the length required to give full treatment to a
`substantial piece of software (40 or more pages).
`Advertising: For details contact-
`Michael J. Levermore, Advertisement Sales, John Wiley & Sons Ltd, Baffins Lane, Chichester, Sussex PO19 1UD, England (Telephone 01243
`770351, Fax 01243 775878, Telex 86290)
`Sofrware—Praciice and Experience (ISSN 0038-0644/USPS 890-920) is published monthly, by John Wiley & Sons Limited, Baflins Lane, Chichester,
`Sussex, England. Second class postage paid at Jamaica. NY. 11431. Air freight and mailing in the U.S.A. by Publications Expediting Services Inc.,
`200 Meacham Avenue, Elmont, N.Y. 11003. © 1995 by John Wiley & Sons Ltd. Printed and bound in Great Britain by Page Bros, Norwich. Printed
`on acid—free paper.
`'
`To subscribe: Orders should be addressed to Subscriptions Department, John Wiley 8i. Sons Limited. Baffins Lane. Chichester, Sussex, PO19 1UD,
`England. 1995 subscription price (13 issues): U.S. $825.00.
`U.S.A. POSTMASTER: Send address changes to Soi‘tware—Practi'ce and Experience, clo Publications Expediting Services Inc., 200 Meacham
`
`
`
`Avenue. Elmont. NY. 11003, U.S.A.
`
`2
`2
`
`
`
`SOFTWARE»--PRACTICE AND EXPERIENCE
`
`(Softw. pract. exp.)
`
`VOLUME 25, ISSUE No. 10
`
`October 1995
`
`CONTENTS
`
`Migration in Object-oriented Database Systems—A Practical Approach:
`C. Huemer, G. Kappel and S. Vieweg ....................................................... .. 1065
`
`Automatic Synthesis of Compression Techniques for Heterogeneous
`Files: W. H. Hsu and A. E. Zwarico ........................................................... .. 1097
`
`A Tool for Visualizing the Execution of Interactions on a Loosely-coupled
`Distributed System: P. Ashton and J. Penny ........................................... .. 1117
`
`Process Scheduling and UNIX Semaphores: N. Dunstan and I. Fris ........ .. 1141
`
`Software Maintenance: An Approach to impact Analysis of Objects
`Change: S. Ajila .......................................................................................... .. 1155
`
`SPEXBL 25(10) 1065-1182 (1995)
`ISSN 0038-0644
`
`
`indexed or abstracted by Cambridge Scientific Abstracts, CompuMath Citation index (lSl),
`Compuscience Database, Computer Contents, Computer Literature index, Computing
`
`Reviews, Current Contents/Eng, Tech & Applied Sciences, Data Processing Digest, Deadline
`Newsletter, Educational Technology Abstracts, Engineering index, Engineering Societies
`Library, IBZ (International Bibliography of Periodical Literature), Information Science Abstracts
`
`
`
`
`
`(Plenum), INSPEC, Knowledge Engineering Review, Nat Centre for Software Technology,
`Research Alert (ISI) and SClSEARCH Database (ISI).
`
`3
`
`
`
`
`
`
`
`
`
`SOF'I'WARE—PRACTlCE AND EXPERIENCE. VOL. 25(l0), 1097-1116 (OCTOBER 1995)
`
`Automatic Synthesis of Compression Techniques for
`
`Heterogeneous Files
`
`WILLIAM H. HSU
`
`Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, U.S.A.
`(email: blzsu@cs.uiuc.edu, voice: (217) 244-1620)
`
`AND
`
`AMY E. ZWARICO
`
`Department of Computer Science, The Johns Hopkins University, Baltimore, MD 21218, U.S.A.
`(email: amy@cs.jl1u.edu, voice: (410) 516-5304)
`
`SUMMARY
`
`We present a compression technique for heterogeneous files, those files which contain multiple types of
`data such as text, images, binary, audio, or animation. The system uses statistical methods to determine
`the best algorithm to use in compressing each block of data in a file (possibly a different algorithm for
`each block). The file is then compressed by applying the appropriate algorithm to each block. We obtain
`better savings than possible by using a single algorithm for compressing the file. The implementation
`of a working version of this heterogeneous compressor is described, along with examples of its value
`toward improving compression both in theoretical and applied contexts. We compare our results with
`those obtained using four commercially available compression programs, PKZIP, Unix compress, Stufiit,
`and Compact Pro, and show that our system provides better space savings.
`
`KEY WORDS:
`
`adaptive/selective data compression algorithms; redundancy metrics; heterogeneous files; program synthesis
`
`INTRODUCTION
`
`The primary motivation in studying compression is the savings in space that it provides.
`Many compression algorithms have been implemented, and with the advent of new hard-
`ware standards, more techniques are under development. Historically, research in data com-
`pression has been devoted to the development of algorithms that exploit various types of
`redundancy found in a file. The shortcoming of such algorithms is that they assume, often
`inaccurately, that files are homogeneous throughout. Consequently, each exploits only a
`subset of the redundancy found in the file.
`Unfortunately, no algorithm is effective in compressing all files.‘ For example, dynamic
`Huffman coding works best on data files with a high variance in the frequency of individ-
`ual characters (including some graphics and audio data), achieves mediocre performance on
`natural language text files, and performs poorly in general on high-redundancy binary data.
`On the other hand, run length encoding works well on high—redundancy binary data, but
`performs very poorly on text files. Textual substitution works best when multiple-character
`strings tend to be repeated, as in English text, but this performance degrades as the average
`
`CCC 0038-0644/95/101097-20
`©1995 by John Wiley & Sons, Ltd.
`
`Received 20 April 1994
`Revised 5 February 1995
`
`4
`
`
`
`1098
`
`w. H. HSU AND A. E. ZWARICO
`
`length of these strings decreases. These relative strengths and weaknesses become critical
`when attempting to compress heterogeneous files. Heterogeneous files are those which con-
`tain multiple types of data such as text, images, binary, audio, or animation. Consequently,
`their constituent parts may have different degrees of compressibility. Because most com-
`pression algorithms are either tailored to a few specific classes of data or are designed to
`handle a single type of data at a time, they are not suited to the compression of heteroge-
`neous files. In attempting to apply a single method to such files, they forfeit the possibility
`of greater savings achievable by compressing various segments of the file with different
`methods.
`
`To overcome this inherent weakness found in compression algorithms, we have developed
`a heterogeneous compressor that automatically chooses the best compression algorithm to
`use on a given variable-length block of a file, based on both the qualitative and quantita-
`tive properties of that segment. The compressor determines and then applies the selected
`algorithms to the blocks separately. Assembling compression procedures to create a specif-
`ically tailored program for each file gives improved performance over using one program
`for all files. This system produces better compression results than four commonly available
`compression packages, PKZIP,2 Unix compress,3 Stufi‘It,‘* and Compact Pro5 for arbitrary
`heterogeneous files.
`The major contributions of this work are twofold. The first is an improved compression
`system for heterogeneous files. The second is the development of a method of statisti-
`cal analysis of the compressibility of a file (its redundancy types). Although the concept
`of redundancy types is not new,“ synthesis of compression techniques using redundancy
`measurements is largely unprecedented. The approach presented in this paper uses a straight-
`forward program synthesis technique: a compression plan, consisting of instructions for each
`block of input data, is generated, guided by the statistical properties of the input data. Be-
`cause of its use of algorithms specifically suited to the types of redundancy exhibited by
`the particular input file, the system achieves consistent average performance throughout the
`file, as shown by experimental evidence.
`As an example of the type of savings our system produces, consider compressing a
`heterogeneous file (such as a small multimedia data file) consisting of 10K of low redun-
`dancy (non-natural language) ASCII data, 10K of English text, and 25K of graphics. In
`this case, a reasonably sophisticated compression program might recognize the increased
`savings achievable by employing Huffman compression, to better take advantage of the fact
`that the majority of the data is graphical. However, none of the general-purpose compres-
`sion methods under consideration are optimal when used alone on this file. This is because
`the text part of this file is best compressed by textual substitution methods (e.g., Lempel-
`Ziv) rather than statistical methods, while the low-redundancy data* and graphics parts
`are best compressed by alphabetic distribution-based methods (e.g., arithmetic or dynamic
`Huffman coding) rather than Lempel—Ziv or run—length encoding. This particular file totals
`45K in length before compression. A compressor using pure dynamic Huffman coding only
`achieves about 7 per cent savings for a compressed file of length 42.2K. One of the best
`general-purpose Lempel—Ziv compressors currently available” achieves 18 per cent sav-
`ings, producing a compressed file of length 37.4K. Our system uses arithmetic coding on
`the first and last segments and Lempel—Ziv compression on the text segment in the middle,
`achieving a 22 per cent savings and producing a compressed file of length 35.6K. This is
`a 4 per cent improvement over the best commercial system.
`
`* This denotes, in our system, a file with a low rate of repeated strings.
`
`5
`
`
`
`AUTOMATIC SYNTHESIS OF COMPRESSION TECHNIQUES FOR HETEROGENEOUS FILES 1099
`
`The purpose of our experiments was to verify the conjecture that a selective system
`for combining methods can improve savings on a significant range of heterogeneous files,
`especially multimedia data. This combination differs from current adaptive methods in
`that it switches among compression paradigms designed to remove very different types
`of redundancy. By contrast, existing adaptive compression programs are sensitive only to
`changes in particular types of redundancy, such as run-length, which do not require changing
`the underlying algorithm used in compression. Thus they cannot adapt to changes in the
`type of redundancy present, such as from high run-length to high character repetition. The
`superiority of our approach is demonstrated in our experimental section.
`This paper begins with a presentation of existing approaches to data compression, includ-
`ing a discussion of hybrid and adaptive compression algorithms and a description of four
`popular commercial compression packages. These are followed by documentation on the
`design of the heterogeneous compression system, analysis of experimental results obtained
`from test runs of the completed system, and comparison of the system’s performance against
`that of commercial systems. Finally, implications of the results and possibilities for future
`work are presented.
`
`RELATED WORK
`
`It is a fundamental result of information theory that there is no single algorithm that per-
`forms optimally in compressing all files.‘ However, much work has been done to develop
`algorithms and techniques that work nearly optimally on all classes of files. In this sec-
`tion we discuss adaptive algorithms, composite algorithms, and four popular commercial
`compression packages.
`
`Adaptive compression algorithms and composite techniques
`
`Exploiting the heterogeneity in a file has been addressed in two ways: the development
`of adaptive compression algorithms, and the composition of various algorithms. Adaptive
`compression algorithms attune themselves gradually to changes in the redundancies within a
`file by modifying parameters used by the algorithm, such as the dictionary, during execution.
`For example, adaptive alphabetic distribution-based algorithms such as dynamic Huffman
`coding“) maintain a tree structure to minimize the encoded length of the most frequently
`occurring characters. This property can be made to change continuously as a file is pro-
`cessed.
`
`An example of an adaptive textual substitution algorithm is Lempel—Ziv compression,
`a title which refers to two distinct variants of a basic textual substitution scheme. The
`
`first variant, known as LZ77 or the sliding dictionary or sliding window method, selects
`positional references from a constant range of preceding strings." " These ‘back-pointers’
`literally encode position and length of a repeated string. The second variant, known as
`LZ78 or the dynamic dictionary method, uses a dictionary structure with a paging heuristic.
`When the dictionary — a table of strings from the file — is completely filled, the contents
`are not discarded. Instead, an auxiliary dictionary is created and updated while compression
`continues using the main dictionary. Each time this auxiliary table is filled, its contents are
`‘swapped’ into the main dictionary and it is cleared. The maintenance of dictionaries for
`textual substitution is analogous to the semi-space method of garbage collection, in which
`two pages are used but only one is ‘active’ — these are exchanged when one fills beyond
`a preset threshold. Another adaptive variant of this algorithm is the Lempel—Ziv—Welch
`
`6
`
`6
`
`
`
`1100
`
`W. H. HSU AND A. E. ZWARICO
`
`(LZW) algorithm, a descendant of LZ78 used in Unix compress.6’ '2 Both LZW and LZ78
`vary the length of strings used in compression.“ '2
`Yet another adaptive (alphabetic distribution-based) compression scheme, the Move-To-
`Front (MTF) method, was developed by Bentley et at.” In MTF, the ‘word code’ for a
`symbol is determined by the position of the word in a sequential list. The word list is ordered
`so that frequently accessed words are near the front, thus shortening their encodings.
`Adaptive compression algorithms are not appropriate to use with heterogeneous files
`because they are sensitive only to changes in the particular redundancy type with which
`they are associated, such as a change in the alphabetic distribution. They do not exploit
`changes across different redundancy types in the files. Therefore a so-called adaptive method
`typically cannot directly handle drastic changes in file properties, such as an abrupt transition
`from text to graphics. For example, adaptive Huffman compressors specially optimized for
`text achieve disproportionately poor performance on certain image files, and vice versa. As
`the use of multimedia files increases, files exhibiting this sort of transition will become
`more prevalent.
`Our approach differs from adaptive compression because the system chooses each algo—
`rithm (as well as the duration of its applicability) before compression begins, rather than
`modifying the technique for each file during compression. In addition, while adaptive meth-
`ods make modifications to their compression parameters on the basis of single bytes or fixed
`length strings of input, our heterogeneous compressor bases its compression upon statistics
`gathered from larger blocks of five kilobytes. This allows us to handle much larger changes
`in file redundancy types. This makes our system less sensitive to residual statistical fluctu—
`ations from different parts of a file. We note that it is possible to use an adaptive algorithm
`as a primitive in the system.
`Another approach to handling heterogeneous files is the composition of compression
`algorithms. Composition can either be accomplished by running several algorithms in suc-
`cession or by combining the basic algorithms and heuristics to create a new technique. For
`example, recent implementations of ‘universal’ compression programs execute the Lempel-
`Ziv algorithm and dynamic Huffman coding in succession, thus improving performance
`by combining the string repetition-based compression of Lempel—Ziv with the frequency-
`based compression strategy of dynamic Huffman coding. One commercial implementation
`is LHarc.”"5 Our system exploits the same savings since it uses the Freeze implementa-
`tion of the Lempel—Ziv algorithm, which filters Lempel—Ziv compressed output through a
`Huffman coder. An example of a truly composite technique is the compression achieved
`by using Shannon—Fano tries* in conjunction with the Fiala—Greene algorithm (a variant
`of Lempel—Ziv)” in the PKZIPZ commercial package. Tries are used to optimally encode
`strings by character frequency.” PKZIP was selected as the representative test program from
`this group in our experiment due to its superior performance on industrial benchmarks.”
`Our approach generalizes the ideas of successively executing or combining different
`compression algorithms by allowing any combination of basic algorithms within a file. This
`includes switching from among algorithms an arbitrary number of times within a file. The
`algorithms themselves may be simple or composite and may be adaptive. All are treated as
`atomic commands to be applied to portions of a file.
`
`* A trie is a tree of variable degree 3 2 such that (1) each edge is labelled with a character, and the depth of any node
`represents one more than the number of characters required to identify it; (2) all internal nodes are intermediate and represent
`prefixes of keys in the trie; (3) keys (strings) may be inserted as leaves using the minimum number of characters which
`distinguish them uniquely. Thus a generic trie containing the strings computer and compare would have keys at a depth of
`five which share a common prefix of length four.
`
`7
`
`
`
`AUTOMATIC SYNTHESIS OF COMPRESSION TECHNIQUES FOR HETEROGENEOUS FILES 1101
`
`The problem of heterogeneous files was addressed by Toal” in a proposal for a naive
`heterogeneous compression system similar to ours. In such a system, files would be seg-
`mented into fixed-length encapsulated blocks; the optimal algorithm would be selected for
`each block on the basis of their simple taxonomy (qualitative data type) only; and the blocks
`would be independently compressed. Our system, however, performs more in-depth statis-
`tical analysis in order to make a more informed selection from the database of algorithms.
`This entails not only the determination of qualitative data properties but the computation of
`metrics for an entire block (as opposed to sporadic or random sampling from parts of each
`block). Furthermore, normalization constants for selection parameters (i.e. the redundancy
`metrics) are fitted to observed parameters for a test library. Finally, a straightforward but
`crucial improvement to the naive encapsulated-block method is the implementation of a
`multi-pass scheme. By determining the complete taxonomy (data type and dominant redun-
`dancy type) in advance, any number of contiguous blocks which use the same compression
`method will be treated as a single segment. Toal observed in preliminary experiments that
`the overhead of changing compression schemes from one block to another dominated the
`additional savings that resulted from selection of a superior compression method.” This
`overhead is attributable to the fact that blocks compressed independently (even if the same
`method is used) are essentially separate files and assume the same startup overhead of the
`compression algorithm used.* We have determined experimentally that merging contiguous
`blocks whenever possible obviates the large majority of changes in compression method.
`This eliminates a sufficient proportion of the overhead to make heterogeneous compression
`worthwhile.
`
`Commercial products
`
`One of the goals of this research was to develop a compression system which is gener-
`ally superior to commercially available systems. The four systems we studied are PKZIP,
`developed for microcomputers running MS-DOS;2 Unix compress? and Stufflt Classic4
`and Compact Pro,5 developed for the Apple Macintosh operating system. Each of these
`products performs its compression in a single pass, with only one method selected per file.
`Thus, the possibility of heterogeneous files is ignored.
`Unix compress uses an adaptive version of the Lempel—Ziv algorithm.‘‘ It operates by
`substituting a fixed-length code for common substrings. compress, like other adaptive
`textual substitution algorithms, periodically tests its own performance and reinitializes its
`string table if the amount of compression has decreased.
`Stufiit makes use of two sets of algorithms:
`it first detects special-type files such as
`image files and processes them with algorithms suited for high-resolution color data; for the
`remaining files, it queries the operating system for the explicit file type given when the file
`was created, and uses this information to choose either the LZW variant of Lempel—Ziv,“ 6
`dynamic Huffman coding, or run-length encoding. This is a much more limited selection
`process than what we have implemented. Additionally, no selection of compression methods
`is attempted within a file. Compact Pro uses the same methodology as Stufiit and compress,
`but incorporates an improved Lempel—Ziv derived directly from LZ77." The public-domain
`version of Stufilr is derived from Unix compress, as is evident from the similarity of their
`performance results.
`
`* For purposes of comparison, the block sizes tested by Toal were nearly identical to those used in our system (ranging
`upwards from 4K).
`
`8
`
`
`
`ll02
`
`W. H. HSU AND A. E. ZWARICO
`
`Compression systems such as Stufilt perform simple selection among alternative com-
`pression algorithms. The important problem is that they are underequipped for the task of
`fitting a specific technique to each file (even when the uncompressed data is homogeneous).
`Stufih uses few heuristics, since its algorithms are intended to be ‘multipurpose’ . Further-
`more, only the file type is considered in selecting the algorithm — that is, no measures of
`redundancy are computed. Earlier versions of Stufilt (which were extremely similar to Unix
`compress) used composite alphabetic and textual compression, but made no selections on
`the basis of data characteristics. The chief improvements of our heterogeneous compressor
`over this approach are that it uses a two-dimensional lookup table, indexed by file proper-
`ties and quantitative redundancy metrics, and ~ more important — that it treats the file as a
`collection of heterogeneous data sets.
`
`THE HETEROGENEOUS COMPRESSOR
`
`Our heterogeneous compressor treats a file as a collection of fixed size blocks (SK in
`the current implementation), each containing a potentially different type of data and thus
`best compressed using different algorithms. The actual compression is accomplished in
`two phases. In the first phase, the system determines the type and compressibility of each
`block. The compressibility of each block of data is determined by the values of three
`quantitative metrics representing the alphabetic distribution, the average run length and the
`string repetition ratio in the file. If these metrics are all below a certain threshold, then the
`block is considered fully compressed (uncompressible) and the program continues on to the
`next block. Otherwise, using the block type and largest metric, the appropriate compression
`algorithm (and possible heuristic) are chosen from the compression algorithm database. The
`compression method for the current block is then recorded in a small array—based map of
`the file, and the system continues.
`The second phase comprises the actual compression and an optimization that maximizes
`the size of a segment of data to be compressed using a particular algorithm. In this optimiza-
`tion, which is interleaved with the actual compression, adjacent blocks for which exactly
`the same method have been chosen are merged into a single block. This merge technique
`maximizes the length of segments requiring a single compression method by greedily scan-
`ning ahead until a change of method is detected. Scanning is performed using the array
`map of the file generated when compression methods were selected from the database. A
`compression history, needed for decompression, is automatically generated as part of this
`phase.
`The newly compressed segments are written to a buffer by the algorithm, which stores
`the output data with the compression history. The system then writes out the compressed
`file and exits with a signal to the operating system that compression was successful.
`From this two-pass scheme it is straightforward to see why this system is less susceptible
`than traditional adaptive systems to biases accrued when the data type changes abruptly
`during compression. Adaptive compressors perform all operations myopically, sacrificing
`the ability to see ahead in the file or data stream to detect future fluctuations in the type
`of data. As a result, adaptive compressors retain the statistical vestiges of the old method
`until these are ‘flushed out‘ by new data (or balanced out, depending upon the process for
`paging and aging internal data structures such as dictionaries). Thus adaptive compressors
`may continue to suffer the effects of bias, achieving suboptimal compression. On the other
`hand, by abruptly changing compression algorithms, our technique completely discards all
`remnants of the ‘previous’ method (i.e. the algorithm used on the preceding segment). This
`
`
`
`9
`
`9
`
`
`
`AUTOMATIC SYNTHESIS OF COMPRESSION TECHNIQUES FOR HETEROGENEOUS FILES 1103
`
`allows us to immediately capitalize on changes in data. In addition, merging contiguous
`blocks of the same data type acquires the advantage of incurring all the overhead at once
`for switching to what will be the best compression method for an entire variable-length
`segment. The primary advantage of adaptive compression techniques over our technique is
`that the adaptive compression algorithms are ‘online’ (single-pass). This property increases
`compression speed and, more important, gives the ability to compress a data stream (for
`instance, incoming data packets in a network or modern transmission) in addition to files
`in secondary ‘storage or variable-length buffers.
`The remainder of this section presents the system. We begin with a description of the
`calculation of the block types and the redundancy metrics. We also explain the use of the
`metrics as absolute indicators of compressibility, and then describe the compression algo-
`rithms used and the structure of the database of algorithms. A discussion of implementation
`details concludes the section.
`
`Property analysis
`
`The compressibility of a block of data and the appropriate algorithm to do so are deter-
`mined by the type of data contained in a block and the type of redundancy (if any) in the
`data. These two properties are represented by four parameters: the block type, and the three
`redundancy metrics. The block type describes the data in the block — text, binary, graphical,
`etc. The three redundancy metrics are the degree of variation in character frequency, average
`run length in the file, and the string repetition ratio of the file. They provide a quantitative
`measure of how compressible the block is and which type of redundancy is most evident
`in the block. The use of both quantitative redundancy measures (redundancy metrics) and
`qualitative characteristics (block types) as indicators for data compressibility is advocated
`by Held7 and Salton.” We have refined the process for computing those attributes referred
`to as datanalysis results by Held7 and as statistical language characteristics by Salton” to
`obtain an actual guide for compression. The remainder of this section describes how these
`four parameters are determined for each block.
`
`Block types
`
`The block type describes the nature of a segment of input data. There are ten classifica-
`tions of data in this system: ANSI text, non—natural language text (hexadecimal encodings of
`binary data), natural language text, computer source code, low redundancy binary, digitized
`audio, low resolution graphics, high-resolution graphics, high-redundancy binary executable,
`and binary object data. ANSI text is composed of characters from a superset of the ASCII
`alphabet. Non-natural language text contains primarily ASCII text but does not follow a
`distribution of characters like that of human languages. Examples are computer typesetting
`data, uuencoded and BinHex encoded data (which has the same character distribution as
`binary data but is converted to text for ease of transmission). Natural language text in-
`cludes text written in English as well as other languages which are representable by the
`Roman (ASCII) alphabet. Most European languages (including the ones using the Cyrillic
`alphabet), special symbols excluded, fall into