`
`'
`
`
`
`.
`L p sh-Pet'forman;ce
` Data C0mprcSSt®n
`
`
`
`
`
`‘
`
`
`
`
`
`strategy poses a different set of these problems and, consequent-4"‘
`ly. the use of each strategy is restricted to applications where its _
`inherent weaknesses present no critical problems.
`This article introduces a new compression algorithm that is
`based on principles not found in existing commercial methods.
`---'1"-his algorithm avoids many of the problems ‘associated with
`older methods in that it dynamically adapts to the redundancy
`characteristics of the data being compressed? An investigation
`into possible application of this algorithm yields insight into the
`compressibility of various types of data and serves to illustrate
`system problems inherent in using any compression scheme. For
`readers interested in simple but subtle procedures, some details
`of this algorithm and its implementations are also‘dcscribcd.5
`The focus throughout this article will be on transparent com-
`pression in which the computer programmer is not aware of the
`existencc of compression except in system performance. This
`form of compression is ‘ ‘noiseless,” the decomprcsséd data is an
`exact replica of the input data, and the compression apparatus is
`given no special program information, such as data type or usage
`statistics. Transparency is ‘jierceivcd to be important because put.-
`ting an extra .burden'on‘thé application programmer would can so
`- -‘»ootsg9:s2/st/oeoo4ioossn:.oo oassa IEEE ,_
`I OfiACLE 1114
`
`COMPUTER '
`
`
`6' taiies or transferred over com-
`‘ stTo'red"‘on"dis’lts
`ass links -in com ercial cotnpoter systems generally
`significant redundancy. A-méclianism or procedure
`_
`;
`_, .hio§_i ec‘odcs.the..data to lessen the redundancy could possibly
`effective data densitites in stored or com»
`. double or"t'ripie the
`
`'
`"
`orcoycr, ifgcomprcssion is automatic: it can
`
`n the rise of software developmgnt costs. A transparent
`._mi>léssion mechanism could permit the use of “sloppy” data
`"es, in that empty space or sparse encoding of data would
`_
`rf¢1t:ste2_r‘t1..vs>..<.:?;=~t:t<1.t?t¢:9§¢..9,t.st9sess.a:2ss9 or .t.1.f.=.»‘.I..*.5.f"“‘i’I\.<=: now-
`c"i'<*ix on 'east:trt4s‘a'- C\‘!1'tVp2‘€‘.'K<l'i3}‘l;‘{'F0C‘éd\Il7e.
`, Sex-em! protalctns encouzncred xx-lie:ri".¢ommo:1 compression
`’etltotls.hr'v il‘tlt\_:r£ttéd in.to:co:ttputcr systems have prevented
`tli wi‘dc§prcnd use oi"auro'rtiatic data compression. For example
`tfirttntime ex__
`‘ speeds interfér§i"‘iii' the attainment of
`‘{2}
`‘
`'_-'=d'a§a rat "
`‘ost techniques are not
`fi_exib_le"e'nough to proccss_,cl1t‘§é enittgipcs ‘of redundancy; (3)
`blocks-‘of compr I
`.
`4
`ve. unpredictable lengths pre-
`sent storage“spa‘ce‘m nage “ant
`
`
`
`
`
`
`
`
`
`.
`
`development costs which would-=oftcn exceed the value of
`compression. As itlustratcd"in Figure 1. compression is
`viewed a$'b'_eing very similar to a communications channel
`at‘-I codes a sequcnce ofbytes into an anirtcial format
`irod'pl1ysica1 handling and later decodes the corn-
`"I
`I
`..
`..
`tessed image back into a_ replica of the original message.
`The data compression described by this model is a
`reversible process-that-is unlike other forms of "data com~
`pr_essiorI_"-___—suclt as-_'d'ala-reduction ‘or data abstraction--in
`to soon relevance cri-
`
`-
`:2
`
`1-
`
`ten and
`anlclt lows“ on ¢9mct.=sti.rm‘_.<i.f.
`nit-ieric data-intermixed:-4n connects‘-_te*npuu ap-
`-' plicattons. Wher_e'_’u'
`' eit_icl_et' heve"pl_eccd per-
`
`tleulsr emphasis on text ootnpresslon"i'n their dtsetrsuonsfi_‘‘_'''
`of comprwsion algorithms for computer data’, "3"titis'
`cle will cover more ofthe systems
`compression
`into acomputé
`
`typical character string,
`Character distribution. In _a
`some characters are used mifiie frequentlsr than others.
`Spec-llicellyf,
`in eight"-bit ASCII representations neatly
`threetfourilis of the possible 256-bit combinations may
`not be used in a specific file. Consequently. nearly two bits
`of each eight-bit packet might be‘ saved. for a possible
`. as-percent '__apaceaftlrne savings.
`In English text.
`the
`ara.t:1t'.l's occur in a wellédocumeoted distn‘o-utlon, with e
`__ and "space" being the most popular. In an inventory
`record. numeric values are very comrnon--the existpnce of
`or packed-decimsl numbersicatt shift the statis-
`tics-—and constraints on field definition can cause
`charfler distributions to varisigttiftcantly from file to~ "
`‘file. For example; the choice of whether to use alphabetic
`or numeric values to identify warehouse sites can shift the
`distribution in theinventory rile. Likewise. the extent to"
`which descriptive tcxtappears in the inventory records will
`ittfluence the average number ofbits needed per character.
`
`Types at redundancy
`
`'
`
`Four types of redundancy can be found in commercial
`data. Redundancy hereis confined to what is observable in -
`a data stream {without knowledge of how the date is to-be -
`interpreted). Redundancy in the form of unused data or
`application-specific correlations am not considered here.
`For illustration. types of redundancy will be described as
`they might be found in two kinds of flies: English text and
`manufacturing parts inventory records (see Figure 2). The
`redundancy categories described arena: lndependent.'but -
`overlap to some extent-._'
`-
`
`_
`
`
`
`-.
`
`..
`
`'
`
`Cltentcter repetition. When a string of repetitions of a .
`single character occurs, the message can usually be-en--.
`'
`coded more compactly than by just repeating the character
`symbol. Such string are infrequent in text. occurring as
`blanlt spaces in place of tube or at the end of lines.
`However. in formatted business files. unused space is very
`common. An inventory record will it-cquently have strings
`of blanks in partially used alphabetic fields. strings of
`zeros in high-order positions of numeric fields. and
`perhaps strings of null symbols in unuscdftelds. Graphical
`images, especially the line drawings of business graphics.
`are mostly composed of homogeneous spaces. and their
`
`
`
`
`
`couratsstntt
`
`;-BOIIPBESSED
`_
`
`'-
`'
`nata
`COWHUNIGITIBHS
`uuss
`"
`
`OUTPUT
`SYMBOLS
`
`°9‘°"‘P3E3$l9.“.
`
`-
`
`'
`
`_
`
`.
`
`'- Figure 1. A model {or a compression system that per-
`. lorrne _l_r_ansparer.it cot-npreeeton.
`-IE‘:
`
`I
`...-g
`FRET HARE: HEX NUT 5'4 K20
`°55G.a.iP.TIUN: STEEL. SIRNQR-ND THBEJW
`COLORCWE: BLINK FIELO
`.
`it
`I
`.
`"""“E“°"-35 *5‘ m5"-" ""“*-'--—-———-..—-...__. sans utut rntuutttrtr .
`HEOGBUFIS IN THIS FILE.
`sroanus sure: tits”
`nuturtwjlti $Tltct_t:_an2‘n‘
`
`._‘_.
`
`.____
`
`E
`
`llAflllBl.E-LEHBTH TEXT
`ENELISH CHAHBACTER
`HBEUBBENCE DISTRIBUYION
`
`LHIIITEB URBIETT OF CMMCTERS
`
`'
`
`stances rautmtne
`
`utmtalc rims‘
`
`' FtgtnI.2.t'fypel"t'il ‘redundant code in a rim-tut_ecI_urlng pens Inventory record.
`
`.
`
`-'-Jurviétglt-i
`.~_.
`.. ,-.\;_I_\‘_;-,\__
`
`9
`
`
`
`Mfllhods oi compression
`
`The precedingxliscussion of redundancy types provides
`a basis for comparing several practical compression meth-
`ods. More theoretical comparisons are provided by Storcr
`and Szymanlti.‘ '
`-
`"
`
`I-lufirnaa coding. The most popular compression meth-
`od is to translate fixed-size pieces of input data into
`variabloiength symbols. The standard Huffman encoding
`procedure prescribes a way toasslgn codes to input sym-
`bols such that each code length in bits is approximately
`logfisymbol probability}
`
`where S)l'l1lh0l_,pl'0hflbilll¥ is the relative frequency or oc-
`currence of a given symbol (expressed as a probability).
`For example, if the seat of symbols (the input ensemble) is
`chosen to be the one-byte ASCII characters {a very typical
`case) and if thcblank character is used one-eighth of the
`time. then the blank character is encoded into a three-bit
`. symbol. ifthe letter Zis found to occur only 0.1 percent of
`‘the time, it is represented-by to bits. '
`ln_nortnal use. tlic size of input symbols is limited by the
`size of the translation table needed for compression. That
`is. a table is needed that lists each input symbol and its cor-
`_respondir_tg__code. Ifa-symbol is one eight-.bit byte (as is
`common), then a table of 256 entries is sufficient. such a
`table is quite economical in storage costs but limits the
`degree of compression achieved. Single-character en-
`coding can cope with character distribution redundancy,
`but not the other types..Cornpression could be improved
`by using input symbols of two bytes each. but that would
`require 9'. table of 64K entries at a cost that might not be
`warranted. Parsing the input data into symbols that are
`not byl.c'-aligned such as 12 bits, isnot likely to improve
`compression t-.fi'ectivene.ss_and would complicate system
`design.
`'
`‘
`,
`A second problem with HulTman encoding is the com-
`plexity of the decompression process. The length of each
`code to be in tcrpretcd for decompression is not known un-
`til the first few bits are interpreted. The basic method for
`intepreting each code is to interpret each bit in sequence
`and choose a translation subtablc according to Whl'.'thet,the
`bit is a one or zero. That’ is. the translation table is essen-
`tially a binary tree. Operationally. that-t,‘a logic decision is
`made for each code bit. In working with a disk drive that
`base 30M-bps transfer rate, the decompression logic must
`cycle at! that rate or faster to avoid creating a system bottle-
`neck. Decompression it that rate is possible but not sim-
`ple. in general. decompression with variable-length sym-
`bols gives a cost/performance disadvantage whenever the
`data. volume is high.
`‘
`A third problem with Huffman encoding is that we need
`to know the frequency distribution for the ensemble of
`possible input symbols.
`in the normal case of single-
`character symbols, character l'r'equcrtcy distribution in the
`data stream is the type of distribution use. Such a distribu-
`tion is known and is probably reasonably stable for Eng-
`lish text. However, the distributions for data tiles are very
`application specific and files such as object code. source
`code. and system tables will have dissimilar characteristic
`distributions. .While it might be possible to have a set of
`
`COMPUTER
`
`increasing integration into textual data can pose a com-
`pression challenge.
`
`I-ligh-usage patterns. Certain sequences of characters
`will reappear with relatively high frequency and can
`tltcrcforcllsc represented with rclatively fewer bits for at not
`saving in tithe;/space. in English. a period followed by two
`spaces l§'_r'rr_ore common than most other three-character
`combinations and could therefore
`rccoded to use fewer
`"bits.
`such as 213. an’: ljlore common than
`lllc ind_li3i5iual'leite"'
`babilities would imply and _might
`‘resistors be"
`__ _ithIewcr bits than the tiv0—charac-
`
`ter symbols tised togct_ll_
`lkewise. unlikely pairs. such
`"as GC, would be encoded witltvery long bit combinations
`to achieve better bit utilization. In particular instances of
`text, certain key words will be r.tsi':‘d" heavily. For example,
`the word “cm-npression" apiamrs frequently in this article;
`consequently, if this article were rrttrcleinto a system file.
`the word "compression" would warrant ‘use of a shorter
`~ "code.
`in inventory records. certain __t'derttil”ters __sut:l_t,.as_
`wareltouse_n-arrtes are exterisivcly reused. Numeric fields
`-' coniitin-préfetriid sequencésin the sense that they contain
`' only sequences of diets; with no letters or special symbols
`intermixed. These c'ould'bé encoded at less than four bits
`per digit,‘ rather than the five to eight bits needed for
`- --gct"tcI'.‘tti' text.
`
`'
`
`
`
` .. =1’-ositlooal. redundancy.
`
`
`.
`
`'
`
`lf certain characters appear
`= s.-_-consist_e'rttiy-._at a predictable place in each block of data.
`then they are at least ‘partially redundant. An e_samp_le of
`‘this. ism-“raster-scanned picture containing a vertical line;
`the vertical line appears as-a blip in the same position in
`each scan, and could be more compactly coded. in inven-
`. -' "tory.fil'es. certain record fields may almost always ha‘-‘E the
`. same entry, such -as .3 -"special handling" field which
`almost always has-Pndne‘--’ in it. Text, on the other hand,
`has virtually no positional redundancy in it.
`
`
`
`‘
`
`Thcsc'l‘qt_u:_.typ_e's of redundancy overlap to some extent.
`For example, inany i_rt__vir'v.='.;ttt't‘_>_t_'y’=_t'r"tr'r't'tet-ic fields will contain
`small integers prcpondertiitlly. These could be compressed
`as a small group of frequently used sequences {the in-
`tegers). as instances of zero strings in front of the integer
`values, as a weighting of the character frequency distribu-
`tion" toward zeros. or amp ositional bias, in that particular
`fields will almost always have nurrteric values with many
`high‘-order zeros. The point here is that almost any com-
`pre'3ssion"trit:_t':l_ta'r_tis'rtl"cancitploit that type of redundancy.
`intent here"-i§":"not to precisely analyze the com-
`ponents of redundancy. but rather to provide a better view
`of the opportunities and challenges for a compression
`algorithm.‘
`
`r-.
`
`.
`
`‘Other types ‘of reditnflhncy exist. beyond those listed here. for example in
`digltlm voice wnvcfo'I'11'IS. -There is ortcnslve redundancy in voice. but
`rnuchof it__b
`i_nth'_e_ frequency domain rather than the |ifl'Icdo=n‘aiI1-
`set-ievol'ee=ti_ata oollectcd can be eliminatecl by dropping unncoessttrif mer-
`sion iryeetttlin contests. but this technique should be seen as data reduction
`rather than data compression bccruseit is not reversible. in general. the
`..nI:thods used tdcomprcss alphanurncric data are not useful on voice, and
`__: vice _'Irl"rp_ (except perhaps for blank ialeqgal flooding}
`-'
`t"t‘.|t'Ii1 -
`
`.-
`
`to
`
`,3;
`
`'
`
`-
`
`”
`
`r
`
`-l..-l.-..
`
`..
`.
`
`10
`
`
`
`. 5
`
`generic translation tables to cover these various cases.
`problems arise in having the decornpressor use the cattle set
`of tables as the compressor. A common solution is to
`anelyie each data block individually to adapt the character
`- di_stribution- uniquely to that block. We must make two
`.jf ‘passes over the data: (1) a pass to count characters and per.
`form a sort operation-ttvtithe character table. and (2) a pass
`for encoding} The derived translation table must be con-
`veyed with the cornpi"esse'd data. which detracts from the--
`:_ compression effectiveness and /or strongly restricts the size
`:
`7 oFllttt'lféli’iIlfitl011‘tIi«ble.-This adaptable approach is accept-
`','
`_t_t_l_J_!_¢ if high transfer rates'throu'glt_{l{t'e compressor are not
`
`- required and’ll’ the dent"-‘blocks
`are very
`large relative‘ to the-size of the tmnsliiiéiii’-table.
`
`3
`
`Run-len5g'ti:t:'encoding. Sequences of identlcalfltuacters
`can be encoded as a count field plus an identifier of the
`
`
`
`-
`
`of characters '_i_avt'-'t"i‘t'~._i'_'g'_i_'5s"t"t"r'te for ASCI1text, st‘: not for ar-
`bitrary bit. pa't_terns such as those in binary integers.
`Tltpicfli‘-Y. three cliarlatgtersftire needed to mark each char-
`acter _run.___.so]-'tt'is‘re'r1
`in‘g' iv‘
`Id not be used for runs of
`
`
`presslon. While progiarnmed com-
`Progranii:rte___
`_
`pression doeshot meet" ‘the ‘transparency constraint desir-
`able for gene'r'al_-,us,_e systems, it is discussed here to show
`.th_e-_ types of technidues used. The programming is general-
`ly done by, the anplirations p_rogrammer or data manage-
`ment systern.'ln formatted daiti_'ill_és;"several techniques
`are used. _l_J_r_tuse'd blank or" hero spaces are eliminated by
`making l't'eIcl}.'v'a.'r_iablt;:':irt length and using an Index struc-
`ture with pointers'io‘iéadlt
`position E_'_r__e_‘.di‘étable field
`values are compactly "encoded
`__
`‘y.-5'l‘5f"'ti.' code table-
`such as when iivarehouse namcsare given as integer codes
`rather than as alphabetic English names. Each field has its
`own specializod code table that deals with positional
`redundancy. Since programmed ‘compression cannot ef-
`fectively handle character distnbtttién redundancy, it is a
`' nice complernc_n_t to _l-luffrnan coding.
`_
`Programmed compression has seiiéral serious disadvan-
`tages.
`it introduces increased _.program development eit-
`penses: the type of decompression used requires a knowl-
`edge of the record structure and the code tables; the choice
`of field size; and code tables may complicate or inhibit
`later chang'e§':to the data structure making the software
`more eitpens'tve_to maintain. Perhaps most irnportant.
`progtamrriersvvillitend to avoid the decompression process
`because it is relatively slow and. therefore, will work with
`data in main memory_ in its compressed format. which
`confuses and complicates the application program.
`
`
`H" '
`
`'
`
`hols into l"uted-length (or predictable length) codes. The
`symbol strings are selected so that all have almost equal
`prgpabnity of occtsrrcncc. Consequently, strings of fre-
`guintly occurring symbols will contain more symbols than
`ifistring has-ins infrequent symbols {see cttarrtple in Table
`1).. ilhis form of compression is effective at exploiting
`character frequency redundancy. character repetitions,
`and high—usage pattern redundancy. However. it is not
`generally effective on positional redundancy.
`_
`This type of algorithm is adaptive in the sense that it
`starts with an empty table o l‘ symbol strings and builds the
`table during both the compression 1.11161 decompression
`processes. These are one-pass procedures that require no
`prior information about the input data statistics and eat-
`ecute in time proportional to the length of the message.
`This adaptivity results in poor compression during the in-
`itial portion of each message: as a result. the message must
`be long enough for the procedure to build enough symbol
`frequency experience to achieve.goo_d compression over
`the full message. On the.other hand. most finite im-
`'-plementatiotts of an adaptive algorithm lose the ability to
`adapt after a certain amount or the message is processed.
`If the message is not homogeneous and its redundancy
`characteristics shift during the message, then compression
`efficiency declines if the message length significantly ex-
`oeedsthe adaptive range of the comptessiort~irnp'1_ernenta-.
`tion.
`- ~
`
`.
`
`.
`
`-~
`
`-
`
`‘.-
`
`,
`
`-
`
`-
`._
`
`-
`
`.
`
`_
`
`.
`
`.
`
`l
`
`.
`
`.
`
`‘
`
`_
`
`t
`
`,
`
`_
`
`-
`
`
`
`
`
`,
`
`cont
`1
`2
`3-
`4
`5
`6
`7
`3
`9
`10
`11
`12
`13
`1-I
`- 15
`iii
`17
`13
`19
`20
`
`Table 1. A compression string table with alphanumeric
`character strings‘ that are encoded Into 12-bit codes. in
`this example. intrequent letters. such as Z. are assigned
`individually to a 12-bit code. Frequent symbols. such as
`space and zero. appear in long strings which in practice
`can exceed so characters In length. Good contptesslon is
`achieved when along string is encountered in the Input. it
`is replaced by a 12-bit code. ellectlng slgnlllcant space
`savings.
`'
`.
` %_——¥4g
`SYMBOL
`-
`.
`STFllNG
`A
`ha
`AN
`atttt
`an
`z
`o
`no
`not
`5
`55
`this
`slits
`snubs
`utittm
`n
`no
`non
`0000
`000-01
`
`Adaptive'cozittt't§ession. The Len1pel~Ziv‘-5 and related
`algori‘thms7 ‘convert -variable-length strinsi of input sym-
`
`sss
`
`4095
`
`.t_ur.ie i984
`
`11
`
`
`
`Q_P':‘°Wed compression results.
`
`Effectiveness of__ compression is expressed as a ratio
`relating the number‘ of bits needed to express the message
`before and after compression. The compression ratio used
`here'wili be the uncompressed hit count divided by the
`compressed .b'rt count. The resulting value. usually greater
`than one, indicates the factor of increased data density
`achieved by compression. For example. compression that
`serves to eliminate half the bits of a particular message is
`pron-tried as achieving a 20 compression ratio. indicating
`that two-to-one compression has been achieved.
`Compression ratios.-develolriéd by software simulation.
`-are given in ‘Table 2'for several data types. Many of these
`observations
`from compression of backup/recovery
`tapes from a Sperry Univac ll00/60-‘maahiiie'- used in a
`scientific -environment. Most of the samples involved
`several different files and covered 105 to 107 bytes each to
`provide meaningful averages. Several versions of Lempe1-
`Ziv compression algorilhrns were used; however, since the
`various alzotithtns produced results that were consistent
`to within ‘a few percent. no attempt
`is made here to
`distinguish the specific algorithm used for each case.
`
`English text. Text samples for compression were ob-
`tained from ASCII word processing filesin-a technical en-
`vironment. Results were reasonably oortsistcnl for simple
`text, at a compression ratio of 1.8. Many word processing
`fries. however. compressed better tharttlmt when they con-
`tained figures. formatted data. or presentation material
`like viewgraphs. Surprisingly. long individual documents
`did not compress better than groups of short documents,
`indicating that the redundancy is not due very much to
`correlation in content. For comparison. Rubin’ achieves
`up to a 2.4 ratio using more complex algorithms. Pechural
`observed a 1.5 ratio using Huffman encoding. These com-
`parisons are not very reliable. since the subject texts may 1.
`have dissimilar properties.
`
`Coboi files. A significant number of large -Cobol Files
`from sev_eral types of applications were compressed. pro-
`ducins widely variable results. Compression depends on
`record format. homogeneity across data records. and the .
`‘extent of integer usage. These were eight-bit ASCII files.
`so the integer data would compress very well. A side ex-
`periment showed that one third to two thirds of the space
`in some of these files appeared as strings of repeated iden-
`tical eharacters. indicating a high fr_acti_on of blank space.
`After allowing for that much blank space, there normally
`was a further two-to-one compression within the useful
`data. Restricted character usage probably caused two
`thirds of this latter two-to-one compression. with repeated
`character patterns giving the rest.
`_
`
`Floating point numbers. Arrays of floating point
`numbers look pretty much like white noise and so they
`compress rather poorly. The fractional part is a nearly ran-
`dom bit pattern. The exponent does yield some compres-
`sion when most of the numbers have the same ntagttitudc.
`Some floating point arrays expand by 10 percent going
`through the compression algorithm. when the exponents
`
`COMPUTER
`
`a.,,.,‘-an-if,
`
`
`
`'i‘he compression procedure discussed here is called the
`LZW-method‘.-‘Miariatlon on the Lempel-Ziv procedure. -"
`it
`retains" the" ‘adaptive properties of Lempel-Ziv and
`achieves" about" the same compression ratios in normal
`con1merci'sl"'eomputer applications. Lzw is distinguished
`° "by its very simple logic. which yields relatively inexpensive
`implementations and the potential for very fast execution.
`A typical"‘-liilliltliitplementation operates at under three
`cloeit cycle's‘-'-"p‘e'i‘ symbol and-rachieves acceptably good
`¢0mr!r5S'Si'on='0h niessages-tr: the_'n:sgni_tude of tert-'titou-
`
`
`
`Contpresston parameters. The preceding discussion of
`redundancy and compressiontétn be summarized bypoint-
`out-the'pararneters that influence thechoice of a corn-
`pressililt-"s'tt'iiteg3't. -The redundancyttype-found in a certain
`application is important, but prediciitbiliiy of redundancy '
`-type-is also important. An adaptive capability in a com-
`pression procedure would=be"o'f 'littie'benefit for applica-
`- Lions-=with predictable redundancy such as in English text,
`'b'ut*'it"would be valuable for business-files. System data
`transfer rates pt__e__;__r_r_tu'ne if a one”-pass procedure is needed-
`for
`th§__'gre_a_tetf-"overhead of a two-pass ap-
`be'3ua"iI¥ed"*l5§5etter compression results. The
`length of the meesage"l5eing-transmitted or retrieved has
`soflteirhporunce because adaptive"te'cl1niques are awk-
`orineffectlve on short messages.
`it is more difficult to analyze the compresion costs that
`reflect hurnarr_"itilrolvernent- Nottadaptive algorithms re-
`r't”'iiil‘e'p'riol“-"'lt'l‘i'owlet_i_}g"e__:-of the data characteristics. which
`r'ie'rta'ips'provrui':'tt lf:'_y*’i-t'i'ari'nat-crass-irieatron of messages
`"‘ihto"categories having preanalyzed "redundancy profiles.
`' Oi‘ programmed compression offers the greatest -
`density improvement at the cost of very high development
`and software maintenance costs.
`The focus of this article is ortthe problems found in con-
`ventioiial data stortt§e"ht"'eoif'rtlneteiaJ computer systems.
`In that applicat'idi‘i§"‘3‘tlie*"‘hi§li ' data transfer
`rates of
`'i"ntlg'1'lelit: cliilts preclude extensive runtirhe overhead. while
`
`the diversity of the data stored essentially requires an
`adaptlve"'p‘i'T:i'ci3iiul‘e. 1111;;
`'tg'_F.Ԥlfori segments on a disk
`complicates the use of dalptive approach. but this is
`becomingless important as data block sizes on disk are in-
`ereasirtg. The Lzw method seems to satisfy these re-
`quirements better than other compression approaches.
`Therefore. its characteristics will betrsed in the following
`discussion. which points out the'opporturtities and prob-
`‘Isms of compression in data-storage.
`
`_
`"' 7
`-
`
`. '
`
`'-
`
`
`
`'
`
`
`
`"
`
`'
`
`*"
`
`,
`-"
`
`'
`
`'
`
`cl
`
`a’
`
`Table 3. compression results tor a variety of data types.
`.
`
`DATA TYPE
`COMPRESSION RATIO
`
`
`
`English ‘last
`__
`_
`Cobol Files
`.
`Flaunt Pvlnilwavs
`Fcnrriitteq-si:lsnIifre‘Dala'
`
`_
`
`.
`
`. Fragrant-‘Source coca
`--
`
`._-H“ _
`12 _
`
`12
`
`1.8 '
`2 to 6
`1.0
`2:1
`
`._-.;.__
`
`.
`
`.
`
`I.
`
`I
`
`I
`
`_
`
`E
`
`‘
`III‘.
`.=.i.._'
`
`2.3
`l«.5
`
`.
`
`.
`
`-
`
`-
`
`..
`i
`
`
`
`as nine-bit input symbols compressed about 2.8-to-one
`pthen processed as a sequence of eight-bit chunks. All of
`.t‘n1'.s -Evidence indicates tha't compression can be effective
`even when dealing with mixed character types.
`
`Recovery tapes. Computer backup/recovery tapes
`(called secure tapes at Sperry) contain copies of active user
`tiles. As such. each tape exposes a cross section of data
`used in its system. although that profile of active files may
`differ from the profile of all files found on the system
`disks. Cotnp_ressi_ng iheso tapes with the LZWK algorithm
`has produccd'soi1}sthi_nglcss than __a two-to-one space re-
`daction in s, n_sls1.lll_fl¢iitstallati.9_ii {I'll slish=lv'betterth9-n #-
`tw_o_-_to-one reduction in a prb‘gra'rti'-developrrtent-oriented
`inst}allation‘."r‘~‘rorm'these_results, we might expect a three-
`to-one reduction in traniiaction-oriente_d'st'stetns where
`formatted data predominates.
`.
`
`System considerations
`
`__
`
`', The availability of an appropriate compression method I
`does not assure an effective system. Several problems
`occur when integrating compression into eitisting com-
`puter systems. These are described here in terms of both
`problem type and impact. on specific computer p‘e_l'.ip_h§Ela_‘1§.
`
`i
`
`:1 E.-. '.
`
`Unpredictable message length. The length of a com
`pressed image for a given message is unpredictable because
`it depends on the content or the message. We have no
`assurance prior to compression that a message will com-
`press at all: in some casesit-may everrexpand a little.
`Therefore. the space ‘allocated for the compressed image
`must be at least as big as the space allocated for the original
`message; this requirement is a complication on allocated
`devices such as a disk. Further. an update to a data record
`that alters just a few characters can change the compressed
`size and rnity result in a required change in allocation-
`even for minor update operations. Unpredictability in
`length also impacts bandwidth requirements: as a result.
`data path capacities have to be overdesigned to handle
`peak demands.
`‘
`'
`
`\\
`
`reversible
`Error tolerance. A characteristic of most
`compression systems is that if one bit is altered in the com-
`pressed image. much ofthe-message is garbled. Therefore.
`a given error rate may be acceptable for the transmission"
`of uncompressed data. but the same rate may be unaccept-
`able for compressed data. Special efforts to detect errors
`are needed. especially to catch any errors in the compres-
`sion and decompression processes. Note that a parity bit"
`added to each data character before compression would be
`removed as redundancy by the compression
`so
`character pltrity has no value in checking the compressor.
`Cyclic checks over. the entire message are effective. how-
`CV61’ .
`.
`.
`
`'
`
`'
`
`This intolerance to errors often limits the use of com-
`pression. For examplendigitized speech can tolerate occa-
`sional bit losses. but compression turns these errors into
`serious losses. On the other hand. since most commercial
`data storage devices, such as disk and tape, have accept-
`ably low error rates_ and there is no attempt to use their
`
`13
`
`vary widely. Eltpansion'undct' an adverse data pattern can
`occur under alrnost any adaptive compression method.
`Randomized bit patterns usually cause this effect. since
`there are no redundancy savings to offset the overhead‘
`that: is. inherent in coding the choice of parameter tvalues
`or implicit in the procedure.
`
`
`
`'
`
`' "I-‘otntatted seleuti'flc_dataL Most data used by Fortran
`programs ten_d_ed_to compress about 50 percent. This data
`included input .d_ata._prirttarily_._integers. it also included
`._;_.--.:.p_rtrtt filespxyhich 'we_re_Asc11_ epceo, Most or these rues
`
`- 3 _- are ‘short. s9.nut-10-file-temple. dill-l40t.ene_ompass ta _larg.e
`
`quantity of data. __
`
`'
`
`System log tlv.t_tt..l_nfo_rmation describing past syitem ac-
`tivity. such as job start and stop times. is tnostlyiortttatted
`integers and is therefore reasonably compressible. This log
`data is used for recovery and constitutes perhaps 10 per-
`ccnt of the datastored on backup/recovery tapes. It tends
`to be in a tightly packed. fixed-length format. so the corn.
`pressiorl"achie'.ved._is due to null fields and r_epctitiq3n__mthe
`
`data-Rralues. .
`-.
`
`
`
`
`T“
`
` E75
`Source code cart,
`ontpressed by a factor
`__ _
`_
`of better than two. 1!. can be cOl'l‘tprcS8_Bi1 better than text
`because w0tds..a:e,-frequently repeated nnd_b_lanlt spaces
`are introduced "by--thesouree code format. iiighly struc-
`tured programrning yields source code with greater corn-
`pressibility than the ayerage; 213 factor cited here. Object
`code. on ute;.9fl39_F Iigiix-!t_CIl_)nl$lSl3'3__.9_f_3Fbllr3Yy bit patterns
`. does not eorttpressias 'wel1’.'Un'eiren_ usage of op codes
`' ‘arid incomplete; utilization of displacement fields would
`account-for. most-.0f—,the compression achieved. Relocat-l
`able code and absolute code differ in compressibility
`dcpc nding on the treatment of blanking for empty variable
`- areas. Actual program files in a development organization
`
`' were found to be mixtures of source. relocatable. and ab-'
`-2" solute codes. These mixed llles'weré-"compressed on the
`-3'
`' average by a factor of two.
`
`Character size. Sometimes data is recorded in nonstan-
`dard character sizes. Our Sperry 1 l'U0f60 has a 36-bit word
`and so. at times. has six-bit. eight-bit. and nine-bit data.
`The measurements cited in Table 2 were compressed using
`nine-bit characters throughout. _Wl-ten eight-bit ASCll_
`symbols were being stored in nine-b_i_t bytes, the compres-
`sion results were adjusted to re llefi results as if they had
`been observed on an eight-bit byte. That is. the observed
`-I=_I. compression was 9/8 greater than tlte_._nur_t_obcrs shown in
`-Table.-2. The co'rnt'ir'esse"ti data‘ stream is virtually the same
`regardless of whether the soured"-‘delta was in eight-bit or
`nine-hit"-syntbolsg since length of the compressed image '
`depends otfthe number of symbols encountered rather-
`than their enco'diug. Six-lair data compressed surprisingly
`vrell when compressed on a nine-bit basis. because
`repeated patterns. such as blank spaces. are word aligned
`and have the same pattern on each occurrence. even
`though that pattern is not meaningful when viewed on
`nine-bit increments. As a side iiiiperirnent, nine-bit data
`_ was compressed as.-tr it were eight-bit characters. Poorer
`but not unacceptably poor compression resulted. For ex-
`.. ,3n_1pi¢. a set of nine-bit files which compressed four-to-one
`
`June 1934
`
`"
`
`' "'
`
`13
`
`
`
`is related to the input
`pressed block sizeand, as such,
`block size factored by the compression ratio. The curve
`shown may be m._ovod vertically up or down without loss of
`meaning; the two-to-one compression ratio curve is shown
`here for illustration. The diagonal lines indicate fixed in-
`put block sizes as they would map i_nto compressed block
`sizes." Note that data which compresses very well needs to
`be handled in larger blocks to obtain the highest compres-
`sion efficiency.
`The "highest efliciency point shown in Figure 3-—about
`6K eight-bit bytes after compression-t-is implementation-
`denendent
`can its adjusted. lrlcwever. since this curve
`matches a very convenient Implementation, it will be com-
`mon practlce.-
`-
`.
`
`_
`
`Disk storage. Compression on disik is_ feasible because
`the intemal data‘ format on disk is generally invisible to
`'
`|
`
`COMPUTER '
`
`System location {or compression. When using compres-
`" sion on computer peripherals, a significant system con-
`sideration occurs in deciding where to implement compres-
`sion in the lift) path of Figure 4: in the device controller,
`channel. of central processor: If thecptnpression opera-
`"lion" is executed in the device, it woul