throbber
..-.....i...-M».-_-.»\.:..
`
`'
`
`
`
`.
`L p sh-Pet'forman;ce
` Data C0mprcSSt®n
`
`
`
`
`
`‘
`
`
`
`
`
`strategy poses a different set of these problems and, consequent-4"‘
`ly. the use of each strategy is restricted to applications where its _
`inherent weaknesses present no critical problems.
`This article introduces a new compression algorithm that is
`based on principles not found in existing commercial methods.
`---'1"-his algorithm avoids many of the problems ‘associated with
`older methods in that it dynamically adapts to the redundancy
`characteristics of the data being compressed? An investigation
`into possible application of this algorithm yields insight into the
`compressibility of various types of data and serves to illustrate
`system problems inherent in using any compression scheme. For
`readers interested in simple but subtle procedures, some details
`of this algorithm and its implementations are also‘dcscribcd.5
`The focus throughout this article will be on transparent com-
`pression in which the computer programmer is not aware of the
`existencc of compression except in system performance. This
`form of compression is ‘ ‘noiseless,” the decomprcsséd data is an
`exact replica of the input data, and the compression apparatus is
`given no special program information, such as data type or usage
`statistics. Transparency is ‘jierceivcd to be important because put.-
`ting an extra .burden'on‘thé application programmer would can so
`- -‘»ootsg9:s2/st/oeoo4ioossn:.oo oassa IEEE ,_
`I OfiACLE 1114
`
`COMPUTER '
`
`
`6' taiies or transferred over com-
`‘ stTo'red"‘on"dis’lts
`ass links -in com ercial cotnpoter systems generally
`significant redundancy. A-méclianism or procedure
`_
`;
`_, .hio§_i ec‘odcs.the..data to lessen the redundancy could possibly
`effective data densitites in stored or com»
`. double or"t'ripie the
`
`'
`"
`orcoycr, ifgcomprcssion is automatic: it can
`
`n the rise of software developmgnt costs. A transparent
`._mi>léssion mechanism could permit the use of “sloppy” data
`"es, in that empty space or sparse encoding of data would
`_
`rf¢1t:ste2_r‘t1..vs>..<.:?;=~t:t<1.t?t¢:9§¢..9,t.st9sess.a:2ss9 or .t.1.f.=.»‘.I..*.5.f"“‘i’I\.<=: now-
`c"i'<*ix on 'east:trt4s‘a'- C\‘!1'tVp2‘€‘.'K<l'i3}‘l;‘{'F0C‘éd\Il7e.
`, Sex-em! protalctns encouzncred xx-lie:ri".¢ommo:1 compression
`’etltotls.hr'v il‘tlt\_:r£ttéd in.to:co:ttputcr systems have prevented
`tli wi‘dc§prcnd use oi"auro'rtiatic data compression. For example
`tfirttntime ex__
`‘ speeds interfér§i"‘iii' the attainment of
`‘{2}
`‘
`'_-'=d'a§a rat "
`‘ost techniques are not
`fi_exib_le"e'nough to proccss_,cl1t‘§é enittgipcs ‘of redundancy; (3)
`blocks-‘of compr I
`.
`4
`ve. unpredictable lengths pre-
`sent storage“spa‘ce‘m nage “ant
`
`
`
`
`
`

`
`
`
`.
`
`development costs which would-=oftcn exceed the value of
`compression. As itlustratcd"in Figure 1. compression is
`viewed a$'b'_eing very similar to a communications channel
`at‘-I codes a sequcnce ofbytes into an anirtcial format
`irod'pl1ysica1 handling and later decodes the corn-
`"I
`I
`..
`..
`tessed image back into a_ replica of the original message.
`The data compression described by this model is a
`reversible process-that-is unlike other forms of "data com~
`pr_essiorI_"-___—suclt as-_'d'ala-reduction ‘or data abstraction--in
`to soon relevance cri-
`
`-
`:2
`
`1-
`
`ten and
`anlclt lows“ on ¢9mct.=sti.rm‘_.<i.f.
`nit-ieric data-intermixed:-4n connects‘-_te*npuu ap-
`-' plicattons. Wher_e'_’u'
`' eit_icl_et' heve"pl_eccd per-
`
`tleulsr emphasis on text ootnpresslon"i'n their dtsetrsuonsfi_‘‘_'''
`of comprwsion algorithms for computer data’, "3"titis'
`cle will cover more ofthe systems
`compression
`into acomputé
`
`typical character string,
`Character distribution. In _a
`some characters are used mifiie frequentlsr than others.
`Spec-llicellyf,
`in eight"-bit ASCII representations neatly
`threetfourilis of the possible 256-bit combinations may
`not be used in a specific file. Consequently. nearly two bits
`of each eight-bit packet might be‘ saved. for a possible
`. as-percent '__apaceaftlrne savings.
`In English text.
`the
`ara.t:1t'.l's occur in a wellédocumeoted distn‘o-utlon, with e
`__ and "space" being the most popular. In an inventory
`record. numeric values are very comrnon--the existpnce of
`or packed-decimsl numbersicatt shift the statis-
`tics-—and constraints on field definition can cause
`charfler distributions to varisigttiftcantly from file to~ "
`‘file. For example; the choice of whether to use alphabetic
`or numeric values to identify warehouse sites can shift the
`distribution in theinventory rile. Likewise. the extent to"
`which descriptive tcxtappears in the inventory records will
`ittfluence the average number ofbits needed per character.
`
`Types at redundancy
`
`'
`
`Four types of redundancy can be found in commercial
`data. Redundancy hereis confined to what is observable in -
`a data stream {without knowledge of how the date is to-be -
`interpreted). Redundancy in the form of unused data or
`application-specific correlations am not considered here.
`For illustration. types of redundancy will be described as
`they might be found in two kinds of flies: English text and
`manufacturing parts inventory records (see Figure 2). The
`redundancy categories described arena: lndependent.'but -
`overlap to some extent-._'
`-
`
`_
`
`
`
`-.
`
`..
`
`'
`
`Cltentcter repetition. When a string of repetitions of a .
`single character occurs, the message can usually be-en--.
`'
`coded more compactly than by just repeating the character
`symbol. Such string are infrequent in text. occurring as
`blanlt spaces in place of tube or at the end of lines.
`However. in formatted business files. unused space is very
`common. An inventory record will it-cquently have strings
`of blanks in partially used alphabetic fields. strings of
`zeros in high-order positions of numeric fields. and
`perhaps strings of null symbols in unuscdftelds. Graphical
`images, especially the line drawings of business graphics.
`are mostly composed of homogeneous spaces. and their
`
`
`
`
`
`couratsstntt
`
`;-BOIIPBESSED
`_
`
`'-
`'
`nata
`COWHUNIGITIBHS
`uuss
`"
`
`OUTPUT
`SYMBOLS
`
`°9‘°"‘P3E3$l9.“.
`
`-
`
`'
`
`_
`
`.
`
`'- Figure 1. A model {or a compression system that per-
`. lorrne _l_r_ansparer.it cot-npreeeton.
`-IE‘:
`
`I
`...-g
`FRET HARE: HEX NUT 5'4 K20
`°55G.a.iP.TIUN: STEEL. SIRNQR-ND THBEJW
`COLORCWE: BLINK FIELO
`.
`it
`I
`.
`"""“E“°"-35 *5‘ m5"-" ""“*-'--—-———-..—-...__. sans utut rntuutttrtr .
`HEOGBUFIS IN THIS FILE.
`sroanus sure: tits”
`nuturtwjlti $Tltct_t:_an2‘n‘
`
`._‘_.
`
`.____
`
`E
`
`llAflllBl.E-LEHBTH TEXT
`ENELISH CHAHBACTER
`HBEUBBENCE DISTRIBUYION
`
`LHIIITEB URBIETT OF CMMCTERS
`
`'
`
`stances rautmtne
`
`utmtalc rims‘
`
`' FtgtnI.2.t'fypel"t'il ‘redundant code in a rim-tut_ecI_urlng pens Inventory record.
`
`.
`
`-'-Jurviétglt-i
`.~_.
`.. ,-.\;_I_\‘_;-,\__
`
`9
`
`

`
`Mfllhods oi compression
`
`The precedingxliscussion of redundancy types provides
`a basis for comparing several practical compression meth-
`ods. More theoretical comparisons are provided by Storcr
`and Szymanlti.‘ '
`-
`"
`
`I-lufirnaa coding. The most popular compression meth-
`od is to translate fixed-size pieces of input data into
`variabloiength symbols. The standard Huffman encoding
`procedure prescribes a way toasslgn codes to input sym-
`bols such that each code length in bits is approximately
`logfisymbol probability}
`
`where S)l'l1lh0l_,pl'0hflbilll¥ is the relative frequency or oc-
`currence of a given symbol (expressed as a probability).
`For example, if the seat of symbols (the input ensemble) is
`chosen to be the one-byte ASCII characters {a very typical
`case) and if thcblank character is used one-eighth of the
`time. then the blank character is encoded into a three-bit
`. symbol. ifthe letter Zis found to occur only 0.1 percent of
`‘the time, it is represented-by to bits. '
`ln_nortnal use. tlic size of input symbols is limited by the
`size of the translation table needed for compression. That
`is. a table is needed that lists each input symbol and its cor-
`_respondir_tg__code. Ifa-symbol is one eight-.bit byte (as is
`common), then a table of 256 entries is sufficient. such a
`table is quite economical in storage costs but limits the
`degree of compression achieved. Single-character en-
`coding can cope with character distribution redundancy,
`but not the other types..Cornpression could be improved
`by using input symbols of two bytes each. but that would
`require 9'. table of 64K entries at a cost that might not be
`warranted. Parsing the input data into symbols that are
`not byl.c'-aligned such as 12 bits, isnot likely to improve
`compression t-.fi'ectivene.ss_and would complicate system
`design.
`'
`‘
`,
`A second problem with HulTman encoding is the com-
`plexity of the decompression process. The length of each
`code to be in tcrpretcd for decompression is not known un-
`til the first few bits are interpreted. The basic method for
`intepreting each code is to interpret each bit in sequence
`and choose a translation subtablc according to Whl'.'thet,the
`bit is a one or zero. That’ is. the translation table is essen-
`tially a binary tree. Operationally. that-t,‘a logic decision is
`made for each code bit. In working with a disk drive that
`base 30M-bps transfer rate, the decompression logic must
`cycle at! that rate or faster to avoid creating a system bottle-
`neck. Decompression it that rate is possible but not sim-
`ple. in general. decompression with variable-length sym-
`bols gives a cost/performance disadvantage whenever the
`data. volume is high.
`‘
`A third problem with Huffman encoding is that we need
`to know the frequency distribution for the ensemble of
`possible input symbols.
`in the normal case of single-
`character symbols, character l'r'equcrtcy distribution in the
`data stream is the type of distribution use. Such a distribu-
`tion is known and is probably reasonably stable for Eng-
`lish text. However, the distributions for data tiles are very
`application specific and files such as object code. source
`code. and system tables will have dissimilar characteristic
`distributions. .While it might be possible to have a set of
`
`COMPUTER
`
`increasing integration into textual data can pose a com-
`pression challenge.
`
`I-ligh-usage patterns. Certain sequences of characters
`will reappear with relatively high frequency and can
`tltcrcforcllsc represented with rclatively fewer bits for at not
`saving in tithe;/space. in English. a period followed by two
`spaces l§'_r'rr_ore common than most other three-character
`combinations and could therefore
`rccoded to use fewer
`"bits.
`such as 213. an’: ljlore common than
`lllc ind_li3i5iual'leite"'
`babilities would imply and _might
`‘resistors be"
`__ _ithIewcr bits than the tiv0—charac-
`
`ter symbols tised togct_ll_
`lkewise. unlikely pairs. such
`"as GC, would be encoded witltvery long bit combinations
`to achieve better bit utilization. In particular instances of
`text, certain key words will be r.tsi':‘d" heavily. For example,
`the word “cm-npression" apiamrs frequently in this article;
`consequently, if this article were rrttrcleinto a system file.
`the word "compression" would warrant ‘use of a shorter
`~ "code.
`in inventory records. certain __t'derttil”ters __sut:l_t,.as_
`wareltouse_n-arrtes are exterisivcly reused. Numeric fields
`-' coniitin-préfetriid sequencésin the sense that they contain
`' only sequences of diets; with no letters or special symbols
`intermixed. These c'ould'bé encoded at less than four bits
`per digit,‘ rather than the five to eight bits needed for
`- --gct"tcI'.‘tti' text.
`
`'
`
`
`
` .. =1’-ositlooal. redundancy.
`
`
`.
`
`'
`
`lf certain characters appear
`= s.-_-consist_e'rttiy-._at a predictable place in each block of data.
`then they are at least ‘partially redundant. An e_samp_le of
`‘this. ism-“raster-scanned picture containing a vertical line;
`the vertical line appears as-a blip in the same position in
`each scan, and could be more compactly coded. in inven-
`. -' "tory.fil'es. certain record fields may almost always ha‘-‘E the
`. same entry, such -as .3 -"special handling" field which
`almost always has-Pndne‘--’ in it. Text, on the other hand,
`has virtually no positional redundancy in it.
`
`
`
`‘
`
`Thcsc'l‘qt_u:_.typ_e's of redundancy overlap to some extent.
`For example, inany i_rt__vir'v.='.;ttt't‘_>_t_'y’=_t'r"tr'r't'tet-ic fields will contain
`small integers prcpondertiitlly. These could be compressed
`as a small group of frequently used sequences {the in-
`tegers). as instances of zero strings in front of the integer
`values, as a weighting of the character frequency distribu-
`tion" toward zeros. or amp ositional bias, in that particular
`fields will almost always have nurrteric values with many
`high‘-order zeros. The point here is that almost any com-
`pre'3ssion"trit:_t':l_ta'r_tis'rtl"cancitploit that type of redundancy.
`intent here"-i§":"not to precisely analyze the com-
`ponents of redundancy. but rather to provide a better view
`of the opportunities and challenges for a compression
`algorithm.‘
`
`r-.
`
`.
`
`‘Other types ‘of reditnflhncy exist. beyond those listed here. for example in
`digltlm voice wnvcfo'I'11'IS. -There is ortcnslve redundancy in voice. but
`rnuchof it__b
`i_nth'_e_ frequency domain rather than the |ifl'Icdo=n‘aiI1-
`set-ievol'ee=ti_ata oollectcd can be eliminatecl by dropping unncoessttrif mer-
`sion iryeetttlin contests. but this technique should be seen as data reduction
`rather than data compression bccruseit is not reversible. in general. the
`..nI:thods used tdcomprcss alphanurncric data are not useful on voice, and
`__: vice _'Irl"rp_ (except perhaps for blank ialeqgal flooding}
`-'
`t"t‘.|t'Ii1 -
`
`.-
`
`to
`
`,3;
`
`'
`
`-
`
`”
`
`r
`
`-l..-l.-..
`
`..
`.
`
`10
`
`

`
`. 5
`
`generic translation tables to cover these various cases.
`problems arise in having the decornpressor use the cattle set
`of tables as the compressor. A common solution is to
`anelyie each data block individually to adapt the character
`- di_stribution- uniquely to that block. We must make two
`.jf ‘passes over the data: (1) a pass to count characters and per.
`form a sort operation-ttvtithe character table. and (2) a pass
`for encoding} The derived translation table must be con-
`veyed with the cornpi"esse'd data. which detracts from the--
`:_ compression effectiveness and /or strongly restricts the size
`:
`7 oFllttt'lféli’iIlfitl011‘tIi«ble.-This adaptable approach is accept-
`','
`_t_t_l_J_!_¢ if high transfer rates'throu'glt_{l{t'e compressor are not
`
`- required and’ll’ the dent"-‘blocks
`are very
`large relative‘ to the-size of the tmnsliiiéiii’-table.
`
`3
`
`Run-len5g'ti:t:'encoding. Sequences of identlcalfltuacters
`can be encoded as a count field plus an identifier of the
`
`
`
`-
`
`of characters '_i_avt'-'t"i‘t'~._i'_'g'_i_'5s"t"t"r'te for ASCI1text, st‘: not for ar-
`bitrary bit. pa't_terns such as those in binary integers.
`Tltpicfli‘-Y. three cliarlatgtersftire needed to mark each char-
`acter _run.___.so]-'tt'is‘re'r1
`in‘g' iv‘
`Id not be used for runs of
`
`
`presslon. While progiarnmed com-
`Progranii:rte___
`_
`pression doeshot meet" ‘the ‘transparency constraint desir-
`able for gene'r'al_-,us,_e systems, it is discussed here to show
`.th_e-_ types of technidues used. The programming is general-
`ly done by, the anplirations p_rogrammer or data manage-
`ment systern.'ln formatted daiti_'ill_és;"several techniques
`are used. _l_J_r_tuse'd blank or" hero spaces are eliminated by
`making l't'eIcl}.'v'a.'r_iablt;:':irt length and using an Index struc-
`ture with pointers'io‘iéadlt
`position E_'_r__e_‘.di‘étable field
`values are compactly "encoded
`__
`‘y.-5'l‘5f"'ti.' code table-
`such as when iivarehouse namcsare given as integer codes
`rather than as alphabetic English names. Each field has its
`own specializod code table that deals with positional
`redundancy. Since programmed ‘compression cannot ef-
`fectively handle character distnbtttién redundancy, it is a
`' nice complernc_n_t to _l-luffrnan coding.
`_
`Programmed compression has seiiéral serious disadvan-
`tages.
`it introduces increased _.program development eit-
`penses: the type of decompression used requires a knowl-
`edge of the record structure and the code tables; the choice
`of field size; and code tables may complicate or inhibit
`later chang'e§':to the data structure making the software
`more eitpens'tve_to maintain. Perhaps most irnportant.
`progtamrriersvvillitend to avoid the decompression process
`because it is relatively slow and. therefore, will work with
`data in main memory_ in its compressed format. which
`confuses and complicates the application program.
`
`
`H" '
`
`'
`
`hols into l"uted-length (or predictable length) codes. The
`symbol strings are selected so that all have almost equal
`prgpabnity of occtsrrcncc. Consequently, strings of fre-
`guintly occurring symbols will contain more symbols than
`ifistring has-ins infrequent symbols {see cttarrtple in Table
`1).. ilhis form of compression is effective at exploiting
`character frequency redundancy. character repetitions,
`and high—usage pattern redundancy. However. it is not
`generally effective on positional redundancy.
`_
`This type of algorithm is adaptive in the sense that it
`starts with an empty table o l‘ symbol strings and builds the
`table during both the compression 1.11161 decompression
`processes. These are one-pass procedures that require no
`prior information about the input data statistics and eat-
`ecute in time proportional to the length of the message.
`This adaptivity results in poor compression during the in-
`itial portion of each message: as a result. the message must
`be long enough for the procedure to build enough symbol
`frequency experience to achieve.goo_d compression over
`the full message. On the.other hand. most finite im-
`'-plementatiotts of an adaptive algorithm lose the ability to
`adapt after a certain amount or the message is processed.
`If the message is not homogeneous and its redundancy
`characteristics shift during the message, then compression
`efficiency declines if the message length significantly ex-
`oeedsthe adaptive range of the comptessiort~irnp'1_ernenta-.
`tion.
`- ~
`
`.
`
`.
`
`-~
`
`-
`
`‘.-
`
`,
`
`-
`
`-
`._
`
`-
`
`.
`
`_
`
`.
`
`.
`
`l
`
`.
`
`.
`
`‘
`
`_
`
`t
`
`,
`
`_
`
`-
`
`
`
`
`
`,
`
`cont
`1
`2
`3-
`4
`5
`6
`7
`3
`9
`10
`11
`12
`13
`1-I
`- 15
`iii
`17
`13
`19
`20
`
`Table 1. A compression string table with alphanumeric
`character strings‘ that are encoded Into 12-bit codes. in
`this example. intrequent letters. such as Z. are assigned
`individually to a 12-bit code. Frequent symbols. such as
`space and zero. appear in long strings which in practice
`can exceed so characters In length. Good contptesslon is
`achieved when along string is encountered in the Input. it
`is replaced by a 12-bit code. ellectlng slgnlllcant space
`savings.
`'
`.
` %_——¥4g
`SYMBOL
`-
`.
`STFllNG
`A
`ha
`AN
`atttt
`an
`z
`o
`no
`not
`5
`55
`this
`slits
`snubs
`utittm
`n
`no
`non
`0000
`000-01
`
`Adaptive'cozittt't§ession. The Len1pel~Ziv‘-5 and related
`algori‘thms7 ‘convert -variable-length strinsi of input sym-
`
`sss
`
`4095
`
`.t_ur.ie i984
`
`11
`
`

`
`Q_P':‘°Wed compression results.
`
`Effectiveness of__ compression is expressed as a ratio
`relating the number‘ of bits needed to express the message
`before and after compression. The compression ratio used
`here'wili be the uncompressed hit count divided by the
`compressed .b'rt count. The resulting value. usually greater
`than one, indicates the factor of increased data density
`achieved by compression. For example. compression that
`serves to eliminate half the bits of a particular message is
`pron-tried as achieving a 20 compression ratio. indicating
`that two-to-one compression has been achieved.
`Compression ratios.-develolriéd by software simulation.
`-are given in ‘Table 2'for several data types. Many of these
`observations
`from compression of backup/recovery
`tapes from a Sperry Univac ll00/60-‘maahiiie'- used in a
`scientific -environment. Most of the samples involved
`several different files and covered 105 to 107 bytes each to
`provide meaningful averages. Several versions of Lempe1-
`Ziv compression algorilhrns were used; however, since the
`various alzotithtns produced results that were consistent
`to within ‘a few percent. no attempt
`is made here to
`distinguish the specific algorithm used for each case.
`
`English text. Text samples for compression were ob-
`tained from ASCII word processing filesin-a technical en-
`vironment. Results were reasonably oortsistcnl for simple
`text, at a compression ratio of 1.8. Many word processing
`fries. however. compressed better tharttlmt when they con-
`tained figures. formatted data. or presentation material
`like viewgraphs. Surprisingly. long individual documents
`did not compress better than groups of short documents,
`indicating that the redundancy is not due very much to
`correlation in content. For comparison. Rubin’ achieves
`up to a 2.4 ratio using more complex algorithms. Pechural
`observed a 1.5 ratio using Huffman encoding. These com-
`parisons are not very reliable. since the subject texts may 1.
`have dissimilar properties.
`
`Coboi files. A significant number of large -Cobol Files
`from sev_eral types of applications were compressed. pro-
`ducins widely variable results. Compression depends on
`record format. homogeneity across data records. and the .
`‘extent of integer usage. These were eight-bit ASCII files.
`so the integer data would compress very well. A side ex-
`periment showed that one third to two thirds of the space
`in some of these files appeared as strings of repeated iden-
`tical eharacters. indicating a high fr_acti_on of blank space.
`After allowing for that much blank space, there normally
`was a further two-to-one compression within the useful
`data. Restricted character usage probably caused two
`thirds of this latter two-to-one compression. with repeated
`character patterns giving the rest.
`_
`
`Floating point numbers. Arrays of floating point
`numbers look pretty much like white noise and so they
`compress rather poorly. The fractional part is a nearly ran-
`dom bit pattern. The exponent does yield some compres-
`sion when most of the numbers have the same ntagttitudc.
`Some floating point arrays expand by 10 percent going
`through the compression algorithm. when the exponents
`
`COMPUTER
`
`a.,,.,‘-an-if,
`
`
`
`'i‘he compression procedure discussed here is called the
`LZW-method‘.-‘Miariatlon on the Lempel-Ziv procedure. -"
`it
`retains" the" ‘adaptive properties of Lempel-Ziv and
`achieves" about" the same compression ratios in normal
`con1merci'sl"'eomputer applications. Lzw is distinguished
`° "by its very simple logic. which yields relatively inexpensive
`implementations and the potential for very fast execution.
`A typical"‘-liilliltliitplementation operates at under three
`cloeit cycle's‘-'-"p‘e'i‘ symbol and-rachieves acceptably good
`¢0mr!r5S'Si'on='0h niessages-tr: the_'n:sgni_tude of tert-'titou-
`
`
`
`Contpresston parameters. The preceding discussion of
`redundancy and compressiontétn be summarized bypoint-
`out-the'pararneters that influence thechoice of a corn-
`pressililt-"s'tt'iiteg3't. -The redundancyttype-found in a certain
`application is important, but prediciitbiliiy of redundancy '
`-type-is also important. An adaptive capability in a com-
`pression procedure would=be"o'f 'littie'benefit for applica-
`- Lions-=with predictable redundancy such as in English text,
`'b'ut*'it"would be valuable for business-files. System data
`transfer rates pt__e__;__r_r_tu'ne if a one”-pass procedure is needed-
`for
`th§__'gre_a_tetf-"overhead of a two-pass ap-
`be'3ua"iI¥ed"*l5§5etter compression results. The
`length of the meesage"l5eing-transmitted or retrieved has
`soflteirhporunce because adaptive"te'cl1niques are awk-
`orineffectlve on short messages.
`it is more difficult to analyze the compresion costs that
`reflect hurnarr_"itilrolvernent- Nottadaptive algorithms re-
`r't”'iiil‘e'p'riol“-"'lt'l‘i'owlet_i_}g"e__:-of the data characteristics. which
`r'ie'rta'ips'provrui':'tt lf:'_y*’i-t'i'ari'nat-crass-irieatron of messages
`"‘ihto"categories having preanalyzed "redundancy profiles.
`' Oi‘ programmed compression offers the greatest -
`density improvement at the cost of very high development
`and software maintenance costs.
`The focus of this article is ortthe problems found in con-
`ventioiial data stortt§e"ht"'eoif'rtlneteiaJ computer systems.
`In that applicat'idi‘i§"‘3‘tlie*"‘hi§li ' data transfer
`rates of
`'i"ntlg'1'lelit: cliilts preclude extensive runtirhe overhead. while
`
`the diversity of the data stored essentially requires an
`adaptlve"'p‘i'T:i'ci3iiul‘e. 1111;;
`'tg'_F.Ԥlfori segments on a disk
`complicates the use of dalptive approach. but this is
`becomingless important as data block sizes on disk are in-
`ereasirtg. The Lzw method seems to satisfy these re-
`quirements better than other compression approaches.
`Therefore. its characteristics will betrsed in the following
`discussion. which points out the'opporturtities and prob-
`‘Isms of compression in data-storage.
`
`_
`"' 7
`-
`
`. '
`
`'-
`
`
`
`'
`
`
`
`"
`
`'
`
`*"
`
`,
`-"
`
`'
`
`'
`
`cl
`
`a’
`
`Table 3. compression results tor a variety of data types.
`.
`
`DATA TYPE
`COMPRESSION RATIO
`
`
`
`English ‘last
`__
`_
`Cobol Files
`.
`Flaunt Pvlnilwavs
`Fcnrriitteq-si:lsnIifre‘Dala'
`
`_
`
`.
`
`. Fragrant-‘Source coca
`--
`
`._-H“ _
`12 _
`
`12
`
`1.8 '
`2 to 6
`1.0
`2:1
`
`._-.;.__
`
`.
`
`.
`
`I.
`
`I
`
`I
`
`_
`
`E
`
`‘
`III‘.
`.=.i.._'
`
`2.3
`l«.5
`
`.
`
`.
`
`-
`
`-
`
`..
`i
`
`

`
`as nine-bit input symbols compressed about 2.8-to-one
`pthen processed as a sequence of eight-bit chunks. All of
`.t‘n1'.s -Evidence indicates tha't compression can be effective
`even when dealing with mixed character types.
`
`Recovery tapes. Computer backup/recovery tapes
`(called secure tapes at Sperry) contain copies of active user
`tiles. As such. each tape exposes a cross section of data
`used in its system. although that profile of active files may
`differ from the profile of all files found on the system
`disks. Cotnp_ressi_ng iheso tapes with the LZWK algorithm
`has produccd'soi1}sthi_nglcss than __a two-to-one space re-
`daction in s, n_sls1.lll_fl¢iitstallati.9_ii {I'll slish=lv'betterth9-n #-
`tw_o_-_to-one reduction in a prb‘gra'rti'-developrrtent-oriented
`inst}allation‘."r‘~‘rorm'these_results, we might expect a three-
`to-one reduction in traniiaction-oriente_d'st'stetns where
`formatted data predominates.
`.
`
`System considerations
`
`__
`
`', The availability of an appropriate compression method I
`does not assure an effective system. Several problems
`occur when integrating compression into eitisting com-
`puter systems. These are described here in terms of both
`problem type and impact. on specific computer p‘e_l'.ip_h§Ela_‘1§.
`
`i
`
`:1 E.-. '.
`
`Unpredictable message length. The length of a com
`pressed image for a given message is unpredictable because
`it depends on the content or the message. We have no
`assurance prior to compression that a message will com-
`press at all: in some casesit-may everrexpand a little.
`Therefore. the space ‘allocated for the compressed image
`must be at least as big as the space allocated for the original
`message; this requirement is a complication on allocated
`devices such as a disk. Further. an update to a data record
`that alters just a few characters can change the compressed
`size and rnity result in a required change in allocation-
`even for minor update operations. Unpredictability in
`length also impacts bandwidth requirements: as a result.
`data path capacities have to be overdesigned to handle
`peak demands.
`‘
`'
`
`\\
`
`reversible
`Error tolerance. A characteristic of most
`compression systems is that if one bit is altered in the com-
`pressed image. much ofthe-message is garbled. Therefore.
`a given error rate may be acceptable for the transmission"
`of uncompressed data. but the same rate may be unaccept-
`able for compressed data. Special efforts to detect errors
`are needed. especially to catch any errors in the compres-
`sion and decompression processes. Note that a parity bit"
`added to each data character before compression would be
`removed as redundancy by the compression
`so
`character pltrity has no value in checking the compressor.
`Cyclic checks over. the entire message are effective. how-
`CV61’ .
`.
`.
`
`'
`
`'
`
`This intolerance to errors often limits the use of com-
`pression. For examplendigitized speech can tolerate occa-
`sional bit losses. but compression turns these errors into
`serious losses. On the other hand. since most commercial
`data storage devices, such as disk and tape, have accept-
`ably low error rates_ and there is no attempt to use their
`
`13
`
`vary widely. Eltpansion'undct' an adverse data pattern can
`occur under alrnost any adaptive compression method.
`Randomized bit patterns usually cause this effect. since
`there are no redundancy savings to offset the overhead‘
`that: is. inherent in coding the choice of parameter tvalues
`or implicit in the procedure.
`
`
`
`'
`
`' "I-‘otntatted seleuti'flc_dataL Most data used by Fortran
`programs ten_d_ed_to compress about 50 percent. This data
`included input .d_ata._prirttarily_._integers. it also included
`._;_.--.:.p_rtrtt filespxyhich 'we_re_Asc11_ epceo, Most or these rues
`
`- 3 _- are ‘short. s9.nut-10-file-temple. dill-l40t.ene_ompass ta _larg.e
`
`quantity of data. __
`
`'
`
`System log tlv.t_tt..l_nfo_rmation describing past syitem ac-
`tivity. such as job start and stop times. is tnostlyiortttatted
`integers and is therefore reasonably compressible. This log
`data is used for recovery and constitutes perhaps 10 per-
`ccnt of the datastored on backup/recovery tapes. It tends
`to be in a tightly packed. fixed-length format. so the corn.
`pressiorl"achie'.ved._is due to null fields and r_epctitiq3n__mthe
`
`data-Rralues. .
`-.
`
`
`
`
`T“
`
` E75
`Source code cart,
`ontpressed by a factor
`__ _
`_
`of better than two. 1!. can be cOl'l‘tprcS8_Bi1 better than text
`because w0tds..a:e,-frequently repeated nnd_b_lanlt spaces
`are introduced "by--thesouree code format. iiighly struc-
`tured programrning yields source code with greater corn-
`pressibility than the ayerage; 213 factor cited here. Object
`code. on ute;.9fl39_F Iigiix-!t_CIl_)nl$lSl3'3__.9_f_3Fbllr3Yy bit patterns
`. does not eorttpressias 'wel1’.'Un'eiren_ usage of op codes
`' ‘arid incomplete; utilization of displacement fields would
`account-for. most-.0f—,the compression achieved. Relocat-l
`able code and absolute code differ in compressibility
`dcpc nding on the treatment of blanking for empty variable
`- areas. Actual program files in a development organization
`
`' were found to be mixtures of source. relocatable. and ab-'
`-2" solute codes. These mixed llles'weré-"compressed on the
`-3'
`' average by a factor of two.
`
`Character size. Sometimes data is recorded in nonstan-
`dard character sizes. Our Sperry 1 l'U0f60 has a 36-bit word
`and so. at times. has six-bit. eight-bit. and nine-bit data.
`The measurements cited in Table 2 were compressed using
`nine-bit characters throughout. _Wl-ten eight-bit ASCll_
`symbols were being stored in nine-b_i_t bytes, the compres-
`sion results were adjusted to re llefi results as if they had
`been observed on an eight-bit byte. That is. the observed
`-I=_I. compression was 9/8 greater than tlte_._nur_t_obcrs shown in
`-Table.-2. The co'rnt'ir'esse"ti data‘ stream is virtually the same
`regardless of whether the soured"-‘delta was in eight-bit or
`nine-hit"-syntbolsg since length of the compressed image '
`depends otfthe number of symbols encountered rather-
`than their enco'diug. Six-lair data compressed surprisingly
`vrell when compressed on a nine-bit basis. because
`repeated patterns. such as blank spaces. are word aligned
`and have the same pattern on each occurrence. even
`though that pattern is not meaningful when viewed on
`nine-bit increments. As a side iiiiperirnent, nine-bit data
`_ was compressed as.-tr it were eight-bit characters. Poorer
`but not unacceptably poor compression resulted. For ex-
`.. ,3n_1pi¢. a set of nine-bit files which compressed four-to-one
`
`June 1934
`
`"
`
`' "'
`
`13
`
`

`
`is related to the input
`pressed block sizeand, as such,
`block size factored by the compression ratio. The curve
`shown may be m._ovod vertically up or down without loss of
`meaning; the two-to-one compression ratio curve is shown
`here for illustration. The diagonal lines indicate fixed in-
`put block sizes as they would map i_nto compressed block
`sizes." Note that data which compresses very well needs to
`be handled in larger blocks to obtain the highest compres-
`sion efficiency.
`The "highest efliciency point shown in Figure 3-—about
`6K eight-bit bytes after compression-t-is implementation-
`denendent
`can its adjusted. lrlcwever. since this curve
`matches a very convenient Implementation, it will be com-
`mon practlce.-
`-
`.
`
`_
`
`Disk storage. Compression on disik is_ feasible because
`the intemal data‘ format on disk is generally invisible to
`'
`|
`
`COMPUTER '
`
`System location {or compression. When using compres-
`" sion on computer peripherals, a significant system con-
`sideration occurs in deciding where to implement compres-
`sion in the lift) path of Figure 4: in the device controller,
`channel. of central processor: If thecptnpression opera-
`"lion" is executed in the device, it woul

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket