`US 6,624,761 B2
`(10) Patent No.:
`(12)
`Fallon
`(45) Date of Patent:
`*Sep. 23, 2003
`
`
`US006624761B2
`
`(54) CONTENT INDEPENDENT DATA
`COMPRESSION METHOD AND SYSTEM
`
`.
`
`(75)
`Inventor:
`James J. Fallon, Armonk, NY (US)
`:
`.
`(73) Assignee: (us) Data, LLC, New York, NY
`:
`:
`:
`:
`:
`(*) Notice:
`Subjectto any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`US.C. 154(b) by 0 days.
`
`This patent is subject to a terminal dis-
`claimer.
`
`(21) Appl. No.: 10/016,355
`ad.
`(22)
`Filed:
`Oct. 29, 2001
`(65)
`Prior Publication Data
`
`US 2002/0097172 Al Jul. 25, 2002
`
`Related U.S. Application Data
`
`(63) Continuation-in-part of application No. 09/705,446,filed on
`Nov. 3, 2000, now Pat. No. 6,309,424, which is a continu-
`ation of application No. 09/210,491, filed on Dec. 11, 1998,
`now Pat. No. 6,195,024.
`
`(51)
`
`Int. Ch? eee GO6F 13/12; GO6F 13/38;
`H03M 7/34; H0O3M 7/38
`(52) US. C1. ee cneseteescscteneeenecneeeees 341/51; 710/68
`(58) Field of Search oe 341/51, 67, 106,
`341/50, 79; 382/166, 100, 239; 710/52,
`68; 348/458; 707/101; 235/431; 709/208;
`704/245; 375/240
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`4,682,150 A *
`4,872,009 A
`4,929,946 A
`5,045,852 A
`5,097,261 A
`5,212,742 A
`
`7/1987 Mathes et al. wesc. 235/431
`10/1989 Tsukiyamaetal.
`5/1990 O’Brienetal.
`9/1991 Mitchellet al.
`3/1992 Langdon, Jr. et al.
`5/1993 Normileetal.
`
`5,231,492 A
`7/1993 Dangietal.
`5,237,675 A *
`8/1993 Hannon, Jr oe 710/68
`5,243,341 A
`9/1993 Seroussiet al.
`5,243,348 A
`9/1993 Jackson —
`5,270,832 A
`12/1993 Balkanskietal.
`5,379,036 A
`1/1995 Storer
`5381145 A
`1/1995. Allen etal.
`5,394,534 A
`2/1995 Kulakowski etal.
`5,461,679 A
`10/1995 Normile
`et
`al.
`5,467,087 A
`tyt99, Chu “ae
`5,471,206 A
`11/1995 Allen etal.
`5,479,587 A
`12/1995 Campbell etal.
`5,486,826 A
`1/1996 Remillard
`5,495,244 A
`2/1996 Jeongetal.
`5,533,051 A
`7/1996 James
`5,557,749 A *
`9/1996 Norris woe: 709/208
`5,583,500 A
`12/1996 Allen et al.
`5,627,534 A
`5/1997 Craft
`5,654,703 A
`8/1997 Clark, I
`5668737 A
`9/1997 ler
`5,717,393 A
`2/1998 Nakanoetal.
`5,717,394 A
`2/1998 Schwartz etal.
`5,729,228 A
`3/1998 Franaszeketal.
`5,748,904 A
`5/1998 Huangetal.
`5,771,340 A
`6/1998 Nakazato etal.
`5,784,572 A
`7/1998 Rostokeretal.
`5,799,110 A
`8/1998 Israelsen etal.
`5,805,932 A
`9/1998 Kawashimaetal.
`:
`:
`(List continued on next page.)
`Primary Examiner—Patrick Wamsley
`(74) Attorney, Agent, or Firm—¥. Chau & Associates, LLP;
`Frank V. DeRosa, Esq.
`(57)
`ABSTRACT
`
`Systems and methods for providing fast and efficient data
`compression using a combination of content independent
`data compression and content dependent data compression.
`In one aspect, a method for compressing data comprises the
`steps of: analyzing a data block of an input data stream to
`identify a data type of the data block, the input data stream
`comprising a plurality of disparate data types; performing
`content dependentdata compression onthe data block,if the
`data type of the data block is identified; performing content
`independent data compression on the data block, if the data
`type of the data block is not identified.
`
`22 Claims, 34 Drawing Sheets
`
`Content Dependent Encoders
`
`
`
`
`
` Content
`
`
`Data Block
`Input Data
`Counter
`Buffer
`
`
`
`
`
`Data/File
`Recognition
`List(s) or
`Algorithm(s}
`
`51310
`
`[EncoderOr
`
`
`
`Content Independent
`Encoders
`
`
`Encoder E3
`
`No
`
`
`
`Commvault Ex. 1014
`Commvault v. Realtime
`
`US Patent No. 9,054,728
`
`Page1
`
`Page 1
`
`Commvault Ex. 1014
`Commvault v. Realtime
`US Patent No. 9,054,728
`
`
`
`US 6,624,761 B2
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`5,809,176 A
`5,818,368 A
`5,818,530 A
`5,819,215 A
`5,825,424 A
`5,847,762 A
`5,861,824 A
`
`9/1998
`10/1998
`10/1998
`10/1998
`10/1998
`12/1998
`1/1999
`
`Yajima
`Langley
`Canfield et al.
`Dobsonetal.
`Canfield et al.
`Canfield et al.
`Ryuetal.
`
`5,917,438 A
`5,064,842 A
`5,991,515 A
`6,031,939 A
`6,195,024 Bl *
`6,309,424 B1 * 10/2001
`6,529,633 Bl *
`
`Ando
`Packard
`Fall etal.
`Gilbertet al.
`Fallon ......cccesseeeeeeeeeees 341/51
`
`Fallon ...........
`Easwaret al. wee 382/239
`
`* cited by examiner
`
`Page 2
`
`Page 2
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 1 of 34
`
`US 6,624,761 B2
`
`INPUT DATA STREAM
`
`SIGNAL
`
`
`IDENTIFY INPUT DATA TYPE AND
`GENERATE DATA TYPE IDENTIFICATION
`
`2
`
`DATA TYPE
`ID SIGNAL
`
`COMPRESSDATA IN ACCORDANCE WITH
`IDENTIFIED DATA TYPE
`
`3
`
`COMPRESSED DATA STREAM
`
`RETRIEVE DATA TYPE
`INFORMATIONOFCOMPRESSED
`
`5
`
`WITH IDENTIFIED DATA TYPE
`
`DECOMPRESSDATA IN ACCORDANCE
`
`ao
`
`FIG. 1
`PRIOR ART
`
`[oo4
`
`Page 3
`
`TO——_—_—_—_—oT
`
`— | | |
`
`!
`
`| | | | |
`
`Page 3
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 2 of 34
`
`US 6,624,761 B2
`
`NOISSSUdNOD
`
`
`
`NOISSSYdWODOlleLNdNI
`
`
`
`Viv¥dQ30d0ONE
`
`HOLAYOSAGé¥SaLNNOD
`
`rd3sd4Ng
`
`bYSINNOOD
`
`b3Y3QOON]
`
`
`
`WVSYLSVLYO
`
`
`NOLdidOS3a0INOLLYNINYSLSO€¥3BLNNODdina
`NOSIMWdNOD.
`
`
`
`09
`
`02Ob
`
`fdajina
`
`UYALNNODU3YSGOONS
`
`or-08
`
`éOld
`
`Page 4
`
`Page 4
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 3 of 34
`
`US 6,624,761 B2
`
`RECEIVE INITIAL
`DATA BLOCK FROM
`
`INPUT DATA STREAM
`DATA BLOCK
`
`COUNTSIZE OF
`
`300
`
`302
`
`B
`
`BUFFER DATA BLOCK
`
`ENCODER
`
`COMPRESS DATA
`BLOCK WITH
`| ENABLED ENCODERS
`
`BUFFER ENCODED
`DATA BLOCK OUTPUT
`FROM EACH
`
`COUNTSIZE OF
`ENCODED DATA
`
`BLOCKS
`RATIOS
`THRESHOLDLIMIT
`
`CALCULATE
`COMPRESSION
`
`COMPARE
`COMPRESSION
`RATIOS WITH
`
`310
`
`312
`
`314
`
`FIG. 3a
`
`Page 5
`
`Page 5
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 4 of 34
`
`US 6,624,761 B2
`
`A
`
`316
`
`
`iS
`COMPRESSION
`
`
`RATIO OF AT LEAST ONE
`
`ENCODED DATA BLOCK
`
`GREATER THAN
`
`THRESHOLD?
`
`NO
`
`
`
`
`APPEND NULL
`SELECT ENCODED
`
`DESCRIPTOR TO
`DATA BLOCK WITH
`
`
`UNENCODED INPUT
`
`GREATEST
`
`DATA BLOCK
`
`COMPRESSION RATIO
`
`
`
`
`
`
`
`
`
`APPEND
`
`CORRESPONDING
`DESCRIPTOR
`
`OUTPUT UNENCODED
`DATA BLOCK WITH
`NULL DESCRIPTOR
`
`
`332 RECEIVE NEXT DATA
`
`
`PROCESS 330
`
`
`
`
`
`DATA BLOCKSIN INPUT
`STREAM?
`
`YES
`
`TERMINATE DATA
`COMPRESSION
`
`BLOCK FROM INPUT
`STREAM
`
`B
`
`FIG. 3b
`
`318
`
`320
`
`Page 6
`
`Page 6
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 5 of 34
`
`US 6,624,761 B2
`
`
`
`
`
`NOIdIYOSAG€HALNNODdYaasng
`OLASIOZYALNNOO
`SdALpgaing€3M30O9NA¥Lva
`
`
`
`NOLLVNINMSL3GSYOLOVS
`
`4OBYNDSISY3ICGOONA
`LISIALnavuisaa
`
`
`NOSIevdWOO
`
`A
`
`fdasina
`
`UYALNNOD|
`uzYSGOONS
`
`
`
`VLVGQ2000N3
`
`/MWWAYLS
`
`fd3ssAng
`
`bYALNNOD
`
`baYSaqOONa
`
`fa3aana
`
`23YSQOON3
`
`0g
`
`vOld
`
`NOISSSHdNOD||Lossay4WooINdNI
`
`
`
`AWVAYLSVLYCG
`
`Page 7
`
`Page 7
`
`
`
`
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 6 of 34
`
`US 6,624,761 B2
`
`500
`
`502
`
`504
`
`RECEIVE INITIAL
`DATA BLOCK FROM
`INPUT DATA STREAM
`
`B
`
`COUNTSIZE OF
`
`DATA BLOCK
`
`BUFFER DATA BLOCK
`
`COMPRESS DATA
`BLOCK WITH
`ENABLED ENCODERS
`
`APPEND CORRESPONDING
`DESIRABILITY FACTORS TO
`ENCODED DATA BLOCKS
`
`BUFFER ENCODED DATA
`BLOCK OUTPUT
`FROM EACH ENCODER
`
`508
`
`510
`
`COUNTSIZE OF
`ENCODED DATA
`BLOCKS
`
`CALCULATE
`COMPRESSION
`RATIOS
`
`512
`
`514
`
`COMPARE COMPRESSION
`RATIOS WITH THRESHOLD
`LIMIT
`
`316
`
`A
`FIG. 5a
`
`Page 8
`
`Page 8
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 7 of 34
`
`US 6,624,761 B2
`
`518
`
`
`iS
`COMPRESSION
`
`
`RATIO OF AT LEAST ONE
`
`
`ENCODED DATA BLOCK
`
`GREATER THAN
`THRESHOLD?
`
`
`
`
`
`
`APPEND NULL
`CALCULATE FIGURE OF
`DESCRIPTOR TO
`
`MERIT FOR EACH ENCODED
`
`
`
`
`DATA BLOCK WHICH EXCEED
`UNENCODEDINPUT
`DATA BLOCK
`THRESHOLD
`
`
`
`
`
`
`SELECT ENCODED DATA
`BLOCK WITH GREATEST
`
`FIGURE OF MERIT
`
`520
`
`
`
`
`
`APPEND
`
`
`CORRESPONDING
`DESCRIPTOR
`
`
`
`OUTPUT UNENCODED
`
`DATA BLOCK WITH
`NULL DESCRIPTOR
`
`DATA BLOCK WiTH
`DESCRIPTOR
`
`
`
`
` OUTPUT ENCODED
`
`
`
`
`MORE
`DATA BLOCKSIN
`
`
`INPUT STREAM?
`
`
`PROCESS
`
`
`
`
`
`
`YES
`
`TERMINATE DATA
`COMPRESSION
`
`RECEIVE NEXT DATA
`BLOCK FROM INPUT
`STREAM
`
`534
`
`B
`
`FIG. 5b
`
`Page 9
`
`Page 9
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 8 of 34
`
`US 6,624,761 B2
`
`
`
`VLiVOGACOON2
`
`/MWVAALS
`
`YOLAYOSAG
`
`fdaaine
`
`éY¥aLNNOOD
`
`¢3Y3SQOON3
`
`fa3sina
`
`b¥SJINNOS
`
`b3Y3SQOONS
`
`
`
`
`
`WVSULSVLYO
`
`AdAL
`
`NOISSSYdWOO
`NOISSSYdNO0O
`
`OlLva
`
`NOWdIdOSs3a0
`JNOLLYNINYS130
`NOSIYVdWOO
`€YS.LNNOD
`
`
`
`fdassand
`
`€3Ya0OON3
`
`LACINI
`
`VLVO
`
`yadAng
`
`fdassine
`
`UYALNNOD
`
`9Dld
`
`
`
`FALLGaldloadsS
`
`“dasn
`
`Page 10
`
`Page 10
`
`
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 9 of 34
`
`US 6,624,761 B2
`
`700
`
`70
`B
`
`INPUT INITIAL
`DATA BLOCK FROM
`INPUT DATA STREAM
`
`COUNT SIZE OF
`DATA BLOCK
`
`704
`
`BUFFER DATA BLOCK
`
`706
`
`INITIALIZE TIMER
`
`708
`
`
`
`BEGIN
`COMPRESSING
`DATA BLOCKWITH
`ENCODERS
`
`710
`
`NO
`
`NO
`
`ENCODING
`COMPLETE?
`
`
` STOP
`ENCODING
`
`PROCESS
`
`
`
`
`718
`
`
`SUFFER
`BUFFER ENCODED
`
`
`
`DATA BLOCK FOR EACH
`
`
`ENCeT ENCODER THAT
`
`
`
`OGM EACH
`COMPLETED ENCODING
`
`
`ENCODER
`PROCESS
`
`
`
`WITHIN TIME LIMIT
`
`
`
`
`
`
`
`COUNTSIZE OF
`
`ENCODED DATA
`BLOCKS
`
`720
`
`
`
`
`CALCULATE
`
`COMPRESSION
`RATIOS
`
`722
`
`
`
`COMPARE COMPRESSION
`
`
`RATIOS WITH THRESHOLD
`
`
`LIMIT
`
`
`
`FIG. 7a
`
`724
`
`Page 11
`
`Page 11
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 10 of 34
`
`US 6,624,761 B2
`
`726
`
`
`iS
`COMPRESSION
`
`
`RATIO OF AT LEAST ONE
`
`ENCODED DATA BLOCK
`
`GREATER THAN
`
`THRESHOLD?
`
`
`
`
`
`
`APPEND NULL
`SELECT ENCODED
`
`
`
`DATA BLOCK WITH
`DESCRIPTOR TO
`
`
`UNENCODED INPUT
`GREATEST
`
`
`
`COMPRESSION RATIO
`DATA BLOCK
`
`
`
`
`
`APPEND
`
`CORRESPONDING
`DESCRIPTOR
`
`OUTPUT UNENCODED
`
`
`OUTPUT ENCODED
`DATA BLOCK WITH
`DATA BLOCK WITH
`
`NULL DESCRIPTOR
`DESCRIPTOR
`
`728
`
`730
`
`
`
`
`
`
`
`
`
`
`
`
`MORE
`
`DATA BLOCKSIN INPUT
`STREAM?
`
`
`
`TERMINATE DATA
`COMPRESSION
`
`
`PROCESS
`
`
`
`740
`
`RECEIVE NEXT DATA
`BLOCK FROMINPUT
`STREAM
`
`
`
`
`
`
`FIG. 7b
`
`Page 12
`
`Page 12
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 11 of 34
`
`US 6,624,761 B2
`
`
`
`VLVOQ3009N3
`
`/MWVSYHLS
`
`YOLdIYISAG
`
`NOISSSYdWOO
`
`OLLWa
`
`faaaing
`
`bYSLNNOD
`
`f4asa3Ne@
`
`@YALNNOD
`
`fd3s3sne
`
`NOILdidOS30
`INOLLVNIWYSL30
`NOSI¥VdWOO
`£YyALNNOD
`
`baYAQOON3
`
`
`
`23)YAGOONA
`
`€3Y3QOONA
`
`LONI
`
`viva
`
`
`
`WY2eaLSVLVYC
`
`
`
`VIVdLONI
`
`M0018
`
`Yyasddne
`Pa8OSS3R900€d
`
`YSLNNOS
`
`403YNdId
`
`LIYan
`
`NOILVNIWYSL30
`
`08
`
`UYSALNNOD
`
`MaingUZYSGOONS
`
`YAGOONA
`
`SYOLOVA
`
`ALMavylsaa06
`
`Orog
`
`MAW
`
`
`
`JINILG3lAldads
`
`“¥asn
`
`0z
`
`OL
`
`Page 13
`
`Page 13
`
`
`
`
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 12 of 34
`
`US 6,624,761 B2
`
`0S
`
`NOISS3SYdWOd
`
`OllvY
`
`NOSIYVdNOO/NOILLYNINYSLSG
`
`403gNoIs
`
`LIA
`
`NOLLYNINYSL3G
`
`09
`
`NOISSAYdWOD
`
`NOlidalwOS3G0
`
`3dAL
`
`
`
`V¥LVOGAG0ON3
`
`IMWVAYLS
`
`YOLdIYOSaG
`
`u'
`
`u'}7aaae
`
`YSIGOONA
`
`SUOLOV
`
`Page 14
`
`
`
`
`
`6‘Sid O24
`ALMISYUISaO06
`
`
`
`ALNdNIviva
`
`V1¥OAOOT
`
`
`
`w334ngySLNNOD
`
`
`
`WYFYLSVLVO
`
`-¥9SN
`
`MSWILShiaaidioads
`
`Page 14
`
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 13 of 34
`
`US 6,624,761 B2
`
` 100
`
`RECEIVEINITIAL
`DATA BLOCK FROM.
`INPUT DATA STREAM
`
`102
`
`B
`
`COUNTSIZE OF
`
`DATA BLOCK
`
`|
`
`BUFFER DATA
`
`™
`
`106
`
`INITIALIZE TIMER
`
`408
`
`APPLY INPUT DATA
`BLOCK TO FIRST
`ENCODING STAGE
`IN CASCADED
`
`ENCODER PATHS
`
`110
`
`116
`
`
`
`
`
`
`
`
`
`
`
`
`APPLY OUTPUT
`OF COMPLETED
`ENCODING
`STAGE TO NEXT
`ENCODING
`STAGEIN
`CASCADE PATH
`
`
`
`
`
`
`
`
`
`
`
`
`
` BUFFER
`
`ENCODED DATA
`BLOCK OUTPUT
`FROM
`COMPLETED
`ENCODING
`
`STAGE
`
`
`
`
`
`STOP ENCODING
`PROCESS
`
`
`
`122
`
`124
`
`126
`
`
`
`
`SELECT BUFFERED OUTPUT OF LAST
`ENCODING STAGE IN ENCODER
`CASCADE THAT COMPLETED ENCODING
`
`
`PROCESS WITHIN TIME LIMIT
`
`
`
`
`
`COUNTSIZE OF
`
`ENCODED DATA
`BLOCKS
`
`
`
`CALCULATE
`
`COMPRESSION
`RATIOS
`
`
`
`COMPARE COMPRESSION
`
`RATIOS WITH THRESHOLD
`
`
`LIMIT
`
`
`FIG. 10a
`
`A
`
`Page 15
`
`Page 15
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 14 of 34
`
`US 6,624,761 B2
`
`A
`
`128
`
`
`1S
`COMPRESSION
`
`
`RATIO OF AT LEAST ONE
`
`
`ENCODED DATA BLOCK
`
`GREATER THAN
`THRESHOLD?
`
`
`134
`
`
`
`
`CALCULATE FIGURE OF
`MERIT FOR EACH ENCODED
`
`
`DATA BLOCK WHICH EXCEED
`THRESHOLD
`
`
`APPEND NULL
`130
`DESCRIPTOR TO
`
`
`UNENCODED INPUT
`DATA BLOCK
`
`
`136
`
`SELECT ENCODED DATA
`BLOCK WITH GREATEST
`FIGURE OF MERIT
`
`
`
`
`138
`
`132
`
`
`
`
`APPEND
`OUTPUT UNENCODED
`
`
`
`DATA BLOCK WITH
`CORRESPONDING
`NULL DESCRIPTOR
`DESCRIPTOR
`
`
`
`
`
`
`OUTPUT ENCODED
`DATA BLOCK WITH
`
`DESCRIPTOR
`
`140
`
`
`
`
`
`
`142
`
`
`MORE
`DATA BLOCKSIN
`
`
`INPUT STREAM?
`
`
`PROCESS
`
`YES
`
`ERMINATE DATA
`COMPRESSION
`
`RECEIVE NEXT DATA
`BLOCK FROM INPUT
`STREAM
`
`
`
`B FIG. 10b
`
`Page 16
`
`Page 16
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 15 of 34
`
`US 6,624,761 B2
`
`TINN/MWLVG
`
`YOLAIMOSAG
`
`20¥ad003G|60¥3009030|
`
`
`
`VLVOLADLNO
`
`WV3SYLS
`
`VLVOLAGLNO
`
`y3d3ng
`
`NOILLOVYLXS
`Y¥344dN8D018
`
`YOLdDIWOSAG
`
`VIVOLANI
`
`WAYSVLVG
`
`ug¥30003G0
`
`vOlL
`
`bbOld
`
`Page 17
`
`Page 17
`
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 16 of 34
`
`US 6,624,761 B2
`
`
`
`RECEIVEINITIAL
`DATA BLOCK FROM
`INPUT DATA STREAM
`
`BUFFER DATA BLOCK
`
`1202
`
`
`
`EXTRACT DATA
`
`COMPRESSION TYPE
`DESCRIPTOR
`
`NULL?
`
`
`
`
`1206
`
`
`1S DATA
`
`COMPRESSION
`
`
`
`
`SELECT DECODER(S)
`
`CORRESPONDING TO
`
`DESCRIPTOR
`
`
`
`OUTPUT
`
`
`UNDECODED
`
` DECODE DATA BLOCK USING
`DATA BLOCK
`
`
`SELECTED DECODER(S)
`
`
`
`OUTPUT DECODED
`DATA BLOCK
`
`
`
`MORE DATA
`
`BLOCKSIN INPUT
`STREAM?
`
`
`
`TERMINATE
`DECODING PROCESS.
`
`
`NO
`
`1218
`
`FIG. 12
`
`Page 18
`
`Page 18
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 17 of 34
`
`US 6,624,761 B2
`
`
`
`
`
`SJOPOoUZJUspUEdEGgjUE}UOD
`
`L3Jepoous
`
`ol4/e\8Q
`
`uonluBooey
`
`Jo(8)!
`
`(s)uquoBly
`
`OLE}$
`
`
`
`juepuedepu|yuE}UODONBegJeyngJeyunod
`
`
`siepoouzuonuBooeay
`
`yuspuedeq49018BFC|weeysBeqQynduy
`
`JUS}UODeed
`
`
`VebAYNSIS
`
`Page 19
`
`Page 19
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 18 of 34
`
`US 6,624,761 B2
`
`pepooux
`
`LQsuajunoojieyng
`
`“WQsuejUNODyayng
`
`
`soyduoseg|Volsseidwoduolsseudiuos)
`
`uueC)pueddyeUIUUE}Eg
`Jo\duoseq(peunbex3!)
`
`
`edAyoney
`€GsigjunoojeyngVZQSiejUNOQjJeyNg
`€3SusjUNOdseyngqZqsuajunoDjeyng
`
`debFYNSIs
`
`LySiejuno>eyng
`
`uysuejuNOdjJeying]
`
`Page 20
`
`Page 20
`
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 19 of 34
`
`US 6,624,761 B2
`
`
`
`Receive Initial
`
`Data Block From
`~ 1400
`Input Data
`Stream
`
`Count Size of
`Data Block
`
` |~ '402
`
`
`
`
`
`
`| Buffer Data Block |. 1404
`
`
`
`
`Apply Content
`
`~ 1406
`Dependent Data
`Recognition
`
`
`
`‘Recognize Data
`Stream Content
`?
`
`Page 21
`
`YES
`
`v C
`
`FIGURE 14A
`
`Page 21
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 20 of 34
`
`US 6,624,761 B2
`
`
`
`
`
`
`
`Compress Data
`Block with
`Enabled Content
`Independent
`Encoders
`
`
`~ 1410
`
`
`
`
`
`
`Buffer Encoded
`
`Data Block Output
`
`
`~ 1412
`From Each
`
`
`Encoder
`
`
`
`
`
`
`
`Count Size of
`Encoded Data
`
`
`Blocks
`
`
`
`Caiculate
`
`Compression
`Ratio
`
`
`
`Compare
`
`Compression
`
`
`Ratios with
`
`Threshold Limit
`
`D
`
`FIGURE 14B
`
`Page 22
`
`Page 22
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 21 of 34
`
`US 6,624,761 B2
`
`
`Compression Ratio
`
`of at Least One Encoded Data
`Block Greater Than Content
`
`ndependent Threshold
`?
`
`
`
`
`Append Null
`Select Encoded
`Descriptor to
`
`Data Block with
`
`Unencoded input
`Data Biock
`
`
`
`
`
`
`1424 ~
`
`Corresponding
`Encoding
`Descriptor
`
`
`
`
`
`
`
`
`
`
`More Data
`Terminate
`Receive Next
`
`
`
`Data Compression
`Blocks in input
`Data Block From
`
`Input Stream
`Stream
`Process
`
`
`2?
`
`
`
`
`
`
`
`1426
`
`Output Encoded
`Data Biock with
`
`Descriptor
`
`Output Unencoded
`Data Block with
`
`
`Null Descriptor
`
`Yes
`
`1430
`
`FIGURE 14C
`
`Page 23
`
`Page 23
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 22 of 34
`
`US 6,624,761 B2
`
`1434~
`
`
`
`Select Recognized
`
`Data Type/File or Block
`
`
`in List or Aigorithm
`
`
`
`1436 ~
`
`Select Content
`Dependent
`Aigorithm(s)
`
`Compress Data Block
`with Enabied Content
`
`1440 ~
`
`| Buffer Encoded Data
`‘Block From Each
`
`Encoder
`
`
`
`
`Dependent Encoder(s)
`
`
`
`No-> F
`
`
`
`
` ~ 1448
`
`
`
`
`Compression
`Ratio of at Least One
`Encoded Data Block
`Greater Than
`Content Dependent
`Threshold
`
`
`
`Yes
`
`¥ E
`
`Count Size of Data
`
`Blocks
`
`Calculate
`Compression Ratio
`
`FIGURE 14D
`
`Page 24
`
`Page 24
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 23 of 34
`
`US 6,624,761 B2
`
`
`
`
`
`SJePOoUsjuapuedeqjua}u04
`
` |WC]JePOoUS|BeQJeyngseyunog
`
`
`:juepuedeqByeqynduwordBC|"weans
`
`,yua}Uu0Dweg
`
`
`uonluBooey
`
`6/14/e}2q
`
`uomuBooay
`
`Jo(8)}S17
`
`(s)wiywobiy
`
`OLEl
`
`VSAYNSIS
`
`Page 25
`
`Page 25
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 24 of 34
`
`US 6,624,761 B2
`
`pepooug
`
`UIABBQ
`
`Joyduoseq
`
`88
`
`
`voissaudluogwepuedece6sieUNOSHE_NGVplouseiy,
`
`
`uolsseudiuos€3Sue}UNOQeLNG[3sepoous|Jepoouy
`/oney-
`
`rorduoseaplouseuys
`
`GOA]uoissaidwo
`oneyuoissesdw09newioo6QseyuNOaLeUNA||~gee,
`
`
`enody| yuajuo
`ZaSJBJUNODPEYING23Jepoouy
`gSbAYNOlS
`UzSIBUNODeYNg
`L3suajunodyeyng
`
`PIOUSeJULMOjagONYUdissasduios
`
`
`SJ@POOUZJUapusdepu}yUE}UOD
`
`
`WC]YIEJUNODJeng
`LQsue}UNoO/JayNg
`
`BUIWE}8Q
`
`PIoYysesy]
`
`b3Jepooug
`
`~O€
`
`Page 26
`
`Page 26
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 25 of 34
`
`US 6,624,761 B2
`
`
`
`
`Receive Initial
`Data Block From
`input Data
`Stream
`
`~ 1600
`
`
`
`
`
`
`
`Count Size of
`Data Block
`
`~~ 1602
`
`Buffer Data Block |~ 1604
`
`
`‘Apply Content
`Dependent Data
`Recognition
`
`No-> B
`
`Page 27
`
`Recognize Data
`Stream Content
`
`
`
`?
`
`~ 1608
`
`YES
`
`v C
`
`c
`
`FIGURE 16A
`
`Page 27
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 26 of 34
`
`US 6,624,761 B2
`
`B
`
`Compress Data
`Block with Enabled
`Content Independent
`Encoders
`
`Buffer Encoded Data
`Block Output From
`Each Encoder
`
`Count Size of
`Encoded Data Blocks
`
`with Threshold Limit
`
`Caiculate
`Compression
`Ratio
`
`Compare
`Compression Ratios
`
`D
`
`FIGURE 16B
`
`Page 28
`
`Page 28
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 27 of 34
`
`US 6,624,761 B2
`
`
`Compression Ratio
`
`of at Least One Encoded Data
`Block Greater Than Content
`
`
`ndependent Threshold
`
`1620 ~
`?
`
`
`
`
`
`
`
`Select Encoded
`Append Null
`Data Blockwith
`Descriptor to
`
`
`Greatest
`Unencoded Input
`
`
`1622 ~/Compression Ratio
`Data Block
`
`
`
`E
`
`F
`
`
`
`Append
`
`__| Corresponding
`
`Encoding
`
`Descriptor
`
`1624
`
`
`Receive Next
`Data Block From
`Input Stream
`
`
` 1626
`
`
`Output Unencoded
`Output Encoded
`Data Block with
`Data Block with
`
`
`
`
`Null Descriptor
`Descriptor
`
`
`
`
`
`More Data
`Terminate
`
`
`
`Data Compression
`Blocks in Input
`
`Stream
`Process
`
`
`?
`
`
`
`Yes
`
`1630
`
`FIGURE 16C
`
`Page 29
`
`Page 29
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 28 of 34
`
`US 6,624,761 B2
`
`
`
`
`
`
`
`
`Compression
`
`Ratio of at Least
` Nos B
`One Encoded Data
`Block Greater Than
`Content Dependent
`Threshold
`9
`
`Yes
`
`v E
`
`Select Recognized
`1634 ~| Data Type/File or Block
`
`in List or Algorithm
`Algorithm(s)
`
`Select Content
`Dependent
`
`Compress Data Block
`with Enabled Content
`Dependent Encoder(s)
`
`
`
`Buffer Encoded Data
`Block From Each
`Encoder
`
`1638
`
`1640 ~
`
`Count Size of Data
`
`Blocks
`Compression Ratio
`
`Calculate
`
`FIGURE 16D
`
`Page 30
`
`Page 30
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 29 of 34
`
`US 6,624,761 B2
`
`
`
`
`
`sJepoouzjuepuededjue}U0D
`
`
`
`
`
`sJapoouyjUepuEdepU|JUE]UOD
`
`yUe}U0D
`
`yuepuedepuy
`
`Beqg
`
`uonuBooey
`
`
`
`JeyngJeyuned
`
`Beg
`
`
`
`jyuepuedeqeyegjndu|yooBIO|weays
`
`
`pueuonjuBooey
`
`~OEwywobyuoyeunsy
`
`
`
`se|qe|df-4007]Jo
`
`
`
`edk|914/68q
`
`OLZL5
`
`
`
`VZbaYNSIS
`
`OOZt0Z5
`
`Page 31
`
`Page 31
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 30 of 34
`
`US 6,624,761 B2
`
`LIMB}2Q
`
`Jo\doseg
`
`peposuywdssaqunog/eyng
`
`Joyduoseq=|uojsseidwodseeeieGe
`adh,ie
`OSELOFEL
`
`a2)AYNODIS
`ZQsuewuNogJeuNG||~oe)
`USSUBJUNOD/JSYyNg
`€3suejuNoQeyng
`73SJaUNOD/jJeyNg
`bySJeuUNODeng
`€OsuejuNooeyng
`LQssejunoDjeyng
`
`(peunbexJI)
`
`Page 32
`
`Page 32
`
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 31 of 34
`
`US 6,624,761 B2
`
`Receive Initial
`Data Block From
`Input Data
`Stream
`
`~ 1800
`
`Count Size of
`Data Block|~ 1802
`
`Buffer Data Block |-~ 1804
`
`
`
`
`
`
`
`
`
`Apply Content
`Dependent /
`Independent Data
`Recognition
`
`
`
`
`~ 1806
`
`
`
`
`Recognize Data
`Stream Content
`?
`
`Page 33
`
`Yes
`
`v C
`
`FIGURE 18A
`
`Page 33
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 32 of 34
`
`US 6,624,761 B2
`
`B
`
`Estimate Optimum
`Encoder(s) From
`Estimation Algorithms
`
`=|-~ 1850
`
`Compress Data
`Block with Enabled
`Content independent
`Encoders
`
`~ 1810
`
`Buffer Encoded Data
`Block Output From)
`Each Encoder
`
`|-~~ 1812
`
`Ratio Compression Ratios
`
`Count Size of
`Encoded Data Blocks |~ 1814
`
`Calculate
`Compression
`
`1816
`
`with Threshold Limit
`
`|~ 1818
`
`D
`
`FIGURE 18B
`
`Page 34
`
`Page 34
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 33 of 34
`
`US 6,624,761 B2
`
`
`
`
`
`Compression Ratio
`
`of at Least One Encoded Data
`
`
`
`Block Greater Than Content
`ndependent Threshold
`
`
`1820 ~
`?
`
`
`
`
`
`
`Append Null
`Select Encaded
`Descriptor to
`
`Data Block with
`
`
`| Unencoded input
`
`
`
`Greatest
`Data Block
`
`1824 ~
`
`Encoding
`Descriptor
`
`1826
`
` Corresponding
`
`
`
`
`Output Unencoded
`Output Encoded
`Data Block with
`Data Block with
`
`
`
`Null Descriptor
`
`Descriptor
`
`
`
`
`
`Terminate
`More Data
`
`
`
`
`
`Data Compression
`Blocksin Input
`Process
`Stream
`
`
` Yes
`?
`
`
`
`Receive Next
`Data Block From
`Input Stream
`
`1830
`
`FIGURE 18C
`
`Page 35
`
`Page 35
`
`
`
`U.S. Patent
`
`Sep. 23, 2003
`
`Sheet 34 of 34
`
`US 6,624,761 B2
`
`1838 ~
`
`
`
`Select Recognized
`
`
`Data Type/File or Block
`
`
`in List or Algorithm
`
`
`
`
`1840 ~
`
`Compress Data Block
`with Enabled Content
`
`
`
`Select or Estimate
`
`Content Dependent
`Algorithm(s)
`Dependent Encoder(s)
`1844 ~
`Blocks
`Compression Ratio
`
`Buffer Encoded Data
`Biock From Each
`Encoder
`
`Count Size of Data
`
`Calculate
`
`
`
`Compression
`
`Ratio of at Least
`
`
`
`One Encoded Data
`Block Greater Than
`
`
`
`
`Content Dependent
`Threshold
`
`
`~ 1850
`?
`
`
`
`No» F
`
`Yes
`
`v E
`
`FIGURE 18D
`
`Page 36
`
`Page 36
`
`
`
`US 6,624,761 B2
`
`1
`CONTENT INDEPENDENT DATA
`COMPRESSION METHOD AND SYSTEM
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`This application is a Continuation-In-Part of U.S. patent
`application Ser. No. 09/705,446,
`is now a US. Pat. No.
`6,309,424 filed on Nov. 3, 2000, which is a Continuation of
`US. patent application Ser. No. 09/210,491, filed on Dec.
`11, 1998, which is now USS. Pat. No. 6,195, 024, issued on
`Feb. 27, 2001.
`
`BACKGROUND
`
`1. Technical Field
`
`The present invention relates generally to a data compres-
`sion and decompression and, more particularly, to systems
`and methods for data compression using content indepen-
`dent and content dependent data compression and decom-
`pression.
`2. Description of Related Art
`Information may be represented in a variety of manners.
`Discrete information such as text and numbers are easily
`represented in digital data. This type of data representation
`is known as symbolic digital data. Symbolic digital data is
`thus an absolute representation of data such as a letter,
`figure, character, mark, machine code, or drawing,
`Continuous information such as speech, music, audio,
`images and video, frequently exists in the natural world as
`analog information. As is well knownto those skilled in the
`art, recent advances in very large scale integration (VLSI)
`digital computer technology have enabled both discrete and
`analog information to be represented with digital data.
`Continuous information represented as digital data is often
`referred to as diffuse data. Diffuse digital data is thus a
`representation of data that is of low information density and
`is typically not easily recognizable to humansin its native
`form.
`
`There are many advantages associated with digital data
`representation. For instance, digital data is more readily
`processed, stored, and transmitted due to its inherently high
`noise immunity. In addition, the inclusion of redundancy in
`digital data representation enables error detection and/or
`correction. Error detection and/or correction capabilities are
`dependent upon the amount and type of data redundancy,
`available error detection and correction processing, and
`extent of data corruption.
`One outcomeof digital data representation is the continu-
`ing need for increased capacity in data processing, storage,
`and transmittal. This is especially true for diffuse data where
`increases in fidelity and resolution create exponentially
`greater quantities of data. Data compression is widely used
`to reduce the amount of data required to process, transmit,
`or store a given quantity of information. In general, there are
`two types of data compression techniques that may be
`utilized either separately or jointly to encode/decode data:
`lossless and lossy data compression.
`Lossy data compression techniques provide for an inexact
`representation of the original uncompressed data such that
`the decoded (or reconstructed) data differs from the original
`unencoded/uncompressed data. Lossy data compression is
`also knownasirreversible or noisy compression. Entropy is
`defined as the quantity of information in a givenset ofdata.
`Thus, one obvious advantage of lossy data compression is
`that the compression ratios can be larger than the entropy
`limit, all at the expense of information content. Many lossy
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`data compression techniques seek to exploit various traits
`within the human senses to eliminate otherwise impercep-
`tible data. For example, lossy data compression of visual
`imagery might seek to delete information content in excess
`of the display resolution or contrast ratio.
`Onthe other hand, lossless data compression techniques
`provide an exact representation of the original uncom-
`pressed data. Simply stated, the decoded (or reconstructed)
`data is identical to the original unencoded/uncompressed
`data. Lossless data compression is also known asreversible
`or noiseless compression. Thus, lossless data compression
`has, as its current limit, a minimum representation defined
`by the entropy of a given dataset.
`There are various problems associated with the use of
`lossless compression techniques. One fundamental problem
`encountered with most lossless data compression techniques
`are their content sensitive behavior. Thisis often referred to
`as data dependency. Data dependency implies that the com-
`pression ratio achieved is highly contingent upon the content
`of the data being compressed. For example, database files
`often have large unused fields and high data redundancies,
`offering the opportunity to losslessly compressdata at ratios
`of 5 to 1 or more. In contrast, concise software programs
`have little to no data redundancy and, typically, will not
`losslessly compress better than 2 to 1.
`Another problem with lossless compression is that there
`are significant variations in the compression ratio obtained
`whenusing a single lossless data compression technique for
`data streams having different data content and data size. This
`process is known as natural variation.
`A further problem is that negative compression may occur
`when certain data compression techniques act upon many
`types of highly compressed data. Highly compressed data
`appears random and many data compression techniques will
`substantially expand, not compress this type of data.
`Fora given application, there are manyfactors that govern
`the applicability of various data compression techniques.
`These factors include compression ratio, encoding and
`decoding processing requirements, encoding and decoding
`time delays, compatibility with existing standards, and
`implementation complexity and cost, along with the adapt-
`ability and robustness to variations in input data. A direct
`relationship exists in the current art between compression
`ratio and the amount and complexity of processing required.
`Oneofthe limiting factors in most existing prior art lossless
`data compression techniquesis the rate at which the encod-
`ing and decoding processes are performed. Hardware and
`software implementation tradeoffs are often dictated by
`encoder and decoder complexity along with cost.
`Another problem associated with lossless compression
`methods is determining the optimal compression technique
`for a given set of input data and intended application. To
`combat this problem, there are many conventional content
`dependent techniques that may beutilized. For instance, file
`type descriptors are typically appended to file names to
`describe the application programsthat normally act upon the
`data contained within thefile. In this manner data types, data
`structures, and formats within a given file may be ascer-
`tained. Fundamental limitations with this content dependent
`technique include:
`(1) the extremely large number of application programs,
`some of which do not possess published or documented
`file formats, data structures, or data type descriptors;
`(2)
`the ability for any data compression supplier or
`consortium to acquire, store, and access the vast
`amounts of data required to identify knownfile descrip-
`tors and associated data types, data structures, and
`formats; and
`
`Page 37
`
`Page 37
`
`
`
`US 6,624,761 B2
`
`3
`(3) the rate at which new application programsare devel-
`oped and the need to update file format data descrip-
`tions accordingly.
`An alternative technique that approaches the problem of
`selecting an appropriate lossless data compression technique
`is disclosed, for example, in U.S. Pat. No. 5,467,087 to Chu
`entitled “High Speed Lossless Data Compression System”
`(“Chu”). FIG.
`1 illustrates an embodiment of this data
`compression and decompression technique. Data compres-
`sion 1 comprises two phases, a data pre-compression phase
`2 and a data compression phase 3. Data decompression 4 of
`a compressed input data stream is also comprised of two
`phases, a data type retrieval phase 5 and a data decompres-
`sion phase 6. During the data compression process 1, the
`data pre-compressor 2 accepts an uncompressed data stream,
`identifies the data type of the input stream, and generates a
`data type identification signal. The data compressor3 selects
`a data compression method from a preselected set of meth-
`ods to compress the input data stream, with the intention of
`producing the best available compression ratio for that
`particular data type. There are several limitations associated
`with the Chu method. One such limitation is the need to
`unambiguously identify various data types. While these
`might include such commondata types as ASCII, binary, or
`unicode, there, in fact, exists a broad universe of data types
`that
`fall outside the three most common data types.
`Examples of these alternate data types include: signed and
`unsigned integers of various lengths, differing types and
`precision of floating point numbers, pointers, other forms of
`character text, and a multitude of user defined data types.
`Additionally, data types may be interspersed or partially
`compressed, making data type recognition difficult and/or
`impractical. Another limitation is that given a known data
`type, or mix of data types within a specific set or subset of
`input data, it may be difficult and/or impractical to predict
`which data encoding technique yields the highest compres-
`sion ratio.
`there is a need for a data compression
`Accordingly,
`system and method that would address limitations in con-
`ventional data compression techniques as described above.
`
`SUMMARYOF THE INVENTION
`
`The present invention is directed to systems and methods
`for providing fast and efficient data compression using a
`combination of content independent data compression and
`content dependent data compression. In one aspect of the
`invention, a method for compressing data comprises the
`steps of:
`analyzing a data block of an input data stream to identify
`a data type of the data block, the input data stream
`comprising a plurality of disparate data types;
`performing content dependent data compression on the
`data block, if the data type of the data block is identi-
`fied;
`performing content independent data compression on the
`data block, if the data type of the data block is not
`identified.
`In another aspect, the step of performing content inde-
`pendent data compression comprises: encoding the data
`block with a plurality of encoders to provide a plurality of
`encoded data blocks; determining a compression ratio
`obtained for each of the encoders; comparing each of the
`determined compression ratios with a first compression
`threshold; selecting for output
`the input data block and
`appending a null compression descriptor to the input data
`block,if all of the encoder compression ratios do not meet
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`the first compression threshold; and selecting for output the
`encoded data block having the highest compressionratio and
`appending a corresponding compression type descriptor to
`the selected encoded data block,
`if at
`least one of the
`compression ratios meet the first compression threshold.
`In another aspect, the step of performing content depen-
`dent compression comprises the steps of: selecting one or
`more encoders associated with the identified data type and
`encoding the data block with the selected encoders to
`provide a plurality of encoded data blocks; determining a
`compression ratio obtained for each of the selected encod-
`ers; comparing each of the determined compression ratios
`with a second compression threshold; selecting for output
`the input data block and appending a null compression
`descriptor to the input data block,
`if all of the encoder
`compression do not meet the second compression threshold;
`and selecting for output the encoded data block having the
`highest compression ratio and appending a corresponding
`compression type descriptor to the selected encoded data
`block, if at least one of the compression ratios meet the
`second compression threshold.
`In yet another aspect,
`the step of performing content
`independent data compression on the data block, if the data
`type of the data block is not identified, comprises the steps
`of: estimating a desirability of using of one or more encoder
`types based one characteristics of the data block; and
`compressing the data block using one or more desirable
`encoders.
`In another aspect, the step of performing content depen-
`dent data compression on the data block,if the data type of
`the data block is identified, comprises the steps of: