`
`[19]
`
`[11] Patent Number:
`
`5,978,791
`
`Farber et al.
`
`[45] Date of Patent:
`
`Nov. 2, 1999
`
`US005978791A
`
`.................... .. 364/200
`. . . .. 364/900
`395/600
`
`
`
`5/1990 Holloway et al.
`4,922,414
`11/1990 Burke . . . . . . .. . . . . . . .
`4,972,367
`4/1991 Bendert et al.
`5,007,658
`6/1991 C110 .................................. ..
`5,025,421
`9/1991 Marca ................................... .. 364/200
`5,050,074
`9/1991 Dyson ........ ..
`380/25
`5,050,212
`5,057,837 10 1991 Colwell et al.
`..
`341 55
`5,129,081
`741992 Kobayashi et al.
`. 395/6/00
`5,129,082
`7/1992 Tirfing etal.
`395/600
`5,144,667
`9/1992 Po
`e, Jr. et al.
`380/45
`5,179,680
`1/1993 Colgxlirell et al.
`......
`. 395/425
`5,202,982
`4/1993 Gramlich et al.
`. 395/600
`
`[54] DATA PROCESSING SYSTEM USING
`SUBSTANTIALLY UNIQUE IDENTIFIERS T0
`IDENTIFY DATA ITEMS, WHEREBY
`
`IDENTIFIERS
`
`[75]
`
`.
`.
`.
`.
`Inventors: David A. Farber, O]a1, Cal1f.; Ronald
`D- Lachmana Nofthbfooka 111-
`[73] Assigneei Kinetech, 1110-, Northbrook, 111.
`
`[21] App]. NO‘: 08/960,079
`
`[22]
`
`Filed:
`
`Oct. 24, 1997
`
`Related U_S_ Application Data
`
`[63] Continuation of application No. 08/425,160, Apr. 11, 1995,
`abandoned.
`
`5,208,858
`5,276,901
`5,301,286
`5,301,316
`5,343,527
`5,357,623
`57384565
`5,404,508
`
`5/1993 Vollert et al.
`1/1994 Howell et al.
`4/1994 Rajani
`. . . . . .. . . . . . . .
`4/1994 Hamilton et al.
`8/1994 M0016 . . . . . .. . . . . . . . . . .
`10/1994 Megory—Cohen
`1/1995 Cannon """ "
`4/1995 Konrad et al.
`
`380/43
`.. 395/800
`. . . .. 395/400
`. 395/600
`. . . .. 380/4
`395/425
`340/82544
`........................ .. 395/600
`
`
`
`OTHER PUBLICATIONS
`Witold Litwin et al, Linear Hashing for Distributed Files,
`ACM SIGMOD’ May, 1993 pp. 327_336.
`Ming—Ling Lo, et al, On Optimal Processor Allocation to
`Support Pipelined Hash Joins, ACM SIGMOD, pp. 69-78,
`1}/[lay 1992.13
`DE
`. 1C
`1
`. M 01232
`‘h
`omas
`.
`erson
`1 erentia
`r
`tana s1s
`o
`wit
`Applications to M135, 1313- 69-31911992 y
`
`(List continued on next page.)
`
`Primary Examiner—Paul V. Kulik
`Assistant Examiner—Jean
`Hgmere
`
`Attorney, Agent, or Firm—Pillsbury Madison & Sutro LLP
`
`[57]
`
`ABSTRACT
`
`In a data processing system, a mechanism identifies data
`items by substantially unique identifiers which depend on all
`of the data in the data items and only on the data in the data
`items. The system also determines whether a particular data
`item is present in the database by examining the identifiers
`of the plurality of data items
`'
`
`48 Claims, 31 Drawing Sheets
`
`[51]
`Int. Cl.6 .................................................... .. G06F 17/30
`
`[52] U_'S' Cl‘ ‘
`" 707/2; 707/1; 707/200
`[58] Fleld Of Search ..................................... 707/2, 1, 200
`,
`Referenees Clted
`U.S. PATENT DOCUMENTS
`6/1972 Evangelistietal.
`................. 340/172.5
`7/1980 Mitchell et al.
`..
`364/200
`9/1981 Cichelli et al.
`364/200
`3/1983 Rivest
`......... ..
`364/900
`9/1983 Rivest et al.
`..
`.. 178/22.1
`10/1983 Neches et al.
`364/200
`
`[56]
`
`3,668,647
`4,215,402
`4,290,105
`4,376,299
`4,405,829
`4,412,285
`
`364/200
`11/1983 Summer: J“ et a1~
`4>414a624
`364/200
`4/1984 Fletcher et al.
`..... ..
`4,441,155
`364/200
`8/1984 Benhase et al.
`..
`4,464,713
`364/200
`4,490,782 12/1984 Dixon et al.
`.... ..
`364/900
`4,571,700
`2/1986 Emry, Jr, et a1,
`,
`365/189
`4,577,293
`3/1986 Matick et al.
`364/900
`4,642,793
`2/1987 Meaden ~~~~~~~ ~-
`364/200
`4575310
`6/1987 Gfuner et a1-
`365/185
`436919299
`9/1987 Rwest et al‘ ““ “
`364/200
`4,725,945
`2/1988 Kronstadtetal.
`364/900
`4,773,039
`9/1988 Zamora ........... ..
`364/900
`4,887,235
`12/1989 Holloway et al.
`4,888,681
`12/1989 Barnes et al.
`......................... .. 364/200
`
`
`
`104
`sromes
`DEVICE
`
`104
`STORAGE
`DEVICE
`
`102
`PROCESSOR
`
`.
`
`.
`
`.
`
`102
`PROCESSOR
`
`ms
`
`102
`PROCESSOR
`
`102
`PROCESSOR
`
`102
`PROCESSOR
`
`PROCESSOR
`
`100
`CPU
`
`3
`3
`
`STORAGE
`nsvucz
`
`114
`
`MEMORY
`132
`A;
`134
`AL
`156
`SM];
`152
`an
`154
`5,“,
`
`126
`
`128
`
`1:10
`
`136
`
`153
`
`GOOG-1001-Page 1 of 56
`
`GOOG-1001-Page 1 of 56
`
`
`
`5,978,791
`Page 2
`
`OTHER PUBLICATIONS
`
`William Perrizo, et al., Distributed Join Processing Perfor-
`mance Evaluation, 1994. Twenty-Seventh Hawaii Interna-
`tional Conference on System Sciences, vol. II, pp. 236-244.
`A concurrency Control Mechanism based on Extendible
`Hashing for Main Memory Database Systems, Vij ay Kumar,
`pp. 109-113, ACM, vol. 3, 1989.
`Birgit Pfitzmann, Sorting Out Signature Schemes, Nov.
`1993, 1st Conf. Computer & Comm. Security ’93 pp. 74-85.
`Bert dem Boer, et al., Collisions for the compression func-
`tion of MDS pp. 292-304, 1994.
`Sakti Pramanik, et al., Multi-Directory Hashing, 1993, Info.
`Sys., vol. 18, No. 1, pp. 63-74.
`Murlidhar Koushik, Dynamic Hashing With Distributed
`Overflow Space: A File Organization With Good Insertion
`Performance, 1993, Info. Sys., vol. 18, No. 5, pp. 299-317.
`Witold Litwin, et al., LH*-Linear Hashing for Distributed
`Files, HP Labs Tech. Report No. HPL-93-21 Jun. 1993 pp.
`1-22.
`
`Yuliang Zheng, et al., HAVAL — A One-Way Hashing
`Algorithm with Variable Length of Output
`(Extended
`Abstract), pp. 83-105, Advances in Cryptology, AUSCRIPT
`’92, 1992.
`
`Chris Charnes and Josef Pieprzky, Linear Nonequivalence
`versus Nonlinearity, Pieprzky, pp. 156-164, 1993.
`Zhiyu Tian, et al., A New Hashing Function: Statistical
`Behaviour and Algorithm, pp. 3-13, SIGIR Forum, 1993.
`G. L. Friedman, Digital Camera With Apparatus For
`Authentication of Images Produced From an Image File,
`NASA Case No. NPO—19108—1—CU, Serial No. 08/159,980,
`Nov. 24, 1993.
`H. Goodman, Feb. 9, 1994 Ada, Object-Oriented Tech-
`niques, and Concurrency in Teaching Data Sructures and
`File Management Report Documentation P. AD-A275
`385 — 94-04277.
`
`Advances in Cryptology-EUROCRYPT ’93, Workshop on
`the Theory and Application of Cryptographic Techniques
`Lofthus, Norway, May 23-27, 1993 Proceedings.
`Proceedings of the 1993 ACM SIGMOD International Con-
`ference on Management of Data, vol. 22, Issue 2, Jun. 1993.
`Advances in Cryptology-AUSCRYPT ’92 — Workshop on
`the Theory and Application of Cryptographic Techniques
`Gold Coast, Queensland, Australia Dec. 13-16, 1992 Pro-
`ceedings.
`Search Report dated Jun. 24, 1996.
`
`GOOG-1001-Page 2 of 56
`
`GOOG-1001-Page 2 of 56
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 1 of 31
`
`5,978,791
`
`2:
`
`
`
`
`
`m_OmwmOOm_n_mowmmoommmowwmoomm
`
`N2N9N3
`
`
`
`...mo_>mn_...m_o_>ma
`
`
`
`
`momwmoommmomwmoommmw<m2.mmwéohw
`
`N2N2a3G:.o_..._
`
`o_..
`
`GOOG-1001-Page 3 of 56
`
`GOOG-1001-Page 3 of 56
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 2 of 31
`
`5,978,791
`
`>mos_ms_
`
`N9.
`
`n_<
`
`V2
`
`.2
`
`m2
`
`ezm
`
`Nm_.
`
`.75
`
`V9.
`
`o<m
`
`¢N_.
`
`mm:
`
`mm?
`
`mu:
`
`wN_.
`
`._.m
`
`5o?
`
`09
`
`._..._
`
`mfl
`
`40
`
`3:.o_.._
`
`GOOG-1001-Page 4 of 56
`
`GOOG-1001-Page 4 of 56
`
`
`
`
`
`U.S. Patent
`
`MN
`
`m
`
`MS
`
`13
`
`5,978,791
`
`
`
`9>mo_.ommE...>mo5mm_o>moBmm_o
`
`2.,2;2.m:
`
`Mm.__u_...m.__n_m.__u_Mourour8.
`
`NN_.
`
`N5.N2
`
`kzmswmw...
`
`
`
`wzmswmmhzmsomm
`
`GOOG-1001-Page 5 of 56
`
`w_._..m.=..._
`
`sm:.m>mN
`
`GOOG-1001-Page 5 of 56
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 4 of 31
`
`5,978,791
`
`
`
`FIG. 3
`
`Pathname
`
`.38
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Time of last access
`
`Time of last modification
`
`
`
`
`
`
` ‘
`
`
`
`
`
`
`:40
`
`True Name
`
`Com-ressed File ID
`
`
`
`
`
`
`
`Time of last access
`
`
`Groomino delete count
`
`I42
`
`
`
`
`
`
`
`
`
`GOOG-1001-Page 6 of 56
`
`
`
`
`
`
`F|G.5
`
`GOOG-1001-Page 6 of 56
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 5 of 31
`
`5,978,791
`
`.EmummfiHB OEMZ
`
`
`
`HMCH.w.no
`
`N.o_n_
`
`¢¢_
`
`o:
`
`
`
`u«awnwHwm>mmouaom
`
`
`
`cowumooamousom
`
`m.sumuusom
`m.o_.._
`
`m¢_
`
`on.
`
`0502OUHH.
`
`OEMCSUMW
`
`uucouomumu
`
`
`
`050.20.D..H.H.
`
`.~
`
`mg“.
`
`mOE
`
`GOOG-1001-Page 7 of 56
`
`GOOG-1001-Page 7 of 56
`
`
`
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 6 of 31
`
`5,978,791
`
`FIG. IO(a)
`
`SIMPKE
`
`DATA ITEM
`
`COMPUTE MD FUNCTION ON
`
`DATA ITEM
`
`8214
`
`APPEND LENGTH MODULO 32 OF
`
`DATA ITEM
`
`TRUE NAME
`
`1
`
`GOOG-1001-Page 8 of 56
`
`GOOG-1001-Page 8 of 56
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 7 of 31
`
`5,978,791
`
`
`
`
`
`S220
`PARTITION DATA ITEM INTO
`SEGMENTS
`
`FIG. IO(b)
`
`S222
`
`_ _ _ _ ..._
`
`ASSIMILATE EACH SEGMENT
`
`(COMPUTING rrs TRUE NAME)
`
`S224
`
`CREATE INDIRECT BLOCK OF
`SEGMENT TRUE NAMES
`
`,r‘“''s:21‘e'”" \
`COMPUTE TRUE :
`; NAME OF SIMPLE 1
`1
`DATA ITEM
`:
`\ . _ _ _ _ _
`
`S226
`
`ASSIMILATE INDIRECT BLOCK
`
`(COMPUTING rrs TRUE NAME)
`
`ITEM
`
`S228
`
`REPLACE FINAL 32 BITS OF TRUE
`NAME WITH LENGHT MOD 32 OF DATA
`
`GOOG-1001-Page 9 of 56
`
`GOOG-1001-Page 9 of 56
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 8 of 31
`
`5,978,791
`
`>m._.zmmmoo
`
`~.n=m.=n_m><:
`
`mm:
`
`8%
`
`
`
`n:m.=.._mmohm.
`
`wmmm
`
`a.m.__u_mfido
`
`mm:
`
`ms_<zM55mmoo
`
`m.=..._mnmpZ.Bum.
`
`~.>Em_omm
`
`8%
`
`mz_s_mm:.mo
`
`ms<zmam»
`
`__.o_.._
`
`ommm
`
`
`
`>mFzmzmzm:.<mmo..
`
`
`
`_.O...._..zDO0mm:Em..
`
`
`
`
`
`
`
`mo._m_u_mmE.oEm..D.m.=u_mmopw..
`
`GOOG-1001-Page 10 of 56
`
`GOOG-1001-Page 10 of 56
`
`
`
`
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 9 of 31
`
`5,978,791
`
`FIG. I2
`
`
` S238
`S240
`UPDATE
`
`
`DEPENDENCY
`
`
`LOCKED?
`LIST
`
`
`
`FILE
`
`S242
`
`SEND MESSAGE TO
`
`
`
`
`
`CACHE SERVER TO
` S244
`
`UPDATE CACHE
`
`(IF DESIRED)
`
`COMPRESS
`
`
`
`MIRROR
`
`
`(IF DESIRED)
`
`
`
`S246
`
`GOOG-1001-Page 11 of 56
`
`GOOG-1001-Page 11 of 56
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 10 of 31
`
`5,978,791
`
`FIG. I3
`
`S250
`
`SEARCH FOR
`THE
`
`PATHNAME
`
`‘ .
`
`.
`
`‘ '
`
`FAIL
`
`
`
`
`LDE INCLUDES
`
`TRUE NAME?
`
`S258
`
`
`
`
`
`ASSIMILATE
`FILE ID
`
`
`
`
`
`LDE IDENTIFIES
`
`DIRECTORY?
`
`YS
`
`S256
`
`FREEZE
`
`DIRECTORY
`
`GOOG-1001-Page 12 of 56
`
`GOOG-1001-Page 12 of 56
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 11 of 31
`
`5,978,791
`
`S260
`
`CONFIRM THAT
`
`TRUE NAME
`
`EXISTS LOCALLY
`
`FIG. I4
`
`
`
`
` S262
`SEARCH FOR
`
`PATHNAME IN
`
`LDE TABLE
`
`S264
`
`
`
`
`
`
`
`CONFIRM THAT
`
`DIRECTORY
`
`EXISTS
`
`S266
`
`
`
`S268
`
`NAMED FILE
`
`EXISTS?
`
`DELETE
`
`TRUE FILE
`
`S270
`
`
`
`CREATE
`
`GOOG-1001-Page 13 of 56
`
`ENTRY IN LDE
`
`
`
`
`
`& UPDATE
`
`GOOG-1001-Page 13 of 56
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 12 of 31
`
`5,978,791
`
`mmzommmm
`
`m>:._mo..
`
`Sum
`
`"Fmozmm
`
`._._<>>wmw<mmms_
`mo".
`
`mwzommmm
`
`2%
`
`
`
`m.__u_m:E.mmwzm
`
`
`
`O._.z_omzmamm
`
`mu:
`
`mmzomwmm
`
`m>:.<wmz
`
`mm;
`
`Eommmoom.<29200..2
`
`Nwmm
`
`
`
`mam»E_mm>
`
`"5m.__"_
`
`ammamo
`
`8mm
`
`m.__"_02.".
`
`m_.o_.._
`
`GOOG-1001-Page 14 of 56
`
`GOOG-1001-Page 14 of 56
`
`
`
`
`
`
`
`U.S. Patent
`
`mN
`
`S
`
`5,978,791
`
`1,>z<2.Sum
`om:.om._mW ommomwmoomm
`
`mm:
`
`Em.or.as
`
`
`1.._.zm_._o
`1moBmmzommmmM.m»m<oo<ommms:.mm:
`
`.5ms:
`
`MMZOQMMQ
`
`i-li+|iI-
`
`m\<R.01
`
`«mum
`
`..zm_._o
`
`..&.om.m_m
`
`.wEommmoom._
`
`GOOG-1001-Page 15 of 56
`
`GOOG-1001-Page 15 of 56
`
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 14 of 31
`
`5,978,791
`
`max»".0momaow
`
`
`
`_20mEmmm"_u__oms_<z
`
`~zo:<z:.wmo
`
`Sum
`
`mmopm
`
`n:mommmoomm
`
`Emmw
`
`mz_s_mm:.mo
`
`
`
`m_.<n_zo:<m_n_xm
`
`.5...O._.oo<oz<
`
`ca
`‘£
`
`4m_momaom
`
`oz_:m_._m:n_
`
`oommm
`
`>.~.s_m»m>m
`
`mommm
`
`mo".mE._.n5X004
`
`on?wms_<zmam»
`
`
`
`n:zO_._.<OO._momsow
`
`
`
`MO".mo.momzowO._.
`
`ms_<zm:E.
`
`
`
`
`
`O._.mo<mwm_zozmw
`
`momnom20
`
`mommmoomm
`
`
`
`m.__.._mamhm>mmwmx
`
`Baum
`
`
`
`35..o_..._
`
`GOOG-1001-Page 16 of 56
`
`GOOG-1001-Page 16 of 56
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 15 of 31
`
`5,978,791
`
` mm:
`
`MO".0.m.__n_
`
`
`
`~.>E.zmm=._._.
`
`O2
`
`
`
`mmmmm..m>
`
`8%
`
`mwmmmsoomo
`
`omwmmmmsoo
`
`S.m.__"_
`
`mm;
`
`.=<..._
`
`oz
`
`>Ezmm.__n_mam
`
`mzmkmo".E»z_
`
`~.ms_<z
`
`A32.2“.
`
`GOOG-1001-Page 17 of 56
`
`GOOG-1001-Page 17 of 56
`
`
`
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 16 of 31
`
`5,978,791
`
`i
`
`88
`
`n:mmokm
`
`38
`
`5m._mw
`
`wemomaom
`
`9momaow
`
`NEOEO2
`
`88
`
`m:.<oo._
`
`
`
`m.=...._m.Ss_mm
`
`EVE.o_..._
`
`mm:
`
`
`
`mSE.m~_._<mm
`
`
`
`s_om”_m.__"_
`
`Emomsom
`
`GOOG-1001-Page 18 of 56
`
`GOOG-1001-Page 18 of 56
`
`
`
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 17 of 31
`
`5,978,791
`
`.3w_
`or.
`
`mo".0.m.=n_
`
`~.m.__"_may:
`
`«N8
`
`mama.M252
`
`._<oo._w.=n_
`
`
`
`0.50:0:o._.<mo
`
`max»"5>n_Oomm
`
`._.m.__n_
`
`m:.m._momSm
`
`m.__n_mam»
`
`mm_"=._.zmn__mo._
`
`mamwwz:m_xm
`
`3...".
`
`ommw
`
`zmzmbqmmo
`
`
`
`m.=..._zobaom
`
`GOOG-1001-Page 19 of 56
`
`GOOG-1001-Page 19 of 56
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 18 of 31
`
`5,978,791
`
`wmmm
`
`m>o_2mmwn:m.Em><m
`m..:.
`
`>Ezm
`
`mm:
`
`ll-
`
`88
`
`zmzO._.w.E>n_Oo
`
`
`
`35..o_u_
`
`
`
`
`
`.m._m<..mo._Z_9..u._u=..._mmohw.m.=.._
`
`pzaoo
`
`
`
`mm:pzmsmmomo
`
`GOOG-1001-Page 20 of 56
`
`GOOG-1001-Page 20 of 56
`
`
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 19 of 31
`
`5,978,791
`
`mmmm
`
`m.E.__s__mm<
`
`om:.<.=s__ww<z:
`
`m.=..._
`
`vmmm
`
`u:mNmmE
`
`>mEomma
`
`
`
`Em..o_.._
`
`$8
`
`_.zmsm_moz_
`
`
`
`X004mnmmmm
`
`_._o<mmo”.
`
`m._.<z_omom:m
`
`QZ<m.__n_
`
`mm...2.>m9.omm_n
`
`
`
`>moSmm_n_zm>_w
`
`GOOG-1001-Page 21 of 56
`
`GOOG-1001-Page 21 of 56
`
`
`
`
`
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 20 of 31
`
`5,978,791
`
`ovmm
`
`nmmamo
`
`
`..<ZO_._._Qo<<._.<Dzmzn_z<m.__n_
`
`DmoommOh>m._.zmoQ<m:.<z_nmom:m
`
`wmmm:o<mmo".
`
`zoP<s_mou_z_
`
`
`
`35..o_.._
`
`_2m._._MI...z_>mo5mm_o
`
`
`
`>mo5mm_azm>_c
`
`$8
`
`
`
`m:._._.m:.<.__s=mm<
`
`
`
`
`
`S_m_._._<._.<D>>mZ
`
`Emw
`
`pzmsmmumo
`
`mwmmmumm»
`
`K004
`
`GOOG-1001-Page 22 of 56
`
`GOOG-1001-Page 22 of 56
`
`
`
`
`
`
`U.S. Patent
`
`mN
`
`2.,
`
`m
`
`5,978,791
`
`w9mm:05mo".mmosozEmmw3%$8
`
`ovmm
`
`mam...mxsz
`
`
`
`._<UO._m.__n_
`
`ON.0_n_
`
`
`
`
`
`>mo.6mm_o>moBmm_omm_Ezmmzoo
`
`
`
` Mm_2<zE.<n_M._.S"_mbmmom88
`
`$8
`
`
`
`O._.I._.<n_X23
`
`m5<zm:mk
`
`>mpzm
`
`GOOG-1001-Page 23 of 56
`
`GOOG-1001-Page 23 of 56
`
`
`
`
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 22 of 31
`
`5,978,791
`
`
`
` S354
`WAIT FOR
`
`
`
`FREEZE LOCK
`
`
`
`TO TURN OFF
`
`FlG.2|
`
`
`
`S356
`
`FIND TFR
`
`ENTRY
`
`
`
` S358
`DECREMENT
`
`REFERENCE
`
`COUNT
`
`REFERENCE COUNT IS
`
`
`
`
`S362
`
`DELETE
`
`TRUE FILE
`
`
`
`ZERO & NO DEPENDENT
`
`SYSTEMS IN TFR?
`
`S364
`
`
`REMOVE FILE ID
`AND COMPRESSED
`
`FILE ID
`
`
`
`GOOG-1001-Page 24 of 56
`
`GOOG-1001-Page 24 of 56
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 23 of 31
`
`5,978,791
`
`S365
`
`GET
`
`OPERATION
`
`3365
`CREATE oR
`MODIFY?
`
`cow OR DELETE
`COMPOUND?
`
`
`
`
`YES
`
`YES
`
`S368
`
`ASSIMILATE
`
`F"-E
`
`3369
`
`NEW TRUE
`
`S378
`
`3370
`
`Mompy USE
`COUNT OF EACH
`COMPONENT
`
`RECORD TRUE
`NAME IN AUDIT
`
`FILE
`
`S379
`
`FOR EACH PARENT
`
`DIRECTORY OR FILE,
`
`UPDATE USE COUNT,
`LAST ACCESS AND
`
`MODIFY TIMES
`
`GOOG-1001-Page 25 of 56
`
`GOOG-1001-Page 25 of 56
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 24 of 31
`
`5,978,791
`
`FIG. 23
`
`
`
` S382
`VERIFY
`
`GROOMING
`
`LOCK OFF
`
`
`
`GROOMING
`
`LOCK
`
`S386
`
`SET GROOM
`
`COUNTS
`
`GOOG-1001-Page 26 of 56
`
`
`
`S384
`
`SET
`
`GOOG-1001-Page 26 of 56
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 25 of 31
`
`5,978,791
`
`
`
`
`FIND LDE
`
`S388
`
`RECORD
`
`
`
`S390
`
`FIND TFR
`
`RECORD
`
`S392
`
`FIG. 24
`
`GOOG-1001-Page 27 of 56
`
`
`
`
`
`DELETE COUNT
`
`
`S394
`ADJUST FILE
`
`
`
`SIZES
`
`INCREMENT
`
`GROOMING
`
`GOOG-1001-Page 27 of 56
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 26 of 31
`
`5,978,791
`
`FIG. 25
`
`S396
`
`DELETE
`
`FEE
`
`S398
`
`UNLOCK
`
`GROOMING
`
`LOCK
`
`GOOG-1001-Page 28 of 56
`
`GOOG-1001-Page 28 of 56
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 27 of 31
`
`5,978,791
`
`GEN.o_.._
`
`-...m
`
`:m=._omn_
`
`>._zo.o<mm
`
`>m0.5mm..
`
`ovm
`
`oz_mm
`
`.Em.Emmo
`
`GOOG-1001-Page 29 of 56
`
`GOOG-1001-Page 29 of 56
`
`
`
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 28 of 31
`
`5,978,791
`
`omvm
`
`._<oo._mxsz
`
`
`
`n:m._:zmammwzo_wmm>
`
`mu:E01".
`
`52n:X004
`
`9200..
`
`:5
`
`m:.<mmo
`
`zobaow
`
`>mO0
`
`§oN.o_n_
`
`m.__"_
`
`._.om:o<o
`
`25
`
`mzmm
`
`>._m._.m.Es_oo
`
`
`
`zmEEsm..
`
`§.m
`
`zmBum
`
`Q.
`
`
`
`m.__u_zobqmom
`
`m._Emm<mm
`
`85
`
`mbmmo
`
`
`
`m.=n__._0._.<m_ow
`
`GOOG-1001-Page 30 of 56
`
`GOOG-1001-Page 30 of 56
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 29 of 31
`
`5,978,791
`
`..._m_:oE
`
`zo_E._mo
`
`Nmvm
`
`
`
` >Ezm._.mama...mz_sEmEo
`
`
`
`mo".momoomm
`
`m.__“_
`
`.Oomoommmm:0.
`
`Z_moomxoo._m.=u_
`
`>._zo-o<mm
`
`tEo5mm_o
`
`vmvm
`
`
`
`mam...>n.=._.zmn=
`
`GKN
`or.
`
`ms_<z
`
`
`
`
`
`m:E.20¢".m.=..._
`
`GOOG-1001-Page 31 of 56
`
`GOOG-1001-Page 31 of 56
`
`
`
`
`
`
`U.S. Patent
`
`N0v.2, 1999
`
`Sheet 30 of 31
`
`5,978,791
`
`.\.N¢w
`
`mfido
`
`
`
`>n_OO:o.p<mom
`
`m.__.._u_O
`
`mmvm
`
`
`
`
`
`O...>m._.zm00.0.
`
`
`
`m.__u_.:a:<
`
`
`
`ozozm<:m.__u_mm;
`
`$s_<zmam»
`
`omvm
`
`m:m._mo
`
`m.__n_mam»
`
`BIN.o_..._
`
`w.m.__u_m:E.
`
`
`
`m.._.ZDOUmm:
`
`MZO
`
`55
`
`
`
`mm:moaomm
`
`M20E#2300
`
`GOOG-1001-Page 32 of 56
`
`GOOG-1001-Page 32 of 56
`
`
`
`
`
`
`U.S. Patent
`
`Nov. 2, 1999
`
`Sheet 31 of 31
`
`5,978,791
`
`
`
`mm0...Smaomm
`
`Smnm<2Eou_
`
`mmvm
`
`m>:.<omz
`
`mwzommmm
`
`Evm
`
`E229.
`
`«Sm
`
`n_m<2Eou.
`
`pmmaamm
`
`mmor.
`
`Nmvm
`
`n=§oo._
`
`m_z<zmamp
`
`02
`
`mm;
`
`ommmmmmzooMOQ.m..=u_
`wmo:._oz
`
`
`
`«D.m.__u_
`
`Slum
`
`m>Ewo._
`
`mmzoawmm
`
`GOOG-1001-Page 33 of 56
`
`GOOG-1001-Page 33 of 56
`
`
`
`
`
`
`
`
`
`
`5,978,791
`
`1
`DATA PROCESSING SYSTEM USING
`SUBSTANTIALLY UNIQUE IDENTIFIERS TO
`IDENTIFY DATA ITEMS, WHEREBY
`IDENTICAL DATA ITEMS HAVE THE SAME
`IDENTIFIERS
`
`This is a continuation of application Ser. No. 08/425,160,
`filed on Apr. 11, 1995, which was abandoned upon the filing
`hereof.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`This invention relates to data processing systems and,
`more particularly, to data processing systems wherein data
`items are identified by substantially unique identifiers which
`depend on all of the data in the data items and only on the
`data in the data items.
`
`2. Background of the Invention
`Data processing (DP) systems, computers, networks of
`computers, or the like, typically offer users and programs
`various ways to identify the data in the systems.
`Users typically identify data in the data processing system
`by giving the data some form of name. For example, a
`typical operating system (OS) on a computer provides a file
`system in which data items are named by alphanumeric
`identifiers. Programs typically identify data in the data
`processing system using a location or address. For example,
`a program may identify a record in a file or database by using
`a record number which serves to locate that record.
`
`In all but the most primitive operating systems, users and
`programs are able to create and use collections of named
`data items, these collections themselves being named by
`identifiers. These named collections can then, themselves,
`be made part of other named collections. For example, an
`OS may provide mechanisms to group files (data items) into
`directories (collections). These directories can then, them-
`selves be made part of other directories. A data item may
`thus be identified relative to these nested directories using a
`sequence of names, or a so-called pathname, which defines
`a path through the directories to a particular data item (file
`or directory).
`As another example, a database management system may
`group data records (data items) into tables and then group
`these tables into database files (collections). The complete
`address of any data record can then be specified using the
`database file name, the table name, and the record number of
`that data record.
`
`Other examples of identifying data items include: identi-
`fying files in a network file system, identifying objects in an
`object-oriented database,
`identifying images in an image
`database, and identifying articles in a text database.
`In general, the terms “data” and “data item” as used herein
`refer to sequences of bits. Thus a data item may be the
`contents of a file, a portion of a file, a page in memory, an
`object in an object-oriented program, a digital message, a
`digital scanned image, a part of a video or audio signal, or
`any other entity which can be represented by a sequence of
`bits. The term “data processing” herein refers to the pro-
`cessing of data items, and is sometimes dependent on the
`type of data item being processed. For example, a data
`processor for a digital image may differ from a data pro-
`cessor for an audio signal.
`In all of the prior data processing systems the names or
`identifiers provided to identify data items (the data items
`being files, directories, records in the database, objects in
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`object-oriented programming, locations in memory or on a
`physical device, or the like) are always defined relative to a
`specific context. For instance, the file identified by a par-
`ticular file name can only be determined when the directory
`containing the file (the context) is known. The file identified
`by a pathname can be determined only when the file system
`(context) is known. Similarly, the addresses in a process
`address space, the keys in a database table, or domain names
`on a global computer network such as the Internet are
`meaningful only because they are specified relative to a
`context.
`
`In prior art systems for identifying data items there is no
`direct relationship between the data names and the data item.
`The same data name in two different contexts may refer to
`different data items, and two different data names in the
`same context may refer to the same data item.
`In addition, because there is no correlation between a data
`name and the data it refers to, there is no a priori way to
`confirm that a given data item is in fact the one named by a
`data name. For instance, in a DP system, if one processor
`requests that another processor deliver a data item with a
`given data name,
`the requesting processor cannot,
`in
`general, verify that the data delivered is the correct data
`(given only the name). Therefore it may require further
`processing, typically on the part of the requestor, to verify
`that the data item it has obtained is,
`in fact,
`the item it
`requested.
`A common operation in a DP system is adding a new data
`item to the system. When a new data item is added to the
`system, a name can be assigned to it only by updating the
`context in which names are defined. Thus such systems
`require a centralized mechanism for the management of
`names. Such a mechanism is required even in a multi-
`processing system when data items are created and identified
`at separate processors in distinct locations, and in which
`there is no other need for communication when data items
`are added.
`
`In many data processing systems or environments, data
`items are transferred between different
`locations in the
`
`system. These locations may be processors in the data
`processing system, storage devices, memory, or the like. For
`example, one processor may obtain a data item from another
`processor or from an external storage device, such as a
`floppy disk, and may incorporate that data item into its
`system (using the name provided with that data item).
`However, when a processor (or some location) obtains a
`data item from another location in the DP system,
`it is
`possible that this obtained data item is already present in the
`system (either at the location of the processor or at some
`other location accessible by the processor) and therefore a
`duplicate of the data item is created. This situation is
`common in a network data processing environment where
`proprietary software products are installed from floppy disks
`onto several processors sharing a common file server. In
`these systems, it is often the case that the same product will
`be installed on several systems, so that several copies of
`each file will reside on the common file server.
`
`In some data processing systems in which several pro-
`cessors are connected in a network, one system is designated
`as a cache server to maintain master copies of data items,
`and other systems are designated as cache clients to copy
`local copies of the master data items into a local cache on an
`as-needed basis. Before using a cached item, a cache client
`must either reload the cached item, be informed of changes
`to the cached item, or confirm that the master item corre-
`sponding to the cached item has not changed. In other words,
`
`GOOG-1001-Page 34 of 56
`
`GOOG-1001-Page 34 of 56
`
`
`
`5,978,791
`
`3
`a cache client must synchronize its data items with those on
`the cache server. This synchronization may involve reload-
`ing data items onto the cache client. The need to keep the
`cache synchronized or reload it adds significant overhead to
`existing caching mechanisms.
`In view of the above and other problems with prior art
`systems, it is therefore desirable to have a mechanism which
`allows each processor in a multiprocessor system to deter-
`mine a common and substantially unique identifier for a data
`item, using only the data in the data item and not relying on
`any sort of context.
`It is further desirable to have a mechanism for reducing
`multiple copies of data items in a data processing system and
`to have a mechanism which enables the identification of
`
`identical data items so as to reduce multiple copies. It is
`further desirable to determine whether two instances of a
`
`data item are in fact the same data item, and to perform
`various other systems’ functions and applications on data
`items without relying on any context information or prop-
`erties of the data item.
`
`It is also desirable to provide such a mechanism in such
`a way as to make it
`transparent
`to users of the data
`processing system, and it is desirable that a single mecha-
`nism be used to address each of the problems described
`above.
`
`SUMMARY OF THE INVENTION
`
`This invention provides, in a data processing system, a
`method and apparatus for identifying a data item in the
`system, where the identity of the data item depends on all of
`the data in the data item and only on the data in the data item.
`Thus the identity of a data item is independent of its name,
`origin, location, address, or other information not derivable
`directly from the data, and depends only on the data itself.
`This invention further provides an apparatus and a method
`for determining whether a particular data item is present in
`the system or at a location in the system, by examining only
`the data identities of a plurality of data items.
`Using the method or apparatus of the present invention,
`the efficiency and integrity of a data processing system can
`be improved. The present invention improves the design and
`operation of a data storage system, file system, relational
`database, object-oriented database, or the like that stores a
`plurality of data items, by making possible or improving the
`design and operation of at least some or all of the following
`features:
`
`the system stores at most one copy of any data item at a
`given location, even when multiple data names in the system
`refer to the same contents;
`the system avoids copying data from source to destination
`locations when the destination locations already have the
`data;
`the system provides transparent access to any data item by
`reference only to its identity and independent of its present
`location, whether it be local, remote, or offline;
`the system caches data items from a server, so that only
`the most recently accessed data items need be retained;
`when the system is being used to cache data items,
`problems of maintaining cache consistency are avoided;
`the system maintains a desired level of redundancy of data
`items in a network of servers, to protect against failure by
`ensuring that multiple copies of the data items are present at
`different locations in the system;
`the system automatically archives data items as they are
`created or modified;
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`the system provides the size, age, and location of groups
`of data items in order to decide whether they can be safely
`removed from a local file system;
`the system can efficiently record and preserve any col-
`lection of data items;
`the system can efficiently make a copy of any collection
`of data items, to support a version control mechanism for
`groups of the data items;
`the system can publish data items, allowing other, possi-
`bly anonymous, systems in a network to gain access to the
`data items and to rely on the availability of the data items;
`the system can maintain a local inventory of all the data
`items located on a given removable medium, such as a
`diskette or CD-ROM, the inventory is independent of other
`properties of the data items such as their name, location, and
`date of creation;
`the system allows closely related sets of data items, such
`as matching or corresponding directories on disconnected
`computers,
`to be periodically resynchronized with one
`another;
`the system can verify that data retrieved from another
`location is the desired or requested data, using only the data
`identifier used to retrieve the data;
`the system can prove possession of specific data items by
`content without disclosing the content of the data items, for
`purposes of later legal verification and to provide anonym-
`ity;
`the system tracks possession of specific data items accord-
`ing to content by owner, independent of the name, date, or
`other properties of the data item, and tracks the uses of
`specific data items and files by content for accounting
`purposes.
`
`Other objects, features, and characteristics of the present
`invention as well as the methods of operation and functions
`of the related elements of structure, and the combination of
`parts and economies of manufacture, will become more
`apparent upon consideration of the following description
`and the appended claims with reference to the accompany-
`ing drawings, all of which form a part of this specification.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIGS. 1(a) and 1(b) depicts a typical data processing
`system in which a preferred embodiment of the present
`invention operates;
`FIG. 2 depicts a hierarchy of data items stored at any
`location in such a data processing system;
`FIGS. 3-9 depict data structures used to implement an
`embodiment of the present invention; and
`FIGS. 10(a)—28 are flow charts depicting operation of
`various aspects of the present invention.
`DETAILED DESCRIPTION OF THE
`PRESENTLY PREFERRED EXEMPLARY
`EMBODIMENTS
`
`An embodiment of the present invention is now described
`with reference to a typical data processing system 100,
`which, with reference to FIGS. 1(a) and 1(b), includes one
`or more processors (or computers) 102 and various storage
`devices 104 connected in some way, for example by a bus
`106.
`
`Each processor 102 includes a CPU 108, a memory 110
`and one or more local storage devices 112. The CPU 108,
`memory 110, and local storage device 112 may be internally
`connected, for example by a bus 114. Each processor 102
`
`GOOG-1001-Page 35 of 56
`
`GOOG-1001-Page 35 of 56
`
`
`
`5,978,791
`
`5
`
`may also include other devices (not shown), such as a
`keyboard, a display, a printer, and the like.
`In a data processing system 100, wherein more than one
`processor 102 is used, that is, in a multiprocessor system, the
`processors may be in one of various relationships. For
`example,
`two processors 102 may be in a client/server,
`client/client, or a server/server relationship. These inter-
`processor relationships may be dynamic, changing depend-
`ing on particular situations and functions. Thus, a particular
`processor 102 may change its relationship to other proces-
`sors as needed, essentially setting up a peer-to-peer relation-
`ship with other processors. In a peer-to-peer relationship,
`sometimes a particular processor 102 acts as a client
`processor, whereas at other times the same processor acts as
`a server processor. In other words, there is no hierarchy
`imposed on or required of processors 102.
`In a multiprocessor system, the processors 102 may be
`homogeneous or heterogeneous. Further, in a multiprocessor
`data processing system 100, some or all of the processors
`102 may be disconnected from the network of processors for
`periods of time. Such disconnection may be part of the
`normal operation of the system 100 or it may be because a
`particular processor 102 is in need of repair.
`Within a data processing system 100, the data may be
`organized to form a hierarchy of data storage elements,
`wherein lower level data storage elements are combined to
`form higher level elements. This hierarchy can consist of, for
`example, processors, file systems, regions, directories, data
`files, segments, and the like. For example, with reference to
`FIG. 2, the data items on a particular processor 102 may be
`organized or structured as a file system 116 which comprises
`regions 117, each of which comprises directories 118, each
`of which can contain other directories 118 or files 120. Each
`
`file 120 being made up of one or more data segments 122.
`In a typical data processing system, some or all of these
`elements can be named by users given certain implementa-
`tion specific naming conventions, the name (or pathname) of
`an element being relative to a context. In the context of a
`data processing system 100, a pathname is fully specified by
`a processor name, a filesystem name, a sequence of zero or
`more directory names identifying nested directories, and a
`final file name. (Usually the lowest level elements, in this
`case segments 122, cannot be named by users.)
`In other words, a file system 116 is a collection of
`directories 118. Adirectory 118 is a collection of named files
`120—both data files 120 and other directory files 118. Afile
`120 is a named data item which is either a data file (which
`may be simple or compound) or a directory file 118. A
`simple file 120 consists of a single data segment 122. A
`compound file 120 consists of a sequence of data segments
`122. A data segment 122 is a fixed sequence of bytes. An
`important property of any data segment
`is its size,
`the
`number of bytes in the sequence.
`A single processor 102 may access one or more file
`systems 116, and a single storage device 104 may contain
`one or more file systems 116, or portions of a file system 116.
`For instance, a file system 116 may span several storage
`devices 104.
`
`In order to implement controls in a file system, file system
`116 may be divided into distinct regions, where each region
`is a unit of management and control. A region consists of a
`given directory 118 and is ident