`Mattis et al.
`
`[19]
`
`111111111111111111111111111111111111111111111111111111111111111111111111111
`US006128623A
`[11] Patent Number:
`[45] Date of Patent:
`
`6,128,623
`Oct. 3, 2000
`
`[54] HIGH PERFORMANCE
`
`OBJECT CACHE
`
`o 834 819 A2
`
`4/1998 European Pat. Off.
`
`[75]
`
`Inventors: Peter Mattis, Belmont; John Plevyak,
`San Francisco, both of Calif.; Matthew
`Haines, Laramie, Wyo.; Adam
`Beguetin, San Mateo, Calif.; Brian
`Totty, Foster City, Calif.; David
`Gourley, Palo Alto, Calif.
`
`[73]
`
`Assignee:
`
`Inktomi Corporation,
`Calif.
`
`San Mateo,
`
`[21]
`[22]
`[51]
`[52]
`[58]
`
`App!. No.: 09/060,866
`
`Apr. 15, 1998
`
`Filed:
`Int. C1.7
`U.S. CI.
`Field of Search
`
`.................... G06F 17/30
`707/103; 707/100; 711/118
`.....................................707/100, 103;
`711/118, 119, 120, 121, 122
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`7/1996 Neimat et al. ..
`5,542,087
`5,611,049
`3/1997 Pitts
`12/1997 Wong et al.
`5,701,432
`3/1998 Kikinis
`5,727,159
`5,732,267
`3/1998 Smith
`5/1998 Mauldin
`5,748,954
`5,826,242
`10/1998 Montulli
`12/1998 Bhide et al.
`5,852,717
`5,864,852
`1/1999 Luotonen
`5,870,763
`2/1999 Lornet
`2/1999 Copeland, et al.
`5,872,969
`3/1999 Singh et al.
`5,881,229
`6/1999 Wong et al.
`5,909,695
`5,924,116
`7/1999 Aggarwal et aJ.
`8/1999 Logue et al.
`5,935,207
`9/1999 Aviani,Jr
`5,950,205
`10/1999 Krishnaswarnyet al.
`5,974,421
`11/1999 Humphrey
`5,987,233
`FOREIGN PATENT DOCUMENTS
`o 811 939 A2 12/1997 European Pat. Off.
`
`....... 707/103
`395/200.09
`395/457
`395/200.76
`713/1
`395/610
`705/27
`395/200.33
`707/10
`707/202
`395/671
`395/200.33
`711/133
`711/122
`709/219
`707/103
`707/103
`395/200.47
`
`..
`
`Primary Examiner-Thomas G. Black
`Assistant Examiner-Frantz Coby
`Attorney, Agent, or Firm-McDermott, Will & Emery
`
`[57]
`
`ABSTRACT
`
`is
`is disclosed. The cacbe
`cache
`A higb-performance
`designed for time- and space-efficiency for a diverse range
`of information objects.
`Information objects are stored in
`portions of a non-volatile
`storage device called arenas,
`which are contiguous regions from wbicb space is allocated
`in parallel, Objects are substantially contiguously allocated
`within an arena and are mapped by name keys and content-
`based object keys to a tag table, an open directory, and a
`directory table. Tbe tag table is indexed by tbe uame keys,
`and stores references to sets in tbe directory table. The tag
`table is compact and therefore can be stored in fast main
`memory,
`facilitating rapid lookups. The directory table is
`organized so tbat at least a frequeutly-accessed
`portion of it
`also usually resides
`in fast main memory, wbicb further
`speeds lookups. The tag and directory tables are organized
`to quickly determine non-presence of objects. Large objects
`may be chunked into fragments, which are chained using a
`forward functional-iteration mechanism,
`to prevent the need
`for mutating existing on-disk data structures. Garbage col-
`lection periodically moves objects within an arena or to
`other arenas so that
`inactive objects are deleted and free
`space becomes contiguous. Because tbe objects are substan-
`tially contiguously allocated, reading and writing an typical
`object requires only one or two disk head actuator move-
`ments;
`tbus,
`the cacbe can efficiently and smoothly stream
`data off of the storage device, providing optimal delivery of
`multimedia
`objects. Tbe disclosure
`also encompasses
`a
`computer apparatus, computer program product, and com-
`puter data signal embodied in a carrier wave tbat are
`similarly configured.
`
`22 Claims, 26 Drawing Sheets
`
`~ B
`
`I
`
`Disklocation 118.
`
`Size 120
`
`\[~-
`
`
`
`I~l---l~~
`
`MICROSOFT
`
`EXHIBIT 1015
`
`Page 1 of 49
`
`
`
`u.s. Patent
`
`Oct. 3, 2000
`
`Sheet 1 of 26
`
`6,128,623
`
`FIGURE 1
`
`PRIOR ART
`
`Network 20
`
`3Q
`Proxy
`
`4.0.
`Server
`
`iDa
`Client
`
`1Qh
`Client
`
`Page 2 of 49
`
`
`
`u.s. Patent
`
`Oct. 3, 2000
`
`Sheet 2 of 26
`
`6,128,623
`
`:::r~
`LC
`
`9.O.b
`
`I
`
`9O.n
`
`•..•.
`
`C
`
`.•...
`
`L.....-.+
`
`...••
`.:::;
`
`...••
`
`Page 3 of 49
`
`....
`
`..::::
`
`.9Qa
`
`III
`
`(
`
`Proxy
`
`3.Q
`
`ZQ
`Protocol
`Engine
`
`FIGURE 2
`
`L
`
`BQ Cache
`
`I r!es
`
`I
`
`...
`110Core
`
`6Q
`
`i 2
`
`U
`Internet
`
`/ Reques~r object /
`
`J
`
`1D.a
`Client
`
`
`
`u.s. Patent
`
`Oct. 3, 2000
`
`Sheet 3 of 26
`
`6,128,623
`
`FIGURE 3A
`
`52
`Object
`
`SA
`Hash function
`
`fiB. Set Subkey
`
`sa Tag
`Subkey
`
`5fi: Remainder Subkey
`
`(Object Key 56
`
`FIGURE 38
`
`sa
`Object name
`
`t S
`
`A
`Hash function
`
`+
`
`1SUb~Y11/ Su:y
`
`21
`
`62
`Name Key
`
`Page 4 of 49
`
`
`
`ee
`='" ..•~N
`
`~N~
`
`••• o..., N
`tt~
`J).g
`
`0\
`
`~~
`
`of
`
`l
`
`Nooo r
`
`StoredObject
`
`124
`
`ObjectKey
`ResponseInfo
`RequestInfo
`123.a
`
`IVectorofAlternates122hI
`IVectorofAlternates122aI
`131n
`
`="
`
`""
`
`""~
`
`~~ ~~ "
`
`132a
`
`FIGURE4A
`
`Page 5 of 49
`
`Disklocation118.
`
`\:0 Subkey116
`
`,\
`
`,
`,
`,
`,
`,
`'
`
`'
`
`,
`
`,
`,
`
`\\\JSize120
`
`112n
`
`DirectoryTable110
`
`TagTable102
`
`'
`,
`'
`,
`,
`,
`
`'
`
`'
`
`'
`
`,\
`
`104n
`
`106
`
`,\
`
`"
`"
`"
`
`vv
`
`"
`
`'-
`
`I,
`
`1
`
`1.1.Qc
`
`1.1O.h
`
`11Qa
`
`110d
`
`112a
`
`1Mb
`
`10Aa
`
`
`
`u.s. Patent
`
`Oct. 3, 2000
`
`Sheet 5 of 26
`
`6,128,623
`
`FIGURE 48
`
`Storage device 90a
`
`Vector of Alternates 122a
`123b
`123a
`"IE4", 'IE4',
`'Enqllsh",
`'English',
`Object Key
`Object Key
`
`123c
`'Navigator,'
`Object Key
`
`.. - Stored Object 124a
`
`r-----------------------------------------------------
`
`l_
`
`- - ~1...__
`
`1_2_2_b__
`
`H Stored Object 124b HI...
`
`,
`J
`
`12_2_n __
`
`Page 6 of 49
`
`
`
`u.s. Patent
`
`Oct. 3, 2000
`
`/
`
`Sheet 6 of 26
`
`6,128,623
`
`DIRECTORY TABLE 110
`
`Page 7 of 49
`
`110a
`
`•• ••••
`
`-- ----------~----------
`•
`
`-- ----------~----------
`
`•••••••••
`
`110n
`
`SEGMENT 111a
`
`SEGMENT 111b
`
`SEGMENT 111c
`
`
`
`u.s. Patent
`
`Oct. 3, 2000
`
`Sheet 7 of 26
`
`6,128,623
`
`FIGURE 5
`
`Directory Table 110
`,---------------------l
`
`I I I
`
`Tail
`141e
`
`I Subkey 132 I
`
`Head
`-:-----..t 141a ,------,::=~
`
`I I I I I I I I I I I I I I II I I I I I I I I~
`
`141d
`
`141b
`
`---------.
`
`-.-..-..,~
`
`Fragment 208c
`
`Fragment 208a
`
`13Gb ------+--I •.•Fragment 2G8b
`
`---128c
`
`----------------------
`
`Page 8 of 49
`
`
`
`u.s. Patent
`
`Oct. 3, 2000
`
`Sheet 8 of 26
`
`6,128,623
`
`- - -
`
`,,
`
`, ,
`
`,
`
`, ,
`
`-,
`
`, ,
`
`, ,
`210
`
`Storage Device 9.Qa
`
`FIGURE 6
`
`Pool 2QQa
`
`Pool 2llilb.
`
`Fragment
`
`Fragment
`
`- --
`
`Fragment
`
`~208a
`
`,
`
`r
`
`\.208n
`
`~,
`
`\.208b
`
`r
`
`,
`,
`,
`,,,
`
`,,,
`
`2
`~
`
`I
`I
`
`J
`
`II I
`
`I \
`
`\
`
`\
`
`\
`
`,..•.
`
`Page 9 of 49
`
`208d Fragment
`Header
`
`2O.8.eFragment Data
`
`"
`
`" 1
`
`2illk Magic 11206a
`
`Key 112O£.b.Length I
`
`202 Pool Header
`Magic
`Version No.
`No. of Arenas
`20fia Arena Header
`Empty
`Corrupt
`
`,
`2.06.0. Arena Header
`Empty
`Corrupt
`
`
`
`u.s. Patent
`
`Oct. 3, 2000
`
`Sheet 9 of 26
`
`6,128,623
`
`FIGURE 7
`
`210
`
`I~
`
`204a
`(
`
`2QB..d Fragment
`Header
`
`2O.8..e Fragment
`Data
`
`Page 10 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 10 of 26
`
`6,128,623
`
`FIGURE 8A
`
`803
`
`"Low water mark"
`"High water mark"
`
`Last arena accessed
`Number of accesses
`
`Garbage
`Collection
`
`8.D2
`Select pool
`
`8M.
`Select arena
`
`8Qfi
`Select fragment
`
`Selection factors
`
`807
`
`YES
`
`8J2
`Mark fragment as
`deleted
`
`8.10.
`Write fragment to new
`arena
`
`ill
`Update Directory Table
`
`Page 11 of 49
`
`
`
`u.s. Patent
`
`Oct. 3, 2000
`
`Sheet 11 of 26
`
`6,128,623
`
`Write fragment
`to new arena
`...---...!--.....•578
`
`FIGURE 88
`
`90
`
`Update
`Open
`Index
`
`Step 806
`
`Write object data 1-----1
`in selected Arena
`
`Page 12 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 12 of 26
`
`6,128,623
`
`820
`
`822
`
`824
`
`Synchronization
`
`Write object to
`cache
`
`Create metadata
`in Open Directory
`
`1-----826
`
`Write and synch
`object data to disk
`
`1---028
`
`Get next metadata
`from Open Directory
`
`FIGURE8C
`
`B2l
`Garbage collection
`
`82.9.
`For each fragment in Arena, delete
`directory metadata that points to the
`fragment and write it to disk
`
`83.1
`Modify pool header in
`memory to show Arena as
`empty
`
`8.2.D:
`Write pool header and
`sync to disk
`
`aaa
`Copy metadata from Open
`Directory to Directory Table
`
`825.
`Sync changes
`
`Page 13 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 13 of 26
`
`6,128,623
`
`FIGURE 80
`
`Open Directory:
`'checkout_read"
`
`8..32
`Get lock for key
`YES
`
`NO
`
`8..3.4.
`Compute bucket for
`key
`
`ass
`Scan blocks in
`bucket's list for key
`FOUND
`
`NOT FOUND
`
`B3B.
`Is block being created
`or destroyed?
`
`B4.O.
`Return 'not available'
`error
`
`842
`Increment block
`reader count
`
`8M
`Return copy of open
`directory block
`
`Ma
`Return 'does not exist'
`error
`
`NOT FOUND
`
`BA.6.
`Cache Directory
`lookup
`FOUND
`
`B..5.Q
`Create Open Directory
`block; initialize block
`from Directory Index;
`add block to bucket's
`block list
`
`Page 14 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 14 of 26
`
`6,128,623
`
`FIGURE BE
`
`Key
`
`Open Directory:
`'checkout_write"
`
`B.5.4.
`Get lock for key
`YES
`
`NO
`
`B.5.6.
`Compute bucket
`key
`
`for
`
`8.5.8.
`Scan blocks in
`bucket's list for key
`
`864
`Is block being created
`or destroyed?
`
`84.Q
`Return 'not available"
`error
`
`M2.
`Increment block
`reader count
`
`844.
`Return copy of open
`directory block
`
`aez
`Return 'does not exist"
`error
`
`NOT FOUND
`
`aso
`Cache Directory
`lookup
`FOUND
`
`8.5.Q
`Create Open Directory
`block; initialize block
`from Directory Index;
`add block to bucket's
`block list
`
`Page 15 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 15 of 26
`
`6,128,623
`
`FIGURE 8F
`
`l
`
`r
`
`Key
`
`1'"E----1,
`
`NO
`
`J
`
`Open Directory:
`'checkout_create"
`
`r
`ill
`Get lock for key
`~YES
`8lB.
`Compute bucket for
`key
`
`NOT FOUND
`
`8.8..0.
`Scan blocks in
`bucket's list for key
`
`FOUND
`
`B..9..6.
`Return "already exists"
`error
`
`FOUND
`
`B.94
`Cache Directory
`lookup
`
`NOT FOUND
`
`BB2
`Is block deleted and has no
`writers or readers?
`
`8M
`NO
`.>----~ Return "not available"
`error
`
`\YES
`8.8..6.
`Increment block
`reader count
`
`8.8B.
`Initialize block from
`key
`
`\
`a9..8
`Create Open Directory
`block; initialize block
`from Directory Index;
`add block to bucket's
`block list
`
`asa
`Increment block writer count
`and mark copy as created
`I
`L-
`
`a92
`Return copy of open directory
`block and mark block as
`being modified
`
`----:~
`
`r;..{
`--.-J
`"-
`
`Done
`
`Page 16 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 16 of 26
`
`6,128,623
`
`9.02
`LOOKUP
`
`9lli1
`Convert object name
`in request
`to key value
`
`sos
`Look up request key
`value in Open Directory
`
`912
`Look up request key
`value in Tag Table
`
`aia
`Store Tag Table miss
`
`FIGURE9A
`
`91B
`Look up key value in
`Directory Table
`
`YES
`~~
`
`910
`Obtain object and
`deliver to client
`
`922
`Update Open
`Directory if needed
`
`m
`Cache Miss
`
`szs
`Obtain object from
`source server
`
`92B
`Done
`
`Page 17 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 17 of 26
`
`6,128,623
`
`Open Directory
`block checkin
`
`....--
`
`\,
`9.3.Q
`Get lock for key
`YES
`
`\1
`9J2
`Is block being
`modified?
`NO
`\'
`94.6.
`Decrement
`count
`
`reader
`
`FIGURE 98
`
`r
`I
`
`Block
`
`--
`--
`
`NO
`
`.,
`
`YES ••.
`
`9.3A
`Decrement writer count
`
`\1
`ass
`Is checkin successful?
`
`NO
`
`M2
`YES •..
`". Copy block info back to
`original block
`I
`
`NO
`
`'1/
`
`~
`
`\'
`93.a
`Is delete checkin flag
`set?
`YES
`
`8M
`Mark block as deleted
`....1
`"'"'\f
`
`9A!l Done
`
`Page 18 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 18 of 26
`
`6,128,623
`
`FIGURE 9C
`
`-
`
`Key
`
`-
`
`95D.
`..•....
`." Return not found
`
`r-
`Cache lookup \.."
`
`,I
`
`BAa
`Check out directory
`
`blocks,
`
`I
`9M
`Check in directory
`
`blocks,v
`~---,v
`
`ass
`Return found
`
`9.52 Done
`
`Page 19 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 19 of 26
`
`6,128,623
`
`FIGURE 90
`
`Key
`
`FAILURE
`
`9..5.8
`Check out directory
`blocks
`SUCCESS
`
`954
`Check in directory blocks
`with deletion flag set
`
`Page 20 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 20 of 26
`
`6,128,623
`
`Cache open read
`
`•.••<E----~
`
`FIGURE 9E
`
`-
`
`Object name
`Request info
`
`FAILURE
`964
`Check out vector for 1---------...,
`reading
`SUCCESS
`\1
`
`\1
`91.2
`Return no document
`error
`
`FAILURE
`
`~
`
`9fifi
`Use request inf0 to select
`alternate from vector
`SUCCESS
`~I
`96B.
`Check in vector
`
`9lQ
`Read object pointed to
`by alternate
`
`Page 21 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 21 of 26
`
`6,128,623
`
`Cache open write -
`
`FIGURE 9F
`
`Object name
`Request info
`Response info
`
`FAILURE
`
`,II
`98A
`Create vector
`
`9.B2
`FAILURE
`I-L..----~:::: Return no document
`error
`
`ill
`Check out vector for
`writing
`SUCCESS
`
`--
`v
`91.6.
`Use request info to define
`alternate for vector; add
`alternate to vector
`SUCCESS
`,II
`9I8.
`Check in vector
`
`9B.O.
`Write object pointed to
`by alternate
`
`Page 22 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 22 of 26
`
`6,128,623
`
`FIGURE 9G
`
`/
`
`'-
`
`Object name
`Old identifier
`Request
`info
`Response info
`
`"""\
`
`./
`
`FAILURE
`
`....
`
`--
`
`sse
`Return not available
`error
`
`--
`
`~
`
`Cache update
`
`W
`9.8.6.
`Check out vector for
`writing
`SUCCESS
`'V
`9.B.B.
`Construct new clone
`vector
`\V
`.9.9.Q
`Write new clone vector
`
`~
`.9.92
`Check in clone vector
`
`"
`
`9..9.4.Done
`
`Page 23 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 23 of 26
`
`6,128,623
`
`FIGURE 10A
`
`65
`
`,----l~ Request Arena with
`available space
`
`sse
`Store contents of write
`aggregation buffer in Arena
`
`660
`
`Update index
`
`640
`
`642
`
`Allocation
`and Write Object
`
`Look up and retrieve
`object from origin
`server
`
`Get object length
`from HTTP request,
`present
`
`if
`
`648
`
`Store object in write
`aggregation buffer
`
`fi.5.Q
`Write
`aggregation
`buffer full?
`
`Page 24 of 49
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 24 of 26
`
`6,128,623
`
`FIGURE 108
`
`620
`
`622
`
`624
`
`626
`
`Scaled Counter
`Updating
`
`Sum all
`counter values
`
`Compute Maximum
`Value that can be
`represented by a counter
`
`Decrement all
`counter values by 1
`
`632
`
`Page 25 of 49
`
`
`
`ee
`..•~N
`='"
`
`~N~
`
`tv0\
`(J1e..,
`=- tttt-tv
`
`'JJ
`
`112.4
`
`HOST
`
`1122
`
`NETWORK
`
`LOCAL
`
`~~
`":••
`
`o I
`
`tv===
`
`="
`
`""
`
`""~
`
`~~ ~~"
`
`1126
`
`1128
`
`ISP
`
`INTERNET
`
`113.Q
`
`SERVERh
`
`1120
`
`I
`
`1100I
`
`111.B.
`
`----------
`
`ILINK
`INETWORK
`
`INTERFACE
`COMMUNICATION
`
`I I I I I I II I I II I II I I I I I
`
`.111Q
`
`11.Qft
`
`DEVICE
`
`STORAGE
`
`BUS
`
`<'7
`
`OM
`
`oR
`
`:r
`
`~
`
`~
`
`llil2
`
`<'7
`
`L;;.
`
`---------
`
`llQ4
`PROCESSOR
`
`'7
`
`~
`
`~
`
`'«./
`
`11DB.
`
`~',>-
`
`MEMORY
`
`MAIN
`
`I I I II I I I II
`
`I I I II I II I I I I
`
`---------------------------
`
`FIGURE11
`
`Page 26 of 49
`
`1116.1"1
`
`A
`
`CONTROL
`CURSOR
`
`~
`
`INPUTDEV~h
`
`DISPLA:mI¢=ll
`
`
`
`u.s. Patent
`
`Oct. 3,2000
`
`Sheet 26 of 26
`FIGURE 12
`
`6,128,623
`
`1212
`(Information object has
`not been loaded recently)
`
`1lli
`Tell process "no"
`
`.12.1fi
`Update expiration date value
`of information object to the
`current date or time
`
`Re- Validation
`
`12.Q2
`External process asks cache
`whether an object has been
`loaded by a client recently
`
`12M
`Cache locates the information
`object in the cache
`
`12.0.6.
`Read a Read Counter value
`associated in directory tables
`with the information object
`
`NO
`
`1210
`Tell process "yes"
`
`Page 27 of 49
`
`
`
`acting as a
`service,
`gateway
`a "middleman"
`server provides
`server to the client, and a client to tbe server. A proxy server
`equipped witb a cache is called a caching proxy server, or
`commonly, a "proxy cache".
`The proxy cache 30 intercepts requests for resources that
`are directed from the clients lOa, lOb to the server 40. When
`tbe cacbe in tbe proxy 30 has a replica of the requested
`resource that meets certain freshness constraints,
`the proxy
`responds
`to the clients lOa,
`lOb and serves the resource
`10 directly. In this arrangement,
`the number and volume of data
`transfers along the link 42 are greatly reduced. As a result,
`network resources or objects are provided more rapidly to
`tbe clients lOa,
`lOb.
`A key problem in sucb caching is tbe efficient storage,
`15 location, and retrieval of objects in tbe cache. This document
`concerns
`technology related to the storage,
`location, and
`retrieval of multimedia objects within a cache. The object
`storage facility witbin a cache is called a "cache object
`store" or "object store".
`such as
`To effectively handle beavy traffic environments,
`tbe World Wide Web, a cache object store needs to be able
`to handle tens or bundreds of millions of different objects,
`while storing, deleting, and fetcbing tbe objects simulta-
`25 neously. Accordingly, cache performance must not degrade
`significantly witb object count. Performance
`is tbe driving
`goal of cacbe object stores.
`Finding an object
`in the cache is the most common
`operation and therefore the cache must be extremely fast in
`that
`limits cache
`30 carrying out searches. The key factor
`performance is 1001-'Uptime. It is desirable to have a cache
`tbat can determine whether an object is in the cache (a "hit")
`or not (a "miss") as fast as possible.
`In past approaches,
`caches capable of storing millions of objects have been
`stored in traditional
`file system storage structures. Tradi-
`tional file systems are poorly suited for multimedia object
`caches because they are tuned for particular object sizes and
`require multiple disk head movements to examine file sys-
`tem metadata. Object stores can obtain higher lookup per-
`formance by dedicating DRAM memory to the task of object
`1001-'Up,but because there are tens or hundreds of millions of
`objects,
`the memory lookup tables must be very compact.
`Once an object
`is located,
`it must be transferred to the
`client efficiently. Modem disk drives offer high performance
`45 when reading and writing sequential data, but suffer signifi-
`cant performance delays when incurring disk head move-
`ments to other parts of the disk. These disk bead movements
`are called "seeks". Disk performance
`is typically con-
`strained by the drive's rated seeks per second. To optimize
`it is desirable to minimize disk
`50 performance of a cache,
`seeks, by reading and writing contiguous blocks of data.
`Eventually,
`the object store will become full, and particu-
`lar objects must be expunged to make room for new content.
`This process is called "garbage collection". Garbage collec-
`tion must be efficient enough that
`it can run continually
`without
`providing
`a significant
`decrease
`in system
`performance, while removing objects tbat have the least
`impact on future cache performance.
`Past Approaches
`In the past, four approaches have been used to structure
`cache object stores: using the native file system, using a
`memory-blocked "page" cache, using a database, and using
`a "cyclone"
`circular
`storage
`structure.
`Each of these prior
`approaches has significant disadvantages.
`The native file system approach uses the file system of an
`operating system running on the server to create and manage
`a cache. File systems are designed for a particular applica-
`
`20
`
`35
`
`40
`
`BACKGROUND OF THE INVENTION
`Several
`important computer
`technologies
`rely, to a great
`extent, upon rapid delivery of information from a central
`storage location to remote devices. For example,
`in the
`client/server model of computing, one or more servers are
`used to store information. Client computers or processes are
`separated from the servers and are connected to the servers
`using a network. The clients request information from one of
`tbe servers by providing a network address of the informa-
`tion. Tbe server locates tbe information based on the pro-
`vided network address and transmits it over tbe network to
`tbe client, completing the transaction.
`tbe
`The World Wide Web is a popular application of
`client/server computing model. FIG. 1 is a simplified block
`diagram of the relationship between elements used in a Web
`system. One or more web clients lOa, lOb, each of which is
`a computer or a software process such as a browser program,
`are connected to a global information network 20 called tbe
`Internet, eitber directly or tbrough an intermediary such as
`an Internet Service Provider, or an online information ser-
`vice.
`A web server 40 is likewise connected to the Internet 20
`by a network link 42. The web server 40 bas one or more
`internet
`network
`addresses
`and textual
`bost names,
`associ-
`ated in an agreed-upon format
`that
`is indexed at a central
`Domain Name Server (DNS). The server contains multime-
`dia information resources, sucb as documents and images, to
`be provided to clients upon demand. The server 40 may
`additionally or alternatively contain software for dynami-
`cally generating such resources in response to requests.
`The clients lOa,
`lOb and server 40 communicate using
`one or more agreed-upon protocols that specify the format of
`tbe information tbat is communicated. A client lOa looks up
`network address of a particular
`server using DNS and
`establisbes a connection to the server using a communica-
`tion protocol
`called
`the Hypertext
`Transfer
`Protocol
`(HTTP). A Uniform Resource Locator
`(URL) uniquely
`identifies eacb information object stored on or dynamically
`generated by the server 40. A URL is a form of network
`address that identifies the location of information stored in
`a network.
`limits tbe performance of tbe World
`that
`A key factor
`Wide Web is tbe speed with which the server 40 can supply
`information to a client via the Internet 20. Performance is
`limited by the speed, reliability, and congestion level of the
`network route through the Internet, by geograpbical distance
`delays, and by server load level. Accordingly, client
`trans-
`action time can be reduced by storing replicas of popular
`information objects in repositories geographically dispersed
`from the server. Each local repository for object replicas is 60
`generally referred to as a cache. A client may be able to
`access replicas from a topologically proximate cache faster
`tban possible from the original web server, wbile at the same
`time reducing Internet server traffic.
`In one arrangement,
`as sbown in FIG. 1, the cacbe is 65
`located in a proxy server 30 tbat
`is logically interposed
`between the clients lOa,
`lOb and the server 40. The proxy
`
`6,128,623
`
`2
`
`1
`HIGH PERFORMANCE OBJECT CACHE
`
`FIELD OF THE INVENTION
`invention relates to information delivery, and
`The present
`relates more specifically to a cache for information objects
`tbat are to be delivered efficiently and at high speed over a
`network to a client
`
`55
`
`Page 28 of 49
`
`
`
`6,128,623
`
`10
`
`15
`
`25
`
`30
`
`50
`
`55
`
`60
`
`3
`tion in mind: storing and retrieving user and system data
`files. File systems
`are designed
`and optimized
`for
`file
`management
`applications. They are optimized for typical
`data file sizes and for a relatively small number of files (both
`total and within one folder/directory). Traditional
`file sys-
`tems are not optimized to minimize the number of seeks to
`open, read/write, and close files. Many file systems incur
`significant performance penalties
`to locate and open files
`when tbere are large numbers of files present. Typical file
`systems suffer fragmentation, with small disk blocks scat-
`tered around the drive surface, increasing the number of disk
`seeks required to access data, and wasting storage space.
`Also,
`file systems,
`being designed
`for user data
`file
`management,
`include facilities
`irrelevant
`to cache object
`stores, and indeed counter-productive
`to this application.
`Examples include: support for random access and selective
`modification,
`file permissions,
`support
`for moving files,
`support for renaming files, and support for appending to files
`over time. File systems are also invest significant energy to
`minimize any data loss, at the e"l'ense of performance, both 20
`at write time, and to reconstruct
`the file system after failure.
`The result
`is that
`file systems
`are relatively poorly for
`handling the millions of files that can be present
`in a cache
`of Web objects. File systems don't efficiently support
`the
`large variation in Internet multimedia object size-in
`par-
`ticular
`they typically do not support very small objects or
`very large objects efficiently. File systems require a large
`number of disk seeks
`for mctadata
`traversal
`and block
`chaining, poorly support garbage collection, and take time to
`ensure data integrity and to repair file systems on restart.
`The page cache extends file systems with a set of fixed
`sized memory buffers. Data is staged in and out of these
`buffers before
`transmission
`across
`the network. This
`for large objects being
`approach wastes
`significant memory
`sent across
`slow connections.
`The database system approach uses a database system as
`a cache. Generally, databases are structured to achieve goals
`that make tbem inappropriate for use as an object cache. For
`example,
`they are structured to optimize transaction pro-
`cessing. To preserve the integrity of each transaction,
`they 40
`use extensive locking. As a result, as a design goal they favor
`data integrity over performance
`factors such as speed.
`In
`contrast,
`it is acceptable for an object cache to lose data
`occasionally,
`provided
`that
`the cache does not corrupt
`objects, because the data always can be retrieved from the 45
`server that is original source of the data. Databases are often
`optimized for
`fast write performance,
`since write speed
`limits
`transaction
`processing
`speed. However,
`in an object
`cache, read speed is equally important. Further, databases
`are not naturally good at storing a vast variety of object sizes
`while
`supporting
`streaming,
`pipelined
`I/O in a virtual
`memory efficient manner. Databases commonly optimized
`for fixed record size sizes. Where databases support variable
`record sizes,
`they contain support
`for maintaining object
`relationships
`that are redundant, and typically employ slow,
`virtual memory paging techniques
`to support
`streaming,
`pipelined I/O.
`file system, data is allocated around a
`In a cyclonic
`circular
`storage structure. When space becomes
`full,
`the
`oldest data is simply removed. This approach allows for fast
`allocation of data, but makes it difficult
`to support
`large
`objects without first staging them in memory, suffers prob-
`lems with fragmentation of data, and typically entails naive
`garbage collection that throws out the oldest object, regard-
`less of its popularity, For a modest, active cache with a 65
`diverse working set, sucb first-in-first-out garbage collection
`can throw objects out before they get to be reused.
`
`35
`
`4
`The fundamental problem with the above approaches for
`the design of cache object stores is that
`the solution isn't
`optimized
`for
`the constraints
`of
`the problem. These
`approaches all represent reapplication of existing tecbnolo-
`gies to a new application. None of the applications above are
`ideally suited for
`the unique constraints
`of multimedia,
`streaming, object caches. Not only do the above solutions
`inherently encumber object caches with inefficiencies due to
`tbeir
`imperfect
`reapplication, but
`they also are unable to
`effectively support
`the more unique requirements of multi-
`media object caches. These unique requirements include the
`ability to disambiguate and share redundant content
`that is
`identical, but has different names, and the opposite ability to
`store multiple variants of content witb the same name,
`targeted for particular clients,
`languages, data types, etc.
`Based on the foregoing,
`tbere is a clear need to provide an
`object cache that overcomes the disadvantages of these prior
`approaches,
`and is more ideally suited for
`the unique
`requirements of multimedia object caches. In particular:
`1.
`there is a need for an object
`store that can store
`hundreds of millions of objects of disparate sizes, and
`a terabyte of content size in a memory efficient manner;
`2. there is a need for an object store that can determine if
`a document
`is a "hit" or a "miss" quickly, without
`lime-consuming file directory lookups;
`3. there is a need for a cache that minimizes the number
`of disk seeks to read and write objects;
`4. there is a need for an object store tbat permits efficient
`streaming of data to and from the cache;
`5. there is a need for an object store that supports multiple
`different versions of targeted alternates
`for the same
`name;
`6. there is a need for an object store that efficiently stores
`large numbers of objects witbout content duplication;
`7. there is a need for an object store that can be rapidly and
`efficiently garbage collected in real-time,
`insightfully
`selecting the documents to be replaced to improve user
`response speed, and traffic reduction;
`8. there is a need for an object store that that can restart
`to full operational capacity within seconds after soft-
`ware or hardware failure without data corruption and
`with minimal data loss.
`This document concerns tecbnology directed to accom-
`In particular,
`plishing the foregoing goals.
`this document
`describes met bods and structures related to the time-efficient
`and space-efficient
`storage,
`retrieval, and maintenance of
`objects in a large object store. The technology described
`herein provides
`for a cache object
`store
`for a high-
`performance,
`high-load application having the following
`general characteristics:
`1. High performance, measured in low latency and high
`throughput
`for object store operations, and large num-
`bers of concurrent operations;
`2. Large cache support, supporting terabyte caches and
`billions of objects,
`to handle the Internet's exponential
`content growth rate;
`3. Memory storage space efficiency, so expensive semi-
`conductor memory is used sparingly and effectively;
`4. Disk storage space efficiency, so large numbers of
`Internet object replicas can be stored within the finite
`disk capacit y of the object store;
`5. Alias free, so that multiple objects or object variants,
`with different names, but with the same content
`iden-
`tical object content, will have tbe object content cached
`only once, shared among the different names;
`
`Page 29 of 49
`
`
`
`6,128,623
`
`6
`of the requested information object in association with such
`block, wherein the size value indicates a storage size of the
`list and the plurality of versions of the information object;
`and wherein step (D) comprises the step of reading the list
`and tbe plurality of versions concurrently. Yet another fea-
`ture is consolidating streaming data transfers of different
`speeds into a write aggregation buffer.
`According to still another feature,
`the step of storing the
`information objects comprises the step of writing the infor-
`10 marion objects in contiguous available storage space of the
`mass storage device, while concurrently performing steps
`(A) through (D) with respect
`to another information object.
`The invention
`also encompasses
`an apparatus,
`computer
`system, computer program product, and a computer data
`15 signal embodied in a carrier wave configured according to
`tbe foregoing aspects, and other aspects.
`
`5
`6. Support for multimedia heterogeneity, efficiently sup-
`porting diverse multimedia objects of a multitude of
`types with size ranging over six orders of magnitude
`from a few bundred bytes to bundreds of megabytes;
`7. Fast, usage-aware garbage collection,
`so less useful
`objects can be efficiently removed from tbe object store
`to make room for new objects;
`8. Data consistency, so programmatic errors and hardware
`failures do not lead to corrupted data;
`9. Fast restartability, so an object cache can begin servic-
`ing requests within seconds of restart, witbout requiring
`a time-consuming database or file system check opera-
`tion;
`10. Streaming, so large objects can be efficiently pipelined
`from the object store to slow clients, without staging
`the entire object
`into memory;
`11. Support for content negotiation, so proxy caches can
`efficiently and flexibly store variants of objects for the
`same URL,
`targeted on client browser,
`language, or
`other attribute of the client request; and
`12. General-purpose
`applicability, so that the object store
`interface is sufficiently flexible to meet
`the needs of
`future media types and protocols.
`
`relating to data
`
`BRIEF DESCRIPTION OF 'II-IE DRAWINGS
`The present
`invention is illustrated by way of example,
`and not by way of limitation,
`in the figures of the accom-
`panying drawings and in which like reference numerals refer
`to similar elements and in which:
`relationship;
`FIG. 1 is a block diagram of a client/server
`FIG. 2 is a block diagram of a traffic server;
`FIG. 3A is a block diagram of transformation of an object
`into a key;
`FIG. 3B is a block diagram of transformation of an object
`name into a key;
`FIG. 4A is a block diagram of a cache;
`FIG. 4B is a block diagram of a storage mechanism for
`Vectors of Alternates;
`FIG. 4C is a block diagram of multi-segment directory
`table;
`FIG. 5 is a block diagram of pointers
`fragments;
`FIG. 6 is a block diagram of a storage device and its
`contents;
`FIG. 7 is a block diagram showing the structure of a pool;
`FIG. 8A is a flow diagram of a process of garbage
`collection;
`FIG. 8B is a flow diagram of a process of writing
`information
`in a storage device;
`FIG. 8C is a flow diagram of a process of synchronization;
`FIG. 8D is a flow diagram of a "checkout_read"
`process;
`FIG. 8E is a flow diagram of a "checkout_write"
`process;
`FIG. 8F is a flow diagram of a "checkout_create"
`pro-
`cess;
`FIG. 9A is a flow diagram of a cache lookup process;
`FIG. 9B is a flow diagram of a "checkin" process;
`FIG. 9C is a flow diagram of a cacbe lookup process;
`FIG. 9D is a flow diagram of a cache remove process;
`FIG. 9E is a flow diagram of a cache read process;
`FIG. 9F is a flow diagram of a cache write process;
`Fl G. 9G is a flow diagram of a cache update process;
`FIG. lOA is a flow diagram of a process of allocating and
`writing
`objects
`in a storage device;
`FIG. lOB is a flow diagram of a process of scaled counter
`updating;
`FIG. 11 is a block diagram of a computer system that can
`be used to implement
`the present
`invention;
`FIG. 12 is a flow diagram of a process
`re-validation.
`
`of object
`
`Page 30 of 49
`
`20
`
`25
`
`30
`
`35
`
`SUMMARY OF THE INVENTION
`The foregoing needs and other needs are addressed by the
`present
`invention, which provides,
`in one aspect,
`in a cache
`for information objects
`that are identified by key values
`based on names of the informati