throbber
US007315919B1
`
`(12) United States Patent
`O’Krafka et a].
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 7,315,919 B1
`*Jan. 1, 2008
`
`(54)
`
`(75)
`
`BANDWIDTH REDUCTION TECHNIQUE IN
`A SNOOPING-BASED CACHE-COHERENT
`CLUSTER OF MULTIPROCESSING NODES
`
`Inventors: Brian W. O’Krafka, Austin, TX (US);
`Michael J. Koster, Fremont, CA (US)
`
`(73)
`
`Assignee: Sun Microsystems, Inc., Santa Clara,
`CA (US)
`
`6,981,097 B2 12/2005 Martin et a1.
`2002/0133674 A1
`9/2002 Martin et a1.
`2005/0144395 A1
`6/2005 Martin et a1.
`2005/0160430 A1
`7/2005 Steely et al.
`2005/0198187 A1
`9/2005 Tierney et al.
`2005/0240735 A1 10/2005 Shen et al.
`Primary ExamineriHyung Sough
`Assistant ExamineriMardochee Chery
`(74) Attorney, Agent, or F irmiOsha ' Liang LLP
`
`(*)
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 490 days.
`
`This patent is subject to a terminal dis
`claimer.
`
`(21)
`
`(22)
`
`(51)
`
`(52)
`(58)
`
`(56)
`
`Appl. No.: 10/868,053
`
`Filed:
`
`Jun. 15, 2004
`
`Int. Cl.
`(2006.01)
`G06F 12/00
`US. Cl. ...................... .. 711/141; 711/146; 711/147
`Field of Classi?cation Search ................... .. None
`See application ?le for complete search history.
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`ABSTRACT
`(57)
`A cluster of multiprocessing nodes uses snooping-based
`cache-coherence to maintain consistency among the cache
`memories of the multiprocessing nodes. One or more of the
`multiprocessing nodes each maintain a directory table that
`includes a list of addresses of data last transferred by
`cache-to-cache transfer transactions. Thus, upon a local
`cache miss for requested data, a multiprocessing node
`searches its directory table for an address of the requested
`data, and if the address is found in the directory table, the
`multiprocessing node obtains a copy of the requested data
`from the last destination of the requested data as indicated in
`the directory table. Thereafter, a message indicating the
`completion of a cache-to-cache transfer is broadcast to other
`connected multiprocessing nodes on a “best efforts” basis in
`Which messages are relayed from multiprocessing node to
`multiprocessing node using loW priority status and/or oth
`erwise unused cycles.
`
`6,883,070 B2
`
`4/2005 Martin et al.
`
`15 Claims, 10 Drawing Sheets
`
`/ST206
`broadcast cache-to
`cache transfer
`destination on "best
`efforts" basis
`
`Yes
`
`access local
`cache
`
`81-182
`
`ST204\
`
`I
`
`update local cache
`and local directory
`table
`
`/ST186
`retrieve and
`use data
`
`ST202
`
`use data
`returned from
`home memory
`
`check local directory
`table for data
`location
`
`ST188
`
`81-198 broadcast data
`request
`
`cache data
`valid?
`
`Yes
`
`pen‘orm cache
`to-cache
`transfer
`
`ST194
`
`1
`
`APPLE 1010
`
`

`

`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 1 0f 10
`
`US 7,315,919 B1
`
`522
`
`29:22
`H
`
`6:250 @580
`
`H
`J 2H
`280 I
`
`63805222
`
`1 3.2m v .OE
`
`2
`
`

`

`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 2 0f 10
`
`US 7,315,919 B1
`
`/ mm
`
`E26
`
`hmSQEoo
`
`1 mm
`
`E26
`
`EEQEoO
`
`J ow
`
`E26
`
`ESQEQO
`
`E26
`
`ESQEQO
`
`Wm
`
`/ on
`
`3
`
`

`

`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 3 0f 10
`
`US 7,315,919 B1
`
`2 t
`
`
`
`8mt2£ 59:02:55
`
`
`
` 1, t 63805822 63805222 63805922 63805922
`
`
`
`
`
`l 8E 0 .wt
`
`J J J J mm mm vm mm
`
`/ om
`
`4
`
`

`

`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 4 of 10
`
`US 7,315,919 B1
`
`530003222
`830090922
`5.6::005:03:00
`
`0:00005000
`
`
`cosommcmfi
`
`
`
`<\aoocwmamboEmzécomo
`
`EoEmE
`
`V.GE
`
`Nw
`
`5
`
`

`

`U.S. Patent
`
`02
`
`S
`
`01f05
`
`919951397S
`
`1
`
`avorm0.5282
`
`Wmow
`
`
`.m..o=:00.80..J0:000
`J9.0009.000m.80..
`
`
`09mm
`
`9.000
`
`5:92.00
`
`No.‘
`
`om
`
`vm
`
`50000050003
`
`53000592.).
`
`Bm.GE
`
`
`
`U2309550:65.
`
`EoEms.
`
`E9095
`
`mm
`
`8..
`
`mow
`
`om
`
`6
`
`
`
`

`

`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 6 of 10
`
`US 7,315,919 B1
`
`mEmmmoanzfi-9.6885522-938005535.-mEmmmuoazfifi
`IIII802$02mwuoz
`IIII£528023022.02
`96386235.-mEmmwooaazé-95385232-mEmmmoansé
`m:_mmmooa=_:§-mcfimmooaazfi-96386335.-96385222
`95885222-ocfiwmooaaaz-96385522-96885232
`-I252252£62302
`£62262252
`
`a.GE
`
`o:
`
`7
`
`

`

`U.S. Patent
`
`nJa
`
`1n.
`
`.0.m
`
`01M
`
`95,137,
`
`1
`
`Blw50t
`
`W000:
`
`
`
`ucwomqvm:C>>OU:
`
`smm.
`
`2985c0
`
`
`
`7530006925.8000005925.
`
`2,30095..
`
`530090225.—I
`
`530005225.
`
`E00080=03:
`
`000:
`
`
`
`8000:5:03:00000:mEmomfim.29...280058:8LE.
`
`
`
`
`
`8
`
`
`

`

`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 8 of 10
`
`US 7,315,919 B1
`
`cozmczmmo33
`
`m_‘wuoz
`
`m.302
`
`r362
`
`C252
`
`$2294
`
`w0302
`
`mmuoz
`
`or2.62
`
`0.,0602
`
`w.GE
`
`o:
`
`9
`
`

`

`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 9 0f 10
`
`US 7,315,919 B1
`
`YES
`
`access local
`cache
`
`ST182
`
`broadcast cache-to
`cache transfer
`destination on "best
`efforts" basis
`
`A
`
`ST204
`\
`update local cache
`and local directory
`table
`
`V
`
`ST186
`/
`retrieve and
`use data
`
`Yes
`
`ST202
`/
`use data
`returned from
`home memory
`
`check local directory
`table for data
`location
`
`ST188
`
`ST198
`
`broadcast data 4—No
`request
`
`NO
`
`ST192
`
`cache data
`valid?
`
`YES
`
`perform cache
`to-cache
`transfer
`
`ST194
`
`FIG. 9
`
`10
`
`

`

`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 10 0f 10
`
`US 7,315,919 B1
`
`mwwhm
`
`wx muo: 8 26: E250
`
`
`
`:6: £5 m :o 298 mm:
`
`x one: 8 2E cosmczwwu
`
`39 E3
`
`
`\ 225: 296608
`292
`29025 x 26:
`
`;
`
`“x muoc u 26: E250
`
`
`
`
`moo: t6: u x 26:
`
`2. .QE
`
`11
`
`

`

`1
`BANDWIDTH REDUCTION TECHNIQUE IN
`A SNOOPING-BASED CACHE-COHERENT
`CLUSTER OF MULTIPROCESSING NODES
`
`BACKGROUND OF INVENTION
`
`As shown in FIG. 1, a typical computer system 10
`includes at least a microprocessor 12 and a main memory 14.
`The main memory 14 contains data for use by the micro
`processor 12 to perform the operations of the computer
`system 10. However, because the speed of the microproces
`sor 12 is typically signi?cantly faster than that of the main
`memory 14, memory of smaller siZe and faster speed (re
`ferred to and knoWn as “cache” memory) is often imple
`mented to alloW the microprocessor 12 to access frequently
`and/ or recently requested data faster than it Would otherWise
`take to obtain such data from the main memory 14.
`Still referring to FIG. 1, the microprocessor 12 has an
`“on-chip” (i.e., on the same semiconductor die as the micro
`processor 12), or “L1,” cache memory 16 and an “off-chip,”
`or “L2,” cache memory 18. When the microprocessor 12
`requests data, a cache controller 20 causes the L1 cache
`memory 16 to be searched for the requested data, and if that
`search does not “hit” (i.e., a cache “miss” occurs), the L2
`cache memory 18 is searched for the requested data. If the
`requested data is not found in the cache memories 16, 18, the
`requested data is retrieved from the relatively sloW main
`memory 14.
`Those skilled in the art Will recognize that a micropro
`cessor may have any number of cache memory levels, Which
`are typically referred to by number in order of decreasing
`proximity to the microprocessor. Further, those skilled in the
`art Will recogniZe that any number of cache memories may
`be on-chip and any number of cache memories may be
`off-chip.
`A computer system, like the one shoWn in FIG. 1, may be
`used as a system that services requests from and provides
`data to other computers connected over a netWork. Such a
`client-server netWork model 30 is shoWn in FIG. 2. In FIG.
`2, a stand-alone server 32 is connected over a netWork 34 to
`several client computers 36, 38, 40, 42. The server 32 may
`be used to store data, programs, etc. for use by the client
`computers 36, 38, 40, 42. Those skilled in the art Will
`recogniZe that the server 32 may also be used to manage and
`control the client computers 36, 38, 40, 42.
`Although some computer systems, like the one shoWn in
`FIG. 1, have a single microprocessor 12 (such a computer
`system referred to and knoWn as a “uniprocessor” computer
`system), other computer systems, like the server 32 shoWn
`in FIG. 2, may be formed of multiple microprocessors. FIG.
`3 shoWs such a multiprocessing computer system 50.
`The computer system 50 of FIG. 3 is shoWn as having
`multiple microprocessors 52, 54, 56, 58. The microproces
`sors 52, 54, 56, 58 communicate With one another and With
`a main memory 60 over a netWork (e.g., a bus) 62. The
`netWork 62 is implemented as a set of bits that propagate
`data in parallel from one location to another. The “band
`Width” of the netWork 62 (i.e., the number of bits propagated
`in parallel by the netWork 62) is an important factor in the
`overall performance of the computer system 50. FIG. 3 also
`shoWs an input/output interface 64 that is connected to the
`netWork 62 and serves to input and output data to other
`portions of the computer system 50 and/or components
`external to the computer system 50.
`Those skilled in the art Will recogniZe that the multipro
`cessing computer system 50 of FIG. 3 may represent a
`particular type of multiprocessing computer system used in
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`netWorking and knoWn and referred to as a symmetric
`multiprocessing (SMP) computer system. A SMP computer
`system is one in Which multiple microprocessors share, for
`example, the same memory and input/output interface.
`Those skilled in the art Will also recogniZe that a SMP
`computer system may share the same operating system.
`Although the multiple microprocessors in a SMP computer
`system share the same resources, each microprocessor may
`act independently. For example, While one microprocessor
`searches for data in memory, other microprocessors may
`update the memory and perform other tasks, thereby increas
`ing the ability of the SMP computer system to handle
`intensive netWorking demands.
`Those skilled in the art Will recogniZe that SMP computer
`systems provide good scalability in that additional micro
`processors may be added or removed With minimal changes
`to the system. Despite the bene?ts of SMP computer sys
`tems, bottlenecks may occur When several microprocessors
`on a board share a single memory bus. Rather than put too
`many microprocessors on the same SMP board, designers of
`netWork elements often distribute applications across a
`netWorked cluster of SMP boards, Where each board has its
`oWn memory, I/O interface, and operating system.
`
`SUMMARY OF INVENTION
`
`According to one aspect of one or more embodiments of
`the present invention, a computer system comprises: a ?rst
`processing node having a snooping-based cache-coherence
`controller, the ?rst processing node arranged to maintain a
`set of addresses of data received by cache-to-cache transfers;
`and a second processing node operatively point-to-point
`connected to the ?rst processing node, Where, in response to
`a cache miss for data requested by the ?rst processing node,
`the snooping-based cache-coherence controller is arranged
`to cause a return of the requested data directly from a cache
`memory of the second processing node dependent on the set
`of addresses.
`According to another aspect of one or more embodiments
`of the present invention, a method of performing operations
`in a netWork of point-to-point connected processing nodes
`comprises: requesting data from a cache memory of a ?rst
`processing node; if the requested data is not found in the
`cache memory, searching for an address of the requested
`data in a list of addresses of data transferred by cache-to
`cache transfers in the netWork; if the address of the requested
`data is found in the list, accordingly returning the requested
`data directly from another processing node; and if the
`address of the requested data is not found in the list,
`broadcasting a request for the requested data across the
`netWork of point-to-point connected processing nodes.
`According to another aspect of one or more embodiments
`of the present invention, a modular computer system com
`prises: a plurality of integrated circuits; and a snooping
`based cache-coherence controller operatively connected to
`the plurality of integrated circuits, the snooping-based
`cache-coherence controller having a cache memory and
`capable of maintaining a directory of addresses of data
`transferred by cache-to-cache transfers, Where the modular
`computer system is point-to-point connectable to other
`modular computer systems, and Where, in response to a local
`cache miss for data requested by the modular computer
`system, the modular computer system is con?gured to
`search the directory for an address of the requested data.
`According to another aspect of one or more embodiments
`of the present invention, a computer netWork comprises a
`cluster of individual SMP computer systems that are con
`
`12
`
`

`

`US 7,315,919 B1
`
`3
`nected using point-to-point interconnect, at least one of the
`individual SMP computer systems having a snooping-based
`cache-coherence controller and a directory of addresses of
`data transferred by cache-to-cache transfers in the network,
`Where, in response to a cache miss for requested data in the
`cache memory of the at least one of the individual SMP
`computers and dependent on the directory, the snooping
`based cache-coherence controller is arranged to one of cause
`a return of the requested data from a particular one of the
`individual SMP computer systems and broadcast a request
`for the requested data across the cluster.
`According to another aspect of one or more embodiments
`of the present invention, a computer system comprises a
`plurality of integrated circuits, a snooping-based cache
`coherence controller connected to the plurality of integrated
`circuits and having a cache memory and a list of addresses
`of data transferred by cache-to-cache transfers, and memory
`comprising instructions to: selectively request data from the
`cache memory, if the requested data is not found in the cache
`memory, search the list for an address of the requested data;
`if the address of the requested data is found in the list,
`accordingly return the requested data from a location des
`ignated by the list; and if the address of the requested data
`is not found in the list, broadcast a request for the requested
`data to processing nodes connected to the computer system.
`Other aspects and advantages of the invention Will be
`apparent from the folloWing description and the appended
`claims.
`
`20
`
`25
`
`BRIEF DESCRIPTION OF DRAWINGS
`
`30
`
`FIG. 1 shoWs a typical computer system.
`FIG. 2 shoWs a typical computer network.
`FIG. 3 shoWs a typical multiprocessor computer system.
`FIG. 4 shoWs a snooping cache-coherent multiprocessor
`computer system.
`FIG. 5 shoWs a directory-based cache-coherent multipro
`cessor computer system.
`FIG. 6 shoWs a computer system in accordance With an
`embodiment of the present invention.
`FIG. 7 shoWs a portion of a computer system in accor
`dance With an embodiment of the present invention.
`FIG. 8 shoWs an exemplary representation of a portion of
`a computer system in accordance With an embodiment of the
`present invention.
`FIG. 9 shoWs a How process in accordance With an
`embodiment of the present invention.
`FIG. 10 shoWs a How process in accordance With an
`embodiment of the present invention.
`
`35
`
`40
`
`45
`
`50
`
`DETAILED DESCRIPTION
`
`55
`
`In a SMP computer system, such as that shoWn in FIG. 3,
`each microprocessor has its oWn cache memory (see
`description of cache memories above With reference to FIG.
`1). Thus, because a particular cache memory data item acted
`upon by one microprocessor may cause the copy of that data
`item to differ from other copies of that data item stored in the
`cache memories of the various microprocessors in a SMP
`computer system, “cache-coherency” techniques are imple
`60
`mented to ensure that the local cache memory of each
`microprocessor is consistent With respect to values that are
`stored in the cache memories of other microprocessors in the
`SMP computer system.
`Cache-coherence problems arise in SMP computer sys
`tems When more than one microprocessor cache memory
`holds a copy of a data item. One type of cache-coherency
`
`65
`
`4
`technique knoWn and referred to as a “snooping” relies on all
`cache memories to monitor a common netWork (e.g., a bus)
`that connects microprocessors to memory. In other Words, a
`snooping-based cache-coherency technique depends on the
`ability of cache memories to observe every transaction on a
`netWork (e.g., a bus) common to the cache memories.
`NoW referring to FIG. 4, When microprocessor 70
`requests data, a cache controller 72 local to microprocessor
`70 searches for the requested data in a cache memory 74
`local to microprocessor 70. If the requested data is not found
`in the local cache memory 74, the cache controller 72
`broadcasts a data request on a bus 76 connected to other
`cache controllers (e.g., cache controller 78) (others not
`shoWn). The cache controllers (e.g., cache controller 78)
`“snoop” on the bus 76 to monitor all transactions on the bus
`76. If a particular cache memory (e.g., cache memory 80
`associated With microprocessor 84) has the data requested
`by the requesting cache controller 72, the cache controller
`(e.g., cache controller 78) associated With the cache memory
`(e.g., cache memory 80) having the requested data forWards
`(i.e., returns) the requested data to the requesting cache
`controller 72, Which, in turn, updates its associated cache
`memory 74 With the returned requested data and provides
`the returned requested data to requesting microprocessor 70.
`Alternatively, if the requested data is not held in any of the
`cache memories 74, 80, a copy of the requested data in the
`main memory 82 is returned to and used by the requesting
`microprocessor 70.
`Further, a cache controller, connected to the bus 76, that
`observes data being Written from one cache memory to
`another may invalidate or update its oWn copy of that data.
`The next time the cache controller’s microprocessor requests
`that data, the most recent value of the data is provided to the
`microprocessor, either because its local cache memory has
`the most recent value of the data or through obtaining that
`data by generating a data request on the bus 76.
`Those skilled in the art Will recogniZe that although a
`snooping-based cache-coherency technique obtains data
`relatively quickly (i.e., has relatively loW latency), such a
`technique consumes relatively high bandWidth due to the
`parallel broadcast nature of its requests. As a result, snoop
`ing-based cache-coherency techniques are typically limited
`to small-scale systems.
`NoW referring to FIG. 5, in another type of cache
`coherency technique knoWn and referred to as “directory
`based cache-coherence,” When a cache miss occurs in a local
`cache memory (e.g., local cache memory 98 or 100) of a
`microprocessor (e.g., microprocessor 94 or 96), a cache
`controller (e.g., cache controller 102 or 106) issues a data
`request over a netWork 104 to a “home” directory (e.g.,
`directory 90 or 92) of the requested data, the “home”
`directory typically being associated With the “home”
`memory (e.g., memory 108 or 109) of the requested data.
`The “home” directory may indicate to the cache controller a
`location of the requested data. Alternatively, if the “home”
`directory indicates that no other cache memories connected
`to the netWork 104 have the requested data, the requested
`data may be returned by the “home” memory of the
`requested data.
`One advantage of directory-based cache-coherency tech
`niques With respect to snooping-based cache-coherency
`techniques is that they keep track of Which microprocessor
`nodes have copies of particular data, thereby eliminating the
`need for a high-bandWidth data request broadcast. This is
`valuable on read misses because a data request is subse
`
`13
`
`

`

`US 7,315,919 B1
`
`5
`quently satis?ed either by the directory indicating the loca
`tion of a copy of the requested data or by accessing the main
`memory.
`Further, because directory-based cache-coherent tech
`niques may rely on loW-bandWidth interconnect rather than
`on high-bandWidth netWorks (e.g., buses) that are necessary
`for broadcasting in snooping-based cache-coherency tech
`niques, directory-based cache-coherent SMP computer sys
`tems may be scalable to a large number of microprocessors.
`However, the indirection overheads associated With direc
`tory queries make directory-based cache-coherency tech
`niques sloWer (i.e., have higher latency) than snooping
`based cache-coherency techniques (e.g., a directory-based
`cache-coherence technique may often require three times the
`number of “hops” otherWise taken in a snooping-based
`cache-coherence technique).
`For example, in a snooping-based cache-coherency tech
`nique, upon a cache miss, one set of parallel messages is
`broadcast over a bus and one response message With the
`requested data is sent back to the requesting processing
`node. On the other hand, in a directory-based cache-coherent
`technique, upon a cache miss, a data request message is sent
`to the home processing node, the home processing node
`forwards the data request message to the oWning cache
`memory, and the oWning cache memory returns the
`requested data to the requesting processing node. Thus,
`generally, in snooping-based cache-coherency techniques,
`there are more messages in parallel (relatively loW average
`latency), While in directory-based cache-coherency tech
`niques, there are more messages in series (relatively high
`average latency).
`Often, several small SMP servers (e.g., near-commodity
`modular shelf servers) are connected together to provide
`increased processing capabilities. Due to the limited band
`Width of the cables connecting the servers, directory-based
`cache-coherency techniques are required to ensure cache
`coherence among the servers. HoWever, as described above,
`directory-based cache-coherency techniques have relatively
`high average latency compared to snooping-based cache
`coherency techniques.
`In one or more embodiments of the present, a cluster of
`multiprocessing nodes are connected together and use
`snooping-based cache-coherence to maintain consistency
`among cache memories of the multiprocessing nodes. As
`described further beloW, such snooping-based cache-coher
`ence is made possible by, perhaps among other things, using
`high-speed point-to-point interconnect to connect the mul
`tiprocessing nodes. Further, embodiments of the present
`invention relate to a technique for reducing the bandWidth
`consumed in a snooping-based cache-coherent cluster of
`microprocessing nodes.
`FIG. 6 shoWs an exemplary computer system 110 in
`accordance With an embodiment of the present invention. In
`FIG. 6, a plurality of multiprocessing nodes 112, 114, 116,
`118, 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140,
`142 are point-to-point connected using high-bandWidth
`interconnect (shoWn but not labeled). Particularly, each
`multiprocessing node (also referred to as “processing node”)
`112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134,
`136, 138, 140, 142 is connected to an adjacent multipro
`cessing node (in FIG. 6, each peripheral multiprocessing
`node is shoWn as being connected to the opposite peripheral
`multiprocessing node, e.g., multiprocessing node 112 is
`connected to multiprocessing node 118). In one or more
`other embodiments of the present invention, a micropro
`cessing node may be connected to a non-adjacent micro
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`processing node. Further, a processing or multiprocessing
`node is not limited to a server and may be any type of
`computer system.
`Further, in one or more embodiments of the present
`invention, high-bandWidth interconnect for point-to-point
`connecting multiprocessing nodes may be implemented
`using interconnect technologies such as, for example, In?ni
`band or PCI Express. Moreover, in one or more other
`embodiments of the present invention, high-bandWidth
`interconnect used to point-to-point connect multiprocessing
`nodes may have a bandWidth greater than that of 16-bit 1
`GHZ interconnect.
`Further, in one or more embodiments of the present
`invention, point-to-point interconnect may be used in
`cabling a plurality of servers together. Moreover, in one or
`more embodiments of the present invention, point-to-point
`interconnect may be used to connect a plurality of servers to
`a passive backplane.
`Still referring to FIG. 6, each microprocessing node 112,
`114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134, 136,
`138, 140, 142 is shoWn as having an address directory table
`113, 115, 117, 119, 121, 123, 125, 127, 129, 131, 133, 135,
`137, 139, 141, 143, respectively. An address directory table
`in accordance With one or more embodiments of the present
`invention is further described beloW With reference to FIGS.
`7 and 8. Further, in one or more other embodiments of the
`present invention, instead of all the microprocessing nodes
`in a netWork having an address directory table, only some
`(i.e., less than all) of the microprocessing nodes may have an
`address directory table.
`As described above, embodiments of the present inven
`tion use snooping-based cache-coherence. Thus, in FIG. 6,
`cache-coherence among the high-bandWidth point-to-point
`connected multiprocessing nodes 112, 114, 116, 118, 120,
`122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142 is
`achieved using a snooping-based cache-coherency tech
`nique. To implement snooping-based cache-coherency, each
`multiprocessing node 112, 114, 116, 118, 120, 122, 124, 126,
`128,130, 132,134,136, 138,140,142 has a cache controller
`(not shoWn) that is operatively connected by the high
`bandWidth interconnect (e.g., shoWn in FIG. 6) to the cache
`controllers (not shoWn) of connected multiprocessing nodes
`(not shoWn).
`FIG. 7 shoWs an exemplary multiprocessing node 150 in
`accordance With an embodiment of the present invention. In
`FIG. 7, the multiprocessing node 150 has four microproces
`sors 152, 154, 156, 158. HoWever, in one or more other
`embodiments of the present invention, a multiprocessing
`node may have any number of microprocessors. In FIG. 7,
`each of the microprocessors 152, 154, 156, 158 is connected
`to a snooping-based cache-coherence controller 160. The
`cache-coherence controller (also referred to as “cache con
`troller”) 160 is connected over high-bandWidth interconnect
`to the cache controllers (not shoWn) of connected multipro
`cessing nodes (not shoWn). Further, each microprocessor
`152, 154, 156, 158 may be connected to every other micro
`processor 152, 154, 156, 158 in the multiprocessing node
`150 for, for example, chip-to-chip communication.
`Further, although the cache controller 160 in FIG. 7 is
`shoWn as being connected to adjacent processing nodes, in
`one or more other embodiments of the present invention, a
`cache-coherence controller may be connected to one or more
`non-adjacent processing nodes.
`Still referring to FIG. 7, the cache controller 160 has a
`cache memory 162 (in addition to the cache memories (not
`shoWn) local to the microprocessors 152, 154, 156, 158). In
`one or more embodiments of the present invention, the cache
`
`14
`
`

`

`US 7,315,919 B1
`
`7
`memory 162 of the cache controller 160 may be sized
`relatively large (e. g., greater than 32 MB) so as to reduce the
`frequency of broadcasting data requests to other multipro
`cessing nodes upon a cache miss in one of its microproces
`sor’s local cache memories. In other Words, by siZing cache
`memory 162 to be relatively large, upon a cache miss in a
`local microprocessor’s local cache memory, the likelihood
`of the cache controller 160 ?nding the requested data in the
`cache memory 162 is increased, thereby reducing the fre
`quency of bandWidth-consuming data request broadcasts to
`other multiprocessing nodes (not shoWn). The cache
`memory 162 may hold copies of data that are frequently
`and/ or recently requested by the local microprocessors 152,
`154, 156, 158 of the multiprocessing node 150.
`Still referring to FIG. 7, the cache controller 160 also has
`an address directory table 163. As is further described beloW
`With reference to FIG. 8, the address directory table main
`tains a list of addresses of data last transferred by cache-to
`cache transfer transactions. Those skilled in the art Will note
`that a cache-to-cache transfer is a transaction in Which, upon
`a local cache miss, data (or an address thereof) is transferred
`from a remote cache to the local cache. In other Words, a
`transfer of data (or an address thereof) betWeen cache
`memories of different processing nodes is a cache-to-cache
`transfer. Those skilled in the art Will note that in some
`Workloads, a considerable percentage of bandWidth con
`sumption may be attributable to cache-to-cache transfers.
`Those skilled in the art Will note that an address directory
`table in accordance With one or more embodiments of the
`present invention may maintain a list of addresses of data
`last transferred by cache-to-cache transfers throughout a
`network. In one or more other embodiments of the present
`invention, an address directory table may maintain a list of
`addresses of data last transferred by cache-to-cache transfers
`involving particular cache memories.
`Still referring to FIG. 7, When a microprocessor (e.g.,
`microprocessor 152) requests data that is not found in its
`local cache memories (or memories) (not shoWn), the cache
`controller 160 searches cache memory 162 for the requested
`data, and if the requested data is not found in cache memory
`162 (i.e., a local cache miss has occurred), the cache
`controller 160 may either, depending on the contents of the
`address directory table 163, (i) perform a snooping-based
`cache-coherence operation (e.g., broadcast a data request to
`operatively connected cache controllers (not shoWn)) or (ii)
`have the requested data directly returned from a last knoWn
`location of the requested data.
`FIG. 8 shoWs an exemplary address directory table 170 in
`accordance With an embodiment of the present invention.
`Generally, the address directory table 170 maintains a list of
`locations (i.e., addresses) of data last transferred by cache
`to-cache transfers. The address directory table 170 includes
`an Address ?eld (or column) that references a number of
`addresses of data recently and/ or frequently returned using
`cache-to-cache transfers. For example, if in FIG. 8, the
`top-most entry in the address directory table 170 represents
`the data most recently returned from a cache memory of a
`remote processing node (i.e., a processing node different
`than that of a cache controller (e.g., 160 in FIG. 7) to Which
`the address directory table 170 is considered “local”),
`address G is the address of the last data returned to the cache
`controller by a remote cache memory.
`Further, the address directory table 170 includes a Last
`Destination ?eld (or column) that references the last desti
`nation of the data corresponding to the associated Address
`?eld entry. For example, in the address directory table 170,
`the data at address Z Was last cache-to-cache transferred to
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`processing node 9. As is further described beloW With
`reference to FIGS. 9 and 10, an entry in the Last Destination
`?eld may be updated in response to a cache-to-cache transfer
`that does not involve the processing node associated With the
`address directory table 170. For example, if (i) the address
`directory table 170 shoWn in FIG. 8 is local to processing
`node 7 and (ii) a cache-to-cache transfer involving data at
`address M occurs betWeen processing nodes 4 and 9, then,
`as shoWn in FIG. 8, the Last Destination ?eld associated
`With address M is updated (on a “best efforts” basis as
`described beloW With reference to FIGS. 9 and 10) to re?ect
`the last destination of the data at address M.
`The address directory table 170 may also include a Type
`?eld that indicates the type of the cache miss (e.g., read or
`Write) of the last cache-to-cache transfer associated With a
`particular address. For example, in the address directory
`table 170, the data at address R Was last cache-to-cache
`transferred to satisfy a read request by processing node 8.
`Those skilled in the art Will note that the an address
`directory table in accordance With one or more embodiments
`of the present invention may be implemented using hard
`Ware and/or softWare and is not limited to a table structure.
`In other Words, the information shoWn in the address direc
`tory table 170 may be maintained by any type of data
`structure. Further, an address directory table in accordance
`With one or more embodiments of the present invention may
`contain less, more, and/or different information than that
`shoWn in FIG. 8.
`FIG. 9 shoWs an exemplary ?oW process in accordance
`With an embodiment of the present invention. Once a
`processing node (e.g., a multiprocessing server) requests
`data ST180, the requesting processing node

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket