`
`(12) United States Patent
`O’Krafka et a].
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 7,315,919 B1
`*Jan. 1, 2008
`
`(54)
`
`(75)
`
`BANDWIDTH REDUCTION TECHNIQUE IN
`A SNOOPING-BASED CACHE-COHERENT
`CLUSTER OF MULTIPROCESSING NODES
`
`Inventors: Brian W. O’Krafka, Austin, TX (US);
`Michael J. Koster, Fremont, CA (US)
`
`(73)
`
`Assignee: Sun Microsystems, Inc., Santa Clara,
`CA (US)
`
`6,981,097 B2 12/2005 Martin et a1.
`2002/0133674 A1
`9/2002 Martin et a1.
`2005/0144395 A1
`6/2005 Martin et a1.
`2005/0160430 A1
`7/2005 Steely et al.
`2005/0198187 A1
`9/2005 Tierney et al.
`2005/0240735 A1 10/2005 Shen et al.
`Primary ExamineriHyung Sough
`Assistant ExamineriMardochee Chery
`(74) Attorney, Agent, or F irmiOsha ' Liang LLP
`
`(*)
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 490 days.
`
`This patent is subject to a terminal dis
`claimer.
`
`(21)
`
`(22)
`
`(51)
`
`(52)
`(58)
`
`(56)
`
`Appl. No.: 10/868,053
`
`Filed:
`
`Jun. 15, 2004
`
`Int. Cl.
`(2006.01)
`G06F 12/00
`US. Cl. ...................... .. 711/141; 711/146; 711/147
`Field of Classi?cation Search ................... .. None
`See application ?le for complete search history.
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`ABSTRACT
`(57)
`A cluster of multiprocessing nodes uses snooping-based
`cache-coherence to maintain consistency among the cache
`memories of the multiprocessing nodes. One or more of the
`multiprocessing nodes each maintain a directory table that
`includes a list of addresses of data last transferred by
`cache-to-cache transfer transactions. Thus, upon a local
`cache miss for requested data, a multiprocessing node
`searches its directory table for an address of the requested
`data, and if the address is found in the directory table, the
`multiprocessing node obtains a copy of the requested data
`from the last destination of the requested data as indicated in
`the directory table. Thereafter, a message indicating the
`completion of a cache-to-cache transfer is broadcast to other
`connected multiprocessing nodes on a “best efforts” basis in
`Which messages are relayed from multiprocessing node to
`multiprocessing node using loW priority status and/or oth
`erwise unused cycles.
`
`6,883,070 B2
`
`4/2005 Martin et al.
`
`15 Claims, 10 Drawing Sheets
`
`/ST206
`broadcast cache-to
`cache transfer
`destination on "best
`efforts" basis
`
`Yes
`
`access local
`cache
`
`81-182
`
`ST204\
`
`I
`
`update local cache
`and local directory
`table
`
`/ST186
`retrieve and
`use data
`
`ST202
`
`use data
`returned from
`home memory
`
`check local directory
`table for data
`location
`
`ST188
`
`81-198 broadcast data
`request
`
`cache data
`valid?
`
`Yes
`
`pen‘orm cache
`to-cache
`transfer
`
`ST194
`
`1
`
`APPLE 1010
`
`
`
`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 1 0f 10
`
`US 7,315,919 B1
`
`522
`
`29:22
`H
`
`6:250 @580
`
`H
`J 2H
`280 I
`
`63805222
`
`1 3.2m v .OE
`
`2
`
`
`
`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 2 0f 10
`
`US 7,315,919 B1
`
`/ mm
`
`E26
`
`hmSQEoo
`
`1 mm
`
`E26
`
`EEQEoO
`
`J ow
`
`E26
`
`ESQEQO
`
`E26
`
`ESQEQO
`
`Wm
`
`/ on
`
`3
`
`
`
`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 3 0f 10
`
`US 7,315,919 B1
`
`2 t
`
`
`
`8mt2£ 59:02:55
`
`
`
` 1, t 63805822 63805222 63805922 63805922
`
`
`
`
`
`l 8E 0 .wt
`
`J J J J mm mm vm mm
`
`/ om
`
`4
`
`
`
`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 4 of 10
`
`US 7,315,919 B1
`
`530003222
`830090922
`5.6::005:03:00
`
`0:00005000
`
`
`cosommcmfi
`
`
`
`<\aoocwmamboEmzécomo
`
`EoEmE
`
`V.GE
`
`Nw
`
`5
`
`
`
`U.S. Patent
`
`02
`
`S
`
`01f05
`
`919951397S
`
`1
`
`avorm0.5282
`
`Wmow
`
`
`.m..o=:00.80..J0:000
`J9.0009.000m.80..
`
`
`09mm
`
`9.000
`
`5:92.00
`
`No.‘
`
`om
`
`vm
`
`50000050003
`
`53000592.).
`
`Bm.GE
`
`
`
`U2309550:65.
`
`EoEms.
`
`E9095
`
`mm
`
`8..
`
`mow
`
`om
`
`6
`
`
`
`
`
`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 6 of 10
`
`US 7,315,919 B1
`
`mEmmmoanzfi-9.6885522-938005535.-mEmmmuoazfifi
`IIII802$02mwuoz
`IIII£528023022.02
`96386235.-mEmmwooaazé-95385232-mEmmmoansé
`m:_mmmooa=_:§-mcfimmooaazfi-96386335.-96385222
`95885222-ocfiwmooaaaz-96385522-96885232
`-I252252£62302
`£62262252
`
`a.GE
`
`o:
`
`7
`
`
`
`U.S. Patent
`
`nJa
`
`1n.
`
`.0.m
`
`01M
`
`95,137,
`
`1
`
`Blw50t
`
`W000:
`
`
`
`ucwomqvm:C>>OU:
`
`smm.
`
`2985c0
`
`
`
`7530006925.8000005925.
`
`2,30095..
`
`530090225.—I
`
`530005225.
`
`E00080=03:
`
`000:
`
`
`
`8000:5:03:00000:mEmomfim.29...280058:8LE.
`
`
`
`
`
`8
`
`
`
`
`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 8 of 10
`
`US 7,315,919 B1
`
`cozmczmmo33
`
`m_‘wuoz
`
`m.302
`
`r362
`
`C252
`
`$2294
`
`w0302
`
`mmuoz
`
`or2.62
`
`0.,0602
`
`w.GE
`
`o:
`
`9
`
`
`
`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 9 0f 10
`
`US 7,315,919 B1
`
`YES
`
`access local
`cache
`
`ST182
`
`broadcast cache-to
`cache transfer
`destination on "best
`efforts" basis
`
`A
`
`ST204
`\
`update local cache
`and local directory
`table
`
`V
`
`ST186
`/
`retrieve and
`use data
`
`Yes
`
`ST202
`/
`use data
`returned from
`home memory
`
`check local directory
`table for data
`location
`
`ST188
`
`ST198
`
`broadcast data 4—No
`request
`
`NO
`
`ST192
`
`cache data
`valid?
`
`YES
`
`perform cache
`to-cache
`transfer
`
`ST194
`
`FIG. 9
`
`10
`
`
`
`U.S. Patent
`
`Jan. 1, 2008
`
`Sheet 10 0f 10
`
`US 7,315,919 B1
`
`mwwhm
`
`wx muo: 8 26: E250
`
`
`
`:6: £5 m :o 298 mm:
`
`x one: 8 2E cosmczwwu
`
`39 E3
`
`
`\ 225: 296608
`292
`29025 x 26:
`
`;
`
`“x muoc u 26: E250
`
`
`
`
`moo: t6: u x 26:
`
`2. .QE
`
`11
`
`
`
`1
`BANDWIDTH REDUCTION TECHNIQUE IN
`A SNOOPING-BASED CACHE-COHERENT
`CLUSTER OF MULTIPROCESSING NODES
`
`BACKGROUND OF INVENTION
`
`As shown in FIG. 1, a typical computer system 10
`includes at least a microprocessor 12 and a main memory 14.
`The main memory 14 contains data for use by the micro
`processor 12 to perform the operations of the computer
`system 10. However, because the speed of the microproces
`sor 12 is typically signi?cantly faster than that of the main
`memory 14, memory of smaller siZe and faster speed (re
`ferred to and knoWn as “cache” memory) is often imple
`mented to alloW the microprocessor 12 to access frequently
`and/ or recently requested data faster than it Would otherWise
`take to obtain such data from the main memory 14.
`Still referring to FIG. 1, the microprocessor 12 has an
`“on-chip” (i.e., on the same semiconductor die as the micro
`processor 12), or “L1,” cache memory 16 and an “off-chip,”
`or “L2,” cache memory 18. When the microprocessor 12
`requests data, a cache controller 20 causes the L1 cache
`memory 16 to be searched for the requested data, and if that
`search does not “hit” (i.e., a cache “miss” occurs), the L2
`cache memory 18 is searched for the requested data. If the
`requested data is not found in the cache memories 16, 18, the
`requested data is retrieved from the relatively sloW main
`memory 14.
`Those skilled in the art Will recognize that a micropro
`cessor may have any number of cache memory levels, Which
`are typically referred to by number in order of decreasing
`proximity to the microprocessor. Further, those skilled in the
`art Will recogniZe that any number of cache memories may
`be on-chip and any number of cache memories may be
`off-chip.
`A computer system, like the one shoWn in FIG. 1, may be
`used as a system that services requests from and provides
`data to other computers connected over a netWork. Such a
`client-server netWork model 30 is shoWn in FIG. 2. In FIG.
`2, a stand-alone server 32 is connected over a netWork 34 to
`several client computers 36, 38, 40, 42. The server 32 may
`be used to store data, programs, etc. for use by the client
`computers 36, 38, 40, 42. Those skilled in the art Will
`recogniZe that the server 32 may also be used to manage and
`control the client computers 36, 38, 40, 42.
`Although some computer systems, like the one shoWn in
`FIG. 1, have a single microprocessor 12 (such a computer
`system referred to and knoWn as a “uniprocessor” computer
`system), other computer systems, like the server 32 shoWn
`in FIG. 2, may be formed of multiple microprocessors. FIG.
`3 shoWs such a multiprocessing computer system 50.
`The computer system 50 of FIG. 3 is shoWn as having
`multiple microprocessors 52, 54, 56, 58. The microproces
`sors 52, 54, 56, 58 communicate With one another and With
`a main memory 60 over a netWork (e.g., a bus) 62. The
`netWork 62 is implemented as a set of bits that propagate
`data in parallel from one location to another. The “band
`Width” of the netWork 62 (i.e., the number of bits propagated
`in parallel by the netWork 62) is an important factor in the
`overall performance of the computer system 50. FIG. 3 also
`shoWs an input/output interface 64 that is connected to the
`netWork 62 and serves to input and output data to other
`portions of the computer system 50 and/or components
`external to the computer system 50.
`Those skilled in the art Will recogniZe that the multipro
`cessing computer system 50 of FIG. 3 may represent a
`particular type of multiprocessing computer system used in
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`netWorking and knoWn and referred to as a symmetric
`multiprocessing (SMP) computer system. A SMP computer
`system is one in Which multiple microprocessors share, for
`example, the same memory and input/output interface.
`Those skilled in the art Will also recogniZe that a SMP
`computer system may share the same operating system.
`Although the multiple microprocessors in a SMP computer
`system share the same resources, each microprocessor may
`act independently. For example, While one microprocessor
`searches for data in memory, other microprocessors may
`update the memory and perform other tasks, thereby increas
`ing the ability of the SMP computer system to handle
`intensive netWorking demands.
`Those skilled in the art Will recogniZe that SMP computer
`systems provide good scalability in that additional micro
`processors may be added or removed With minimal changes
`to the system. Despite the bene?ts of SMP computer sys
`tems, bottlenecks may occur When several microprocessors
`on a board share a single memory bus. Rather than put too
`many microprocessors on the same SMP board, designers of
`netWork elements often distribute applications across a
`netWorked cluster of SMP boards, Where each board has its
`oWn memory, I/O interface, and operating system.
`
`SUMMARY OF INVENTION
`
`According to one aspect of one or more embodiments of
`the present invention, a computer system comprises: a ?rst
`processing node having a snooping-based cache-coherence
`controller, the ?rst processing node arranged to maintain a
`set of addresses of data received by cache-to-cache transfers;
`and a second processing node operatively point-to-point
`connected to the ?rst processing node, Where, in response to
`a cache miss for data requested by the ?rst processing node,
`the snooping-based cache-coherence controller is arranged
`to cause a return of the requested data directly from a cache
`memory of the second processing node dependent on the set
`of addresses.
`According to another aspect of one or more embodiments
`of the present invention, a method of performing operations
`in a netWork of point-to-point connected processing nodes
`comprises: requesting data from a cache memory of a ?rst
`processing node; if the requested data is not found in the
`cache memory, searching for an address of the requested
`data in a list of addresses of data transferred by cache-to
`cache transfers in the netWork; if the address of the requested
`data is found in the list, accordingly returning the requested
`data directly from another processing node; and if the
`address of the requested data is not found in the list,
`broadcasting a request for the requested data across the
`netWork of point-to-point connected processing nodes.
`According to another aspect of one or more embodiments
`of the present invention, a modular computer system com
`prises: a plurality of integrated circuits; and a snooping
`based cache-coherence controller operatively connected to
`the plurality of integrated circuits, the snooping-based
`cache-coherence controller having a cache memory and
`capable of maintaining a directory of addresses of data
`transferred by cache-to-cache transfers, Where the modular
`computer system is point-to-point connectable to other
`modular computer systems, and Where, in response to a local
`cache miss for data requested by the modular computer
`system, the modular computer system is con?gured to
`search the directory for an address of the requested data.
`According to another aspect of one or more embodiments
`of the present invention, a computer netWork comprises a
`cluster of individual SMP computer systems that are con
`
`12
`
`
`
`US 7,315,919 B1
`
`3
`nected using point-to-point interconnect, at least one of the
`individual SMP computer systems having a snooping-based
`cache-coherence controller and a directory of addresses of
`data transferred by cache-to-cache transfers in the network,
`Where, in response to a cache miss for requested data in the
`cache memory of the at least one of the individual SMP
`computers and dependent on the directory, the snooping
`based cache-coherence controller is arranged to one of cause
`a return of the requested data from a particular one of the
`individual SMP computer systems and broadcast a request
`for the requested data across the cluster.
`According to another aspect of one or more embodiments
`of the present invention, a computer system comprises a
`plurality of integrated circuits, a snooping-based cache
`coherence controller connected to the plurality of integrated
`circuits and having a cache memory and a list of addresses
`of data transferred by cache-to-cache transfers, and memory
`comprising instructions to: selectively request data from the
`cache memory, if the requested data is not found in the cache
`memory, search the list for an address of the requested data;
`if the address of the requested data is found in the list,
`accordingly return the requested data from a location des
`ignated by the list; and if the address of the requested data
`is not found in the list, broadcast a request for the requested
`data to processing nodes connected to the computer system.
`Other aspects and advantages of the invention Will be
`apparent from the folloWing description and the appended
`claims.
`
`20
`
`25
`
`BRIEF DESCRIPTION OF DRAWINGS
`
`30
`
`FIG. 1 shoWs a typical computer system.
`FIG. 2 shoWs a typical computer network.
`FIG. 3 shoWs a typical multiprocessor computer system.
`FIG. 4 shoWs a snooping cache-coherent multiprocessor
`computer system.
`FIG. 5 shoWs a directory-based cache-coherent multipro
`cessor computer system.
`FIG. 6 shoWs a computer system in accordance With an
`embodiment of the present invention.
`FIG. 7 shoWs a portion of a computer system in accor
`dance With an embodiment of the present invention.
`FIG. 8 shoWs an exemplary representation of a portion of
`a computer system in accordance With an embodiment of the
`present invention.
`FIG. 9 shoWs a How process in accordance With an
`embodiment of the present invention.
`FIG. 10 shoWs a How process in accordance With an
`embodiment of the present invention.
`
`35
`
`40
`
`45
`
`50
`
`DETAILED DESCRIPTION
`
`55
`
`In a SMP computer system, such as that shoWn in FIG. 3,
`each microprocessor has its oWn cache memory (see
`description of cache memories above With reference to FIG.
`1). Thus, because a particular cache memory data item acted
`upon by one microprocessor may cause the copy of that data
`item to differ from other copies of that data item stored in the
`cache memories of the various microprocessors in a SMP
`computer system, “cache-coherency” techniques are imple
`60
`mented to ensure that the local cache memory of each
`microprocessor is consistent With respect to values that are
`stored in the cache memories of other microprocessors in the
`SMP computer system.
`Cache-coherence problems arise in SMP computer sys
`tems When more than one microprocessor cache memory
`holds a copy of a data item. One type of cache-coherency
`
`65
`
`4
`technique knoWn and referred to as a “snooping” relies on all
`cache memories to monitor a common netWork (e.g., a bus)
`that connects microprocessors to memory. In other Words, a
`snooping-based cache-coherency technique depends on the
`ability of cache memories to observe every transaction on a
`netWork (e.g., a bus) common to the cache memories.
`NoW referring to FIG. 4, When microprocessor 70
`requests data, a cache controller 72 local to microprocessor
`70 searches for the requested data in a cache memory 74
`local to microprocessor 70. If the requested data is not found
`in the local cache memory 74, the cache controller 72
`broadcasts a data request on a bus 76 connected to other
`cache controllers (e.g., cache controller 78) (others not
`shoWn). The cache controllers (e.g., cache controller 78)
`“snoop” on the bus 76 to monitor all transactions on the bus
`76. If a particular cache memory (e.g., cache memory 80
`associated With microprocessor 84) has the data requested
`by the requesting cache controller 72, the cache controller
`(e.g., cache controller 78) associated With the cache memory
`(e.g., cache memory 80) having the requested data forWards
`(i.e., returns) the requested data to the requesting cache
`controller 72, Which, in turn, updates its associated cache
`memory 74 With the returned requested data and provides
`the returned requested data to requesting microprocessor 70.
`Alternatively, if the requested data is not held in any of the
`cache memories 74, 80, a copy of the requested data in the
`main memory 82 is returned to and used by the requesting
`microprocessor 70.
`Further, a cache controller, connected to the bus 76, that
`observes data being Written from one cache memory to
`another may invalidate or update its oWn copy of that data.
`The next time the cache controller’s microprocessor requests
`that data, the most recent value of the data is provided to the
`microprocessor, either because its local cache memory has
`the most recent value of the data or through obtaining that
`data by generating a data request on the bus 76.
`Those skilled in the art Will recogniZe that although a
`snooping-based cache-coherency technique obtains data
`relatively quickly (i.e., has relatively loW latency), such a
`technique consumes relatively high bandWidth due to the
`parallel broadcast nature of its requests. As a result, snoop
`ing-based cache-coherency techniques are typically limited
`to small-scale systems.
`NoW referring to FIG. 5, in another type of cache
`coherency technique knoWn and referred to as “directory
`based cache-coherence,” When a cache miss occurs in a local
`cache memory (e.g., local cache memory 98 or 100) of a
`microprocessor (e.g., microprocessor 94 or 96), a cache
`controller (e.g., cache controller 102 or 106) issues a data
`request over a netWork 104 to a “home” directory (e.g.,
`directory 90 or 92) of the requested data, the “home”
`directory typically being associated With the “home”
`memory (e.g., memory 108 or 109) of the requested data.
`The “home” directory may indicate to the cache controller a
`location of the requested data. Alternatively, if the “home”
`directory indicates that no other cache memories connected
`to the netWork 104 have the requested data, the requested
`data may be returned by the “home” memory of the
`requested data.
`One advantage of directory-based cache-coherency tech
`niques With respect to snooping-based cache-coherency
`techniques is that they keep track of Which microprocessor
`nodes have copies of particular data, thereby eliminating the
`need for a high-bandWidth data request broadcast. This is
`valuable on read misses because a data request is subse
`
`13
`
`
`
`US 7,315,919 B1
`
`5
`quently satis?ed either by the directory indicating the loca
`tion of a copy of the requested data or by accessing the main
`memory.
`Further, because directory-based cache-coherent tech
`niques may rely on loW-bandWidth interconnect rather than
`on high-bandWidth netWorks (e.g., buses) that are necessary
`for broadcasting in snooping-based cache-coherency tech
`niques, directory-based cache-coherent SMP computer sys
`tems may be scalable to a large number of microprocessors.
`However, the indirection overheads associated With direc
`tory queries make directory-based cache-coherency tech
`niques sloWer (i.e., have higher latency) than snooping
`based cache-coherency techniques (e.g., a directory-based
`cache-coherence technique may often require three times the
`number of “hops” otherWise taken in a snooping-based
`cache-coherence technique).
`For example, in a snooping-based cache-coherency tech
`nique, upon a cache miss, one set of parallel messages is
`broadcast over a bus and one response message With the
`requested data is sent back to the requesting processing
`node. On the other hand, in a directory-based cache-coherent
`technique, upon a cache miss, a data request message is sent
`to the home processing node, the home processing node
`forwards the data request message to the oWning cache
`memory, and the oWning cache memory returns the
`requested data to the requesting processing node. Thus,
`generally, in snooping-based cache-coherency techniques,
`there are more messages in parallel (relatively loW average
`latency), While in directory-based cache-coherency tech
`niques, there are more messages in series (relatively high
`average latency).
`Often, several small SMP servers (e.g., near-commodity
`modular shelf servers) are connected together to provide
`increased processing capabilities. Due to the limited band
`Width of the cables connecting the servers, directory-based
`cache-coherency techniques are required to ensure cache
`coherence among the servers. HoWever, as described above,
`directory-based cache-coherency techniques have relatively
`high average latency compared to snooping-based cache
`coherency techniques.
`In one or more embodiments of the present, a cluster of
`multiprocessing nodes are connected together and use
`snooping-based cache-coherence to maintain consistency
`among cache memories of the multiprocessing nodes. As
`described further beloW, such snooping-based cache-coher
`ence is made possible by, perhaps among other things, using
`high-speed point-to-point interconnect to connect the mul
`tiprocessing nodes. Further, embodiments of the present
`invention relate to a technique for reducing the bandWidth
`consumed in a snooping-based cache-coherent cluster of
`microprocessing nodes.
`FIG. 6 shoWs an exemplary computer system 110 in
`accordance With an embodiment of the present invention. In
`FIG. 6, a plurality of multiprocessing nodes 112, 114, 116,
`118, 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140,
`142 are point-to-point connected using high-bandWidth
`interconnect (shoWn but not labeled). Particularly, each
`multiprocessing node (also referred to as “processing node”)
`112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134,
`136, 138, 140, 142 is connected to an adjacent multipro
`cessing node (in FIG. 6, each peripheral multiprocessing
`node is shoWn as being connected to the opposite peripheral
`multiprocessing node, e.g., multiprocessing node 112 is
`connected to multiprocessing node 118). In one or more
`other embodiments of the present invention, a micropro
`cessing node may be connected to a non-adjacent micro
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`processing node. Further, a processing or multiprocessing
`node is not limited to a server and may be any type of
`computer system.
`Further, in one or more embodiments of the present
`invention, high-bandWidth interconnect for point-to-point
`connecting multiprocessing nodes may be implemented
`using interconnect technologies such as, for example, In?ni
`band or PCI Express. Moreover, in one or more other
`embodiments of the present invention, high-bandWidth
`interconnect used to point-to-point connect multiprocessing
`nodes may have a bandWidth greater than that of 16-bit 1
`GHZ interconnect.
`Further, in one or more embodiments of the present
`invention, point-to-point interconnect may be used in
`cabling a plurality of servers together. Moreover, in one or
`more embodiments of the present invention, point-to-point
`interconnect may be used to connect a plurality of servers to
`a passive backplane.
`Still referring to FIG. 6, each microprocessing node 112,
`114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134, 136,
`138, 140, 142 is shoWn as having an address directory table
`113, 115, 117, 119, 121, 123, 125, 127, 129, 131, 133, 135,
`137, 139, 141, 143, respectively. An address directory table
`in accordance With one or more embodiments of the present
`invention is further described beloW With reference to FIGS.
`7 and 8. Further, in one or more other embodiments of the
`present invention, instead of all the microprocessing nodes
`in a netWork having an address directory table, only some
`(i.e., less than all) of the microprocessing nodes may have an
`address directory table.
`As described above, embodiments of the present inven
`tion use snooping-based cache-coherence. Thus, in FIG. 6,
`cache-coherence among the high-bandWidth point-to-point
`connected multiprocessing nodes 112, 114, 116, 118, 120,
`122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142 is
`achieved using a snooping-based cache-coherency tech
`nique. To implement snooping-based cache-coherency, each
`multiprocessing node 112, 114, 116, 118, 120, 122, 124, 126,
`128,130, 132,134,136, 138,140,142 has a cache controller
`(not shoWn) that is operatively connected by the high
`bandWidth interconnect (e.g., shoWn in FIG. 6) to the cache
`controllers (not shoWn) of connected multiprocessing nodes
`(not shoWn).
`FIG. 7 shoWs an exemplary multiprocessing node 150 in
`accordance With an embodiment of the present invention. In
`FIG. 7, the multiprocessing node 150 has four microproces
`sors 152, 154, 156, 158. HoWever, in one or more other
`embodiments of the present invention, a multiprocessing
`node may have any number of microprocessors. In FIG. 7,
`each of the microprocessors 152, 154, 156, 158 is connected
`to a snooping-based cache-coherence controller 160. The
`cache-coherence controller (also referred to as “cache con
`troller”) 160 is connected over high-bandWidth interconnect
`to the cache controllers (not shoWn) of connected multipro
`cessing nodes (not shoWn). Further, each microprocessor
`152, 154, 156, 158 may be connected to every other micro
`processor 152, 154, 156, 158 in the multiprocessing node
`150 for, for example, chip-to-chip communication.
`Further, although the cache controller 160 in FIG. 7 is
`shoWn as being connected to adjacent processing nodes, in
`one or more other embodiments of the present invention, a
`cache-coherence controller may be connected to one or more
`non-adjacent processing nodes.
`Still referring to FIG. 7, the cache controller 160 has a
`cache memory 162 (in addition to the cache memories (not
`shoWn) local to the microprocessors 152, 154, 156, 158). In
`one or more embodiments of the present invention, the cache
`
`14
`
`
`
`US 7,315,919 B1
`
`7
`memory 162 of the cache controller 160 may be sized
`relatively large (e. g., greater than 32 MB) so as to reduce the
`frequency of broadcasting data requests to other multipro
`cessing nodes upon a cache miss in one of its microproces
`sor’s local cache memories. In other Words, by siZing cache
`memory 162 to be relatively large, upon a cache miss in a
`local microprocessor’s local cache memory, the likelihood
`of the cache controller 160 ?nding the requested data in the
`cache memory 162 is increased, thereby reducing the fre
`quency of bandWidth-consuming data request broadcasts to
`other multiprocessing nodes (not shoWn). The cache
`memory 162 may hold copies of data that are frequently
`and/ or recently requested by the local microprocessors 152,
`154, 156, 158 of the multiprocessing node 150.
`Still referring to FIG. 7, the cache controller 160 also has
`an address directory table 163. As is further described beloW
`With reference to FIG. 8, the address directory table main
`tains a list of addresses of data last transferred by cache-to
`cache transfer transactions. Those skilled in the art Will note
`that a cache-to-cache transfer is a transaction in Which, upon
`a local cache miss, data (or an address thereof) is transferred
`from a remote cache to the local cache. In other Words, a
`transfer of data (or an address thereof) betWeen cache
`memories of different processing nodes is a cache-to-cache
`transfer. Those skilled in the art Will note that in some
`Workloads, a considerable percentage of bandWidth con
`sumption may be attributable to cache-to-cache transfers.
`Those skilled in the art Will note that an address directory
`table in accordance With one or more embodiments of the
`present invention may maintain a list of addresses of data
`last transferred by cache-to-cache transfers throughout a
`network. In one or more other embodiments of the present
`invention, an address directory table may maintain a list of
`addresses of data last transferred by cache-to-cache transfers
`involving particular cache memories.
`Still referring to FIG. 7, When a microprocessor (e.g.,
`microprocessor 152) requests data that is not found in its
`local cache memories (or memories) (not shoWn), the cache
`controller 160 searches cache memory 162 for the requested
`data, and if the requested data is not found in cache memory
`162 (i.e., a local cache miss has occurred), the cache
`controller 160 may either, depending on the contents of the
`address directory table 163, (i) perform a snooping-based
`cache-coherence operation (e.g., broadcast a data request to
`operatively connected cache controllers (not shoWn)) or (ii)
`have the requested data directly returned from a last knoWn
`location of the requested data.
`FIG. 8 shoWs an exemplary address directory table 170 in
`accordance With an embodiment of the present invention.
`Generally, the address directory table 170 maintains a list of
`locations (i.e., addresses) of data last transferred by cache
`to-cache transfers. The address directory table 170 includes
`an Address ?eld (or column) that references a number of
`addresses of data recently and/ or frequently returned using
`cache-to-cache transfers. For example, if in FIG. 8, the
`top-most entry in the address directory table 170 represents
`the data most recently returned from a cache memory of a
`remote processing node (i.e., a processing node different
`than that of a cache controller (e.g., 160 in FIG. 7) to Which
`the address directory table 170 is considered “local”),
`address G is the address of the last data returned to the cache
`controller by a remote cache memory.
`Further, the address directory table 170 includes a Last
`Destination ?eld (or column) that references the last desti
`nation of the data corresponding to the associated Address
`?eld entry. For example, in the address directory table 170,
`the data at address Z Was last cache-to-cache transferred to
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`processing node 9. As is further described beloW With
`reference to FIGS. 9 and 10, an entry in the Last Destination
`?eld may be updated in response to a cache-to-cache transfer
`that does not involve the processing node associated With the
`address directory table 170. For example, if (i) the address
`directory table 170 shoWn in FIG. 8 is local to processing
`node 7 and (ii) a cache-to-cache transfer involving data at
`address M occurs betWeen processing nodes 4 and 9, then,
`as shoWn in FIG. 8, the Last Destination ?eld associated
`With address M is updated (on a “best efforts” basis as
`described beloW With reference to FIGS. 9 and 10) to re?ect
`the last destination of the data at address M.
`The address directory table 170 may also include a Type
`?eld that indicates the type of the cache miss (e.g., read or
`Write) of the last cache-to-cache transfer associated With a
`particular address. For example, in the address directory
`table 170, the data at address R Was last cache-to-cache
`transferred to satisfy a read request by processing node 8.
`Those skilled in the art Will note that the an address
`directory table in accordance With one or more embodiments
`of the present invention may be implemented using hard
`Ware and/or softWare and is not limited to a table structure.
`In other Words, the information shoWn in the address direc
`tory table 170 may be maintained by any type of data
`structure. Further, an address directory table in accordance
`With one or more embodiments of the present invention may
`contain less, more, and/or different information than that
`shoWn in FIG. 8.
`FIG. 9 shoWs an exemplary ?oW process in accordance
`With an embodiment of the present invention. Once a
`processing node (e.g., a multiprocessing server) requests
`data ST180, the requesting processing node