`(76)
`
`Inventors: 1?; iflévlaly llvlaw’ NCwhEOrFHgtIbIg/IAUS _
`(
`)’
`ar _
`eler’
`6 es 6y’
`(
`)’
`Todd Comms’ Chelmsford’ MA (Us)
`_
`Correspondence Address.
`NIELDS & LEMACK
`176 EAST MAIN STREET’ SUITE 7
`WESTBORO MA 01581
`,
`(Us)
`
`(21)
`(22)
`
`A 1' N ‘I
`pp
`0
`Filed:
`
`10 823 300
`/
`’
`Apt 13’ 2004
`
`Publication Classi?cation
`
`(51)
`(52)
`
`Int. Cl.7 ................................................... .. G06F 12/00
`US. Cl. ......................... .. 711/133; 711/144; 711/145
`
`(19) United States
`(12) Patent Application Publication (10) Pub. No.: US 2005/0228952 A1
`(43) Pub. Date:
`Oct. 13, 2005
`Mayhew et al.
`
`US 20050228952A1
`
`CACHE COHERENCY MECHANISM
`
`(57)
`
`ABSTRACT
`
`The present invention minimizes the amount of traf?c that
`traverses the fabric in support of the cache coherency
`protocol. It also alloWs rapid transmission of all traf?c
`associated With the cache coherency protocol, so as to
`.
`.
`.
`1
`d
`.
`.
`f
`A f b .
`.
`IIllIléIIllZtJ atency an maxémizef per ormance.~
`a r1131 is
`use to 'interconnect a num er 0' processing units toget er.
`The switches are able to recogniZe incoming traf?c related
`_
`to the cache coherency protocol and then move these mes
`sages to the head of that sWitch’s output queue to insure fast
`transmission. Also, the traffic related to the cache coherency
`protocol can interrupt an outgoing message, further reducing
`latency. The switch incorporates a memory element, dedi
`cated to the cache coherency protocol, Which tracks the
`contents of all of the caches of all of the processors con
`nected to the fabric. In this Way, the fabric can selectively
`transmit tra?ic only to the processors Where it is relevant.
`
`Disk
`
`ll '
`
`b M /0
`
`CPU 0
`
`CPU 1
`
`2
`
`CPU N
`
`I
`
`rzr
`
`‘
`
`l
`
`I 4 I
`
`I40
`l/‘
`
`‘1°
`
`Local
`‘Memory
`I23
`
`Cache
`Memory
`‘7'1
`
`Local
`Cache
`Cache
`Local
`Memory
`Memory
`Memory
`Memory
`I33
`I32.
`H3 r42.
`A distn'buted system with a single switch using shared memory
`
`1
`
`APPLE 1019
`
`
`
`Patent Application Publication Oct. 13, 2005 Sheet 1 0f 3
`
`US 2005/0228952 A1
`
`Disk
`Controller
`
`HO
`
`HS
`
`l-
`
`2
`
`
`
`CPUN I
`
`|~H
`
`/__7
`
`
`
`CPUO a" l
`
`
`
`
`
`CPU1 I \
`
`‘7-0
`
`Local
`‘Memory
`r23
`
`Cache
`Local
`Cache
`Cache
`Local
`Memory
`Memory
`Memory
`Memory
`Memory
`‘7-1-
`l3'5
`I37,
`H3 142.
`Figure l. A distributed system with a single switch using shared memory
`
`.
`Disk
`Controller _ D'sk
`
`CPU 0
`
`CPU 1
`
`CPU 0
`
`CPU 1
`
`CPU 0
`
`CPU 1
`
`z ‘I 0
`
`11 u
`
`oea
`Memory
`Hiera
`
`Local
`Memory
`Hiera
`
`Local
`Memory
`Hiera
`11/0
`
`Memory
`Hiera
`
`1
`
`oca
`Memory
`Hierareh
`
`-
`
`26 0
`
`Figure 2. A distributed system with multiple switches using shared memory
`
`2
`
`
`
`Patent Application Publication Oct. 13, 2005 Sheet 2 of 3
`
`US 2005/0228952 A1
`
`Time
`
`Time
`
`Packet 1
`
`+
`
`30 o
`
`Packet 2
`
`3 l 0
`
`Message
`Generated
`Figure 3a. Message inserted at tail of queue.
`
`Message
`
`3 7- 0
`
`T0
`
`Packet 1
`
`Message
`
`+
`
`3 0 O
`
`+
`
`3
`
`Packet 2
`
`3 l 0
`
`Message
`Generated
`Figure 3b, Message inserted when packet in transmission is completed. Speed-up over
`Figure 3a is T0 — Tl.
`
`T0
`
`T1
`
`Time
`
`4>
`3a’
`30*’
`p/o Packet 1 C Message C p/o Packet 1
`
`Packet 2
`
`300
`
`+
`T1
`
`+ 320
`I T2
`Message
`Generated
`Figure 30. Message inserted at earliest possible moment. Speed-up over Figure 3a is T0
`— T2. Speed-up over Figure 3b is Tl — T2.
`
`390
`
`3 i O
`
`+
`T0
`
`3
`
`
`
`Patent Application Publication Oct. 13, 2005 Sheet 3 of 3
`
`US 2005/0228952 A1
`
`m m
`
`e S s m m .m
`
`Allllll
`WOOOOOO
`234344
`
`s?msnm
`n ‘mwo‘mo
`.Offfm m
`
`Figure 4. Representative Directory Structure
`
`0 2 l U P C O 0 1 m .w
`
`SESSIII
`
`0 4 l U P C 0 0 l h
`
`.W.
`o 3 .l U P C 0 0 l h
`
`SISSMOI
`
`SIISISM
`
`Figure 5. Directory Entries for Switch 100
`
`
`
`.0 rmmrmmmmm w mewmmwwmw A 2m2222222
`
`5 s m m
`
`0 0 20
`
`
`
`SP1 1111111
`
`0 0 27
`
`
`
`SPE SSIIIII
`
`O 0 26
`
`
`
`SP1 $811118
`
`m4
`
`
`
`SPI IIMMMMO
`
`5
`
`m0
`
`
`
`SPE ESIIIIS
`
`5 m7
`
`5 0 26
`
`
`
`SPI 1.118811
`
`5 N4
`
`0
`
`MO
`
`0
`
`0
`
`H6
`
`27
`
`
`
`SP1 1111111
`
`0 l
`
`24
`
`
`
`SP1 IIIIIII
`
`Figure 6. Directory Entries for Switches 200, 205 and 210
`
`4
`
`
`
`US 2005/0228952 A1
`
`Oct. 13, 2005
`
`CACHE COHERENCY MECHANISM
`
`BACKGROUND OF THE INVENTION
`
`[0001] Today’s computer systems continue to become
`increasingly complex. First, there Were single central pro
`cessing units, or CPUs, used to perform a speci?c function.
`As the complexity of software increased, neW computer
`systems emerged, such as symmetric multiprocessing, or
`SMP, systems, Which have multiple CPUs operating simul
`taneously, typically utiliZing a common high-speed bus.
`These CPUs all have access to the same memory and storage
`elements, With each having the ability to read and Write to
`these elements. More recently, another form of multi-pro
`cessor system has emerged, knoWn as Non-Uniform
`Memory Access, or “NUMA”. NUMA refers to a con?gu
`ration of CPUs, all sharing common memory space and disk
`storage, but having distinct processor and memory sub
`systems. Computer systems having processing elements that
`are not tightly coupled are also knoWn as distributed com
`puting systems. NUMA systems can be con?gured to have
`a global shared memory, or alternatively can be con?gured
`such that the total amount of memory is distributed among
`the various processors. In either embodiment, the processors
`are not as tightly bound together as With SMP over a single
`high-speed bus. Rather, they have their oWn high-speed bus
`to communicate With their local resources, such as cache and
`local memory. A different communication mechanism is
`employed When the CPU requires data elements that are not
`resident in its local subsystem. Because the performance is
`very different When the processor accesses data that is not
`local to its subsystem, this con?guration results in non
`uniform memory access. Information in its local memory
`Will be accessed most quickly, While information in other
`processor’s local memory is accessed more quickly than
`accesses to disk storage.
`
`[0002] In most embodiments, these CPUs possess a dedi
`cated cache memory, Which is used to store duplicate
`versions of data found in the main memory and storage
`elements, such as disk drives. Typically, these caches con
`tain data that the processor has recently used, or Will use
`shortly. These cache memories can be accessed extremely
`quickly, at much loWer latency than typical main memory,
`thereby alloWing the processor to execute instructions With
`out stalling to Wait for data. Data elements are added to the
`cache in “lines”, Which is typically a ?xed number of bytes,
`depending on the architecture of the processor and the
`system.
`
`[0003] Through the use of cache memory, performance of
`the machine therefore increases, since many softWare pro
`grams execute code that contains “loops” in Which a set of
`instructions is executed and then repeated several times.
`Most programs typically execute code from sequential loca
`tions, alloWing caches to predictively obtain data before the
`CPU needs it—a concept knoWn as prefetching. Caches,
`Which hold recently used data and prefetch data that is likely
`to be used, alloW the processor to operate more ef?ciently,
`since the CPU does not need to stop and Wait for data to be
`read from main memory or disk.
`
`[0004] With multiple CPUs each having their oWn cache
`and the ability to modify data, it is desirous to alloW the
`caches to communicate With each other to minimiZe the
`number of main memory and disk accesses. In addition, in
`
`systems that alloW a cache to modify its contents Without
`Writing it back to main memory, it is essential that the caches
`communicate to insure that the most recent version of the
`data is used. Therefore, the caches monitor, or “snoop”, each
`other’s activities, and can intercept memory read requests
`When they have a local cached copy of the requested data.
`
`[0005] In systems With multiple processors and caches, it
`is imperative that the caches all contain consistent data; that
`is, if one processor modi?es a particular data element, that
`change must be communicated and re?ected in any other
`caches containing that same data element. This feature is
`knoWn as “cache coherence”.
`
`[0006] Thus, a mechanism is needed to insure that all of
`the CPUs are using the most recently updated data. For
`example, suppose one CPU reads a memory location and
`copies it into its cache and later it modi?es that data element
`in its cache. If a second CPU reads that element from
`memory, it Will contain the old, or “stale” version of the data,
`since the most up-to-date, modi?ed version of that data
`element only resides in the cache of the ?rst CPU.
`
`[0007] The easiest mechanism to insure that all caches
`have consistent data is to force the cache to Write any
`modi?cation back to main memory immediately. In this Way,
`CPUs can continuously read items in their cache, but once
`they modify a data element, it must be Written to main
`memory. This trivial approach to maintaining consistent
`caches, or cache coherency, is knoWn as Write through
`caching. While it insures cache coherency, it affects perfor
`mance by forcing the system to Wait Whenever data needs to
`be Written to main memory, a process Which is much sloWer
`than accessing the cache.
`
`[0008] There are several more sophisticated cache coher
`ency protocols that are Widely used. The ?rst is referred to
`as “MESI”, Which is an acronym for Modi?ed, Exclusive,
`Shared, and Invalid. These four Words describe the potential
`state of each cache line.
`
`[0009] To illustrate the use of the MESI protocol, assume
`that CPU 1 needs a particular data element, Which is not
`contained in its cache. It issues a request for the particular
`cache line. If none of the other caches has the data, it is
`retrieved from main memory or disk and loaded into the
`cache of CPU 1, and is marked “E” for exclusive, indicating
`that it is the only cache that has this data element. If CPU 2
`later needs the same data element, it issues the same request
`that CPU 1 had issued earlier. HoWever, in this case, the
`cache for CPU 1 responds With the requested data. Recog
`niZing that the data came from another cache, the line is
`saved in the cache of CPU 2, With a marking of “S”, or
`shared. The cache line of CPU 1 is noW modi?ed to “S”,
`since it shared the data With the cache of CPU 2, and
`therefore no longer has exclusive access to it. Continuing on,
`if CPU 2 (or CPU 1) needs to modify the data, it checks the
`cache line marker and since it is shared, issues an invalidate
`message to the other caches, signaling that their copy of the
`cache line is no longer valid since it has been modi?ed by
`CPU 2. CPU 2 also changes the marker for this cache line
`to “M”, to signify that the line has been modi?ed and that
`main memory does not have the correct data. Thus, CPU 2
`must Write this cache line back to main memory before other
`caches can use it, to restore the integrity of main memory.
`Therefore, if CPU 1 needs this data element, CPU 2 Will
`detect the request, it Will then Write the modi?ed cache line
`
`5
`
`
`
`US 2005/0228952 A1
`
`Oct. 13, 2005
`
`back to main memory and change the cache line marker to
`“S”. Table 1 brie?y describes the four states used in this
`cache coherency protocol.
`
`State
`
`Modi?ed
`
`Exclusive
`
`Shared
`
`Invalid
`
`TABLE 1
`
`Description
`
`The cache line is valid in this cache and
`no other cache. A transition to this state
`requires an invalidate message to be
`broadcast to the other caches.
`Main memory is not up to date
`The cache line is in this cache and no
`other cache.
`Main memory is up to date
`The cache line is valid in this cache and
`at least one other cache.
`Main memory is up to date
`The cache line does not reside in the
`cache or does not contain valid data
`
`[0010] This scheme reduces the number of accesses to
`main memory by having the various caches snoop each
`other’s requests before alloWing them to access main
`memory or disk. HoWever, this scheme does not signi?
`cantly reduce the number of Write accesses to main memory,
`since once a cache line achieves a status of “M”, it must be
`Written back to main memory before another cache can
`access it to insure main memory integrity. A second cache
`coherency protocol, referred to as “MOESI”, addresses this
`issue. “MOESI” represents the acronym for Modi?ed,
`OWner, Exclusive, Shared and Invalid. In most scenarios, it
`operates like the MESI protocol described above. HoWever,
`the added state of OWner alloWs a reduction in the number
`of Write accesses to main memory.
`
`[0011] To illustrate the use of the MOESI protocol, the
`previous example Will be repeated. Assume that CPU 1
`needs a particular data element, Which is not contained in its
`cache. It issues a request for the particular cache line. If none
`of the other caches h the data, it is retrieved from main
`memory or disk and loading into the cache of CPU 1, and is
`marked “E” for exclusive, indicating that it is the only cache
`that has this data element. If CPU 2 later needs the same data
`element, it issues the same request that CPU 1 had issued
`earlier. HoWever, in this case, the cache for CPU 1 responds
`With the requested data. RecogniZing that the data came
`from another cache, the line is entered into the cache of CPU
`2, With a marking of “S”, or shared. The cache line of CPU
`1 is noW modi?ed to “S”, since it shared the data With CPU
`2, and therefore no longer has exclusive access to it. Con
`tinuing on, if CPU 2 (or CPU 1) needs to modify the data,
`it checks the cache line marker and since it is shared, issues
`an invalidate message to the other caches, signaling that
`their copy of the cache line is no longer valid since it has
`been modi?ed by CPU 2. CPU 2 also changes the marker for
`this cache line to “M”, to signify that the line has been
`modi?ed and that main memory does not have the correct
`data. If CPU 1 requests the data that has been modi?ed, the
`cache of CPU 2 supplies it to CPU 1, and changes the marker
`for this cache line to “O” for oWner, signifying that it is
`responsible for supplying the cache line Whenever
`requested. This state is roughly equivalent to a combined
`state of Modi?ed and Shared, Where the data exists in
`multiple caches, but is not current in main memory. If CPU
`2 again modi?es the data While in the “O” state, it changes
`
`its marker to “M” and issues an invalidate message to the
`other caches, since their modi?ed copy is no longer current.
`Similarly, if CPU 1 modi?es the data that it received from
`CPU 2, it changes the marker for the cache line to “M” and
`issues an invalidate message to the other caches, including
`CPU 2, Which Was the previous oWner. In this Way, the
`modi?ed data need not be Written back, since the use of the
`M, S, and O states alloW the various caches to determine
`Which has the most recent version.
`
`[0012] These schemes are effective at reducing the amount
`of accesses to main memory and disk. HoWever, each
`requires a signi?cant amount of communication betWeen the
`caches of the various CPUs. As the number of CPUs
`increases, so too does the amount of communication
`betWeen the various caches. In one embodiment, the caches
`share a single high-speed bus and all of the messages and
`requests are transmitted via this bus. While this scheme is
`effective With small numbers of CPUs, it becomes less
`practical as the numbers increase. A second embodiment
`uses a ring structure, Where each cache is connected to
`exactly tWo other caches, one from Which it receives data
`(upstream) and the other to Which it sends data (doWn
`stream), via point to point connections. Messages and
`requests from one cache are passed typically in one direction
`to the doWnstream cache, Which either replies or forWards
`the original message to its doWnstream cache. This process
`continues until the communication arrives back at the origi
`nal sender’s cache. While electrical characteristics are more
`trivial than on a shared bus, the latency associated With
`traversing a large ring can become unacceptable.
`
`[0013] A third embodiment incorporates a netWork fabric,
`incorporating one or more sWitches to interconnect the
`various CPUs together. Fabrics overcome the electrical
`issues associated With shared busses since all connections
`are point-to-point. In addition, they typically have loWer
`latency than ring con?gurations, especially in con?gurations
`With many CPUs. HoWever, large multiprocessor systems
`Will create signi?cant amounts of traf?c related to cache
`snooping. This traffic has the effect of sloWing doWn the
`entire system, as these messages can cause congestion in the
`sWitch.
`
`[0014] Therefore, it is an object of the present invention to
`provide a system and method for operating a cache coherent
`NUMA system With a netWork fabric, While minimiZing the
`amount of traf?c in the fabric. In addition, it is a further
`object of the present invention to provide a system and
`method to alloW rapid transmission and reduced latency of
`the cache snoop cycles and requests through the sWitch.
`
`SUMMARY OF THE INVENTION
`
`[0015] The problems of the prior art have been overcome
`by the present invention, Which provides a system for
`minimiZing the amount of traf?c that traverses the fabric in
`support of the cache coherency protocol. The system of the
`present invention also alloWs rapid transmission of all traf?c
`associated With the cache coherency protocol, so as to
`minimiZe latency and maximiZe performance. Brie?y, a
`fabric is used to interconnect a number of processing units
`together. The sWitches that comprise the fabric are able to
`recogniZe incoming traffic related to the cache coherency
`protocol. These messages are then moved to the head of the
`sWitch’s output queue to insure fast transmission throughout
`
`6
`
`
`
`US 2005/0228952 Al
`
`Oct. 13, 2005
`
`the fabric. In another embodiment, the traf?c related to the
`cache coherency protocol can interrupt an outgoing mes
`sage, further reducing latency through the sWitch, since the
`traf?c does not need to Wait for the current packet to be
`transmitted.
`
`[0016] The sWitches Within the fabric also incorporate at
`least one memory element, Which is dedicated to analyZing
`and storing transactions related to the cache coherency
`protocol. This memory element tracks the contents of the
`caches of all of the processors connected to the sWitch. In
`this manner, traf?c can be minimiZed. In a traditional
`NUMA system, read requests and invalidate messages are
`communicated With every other processor in the system. By
`tracking the contents of each processor’s cache, the fabric
`can selectively transmit this traf?c only to the processors
`Where the data is resident in its cache.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0017] FIG. 1 is a schematic diagram illustrating a ?rst
`embodiment of the present invention;
`[0018] FIG. 2 is a schematic diagram illustrating a second
`embodiment of the present invention;
`[0019] FIG. 3a is a schematic diagram illustrating deliv
`ery of packets as performed in the prior art;
`
`[0020] FIG. 3b is a schematic diagram illustrating a
`mechanism for reducing the latency of message transmission
`in accordance With one embodiment of the present inven
`tion;
`[0021] FIG. 3c is a schematic diagram illustrating a
`mechanism for reducing the latency of message transmission
`in accordance With a second embodiment of the present
`invention;
`[0022] FIG. 4 is a chart illustrating a representative format
`for a directory in accordance With an embodiment of the
`present invention;
`[0023] FIG. 5 is a chart illustrating the states of a directory
`during a sequence of operations in a single sWitch fabric in
`accordance With an embodiment of the present invention;
`and
`
`[0024] FIG. 6 is a chart illustrating the states of directories
`during a sequence of operations in a multi-sWitch fabric in
`accordance With an embodiment of the present invention.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`[0025] Cache coherency in a NUMA system requires
`communication to occur among all of the various caches.
`While this presents a manageable problem With small num
`bers of processors and caches, the complexity of the problem
`increases as the number of processors increased. Not only
`are there more caches to track, but also there is a signi?cant
`increase in the number of communications betWeen these
`caches necessary to insure coherency. The present invention
`reduces that number of communications by tracking the
`contents of each cache and sending communications only to
`those caches that have the data resident in them. Thus, the
`amount of traf?c created is minimiZed.
`
`[0026] FIG. 1 illustrates a distributed processing system,
`or speci?cally, a shared memory NUMA system 10 Where
`
`the various CPUs are all in communication With a single
`sWitch. This sWitch is responsible for routing traf?c betWeen
`the processors, as Well as to and from the disk storage 115.
`In this embodiment, CPU subsystem 120, CPU subsystem
`130 and CPU subsystem 140 are each in communication
`With sWitch 100 via an interconnection, such as a cable, back
`plane or a interconnect on a printed circuit board. CPU
`subsystem 120 comprises a central processing unit 121,
`Which may include one or more processor elements, a cache
`memory 122, and a local memory 123. LikeWise, CPU
`subsystems 130 and 140 comprise the same elements. In
`communication With sWitch 100 is disk controller 110,
`Which comprises the control logic for the disk storage 115,
`Which may comprise one or more disk drives. The total
`amount of memory in distributed system 10 is contained and
`partitioned betWeen the various CPU subsystems. The spe
`ci?c con?guration of the memory can vary and is not limited
`by the present invention. For illustrative purposes, it Will be
`assumed that the total system memory is equally divided
`betWeen the processor subsystems.
`
`[0027] Since the total system memory is divided among
`the processor subsystems, each processor must access
`memory that is local to other processor subsystems. These
`accesses are much sloWer than accesses by the processor to
`its oWn local memory, and therefore impact performance. To
`minimiZe the performance impact of these accesses, each
`CPU subsystem is equipped With a cache memory. The
`cache is used to store frequently used, or soon to be used,
`data. The data stored in the cache might be associated With
`any of the main memory elements in the system.
`
`[0028] As described above, it is essential to maintain
`cache coherency betWeen the various cache elements in the
`system. Referring again to FIG. 1, the traditional eXchange
`of information betWeen the various caches using MOESI
`Will be described, as is currently performed in the prior art.
`Suppose CPU 121 requires a data element that is not in its
`cache. It issues a read request to sWitch 100. SWitch 100
`broadcasts this read request to all other CPU subsystems.
`Since none of the other caches contains the data, the
`requested data must be retrieved from main memory, such as
`local memory 143. This data element is sent to sWitch 100,
`Which forWards it back to CPU subsystem 120, Where it is
`stored in cache 122. The cache line associated With this data
`element is marked “E”, since this is the only cache that has
`the data.
`
`[0029] At a later time, assume that CPU 131 requires the
`same data element. In a similar fashion, it issues a read
`request to sWitch 100, Which sends a multicast message to
`the other CPU subsystems. In this case, cache 122 has the
`requested data and transmits it back to CPU 131, via sWitch
`100. At this point, both cache 122 and 132 have a copy of
`the data element, Which is stored in main memory 143. Both
`of these caches mark the cache line “S” to indicate shared
`access.
`
`[0030] Later, assume that CPU 141 requires the same data
`element. It issues a read request to sWitch 100, Which sends
`a multicast message to the other CPU subsystems. In this
`case, both cache 122 and cache 132 have the requested data
`and transmit it back to CPU 141, via sWitch 100. At this
`point, caches 122, 132 and 142 all have a copy of the data
`element, Which is stored in main memory 143. All of these
`caches mark the cache line “S” to indicate shared access.
`
`7
`
`
`
`US 2005/0228952 A1
`
`Oct. 13, 2005
`
`[0031] At a later point in time, assume that CPU 131
`modi?es the data element. It then also issues an invalidate
`message informing the other caches that their copy of that
`cache line is noW stale and must be discarded. It also
`modi?es the marker associated With this line to “M”, sig
`nifying that it has modi?ed the cache line. The sWitch 100
`receives the invalidate message and broadcasts it to all of the
`other caches in the system, even though only caches 122 and
`142 have that data element.
`
`[0032] At a later point, assume that CPU 141 requests the
`data element and sends a read request to sWitch 100. SWitch
`100 sends a multicast message to all of the other systems
`requesting the data element. Since the modi?ed data exists
`only in cache 132, it returns the data element to CPU 141,
`via sWitch 100. It is then stored in cache 142 and is marked
`“S”, since the item is noW shared. Cache 132 marks the
`cache line “O”, since it is the oWner and has shared a
`modi?ed copy of the cache line With another system.
`
`[0033] At a subsequent time, assume that CPU 141 modi
`?es the data element. An invalidate message is sent to sWitch
`100, Which is then broadcast to all of the other CPU
`subsystems, even though only CPU subsystem 130 has the
`data element in its cache. The cache line in 142 is noW
`marked “M”, since it has modi?ed the element, and the line
`is noW invalidated in cache 132.
`
`[0034] The mechanism described above continues accord
`ingly, depending on the type of data access and the CPU
`requesting the data. While cache coherency is maintained
`throughout this process, there are distinct disadvantages to
`the mechanism described.
`
`[0035] First, the cache protocol creates excessive of traf?c
`throughout the netWork. Each read request and invalidate
`message is sent to every CPU, even to those that are not
`involved in the transaction. For example, in the previous
`scenario, an invalidate message is sent to CPU subsystem
`120 after the data element Was modi?ed by CPU 141.
`HoWever, at that point in time, cache 122 does not have the
`data element in its cache and therefore does not need to
`invalidate the cache line. Similarly, read requests are trans
`mitted to CPU subsystems that neither have the data in their
`main memory, nor in their caches. These requests unneces
`sarily use netWork bandWidth.
`
`[0036] In the scenario described above, each read request
`from a CPU Will generate a read request to every other CPU
`in the system, and a corresponding response from every
`other CPU. These responses are then all forWarded back to
`the requesting CPU. To illustrate this, assume that there are
`eight CPUs in the system. A read request is send from one
`of these CPUs to the sWitch. Then seven read requests are
`forWarded to the other CPUs, Which each generate a
`response to the request. These seven responses are then
`forWard back to the requesting CPU. Thus, a single read
`request generates 22 messages Within the system. This
`number groWs With additional CPUs. In the general case, the
`number of messages generated by a single read request is
`3*(# of CPUs-1)+1. In a system With 16 CPUs, a total of 46
`messages Would be generated.
`
`[0037] The second disadvantage of the current mechanism
`is that latency through the sWitch cannot be determined. For
`example, When the read request arrives at sWitch 100 and is
`ready to be broadcast to all other CPU subsystems, the
`
`sWitch queues the message at each output queue. If the
`queue is empty, the read request Will be sent immediately. If,
`hoWever, the queue contains other packets, the message is
`not sent until it reaches the head of the queue. In fact, even
`if the queue is empty, the read request must still Wait until
`any currently outgoing message has been transmitted. Simi
`larly, When a cache delivers the requested data back to the
`sWitch, latency is incurred in delivering that data to the
`requesting CPU.
`
`[0038] The present invention reduces the amount of traf?c
`in the netWork, thus reducing congestion and latency. By
`tracking the actions of the CPU subsystems, it is possible for
`sWitch 100 to determine Where the requested data resides
`and limit netWork traf?c only to those subsystems. The
`present invention Will be described in relation to the scenario
`given above. SWitch 100 contains internal memory, a portion
`of Which is used to create a directory of cache lines. The
`speci?c format and con?guration of the directory can be
`implemented in various Ways, and this disclosure is not
`limited to any particular directory format. FIG. 4 shoWs one
`possible format, Which Will be used to describe the inven
`tion. The table of FIG. 4 contains multiple entries. There is
`an entry for each cache line, and this is accompanied by the
`status of that cache line in each CPU subsystem. These
`entries may be static, Where the memory is large enough to
`support every cache line. Alternatively, entries could be
`added as cache lines become populated and deleted as lines
`become invalidated, Which may result in a small directory
`structure. At initialiZation, the table is con?gured to accom
`modate all of the CPU subsystems in the NUMA system.
`Each entry is marked as “I” for invalid, since there is no
`valid cache data yet. The state of the directory in sWitch 100
`during the folloWing sequence of operations can be seen in
`FIG. 5.
`
`[0039] Returning to the previous scenario, suppose CPU
`121 requires a data element that is not in its cache. It issues
`a read request to sWitch 100. SWitch 100 checks the direc
`tory and ?nds that there is no entry for the requested data.
`It then broadcasts this read request to all other CPU sub
`systems. Since none of the other caches contain the data, the
`requested data must be retrieved from main memory, such as
`local memory 143. This data element is sent to sWitch 100,
`Which forWards it back to CPU subsystem 120, Where it is
`stored in cache 122. Within CPU subsystem 120, the cache
`line associated With this data element is marked “E”, since
`this is the only cache that has the data. The sWitch also
`updates its directory, by adding an entry for this cache line.
`It then denotes that CPU 120 has exclusive access “E”, While
`all other CPUs remain invalid “I”. This is shoWn in roW 1 of
`FIG. 5.
`
`[0040] At a later time, CPU 131 requires the same data
`element. In a similar fashion, it issues a read request to
`sWitch 100. SWitch 100 checks the directory and in this case
`discovers that the requested data element exists in CPU
`subsystem 120. Rather than sending a multicast message to
`all of the other CPU subsystems, sWitch 100 sends a request
`only to CPU subsystem 120. Cache 122 has the requested
`data and transmits it back to CPU subsystem 130, via sWitch
`100. At this point, both cache 122 and 132 have a copy of
`the data element, Which is stored in main memory 143. Both
`of these caches mark the cache line “S” for shared. SWitch
`100 then updates its directory to re?ect that CPU 120 and
`
`8
`
`
`
`US 2005/0228952 A1
`
`Oct. 13, 2005
`
`CPU 130 both have shared access, “S”, to this cache line.
`This is illustrated in roW 2 of FIG. 5.
`
`[0041] Later, CPU subsystem 140 requires the same data
`element. It issues a read request to sWitch 100, Which
`indexes into the directory and ?nds that both CPU sub
`system 120 and CPU subsystem 130 have the requested data
`in their caches. Based on a pre-determined algorithm, the
`sWitch sends a message to one of these CPU subsystems,
`preferably the least busy of the tWo. The algorithm used in
`not limited by this invention and may be based on various
`netWork parameters, such as queue length or average
`response time. The requested data is then transmitted back to
`CPU subsystem 140, via sWitch 100. At this point, caches
`122, 132 and 142 all have a copy of the data element, Which
`is stored in main memory 143. All of these caches mark the
`cache line “S” for shared. The sWitch then updates the
`directory to indicate that CPU 120, 130 and 140 all have
`shared access “S” to this data as illustrated in roW 3 of FIG.
`5.
`
`[0042] At a later point in time, CPU subsystem 130
`modi?es the data element. It also issues an invalidate
`message informing the other caches that their copy of that
`cache line is noW stale and must be discarded. It also
`modi?es the marker associated With this line to “M”, sig
`nifying t