throbber
(54)
`(76)
`
`Inventors: 1?; iflévlaly llvlaw’ NCwhEOrFHgtIbIg/IAUS _
`(
`)’
`ar _
`eler’
`6 es 6y’
`(
`)’
`Todd Comms’ Chelmsford’ MA (Us)
`_
`Correspondence Address.
`NIELDS & LEMACK
`176 EAST MAIN STREET’ SUITE 7
`WESTBORO MA 01581
`,
`(Us)
`
`(21)
`(22)
`
`A 1' N ‘I
`pp
`0
`Filed:
`
`10 823 300
`/
`’
`Apt 13’ 2004
`
`Publication Classi?cation
`
`(51)
`(52)
`
`Int. Cl.7 ................................................... .. G06F 12/00
`US. Cl. ......................... .. 711/133; 711/144; 711/145
`
`(19) United States
`(12) Patent Application Publication (10) Pub. No.: US 2005/0228952 A1
`(43) Pub. Date:
`Oct. 13, 2005
`Mayhew et al.
`
`US 20050228952A1
`
`CACHE COHERENCY MECHANISM
`
`(57)
`
`ABSTRACT
`
`The present invention minimizes the amount of traf?c that
`traverses the fabric in support of the cache coherency
`protocol. It also alloWs rapid transmission of all traf?c
`associated With the cache coherency protocol, so as to
`.
`.
`.
`1
`d
`.
`.
`f
`A f b .
`.
`IIllIléIIllZtJ atency an maxémizef per ormance.~
`a r1131 is
`use to 'interconnect a num er 0' processing units toget er.
`The switches are able to recogniZe incoming traf?c related
`_
`to the cache coherency protocol and then move these mes
`sages to the head of that sWitch’s output queue to insure fast
`transmission. Also, the traffic related to the cache coherency
`protocol can interrupt an outgoing message, further reducing
`latency. The switch incorporates a memory element, dedi
`cated to the cache coherency protocol, Which tracks the
`contents of all of the caches of all of the processors con
`nected to the fabric. In this Way, the fabric can selectively
`transmit tra?ic only to the processors Where it is relevant.
`
`Disk
`
`ll '
`
`b M /0
`
`CPU 0
`
`CPU 1
`
`2
`
`CPU N
`
`I
`
`rzr
`
`‘
`
`l
`
`I 4 I
`
`I40
`l/‘
`
`‘1°
`
`Local
`‘Memory
`I23
`
`Cache
`Memory
`‘7'1
`
`Local
`Cache
`Cache
`Local
`Memory
`Memory
`Memory
`Memory
`I33
`I32.
`H3 r42.
`A distn'buted system with a single switch using shared memory
`
`1
`
`APPLE 1019
`
`

`

`Patent Application Publication Oct. 13, 2005 Sheet 1 0f 3
`
`US 2005/0228952 A1
`
`Disk
`Controller
`
`HO
`
`HS
`
`l-
`
`2
`
`
`
`CPUN I
`
`|~H
`
`/__7
`
`
`
`CPUO a" l
`
`
`
`
`
`CPU1 I \
`
`‘7-0
`
`Local
`‘Memory
`r23
`
`Cache
`Local
`Cache
`Cache
`Local
`Memory
`Memory
`Memory
`Memory
`Memory
`‘7-1-
`l3'5
`I37,
`H3 142.
`Figure l. A distributed system with a single switch using shared memory
`
`.
`Disk
`Controller _ D'sk
`
`CPU 0
`
`CPU 1
`
`CPU 0
`
`CPU 1
`
`CPU 0
`
`CPU 1
`
`z ‘I 0
`
`11 u
`
`oea
`Memory
`Hiera
`
`Local
`Memory
`Hiera
`
`Local
`Memory
`Hiera
`11/0
`
`Memory
`Hiera
`
`1
`
`oca
`Memory
`Hierareh
`
`-
`
`26 0
`
`Figure 2. A distributed system with multiple switches using shared memory
`
`2
`
`

`

`Patent Application Publication Oct. 13, 2005 Sheet 2 of 3
`
`US 2005/0228952 A1
`
`Time
`
`Time
`
`Packet 1
`
`+
`
`30 o
`
`Packet 2
`
`3 l 0
`
`Message
`Generated
`Figure 3a. Message inserted at tail of queue.
`
`Message
`
`3 7- 0
`
`T0
`
`Packet 1
`
`Message
`
`+
`
`3 0 O
`
`+
`
`3
`
`Packet 2
`
`3 l 0
`
`Message
`Generated
`Figure 3b, Message inserted when packet in transmission is completed. Speed-up over
`Figure 3a is T0 — Tl.
`
`T0
`
`T1
`
`Time
`
`4>
`3a’
`30*’
`p/o Packet 1 C Message C p/o Packet 1
`
`Packet 2
`
`300
`
`+
`T1
`
`+ 320
`I T2
`Message
`Generated
`Figure 30. Message inserted at earliest possible moment. Speed-up over Figure 3a is T0
`— T2. Speed-up over Figure 3b is Tl — T2.
`
`390
`
`3 i O
`
`+
`T0
`
`3
`
`

`

`Patent Application Publication Oct. 13, 2005 Sheet 3 of 3
`
`US 2005/0228952 A1
`
`m m
`
`e S s m m .m
`
`Allllll
`WOOOOOO
`234344
`
`s?msnm
`n ‘mwo‘mo
`.Offfm m
`
`Figure 4. Representative Directory Structure
`
`0 2 l U P C O 0 1 m .w
`
`SESSIII
`
`0 4 l U P C 0 0 l h
`
`.W.
`o 3 .l U P C 0 0 l h
`
`SISSMOI
`
`SIISISM
`
`Figure 5. Directory Entries for Switch 100
`
`
`
`.0 rmmrmmmmm w mewmmwwmw A 2m2222222
`
`5 s m m
`
`0 0 20
`
`
`
`SP1 1111111
`
`0 0 27
`
`
`
`SPE SSIIIII
`
`O 0 26
`
`
`
`SP1 $811118
`
`m4
`
`
`
`SPI IIMMMMO
`
`5
`
`m0
`
`
`
`SPE ESIIIIS
`
`5 m7
`
`5 0 26
`
`
`
`SPI 1.118811
`
`5 N4
`
`0
`
`MO
`
`0
`
`0
`
`H6
`
`27
`
`
`
`SP1 1111111
`
`0 l
`
`24
`
`
`
`SP1 IIIIIII
`
`Figure 6. Directory Entries for Switches 200, 205 and 210
`
`4
`
`

`

`US 2005/0228952 A1
`
`Oct. 13, 2005
`
`CACHE COHERENCY MECHANISM
`
`BACKGROUND OF THE INVENTION
`
`[0001] Today’s computer systems continue to become
`increasingly complex. First, there Were single central pro
`cessing units, or CPUs, used to perform a speci?c function.
`As the complexity of software increased, neW computer
`systems emerged, such as symmetric multiprocessing, or
`SMP, systems, Which have multiple CPUs operating simul
`taneously, typically utiliZing a common high-speed bus.
`These CPUs all have access to the same memory and storage
`elements, With each having the ability to read and Write to
`these elements. More recently, another form of multi-pro
`cessor system has emerged, knoWn as Non-Uniform
`Memory Access, or “NUMA”. NUMA refers to a con?gu
`ration of CPUs, all sharing common memory space and disk
`storage, but having distinct processor and memory sub
`systems. Computer systems having processing elements that
`are not tightly coupled are also knoWn as distributed com
`puting systems. NUMA systems can be con?gured to have
`a global shared memory, or alternatively can be con?gured
`such that the total amount of memory is distributed among
`the various processors. In either embodiment, the processors
`are not as tightly bound together as With SMP over a single
`high-speed bus. Rather, they have their oWn high-speed bus
`to communicate With their local resources, such as cache and
`local memory. A different communication mechanism is
`employed When the CPU requires data elements that are not
`resident in its local subsystem. Because the performance is
`very different When the processor accesses data that is not
`local to its subsystem, this con?guration results in non
`uniform memory access. Information in its local memory
`Will be accessed most quickly, While information in other
`processor’s local memory is accessed more quickly than
`accesses to disk storage.
`
`[0002] In most embodiments, these CPUs possess a dedi
`cated cache memory, Which is used to store duplicate
`versions of data found in the main memory and storage
`elements, such as disk drives. Typically, these caches con
`tain data that the processor has recently used, or Will use
`shortly. These cache memories can be accessed extremely
`quickly, at much loWer latency than typical main memory,
`thereby alloWing the processor to execute instructions With
`out stalling to Wait for data. Data elements are added to the
`cache in “lines”, Which is typically a ?xed number of bytes,
`depending on the architecture of the processor and the
`system.
`
`[0003] Through the use of cache memory, performance of
`the machine therefore increases, since many softWare pro
`grams execute code that contains “loops” in Which a set of
`instructions is executed and then repeated several times.
`Most programs typically execute code from sequential loca
`tions, alloWing caches to predictively obtain data before the
`CPU needs it—a concept knoWn as prefetching. Caches,
`Which hold recently used data and prefetch data that is likely
`to be used, alloW the processor to operate more ef?ciently,
`since the CPU does not need to stop and Wait for data to be
`read from main memory or disk.
`
`[0004] With multiple CPUs each having their oWn cache
`and the ability to modify data, it is desirous to alloW the
`caches to communicate With each other to minimiZe the
`number of main memory and disk accesses. In addition, in
`
`systems that alloW a cache to modify its contents Without
`Writing it back to main memory, it is essential that the caches
`communicate to insure that the most recent version of the
`data is used. Therefore, the caches monitor, or “snoop”, each
`other’s activities, and can intercept memory read requests
`When they have a local cached copy of the requested data.
`
`[0005] In systems With multiple processors and caches, it
`is imperative that the caches all contain consistent data; that
`is, if one processor modi?es a particular data element, that
`change must be communicated and re?ected in any other
`caches containing that same data element. This feature is
`knoWn as “cache coherence”.
`
`[0006] Thus, a mechanism is needed to insure that all of
`the CPUs are using the most recently updated data. For
`example, suppose one CPU reads a memory location and
`copies it into its cache and later it modi?es that data element
`in its cache. If a second CPU reads that element from
`memory, it Will contain the old, or “stale” version of the data,
`since the most up-to-date, modi?ed version of that data
`element only resides in the cache of the ?rst CPU.
`
`[0007] The easiest mechanism to insure that all caches
`have consistent data is to force the cache to Write any
`modi?cation back to main memory immediately. In this Way,
`CPUs can continuously read items in their cache, but once
`they modify a data element, it must be Written to main
`memory. This trivial approach to maintaining consistent
`caches, or cache coherency, is knoWn as Write through
`caching. While it insures cache coherency, it affects perfor
`mance by forcing the system to Wait Whenever data needs to
`be Written to main memory, a process Which is much sloWer
`than accessing the cache.
`
`[0008] There are several more sophisticated cache coher
`ency protocols that are Widely used. The ?rst is referred to
`as “MESI”, Which is an acronym for Modi?ed, Exclusive,
`Shared, and Invalid. These four Words describe the potential
`state of each cache line.
`
`[0009] To illustrate the use of the MESI protocol, assume
`that CPU 1 needs a particular data element, Which is not
`contained in its cache. It issues a request for the particular
`cache line. If none of the other caches has the data, it is
`retrieved from main memory or disk and loaded into the
`cache of CPU 1, and is marked “E” for exclusive, indicating
`that it is the only cache that has this data element. If CPU 2
`later needs the same data element, it issues the same request
`that CPU 1 had issued earlier. HoWever, in this case, the
`cache for CPU 1 responds With the requested data. Recog
`niZing that the data came from another cache, the line is
`saved in the cache of CPU 2, With a marking of “S”, or
`shared. The cache line of CPU 1 is noW modi?ed to “S”,
`since it shared the data With the cache of CPU 2, and
`therefore no longer has exclusive access to it. Continuing on,
`if CPU 2 (or CPU 1) needs to modify the data, it checks the
`cache line marker and since it is shared, issues an invalidate
`message to the other caches, signaling that their copy of the
`cache line is no longer valid since it has been modi?ed by
`CPU 2. CPU 2 also changes the marker for this cache line
`to “M”, to signify that the line has been modi?ed and that
`main memory does not have the correct data. Thus, CPU 2
`must Write this cache line back to main memory before other
`caches can use it, to restore the integrity of main memory.
`Therefore, if CPU 1 needs this data element, CPU 2 Will
`detect the request, it Will then Write the modi?ed cache line
`
`5
`
`

`

`US 2005/0228952 A1
`
`Oct. 13, 2005
`
`back to main memory and change the cache line marker to
`“S”. Table 1 brie?y describes the four states used in this
`cache coherency protocol.
`
`State
`
`Modi?ed
`
`Exclusive
`
`Shared
`
`Invalid
`
`TABLE 1
`
`Description
`
`The cache line is valid in this cache and
`no other cache. A transition to this state
`requires an invalidate message to be
`broadcast to the other caches.
`Main memory is not up to date
`The cache line is in this cache and no
`other cache.
`Main memory is up to date
`The cache line is valid in this cache and
`at least one other cache.
`Main memory is up to date
`The cache line does not reside in the
`cache or does not contain valid data
`
`[0010] This scheme reduces the number of accesses to
`main memory by having the various caches snoop each
`other’s requests before alloWing them to access main
`memory or disk. HoWever, this scheme does not signi?
`cantly reduce the number of Write accesses to main memory,
`since once a cache line achieves a status of “M”, it must be
`Written back to main memory before another cache can
`access it to insure main memory integrity. A second cache
`coherency protocol, referred to as “MOESI”, addresses this
`issue. “MOESI” represents the acronym for Modi?ed,
`OWner, Exclusive, Shared and Invalid. In most scenarios, it
`operates like the MESI protocol described above. HoWever,
`the added state of OWner alloWs a reduction in the number
`of Write accesses to main memory.
`
`[0011] To illustrate the use of the MOESI protocol, the
`previous example Will be repeated. Assume that CPU 1
`needs a particular data element, Which is not contained in its
`cache. It issues a request for the particular cache line. If none
`of the other caches h the data, it is retrieved from main
`memory or disk and loading into the cache of CPU 1, and is
`marked “E” for exclusive, indicating that it is the only cache
`that has this data element. If CPU 2 later needs the same data
`element, it issues the same request that CPU 1 had issued
`earlier. HoWever, in this case, the cache for CPU 1 responds
`With the requested data. RecogniZing that the data came
`from another cache, the line is entered into the cache of CPU
`2, With a marking of “S”, or shared. The cache line of CPU
`1 is noW modi?ed to “S”, since it shared the data With CPU
`2, and therefore no longer has exclusive access to it. Con
`tinuing on, if CPU 2 (or CPU 1) needs to modify the data,
`it checks the cache line marker and since it is shared, issues
`an invalidate message to the other caches, signaling that
`their copy of the cache line is no longer valid since it has
`been modi?ed by CPU 2. CPU 2 also changes the marker for
`this cache line to “M”, to signify that the line has been
`modi?ed and that main memory does not have the correct
`data. If CPU 1 requests the data that has been modi?ed, the
`cache of CPU 2 supplies it to CPU 1, and changes the marker
`for this cache line to “O” for oWner, signifying that it is
`responsible for supplying the cache line Whenever
`requested. This state is roughly equivalent to a combined
`state of Modi?ed and Shared, Where the data exists in
`multiple caches, but is not current in main memory. If CPU
`2 again modi?es the data While in the “O” state, it changes
`
`its marker to “M” and issues an invalidate message to the
`other caches, since their modi?ed copy is no longer current.
`Similarly, if CPU 1 modi?es the data that it received from
`CPU 2, it changes the marker for the cache line to “M” and
`issues an invalidate message to the other caches, including
`CPU 2, Which Was the previous oWner. In this Way, the
`modi?ed data need not be Written back, since the use of the
`M, S, and O states alloW the various caches to determine
`Which has the most recent version.
`
`[0012] These schemes are effective at reducing the amount
`of accesses to main memory and disk. HoWever, each
`requires a signi?cant amount of communication betWeen the
`caches of the various CPUs. As the number of CPUs
`increases, so too does the amount of communication
`betWeen the various caches. In one embodiment, the caches
`share a single high-speed bus and all of the messages and
`requests are transmitted via this bus. While this scheme is
`effective With small numbers of CPUs, it becomes less
`practical as the numbers increase. A second embodiment
`uses a ring structure, Where each cache is connected to
`exactly tWo other caches, one from Which it receives data
`(upstream) and the other to Which it sends data (doWn
`stream), via point to point connections. Messages and
`requests from one cache are passed typically in one direction
`to the doWnstream cache, Which either replies or forWards
`the original message to its doWnstream cache. This process
`continues until the communication arrives back at the origi
`nal sender’s cache. While electrical characteristics are more
`trivial than on a shared bus, the latency associated With
`traversing a large ring can become unacceptable.
`
`[0013] A third embodiment incorporates a netWork fabric,
`incorporating one or more sWitches to interconnect the
`various CPUs together. Fabrics overcome the electrical
`issues associated With shared busses since all connections
`are point-to-point. In addition, they typically have loWer
`latency than ring con?gurations, especially in con?gurations
`With many CPUs. HoWever, large multiprocessor systems
`Will create signi?cant amounts of traf?c related to cache
`snooping. This traffic has the effect of sloWing doWn the
`entire system, as these messages can cause congestion in the
`sWitch.
`
`[0014] Therefore, it is an object of the present invention to
`provide a system and method for operating a cache coherent
`NUMA system With a netWork fabric, While minimiZing the
`amount of traf?c in the fabric. In addition, it is a further
`object of the present invention to provide a system and
`method to alloW rapid transmission and reduced latency of
`the cache snoop cycles and requests through the sWitch.
`
`SUMMARY OF THE INVENTION
`
`[0015] The problems of the prior art have been overcome
`by the present invention, Which provides a system for
`minimiZing the amount of traf?c that traverses the fabric in
`support of the cache coherency protocol. The system of the
`present invention also alloWs rapid transmission of all traf?c
`associated With the cache coherency protocol, so as to
`minimiZe latency and maximiZe performance. Brie?y, a
`fabric is used to interconnect a number of processing units
`together. The sWitches that comprise the fabric are able to
`recogniZe incoming traffic related to the cache coherency
`protocol. These messages are then moved to the head of the
`sWitch’s output queue to insure fast transmission throughout
`
`6
`
`

`

`US 2005/0228952 Al
`
`Oct. 13, 2005
`
`the fabric. In another embodiment, the traf?c related to the
`cache coherency protocol can interrupt an outgoing mes
`sage, further reducing latency through the sWitch, since the
`traf?c does not need to Wait for the current packet to be
`transmitted.
`
`[0016] The sWitches Within the fabric also incorporate at
`least one memory element, Which is dedicated to analyZing
`and storing transactions related to the cache coherency
`protocol. This memory element tracks the contents of the
`caches of all of the processors connected to the sWitch. In
`this manner, traf?c can be minimiZed. In a traditional
`NUMA system, read requests and invalidate messages are
`communicated With every other processor in the system. By
`tracking the contents of each processor’s cache, the fabric
`can selectively transmit this traf?c only to the processors
`Where the data is resident in its cache.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0017] FIG. 1 is a schematic diagram illustrating a ?rst
`embodiment of the present invention;
`[0018] FIG. 2 is a schematic diagram illustrating a second
`embodiment of the present invention;
`[0019] FIG. 3a is a schematic diagram illustrating deliv
`ery of packets as performed in the prior art;
`
`[0020] FIG. 3b is a schematic diagram illustrating a
`mechanism for reducing the latency of message transmission
`in accordance With one embodiment of the present inven
`tion;
`[0021] FIG. 3c is a schematic diagram illustrating a
`mechanism for reducing the latency of message transmission
`in accordance With a second embodiment of the present
`invention;
`[0022] FIG. 4 is a chart illustrating a representative format
`for a directory in accordance With an embodiment of the
`present invention;
`[0023] FIG. 5 is a chart illustrating the states of a directory
`during a sequence of operations in a single sWitch fabric in
`accordance With an embodiment of the present invention;
`and
`
`[0024] FIG. 6 is a chart illustrating the states of directories
`during a sequence of operations in a multi-sWitch fabric in
`accordance With an embodiment of the present invention.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`[0025] Cache coherency in a NUMA system requires
`communication to occur among all of the various caches.
`While this presents a manageable problem With small num
`bers of processors and caches, the complexity of the problem
`increases as the number of processors increased. Not only
`are there more caches to track, but also there is a signi?cant
`increase in the number of communications betWeen these
`caches necessary to insure coherency. The present invention
`reduces that number of communications by tracking the
`contents of each cache and sending communications only to
`those caches that have the data resident in them. Thus, the
`amount of traf?c created is minimiZed.
`
`[0026] FIG. 1 illustrates a distributed processing system,
`or speci?cally, a shared memory NUMA system 10 Where
`
`the various CPUs are all in communication With a single
`sWitch. This sWitch is responsible for routing traf?c betWeen
`the processors, as Well as to and from the disk storage 115.
`In this embodiment, CPU subsystem 120, CPU subsystem
`130 and CPU subsystem 140 are each in communication
`With sWitch 100 via an interconnection, such as a cable, back
`plane or a interconnect on a printed circuit board. CPU
`subsystem 120 comprises a central processing unit 121,
`Which may include one or more processor elements, a cache
`memory 122, and a local memory 123. LikeWise, CPU
`subsystems 130 and 140 comprise the same elements. In
`communication With sWitch 100 is disk controller 110,
`Which comprises the control logic for the disk storage 115,
`Which may comprise one or more disk drives. The total
`amount of memory in distributed system 10 is contained and
`partitioned betWeen the various CPU subsystems. The spe
`ci?c con?guration of the memory can vary and is not limited
`by the present invention. For illustrative purposes, it Will be
`assumed that the total system memory is equally divided
`betWeen the processor subsystems.
`
`[0027] Since the total system memory is divided among
`the processor subsystems, each processor must access
`memory that is local to other processor subsystems. These
`accesses are much sloWer than accesses by the processor to
`its oWn local memory, and therefore impact performance. To
`minimiZe the performance impact of these accesses, each
`CPU subsystem is equipped With a cache memory. The
`cache is used to store frequently used, or soon to be used,
`data. The data stored in the cache might be associated With
`any of the main memory elements in the system.
`
`[0028] As described above, it is essential to maintain
`cache coherency betWeen the various cache elements in the
`system. Referring again to FIG. 1, the traditional eXchange
`of information betWeen the various caches using MOESI
`Will be described, as is currently performed in the prior art.
`Suppose CPU 121 requires a data element that is not in its
`cache. It issues a read request to sWitch 100. SWitch 100
`broadcasts this read request to all other CPU subsystems.
`Since none of the other caches contains the data, the
`requested data must be retrieved from main memory, such as
`local memory 143. This data element is sent to sWitch 100,
`Which forWards it back to CPU subsystem 120, Where it is
`stored in cache 122. The cache line associated With this data
`element is marked “E”, since this is the only cache that has
`the data.
`
`[0029] At a later time, assume that CPU 131 requires the
`same data element. In a similar fashion, it issues a read
`request to sWitch 100, Which sends a multicast message to
`the other CPU subsystems. In this case, cache 122 has the
`requested data and transmits it back to CPU 131, via sWitch
`100. At this point, both cache 122 and 132 have a copy of
`the data element, Which is stored in main memory 143. Both
`of these caches mark the cache line “S” to indicate shared
`access.
`
`[0030] Later, assume that CPU 141 requires the same data
`element. It issues a read request to sWitch 100, Which sends
`a multicast message to the other CPU subsystems. In this
`case, both cache 122 and cache 132 have the requested data
`and transmit it back to CPU 141, via sWitch 100. At this
`point, caches 122, 132 and 142 all have a copy of the data
`element, Which is stored in main memory 143. All of these
`caches mark the cache line “S” to indicate shared access.
`
`7
`
`

`

`US 2005/0228952 A1
`
`Oct. 13, 2005
`
`[0031] At a later point in time, assume that CPU 131
`modi?es the data element. It then also issues an invalidate
`message informing the other caches that their copy of that
`cache line is noW stale and must be discarded. It also
`modi?es the marker associated With this line to “M”, sig
`nifying that it has modi?ed the cache line. The sWitch 100
`receives the invalidate message and broadcasts it to all of the
`other caches in the system, even though only caches 122 and
`142 have that data element.
`
`[0032] At a later point, assume that CPU 141 requests the
`data element and sends a read request to sWitch 100. SWitch
`100 sends a multicast message to all of the other systems
`requesting the data element. Since the modi?ed data exists
`only in cache 132, it returns the data element to CPU 141,
`via sWitch 100. It is then stored in cache 142 and is marked
`“S”, since the item is noW shared. Cache 132 marks the
`cache line “O”, since it is the oWner and has shared a
`modi?ed copy of the cache line With another system.
`
`[0033] At a subsequent time, assume that CPU 141 modi
`?es the data element. An invalidate message is sent to sWitch
`100, Which is then broadcast to all of the other CPU
`subsystems, even though only CPU subsystem 130 has the
`data element in its cache. The cache line in 142 is noW
`marked “M”, since it has modi?ed the element, and the line
`is noW invalidated in cache 132.
`
`[0034] The mechanism described above continues accord
`ingly, depending on the type of data access and the CPU
`requesting the data. While cache coherency is maintained
`throughout this process, there are distinct disadvantages to
`the mechanism described.
`
`[0035] First, the cache protocol creates excessive of traf?c
`throughout the netWork. Each read request and invalidate
`message is sent to every CPU, even to those that are not
`involved in the transaction. For example, in the previous
`scenario, an invalidate message is sent to CPU subsystem
`120 after the data element Was modi?ed by CPU 141.
`HoWever, at that point in time, cache 122 does not have the
`data element in its cache and therefore does not need to
`invalidate the cache line. Similarly, read requests are trans
`mitted to CPU subsystems that neither have the data in their
`main memory, nor in their caches. These requests unneces
`sarily use netWork bandWidth.
`
`[0036] In the scenario described above, each read request
`from a CPU Will generate a read request to every other CPU
`in the system, and a corresponding response from every
`other CPU. These responses are then all forWarded back to
`the requesting CPU. To illustrate this, assume that there are
`eight CPUs in the system. A read request is send from one
`of these CPUs to the sWitch. Then seven read requests are
`forWarded to the other CPUs, Which each generate a
`response to the request. These seven responses are then
`forWard back to the requesting CPU. Thus, a single read
`request generates 22 messages Within the system. This
`number groWs With additional CPUs. In the general case, the
`number of messages generated by a single read request is
`3*(# of CPUs-1)+1. In a system With 16 CPUs, a total of 46
`messages Would be generated.
`
`[0037] The second disadvantage of the current mechanism
`is that latency through the sWitch cannot be determined. For
`example, When the read request arrives at sWitch 100 and is
`ready to be broadcast to all other CPU subsystems, the
`
`sWitch queues the message at each output queue. If the
`queue is empty, the read request Will be sent immediately. If,
`hoWever, the queue contains other packets, the message is
`not sent until it reaches the head of the queue. In fact, even
`if the queue is empty, the read request must still Wait until
`any currently outgoing message has been transmitted. Simi
`larly, When a cache delivers the requested data back to the
`sWitch, latency is incurred in delivering that data to the
`requesting CPU.
`
`[0038] The present invention reduces the amount of traf?c
`in the netWork, thus reducing congestion and latency. By
`tracking the actions of the CPU subsystems, it is possible for
`sWitch 100 to determine Where the requested data resides
`and limit netWork traf?c only to those subsystems. The
`present invention Will be described in relation to the scenario
`given above. SWitch 100 contains internal memory, a portion
`of Which is used to create a directory of cache lines. The
`speci?c format and con?guration of the directory can be
`implemented in various Ways, and this disclosure is not
`limited to any particular directory format. FIG. 4 shoWs one
`possible format, Which Will be used to describe the inven
`tion. The table of FIG. 4 contains multiple entries. There is
`an entry for each cache line, and this is accompanied by the
`status of that cache line in each CPU subsystem. These
`entries may be static, Where the memory is large enough to
`support every cache line. Alternatively, entries could be
`added as cache lines become populated and deleted as lines
`become invalidated, Which may result in a small directory
`structure. At initialiZation, the table is con?gured to accom
`modate all of the CPU subsystems in the NUMA system.
`Each entry is marked as “I” for invalid, since there is no
`valid cache data yet. The state of the directory in sWitch 100
`during the folloWing sequence of operations can be seen in
`FIG. 5.
`
`[0039] Returning to the previous scenario, suppose CPU
`121 requires a data element that is not in its cache. It issues
`a read request to sWitch 100. SWitch 100 checks the direc
`tory and ?nds that there is no entry for the requested data.
`It then broadcasts this read request to all other CPU sub
`systems. Since none of the other caches contain the data, the
`requested data must be retrieved from main memory, such as
`local memory 143. This data element is sent to sWitch 100,
`Which forWards it back to CPU subsystem 120, Where it is
`stored in cache 122. Within CPU subsystem 120, the cache
`line associated With this data element is marked “E”, since
`this is the only cache that has the data. The sWitch also
`updates its directory, by adding an entry for this cache line.
`It then denotes that CPU 120 has exclusive access “E”, While
`all other CPUs remain invalid “I”. This is shoWn in roW 1 of
`FIG. 5.
`
`[0040] At a later time, CPU 131 requires the same data
`element. In a similar fashion, it issues a read request to
`sWitch 100. SWitch 100 checks the directory and in this case
`discovers that the requested data element exists in CPU
`subsystem 120. Rather than sending a multicast message to
`all of the other CPU subsystems, sWitch 100 sends a request
`only to CPU subsystem 120. Cache 122 has the requested
`data and transmits it back to CPU subsystem 130, via sWitch
`100. At this point, both cache 122 and 132 have a copy of
`the data element, Which is stored in main memory 143. Both
`of these caches mark the cache line “S” for shared. SWitch
`100 then updates its directory to re?ect that CPU 120 and
`
`8
`
`

`

`US 2005/0228952 A1
`
`Oct. 13, 2005
`
`CPU 130 both have shared access, “S”, to this cache line.
`This is illustrated in roW 2 of FIG. 5.
`
`[0041] Later, CPU subsystem 140 requires the same data
`element. It issues a read request to sWitch 100, Which
`indexes into the directory and ?nds that both CPU sub
`system 120 and CPU subsystem 130 have the requested data
`in their caches. Based on a pre-determined algorithm, the
`sWitch sends a message to one of these CPU subsystems,
`preferably the least busy of the tWo. The algorithm used in
`not limited by this invention and may be based on various
`netWork parameters, such as queue length or average
`response time. The requested data is then transmitted back to
`CPU subsystem 140, via sWitch 100. At this point, caches
`122, 132 and 142 all have a copy of the data element, Which
`is stored in main memory 143. All of these caches mark the
`cache line “S” for shared. The sWitch then updates the
`directory to indicate that CPU 120, 130 and 140 all have
`shared access “S” to this data as illustrated in roW 3 of FIG.
`5.
`
`[0042] At a later point in time, CPU subsystem 130
`modi?es the data element. It also issues an invalidate
`message informing the other caches that their copy of that
`cache line is noW stale and must be discarded. It also
`modi?es the marker associated With this line to “M”, sig
`n

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket