`-N .4111
`...¢..
`.,_
`',..
`v
`..
`.:_-- .3‘
`,.:
`..2‘,..
`.-..,.’.‘
`
`..,...,
`_.
`......_,,
`1-
`H
`“~v_g- I,»
`
`,
`“
`.
`_-rm
`..... .,.........
`5:
`fr
`;-,-
`__vr
`
`__._
`
`11
`_.u._.
`...
`.....
`.,. .,,...L
`
`_
`.
`
`Attorney Docket N o. NWIS P024
`
`PATENT APPLICATION
`
`METHODS AND APPARATUS FOR
`MANAGING PROBE REQUESTS
`
`Ir1ve11tor(s):
`
`David B. Glasco
`10337 Ember Glen Drive
`
`Austin, TX 78726
`Citizen of the U.S.
`
`Assignee:
`
`Newisys, Inc.
`A Delaware corporation
`
`BEYER WEAVER & THOMAS, LLP
`P.0. Box 778
`Berkeley, California 94704-0778
`(510) 843-6200
`
`
`
`
`
`PATENT
`Attorney Docket No. NWISP024
`
`METHODS AND APPARATUS FOR
`
`MANAGING PROBE REQUESTS
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`CROSS-REFERENCE TO RELATED APPLICATIONS
`
`The present application is related to filed U.S. Application No. 10/106,426 titled
`
`Methods And Apparatus For Speculative Probing At A Request Cluster, U.S.
`
`Application No. 10/106,430 titled Methods And Apparatus For Speculative Probing
`
`With Early Completion And Delayed Request, and U.S. Application No. 10/106,299
`
`titled Methods And Apparatus For Speculative Probing With Early Completion And
`
`Early Request, the entireties of which are incorporated by reference herein for all
`
`purposes.
`
`The present application is also related to filed U.S. Application Nos.
`
`10/157,340, 10/145,439, 10/145,438, and 10/157,388 titled Methods And Apparatus
`
`For Responding To A Request Cluster by David B. Glasco, the entireties of which are
`
`incorporated by reference for all purposes. The present application is also related to
`
`concurrently filed U.S. Application No.
`
`/
`
`(Attorney Docket No. NWISP025)
`
`with the same title and inventor, the entirety of which is incorporated by reference
`
`herein for all purposes.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention.
`
`The present invention generally relates to accessing data in a multiple processor
`
`system. More specifically, the present invention provides teclmiques for improving
`
`data access efficiency while maintaining cache coherency in a multiple processor
`
`system having a multiple cluster architecture.
`
`2. Description of Related Art
`
`Data access in multiple processor systems can raise issues relating to cache
`
`coherency.
`
`Conventional multiple processor computer systems have processors
`
`2
`
`
`
`
`
`coupled to a system memory through a shared bus.
`
`In order to optimize access to data
`
`in the system memory, individual processors are typically designed to work with cache
`
`memory.
`
`In one example, each processor has a cache that is loaded with data that the
`
`processor frequently accesses. The cache is read or written by a processor. However,
`
`cache coherency problems arise because multiple copies of the same data can co—exist
`in systems having multiple processors and. multiple cache memories. For example, a
`frequently accessed data block corresponding to a memory line may be loaded into the
`
`cache of two different processors.
`
`In one example, if both processors attempt to write
`
`new values into the data block at the same time, different data values may result. One
`
`10'
`
`value may be written into the first cache while a different value is written into the
`
`second cache. A system might then be unable to determine what value to write through
`
`to system memory.
`
`15
`
`20
`
`25
`
`30
`
`A variety of cache ‘coherency mechanisms have been developed to address such
`problems in multiprocessor systems. One solution is to simply force all processor
`
`writes to go through to memory immediately and bypass the associated cache. The
`write requests can then be serialized before overwriting a system memory line.
`
`However, bypassing the cache significantly decreases efficiency gained by using a
`
`cache.
`
`Other cache coherency mechanisms have been developed for specific
`
`architectures.
`
`In a shared bus architecture, each processor checks or snoops on the bus
`
`to determine whether it can read or write a shared cache block.
`
`In one example, a
`
`processor only writes an object when it owns or has exclusive access to the object.
`
`Each corresponding cache object is then updated to allow processors access to the most
`
`recent version of the object.
`
`Bus arbitration is used when both processors attempt to write the same shared
`
`data block in the same clock cycle. Bus arbitration logic decides which processor gets
`
`the bus first. Although, cache coherency mechanisms such as bus arbitration are
`
`effective, using a shared bus limits the number of processors that can be implemented
`
`in a single system with a single memory space.
`
`Other multiprocessor schemes involve individual processor, cache, and memory
`
`systems connected to other processors, cache, and memory systems using a network
`
`3
`
`
`
`
`
`
`
`backbone such as Ethernet or Token Ring. Multiprocessor schemes involving separate
`
`computer systems each with its own address space can avoid many cache coherency
`
`problems because each processor has its own associated memory and cache. When one
`
`processor wishes to access data on a remote computing system, communication is
`
`explicit. Messages are sent to move data to another processor and messages are
`
`received to accept data from another processor using standard network protocols such
`
`as TCP/IP.
`
`Multiprocessor
`
`systems using explicit communication including
`
`transactions such as sends and receives are referred to as systems using multiple private
`
`memories. By contrast, multiprocessor system using implicit communication including
`
`transactions such as loads and stores are referred to herein as using a single address
`
`space.
`
`Multiprocessor
`
`schemes using separate
`
`computer
`
`systems
`
`allow more
`
`processors
`
`to be interconnected while minimizing cache coherency problems.
`
`However,
`
`it would take substantially more time to access data held by a remote
`
`processor using a network infrastructure than it would take to access data held by a
`
`processor coupled to a system bus. Furthermore, valuable network bandwidth would be
`
`consumed moving data to the proper processors. This can negatively impact both
`
`processor and network performance.
`
`Performance limitations have led to the development of a point-to-point
`
`architecture for connecting processors in a system with a single memory space.
`In one
`example,
`individual processors can be directly connected to each other through a
`
`plurality of point-to-point links to form a cluster of processors. Separate clusters of
`
`processors can also be connected. The point-to-point links significantly increase the
`
`bandwidth for coprocessing and multiprocessing functions. However, using a point-to-
`
`point architecture to connect multiple processors in a multiple cluster system sharing a
`
`single memory space presents its own problems.
`
`Consequently, it is desirable to provide techniques for improving data access
`
`and cache coherency in systems having multiple clusters of multiple processors
`
`connected using point-to-point links.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`4
`
`
`
`:_:}.~
`
`
`
`SLHVIMARY OF THE INVENTION
`
`According to the present invention, methods and apparatus are provided for
`
`increasing the efficiency of data access in a multiple processor, multiple cluster system.
`
`Mechanisms for reducing the number of transactions in a multiple cluster system are
`
`provided.
`
`In one example, probe filter information is used to limit the number of probe
`
`.
`
`requests transmitted to request and remote clusters.
`
`l0
`
`15
`
`20
`
`25
`
`In one embodiment, a computer system is provided. The computer system
`
`includes a home cluster having a first plurality of processors and a home cache
`
`coherence controller. The first plurality of processors and the home cache coherence
`
`controller are interconnected in a point-to—point architecture.
`
`The home cache
`
`coherence controller is configured to’ receive a probe’ request and probe one or more
`selected clusters. The one or more clusters are selected based on the characteristics
`
`associated with the probe request.
`
`In another embodiment, a method for managing probes is provided. A probe
`
`request is received at a home cache coherence controller in a home cluster. The home
`
`cluster includes a first plurality of processors and the home cache coherence controller.
`
`The first plurality of processors and the home cache coherence controller are
`
`interconnected in a point-to-point architecture. One or more clusters are selected for
`
`probing based on the characteristics associated with the probe request. The one or
`
`more clusters are probed.
`
`A further understanding of the nature and advantages of the present invention
`
`may be realized by reference to the remaining portions of the specification and the
`
`drawings.
`
`5
`
`
`
`I1
`
`(Jo N:
`
`.3“. ,..
`
`'3
`
`3?- -2
`
`
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the following
`
`description taken in conjunction with the accompanying drawings, which are
`
`illustrative of specific embodiments of the present invention.
`Figure 1A and 1B are diagrammatic representation depicting a system having
`
`‘J\
`
`multiple clusters.
`
`Figure 2 is a diagrammatic representation ofa cluster having a plurality of
`
`processors.
`
`Figure 3 is a diagrammatic representation of a cache coherence controller.
`
`Figure 4 is a diagrammatic representation showing a transaction flow for a data
`
`access request from a processor in a single cluster.
`
`Figure SA-SD are diagrammatic representations showing cache coherence
`
`controller functionality.
`
`Figure 6 is a diagrammatic representation depicting a transaction flow for a
`
`probe request with multiple probe responses.
`
`Figure 7 is a diagrammatic representation showing a cache coherence directory.
`
`Figure 8 is a diagrammatic representation showing probe filter information that
`
`can be used to reduce the number of probes transmitted to various clusters.
`
`Figure 9 is a diagrammatic representation showing a transaction flow for
`
`probing of a home cluster without probing of other clusters.
`
`Figure 10 is a diagrammatic representation showing a transaction flow for
`
`probing of a single remote cluster.
`
`Figure 11 is a flow process diagram showing the handling of a probe request
`with probe filter information.
`I
`
`Figure 12 is a diagrammatic representation showing memory controller filter
`infonnation.
`A
`
`Figure 13 is a diagrammatic representation showing a transaction flow for
`
`probing a single remote cluster‘ without probing a home cluster.
`
`Figure 14 is a flow process diagram showing the handling of a probe request at
`
`a home cluster cache coherence controller using memory controller filter information.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`6
`
`
`
`
`
`
`
`DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS
`
`Reference will now be made in detail to some specific embodiments of the
`
`invention including the best modes contemplated by the inventors for carrying out the
`
`invention. Examples of these specific embodiments are illustrated in the accompanying
`
`drawings. While the invention is described in conjunction with these specific
`
`embodiments, it will be understood that it is not intended to limit the invention to the
`
`described embodiments. On the contrary,
`
`it
`
`is intended to cover alternatives,
`
`modifications, and equivalents as may be included within the spirit and scope of the
`
`invention as defined by the appended claims. Multi-processor architectures having
`
`point-to~point communication among their processors are suitable for implementing
`
`specific embodiments of the present invention.
`
`In the following description, numerous
`
`specific details are set forth in order to provide a thorough understanding of the present
`
`invention. The present invention may be practiced without some or all of these specific
`
`details. Well—known process operations have not been described in detail in order not
`
`to unnecessarily obscure the present invention. Furthermore, the present application’s
`
`reference to a particular singular entity includes that possibility that the methods and
`
`apparatus of the ‘present invention can be implemented using more than one entity,
`
`unless the context clearly dictates otherwise.
`
`Techniques are provided for increasing data access efficiency in a multiple
`
`processor, multiple cluster system.
`
`In a point—to-point architecture, a cluster of
`
`processors includes multiple processors directly connected to each other through point-
`
`to—point links. By using point-to-point links instead of a conventional shared bus or
`
`external network, multiple processors are used efficiently in a system sharing the same
`
`memory space. Processing and network efficiency are also improved by avoiding
`
`many of the bandwidth and latency limitations of conventional bus and external
`
`network based multiprocessor architectures. According to various embodiments,
`
`however, linearly increasing the number of processors in a point—to-point architecture
`
`leads to an exponential increase in the number of links used to connect the multiple
`
`processors.
`
`In order to reduce the number of links used and to further modularize a
`
`multiprocessor system using a point-to-point architecture, multiple clusters are used.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`7
`
`
`
`
`
`
`
`According to various embodiments,
`
`the multiple processor clusters are
`
`interconnected using a point-to-point architecture. Each cluster of processors includes
`a cache coherence controller used to handle communications between clusters.
`ln one
`
`embodiment,
`
`the point—to~point architecture used to connect processors are used to
`
`connect clusters as well.
`
`By using a cache coherence controller, multiple cluster systems can be built
`
`using processors that may not necessarily support multiple clusters. Such a multiple
`
`cluster system can be built by using a cache coherence controller to represent non—local
`nodes in local transactions so that local nodes do not need to be aware of the existence
`
`1.0
`
`of nodes outside of the local cluster. More detail on the cache coherence controller will
`
`be provided below.
`
`15
`
`20
`
`25
`
`30
`
`In a single cluster system, cache coherency can be maintained by sending all
`
`data access requests through a serialization point. Any mechanism for ordering data
`
`access requests is referred to herein as a serialization point. One example of a
`
`serialization point is a memory controller. Various processors in the single cluster
`
`system send data access requests to the memory controller.
`
`In one example,
`
`the
`
`memory controller is configured to serialize or lock the data access requests so that
`
`only one data access request for a given memory line is allowed at any particular time.
`
`If another processor attempts to access the same memory line, the data access attempt is
`
`blocked until
`
`the memory line is unlocked. The memory controller allows cache
`
`coherency to be maintained in a multiple processor, single cluster system.
`
`A serialization point can also be used in a multiple processor, multiple cluster
`
`system where the processors in the various clusters share a single address space. By
`
`using a single address space, internal point-to-point links can be used to significantly
`
`improve intercluster communication over traditional external network based multiple
`
`cluster systems. Various processors in Various clusters send data access requests to a
`
`memory controller associated with a particular cluster such as a home cluster. The
`
`memory controller can similarly serialize all data requests from the different clusters.
`
`However, a serialization point in a multiple processor, multiple cluster system may not
`
`be as efficient as a serialization point in a multiple processor, single cluster system.
`
`8
`
`
`
`
`
`
`
`That is, delay resulting from factors such as latency from transmitting between clusters
`
`can adversely affect the response times for various data access requests.
`
`It should be
`
`noted that delay also results from the use of probes in a multiple processor
`
`environment.
`
`Although delay in intercluster transactions in an architecture using a shared
`
`memory space is significantly less than the delay in conventional message passing
`
`environments using external networks such as Ethernet or Token Ring, even minimal
`
`delay is a significant factor. In some applications, there may be millions of data access
`
`requests from a processor in a fraction of a second. Any delay can adversely impact
`
`processor performance.
`
`According to various embodiments, probe management is used to increase the
`
`efficiency of accessing data in a multiple processor, multiple cluster system. A
`
`mechanism for eliciting a response from a node to maintain cache coherency in a
`
`system is referred to herein as a probe.
`
`In one example, a mechanism for snooping a
`
`cache is referred to as a probe. A response to a probe can be directed to the source or
`
`target of the initiating request. Any mechanism for filtering or "reducing the number of
`
`probes and probe requests transmitted to various nodes is referred to herein as
`
`managing probes.
`
`In one example, managing probe entails characterizing a probe
`
`request to determine if a probe can be transmitted to a reduced number of entities.
`
`In typical implementations, probe requests are sent to a memory controller that
`
`broadcasts probes to various nodes in a system. In such a system, no knowledge of the
`
`cache line state is known. All nodes in the system are probed and the request cluster
`
`receives a response from each node.
`
`In a system with a coherence directory, state
`
`information associated with various memory lines can be used to reduce the number of
`
`transactions. Any mechanism for maintaining state information associated with various
`
`memory lines is referred to herein as a coherence directory. A coherence directory
`
`typically includes information for memory lines in a local cluster that are cached in a
`
`remote cluster. According to various embodiments, a coherence directory is used to
`
`reduce the number of probes to remote quads by inferring the state of local caches.
`
`In
`
`l0
`
`15
`
`25
`
`30f
`
`9
`
`
`
` w mm.
`
`.,.;.........'
`
`other embodiments, a coherence directory is used to eliminate the transmission of a
`
`request to a memory controller in a home cluster.
`
`Figure 1A is a diagrammatic representation of one example of a multiple
`
`cluster, multiple processor system that can use the techniques of the present invention.
`
`Each processing cluster 101, 103, 105, and 107 can include a plurality of processors.
`
`The processing clusters 101, 103, 105, and 107 are connected to each other through
`
`point-to—point links 11 laaf. In one embodiment, the multiple processors in the multiple
`
`cluster architecture shown in Figure 1A share the same memory space. In this example,
`
`the point-to—point links 111a—f are internal system connections that are used in place of
`
`a traditional front—side bus to connect the multiple processors in the multiple clusters
`
`101, 103, 105, and 107. The point-to—point
`
`links may support any point-to—point
`
`coherence protocol.
`
`Figure 1B is a diagrammatic representation of another example of a multiple
`
`cluster, multiple processor system that can use the techniques of the present invention.
`
`Each processing cluster 121, 123, 125, and 127 can be coupled to a switch 131 through
`
`point~to—point links l41a—d.
`
`It should be noted that using a switch and point-to—point
`
`links allows implementation with fewer point-to—point links when connecting multiple
`
`clusters in the system. A switch 131 can include a processor with a coherencc protocol
`interface. According to various implementations, a multicluster system shown in
`
`Figure 1A is expanded using a switch 131 as shown in Figure 1B.
`
`Figure 2 is a diagrammatic representation of a multiple processor cluster, such
`
`as the cluster 101 shown in Figure 1A. Cluster 200 includes processors 202a—202d, one
`
`or more Basic I/O systems (BIOS) 204, a memory subsystem comprising memory
`
`banks 206a~206d, point-to—point communication links 208a—208e, and a service
`
`processor 212.
`
`The point-to—point communication links are configured to allow
`
`interconnections between processors 202a—202d, I/O switch 210, and cache coherence
`
`controller 230. The service processor 212 is configured to allow communications with
`
`processors 202a—202d, I/O switch 210, and cache coherence controller 230 Via a JTAG
`
`interface represented in Fig. 2 by links 214a-214f.
`
`It should be noted that other
`
`interfaces are supported.
`
`It should also be noted that in some implementations, a
`
`10
`
`15
`
`25
`
`30
`
`10
`
`
`
`
`
`
`
`service processor is not
`
`included in multiple processor clusters.
`
`I/O switch 210
`
`connects the rest of the system to I/O adapters 216 and 220.
`
`According to specific embodiments,
`
`the service processor of the present
`
`invention has the intelligence to partition system resources according to a previously
`
`specified partitioning schema.
`
`The partitioning can be achieved through direct
`
`manipulation of routing tables associated with the system processors by the service
`
`processor which is made possible by the point-to—point communication infrastructure.
`
`The routing tables are used to control and isolate various system resources,
`
`the
`
`l0
`
`connections between which are defined therein.
`
`15
`
`20
`
`25
`
`30
`
`The processors 202a—d are also coupled to a cache coherence controller 230
`through point-to—point links 232a—d. Any mechanism or apparatus that can be used to
`
`provide communication between multiple processor clusters while maintaining cache
`
`coherence is referred to herein as a cache coherence controller. The cache coherence
`
`controller 230 can be coupled to cache coherence controllers associated with other
`
`multiprocessor clusters.
`
`It should be noted that there can be more than one cache
`
`coherence controller in one cluster. The cache coherence controller 230 communicates
`
`with both processors 202a-d as well as remote clusters using a point-to—point protocol.
`
`More generally, it should be understood that the specific architecture shown in
`
`Figure 2 is merely exemplary and that embodiments of the present invention are
`
`contemplated having different configurations and resource interconnections, and a
`
`variety of alternatives for each of the system resources shown. However, for purpose
`of illustration, specific details oi: server 200 will be assumed. For example, most of the
`
`resources shown in Fig. 2 are assumed to reside on a single electronic assembly.
`
`In
`
`addition, memory banks 206a—206d may comprise double data rate (DDR) memory
`
`which is physically provided as dual in—line memory modules (DIMMS).
`I/O adapter
`216 may be, for example, an ultra direct memory access (UDMA) controller or a small
`
`computer system interface (SCSI) controller which provides access to a permanent
`
`storage device.
`
`I/O adapter 220 may be an Ethernet card adapted to provide
`
`communications with a network such as, for example. a local area network (LAN) or
`the Internet.
`
`10
`
`11
`
`
`
`
`
`
`
`According to a specific embodiment and as shown in Fig. 2, both of I/O
`
`adapters 216 and 220 provide symmetric I/O access. That is, each provides access to
`
`equivalent sets of I/O. As will be understood, such a configuration would facilitate a
`
`partitioning scheme in which multiple partitions have access to the same types of I/O.
`
`However,
`
`it should also be understood that embodiments are envisioned in which
`
`partitions without I/O are created. For example, a partition including one or more
`
`processors and associated memory resources, i.e., a memory complex, could be created
`
`for the purpose of testing the memory complex.
`
`According to one embodiment, service processor 212 is a Motorola MPC855T
`
`microprocessor which includes integrated chipset functions. The cache coherence
`
`controller 230 is an Application Specific Integrated Circuit (ASIC) supporting the local
`
`point-to—point coherence protocol. The cache coherence controller 230 can also be
`
`configured to handle a non—coherent protocol to allow communication with I/O devices.
`
`In one embodiment,
`
`the cache coherence controller 230 is a specially configured
`
`programmable chip such as a programmable logic device or a field programmable gate
`
`array.
`
`Figure 3 is a diagrammatic representation of one example of a cache coherence
`
`controller 230. According to various embodiments, the cache coherence controller
`
`includes a protocol engine 305 configured to handle packets such as probes and
`
`requests received from processors in various clusters of a multiprocessor system. The
`
`functionality of the protocol engine 305 can be partitioned across several engines to
`
`improve performance.
`
`In one example, partitioning is done based on packet type
`
`(request, probe and response), direction (incoming and outgoing), or transaction flow
`
`(request flows, probe flows, etc).
`
`The protocol engine 305 has access to a pending buffer 309 that allows the
`
`cache coherence controller to track transactions such as recent requests and probes and
`
`associate the transactions with specific processors. Transaction information maintained
`
`in the pending buffer 309 can include transaction destination nodes, the addresses of
`
`l0
`
`l5
`
`20
`
`25
`
`30
`
`11
`
`12
`
`
`
`
`
`
`
`requests for subsequent collision detection and protocol optimizations-,
`
`response
`
`information, tags, and state information.
`
`The cache coherence controller has an interface such as a coherent protocol
`
`interface 307 that allows the cache coherence controller to communicate with other
`
`processors in the cluster as well as external processor clusters. According to various 0
`
`embodiments, each interface 307 and 311 is implemented either as a fiill crossbar or as
`
`separate receive and transmit units using components such as multiplexers and buffers.
`
`The cache coherence controller can also include other interfaces such as a non-coherent
`
`'10
`
`protocol
`
`interface 311 for communicating with 1/0 devices.
`
`lt should be noted,
`
`however, that the cache coherence controller 230 does not necessarily need to provide
`
`both coherent and non—coherent
`
`interfaces.
`
`It should also be noted that a cache
`
`coherence controller in one cluster can communicate with a cache coherence controller
`
`in another cluster.
`
`15
`
`20
`
`25
`
`30
`
`Figure 4 is a diagrammatic representation showing the transactions for a cache
`
`request from a processor in a system having a single cluster without using a cache
`
`coherence controller. A processor 401-1 sends an access request such as a read
`memory line request to a memory controller 403-1. The memory controller 403-l may
`
`be associated with this processor, another processor in the single cluster or may be a
`
`separate component such as an ASIC or specially configured Programmable Logic
`
`Device (PLD). To preserve cache coherence, only one processor is typically allowed to
`
`access a memory line corresponding to a shared address space at anyone given time.
`
`To prevent other processors from attempting to access the same memory line,
`
`the
`
`memory line can be locked by the memory controller 403-l. All other requests to the
`
`same memory line are blocked or queued. Access by another processor is typically
`
`only allowed when the memory controller 403-1 unlocks the memory line.
`
`The memory controller 403-1 then sends probes to the local cache memories
`
`405, 407, and 409 to determine cache states. The local cache memories 405, 407, and
`
`409 then in turn send probe responses to the same processor 401-2. The memory
`
`controller 403-1 also sends an access response such as a read response to the same
`
`processor 401-3. The processor 401-3 can then send a done response to the memory
`
`12
`
`13
`
`
`
`
`
`
`
`controller 403-2 to allow the memory controller 403’-2 to unlock the memory line for
`
`subsequent requests.
`
`It should be noted that CPU 401-1, CPU 401-2, and CPU 401-3
`
`refer to the same processor.
`
`Figures 5A—5D are diagrammatic representations depicting cache coherence
`
`controller operation. The use of a cache coherence controller in multiprocessor clusters
`
`allows the creation of a multiprocessor, multicluster coherent domain without affecting
`
`the functionality of local nodes such as processors and memory controllers in each
`
`cluster.
`
`In some instances, processors may only support a protocol that allows for a
`
`limited number of processors in a single cluster without allowing for multiple clusters.
`
`The. cache coherence controller can be used to allow multiple clustcrs bymaking local
`
`processors believe that the non-local nodes are merely a single local node embodied in
`
`the cache coherence controller.
`
`In one example, the processors in a cluster do not need
`
`to be aware of processors in other clusters.
`
`Instead,
`
`the processors in the cluster
`
`communicate with the cache coherence controller as though the cache coherence
`
`controller were representing all non-local nodes.
`
`It should be noted that nodes in a remote cluster will be referred to herein as
`
`non-local nodes or as remotes nodes. However, non-local nodes refer to nodes not in a
`
`request cluster generally and includes nodes in both a remote cluster and nodes in a
`
`home cluster. A cluster from which a data access or cache access request originates is
`
`referred to herein as a request cluster. A cluster containing a serialization point is
`
`referred to herein as a home cluster. Other clusters are referred to as remote clusters.
`
`The home cluster and the remote cluster are also referred to herein as non-local
`
`10
`
`15
`
`20
`
`25
`
`clusters.
`
`Figure 5A shows thc cachc cohercnce controllcr acting as an aggregate remotc
`
`cache. When a processor 501-1 generates a data access request to a local memory
`
`controller 503-1, the cache coherence controller 509 accepts the probe from the local
`
`30
`
`memory controller 503-1 and forwards it to non-local node portion 511.
`
`It should be
`
`noted that a coherence protocol can contain several types of messages.
`
`In one example,
`
`a coherence protocol includes four types of messages; data or cache access requests,
`
`probes, responses or probe responses, and data packets. Data or cache access requests
`
`13
`
`14
`
`
`
`
`
`usually target the home node memory controller. Probes are used to query each cache
`
`in the system. The probe packet can carry information that allows the caches to
`
`properly transition the cache state for a specified line. Responses are used to carry
`
`probe response information and to allow nodes to inform other nodes of the state of a
`
`given transaction. Data packets carry request data for both write requests and read
`
`responses.
`
`According to various embodiments,
`
`the memory address resides at the local
`
`memory controller. As noted above, nodes including processors and cache coherence
`controllers outside ofla local cluster are referred to herein as non—local nodes. The
`
`10
`
`cache coherence controller 509 then accumulates the response from the non—local nodes
`
`and sends a single response in the same manner that local nodes associated with cache
`
`blocks 505 and 507 send a single response to processor 501-2. Local processors may
`
`expect a single probe response for every local node probed. The use of a cache
`
`coherence controller allows the local processors to operate without concern as to
`
`whether non—local nodes exist.
`
`It should also be noted that components such as processor 501-1 and processor
`
`501-2 refer herein to the same component at different points in time during a
`
`transaction sequence. For example, processor 501-1 can initiate a data access request
`
`and the same processor 501-2 can later receive probe responses resulting from the
`
`request.
`
`Figure 5B shows the cache coherence controller acting as a probing agent pair.
`
`When the cache coherence controller 521-1 receives a probe from non—local nodes 531,
`
`the cache coherence controller 521-l accepts the probe and forwards the probe to local
`
`nodes associated with cache blocks 523, 525, and 527. The cache coherence controller
`
`4521-2 then forwards a final response to the non—local node portion 531.
`
`In this
`
`example, the cache coherence controller is both the source and the destination of the
`
`probes. The local nodes associated with cache blocks 523, 525, and 527 behave as if
`
`the cache coherence controller were a local processor with a local memory request.
`
`15
`
`20
`
`25
`
`30
`
`14
`
`15
`
`
`
`
`
`Figure 5C shows the cache coherence controller acting as a remote memory.
`
`When a local processor 541-1 generates an access request that targets remote memory,
`
`the cache coherence controller 543-1 forwards the request to the non-local nodes 553.
`
`When the remote request specifies local probing, the cache coherence controller 543-1
`
`generates probes to local nodes and the probed nodes provide responses to the
`
`processor 541-2. Once the cache coherence controller 543-1 has received data from the
`
`non—local node portion 553, it forwards a read response to the processor 541-3. The
`
`cache coherence controller also forwards the final response to the remote memory
`
`controller associated with non-local nodes 553.
`
`Figure 5D shows the cache coherence controller acting as a remote proce