`
`Attorney Docket No. NWISP052
`
`PATENT APPLICATION
`
`REDUCING PROBE TRAFFIC IN MULTIPROCESSOR SYSTEMS
`
`Inventors :
`
`Eric Morton of
`
`Austin, Texas
`United States citizen
`
`Raj esh Kota of
`Austin, Texas
`Citizen of India
`
`Adnan Khaleel of
`
`Austin, Texas
`Citizen of India
`
`David B. Glasco of
`
`Austin, Texas
`United States citizen
`
`Assignee:
`
`Newisys, Inc.
`A Delaware corporation
`
`BEYER WEAVER & THOMAS, LLP
`P.O. Box 778’
`
`Berkeley, California 94704-0778
`(510) 843-6200
`
`APPLE 1021
`
`1
`
`APPLE 1021
`
`
`
`REDUCING PROBE TRAFFIC IN MULTIPROCESSOR SYSTEMS
`
`Attorney Docket No. NWISP052
`
`PATENT
`
`CROSS-REFERENCE TO RELATED APPLICATIONS
`
`The present application is a continuation-in-part of and claims priority under 35
`
`U.S.C. 120 to U.S. Patent Application No. 10/288,347 for METHODS AND
`
`APPARATUS FOR MANAGING PROBE REQUESTS filed on November 4, 2002
`
`(Attorney Docket No. NWISP024), the entire disclosure of which is incorporated herein
`
`by reference for all purposes. The subject matter described in the present application is
`
`also related to U.S. Patent Application No. 10/288,399 for METHODS AND
`
`APPARATUS FOR MANAGING PROBE REQUESTS filed on November 4, 2002
`
`(Attorney Docket No. NWISP025), the entire disclosure of which‘ is incorporated herein
`
`by reference for all purposes.
`
`BACKGROUND OF THESINVENTION
`
`The present invention generally relates to accessing data in a multiple processor
`
`system. More specifically, the present invention provides techniques for reducing
`
`memory transaction traffic in a multiple processor system.
`
`Data access in multiple processor systems can raise issues relating to cache
`
`coherency.
`
`Conventional multiple processor computer systems have processors
`
`coupled to a system memory through a shared bus.
`
`In order to optimize access to data
`
`in the system memory, individual processors are typically designed to work with cache
`
`memory.
`
`In one example, each processor has a cache that is loaded with data that the
`
`processor frequently accesses. The cache is read or written by a processor. However,
`
`cache coherency problems arise because multiple copies of the same data can co-exist
`
`in systems having multiple processors and multiple cache memories. For example, a
`
`frequently accessed data block corresponding to a memory line may be loaded into the
`
`cache of two different processors.
`
`In one example, if both processors attempt to write
`
`new values into the data block at the same time, different data values may result. One
`
`value may be written into the first cache while a different value is written into the
`
`10
`
`15.
`
`20
`
`25
`
`30
`
`35
`
`2
`
`
`
`second cache. A system might then be unable to determine what value to write through
`
`to system memory.
`
`A variety of cache coherency mechanisms have been developed to address such
`
`problems in multiprocessor systems. One solution is to simply force all processor
`writes to go through to memory immediately and bypass the associated cache. The
`
`write requests can then be serialized before overwriting a system memory line.
`
`However, bypassing the cache significantly decreases efficiency gained by using a
`
`cache.
`
`Other cache coherency mechanisms have been developed for specific
`
`10
`
`architectures.
`
`In a shared bus architecture, each processor checks or snoops on the bus
`
`to determine whether it can read or write a shared cache block.
`
`In one example, a
`
`processor only writes an object when it owns or has exclusive access to the object.
`Each corresponding cache object is then updated to allow processors access to the most
`
`recent version of the object.
`
`15
`
`20
`
`Bus arbitration is used when both processors attempt to write the same shared
`
`data block in the same clock cycle. Bus arbitration logic decides which processor gets
`
`the bus first. Although, cache coherency mechanisms such as bus arbitration are
`
`effective, using a shared bus limits the number of processors that can be implemented
`
`in a single system with a single memory space.
`
`Other multiprocessor schemes involve individual processor, cache, and memory
`
`systems connected to other processors, cache, and memory systems using a network
`backbone such as Ethernet or Token Ring. Multiprocessor schemes involving separate
`
`25
`
`computer systems each with its own address space can avoid many cache coherency
`
`problems because each processor has its own associated memory and cache. When one
`processor wishes to access data on a remote computing system, communication is
`
`explicit. Messages are sent to move data to another processor and messages are
`
`received to accept data from another processor using standard network protocols such
`
`30
`
`as TCP/IP.
`
`Multiprocessor
`
`systems using explicit communication including
`
`transactions such as sends and receives are referred to as systems using multiple private
`
`memories. By contrast, multiprocessor system using implicit communication including
`
`3
`
`
`
`transactions such as loads and stores are referred to herein as using a single address
`
`space.
`
`Multiprocessor
`
`schemes using separate
`
`computer
`
`systems
`
`allow more
`
`processors
`
`to be interconnected while minimizing cache coherency problems.
`
`However,
`
`it would take substantially more time to access data held by a remote
`
`processoriusing a network infiastructure than it would take to access data held by a
`
`processor coupled to a system bus. Furthermore, valuable network bandwidth would be
`
`consumed moving data to the proper processors. This can negatively impact both
`
`processor and network performance.
`
`Performance limitations have led to the development of a point-to-point
`
`architecture for connecting processors in a system with a single memory space.
`
`In one
`
`example,
`
`individual processors can be directly connected to each other through a
`
`plurality of point-to-point links to form a cluster of processors. Separate clusters of
`
`processors can also be connected. The point-to-point links significantly increase the
`
`bandwidth for coprocessing and multiprocessing functions; However, using a point-to-
`
`point architecture to connect multiple processors in a multiple cluster system sharing a
`
`single memory space presents its own problems.
`
`Consequently, it is desirable to provide techniques for improving data access
`
`and cache coherency in systems having multiple processors connected using point-to-
`
`point links.
`
`10
`
`15
`
`20
`
`4
`
`
`
`SUl\/IMARY OF THE H\TVENTION
`
`According to the present
`
`invention, various techniques are provided for
`
`reducing traffic relating to memory transactions in multi-processor systems. According
`
`to various specific embodiments, a computer system having a plurality of processing
`
`nodes interconnected by a first point-to-point architecture is provided. Each processing
`
`node has a cache memory associated therewith. A probe filtering unit is operable to
`
`receive probes corresponding to memory lines from the processing nodes and to
`
`transmit the probes only to selected ones of the processing nodes with reference to
`
`probe filtering information. The probe filtering information is representative of states
`
`associated with selected ones of the cache memories.
`
`According to other embodiments, methods and apparatus are provided for
`
`reducing probe traffic in a computer system comprising a plurality of processing nodes
`
`interconnected by a first point-to-point architecture. A probe corresponding to a
`
`memory line is transmitted from a first one of the processing nodes only to a probe
`
`filtering unit. The probe is evaluated with the probe filtering unit to determine whether
`
`a valid copy of the memory line is in any of the cache memories. The evaluation is
`
`done with reference to probe filtering information associated with the probe filtering
`
`unit and representative of states associated with selected ones of the cache memories.
`
`The probe is transmitted from the probe filtering unit only to selected ones of the
`
`processing nodes identified by the evaluating. Probe responses from the selected
`
`processing nodes are accumulated by the probe filtering unit. Only the probe filtering
`
`unit responds to the first processing node.
`
`A finther understanding of the nature and advantages of the present invention
`
`may be realized by reference to the remaining portions of the specification and the
`
`drawings.
`
`10
`
`15
`
`20
`
`25
`
`5
`
`
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the following
`
`description taken in conjunction with the accompanying drawings, which are
`
`illustrative of specific embodiments of the present invention.
`
`Figure 1A and 1B are diagrammatic representation depicting a system having
`
`multiple clusters.
`Figure 2 is a diagrammatic representation of a cluster having a plurality of
`
`processors.
`Figure 3 is a diagrammatic representation of a cache coherence controller.
`
`10
`
`Figure 4 is a diagrammatic representation showing a transaction flow for a data
`
`access request from a processor in a single cluster.
`
`Figure SA-SD are diagrammatic representations showing cache coherence
`
`controller functionality.
`
`15
`
`Figure 6 is a diagrammatic representation depicting a transaction flow for a
`
`request with multiple probe responses.
`Figure 7 is a diagrammatic representation showing a cache coherence directory.
`
`Figure 8 is a diagrammatic representation showing probe filter information that
`
`can be used to reduce the number of probes transmitted to various clusters.
`
`20
`
`Figure 9 is a diagrammatic representation showing a transaction flow for
`
`probing of a home cluster without probing of other clusters.
`Figure 10 is a diagrammatic representation showing a transaction flow for
`
`probing of a single remote cluster.
`Figure 11 is a flow process diagram showing the handling of a request with
`
`probe filter information.
`Figure 12 is a diagrammatic representation showing memory controller filter
`
`information.
`
`Figure 13 is a diagrammatic representation showing a transaction flow for
`
`probing a single remote cluster without probing a home cluster.
`
`Figure 14 is a flow process diagram showing the handling of a request at a
`
`home cluster cache coherence controller using memory controller filter information.
`
`Figure 15 is a diagrammatic representation showing a transaction flow for a
`
`cache coherence directory eviction of an entry corresponding to a dirty memory line.
`
`25
`
`30
`
`6
`
`
`
`Figure 16 is a diagrammatic representation showing a transaction flow for a
`
`cache coherence directory eviction of an entry corresponding to a clean memory line.
`
`Figure 17 is a diagrammatic representation of a cache coherence controller
`
`according to a specific embodiment of the invention.
`Figure 18 is a diagrammatic representation of a cluster having a plurality of
`
`5
`
`processing nodes and a probe filtering unit.
`
`Figure 19 is an exemplary representation of a processing node.
`
`Figure 20 is a flowchart illustrating loca1.probe filtering according to a specific
`
`embodiment of the invention.
`
`10
`
`Figure 21 is a diagrammatic representation of a transaction flow in which local
`
`probe filtering is facilitated according to a specific embodiment of the invention.
`
`Figure 22 is a diagrammatic representation of another transaction flow in which
`
`local probe filtering is facilitated according to a specific embodiment of the invention.
`
`7
`
`
`
`DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS
`(
`
`Reference will now be made in detail to some‘ specific embodiments of the
`
`invention including the best modes contemplated by the inventors for carrying out the
`invention. Examples of these specific embodiments are illustrated in the accompanying
`drawings. While the invention is described in conjunction with these specific
`embodiments, it will be understood that it is not intended to limit the invention to the
`described embodiments.
`On the contrary,
`it
`is intended to cover alternatives,
`modifications, and equivalents as may be included within the spirit and scope of the
`invention as defined by the appended claims. Multi-processor architectures having
`
`point-to-point communication among their processors are suitable for implementing
`specific embodiments of the present invention.
`In the following description, numerous
`specific details are set forth in order to provide a thorough understanding of the present
`invention. The present invention may be practiced without some or all of these specific
`details. Well-known process operations have not been described in detail in order not
`to unnecessarily obscure the present invention. Furthermore, the present application’s
`reference to a particular singular entity includes that possibility that the methods and
`apparatus of the present invention can be implemented using more than one entity,
`
`unless the context clearly dictates otherwise.
`
`According to various embodiments, techniques are provided for increasing data
`
`access efficiency in a multiple processor system.
`In a point-to-point architecture, a
`cluster of processors includes multipleprocessors directly connected to each other
`through point-to-point links. By using point-to-point links instead of a conventional
`shared bus or external network, multiple processors are used efficiently in a system
`
`sharing the same memory space. Processing and network efficiency are also improved
`by avoiding many of the bandwidth and latency limitations of conventional bus and
`external network based multiprocessor
`archi_tectures.
`According to
`various
`
`embodiments, however, linearly increasing the number of processors in a point-to-point
`
`10
`
`15
`
`20
`
`25
`
`30
`
`architecture leads to an exponential increase in the number of links used to connect the
`
`In order to reduce the number of links used and to further
`multiple processors.
`modularize a multiprocessor system using a point-to-point architecture, multiple
`
`clusters may be used.
`
`8
`
`
`
`According to some embodiments, multiple processor clusters are interconnected
`using a point-to-point architecture.
`Each cluster of processors includes a cache
`coherence controller used to handle communications between clusters.
`In one
`embodiment, the point-to-point architecture used to connect processors are used to
`
`connect clusters as well.
`
`10
`
`15
`
`20
`
`By using a cache coherence controller, multiple cluster systems can be built
`using processors that may not necessarily support multiple clusters. Such a multiple
`cluster system can be built by using a cache coherence controller to represent non-local
`nodes in local transactions so that local nodes do not need to be aware of the existence
`of nodes outside of the local cluster. More detail on the cache coherence controller will
`
`be provided below.
`
`In a single cluster system, cache coherency can be maintained by sending all
`data access requests through a serialization point. Any mechanism for ordering data
`access requests (also referred to herein as requests and memory requests) is referred to
`herein as a serialization point. One example of a serialization point is a memory
`
`controller. Various processors in the single cluster system send data access requests to
`one or more memory controllers.
`In one example, each memory controller is
`configured to serialize or lock the data access requests so that only one data access
`request for a given memory line is allowed at any particular time. If another processor
`attempts to access the same memory line, the data access attempt is blocked until the
`memory line is unlocked. The memory controller allows cache coherency to be
`
`25
`
`maintained in a multiple processor, single cluster system.
`
`30
`
`A serialization point can also be used in a multiple processor, multiple cluster
`system where the processors in the various clusters share a single address space. By
`using a single address space, internal point-to-point links can be used to significantly
`improve intercluster communication over traditional external network based multiple
`cluster systems. Various processors in various clusters send data access requests to a
`memory controller associated with a particular cluster such as a home cluster. The
`memory controller can similarly serialize all data requests from the different clusters.
`
`9
`
`
`
`However, a serialization point in a multiple processor, multiple cluster system may not
`
`be as efficient as a serialization point in a multiple processor, single cluster system.
`
`That is, delay resulting fiom factors such as latency from transmitting between clusters
`can adversely affect the response times for various data access requests.
`It should be
`noted that delay also results from the use of probes in a multiple processor
`
`environment.
`
`Although delay in intercluster transactions in an architecture using a shared
`memory space is significantly less than the delay in conventional message passing
`environments using external networks such as Ethernet or Token Ring, even minimal
`
`10
`
`delay is a significant factor. In some applications, there may be millions of data access
`requests from a processor in a fraction of a second. Any delay can adversely impact
`
`processor performance.
`
`15
`
`According to various embodiments, probe management is used to increase the
`
`efficiency of accessing data in a multiple processor, multiple cluster system. A
`mechanism for eliciting a response from a node to maintain cache coherency in a
`
`In one example, a mechanism for snooping a
`system is referred to herein as a probe.
`cache is referred to as a probe. A response to a probe can be directed to the source or
`
`20
`
`target of the initiating request. Any mechanism for filtering or reducing the number of
`probes and requests transmitted to various nodes is referred to herein as managing
`probes.
`In one example, managing probes entails characterizing a request to determine
`if a probe can be transmitted to a reduced number of entities.
`
`25
`
`In typical
`
`implementations, requests are sent
`
`to a memory controller that
`
`In such a system, no knowledge of the
`broadcasts probes to various nodes in a system.
`cache line state needs to be maintained by the memory controller. All nodes in the
`
`In a
`system are probed and the request cluster receives a response from each node.
`system with a coherence directory, state information associated with various memory
`lines can be used to reduce the number of transactions. Any mechanism for
`
`30
`
`maintaining state information associated with various memory lines is referred to
`
`herein as a coherence directory. According to some embodiments, a coherence
`
`directory includes information for memory lines in a local cluster that are cached in a
`
`10
`
`10
`
`
`
`remote cluster. According to others, such a directory includes information for locally
`
`cached lines. According to various embodiments, a coherence directory is used to
`
`reduce the number of probes to remote quads by inferring the state of local caches.
`
`According to some embodiments, such a directory mechanism is used in a single cluster
`system or within a cluster in a multi-cluster system to reduce the number of probes
`
`within a cluster.
`
`10
`
`15
`
`20
`
`Figure 1A is a diagrammatic representation of one example of a multiple
`cluster, multiple processor system that can use the techniques of the present invention.
`
`Each processing cluster 101, 103, 105, and 107 can include a plurality of processors.
`The processing clusters 101, 103, 105, and 107 are connected to each other through
`point-to-point links 111a-f. In one embodiment, the multiple processors in the multiple
`cluster architecture shown in Figure 1A share the same memory space. In this example,
`the point-to-point links 111a-f are internal system connections that are used in place of
`a traditional front-side bus to connect the multiple processors in the multiple clusters
`
`101, 103, 105, and 107. The point-to-point links may support any point-to-point
`
`protocol.
`
`Figure 1B is a diagrammatic representation of another example of a multiple
`
`cluster, multiple processor system that can use the techniques of the present invention.
`Each processing cluster 121, 123, 125, and 127 can be coupled to a switch 131 through
`point-to-point links 141a-d.
`It should be noted that using a switch and point-to-point
`links allows implementation with fewer point-to-point links when connecting multiple
`
`clusters in the system. A switch 131 can include a processor with a coherence protocol
`interface. According to various implementations, a multicluster system shown in
`
`25
`
`Figure 1A is expanded using a switch 131 as shown in Figure 1B.
`
`Figure 2 is a diagrammatic representation of a multiple processor cluster, such
`
`as the cluster 101 shown in Figure 1A. Cluster 200 includes processors 202a-202d, one
`
`30
`
`or more Basic U0 systems (BIOS) 204, a memory subsystem comprising memory
`
`banks 206a-206d, point-to-point communication links 208a-208e, and a service
`
`processor 212.
`
`The point-to-point communication links are configured to allow
`
`interconnections between processors 202a-202d, I/O switch 210, and cache coherence
`
`10
`
`11
`
`11
`
`
`
`controller 230. The service processor 212 is configured to allow communications with
`processors 202a-202d, I/O switch 210, and cache coherence controller 230 via a JTAG
`interface represented in Figure 2 by links 214a-214f.
`It should ‘be noted that other
`interfaces are supported.
`It should also be noted that in some implementations, a
`service processor is not
`included in multiple processor clusters.
`I/O switch 210
`connects the rest of the system to I/O adapters 216 and 220.
`It should further be noted
`that the terms node and processor are often used interchangeably herein. However, it
`should be. understood that according to various implementations, a node (e.g.,
`
`processors 202a-202d) may comprise multiple sub—units, e.g., CPUs, memory
`
`10
`
`controllers, I/O bridges, etc.
`
`15
`
`20
`
`25
`
`30
`
`the service processor of the present
`According to specific embodiments,
`invention has the intelligence to partition system resources according to a previously
`specified partitioning schema.
`The partitioning can be achieved through direct
`manipulation of routing tables associated with the system processors by the service
`processor which is made possible by the point—to—point communication infrastructure.
`The routing tables are used to control and isolate various system resources,
`the
`
`connections between which are defined therein.
`
`The processors 202a-d are also coupled to a cache coherence controller 230
`through point—to—point links 232a-d. Any mechanism or apparatus that can be used to
`provide communication between multiple processor clusters while maintaining cache
`coherence is referred to herein as a cache coherence controller. The cache coherence
`
`controller 230 can be coupled to cache coherence controllers associated with other
`multiprocessor clusters.
`It should be noted that there can be more than one cache
`coherence controller in one cluster. The cache coherence controller 230 communicates
`with both processors 202a-d as well as remote clusters using a point-to-point protocol.
`
`More generally, it should be understood that the specific architecture shown in
`Figure 2 is merely exemplary and that embodiments of the present invention are
`contemplated having different configurations and resource interconnections, and a
`variety of alternatives for each of the system resources shown. However, for purpose
`of illustration, specific details of server 200 will be assumed. For example, most of the
`
`11
`
`12
`
`12
`
`
`
`resources shown in Figure 2 are assumed to-reside on a single electronic assembly.
`
`In
`
`addition, memory banks 206a—206d may comprise double data rate (DDR) memory
`which is physicallyprovided as dual in—1ine memory modules (Dl1VIMs).
`I/O adapter
`216 may be, for example, an ultra direct memory access (UDMA) controller or a small
`computer system interface (SCSI) controller which provides access to a permanent
`storage device.
`I/O adapter 220 may be an Ethernet card adapted to provide
`communications with a network such as, for example, a local area network (LAN) or
`
`the Internet.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`According to a specific embodiment and as shown in Figure 2, both of I/O
`adapters 216 and 220 provide symmetric I/O access. That is, each provides access to
`equivalent sets of I/O. As will be understood, such a configuration would facilitate a
`partitioning scheme in which multiple partitions have access to the same types of I/O.
`However, it should also be understood that embodiments are envisioned in which
`partitions without I/O are created. For example, a partition including one or more
`processors and associated memory resources, i.e., a memory complex, could be created
`
`for the purpose of testing the memory complex.
`
`According to one embodiment, service processor 212 is a Motorola MZPC855T
`
`microprocessor which includes integrated chipset functions. The cache coherence
`controller 230 is an Application Specific Integrated Circuit (ASIC) supporting the local
`point-to-point coherence protocol. The cache coherence controller 230 can also be
`configured to handle a non—coherent protocol to allow communication with I/O devices.
`In one embodiment, ‘the cache coherence controller 230 is a specially configured
`programmable chip such as a programmable logic device or a field programmable gate
`
`array.
`
`Figure 3 is a diagrammatic representation of one example of a cache coherence
`controller 230. According to various embodiments, the cache coherence controller
`
`includes a protocol engine 305 configured to handle packets such as probes and
`requests received from processors in various clusters of a multiprocessor system. The
`functionality of the protocol engine 305 can be partitioned across several engines to
`improve performance.
`In one example, partitioning is done based on packet type
`
`12
`
`13
`
`13
`
`
`
`(request, probe and response), direction (incoming and outgoing), or transaction flow
`
`(request flows, probe flows, etc).
`
`The protocol engine 305 has access to a pending buffer 309 that allows the
`cache coherence controller to track transactions such as recent requests and probes and
`associate the transactions with specific processors. Transaction information maintained
`in the pending buffer 309 can include transaction destination ‘nodes, the addresses of
`requests for subsequent collision detection and protocol optimizations,
`response
`
`information, tags, and state information.
`
`The cache coherence controller has an interface such as a coherent protocol
`interface 307 that allows the cache coherence controller to communicate with other
`processors in the cluster as well as external processor clusters. The cache coherence
`controller can also include other interfaces such as a non-coherent protocol interface
`311 for communicating with I/O devices. According to various embodiments, each
`interface 307 and 311 is implemented either as a full crossbar or as separate receive and
`transmit units using components such as multiplexers and buffers.
`It should be noted,
`however, that the cache coherence controller 230 does not necessarily need to provide
`both coherent and non-coherent interfaces.
`It should also be noted that a cache
`coherence controller in one cluster can communicate with a cache coherence controller
`
`in another cluster.
`
`Figure 4 is a diagrammatic representation showing the transactions for a cache
`request from a processor in a system having a single cluster without using a cache
`coherence controller or other probe management mechanism. A processor 401-1 sends
`an access request such as a read memory line request to a memory controller 403-1.
`The memory controller 403-1 may be associated with this processor, another processor
`in the single cluster or may be a separate component such as an ASIC or specially
`configured Programmable Logic Device (PLD). To preserve cache coherence, only
`one processor is typically allowed to access a memory line corresponding to a shared
`address space at anyone given time. To prevent other processors from attempting to
`access the same memory line, the memory line can be locked by the memory controller
`403-1. All other requests to the same memory line are blocked or queued. Access by
`
`10
`
`15
`
`20
`
`25
`
`30
`
`13
`
`14
`
`14
`
`
`
`another processor is typically only allowed when the memory controller 403-1 unlocks
`
`the memory line.
`
`The memory controller 403-1 then sends probes to the local cache memories
`405, 407, and 409 to determine cache states. The local cache memories 405, 407, and
`409 then in turn send probe responses to the same processor 401-2. The memory
`controller 403-1 also sends an access response such as a read response to the same
`processor 401-3. The processor 401-3 can then send a done response to the memory
`controller 403-2 to allow the memory controller 403-2 to unlock the memory line for
`subsequent requests.
`It should be noted that CPU 401-1, CPU 401-2, and CPU 401-3
`
`refer to the same processor.
`
`Figures 5A-5D are diagrammatic representations depicting cache coherence
`controller operation. The use of a cache coherence controller in multiprocessor clusters
`allows the creation of a multiprocessor, multicluster coherent domain without affecting
`the fimctionality of local nodes in each cluster. In some instances, processors may only
`support a protocol that allows for a limited number of processors in a single cluster
`without allowing for multiple clusters. The cache coherence controller can be used to .
`allow multiple clusters by making local processors believe that the non-local nodes are
`merely a one or more local nodes embodied in the cache coherence controller.
`In one
`example, the processors in a cluster do not need to be aware of processors in other
`clusters.
`Instead, the processors in the cluster communicate with the cache coherence
`controller as though the cache coherence controller were representing all non-local
`nodes.
`In addition, although generally a node may correspond to one or a plurality of
`resources (including, for example, a processor), it should be noted that the terms node
`and processor are ofien used interchangeably herein. According to a particular
`implementation, a node comprises multiple sub-units, e.g., CPUs, memory controllers,
`
`I/O bridges, etc.
`
`It should be noted that nodes in a remote cluster will be referred to herein as
`non-local nodes or as remote nodes. However, non-local nodes refer to nodes not in a
`request cluster generally and includes nodes in both a remote cluster and nodes in a
`home cluster. A cluster from which a data access or cache access request originates is
`
`l0
`
`15
`
`20
`
`25
`
`30
`
`14
`
`15
`
`15
`
`
`
`referred to herein as a request cluster. A cluster containing a serialization point is
`
`referred to herein as a home cluster. Other clusters are referred to as remote clusters.
`
`The home cluster and the remote cluster are also referred to herein as non-local
`
`clusters.
`
`10
`
`'15
`
`20
`
`Figure 5A shows the cache coherence controller acting as an aggregate remote
`cache. When a processor 501-1 generates a data access request to a local memory
`controller 503-1, the cache coherence controller 509 accepts the probe from the local
`memory controller 503-1 and forwards it to non-local node portion 511.
`It should be
`noted that a coherence protocol can contain several types of messages. In one example,
`
`a coherence protocol includes four types of messages; data or cache access requests,
`probes, responses or probe responses, and data packets. Data or cache access requests
`usually target the home node memory controller. Probes are used to query each cache
`in the system. The probe packet can carry information that allows the caches to
`properly transition the cache state for a specified line. Responses are used to carry
`probe response information and to allow nodes to inform other nodes of the state of a
`given transaction. Data packets carry request data for both write requests and read
`
`responses.
`
`According to various embodiments, the memory address resides at the local
`memory controller. As noted above, nodes including processors and cache coherence
`controllers outside of a local cluster are referred to herein as non-local nodes. The
`
`cache coherence controller 509 then accumulates the response from the non-local nodes
`
`and sends a single response in the same manner that local nodes associated with cache
`
`25
`
`blocks 505 and 507 send a single response to processor 501-2. Local processors may
`
`expect a single probe response for every local node probed. The use of a cache
`coherence controller allows the local processors to operate without concern as to
`
`whether non-local nodes exist.
`
`30
`
`It should also be noted that components such as processor 501-1 and processor
`
`501-2 refer herein to the same component at different points in time during a
`
`transaction sequence. For example, processor 501-1 can initiate a data access request
`
`15
`
`16
`
`16
`
`
`
`and the same processor 501-2 can later receive probe responses resulting from the
`
`request.
`
`Figure 5B shows the cache coherence controller acting as a probing ag