`__________________
`BEFORE THE PATENT TRIAL AND APPEAL BOARD
`__________________________________________________________________
`
`SONY MOBILE COMMUNICATIONS (USA) INC.
`Petitioner
`
`
`Patent No. 7,296,121
`Issue Date: Nov. 13, 2007
`Title: REDUCING PROBE TRAFFIC IN MULTIPROCESSOR SYSTEMS
`__________________________________________________________________
`
`EXHIBIT
`COMPARISON OF ’121 PATENT AND
`’633 PATENT SPECIFICATIONS
`
`No. IPR2015-00158
`__________________________________________________________________
`
`
`
`On the following pages, the specification of U.S. Pat. No. 7,296,121(“the ’121
`
`Patent”) is compared to the specification of U.S. Pat. No. 7,003,633 (“the ’633 Patent”).
`
`Blue text with a double underline appears in the ’121 Patent’s specification but not in
`
`the ’633 Patent’s specification. Red text in strikeout appears in the ’633 Patent’s
`
`specification but not in the ’121 Patent’s specification. Black text appears in both the
`
`’121 Patent and ’633 Patent’s specification. Green text appears in both the ’121 Patent
`
`and ’633 Patent’s specification, however the location of the text has been moved.
`
`
`
`Petition for Inter Partes Review of
`U.S. Pat. No. 7,296,121
`IPR2015‐00158
`EXHIBIT
`Sony‐
`
`
`
`2
`
`
`Description
`
`
`CROSS-REFERENCE TO RELATED APPLICATIONS
`
`The present application is related to filed U.S. application Ser. No.
`10/106,426 titled Methods And Apparatus For Speculative Probing At
`A Request Cluster, U.S. application Ser. No. 10/106,430 titled
`Methods And Apparatus For Speculative Probing With Early Completion
`And Delayed Request, and U.S. application Ser. No. 10/106,299 titled
`Methods And Apparatus For Speculative Probing With Early Completion
`And Early Request, the entireties of which are incorporated by
`reference herein for all purposes. The present application is also
`related to filed U.S. application Ser. No. 10/157,340, now U.S. Pat.
`No. 6,865,595, Ser. No. 10/145,439, Ser. No. 10/145,438, and Ser. No.
`10/157,388 titled Methods And Apparatus For Responding To A Request
`Cluster by David B. Glasco, the entireties of which are incorporated
`by reference for all purposes. The present application is also related
`to concurrently filed U.S. application Ser. No. 10/288,399 with the
`same title and inventor, the entirety of which is incorporated by
`reference herein for all purposes.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The present invention generally relates to accessing data in a
`multiple processor system. More specifically, the present invention
`provides techniques for improving data access efficiency while
`maintaining cache coherencyreducing memory transaction traffic in a
`multiple processor system having a multiple cluster architecture. 2.
`Description of Related Art.
`
`Data access in multiple processor systems can raise issues relating
`to cache coherency. Conventional multiple processor computer systems
`have processors coupled to a system memory through a shared bus. In
`order to optimize access to data in the system memory, individual
`processors are typically designed to work with cache memory. In one
`example, each processor has a cache that is loaded with data that the
`processor frequently accesses. The cache is read or written by a
`processor. However, cache coherency problems arise because multiple
`copies of the same data can co-exist in systems having multiple
`processors and multiple cache memories. For example, a frequently
`accessed data block corresponding to a memory line may be loaded into
`the cache of two different processors. In one example, if both
`processors attempt to write new values into the data block at the same
`time, different data values may result. One value may be written into
`the first cache while a different value is written into the second
`cache. A system might then be unable to determine what value to write
`through to system memory.
`
` A
`
`
`
` variety of cache coherency mechanisms have been developed to address
`such problems in multiprocessor systems. One solution is to simply
`force all processor writes to go through to memory immediately and
`bypass the associated cache. The write requests can then be serialized
`before overwriting a system memory line. However, bypassing the cache
`significantly decreases efficiency gained by using a cache. Other
`cache coherency mechanisms have been developed for specific
`
`
`
`3
`
`
`architectures. In a shared bus architecture, each processor checks
`or snoops on the bus to determine whether it can read or write a shared
`cache block. In one example, a processor only writes an object when
`it owns or has exclusive access to the object. Each corresponding
`cache object is then updated to allow processors access to the most
`recent version of the object.
`
`Bus arbitration is used when both processors attempt to write the same
`shared data block in the same clock cycle. Bus arbitration logic
`decides which processor gets the bus first. Although, cache coherency
`mechanisms such as bus arbitration are effective, using a shared bus
`limits the number of processors that can be implemented in a single
`system with a single memory space.
`
`Other multiprocessor schemes involve individual processor, cache,
`and memory systems connected to other processors, cache, and memory
`systems using a network backbone such as Ethernet or Token Ring.
`Multiprocessor schemes involving separate computer systems each with
`its own address space can avoid many cache coherency problems because
`each processor has its own associated memory and cache. When one
`processor wishes to access data on a remote computing system,
`communication is explicit. Messages are sent to move data to another
`processor and messages are received to accept data from another
`processor using standard network protocols such as TCP/IP.
`Multiprocessor systems using explicit communication including
`transactions such as sends and receives are referred to as systems
`using multiple private memories. By contrast, multiprocessor system
`using implicit communication including transactions such as loads and
`stores are referred to herein as using a single address space.
`
`Multiprocessor schemes using separate computer systems allow more
`processors to be interconnected while minimizing cache coherency
`problems. However, it would take substantially more time to access
`data held by a remote processor using a network infrastructure than
`it would take to access data held by a processor coupled to a system
`bus. Furthermore, valuable network bandwidth would be consumed moving
`data to the proper processors. This can negatively impact both
`processor and network performance.
`
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a system
`with a single memory space. In one example, individual processors can
`be directly connected to each other through a plurality of
`point-to-point links to form a cluster of processors. Separate
`clusters of processors can also be connected. The point-to-point
`links significantly increase the bandwidth for coprocessing and
`multiprocessing functions. However, using a point-to-point
`architecture to connect multiple processors in a multiple cluster
`system sharing a single memory space presents its own problems.
`
`Consequently, it is desirable to provide techniques for improving
`data access and cache coherency in systems having multiple clusters
`of multiple processors connected using point-to-point links.
`
`SUMMARY OF THE INVENTION
`
`According to the present invention, methods and apparatus are
`provided for increasing the efficiency of data access in a multiple
`
`
`
`
`
`4
`
`
`processor, multiple cluster system. Mechanisms for reducing the
`number of transactions in a multiple cluster system are provided. In
`one example, probe filter information is used to limit the number of
`probe requests transmitted to request and remote clusters. various
`techniques are provided for reducing traffic relating to memory
`transactions in multi-processor systems. According to various
`specific embodiments, a computer system having a plurality of
`processing nodes interconnected by a first point-to-point
`architecture is provided. Each processing node has a cache memory
`associated therewith. A probe filtering unit is operable to receive
`probes corresponding to memory lines from the processing nodes and
`to transmit the probes only to selected ones of the processing nodes
`with reference to probe filtering information. The probe filtering
`information is representative of states associated with selected ones
`of the cache memories.
`
`In one embodiment, a computer system is provided. The computer system
`includes a home cluster having a first plurality of processors and
`a home cache coherence controller. The first plurality of processors
`and the home cache coherence controller are interconnected in a
`point-to-point architecture. The home cache coherence controller is
`configured to receive a probe request and probe one or more selected
`clusters. The one or more clusters are selected based on the
`characteristics associated with the probe request.
`
`In another embodiment, a method for managing probes is provided. A
`probe request is received at a home cache coherence controller in a
`home cluster. The home cluster includes a first plurality of
`processors and the home cache coherence controller. The first
`plurality of processors and the home cache coherence controller are
`interconnected in a point-to-point architecture. One or more clusters
`are selected for probing based on the characteristics associated with
`the probe request. The one or more clusters are probed. According to
`other embodiments, methods and apparatus are provided for reducing
`probe traffic in a computer system comprising a plurality of
`processing nodes interconnected by a first point-to-point
`architecture. A probe corresponding to a memory line is transmitted
`from a first one of the processing nodes only to a probe filtering
`unit. The probe is evaluated with the probe filtering unit to
`determine whether a valid copy of the memory line is in any of the
`cache memories. The evaluation is done with reference to probe
`filtering information associated with the probe filtering unit and
`representative of states associated with selected ones of the cache
`memories. The probe is transmitted from the probe filtering unit only
`to selected ones of the processing nodes identified by the evaluating.
`Probe responses from the selected processing nodes are accumulated
`by the probe filtering unit. Only the probe filtering unit responds
`to the first processing node.
`
` A
`
` further understanding of the nature and advantages of the present
`invention may be realized by reference to the remaining portions of
`the specification and the drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the following
`description taken in conjunction with the accompanying drawings,
`
`
`
`
`
`5
`
`
`which are illustrative of specific embodiments of the present
`invention.
`
`FIGS. 1A and 1B are diagrammatic representation depicting a system
`having multiple clusters.
`
`FIG. 2 is a diagrammatic representation of a cluster having a
`plurality of processors.
`
`FIG. 3 is a diagrammatic representation of a cache coherence
`controller.
`
`FIG. 4 is a diagrammatic representation showing a transaction flow
`for a data access request from a processor in a single cluster.
`
`FIG. 5A -5D are diagrammatic representations showing cache coherence
`controller functionality.
`
`FIG. 6 is a diagrammatic representation depicting a transaction flow
`for a probe request with multiple probe responses.
`
`FIG. 7 is a diagrammatic representation showing a cache coherence
`directory.
`
`FIG. 8 is a diagrammatic representation showing probe filter
`information that can be used to reduce the number of probes
`transmitted to various clusters.
`
`FIG. 9 is a diagrammatic representation showing a transaction flow
`for probing of a home cluster without probing of other clusters.
`
`FIG. 10 is a diagrammatic representation showing a transaction flow
`for probing of a single remote cluster.
`
`FIG. 11 is a flow process diagram showing the handling of a probe
`request with probe filter information.
`
`FIG. 12 is a diagrammatic representation showing memory controller
`filter information.
`
`FIG. 13 is a diagrammatic representation showing a transaction flow
`for probing a single remote cluster without probing a home cluster.
`
`FIG. 14 is a flow process diagram showing the handling of a probe
`request at a home cluster cache coherence controller using memory
`controller filter information.
`
`FIG. 15 is a diagrammatic representation showing a transaction flow
`for a cache coherence directory eviction of an entry corresponding
`to a dirty memory line.
`
`FIG. 16 is a diagrammatic representation showing a transaction flow
`for a cache coherence directory eviction of an entry corresponding
`to a clean memory line.
`
`FIG. 17 is a diagrammatic representation of a cache coherence
`controller according to a specific embodiment of the invention.
`
`
`
`
`
`
`6
`
`
`FIG. 18 is a diagrammatic representation of a cluster having a
`plurality of processing nodes and a probe filtering unit.
`
`FIG. 19 is an exemplary representation of a processing node.
`
`FIG. 20 is a flowchart illustrating local probe filtering according
`to a specific embodiment of the invention.
`
`FIG. 21 is a diagrammatic representation of a transaction flow in
`which local probe filtering is facilitated according to a specific
`embodiment of the invention.
`
`FIG. 22 is a diagrammatic representation of another transaction flow
`in which local probe filtering is facilitated according to a specific
`embodiment of the invention.
`
`DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS
`
`Reference will now be made in detail to some specific embodiments of
`the invention including the best modes contemplated by the inventors
`for carrying out the invention. Examples of these specific
`embodiments are illustrated in the accompanying drawings. While the
`invention is described in conjunction with these specific
`embodiments, it will be understood that it is not intended to limit
`the invention to the described embodiments. On the contrary, it is
`intended to cover alternatives, modifications, and equivalents as may
`be included within the spirit and scope of the invention as defined
`by the appended claims. Multi-processor architectures having
`point-to-point communication among their processors are suitable for
`implementing specific embodiments of the present invention. In the
`following description, numerous specific details are set forth in
`order to provide a thorough understanding of the present invention.
`The present invention may be practiced without some or all of these
`specific details. Well-known process operations have not been
`described in detail in order not to unnecessarily obscure the present
`invention. Furthermore, the present application's reference to a
`particular singular entity includes that possibility that the methods
`and apparatus of the present invention can be implemented using more
`than one entity, unless the context clearly dictates otherwise.
`
`TechniquesAccording to various embodiments, techniques are provided
`for increasing data access efficiency in a multiple processor,
`multiple cluster system. In a point-to-point architecture, a cluster
`of processors includes multiple processors directly connected to each
`other through point-to-point links. By using point-to-point links
`instead of a conventional shared bus or external network, multiple
`processors are used efficiently in a system sharing the same memory
`space. Processing and network efficiency are also improved by
`avoiding many of the bandwidth and latency limitations of
`conventional bus and external network based multiprocessor
`architectures. According to various embodiments, however, linearly
`increasing the number of processors in a point-to-point architecture
`leads to an exponential increase in the number of links used to connect
`the multiple processors. In order to reduce the number of links used
`and to further modularize a multiprocessor system using a
`point-to-point architecture, multiple clusters aremay be used.
`
`
`
`
`
`
`7
`
`
`According to varioussome embodiments, the multiple processor
`clusters are interconnected using a point-to-point architecture.
`Each cluster of processors includes a cache coherence controller used
`to handle communications between clusters. In one embodiment, the
`point-to-point architecture used to connect processors are used to
`connect clusters as well.
`
`By using a cache coherence controller, multiple cluster systems can
`be built using processors that may not necessarily support multiple
`clusters. Such a multiple cluster system can be built by using a cache
`coherence controller to represent non-local nodes in local
`transactions so that local nodes do not need to be aware of the
`existence of nodes outside of the local cluster. More detail on the
`cache coherence controller will be provided below.
`
`In a single cluster system, cache coherency can be maintained by
`sending all data access requests through a serialization point. Any
`mechanism for ordering data access requests (also referred to herein
`as requests and memory requests) is referred to herein as a
`serialization point. One example of a serialization point is a memory
`controller. Various processors in the single cluster system send data
`access requests to theone or more memory controllercontrollers. In
`one example, theeach memory controller is configured to serialize or
`lock the data access requests so that only one data access request
`for a given memory line is allowed at any particular time. If another
`processor attempts to access the same memory line, the data access
`attempt is blocked until the memory line is unlocked. The memory
`controller allows cache coherency to be maintained in a multiple
`processor, single cluster system.
`
` A
`
`
`
` serialization point can also be used in a multiple processor,
`multiple cluster system where the processors in the various clusters
`share a single address space. By using a single address space,
`internal point-to-point links can be used to significantly improve
`intercluster communication over traditional external network based
`multiple cluster systems. Various processors in various clusters send
`data access requests to a memory controller associated with a
`particular cluster such as a home cluster. The memory controller can
`similarly serialize all data requests from the different clusters.
`However, a serialization point in a multiple processor, multiple
`cluster system may not be as efficient as a serialization point in
`a multiple processor, single cluster system. That is, delay resulting
`from factors such as latency from transmitting between clusters can
`adversely affect the response times for various data access requests.
`It should be noted that delay also results from the use of probes in
`a multiple processor environment.
`
`Although delay in intercluster transactions in an architecture using
`a shared memory space is significantly less than the delay in
`conventional message passing environments using external networks
`such as Ethernet or Token Ring, even minimal delay is a significant
`factor. In some applications, there may be millions of data access
`requests from a processor in a fraction of a second. Any delay can
`adversely impact processor performance.
`
`According to various embodiments, probe management is used to
`increase the efficiency of accessing data in a multiple processor,
`multiple cluster system. A mechanism for eliciting a response from
`
`
`
`8
`
`
`a node to maintain cache coherency in a system is referred to herein
`as a probe. In one example, a mechanism for snooping a cache is
`referred to as a probe. A response to a probe can be directed to the
`source or target of the initiating request. Any mechanism for
`filtering or reducing the number of probes and probe requests
`transmitted to various nodes is referred to herein as managing probes.
`In one example, managing probeprobes entails characterizing a probe
`request to determine if a probe can be transmitted to a reduced number
`of entities.
`
`In typical implementations, probe requests are sent to a memory
`controller that broadcasts probes to various nodes in a system. In
`such a system, no knowledge of the cache line state is knownneeds to
`be maintained by the memory controller. All nodes in the system are
`probed and the request cluster receives a response from each node.
`In a system with a coherence directory, state information associated
`with various memory lines can be used to reduce the number of
`transactions. Any mechanism for maintaining state information
`associated with various memory lines is referred to herein as a
`coherence directory. AAccording to some embodiments, a coherence
`directory typically includes information for memory lines in a local
`cluster that are cached in a remote cluster. According to others, such
`a directory includes information for locally cached lines. According
`to various embodiments, a coherence directory is used to reduce the
`number of probes to remote quads by inferring the state of local
`caches. In otherAccording to some embodiments, a coherence directory
`is used to eliminate the transmissionsuch a directory mechanism is
`used in a single cluster system or within a cluster in a multi-cluster
`system to reduce the number of probes within a request to a memory
`controller in a home cluster.
`
`FIG. 1A is a diagrammatic representation of one example of a multiple
`cluster, multiple processor system that can use the techniques of the
`present invention. Each processing cluster 101, 103, 105, and 107 can
`include a plurality of processors. The processing clusters 101, 103,
`105, and 107 are connected to each other through point-to-point links
`11111a -f. In one embodiment, the multiple processors in the multiple
`cluster architecture shown in FIG. 1A share the same memory space.
`In this example, the point-to-point links 111a -f are internal system
`connections that are used in place of a traditional front-side bus
`to connect the multiple processors in the multiple clusters 101, 103,
`105, and 107. The point-to-point links may support any point-to-point
`coherence protocol.
`
`FIG. 1B is a diagrammatic representation of another example of a
`multiple cluster, multiple processor system that can use the
`techniques of the present invention. Each processing cluster 121,
`123, 125, and 127 can be coupled to a switch 131 through point-to-point
`links 141a -d. It should be noted that using a switch and
`point-to-point links allows implementation with fewer point-to-point
`links when connecting multiple clusters in the system. A switch 131
`can include a processor with a coherence protocol interface.
`According to various implementations, a multicluster system shown in
`FIG. 1A is expanded using a switch 131 as shown in FIG. 1B.
`
`FIG. 2 is a diagrammatic representation of a multiple processor
`cluster, such as the cluster 101 shown in FIG. 1A. Cluster 200 includes
`processors 202a -202d, one or more Basic I/O systems (BIOS) 204, a
`
`
`
`
`
`9
`
`
`memory subsystem comprising memory banks 206a -206d, point-to-point
`communication links 208a -208e, and a service processor 212. The
`point-to-point communication links are configured to allow
`interconnections between processors 202a -202d, I/O switch 210, and
`cache coherence controller 230. The service processor 212 is
`configured to allow communications with processors 202a -202d, I/O
`switch 210, and cache coherence controller 230 via a JTAG interface
`represented in FIG. 2 by links 214a -214f. It should be noted that
`other interfaces are supported. It should also be noted that in some
`implementations, a service processor is not included in multiple
`processor clusters. I/O switch 210 connects the rest of the system
`to I/O adapters 216 and 220. It should further be noted that the terms
`node and processor are often used interchangeably herein. However,
`it should be understood that according to various implementations,
`a node (e.g., processors 202a-202d) may comprise multiple sub-units,
`e.g., CPUs, memory controllers, I/O bridges, etc.
`
`According to specific embodiments, the service processor of the
`present invention has the intelligence to partition system resources
`according to a previously specified partitioning schema. The
`partitioning can be achieved through direct manipulation of routing
`tables associated with the system processors by the service processor
`which is made possible by the point-to-point communication
`infrastructure. The routing tables are used to control and isolate
`various system resources, the connections between which are defined
`therein.
`
`The processors 202a -d are also coupled to a cache coherence
`controller 230 through point-to-point links 232a-d. Any mechanism or
`apparatus that can be used to provide communication between multiple
`processor clusters while maintaining cache coherence is referred to
`herein as a cache coherence controller. The cache coherence
`controller 230 can be coupled to cache coherence controllers
`associated with other multiprocessor clusters. It should be noted
`that there can be more than one cache coherence controller in one
`cluster. The cache coherence controller 230 communicates with both
`processors 202a -d as well as remote clusters using a point-to-point
`protocol.
`
`More generally, it should be understood that the specific
`architecture shown in FIG. 2 is merely exemplary and that embodiments
`of the present invention are contemplated having different
`configurations and resource interconnections, and a variety of
`alternatives for each of the system resources shown. However, for
`purpose of illustration, specific details of server 200 will be
`assumed. For example, most of the resources shown in FIG. 2 are assumed
`to reside on a single electronic assembly. In addition, memory banks
`206a -206d may comprise double data rate (DDR) memory which is
`physically provided as dual in-line memory modules (DIMMs). I/O
`adapter 216 may be, for example, an ultra direct memory access (UDMA)
`controller or a small computer system interface (SCSI) controller
`which provides access to a permanent storage device. I/O adapter 220
`may be an Ethernet card adapted to provide communications with a
`network such as, for example, a local area network (LAN) or the
`Internet.
`
`According to a specific embodiment and as shown in FIG. 2, both of
`I/O adapters 216 and 220 provide symmetric I1/O0 access. That is, each
`
`
`
`
`
`10
`
`
`provides access to equivalent sets of I/O. As will be understood, such
`a configuration would facilitate a partitioning scheme in which
`multiple partitions have access to the same types of I/O. However,
`it should also be understood that embodiments are envisioned in which
`partitions without I/O are created. For example, a partition
`including one or more processors and associated memory resources,
`i.e., a memory complex, could be created for the purpose of testing
`the memory complex.
`
`According to one embodiment, service processor 212 is a Motorola
`MPC855T microprocessor which includes integrated chipset functions.
`The cache coherence controller 230 is an Application Specific
`Integrated Circuit (ASIC) supporting the local point-to-point
`coherence protocol. The cache coherence controller 230 can also be
`configured to handle a non-coherent protocol to allow communication
`with I/O devices. In one embodiment, the cache coherence controller
`230 is a specially configured programmable chip such as a programmable
`logic device or a field programmable gate array.
`
`FIG. 3 is a diagrammatic representation of one example of a cache
`coherence controller 230. According to various embodiments, the cache
`coherence controller includes a protocol engine 305 configured to
`handle packets such as probes and requests received from processors
`in various clusters of a multiprocessor system. The functionality of
`the protocol engine 305 can be partitioned across several engines to
`improve performance. In one example, partitioning is done based on
`packet type (request, probe and response), direction (incoming and
`outgoing), or transaction flow (request flows, probe flows, etc).
`
`The protocol engine 305 has access to a pending buffer 309 that allows
`the cache coherence controller to track transactions such as recent
`requests and probes and associate the transactions with specific
`processors. Transaction information maintained in the pending buffer
`309 can include transaction destination nodes, the addresses of
`requests for subsequent collision detection and protocol
`optimizations, response information, tags, and state information.
`
`The cache coherence controller has an interface such as a coherent
`protocol interface 307 that allows the cache coherence controller to
`communicate with other processors in the cluster as well as external
`processor clusters. The cache coherence controller can also include
`other interfaces such as a non-coherent protocol interface 311 for
`communicating with I/O devices. According to various embodiments,
`each interface 307 and 311 is implemented either as a full crossbar
`or as separate receive and transmit units using components such as
`multiplexers and buffers. The cache coherence controller can also
`include other interfaces such as a non-coherent protocol interface
`311 for communicating with I/O devices. It should be noted, however,
`that the cache coherence controller 230 does not necessarily need to
`provide both coherent and non-coherent interfaces. It should also be
`noted that a cache coherence controller in one cluster can communicate
`with a cache coherence controller in another cluster.
`
`FIG. 4 is a diagrammatic representation showing the transactions for
`a cache request from a processor in a system having a single cluster
`without using a cache coherence controller or other probe management
`mechanism. A processor 401-1 sends an access request such as a read
`memory line request to a memory controller 403-1. The memory
`
`
`
`
`
`11
`
`
`controller 403-1 may be associated with this processor, another
`processor in the single cluster or may be a separate component such
`as an ASIC or specially configured Programmable Logic Device (PLD).
`To preserve cache coherence, only one processor is typically allowed
`to access a memory line corresponding to a shared address space at
`anyone given time. To prevent other processors from attempting to
`access the same memory line, the memory line can be locked by the
`memory controller 403-1. All other requests to the same memory line
`are blocked or queued. Access by another processor is typically only
`allowed when the memory controller 403-1 unlocks the memory line.
`
`The memory controller 403-1 then sends probes to the local cache
`memories 405, 407, and 409 to determine cache states. The local cache
`memories 405, 407, and 409 then in turn send probe responses to the
`same processor 401-2. The memory controller 403-1 also sends an access
`response such as a read response to the same processor 401-3. The
`processor 401-3 can then send a done response to the memory controller
`403-2 to allow the memory controller 403-2 to unlock the memory line
`for subsequent requests. It should be noted that CPU 401-1, CPU 401-2,
`and CPU 401-3 refer to the same processor.
`
`FIGS. 5A -5D are diagrammatic representations depicting cache
`coherence controller operation. The use of a cache coherence
`controller in multiprocessor clusters allows the creation of a
`multiprocessor, multicluster coherent domain without affecting the
`functionality of local nodes such as processors and memory
`controllers in each cluster. In some instances, processors may only
`support a protocol that allows for a limited number of processors in
`a single cluster without allowing for multiple clusters. The cache
`coherence controller can be used to allow multiple clusters by making
`local processors believe that the non-local nodes are merely a
`singleone or more local nodenodes embodied in the cache coherence
`controller. In one example, the processors in a cluster do not need
`to be aware of processors in other clusters. Instead, the processors
`in the cluster c