`
`PATENT APPLICATION
`
`METHODS AND APPARATUS FOR SPECULATIVE
`. PROBING OF A REMOTE CLUSTER
`
`Inventor(s):
`
`David B. Glasco
`10337 Ember Glen Drive
`
`Austin, TX 78726
`Citizen of the U.S.
`
`Assignee:
`
`Newisys, Inc.
`A Delaware corporation
`
`BEYER WEAVER & THOMAS, LLP
`P.O. Box 778
`
`Berkeley, California 94704-0778
`(510) 843-6200
`
`-EH21
`
`Memury Integrity. LLC
`lPR2IJ15—IJIJ15B, —lJIJ159, —IJI'.I1E3
`EXHIBIT
`lnte r'
`
`Mama
`
`
`
`12 =7"
`...lc..
`::.§¥
`
`~~
`
`s:.::‘
`
`‘:»%*
`
`1:“ -“.:’:-«;:-
`3:.
`.l:*-
`
`~.-
`
`‘
`
`..
`
`Attorney Docket No. NWISP009
`
`PATENT
`
`METHODS AND APPARATUS FOR SPECULATIVE
`PROBING OF A REMOTE CLUSTER
`
`CROSS—REFERENCE TO RELATED APPLICATIONS
`
`10
`
`15
`
`20
`
`25
`
`30
`
`The present application is related to U.S. Application No. 10/106,426 titled
`
`Methods And Apparatus For Speculative Probing At A Request Cluster, U.S.
`
`Application No. 10/106,430 titled Methods And Apparatus For Speculative Probing
`With Early Completion And Delayed Request, and U.S. Application No. 10/106,299
`
`titled Methods And Apparatus For Speculative Probing With Early Completion And
`
`Early Request, the entireties of which are incorporated by reference herein for all
`
`purposes. The present application is also related to U.S. Application Nos. ___/
`
`and _/
`both titled Methods And Apparatus For Responding To A Request
`Cluster (Attorney Docket Nos. NWISPOO7 and NWISP008) by David B. Glasco filed
`
`on May 13, 2002,
`
`the entireties of which are incorporated by reference for all
`
`purposes. Furthermore, the present application is related to concurrently filed U.S.
`
`Application No.
`
`/
`
`also titled Methods And Apparatus For Speculative
`
`Probing Of A Remote Cluster (Attorney Docket Nos. NWISPOO6) by David B.
`
`Glasco, the entirety of which is incorporated by reference for all purposes.
`
`The present application is also related to concurrently filed U.S. Application
`
`Nos. _/
`
`, _/
`
`, and
`
`/
`
`titled Transaction Management In
`
`Systems Having Multiple Multi-Processor
`
`Clusters
`
`(Attorney Docket No.
`
`NWISPOOIZ), Routing Mechanisms In Systems Having Multiple Multi-Processor
`
`Clusters (Attorney Docket No. NWISPOO13), and Address Space Management In
`
`Systems Having Multiple Multi-Processor Clusters
`
`(Attorney Docket No.
`
`NWISPO014) respectively, all by David B. Glasco, Carl Zeitler, Rajesh Kota, Guru
`
`Prasadh, and Richard R. Oehler, the entireties of which are incorporated by reference
`
`for all purposes.
`
`
`
`
`~:.m- u..,5-
`.
`..
`:33] gm?) :3.“
`(l.
`{.31 .131 ‘lfiii:
`iliti 17‘
`1
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention.
`
`The present invention generally relates to accessing data in a multiple processor
`
`system. More specifically, the present invention provides techniques for improving
`
`data access efficiency while maintaining cache coherency in a multiple processor
`
`system having a multiple cluster architecture.
`
`2. Description of Related Art
`
`Data access in multiple processor systems can raise issues relating to cache
`
`coherency.
`
`Conventional multiple processor computer systems have processors
`
`coupled to a system memory through a shared bus.
`
`In order to optimize access to data
`
`in the system memory, individual processors are typically designed to work with cache
`
`memory.
`
`In one example, each processor has a cache that is loaded with data that the
`
`processor frequently accesses. The cache is read or written by a processor. However,
`
`cache coherency problems arise because multiple copies of the same data can co—exist
`
`in systems having multiple processors and multiple cache memories. For example, a
`
`frequently accessed data block corresponding to a memory line may be loaded into the
`
`cache of two different processors.
`
`In one example, if both processors attempt to write
`
`new values into the data block at the same time, different data values may result. One
`
`value may be written into the first cache while a different value is written into the
`
`second cache. A system might then be unable to determine what value to write through
`
`to system memory.
`
`A variety of cache coherency mechanisms have been developed to address such
`
`problems in multiprocessor systems.
`
`One solution is to simply force all processor
`
`writes to go through to memory immediately and bypass the associated cache.
`
`The
`
`write requests can then be serialized before overwriting a system memory line.
`
`However, bypassing the cache significantly decreases efficiency gained by using a
`
`cache.
`
`Other cache coherency mechanisms have been developed for specific
`
`architectures.
`
`In a shared bus architecture, each processor checks or snoops on the bus
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`
`
`
`
`to determine whether it can read or write a shared cache block.
`
`In one example, a
`
`processor only writes an object when it owns or has exclusive access to the object.
`
`Each corresponding cache object is then updated to allow processors access to the most
`
`recent version of the object.
`
`Bus arbitration is used when both processors attempt to write a shared data
`
`block in the same clock cycle. Bus arbitration logic decides which processor gets the
`bus first. Although, cache coherency mechanisms such as bus arbitration are effective,
`
`using a shared bus limits the number of processors that can be implemented in a single
`
`system with a single memory space.
`
`Other multiprocessor schemes involve individual processor, cache, and memory
`
`systems connected to other processors, cache, and memory systems using a network
`
`backbone such as Ethernet or Token Ring. Multiprocessor schemes involving separate
`
`computer systems each with its own address space can avoid many cache coherency
`
`problems because each processor has its own associated memory and cache. When one
`
`processor wishes to access data on a remote computing system, communication is
`
`explicit. Messages are sent to move data to another processor and messages are
`received to accept data from another processor using standard network protocols such
`
`as TCP/IP. Multiprocessor
`
`systems using explicit communication including
`
`transactions such as sends and receives are referred to as systems using multiple private
`
`memories. By contrast, multiprocessor system using implicit communication including
`
`transactions such as loads and stores are referred to herein as using a single address
`
`space.
`
`Multiprocessor
`
`schemes using separate
`
`computer
`
`systems
`
`allow more
`
`processors
`
`to be interconnected while minimizing cache coherency problems.
`
`However,
`
`it would take substantially more time to access data held by a remote
`
`processor using a network infrastructure than it would take to access data held by a
`
`processor coupled to a system bus. Furthermore, valuable network bandwidth would be
`
`consumed moving data to the proper processors. This can negatively impact both
`
`processor and network performance.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`
`
`W
`$1.31 3222;: 22:53 3:}. $3.31 :1:::5f
`
`Performance limitations have led to the development of a point—to-point
`
`architecture for connecting processors in a system with a single memory space. In one
`
`example,
`
`individual processors can be directly connected to each other through a
`
`plurality of point-to~point links to form a cluster of processors. Separate clusters of
`
`processors can also be connected. The point-t0—point links significantly increase the
`
`bandwidth for coprocessing and multiprocessing functions. However, using a point-to-
`
`point architecture to connect multiple processors in a multiple cluster system sharing a
`
`single memory space presents its own problems.
`
`10
`
`Consequently, it is desirable to provide techniques for improving data access
`
`and cache coherency in systems having multiple clusters of multiple processors
`
`connected using point—to—point links.
`
`
`
`":1 Mi
`
`...;...
`
`.. ..x
`art: ‘I?’ _- -1.: re:
`
`
`SUMNIARY OF THE INVENTION
`
`According to the present invention, methods and apparatus are provided for
`
`increasing the efficiency of data access in a multiple processor, multiple cluster system.
`
`Techniques are provided for speculatively probing a remote cluster from either a
`
`request cluster or a home cluster. A speculative probe associated with a particular
`
`memory line is transmitted to the remote cluster before the cache access request
`
`associated with the memory line is serialized at a home cluster. When a non-
`
`speculative probe is received at a remote cluster, the information associated with the
`
`response to the speculative probe is used to provide a response to the non-speculative
`
`probe.
`
`According to various embodiments, a computer system is provided.
`
`The
`
`computer system includes a request cluster, a home cluster, and a remote cluster. The
`
`request cluster has a plurality of interconnected request cluster processors and a request
`
`cluster cache coherence controller. The home cluster has a plurality of interconnected
`
`home processors, a serialization point, and a home cache coherence controller. The
`
`remote cluster has a plurality of interconnected remote processors and a remote cache
`
`coherence controller.
`
`The remote cluster is configured to receive a first probe
`
`corresponding to a cache access request from a request cluster processor in the request
`
`cluster and a second probe corresponding to the cache access request fi‘om the home
`
`cluster.
`
`According to other embodiments, a method for a cache coherence controller to
`
`manage data access in a multiprocessor system is provided. The method includes
`
`receiving a cache access request from a request cluster processor associated with a
`
`request cluster, forwarding the cache access request to a home cluster, and sending a
`
`probe associated with the cache request to a remote cluster. The home cluster includes
`
`a home cluster cache coherence controller and a serialization point.
`
`According to still other embodiments, a computer system is provided. The
`
`computer system includes a first cluster and a second cluster. The first cluster includes
`
`a first plurality of processors and a first cache coherence controller. The first plurality
`
`10
`
`15
`
`20
`
`25
`
`30
`
`
`
`
`
`of processors and the first cache coherence controller are interconnected in a point—to-
`
`point architecture. The second cluster includes a second plurality of processors and a
`
`second cache coherence controller. The second plurality of processors and the second
`
`cache coherence controller are interconnected in a point-to-point architecture, the first
`
`cache coherence controller is coupled to the second cache coherence controller. The
`
`first cache coherence controller is configured to receive a cache access request
`
`originating from the first plurality of processors and send a probe to a third cluster
`
`including a third plurality of processors before the cache access request is received by a
`
`serialization point in the second cluster.
`
`In still other embodiments, a computer system is provided. The computer
`
`system includes a first cluster and a second cluster. The first cluster includes a first
`
`plurality of processors and a first cache coherence controller. The first plurality of
`
`processors and the first cache coherence controller are interconnected in a point-to-
`
`point architecture. The second cluster includes a second plurality of processors and a
`
`second cache coherence controller. The second plurality of processors and the second
`
`cache coherence controller are interconnected in a point-to-point architecture. The first
`
`cache coherence controller is coupled to the second cache coherence controller and
`
`constructed to receive a cache access request originating from the first plurality of
`
`processors and send a probe to a third cluster including a third plurality of processors
`
`before a memory line associated with the cache access request is locked.
`
`In yet another embodiment, a cache coherence controller is provided. The
`
`cache coherence controller includes interface circuitry coupled to a request cluster
`
`processor in a request cluster and a remote cluster cache coherence controller in a
`
`remote cluster and a protocol engine coupled to the interface circuitry. The protocol
`
`engine is configured to receive a cache access request from the request cluster
`
`processor and speculatively probe a remote node in the remote cluster.
`
`A further understanding of the nature and advantages of the present invention
`
`may be realized by reference to the remaining portions of the specification and the
`
`drawings.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`
`
`sh
`«sum ..
`_.,.
`.s...i-
`,IKl;. »...,.
`1MmwwwmH_H%mmmm
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the following
`
`description taken in conjunction with the accompanying drawings, which are
`
`illustrative of specific embodiments of the present invention.
`
`Figure 1A and 1B are diagrammatic representation depicting a system having
`
`multiple clusters.
`
`Figure 2 is a diagrammatic representation of a cluster having a plurality of
`
`pI'OCCSSOI‘S.
`
`10
`
`15
`
`Figure 3 is a diagrammatic representation of a cache coherence controller.
`
`Figure 4 is a diagrammatic representation showing a transaction flow for a data
`
`access request from a processor in a single cluster.
`
`Figure 5A—5D are diagrammatic representations showing cache coherence
`
`controller functionality.
`
`Figure 6 is a diagrammatic representation depicting a transaction flow probing a
`
`remote cluster.
`
`Figure 7 is a diagrammatic representation showing a transaction flow for a
`
`speculative probing from a home cluster.
`
`Figure 8 is a flow process diagram showing speculative probing from a home
`
`20
`
`cluster.
`
`Figure 9 is a flow process diagram showing speculative probing from a home
`
`cluster at a remote cluster.
`
`Figure 10 is a diagrammatic representation showing a transaction flow for a
`
`speculative probing from a request cluster.
`
`25
`
`cluster.
`
`Figure 11 is a flow process diagram showing speculative probing from a request
`
`Figure 12 is a flow process diagram showing speculative probing from a request
`
`cluster at a remote cluster.
`
`
`
`'
`t....st
`rs.
`
`...ss..
`=1:
`
`......,
`'li
`
`.-...£ ,.
`....x. ..
`.x
`z;::; "aw *‘;s=;,
`-:S.:;;:;g;:;;'
`
`x:
`
`—
`/y.— .....
`u 1
`a
`-
`J I V:
`.
`li...
`.
`" “*-
`.....i¥ ii”...
`:<.,.;.
`
`..— x.
`ll .3§ iiirl‘
`
`DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS
`
`Reference will now be made in detail to some specific embodiments of the
`
`invention including the best modes contemplated by the inventors for carrying out the
`
`invention. Examples of these specific embodiments are illustrated in the accompanying
`
`drawings. While the invention is described in conjunction with these specific
`embodiments, it will be understood that it is not intended to limit the invention to the
`
`described embodiments. On the contrary,
`
`it
`
`is intended to cover alternatives,
`
`modifications, and equivalents as may be included within the spirit and scope of the
`
`10
`
`invention as defined by the appended claims. Multi—processor architectures having
`
`point—to—point communication among their processors are suitable for implementing
`
`specific embodiments of the present invention.
`
`In the following description, numerous
`
`specific details are set forth in order to provide a thorough understanding of the present
`
`invention. The present invention may be practiced without some or all of these specific
`
`details. Well known process operations have not been described in detail in order not to
`
`unnecessarily obscure the present invention. Furthennore, the present application’s
`reference to a particular singular entity includes that possibility that the methods and
`apparatus of the present invention can be implemented using more than one entity,
`
`unless the context clearly dictates otherwise.
`
`Techniques are provided for increasing data access efficiency in a multiple
`
`processor, multiple cluster system.
`
`In a point—to—point architecture, a cluster of
`
`processors includes multiple processors directly connected to each other through point-
`
`to—point links. By using point-to-point links instead of a conventional shared bus or
`
`external network, multiple processors are used efficiently in a system sharing the same
`
`memory space. Processing and network efficiency are also improved by avoiding
`
`many of the bandwidth and latency limitations of conventional bus and external
`
`network based multiprocessor architectures. According to various embodiments,
`
`however, linearly increasing the number of processors in a point—to—point architecture
`
`leads to an exponential increase in the number of links used to connect the multiple
`
`processors.
`
`In order to ~reduce the number of links used and to further modularize a
`
`multiprocessor system using a point—to—point architecture, multiple clusters are used.
`
`15
`
`20
`
`25
`
`30
`
`
`
`
`
`According to various embodiments,
`
`the multiple processor clusters are
`
`interconnected using a point-to—point architecture. Each cluster of processors includes
`
`a cache coherence controller used to handle communications between clusters.
`
`In one
`
`embodiment,
`
`the point-to-point architecture used to connect processors are used to
`
`connect clusters as well.
`
`By using a cache coherence controller, multiple cluster systems can be built
`
`using processors that may not necessarily support multiple clusters. Such a multiple
`
`cluster system can be built by using a cache coherence controller to represent non-local
`
`nodes in local transactions so that local nodes do not need to be aware of the existence
`
`of nodes outside of the local cluster. More detail on the cache coherence controller will
`
`be provided below-
`
`In a single cluster system, cache coherency can be maintained by sending all
`
`data access requests through a serialization point. Any mechanism for ordering data
`
`access requests is referred to herein as a serialization point. One example of a
`
`serialization point is a memory controller. Various processors in the single cluster
`
`system send data access requests to the memory controller.
`
`In one example,
`
`the
`
`memory controller is configured to serialize or lock the data access requests so that
`
`only one data access request for a given memory line is allowed at any particular time.
`
`If another processor attempts to access the same memory line, the data access attempt is
`
`blocked until the memory line is unlocked. The memory controller allows cache
`
`coherency to be maintained in a multiple processor, single cluster system.
`
`A serialization point can also be used in a multiple processor, multiple cluster
`
`system where the processors in the various clusters share a single address space. By
`
`using a single address space, internal point-to—point links can be used to significantly
`
`improve intercluster communication over traditional external network based multiple
`
`cluster systems. Various processors in various clusters send data access requests to a
`
`memory controller associated with a particular cluster such as a home cluster. The
`
`memory controller can similarly serialize all data requests from the different clusters.
`
`However, a serialization point in a multiple processor, multiple cluster system may not
`
`be as efficient as a serialization point in a multiple processor, single cluster system.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`
`
`..5!
`
`” 1‘
`
`ll
`
`‘.13:
`
`9‘
`
`
`That is, delay resulting from factors such as latency from transmitting between clusters
`
`can adversely affect the response times for various data access requests.
`
`It should be
`
`noted that delay also results from the use of probes in a multiple processor
`
`environment.
`
`Although delay in intercluster transactions in an architecture using a shared
`
`memory space is significantly less than the delay in conventional message passing
`
`environments using external networks such as Ethernet or Token Ring, even minimal
`
`delay is a significant factor. In some applications, there may be millions of data access
`
`requests from a processor in a fraction of a second. Any delay can adversely impact
`
`processor performance.
`
`According to various embodiments, speculative probing is used to increase the
`
`efficiency of accessing data in a multiple processor, multiple cluster system. A
`
`mechanism for eliciting a response from a node to maintain cache coherency in a
`
`system is referred to herein as a probe.
`
`In one example, a mechanism for snooping a
`
`cache is referred to as a probe. A response to a probe can be directed to the source or
`
`target of the initiating request. Any mechanism for sending probes to nodes associated
`
`with cache blocks before a request associated with the probes is received at a
`
`10
`
`15
`
`20
`
`serialization point is referred to herein as speculative probing.
`
`25
`
`30
`
`According to various embodiments, the reordering or elimination of certain data
`
`access requests do not adversely affect cache coherency. That is, the end value in the
`
`cache is the same whether or not snooping occurs. For example, a local processor
`
`attempting to read the cache data block can be allowed to access the data block without
`
`sending the requests through a serialization point in certain circumstances.
`
`In one
`
`example, read access can be permitted when the cache block is valid and the associated
`
`memory line is not locked. Techniques for performing speculative probing generally
`
`are described in U.S. Application No. 10/106,426 titled Methods And Apparatus For
`
`Speculative Probing At A Request Cluster, U.S. Application No. 10/106,430 titled
`
`Methods And Apparatus For Speculative Probing With Early Completion And Delayed
`
`Request, and U.S. Application No. 10/106,299 titled Methods And Apparatus For
`
`Speculative Probing With Early Completion And Early Request, the entireties of which
`
`10
`
`
`
`I.
`Ill 5” Li...
`
`
`are incorporated by reference herein for all purposes. By completing a data access
`
`transaction within a local cluster, the delay associated with transactions in a multiple
`
`cluster systemican be reduced or eliminated.
`
`The techniques of the present invention recognize that other effieiencies can be
`
`achieved, particularly when speculative probing can not be completed at a local cluster.
`
`In one example, a cache access request is forwarded from a local cluster to a home
`
`cluster. A home cluster then proceeds to send probes to remote clusters in the system.
`
`In typical
`
`implementations,
`
`the home cluster gatherers
`
`the probe
`
`responses
`
`corresponding to the probe before sending an aggregated response to the request
`
`cluster. The aggregated response typically includes the results of the home cluster
`
`probes and the results of the remote cluster probes. The techniques of the present
`
`invention provide techniques for more efficiently probing a remote cluster.
`
`In typical
`
`implementations, a remote cluster is probed after a cache access request is ordered at a
`
`home cluster serialization point. The remote cluster then waits for the results of the
`
`10
`
`15
`
`probe and sends the results back to the request cluster.
`
`In some examples, the results
`
`are sent directly to the request cluster or to the request cluster through the home cluster.
`
`According to various embodiments, a speculative probe is sent to the remote cluster
`
`first to begin the probing of the remote nodes. When the probe transmitted after the
`
`_2o
`
`request is serialized arrives at the remote cluster, the results of the speculative probe
`
`can be used to provide a faster response to the request cluster.
`
`Figure 1A is a diagrammatic representation of one example of a multiple
`
`cluster, multiple processor system that can use the techniques of the present invention.
`
`Each processing cluster 101, 103, 105, and 107 can include a plurality of processors.
`
`The processing clusters 101, 103, 105, and 107 are connected to each other through
`
`point—to~point links 111a-f. In one embodiment, the multiple processors in the multiple
`
`cluster architecture shown in Figure 1A share the same memory space. In this example,
`
`the point—to-point links 111a—f are internal system connections that are used in place of
`
`a traditional front—side bus to connect the multiple processors in the multiple clusters
`
`101, 103, 105, and 107. The point—to-point
`
`links may support any point—to-point
`
`25
`
`30
`
`coherence protocol.
`
`11
`
`
`
`..:L. 51.1!
`
`..s.L:~.~.:,:=
`
`""13
`'
`‘L’?! 5";
`1
`..:'t;.:::-:2. —».l.,:s33.:::.
`
`
`Figure 1B is a diagrammatic representation of another example of a multiple
`
`cluster, multiple processor system that can use the techniques of the present invention.
`
`Each processing cluster 121, 123, 125, and 127 can be coupled to a switch 131 through
`
`point—to—point links 141a-d.
`
`It should be noted that using a switch and point—to—point
`
`links allows implementation with fewer point~to—point links when connecting multiple I
`
`clusters in the system. A switch 131 can include a processor with a coherence protocol
`
`interface. According to various implementations, a multicluster system shown in
`
`Figure 1A is expanded using a switch 131 as shown in Figure 1B.
`
`10
`
`15
`
`Figure 2 is a diagrammatic representation of a multiple processor cluster, such
`
`as the cluster 101 shown in Figure 1A. Cluster 200 includes processors 202a-202d, one
`
`or more Basic I/O systems (BIOS) 204, a memory subsystem comprising memory
`
`banks 206a—206d, point—to—point communication links 20821-2086, and a service
`
`processor 212.
`
`The point—to—point communication links are configured to allow
`
`interconnections between processors 202a-202d, I/O switch 210, and cache coherence
`
`controller 230. The service processor 212 is configured to allow communications with
`
`processors 202a-202d, I/O switch 210, and cache coherence controller 230 Via a JTAG
`
`interface represented in Fig. 2 by links 214a—214f.
`
`It should be noted that other
`
`interfaces are supported. I/O switch 210 connects the rest of the system to I/O adapters
`
`20
`
`216 and 220.
`
`According to specific embodiments,
`
`the service processor of the present
`
`invention has the intelligence to partition system resources according to a previously
`
`speci tied partitioning schema.
`
`The partitioning can be achieved through direct
`
`manipulation of routing tables associated with the system processors by the service
`
`processor which is made possible by the point—to—point communication infrastructure.
`
`The routing tables are used to control and isolate various system resources,
`
`the
`
`connections between which are defined therein. The service processor and computer
`
`system partitioning are described in Patent Application No. 09/932,456 titled Computer
`
`System Partitioning Using Data Transfer Routing Mechanism, filed on August 16, 2001
`
`(Attorney Docket No. NWISP001), the entirety of which is incorporated by reference
`
`25
`
`30
`
`for all purposes.
`
`12
`
`
`
`
`
`The processors 202a-d are also coupled to a cache coherence controller 230
`
`through point—to—point links 232a—d. Any mechanism or apparatus that can be used to
`provide communication between multiple processor clusters while maintaining cache
`coherence is referred to herein as a cache coherence controller. The cache coherence
`
`controller 230 can be coupled to cache coherence controllers associated with other
`
`multiprocessor clusters.
`
`It should be noted that there can be more than one cache
`
`coherence controller in one cluster. The cache coherence controller 230 communicates
`
`with both processors 202a-d as well as remote clusters using a point—to-point protocol.
`
`More generally, it should be understood that the specific architecture shown in
`
`Figure 2 is merely exemplary and that embodiments of the present invention are
`
`contemplated having different configurations and resource interconnections, and a
`
`variety of alternatives for each of the system resources shown. However, for purpose
`
`of illustration, specific details of server 200 will be assumed. For example, most of the
`
`resources shown in Fig. 2 are assumed to reside on a single electronic assembly.
`
`In
`
`addition, memory banks 206a-206d may comprise double data rate (DDR) memory
`
`which is physically provided as dual in—line memory modules (DHVIMS).
`
`I/O adapter
`
`2l6 may be, for example, an ultra direct memory access (UDMA) controller or a small
`
`computer system interface (SCSI) controller which provides access to a permanent
`
`storage device.
`
`I/O adapter 220 may be an Ethernet card adapted to provide
`
`communications with a network such as, for example, a local area network (LAN) or
`
`the Internet.
`
`According to a specific embodiment and as shown in Fig. 2, both of I/O
`
`adapters 216 and 220 provide symmetric I/O access. That is, each provides access to
`
`equivalent sets of I/O. As will be understood, such a configuration would facilitate a
`
`partitioning scheme in which multiple partitions have access to the same types of I/O.
`
`However,
`
`it should also be understood that embodiments are envisioned in which
`
`partitions without I/O are created. For example, a partition including one or more
`
`processors and associated memory resources, i.e., a memory complex, could be created
`
`for the purpose of testing the memory complex.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`13
`
`
`
`"
`5.’! ‘HF “"
`1”“ '51.“
`..s ?:..!l ‘L J: é:r.‘.. 1l:§~ Lil 2”“
`
`According to one embodiment, service processor 212 is a Motorola MPC855T
`
`microprocessor which includes integrated chipset functions. The cache coherence
`controller 230 is an Application Specific Integrated Circuit (ASIC) supporting the local
`
`point—to—point coherence protocol. The cache coherence controller 230 can also be
`configured to handle a non—coherent protocol to allow communication with I/O devices.
`
`In one embodiment,
`
`the cache coherence controller 230 is a specially configured
`
`programmable chip such as a programmable logic device or a field programmable gate
`
`array. In another embodiment, the cache coherence controller is a general purpose
`
`processor with an interface to the point-to-point links 232.
`
`Figure 3 is a diagrammatic representation of one example of a cache coherence
`
`controller 230. According to various embodiments, the cache coherence controller
`
`includes a protocol engine 305 configured to handle packets such as probes and
`
`requests received from processors in various clusters of a multiprocessor system. The
`
`functionality of the protocol engine 305 can be partitioned across several engines to
`
`improve performance.
`
`In one example, partitioning is done based on packet type
`
`(request, probe and response), direction (incoming and outgoing), or transaction flow
`
`(request flows, probe flows, etc).
`
`The protocol engine 305 has access to a pending buffer 309 that allows the
`
`cache coherence controller to track transactions such as recent requests and probes and
`
`associate the transactions with specific processors. Transaction information maintained
`
`in the pending buffer 309 can include transaction destination nodes, the addresses of
`
`requests for subsequent collision detection and protocol optimizations,
`
`response
`
`information, tags, and state infonnation.
`
`The cache coherence controller has an interface such as a coherent protocol
`
`interface 307 that allows the cache coherence controller to communicate with other
`
`processors in the cluster as well as external processor clusters. According to various
`
`embodiments, each interface 307 and 311 is implemented either as a full crossbar or as
`
`separate receive and transmit units using components such as multiplexers and buffers.
`
`The cache coherence controller can also include other interfaces such as a non—coherent
`
`protocol
`
`interface 311 for communicating with I/O devices.
`
`It should be noted,
`
`10
`
`15
`
`20
`
`25
`
`30
`
`14
`
`
`
`...=.l 2:, ti
`
`...it. 23:31
`
`..::’:;
`
`
`however, that tl1e cache coherence controller 230 does not necessarily need to provide
`
`both coherent and non-coherent interfaces.
`
`It should also be noted that a cache
`
`coherence controller in one cluster can communicate with a cache coherence controller
`
`in another cluster.
`
`Figure 4 is a diagrammatic representation showing the transactions for a cache
`
`request from a processor in a system having a single cluster without using a cache
`
`coherence controller. A processor 401-1 sends an access request such as a read
`
`memory line request to a memory controller 403-1. The memory controller 403-1 may
`
`be associated with this processor, another processor in the single cluster or may be a
`separate component such as an ASIC or specially configured Programmable Logic
`
`Device (PLD). To preserve cache coherence, only one processor is typically allowed to
`
`access a memory line corresponding to a shared address space at anyone given time.
`
`To prevent other processors from attempting to access the same memory line,
`
`the
`
`memory line can be locked by the memory controller 403-1. All other requests to the
`
`same memory line are blocked or queued. Access by another processor is typically
`
`only allowed when the memory controller 403-1 unlocks the memory line.
`
`The memory controller 403-1 then sends probes to the local cache memories
`
`405, 407, and 409 to determine cache states. The local cache memories 405, 407,