`
`11\\1\111\11\\1\111\\\\1111\\111\1\11\\1
`101504
`
`'
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`CERTIFICATE OF EXPRESS MAILING
`I h~reby certifY that this paper and the documents and/or fees referred to as
`attached therein are being deposited with the United States Postal Service
`on October 15,2004 in an envelope as "Express Mail Post Office to
`Addressee" service under 37 CFR § 1.1 0, Mailing Label Number
`EV332820115US, addressed to the Commissioner for Patents, P.O. Box
`1450, Alexand · , V
`13-1450.
`
`Attorney Docket No. : NWISP052
`
`First Named Inventor: Morton
`
`UTILITY PATENT APPLICATION TRANSMITTAL (37 CFR. § 1.53(b))
`(Continuation, Divisional or Continuation-in-part application)
`
`Mail Stop Pat~nt Application
`Commissioner for Patents
`P.O. Box 1450
`Alexandria, VA 22313-1450
`
`Sir:
`
`This is a request for filing a patent application under 37 CFR. § 1.53(b) in the name of inventors:
`Eric Morton, Rajesh Kota, Adnan Khaleel and David B. Glasco
`
`For: REDUCING PROBE TRAFFIC IN MULTIPROCESSOR SYSTEMS
`
`Assigned to : Newisys, Inc.
`
`This application is a D Continuation D Divisional
`
`[:8] Continuation-in-part
`ofprior Application No.: 10/288,347, from which priority under 35 U.S.C. §120 is claimed.
`
`The specification has been amended to claim priority from the parent application, or such
`amendment is included in a separate sheet.
`
`Application Elements:
`
`~ 54 Pages of Specification, Claims and Abstract
`~ 25 Sheets of formal Drawings
`D
`Declaration D Newly executed
`D Copy from a prior application (3 7 CFR 1.63( d) for a continuation or divisional).
`
`The entire disclosure of the prior application from which a copy of the declaration is
`herein supplied is considered as being part of the disclosure of the accompanying
`application and is hereby incorporated by reference therein.
`
`D Deletion of inventors Signed statement attached de~IP~. i. i~.i.i.liJi~~~~-.......
`
`named in the prior application, see 3 7 CFR 1.63( d)(2) an
`
`(Revised 04/03, Pat App Trans 53(b) ContDivCIP)
`
`Page I
`
`Petition for Inter Partes Review of
`U.S. Pat. No. 7,296,121
`IPR2015‐00158
`EXHIBIT
`Sony‐
`
`1
`
`
`
`Accompanying Application Parts:
`
`" 0 Do not publish this application. Nonpublication Request is attached.
`0 Assignment and Assignment Recordation Cover Sheet (recording fee of $40.00 enclosed)
`0 Power of Attorney
`0 37 CFR 3.73(b) Statement by Assignee
`0 Information Disclosure Statement with Form PT0-1449 D Copies of IDS Citations
`0 Preliminary Amendment (New claims numbered after highest original claim in prior
`C2J Return Receipt Postcard
`0 Other:
`
`application.)
`
`Claim For Foreign Priority
`
`0
`
`Application No.
`Priority of
`is claimed under 35 U.S.C. § 119.
`
`filed on
`
`D The certified copy has been filed in prior application U.S. Application No.
`D The certified copy will follow.
`
`Extension of Time for Prior Pending Application
`
`D A Petition for Extension of Time is being concurrently filed in the prior pending
`
`application. A copy of the Petition for Extension of Time is attached.
`
`Fee Calculation (37 CFR § 1.16)
`
`~ PLEASE DEFER THE PAYMENT OF THE FILING FEES AT THIS TIME.
`
`General Authorization for Petition for Extension of Time (37 CFR § 1.136)
`~ Applicants hereby make and generally authorize any Petitions for Extensions of Time as may be
`needed for any subsequent filings. The Commissioner is also authorized to charge any extension fees under
`37 CFR § 1.17 as may be needed to Deposit Account No. 500388 (Order No. NWISP052).
`
`C8J Please send correspondence to the following address:
`Customer Number 022434
`
`Date: October 15, 2004
`
`(Revised 04/03, Pat App Trans 53(b) ContDivCIP)
`
`Page 2
`
`2
`
`
`
`22764 U.S. PTO
`I \\\1\1\\\\\1\1\1 ~\\\ \\1\l\\\11 \I\\\\\\
`
`101504
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`CERTIFICATE OF EXPRESS MAILING
`I h~reby certify that this paper and the documents and/or fees referred to as
`attached therein are being deposited with the United States Postal Service
`on October 15,2004 in an envelope as "Express Mail Post Office to
`Addressee" service under 37 CFR § 1.1 0, Mailing Label Number
`EV332820115US, addressed to the Commissioner for Patents, P.O. Box
`1450, Alexand · , V
`13-1450.
`
`Attorney Docket No.: NWISP052
`
`First Named Inventor: Morton
`
`UTILITY PATENT APPLICATION TRANSMITTAL (37 CFR. § 1.53(b))
`(Continuation, Divisional or Continuation-in-part application)
`
`Mail Stop Pat~nt Application
`Commissioner for Patents
`P.O. Box 1450·
`Alexandria, VA 22313-1450
`
`Sir:
`
`This is a request for filing a patent application under 37 CFR. § 1.53(b) in the name of inventors:
`Eric Morton, Rajesb Kota, Adnan Kbaleel and David B. Glasco
`
`For: REDUCING PROBE TRAFFIC IN MULTIPROCESSOR SYSTEMS
`
`Assigned to : Newisys, Inc.
`
`This application is a D Continuation D Divisional
`
`[8J Continuation-in-part
`ofprior Application No.: 10/288,347, from which priority under 35 U.S.C. §120 is claimed.
`
`The specification has been amended to claim priority from the parent application, or such
`amendment is included in a separate sheet.
`
`Application Elements:
`
`[8J 54 Pages of Specification, Claims and Abstract
`[8J 25 Sheets of formal Drawings
`D
`Declaration
`D Newly executed
`D Copy from a prior application (37 CFR 1.63(d) for a continuation or divisional).
`
`The entire disclosure of the prior application from which a copy of the declaration is
`herein supplied is considered as being part of the disclosure of the accompanying
`application and is hereby incorporated by reference therein.
`
`D Deletion of inventors Signed statement attached deleting inventor(s)
`
`named in the prior application, see 37 CFR 1.63(d)(2) and 1.33(b).
`
`(Revised 04/03, Pat App Trans 53(b) ContDivCIP)
`
`Page 1
`
`3
`
`
`
`Accompanying Application Parts:
`
`" 0 Do not publish this application. Nonpublication Request is attached.
`0 Assignment and Assignment Recordation Cover Sheet (recording fee of $40.00 enclosed)
`0 Power of Attorney
`0 37 CFR 3.73(b) Statement by Assignee
`0 Information Disclosure Statement with Form PT0-1449 D Copies of IDS Citations
`0 Preliminary Amendment (New claims numbered after highest original claim in prior
`C2J Return Receipt Postcard
`0 Other:
`
`application.)
`
`Claim For Foreign Priority
`
`0
`
`Application No.
`Priority of
`is claimed under 35 U.S.C. § 119.
`
`filed on
`
`D The certified copy has been filed in prior application U.S. Application No.
`D The certified copy will follow.
`
`Extension of Time for Prior Pending Application
`
`D A Petition for Extension of Time is being concurrently filed in the prior pending
`
`application. A copy of the Petition for Extension of Time is attached.
`
`Fee Calculation (37 CFR § 1.16)
`
`~ PLEASE DEFER THE PAYMENT OF THE FILING FEES AT THIS TIME.
`
`General Authorization for Petition for Extension of Time (37 CFR § 1.136)
`~ Applicants hereby make and generally authorize any Petitions for Extensions of Time as may be
`needed for any subsequent filings. The Commissioner is also authorized to charge any extension fees under
`37 CFR § 1.17 as may be needed to Deposit Account No. 500388 (Order No. NWISP052).
`
`C8J Please send correspondence to the following address:
`Customer Number 022434
`
`Date: October 15, 2004
`
`(Revised 04/03, Pat App Trans 53(b) ContDivCIP)
`
`Page 2
`
`4
`
`
`
`22764 U.S. PTO
`I \\\1\1\\\\\1\1\1 ~\\\ \\1\l\\\11 \I\\\\\\
`
`101504
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`CERTIFICATE OF EXPRESS MAILING
`I h~reby certify that this paper and the documents and/or fees referred to as
`attached therein are being deposited with the United States Postal Service
`on October 15,2004 in an envelope as "Express Mail Post Office to
`Addressee" service under 37 CFR § 1.1 0, Mailing Label Number
`EV332820115US, addressed to the Commissioner for Patents, P.O. Box
`1450, Alexand · , V
`13-1450.
`
`Attorney Docket No.: NWISP052
`
`First Named Inventor: Morton
`
`UTILITY PATENT APPLICATION TRANSMITTAL (37 CFR. § 1.53(b))
`(Continuation, Divisional or Continuation-in-part application)
`
`Mail Stop Pat~nt Application
`Commissioner for Patents
`P.O. Box 1450·
`Alexandria, VA 22313-1450
`
`Sir:
`
`This is a request for filing a patent application under 37 CFR. § 1.53(b) in the name of inventors:
`Eric Morton, Rajesb Kota, Adnan Kbaleel and David B. Glasco
`
`For: REDUCING PROBE TRAFFIC IN MULTIPROCESSOR SYSTEMS
`
`Assigned to : Newisys, Inc.
`
`This application is a D Continuation D Divisional
`
`[8J Continuation-in-part
`ofprior Application No.: 10/288,347, from which priority under 35 U.S.C. §120 is claimed.
`
`The specification has been amended to claim priority from the parent application, or such
`amendment is included in a separate sheet.
`
`Application Elements:
`
`[8J 54 Pages of Specification, Claims and Abstract
`[8J 25 Sheets of formal Drawings
`D
`Declaration
`D Newly executed
`D Copy from a prior application (37 CFR 1.63(d) for a continuation or divisional).
`
`The entire disclosure of the prior application from which a copy of the declaration is
`herein supplied is considered as being part of the disclosure of the accompanying
`application and is hereby incorporated by reference therein.
`
`D Deletion of inventors Signed statement attached deleting inventor(s)
`
`named in the prior application, see 37 CFR 1.63(d)(2) and 1.33(b).
`
`(Revised 04/03, Pat App Trans 53(b) ContDivCIP)
`
`Page 1
`
`5
`
`
`
`Accompanying Application Parts:
`
`" 0 Do not publish this application. Nonpublication Request is attached.
`0 Assignment and Assignment Recordation Cover Sheet (recording fee of $40.00 enclosed)
`0 Power of Attorney
`0 37 CFR 3.73(b) Statement by Assignee
`0 Information Disclosure Statement with Form PT0-1449 D Copies of IDS Citations
`0 Preliminary Amendment (New claims numbered after highest original claim in prior
`C2J Return Receipt Postcard
`0 Other:
`
`application.)
`
`Claim For Foreign Priority
`
`0
`
`Application No.
`Priority of
`is claimed under 35 U.S.C. § 119.
`
`filed on
`
`D The certified copy has been filed in prior application U.S. Application No.
`D The certified copy will follow.
`
`Extension of Time for Prior Pending Application
`
`D A Petition for Extension of Time is being concurrently filed in the prior pending
`
`application. A copy of the Petition for Extension of Time is attached.
`
`Fee Calculation (37 CFR § 1.16)
`
`~ PLEASE DEFER THE PAYMENT OF THE FILING FEES AT THIS TIME.
`
`General Authorization for Petition for Extension of Time (37 CFR § 1.136)
`~ Applicants hereby make and generally authorize any Petitions for Extensions of Time as may be
`needed for any subsequent filings. The Commissioner is also authorized to charge any extension fees under
`37 CFR § 1.17 as may be needed to Deposit Account No. 500388 (Order No. NWISP052).
`
`C8J Please send correspondence to the following address:
`Customer Number 022434
`
`Date: October 15, 2004
`
`(Revised 04/03, Pat App Trans 53(b) ContDivCIP)
`
`Page 2
`
`6
`
`
`
`Attorney Docket No. NWISP052
`
`PATENT APPLICATION
`
`REDUCING PROBE TRAFFIC IN MULTIPROCESSOR SYSTEMS
`
`Inventors:
`
`Eric Morton of
`Austin, Texas
`United States citizen
`
`Rajesh Kota of
`Austin, Texas
`Citizen of India
`
`Adnan Khaleel of
`Austin, Texas
`Citizen of India
`
`David B. Glasco of
`Austin, Texas
`United States citizen
`
`Assignee:
`
`Newisys, Inc.
`A Delaware corporation
`
`BEYER WEAVER & THOMAS, LLP
`P.O. Box 778.
`Berkeley, California 94704-0778
`(510) 843-6200
`
`7
`
`
`
`PATENT
`Attorney Docket No. NWISP052
`
`REDUCING PROBE TRAFFIC IN MULTIPROCESSOR SYSTEMS
`
`5
`
`CROSS-REFERENCE TO RELATED APPLICATIONS
`
`The present application is a continuation-in-part of and claims priority under 35
`
`10
`
`U.S.C. 120 to U.S. Patent Application No. 10/288,347 for METHODS AND
`
`APPARATUS FOR MANAGING PROBE REQUESTS filed on November 4, 2002
`
`(Attorney Docket No. NWISP024), the entire disclosure of which is incorporated herein
`
`by reference for all purposes. The subject matter described in the present application is
`
`also related to U.S. Patent Application No. 10/288,399 for METHODS AND
`
`15.
`
`APPARATUS FOR MANAGING PROBE REQUESTS filed on November 4, 2002
`
`(Attorney Docket No. NWISP025), the entire disclosure of which· is incorporated herein
`
`by reference for all purposes.
`
`20
`
`BACKGROUND OF THE INVENTION
`
`The present invention generally relates to accessing data in a multiple processor
`
`system. More specifically, the present invention provides techniques for reducing
`
`memory transaction traffic in a multiple processor system.
`
`25
`
`Data access in multiple processor systems can raise issues relating to cache
`
`coherency. Conventional multiple processor computer systems have processors
`
`coupled to a system memory through a shared bus. In order to optimize access to data
`
`in the system memory, individual processors are typically designed to work with cache
`
`memory. In one example, each processor has a cache that is loaded with data that the
`
`30
`
`processor frequently accesses. The cache is read or written by a processor. However,
`
`cache coherency problems arise because multiple copies of the same data can co-exist
`
`in systems having multiple processors and multiple cache memories. For example, a
`
`frequently accessed data block corresponding to a memory line may be loaded into the
`
`cache of two different processors. In one example, if both processors attempt to write
`
`35
`
`new values into the data block at the same time, different data values may result. One
`
`value may be written into the first cache while a different value is written into the
`
`8
`
`
`
`second cache. A system might then be unable to determine what value to write through
`
`to system memory.
`
`•
`
`A variety of cache coherency mechanisms have been developed to address such
`
`5
`
`problems in multiprocessor systems. One solution is to simply force all processor
`
`writes to go through to memory immediately and bypass the associated cache. The
`
`write requests can then be serialized before overwriting a system memory line.
`
`However, bypassing the cache significantly decreases efficiency gained by using a
`
`cache. Other cache coherency mechanisms have been developed for specific
`
`10
`
`architectures. In a shared bus architecture, each processor checks or snoops on the bus
`
`to determine whether it can read or write a shared cache block.
`
`In one example, a
`
`processor only writes an object when it owns or has exclusive access to the object.
`
`Each corresponding cache object is then updated to allow processors access to the most
`
`recent version of the object.
`
`15
`
`Bus arbitration is used when both processors attempt to write the same shared
`
`data block in the same clock cycle. Bus arbitration logic decides which processor gets
`
`the bus first. Although, cache coherency mechanisms such as bus arbitration are
`
`effective, using a shared bus limits the number of processors that can be implemented
`
`20
`
`in a single system with a single memory space.
`
`Other multiprocessor schemes involve individual processor, cache, and memory
`
`systems connected to other processors, cache, and memory systems using a network
`
`backbone such as Ethernet or Token Ring. Multiprocessor schemes involving separate
`
`25
`
`computer systems each with its own address space can avoid many cache coherency
`
`problems because each processor has its own associated memory and cache. When one
`
`processor wishes to access data on a remote computing system, communication is
`
`explicit. Messages are sent to move data to another processor and messages are
`
`received to accept data from another processor using standard network protocols such
`
`30
`
`as TCP/IP. Multiprocessor systems using explicit communication
`
`including
`
`transactions such as sends and receives are referred to as systems using multiple private
`
`memories. By contrast, multiprocessor system using implicit communication including
`
`2
`
`9
`
`
`
`transactions such as loads and stores are referred to herein as using a single address
`
`space.
`
`Multiprocessor schemes using separate computer systems allow more
`
`5
`
`processors
`
`to be
`
`interconnected while minirilizing cache coherency problems.
`
`However, it would take substantially more time to access data held by a remote
`
`processor· using a network infrastructure than it would take to access data held by a
`
`processor coupled to a system bus. Furthermore, valuable network bandwidth would be
`
`consumed moving data to the proper processors. This can negatively impact both
`
`1 o
`
`processor and network performance.
`
`Performance limitations have led to the development of a point-to-point
`
`architecture for connecting processors in a system with a single memory space. In one
`
`example, individual processors can be directly connected to each other through a
`
`15
`
`plurality of point-to-point links to form a cluster of processors. Separate clusters of
`
`processors can also be connected. The point-to-point links significantly increase the
`
`bandwidth for coprocessing and multiprocessing functions.· However, using a point-to(cid:173)
`
`point architecture to connect multiple processors in a multiple cluster system sharing a
`
`single memory space presents its own problems.
`
`20
`
`Consequently, it is desirable to provide techniques for improving data access
`
`and cache coherency in systems having multiple processors connected using point-to(cid:173)
`
`point links.
`
`3
`
`10
`
`
`
`SUMMARY OF THE INVENTION
`
`According to the present invention, various techniques are provided for
`
`reducing traffic relating to memory transactions in multi-processor systems. According
`
`5
`
`to various specific embodiments, a computer system having a plurality of processing
`
`nodes interconnected by a first point-to-point architecture is provided. Each processing
`
`node has a cache memory associated therewith. A probe filtering unit is operable to
`
`receive probes corresponding to memory lines from the processing nodes and to
`
`transmit the probes only to selected ones of the processing nodes with reference to
`
`10
`
`probe filtering information. The probe filtering information is representative of states
`
`associated with selected ones of the cache memories.
`
`According to other embodiments, methods and apparatus are provided for
`
`reducing probe traffic in a computer system comprising a plurality of processing nodes
`
`15
`
`interconnected by a first point-to-point architecture. A probe corresponding to a
`
`memory line is transmitted :from a first one of the processing nodes only to a probe
`
`filtering unit. The probe is evaluated with the probe filtering unit to determine whether
`
`a valid copy of the memory line is in any of the cache memories. The evaluation is
`
`done with reference to probe filtering information associated with the probe filtering
`
`20
`
`unit and representative of states associated with selected ones of the cache memories.
`
`The probe is transmitted :from the probe filtering unit only to selected ones of the
`
`processing nodes identified by the evaluating. Probe responses from the selected
`
`processing nodes are accumulated by the probe filtering unit. Only the probe filtering
`
`unit responds to the first processing node.
`
`25
`
`A further understanding of the nature and advantages of the present invention
`
`may be realized by reference to the remaining portions of the specification and the
`
`drawings.
`
`4
`
`11
`
`
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the following
`
`description taken in conjunction with the accompanying drawings, which are
`
`illustrative of specific embodiments of the present invention.
`
`5
`
`Figure 1A and 1B are diagrammatic representation depicting a system having
`
`multiple clusters.
`
`Figure 2 1s a diagrammatic representation of a cluster having a plurality of
`
`processors.
`
`10
`
`Figure 3 is a diagrammatic representation of a cache coherence controller.
`
`Figure 4 is a diagrammatic representation showing a transaction flow for a data
`
`access request from a processor in a single cluster.
`
`Figure 5A-5D are diagrammatic representations showing cache coherence
`
`controller functionality.
`
`15
`
`Figure 6 is a diagrammatic representation depicting a transaction flow for a
`
`request with multiple probe responses.
`
`Figure 7 is a diagrammatic representation showing a cache coherence directory.
`
`Figure 8 is a diagrammatic representation showing probe filter information that
`
`can be used to reduce the number of probes transmitted to various clusters.
`
`20
`
`Figure 9 is a diagrammatic representation showing a transaction flow for
`
`probing of a home cluster without probing of other clusters.
`
`Figure 10 is a diagrammatic representation showing a transaction flow for
`
`probing of a single remote cluster.
`
`Figure 11 is a flow process diagram showing the handling of a request with
`
`25
`
`probe filter information.
`
`Figure 12 is a diagrammatic representation showing memory controller filter
`
`information.
`
`Figure 13 is a diagrammatic representation showing a transaction flow for
`
`probing a single remote cluster without probing a home cluster.
`
`30
`
`Figure 14 is a flow process diagram showing the handling of a request at a
`
`home cluster cache coherence controller using memory controller filter information.
`
`Figure 15 is a diagrammatic representation showing a transaction flow for a
`
`cache coherence directory eviction of an entry corresponding to a dirty memory line.
`
`5
`
`12
`
`
`
`Figure 16 is a diagrammatic representation showing a transaction flow for a
`
`cache coherence directory eviction of an entry corresponding to a clean memory line.
`
`Figure 17 is a diagrammatic representation of a cache coherence controller
`
`according to a specific embodiment of the invention.
`
`5
`
`Figure 18 is a diagrammatic representation of a cluster having a plurality of
`
`processing nodes and a probe filtering unit.
`
`Figure 19 is an exemplary representation of a processing node.
`
`Figure 20 is a flowchart illustrating localprobe filtering according to a specific
`
`embodiment of the invention.
`
`10
`
`Figure 21 is a diagrammatic representation of a transaction flow in which local
`
`probe filtering is facilitated according to a specific embodiment of the invention.
`
`Figure 22 is a diagrammatic representation of another transaction flow in which
`
`local probe filtering is facilitated according to a specific embodiment of the invention.
`
`6
`
`13
`
`
`
`DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS
`
`Reference will now be made in detail to some· specific embodiments of the
`
`invention including the best modes contemplated by the inventors for carrying out the
`
`5
`
`invention. Examples of these specific embo~iments are illustrated in the accompanying
`
`drawings. While the invention is described in conjunction with these specific
`
`embodiments, it will be understood that it is not intended to limit the invention to the
`
`described embodiments. On the contrary, it is intended to cover alternatives,
`
`modifications, and equivalents as may be included within the spirit and scope of the
`
`10
`
`invention as defined by the appended claims. Multi-processor architectures having
`
`point-to-point communication among their processors are suitable for implementing
`
`specific embodiments of the present invention. In the following description, numerous
`
`specific details are set forth in order to provide a thorough understanding of the present
`
`invention. The present invention may be practiced without some or all of these specific
`
`15
`
`details. Well-known process operations have not been described in detail in order not
`
`to unnecessarily obscure the present invention. Furthermore, the present application's
`
`reference to a particular singular entity includes that possibility that the methods and
`
`apparatus of the present invention can be implemented using more than one entity,
`
`unless the context clearly dictates otherwise.
`
`20
`
`According to various embodiments, techniques are provided for increasing data
`
`access efficiency in a multiple processor system.
`
`In a point-to-point architecture, a
`
`cluster of processors includes multiple. processors directly connected to each other
`
`through point-to-point links. By using point-to-point links instead of a conventional
`
`25
`
`shared bus or external network, multiple processors are used efficiently in a system
`
`sharing the same memory space. Processing and network efficiency are also improved
`
`by avoiding many of the bandwidth and latency limitations of conventional bus and
`
`external network based multiprocessor architectures.
`
`According
`
`to various
`
`embodiments, however, linearly increasing the number of processors in a point-to-point
`
`30
`
`architecture leads to an exponential increase in the number of links used to connect the
`
`multiple processors.
`
`In order to reduce the number of links used and to further
`
`modularize a multiprocessor system using a point-to-point architecture, multiple
`
`clusters may be used.
`
`7
`
`14
`
`
`
`According to some embodiments, multiple processor clusters are interconnected
`
`using a point-to-point architecture. Each cluster of processors includes a cache
`
`coherence controller used to handle communications between clusters.
`
`In one
`
`5
`
`embodiment, the point-to-point architecture used to connect processors are used to
`
`connect clusters as well.
`
`By using a cache coherence controller, multiple cluster systems can be built
`
`using processors that may not necessarily support multiple clusters. Such a multiple
`
`10
`
`cluster system can be built by using a cache coherence controller to represent non-local
`
`nodes in local transactions so that local nodes do not need to be aware of the existence
`
`of nodes outside of the local cluster. More detail on the cache coherence controller will
`
`be provided below.
`
`15
`
`In a single cluster system, cache coherency can be maintained by sending all
`
`data access requests through a serialization point. Any mechanism for ordering data
`
`access requests (also referred to herein as requests and memory requests) is referred to
`
`herein as a serialization point. One example of a serialization point is a memory
`
`controller. Various processors in the single cluster system send data access requests to
`
`20
`
`one or more memory controllers.
`
`In one example, each memory controller is
`
`configured to serialize or lock the data access requests so that only one data access
`
`request for a given memory line is allowed at any particular time. If another processor
`
`attempts to access the same memory line, the data access attempt is blocked until the
`
`memory line is unlocked. The memory controller allows cache coherency to be
`
`25
`
`maintained in a multiple processor, single cluster system.
`
`A serialization point can also be used in a multiple processor, multiple cluster
`
`system where the processors in the various clusters share a single address space. By
`
`using a single address space, internal point-to-point links can be used to significantly
`
`30
`
`improve intercluster communication over traditional external network based multiple
`
`cluster systems. Various processors in various clusters send data access requests to a
`
`memory controller associated with a particular cluster such as a home cluster. The
`
`memory controller can similarly serialize all data requests from the different clusters.
`
`8
`
`15
`
`
`
`However, a serialization point in a multiple processor, multiple cluster system may not
`
`be as efficient as a serialization point in a multiple processor, single cluster system.
`
`That is, delay resulting from factors such as latency from transmitting between clusters
`
`can adversely affect the response times for various data access requests. It should be
`
`5
`
`noted that delay also results from the use of probes in a multiple processor
`
`environment.
`
`Although delay in intercluster transactions in an architecture using a shared
`
`memory space is significantly less than the delay in conventional message passing
`
`10
`
`environments using external networks such as Ethernet or Token Ring, even minimal
`
`delay is a significant factor. In some applications, there may be millions of data access
`
`requests from a processor in a fraction of a second. Any delay can adversely impact
`
`processor performance.
`
`15
`
`According to various embodiments, probe management is used to increase the
`
`efficiency of accessing data in a multiple processor, multiple cluster system. A
`
`mechanism for eliciting a response from a node to maintain cache coherency in a
`
`system is referred to herein as a probe. In one example, a mechanism for snooping a
`
`cache is referred to as a probe. A response to a probe can be directed to the source or
`
`20
`
`target of the initiating request. Any mechanism for filtering or reducing the number of
`
`probes and requests transmitted to various nodes is referred to herein as managing
`
`probes. In one example, managing probes entails characterizing a request to determine
`
`if a probe can be transmitted to a reduced number of entities.
`
`25
`
`In typical implementations, requests are sent to a memory controller that
`
`broadcasts probes to various nodes in a system. In such a system, no knowledge of the
`
`cache line state needs to be maintained by the memory controller. All nodes in the
`
`system are probed and the request cluster receives a response from each node. In a
`
`system with a coherence directory, state information associated with various memory
`
`30
`
`lines can be used to reduce the number of transactions. Any mechanism for
`
`maintaining state information associated with various memory lines is referred to
`
`herein as a coherence directory. According to some embodiments, a coherence
`
`directory includes information for memory lines in a local cluster that are cached in a
`
`9
`
`16
`
`
`
`remote cluster. According to others, such a directory includes information for locally
`
`cached lines. According to various embodiments, a coherence directory is used to
`
`reduce the number of probes to remote quads by inferring the state of local caches.
`
`According to some embodiments, such a directory mechanism is used in a single cluster
`
`5
`
`system or within a cluster in a multi-cluster system to reduce the number of probes
`
`within a cluster.
`
`Figure lA is a diagrammatic representation of one example of a multiple
`
`cluster, multiple processor system that can use the techniques of the present invention.
`
`10
`
`Each processing cluster 101, 103, 105, and 107 can include a plurality of processors.
`
`The processing clusters 101, 103, 105, and 107 are connected to each other through
`
`point-to-point links 111a-f. In one embodiment, the multiple processors in the multiple
`
`cluster architecture shown in Figure 1A share the same memory space. In this example,
`
`the point-to-point links 111a-f are internal system connections that are used in place of
`
`15
`
`a traditional front-side bus to connect the multiple processors in the multiple clusters
`
`101, 103, 105, and 107. The point-to-point links may support any point-to-point
`
`protocol.
`
`Figure 1B is a diagrammatic representation of another example of a multiple
`
`20
`
`cluster, multiple processor system that can use the techniques of the present invention.
`
`Each processing cluster 121, 123, 125, and 127 can be coupled to a switch 131 through
`
`point-to-point links 141a-d. It should be noted that using a switch and point-to-point
`
`links allows implementation with fewer point-to-point links when connecting multiple
`
`clusters in the system. A switch 131 can include a processor with a coherence protocol
`
`25
`
`interface. According to various implementations, a multicluster system shown in
`
`Figure IA is expanded using a switch 131 as shown in Figure lB.
`
`Figure 2 is a diagrammatic representation of a multiple processor cluster, such
`
`as the cluster 101 shown in Figure 1A. Cluster 200 includes processors 202a-202d, one
`
`30
`
`or more Basic UO systems (BIOS) 204, a memory subsystem comprising memory
`
`banks 206a-206d, point-to-point communication links 208a-208e, and a service
`
`processor 212. The point-to-point communication links are configured to allow
`
`interconnections between processors 202a-202d, UO switch 210, and cache coherence
`
`10
`
`17
`
`
`
`controller 230. The service processor 212 is configured to allow communications with
`
`processors 202a-202d, J/0 switch 210, and cache coherence controller 230 via a JTAG
`
`interface represented in Figure 2 by links 214a-214f. It should 'be noted that other
`
`interfaces are supported.
`
`It should also be noted that in some implementations, a
`
`5
`
`service processor is not included in multiple processor clusters.
`
`I/0 switch 210
`
`connects the rest of the system to J/0 adapters 216 and 220. It should further be noted
`
`that the terms node and processor are often used interchange