`
`US007107408B2
`
`(12; United States Patent
`Glasco
`
`(10) Patent No.:
`
`(45) Date of Patent:
`
`US 7,107,408 B2
`Sep. 12, 2006
`
`(54) METHODS AND APPARATUS FOR
`SPECULATIVE PROBING WITH EARLY
`(I()MPLICTI()N AND EARLY REQUEST
`
`(75)
`
`Inventor: David I]. Glaseo. Austin. TX (US)
`
`(73) Assignee: Newisys, Inc.. Austin. TX (US)
`
`( 1‘ ) Notice:
`
`Subject to any disclaimer. the term of this
`patent is extended or adjusted under 35
`U.S.C. 1540;) by 454 days.
`
`(21) Appl. No.: 1o;1ms,299
`
`(22) Filed:
`
`Mar. 22, 2002
`
`(65)
`
`Prior Publication Data
`
`US 20113011825118 Al
`
`Sep. 25. 2003
`
`(51)
`
`Int. Cl.
`G06!‘ 12/'08
`G061’ 12/16
`(52) U.S. Cl.
`
`(2006.01)
`(2006.01)
`711.1141: 7111118: 7111128:
`7l1t'l3(); 7111146
`71li’14l—1-46.
`(58) Field of Classification Search
`7111130. 147-149. 135-136. 119-122. 118
`71141117
`
`application file [or complete seancli history.
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5.195.039 A ’“
`5.958.019 A
`6.067.603 A "‘
`6.167.492 A
`6,292,705 B1 "‘
`6.338.122 Bl "‘
`6.324.331 B1‘
`
`311993 Sind.l1I.l el :11.
`9.11999 Hagersten et al.
`5.52000 Carpenter ct al.
`1252000 Keller ct al.
`912001 Wang et al.
`l.-T2002 Baumganner el al.
`4.-“Z002
`Jana.kira.ma.n ct al.
`
`370.-"235
`
`
`
`711.-‘"141
`.. 711.-"154
`70055
`711.-"141
`T11.-"141
`
`6.385.705 Bl
`6.490.661 Bl
`6.615.319 B1‘
`6.633.945 B1‘
`6.754.782 Bl "
`6360.819 B11‘
`6.299.252 B1‘
`6.339.808 Bl "
`
`512002 Keller el 21.].
`12.-"2002 Keller el al.
`9.-"2003 Kharc et al.
`1012003 Fu et al.
`6.32004 Aliinilli el al.
`?.-''2004 Dhong ct a.l.
`9.32004 Bauman
`1.-"2005 Gruner et al.
`
`
`
`T11.-“I54
`'r‘ll-'lS0
`'i'lle'l41
`'i’l0.*'3 16
`7''] [F144
`
`'i'lle'l46
`"ill.-‘"149
`............ .. 7''] [F130
`
`..
`
`(JTI [HR PU l3[.ICA'I‘l()NS
`
`1.03.
`Revision
`Specification
`Link
`b"yperIrar:spoJ'r"'”I/O
`lIypcr’l‘ra.nsport"'“ Consortium. Oct. 10. 2001. Copyright EC‘: 2001
`HyperTra.I1sp0l'l Technology Consortium.
`US. App]. No. 101106.426. filed Mar. 22. 2002. Oflice Action
`mailed Nov. 21. 2005.
`US. App]. No. 10.-"145.-438.
`mailed Nov. 21. 2005.
`US. App]. No. 10.-"145.-439.
`mailed Nov. 21, 2005.
`
`filed May 13. 2002. Office Action
`
`filed May 13. 2002. Officc Action
`
`"‘ cited by examiner
`
`Pr.I'mar_v E.raminer—Matthew Kiln
`Assisram ;.’5'xai::fner~
`?’.huo ll. Li
`(74) /1m)rr1ey.
`/1gem, or Firm Beyer Weaver & Thomas.
`1.1.1’
`
`(57)
`
`ABSTRACT
`
`According to the present invention. methods and apparatus
`are provided for increasing the efficiency of data access in
`multiple processor. multiple cluster systems. A cache coher-
`ence controller associated with a lirst cluster of processors
`can determine whether speculative probing can be per-
`formed before forwarding a data access request to a second
`cluster. The cache coherence controller can also forward the
`
`data access request to the second cluster before receiving a
`probe response.
`
`53 Claims, 15 Drawing Sheets
`
` wlmme923-2
`
`Home
`___C1uster 97.0
`
`Remote
`Cluster ‘J40
`
`1 A
`
`
`
`1
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 1 0f 15
`
`US 7,107,408 B2
`
`Figure 1A
`
`Processing
`Cluster 101
`
`‘
`
`{-1 1 1d
`.1’ 3,
`I.’
`
`Processing
`Cluster 103
`
`V
`
`1|.
`
`,_ -
`
`-__
`
`: _ _ -.\
`
`A
`
`“MN,
`
`C111i:
`
`"111-1J
`
`,.--~-~~~/'111°
`
`V
`
`Y
`
`Processing
`Cluster 105
`
`‘
`
`[-11 lb
`I ‘
`
`Processing
`Cluster 107
`
`,
`
`Figure 1B
`
`Processing
`Cluster 121 f
`
`Processing
`Cluster 123
`
`V
`
`“""‘;'\—141d
`141 sf"?
`Switch 131
`
`mb?iilklb
`
`‘III: If 1410
`
`Processing *
`Cluster 125
`
`* Processing
`Cluster 127
`
`2
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 2 0f 15
`
`US 7,107,408 B2
`
`
`
`
`
`I] swam Homwoooam
`
`
`
`umom 5380.5
`
`6525
`
`N 0.53
`
`
`
`2835 225M
`
`OS 3955
`
`2.150
`
`EN @236 0;
`
`wOHm
`
`wow
`
`
`
`mmom Hommmooum
`
`1; 68k
`
`mwom)
`
`
`
`owom Hommwuopm
`
`I]:
`
`umomfl/
`
`Egg
`
`3
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 3 0f 15
`
`US 7,107,408 B2
`
`32$
`
`m mEmE
`
`
`
`2% @o?BEH E8500
`
`
`
`mom Sham mom m?mmm E9805
`
`
`
`
`
`4
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 4 0f 15
`
`US 7,107,408 B2
`
`U2
`23 1
`
`DMD
`
`méow
`
`A
`
`w PBwE
`
`m?
`
`DmU
`
`Néov
`
`A A
`
`A
`
`U2
`
`Tmov
`
`DmU
`
`Héov
`
`mow
`
`5
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 5 0f 15
`
`US 7,107,408 B2
`
`U2
`
`Wmom
`
`k
`
`§ 2&5
`
`DmO
`
`méom
`
`DmU
`
`mom
`
`wow
`
`mom
`
`U2
`
`Tmom
`
`DmU
`
`Tam
`
`6
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 6 0f 15
`
`US 7,107,408 B2
`
`mémm
`
`mm 2&5
`
`7
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 7 0f 15
`
`US 7,107,408 B2
`
`mlgm
`
`DmU
`
`A
`
`DmU
`
`N113
`
`H 23%
`
`mwm
`
`Dvm
`
`mwm
`
`Hmm
`
`Tmwm
`
`A
`
`DAD
`
`Tiwm
`
`
`
`wowoz 13017202
`
`mmm
`
`8
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 8 0f 15
`
`US 7,107,408 B2
`
`02
`
`mlmom
`
`A
`
`DHHO
`
`mémm
`
`1 281 >3 1
`
`P6 1H 1
`
`QQBmE
`
`Q
`
`Ch
`
`A
`
`g \ Q /
`wow .\
`203
`v2
`
`1 7-2%
`P6 _
`
`4
`
`V
`
`mbm
`
`9
`
`
`
`U.S. Patent
`
`eS
`
`D...
`
`60022,1
`
`51f09teehS
`
`7
`
`2B80
`
`10
`
`W3%So3%oqo
`
`M,0ago095mam1,NH
`
`as4
`
`q0%
`
`10
`
`
`
`U.S. Patent
`
`Sep. 12,2006
`
`Sheet 10 of 15
`
`US 7,107,408 B2
`
`11
`
`CMC721E723-2
`
`721-4
`
`Figure7
`
`Home
`
`Cluster720
`
`Remote
`
`Cluster740
`
`11
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 11 0f 15
`
`US 7,107,408 B2
`
`Figure 8
`
`Identi?es Memory Line
`Associated With A
`Request From A Request
`801 I“ Cluster Processor
`
`803
`
`Specula '
`,
`Probing Be
`
`305 /\ Proceed With Speculative
`Probing
`
`s - ‘ eivcd Pr :
`
`Associated Wit
`
`Proceed Without
`Sp eculative Probing
`
`i
`
`Y?g
`8
`
`,-\l
`
`Provide Probe
`InfOIljnHtlOl'l T0
`Intervemng Processor
`
`823
`/ v
`
`Provide Probe
`-
`——~———v Information T 0 Request
`81 1
`Ciuster Processor
`
`.
`Proceed Without
`Sp ecula?ve Probing
`
`E
`
`815 ’\'
`
`Wait For Responses
`I
`
`L_______
`
`,
`
`813 \f ‘ End
`
`12
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 12 0f 15
`
`US 7,107,408 B2
`
`$8 2% ES 2&1 Rm ZRTIQ
`
`02 u o 0 Q 02 0
`
`
`
`
`
`2a Q
`
`0
`
`
`, ,7 83225
`
`
`mémm 080m
`
`mmQsmE
`
`
`
`38 £8 18 38 38 $3 Sm Ii 38 o o o uTQEoTu Q‘lo?lbmu
`
`Q
`
`Q
`
`8m \
`
`
`
`mom 595mm
`
`82236
`
`2a Q
`
`3,3 2% 3% 0 Q 0
`
`Q
`
`
`
`9% 805mm
`
`33226
`
`13
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 13 0f 15
`
`US 7,107,408 B2
`
`Figure 10
`
`1001 \ Identify Cache State In
`Controller
`
`1003
`
`Cache State =
`Shared?
`
`No
`
`Yes
`
`1015
`
`Types Of Access
`Requested?
`
`1005
`
`Cache State =
`Owned?
`
`NO
`
`Read
`
`1 007
`
`Cache State =
`Exclusive?
`
`NO
`
`I01 7 N
`Yas
`
`Can Complete
`Transaction Locally
`
`1009
`
`Cache State =
`Modi?ed?
`
`No
`v
`
`1011
`
`Can Not Complete
`Transaction Locally
`
`‘
`
`n
`
`Write
`
`End ~
`
`14
`
`
`
`U.S. Patent
`
`Sep. 12,2006
`
`Sheet 14 of 15
`
`US 7,107,408 B2
`
`15
`
`THNZU
`
`ofiHBZEUESE
`
`3o.Eu.M
`
`3:SE20
`
`15
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 15 of 15
`
`US 7,107,408 B2
`
`Figure 12
`
`1201
`
`Allocate Transaction
`x Identi?er
`
`y
`
`1203 \ Probe Local And Remote
`Clusters
`
`V
`
`1205 \d Local Transaction
`Completes
`
`V
`
`1207 "\ Maintain Transaction
`Identi?er
`
`1209
`
`All Remote Probe
`! esponses Receive .
`
`Yes
`
`i
`
`121 1 ’\ Clear Transaction
`Identi?er
`
`16
`
`
`
`US 7,107,408 B2
`
`1
`METHODS AND APPARATUS FOR
`SPECULATIVE PROBING WITH EARLY
`COMPLETION AND EARLY REQUEST
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`This application is related to concurrently ?led US.
`application Ser. No. 10/106,426, entitled METHODS AND
`APPARATUS FOR SPECULATIVE PROBING AT A
`10
`REQUEST CLUSTER and to concurrently ?led U.S. appli
`cation Ser. No. 10/ 106,430, entitled METHODS AND
`APPARATUS FOR SPECULATIVE PROBING WITH
`EARLY COMPLETION AND DELAYED REQUEST, the
`disclosures of Which are incorporated by reference herein for
`all purposes.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention generally relates to accessing data
`in a multiple processor system. More speci?cally, the
`present invention provides techniques for improving data
`access ef?ciency While maintaining cache coherency in a
`multiple processor system having a multiple cluster archi
`tecture.
`2. Description of Related Art
`Data access in multiple processor systems can raise issues
`relating to cache coherency. Conventional multiple proces
`sor computer systems have processors coupled to a system
`memory through a shared bus. In order to optimiZe access to
`data in the system memory, individual processors are typi
`cally designed to Work With cache memory. In one example,
`each processor has a cache that is loaded With data that the
`processor frequently accesses. The cache can be onchip or
`olfchip. Each cache block can be read or Written by the
`processor. HoWever, cache coherency problems can arise
`because multiple copies of the same data can co-exist in
`systems having multiple processors and multiple cache
`memories. For example, a frequently accessed data block
`corresponding to a memory line may be loaded into the
`cache of tWo different processors. In one example, if both
`processors attempt to Write neW values into the data block at
`the same time, different data values may result. One value
`may be Written into the ?rst cache While a different value is
`Written into the second cache. A system might then be unable
`to determine What value to Write through to system memory.
`A variety of cache coherency mechanisms have been
`developed to address such problems in multiprocessor sys
`tems. One solution is to simply force all processor Writes to
`go through to memory immediately and bypass the associ
`ated cache. The Write requests can then be serialiZed before
`overWriting a system memory line. HoWever, bypassing the
`cache signi?cantly decreases ef?ciency gained by using a
`cache. Other cache coherency mechanisms have been devel
`oped for speci?c architectures. In a shared bus architecture,
`each processor checks or snoops on the bus to determine
`Whether it can read or Write a shared cache block. In one
`example, a processor only Writes an object When it oWns or
`has exclusive access to the object. Each corresponding cache
`object is then updated to alloW processors access to the most
`recent version of the object.
`Bus arbitration can be used When both processors attempt
`to Write the same shared data block in the same clock cycle.
`Bus arbitration logic can decide Which processor gets the
`bus ?rst. Although, cache coherency mechanisms such as
`bus arbitration are effective, using a shared bus limits the
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`number of processors that can be implemented in a single
`system With a single memory space.
`Other multiprocessor schemes involve individual proces
`sor, cache, and memory systems connected to other proces
`sors, cache, and memory systems using a netWork backbone
`such as Ethernet or Token Ring. Multiprocessor schemes
`involving separate computer systems each With its oWn
`address space can avoid many cache coherency problems
`because each processor has its oWn associated memory and
`cache. When one processor Wishes to access data on a
`remote computing system, communication is explicit. Mes
`sages are sent to move data to another processor and
`messages are received to accept data from another processor
`using standard netWork protocols such as TCP/IP. Multipro
`cessor systems using explicit communication including
`transactions such as sends and receives are referred to as
`systems using multiple private memories. By contrast, mul
`tiprocessor system using implicit communication including
`transactions such as loads and stores are referred to herein as
`using a single address space.
`Multiprocessor schemes using separate computer systems
`alloW more processors to be interconnected While minimiZ
`ing cache coherency problems. HoWever, it Would take
`substantially more time to access data held by a remote
`processor using a netWork infrastructure than it Would take
`to access data held by a processor coupled to a system bus.
`Furthermore, valuable netWork bandWidth Would be con
`sumed moving data to the proper processors. This can
`negatively impact both processor and netWork performance.
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a
`system With a single memory space. In one example, indi
`vidual processors can be directly connected to each other
`through a plurality of point-to-point links to form a cluster
`of processors. Separate clusters of processors can also be
`corrected. The point-to-point links signi?cantly increase the
`bandWidth for coprocessing and multiprocessing functions.
`HoWever, using a point-to-point architecture to connect
`multiple processors in a multiple cluster system sharing a
`single memory space presents its oWn problems.
`Consequently, it is desirable to provide techniques for
`improving data access and cache coherency in systems
`having multiple clusters of multiple processors connected
`using point-to-point links.
`
`SUMMARY OF THE INVENTION
`
`According to the present invention, methods and appara
`tus are provided for increasing the ef?ciency of data access
`in a multiple processor, multiple cluster system. A cache
`coherence controller associated With a ?rst cluster of pro
`cessors can determine Whether speculative probing can be
`performed before forWarding a data access request to a
`second cluster. The cache coherence controller can also
`forWard the data access request to the second cluster before
`receiving a probe response.
`According to speci?c embodiments, a computer system is
`provided. A ?rst cluster includes a ?rst plurality of proces
`sors and a ?rst cache coherence controller. The ?rst plurality
`of processors and the ?rst cache coherence controller are
`interconnected in a point-to-point architecture. A second
`cluster includes a second plurality of processors and a
`second cache coherence controller. The second plurality of
`processors and the second cache coherence controller are
`interconnected in a point-to-point architecture. The ?rst
`cache coherence controller is coupled to the second cache
`coherence controller. The ?rst cache coherence controller is
`
`17
`
`
`
`US 7,107,408 B2
`
`3
`con?gured to receive a cache access request originating
`from the ?rst plurality of processors and send a probe to the
`?rst plurality of processors in the ?rst cluster before the
`cache access request is received by a serialiZation point in
`the second cluster. The ?rst cache coherence controller can
`be further con?gured to forward the cache access request
`before determining if the cache access request can be
`completed locally.
`In one embodiment, the serialization point is a memory
`controller in the second cluster. The probe can be associated
`With the memory line corresponding to the cache access
`request. The ?rst cache coherence controller can be further
`con?gured to respond to the probe originating from the
`second cluster using information obtained from the probe of
`the ?rst plurality of processors. The ?rst cache coherence
`controller can also be associated With a pending buffer.
`According to another embodiment, a cache coherence
`controller is provided. The cache coherence controller
`includes interface circuitry coupled to a plurality of local
`processors in a local cluster and a non-local cache coherence
`controller in a non-local cluster. The plurality of local
`processors are arranged in a point-to-point architecture. The
`cache coherence controller can also include a protocol
`engine coupled to the interface circuitry. The protocol
`engine can be con?gured to receive a cache access request
`from a ?rst processor in the local cluster and speculatively
`probe a local node. The protocol engine can also forWard the
`cache access request before receiving a probe response from
`the local node associated With the cache.
`According to another embodiment, a method for a cache
`coherence controller to manage data access in a multipro
`cessor system is provided. A cache access request is received
`from a local processor associated With a local cluster of
`processors connected through a point-to-point architecture.
`It is determined if speculative probing of a local node
`associated With a cache can be performed before forWarding
`the cache request to a non-local cache coherence controller.
`The non-local cache coherence controller is associated With
`a remote cluster of processors connected through a point
`to-point architecture. The remote cluster of processors
`shares an address space With the local cluster of processors.
`A cache access request can be sent before receiving a probe
`response from the local node associated With the cache.
`A further understanding of the nature and advantages of
`the present invention may be realiZed by reference to the
`remaining portions of the speci?cation and the draWings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the
`folloWing description taken in conjunction With the accom
`panying draWings, Which are illustrative of speci?c embodi
`ments of the present invention.
`FIGS. 1A and 1B are diagrammatic representation depict
`ing a system having multiple clusters.
`FIG. 2 is a diagrammatic representation of a cluster
`having a plurality of processors.
`FIG. 3 is a diagrammatic representation of a cache coher
`ence controller.
`FIG. 4 is a diagrammatic representation shoWing a trans
`action ?oW for a data access request.
`FIG. 5A-5D are diagrammatic representations shoWing
`cache coherence controller functionality.
`FIG. 6 is a diagrammatic representation depicting a trans
`action ?oW for a data access request from a processor
`transmitted to a home cache coherency controller.
`
`4
`FIG. 7 is a diagrammatic representation shoWing a trans
`action ?oW for speculative probing at a request cluster.
`FIG. 8 is a process How diagram depicting the handling of
`intervening requests.
`FIG. 9 is a diagrammatic representation shoWing a trans
`action ?oW for speculative probing With delayed request.
`FIG. 10 is a process How diagram depicting the determi
`nation of Whether a data access request can complete locally.
`FIG. 11 is a diagrammatic representation shoWing a
`transaction ?oW for speculative probing With early request.
`FIG. 12 is a process How diagram depicting the mainte
`nance of transaction information.
`
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`
`Reference Will noW be made in detail to some speci?c
`embodiments of the invention including the best modes
`contemplated by the inventors for carrying out the invention.
`Examples of these speci?c embodiments are illustrated in
`the accompanying draWings. While the invention is
`described in conjunction With these speci?c embodiments, it
`Will be understood that it is not intended to limit the
`invention to the described embodiments. On the contrary, it
`is intended to cover alternatives, modi?cations, and equiva
`lents as may be included Within the spirit and scope of the
`invention as de?ned by the appended claims. Multi-proces
`sor architectures having point-to-point communication
`among their processors are suitable for implementing spe
`ci?c embodiments of the present invention. In the folloWing
`description, numerous speci?c details are set forth in order
`to provide a thorough understanding of the present inven
`tion. The present invention may be practiced Without some
`or all of these speci?c details. Well knoWn process opera
`tions have not been described in detail in order not to
`unnecessarily obscure the present invention.
`Techniques are provided for increasing data access effi
`ciency in a multiple processor, multiple cluster system. In a
`point-to-point architecture, a cluster of processors includes
`multiple processors directly connected to each other through
`point-to-point links. By using point-to-point links instead of
`a conventional shared bus or external network, multiple
`processors are used e?iciently in a system sharing the same
`memory space. Processing and netWork ef?ciency are also
`improved by avoiding many of the bandWidth and latency
`limitations of conventional bus and external netWork based
`multiprocessor architectures. According to various embodi
`ments, hoWever, linearly increasing the number of proces
`sors in a point-to-point architecture leads to an exponential
`increase in the number of links used to connect the multiple
`processors. In order to reduce the number of links used and
`to further modulariZe a multiprocessor system using a point
`to-point architecture, multiple clusters are used.
`According to various embodiments, the multiple proces
`sor clusters are interconnected using a point-to-point archi
`tecture. Each cluster of processors includes a cache coher
`ence controller used to handle communications betWeen
`clusters. In one embodiment, the point-to-point architecture
`used to connect processors are used to connect clusters as
`Well.
`By using a cache coherence controller, multiple cluster
`systems can be built using processors that may not neces
`sarily support multiple clusters. Such a multiple cluster
`system can be built by using a cache coherence controller to
`represent non-local nodes in local transactions so that local
`nodes do not need to be aWare of the existence of nodes
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`18
`
`
`
`US 7,107,408 B2
`
`5
`outside of the local cluster. More detail on the cache
`coherence controller Will be provided below.
`In a single cluster system, cache coherency can be main
`tained by sending all data access requests through a serial
`ization point. Any mechanism for ordering data access
`requests is referred to herein as a serialization point. One
`example of a serialization point is a memory controller.
`Various processors in the single cluster system send data
`access requests to the memory controller. The memory
`controller can be con?gured to serialize the data access
`requests so that only one data access request for a given
`memory line is alloWed at any particular time. If another
`processor attempts to access the same memory line, the data
`access attempt is blocked until the memory line is unlocked.
`The memory controller alloWs cache coherency to be main
`tained in a multiple processor, single cluster system.
`A serialization point can also be used in a multiple
`processor, multiple cluster system Where the processors in
`the various clusters share a single address space. By using a
`single address space, internal point-to-point links can be
`used to signi?cantly improve intercluster communication
`over traditional external netWork based multiple cluster
`systems. Various processors in various clusters send data
`access requests to a memory controller associated With a
`particular cluster such as a home cluster. The memory
`controller can similarly serialize all data requests from the
`different clusters. HoWever, a serialization point in a mul
`tiple processor, multiple cluster system may not be as
`ef?cient as a serialization point in a multiple processor,
`single cluster system. That is, delay resulting from factors
`such as latency from transmitting betWeen clusters can
`adversely affect the response times for various data access
`requests. It should be noted that delay also results from the
`use of probes in a multiple processor environment.
`Although delay in intercluster transactions in an architec
`ture using a shared memory space is signi?cantly less than
`the delay in conventional message passing environments
`using external netWorks such as Ethernet or token ring, even
`minimal delay is a signi?cant factor. In some applications,
`there may be millions of data access requests from a
`processor in a single second. Any delay can adversely
`impact processor performance.
`According to various embodiments, speculative probing
`is used to increase the ef?ciency of accessing data in a
`multiple processor, multiple cluster system. A mechanism
`for eliciting a response from a node to maintain cache
`coherency in a system is referred to herein as a probe. In one
`example, a mechanism forsnooping a cache is referred to as
`a probe. A response to a probe can be directed to the source
`or target of the initiating request. Any mechanism for
`sending probes to nodes associated With cache blocks before
`a request associated With the probes is received at a serial
`ization point is referred to herein as speculative probing.
`Techniques of the present invention recognize the reor
`dering or elimination of certain data access requests do not
`adversely affect cache coherency. That is, the end value in
`the cache is the same Whether or not snooping occurs. For
`example, a local processor attempting to read the cache data
`block can be alloWed to access the data block Without
`sending the requests through a serialization point in certain
`circumstances. In one example, read access can be permitted
`When the cache block is valid and the associated memory
`line is not locked. The techniques of the present invention
`provide mechanisms for determining When speculative
`probing can be performed and also provide mechanisms for
`determining When speculative probing can be completed
`Without sending a request through a serialization point.
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`Speculative probing Will be described in greater detail
`beloW. By completing a data access transaction Within a
`local cluster, the delay associated With transactions in a
`multiple cluster system can be reduced or eliminated.
`To alloW even more ef?cient speculative probing, the
`techniques of the present invention also provide mechanisms
`for handling transactions that may result from speculatively
`probing a local node before locking a particular memory
`line. In one example, a cache coherence protocol used in a
`point-to-point architecture may not alloW for speculative
`probing. Nonetheless, mechanisms are provided to alloW
`various nodes such as processors and memory controllers to
`continue operations Within the cache coherence protocol
`Without knoWing that any protocol variations have occurred.
`FIG. 1A is a diagrammatic representation of one example
`of a multiple cluster, multiple processor system that can use
`the techniques of the present invention. Each processing
`cluster 101, 103, 105, and 107 can include a plurality of
`processors. The processing clusters 101, 103, 105, and 107
`are connected to each other through point-to-point links
`111117‘: In one embodiment, the multiple processors in the
`multiple cluster architecture shoWn in FIG. 1A share the
`same memory space. In this example, the point-to-point
`links lllaif are internal system connections that are used in
`place of a traditional front-side bus to connect the multiple
`processors in the multiple clusters 101, 103, 105, and 107.
`The point-to-point links may support any point-to-point
`coherence protocol.
`FIG. 1B is a diagrammatic representation of another
`example of a multiple cluster, multiple processor system that
`can use the techniques of the present invention. Each pro
`cessing cluster 121, 123, 125, and 127 can be coupled to a
`sWitch 135 through point-to-point links l4laid. It should be
`noted that using a sWitch and point-to-point alloWs imple
`mentation With feWer point-to-point links When connecting
`multiple clusters in the system. A sWitch 131 can include a
`processor With a coherence protocol interface. According to
`various implementations, a multicluster system shoWn in
`FIG. 1A is expanded using a sWitch 131 as shoWn in FIG.
`1B.
`FIG. 2 is a diagrammatic representation of a multiple
`processor cluster, such as the cluster 101 shoWn in FIG. 1A.
`Cluster 200 includes processors 202ai202d, one or more
`Basic I/O systems (BIOS) 204, a memory subsystem com
`prising memory banks 206ai206d, point-to-point commu
`nication links 208ai208e, and a service processor 212. The
`point-to-point communication links are con?gured to alloW
`interconnections betWeen processors 202ai202d, I/O sWitch
`210, and cache coherence controller 230. The service pro
`cessor 212 is con?gured to alloW communications With
`processors 202ai202d, I/O sWitch 210, and cache coherence
`controller 230 via a JTAG interface represented in FIG. 2 by
`links 2l4ai2l4f It should be noted that other interfaces are
`supported. I/O sWitch 210 connects the rest of the system to
`I/O adapters 216 and 220.
`According to speci?c embodiments, the service processor
`of the present invention has the intelligence to partition
`system resources according to a previously speci?ed parti
`tioning schema. The partitioning can be achieved through
`direct manipulation of routing tables associated With the
`system processors by the service processor Which is made
`possible by the point-to-point communication infrastructure.
`The routing tables are used to control and isolate various
`system resources, the connections betWeen Which are
`de?ned therein. The service processor and computer system
`partitioning are described in patent application Ser. No.
`09/932,456 titled Computer System Partitioning Using Data
`
`19
`
`
`
`US 7,107,408 B2
`
`7
`Transfer Routing Mechanism, ?led on Aug. 16, 2001, the
`entirety of Which is incorporated by reference for all pur
`poses.
`The processors 202aid are also coupled to a cache
`coherence controller 230 through point-to-point links
`232aid. Any mechanism or apparatus that can be used to
`provide communication betWeen multiple processor clusters
`While maintaining cache coherence is referred to herein as a
`cache coherence controller. The cache coherence controller
`230 can be coupled to cache coherence controllers associ
`ated With other multiprocessor clusters. It should be noted
`that there can be more than one cache coherence controller
`in one cluster. The cache coherence controller 230 commu
`nicates With both processors 202aid as Well as remote
`clusters using a point-to-point protocol.
`More generally, it should be understood that the speci?c
`architecture shoWn in FIG. 2 is merely exemplary and that
`embodiments of the present invention are contemplated
`having different con?gurations and resource interconnec
`tions, and a variety of alternatives for each of the system
`resources shoWn. HoWever, for purpose of illustration, spe
`ci?c details of server 200 Will be assumed. For example,
`most of the resources shoWn in FIG. 2 are assumed to reside
`on a single electronic assembly. In addition, memory banks
`206ai206d may comprise double data rate (DDR) memory
`Which is physically provided as dual in-line memory mod
`ules (DIMMs). I/O adapter 216 may be, for example, an
`ultra direct memory access (UDMA) controller or a small
`computer system interface (SCSI) controller Which provides
`access to a permanent storage device. I/O adapter 220 may
`be an Ethernet card adapted to provide communications With
`a netWork such as, for example, a local area netWork (LAN)
`or the Internet.
`According to a speci?c embodiment and as shoWn in FIG.
`2, both of I/O adapters 216 and 220 provide symmetric I/O
`access. That is, each provides access to equivalent sets of
`I/O. As Will be understood, such a con?guration Would
`facilitate a partitioning scheme in Which multiple partitions
`have access to the same types of I/O. HoWever, it should also
`be understood that embodiments are envisioned in Which
`partitions Without I/O are created. For example, a partition
`including one or more processors and associated memory
`resources, i.e., a memory complex, could be created for the
`purpose of testing the memory complex.
`According to one embodiment, service processor 212 is a
`Motorola MPC855T microprocessor Which includes inte
`grated chipset functions. The cache coherence controller 230
`can be an Application Speci?c Integrated Circuit (ASIC)
`supporting the local point-to-point coherence protocol. The
`cache coherence controller 230 can also be con?gured to
`handle a non-coherent protocol to alloW communication
`with I/O devices. In one embodiment, the cache coherence
`controller 230 is a specially con?gured programmable chip
`such as a programmable logic device or a ?eld program
`mable gate array.
`FIG. 3 is a diagrammatic representation of one example of
`a cache coherence controller 230. The cache coherence
`controller can include a protocol engine 305 con?gured to
`handle packets such as probes and requests received from
`processors in various clusters of a multiprocessor system.
`The functionality of the protocol engine 305 can be parti
`tioned across several engines to improve performance. In
`one example, partitioning can be done based on individual
`transactions ?oWs, packet type (request, probe and
`response), direction (incoming and outgoing), or transac
`tions ?oW (request ?oWs, probe ?oWs, etc).
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`The protocol engine 305 has access to a pending buffer
`309 that alloWs the cache coherence controller to track
`transactions such as recent requests and probes and associ
`ated the transactions With speci?c processors. Transaction
`information maintained in the pending buffer 309 can
`include transaction destination nodes, the addresses of
`req