`Glasco
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 6,865,595 B2
`Mar. 8, 2005
`
`US006865595B2
`
`(54) METHODS AND APPARATUS FOR
`SPECULATIVE PROBING ()F A REMOTE
`CLUSTER
`
`.
`-
`-
`(75) Inventor‘ Davld B‘ Glasco’ Ausnn’ TX (Us)
`(73) Assignee: Newisys, Inc., Austin, TX (US)
`
`6,633,960 B1 * 10/2003 Kessler et al. ............ .. 711/144
`6,704,842 B1 * 3/2004 Janakiraman et al. .
`711/141
`6,751,698 B1 * 6/2004 Deneroff et al. .... ..
`710/317
`6,754,782 B2 * 6/2004 Arimillietal.
`711/144
`2002/0087811 A1 * 7/2002 Khare et al. .... ..
`711/146
`2003/0196047 A1 * 10/2003 Kessler et al. ............ .. 711/147
`
`OTHER PUBLICATIONS
`
`* N '
`ot1ce:
`
`s bj
`yd' 1 '
`h
`f 11'
`u ect to an 1sc a1mer, t e term 0 t is
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 491 days.
`
`HyperTransport TM I/O Link Speci?cation Revision 1.03,
`HyperTransport TM Consortium, Oct. 10, 2001, Copyright@
`2001 HyperTransport Technology Consortium.
`
`(21) Appl. No.: 10/157,340
`(22) Filed:
`May 28, 2002
`(65)
`Prior Publication Data
`
`US 2003/0225978 A1 Dec. 4, 2003
`
`(51) Int. Cl.7 ........................ .. G06F 13/14; G06F 12/16
`(52) US. Cl. ..................... .. 709/206; 709/216; 709/217;
`711/141; 711/146
`(58) Field of Search ............................... .. 709/206, 213,
`709/216, 217, 218, 219; 711/141, 146
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`6,067,603 A * 5/2000 Carpenter et al. ........ .. 711/141
`6,167,492 A 12/2000 Keller et al. .............. .. 711/154
`6,338,122 B1 * 1/2002 Baumgartner et al. .... .. 711/141
`6,385,705 B1
`5/2002 Keller et al. .............. .. 711/154
`6,490,661 B1
`12/2002 Keller et al. .............. .. 711/150
`6,615,319 B2 * 9/2003 Khare et al. .............. .. 711/141
`
`* cited by examiner
`
`Primary Examiner—Kevin Verbrugge
`(74) Attorney, Agent, or Firm—Beyer Weaver & Thomas,
`LLP
`(57)
`
`ABSTRACT
`
`According to the present invention, methods and apparatus
`are provided for increasing the ef?ciency of data access in a
`multiple processor, multiple cluster system. Techniques are
`provided for speculatively probing a remote cluster from
`either a request cluster or a home cluster. A speculative
`probe associated With a particular memory line is transmit
`ted to the remote cluster before the cache access request
`associated With the memory line is serialiZed at a home
`cluster. When a non-speculative probe is received at a
`remote cluster, the information associated With the response
`to the speculative probe is used to provide a response to the
`non-speculative probe.
`
`39 Claims, 15 Drawing Sheets
`
`1 CE? 1 c"
`‘[70
`703-1
`Rcqucsl
`i
`
`Cluster700 I
`
`1
`
`1
`
`f“ l
`
`1.751
`
`701-3
`
`L703; '
`
`FT yum
`W7 / /
`
`0
`721-1
`\\
`\
`\\
`
`_\
`
`\\
`
`Home
`Cluster 720
`
`2
`
`MC
`\
`/
`1.
`723-1‘
`1 727 [ '*' ?
`,i\ \
`K \F-C ,
`\
`i 0'
`\i Z'J?z
`
`W
`
`Chi
`
`\\
`\\
`\
`
`l
`I
`/ 745*l\
`C i
`L7
`xiici
`741-1 \
`747
`741 2
`
`\\
`\
`
`
`
`Remote (‘luster 741)
`
`I
`
`74‘)
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 1 0f 15
`
`US 6,865,595 B2
`
`Figure 1A
`
`.
`Processing
`Cluster 101
`
`‘
`
`(‘m i
`;
`3
`1
`lb"
`
`5
`
`.
`Processing
`Cluster 103
`
`I
`
`I {v
`
`’/—111b
`i‘;
`l
`5‘ 1'
`
`'”
`
`Processing
`Cluster 107
`
`_
`
`Processing
`Cluster 105
`
`‘
`
`1
`
`I
`
`Figure 1B
`
`Processing
`Cluster 121
`
`f‘
`
`1
`
`r
`
`I
`
`Processing
`Cluster 123
`
`‘——
`
`141a#/
`
`‘
`
`1
`
`3
`l
`
`,
`
`Processing "—A
`Cluster 125
`‘
`
`1
`;
`l
`‘cw;
`
`r
`
`L’ Processing
`Cluster 127
`
`i
`
`1
`l
`
`1
`l
`—_—l
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 2 of 15
`
`US 6,865,595 B2
`
`\\\’§
`
`_2
`
`
`
` 75;\A(I112uommoooi
`
`
`
`‘;..Hv_Oh%\1\
`
`SNfizamOb:
`
`A
`
`\.
`
`was;_
`
`,2
`
`.,..v7____|TJ‘l1cl
`
`1/
`
`y\\
`
`x\
`
`\'‘J
`
`\
`
`
`
`\-au..|HOEUSHE
`
`omomEwmoooi
`
`NJ
`
`88-»/.
`
`.
`
`
`
`lk
`,nmom..ommmuo5
`
`§HI\,...2.26
`
`025.5300
`
`.2;IW_.5..ommb:c:.:cU
`
`__
`
`.
`
`
`
` E.u_§:_U3oEoMNPBWE
`
`I—NuommoooifitouorrfimLaim
`
`\:sq5_N
`
`+.2¥‘1i2
`
`_W eW.§-\,e
`amenSmmoooi_Is
`
`
`:_
`
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 3 0f 15
`
`US 6,865,595 B2
`
`
`
`
`
`:m R5235 B30185 Z Q
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 4 0f 15
`
`US 6,865,595 B2
`
`02
`
`mimov
`
`v. 22mg
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 5 0f 15
`
`US 6,865,595 B2
`
`5
`
`‘ 02
`
`QSW
`
`<m Emmi
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 6 6f 15
`
`US 6,865,595 B2
`
`mm Emmi
`
`
`
`@252 EuoAéoZ
`
`Hmm
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 7 0f 15
`
`US 6,865,595 B2
`
`2E
`
`Um 95mg
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 8 0f 15
`
`US 6,865,595 B2
`
`Gm 25mg
`
`
`
`U.S. Patent
`
`Mar. 8, 2005
`
`Sheet 9 of 15
`
`US 6,865,595 B2
`
`__
`
`_8955:5,amsvom
`
`Eaflogw3.5_
`
`
`
`oEoI
`
`
`
`omoSEED
`
`2oEum
`
`S5.3m:_..V
`
`
`
`U.S. Patent
`
`r
`
`5002
`
`1tee_h__S
`
`51
`
`US 6,865,595 B2
`
`WuEoswox
`
`8.:3.5530
`
`EVIL.MAM_
`
`_x.\4._..J
`
`I‘Hf.‘TA?7.0;
`
`02:02
`
`pmown5530
`
`Bocfim
`
`:3Sszc
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 11 0f 15
`
`US 6,865,595 B2
`
`Figure 8
`
`Speculative Probing
`From A Home Cluster
`
`7
`
`Receive Cache Access
`Request Associated With
`A Memory Line
`
`uirent Re i ests
`
`Asso ciated With
`ory M n -
`
`805
`- '\~
`
`No
`v
`Send Speculative Probe
`Corresponding To
`Request To Remote
`Cluster
`
`l
`
`807
`
`T
`d
`Forwar Request 0
`Serialization Point
`
`309
`
`Receive Probe From
`Serialization Point
`
`I
`
`y
`
`8 ' 1
`Broadcast Probes To
`L X! Request And Remote
`!
`Clusters
`
`I
`
`13
`
`8 \1 Receive Probe Response
`
`.
`
`7
`
`I
`
`.
`
`815 '% Provide Probe Responses |
`
`' I
`To Request Cluster
`
`l
`
`End I
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 12 0f 15
`
`US 6,865,595 B2
`
`Figure 9
`
`Speculative Probing
`From A Home Cluster
`
`f l
`
`Receive Speculative
`Probe Associated With A
`Memory Line From A
`Home Cluster
`
`V
`
`Speculatively Probe Local
`Nodes
`
`Y
`
`Receive N011~Speculative
`905 \
`Probe Associated With
`The Memory Line From
`The Home Cluster
`
`7
`
`907
`
`Receive Speculative
`Probe Responses
`
`I
`
`l
`
`909 \4
`
`7
`
`
`
`Maintain Speculative Probe Responses
`
`f
`
`911
`
`Provide Non-Speculative
`Probe Responses To
`Home Cluster Or Request
`‘ Cluster Using The Results
`Of The Speculative Probe
`
`7
`
`i
`Remove Speculative
`913 X
`Probe From The Pending ,
`Buffer
`
`l @5 J
`
`
`
`U.S. Patent
`
`Mar. 8, 2005
`
`Sheet 13 0f 15
`
`US 6,865,595 B2
`
`52 W
`
`
`
`wrmmol m;
`
`N
`
`I \ J W ,
`
`\
`
`I \‘ \ \ / 82.53256
`
`
`
`
`
`W82 12: “ 18: I02 W82 W52 ‘ 82 T 28: QHSEIIAEQ
`
`*4
`
`, k \ R /
`
`, \ $2 _ /, 558x
`
`
`
`l [L { L EL
`
`U TzU
`
`
`
` SET L A 1 / W
`
`_ / 5
`
`J/ K [ I
`
`u T1! /
`T12: L $2 i5:
` a 53: g :2: 5253 A 1- / 2220M
`
`
`5 A u
`
`
`
`@825 @EcI
`
`cmi
`
`O
`H
`G.)
`3-4
`:3
`.20
`EL
`
`
`
`u 5 P6 u _“ Eu 5 0 E6 , 5 _ w u
`
`
`
`
`
`
`
`
`
`
`
`\\ 1 iii!‘ LAHfJ/
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 14 0f 15
`
`US 6,865,595 B2
`
`Figure ll
`
`peoulative Probin
`From A Request
`Cluster
`
`r
`
`Receive Cache Access
`1
`101 X Request Associated With
`A Memory Line
`
`V
`
`Forward Cache Access
`1103 X Request To Home Cluster
`
`7
`Send Speculative Probe
`1105 \u Corresponding To
`Request To All Remote
`Clusters
`
`Y
`
`1107 \l Receive Probe From
`L Home Cluster
`
`l
`
`y
`
`1109 '\ Receive Probe Responses
`From Home Cluster And
`Remote Cluster
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 15 0f 15
`
`US 6,865,595 B2
`
`Figure 12
`
`peculative Pro5§1
`From A Request
`Cluster
`
`1201 e
`
`Receive Speculative
`Probe Associated With A
`Memory Line From A
`Request Cluster
`
`P
`
`1203 w
`
`Non-Speculative
`Probe Associated
`W1
`ine?
`
`1205 -\
`
`Drop Speculative Probe
`
`F“‘_—"“_L~_" PP“ '
`
`1207 ‘\1 Speculativcly Probe Local i
`Nodes
`l
`
`l
`
`1209
`
`I
`
`121 1
`
`1213
`
`I
`\i
`
`Receive Non-Speculative
`Probe Associated With
`The Memory Line From
`The Home Cluster
`l
`’
`.
`Provide Nonspeculative
`Probe Responses To
`Home Cluster Or Request 4
`Cluster Using; The Results
`Of The Speculative Probe
`1
`Remove Speculative
`Probe From The Pending -
`Buffer
`
`u
`
`_
`'
`1
`1215 \ Receive Speculative
`Probe Responses
`
`4
`Maintain Speculative
`Probe Responses
`
`1217
`
`p\~
`
`l
`
`End
`
`t
`l
`1
`
`
`
`US 6,865,595 B2
`
`1
`METHODS AND APPARATUS FOR
`SPECULATIVE PROBING OF A REMOTE
`CLUSTER
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`The present application is related to US. application Ser.
`No. 10/106,426 titled Methods And Apparatus For Specu
`lative Probing At A Request Cluster, US. application Ser.
`No. 10/106,430 titled Methods And Apparatus For Specu
`lative Probing With Early Completion And Delayed
`Request, and US. application Ser. No. 10/106,299 titled
`Methods And Apparatus For Speculative Probing With Early
`Completion And Early Request, the entireties of Which are
`incorporated by reference herein for all purposes. The
`present application is also related to US. application Ser.
`Nos. 10/145,439 and 10/145,438 both titled Methods And
`Apparatus For Responding To A Request Cluster by David
`B. Glasco ?led on May 13, 2002, the entireties of Which are
`incorporated by reference for all purposes. Furthermore, the
`present application is related to concurrently ?led US.
`application Ser. No. 10/157,388 also titled Methods And
`Apparatus For Speculative Probing Of A Remote Cluster by
`David B. Glasco, the entirety of Which is incorporated by
`reference for all purposes.
`The present application is also related to concurrently
`?led US. patent applications Ser. No. 10/157,384 titled
`Transaction Management In Systems Having Multiple
`Multi-Processor Clusters, Ser. No. 10/ 156,893 titled Routing
`Mechanisms In Systems Having Multiple Multi-Processor
`Clusters, and Ser. No. 10/157,409 titled Address Space
`Management In Systems Having Multiple Multi-Processor
`Clusters all by David B. Glasco, Carl Zeitler, Rajesh Kota,
`Guru Prasadh, and Richard R. Oehler, the entireties of Which
`are incorporated by reference for all purposes.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention generally relates to accessing data
`in a multiple processor system. More speci?cally, the
`present invention provides techniques for improving data
`access ef?ciency While maintaining cache coherency in a
`multiple processor system having a multiple cluster archi
`tecture.
`2. Description of Related Art
`Data access in multiple processor systems can raise issues
`relating to cache coherency. Conventional multiple proces
`sor computer systems have processors coupled to a system
`memory through a shared bus. In order to optimiZe access to
`data in the system memory, individual processors are typi
`cally designed to Work With cache memory. In one eXample,
`each processor has a cache that is loaded With data that the
`processor frequently accesses. The cache is read or Written
`by a processor. HoWever, cache coherency problems arise
`because multiple copies of the same data can co-eXist in
`systems having multiple processors and multiple cache
`memories. For eXample, a frequently accessed data block
`corresponding to a memory line may be loaded into the
`cache of tWo different processors. In one eXample, if both
`processors attempt to Write neW values into the data block at
`the same time, different data values may result. One value
`may be Written into the ?rst cache While a different value is
`Written into the second cache. Asystem might then be unable
`to determine What value to Write through to system memory.
`A variety of cache coherency mechanisms have been
`developed to address such problems in multiprocessor sys
`
`2
`tems. One solution is to simply force all processor Writes to
`go through to memory immediately and bypass the associ
`ated cache. The Write requests can then be serialiZed before
`overWriting a system memory line. HoWever, bypassing the
`cache signi?cantly decreases ef?ciency gained by using a
`cache. Other cache coherency mechanisms have been devel
`oped for speci?c architectures. In a shared bus architecture,
`each processor checks or snoops on the bus to determine
`Whether it can read or Write a shared cache block. In one
`eXample, a processor only Writes an object When it oWns or
`has eXclusive access to the object. Each corresponding cache
`object is then updated to alloW processors access to the most
`recent version of the object.
`Bus arbitration is used When both processors attempt to
`Write a shared data block in the same clock cycle. Bus
`arbitration logic decides Which processor gets the bus ?rst.
`Although, cache coherency mechanisms such as bus arbi
`tration are effective, using a shared bus limits the number of
`processors that can be implemented in a single system With
`a single memory space.
`Other multiprocessor schemes involve individual
`processor, cache, and memory systems connected to other
`processors, cache, and memory systems using a netWork
`backbone such as Ethernet or Token Ring. Multiprocessor
`schemes involving separate computer systems each With its
`oWn address space can avoid many cache coherency prob
`lems because each processor has its oWn associated memory
`and cache. When one processor Wishes to access data on a
`remote computing system, communication is eXplicit. Mes
`sages are sent to move data to another processor and
`messages are received to accept data from another processor
`using standard netWork protocols such as TCP/IP. Multipro
`cessor systems using eXplicit communication including
`transactions such as sends and receives are referred to as
`systems using multiple private memories. By contrast, mul
`tiprocessor system using implicit communication including
`transactions such as loads and stores are referred to herein as
`using a single address space.
`Multiprocessor schemes using separate computer systems
`alloW more processors to be interconnected While minimiZ
`ing cache coherency problems. HoWever, it Would take
`substantially more time to access data held by a remote
`processor using a netWork infrastructure than it Would take
`to access data held by a processor coupled to a system bus.
`Furthermore, valuable netWork bandWidth Would be con
`sumed moving data to the proper processors. This can
`negatively impact both processor and netWork performance.
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a
`system With a single memory space. In one eXample, indi
`vidual processors can be directly connected to each other
`through a plurality of point-to-point links to form a cluster
`of processors. Separate clusters of processors can also be
`connected. The point-to-point links signi?cantly increase the
`bandWidth for coprocessing and multiprocessing functions.
`HoWever, using a point-to-point architecture to connect
`multiple processors in a multiple cluster system sharing a
`single memory space presents its oWn problems.
`Consequently, it is desirable to provide techniques for
`improving data access and cache coherency in systems
`having multiple clusters of multiple processors connected
`using point-to-point links.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`SUMMARY OF THE INVENTION
`According to the present invention, methods and appara
`tus are provided for increasing the ef?ciency of data access
`
`
`
`US 6,865,595 B2
`
`3
`in a multiple processor, multiple cluster system. Techniques
`are provided for speculatively probing a remote cluster from
`either a request cluster or a home cluster. A speculative
`probe associated With a particular memory line is transmit
`ted to the remote cluster before the cache access request
`associated With the memory line is serialiZed at a home
`cluster. When a non-speculative probe is received at a
`remote cluster, the information associated With the response
`to the speculative probe is used to provide a response to the
`non-speculative probe.
`According to various embodiments, a computer system
`including a request cluster, a home cluster, and a remote
`cluster are provided. The request cluster includes a plurality
`of interconnected request cluster processors and a request
`cluster cache coherence controller. The home cluster
`includes a plurality of interconnected home processors, a
`serialiZation point, and a home cache coherence controller.
`The remote cluster includes a plurality of interconnected
`remote processors and a remote cache coherence controller.
`The remote cluster is con?gured to receive a ?rst probe
`corresponding to a cache access request from a home cluster
`processor in the home cluster and a second probe corre
`sponding to the cache access request from the home cluster.
`According to other embodiments, a method for managing
`data access in a multiprocessor system is provided. The
`method includes receiving a cache access request from a
`request cluster, sending a ?rst probe associated With the
`cache access request from a home cluster to a remote cluster,
`and sending a second probe associated With the cache access
`request to the remote cluster. The home cluster includes a
`home cluster cache coherence controller and a serialiZation
`point.
`In still other embodiments, a computer system is pro
`vided. The computer system includes a ?rst cluster and a
`second cluster. The ?rst cluster includes a ?rst plurality of
`processors and a ?rst cache coherence controller. The ?rst
`plurality of processors and the ?rst cache coherence con
`troller are interconnected in a point-to-point architecture.
`The second cluster includes a second plurality of processors
`and a second cache coherence controller. The second plu
`rality of processors and the second cache coherence con
`troller are interconnected in a point-to-point architecture.
`The ?rst cache coherence controller is coupled to the second
`cache coherence controller. The second cache coherence
`controller is con?gured to receive a cache access request
`originating from the ?rst plurality of processors and send a
`?rst probe to a third cluster including a third plurality of
`processors before the cache access request is received by a
`serialiZation point in the second cluster.
`In yet other embodiments, a computer system including a
`?rst cluster and a second cluster is provided. The ?rst cluster
`includes a ?rst plurality of processors and a ?rst cache
`coherence controller. The ?rst plurality of processors and the
`?rst cache coherence controller are interconnected in a
`point-to-point architecture. The second cluster includes a
`second plurality of processors and a second cache coherence
`controller. The second plurality of processors and the second
`cache coherence controller are interconnected in a point-to
`point architecture. The ?rst cache coherence controller is
`coupled to the second cache coherence controller and con
`structed to receive a cache access request originating from
`the ?rst plurality of processors and send a probe to a third
`cluster including a third plurality of processors before a
`memory line associated With the cache access request is
`locked.
`According to other embodiments, a cache coherence
`controller is provided. The cache coherence controller
`
`10
`
`15
`
`25
`
`35
`
`40
`
`45
`
`55
`
`65
`
`4
`includes interface circuitry coupled to a home cluster pro
`cessor in a home cluster and a remote cluster cache coher
`ence controller in a remote cluster and a protocol engine
`coupled to the interface circuitry. The protocol engine is
`con?gured to receive a cache access request from a request
`cluster and speculatively probe a remote node in the remote
`cluster.
`A further understanding of the nature and advantages of
`the present invention may be realiZed by reference to the
`remaining portions of the speci?cation and the draWings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the
`folloWing description taken in conjunction With the accom
`panying draWings, Which are illustrative of speci?c embodi
`ments of the present invention.
`FIGS. 1A and 1B are diagrammatic representation depict
`ing a system having multiple clusters.
`FIG. 2 is a diagrammatic representation of a cluster
`having a plurality of processors.
`FIG. 3 is a diagrammatic representation of a cache coher
`ence controller.
`FIG. 4 is a diagrammatic representation shoWing a trans
`action ?oW for a data access request from a processor in a
`single cluster.
`FIGS. 5A—5D are diagrammatic representations shoWing
`cache coherence controller functionality.
`FIG. 6 is a diagrammatic representation depicting a trans
`action ?oW probing a remote cluster.
`FIG. 7 is a diagrammatic representation showing a trans
`action ?oW for a speculative probing from a home cluster.
`FIG. 8 is a How process diagram shoWing speculative
`probing from a home cluster.
`FIG. 9 is a How process diagram shoWing speculative
`probing from a home cluster at a remote cluster.
`FIG. 10 is a diagrammatic representation shoWing a
`transaction ?oW for a speculative probing from a request
`cluster.
`FIG. 11 is a How process diagram shoWing speculative
`probing from a request cluster.
`FIG. 12 is a How process diagram shoWing speculative
`probing from a request cluster at a remote cluster.
`
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`
`Reference Will noW be made in detail to some speci?c
`embodiments of the invention including the best modes
`contemplated by the inventors for carrying out the invention.
`EXamples of these speci?c embodiments are illustrated in
`the accompanying draWings. While the invention is
`described in conjunction With these speci?c embodiments, it
`Will be understood that it is not intended to limit the
`invention to the described embodiments. On the contrary, it
`is intended to cover alternatives, modi?cations, and equiva
`lents as may be included Within the spirit and scope of the
`invention as de?ned by the appended claims. Multi
`processor architectures having point-to-point communica
`tion among their processors are suitable for implementing
`speci?c embodiments of the present invention. In the fol
`loWing description, numerous speci?c details are set forth in
`order to provide a thorough understanding of the present
`invention. The present invention may be practiced Without
`some or all of these speci?c details. Well knoWn process
`operations have not been described in detail in order not to
`
`
`
`US 6,865,595 B2
`
`5
`unnecessarily obscure the present invention. Furthermore,
`the present application’s reference to a particular singular
`entity includes that possibility that the methods and appa
`ratus of the present invention can be implemented using
`more than one entity, unless the context clearly dictates
`otherWise.
`Techniques are provided for increasing data access effi
`ciency in a multiple processor, multiple cluster system. In a
`point-to-point architecture, a cluster of processors includes
`multiple processors directly connected to each other through
`point-to-point links. By using point-to-point links instead of
`a conventional shared bus or external netWork, multiple
`processors are used efficiently in a system sharing the same
`memory space. Processing and netWork ef?ciency are also
`improved by avoiding many of the bandWidth and latency
`limitations of conventional bus and external netWork based
`multiprocessor architectures. According to various
`embodiments, hoWever, linearly increasing the number of
`processors in a point-to-point architecture leads to an expo
`nential increase in the number of links used to connect the
`multiple processors. In order to reduce the number of links
`used and to further modulariZe a multiprocessor system
`using a point-to-point architecture, multiple clusters are
`used.
`According to various embodiments, the multiple proces
`sor clusters are interconnected using a point-to-point archi
`tecture. Each cluster of processors includes a cache coher
`ence controller used to handle communications betWeen
`clusters. In one embodiment, the point-to-point architecture
`used to connect processors are used to connect clusters as
`Well.
`By using a cache coherence controller, multiple cluster
`systems can be built using processors that may not neces
`sarily support multiple clusters. Such a multiple cluster
`system can be built by using a cache coherence controller to
`represent non-local nodes in local transactions so that local
`nodes do not need to be aWare of the existence of nodes
`outside of the local cluster. More detail on the cache
`coherence controller Will be provided beloW.
`In a single cluster system, cache coherency can be main
`tained by sending all data access requests through a serial
`iZation point. Any mechanism for ordering data access
`requests is referred to herein as a serialiZation point. One
`example of a serialiZation point is a memory controller.
`Various processors in the single cluster system send data
`access requests to the memory controller. In one example,
`the memory controller is con?gured to serialiZe or lock the
`data access requests so that only one data access request for
`a given memory line is alloWed at any particular time. If
`another processor attempts to access the same memory line,
`the data access attempt is blocked until the memory line is
`unlocked. The memory controller alloWs cache coherency to
`be maintained in a multiple processor, single cluster system.
`A serialiZation point can also be used in a multiple
`processor, multiple cluster system Where the processors in
`the various clusters share a single address space. By using a
`single address space, internal point-to-point links can be
`used to signi?cantly improve intercluster communication
`over traditional external netWork based multiple cluster
`systems. Various processors in various clusters send data
`access requests to a memory controller associated With a
`particular cluster such as a home cluster. The memory
`controller can similarly serialiZe all data requests from the
`different clusters. HoWever, a serialiZation point in a mul
`tiple processor, multiple cluster system may not be as
`ef?cient as a serialiZation point in a multiple processor,
`
`10
`
`15
`
`25
`
`35
`
`40
`
`45
`
`55
`
`65
`
`6
`single cluster system. That is, delay resulting from factors
`such as latency from transmitting betWeen clusters can
`adversely affect the response times for various data access
`requests. It should be noted that delay also results from the
`use of probes in a multiple processor environment.
`Although delay in intercluster transactions in an architec
`ture using a shared memory space is signi?cantly less than
`the delay in conventional message passing environments
`using external netWorks such as Ethernet or Token Ring,
`even minimal delay is a signi?cant factor. In some
`applications, there may be millions of data access requests
`from a processor in a fraction of a second. Any delay can
`adversely impact processor performance.
`According to various embodiments, speculative probing
`is used to increase the efficiency of accessing data in a
`multiple processor, multiple cluster system. A mechanism
`for eliciting a response from a node to maintain cache
`coherency in a system is referred to herein as a probe. In one
`example, a mechanism for snooping a cache is referred to as
`a probe. A response to a probe can be directed to the source
`or target of the initiating request. Any mechanism for
`sending probes to nodes associated With cache blocks before
`a request associated With the probes is received at a serial
`iZation point is referred to herein as speculative probing.
`According to various embodiments, the reordering or
`elimination of certain data access requests do not adversely
`affect cache coherency. That is, the end value in the cache is
`the same Whether or not snooping occurs. For example, a
`local processor attempting to read the cache data block can
`be alloWed to access the data block Without sending the
`requests through a serialiZation point in certain circum
`stances. In one example, read access can be permitted When
`the cache block is valid and the associated memory line is
`not locked. Techniques for performing speculative probing
`generally are described in US. application Ser. No. 10/106,
`426 titled Methods And Apparatus For Speculative Probing
`At A Request Cluster, US. application Ser. No. 10/ 106,430
`titled Methods And Apparatus For Speculative Probing With
`Early Completion And Delayed Request, and US. applica
`tion Ser. No. 10/ 106,299 titled Methods And Apparatus For
`Speculative Probing With Early Completion And Early
`Request, the entireties of Which are incorporated by refer
`ence herein for all purposes. By completing a data access
`transaction Within a local cluster, the delay associated With
`transactions in a multiple cluster system can be reduced or
`eliminated.
`The techniques of the present invention recogniZe that
`other ef?ciencies can be achieved, particularly When specu
`lative probing can not be completed at a local cluster. In one
`example, a cache access request is forWarded from a local
`cluster to a home cluster. A home cluster then proceeds to
`send probes to remote clusters in the system. In typical
`implementations, the home cluster gatherers the probe
`responses corresponding to the probe before sending an
`aggregated response to the request cluster. The aggregated
`response typically includes the results of the home cluster
`probes and the results of the remote cluster probes. The
`techniques of the present invention provide techniques for
`more ef?ciently probing a remote cluster. In typical
`implementations, a remote cluster is probed after a cache
`access request is ordered at a home cluster serialiZation
`point. The remote cluster then Waits for the results of the
`probe and sends the results back to the request cluster. In
`some examples, the results are sent directly to the request
`cluster or to the request cluster through the home cluster.
`According to various embodiments, a speculative probe is
`sent to the remote cluster ?rst to begin the probing of the
`
`
`
`US 6,865,595 B2
`
`7
`remote nodes. When the probe transmitted after the request
`is serialized arrives at the remote cluster, the results of the
`speculative probe can be used to provide a faster response to
`the request cluster.
`FIG. 1A is a diagrammatic representation of one example
`of a multiple cluster, multiple processor system that can use
`the techniques of the present invention. Each processing
`cluster 101, 103, 105, and 107 can include a plurality of
`processors. The processing clusters 101, 103, 105, and 107
`are connected to each other through point-to-point links
`111a—f. In one embodiment, the multiple processors in the
`multiple cluster architecture shoWn in FIG. 1A share the
`same memory space. In this example, the point-to-point
`links 111a—f are internal system connections that are used in
`place of a traditional front-side bus to connect the multiple
`processors in the multiple clusters 101, 103, 105, and 107.
`The point-to-point links may support any point-to-point
`coherence protocol.
`FIG. 1B is a diagrammatic representation of another
`example of a multiple cluster, multiple processor system that
`can use the techniques of the present invention. Each pro
`cessing cluster 121, 123, 125, and 127 can be coupled to a
`sWitch 131 through point-to-point links 141a—a'. It should be
`noted that using a sWitch and point-to-point links alloWs
`implementation With feWer point-to-point links When con
`necting multiple clusters in the system. A sWitch 131 can
`include a processor With a coherence protocol interface.
`According to various implementations, a multicluster sys
`tem shoWn in FIG. 1A is expanded using a sWitch 131 as
`shoWn in FIG. 1B.
`FIG. 2 is a diagrammatic representation of a multiple
`processor cluster, such as the cluster 101 shoWn in FIG. 1A.
`Cluster 200 includes processors 202a—202d, one or more
`Basic 110 systems (BIOS) 204, a memory subsystem com
`prising memory banks 206a—206d, point-to-point commu
`nication links 208a—208e, and a service processor 212. The
`point-to-point communication links are con?gured to alloW
`interconnections betWeen processors 202a—202d, I/O sWitch
`210, and cache coherence controller 230. The service pro
`cessor 212 is con?gured to alloW communications With
`processors 202a—202d, I/O sWitch 210, and cache coherence
`controller 230 via a J TAG interface represented in FIG. 2 by
`links 214a—214f. It should be noted that other interfaces are
`supported. I/O sWitch 210 connects the rest of the system to
`I/O adapters 216 and 220.
`According to speci?c embodiments, the service processor
`of the present invention has the intelligence to partition
`system resources according to a previously speci?ed parti
`tioning schema. The partitioning can be achieved through
`direct manipulation of routing tables associated With the
`system processors by the service processor Which is made
`possible by the point-to-point communication infrastructure.
`The routing tables are used to control and isolate various
`system resources, the connections betWeen Which are
`de?ned therein. The service processor and computer system
`partitionin