`
`(12) Ulllted States Patent
`Glasco
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 7,653,790 B2
`*Jan. 26, 2010
`
`(54) METHODS AND APPARATUS FOR
`RESPONDING TO A REQUEST CLUSTER
`
`6,374,331 B1 *
`6,385,705 B1
`
`4/2002 Janakiraman et a1. ..... .. 711/141
`5/2002 Keller et al. .............. .. 711/154
`
`t
`(76) I
`nven or:
`
`10337 E b G1
`D _d B G1
`m er en
`av1
`.
`asco,
`Dr" Austin’ TX (Us) 78726
`
`( * ) Notice:
`
`Subject to any disclaimer’ the term Ofthis
`patent is extended or adjusted under 35
`U-S-C- 154(1)) by 849 days-
`
`12/2002 Keller et al. .............. .. 711/150
`6,490,661 B1
`9/2003 Khare et al.
`6,615,319 B2
`10/2003 Keller et al.
`6,631,401 B1
`6,631,448 B2 * 10/2003 Weber ...................... .. 711/141
`6,633,945 B1 * 10/2003 Fu et a1. ................... .. 710/316
`6,728,843 B1 *
`4/2004 Pong etal. ................ .. 711/150
`
`.
`.
`.
`.
`.
`This patent 1s subject to a terminal d1s
`Clailner~
`
`6,754,782 B2
`6,760,819 B2
`
`6/2004 Arimilli et al.
`7/2004 Dhong et al.
`
`(21) App1.N0.: 10/145,438
`
`(22) Filed:
`
`May 13, 2002
`
`(Continued)
`
`(65)
`
`Prior Publication Data
`
`OTHER PUBLICATIONS
`
`Us 2003/021065 5 A1
`
`NOV' 13’ 2003
`
`(51) Int. Cl.
`(2006.01)
`G06F 12/00
`(52) US. Cl. ..................... .. 711/146; 711/118; 711/128;
`711/141; 711/144; 711/145; 711/147; 711/148
`(58) Field of Classi?cation Search ............... .. 711/141,
`711/144e147, 128
`See application ?le for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`3/1993 Sindhu et a1.
`5,195,089 A
`8/1997 Sherman et al.
`5,659,710 A
`4/1999 Merchant .................. .. 711/140
`5,893,151 A *
`9/1999 Hagersten et a1.
`5,958,019 A
`5,966,729 A * 10/1999 Phelps ...................... .. 711/146
`6,038,644 A *
`3/2000 Irie et a1. .................. .. 711/141
`6,067,603 A
`5/2000 Carpenter et al.
`6,141,692 A * 10/2000 Loewenstein et al. ..... .. 709/234
`6,167,492 A 12/2000 Keller et a1. .............. .. 711/154
`6,292,705 B1
`9/2001 Wang et a1.
`6,295,583 B1* 9/2001 RaZdan et a1. ............ .. 711/137
`6,336,169 B1
`1/2002 Arimille et a1.
`6,338,122 B1* 1/2002 Baumgartner et a1. ..... .. 711/141
`6,351,791 B1
`2/2002 Freerksen et al.
`
`Alan Charesworth, “Star?re: extending the SMP envelope”, publish
`in Feb. 1998 by IEEE, pp. 39-49.*
`_
`(Commued)
`Primary ExamineriTuanV Thai
`ASSiSmmExamineriZhuO H Li
`
`(57)
`
`ABSTRACT
`
`According to the present invention, methods and apparatus
`are provided for increasing the e?iciency of data access in a
`multiple processor, multiple cluster system. A home cluster of
`processors receives a cache access request from a request
`cluster. The home cluster includes mechanisms for instruct
`ing probed remote clusters to respond to the request cluster
`instead of to the home cluster. The home cluster can also
`include mechanisms for reducing the number of probes sent
`to remote clusters. Techniques are also included for providing
`the requesting cluster With information to determine the num
`ber of responses to be transmitted to the requesting cluster as
`a result of the reduction in the number of probes sent at the
`home cluster.
`
`35 Claims, 15 Drawing Sheets
`
`cPu
`1001-1
`
`Request
`cum 1000
`
`Home Clusm'
`1020
`
`Remote
`Cluster 10447
`
`Remote
`Chum 1060
`
`
`
`US 7,653,790 B2
`Page 2
`
`Us. PATENT DOCUMENTS
`
`9/2004 Bauman ................... .. 711/149
`6,799,252 B1 *
`1/2005 Gruner et al. ..
`6,839,808 B2 *
`6,973,543 B1* 12/2005 Hughes .... ..
`2002/0053004 A1* 5/2002 Pong ........ ..
`2003/0095557 A1* 5/2003 Keller et al.
`
`.. 370/412
`
`OTHER PUBLICATIONS
`
`1. 03,
`HyperTransportTM I/O Link Speci?cation Revision
`HyperTranspoItTM ConsoItium, Oct. 10, 2001, Copyright © 2001
`HyperTranspoIt Technology ConsoItium.
`U.S. Appl. No. 10/106,426, Of?ce Action dated Sep. 22, 2004.
`US. Appl. No. 10/106,426, Of?ce Action dated Mar. 7, 2005.
`US. Appl. No. 10/106,426, Of?ce Action dated Jul. 21, 2005.
`US. Appl. No. 10/106,426, Of?ce Action dated Nov. 21, 2005.
`US. Appl. No. 10/106,430, Of?ce Action dated Sep. 23, 2004.
`US. Appl. No. 10/106,430, Of?ce Action dated Mar. 10,2005.
`U.S. Appl. No. 10/106,430, Of?ce Action dated Jul. 21, 2005.
`US. Appl. No. 10/106,430, Of?ce Action dated Nov. 2, 2005.
`US. Appl. No. 10/106,299, Of?ce Action dated Sep. 22, 2004.
`US. Appl. No. 10/106,299, Of?ce Action dated Mar. 10,2005.
`U.S. Appl. No. 10/106,299, Of?ce Action dated Jul. 21, 2005.
`
`US. Appl. No. 10/106,299, Of?ce Action dated Nov. 21, 2005.
`US. Appl. No. 10/145,439, Of?ce Action dated Nov. 21, 2005.
`US. Appl. No. 10/106,426, ?led Mar. 22, 2002, Notice ofAllowance,
`mailed Apr. 21, 2006.
`US. Appl. No. 10/106,426, ?led Mar. 22, 2002, Allowed claims.
`U.S. Appl. No. 10/106,430, ?led Mar. 22, 2002, Notice ofAllowance
`mailed Apr. 21, 2006.
`US. Appl. No. 10/106,430, ?led Mar. 22, 2002, Allowed claims.
`U.S. Appl. No. 10/106,299, ?led Mar. 22, 2002, Notice ofAllowance
`mailed Apr. 28, 2006.
`US. Appl. No. 10/106,299, ?led Mar. 22, 2002, Allowed claims.
`U.S. Appl. No. 10/145,439, ?led May 13, 2002, Of?ce Action mailed
`Aug. 13, 2007.
`US. Appl. No. 10/145,439, ?led May 13, 2002, Of?ce Action mailed
`Apr. 17,2007.
`U.S. Appl. No. 10/145,439, ?led May 13, 2002, Of?ce Action mailed
`Aug. 22, 2006.
`US. Appl. No. 10/145,439, ?led May 13, 2002, Of?ce Action mailed
`May 5, 2006.
`US. Appl. No. 10/145,439, Notice of Allowance mailed Feb. 26,
`2008.
`US. Appl. No. 10/145,439, Allowed Claims, as of Feb. 26, 2008.
`
`* cited by examiner
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 1 0f 15
`
`US 7,653,790 B2
`
`Figure 1A
`
`Processing
`Cluster 101
`
`‘
`
`fl 1 1d
`f ‘1,
`i j
`
`Processing
`Cluster 103
`
`h
`
`J . - - - - ~ - - _ e _
`
`. r - - - — - ~ __
`
`A
`
`llla?rr-f,
`
`Lllle
`
`mfJ ,z-m-jtflllc
`
`i
`
`Processing
`Cluster 105
`
`‘
`
`[-1 1 1b
`,I'
`‘3
`all
`
`Processing
`Cluster 107
`
`Figure 1B
`
`Processing
`Cluster 121
`
`Processing
`Cluster 123
`
`“t; "X141d
`141a-/“ 7'":
`Switch 131
`
`Processing
`Cluster 125
`
`n
`
`L
`
`'
`
`Processing
`Cluster 127
`
`
`
`01026,2
`
`2w_h__S
`
`US 7,653,790 B2
`
`U
`
`S
`
`.NPBME
`
`P._STK.
`
`4|.lmneon38M£2
`
`
`
`m.§mEUBofimm
`
`
`
`mca5:380JpmomHommuooum3:29.500
`
`unomo
`
`pmat:on-
`
`
`
`
`
`
`
`BGNONuommooobmONONuommoooi
`
`3525
`
`moi'pofiofim
`
`«om
`
`Sowm
`
`ofiu§§mOn
`
`oom
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 3 0f 15
`
`US 7,653,790 B2
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 4 0f 15
`
`US 7,653,790 B2
`
`w @EwE
`
`U2
`
`mime,‘
`
`DmO
`
`méov
`
`méow
`
`P6 \ /
`
`02
`
`Tmow
`
`DmU
`
`75v
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 5 0f 15
`
`US 7,653,790 B2
`
`02
`
`mnmom
`
`DHHO
`
`méom
`
`DmU
`
`N-Sm
`
`k
`
`
`
`@362 ?ooqlsoz
`
`02
`
`Tmom
`
`A
`
`DmO
`
`75m
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 6 6f 15
`
`US 7,653,790 B2
`
`mm PEmE
`
`NIHNm
`
`V') N I
`
`Ln
`
`Tam
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 7 0f 15
`
`US 7,653,790 B2
`
`DmU
`
`mévm
`
`A
`
`mm 2&5
`
`NI?“
`
`DAG \ / mwm
`
`5%
`
`2%
`
`
`
`@262 EooAéoZ
`
`M2
`
`Tmvm
`
`A
`
`DMD
`
`73m
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 8 0f 15
`
`US 7,653,790 B2
`
`U2
`
`mimwm
`
`mlam
`
`Q 0&5
`
`A
`
`
`
`
`
` méwm 1 Gm Tmwm L 0 Q o2 -
`
`mwm
`
`1H
`
`gm
`
`75m 0
`
`mbm
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 9 0f 15
`
`US 7,653,790 B2
`
`
`
`
`
`
`
` 2% 18 1.1% 38 2S 2% S0 ‘l?oo 28?;38 0 P6 0 P6 0 P6 A o 0 Q6
`
`w QEmE
`
`80 A
`
`ES 3% 3S 3% Kw 3% 3%
`
`
`
`02 o 0 u - Q 11 02 u
`
`3% o
`
`
`
`Oww “335D
`
`3080M
`
`m8 Q
`
`0
`
`
`
`_ , 83250
`
`
`
`NAME 08cm
`
`q /
`
`
`
`mow 3250M
`
`832.020
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 10 0f 15
`
`US 7,653,790 B2
`
`
`
`
`
` mLmoN. .vl?ob wkmoh méob Wmow méow U DmO O DmU O DmO
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`NAN“ ‘I WAN“ ?lm» MAN» Rn TEN Tan
`
`02 o o o A A 02 0
`
`m: A
`
`man A
`
`0
`
`
`
`
` 0 , own 8620
`
`NIHNF 050m
`
`
`
`Two“. 73h U DmO
`
`
`
`ooh 3:30
`
`$280M
`
`
`
`
`
`NAE, 20> 7M3‘ _Io A T 0|‘
`
`A
`
`
`
`mi 8050M
`
`2R 5630
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 11 or 15
`
`US 7,653,790 B2
`
`Figure 8
`
`ag Handling A
`A Home Cache
`herence Contro
`
`801 \ Receive Cache Access
`Request
`
`803 X Generate New Tag
`
`805 '\_
`
`7
`
`Maintained Tag In
`Pending Buffer
`
`307
`
`Forward Request To
`Serialization Point
`
`811 '\ Receive Probe From
`Serialization Point
`
`313
`
`815
`
`‘
`i 6 Resulting
`Locally Generated
`Request?
`
`821 \ Use Newly Generated Tag
`Yes-—~—> From Home Cluster Tag
`Spam:
`
`No
`1'
`Use Tag Corresponding
`To Tag Frorn Request
`Cluster
`
`J
`
`r
`Broadcast Probes To
`823 F\_
`~ Nonlocal Clusters With
`Selected Tag Information
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 12 0f 15
`
`US 7,653,790 B2
`
`Figure 9
`
`ag Handling A
`A Request Cache
`herence Contro
`
`p
`
`901 '\ Send Cache Access
`Request To Home Cluster
`
`903 “\d
`
`Receive Probe From
`Home Cluster
`
`5 90 “\1
`
`Probe Local Nodes
`
`F
`
`y
`
`907 \1 Receive A Plurality Of
`Probe Responses
`
`Signal Processor
`Associated With The
`909 \ Request After Expected
`Probe Responses Are
`Received
`
`
`
`U.S. Patent
`
`aJ
`
`e_h__
`
`51
`
`US 7,653,790 B2
`
`n...ymuswom
`
`82A
`
`
`
`SW23mémofiv-59ENSR3TmmofiT53
`
`2«HM82.250
`03UDOA02Um3:
`
`ml.0
`
`BwasE50uaom
`
`982«-82«-82wasmacsW53$8.:T82T39UPmoU3.5UBoU0~50
`
`224
`
`
`
`3::3372.330.50%
`
`83B520
`
` 835520
`
`
`
`#69Begum
`
`2Sam
`
`82A
`
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 14 of 15
`
`US 7,653,790 B2
`
`Figure 11
`
`ag Managemen
`Before Probe
`Transmission
`
`y
`
`1101 N Receive Cache Access
`Request
`
`1 105
`
`‘
`Maintained Tag In
`Pending Buffer
`
`1121 \ Use Newly Generated Tag
`7 From Home Cluster Tag
`Space
`
`l
`
`V
`
`1107
`
`'
`\J Fowl-[aid Rpquastio
`Senahzatron Point
`
`Yes
`
`l 123
`
`\ Select Clusters To Send
`'
`Probes To Based On
`Directory
`
`I
`
`ll] 1 \( RECEiXIIC Probe From
`Sena ization Point
`
`‘
`1 131 x
`Send Probe To Home
`Cluster With Coherence
`Information
`
`1113
`
`11 15
`
`a e Resulting
`‘
`Locally Generated
`Requast?
`
`1133 —\?
`
`I
`Forward Probes To
`Selected Clusters With
`Tag Information
`
`No
`v
`Use Tag Corresponding
`To Tag From Request -_-__
`Cluster
`
`
`
`US. Patent
`
`Jan. 26, 2010
`
`Sheet 15 0f 15
`
`US 7,653,790 B2
`
`Figure 12
`
`ag Handling Upo
`Receiving Probe
`Responses
`
`T
`
`Send Cache Access
`1201 '\_
`Request To Home Cluster
`
`i
`1203 '\ Receive Probe From
`Home Cluster
`
`7
`
`1205
`
`Extract Information From
`\ Probe To Determine
`Number Of Expected
`Probe Responses
`
`7
`120 F\-
`
`Probe Local Nodes
`
`7
`
`V
`
`1209 '\ Receive A Plurality Of
`Probe Responses
`
`7
`
`1211
`
`Sigoal Processor
`Associated Wlth The
`X Request A?er Expected
`Probe Responses Are
`Received
`
`
`
`US 7,653,790 B2
`
`1
`METHODS AND APPARATUS FOR
`RESPONDING TO A REQUEST CLUSTER
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`The present application is related to ?led US. application
`Ser. No. 10/106,426 titled Methods And Apparatus For
`Speculative Probing At A Request Cluster, US. application
`Ser. No. 10/106,430 titled Methods And Apparatus For
`Speculative Probing With Early Completion And Delayed
`Request, and US. application Ser. No. 10/106,299 titled
`Methods And Apparatus For Speculative Probing With Early
`Completion And Early Request, the entireties of Which are
`incorporated by reference herein for all purposes. The present
`application is also related to concurrently ?led U.S. applica
`tion Ser. No. 10/145,439 titled Methods And Apparatus For
`Responding To A Request Cluster by David B. Glasco, the
`entirety of Which is incorporated by reference for all pur
`poses.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention generally relates to accessing data in
`a multiple processor system. More speci?cally, the present
`invention provides techniques for improving data access e?i
`ciency While maintaining cache coherency in a multiple pro
`cessor system having a multiple cluster architecture.
`2. Description of Related Art
`Data access in multiple processor systems can raise issues
`relating to cache coherency. Conventional multiple processor
`computer systems have processors coupled to a system
`memory through a shared bus. In order to optimiZe access to
`data in the system memory, individual processors are typi
`cally designed to Work With cache memory. In one example,
`each processor has a cache that is loaded With data that the
`processor frequently accesses. The cache is read or Written by
`a processor. HoWever, cache coherency problems arise
`because multiple copies of the same data can co-exist in
`systems having multiple processors and multiple cache
`memories. For example, a frequently accessed data block
`corresponding to a memory line may be loaded into the cache
`of tWo different processors. In one example, if both proces
`sors attempt to Write neW values into the data block at the
`same time, different data values may result. One value may be
`Written into the ?rst cache While a different value is Written
`into the second cache. A system might then be unable to
`determine What value to Write through to system memory.
`A variety of cache coherency mechanisms have been
`developed to address such problems in multiprocessor sys
`tems. One solution is to simply force all processor Writes to go
`through to memory immediately and bypass the associated
`cache. The Write requests can then be serialized before over
`Writing a system memory line. HoWever, bypassing the cache
`signi?cantly decreases e?iciency gained by using a cache.
`Other cache coherency mechanisms have been developed for
`speci?c architectures. In a shared bus architecture, each pro
`cessor checks or snoops on the bus to determine Whether it
`can read or Write a shared cache block. In one example, a
`processor only Writes an object When it oWns or has exclusive
`access to the object. Each corresponding cache object is then
`updated to alloW processors access to the most recent version
`of the object.
`Bus arbitration is used When both processors attempt to
`Write the same shared data block in the same clock cycle. Bus
`arbitration logic decides Which processor gets the bus ?rst.
`
`10
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`Although, cache coherency mechanisms such as bus arbitra
`tion are effective, using a shared bus limits the number of
`processors that can be implemented in a single system With a
`single memory space.
`Other multiprocessor schemes involve individual proces
`sor, cache, and memory systems connected to other proces
`sors, cache, and memory systems using a netWork backbone
`such as Ethernet or Token Ring. Multiprocessor schemes
`involving separate computer systems each With its oWn
`address space can avoid many cache coherency problems
`because each processor has its oWn associated memory and
`cache. When one processor Wishes to access data on a remote
`computing system, communication is explicit. Messages are
`sent to move data to another processor and messages are
`received to accept data from another processor using standard
`netWork protocols such as TCP/IP. Multiprocessor systems
`using explicit communication including transactions such as
`sends and receives are referred to as systems using multiple
`private memories. By contrast, multiprocessor system using
`implicit communication including transactions such as loads
`and stores are referred to herein as using a single address
`space.
`Multiprocessor schemes using separate computer systems
`alloW more processors to be interconnected While minimiZing
`cache coherency problems. HoWever, it Would take substan
`tially more time to access data held by a remote processor
`using a netWork infrastructure than it Would take to access
`data held by a processor coupled to a system bus. Further
`more, valuable netWork bandWidth Would be consumed mov
`ing data to the proper processors. This can negatively impact
`both processor and netWork performance.
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a sys
`tem With a single memory space. In one example, individual
`processors can be directly connected to each other through a
`plurality of point-to-point links to form a cluster of proces
`sors. Separate clusters of processors can also be connected.
`The point-to-point links signi?cantly increase the bandWidth
`for coprocessing and multiprocessing functions. HoWever,
`using a point-to-point architecture to connect multiple pro
`cessors in a multiple cluster system sharing a single memory
`space presents its oWn problems.
`Consequently, it is desirable to provide techniques for
`improving data access and cache coherency in systems hav
`ing multiple clusters of multiple processors connected using
`point-to-point links.
`
`SUMMARY OF THE INVENTION
`
`According to the present invention, methods and apparatus
`are provided for increasing the e?iciency of data access in a
`multiple processor, multiple cluster system. A home cluster of
`processors receives a cache access request from a request
`cluster. The home cluster includes mechanisms for instruct
`ing probed remote clusters to respond to the request cluster
`instead of to the home cluster. The home cluster can also
`include mechanisms for reducing the number of probes sent
`to remote clusters. Techniques are also included for providing
`the requesting cluster With information to determine the num
`ber of responses to be transmitted to the requesting cluster as
`a result of the reduction in the number of probes sent from the
`home cluster.
`According to various embodiments, a computer system is
`provided. A home cluster includes a ?rst plurality of proces
`sors and a home cache coherence controller. The ?rst plurality
`of processors and the home cache coherence controller are
`interconnected in a point-to-point architecture. The home
`
`
`
`US 7,653,790 B2
`
`3
`cache coherence controller is con?gured to send a probe to a
`remote cluster upon receiving a cache access request from a
`request cluster. The probe includes information for the
`request cache coherence controller to determine the number
`of probe responses corresponding to the cache access request
`to be transmitted to the request cluster.
`According to other embodiments, a method for managing
`data access is provided. A request is transmitted to a home
`cluster comprising a plurality of processors coupled to a
`home cache coherence controller. A probe is received from
`the home cluster. The probe corresponds to the request and
`includes information for determining the number of expected
`probe responses. A plurality of probe responses is received
`from a plurality of clusters.
`A further understanding of the nature and advantages of the
`present invention may be realiZed by reference to the remain
`ing portions of the speci?cation and the draWings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the
`folloWing description taken in conjunction With the accom
`panying draWings, Which are illustrative of speci?c embodi
`ments of the present invention.
`FIGS. 1A and 1B are diagrammatic representation depict
`ing a system having multiple clusters.
`FIG. 2 is a diagrammatic representation of a cluster having
`a plurality of processors.
`FIG. 3 is a diagrammatic representation of a cache coher
`ence controller.
`FIG. 4 is a diagrammatic representation shoWing a trans
`action ?oW for a data access request from a processor in a
`single cluster.
`FIGS. 5A-5D are diagrammatic representations shoWing
`cache coherence controller functionality.
`FIG. 6 is a diagrammatic representation depicting a trans
`action ?oW for a remote cluster sending a probe response to a
`home cluster.
`FIG. 7 is a diagrammatic representation shoWing a trans
`action ?oW for a remote cluster sending a probe response to a
`requesting cluster.
`FIG. 8 is a How process diagram shoWing tag management
`before probe transmission to remote nodes.
`FIG. 9 is a process How diagram shoWing a technique for
`receiving probe responses.
`FIG. 10 is a diagrammatic representation shoWing a trans
`action ?oW for a remote cluster sending a probe response to a
`requesting cluster.
`FIG. 11 is a How process diagram shoWing tag manage
`ment before probe transmission to remote nodes in a system
`With a coherence directory.
`FIG. 12 is a process How diagram shoWing a technique for
`receiving probe responses in a system With a coherence direc
`tory.
`
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`
`Reference Will noW be made in detail to some speci?c
`embodiments of the invention including the best modes con
`templated by the inventors for carrying out the invention.
`Examples of these speci?c embodiments are illustrated in the
`accompanying draWings. While the invention is described in
`conjunction With these speci?c embodiments, it Will be
`understood that it is not intended to limit the invention to the
`described embodiments. On the contrary, it is intended to
`cover alternatives, modi?cations, and equivalents as may be
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`included Within the spirit and scope of the invention as
`de?ned by the appended claims. Multi-processor architec
`tures having point-to-point communication among their pro
`cessors are suitable for implementing speci?c embodiments
`of the present invention. In the folloWing description, numer
`ous speci?c details are set forth in order to provide a thorough
`understanding of the present invention. The present invention
`may be practiced Without some or all of these speci?c details.
`Well knoWn process operations have not been described in
`detail in order not to unnecessarily obscure the present inven
`tion. Furthermore, the present application’s reference to a
`particular singular entity includes that possibility that the
`methods and apparatus of the present invention can be imple
`mented using more than one entity, unless the context clearly
`dictates otherWise.
`Techniques are provided for increasing data access e?i
`ciency in a multiple processor, multiple cluster system. In a
`point-to-point architecture, a cluster of processors includes
`multiple processors directly connected to each other through
`point-to-point links. By using point-to-point links instead of a
`conventional shared bus or external netWork, multiple pro
`cessors are used e?iciently in a system sharing the same
`memory space. Processing and netWork e?iciency are also
`improved by avoiding many of the bandWidth and latency
`limitations of conventional bus and external netWork based
`multiprocessor architectures. According to various embodi
`ments, hoWever, linearly increasing the number of proces sors
`in a point-to-point architecture leads to an exponential
`increase in the number of links used to connect the multiple
`processors. In order to reduce the number of links used and to
`further modulariZe a multiprocessor system using a point-to
`point architecture, multiple clusters are used.
`According to various embodiments, the multiple processor
`clusters are interconnected using a point-to-point architec
`ture. Each cluster of processors includes a cache coherence
`controller used to handle communications betWeen clusters.
`In one embodiment, the point-to-point architecture used to
`connect processors are used to connect clusters as Well.
`By using a cache coherence controller, multiple cluster
`systems can be built using processors that may not necessarily
`support multiple clusters. Such a multiple cluster system can
`be built by using a cache coherence controller to represent
`non-local nodes in local transactions so that local nodes do
`not need to be aWare of the existence of nodes outside of the
`local cluster. More detail on the cache coherence controller
`Will be provided beloW.
`In a single cluster system, cache coherency can be main
`tained by sending all data access requests through a serialiZa
`tion point. Any mechanism for ordering data access requests
`is referred to herein as a serialization point. One example of a
`serialiZation point is a memory controller. Various processors
`in the single cluster system send data access requests to the
`memory controller. In one example, the memory controller is
`con?gured to serialiZe or lock the data access requests so that
`only one data access request for a given memory line is
`alloWed at any particular time. If another processor attempts
`to access the same memory line, the data access attempt is
`blocked until the memory line is unlocked. The memory
`controller alloWs cache coherency to be maintained in a mul
`tiple processor, single cluster system.
`A serialiZation point can also be used in a multiple proces
`sor, multiple cluster system Where the processors in the vari
`ous clusters share a single address space. By using a single
`address space, internal point-to-point links can be used to
`signi?cantly improve intercluster communication over tradi
`tional external netWork based multiple cluster systems. Vari
`ous processors in various clusters send data access requests to
`
`
`
`US 7,653,790 B2
`
`20
`
`30
`
`5
`a memory controller associated With a particular cluster such
`as a home cluster. The memory controller can similarly seri
`aliZe all data requests from the different clusters. However, a
`serialiZation point in a multiple processor, multiple cluster
`system may not be as ef?cient as a serialization point in a
`multiple processor, single cluster system. That is, delay
`resulting from factors such as latency from transmitting
`betWeen clusters can adversely affect the response times for
`various data access requests. It should be noted that delay also
`results from the use of probes in a multiple processor envi
`ronment.
`Although delay in intercluster transactions in an architec
`ture using a shared memory space is signi?cantly less than the
`delay in conventional message passing environments using
`external netWorks such as Ethernet or Token Ring, even mini
`mal delay is a signi?cant factor. In some applications, there
`may be millions of data access requests from a processor in a
`fraction of a second. Any delay can adversely impact proces
`sor performance.
`According to various embodiments, speculative probing is
`used to increase the ef?ciency of accessing data in a multiple
`processor, multiple cluster system. A mechanism for eliciting
`a response from a node to maintain cache coherency in a
`system is referred to herein as a probe. In one example, a
`mechanism for snooping a cache is referred to as a probe. A
`25
`response to a probe can be directed to the source or target of
`the initiating request. Any mechanism for sending probes to
`nodes associated With cache blocks before a request associ
`ated With the probes is received at a serialization point is
`referred to herein as speculative probing.
`According to various embodiments, the reordering or
`elimination of certain data access requests do not adversely
`affect cache coherency. That is, the end value in the cache is
`the same Whether or not snooping occurs. For example, a
`local processor attempting to read the cache data block can be
`alloWed to access the data block Without sending the requests
`through a serialiZation point in certain circumstances. In one
`example, read access can be permitted When the cache block
`is valid and the associated memory line is not locked. Tech
`niques for performing speculative probing generally are
`described in US. application Ser. No. 10/ 106,426 titled
`Methods And Apparatus For Speculative Probing At A
`Request Cluster, US. application Ser. No. 10/ 106,430 titled
`Methods And Apparatus For Speculative Probing With Early
`Completion And Delayed Request, and US. application Ser.
`No. 10/ 106,299 titled Methods And Apparatus For Specula
`tive Probing With Early Completion And Early Request, the
`entireties of Which are incorporated by reference herein for all
`purposes. By completing a data access transaction Within a
`local cluster, the delay associated With transactions in a mul
`tiple cluster system can be reduced or eliminated.
`The techniques of the present invention recogniZe that
`other ef?ciencies can be achieved, particularly When specu
`lative probing can not be completed at a local cluster. In one
`example, a cache access request is forWarded from a local
`cluster to a home cluster. A home cluster then proceeds to
`send probes to remote clusters in the system. In typical imple
`mentations, the home cluster gatherers the probe responses
`corresponding to the probe before sending an aggregated
`response to the request cluster. The aggregated response typi
`cally includes the results of the home cluster probes and the
`results of the remote cluster probes. The techniques of the
`present invention provide techniques for more ef?ciently
`aggregating responses at the request cluster instead of a home
`cluster. According to various embodiments, remote clusters
`send probe responses directly to the request cluster instead of
`sending the probe responses to the request cluster through a
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`home cluster. In one embodiment, techniques are provided
`for enabling a home cluster to send a reduced number of
`probes to remote clusters. Mechanisms are provided for
`alloWing a home cluster to inform the request cluster that a
`reduced number of probes are being transmitted. The mecha
`nisms can be implemented in a manner entirely transparent to
`remote clusters.
`FIG. 1A is a diagrammatic representation of one example
`of a multiple cluster, multiple processor system that can use
`the techniques of the present invention. Each processing clus
`ter 101, 103, 105, and 107 can include a plurality of proces
`sors. The processing clusters 101, 103, 10S, and 107 are
`connected to each other through point-to-point links Illa-f
`In one embodiment, the multiple processors in the multiple
`cluster architecture shoWn in FIG. 1A share the same memory
`space. In this example, the point-to-point links Illa-f are
`internal system connections that are used in place of a tradi
`tional front-side bus to connect the multiple processors in the
`multiple clusters 101, 103, 105, and 107. The point-to-point
`links may support any point-to-point coherence protocol.
`FIG. 1B is a diagrammatic representation of another
`example of a multiple cluster, multiple processor system that
`can use the techniques of the present invention. Each process
`ing cluster 121, 123, 125, and 127 can be coupled to a sWitch
`131 through point-to-point links 141a-d. It should be noted
`that using a sWitch and point-to-point links alloWs implemen
`tation With feWer point-to-point links When connecting mul
`tiple clusters in the system. A sWitch 131 can include a pro
`cessor With a coherence protocol interface. According to
`various implementations, a multicluster system shoWn in
`FIG. 1A is expanded using a sWitch 131 as shoWn in FIG. 1B.
`FIG. 2 is a diagrammatic representation of a multiple pro
`cessor cluster, such as the cluster 101 shoWn in FIG. 1A.
`Cluster 200 includes processors 20211-20201, one or more
`Basic I/O systems (BIOS) 204, a memory subsystem com
`prising memory banks 206a-206d, point-to-point communi
`cation links 208a-208e, and a service processor 212. The
`point-to-point communication links are con?gured to alloW
`interconnections betWeen processors 20211-20201, I/O sWitch
`210, and cache coherence controller 230. The service proces
`sor 212 is con?gured to alloW communications With proces
`sors 20211-20201, I/O sWitch 210, and cache coherence con
`troller 230 via a JTAG interface represented in FIG. 2 by links
`214a-214f It should be noted that other interfaces are sup
`ported. I/O sWitch 210 connects the rest of the system to I/O
`adapters 216 and 220.
`According to speci?c embodiments, the service processor
`of the present invention has the intelligence to partition sys
`tem resources according to a previously speci?ed partitioning
`schema. The partitioning can be achieved through direct
`manipulation of routing tables associated With the system
`processors by the service processor Which is made possible
`by the point-to-point communication infrastructure. The
`routing tables are used to control and isolate various system
`resources, the connections betWeen Which are de?ned
`therein. The service processor and computer system partition
`ing are described in patent application Ser. No. 09/932,456
`titled Computer System Partitioning Using Data Transfer
`Routing Mechanism, ?led on Aug. 16, 2001, the entirety of
`Which is incorporated by reference for all purposes.
`The processors 202a-d are also coupled to a cache coher
`ence controller 230 through point-to-point links 232a-d. Any
`mechanism or apparatus that can be used to provide commu
`nication betWeen multiple processor clusters While maintain
`ing cache coherence is referred to herein as a cache coherence
`controller. The cache coherence controller 230 can be
`coupled to cache coherence controllers associated With other
`
`
`
`US 7,653,790 B2
`
`7
`multiprocessor clusters. It should be noted that there can be
`more than one cache coherence controller in one cluster. The
`
`cache coherence controller 230 communicates with b