`US 7,003,633 B2
`(10) Patent N0.:
`Glasco
`*Feb. 21, 2006
`(45) Date of Patent:
`
`USOO7003633B2
`
`(54) METHODS AND APPARATUS FOR
`MANAGING PROBE REQUESTS
`
`(75)
`
`Inventor: David B. Glasco, Austin, TX (US)
`
`(73) Assignee: Newisys, Inc., Austin, TX (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 205 days.
`
`This patent is subject to a terminal dis-
`claimer.
`
`(21) Appl. No.: 10/288,347
`
`(22)
`
`Filed:
`
`Nov. 4, 2002
`
`(65)
`
`Prior Publication Data
`
`US 2004/0088492 A1
`
`May 6, 2004
`
`(51)
`
`Int. Cl.
`(2006.01)
`0an 12/00
`(52) US. Cl.
`....................... 711/146; 711/141, 709/216;
`709/218
`
`(58) Field of Classification Search ................ 711/141,
`711/146, 144, 145; 709/206, 213, 216, 217,
`709/218, 219
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`6,085,295 A
`6,108,737 A
`6,122,715 A
`6,148,378 A
`6,167,492 A
`6,173,393 B1
`6,189,078 B1
`6,192,451 B1
`6,205,520 B1
`6,209,055 B1
`6,292,705 B1
`6,292,906 B1
`6,330,643 B1
`6,334,172 B1
`
`7/2000 Ekanadham et 211.
`8/2000 Sharma et 211.
`9/2000 Palanca et 211.
`.............. 711/147
`11/2000 Bordaz et a1.
`................ 711/154
`12/2000 Keller et al.
`1/2001 Palanca et 211.
`2/2001 Bauman et a1.
`2/2001 Arimilli et al.
`3/2001 Palanca et 211.
`3/2001 Van Doren et 211.
`9/2001 Wang et 211.
`9/2001 Fu et 211.
`12/2001 Arimilli et al.
`12/2001 Arimilli et al.
`
`
`
`...... 711/141
`............. 711/144
`
`............. 711/156
`
`...... 711/141
`
`(Continued)
`FOREIGN PATENT DOCUMENTS
`
`WO
`
`WOO239242
`
`5/2002
`
`OTHER PUBLICATIONS
`
`Multicast snooping: a new coherence method using a mul-
`ticast address network Bilir, E.E.; Dickson, R.M.; Ying Hu;
`Plakal, M.; Sorin, D.J.; Hill, M.D.; Wood, D.A.; Computer
`Architecture, 1999. Proceedings of the 26th International
`Symposium on, May 2-4, 1999.*
`
`(Continued)
`
`Primary Examiner—Brian R Peugh
`(74) Attorney, Agent, or Firm—Beyer Weaver & Thomas,
`LLP
`
`U.S. PATENT DOCUMENTS
`
`(57)
`
`ABSTRACT
`
`5,195,089 A
`5,394,555 A
`5,524,212 A *
`5,692,123 A
`5,751,995 A *
`5,829,032 A
`5,893,151 A *
`6,018,791 A
`6,038,652 A
`6,052,769 A
`6,067,603 A
`6,073,210 A
`
`............... 711/148
`.............. 711/121
`
`3/1993 Sindhu
`2/1995 Hunter et a1.
`6/1996 Somani et a1.
`11/1997 Logghe
`5/1998 Sarangdhar ................. 711/145
`10/1998 Komuro et a1.
`.
`..... 711/141
`
`4/1999 Merchant ............. 711/140
`1/2000 Arimilli et al. .......... 711/141
`
`3/2000 Van Huben et a1.
`.......... 712/21
`4/2000 Huff et al.
`5/2000 Carpenter et 211.
`6/2000 Palanca et a1.
`
`According to the present invention, methods and apparatus
`are provided for increasing the efficiency of data access in a
`multiple processor, multiple cluster system. Mechanisms for
`reducing the number of transactions in a multiple cluster
`system are provided. In one example, probe filter informa-
`tion is used to limit the number of probe requests transmitted
`to request and remote clusters.
`
`39 Claims, 17 Drawing Sheets
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Im
`
`me
`:
`DymclmyAndehE
`FilmInlammllun
`
`HZI
`mFiltnmgAn
`fl 1
`‘.
`
`
`
`
`
`
`APPLE 1023
`
`1
`
`APPLE 1023
`
`
`
`US 7,003,633 B2
`Page 2
`
`Specifying and verifying a broadcast and a multicast snoop-
`ing cache coherence protocol Sorin, D.J.; Plakal, M.;
`Condon, A.E.; Hill, M.D.; Martin, M.M.K.; Wood, D.A.;
`Parallel and Distributed Systems, IEEE Transactions on ,
`V01.: 13 , Issue: 6 , Jun. 2002*
`Kim, et al., “Power-aware Partitioned Cache Architectures”,
`© 2001 ACM, p. 6467*
`Powell, et al., “Reducing Set-Associative Cache Energy Via
`Way-Prediction and Selective Direct-Mapping”, © 2001
`IEEE, p. 54-65.*
`HyperTransport TM I/O Link Specification Revision 1.03,
`HyperTransport TM Consortium, Oct. 10, 2001, Copyright ©
`HyperTransport Technology Consortium.
`PCT Search Report PCT/US03/34756, Int’l filing date Oct.
`30, 2003, Search report Mailed Dec. 16, 2004.
`L .S. Office Action mailed Sep. 22, 2004, from related L .S.
`Appl. \0. 10/106,426.
`L.S. Office Action mailed Mar. 7, 2005, from related L.S.
`Appl. \0. 10/106,426.
`L.S. Office Action mailed Jul. 21, 2005, from related L.S.
`Appl. \0. 10/106,426.
`L .S. Office Action mailed Sep. 23, 2004, from related L .S.
`Appl. \0. 10/106,430.
`L .S. Office Action mailed Mar. 10, 2005, from related L .S.
`Appl. \0. 10/106,430.
`L.S. Office Action mailed Jul. 21, 2005, from related L.S.
`Appl. \0. 10/106,430.
`L .S. Office Action mailed Sep. 22, 2004, from related L .S.
`Appl. \0. 10/106,299.
`L .S. Office Action mailed Mar. 10, 2005, from related L .S.
`Appl. \0. 10/106,299.
`L.S. Office Action mailed Jul. 21, 2005, from related L.S.
`Appl. \0. 10/106,299.
`D. E. Culler, J. P. Singh, A. Gupta, “Parallel Computer
`Architecture”, 1999 Morgan Kaufmann, San Francisco, CA
`L SA XP002277658.
`
`
`
`
`
`
`
`U.S. PATENT DOCUMENTS
`
`........ 711/150
`........ 711/150
`
`............. 711/119
`
`6,338,122 B1
`1/2002 Baumgartner et al.
`1/2002 Arimilli et al.
`6,343,347 B1
`5/2002 Keller et al.
`................ 711/154
`6,385,705 B1
`6/2002 Arimilli et al.
`............. 711/145
`6,405,289 B1
`10/2002 Miller et 81.
`6,463,529 B1
`6,467,007 B1
`10/2002 Armstrong et 81.
`12/2002 Keller et al.
`................ 711/150
`6,490,661 B1
`4/2003 Zalewski et al.
`........... 709/213
`6,542,926 B1
`9/2003 Khare et 81.
`6,615,319 B1
`10/2003 Morioka et 81.
`6,631,447 B1
`10/2003 Fu et al.
`..................... 710/316
`6,633,945 B1
`10/2003 Kessler et 81.
`6,633,960 B1
`................ 710/22
`6,636,906 B1 * 10/2003 Sharma et al.
`10/2003 Gharachorloo et 81.
`6,640,287 B1
`6,658,526 B1
`12/2003 Nguyen et 81.
`6,665,767 B1
`12/2003 Comisky et al.
`3/2004 Janakiraman et 81.
`6,704,842 B1
`6,738,870 B1
`5/2004 Van Huben et al.
`6,738,871 B1
`5/2004 Van Huben et al.
`6,751,698 B1
`6/2004 Deneroff et 81.
`6,751,721 B1 *
`6/2004 Webb et al.
`.................. 712/10
`6,754,782 B1
`6/2004 Arimilli et 81.
`6,760,809 B1
`7/2004 Arimilli et al.
`6,760,819 B1
`7/2004 Dhong et 81.
`6,799,252 B1 *
`9/2004 Bauman ..................... 711/149
`6,865,595 B1
`3/2005 Glasco
`6,892,282 B1
`5/2005 Hass et al. .................. 711/146
`2001/0013089 A1
`8/2001 Weber
`11/2001 Van Doren
`2001/0037435 A1
`2002/0007463 A1
`1/2002 Fung
`2002/0046327 A1
`4/2002 Gharachorloo et al.
`2002/0052914 A1
`5/2002 Zalewski et al.
`........... 709/203
`2002/0083149 A1 *
`6/2002 Van Huben et al.
`........ 709/215
`2002/0083243 A1
`6/2002 Van Huben ................. 710/107
`2002/0087807 A1 *
`7/2002 Gharachorloo et al.
`..... 711/141
`2002/0087811 A1
`7/2002 Khare et al.
`2003/0009623 A1
`1/2003 Arimilli et al.
`2003/0182508 A1
`9/2003 Glasco
`2003/0182509 A1
`9/2003 Glasco
`2003/0182514 A1
`9/2003 Glasco
`2003/0195939 A1
`10/2003 Edirisooriya et al.
`2003/0196047 A1
`10/2003 Kessler et 81.
`2003/0210655 A1
`11/2003 Glasco
`2003/0212741 A1
`11/2003 Glasco
`2003/0233388 A1
`12/2003 Glasco et al.
`................ 711/144
`2004/0073755 A1
`4/2004 Webb et al.
`2004/0088493 A1
`5/2004 Glasco ....................... 711/141
`2004/0088494 A1
`5/2004 Glasco
`2004/0255002 A1
`12/2004 Kota et al.
`
`............. 711/119
`
`....... 709/212
`
`OTHER PUBLICATIONS
`
`Bandwidth adaptive snooping Martin, M.M.K.; Sorin, D.J.;
`Hill, M.D.; Wood, D.A.; High-Performance Computer
`Architecture, 2002. Proceedings. Eighth International Sym-
`posium on , Feb. 2-6, 2002; pp. 251-262.*
`
`
`"CS
`
`Andrew Tanenbaum, “Computer Networks”, Computer
`l\etworks, London: Prentice Hall International, GB, 1996,
`pp. 345-403, XP002155220.
`.S. Office Action mailed Jul. 20, 2005, from related Ap-
`lication No. 10/608,846.
`.S. Office Action mailed Sep. 9, 2005, from related Ap-
`lication No. 10/462,015.
`.S. Office Action mailed Sep. 9, 2005, from related Ap-
`lication No. 10/426,084.
`.S. Office Action mailed NOV. 2, 2005, from related Ap-
`lication N0. 10/106/430.
`.S. Office Action mailed Oct. 5, 2005, from related Ap-
`plication No. 10/635,703.
`
`(“’73('“UF‘
`
`('“UF‘
`
`* cited by examiner
`
`2
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 1 0f 17
`
`US 7,003,633 B2
`
`Figure 1A
`
`Processing
`Cluster 101
`
`'
`
`‘
`
`Processing
`Cluster 103
`
`Cluster 107
`
`
`
`Processing
`Cluster 105
`
`\
`
`Processing
`
`Figure 1B
`
`Processing
`Cluster 121
`
`Processing
`Cluster 123
`
`i41aJ“
`
`Cluster 127
`
`j::E":
`
`II: i‘j/i
`
`Processing
`
`Processing
`Cluster 125
`
`3
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 2 0f 17
`
`US 7,003,633 B2
`
`
`
`
`-:
`
`
`Amom83.88;
`
`wmo
`
`N~880on
`
`
`
`£80
`
`0mmEzobcoo
`
`Bfiofim
`
`mOHm
`
`wow
`
`anCE‘SW
`
`owom
`
`EN#236Oh
`
`
`8:05:00Ill«momS$88m
`NS538onr.83.6w1-Eumom
`
`mEmmi
`
`
`
`96530208mm
`
`
`
`
`"Insol
`
`
`
`
`cu.
`
`4
`
`
`
`
`
`
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 3 0f 17
`
`US 7,003,633 B2
`
`
`
`mfinaom
`
`momSham
`
`
`
`
`
`
`
`mopsmE
`
`
`
`gmmumfioyfi
`
`
`
`momofiwam608on
`
`Evapou
`
`
`
`HHmDUNMHOUFHH“GUHUSOUEOZ
`
`5
`
`
`
`
`
`US. Patent
`
`Feb. 21
`
`9
`
`2006
`
`Sheet 4 0f 17
`
`US 7,003,633 B2
`
`v23me
`
`
`
`6
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 5 0f 17
`
`US 7,003,633 B2
`
`4m25%
`
`2m
`
`
`
`
`
`$602€84-52
`
`7
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 6 0f 17
`
`US 7,003,633 B2
`
`mm22mm
`
`
`
`mmwozHmooqéoZ
`
`Hmm
`
`8
`
`
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 7 0f 17
`
`US 7,003,633 B2
`
`nv
`
`m-m<m
`
`meu
`
`m-HVm
`
`Hymnv
`
`N-HVm
`
`om8ng
`
`
`
`m¢m
`
`hwm
`
`awn
`
`Hmm
`
`
`
`$602$25.52
`
`mmm
`
`
`
`Hlva_-HVmUfigmU
`
`9
`
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 8 0f 17
`
`US 7,003,633 B2
`
`
`
`
`
`mm3&5
`
`
`
`
`
`monoz“804952
`
`2%
`
`10
`
`10
`
`
`
`US. Patent
`
`eF
`
`6002
`
`9tw
`
`US 7,003,633 B2
`
`%$meTHE«Ame
`
`pm.2:on
`b.flm
`UoweH236
`DE0o
`mmoov800NWOUON80T80T56
`
`DmUUDAD
`
`0via
`
`11
`
`A
`
`médoDuoTEEUAD
`
`9%89:3»
`
`0%5530
`
`11
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 10 0f 17
`
`US 7,003,633 B2
`
`Figure 7
`
`
`
`Coherence Directory 701
`
`
`.
`Dirty Data
`Occupancy Vector
`
`
`l Memory Line 711
`
`State 713
`
`Owner 715
`
`717
`
`Address 721
`Invalid
`N/A
`N/A
`
`
`
`
`Invalid N/AAddress 731 N/A
`
`
`
`
`
`
`
`Shared N/AAddress 741 Clusters 1,3
`
`
`
`
`
`
`
`Shared N/AAddress 751 Clusters l, 2, 3, 4
`
`
`
`
`
`
`
`Cluster 2
`
`
`
`Owned Cluster 4Address 761 Cluster 2, 3, 4
`
`
`
`Address 771
`
`Owned
`
`
`
`Cluster 2 Cluster 2, 4
`
`Address 781
`
`Modified
`
`N/A
`Cluster 3
`Modified
`Address 791
`
`
`
`
`12
`
`12
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 11 0f 17
`
`US 7,003,633 B2
`
`Figure 8
`
`
`
`Read Block Modify
`Probe Filter
`(Read/Write) 825
`Read 13100" (Read) 823
`Information 821
`
`
`-
`
`lnvahd 831
`
`-
`-
`Can use completion b1t.
`
`Probe home cluster. (801)
`
`Can use completion bit.
`
`Probe h(%%19e)cluster.
`
`
`
`Shared 833
`
`Can use completion bit.
`Probe home cluster. (803)
`
`l
`
`Can use completion bit.
`Probe remote cluster with
`line cached in owned
`state. (805)
`
`Owned 835
`
`-
`
`Modified 837
`
`N/A (811)
`
`N/A (813)
`
`
`
`
`
`
`
`state. (807)
`
`
`Can use completlon blt'
`Probe remote cluster with
`line cached in modified
`
`Can use completion bit.
`Probe reap; cluster.
`
`13
`
`13
`
`
`
`US. Patent
`
`b.eF
`
`1
`
`60m.
`
`21teehS
`
`300,7SU
`
`2B
`
`m83me
`
`UomoMMmEUMom
`2Hmozcom
`acom.5520
`
`
`
`
`14
`
`
`
`6.,895%
`
`%ova.5520
`
`14
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 13 0f 17
`
`US 7,003,633 B2
`
`Howammonoafioo
`
`VANS0
`
`mémofiU
`
`02
`
`Tmmofl
`
`THNOH
`
`15
`
`2855
`
`783TS?UDAD
`
`$03qu
`
`co?56:5
`
`NAVS$30Emi:A
`
`TS»?
`
`92:A
`
`SEEDqum
`
`own:
`
`302.5%
`
`
`
`ovofi.6me0
`
`15
`
`
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 14 0f 17
`
`US 7,003,633 B2
`
`Figure 11
`
`robe Request
`
`Handling At Home
`
`herence Contro
`
`
`
`
`‘
`
`
`Receive Probe Request
` 1 101
`
`
`Associated With A
`
`Memory Line
`
`1135
`
`
`Forward Probe With
`
`
`Completion Indicator To
`
`Remote Cluster
`
`
`
`
`
`
`
`
`Receive Home Cluster
`
`
`Probe Responses And Do
`Not Forward Response To
`Request Cluster
`
`
`1105
`
`Forward Request To
`Memory Controller
`
`1109 ”\
`
`Recieve Probe From
`Memory Controller
`
`
`
`1 1 13
`
`Access Coherence
`Directory And Probe
`Filter Information
`
` Ise Filtering An
`
`pletion Indicato a
`
` Probe Remote
`
`
`
`
`
`
`Cluster?
`
`No
`
`1141
`
`
`
`
`Receive Source Done
`
`Receive Probe Responses
`
`From Request Cluster
`
`
`And Forward Re5ponse
`And Forward Source
`
`To Request Cluster With
`
`Done To Memory
`
`Completion Indicator
`Controller
`
`
`
`1145
`
`
`
`Proceed Using Memory
`Controller Serialization
`
`With No Filtering And
`Completion Bit
`
`16
`
`16
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 15 0f 17
`
`US 7,003,633 B2
`
`Figure 12
`
`
`
`
`
`figfififl‘gfly
`
`Send request to target.
`(1201)
`
`
`Send request to target.
`(1209)
`
`Memory Controller Filter Information 1221
`
`Read Block [Read] 1223
`
`
`
`-
`Invahd 1231
`
`
`Shared 1233
`
` -
`
`Send request to target.
`(1203)
`
`
`Send request to target.
`(1211)
`
`
`
`
`Owned1235
`
`Forward Probe To
`Owning Cluster. (1205)
`
`Send request to target.
`(1213)
`
`
`
`
`Forward Probe To
`Forward Probe To
`Modified Cluster. (1207) Modified Cluster. (1215)
`MOd‘fied 1237
`
`
`17
`
`17
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 16 0f 17
`
`US 7,003,633 B2
`
`22&5
`
`
`
` Two:I:789u:8.
`
`Begum
`
`oo?SEED
`
`EmamcoCmEEoU
`
`TED
`
`18
`
`
`
`ommH.5520quE
`
`BoEuM
`
`3.25355
`
`18
`
`
`
`
`
`US. Patent
`
`Feb. 21, 2006
`
`Sheet 17 0f 17
`
`US 7,003,633 B2
`
`Figure 14
`
` robe Request
`
`Handling At Home
`
`
`herence Contro -.
`
`1401
`
`1405
`
`1405
`
`
`
`Receive Probe Request
`1403
`Identify Cha1acterist1cs
`
`Associated With A
`Associated With Probe
`Particular Memory Line
`Request
`
`
`
`F01ward Request To
`
`Serialization Point
`
`No 1411
`
`
`
`5 ypass Memo v
`Controller?
`
`
`
`Yes
`
`1415
`——L—
`Block Probe Requests
`And Probe Remote
`Cluster (Figure 12)
`
`Proceed With Or Without
`Probe Filtering-And
`Complet1on Indicator
`
`(Figure 6 Or Figure 11)
`
`Unblock Memory Line
`After Receiving Source
`Done From Request
`CMMH
`
`
`
`19
`
`19
`
`
`
`US 7,003,633 B2
`
`1
`METHODS AND APPARATUS FOR
`MANAGING PROBE REQUESTS
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`The present application is related to filed US. application
`Ser. No. 10/106,426 titled Methods And Apparatus For
`Speculative Probing At A Request Cluster, US. application
`Ser. No. 10/106,430 titled Methods And Apparatus For
`Speculative Probing With Early Completion And Delayed
`Request, and US. application Ser. No. 10/106,299 titled
`Methods And Apparatus For Speculative Probing With Early
`Completion And Early Request, the entireties of which are
`incorporated by reference herein for all purposes. The
`present application is also related to filed US. application
`Ser. No. 10/157,340, now US. Pat. No. 6,865,595, Ser. No.
`10/145,439, Ser. No. 10/145,438, and Ser. No. 10/157,388
`titled Methods And Apparatus For Responding To ARequest
`Cluster by David B. Glasco,
`the entireties of which are
`incorporated by reference for all purposes. The present
`application is also related to concurrently filed U.S. appli-
`cation Ser. No. 10/288,399 with the same title and inventor,
`the entirety of which is incorporated by reference herein for
`all purposes.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The present invention generally relates to accessing data
`in a multiple processor system. More specifically,
`the
`present invention provides techniques for improving data
`access efficiency while maintaining cache coherency in a
`multiple processor system having a multiple cluster archi-
`tecture.
`
`2. Description of Related Art
`Data access in multiple processor systems can raise issues
`relating to cache coherency. Conventional multiple proces-
`sor computer systems have processors coupled to a system
`memory through a shared bus. In order to optimize access to
`data in the system memory, individual processors are typi-
`cally designed to work with cache memory. In one example,
`each processor has a cache that is loaded with data that the
`processor frequently accesses. The cache is read or written
`by a processor. However, cache coherency problems arise
`because multiple copies of the same data can co-exist in
`systems having multiple processors and multiple cache
`memories. For example, a frequently accessed data block
`corresponding to a memory line may be loaded into the
`cache of two different processors. In one example, if both
`processors attempt to write new values into the data block at
`the same time, different data values may result. One value
`may be written into the first cache while a different value is
`written into the second cache. Asystem might then be unable
`to determine what value to write through to system memory.
`A variety of cache coherency mechanisms have been
`developed to address such problems in multiprocessor sys-
`tems. One solution is to simply force all processor writes to
`go through to memory immediately and bypass the associ-
`ated cache. The write requests can then be serialized before
`overwriting a system memory line. However, bypassing the
`cache significantly decreases efficiency gained by using a
`cache. Other cache coherency mechanisms have been devel-
`oped for specific architectures. In a shared bus architecture,
`each processor checks or snoops on the bus to determine
`whether it can read or write a shared cache block. In one
`
`example, a processor only writes an object when it owns or
`
`2
`has exclusive access to the object. Each corresponding cache
`object is then updated to allow processors access to the most
`recent version of the object.
`Bus arbitration is used when both processors attempt to
`write the same shared data block in the same clock cycle.
`Bus arbitration logic decides which processor gets the bus
`first. Although, cache coherency mechanisms such as bus
`arbitration are effective, using a shared bus limits the num-
`ber of processors that can be implemented in a single system
`with a single memory space.
`Other multiprocessor schemes involve individual proces-
`sor, cache, and memory systems connected to other proces-
`sors, cache, and memory systems using a network backbone
`such as Ethernet or Token Ring. Multiprocessor schemes
`involving separate computer systems each with its own
`address space can avoid many cache coherency problems
`because each processor has its own associated memory and
`cache. When one processor wishes to access data on a
`remote computing system, communication is explicit. Mes-
`sages are sent
`to move data to another processor and
`messages are received to accept data from another processor
`using standard network protocols such as TCP/IP. Multipro-
`cessor systems using explicit communication including
`transactions such as sends and receives are referred to as
`
`systems using multiple private memories. By contrast, mul-
`tiprocessor system using implicit communication including
`transactions such as loads and stores are referred to herein as
`
`using a single address space.
`Multiprocessor schemes using separate computer systems
`allow more processors to be interconnected while minimiz-
`ing cache coherency problems. However,
`it would take
`substantially more time to access data held by a remote
`processor using a network infrastructure than it would take
`to access data held by a processor coupled to a system bus.
`Furthermore, valuable network bandwidth would be con-
`sumed moving data to the proper processors. This can
`negatively impact both processor and network performance.
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a
`system with a single memory space. In one example, indi-
`vidual processors can be directly connected to each other
`through a plurality of point-to-point links to form a cluster
`of processors. Separate clusters of processors can also be
`connected. The point-to-point links significantly increase the
`bandwidth for coprocessing and multiprocessing functions.
`However, using a point-to-point architecture to connect
`multiple processors in a multiple cluster system sharing a
`single memory space presents its own problems.
`Consequently, it is desirable to provide techniques for
`improving data access and cache coherency in systems
`having multiple clusters of multiple processors connected
`using point-to-point links.
`
`SUMMARY OF THE INVENTION
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`According to the present invention, methods and appara-
`tus are provided for increasing the efficiency of data access
`in a multiple processor, multiple cluster system. Mecha-
`nisms for reducing the number of transactions in a multiple
`cluster system are provided. In one example, probe filter
`information is used to limit the number of probe requests
`transmitted to request and remote clusters.
`In one embodiment, a computer system is provided. The
`computer system includes a home cluster having a first
`plurality of processors and a home cache coherence con-
`troller. The first plurality of processors and the home cache
`coherence controller are interconnected in a point-to-point
`20
`
`65
`
`20
`
`
`
`US 7,003,633 B2
`
`3
`architecture. The home cache coherence controller is con-
`
`figured to receive a probe request and probe one or more
`selected clusters. The one or more clusters are selected based
`
`on the characteristics associated with the probe request.
`In another embodiment, a method for managing probes is
`provided. A probe request is received at a home cache
`coherence controller in a home cluster. The home cluster
`
`includes a first plurality of processors and the home cache
`coherence controller. The first plurality of processors and the
`home cache coherence controller are interconnected in a
`
`10
`
`point-to-point architecture. One or more clusters are selected
`for probing based on the characteristics associated with the
`probe request. The one or more clusters are probed.
`A further understanding of the nature and advantages of
`the present invention may be realized by reference to the
`remaining portions of the specification and the drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the
`following description taken in conjunction with the accom-
`panying drawings, which are illustrative of specific embodi-
`ments of the present invention.
`FIGS. 1A and 1B are diagrammatic representation depict-
`ing a system having multiple clusters.
`FIG. 2 is a diagrammatic representation of a cluster
`having a plurality of processors.
`FIG. 3 is a diagrammatic representation of a cache coher-
`ence controller.
`
`FIG. 4 is a diagrammatic representation showing a trans-
`action fiow for a data access request from a processor in a
`single cluster.
`FIG. 5A—5D are diagrammatic representations showing
`cache coherence controller functionality.
`FIG. 6 is a diagrammatic representation depicting a trans-
`action fiow for
`a probe request with multiple probe
`responses.
`FIG. 7 is a diagrammatic representation showing a cache
`coherence directory.
`FIG. 8 is a diagrammatic representation showing probe
`filter information that can be used to reduce the number of
`
`probes transmitted to various clusters.
`FIG. 9 is a diagrammatic representation showing a trans-
`action fiow for probing of a home cluster without probing of
`other clusters.
`
`FIG. 10 is a diagrammatic representation showing a
`transaction flow for probing of a single remote cluster.
`FIG. 11 is a flow process diagram showing the handling
`of a probe request with probe filter information.
`FIG. 12 is
`a diagrammatic representation showing
`memory controller filter information.
`FIG. 13 is a diagrammatic representation showing a
`transaction flow for probing a single remote cluster without
`probing a home cluster.
`FIG. 14 is a flow process diagram showing the handling
`of a probe request at a home cluster cache coherence
`controller using memory controller filter information.
`
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`
`Reference will now be made in detail to some specific
`embodiments of the invention including the best modes
`contemplated by the inventors for carrying out the invention.
`Examples of these specific embodiments are illustrated in
`the
`accompanying drawings. While
`the
`invention is
`described in conjunction with these specific embodiments, it
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`21
`
`will be understood that
`
`it
`
`4
`is not
`
`intended to limit
`
`the
`
`invention to the described embodiments. On the contrary, it
`is intended to cover alternatives, modifications, and equiva-
`lents as may be included within the spirit and scope of the
`invention as defined by the appended claims. Multi-proces-
`sor architectures having point-to-point communication
`among their processors are suitable for implementing spe-
`cific embodiments of the present invention. In the following
`description, numerous specific details are set forth in order
`to provide a thorough understanding of the present inven-
`tion. The present invention may be practiced without some
`or all of these specific details. Well-known process opera-
`tions have not been described in detail
`in order not
`to
`
`unnecessarily obscure the present invention. Furthermore,
`the present application’s reference to a particular singular
`entity includes that possibility that the methods and appa-
`ratus of the present invention can be implemented using
`more than one entity, unless the context clearly dictates
`otherwise.
`
`Techniques are provided for increasing data access effi-
`ciency in a multiple processor, multiple cluster system. In a
`point-to-point architecture, a cluster of processors includes
`multiple processors directly connected to each other through
`point-to-point links. By using point-to-point links instead of
`a conventional shared bus or external network, multiple
`processors are used efficiently in a system sharing the same
`memory space. Processing and network efficiency are also
`improved by avoiding many of the bandwidth and latency
`limitations of conventional bus and external network based
`
`multiprocessor architectures. According to various embodi-
`ments, however, linearly increasing the number of proces-
`sors in a point-to-point architecture leads to an exponential
`increase in the number of links used to connect the multiple
`processors. In order to reduce the number of links used and
`to further modularize a multiprocessor system using a point-
`to-point architecture, multiple clusters are used.
`According to various embodiments, the multiple proces-
`sor clusters are interconnected using a point-to-point archi-
`tecture. Each cluster of processors includes a cache coher-
`ence controller used to handle communications between
`
`clusters. In one embodiment, the point-to-point architecture
`used to connect processors are used to connect clusters as
`well.
`
`By using a cache coherence controller, multiple cluster
`systems can be built using processors that may not neces-
`sarily support multiple clusters. Such a multiple cluster
`system can be built by using a cache coherence controller to
`represent non-local nodes in local transactions so that local
`nodes do not need to be aware of the existence of nodes
`outside of the local cluster. More detail on the cache
`
`coherence controller will be provided below.
`In a single cluster system, cache coherency can be main-
`tained by sending all data access requests through a serial-
`ization point. Any mechanism for ordering data access
`requests is referred to herein as a serialization point. One
`example of a serialization point is a memory controller.
`Various processors in the single cluster system send data
`access requests to the memory controller. In one example,
`the memory controller is configured to serialize or lock the
`data access requests so that only one data access request for
`a given memory line is allowed at any particular time. If
`another processor attempts to access the same memory line,
`the data access attempt is blocked until the memory line is
`unlocked. The memory controller allows cache coherency to
`be maintained in a multiple processor, single cluster system.
`A serialization point can also be used in a multiple
`processor, multiple cluster system where the processors in
`
`21
`
`
`
`US 7,003,633 B2
`
`5
`the various clusters share a single address space. By using a
`single address space, internal point-to-point links can be
`used to significantly improve intercluster communication
`over traditional external network based multiple cluster
`systems. Various processors in various clusters send data
`access requests to a memory controller associated with a
`particular cluster such as a home cluster. The memory
`controller can similarly serialize all data requests from the
`different clusters. However, a serialization point in a mul-
`tiple processor, multiple cluster system may not be as
`efficient as a serialization point in a multiple processor,
`single cluster system. That is, delay resulting from factors
`such as latency from transmitting between clusters can
`adversely affect the response times for various data access
`requests. It should be noted that delay also results from the
`use of probes in a multiple processor environment.
`Although delay in intercluster transactions in an architec-
`ture using a shared memory space is significantly less than
`the delay in conventional message passing environments
`using external networks such as Ethernet or Token Ring,
`even minimal delay is a significant factor. In some applica-
`tions, there may be millions of data access requests from a
`processor in a fraction of a second. Any delay can adversely
`impact processor performance.
`According to various embodiments, probe management is
`used to increase the efficiency of accessing data in a multiple
`processor, multiple cluster system. A mechanism for elicit-
`ing a response from a node to maintain cache coherency in
`a system is referred to herein as a probe. In one example, a
`mechanism for snooping a cache is referred to as a probe. A
`response to a probe can be directed to the source or target of
`the initiating request. Any mechanism for filtering or reduc-
`ing the number of probes and probe requests transmitted to
`various nodes is referred to herein as managing probes. In
`one example, managing probe entails characterizing a probe
`request to determine if a probe can be transmitted to a
`reduced number of entities.
`
`In typical implementations, probe requests are sent to a
`memory controller that broadcasts probes to various nodes
`in a system. In such a system, no knowledge of the cache line
`state is known. All nodes in the system are probed and the
`request cluster receives a response from each node. In a
`system with a coherence directory, state information asso-
`ciated with various memory lines can be used to reduce the
`number of transactions. Any mechanism for maintaining
`state information associated with various memory lines is
`referred to herein as a coherence directory. A coherence
`directory typically includes information for memory lines in
`a local cluster that are cached in a remote cluster. According
`to various embodiments, a coherence directory is used to
`reduce the number of probes to remote quads by inferring
`the state of local caches. In other embodiments, a coherence
`directory is used to eliminate the transmission of a request
`to a memory controller in a home cluster.
`FIG. 1A is a diagrammatic representation of one example
`of a multiple cluster, multiple processor system that can use
`the techniques of the present invention. Each processing
`cluster 101, 103, 105, and 107 can include a plurality of
`processors. The processing clusters 101, 103, 105, and 107
`are connected to each other through point-to-point links
`11a—f. In one embodiment, the multiple processors in the
`multiple cluster architecture shown in FIG. 1A share the
`same memory space. In this example,
`the point-to-point
`links 111a—f are internal system connections that are used in
`place of a traditional front-side bus to connect the multiple
`
`6
`processors in the multiple clusters 101, 103, 105, and 107.
`The point-to-point
`links may support any point-to-point
`coherence protocol.
`FIG. 1B is a diagrammatic representation of another
`example of a multiple cluster, multiple processor system that
`can use the techniques of the present invention. Each pro-
`cessing cluster 121, 123, 125, and 127 can be coupled to a
`switch 131 through point-to-point links 141a—d. It should be
`noted that using a switch and point-to-point links allows
`implementation with fewer point-to-point links when con-
`necting multiple clusters in the system. A switch 131 can
`include a processor with a coherence protocol interface.
`According to various implementations, a multicluster sys-
`tem shown in FIG. 1A is expanded using a switch 131 as
`shown in FIG. 1B.
`
`FIG. 2 is a diagrammatic representation of a multiple
`processor cluster, such as the cluster 101 shown in FIG. 1A.
`Cluster 200 includes processors 202a—202d, one or more
`Basic I/O systems (BIOS) 204, a memory subsystem com-
`prising memory banks 206a—206d, point-to-point commu-
`nication links 208a—20Se, and a service processor 212. The
`point-to-point communication links are configured to allow
`interconnections between processors 202a—202d, I/O switch
`210, and cache coherence controller 230. The service pro-
`cessor 212 is configured to allow communications with
`processors 202a—202d, I/O switch 210, and cache coherence
`controller 230 via a JTAG interface represented in FIG. 2 by
`links 214a—214f. It should be noted that other interfaces are
`supported. It should also be noted that in some implemen-
`tations, a service processor is not
`included in multiple
`processor clusters. I/O switch 210 connects the rest of the
`system to I/O adapters 216 and 220.
`According to specific embodiments, the service processor
`of the present invention has the intelligence to partition
`system resources ac