`Glasco
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 7,003,633 B2
`*Feb. 21, 2006
`
`US007003633B2
`
`(54) METHODS AND APPARATUS FOR
`MANAGING PROBE REQUESTS
`
`Inventor: David B_ Glasco’ Austin’
`
`(73) Assignee: Newisys, Inc., Austin, TX (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`use 154(b) by 205 days
`.
`.
`.
`Th1§ Patent 15 S11bJ€Ct
`C1a1TT1eT~
`
`.
`.
`t0 a tefmlflal 0115-
`
`6,085,295 A
`6,108,737 A
`6,122,715 A
`
`A
`6,167,492 A
`
`6473393 B1
`6,189,078 B1
`
`’
`’
`6,209,055 B1
`6,292,705 B1
`6292906 B1
`6,330,643 B1
`6,334,172 B1
`
`7/2000 Ekanadham et al.
`8/2000 Sharma et 211.
`9/2000 Palanca et al.
`
`B0fd8.Z et al.
`12/2000 Keller et al.
`
`. . . . . . . . . . . . ..
`.............. .. 711/154
`
`1/2001 Palanca 6‘ ‘11-
`2/2001 Bauman et al.
`§:11:I111c1:1::11‘
`'
`3/2001 Van Doren et 211.
`9/2001 Wang et al.
`90001 Fuetal
`12/2001 Arimilli et al.
`12/2001 Arimilli et al.
`
`........... .. 711/156
`
`
`
`........... .. 711/144
`
`(21) Appl. No.: 10/288,347
`(22)
`Filed:
`N0V- 4, 2002
`
`(Continued)
`FOREIGN PATENT DOCUMENTS
`
`(65)
`
`Prior Publication Data
`
`W0
`
`WO0239242
`
`5/2002
`
`Us 2004/0088492 A1
`
`May 6, 2004
`
`(51)
`
`Int‘ Cl‘
`(200601)
`G06F 12/00
`711/146' 711/141' 709/216'
`(52) U S C]
`-
`-
`-
`--------------------- --
`>
`>
`>
`709/218
`(58) Field of Classification Search .............. .. 711/141,
`711/146, 144, 145; 709/206, 213,9/22116é
`See application file for complete search history.’
`
`(56)
`
`References Cited
`
`OTHER PUBLICATIONS
`Multicast snooping: a new coherence method using a mul-
`ticast address network Bilir, E.E.; Dickson, R.M.; Ying Hu;
`Plakal, M.; Sorin, D.J.; Hill, M.D.; Wood, D.A.; Computer
`Ar h‘
`1999 p
`d‘
`f h
`26 h I
`'
`1
`c itecture,
`.
`rocee 1ngs o
`t e
`t
`nternationa
`Symposium on, May 2-4, 1999.*
`
`(Continued)
`Primary EX0lmii1€‘I’—BriaI1 R Peugh
`(74) Attorney, Agent, or Firm—Beyer Weaver & Thomas,
`LLP
`
`U.S. PATENT DOCUMENTS
`
`(57)
`
`ABSTRACT
`
`According to the present invention, methods and apparatus
`are provided for increasing the efficiency of data access in a
`multiple processor, multiple cluster system. Mechanisms for
`reducing the number of transactions in a multiple cluster
`system are provided. In one example, probe filter informa-
`tion is used to limit the number of probe requests transmitted
`to request and remote clusters.
`
`39 Claims, 17 Drawing Sheets
`
`............. .. 711/148
`............ .. 711/121
`
`3/1993 Sindhu
`2/1995 Hunter et al.
`6/1996 Somani et al.
`11/1997 Logghe
`5/1998 Sarangdhar ............... .. 711/145
`10/1998 Komuro et al.
`.
`711/141
`4/1999 Merchant . . . . . . . .
`. . . .. 711/140
`1/2000 Arimilli et al.
`. . . . .
`. . . .. 711/141
`3/2000 Van Huben et al.
`........ .. 712/21
`4/2000 Huff et 211.
`5/2000 Carpenter et 211.
`6/2000 Palanca et al.
`
`
`
`5,195,089 A
`5,394,555 A
`5,524,212 A *
`5,692,123 A
`5,751,995 A *
`5,829,032 A
`5,893,151 A *
`6,018,791 A
`6,038,652 A
`6,052,769 A
`6,067,603 A
`6,073,210 A
`
`
`
`APPLE 1023
`
`1
`
`APPLE 1023
`
`
`
`US 7,003,633 B2
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`Specifying and Verifying a broadcast and a multicast snoop-
`ing cache coherence protocol Sorin, D.J.; Plakal, M.;
`Condon, A.E.; Hill, M.D.; Martin, M.M.K.; Wood, D.A.;
`Parallel and Distributed Systems, IEEE Transactions on ,
`Vol.: 13 , Issue: 6 , Jun. 2002.*
`Kim, et al., “Power-aware Partitioned Cache Architectures”,
`© 2001 ACM, p. 6467.*
`Powell, et al., “Reducing Set-Associative Cache Energy Via
`Way-Prediction and Selective Direct-Mapping”, © 2001
`IEEE, p. 54-65.*
`HyperTransport TM I/0 Link Specification Revision 1.03,
`HyperTransport TM Consortium, Oct. 10, 2001, Copyright ©
`HyperTransport Technology Consortium.
`PCT Search Report PCT/US03/34756, Int’l filing date Oct.
`30, 2003, Search report Mailed Dec. 16, 2004.
`L .S. Office Action mailed Sep. 22, 2004, from related L .S.
`Appl. \o. 10/106,426.
`L.S. Office Action mailed Mar. 7, 2005, from related L.S.
`Appl. \o. 10/106,426.
`L.S. Office Action mailed Jul. 21, 2005, from related L.S.
`Appl. \o. 10/106,426.
`L .S. Office Action mailed Sep. 23, 2004, from related L .S.
`Appl. \o. 10/106,430.
`L .S. Office Action mailed Mar. 10, 2005, from related L .S.
`Appl. \o. 10/106,430.
`L.S. Office Action mailed Jul. 21, 2005, from related L.S.
`Appl. \o. 10/106,430.
`L .S. Office Action mailed Sep. 22, 2004, from related L .S.
`Appl. \o. 10/106,299.
`L .S. Office Action mailed Mar. 10, 2005, from related L .S.
`Appl. \o. 10/106,299.
`L.S. Office Action mailed Jul. 21, 2005, from related L.S.
`Appl. \o. 10/106,299.
`D. E. Culler, J. P. Singh, A. Gupta, “Parallel Computer
`Architecture”, 1999 Morgan Kaufmann, San Francisco, CA
`L SA XP002277658.
`
`"r:$(-"'r:$(-"'C$(-‘
`
`Andrew Tanenbaum, “Computer Networks”, Computer
`I\etworks, London: Prentice Hall International, GB, 1996,
`pp. 345-403, XP002155220.
`.S. Office Action mailed Jul. 20, 2005, from related Ap-
`lication No. 10/608,846.
`.S. Office Action mailed Sep. 9, 2005, from related Ap-
`lication No. 10/462,015.
`.S. Office Action mailed Sep. 9, 2005, from related Ap-
`lication No. 10/426,084.
`.S. Office Action mailed Nov. 2, 2005, from related Ap-
`lication No. 10/106/430.
`.S. Office Action mailed Oct. 5, 2005, from related Ap-
`plication No. 10/635,703.
`
`(""'3("
`
`* cited by examiner
`
`...... .. 711/150
`...... .. 711/150
`
`........... .. 711/119
`
`1/2002 Baumgartner et 211.
`6,338,122 B1
`1/2002 Arimilli et 211.
`6,343,347 B1
`5/2002 Keller etal.
`.............. .. 711/154
`6,385,705 B1
`6/2002 Arimilli et al.
`........... .. 711/145
`6,405,289 B1
`10/2002 Miller et 211.
`6,463,529 B1
`10/2002 Armstrong et 211.
`6,467,007 B1
`12/2002 Keller etal.
`.............. .. 711/150
`6,490,661 B1
`4/2003 Zalewski et al.
`......... .. 709/213
`6,542,926 B1
`9/2003 Khare et 211.
`6,615,319 B1
`10/2003 Morioka et 211.
`6,631,447 B1
`10/2003 Fu et al.
`................... .. 710/316
`6,633,945 B1
`10/2003 Kessler et 211.
`6,633,960 B1
`.............. .. 710/22
`6,636,906 B1 * 10/2003 Sharma et al.
`6,640,287 B1
`10/2003 Gharachorloo et 211.
`6,658,526 B1
`12/2003 Nguyen et 211.
`6,665,767 B1
`12/2003 Comisky et 211.
`6,704,842 B1
`3/2004 Janakiraman et 211.
`6,738,870 B1
`5/2004 Van Huben et al.
`6,738,871 B1
`5/2004 Van Huben et al.
`6,751,698 B1
`6/2004 Deneroff et 211.
`6,751,721 B1 *
`6/2004 Webb et al.
`................ .. 712/10
`6,754,782 B1
`6/2004 Arimilli et 211.
`6,760,809 B1
`7/2004 Arimilli et al.
`6,760,819 B1
`7/2004 Dhong et 211.
`6,799,252 B1 *
`9/2004 Bauman ................... .. 711/149
`6,865,595 B1
`3/2005 Glasco
`6,892,282 B1
`5/2005 Hass et al. ................ .. 711/146
`2001/0013089 A1
`8/2001 Weber
`11/2001 Van Doren
`200 1/0037435 A1
`2002/0007463 A1
`1/2002 Fung
`2002/0046327 A1
`4/2002 Gharachorloo et al.
`2002/0052914 A1
`5/2002 Zalewski et al.
`......... .. 709/203
`2002/0083149 A1 *
`6/2002 Van Huben et al.
`...... .. 709/215
`2002/0083243 A1
`6/2002 Van Huben ............... .. 710/107
`2002/0087807 A1 *
`7/2002 Gharachorloo et al.
`711/141
`2002/0087811 A1
`7/2002 Khare et al.
`2003/0009623 A1
`1/2003 Arimilli et al.
`2003/0182508 A1
`9/2003 Glasco
`2003/0182509 A1
`9/2003 Glasco
`2003/0182514 A1
`9/2003 Glasco
`2003/0195939 A1
`10/2003 Edirisooriya et al.
`2003/0196047 A1
`10/2003 Kessler et al.
`2003/0210655 A1
`11/2003 Glasco
`2003/0212741 A1
`11/2003 Glasco
`2003/0233388 A1
`12/2003 Glasco et al.
`.............. .. 711/144
`2004/0073755 A1
`4/2004 Webb et al.
`2004/0088493 A1
`5/2004 Glasco ..................... .. 711/141
`2004/0088494 A1
`5/2004 Glasco
`2004/0255002 A1
`12/2004 Kota et 211.
`
`........... .. 711/119
`
`..... .. 709/212
`
`OTHER PUBLICATIONS
`
`Bandwidth adaptive snooping Martin, M.M.K.; Sorin, D.J.;
`Hill, M.D.; Wood, D.A.; High-Performance Computer
`Architecture, 2002. Proceedings. Eighth International Sym-
`posium on , Feb. 2-6, 2002; pp. 251-262.*
`
`2
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 1 of 17
`
`US 7,003,633 B2
`
`Figure 1A
`
`Processing
`
`'
`
`'
`
`Processing
`Cluster 103
`
`Cluster 107
`
`Processing
`Cluster 105
`
`‘
`
`Processing
`
`Figure 1B
`
`Processing
`Cluster 121
`
`Processing
`Cluster 123
`
`Cluster 127
`
`Processing
`Cluster 125
`
`Processing
`
`3
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 2 of 17
`
`US 7,003,633 B2
`
`can
`
`Hommuoogm
`
`\l8om
`
`Bo
`
`NSmmmuoum
`
`Bfiofim
`
`NEmmi
`
`
`
` inm.§m3UBofimm
`
`uaomo
`
`HMZOEEOQ UUEOHDEOU Hommoooi
`
`morram.
`
`moom.\\x
`
`ou-
`
`SmfiaamOh
`
`momm
`
`«om
`
`Bfiwfim
`
`88
`
`4
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 3 of 17
`
`US 7,003,633 B2
`
`mfiumom.
`
`mom.E.Em
`
`
`
`momofiwamS0805
`
`
`
`M.mMoomfiflfiEohfioonoz
`
`mBsmfl
`
`
`
`uumfioyfiEmpmnou
`
`5
`
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 4 of 17
`
`US 7,003,633 B2
`
`vBswfl
`
`6
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 5 of 17
`
`US 7,003,633 B2
`
`.46o.SwE
`
`mouoZEuodéoz
`
`:m
`
`7
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 6 of 17
`
`US 7,003,633 B2
`
`mm2&5
`
`
`
`muoozHmooqéoz
`
`Em
`
`8
`
`
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 7 of 17
`
`US 7,003,633 B2
`
`Umo.SmE
`
`H.
`
`m-mvm
`
`hymu
`
`m-Hwm
`
`hgmnv
`
`N-Hvm
`
`mvm
`
`nqm
`
`
`
`H-mvm_-Hvm
`
`
`
`H.agmnv
`
`
`
`mowozauoqéoz
`
`mmm
`
`9
`
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 8 of 17
`
`US 7,003,633 B2
`
`In
`OJ
`'U
`
`0 2'
`
`*l1'\
`ShOlri
`'-:3
`i0
`Z
`
`Figure5D
`
`10
`
`10
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 9 of 17
`
`US 7,003,633 B2
`
`KO
`0
`{-4
`al
`'1-4
`Fl-4
`
`621-4
`
`"9
`03
`‘0
`
`L 25
`
`L 627
`
`MC 623-1
`
`“B
`03
`xo
`
`4% ~43
`
`I-1I
`
`0:\O
`
`CPUC601-1603-1
`
`RequestCluster600
`
`621-1
`
`Home
`
`Cluster620
`
`Remote
`
`Cluster640
`
`11
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 10 of 17
`
`US 7,003,633 B2
`
`Figure 7
`
`Coherence Directory 701
`
`ff; Occllpafrllg Vector
`
`N/A
`
`N/A
`
`N/A
`
`N/A
`
`L Memory Line 711
`
`Address 721
`
`State 713
`
`Invalid
`
`Address 731
`
`Invalid
`
`Address 741
`
`Shared
`
`Address 751
`
`Shared
`
`Address 761
`
`Address 771
`
`Owned
`
`Owned
`
`Address 781
`
`Modified
`
`N/A
`
`N/A
`
`Clusters 1,3
`
`Clusters l, 2, 3, 4
`
`Cluster 2
`
`Cluster 4
`
`Cluster 2, 3, 4
`
`Cluster 2
`
`Cluster 2, 4
`
`Address 791
`
`Modified
`
`Cluster 3
`
`N/A
`
`12
`
`12
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 11 of 17
`
`US 7,003,633 B2
`
`Figure 8
`
`Probe Filter
`Information 821
`
`Read B1001‘ (Read) 323
`
`Read Block Modify
`(Read/Write) 825
`
`Invalid 831
`
`P(r?:t1)1eu}f§IE1‘;I::11131l1:tt;).rl(}836ti)
`
`-
`
`-
`
`Can use completion bit.
`
`Probe home cluster.
`
`(809)
`
`Shared 833
`
`Can use completion bit.
`Probe home cluster. (803)
`
`N/A (811)
`
`1
`
`Owned 835
`
`Can use completion bit.
`Probe remote cluster with
`line cached in owned
`state. (805)
`
`N/A (813)
`
`Modified 837
`
`can use Completlon bu‘
`Pfiigecgeclfigfieiglglsggrigéiéh
`
`state. (807)
`
`Can use completion bit.
`Probe I‘e(I§1{);§ cluster.
`
`13
`
`13
`
`
`
`U.S. Patent
`
`b.eF
`
`1
`
`60M.
`
`21teehS
`
`30097SU
`
`2B
`
`mBswfl
`
`2moscom
`
`9oom.6520
`
`pm3:03
`
`Uo8E36
`
`
`14
`
`
`
`6.,Bofiom
`
`%ova.5320
`
`14
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 13 of 17
`
`US 7,003,633 B2
`
`“omumaocofifioo
`
`VANS0
`
`W5:0
`
`D2
`
`Tmmofl
`
`THNS
`
`15
`
`ofiBswfi
`
`#82T5?0DAD
`
`auzcum
`
`8252:0
`
`32A
`
`NAVSSun:
`
`T33
`
`$2A
`
`
`
`.6620ufiom
`
`82
`
`302.5%
`
`32.5320
`
`15
`
`
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 14 of 17
`
`US 7,003,633 B2
`
`Figure 11
`
`robe Request
`
`Handling At Home
`
`herence Contro
`
`
`
`
`‘
`
`
`Receive Probe Request
`Associated With A
`
`
`
` 1 101
`
`
`1135
`
`
`
`1105
`
`Forward Request To
`Memory C011'[1‘0llCI‘
`
`Memory Line
`
`Forward Probe With
`
`Completion Indicator To
`Renqote Cluster
`
`
`
`
`
`
`1109 ’\
`
`Recieve Probe From
`Memory Controller
`
`1 1 13
`
`Access Coherence
`Directory And Probe
`Filter Information
`
`Receive Home Cluster
`
`Probe Responses And Do
`Not Forward Response To
`Request Cluster
`
`
`
` se Filtering An
`
`pletion Indicat 2
`
` Probe Remote
`
`Cluster’?
`
`No
`
`
`
`Receive Source Done
`From Request Cluster
`And Forward Source
`Done To MCm01y
`Controller
`
`
`
`
`
`
`
`
`
`Proceed Using Memory
`Controller Serialization
`
`With No Filtering And
`Completion Bit
`
`16
`
`1141
`
`
`Receive Probe Responses
`And Forward Response
`To Request Cluster With
`
`Completion Indicator
` 1145
`
`16
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 15 of 17
`
`US 7,003,633 B2
`
`Figure 12
`
`Memory Controller Filter Information 1221
`
`Read Block [Read] 1223 1:131‘:
`
`Invalid 1231
`
`Send request to target.
`(1201)
`
`Send request to target.
`
`Owned 1235
`
`Forward Probe To
`Owning Cluster. (1205)
`
`Send request to target.
`(1209)
`
`Send request to target.
`
`Send request to target.
`(1213)
`
`
`
`-
`M°d‘fi"d 1237
`
`Forward Probe To
`Forward Probe To
`Modified Cluster. (1207) Modified Cluster. (1215)
`
`17
`
`17
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 16 of 17
`
`US 7,003,633 B2
`
`22&5
`
`
`
` I52IT59u:8.
`
`mmsaum
`
`
`
`oomaSE20
`
`EmamQBEEEOU
`
`T53
`
`18
`
`ommH.5520M552
`
`8oEuM
`
`$2$320
`
`18
`
`
`
`
`
`U.S. Patent
`
`Feb. 21, 2006
`
`Sheet 17 of 17
`
`US 7,003,633 B2
`
`Figure 14
`
`1403
`
`Identify Characteristics
`Associated With Probe
`Request
`
`
`
`5 ypass Memo v
`C0mm]1er9
`
`NO
`
`
`
`
`1415
`
`Yes
`
`——1-—
`Block Probe Requests
`And Probe Remote
`Cluster (Figure 12)
`
`Unblock Memory Line
`After Receiving Source
`Done From Request
`Cmmm
`
`
`
` robe Request
`
`Handling At Home
`herence Contro -.
`
`
`Receive Probe Request
`Associated With A
`Particular Memory Line
`
`Forward Request To
`Serialization Point
`
`‘
`
`Proceed With 01‘ Without
`Probe Fi_ltering_Ar1d
`Completion Indicator
`(Figure 6 Or Figure 11)
`
`1401
`
`1405
`
`1405
`
`19
`
`19
`
`
`
`US 7,003,633 B2
`
`1
`METHODS AND APPARATUS FOR
`MANAGING PROBE REQUESTS
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`The present application is related to filed U.S. application
`Ser. No. 10/106,426 titled Methods And Apparatus For
`Speculative Probing At A Request Cluster, U.S. application
`Ser. No. 10/106,430 titled Methods And Apparatus For
`Speculative Probing With Early Completion And Delayed
`Request, and U.S. application Ser. No. 10/106,299 titled
`Methods And Apparatus For Speculative Probing With Early
`Completion And Early Request, the entireties of which are
`incorporated by reference herein for all purposes. The
`present application is also related to filed U.S. application
`Ser. No. 10/157,340, now U.S. Pat. No. 6,865,595, Ser. No.
`10/145,439, Ser. No. 10/145,438, and Ser. No. 10/157,388
`titled Methods And Apparatus For Responding To ARequest
`Cluster by David B. Glasco,
`the entireties of which are
`incorporated by reference for all purposes. The present
`application is also related to concurrently filed U.S. appli-
`cation Ser. No. 10/288,399 with the same title and inventor,
`the entirety of which is incorporated by reference herein for
`all purposes.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The present invention generally relates to accessing data
`in a multiple processor system. More specifically,
`the
`present invention provides techniques for improving data
`access efficiency while maintaining cache coherency in a
`multiple processor system having a multiple cluster archi-
`tecture.
`
`2. Description of Related Art
`Data access in multiple processor systems can raise issues
`relating to cache coherency. Conventional multiple proces-
`sor computer systems have processors coupled to a system
`memory through a shared bus. In order to optimize access to
`data in the system memory, individual processors are typi-
`cally designed to work with cache memory. In one example,
`each processor has a cache that is loaded with data that the
`processor frequently accesses. The cache is read or written
`by a processor. However, cache coherency problems arise
`because multiple copies of the same data can co-exist in
`systems having multiple processors and multiple cache
`memories. For example, a frequently accessed data block
`corresponding to a memory line may be loaded into the
`cache of two different processors. In one example, if both
`processors attempt to write new values into the data block at
`the same time, different data values may result. One value
`may be written into the first cache while a different value is
`written into the second cache. Asystem might then be unable
`to determine what value to write through to system memory.
`A variety of cache coherency mechanisms have been
`developed to address such problems in multiprocessor sys-
`tems. One solution is to simply force all processor writes to
`go through to memory immediately and bypass the associ-
`ated cache. The write requests can then be serialized before
`overwriting a system memory line. However, bypassing the
`cache significantly decreases efficiency gained by using a
`cache. Other cache coherency mechanisms have been devel-
`oped for specific architectures. In a shared bus architecture,
`each processor checks or snoops on the bus to determine
`whether it can read or write a shared cache block. In one
`
`example, a processor only writes an object when it owns or
`
`2
`has exclusive access to the object. Each corresponding cache
`object is then updated to allow processors access to the most
`recent version of the object.
`Bus arbitration is used when both processors attempt to
`write the same shared data block in the same clock cycle.
`Bus arbitration logic decides which processor gets the bus
`first. Although, cache coherency mechanisms such as bus
`arbitration are effective, using a shared bus limits the num-
`ber of processors that can be implemented in a single system
`with a single memory space.
`Other multiprocessor schemes involve individual proces-
`sor, cache, and memory systems connected to other proces-
`sors, cache, and memory systems using a network backbone
`such as Ethernet or Token Ring. Multiprocessor schemes
`involving separate computer systems each with its own
`address space can avoid many cache coherency problems
`because each processor has its own associated memory and
`cache. When one processor wishes to access data on a
`remote computing system, communication is explicit. Mes-
`sages are sent
`to move data to another processor and
`messages are received to accept data from another processor
`using standard network protocols such as TCP/IP. Multipro-
`cessor systems using explicit communication including
`transactions such as sends and receives are referred to as
`
`systems using multiple private memories. By contrast, mul-
`tiprocessor system using implicit communication including
`transactions such as loads and stores are referred to herein as
`
`using a single address space.
`Multiprocessor schemes using separate computer systems
`allow more processors to be interconnected while minimiz-
`ing cache coherency problems. However,
`it would take
`substantially more time to access data held by a remote
`processor using a network infrastructure than it would take
`to access data held by a processor coupled to a system bus.
`Furthermore, valuable network bandwidth would be con-
`sumed moving data to the proper processors. This can
`negatively impact both processor and network performance.
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a
`system with a single memory space. In one example, indi-
`vidual processors can be directly connected to each other
`through a plurality of point-to-point links to form a cluster
`of processors. Separate clusters of processors can also be
`connected. The point-to-point links significantly increase the
`bandwidth for coprocessing and multiprocessing functions.
`However, using a point-to-point architecture to connect
`multiple processors in a multiple cluster system sharing a
`single memory space presents its own problems.
`Consequently, it is desirable to provide techniques for
`improving data access and cache coherency in systems
`having multiple clusters of multiple processors connected
`using point-to-point links.
`
`SUMMARY OF THE INVENTION
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`According to the present invention, methods and appara-
`tus are provided for increasing the efficiency of data access
`in a multiple processor, multiple cluster system. Mecha-
`nisms for reducing the number of transactions in a multiple
`cluster system are provided. In one example, probe filter
`information is used to limit the number of probe requests
`transmitted to request and remote clusters.
`In one embodiment, a computer system is provided. The
`computer system includes a home cluster having a first
`plurality of processors and a home cache coherence con-
`troller. The first plurality of processors and the home cache
`coherence controller are interconnected in a point-to-point
`20
`
`65
`
`20
`
`
`
`US 7,003,633 B2
`
`3
`architecture. The home cache coherence controller is con-
`
`figured to receive a probe request and probe one or more
`selected clusters. The one or more clusters are selected based
`
`on the characteristics associated with the probe request.
`In another embodiment, a method for managing probes is
`provided. A probe request is received at a home cache
`coherence controller in a home cluster. The home cluster
`
`includes a first plurality of processors and the home cache
`coherence controller. The first plurality of processors and the
`home cache coherence controller are interconnected in a
`
`10
`
`point-to-point architecture. One or more clusters are selected
`for probing based on the characteristics associated with the
`probe request. The one or more clusters are probed.
`A further understanding of the nature and advantages of
`the present invention may be realized by reference to the
`remaining portions of the specification and the drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the
`following description taken in conjunction with the accom-
`panying drawings, which are illustrative of specific embodi-
`ments of the present invention.
`FIGS. 1A and 1B are diagrammatic representation depict-
`ing a system having multiple clusters.
`FIG. 2 is a diagrammatic representation of a cluster
`having a plurality of processors.
`FIG. 3 is a diagrammatic representation of a cache coher-
`ence controller.
`
`FIG. 4 is a diagrammatic representation showing a trans-
`action flow for a data access request from a processor in a
`single cluster.
`FIG. 5A—5D are diagrammatic representations showing
`cache coherence controller functionality.
`FIG. 6 is a diagrammatic representation depicting a trans-
`action flow for
`a probe request with multiple probe
`responses.
`FIG. 7 is a diagrammatic representation showing a cache
`coherence directory.
`FIG. 8 is a diagrammatic representation showing probe
`filter information that can be used to reduce the number of
`
`probes transmitted to various clusters.
`FIG. 9 is a diagrammatic representation showing a trans-
`action flow for probing of a home cluster without probing of
`other clusters.
`
`FIG. 10 is a diagrammatic representation showing a
`transaction flow for probing of a single remote cluster.
`FIG. 11 is a flow process diagram showing the handling
`of a probe request with probe filter information.
`FIG. 12 is
`a diagrammatic representation showing
`memory controller filter information.
`FIG. 13 is a diagrammatic representation showing a
`transaction flow for probing a single remote cluster without
`probing a home cluster.
`FIG. 14 is a flow process diagram showing the handling
`of a probe request at a home cluster cache coherence
`controller using memory controller filter information.
`
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`
`Reference will now be made in detail to some specific
`embodiments of the invention including the best modes
`contemplated by the inventors for carrying out the invention.
`Examples of these specific embodiments are illustrated in
`the
`accompanying drawings. While
`the
`invention is
`described in conjunction with these specific embodiments, it
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`21
`
`will be understood that
`
`it
`
`4
`is not
`
`intended to limit
`
`the
`
`invention to the described embodiments. On the contrary, it
`is intended to cover alternatives, modifications, and equiva-
`lents as may be included within the spirit and scope of the
`invention as defined by the appended claims. Multi-proces-
`sor architectures having point-to-point communication
`among their processors are suitable for implementing spe-
`cific embodiments of the present invention. In the following
`description, numerous specific details are set forth in order
`to provide a thorough understanding of the present inven-
`tion. The present invention may be practiced without some
`or all of these specific details. Well-known process opera-
`tions have not been described in detail
`in order not
`to
`
`unnecessarily obscure the present invention. Furthermore,
`the present application’s reference to a particular singular
`entity includes that possibility that the methods and appa-
`ratus of the present invention can be implemented using
`more than one entity, unless the context clearly dictates
`otherwise.
`
`Techniques are provided for increasing data access effi-
`ciency in a multiple processor, multiple cluster system. In a
`point-to-point architecture, a cluster of processors includes
`multiple processors directly connected to each other through
`point-to-point links. By using point-to-point links instead of
`a conventional shared bus or external network, multiple
`processors are used efficiently in a system sharing the same
`memory space. Processing and network efficiency are also
`improved by avoiding many of the bandwidth and latency
`limitations of conventional bus and external network based
`
`multiprocessor architectures. According to various embodi-
`ments, however, linearly increasing the number of proces-
`sors in a point-to-point architecture leads to an exponential
`increase in the number of links used to connect the multiple
`processors. In order to reduce the number of links used and
`to further modularize a multiprocessor system using a point-
`to-point architecture, multiple clusters are used.
`According to various embodiments, the multiple proces-
`sor clusters are interconnected using a point-to-point archi-
`tecture. Each cluster of processors includes a cache coher-
`ence controller used to handle communications between
`
`clusters. In one embodiment, the point-to-point architecture
`used to connect processors are used to connect clusters as
`well.
`
`By using a cache coherence controller, multiple cluster
`systems can be built using processors that may not neces-
`sarily support multiple clusters. Such a multiple cluster
`system can be built by using a cache coherence controller to
`represent non-local nodes in local transactions so that local
`nodes do not need to be aware of the existence of nodes
`outside of the local cluster. More detail on the cache
`
`coherence controller will be provided below.
`In a single cluster system, cache coherency can be main-
`tained by sending all data access requests through a serial-
`ization point. Any mechanism for ordering data access
`requests is referred to herein as a serialization point. One
`example of a serialization point is a memory controller.
`Various processors in the single cluster system send data
`access requests to the memory controller. In one example,
`the memory controller is configured to serialize or lock the
`data access requests so that only one data access request for
`a given memory line is allowed at any particular time. If
`another processor attempts to access the same memory line,
`the data access attempt is blocked until the memory line is
`unlocked. The memory controller allows cache coherency to
`be maintained in a multiple processor, single cluster system.
`A serialization point can also be used in a multiple
`processor, multiple cluster system where the processors in
`
`21
`
`
`
`US 7,003,633 B2
`
`5
`the various clusters share a single address space. By using a
`single address space, internal point-to-point links can be
`used to significantly improve intercluster communication
`over traditional external network based multiple cluster
`systems. Various processors in various clusters send data
`access requests to a memory controller associated with a
`particular cluster such as a home cluster. The memory
`controller can similarly serialize all data requests from the
`different clusters. However, a serialization point in a mul-
`tiple processor, multiple cluster system may not be as
`efficient as a serialization point in a multiple processor,
`single cluster system. That is, delay resulting from factors
`such as latency from transmitting between clusters can
`adversely affect the response times for various data access
`requests. It should be noted that delay also results from the
`use of probes in a multiple processor environment.
`Although delay in intercluster transactions in an architec-
`ture using a shared memory space is significantly less than
`the delay in conventional message passing environments
`using external networks such as Ethernet or Token Ring,
`even minimal delay is a significant factor. In some applica-
`tions, there may be millions of data access requests from a
`processor in a fraction of a second. Any delay can adversely
`impact processor performance.
`According to various embodiments, probe management is
`used to increase the efficiency of accessing data in a multiple
`processor, multiple cluster system. A mechanism for elicit-
`ing a response from a node to maintain cache coherency in
`a system is referred to herein as a probe. In one example, a
`mechanism for snooping a cache is referred to as a probe. A
`response to a probe can be directed to the source or target of
`the initiating request. Any mechanism for filtering or reduc-
`ing the number of probes and probe requests transmitted to
`various nodes is referred to herein as managing probes. In
`one example, managing probe entails characterizing a probe
`request to determine if a probe can be transmitted to a
`reduced number of entities.
`
`In typical implementations, probe requests are sent to a
`memory controller that broadcasts probes to various nodes
`in a system. In such a system, no knowledge of the cache line
`state is known. All nodes in the system are probed and the
`request cluster receives a response from each node. In a
`system with a coherence directory, state information asso-
`ciated with various memory lines can be used to reduce the
`number of transactions. Any mechanism for maintaining
`state information associated with various memory lines is
`referred to herein as a coherence directory. A coherence
`directory typically includes information for memory lines in
`a local cluster that are cached in a remote cluster. According
`to various embodiments, a coherence directory is used to
`reduce the number of probes to remote quads by inferring
`the state of local caches. In other embodiments, a coherence
`directory is used to eliminate the transmission of a request
`to a memory controller in a home cluster.
`FIG. 1A is a diagrammatic representation of one example
`of a multiple cluster, multiple processor system that can use
`the techniques of the present invention. Each processing
`cluster 101, 103, 105, and 107 can include a plurality of
`processors. The processing clusters 101, 103, 105, and 107
`are connected to each other through point-to-point links
`11a—f In one embodiment, the multiple processors in the
`multiple cluster architecture shown in FIG. 1A share the
`same memory space. In this example,
`the point-to-point
`links 111a—f are internal system connections that are used in
`place of a traditional front-side bus to connect the multiple
`
`6
`processors in the multiple clusters 101, 103, 105, and 107.
`The point-to-point
`links may support any point-to-point
`coherence protocol.
`FIG. 1B is a diagrammatic representation of another
`example of a multiple cluster, multiple processor system that
`can use the techniques of the present invention. Each pro-
`cessing cluster 121, 123, 125, and 127 can be coupled to a
`switch 131 through point-to-point links 141a—a'. It should be
`noted that using a switch and point-to-point links allows
`implementation with fewer point-to-point links when con-
`necting multiple clusters in the system. A switch 131 can
`include a processor with a coherence protocol interface.
`According to various implementations, a multicluster sys-
`tem shown in FIG. 1A is expanded using a switch 131 as
`shown in FIG. 1B.
`
`FIG. 2 is a diagrammatic representation of a multiple
`processor cluster, such as the cluster 101 shown in FIG. 1A.
`Cluster 200 includes processors 202a—202a', one or more
`Basic I/O systems (BIOS) 204, a memory subsystem com-
`prising memory banks 206a—206d, point-to-point commu-
`nication links 20851-2086, and a service processor 212. The
`point-to-point communication links are configured to allow
`interconnections between processors 202a—202a', I/O switch
`210, and cache coherence controller 230. The service pro-
`cessor 212 is configured to allow communications with
`processors 202a—202d, I/O switch 210, and cache coherence
`controller 230 via a JTAG interface represented in FIG. 2 by
`links 214a—214f. It should be noted that other interfaces are
`supported. It should also be noted that in some implemen-
`tations, a service processor is not
`included in multiple
`processor clusters. I/O switch 210 connects the rest of the
`system to I/O adapters 216 and 220.
`According to specific embodiments, the service processor
`of the present invention has the intelligence to partition
`system resources according to a previously specified parti-
`tioning schema. The partitioning can be achieved through
`direct manipulation of routing tables associated with the
`system processors by the service processor which is made
`possible by the point-to-point communication infrastructure.
`The routing tables are used to control and isolate various
`system resources,
`the connection