`Glasco
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 7,103,636 B2
`*Sep. 5,2006
`
`US007103636B2
`
`(54)
`
`(75)
`
`(73)
`(*)
`
`(21)
`(22)
`(65)
`
`(51)
`
`(52)
`
`(58)
`
`(56)
`
`METHODS AND APPARATUS FOR
`SPECULATIVE PROBING OF A REMOTE
`CLUSTER
`
`Inventor: David B. Glasco, Austin, TX (US)
`
`Assignee: NeWisys, Inc., Austin, TX (US)
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 413 days.
`
`This patent is subject to a terminal dis
`claimer.
`
`Appl. No.: 10/157,388
`Filed:
`May 28, 2002
`
`Prior Publication Data
`
`US 2003/0225979 A1 Dec. 4, 2003
`
`Int. Cl.
`G06F 13/14
`G06F 12/16
`
`(2006.01)
`(2006.01)
`
`US. Cl. ..................... .. 709/206; 709/216; 709/217;
`711/141; 711/146
`
`Field of Classi?cation Search ....... .. 709/206i2l3,
`709/2l6i2l8, 219; 711/141*146
`See application ?le for complete search history.
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5/1987 Allen et a1. ............... .. 709/234
`4,667,287 A
`5,166,674 A 11/1992 Baum et a1.
`714/752
`5,191,651 A
`3/1993 Halim et a1. .............. .. 709/250
`(Continued)
`FOREIGN PATENT DOCUMENTS
`
`OTHER PUBLICATIONS
`
`HyperTransport TM [/0 Link Specification Revision 1.03,
`HyperTransport TM Consortium, Oct. 10, 2001, Copyright ©
`2001 HyperTransport Technology Consortium.
`D. E. Culler, J. P. Singh, A. Gupta, “Parallel Computer
`Architecture”, 1999 Morgan Kaufmann, San Francisco, CA
`USA XP002277658.
`Andrew Tanenbaum, “Computer Networks”, Computer Net
`Works, London: Prentice Hall lntemational, GB, 1996, pp.
`3454403, XP002155220.
`GLASCO, David Brian, “Methods And Apparatus For
`Speculative Probing At A Request Cluster,” U.S. Appl. No.
`10/ 106,426, ?led Mar. 22, 2002, Of?ce Action mailed Sep.
`22, 2004.
`GLASCO, David Brian, “Methods And Apparatus For
`Speculative Probing With Early Completion And Delayed
`Request,” U.S. Appl. No. 10/106,430, ?led Mar. 22, 2002,
`Of?ce Action mailed Sep. 23, 2004.
`GLASCO, David Brian, “Methods And Apparatus For
`Speculative Probing With Early Completion And Early
`Request,” U.S. Appl. No. 10/106,299, ?led Mar. 22, 2002,
`Of?ce Action mailed Sep. 22, 2004.
`
`(Continued)
`Primary ExamineriKevin Verbrugge
`(74) Attorney, Agent, or F irmiBeyer Weaver & Thomas,
`LLP.
`
`(57)
`
`ABSTRACT
`
`According to the present invention, methods and apparatus
`are provided for increasing the ef?ciency of data access in a
`multiple processor, multiple cluster system. Techniques are
`provided for speculatively probing a remote cluster from
`either a request cluster or a home cluster. A speculative
`probe associated With a particular memory line is transmit
`ted to the remote cluster before the cache access request
`associated With the memory line is serialized at a home
`cluster. When a non-speculative probe is received at a
`remote cluster, the information associated With the response
`to the speculative probe is used to provide a response to the
`non-speculative probe.
`
`EP
`WO
`
`0978781
`WO0239242
`
`2/2000
`5/2002
`
`36 Claims, 15 Drawing Sheets
`
`CPU
`7014
`
`MC
`723-2
`
`Request
`Cluster 70o
`
`Rmwb:
`Cluster 740
`
`
`
`US 7,103,636 B2
`Page 2
`
`US. PATENT DOCUMENTS
`
`3/1993 Sindhu et 31
`5,195,089 A
`3/1993 Chen et a1. .................. .. 712/3
`5,197,130 A
`4/1994 Fushimi er a1~
`714/23
`5301311 A
`5,371,852 A 12/1994 Attanasio er a1- --------- -- 709/245
`5,394,555 A * 2/1995 Hunter et a1. ............. .. 7ll/l48
`5,561,768 A 10/1996 Smith ------- --
`-- 712/13
`5,623,644 A
`4/1997 Self et al. ................. .. 713/503
`5,682,512 A 10/1997 Tetrick ..................... .. 711/202
`5,692,123 A 11/1997 Logghe
`5,829,032 A 10/1998 Komufo er a1~ ----------- -- 711/141
`5931938 A
`8/1999 Drogichen er a1- --------- -- 712/15
`6,018,791 A * l/2000 Arimilliet al. .... ..
`. 711/141
`6,038,651 A * 3/2000 VanHuben et a1. ......... .. 712/21
`6,047,332 A
`4/2000 Viswanathan et a1. .... .. 709/245
`6,067,603 A * 5/2000 Carpenter et a1. ........ .. 711/141
`6,148,378 A 11/2000 BordaZ et al. ............ .. 711/147
`6,167,492 A 12/2000 Keller et al. .............. .. 711/154
`6,192,451 B1 * 2/2001 Arimilli er a1
`711/141
`6219775 B1
`4/2001 Wade er a1- ---------------- -- 712/11
`6,226,671 B1
`5/2001 Hagersten et a1. ........ .. 709/215
`6,256,671 B1
`7/2001 StrentZsch et a1.
`709/227
`6,259,701 B1
`7/2001 Shur et a1. ................ .. 370/401
`6’292’705 Bl
`9/2001 Wang e.t a1‘
`6,330,643 B1 * 12/2001 Ar1m1ll1 et a1. ........... .. 711/141
`6,331,983 B1
`12/2001 Haggerty et a1. ......... .. 370/400
`
`1/2002 Fung ........................ .. 713/320
`2002/0004915 A1
`1/2002 Fung
`2002/0007463 A1
`2002/0052914 A1 * 5/2002 Zalewski et a1. ......... .. 709/203
`2002/0083149 A1 >1<
`6/2002 Van Huben et a1‘ __
`709/215
`2002/0083243 A1 * 6/2002 Van Huben et a1. ...... .. 710/107
`2002/0087811 A1 * 7/2002 Khare et a1. .............. .. 711/146
`2002/0156ggg A1 10/2002 Lee et a1‘
`709/224
`2003/0009623 A1
`1/2003 Arimilli et a1. ........... .. 711/119
`2003/0182508 A1
`9/2003 Glasco
`2003/0182509 A1
`9/2003 Glasco
`2003/0182514 A1
`9/2003 Glasco
`2003/0195939 A1 10/2003 Edirisooriya et a1. ..... .. 709/212
`2003/0196047 A1 * 10/2003 Kessler et a1. ............ .. 711/147
`2003/0210655 A1
`11/2003 Glasco
`2003/0212741 A1 110003 Glasco
`2003/0225909 A1 12/2003 Glasco et a1‘
`2003/0225938 A1 l2/2003 Glasco et 31‘
`2003/0225978 A1 12/2003 Glasco
`2003/0233388 A1 12/2003 Glasco et 31,
`2004/0073755 A1 * 4/2004 Webb et a1. .............. .. 711/144
`2004/0098475 A1
`5/2004 Zeitleret a1. ............. .. 709/223
`
`OTHER PUBLICATIONS
`Zei?er et a1” “Methods And Apparatus For Distributing
`System Management Signals”, PCT/US03/34687, lnt’l ?l
`.
`dt 0 t 20 2003 P n. 1
`h
`n .1 dM 28
`mg ae 0'
`’
`’ a 12‘ Seam repo male
`ay
`’
`
`,
`
`,
`
`ms ong e a.
`
`6,334,172 B1 * 12/2001 Arimilli et a1. ........... .. 711/144
`6,338,122 B1 * 1/2002 Baumgartner et a1‘ ____ __ 711/141
`6,370,585 B1
`4/2002 Hagersten et a1. ........ .. 709/238
`6,385,705 B1
`5/2002 Keller et a1. .............. .. 711/154
`6,397,255 B1
`5/2002 Nurenberg et a1. ..
`709/228
`6,405,289 B1 * 6/2002 AI_imi11i et a1~ ~~~~ ~~
`- 711/145
`2’323333 51
`18/588? 2451161?‘ a1‘, 1
`6,490,661 B1
`12/2002 K611616131. .............. .. 711/150
`6,542,926 B1 * 4/2003 Zalewski et a1.
`709/213
`6,578,071 B1
`6/2003 Hagersten et a1.
`709/215
`6,615,319 B1 * 9/2003 Khare et a1. .... ..
`711/141
`6,631,447 B1 * 10/2003 Morioka et a1. ..
`711/141
`6,633,945 B1
`10/2003 Fu et a1. ......... ..
`710/316
`6,633,960 B1 * l0/2003 Kessler et a1. ..... ..
`7ll/l44
`6,704,842 B1 * 3/2004 Janakiraman et a1. ..... .. 711/141
`6,738,870 B1 : 5/2004 Van Huben et 31' ~~~~~~ " 711/150
`6,738,871 B1
`5/2004 Van Huben et a1. ...... .. 711/150
`6,751,698 B1 * 6/2004 Deneroifet a1.
`.. 710/317
`6,751,721 B1 * 6/2004 Webb, Jr. et a1. ........ .. 712/10
`6,754,782 B1 * 6/2004 Arimilli et a1.
`711/144
`6,760,809 B1
`7/2004 Arimilli et a1. ........... .. 711/119
`6,760,819 B1
`7/2004 Dhong et a1.
`6,799,252 B1
`9/2004 Bauman
`6,826,660 B1
`ll/2004 Hagersten et a1. ........ .. 7ll/l53
`6,847,993 B1
`V2005 Novaes et a1~
`-- 709/221
`2332235 51 * Z3882 2m“ {"1"
`ill/i912
`6,920,519 B1
`7/2005 Beukema et a1. .
`710/306
`2001/0014097 A1
`8/2001 Becket a1. ............... .. 370/401
`2001/0037435 A1
`11/2001 Van Doren
`
`,
`
`,
`
`ass e a.
`
`-
`
`~
`
`_
`_
`_
`2094-
`Ze1tler et al., “Methods And Apparatus For D1str1but1ng
`System Management Signals”, PCT/1180364687, lnt’l ?l
`ing date Oct. 20, 2003, lnt’l search report mailed Aug. 17,
`2004.
`U.S. Of?ce Action mailed Jul. 20, 2005, from related Appli
`Cation N°~ 1098,846
`U'tsi 056233118161 giggled NOV‘ 2’ ZOOS’frOm related Apph
`Calon 0'
`.’
`'.
`U-S- _O?_i°e Acnon malled S8P- 21’ 2005’ from related
`APPllcatlon NO- 10/157384
`US O?ice Action mailed Oct 20, 2005, from related
`Application NO. 10/156,893.
`U_S_ Of?ce Action mailed Sep, 21, 2005, from related
`Application NO_ 1()/157,4()9_
`U.S. Of?ce Action mailed Mar. 7, 2005, from related Appli
`Cation Now/106,426
`Us O?iCeACtiOnma?ed Jul 21 2005 fromrelatedA 1._
`'
`’
`’
`ppl
`
`~_
`
`Canon NO- 10009426- _
`U-S~ O?qlce ACUOH malled Mal 10> 2005, from related
`Application NO- 10/106,430
`U.S. Of?ce Action mailed Jul. 21, 2005, from related Appli
`cation No, lO/106,430,
`U.S. Of?ce Action mailed Mar. 10, 2005, from related
`APPHCa‘iOH “1010/1061”
`$318161 ggaglled Jul‘ 21’ 2005’ from related Apph
`'
`’
`'
`* cited by examiner
`
`~
`
`-
`
`-
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 1 6f 15
`
`US 7,103,636 B2
`
`Figure 1A
`
`Processing
`Cluster 101
`
`‘
`
`/—l 1 1d
`;' 1,
`'
`"
`
`Processing
`Cluster 103
`
`
`
`llla?: :____::) Processing
`
`Cluster 105
`
`{
`
`i
`
`Processing
`
`Cluster 107
`
`Figure 1B
`
`Processing
`Cluster 121
`
`Processing
`Cluster 123
`
`141a-/“‘;“"'
`
`' i‘kmid
`
`Switch 131
`
`1411)
`
`1
`
`\II‘EII‘I
`
`1410
`
`rlrf
`
`Processing
`Cluster 125
`
`‘
`
`Processing
`Cluster 127
`
`
`
`U.S. Patent
`
`Sep. 5,2006
`
`Sheet 2 of 15
`
`US 7,103,636 B2
`
`8~n\
`
`£8
`
`Nousmfl
`
`
`
`22.930Bofium
`
`3888
`
`£30
`
`uucohosoo
`
`omm.HuSO.5fiOU
`
`III.S8Smsooa
`
`uo_.>.Em
`
`
`
`
`
`UmomuommooobmumomHommmooumm
`
`N5Hommuooum
`
`_SNBfimfim
`
`Boa83
`moaISfiofim
`
`EN§_amon
`
`ENOh
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 5,2006
`
`Sheet 3 of 15
`
`US 7,103,636 B2
`
`momswam
`wfiunom
`
`
`
`momufimcm#8805
`
`
`
`a._..moommuuufiEobnoouoz
`
`mo.5mE
`
`
`
`pomuum.tB£E30300
`
`omm
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 4 6f 15
`
`US 7,103,636 B2
`
`02
`
`NAB“.
`
`A
`
`w EswE
`
`DAD
`
`méow
`
`DmU
`
`02
`
`78¢
`
`DmU
`
`73v
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 5 6f 15
`
`US 7,103,636 B2
`
`U2
`
`~-mom
`
`L
`
`DmU
`
`méom
`
`DmU
`
`i 2&5
`
`
`
`5602 EuoQéoZ
`
`\/
`
`OE
`
`Tmom
`
`DmU
`
`75m
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 6 6f 15
`
`US 7,103,636 B2
`
`mm a?
`
`mémm
`
`\ k
`
`75m
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 7 6f 15
`
`US 7,103,636 B2
`
`wrmvm
`
`l
`
`DmU
`
`MAE
`
`om 65mm
`
`Néwm
`
`\“/
`
`
`
`$602 $03.52
`
`mnm
`
`Tmwm
`
`DmU
`
`72%
`
`DB6 \ / mwm
`
`Cum
`
`mqm
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 8 6f 15
`
`US 7,103,636 B2
`
`U2
`
`mlmwm
`
`mm 65mm
`
`
`
` 3% 32 Gm - 3% - u o A Q2 1
`
`,H
`
`a;
`
`7.5m
`
`V
`
`@8652
`
`3%
`
`
`
`U.S. Patent
`
`eS
`
`D..
`
`0
`
`Ll
`
`US 7,103,636 B2
`
`2,Q85:205«moswom
`
`moo4
`
`%A
`
`H 5:30MN-HSufiom9U
`%1mmN483%W30E0E83%ozuo4‘ozo
`
`o2&5
`
`28E8E8E838S88938E8280P602.6uPUEo0PB
`
`804
`
`Bofiom
`
`So~Bm3U
`
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 10 6f 15
`
`US 7,103,636 B2
`
`02 o u 0 Q 02 o
`
`2i W5 #5 W5“ R6 112? I?
`
`
`
`
`
`32 IE 6.8“ 29. 32 2? 86 Na? Z2 ZR
`
`
`
`0 P6 0 P6 0 P6 T H ‘l o o ‘I P6
`
`w 25min.
`
`1H
`
`2; \
`A /
`
`5 Q
`
`0
`
`
`
`, 82236
`
`
`
`wig 050m
`
`
`
`mow “353%
`
`822620
`
`\ 6
`
`W5 33‘! E T I: u u A 0
`
`2R \
`
`q
`
`a: 895%
`
`82236
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 11 6f 15
`
`US 7,103,636 B2
`
`Figure 8
`
`Speculative Probing
`From A Home Cluster
`
`y
`Receive Cache Access
`801 '\ Request Associated With
`A Memory Line
`
`803
`
`Yes
`
`805
`'\_
`
`Send Speculaéive Probe
`Correspon mg To
`Request To Remote
`Cluster
`
`807
`
`'
`Forvvard Request'To
`Serialization Pomt
`
`'
`
`7
`
`809 '\_
`
`Receive Probe From
`Serialization Point
`
`7
`
`311 -\ Broadcast Probes To
`Request And Remote
`Clusters
`
`Y
`
`813
`
`_ X RCCGIVC Probe Response
`
`I
`
`815 \_ Provide Probe Responses
`To Request Cluster
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 12 0f 15
`
`US 7,103,636 B2
`
`Figure 9
`
`Speculative Probing
`From A Home Cluster
`
`7
`
`Receive Speculative
`901 "\_ Probe Associated With A
`Memory Line From A
`Home Cluster
`
`903 \ Speculatively Probe Local
`Nodes
`
`Receive Non-Speculative
`905 ’\ Probe Associated With
`The Memory Line From
`The Home Cluster
`
`I
`
`Receive Speculative
`Probe Responses
`
`909
`
`r
`
`Maintain Speculative
`Probe Responses
`
`7
`
`Provide Non-Speculative
`Probe Responses T 0
`Home Cluster Or Request k
`Cluster Using The Results
`Of The Speculative Probe
`
`Remove Speculative
`913 \ Probe From The Pending
`Buffer
`
`
`
`U.S. Patent
`
`Sep. 5,2006
`
`Sheet 13 of 15
`
`US 7,103,636 B2
`
`WE:
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 14 6f 15
`
`US 7,103,636 B2
`
`Figure 11
`
`peculative Probni
`From A Request
`Cluster
`
`Receive Cache Access
`1101 '\_ Request Associated With
`A Memory Line
`
`I
`
`Forward Cache Access
`1103 —\ Request To Home Cluster
`
`1105 p\.
`
`Send Speculative Probe
`Corresponding To
`Request To All Remote
`Clusters
`
`1107 '\ Receive Probe From
`Home Cluster
`
`7
`
`1109 N Receive Probe Responses
`From Home Cluster And
`Remote Cluster
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 15 6f 15
`
`US 7,103,636 B2
`
`Figure 12
`
`peculative Probm
`From A Request
`Cluster
`
`7
`Receive S eculative
`1201 X Probe Associated With A
`Memory Line From A
`Request Cluster
`
`1 203
`
`Yes
`
`Drop Speculative Probe
`
`'
`_
`1215 \ Receive Speculative
`Probe Responses
`
`1217
`
`\?
`
`V
`
`.
`
`.
`
`.
`
`Maintain S eculative
`Probe; Regponses
`l
`
`1207 \_ speculati?gilgobe Local
`
`7
`
`Receive Non-Speculative
`1209 \ Probe Associated With
`The Memory Line From
`The Home Cluster
`
`121 1
`
`V
`
`_
`_
`Provide Nonspeculative
`Probe Responses To
`\ Home Cluster Or Request ‘
`Cluster Using The Results
`Of The Speculative Probe
`
`7
`
`Remove Speculative
`1213 '\ Probe From The Pending
`Buffer
`
`
`
`US 7,103,636 B2
`
`1
`METHODS AND APPARATUS FOR
`SPECULATIVE PROBING OF A REMOTE
`CLUSTER
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`The present application is related to US. application Ser.
`No. 10/ 106,426 titled Methods And Apparatus For Specu
`lative Probing At A Request Cluster, US. application Ser.
`No. 10/ 106,430 titled Methods And Apparatus For Specu
`lative Probing With Early Completion And Delayed
`Request, and US. application Ser. No. 10/106,299 titled
`Methods And Apparatus For Speculative Probing With Early
`Completion And Early Request, the entireties of Which are
`incorporated by reference herein for all purposes. The
`present application is also related to US. application Ser.
`Nos. 10/145,439 and 10/145,438 both titled Methods And
`Apparatus For Responding To A Request Cluster by David
`B. Glasco ?led on May 13, 2002, the entireties of Which are
`incorporated by reference for all purposes. Furthermore, the
`present application is related to concurrently ?led US.
`application Ser. No. 10/ 157,340 also titled Methods And
`Apparatus For Speculative Probing Of A Remote Cluster by
`David B. Glasco, the entirety of Which is incorporated by
`reference for all purposes.
`The present application is also related to concurrently
`?led US. application Ser. Nos. 10/157,384, 10/156,893, and
`10/157,409 titled Transaction Management In Systems Hav
`ing Multiple Multi-Processor Clusters, Routing Mechanisms
`In Systems Having Multiple Multi-Processor Clusters, and
`Address Space Management In Systems Having Multiple
`Multi-Processor Clusters respectively, all by David B.
`Glasco, Carl Zeitler, Rajesh Kota, Guru Prasadh, and Rich
`ard R. Oehler, the entireties of Which are incorporated by
`reference for all purposes.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention generally relates to accessing data
`in a multiple processor system. More speci?cally, the
`present invention provides techniques for improving data
`access ef?ciency While maintaining cache coherency in a
`multiple processor system having a multiple cluster archi
`tecture.
`2. Description of Related Art
`Data access in multiple processor systems can raise issues
`relating to cache coherency. Conventional multiple proces
`sor computer systems have processors coupled to a system
`memory through a shared bus. In order to optimiZe access to
`data in the system memory, individual processors are typi
`cally designed to Work With cache memory. In one example,
`each processor has a cache that is loaded With data that the
`processor frequently accesses. The cache is read or Written
`by a processor. HoWever, cache coherency problems arise
`because multiple copies of the same data can co-exist in
`systems having multiple processors and multiple cache
`memories. For example, a frequently accessed data block
`corresponding to a memory line may be loaded into the
`cache of tWo different processors. In one example, if both
`processors attempt to Write neW values into the data block at
`the same time, different data values may result. One value
`may be Written into the ?rst cache While a different value is
`Written into the second cache. A system might then be unable
`to determine What value to Write through to system memory.
`A variety of cache coherency mechanisms have been
`developed to address such problems in multiprocessor sys
`
`2
`tems. One solution is to simply force all processor Writes to
`go through to memory immediately and bypass the associ
`ated cache. The Write requests can then be serialized before
`overWriting a system memory line. HoWever, bypassing the
`cache signi?cantly decreases ef?ciency gained by using a
`cache. Other cache coherency mechanisms have been devel
`oped for speci?c architectures. In a shared bus architecture,
`each processor checks or snoops on the bus to determine
`Whether it can read or Write a shared cache block. In one
`example, a processor only Writes an object When it oWns or
`has exclusive access to the object. Each corresponding cache
`object is then updated to alloW processors access to the most
`recent version of the object.
`Bus arbitration is used When both processors attempt to
`Write a shared data block in the same clock cycle. Bus
`arbitration logic decides Which processor gets the bus ?rst.
`Although, cache coherency mechanisms such as bus arbi
`tration are effective, using a shared bus limits the number of
`processors that can be implemented in a single system With
`a single memory space.
`Other multiprocessor schemes involve individual
`processor, cache, and memory systems connected to other
`processors, cache, and memory systems using a netWork
`backbone such as Ethernet or Token Ring. Multiprocessor
`schemes involving separate computer systems each With its
`oWn address space can avoid many cache coherency prob
`lems because each processor has its oWn associated memory
`and cache. When one processor Wishes to access data on a
`remote computing system, communication is explicit. Mes
`sages are sent to move data to another processor and
`messages are received to accept data from another processor
`using standard netWork protocols such as TCP/IP. Multipro
`cessor systems using explicit communication including
`transactions such as sends and receives are referred to as
`systems using multiple private memories. By contrast, mul
`tiprocessor system using implicit communication including
`transactions such as loads and stores are referred to herein as
`using a single address space.
`Multiprocessor schemes using separate computer systems
`alloW more processors to be interconnected While minimiZ
`ing cache coherency problems. HoWever, it Would take
`substantially more time to access data held by a remote
`processor using a netWork infrastructure than it Would take
`to access data held by a processor coupled to a system bus.
`Furthermore, valuable netWork bandWidth Would be con
`sumed moving data to the proper processors. This can
`negatively impact both processor and netWork performance.
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a
`system With a single memory space. In one example, indi
`vidual processors can be directly connected to each other
`through a plurality of point-to-point links to form a cluster
`of processors. Separate clusters of processors can also be
`connected. The point-to-point links signi?cantly increase the
`bandWidth for coprocessing and multiprocessing functions.
`HoWever, using a point-to-point architecture to connect
`multiple processors in a multiple cluster system sharing a
`single memory space presents its oWn problems.
`Consequently, it is desirable to provide techniques for
`improving data access and cache coherency in systems
`having multiple clusters of multiple processors connected
`using point-to-point links.
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`SUMMARY OF THE INVENTION
`According to the present invention, methods and appara
`tus are provided for increasing the ef?ciency of data access
`
`
`
`US 7,103,636 B2
`
`3
`in a multiple processor, multiple cluster system. Techniques
`are provided for speculatively probing a remote cluster from
`either a request cluster or a home cluster. A speculative
`probe associated With a particular memory line is transmit
`ted to the remote cluster before the cache access request
`associated With the memory line is serialiZed at a home
`cluster. When a non-speculative probe is received at a
`remote cluster, the information associated With the response
`to the speculative probe is used to provide a response to the
`non-speculative probe.
`According to various embodiments, a computer system is
`provided. The computer system includes a request cluster, a
`home cluster, and a remote cluster. The request cluster has
`a plurality of interconnected request cluster processors and
`a request cluster cache coherence controller. The home
`cluster has a plurality of interconnected home processors, a
`serialiZation point, and a home cache coherence controller.
`The remote cluster has a plurality of interconnected remote
`processors and a remote cache coherence controller. The
`remote cluster is con?gured to receive a ?rst probe corre
`sponding to a cache access request from a request cluster
`processor in the request cluster and a second probe corre
`sponding to the cache access request from the home cluster.
`According to other embodiments, a method for a cache
`coherence controller to manage data access in a multipro
`cessor system is provided. The method includes receiving a
`cache access request from a request cluster processor asso
`ciated With a request cluster, forWarding the cache access
`request to a home cluster, and sending a probe associated
`With the cache request to a remote cluster. The home cluster
`includes a home cluster cache coherence controller and a
`serialization point.
`According to still other embodiments, a computer system
`is provided. The computer system includes a ?rst cluster and
`a second cluster. The ?rst cluster includes a ?rst plurality of
`processors and a ?rst cache coherence controller. The ?rst
`plurality of processors and the ?rst cache coherence con
`troller are interconnected in a point-to-point architecture.
`The second cluster includes a second plurality of processors
`and a second cache coherence controller. The second plu
`rality of processors and the second cache coherence con
`troller are interconnected in a point-to-point architecture, the
`?rst cache coherence controller is coupled to the second
`cache coherence controller. The ?rst cache coherence con
`troller is con?gured to receive a cache access request origi
`nating from the ?rst plurality of processors and send a probe
`to a third cluster including a third plurality of processors
`before the cache access request is received by a serialization
`point in the second cluster.
`In still other embodiments, a computer system is pro
`vided. The computer system includes a ?rst cluster and a
`second cluster. The ?rst cluster includes a ?rst plurality of
`processors and a ?rst cache coherence controller. The ?rst
`plurality of processors and the ?rst cache coherence con
`troller are interconnected in a point-to-point architecture.
`The second cluster includes a second plurality of processors
`and a second cache coherence controller. The second plu
`rality of processors and the second cache coherence con
`troller are interconnected in a point-to-point architecture.
`The ?rst cache coherence controller is coupled to the second
`cache coherence controller and constructed to receive a
`cache access request originating from the ?rst plurality of
`processors and send a probe to a third cluster including a
`third plurality of processors before a memory line associated
`With the cache access request is locked.
`In yet another embodiment, a cache coherence controller
`is provided. The cache coherence controller includes inter
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`face circuitry coupled to a request cluster processor in a
`request cluster and a remote cluster cache coherence con
`troller in a remote cluster and a protocol engine coupled to
`the interface circuitry. The protocol engine is con?gured to
`receive a cache access request from the request cluster
`processor and speculatively probe a remote node in the
`remote cluster.
`A further understanding of the nature and advantages of
`the present invention may be realiZed by reference to the
`remaining portions of the speci?cation and the draWings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the
`folloWing description taken in conjunction With the accom
`panying draWings, Which are illustrative of speci?c embodi
`ments of the present invention.
`FIGS. 1A and 1B are diagrammatic representation depict
`ing a system having multiple clusters.
`FIG. 2 is a diagrammatic representation of a cluster
`having a plurality of processors.
`FIG. 3 is a diagrammatic representation of a cache coher
`ence controller.
`FIG. 4 is a diagrammatic representation shoWing a trans
`action ?oW for a data access request from a processor in a
`single cluster.
`FIGS. 5Ai5D are diagrammatic representations shoWing
`cache coherence controller functionality.
`FIG. 6 is a diagrammatic representation depicting a trans
`action ?oW probing a remote cluster.
`FIG. 7 is a diagrammatic representation showing a trans
`action ?oW for a speculative probing from a home cluster.
`FIG. 8 is a How process diagram shoWing speculative
`probing from a home cluster.
`FIG. 9 is a How process diagram shoWing speculative
`probing from a home cluster at a remote cluster.
`FIG. 10 is a diagrammatic representation shoWing a
`transaction ?oW for a speculative probing from a request
`cluster.
`FIG. 11 is a How process diagram shoWing speculative
`probing from a request cluster.
`FIG. 12 is a How process diagram shoWing speculative
`probing from a request cluster at a remote cluster.
`
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`
`Reference Will noW be made in detail to some speci?c
`embodiments of the invention including the best modes
`contemplated by the inventors for carrying out the invention.
`Examples of these speci?c embodiments are illustrated in
`the accompanying draWings. While the invention is
`described in conjunction With these speci?c embodiments, it
`Will be understood that it is not intended to limit the
`invention to the described embodiments. On the contrary, it
`is intended to cover alternatives, modi?cations, and equiva
`lents as may be included Within the spirit and scope of the
`invention as de?ned by the appended claims. Multi
`processor architectures having point-to-point communica
`tion among their processors are suitable for implementing
`speci?c embodiments of the present invention. In the fol
`loWing description, numerous speci?c details are set forth in
`order to provide a thorough understanding of the present
`invention. The present invention may be practiced Without
`some or all of these speci?c details. Well knoWn process
`operations have not been described in detail in order not to
`
`
`
`US 7,103,636 B2
`
`5
`unnecessarily obscure the present invention. Furthermore,
`the present application’s reference to a particular singular
`entity includes that possibility that the methods and appa
`ratus of the present invention can be implemented using
`more than one entity, unless the context clearly dictates
`otherWise.
`Techniques are provided for increasing data access effi
`ciency in a multiple processor, multiple cluster system. In a
`point-to-point architecture, a cluster of processors includes
`multiple processors directly connected to each other through
`point-to-point links. By using point-to-point links instead of
`a conventional shared bus or external network, multiple
`processors are used e?iciently in a system sharing the same
`memory space. Processing and netWork ef?ciency are also
`improved by avoiding many of the bandWidth and latency
`limitations of conventional bus and external netWork based
`multiprocessor architectures. According to various
`embodiments, hoWever, linearly increasing the number of
`processors in a point-to-point architecture leads to an expo
`nential increase in the number of links used to connect the
`multiple processors. In order to reduce the number of links
`used and to further modularize a multiprocessor system
`using a point-to-point architecture, multiple clusters are
`used.
`According to various embodiments, the multiple proces
`sor clusters are interconnected using a point-to-point archi
`tecture. Each cluster of processors includes a cache coher
`ence controller used to handle communications betWeen
`clusters. In one embodiment, the point-to-point architecture
`used to connect processors are used to connect clusters as
`Well.
`By using a cache coherence controller, multiple cluster
`systems can be built using processors that may not neces
`sarily support multiple clusters. Such a multiple cluster
`system can be built by using a cache coherence controller to
`represent non-local nodes in local transactions so that local
`nodes do not need to be aWare of the existence of nodes
`outside of the local cluster. More detail on the cache
`coherence controller Will be provided beloW.
`In a single cluster system, cache coherency can be main
`tained by sending all data access requests through a serial
`ization point. Any mechanism for ordering data access
`requests is referred to herein as a serialization point. One
`example of a serialization point is a memory controller.
`Various processors in the single cluster system send data
`access requests to the memory controller. In one example,
`the memory controller is con?gured to serialize or lock the
`data access requests so that only one data access request for
`a given memory line is alloWed at any particular time. If
`another processor attempts to access the same memory line,
`the data access attempt is blocked until the memory line is
`unlocked. The memory controller alloWs cache coherency to
`be maintained in a multiple processor, single cluster system.
`A serialization point can also be used in a multiple
`processor, multiple cluster system Where the processors in
`the various clusters share a single address space. By using a
`single address space, internal point-to-point links can be
`used to signi?cantly improve intercluster communication
`over traditional external netWork based multiple cluster
`systems. Various processors in various clusters send data
`access requests to a memory controller associated With a
`particular cluster such as a home cluster. The memory
`controller can similarly serialize all data requests from the
`different clusters. HoWever, a serialization point in a mul
`tiple processor, multiple cluster system may not be as
`ef?cient as a serialization point in a multiple processor,
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`single cluster system. That is, delay resulting from factors
`such as latency from transmitting betWeen clusters can
`adversely affect the response times for various data access
`requests. It should be noted that delay also results from the
`use of probes in a multiple processor environment.
`Although delay in intercluster transactions in an architec
`ture using a shared memory space is signi?cantly less than
`the delay in conventional message passing environments
`using external netWorks such as Ethernet or Token Ring,
`even minimal delay is a signi?cant factor. In some
`applications, there may be millions of data access requests
`from a processor in a fraction of a second. Any delay can
`adversely impact processor performance.
`According to various embodiments, speculative probing
`is used to increase the e?iciency of accessing data in a
`multiple processor, multiple cluster system. A mechanism
`for eliciting a response from a node to maintain cache
`coherency in a system is referred to herein as a probe. In one
`example, a mechanism for snooping a cache is referred to as
`a probe. A response to a probe can be directed to the source
`or target of the initiating request. Any mechanism for
`sending probes to nodes associated With cache blocks before
`a request associated With the probes is received at a serial
`ization point is referred to herein as speculative probing.
`According to various embodiments, the reordering or
`elimination of certain data access requests do not adversely
`affect cache coherency. That is, the end value in the cache is
`the same Whether or not snooping occurs. For example, a
`local processor attempting to read the cache data block can
`be alloWed to access the data block Without sending the
`requests through a serial