`
`1111111111111111111111111111111111111111111111111111111111111
`US007003633B2
`
`(12) United States Patent
`Glasco
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,003,633 B2
`*Feb.21,2006
`
`(54) METHODS AND APPARATUS FOR
`MANAGING PROBE REQUESTS
`
`(75)
`
`Inventor: David B. Glasco, Austin, TX (US)
`
`(73) Assignee: Newisys, Inc., Austin, TX (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 205 days.
`
`This patent is subject to a terminal dis(cid:173)
`claimer.
`
`6,085,295 A
`6,108,737 A
`6,122,715 A
`6,148,378 A
`6,167,492 A
`6,173,393 B1
`6,189,078 B1
`6,192,451 B1
`6,205,520 B1
`6,209,055 B1
`6,292,705 B1
`6,292,906 B1
`6,330,643 B1
`6,334,172 B1
`
`7/2000 Ekanadham et a!.
`8/2000 Sharma eta!.
`9/2000 Palanca et a!.
`11/2000 Bordaz eta!. .............. 711!147
`12/2000 Keller et a!. ................ 711!154
`1!2001 Palanca et a!.
`2/2001 Bauman et a!. ............. 711!156
`2/2001 Arimilli eta!. ............. 711!141
`3/2001 Palanca et a!.
`3/2001 Van Doren et a!.
`9/2001 Wang eta!.
`9/2001 Fu eta!.
`12/2001 Arimilli et a!. ............. 711!141
`12/2001 Arimilli et a!. ............. 711!144
`
`(21) Appl. No.: 10/288,347
`
`(22) Filed:
`
`Nov. 4, 2002
`
`(65)
`
`Prior Publication Data
`
`US 2004/0088492 Al May 6, 2004
`
`(Continued)
`
`FOREIGN PATENT DOCUMENTS
`
`wo
`
`W00239242
`
`5!2002
`
`OTHER PUBLICATIONS
`
`(51)
`
`Int. Cl.
`(2006.01)
`G06F 12/00
`(52) U.S. Cl. ....................... 711/146; 711!141; 709/216;
`709/218
`(58) Field of Classification Search ................ 711/141,
`711/146,144, 145; 709/206,213,216,217,
`709/218, 219
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`Multicast snooping: a new coherence method using a mul(cid:173)
`ticast address network Bilir, E.E.; Dickson, R.M.; Ying Hu;
`Plakal, M.; Sarin, D.J.; Hill, M.D.; Wood, D.A.; Computer
`Architecture, 1999. Proceedings of the 26th International
`Symposium on, May 2-4, 1999.*
`
`(Continued)
`
`Primary Examiner-Brian R Peugh
`(74) Attorney, Agent, or Firm-Beyer Weaver & Thomas,
`LLP
`
`U.S. PATENT DOCUMENTS
`
`(57)
`
`ABSTRACT
`
`5,195,089 A
`5,394,555 A
`5,524,212 A *
`5,692,123 A
`5,751,995 A *
`5,829,032 A
`5,893,151 A *
`6,018,791 A
`6,038,652 A
`6,052,769 A
`6,067,603 A
`6,073,210 A
`
`3/1993 Sindhu
`2/1995 Hunter et a!. ............... 711!148
`6/1996 Somani et a!. ...... ... ..... 711!121
`11/1997 Logghe
`5/1998 Sarangdhar ................. 711!145
`10/1998 Komuro eta!. .... .. ... .... 711!141
`4/1999 Merchant .................... 711!140
`1!2000 Arimilli et a!. ............. 711!141
`3/2000 Van Ruben et a!.
`.......... 712/21
`4/2000 Huff et a!.
`5!2000 Carpenter et a!.
`6/2000 Palanca et a!.
`
`According to the present invention, methods and apparatus
`are provided for increasing the efficiency of data access in a
`multiple processor, multiple cluster system. Mechanisms for
`reducing the number of transactions in a multiple cluster
`system are provided. In one example, probe filter informa(cid:173)
`tion is used to limit the number of probe requests transmitted
`to request and remote clusters.
`
`39 Claims, 17 Drawing Sheets
`
`r::.~
`~;~~~
`oro'T""t
`""'~"""~I
`1105 "-L foJ"wardROQUe£!To J
`
`~WilhA
`Mom Uno
`
`MomoryCorllroHer
`- - ·· - -
`
`Petition for Inter Partes Review of
`U.S. Pat. No. 7,296,121
`IPR2015‐00158
`EXHIBIT
`Sony‐
`
`1
`
`
`
`US 7,003,633 B2
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`Specifying and verifying a broadcast and a multicast snoop(cid:173)
`ing cache coherence protocol Sarin, D.J.; Plakal, M.;
`Condon, A.E.; Hill, M.D.; Martin, M.M.K.; Wood, D.A.;
`Parallel and Distributed Systems, IEEE Transactions on ,
`vol.: 13 , Issue: 6 , Jun. 2002.*
`Kim, et al., "Power-aware Partitioned Cache Architectures",
`© 2001 ACM, p. 6467. *
`Powell, et al., "Reducing Set-Associative Cache Energy via
`Way-Prediction and Selective Direct-Mapping", © 2001
`IEEE, p. 54-65.*
`HyperTransport TM I/0 Link Specification Revision 1.03,
`HyperTransport TM Consortium, Oct. 10, 2001, Copyright©
`HyperTransport Technology Consortium.
`PCT Search Report PCT/US03/34756, Int'l filing date Oct.
`30, 2003, Search report Mailed Dec. 16, 2004.
`U.S. Office Action mailed Sep. 22, 2004, from related U.S.
`Appl. No. 10/106,426.
`U.S. Office Action mailed Mar. 7, 2005, from related U.S.
`Appl. No. 10/106,426.
`U.S. Office Action mailed Jul. 21, 2005, from related U.S.
`Appl. No. 10/106,426.
`U.S. Office Action mailed Sep. 23, 2004, from related U.S.
`Appl. No. 10/106,430.
`U.S. Office Action mailed Mar. 10, 2005, from related U.S.
`Appl. No. 10/106,430.
`U.S. Office Action mailed Jul. 21, 2005, from related U.S.
`Appl. No. 10/106,430.
`U.S. Office Action mailed Sep. 22, 2004, from related U.S.
`Appl. No. 10/106,299.
`U.S. Office Action mailed Mar. 10, 2005, from related U.S.
`Appl. No. 10/106,299.
`U.S. Office Action mailed Jul. 21, 2005, from related U.S.
`Appl. No. 10/106,299.
`D. E. Culler, J. P. Singh, A Gupta, "Parallel Computer
`Architecture", 1999 Morgan Kaufmann, San Francisco, CA
`USA XP002277658.
`Andrew Tanenbaum, "Computer Networks", Computer
`Networks, London: Prentice Hall International, GB, 1996,
`pp. 345-403, XP002155220.
`U.S. Office Action mailed Jul. 20, 2005, from related Ap(cid:173)
`plication No. 10/608,846.
`U.S. Office Action mailed Sep. 9, 2005, from related Ap(cid:173)
`plication No. 10/462,015.
`U.S. Office Action mailed Sep. 9, 2005, from related Ap(cid:173)
`plication No. 10/426,084.
`U.S. Office Action mailed Nov. 2, 2005, from related Ap(cid:173)
`plication No. 10/106/430.
`U.S. Office Action mailed Oct. 5, 2005, from related Ap(cid:173)
`plication No. 10/635,703.
`
`* cited by examiner
`
`6,338,122 B1
`6,343,347 B1
`6,385,705 B1
`6,405,289 B1
`6,463,529 B1
`6,467,007 B1
`6,490,661 B1
`6,542,926 B1
`6,615,319 B1
`6,631,447 B1
`6,633,945 B1
`6,633,960 B1
`6,636,906 B1 *
`6,640,287 B1
`6,658,526 B1
`6,665,767 B1
`6,704,842 B1
`6,738,870 B1
`6,738,871 B1
`6,751,698 B1
`6,751,721 B1 *
`6,754,782 B1
`6,760,809 B1
`6,760,819 B1
`6,799,252 B1 *
`6,865,595 B1
`6,892,282 B1
`2001!0013089 A1
`2001!0037435 A1
`2002/0007463 A1
`2002/0046327 A1
`2002/0052914 A1
`2002/0083149 A1 *
`2002/0083243 A1
`2002/0087807 A1 *
`2002/0087811 A1
`2003/0009623 A1
`2003/0182508 A1
`2003/0182509 A1
`2003/0182514 A1
`2003/0195939 A1
`2003/0196047 A1
`2003/0210655 A1
`2003/0212741 A1
`2003/0233388 A1
`2004/0073755 A1
`2004/0088493 A1
`2004/0088494 A1
`2004/0255002 A1
`
`711!150
`... ... .. 711!150
`
`............. 711!119
`
`1!2002 Baumgartner et a!.
`1!2002 Arimilli et a!.
`5!2002 Keller et a!. ................ 711!154
`6/2002 Arimilli et a!.
`............. 711!145
`10/2002 Miller et a!.
`10/2002 Armstrong et a!.
`12/2002 Keller et a!. ................ 711!150
`........... 709/213
`4/2003 Zalewski et a!.
`9/2003 Khare et a!.
`10/2003 Morioka et a!.
`10/2003 Fu et a!. ..................... 710/316
`10/2003 Kessler et a!.
`10/2003 Sharma eta!. ................ 710/22
`10/2003 Gharachorloo et a!.
`12/2003 Nguyen et a!.
`12/2003 Comisky et a!.
`3/2004 Janakiraman eta!.
`5!2004 Van Ruben et a!.
`5!2004 Van Ruben et a!.
`6/2004 Deneroff et a!.
`6/2004 Webb et a!.
`.................. 712/10
`6/2004 Arimilli et a!.
`7/2004 Arimilli et a!.
`7/2004 Dhong et a!.
`9/2004 Bauman ..................... 711!149
`3/2005 Glasco
`5!2005 Rass et a!. .................. 711!146
`8/2001 Weber
`11/2001 Van Doren
`1!2002 Fung
`4/2002 Gharachorloo et a!.
`5!2002 Zalewski eta!. ........... 709/203
`........ 709/215
`6/2002 Van Ruben et a!.
`6/2002 Van Ruben ................. 710/107
`7/2002 Gharachorloo eta!. ..... 711!141
`7/2002 Khare eta!.
`1!2003 Arimilli et a!.
`9/2003 Glasco
`9/2003 Glasco
`9/2003 Glasco
`10/2003 Edirisooriya et a!. ....... 709/212
`10/2003 Kessler et a!.
`11/2003 Glasco
`11/2003 Glasco
`12/2003 Glasco et a!.
`4/2004 Webb eta!. ................ 711!144
`5!2004 Glasco ....................... 711!141
`5!2004 Glasco
`12/2004 Kota eta!.
`
`............. 711!119
`
`01HER PUBLICATIONS
`
`Bandwidth adaptive snooping Martin, M.M.K.; Sarin, D.J.;
`Hill, M.D.; Wood, D.A.; High-Performance Computer
`Architecture, 2002. Proceedings. Eighth International Sym(cid:173)
`posium on , Feb. 2-6, 2002; pp. 251-262.*
`
`2
`
`
`
`U.S. Patent
`
`Feb.21,2006
`
`Sheet 1 of 17
`
`US 7,003,633 B2
`
`Figure lA
`
`Processing
`Cluster 101
`
`Processing
`Cluster 105
`
`'
`'
`'
`'
`
`'
`'
`'
`'
`'
`
`'
`'
`'
`'
`
`'
`'
`'
`'
`'
`
`Processing
`Cluster 103
`
`- ----- rl11c
`• .. ----- ~
`
`Processing
`Cluster 107
`
`Figure lB
`
`Processing
`Cluster 121
`
`Processing
`Cluster 123
`
`Switch 131
`.-----1_41---,b~J: ::>
`
`____ --~.--14_1_c ___ ---.
`
`1,_ · - --"
`
`Processing
`Cluster 125
`
`Processing
`Cluster 127
`
`3
`
`
`
`ll
`
`06a
`
`Figure 2
`
`Remote Clusters
`
`:'\/--208a
`
`· ... ·
`
`,-2oo
`
`~ I 206
`
`r - - - Processor 202a
`
`f\r232b
`:, ,1
`'·'
`
`Cache
`Coherence
`Controller 230
`
`.c;r-232a
`·,
`.'
`·-·
`
`Processor 202b -
`
`rl
`
`,- ·- ··:
`·- --[(_208e
`
`r-214f
`
`I
`
`I
`
`I
`
`1
`
`r - - - - - ' ' -_ - - - ,
`Service
`J--+1
`Processor 212
`
`214a
`
`;-214c
`
`214b_/
`
`.. - ---.
`V"""214c 232c:.P --·
`
`..-~· --.
`·- -- (_232d
`
`'-214d
`
`208b-~
`:· -- --)
`· - --
`
`I!
`
`r - Processor 202c
`
`(':
`
`(\
`
`Processor 202d f-------1
`
`208d-/
`
`~
`
`~
`
`'·\._208c
`
`W6c
`
`J; !10216
`
`V0220 Q =
`
`VO Switch 210
`
`i
`I BIOS I
`
`204
`
`I[
`206d -~
`
`d •
`\Jl
`•
`~
`~ ......
`~ = ......
`
`"!"j
`~
`?'
`N
`'"""' ~
`N c c
`
`0'1
`
`'JJ. =(cid:173)~
`~ .....
`N
`0 ......,
`'"""'
`---l
`
`e
`
`rJ'l
`
`"'""-l
`Q
`Q
`~
`
`~
`~
`~
`N
`
`4
`
`
`
`U.S. Patent
`US. Patent
`
`Feb.21,2006
`F
`60
`
`Sheet 3 of 17
`whS
`U
`
`US 7,003,633 B2
`US 7,003,633 B2
`
`
`
`magnum.fmomShammomofiwam608onM.
`
`
`
`
`
`<n
`0
`rl)
`
`Q)
`t=:
`.......
`bJ)
`t=:
`t:r.J.
`__,
`0
`(.)
`0
`+-'
`0
`1-o
`~
`
`.......
`......
`
`(Y)
`
`11.)
`(.)
`
`
`
`:mooflbfimE23852
`
`~
`1-o
`Q)
`.......
`.8
`+-'
`s:::
`11)
`1-o
`tU
`..Q
`0
`(.)
`~
`
`0 z
`
`
`
`
`
`
`
`("tj
`
`mopsmE
`
`(]..)
`~
`~
`bl)
`• .-I
`~
`
`
`
`t--
`0
`l')
`QJ
`(.)
`
`~
`1-o
`v
`+-'
`..8
`.....
`t=:
`Q)
`1-o
`Q)
`....c:
`0
`u
`
`
`
`m.pomuumfififiEEEoUn,f.momm
`
`
`
`5
`
`
`
`
`
`
`U.S. Patent
`US. Patent
`
`9
`Feb.21,2006
`Feb. 21
`2006
`
`Sheet 4 of 17
`Sheet 4 0f 17
`
`US 7,003,633 B2
`US 7,003,633 B2
`
`vogwfi
`
`
`
`
`
`6
`
`
`
`U.S. Patent
`US. Patent
`
`Feb.21,2006
`Feb. 21, 2006
`
`Sheet 5 of 17
`Sheet 5 0f 17
`
`US 7,003,633 B2
`US 7,003,633 B2
`
`u~
`;:88
`
`l()
`
`fl2&5
`
`3quHmoodéoz
`
`l()
`......:10
`l()
`
`2m
`
`7
`
`
`
`U.S. Patent
`US. Patent
`
`Feb. 21, 2006
`Feb.21,2006
`
`Sheet 6 0f 17
`Sheet 6 of 17
`
`US 7,003,633 B2
`US 7,003,633 B2
`
`0 Z
`
`H
`8m
`Otn
`*9
`C:
`
`U]
`a.)
`*0
`
`0 ZH
`
`Figure5B
`
`8
`
`
`
`U.S. Patent
`US. Patent
`
`Feb.21,2006
`Feb. 21, 2006
`
`Sheet 7 of 17
`Sheet 7 0f 17
`
`US 7,003,633 B2
`US 7,003,633 B2
`
`nu
`
`m-m<m
`
`meu
`
`m-HVm
`
`Hymnv
`
`N-HVm
`
`om05mm
`
`
`
`m¢m
`
`hwm
`
`awn
`
`Hmm
`
`.......
`u M ~~--------------~
`
`"¢
`lf)
`
`
`
`Hlva_-HVmUfigmU
`
`
`
`mowozQuoqéoz
`
`mmm
`
`9
`
`
`
`
`U.S. Patent
`US. Patent
`
`Feb.21,2006
`Feb. 21, 2006
`
`Sheet 8 of 17
`Sheet 8 0f 17
`
`US 7,003,633 B2
`US 7,003,633 B2
`
`10
`
`M
`
`
`u ~ ~--------------------~
`'D
`If)
`
`mm8&5
`
`
`
`“804952
`
`..-
`0 ~ ~--------------~
`\0
`tr)
`
`monoz
`
`2%
`
`10
`
`
`
`Figure 6
`
`C
`
`L ~CPU
`60:-2-\ 6:7
`60!-2
`
`609
`
`I
`
`J
`
`c
`621-3
`t
`
`L
`
`C
`
`L
`
`• C
`
`l/ 645 : \
`641-1 ~ 647 !--; 641-2
`L6~91
`
`~c 603-1
`
`Request
`Cluster 600
`
`L
`
`L
`627
`
`1/ 625 r-
`c f-+ MC f-Jo
`\ \ 62~-2
`
`621-1
`
`623-1
`
`Home
`Cluster 620
`
`Cluster 640
`
`c
`603-3
`
`CPU
`~
`601-3
`
`c ~
`
`CPU
`601-4
`
`603-4
`
`I
`/
`
`I
`c /
`621-4 I
`
`c
`~
`603-5
`
`...
`I c
`MC:
`~1-5 ~ 623-2
`
`d •
`\Jl
`•
`~
`~ ......
`~ = ......
`
`"'!"j
`~
`?'
`N
`'"""' ~
`N c c
`
`0'1
`
`'JJ. =(cid:173)~
`~ .....
`'0
`0 ......,
`'"""'
`-..J
`
`e
`
`rJ'l
`
`"'""-l
`Q
`Q
`~
`
`~
`~
`~
`N
`
`11
`
`
`
`U.S. Patent
`
`Feb.21,2006
`
`Sheet 10 of 17
`
`US 7,003,633 B2
`
`Figure 7
`
`Coherence Directory 701
`
`Memory Line 711
`
`State 713
`
`Address 721
`
`Address 731
`
`Address 741
`
`Address 751
`
`Address 761
`
`Address 771
`
`Invalid
`
`Invalid
`
`Shared
`
`Shared
`
`Owned
`
`Owned
`
`Dirty Data
`Owner715
`NIA
`
`NIA
`
`N/A
`
`N/A
`
`Occupancy Vector
`717
`
`N/A
`
`N/A
`
`Clusters 1 ,3
`
`Clusters 1, 2, 3, 4
`
`Cluster4
`
`Cluster 2, 3, 4
`
`Cluster 2
`
`Cluster 2, 4
`
`Address 781
`
`Modified
`
`Cluster 2
`
`Address 791
`
`Modified
`
`Cluster 3
`
`...
`
`...
`
`. ..
`
`NIA
`
`N/A
`
`...
`
`12
`
`
`
`U.S. Patent
`
`Feb.21,2006
`
`Sheet 11 of 17
`
`US 7,003,633 B2
`
`Figure 8
`
`Probe Filter
`Information 821
`
`Read Block (Read) 823
`
`Read Block Modify
`(Read/Write) 825
`
`Invalid 831
`
`Can use completion bit.
`Probe home cluster. (801)
`
`Can use completion bit.
`Probe home cluster.
`(809)
`
`Shared 833
`
`Can use completion bit.
`Probe home cluster. (803)
`
`N/A (811)
`
`Owned 835
`
`Can use completion bit.
`Probe remote cluster with
`line cached in owned
`state. (805)
`
`N/A (813)
`
`Modified 83 7
`
`Can use completion bit.
`Probe remote cluster with
`line cached in modified
`state. (807)
`
`Can use completion bit.
`Probe remote cluster.
`(815)
`
`13
`
`
`
`c ~ I~ 903-l
`
`Request
`Cluster 900
`
`L
`. 925
`
`l
`
`Figure 9
`
`c
`903-4
`
`CPU
`~
`901-4
`
`'-+ c
`903-5
`
`I I
`
`Completion Bit Set
`
`L
`927
`
`c
`921-3
`
`c
`921-4
`
`*-
`c H MC
`921-5
`
`1923-2
`
`c
`MC
`921-1 ~
`
`--+
`
`923-1 \
`
`c
`921-2
`
`\
`
`Remote
`Cluster 940
`
`d •
`\Jl
`•
`~
`~ ......
`~ = ......
`
`~
`?'
`N
`!""
`N c c
`
`0'1
`
`'Jl =(cid:173)~
`~ .....
`'"""' N
`0 ......,
`'"""'
`-..J
`
`e
`
`rJ'l
`
`"'""-l
`Q
`Q
`"'~
`~
`
`~
`~
`N
`
`14
`
`
`
`Figure 10
`
`c
`1003-4
`
`CPU
`1001-4
`
`c
`1003-5
`
`Completion Bit Set
`
`c
`1021-3
`
`c
`1021-4
`
`I
`\
`c
`1021-5
`
`MC
`1023-2
`
`L
`1025
`
`L
`1027
`
`c
`
`CPU
`100 l-l
`
`c
`1003-1
`
`Request
`Cluster lOOO
`
`c
`1021-1
`
`Home Cluster
`1020
`
`Remote
`Cluster 1040
`
`d •
`\Jl
`•
`~
`~ ......
`~ = ......
`
`"'!"j
`~
`?'
`N
`'"""' ~
`N c c
`
`0'1
`
`'JJ. =(cid:173)~
`~ .....
`'"""' ~
`0 ......,
`'"""'
`-..J
`
`e
`
`rJ'l
`
`"'""-l
`Q
`Q
`~
`
`~
`~
`~
`N
`
`15
`
`
`
`U.S. Patent
`
`Feb.21,2006
`
`Sheet 14 of 17
`
`US 7,003,633 B2
`
`Figure 11
`
`robe Request
`Handling At Home
`herence Contro
`
`Receive Probe Request
`Associated With A
`Memory Line
`
`Forward Request To
`Memmy Controller
`
`1101
`
`1105
`
`1109 \__
`
`Recieve Probe From
`Memory Controller
`
`1113
`
`1121
`
`1131
`
`No
`
`1141
`
`Access Coherence
`Directory And Probe
`Filter Infomntion
`
`Yes
`
`Yes
`
`No
`
`Receive Probe Responses
`And Forward Response
`To Request Cluster With
`Completion Indicator
`
`1145
`
`Proceed Using Memmy
`Controller Serialization
`With No Filtering And
`Completion Bit
`
`1135
`
`1139 -
`
`Forward Probe With
`Completion Indicator To
`Remote Cluster
`
`Receive Home Cluster
`Probe Responses And Do
`Not Forward Response To
`Request Cluster
`
`1149
`
`Receive Source Done
`From Request Cluster
`And Forward Source
`Done To Memmy
`Controller
`
`16
`
`
`
`U.S. Patent
`
`Feb.21,2006
`
`Sheet 15 of 17
`
`US 7,003,633 B2
`
`Figure 12
`
`Memory Controller Filter Infom1ation 1221
`
`Read Block [Read] 1223
`
`Read Block Modify
`[Read/Write] 1225
`
`Invalid 123 1
`
`Send request to target.
`(1201)
`
`Send request to target.
`(1209)
`
`Shared 1233
`
`Send request to target.
`(1203)
`
`Send request to target.
`(1211)
`
`Owned 1235
`
`Forward Probe To
`Owning Cluster. (1205)
`
`Send request to target.
`(1213)
`
`Modified 1237
`
`Forward Probe To
`Forward Probe To
`Modified Cluster. (1207) Modified Cluster. (1215)
`
`17
`
`
`
`c HCPUH c
`1303-5 T
`
`1301-4
`
`1303-4
`
`Completion Bit Set
`
`j,_sj
`
`Figure 13
`
`ICPul
`
`~·
`
`Request
`Cluster 1300
`
`Home Cluster
`1320
`
`Remote
`Cluster 1340
`
`d •
`\Jl
`•
`~
`~ ......
`~ = ......
`
`"'!"j
`~
`?'
`N
`'"""' ~
`N c c
`
`0'1
`
`'JJ. =-~
`~ .....
`'"""' 0'1
`0 ......,
`'"""'
`-..J
`
`e
`
`rJ'l
`
`"'""-l
`Q
`Q
`~
`
`~
`~
`~
`N
`
`18
`
`
`
`U.S. Patent
`
`Feb.21,2006
`
`Sheet 17 of 17
`
`US 7,003,633 B2
`
`Figure 14
`
`robe Request
`Handling At Home
`herence Contro
`
`1401
`
`1405
`
`1405
`
`Receive Probe Request
`Associated With A
`Particular Memory Line
`
`1403
`
`Identify Characteristics
`Associated With Probe
`Request
`
`Fmward Request To
`Serialization Point
`
`1411
`~--No---<:
`
`Proceed With Or Without
`Probe Filtering And
`Completion Indicator
`(Figure 6 Or Figure 11)
`
`1415
`
`1417
`
`Yes
`
`Block Probe Requests
`And Probe Remote
`Cluster (Figure 12)
`
`Unblock Memory Line
`After Receiving Source
`Done From Request
`Cluster
`
`19
`
`
`
`US 7,003,633 B2
`
`1
`METHODS AND APPARATUS FOR
`MANAGING PROBE REQUESTS
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`The present application is related to filed U.S. application
`Ser. No. 10/106,426 titled Methods And Apparatus For
`Speculative Probing At A Request Cluster, U.S. application
`Ser. No. 10/106,430 titled Methods And Apparatus For
`Speculative Probing With Early Completion And Delayed
`Request, and U.S. application Ser. No. 10/106,299 titled
`Methods And Apparatus For Speculative Probing With Early
`Completion And Early Request, the entireties of which are
`incorporated by reference herein for all purposes. The
`present application is also related to filed U.S. application
`Ser. No. 10/157,340, now U.S. Pat. No. 6,865,595, Ser. No.
`10/145,439, Ser. No. 10/145,438, and Ser. No. 10/157,388
`titled Methods And Apparatus For Responding To A Request
`Cluster by David B. Glasco, the entireties of which are
`incorporated by reference for all purposes. The present
`application is also related to concurrently filed U.S. appli(cid:173)
`cation Ser. No. 10/288,399 with the same title and inventor,
`the entirety of which is incorporated by reference herein for
`all purposes.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention generally relates to accessing data
`in a multiple processor system. More specifically, the
`present invention provides techniques for improving data
`access efficiency while maintaining cache coherency in a
`multiple processor system having a multiple cluster archi(cid:173)
`tecture.
`2. Description of Related Art
`Data access in multiple processor systems can raise issues
`relating to cache coherency. Conventional multiple proces(cid:173)
`sor computer systems have processors coupled to a system
`memory through a shared bus. In order to optimize access to
`data in the system memory, individual processors are typi(cid:173)
`cally designed to work with cache memory. In one example,
`each processor has a cache that is loaded with data that the
`processor frequently accesses. The cache is read or written
`by a processor. However, cache coherency problems arise
`because multiple copies of the same data can co-exist in
`systems having multiple processors and multiple cache
`memories. For example, a frequently accessed data block
`corresponding to a memory line may be loaded into the
`cache of two different processors. In one example, if both
`processors attempt to write new values into the data block at
`the same time, different data values may result. One value
`may be written into the first cache while a different value is
`written into the second cache. A system might then be unable
`to determine what value to write through to system memory. 55
`A variety of cache coherency mechanisms have been
`developed to address such problems in multiprocessor sys(cid:173)
`tems. One solution is to simply force all processor writes to
`go through to memory immediately and bypass the associ(cid:173)
`ated cache. The write requests can then be serialized before 60
`overwriting a system memory line. However, bypassing the
`cache significantly decreases efficiency gained by using a
`cache. Other cache coherency mechanisms have been devel(cid:173)
`oped for specific architectures. In a shared bus architecture,
`each processor checks or snoops on the bus to determine 65
`whether it can read or write a shared cache block. In one
`example, a processor only writes an object when it owns or
`
`2
`has exclusive access to the object. Each corresponding cache
`object is then updated to allow processors access to the most
`recent version of the object.
`Bus arbitration is used when both processors attempt to
`5 write the same shared data block in the same clock cycle.
`Bus arbitration logic decides which processor gets the bus
`first. Although, cache coherency mechanisms such as bus
`arbitration are effective, using a shared bus limits the num(cid:173)
`ber of processors that can be implemented in a single system
`10 with a single memory space.
`Other multiprocessor schemes involve individual proces(cid:173)
`sor, cache, and memory systems connected to other proces(cid:173)
`sors, cache, and memory systems using a network backbone
`such as Ethernet or Token Ring. Multiprocessor schemes
`15 involving separate computer systems each with its own
`address space can avoid many cache coherency problems
`because each processor has its own associated memory and
`cache. When one processor wishes to access data on a
`remote computing system, communication is explicit. Mes-
`20 sages are sent to move data to another processor and
`messages are received to accept data from another processor
`using standard network protocols such as TCP/IP. Multipro(cid:173)
`cessor systems using explicit communication including
`transactions such as sends and receives are referred to as
`25 systems using multiple private memories. By contrast, mul(cid:173)
`tiprocessor system using implicit communication including
`transactions such as loads and stores are referred to herein as
`using a single address space.
`Multiprocessor schemes using separate computer systems
`30 allow more processors to be interconnected while minimiz(cid:173)
`ing cache coherency problems. However, it would take
`substantially more time to access data held by a remote
`processor using a network infrastructure than it would take
`to access data held by a processor coupled to a system bus.
`35 Furthermore, valuable network bandwidth would be con(cid:173)
`sumed moving data to the proper processors. This can
`negatively impact both processor and network performance.
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a
`40 system with a single memory space. In one example, indi(cid:173)
`vidual processors can be directly connected to each other
`through a plurality of point-to-point links to form a cluster
`of processors. Separate clusters of processors can also be
`connected. The point-to-point links significantly increase the
`45 bandwidth for coprocessing and multiprocessing functions.
`However, using a point-to-point architecture to connect
`multiple processors in a multiple cluster system sharing a
`single memory space presents its own problems.
`Consequently, it is desirable to provide techniques for
`50 improving data access and cache coherency in systems
`having multiple clusters of multiple processors connected
`using point-to-point links.
`
`SUMMARY OF THE INVENTION
`
`According to the present invention, methods and appara(cid:173)
`tus are provided for increasing the efficiency of data access
`in a multiple processor, multiple cluster system. Mecha(cid:173)
`nisms for reducing the number of transactions in a multiple
`cluster system are provided. In one example, probe filter
`information is used to limit the number of probe requests
`transmitted to request and remote clusters.
`In one embodiment, a computer system is provided. The
`computer system includes a home cluster having a first
`plurality of processors and a home cache coherence con(cid:173)
`troller. The first plurality of processors and the home cache
`coherence controller are interconnected in a point-to-point
`
`20
`
`
`
`US 7,003,633 B2
`
`3
`architecture. The home cache coherence controller is con(cid:173)
`figured to receive a probe request and probe one or more
`selected clusters. The one or more clusters are selected based
`on the characteristics associated with the probe request.
`In another embodiment, a method for managing probes is
`provided. A probe request is received at a home cache
`coherence controller in a home cluster. The home cluster
`includes a first plurality of processors and the home cache
`coherence controller. The first plurality of processors and the
`home cache coherence controller are interconnected in a 10
`point-to-point architecture. One or more clusters are selected
`for probing based on the characteristics associated with the
`probe request. The one or more clusters are probed.
`A further understanding of the nature and advantages of
`the present invention may be realized by reference to the 15
`remaining portions of the specification and the drawings.
`
`5
`
`4
`will be understood that it is not intended to limit the
`invention to the described embodiments. On the contrary, it
`is intended to cover alternatives, modifications, and equiva(cid:173)
`lents as may be included within the spirit and scope of the
`invention as defined by the appended claims. Multi-proces(cid:173)
`sor architectures having point-to-point communication
`among their processors are suitable for implementing spe(cid:173)
`cific embodiments of the present invention. In the following
`description, numerous specific details are set forth in order
`to provide a thorough understanding of the present inven(cid:173)
`tion. The present invention may be practiced without some
`or all of these specific details. Well-known process opera(cid:173)
`tions have not been described in detail in order not to
`unnecessarily obscure the present invention. Furthermore,
`the present application's reference to a particular singular
`entity includes that possibility that the methods and appa-
`ratus of the present invention can be implemented using
`more than one entity, unless the context clearly dictates
`otherwise.
`Techniques are provided for increasing data access effi-
`ciency in a multiple processor, multiple cluster system. In a
`point-to-point architecture, a cluster of processors includes
`multiple processors directly connected to each other through
`point-to-point links. By using point-to-point links instead of
`25 a conventional shared bus or external network, multiple
`processors are used efficiently in a system sharing the same
`memory space. Processing and network efficiency are also
`improved by avoiding many of the bandwidth and latency
`limitations of conventional bus and external network based
`multiprocessor architectures. According to various embodi(cid:173)
`ments, however, linearly increasing the number of proces-
`sors in a point-to-point architecture leads to an exponential
`increase in the number of links used to connect the multiple
`processors. In order to reduce the number of links used and
`to further modularize a multiprocessor system using a point(cid:173)
`to-point architecture, multiple clusters are used.
`According to various embodiments, the multiple proces(cid:173)
`sor clusters are interconnected using a point-to-point archi(cid:173)
`tecture. Each cluster of processors includes a cache coher-
`40 ence controller used to handle communications between
`clusters. In one embodiment, the point-to-point architecture
`used to connect processors are used to connect clusters as
`well.
`By using a cache coherence controller, multiple cluster
`45 systems can be built using processors that may not neces(cid:173)
`sarily support multiple clusters. Such a multiple cluster
`system can be built by using a cache coherence controller to
`represent non-local nodes in local transactions so that local
`nodes do not need to be aware of the existence of nodes
`50 outside of the local cluster. More detail on the cache
`coherence controller will be provided below.
`In a single cluster system, cache coherency can be main-
`tained by sending all data access requests through a serial(cid:173)
`ization point. Any mechanism for ordering data access
`requests is referred to herein as a serialization point. One
`example of a serialization point is a memory controller.
`Various processors in the single cluster system send data
`access requests to the memory controller. In one example,
`the memory controller is configured to serialize or lock the
`60 data access requests so that only one data access request for
`a given memory line is allowed at any particular time. If
`another processor attempts to access the same memory line,
`the data access attempt is blocked until the memory line is
`unlocked. The memory controller allows cache coherency to
`65 be maintained in a multiple processor, single cluster system.
`A serialization point can also be used in a multiple
`processor, multiple cluster system where the processors in
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the 20
`following description taken in conjunction with the accom(cid:173)
`panying drawings, which are illustrative of specific embodi(cid:173)
`ments of the present invention.
`FIGS. 1A and 1B are diagrammatic representation depict(cid:173)
`ing a system having multiple clusters.
`FIG. 2 is a diagrammatic representation of a cluster
`having a plurality of processors.
`FIG. 3 is a diagrammatic representation of a cache coher(cid:173)
`ence controller.
`FIG. 4 is a diagrammatic representation showing a trans- 30
`action flow for a data access request from a processor in a
`single cluster.
`FIG. 5A-5D are diagrammatic representations showing
`cache coherence controller functionality.
`FIG. 6 is a diagrammatic representation depicting a trans- 35
`action flow for a probe request with multiple probe
`responses.
`FIG. 7 is a diagrammatic representation showing a cache
`coherence directory.
`FIG. 8 is a diagrammatic representation showing probe
`filter information that can be used to reduce the number of
`probes transmitted to various clusters.
`FIG. 9 is a diagrammatic representation showing a trans(cid:173)
`action flow for probing of a home cluster without probing of
`other clusters.
`FIG. 10 is a diagrammatic representation showing a
`transaction flow for probing of a single remote cluster.
`FIG. 11 is a flow process diagram showing the handling
`of a probe request with probe filter information.
`FIG. 12 is a diagrammatic representation showing
`memory controller filter information.
`FIG. 13 is a diagrammatic representation showing a
`transaction flow for probing a single remote cluster without
`probing a home cluster.
`FIG. 14 is a flow process diagram showing the handling 55
`of a probe request at a home cluster cache coherence
`controller using memory controller filter information.
`
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`
`Reference will now be made in detail to some specific
`embodiments of the invention including the best modes
`contemplated by the inventors for carrying out the invention.
`Examples of these specific embodiments are illustrated in
`the accompanying drawings. While
`the
`invention
`is
`described in conjunction with these specific embodiments, it
`
`21
`
`
`
`US 7,003,633 B2
`
`5
`the various clusters share a single address space. By using a
`single address space, internal point-to-point links can be
`used to significantly improve intercluster communication
`over traditional external network based multiple cluster
`systems. Various processors in various clusters send data 5
`access requests to a memory controller associated with a
`particular cluster such as a home cluster. The memory
`controller can similarly serialize all data requests from the
`different clusters. However, a serialization point in a mul(cid:173)
`tiple processor, multiple cluster system may not be as 10
`efficient as a serialization point in a multiple processor,
`single cluster system. That is, delay resulting from factors
`such as latency from transmitting between clusters can
`adversely affect the response times for various data access 15
`requests. It should be noted that delay also results from the
`use of probes in a multiple processor environment.
`Although delay in intercluster transactions in an architec(cid:173)
`ture using a shared memory space is significantly less than
`the delay in conventional message passing environments 20
`using external networks such as Ethernet or Token Ring,
`even minimal delay is a significant factor. In some applica(cid:173)
`tions, there may be millions of data access requests from a
`processor in a fraction of a second. Any delay can adversely
`impact processor performance.
`According to various embodiments, probe management is
`used to increase the efficiency of accessing data in a multiple
`processor, multiple cluster system. A mechanism for elicit(cid:173)
`ing a response from a node to maintain cache coherency in
`a system is referred to herein as a probe. In one example, a
`mechanism for snooping a cache is referred to as a probe. A
`response to a probe can be directed to the source or target of
`the initiating request. Any mechanism for filtering or reduc(cid:173)
`ing the number of probes and probe requests transmitted to
`various nodes is referred to herein as managing probes. In
`one example, managing probe entails characterizing a probe
`request to determine if a probe can be transmitted to a
`reduced number of entities.
`In typ