`
`(12)
`(10) Patent No.:
`US 7,296,121 B2
`
`Morton et al.
`(45) Date of Patent:
`Nov. 13, 2007
`
`US007296121B2
`
`(54) REDUCING PROBE TRAFFIC IN
`MULTIPROCESSOR SYSTEMS
`
`(75)
`
`Inventors: Eric Morton, Austin, TX (US); Rajesh
`’
`.
`E3“: ‘Al‘uiln’t.TXT§J%)j§d3an.d B
`a ee’
`115.11“
`’ a“
`G135”: Ausms TX (Us)
`_
`.
`.
`(73) ASSlgneeI NeWISyss Inca Austm, TX (US)
`
`'
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 250 days.
`
`5,524,212 A
`5,692,123 A
`5,751,995 A
`27:52,??? 2
`,
`,
`6,018,791 A
`6,038,652 A
`6,052,769 A
`6,067,603 A
`6,073,210 A
`6,085,295 A
`6,108,737 A
`6,122,715 A
`
`6/1996 Somani et al.
`11/1997 Logghe
`5/1998 Sarangdhar
`12/133;
`fiomllllro tet 31'
`erc an
`1/2000 Arimilli et al.
`3/2000 Phillips et al.
`4/2000 Hulf et al.
`5/2000 Carpenter et al.
`6/2000 Palanca et 31.
`7/2000 Ekanadham et al.
`8/2000 sharma et 31.
`9/2000 Palanca et al.
`.
`(Cont1nued)
`
`(21) APP1~ N05 10/966,161
`
`(22)
`
`Filed:
`
`Oct. 15, 2004
`(Under 37 CFR 1.47)
`
`FOREIGN PATENT DOCUMENTS
`
`W0
`
`W0 0239242
`
`5/2002
`
`OTHER PUBLICATIONS
`
`(65)
`
`Prior Publication Data
`US 2007/0055826 A1
`Mar. 8, 2007
`
`Guo, et al., “A Probe-Based Server Selection Protocol for Differ-
`entiated Service Networks”, © 2002 IEEE, p. 2353-2357.*
`
`Related US. Application Data
`(63) Continuation-in-part of application No. 10/288,347,
`filed on Nov. 4, 2002. now Pat. No. 7,003,633.
`
`(Cont1nued)
`Primary ExammeriBr‘lan.R. Peugh
`(74) A’wmey’ Age”! 0r F’rmiBeyer weaVer LLP
`
`(51)
`
`Int. Cl.
`(200601)
`G06F 12/00
`(52) US. Cl.
`..................; ..................... 711/148; 711/141
`(58) Field of ClaSSIficatlon Search .................
`7 11/141,
`711/148’ 131’ 144’ 145’ 146’ 709/206’ 213’
`.
`709/2165 217} 218’ 219
`.
`See appllcatlon file for complete search hlstory.
`References Cited
`U.S. PATENT DOCUMENTS
`
`(56)
`
`5,195,089 A
`5,394,555 A
`
`3/1993 Sindhu et al.
`2/1995 Hunter et 31.
`
`ABSTRACT
`(57)
`A computer system having a plurality of processing nodes
`interconnected by a first point-to-point architecture is
`described. Each processing node has a cache memory asso-
`ciated therewith. Aprobe filtering unit is operable to receive
`probes corresponding to memory lines from the processing
`nodes and to transmit the probes only to selected ones of the
`processing nodes with reference to probe filtering informa-
`tion. The probe filtering information is representative of
`states associated with selected ones of the cache memories.
`
`25 Claims, 25 Drawing Sheets
`
`Local Probe
`Fllleling
`
`memory cummllar
`
`geneyales pmhe 12002)
`
`
`
`v
`consult mullng
`table (2004)
`
`
`
`send Dmbe auly
`w PFU 12006)
`
` l
` v
`
`
`respond in
`
`requesmrlzmzl "“"°
`
`
`
`Requesled
`lms sloreo 1n cache'l
`201
`.5
`PFU sends pronelsl only lo nudes
`malmay have llneln some (2014) 1
`
`
`
`Pmbed nodes look
`
`up Ilne (2016)
`
`
`
`Pmbed nodes mum
`
`response: to PFU(ZD1B)
`
`
`FFU upoales
`anomaly (20207
`
`
`
`
`FFU accepts Probe and
`Mars lo acne mnemnm
`
`fllrednly (20051
`
`
` All
`responses (2024)
` FFU mllate: Ihe
`
`moans
`pmhad
`responded?
`(2022)
`inv
`
`PFU mivnnds In
`requesting nooe [2025)
`
`
`
`1
`
`APPLE 1001
`
`1
`
`APPLE 1001
`
`
`
`US 7,296,121 B2
`
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`2004/0088494 A1
`2004/0117559 A1
`2004/0255002 A1
`
`5/2004 Giasco
`6/2004 Glasco
`l2/2004 Kota et al.
`
`OTHER PUBLICATIONS
`
`
`
`\Io.
`
`\Io.
`
`\Io.
`
`
`
`
`
`* cited by examiner
`
`
`
`Bordaz et al.
`Keller et al.
`Palanca et al.
`Bauman et al.
`Arimilli et al.
`Palanca et al.
`Durham et al.
`Wang et al.
`Fu et al.
`Arimilli et al.
`Arimilli et al.
`Baumgartner et al.
`Arimilli et al.
`Keller et al.
`Arimilli et al.
`Miller et al.
`Armstrong et al.
`Keller et al.
`Zalewski et al.
`Khare et al.
`Morioka et al.
`Fu et al.
`Kessler et al.
`Sharma et al.
`Gharachorloo et al.
`Nguyen et al.
`Comisky et al.
`Janakiraman et al.
`Van Huben et al.
`Van Huben et al.
`Deneroff et al.
`Webb et al.
`Arimilli et al.
`Arimilli et al.
`Dhong et al.
`Mudgett et al.
`Bauman
`Glasco
`Hass et al.
`Glasco
`Weber
`Razdan et al.
`Van Doren
`Fung
`Gharachorloo et al.
`Zalewski et al.
`Van Huben et al.
`Van Huben
`Gharachorloo et al.
`Khare et al.
`Arimilli et al.
`Glasco
`Glasco
`Glasco
`Edirisooriya et al.
`Kessler et al.
`Giasco
`Giasco
`Giasco et al.
`Keller et al.
`Webb et al.
`Glasco
`Glasco
`
`............ 711/146
`
`.............. 7ll/l30
`
`................ 709/213
`
`11/2000
`12/2000
`l/2001
`2/2001
`2/2001
`3/2001
`3/2001
`9/2001
`9/2001
`l2/2001
`l2/2001
`l/2002
`l/2002
`5/2002
`6/2002
`10/2002
`10/2002
`l2/2002
`4/2003
`9/2003
`10/2003
`10/2003
`10/2003
`10/2003
`10/2003
`l2/2003
`l2/2003
`3/2004
`5/2004
`5/2004
`6/2004
`6/2004
`6/2004
`7/2004
`7/2004
`8/2004
`9/2004
`3/2005
`5/2005
`2/2006
`8/2001
`* 10/2001
`ll/2001
`l/2002
`4/2002
`5/2002
`6/2002
`6/2002
`7/2002
`7/2002
`l/2003
`9/2003
`9/2003
`9/2003
`10/2003
`10/2003
`ll/2003
`ll/2003
`l2/2003
`2/2004
`4/2004
`5/2004
`5/2004
`
`*
`
`*
`
`AAB
`
`1
`B1
`B1
`B1
`B1
`B1
`B1
`B1
`B1
`B1
`B1
`B1
`B1
`B1
`B1
`B1
`B2
`B2
`B1
`B1
`B1
`B1
`B2
`B2
`B1
`B1
`B2
`B2
`B1
`B1
`B2
`B2
`B2
`B1
`B1
`B2
`B2
`B2
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`Al
`A1
`
`6,148,378
`6,167,492
`6,173,393
`6,189,078
`6,192,451
`6,205,520
`6,209,065
`6,292,705
`6,292,906
`6,330,643
`6,334,172
`6,338,122
`6,343,347
`6,385,705
`6,405,289
`6,463,529
`6,467,007
`6,490,661
`6,542,926
`6,615,319
`6,631,447
`6,633,945
`6,633,960
`6,636,906
`6,640,287
`6,658,526
`6,665,767
`6,704,842
`6,738,870
`6,738,871
`6,751,698
`6,751,721
`6,754,782
`6,760,809
`6,760,819
`6,775,749
`6,799,252
`6,865,595
`6,892,282
`7,003,633
`2001/0013089
`2001/0029574
`2001/0037435
`2002/0007463
`2002/0046327
`2002/0052914
`2002/0083149
`2002/0083243
`2002/0087807
`2002/00878ll
`2003/0009623
`2003/0182508
`2003/0182509
`2003/0182514
`2003/0195939
`2003/0196047
`2003/0210655
`2003/02l274l
`2003/0233388
`2004/0024836
`2004/0073755
`2004/0088492
`2004/0088493
`
`1.03,
`Specification Revision
`HyperTransportTM l/O Link
`HyperTransportTM Consortium, Oct. 10, 2001, Copyright © 2001
`HyperTransport Technology Consortium.
`PCT Search Report PCT/USO3/34756, Int’l filing date Oct. 30,
`2003, Search report Mailed Dec. 16, 2004.
`Bilir et al., “Multicast Snooping: A New Coherence Method Using
`a Multicast Address Networ ”, Computer Architecture, 1999. Pro-
`ceedings of the 26th International Symposium on, May 2-4, 1999.
`Martin et al., “Bandwidth Adaptive Snooping”, Proceedings of the
`Eighth International Symposium on High-Performance Computer
`Architecture on Feb. 2-6, 2002; pp. 251-262.
`Sorin et al., “Specifying and Verifying a Broadcast and a Multicast
`Snooping Cache Coherence Protocol”, IEEE Transactions on Par-
`allel and Distributed Systems, vol. 13, No. 6, Jun. 2002.
`US. Appl. No.: 10/288,347 (Now US. Pat. No. 7,003,633), Notice
`of Allowance, dated Sep. 12, 2005.
`US. Appl. No. 10/288,347 (Now US. Pat. No. 7,003,633), First
`Office Action, dated Nov. 18, 2004.
`Kim et al., “Power-aware Partitioned Cache Architectures”, 2001
`ACM p. 6467.
`Powell et al., “Reducing Set-Associative Cache Energy via Way-
`Prediction and Selective Direct-Mapping” 2001 IEEE, p. 54-65.
`Culler, D. E., J. P. Singh, A. Gupta, “Parallel Computer Architec-
`ture”,
`1999 Morgan Kaufmann, San Francisco, CA USA
`XP002277658.
`Tanenbaum, Andrew, “Computer Networks”, Computer Networks,
`London: Prentice Hall
`International, GB, 1996, pp. 345-403,
`XP002155220.
`L.S. Appl. No.: 10/288,347 (Now US. Pat. No. 7,003,633). Final
`Office Action, dated May 12, 2005.
`LS Office Action mailed Sep. 22, 2004, from L.S. Appl. \Io.
`10/106,426 [\TWISP002].
`L.S. Office Action mailed Mar. 7, 2005, form L.S. Appl. \Io.
`10/106,426 [\TWISP002].
`L.S. Office Action mailed Jul. 21, 2005, from L.S. Appl. \Io.
`10/106,426 [\TWISP002].
`L.S. Office Action mailed Sep. 23, 2004, from L.S. Appl. \Io.
`10/106,430 [\TWISP003].
`L.S. Ofiice Action mailed Mar. 10, 2005, form L.S. Appl. \Io.
`10/106,430 [\TWISP003].
`L.S. Office Action mailed Jul. 21, 2005, from L.S. Appl. \Io.
`10/106,430 [\TWISP003].
`L.S. Office Action mailed Sep. 22, 2004, from L.S. Appl. \Io.
`10/106,299 [\TWISP004].
`L.S. Ofiice Action mailed Mar. 10, 2005, from L.S. Appl. \Io.
`10/106,299 [\TWISP004].
`L.S. Office Action mailed Jul. 21, 2005, from L.S. Appl. \Io.
`10/106,299 [\TWISP004].
`L.S. Office Action mailed Jul. 20, 2005, from L.S. Appl. \Io.
`10/608,846 [\TWISP030].
`L.S. Office Action mailed Sep. 9, 2005, from L.S. Appl.
`10/462,015 [\TWISP040].
`L.S. Office Action mailed Sep. 9, 2005, from L.S. Appl.
`10/426,084 [\TWISP033].
`L.S. Office Action mailed Nov. 2, 2005, from L.S. Appl. \Io.
`10/106/430 [\TWISP003].
`L.S. Office Action mailed Oct. 5, 2005, from L.S. Appl.
`10/635,703 [\TWISP036].
`
`2
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 1 of 25
`
`US 7,296,121 B2
`
`Figure 1A
`
`Processing
`Cluster 101
`
`Processing
`Cluster 103
`
`Cluster 107
`
`Processing
`Cluster 105
`
`Processing
`
`Figure 1B
`
`Processing
`Cluster 121
`
`Processing
`Cluster 123
`
`Cluster 127
`
`Processing
`Cluster 125
`
`Processing
`
`3
`
`
`
`U.S. Patent
`
`ehS
`
`M2
`
`S
`
`2Bn
`
`70023a1
`
`
`
`
`V.to.0Omm5:05:00Nflow538on8:20:00mmomEmmooem
`aE
`3umow8880iomomEmmoooi
`
`
`30m
`
`Uomm0:III:InENOh
`
`2.,I85257-
`
`1,3505wowI%mOHm-
`
`
`
`
`
`EN:BEmO:ocom
`
`
`
`
`
`moEwE
`
`
`
`$0320895%
`
`
`
`Doom“wow
`
`
`
`2
`
`
`
`
`
`835m
`
`EN5&8on
`
`
`
`
`
`4
`
`
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 3 of 25
`
`US 7,296,121 B2
`
`
`
`
`wfiwcom
`
`oomHobsm
`
`
`
`
`
`
`
`momofiwnm3080.5
`
`
`
`HflmUOMMHOHGHHQDHUSOOEOZ
`
`
`
`Smooflgofi:
`
`“55:00 moSwE
`
`
`
`5
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 4 of 25
`
`US 7,296,121 B2
`
`v8:me
`
`
`
`6
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 5 of 25
`
`US 7,296,121 B2
`
`gDEE
`
`
`
`
`
` H3$602“$04.52
`
`7
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 6 of 25
`
`US 7,296,121 B2
`
`mmoSwE
`
`
`
`mufiozHmofinfioz
`
`Hmm
`
`8
`
`
`
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 7 of 25
`
`US 7,296,121 B2
`
`Figure5C
`
`In
`D
`'U
`
`0 Z
`
`m
`8111cm
`*71I:
`
`0 ZH
`
`9
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 8 of 25
`
`US 7,296,121 B2
`
`02
`
`N-momm-Nomi;
`
`
`
`
`
`
`
`N-.ow
`
`mmmama
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`:m
`
`
`
`
`
`
`
`mouoZBoo-H-282
`
`mum
`
`10
`
`10
`
`
`
`
`U.S. Patent
`
`N
`
`1V.
`
`70
`
`f09teehS
`
`US 7,296,121 B2
`
`
`
`Duo0
`
`méoom-moo
`
`
`
`
`
`0<
`E0:51
`3a“833%
`mcoo.5320
`
`
`
`
`
`
`
`
`
`
`
`
`
`0N05330
`
`11
`
`
`
`8050M
`
`
`
`oweSums—O
`
`coSwE
`
`11
`
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 10 0f 25
`
`US 7,296,121 B2
`
`Figure 7
`
`
`
`Coherence Directory 701
`
`Memory Line 711
`State 713
`gig;r1373}?
`OCCUPaI/Bfg’ VCCW
`
`1 Address 721
`Invalid
`N/A
`N/A
`
`l Address 731
`Invalid
`N/A
`N/A
`Address 741
`Shared
`N/A
`Clusters 1,3
`Address 751
`Shared
`N/A
`Clusters 1, 2, 3, 4
`[ Address 761
`Owned
`Cluster 4
`Cluster 2, 3, 4
`M Address 771
`“Owned
`Cluster 2
`Cluster 2, 4
`‘
`Address 781
`Modified
`Cluster 2
`N/A
`‘
`Address 791
`Modified
`Cluster 3
`l
`
`
`
`'
`
`-
`
`I
`
`
`
`l_____l
`
`12
`
`
`
`
`
`“I
`
`i
`
`
`
`
`
`N/A
`
`
`12
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 11 0125
`
`US 7,296,121 B2
`
`Figure 8
`
`
`
`Probe Filter
`Information 821
`
`Read Block Modify
`Read Block (Read) 823 (ReaWrite) 825
`
`
`
`Invalid 831
`
`Can use completion bit.
`Probe home cluster. (801)
`
`Can use completion bit.
`Pro“ figfigfmer-
`
`
`
`Shared 833
`
`Can use completion bit.
`Probe home cluster. (803)
`
`
`N/A (811)
`
`i Can use completion bit.
`Probe remote cluster with
`N/A (813)
`line cached in owned
`Owned 835
`state. (805)
`
`
`
`
`
`Can use completion blt'
`Probe remote cluster with
`line cached in modified
`state. (807)
`l.
`
`
`~
`
`Modified 837
`
`Can use completion bit.
`Probe regglpge): cluster.
`
`13
`
`13
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 12 of 25
`
`US 7,296,121 B2
`
`U:
`
`N-mmm
`
`3m2m553800
`
`
`
`i2%W402o /Rm4
`
`Ra28.
`
`14
`
`
`
`
`
`0mmSEED
`
`O
`
`N-Mmm2:0:
`
`a9.5mm
`
`28I280PG
`
`“weaved
`
`coo$330
`
`805cm
`
`ova$320
`
`14
`
`
`
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 13 of 25
`
`US 7,296,121 B2
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`02
`
`N-mmo_
`
`
`
`
`6m:m28303800
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Duo
`
`v-52
`
`: v
`
`-83U
`
`
`
`S2&5
`
`amuzvum
`
`02:$320
`
`bum-:0mEoI
`
`ONE
`
`295m
`
`
`
`o3:Ems—U
`
`15
`
`15
`
`
`
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 14 of 25
`
`US 7,296,121 B2
`
`Figure 11
`
`
`
`
`equest Handling 1
`Home Coherence
`
`Controller
`
`
`R
`.
`'
`eceive Request
`Associated With A
`
`Memory Line
`
`i
`
`
`
`1135 X Forward Probe With
`
`Forward Request To
`—————> Completion Indicator To
`
`
`Memory Controller Remote Cluster
`
`1101
`
`1105
`
`
`V
`1 1" 9
`
`
`
`l109 F\..
`
`Receive Probe From 1
`
`Memory Controller
`
`V
`
`1113
`
`Access Coherence
`
`
`
`Directory And Probe
`Filter Information
`
`
`
`
`I se Filtering An
`pletion Indicat
`.
`
`Probe Responses And Do
`Not Forward Response To
`Request Cluster
`
`
`3 x RCCCIVC Home Cluster
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Probe Remote
`Cluster?
`
`1 131
`
`1 141
`
`
`
`
`
`
`
`
`Receive Source Done
`Receive Probe Responses
`
`
`From Request Cluster
`And Forward Response
`And Forward Source
`
`To Request Cluster With
`Done To Memory
`
`Completion Indicator
`Controller
`
`
`
`
`Proceed Using Memory
`Controller Serialization
`With No Filtering And
`
`Completion Bit
`
`16
`
`16
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 15 of 25
`
`US 7,296,121 B2
`
`Figure 12
`
`
`
`[
`1
`
`
`
`Memory Controller Filter Information 1221
`ReadBlock [Read] 1223
`lffiaegdfiggfigfgfsy
`
`
`
`Invalid 1231
`
`Send rec(1111§(s)tlt)o target.
`
`Send recglllggtggo target.
`
`
`
`Shared 1233
`
`Send recengzgtf target.
`
`Send re((1ll§ltlt)o target.
`
`
`
`l
`
`”213330W
`
`Owned
`owiiggaétuisgtitgog
`
`
`MOdified 1237
`
`Mogggliarcigtglé 5307) Moclli(iir:dagllli)sr’tcelrj.e(1l%15)
`
`
`
`17
`
`17
`
`
`
`U.S. Patent
`
`V.0N
`
`700
`
`61a
`
`5
`
`US 7,296,121 B2
`
`
`
`
`
`
`
`
`
`
`98m—v-—ogv-32T89T52
`0Eu7u0Eu
`
`29.3me
`
`
`
`2co?Bums—UB“moscom
`
`2ON?
`
`rmBEEU2:2.—
`
`mEN:5mgm8:235752
`OU
`
`18
`
`
`
`
`
`
`
`
`32vA
`
`N43:312ATE»:U1H0
`
`
`
`A
`
`$28083*
`
`912Sums—O
`
`18
`
`
`
`
`
`equest Handling
`Home Coherence
`
`
`Receive Request
`1403
`Identify Characteristics
`
`Associated With A
`Associated With Re uest
`Particular Memory Line
`q
`
`
`
`
`‘
`
`qu
`
`
`1411
`
`'5
`
`
`
`ass Memo v
`yp
`Controller?
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 17 of 25
`
`US 7,296,121 B2
`
`Figure 14
`
`Controller
`
`1401
`
`1405
`
`1405
`
`Forward Re uest To
`q
`Serialization Point
`1
`.
`.
`Proceed with Or Without
`Probe Filtering/31nd
`Completlon Indlcator
`
`(Figure 6 Or Figure 11)
`
`
`
`Yes
`'
`Block Requests And
`Probe Remote Cluster
`(Figure 12)
`
`
`
`l
`
`.
`
`1415
`
`1417
`
`Unblock Memory Line
`After Receiving Source
`Done From Request
`Cluster
`
`
`
`19
`
`19
`
`
`
`U.S. Patent
`
`Nov 13, 2007
`
`Sheet 18 of 25
`
`US 7,296,121 B2
`
`
`«.82.302$82$82$82I#82T82
`02mMHQ02m00002m0%HQ
`QmQBwomfi
`
`A3E5vows
`
`
`$82$2:02g80
`
`onH
`
`5520325%
`
`SEED060m
`
`20
`
`20
`
`
`
`U.S. Patent
`
`7m
`
`M91te
`
`US 7,296,121 B2
`
`
`
`
`
`3,AA$83”-83.352NASH.T82:03w8%HQ080%035anNQm
`
`
`
`<03#85A33:?
`
`SESBaaI
`
`
`
`EAmVHEmEU208mm
`
`2mow“
`
`mM“Bum:050S7826m
`
`21
`
`000
`
`m
`
`$2a
`
`21
`
`
`
`
`U.S. Patent
`
`%2
`
`m
`
`5
`
`US 7,296,121 B2
`
`mcomo
`
`BaVON._‘.No:
`2.M:2.SwtflgEwacooéoz
`
`
`momtflc.vP6695M3:928memcmEcozommcmt.
`
`Lnlw.mnot35cm686$
`7.5:momtmEE96500
`
`
`
`
`
`
`cozoSmwugmgmog
`
`22&5
`
`22
`
`22
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 21 of 25
`
`US 7,296,121 B2
`
`
`
`
`
`
`
`
`
`\.om:
`
`
`
`£2:5&8onIII,mummma
`
`S2:me
`
`38m
`
`
`
`«mom:8&8on
`
`«com:
`
`
`
`
`
`392.833onone”:Emmuoem
`
`
`
`-
`
`-I“2:05”,—
`2%“mm,I
`
`OS:5thO:
`
`cam:O:Ivlllllon:O:
`
`
`
`owe“:
`
`
`
`23
`
`23
`
`
`
`
`
`
`U.S. Patent
`
`NOV.13,2007
`
`Sheet22 of25
`
`US 7,296,121 B2
`
`nvoa_
`
`ovomfi
`
`Noafi
`
`Eu
`
`#0
`
`559.0
`
`v50
`
`db
`
`$596
`
`mczsom
`
`03$.
`
`H8885
`
`83%??—
`
`Eoiwum
`
`0.3:
`
`wESom
`
`manna:
`
`03$.
`
`amama
`
`avomfi
`
`24
`
`24
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 23 of 25
`
`US 7,296,121 B2
`
`Local Probe
`
`Filtering
`
`
`
`
`
`
`memory controller
`generates probe (2002)
`
`
`consult routing
`
`table (2004)
`
`
`
`
`
`
`send probe only
`
`to PFU (2006)
`
`
`
`
`
`
`
`
`PFU accepts probe and
`refers to cache coherence
`directory (2008)
`
`respond to
`requester (2012)
`
`
`
`Requested
`
`line stored in cache?
`
`
`(2010)
`
`
`
`yes
`__Y__
`
`PFU sends probe(s) only to nodes
`that may have line in cache (2014)
`
`
`
`
`
`Probed nodes look
`up line (2016)
`
`Probed nodes return
`responses to PFU (2018)
`
`Figure 20
`
`no
`
`PFU updates
`directory (2020)
`
`
`
` All
`
`probed nodes
`
`responded?
`(2022)
`
`
`
`yes
`
`PFU collates the
`responses (2024)
`
`
`PFU responds to
`
`
`requesting node (2026)
`
`
`
`25
`
`25
`
`
`
`U.S. Patent
`
`NOV 13, 2007
`
`Sheet 24 of 25
`
`US 7,296,121 B2
`
`E9:5
`
`mZ
`
`
`
`Ho:2mwctflE“5tho“2;”.move:wonoa:02#2
`
`.ESomeS9:3023;
`
`26
`
`Q(
`
`/3
`
`.F0.2Eoc:c:mciotc3meowEcoEmunoi
`
`
`
`.uwkonoiamm“so:32vanFOEBu:flow
`
`Z :
`
`o-
`a)
`m
`
`\ a
`
`
`
`2AI|DAGAIDEEDEAII2All:mo
`
`.
`
`26
`
`
`
`U.S. Patent
`
`mm
`
`m
`
`E
`
`US 7,296,121 B2
`
`3,”ESm.2[PETPETali:;I|P5
`
`Om
`
`mm
`
`com
`
`
`
`
`
`Ho:02mctozcone;826mono:wonopm:0202
`.5803:th05:083:HZ
`
`
`
`mmzNz
`
`27
`
`mm2:9“.
`
`27
`
`
`
`1
`REDUCING PROBE TRAFFIC IN
`MULTIPROCESSOR SYSTEMS
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`The present application is a continuation-in-part of and
`claims priority under 35 U.S.C. 120 to US. patent applica-
`tion Ser. No. 10/288,347 now US. Pat. No. 7,003,633 for
`METHODS AND APPARATUS FOR MANAGING
`
`PROBE REQUESTS filed on NOV. 4, 2002 the entire
`disclosure of which is incorporated herein by reference for
`all purposes. The subject matter described in the present
`application is also related to US. patent application Ser. No.
`10/288,399 now US. Pat. No. 7,103,726 for METHODS
`AND APPARATUS
`FOR MANAGING PROBE
`
`REQUESTS filed on Nov. 4, 2002, the entire disclosure of
`which is incorporated herein by reference for all purposes.
`
`BACKGROUND OF THE INVENTION
`
`The present invention generally relates to accessing data
`in a multiple processor system. More specifically,
`the
`present invention provides techniques for reducing memory
`transaction traffic in a multiple processor system.
`Data access in multiple processor systems can raise issues
`relating to cache coherency. Conventional multiple proces-
`sor computer systems have processors coupled to a system
`memory through a shared bus. In order to optimize access to
`data in the system memory, individual processors are typi-
`cally designed to work with cache memory. In one example,
`each processor has a cache that is loaded with data that the
`processor frequently accesses. The cache is read or written
`by a processor. However, cache coherency problems arise
`because multiple copies of the same data can co-exist in
`systems having multiple processors and multiple cache
`memories. For example, a frequently accessed data block
`corresponding to a memory line may be loaded into the
`cache of two different processors. In one example, if both
`processors attempt to write new values into the data block at
`the same time, different data values may result. One value
`may be written into the first cache while a different value is
`written into the second cache. A system might then be unable
`to determine what value to write through to system memory.
`A variety of cache coherency mechanisms have been
`developed to address such problems in multiprocessor sys-
`tems. One solution is to simply force all processor writes to
`go through to memory immediately and bypass the associ-
`ated cache. The write requests can then be serialized before
`overwriting a system memory line. However, bypassing the
`cache significantly decreases efficiency gained by using a
`cache. Other cache coherency mechanisms have been devel-
`oped for specific architectures. In a shared bus architecture,
`each processor checks or snoops on the bus to determine
`whether it can read or write a shared cache block. In one
`
`example, a processor only writes an object when it owns or
`has exclusive access to the object. Each corresponding cache
`object is then updated to allow processors access to the most
`recent version of the object.
`Bus arbitration is used when both processors attempt to
`write the same shared data block in the same clock cycle.
`Bus arbitration logic decides which processor gets the bus
`first. Although, cache coherency mechanisms such as bus
`arbitration are effective, using a shared bus limits the num-
`ber of processors that can be implemented in a single system
`with a single memory space.
`
`US 7,296,121 B2
`
`2
`
`Other multiprocessor schemes involve individual proces-
`sor, cache, and memory systems connected to other proces-
`sors, cache, and memory systems using a network backbone
`such as Ethernet or Token Ring. Multiprocessor schemes
`involving separate computer systems each with its own
`address space can avoid many cache coherency problems
`because each processor has its own associated memory and
`cache. When one processor wishes to access data on a
`remote computing system, communication is explicit. Mes-
`sages are sent
`to move data to another processor and
`messages are received to accept data from another processor
`using standard network protocols such as TCP/IP. Multipro-
`cessor systems using explicit communication including
`transactions such as sends and receives are referred to as
`
`systems using multiple private memories. By contrast, mul-
`tiprocessor system using implicit communication including
`transactions such as loads and stores are referred to herein as
`
`using a single address space.
`Multiprocessor schemes using separate computer systems
`allow more processors to be interconnected while minimiz-
`ing cache coherency problems. However,
`it would take
`substantially more time to access data held by a remote
`processor using a network infrastructure than it would take
`to access data held by a processor coupled to a system bus.
`Furthermore, valuable network bandwidth would be con-
`sumed moving data to the proper processors. This can
`negatively impact both processor and network performance.
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a
`system with a single memory space. In one example, indi-
`vidual processors can be directly connected to each other
`through a plurality of point-to-point links to form a cluster
`of processors. Separate clusters of processors can also be
`connected. The point-to-point links significantly increase the
`bandwidth for coprocessing and multiprocessing functions.
`However, using a point-to-point architecture to connect
`multiple processors in a multiple cluster system sharing a
`single memory space presents its own problems.
`Consequently,
`it is desirable to provide techniques for
`improving data access and cache coherency in systems
`having multiple processors connected using point-to-point
`links.
`
`SUMMARY OF THE INVENTION
`
`According to the present invention, various techniques are
`provided for reducing traffic relating to memory transactions
`in multi-processor systems. According to various specific
`embodiments, a computer system having a plurality of
`processing nodes interconnected by a first point-to-point
`architecture is provided. Each processing node has a cache
`memory associated therewith. A probe filtering unit is oper-
`able to receive probes corresponding to memory lines from
`the processing nodes and to transmit the probes only to
`selected ones of the processing nodes with reference to
`probe filtering information. The probe filtering information
`is representative of states associated with selected ones of
`the cache memories.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`According to other embodiments, methods and apparatus
`are provided for reducing probe traffic in a computer system
`comprising a plurality of processing nodes interconnected
`by a first point-to-point architecture. A probe corresponding
`to a memory line is transmitted from a first one of the
`processing nodes only to a probe filtering unit. The probe is
`evaluated with the probe filtering unit to determine whether
`a valid copy of the memory line is in any of the cache
`memories. The evaluation is done with reference to probe
`28
`
`65
`
`28
`
`
`
`US 7,296,121 B2
`
`3
`filtering information associated with the probe filtering unit
`and representative of states associated with selected ones of
`the cache memories. The probe is transmitted from the probe
`filtering unit only to selected ones of the processing nodes
`identified by the evaluating. Probe responses from the
`selected processing nodes are accumulated by the probe
`filtering unit. Only the probe filtering unit responds to the
`first processing node.
`A further understanding of the nature and advantages of
`the present invention may be realized by reference to the
`remaining portions of the specification and the drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the
`following description taken in conjunction with the accom-
`panying drawings, which are illustrative of specific embodi-
`ments of the present invention.
`FIGS. 1A and 1B are diagrammatic representation depict-
`ing a system having multiple clusters.
`FIG. 2 is a diagrammatic representation of a cluster
`having a plurality of processors.
`FIG. 3 is a diagrammatic representation of a cache coher-
`ence controller.
`
`FIG. 4 is a diagrammatic representation showing a trans-
`action flow for a data access request from a processor in a
`single cluster.
`FIG. 5A-5D are diagrammatic representations showing
`cache coherence controller functionality.
`FIG. 6 is a diagrammatic representation depicting a trans-
`action flow for a request with multiple probe responses.
`FIG. 7 is a diagrammatic representation showing a cache
`coherence directory.
`FIG. 8 is a diagrammatic representation showing probe
`filter information that can be used to reduce the number of
`
`probes transmitted to various clusters.
`FIG. 9 is a diagrammatic representation showing a trans-
`action flow for probing of a home cluster without probing of
`other clusters.
`
`FIG. 10 is a diagrammatic representation showing a
`transaction flow for probing of a single remote cluster.
`FIG. 11 is a flow process diagram showing the handling
`of a request with probe filter information.
`FIG. 12 is a diagrammatic representation showing
`memory controller filter information.
`FIG. 13 is a diagrammatic representation showing a
`transaction flow for probing a single remote cluster without
`probing a home cluster.
`FIG. 14 is a flow process diagram showing the handling
`of a request at a home cluster cache coherence controller
`using memory controller filter information.
`FIG. 15 is a diagrammatic representation showing a
`transaction flow for a cache coherence directory eviction of
`an entry corresponding to a dirty memory line.
`FIG. 16 is a diagrammatic representation showing a
`transaction flow for a cache coherence directory eviction of
`an entry corresponding to a clean memory line.
`FIG. 17 is a diagrammatic representation of a cache
`coherence controller according to a specific embodiment of
`the invention.
`
`FIG. 18 is a diagrammatic representation of a cluster
`having a plurality of processing nodes and a probe filtering
`unit.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`FIG. 19 is an exemplary representation of a processing
`node.
`
`65
`
`FIG. 20 is a flowchart illustrating local probe filtering
`according to a specific embodiment of the invention.
`
`4
`
`FIG. 21 is a diagrammatic representation of a transaction
`flow in which local probe filtering is facilitated according to
`a specific embodiment of the invention.
`FIG. 22 is a diagrammatic representation of another
`transaction flow in which local probe filtering is facilitated
`according to a specific embodiment of the invention.
`
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`
`Reference will now be made in detail to some specific
`embodiments of the invention including the best modes
`contemplated by the inventors for carrying out the invention.
`Examples of these specific embodiments are illustrated in
`the accompanying drawings. While the invention is
`described in conjunction with these specific embodiments, it
`will be understood that
`it
`is not
`intended to limit
`the
`
`invention to the described embodiments. On the contrary, it
`is intended to cover alternatives, modifications, and equiva-
`lents as may be included within the spirit and scope of the
`invention as defined by the appended claims. Multi-proces-
`sor architectures having point-to-point communication
`among their processors are suitable for implementing spe-
`cific embodiments of the present invention. In the following
`description, numerous specific details are set forth in order
`to provide a thorough understanding of the present inven-
`tion. The present invention may be practiced without some
`or all of these specific details. Well-known process opera-
`tions have not been described in detail
`in order not to
`
`unnecessarily obscure the present invention. Furthermore,
`the present application’s reference to a particular singular
`entity includes that possibility that the methods and appa-
`ratus of the present invention can be implemented using
`more than one entity, unless the context clearly dictates
`otherwise.
`
`According to various embodiments, techniques are pro-
`vided for increasing data access efficiency in a multiple
`processor system. In a point-to-point architecture, a cluster
`of processors includes multiple processors directly con-
`nected to each other through point-to-point links. By using
`point-to-point links instead of a conventional shared bus or
`external network, multiple processors are used efficiently in
`a system sharing the same memory space. Processing and
`network efficiency are also improved by avoiding many of
`the bandwidth and latency limitations of conventional bus
`and external network based multiprocessor architectures.
`According to various embodiments, however,
`linearly
`increasing the number of processors in a point-to-point
`architecture leads to an exponential increase in the number
`of links used to connect the multiple processors. In order to
`reduce the number of links used and to further modularize a
`
`multiprocessor system using a point-to-point architecture,
`multiple clusters may be used.
`According to some embodiments, multiple processor
`clusters are interconnected using a point-to-point architec-
`ture. Each cluster of processors includes a cache coherence
`controller used to handle communications between clusters.
`
`In one embodiment, the point-to-point architecture used to
`connect processors are used to connect clusters as well.
`By using a cache coherence controller, multiple cluster
`systems can be built using processors that may not neces-
`sarily support multiple clusters. Such a multiple cluster
`system can be built by using a cache coherence controller to
`represent non-local nodes in local transactions so that local
`nodes do not need to be aware of the existence of nodes
`outside of the local cluster. More detail on the cache
`
`coherence controller will be provided below.
`29
`
`29
`
`
`
`US 7,296,121 B2
`
`5
`In a single cluster system, cache coherency can be main-
`tained by sending all data access requests through a serial-
`ization point. Any mechanism for ordering data access
`requests (also referred to herein as requests and memory
`requests) is referred to herein as a serialization point. One
`example of a serialization point is a memory controller.
`Various processors in the single cluster system send data
`access requests to one or more memory controllers. In one
`example, each memory controller is configured to serialize
`or lock the data access requests so that only one data access
`request for a given memory line is allowed at any particular
`time. If another processor attempts to access the same
`memory line, the data access attempt is blocked until the
`memory line is unlocked. The memory controller allows
`cache coherency to be maintained in a multiple processor,
`single cluster system.
`A serialization point can also be used in a multiple
`processor, multiple cluster system where the processors in
`the various clusters share a single address space. By using a
`single address space, internal point-to-point links can be
`used to significantly improve intercluster communication
`over traditional external network based multiple cluster
`systems. Various processors in various clusters send data
`access requests to a memory controller associated with a
`particular cluster such as a home cluster. The memory
`controller can similarly serialize all data requests from the
`different clusters. However, a serialization point in a mul-
`tiple processor, multiple cluster system may not be as
`efficient as a serialization point in a multiple processor,
`single cluster system. That is, delay resulting from factors
`such as latency from transmitting between clusters can
`adversely affect the response times for various data access
`requests. It should be noted that delay also results from the
`use of probes in a multiple processor environment.
`Although delay in intercluster transactions in an architec-
`ture using a shared memory space is significantly less than
`the delay in conventional message passing environments
`using external networks such as Ethernet or Token Ring,
`even minimal delay is a significant factor. In some applica-
`tions, there may be millions of data access requests from a
`processor in a fraction of a second. Any delay can adversely
`impact processor performance.
`According to various embodiments, probe management is
`used to increase the efficiency of accessing data in a multiple
`processor, multiple cluster system. A mechanism for elicit-
`ing a response from a node to maintain cache coherency in
`a system is referred to herein as a probe. In one example, a
`mechanism for snooping a cache is referred to as a probe. A
`response to a probe can be directed to the source or target of
`the initiating request. Any mechanism for filtering or reduc-
`ing the number of probes and requests transmitted to various
`nodes is referred to herein as managing