`
`1111111111111111111111111111111111111111111111111111111111111
`US007296121B2
`
`c12) United States Patent
`Morton et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,296,121 B2
`Nov. 13, 2007
`
`(54) REDUCING PROBE TRAFFIC IN
`MULTIPROCESSOR SYSTEMS
`
`(75)
`
`Inventors: Eric Morton, Austin, TX (US); Rajesh
`Kota, Austin, TX (US); Adnan
`Khaleel, Austin, TX (US); David B.
`Glasco, Austin, TX (US)
`
`(73) Assignee: Newisys, Inc., Austin, TX (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 250 days.
`
`(21) Appl. No.: 10/966,161
`
`(22) Filed:
`
`Oct. 15, 2004
`(Under 37 CFR 1.47)
`
`(65)
`
`Prior Publication Data
`
`US 2007/0055826 Al Mar. 8, 2007
`
`Related U.S. Application Data
`
`(63) Continuation-in-part of application No. 10/288,347,
`filed on Nov. 4, 2002, now Pat. No. 7,003,633.
`
`(51)
`
`Int. Cl.
`G06F 12100
`(2006.01)
`(52) U.S. Cl. ....................................... 7111148; 711/141
`(58) Field of Classification Search ................ 711/141,
`711/148, 131, 144, 145, 146; 709/206, 213,
`709/216,217,218,219
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,195,089 A
`5,394,555 A
`
`3/1993 Sindhu et al.
`2/1995 Hunter et al.
`
`5,524,212 A
`5,692,123 A
`5,751,995 A
`5,829,032 A
`5,893,151 A
`6,018,791 A
`6,038,652 A
`6,052,769 A
`6,067,603 A
`6,073,210 A
`6,085,295 A
`6,108,737 A
`6,122,715 A
`
`6/1996 Somani et a!.
`1111997 Logghe
`511998 Sarangdhar
`10/1998 Komuro et a!.
`4/1999 Merchant
`1/2000 Arimilli et a!.
`3/2000 Phillips et a!.
`4/2000 Huff et a!.
`5/2000 Carpenter et a!.
`6/2000 Palanca et al.
`7/2000 Ekanadham eta!.
`8/2000 Sharma et al.
`9/2000 Palanca et al.
`
`(Continued)
`
`wo
`
`FOREIGN PATENT DOCUMENTS
`wo 0239242
`
`5/2002
`
`OTHER PUBLICATIONS
`
`Guo, et a!., "A Probe-Based Server Selection Protocol for Differ(cid:173)
`entiated Service Networks", © 2002 IEEE, p. 2353-2357.*
`
`(Continued)
`
`Primary Examiner-Brian R. Peugh
`(74) Attorney, Agent, or Firm-Beyer Weaver LLP
`
`(57)
`
`ABSTRACT
`
`A computer system having a plurality of processing nodes
`interconnected by a first point-to-point architecture is
`described. Each processing node has a cache memory asso(cid:173)
`ciated therewith. A probe filtering unit is operable to receive
`probes corresponding to memory lines from the processing
`nodes and to transmit the probes only to selected ones of the
`processing nodes with reference to probe filtering informa(cid:173)
`tion. The probe filtering information is representative of
`states associated with selected ones of the cache memories.
`
`25 Claims, 25 Drawing Sheets
`
`Petition for Inter Partes Review of
`U.S. Pat. No. 7,296,121
`IPR2015‐00158
`EXHIBIT
`Sony‐
`
`1
`
`
`
`US 7,296,121 B2
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`6,148,378 A
`6,167,492 A
`6,173,393 B1
`6,189,078 B1
`6,192,451 B1
`6,205,520 B1
`6,209,065 B1
`6,292,705 B1
`6,292,906 B1
`6,330,643 B1
`6,334,172 B1
`6,338,122 B1
`6,343,347 B1
`6,385,705 B1
`6,405,289 B1
`6,463,529 B1
`6,467,007 B1
`6,490,661 B1
`6,542,926 B2
`6,615,319 B2
`6,631,447 B1
`6,633,945 B1
`6,633,960 B1
`6,636,906 B1
`6,640,287 B2
`6,658,526 B2
`6,665,767 B1
`6,704,842 B1
`6,738,870 B2
`6,738,871 B2
`6,751,698 B1
`6,751,721 B1
`6,754,782 B2
`6,760,809 B2
`6,760,819 B2
`6,775,749 B1 *
`6,799,252 B1
`6,865,595 B2
`6,892,282 B2
`7,003,633 B2
`200110013089 A1
`200110029574 A1 *
`200110037435 A1
`2002/0007463 A1
`2002/0046327 A1
`2002/0052914 A1
`2002/0083149 A1
`2002/0083243 A1
`2002/0087807 A1
`2002/0087811 A1
`2003/0009623 A1
`2003/0182508 A1
`2003/0182509 A1
`2003/0182514 A1
`2003/0195939 A1
`2003/0196047 A1
`2003/0210655 A1
`2003/0212741 A1
`2003/0233388 A1
`2004/0024836 A1 *
`2004/00737 55 A1
`2004/0088492 A1
`2004/0088493 A1
`
`1112000 Bordaz et a!.
`12/2000 Keller et a!.
`112001 Palanca et al.
`212001 Bauman et a!.
`212001 Arimilli et a!.
`3/2001 Palanca et al.
`3/2001 Durham et a!.
`9/2001 Wang et a!.
`9/2001 Fu et a!.
`12/2001 Arimilli et a!.
`12/2001 Arimilli et a!.
`112002 Baumgartner et a!.
`112002 Arimilli et a!.
`5/2002 Keller et a!.
`6/2002 Arimilli et a!.
`10/2002 Miller et a!.
`10/2002 Armstrong et a!.
`12/2002 Keller et a!.
`4/2003 Zalewski et a!.
`9/2003 Khare et al.
`10/2003 Morioka et a!.
`10/2003 Fu et a!.
`10/2003 Kessler et a!.
`10/2003 Sharma et a!.
`10/2003 Gharachorloo et a!.
`12/2003 Nguyen et al.
`12/2003 Comisky et a!.
`3/2004 Janakiraman et a!.
`5/2004 Van Ruben et al.
`5/2004 Van Ruben et al.
`6/2004 Deneroff et a!.
`6/2004 Webb et a!.
`6/2004 Arimilli et a!.
`7/2004 Arimilli eta!.
`7/2004 Dhong eta!.
`8/2004 Mudgett eta!. ............ 7111146
`9/2004 Bauman
`3/2005 Glasco
`5/2005 Hass et a!.
`2/2006 Glasco
`8/2001 Weber
`10/2001 Razdan eta!. .............. 7111130
`11/2001 Van Doren
`112002 Fung
`4/2002 Gharachorloo et a!.
`5/2002 Zalewski et a!.
`6/2002 Van Ruben et al.
`6/2002 Van Ruben
`7/2002 Gharachorloo et a!.
`7/2002 Khare et al.
`112003 Arimilli et a!.
`9/2003 Glasco
`9/2003 Glasco
`9/2003 Glasco
`10/2003 Edirisooriya et al.
`10/2003 Kessler et a!.
`11/2003 Giasco
`11/2003 Giasco
`12/2003 Giasco et al.
`2/2004 Keller eta!. ................ 709/213
`4/2004 Webb et a!.
`5/2004 Glasco
`5/2004 Glasco
`
`2004/0088494 A1
`2004/0117559 A1
`2004/0255002 A1
`
`5/2004 Giasco
`6/2004 Glasco
`12/2004 Kota et a!.
`
`OTHER PUBLICATIONS
`
`l/0 Link Specification Revision
`HyperTransport™
`1.03,
`HyperTransport™ Consortium, Oct. 10, 2001, Copynght © 2001
`HyperTransport Technology Consortium.
`PCT Search Report PCT/US03/34756, Int'l filing date Oct. 30,
`2003, Search report Mailed Dec. 16, 2004.
`.
`Bilir eta!., "Multicast Snooping: A New Coherence Method Usmg
`a Multicast Address Network", Computer Architecture, 1999. Pro(cid:173)
`ceedings of the 26th International Symposium on, May 2-4, 1999.
`Martinet al., "Bandwidth Adaptive Snooping", Proceedmgs of the
`Eighth International Symposium on High-Performance Computer
`Architecture on Feb. 2-6, 2002; pp. 251-262.
`Sorin et al., "SpecifYing and VerifYing a Broadcast and a Multicast
`Snooping Cache Coherence Protocol", IEEE Transactions on Par(cid:173)
`allel and Distributed Systems, vol. 13, No. 6, Jun. 2002.
`U.S. Appl. No.: 10/288,347 (Now U.S. Pat. No. 7,003,633), Notice
`of Allowance, dated Sep. 12, 2005.
`U.S. Appl. No. 10/288,347 (Now U.S. Pat. No. 7,003,633), First
`Office Action, dated Nov. 18, 2004.
`Kim eta!., "Power-aware Partitioned Cache Architectures", 2001
`ACM p. 6467.
`.
`Powell et a!., "Reducing Set-Associative Cache Energy v1a Way(cid:173)
`Prediction and Selective Direct-Mapping" 2001 IEEE, p. 54-65.
`Culler, D. E., J. P. Singh, A. Gupta, "Parallel Computer Architec(cid:173)
`ture", 1999 Morgan Kaufmann, San Francisco, CA USA
`XP002277658.
`Tanenbaum, Andrew, "Computer Networks", Computer Networks,
`London: Prentice Hall International, GB, 1996, pp. 345-403,
`XP002155220.
`U.S. Appl. No.: 10/288,347 (Now U.S. Pat. No. 7,003,633). Final
`Office Action, dated May 12, 2005.
`U.S. Office Action mailed Sep. 22, 2004, from U.S. Appl. No.
`10/106,426 [NWISP002].
`U.S. Office Action mailed Mar. 7, 2005, form U.S. Appl. No.
`10/106,426 [NWISP002].
`U.S. Office Action mailed Jul. 21, 2005, from U.S. Appl. No.
`10/106,426 [NWISP002].
`U.S. Office Action mailed Sep. 23, 2004, from U.S. Appl. No.
`10/106,430 [NWISP003].
`U.S. Office Action mailed Mar. 10, 2005, form U.S. Appl. No.
`10/106,430 [NWISP003].
`U.S. Office Action mailed Jul. 21, 2005, from U.S. Appl. No.
`10/106,430 [NWISP003].
`U.S. Office Action mailed Sep. 22, 2004, from U.S. Appl. No.
`10/106,299 [NWISP004].
`U.S. Office Action mailed Mar. 10, 2005, from U.S. Appl. No.
`10/106,299 [NWISP004].
`U.S. Office Action mailed Jul. 21, 2005, from U.S. Appl. No.
`10/106,299 [NWISP004].
`U.S. Office Action mailed Jul. 20, 2005, from U.S. Appl. No.
`10/608,846 [NWISP030].
`U.S. Office Action mailed Sep. 9, 2005, from U.S. Appl. No.
`10/462,015 [NWISP040].
`U.S. Office Action mailed Sep. 9, 2005, from U.S. Appl. No.
`10/426,084 [NWISP033].
`U.S. Office Action mailed Nov. 2, 2005, from U.S. Appl. No.
`10/106/430 [NWISP003].
`U.S. Office Action mailed Oct. 5, 2005, from U.S. Appl. No.
`10/635,703 [NWISP036].
`* cited by examiner
`
`2
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 1 of 25
`
`US 7,296,121 B2
`
`Figure lA
`
`Processing
`Cluster 101
`
`'
`'
`'
`'
`'
`'
`'
`'
`'
`'
`,_,
`
`Processing
`Cluster 103
`
`Ill a~- ______ :·
`
`: _______ : f lllc
`
`Processing
`Cluster 105
`
`Processing
`Cluster 107
`
`Figure lB
`
`Processing
`Cluster 121
`
`Processing
`Cluster 123
`
`-
`
`Switch 131
`141b~: :::
`,:: ~~~141c
`
`Processing
`Cluster 125
`
`Processing
`Cluster 127
`
`3
`
`
`
`06a
`
`11
`
`Figure 2
`
`Remote Clusters
`
`(\/208a
`'
`'
`'
`'
`\)
`
`,r-200
`
`206
`
`~I
`
`-
`
`Processor 202a
`
`(\/232b
`: :
`\ . ./
`
`Cache
`Coherence
`Controller 230
`
`(\/232a
`:
`:
`'
`'
`\,'
`
`Processor 202b r------
`
`-..-
`
`:,- ~- ~-... ,
`·- -- --·
`1"--208e
`
`214a
`
`214b_/
`
`r------ Processor 202c
`
`1;214f
`
`Service
`Processor 212
`
`~
`
`i'-'
`
`v-214e
`
`< ~~ ~)
`232c-1/
`
`:·· -- ··~,
`·- -----
`1"--232d
`
`' ' '
`'
`'
`'
`'
`'
`1
`
`208d_;
`
`~
`
`W6c
`
`;-:: 110 216
`
`1/0 Switch 210
`
`i
`I BIOS I
`
`204
`
`;214c
`
`"--214d
`
`'
`'
`'
`'
`'
`'
`.,_,(_208c
`
`208b-,
`---- ---
`'~. -- ---
`
`Processor 202d 1 -
`
`206d -
`
`1/0220 ~
`
`~
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~ ....
`
`(.H
`
`~
`
`N
`0
`0
`....:J
`
`('D
`('D
`
`rFJ =(cid:173)
`.....
`N
`0 .....
`N
`Ul
`
`d
`rJl
`-....l
`'N
`\C
`... o-..
`""""' N
`
`""""' = N
`
`4
`
`
`
`Figure 3
`
`230
`
`Coherent Interface 307
`
`Protocol Engine 305
`
`Pending
`Buffer 309
`
`Noncoherent Interface 311
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~ ....
`
`(.H
`'"
`N
`0
`0
`-....l
`
`rFJ =(cid:173)
`.....
`
`('D
`('D
`
`(.H
`
`0 .....
`N
`Ul
`
`d
`rJl
`-....l
`'N
`\C
`'"0'1
`""""' N
`
`""""' = N
`
`5
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov. 13, 2007
`Nov. 13, 2007
`
`Sheet 4 of 25
`Sheet 4 of 25
`
`US 7,296,121 B2
`US 7,296,121 B2
`
`v8:me
`
`
`
`6
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov. 13, 2007
`NOV. 13, 2007
`
`Sheet 5 of 25
`Sheet 5 of 25
`
`US 7,296,121 B2
`US 7,296,121 B2
`
`gmama
`
`
`
`
`
` H3$602“$04.52
`
`7
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov. 13, 2007
`Nov. 13, 2007
`
`Sheet 6 of 25
`Sheet 6 of 25
`
`US 7,296,121 B2
`US 7,296,121 B2
`
`mmoSwE
`
`$qu384.52
`
`Hmm
`
`8
`
`
`
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov. 13,2007
`Nov. 13, 2007
`
`Sheet 7 of 25
`Sheet 7 of 25
`
`US 7,296,121 B2
`US 7,296,121 B2
`
`0m8&3
`
`
`
`mmmmoveZ€03.20Z
`
`9
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 8 of 25
`
`US 7,296,121 B2
`
`u ~ ~--------------------,
`
`M
`
`\0
`lrl
`
`01
`
`I u(cid:173)\0
`
`lrl
`
`-u ~ ~----------------7
`
`\0
`V)
`
`r.rJ
`
`~
`0 z
`
`10
`
`
`
`Figure 6
`
`601-2
`
`L
`607
`
`·{Ll/
`LbJ
`
`I
`
`L
`
`Gog~ 601-3
`
`.........___~~~
`-~
`.,.
`
`c
`621-5
`
`MC
`623-2
`
`L
`645
`
`c
`641-1 f
`
`CPU H c
`
`601-l
`
`603-l
`
`Request
`Cluster 600
`
`Home
`Cluster 620
`
`Remote
`Cluster 640
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~
`.....
`
`(.H
`
`~
`
`N
`0
`0
`-....l
`
`('D
`
`rFJ =-('D
`.....
`\0
`0 .....
`N
`Ul
`
`d
`rJl
`--..1
`'N
`\C
`'"0'1
`""""' N
`
`""""' = N
`
`11
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 10 of 25
`
`US 7,296,121 B2
`
`Figure 7
`
`Coherence Directory 701
`
`Memory Line 711
`
`State 713
`
`Address 721
`
`Address 731
`
`Address 741
`
`Address 751
`
`Invalid
`
`Invalid
`
`Shared
`
`Shared
`
`Dirty Data
`Owner 715
`NIA
`
`NIA
`
`N/A
`
`NIA
`
`Occupancy Vector
`717
`
`N/A
`
`N/A
`
`Clusters 1,3
`
`Clusters 1, 2, 3, 4
`
`Address 761
`Owned
`! - - - - - - - - - - - - - - r--------------
`Address 771
`Owned
`
`Cluster 4
`
`Cluster 2, 3, 4
`
`Cluster 2
`
`Cluster 2, 4
`
`Address 781
`
`Modified
`
`Cluster 2
`
`Address 791
`
`Modified
`
`Cluster 3
`
`...
`
`...
`
`...
`
`N/A
`
`N/A
`
`...
`
`12
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 11 of 25
`
`US 7,296,121 B2
`
`Figure 8
`
`Probe Filter
`Information 821
`
`Read Block (Read) 823
`
`Read Block Modify
`(Read/Write) 825
`
`Invalid 831
`
`Can use completion bit.
`Probe home cluster. (801)
`
`Can use completion bit.
`Probe home cluster.
`(809)
`
`Shared 833
`
`Can use completion bit.
`Probe home cluster. (803)
`
`N/A (811)
`
`Owned 835
`
`Can use completion bit.
`Probe remote cluster with
`line cached in owned
`state. (805)
`
`N/A (813)
`
`Modified 83 7
`
`Can use completion bit.
`Probe remote cluster with
`line cached in modified
`state. (807)
`
`Can use completion bit.
`Probe remote cluster.
`(815)
`
`13
`
`
`
`Figure 9
`
`c
`903-4
`
`CPU
`901-4
`
`c
`903-5
`
`lcPUl
`c
`~
`903-1
`
`Request
`Cluster 900
`
`L
`925
`
`192 ~ l ~ '--,---J
`
`19~7]
`fC I
`
`----------
`
`-, c
`
`1nt-3
`
`Completion Bit Set
`
`c
`921-4
`
`Home
`Cluster 920
`
`~
`
`Remote
`Cluster 940
`
`c ~~
`
`923-2
`
`921-5
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~ ....
`
`(.H
`
`~
`
`N
`0
`0
`-....l
`
`('D
`
`rFJ =-('D
`.....
`....
`N
`0 .....
`N
`Ul
`
`d
`rJl
`"'--...1
`N
`\C
`"'0'1
`""""' N
`
`""""' = N
`
`14
`
`
`
`Figure 10
`
`c
`
`f---+ CPU
`1001-4
`
`I 003-4 I
`
`CPU
`1001-1
`
`c
`~
`1003-1
`
`Request
`Cluster 1000
`
`L
`
`I 1025 ;~~~-~
`
`L
`MC
`C
`1021-1 f---+ 1023-1- 1027
`
`C
`1021-3
`
`Home Cluster
`1020
`
`\ \ 10;1-2'\
`
`~ ,~4s f
`
`c
`1041-1
`
`L
`1047
`
`c
`1041-2
`
`1!~9 f
`
`Cluster I 040
`
`I
`
`Completion Bit Set
`
`I
`
`~
`
`c
`~
`1003-5
`I
`
`t
`c
`f---+ MC
`1021-5
`1023-2
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~ ....
`
`~(H
`N
`0
`0
`-....l
`
`('D
`('D
`
`rFJ =(cid:173)
`.....
`....
`0 .....
`N
`Ul
`
`(H
`
`d
`rJl
`"'--...1
`N
`\C
`"'0'1
`""""' N
`
`""""' = N
`
`15
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 14 of 25
`
`US 7,296,121 B2
`
`Figure 11
`
`1135
`
`1139
`
`Forward Probe With
`Completion Indicator To
`Remote Cluster
`
`Receive Home Cluster
`Probe Responses And Do
`Not Forward Response To
`Request Cluster
`
`equest Handling
`Home Coherence
`Controller
`
`Receive Request
`Associated With A
`Memory Line
`
`Forward Request To
`Memory Controller
`
`Receive Probe From
`Memory Controller
`
`Access Coherence
`Directory And Probe
`Filter Information
`
`Yes
`
`Yes
`
`Probe Remote
`Cluster?
`
`1101
`
`1105
`
`1109
`
`1113
`
`1121
`
`1131
`
`No
`
`1141
`
`No
`
`1149
`
`Receive Probe Responses
`And Forward Response
`To Request Cluster With
`Completion Indicator
`
`Receive Source Done
`From Request Cluster
`And Forward Source
`Done To Memory
`Controller
`
`1145
`
`Proceed Using Memory
`Controller Serialization
`With No Filtering And
`Completion Bit
`
`16
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 15 of 25
`
`US 7,296,121 B2
`
`Figure 12
`
`Memory Controller Filter Information 1221
`
`Read Block [Read] 1223
`
`Read Block Modify
`[Read/Write] 1225
`
`Invalid 1231
`
`Send request to target.
`(1201)
`
`Send request to target.
`(1209)
`
`Shared 1233
`
`Send request to target.
`(1203)
`
`Send request to target.
`(1211)
`
`Owned 1235
`
`Forward Probe To
`Owning Cluster. (1205)
`
`Send request to target.
`(1213)
`
`Modified 123 7
`
`Forward Probe To
`Forward Probe To
`Modified Cluster. (1207) Modified Cluster. ( 1215)
`
`17
`
`
`
`Figure 13
`
`CPU H c
`
`1301-1
`
`1303-1
`
`c H CPU H c
`
`1303-4
`
`11301-4
`
`I
`
`1303-5
`~
`
`Completion Bit Set
`
`j_,l
`
`Request
`Cluster 1300
`
`c
`1321-1
`
`Home Cluster
`1320
`
`Remote
`Cluster 1340
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~ ....
`
`~(H
`N
`0
`0
`-....l
`
`('D
`
`rFJ =-('D
`.....
`....
`0\
`0 .....
`N
`Ul
`
`d
`rJl
`"'--...1
`N
`\C
`"'0'1
`""""' N
`
`""""' = N
`
`18
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 17 of 25
`
`US 7,296,121 B2
`
`Figure 14
`
`equest Handling
`Home Coherence
`Controller
`
`1401
`
`1405
`
`1405
`
`Receive Request
`Associated With A
`Particular Memory Line
`
`1403
`
`Identify Characteristics
`Associated With Request
`
`Forward Request To
`Serialization Point
`
`1411
`~-No--
`
`Proceed With Or Without
`Probe Filtering And
`Completion Indicator
`(Figure 6 Or Figure 11)
`
`1415
`
`1417
`
`Yes
`
`Block Requests And
`Probe Remote Cluster
`(Figure 12)
`
`Unblock Memory Line
`After Receiving Source
`Done From Request
`Cluster
`
`19
`
`
`
`sized write
`
`DIR
`1501-1
`
`1
`
`..,1 HMC
`
`HMC
`1502-2
`
`Home Cluster
`
`Remote Cluster
`
`CCC
`1506-2
`
`R
`
`1502-3
`
`HMC I TD~I DIR \ SD ,.I HMC
`
`1501-2
`
`1502-4
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~ ....
`
`(.H
`
`~
`
`N
`0
`0
`-....l
`
`('D
`('D
`
`rFJ =(cid:173)
`.....
`....
`0 .....
`N
`Ul
`
`QO
`
`1
`
`• CCC
`1507-2
`
`Figure 15
`
`d
`rJl
`-....l
`'N
`\C
`'"0'1
`""""' N
`
`""""' = N
`
`20
`
`
`
`validate
`block
`
`I HMC
`
`.1602-1
`
`DIR I
`
`1601-1
`
`L
`
`HMC
`1602-2
`
`CCC
`1'606-2
`
`DIR.
`1601-2
`
`SD. HMC
`1602-3
`
`Home Cluster
`
`Remote Cluster(s)
`
`R
`
`CCC
`lt-----·1607-2
`
`Figure 16
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~ ....
`
`(.H
`
`~
`
`N
`0
`0
`-....l
`
`('D
`('D
`
`rFJ =(cid:173)
`.....
`....
`\0
`0 .....
`N
`Ul
`
`d
`rJl
`-....l
`'N
`\C
`'"0'1
`"""' N
`
`"""' = N
`
`21
`
`
`
`1700
`
`Ql g.:g
`8~ ...10:::
`
`LocaVRerrote
`Transaction
`Interface
`
`Eviction
`Manager
`1702
`
`Cache
`Coherence
`Directory
`1701
`
`y
`
`I
`
`y
`Coherent hterface 1707
`I
`
`A"otocol Engine 1705
`
`Pending
`Buffer
`1709
`
`Gl
`1V
`~~
`
`I ! ~
`
`_§
`
`l---N;~-~e~~lnterfaoo 111-1--~
`
`Figure 17
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~ ....
`
`(.H
`
`~
`
`N
`0
`0
`-....l
`
`rFJ =(cid:173)
`('D a
`
`N
`0
`0 .....
`N
`Ul
`
`d r.r;_
`
`-....l
`'N
`\C
`0'1
`"'
`""""' N
`
`""""' = N
`
`22
`
`
`
`Figure 18
`
`_lSUOa ll
`
`'
`'
`
`'
`'
`
`' . ~, ... •
`
`f - - - Processor 1802a
`
`/'.,r1832b
`: :
`\~.'
`
`Probe
`Filtering
`Unit
`1830
`
`(·.,r1832a
`I
`:
`'
`'
`'-~'
`
`(.- -- -.. ~,
`.. -- __ ..
`'-1808e
`
`:"- -- --.. ,
`.. _ -- --·
`1832c- /
`
`:"'" -- -.... ,
`.... -- --·
`"-1832d
`
`1808a
`
`r18oo
`
`16UO
`
`~I
`
`Processor 1802b -
`
`1808b- \ . -· --
`....
`'•- ... __ )
`
`f - - - Processor 1802c
`
`,·,.\
`: :
`
`t"\
`: :
`
`1808d_) ..
`
`~
`
`~
`
`.,_\__1808c
`
`Processor 1802d -
`
`806c
`
`~ I/0 1816
`
`1101820 Q
`
`1806d -
`
`-F1
`
`I/0 Switch 1810
`
`T
`I BIOS I
`1804
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~ ....
`
`(.H
`
`~
`
`N
`0
`0
`--.l
`
`('D
`('D
`
`rFJ =(cid:173)
`.....
`N ....
`0 .....
`N
`Ul
`
`d
`rJl
`---1
`'N
`\C
`... o-.
`""""' N
`
`""""' = N
`
`23
`
`
`
`1802
`
`JTAG
`Handshake
`Registers
`
`[\_;908
`
`Processor
`
`CLK
`CTL
`CAD (n:O)
`
`....-
`
`I"
`
`""'["
`
`CLK
`CTL
`CAD (n:O)
`
`v
`1906b~
`
`Routing
`Table
`
`v--1902
`
`....-
`
`....-
`
`~
`""""~""
`
`1904a
`
`CLK
`CTL
`CAD (n:O)
`
`......
`
`CLK
`CTL
`.... CAD (n:O)
`,
`
`N906a
`
`Routing
`Table
`
`Routing
`Table
`
`1906~
`
`/
`
`CLK
`CTL
`CAD (n:O)
`
`~
`
`~
`
`~
`""'["
`
`CLK
`CTL
`CAD (n:O)
`
`Figure 19
`
`'
`
`1904b
`
`~
`
`~
`
`_..
`
`~
`
`)
`
`\
`
`)
`
`..
`__..
`...
`
`1904c
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~ ....
`
`(.H
`
`~
`
`N
`0
`0
`-....l
`
`('D
`('D
`
`rFJ =(cid:173)
`.....
`N
`N
`0 .....
`N
`Ul
`
`d
`rJl
`-....l
`'N
`\C
`'"0'1
`""""' N
`
`""""' = N
`
`24
`
`
`
`U.S. Patent
`
`Nov. 13, 2007
`
`Sheet 23 of 25
`
`US 7,296,121 B2
`
`Local Probe
`Filtering
`
`PFU accepts probe and
`refers to cache coherence
`directory (2008)
`
`respond to
`requester (2012)
`
`no
`
`yes
`
`PFU sends probe(s) only to nodes
`that may have line in cache (2014)
`
`Figure 20
`
`yes
`
`end
`
`25
`
`
`
`ProbeSrc only goes to filtering unit from M CT,
`gets new TOT and resent out as a ProbeTgt.
`
`RR
`
`Req
`
`..
`
`CPU
`
`M
`
`/~ NO-_---~
`~
`p ~ N
`
`P PFU /
`
`PR
`
`PFU---+l:PU
`2 x P R
`_____________..
`
`SD
`
`..
`
`M
`
`Non probed nodes due to probe filtering are not
`aware ofthe transactoin.
`
`Nl
`
`N3
`
`Figure 21
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~ ....
`
`(.H
`
`~
`
`N
`0
`0
`-....l
`
`('D
`('D
`
`rFJ =(cid:173)
`......
`N
`.j;o.
`
`0 ......
`N
`Ul
`
`d
`rJl
`-....l
`'N
`\C
`'"0'1
`""""' N
`
`""""' = N
`
`26
`
`
`
`RR
`
`Req
`
`CPU
`
`.,. M
`
`SD __ __...~ M
`~ PFU -·~CPU
`2xPR
`
`NO
`
`Nl
`
`Non probed nodes due to probe filtering are not
`aware ofthe transactoin.
`
`N2
`
`N3
`
`Figure 22
`
`e •
`
`00
`•
`~
`~
`~
`
`~ = ~
`
`z 0
`~ ....
`
`(.H
`
`~
`
`N
`0
`0
`-....l
`
`('D
`
`rFJ =-('D
`.....
`N
`Ul
`0 .....
`N
`Ul
`
`d
`rJl
`-....l
`'N
`\C
`'"0'1
`""""' N
`
`""""' = N
`
`27
`
`
`
`US 7,296,121 B2
`
`1
`REDUCING PROBE TRAFFIC IN
`MULTIPROCESSOR SYSTEMS
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`The present application is a continuation-in-part of and
`claims priority under 35 U.S.C. 120 to U.S. patent applica(cid:173)
`tion Ser. No. 10/288,347 now U.S. Pat. No. 7,003,633 for
`METHODS AND APPARATUS FOR MANAGING
`PROBE REQUESTS filed on Nov. 4, 2002 the entire
`disclosure of which is incorporated herein by reference for
`all purposes. The subject matter described in the present
`application is also related to U.S. patent application Ser. No.
`10/288,399 now U.S. Pat. No. 7,103,726 for METHODS 15
`AND APPARATUS
`FOR MANAGING
`PROBE
`REQUESTS filed on Nov. 4, 2002, the entire disclosure of
`which is incorporated herein by reference for all purposes.
`
`BACKGROUND OF THE INVENTION
`
`The present invention generally relates to accessing data
`in a multiple processor system. More specifically, the
`present invention provides techniques for reducing memory
`transaction traffic in a multiple processor system.
`Data access in multiple processor systems can raise issues
`relating to cache coherency. Conventional multiple proces(cid:173)
`sor computer systems have processors coupled to a system
`memory through a shared bus. In order to optimize access to
`data in the system memory, individual processors are typi(cid:173)
`cally designed to work with cache memory. In one example,
`each processor has a cache that is loaded with data that the
`processor frequently accesses. The cache is read or written
`by a processor. However, cache coherency problems arise
`because multiple copies of the same data can co-exist in
`systems having multiple processors and multiple cache
`memories. For example, a frequently accessed data block
`corresponding to a memory line may be loaded into the
`cache of two different processors. In one example, if both
`processors attempt to write new values into the data block at
`the same time, different data values may result. One value
`may be written into the first cache while a different value is
`written into the second cache. A system might then be unable
`to determine what value to write through to system memory.
`A variety of cache coherency mechanisms have been
`developed to address such problems in multiprocessor sys(cid:173)
`tems. One solution is to simply force all processor writes to
`go through to memory immediately and bypass the associ(cid:173)
`ated cache. The write requests can then be serialized before 50
`overwriting a system memory line. However, bypassing the
`cache significantly decreases efficiency gained by using a
`cache. Other cache coherency mechanisms have been devel(cid:173)
`oped for specific architectures. In a shared bus architecture,
`each processor checks or snoops on the bus to determine 55
`whether it can read or write a shared cache block. In one
`example, a processor only writes an object when it owns or
`has exclusive access to the object. Each corresponding cache
`object is then updated to allow processors access to the most
`recent version of the object.
`Bus arbitration is used when both processors attempt to
`write the same shared data block in the same clock cycle.
`Bus arbitration logic decides which processor gets the bus
`first. Although, cache coherency mechanisms such as bus
`arbitration are effective, using a shared bus limits the num(cid:173)
`ber of processors that can be implemented in a single system
`with a single memory space.
`
`45
`
`2
`Other multiprocessor schemes involve individual proces(cid:173)
`sor, cache, and memory systems connected to other proces(cid:173)
`sors, cache, and memory systems using a network backbone
`such as Ethernet or Token Ring. Multiprocessor schemes
`involving separate computer systems each with its own
`address space can avoid many cache coherency problems
`because each processor has its own associated memory and
`cache. When one processor wishes to access data on a
`remote computing system, communication is explicit. Mes-
`10 sages are sent to move data to another processor and
`messages are received to accept data from another processor
`using standard network protocols such as TCP/IP. Multipro(cid:173)
`cessor systems using explicit communication including
`transactions such as sends and receives are referred to as
`systems using multiple private memories. By contrast, mul(cid:173)
`tiprocessor system using implicit communication including
`transactions such as loads and stores are referred to herein as
`using a single address space.
`Multiprocessor schemes using separate computer systems
`20 allow more processors to be interconnected while minimiz(cid:173)
`ing cache coherency problems. However, it would take
`substantially more time to access data held by a remote
`processor using a network infrastructure than it would take
`to access data held by a processor coupled to a system bus.
`25 Furthermore, valuable network bandwidth would be con(cid:173)
`sumed moving data to the proper processors. This can
`negatively impact both processor and network performance.
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a
`30 system with a single memory space. In one example, indi(cid:173)
`vidual processors can be directly connected to each other
`through a plurality of point-to-point links to form a cluster
`of processors. Separate clusters of processors can also be
`connected. The point-to-point links significantly increase the
`35 bandwidth for coprocessing and multiprocessing functions.
`However, using a point-to-point architecture to connect
`multiple processors in a multiple cluster system sharing a
`single memory space presents its own problems.
`Consequently, it is desirable to provide techniques for
`40 improving data access and cache coherency in systems
`having multiple processors connected using point-to-point
`links.
`
`SUMMARY OF THE INVENTION
`
`According to the present invention, various techniques are
`provided for reducing traffic relating to memory transactions
`in multi-processor systems. According to various specific
`embodiments, a computer system having a plurality of
`processing nodes interconnected by a first point-to-point
`architecture is provided. Each processing node has a cache
`memory associated therewith. A probe filtering unit is oper(cid:173)
`able to receive probes corresponding to memory lines from
`the processing nodes and to transmit the probes only to
`selected ones of the processing nodes with reference to
`probe filtering information. The probe filtering information
`is representative of states associated with selected ones of
`the cache memories.
`According to other embodiments, methods and apparatus
`60 are provided for reducing probe traffic in a computer system
`comprising a plurality of processing nodes interconnected
`by a first point-to-point architecture. A probe corresponding
`to a memory line is transmitted from a first one of the
`processing nodes only to a probe filtering unit. The probe is
`65 evaluated with the probe filtering unit to determine whether
`a valid copy of the memory line is in any of the cache
`memories. The evaluation is done with reference to probe
`
`28
`
`
`
`3
`filtering information associated with the probe filtering unit
`and representative of states associated with selected ones of
`the cache memories. The probe is transmitted from the probe
`filtering unit only to selected ones of the processing nodes
`identified by the evaluating. Probe responses from the
`selected processing nodes are accumulated by the probe
`filtering unit. Only the probe filtering unit responds to the
`first processing node.
`A further understanding of the nature and advantages of
`the present invention may be realized by reference to the 10
`remaining portions of the specification and the drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`60
`
`US 7,296,121 B2
`
`4
`FIG. 21 is a diagrammatic representation of a transaction
`flow in which local probe filtering is facilitated according to
`a specific embodiment of the invention.
`FIG. 22 is a diagrammatic representation of another
`transaction flow in which local probe filtering is facilitated
`according to a specific embodiment of the invention.
`
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`
`Reference will now be made in detail to some specific
`embodiments of the invention including the best modes
`contemplated by the inventors for carrying out the invention.
`Examples of these specific embodiments are illustrated in
`The invention may best be understood by reference to the
`15 the accompanying drawings. While
`the
`invention
`is
`following description taken in conjunction with the accom(cid:173)
`described in conjunction with these specific embodiments, it
`panying drawings, which are illustrative of specific embodi(cid:173)
`will be understood that it is not intended to limit the
`ments of the present invention.
`invention to the described embodiments. On the contrary, it
`FIGS. 1A and 1B are diagrammatic representation depict(cid:173)
`is intended to cover alternatives, modifications, and equiva-
`20 Ients as may be included within the spirit and scope of the
`ing a system having multiple clusters.
`FIG. 2 is a diagrammatic representation of a cluster
`invention as defined by the appended claims. Multi-proces(cid:173)
`having a plurality of processors.
`sor architectures having point-to-point communication
`FIG. 3 is a diagrammatic representation of a cache coher(cid:173)
`among their processors are suitable for implementing spe-
`ence controller.
`cific embodiments of the present invention. In the following
`FIG. 4 is a diagrmatic representation showing a trans- 25
`description, numerous specific details are set forth in order
`action flow for a data access request from a processor in a
`to provide a thorough understanding of the present inven-
`single cluster.
`tion. The present invention may be practiced without some
`FIG. SA-SD are diagrammatic representations showing
`or all of these specific details. Well-known process opera(cid:173)
`tions have not been described in detail in order not to
`cache coherence controller functionality.
`FIG. 6 is a diagrammatic representation depicting a trans- 30
`unnecessarily obscure the present invention. Furthermore,
`action flow for a request with multiple probe responses.
`the present application's reference to a particular singular
`FIG. 7 is a diagrmatic representation showing a cache
`entity includes that possibility that the methods and appa(cid:173)
`ratus of the present invention can be implemented using
`coherence directory.
`FIG. 8 is a diagrammatic representation showing probe
`more than one entity, unless the context clearly dictates
`filter information that can be used to reduce the number of 35 otherwise.
`According to various embodiments, techniques are pro(cid:173)
`probes transmitted to various clusters.
`FIG. 9 is a diagrmatic representation showing a trans(cid:173)
`vided for increasing data access efficiency in a multiple
`action flow for probing of a home cluster without probing of
`processor system. In a point-to-point architecture, a cluster
`other clusters.
`of processors includes multiple processors directly con-
`FIG. 10 is a diagrammatic representation showing a
`40 nected to each other through point-to-point links. By using
`transaction flow for probing of a single remote cluster.
`point-to-point links instead of a conventional shared bus or
`FIG. 11 is a flow process diagram showing the handling
`external network, multiple processors are used efficiently in
`of a request with probe filter information.
`a system sharing the same memory space. Processing and
`FIG. 12 is a diagrmatic representation showing
`network efficiency are also improved by avoiding many of
`45 the bandwidth and latency limitations of conventional bus
`memory controller filter information.
`FIG. 13 is a diagrammatic representation showing a
`and external network based multiprocessor architectures.
`transaction flow for probing a single remote cluster without
`According to various embodiments, however,
`linearly
`increasing the number of processors in a point-to-point
`probing a home cluster.
`FIG. 14 is a flow process diagram showing the handling
`architecture leads to an exponential increase in the number
`of a request at a home cluster cache coherence controller
`50 of links used to connect the multiple processors. In order to
`reduce the number of links used and to further modularize a
`using memory controller filter information.
`FIG. 15 is a diagrammatic representation showing a
`multiprocessor system using a point-to-point architecture,
`transaction flow for a cache coherence directory eviction of
`multiple clusters may be used.
`an entry corresponding to a dirty memory line.
`According to some embodiments, multiple processor
`FIG. 16 is a diagrammatic representation showing a 55
`clusters are interconnected using a point-to-point architec(cid:173)
`transaction flow for a cache coherence directory eviction of
`ture. Each cluster of processors includes a cache coherence
`an entry corresponding to a clean memory line.
`controller used to handle communications between clusters.
`FIG. 17 is a diagrmatic representation of a cache
`In one embodiment, the point-to-point architecture used to
`coherence controller according to a specific embodiment of
`connect processors are used