`
`United States Patent
`Fu et al.
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 6,633,945 B1
`Oct. 14, 2003
`
`US006633945B1
`
`(54) FULLY CONNECTED CACHE COHERENT
`MULTIPROCESSING SYSTEMS
`
`(75) Inventors: Daniel Fu, Sunnyvale, CA (US);
`Carlton T. Amdahl, Alameda County,
`CA (US); Walstein Bennett Smith, III,
`Pal‘) A1t°> CA(US)
`
`_
`(73) Asslgnee: Conexant Systems, IIlC., Newport
`Beach, CA (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U,S,C, 154(k)) by 0 days,
`
`_
`(21) Appl' No" 09/349’641
`(22) Filed:
`Jul_ 8 1999
`’
`Related US‘ Application Data
`
`(63) Continuation-in-part of application No. 09/281,749, ?led on
`Mar. 30, 1999, now Pat. No. 6,516,442, which is a continu
`ation-in-part of application No. 09/163,294, ?led on Sep. 29,
`1998, now Pat. No. 6,292,705, which is a continuation-in-
`part of application No. 08/986,430, ?led on Dec. 7, 1997,
`now Pat No_ 670657077'
`
`(51) Int. cl.7 .............................................. .. G06F 13/00
`(52) US. Cl. ....................... .. 710/316; 710/317; 710/29;
`_
`709/213; 711/130
`(58) Fleld 0f Search ............................... .. 710/100, 305,
`710/313> 315’ 316’ 317’ 29; 711/144’ 143’
`130, 147; 709/213
`
`(56)
`
`References Clted
`U.S. PATENT DOCUMENTS
`
`gilfgcslznet a1
`2
`4’48O’307 A 10/198 4 Budde et a1:
`5j161:156 A 11/1992 Baum et a1_
`5,271,000 A 12/1993 Engbersen et a1,
`5,313,609 A
`5/ 1994 Baylor et al.
`5,335,335 A
`8/1994 Jackson et al.
`5,440,698 A
`8/1995 Sindhu et al.
`5,505,686 A
`4/1996 Willis et al.
`
`4/1996 Zilka
`5,511,226 A
`4/1996 McClure
`5,513,335 A
`6/1996 Martinez, Jr. et al.
`5,524,234 A
`6/1996 IZZfHd
`5526380 A
`if“; h.
`2
`as“ ‘1° 1
`7
`7
`7/1996 Foley
`5,537,575 A
`9/1996 Taylor et al.
`5,553,310 A
`5,561,779 A 10/1996 Jackson
`5,568,620 A 10/1996 Sarangdhar et aL
`5,574,868 A 11/1996 Marisetty
`5,577,204 A 11/1996 Brewer et al.
`5,581,729 A 12/1996 Nishtala et al.
`5,588,131 A 12/1996 Borrill
`5,594,886 A
`1/1997 Smith et al.
`5,602,814 A
`2/1997 Jaquette et al.
`5,606,686 A
`2/1997 Tarui et al.
`5,634,043 A
`5/1997 Self et al.
`5,634,068 A
`5/1997 Nishtala et al.
`5,644,754 A
`7/1997 Weber
`5,655,100 A
`8/1997 Ebrahim et al.
`5,657,472 A
`8/1997 Van Loo et al.
`(List continued On next page.)
`
`OTHER PUBLICATIONS
`-
`-
`-
`Techmcal Whlte Paper’ Sun TM Enterpnse TM 10000
`Server, Sun Mlcrosystems, Sep. 1998.
`Alan Charlesworth, Star?re: Extending the SMP Envelope,
`IEEE Micro, Jan/Feb- 1998,1111- 3949
`(List continued on next page.)
`Primary Examiner—Sumati LefkowitZ
`Assistant Examiner_)(' Chung_TI-anS
`(74) Attorney, Agent, or Firm—Keith Kind; Kelly H. Hale
`(57)
`ABSTRACT
`Fully connected multiple FCU-based architectures reduce
`requirements for Tag SRAM siZe and memory read laten
`cies. Apreferred embodiment of a symmetric multiprocessor
`system includes a switched fabric (switch matrix) for data
`transfers that provides multiple concurrent buses that enable
`greatly increased bandwidth between processors and shared
`memory. Ahlgh-speed pomt-to-pomt Channel couples com
`mand initiators and memory with the switch matrix and with
`I/O subsystems.
`
`10 Claims, 15 Drawing Sheets
`
`MPCPUBUS—> M M M M W w MPCPUBUS
`
`DDR'SDRAM +- FcU-Mcu
`
`DDReSDRAM |-—-
`
`0
`
`I
`
`Pl-lo-Pl
`
`CHANNEL
`
`WOW
`CHANNEL
`
`FcU-Mcu —' DDR-SDRAM
`
`1
`
`—| DDR-SDRAM
`
`Pl-lo-Pl
`
`CHANNEL
`
`DDR-SDRAM |—' Fcu-Mcu
`
`FcU-Mcu —' DDR-SDRAM
`
`
`
`DDRASDRAM '- Elisa MPCPUBUS——>
`
`
`
`2
`
`3
`
`
`
`DDR-SDRAM ‘———MPCPUBUS
`
`—l
`
`V0
`BRlDGE CHIP
`
`PT-loPl
`
`CHANNEL
`
`V0
`BRlDGE CHlP
`
`PT-TQPT
`
`CHANNEL
`
`V0
`BRIDGE CHlP
`
`l—
`
`V0
`BRlDGE CHIP
`
`POI BUS
`
`PCl BUS
`
`PCl BUS
`
`_|_ _|_
`
`PC! BUS
`
`PCI BUS
`
`PCI BUS
`
`PCI BUS
`
`POI BUS
`
`NETAPP, INC. EXHIBIT 1001
`Page 1 of 23
`
`
`
`US 6,633,945 B1
`Page 2
`
`US. PATENT DOCUMENTS
`
`5,682,516 A 10/1997 Sarangdhar et al.
`5,684,977 A 11/1997 Van Loo et al.
`5,696,910 A 12/1997 Pawlowski
`5,796,605 A
`8/1998 Hagersten
`5,829,034 A 10/1998 Hagersten et 211.
`5,895,495 A
`4/1999 Arimilli et al.
`5,897,656 A
`4/1999 Vogt et al.
`5,940,856 A
`8/1999 Arimilli et al.
`5,946,709 A
`8/1999 Arimilli et al.
`5,978,411 A 11/1999 Kitade et al.
`6,044,122 A
`3/2000 Ellersick et 211.
`6,065,077 A * 5/2000 Fu ........................... .. 710/100
`6,125,429 A * 9/2000 Goodwin et a1. ......... .. 711/143
`6,145,007 A 11/2000 Dokic et al.
`6,279,084 B1
`8/2001 VanDoren et 211.
`6,289,420 B1 * 9/2001 Cypher ..................... .. 711/144
`6,292,705 B1
`9/2001 Wang et al.
`6,295,581 B1 * 9/2001 DeRoo ..................... .. 711/135
`
`OTHER PUBLICATIONS
`
`Joseph Heinrich, Origin TM and OnyZ2 TM Theory of
`Operations Manual, Document No. 007—3439—002, Silicon
`Graphics, Inc., 1997.
`White Paper, Sequent’s NUMA—Q SMP Architecture,
`Sequent, 1997.
`White Paper, Eight—way Multiprocessing, Hewlett—Packard,
`Nov. 1997.
`George White & Pete Vogt, Profusion, a Buffered, Cache
`Coherent Crossbar Switch, presented at Hot Interconnects
`Symposium V, Aug. 1997.
`Alan Charlesworth, et al., Gigaplane—XP: Extending the
`Ultra Enterprise Family, presented at Hot Interconnects
`Symposium V, Aug. 1997.
`
`James Loudon & Daniel Lenoski, The SGI Origin: A
`ccNUMA Highly Scalable Server, Silicon Graphics, Inc.,
`presented at the Proc. Of the 24m Int’l Symp. Computer
`Architecture, Jun. 1997.
`Mike Galles, Spider: A High—Speed Network Interconnect,
`IEEE Micro, Jan/Feb. 1997, pp. 34—39.
`T.D. Lovett, R. M. Clapp and R. J. Safranek, Numa—Q: an
`SCI—based Enterprise Server, Sequent, 1996.
`Daniel E. Lenoski & Wolf—Dietrich Weber, Scalable Shared
`Memory Multiprocessing, Morgan Kaufmann Publishers,
`1995, pp. 143—159.
`David B. Gustavson, The Scalable coherent Interface and
`Related Standards Projects, (as reprinted in Advanced Mul
`timicroprocessor Bus Architectures, JanusZ Zalewski, IEEE
`computer Society Press, 1995, pp. 195—207.).
`Kevin Normoyle, et al., UltraSPARC TM Port Architecture,
`Sun Microsystems, Inc., presented at Hot Interconnects III,
`Aug. 1995.
`Kevin Normoyle, et al., UltraSPARC TM Port Architecture,
`Sun Microsystems, Inc., presented at Hot Interconnects III,
`Aug. 1995, UltraSparc Interfaces.
`Kai Hwang, Advanced Computer Architecture: Parallelism,
`Scalability, Programmability, McGraw—Hill, 1993, pp.
`355—357.
`Jim Handy, The Cache Memory Book, Academic Press,
`1993, pp. 161—169.
`Angel L. Decegama, Parallel Processing Architectures and
`VLSI Hardware, vol. 1, Prentice—Hall, 1989, pp. 341—344.
`
`* cited by examiner
`
`NETAPP, INC. EXHIBIT 1001
`Page 2 of 23
`
`
`
`U.S. Patent
`
`0a. 14, 2003
`
`Sheet 1 0f 15
`
`US 6,633,945 B1
`
`g > :
`
`o: = =
`
`M85 95.2 $252 9 25.3%
`
`
`22:
`
`= =
`
`i ‘V
`
`QM 3
`
`
`
`wzoéslo M56510
`
`i . . . a
`
`a a
`
`S; 3‘
`
`=
`
`=
`
`v = V
`
`@
`
`#3. K95 v .QE
`
`55% 2 »\ o?
`
`NETAPP, INC. EXHIBIT 1001
`Page 3 of 23
`
`
`
`U.S. Patent
`
`0a. 14, 2003
`
`Sheet 2 0f 15
`
`US 6,633,945 B1
`
`E0552
`
`com?
`
`:02 I
`
`P P < < <
`
`1 @mm Al 53015826:
`
`:02 A v I
`
`I!
`
`I g 1 <
`
`All‘
`
`
`
`m: E Ola . . . 3
`
`QINIF a a a O I I
`
`
`
`20a :60
`
`4 < <
`
`I > > = \N:
`
`5i
`
`CNN
`
`> =
`
`$4 I. I: N? W oww
`
`
`
` oww M mo< 3mm 3% n6,,‘
`
`> = > 1:
`
`(m;
`
`mm? SF
`
`N 6E
`
`6m 61 \
`
`cow
`
`
`
`510 @310 510 9.50
`
`
`
`
`
`NETAPP, INC. EXHIBIT 1001
`Page 4 of 23
`
`
`
`U.S. Patent
`
`Oct. 14, 2003
`
`Sheet 3 of 15
`
`US 6,633,945 B1
`
`
`
`M..9...
`
`5%22%
`
`
`
`...om.our292n_©<E22
`
`IIIImm;:2
`
`
`
`m2_._o<o2EoSm/.Im2o<o5n_om9Em:2_2mo<22E2_
`
`N:W2mo<”_2E2_...WmoE2E2_29:28229:28
`282:8mSF_.............-Lr...........-L_..........-L
`
`IK/IvE;N9,
`
`'I
`
`I
`.
`__-_---..------I ‘
`LO
`‘O
`
`NETAPP, INC. EXHIBIT 1001
`
`Page 5 of 23
`
`O(
`
`INTERFACE
`CONTROL
`
`INTERFACE
`CONTROL
`
`m2o2H2oo
`
`
`
`m22<2oA8mM>2o2m__2’v-Imn2
`
`I
`I
`
`I
`
`I
`
`I '
`
`IIIII IIII
`
`:
`
`I :
`
`O I
`ml
`:I
`
`.1.mm2>2o2w2_vv85S5.vvMN2E52222H‘ATm2E26528MN2.22052822MWm2mo$2E2_22moE2E2_22239:28222222mosmo2...2mo_>m_o22.222m9EE2_U22:2852+o2_2o<o2Wo2_2o<o2WM22.22-|om:2ma.2m1IIIIIIIIIIIIA1IIIIIIIIIIIIIIJIIIll
`4|I.I||2mmm22225_In.F‘O828225_._EE3v8WR3oE28_“‘I||.lI.E|
`
`
`-IAHJIAJAA282_’72.%2_22.2_22.o.iflllfl,2>mo_2m__2_1%flufl
`m22<2o
`
`NETAPP, INC. EXHIBIT 1001
`Page 5 of 23
`
`
`
`
`
`
`
`
`U.S. Patent
`
`0a. 14, 2003
`
`Sheet 4 0f 15
`
`US 6,633,945 B1
`
`“3 .031
`
`E0532
`
`EOEmE
`
`E0552
`
`E0552
`
`NETAPP, INC. EXHIBIT 1001
`Page 6 of 23
`
`
`
`U.S. Patent
`
`Oct. 14, 2003
`
`Sheet 5 of 15
`
`US 6,633,945 B1
`
`5o_>_.
`
`N32———ma4a-uu
`.._II“-I
`=...,..
`
`I
`
`IIIIIIIIIIIIII
`
`COCO
`("OLD
`<I'
`0")
`COCO
`<‘OC\l
`0')‘-
`931$
`C\lO'2
`31$
`t\lf\
`(NICO
`<\lLD
`t\l<f
`(NCO
`
`c\|(\l
`(\I\—
`CH3
`
`::®:;~;
`
`::_:::::©:.mN_
`
`III
`
`V821113.nnnnunnunnnnnnnn
`
`ma:Iulnnnnnuununnnnr
`.mmmmmmmmmnnnnnnnr--$111
`"IIII'IIIIIiIII
`
`IIIIII
`
`
`
`><>>mm:m_.5>omm:
`
`mEm§_IIIIIIIIIIIIIIIIEIIII
`owmzwzép
`
`EmiwI
`IIIIIIIIIIIIIIII‘IIIIIIIIIIIIII
`IIIIIIIIIIIIII
`
`ma
`
`M.9...
`
`586mun.8mEmz<EmtgwL
`IIIIIIIIIIIIL
`
`
`
`IIIIIIIIIIIIII
`
`III
`
`
`
`<_.<n_w$m8<
`
`
`
` ._sa:2:8m<m:§g
`
`w:Om_z<._.._:_>=wwo._.n5
`
`Eozmz
`
`mzo_5<mz<E
`
`m_z_._$55Eswe
`
`emmEozmzEma
`
`usem:mz<E
`
`><>>mm:mac3u
`
`9<>>-$macEH
`
`av
`
`.w:m-om_m<_._wE<moan
`
`n__>_m>33.89$
`
`m_z_._$55m_._.>m_mm
`
`
`
`E».Eu_mz<EEma
`
`mazasH
`
`NETAPP, INC. EXHIBIT 1001
`
`Page 7 of 23
`
`NETAPP, INC. EXHIBIT 1001
`Page 7 of 23
`
`
`
`
`
`U.S. Patent
`
`Ef06
`
`36:6
`
`1BM...
`
`wE052I%DOE
`
`>mO_>.m=>_1
`
`33ou_W>mO_>_ms_IH,.2UM."EozwzImH
`
`2:0IEO
`
`oooooooooooooooooooooo
`
`wm.9...
`
`W83_on_82%_on_
`
`83_on_
`
`Alv3mm3%
`
`83_on_
`
`NETAPP, INC. EXHIBIT 1001
`
`Page 8 of 23
`
`NETAPP, INC. EXHIBIT 1001
`Page 8 of 23
`
`
`
`U.S. Patent
`
`Oct. 14, 2003
`
`Sheet 7 of 15
`
`US 6,633,945 B1
`
`N <
`
`5 I
`
`EEoo
`E2LIJUJ
`22
`
`II
`
`3o2
`
`3 DD
`0 O0
`2 22
`
`
`
`E,’
`an
`
`D
`an
`(II
`
`3
`
`ED
`
`EEEEOOOO
`2222
`LLIUJLLJLIJ
`2222
`
`3233
`0000
`2225
`
`2
`:>
`D_<—>LJ<-—>
`O
`O
`
`c\_<—><_><—>
`O
`(J
`
`0.4->LJ<—->
`
`! O
`
`3
`3
`D_<—>O<-—>
`
`8
`%
`14-90%’
`O
`0
`:
`
`3
`2
`
`I)
`
`,_
`C
`
`D
`
`..
`C
`
`E3
`
`D
`ca
`ED
`
`2
`3
`
`4:-D
`
`NETAPP, INC. EXHIBIT 1001
`
`Page 9 of 23
`
`3E
`
`D
`
`m
`
`3
`[13
`CD
`
`3
`no
`
`I
`
`NETAPP, INC. EXHIBIT 1001
`Page 9 of 23
`
`
`
`U.S. Patent
`U.S. Patent
`
`0a. 14, 2003
`Oct. 14, 2003
`
`Sheet 8 0f 15
`Sheet 8 of 15
`
`US 6,633,945 B1
`US 6,633,945 B1
`
`m.0?‘ mice3%
`
`m .UE
`
`III
`
`_.|
`
`m_o<n_m_m:.z_mamN._
`
`$60 310
`
`:00
`
`NETAPP, INC. EXHIBIT 1001
`
`Page 10 of 23
`
`NETAPP, INC. EXHIBIT 1001
`Page 10 of 23
`
`
`
`
`U.S. Patent
`
`0
`
`51cl0
`
`U
`
`3m
`
`1B
`
`S>mo_>ms_I
`
`wE052+I.Ivm:02
`
`m§os_ms_
`
`Al|.l.Vm.:02M,Eo_>m_>_IaHH
`
`cooooooooooooooooooocc
`
`ES
`
`5IM,aGE
`
`%83E8%E
`
`NETAPP, INC. EXHIBIT 1001
`
`Page 11 of 23
`
`NETAPP, INC. EXHIBIT 1001
`Page 11 of 23
`
`
`
`u
`
`H
`
`Oct. 14, 2003
`
`1
`
`0.1
`
`4
`
`3
`
`3
`
`PHHHab.93H3H33..H3
`
`ENEEIEEHET.HéHeEtEIHHHHH
`
`H:
`
`HEH
`
`1I%2.GE
`
`wHHHHas%H3H3H3H5
`M3%E02%EuH5H_o
`
`
`
`U ..E.|. ¢E.|v
`
` E E EHe E ..§.|. E5HHHHH
`
`mEEgmWEEEam
`
`:8
`
`E1
`
`NETAPP, INC. EXHIBIT 1001
`
`Page 12 of 23
`
`NETAPP, INC. EXHIBIT 1001
`Page 12 of 23
`
`
`
`
`
`U.S. Patent
`
`0a. 14, 2003
`
`Sheet 11 0f 15
`
`US 6,633,945 B1
`
`_>_<mn_m I
`
`52mg I
`
`gig I
`
`_>_<mom I 30¢
`
`v v 6E
`
`‘V t
`
`1 :
`
`
`
`3E0 ............. .. P :10
`
`I
`
`I O62
`
`I 219E
`
`3mm
`
`NETAPP, INC. EXHIBIT 1001
`Page 13 of 23
`
`
`
`U.S. Patent
`
`Oct. 14, 2003
`
`Sheet 12 of 15
`
`US 6,633,945 B1
`
`5&08%m3n_o3%
`
`8%
`
`S55%8%
`
`
`
`ma3%n__2.l||IVTullmamDuon_s_
`
`
`
`:os_.:on_3.2-20".
`
`
`
`_>_<mom-m8_>_<~_om-m8
`
`_>_§8-%oFE-oEn_c_>_$_8.m8
`
`._m_z2<Io
`
`3.9.35.2.5
`
`
`
`._m_zz<Io._mzz<_._o
`
`E-O._..._.n_
`
`
`
`
`
`2<Ew.moo.m_zz<_._o_>_$am.~_8
`
`
`
`30.2.30".:us_.=uu_
`
`
`
`
`
`.E-OH.Es_<Em-m8MN_>_<~5m-m8bn_-O.E.n_
`
`
`
`._m_zz<_._o._mzz<Io
`
`mamEun=>_ma3%as
`
`23%:2022023%SP695%$53%
`
`
`
`._.n_-O._..._.n_._.n_-o._..._.n_
`
`
`
`._mzz<_._odzz<_._o
`
`O:Q.2o\_
`
`
`
`
`
`
`
`n_=._omoemmn___._omoemmn___._omoemmn___._omeemm
`
`
`
`
`
`
`
`
`
`
`
`NFmom._on_ma.8.mam_on_mam.8mamGmSm_on_mamEmam_on_
`
`NETAPP, INC. EXHIBIT 1001
`
`Page 14 of 23
`
`NETAPP, INC. EXHIBIT 1001
`Page 14 of 23
`
`
`
`
`U.S. Patent
`
`Oct. 14, 2003
`
`Sheet 13 of 15
`
`US 6,633,945 B1
`
`5&08%8%3%
`
`2GE
`
`Tlma3%n=>_Ima3%.3
`
`Soumzzéo
`
`5-9.5
`
`dzz<_._oI:2:m_§m
`
`
`2E8o32.28zéom
`
`=os_.=ou_20.2.:n.3.2.20".E
`
`%
`
`dzz<IoV’|
`
`
`
`25E0n=2SouE-E.E
`
`._m_zz<_._o
`
`.E.oH.E
`
`._mzz<_._o
`
`N300maPaQ2
`
`2.O:O:o\_
`
`
`
`
`
`":5moemmn___._omoemm,___._om8_mm_E10m_8_%
`
`
`
`
`
`ma9ma_on_ma_on_SmGmmam_on_maan.mamEmam_on_
`
`
`
`
`
`
`
`v_z_._._<_m_mmv_z_._._<_$mxz:._<_m_m_mv_z_._._<_mmm
`
`
`
`
`
`
`
`NETAPP, INC. EXHIBIT 1001
`
`Page 15 of 23
`
`NETAPP, INC. EXHIBIT 1001
`Page 15 of 23
`
`
`
`
`
`U.S. Patent
`
`Oct. 14, 2003
`
`Sheet 14 of 15
`
`US 6,633,945 B1
`
`
`
`
`
`mzoamz._<zO0O_._.EO~59.O._.z_.Eos_m_s_zoEE<._
`
`
`
`
`
`3..05
`
`
`
`oN_mIm:__Io;omo«zmnm_N_mv_oO._m
`
`momxo3E.8
`
`xoo._..._~.8
`
`
`
`mooxoo._mn.8
`
`mwo.N«_oo._m
`
`m8o53¢
`mom\M50.5
`moo{Rq53m
`‘momaxuo._m
`_Vacsv N50.5
`
`m59m
`
`mow
`
`moo
`
`NETAPP, INC. EXHIBIT 1001
`
`Page 16 of 23
`
`NETAPP, INC. EXHIBIT 1001
`Page 16 of 23
`
`
`
`
`U.S. Patent
`U.S. Patent
`
`0a. 14, 2003
`Oct. 14, 2003
`
`Sheet 15 0f 15
`Sheet 15 of 15
`
`US 6,633,945 B1
`US 6,633,945 B1
`
`
`
`
`
`.|_HHIIIIIIIIIIIIIIIIIIIIIIII._mz:$55_xmgz_mz:m__._o<om9:mIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIll
`
`::::m:::::::::::m:1
`
`2.03
`
`NETAPP, INC. EXHIBIT 1001
`
`Page 17 of 23
`
`NETAPP, INC. EXHIBIT 1001
`Page 17 of 23
`
`
`
`US 6,633,945 B1
`
`1
`FULLY CONNECTED CACHE COHERENT
`MULTIPROCESSING SYSTEMS
`
`CROSS-REFERENCE TO RELATED APPLICATIONS
`This patent application is a continuation-in-part of the
`following commonly-oWned, US. patent application Ser.
`Nos.: US. application Ser. No. 08/986,430 now US. Pat.
`No. 6,065,077, AN APPARATUS AND METHOD FOR A
`CACHE COHERENT SHARED MEMORY MULTIPRO
`CESSING SYSTEM ?led Dec. 7, 1997; US. application
`Ser. No. 09/163,294 now US. Pat. No. 6,292,705,
`METHOD AND APPARATUS FOR ADDRESS
`TRANSFERS, SYSTEM SERIALIZATION, AND CEN
`TRALIZED CACHE AND TRANSACTION CONTROL,
`IN A SYMMETRIC MULTIPROCESSOR SYSTEM, ?led
`Sep. 29, 1998; and US. application Ser. No. 09/281,749 now
`US. Pat. No 6,516,442, CACHE INTERFACE AND PRO
`TOCOLS FOR CACHE COHERENCY IN A SCALABLE
`SYMMETRIC MULTIPROCESSOR SYSTEM, ?led Mar.
`30, 1999; all of Which are incorporated by reference herein.
`
`10
`
`15
`
`20
`
`BACKGROUND
`
`FIGS. 2—11 shoW point to point cache coherent sWitch
`solution for multiprocessor systems that are the subject of
`copending and coassigned applications.
`Depending on the implementation speci?cs, these designs
`may be problematic in tWo respects:
`1. Tag SRAM siZe is expensive
`2. Latency is greater than desired
`First, SRAM SiZe Issue:
`To support L2 siZe=4 MB, total 64 GB memory and 64
`byte line siZe
`the TAG array entry Will be 4 MB/64 Byte=64K entries
`the TAG siZe Will be 14 bits
`The total TAG array siZe=14 bits *64K=917,504 bit/per
`CPU
`To support 8-Way system, a duplicated TAG array siZe
`Will be 8*14 bits *64K—about 8M bit SRAM.
`8 Mbit SRAM is too large for single silicon integrait even
`With 0.25 micron CMOS process.
`Second, Latency Issue:
`Although the sWitch fabric solutions of FIGS. 2—11 pro
`vide scalability in memory throughput, maximum transac
`tion parallelism, and easy PCB broad routing, the latency for
`memory read transactions is greater than desired.
`Example for Memory Read Transactions:
`CPU read transaction Will ?rst latched by CCU, CCU
`format transaction into channel command, CCU Will send
`the transaction through channel, FCU’s IIF unit Will
`de-serialiZe the channel command or data and perform cache
`coherency operation, then FCU Will send the memory read
`transaction to MCU. MCU Will de-serialiZe the channel
`command, send the read command to DRAM address bus,
`MCU read from DRAM data bus, send the data to FCU via
`channel, FCU Will send data to CCU via channel. Finally the
`data is presented at CPU bus. A transaction for read crosses
`the channel four times. Each crossing introduces additional
`latency. What is needed is an SMP architecture With the
`bene?ts of the present FCU architecture, but With reduced
`Tag SRAM siZe requirements per chip and With reduced
`latencies.
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`SUMMARY
`Fully connected multiple FCU-based architectures reduce
`requirements for Tag SRAM siZe and memory read laten
`
`65
`
`2
`cies. Apreferred embodiment of a symmetric multiprocessor
`system includes a sWitched fabric (sWitch matrix) for data
`transfers that provides multiple concurrent buses that enable
`greatly increased bandWidth betWeen processors and shared
`memory. A high-speed point-to-point Channel couples com
`mand initiators and memory With the sWitch matrix and With
`I/O subsystems.
`
`BRIEF DESCRIPTION OF DRAWINGS
`
`FIG. 1 is a draWing of a prior-art generic symmetric
`shared-memory multiprocessor system using a shared-bus.
`FIG. 2 is a draWing of a preferred embodiment symmetric
`shared-memory multiprocessor system using a sWitched
`fabric data path architecture centered on a FloW-Control
`Unit (FCU).
`FIG. 3 is a draWing of the sWitched fabric data path
`architecture of FIG. 2, further shoWing internal detail of an
`FCU having a Transaction Controller (TC), Transaction Bus
`(TB), and Transaction Status Bus (TSB) according to the
`present invention.
`FIG. 4 is a draWing of a variation the embodiment of FIG.
`2, it Which each CPU has its oWn CCU, and in Which the
`channel interface and control is abstractly represented as
`being composed of a physical (PHY) link layer and a
`transport layer.
`FIG. 5 is a timing diagram comparing the memory trans
`action performance of a system based on a How control unit
`according to the present invention and a prior art shared-bus
`system.
`FIG. 6 is another vieW of the embodiment of FIG. 4.
`FIG. 7 is a draWing of a number of system embodiments
`according to the present invention. FIG. 7a illustrates a
`minimal con?guration, 7b illustrates a 4-Way con?guration,
`7c illustrates an 8-Way high-performance con?guration, and
`7a' illustrates a con?guration for I/O intensive applications.
`FIG. 8 is a draWing of a CPU having an integral CCU.
`FIG. 9 illustrates a variation of the embodiment of FIG. 6
`using the integrated CPU/CCU of FIG. 8.
`FIGS. 10a—d illustrates variations of the embodiments of
`FIG. 7 using the integrated CPU/CCU of FIG. 8.
`FIG. 11 is a draWing of an 4-Way embodiment of the
`present invention that includes coupling to an industry
`standard sWitching fabric for coupling CPU/Memory com
`plexes With I/O devices.
`FIG. 12 is a draWing of an FCU-based architecture
`according to a ?rst embodiment.
`FIG. 13 is a draWing of an FCU-based architecture
`according to a second embodiment.
`FIG. 14 de?nes the cache line characteristics of the
`systems of FIGS. 12 and 13.
`FIG. 15 de?nes the cache line de?nition.
`
`DETAILED DESCRIPTION
`
`System OvervieW
`FIG. 2 is a draWing of a preferred embodiment symmetric
`shared-memory multiprocessor system using a sWitched
`fabric data path architecture centered on a FloW-Control
`Unit (FCU) 220. In the illustrated embodiment, eight pro
`cessors 120 are used and the con?guration is referred herein
`as an “8P” system.
`The ECU (FloW Control Unit) 220 chip is the central core
`of the 8P system. The ECU internally implements a
`sWitched-fabric data path architecture. Point-to-Point (PP)
`
`NETAPP, INC. EXHIBIT 1001
`Page 18 of 23
`
`
`
`US 6,633,945 B1
`
`3
`interconnect 112, 113, and 114 and an associated protocol
`de?ne dedicated communication channels for all FCU U0.
`The terms Channels and PP-Channel are references to the
`FCU’s PP U0. The FCU provides Point-to-Point Channel
`interfaces to up to ten Bus Bridge Units (BBUs) 240 and/or
`CPU Channel Units (CCUs, also knoWn as Chanel Interface
`Units or CIUs) and one to four Memory Control Units
`(MCUs) 230. TWo of the ten Channels are ?xed to connect
`to BBUs. The other eight Channels can connect to either
`BBUs or CCUs. In an illustrative embodiment the number of
`CCUs is eight. In one embodiment the CCUs are packaged
`as a pair referred herein as a Dual CPU Interface Unit
`(DCIU) 210. In the 8P system shoWn, the Dual CPU
`Interface Unit (DCIU) 210 interfaces tWo CPUs With the
`FCU. Throughout this description, a reference to a “CCU”
`is understood to describe the logical operation of each half
`of a DCIU 210 and a references to “CCUs” is understood to
`apply to equally to an implementation that uses either single
`CCUs or DCIUs 210. CCUs act as a protocol converter
`betWeen the CPU bus protocol and the PP-Channel protocol.
`The FCU 210 provides a high-bandWidth and loW-latency
`connection among these components via a Data SWitch, also
`referred herein as a Simultaneous SWitched Matrix (SSM),
`or sWitched fabric data path. In addition to connecting all of
`these components, the FCU provides the cache coherency
`support for the connected BBUs and CCUs via a Transaction
`Controller and a set of cache-tags duplicating those of the
`attached CPUs’ L2 caches. FIG. 5 is a timing diagram
`comparing the memory transaction performance of a system
`based on a How control unit according to the present
`invention and a prior art shared-bus system.
`In a preferred embodiment, the FCU provides support tWo
`dedicated BBU channels, four dedicated MCU channels, up
`to eight additional CCU or BBU channels, and PCI peer
`to-peer bridging. The FCU contains a Transaction Controller
`(TC) With re?ected L2, states. The TC supports up to 200M
`cache-coherent transactions/second, MOSEI and MESI
`protocols, and up to 39-bit addressing. The FCU contains the
`Simultaneous SWitch Matrix (SSM) Data?oW SWitch, Which
`supports non-blocking data transfers.
`In a preferred embodiment, the MCU supports ?exible
`memory con?gurations, including one or tWo channels per
`MCU, up to 4 Gbytes per MCU (maximum of 16 Gbytes per
`system), With one or tWo memory banks per MCU, With one
`to four DIMMS per bank, of SDRAM, DDR-SDRAM, or
`RDRAM, and With non-interleaved or interleaved operation.
`In a preferred embodiment, the BBU supports both 32 and
`64 bit PCd bus con?gurations, including 32 bit/33 MHZ, 32
`bit/66 MHZ, and 64 bit/66 MHZ. The BBU is also 5V
`tolerant and supports AGP.
`All connections betWeen components occur as a series of
`“transactions.” A transaction is a Channel Protocol request
`command and a corresponding Channel Protocol reply. For
`example, a processor, via a CCU, can perform a Read
`request that Will be forWarded, via the FCU, to the MCU; the
`MCU Will return a Read reply, via the FCU, back to the same
`processor. A Transaction Protocol Table (TPT) de?nes the
`system-Wide behavior of every type of transaction and a
`Point-to-Point Channel Protocol de?nes the command for
`mat for transactions.
`The FCU assumes that initiators have converted addresses
`from other formats to conform With the PP-Channel de?ni
`tions. The FCU does do target detection. Speci?cally, the
`FCU determines the correspondence betWeen addresses and
`speci?c targets via address mapping tables. Note that this
`mapping hardWare (contained in the CFGIF and the TC)
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`4
`maps from Channel Protocol addresses to targets. The
`mapping generally does not change or permute addresses.
`
`Summary of Key Components
`Transaction Controller (TC) 400. The most critical coher
`ency principle obeyed by the FCU is the concept of a single,
`system-serialiZation point. The system-serialiZation point is
`the “funnel” through Which all transactions must pass. By
`guaranteeing that all transactions pass through the system
`serialiZation point, a precise order of transactions can be
`de?ned. (And this in turn implies a precise order of tag state
`changes.) In the FCU, the system-serialiZation point is the
`Transaction Controller (TC). Coherency state is maintained
`by the duplicate set of processor L2 cache-tags stored in the
`TC.
`The Transaction Controller (TC) acts as central system
`serialiZation and cache coherence point, ensuring that all
`transactions in the system happen in a de?ned order, obeying
`de?ned rules. All requests, cacheable or not, pass through
`the Transaction Controller. The TC handles the cache coher
`ency protocol using a duplicate set of L2 cache-tags for each
`CPU. It also controls address mapping inside the FCU,
`dispatching each transaction request to the appropriate target
`interface.
`Transaction Bus (TB) 3104 and Transaction Status Bus
`(TSB) 3106. All request commands ?oW through the Trans
`action Bus. The Transaction Bus is designed to provide fair
`arbitration betWeen all transaction sources (initiators) and
`the TC; it provides an inbound path to the TC, and distrib
`utes outbound status from the TC (via a Transaction Status
`Bus). The Transaction Bus (TB) is the address/control
`“highWay” in the FCU. It includes an arbiter and the
`Transaction Bus itself. The TB pipelines the address over
`tWo cycles. The extent of pipelining is intended to support
`operation of the FCU at 200 MHZ using contemporary
`fabrication technology at the time of ?ling of this disclosure.
`Whereas the TB provides inputs to the Transaction
`Controller, the Transaction Status Bus delivers outputs from
`the Transaction Controller to each interface and/or target.
`The TSB outputs provide transaction con?rmation, coher
`ency state update information, etc. Note that While many
`signals on the TSB are common, the TC does drive unique
`status information (such as cache-state) to each interface.
`The Transaction Bus and Transaction Status Bus are dis
`cussed in detail later in this application.
`SWitched Fabric Data Path (Data SWitch). The Data
`SWitch is an implementation of a Simultaneous SWitched
`Matrix (SSM) or sWitched fabric data path architecture. It
`provides for parallel routing of transaction data betWeen
`multiple initiators and multiple targets. The Data SWitch is
`designed to let multiple, simultaneous data transfers take
`place to/from initiators and from/to targets (destinations of
`transactions). Note that the Data SWitch is packet based.
`Every transfer over the Data SWitch starts With a Channel
`Protocol command (playing the role of a packet header) and
`is folloWed by Zero or more data cycles (the packet payload).
`All reply commands (some With data) NoW through the Data
`SWitch. Both Write requests and read replies Will have data
`cycles. Other replies also use the Data SWitch and Will only
`send a command header (no payload).
`IIF (Initiator InterFace) 3102. The IIF is the interface
`betWeen the FCU and an initiator (a BBU or a CCU). The IIF
`transfers Channel Protocol commands to and from the
`initiator. The IIF must understand the cache coherency
`protocol and must be able to track all outstanding transac
`tions. Note that the BBU/CCU can be both an initiator of
`
`NETAPP, INC. EXHIBIT 1001
`Page 19 of 23
`
`
`
`US 6,633,945 B1
`
`15
`
`25
`
`35
`
`5
`commands and a target of commands (for CSR read/Write if
`nothing else). Address and control buffering happen in the
`IIF; bulk data buffering is preferably done in the BBU/CCU
`(in order to save space in the FCU, Which has ten copies of
`the IIF). The IIF needs con?guration for CPU and I/O
`modes, and to handle differences betWeen multiple types of
`processors that may be used in different system con?gura
`tions.
`Memory Interface (MIF) 3108. The Memory Interface
`(MIF) is the portal to the memory system, acting as the
`interface betWeen the rest of the chipset and the MCU(s).
`The MIF is the interpreter/?lter/parser that receives trans
`action status from the TB and TC, issues requests to the
`MCU, receives replies from the MCU, and forWards the
`replies to the initiator of the transaction via the Data SWitch.
`It is a “slave” device in that it can never be an initiator on
`the TB. (The MIF is an initiator in another sense, in that it
`sources data to the Data SWitch.) For higher performance,
`the MIF supports speculative reads. Speculative reads start
`the read process early using the data from the TB rather than
`Waiting for the data on the TSB. There is one MIF
`(regardless of hoW many memory interfaces there are). The
`MIF contains the memory mapping logic that determines the
`relationship betWeen addresses and MCUs (and memory
`ports). The memory mapping logic includes means to con
`?gure the MIF for various memory banking/interleaving
`schemes. The MIF also contains the GART (Graphics
`Address Remap Table). Addresses that hit in the GART
`region of memory Will be mapped by the GART to the
`proper physical address.
`Con?guration Register Interface (CFGIF) 410. This is
`Where all the FCU’s Control and Status Registers (CSRs)
`logically reside. CFGIF is responsible for the reading and
`Writing of all the CSRs in the FCU, as Well as all of the
`diagnostic reads/Writes (e.g., diagnostic accesses to the
`duplicate tag RAM).
`Channel Interface Block (CIB). The CIBs are the transmit
`and receive interface for the Channel connections to and
`from the FCU. The FCU has 14 copies of the CIB, 10 for
`BBUs/CCUs, and 4 for MCUs. (The CIB is generic, but the
`logic on the core-side of the Channel is an IIF or the MIF.)
`Embodiments overvieW. FIG. 3 is a draWing shoWing
`internal detail of the sWitched fabric data path architecture
`Within the FCU of FIG. 2. A ?rst key component of the FCU
`is the Transaction Controller (TC) 400. A second key com
`ponent of the FCU is an address and control bus 3100, that
`is actually an abstraction representing a Transaction Bus
`(TB) 3104 and Transaction Status Bus (TSB) 3106. A third
`key component of the FCU is the Data Path SWitch (also
`referred herein as the Data SWitch, or the sWitched fabric
`data path). The Data SWitch is composed of vertical buses
`320, horiZontal buses 340, node sWitches 380. The node
`sWitches selectively couple the vertical and horiZontal buses
`under control of the Data Path SWitch Controller 360 and
`control signals 370. Additional key components of the FCU
`include one or more Initiator Interfaces (IIFs) 3102; a
`Memory Interface (MIF) 3108; and Channel Interface
`Blocks (CIBs) 305 at the periphery of the various interfaces.
`A number of alternate embodiments eXist. FIG. 4 is a
`draWing of a variation on the embodiment of FIG. 2, in
`Which each CPU has its oWn CCU. In this vieW the channel
`interface and control that make up the IFs and CCUs are
`abstractly represented as being composed of a physical
`(PHY) link layer and a transport layer. FIG. 6 is another
`vieW of the embodiment of FIG. 4. FIG. 7 is a draWing of
`a number of application speci?c variations on the embodi
`
`6
`ment of FIG. 4. FIG. 7a illustrates a minimal con?guration,
`7b illustrates a 4-Way con?guration, 7c illustrates an 8-Way
`high-performance con?guration, and 7a' illustrates a con
`?guration for I/O intensive applications.
`FIG. 8 is a draWing of a CPU having an integral CCU.
`FIG. 8 makes eXplicit a “backside” bus interface to an
`external cache (an L2 cache in the case illustrated). An IIF
`replaces the conventional CPU interface, such that the
`Channel is the frontside bus of the CPU of FIG. 8.
`The embodiments of FIGS. 9 and 10, are respective
`variations of the embodiments of FIGS. 6 and 7, With
`adaptation for the use of the integrated CPU/CCU of FIG. 8.
`The embodiments of FIGS. 9 and 10 offer system solutions
`With loWer CPU pin counts, higher throughput, loWer
`latency, hot plugable CPUs (if an OS supports it), and
`reduced PCB board layout compleXity compared With non
`integrated solutions.
`FIG. 11 is a draWing of an 4-Way embodiment of the
`present invention that includes coupling to an industry
`standard sWitching fabric for coupling CPU/Memory com
`pleXes with I/O devices.
`FIG. 12 is a draWing of an FCU-based architecture
`according to a ?rst embodiment.
`FIG. 13 is a draWing of an FCU-based architecture
`according to a second embodiment.
`FIG. 14 de?nes the cache line characteristics of the
`systems of FIGS. 12 and 13.
`
`Additional Descriptions
`US. application Ser. No. 08/986,430, AN APPARATUS
`AND METHOD FOR A CACHE COHERENT SHARED
`MEMORY MULTIPROCESSING SYSTEM, ?led Dec. 7,
`1997, incorporated by reference above, provides additional
`detail of the overall operation of the systems of FIGS. 2 and
`3. US. application Ser. No. 09/163,294, METHOD AND
`APPARATUS FOR ADDRESS TRANSFERS, SYSTEM
`SERLIZATION, AND CENTRALIZED CACHE AND
`TRANSACTION CONTROL, IN, A SYMMETRIC MUL
`TIPROCESSOR SYSTEM, ?led Sep. 29, 1998, provides
`additional detail of particular transaction address bus
`embodiments, and Was incorporated by reference previously
`herein. US. application Ser. No. 09/168,311, METHOD AN
`APPARATUS FOR EXTRACTING RECEIVED DIGITAL
`DATA FROM A FULLDUPLEX POINT-TO-POINT SIG
`NALING CHANNEL USING SAMPLED DATA
`TECHNIQUES, ?led Oct. 7, 1998, provides additional
`detail of particular transceiver embodiments, and Was incor
`porated by reference previously herein. US. application Ser.
`No. 09/281,749, CHANNEL INTERFACE AND PROTO
`COLS FOR CACHE COHERENCY IN A SCALABLE
`SYMMETRIC MULTIPROCESSOR SYSTEM, ?led Mar.
`30, 1999, provides additional detail of the channel interface
`blocks and the transport protocol, and Was incorporated by
`reference previously herein. To the eXtent to Which any
`discrepancies eXist betWeen the description in the above
`referenced applications and the instant application, the
`instant appli