`Wang et al.
`
`111111
`
`1111111111111111111111111111111111111111111111111111111111111
`US006516442Bl
`US 6,516,442 Bl
`Feb.4,2003
`
`(10) Patent No.:
`(45) Date of Patent:
`
`(54) CHANNEL INTERFACE AND PROTOCOLS
`FOR CACHE COHERENCY IN A SCALABLE
`SYMMETRIC MULTIPROCESSOR SYSTEM
`
`(75)
`
`Inventors: Yuanlong Wang, San Jose, CA (US);
`Brian R. Biard, Alameda County, CA
`(US); Daniel Fu, Sunnyvale, CA (US);
`Earl T. Cohen, Fremont, CA (US);
`Carl G. Amdahl, Alameda County, CA
`(US)
`
`(73) Assignee: Conexant Systems, Inc., Newport
`Beach, CA (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/281,749
`
`(22) Filed:
`
`Mar. 30, 1999
`
`(63)
`
`(51)
`(52)
`(58)
`
`(56)
`
`Related U.S. Application Data
`
`Continuation-in-part of application No. 09/163,294, filed on
`Sep. 29, 1998, now Pat. No. 6,292,705, and a continuation(cid:173)
`in-part of application No. 08/986,430, filed on Dec. 7, 1997,
`now Pat. No. 6,065,077.
`Int. Cl? . ... ... .. ... ... ... ... .. . H03M 13/00; G06C 13/00
`U.S. Cl. ........................................ 714/776; 711!146
`Field of Search .......................... 714/748, 18, 751,
`714/20, 4, 776; 370/244, 235; 711/141,
`146
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,315,308 A
`4,438,494 A
`4,480,307 A
`
`2/1982 Jackson
`3/1984 Budde et a!.
`10/1984 Budde et a!.
`
`5,161,156 A
`5,271,000 A
`5,313,609 A
`5,335,335 A
`5,440,698 A
`5,505,686 A
`5,511,226 A
`5,513,335 A
`5,524,234 A
`
`* 11/1992 Baum eta!. ................... 714/4
`* 12/1993 Engbersen et a!.
`......... 370/244
`5/1994 Baylor et a!.
`8/1994 Jackson et a!.
`8/1995 Sindhu et a!.
`4/1996 Willis et a!.
`4/1996 Zilka
`4/1996 McClure
`6/1996 Martinez, Jr. et a!.
`
`(List continued on next page.)
`
`OTHER PUBLICATIONS
`
`Technical White Paper, Sun TM Enterprise TM 10000
`Server, Sun Microsystems, Sep. 1998.
`
`(List continued on next page.)
`
`Primary Examiner-Albert Decady
`Assistant Examiner-Cynthia Harris
`(74) Attorney, Agent, or Firm-Keith Kind; Kelly H. Hale
`
`(57)
`
`ABSTRACT
`
`A preferred embodiment of a symmetric multiprocessor
`system includes a switched fabric (switch matrix) for data
`transfers that provides multiple concurrent buses that enable
`greatly increased bandwidth between processors and shared
`memory. A high-speed point-to-point Channel couples com(cid:173)
`mand initiators and memory with the switch matrix and with
`1!0 subsystems. Each end of a channel is connected to a
`Channel Interface Block (CIB). The CIB presents a logical
`interface to the Channel, providing a communication path to
`and from a CIB in another I C. CIB logic presents a similar
`interface between the CIB and the core-logic and between
`the CIB and the Channel transceivers. A channel transport
`protocol is is implemented in the CIB to reliably transfer
`data from one chip to another in the face of errors and
`limited buffering.
`
`44 Claims, 12 Drawing Sheets
`
`AGP
`
`305
`
`f200
`
`SDRAM
`1300
`
`SDRAM
`1301
`
`0 z;;;;: ~I SDRAM
`~rri~ :
`.
`1302
`@;6~ : I SDRAM
`·~-<~. 1303
`
`- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ' 0 000 co co co 0 0 0 0 co•
`
`, __ ----------- _:-._230
`
`NETAPP, INC. EXHIBIT 1017
`Page 1 of 35
`
`
`
`US 6,516,442 Bl
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`6/1996 Izzard
`5,526,380 A
`7/1996 Prince
`5,535,363 A
`7/1996 Masubuchi
`5,537,569 A
`7/1996 Foley
`5,537,575 A
`9/1996 Taylor et a!.
`5,553,310 A
`10/1996 Jackson
`5,561,779 A
`5,568,620 A
`10/1996 Sarangdhar et a!.
`11/1996 Marisetty
`5,574,868 A
`11/1996 Brewer et a!.
`5,577,204 A
`12/1996 Nishtala et a!.
`5,581,729 A
`12/1996 Borrill
`5,588,131 A
`1!1997 Smith et a!.
`5,594,886 A
`5,602,814 A * 2/1997 Jaquette eta!. .......... 369/47.53
`5,606,686 A
`2/1997 Tarui et a!.
`5,634,043 A
`5/1997 Self et a!.
`5,634,068 A
`5/1997 Nishtala et a!.
`5,644,754 A
`7/1997 Weber
`5,655,100 A
`8/1997 Ebrahim et a!.
`5,657,472 A
`8/1997 Van Loo et a!.
`5,682,516 A
`10/1997 Sarangdhar et a!.
`5,684,977 A
`11/1997 Van Loo et a!.
`5,696,910 A
`12/1997 Pawlowski
`5,796,605 A
`8/1998 Hagersten
`5,829,034 A
`10/1998 Hagersten eta!.
`5,895,495 A
`4/1999 Arimilli et a!.
`5,897,656 A
`4/1999 Vogt et a!.
`5,940,856 A
`8/1999 Arimilli et a!.
`5,946,709 A
`8/1999 Arimilli et a!.
`5,978,411 A
`11/1999 Kitade eta!.
`6,044,122 A
`3/2000 Ellersick et a!.
`6,065,077 A
`5!2000 Fu
`6,125,429 A
`9/2000 Goodwin et a!.
`6,145,007 A * 11/2000 Dokic eta!. ................ 709/230
`6,279,084 B1 * 8/2001 VanDoren eta!. .......... 711!141
`6,289,420 B1
`9/2001 Cypher
`6,292,705 B1
`9/2001 Wang et a!.
`
`01HER PUBLICATIONS
`
`Alan Charlesworth, Starfire: Extending the SMP Envelope,
`IEEE Micro, Jan./Feb. 1998, pp. 39-49.
`Joseph Heinrich, Origin TM and Onyz2 TM Theory of
`Operations Manual, Document No. 007-3439-002, Silicon
`Graphics, Inc., 1997.
`
`White Paper, Sequent's NUMA-Q SMP Architecture,
`Sequent, 1997.
`White Paper, Eight-way Multiprocessing, Hewlett-Packard,
`Nov. 1997.
`George White & Pete Vogt, Profusion, a Buffered, Cache(cid:173)
`Coherent Crossbar Switch, presented at Hot Interconnects
`Symposium V, Aug. 1997.
`Alan Charlesworth, et al., Gigaplane-XB: Extending the
`Ultra Enterprise Family, presented at Hot Interconnects
`Symposium V, Aug. 1997.
`James Loudon & Daniel Lenoski, The SGI Origin: A
`ccNUMA Highly Scalable Server, Silcon Graphics, Inc.,
`presented at the Proc. Of the 24'h Int'l Symp. Computer
`Architecture, Jun. 1997.
`Mike Galles, Spider: A High-Speed Network Interconnect,
`IEEE Micro, Jan./Feb. 1997, pp. 34-39.
`T.D. Lovett, R. M. Clapp and R. J. Safranek, NUMA-Q: an
`SCI-based Enterprise Server, Sequent, 1996.
`Daniel E. Lenoski & Wolf-Dietrich Weber, Scalable
`Shared-Memory Multiprocessing, Morgan Kaufmann Pub(cid:173)
`lishers, 1995, pp. 143-159.
`David B. Gustavson, The Scalable coherent Interface and
`Related Standards Projects, (as reprinted in Advanced Mul(cid:173)
`timicroprocessor Bus Architectures, Janusz Zalewski, IEEE
`computer Society Press, 1995, pp. 195-207.).
`Kevin Normoyle, et al., UltraSPARC TM Port Architecture,
`Sun Microsystems, Inc., presented in Hot Interconnects III,
`Aug. 1995.
`Kevin Normoyle, et al., UltraSPARC TM Port Architecture,
`Sun Microsystems, Inc., presented in Hot Interconnects III,
`Aug. 1995, UltraSparc Interfaces.
`Kai Hwang, Advanced Computer Architecture: Parallelism
`Scalability, Programmability, McGraw-Hill, 1993, pp.
`355-357.
`Jim Handy, The Cache Memory Book, Academic Press,
`1993, pp. 161-169.
`Angel L. Decegama, Parallel Processing Architectures and
`VLSI Hardware, vol. 1, Prentice-Hall, 1989, pp. 341-344.
`
`* cited by examiner
`
`NETAPP, INC. EXHIBIT 1017
`Page 2 of 35
`
`
`
`U.S. Patent
`
`Feb.4,2003
`
`Sheet 1 of 12
`
`US 6,516,442 Bl
`
`0
`
`0
`
`~
`
`~
`
`L_
`
`w
`:c
`(_)
`<(z
`Q'*'l=
`=> a..
`
`(_)
`
`~I
`
`•
`•
`•
`
`w
`oCJ
`1-0
`en a::
`:::> c::o
`D?cn
`:::> =>
`a..c::o
`0 ' Q
`
`~I
`
`v
`
`w
`:c
`(_)
`<5~
`
`--=> a..
`
`(_)
`
`~I
`
`~ zo
`
`<(~
`~w
`::2:
`
`~I
`
`0
`0
`
`~
`
`NETAPP, INC. EXHIBIT 1017
`Page 3 of 35
`
`
`
`1--"
`~
`N
`~
`~
`0'1
`1--"
`'&.
`0'1
`rJ'l
`
`e
`
`'"""' N
`0 ......,
`N
`~ .....
`'JJ. =(cid:173)~
`
`8
`N c
`~,J;;..
`?'
`~
`"'!"j
`
`~ = ......
`~ ......
`~
`•
`\Jl
`d •
`
`I
`
`1303
`MEMORY
`
`1302
`MEMORY
`1lQ1
`: M:Y
`
`::~ :
`
`-
`
`:
`
`) I II I MEMORY
`
`115
`
`114
`
`t
`L__gg_J
`ICMI
`
`t
`DCIU 1
`t
`L__gg_J
`• • .ICPU6l
`
`FIG. 2
`
`PCI r:
`240 ~54
`
`230
`
`MCU
`
`AGP
`
`---....
`
`BBU
`
`l
`
`I :
`t
`
`210
`
`-
`
`c51
`
`PCI
`
`200_/
`
`.... , ~ . I
`
`1:2J
`
`BBU
`
`AGP
`
`220
`
`FLOW CONTROL UNIT
`
`(FCU)
`
`11~'-~ l
`
`I
`
`• • •
`
`112 J
`
`t t
`111_;--t t
`WWPU1
`
`120
`
`-
`
`210
`DCIU
`
`NETAPP, INC. EXHIBIT 1017
`Page 4 of 35
`
`
`
`I-"
`~
`N
`~
`~
`0'1
`I-"
`'&.
`0'1
`\Jl
`
`e
`
`'"""' N
`0 ......,
`~
`
`~ .....
`'JJ. =(cid:173)~
`
`~
`
`N c c
`
`~,J;;..
`
`0'
`~
`"'!"j
`
`~ = ......
`~ ......
`~
`•
`\Jl
`d •
`
`I
`CHANNEL g~ 65 8~~-:--~
`I
`MEMORY
`
`::::0 )!;! ;;o
`~ ~ ~ :
`:
`
`I
`
`I
`
`I
`
`: :
`
`-
`
`("')
`
`: :-
`
`I
`
`I SDRAM
`I SDRAM
`1301
`SDRAM
`
`. --
`
`1303
`
`230
`
`________ :\._
`
`I
`I
`
`OJ
`
`: :
`
`I
`I
`
`I
`I
`
`65
`
`L3
`CHANNEL
`
`: CONTRO
`
`MEMORY o~ Q Phl'<..:--
`
`! .~~
`
`1
`
`I
`
`...............
`
`I
`
`1300
`
`--···
`
`I
`
`:
`
`: :_
`, ,
`
`ro 8 "'; iii ---:---1
`: (
`
`:
`
`:
`
`-
`
`("')
`
`:tf___ -
`
`:
`
`1
`
`CONTROL 1
`CHANNEL
`MEMORY
`
`CONTROL 0
`CHANNEL
`
`hAIF/GART
`
`:
`
`230
`
`31 08 305 : 114
`
`:
`IIF :
`~3102
`
`I
`
`f200
`
`FIG. 3
`
`~
`220
`
`------'
`
`---------------
`
`-: 305
`
`305
`
`-:----. 21 0
`
`INTERFACE
`CPU/CACHE
`
`DUAL
`
`120
`
`CPU7
`
`120
`
`CPU6
`
`------------_~ _ .. _ -~--~ ~ -----------------------------------------------~:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.I -------
`
`'
`
`I
`
`l
`
`,
`
`I
`
`<
`
`380
`
`' £ s I~ I : ~ CONTROL 2
`
`I
`
`I I
`
`(
`
`I
`
`! I
`
`• .
`
`!
`
`~ ·
`I I
`
`'
`
`.. I
`
`~ ~ ~ .._:__1
`"'; ~ i§
`: CONTR~L-~--_-__ , c __________ ;) _3~~ ___ 1_ 115
`
`: INTERFACE/
`
`CONTROLN-1 :
`:
`INTERFACE/
`: • • • :
`
`'
`·
`IIF :
`: IIF
`:
`:
`I 113 I
`
`I
`
`I
`
`~u~~~+~~~~~L,:~rjt-::r-t::::1~ MEMORY
`
`'
`CONTROLO
`: IIF INTERFACE
`:
`DEVICE
`~--
`CACHING
`-------f---------~~-~-~---_-_-_-~----~~~-)--~~~~-----~~------~ ---r:~----~~B
`
`:: CONTROLO
`: : INTERFACE
`: :
`
`1/0
`CIB
`
`I
`I
`
`I
`I
`
`I
`
`:
`: 21 O
`t~rlTT Til
`
`:
`:
`:
`
`• • •
`
`DUAL ~
`
`INTERFACE
`CPU/CACHE
`
`~--------------:
`
`111
`
`,-----.--:
`
`• • •
`
`120
`
`CPU1
`
`153 ............. ---r-.......
`
`120
`
`PCI
`
`AGP
`
`CPUO
`
`CIB
`
`CONTROL
`INTERFACE
`
`BUS
`
`240
`
`CIB
`
`CONTROL
`INTERFACE
`
`BUS
`
`, ---:&.-----~-----r ----
`
`152~ r 154
`
`tr151
`PCI
`
`t
`AGP
`
`---=-=-=-•-------
`
`: 1 240
`
`:
`I
`
`0 0
`
`0
`
`DEVICE
`CACHING
`c
`----~-.
`
`I
`I
`
`FCU
`
`::::0 :z:
`In 5
`~ ~
`8 ~
`
`"'; z
`
`110
`CIB
`
`I
`
`305
`
`NETAPP, INC. EXHIBIT 1017
`Page 5 of 35
`
`
`
`1--"
`~
`N
`~
`~
`0'1
`1--"
`'&.
`0'1
`rJ'l
`
`e
`
`'"""' N
`0 ......,
`,J;;..
`~ .....
`'JJ. =-~
`
`~
`
`N c c
`
`~,J;;..
`?'
`~
`"'!"j
`
`~ = ......
`~ ......
`~
`•
`\Jl
`d •
`
`I
`I
`I
`I
`
`ill· • [_ MEMORY
`-I m ill
`I MEMORY
`;s: ill·
`[ MEMORY
`•
`;[IJ • I MEMORY
`
`m
`("')
`j;!
`
`m s:::
`
`-:::0
`-z
`
`CONTROL
`t t + +
`
`1/0
`
`TRANSPORT
`
`PHY LINK
`
`t
`
`PHY LINK
`
`CONTROL
`t t + +
`
`1/0
`
`TRANSPORT
`
`PHY LINK
`
`PHY LINK
`
`t
`
`CONTROL
`t t + +
`
`1/0
`
`TRANSPORT
`
`PHY LINK
`
`PHY LINK
`
`t
`
`TRANSPORT
`• + t 1
`
`t t + !
`
`TRANSPORT
`
`PHY LINK
`
`f
`
`PHY LINK
`
`CACHE COHERENT FLOW CONTROL
`
`TRANSPORT
`+ + t t
`NON BLOCKING DATA SWITCH
`
`TRANSPORT
`+ + t t
`
`t t + +
`TRANSPORT
`
`t t + +
`TRANSPORT
`
`PHY LINK
`
`t
`
`PHY LINK
`
`PHY LINK
`
`t
`
`PHY LINK
`
`TRANSPORT
`
`TRANSPORT
`
`TRANSPORT
`
`FIG.4
`
`CORE
`CPU
`
`CORE
`CPU
`
`CORE
`CPU
`
`NETAPP, INC. EXHIBIT 1017
`Page 6 of 35
`
`
`
`1-"
`~
`N
`~
`~
`0'1
`1-"
`11.
`0'1
`rJ'l
`
`e
`
`'"""' N
`0 ......,
`Ul
`~ .....
`'JJ. =-~
`
`8
`N c
`~,J;;..
`~
`"'!"j
`
`~ = ......
`~ ......
`~
`•
`\Jl
`d •
`
`FIG. 5
`
`BUS
`SHARED
`
`MCU4
`
`MCU3
`
`I
`
`I
`
`•
`
`MCU2
`
`I
`
`MCU1
`
`8 BYTES TRANSFERED PER CYCLE
`
`-
`
`--
`
`--
`I -
`I
`
`PER CYCLE PER WAY
`16 BYTES TRANSFERED
`
`•
`
`• •
`
`• • • •
`
`• • • •
`
`• • .-.-
`
`• • • •
`
`• • • •
`
`• • • •
`
`• • • •
`
`.-. -. --.
`
`0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
`1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3
`
`DATA
`
`I
`
`I
`
`-
`
`•
`
`•
`
`•
`
`•
`
`-
`
`•
`
`•
`
`•
`
`ADDRESS
`
`•
`
`•
`
`-
`
`PEAK TRANSFER RATE
`
`= 800 MB/s
`
`32 BYTE CACHE LINE
`
`PRIOR ART SHARED-BUS(cid:173)
`
`BASED 4-WAY SMP
`
`(B)
`
`PEAK MEMORY READ
`
`= 6.4 GB/s (4-WAY)
`= 1.6 GB/s PER WAY
`TRANSFER RATE
`
`64 BYTE CACHE LINE
`
`UP TO 4 SIMULTANEOUS
`
`TRANSACTIONS
`
`MEMORY
`
`FCU-BASED 4-WAY SMP
`
`(A)
`
`NETAPP, INC. EXHIBIT 1017
`Page 7 of 35
`
`
`
`U.S. Patent
`
`Feb.4,2003
`
`Sheet 6 of 12
`
`US 6,516,442 Bl
`
`1 1 1 1
`::::> 0
`
`::a:
`
`(.)
`
`<0
`~
`'<T
`<0
`u
`0...
`
`.,..__.
`
`00
`::::>
`(.)
`(.)
`
`~
`
`f+---+
`
`:::>
`co
`co
`
`• . • • • • . . •
`. • . • • • • • . •
`
`• •
`•
`
`:::>
`(.)
`u.
`
`...--
`::::> I~
`(.)
`(.)
`
`I~
`
`:::>
`co
`co
`
`~
`
`c.o
`~ c.o
`u
`0...
`
`NETAPP, INC. EXHIBIT 1017
`Page 8 of 35
`
`
`
`U.S. Patent
`
`Feb. 4, 2003
`
`Sheet 7 of 12
`
`US 6,516,442 Bl
`
`~~ m
`
`! !
`
`NETAPP, INC. EXHIBIT 1017
`Page 9 of 35
`
`
`
`U.S. Patent
`
`Feb.4,2003
`
`Sheet 8 of 12
`
`US 6,516,442 Bl
`
`_________________________________ ,
`
`..
`
`~
`
`w
`u
`L£
`0:::
`w
`I-
`:z
`(f)
`::::>
`aJ
`N __.
`
`w
`0:::
`0
`u
`::::>
`a_
`0
`
`I
`
`..
`
`~
`
`::::>
`u
`u
`
`co
`•
`(!)
`Li:
`
`NETAPP, INC. EXHIBIT 1017
`Page 10 of 35
`
`
`
`U.S. Patent
`
`Feb.4,2003
`
`Sheet 9 of 12
`
`US 6,516,442 Bl
`
`1 1 1 1
`00
`
`co
`
`~
`0 a...
`
`= Efl- I~
`
`N
`_J
`
`00
`
`:=1 a...
`0
`
`~
`
`I~
`
`:::) co
`co
`
`• • • • • • • • • • . • • • • • • • • • •
`
`:::)
`(.)
`LL
`
`...--
`Efl- I~
`N
`_J
`
`0
`
`:=1 a...
`0
`
`1-+--+
`
`:::) co
`co
`
`~
`
`co
`~ co
`0 a...
`
`NETAPP, INC. EXHIBIT 1017
`Page 11 of 35
`
`
`
`U.S. Patent
`
`Feb. 4, 2003
`
`Sheet 10 of 12
`
`US 6,516,442 Bl
`
`NETAPP, INC. EXHIBIT 1017
`Page 12 of 35
`
`
`
`U.S. Patent
`
`Feb.4,2003
`
`Sheet 11 of 12
`
`US 6,516,442 Bl
`
`,....
`,....
`
`r - - - r - - -
`
`-
`
`-
`
`~ ~
`~ ~
`0
`0
`0:::
`0:::
`
`~ ~
`~ ~
`0
`0
`0:::
`0:::
`
`'---,.--
`
`'---,.--
`
`-,.-- --
`
`I I I I
`
`::>
`(.)
`LL.
`
`4
`
`~
`
`0(.)
`-0:::
`<!Jcc Zlf
`
`4
`
`~
`
`::>
`cc
`cc
`
`-
`
`-
`
`<D
`
`<D -"<1"
`
`<D
`
`(.) a...
`
`<D
`
`<D -
`
`"<1"
`<D
`
`(.) a...
`
`NETAPP, INC. EXHIBIT 1017
`Page 13 of 35
`
`
`
`U.S. Patent
`
`Feb. 4, 2003
`
`Sheet 12 of 12
`
`US 6,516,442 Bl
`
`llllllll
`0000
`o~\1
`::> D ~~ ..
`
`I-
`0 - I--
`__.-
`
`(/)
`
`f-
`
`I-
`__.-
`0 - f-
`
`Cf)
`
`D
`0 D I-
`
`::J u u..
`
`. .
`. .
`.
`
`.---
`
`::> u
`u
`
`,______
`
`~
`
`~
`
`ou
`-0:::
`(.')aJ
`:Z:lf
`
`ou
`-0:::
`(.')aJ
`:Z:lf
`
`I I I I
`
`NETAPP, INC. EXHIBIT 1017
`Page 14 of 35
`
`
`
`2
`FIG. 2 is a drawing of a preferred embodiment symmetric
`shared-memory multiprocessor system using a switched
`fabric data path architecture centered on a Flow-Control
`Unit (FCU).
`FIG. 3 is a drawing of the switched fabric data path
`architecture of FIG. 2, further showing internal detail of an
`FCU having a Transaction Controller (TC), Transaction Bus
`(TB), and Transaction Status Bus (TSB) according to the
`present invention.
`FIG. 4 is a drawing of a variation on the embodiment of
`FIG. 2, in which each CPU has its own CCU, and in which
`the channel interface and control is abstractly represented as
`being composed of a physical (PHY) link layer and a
`transport layer.
`FIG. 5 is a timing diagram comparing the memory trans(cid:173)
`action performance of a system based on a flow control unit
`according to the present invention and a prior art shared-bus
`system.
`FIG. 6 is another view of the embodiment of FIG. 4.
`FIG. 7 is a drawing of a number of system embodiments
`according to the present invention. FIG. 7a illustrates a
`minimal configuration, 7b illustrates a 4-way configuration,
`7c illustrates an 8-way high-performance configuration, and
`7 d illustrates a configuration for 1!0 intensive applications.
`FIG. 8 is a drawing of a CPU having an integral CCU.
`FIG. 9 illustrates a variation of the embodiment of FIG. 6
`using the integrated CPU/CCU of FIG. 8.
`FIG. 10 illustrates variations of the embodiments of FIG.
`30 7 using the integrated CPU/CCU of FIG. 8.
`FIG. 11 is a drawing of an 4-way embodiment of the
`present invention that includes coupling to an industry
`standard switching fabric for coupling CPU/Memory com(cid:173)
`plexes with 1!0 devices.
`FIG. 12 is a drawing of a 16-way embodiment of the
`present invention, in which multiple 4-way shared-bus sys(cid:173)
`tems are coupled via CCUs to the FCU, and which includes
`coupling to two instances of an industry standard switching
`fabric for coupling CPU/Memory complexes with 1!0
`40 devices.
`
`20
`
`35
`
`US 6,516,442 Bl
`
`1
`CHANNEL INTERFACE AND PROTOCOLS
`FOR CACHE COHERENCY IN A SCALABLE
`SYMMETRIC MULTIPROCESSOR SYSTEM
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`5
`
`This patent application is a continuation-in-part of the
`following commonly-owned, U.S. patent application Ser.
`Nos.:
`U.S. application Ser. No. 08/986,430, filed Dec. 7, 1997 10
`now U.S. Pat. No. 6,065,077; and
`U.S. application Ser. No. 09/163,294, filed Sep. 29, 1998
`now U.S. Pat. No. 6,292,705;
`all of which are incorporated by reference herein.
`
`15
`
`BACKGROUND
`
`The system of FIG. 1 is a prototypical prior art symmetric
`multiprocessor (SMP) system 100. This traditional approach
`provides uniform access to memory 130 over a shared
`system bus 110. Each processor 120 has an associated cache
`and cache controller. The caches are individually managed
`according to a common cache coherency protocol to insure
`that all software is well behaved. The caches continually
`monitor (snoop) the shared system bus 110, watching for 25
`cache updates and other system transactions. Transactions
`are often decomposed into different component stages, con(cid:173)
`trolled by different system bus signals, such that different
`stages of multiple transactions may be overlapped in time to
`permit greater throughput. Nevertheless, for each stage,
`subsequent transactions make sequential use of the shared
`bus. The serial availability of the bus insures that transac(cid:173)
`tions are performed in a well-defined order. Without strong
`transaction ordering, cache coherency protocols fail and
`system and application software will not be well behaved.
`A first problem with the above-described traditional SMP
`system is that the serial availability of the bus limits the
`scalability of the SMP system. As more processors are
`added, eventually system performance is limited by the
`saturation of the shared system bus.
`What is needed is an SMP system architecture that
`provides greater seal ability by permitting concurrent use of
`multiple buses, while still providing a system serialization
`point to maintain strong transaction ordering and cache
`coherency. What is also needed is an SMP architecture that 45
`further provides increased transaction throughputs.
`
`SUMMARY
`
`A preferred embodiment of a symmetric multiprocessor
`system includes a switched fabric (switch matrix) for data 50
`transfers that provides multiple concurrent buses that enable
`greatly increased bandwidth between processors and shared
`memory. A high-speed point-to-point Channel couples com(cid:173)
`mand initiators and memory with the switch matrix and with
`1!0 subsystems. Each end of a channel is connected to a
`Channel Interface Block (CIB). The CIB presents a logical
`interface to the Channel, providing a communication path to
`and from a CIB in another IC. CIB logic presents a similar
`interface between the CIB and the core-logic and between
`the CIB and the Channel transceivers. A channel transport
`protocol is is implemented in the CIB to reliably transfer
`data from one chip to another in the face of errors and
`limited buffering.
`
`BRIEF DESCRIPTION OF DRAWINGS
`FIG. 1 is a drawing of a prior-art generic symmetric
`shared-memory multiprocessor system using a shared-bus.
`
`DETAILED DESCRIPTION
`System Overview
`FIG. 2 is a drawing of a preferred embodiment symmetric
`shared-memory multiprocessor system using a switched
`fabric data path architecture centered on a Flow-Control
`Unit (FCU) 220. In the illustrated embodiment, eight pro(cid:173)
`cessors 120 are used and the configuration is referred herein
`as an "8P" system.
`The FCU (Flow Control Unit) 220 chip is the central core
`of the 8P system. The FCU internally implements a
`switched-fabric data path architecture. Point-to-Point (PP)
`interconnect 112, 113, and 114 and an associated protocol
`define dedicated communication channels for all FCU 1/0.
`55 The terms Channels and PP-Channel are references to the
`FCU's PP 1/0. The FCU provides Point-to-Point Channel
`interfaces to up to ten Bus Bridge Units (BBUs) 240 and/or
`CPU Channel Units (CCUS, also known as Chanel Interface
`Units or CIUs) and one to four Memory Control Units
`60 (MCUs) 230. Two of the ten Channels are fixed to connect
`to BBUs. The other eight Channels can connect to either
`BBUs or CCUs. In an illustrative embodiment the number of
`CCUs is eight. In one embodiment the CCUs are packaged
`as a pair referred herein as a Dual CPU Interface Unit
`65 (DCIU) 210. In the 8P system shown, the Dual CPU
`Interface Unit (DCIU) 210 interfaces two CPUs with the
`FCU. Throughout this description, a reference to a "CCU"
`
`NETAPP, INC. EXHIBIT 1017
`Page 15 of 35
`
`
`
`US 6,516,442 Bl
`
`3
`is understood to describe the logical operation of each half
`of a DCIU 210 and a references to "CCUs" is understood to
`apply to equally to an implementation that uses either single
`CCUs or DCIUs 210. CCUs act as a protocol converter
`between the CPU bus protocol and the PP-Channel protocol. 5
`The FCU 210 provides a high-bandwidth and low-latency
`connection among these components via a Data Switch, also
`referred herein as a Simultaneous Switched Matrix (SSM),
`or switched fabric data path. In addition to connecting all of
`these components, the FCU provides the cache coherency 10
`support for the connected BBUs and CCUs via a Transaction
`Controller and a set of cache-tags duplicating those of the
`attached CPUs' L2 caches. FIG. 5 is a timing diagram
`comparing the memory transaction performance of a system
`based on a flow control unit according to the present 15
`invention and a prior art shared-bus system.
`In a preferred embodiment, the FCU provides support two
`dedicated BBU channels, four dedicated MCU channels, up
`to eight additional CCU or BBU channels, and PCI peer(cid:173)
`to-peer bridging. The FCU contains a Transaction Controller 20
`(TC) with reflected L2 states. The TC supports up to 200M
`cache-coherent transactions/second, MOSEl and MESI
`protocols, and up to 39-bit addressing. The FCU contains the
`Simultaneous Switch Matrix (SSM) Dataflow Switch, which
`supports non-blocking data transfers.
`In a preferred embodiment, the MCU supports flexible
`memory configurations, including one or two channels per
`MCU, up to 4 Gbytes per MCU (maximum of 16 Gbytes per
`system), with one or two memory banks per MCU, with one
`to four DIMMS per bank, of SDRAM, DDR-SDRAM, or 30
`RDRAM, and with non-interleaved or interleaved operation.
`In a preferred embodiment, the BBU supports both 32 and
`64 bit PCI bus configurations, including 32 bit/33 MHz, 32
`bit/66 MHz, and 64 bit/66 MHz. The BBU is also 5 V
`tolerant and supports AGP.
`All connections between components occur as a series of
`"transactions." A transaction is a Channel Protocol request
`command and a corresponding Channel Protocol reply. For
`example, a processor, via a CCU, can perform a Read
`request that will be forwarded, via the FCU, to the MCU; the 40
`MCU will return a Read reply, via the FCU, back to the same
`processor. A Transaction Protocol Table (TPT) defines the
`system-wide behavior of every type of transaction and a
`Point-to-Point Channel Protocol defines the command for(cid:173)
`mat for transactions.
`The FCU assumes that initiators have converted addresses
`from other formats to conform with the PP-Channel defini(cid:173)
`tions. The FCU does do target detection. Specifically, the
`FCU determines the correspondence between addresses and
`specific targets via address mapping tables. Note that this
`mapping hardware (contained in the CFGIF and the TC)
`maps from Channel Protocol addresses to targets. The
`mapping generally does not change or permute addresses.
`Summary of Key Components
`Transaction Controller (TC) 400. The most critical coher(cid:173)
`ency principle obeyed by the FCU is the concept of a single,
`system-serialization point. The system-serialization point is
`the "funnel" through which all transactions must pass. By
`guaranteeing that all transactions pass through the system(cid:173)
`serialization point, a precise order of transactions can be 60
`defined. (And this in turn implies a precise order of tag state
`changes.) In the FCU, the system-serialization point is the
`Transaction Controller (TC). Coherency state is maintained
`by the duplicate set of processor L2 cache-tags stored in the
`TC.
`The Transaction Controller (TC) acts as central system(cid:173)
`serialization and cache coherence point, ensuring that all
`
`4
`transactions in the system happen in a defined order, obeying
`defined rules. All requests, cacheable or not, pass through
`the Transaction Controller. The TC handles the cache coher-
`ency protocol using a duplicate set of L2 cache-tags for each
`CPU. It also controls address mapping inside the FCU,
`dispatching each transaction request to the appropriate target
`interface.
`Transaction Bus (TB) 3104 and Transaction Status Bus
`(TSB) 3106. All request commands flow through the Trans(cid:173)
`action Bus. The Transaction Bus is designed to provide fair
`arbitration between all transaction sources (initiators) and
`the TC; it provides an inbound path to the TC, and distrib(cid:173)
`utes outbound status from the TC (via a Transaction Status
`Bus).
`The Transaction Bus (TB) is the address/control "high(cid:173)
`way" in the FCU. It includes an arbiter and the Transaction
`Bus itself. The TB pipelines the address over two cycles. The
`extent of pipelining is intended to support operation of the
`FCU at 200 Mhz using contemporary fabrication technology
`at the time of filing of this disclosure.
`Whereas the TB provides inputs to the Transaction
`Controller, the Transaction Status Bus delivers outputs from
`the Transaction Controller to each interface and/or target.
`The TSB outputs provide transaction confirmation, coher-
`25 ency state update information, etc. Note that while many
`signals on the TSB are common, the TC does drive unique
`status information (such as cache-state) to each interface.
`The Transaction Bus and Transaction Status Bus are dis-
`cussed in detail later in this application.
`Switched Fabric Data Path (Data Switch). The Data
`Switch is an implementation of a Simultaneous Switched
`Matrix (SSM) or switched fabric data path architecture. It
`provides for parallel routing of transaction data between
`multiple initiators and multiple targets. The Data Switch is
`35 designed to let multiple, simultaneous data transfers take
`place to/from initiators and from/to targets (destinations of
`transactions). Note that the Data Switch is packet based.
`Every transfer over the Data Switch starts with a Channel
`Protocol command (playing the role of a packet header) and
`is followed by zero or more data cycles (the packet payload).
`All reply commands (some with data) flow through the Data
`Switch. Both write requests and read replies will have data
`cycles. Other replies also use the Data Switch and will only
`send a command header (no payload).
`IIF (Initiator InterFace) 3102. The IIF is the interface
`between the FCU and an initiator (a BBU or a CCU). The
`ElF transfers Channel Protocol commands to and from the
`initiator. The IIF must understand the cache coherency
`protocol and must be able to track all outstanding transac-
`50 tions. Note that the BBU/CCU can be both an initiator of
`commands and a target of commands (for CSR read/write if
`nothing else). Address and control buffering happen in the
`IIF; bulk data buffering is preferably done in the BBU/CCU
`(in order to save space in the FCU, which has ten copies of
`55 the IIF). The IIF needs configuration for CPU and 1!0
`modes, and to handle differences between multiple types of
`processors that may be used in different system configura(cid:173)
`tions.
`Memory Interface (MIF) 3108. The Memory Interface
`(MIF) is the portal to the memory system, acting as the
`interface between the rest of the chipset and the MCU(s).
`The MIF is the interpreter/filter/parser that receives trans(cid:173)
`action status from the TB and TC, issues requests to the
`MCU, receives replies from the MCU, and forwards the
`65 replies to the initiator of the transaction via the Data Switch.
`It is a "slave" device in that it can never be an initiator on
`the TB. (The MIF is an initiator in another sense, in that it
`
`45
`
`NETAPP, INC. EXHIBIT 1017
`Page 16 of 35
`
`
`
`US 6,516,442 Bl
`
`5
`sources data to the Data Switch.) For higher performance,
`the MIF supports speculative reads. Speculative reads start
`the read process early using the data from the TB rather than
`waiting for the data on the TSB. There is one MIF
`(regardless of how many memory interfaces there are). The 5
`MIF contains the memory mapping logic that determines the
`relationship between addresses and MCUs (and memory
`ports). The memory mapping logic includes means to con(cid:173)
`figure the MIF for various memory banking/interleaving
`schemes. The MWF also contains the GART (Graphics 10
`Address Remap Table). Addresses that hit in the GART
`region of memory will be mapped by the GART to the
`proper physical address.
`Configuration Register Interface (CFGIF) 410. This is
`where all the FCU's Control and Status Registers (CSRs) 15
`logically reside. CFGIF is responsible for the reading and
`writing of all the CSRs in the FCU, as well as all of the
`diagnostic reads/writes (e.g., diagnostic accesses to the
`duplicate tag RAM).
`Channel Interface Block (CIB). The CIBs are the transmit 20
`and receive interface for the Channel connections to and
`from the FCU. The FCU has 14 copies of the CIB, 10 for
`BBUs/CCUs, and 4 for MCUs. (The CIB is generic, but the
`logic on the core-side of the Channel is an IIF or the MIF.)
`. Embodim~nts overview. FIG. 3 is a drawing showing 25
`mternal detail of the switched fabric data path architecture
`within the FCU of FIG. 2. A first key component of the FCU
`is the Transaction Controller (TC) 400. A second key com(cid:173)
`ponent of the FCU is an address and control bus 3100 that
`is actually an abstraction representing a Transaction' Bus 30
`(TB) 3104 and Transaction Status Bus (TSB) 3106. A third
`key component of the FCU is the Data Path Switch (also
`referred herein as the Data Switch, or the switched fabric
`data path). The Data Switch is composed of vertical buses
`320, horizontal buses 340, node switches 380. The node 35
`switches selectively couple the vertical and horizontal buses
`under control of the Data Path Switch Controller 360 and
`control signals 370. Additional key components of the FCU
`include one or more Initiator Interfaces (IIFs) 3102; a
`Memory Interface (MIF) 3108; and Channel Interface 40
`Blocks (CIBs) 305 at the periphery of the various interfaces.
`A number of alternate embodiments exist. FIG. 4 is a
`drawing of a variation on the embodiment of FIG. 2, in
`which each CPU has its own CCU. In this view the channel
`interface and control that make up the IIFs and CCUs are 45
`abstractly represented as being composed of a physical
`(PHY) link layer and a transport layer. FIG. 6 is another
`view of the embodiment of FIG. 4. FIG. 7 is a drawing of
`a number of application specific variations on the embodi(cid:173)
`me~t of FIG. 4. FIG. 7a illustrates a minimal configuration, 50
`7? Illustrates a 4-way configuration, 7c illustrates an 8-way
`high-performance configuration, and 7d illustrates a con(cid:173)
`figuration for 1!0 intensive applications.
`FIG. 8 is a drawing of a CPU having an integral CCU.
`FIG. 8 makes explicit a "backside" bus interface to an 55
`external cache (an L2 cache in the case illustrated). An IIF
`replaces the conventional CPU interface, such that the
`Channel is the frontside bus of the CPU of FIG. 8. The
`embodiments of FIGS. 9 and 10, are respective variations of
`the embodiments of FIGS. 6 and 7, with adaptation for the 60
`use of the integrated CPU/CCU of FIG. 8. The embodiments
`of FIGS. 9 and 10 offer system solutions with lower CPU pin
`counts, higher throughput, lower latency, hot plugable CPUs
`(if an OS supports it), and reduced PCB board layout
`complexity compared with non-integrated solutions.
`FIG. 11 is a drawing of an 4-way embodiment of the
`present invention that includes coupling to an industry
`
`6
`standard ~witching fabric for coupling CPU/Memory com(cid:173)
`plexes .with 1!0 devices. FIG. 12 is a drawing of a 16-way
`embodiment of the present invention, in which multiple
`4-way shared-bus systems are coupled via CCUs to the
`FCU, and which includes coupling to two instances of an
`industry standard switching fabric for coupling CPU/
`Memory complexes with 1!0 devices.
`Additional Descriptions
`U.S. application Ser. No. 08/986,430, AN APPARATUS
`AND METHOD FOR A CACHE COHERENT SHARED
`MEMORY MULTIPROCESSING SYSTEM, filed Dec. 7,
`1997, incorporated by reference above, provides additional
`detail of the overall operation of the systems of FIGS. 2 and
`3. U.S. application Ser. No. 09/163,294, METHOD AND
`APPARATUS FOR ADDRESS TRANSFERS, SYSTEM
`SERIALIZATION, AND CENTRALIZED CACHE AND
`TRANSACTION CONTROL, IN A SYMMETRIC MUL(cid:173)
`TIPROCESSOR SYSTEM, filed Sep. 29, 1998, incorpo(cid:173)
`rated by reference above, provides additional detail of
`particular transaction address bus embodiments. U.S. appli(cid:173)
`cation Ser. No. 09/168,311, METHOD