throbber
(12) United States Patent
`Wang et al.
`
`111111
`
`1111111111111111111111111111111111111111111111111111111111111
`US006516442Bl
`US 6,516,442 Bl
`Feb.4,2003
`
`(10) Patent No.:
`(45) Date of Patent:
`
`(54) CHANNEL INTERFACE AND PROTOCOLS
`FOR CACHE COHERENCY IN A SCALABLE
`SYMMETRIC MULTIPROCESSOR SYSTEM
`
`(75)
`
`Inventors: Yuanlong Wang, San Jose, CA (US);
`Brian R. Biard, Alameda County, CA
`(US); Daniel Fu, Sunnyvale, CA (US);
`Earl T. Cohen, Fremont, CA (US);
`Carl G. Amdahl, Alameda County, CA
`(US)
`
`(73) Assignee: Conexant Systems, Inc., Newport
`Beach, CA (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/281,749
`
`(22) Filed:
`
`Mar. 30, 1999
`
`(63)
`
`(51)
`(52)
`(58)
`
`(56)
`
`Related U.S. Application Data
`
`Continuation-in-part of application No. 09/163,294, filed on
`Sep. 29, 1998, now Pat. No. 6,292,705, and a continuation(cid:173)
`in-part of application No. 08/986,430, filed on Dec. 7, 1997,
`now Pat. No. 6,065,077.
`Int. Cl? . ... ... .. ... ... ... ... .. . H03M 13/00; G06C 13/00
`U.S. Cl. ........................................ 714/776; 711!146
`Field of Search .......................... 714/748, 18, 751,
`714/20, 4, 776; 370/244, 235; 711/141,
`146
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,315,308 A
`4,438,494 A
`4,480,307 A
`
`2/1982 Jackson
`3/1984 Budde et a!.
`10/1984 Budde et a!.
`
`5,161,156 A
`5,271,000 A
`5,313,609 A
`5,335,335 A
`5,440,698 A
`5,505,686 A
`5,511,226 A
`5,513,335 A
`5,524,234 A
`
`* 11/1992 Baum eta!. ................... 714/4
`* 12/1993 Engbersen et a!.
`......... 370/244
`5/1994 Baylor et a!.
`8/1994 Jackson et a!.
`8/1995 Sindhu et a!.
`4/1996 Willis et a!.
`4/1996 Zilka
`4/1996 McClure
`6/1996 Martinez, Jr. et a!.
`
`(List continued on next page.)
`
`OTHER PUBLICATIONS
`
`Technical White Paper, Sun TM Enterprise TM 10000
`Server, Sun Microsystems, Sep. 1998.
`
`(List continued on next page.)
`
`Primary Examiner-Albert Decady
`Assistant Examiner-Cynthia Harris
`(74) Attorney, Agent, or Firm-Keith Kind; Kelly H. Hale
`
`(57)
`
`ABSTRACT
`
`A preferred embodiment of a symmetric multiprocessor
`system includes a switched fabric (switch matrix) for data
`transfers that provides multiple concurrent buses that enable
`greatly increased bandwidth between processors and shared
`memory. A high-speed point-to-point Channel couples com(cid:173)
`mand initiators and memory with the switch matrix and with
`1!0 subsystems. Each end of a channel is connected to a
`Channel Interface Block (CIB). The CIB presents a logical
`interface to the Channel, providing a communication path to
`and from a CIB in another I C. CIB logic presents a similar
`interface between the CIB and the core-logic and between
`the CIB and the Channel transceivers. A channel transport
`protocol is is implemented in the CIB to reliably transfer
`data from one chip to another in the face of errors and
`limited buffering.
`
`44 Claims, 12 Drawing Sheets
`
`AGP
`
`305
`
`f200
`
`SDRAM
`1300
`
`SDRAM
`1301
`
`0 z;;;;: ~I SDRAM
`~rri~ :
`.
`1302
`@;6~ : I SDRAM
`·~-<~. 1303
`
`- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ' 0 000 co co co 0 0 0 0 co•
`
`, __ ----------- _:-._230
`
`NETAPP, INC. EXHIBIT 1017
`Page 1 of 35
`
`

`
`US 6,516,442 Bl
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`6/1996 Izzard
`5,526,380 A
`7/1996 Prince
`5,535,363 A
`7/1996 Masubuchi
`5,537,569 A
`7/1996 Foley
`5,537,575 A
`9/1996 Taylor et a!.
`5,553,310 A
`10/1996 Jackson
`5,561,779 A
`5,568,620 A
`10/1996 Sarangdhar et a!.
`11/1996 Marisetty
`5,574,868 A
`11/1996 Brewer et a!.
`5,577,204 A
`12/1996 Nishtala et a!.
`5,581,729 A
`12/1996 Borrill
`5,588,131 A
`1!1997 Smith et a!.
`5,594,886 A
`5,602,814 A * 2/1997 Jaquette eta!. .......... 369/47.53
`5,606,686 A
`2/1997 Tarui et a!.
`5,634,043 A
`5/1997 Self et a!.
`5,634,068 A
`5/1997 Nishtala et a!.
`5,644,754 A
`7/1997 Weber
`5,655,100 A
`8/1997 Ebrahim et a!.
`5,657,472 A
`8/1997 Van Loo et a!.
`5,682,516 A
`10/1997 Sarangdhar et a!.
`5,684,977 A
`11/1997 Van Loo et a!.
`5,696,910 A
`12/1997 Pawlowski
`5,796,605 A
`8/1998 Hagersten
`5,829,034 A
`10/1998 Hagersten eta!.
`5,895,495 A
`4/1999 Arimilli et a!.
`5,897,656 A
`4/1999 Vogt et a!.
`5,940,856 A
`8/1999 Arimilli et a!.
`5,946,709 A
`8/1999 Arimilli et a!.
`5,978,411 A
`11/1999 Kitade eta!.
`6,044,122 A
`3/2000 Ellersick et a!.
`6,065,077 A
`5!2000 Fu
`6,125,429 A
`9/2000 Goodwin et a!.
`6,145,007 A * 11/2000 Dokic eta!. ................ 709/230
`6,279,084 B1 * 8/2001 VanDoren eta!. .......... 711!141
`6,289,420 B1
`9/2001 Cypher
`6,292,705 B1
`9/2001 Wang et a!.
`
`01HER PUBLICATIONS
`
`Alan Charlesworth, Starfire: Extending the SMP Envelope,
`IEEE Micro, Jan./Feb. 1998, pp. 39-49.
`Joseph Heinrich, Origin TM and Onyz2 TM Theory of
`Operations Manual, Document No. 007-3439-002, Silicon
`Graphics, Inc., 1997.
`
`White Paper, Sequent's NUMA-Q SMP Architecture,
`Sequent, 1997.
`White Paper, Eight-way Multiprocessing, Hewlett-Packard,
`Nov. 1997.
`George White & Pete Vogt, Profusion, a Buffered, Cache(cid:173)
`Coherent Crossbar Switch, presented at Hot Interconnects
`Symposium V, Aug. 1997.
`Alan Charlesworth, et al., Gigaplane-XB: Extending the
`Ultra Enterprise Family, presented at Hot Interconnects
`Symposium V, Aug. 1997.
`James Loudon & Daniel Lenoski, The SGI Origin: A
`ccNUMA Highly Scalable Server, Silcon Graphics, Inc.,
`presented at the Proc. Of the 24'h Int'l Symp. Computer
`Architecture, Jun. 1997.
`Mike Galles, Spider: A High-Speed Network Interconnect,
`IEEE Micro, Jan./Feb. 1997, pp. 34-39.
`T.D. Lovett, R. M. Clapp and R. J. Safranek, NUMA-Q: an
`SCI-based Enterprise Server, Sequent, 1996.
`Daniel E. Lenoski & Wolf-Dietrich Weber, Scalable
`Shared-Memory Multiprocessing, Morgan Kaufmann Pub(cid:173)
`lishers, 1995, pp. 143-159.
`David B. Gustavson, The Scalable coherent Interface and
`Related Standards Projects, (as reprinted in Advanced Mul(cid:173)
`timicroprocessor Bus Architectures, Janusz Zalewski, IEEE
`computer Society Press, 1995, pp. 195-207.).
`Kevin Normoyle, et al., UltraSPARC TM Port Architecture,
`Sun Microsystems, Inc., presented in Hot Interconnects III,
`Aug. 1995.
`Kevin Normoyle, et al., UltraSPARC TM Port Architecture,
`Sun Microsystems, Inc., presented in Hot Interconnects III,
`Aug. 1995, UltraSparc Interfaces.
`Kai Hwang, Advanced Computer Architecture: Parallelism
`Scalability, Programmability, McGraw-Hill, 1993, pp.
`355-357.
`Jim Handy, The Cache Memory Book, Academic Press,
`1993, pp. 161-169.
`Angel L. Decegama, Parallel Processing Architectures and
`VLSI Hardware, vol. 1, Prentice-Hall, 1989, pp. 341-344.
`
`* cited by examiner
`
`NETAPP, INC. EXHIBIT 1017
`Page 2 of 35
`
`

`
`U.S. Patent
`
`Feb.4,2003
`
`Sheet 1 of 12
`
`US 6,516,442 Bl
`
`0
`
`0
`
`~
`
`~
`
`L_
`
`w
`:c
`(_)
`<(z
`Q'*'l=
`=> a..
`
`(_)
`
`~I
`
`•
`•
`•
`
`w
`oCJ
`1-0
`en a::
`:::> c::o
`D?cn
`:::> =>
`a..c::o
`0 ' Q
`
`~I
`
`v
`
`w
`:c
`(_)
`<5~
`
`--=> a..
`
`(_)
`
`~I
`
`~ zo
`
`<(~
`~w
`::2:
`
`~I
`
`0
`0
`
`~
`
`NETAPP, INC. EXHIBIT 1017
`Page 3 of 35
`
`

`
`1--"
`~
`N
`~
`~
`0'1
`1--"
`'&.
`0'1
`rJ'l
`
`e
`
`'"""' N
`0 ......,
`N
`~ .....
`'JJ. =(cid:173)~
`
`8
`N c
`~,J;;..
`?'
`~
`"'!"j
`
`~ = ......
`~ ......
`~
`•
`\Jl
`d •
`
`I
`
`1303
`MEMORY
`
`1302
`MEMORY
`1lQ1
`: M:Y
`
`::~ :
`
`-
`
`:
`
`) I II I MEMORY
`
`115
`
`114
`
`t
`L__gg_J
`ICMI
`
`t
`DCIU 1
`t
`L__gg_J
`• • .ICPU6l
`
`FIG. 2
`
`PCI r:
`240 ~54
`
`230
`
`MCU
`
`AGP
`
`---....
`
`BBU
`
`l
`
`I :
`t
`
`210
`
`-
`
`c51
`
`PCI
`
`200_/
`
`.... , ~ . I
`
`1:2J
`
`BBU
`
`AGP
`
`220
`
`FLOW CONTROL UNIT
`
`(FCU)
`
`11~'-~ l
`
`I
`
`• • •
`
`112 J
`
`t t
`111_;--t t
`WWPU1
`
`120
`
`-
`
`210
`DCIU
`
`NETAPP, INC. EXHIBIT 1017
`Page 4 of 35
`
`

`
`I-"
`~
`N
`~
`~
`0'1
`I-"
`'&.
`0'1
`\Jl
`
`e
`
`'"""' N
`0 ......,
`~
`
`~ .....
`'JJ. =(cid:173)~
`
`~
`
`N c c
`
`~,J;;..
`
`0'
`~
`"'!"j
`
`~ = ......
`~ ......
`~
`•
`\Jl
`d •
`
`I
`CHANNEL g~ 65 8~~-:--~
`I
`MEMORY
`
`::::0 )!;! ;;o
`~ ~ ~ :
`:
`
`I
`
`I
`
`I
`
`: :
`
`-
`
`("')
`
`: :-
`
`I
`
`I SDRAM
`I SDRAM
`1301
`SDRAM
`
`. --
`
`1303
`
`230
`
`________ :\._
`
`I
`I
`
`OJ
`
`: :
`
`I
`I
`
`I
`I
`
`65
`
`L3
`CHANNEL
`
`: CONTRO
`
`MEMORY o~ Q Phl'<..:--
`
`! .~~
`
`1
`
`I
`
`...............
`
`I
`
`1300
`
`--···
`
`I
`
`:
`
`: :_
`, ,
`
`ro 8 "'; iii ---:---1
`: (
`
`:
`
`:
`
`-
`
`("')
`
`:tf___ -
`
`:
`
`1
`
`CONTROL 1
`CHANNEL
`MEMORY
`
`CONTROL 0
`CHANNEL
`
`hAIF/GART
`
`:
`
`230
`
`31 08 305 : 114
`
`:
`IIF :
`~3102
`
`I
`
`f200
`
`FIG. 3
`
`~
`220
`
`------'
`
`---------------
`
`-: 305
`
`305
`
`-:----. 21 0
`
`INTERFACE
`CPU/CACHE
`
`DUAL
`
`120
`
`CPU7
`
`120
`
`CPU6
`
`------------_~ _ .. _ -~--~ ~ -----------------------------------------------~:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.I -------
`
`'
`
`I
`
`l
`
`,
`
`I
`
`<
`
`380
`
`' £ s I~ I : ~ CONTROL 2
`
`I
`
`I I
`
`(
`
`I
`
`! I
`
`• .
`
`!
`
`~ ·
`I I
`
`'
`
`.. I
`
`~ ~ ~ .._:__1
`"'; ~ i§
`: CONTR~L-~--_-__ , c __________ ;) _3~~ ___ 1_ 115
`
`: INTERFACE/
`
`CONTROLN-1 :
`:
`INTERFACE/
`: • • • :
`
`'

`IIF :
`: IIF
`:
`:
`I 113 I
`
`I
`
`I
`
`~u~~~+~~~~~L,:~rjt-::r-t::::1~ MEMORY
`
`'
`CONTROLO
`: IIF INTERFACE
`:
`DEVICE
`~--
`CACHING
`-------f---------~~-~-~---_-_-_-~----~~~-)--~~~~-----~~------~ ---r:~----~~B
`
`:: CONTROLO
`: : INTERFACE
`: :
`
`1/0
`CIB
`
`I
`I
`
`I
`I
`
`I
`
`:
`: 21 O
`t~rlTT Til
`
`:
`:
`:
`
`• • •
`
`DUAL ~
`
`INTERFACE
`CPU/CACHE
`
`~--------------:
`
`111
`
`,-----.--:
`
`• • •
`
`120
`
`CPU1
`
`153 ............. ---r-.......
`
`120
`
`PCI
`
`AGP
`
`CPUO
`
`CIB
`
`CONTROL
`INTERFACE
`
`BUS
`
`240
`
`CIB
`
`CONTROL
`INTERFACE
`
`BUS
`
`, ---:&.-----~-----r ----
`
`152~ r 154
`
`tr151
`PCI
`
`t
`AGP
`
`---=-=-=-•-------
`
`: 1 240
`
`:
`I
`
`0 0
`
`0
`
`DEVICE
`CACHING
`c
`----~-.
`
`I
`I
`
`FCU
`
`::::0 :z:
`In 5
`~ ~
`8 ~
`
`"'; z
`
`110
`CIB
`
`I
`
`305
`
`NETAPP, INC. EXHIBIT 1017
`Page 5 of 35
`
`

`
`1--"
`~
`N
`~
`~
`0'1
`1--"
`'&.
`0'1
`rJ'l
`
`e
`
`'"""' N
`0 ......,
`,J;;..
`~ .....
`'JJ. =-~
`
`~
`
`N c c
`
`~,J;;..
`?'
`~
`"'!"j
`
`~ = ......
`~ ......
`~
`•
`\Jl
`d •
`
`I
`I
`I
`I
`
`ill· • [_ MEMORY
`-I m ill
`I MEMORY
`;s: ill·
`[ MEMORY
`•
`;[IJ • I MEMORY
`
`m
`("')
`j;!
`
`m s:::
`
`-:::0
`-z
`
`CONTROL
`t t + +
`
`1/0
`
`TRANSPORT
`
`PHY LINK
`
`t
`
`PHY LINK
`
`CONTROL
`t t + +
`
`1/0
`
`TRANSPORT
`
`PHY LINK
`
`PHY LINK
`
`t
`
`CONTROL
`t t + +
`
`1/0
`
`TRANSPORT
`
`PHY LINK
`
`PHY LINK
`
`t
`
`TRANSPORT
`• + t 1
`
`t t + !
`
`TRANSPORT
`
`PHY LINK
`
`f
`
`PHY LINK
`
`CACHE COHERENT FLOW CONTROL
`
`TRANSPORT
`+ + t t
`NON BLOCKING DATA SWITCH
`
`TRANSPORT
`+ + t t
`
`t t + +
`TRANSPORT
`
`t t + +
`TRANSPORT
`
`PHY LINK
`
`t
`
`PHY LINK
`
`PHY LINK
`
`t
`
`PHY LINK
`
`TRANSPORT
`
`TRANSPORT
`
`TRANSPORT
`
`FIG.4
`
`CORE
`CPU
`
`CORE
`CPU
`
`CORE
`CPU
`
`NETAPP, INC. EXHIBIT 1017
`Page 6 of 35
`
`

`
`1-"
`~
`N
`~
`~
`0'1
`1-"
`11.
`0'1
`rJ'l
`
`e
`
`'"""' N
`0 ......,
`Ul
`~ .....
`'JJ. =-~
`
`8
`N c
`~,J;;..
`~
`"'!"j
`
`~ = ......
`~ ......
`~
`•
`\Jl
`d •
`
`FIG. 5
`
`BUS
`SHARED
`
`MCU4
`
`MCU3
`
`I
`
`I
`
`•
`
`MCU2
`
`I
`
`MCU1
`
`8 BYTES TRANSFERED PER CYCLE
`
`-
`
`--
`
`--
`I -
`I
`
`PER CYCLE PER WAY
`16 BYTES TRANSFERED
`
`•
`
`• •
`
`• • • •
`
`• • • •
`
`• • .-.-
`
`• • • •
`
`• • • •
`
`• • • •
`
`• • • •
`
`.-. -. --.
`
`0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
`1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3
`
`DATA
`
`I
`
`I
`
`-
`
`•
`
`•
`
`•
`
`•
`
`-
`
`•
`
`•
`
`•
`
`ADDRESS
`
`•
`
`•
`
`-
`
`PEAK TRANSFER RATE
`
`= 800 MB/s
`
`32 BYTE CACHE LINE
`
`PRIOR ART SHARED-BUS(cid:173)
`
`BASED 4-WAY SMP
`
`(B)
`
`PEAK MEMORY READ
`
`= 6.4 GB/s (4-WAY)
`= 1.6 GB/s PER WAY
`TRANSFER RATE
`
`64 BYTE CACHE LINE
`
`UP TO 4 SIMULTANEOUS
`
`TRANSACTIONS
`
`MEMORY
`
`FCU-BASED 4-WAY SMP
`
`(A)
`
`NETAPP, INC. EXHIBIT 1017
`Page 7 of 35
`
`

`
`U.S. Patent
`
`Feb.4,2003
`
`Sheet 6 of 12
`
`US 6,516,442 Bl
`
`1 1 1 1
`::::> 0
`
`::a:
`
`(.)
`
`<0
`~
`'<T
`<0
`u
`0...
`
`.,..__.
`
`00
`::::>
`(.)
`(.)
`
`~
`
`f+---+
`
`:::>
`co
`co
`
`• . • • • • . . •
`. • . • • • • • . •
`
`• •
`•
`
`:::>
`(.)
`u.
`
`...--
`::::> I~
`(.)
`(.)
`
`I~
`
`:::>
`co
`co
`
`~
`
`c.o
`~ c.o
`u
`0...
`
`NETAPP, INC. EXHIBIT 1017
`Page 8 of 35
`
`

`
`U.S. Patent
`
`Feb. 4, 2003
`
`Sheet 7 of 12
`
`US 6,516,442 Bl
`
`~~ m
`
`! !
`
`NETAPP, INC. EXHIBIT 1017
`Page 9 of 35
`
`

`
`U.S. Patent
`
`Feb.4,2003
`
`Sheet 8 of 12
`
`US 6,516,442 Bl
`
`_________________________________ ,
`
`..
`
`~
`
`w
`u
`L£
`0:::
`w
`I-
`:z
`(f)
`::::>
`aJ
`N __.
`
`w
`0:::
`0
`u
`::::>
`a_
`0
`
`I
`
`..
`
`~
`
`::::>
`u
`u
`
`co
`•
`(!)
`Li:
`
`NETAPP, INC. EXHIBIT 1017
`Page 10 of 35
`
`

`
`U.S. Patent
`
`Feb.4,2003
`
`Sheet 9 of 12
`
`US 6,516,442 Bl
`
`1 1 1 1
`00
`
`co
`
`~
`0 a...
`
`= Efl- I~
`
`N
`_J
`
`00
`
`:=1 a...
`0
`
`~
`
`I~
`
`:::) co
`co
`
`• • • • • • • • • • . • • • • • • • • • •
`
`:::)
`(.)
`LL
`
`...--
`Efl- I~
`N
`_J
`
`0
`
`:=1 a...
`0
`
`1-+--+
`
`:::) co
`co
`
`~
`
`co
`~ co
`0 a...
`
`NETAPP, INC. EXHIBIT 1017
`Page 11 of 35
`
`

`
`U.S. Patent
`
`Feb. 4, 2003
`
`Sheet 10 of 12
`
`US 6,516,442 Bl
`
`NETAPP, INC. EXHIBIT 1017
`Page 12 of 35
`
`

`
`U.S. Patent
`
`Feb.4,2003
`
`Sheet 11 of 12
`
`US 6,516,442 Bl
`
`,....
`,....
`
`r - - - r - - -
`
`-
`
`-
`
`~ ~
`~ ~
`0
`0
`0:::
`0:::
`
`~ ~
`~ ~
`0
`0
`0:::
`0:::
`
`'---,.--
`
`'---,.--
`
`-,.-- --
`
`I I I I
`
`::>
`(.)
`LL.
`
`4
`
`~
`
`0(.)
`-0:::
`<!Jcc Zlf
`
`4
`
`~
`
`::>
`cc
`cc
`
`-
`
`-
`
`<D
`
`<D -"<1"
`
`<D
`
`(.) a...
`
`<D
`
`<D -
`
`"<1"
`<D
`
`(.) a...
`
`NETAPP, INC. EXHIBIT 1017
`Page 13 of 35
`
`

`
`U.S. Patent
`
`Feb. 4, 2003
`
`Sheet 12 of 12
`
`US 6,516,442 Bl
`
`llllllll
`0000
`o~\1
`::> D ~~ ..
`
`I-
`0 - I--
`__.-
`
`(/)
`
`f-
`
`I-
`__.-
`0 - f-
`
`Cf)
`
`D
`0 D I-
`
`::J u u..
`
`. .
`. .
`.
`
`.---
`
`::> u
`u
`
`,______
`
`~
`
`~
`
`ou
`-0:::
`(.')aJ
`:Z:lf
`
`ou
`-0:::
`(.')aJ
`:Z:lf
`
`I I I I
`
`NETAPP, INC. EXHIBIT 1017
`Page 14 of 35
`
`

`
`2
`FIG. 2 is a drawing of a preferred embodiment symmetric
`shared-memory multiprocessor system using a switched
`fabric data path architecture centered on a Flow-Control
`Unit (FCU).
`FIG. 3 is a drawing of the switched fabric data path
`architecture of FIG. 2, further showing internal detail of an
`FCU having a Transaction Controller (TC), Transaction Bus
`(TB), and Transaction Status Bus (TSB) according to the
`present invention.
`FIG. 4 is a drawing of a variation on the embodiment of
`FIG. 2, in which each CPU has its own CCU, and in which
`the channel interface and control is abstractly represented as
`being composed of a physical (PHY) link layer and a
`transport layer.
`FIG. 5 is a timing diagram comparing the memory trans(cid:173)
`action performance of a system based on a flow control unit
`according to the present invention and a prior art shared-bus
`system.
`FIG. 6 is another view of the embodiment of FIG. 4.
`FIG. 7 is a drawing of a number of system embodiments
`according to the present invention. FIG. 7a illustrates a
`minimal configuration, 7b illustrates a 4-way configuration,
`7c illustrates an 8-way high-performance configuration, and
`7 d illustrates a configuration for 1!0 intensive applications.
`FIG. 8 is a drawing of a CPU having an integral CCU.
`FIG. 9 illustrates a variation of the embodiment of FIG. 6
`using the integrated CPU/CCU of FIG. 8.
`FIG. 10 illustrates variations of the embodiments of FIG.
`30 7 using the integrated CPU/CCU of FIG. 8.
`FIG. 11 is a drawing of an 4-way embodiment of the
`present invention that includes coupling to an industry
`standard switching fabric for coupling CPU/Memory com(cid:173)
`plexes with 1!0 devices.
`FIG. 12 is a drawing of a 16-way embodiment of the
`present invention, in which multiple 4-way shared-bus sys(cid:173)
`tems are coupled via CCUs to the FCU, and which includes
`coupling to two instances of an industry standard switching
`fabric for coupling CPU/Memory complexes with 1!0
`40 devices.
`
`20
`
`35
`
`US 6,516,442 Bl
`
`1
`CHANNEL INTERFACE AND PROTOCOLS
`FOR CACHE COHERENCY IN A SCALABLE
`SYMMETRIC MULTIPROCESSOR SYSTEM
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`5
`
`This patent application is a continuation-in-part of the
`following commonly-owned, U.S. patent application Ser.
`Nos.:
`U.S. application Ser. No. 08/986,430, filed Dec. 7, 1997 10
`now U.S. Pat. No. 6,065,077; and
`U.S. application Ser. No. 09/163,294, filed Sep. 29, 1998
`now U.S. Pat. No. 6,292,705;
`all of which are incorporated by reference herein.
`
`15
`
`BACKGROUND
`
`The system of FIG. 1 is a prototypical prior art symmetric
`multiprocessor (SMP) system 100. This traditional approach
`provides uniform access to memory 130 over a shared
`system bus 110. Each processor 120 has an associated cache
`and cache controller. The caches are individually managed
`according to a common cache coherency protocol to insure
`that all software is well behaved. The caches continually
`monitor (snoop) the shared system bus 110, watching for 25
`cache updates and other system transactions. Transactions
`are often decomposed into different component stages, con(cid:173)
`trolled by different system bus signals, such that different
`stages of multiple transactions may be overlapped in time to
`permit greater throughput. Nevertheless, for each stage,
`subsequent transactions make sequential use of the shared
`bus. The serial availability of the bus insures that transac(cid:173)
`tions are performed in a well-defined order. Without strong
`transaction ordering, cache coherency protocols fail and
`system and application software will not be well behaved.
`A first problem with the above-described traditional SMP
`system is that the serial availability of the bus limits the
`scalability of the SMP system. As more processors are
`added, eventually system performance is limited by the
`saturation of the shared system bus.
`What is needed is an SMP system architecture that
`provides greater seal ability by permitting concurrent use of
`multiple buses, while still providing a system serialization
`point to maintain strong transaction ordering and cache
`coherency. What is also needed is an SMP architecture that 45
`further provides increased transaction throughputs.
`
`SUMMARY
`
`A preferred embodiment of a symmetric multiprocessor
`system includes a switched fabric (switch matrix) for data 50
`transfers that provides multiple concurrent buses that enable
`greatly increased bandwidth between processors and shared
`memory. A high-speed point-to-point Channel couples com(cid:173)
`mand initiators and memory with the switch matrix and with
`1!0 subsystems. Each end of a channel is connected to a
`Channel Interface Block (CIB). The CIB presents a logical
`interface to the Channel, providing a communication path to
`and from a CIB in another IC. CIB logic presents a similar
`interface between the CIB and the core-logic and between
`the CIB and the Channel transceivers. A channel transport
`protocol is is implemented in the CIB to reliably transfer
`data from one chip to another in the face of errors and
`limited buffering.
`
`BRIEF DESCRIPTION OF DRAWINGS
`FIG. 1 is a drawing of a prior-art generic symmetric
`shared-memory multiprocessor system using a shared-bus.
`
`DETAILED DESCRIPTION
`System Overview
`FIG. 2 is a drawing of a preferred embodiment symmetric
`shared-memory multiprocessor system using a switched
`fabric data path architecture centered on a Flow-Control
`Unit (FCU) 220. In the illustrated embodiment, eight pro(cid:173)
`cessors 120 are used and the configuration is referred herein
`as an "8P" system.
`The FCU (Flow Control Unit) 220 chip is the central core
`of the 8P system. The FCU internally implements a
`switched-fabric data path architecture. Point-to-Point (PP)
`interconnect 112, 113, and 114 and an associated protocol
`define dedicated communication channels for all FCU 1/0.
`55 The terms Channels and PP-Channel are references to the
`FCU's PP 1/0. The FCU provides Point-to-Point Channel
`interfaces to up to ten Bus Bridge Units (BBUs) 240 and/or
`CPU Channel Units (CCUS, also known as Chanel Interface
`Units or CIUs) and one to four Memory Control Units
`60 (MCUs) 230. Two of the ten Channels are fixed to connect
`to BBUs. The other eight Channels can connect to either
`BBUs or CCUs. In an illustrative embodiment the number of
`CCUs is eight. In one embodiment the CCUs are packaged
`as a pair referred herein as a Dual CPU Interface Unit
`65 (DCIU) 210. In the 8P system shown, the Dual CPU
`Interface Unit (DCIU) 210 interfaces two CPUs with the
`FCU. Throughout this description, a reference to a "CCU"
`
`NETAPP, INC. EXHIBIT 1017
`Page 15 of 35
`
`

`
`US 6,516,442 Bl
`
`3
`is understood to describe the logical operation of each half
`of a DCIU 210 and a references to "CCUs" is understood to
`apply to equally to an implementation that uses either single
`CCUs or DCIUs 210. CCUs act as a protocol converter
`between the CPU bus protocol and the PP-Channel protocol. 5
`The FCU 210 provides a high-bandwidth and low-latency
`connection among these components via a Data Switch, also
`referred herein as a Simultaneous Switched Matrix (SSM),
`or switched fabric data path. In addition to connecting all of
`these components, the FCU provides the cache coherency 10
`support for the connected BBUs and CCUs via a Transaction
`Controller and a set of cache-tags duplicating those of the
`attached CPUs' L2 caches. FIG. 5 is a timing diagram
`comparing the memory transaction performance of a system
`based on a flow control unit according to the present 15
`invention and a prior art shared-bus system.
`In a preferred embodiment, the FCU provides support two
`dedicated BBU channels, four dedicated MCU channels, up
`to eight additional CCU or BBU channels, and PCI peer(cid:173)
`to-peer bridging. The FCU contains a Transaction Controller 20
`(TC) with reflected L2 states. The TC supports up to 200M
`cache-coherent transactions/second, MOSEl and MESI
`protocols, and up to 39-bit addressing. The FCU contains the
`Simultaneous Switch Matrix (SSM) Dataflow Switch, which
`supports non-blocking data transfers.
`In a preferred embodiment, the MCU supports flexible
`memory configurations, including one or two channels per
`MCU, up to 4 Gbytes per MCU (maximum of 16 Gbytes per
`system), with one or two memory banks per MCU, with one
`to four DIMMS per bank, of SDRAM, DDR-SDRAM, or 30
`RDRAM, and with non-interleaved or interleaved operation.
`In a preferred embodiment, the BBU supports both 32 and
`64 bit PCI bus configurations, including 32 bit/33 MHz, 32
`bit/66 MHz, and 64 bit/66 MHz. The BBU is also 5 V
`tolerant and supports AGP.
`All connections between components occur as a series of
`"transactions." A transaction is a Channel Protocol request
`command and a corresponding Channel Protocol reply. For
`example, a processor, via a CCU, can perform a Read
`request that will be forwarded, via the FCU, to the MCU; the 40
`MCU will return a Read reply, via the FCU, back to the same
`processor. A Transaction Protocol Table (TPT) defines the
`system-wide behavior of every type of transaction and a
`Point-to-Point Channel Protocol defines the command for(cid:173)
`mat for transactions.
`The FCU assumes that initiators have converted addresses
`from other formats to conform with the PP-Channel defini(cid:173)
`tions. The FCU does do target detection. Specifically, the
`FCU determines the correspondence between addresses and
`specific targets via address mapping tables. Note that this
`mapping hardware (contained in the CFGIF and the TC)
`maps from Channel Protocol addresses to targets. The
`mapping generally does not change or permute addresses.
`Summary of Key Components
`Transaction Controller (TC) 400. The most critical coher(cid:173)
`ency principle obeyed by the FCU is the concept of a single,
`system-serialization point. The system-serialization point is
`the "funnel" through which all transactions must pass. By
`guaranteeing that all transactions pass through the system(cid:173)
`serialization point, a precise order of transactions can be 60
`defined. (And this in turn implies a precise order of tag state
`changes.) In the FCU, the system-serialization point is the
`Transaction Controller (TC). Coherency state is maintained
`by the duplicate set of processor L2 cache-tags stored in the
`TC.
`The Transaction Controller (TC) acts as central system(cid:173)
`serialization and cache coherence point, ensuring that all
`
`4
`transactions in the system happen in a defined order, obeying
`defined rules. All requests, cacheable or not, pass through
`the Transaction Controller. The TC handles the cache coher-
`ency protocol using a duplicate set of L2 cache-tags for each
`CPU. It also controls address mapping inside the FCU,
`dispatching each transaction request to the appropriate target
`interface.
`Transaction Bus (TB) 3104 and Transaction Status Bus
`(TSB) 3106. All request commands flow through the Trans(cid:173)
`action Bus. The Transaction Bus is designed to provide fair
`arbitration between all transaction sources (initiators) and
`the TC; it provides an inbound path to the TC, and distrib(cid:173)
`utes outbound status from the TC (via a Transaction Status
`Bus).
`The Transaction Bus (TB) is the address/control "high(cid:173)
`way" in the FCU. It includes an arbiter and the Transaction
`Bus itself. The TB pipelines the address over two cycles. The
`extent of pipelining is intended to support operation of the
`FCU at 200 Mhz using contemporary fabrication technology
`at the time of filing of this disclosure.
`Whereas the TB provides inputs to the Transaction
`Controller, the Transaction Status Bus delivers outputs from
`the Transaction Controller to each interface and/or target.
`The TSB outputs provide transaction confirmation, coher-
`25 ency state update information, etc. Note that while many
`signals on the TSB are common, the TC does drive unique
`status information (such as cache-state) to each interface.
`The Transaction Bus and Transaction Status Bus are dis-
`cussed in detail later in this application.
`Switched Fabric Data Path (Data Switch). The Data
`Switch is an implementation of a Simultaneous Switched
`Matrix (SSM) or switched fabric data path architecture. It
`provides for parallel routing of transaction data between
`multiple initiators and multiple targets. The Data Switch is
`35 designed to let multiple, simultaneous data transfers take
`place to/from initiators and from/to targets (destinations of
`transactions). Note that the Data Switch is packet based.
`Every transfer over the Data Switch starts with a Channel
`Protocol command (playing the role of a packet header) and
`is followed by zero or more data cycles (the packet payload).
`All reply commands (some with data) flow through the Data
`Switch. Both write requests and read replies will have data
`cycles. Other replies also use the Data Switch and will only
`send a command header (no payload).
`IIF (Initiator InterFace) 3102. The IIF is the interface
`between the FCU and an initiator (a BBU or a CCU). The
`ElF transfers Channel Protocol commands to and from the
`initiator. The IIF must understand the cache coherency
`protocol and must be able to track all outstanding transac-
`50 tions. Note that the BBU/CCU can be both an initiator of
`commands and a target of commands (for CSR read/write if
`nothing else). Address and control buffering happen in the
`IIF; bulk data buffering is preferably done in the BBU/CCU
`(in order to save space in the FCU, which has ten copies of
`55 the IIF). The IIF needs configuration for CPU and 1!0
`modes, and to handle differences between multiple types of
`processors that may be used in different system configura(cid:173)
`tions.
`Memory Interface (MIF) 3108. The Memory Interface
`(MIF) is the portal to the memory system, acting as the
`interface between the rest of the chipset and the MCU(s).
`The MIF is the interpreter/filter/parser that receives trans(cid:173)
`action status from the TB and TC, issues requests to the
`MCU, receives replies from the MCU, and forwards the
`65 replies to the initiator of the transaction via the Data Switch.
`It is a "slave" device in that it can never be an initiator on
`the TB. (The MIF is an initiator in another sense, in that it
`
`45
`
`NETAPP, INC. EXHIBIT 1017
`Page 16 of 35
`
`

`
`US 6,516,442 Bl
`
`5
`sources data to the Data Switch.) For higher performance,
`the MIF supports speculative reads. Speculative reads start
`the read process early using the data from the TB rather than
`waiting for the data on the TSB. There is one MIF
`(regardless of how many memory interfaces there are). The 5
`MIF contains the memory mapping logic that determines the
`relationship between addresses and MCUs (and memory
`ports). The memory mapping logic includes means to con(cid:173)
`figure the MIF for various memory banking/interleaving
`schemes. The MWF also contains the GART (Graphics 10
`Address Remap Table). Addresses that hit in the GART
`region of memory will be mapped by the GART to the
`proper physical address.
`Configuration Register Interface (CFGIF) 410. This is
`where all the FCU's Control and Status Registers (CSRs) 15
`logically reside. CFGIF is responsible for the reading and
`writing of all the CSRs in the FCU, as well as all of the
`diagnostic reads/writes (e.g., diagnostic accesses to the
`duplicate tag RAM).
`Channel Interface Block (CIB). The CIBs are the transmit 20
`and receive interface for the Channel connections to and
`from the FCU. The FCU has 14 copies of the CIB, 10 for
`BBUs/CCUs, and 4 for MCUs. (The CIB is generic, but the
`logic on the core-side of the Channel is an IIF or the MIF.)
`. Embodim~nts overview. FIG. 3 is a drawing showing 25
`mternal detail of the switched fabric data path architecture
`within the FCU of FIG. 2. A first key component of the FCU
`is the Transaction Controller (TC) 400. A second key com(cid:173)
`ponent of the FCU is an address and control bus 3100 that
`is actually an abstraction representing a Transaction' Bus 30
`(TB) 3104 and Transaction Status Bus (TSB) 3106. A third
`key component of the FCU is the Data Path Switch (also
`referred herein as the Data Switch, or the switched fabric
`data path). The Data Switch is composed of vertical buses
`320, horizontal buses 340, node switches 380. The node 35
`switches selectively couple the vertical and horizontal buses
`under control of the Data Path Switch Controller 360 and
`control signals 370. Additional key components of the FCU
`include one or more Initiator Interfaces (IIFs) 3102; a
`Memory Interface (MIF) 3108; and Channel Interface 40
`Blocks (CIBs) 305 at the periphery of the various interfaces.
`A number of alternate embodiments exist. FIG. 4 is a
`drawing of a variation on the embodiment of FIG. 2, in
`which each CPU has its own CCU. In this view the channel
`interface and control that make up the IIFs and CCUs are 45
`abstractly represented as being composed of a physical
`(PHY) link layer and a transport layer. FIG. 6 is another
`view of the embodiment of FIG. 4. FIG. 7 is a drawing of
`a number of application specific variations on the embodi(cid:173)
`me~t of FIG. 4. FIG. 7a illustrates a minimal configuration, 50
`7? Illustrates a 4-way configuration, 7c illustrates an 8-way
`high-performance configuration, and 7d illustrates a con(cid:173)
`figuration for 1!0 intensive applications.
`FIG. 8 is a drawing of a CPU having an integral CCU.
`FIG. 8 makes explicit a "backside" bus interface to an 55
`external cache (an L2 cache in the case illustrated). An IIF
`replaces the conventional CPU interface, such that the
`Channel is the frontside bus of the CPU of FIG. 8. The
`embodiments of FIGS. 9 and 10, are respective variations of
`the embodiments of FIGS. 6 and 7, with adaptation for the 60
`use of the integrated CPU/CCU of FIG. 8. The embodiments
`of FIGS. 9 and 10 offer system solutions with lower CPU pin
`counts, higher throughput, lower latency, hot plugable CPUs
`(if an OS supports it), and reduced PCB board layout
`complexity compared with non-integrated solutions.
`FIG. 11 is a drawing of an 4-way embodiment of the
`present invention that includes coupling to an industry
`
`6
`standard ~witching fabric for coupling CPU/Memory com(cid:173)
`plexes .with 1!0 devices. FIG. 12 is a drawing of a 16-way
`embodiment of the present invention, in which multiple
`4-way shared-bus systems are coupled via CCUs to the
`FCU, and which includes coupling to two instances of an
`industry standard switching fabric for coupling CPU/
`Memory complexes with 1!0 devices.
`Additional Descriptions
`U.S. application Ser. No. 08/986,430, AN APPARATUS
`AND METHOD FOR A CACHE COHERENT SHARED
`MEMORY MULTIPROCESSING SYSTEM, filed Dec. 7,
`1997, incorporated by reference above, provides additional
`detail of the overall operation of the systems of FIGS. 2 and
`3. U.S. application Ser. No. 09/163,294, METHOD AND
`APPARATUS FOR ADDRESS TRANSFERS, SYSTEM
`SERIALIZATION, AND CENTRALIZED CACHE AND
`TRANSACTION CONTROL, IN A SYMMETRIC MUL(cid:173)
`TIPROCESSOR SYSTEM, filed Sep. 29, 1998, incorpo(cid:173)
`rated by reference above, provides additional detail of
`particular transaction address bus embodiments. U.S. appli(cid:173)
`cation Ser. No. 09/168,311, METHOD

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket