`Hagersten et a].
`
`USOO5754877A
`[11] Patent Number:
`[45] Date of Patent:
`
`‘
`5,754,877
`May 19, 1998
`
`[54] EXTENDED SYMMETRICAL
`MULTIPROCESSOR ARCHITECTURE
`
`[75] Inventors: Erik E. Hagersten. Palo Alto. Calif.;
`Mark D. Hill. Madison. Wis.
`
`[73] Assignee: Sun Microsystems, Inc.. Palo Alto.
`Calif.
`
`[21] Appl. No.: 675,363
`[22] Filed:
`Jul. 2, 1996
`
`[51] Int. Cl.6 ................................................. .. G06F 15/163
`[52] US. Cl. .............................. .. 395/800.29; 395/200.81;
`395/200.73
`[58] Field of Search ....................... .. 395120068. 200.73.
`395/200.81. 800.29
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4/1989 Kn'ngs ....................................... .. 371/9
`4,819,232
`7/1996 Peavey et a1.
`.
`5,533,103
`5.579.512 11/1996 Goodrum et a1. .................... .. 395/500
`5,590,335 12/1996 Dubourreau et a].
`395/704
`5,608,893
`3/1997 Slingwine etal.
`395/468
`5,655,103
`8/1997 Cheng et a1. ........................ .. 395/479
`
`OTHER PUBLICATIONS
`
`Cox et al.. “Adaptive Cache Coherency for Detecting Migra
`tory Shared Data." Proc. 20th Annual Symposium on Com
`puter Architecture. May 1993. pp. 98-108.
`Stenstrom et al.. “An Adaptive Cache Coherence Protocol
`Optimized for Migratory Sharing." Proc. 20th Annual Sym
`posium on Computer Architecture. May 1993 IEEE. pp.
`109-118.
`Wolf-Dietrich Weber et al.. “Analysis of Cache invalidation
`Patterns in Multiprocessors”. Computer Systems Labora
`tory. Stanford University. CA. pp. 243-256.
`Kourosh et a1.. ‘Two Techniques to Enhance the Perfor
`mance of Memory Consistency Models 1991 International
`Conference on Parallel Processing. pp. 1-10.
`Li et al.. “Memory Coherence in Shared Virtual Memory
`Systems.” 1986 ACM. pp. 229-239.
`
`ABSTRACT
`
`D. Lenosky. PhD. “The Description and Analysis of DASH:
`A Scalable Directory-Based Multiprocessor.” DASH Proto
`type System. Dec. 1991. pp. 36-56.
`Hagersten et al.. “Simple COMA Node Implementations.”
`Ashley Saulsbury and Anders Landin Swedish Institute of
`Computer Science. 12 pages.
`Saulsbury et aL. “An Argument for Simple COMA.” Swed
`ish Institute of Computer Science. 10 pages.
`Hagersten et al.. “Simple COMA." Ashley Saulsbury and
`Anders Landin Swedish Institute of Computer Science. Jul.
`1993. pp. 233-259.
`Primary Examiner-William M. Treat
`Attorney, Agent, or F inn-Conley. Rose 8: Tayon; B. Noel
`Kivlin
`[57]
`An architecture for an extended multiprocessor (XMP)
`computer system is provided. The XMP computer system
`includes multiple SMP nodes. Each SMP node includes an
`XMP interface and a repeater structure coupled to the XMP
`interface. The SMP nodes are connected to each other by
`unidirectional point-to-point links. The repeater structure in
`each SMP node includes an upper level bus. one or more
`transaction repeaters coupled to the upper level bus. Each
`transaction repeater broadcasts transactions to bus devices
`attached to a lower level bus. wherein each transaction
`repeater is coupled to a separate lower level bus. Transaction
`repeater includes a queue and a bypass path. Transaction
`originating in a particular SMP node are stored in the queue.
`whereas transactions originating in other SMP nodes bypass
`the incoming queue to the bus device. Multiple transactions
`may be simultaneously broadcast across the point-to-point
`link connections between the SMP nodes. However. trans
`actions are broadcast to the SMP nodes in a de?ned. uniform
`order. A control signal is asserted by the XMP interface so
`that a transaction is received by bus devices in the originat
`ing node from the incoming queues at the same time and in
`the same order it is received by bus devices in non
`originating nodes. Thus a hierarchical bus structure is pro
`vided that overcomes physical/electrical limitations of
`single bus architecture while maximizing bus bandwidth
`utilization.
`
`16 Claims, 10 Drawing Sheets
`
`/I/
`
`'126
`
`126k].
`l
`
`XMP
`
`Interface
`I
`
`’ *
`
`;.1255
`
`age [
`
`b
`
`35A
`
`-120
`r
`
`22
`
`1
`
`124A ’\
`
`\422).]
`
`1248 ‘\~
`
`4228
`
`\
`
`1
`
`3.2A
`
`.
`J L
`
`\
`
`r
`
`l
`
`329 .
`I
`
`\
`
`35A
`Processor!
`Mammy ‘150A
`,—_L_
`l MTAG
`
`sea
`
`l/O
`
`sac
`Processor/
`“9m” .1500
`
`sap
`Processor!
`Mammy ‘-1500
`
`MTAG
`
`MTAG
`
`NETAPP, INC. EXHIBIT 1004
`Page 1 of 22
`
`
`
`US. Patent
`
`May 19, 1998
`
`Sheet 1 of 10
`
`5,754,877
`
`2\
`
`w
`
`/ ~ \
`
`
`
`............ mm r \ E
`F 8m S I
`
`
`
`m @1 m 663mm 3? < 683mm
`
`~| 26 N. 3 w 2a F. 3 A , mv ~ , 3v \
`
`~ ~ ~ mm on mm <m
`
`o @250 0 @250 m @250 < 83mm
`
`P .mE
`
`NETAPP, INC. EXHIBIT 1004
`Page 2 of 22
`
`
`
`US. Patent
`
`EL.
`
`May 19, 1998
`
`
`
`MAEMEVMAEVMAEVH KEVKEWEXEVW
`
`QKQMQEQVMAQEVWEVEMAEEXCVEXamayA32x3EXEEQQQEM
`
`Sheet 2 of 10
`
`5,754,877
`
`I or m
`
`mAsEXeE VWQVEVKAOVE
`
`35x33
`M2596.
`
`9.5 f: I m“
`
`mam N: l 2
`
`NETAPP, INC. EXHIBIT 1004
`Page 3 of 22
`
`
`
`U.S. Patent
`
`May 19, 1998
`
`Sheet 3 of 10
`
`5,754,877
`
`Processor/Memory
`
`Processor/Memory
`
`E 8
`
`%‘
`U)
`(DE
`0(1)
`92n.
`
`NETAPP, INC. EXHIBIT 1004
`
`Page 4 of 22
`
`22
`
`RepeaterB343
`
`34A
`
`RepeaterA
`
`32A
`
`NETAPP, INC. EXHIBIT 1004
`Page 4 of 22
`
`
`
`U.S. Patent
`
`May 19, 1993
`
`Sheet 4 of 10
`
`5,754,877
`
`:_S,m.w_n_o,m¢_m.m,o
`
`o.o>o
`
`mam31.5
`
`_..Elmo
`
`
`
`N.3.7%
`
`B:_Eoo:_»/we
`
`mm:_Eoo:_mm
`
`NETAPP, INC. EXHIBIT 1004
`
`Page 5 of 22
`
`NETAPP, INC. EXHIBIT 1004
`Page 5 of 22
`
`
`
`US. Patent
`
`May 19, 1998
`
`Sheet 5 of 10
`
`5,754,877
`
`2? 1
`
`Q60
`
`3 m?
`
`@0822
`
`
`
`mm; 230
`
`
`
`mam Ema 0P @EEQQE gm 5 oh
`
`
`6:02:00 \) \mv
`NvKK /
`
`04.
`
`E) S
`
`
`
`s? o_ oo {E
`
`_|IL
`
`my
`
`Bwwwooi
`
`cm, 222 2 mm
`
`m .2“.
`
`NETAPP, INC. EXHIBIT 1004
`Page 6 of 22
`
`
`
`US. Patent
`
`May 19, 1998
`
`Sheet 6 0f 10
`
`5,754,877
`
`mm
`
`, N _ _ \
`
`ow mm,
`
`.w?mmcmw U 6=9Eoo
`
`mm]
`
`~ ~ vm mm
`o<
`m w M525 9
`
`r _ _ _ L N v
`
`
`
`
`
`
`
`mam Ema o._. @5885 95 I 2.
`
`
`
`
`
`625m 8:9:60 6=9Eoo
`
`
`
`02 > wEwE mm? 51 5? 5m
`
`@ .5
`
`
`E8 Emu Emu Emu
`Ga Ga 61 5Q
`
`m , M N L,
`
`
`
`
`
`~ ~ ~ ~ Q2: Q2: mg“ 3.2
`
`mm? wNE
`
`NETAPP, INC. EXHIBIT 1004
`Page 7 of 22
`
`
`
`U.S. Patent
`
`May 19, 1993
`
`Sheet 7 of 10
`
`5,754,877
`
`81
`
`N.9...
`
`NETAPP, INC. EXHIBIT 1004
`
`Page 8 of 22
`
`NETAPP, INC. EXHIBIT 1004
`Page 8 of 22
`
`
`
`U.S. Patent
`
`May 19, 1993
`
`Sheet 3 of 10
`
`5,754,877
`
`8.1.
`
`NW
`
`mm.»
`
`n_S_X
`
`
`
`mom“2oom.cmE_.mm“
`
`
`
`mm.{NM
`
`II
`““mmfl-mwflvmfl—$3
`
`rllu-9_.Il..§
`
`
`
` 'mom:>._oEo_>_
`
`zommmooi
`
`mam
`
`tommmooi
`
`boEm_2
`
`
`
`o\_<8...roams.
`
`tommmooi
`
`mum<
`
`NETAPP, INC. EXHIBIT 1004
`
`Page 9 of 22
`
`NETAPP, INC. EXHIBIT 1004
`Page 9 of 22
`
`
`
`
`
`
`
`
`US. Patent
`
`May 19, 1998
`
`5,754,877
`
`252
`
`282
`
`252
`
`
`
`20:62 5295a
`
`
`
`Q22 $662
`
`
`
`%///////% m
`
`NETAPP, INC. EXHIBIT 1004
`Page 10 of 22
`
`
`
`US. Patent
`
`May 19, 1998
`
`Sheet 10 of 10
`
`5,754,877
`
`81
`
`0p m w \I o m w m m _.
`
`
`
`e2 em“. s5 e3 mszam e: e?vwsv? 3a 5E QEVWAAOVOEW s2 e2 5E 53W 55
`
`3E MACE VMAAQNEVWAQEVMCE agvwéemnvm 52 :5 M
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`mam 541m?
`
`25 m.ml_bmww
`
`9.5 mNJ 1 #9
`
`Em] mm?
`
`226
`
`mmmlhww
`
`NETAPP, INC. EXHIBIT 1004
`Page 11 of 22
`
`
`
`1
`EXTENDED SYMMETRICAL
`MULTIPROCESSOR ARCHITECTURE
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`This invention relates to the ?eld of multiprocessor com
`puter systems and. more particularly. to the architectural
`connection of multiple processors within a multiprocessor
`computer system.
`2. Description of the Relevant Art
`Multiprocessing computer systems include two or more
`processors which may be employed to perform computing
`tasks. A particular computing task may be performed upon
`one processor while other processors perform unrelated
`computing tasks. Alternatively. components of a particular
`computing task may be distributed among multiple proces
`sors to decrease the time required to perform the computing
`task as a whole. Generally speaking. a processor is a device
`con?gured to perform an operation upon one or more
`operands to produce a result. The operation is performed in
`response to an instruction executed by the processor.
`A popular architecture in commercial multiprocessing
`computer systems is the symmetric multiprocessor (SMP)
`architecture. Typically. an SMP computer system comprises
`multiple processors connected through a cache hierarchy to
`a shared bus. Additionally connected to the bus is a memory.
`which is shared among the processors in the system. Access
`to any particular memory location within the memory occurs
`in a similar amount of time as access to any other particular
`memory location. Since each location in the memory may be
`accessed in a uniform manner. this structure is often referred
`to as a uniform memory architecture (UMA).
`Processors are often con?gured with internal caches. and
`one or more caches are typically included in the cache
`hierarchy between the processors and the shared bus in an
`SMP computer system. Multiple copies of data residing at a
`particular main memory address may be stored in these
`caches. In order to maintain the shared memory model. in
`which a particular address stores exactly one data value at
`any given time. shared bus computer systems employ cache
`coherency. Generally speaking. an operation is coherent if
`the eifects of the operation upon data stored at a particular
`memory address are re?ected in each copy of the data within
`the cache hierarchy. For example. when data stored at a
`particular memory address is updated. the update may be
`supplied to the caches which are storing copies of the
`previous data. Alternatively. the copies of the previous data
`may be invalidated in the caches such that a subsequent
`access to the particular memory address causes the updated
`copy to be transferred from main memory. For shared bus
`systems. a snoop bus protocol is typically employed. Each
`coherent transaction performed upon the shared bus is
`examined (or “snooped") against data in the caches. If a
`copy of the affected data is found. the state of the cache line
`containing the data may be updated in response to the
`coherent transaction.
`Unfortunately. shared bus architectures su?er from sev
`eral drawbacks which limit their usefulness in multiprocess
`ing computer systems. A bus is capable of a peak bandwidth
`(e.g. a number of bytes/second which may be transferred
`across the bus). As additional processors are attached to the
`bus. the bandwidth required to supply the processors with
`data and instructions may exceed the peak bus bandwidth.
`Since some processors are forced to wait for available bus
`bandwidth. performance of the computer system suffers
`when the bandwidth requirements of the processors exceeds
`available bus bandwidth.
`
`45
`
`SO
`
`55
`
`60
`
`65
`
`5.754.877
`
`10
`
`25
`
`30
`
`35
`
`2
`Additionally. adding more processors to a shared bus
`increases the capacitive loading on the bus and may even
`cause the physical length of the bus to be increased. The
`increased capacitive loading and extended bus length
`increases the delay in propagating a signal across the bus.
`Due to the increased propagation delay. transactions may
`take longer to perform. Therefore. the peak bandwidth of the
`bus may decrease as more processors are added.
`These problems are further magni?ed by the continued
`increase in operating frequency and performance of proces
`sors. The increased performance enabled by the higher
`frequencies and more advanced processor microarchitec
`tures results in higher bandwidth requirements than previous
`processor generations. even for the same number of proces
`sors. Therefore. buses which previously provided sut?cient
`bandwidth for a multiprocessing computer system may be
`insu?icient for a similar computer system employing the
`higher performance processors.
`A common way to address the problems incurred as more
`processors and devices are added to a shared bus system. is
`to have a hierarchy of buses. In a hierarchical shared bus
`system. the processors and other bus devices are divided
`among several low level buses. These low level buses are
`connected by one or more high level buses. Transactions are
`originated on a low level bus. transmitted to the high level
`bus. and then driven back down to all the low level buses by
`repeaters. Thus. all the bus devices see the transaction at the
`same time and transactions remain ordered. The hierarchical
`shared bus logically appears as one large shared bus to all the
`devices. Additionally. the hierarchical structures overcomes
`the electrical constraints of a singled large shared bus.
`However. one problem with the above hierarchical shared
`bus structure is that transactions are always broadcast twice
`on the originating low level bus. This ine?iciency can
`severely limit the available bandwidth on the low level
`buses. A possible solution would be to have separate unidi
`rectional buses for transactions on the way up to higher
`levels of the bus hierarchy and for transactions on the way
`down from higher levels of the bus hierarchy. But this
`solution requires double the amount of bus signals and
`double the amount of pins on bus device packages. Obvi
`ously the solution imposes serious physical problems.
`An example an SMP computer system employing a tra
`ditional hierarchical bus structure. is illustrated in FIG. 1. A
`two-level bus structure is shown. Bus devices 8A-B are
`connected to lower level Ll.1 bus 4A and bus devices SC-D
`are connected to lower level L1.2 bus 4B. The bus devices
`may be any local bus type devices found in modern com
`puter systems such as a processor/memory device or an I/O
`bridge device. Each separate L1 bus 4A-B is coupled to an
`upper level L2 bus 2 by a repeater 6A-B. Together. each
`repeater. L1 bus. and bus device group from a repeater node
`5. For example. repeater 6A. L1 bus 4A. and bus devices
`SA-B comprise repeater node SA.
`When a bus transaction (such as a memory read) is
`initiated by a bus device. the transaction is transmitted from
`the originating Ll bus (4A or 4B) to the L2 bus 2. The
`transaction is then simultaneously broadcast back to both L1
`buses 4A-B by their respective repeaters 6A-B. In this
`manner the transaction is seen by all bus devices 8 at the
`same time. Furthermore. the hierarchical structure of FIG. 1
`ensures that bus transactions appear to all bus devices 8 in
`the same order. Thus. the hierarchical bus structure logically
`appears to the bus devices SA-D as a single shared bus.
`The operation of the computer system of FIG. I may be
`illustrated by timing diagram 12 as shown in FIG. 2. Each
`
`NETAPP, INC. EXHIBIT 1004
`Page 12 of 22
`
`
`
`5 .754.877
`
`3
`column of timing diagram 12 corresponds to a particular bus
`cycle. Eleven bus cycles increasing in time from left to right
`are represented by the eleven columns. The state of the L2
`bus 2. L11 bus 4A. and L1.2 bus 4B is shown for each bus
`cycle according to rows 14-16 respectively.
`During bus cycle 1. an outgoing packet (address and
`command) is driven by one of the bus devices 8 on the L1
`bus 4 in each repeater node 5. In timing diagram 12. these
`outgoing packets are shown as Pl(o) on the L1.l bus 4A and
`P2(o) on the L1.2 bus 4B. Since two di?erent bus transac
`tions were issued during the same cycle. the order in which
`they appear on the L2 bus 2 depends upon the arbitration
`scheme chosen. For the embodiment illustrated in tinting
`diagram 12. the transaction issued on the L1.1 bus 4A is
`transmitted to the 12 bus 2 ?rst. as represented by P1 on the
`L2 bus in bus cycle 2. Transaction P2(o) is queued in its
`respective repeater 6B. Also during bus cycle 2. two new
`transactions are issued on the lower level buses 4. repre
`sented by outgoing bus transactions P3(o) and P4(o) on the
`L1.1 bus 4A and L1.2 bus 4B respectively.
`During bus cycle 3. transaction P1 is broadcast as an
`incoming transaction on the L1 buses 4 of both repeater
`nodes 5. as represented by Pl(i) on rows 15 an 16. Also.
`during bus cycle 3. the second outgoing transaction P2(o)
`from bus cycle 1 broadcasts on the L2 bus 2 as shown in row
`14 on timing diagram 12.
`During bus cycle 4. transaction P2 is broadcast as an
`incoming transaction on the L1 buses 4. as represented by
`P2(i) on rows 15 and 16. Also. during bus cycle 4. outgoing
`transaction P3(o) broadcasts on the L2 bus 2 as transaction
`P3 as shown in row 14 on timing diagram 12. Similarly. bus
`transactions P3 and P4 are broadcast to the L1 buses during
`bus cycles 5 and 6. Because the L1 bus bandwidth it
`consumed with repeater broadcasts of incoming
`transactions. new outgoing transactions cannot be issued
`until bus cycle 7. As a result the full bandwidth of the L2 bus
`2 is not utilized as illustrated by the gap on row 14 during
`bus cycles 6 and 7.
`For systems requiring a large number of processors. the
`above hierarchical bus structure may require many levels of
`hierarchy. The delay associated with broadcasting each
`transaction to the top of the hierarchy and back down and the
`delay associated with bus arbitration may severely limit the
`throughput of large hierarchical structures.
`Another structure for multiprocessing computer systems
`is a distributed shared memory architecture. A distributed
`shared memory architecture includes multiple nodes within
`which processors and memory reside. The multiple nodes
`communicate via a network coupled there between. When
`considered as a whole. the memory included within the
`multiple nodes forms the shared memory for the computer
`system. Typically. directories are used to identify which
`nodes have cached copies of data corresponding to a par
`ticular address. Coherency activities may be generated via
`examination of the directories.
`However. distributed shared memory architectures also
`have drawbacks. Directory look ups. address translations.
`and coherency maintenance all add latency to transactions
`between nodes. Also. distributed shared memory architec
`ture systems normally require more complicated hardware
`than shared bus architectures.
`It is apparent from the above discussion that a more
`e?icient architecture for connecting a large number of
`devices in a multiprocessor system is desirable. The present
`invention addresses this need
`SUMMARY OF THE INVENTION
`The problems outlined above are in large part solved by
`a computer system in accordance with the present invention.
`
`20
`
`25
`
`35
`
`45
`
`55
`
`65
`
`4
`Broadly speaking. the present invention contemplates a
`multiprocessor computer system including multiple repeater
`nodes interconnected by an upper level bus. Each repeater
`node includes multiple bus devices. a lower level bus and an
`address repeater. The bus devices are interconnected on the
`lower level bus. The repeater couples the upper level bus to
`the lower level bus. The bus devices may be processor!
`memory devices and each bus device includes an incoming
`queue. Processor/memory bus devices include a high per
`formance processor such as a SPARC processor. DRAM
`memory. and a high speed second level cache memory. The
`physical DRAM memory located on each bus device col
`lectively comprises the system memory for the multiproces
`sor computer system. Also. bus devices may be input/output
`bus devices. I/O devices also include an incoming queue.
`Furthermore. input/output bus devices may include an I/O
`bus bridge that supports a peripheral I/O bus such as the PCI
`bus. This peripheral 110 bus allows communication with I/O
`devices. such as graphics controllers. serial and parallel
`ports and disk drives.
`The bus devices communicate with each other by sending
`and receiving bus transactions. A bus transaction initiated by
`one bus device is broadcast as an outgoing transaction on the
`lower level bus to which the initiating bus device is attached
`Each other bus device attached to the same lower level bus
`stores this outgoing transaction in its respective incoming
`queue. Also. the repeater attached to this lower level bus
`broadcasts the outgoing transaction to the upper level bus.
`The repeaters in each of the other repeater nodes receive this
`outgoing transaction and repeat it as an incoming transaction
`on their respective lower level buses. The repeater in the
`originating repeater node does not repeat the outgoing bus
`transaction as an incoming bus transaction on its lower level
`bus. Instead. when the other repeaters drive the outgoing
`transaction as incoming transactions on their respective
`lower level buses. the repeater in the originating repeater
`node asserts a control signal that alerts each bus device in the
`originating repeater node to treat the packet stored at the
`head of its incoming queue as the current incoming trans
`action. The repeaters in the nonoriginating repeater nodes
`assert control signals to the bus devices on their respective
`lower level buses indicating that those bus devices should
`bypass their incoming queues and receive the incoming
`transaction broadcast on their lower level buses. Storing the
`outgoing transaction in the incoming bus device queues in
`the originating repeater node frees up the lower level bus in
`the originating repeater node to broadcast another outgoing
`transaction while the ?rst transaction is being broadcast on
`the lower level buses in the nonoriginating repeater nodes.
`Therefore. maximum utilization of the lower level bus
`bandwidth is achieved.
`Generally speaking. every bus device on a given lower
`level bus stores all outgoing transactions that appear on that
`lower level bus in their incoming queues. Outgoing trans
`actions are broadcast by the repeater to the upper level bus
`in the same order that they appear in the lower level bus. The
`repeater for each repeater node drives transactions appearing
`on the upper level bus as incoming packets on the lower
`level bus only when those transactions are incoming trans
`actions from another repeater node. In this manner. all bus
`devices in the computer system see each particular transac
`tion at the same time and in the same order. Also. each bus
`transaction appears only once on each bus. Thus. the hier
`archical bus structure of the present invention appears as a
`single large. logically shared bus to all the bus devices and
`the multiprocessor computer system.
`Another embodiment of the present invention contem
`plates an extended multiprocessor computer architecture.
`
`NETAPP, INC. EXHIBIT 1004
`Page 13 of 22
`
`
`
`5.754.877
`
`6
`particular node initiates a transaction. the MTAG in that
`node is examined to determine if that node has valid access
`rights for that transaction address. If the retrieved MTAG
`indicates proper access rights. then the completed transac
`tion is valid. Otherwise. the transaction must be reissued
`globally to the other nodes.
`In another embodiment of the extended multiprocessor
`computer system of the present invention. different regions
`of the system memory address space may be assigned to
`operate in one of three modes. The three modes are the
`replicate mode. the migrate mode. and normal mode. For
`memory regions operating in the normal mode. all memory
`transactions are attempted in the originating multiprocessor
`node without sending global transactions. Transactions are
`only sent globally if the MTAG indicates improper access
`rights or if the address corresponds to a memory region
`mapped to another multiprocessor node.
`In the replicate mode. the replicate memory region is
`mapped to memory located in each multiprocessor node.
`such that a duplicate copy of the memory region is stored in
`each node. Therefore. replicate mode transactions are
`always attempted locally in the originating multiprocessor
`node. Transactions are only sent globally in replicate mode
`if the MTAG indicates improper access rights. In migrate
`mode. transactions are always sent globally the ?rst time.
`Therefore there is no need to maintain the MTAG coherency
`states.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`10
`
`25
`
`5
`Several multiprocessor nodes are interconnected with uni
`directional point-to-point link connections. Each multipro
`cessor link node includes a top level interface device for
`interfacing to these point-to-point link connections. Each
`node also includes an upper level bus which couples the top
`level interface to one or more repeaters. Each repeater is also
`coupled to a separate lower level bus in a fashion similar to
`that described for the embodiment above. One or more bus
`devices are attached to each lower level bus.
`Each repeater in a given multiprocessor node includes an
`internal queue and a bypass path. Each repeater also receives
`control signals from the top level interface. The control
`signals are used to select either the bypass path or the queue
`for transmitting transactions from the upper level bus to the
`lower level bus. Transactions originating within a given
`repeater node are stored in the queue whereas transactions
`incoming from another multiprocessor node are transmitted
`to the lower level bus via the bypass path. The point-to-point
`linking structure between top level interfaces of the multi
`processor nodes allows transactions to be communicated
`simultaneously between each multiprocessor node.
`Therefore. no arbitration delay is associated with these top
`level communications. Transaction ordering is maintained
`on this top level interface by following a strict de?ned
`transaction order. Any order may be chosen. but a speci?c
`de?ned order must be consistently used. For example. one
`such ordering may be that in a system comprising three
`nodes. node A. node B. and node C. transactions originating
`from node A take priority over transactions originating from
`node B and transactions originating from node B take
`priority over transactions originating from node C. This
`de?ned order indicates the order that transactions commu
`nicated on the top level point-to-point link structure will be
`transmitted to the repeaters in each multiprocessor node.
`Transactions broadcast on the upper level bus of nonorigi
`nating repeater nodes are further transmitted by the bypass
`path to the lower level buses in those nodes. However. the
`same transaction is not broadcast to the upper level bus in
`the originating repeater node. Instead. the control signal is
`asserted to the repeaters indicating that the transaction is to
`be broadcast to the lower level buses from the repeater
`queues. This allows the upper level bus in the originating
`node to remain free for broadcasting of new transactions.
`From the operation described above for the extended
`multiprocessor computer system. it can be seen that bus
`transactions broadcast between multiprocessor nodes appear
`only once on each upper level bus and lower level bus of
`each multiprocessor node. This allows maximum bus band
`width to be utilized. Furthermore. the strict de?ned ordering
`for the top level point-to-point connections ensures that an
`ordered transaction broadcast will always occur and that
`each bus device in the system will see each transaction at the
`same time and in the same order.
`Each bus device may contain memory. The memory
`located on each bus device collectively forms the system
`memory for the extended multiprocessor computer system.
`The memory is split into ditferent regions such that each
`multiprocessor node is assigned one portion of the total
`address space. The size of each address space portion is
`inversely proportional to the number of multiprocessor
`nodes comprising the extended multiprocessor computer
`system. For example. if there are three nodes. each node is
`assigned one-third of the address space.
`In order to maintain memory coherency between each
`node. each cache line in the system memory is tagged with
`a coherency state for that node. These coherency state tags
`are referred to as an MTAG. When a bus device in a
`
`Other objects and advantages of the invention will
`become apparent upon reading the following detailed
`description and upon reference to the accompanying draw
`ings in which:
`FIG. 1 is a block diagram of a symmetric multiprocessor
`computer system employing a hierarchical bus structure.
`FIG. 2 is a timing diagram illustrating the operation of the
`computer system of FIG. 1.
`FIG. 3 is a block diagram of a symmetric multiprocessor
`computer system employing a hierarchical bus structure
`according to one embodiment of the present invention.
`FIG. 4 is a timing diagram illustrating the operation of the
`computer system of FIG. 3.
`FIG. 5 is a block diagram of a processor/memory bus
`device for one embodiment of the present invention.
`FIG. 6 is block diagram of a U0 bridge bus device
`according to one embodiment of the present invention.
`FIG. 7 is a block diagram of an extended symmetric
`multiprocessor computer system according to one embodi
`ment of the present invention.
`FIG. 8 is a block diagram of an SMP node of the extended
`symmetric multiprocessor computer system of FIG. 7.
`FIG. 9 is a diagram of dilferent addressing modes
`employed in one embodiment of the present invention.
`FIG. 10 is a timing diagram illustrating the operation of
`the extended symmetric multiprocessor computer system of
`FIG. 7.
`While the invention is susceptible to various modi?ca
`tions and alternative forms. speci?c embodiments thereof
`are shown by way of example in the drawings and will
`herein be described in detail. It should be understood.
`however. that the drawings and detailed description thereto
`are not intended to limit the invention to the particular form
`disclosed. but on the contrary. the intention is to cover adl
`modi?cations. equivalents and alternatives falling within the
`
`35
`
`45
`
`50
`
`65
`
`NETAPP, INC. EXHIBIT 1004
`Page 14 of 22
`
`
`
`5.754.877
`
`7
`spirit and scope of the present invention as de?ned by the
`appended claims.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`8
`computer system 20 shown in FIG. 3. a memory operation
`may include one or more transactions upon the L1 buses 32
`and L2 bus 22. Bus transactions are broadcast as bit-encoded
`packets comprising an address. command. and source id.
`Other information may also be encoded in each packet such
`as addressing modes or mask information.
`Generally speaking. I/O operations are similar to memory
`operations except the destination is an I/O bus device. I/O
`devices are used to communicate with peripheral devices.
`such as serial ports or a ?oppy disk drive. For example. an
`1/0 read operation may cause a transfer of data from 1/0
`element 50 to a processor in processor/memory bus device
`38D. Similarly. an I/O write operation may cause a transfer
`of data from a processor in bus device 38D to the I/O
`element 50 in bus device 38B. In the computer system 20
`shown in FIG. 3. an I/O operation may include one or more
`transactions upon the L1 buses 32 and L2 bus 22.
`The architecture of the computer system 20 in FIG. 3 may
`be better understood by tracing the ?ow of typical bus
`transactions. For example. a bus transaction initiated by
`processor/memory element 48 of bus device 38A is issued
`on outgoing interconnect path 44A. The transaction is seen
`as outgoing packet P1(o) on L1.1 bus 32A. Each bus device
`connected to L1.1 bus 32A. including the initiating bus
`device (38A in this example). stores the outgoing packet
`P1(o) in its incoming queue 40. Also. repeater 34A broad
`casts the packet P1(o) onto the L2 bus 22 where it appears
`as packet P1. The repeaters in each of the non-originating
`repeater nodes 30 receive the packet P1 and drive it as an
`incoming packet Pl(i) on their respective L1 buses 32. Since
`the embodiment illustrated in FIG. 3 only show two repeater
`nodes 30. repeater 34B would receive packet P1 on the L2
`bus 22 and drive it as incoming packet Pl(i) on Ll.2 bus
`32B. in the above example. It is important to note that
`repeater 34A on the device node 30A from which the packet
`Pl originated as outgoing packet P1(o). does not drive
`packet P1 back down to L1.1 bus 32A as an incoming
`packet. Instead. when the other repeaters. such as repeater
`34B. drive packet P1 on their respective L1 buses. repeater
`34A asserts incoming signal 36A. Incoming signal 36A
`alerts each bus device in the originating node to treat the
`packet stored in its incoming queue 40 as the current
`incoming packet. The repeater 34B in non-originating node
`30B does not assert its incoming signal 36B. Thus devices
`38C and 38D bypass their incoming queues 40 and receive
`the incoming packet P1(i) from Ll.2 bus 32B. Multiplexors
`42 are responsive to the incoming signal and allow each
`device to see either the packet on the L1 bus 32 or the packet
`at the head of incoming queue 40 as the current transaction
`packet.
`In the above example. storing the outgoing packet P1(o)
`in the incoming queues 40A-B of all bus devices 38A-B in
`the originating node 30A. frees up the L1.1 bus 32A to
`broadcast another outgoing packet while t