throbber
Sony Corporation v. Memory
`Integrity, LLC
`IPR2015-00158
`EXHIBIT
`Sony-1018
`
`

`
`to any IO device. IO devices can DMA to and from all memory in
`the system, not just their local memory.
`While the two processors share the same bus connected to the Hub,
`they do not function as a snoopy cluster. Instead they operate as
`two separate processors multiplexed over the single physical bus
`(done to save Hub pins). This is different from many other ccNU-
`MA systems, where the node is a SMP cluster. Origin does not em-
`ploy a SMP cluster in order to reduce both the local and remote
`memory latency, and to increase remote memory bandwidth. Local
`memory latency is reduced because the bus can be run at a much
`higher frequency when it needs to support only one or two proces-
`sor than when it must support large numbers of processors. Re-
`mote memory latency is also reduced by a higher frequency bus,
`and in addition because a request made in a snoopy bus cluster
`must generally wait for the result of the snoop before being for-
`warded to the remote node[7]. Remote bandwidth can be lower in
`a system with a SMP cluster node if memory data is sent across the
`remote data bus before being sent to the network, as is commonly
`done in DSM systems with SMP-based nodes[7][8]. For remote re-
`quests, the data will traverse the data bus at both the remote node
`and at the local node of the requestor, leading to the remote band-
`width being one-half the local bandwidth. One of the major goals
`for the Origin system was to keep both absolute memory latency
`and the ratio of remote to local latency as low as possible and to
`provide remote memory bandwidth equal to local memory band-
`width in order to provide an easy migration path for existing SMP
`software. As we will show in the paper, the Origin system does ac-
`complish both goals, whereas in Section 6 we see that all the
`snoopy-bus clustered ccNUMA systems do not achieve all of these
`goals.
`In addition to keeping the ratio of remote memory to local memory
`latency low, Origin also includes architectural features to address
`the NUMA aspects of the machine. First, a combination of hard-
`ware and software features are provided for effective page migra-
`tion and replication. Page migration and replication is important as
`it reduces effective memory latency by satisfying a greater percent-
`age of accesses locally. To support page migration Origin provides
`per-page hardware memory reference counters, contains a block
`copy engine that is able to copy data at near peak memory speeds,
`and has mechanisms for reducing the cost of TLB updates.
`Other performance features of the architecture include a high-per-
`formance local and global interconnect design, coherence protocol
`features to minimize latency and bandwidth per access, and a rich
`set of synchronization primitives. The intra-node interconnect con-
`sists of single Hub chip that implements a full four-way crossbar
`between processors, local memory, and the I/O and network inter-
`faces. The global interconnect is based on a six-ported router chip
`configured in a multi-level fat-hypercube topology.
`The coherence protocol supports a clean-exclusive state to mini-
`mize latency on read-modify-write operations. Further, it allows
`cache dropping of clean-exclusive or shared data without notifying
`the directory in order to minimize the impact on memory/directory
`bandwidth caused by directory coherence. The architecture also
`supports request forwarding to reduce the latency of interprocessor
`communication.
`For effective synchronization in large systems, the Origin system
`provides fetch-and-op primitives on memory in addition to the
`standard MIPS load-linkedstore-conditional (LUSC) instructions.
`These operations greatly reduce the serialization for highly con-
`tended locks and barrier operations.
`Origin includes many features to enhance reliability and availabili-
`ty. All external cache SRAM and main memory and directory
`DRAM are protected by a SECDED ECC code. Furthermore, all
`high-speed router and U0 links are protected by a full CRC code
`and a hardware link-level protocol that detects and automatically
`retries faulty packets. Origin’s modular design provides the overall
`
`242
`
`4
`t
`EsmmcEI
`
`T T D
`
`I
`
`t
`
`4
`
`*
`
`Figure 2 SPIDER ASIC block diagram
`basis for a highly available hardware architecture. The flexible
`routing network supports multiple paths between nodes, partial
`population of the interconnect, and the hot plugging of cabled-
`links that permits the bypass, service, and reintegration of faulty
`hardware.
`To address software availability in large systems, Origin provides
`access protection rights on both memory and IO devices. These ac-
`cess protection rights prevent unauthorized nodes from being able
`to modify memory or IO and allows an operating system to be
`structured into cells or partitions with containment of most failures
`to within agiven partition[lO][l2].
`3 The Origin Implementation
`While existence proofs for the DSM architecture have been avail-
`able in the academic community for some time[l][6], the key to
`commercial success of this architecture will be an aggressive im-
`plementation that provides for a truly scalable system with low
`memory latency and no unexpected bandwidth bottlenecks. In this
`section we explore how the Origin 2000 implementation meets this
`goal. We start by exploring the global interconnect of the system.
`We then present an overview of the cache coherence protocol, fol-
`lowed with a discussion of the node design. The IO subsystem is
`explored next, and then the various subsystems are tied together
`with the presentation of the product design. Finally, this section
`ends with a discussion of interesting performance features of the
`Origin system.
`3.1 Network Topology
`The interconnect employed in the Origin 2000 system is based on
`the SGI SPIDER router chip[4]. A block diagram of this chip is
`shown in Figure 2. The main features of the SPIDER chip are:
`six pairs of unidirectional links per router
`low latency (41 ns pin-to-pin) wormhole routing
`DAMQ buffer structures[4] with global arbitration to maxi-
`mize utilization under load.
`four virtual channels per physical channel
`congestion control allowing messages to adaptively switch
`between two virtual channels
`
`

`
`32 Processor System
`
`64 Processor System
`
`i i
`
`Figure 3 32P and 64P Bristled Hypercubes
`
`Figure 4 128P Heirarchical Fat Bristled
`Hypercube
`0 support for 256 levels of message priority with increased pri-
`ority via packet aging
`0 CRC checking on each packet with retransmission on error
`via a go-back-n sliding window protocol
`software programmable routing tables
`The Origin 2000 employs SPIDER routers to create a bristled fat
`hypercube interconnect topology. The network topology is bristled
`in that two nodes are connected to a single router instead of one.
`The fat hypercube comes into play for systems beyond 32 nodes
`(64 processors). For up to 32 nodes, the routers connect in a bris-
`tled hypercube as shown in Figure 3. The SPIDER routers are la-
`beled using R, the nodes are the block boxes connecting to the
`routers. In the 32 processor configuration, the otherwise unused
`SPIDER ports are shown as dotted lines being used for Express
`Links which connect the corners of the cube, thereby reducing la-
`tency and increasing bisection bandwidth.
`Beyond 64 processors, a hierarchical fat hypercube is employed.
`Figure 4 shows the topology of a 128 processor Origin system. The
`vertices of four 32-processor hypercubes are connected to eight
`meta-routers. To scale up to 1024 processors, each of the single
`meta-routers in the 128 processor system is replaced with a 5-D
`hypercubes.
`3.2 Cache Coherence Protocol
`The cache coherence protocol employed in Origin is similar to the
`Stanford DASH protocol[6], but has several significant perfor-
`
`243
`
`mance improvements. Like the DASH protocol, the Origin cache
`coherence protocol is non-blocking. Memory can satisfy any in-
`coming request immediately; it never buffers requests while wait-
`ing for another message to arrive. The Origin protocol also
`employs the request forwarding of the DASH protocol for three
`party transactions. Rcquest forwarding rcduces thc latency of re-
`quests which target a cache line that is owned by another proces-
`sor.
`The Origin coherence protocol has several enhancements over the
`DASH protocol. First, the Clean-exclusive (CEX) processor cache
`state (also known as the exclusive state in MESI) is fully supported
`by the Origin protocol. This state allows for efficient execution of
`read-modify-write accesses since there is only a single fetch of the
`cache line from memory. The protocol also permits the processor
`to replace a CEX cache line without notifying the directory. The
`Origin protocol is able to detect a rerequest by a processor that had
`replaced a CEX cache line and immediately satisfy that request
`from memory. Support of CEX state in this manner is very impor-
`tant for single process performance as much of the gains from the
`CEX state would be lost if directory bandwidth was needed each
`time a processor replaced a CEX line. By adding protocol com-
`plexity to allow for the “silent” CEX replacement, all of the advan-
`tages of the CEX state are realized.
`The second enhancement of the Origin protocol over DASH is full
`support of upgrade rcquests which move a linc from a shared to cx-
`clusive state without the bandwidth and latency overhead of trans-
`ferring the memory data.
`For handling incoming I/O DMA data, Origin employs a write-in-
`validate transaction that uses only a single memory write as op-
`posed to the processor’s normal write-allocate plus writeback. This
`transaction is fully cache coherent (i.e., any cache invalidations/in-
`terventions required by the directory are sent), and increases I/O
`DMA bandwidth by as much as a factor of two.
`Origin’s protocol is fully insensitive to network ordering. Messag-
`es are allowed to bypass each other in the network and the protocol
`detects and resolves all of these out-of-order message deliveries.
`This allows Origin to employ adaptive routing in its network to
`deal with network congestion.
`The Origin protocol uses a more sophisticated network deadlock
`avoidance scheme than DASH. As in DASH, two separate net-
`works are provided for requests and replies (implemented in Ori-
`gin via different virtual channels). The Origin protocol does have
`requests which generate additional requests (these additional re-
`quests are referred to as interventions or invalidations). This re-
`quest-to-request dependency could lead to deadlock in the request
`network. In DASH, this deadlock was broken by detecting a poten-
`tial deadlock situation and sending negative-acknowledgments
`(NAKs) to all requests which needed to generate additional re-
`quests to be serviced until the potential deadlock situation was re-
`solved. In Origin, rather than sending NAKs in such a situation, a
`backoflintervention or invalidate is sent to the requestor on the re-
`ply network. The backoff message contains either the target of the
`intervention or the list of sharers to invalidate, and is used to signal
`the requestor that the memory was unable to generate the interven-
`tion or invalidation directly and therefore the requestor must gener-
`ate that message instead. The requestor can always sink the
`backoff reply, which causes the requestor to then queue up the in-
`tervention or invalidate for injection into the request network as
`soon as the request network allows. The backoff intervention or in-
`validate changes the request-intervention-reply chain to two re-
`quest-reply chains (one chain being the request-backoff message,
`one being the intervention-reply chain), with the two networks pre-
`venting deadlock on these two request-reply chains. The ability to
`generate backoff interventions and invalidations allows for better
`forward progress in the face of very heavily loaded systems since
`the deadlock detection in both DASH and Origin is conservatively
`
`

`
`3.
`4.
`
`done based on local information, and a processor that receives a
`backoff is guaranteed that it will eventually receive the data, while
`a processor that receives a NAK must retry its request.
`Since the Origin system is able to maintain coherence over 1024
`processors, it obviously employs a more scalable directory scheme
`than in DASH. For tracking sharers, Origin supports a bit-vector
`directory format with either 16 or 64 bits. Each bit represents a
`node, so with a single bit to node correspondence the directory can
`track up to a maximum of 128 processors. For systems with greater
`than 64 nodes, Origin dynamically selects between a full bit vector
`and coarse bit vector[ 121 depending on where the sharers are locat-
`ed. This dynamic selection is based on the machine being divided
`into up to eight 64 node octunts. If all the processors sharing the
`cache line are from the same octant, the full bit vector is used (in
`conjunction with a 3-bit octant identifier). If the processors sharing
`the cache line are from different octants, a coarse bit vector where
`each bit represents eight nodes is employed.
`Finally, the coherence protocol includes an important feature for
`effective page migration known as directory poisoning. The use of
`directory poisoning will be discussed in more detail in Section 3.6.
`A slightly simplified flow of the cache coherence protocol is now
`presented for both read, read-exclusive, and writeback requests.
`We start with the basic flow for a read request.
`1. Processor issues read request.
`Read request goes across network to home memory (requests
`2.
`to local memory only traverse Hub).
`Home memory does memory read and directory lookup.
`If directory state is Unowned or Exclusive with requestor as
`owner, transitions to Exclusive and retums an exclusive reply
`to the requestor. Go to 5a.
`If directory state is Shared, the requesting node is marked in
`the bit vector and a shared reply is returned to the requestor.
`Go to 5a.
`If directory state is Exclusive with another owner, transitions
`to Busy-shared with requestor as owner and send out an inter-
`vention shared request to the previous owner and a specula-
`tive reply to the requestor. Go to 5b.
`If directory state is Busy, a negative acknowledgment is sent
`to the requestor, who must retry the request. QED
`5a. Processor receives exclusive or shared reply and fills cache in
`CEX or shared (SHD) state respectively. QED
`5b. Intervention shared received by owner. If owner has a dirty
`copy it sends an shared response to the requestor and a shar-
`ing writeback to the directory. If owner has a clean-exclusive
`or invalid copy it sends an shared ack (no data) to the request-
`or and a sharing transfer (no data) to the directory.
`6a. Directory receives shared writeback or shared transfer, up-
`dates memory (only if shared writeback) and transitions to the
`shared state.
`6b. Processor receives both speculative reply and shared response
`or ack. Cache filled in SHD state with data from response (if
`shared response) or data from speculative reply (if shared
`ack). QED
`The following list details the basic flow for a read-exclusive re-
`quest.
`Processor issues read-exclusive request.
`1.
`Read-exclusive request goes across network to home memory
`2.
`(only traverses Hub if local).
`Home memory does memory read and directory lookup
`If directory state is Unowned or Exclusive with requestor as
`owner, transitions to Exclusive and returns an exclusive reply
`to the requestor. Go to 5a.
`If directory state is Shared, transitions to Exclusive and a ex-
`clusive reply with invalidates pending is returned to the re-
`
`3.
`4.
`
`questor. Invalidations are sent to the sharers. Go to 5b.
`If directory state is Exclusive with another owner, transitions
`to Busy-Exclusive with requestor as owner and sends out an
`intervention exclusive request to the previous owner and a
`speculative reply to the requestor. Go to 5c.
`If directory state is Busy, a negative acknowledgment is sent
`to the requestor, who must retry the request. QED
`5a. Processor receives exclusive reply and fills cache in dirty ex-
`clusive (DEX) state. QED
`5b. Invalidates received by sharers. Caches invalidated and invali-
`date acknowledgments sent to requestor. Go to 6u.
`5c. Intervention shared received by owner. If owner has a dirty
`copy it sends an exclusive response to the requestor and a
`dirty transfer (no data) to the directory. If owner has a clean-
`exclusive or invalid copy it sends an exclusive ack to the re-
`questor and a dirty transfer to the directory. Go to 6b.
`6a. Processor receives exclusive reply with invalidates pending
`and all invalidate acks. (Exclusive reply with invalidates
`pending has count of invalidate acks to expect.) Processor fills
`cache in DEX state. QED
`6b. Directory receives dirty transfer and transitions to the exclu-
`sive state with new owner.
`6c. Processor receives both speculative reply and exclusive re-
`sponse or ack. Cache filled in DEX state with data from re-
`sponse (if exclusive response) or data from speculative reply
`(if exclusive ack). QED
`The flow for an upgrade (write hit to SHD state) is similar to the
`read-exclusive, except it only succeeds for the case where the di-
`rectory is in the shared state (and the equivalent reply to the exclu-
`sive reply with invalidates pending does not need to send the
`memory data). In all other cases a negative acknowledgment is
`sent to the requestor in response to the upgrade request.
`Finally, the flow for a writeback request is presented. Note that if a
`writeback encounters the directory in one of the busy states, this
`means that the writeback was issued before an intervention target-
`ing the cache line being written back made it to the writeback issu-
`er. This race is resolved in the Origin protocol by “bouncing” the
`writeback data off the memory as a response to the processor that
`caused the intervention, and sending a special type of writeback
`acknowledgment that informs the writeback issuer to wait for (and
`then ignore) the intervention in addition to the writeback acknowl-
`edgment.
`Processor issues writeback request.
`1.
`2. Writeback request goes across network to home memory
`(only traverses Hub if local).
`3. Home memory does memory write and directory lookup.
`4.
`If directory state is Exclusive with requestor as owner, transi-
`tions to Unowned and returns a writeback exclusive acknowl-
`edge to the requestor. Go to 5a.
`If directory state is Busy-shared, transitions to Shared, a
`shared response is returned to the owner marked in the direc-
`tory. A writeback busy acknowledgment is also sent to the re-
`questor. Go to 5b.
`If directory state is Busy-exclusive, transitions to Exclusive,
`an exclusive response is returned to the owner marked in the
`directory. A writeback busy acknowledgment is also sent to
`the requestor. Go to 5b.
`5a. Processor receives writeback exclusive acknowledgment.
`QED
`5b. Processor receives both a writeback busy acknowledgment
`and an intervention. QED
`3.3 Node Design
`The design of an Origin node fits on a single 16” x 1 1” printed cir-
`cuit board. A drawing of the Origin node board is shown in Figure
`
`244
`
`

`
`Dir -
`
`Mem *
`
`Front View
`
`S,& v,ew
`
`RI"
`pmcerrorand
`secmdalycache(H1MM)
`with heal rnk
`
`Figure 5 An Origin node board
`
`5. At the bottom of the board are two RlOOOO processors with their
`secondary caches. The RIO000 is a four-way out-of-order super-
`scalar processor[ 141. Current Origin systems run the processor at
`195 MHz and contain 4 MB secondary caches. Each processor and
`its secondary cache is mounted on a horizontal in-line memory
`module (HIMM) daughter card. The HIMM is parallel to the main
`node card and connects via low-inductance fuzz-button processor
`and HIMM interposers. The system interface buses of the R10000s
`are connected to the Hub chip. The Hub chip also has connections
`to the memory and directory on the node board, and has two ports
`that exit the node board via the 300-pin CPOP (compression pad-
`on-pad) connector. These two ports are the Craylink connection to
`router network and the X I 0 connection to the IO subsystem.
`As was mentioned in Section 3.2, a 16 bit-vector directory format
`and a 64 bit-vector format are supported by the Origin system. The
`directory that implements the 16-bit vector format is located on the
`same DIMMs as main memory. For systems larger than 32 proces-
`sors, additional expansion directory is needed. These expansion di-
`rectory slots, shown to the left of the Huh chip in Figure 5, operate
`by expanding the width of the standard directory included on the
`main memory boards. The Hub chip operates on standard 16-hit
`directory entries by converting them to expanded entries upon their
`entry into the Hub chip. All directory operations within the Hub
`chip are done on the expanded directory entries, and the results are
`then converted back to standard entries before being written back
`to the directory memory. Expanded directory entries obviously by-
`pass the conversion stages.
`Figure 6 shows a block diagram of the Huh chip. The hub chip is
`divided into five major sections: the crossbar (XB), the IO inter-
`face (II), the network interface (NI), the processor interface (PI),
`and the memory and directory interface (MD). All the interfaces
`communicate with each other via FIFOs that connect to the cross-
`bar.
`The 10 interface contains the translation logic for interfacing to the
`X I 0 IO subsystem. The X I 0 subsystem is based on the same low-
`level signalling protocol as the Craylink network (and uses the
`same interface block to the X I 0 pins as in the SPIDER router of
`Figure 2), but utilizes a different higher level message protocol.
`The IO section also contains the logic for two block transfer en-
`gines (BTEs) which are able to do memory to memory copies at
`
`SysAD I
`
`Figure 6 Hub ASIC block diagram
`near the peak of a node's memory bandwidth. It also implements
`the IO request tracking portion of the cache coherence protocol via
`the 1 0 request buffers (IRB) and the IO protocol table. The IRB
`tracks both full and partial cache line DMA requests by IO devices
`as well as full cache line requests by the BTEs.
`The network interface takes messages from the 11, PI, and MD and
`sends them out on the Craylink network. It also receives incoming
`messages for the MD, PI, 11, and local Hub registers from the
`Craylink network. Routing tables for outgoing messages are pro-
`vided in the NI as the software programmable routing of the SPI-
`DER chip is pipelined by one network hop[4]. The NI also is
`responsible for taking a compact intra-Huh version of the invalida-
`tion message resulting from a coherence operation (a hit-vector
`representation) and generating the multiple unicast invalidate mes-
`sages required by that message.
`The processor interface contains the logic for implementing the re-
`quest tracking for both processors. Read and write requests are
`tracked via a coherent request buffer (CRB), with one CRB per
`processor. The PI also includes the protocol table for its portion of
`the cache coherence protocol. The PI also has logic for controlling
`the flow of requests to and from the RI0000 processors and con-
`tains the logic for generating interrupts to the processors.
`Finally, the memory/directory section contains logic for sequenc-
`ing the external memory and directory synchronous DRAMS
`(SDRAMs). Memory on a node is banked 4-32 way depending on
`how many memory DIMMs are populated. Requests to different
`hanks and requests to the same page within a hank as the previous
`request can be serviced at minimum latency and full bandwidth.
`Directory operations are performed in parallel with the memory
`data access. A complete directory entry (and page reference
`counter, as will be discussed in Section 3.6) read-modify-write can
`be performed in the same amount of time it takes to fetch the 128B
`cache line from memory. The MD performs the directory portion
`of the cache coherence protocol via its protocol table and generates
`the appropriate requests and/or replies for all incoming messages.
`The MD also contains a small fetch-and-op cache which sits in
`front of the memory. This fetch-and-op cache allows fetch-and-op
`variables that hit in the cache to be updated at the minimum net-
`
`245
`
`

`
`i
`
`Base IO
`ippi-pL-p--
`Table 1 Hub ASIC port bandwidths
`
`-v -7
`NumberofPorts
`Board
`,
`2 Ultra SCSI, 1 -7 Fast
`Enet, 2 serial
`
`~
`
`1
`
`~
`
`1 133
`-I MD
`
`e Z o T r T -
`S
`10/100 Enet
`L-pI__._-L
`'
`77
`296
`56
`rKg$F---246
`HiPPI
`Hub ASIC gate count
`Fibre Channel
`
`'
`
`I
`
`Ultra SCSI
`
`~
`
`Table 2
`
`U 0 Subsystem
`
`ATM OC3
`
`Infinite Reality Gfx
`
`~
`
`1
`
`P
`
`S
`
`~
`
`I
`I
`
`C
`
`~
`
`4
`
`1
`
`Table 3 Origin IO boards
`cation of the Xbow buffering and arbitration protocols given the
`chips more limited configuration. These simplifications reduce
`costs and permit eight ports to be integrated on a single chip. Some
`of the main features of the Xbow are:
`eight X I 0 ports, connected in Origin to 2 nodes and 6 XI0
`cards.
`two virtual channels per physical channel
`low latency wormhole routing
`support for allocated bandwidth of messages from particular
`devices
`CRC checking on each packet with retransmission on error
`via a go-back-n sliding window protocol
`The Crossbow has support in its arbiter for allocating a portion of
`the bandwidth to a given IO device. This feature is important for
`certain system applications such as video on demand.
`A large number of XI0 cards are available to connect to the Cross-
`bow. Table 3 contains a listing of the common X I 0 cards. The
`highest performance X I 0 cards connect directly to the XIO, but
`most of the cards bridge X I 0 to an embedded PCI bus with multi-
`ple external interfaces. The IO bandwidth together with integration
`provide IO performance which is effectively added as a PCI-bus at
`a time versus individual PCI cards.
`3.5 Product Design
`The Origin 2000 is a highly modular design. The basic building
`block is the deskside module, which has slots for 4 node boards, 2
`router boards, and 12 X I 0 boards. The module also includes a
`CDROM and up to 5 Ultra SCSI devices. Figure 8 shows a block
`diagram of the deskside module, while Figure 9 shows a rear-view
`perspective of a deskside system. The system has a central mid-
`plane, which has two Crossbow chips mounted on it. The 4 node
`and 12 X I 0 boards plug into the midplane from the rear of the sys-
`tem, while the 2 router boards, the power supply and the UltraSCSl
`devices plug into the midplane from the front of the system.
`A module can be used as a stand-alone deskside system or two
`modules (without the deskside plastic skins) can be mounted in a
`rack to form a 16 processor system. In addition to the two mod-
`ules, the rack also includes a disk bay for up to 8 additional disks.
`One of the modules can be replaced with an Infinite reality graph-
`
`Figure 7 Example IO subsystem block diagram
`work reply serialization rate of 41 ns instead of at the much slower
`SDRAM read-modify-write timing.
`Note that all the protocol tables in the Hub are hard-wired. While
`programmable protocol engines can come close to achieving the
`performance of a hard-wired protocol state machine[5][9], we opt-
`ed for hard-wiring the protocol to minimize latency and maximize
`bandwidth. We were also concerned about the variability in latency
`and bandwidth given the caching of directory information used by
`most programmable approaches. To ensure that the cache coher-
`ence protocol implemented in the tables was correct, we employed
`formal verification[3]. Formal verification worked extremely well;
`no bugs have been found in the Origin cache coherence protocol
`since the formal verification was completed.
`The raw data bandwidth of the Hub chip ports is listed in Table 1.
`A summary of the sizes of the units is shown in Table 2. Note that
`most of the chip is allocated either to interfacing to the IO sub-
`system or in the crossbar itself, rather than in implementing global
`cache coherence.
`3.4 IO Subsystem
`Not too surprisingly, the Origin system also utilizes crossbars in its
`IO subsystem. Figure 7 shows one possible configuration of IO
`cards connected to two nodes. Using the same link technology as
`in the Craylink interconnect, each Hub link provides a peak of 1.56
`GB/sec of bandwidth to the six X I 0 cards connected to it (actually
`limited to half this amount if only local memory bandwidth is con-
`sidered). At the heart of the IO subsystem is the Crossbow (Xbow)
`ASIC, which has many similarities with the SPIDER router. The
`primary differences between the Xbow and the router is a simplifi-
`
`246
`
`

`
`Node 1
`
`MainMemor//
`
`t
`
`Midplane
`
`;,
`
`--..__.-...I____-
`
`....- __.--. i
`
`Node
`boards
`
`i,
`
`Router
`boards
`
`Figure 8 Deskside module block diagram
`
`Figure 9 Deskside module, rear view
`ics module or with 4 additional 8-disk bays. An Origin Vault which
`contains 9 8-disk bays in a single rack is also available. Figure 10
`depicts a configured rack supporting 16 processors, 24 X I 0
`boards, and 18 UltraSCSI devices.
`
`247
`
`
`
`J
`
`i
`i
`I
`I
`I
`Figure 10 16 processor Origin system.
`3.6 Performance Features
`The Origin system has two features very important for achieving
`good performance in a highly scalable system. First, fetch-and-op
`primitives are provided as uncached operations that occur at the
`memory. Fetch-and-op variables are used for highly contended
`locks, barriers, and other synchronization mechanisms. The typical
`serialization rate (the rate at which a stream of requests can be ser-
`viced) for fetch-and-op variables is 41 11s. In Section 4.1 we will
`show how fetch-and-op variables can improve the performance of
`highly contended objects.
`Second, Origin provides hardware and software support for page
`migration. Page migration is important for NUMA systems as it
`changes many of the cache misses which would have gone to re-
`mote memory to local misses. To help the OS in determining when
`and which page to migrate the Origin system provides an array of
`per-page memory reference counters, which are stored in the direc-
`tory memory. This array is indexed by the nodes in a system (up to
`64 nodes, beyond this 8 nodes share a single counter). When a re-
`quest comes in, its reference counter is read out during the directo-
`ry lookup and incremented. In addition, the reference counter of
`the home node is read out during the same directory lookup. The
`requestor's count and home count are compared and if the differ-
`ence exceeds a software programmable threshold register (and the
`migration control bits stored with the requestor's reference counter
`says that this page is a candidate for migration), an interrupt is gen-
`erated to the home node. This interrupt signals a potential migra-
`tion candidate to the operating system.
`When the operating system determines it does indeed want to mi-
`grate the page[ 131, two operations need to be performed. First, the
`OS needs to copy the page from its current location to a free mem-
`ory page on the requestor's node. Second, the OS needs to invali-
`
`

`
`Memory level
`
`1
`
`+.--
`
`L1 cache
`
`L2 cache
`
`__-
`
`_ ~ _ _ _
`
`I 8P avg. remote memory
`1
`16P avg. remote memory
`1
`&.p---p--.------
`32P avg. remote memory
`64Pavg remotememory
`I
`1 128P avg. remote memory
`I
`- _ -
`i
`-
`- ~ - _ _ -
`- _ -
`
`
`1
`
`1
`
`I
`
`Latency (ns)
`
`5.1
`56.4
`
`310
`
`I
`
`I
`I
`
`540
`
`707
`
`726
`
`I
`, I
`
`773
`
`,
`
`~
`
`867
`
`945
`
`Figure 11 STREAM results - one thread per
`
`a
`N””
`
`01 P!m6sm
`
`6
`
`node
`
`Table 4 Origin 2000 latencies
`date all the translations to the old page cached in processor’s TLBs
`and then update the translation for the migrated page to point to the
`new page.
`The block transfer engine allows a 16 KB page to be copied from
`one node’s memory to another in under 30 microseconds. Unfortu-
`nately, in a very large Origin system, the cost to invalidate all the
`TLBs and update the translation using a conventional TLB shoot-
`down algorithm can be 100 microseconds or more, removing much
`of the benefit of providing a fast memory to memory copy. Recent
`page migration research has also identified TLB shootdown as a
`significant cost of page migration[l 11. To solve the TLB update
`problem, the directory supports a block transfer copy mode known
`as directory poisoning, which works as follows.
`During the read phase of the poisoning block co

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket