throbber
The Stanford FLASH Multiprocessor
`
`Jeffrey Kuskin, David Ofelt, Mark Heinrich, John Heinlein,
`Richard Simoni, Kourosh Gharachorloo, John Chapin, David Nakahira, Joel Baxter,
`Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John Hennessy
`
`Computer Systems Laboratory
`Stanford University
`Stanford, CA 94305
`
`Abstract
`The FLASH multiprocessor efficiently integrates support for
`cache-coherent shared memory and high-performance message
`passing, while minimizing both hardware and software overhead.
`Each node in FLASH contains a microprocessor, a portion of the
`machine's global memory, a port to the interconnection network,
`an I/0 interface, and a custom node controller called MAGIC.
`The MAGIC chip handles all communication both within the
`node and among nodes, using hardwired data paths for efficient
`data movement and a programmable processor optimized for
`executing protocol operations. The use of the protocol processor
`it can support a variety of differ(cid:173)
`makes FLASH very flexible -
`ent communication mechanisms- and simplifies the design and
`implementation.
`This paper presents the architecture of FLASH and MAGIC,
`and discusses the base cache-coherence and message-passing
`protocols. Latency and occupancy numbers, which are derived
`from our system-level simulator and our Verilog code, are given
`for several common protocol operations. The paper also
`describes our software strategy and FLASH's current status.
`
`1 Introduction
`
`The two architectural techniques for communicating
`data among processors in a scalable multiprocessor are
`message passing and distributed shared memory (DSM).
`Despite significant differences in how programmers view
`these two architectural models, the underlying hardware
`mechanisms used to implement these approaches have
`been converging. Current DSM and message-passing
`multiprocessors consist of processing nodes intercon(cid:173)
`nected with a high-bandwidth network. Each node con(cid:173)
`tains a node processor, a portion of the physically
`distributed memory, and a node controller that connects
`the processor, memory, and network together. The princi(cid:173)
`pal difference between message-passing and DSM
`machines is in the protocol implemented by the node con(cid:173)
`troller for transferring data both within and among nodes.
`Perhaps more surprising than the similarity of the over(cid:173)
`all structure of these types of machines is the commonality
`
`in functions performed by the node controller. In both
`cases, the primary performance-critical function of the
`node controller is the movement of data at high bandwidth
`and low latency among the processor, memory, and net(cid:173)
`work. In addition to these existing similarities, the archi(cid:173)
`tectural trends for both styles of machine favor further
`convergence in both the hardware and software mecha(cid:173)
`nisms used to implement the communication abstractions.
`Message-passing machines are moving to efficient support
`of short messages and a uniform address space, features
`normally associated with DSM machines. Similarly, DSM
`machines are starting to provide support for message-like
`block transfers (e.g., the Cray T3D), a feature normally
`associated with message-passing machines.
`The efficient integration and support of both cache(cid:173)
`coherent shared memory and low-overhead user-level
`message passing is the primary goal of the FLASH (FLex(cid:173)
`ible Architecture for SHared memory) multiprocessor.
`Efficiency involves both low hardware overhead and high
`performance. A major problem of current cache-coherent
`DSM machines (such as the earlier DASH machine
`(LLG+92]) is their high hardware overhead, while a major
`criticism of current message-passing machines is their
`high software overhead for user-level message passing.
`FLASH integrates and streamlines the hardware primitives
`needed to provide low-cost and high-performance support
`for global cache coherence and message passing. We aim
`to achieve this support without compromising the protec(cid:173)
`tion model or the ability of an operating system to control
`resource usage. The latter point is important since we want
`FLASH to operate well in a general-purpose multipro(cid:173)
`grammed environment with many users sharing the
`machine as well as in a traditional supercomputer environ(cid:173)
`ment.
`To accomplish these goals we are designing a custom
`node controller. This controller, called MAGIC (Memory
`And General Interconnect Controller), is a highly inte(cid:173)
`grated chip that implements all data transfers both within
`
`1063-6897194 $03.00 © 1994 IEEE
`
`302
`
`Petition for Inter Partes Review of 
`U.S. Pat. No. 7,296,121
`IPR2015‐00158
`EXHIBIT
`Sony‐
`
`

`

`Net
`
`MAGIC
`
`JJO
`
`Figure 2.1. FLASH system architecture.
`
`the node and between the node and the network. To deliver
`high performance, the MAGIC chip contains a specialized
`data path optimized to move data between the memory,
`network, processor, and I/0 ports in a pipelined fashion
`without redundant copying. To provide the flexible control
`needed to support a variety of DSM and message-passing
`protocols, the MAGIC chip contains an embedded proces(cid:173)
`sor that controls the data path and implements the proto(cid:173)
`col. The separate data path allows the processor to update
`the protocol data structures (e.g., the directory for cache
`coherence) in parallel with the associated data transfers.
`This paper describes the FLASH design and rationale.
`Section 2 gives an overview of FLASH. Section 3 briefly
`describes two example protocols, one for cache-coherent
`shared memory and one for message passing. Section 4
`the microarchitecture of the MAGIC chip.
`presents
`Section 5 briefly presents our system software strategy and
`Section 6 presents our implementation strategy and cur(cid:173)
`rent status. Section 7 discusses related work and we con(cid:173)
`clude in Section 8.
`
`2 FLASH Architecture Overview
`
`FLASH is a single-address-space machine consisting of
`a large number of processing nodes connected by a low(cid:173)
`latency, high-bandwidth interconnection network. Every
`node is identical (see Figure 2.1), containing a high-per(cid:173)
`formance off-the-shelf microprocessor with its caches, a
`portion of the machine's distributed main memory, and the
`MAGIC node controller chip. The MAGIC chip forms the
`heart of the node, integrating the memory controller, 1/0
`controller, network interface, and a progranunable proto(cid:173)
`col processor. This integration allows for low hardware
`overhead while supporting both cache-coherence and mes(cid:173)
`sage-passing protocols in a scalable and cohesive fashion. 1
`
`The MAGIC architecture is designed to offer both flex(cid:173)
`ibility and high performance. First, MAGIC includes a
`progranunable protocol processor for flexibility. Second,
`MAGIC's central location within the node ensures that it
`sees all processor, network, and I/0 transactions, allowing
`it to control all node resources and support a variety of
`protocols. Third, to avoid limiting the node design to any
`specific protocol and to accommodate protocols with vary(cid:173)
`ing memory requirements, the node contains no dedicated
`protocol storage; instead, both the protocol code and pro(cid:173)
`tocol data reside in a reserved portion of the node's main
`memory. However, to provide high-speed access to fre(cid:173)
`quently-used protocol code and data, MAGIC contains on(cid:173)
`chip instruction and data caches. Finally, MAGIC sepa(cid:173)
`rates data movement logic from protocol state manipula(cid:173)
`tion logic. The hardwired data movement logic achieves
`low latency and high bandwidth by supporting highly(cid:173)
`pipelined data transfers without extra copying within the
`chip. The protocol processor employs a hardware dispatch
`table to help service requests quickly, and a coarse-level
`pipeline to reduce protocol processor occupancy. This sep(cid:173)
`aration and specialization of data transfer and control logic
`ensures that MAGIC does not become a latency or band(cid:173)
`width bottleneck.
`FLASH nodes communicate by sending intra- and
`inter-node commands, which we refer to as messages. To
`implement a protocol on FLASH, one must define what
`kinds of messages will be exchanged (the message types),
`
`1. Our decision to use only one compute processor per node rather than
`multiple processors was driven mainly by pragmatic concerns. Using
`only one processor considerably simplifies the node design, and given the
`high bandwidth requirements of mOdern processors, it was not clear that
`we could support multiple processors productively. However, nothing in
`our approach precludes the use of multiple processors per node.
`
`303
`
`

`

`and write the corresponding code sequences for the proto(cid:173)
`col processor (the handlers). Each handler performs the
`necessary actions based on the machine state and the infor(cid:173)
`mation in the message it receives. Handler actions include
`updating machine state, communicating with the local pro(cid:173)
`cessor, and communicating with other nodes via the net(cid:173)
`work.
`Multiple protocols can be integrated efficiently in
`FLASH by ensuring that messages in different protocols
`are assigned different message types. The handlers for the
`various protocols then can be dispatched as efficiently as if
`only a single protocol were resident on the machine.
`Moreover, although the handlers are dynamically inter(cid:173)
`leaved, each handler invocation runs without interruption
`on MAGIC's embedded processor, easing the concurrent
`sharing of state and other critical resources. MAGIC also
`provides protocol-independent deadlock avoidance sup(cid:173)
`port, allowing multiple protocols to coexist without dead(cid:173)
`locking the machine or having other negative interactions.
`Since FLASH is designed to scale to thousands of pro(cid:173)
`cessing nodes, a comprehensive protection and fault con(cid:173)
`tainment strategy is needed to assure acceptable system
`availability. At the user level, the virtual memory system
`provides protection against application software errors.
`However, system-level errors such as operating system
`bugs and hardware faults require a separate fault detection
`and containment mechanism. The hardware and operating
`system cooperate to identify, isolate, and contain these
`faults. MAGIC provides a hardware-based "firewall"
`mechanism that can be used to prevent certain operations
`(memory writes, for example) from occurring on unautho(cid:173)
`rized addresses. Error-detection codes ensure data integ(cid:173)
`rity: ECC protects main memory and CRCs protect
`network traffic. Errors are reported to the operating sys(cid:173)
`tem, which is responsible for taking suitable action.
`
`3 FLASH Protocols
`
`This section presents a base cache-coherence protocol
`and a base block-transfer protocol we have designed for
`FLASH. We use the term "base" to emphasize that these
`two protocols are simply the ones we chose to implement
`first; Section 3.3 discusses protocol extensions and alter(cid:173)
`natives.
`
`3.1 Cache Coherence Protocol
`
`The base cache-coherence protocol is directory-based
`and has two components: a scalable directory data struc(cid:173)
`ture, and a set of handlers. For a scalable directory struc(cid:173)
`ture, FLASH uses dynamic pointer allocation [Simoni92],
`illustrated in Figure 3.1. In this scheme, each cache line(cid:173)
`sized block- 128 bytes in the prototype- of main mem(cid:173)
`ory is associated with an 8-byte state word called a direc-
`
`tory header, which is stored in a contiguous section of
`main memory devoted solely to the cache-coherence pro(cid:173)
`tocol. Each directory header contains some boolean flags
`and a link field that points to a linked list of sharers. For
`efficiency, the first element of the sharer list is stored in the
`directory header itself. If a block of memory is cached by
`more than one processor, additional memory for its list of
`sharers is allocated from the pointer/link store. Like the
`directory headers, the pointer/link store is also a physically
`contiguous region of main memory. Each entry in the
`pointer/link store consists of a pointer to the sharing pro(cid:173)
`cessor, a link to the next entry in the list, and an end-of-list
`bit. A free list is used to track the available entries in the
`pointer/link store. Pointer/link store entries are allocated
`from the free list as cache misses are satisfied, and are
`returned to the free list either when the line is written and
`invalidations are sent to each cache on the list of sharers,
`or when a processor notifies the directory that it is no
`longer caching a block2.
`A significant advantage of dynamic pointer allocation
`is that the directory storage requirements are scalable. The
`amount of memory needed for the directory headers is pro(cid:173)
`portional to the local memory per node, and scales as more
`processors are added. The total amount of memory needed
`in the machine for the pointer/link store is proportional to
`the total amount of cache in the system. Since the amount
`of cache is much smaller that the amount of main memory,
`the size of the pointer/link store is sufficient to maintain
`full caching information, as long as the loading on the dif(cid:173)
`ferent memory modules is uniform. When this uniformity
`does not exist, a node can run out of pointer/link storage.
`While a detailed discussion is beyond the scope of this
`paper, several heuristics can be used in this situation to
`ensure reasonable performance. Overall, the directory
`occupies 7% to 9% of main memory, depending on system
`configuration.
`Apart from the data structures used to maintain direc(cid:173)
`tory information, the base cache-coherence protocol is
`similar to the DASH protocol [LLG+90]. Both protocols
`utilize separate request and reply networks to eliminate
`request-reply cycles in the network. Both protocols for(cid:173)
`ward dirty data from a processor's cache directly to a
`requesting processor, and both protocols use negative
`acknowledgments to avoid deadlock and to cause retries
`when a requested line is in a transient state. The main dif(cid:173)
`ference between the two protocols is that in DASH each
`cluster collects its own invalidation acknowledgments,
`whereas in FLASH invalidation acknowledgments are col-
`
`2. The base cache-coherence protocol relies on replacement hints. The
`protocol could be modified to accommodate processors which do not pro(cid:173)
`vide these hints.
`
`304
`
`

`

`Memory Lines :·.··
`
`Directory Header
`
`Free List
`I
`
`Pointer/Link: Store
`
`--
`-- ~ ...
`-
`. . ~ I
`
`.
`
`.
`.
`.
`
`I
`
`_../"
`
`.
`J.......--"
`I--4B-4
`
`IJ-oo•t-----,C"" 128B---i•-.!l
`
`Figure 3.1. Data structures for the dynamic pointer allocation directory scheme.
`
`lected at the home node, that is, the node where the direc(cid:173)
`tory data is stored for that block.
`Avoiding deadlock is difficult in any cache-coherence
`protocol. Below we discuss how the base protocol handles
`the deadlock problem, and illustrate some of the protocol(cid:173)
`independent deadlock avoidance mechanisms of the
`MAGIC architecture. Although this discussion focuses on
`the base cache-coherence protocol, any protocol run on
`FLASH can use these mechanisms to eliminate the dead(cid:173)
`lock problem.
`As a first step, the base protocol divides all messages
`into requests (e.g., read, read-exclusive, and invalidate
`requests) and replies (e.g., read and read-exclusive data
`replies, and invalidation acknowledgments). Second, the
`protocol uses the virtual lane support in the network rout(cid:173)
`ers to transmit requests and replies over separate logical
`networks. Next, it guarantees that replies can be sunk, that
`is, replies generate no additional outgoing messages. This
`eliminates the possibility of request-reply circular depen(cid:173)
`dencies. To break request-request cycles, requests that
`cannot be sunk may be negatively acknowledged, effec(cid:173)
`tively turning those requests into replies.
`The final requirement for a deadlock solution is a
`restriction placed on all handlers: they must yield the pro(cid:173)
`tocol processor if they cannot run to completion. If a han(cid:173)
`dler violates this constraint and stalls waiting for space on
`one of its output queues, the machine could potentially
`deadlock because it is no longer servicing messages from
`the network. To avoid this type of deadlock, the schedul(cid:173)
`ing mechanism for the incoming queues is initialized to
`indicate which incoming queues contain messages that
`may require outgoing queue space. The scheduler will not
`select an incoming queue unless the corresponding outgo(cid:173)
`ing queue space requirements are satisfied.
`However, in some cases, the number of outgoing mes(cid:173)
`sages a handler will send cannot be determined before-
`
`hand, preventing the scheduler from ensuring adequate
`outgoing queue space for these handlers. For example, an
`incoming request (for which only outgoing reply queue
`space is guaranteed) may need to be forwarded to a dirty
`remote node. If at this point the outgoing request queue is
`full, the protocol processor negatively acknowledges the
`incoming request, converting it into a reply. A second case
`not handled by the scheduler is an incoming write miss
`that is scheduled and finds that it needs to send N invalida(cid:173)
`tion requests into the network. Unfortunately, the outgoing
`request queue may have fewer than N spots available. As
`stated above, the handler cannot simply wait for space to
`free up in the outgoing request queue to send the remain(cid:173)
`ing invalidations. To solve this problem, the protocol
`employs the software queue where it can suspend mes(cid:173)
`sages to be rescheduled at a later time.
`The software queue is a reserved region of main mem(cid:173)
`ory that any protocol can use to suspend message process(cid:173)
`ing temporarily. For instance, each time MAGIC receives
`a write request to a shared line, the corresponding handler
`reserves space in the software queue for possible resched(cid:173)
`uling. If the queue is already full, the incoming request is
`simply negatively acknowledged. This case should be
`extremely rare. If the handler discovers that it needs to
`send N invalidations, but only M < N spots are available in
`the outgoing request queue, the handler sends M invalidate
`requests and then places itself on the software queue. The
`list of sharers at this point contains only those processors
`that have not been invalidated. When the write request is
`rescheduled off of the software queue, the new handler
`invocation continues sending invalidation requests where
`the old one left off.
`
`3.2 Message Passing Protocol
`
`In FLASH, we distinguish long messages, used for
`block transfer, from short messages, such as those required
`
`305
`
`

`

`for synchronization. This section discusses the block
`tiansfer mechanism; Section 3.3 discusses short messages.
`The design of the block transfer protocol was driven by
`three main goals: provide user-level access to block trans(cid:173)
`fer without sacrificing protection; achieve tiansfer band(cid:173)
`width and latency comparable to a message-passing
`machine containing dedicated hardware support for this
`task; and operate in harmony with other key attributes of
`the machine including cache coherence, virtual memory,
`and multiprogramming [HGG94]. We achieve high perfor(cid:173)
`mance because MAGIC efficiently streams data to the
`receiver. The performance is further improved by the elim(cid:173)
`ination of processor interrupts and system calls in the com(cid:173)
`mon case, and by the avoidance of extra copying of
`message data.
`To distinguish a user-level message from the low-level
`messages MAGIC sends between nodes, this section
`explicitly refers to the former as a user message. Sending a
`user message in FLASH logically consists of three phases:
`initiation, transfer, and reception/completion.
`To send a user message, an application process calls a
`library routine to communicate the parameters of the user(cid:173)
`level message to MAGIC. This communication happens
`using a series of uncached writes to special addresses
`(which act as memory-mapped commands). Unlike stan(cid:173)
`dard uncached writes, tliese special writes invoke a differ(cid:173)
`ent handler
`that accumulates
`information from
`the
`command into a message description record in MAGIC's
`memory. The final command is an uncached read, to which
`MAGIC replies with a value indicating if the message is
`accepted. Once the message is accepted, MAGIC invokes
`a transfer handler that takes over responsibility for trans(cid:173)
`ferring the user message to its destination, allowing the
`main processor to run in parallel with the message transfer.
`The transfer handler sends the user message data as a
`series of independent, cache line-sized messages. The
`tiansfer handler keeps the user message data coherent by
`checking the directory state as the transfer proceeds, tak(cid:173)
`ing appropriate coherence actions as needed. Block trans(cid:173)
`fers are broken into cache line-sized chunks because the
`system is optimized for data transfers of this size, and
`because block transfers can then utilize the deadlock pre(cid:173)
`vention mechanisms implemented for the base cache(cid:173)
`coherence protocol. From a deadlock avoidance perspec(cid:173)
`tive, the user message transfer is similar to sending a long
`list of invalidations: the transfer handler may only be able
`to send part of the user message in a single activation. To
`avoid filling the outgoing queue and to allow other han(cid:173)
`dlers to execute, the transfer handler periodically marks its
`progress and suspends itself on the software queue.
`When each component of the user-level message
`arrives at the destination node, a reception handler is
`invoked which stores the associated message data in mem-
`
`ory and updates the number of message components
`received. Using information provided in advance by the
`receiving process, the handler can store the data directly in
`the user process's memory without extra copying. When
`all the user message data has been received, the handler
`notifies the local processor that a user message has arrived
`(the application can choose to poll for the user message
`arrival or be interrupted), and sends a single acknowledg(cid:173)
`ment back to the sender, completing the tiansfer.
`Section 4.3 discusses the anticipated performance of
`this protocol.
`
`3.3 Protocol Extensions and Alternatives
`
`MAGIC's flexible design supports a variety of proto(cid:173)
`cols, not just the two described in Section 3.1 and
`Section 3.2. By changing the handlers, one can implement
`other cache-coherence and message-passing protocols, or
`support completely different operations and communica(cid:173)
`tion models. Consequently, FLASH is ideal for experi(cid:173)
`menting with new protocols.
`For example, the handlers can be modified to emulate
`the "attraction memory" found in a cache-only memory
`architecture, such as Kendall Square Research's ALL(cid:173)
`CACHE [KSR92]. A handler that normally forwards a
`remote request to the home node in the base cache-coher(cid:173)
`ence protocol can be expanded to first check the local
`memory for the presence of the data. Because MAGIC
`stores protocol data structures in main memory, it has no
`difficulty accommodating the different state information
`(e.g., attraction memory tags) maintained by a COMA
`protocol.
`Another possibility is to implement synchronization
`primitives as MAGIC handlers. Primitives executing on
`MAGIC avoid the cost of interrupting the main processor
`and can exploit MAGIC's ability to communicate effi(cid:173)
`ciently with other nodes. In addition, guaranteeing the ato(cid:173)
`micity of the primitives is simplified since MAGIC
`handlers are non-interruptible. Operations such as
`fetch-and-op and tree barriers are ideal candidates for this
`type of implementation.
`FLASH's short message support corresponds closely to
`the structuring of communication using active messages as
`advocated by von Eicken et al. [vECG+92]. However, the
`MAGIC chip supports fast active messages only at the sys(cid:173)
`tem level, as opposed to the user level. While von Eicken
`et al. argue for user-level active messages, we have found
`that system-level active messages suffice and in many
`ways simplify matters. For example, consider the shared(cid:173)
`memory model and the ordinary read/write requests issued
`by compute processors. Since the virtual addresses issued
`by the processor are translated into physical addresses and
`are protection-checked by the TLB before they reach the
`MAGIC chip, no further translation or protection checks
`
`306
`
`

`

`are needed at MAGIC. By not allowing user-level han(cid:173)
`dlers, we ensure that malicious user-level handlers do not
`cause deadlock by breaking resource consumption con(cid:173)
`ventions in the MAGIC chip. The MAGIC chip architec(cid:173)
`ture could be extended to provide protection for user-level
`handlers (e.g., by providing time-outs), but this change
`would significantly complicate the chip and the protocols.
`Instead, we are investigating software techniques for
`achieving the required protection to allow user-level han(cid:173)
`dlers to execute in the unprotected MAGIC environment.
`Overall, we believe the disadvantages of providing hard(cid:173)
`ware support for user-level handlers in MAGIC outweigh
`the advantages. Operations that are truly critical to perfor(cid:173)
`mance (e.g., support for tree barriers and other synchroni(cid:173)
`zation primitives) usually can be coded and provided at the
`system level by MAGIC. Disallowing user-level handlers
`should lead to a simpler and higher-performing design.
`While the inherent complexity of writing handlers may
`be small, it is important to realize that errors in the
`MAGIC handlers directly impact the correctness and sta(cid:173)
`bility of the machine. We consider the verification of han(cid:173)
`dlers to be · analogous to hardware verification, since
`MAGIC handlers directly control the node's hardware
`resources. As a result, although new protocols may be sun(cid:173)
`pie to implement, they must be verified thoroughly to be
`trusted to run on MAGIC.
`
`4 MAGIC Microarchitecture
`
`Fundamentally, protocol handlers must perfonn two
`tasks: data movement and state manipulation. The
`MAGIC architecture exploits the relative independence of
`these tasks by separating control and data processing. As
`messages enter the MAGIC chip they are split into mes(cid:173)
`sage headers and message data. Message headers flow
`through the control macropipeline while message data
`the data
`transfer logic, depicted
`in
`flows
`through
`Figure 4.1. Data and control information are recombined
`as outgoing message headers are merged with the associ(cid:173)
`ated outgoing message data to fonn complete outgoing
`messages.
`
`4.1 The Data Transfer Logic
`
`Both message-passing and cache-coherence protocols
`require data connections among the network, local mem(cid:173)
`ory, and local processor. Because the structure of these
`connections is protocol-independent, the data transfer
`logic can be implemented completely in hardware without
`causing a loss of overall protocol processing flexibility.
`The hardwired implementation minimizes data access
`latency, maximizes data transfer bandwidth, and frees the
`protocol processor from having to perfonn data transfers
`itself.
`
`Memory
`
`Processor Network 1/0
`
`Figure 4.1. Message How in MAGIC.
`
`Figure 4.2 shows the data transfer logic in detail. When
`messages arrive from the network, processor, or 1/0 sub(cid:173)
`system, the network interface (NI), processor interface
`(PI), or 1/0 interface (110) splits the message into message
`header and message data, as noted above. If the message
`contains data, the data is copied into a data buffer, a tem(cid:173)
`porary storage element contained on the MAGIC chip that
`is used to stage data as it is forwarded from source to des(cid:173)
`tination. Sixteen data buffers are provided on-chip, each
`large enough to store one cache line.
`Staging data through data buffers allows the data trans(cid:173)
`fer logic to achieve low latency and high bandwidth
`through data pipelining and elimination of multiple data
`copies. Data pipelining is achieved by tagging each data
`buffer word with a valid bit. The functional unit reading
`data from the data buffer monitors the valid bits to pipeline
`
`Processor Network
`
`I/0
`
`Memory
`
`PI
`
`NI
`
`1/0
`
`Data
`Buffers
`
`Figure 4.2. Data transfer logic.
`
`307
`
`

`

`the destination does
`the data from source to destination -
`not have to wait for the entire buffer to be written before
`starting to read data from the buffer. Multiple data copies
`are eliminated by placing data into a data buffer as it is
`received from the network, processor, or 110 subsystem,
`and keeping the data in the same buffer until it is delivered
`to its destination. The control macropipeline rarely manip(cid:173)
`ulates data directly; instead, it uses the number of the data
`buffer associated with a message to cause the data transfer
`logic to deliver the data to the proper destination.
`
`4.2 The Control Macropipeline
`
`The control macropipeline must satisfy two potentially
`conflicting goals: it must provide flexible support for a
`variety of protocols, yet it must process protocol opera(cid:173)
`tions quickly enough to ensure that control processing
`time does not dominate the data transfer time. A program(cid:173)
`mable controller provides the flexibility. Additional hard(cid:173)
`ware support ensures that the controller can process
`protocol operations efficiently. First, a hardware-based
`message dispatch mechanism eliminates the need for the
`controller to perform message dispatch in software. Sec(cid:173)
`ond, this dispatch mechanism allows a speculative mem(cid:173)
`ory operation to be initiated even before the controller
`begins processing the message, thereby reducing the data
`access time. Third, in addition to standard RISC instruc(cid:173)
`tions, the controller's instruction set includes bit.field
`manipulation and other special instructions to provide effi(cid:173)
`cient support for common protocol operations. Fourth, the
`mechanics of outgoing message sends are handled by a
`separate hardware unit
`Figure 4.3 shows the structure of the control mac(cid:173)
`ropipeline. Message headers are passed from the PI, NI
`and 110 to the inbox, which contains the hardware dispatch
`and speculative memory logic. The inbox passes the mes(cid:173)
`sage header to the flexible controller- the protocol pro(cid:173)
`cessor (PP) - where the actual protocol processing
`occurs. To improve performance, PP code and data are
`cached in the MAGIC instruction cache and MAGIC data
`cache, respectively. Finally, the outbox handles outgoing
`message sends on behalf of the PP, taking outgoing mes(cid:173)
`sage headers and forwarding them to the PI, NI, and 110
`for delivery to the processor, network:, and 1/0 subsystem.
`As soon as the inbox completes message preprocessing
`and passes the message header to the PP, it can begin pro(cid:173)
`cessing a new message. Similarly, once the PP composes
`an outgoing message and passes it to the outbox, it can
`accept a new message from the inbox. Thus, the in box, PP,
`and outbox operate independently, increasing message
`processing throughput by allowing up to three messages to
`be processed concurrently; hence the name "macropipe(cid:173)
`line." The following sections describe the operation of the
`inbox, PP, and outbox in greater detail.
`
`4.2.1 lnbox Operation
`The inbox processes messages in several steps. First,
`the scheduler selects the incoming queue from which the
`next message will be read. Second, the inbox uses portions
`of the selected message's header to index into a small
`memory called the jump table to determine the starting PP
`program counter (PC) appropriate for the message. The
`jump table also determines whether the inbox should ini(cid:173)
`tiate a speculative memory operation. Finally, the inbox
`passes the selected message header to the PP for process(cid:173)
`ing.
`The scheduler selects a message from one of several
`queues. The PI and 110 each provide a single queue of
`requests issued by the processor and 110 subsystem,
`respectively. The NI provides one queue for each network
`virtual lane. The last queue is the software queue. Unlike
`the other queues, the software queue is managed entirely
`by the PP. The inbox contains only the queue's head entry;
`the remainder of the queue is

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket