`
`Jeffrey Kuskin, David Ofelt, Mark Heinrich, John Heinlein,
`Richard Simoni, Kourosh Gharachorloo, John Chapin, David Nakahira, Joel Baxter,
`Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John Hennessy
`
`Computer Systems Laboratory
`Stanford University
`Stanford, CA 94305
`
`Abstract
`The FLASH multiprocessor efficiently integrates support for
`cache-coherent shared memory and high-performance message
`passing, while minimizing both hardware and software overhead.
`Each node in FLASH contains a microprocessor, a portion of the
`machine's global memory, a port to the interconnection network,
`an I/0 interface, and a custom node controller called MAGIC.
`The MAGIC chip handles all communication both within the
`node and among nodes, using hardwired data paths for efficient
`data movement and a programmable processor optimized for
`executing protocol operations. The use of the protocol processor
`it can support a variety of differ(cid:173)
`makes FLASH very flexible -
`ent communication mechanisms- and simplifies the design and
`implementation.
`This paper presents the architecture of FLASH and MAGIC,
`and discusses the base cache-coherence and message-passing
`protocols. Latency and occupancy numbers, which are derived
`from our system-level simulator and our Verilog code, are given
`for several common protocol operations. The paper also
`describes our software strategy and FLASH's current status.
`
`1 Introduction
`
`The two architectural techniques for communicating
`data among processors in a scalable multiprocessor are
`message passing and distributed shared memory (DSM).
`Despite significant differences in how programmers view
`these two architectural models, the underlying hardware
`mechanisms used to implement these approaches have
`been converging. Current DSM and message-passing
`multiprocessors consist of processing nodes intercon(cid:173)
`nected with a high-bandwidth network. Each node con(cid:173)
`tains a node processor, a portion of the physically
`distributed memory, and a node controller that connects
`the processor, memory, and network together. The princi(cid:173)
`pal difference between message-passing and DSM
`machines is in the protocol implemented by the node con(cid:173)
`troller for transferring data both within and among nodes.
`Perhaps more surprising than the similarity of the over(cid:173)
`all structure of these types of machines is the commonality
`
`in functions performed by the node controller. In both
`cases, the primary performance-critical function of the
`node controller is the movement of data at high bandwidth
`and low latency among the processor, memory, and net(cid:173)
`work. In addition to these existing similarities, the archi(cid:173)
`tectural trends for both styles of machine favor further
`convergence in both the hardware and software mecha(cid:173)
`nisms used to implement the communication abstractions.
`Message-passing machines are moving to efficient support
`of short messages and a uniform address space, features
`normally associated with DSM machines. Similarly, DSM
`machines are starting to provide support for message-like
`block transfers (e.g., the Cray T3D), a feature normally
`associated with message-passing machines.
`The efficient integration and support of both cache(cid:173)
`coherent shared memory and low-overhead user-level
`message passing is the primary goal of the FLASH (FLex(cid:173)
`ible Architecture for SHared memory) multiprocessor.
`Efficiency involves both low hardware overhead and high
`performance. A major problem of current cache-coherent
`DSM machines (such as the earlier DASH machine
`(LLG+92]) is their high hardware overhead, while a major
`criticism of current message-passing machines is their
`high software overhead for user-level message passing.
`FLASH integrates and streamlines the hardware primitives
`needed to provide low-cost and high-performance support
`for global cache coherence and message passing. We aim
`to achieve this support without compromising the protec(cid:173)
`tion model or the ability of an operating system to control
`resource usage. The latter point is important since we want
`FLASH to operate well in a general-purpose multipro(cid:173)
`grammed environment with many users sharing the
`machine as well as in a traditional supercomputer environ(cid:173)
`ment.
`To accomplish these goals we are designing a custom
`node controller. This controller, called MAGIC (Memory
`And General Interconnect Controller), is a highly inte(cid:173)
`grated chip that implements all data transfers both within
`
`1063-6897194 $03.00 © 1994 IEEE
`
`302
`
`Petition for Inter Partes Review of
`U.S. Pat. No. 7,296,121
`IPR2015‐00158
`EXHIBIT
`Sony‐
`
`
`
`Net
`
`MAGIC
`
`JJO
`
`Figure 2.1. FLASH system architecture.
`
`the node and between the node and the network. To deliver
`high performance, the MAGIC chip contains a specialized
`data path optimized to move data between the memory,
`network, processor, and I/0 ports in a pipelined fashion
`without redundant copying. To provide the flexible control
`needed to support a variety of DSM and message-passing
`protocols, the MAGIC chip contains an embedded proces(cid:173)
`sor that controls the data path and implements the proto(cid:173)
`col. The separate data path allows the processor to update
`the protocol data structures (e.g., the directory for cache
`coherence) in parallel with the associated data transfers.
`This paper describes the FLASH design and rationale.
`Section 2 gives an overview of FLASH. Section 3 briefly
`describes two example protocols, one for cache-coherent
`shared memory and one for message passing. Section 4
`the microarchitecture of the MAGIC chip.
`presents
`Section 5 briefly presents our system software strategy and
`Section 6 presents our implementation strategy and cur(cid:173)
`rent status. Section 7 discusses related work and we con(cid:173)
`clude in Section 8.
`
`2 FLASH Architecture Overview
`
`FLASH is a single-address-space machine consisting of
`a large number of processing nodes connected by a low(cid:173)
`latency, high-bandwidth interconnection network. Every
`node is identical (see Figure 2.1), containing a high-per(cid:173)
`formance off-the-shelf microprocessor with its caches, a
`portion of the machine's distributed main memory, and the
`MAGIC node controller chip. The MAGIC chip forms the
`heart of the node, integrating the memory controller, 1/0
`controller, network interface, and a progranunable proto(cid:173)
`col processor. This integration allows for low hardware
`overhead while supporting both cache-coherence and mes(cid:173)
`sage-passing protocols in a scalable and cohesive fashion. 1
`
`The MAGIC architecture is designed to offer both flex(cid:173)
`ibility and high performance. First, MAGIC includes a
`progranunable protocol processor for flexibility. Second,
`MAGIC's central location within the node ensures that it
`sees all processor, network, and I/0 transactions, allowing
`it to control all node resources and support a variety of
`protocols. Third, to avoid limiting the node design to any
`specific protocol and to accommodate protocols with vary(cid:173)
`ing memory requirements, the node contains no dedicated
`protocol storage; instead, both the protocol code and pro(cid:173)
`tocol data reside in a reserved portion of the node's main
`memory. However, to provide high-speed access to fre(cid:173)
`quently-used protocol code and data, MAGIC contains on(cid:173)
`chip instruction and data caches. Finally, MAGIC sepa(cid:173)
`rates data movement logic from protocol state manipula(cid:173)
`tion logic. The hardwired data movement logic achieves
`low latency and high bandwidth by supporting highly(cid:173)
`pipelined data transfers without extra copying within the
`chip. The protocol processor employs a hardware dispatch
`table to help service requests quickly, and a coarse-level
`pipeline to reduce protocol processor occupancy. This sep(cid:173)
`aration and specialization of data transfer and control logic
`ensures that MAGIC does not become a latency or band(cid:173)
`width bottleneck.
`FLASH nodes communicate by sending intra- and
`inter-node commands, which we refer to as messages. To
`implement a protocol on FLASH, one must define what
`kinds of messages will be exchanged (the message types),
`
`1. Our decision to use only one compute processor per node rather than
`multiple processors was driven mainly by pragmatic concerns. Using
`only one processor considerably simplifies the node design, and given the
`high bandwidth requirements of mOdern processors, it was not clear that
`we could support multiple processors productively. However, nothing in
`our approach precludes the use of multiple processors per node.
`
`303
`
`
`
`and write the corresponding code sequences for the proto(cid:173)
`col processor (the handlers). Each handler performs the
`necessary actions based on the machine state and the infor(cid:173)
`mation in the message it receives. Handler actions include
`updating machine state, communicating with the local pro(cid:173)
`cessor, and communicating with other nodes via the net(cid:173)
`work.
`Multiple protocols can be integrated efficiently in
`FLASH by ensuring that messages in different protocols
`are assigned different message types. The handlers for the
`various protocols then can be dispatched as efficiently as if
`only a single protocol were resident on the machine.
`Moreover, although the handlers are dynamically inter(cid:173)
`leaved, each handler invocation runs without interruption
`on MAGIC's embedded processor, easing the concurrent
`sharing of state and other critical resources. MAGIC also
`provides protocol-independent deadlock avoidance sup(cid:173)
`port, allowing multiple protocols to coexist without dead(cid:173)
`locking the machine or having other negative interactions.
`Since FLASH is designed to scale to thousands of pro(cid:173)
`cessing nodes, a comprehensive protection and fault con(cid:173)
`tainment strategy is needed to assure acceptable system
`availability. At the user level, the virtual memory system
`provides protection against application software errors.
`However, system-level errors such as operating system
`bugs and hardware faults require a separate fault detection
`and containment mechanism. The hardware and operating
`system cooperate to identify, isolate, and contain these
`faults. MAGIC provides a hardware-based "firewall"
`mechanism that can be used to prevent certain operations
`(memory writes, for example) from occurring on unautho(cid:173)
`rized addresses. Error-detection codes ensure data integ(cid:173)
`rity: ECC protects main memory and CRCs protect
`network traffic. Errors are reported to the operating sys(cid:173)
`tem, which is responsible for taking suitable action.
`
`3 FLASH Protocols
`
`This section presents a base cache-coherence protocol
`and a base block-transfer protocol we have designed for
`FLASH. We use the term "base" to emphasize that these
`two protocols are simply the ones we chose to implement
`first; Section 3.3 discusses protocol extensions and alter(cid:173)
`natives.
`
`3.1 Cache Coherence Protocol
`
`The base cache-coherence protocol is directory-based
`and has two components: a scalable directory data struc(cid:173)
`ture, and a set of handlers. For a scalable directory struc(cid:173)
`ture, FLASH uses dynamic pointer allocation [Simoni92],
`illustrated in Figure 3.1. In this scheme, each cache line(cid:173)
`sized block- 128 bytes in the prototype- of main mem(cid:173)
`ory is associated with an 8-byte state word called a direc-
`
`tory header, which is stored in a contiguous section of
`main memory devoted solely to the cache-coherence pro(cid:173)
`tocol. Each directory header contains some boolean flags
`and a link field that points to a linked list of sharers. For
`efficiency, the first element of the sharer list is stored in the
`directory header itself. If a block of memory is cached by
`more than one processor, additional memory for its list of
`sharers is allocated from the pointer/link store. Like the
`directory headers, the pointer/link store is also a physically
`contiguous region of main memory. Each entry in the
`pointer/link store consists of a pointer to the sharing pro(cid:173)
`cessor, a link to the next entry in the list, and an end-of-list
`bit. A free list is used to track the available entries in the
`pointer/link store. Pointer/link store entries are allocated
`from the free list as cache misses are satisfied, and are
`returned to the free list either when the line is written and
`invalidations are sent to each cache on the list of sharers,
`or when a processor notifies the directory that it is no
`longer caching a block2.
`A significant advantage of dynamic pointer allocation
`is that the directory storage requirements are scalable. The
`amount of memory needed for the directory headers is pro(cid:173)
`portional to the local memory per node, and scales as more
`processors are added. The total amount of memory needed
`in the machine for the pointer/link store is proportional to
`the total amount of cache in the system. Since the amount
`of cache is much smaller that the amount of main memory,
`the size of the pointer/link store is sufficient to maintain
`full caching information, as long as the loading on the dif(cid:173)
`ferent memory modules is uniform. When this uniformity
`does not exist, a node can run out of pointer/link storage.
`While a detailed discussion is beyond the scope of this
`paper, several heuristics can be used in this situation to
`ensure reasonable performance. Overall, the directory
`occupies 7% to 9% of main memory, depending on system
`configuration.
`Apart from the data structures used to maintain direc(cid:173)
`tory information, the base cache-coherence protocol is
`similar to the DASH protocol [LLG+90]. Both protocols
`utilize separate request and reply networks to eliminate
`request-reply cycles in the network. Both protocols for(cid:173)
`ward dirty data from a processor's cache directly to a
`requesting processor, and both protocols use negative
`acknowledgments to avoid deadlock and to cause retries
`when a requested line is in a transient state. The main dif(cid:173)
`ference between the two protocols is that in DASH each
`cluster collects its own invalidation acknowledgments,
`whereas in FLASH invalidation acknowledgments are col-
`
`2. The base cache-coherence protocol relies on replacement hints. The
`protocol could be modified to accommodate processors which do not pro(cid:173)
`vide these hints.
`
`304
`
`
`
`Memory Lines :·.··
`
`Directory Header
`
`Free List
`I
`
`Pointer/Link: Store
`
`--
`-- ~ ...
`-
`. . ~ I
`
`.
`
`.
`.
`.
`
`I
`
`_../"
`
`.
`J.......--"
`I--4B-4
`
`IJ-oo•t-----,C"" 128B---i•-.!l
`
`Figure 3.1. Data structures for the dynamic pointer allocation directory scheme.
`
`lected at the home node, that is, the node where the direc(cid:173)
`tory data is stored for that block.
`Avoiding deadlock is difficult in any cache-coherence
`protocol. Below we discuss how the base protocol handles
`the deadlock problem, and illustrate some of the protocol(cid:173)
`independent deadlock avoidance mechanisms of the
`MAGIC architecture. Although this discussion focuses on
`the base cache-coherence protocol, any protocol run on
`FLASH can use these mechanisms to eliminate the dead(cid:173)
`lock problem.
`As a first step, the base protocol divides all messages
`into requests (e.g., read, read-exclusive, and invalidate
`requests) and replies (e.g., read and read-exclusive data
`replies, and invalidation acknowledgments). Second, the
`protocol uses the virtual lane support in the network rout(cid:173)
`ers to transmit requests and replies over separate logical
`networks. Next, it guarantees that replies can be sunk, that
`is, replies generate no additional outgoing messages. This
`eliminates the possibility of request-reply circular depen(cid:173)
`dencies. To break request-request cycles, requests that
`cannot be sunk may be negatively acknowledged, effec(cid:173)
`tively turning those requests into replies.
`The final requirement for a deadlock solution is a
`restriction placed on all handlers: they must yield the pro(cid:173)
`tocol processor if they cannot run to completion. If a han(cid:173)
`dler violates this constraint and stalls waiting for space on
`one of its output queues, the machine could potentially
`deadlock because it is no longer servicing messages from
`the network. To avoid this type of deadlock, the schedul(cid:173)
`ing mechanism for the incoming queues is initialized to
`indicate which incoming queues contain messages that
`may require outgoing queue space. The scheduler will not
`select an incoming queue unless the corresponding outgo(cid:173)
`ing queue space requirements are satisfied.
`However, in some cases, the number of outgoing mes(cid:173)
`sages a handler will send cannot be determined before-
`
`hand, preventing the scheduler from ensuring adequate
`outgoing queue space for these handlers. For example, an
`incoming request (for which only outgoing reply queue
`space is guaranteed) may need to be forwarded to a dirty
`remote node. If at this point the outgoing request queue is
`full, the protocol processor negatively acknowledges the
`incoming request, converting it into a reply. A second case
`not handled by the scheduler is an incoming write miss
`that is scheduled and finds that it needs to send N invalida(cid:173)
`tion requests into the network. Unfortunately, the outgoing
`request queue may have fewer than N spots available. As
`stated above, the handler cannot simply wait for space to
`free up in the outgoing request queue to send the remain(cid:173)
`ing invalidations. To solve this problem, the protocol
`employs the software queue where it can suspend mes(cid:173)
`sages to be rescheduled at a later time.
`The software queue is a reserved region of main mem(cid:173)
`ory that any protocol can use to suspend message process(cid:173)
`ing temporarily. For instance, each time MAGIC receives
`a write request to a shared line, the corresponding handler
`reserves space in the software queue for possible resched(cid:173)
`uling. If the queue is already full, the incoming request is
`simply negatively acknowledged. This case should be
`extremely rare. If the handler discovers that it needs to
`send N invalidations, but only M < N spots are available in
`the outgoing request queue, the handler sends M invalidate
`requests and then places itself on the software queue. The
`list of sharers at this point contains only those processors
`that have not been invalidated. When the write request is
`rescheduled off of the software queue, the new handler
`invocation continues sending invalidation requests where
`the old one left off.
`
`3.2 Message Passing Protocol
`
`In FLASH, we distinguish long messages, used for
`block transfer, from short messages, such as those required
`
`305
`
`
`
`for synchronization. This section discusses the block
`tiansfer mechanism; Section 3.3 discusses short messages.
`The design of the block transfer protocol was driven by
`three main goals: provide user-level access to block trans(cid:173)
`fer without sacrificing protection; achieve tiansfer band(cid:173)
`width and latency comparable to a message-passing
`machine containing dedicated hardware support for this
`task; and operate in harmony with other key attributes of
`the machine including cache coherence, virtual memory,
`and multiprogramming [HGG94]. We achieve high perfor(cid:173)
`mance because MAGIC efficiently streams data to the
`receiver. The performance is further improved by the elim(cid:173)
`ination of processor interrupts and system calls in the com(cid:173)
`mon case, and by the avoidance of extra copying of
`message data.
`To distinguish a user-level message from the low-level
`messages MAGIC sends between nodes, this section
`explicitly refers to the former as a user message. Sending a
`user message in FLASH logically consists of three phases:
`initiation, transfer, and reception/completion.
`To send a user message, an application process calls a
`library routine to communicate the parameters of the user(cid:173)
`level message to MAGIC. This communication happens
`using a series of uncached writes to special addresses
`(which act as memory-mapped commands). Unlike stan(cid:173)
`dard uncached writes, tliese special writes invoke a differ(cid:173)
`ent handler
`that accumulates
`information from
`the
`command into a message description record in MAGIC's
`memory. The final command is an uncached read, to which
`MAGIC replies with a value indicating if the message is
`accepted. Once the message is accepted, MAGIC invokes
`a transfer handler that takes over responsibility for trans(cid:173)
`ferring the user message to its destination, allowing the
`main processor to run in parallel with the message transfer.
`The transfer handler sends the user message data as a
`series of independent, cache line-sized messages. The
`tiansfer handler keeps the user message data coherent by
`checking the directory state as the transfer proceeds, tak(cid:173)
`ing appropriate coherence actions as needed. Block trans(cid:173)
`fers are broken into cache line-sized chunks because the
`system is optimized for data transfers of this size, and
`because block transfers can then utilize the deadlock pre(cid:173)
`vention mechanisms implemented for the base cache(cid:173)
`coherence protocol. From a deadlock avoidance perspec(cid:173)
`tive, the user message transfer is similar to sending a long
`list of invalidations: the transfer handler may only be able
`to send part of the user message in a single activation. To
`avoid filling the outgoing queue and to allow other han(cid:173)
`dlers to execute, the transfer handler periodically marks its
`progress and suspends itself on the software queue.
`When each component of the user-level message
`arrives at the destination node, a reception handler is
`invoked which stores the associated message data in mem-
`
`ory and updates the number of message components
`received. Using information provided in advance by the
`receiving process, the handler can store the data directly in
`the user process's memory without extra copying. When
`all the user message data has been received, the handler
`notifies the local processor that a user message has arrived
`(the application can choose to poll for the user message
`arrival or be interrupted), and sends a single acknowledg(cid:173)
`ment back to the sender, completing the tiansfer.
`Section 4.3 discusses the anticipated performance of
`this protocol.
`
`3.3 Protocol Extensions and Alternatives
`
`MAGIC's flexible design supports a variety of proto(cid:173)
`cols, not just the two described in Section 3.1 and
`Section 3.2. By changing the handlers, one can implement
`other cache-coherence and message-passing protocols, or
`support completely different operations and communica(cid:173)
`tion models. Consequently, FLASH is ideal for experi(cid:173)
`menting with new protocols.
`For example, the handlers can be modified to emulate
`the "attraction memory" found in a cache-only memory
`architecture, such as Kendall Square Research's ALL(cid:173)
`CACHE [KSR92]. A handler that normally forwards a
`remote request to the home node in the base cache-coher(cid:173)
`ence protocol can be expanded to first check the local
`memory for the presence of the data. Because MAGIC
`stores protocol data structures in main memory, it has no
`difficulty accommodating the different state information
`(e.g., attraction memory tags) maintained by a COMA
`protocol.
`Another possibility is to implement synchronization
`primitives as MAGIC handlers. Primitives executing on
`MAGIC avoid the cost of interrupting the main processor
`and can exploit MAGIC's ability to communicate effi(cid:173)
`ciently with other nodes. In addition, guaranteeing the ato(cid:173)
`micity of the primitives is simplified since MAGIC
`handlers are non-interruptible. Operations such as
`fetch-and-op and tree barriers are ideal candidates for this
`type of implementation.
`FLASH's short message support corresponds closely to
`the structuring of communication using active messages as
`advocated by von Eicken et al. [vECG+92]. However, the
`MAGIC chip supports fast active messages only at the sys(cid:173)
`tem level, as opposed to the user level. While von Eicken
`et al. argue for user-level active messages, we have found
`that system-level active messages suffice and in many
`ways simplify matters. For example, consider the shared(cid:173)
`memory model and the ordinary read/write requests issued
`by compute processors. Since the virtual addresses issued
`by the processor are translated into physical addresses and
`are protection-checked by the TLB before they reach the
`MAGIC chip, no further translation or protection checks
`
`306
`
`
`
`are needed at MAGIC. By not allowing user-level han(cid:173)
`dlers, we ensure that malicious user-level handlers do not
`cause deadlock by breaking resource consumption con(cid:173)
`ventions in the MAGIC chip. The MAGIC chip architec(cid:173)
`ture could be extended to provide protection for user-level
`handlers (e.g., by providing time-outs), but this change
`would significantly complicate the chip and the protocols.
`Instead, we are investigating software techniques for
`achieving the required protection to allow user-level han(cid:173)
`dlers to execute in the unprotected MAGIC environment.
`Overall, we believe the disadvantages of providing hard(cid:173)
`ware support for user-level handlers in MAGIC outweigh
`the advantages. Operations that are truly critical to perfor(cid:173)
`mance (e.g., support for tree barriers and other synchroni(cid:173)
`zation primitives) usually can be coded and provided at the
`system level by MAGIC. Disallowing user-level handlers
`should lead to a simpler and higher-performing design.
`While the inherent complexity of writing handlers may
`be small, it is important to realize that errors in the
`MAGIC handlers directly impact the correctness and sta(cid:173)
`bility of the machine. We consider the verification of han(cid:173)
`dlers to be · analogous to hardware verification, since
`MAGIC handlers directly control the node's hardware
`resources. As a result, although new protocols may be sun(cid:173)
`pie to implement, they must be verified thoroughly to be
`trusted to run on MAGIC.
`
`4 MAGIC Microarchitecture
`
`Fundamentally, protocol handlers must perfonn two
`tasks: data movement and state manipulation. The
`MAGIC architecture exploits the relative independence of
`these tasks by separating control and data processing. As
`messages enter the MAGIC chip they are split into mes(cid:173)
`sage headers and message data. Message headers flow
`through the control macropipeline while message data
`the data
`transfer logic, depicted
`in
`flows
`through
`Figure 4.1. Data and control information are recombined
`as outgoing message headers are merged with the associ(cid:173)
`ated outgoing message data to fonn complete outgoing
`messages.
`
`4.1 The Data Transfer Logic
`
`Both message-passing and cache-coherence protocols
`require data connections among the network, local mem(cid:173)
`ory, and local processor. Because the structure of these
`connections is protocol-independent, the data transfer
`logic can be implemented completely in hardware without
`causing a loss of overall protocol processing flexibility.
`The hardwired implementation minimizes data access
`latency, maximizes data transfer bandwidth, and frees the
`protocol processor from having to perfonn data transfers
`itself.
`
`Memory
`
`Processor Network 1/0
`
`Figure 4.1. Message How in MAGIC.
`
`Figure 4.2 shows the data transfer logic in detail. When
`messages arrive from the network, processor, or 1/0 sub(cid:173)
`system, the network interface (NI), processor interface
`(PI), or 1/0 interface (110) splits the message into message
`header and message data, as noted above. If the message
`contains data, the data is copied into a data buffer, a tem(cid:173)
`porary storage element contained on the MAGIC chip that
`is used to stage data as it is forwarded from source to des(cid:173)
`tination. Sixteen data buffers are provided on-chip, each
`large enough to store one cache line.
`Staging data through data buffers allows the data trans(cid:173)
`fer logic to achieve low latency and high bandwidth
`through data pipelining and elimination of multiple data
`copies. Data pipelining is achieved by tagging each data
`buffer word with a valid bit. The functional unit reading
`data from the data buffer monitors the valid bits to pipeline
`
`Processor Network
`
`I/0
`
`Memory
`
`PI
`
`NI
`
`1/0
`
`Data
`Buffers
`
`Figure 4.2. Data transfer logic.
`
`307
`
`
`
`the destination does
`the data from source to destination -
`not have to wait for the entire buffer to be written before
`starting to read data from the buffer. Multiple data copies
`are eliminated by placing data into a data buffer as it is
`received from the network, processor, or 110 subsystem,
`and keeping the data in the same buffer until it is delivered
`to its destination. The control macropipeline rarely manip(cid:173)
`ulates data directly; instead, it uses the number of the data
`buffer associated with a message to cause the data transfer
`logic to deliver the data to the proper destination.
`
`4.2 The Control Macropipeline
`
`The control macropipeline must satisfy two potentially
`conflicting goals: it must provide flexible support for a
`variety of protocols, yet it must process protocol opera(cid:173)
`tions quickly enough to ensure that control processing
`time does not dominate the data transfer time. A program(cid:173)
`mable controller provides the flexibility. Additional hard(cid:173)
`ware support ensures that the controller can process
`protocol operations efficiently. First, a hardware-based
`message dispatch mechanism eliminates the need for the
`controller to perform message dispatch in software. Sec(cid:173)
`ond, this dispatch mechanism allows a speculative mem(cid:173)
`ory operation to be initiated even before the controller
`begins processing the message, thereby reducing the data
`access time. Third, in addition to standard RISC instruc(cid:173)
`tions, the controller's instruction set includes bit.field
`manipulation and other special instructions to provide effi(cid:173)
`cient support for common protocol operations. Fourth, the
`mechanics of outgoing message sends are handled by a
`separate hardware unit
`Figure 4.3 shows the structure of the control mac(cid:173)
`ropipeline. Message headers are passed from the PI, NI
`and 110 to the inbox, which contains the hardware dispatch
`and speculative memory logic. The inbox passes the mes(cid:173)
`sage header to the flexible controller- the protocol pro(cid:173)
`cessor (PP) - where the actual protocol processing
`occurs. To improve performance, PP code and data are
`cached in the MAGIC instruction cache and MAGIC data
`cache, respectively. Finally, the outbox handles outgoing
`message sends on behalf of the PP, taking outgoing mes(cid:173)
`sage headers and forwarding them to the PI, NI, and 110
`for delivery to the processor, network:, and 1/0 subsystem.
`As soon as the inbox completes message preprocessing
`and passes the message header to the PP, it can begin pro(cid:173)
`cessing a new message. Similarly, once the PP composes
`an outgoing message and passes it to the outbox, it can
`accept a new message from the inbox. Thus, the in box, PP,
`and outbox operate independently, increasing message
`processing throughput by allowing up to three messages to
`be processed concurrently; hence the name "macropipe(cid:173)
`line." The following sections describe the operation of the
`inbox, PP, and outbox in greater detail.
`
`4.2.1 lnbox Operation
`The inbox processes messages in several steps. First,
`the scheduler selects the incoming queue from which the
`next message will be read. Second, the inbox uses portions
`of the selected message's header to index into a small
`memory called the jump table to determine the starting PP
`program counter (PC) appropriate for the message. The
`jump table also determines whether the inbox should ini(cid:173)
`tiate a speculative memory operation. Finally, the inbox
`passes the selected message header to the PP for process(cid:173)
`ing.
`The scheduler selects a message from one of several
`queues. The PI and 110 each provide a single queue of
`requests issued by the processor and 110 subsystem,
`respectively. The NI provides one queue for each network
`virtual lane. The last queue is the software queue. Unlike
`the other queues, the software queue is managed entirely
`by the PP. The inbox contains only the queue's head entry;
`the remainder of the queue is