`Research Showcase @ CMU
`
`Computer Science Department
`
`School of Computer Science
`
`1989
`
`The design of Nectar : a network backplane for
`heterogeneous multicomputers
`Emmanuel A. Arnould
`Carnegie Mellon University
`
`Follow this and additional works at: http://repository.cmu.edu/compsci
`
`This Technical Report is brought to you for free and open access by the School of Computer Science at Research Showcase @ CMU. It has been
`accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase @ CMU. For more information, please
`contact research-showcase@andrew.cmu.edu.
`
`INTEL EX.1222.001
`
`
`
`NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS:
`The copyright law of the United States (title 17, U.S. Code) governs the making
`of photocopies or other reproductions of copyrighted material. Any copying of this
`document without permission of its author may be prohibited by law.
`
`INTEL EX.1222.002
`
`
`
`The Design of Nectar:
`A Network Backplane for Heterogeneous Multicomputer
`
`Emmanuel A. Arnould
`H. T. Kung
`
`Eric C. Cooper
`Peter A. Steenkiste
`
`Francois J. Bitz
`Robert D. Sansom
`January 1989
`
`CMU-CS-89-101
`
`School of Computer Science
`Carnegie Mellon University
`Pittsburgh, Pennsylvania 15213
`
`Also published in The Proceedings of the Third International Conference on
`Architectural Support for Programming Languages and Operating Systems, ACM, April 1989
`
`ABSTRACT
`
`Nectar is a "network backplane" for use in heterogeneous multicomputers. The initial system consists
`of a star-shaped fiber-optic network with an aggregate bandwidth of 1.6 gigabits/second and a switching
`latency of 700 nanoseconds. The system can be scaled up by connecting hundreds of these networks
`together.
`The Nectar architecture provides a flexible way to handle heterogeneity and task-level parallelism.
`A wide variety of machines can be connected as Nectar nodes and the Nectar system software allows
`applications to communicate at a high level. Protocol processing is off-loaded to powerful communication
`processors so that nodes do not have to support a suite of network protocols.
`We have designed and built a prototype Nectar system that has been operational since November
`1988. This paper presents the motivation and goals for Nectar and describes its hardware and software.
`The presentation emphasizes how the goals influenced the design decisions and led to the novel aspects
`of Nectar.
`
`This research was supported in part by Defense Advanced Research Projects Agency (DOD) monitored by the
`Space and Naval Warfare Systems Command under Contract N00039-87-C-0251, and in part by the Office of Naval
`Research under Contracts N00014-87-K-0385 and N00014-87-K-0533.
`
`INTEL EX.1222.003
`
`
`
`1
`
`Introduction
`
`Parallel processing is widely accepted as the most promising way to reach the next level of computer
`system performance. Currently, most parallel machines provide efficient support only for homogeneous,
`fine-grained parallel applications. There are three problems with such machines.
`First, there is a limit to how far the performance of these machines can be scaled up. When that
`limit is reached, it becomes desirable to exploit coarse-grained, or task-level, parallelism by connecting
`together several such machines as nodes in a multicomputer. Second, there is a limit to the usefulness of
`homogeneous systems; as we will argue below, heterogeneity (at hardware and software levels) is inherent
`in a whole class of important applications. Third, parallel machines are typically built from custom-made
`processor boards, although they sometimes use standard microprocessor components. These machines
`cannot readily take advantage of rapid advances in commercially-available sequential processors.
`Current local area networks (LANs) can be used to connect together existing machines, but this
`approach is unsatisfactory for a heterogeneous multicomputer with both general-purpose and specialized,
`high-performance machines. It is often not possible to implement efficiently the required communication
`protocols on special-purpose machines, and typical applications for such systems require higher bandwidth
`and lower latency than current LANs can provide.
`The Nectar (network computer architecture) project attacks the problem of heterogeneous, coarse(cid:173)
`grained parallelism on several fronts, from the underlying hardware, through the communication protocols
`and node operating system support, to the application interface to communication. The solution embodied
`in the Nectar architecture is a two-level structure, with fine-grained parallelism within tasks at individual
`nodes, and coarse-grained parallelism among tasks on different nodes. This system-level approach is
`influenced by our experience with previous projects such as the Warp1 systolic array machine [1] and the
`Mach multiprocessor operating system [12].
`The Nectar architecture provides a general and systematic way to handle heterogeneity and task-level
`parallelism. A variety of existing systems can be plugged into a flexible, extensible network backplane.
`The Nectar system software allows applications to communicate at a high level, without requiring each
`node to support a suite of network protocols; instead, protocol processing is off-loaded to powerful
`network interface processors.
`We have designed and built a prototype Nectar system that has been operational since November
`1988. This paper discusses the motivation and goals of the Nectar project, the hardware and software
`architectures, and our design decisions for the prototype system. We will evaluate the system with real
`applications in the coming year.
`Section 2 of this paper summarizes the goals for Nectar. An overview of the Nectar system and the
`prototype implementation is given in Section 3. The two major functional units of the system, the HUB
`and CAB, are described in Sections 4 and 5. Section 6 describes the Nectar software. Some of the
`applications that Nectar will support are discussed in Section 7. The paper concludes with Section 8.
`
`2 Nectar Goals
`
`low-latency,
`There are three major technical goals for the Nectar system: heterogeneity, scalability, and
`high-bandwidth communication. These goals follow directly from the desire to support an emerging class
`of large-grained parallel programs whose characteristics are described below.
`
`lWarp is a service mark of Carnegie Mellon University.
`
`1
`
`INTEL EX.1222.004
`
`
`
`2.1 Heterogeneity
`One of the characteristics of these applications is the need to process information at multiple, qualitatively
`different levels. For example, a computer vision system may require image processing on its raw input
`at the lowest level, and scene recognition using a knowledge base at the highest level. A speech
`understanding system has a similar structure, with low-level signal processing and high-level natural
`language parsing. The processing required by an autonomous robot might range from handling sensor
`inputs to high-level planning.
`At the lowest levels, these applications deal with simple data structures and highly regular number-
`crunching algorithms. The large amount of data at high rates often requires specialized hardware. At the
`highest level, these applications may use complicated symbolic data structures and data-dependent flow
`of control. Specialized inference engines or database machines might be appropriate for these tasks. The
`very nature of these applications dictates a heterogeneous hardware environment, with varied instruction
`sets, data representations, and performance.
`Software heterogeneity is equally significant. The most natural programming language for each task
`ranges from Fortran and C to query languages and production systems. As a result, the system must
`handle differences in programming languages, operating systems, and data representations.
`
`2.2 Scalability
`Often the most cost-effective way of extending a system to support new applications is to add hardware
`rather than replacing the entire system. Also, by including a variety of processors, the system can take
`advantage of performance improvements in commercially available computers. In the Nectar system, it
`must therefore be possible to add or replace nodes without disruption: the bandwidth and latency between
`existing tasks should not be affected significantly, and it should not be necessary to change existing
`system software. Using the same hardware design, Nectar should scale up to a network of hundreds of
`supercomputer-class machines.
`
`23 Low-Latency, High-Bandwidth Communication
`The structure of these parallel applications requires communication among different tasks, both "hori(cid:173)
`zontally" (among tasks operating at the same level of representation) and "vertically" (between levels).
`The lower levels in particular require a high data rate (megabyte images at video rates, for example).
`Moreover, applications often have response-time requirements that can only be satisfied by low-latency
`communication; two examples are continuous speech recognition and the control of autonomous vehicles.
`In general, by providing low-latency, high-bandwidth communication the system can rapidly dis(cid:173)
`tribute computations to multiple processors. This allows the efficient parallel implementation of many
`applications.
`Lowering latency is much more challenging than increasing bandwidth, since the latter can always be
`achieved by using pipelined architectures with wide data paths and high-bandwidth communication media
`such as fiber-optic lines. Latency can be particularly difficult to minimize in a large system where multi-
`hop communication is necessary. Nectar has the following performance goals for communication latency:
`excluding the transmission delays of the optical fibers, the latency for a message sent between processes
`on two CABs should be under 30 microseconds; the corresponding latency for processes residing in nodes
`should be under 100 microseconds; and the latency to establish a connection through a single HUB should
`be under 1 microsecond.
`
`2
`
`INTEL EX.1222.005
`
`
`
`I CAB
`
`NtCtfcNMt
`
`11
`mm
`
`NODE
`
`1 ^1
`
`11
`
`HUB
`
`TT
`
`GAB
`
`HUB
`
`TT
`
`CAB
`
`Figure 1: Nectar system overview
`
`3 The Nectar System
`3.1 System Overview
`
`The Nectar system consists of a Nectar-net and a set of CABs (communication accelerator boards), as
`illustrated in Figure 1. It connects a number of existing systems called nodes. The Nectar-net is built
`from fiber-optic lines and one or more HUBs. A HUB is a crossbar switch with a flexible datalink
`protocol implemented in hardware. A CAB is a RISC-based processor board serving three functions:
`it implements higher-level network protocols; it provides the interface between the Nectar-net and the
`nodes; and it off-loads application tasks from nodes whenever appropriate. Every CAB is connected to
`a HUB via a pair of fiber lines carrying signals in opposite directions. A HUB together with its directly
`connected CABs forms a HUB cluster.
`
`NODE
`
`NODE
`
`CAB CA*
`
`NODS
`
`GAB
`
`NODE [CAB
`
`FIBER LINES
`
`CAB
`
`GAB
`
`HUB
`
`NODE CABf
`
`[GAB NODE
`
`GAB GAB]
`
`CAB j
`
`NODE I NODE I
`
`I NODE I
`
`Figure 2: A single-HUB system
`
`In a system with a single HUB, all the CABs are connected to the same HUB (Figure 2). The number
`of CABs in the system is therefore limited by the number of I/O ports of the HUB.
`
`3
`
`INTEL EX.1222.006
`
`
`
`CAB
`
`INTER-HUB
`CONNECTIONS
`
`1 C A BH
`
`HUB
`
`p CAB
`
`CAB
`
`HUB
`CLUSTER
`
`Figure 3: HUB cluster
`
`To build larger systems, multiple HUBs are needed. In such systems, some of the I/O ports on each
`HUB are used for inter-HUB fiber connections, as shown in Figure 3. The HUB clusters may be connected
`in any topology appropriate to the application environment. Since the I/O ports used for HUB-HUB and
`for CAB-HUB connections are identical, there is no a priori restriction on how many links can be used
`for inter-HUB connections. Figure 4 depicts a multi-HUB system using a 2-dimensional mesh to connect
`its clusters.
`The Nectar-net offers at least an order of magnitude improvement in bandwidth and latency over current
`LANs. Moreover, the use of crossbar switches substantially reduces network contention. Moderate-size,
`high-speed crossbars (with setup latency under one microsecond) arc now practical; 8-bit wide 32 x 32
`crossbars can be built with off-the-shelf parts, and 128 x 128 crossbars are possible with custom VLSI.
`Re-engineering the software in the critical path of communication is as important for achieving low
`latency as building fast network hardware. Typical profiles of networking implementations on U N I X 2
`show that the time spent in the software dominates the time spent on the wire [3,5,11].
`There are three main sources of inefficiency in current networking implementations. First, existing
`application interfaces incur excessive costs due to context switching and data copying between the user
`process and the node operating system. Second, the node must incur the overhead of higher-level protocols
`that ensure reliable communications for applications. Third, the network interface burdens the node with
`interrupt handling and header processing for each packet.
`The Nectar software architecture alleviates these problems by restructuring the way applications
`communicate. User processes have direct access to a high-level network interface mapped into their
`address spaces. Communication overhead on the node is substantially reduced for three reasons. First, no
`system calls arc required during communication. Second, protocol processing is off-loaded to the CAB.
`Third, interrupts are required only for high-level events in which the application is interested, such as
`delivery of complete messages, rather than low-level events such as the arrival of control packets or timer
`expiration.
`
`2UNDC is a trademark of AT&T Bell Laboratories.
`
`4
`
`INTEL EX.1222.007
`
`
`
`HUB
`Cluster
`
`HUB
`Cluster
`
`HUB
`Cluster
`
`HUB
`Cluster
`
`HUB
`Cluster
`
`HUB
`
`HUB
`Cluster
`
`HUB
`Ouster
`
`SHUBI
`Cluster;
`
`Figure 4: Multi-HUB system connected in a 2-D mesh
`
`3.2 Prototype Implementation
`
`We have built a prototype Nectar system to carry out extensive systems and applications experiments.
`The system includes three board types: CAB, HUB I/O board, and HUB backplane. As of early 1989 the
`prototype consists of 2 HUBs and 4 CABs. The system will be expanded to about 30 CABs in Spring
`1989.
`In the prototype, a node can be any system running U N IX or Mach [12] with a VME interface. The
`initial Nectar system at Carnegie Mellon will have Sun-3s, Sun-4s and Warp systems as nodes.
`To speed up hardware construction, the prototype uses only off-the-shelf parts and 16 x 16 crossbars.
`The serial to parallel conversion is performed by a pair of TAXI chips manufactured by Advanced Micro
`Devices. The effective bandwidth per fiber line is 100 megabits/second, a limit imposed by the TAXI
`chips.
`When the prototype has demonstrated that the Nectar architecture and software works well for
`applications, we plan to re-implement the system in custom or semi-custom VLSI. This will lead to
`larger systems with higher performance and lower cost.
`
`4 The HUB
`
`The Nectar HUB establishes connections and passes messages between its input and output fiber lines.
`There are four design goals for the HUB:
`
`1. Low latency. The HUB provides custom hardware to minimize latency. In the prototype system,
`the latency to set up a connection and transfer the first byte of a packet through a single HUB is
`ten cycles (700 nanoseconds). Once a connection has been established, the latency to transfer a
`byte is five cycles (350 nanoseconds), but the transfer of multiple bytes is pipelined to match the
`100 megabits/second peak bandwidth of the fibers.
`
`2. High switching rate. In the prototype Nectar system, the HUB central controller can set up a new
`connection through the crossbar switch every 70 nanosecond cycle.
`
`5
`
`INTEL EX.1222.008
`
`
`
`3. Efficient support for multi-HUB systems. Because of the low switching and transfer latency of
`a single HUB, the latency of process to process communication in a multi-HUB system is not
`significantly higher. Flow control for inter-HUB communication is implemented in hardware (see
`Section 4.2.3).
`
`4. Flexibility and high efficiency. The HUB hardware implements a set of simple commands for the
`most frequently used operations such as opening and closing connections. These commands can be
`executed in one cycle by the central HUB controller. By sending different combinations of these
`simple commands to the HUB, CABs can implement more complicated datalink protocols such
`as multicast and multi-HUB connections. The HUB hardware is flexible enough to implement
`point-to-point and multicast connections using either circuit or packet switching.
`In addition,
`HUB commands can be used to implement various network management functions such as testing,
`reconfiguration, and recovery from hardware failures.
`
`4.1 HUB Overview
`The HUB has a number of I/O ports, each capable of connecting to a CAB or a HUB via a pair of fiber
`lines. The I/O port contains circuitry for optical to electrical and electrical to optical conversion. From
`the functional viewpoint, a port consists of an input queue and an output register as depicted in Figure 5.
`
`Figure 5: HUB overview
`
`The HUB has a crossbar switch, which can connect the input queue of a port to the output register of
`any other port (see Figure 7). An input queue can be connected to multiple output registers (for multicast),
`but only one input queue can be connected to an output register at a time. A status table is used to keep
`track of existing connections and to ensure that no new connections are made to output registers that
`are already in use. The status table is maintained by a central controller and can be interrogated by the
`CABs.
`The I/O port extracts commands from the incoming byte stream, and inserts replies to the commands
`in the outgoing byte stream. Commands that require serialization, such as establishing a connection,
`are forwarded to the central controller, while "localized" commands, such as breaking a connection, are
`executed inside the I/O port.
`
`6
`
`INTEL EX.1222.009
`
`
`
`Backplane containing
`Crossbar & Central Controller
`
`Figure 6: HUB packaging in the Nectar prototype
`
`For the prototype Nectar system, the HUB has 16 I/O ports. Two HUB I/O boards, each consisting of
`eight I/O ports, can be plugged into the HUB backplane. The backplane contains an 8-bit wide 16 x 16
`crossbar and the central controller. Each I/O port interfaces to a pair of fiber lines at the front of the
`I/O board. This packaging scheme is depicted in Figure 6. An additional instrumentation board can be
`plugged into the backplane, as shown in the figure; it can monitor and record events related to the crossbar
`and its controller.
`Each I/O board in the prototype uses 305 chips and has a typical power consumption of 110 watts;
`the boards are 15 x 17 inches. The backplane uses 92 chips for the 16 x 16 crossbar and 132 chips for the
`central controller. (47 chips in the crossbar and 20 chips in the controller are for hardware debugging.)
`The backplane has a typical power consumption of 70 watts.
`
`4.2 HUB Commands and Usage
`
`The HUB hardware supports 38 user commands and 14 supervisor commands for various datalink
`protocols. Supervisor commands are for system testing and reconfiguration purposes, whereas user
`commands are for operations concerning connections, locks, status, and flow control.
`For the Nectar prototype each command is a sequence of three bytes:
`
`command HUB ID param
`
`The first byte specifies a HUB command, the second byte specifies the HUB to which the command is
`directed, and the third byte is a parameter for the command, typically the ID of one of the ports on that
`HUB.
`In the following we mention some of the user commands and describe how they can be used to
`implement several datalink protocols. The four-HUB system depicted in Figure 7 will be used in all the
`examples.
`
`4.2.1 Circuit Switching
`
`Using circuit switching, the entire route is set up first using a command packet, before a data packet
`is
`transmitted. (A data packet is framed by start of packet and end of packet.) To send data from CAB3
`to CAB1, CAB3 first establishes the route by sending out the following command packet:
`
`7
`
`INTEL EX.1222.010
`
`
`
`Figure 7: Connections on a 4-HUB system
`
`HUB2 P8
`open with retry
`open with retry and reply HUB1 P8
`
`HUB2 will keep trying to open the connection from P4 to P8. After the connection is made, the open
`with retry and reply command is forwarded to HUB1 over the connection. After the connection from
`P3 to P8 in HUB1 is established, HUB1 sends a reply over the route established in the opposite direction,
`using another set of fiber lines, input queues, and output registers. By stealing cycles from these resources
`whenever necessary, the reply is never blocked and can reach CAB3 within a bounded amount of time.
`After receiving the reply, CAB3 knows that all the requested connections have been established. Then
`CAB3 sends data, followed by a close all command that travels over the established route.
`The close all command is recognized at the output register of each HUB in the route. After detecting
`the close all, the HUB closes the connection leading to the output register. Therefore all the connections
`will be closed after the data has flowed through them. Alternatively, close all can be replaced with a set
`of individual close commands, closing the connections in reverse order.
`If CAB3 does not receive a reply soon enough, it can try to get the connection status of the HUBs
`involved to find out what connections have been made, and can send another command packet requesting
`a different route (possibly starting from some existing connections). CAB3 can also decide to take down
`all the existing connections by using close all, and attempt to re-establish an entire route.
`
`8
`
`INTEL EX.1222.011
`
`
`
`4.2.2 Circuit Switching for Multicasting
`
`To illustrate how multicasting is implemented, consider the case in which CAB2 wants to send a data
`packet to both CAB4 and CAB5, as depicted in Figure 7. To establish the required connections, CAB2
`can use the following command packet:
`
`open with retry
`open with retry and reply
`open with retry
`open with retry and reply
`
`HUB1
`HUB4
`HUB4
`HUB3
`
`P6
`P5
`P3
`P4
`
`After receiving replies to both of the open with retry and reply commands, CAB2 sends the data packet.
`
`4.2.3 Packet Switching
`
`The HUB has two facilities to support packet switching:
`
`1. An input queue for each port. For the prototype Nectar system the length of the input queue, and
`thus the maximum packet size, is 1 kilobyte. (Circuit switching must be used for larger packets
`but, since the overhead of circuit setup is small compared to the packet transmission time, this does
`not add significantly to latency.)
`
`2. Support for flow control A ready bit is associated with each port of a HUB. The ready bit indicates
`whether the input queue of the next HUB connected to it is ready to store a new packet. Consider
`for example port P8 of HUB2 in Figure 7. This port is connected to port P3 of HUB 1. If the ready
`bit associated with P8 of HUB2 is 1, then the input queue of P3 of HUB 1 is guaranteed to be ready
`to store a new packet.
`
`The ready bit associated with each port is set to 1 initially. When start of packet is detected at the
`output register of the port, the ready bit is set to 0. Upon receipt of a signal from the next HUB
`indicating that the start of packet has emerged from the input queue connected to the port, the bit
`is set to 1.
`
`Suppose that CAB3 wants to send a data packet to CAB1 using the route shown in Figure 7. Using
`packet switching, CAB3 can send out the following packet:
`
`test open with retry HUB2 P8
`test open with retry HUB1 P8
`data
`close all
`
`The test open with retry command is used to enforce flow control. For example, the first test open
`with retry command ensures that HUB2 will not succeed in making the connection from P4 to P8 until
`port P3 of HUB1 is ready to store the entire data packet Otherwise HUB2 will keep trying to make
`the connection. Thus the packet is forwarded to the next HUB as soon as the input queue in that HUB
`becomes available.
`
`4.2.4 Packet Switching for Multicasting
`
`Consider again the multicasting example of Section 4.2.2. Using packet switching, CAB2 can use the
`following commands to multicast a packet to CAB4 and CAB5:
`
`9
`
`INTEL EX.1222.012
`
`
`
`test open with retry
`test open with retry
`test open with retry
`test open with retry
`data
`close all
`
`HUB1 P6
`HUB4 P5
`HUB4 P3
`HUB3 P4
`
`5 The CAB
`
`The CAB is the interface between a node and the Nectar-net It handles the transmission and reception of
`data over the fibers connected to the network. Communication protocol processing is off-loaded from the
`node to the CAB thus freeing the node from the burden of handling packet interrupts, processing packet
`headers, retransmitting lost packets, fragmenting lar^e messages, and calculating checksums.
`
`Data Memory Bus
`
`Fibers to HUB
`
`Fiber Out
`
`Fiber In
`
`CPU Bus
`
`CPU
`
`DMA
`Controller
`
`VME to Node
`
`Memory
`Protection
`
`Serial Line
`
`Figure 8: CAB block diagram
`
`5.1 CAB Design Issues
`The design of the CAB is driven by three requirements:
`1. The CAB must be able to keep up with the transmission rate of the optical fibers (100 megabits/sec(cid:173)
`ond in each direction).
`2. The CAB should ensure that messages can be transmitted over the Nectar-net with low latency. The
`HUB can set up a connection and begin transferring a data packet in less than one microsecond.
`Thus any latency added by the CAB can contribute significantly to the overall latency of message
`transmission.
`3. The CAB should provide a flexible environment for the efficient implementation of protocols and
`selected applications. Specifically, it should be possible to implement a simple operating system on
`the CAB that allows multiple lightweight processes to share CAB resources.
`
`10
`
`INTEL EX.1222.013
`
`
`
`Together these design goals require that the CAB be able to handle incoming and outgoing data at
`the same time as meeting local processing needs. This is accomplished by including a hardware DMA
`controller on the CAB. The DMA controller is able to manage simultaneous data transfers between the
`incoming and outgoing fibers and CAB memory, as well as between VME and CAB memory, leaving the
`CAB CPU free for protocol and application processing.
`To meet the protocol and application processing requirements, we designed the CAB around a high
`performance RISC CPU and fast local memory. The choice of a high-speed CPU, rather than a custom
`microengine or lower performance CPU, distinguishes the CAB from many I/O controllers. The latest
`RISC chips are competitive with microsequencers in speed, but offer additional flexibility and a familiar
`development environment. Older general-purpose microprocessors would be unable to keep up with the
`protocol processing requirements at fiber speeds, let alone provide cycles for user tasks.
`Allowing application software to run on the CAB is important to many applications but has dangers. In
`particular, incorrect application software may corrupt CAB operating system data structures. To prevent
`such problems, the CAB provides memory protection on a per-page basis and hardware support for
`multiple protection domains.
`The CAB design also includes various devices to support high-speed communication: hardware
`checksum computation removes this burden from protocol software; hardware timers allow time-outs
`to be set by the software with low overhead.
`
`5.2 CAB Implementation
`
`The prototype CAB implementation uses as its RISC GPU a SPARC processor running at 16 megahertz.
`A block diagram of the CAB is shown in Figure 8.
`A VME interface to the node was the natural choice in our environment, allowing Sun workstations
`and Warp systems to be used in the Nectar prototype. The initial CAB implementation supports a VME
`bandwidth of 10 megabytes/second, which is close to the speed of the current fiber interface.
`Two fibers (one for each direction) connect each CAB to the HUB. The fiber interface uses the same
`circuit as the HUB I/O port (see Section 4.1). Data can be read or written to the fiber input or output
`queue by the GPU, but for data transfers of more than small numbers of words the DMA controller should
`be used to achieve higher transmission rates. The DMA controller also handles flow control during a
`transfer the DMA controller waits for data to arrive if the input queue is empty, or for data to drain if
`the output queue is full.
`The on-board CAB memory is split into two regions: one intended for use as program memory, the
`other as data memory. DMA transfers are supported for data memory only; transfers to and from program
`memory must be performed by the CPU. The memory architecture is thus optimized for the expected
`usage pattern, although still allowing code to be executed from data memory or packets to be sent from
`program memory.
`In the prototype, the total bandwidth of the data memory is 66 megabytes/second, sufficient to support
`the following concurrent accesses: CPU reads or writes, DMA to the outgoing fiber, DMA from the
`incoming fiber, and DMA to or from VME memory. The program memory has the same bandwidth as
`data memory and is thus able to sustain the peak CPU execution rate.
`The program memory region contains 128 kilobytes of PROM and 512 kilobytes of RAM. The data
`memory region contains 1 megabyte of RAM. Both memories are implemented using fast (35 nanosecond)
`static RAM. Using static RAM in the prototype rather than less expensive dynamic RAM plus caching
`was worth the additional cost—it allowed us to focus on the more innovative aspects of the design instead
`of expending effort on a cache.
`The CAB's memory protection facility allows each 1 kilobyte page to be protected separately. Each
`page of the CAB address space (including the CAB registers and devices) can be assigned any subset of
`
`11
`
`INTEL EX.1222.014
`
`
`
`read, write, and execute permissions. All accesses from the CAB CPU or from over the VME bus are
`checked in parallel with the operation so that no latency is added to memory accesses. The flexibility,
`safety, and debugging support that memory protection affords the CAB software is worth the non-trivial
`cost in design time and board area.
`The memory protection includes hardware support for multiple protection domains, with a separate
`page protection table for each domain. Currently the CAB supports 32 protection domains. The assignment
`of protection domains is under the control of the CAB operating system kernel. The kernel can therefore
`ensure that the CAB system software is protected from user tasks and that user tasks are protected from
`one another. In addition, accesses from over the VME bus are assigned to a VME-specific protection
`domain.
`The CAB occupies a 24-bit region of the node's VME address space. Every device accessible to the
`CAB CPU is also visible to the node, allowing complete control of the CAB from the node. In normal
`operation, however, the node and CAB communicate through shared buffers, DMA, and VME interrupts.
`The CAB prototype is a 15 x 17 inch board, with a typical power consumption of 100 watts. Of the
`nearly 360 components on the densely packed board about 25% are for the data memory and DMA ports,
`15% for the VME interface, 15% for the CPU and program memory, and 13% for the I/O ports. The
`remaining 120 or so chips are divided among the DMA controller, CAB registers, hardware checksum
`computation, memory protection, and clocks and timers.
`
`6 Software
`
`The design of the Nectar software has two goals:
`
`latency between user processes. Since processing overhead on the sending
`1. Minimize communication
`and receiving nodes accounts for most of the communication latency over local area networks
`[3,5,11], the software organization plays a critical role in reducing the latency. To achieve low
`latency, data copying and context switching must be minimized.
`
`2. Provide a flexible software environment on the CAB. This will convert a bare "protocol engine"
`into a customizable network interface. When appropriate, the CAB can also be used to off-load
`application tasks from the node.
`
`The software currently running on the prototype system consists of the CAB kernel, communica