`lnterf ace
`
`Ron Minnich
`
`David Sarnoff Research
`Center
`
`Dan Burns
`
`Frank Hady
`
`Supercomputing Research
`Center
`
`Our zero-copy ATM Memory-Integrated Network Interface targets 1-Gbps bandwidth with
`1.2-p4 latency for applications that need to send or receive data at very frequent intervals.
`Applications can send or receive packets without operating system support, initiate packet
`transmission with one memory write, and determine packet arrival. At the same time the
`operating system can use MINI for communications and for supporting standard networking
`software. Simulations show MINI “bounces” a single ATM cell in a round-trip time of 3.9 ps
`at 10 hibytes/s.
`
`he bandwidth and high latency of cur-
`rent networks limit the set of cluster
`computing applications to those that
`require little use of the network.
`Running a distributed computation on a network
`of 100 or more machines for five minutes allows
`only a few seconds of communications if
`speedup is to reach acceptable levels. Thus such
`applications must require only a byte of com-
`munication for every thousand to ten thousand
`floating-point operations. While these sorts of low
`communication rate applications exist, studies of
`other types of applications’ show a much higher
`frequency of communication with large amounts
`of data, requiring application-to-application laten-
`cy on the order of one microsecond.
`New high-bandwidth networks should widen
`the scope of cluster computing applications to
`include those requiring substantial communica-
`tions. Asynchronous transfer mode (ATM) in par-
`ticular provides a clear long-term path to
`gigabit-per-second bandwidths. However, the
`commercial network interfaces designed for
`these new networks are not suitable for a large
`set of distributed and cluster computing appli-
`cations. These interfaces require operating sys-
`tem interaction each time a message is sent or
`received, greatly increasing latency and decreas-
`ing throughput. We needed an interface that
`does not involve the operating system in send-
`ing or receiving messages but does support opti-
`
`mized versions of TCPAP (Transmission Control
`Protocolhternet Protocol) and NFS (Network
`File System).
`Our Memory-Integrated Network Interfxe
`(MINI) architecture integrates the network inter-
`face into the memory hierarchy, not the input/
`output hierarchy, as shown in Figure 1, next
`page. Typical implementations of network inter-
`faces use the 110 bus. They suffer from band-
`width and latency limitations imposed by DMA
`start-up times, low- I/O cycle times (compared to
`the CPU cycle times), limited VO address space,
`multiple data copying, and CPU bus contention.
`Placing the network interface on the other side of
`main memory avoids these problems.
`MINI meets certain application-driven perfor-
`mance goals. That is, we based the architecture
`of the interface on our experiences kvith appli-
`cations that have succeeded (and failed) in a clus-
`ter and a distributed programming environment,
`We specified the performance shown later for
`two point-to-point connected workstations. We
`did not consider a switched system in the goals,
`as the ATM switch market remains immature.
`The goals are as follows:
`
`1. 1-ps application-to-application latency for
`small, single ATM cell messages. (Measured
`from the initiation of a host send command
`until the data is loaded into the target node’s
`memory, excluding fiber delay.)
`
`0740-7475/95/$04.00 8 1995 IEEE
`
`February 7995 7 7
`
`INTEL EX.1217.001
`
`
`
`2. I-Gbps sustained application-to-application bandwidth
`for large (-4 Kbyte) messages.
`3. A multiuser environment. (Measurements on 1 and 2
`take place in this environment).
`4. Interoperability with an existing network standard
`
`1
`b - 7 L
`$. CPUbus $.
`
`networkinterface
`
`I
`
`I
`
`(ATM). - - , I CPU
`.
`I :-&
`-
`
`5. Support for high-performance (zero-copy) TCP/IP and
`NFS.
`6. Direct network access to 256 Mbytes of main memory,
`64 Mbytes mapped at a time.
`7. Nonblocking polling of message receipt status.
`8. Support for many virtual channels (1K to 4K).
`
`Hardware architecture
`Figure 2 shows the major components of the MINI archi-
`tecture. MINI has several memory-addressable areas that allow
`communication between the host and network interface. The
`physical layer interface card (PIC) block controls host access
`to these memory-addressable areas, converting the DRAM
`accesses issued by the host. This block is the only part of the
`architecture that must be redesigned when porting the design
`to another host. Note that we currently use a 72-pin SIMM bus,
`the timing and pinout of which is standard on most worksta-
`tions and personal computers.
`Before using the interface, a process must reserve and ini-
`tialize channels into the network. MINI allows user process-
`es or the operating system to have one or more channels,
`each corresponding to an ATM virtual channel (VC), into the
`network. The VC/CRC (cyclic redundancy check) control
`table contains two control word entries
`for each VC, a send and a receive. Each
`control word contains a pointer to a dou-
`ble-page area in main memory, the VC
`status, and the ATM cell header informa-
`tion. The operating system initializes con-
`trol words whenever a process requests a
`channel into the network. Also, part of
`the channel initialization process allocates
`to the requesting process the double page
`indicated by the control word. MINI s u p
`ports a maximum of 4K VCs when page
`size is 4 Kbytes, 2K VCs for 8-Kbyte
`pages, and 1K VCs for 16-Kbyte pages.
`Setting up the channel is relatively slow,
`since it involves creating an ATM circuit
`with a remote host; once complete, the
`channel operates with very little overhead.
`User processes or the operating system
`can initiate a send with a write to either
`the high- or low-priority FIFO buffer.
`The MINI control logic uses the infor-
`mation in the high- or low-priority FIFO
`along with the proper entry of the VC
`control table to read the required data
`directly from main memory, segment the
`data into ATM Adaptation Layer 5 cells,
`and send it into the network. (AAL5
`specifics ATM cells for computer-to-
`computer communications.) An update
`
`(SIMM bus)
`
`I
`
`F
`
`interface
`
`networ interface
`
`Figure 1. MINI network interface hierarchy.
`
`Figure 2. MINI architecture.
`
`I2 /€E€ Micro
`
`INTEL EX.1217.002
`
`
`
`Memory map
`
`Page maps
`
`VCERC control table page
`
`Main memory
`(64 to 256 Mbytes)
`
`Send CRC
`Receive control
`
`VC/CRC control table
`(1 6 Mbytes)
`
`,
`_ _ - -
`High-/low-priority FIFO page
`
`Figure 3. Host memory and maps.
`
`~~
`
`~~
`
`Figure 4. Double pages data handling.
`
`of the VC control table send word notifies the sending
`process of a completed send. The control logic reassembles
`cells arriving from the network directly into main memory at
`a location stored in the VC control table. User processes or
`the operating system can check arrival status with VC con-
`trol table reads at any time.
`The main memory map of a machine featuring a MINI inter-
`face appears in Figure 3. The structure of the memory map
`is key to allowing a user process to send and receive data
`without operating system involvement in a multiuser envi-
`ronment. At channel setup time, the control logic allocates
`six pages to a channel. Two pages in main memory space
`hold the receive data and two more hold the send data. One
`page within the VCKRC control table space is user read only
`and initialized by the operating system at setup time. This
`page contains the send and receive control words, and a send
`CRC and receive CRC. The final page allows a user process
`to write to the high- or low-priority FIFO. Using the address
`bus to indicate the channel number on which the send is to
`
`Table 1. The VC table transmit control word.
`
`Item
`
`Upper VCI bits
`GFC*
`Stop offset
`Double-page pointer
`Current offset
`Header HEC
`Last cell header HEC**
`
`'Generic flow control
`"Header error correction
`
`Bits
`
`2
`5
`2
`
`4 to 6
`4
`0 to
`3 to
`0 to
`8
`8
`-
`
`I
`
`Table 2. The VC table receive control word.
`
`Item
`
`Bits
`
`PDU count
`Status
`Dropped-cell count
`Double-page pointer
`Current offset
`Stop offset
`
`8
`7
`10 (MSB is sticky)
`1 3 t o 1 5
`10to 12
`1 0 t o 12
`
`take place achieves multiuser protection.
`VC/CRC control table. This dual-port SRAM table holds
`setup and status information for each channel. Information
`used to send messages into the network remains in the 64-
`bit transmit word, while information used upon receiving
`data stays in the 64-bit receive word. The transmit control
`word (see Table 1) includes a pointer to a double page allo-
`cated by the operating system as the channel's data space.
`MINI sends data from within the double page, as shown in
`Figure 4, and generates ATM headers using the data stored
`in the send word.
`Sending starts at the current offset, which is updated after
`each ATM cell transmits. MINI stops sending after reaching
`the stop offset. When MINI reaches the end of a page of
`memory during a send or receive, it wraps to the beginning
`of the double page. It can place the data in page 1, with the
`header at the end of page 0 (right before the data) and the
`trailer at the start of page 0 (right after the data). This layout
`permits sending and receiving data in a page-aligned man-
`ner, even if a header and trailer are present. An example of
`its use appears in the NFS discussion later.
`The receive control word (Table 2) includes a current off-
`set, stop offset, and double-page pointer used as described
`for the transmit word. A protocol data unit (PDU) count incre-
`ments each time a full AAL5 message is received, giving the
`
`February 1995 73
`
`INTEL EX.1217.003
`
`
`
`process using the channel a convenient place to poll for mes-
`sage arrival. When the interface drops arriving cells, possibly
`due to contention for main memory, the drops are recorded
`in the dropped cell count field. This field features a "sticky"
`most significant bit so that no drops go unnoticed. Status bits
`denote whether the channel represented by the control word
`has been allocated and record possible errors. They also
`determine whether to place arriving cells on 6-word bound-
`aries (packed) or on 8-word boundaries with AALS CRCs and
`real-time clock values placed in two extra words.
`Tables 1 and 2 contain a number of fields specified as
`ranges. The size of the pages used by the host determines the
`size of the current and stop offsets held. A 512-word page (4
`Kbytes) requires an offset of only 10 bits while a 2K-word (16
`Kbytes) page requires a 12-bit offset. MINI provides address
`space for 256 Mbytes of network-mappable memory. This
`requires a 15-bit double-page pointer for 4-Kbyte pages and
`a 13-bit double-page pointer for 16-Kbyte pages. MINI allows
`64 Mbytes of memory to be mapped into the network at any
`one time. This number follows from MINI'S support for 4K
`channels, each referencing a pair of 8-Kbyte double pages.
`The same dual-port SRAM used to hold the VC control words
`also holds a mnsmit and receive AALS CRC value for each VC.
`Each time a cell passes across the network interface bus, the
`AAL5 CRC calculator fetches the channel's CRC value. (This
`occurs in parallel with the network interface controller fetch of
`the VC control word). The AALS CRC calculator updates the
`channel's CRC to include the passing cell, and then stores the
`result in the CRC table (again in parallel with the VC control
`word store). This method of calculating the CRC allows for
`arbitrary interleaving of cells from different VCs.
`High- and low-priority FIFOs. A user or operating sys-
`tem host write to either the high- or low-priority FIFO initi-
`ates message sends. During this write, MINI takes the start
`and stop offsets within the channel's double page from the
`host data bus. For user process-initiated sends, MINI latch-
`es the virtual channel identfier (VCI) on which the send is
`to take place from the address bus. Therefore a process can
`initiate a write on a given VC only if it gains write access to
`the proper address. A special case of VCI equal to zero allows
`the kernel to access all channels without having to gain
`access (and fetch TLB entries) to a separate page for each
`channel. When the host address bus indicates a request to
`send on channel zero, MINI latches the VC number from the
`data bus rather than the address bus. The kernel always occu-
`pies channel zero.
`Two FIFOs assure that low-latency messages can be sent
`even when the interface is currently sending very large mes-
`sages. A process performing bulk data transfer will send large
`messages using the low-priority FIFO. Another process can
`still send a small, low-latency message by writing to the high-
`priority FIFO. MINI will interrupt any low-priority sends
`between ATM cells, send the high-priority message, and then
`
`resume sending the low-priority message.
`MINI does not stall unsuccessful writes to the command
`FIFOs. To determine if a send command write was success-
`ful, the process can read the send command status word
`associated with each channel.
`Pseudo dual-port memory. This section of the host's
`main memory is directly accessible to the network interface.
`We decided to make this space very large, possibly encom-
`passing all of the host's memory. Due to its size, storage for
`this block must be constructed from dynamic RAM.
`Since dual-port DRAMS do not exist, we will initially con-
`struct this block with off-the-shelf SIMMs and multiplexers,
`and a pseudo dual-port main memory. The network interface
`will take control of the memory during unused processor-
`memory cycles. When the processor accesses main memo-
`ry, MINI will relinquish its use of the SIMMs in time for the
`processor to identify a normal access.
`Arbitration for use of main memory is much less complex
`when the host has a memory busy line back to the CPU.
`When the CPU detects a main memory access, MINI could
`assert the memory busy line, causing the CPU to wait until
`MINI completes its memory access.
`The best method for sharing main memory between the
`host and network interface would be to implement a true
`dual-port DRAM. Such a DRAM would simultaneously fetch
`two rows and decode two column addresses. Collisions for
`the same row would cause the second requester, either the
`host or network interface, to wait. Unfortunately, such chips
`do not yet exist.
`MINI control logic. The state machine within the MINI
`control logic moves data between the transmit and receive
`FIFOs and main memory, updating VC/CRC table entries,
`processing transmit commands, and segmenting or reassem-
`bling ATM cells. The MINI control logic operates on one ATM
`cell at a time and services the received cells first. This mini-
`mizes the probability of receive logic FIFO overflow. When
`no received cells are present, MINI moves cells from main
`memory to the transmit logic in response to send commands
`from the host.
`Cell receipt involves the following steps:
`
`1. Using the VCI in the received cell as an index into the
`VC/CRC control table, MINI fetches the channel's
`receive control word and CRC.
`2. The receive control word forms a main memory address,
`and MINI initiates a store and updates the CRC.
`3. When the pseudo dual-port main memory begins the
`requested write, MINI increments the offset counter,
`requests the next write, and updates the CRC.
`4. Step 3 repeats until six words have been written to main
`memory.
`5. MINI stores the updated receive control word and CRC
`in the VCKRC control table.
`
`14
`
`/€€€Micro
`
`INTEL EX.1217.004
`
`
`
`When the final cell of an AAL5 PDL is received, MINI
`stores two extra words (the locally calculated AAL5 CRC and
`the real-time clock value) at main memory addresses suc-
`ceeding the final data words. Also on a final cell, MINI ini-
`tializes the channel’s CRC control table entry. If the flag called
`Pack Arriving Cells has not been set, MINI stores the real-
`time clock value and the current AAL5 CRC value after each
`cell. Note that if the stop offset is reached while storing
`incoming cells, no further writes to the main memory occur,
`though VCKRC control table updates still take place.
`Cell transmission involves the following steps:
`
`1. The VCI held in the high- or low-priority FIFO redds the
`receive word and CRC from the VCKRC control table.
`2. MINI sends the ATM cell’s header, formed using infor-
`mation from the VC control table receive word, to the
`transmit FIFO while a main memory read is requested.
`3. When granted use of main memory, the network inter-
`face initiates a read from the address specified by the
`FIFO and VC control table contents.
`4. MINI sends the word read from main memory to both
`the transmit FIFO and the CRC logic, increments the off-
`set counter, and initiates a read from the next address
`location.
`5 . Step 4 repeats until six words have been read from main
`memory.
`6. MINI stores the updated VC control send word and CRC
`in the VCKRC control table, and the VCI within the net-
`work interface controller for use on future cell sends.
`
`ControVstatus, transmit, and receive logic. The con-
`troustatus block contains a 100-ns-resolution, real-time clock
`readable by both the host and network interface, along with
`board configuration and test logic. The transmit and receive
`logic blocks contain FIFOs that perform asynchronous trans-
`fer of ATM cells between the physical layer interface and the
`network interface.
`Packaging. Implementation of the MINI architecture
`requires the six different boards shown in Figure 5. Two of
`the boards, the Finisar transmit and receive modules, are off-
`the-shelf parts required to implement the physical link into
`the network. These modules have a 16-bit interface that runs
`at 75 MHz. The PIC board is the network interface to the
`Finisar conversion board that translates 64-bit network inter-
`face words into Finisar 16-bit words. The network interface
`board contains all the hardware necessary to implement the
`MINI architecture. The main memory is off-the-shelf, 16-
`Mbyte SIMMs residing on the network interface board. SIMM
`extenders plug into the host’s SIMM slots, extending the
`memory bus to the network interface board.
`The separation of the PIC functions from the network inter-
`face board eases the migration of the design to other hosts or
`physical media. For instance, to use the design over different
`
`Finisar modules
`
`PIC board \
`
`SlMMs
`
`Network
`
`interface \
`
`SIMM
`extenders
`
`Figure 5. MINI boards.
`
`physical media, designers would replace the PIC board and
`Finisar modules. To change from an Indy host to a Sun, DEC,
`or HP system, designers would redesign host-dependent por-
`tions of the network interface board, using the same 64-bit
`interface to the PIC. This eliminates the need to redesign the
`physical layer.
`Flow control. MINI provides hardware support for the
`low-level form of flow control developed by Seitz to keep the
`network from dropping cells.’ MINI sends a stop message
`from a receiving node to the transmitting node (or switch port)
`when the receiving FIFO fills past some predetermined high-
`water mark. Once the receiving FIFO is less full, the receiv-
`ing node restarts transmission by sending a start message.
`The MINI architecture also supports software-implemented
`flow control. In the event that it becomes necessary to drop
`incoming cells, the MINI hardware keeps a count of the num-
`ber of cells dropped on a per-VC basis within the VC control
`table. Real-time clock values automatically passed within mes-
`sages may also prove useful.
`Performance. A detailed VHDL simulation of the initial
`MINI design gathered some performance statistics.
`7houghput. ATM cells are packaged and transmitted at a
`rate of one cell every 400 ns (20 MINI control logic stare
`machine states), or 0.96 Gbps. Note that this is the real data
`transfer rate. The total bit transfer rate, including ATM head-
`er information is 1.12 Gbps. The PIC board removes the cells
`64-bits at a time from a transmit FIFO at a rate of 1.2 Gbps
`and sends them 16 bits at a time (at 75 MHz) to a Finisar
`
`February 1995 15
`
`INTEL EX.1217.005
`
`
`
`Network link
`
`Interface hardware
`DMA engine
`
`Receive ring buffer
`
`Transmit ring buffer
`
`I ~rotoco~ software I
`
`9
`d
`
`4
`
`4
`
`
`
`0 2 Q3
`Q4
`Q1
`Incoming data queues
`
`Q7 Q8
`Q5
`Q6
`Outgoing data queues
`
`Figure 6. A typical network interface data management
`structure. Each queue represents a TCP connection or
`UDP port.
`
`transmit module. This module serializes and encodes the
`data, adding 4 bits to every 16, and optically sends the data
`out the fiber link at a rate of 1.5 Gbaud.
`Latency. A single cell message can be sent from one host
`to another (assuming almost no link delay) in 1.2 ps. This
`time breaks down as follows: the host’s high-priority FIFO
`write requires 230 ns, and the cell send takes 400 ns. To get
`the head of the cell through the transmit logic, the physical
`layer interface out, the physical layer interface in, and the
`receive logic requires 270 ns, while the cell receive (includ-
`ing setting the VC control word flag notifying the receiving
`software) needs 300 ns.
`
`Software architecture
`Software tasks on traditional interface architectures include
`buffer pool management, interrupt handling, and device I/O
`(especially for DMA devices). These software activities can
`consume as much time as it takes to move the packets over
`the wire. Our measurements on several architectures show
`that the cumulative overhead of these activities is a minimum
`
`16
`
`IEEE Micro
`
`of several hundred microseconds per Ethernet packet; no
`workstation we have measured has even come close to 250
`ps. At MINI data rates, one 4K packet can arrive every 30 p.
`Most current network interfaces use the model of a ring of
`buffers, managed by a combination of operating system soft-
`ware and interface hardware, which are either emptied or
`filled by the network interface. When buffers become full or
`empty, the interface interrupts the processor to signal buffer
`availability. The processor can then use the available buffers
`as needed, reusing them for output or attaching them to other
`queues as input. In these interfaces, the DMA engine is a sin-
`gle-threaded resource, which means the system software
`must manage the time-sharing of that resource. Mapping vir-
`tual to physical addresses is also a large source of overhead.
`Figure 6 gives an example of a ring buffer system. The
`protocol software manages queues of incoming and outgo-
`ing data. Each queue represents either a UDP (User Datagram
`Protocol) port or a TCP socket. As ring buffer slots become
`available, the software copies data (or ‘‘loans’’ it) to or from
`the ring buffers so that the DMA engine can access it. The
`management and control of these buffers and interrupts is
`time-consuming, so that fielding an interrupt for each pack-
`et or even every 10 packets at Gbit rates is impractical.
`All the overheads required to manage a traditional net-
`work interface add up, sometimes to a surprisingly large
`number. For example, on an Sbus interface we built at SRC
`we could obtain only 20 Mbytes/s of the interface’s poten-
`tial 48-Mbytes/s bandwidth; DMA setup and operating sys-
`tem overhead consumed the other 28 Mbytes/s.
`We are determined to reduce MINI overhead by reducing
`the complexity of the data path from the network to the appli-
`cation. In the best case, processes on different machines should
`be only a few memory writes away from each other in time.
`This low latency is difficult to accomplish if there are queues
`and linked lists between the communications processes.
`MINI supports up to 4,096 DMA channels, one per virtu-
`al circuit, with each channel mapped to an 8K area (two 4K
`pages). Part of the process of setting up a virtual circuit
`involves allocating the area of memory associated with that
`circuit. We intend to eliminate buffer pool management over-
`head by, in effect, turning the buffer pools sideways. For
`applications or protocols that need more than 8K of receive
`or send buffering, we expect the application to allocate mul-
`tiple ATM channels. Thus, instead of looking down a long
`pipe at a stream of buffers coming in from an interface, one
`can imagine looking at an array of buffers standing shoulder-
`to-shoulder, as shown in Figure 7 .
`Removing ring buffers and queues from the software
`model of the interface allows us to greatly simplify the hard-
`ware and software that controls the interface. The simple
`interface has much less work to do for each packet, and
`hence can run faster. The result is that we achieve our laten-
`cy and throughput goals.
`
`INTEL EX.1217.006
`
`
`
`Network link
`
`I \\\
`
`VCll VC12 VC13
`
`
`
`virtual
`circuits
`
`+
`
`vc1 vc2 vc3
`
`I
`
`+
`
`+
`
`One TCP connection
`or UDP port (incoming)
`
`One TCP connection
`or UDP port (outgoing)
`
`Figure 7. How MINI uses multiple ATM circuits to support
`buffering on a connection or port.
`
`The design of the MINI architecture explicitly considers
`communication protocols. MINI provides low-level message
`status information including corruption or loss of data noti-
`fication, time of arrival, and network link activity. MINI allows
`access to data with very little overhead. Users can directly
`access data as well as set start and stop offsets for message
`sends or receives without operating system involvement.
`Through its allocation of double-page regions of memory,
`MINI reduces copying and conserves main-memory band-
`width. These features allow efficient implementation of stan-
`dard communication protocols such as SunRPC TCP, and NFS
`as well as specific user-programmed protocols. A more
`detailed description of NFS along with a set of examples
`describing different uses of MINI follow.
`Zerocopy NFS protocol. This protocol is the basis of the
`NFS, which permits file sharing in a networked environment.
`NFS is an integral part of most Unix workstation-based net-
`works. Our measurements show that over 60 percent of the
`data that flows on our networks is NFS read and write traffic;
`well over 66 percent of the packets are NFS control packets.
`Proper optimization of higher level protocols such as NFS is
`essential to achieving high network performance.
`From the beginning, the MINI team included support for
`a highly optimized NFS within the MINI architecture. This
`support, along with changes we have identified to the NFS
`packet structure, will allow NFS to use our Gbps network.
`One of the most important optimizations ensures that NFS
`does not unnecessarily copy data. With MINI, NFS will not
`copy; data passes directly from the network to the correct
`place in memory and comes from its original location in
`memory directly into the network. We call this optimization
`zero-copy NFS, since the data is never copied.
`
`IP header
`
`NFS replyhequest
`
`Fixed
`size
`
`Figure 8. A typical NFS read reply or write request packet.
`
`The fact that a given NFS implementation uses a zero-copy
`TCP or UDP does not imply that the NFS protocol itself is zero-
`copy. One of the optimizations that makes zero-copy proto-
`cols work well is data alignment on page boundaries. But
`since the data and header portions of an NFS packet look like
`data to the lower levels of the protocol stack (that is, TCP),
`the NFS packet may be page-aligned while the actual NFS data
`is not (requiring copying so that it is aligned by page). Even
`when lower level protocols benefit from page-aligned data,
`higher level protocols may still pay data-copying penalties.
`To take advantage of MINI’S support for zero-copy NFS,
`we changed the format of the NFS packets. To explain why,
`we must first give a quick overview of NFS packet formats.
`Figure 8 shows typical packets for NFS versions two and
`three. NFS packets consist of either client requests, such as
`read or write requests; and server replies, which are answers
`to requests. Since NFS uses either TCP/IP or UDPAP, the first
`part of the packet is a standard IP header. Following this
`header is the NFS section, which consists of a standard XFS
`header (the same for all NFS requests, regardless of type),
`and the request-dependent part (a set of parameters which
`is different for every type of request and reply), and data.
`The length of the data field equals zero on everything but
`NFS write requests and read replies.
`A high-performance implementation requires that NFS data
`begin and end on a fixed boundary. In a standard NFS, vari-
`able-length data follows variable-length header information.
`Moreover, due to the manner in which the header is encod-
`ed, the software cannot easily determine its size. The head-
`er’s embedded-length information (much more in version
`three than in version two) cannot be practically decoded
`with hardware. The only way to have the data begin and
`end on fixed memory boundaries is to reorder the fields in
`the NFS packets so that data is positioned at a fixed offset in
`the packet and to fix the length of the data.
`Figure 9 (next page) shows the modified NFS packet with
`a fixed-size NFSAP header, followed by fixed-size data
`placed in the middle of the packet, followed by the variable-
`length NFS header. Because MINI works with packets that
`start in the middle of a buffer and wrap around to the top,
`we place the packet so that the data is aligned by page, as
`
`February 1995 17
`
`INTEL EX.1217.007
`
`
`
`IP header
`
`I NFS reply/request, part 1 I
`I NFS reply request, part 2 I
`
`Data
`
`NFS attribute information
`
`Fixed
`’ size
`
`Variable
`size
`
`Figure 9. Modified NFS packet with fixed-size data in the
`mid d I e.
`
`I Fixed-size NFSllP header I 1
`
`I
`
`I J
`
`Figure IO. Placement of the modified NFS packet in MINI
`memory.
`
`shown in Figure 10. When the software sends the packet
`over the fiber link, it sends the fixed-size NFS header first, fol-
`lowed by the data, and then by the variable-size trailer.
`Packets move over the network and arrive at MINI, header
`first. These packets are then stored so that the trailer comes
`first in memory, followed by the header, and then by the
`data.
`Virtual multicast, Because the data associated with a VC
`can come from any page pair in memory, we can associate
`a single page with many VCs. The data can be stored once,
`and then sent on each VC by simply writing to the high- or
`low-priority FIFO. Thus the cost of sending the same Nbyte
`message on Mchannels is iV+ Mmemory writes. This virtu-
`al multicast technique has some of the advantages of multi-
`cast and is fairly easy to use.
`Barrier synchronization. Many parallel applications
`must synchronize component processes to stabilize the state
`of the application prior to moving to a new activity. A mes-
`sage-passing machine usually accomplishes this synchro-
`nization by creating a master process. Each slave process to
`be synchronized sends a message to the master process. After
`receiving a message from each slave process, the master
`
`18
`
`/E€€ Micro
`
`sends a message to each slave process informing the process
`that it may proceed. This technique is typically called barri-
`er synchronization.
`Single-cell messages are particularly useful for barrier syn-
`chronization. The component processes set up links to the
`master program. When they are ready to synchronize, they
`send one cell to the master. The master waits as the single
`cell messages come in, until a cell has been received from
`each component process. The receive time per cell is 300
`ns. The master can then use virtual multicast to send one
`“go” cell to all the other processes, at a cost of 400 fls per cell.
`Thus barrier synchronization using this simple algorithm
`costs 700 ns per participant.
`Network graphics example. We used an array of work-
`stations, assembled into a custom rack, to display pictures as
`though they were one large bitmap;’ this example of network
`graphics illustrates some of the software capabilities. To sup-
`port this large display, we broke the problem into
`
`1. drawing on a 16-Mpixel virtual bitmap (16 displays
`worth of image), and
`2 . copying from a piece of the large virtual bitmap onto
`the frame buffer of each of the 16 physical displays.
`
`A renderer program draws on the large virtual bitmap,
`viewing the bitmap as a large frame buffer, while a painter
`program copies graphics from the large virtual bitmap to the
`individual display. We used the Mether-NFS (MNFS).’ dis-
`tributed shared-memory system to support the large virtual
`bitmap. MNFS supports a shared memory that ca