throbber
The Memory-Integrated Network
`lnterf ace
`
`Ron Minnich
`
`David Sarnoff Research
`Center
`
`Dan Burns
`
`Frank Hady
`
`Supercomputing Research
`Center
`
`Our zero-copy ATM Memory-Integrated Network Interface targets 1-Gbps bandwidth with
`1.2-p4 latency for applications that need to send or receive data at very frequent intervals.
`Applications can send or receive packets without operating system support, initiate packet
`transmission with one memory write, and determine packet arrival. At the same time the
`operating system can use MINI for communications and for supporting standard networking
`software. Simulations show MINI “bounces” a single ATM cell in a round-trip time of 3.9 ps
`at 10 hibytes/s.
`
`he bandwidth and high latency of cur-
`rent networks limit the set of cluster
`computing applications to those that
`require little use of the network.
`Running a distributed computation on a network
`of 100 or more machines for five minutes allows
`only a few seconds of communications if
`speedup is to reach acceptable levels. Thus such
`applications must require only a byte of com-
`munication for every thousand to ten thousand
`floating-point operations. While these sorts of low
`communication rate applications exist, studies of
`other types of applications’ show a much higher
`frequency of communication with large amounts
`of data, requiring application-to-application laten-
`cy on the order of one microsecond.
`New high-bandwidth networks should widen
`the scope of cluster computing applications to
`include those requiring substantial communica-
`tions. Asynchronous transfer mode (ATM) in par-
`ticular provides a clear long-term path to
`gigabit-per-second bandwidths. However, the
`commercial network interfaces designed for
`these new networks are not suitable for a large
`set of distributed and cluster computing appli-
`cations. These interfaces require operating sys-
`tem interaction each time a message is sent or
`received, greatly increasing latency and decreas-
`ing throughput. We needed an interface that
`does not involve the operating system in send-
`ing or receiving messages but does support opti-
`
`mized versions of TCPAP (Transmission Control
`Protocolhternet Protocol) and NFS (Network
`File System).
`Our Memory-Integrated Network Interfxe
`(MINI) architecture integrates the network inter-
`face into the memory hierarchy, not the input/
`output hierarchy, as shown in Figure 1, next
`page. Typical implementations of network inter-
`faces use the 110 bus. They suffer from band-
`width and latency limitations imposed by DMA
`start-up times, low- I/O cycle times (compared to
`the CPU cycle times), limited VO address space,
`multiple data copying, and CPU bus contention.
`Placing the network interface on the other side of
`main memory avoids these problems.
`MINI meets certain application-driven perfor-
`mance goals. That is, we based the architecture
`of the interface on our experiences kvith appli-
`cations that have succeeded (and failed) in a clus-
`ter and a distributed programming environment,
`We specified the performance shown later for
`two point-to-point connected workstations. We
`did not consider a switched system in the goals,
`as the ATM switch market remains immature.
`The goals are as follows:
`
`1. 1-ps application-to-application latency for
`small, single ATM cell messages. (Measured
`from the initiation of a host send command
`until the data is loaded into the target node’s
`memory, excluding fiber delay.)
`
`0740-7475/95/$04.00 8 1995 IEEE
`
`February 7995 7 7
`
`INTEL EX.1217.001
`
`

`

`2. I-Gbps sustained application-to-application bandwidth
`for large (-4 Kbyte) messages.
`3. A multiuser environment. (Measurements on 1 and 2
`take place in this environment).
`4. Interoperability with an existing network standard
`
`1
`b - 7 L
`$. CPUbus $.
`
`networkinterface
`
`I
`
`I
`
`(ATM). - - , I CPU
`.
`I :-&
`-
`
`5. Support for high-performance (zero-copy) TCP/IP and
`NFS.
`6. Direct network access to 256 Mbytes of main memory,
`64 Mbytes mapped at a time.
`7. Nonblocking polling of message receipt status.
`8. Support for many virtual channels (1K to 4K).
`
`Hardware architecture
`Figure 2 shows the major components of the MINI archi-
`tecture. MINI has several memory-addressable areas that allow
`communication between the host and network interface. The
`physical layer interface card (PIC) block controls host access
`to these memory-addressable areas, converting the DRAM
`accesses issued by the host. This block is the only part of the
`architecture that must be redesigned when porting the design
`to another host. Note that we currently use a 72-pin SIMM bus,
`the timing and pinout of which is standard on most worksta-
`tions and personal computers.
`Before using the interface, a process must reserve and ini-
`tialize channels into the network. MINI allows user process-
`es or the operating system to have one or more channels,
`each corresponding to an ATM virtual channel (VC), into the
`network. The VC/CRC (cyclic redundancy check) control
`table contains two control word entries
`for each VC, a send and a receive. Each
`control word contains a pointer to a dou-
`ble-page area in main memory, the VC
`status, and the ATM cell header informa-
`tion. The operating system initializes con-
`trol words whenever a process requests a
`channel into the network. Also, part of
`the channel initialization process allocates
`to the requesting process the double page
`indicated by the control word. MINI s u p
`ports a maximum of 4K VCs when page
`size is 4 Kbytes, 2K VCs for 8-Kbyte
`pages, and 1K VCs for 16-Kbyte pages.
`Setting up the channel is relatively slow,
`since it involves creating an ATM circuit
`with a remote host; once complete, the
`channel operates with very little overhead.
`User processes or the operating system
`can initiate a send with a write to either
`the high- or low-priority FIFO buffer.
`The MINI control logic uses the infor-
`mation in the high- or low-priority FIFO
`along with the proper entry of the VC
`control table to read the required data
`directly from main memory, segment the
`data into ATM Adaptation Layer 5 cells,
`and send it into the network. (AAL5
`specifics ATM cells for computer-to-
`computer communications.) An update
`
`(SIMM bus)
`
`I
`
`F
`
`interface
`
`networ interface
`
`Figure 1. MINI network interface hierarchy.
`
`Figure 2. MINI architecture.
`
`I2 /€E€ Micro
`
`INTEL EX.1217.002
`
`

`

`Memory map
`
`Page maps
`
`VCERC control table page
`
`Main memory
`(64 to 256 Mbytes)
`
`Send CRC
`Receive control
`
`VC/CRC control table
`(1 6 Mbytes)
`
`,
`_ _ - -
`High-/low-priority FIFO page
`
`Figure 3. Host memory and maps.
`
`~~
`
`~~
`
`Figure 4. Double pages data handling.
`
`of the VC control table send word notifies the sending
`process of a completed send. The control logic reassembles
`cells arriving from the network directly into main memory at
`a location stored in the VC control table. User processes or
`the operating system can check arrival status with VC con-
`trol table reads at any time.
`The main memory map of a machine featuring a MINI inter-
`face appears in Figure 3. The structure of the memory map
`is key to allowing a user process to send and receive data
`without operating system involvement in a multiuser envi-
`ronment. At channel setup time, the control logic allocates
`six pages to a channel. Two pages in main memory space
`hold the receive data and two more hold the send data. One
`page within the VCKRC control table space is user read only
`and initialized by the operating system at setup time. This
`page contains the send and receive control words, and a send
`CRC and receive CRC. The final page allows a user process
`to write to the high- or low-priority FIFO. Using the address
`bus to indicate the channel number on which the send is to
`
`Table 1. The VC table transmit control word.
`
`Item
`
`Upper VCI bits
`GFC*
`Stop offset
`Double-page pointer
`Current offset
`Header HEC
`Last cell header HEC**
`
`'Generic flow control
`"Header error correction
`
`Bits
`
`2
`5
`2
`
`4 to 6
`4
`0 to
`3 to
`0 to
`8
`8
`-
`
`I
`
`Table 2. The VC table receive control word.
`
`Item
`
`Bits
`
`PDU count
`Status
`Dropped-cell count
`Double-page pointer
`Current offset
`Stop offset
`
`8
`7
`10 (MSB is sticky)
`1 3 t o 1 5
`10to 12
`1 0 t o 12
`
`take place achieves multiuser protection.
`VC/CRC control table. This dual-port SRAM table holds
`setup and status information for each channel. Information
`used to send messages into the network remains in the 64-
`bit transmit word, while information used upon receiving
`data stays in the 64-bit receive word. The transmit control
`word (see Table 1) includes a pointer to a double page allo-
`cated by the operating system as the channel's data space.
`MINI sends data from within the double page, as shown in
`Figure 4, and generates ATM headers using the data stored
`in the send word.
`Sending starts at the current offset, which is updated after
`each ATM cell transmits. MINI stops sending after reaching
`the stop offset. When MINI reaches the end of a page of
`memory during a send or receive, it wraps to the beginning
`of the double page. It can place the data in page 1, with the
`header at the end of page 0 (right before the data) and the
`trailer at the start of page 0 (right after the data). This layout
`permits sending and receiving data in a page-aligned man-
`ner, even if a header and trailer are present. An example of
`its use appears in the NFS discussion later.
`The receive control word (Table 2) includes a current off-
`set, stop offset, and double-page pointer used as described
`for the transmit word. A protocol data unit (PDU) count incre-
`ments each time a full AAL5 message is received, giving the
`
`February 1995 73
`
`INTEL EX.1217.003
`
`

`

`process using the channel a convenient place to poll for mes-
`sage arrival. When the interface drops arriving cells, possibly
`due to contention for main memory, the drops are recorded
`in the dropped cell count field. This field features a "sticky"
`most significant bit so that no drops go unnoticed. Status bits
`denote whether the channel represented by the control word
`has been allocated and record possible errors. They also
`determine whether to place arriving cells on 6-word bound-
`aries (packed) or on 8-word boundaries with AALS CRCs and
`real-time clock values placed in two extra words.
`Tables 1 and 2 contain a number of fields specified as
`ranges. The size of the pages used by the host determines the
`size of the current and stop offsets held. A 512-word page (4
`Kbytes) requires an offset of only 10 bits while a 2K-word (16
`Kbytes) page requires a 12-bit offset. MINI provides address
`space for 256 Mbytes of network-mappable memory. This
`requires a 15-bit double-page pointer for 4-Kbyte pages and
`a 13-bit double-page pointer for 16-Kbyte pages. MINI allows
`64 Mbytes of memory to be mapped into the network at any
`one time. This number follows from MINI'S support for 4K
`channels, each referencing a pair of 8-Kbyte double pages.
`The same dual-port SRAM used to hold the VC control words
`also holds a mnsmit and receive AALS CRC value for each VC.
`Each time a cell passes across the network interface bus, the
`AAL5 CRC calculator fetches the channel's CRC value. (This
`occurs in parallel with the network interface controller fetch of
`the VC control word). The AALS CRC calculator updates the
`channel's CRC to include the passing cell, and then stores the
`result in the CRC table (again in parallel with the VC control
`word store). This method of calculating the CRC allows for
`arbitrary interleaving of cells from different VCs.
`High- and low-priority FIFOs. A user or operating sys-
`tem host write to either the high- or low-priority FIFO initi-
`ates message sends. During this write, MINI takes the start
`and stop offsets within the channel's double page from the
`host data bus. For user process-initiated sends, MINI latch-
`es the virtual channel identfier (VCI) on which the send is
`to take place from the address bus. Therefore a process can
`initiate a write on a given VC only if it gains write access to
`the proper address. A special case of VCI equal to zero allows
`the kernel to access all channels without having to gain
`access (and fetch TLB entries) to a separate page for each
`channel. When the host address bus indicates a request to
`send on channel zero, MINI latches the VC number from the
`data bus rather than the address bus. The kernel always occu-
`pies channel zero.
`Two FIFOs assure that low-latency messages can be sent
`even when the interface is currently sending very large mes-
`sages. A process performing bulk data transfer will send large
`messages using the low-priority FIFO. Another process can
`still send a small, low-latency message by writing to the high-
`priority FIFO. MINI will interrupt any low-priority sends
`between ATM cells, send the high-priority message, and then
`
`resume sending the low-priority message.
`MINI does not stall unsuccessful writes to the command
`FIFOs. To determine if a send command write was success-
`ful, the process can read the send command status word
`associated with each channel.
`Pseudo dual-port memory. This section of the host's
`main memory is directly accessible to the network interface.
`We decided to make this space very large, possibly encom-
`passing all of the host's memory. Due to its size, storage for
`this block must be constructed from dynamic RAM.
`Since dual-port DRAMS do not exist, we will initially con-
`struct this block with off-the-shelf SIMMs and multiplexers,
`and a pseudo dual-port main memory. The network interface
`will take control of the memory during unused processor-
`memory cycles. When the processor accesses main memo-
`ry, MINI will relinquish its use of the SIMMs in time for the
`processor to identify a normal access.
`Arbitration for use of main memory is much less complex
`when the host has a memory busy line back to the CPU.
`When the CPU detects a main memory access, MINI could
`assert the memory busy line, causing the CPU to wait until
`MINI completes its memory access.
`The best method for sharing main memory between the
`host and network interface would be to implement a true
`dual-port DRAM. Such a DRAM would simultaneously fetch
`two rows and decode two column addresses. Collisions for
`the same row would cause the second requester, either the
`host or network interface, to wait. Unfortunately, such chips
`do not yet exist.
`MINI control logic. The state machine within the MINI
`control logic moves data between the transmit and receive
`FIFOs and main memory, updating VC/CRC table entries,
`processing transmit commands, and segmenting or reassem-
`bling ATM cells. The MINI control logic operates on one ATM
`cell at a time and services the received cells first. This mini-
`mizes the probability of receive logic FIFO overflow. When
`no received cells are present, MINI moves cells from main
`memory to the transmit logic in response to send commands
`from the host.
`Cell receipt involves the following steps:
`
`1. Using the VCI in the received cell as an index into the
`VC/CRC control table, MINI fetches the channel's
`receive control word and CRC.
`2. The receive control word forms a main memory address,
`and MINI initiates a store and updates the CRC.
`3. When the pseudo dual-port main memory begins the
`requested write, MINI increments the offset counter,
`requests the next write, and updates the CRC.
`4. Step 3 repeats until six words have been written to main
`memory.
`5. MINI stores the updated receive control word and CRC
`in the VCKRC control table.
`
`14
`
`/€€€Micro
`
`INTEL EX.1217.004
`
`

`

`When the final cell of an AAL5 PDL is received, MINI
`stores two extra words (the locally calculated AAL5 CRC and
`the real-time clock value) at main memory addresses suc-
`ceeding the final data words. Also on a final cell, MINI ini-
`tializes the channel’s CRC control table entry. If the flag called
`Pack Arriving Cells has not been set, MINI stores the real-
`time clock value and the current AAL5 CRC value after each
`cell. Note that if the stop offset is reached while storing
`incoming cells, no further writes to the main memory occur,
`though VCKRC control table updates still take place.
`Cell transmission involves the following steps:
`
`1. The VCI held in the high- or low-priority FIFO redds the
`receive word and CRC from the VCKRC control table.
`2. MINI sends the ATM cell’s header, formed using infor-
`mation from the VC control table receive word, to the
`transmit FIFO while a main memory read is requested.
`3. When granted use of main memory, the network inter-
`face initiates a read from the address specified by the
`FIFO and VC control table contents.
`4. MINI sends the word read from main memory to both
`the transmit FIFO and the CRC logic, increments the off-
`set counter, and initiates a read from the next address
`location.
`5 . Step 4 repeats until six words have been read from main
`memory.
`6. MINI stores the updated VC control send word and CRC
`in the VCKRC control table, and the VCI within the net-
`work interface controller for use on future cell sends.
`
`ControVstatus, transmit, and receive logic. The con-
`troustatus block contains a 100-ns-resolution, real-time clock
`readable by both the host and network interface, along with
`board configuration and test logic. The transmit and receive
`logic blocks contain FIFOs that perform asynchronous trans-
`fer of ATM cells between the physical layer interface and the
`network interface.
`Packaging. Implementation of the MINI architecture
`requires the six different boards shown in Figure 5. Two of
`the boards, the Finisar transmit and receive modules, are off-
`the-shelf parts required to implement the physical link into
`the network. These modules have a 16-bit interface that runs
`at 75 MHz. The PIC board is the network interface to the
`Finisar conversion board that translates 64-bit network inter-
`face words into Finisar 16-bit words. The network interface
`board contains all the hardware necessary to implement the
`MINI architecture. The main memory is off-the-shelf, 16-
`Mbyte SIMMs residing on the network interface board. SIMM
`extenders plug into the host’s SIMM slots, extending the
`memory bus to the network interface board.
`The separation of the PIC functions from the network inter-
`face board eases the migration of the design to other hosts or
`physical media. For instance, to use the design over different
`
`Finisar modules
`
`PIC board \
`
`SlMMs
`
`Network
`
`interface \
`
`SIMM
`extenders
`
`Figure 5. MINI boards.
`
`physical media, designers would replace the PIC board and
`Finisar modules. To change from an Indy host to a Sun, DEC,
`or HP system, designers would redesign host-dependent por-
`tions of the network interface board, using the same 64-bit
`interface to the PIC. This eliminates the need to redesign the
`physical layer.
`Flow control. MINI provides hardware support for the
`low-level form of flow control developed by Seitz to keep the
`network from dropping cells.’ MINI sends a stop message
`from a receiving node to the transmitting node (or switch port)
`when the receiving FIFO fills past some predetermined high-
`water mark. Once the receiving FIFO is less full, the receiv-
`ing node restarts transmission by sending a start message.
`The MINI architecture also supports software-implemented
`flow control. In the event that it becomes necessary to drop
`incoming cells, the MINI hardware keeps a count of the num-
`ber of cells dropped on a per-VC basis within the VC control
`table. Real-time clock values automatically passed within mes-
`sages may also prove useful.
`Performance. A detailed VHDL simulation of the initial
`MINI design gathered some performance statistics.
`7houghput. ATM cells are packaged and transmitted at a
`rate of one cell every 400 ns (20 MINI control logic stare
`machine states), or 0.96 Gbps. Note that this is the real data
`transfer rate. The total bit transfer rate, including ATM head-
`er information is 1.12 Gbps. The PIC board removes the cells
`64-bits at a time from a transmit FIFO at a rate of 1.2 Gbps
`and sends them 16 bits at a time (at 75 MHz) to a Finisar
`
`February 1995 15
`
`INTEL EX.1217.005
`
`

`

`Network link
`
`Interface hardware
`DMA engine
`
`Receive ring buffer
`
`Transmit ring buffer
`
`I ~rotoco~ software I
`
`9
`d
`
`4
`
`4
`
`
`
`0 2 Q3
`Q4
`Q1
`Incoming data queues
`
`Q7 Q8
`Q5
`Q6
`Outgoing data queues
`
`Figure 6. A typical network interface data management
`structure. Each queue represents a TCP connection or
`UDP port.
`
`transmit module. This module serializes and encodes the
`data, adding 4 bits to every 16, and optically sends the data
`out the fiber link at a rate of 1.5 Gbaud.
`Latency. A single cell message can be sent from one host
`to another (assuming almost no link delay) in 1.2 ps. This
`time breaks down as follows: the host’s high-priority FIFO
`write requires 230 ns, and the cell send takes 400 ns. To get
`the head of the cell through the transmit logic, the physical
`layer interface out, the physical layer interface in, and the
`receive logic requires 270 ns, while the cell receive (includ-
`ing setting the VC control word flag notifying the receiving
`software) needs 300 ns.
`
`Software architecture
`Software tasks on traditional interface architectures include
`buffer pool management, interrupt handling, and device I/O
`(especially for DMA devices). These software activities can
`consume as much time as it takes to move the packets over
`the wire. Our measurements on several architectures show
`that the cumulative overhead of these activities is a minimum
`
`16
`
`IEEE Micro
`
`of several hundred microseconds per Ethernet packet; no
`workstation we have measured has even come close to 250
`ps. At MINI data rates, one 4K packet can arrive every 30 p.
`Most current network interfaces use the model of a ring of
`buffers, managed by a combination of operating system soft-
`ware and interface hardware, which are either emptied or
`filled by the network interface. When buffers become full or
`empty, the interface interrupts the processor to signal buffer
`availability. The processor can then use the available buffers
`as needed, reusing them for output or attaching them to other
`queues as input. In these interfaces, the DMA engine is a sin-
`gle-threaded resource, which means the system software
`must manage the time-sharing of that resource. Mapping vir-
`tual to physical addresses is also a large source of overhead.
`Figure 6 gives an example of a ring buffer system. The
`protocol software manages queues of incoming and outgo-
`ing data. Each queue represents either a UDP (User Datagram
`Protocol) port or a TCP socket. As ring buffer slots become
`available, the software copies data (or ‘‘loans’’ it) to or from
`the ring buffers so that the DMA engine can access it. The
`management and control of these buffers and interrupts is
`time-consuming, so that fielding an interrupt for each pack-
`et or even every 10 packets at Gbit rates is impractical.
`All the overheads required to manage a traditional net-
`work interface add up, sometimes to a surprisingly large
`number. For example, on an Sbus interface we built at SRC
`we could obtain only 20 Mbytes/s of the interface’s poten-
`tial 48-Mbytes/s bandwidth; DMA setup and operating sys-
`tem overhead consumed the other 28 Mbytes/s.
`We are determined to reduce MINI overhead by reducing
`the complexity of the data path from the network to the appli-
`cation. In the best case, processes on different machines should
`be only a few memory writes away from each other in time.
`This low latency is difficult to accomplish if there are queues
`and linked lists between the communications processes.
`MINI supports up to 4,096 DMA channels, one per virtu-
`al circuit, with each channel mapped to an 8K area (two 4K
`pages). Part of the process of setting up a virtual circuit
`involves allocating the area of memory associated with that
`circuit. We intend to eliminate buffer pool management over-
`head by, in effect, turning the buffer pools sideways. For
`applications or protocols that need more than 8K of receive
`or send buffering, we expect the application to allocate mul-
`tiple ATM channels. Thus, instead of looking down a long
`pipe at a stream of buffers coming in from an interface, one
`can imagine looking at an array of buffers standing shoulder-
`to-shoulder, as shown in Figure 7 .
`Removing ring buffers and queues from the software
`model of the interface allows us to greatly simplify the hard-
`ware and software that controls the interface. The simple
`interface has much less work to do for each packet, and
`hence can run faster. The result is that we achieve our laten-
`cy and throughput goals.
`
`INTEL EX.1217.006
`
`

`

`Network link
`
`I \\\
`
`VCll VC12 VC13
`
`
`
`virtual
`circuits
`
`+
`
`vc1 vc2 vc3
`
`I
`
`+
`
`+
`
`One TCP connection
`or UDP port (incoming)
`
`One TCP connection
`or UDP port (outgoing)
`
`Figure 7. How MINI uses multiple ATM circuits to support
`buffering on a connection or port.
`
`The design of the MINI architecture explicitly considers
`communication protocols. MINI provides low-level message
`status information including corruption or loss of data noti-
`fication, time of arrival, and network link activity. MINI allows
`access to data with very little overhead. Users can directly
`access data as well as set start and stop offsets for message
`sends or receives without operating system involvement.
`Through its allocation of double-page regions of memory,
`MINI reduces copying and conserves main-memory band-
`width. These features allow efficient implementation of stan-
`dard communication protocols such as SunRPC TCP, and NFS
`as well as specific user-programmed protocols. A more
`detailed description of NFS along with a set of examples
`describing different uses of MINI follow.
`Zerocopy NFS protocol. This protocol is the basis of the
`NFS, which permits file sharing in a networked environment.
`NFS is an integral part of most Unix workstation-based net-
`works. Our measurements show that over 60 percent of the
`data that flows on our networks is NFS read and write traffic;
`well over 66 percent of the packets are NFS control packets.
`Proper optimization of higher level protocols such as NFS is
`essential to achieving high network performance.
`From the beginning, the MINI team included support for
`a highly optimized NFS within the MINI architecture. This
`support, along with changes we have identified to the NFS
`packet structure, will allow NFS to use our Gbps network.
`One of the most important optimizations ensures that NFS
`does not unnecessarily copy data. With MINI, NFS will not
`copy; data passes directly from the network to the correct
`place in memory and comes from its original location in
`memory directly into the network. We call this optimization
`zero-copy NFS, since the data is never copied.
`
`IP header
`
`NFS replyhequest
`
`Fixed
`size
`
`Figure 8. A typical NFS read reply or write request packet.
`
`The fact that a given NFS implementation uses a zero-copy
`TCP or UDP does not imply that the NFS protocol itself is zero-
`copy. One of the optimizations that makes zero-copy proto-
`cols work well is data alignment on page boundaries. But
`since the data and header portions of an NFS packet look like
`data to the lower levels of the protocol stack (that is, TCP),
`the NFS packet may be page-aligned while the actual NFS data
`is not (requiring copying so that it is aligned by page). Even
`when lower level protocols benefit from page-aligned data,
`higher level protocols may still pay data-copying penalties.
`To take advantage of MINI’S support for zero-copy NFS,
`we changed the format of the NFS packets. To explain why,
`we must first give a quick overview of NFS packet formats.
`Figure 8 shows typical packets for NFS versions two and
`three. NFS packets consist of either client requests, such as
`read or write requests; and server replies, which are answers
`to requests. Since NFS uses either TCP/IP or UDPAP, the first
`part of the packet is a standard IP header. Following this
`header is the NFS section, which consists of a standard XFS
`header (the same for all NFS requests, regardless of type),
`and the request-dependent part (a set of parameters which
`is different for every type of request and reply), and data.
`The length of the data field equals zero on everything but
`NFS write requests and read replies.
`A high-performance implementation requires that NFS data
`begin and end on a fixed boundary. In a standard NFS, vari-
`able-length data follows variable-length header information.
`Moreover, due to the manner in which the header is encod-
`ed, the software cannot easily determine its size. The head-
`er’s embedded-length information (much more in version
`three than in version two) cannot be practically decoded
`with hardware. The only way to have the data begin and
`end on fixed memory boundaries is to reorder the fields in
`the NFS packets so that data is positioned at a fixed offset in
`the packet and to fix the length of the data.
`Figure 9 (next page) shows the modified NFS packet with
`a fixed-size NFSAP header, followed by fixed-size data
`placed in the middle of the packet, followed by the variable-
`length NFS header. Because MINI works with packets that
`start in the middle of a buffer and wrap around to the top,
`we place the packet so that the data is aligned by page, as
`
`February 1995 17
`
`INTEL EX.1217.007
`
`

`

`IP header
`
`I NFS reply/request, part 1 I
`I NFS reply request, part 2 I
`
`Data
`
`NFS attribute information
`
`Fixed
`’ size
`
`Variable
`size
`
`Figure 9. Modified NFS packet with fixed-size data in the
`mid d I e.
`
`I Fixed-size NFSllP header I 1
`
`I
`
`I J
`
`Figure IO. Placement of the modified NFS packet in MINI
`memory.
`
`shown in Figure 10. When the software sends the packet
`over the fiber link, it sends the fixed-size NFS header first, fol-
`lowed by the data, and then by the variable-size trailer.
`Packets move over the network and arrive at MINI, header
`first. These packets are then stored so that the trailer comes
`first in memory, followed by the header, and then by the
`data.
`Virtual multicast, Because the data associated with a VC
`can come from any page pair in memory, we can associate
`a single page with many VCs. The data can be stored once,
`and then sent on each VC by simply writing to the high- or
`low-priority FIFO. Thus the cost of sending the same Nbyte
`message on Mchannels is iV+ Mmemory writes. This virtu-
`al multicast technique has some of the advantages of multi-
`cast and is fairly easy to use.
`Barrier synchronization. Many parallel applications
`must synchronize component processes to stabilize the state
`of the application prior to moving to a new activity. A mes-
`sage-passing machine usually accomplishes this synchro-
`nization by creating a master process. Each slave process to
`be synchronized sends a message to the master process. After
`receiving a message from each slave process, the master
`
`18
`
`/E€€ Micro
`
`sends a message to each slave process informing the process
`that it may proceed. This technique is typically called barri-
`er synchronization.
`Single-cell messages are particularly useful for barrier syn-
`chronization. The component processes set up links to the
`master program. When they are ready to synchronize, they
`send one cell to the master. The master waits as the single
`cell messages come in, until a cell has been received from
`each component process. The receive time per cell is 300
`ns. The master can then use virtual multicast to send one
`“go” cell to all the other processes, at a cost of 400 fls per cell.
`Thus barrier synchronization using this simple algorithm
`costs 700 ns per participant.
`Network graphics example. We used an array of work-
`stations, assembled into a custom rack, to display pictures as
`though they were one large bitmap;’ this example of network
`graphics illustrates some of the software capabilities. To sup-
`port this large display, we broke the problem into
`
`1. drawing on a 16-Mpixel virtual bitmap (16 displays
`worth of image), and
`2 . copying from a piece of the large virtual bitmap onto
`the frame buffer of each of the 16 physical displays.
`
`A renderer program draws on the large virtual bitmap,
`viewing the bitmap as a large frame buffer, while a painter
`program copies graphics from the large virtual bitmap to the
`individual display. We used the Mether-NFS (MNFS).’ dis-
`tributed shared-memory system to support the large virtual
`bitmap. MNFS supports a shared memory that ca

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket