`Ex.1025.001
`
`DELL
`
`
`
`The Association for Computing Machinery
`11 West 42nd Street
`New York, New York 10036
`
`Copyright © 1988 by the Association for Computing Machinery, Inc. Copying, without
`fee is permitted provided that the copies are not made or distributed for direct
`commercial advantage, and credit to the source is given. Abstracting with credit
`is permitted,
`For other copying of articles that carry a code at the bottom of
`the first page, copying is permitted provided that the per-copy indicated in the
`code is paid through the Copyright Clearance Center, 27 Congress Street, Salem, MA
`01970.
`For permission
`to
`republish write
`to:
`Director of Publications,
`Association for Computing Machinery. To copy otherwise, or republish, requires a
`fee and/or specific permission.
`
`ISBN 0-89791-279-9
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 64145
`Baltimore, MD 21264
`
`Price:
`Members
`•••••••• $20.00
`All others •••••••• $26.00
`
`ACM Order Number:
`
`533880
`
`ii
`
`DEFS-ALAOO 10587
`Ex.1025.002
`
`DELL
`
`
`
`The VMP Network Adapter Board (NAB):
`High-Performance Network Communication
`for Multiprocessors
`
`Hemant Kanakia
`Computer Systems Laboratory
`· Stanford University
`
`David R. Cheriton
`Computer Science Department
`Stanford University
`
`Abstract
`
`High performance computer communication between multipro(cid:173)
`cessor nodes requires significant improvements over conven(cid:173)
`tional host-to-network adapters. Current host-to-network adapter
`interfaces impose excessive processing, system bus and interrupt
`overhead on a multiprocessor host Current network adapters are
`either limited in function, wasting key host resources such as the
`system bus and the processors, or else intelligent but too slow,
`because of complex transport protocols and because of an in(cid:173)
`adequate internal memory architecture. Conventional transport
`protocols are too complex for hardware implementation and too
`slow without it.
`In this paper, we describe the design of a network adapter
`board for the VMP multiprocessor machine that addresses these
`issues. The adapter uses a host interface that is designed for min(cid:173)
`imal latency, minimal interrupt processing overhead and mini(cid:173)
`mal system bus and memory access overhead. The network
`adapter itself has a novel internal memory and processing ar(cid:173)
`chitecture that implements some of the key performance-critical
`transport layer functions in hardware. This design is integrated
`with VMTP, a new transport protocol specifically designed for
`efficient implementation on an intelligent high-performance net(cid:173)
`work adapter. Although targeted for the VMP system, the design
`is applicable to other multiprocessors as well as uni-processors.
`
`1
`
`Introduction
`
`Performance of transport protocols on multi-megabit communi(cid:173)
`cation networks tends to be limited by overhead at both the trans(cid:173)
`mitter and receiver. For example, measurements of the V kernel
`[5] indicate that network transmission time on Ethernet accounts
`for only about 20 percent of the elapsed time for transport-level
`communication operations, even with its highly optimized proto(cid:173)
`col. Similar performance figures have been reported in [ 15, 17].
`Although processor and memory cycle times keep improving,
`with communication networks moving to gigabit range, we ex(cid:173)
`pect the processing to persist as a bottleneck - unless significant
`improvements in network adapter and transport protocol designs
`
`Permission to copy without fee all or part of this material is granted provided
`that the copies are not made or distributed for direct commercial advantage,
`the ACM copyright notice and the title of the publication and its date appear.
`and notice is given that copying is by permission of the Association for
`Computing Machinery. To copy otherwise. or to republish. requires a fee and/
`or specific permission.
`
`©
`
`1988 ACM 0-89791-279-9/88/008/0175 $1.50
`
`are achieved. We identify three major problems with current
`designs.
`First, the host-to-network adapter interface imposes excessive
`overhead, particularly on a multiprocessor host, in the form of
`processor cycles, system bus capacity and host interrupts. The
`processing overhead arises from calculating end-to-end check(cid:173)
`sums, packetizing and depacketizing as well as encryption, in
`the case of secure communication. The memory-intensive pro(cid:173)
`cessing required by these functions reduces the average instruc(cid:173)
`tion execution rate especially for a high-performance proces(cid:173)
`sor such as MIPS [14] in which memory reference operations
`are proportionally much slower than register-only operations.
`This processing causes the data to move at least twice over the
`system bus; once from global memory to the processor (or its
`cache), and once when the packet is copied to a network adapter.
`The increased traffic wastes system bus bandwidth, a critical re(cid:173)
`source in multiprocessor machines. In current host-to-network
`adapter interfaces, a host is interrupted for each packet received
`or transmitted. These per-packet interrupts force frequent con(cid:173)
`text switching with the attendant overheads, with a high penalty
`in the multiprocessors system with processor caches, where it
`may be necessary to fault code and data into the cache before
`responding to the interrupt In addition, the context switch may
`also incur contention overhead when data associated with the
`network module is resident in another cache. The problem is fur(cid:173)
`ther aggravated by the prospect of networks moving to the 100
`megabit up to the gigabit range using fiber optics (11, 13, 16].
`For instance, in a file server attached to a 100 megabit network
`with the interface interrupting on every 2 kilobyte packet, the
`network interrupts every 200 microseconds under load, hardly
`sufficient to do even minimal packet processing.
`Second, the so-called "intelligent'' network adapters that im(cid:173)
`plement transport-level functions have lower performance at the
`transport-level as compared to the alternative system where a net(cid:173)
`work adapter does programmed 1/0 transfers and a host performs
`transport protocol functions. The primary reason is an inadequate
`internal memory architecture. Currently, the data transfers into
`and out of the buffer memory reduces the number of memory
`cycles available for packet processing. The future system bus
`technology, with a high transfer rate and the burst-mode transfer,
`and the future networks, with a high data rate, will make this
`problem even more acute.
`Finally, conventional transport protocols are too complex or
`awkward for hardware implementation and too slow without it.
`For a large packet, the processing cost incurred in checksumming
`and encryption dominates the packet processing, since the cost
`increases in proportion to the size of a packet. Hardware im(cid:173)
`plementation for such key performance-critical functions would
`substantially increase performance, but the packet formats of
`
`175
`
`DEFS-ALAOO 10588
`Ex.1025.003
`
`DELL
`
`
`
`conventional transport protocols are does not facilitate hardware
`support or implementations.
`An additional factor that motivates rethinking network adapter
`architecture is the problem of a host being bombarded by packets
`from one or more other hosts. The packet arrival rate, especially
`in a high-speed network, can exceed the rate at which a host can
`process and discard these packets, effectively incapacitating the
`host for useful computation. Excessive packet traffic can arise
`either from failures or malicious behavior of a remote host. A
`well-designed network adapter acts as a "firewall" between the
`network and the host
`In this paper, we describe the Network Adapter Board (NAB)
`for the VMP multiprocessor [3], focusing on the architectural
`issues in the host interface, the adapter board, and the transport
`protocol. The adapter host interface architecture is designed
`for minimal latency, minimal interrupt processing overhead and
`minimal data transfer on the system bus. The NAB uses a novel
`internal memory and pipelined processing architecture that im(cid:173)
`plements some of the key performance-critical transport layer
`functions in hardware. The design is coupled to a new transport
`protocol called Versatile Message Transaction Protocol (VMTP)
`[1]. VMTP is a request-response transport protocol specifically
`designed to facilitate implementation by a high-performance net(cid:173)
`work adapter. VMTP assumes an underlying network (or inter(cid:173)
`network) service providing a datagram packet service, and in(cid:173)
`cludes support for multicast, real-time, security and streaming.
`The interested reader is referred to an Internet RFC [l] for fur(cid:173)
`ther details. For brevity in this report, we focus on aspects of the
`protocol relevant to the cost of communication and the design
`of the VMP network adapter. We describe the expected perfor(cid:173)
`mance of the NAB prototype being built. Although targeted for
`the VMP system, the design is applicable to other multiproces(cid:173)
`sors as well as uni-processors.
`The next section describes the host-to-network adapter inter(cid:173)
`face architecture. Section 3 describes the internal architecture of
`the network adapter board. Section 4 describes details of a pro(cid:173)
`totype of the network adapter. Section 5 indicates the (expected)
`performance for this adapter. Section 6 compares our design to
`related work on this problem. We conclude with a summary of
`the current status, our plans to further evaluate this design and
`the current open problems in this area.
`
`2 Host-Network Adapter Interface
`
`A request/response model of communications is used for infor(cid:173)
`mation exchange between a network device and a host. A host
`requests network services with a control block sent to device, and
`the device returns a response after the service is provided. These
`messages are transferred across via an interface that appears to
`the host software as a 1024-byte control register.
`To transmit data, the host software writes a control block, the
`Transmit AUlhorization Record (TAR), to this control register.
`The TAR contains control information describing data to be sent
`including the pointer to data in physical memory. If the data
`fits entirely within the control register, the data segment descrip(cid:173)
`tion is omitted from TAR. In both cases, the network adapter
`transmits the data, checksumming and encrypting (if required).
`Interface interrupts host to inform that a TAR has been used to
`successfully transmit the data.
`To receive data, the host software writes a control block, a
`Receive Authorization Record (RAR), that "arms" the network
`
`interface to receive packets, specifying the maximum size to
`receive and providing a list of pointers to memory pages in host
`memory into which to deliver the data. The RAR can specify
`any one of: (1) a specific source, (2) a class of allowable sources
`or, (3) any source for the received data. where source is either
`a transport-level or network-level address.
`The interface interrupts the host when the received packet(s)
`satisfies one of these RARs, returning the RAR, along with the
`packet header, to the appropriate host via the control register.
`The returned RAR itself may contain small amounts of data in
`addition to the corresponding packet header. When the RAR is
`returned, the data has been already stored in the host memory at
`the location pointer(s) contained in it, unless the data is contained
`in the RAR. Incoming packets are discarded if they cannot be
`matched to an outstanding RAR.
`Type, a subfield in the control block, distinguishes the records
`passed via the control register. Four major types of records, used
`in transmission and reception of data, are an RAR with small
`amounts of data, an RAR with data descriptors, a TAR with small
`amounts of data and a TAR with data descriptors. To optimize
`host-network adapter interactions, host is allowed to combine
`issuing of a TAR and RAR, using the same buffers to send and
`to receive data. Additional types of records are used by a host for
`various purposes such as to add or delete acceptable destinations,
`to restrict traffic from a source, to provide decryption/encryption
`keys to the NAB, to get status information, and to reset the NAB.
`The RARs and TARs include a packet header including a
`network-level header, either small amounts of data or a list of
`data descriptors, and various control information. The buffer
`descriptors in a TAR point to locations in the physical memory
`space where data to transmit is available. The buffer descriptors
`in an RAR point to locations in the physical memory space
`where the data is to be received. The returned RAR or TAR
`contain in addition to the buffer descriptors the number of data
`words actually received or transmitted. The control information,
`used by NAB, includes a link field, type of RAR matching to be
`used, transport-level source and destination addresses, interrupt
`control, timeout control, a local host number, and a local process
`identifier. The buffer descriptors are omitted for T ARs and RARs
`containing small amounts of data. A link field, used by the
`adaptor, allows chaining of these records as necessary. The type
`of RAR matching to be done indicates if the RAR can be used
`for receiving traffic only from the specified source or any source.
`Interrupt control is used to determine when to interrupt the
`host identified in the record. On reception, the host, indicated
`by the host number, is interrupted either when the data of the
`first packet is stored in the system memory or when the data is
`completely received or both. The completed RAR is returned via
`the control register with the length of data received before the
`host indicated in RAR is interrupted. On transmission, one may
`have the host interrupted either when the NAB begins processing
`a TAR, or when the last of the data segment is transmitted.
`The TAR is written into the control register before a host is
`interrupted.
`In the following, we discuss how this interface efficiently
`handles small amounts of data, large amounts of data, and also
`allows the interface to act as a firewall, protecting the host from
`the network.
`
`176
`
`DEFS-ALAOO 10589
`Ex.1025.004
`
`DELL
`
`
`
`2.1 Short Message Handling
`
`The latency with short messages is minimized because the short
`message is written to the interface as part of the TAR and read
`as part of the returned RAR on reception. The operating system
`interrupt handlers for the network adapter can directly copy the
`message data between the control register and the operating sys(cid:173)
`tem data structures, moving the higher-layer data to its intended
`destination with minimal cost. For multiprocessor hosts with
`per processor cache this procedure also avoids additional cache
`and bus activity, as we expect the small amount of data to be
`already available in cache before being sent with TAR or used
`immediately after having received in an RAR. Thus, the delay
`introduced in transmitting and receiving a packet with a small
`amount of data is no more than that incurred with a host directly
`handling the packet and using the interface as a staging area to
`send and receive packets.
`Note that including the packet header in the TAR means that
`the processor writes a small amount of data to the interface for
`transmission yet the network adapter has minimal processing
`on the data to prepare packets for transmission. In particular,
`for small data appended to header, the network adapter need
`not do anything before starting network transmission, given that
`checksumming and encryption occur as part of transmission.
`
`2.2 Long Message Handling
`
`Host overhead is minimized for the transmission and reception
`of large amounts of data, typically in the range of 4-16 kilobytes,
`by passing descriptors rather than actual data. On transmission,
`the host writes one TAR and receives one completion interrupt,
`with the network adapter transferring the data from host memory
`with minimal bus overhead. 1 The network adapter handles the
`per-packet overhead of packetizing, checksumming, encryption
`and per-packet coordination. On reception, the host receives a
`single interrupt for each RAR returned after the data has been
`transferred into global memory. Again, the per-packet interrupt
`overhead is handled by the network adapter.
`Latency in transfer of large amounts of data is reduced by
`ensuring minimal bus and memory references. Moving byte(cid:173)
`based processing functions such as checksumming and encryp(cid:173)
`tion to network adapter reduces memory references to one per
`data word. Passing buffer descriptors to network adapter which
`retrieves data from host memory ensures that only one bus trans(cid:173)
`fer per word transferred is required. For multiprocessors with
`per-processor cache, the buffer-passing model helps reduce num(cid:173)
`ber of cache misses one would otherwise incur, as a cache is
`neither likely to have most of the data transferred nor going to
`use it in the near future. The cache pollution, resulting from us(cid:173)
`ing cache for network data transfers, would also increase cache
`miss ratio for other applications. An additional factor reduc(cid:173)
`ing latency is the use of burst-mode transfer on host bus, which
`decreases transfer time of data on the bus by a factor of 4.
`In sending and receiving groups of packets, the interface can
`afford to introduce some latency for the first packet of the group
`as long as the whole transmission and reception has less delay
`as compared to a host processor handling per packet processing.
`That is, for a small number of packets K, it should be the case
`that
`
`K * Pho.i > D + K * Pinter face
`1 The VMP memory supports block transfer using the VME serial bus protocol,
`thereby minimizing bus occupancy and arbitration overltead
`
`~~~~~~~~~~
`
`where writing the control record to the interface introduces a
`delay D in transmission over the host processor writing the data
`directly to the network, K is the number of packets to be trans(cid:173)
`mitted, Pho•t is the time for the host to packetize and send one
`full-sized packet and Pinter 1 ace is the time for the interface to
`transmit one full-sized packet. The value of K for which this is
`true should be as small as possible, ideally 1 but certainly less
`than the common size of a multi-packet packet group. In this
`interface, the value is close to l primarily because Pinter J ace is
`much smaller than Pho•t.
`
`2.3 Network Firewall
`
`The interface architecture is designed to allow the network
`adapter to function as a firewall, protecting the host from network
`packet pollution, both accidental and malicious. In essence, a
`host incurs no overhead for network packets whose reception
`it has not authorized; the interface discards all packets that are
`not compatible with an RAR provided by the host and are not
`directed to an end-point registered with the adaptor. Some ex(cid:173)
`amples of its use follows. If the host does not provide an RAR
`for broadcast packets, then garbage broadcast packets incur zero
`overhead on the host processor(s). Multiple responses generated
`by a multicast request can be limited to only those that fit into
`the buffer area provided; the rest are discarded without incurring
`any host overhead. By providing RARs for only those sources
`it wants to listen to, a process could avoid host overhead re(cid:173)
`quired otherwise for pruning out the traffic from unauthorized
`sources. In general, the authorization model of packet reception
`plus the speed of the network adapter insulates the host from
`packet pollution on the network.
`
`3 Network Adapter Internal Architecture
`
`The network adapter internal architecture is designed to provide
`maximal performance between the host interface and the net(cid:173)
`work architecture. The internal architecture is structured as five
`major components, interconnected as shown in Figure 1. These
`components serve the following functions:
`
`Network access controller (NAC) : implements the network
`access protocol and transfers data between the network and
`the packet pipeline.
`
`Packet Pipeline : generates and checks transport-level check(cid:173)
`sums and performs encryption and decryption of data in
`secure communication.
`
`Buffer memory : a staging and speed-matching area for data in
`transit between the packet pipeline and the host memory. Its
`specialized buffer memory permits fast block data transfers
`between the network and host and provides the on-board
`processor with contention-free memory access to the packet
`data.
`
`Host block copier : moves data between the buffer memory
`and the host memory using a burst-transfer bus protocol,
`minimizing the latency as well as the bus and memory over(cid:173)
`head for transfers.
`
`On-board processor : a general-purpose processor that man(cid:173)
`ages the packet processing pipeline and various bookkeep(cid:173)
`ing functions associated with the protocol.
`
`177
`
`DEFS-ALAOO 10590
`Ex.1025.005
`
`DELL
`
`
`
`y Network Adapter
`
`Boerd (NAB)
`
`J CHECKSUll I EHCRYPllON I NET#ORK ACCESS
`
`'1 LOGIC
`
`LOGIC
`
`CONTROLLER
`
`BUFFER
`llEllORY
`
`CONTROUfR
`Hf
`
`•
`
`llOST llTEllFACE
`HOST BLOCK COPIER
`
`I •
`
`PACKET Pl'BJIE
`
`IElWOllC Ull(
`
`Host Bus
`
`Figure 1: Network Adapter Internal Architecture
`
`Transmission is handled in three steps. When a Transmission
`Authorization Record is written to the interface, it is moved into
`the interface from the host memory by the host block copier.
`The TAR provides a description of the segment of data in the
`host memory to transmit Next, the on-board processor forms the
`first packet from the TAR and first data blocks of the message
`and queues the packet for the packet pipeline using the infor(cid:173)
`mation provided by the TAR. Finally, the packet is processed
`and transmitted by the packet processing pipeline and NAC at
`the network data rate. The pipeline calculates a checksum and
`optionally encrypts the data as the packet is transmitted.
`On reception, a packet is accepted by the NAC and passed
`through the packet pipeline which decrypts the data, verifies
`the checksum, and deposits the received packet into the buffer
`memory. If the received checksum is verified, the packet is
`matched to the appropriate RAR. For the first packet in a group of
`packets, this may involve locating and allocating a non-specific
`RAR, making it dedicated to receiving more packets from this
`source. The packet data is then delivered into the host memory
`associated with this RAR. The reception of a packet into the
`buffer memory proceeds concurrently with both the checksum
`verification and the transfer to the host of previous packets. On
`reception of the final packet or on timeout, the host is interrupted
`and informed of the receipt of this packet group by returning
`the RAR in the interface control register. Thus, the host is
`interrupted only once per RAR used by the NAB.
`If there is no matching RAR for a packet, the adapter deter(cid:173)
`mines whether the destination address is locally acceptable. An
`address is locally acceptable if the address is in the local host's
`list of destination addresses, both individual and group addresses.
`The adapter then transmits a response to the local host indicat(cid:173)
`ing that the packet was discarded. When the destination address
`is valid but no RAR is found, the interface discards the seg(cid:173)
`ment data and returns an indication to the destination host via
`the control register. The interface does not time out a partially(cid:173)
`filled RAR; the partially-filled RAR is returned to the host only
`
`on an explicit request from a hosL The host handles sending
`out retransmission requests and various timeouts. The host also
`sends acknowledgments and handles retransmission requests by
`issuing a new TAR.
`Three key aspects to the design are the buffer memory, the
`packet processing pipeline, and the use of the general-purpose
`processor, as discussed in the following sections.
`
`3.1 The Buffer Memory
`
`The network adapter requires buffer memory in order to speed(cid:173)
`match between the host bus and the network as well as to pro(cid:173)
`vide a staging area for the transmission and reception of pack(cid:173)
`ets. Three issues arise with the buffer memory design. First, the
`buffer memory must provide sufficient contention-free memory
`access to support simultaneous uses by the on-board processor,
`NAC, and block copier, as occurs under load. The performance
`of many so-called "smart" network adapters suffer from this con(cid:173)
`tention. Second, the buffer memory must minimize latency for
`packet transmission and reception over direct transmission be(cid:173)
`tween the host memory and the network. Finally, a provision is
`required to prevent overcommitting the buffer memory to either
`transmission or reception, which would interfere with the func(cid:173)
`tioning of the adapter. Our approach to each of these issues is
`discussed below.
`
`3.1.1 Buffer Memory Contention
`
`To minimize contention, the buffer memory uses dual-port static
`column RAM components, also referred to as Video RAM ICs.
`The Video RAM-based buffer memory, shown in Figure 2, pro(cid:173)
`vides multiple buffers to hold and to process packets while a
`packet is being received or being transmitted. This IC provides
`
`4 bb 18 Video Rim : 32-llil word
`
`Address
`
`Random-access
`
`Dll&Porl
`
`Figure 2: Video RAM-based Buffer Memory at Network Adapter
`Board (NAB)
`
`two independently accessed ports: one providing high-speed
`burst-mode transfer, and the other providing random-access. The
`serial-access port is used to move a packet from the network into
`
`178
`
`DEFS-ALAOO 10591
`Ex.1025.006
`
`DELL
`
`
`
`the buffer memory or the data from host memory to the buffer
`memory. The random-access port provides memory access for
`on-board processing of packets by the adapter's general-purpose
`processor. The serial access does not need address set-up and
`decoding time so read-write times on this port are faster than
`read-write times for a RAM array. For instance, in our proto(cid:173)
`type, memory is 32 bits wide, the serial access time is 40 ns per
`word and the cycle time for random read/write access is 200 ns.
`This gives an effective transfer rate of 800 megabits/second over
`the serial port and 160 megabits/second over the random-access
`port. To provide the equivalent memory bandwidth on a single
`ported standard 32-bit wide memory would require a memory IC
`with read/write cycle time of 33 ns. Currently, such fast mem(cid:173)
`ories are available, but they cost more and have less memory
`density than video RAMs.
`Data transfer operations in this memory proceed as follows.
`A packet (or data) is received from the network (or the host) via
`the serial port into the shift register contained in Video RAM
`ICs. The shift register acts as the temporary storage. When
`the block is completely received, it is transferred from the shift
`register, in a single memory cycle, to a row of the memory cell
`array constituting the buffer memory. The processor manipulates
`header fields of packets stored in the array via the random-access
`port. The processing of a packet continues without interference
`while the next packet is being received, except for one memory
`cycle stolen for each received packet transferred to the memory
`cell array from the shift register.
`Video RAM lCs provide performance that closely approxi(cid:173)
`mates the performance of true multiport memories, but at a frac(cid:173)
`tion of the cost. A triple-port memory cell would triple the area
`of the memory cell, reducing memory density and increasing its
`access time. Video RAMs provide full memory bandwidth to
`the processor at low cost. They also allow high-speed block
`data transfer between the buffer memory and the host and the
`network. The separate serial port avoids the processor losing
`memory bandwidth to arbitration overhead on the random access
`port. The serial transfer ports, accessed in parallel across a bank
`of video RAM ICs, maximizes the data unit to be transferred
`per arbitration between the host block copier and the NAC and
`minimizes the transfer time, thereby minimizing the arbitration
`penalty.
`High speed FIFOs could be used as an alternative to the buffer
`memory described above to speed-match between the host and
`the network. Intuitively, a FIFO-based design should have min(cid:173)
`imal delay for two reasons. With short FIFOs, one begins to
`process and transmit the packet before the entire packet has been
`copied. With FIFOs, one avoids the software overhead of man(cid:173)
`aging input and output packet queues. Nevertheless, our buffer
`memory design provides better performance for reasons outlined
`below.
`When the system bus is lightly loaded, our design compares
`favorably with the FIFO approach for large and small amounts
`of data transfers. For large data transfers, our design amortizes
`the buffering latency of the first packet over multiple packets.
`For a short single packet transfer, the difference in delay is small
`because the main source of delay in this case is the processing
`done at a host and the adapter, not the time of copying data to
`the interface.
`However, at even moderate levels of bus traffic, our design
`outperforms the FIFO-based approach. When the bus is con(cid:173)
`gested, the demand for bus access may not be satisfied in time
`to avoid underrunning or overrunning a FIFO, resulting in packet
`
`loss at a sender and at a receiver. The time thus lost in transmit(cid:173)
`ting aborted packets as well as the retransmission delay increases
`the total time needed to transfer a data segment.
`Mismatch between transmission data rates on the host bus
`and the network channel also supports using the buffer memory.
`When host bus speed is much higher than network channel speed,
`the additional delay for bringing the first packet in full before
`beginning transmission is negligible compared to the total trans(cid:173)
`mission time for a large data segment. When host bus speed is
`comparable to (or lower than) the network channel speed, bring(cid:173)
`ing in a full packet before starting transmission is necessary to
`avoid (frequent) loss of packets by contention on the bus.
`
`3.1.2 Buffer Latency
`
`Buffer latency refers here to the additional delay caused by the
`on-board buffering of a packet over the direct transmission by
`host to network. We minimize it by using a hardware block
`copier and by using contention-less memory accessing. A block
`copier transfers data between host memory and NAB using a
`serial memory transfer protocol, which minimizes transfer time
`and bus occupancy. Moreover, the NAB memory design de(cid:173)
`scribed above facilitates high speed block data transfers 2 and
`provides contention-less memory accessing, both minimizing the
`buffering latency.
`In this design, the cost of the buffering latency of the first
`packet is amortized over the subsequent packets sent in the
`packet group. The subsequent packets are copied from the host
`memory or the network link in parallel with the processing of
`the previous packet transferred, hiding the cost of their buffer(cid:173)
`ing latency. For large amounts of data, the buffering latency of
`the first packet forms only a small fraction of the total delay,
`For instance, the minimum transmission time for 16 Kbytes of
`data over a 100 megabits/sec network is 1.31 ms; whereas, the
`buffering latency for the first packet in this packet group is ap(cid:173)
`proximately 25 microseconds, 3 which is less than 2 percent of
`the total transmission time. The latency is even less significant
`compared to request-response delay for large data transfers, as
`this delay includes processing times at the host processors.
`We note that the data transfer between host and interface
`buffer memory does not constitute an extra copy because the
`NAB performs all the functions that the host processor would
`have otherwise performed if it was performing the transfer. That
`is, the interface memory is required as a staging area for incom(cid:173)
`ing and outgoing network traffic in any case.
`
`3.1.3 Reception of Garbage Packets
`
`Ideally, the mechanism for matching a packet to an RAR is
`fast enough so that it canriot be overrun by receiving garbage
`packets from the network. The packets with correct node address
`or multicast packets may still be undesirable, if