throbber
A Host Interface Architecture for High-Speed Networks
`
`Peter A. Steenkistea, Brian D. Zilla, H.T. Kunga, Steven J. Schlicka, Jim Hughesb,
`Bob Kowalskib, and John Mullaneyb
`
`a School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh,
`PA 15213-3890, USA
`
`b Network Systems Corporation, 7600 Boone Avenue North, Brooklyn Park, MN 55428, USA
`
`Abstract
`This paper describes a new host interface architecture for high-speed networks operating
`at 800 of Mbit/second or higher rates. The architecture is targeted to achieve several lOOs
`of Mbit/second application-to-application performance for a wide range of host architectures.
`The architecture achieves the goal by providing a streamlined execution environment for the
`entire path between host application and network interface. In particular, a acommunication
`AcceleratorBlock 0 (CAB)isusedtominimizedatacopies,reducehostinterrupts,supportDMA
`and hardware checksumming, and control network access.
`This host architecture is applicable to a large class of hosts with high-speed I/O busses. Two
`implementations for the 800 Mbit/second HIPPI network are under development. One is for a
`distributed-memorysupercomputer(iW arp )and theotheris forahigh-performanceworkstation
`(DECstation 5000). We describe and justify both implementations.
`
`Keyword Codes: B.4.1; C.2.1
`Keywords: Data Communications Devices; Network Architecture and Design
`
`1 Introduction
`
`Recent advances in network technology have made it feasible to build high-speed networks
`using links operating at 100s of Mbit/second or higher rates. HIPPI networks based on
`the ANSI High-Performance Parallel Interface (HIPPI) protocol [1] are an example. HIPPI
`supports a data rate of 800 Mbit/second or 1.6 Gbit/second and almost all commercially
`available supercomputers have a HIPPI interface. As a result, HIPPI networks have become
`popular in supercomputing centers. In addition to HIPPI, there are a number of high-speed
`network standards in various stages of development by standards bodies. These include ATM
`(Asynchronous Transfer Mode) [2] and Fibre Channel [3].
`Asnetworkspeeds increase, it
`isimportantthathost interfacespeeds increaseproportionally,
`so that applications can bene<Et from the increased network performance. Several recent
`developments should simplify the task of building host interfaces that can operate at high
`rates. First, most computer systems, including many workstations, have I/O busses with
`raw hardware capacity of 100 MByte/second or more. Second, existing transport protocols,
`
`DEFS-ALA0009962
`INTEL Ex.1023.001
`
`

`

`in particular Transmission Control Protocol/Internet Protocol (TCP/IP), can be implemented
`ef&:iently [4, 5]. Finally, special-purpose high-speed circuits, such as the AMCC HIPPI chip
`set, can be used to handle low-level, time-critical network interface operations.
`However, these elements do not automatically translate into good network performance for
`applications. The problem is that the host interface involves several interacting functions such
`as data movement, protocol processing and the operating system, and it is necessary to take a
`global, end-to-end view in the design of the network interface to achieve good throughput and
`latency. Optimizing individual functions is not suf&:ient.
`W ehavedesignedahost-networkinterfacearchitectureoptimizedtoachievehighapplication(cid:173)
`to-applicationthroughput. Our interface architecture is based on a CommunicationAccelerator
`Block (CAB) that provides support for key communication operations. The CAB is a network
`interface architecturethatcanbeusedforawiderangeofhosts,asopposedtoan
`implementation
`fora speci&:host. TwoCAB implementationsforHIPPinetworksareunderdevelopment. One
`is for the iWarp parallel machine [6] and the other one is for the DEC workstation using the
`TURBOchannel bus [7]. These two CAB implementations should allow applications to use a
`high percentage of the I 00 MByte/second available on HIPPI. The interfaces will be used in
`the context of the Gigabit Nectar testbed at Carnegie Mellon University [8]. The goal of the
`testbed isto distributelarge scienti@tapplicationsacrossa numberofcomputersconnectedby a
`high-speed network. The network tram will consist of both small control messages for which
`latency is important, and large data transfers, for which throughput is critical.
`In the remainderof the paper we ®:-st discuss the requirementsforthe host interface (Section
`2). We then present the motivation and hardware and software architecture of the CAB-based
`interface(Section3) andthe design decisionsforthe twoCAB implementations(Sections4 and
`5). We conclude with a comparison with earlier work.
`
`2 Requirements for a Network Interface Design
`
`Inlocalareanetworks,throughputandlatencyistypicallylimitedbyoverheadonthesending
`and receiving systems, i.e. it is limited by CPU or memory bandwidth resource constraints on
`the hosts. This means that the ef&:iency of the host-network interface plays a central role.
`Consuming fewer CPU and bus cycles will not only make it possible to communicate at higher
`ratessincethecommunicationbottleneckhasbeenreduced,butforthesamecommunicationload
`morecycleswillbeavailablefortheapplication. Ef&:iencyiscriticalbothforapplicationswhose
`only task iscommunication( e.g. ftp )and forapplicationsthat are communicationintensive, but
`for which communication is not the main task.
`Since host architecture has a big impact on communication performance, we considered the
`communication bottlenecks for different classes of computer systems. A ®:-st class consists of
`workstations, currently characterized by one or a few CPUs, and a memory bandwidth of a few
`I 00 MByte/second. Existing network interfaces typically allow these workstations to achieve
`a throughput of a few MByte/second, without leaving any cycles to the application. Some
`projectshave been successful at improvingthroughputover FDDI,but these effortsconcentrate
`onachievingupto I 00 MB it/ secondforworkstations only [9].
`Thiscommunicationperformance
`is not adequate for many applications [1 O].
`General-purpose supercomputers such as Cray also have a small number of processors
`accessing a sharedmemory. They havehowevera veryhighmemoryand computingbandwidth
`and they have I/O subsystems to manage I/O devices with minimal involvement from the CPU.
`These resources allow them to communicate at near gigabit rates while using only a fraction of
`
`DEFS-ALA0009963
`INTEL Ex.1023.002
`
`

`

`their computing resources [5].
`Special-purpose supercomputers such as iWarp [6] and the Connection Machine [11] have
`a very different architecture. Although these systems have a lot of computing power, the
`computingcycles are spread out over a large number of relatively slow processors, and they are
`not suited to support communication over general-purpose networks. A single cell (for iWarp ),
`or a front-end (for CM) can do the protocol processing, but the resulting network performance
`will match the speed of a single processor, and will not be su-f\&:ient for the entire system. The
`issue is the e-f\&:iency of the network interface: can we optimize the interface so that a single
`processor can manage the network communication for a parallel machine?
`Our goal is to de:IDne a "Communication Acceleration Block" that can support e-f\&:ient
`communication on a variety of architectures. Speci@:tally, this architecture must have the
`following properties:
`
`1. High-throughput, while leaving sujEcient computing resources to the application. The
`goal is to demonstrate that applications on high-performance workstations can achieve
`several 100 Mbits/second end-to-end bandwidth. It is not acceptable to devote most of
`the CPU and memory resources of a host to network related activities, so the network
`interface should use the resources of the host as e-f\&:iently as possible, and brute-force
`solutionsthatmightworkforsupercomputersshouldbeavoided.
`Theexactperformance
`will depend on the capabilities of the host.
`
`2. Modular architecture. The portions of the architecture that depend on speci@:t host
`busses (such as TURBOchannel) and network interfaces (such as HIPPI), should be
`contained in separate modules. By replacing these modules, other hosts and networks
`can be supported. For example, by using different host interfacing modules, the CAB
`architecture can interface with the TURBOchannel or to an iWarp parallel machine.
`
`3. Inherently low-cost architecture. The host interface should cost only a small fraction
`of the host itself
`It is essential that eventually the host interface can be cheaply
`implemented using ASICs, similar to existing Ethernet or FDDI controller chips.
`Early implementations may be more expensive, but the interface architecture should
`be amenable to low-cost ASIC implementation.
`
`4. Use of standards. We concentrate on the implementation of the TCP and UDP internet
`protocols since they are widely used and have been shown to work at high transfer
`rates [5]. We use UNIX sockets as the primary communication interface for portability
`reasons. Wealso wanttobetterunderstandhowprotocolandinterfacefeaturesin - uence
`the performance and complexity of the host-network interface, and other interfaces that
`are more appropriate for network-based multicomputer applications will be developed
`in parallel or on top of sockets.
`
`3 The Host-Network Interface Architecture
`
`Many papershave beenpublished thatreportmeasurementsoftheoverheadsassociated with
`Eventhoughitisdi-f\&:ulttocomparethese
`communicatingovernetworks [ 12, 4, 13, 14, 15, 16].
`resultsbecausethemeasurementsaremadefordifferentarchitectures,protocols,communication
`interfaces, and benchmarks, there is a common pattern: there is no single source of overhead.
`The time spent on sending and receiving data is distributed over several operations such as
`
`DEFS-ALA0009964
`INTEL Ex.1023.003
`
`

`

`Application
`
`Copy data to
`system buffers
`
`Copy data to
`user space
`
`TCP Protocol
`Processing
`
`Access
`Device
`
`Network
`
`Figure I: Network data processing overheads
`
`interrupthandlingand system calls, and
`copyingdata, buffermanagement,protocolprocessing,
`Theconclusion
`differentoverheadsdominatedependingonthecircumstances (e.g. packetsize ).
`is that implementing an efEcient network interface involves looking at all the functions in the
`network interface, and not just a single function such as, for example, protocol processing.
`Figure I shows the operations involved in sending and receiving data over a network using
`the socket interface. These operations fall in different categories. First, there are overheads
`associated with every application write (socket call± white), and with every packet sent over
`the network (TCP, IP, physical layer protocol processing and interrupt handling ± light grey);
`these operations involve mainly CPU processing. There is also overhead that scales with the
`number of bytessent (copying and checksumming± dark grey );this overhead is largely limited
`by memory bandwidth. In the remainder of this section we ®-st look at how we can minimize
`both types of overhead. We then present the CAB architecture, and we describe how the CAB
`is seen and used by the host.
`
`3.1 Optimizing per-byte operations
`As networks get faster, data copying and checksumming will become the dominating
`overheads, both because the other overheads are amortized over larger packets and because
`these operations make heavy use of a critical resource: the memory bus. Figure 2 shows the
`data - ow when sending a message using a traditional host interface; receives follow the inverse
`
`DEFS-ALA0009965
`INTEL Ex.1023.004
`
`

`

`Application
`
`Network Interface
`
`Figure 2: Data-ow in traditional network interface
`
`path. The dashed line is the checksum calculation. There are a total of ®ve bus transfers for
`every word sent. On some hosts there is an additional CPU copy to move the data between
`, which results in two more bus transfers.
`asystem buffers 0 and adevice buffers 0
`We can reduce the number of bus transfers by moving the system buffers that are used to
`buffer the data outboard, as is shown in Figure 3. The checksum is calculated while the data
`is copied. The number of data transfers has been reduced to three. This interface corresponds
`to the aWITLESS 0 interface proposed by Van Jacobson [17]. Besides using the bus more
`e-E&:iently, outboard buffering also allows packets to be sent over the network at the full media
`rate, independent of the speed of the internal host bus.
`Figure 4 shows how the number of data transfers can be further reduced by using DMA
`for the data transfer between main memory and the buffers on the CAB. This is the minimum
`number with the socket interface. Checksumming is still done while copying the data, i.e.
`checksumming is done in hardware. Besides reducing the load on the bus, DMA has the
`advantage that it allows the use of burst transfers. This is necessary to get good throughput
`on today's high-speed I/O busses. For example, the DEC TURBOchannel throughput is about
`11.1 MByte/second for single word transfers, but 76.0 MByte/second for 32 word transfers.
`However, on some systems, DMA addsenough overhead that it is sometimesmore attractive to
`copy and checksum the data using the CPU (see Section 5).
`
`Application
`
`Network Interface
`
`Network
`
`Figure 3: Data-ow in network interface with outboard buffering
`
`DEFS-ALA0009966
`INTEL Ex.1023.005
`
`

`

`Application
`
`Network Interface
`
`Network
`
`Figure 4: Data-ow in network interface with DMA
`
`3.2 Optimizing per-packet operations
`The CAB should have support for Media Access Control (MAC), as is the case for well
`established network interfaces such as Ethernet and FDDI, so that the host does not have to be
`involvedinnegotiatingaccess tothenetworkforeverypacket.
`Thiswouldbeinterrupt-intensive
`(interruptbased interface) or CPU-intensive (based on polling).
`The remaining network interface functions on the host are TCP and IP protocol processing,
`including the creation of the TCP and IP headers. Measurements for optimized protocol
`implementation show that the combined cost of protocol processing on the send and receive
`side is about 200 instructions [4], or about 10 microseconds on a 20 MIPS workstation.
`The main (likely only) bene::IDI: of moving protocol processing outboard is that it potentially
`frees up cycles for the application. The most obvious drawback is that the network interface
`becomesmorecomplicatedand expensive since it requiresa high-performancegeneral-purpose
`CPU (with matching memory system) for protocol processing. A second drawback is that the
`host and network interface have to share state and the host-interface pair must be viewed as a
`multiprocessor. Earlier experiments [18, 16] show that this can make interactions between the
`host and CAB considerably more complex and expensive compared with a master-slave model.
`Given these drawbacks and the limited advantages, we decided to perform protocol processing
`on the host, and to make the CAB a pure slave.
`Given the high cost of crossing the I/O busand of synchronization( e.g. interrupts ),the CAB
`architectureminimizesthenumberofhost-CAB interface interactions andtheircomplexity.
`The
`host can request a small set of operations from the CAB, and for each operation, it can specify
`whetherit should be interruptedwhen the operationis ®nished. The CAB generates a returntag
`for every request it ®nishes, but it only interrupts the host if requested. The host processes all
`accumulated return tags every time it is interrupted. This limits the number of interruptsto one
`per user write on transmit, and at most one per packet and one per user read on receive.
`
`3.3 The CAB Architecture
`architecture. TheCABconsistsofatransmitand
`Figure5showsablockdiagramoftheCAB
`areceive half Thecoreofeachhalfis amemoryusedforoutboardbuff eringofpackets (network
`memory). Each memory has two ports, each running at 100 MByte/second. Network memory
`can for example be implemented using VRAM, with the serial access port at the network side.
`Data is transfered between main memory and network memory using system DMA (SDMA)
`and between network memory and the network using media DMA (MDMA).
`
`DEFS-ALA0009967
`INTEL Ex.1023.006
`
`

`

`Host
`Bus
`Int.
`
`IRegistersl
`
`Network
`SDMA
`Check Memory
`sum
`
`IMAcl
`
`N
`e
`t
`w
`0
`r
`k
`
`k
`Host IRegistersl N
`B
`etwor
`
`et
`
`IMAc I~
`1:~ +DM+ Momory {$j ;;
`r k
`
`sum
`
`Figure 5: Block diagram generic network interface
`
`The most natural place to calculate the checksum on transmit is while the data is placed on
`thenetwork. Thisishowevernotpossiblesince TCPandUDPplacethechecksum
`intheheader
`ofthepacket. Asa result,thechecksum iscalculated whenthedata owsintonetworkmemory,
`and it is placed in the header by the CAB in a location that is speci@ed by the host as part of the
`SDMA request. On receive, the checksum is calculated when the data - ows from the network
`into network memory, so that it is available to the host as soon as the message is available.
`Media access control is performed by hardware on the CAB, under control of the host. We
`concentrate on MAC support for switch-based networks, speci&:ally HIPPI networks. The
`simplest MAC algorithm for a switch-based network is to send packets in FIFO order. If the
`destination is busy, i.e. one or more links between the source and the destination is being used
`for another connection, the sender waits until the destination becomes free (called camp-on).
`This simple algorithm does not make good use of the network bandwidth because of the Head
`of Line (HOL) problem: if the destination of the packet at the head of the queue is busy, the
`node cannot send, even if the destinations of other packets are reachable. Analysis has shown
`that one can utilize at most 58% of the network bandwidth, assuming random tram [19].
`MAC on the CAB is based on multiple alogical channels 0
`, queues of packets with different
`destinations. The CAB attempts to send a packet from each queue in round-robin fashion. If
`a destination is busy, the CAB moves to the next queue, which holds packets with a different
`destination. The exact MAC algorithm is controled by the host through retry frequency and
`timeoutparametersfor each logical channels. The host can also specify that camp-onshould be
`used after a number of tries to improve the chances that packets to a busy destination will get
`through (i.e. balance fairness versus throughput).
`The register ®les on both the transmit and receive half of the CAB are used to queue host
`requests and reh1m tags. The host interface implements the bus protocol for the sped& host.
`Depending on implementation considerations such as the speed of the bus, the transmit and
`receive halves can have their own bus interface, or they can share a bus interface.
`The CAB architechireis a general model for a network interface. Although the details of the
`host interface, checksum and MAC blocks depend on the sped& host, protocol and network,
`the architecture should apply to a wide range of hosts, protocols and networks. In Sections 4
`and 5 we describe implementation for this architecture for the iW arp parallel machine and for
`the DECstation 5000 workstation, two very different systems.
`
`DEFS-ALA0009968
`INTEL Ex.1023.007
`
`

`

`3.4 Host View of Network Interface
`From the viewpoint of the host system software, the CAB is a large bank of memory
`accompanied by a means for transfering data into and out of that memory. The transmit half
`of the CAB also provides a set of commands for issuing media operations using data in the
`memory, while the receive side provides noti@:tation that new data has arrived in the memory
`from the media.
`Several features of the CAB have an impact on the structure of the networking software.
`First, to insure full bandwidth to the media, packets must start on a page boundary in CAB
`memory, and all but the last page must be full pages. This, together with the fact that checksum
`calculationforintemetpackettransmissions isperformedduringthetransferintoCAB memory,
`dictates that individual packets should be fully formed when they are transfered to the CAB.
`To make the most e-E&:ient use of this interface, data should be transferred directlyfrom user
`space to CAB memory and vice-versa. This model is different from that currently found in
`Berkeley Unix operating systems, where data is channeled through the system's network buffer
`pool [20]. The difference in the models, together with the restriction that data in CAB memory
`should be formattedinto complete packets, means that decisions about partitioningof user data
`into packets must be made before the data is transfered out of user space. This means that
`instead of a conventional a layered 0 protocol stack implementation (Figure I) where decisions
`about packet formation are the sole domain of the transport protocol, some of that functionality
`must now be shared at a higher level.
`To illustratehowhost softwareinteractswiththe CAB hardwareinnormalusage, we present
`a walk-through of a typical read and write. To handle a user write, the system ®rst examines
`the size of the write and other factors to determine how many packets will be needed on the
`media, and then it issues SDMA requests to the CAB, one per packet. The CAB transfers the
`data from the user's address space to the CAB network memory. In most cases, i.e. ifthe TCP
`window is open, a MDMA request to perform the actual media transfer can be issued at the
`same time, freeing the processor from any further involvement with individual packets. Only
`the ®nal packet's SDMA request needs to be - agged to interrupt the host upon completion, so
`that the user process can be scheduled. No interruptis needed to - ag the end ofMDMA ofTCP
`packets, since the TCP acknowledgement will cot®-m that the data was sent.
`Upon receiving a packet from the network, the CAB interrupts the host, which performs
`protocol processing. For TCP and UDP, only the packet's header needs to be examined as
`the data checksum has already been calculated by the hardware. The packet is then logically
`queued for the appropriate user process. A user read is handled by issuing one or more SDMA
`operationsto copy the data out of the interface memory. The last SDMA operation is agged to
`generate an interrupt upon completion so that the user process can be scheduled.
`
`4 The iWarp CAB Implementation
`
`The ®rst implementationof the CAB architectureis foriW arp, a distributed memoryparallel
`computer. Each iWarp cell has a CPU, memory, and four pathways supporting communication
`withneighboringcells. Thecellsareinterconnectedasatorus [6]. AniW arparraycommunicates
`withothercomputersystems by inserting a interface boards 0 in the backloops ofthetorns (Figure
`6). Interface boards are, besides being linked into the iW arp interconnect, also connected to
`an external bus or network, which allows them to forward data between iWarp and the outside
`world.
`The iWarp-Nectarinterface board, or HIPPI Interface Board (HIB), consists of two iWarp
`
`DEFS-ALA0009969
`INTEL Ex.1023.008
`
`

`

`~~ 0
`
`8
`
`1+ 1.
`
`1
`
`9
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`10
`
`11
`
`12
`
`13
`
`14
`
`15
`
`16
`
`17
`
`18
`
`19
`
`20
`
`21
`
`22
`
`23
`
`:
`
`24
`
`25
`
`26
`
`27
`
`28
`
`29
`
`30
`
`31
`
`~IB~ 32
`
`OUT
`
`40
`
`33
`
`34
`
`35
`
`36
`
`37
`
`38
`
`39
`
`41
`
`42
`
`43
`
`44
`
`45
`
`46
`
`47
`
`48
`
`49
`
`50
`
`51
`
`52
`
`53
`
`54
`
`55
`
`~:]d.
`
`56
`
`57
`
`58
`
`59
`
`60
`
`61
`
`62
`
`63
`
`+I t I
`
`Figure 6: Connecting the network interface to iWarp
`
`interface cells and a CAB (Figure 7). Each interface cell is linked into the iWarp torus
`independently through four iWarp communication pathways, as is shown in Figure 6 for an
`8 by 8 iWarp array. The two iWarp cells play the role of host on the network. They are
`responsible for TCP and UDP protocol processing and for distributing data to, and collecting
`data from, the cells in the iWarp array.
`To transmit data over the HIPP! network, cells in the iWarp array send data over the iWarp
`interconnect to the a HIB OUT 0 interface cell. The interface cell combines the data, stores it in
`staging memory as a single data stream and issues a write call to invoke the TCP/IP protocol.
`The TCP/IP protocol stack does the protocol processing, using the CAB to DMA the data from
`staging memory into network memory, calculate the checksum and transmit the data over the
`HIPPinetwork. The inverse process is used to receive data. The HIB can also be used as a raw
`HIPP! interface, for example to send data to a framebuffer.
`The motivation for using two cells on the network interface is to provide enough bandwidth
`betweenthenetworkandtheiW arparray. AniW arpcellhasabandwidthofl 60MByte/secondto
`itsmemorysystem(Figure7);thisbandwidthissharedbythedatastreamandprogramexecution.
`Since HIPPihas a bandwidth of 100 MByte/second in each direction,a single iWarp cell would
`clearly not be able to substain the peak HIPP! bandwidth. A two-cell architecture doubles the
`available memory bandwidth, but requires that TCP protocol processing is distributed between
`the transmit and receive cells. A shared memory is provided to simplify the sharing of the TCP
`state (open connections, open window, unacknowledged data, etc.).
`The iW arp Nectar interface physically consists of a transmit and receive board, plus a board
`that contains the HIPP! interfaces (necessary to provide space for connectors). All boards are
`VME boards, but only use VME for power and ground. Most of the board space is used for
`the iWarp components, memory and support chips. Network memory consists of two banks of
`2 MBytes of VRAM (can be upgraded to 8 Mbyte each). Each staging memory (see Figure 7)
`consists of 128 KBytes of dual ported static RAM. The transmit board is currently operational,
`and the full interface is scheduled for completion in December 1992.
`
`DEFS-ALA0009970
`INTEL Ex.1023.009
`
`

`

`40 MBytes/sec 160 MBytes/sec
`each
`
`100 MBytes/sec
`
`iWarp
`
`Local
`Memor
`
`iWarp
`
`Local
`Memor
`
`Memor
`
`R/W
`
`Shared
`Memory
`
`Memor
`
`R/W
`
`SDMA
`Check
`sum
`
`Bus
`Int.
`
`etwor
`Memor
`
`Host
`Bus
`Int.
`
`SDMA
`
`etwor
`Memor
`
`MDMA
`Check
`sum
`
`Figure 7: Block diagram iWarpnetwork interface
`
`5 The Workstation CAB Implementation
`
`5.1 Target Machine
`TheCABarchitectureisbeingimplementedfortheDECworkstationsusingtheTURBOchan-
`nel I/O bus [7]; the initial platform will be the DEC 5000/200. The workstation interface is a
`single board implementationwith a shared bus interface for transmit and receive. The interface
`is being built using off-the-shelfcomponents and will be operational in Fall 1992.
`The DEC workstation was selected as the target for our network interface after a careful
`comparisonof the DEC, HP,IBM and Sun workstations. One interesting resultwas that all four
`classes of workstations have similar I/O subsystems: all are based on I 00 MByte/second I/O
`busses that rely heavily on burst transfers for high sustained throughput. TURBOchannel was
`chosen because it is an open, simple bus architecture and because there is research ongoing at
`CMU in the operating systems area, mainly based on DEC workstations, that we want to use.
`Based on this similarity in I/O architecture, we expect that our results for the TURBOchannel
`should largely carry over to other workstations.
`
`5.2 Processor Cache and Virtual memory
`Onworkstations, theuseofD MA totransferdatabetweennetworkmemoryandhostmemory
`is made more complicated by the presence of a cache and virtual memory. Because of these
`extra overheads, it might sometimes be more e-E&:ient to use the CPU to copy packet between
`user and system space (programmedl/O ±grey arrow in Figure 8).
`DMA can create inconsistencies between the cache and main memory, resulting in wrong
`
`DEFS-ALA0009971
`INTEL Ex.1023.010
`
`

`

`CPU Application
`
`Network Interface
`
`Figure 8: DMA-cache interaction
`
`data being sent over the network or being read by the application. Hostscan avoid this problem
`by - ushing the data to memory before transferring it using DMA on transmit (black arrows in
`Figure 8), and by invalidating the data in the cache before DMAing on receive. Write-through
`caches only require cache invalidation on receive, which can be performed in parallel with the
`data transfer from the device to main memory. For performance reasons, most workstations
`move towards write-back caches where both cache - ushing and invalidation add overhead to
`the DMA. Fortunately, these operations make ef&:ient use of the bus since they use bursts to
`transfer cache lines to main memory.
`On workstations, user pages have to be wired in memory to insure that they are not paged
`out whilethe DMA is inprogress. This cost, however, isacceptable and becomes smalleras the
`CPU speed increases. For example, the combined cost of a page wire and unwire on a 15 MIPS
`DECstation 5000/200 running Mach 2.6 is 134 microseconds while the time to transfer a 16
`KBytepagefromdevice tomemoryis262microseconds(usinga64byteburstsize). Whilethis
`reduces the throughput from 62.4 Mbytes/second to 41.3 Mbytes/second, it is still much faster
`than the maximum CPU copy rate (11.1 Mbytes/second). This is one area where moving away
`from sockets can help. If the application builds messages in wired-down buffers (of limited
`size), the overhead is eliminated. This is similar to mailboxes in Nectar [18].
`Figure 9 compares the transmit and receive throughput that can be obtained across Turbo
`Channel on the DEC 5000/200 with programmed I/O and with DMA using a block size of 16
`words. The overheadsforcacheinvalidation(theDEC5000/200hasawrite-throughcache)and
`page wiring and unwiring have been included. For small read and write sizes, programmedI/O
`is faster. The exact crossover point between programmed I/O and DMA is hard to calculculate,
`because it depends not only on the hardware features of the system, but also on how much data
`is in the cache at the time of the read or write and the cost of cache pollution.
`DMA is faster, even when considering the high overheads, because it can use burst transfers
`thatmake muchmoreef&:ientuse ofthebusthan thesingle wordtransfersused inprogrammed
`I/O. On other workstation architectures (such as the HP Snake and the IBM RS/6000) the CPU
`can also burst data across the bus. This functionality is intended for use with graphics devices,
`but can also be used for the network interface. Although the I/O bus throughput that can be
`obtained using the CPU is typically still lower than the raw DMA rate, it makes programmed
`I/O more attractive, and it increases the crossover point between the two.
`
`DEFS-ALA0009972
`INTEL Ex.1023.011
`
`

`

`60 MB/sec
`
`48
`
`36
`
`24
`
`12
`
`KByte
`·o l--+-~-+-~--+~~+--~-+-~--+-~--+~~+-~.......-
`128 256 512
`lM
`4
`16
`32
`64
`8
`
`Figure 9: Transmit (T) and receive (R) transfer rates across TURBOchannel
`
`5.3 Performance Estimates
`To validate the design of the CAB for workstations, we traced where the time is spent when
`sendingandreceivingdataonaDECstation5000/200runningMach2.6,usingexistinginterfaces
`toEthemet andFDDI. The networkingcode consistsprimarilyofTahoe BSDTCP/IPcode with
`anoptimizedchecksumcalculationroutine [21].
`Itdoesnotincludeoptimizations such as header
`prediction,butwe areinthe processofincludingthese. We use aspecial TURBOchannel-based
`timerboardwitha 40nanosecondresolutionclocktomeasurethe overheadof differentnetwork
`related operations.
`Using these measurements, we constructed graphs that show how much of the host memory
`bandwidth, and indirectly of the CPU cycles, is consumed by the different communication(cid:173)
`related operations, and how many cycles are left for the application. Figure 10 shows the
`result forcommunication over FDDI.The horizontalaxis showsthe throughputand the vertical
`axis shows the percentage of the DS5000/200 CPU that is left for the application for a given
`throughput. We observe for example that using 100 % of the processor we can only drive
`FDDI at about 21 Mbits/second assuming 4500-byte packets. This result is in line with actual
`measurements. It is clear that even after optimizing the TCP protocol processing, the existing
`networkinterfacewill notallow usto communicateat the fullFDDimedia rate,let alone HIPPI
`rates.
`BasedonthesemeasurementswecanestimatetheperformanceforaHIPPinetworkinterface
`based on the CAB architecture. The result is shown in Figure 11. As expected, network
`throughput is limited by the bandwidth of the memory bus (peak of 1 OOMByte/second). The
`graph shows that we should be be able to achieve about 300 Mbit/second throug

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket