`
`Peter A. Steenkistea, Brian D. Zilla, H.T. Kunga, Steven J. Schlicka, Jim Hughesb,
`Bob Kowalskib, and John Mullaneyb
`
`a School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh,
`PA 15213-3890, USA
`
`b Network Systems Corporation, 7600 Boone Avenue North, Brooklyn Park, MN 55428, USA
`
`Abstract
`This paper describes a new host interface architecture for high-speed networks operating
`at 800 of Mbit/second or higher rates. The architecture is targeted to achieve several lOOs
`of Mbit/second application-to-application performance for a wide range of host architectures.
`The architecture achieves the goal by providing a streamlined execution environment for the
`entire path between host application and network interface. In particular, a acommunication
`AcceleratorBlock 0 (CAB)isusedtominimizedatacopies,reducehostinterrupts,supportDMA
`and hardware checksumming, and control network access.
`This host architecture is applicable to a large class of hosts with high-speed I/O busses. Two
`implementations for the 800 Mbit/second HIPPI network are under development. One is for a
`distributed-memorysupercomputer(iW arp )and theotheris forahigh-performanceworkstation
`(DECstation 5000). We describe and justify both implementations.
`
`Keyword Codes: B.4.1; C.2.1
`Keywords: Data Communications Devices; Network Architecture and Design
`
`1 Introduction
`
`Recent advances in network technology have made it feasible to build high-speed networks
`using links operating at 100s of Mbit/second or higher rates. HIPPI networks based on
`the ANSI High-Performance Parallel Interface (HIPPI) protocol [1] are an example. HIPPI
`supports a data rate of 800 Mbit/second or 1.6 Gbit/second and almost all commercially
`available supercomputers have a HIPPI interface. As a result, HIPPI networks have become
`popular in supercomputing centers. In addition to HIPPI, there are a number of high-speed
`network standards in various stages of development by standards bodies. These include ATM
`(Asynchronous Transfer Mode) [2] and Fibre Channel [3].
`Asnetworkspeeds increase, it
`isimportantthathost interfacespeeds increaseproportionally,
`so that applications can bene<Et from the increased network performance. Several recent
`developments should simplify the task of building host interfaces that can operate at high
`rates. First, most computer systems, including many workstations, have I/O busses with
`raw hardware capacity of 100 MByte/second or more. Second, existing transport protocols,
`
`
`
`WISTRON CORP. EXHIBIT 1023.001
`
`
`
`in particular Transmission Control Protocol/Internet Protocol (TCP/IP), can be implemented
`ef&:iently [4, 5]. Finally, special-purpose high-speed circuits, such as the AMCC HIPPI chip
`set, can be used to handle low-level, time-critical network interface operations.
`However, these elements do not automatically translate into good network performance for
`applications. The problem is that the host interface involves several interacting functions such
`as data movement, protocol processing and the operating system, and it is necessary to take a
`global, end-to-end view in the design of the network interface to achieve good throughput and
`latency. Optimizing individual functions is not suf&:ient.
`W ehavedesignedahost-networkinterfacearchitectureoptimizedtoachievehighapplication(cid:173)
`to-applicationthroughput. Our interface architecture is based on a CommunicationAccelerator
`Block (CAB) that provides support for key communication operations. The CAB is a network
`interface architecturethatcanbeusedforawiderangeofhosts,asopposedtoan
`implementation
`fora speci&:host. TwoCAB implementationsforHIPPinetworksareunderdevelopment. One
`is for the iWarp parallel machine [6] and the other one is for the DEC workstation using the
`TURBOchannel bus [7]. These two CAB implementations should allow applications to use a
`high percentage of the I 00 MByte/second available on HIPPI. The interfaces will be used in
`the context of the Gigabit Nectar testbed at Carnegie Mellon University [8]. The goal of the
`testbed isto distributelarge scienti@tapplicationsacrossa numberofcomputersconnectedby a
`high-speed network. The network tram will consist of both small control messages for which
`latency is important, and large data transfers, for which throughput is critical.
`In the remainderof the paper we ®:-st discuss the requirementsforthe host interface (Section
`2). We then present the motivation and hardware and software architecture of the CAB-based
`interface(Section3) andthe design decisionsforthe twoCAB implementations(Sections4 and
`5). We conclude with a comparison with earlier work.
`
`2 Requirements for a Network Interface Design
`
`Inlocalareanetworks,throughputandlatencyistypicallylimitedbyoverheadonthesending
`and receiving systems, i.e. it is limited by CPU or memory bandwidth resource constraints on
`the hosts. This means that the ef&:iency of the host-network interface plays a central role.
`Consuming fewer CPU and bus cycles will not only make it possible to communicate at higher
`ratessincethecommunicationbottleneckhasbeenreduced,butforthesamecommunicationload
`morecycleswillbeavailablefortheapplication. Ef&:iencyiscriticalbothforapplicationswhose
`only task iscommunication( e.g. ftp )and forapplicationsthat are communicationintensive, but
`for which communication is not the main task.
`Since host architecture has a big impact on communication performance, we considered the
`communication bottlenecks for different classes of computer systems. A ®:-st class consists of
`workstations, currently characterized by one or a few CPUs, and a memory bandwidth of a few
`I 00 MByte/second. Existing network interfaces typically allow these workstations to achieve
`a throughput of a few MByte/second, without leaving any cycles to the application. Some
`projectshave been successful at improvingthroughputover FDDI,but these effortsconcentrate
`Thiscommunicationperformance
`onachievingupto I 00 MB it/ secondforworkstations only [9].
`is not adequate for many applications [1 O].
`General-purpose supercomputers such as Cray also have a small number of processors
`accessing a sharedmemory. They havehowevera veryhighmemoryand computingbandwidth
`and they have I/O subsystems to manage I/O devices with minimal involvement from the CPU.
`These resources allow them to communicate at near gigabit rates while using only a fraction of
`
`
`
`WISTRON CORP. EXHIBIT 1023.002
`
`
`
`their computing resources [5].
`Special-purpose supercomputers such as iWarp [6] and the Connection Machine [11] have
`a very different architecture. Although these systems have a lot of computing power, the
`computingcycles are spread out over a large number of relatively slow processors, and they are
`not suited to support communication over general-purpose networks. A single cell (for iWarp ),
`or a front-end (for CM) can do the protocol processing, but the resulting network performance
`will match the speed of a single processor, and will not be su-f\&:ient for the entire system. The
`issue is the e-f\&:iency of the network interface: can we optimize the interface so that a single
`processor can manage the network communication for a parallel machine?
`Our goal is to de:IDne a "Communication Acceleration Block" that can support e-f\&:ient
`communication on a variety of architectures. Speci@:tally, this architecture must have the
`following properties:
`
`1. High-throughput, while leaving sujEcient computing resources to the application. The
`goal is to demonstrate that applications on high-performance workstations can achieve
`several 100 Mbits/second end-to-end bandwidth. It is not acceptable to devote most of
`the CPU and memory resources of a host to network related activities, so the network
`interface should use the resources of the host as e-f\&:iently as possible, and brute-force
`solutionsthatmightworkforsupercomputersshouldbeavoided.
`Theexactperformance
`will depend on the capabilities of the host.
`
`2. Modular architecture. The portions of the architecture that depend on speci@:t host
`busses (such as TURBOchannel) and network interfaces (such as HIPPI), should be
`contained in separate modules. By replacing these modules, other hosts and networks
`can be supported. For example, by using different host interfacing modules, the CAB
`architecture can interface with the TURBOchannel or to an iWarp parallel machine.
`
`3. Inherently low-cost architecture. The host interface should cost only a small fraction
`of the host itself
`It is essential that eventually the host interface can be cheaply
`implemented using ASICs, similar to existing Ethernet or FDDI controller chips.
`Early implementations may be more expensive, but the interface architecture should
`be amenable to low-cost ASIC implementation.
`
`4. Use of standards. We concentrate on the implementation of the TCP and UDP internet
`protocols since they are widely used and have been shown to work at high transfer
`rates [5]. We use UNIX sockets as the primary communication interface for portability
`reasons. Wealso wanttobetterunderstandhowprotocolandinterfacefeaturesin - uence
`the performance and complexity of the host-network interface, and other interfaces that
`are more appropriate for network-based multicomputer applications will be developed
`in parallel or on top of sockets.
`
`3 The Host-Network Interface Architecture
`
`Many papershave beenpublished thatreportmeasurementsoftheoverheadsassociated with
`Eventhoughitisdi-f\&:ulttocomparethese
`communicatingovernetworks [ 12, 4, 13, 14, 15, 16].
`resultsbecausethemeasurementsaremadefordifferentarchitectures,protocols,communication
`interfaces, and benchmarks, there is a common pattern: there is no single source of overhead.
`The time spent on sending and receiving data is distributed over several operations such as
`
`
`
`WISTRON CORP. EXHIBIT 1023.003
`
`
`
`Application
`
`Copy data to
`system buffers
`
`Copy data to
`user space
`
`TCP Protocol
`Processing
`
`Access
`Device
`
`Network
`
`Figure I: Network data processing overheads
`
`interrupthandlingand system calls, and
`copyingdata, buffermanagement,protocolprocessing,
`Theconclusion
`differentoverheadsdominatedependingonthecircumstances (e.g. packetsize ).
`is that implementing an efEcient network interface involves looking at all the functions in the
`network interface, and not just a single function such as, for example, protocol processing.
`Figure I shows the operations involved in sending and receiving data over a network using
`the socket interface. These operations fall in different categories. First, there are overheads
`associated with every application write (socket call± white), and with every packet sent over
`the network (TCP, IP, physical layer protocol processing and interrupt handling ± light grey);
`these operations involve mainly CPU processing. There is also overhead that scales with the
`number of bytessent (copying and checksumming± dark grey );this overhead is largely limited
`by memory bandwidth. In the remainder of this section we ®-st look at how we can minimize
`both types of overhead. We then present the CAB architecture, and we describe how the CAB
`is seen and used by the host.
`
`3.1 Optimizing per-byte operations
`As networks get faster, data copying and checksumming will become the dominating
`overheads, both because the other overheads are amortized over larger packets and because
`these operations make heavy use of a critical resource: the memory bus. Figure 2 shows the
`data - ow when sending a message using a traditional host interface; receives follow the inverse
`
`
`
`WISTRON CORP. EXHIBIT 1023.004
`
`
`
`Application
`
`Network Interface
`
`Figure 2: Data-ow in traditional network interface
`
`path. The dashed line is the checksum calculation. There are a total of ®ve bus transfers for
`every word sent. On some hosts there is an additional CPU copy to move the data between
`, which results in two more bus transfers.
`asystem buffers 0 and adevice buffers 0
`We can reduce the number of bus transfers by moving the system buffers that are used to
`buffer the data outboard, as is shown in Figure 3. The checksum is calculated while the data
`is copied. The number of data transfers has been reduced to three. This interface corresponds
`to the aWITLESS 0 interface proposed by Van Jacobson [17]. Besides using the bus more
`e-E&:iently, outboard buffering also allows packets to be sent over the network at the full media
`rate, independent of the speed of the internal host bus.
`Figure 4 shows how the number of data transfers can be further reduced by using DMA
`for the data transfer between main memory and the buffers on the CAB. This is the minimum
`number with the socket interface. Checksumming is still done while copying the data, i.e.
`checksumming is done in hardware. Besides reducing the load on the bus, DMA has the
`advantage that it allows the use of burst transfers. This is necessary to get good throughput
`on today's high-speed I/O busses. For example, the DEC TURBOchannel throughput is about
`11.1 MByte/second for single word transfers, but 76.0 MByte/second for 32 word transfers.
`However, on some systems, DMA addsenough overhead that it is sometimesmore attractive to
`copy and checksum the data using the CPU (see Section 5).
`
`Application
`
`Network Interface
`
`Network
`
`Figure 3: Data-ow in network interface with outboard buffering
`
`
`
`WISTRON CORP. EXHIBIT 1023.005
`
`
`
`Application
`
`Network Interface
`
`Network
`
`Figure 4: Data-ow in network interface with DMA
`
`3.2 Optimizing per-packet operations
`The CAB should have support for Media Access Control (MAC), as is the case for well
`established network interfaces such as Ethernet and FDDI, so that the host does not have to be
`involvedinnegotiatingaccess tothenetworkforeverypacket.
`Thiswouldbeinterrupt-intensive
`(interruptbased interface) or CPU-intensive (based on polling).
`The remaining network interface functions on the host are TCP and IP protocol processing,
`including the creation of the TCP and IP headers. Measurements for optimized protocol
`implementation show that the combined cost of protocol processing on the send and receive
`side is about 200 instructions [4], or about 10 microseconds on a 20 MIPS workstation.
`The main (likely only) bene::IDI: of moving protocol processing outboard is that it potentially
`frees up cycles for the application. The most obvious drawback is that the network interface
`becomesmorecomplicatedand expensive since it requiresa high-performancegeneral-purpose
`CPU (with matching memory system) for protocol processing. A second drawback is that the
`host and network interface have to share state and the host-interface pair must be viewed as a
`multiprocessor. Earlier experiments [18, 16] show that this can make interactions between the
`host and CAB considerably more complex and expensive compared with a master-slave model.
`Given these drawbacks and the limited advantages, we decided to perform protocol processing
`on the host, and to make the CAB a pure slave.
`Given the high cost of crossing the I/O busand of synchronization( e.g. interrupts ),the CAB
`architectureminimizesthenumberofhost-CAB interface interactions andtheircomplexity.
`The
`host can request a small set of operations from the CAB, and for each operation, it can specify
`whetherit should be interruptedwhen the operationis ®nished. The CAB generates a returntag
`for every request it ®nishes, but it only interrupts the host if requested. The host processes all
`accumulated return tags every time it is interrupted. This limits the number of interruptsto one
`per user write on transmit, and at most one per packet and one per user read on receive.
`
`3.3 The CAB Architecture
`architecture. TheCABconsistsofatransmitand
`Figure5showsablockdiagramoftheCAB
`areceive half Thecoreofeachhalfis amemoryusedforoutboardbuff eringofpackets (network
`memory). Each memory has two ports, each running at 100 MByte/second. Network memory
`can for example be implemented using VRAM, with the serial access port at the network side.
`Data is transfered between main memory and network memory using system DMA (SDMA)
`and between network memory and the network using media DMA (MDMA).
`
`
`
`WISTRON CORP. EXHIBIT 1023.006
`
`
`
`Host
`Bus
`Int.
`
`IRegistersl
`
`Network
`SDMA
`Check Memory
`sum
`
`IMAcl
`
`N
`e
`t
`w
`0
`r
`k
`
`k
`Host IRegistersl N
`B
`etwor
`
`et
`
`IMAc I~
`1:~ +DM+ Momory {$j ;;
`r k
`
`sum
`
`Figure 5: Block diagram generic network interface
`
`The most natural place to calculate the checksum on transmit is while the data is placed on
`thenetwork. Thisishowevernotpossiblesince TCPandUDPplacethechecksum
`intheheader
`ofthepacket. Asa result,thechecksum iscalculated whenthedata owsintonetworkmemory,
`and it is placed in the header by the CAB in a location that is speci@ed by the host as part of the
`SDMA request. On receive, the checksum is calculated when the data - ows from the network
`into network memory, so that it is available to the host as soon as the message is available.
`Media access control is performed by hardware on the CAB, under control of the host. We
`concentrate on MAC support for switch-based networks, speci&:ally HIPPI networks. The
`simplest MAC algorithm for a switch-based network is to send packets in FIFO order. If the
`destination is busy, i.e. one or more links between the source and the destination is being used
`for another connection, the sender waits until the destination becomes free (called camp-on).
`This simple algorithm does not make good use of the network bandwidth because of the Head
`of Line (HOL) problem: if the destination of the packet at the head of the queue is busy, the
`node cannot send, even if the destinations of other packets are reachable. Analysis has shown
`that one can utilize at most 58% of the network bandwidth, assuming random tram [19].
`MAC on the CAB is based on multiple alogical channels 0
`, queues of packets with different
`destinations. The CAB attempts to send a packet from each queue in round-robin fashion. If
`a destination is busy, the CAB moves to the next queue, which holds packets with a different
`destination. The exact MAC algorithm is controled by the host through retry frequency and
`timeoutparametersfor each logical channels. The host can also specify that camp-onshould be
`used after a number of tries to improve the chances that packets to a busy destination will get
`through (i.e. balance fairness versus throughput).
`The register ®les on both the transmit and receive half of the CAB are used to queue host
`requests and reh1m tags. The host interface implements the bus protocol for the sped& host.
`Depending on implementation considerations such as the speed of the bus, the transmit and
`receive halves can have their own bus interface, or they can share a bus interface.
`The CAB architechireis a general model for a network interface. Although the details of the
`host interface, checksum and MAC blocks depend on the sped& host, protocol and network,
`the architecture should apply to a wide range of hosts, protocols and networks. In Sections 4
`and 5 we describe implementation for this architecture for the iW arp parallel machine and for
`the DECstation 5000 workstation, two very different systems.
`
`
`
`WISTRON CORP. EXHIBIT 1023.007
`
`
`
`3.4 Host View of Network Interface
`From the viewpoint of the host system software, the CAB is a large bank of memory
`accompanied by a means for transfering data into and out of that memory. The transmit half
`of the CAB also provides a set of commands for issuing media operations using data in the
`memory, while the receive side provides noti@:tation that new data has arrived in the memory
`from the media.
`Several features of the CAB have an impact on the structure of the networking software.
`First, to insure full bandwidth to the media, packets must start on a page boundary in CAB
`memory, and all but the last page must be full pages. This, together with the fact that checksum
`calculationforintemetpackettransmissions isperformedduringthetransferintoCAB memory,
`dictates that individual packets should be fully formed when they are transfered to the CAB.
`To make the most e-E&:ient use of this interface, data should be transferred directlyfrom user
`space to CAB memory and vice-versa. This model is different from that currently found in
`Berkeley Unix operating systems, where data is channeled through the system's network buffer
`pool [20]. The difference in the models, together with the restriction that data in CAB memory
`should be formattedinto complete packets, means that decisions about partitioningof user data
`into packets must be made before the data is transfered out of user space. This means that
`instead of a conventional a layered 0 protocol stack implementation (Figure I) where decisions
`about packet formation are the sole domain of the transport protocol, some of that functionality
`must now be shared at a higher level.
`To illustratehowhost softwareinteractswiththe CAB hardwareinnormalusage, we present
`a walk-through of a typical read and write. To handle a user write, the system ®rst examines
`the size of the write and other factors to determine how many packets will be needed on the
`media, and then it issues SDMA requests to the CAB, one per packet. The CAB transfers the
`data from the user's address space to the CAB network memory. In most cases, i.e. ifthe TCP
`window is open, a MDMA request to perform the actual media transfer can be issued at the
`same time, freeing the processor from any further involvement with individual packets. Only
`the ®nal packet's SDMA request needs to be - agged to interrupt the host upon completion, so
`that the user process can be scheduled. No interruptis needed to - ag the end ofMDMA ofTCP
`packets, since the TCP acknowledgement will cot®-m that the data was sent.
`Upon receiving a packet from the network, the CAB interrupts the host, which performs
`protocol processing. For TCP and UDP, only the packet's header needs to be examined as
`the data checksum has already been calculated by the hardware. The packet is then logically
`queued for the appropriate user process. A user read is handled by issuing one or more SDMA
`operationsto copy the data out of the interface memory. The last SDMA operation is agged to
`generate an interrupt upon completion so that the user process can be scheduled.
`
`4 The iWarp CAB Implementation
`
`The ®rst implementationof the CAB architectureis foriW arp, a distributed memoryparallel
`computer. Each iWarp cell has a CPU, memory, and four pathways supporting communication
`withneighboringcells. Thecellsareinterconnectedasatorus [6]. AniW arparraycommunicates
`withothercomputersystems by inserting a interface boards 0 in the backloops ofthetorns (Figure
`6). Interface boards are, besides being linked into the iW arp interconnect, also connected to
`an external bus or network, which allows them to forward data between iWarp and the outside
`world.
`The iWarp-Nectarinterface board, or HIPPI Interface Board (HIB), consists of two iWarp
`
`
`
`WISTRON CORP. EXHIBIT 1023.008
`
`
`
`~~ 0
`
`8
`
`1+ 1.
`
`1
`
`9
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`10
`
`11
`
`12
`
`13
`
`14
`
`15
`
`16
`
`17
`
`18
`
`19
`
`20
`
`21
`
`22
`
`23
`
`:
`
`24
`
`25
`
`26
`
`27
`
`28
`
`29
`
`30
`
`31
`
`~IB~ 32
`
`OUT
`
`40
`
`33
`
`34
`
`35
`
`36
`
`37
`
`38
`
`39
`
`41
`
`42
`
`43
`
`44
`
`45
`
`46
`
`47
`
`48
`
`49
`
`50
`
`51
`
`52
`
`53
`
`54
`
`55
`
`~:]d.
`
`56
`
`57
`
`58
`
`59
`
`60
`
`61
`
`62
`
`63
`
`+I t I
`
`Figure 6: Connecting the network interface to iWarp
`
`interface cells and a CAB (Figure 7). Each interface cell is linked into the iWarp torus
`independently through four iWarp communication pathways, as is shown in Figure 6 for an
`8 by 8 iWarp array. The two iWarp cells play the role of host on the network. They are
`responsible for TCP and UDP protocol processing and for distributing data to, and collecting
`data from, the cells in the iWarp array.
`To transmit data over the HIPP! network, cells in the iWarp array send data over the iWarp
`interconnect to the a HIB OUT 0 interface cell. The interface cell combines the data, stores it in
`staging memory as a single data stream and issues a write call to invoke the TCP/IP protocol.
`The TCP/IP protocol stack does the protocol processing, using the CAB to DMA the data from
`staging memory into network memory, calculate the checksum and transmit the data over the
`HIPPinetwork. The inverse process is used to receive data. The HIB can also be used as a raw
`HIPP! interface, for example to send data to a framebuffer.
`The motivation for using two cells on the network interface is to provide enough bandwidth
`betweenthenetworkandtheiW arparray. AniW arpcellhasabandwidthofl 60MByte/secondto
`itsmemorysystem(Figure7);thisbandwidthissharedbythedatastreamandprogramexecution.
`Since HIPPihas a bandwidth of 100 MByte/second in each direction,a single iWarp cell would
`clearly not be able to substain the peak HIPP! bandwidth. A two-cell architecture doubles the
`available memory bandwidth, but requires that TCP protocol processing is distributed between
`the transmit and receive cells. A shared memory is provided to simplify the sharing of the TCP
`state (open connections, open window, unacknowledged data, etc.).
`The iW arp Nectar interface physically consists of a transmit and receive board, plus a board
`that contains the HIPP! interfaces (necessary to provide space for connectors). All boards are
`VME boards, but only use VME for power and ground. Most of the board space is used for
`the iWarp components, memory and support chips. Network memory consists of two banks of
`2 MBytes of VRAM (can be upgraded to 8 Mbyte each). Each staging memory (see Figure 7)
`consists of 128 KBytes of dual ported static RAM. The transmit board is currently operational,
`and the full interface is scheduled for completion in December 1992.
`
`
`
`WISTRON CORP. EXHIBIT 1023.009
`
`
`
`40 MBytes/sec 160 MBytes/sec
`each
`
`100 MBytes/sec
`
`iWarp
`
`Local
`Memor
`
`iWarp
`
`Local
`Memor
`
`Memor
`
`R/W
`
`Shared
`Memory
`
`Memor
`
`R/W
`
`SDMA
`Check
`sum
`
`Bus
`Int.
`
`etwor
`Memor
`
`Host
`Bus
`Int.
`
`SDMA
`
`etwor
`Memor
`
`MDMA
`Check
`sum
`
`Figure 7: Block diagram iWarpnetwork interface
`
`5 The Workstation CAB Implementation
`
`5.1 Target Machine
`TheCABarchitectureisbeingimplementedfortheDECworkstationsusingtheTURBOchan-
`nel I/O bus [7]; the initial platform will be the DEC 5000/200. The workstation interface is a
`single board implementationwith a shared bus interface for transmit and receive. The interface
`is being built using off-the-shelfcomponents and will be operational in Fall 1992.
`The DEC workstation was selected as the target for our network interface after a careful
`comparisonof the DEC, HP,IBM and Sun workstations. One interesting resultwas that all four
`classes of workstations have similar I/O subsystems: all are based on I 00 MByte/second I/O
`busses that rely heavily on burst transfers for high sustained throughput. TURBOchannel was
`chosen because it is an open, simple bus architecture and because there is research ongoing at
`CMU in the operating systems area, mainly based on DEC workstations, that we want to use.
`Based on this similarity in I/O architecture, we expect that our results for the TURBOchannel
`should largely carry over to other workstations.
`
`5.2 Processor Cache and Virtual memory
`Onworkstations, theuseofD MA totransferdatabetweennetworkmemoryandhostmemory
`is made more complicated by the presence of a cache and virtual memory. Because of these
`extra overheads, it might sometimes be more e-E&:ient to use the CPU to copy packet between
`user and system space (programmedl/O ±grey arrow in Figure 8).
`DMA can create inconsistencies between the cache and main memory, resulting in wrong
`
`
`
`WISTRON CORP. EXHIBIT 1023.010
`
`
`
`CPU Application
`
`Network Interface
`
`Figure 8: DMA-cache interaction
`
`data being sent over the network or being read by the application. Hostscan avoid this problem
`by - ushing the data to memory before transferring it using DMA on transmit (black arrows in
`Figure 8), and by invalidating the data in the cache before DMAing on receive. Write-through
`caches only require cache invalidation on receive, which can be performed in parallel with the
`data transfer from the device to main memory. For performance reasons, most workstations
`move towards write-back caches where both cache - ushing and invalidation add overhead to
`the DMA. Fortunately, these operations make ef&:ient use of the bus since they use bursts to
`transfer cache lines to main memory.
`On workstations, user pages have to be wired in memory to insure that they are not paged
`out whilethe DMA is inprogress. This cost, however, isacceptable and becomes smalleras the
`CPU speed increases. For example, the combined cost of a page wire and unwire on a 15 MIPS
`DECstation 5000/200 running Mach 2.6 is 134 microseconds while the time to transfer a 16
`KBytepagefromdevice tomemoryis262microseconds(usinga64byteburstsize). Whilethis
`reduces the throughput from 62.4 Mbytes/second to 41.3 Mbytes/second, it is still much faster
`than the maximum CPU copy rate (11.1 Mbytes/second). This is one area where moving away
`from sockets can help. If the application builds messages in wired-down buffers (of limited
`size), the overhead is eliminated. This is similar to mailboxes in Nectar [18].
`Figure 9 compares the transmit and receive throughput that can be obtained across Turbo
`Channel on the DEC 5000/200 with programmed I/O and with DMA using a block size of 16
`words. The overheadsforcacheinvalidation(theDEC5000/200hasawrite-throughcache)and
`page wiring and unwiring have been included. For small read and write sizes, programmedI/O
`is faster. The exact crossover point between programmed I/O and DMA is hard to calculculate,
`because it depends not only on the hardware features of the system, but also on how much data
`is in the cache at the time of the read or write and the cost of cache pollution.
`DMA is faster, even when considering the high overheads, because it can use burst transfers
`thatmake muchmoreef&:ientuse ofthebusthan thesingle wordtransfersused inprogrammed
`I/O. On other workstation architectures (such as the HP Snake and the IBM RS/6000) the CPU
`can also burst data across the bus. This functionality is intended for use with graphics devices,
`but can also be used for the network interface. Although the I/O bus throughput that can be
`obtained using the CPU is typically still lower than the raw DMA rate, it makes programmed
`I/O more attractive, and it increases the crossover point between the two.
`
`
`
`WISTRON CORP. EXHIBIT 1023.011
`
`
`
`60 MB/sec
`
`48
`
`36
`
`24
`
`12
`
`KByte
`·o l--+-~-+-~--+~~+--~-+-~--+-~--+~~+-~.......-
`128 256 512
`lM
`4
`16
`32
`64
`8
`
`Figure 9: Transmit (T) and receive (R) transfer rates across TURBOchannel
`
`5.3 Performance Estimates
`To validate the design of the CAB for workstations, we traced where the time is spent when
`sendingandreceivingdataonaDECstation5000/200runningMach2.6,usingexistinginterfaces
`toEthemet andFDDI. The networkingcode consistsprimarilyofTahoe BSDTCP/IPcode with
`anoptimizedchecksumcalculationroutine [21].
`Itdoesnotincludeoptimizations such as header
`prediction,butwe areinthe processofincludingthese. We use aspecial TURBOchannel-based
`timerboardwitha 40nanosecondresolutionclocktomeasurethe overheadof differentnetwork
`related operations.
`Using these measurements, we constructed graphs that show how much of the host memory
`bandwidth, and indirectly of the CPU cycles, is consumed by the different communication(cid:173)
`related operations, and how many cycles are left for the application. Figure 10 shows the
`result forcommunication over FDDI.The horizontalaxis showsthe throughputand the vertical
`axis shows the percentage of the DS5000/200 CPU that is left for the application for a given
`throughput. We observe for example that using 100 % of the processor we can only drive
`FDDI at about 21 Mbits/second assuming 4500-byte packets. This result is in line with actual
`measurements. It is clear that even after optimizing the TCP protocol processing, the existing
`networkinterfacewill notallow usto communicateat the fullFDDimedia rate,let alone HIPPI
`rates.
`BasedonthesemeasurementswecanestimatetheperformanceforaHIPPinetworkinterface
`based on the CAB architecture. The result is shown in Figure 11. As expected, network
`throughput is limited by the bandwidth of the memory bus (peak of 1 OOMByte/second). The
`graph shows that we should be be able to achieve about 300 Mbit/second throughput. This