`
`An Evaluation of an Attempt at Offloading
`TCP/IP Protocol Processing onto an
`i960RN-based iNIC
`
`Boon S. Ang
`Computer Systems and Technology Laboratory
`HP Laboratories Palo Alto
`HPL-2001-8
`
`January 9th , 2001*
`
`TCP/IP
`
`networking,
`Intelligent
`Network
`Interface
`
`This report presents an evaluation of a TCP/IP offload implementation that
`utilizes a 100Base’l‘ intelligent Network Interface Card (iNIC) equipped with
`a 100 MHZ i960RN processor. The entire FreeBSD-derived networking stack
`from socket downward is implemented on the iNIC with the goal of reducing
`host processor workload. For large messages that result in MTU packets, the
`offload implementation can sustain wire-speed on receive but only about
`80% of wire-speed on transmit. Utilizing hardware-based profiling of ’I‘I‘CP
`benchmark runs, our evaluation pieced together a comprehensive picture of
`transmit behavior on the iNIC. Our first surprise was the number of iQGORN
`processor cycles consumed in transmitting large messages—around 17
`thousand processor cycles per 1.5kbyte (Ethernet MTU) packet. Further
`investigation reveals that this high cost is due to a combination of i960RN
`architectural shortcomings, poor buffering strategy in the TCP/IP code
`running on the iNIC, and limitations imposed by the I20-based host»iNIC
`interface. We also found room for improvements in the implementation of
`the socket buffer data-structure. This report presents profiling statistics, as
`well as code-path analysis that back up these conclusions. Our results call
`into question the hypothesis that a specialized networking software
`environment coupled with cheap embedded processors is a cost effective way
`of improving system performance. At
`least
`in the case of the offload
`implementation on the iQGORN-based iNIC, neither was the performance
`adequate nor the system cheap. This conclusion, however, does not imply
`that offload is a bad idea. In fact, measurements we made with Alacritech's
`SLIC NIC, which partially offloads TCP/IP protocol processing to an ASIC,
`suggests that offloading can confer advantages in a cost effective way.
`Taking the right implementation approach is critical.
`
`* Internal Accession Date Only
`© Copyright Hewlett-Packard Company 2001
`
`Approved for External Publication
`
`ALA07370935
`
`Alacritech, Ex. 2039 Page 1
`
`Alacritech, Ex. 2039 Page 1
`
`
`
`
`
`ALA07370936
`
`Alacritech, Ex. 2039 Page 2
`
`Alacritech, Ex. 2039 Page 2
`
`
`
`1
`
`Introduction
`
`This report presents an evahaation of" a TCPJ’IP implementation that performs network
`protocol stack processing on a lOOBaseT inteiligent network interface card (iNlC)
`equiped with an i960RN embedded processor. Offloading TCWIP protocol processing
`from the host processors to a specialized environment was proposed as a means to reduce
`the workload on the host processors, The initial arguments profered were that network
`protocol processing is consuming an increasingly larger portion of processor cycles and
`that a specialized sottware envrionment on an iNIC can perform the same task more
`efficientiy using cheaper processors.
`
`'
`
`Since the inception of the project. alternate motivations for offloading protocol stack
`processing to an iNlC have been proposed. One was the iNlC offers a point of network
`traffic oontroi
`independent of the host -- a usefui capability in the Distributed Service
`Utility (DSU) architecture [9] for lnternet data-centers, Where the host system is not
`necessarily trusted. More generally, the iNlC is viewed as a point Where additional
`speciaiized functions, snch as firewaii and web caching, can be added in a way that scales
`performance with the number of iNlC’s in a system.
`
`The primary motivation of this evaluation is to understand the behavior of a specific
`TCP/IP offload design implemented in the Platform System Software department of HP
`. Laboratories” Computer Systems and Technology Laboratory Despite initiai optimism,
`this impiementation using Cyclone’s PC1-5381 iNIC, While able to reduce host processor
`cycles spent on networking,
`is unahie to deliver the same networking performance as
`Windows N‘T’s native protocol stack for iOO'Bese'i‘ Ethernet. Furthermore,
`trans
`'
`performance lags behind receive performance for reasons that were not well understood.
`
`good understanding of the processing
`_ Another goal of this work is to arrive at a
`requirements, implementation issues and hardware and software architectural needs of
`' TCP/ll? processing. This understanding wiil feed into futnre iNlC projects targetting very
`high bandwidth networking in highly distributed data center architectures. At a higher
`level, information from this project provides concrete data—points for understanding the
`merits, if any, of offloading ”PCP/l? processing from the host processors to an iNlC.
`
`1.1 Summary of Results
`Utilizing hardware—assisted profiling of ”FTC? bench-mark runs, our evaiuation pieced
`together
`a comprehensive picture of transmit behavior on the iNIC.
`All our
`measurements assume that cheeksnin computation, an expensive operation on generic
`microprocessors. is done by specialized hardware in Ethernet MAC/Pity devices, as is the
`case with commodity devices appearing in late 2000.
`
`‘ The team that impiemcnted the TCP/IP offload recently informed us that they found some software
`problem that was causing the transmit and receive performance disparity and had worked around it,
`Unfortunately, we were unable to obtain further detail in time for inclusion in this report.
`2 The Cyclone PC1—9591 iNlC uses Ethernet devices that do not have checksurn support. To factor out the
`cost of checksuan computation, our offload impicmentation simply does not compute checksam on both
`transmit and receive during our benchmark runs. To accommodate this. machines receiving packets from
`
`ALA07370937
`
`Alacritech, Ex. 2039 Page 3
`
`Alacritech, Ex. 2039 Page 3
`
`
`
`Our first surprise was the number of i960RN processor cycles consumed in transmitting
`large messages over TCP/EP -- around 17 thousand processor cycles per L5l<byte
`(Ethernet MTU} packet. This cost increases very significantly for smaller messages
`because aggregation (Nagle Algorithm) is done on the iNICi thus incurring the overhead
`of handshake between host processor and iNlC for every message.
`In an extreme case!
`with host software sending l—byte messages that are aggregated into approximately
`940nm packets , each packet consumes 4.5 million l96ORN processor cycles. Even
`transmitting a pure acknowledgement packet costs over it} thousand processor cycles.
`
`_
`
`is due to a combination of i960RN
`Further investigation reveals that this high cost
`architectural shortcomings, poor buffering strategy in the TCWH’ code running on the
`iNIC, and limitations imposed by the l20—based host»iNlC interfaces We also found
`room for improvements in the implementation of the socket butler datastrnctnre and
`some inefficiency due to the gcc960 compilers
`
`Our study of the hosesicle behavior is done at a coarser level. Using NT”s Performance
`Monitor tool. we quantified the processor utilization and the number of interrupts during
`each ETC? benchmark run. We compare these metrics for our otlload implementation
`with those of NT‘s native TCPI’IF networking code anti another partially offloaded iNlC
`implementation from Alacritech. To deal with the fact that different implementations
`achieve different networking banowidth, the metrics are accumulated over the course of
`complete runs transferring the same amount of data.
`
`'
`
`The measurements show that compared against native NT implementation? our offload
`implementation achieves significantly lower processor utilization for large messages, but
`_ much higher processor utilization for small messages, with crossover point at around
`lOOObytc messages. The interrupt statistics shows similar trend, though with crossover at
`a smaller message size. Furthermore the number of interrupts is rednced by a much
`more significant percentage than the reduction in host processor utilizatiom suggesting
`that costs other than interrupt processing contributes quite significantly to the remaining
`host—side cost in our offload implementation, Based on other researcher’s results [ll¢ we
`believe that host processor copying data between user and system buffer is the major
`remaining cost.
`
`The Alacritech NIC is an interesting comparison. because it represents a very lean and
`low cost approach. Whereas the Cyclone board is a full-length PCI card that
`is
`essentially a complete computer system decked with a processor“ memory and supporting
`logic chips, the Alacritech NIC looks just like another normal NlC card, except that its
`MAC/’Phy ASIC has adéitional logic to process TCP/IP protocol for “fast path” cases. A
`
`our offload implementation run specially doctored 'FCP/il’ stacks that do not verify checksum. Error rates
`on today’s networking, hardware in a local switched network are low enough that running TTCP benchmark
`is not a ptoblemi
`3 The aggregation is not controlled by a fixed size threshold, and thus the neural packet size varies
`dynamically, subjected to an MTU of 3460 data bytes:
`
`ALA07370938
`
`Alacritech, Ex. 2039 Page 4
`
`Alacritech, Ex. 2039 Page 4
`
`
`
`limitation of this approach is it does not allow any modification or additions to the
`offloaded functions once the hardware is designed,
`
`Our measurement shows that an Alacritech NIC is able to sustain network bandwidth
`
`Its
`comparable to that of Native NT for large messages. which is close to wire—speed.
`accumulated host processor utilization, while lower than native NT’S. is higher than that
`with our offload implementation.
`[is performance degrades when messages are smaller
`then 2%; bytes because it has no means of aggregating outgoing; messages (Le. no Nagel
`' Algorithm).
`
`.
`
`This study cells into Question the hypothesis that a specialized software environment
`together with it cheap embedded processor can effectively offload TCP/IP protocol stack
`processing.
`The i960RNsbased implementation studied in this work is unable to
`adequately handle traffic even at the lOGMbit/s level, much less at
`l Gigabitx’s or 10
`Gigebit/s levels that are the bendwidths of interest
`in the near future. While bed
`buffering strategy is partly responsible for the less than satisfactory performance on our
`offload implementation,
`its BSDderived TCP/IP protocol stack is in tact better than
`NTls when executed on the some host platform. Clearly, the “better” stack and the
`advantage from the specialized interrupt-handling environment are insufficient to make
`up for the loss of a good processor and the additional overhead of interactions between
`the host and the iNIC. Ultimately? this approach is at best moving work from one place
`to another without conferring any advantage in efficiency, performance or price. and at
`worse, a performance limiter.
`
`In fact, the performance of the
`This conclusion does not imply that offload is a bad idea.
`Alacritech NIC suggests that it can confer advantages. What is critical is taking the right
`implementation approach, We believe there is still unfinished research in. this area. that a
`fixed hardware implementation, such as that from Alacritech, is not the solution. From a
`functional perspective, having a flexible and extensible iNIC implementation not only
`enables tracking changes to protocol standards, but also allows additional functions to be
`added overtime. The key is a cost effective, programmable iNlC microcrchitectore that
`pays attention to the interface between iNlC and host. There are a number of promising
`alternate microerchitecture components? ranging from specialized queue management
`hardware, to field programmable hardware. to mold—threaded and/or mule—core, possibly
`Systoliolilre. processors. This is the research topic ot‘our next project.
`
`1.2 Organization of this Report
`
`i9oORN—hased TCPIIP offload
`an overview of our
`section gives
`The next
`implementation. We briefly cover both the hardware and the software aspects of this
`implementation to pave the background for the rest of this report.
`Section fienemines
`the behavior on the iNlC, Detailed profiling statistics giving breakdowns for various
`steps of the processing, and utilization of the hardware resources is presented. This is
`followed in Section Elwith an examination of the host side statistics. Section Elpresents
`some related work covering both previous studies of TCWli’ implementations and other
`offload implementations. Finally, we conclude in Section Elwith what we learned from
`this study and areas for future work.
`
`U:
`
`ALA07370939
`
`Alacritech, Ex. 2039 Page 5
`
`Alacritech, Ex. 2039 Page 5
`
`
`
`
`
`(Device interface)
`
`
`
`Secondary PC}
`
`
`
`2 TCP/IP Offload Implementation Overview
`
`- Gui” TCPSH’ offload implementation uses the Cyclone PC1981 iNlC described in Section
`E The emheoded proceesor on this iNlC runs a lean, HP custom éesigned run»
`time/operating system caiEed RTX described in Section
`The networking protocol
`stack code is derived from FreeBSD’s Reno version of the TCP/IP protocol code.
`Interaction between the host and the iNlC occurs through hardware implemented 120
`messaging infrastructure. Section .presents saliant features of the networking code.
`
`2.1 Cyclone PCIwQS‘l iNlC hardware
`
`
`l 0! 1 6082156"?
`
`Primary PC}
`(host interface)
`
`Ethernet
`
`
`
`The Cyclone PC1381 iNlC, illustrated in the above figure, supports four 101’ lOOBaseT
`Ethernet ports on a private 32/64~bit, 33M'Hz PCI bus which will be referred to as the
`secondary PCI bus. Although the bus is capable of 64—bit perfonnance, each Ethernet
`rievices only supports a 32~bit PCI interface, so that this has is effectively 32~bit, 66~
`MHZ. The ENIC presents a 32z’64nhit, 333MHz PCI external interface which plugs into the
`host PCI has, referred to as the primary PCI bus.
`In our experimenu;2 the host ie only
`equipped with a 32—bit PCI bus, thus constraining the iNlC’s interface to operate at 32—
`bit, 33MHZ.
`Located. between the two PCl buses is; an i960RN highly integrated
`embedded processor, marked by the dashed box in the above figure.
`It contains a i960
`processor core running at 100MHz? a primary address translation unit (PATU) interfacing
`with the primary PCl bus, a secondary addrese translation unit (SATU) interfacing with
`the secondary PCl bus, and a memory controller (MC) user} to control (sowing SDRAM
`DIMMS. (Our evaluation uses an iNlC equiped with loMbytes of 66MH2 SDRAM.) An
`internal 64-bit, 66MHZ bus connects these four components together.
`
`6
`
`ALA07370940
`
`Alacritech, Ex. 2039 Page 6
`
`Alacritech, Ex. 2039 Page 6
`
`
`
`The PATU is oquipco with two DMA engines in addition to bridging the internal bus and
`the primary PCI bus.
`lt also implcmcnts 120 messaging and door boll facilities in
`hardware. The SATU has one DMA engine and bridges the internal bus and the
`secondary PCI bus. The i960 proccgsor core implements a simple single—issue processing
`pipeline with none of the fancy sopcrscalar, branch prediction and out~of~order
`capabilities of today’s main stream processors. Not shown in tho abovo figure are a PCl-
`to-PCl bus bridge and an application accelerator in the i960RN chip. Those are not used
`in our offload implcmcntation. Further details of the i960RN chip can be found in thc
`i960 RMJ’RN 1/0 Processor Developer’s Manual [3].
`
`2.2 Offload Software
`
`Two pieces of software running on the i960 processor arc reinvent to this study. One is a
`rnnwtimc system. called RTX, that defines the operating or execution environment. This
`is dcscribcd in the next scction. The other piece of software is the networking protocol
`code itself, which is; described in Section
`
`2.2.1
`
`Execution Environment
`
`RTX is designed to be a. specialized networking environment that avoids some well«
`known systennimposcd networking costs. More specifically, interrupts are not structured
`into the layered, multiple invocation framework found in most general purpose operating
`systems.
`instead, interrupt handlers are allowed to run-to-completion. The motivation is
`to avoid the cost of repeatedly storing aside infomration and subsequently reminvocaking
`proccssing code at a lower priority. RTX also only supports a single address space
`without a clear notion of system vs. userwlcvci address spaces.
`
`RTX is a simple, ad hoc run-time syntem. Although it provides basic threading and pro
`. cmptivc main-threading support, the lack of cnforeable restrictions on interrupt handling
`_ makes it impossible to guarantee any fair share of processor cycles to each thread.
`in
`fact, with the networking code studied. in this report, interrupt handlers disah
`all forms
`of interrupts for the full duration of its execution, including tirncr interrupts . Coupled
`with the fact that interrupt handlers sometimes ran for very very long periods (on, we
`routinely observer the Ethernet device driver running for tens of thousands of processor
`clocks),
`this forces processor scheduling to be hand coded into the Ethernet device
`interrupt handling code in the form of explicit software polls for events. Without these,
`starvation is a real problem.
`
`Overall the execution of networking code is either invoked by interrupt when no other
`interrupt handler is running. or by software polling for the presence of masked interrupts.
`Our measurement shows that for "FTCP runs, most invocations are through software
`polling M for every hardware interrupt dispatched invocation, we saw ccveral tens of
`software polling dispatched invocations.
`
`
`
`4 This raisins quotations about how accurately timers arc implcmcntcd on the ofilond design It is qnitc
`posaiblc that software timer “ticks" occur less freqncntly than intended. We did not look closely into this
`because it is beyond the scope of this study.
`
`ALA07370941
`
`Alacritech, Ex. 2039 Page 7
`
`Alacritech, Ex. 2039 Page 7
`
`
`
`The RTX environment clearly presents challenges for addition of new code. Without a
`strict discipline for time-sharing the processor, every piece of code is tangled with every
`other piece of code when it comes to avoiding starvation and ensuring timely handling of
`events. Clearly, a better scheduling framework is needed, especially to support any kind
`of service quality provisions.
`
`2.2.2 Networking code
`The network protocol code is derived from the Reno version of BSD networking code
`and uses the fxp device driver. The host interfaces to the iNIC at the socket level, using
`120 messaging facility as the underlying means of communication. On the iNIC side,
`glue code is added to splice into the protocol code at the socket level. The following two
`sections briefly trace the transmit and receive paths.
`
`2.2.2.1 Transmit path
`
`elem-e-
`
`a“
`
`Software
`Copy
`
`IOP Hdw DMA
`
`Copy
`
`Only COPY
`Reference
`
`IC>Id>la>I
`
`Software
`Copy
`
`IOP sofiware
`copy
`
`COPY to
`Compress
`
`
`
`2 §’
`
`5m
`
`The above diagram shows the transmit paths for large and small messages. The paths are
`slightly different for different message sizes. The green portions (lightly shaded in non-
`color prints) occur on the host side while the blue portions (darkly shared in non-color
`prints) happen on the iNIC. Unless otherwise stated, the host in this study is an HP
`Kayak XU 6/3000 with a 300MHz Pentium-II processor and 64Mbyte of memory,
`running Windows NT 4.0 service pack 5. (The amount of memory though small by
`today’s standards is adqeuate for TTCP runs, especially with the —s option that we used in
`our runs, which causes artificially generated data to be sourced at the transmit side, and
`incoming data to be discarded at the receive end.)
`
`When transmitting large messages, the message data is first copied on the host side from
`user space into pre-pinned system buffers. Next, 120 messages are sent to the iNIC with
`references to the data residing in host—side main memory. On the iNIC side, servicing of
`the 120 messages includes setting up DMA requests. Hardware DMA engines in the
`PATU performs the actual data transfer from host main memory into iNlC SDRAM,
`
`ALA07370942
`
`Alacritech, Ex. 2039 Page 8
`
`Alacritech, Ex. 2039 Page 8
`
`
`
`Where it is place in mhuf data structures, {We will have a more (totalled discussion of
`motifs in Section
`
`When DMA compietes, iNlC code is once again invoked. to queue the transferred data
`with the reicvent socket data structure and push it down the protocol stack. For large
`messages,
`the enqueuing simply involves linking the already allocated tnbufs into a
`linked list. Without interrupting this thread of execution. an attempt is next made to send
`this message out. The main decision point is at the TCP level, Where the tcpfloutput
`function will decide if any of the data should be sent at this time. This decision is based
`on factors such as the amount of data that is ready to go (Nagel Algorithm will wait if
`there is too little data) and whether there is any transmit window space left {which is
`deteremind by the amount of buffer space advertised by the receiver and the dynamic
`actions of TCP’s congestion control protocol). If tcp_output decides not to send any data
`at this point, the thread suspends. Transmission of data wiii be rednvoked by other
`events. such as the arrival of more data from the host side, expiration of a timeout timer
`to stop waiting for more data, or the arrival of acknowledgements that open up transmit
`window.
`
`ane tcpfiutput decides to send out data, a “copy” of the data is made. The new “copy”
`is passed down the protocol stack through the 1? layer and then to the Ethernet device
`driver. This copy is deallocatcd once the packet is transmitted onto the wire. The
`original copy is kept as a source copy until an acknowicdgernent send by the receiver is
`received. To copy large messages, the mhuf—based BSD buffering strategy merely copies
`- an ”anchor” data structure with reference to the actual data.
`
`Another copy may occur at the IF layer it fragmentation occurs. With proper setting of
`TCP segment size to match that of the undertying network‘s MTU, no fragmentation
`occurs. Execution may again be suspended at the 1P layer if address lockup results in
`external query using ARR The 1.? layer caches the result of such a query so that in most
`cases during a bulk transfer over a TCP connection, no suspension occurs here.
`
`Transfer of data from iNiC SDRAM onto the Wire is undertaken by DMA engines on the
`Ethernet devices. When transmission comeletes,
`the i960 processor is notified via
`interrupts.
`
`The transmit path for small messages is very similar to that for iarge messages with a few
`differences. One difference is data is passed from the host processor to the iNlC directly
`in the 120 messages.
`(The specific instruction‘level mechanism on IA32 platforms is
`PlO operations). Thus, on the host side, data is copied from user memory directly into
`120 message frames. At the iNlC side, data is now available during servicing of a
`transmit request £20 message. The Will software directly copies the data into rnbufs
`instead of using DMA because for small messages,
`it
`is cheaper this way than using
`DMA. The mhuf data structure behaves differentiy for small messages (< 208 bytes)
`than for large messages. When small messages are enqueued into a socket, an attempt is
`made to conserve buffer usage by “compressing” data into partly used mhufs. This may
`result in significant software copying. Further down the protocol stack when. a copy of
`
`ALA07370943
`
`Alacritech, Ex. 2039 Page 9
`
`Alacritech, Ex. 2039 Page 9
`
`
`
`message data is made in the tcp_output function for passing further down the
`protocol stack, actual copy of data occurs for small messages.
`
`we will
`In Section
`Clearly, many copies or pseudo-copies occur in the transmit path.
`re-examine mbuf and this issue of copying. Actual measurements of the cost involved
`will be presented.
`
`2.2.2.2 Receive path
`The following diagram illustrates the receive path. Broadly speaking, it is the reverse of
`the transmit path.
`(As before, the green or lightly shaded portions execute on the host
`side, while the blue or darkly shaded portions execute on the iNIC.) Again, we will first
`consider the path for large packets. The first action is taken by the Ethernet device.
`Its
`device driver pre-queues empty mbufs that are filled by the device as data packets arrive.
`The i960 processor is notified of the presence of new packets via interrupts, which may
`
`
`3?
`
`2 3
`
`beinvokedeitherbyhardwareinterru-ptdispatchorsoftwarepoll.
`’03
`:3I.33N- 33-
`
`
`
`Software
`Copy
`
`IOP HdW DMA
`
`CNopy
`
`Ethernet MAC
`hdw copy
`
`
`
`(I)
`an
`
`IOP software copy
`
`2 _
`
`3
`
`
`some .C'. ¢-
`ost
`
`COPY
`C
`Ethernet MAC
`opy to
`Host Software
`hdw copy
`
`Copy compress
`
`The Ethernet device interrupt handler begins the walk up the network protocol stack. The
`logical processing of data packets may involve reassembly at the IP layer and dealing
`with out-of—order segments at the TCP layer.
`In practice, these are very infrequent, and
`the BSD TCP/IF code providesa “fast-path” optimization that reduces processing when
`packets arrive in order. Anincoming packet s headerIs parsed to identify the protocol,
`and if TCP, the connection. This enables the packet’s mbufs to be queued into the
`relevent socket’s sockbuf data structure. Again for large messages, this simply links
`mbuf’s into a singly linked list.
`
`Next, the data is handed over to the host. As an optimization, our offload implementation
`attempts to aggregate more packets before interrupting the host to notify it of the arrived
`data. The i960 processor is responsible for using hardware DMA engine in the PATU to
`move data from iNIC SDRAM into host memory before notifying the host about the
`newly arrived data through 120 messages. Data is transferred into host side system
`
`10
`
`ALA07370944
`
`Alacritech, Ex. 2039 Page 10
`
`Alacritech, Ex. 2039 Page 10
`
`
`
`buffers that have hoot; pinned and handed to the iNIC ahead of time. Eventually, when
`software docs a rcccivc on the host side, data is copied on the host side from system
`buffer into user memory,
`
`For small massages, similar differences as in the case of transmit apply, Thus, queuing
`into sockhuf may invol'vo copying to compress and conserve mbuf nsagc. Just as; in thc
`' case of transmit, small size data is passed between the iNlC and host in 120 messages.
`On the host side, this cnahlcs an optimization if the receiver has previously posted a
`receive —— data is copicd by the host processor from the 120 messages (physically residing
`on the iNiC) into user memory dircctly. If no receive has been posted yet, the data is first
`copied into system buffer.
`
`3 -
`
`iNic'Side Behavior
`
`This section presents the profiiing Statistics wc collected on tho iNlC sidc. Most
`measurements rely on a free running cycle counter on the i960 processor to provide
`timing information. Timing code is added manualiy to the networking and RTX sourcc
`code to measure durations of interest.
`In most cases, this is straightforward because the
`code operates as interrupt handlers in a non-procmtablc mode. Timing a chord: of codc
`Simply involves noting the starting and ending times and accumulating that if aggregatcd
`data of multiple invocations is being collectcd. The exception is the collection of
`statistics reported in Section .donc using the i960RN processor’s hardware resource
`usagc performance registers. More will be said about this in that section.
`
`This provides a
`The commonly used TTCP benchmark is used for our studies.
`convenient micro~bcnchmarlt for getting a detailed look of various aspects of the -
`networking code daring bulk transfer,
`
`In the ncxt section, we begin with some general performance atatiatics that Show the
`.' breakdown of processing time. Our study emphasises transmit behavior as that has the
`bigger problem in our implementation, Furthermore, for many server applications that
`are the targets of the offload technology, transmit performance is. more important than
`receive performance,
`
`leads us to look for hardware bottleneck because the cost
`The general statistics first
`numbers look surprisingly large. During close manual
`inspection of certain codc
`fragments, we also conic across cases where a relatively small number of instructions, any
`10’s of instructions in an inner loop, take hundreds of processor cycles per iteration. Our
`investigation into hardware bottlenecks is reported in Section
`
`Nextr wc shift our attention to the components responsible for the largest share of
`processing cost, as indicated by the processor usage breakdown reported in Section
`We: examine tcp_oatput, which has come of the largest sharc of the processing cyclcs, to
`get a better understanding of what happens in that function, The results are reported in
`Section
`This tooit as onto the track of the overall buffering stratcgy, a cioscr look of
`which is reported in Section
`One other significant source of overhead is the host-
`
`ll
`
`ALA07370945
`
`Alacritech, Ex. 2039 Page 11
`
`Alacritech, Ex. 2039 Page 11
`
`
`
`Finally, a number of miscellaneous
`iNlC interface. We examine that in Section
`inefficiencies that we came across during our study are reported in Section
`
`3.1 General Performance Statistics
`
`is a summary of iNlC processor usage breakdown, roughly categorized by
`network protocol layer. We added a layer for the 110$th [C interface which includes the
`cost of moving data between the host and iNIC memories, We report the numbers in
`.lOOMhlz processor clock cycles, label as “pclks”, We use the term message (msg) to
`refer to the logical unit used by host software sending or receiving data, and the term
`packet {13kt} to refer to the Ethernet frame that actually goes on the wire. For all our
`experiements, the TCP segment size and IP MTU are set to match the Ethernet frame
`MTU of 1.5kbyte so that a packet is also the unit handled by the IP and TCP layers,
`
`_contains the data for three transmit instances with representative message sizes —-
`S kbyte as a large message, 200 byte as a small but common message size, and lbyte to
`highlight the behavior of the protocol stock. We only report one instance of receive
`' Where TTCP is receiving with Ekbyte buffers,
`In all cases, the machine on the other end
`ofTTCP is a 500MHz Pentiumdll FreeBSD box equiped with a normal 'NIC card.
`'1‘his
`end is able to handle the work—load easily and is not a bottleneck.
`
`The row labeled “Per msg cost” and the corresponding cost breakdown in
`includes both transmit and receive costs and is the average derived by dividing the total
`processor usage by the number of messages. While it gives total cost that is most readily
`correlated to host
`software transmit and receive actions
`it bears
`few direct
`correspondence to actions on the iNIC, except in the cases of host—iNiC interface and
`socket
`layers for transmit.
`For receive, even these layers’ numbers have marginal
`correspondence to the per—message cost number because action in all network layers is
`driven by incoming packets.
`
`Nevertheless there are several interesting points about the pero‘nessage numbers. One is
`that for Skbyte messages transmit costs 66% more than receive.
`It was not immediately
`obvious why that should be the case. While transmit and receive processing is obviously
`different, the per—packet costs do not indicate as big a cost difference. Our investigation
`found that a different number of acknowledgement packets are involved in the two runs.
`When the iNlC transmits data,
`the FreeBSD box that
`is
`receiving, sends a pure
`acknowledgement packet for every two data packets it receives.
`In contrast, when the
`iNlCl is date receives it only sends out a pure £10k packet after every six data packets,
`This happens because incoming packets are aggregated before being sent to the host.
`Only when data is sent to the host is buffering space released and acknowledged to the
`sender. With transmission of each note rick cricket costing about ten thousand politea this
`is a significant ”saving” that improves receive performance.
`
`highlights the layers that are most costly in
`The per—message cost breakdown in
`terms of processor usage. For the 8kbyte message transmit (large message transmit),
`TCP layer is responsible for the built of the iNIC pi‘occssor cycles.
`(See Section .for
`
`l2
`
`ALA07370946
`
`Alacritech, Ex. 2039 Page 12
`
`Alacritech, Ex. 2039 Page 12
`
`
`
`the host~iNIC interface
`For $171011 mcsgagc transmit, however,
`fimhcr accounting.)
`accounts for the bulk of the cost. This is due to Nagol Algorithm aggregation which
`casusos tho per‘packct costs to be amortized over a number of messages, around a
`thousand in the case of 1—bytc message.
`in our offload implementation, the aggregation
`is done on the iNlC so that host»iNIC interaction cost remains a poo messagc occurancc.
`Tho host—MIC interface cost dominates; rcccivo as weli, though with a less dominant
`percentage. (See Section Efor further accounting.)
`
`The host—iNIC interface cost is important inecauso it constraints the frequency of host-
`iNIC interaction