throbber
EB EESKLEJE
`
`An Evaluation of an Attempt at Offloading
`TCP/IP Protocol Processing onto an
`i960RN-based iNIC
`
`Boon S. Ang
`Computer Systems and Technology Laboratory
`HP Laboratories Palo Alto
`HPL-2001-8
`
`January 9th , 2001*
`
`TCP/IP
`
`networking,
`Intelligent
`Network
`Interface
`
`This report presents an evaluation of a TCP/IP offload implementation that
`utilizes a 100Base’l‘ intelligent Network Interface Card (iNIC) equipped with
`a 100 MHZ i960RN processor. The entire FreeBSD-derived networking stack
`from socket downward is implemented on the iNIC with the goal of reducing
`host processor workload. For large messages that result in MTU packets, the
`offload implementation can sustain wire-speed on receive but only about
`80% of wire-speed on transmit. Utilizing hardware-based profiling of ’I‘I‘CP
`benchmark runs, our evaluation pieced together a comprehensive picture of
`transmit behavior on the iNIC. Our first surprise was the number of iQGORN
`processor cycles consumed in transmitting large messages—around 17
`thousand processor cycles per 1.5kbyte (Ethernet MTU) packet. Further
`investigation reveals that this high cost is due to a combination of i960RN
`architectural shortcomings, poor buffering strategy in the TCP/IP code
`running on the iNIC, and limitations imposed by the I20-based host»iNIC
`interface. We also found room for improvements in the implementation of
`the socket buffer data-structure. This report presents profiling statistics, as
`well as code-path analysis that back up these conclusions. Our results call
`into question the hypothesis that a specialized networking software
`environment coupled with cheap embedded processors is a cost effective way
`of improving system performance. At
`least
`in the case of the offload
`implementation on the iQGORN-based iNIC, neither was the performance
`adequate nor the system cheap. This conclusion, however, does not imply
`that offload is a bad idea. In fact, measurements we made with Alacritech's
`SLIC NIC, which partially offloads TCP/IP protocol processing to an ASIC,
`suggests that offloading can confer advantages in a cost effective way.
`Taking the right implementation approach is critical.
`
`* Internal Accession Date Only
`© Copyright Hewlett-Packard Company 2001
`
`Approved for External Publication
`
`ALA07370935
`
`Alacritech, Ex. 2039 Page 1
`
`Alacritech, Ex. 2039 Page 1
`
`

`

`
`
`ALA07370936
`
`Alacritech, Ex. 2039 Page 2
`
`Alacritech, Ex. 2039 Page 2
`
`

`

`1
`
`Introduction
`
`This report presents an evahaation of" a TCPJ’IP implementation that performs network
`protocol stack processing on a lOOBaseT inteiligent network interface card (iNlC)
`equiped with an i960RN embedded processor. Offloading TCWIP protocol processing
`from the host processors to a specialized environment was proposed as a means to reduce
`the workload on the host processors, The initial arguments profered were that network
`protocol processing is consuming an increasingly larger portion of processor cycles and
`that a specialized sottware envrionment on an iNIC can perform the same task more
`efficientiy using cheaper processors.
`
`'
`
`Since the inception of the project. alternate motivations for offloading protocol stack
`processing to an iNlC have been proposed. One was the iNlC offers a point of network
`traffic oontroi
`independent of the host -- a usefui capability in the Distributed Service
`Utility (DSU) architecture [9] for lnternet data-centers, Where the host system is not
`necessarily trusted. More generally, the iNlC is viewed as a point Where additional
`speciaiized functions, snch as firewaii and web caching, can be added in a way that scales
`performance with the number of iNlC’s in a system.
`
`The primary motivation of this evaluation is to understand the behavior of a specific
`TCP/IP offload design implemented in the Platform System Software department of HP
`. Laboratories” Computer Systems and Technology Laboratory Despite initiai optimism,
`this impiementation using Cyclone’s PC1-5381 iNIC, While able to reduce host processor
`cycles spent on networking,
`is unahie to deliver the same networking performance as
`Windows N‘T’s native protocol stack for iOO'Bese'i‘ Ethernet. Furthermore,
`trans
`'
`performance lags behind receive performance for reasons that were not well understood.
`
`good understanding of the processing
`_ Another goal of this work is to arrive at a
`requirements, implementation issues and hardware and software architectural needs of
`' TCP/ll? processing. This understanding wiil feed into futnre iNlC projects targetting very
`high bandwidth networking in highly distributed data center architectures. At a higher
`level, information from this project provides concrete data—points for understanding the
`merits, if any, of offloading ”PCP/l? processing from the host processors to an iNlC.
`
`1.1 Summary of Results
`Utilizing hardware—assisted profiling of ”FTC? bench-mark runs, our evaiuation pieced
`together
`a comprehensive picture of transmit behavior on the iNIC.
`All our
`measurements assume that cheeksnin computation, an expensive operation on generic
`microprocessors. is done by specialized hardware in Ethernet MAC/Pity devices, as is the
`case with commodity devices appearing in late 2000.
`
`‘ The team that impiemcnted the TCP/IP offload recently informed us that they found some software
`problem that was causing the transmit and receive performance disparity and had worked around it,
`Unfortunately, we were unable to obtain further detail in time for inclusion in this report.
`2 The Cyclone PC1—9591 iNlC uses Ethernet devices that do not have checksurn support. To factor out the
`cost of checksuan computation, our offload impicmentation simply does not compute checksam on both
`transmit and receive during our benchmark runs. To accommodate this. machines receiving packets from
`
`ALA07370937
`
`Alacritech, Ex. 2039 Page 3
`
`Alacritech, Ex. 2039 Page 3
`
`

`

`Our first surprise was the number of i960RN processor cycles consumed in transmitting
`large messages over TCP/EP -- around 17 thousand processor cycles per L5l<byte
`(Ethernet MTU} packet. This cost increases very significantly for smaller messages
`because aggregation (Nagle Algorithm) is done on the iNICi thus incurring the overhead
`of handshake between host processor and iNlC for every message.
`In an extreme case!
`with host software sending l—byte messages that are aggregated into approximately
`940nm packets , each packet consumes 4.5 million l96ORN processor cycles. Even
`transmitting a pure acknowledgement packet costs over it} thousand processor cycles.
`
`_
`
`is due to a combination of i960RN
`Further investigation reveals that this high cost
`architectural shortcomings, poor buffering strategy in the TCWH’ code running on the
`iNIC, and limitations imposed by the l20—based host»iNlC interfaces We also found
`room for improvements in the implementation of the socket butler datastrnctnre and
`some inefficiency due to the gcc960 compilers
`
`Our study of the hosesicle behavior is done at a coarser level. Using NT”s Performance
`Monitor tool. we quantified the processor utilization and the number of interrupts during
`each ETC? benchmark run. We compare these metrics for our otlload implementation
`with those of NT‘s native TCPI’IF networking code anti another partially offloaded iNlC
`implementation from Alacritech. To deal with the fact that different implementations
`achieve different networking banowidth, the metrics are accumulated over the course of
`complete runs transferring the same amount of data.
`
`'
`
`The measurements show that compared against native NT implementation? our offload
`implementation achieves significantly lower processor utilization for large messages, but
`_ much higher processor utilization for small messages, with crossover point at around
`lOOObytc messages. The interrupt statistics shows similar trend, though with crossover at
`a smaller message size. Furthermore the number of interrupts is rednced by a much
`more significant percentage than the reduction in host processor utilizatiom suggesting
`that costs other than interrupt processing contributes quite significantly to the remaining
`host—side cost in our offload implementation, Based on other researcher’s results [ll¢ we
`believe that host processor copying data between user and system buffer is the major
`remaining cost.
`
`The Alacritech NIC is an interesting comparison. because it represents a very lean and
`low cost approach. Whereas the Cyclone board is a full-length PCI card that
`is
`essentially a complete computer system decked with a processor“ memory and supporting
`logic chips, the Alacritech NIC looks just like another normal NlC card, except that its
`MAC/’Phy ASIC has adéitional logic to process TCP/IP protocol for “fast path” cases. A
`
`our offload implementation run specially doctored 'FCP/il’ stacks that do not verify checksum. Error rates
`on today’s networking, hardware in a local switched network are low enough that running TTCP benchmark
`is not a ptoblemi
`3 The aggregation is not controlled by a fixed size threshold, and thus the neural packet size varies
`dynamically, subjected to an MTU of 3460 data bytes:
`
`ALA07370938
`
`Alacritech, Ex. 2039 Page 4
`
`Alacritech, Ex. 2039 Page 4
`
`

`

`limitation of this approach is it does not allow any modification or additions to the
`offloaded functions once the hardware is designed,
`
`Our measurement shows that an Alacritech NIC is able to sustain network bandwidth
`
`Its
`comparable to that of Native NT for large messages. which is close to wire—speed.
`accumulated host processor utilization, while lower than native NT’S. is higher than that
`with our offload implementation.
`[is performance degrades when messages are smaller
`then 2%; bytes because it has no means of aggregating outgoing; messages (Le. no Nagel
`' Algorithm).
`
`.
`
`This study cells into Question the hypothesis that a specialized software environment
`together with it cheap embedded processor can effectively offload TCP/IP protocol stack
`processing.
`The i960RNsbased implementation studied in this work is unable to
`adequately handle traffic even at the lOGMbit/s level, much less at
`l Gigabitx’s or 10
`Gigebit/s levels that are the bendwidths of interest
`in the near future. While bed
`buffering strategy is partly responsible for the less than satisfactory performance on our
`offload implementation,
`its BSDderived TCP/IP protocol stack is in tact better than
`NTls when executed on the some host platform. Clearly, the “better” stack and the
`advantage from the specialized interrupt-handling environment are insufficient to make
`up for the loss of a good processor and the additional overhead of interactions between
`the host and the iNIC. Ultimately? this approach is at best moving work from one place
`to another without conferring any advantage in efficiency, performance or price. and at
`worse, a performance limiter.
`
`In fact, the performance of the
`This conclusion does not imply that offload is a bad idea.
`Alacritech NIC suggests that it can confer advantages. What is critical is taking the right
`implementation approach, We believe there is still unfinished research in. this area. that a
`fixed hardware implementation, such as that from Alacritech, is not the solution. From a
`functional perspective, having a flexible and extensible iNIC implementation not only
`enables tracking changes to protocol standards, but also allows additional functions to be
`added overtime. The key is a cost effective, programmable iNlC microcrchitectore that
`pays attention to the interface between iNlC and host. There are a number of promising
`alternate microerchitecture components? ranging from specialized queue management
`hardware, to field programmable hardware. to mold—threaded and/or mule—core, possibly
`Systoliolilre. processors. This is the research topic ot‘our next project.
`
`1.2 Organization of this Report
`
`i9oORN—hased TCPIIP offload
`an overview of our
`section gives
`The next
`implementation. We briefly cover both the hardware and the software aspects of this
`implementation to pave the background for the rest of this report.
`Section fienemines
`the behavior on the iNlC, Detailed profiling statistics giving breakdowns for various
`steps of the processing, and utilization of the hardware resources is presented. This is
`followed in Section Elwith an examination of the host side statistics. Section Elpresents
`some related work covering both previous studies of TCWli’ implementations and other
`offload implementations. Finally, we conclude in Section Elwith what we learned from
`this study and areas for future work.
`
`U:
`
`ALA07370939
`
`Alacritech, Ex. 2039 Page 5
`
`Alacritech, Ex. 2039 Page 5
`
`

`

`
`
`(Device interface)
`
`
`
`Secondary PC}
`
`
`
`2 TCP/IP Offload Implementation Overview
`
`- Gui” TCPSH’ offload implementation uses the Cyclone PC1981 iNlC described in Section
`E The emheoded proceesor on this iNlC runs a lean, HP custom éesigned run»
`time/operating system caiEed RTX described in Section
`The networking protocol
`stack code is derived from FreeBSD’s Reno version of the TCP/IP protocol code.
`Interaction between the host and the iNlC occurs through hardware implemented 120
`messaging infrastructure. Section .presents saliant features of the networking code.
`
`2.1 Cyclone PCIwQS‘l iNlC hardware
`
`
`l 0! 1 6082156"?
`
`Primary PC}
`(host interface)
`
`Ethernet
`
`
`
`The Cyclone PC1381 iNlC, illustrated in the above figure, supports four 101’ lOOBaseT
`Ethernet ports on a private 32/64~bit, 33M'Hz PCI bus which will be referred to as the
`secondary PCI bus. Although the bus is capable of 64—bit perfonnance, each Ethernet
`rievices only supports a 32~bit PCI interface, so that this has is effectively 32~bit, 66~
`MHZ. The ENIC presents a 32z’64nhit, 333MHz PCI external interface which plugs into the
`host PCI has, referred to as the primary PCI bus.
`In our experimenu;2 the host ie only
`equipped with a 32—bit PCI bus, thus constraining the iNlC’s interface to operate at 32—
`bit, 33MHZ.
`Located. between the two PCl buses is; an i960RN highly integrated
`embedded processor, marked by the dashed box in the above figure.
`It contains a i960
`processor core running at 100MHz? a primary address translation unit (PATU) interfacing
`with the primary PCl bus, a secondary addrese translation unit (SATU) interfacing with
`the secondary PCl bus, and a memory controller (MC) user} to control (sowing SDRAM
`DIMMS. (Our evaluation uses an iNlC equiped with loMbytes of 66MH2 SDRAM.) An
`internal 64-bit, 66MHZ bus connects these four components together.
`
`6
`
`ALA07370940
`
`Alacritech, Ex. 2039 Page 6
`
`Alacritech, Ex. 2039 Page 6
`
`

`

`The PATU is oquipco with two DMA engines in addition to bridging the internal bus and
`the primary PCI bus.
`lt also implcmcnts 120 messaging and door boll facilities in
`hardware. The SATU has one DMA engine and bridges the internal bus and the
`secondary PCI bus. The i960 proccgsor core implements a simple single—issue processing
`pipeline with none of the fancy sopcrscalar, branch prediction and out~of~order
`capabilities of today’s main stream processors. Not shown in tho abovo figure are a PCl-
`to-PCl bus bridge and an application accelerator in the i960RN chip. Those are not used
`in our offload implcmcntation. Further details of the i960RN chip can be found in thc
`i960 RMJ’RN 1/0 Processor Developer’s Manual [3].
`
`2.2 Offload Software
`
`Two pieces of software running on the i960 processor arc reinvent to this study. One is a
`rnnwtimc system. called RTX, that defines the operating or execution environment. This
`is dcscribcd in the next scction. The other piece of software is the networking protocol
`code itself, which is; described in Section
`
`2.2.1
`
`Execution Environment
`
`RTX is designed to be a. specialized networking environment that avoids some well«
`known systennimposcd networking costs. More specifically, interrupts are not structured
`into the layered, multiple invocation framework found in most general purpose operating
`systems.
`instead, interrupt handlers are allowed to run-to-completion. The motivation is
`to avoid the cost of repeatedly storing aside infomration and subsequently reminvocaking
`proccssing code at a lower priority. RTX also only supports a single address space
`without a clear notion of system vs. userwlcvci address spaces.
`
`RTX is a simple, ad hoc run-time syntem. Although it provides basic threading and pro
`. cmptivc main-threading support, the lack of cnforeable restrictions on interrupt handling
`_ makes it impossible to guarantee any fair share of processor cycles to each thread.
`in
`fact, with the networking code studied. in this report, interrupt handlers disah
`all forms
`of interrupts for the full duration of its execution, including tirncr interrupts . Coupled
`with the fact that interrupt handlers sometimes ran for very very long periods (on, we
`routinely observer the Ethernet device driver running for tens of thousands of processor
`clocks),
`this forces processor scheduling to be hand coded into the Ethernet device
`interrupt handling code in the form of explicit software polls for events. Without these,
`starvation is a real problem.
`
`Overall the execution of networking code is either invoked by interrupt when no other
`interrupt handler is running. or by software polling for the presence of masked interrupts.
`Our measurement shows that for "FTCP runs, most invocations are through software
`polling M for every hardware interrupt dispatched invocation, we saw ccveral tens of
`software polling dispatched invocations.
`
`
`
`4 This raisins quotations about how accurately timers arc implcmcntcd on the ofilond design It is qnitc
`posaiblc that software timer “ticks" occur less freqncntly than intended. We did not look closely into this
`because it is beyond the scope of this study.
`
`ALA07370941
`
`Alacritech, Ex. 2039 Page 7
`
`Alacritech, Ex. 2039 Page 7
`
`

`

`The RTX environment clearly presents challenges for addition of new code. Without a
`strict discipline for time-sharing the processor, every piece of code is tangled with every
`other piece of code when it comes to avoiding starvation and ensuring timely handling of
`events. Clearly, a better scheduling framework is needed, especially to support any kind
`of service quality provisions.
`
`2.2.2 Networking code
`The network protocol code is derived from the Reno version of BSD networking code
`and uses the fxp device driver. The host interfaces to the iNIC at the socket level, using
`120 messaging facility as the underlying means of communication. On the iNIC side,
`glue code is added to splice into the protocol code at the socket level. The following two
`sections briefly trace the transmit and receive paths.
`
`2.2.2.1 Transmit path
`
`elem-e-
`
`a“
`
`Software
`Copy
`
`IOP Hdw DMA
`
`Copy
`
`Only COPY
`Reference
`
`IC>Id>la>I
`
`Software
`Copy
`
`IOP sofiware
`copy
`
`COPY to
`Compress
`
`
`
`2 §’
`
`5m
`
`The above diagram shows the transmit paths for large and small messages. The paths are
`slightly different for different message sizes. The green portions (lightly shaded in non-
`color prints) occur on the host side while the blue portions (darkly shared in non-color
`prints) happen on the iNIC. Unless otherwise stated, the host in this study is an HP
`Kayak XU 6/3000 with a 300MHz Pentium-II processor and 64Mbyte of memory,
`running Windows NT 4.0 service pack 5. (The amount of memory though small by
`today’s standards is adqeuate for TTCP runs, especially with the —s option that we used in
`our runs, which causes artificially generated data to be sourced at the transmit side, and
`incoming data to be discarded at the receive end.)
`
`When transmitting large messages, the message data is first copied on the host side from
`user space into pre-pinned system buffers. Next, 120 messages are sent to the iNIC with
`references to the data residing in host—side main memory. On the iNIC side, servicing of
`the 120 messages includes setting up DMA requests. Hardware DMA engines in the
`PATU performs the actual data transfer from host main memory into iNlC SDRAM,
`
`ALA07370942
`
`Alacritech, Ex. 2039 Page 8
`
`Alacritech, Ex. 2039 Page 8
`
`

`

`Where it is place in mhuf data structures, {We will have a more (totalled discussion of
`motifs in Section
`
`When DMA compietes, iNlC code is once again invoked. to queue the transferred data
`with the reicvent socket data structure and push it down the protocol stack. For large
`messages,
`the enqueuing simply involves linking the already allocated tnbufs into a
`linked list. Without interrupting this thread of execution. an attempt is next made to send
`this message out. The main decision point is at the TCP level, Where the tcpfloutput
`function will decide if any of the data should be sent at this time. This decision is based
`on factors such as the amount of data that is ready to go (Nagel Algorithm will wait if
`there is too little data) and whether there is any transmit window space left {which is
`deteremind by the amount of buffer space advertised by the receiver and the dynamic
`actions of TCP’s congestion control protocol). If tcp_output decides not to send any data
`at this point, the thread suspends. Transmission of data wiii be rednvoked by other
`events. such as the arrival of more data from the host side, expiration of a timeout timer
`to stop waiting for more data, or the arrival of acknowledgements that open up transmit
`window.
`
`ane tcpfiutput decides to send out data, a “copy” of the data is made. The new “copy”
`is passed down the protocol stack through the 1? layer and then to the Ethernet device
`driver. This copy is deallocatcd once the packet is transmitted onto the wire. The
`original copy is kept as a source copy until an acknowicdgernent send by the receiver is
`received. To copy large messages, the mhuf—based BSD buffering strategy merely copies
`- an ”anchor” data structure with reference to the actual data.
`
`Another copy may occur at the IF layer it fragmentation occurs. With proper setting of
`TCP segment size to match that of the undertying network‘s MTU, no fragmentation
`occurs. Execution may again be suspended at the 1P layer if address lockup results in
`external query using ARR The 1.? layer caches the result of such a query so that in most
`cases during a bulk transfer over a TCP connection, no suspension occurs here.
`
`Transfer of data from iNiC SDRAM onto the Wire is undertaken by DMA engines on the
`Ethernet devices. When transmission comeletes,
`the i960 processor is notified via
`interrupts.
`
`The transmit path for small messages is very similar to that for iarge messages with a few
`differences. One difference is data is passed from the host processor to the iNlC directly
`in the 120 messages.
`(The specific instruction‘level mechanism on IA32 platforms is
`PlO operations). Thus, on the host side, data is copied from user memory directly into
`120 message frames. At the iNlC side, data is now available during servicing of a
`transmit request £20 message. The Will software directly copies the data into rnbufs
`instead of using DMA because for small messages,
`it
`is cheaper this way than using
`DMA. The mhuf data structure behaves differentiy for small messages (< 208 bytes)
`than for large messages. When small messages are enqueued into a socket, an attempt is
`made to conserve buffer usage by “compressing” data into partly used mhufs. This may
`result in significant software copying. Further down the protocol stack when. a copy of
`
`ALA07370943
`
`Alacritech, Ex. 2039 Page 9
`
`Alacritech, Ex. 2039 Page 9
`
`

`

`message data is made in the tcp_output function for passing further down the
`protocol stack, actual copy of data occurs for small messages.
`
`we will
`In Section
`Clearly, many copies or pseudo-copies occur in the transmit path.
`re-examine mbuf and this issue of copying. Actual measurements of the cost involved
`will be presented.
`
`2.2.2.2 Receive path
`The following diagram illustrates the receive path. Broadly speaking, it is the reverse of
`the transmit path.
`(As before, the green or lightly shaded portions execute on the host
`side, while the blue or darkly shaded portions execute on the iNIC.) Again, we will first
`consider the path for large packets. The first action is taken by the Ethernet device.
`Its
`device driver pre-queues empty mbufs that are filled by the device as data packets arrive.
`The i960 processor is notified of the presence of new packets via interrupts, which may
`
`
`3?
`
`2 3
`
`beinvokedeitherbyhardwareinterru-ptdispatchorsoftwarepoll.
`’03
`:3I.33N- 33-
`
`
`
`Software
`Copy
`
`IOP HdW DMA
`
`CNopy
`
`Ethernet MAC
`hdw copy
`
`
`
`(I)
`an
`
`IOP software copy
`
`2 _
`
`3
`
`
`some .C'. ¢-
`ost
`
`COPY
`C
`Ethernet MAC
`opy to
`Host Software
`hdw copy
`
`Copy compress
`
`The Ethernet device interrupt handler begins the walk up the network protocol stack. The
`logical processing of data packets may involve reassembly at the IP layer and dealing
`with out-of—order segments at the TCP layer.
`In practice, these are very infrequent, and
`the BSD TCP/IF code providesa “fast-path” optimization that reduces processing when
`packets arrive in order. Anincoming packet s headerIs parsed to identify the protocol,
`and if TCP, the connection. This enables the packet’s mbufs to be queued into the
`relevent socket’s sockbuf data structure. Again for large messages, this simply links
`mbuf’s into a singly linked list.
`
`Next, the data is handed over to the host. As an optimization, our offload implementation
`attempts to aggregate more packets before interrupting the host to notify it of the arrived
`data. The i960 processor is responsible for using hardware DMA engine in the PATU to
`move data from iNIC SDRAM into host memory before notifying the host about the
`newly arrived data through 120 messages. Data is transferred into host side system
`
`10
`
`ALA07370944
`
`Alacritech, Ex. 2039 Page 10
`
`Alacritech, Ex. 2039 Page 10
`
`

`

`buffers that have hoot; pinned and handed to the iNIC ahead of time. Eventually, when
`software docs a rcccivc on the host side, data is copied on the host side from system
`buffer into user memory,
`
`For small massages, similar differences as in the case of transmit apply, Thus, queuing
`into sockhuf may invol'vo copying to compress and conserve mbuf nsagc. Just as; in thc
`' case of transmit, small size data is passed between the iNlC and host in 120 messages.
`On the host side, this cnahlcs an optimization if the receiver has previously posted a
`receive —— data is copicd by the host processor from the 120 messages (physically residing
`on the iNiC) into user memory dircctly. If no receive has been posted yet, the data is first
`copied into system buffer.
`
`3 -
`
`iNic'Side Behavior
`
`This section presents the profiiing Statistics wc collected on tho iNlC sidc. Most
`measurements rely on a free running cycle counter on the i960 processor to provide
`timing information. Timing code is added manualiy to the networking and RTX sourcc
`code to measure durations of interest.
`In most cases, this is straightforward because the
`code operates as interrupt handlers in a non-procmtablc mode. Timing a chord: of codc
`Simply involves noting the starting and ending times and accumulating that if aggregatcd
`data of multiple invocations is being collectcd. The exception is the collection of
`statistics reported in Section .donc using the i960RN processor’s hardware resource
`usagc performance registers. More will be said about this in that section.
`
`This provides a
`The commonly used TTCP benchmark is used for our studies.
`convenient micro~bcnchmarlt for getting a detailed look of various aspects of the -
`networking code daring bulk transfer,
`
`In the ncxt section, we begin with some general performance atatiatics that Show the
`.' breakdown of processing time. Our study emphasises transmit behavior as that has the
`bigger problem in our implementation, Furthermore, for many server applications that
`are the targets of the offload technology, transmit performance is. more important than
`receive performance,
`
`leads us to look for hardware bottleneck because the cost
`The general statistics first
`numbers look surprisingly large. During close manual
`inspection of certain codc
`fragments, we also conic across cases where a relatively small number of instructions, any
`10’s of instructions in an inner loop, take hundreds of processor cycles per iteration. Our
`investigation into hardware bottlenecks is reported in Section
`
`Nextr wc shift our attention to the components responsible for the largest share of
`processing cost, as indicated by the processor usage breakdown reported in Section
`We: examine tcp_oatput, which has come of the largest sharc of the processing cyclcs, to
`get a better understanding of what happens in that function, The results are reported in
`Section
`This tooit as onto the track of the overall buffering stratcgy, a cioscr look of
`which is reported in Section
`One other significant source of overhead is the host-
`
`ll
`
`ALA07370945
`
`Alacritech, Ex. 2039 Page 11
`
`Alacritech, Ex. 2039 Page 11
`
`

`

`Finally, a number of miscellaneous
`iNlC interface. We examine that in Section
`inefficiencies that we came across during our study are reported in Section
`
`3.1 General Performance Statistics
`
`is a summary of iNlC processor usage breakdown, roughly categorized by
`network protocol layer. We added a layer for the 110$th [C interface which includes the
`cost of moving data between the host and iNIC memories, We report the numbers in
`.lOOMhlz processor clock cycles, label as “pclks”, We use the term message (msg) to
`refer to the logical unit used by host software sending or receiving data, and the term
`packet {13kt} to refer to the Ethernet frame that actually goes on the wire. For all our
`experiements, the TCP segment size and IP MTU are set to match the Ethernet frame
`MTU of 1.5kbyte so that a packet is also the unit handled by the IP and TCP layers,
`
`_contains the data for three transmit instances with representative message sizes —-
`S kbyte as a large message, 200 byte as a small but common message size, and lbyte to
`highlight the behavior of the protocol stock. We only report one instance of receive
`' Where TTCP is receiving with Ekbyte buffers,
`In all cases, the machine on the other end
`ofTTCP is a 500MHz Pentiumdll FreeBSD box equiped with a normal 'NIC card.
`'1‘his
`end is able to handle the work—load easily and is not a bottleneck.
`
`The row labeled “Per msg cost” and the corresponding cost breakdown in
`includes both transmit and receive costs and is the average derived by dividing the total
`processor usage by the number of messages. While it gives total cost that is most readily
`correlated to host
`software transmit and receive actions
`it bears
`few direct
`correspondence to actions on the iNIC, except in the cases of host—iNiC interface and
`socket
`layers for transmit.
`For receive, even these layers’ numbers have marginal
`correspondence to the per—message cost number because action in all network layers is
`driven by incoming packets.
`
`Nevertheless there are several interesting points about the pero‘nessage numbers. One is
`that for Skbyte messages transmit costs 66% more than receive.
`It was not immediately
`obvious why that should be the case. While transmit and receive processing is obviously
`different, the per—packet costs do not indicate as big a cost difference. Our investigation
`found that a different number of acknowledgement packets are involved in the two runs.
`When the iNlC transmits data,
`the FreeBSD box that
`is
`receiving, sends a pure
`acknowledgement packet for every two data packets it receives.
`In contrast, when the
`iNlCl is date receives it only sends out a pure £10k packet after every six data packets,
`This happens because incoming packets are aggregated before being sent to the host.
`Only when data is sent to the host is buffering space released and acknowledged to the
`sender. With transmission of each note rick cricket costing about ten thousand politea this
`is a significant ”saving” that improves receive performance.
`
`highlights the layers that are most costly in
`The per—message cost breakdown in
`terms of processor usage. For the 8kbyte message transmit (large message transmit),
`TCP layer is responsible for the built of the iNIC pi‘occssor cycles.
`(See Section .for
`
`l2
`
`ALA07370946
`
`Alacritech, Ex. 2039 Page 12
`
`Alacritech, Ex. 2039 Page 12
`
`

`

`the host~iNIC interface
`For $171011 mcsgagc transmit, however,
`fimhcr accounting.)
`accounts for the bulk of the cost. This is due to Nagol Algorithm aggregation which
`casusos tho per‘packct costs to be amortized over a number of messages, around a
`thousand in the case of 1—bytc message.
`in our offload implementation, the aggregation
`is done on the iNlC so that host»iNIC interaction cost remains a poo messagc occurancc.
`Tho host—MIC interface cost dominates; rcccivo as weli, though with a less dominant
`percentage. (See Section Efor further accounting.)
`
`The host—iNIC interface cost is important inecauso it constraints the frequency of host-
`iNIC interaction

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket