throbber
CA packano
`
`An Evaluation of an Attempt at Offloading
`TCP/IP Protocol Processing onto an
`i960RN-based iNIC
`
`Boon S. Ang
`Computer Systems and Technology Laboratory
`HP Laboratories Palo Alto
`HPL-2001-8
`January 9» , 2001*
`
`TCP/IP
`networking,
`Intelligent
`Network
`Interface
`
`This report presents an evaluation of a TCP/IP offload implementation that
`utilizes a 100BaseT intelligent Network Interface Card (iNIC) equipped with
`a 100 MHz i960RN processor. The entire FreeBSD-derived networking stack
`from socket downward is implemented on the iNIC with the goal of reducing
`host processor workload. For large messages that result in MTU packets, the
`offload implementation can sustain wire-speed on receive but only about
`80% of wire-speed on transmit. Utilizing hardware-based profiling of TT'CP
`benchmark runs, our evaluation pieced together a comprehensivepicture of
`transmit behavior on the iNIC. Ourfirst surprise was the numberof i960RN
`processor cycles consumed in transmitting large messages--around 17
`thousand processor cycles per 1.5kbyte (Ethernet MTU) packet. Further
`investigation reveals that this high cost is due to a combination of i960RN
`architectural shortcomings, poor buffering strategy in the TCP/IP code
`running on the iNIC, and limitations imposed by the I20-based host-iNIC
`interface. We also found room for improvements in the implementation of
`the socket buffer data-structure. This report presents profiling statistics, as
`well as code-path analysis that back up these conclusions. Our results call
`into question the hypothesis that a specialized networking software
`environment coupled with cheap embedded processors is a cost effective way
`of improving system performance. At
`least
`in the case of the offload
`implementation on the 1960RN-based iNIC, neither was the performance
`adequate nor the system cheap. This conclusion, however, does not imply
`that offload is a bad idea. In fact, measurements we made with Alacritech's
`SLIC NIC, which partially offloads TCP/IP protocol processing to an ASIC,
`suggests that offloading can confer advantages in a cost effective way.
`Taking the right implementation approachiscritical.
`
`* Internal Accession Date Only
`© Copyright Hewlett-Packard Company 2001
`
`Approved for External Publication
`
`ALA07370935
`
`Alacritech, Ex. 2039 Page 1
`
`Alacritech, Ex. 2039 Page 1
`
`

`

`
`
`ALA07370936
`
`Alacritech, Ex. 2039 Page 2
`
`Alacritech, Ex. 2039 Page 2
`
`

`

`1
`
`Introduction
`
`This report presents an evaluation of a TCP/IP implementation that performs network
`protocal stack processing on a 100BaseT intelligent network interface card (ENIC)
`equiped with an i960RN embedded processor. Offloading TCP/IP protocol processing
`from the host processors to a specialized environment was proposed as a means to reduce
`the workload on the host processors. The initial arguments profered were that network
`protocol processing is consuming an increasingly larger portion of processor cycles and
`that a specialized software envrionment on an iNIC can perform the same task more
`efficiently using cheaper processors.
`
`Since the inception of the project, alternate motivations for offloading protocol stack
`processing to an iNIC have been proposed. One was the iNIC offers a point of network
`traffic control
`independent of the host -- a useful capability in the Distributed Service
`Utility (DSU) architecture [9] for Internet data-centers, where the host system is not
`necessarily trusted. More generally, the iNIC is viewed as a point where additional
`specialized functions, such as firewall and web caching, can be added in a waythat scales
`performance with the number of iNIC’s in a system.
`
`The primary motivation of this evaluation is to understand the behavior of a specific
`TCP/IP offload design implemented in the Platform System Software department of HP
`. Laboratories’ Computer Systems and Technology Laboratory. Despite initial optimism,
`this implementation using Cyclone’s PCI-981 iNIC, while able to reduce host processor
`cycles spent on networking,
`is unable to deliver the same networking performance as
`Windows NT’s native protocol stack for 10@BaseT Ethernet. Furthermore,
`transmj
`performance lags behind receive performance for reasons that were not well understood.
`
`good understanding of the processing
`Another goal of this work is to arrive at a
`requirements, implementation issues and hardware and software architectural needs of
`TCP/IP processing. This understanding will feed into future iNIC projects targetting very
`high bandwidth networking in highly distributed data center architectures. At a higher
`level, information from this project provides concrete data-points for understanding the
`merits, if any, of offloading TCP/IP processing from the host processors to an iNIC.
`
`1.1 Sammary of Results
`Utilizing hardware-assisted profiling of TTCP benchmark runs, our evaluation pieced
`together
`a comprehensive picture of transmit behavior on the iNIC.
`All our
`measurements assume that checkstmn computation, an expensive operation on generic
`microprocessors, is done by specialized hardware in pihernet MAC/Phy devices, as is the
`case with commodity devices appearing in late 2000.
`
`' The teamthat implemented the TCP/IP offload recently informedus that they found some software
`problemthat was causing the transmit and receive performance disparity and had worked aroundit.
`Unfortunately, we were unable to obtain further detail in time for inclusion in this report.
`* The Cyclone PCI-981 iNIC uses Ethernet devices that do not have checksumsupport. To factoront the
`cost of checksum computation, our offload implementation simply does not compute checksumon beth
`transmit and receive during our benchmark runs. To accommodate this, machines receiving packets from
`
`ALA07370937
`
`Alacritech, Ex. 2039 Page 3
`
`Alacritech, Ex. 2039 Page 3
`
`

`

`Our first surprise was the number of 1960RN processor cycles consumed in transmitting
`large messages over TCP/IP -- around 17 thousand processor cycies per
`| Skbyte
`(Ethernet MTU} packet. This cost increases very significantly for smaller messages
`because aggregation (Nagle Algorithm) is done on the iNIC, thus incurring the overhead
`of handshake between host processor and INIC for every message.
`In an extreme case,
`with host software sending 1-byte messages that are aggregated into approximately
`940byte packets) each packet consumes 4.5 million i960RN processor cycles, Even
`transmitting a pure acknowledgement packet casts over 10 thousand processor cycles,
`
`is due to a combination of i960RN
`Further investigation reveals that this high cost
`architectural shortcomings, poor buffering strategy in the TCP/IP code running on the
`_ NIC, and limitations imposed by the 120-based host-iNIC interface. We also found
`room for improvements in the implementation of the socket buffer data-structure and
`some inefficiency due to the gcc960 compiler.
`
`Ourstudy of the host-side behavior is done at a coarser level, Using NT’s Performance
`Monitor tool, we quantified the processor utilization and the number of interrupts during
`each TTCP benchmark run. We compare these metrics for our offload implementation
`with those of NT’s native TCP/IP networking code and another partially offloaded iNIC
`implementation from Alacritech. To deal with the fact that different implementations
`achieve different networking bandwidth, the metrics are accumulated over the course of
`complete runs transferring the same amountofdata.
`
`The measurements show that compared against native NT implementation, our offload
`implementation achieves significantly lower processor utilization for large messages, but
`“much higher processor utilization for small messages, with crossover point at around
`1Q00byte messages. The interruptstatistics shows similar trend, though with crossover at
`a smaller message size. Furthermore, the number of interrupts is reduced by a much
`more significant percentage than the reduction in host processor utilization, suggesting
`that costs other than interrupt processing contributes quite significantly to the remaining
`host-side cost in our offload implementation, Based on other researcher’s results [1], we
`believe that host processor copying data between user and system buffer is the major
`remaining cost.
`
`The Alacritech NIC is an interesting comparison because it represents a very lean and
`low cost approach. Whereas the Cyclone board is a full-length PCI card that
`is
`essentially a complete computer system decked with a processor, memory and supporting
`logic chips, the Alacritech NIC locks just like another normal NIC card, except that its
`MAC/Phy ASIC has additional logic to process TCP/IP protocol for “fast path’ cases. A
`
`our offload implementation run specially doctored TCP/IP stacks that do not verify checksum. Error rates
`on today’s networking hardware in a local switched network are low enough that running TTCP benchmark
`is not a problem.
`2 The aggregation is not controlled by a fixed size threshold, and thus the actual packet size varies
`dynamically, subjected to an MTU of 1460 data bytes.
`
`ALA07370938
`
`Alacritech, Ex. 2039 Page 4
`
`Alacritech, Ex. 2039 Page 4
`
`

`

`limitation of this approach is it does not allow any modification or additions to the
`offloaded functions once the hardware is designed.
`
`Our measurement shows that an Alacritech NIC is able to sustain network bandwidth
`comparable to that of Native NT for large messages, which is close to wire-speed.
`Its
`accumulated host processor utilization, while lower than native NT’s, is higher than that
`with our offload implementation.
`Its performance degrades when messages are smaller
`-than 2k bytes because it has no means of aggregating out-going messages (i.e. no Nagel.
`’ Algorithm).
`
`This study calls into question the hypothesis that a specialized software environment
`together with a cheap embedded processor can effectively offload TCP/IP protocol stack
`processing.
`The i960RN-based implementation studied in this work is unable to
`adequately handle traffic even at the 10QMbit/s level, much less at
`1 Gigabit/s or 10
`Gigabit/s levels that are the bandwidths of interest
`in the near future. While bad
`buffering strategy is partly responsible for the less than satisfactory performance on our
`offload implementation,
`its BSD-derived TCP/IP protocol stack is in fact better than
`NT’s when executed on the same host platform. Clearly, the “better” stack and the
`advantage from the specialized interrupt-handling environment are insufficient to make
`up for the loss of a good processor and the additional overhead of interactions between
`the host and the iNIC. Ultimately, this approach is at best moving work from one place
`to another without conferring any advantage in efficiency, performance or price, and at
`worse, a performance limiter.
`
`In fact, the performance of the
`This conchision does not imply that offload is a bad idea.
`Alacritech NIC suggests that it can confer advantages, Whatis critical is taking the right
`implementation approach. We believe thereis still unfinished research inthis area, that a
`fixed hardware implementation, such as that from Alacritech, is not the solution. From a
`functional perspective, having a flexible and extensible INIC implementation not only
`enables tracking changes to protocol standards, but also allows additional functions to be
`added over time. The key is a cost effective, programmable iNIC micro-architecture that
`pays attention to the mterface between iNIC and host. There are a number of promising
`alternate micro-architecture components, ranging from specialized queue management
`hardware, to field programmable hardware, to multi-threaded and/or multi-core, possibly
`Systolic-like, processors. This is the research topic of our next project.
`
`1.2 Organization of this Report
`i960RN-based TCP/IP offload
`The next
`section gives
`an overview of our
`implementation. We briefly cover both the hardware and the software aspects of this
`implementation to pave the background for the rest of this report.
`Section Blexamines
`the behavior on the iNIC. Detailed profiling statistics giving breakdowns for various
`steps of the processing, and utilization of the hardware resources is presented. This is
`followed in Section f] with an examination of the host side statistics, Section 6] presents
`some related work covering both previous studies of TCP/IP implementations and other
`offload implementations. Finally, we conclude in Section 6] with what we learned from
`this study and areas for future work.
`
`ALA07370939
`
`Alacritech, Ex. 2039 Page 5
`
`Alacritech, Ex. 2039 Page 5
`
`

`

`2 TCP/IP Offload Implementation Overview
`
`» Our TCP/IP offload implementation uses the Cyclone PCI-981 iNIC described in Section
`The embedded processor on this iNIC runs a lean, HP custom designed run-
`time/operating system called RTX described in Section
`The networking protocol
`stack code is derived from FreeBSD’s Reno version of the TCP/IP protocol code.
`Interaction between the host and the iNIC occurs through hardware implemented 120
`messaging infrastructure. Section $.2.2]presents saliant features of the networking code.
`
`2.1. Cyclone PCI-981 iNIC hardware
`
`
`Primary PC]
`
`10/100BaseT
`Ethernet
`
`
`
`(host interface} Internal
` Secondary PCI
`
`
`(Device interface)
`
`The Cyclone PCI-981 iNIC, illustrated in the above figure, supports four 10/100BaseT
`Ethernet ports on a private 32/64-bit, 33MHz PCI bus which will be referred to as the
`secondary PCI bus. Although the bus is capable of 64-bit performance, each Ethernet
`devices only supports a 32-bit PCT interface, so that this bus is effectively 32-bit, 66-
`MHz. The iNIC presents a 32/64-bit, 33MHz PCI external interface which plugs into the
`host PCI bus, referred to as the primary PCI bus.
`In our experiments, the host is only
`equipped with a 32-bit PCI bus, thus constraining the iNIC’s interface to operate at 32-
`bit, 33MHz., Located between the two PCI buses is an i960RN highly integrated
`embedded processor, marked by the dashed box in the above figure.
`It contains a i960
`processor core running at JOOMHz, a primary address translation unit (PATU) interfacing
`with the primary PCI bus, a secondary address translation unit (SATU) interfacing with
`the secondary PCI bus, and a memory controller (MC) used to control 66MHz SDRAM
`DIMMs. (Our evaluation uses an iNIC equiped with l6Mbytes of 66MHz SDRAM.) An
`internal 64-bit, G6MHz bus connects these four components together.
`
`ALA07370940
`
`Alacritech, Ex. 2039 Page 6
`
`Alacritech, Ex. 2039 Page 6
`
`

`

`The PATUis equiped with two DMA enginesin addition to bridging the internal bus and
`the primary PCI bus.
`It also implements 220 messaging and door bell facilities in
`hardware. The SATU has one DMA engine and bridges the internal bus and the
`secondary PCI bus. The 1960 processor core implements a simple single-issue processing
`pipeline with none of the fancy superscalar, branch prediction, and out-of-order
`capabilities of today’s main stream processors. Not shown in the above figure are a PCI-
`to-PCI bus bridge and an application accelerator in the i960RN chip. These are not used
`in our offload implementation. Further details of the i960RN chip can be found in the
`i960 RM/RN I/O Processor Developer's Manual[3].
`
`2.2 Offload Software
`
`Two pieces of software running on the 1960 processor are relevent to this study. One is a
`run-time system, called RTX, that defines the operating or execution environment. This
`is described in the next section. The other piece of software is the networking protocol
`code itself, which is described in Section
`
`2.2.1.
`
`Execution Environment
`
`RTX is designed to be a specialized networking environment that avoids some well-
`knownsystem-imposed networking costs. More specifically, interrupts are not structured
`into the layered, multiple invocation framework found in most general purpose operating
`systems.
`Instead, interrupt handlers are allowed to run-to-completion. The motivation is
`to avoid the cost of repeatedly storing aside information and subsequently re-invocaking
`processing code at a lower priority. RTX also only supports a single address space
`without a clear notion of system vs. user-level address spaces.
`
`RTX is a simple, ad hoc run-time system. Although it provides basic threading and pre-
`.emptive multi-threading support, the lack of enforeable restrictions on interrupt handling
`. makes it impossible to guarantee any fair share of processor cycles to each thread.
`In
`fact, with the networking code studiedin this report, interrupt handiers disable
`all forms
`of interrupts for the full duration of its execution, including timer interrupts7 Coupled
`with the fact that interrupt handlers sometimes run for very very long periods (e.g., we
`routinely observer the Ethernet device driver running for tens of thousands of processor
`clocks},
`this forces processor scheduling to be hand coded into the Ethernet device
`interrupt handling code in the form of explicit software polls for events. Without these,
`starvationis a real problem.
`
`Overall the execution of networking code is either invoked by interrupt when no other
`interrupt handler is running, or by software polling for the presence of masked interrupts,
`Our measurement shows that for TTCP runs, most invocations are through software
`polling -- for every hardware interrupt dispatched invocation, we saw several tens of
`software polling dispatched invocations.
`
`
`
`* This raises questions about how accurately timers are implemented on the offload design. It is quite
`possible that software timer“ticks” occur less frequently than intended. We did not look closely into this
`because it is beyond the scope ofthis study.
`
`ALA07370941
`
`Alacritech, Ex. 2039 Page 7
`
`Alacritech, Ex. 2039 Page 7
`
`

`

`The RTX environment clearly presents challenges for addition of new code. Without a
`strict discipline for time-sharing the processor, every piece of code is tangled with every
`other piece of code when it comes to avoiding starvation and ensuring timely handling of
`events. Clearly, a better scheduling framework is needed, especially to support any kind
`of service quality provisions.
`
`2.2.2 Networking code
`The network protocol code is derived from the Reno version of BSD networking code
`and uses the fxp device driver. The host interfaces to the iNIC at the socket level, using
`120 messaging facility as the underlying means of communication. On the iNIC side,
`glue code is addedto splice into the protocol code at the socket level. The following two
`sectionsbriefly trace the transmit and receive paths.
`
`2.2.2.1 Transmit path
`
`semSoftware
`Copy
`nl
`Reference
`= S-s-s-=
`
`IOP Hdw DMA
`
`Only copy
`
`2= 3&w
`
`a
`
`
`
`Software
`Copy
`
`IOP software
`copy
`
`Copy to
`Compress
`
`Software
`Copy
`
`The above diagram showsthe transmit paths for large and small messages. The paths are
`slightly different for different message sizes. The green portions (lightly shaded in non-
`color prints) occur on the host side while the blue portions (darkly shared in non-color
`prints) happen on the iNIC. Unless otherwise stated, the host in this study is an HP
`Kayak XU 6/3000 with a 300MHz Pentium-II processor and 64Mbyte of memory,
`running Windows NT 4.0 service pack 5. (The amount of memory though small by
`today’s standards is adqeuate for TTCP runs, especially with the —s option that we used in
`our runs, which causesartificially generated data to be sourced at the transmit side, and
`incoming data to be discarded at the receive end.)
`
`Whentransmitting large messages, the message datais first copied on the host side from
`user space into pre-pinned system buffers. Next, 120 messagesare sent to the iNIC with
`references to the data residing in host-side main memory. On the iNIC side, servicing of
`the I20 messages includes setting up DMA requests. Hardware DMAengines in the
`PATU performs the actual data transfer from host main memory into iNIC SDRAM,
`
`ALA07370942
`
`Alacritech, Ex. 2039 Page 8
`
`Alacritech, Ex. 2039 Page 8
`
`

`

`where it is place in mbuf data structures.
`mbufs in Section
`
`(We will have a more detailed discussion of
`
`When DMA completes, iNIC code is once again invoked, to queue the transferred data
`with the relevent socket data structure and push it down the protocol stack. For large
`messages,
`the enqueuing simply involves linking the already allocated mbufs into a
`linked list. Without interrupting this thread of execution, an attempt is next made to send
`this message out, The main decision point is at the TCP level, where the tcpoutput
`function will decide if any of the data should be sent at this time. This decision is based
`on factors such as the amount of data that is ready to go (Nagel Algorithm will wait if
`there is too little data) and whether there is any transmit window space left (which is
`deteremind by the amount of buffer space advertised by the receiver and the dynamic
`actions of TCP’s congestion control protocol).
`[If tcpoutput decides not to send any data
`at this point, the thread suspends. Transmission of data will be re-invoked by other
`events, such as the arrival of more data from the host side, expiration of a time-out timer
`to stop waiting for more data, or the arrival of acknowledgements that open up transmit
`window.
`
`Once tcp_output decides to send out data, a “copy” of the data is made. The new “copy”
`is passed down the protocol stack through the IP layer and then to the Ethernet device
`driver, This copy is deallocated once the packet is transmitted onto the wire, The
`original copy is kept as a source copy until an acknowledgement send by the receiveris
`received. To copy large messages, the mbuf-based BSD buffering strategy merely copies
`-an “anchor” data structure with reference to the actual data,
`
`Another copy may occur at the IP layer if fragmentation occurs. With proper setting of
`TCP segmentsize to match that of the underlying network’s MTU, no fragmentation
`occurs. Execution may again be suspended at the IP layer if address lookup results in
`external query using ARP. The IP layer caches the result of such a query so that in most
`cases during a bulk transfer over a TCP connection, no suspension occurshere.
`
`Transfer of data from iNIC SDRAM onto the wire is undertaken by DMA engines on the
`Ethernet devices. When transmission completes,
`the i960 processor is notified via
`interrupts.
`
`The transmit path for small messages is very similar to that for large messages with a few
`differences, One difference is data is passed from the host processor to the iNIC directly
`in the 120 messages.
`(The specific mstruction-level mechanism on [A32 platforms is
`PIO operations). Thus, on the host side, data is copied from user memory directly into
`20 message frames. At the iNIC side, data is now available during servicing of a
`transmit request 120 message. The iNIC software directly copies the data into mbufs
`instead of using DMA because for small messages,
`it
`is cheaper this way than using
`DMA. The mbuf data structure behaves differently for small messages (<< 208 bytes)
`than for large messages. When small messages are enqueued into a socket, an attempt is
`made to conserve buffer usage by “compressing” data into partly used mbufs. This may
`result in significant software copying, Further down the protocol stack when a copy of
`
`ALA07370943
`
`Alacritech, Ex. 2039 Page 9
`
`Alacritech, Ex. 2039 Page 9
`
`

`

`message data is made in the tcpoutput function for passing further down the
`protocol stack, actual copy of data occurs for small messages.
`
`we will
`In Section
`Clearly, many copies or pseudo-copies occurin the transmit path.
`re-examine mbuf and this issue of copying. Actual measurements of the cost involved
`will be presented.
`
`2.2.2.2 Receive path
`The following diagram illustrates the receive path. Broadly speaking,it is the reverse of
`the transmit path.
`(As before, the green or lightly shaded portions execute on the host
`side, while the blue or darkly shaded portions execute on the iNIC.) Again, we will first
`consider the path for large packets. The first action is taken by the Ethernet device.
`Its
`device driver pre-queues empty mbufs that are filled by the device as data packets arrive.
`The i960 processoris notified of the presence of new packets via interrupts, which may
`
`a= S
`
`beinvokedeitherbyhardwareTSIdispatchorsoftwarepoll.
`bS
`<aaL<Aasa J[om
`
`
`
`Software
`
`IOP Hdw DMA
`
`Copy
`
`
`one
`
`Ethernet MAC
`
`hdw copy
`
`IOP software copy
`
`es
`wn
`
`= 3
`
`:
`
`ost
`
`
`stan<<C
`
`Ethernet MAC
`hdw copy
`
`spy &
`Host Software
`compress
`Copy
`
`
`The Ethernet device interrupt handler begins the walk up the network protocol stack. The
`logical processing of data packets may involve reassembly at the IP layer and dealing
`with out-of-order segments at the TCP layer.
`In practice, these are very infrequent, and
`the BSD TCP/IP code providesa “fast-path” optimization that reduces processing when
`packets arrive in order. An incoming packet’s headeris parsed to identify the protocol,
`and if TCP, the connection. This enables the packet’s mbuf’s to be queued into the
`relevent socket’s sockbuf data structure. Again for large messages, this simply links
`mbuf’s into a singly linkedlist.
`
`Next, the data is handed overto the host. As an optimization, our offload implementation
`attempts to aggregate more packets before interrupting the host to notify it of the arrived
`data. The i960 processoris responsible for using hardware DMAengine in the PATUto
`move data from iNIC SDRAMinto host memory before notifying the host about the
`newly arrived data through I20 messages. Data is transferred into host side system
`
`ALA07370944
`
`Alacritech, Ex. 2039 Page 10
`
`Alacritech, Ex. 2039 Page 10
`
`

`

`buffers that have been pinned and handed to the iNIC ahead of time. Eventually, when
`software does a receive on the host side, data is copied on the hast side fram system
`buffer into user memory.
`
`For small messages, similar differences as in the case of transmit apply, Thus, queuing
`into sockbuf may involve copying to compress and conserve mbuf usage. Just as in the
`' case of transmit, small size data is passed between the iNIC and host in 20 messages.
`On the host side, this enables an optimization if the receiver has previously posted a
`receive -- data is copied by the host processor from the 20 messages (physically residing
`on the iNIC) into user memory directly. If no receive has been posted yet, the data is first
`copied into system buffer,
`
`3 iNIC Side Behavior
`
`This section presents the profiling statistics we collected on the iNIC side. Most
`measurements rely on a free running cycle counter on the i960 processor to provide
`timing information. Timing code is added manually to the networking and RTX source
`code to measure durations of interest.
`In most cases, this is straightforward because the
`code operates as interrupt handlers in a non-preemtable mode. Timing a chunk of code
`simply involves noting the starting and ending times and accumulating that if aggregated
`daia of multiple invocations is being collected. The exception is the collection of
`statistics reported in Section B.2]done using the i960RNprocessor’s hardware resource
`usage performance registers. More will be said about this in that section.
`
`This provides a
`The commonly used TTCP benchmark is used for our studies.
`convenient micro-benchmark. for getting a detailed look of various aspects of the -
`networking code during bulk transfer.
`
`In the next section, we begity with some general performance statistics that show the
`. breakdown of processing time. Our study emphasises transmit behavior as that has the
`bigger problem in our implementation. Furthermore, for many server applications that
`are the targets of the offload technology, transmit performance is mere important than
`receive performance.
`
`leads us to look for hardware bottleneck because the cost
`The general statistics first
`numbers look surprisingly large. During close manual
`imspection of certain code
`fragments, we also came across cases where a relatively small number ofinstructions, say
`10°s of instructions in an inner loop, take hundreds of processor cycles periteration. Our
`investigation into hardware bottlenecks is reported in Section
`
`Next, we shift our attention to the components responsible for the largest share of
`processing cost, as indicated by the processor usage breakdown reported in Section
`We examine tcpoutput, which has some of the largest share of the processing cycles, to
`get a better understanding of what happens in that function. The results are reported in
`Section
`This took us onto the track of the overall buffering strategy, a closer look of
`which is reported in Section
`One other significant source of overhead is the host-
`
`ul
`
`ALA07370945
`
`Alacritech, Ex. 2039 Page 11
`
`Alacritech, Ex. 2039 Page 11
`
`

`

`Finally, a number of miscellaneous
`iNIC interface. We examine that in Section
`inefficiencies that we came across during our study are reported in Section
`
`3.1 General Performance Statistics
`
`isa summary of iNIC processor usage breakdown, roughly categorized by
`network protocol layer, We added a layer for the host-iNIC interface which includes the
`cost of moving data between the host and iNIC memories. We report the numbers in
`100MHz processor clock cycles, label as “pclks”. We use the term message (msg) to
`refer to the logical unit used by host software sending or receiving data, and the term
`packet (pkt) to refer to the Ethernet frame that actually goes on the wire. For all our
`experiemenis, the TCP segment size and IP MTU are set to match the Ethernet frame
`MTU of 1.Skbyte so that a packet is also the unit handled by the IP and TCP layers.
`
`Table licontains the data for three transmit instances with representative message sizes --
`% kbyte as a large message, 200 byte as a small but common message size, and lbyte to
`highlight the behavior of the protocel stack. We only report one instance of receive
`. where TTCPis receiving with 8kbyte buffers.
`In all cases, the machine on the other end
`of TTCP is a SOOMHz Penttum-IH FreeBSD box equiped with a normal NIC card. This
`end is able to handle the work-load easily and is not a bottleneck.
`
`The row labeled “Per msg cost” and the corresponding cost breakdown in
`includes both transmit and receive costs and is the average derived by dividing the total
`processor usage by the number of messages. While it gives total cost that is most readily
`correlated to host
`software transmit and receive actions,
`it bears
`few direct
`correspondance to actions on the iNIC, except in the cases of host-iNIC interface and
`socket
`layers for transmit.
`For receive, even these layers’ numbers have marginal
`correspondance to the per-message cost number because action in all network layers is
`driven by incoming packets.
`
`Nevertheless, there are several interesting points about the per-message numbers. Oneis
`that for Skbyte messages, transmit costs 66% more than receive.
`It was not immediately
`obvious why that should be the case. While transmit and receive processing is obviously
`different, the per-packet costs do not indicate as big a cost difference. Our investigation
`found that a different number of acknowledgement packets are involved in the two runs.
`When the iNIC transmits data,
`the FreeBSD box that
`is
`receiving sends a pure
`acknowledgement packet for every two data packets it receives.
`In contrast, when the
`INIC is data recerver, it only sends out a pure ack packet after every six data packets,
`This happens because incoming packets are aggregated before being sent to the host.
`Only when data is sent to the host is buffering space released and acknowledged to the
`sender. With transmission of each pure ack packet costing about ten thousand pclks, this
`is a significant “saving” that improves receive performance.
`
`highlights the layers that are most costly in
`The per-message cost breakdown in
`terms of processor usage. For the 8kbyte message transmit (large message transmit),
`TCP layer is responsible for the bulk of the iNIC processor cycles.
`(See Section B33bor
`
`ALA07370946
`
`Alacritech, Ex. 2039 Page 12
`
`Alacritech, Ex. 2039 Page 12
`
`

`

`the host-iNIC interface
`For small message transmit, however,
`further accounting.}
`accounts for the bulk of the cost. This is due to Nagel Algorithm aggregation which
`casuses the per-packet costs to be amortized over a number of messages, around a
`thousand in the case of 1-byte message.
`In our offload implementation, the aggregation
`is done on the iNIC so that host-iNIC interaction cost remains a per- message occurance.
`The host-iNIC interface cost dominates receive as well, though with a less dominant
`percentage, (See Section B-5}for further accounting.)
`
`._ The host-iNiC interface cost is important because it constraints the frequency of host-
`iNIC interaction, and the granularity of work that is worth offloading. The per-message
`host-iNIC layer cost breakdown for transmit corresponds roughly to one host-iN

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket