`
`An Evaluation of an Attempt at Offloading
`TCP/IP Protocol Processing onto an
`i960RN-based iNIC
`
`Boon S. Ang
`Computer Systems and Technology Laboratory
`HP Laboratories Palo Alto
`HPL-2001-8
`January 9» , 2001*
`
`TCP/IP
`networking,
`Intelligent
`Network
`Interface
`
`This report presents an evaluation of a TCP/IP offload implementation that
`utilizes a 100BaseT intelligent Network Interface Card (iNIC) equipped with
`a 100 MHz i960RN processor. The entire FreeBSD-derived networking stack
`from socket downward is implemented on the iNIC with the goal of reducing
`host processor workload. For large messages that result in MTU packets, the
`offload implementation can sustain wire-speed on receive but only about
`80% of wire-speed on transmit. Utilizing hardware-based profiling of TT'CP
`benchmark runs, our evaluation pieced together a comprehensivepicture of
`transmit behavior on the iNIC. Ourfirst surprise was the numberof i960RN
`processor cycles consumed in transmitting large messages--around 17
`thousand processor cycles per 1.5kbyte (Ethernet MTU) packet. Further
`investigation reveals that this high cost is due to a combination of i960RN
`architectural shortcomings, poor buffering strategy in the TCP/IP code
`running on the iNIC, and limitations imposed by the I20-based host-iNIC
`interface. We also found room for improvements in the implementation of
`the socket buffer data-structure. This report presents profiling statistics, as
`well as code-path analysis that back up these conclusions. Our results call
`into question the hypothesis that a specialized networking software
`environment coupled with cheap embedded processors is a cost effective way
`of improving system performance. At
`least
`in the case of the offload
`implementation on the 1960RN-based iNIC, neither was the performance
`adequate nor the system cheap. This conclusion, however, does not imply
`that offload is a bad idea. In fact, measurements we made with Alacritech's
`SLIC NIC, which partially offloads TCP/IP protocol processing to an ASIC,
`suggests that offloading can confer advantages in a cost effective way.
`Taking the right implementation approachiscritical.
`
`* Internal Accession Date Only
`© Copyright Hewlett-Packard Company 2001
`
`Approved for External Publication
`
`ALA07370935
`
`Alacritech, Ex. 2039 Page 1
`
`Alacritech, Ex. 2039 Page 1
`
`
`
`
`
`ALA07370936
`
`Alacritech, Ex. 2039 Page 2
`
`Alacritech, Ex. 2039 Page 2
`
`
`
`1
`
`Introduction
`
`This report presents an evaluation of a TCP/IP implementation that performs network
`protocal stack processing on a 100BaseT intelligent network interface card (ENIC)
`equiped with an i960RN embedded processor. Offloading TCP/IP protocol processing
`from the host processors to a specialized environment was proposed as a means to reduce
`the workload on the host processors. The initial arguments profered were that network
`protocol processing is consuming an increasingly larger portion of processor cycles and
`that a specialized software envrionment on an iNIC can perform the same task more
`efficiently using cheaper processors.
`
`Since the inception of the project, alternate motivations for offloading protocol stack
`processing to an iNIC have been proposed. One was the iNIC offers a point of network
`traffic control
`independent of the host -- a useful capability in the Distributed Service
`Utility (DSU) architecture [9] for Internet data-centers, where the host system is not
`necessarily trusted. More generally, the iNIC is viewed as a point where additional
`specialized functions, such as firewall and web caching, can be added in a waythat scales
`performance with the number of iNIC’s in a system.
`
`The primary motivation of this evaluation is to understand the behavior of a specific
`TCP/IP offload design implemented in the Platform System Software department of HP
`. Laboratories’ Computer Systems and Technology Laboratory. Despite initial optimism,
`this implementation using Cyclone’s PCI-981 iNIC, while able to reduce host processor
`cycles spent on networking,
`is unable to deliver the same networking performance as
`Windows NT’s native protocol stack for 10@BaseT Ethernet. Furthermore,
`transmj
`performance lags behind receive performance for reasons that were not well understood.
`
`good understanding of the processing
`Another goal of this work is to arrive at a
`requirements, implementation issues and hardware and software architectural needs of
`TCP/IP processing. This understanding will feed into future iNIC projects targetting very
`high bandwidth networking in highly distributed data center architectures. At a higher
`level, information from this project provides concrete data-points for understanding the
`merits, if any, of offloading TCP/IP processing from the host processors to an iNIC.
`
`1.1 Sammary of Results
`Utilizing hardware-assisted profiling of TTCP benchmark runs, our evaluation pieced
`together
`a comprehensive picture of transmit behavior on the iNIC.
`All our
`measurements assume that checkstmn computation, an expensive operation on generic
`microprocessors, is done by specialized hardware in pihernet MAC/Phy devices, as is the
`case with commodity devices appearing in late 2000.
`
`' The teamthat implemented the TCP/IP offload recently informedus that they found some software
`problemthat was causing the transmit and receive performance disparity and had worked aroundit.
`Unfortunately, we were unable to obtain further detail in time for inclusion in this report.
`* The Cyclone PCI-981 iNIC uses Ethernet devices that do not have checksumsupport. To factoront the
`cost of checksum computation, our offload implementation simply does not compute checksumon beth
`transmit and receive during our benchmark runs. To accommodate this, machines receiving packets from
`
`ALA07370937
`
`Alacritech, Ex. 2039 Page 3
`
`Alacritech, Ex. 2039 Page 3
`
`
`
`Our first surprise was the number of 1960RN processor cycles consumed in transmitting
`large messages over TCP/IP -- around 17 thousand processor cycies per
`| Skbyte
`(Ethernet MTU} packet. This cost increases very significantly for smaller messages
`because aggregation (Nagle Algorithm) is done on the iNIC, thus incurring the overhead
`of handshake between host processor and INIC for every message.
`In an extreme case,
`with host software sending 1-byte messages that are aggregated into approximately
`940byte packets) each packet consumes 4.5 million i960RN processor cycles, Even
`transmitting a pure acknowledgement packet casts over 10 thousand processor cycles,
`
`is due to a combination of i960RN
`Further investigation reveals that this high cost
`architectural shortcomings, poor buffering strategy in the TCP/IP code running on the
`_ NIC, and limitations imposed by the 120-based host-iNIC interface. We also found
`room for improvements in the implementation of the socket buffer data-structure and
`some inefficiency due to the gcc960 compiler.
`
`Ourstudy of the host-side behavior is done at a coarser level, Using NT’s Performance
`Monitor tool, we quantified the processor utilization and the number of interrupts during
`each TTCP benchmark run. We compare these metrics for our offload implementation
`with those of NT’s native TCP/IP networking code and another partially offloaded iNIC
`implementation from Alacritech. To deal with the fact that different implementations
`achieve different networking bandwidth, the metrics are accumulated over the course of
`complete runs transferring the same amountofdata.
`
`The measurements show that compared against native NT implementation, our offload
`implementation achieves significantly lower processor utilization for large messages, but
`“much higher processor utilization for small messages, with crossover point at around
`1Q00byte messages. The interruptstatistics shows similar trend, though with crossover at
`a smaller message size. Furthermore, the number of interrupts is reduced by a much
`more significant percentage than the reduction in host processor utilization, suggesting
`that costs other than interrupt processing contributes quite significantly to the remaining
`host-side cost in our offload implementation, Based on other researcher’s results [1], we
`believe that host processor copying data between user and system buffer is the major
`remaining cost.
`
`The Alacritech NIC is an interesting comparison because it represents a very lean and
`low cost approach. Whereas the Cyclone board is a full-length PCI card that
`is
`essentially a complete computer system decked with a processor, memory and supporting
`logic chips, the Alacritech NIC locks just like another normal NIC card, except that its
`MAC/Phy ASIC has additional logic to process TCP/IP protocol for “fast path’ cases. A
`
`our offload implementation run specially doctored TCP/IP stacks that do not verify checksum. Error rates
`on today’s networking hardware in a local switched network are low enough that running TTCP benchmark
`is not a problem.
`2 The aggregation is not controlled by a fixed size threshold, and thus the actual packet size varies
`dynamically, subjected to an MTU of 1460 data bytes.
`
`ALA07370938
`
`Alacritech, Ex. 2039 Page 4
`
`Alacritech, Ex. 2039 Page 4
`
`
`
`limitation of this approach is it does not allow any modification or additions to the
`offloaded functions once the hardware is designed.
`
`Our measurement shows that an Alacritech NIC is able to sustain network bandwidth
`comparable to that of Native NT for large messages, which is close to wire-speed.
`Its
`accumulated host processor utilization, while lower than native NT’s, is higher than that
`with our offload implementation.
`Its performance degrades when messages are smaller
`-than 2k bytes because it has no means of aggregating out-going messages (i.e. no Nagel.
`’ Algorithm).
`
`This study calls into question the hypothesis that a specialized software environment
`together with a cheap embedded processor can effectively offload TCP/IP protocol stack
`processing.
`The i960RN-based implementation studied in this work is unable to
`adequately handle traffic even at the 10QMbit/s level, much less at
`1 Gigabit/s or 10
`Gigabit/s levels that are the bandwidths of interest
`in the near future. While bad
`buffering strategy is partly responsible for the less than satisfactory performance on our
`offload implementation,
`its BSD-derived TCP/IP protocol stack is in fact better than
`NT’s when executed on the same host platform. Clearly, the “better” stack and the
`advantage from the specialized interrupt-handling environment are insufficient to make
`up for the loss of a good processor and the additional overhead of interactions between
`the host and the iNIC. Ultimately, this approach is at best moving work from one place
`to another without conferring any advantage in efficiency, performance or price, and at
`worse, a performance limiter.
`
`In fact, the performance of the
`This conchision does not imply that offload is a bad idea.
`Alacritech NIC suggests that it can confer advantages, Whatis critical is taking the right
`implementation approach. We believe thereis still unfinished research inthis area, that a
`fixed hardware implementation, such as that from Alacritech, is not the solution. From a
`functional perspective, having a flexible and extensible INIC implementation not only
`enables tracking changes to protocol standards, but also allows additional functions to be
`added over time. The key is a cost effective, programmable iNIC micro-architecture that
`pays attention to the mterface between iNIC and host. There are a number of promising
`alternate micro-architecture components, ranging from specialized queue management
`hardware, to field programmable hardware, to multi-threaded and/or multi-core, possibly
`Systolic-like, processors. This is the research topic of our next project.
`
`1.2 Organization of this Report
`i960RN-based TCP/IP offload
`The next
`section gives
`an overview of our
`implementation. We briefly cover both the hardware and the software aspects of this
`implementation to pave the background for the rest of this report.
`Section Blexamines
`the behavior on the iNIC. Detailed profiling statistics giving breakdowns for various
`steps of the processing, and utilization of the hardware resources is presented. This is
`followed in Section f] with an examination of the host side statistics, Section 6] presents
`some related work covering both previous studies of TCP/IP implementations and other
`offload implementations. Finally, we conclude in Section 6] with what we learned from
`this study and areas for future work.
`
`ALA07370939
`
`Alacritech, Ex. 2039 Page 5
`
`Alacritech, Ex. 2039 Page 5
`
`
`
`2 TCP/IP Offload Implementation Overview
`
`» Our TCP/IP offload implementation uses the Cyclone PCI-981 iNIC described in Section
`The embedded processor on this iNIC runs a lean, HP custom designed run-
`time/operating system called RTX described in Section
`The networking protocol
`stack code is derived from FreeBSD’s Reno version of the TCP/IP protocol code.
`Interaction between the host and the iNIC occurs through hardware implemented 120
`messaging infrastructure. Section $.2.2]presents saliant features of the networking code.
`
`2.1. Cyclone PCI-981 iNIC hardware
`
`
`Primary PC]
`
`10/100BaseT
`Ethernet
`
`
`
`(host interface} Internal
` Secondary PCI
`
`
`(Device interface)
`
`The Cyclone PCI-981 iNIC, illustrated in the above figure, supports four 10/100BaseT
`Ethernet ports on a private 32/64-bit, 33MHz PCI bus which will be referred to as the
`secondary PCI bus. Although the bus is capable of 64-bit performance, each Ethernet
`devices only supports a 32-bit PCT interface, so that this bus is effectively 32-bit, 66-
`MHz. The iNIC presents a 32/64-bit, 33MHz PCI external interface which plugs into the
`host PCI bus, referred to as the primary PCI bus.
`In our experiments, the host is only
`equipped with a 32-bit PCI bus, thus constraining the iNIC’s interface to operate at 32-
`bit, 33MHz., Located between the two PCI buses is an i960RN highly integrated
`embedded processor, marked by the dashed box in the above figure.
`It contains a i960
`processor core running at JOOMHz, a primary address translation unit (PATU) interfacing
`with the primary PCI bus, a secondary address translation unit (SATU) interfacing with
`the secondary PCI bus, and a memory controller (MC) used to control 66MHz SDRAM
`DIMMs. (Our evaluation uses an iNIC equiped with l6Mbytes of 66MHz SDRAM.) An
`internal 64-bit, G6MHz bus connects these four components together.
`
`ALA07370940
`
`Alacritech, Ex. 2039 Page 6
`
`Alacritech, Ex. 2039 Page 6
`
`
`
`The PATUis equiped with two DMA enginesin addition to bridging the internal bus and
`the primary PCI bus.
`It also implements 220 messaging and door bell facilities in
`hardware. The SATU has one DMA engine and bridges the internal bus and the
`secondary PCI bus. The 1960 processor core implements a simple single-issue processing
`pipeline with none of the fancy superscalar, branch prediction, and out-of-order
`capabilities of today’s main stream processors. Not shown in the above figure are a PCI-
`to-PCI bus bridge and an application accelerator in the i960RN chip. These are not used
`in our offload implementation. Further details of the i960RN chip can be found in the
`i960 RM/RN I/O Processor Developer's Manual[3].
`
`2.2 Offload Software
`
`Two pieces of software running on the 1960 processor are relevent to this study. One is a
`run-time system, called RTX, that defines the operating or execution environment. This
`is described in the next section. The other piece of software is the networking protocol
`code itself, which is described in Section
`
`2.2.1.
`
`Execution Environment
`
`RTX is designed to be a specialized networking environment that avoids some well-
`knownsystem-imposed networking costs. More specifically, interrupts are not structured
`into the layered, multiple invocation framework found in most general purpose operating
`systems.
`Instead, interrupt handlers are allowed to run-to-completion. The motivation is
`to avoid the cost of repeatedly storing aside information and subsequently re-invocaking
`processing code at a lower priority. RTX also only supports a single address space
`without a clear notion of system vs. user-level address spaces.
`
`RTX is a simple, ad hoc run-time system. Although it provides basic threading and pre-
`.emptive multi-threading support, the lack of enforeable restrictions on interrupt handling
`. makes it impossible to guarantee any fair share of processor cycles to each thread.
`In
`fact, with the networking code studiedin this report, interrupt handiers disable
`all forms
`of interrupts for the full duration of its execution, including timer interrupts7 Coupled
`with the fact that interrupt handlers sometimes run for very very long periods (e.g., we
`routinely observer the Ethernet device driver running for tens of thousands of processor
`clocks},
`this forces processor scheduling to be hand coded into the Ethernet device
`interrupt handling code in the form of explicit software polls for events. Without these,
`starvationis a real problem.
`
`Overall the execution of networking code is either invoked by interrupt when no other
`interrupt handler is running, or by software polling for the presence of masked interrupts,
`Our measurement shows that for TTCP runs, most invocations are through software
`polling -- for every hardware interrupt dispatched invocation, we saw several tens of
`software polling dispatched invocations.
`
`
`
`* This raises questions about how accurately timers are implemented on the offload design. It is quite
`possible that software timer“ticks” occur less frequently than intended. We did not look closely into this
`because it is beyond the scope ofthis study.
`
`ALA07370941
`
`Alacritech, Ex. 2039 Page 7
`
`Alacritech, Ex. 2039 Page 7
`
`
`
`The RTX environment clearly presents challenges for addition of new code. Without a
`strict discipline for time-sharing the processor, every piece of code is tangled with every
`other piece of code when it comes to avoiding starvation and ensuring timely handling of
`events. Clearly, a better scheduling framework is needed, especially to support any kind
`of service quality provisions.
`
`2.2.2 Networking code
`The network protocol code is derived from the Reno version of BSD networking code
`and uses the fxp device driver. The host interfaces to the iNIC at the socket level, using
`120 messaging facility as the underlying means of communication. On the iNIC side,
`glue code is addedto splice into the protocol code at the socket level. The following two
`sectionsbriefly trace the transmit and receive paths.
`
`2.2.2.1 Transmit path
`
`semSoftware
`Copy
`nl
`Reference
`= S-s-s-=
`
`IOP Hdw DMA
`
`Only copy
`
`2= 3&w
`
`a
`
`
`
`Software
`Copy
`
`IOP software
`copy
`
`Copy to
`Compress
`
`Software
`Copy
`
`The above diagram showsthe transmit paths for large and small messages. The paths are
`slightly different for different message sizes. The green portions (lightly shaded in non-
`color prints) occur on the host side while the blue portions (darkly shared in non-color
`prints) happen on the iNIC. Unless otherwise stated, the host in this study is an HP
`Kayak XU 6/3000 with a 300MHz Pentium-II processor and 64Mbyte of memory,
`running Windows NT 4.0 service pack 5. (The amount of memory though small by
`today’s standards is adqeuate for TTCP runs, especially with the —s option that we used in
`our runs, which causesartificially generated data to be sourced at the transmit side, and
`incoming data to be discarded at the receive end.)
`
`Whentransmitting large messages, the message datais first copied on the host side from
`user space into pre-pinned system buffers. Next, 120 messagesare sent to the iNIC with
`references to the data residing in host-side main memory. On the iNIC side, servicing of
`the I20 messages includes setting up DMA requests. Hardware DMAengines in the
`PATU performs the actual data transfer from host main memory into iNIC SDRAM,
`
`ALA07370942
`
`Alacritech, Ex. 2039 Page 8
`
`Alacritech, Ex. 2039 Page 8
`
`
`
`where it is place in mbuf data structures.
`mbufs in Section
`
`(We will have a more detailed discussion of
`
`When DMA completes, iNIC code is once again invoked, to queue the transferred data
`with the relevent socket data structure and push it down the protocol stack. For large
`messages,
`the enqueuing simply involves linking the already allocated mbufs into a
`linked list. Without interrupting this thread of execution, an attempt is next made to send
`this message out, The main decision point is at the TCP level, where the tcpoutput
`function will decide if any of the data should be sent at this time. This decision is based
`on factors such as the amount of data that is ready to go (Nagel Algorithm will wait if
`there is too little data) and whether there is any transmit window space left (which is
`deteremind by the amount of buffer space advertised by the receiver and the dynamic
`actions of TCP’s congestion control protocol).
`[If tcpoutput decides not to send any data
`at this point, the thread suspends. Transmission of data will be re-invoked by other
`events, such as the arrival of more data from the host side, expiration of a time-out timer
`to stop waiting for more data, or the arrival of acknowledgements that open up transmit
`window.
`
`Once tcp_output decides to send out data, a “copy” of the data is made. The new “copy”
`is passed down the protocol stack through the IP layer and then to the Ethernet device
`driver, This copy is deallocated once the packet is transmitted onto the wire, The
`original copy is kept as a source copy until an acknowledgement send by the receiveris
`received. To copy large messages, the mbuf-based BSD buffering strategy merely copies
`-an “anchor” data structure with reference to the actual data,
`
`Another copy may occur at the IP layer if fragmentation occurs. With proper setting of
`TCP segmentsize to match that of the underlying network’s MTU, no fragmentation
`occurs. Execution may again be suspended at the IP layer if address lookup results in
`external query using ARP. The IP layer caches the result of such a query so that in most
`cases during a bulk transfer over a TCP connection, no suspension occurshere.
`
`Transfer of data from iNIC SDRAM onto the wire is undertaken by DMA engines on the
`Ethernet devices. When transmission completes,
`the i960 processor is notified via
`interrupts.
`
`The transmit path for small messages is very similar to that for large messages with a few
`differences, One difference is data is passed from the host processor to the iNIC directly
`in the 120 messages.
`(The specific mstruction-level mechanism on [A32 platforms is
`PIO operations). Thus, on the host side, data is copied from user memory directly into
`20 message frames. At the iNIC side, data is now available during servicing of a
`transmit request 120 message. The iNIC software directly copies the data into mbufs
`instead of using DMA because for small messages,
`it
`is cheaper this way than using
`DMA. The mbuf data structure behaves differently for small messages (<< 208 bytes)
`than for large messages. When small messages are enqueued into a socket, an attempt is
`made to conserve buffer usage by “compressing” data into partly used mbufs. This may
`result in significant software copying, Further down the protocol stack when a copy of
`
`ALA07370943
`
`Alacritech, Ex. 2039 Page 9
`
`Alacritech, Ex. 2039 Page 9
`
`
`
`message data is made in the tcpoutput function for passing further down the
`protocol stack, actual copy of data occurs for small messages.
`
`we will
`In Section
`Clearly, many copies or pseudo-copies occurin the transmit path.
`re-examine mbuf and this issue of copying. Actual measurements of the cost involved
`will be presented.
`
`2.2.2.2 Receive path
`The following diagram illustrates the receive path. Broadly speaking,it is the reverse of
`the transmit path.
`(As before, the green or lightly shaded portions execute on the host
`side, while the blue or darkly shaded portions execute on the iNIC.) Again, we will first
`consider the path for large packets. The first action is taken by the Ethernet device.
`Its
`device driver pre-queues empty mbufs that are filled by the device as data packets arrive.
`The i960 processoris notified of the presence of new packets via interrupts, which may
`
`a= S
`
`beinvokedeitherbyhardwareTSIdispatchorsoftwarepoll.
`bS
`<aaL<Aasa J[om
`
`
`
`Software
`
`IOP Hdw DMA
`
`Copy
`
`
`one
`
`Ethernet MAC
`
`hdw copy
`
`IOP software copy
`
`es
`wn
`
`= 3
`
`:
`
`ost
`
`
`stan<<C
`
`Ethernet MAC
`hdw copy
`
`spy &
`Host Software
`compress
`Copy
`
`
`The Ethernet device interrupt handler begins the walk up the network protocol stack. The
`logical processing of data packets may involve reassembly at the IP layer and dealing
`with out-of-order segments at the TCP layer.
`In practice, these are very infrequent, and
`the BSD TCP/IP code providesa “fast-path” optimization that reduces processing when
`packets arrive in order. An incoming packet’s headeris parsed to identify the protocol,
`and if TCP, the connection. This enables the packet’s mbuf’s to be queued into the
`relevent socket’s sockbuf data structure. Again for large messages, this simply links
`mbuf’s into a singly linkedlist.
`
`Next, the data is handed overto the host. As an optimization, our offload implementation
`attempts to aggregate more packets before interrupting the host to notify it of the arrived
`data. The i960 processoris responsible for using hardware DMAengine in the PATUto
`move data from iNIC SDRAMinto host memory before notifying the host about the
`newly arrived data through I20 messages. Data is transferred into host side system
`
`ALA07370944
`
`Alacritech, Ex. 2039 Page 10
`
`Alacritech, Ex. 2039 Page 10
`
`
`
`buffers that have been pinned and handed to the iNIC ahead of time. Eventually, when
`software does a receive on the host side, data is copied on the hast side fram system
`buffer into user memory.
`
`For small messages, similar differences as in the case of transmit apply, Thus, queuing
`into sockbuf may involve copying to compress and conserve mbuf usage. Just as in the
`' case of transmit, small size data is passed between the iNIC and host in 20 messages.
`On the host side, this enables an optimization if the receiver has previously posted a
`receive -- data is copied by the host processor from the 20 messages (physically residing
`on the iNIC) into user memory directly. If no receive has been posted yet, the data is first
`copied into system buffer,
`
`3 iNIC Side Behavior
`
`This section presents the profiling statistics we collected on the iNIC side. Most
`measurements rely on a free running cycle counter on the i960 processor to provide
`timing information. Timing code is added manually to the networking and RTX source
`code to measure durations of interest.
`In most cases, this is straightforward because the
`code operates as interrupt handlers in a non-preemtable mode. Timing a chunk of code
`simply involves noting the starting and ending times and accumulating that if aggregated
`daia of multiple invocations is being collected. The exception is the collection of
`statistics reported in Section B.2]done using the i960RNprocessor’s hardware resource
`usage performance registers. More will be said about this in that section.
`
`This provides a
`The commonly used TTCP benchmark is used for our studies.
`convenient micro-benchmark. for getting a detailed look of various aspects of the -
`networking code during bulk transfer.
`
`In the next section, we begity with some general performance statistics that show the
`. breakdown of processing time. Our study emphasises transmit behavior as that has the
`bigger problem in our implementation. Furthermore, for many server applications that
`are the targets of the offload technology, transmit performance is mere important than
`receive performance.
`
`leads us to look for hardware bottleneck because the cost
`The general statistics first
`numbers look surprisingly large. During close manual
`imspection of certain code
`fragments, we also came across cases where a relatively small number ofinstructions, say
`10°s of instructions in an inner loop, take hundreds of processor cycles periteration. Our
`investigation into hardware bottlenecks is reported in Section
`
`Next, we shift our attention to the components responsible for the largest share of
`processing cost, as indicated by the processor usage breakdown reported in Section
`We examine tcpoutput, which has some of the largest share of the processing cycles, to
`get a better understanding of what happens in that function. The results are reported in
`Section
`This took us onto the track of the overall buffering strategy, a closer look of
`which is reported in Section
`One other significant source of overhead is the host-
`
`ul
`
`ALA07370945
`
`Alacritech, Ex. 2039 Page 11
`
`Alacritech, Ex. 2039 Page 11
`
`
`
`Finally, a number of miscellaneous
`iNIC interface. We examine that in Section
`inefficiencies that we came across during our study are reported in Section
`
`3.1 General Performance Statistics
`
`isa summary of iNIC processor usage breakdown, roughly categorized by
`network protocol layer, We added a layer for the host-iNIC interface which includes the
`cost of moving data between the host and iNIC memories. We report the numbers in
`100MHz processor clock cycles, label as “pclks”. We use the term message (msg) to
`refer to the logical unit used by host software sending or receiving data, and the term
`packet (pkt) to refer to the Ethernet frame that actually goes on the wire. For all our
`experiemenis, the TCP segment size and IP MTU are set to match the Ethernet frame
`MTU of 1.Skbyte so that a packet is also the unit handled by the IP and TCP layers.
`
`Table licontains the data for three transmit instances with representative message sizes --
`% kbyte as a large message, 200 byte as a small but common message size, and lbyte to
`highlight the behavior of the protocel stack. We only report one instance of receive
`. where TTCPis receiving with 8kbyte buffers.
`In all cases, the machine on the other end
`of TTCP is a SOOMHz Penttum-IH FreeBSD box equiped with a normal NIC card. This
`end is able to handle the work-load easily and is not a bottleneck.
`
`The row labeled “Per msg cost” and the corresponding cost breakdown in
`includes both transmit and receive costs and is the average derived by dividing the total
`processor usage by the number of messages. While it gives total cost that is most readily
`correlated to host
`software transmit and receive actions,
`it bears
`few direct
`correspondance to actions on the iNIC, except in the cases of host-iNIC interface and
`socket
`layers for transmit.
`For receive, even these layers’ numbers have marginal
`correspondance to the per-message cost number because action in all network layers is
`driven by incoming packets.
`
`Nevertheless, there are several interesting points about the per-message numbers. Oneis
`that for Skbyte messages, transmit costs 66% more than receive.
`It was not immediately
`obvious why that should be the case. While transmit and receive processing is obviously
`different, the per-packet costs do not indicate as big a cost difference. Our investigation
`found that a different number of acknowledgement packets are involved in the two runs.
`When the iNIC transmits data,
`the FreeBSD box that
`is
`receiving sends a pure
`acknowledgement packet for every two data packets it receives.
`In contrast, when the
`INIC is data recerver, it only sends out a pure ack packet after every six data packets,
`This happens because incoming packets are aggregated before being sent to the host.
`Only when data is sent to the host is buffering space released and acknowledged to the
`sender. With transmission of each pure ack packet costing about ten thousand pclks, this
`is a significant “saving” that improves receive performance.
`
`highlights the layers that are most costly in
`The per-message cost breakdown in
`terms of processor usage. For the 8kbyte message transmit (large message transmit),
`TCP layer is responsible for the bulk of the iNIC processor cycles.
`(See Section B33bor
`
`ALA07370946
`
`Alacritech, Ex. 2039 Page 12
`
`Alacritech, Ex. 2039 Page 12
`
`
`
`the host-iNIC interface
`For small message transmit, however,
`further accounting.}
`accounts for the bulk of the cost. This is due to Nagel Algorithm aggregation which
`casuses the per-packet costs to be amortized over a number of messages, around a
`thousand in the case of 1-byte message.
`In our offload implementation, the aggregation
`is done on the iNIC so that host-iNIC interaction cost remains a per- message occurance.
`The host-iNIC interface cost dominates receive as well, though with a less dominant
`percentage, (See Section B-5}for further accounting.)
`
`._ The host-iNiC interface cost is important because it constraints the frequency of host-
`iNIC interaction, and the granularity of work that is worth offloading. The per-message
`host-iNIC layer cost breakdown for transmit corresponds roughly to one host-iN