throbber
Server Network Scalability and TCP Offload
`
`Doug Freimuth, Elbert Hu, Jason LaVoie, Ronald Mraz,
`Erich Nahum, Prashant Pradhan, John Tracey
`IBM T. J. Watsan Research Center
`Hawthorne, NY, 10532
`{dmfreim, elbert, lavole, mraz, nahum, poradhan, tracey] }@us.ibm.com
`
`Abstract
`
`Server network performanceis increasingly dominated
`by poorly scaling operations such as I/O bus crossings,
`cache misses and interrupts. Their overhead prevents
`' performance from scaling even with increased CPU, tink
`or I/O bus bandwidths. These operations can be reduced
`by redesigning the host/adapterinterface to exploit addi-
`tional processing on the adapter. Offloading processing
`to the adapter is beneficial not only because it allows
`more cycles to be applied but also of the changes it en-
`ables in the host/adapter interface. As opposed to other
`approaches such as RDMA, TCP offioad provides bene-
`fits without requiring changesto either the transport pro-
`tocol or APL.
`We have designed a new host/adapter interface that
`exploits offloaded processing to reduce poorly scaling
`operations. We have implemented a protatype ofthe
`design inchiding both host and adapter software com-
`ponents, Experimental evaluation with simple network
`benchmarks indicates our design significantly reduces
`/O bus crossings and holds promise to reduce other
`poorly scaling operations as well,
`
`1.
`
`Introduction
`
`
`
`is not scaling with CPU
`Server network throughpat
`speeds. Various studies have reported CPU sealing fac-
`ors of 43%[23], 60% [15], and 33% fo 68%[22] which
`fall short of an ideal scaling of 100%.
`In this paper,
`we showthat even increasing CPU speeds and link and
`bus bandwidths does not generate a commensurate in-
`crease in server network throughput. This lack of scala-
`bility points to an increasing tendency for server network
`hroughput to becomethe key bottleneck limiting system
`performance. It motivates the need for an alternative de-
`sign with better scalability.
`Server network scalability is limited by operations
`heavily used in current designs that themselves do not
`scale well, most notably bus crossings, cache misses and
`interrupts. Any significant improvement in scalability
`must reduce these operations. Given that the problem is
`one of scalability and not simply performance, it will not
`be solved by faster processors, Faster processors merely
`
`expend more cycles on poorly scaling operations.
`Research in server network performance over the
`years has yielded significant improvements including:
`integrated checksum and copy, checksumoffload, copy
`avoidance, interrupt coalescing, fast path protocol pro-
`cessing, efficient state lookup, efficient timer manage-
`ment and segmentation offload, a.k.a.
`large send. An-
`other technique, full TCP offload, has been pursued for
`many years. Work on offload has generated both promis-
`ing and less than compelling results [1, 38, 40, 42].
`Good performance data and analysis on offload is scarce.
`Many improvements in server scalability were de-
`scribed more than fifteen years ago by Clark et al. [9].
`The authors demonstrated that the overhead incurred by
`network protocol processing, per se, is small compared
`to both per-byte (memory access) costs and operating
`system overhead, such as buffer and timer management.
`This motivated work to reduce or eliminate data touch-
`ing operations, such as copies, and to improve the ef-
`ficiency of operating system services heavily used by
`the network stack. Later work [19] shewed that over-
`head of non-data touching operationsis, in fact, signifi-
`cant for real workloads, which tend to feature a prepon-
`derance of small messages. Today, per-byte overhead.
`has been greatly reduced through checksum offload and
`zero-copy send. This leaves per-packet overhead, oper-
`aling system services and zero-copy receive as the main
`remaining areas for further improvement.
`Nearly all of the enhancements described by Clark et
`al. have seen widespread adoption. The one notable ex-
`ception is “an efficient network interface.” This is a net-
`work adapter with a fast general-purpose processor that
`provides a much more efficient interface to the network
`than the current frame-based interface devised decades
`ago.
`In this paper, we describe an effort to develop a
`much more efficient network interface and to make this
`enhancement a reality as well,
`Our work is pursued in the context of TCP for three
`reasons:
`1) TCP’s enormous installed base, 2) the
`methodology employed with TCP wil transfer to other
`protocols, and 3) the expectation that key new architec-
`tural features, such as zero copy receive, will ultimately
`demonstrate their viability with TCP.
`
`ALA07620802
`
`Alacritech, Ex. 2034 Page 1
`
`Alacritech, Ex. 2034 Page 1
`
`

`

`The work described here is part of a larger effort to
`improve server network scalability. We began by ana-
`lyzing server network performance and recognizing, as
`others have, a significant scalability problem. Next, we
`identified specific operations to be the cause, specifi-
`cally: bus crossings, cache misses, and interrupts. We
`formulated a design that reduces the impact of these op-
`erations. This design exploits additional processing at
`the network adapter, ie. offload, to improve the effi-
`ciency ofthe host/adapterinterface whichis our primary
`focus. We have iniplemented a prototype of the new de-
`sign. which consists of host and adapter software com-
`ponents and have analyzed the impact of the new design
`on bus crossings. Our findings indicate that offload can
`substantially decrease bus crossings and holds promise
`to reduce other scalability limiting operations such as
`cache misses, Ultimately, we intend to evaluate the de-
`sign in a cycle-accurate hardware simulator. This will
`allow us to comprehensively quantify the impact of de-
`sign alternatives on cache misses, interrupts and overali
`performance overseveral generations of hardware.
`This paper is organized as follows. Section 2 pro-
`vides motivation and background. Section 3 presents
`our design, and the current prototype implementation is
`deseribed in Section 4. Section 5 presents our experi-
`mental infrastructure and results. Section 6 surveys and
`contrasts related work, and Section 7 summarizes our
`contributions and plans for future work.
`
`2-- Motivation and Background
`To provide the proper motivation and background for
`our work, we first describe the current best practices of
`techniques and optimizations for network server perfor-
`mance. Using industry standard benchmarks we then
`show that, despite these practices, servers are still not
`scaling with CPU speeds via several benchmarks. Since
`TCP offioad has been a controversial topic in the re-
`search community, we review the critiques of ofload,
`providing counterarguments te each pomt. How TCPof-
`fload addresses these scaling issues is described in more
`detail in Section 3.
`
`2.1 Current Best Practices
`
`Current high-performance servers have adopted many
`techniques to maximize performance. We provide a
`brief overviewofthemhere.
`Sendfile with zero copy. Most operating systems
`have a sendfile or transmitfile operation that allows send-
`ing a file over a socket without copying the contents of
`the file into user space, This can have substantial perfor-
`mance benefits [30]. However, the benefits are limited to
`send-side processing; it does not affect receive-side pro-
`cessing. In addition, it requires the server application to
`maintain its data in the kernel, which may notbe feasible
`
`for systems such as application servers, which generate
`content dynamically.
`Checksum offlead. Researchers have shown that
`
`calculating the IP checksumover the body ofthe data
`can be expensive [19]. Most high-performance adapters
`have the ability to perform the IP checksum over both
`the contents of the data and the TCP/IP headers. This
`
`removes an expensive data-touching operation on both
`send and receive. However, adapter-level checksums
`will not catch errors introduced by transferring data over
`the 1/O bus, which has ied some to advocate caution with
`checksum offload [41].
`Interrupt coalescing. Researchers have shown that
`interrupts are costly, and generating an interrupt for each
`packetarrival can severely throttle a system [28]. In re-
`sponse, adapter vendors have enabled the ability to de-
`lay interrupts by a certain amount of time or number of
`packets in an effort to batch packets per interrupt and
`amortize the costs [14]. While effective, it can be diffi-
`cult to determine the proper trigger thresholds for firing
`interrupts, and large amounts of batching may cause un-
`acceptable latency for an individual connection.
`Large send/segmentation offisaad. TCP/IP imple-
`menters have long known that larger MTU sizes pro-
`vide greater efficiency, both in terms of network utiliza-
`tion (fewer headers per byte transferred) and in terms
`of host CPU utilization (fewer per-packet operations in-
`curred per byte sent or received}. Unfortunately, larger
`MTUsizes are not usually available duc to Ethernet’s
`1516 byte frame size. Gigabit Ethernet provides ‘jumbo
`frames” of 9 KB, but these are only useful in specialized
`local environments and cannot be preserved acress the
`wide-area Internet. As an approximation, certain operat-
`ing systems, such as ALX and Linux, provide large send
`or TCP segmentation offload (TSO) where the TCP/IP
`stack interacts with the network device as if it had a large
`MTUsize. The device in turn segmentsthe larger buffers
`into 1516-byte Ethernet frames and adjusts the TCP se-
`quence numbers and checksums accordingly. However,
`this technique is also limited to send-side processing. In
`addition, as we demonstrate in Section 2.2, the technique
`is Limited by the way TCP performs congestion control.
`Efficient connection management. Early networked
`servers did not handic large numbers ef TCP connec-
`tions efficiently, for example by using a linear linked-
`list to manage state [26]. This led to operating systems
`using hash table based approaches [24] and separating
`table entries in the TIME_WAIT state [2].
`Asynchronous interfaces.
`To maximize concur-
`rency, high-performance servers use asynchronous in-
`terfaces as not to block on long-latency operations [33].
`Server applications interact using an event notification
`interface such as select () orpoll (), which in tum
`can have performance nnplications [5]. Unfortunately,
`
`ALA07620803
`
`Alacritech, Ex. 2034 Page 2
`
`Alacritech, Ex. 2034 Page 2
`
`

`

`
`Machine
`BIOS
`Release
`Date
`Workstation-Class
`500 MHz P3
`Jul 2000
`2.000
`Mar2001
`933 MHz P3
`1.070
`1.7 GHz P4.
`0.590
`Sep 2003
`
`Server-Class
`450 MHz P2-Xeon
`dan 2000
`1.6 GHz P4-Xeon
`Oct 2001
`3.2 GHz P4-Xeon
`
`May 2004
`
`
`Table 1: Properties for Multiple Generations of Machines
`
`these interfaces are typically only for network /O and
`not file 1/Q, so ihey are not as general as they could be.
`in-kernel implementations. Context switches, data
`copies, and system calls can be avoidedaltogether by
`implementing the server completely in kernel space
`{17, 18]. While this provides the best performance, in-
`kernel implementations are difficult to implement and
`maintain, and the approachis hard to generalize across
`multiple applications.
`RDMA.Others have also noticed these scaling prob-
`lems, particularly with respect to data copying, and have
`offered RDMA as a solution.
`Interest in ROMA and
`Infiniband [4] is growing in the local-area case, such
`as in storage networks or chister-based supercomputing.
`However, RDMArequires modifications to both sides of -
`a conversation, whereas Offload can be deployedincre-
`mentally on the server side only. Our interest is in sup-
`porting existing applications in an inter-operable way,
`which precludes using RDMA.
`While effective, these optimizations are limited in that
`they do not address the full range of scenarios seen by a
`server, The main restrictions are: 1) that they do not ap-
`ply to the receive side, 2} they are not fully asynchronous
`in the way they interact with the operating system, 3}
`they do not minimize the interaction with the network
`interface, or 4) they are not inter-operable. Addition-
`ally, many techniques de not address what we believe to
`be the fundamental performance issue, which is overall
`serverscalability.
`
`2.2 Server Scalability
`
`The recent arrival of 10 gigabit Ethermetand the promise
`of 40 and 100 gigabit Ethernet in the near future show
`that raw network bandwidth is scaling at least as quickly
`as CPU speed. However, it is well-known that mem-
`ory speeds are not scaling as quickly as CPU speed in-
`creases [16]. As a consequence of this and other factors,
`researchers have observed that the performance of host
`
`TCP/IP implementations is not scaling at the same rate
`as CPU speeds in spite of raw network bandwidth in-
`creases,
`
`To quantify how performance scales over time, we
`ran a number of experiments using several generations
`of machines, described in detail in Table 1. We break
`the machines into 2 classes: desk-side workstations and
`
`rack-mounted servers with aggressive memory systems
`and 1/O busses. The workstations include a a 500 MHz
`Intel Pentium 3, a 933 MHz Intel Pentiurn 3, and a a 1.7
`GHz Pentium 4. The servers include a 456 MHz Pen-
`trum H-Xeon, a 1.6 GHz P4 Xeon, and a 3.2 GHz P4
`Xeon.
`In addition, each of the P4-Xeon servers have
`1 MB L3 caches. Each machine runs Linux 2.6.9 and
`
`has a number ofIntel E1000 MT server gigabit Ethernet
`adapters, connected via a Dell gigabit switch. Load is
`generated by five 3.2 GHz P4-Xeons acting as clients,
`each using an £1000 client gigabit adapter and running
`Linux 2.6.5. We chose the E1000 MT adapters for the
`servers since these have been shown to be one of the
`highest-performing conventional adapters on the market
`{32], and we did not have access to a 10 gigabit adapter.
`We measured the time to access various locations
`in the memory hierarchy for these machines,
`includ-
`ing from the L1 and L2 caches, main memory, and the
`memory-mapped I/O registers on the E1000. Memory
`hierarchy times were measured using LMBench[25]. To
`measure the device I/O register times, we added same
`modifications to the initialization routine of the Linux
`
`
`
`2.6,9 E1000 device driver code. Table 2 presents the re-
`sults. Note that while L1 and L? access times remain rel-
`atively consistent in terms of processor cycles, the time
`to access main memory and the device registers is in-
`creasing over time.
`If access times were improving at
`the same rate as CPU speeds, the number of clock cy-
`cles would remain constant.
`
`To see how actual server performance is scaling over
`time, we ran the static portion of SPECweb99 [12] us-
`
`ALA07620804
`
`Alacritech, Ex. 2034 Page 3
`
`Alacritech, Ex. 2034 Page 3
`
`

`

`
`Machine
`Li Cache
`VO Register
`VO Register
`Main
`Hit Write
`
`Memory
`Read
`Time|Clock
`ins}|Cycles
`Workstation-Class
`500 MFiz P3
`933 MHz P3
`
`1.7 GHz P4
`Server-Class
`450 MHz P2-xXeon
`6 GHz Xeon
`
`3.2 GHz Xeon
`
`Table 2: Memory Access Times for Multiple Generations of Machines
`
`In these experi-
`ing a recent version of Flash [33, 37]..
`ments, Flash exploits all the available performance opti-
`mizations on Linux, including sendfile ¢) with zero
`copy, TSO, and checksum offload on the E1000, Table
`3 shows the results. Observe that server performanceis
`not scaling with CPU speed, even thoughthis is a heavily
`optimized server making use of all current best practices.
`This is not because of limitations in the network band-
`“width; for example, the 3.2 GHz Xeon-based machine
`has 4 gigabit interfaces and multiple 10 gigabit PCI-X
`busses.
`
`2.3 Offload: Critiques and Responses
`In this paper, we study TCP offload as a solution to the .
`scalability problem. However, TCP offload has been
`. hotly debated by the research community, perhaps best
`- exemplified by Mogul’s paper, “TCP offload is a dumb
`idea whose time has come” [27]. That paper effectively
`summarizes the criticisms of TCP offload, and so, we
`use the structure ofthat paper to offer our counterargu-
`ments here,
`
`Limited precessing requirements. One argumentis
`that Clark et al.
`[9] show that the main issue in TCP
`performance is implementation, not the TCP protocol it-
`self, and a major factor is data movement; thus Offload
`does not address the real problem. We point out that
`Offload does not simply mean TCP header processing;
`it includes the entire TCP/IP stack,
`including peoorly-
`scaling, performance-critical components such as data
`movement, bus crossings, interrupts, and device inter-
`action. Offload provides an improved interface to the
`adapter that reduces the use of these scalability-limiting
`operations.
`Moore’s Law: Maore’s Lawstates thar CPU speeds
`are doubling every 18 months, and thus one claimis that
`Offlead cannot cumpete with general-purpose CPUs.
`Historically, chips used by adapter vendors have not in-
`creased at the same rate as general-purpose CPUs due to
`
`the economies of scale. However, offload can use com-
`modity CPUs with software implementations, which we
`beHeve is the proper approach. In addition, speed needs
`only to be matched with the interface (e.g., 10 giga-
`bit Ethernet), and we argue proper design reduces the
`code path relative to the non-offloaded case (e.g. with
`fewer memory copies}. Sarkar et al.
`[38] and Ang [1]
`show that when the NIC CPU is under-provisioned with
`respect to the host CPU, performance can actually de-
`grade. Clearly the NIC processing capacity must be
`sized properly. Finally, increasing CPU speeds does not
`address the scalability issue, which is what we focus on
`here,
`
`interface: Early critiques are that
`Efficient host
`TCP Offload Engines (TOE) vendors recreated "TCP
`over a bus”. Development of an elegant and efficient
`host/adapter interface for offload is a fundamental re-
`search challenge, one we are addressing in this paper.
`Bad buffer management: Unless Offload engines
`understand higher-level protocols,
`there is
`still an
`apphcation-layer header copy. While true, copying of
`application headers is not as performance-critical as
`copying application data. One complication is the ap-
`plication combining its own headers on the same con-
`nection with its data. This can only be solved by chang-
`ing the application, which is already proposed in RDMA
`extensions for NFS and iSCSI [7, 8].
`Connection management overhead: Unlike con-
`ventional NICs, offload adapters must maintain per-
`connection state. Opponents argue that offlead cannot
`handle large numbers of connections, but Web server
`workloads have forced host TCP stacks to discover tech-
`niques to efficiently manage 10,000’s of connections.
`These techniques are equally applicable for an interface-
`based implementation.
`Resource management averhead: Critics argue that
`tracking resource management is "more difficult” for of-
`fload. We do not believe this is the case, Ht is straight-
`
`ALA07620805
`
`Alacritech, Ex. 2034 Page 4
`
`Alacritech, Ex. 2034 Page 4
`
`

`

`
`
`
`
`
`Machine Requested|Conforming Scale|Seale|}RatioThroughput
`
`
`{ops/sec)|Connections|Connections|(achieved)|{ideal) (%)
`
`
`
`Workstation-Class
`1
`300 Milz P3
`1231
`375
`378
`1.00
`1.06
`£00
`933 MHz P3
`1318
`400
`399
`1.06
`1.87
`56
`
`1.7 GHz P4
`3457
`1200
`1169
`3,20
`3.46
`o4
`Server-Class
`|
`450 MHz P2-Xeon
`1.6 GHz P4-Xeon
`3.2 GHz P4-Xeon
`
`699
`2792
`3490
`
`Loo TG 100
`4.00)
`3,56
`112
`5.00
`7.10
`Ti
`
`
`
`
`
`
`
`
`
`2230
`8893
`liei4
`
`700
`2800
`2500
`
`Table 3: SPECWeb99 Performance Scalability over Multiple Generations of Machines
`
`forward fo extend the notion of resource management
`~ across the interface without making the adapter aware of
`every process as we will showin Sections 3 and 4,
`Event management: The claim is that offload does
`“not address managing the large numbers of events that
`_ occur in high-volume servers. It is true that offload, per
`se, does not address application visible events, which
`are better addressed by the API. However, offload can
`shield the host operating system from spurious unneces-
`sary adapter events, such as TCP acknowledgments or
`window advertisements. In addition, it allows batching
`of other events to amortize the cost of interrupts and bus
`crossings.
`Partial offload is sufficiently effective: Partial of-
`fload approaches include checksum offload and large
`send (or TCP Segmentation Offload), as discussed in
`_ Section 2.1. While useful, they have limited value and
`_do not fully solve the scalability problem as was shown
`in Section 2.2, Other arguments inchide that checksum
`offioad actually masks errors to the host [41].
`In con-
`trast, offload allows larger batching and the opportunity
`to perform more rigorous error checking (by including
`the CRCin the data descriptors).
`Maintainability: Opponents argue that offload-based
`approaches are more difficult to update and maintamin
`the presence of security and bug patches, While this
`is true of an ASIC-based approach,
`it
`is not true of
`a software-based approach using general-purpose hard-
`ware,
`
`Quality assurance: The argument here is that offload
`is harder to test fo determine bugs. However, testing
`tools such as TBIT [31] and ANVL [11] allow remote
`testing of the offload interface.
`In addition, software
`based approaches based on open-source TCP tmplemen-
`tations such as Linux or FreeBSD facilitate both main-
`tainability and quality assurance.
`System management interface: Opponents claim
`that offload adapters cannot have the same management
`interface as the host OS. This is incorrect: one example
`
`is SNMP. It is trivial to extend this to an offload adapter.
`Concerns about NIC vendors: Third-party vendors
`may go out of business and strand the customer. This has
`nothing to do with offload; it is true of any 1/O device:
`disk, NIC, or graphics card. Economic incentives seem
`to address customer needs. In addition, one of the largest
`NIC vendors is Intel.
`
`3 System Design
`
`__
`
`In this Section we describe our Offload design and how
`it addresses scalability.
`
`3.14 How Offfoad Addresses Scalability
`
`A higher-level interface. Offload allows the host oper-
`ating system to interact with the device at a higher level
`ofabstraction. Rather than simply queuing MTU-sized
`packets for transmission or reception, the host issues
`commands at the transport layer (e.g., connect (},
`accept (}, send(), close()}). This allows the
`adapier to shield the host from transport layer events
`(and their attendant interrupt costs) that may be of no
`interest to the host, suchas arrivals of TCP acknow!l-
`edgments or window updates. Instead, the host is only
`notified of meaningful events. Examples include a com-
`pleted connection establishment or termimeation (rather
`than every packet arrival for the 3-way handshake or
`4-way fear-down} or application-level data units. Suf-
`ficient intelligence on the adapter can determine the ap-
`propriate time to transfer data to the host, either through
`knowledge of standardized higher-level protocols (such
`as HTTP or NFS) or through a programmable inter-
`face that can provide an application signature (i.¢., an
`application-level equivalentto a packet filter). By inter-
`acting at this higher level of abstraction, the host will
`transfer less data over the bus and incur fewer interrupts
`and device register accesses.
`Ability te move data in larger sizes. As described
`in Section 2.1, the ability to use large MTUs has a sig-
`nificant impact on performance for both sending and re-
`
`ALA07620806
`
`Alacritech, Ex. 2034 Page 5
`
`Alacritech, Ex. 2034 Page 5
`
`

`

`r—
`Socket Layer
`Kernel SE
`Space
`TCP
`Host
`
`IP
`CPU
`PCI +Bus—fDMA t PIO t INTR
`
`
`User
`Application
`
`Space
`
`Socket Layer =
`Kernel
`
`Device Driver
`Space
`

`
`Host
`os
`CPU
`
`
`
`NIC
`
`
`
`PCI
`TCP
`PCI
`+ +
`_
`IP
`Card
`
`
`Bus PIO-FINTRDMA-}
`PCI
`Ethernet
`> Card
`
`User
`Application
`
`Space
`
`
`Ethernet
`
`
`
`
`Device Driver
`
`
`
`NIC
`
`
`
`
`
`Figure 1: Conventional Protocol Stack
`
`Figure 2: Offload Architecture
`
`ceiving data. Large send/TSO only approximates this
`optimization, and only for the sendside. In contrast, of-
`fload allows the host to send and receive data in large
`chunks unaffected by the underlying MTUsize. This re-
`ducesuse ofpoorly scaling components by making more
`efficient use of the I/O bus. Utilization of the I/O busis
`
`not only affected by the data sent overit, but also by the
`DMAdescriptors required to describe that data; offload
`reduces both. In addition, data that is typically DMA’ed
`over the I/O bus in the conventional case is not trans-
`ferred here, for example TCP/IP and Ethernet headers.
`Improving memory reference behavior. We believe
`offload will not only increase available cycles to the ap-
`plication but improve application memory reference be-
`havior. By reducing cache and TLB pollution, cache hit
`rates and CPI will improve, increasing application per-
`formance.
`
`3.2 Current Adapter Designs
`
`Perhaps the simplest way to understand an architecture
`that offloads all TCP/IP processingis to outline the ways
`in which offload differs from conventional adapters in
`the way it interacts with the OS. Figure |
`illustrates a
`conventional protocol architecture in an operating sys-
`tem. Operating systems tend to communicate with con-
`ventional adaptersonly in termsofdata transfer by pro-
`viding them with two queues of buffers. One queue is
`made up of ready-made packets for transmission;
`the
`otheris a queue of empty buffers to use for packet recep-
`tion. Each queueof buffersis identified, in turn, by a de-
`scriptortable that describes the size and location of each
`buffer in the queue. Buffers are typically described in
`physical memory and mustbe pinnedto ensure that they
`
`are accessible to the card,i.e., so that they are not paged
`out. The adapter provides a memory-mappedI/O inter-
`face for telling the adapter where the descriptor tables
`are located in physical memory, and provides an inter-
`face for some control information, such as what interrupt
`numberto raise when a packet arrives. Communication
`between the host CPU andthe adapter tendsto be in one
`of three forms, as is shown in Figure 1: DMA’s ofbuffers
`and descriptors to and from the adapter; reads and writes
`of control information to and from the adapter, and in-
`terrupts generated by the adapter.
`
`3.3. Offloaded Adapter Design
`An architecture that seeks to offload the full TCP/IP
`stack has both similarities and differences in the way
`it interacts with the adapter. Figure 2 illustrates our
`offload architecture. As in the conventional scenario,
`queues of buffers and descriptor tables are passed be-
`tween the host CPU and the adapter, and DMA’s, reads,
`writes and interrupts are used to communicate.
`In the
`offload architecture, however, the host and the adapter
`communicate using a higherlevel of abstraction. Buffers
`have more explicit data structures imposed on them that
`indicate both control and data interfaces. As with a
`conventional adapter, passed buffers must be expressed
`as physical addresses and must be in pinned memory.
`The control interface allows for the host to command
`
`the adapter (e.g., what port numbers to listen on) and
`for the adapter to instruct the host (e.g., to notify the
`host of the arrival of a new connection). The con-
`trol interface is invoked, for example, by conventional
`socket functions that control connections: socket (),
`bind(),
`listen(), connect(), accept (),
`
`ALA07620807
`
`Alacritech, Ex. 2034 Page 6
`
`Alacritech, Ex. 2034 Page 6
`
`

`

`setsockopt (), etc. The datainterface provides a
`way to transfer data on established connections for both
`sending and receiving andis invoked by socketfunctions
`such as send(), sendto(),write(),writev(},
`read (), ready (}, etc. Even the data interface is ata
`higher layer of abstraction, since the passed buffers con-
`sist of application-specific data rather than fully-formed
`Ethernet frames with TCP/IP headers attached,
`In ad-
`dition, these buffers need ta identify which connection
`that the data is for. Buffers containing data can be in
`units much larger than the packet MTUsize. While con-
`ceptually they could be of any size, in practice they are
`unlikely to be larger than a VM pagesize.
`As with a conventional adapter, the interface to the of-
`Hload adapter need not be synchronous. The host OS can
`" queue requests to the adapter, continue doing other pro-
`cessing, and then receive a notification (perhaps in the
`form of an interrupt) that the operation is complete. The
`host can implement synchronous socket operations by
`using the asynchronousinterface and then block the ap-
`plication until the results are returned from the adapter.
`We believe asynchronous operation is key in order to
`ameliorate and amortize fixed overheads. Asynchrony
`allows larger-scale batching and enables other optimiza-
`tions such as polling-based approachesto servers [3, 28].
`The offioad interface allows supporting conventional
`user-level APIs, such as the socket interface, as well as
`newer APIs that allow more direct access to user mem-
`ory such as DAFS, SDP, and ROMA. in addition, offload
`allows performing zero-cepysends and receives without
`changes to the socket API. The term zero-copy refers
`to the elimination of memory-to-memory copies by the
`host. Even in the zero-copy case, data is still transferred
`across the I/O bus by the adapter via DMA.
`For exaraple, in the case of a send using a conven-
`tional adapter, the host typically copies the data from
`user space into to a pinned kernel buffer, which is then
`queued to the adapter for transmission. With an. infelli-
`gent adapter, the host can block the user application and
`pin its buffers, then invoke the adapter to DMA the data
`directly fromthe user application buffer. This is similar
`to previous “single-copy” approaches [13, 20], except
`that the transfer across the bus is done by the adapter
`DMAand not via an explicit copy by the host CPU.
`Observe from Figure 2 that the interaction between
`the host and the adapter now occurs between the socket
`and TCP layers. A naive implementation may make
`unnecessary transfers across the PCL bus for achieving
`socket functionality. For example, accept() would now
`cause a bus crossing in addition to a kernel crossing, as
`could setsockont{) for actions such as changing the send
`or receive buffer sizes or the Nagle algorithm. However,
`each of these costs can be amortized via batching mul-
`tiple requests into a single requestthat crosses the bus.
`
`For cxample, multiple arrived connections can be agere-
`gated into a single accept()} crossing which then trans-
`lates into multiple accepit) system calls. On the other
`hand, certain events that would generate bus crossings
`with a conventional adapter night not do se with a of-
`fload adapter, such as ACK. processing and generation.
`The relative weight of these advantages and disadvan-
`tages depends on the implementation and workload of
`the application using the adapter.
`
`4 System Implementation
`To evaluate our design and the impact of design deci-
`sions, we implemented a software prototype. Our de-
`cision to implement the prototype purely m software,
`rather than building or modifying actual adapter hard-
`ware, was motivated by several factors. Since our goal
`is ta study not just performance, but scalability, we ul-
`timately intend to model different hardware characteris-
`tics, for both the host and adapter, using a cycle accurate
`hardware simulator, Limiting our analysis to only cur-
`rently available hardware would hinder our evaluation
`for future hardware generations, Ultmately, we envision
`an adapter with a general purpose processor, in addition
`to specialized hardware to accelerate specific operations
`such as checksum calculation. Our prototype software
`is intended to serve as a reference implementation for a
`production adapter,
`Our prototype is composed of three mam components:
`
`« OSLaver, an operating system layer that provides
`the socket interface to applications and mapsit to
`the descriptor interface shared with the adapter,
`e Event-driven TCP, our offloaded TCP implementa-
`tion;
`« IOLib, a library that encapsulates interaction be-
`tween OSLayer and Event-driven TCP.
`
`At the moment, OSLayer is implemented as a li-
`brary that is statically linked with the application. U1-
`timately, it will be decomposed into tvo components: a
`library linked with applications and a component built
`in the kernel. Event-driven TCP currently rans as a user-
`level process that accesses the actual network via a raw
`socket. It will eventually become the main software loop
`on the adapter. The 1OLib implementation currently
`communicates via TCP sockets, but the design allows
`for implementations that communicate over a PCL bus
`or other interconnects such as Infiniband. This provides
`a vehicle for experimentation and analysis and allows us
`to measure bus traffic without having to build a detailed
`simulation of a PCT bus or other interconnect,
`We used the Flash Web server for our evaluation, with
`Flash and OSLayer running on one machine and Event-
`driven TCP running on another. We use httperf [29] run-
`n

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket