throbber
Server Network Scalability and TCP Offload
`
`Doug Freimoth, Elbert Ho, Jason LaVoie, Ronald Mraz,
`Erich Nahum, Prashant Pradhan. .lohn Tracey
`IBM I J. Ill/arson Research Center
`
`Hawthorne, NY, 10532
`{olmfreim olloert, lavoie, ntraz, nnhum, pprad‘nan , traceyfz Wile . ibm. com
`
`Abstrect
`
`Server network porl’onnancc is increasingly dominated
`by poorly scaling operations such as 1/0 bus crossingsl
`cache misses and interrupts. Their overhead prevents
`' performance from scaling even with increased CPU, link
`or if!) has bandwidths, These operations can be reduced
`by redesigning the host/adapter interface to exploit addi-
`tional processing on the adapter, Offloadlng processing
`to the adapter is beneficial not only because it allows
`more cycles to be applied but also ofthc changes it enu
`ables in the host/adapter interface. As opposed to other
`approaches such as RDMA, TCP offload provides bene-
`fits without requiring changes to either the transport pro—
`Tocol or API.
`We have designed a new liner/adapter interface that
`exploits olfioeded processing to reduce poorly scaling
`operations. We have implemented a prototype of the
`design including both host and adapter software com-
`ponents, Exocrimental evaluation with simple network
`benchmarks indicates our design significantly reduces
`{0 bus crossing and iroldS promise to reduce other
`poorly scaling, operations as well.
`
`1
`
`Introduction
`
`
`
`is not scaling with CPU
`Server network throughput
`speeds. Various: studies have reported CPU scaling then
`ors of43% [23}, 60% {l 5]. and 33% to 68% [22] which
`“all short of an ideal scaling of 100%.
`in this paper.
`we Show that even increaning CPU spec-d5 and link and
`ans bandwidths; does not generate a commensurate in-
`crease in server network throughput. This lack of scale—
`aility points to an increasing tendency for server network
`hroughput to become the key bottleneck limiting syetcm
`aerformancei lt motivates the need for an alternative do
`Sign with better scalability“
`Server network scalability is limited by operations
`ieavlly used in current designs that themselves do not
`scale well, most notably bug erosnings, cache misses and
`'nterrupts. Any significant improvement in scalability
`meet reduce these operations, Given that the problem is
`one of scalability and not Simply performance, it will nor
`be solved by faster processors. Faster processors merely
`
`expend more cycles on poorly scaling operations.
`Reeearch in server network performance over the
`years has yielded significant improvements including:
`integrated cheeksnm and copy, ohecksmn offload, copy
`avoidance, interrupt coalescing. fast path protocol pro~
`massing. efficient state lookup, efficient timer manage-
`ment and segmentation oillood, eke large send. Ant
`other technique, full TCP offload, hart been pursued for
`many years, Work on olfioarl has generercrl both promis~
`ing and less than compelling results ll, 38, 40, 42].
`Good performance data and analysis on offload is scarce.
`Many improvements in server scalability were (le-
`scribed more than fifleen years ago by Clark et al.
`[9}.
`The atiihors demonstrated that the overhead incurred by
`network protocol processing, per are, is small compared
`to both penbyte {memory access) costs; and operating
`system overhead, such as buffer and timer management
`This motivated work to reduce or eliminate data touch-
`ing operations}, such as cop-lee, and to improve the ef-
`ficiency of operating system services heavily used by
`the network stack. Later work [19] showed that over-
`head of nonvdata touching operations is, in fact, signifim
`cant for real workloads, which tend to feature a prepon-
`derance of $1113“ messages. Today, per—byte overhead.
`has been greatly reduced through checksnm offload and
`zero~copy send. This leaves per—packer overhead, oper-
`ating system services and zero-copy receive as the main
`remaining areas for further improvement.
`Nearly all of the enhancements described by Clark or
`al. have seen widespread adoption. The one notable ex-
`ception is “an efficient network interlnce.“ This is a net—
`work adapter with a fast general-purport: processor that
`provides a much more efficient interface to the network
`than the current franiobzmed interface devised decades
`ago.
`In this paper, we describe an effort to develop a
`much more efficient network interface and to make this
`enhancement a reality as well.
`Our work is pursued in the context of TC? for three
`reasons:
`l) TCP’S enormous installed base,
`2'} the
`methodology employed with TCP will transfer to other
`protocols, and 3) the expectation that key new arehitec»
`rural Features, such as zero copy receive, will ultimately
`demonstrate their viability with TCP.
`
`ALA07620802
`
`Alacritech, Ex. 2034 Page 1
`
`Alacritech, Ex. 2034 Page 1
`
`

`

`The work described here is part of a larger effort to
`improve server network scalability. We began by anat-
`lyzing server network performance and recognizing, as
`others have. a significant scalability problem. Next, we
`identified specific operations to be the cause. specifi-
`cally: bus crossings. cache misses. and interrupts. We
`formulated a design that reduces the impact of these op-
`erations. This design exploits additional processing at
`the network adapter, Le. offload, to improve the eili»
`cicncy oftlie host/’adaotcr interface which is our primary
`focus. We have implemented a prototype of the now de‘
`sign which consists of host and adapter software com—
`ponents and have analyzed the impact of the new design
`on bus crossings. Our findings indicate that offload can
`substantially decrease bus crossings and holds promise
`to reduce other scalability limiting operations such as
`cache misses. Ultimately, we intend to evaluate the de-
`sign in a cycle—accurate hardware simulator. This will
`allow us to comprehensively quantify the impact of den
`Sign alternatives on cache misses, interrupts and overall
`performance over several generations of hardware.
`This. paper is organized as follows. Section 2 pro~
`vidcs motivation and background. Section 3 presents
`our design. amt the current prototype implementation is
`described in Section 4. Section 5 presents our experi—
`mental infrastructure and results. Section 6 surveys and
`contrasts related work. and Section 7 summarizes our
`contributions and plans for future work.
`
`2 Motivation and Background
`To provide the proper motivation and background for
`our work, we first describe the current best p'rtietices of
`techniques and. optimizations for network server perfor—
`mance. Using industry standard benchmarks we then
`show that, despite these practices. servers are stili not
`sealing with CPU speeds via scvemi benchmarks. Since
`TC? offload has been a controversial topic in the re,
`search community. we review the critiques of otfload,
`providing counterarguments to each point. l-‘low TCP of—
`fload addresses these scaling issues is described in more
`detail in Section 3.
`
`2.1 Current Best Practices
`
`Current higbperformancc servers have adopted many
`techniques to maximize oerfotmanee. We provide a
`brief overview of them here.
`Sendfiie with zero copy. Most operating systems
`have a sendfilc or transmitlile one-ration that allowf; send-
`ing a file over a socket without copying the contents of
`the file into user space. This can have substantial perfor—
`mance benefits {30]. However, the benefits are limited to
`scnd~side processing: it does not afiect rccoivc~sidc prow
`cessing. In addition, it requires the server application to
`maintain its data in the kernel, which may not be feasible
`
`for systems such as application servers, which generate
`content dynamically.
`Checksum offload. Researchers have shown that
`
`calculating the [P cheeksun’r over the. body of the data
`can be expensive [E93, Most ltigh-pei'finmnneo adapters
`have the ability to perform the l? cizecksum over both
`the contents of the data and the TCP/IP headers. This
`
`removes an expensive data~touehing operation on both
`send and receive. However, adapter-level checksums
`wili not catch errors introduced by transferring data over
`the l/O bus, which has led some to advocate caution with
`checksum otiioad {41).
`Interrupt coalescing. Researchers have shown that
`interrupts are costly, and generating an ioterrttpt for each
`packet arrival can severely throttle a system [28]. In re»
`spouse: adapter vendors have enabled the ability to de—
`lay interrupts by a certain amount of time or number of
`packets in an effort to batch packcis per interrupt and.
`amortize the costs {E43, While effective, it can be diffi-
`cult to determine the proper trigger thresholds for firing
`interrupts. and large amounts of batching may cause no
`acceptabic latency for an individual connection,
`Large send/segmentation offload. TCl’flP Erupts»
`menters have long known that larger MTU sizes pro-
`vide greater efficiency. both in terms of network utiliza-
`tion {fewer headers per byte transferred) and in terms
`of host CPU utilization {fewer per-packet: operations iuu
`curred per byte sent or received). Unfortunately. larger
`MTU sizes are not usually available due to Ethernct’s
`1516 byte frame size. Gigabit Ethernet provides ‘jumbo
`frames” of 9 KB. but these are only useful in. specialized
`local environments and cannot be preserved across the
`wide—area lntcrnct. As an unproximation, certain operat-
`ing systems, such as AIX and Linux, provide large. send
`or TC? segmentation offload (T80) Where the TCP/EP
`stack interacts Wirh the network device as if it had a large
`MTU size. The device in turn segments the larger buffers
`into lSlé-byte Ethernet frames and adjusts the TCP sew
`quencc numbers and cheeksums accordingly. However,
`this technique is also limited to send-side processing. lu
`addition, as we demonstrate in Section 2.2. the technique
`is limited by the way TCP performs congestion control.
`Efficient connection management. 'Eariy networked
`servers did not handle large numbers of TC? connec-
`tions eihciently, for example by using a linear linked-
`list to manage state [26]. This led to operating systems
`using hash table based approaches [24} and separating
`table entries in the TIMBWAIT state [2'].
`Asynchronous interfaces.
`To maximize concur»
`rcucy, high—performance servers use. asynchronous; in-
`terfaces as not to block on king-latency operations [33].
`Server applications interact using an even? notification
`interface such as seiect () or pol: () ._ which in turn
`can b: vc performance implications {5}. Untbrtunatcly.
`
`ALA07620803
`
`Alacritech, Ex. 2034 Page 2
`
`Alacritech, Ex. 2034 Page 2
`
`

`

`
`Machine
`BIOS
`Release
`Date
`Workstniion—Class
`500 MHz PS
`Jul 2000
`2.000
`933 MHZ 5’3
`Mar 2091
`a .070
`57 (3H2 P4
`Sega 2003
`0.590
`
`Server-Class
`450 MHz i32-Xcon
`Jan 20%
`Hi Gila P4«Xcon
`Oct 200!
`3.2 GHZ {Ml-Xena
`
`May 2094
`
`
`"fable 1: inoperties for Muitiple Generations of Machines
`
`these interfaces are typically only for network 1/0 and
`not file 3/0, so they are not. as general (is they could be.
`iii—kernel implementations. Context switches, data
`copies, and system calls can be avoided. altogether by
`implementing the server completely in kernel Space
`{17, 18]. While this providcs the best performance, in-
`kemci implementations are difficulr to implement and
`maintain, and [he approach is hard to gcneralize across
`muitiple applications.
`RDMA. Others have also noticed these scaling prob—
`lems, particuiarly Wiéh respect to data copying, and have
`offered RDMA as a solution.
`interest: in RDMA and
`lnfiniband [4} is growing in the local-area case, such
`as in storage networks or cluster-based supercomputing.
`However, RDMA requires modifications to both sides of -
`a conversation, whereas Oillooci can be deployed incre-
`mentally on the server side only. Our interest is in sup-
`porting existing appiications in an inter-operable way
`which precludes using REDMA.
`While effective. these optimizations are limited in that
`they do not address the full range oi" scenarios seen by a
`server, The main restrictions are:
`I} that they do not up“
`ply to the receive side, 2‘} they are not fully asynchronous
`in the way they interact with the operating system, 3)
`they do not minimize the interaction with the network
`inéerlhce, or 4) they are not inter-operable. Addition-
`ally, many techniques do not address what we believe to
`he the fundamental performance issue, which is overall
`server scalability.
`
`2.2 Server Scalability
`
`The recent arrival of it) gigabit Ethernet and the promise
`of 46) and 100 gigabit Ethernet in the near future show
`that raw network bandwidth is scaling at least as quickly
`as CPU speed. However, it is weli-lmown that mem-
`ory speeds are not scaling as quickly as CPU specti in»
`creases {lo}. As a consequence of this anti other factors,
`researchers have observed that the performance of host
`
`TCP/IP implementations is not scaling at the same rate:
`as CPU speeds in spite of raw ncrwork bandwidth in-
`creases.
`
`To qnantir‘y how performance scales over time, we
`ran 2: number of experiments using several generations
`of machines, described in detail in Table ]. Wc break
`the machines into 2 classes: desk~side workstations and
`
`rack~rnounted servers with aggressive memory systems
`and NO bosses. The workstations include a a 500 MHZ
`Intel Pentium 3, a 933 MHZ inicl Pentium 3. anti 3 o 3.7
`Gl—iz Pentium 4. The servers include a 456 Ml-iz 13:311—
`{ium “—ana a 1.6 Gill P4 Xoon, and a 3.2 GHZ P4
`Xcon.
`in addition, each of the 'P4—Xeon servers have
`1 MB LS caches. Each machine runs Linux 36.9 and
`
`has a number of Intel E1000 MT server gigabn Ethernet
`adapters. connected via a Dell gigabit switch. Lon: is
`generated by five 3.2 GHZ I’ll-Kenna acting as clients,
`each using an 531000 client gigabit adapter and running
`Linux 2.6.5. We chose the El 000 MT adapters for he
`servers Since these have been shown to be one of thc
`highestmperforrning conventional adapters on the mar 'ct
`{32}, and we did not have access to a ll} gigabit adap er.
`We measured the time to access various locations
`in the memory hierarchy for these machines,
`includfl
`ing from the Ll and L2 caches, main memory, and he
`memory-mapped l/Q registers on. the mono. Memory
`hieratchy times were measured using LMBcnch [25}. To
`measure the device l/O register timea. we added so in:
`modifications to the initiaiization routine of the Linux
`
`
`
`2.69 El 006 device driver code, Table 2 presents the re-
`sults. Nola that while L1 and L2 access limes mmain rei—
`atively consistent in terms of grocessor cycles, she time
`to access main memory and the device registers is in
`creasing over time.
`if acccss times were improving or
`the same rate as CPU speeds, the nurnbcr of clock cy-
`cles would remain constant.
`
`To see how aciual server performance is scaling over
`time, we ran the static portion of SPEC‘webQQ {i2} us-
`
`ALA07620804
`
`Alacritech, Ex. 2034 Page 3
`
`Alacritech, Ex. 2034 Page 3
`
`

`

`
`Mac hinc
`
`HO Register
`Road
`Clock
`Time:
`Clock
`
`ins)Cycles Cycles
`
`Workstation-Class
`506 MHZ P3
`933 MHZ P3
`L7 GHZ P4
`Sorvcr~Closs
`450 MHz P2~Xcmi
`L6 Griz X6021
`
`3.2 GHz Xcon
`
`Table 2: Memory Access ”limos {or Multiple Generations of Machines
`
`in those experi-
`ing a recent version of Flash [33, 373,
`ments, Flash oxploits all the available performance opti~
`mizations on Linux, including Sandi ile t) with zero
`copy, TSO, and chocksum offload on the E1000. Table
`3 shown the rcsulto. Observe that server performance is
`not scaling with CPU speed. even though this is a heavily
`optimized sorvcr matting use ofafii current best practices.
`This is not because of limitations in the network bantin
`'wiclih; for example. the 3.2 GHZ Xconmbascd machine
`has 4 gigahit intcrfaocs and multiple It) gigabit PCLX
`'bttsscs.
`
`2.3 Offload: Critiques and Responses
`To this paper, we study TC? oii‘load as a solution to the .
`scalability problem. However, TC? offload has been
`. hotly debated by the research community, perhaps best
`- exemplified by Mogol‘s paper. “‘TCP offload is a dumb
`idea whose time has come” {27]. That paper effectively
`summarizes the criticisms of TCP offload, and so, we
`nsc the structure of that paper to oil‘cr our conntcrargu—
`monts hcrc.
`
`Limited processing requirements. Onc arguincnt is
`that Clark ct al.
`[9] Show that the main issue in TC?
`performance is implementation, not the TC? protocol its
`self, and a major factor is data movement; thus Offload
`does not address the real problem. We point out that
`Offload deco not simply mean TC? header processing;
`it includes the entire ”PCP/[P stack,
`including poorly
`stealing. performance—critical components ouch as data
`movement, bus crossings, interrupts. and device inter-
`action, Offload provides an improved interface to the
`adaptor that induces the use of those scalability—limiting
`(imitations.
`Moore’s Law: Moore’s Law states that CPU speeds
`arc doubling ovcry 38 months, and thus one claim it; that
`Offload cannot compote with general-porpoise CPUS.
`H istoricallyi chino used by adaptor vendors have. not in-
`creased at the same rate as gcnci'al-pttrposc CPUS due to
`
`the economics of scale, However, offload can usc com,
`motility CPUS with software implementations, which we
`bclicve is the proper anproach. In addition, speed needs
`only to bc matched with the interface (cg, ll} gig?»
`bit Ethernet), and we argue proper design reduces the
`code path relative to the non-oiflondcd case (cg with
`fewer memory copies). Sarkar at al.
`[38} and Ang {l}
`Show that when the NEC CPU is untiernprovisioned with
`respect to the host CPU. performance can actually do
`grade. Clearly the NEC procesning capacity must be:
`sized properly. Finally, increasing CPU speeds does not
`address the scalability issue, which is what we focus on.
`here.
`
`interface: Early oritiqucs are that
`Efficient host
`TCP Offload Engines (TOE) vendors recreated ”TC?
`over a bus”. Dcvciopmcnt of an elegant and efficient
`host/adapter intorface for offload is a fundamental rc—
`soarch challengc‘ one we are addressing it: this papct‘.
`Bad buffer management: Unicss Otfioad engines
`understand higherwlovol protocols,
`there is
`still an
`zippiicettioti—lezycr boarder copy. While two, copying of
`nppiicmion headcrs is not as performance—critical as
`copying application data. One complication is the ap—
`plication combining its own headers on thc same con—
`nection with its data. This can only be solved by chang-
`ing the application, which is already proposed in RDMA
`extenoions for NFS and ifSCSI {7, 8}.
`Connection management overhead: Unlike con—
`ventional NICS. ofiloncl adaptors must maintain por—
`coonection state. Upponcots argue that ofiload cannot
`handle large nambcrs of connections. but Web server
`workloads have forced host TCP stacks to discover recli-
`niqucs to efficiently manage '1 0,0001% of connections.
`Those {ccl'tniqucs are equally applicable for an interfacm
`based implementation.
`Resource management overhead: Critics argue that
`tracking resource management is ”more difficult” for oh
`flood. We: do not believe this is the case,
`it is straight-
`
`ALA07620805
`
`Alacritech, Ex. 2034 Page 4
`
`Alacritech, Ex. 2034 Page 4
`
`

`

`
`
`Machine:
`
`chticstctl
`Conforming
`Scale
`Scale:
`Ratio
`Throughput
`
`
`Connections
`Connections
`(ochicvcé)
`(ideal)
`(92))
`(ops/sec)
`
`
`
`
`Workstation—Class;
`l
`590 MHZ P3
`1235i
`375
`375
`1.00
`L00
`109
`933 MHz P3
`13m
`400
`399
`LOG
`1.87
`56
`
`1.7 GHZ ?4
`3457‘
`lQUO
`$169
`3,20
`3.40
`94
`Sewer-Class
`fi-
`450 MHZ l’Z-Xoon
`1.6 GHZ P4-XCOD
`3.2 GHZ PituXcon
`
`
`
`
`
`
`
`L60 w
`4.00
`5‘00
`
`133
`3.56
`716
`
`100
`l 12
`TE
`
`_
`
`_
`
`2236
`8893
`l MM
`
`700
`2800
`2500
`
`699
`2792
`3490
`
`Table 3: SPECWeb99 Performance Scalability over McEtiple Generations of Machincs
`
`forward to extend the notion of resource management
`' across the interface without making the adapter aware of
`every process as we will ShOW in SectionS 3 and 4,
`Event management: The claim is that offload does
`' hot address managing the large numbers of events that
`_ occur in highwolume scrvcra, It is true that of’fioad, per
`so. (loci; not adfiress oppi’iz‘aiion viyiliie events, which
`are boner addressed by the AP], However, offload can
`shield the has: {)perzztifigsyslem from spurious 811116038“
`sary adapter events. such as TCP acknowledgments or
`Window ntlvertisememsi In addition, it allows hatching
`ofoihcr events to amortize the cost of iotermpts and bus
`crossings.
`Partial offload is sufficiently effective: Parriai of«
`flood approaches incitidc checksum ofilood and large
`send (or TCP Segmentation Offload), as discussed in
`_ Section 2.l. While useful, they have limited value and
`_ do not fully Solve the scalability problem as was shown
`in Section 2.2. 0ther arguments include that checksum
`offload actually masks enom to the host. {4i}.
`In con:
`tract. offload allows larger batching and the opportunity
`to pcrform mom rigorotES error checking (by including
`the (TM: in the data dcscriptors),
`Maintainability: Opponents argue that omoad—based
`approachcc are more difficuit to update and maintain in
`the presence of security and bug patches. While this
`is true of an ASiC~hoscd approach it
`is not true of
`a softwarohascd approach using general—purpocc hard-
`ware.
`
`Quality assurance: The argcmcot here. is that offiomi
`is harder to test to determine bugs. However, testing
`tools such as TBIT [3 l} and ANVL [i 5} allow remote
`testing of the offload interface.
`In additiom software
`based approaches based on open»soorce TCP implemen—
`tations such as Linux or FrecBSD facilitate both anally
`tainzihility and quality assurance.
`System management interface: Opponents claim
`that offload adapters cannot have the same management
`interface as the host OS. This is incorrect one example
`
`is SNMP, lt is trivial to extend this to an offload adapter,
`Concerns about NiC vendors: Third—party vendors
`may go out of husincss and strand the cugtomer. This has
`nothing to do with offload; it is irrac of any M) device:
`disk, NIC, or graphics card. Economic incentives seem
`to address customer needs. in addition. one of the: largest
`NEC vendors is Intel.
`
`3 System Design
`
`_
`
`In this Section we describe our Offload design and how
`it addi‘ecses scalability.
`
`3.} How Offload Addresses Scalabiiity
`
`A higher—level interface. Offload allows the host open
`ating system to interact: with the device at a higher lcvci
`ofahstmction. Rather than simply queuing MTUwsizcd
`packets for transmission or reception. the host issues
`commands or the traz‘isport layer (cg, connect ( l,
`accept (i. send ( l , close { ) ). This allows the
`adapter to shield the host from transport layer events
`(and their attendant intermpt costs) that may he of no
`interest to the host, such. as arrivals of TCP acknowlm
`cdgments or window updates. Instead, the host is only
`notified of meaningful events. Examples include a com—
`pleted connection establishment or termination {rather
`than every packei arrival for the 3-way handshake or
`4—way tearaclown} or agnplication-lcvei data units. Soil
`ficient intelligence on the adapter can determine the ap-
`propriate time to transfer data to the host, either through
`knowioége of atondardizcd higheplevel protocols (such
`as HTT? or NFS) or through a programmohlc intcr—
`face that can provide an application signature (to. an
`application-level equivalent to a packet filter). By inter—
`acting at this higher lcvel of abstraction, the host will
`transfer less data over the bus and incur fewer interruptc
`arid device register accesses.
`Ability to move data in larger sizes. As described
`in Section Zili the ability to use large MTUS has a sig-
`nificant impact on performance for both sending and re—
`
`ALA07620806
`
`Alacritech, Ex. 2034 Page 5
`
`Alacritech, Ex. 2034 Page 5
`
`

`

`
`Application
`User
`
`Space
`
`Socket Layer —
`Kernel
`
`Device Driver
`Space
`
`User
`Application
`
`Space
`
`
`Socket Layer —
`Kernel—
`
`
`
`
`TCP
`
`IP
`
`Ethernet
`
`Device Driver
`
`DMA
`
`i
`
`PlO
`
`“f
`
`INTR
`
`NIC
`
`Host
`
`CPU
`
`_
`1 PC]
`J Card
`
`Space
`
`PCl
`Bus
`
`
`
`PCl _t_
`Bus
`DMA 1- HO 1- INTR
`
`NIC
`
`
`
`TCP
`
`11’
`
`Ethernet
`
`
`
`\
`
`H t
`OS
`CPU
`
`PC!
`
`Card
`
`Figure 1: Conventional Protocol Stack
`
`Figure 2: Offload Architecture
`
`ceiving data. Large send/TSO only approximates this
`optimization, and only for the send side. In contrast, of-
`fload allows the host to send and receive data in large
`chunks unaffected by the underlying MTU size. This re-
`duces use ofpoorly scaling components by making more
`efficient use ofthe 1/0 bus. Utilization of the I/O bus is
`
`not only affected by the data sent over it, but also by the
`DMA descriptors required to describe that data; offload
`reduces both. in addition, data that is typically DMA’ed
`over the [/0 bus in the conventional case is not trans-
`ferred here, for example TCP/1P and Ethernet headers.
`Improving memory reference behavior. We believe
`oflioad will not only increase available cycles to the ap-
`plication but improve application memory reference be-
`havior. By reducing cache and TLB pollution, cache hit
`rates and CPI will improve, increasing application per-
`fomiance.
`
`3.2 Current Adapter Designs
`
`Perhaps the simplest way to understand an architecture
`that oflioads all TCP/1P processing is to outline the ways
`in which offload differs from conventional adapters in
`the way it interacts with the OS. Figure 1
`illustrates a
`conventional protocol architecture in an operating sys—
`tem. Operating systems tend to communicate with con-
`ventional adapters only in terms of data transfer by pro-
`viding them with two queues of buffers. One queue is
`made up of ready-made packets for transmission;
`the
`other is a queue of empty buffers to use for packet recep-
`tion. Each queue of buffers is identified, in turn, by a de-
`scriptor table that describes the size and location of each
`buffer in the queue. Buffers are typically described in
`physical memory and must be pinned to ensure that they
`
`are accessible to the card, i.e., so that they are not paged
`out. The adapter provides a memory-mapped l/O inter-
`face for telling the adapter where the descriptor tables
`are located in physical memory, and provides an inter-
`face for some control information, such as what interrupt
`number to raise when a packet arrives. Communication
`between the host CPU and the adapter tends to be in one
`of three forms, as is shown in Figure 1: DMA’s of buffers
`and descriptors to and from the adapter; reads and writes
`of control information to and from the adapter, and in-
`terrupts generated by the adapter.
`
`3.3 Offloaded Adapter Design
`An architecture that seeks to offload the full TCP/1P
`stack has both similarities and differences in the way
`it interacts with the adapter. Figure 2 illustrates our
`offload architecture. As in the conventional scenario,
`queues of buffers and descriptor tables are passed be—
`tween the host CPU and the adapter, and DMA’s, reads,
`writes and interrupts are used to communicate.
`In the
`offload architecture, however, the host and the adapter
`communicate using a higher level of abstraction. Buffers
`have more explicit data structures imposed on them that
`indicate both control and data interfaces. As with a
`conventional adapter, passed buffers must be expressed
`as physical addresses and must be in pinned memory.
`The control interface allows for the host to command
`
`the adapter (e,g., what port numbers to listen on) and
`for the adapter to instruct the host (e.g., to notify the
`host of the arrival of a new connection). The con—
`trol interface is invoked, for example, by conventional
`socket functions that control connections: socket ( l ,
`bind (),
`listen(), connect (), accept(),
`
`ALA07620807
`
`Alacritech, Ex. 2034 Page 6
`
`Alacritech, Ex. 2034 Page 6
`
`

`

`hotnockopt t). etc. The data. interface provides- :3
`way to transfer data: on established connections for both
`sending and receiving and is invoked by socket functions
`such as send { i , Senorita () i write ( ) , writov t),
`read () , ready ( l
`, etc. Even the data interface is at a
`higher layer of ahfitrztction, since the pasacd boilers coon
`siot of application~specific data rather than fullymlormcd
`Ethernet frames with TCP/l? headers attached,
`in ad
`dition, these buffers; need to identify which connection
`that the data is for. Buffers containing data can be in
`units much larger than thcpackct MTU size. While Con-
`ceptualiy they could be of any size, in practice they are
`unlikely to be larger than 2: VM page size.
`As with a conventional adapter. the interface to the of1
`flood adapter need not be synchronous, The host OS can
`' queue requests to the adapter, continue doing other pro
`cessing? and then receive a notification {perhaps in the
`form of an interrupt) that the operation is complete. The
`host can implement synchronous Socket operations by
`using the asynchronous interface and then block the ap»
`plication until the results are returned from the adaptei:
`We heiievc asynchronous operation is; key in order to
`ameliorate and amortize fixed overheads. Asynchrony
`allows; Earger-scale hatching and enables other optiinimw
`tions such as polling-based approaches to servers. {3, 28].
`The offload interface allows supporting conventional
`user-level APlS, such as; the socket interface, as well as
`newer Aids that allow more direct accegs to tiger mem»
`ory such as DAYS, SUP, and RDMA. in addition, offload
`allows performing zero—copy sends and receives without
`changes to the socket API. The term zero—copy refers
`to the elimination of memory—townemory copies by the
`host. Even in the zerowcopy ease, data is stilt transferred
`across the 1/0 hits by the adapter via DMA.
`For exampie, in the case of a semi using a conven-
`tional adapter, the host typically copies the data from
`user space into to a pinned kcmci buffer, which is then
`queried to the adapter for transmission. With an intelli-
`gent adapter, the host can block the riser application and
`pin its buil‘ors, than invoke the. adaptor to DMA the data
`directly from the user application butler. This is similar
`to previous “singlewopy” approaches U3, 20], except
`that the transfer across the bus is done by the adapter
`DMA and not via an explicit copy by the host CPU.
`Observe from Figure 2 that the interaction between
`the host and the adapter now occurs between the socket
`and "i0? layers. A naive implementation may make
`unnecessary transfers across the PC! bus for achieving
`socket functionality. For {example} acceptt") would now
`cause a has crossing in addition to a kernel crossing; as
`could sctsocktint() for actions such at; changing the send
`or receive buffer sizes or the Naglo algorithm. Howeveri
`each of these costs can be amortized via batching mul-
`tiple reqttests into a single request. that crosses the bus.
`
`For example, multiple arrived connections can be aggre-
`gated into a single acoeptO crossing which then trans-
`lates into multiple acccptg) system calls. On the other
`hand, certain events that would generate hits crossings
`with a conventional adapter might not do so with 3 oh
`fioad adapter, such as ACK processing and generation.
`The relative weight of these advantages and disadvan.«
`rages depends on the: implementation and workload of
`the application using the adapter.
`
`4 System Implementation
`To evaiuote our design and the impact of design deci—
`sions, we implemented a software prototype. Our de—
`cision to implement the prototype purely in software:
`rather than building or modifying actual adapter hard
`ware, was motivated by several factors. Since our goal
`is to study notjufit performance, but scalability, we ul—
`timately intend to model different hardware charactori5«
`tics, for both the host and adapter, using a cycle accurate
`hardware Simulator. Limiting our analysis to only cur-
`rently available hardware woold hinder our evaluation
`for future hardware generationn Ultimately, we envision
`an adapter with a general purpose processor, in addition
`to specialized hardware to accelerate specific operations
`such as checksum calculation. Our prototype software
`is intended to serve as a reference implementation for a
`production adapter.
`Our prototype is composed oi‘thrco main components:
`
`I OSLayer, an operating system ancr that provides
`the socket interface to applications and maps it to
`the descriptor interface shared with the adapter;
`- Event—driven TCP, our ol’fioaded TCP implementav
`tion;
`it library that encapgttlates interaction he~
`0 IOLib,
`tween OSLaycr and Evcnvdrivm TCP.
`
`At the moment, ()SLaycr is implemented 35 a li~
`hrary that is; Staticaily linked with the application. Ul—
`timately, it will be decomposed into two components: a
`library linked with npplicntions and a component built
`in the kernel. Event-driven TCP currently runs as a user-
`level process that accesses the actual network via a raw
`socket. l't wi El eventualiy become the main software loop
`on the adapter. The lOLib implementation currently
`cotnninnicates via TCP sockets. but the design allows
`for implementations that communicate over a PCI bus
`or other interconnects such as: ln‘hniband. This provides
`a vehicle for experimentation and analysis and allows us
`to measure has traffic witho

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket