`
`Doug Freimoth, Elbert Ho, Jason LaVoie, Ronald Mraz,
`Erich Nahum, Prashant Pradhan. .lohn Tracey
`IBM I J. Ill/arson Research Center
`
`Hawthorne, NY, 10532
`{olmfreim olloert, lavoie, ntraz, nnhum, pprad‘nan , traceyfz Wile . ibm. com
`
`Abstrect
`
`Server network porl’onnancc is increasingly dominated
`by poorly scaling operations such as 1/0 bus crossingsl
`cache misses and interrupts. Their overhead prevents
`' performance from scaling even with increased CPU, link
`or if!) has bandwidths, These operations can be reduced
`by redesigning the host/adapter interface to exploit addi-
`tional processing on the adapter, Offloadlng processing
`to the adapter is beneficial not only because it allows
`more cycles to be applied but also ofthc changes it enu
`ables in the host/adapter interface. As opposed to other
`approaches such as RDMA, TCP offload provides bene-
`fits without requiring changes to either the transport pro—
`Tocol or API.
`We have designed a new liner/adapter interface that
`exploits olfioeded processing to reduce poorly scaling
`operations. We have implemented a prototype of the
`design including both host and adapter software com-
`ponents, Exocrimental evaluation with simple network
`benchmarks indicates our design significantly reduces
`{0 bus crossing and iroldS promise to reduce other
`poorly scaling, operations as well.
`
`1
`
`Introduction
`
`
`
`is not scaling with CPU
`Server network throughput
`speeds. Various: studies have reported CPU scaling then
`ors of43% [23}, 60% {l 5]. and 33% to 68% [22] which
`“all short of an ideal scaling of 100%.
`in this paper.
`we Show that even increaning CPU spec-d5 and link and
`ans bandwidths; does not generate a commensurate in-
`crease in server network throughput. This lack of scale—
`aility points to an increasing tendency for server network
`hroughput to become the key bottleneck limiting syetcm
`aerformancei lt motivates the need for an alternative do
`Sign with better scalability“
`Server network scalability is limited by operations
`ieavlly used in current designs that themselves do not
`scale well, most notably bug erosnings, cache misses and
`'nterrupts. Any significant improvement in scalability
`meet reduce these operations, Given that the problem is
`one of scalability and not Simply performance, it will nor
`be solved by faster processors. Faster processors merely
`
`expend more cycles on poorly scaling operations.
`Reeearch in server network performance over the
`years has yielded significant improvements including:
`integrated cheeksnm and copy, ohecksmn offload, copy
`avoidance, interrupt coalescing. fast path protocol pro~
`massing. efficient state lookup, efficient timer manage-
`ment and segmentation oillood, eke large send. Ant
`other technique, full TCP offload, hart been pursued for
`many years, Work on olfioarl has generercrl both promis~
`ing and less than compelling results ll, 38, 40, 42].
`Good performance data and analysis on offload is scarce.
`Many improvements in server scalability were (le-
`scribed more than fifleen years ago by Clark et al.
`[9}.
`The atiihors demonstrated that the overhead incurred by
`network protocol processing, per are, is small compared
`to both penbyte {memory access) costs; and operating
`system overhead, such as buffer and timer management
`This motivated work to reduce or eliminate data touch-
`ing operations}, such as cop-lee, and to improve the ef-
`ficiency of operating system services heavily used by
`the network stack. Later work [19] showed that over-
`head of nonvdata touching operations is, in fact, signifim
`cant for real workloads, which tend to feature a prepon-
`derance of $1113“ messages. Today, per—byte overhead.
`has been greatly reduced through checksnm offload and
`zero~copy send. This leaves per—packer overhead, oper-
`ating system services and zero-copy receive as the main
`remaining areas for further improvement.
`Nearly all of the enhancements described by Clark or
`al. have seen widespread adoption. The one notable ex-
`ception is “an efficient network interlnce.“ This is a net—
`work adapter with a fast general-purport: processor that
`provides a much more efficient interface to the network
`than the current franiobzmed interface devised decades
`ago.
`In this paper, we describe an effort to develop a
`much more efficient network interface and to make this
`enhancement a reality as well.
`Our work is pursued in the context of TC? for three
`reasons:
`l) TCP’S enormous installed base,
`2'} the
`methodology employed with TCP will transfer to other
`protocols, and 3) the expectation that key new arehitec»
`rural Features, such as zero copy receive, will ultimately
`demonstrate their viability with TCP.
`
`ALA07620802
`
`Alacritech, Ex. 2034 Page 1
`
`Alacritech, Ex. 2034 Page 1
`
`
`
`The work described here is part of a larger effort to
`improve server network scalability. We began by anat-
`lyzing server network performance and recognizing, as
`others have. a significant scalability problem. Next, we
`identified specific operations to be the cause. specifi-
`cally: bus crossings. cache misses. and interrupts. We
`formulated a design that reduces the impact of these op-
`erations. This design exploits additional processing at
`the network adapter, Le. offload, to improve the eili»
`cicncy oftlie host/’adaotcr interface which is our primary
`focus. We have implemented a prototype of the now de‘
`sign which consists of host and adapter software com—
`ponents and have analyzed the impact of the new design
`on bus crossings. Our findings indicate that offload can
`substantially decrease bus crossings and holds promise
`to reduce other scalability limiting operations such as
`cache misses. Ultimately, we intend to evaluate the de-
`sign in a cycle—accurate hardware simulator. This will
`allow us to comprehensively quantify the impact of den
`Sign alternatives on cache misses, interrupts and overall
`performance over several generations of hardware.
`This. paper is organized as follows. Section 2 pro~
`vidcs motivation and background. Section 3 presents
`our design. amt the current prototype implementation is
`described in Section 4. Section 5 presents our experi—
`mental infrastructure and results. Section 6 surveys and
`contrasts related work. and Section 7 summarizes our
`contributions and plans for future work.
`
`2 Motivation and Background
`To provide the proper motivation and background for
`our work, we first describe the current best p'rtietices of
`techniques and. optimizations for network server perfor—
`mance. Using industry standard benchmarks we then
`show that, despite these practices. servers are stili not
`sealing with CPU speeds via scvemi benchmarks. Since
`TC? offload has been a controversial topic in the re,
`search community. we review the critiques of otfload,
`providing counterarguments to each point. l-‘low TCP of—
`fload addresses these scaling issues is described in more
`detail in Section 3.
`
`2.1 Current Best Practices
`
`Current higbperformancc servers have adopted many
`techniques to maximize oerfotmanee. We provide a
`brief overview of them here.
`Sendfiie with zero copy. Most operating systems
`have a sendfilc or transmitlile one-ration that allowf; send-
`ing a file over a socket without copying the contents of
`the file into user space. This can have substantial perfor—
`mance benefits {30]. However, the benefits are limited to
`scnd~side processing: it does not afiect rccoivc~sidc prow
`cessing. In addition, it requires the server application to
`maintain its data in the kernel, which may not be feasible
`
`for systems such as application servers, which generate
`content dynamically.
`Checksum offload. Researchers have shown that
`
`calculating the [P cheeksun’r over the. body of the data
`can be expensive [E93, Most ltigh-pei'finmnneo adapters
`have the ability to perform the l? cizecksum over both
`the contents of the data and the TCP/IP headers. This
`
`removes an expensive data~touehing operation on both
`send and receive. However, adapter-level checksums
`wili not catch errors introduced by transferring data over
`the l/O bus, which has led some to advocate caution with
`checksum otiioad {41).
`Interrupt coalescing. Researchers have shown that
`interrupts are costly, and generating an ioterrttpt for each
`packet arrival can severely throttle a system [28]. In re»
`spouse: adapter vendors have enabled the ability to de—
`lay interrupts by a certain amount of time or number of
`packets in an effort to batch packcis per interrupt and.
`amortize the costs {E43, While effective, it can be diffi-
`cult to determine the proper trigger thresholds for firing
`interrupts. and large amounts of batching may cause no
`acceptabic latency for an individual connection,
`Large send/segmentation offload. TCl’flP Erupts»
`menters have long known that larger MTU sizes pro-
`vide greater efficiency. both in terms of network utiliza-
`tion {fewer headers per byte transferred) and in terms
`of host CPU utilization {fewer per-packet: operations iuu
`curred per byte sent or received). Unfortunately. larger
`MTU sizes are not usually available due to Ethernct’s
`1516 byte frame size. Gigabit Ethernet provides ‘jumbo
`frames” of 9 KB. but these are only useful in. specialized
`local environments and cannot be preserved across the
`wide—area lntcrnct. As an unproximation, certain operat-
`ing systems, such as AIX and Linux, provide large. send
`or TC? segmentation offload (T80) Where the TCP/EP
`stack interacts Wirh the network device as if it had a large
`MTU size. The device in turn segments the larger buffers
`into lSlé-byte Ethernet frames and adjusts the TCP sew
`quencc numbers and cheeksums accordingly. However,
`this technique is also limited to send-side processing. lu
`addition, as we demonstrate in Section 2.2. the technique
`is limited by the way TCP performs congestion control.
`Efficient connection management. 'Eariy networked
`servers did not handle large numbers of TC? connec-
`tions eihciently, for example by using a linear linked-
`list to manage state [26]. This led to operating systems
`using hash table based approaches [24} and separating
`table entries in the TIMBWAIT state [2'].
`Asynchronous interfaces.
`To maximize concur»
`rcucy, high—performance servers use. asynchronous; in-
`terfaces as not to block on king-latency operations [33].
`Server applications interact using an even? notification
`interface such as seiect () or pol: () ._ which in turn
`can b: vc performance implications {5}. Untbrtunatcly.
`
`ALA07620803
`
`Alacritech, Ex. 2034 Page 2
`
`Alacritech, Ex. 2034 Page 2
`
`
`
`
`Machine
`BIOS
`Release
`Date
`Workstniion—Class
`500 MHz PS
`Jul 2000
`2.000
`933 MHZ 5’3
`Mar 2091
`a .070
`57 (3H2 P4
`Sega 2003
`0.590
`
`Server-Class
`450 MHz i32-Xcon
`Jan 20%
`Hi Gila P4«Xcon
`Oct 200!
`3.2 GHZ {Ml-Xena
`
`May 2094
`
`
`"fable 1: inoperties for Muitiple Generations of Machines
`
`these interfaces are typically only for network 1/0 and
`not file 3/0, so they are not. as general (is they could be.
`iii—kernel implementations. Context switches, data
`copies, and system calls can be avoided. altogether by
`implementing the server completely in kernel Space
`{17, 18]. While this providcs the best performance, in-
`kemci implementations are difficulr to implement and
`maintain, and [he approach is hard to gcneralize across
`muitiple applications.
`RDMA. Others have also noticed these scaling prob—
`lems, particuiarly Wiéh respect to data copying, and have
`offered RDMA as a solution.
`interest: in RDMA and
`lnfiniband [4} is growing in the local-area case, such
`as in storage networks or cluster-based supercomputing.
`However, RDMA requires modifications to both sides of -
`a conversation, whereas Oillooci can be deployed incre-
`mentally on the server side only. Our interest is in sup-
`porting existing appiications in an inter-operable way
`which precludes using REDMA.
`While effective. these optimizations are limited in that
`they do not address the full range oi" scenarios seen by a
`server, The main restrictions are:
`I} that they do not up“
`ply to the receive side, 2‘} they are not fully asynchronous
`in the way they interact with the operating system, 3)
`they do not minimize the interaction with the network
`inéerlhce, or 4) they are not inter-operable. Addition-
`ally, many techniques do not address what we believe to
`he the fundamental performance issue, which is overall
`server scalability.
`
`2.2 Server Scalability
`
`The recent arrival of it) gigabit Ethernet and the promise
`of 46) and 100 gigabit Ethernet in the near future show
`that raw network bandwidth is scaling at least as quickly
`as CPU speed. However, it is weli-lmown that mem-
`ory speeds are not scaling as quickly as CPU specti in»
`creases {lo}. As a consequence of this anti other factors,
`researchers have observed that the performance of host
`
`TCP/IP implementations is not scaling at the same rate:
`as CPU speeds in spite of raw ncrwork bandwidth in-
`creases.
`
`To qnantir‘y how performance scales over time, we
`ran 2: number of experiments using several generations
`of machines, described in detail in Table ]. Wc break
`the machines into 2 classes: desk~side workstations and
`
`rack~rnounted servers with aggressive memory systems
`and NO bosses. The workstations include a a 500 MHZ
`Intel Pentium 3, a 933 MHZ inicl Pentium 3. anti 3 o 3.7
`Gl—iz Pentium 4. The servers include a 456 Ml-iz 13:311—
`{ium “—ana a 1.6 Gill P4 Xoon, and a 3.2 GHZ P4
`Xcon.
`in addition, each of the 'P4—Xeon servers have
`1 MB LS caches. Each machine runs Linux 36.9 and
`
`has a number of Intel E1000 MT server gigabn Ethernet
`adapters. connected via a Dell gigabit switch. Lon: is
`generated by five 3.2 GHZ I’ll-Kenna acting as clients,
`each using an 531000 client gigabit adapter and running
`Linux 2.6.5. We chose the El 000 MT adapters for he
`servers Since these have been shown to be one of thc
`highestmperforrning conventional adapters on the mar 'ct
`{32}, and we did not have access to a ll} gigabit adap er.
`We measured the time to access various locations
`in the memory hierarchy for these machines,
`includfl
`ing from the Ll and L2 caches, main memory, and he
`memory-mapped l/Q registers on. the mono. Memory
`hieratchy times were measured using LMBcnch [25}. To
`measure the device l/O register timea. we added so in:
`modifications to the initiaiization routine of the Linux
`
`
`
`2.69 El 006 device driver code, Table 2 presents the re-
`sults. Nola that while L1 and L2 access limes mmain rei—
`atively consistent in terms of grocessor cycles, she time
`to access main memory and the device registers is in
`creasing over time.
`if acccss times were improving or
`the same rate as CPU speeds, the nurnbcr of clock cy-
`cles would remain constant.
`
`To see how aciual server performance is scaling over
`time, we ran the static portion of SPEC‘webQQ {i2} us-
`
`ALA07620804
`
`Alacritech, Ex. 2034 Page 3
`
`Alacritech, Ex. 2034 Page 3
`
`
`
`
`Mac hinc
`
`HO Register
`Road
`Clock
`Time:
`Clock
`
`ins)Cycles Cycles
`
`Workstation-Class
`506 MHZ P3
`933 MHZ P3
`L7 GHZ P4
`Sorvcr~Closs
`450 MHz P2~Xcmi
`L6 Griz X6021
`
`3.2 GHz Xcon
`
`Table 2: Memory Access ”limos {or Multiple Generations of Machines
`
`in those experi-
`ing a recent version of Flash [33, 373,
`ments, Flash oxploits all the available performance opti~
`mizations on Linux, including Sandi ile t) with zero
`copy, TSO, and chocksum offload on the E1000. Table
`3 shown the rcsulto. Observe that server performance is
`not scaling with CPU speed. even though this is a heavily
`optimized sorvcr matting use ofafii current best practices.
`This is not because of limitations in the network bantin
`'wiclih; for example. the 3.2 GHZ Xconmbascd machine
`has 4 gigahit intcrfaocs and multiple It) gigabit PCLX
`'bttsscs.
`
`2.3 Offload: Critiques and Responses
`To this paper, we study TC? oii‘load as a solution to the .
`scalability problem. However, TC? offload has been
`. hotly debated by the research community, perhaps best
`- exemplified by Mogol‘s paper. “‘TCP offload is a dumb
`idea whose time has come” {27]. That paper effectively
`summarizes the criticisms of TCP offload, and so, we
`nsc the structure of that paper to oil‘cr our conntcrargu—
`monts hcrc.
`
`Limited processing requirements. Onc arguincnt is
`that Clark ct al.
`[9] Show that the main issue in TC?
`performance is implementation, not the TC? protocol its
`self, and a major factor is data movement; thus Offload
`does not address the real problem. We point out that
`Offload deco not simply mean TC? header processing;
`it includes the entire ”PCP/[P stack,
`including poorly
`stealing. performance—critical components ouch as data
`movement, bus crossings, interrupts. and device inter-
`action, Offload provides an improved interface to the
`adaptor that induces the use of those scalability—limiting
`(imitations.
`Moore’s Law: Moore’s Law states that CPU speeds
`arc doubling ovcry 38 months, and thus one claim it; that
`Offload cannot compote with general-porpoise CPUS.
`H istoricallyi chino used by adaptor vendors have. not in-
`creased at the same rate as gcnci'al-pttrposc CPUS due to
`
`the economics of scale, However, offload can usc com,
`motility CPUS with software implementations, which we
`bclicve is the proper anproach. In addition, speed needs
`only to bc matched with the interface (cg, ll} gig?»
`bit Ethernet), and we argue proper design reduces the
`code path relative to the non-oiflondcd case (cg with
`fewer memory copies). Sarkar at al.
`[38} and Ang {l}
`Show that when the NEC CPU is untiernprovisioned with
`respect to the host CPU. performance can actually do
`grade. Clearly the NEC procesning capacity must be:
`sized properly. Finally, increasing CPU speeds does not
`address the scalability issue, which is what we focus on.
`here.
`
`interface: Early oritiqucs are that
`Efficient host
`TCP Offload Engines (TOE) vendors recreated ”TC?
`over a bus”. Dcvciopmcnt of an elegant and efficient
`host/adapter intorface for offload is a fundamental rc—
`soarch challengc‘ one we are addressing it: this papct‘.
`Bad buffer management: Unicss Otfioad engines
`understand higherwlovol protocols,
`there is
`still an
`zippiicettioti—lezycr boarder copy. While two, copying of
`nppiicmion headcrs is not as performance—critical as
`copying application data. One complication is the ap—
`plication combining its own headers on thc same con—
`nection with its data. This can only be solved by chang-
`ing the application, which is already proposed in RDMA
`extenoions for NFS and ifSCSI {7, 8}.
`Connection management overhead: Unlike con—
`ventional NICS. ofiloncl adaptors must maintain por—
`coonection state. Upponcots argue that ofiload cannot
`handle large nambcrs of connections. but Web server
`workloads have forced host TCP stacks to discover recli-
`niqucs to efficiently manage '1 0,0001% of connections.
`Those {ccl'tniqucs are equally applicable for an interfacm
`based implementation.
`Resource management overhead: Critics argue that
`tracking resource management is ”more difficult” for oh
`flood. We: do not believe this is the case,
`it is straight-
`
`ALA07620805
`
`Alacritech, Ex. 2034 Page 4
`
`Alacritech, Ex. 2034 Page 4
`
`
`
`
`
`Machine:
`
`chticstctl
`Conforming
`Scale
`Scale:
`Ratio
`Throughput
`
`
`Connections
`Connections
`(ochicvcé)
`(ideal)
`(92))
`(ops/sec)
`
`
`
`
`Workstation—Class;
`l
`590 MHZ P3
`1235i
`375
`375
`1.00
`L00
`109
`933 MHz P3
`13m
`400
`399
`LOG
`1.87
`56
`
`1.7 GHZ ?4
`3457‘
`lQUO
`$169
`3,20
`3.40
`94
`Sewer-Class
`fi-
`450 MHZ l’Z-Xoon
`1.6 GHZ P4-XCOD
`3.2 GHZ PituXcon
`
`
`
`
`
`
`
`L60 w
`4.00
`5‘00
`
`133
`3.56
`716
`
`100
`l 12
`TE
`
`_
`
`_
`
`2236
`8893
`l MM
`
`700
`2800
`2500
`
`699
`2792
`3490
`
`Table 3: SPECWeb99 Performance Scalability over McEtiple Generations of Machincs
`
`forward to extend the notion of resource management
`' across the interface without making the adapter aware of
`every process as we will ShOW in SectionS 3 and 4,
`Event management: The claim is that offload does
`' hot address managing the large numbers of events that
`_ occur in highwolume scrvcra, It is true that of’fioad, per
`so. (loci; not adfiress oppi’iz‘aiion viyiliie events, which
`are boner addressed by the AP], However, offload can
`shield the has: {)perzztifigsyslem from spurious 811116038“
`sary adapter events. such as TCP acknowledgments or
`Window ntlvertisememsi In addition, it allows hatching
`ofoihcr events to amortize the cost of iotermpts and bus
`crossings.
`Partial offload is sufficiently effective: Parriai of«
`flood approaches incitidc checksum ofilood and large
`send (or TCP Segmentation Offload), as discussed in
`_ Section 2.l. While useful, they have limited value and
`_ do not fully Solve the scalability problem as was shown
`in Section 2.2. 0ther arguments include that checksum
`offload actually masks enom to the host. {4i}.
`In con:
`tract. offload allows larger batching and the opportunity
`to pcrform mom rigorotES error checking (by including
`the (TM: in the data dcscriptors),
`Maintainability: Opponents argue that omoad—based
`approachcc are more difficuit to update and maintain in
`the presence of security and bug patches. While this
`is true of an ASiC~hoscd approach it
`is not true of
`a softwarohascd approach using general—purpocc hard-
`ware.
`
`Quality assurance: The argcmcot here. is that offiomi
`is harder to test to determine bugs. However, testing
`tools such as TBIT [3 l} and ANVL [i 5} allow remote
`testing of the offload interface.
`In additiom software
`based approaches based on open»soorce TCP implemen—
`tations such as Linux or FrecBSD facilitate both anally
`tainzihility and quality assurance.
`System management interface: Opponents claim
`that offload adapters cannot have the same management
`interface as the host OS. This is incorrect one example
`
`is SNMP, lt is trivial to extend this to an offload adapter,
`Concerns about NiC vendors: Third—party vendors
`may go out of husincss and strand the cugtomer. This has
`nothing to do with offload; it is irrac of any M) device:
`disk, NIC, or graphics card. Economic incentives seem
`to address customer needs. in addition. one of the: largest
`NEC vendors is Intel.
`
`3 System Design
`
`_
`
`In this Section we describe our Offload design and how
`it addi‘ecses scalability.
`
`3.} How Offload Addresses Scalabiiity
`
`A higher—level interface. Offload allows the host open
`ating system to interact: with the device at a higher lcvci
`ofahstmction. Rather than simply queuing MTUwsizcd
`packets for transmission or reception. the host issues
`commands or the traz‘isport layer (cg, connect ( l,
`accept (i. send ( l , close { ) ). This allows the
`adapter to shield the host from transport layer events
`(and their attendant intermpt costs) that may he of no
`interest to the host, such. as arrivals of TCP acknowlm
`cdgments or window updates. Instead, the host is only
`notified of meaningful events. Examples include a com—
`pleted connection establishment or termination {rather
`than every packei arrival for the 3-way handshake or
`4—way tearaclown} or agnplication-lcvei data units. Soil
`ficient intelligence on the adapter can determine the ap-
`propriate time to transfer data to the host, either through
`knowioége of atondardizcd higheplevel protocols (such
`as HTT? or NFS) or through a programmohlc intcr—
`face that can provide an application signature (to. an
`application-level equivalent to a packet filter). By inter—
`acting at this higher lcvel of abstraction, the host will
`transfer less data over the bus and incur fewer interruptc
`arid device register accesses.
`Ability to move data in larger sizes. As described
`in Section Zili the ability to use large MTUS has a sig-
`nificant impact on performance for both sending and re—
`
`ALA07620806
`
`Alacritech, Ex. 2034 Page 5
`
`Alacritech, Ex. 2034 Page 5
`
`
`
`
`Application
`User
`
`Space
`
`Socket Layer —
`Kernel
`
`Device Driver
`Space
`
`User
`Application
`
`Space
`
`
`Socket Layer —
`Kernel—
`
`
`
`
`TCP
`
`IP
`
`Ethernet
`
`Device Driver
`
`DMA
`
`i
`
`PlO
`
`“f
`
`INTR
`
`NIC
`
`Host
`
`CPU
`
`_
`1 PC]
`J Card
`
`Space
`
`PCl
`Bus
`
`
`
`PCl _t_
`Bus
`DMA 1- HO 1- INTR
`
`NIC
`
`
`
`TCP
`
`11’
`
`Ethernet
`
`
`
`\
`
`H t
`OS
`CPU
`
`PC!
`
`Card
`
`Figure 1: Conventional Protocol Stack
`
`Figure 2: Offload Architecture
`
`ceiving data. Large send/TSO only approximates this
`optimization, and only for the send side. In contrast, of-
`fload allows the host to send and receive data in large
`chunks unaffected by the underlying MTU size. This re-
`duces use ofpoorly scaling components by making more
`efficient use ofthe 1/0 bus. Utilization of the I/O bus is
`
`not only affected by the data sent over it, but also by the
`DMA descriptors required to describe that data; offload
`reduces both. in addition, data that is typically DMA’ed
`over the [/0 bus in the conventional case is not trans-
`ferred here, for example TCP/1P and Ethernet headers.
`Improving memory reference behavior. We believe
`oflioad will not only increase available cycles to the ap-
`plication but improve application memory reference be-
`havior. By reducing cache and TLB pollution, cache hit
`rates and CPI will improve, increasing application per-
`fomiance.
`
`3.2 Current Adapter Designs
`
`Perhaps the simplest way to understand an architecture
`that oflioads all TCP/1P processing is to outline the ways
`in which offload differs from conventional adapters in
`the way it interacts with the OS. Figure 1
`illustrates a
`conventional protocol architecture in an operating sys—
`tem. Operating systems tend to communicate with con-
`ventional adapters only in terms of data transfer by pro-
`viding them with two queues of buffers. One queue is
`made up of ready-made packets for transmission;
`the
`other is a queue of empty buffers to use for packet recep-
`tion. Each queue of buffers is identified, in turn, by a de-
`scriptor table that describes the size and location of each
`buffer in the queue. Buffers are typically described in
`physical memory and must be pinned to ensure that they
`
`are accessible to the card, i.e., so that they are not paged
`out. The adapter provides a memory-mapped l/O inter-
`face for telling the adapter where the descriptor tables
`are located in physical memory, and provides an inter-
`face for some control information, such as what interrupt
`number to raise when a packet arrives. Communication
`between the host CPU and the adapter tends to be in one
`of three forms, as is shown in Figure 1: DMA’s of buffers
`and descriptors to and from the adapter; reads and writes
`of control information to and from the adapter, and in-
`terrupts generated by the adapter.
`
`3.3 Offloaded Adapter Design
`An architecture that seeks to offload the full TCP/1P
`stack has both similarities and differences in the way
`it interacts with the adapter. Figure 2 illustrates our
`offload architecture. As in the conventional scenario,
`queues of buffers and descriptor tables are passed be—
`tween the host CPU and the adapter, and DMA’s, reads,
`writes and interrupts are used to communicate.
`In the
`offload architecture, however, the host and the adapter
`communicate using a higher level of abstraction. Buffers
`have more explicit data structures imposed on them that
`indicate both control and data interfaces. As with a
`conventional adapter, passed buffers must be expressed
`as physical addresses and must be in pinned memory.
`The control interface allows for the host to command
`
`the adapter (e,g., what port numbers to listen on) and
`for the adapter to instruct the host (e.g., to notify the
`host of the arrival of a new connection). The con—
`trol interface is invoked, for example, by conventional
`socket functions that control connections: socket ( l ,
`bind (),
`listen(), connect (), accept(),
`
`ALA07620807
`
`Alacritech, Ex. 2034 Page 6
`
`Alacritech, Ex. 2034 Page 6
`
`
`
`hotnockopt t). etc. The data. interface provides- :3
`way to transfer data: on established connections for both
`sending and receiving and is invoked by socket functions
`such as send { i , Senorita () i write ( ) , writov t),
`read () , ready ( l
`, etc. Even the data interface is at a
`higher layer of ahfitrztction, since the pasacd boilers coon
`siot of application~specific data rather than fullymlormcd
`Ethernet frames with TCP/l? headers attached,
`in ad
`dition, these buffers; need to identify which connection
`that the data is for. Buffers containing data can be in
`units much larger than thcpackct MTU size. While Con-
`ceptualiy they could be of any size, in practice they are
`unlikely to be larger than 2: VM page size.
`As with a conventional adapter. the interface to the of1
`flood adapter need not be synchronous, The host OS can
`' queue requests to the adapter, continue doing other pro
`cessing? and then receive a notification {perhaps in the
`form of an interrupt) that the operation is complete. The
`host can implement synchronous Socket operations by
`using the asynchronous interface and then block the ap»
`plication until the results are returned from the adaptei:
`We heiievc asynchronous operation is; key in order to
`ameliorate and amortize fixed overheads. Asynchrony
`allows; Earger-scale hatching and enables other optiinimw
`tions such as polling-based approaches to servers. {3, 28].
`The offload interface allows supporting conventional
`user-level APlS, such as; the socket interface, as well as
`newer Aids that allow more direct accegs to tiger mem»
`ory such as DAYS, SUP, and RDMA. in addition, offload
`allows performing zero—copy sends and receives without
`changes to the socket API. The term zero—copy refers
`to the elimination of memory—townemory copies by the
`host. Even in the zerowcopy ease, data is stilt transferred
`across the 1/0 hits by the adapter via DMA.
`For exampie, in the case of a semi using a conven-
`tional adapter, the host typically copies the data from
`user space into to a pinned kcmci buffer, which is then
`queried to the adapter for transmission. With an intelli-
`gent adapter, the host can block the riser application and
`pin its buil‘ors, than invoke the. adaptor to DMA the data
`directly from the user application butler. This is similar
`to previous “singlewopy” approaches U3, 20], except
`that the transfer across the bus is done by the adapter
`DMA and not via an explicit copy by the host CPU.
`Observe from Figure 2 that the interaction between
`the host and the adapter now occurs between the socket
`and "i0? layers. A naive implementation may make
`unnecessary transfers across the PC! bus for achieving
`socket functionality. For {example} acceptt") would now
`cause a has crossing in addition to a kernel crossing; as
`could sctsocktint() for actions such at; changing the send
`or receive buffer sizes or the Naglo algorithm. Howeveri
`each of these costs can be amortized via batching mul-
`tiple reqttests into a single request. that crosses the bus.
`
`For example, multiple arrived connections can be aggre-
`gated into a single acoeptO crossing which then trans-
`lates into multiple acccptg) system calls. On the other
`hand, certain events that would generate hits crossings
`with a conventional adapter might not do so with 3 oh
`fioad adapter, such as ACK processing and generation.
`The relative weight of these advantages and disadvan.«
`rages depends on the: implementation and workload of
`the application using the adapter.
`
`4 System Implementation
`To evaiuote our design and the impact of design deci—
`sions, we implemented a software prototype. Our de—
`cision to implement the prototype purely in software:
`rather than building or modifying actual adapter hard
`ware, was motivated by several factors. Since our goal
`is to study notjufit performance, but scalability, we ul—
`timately intend to model different hardware charactori5«
`tics, for both the host and adapter, using a cycle accurate
`hardware Simulator. Limiting our analysis to only cur-
`rently available hardware woold hinder our evaluation
`for future hardware generationn Ultimately, we envision
`an adapter with a general purpose processor, in addition
`to specialized hardware to accelerate specific operations
`such as checksum calculation. Our prototype software
`is intended to serve as a reference implementation for a
`production adapter.
`Our prototype is composed oi‘thrco main components:
`
`I OSLayer, an operating system ancr that provides
`the socket interface to applications and maps it to
`the descriptor interface shared with the adapter;
`- Event—driven TCP, our ol’fioaded TCP implementav
`tion;
`it library that encapgttlates interaction he~
`0 IOLib,
`tween OSLaycr and Evcnvdrivm TCP.
`
`At the moment, ()SLaycr is implemented 35 a li~
`hrary that is; Staticaily linked with the application. Ul—
`timately, it will be decomposed into two components: a
`library linked with npplicntions and a component built
`in the kernel. Event-driven TCP currently runs as a user-
`level process that accesses the actual network via a raw
`socket. l't wi El eventualiy become the main software loop
`on the adapter. The lOLib implementation currently
`cotnninnicates via TCP sockets. but the design allows
`for implementations that communicate over a PCI bus
`or other interconnects such as: ln‘hniband. This provides
`a vehicle for experimentation and analysis and allows us
`to measure has traffic witho