`
`1·· I ·I ::1 II :: 1·· • I ··1 II ·11 ·11 ' 111::
`
`II II
`
`II
`
`II
`
`II I II
`
`I
`
`11111
`
`·:i II
`
`111
`
`I
`
`Phoenix, Arizona • December 2-5, 1991
`
`"COUNTDOWN TO THE NEW MILLENNIUM"
`Featuring a Mini-Theme on:
`Personal Communications Services (PCS)
`
`·
`
`Conference Record
`
`Vol. 3 of 3
`
`VOLUME
`
`DAY
`
`SESSIONS
`
`PAGES
`
`1
`
`2
`
`3
`
`Tuesday
`
`1-20
`
`1·694
`
`Wednesday
`
`21-42
`
`695-1531
`
`Thursday
`
`43-608
`
`1533-2150
`
`m m
`_,
`~ )> -!; .
`°' r-m
`
`0
`0
`""Cl -<
`
`•. ~··.
`
`tt
`
`IEEE
`
`ICC
`
`'U
`
`GLOBE COM
`
`Sponsored by the
`IEEE Communications Socie!y
`and the Phoenix IEEE Section
`
`DEFS-ALAOO 10601
`Ex.1029.001
`
`DELL
`
`
`
`Additional sets of Volumes 1, 2, and 3 may be ordered from:
`
`IEEE Service Center
`Publications Sales Departmenl
`445 Hoes Lane
`P.O. Box 1331
`Piscataway, New Jersey 08855-1331
`
`IEEE GLOBECOM '91
`
`IEEE Catalog No.:
`ISBN Numbers:
`
`9ICH2980-l
`0-87942-697-7
`0-87942-698-5
`0-8794 2-699-3
`
`Softbound
`Case bound
`Microfiche
`
`Library of Congress No.: 87-640337
`
`Serial
`
`m m en _,
`~ )> -s;
`all r(cid:173)m
`0
`0
`~
`
`COPYRIGHT AND REPRINT PERMISSION5.;.
`
`Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limits of U.S.
`copyright law for private use of patrons those artides in this volume that carry a code at the. bottom of the first page,
`provided the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 20 Congress St.,
`Salem, Mass. 01970. Instructors are permitted to photocopy isolated articles for noncommercial classroom use
`without fee. For other copying, reprint or republications permission, write to Director, Publishing Services, IEEE,
`345 East 47th St., New York, NY 10017. All rights reserved. Copyright© 1991 by the Institute of Electrical and
`Electronics Engineers, Inc.
`
`ii.
`
`DEFS-ALAOO 10602
`Ex.1029.002
`
`DELL
`
`
`
`•:.t
`
`..
`
`AN OUTDOARD PROCESSOR FOR HlGH PERFORMANCE IMPLEMENTATION OF
`TRANSPORT LAYER PROTOCOLS
`
`R. ANDREW MACLEAN and SCOTT E. BARVICK 1
`
`Bellcore, 444 Hoes Lane, Piscataway, NJ 08854
`
`AllSTllACT: Tho high throughputs promised by cmcrclnc
`nct\York tochnologlcs ar• ancn difficult la a<hic\'t •pplicotion(cid:173)
`lo·•ppllc•tlon boOIU>C or host transport protocol boUJtntcks.
`This pupcr describes an cxptrlmcntal prt1totypc lmplemcnta.
`lion or on outboard protocol processor which eliminates these
`butUcnttk• by performing transport layer fundlons In dcdf.
`coted hi>rdwu...,, The architecture conslslJ of stpnr•I< tr•ns·
`mil and tccdvc CPUs, each wllh checksum and DMA circuiu.
`Mca<uromonts madc.UJ<lnc an lmplcmcntalion orthc TCP pro·
`locol Indicate thol this orchllccturc can wpporl ond·lo·cnd
`thruuchpuu in excess of 11,000 p:u:kcts/&cc bctw«:n UNIX,
`hosts.
`
`I. INTRODUCTION
`
`M EASUREMENTS of the end-to-end pcrform3ncc or
`
`LANc indic:ite \hat lransport ond network layer protocol
`implementation$ often limit lhroughput between communicat·
`ing applications (1). Several solutions huve been proposed, the
`most common or which can be ca1egori1.cd as follows:
`I) increase the size of Lhe transport protocol data Wlit (TPDU),
`2) optimize the protocol software,
`3) change to a more cffiCienl protocol and
`4) use a more powerful host processor.
`The idea behind Lhe first method is to increase Lhc rn1io or
`1fota bits to he:ider hits so lhat the protocol proccs!ing required
`per data hit i~ reduced. Depending upon the underlying net·
`work, however, thi~ appro3ch may lead lo incrca~ed lmcncy in
`the network, or Lhc need for fragmentation 1111d reassembly in
`the lower layers.
`Optimization of the communications soflwo.re for through·
`put has been discussed for I.he case of TCP/IP [2J. The melhod·
`is to change the existing implementation
`ology adopted
`soflwarc to reduce the protocol processing overhead. Perfor(cid:173)
`mance comparisons of optimized and non·optimized implc·
`mcntations of TCP/IP foWld in ! I) indicate that significant
`pcrfonnance increases can be realiwd using such tcc:hniques ..
`The TCP and OSI TP4 protocols have been dC$igncd pri(cid:173)
`marily for robustness and utility rather Uw1 throughput. Sev·
`cral 'lightweight' tllmsport layer protocols have been proposed
`which oITcr performance improvcmenlS: NErBLT [3), XTP
`[4}, NACK [5), SNR(7}, and VMTP (6) are examples of such
`protocols. With respect 10 lhc hardware, typically, the data link
`layer (to use OSI terminology) has been handled by some kind
`or host network adapter and the network layer and above has
`
`I. Scott Barvick b now with Wellnce1 Communitalior,,,
`15 Crosby Orivc. Bedford, MA 01730
`2. UNIX is a rcgi~terod l!Udcmuk of UNIX Syncrn Lthor:itorics Inc.
`
`been the responsibility of the host processor. Thus there nrc
`potentially two areas where processing power can be added,
`the host or the communications adapter.
`While the use of more powerful host processors is a com·
`mon solution, there is a growing trend towards increasing lhc
`front end intelligence of the 1/0. There Me two primary reasons
`for this; firstly, lhc IJO subsystem often demands response
`limes which ·cannot be guaranteed by the host processor or
`would cauM: the host to behave inefficie11tly or erratically. Sec(cid:173)
`ondly, it is often the case that !%rt.ain compute intensive runc(cid:173)
`tions can be perfonned more quickly or more cost effectively
`by speciali1.ed hardware than by the host processor. Several
`outboard processor designs have been reported [7H 12). Our
`objective for this project has been to explore the outbonrd
`approach by designing an cxperiment:il prototype processor as
`a plat!onn for analyr.ing transport layer protocols. We call this
`processor the Protocol Accelerator (PA), and it is described in
`lhe sections which follow. One of our u:timate aims is to
`explore different high iq:ieed protocols using this processor as a
`platfonn, in order to determine I.he most appropriate techniques
`for transport of data on high speed Metropolitan Arca Net·
`works.
`
`2. PROTOCOL ACCELERATOR
`
`1.1 System Configuration
`llic Protocol Accelerator is a bo:lrd on the VME system
`bus. Figure I shows how the PA integrates into the host sys·
`tcm. On the network side, the PA is equipped with both input
`and. output 32 bit parallel ports, each $upporting data transfer
`ratc:s in excess of 320 Mbits/sec. Intentionally, no media acccs~
`circuitry has been included on the PA; this provides us wiU1 the
`capability to connect Ille Protocol Aecelerator to a variety of
`network· types vill appropriate adapters, or, a.s in the case of
`loop-back testing, to leave out the network circuitiy com(cid:173)
`pletely. ln future communications sul»ystcms, lhe transport
`protocol acceleration circuitry probably would physic:illy
`reside on the network adapter card.
`
`Nl!TWOllXJ
`t.OOl'IMCK
`ORCUTT
`
`FIGURE I. System Configuration
`
`1'728
`
`49.4.1
`CH2980-119110000·1728 $1.00 © 1991 IEEE
`
`GLOBECOM '91
`
`m m en _,
`~ )> -
`~ I m
`
`0
`0
`~
`
`DEFS-ALAOO 10603
`Ex.1029.003
`
`DELL
`
`
`
`2.2 Functionality and Data Flow
`The prc11io11.•ly rcponcd outboard processor$ can be catego(cid:173)
`rized into single aml muhiplc CPU architectures. The single
`CPU implcmenu1ions [4, 10. 11) include peripheral suppon for
`high llal.:l throughputs. The muhiproccs.sor implementations [i,
`8, 9, 12! use up to eight CPUs, typically with no special pcriph(cid:173)
`crJl devices. In our design, we exploit fClltures from both of
`these approaches with a dual CPU design and special purpose
`peripheral circuitry in 11.11 architecture optimized for 1.ransport
`layer processing.
`Tue internal functions and data nows of !he protocol accel·
`er.nor are shown in Figure 2. We use a dual CPU approach to
`protocol processing, with one CPU subsyst.cm dedicated to lhc
`transmission, and lhc other to the reception. The transmit am.I
`receive CPUs arc both 68020 (25 MHz) based, each with its
`own private resources: ROM, parallel VO, interrupt circuitry
`~ml 128 kilobytes of random access memory (RAM). Jn addi·
`tion !here is 128 kilobytes of RAM shared by both CPUs which
`is ulso accessible to lhc two host busses, VME and VSB. Using
`both host busscs simuhancously, it is p0ssiblc for the PA to
`move data blocks both to and from host memory. The trnnsmit
`and receive CPUs have VME bus master capability. All the
`dau p~Lhs shown are 32 bits wide.
`2.J Openition
`On transmit., dai.a can be piped from one of three locations;
`host memory, shared memory, or transmit CPU memory into
`!he network port, while the transmit CPU liUJlCrviscs transfer$
`ond compiles headers. No intermediate buffering of the appli(cid:173)
`cation data takes place. We believe this is key to high speed
`operation.
`On receive, p:irallcl dau from the network is pipelined to
`the hoSl memory, loc:al r~ive CPU memory, or shared mcm·
`ory. In normal operation. the receive CPU will OMA the
`header to local memory, perform initial proccs..~ing to csublish
`header integrity and paytoad dcstination, and then start a DMA
`
`process to transfer the data segment of thc packet either to host
`memory or to lhe shared memory region. While s1orage of the
`payload is proceeding, the receive CPU completes its procm(cid:173)
`ing of the header infonnation. Messages arc passed between
`the transmit/receive CPUs and the host either hy u.•ing the
`shared memory region or by using interrupt mechanisms which
`exist among all CPUs (h0$t, transmit or receive).
`The Protoc:ol Accelerator enables a rapid data flow to and
`from the network by intraducing a high degree of concwncy
`into the communication mechanism. Several activities execute
`simulwieously:
`I) host processing (of higher layers and application),
`2) transmit protocol processing,
`·
`3) receive protocol processing,
`4) data transfer from host memory 10 the networl<: adapter,
`S) data transfer from the network adapter to host memory,
`6) receive data checksumming,
`7) transmit data checksumming, a11d
`8) MAC frame proccning.
`2.4 Din!ct Memory Access
`An impon.ant feature of the hardware architecture is the
`dual dircet memory access controllers (DMACs). The DMACs
`are capable of moving 32 bit data words al rates of up to 33
`Mbytes/sec over VME or 30 MBytcs/scc over !he VSB bus
`directly to and from the network pons. Scatter-gather type
`operations are fully supported both to and from the application
`memory. Data paths also exist for OMA of data between the
`network ports and any other RAM area on !he PA (i.e. CPU
`RAM space or shared RAM space) so that intermediate daLa
`buffering is possible whenever necessary.
`2.S Checlt.wmming
`The on-board CPUs are capable of checksumming d~ta at a
`rate in the region of 75 Mbitsfscc[l3] but operation at !his rate
`would ·1ca11c no time for protocol processing. To maximi1.c our
`data
`throughput.,
`it was decided
`include hardware
`to
`
`RS:n!Vll
`CJ'\J
`SUBSYS'TliM
`
`RECEIVE
`MEMORY
`
`sn>1Nno
`MF.MORY
`
`TRANS"'1T
`MEMORY
`
`TllJINSMITBIJS
`
`TRANSMIT
`CllECttSUM
`
`RfiC!ilVli DUS
`
`llECillVE
`ClCECXSUM
`
`H~TWOllK
`ll'PVTBUS
`
`IJOST V Mil llUS
`
`SUAlEDDUS
`
`FlGURE 2. Protocol Accelerator Fiinctional Block Oiagi:am
`
`f ............ ~;, ................. ..
`
`• D BUS
`:
`TICAHSCEIVER
`
`l r:J BtDlRF.C'llONAL
`i r:11tE01STEREO
`:LJ BUJ-T-B
`
`~ .. lo" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`49.4.2
`
`1729
`
`DEFS-ALAOO 10604
`Ex.1029.004
`
`DELL
`
`
`
`checksumming in preference 10 using faster, more sophisti·
`c~tcd proce.ssors for this function.
`We use an 'on-thc-ny' technique for the checksum und in
`order 10 take advant.agc of this hardware, the checksum field is ·
`required to trail the fields being checked. In the c:ise of TCP
`this requires that the checksum field be moved from ilJ usu~I
`pl~ce in the header. Ahhough hardware schemes can be
`devised which leave the checksum field in place, these arc
`somewhat more difficult to implement, and because the intent
`with this design was to produce a testbed suitable for many
`protocols, the checksumming circuitry was not designed to be
`protocol specific.
`· The circuits uLilh:e 32 bit ones complement adders and
`operate in tandem with the OMA controllers; every word
`nmved by lhe OMA controller is simultaneously applied to the
`checksum circuits. On the transmit processor, lhc circuit auto·
`nia1ic.1lly imerts a 32 bit chccbum at the end or hoth the
`header and lhe data segments. On 1hc receiver these fields arc
`checked, again automatically.
`
`3. TCP IMPLEMENTATION
`3.1 Introduction
`The ttansport protocol used to dcmons1.ratc the c:ipabilitics
`of the Protocol Aca:lcrator is tbc Transmission Control Proto(cid:173)
`col (TCP). It was chosen because it is currcnll y one of the most
`wide-spread transport protocols in ·use on networks today. It is
`~!so the subject of a great deal of continuing re~h. nnd I.he
`expanding use ofUNIX-b:lscd desktop workstations is increa.~
`ing its penetration. Along with the incre:ised use of TCP is the
`fear that as network rates rise, network throughput may ,not
`keep pace or may even decline as connection-oriented, window
`now-conttollcd protocols sucll as TCP become a bonlcnocll:.
`Some researchers do not believe lhat the protocol is to blame
`for the low throughput observed when current TCP implcmcn·
`talions arc run over experimental high speed networks l2J.
`Instead, they cite inefficient implementations of the protocol
`wu! interactions with lhe host operating system (UNIX) as the
`causes of the poor performance. The intensity of this debate
`"'ill grow as more hii;h pcrrormancc rruichincs running TCP
`observe less t.h:ln ideal performance over high spc.cd net works.
`Therefore, TCP was implemented to show th11t wilh an efficient
`implementation on the proper hardware platform. even a proto(cid:173)
`col dC$igned for moderate-speed, error-prone networks can
`achieve high throughput.
`3.2 Implementation Details
`The high performance aspects of Ille Protocol Accclcralor
`such as lhc dual processors. DMA, and on·lhe-Oy check.sum
`require the design of lhc transport protocol implementation io
`re specific to the PA. Therefore, a custom implementation of
`TCP was developed. The modules wm: written in C for Ille
`main protocol processing functions wilh cmbetldcd 68020
`Assembly language: code to pcrfonn many of the hardware rpc·
`cilic tasks. such as setting up the OMA controller, controlling
`1imcrs, and polling Jtatus flags. No operating sys1crn is used on
`I.he PA. Many implcmenta1ion decisions were made to optimize
`JPCCd and efficiency for lhe 'main palh' of protocol proceSling
`n1 the expense of processing for infrequent error condilions.
`These decisions ere jus1ificd given I.he low c:.rror mes chanic·
`tcrisLic or high speed libcr networks.
`The dual processor hardware archilccture of the Protocol
`Accclcra1or leads naturally to the software archiLCctwc. The
`
`TCP implementation consislJ of separate ll'ansmit and receive
`processes n.inning their respective tasks on separate micropro(cid:173)
`cessors. The ll'll!ISmittcr and receiver do most of their process(cid:173)
`ing on data stored in private memory, but they do communicate
`through lhe shared memory. The bulk of this communication
`occurs through the Transmission Conttol Block (TCB). the
`main TCP Stale informal.ion structure which resides in shared
`memory. Al appropriate times, state changes in either the trnns(cid:173)
`milter or receiver ere updated in lhc TCB which may !hen oo
`read by lhe olher processor in the course of itS work.
`As noted in the hardware description, lhe generic nature of
`the PA requires that chedcsums be placed afier the header and
`after the data. 01.her than lhis difference, t.he implementation
`provides all of !he TCP fW\Clions required t6 transmit and
`receive data in the TCP (call) ESTABLISHED state. Among
`others, these functions include maim:iining lhe retransmission
`queue. providing resequencing for out or order packets, sup(cid:173)
`porting retran~mlssion timers, and packcti:ting hosl data imo
`TCP scgmC'llts, or TI'OUs. ll must be noted that 10 minimize
`data movement, host data is moved directly between host
`memory and the nctworic interface. This precludes further scg·
`mcntation/rca.<.'<Cmbly of lhe data at whai would be the lnLcmct
`Protocol (IP) layer. Therefore, although certain functions of the
`!? layer are rolled into tbc TCP header (IP address, length, pro(cid:173)
`tocol), IP functionality is not claimed.
`Another objective of lhis work is to quantiry lhc eITccls or
`the end host system on outboard protocol processing. To this
`end a UNIX devici: driver, applications programming interface
`(AP!), and application were developed for lhe host UNIX sys·
`tcm. The relationships among lhe components are shown in
`Figure 3.
`
`UNIX
`APPLICATION
`
`AP!
`
`TC!'
`RF.CGIVFJI
`
`FlOURE 3. System Software AtchileetUrc
`
`Because many variations are possible wilh software in 1he
`UNIX environment, auempts were made to keep the device
`· driver, API, and application code as simple ..5 possible while
`maintaining funclional similarity to current methodologies for
`protoc0l/sys1em interl'ace.s. The results rriay then be used to
`exttapolate meaningful performance expcetations of olhcr sys(cid:173)
`tems with different basic parameter! such as processor capabil·
`ity, network packet site, or bus speeds.
`
`1730
`
`49.4.3
`
`DEFS-ALAOO 10605
`Ex.1029.005
`
`DELL
`
`
`
`4. PERFORMANCE MEASUREMENTS I ANALYSIS
`
`4.1 T~I Confii:uration
`For these initial results, the circuit was operated in a hard(cid:173)
`ware loopbaelc configuration (see Figure 1) with the network
`output port connected to the network inpw port via FIFO buJT·
`crs. Thc. loopback caU$cs no Jou, errors, or reordering of pack(cid:173)
`ets :ind is thus a best case. For the host we used a VME b;ised
`single board computer based on the Motorola 68030 processor
`orcrating at 25 MHz. This processor was cquirpcd with an
`mca or shared d)lllal'llic RAM (4 MBylCS) eca:~ible from both
`the host processor subsystem and the VME bus.
`The communications model U$ed to test the capabilities or
`the Protocol Accc:lcr~tor is that of a rile server/client which c;in
`cillicr provide or receive large messages at the throughput rates
`or llrn PA. Performance measurements were Ulkcn based on the
`UNIX host transmitting bulk data over the network which, in
`this system, is a loopbaclc to the receiver side of the PA. In the
`initial measw-cmcms we found !he tranSmittcr to be the rate(cid:173)
`\lctcnnining element in the end-to-end process, being roughly
`filly percent slower than the receiver. We t!'iercforc focus ow(cid:173)
`discuuion on the transmiucr. In order to fully test the capabili·
`tics or !he tl'lll1smit1cr, the received da1n is not run back to !he
`UNIX host because thal \\/Quld cause excessive contention for
`ihc VME bus. For these measurements only one host bus
`(VME) was used.
`~.2 Protocol Acceler:nor Performance
`Figure 4 shows !he timing diagram for the now of TPDUs
`through the PA transmiucr. The header generation time is the
`time talccn for the pmccsscr to access the c!1rrcn1 sllltc informa·
`tion nnd popul:i1c the header fields. Onee I.he payload OMA
`1ransrcr is undcrwoy, further transport protocol processing pro·
`cccds in parallel and docs not impede data transfer ll!llc.1s the
`p~y!ond trwtsfcr time is shorter than the overall 1r.1nsport pro(cid:173)
`cessing time.
`
`-1•
`
`()jl.KC •l
`
`53µ.soc
`
`I llllADO!. I
`
`Cl:NCRATION
`
`TRANSPORT PROTOCOL PROCESSING
`
`PAYLOAD OMA TRANSrtlR
`(TIME IS PACKET SIZE DEPENDliNl)
`
`TIMC:
`
`FlGURE 4. Transmitter Per Packet Processing lime
`
`The payload transfer time is simply the product of lhc pay(cid:173)
`load si1.e (in 32 bit words) and the da1a lral\sfcr cycle time. The
`dii~ Lransfcr ~~cle time for our aystcm can be calculated by
`adding the mm1mum DMA cycle time of80 nscc (asynchro(cid:173)
`nous) to the source (Ol dc.sllnalion) memory 8cccss time. 1111:
`shared PA static RAM has an access time of 40 nsec:. leading to
`a cycle time of 120 nscc per 32 bit word (i.e. peak data tr.insfcr
`rate. in c::11"ss of 250 Mbits/scc). The VME mcm~on the
`ava1l:ible host computer, on the other hand, has a measun::d
`response time averaging 580 nscc, which n!duccs the peak
`~iroughput to below SO Mbits/scc when this memory is used.
`These figures assume no olher activity on lhc S)'lltcm bus. For
`these preliminary findings we use a constant packet size or
`
`2016 biu of which 1728 bits was payload. The total transport
`processing time of our transmiuer TCP implementation is
`66 µ.sec and this becomes the limiting throughput factor for
`small payload paclcct.s. The header generation part of lhis time
`is 13 J.UCC.
`
`Using these results, it is sttaightforw111d to extrapolate the
`perfonnance' of the PA transmitter for any memory access time
`and packet size. An ex.ample is shown in Figure 5. Herc, the
`payload throughput and Lhc packet throughput arc plotted
`•g~inst packet siu: for dilferi:nt memory access times. The Ont
`region of !he packet throughput cwves are in the region where
`the protocol processor limits throughput. To the right of this
`region, throughput is limited by memory bandwidth. Two cases
`nrc shown, l 20 nst:(: and 660 nscc memory cycle times; these
`represent the memory cycle times for the two cases described
`previously. PA shared SRAM and host VME DRAM.
`
`161-~---~
`14
`
`/ ..
`
`··--·····s····---·· 250
`
`/
`I
`I
`
`I
`
`cc
`l2Q
`llM"'CYCU.
`JI'
`,
`.
`.
`
`12
`10
`lrl 8
`e
`lS
`~
`
`I<!
`
`0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
`PACKET SIZE (IUJ YTESl
`
`200
`
`150
`
`100 ~
`~
`50 ~
`
`0
`
`FIGURES. PA Tuoughput Based on Memory Access lime
`and Packet Size
`
`4..3 PA rerfonnanct In UNIX Environment
`
`We now ex.amine the effect of host system soft ware on the
`observed throughput to the application. Host system interac(cid:173)
`tions with outboard proccuors are critical because regardless
`of how fast I.he data may depart from or arrive to the outboard
`processor, it is not until Ille data ic actually in user space on the
`host system !hat the communication is complete.
`The UNIX overlie.ad time was measured for a number of
`hosi mess.age sizes transmiucd from the PA. The UNIX over(cid:173)
`head c011sis1S or lhc lime between lhc user C4ll to the APl's
`BSD-like write soclcel call and the rcccipl of !he command by
`the PA on the inbound side plus the time between !he genera·
`tion or the UNIX int.crrupt (signaling !he end of PA processing.
`or the host meuage) and n:wm of control to the user at the md
`of message U'BtlSmission. The cormection cstabli1hmcnt func(cid:173)
`tions of socltet and c:oanect are not included in these measure(cid:173)
`ment.& because they are llOI directly associated with !he per
`message data tnul$fcr. The. syuem was kept unloaded except
`for lllllldard S)'$11!m daemons in order to minimi7.c contention
`for proce~sor and bus resources. The results of the UNIX over(cid:173)
`head measuremenU; for message transmission are shown in
`Figure 6.
`
`49.4.4
`
`1731
`
`DEFS-ALAOO 10606
`Ex.1029.006
`
`DELL
`
`
`
`I 2
`~l ~ l
`~
`)( z :::>
`
`IOK
`5K
`15K
`JIOST MHSS/\06 SIZE (BYTHS)
`
`20K
`
`FlvURE 6. Absolute and Relative UN IX Overhead vs. Host
`Message Size
`
`5. CONCLUSIONS
`An outboard protocol aa:elerator optimized for transport
`layer procc!'Sing has been implcmcntcd and tested. The design
`consists of separate transmit and receive units, each with its
`own CPU, checksum, and OMA hardware. Using an imple·
`mentation of TCP, we demonstra1ed that this approach.can sup·
`pon end 10 end throughput rates in excess of 15,000 packets/
`sec bctwe.ert network and application. When integrated into the
`UNIX cnvirorunc:nt, the protocol accelerato.r freed lhe host of
`the traditional protocol procc.1sing tasks but led a 25% decrea~e
`in average throughput due to the UNIX overllead.
`
`ACKNOWLEDGEMENTS
`We would like 10 thank Mike Koblcnu: for his major con1ri-
`001ions and.Michael Stanton for his part in lhe initial design of
`·the hard wore,
`
`PRgTOCOl. ACCfll.F.R/)TClll TCP
`
`As the message size increases, the absolu1c UNIX overhead
`REFERENCES
`increases while ii.I percentage of the !Dial operation time falls.
`[1) L. Svobodova, "Measured Performance of Transport Ser·
`If the TCP interface process had been the only process running
`vice in LANs", Compuu:r Network.J and ISDN Systems,
`on the host system, the absolute overhead would remain con(cid:173)
`vol 18, pp. 31.-45, 1989m.
`~ton1 as host mC$sagc size increases. This is because the only
`{2) D. D. Clark, V. JacobSen, J. Romlc:cy and H. Salwen, "An
`action required of lhe host is to change lhe command message
`Analysis of TCP Processing Overhead", IEEE Comm.
`length delivered to the 1ransmitll.lr. The observed results arc
`Mag., pp 23-29, June 1989.
`due to the accumulation of higher priority host processes while
`[31 D. D.QaTk, M. L. Lambert and L. Zhang, "NETBLT; Bulk
`the PA OMA is in control of the UNIX memory. As the me!·
`Data Transfer Protocol", Network Information Center
`s;igc size increases, the OMA conltols the host bus for more
`RFC-998, SRI lntcmalional, Menlo Park. CA, 1987.
`time per message. The linear increase in absolute overhead also
`(4) G. Chesson, E. Brendan, V. Schryver, A. Chercnson arid A.
`indicates that the higher priority functions are generally peri(cid:173)
`Whaley, "XTP Protocol Definition." Protocol Engines Inc.,
`odic. It should be noted that even as the absolute overhead
`1900 State St., Sanlll Barbera CA. 93101.
`increases with meSJage size, the on-board TCP pnx:e.~sing time
`[5) R. P. Singh and A. Erramilli, "?rotocol Design and Model(cid:173)
`remains constant on n JlCT segment ba.~is. This shows th~t the
`ing bi;uc~ in Broadhilnd Ne1work5", Prix:. ICC '90,
`increase in absolute overhead is nnt due to increa.~c.' in I inte
`Atlanlll, GA, pp 1458, 1990.
`waiting for the host bus or other pc.r segment processing; it
`!61 D. R. Cheritan, "VM'Il'; Versatile Message Transaction
`occurs at the end of the packet proc:cSsing before control is
`Protocol, Protocol Specification'', Network Information
`rc1umcd to the user process.
`Centre RE'C-1045, SRI International, Menlo Park, CA,
`1987.
`The declining percentage of overhead as host message size
`[7) A. N. Netravali, W. D. Roome, K. Sabnani, "Design and
`is increased confirms 1>ur expccta1i0ns. As lhe host messnge
`Implementation of a High-Speed Transpan Protocol",
`si1.c incrca.,es, the amount of time spent on UNIX overhead
`IEEE Tra1U. 011 Communicazions, vol 38, no. 11, pp.2010,
`processing relative to the overall l.Jllnsmission time decreases.
`Nov.1990.
`111i1 again is due to the constant amoun1 or code which must be
`[8) D. Giarrizzo, M. Kaiserswcrth, T. Wicki, R. C. Williunson,
`executed. The observed asymptotic approach to 25'1. rcprc·
`"High Speed Parallel Protocol Implementation", H. Rudin
`scnts a balance between the decreasing percent.age or host pro(cid:173)
`and R. C. Williamson, Protocols for High Spud Ncrworh,
`cessing time per message and the increased amount of system
`Nonh-Holland, 1989.
`backlog which is serviced before the user process regains con·
`[91 M. Zitterban, MHigh Speed Transport Components", IEEE
`trol. Figure 7 shows lhe overall performance of the PA in and
`Network MagauM, January, 1991, pp. 54-63.
`out of the UNIX cnvirorunc:n1.
`[10] H. Kanakia and D. R. Cheritan, '1lle· VMP Network
`Adapter Board (NAB); High performance Networic Com·
`8
`munication for Multiprocessors.." PTCK:. ACM SJGCOM '88
`~
`Symposium 011 ComtnJlllicarions Architectures and Proto(cid:173)
`15
`cols, Stanford University, CA, 1988, pp. 17.5·187.
`~ 13
`[11] E. C. Cooper, P.A. SteenldSle, R. D. Sansom, B. D. Zill,
`~11 ~ ""
`"Protocol lmplemcnllltion on the Nec:lllr Communication
`Processor", Proc. SIGCOM '90, Philadelphia, PA, 1990
`~h
`pp. 135.
`(121 N. Jain, M. Schwaru, T. R. Ba.shkow, "Transport Protocol
`~
`Processing at GBPS Rates", Proc. SIGCOM
`'90,
`I=
`Philadelphia. PA. 1990, pp. 188.
`[13) R.Bradcn, D. Borman, C. Partridge; "Computing the
`Internet Checksum", Network Information Center RFC-
`1071, SRI International, Menlo Park, CA, t9RR.
`
`PA.UNOl!lt UNIX
`
`10.000
`15,000
`5,000
`HOST MESSA Gil SIZE (BYTES)
`
`20,000
`
`FIGURE 7. Throughput vs. Host Mcs.o;:igc Size
`
`1732
`
`49.4.5
`
`DEFS-ALAOO 10607
`Ex.1029.007
`
`DELL
`
`