`
`
`
`~1'7
`II<
`5/t?.?-.?
`, :;?1?.?1/
`/19?-
`
`ELSEVIER SCIENCE PUBLISHERS BV.
`Sara Burgerhartstraat 25
`P.O. Box 211, 1000 AE Amsterdam, The Netherlands
`
`Keywords are chosen from the ACM Computing Reviews Classification System, !01991, with permission.
`Details of the full classification system are available from
`ACM 11 West 42nd St., New York. NY 10036, USA
`
`ISBN· 0 444 81481 7
`ISSN· 0926-549X
`
`<C 1993 IFIP. All rights reserved.
`No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any
`means. electronic, mechanical. photocopying, recording or otherwise. without the prior written permission of
`the publisher, Elsevier Science Publishers B V., Copyright & Permissions Department, P.O. Box 521 . 1000 AM
`Amsterdam, The Netherlands.
`
`Special regulations for readers in the U.SA - This publlcallon has been registered with the Copyright Clearance
`Center Inc. (CCC). Salem. Massachusetts. Information can be obtained from the CCC about conditions under
`which photocopies of parts of this pubhcat1on may be made in the U.S.A. All other copyright questions, including
`photocopying outside of the U.S A., should be referred to the publisher, Elsevier Science Publishers B.V., unless
`otherwise specified.
`
`No respons1b11ity 1s assumed by the publisher or by IFIP for any injury and/or damage to persons or property
`as a matter of products liability, negligence or otherwise, or from any use or operation of any methods. products,
`instructions or ideas contained in the material herein.
`
`pp. 119-134, 199-218, 267-281 , 367-381 : Copyright not transferred
`
`This book is printed on acid-free paper.
`
`Printed in The Netherlands
`
`Ex.1017.003
`
`DELL
`
`
`
`
`
`viii
`
`A High Speed Data Link Control Protocol
`Ahmed N.Tantawy, IBM Res. Div., T.J. Wa tson Research Center, USA,
`Hanafy Meleis, DEC, Reading, UK
`
`Session C: Parallel Implementation and Transport
`Protocols
`Chair: Guy Pujolle, Universite P. et M. Curie, France
`
`Parallel TCP/IP for Multiprocessor Workstations
`Kurt Maly, S. Khanna, A. Mukkamala, C.M. Overstreet, A. Yerraballi,
`E.C. Foudriat, B. Madan, Old Dominion University, USA
`
`TCP/IP on the Parallel Protocol Engine
`Erich AOtsche, Matthias Kaiserswerth,
`IBM Research Division, Zurich Research Laboratory, Switzerland
`
`A High-Speed Protocol Parallel Implementation: Design and Analysis
`Thomas F. La Porta, AT& T Bell Laboratories, USA,
`Mischa Schwartz, Columbia University, New York, USA
`
`Session D: Multimedia Communication Systems
`Chair: Radu Popescu-Zeletin, GMO FOKUS, Germany
`
`Orchestration Services for Distributed Multimedia Synchronisation
`Andrew Campbell, Geoff Coulson, Francisco Garcia, David Hutchison,
`Lancaster University, UK
`
`Towards an Integrated Quality of Service Architecture (OOS·AI f or
`Distributed Multimedia Communications
`Helmut Leopold, Alcatel ELIN Research, Austria
`Andrew Campbell, David Hutchison, Lancaster University, UK,
`Niklaus Singer, Alcatel ELIN Research, Austria
`
`JVTOS - A Reference Model for a New Multimedia Service
`Gabriel Dermler, University of Stuttgart, Germany
`Konrad Froitzheim, University of Ulm, Germany
`
`Experiences with the Heidelberg Multimedia Communication System:
`Multicast, Rate Enforcement and Pertonnance
`Andreas Cramer, Manny Farber, Brian McKellar, Ralf Steinmetz,
`IBM European Networking Center, Germany
`
`81
`
`101
`
`103
`
`119
`
`135
`
`151
`
`153
`
`169
`
`183
`
`199
`
`Ex.1017.005
`
`DELL
`
`
`
`Session E: QoS Semantics and Management
`Chair: Martina Zitterbart, IBM Res. Div., Watson Research Center, USA
`
`Client-Network Interactions in Quality of Service Communication
`Environments
`Domenico Ferrari , Jean Ramaekers, Giorgio Ventre, International
`Computer Science Institute, USA
`
`The OSI 95 Connection-mode Transport Service: The Enhanced QoS
`Andre Danthine, Yves Baguette, Guy Leduc, Luc Leonard,
`University of Liege, Belgium
`
`QoS : From Definition to Management
`Noemie Simoni, Simon Znaty, TELECOM Paris, France
`
`Session F: Evaluation of High Speed Communication
`Sy~ems
`Chair: Otto Spaniel, Technical University Aachen, Germany
`
`ISO OSI FTAM and High Speed File Transfer: No Contradiction
`Martin Bever, Ulrich Schaffer, Claus SchottmUller,
`IBM European Networking Center, Germany
`
`Analysis of a Delay Based Congestion Avoidance Algorithm
`Walid Dabbous, INRIA, France
`
`Performance Issues In Designing Network Interfaces : A Case Study
`K.K. Ramakrishnan, Digital Equipment Corporation, USA
`
`ix
`
`219
`
`221
`
`235
`
`253
`
`~5
`
`267
`
`283
`
`299
`
`Session G: High Performance Protocol Mechanisms 315
`Chair: Craig Partridge, BBN, USA
`
`Multicast Provision for High Speed Networks
`A.G . Waters, University of Essex, UK
`
`Transport Layer Multicast: An Enhancement for XTP Bucket
`Error Control
`Harry Santoso, MASI, Universite P.et M. Curie, France,
`Serge Fdida, MASI, Universite Rene Descartes, France
`
`A Performance Study of the XTP Error Control
`Arne A. Nilsson, Meejeong Lee,
`North Carolina State University, USA
`
`317
`
`333
`
`351
`
`Ex.1017.006
`
`DELL
`
`
`
`x
`
`Session H: Protocol Implementation
`Chair: SamirTohme, E.N.S.T. , France
`
`ADAPTIVE An Object-Oriented Framework for Flexible and Adaptive
`Communication Protocols
`Donald F. Box, Douglas C. Schmidt, Tatsuya Suda,
`University of California, Irvine, USA
`
`365
`
`367
`
`HIPOD : An Architecture for High-Speed Protocol Implementations
`A.S. Krishnakumar, J.G. Kneuer, A.J. Shaw, AT&T Bell Laboratories, USA
`
`383
`
`Parallel Transport System Design
`Torsten Braun, University of Karlsruhe, Germany,
`Martina Zitterbart, IBM Res. Div., T.J. Watson Research Center, USA
`
`Session I: Network Interconnection
`Chair: Augusto Casaca, INESC, Portugal
`
`A Rate-based Congestion Avoidance Scheme for Interconnected
`DQDB Metropolitan Area Networks
`Nen-Fu Huang, Chiung-Shien Wu, Chung-Ching Chiou,
`National Tsing Hua University, Rep.of China
`
`Interconnection of LANs/802.6 Customer Premises Equipments {CPEs)
`via SMDS on Top of ATM: a case description
`W. Rozenblad, B. Li, R. Peschi,
`Alcatel Bell Telephone, Research Centre, Belgium
`
`Architectures for Interworking between B-ISDN and Frame Relay
`J. Vozmediano, J. Berrocal, J. Vinyes,
`ETSI Telecomunicacion, Spain
`
`Author Index
`
`397
`
`413
`
`415
`
`431
`
`443
`
`455
`
`Ex.1017.007
`
`DELL
`
`
`
`Jilgh Performance Networking. IV (C-14)
`A. Danthinc and 0. Spaniol (Editors)
`Elsevier Science Publishers B.V. (North Holland)
`1993 IFIP.
`
`119
`
`TCP/IP on the Parallel Protocol Engine
`
`Erkh Riitschc and Mauh1a~ Kaiserswcrth
`
`IBM Research Division, Zurich Research Laboratory
`SaumcP.>tmsse 4, 8803 RiiM:hlikon, Switzerland
`
`Abstract
`
`In th" paper. a parallel 1111pkmentation of 1he TCP/JP pro1ocol ~u11e on the Parallel Pro1ocol En
`gine (PPE), a mulliproccssor-based communication subsystem, is descnbed. The execution
`time~ of the various protocol func1ions are used to analy1e the syMem 's performance in two see
`nario,. In the fir\t ~cenario we execute the te\I application on the PPE; in the second we evaluate
`the potential pcrfonnance of our TCP/lP implcmentallon when 111s driven hy an application on
`the workstation. For the second scenario, the end-to-end performance of our 1111plemcntution on a
`four-processor PPE system 1s more than 3300 TCP segments per \econd.
`
`Keyword Codes. C. l.2; C.2.2; D. U
`Keywords: Multiple Daw Stream Architectures (Multiprocessors); Network Protocols:
`Concurrent Programming
`
`1. INTRODUCTION
`
`Progrc~s m high-speed net.,.. orkrng technologies such as fiber opucs have shifted the bottleneck
`in communi<.:ations from 1hc limited bandwidth of the transmission media to protocol processing
`and the operatmg system overhead in the workstation. So-called lightweight protocols and proto(cid:173)
`col ollload to programmable adapter.. are two approaches proposed 10 cope with this problem.
`Prott>l:Ols such as the Xpress Transfer Protocol (XTP)' !PEI 921 and VMTP !Cheriton 88] try 10
`simplify the control mechanisms and packet \lructures such that the protocol implemcnlation be(cid:173)
`come\ less complex and can possibly be done m hardware We took the second approach in build
`f Kaiserswenh 92J. a muluprocessor-bascd
`the Parallel Protocol Engine (PPE)
`mg
`communication adapter, upon which protocol processing can be omoaded from a host system.
`The ~cctarCAB IAmould 891 and the V.MP Network Adapter Board (Kanak1a 881 are other pro(cid:173)
`grammable adapters, each based on a single protocol processor. The XTP chipsct (Chesson 871 is
`a very spec.:ialiied set of RISC processors designed to e;<ecute the XTP protocol. Our objective
`was to 111vcstiga1e and exploit parallelism in many diffrrent protocols. Therefore we decided 10
`de\ clop a gcnaal purpost communication subsystem capable of suppomng standard protocols
`cfftc1cntly 111 software.
`
`1
`
`.\pre" Tran>lcr PrntlK:OI and XTP arc rcgl\tcrcd tradcmarh ol XTP Forum
`
`Ex.1017.008
`
`DELL
`
`
`
`120
`
`In this paper our goal is to demonMrate that a careful 1mplementation of a Mandard tmnspon pro(cid:173)
`tocol stack on a general-purpose multiprocessor architecture allows efficient use of the band(cid:173)
`width availabk Ill today's high-,peed networks. A, an example, we chose to implement the
`TCP/IP protocol l.uite on our -t-processor prototype of the PPE.
`
`We implemented the \ockct interface and a test application directly on the PPE to facilitate our
`performance measurements. In this tesr \Cenario we analyze the performance of TCP/IP and the
`socl..ct layer. We also exanuncd a second scenario to unden.tand how our 1mplementauon would
`pcrfonn when integrated into a workstation. where protocol processing up to the transport layer 15
`perfom1ed on the PPE and applicauons can access the transpon service via the socket interface on
`the:: workstation.
`
`In Section 2 our hardware platform, the PPE. is presented. Section 3 introduces TCP/IP. In the
`follm-.mg secuon we C\pl<tin our approach to parallt:I protoc.:ol unplementation. Section 5 prc\(cid:173)
`e111s the results and discusses the impact of the hardware and software architecture on perfor(cid:173)
`mance. The last section gives the condusion and an outlook on our furure work.
`
`2. THE PARALLEL PROTOCOL El'IGJ~E
`
`The PPE is to be pre~ented only hriefly hen.!. It is described in greater detail in [Wicki 90J and
`( Kat,erswenh 91 . 921 We "Will first concentrate on the hardware and then present the program·
`ming environment.
`
`The PPE is a hybrid \ha.red-memory/message-passing multiprocessor. Message passing is used
`for \ym.:hron1zation. whereas shared memor} is used to store sen.ice primitives and protocol
`frumcs. Figul"I! I shows the archnecture of the PPE ;ind ns u~e as a commumcation subsystem.
`
`The PPE u!>es two separate memories. one for tran~mitting. one for receiving data. Both of these
`memories an:: mapped mto the address space of the worhtallon. (n our tmplementation. four
`T425 transputer\ [IN MOS 89] arc used as protocol processors. On each stde of the adapter, two
`T425s have access to the shared memory. Each processor uses private memory to store its pro(cid:173)
`gram and local data. We decided against using a single shared memory for storing both inbound
`and outbound protocol data, although this would make the adapter more flexible and facilitate
`programming, for the following reason. lligh-speed network interfaces work in a synchronous
`fashion.with data be111g docl..ed in and out of memory. possibly at the same time, ::11 the transml\(cid:173)
`!>ion speed of the physical network. Splitting the adapter mto -.cparate receive and transmit parts
`accommodates simultaneous transmission and reeepuon and only requires memory wtth half the
`speed of that required for a smgle-memory solution. This architecture results in significant cost
`savings. especially when transmiss10n speeds exceed 100 Mb/s
`
`The network interface has read access to the transmit side and write access to the n::cetve side of
`the adapter We emulate a physical network by means of an 8 bit wide parallel interface, which
`allows a po1nt-to-pomt connec11011 between two PPE \ystems operaung with a b1directional
`transrmssion rate of up to 120 Mb/s. The transputer links are used exclusively for signalling and
`control message transfer within the PPE and to and from the ho~t system.
`
`The program111111g language wh11:h best dt:~l:nbcs the transputer\ programrrung model is OC(cid:173)
`CAM [Pounta1n 881. It 1s based on the theory of Cum111w1icaii111{ Sequential Proasses (CSP) de(cid:173)
`veloped hy lloare I Hoare 781. The structuring elements are processes that communicate and
`synd1romze \'lit mes~gcs :\1essage transfer" unbuffered communicating processes must reach
`
`Ex.1017.009
`
`DELL
`
`
`
`
`
`
`
`
`
`124
`
`4.1 IP and ICMP
`Becau~e IP is a datagram protocol, the normal flow of data through IP in an end-system requires
`no interaction betwet:n the receiving and transmitting part. Routing infom1ation and exception
`handling, however, require a data exchange. The handling of exception and control messages is
`the function of ICMP. We therefore partitioned IP into two independent processes lcmp_demu1e
`and lp_demux. To guarantee the timely handling of incoming packeis, we dedicated a separate
`proces\ on the receive side of the PPE to the handling of the physical network interface.
`
`The rouung table 1s shared between both processes on the Lransmit and receive side of the PPE. An
`RPC 1~ used if lcmp_demux needs to send out an ICMP message.
`
`4.2 TCP
`Splitting the PPE hardware into a i.eparate send and receive side had more impact on how \.\-e had
`to deal with TCP, the socket layer, and application layer, than it had on IP.
`
`We dedded to split the finite \tale machine (FSM) responsible for implementing a TCP connec(cid:173)
`tion into two separate FSMl> once the connectton is in the data phase. The actions of these FSMs
`are implemented on the receive side through two processes, rtask and tcp_recv. On the transmit
`side one process xtask implements the FSM. Owing to the duplex nature of TCP and the piggy(cid:173)
`backing of control information in data packets, these processes need to share the protocol's send
`and receive ~rate vanables maintained in the transmil:.ion conrrol block (fCB).
`
`tcp_ recv demultiplexes incoming TCP segments, locates the appropriate TCB and executes the
`required action for the FSM state. Header prediction is used Lo speed up packet handling for pack(cid:173)
`ets amving con~cutively on the same connection. Correctly received segments are appended to
`the rece1 ve queue and the application process waiting on this connccuon is then woken up to move
`the data to its own buffers. When the received data exceeds the c1cknowledgement threshold,
`which is specified as a percentage of the advertised receive window, tcp_recv makes an RPC to
`the transmit side to generate an acknowledgement. The acknowledgement is sent a.<; a separate
`packt:t, unless this information can be piggybacked onto an outgoing data segment.
`
`rtask 1s drivi=n by two timers, one responsible for delayed acknowledgements, the other for keep(cid:173)
`alive messages. In steady state data transmission, rtask should never generate an acknowledge(cid:173)
`ment, as tcp_recv already generates ackno\.\-ledgernents \.\-hile data are received. Only when the
`timer runs out and new unacknowledged data have been received since the last acknowledgement
`will rtask generate an acknowledgement. Similarly, keep-alive messages are also sent only when
`no acnvity has taken place on a rnnnection for some time. Again, both acknowledgements and
`keep-ahvc messages are gencrntcd via RPCs to the trJllSIIllt Mde.
`
`On the transmit side the process xtask manages the trnnsmit queue and the retransmission timers.
`To send data, xtask creates the TCP header and fills in the necessary infonnatlon from the TCB.
`<;uch as addresses and sequence number.. for the data and acknowledgements. The header and a
`pointer to the d;ua are then pas'.>Cd ro the IP process (procedure lp_send), which embeds this in(cid:173)
`fonnataon into an IP datagram.
`
`4.3 Socket Layer and AppliC<1tion
`To fac1lttate our cxpenments wuh TCP/IP, we decided as a first Mep t0 implement the entire sock(cid:173)
`et layer us well as the test applicanon on the PPE. A detailed description of the interaction~ be-
`
`Ex.1017.013
`
`DELL
`
`
`
`
`
`126
`
`wri11en only from the receive side (e.g .. the updated tnu1~mi1 w111dow), and 1hc other wrirten only
`from the tr;111smi1 ~idc (e.g .. the la\t -,end sequence number).
`
`Since we do not have a lockrng protocol for acces\mg shared data structures. It is possible thill for
`a brief period after chc local update and before the remote update has been propagated, the \atnc
`field in the shared data siructure contains 1wo different values. Because of the properties of TCP
`and the way we have 'Pill the proto.:ol onto the cran\mll and receive side of the PPE, this inconMs(cid:173)
`tency will only be of importance if 111s 1he reason for the protocol Mate to change. As an example
`consider the following: assume the n:transrnission timer (it is abo maintained in the TCB) in
`xtask cxpin:s and, because the acknowledgement field in the TCB docs not indica1e reception of
`an acknowledgemc111, xtask decides 10 retransmit the unacknowledged TCP \Cgments. On the
`recdve side. ho\\ ever. ;m at·i-nowlcdgemem has been received in the meantime which makes this
`retransm1s\lon unnecessary4. To avoid this problem, before actually going 10 a retransmit state,
`xtask will reread the acknowledgement field, now however with the value on the receive side, to
`make sure that a retransmission is warranted. Reading a remote field is similar tu writing; a mes(cid:173)
`~age with the address and stze of the variable is sent to tht remote peek_poke process, which then
`returns the value of that field.
`
`RPCs from the receive to the transmtl side have been implemented as follows: any process on the
`receive side can fom1a1 an RPC message, whit·h is then sent via a dedicated transputer link to the
`rpcJlrocess. This process will then execute the remole procedure. or 111 the case of transmission
`request,, pa\s the request via a local (internal) channel 10 the appropriate write process. one of
`.,.. hich exists for each TCP connection Return values are sent, agarn via a dedicated transputer
`ltnk. bac~ to the receive side to rpc_demux, which forwards these values over a local channel to
`the proccs' 1hat had iniuatt:d the RPC. Upon receiving 1he return value. the calli.:r becomes ready
`again and can continue its e"ecution.
`
`4.5 Example
`Sending u TCP dam sewne111: The normal data flow is shown in hgure 3. The send data are in a
`remotely al located buffer on the transmit side. The application creates a socket ;ind establishes a
`TCP connccuon. The socket send call causes an RPC to the n:mote write process which in tum
`copies the data into the TCP send buffer. xtask then controls the tran~mission and eventual re(cid:173)
`transmissions of the data. The send procedure builds the TCP segment and fory.ards the pointer to
`the segmen1 and the assocmted control block to lp_send. Here the IP header is placed in front of
`the TCP segment and then the packet is sent to the network. The data is copied twice: first from the
`applicauon buffer 10 the -.end queue in ~hared memory and from there to the network.
`Receiving a rep datll .\t'f.lmem: Upon receipt the data IS also copied twice: first from the network
`to the receive queue and from there 10 the application buffer. The interrupt handler process serves
`tbe physical interface and forwards poinlers to received datagrams 10 lp_demux, which checks
`the header and forward~ the packet depending on us type to tcp_recv or lcmp demux.
`
`tcp_recv analyzes the TCP header and calls the appropriate handler function for a given protcxol
`~ late. To ~end an acknowle<lgement or a control packet, tcp_recv uses RPCs to the transmit side
`Correctly received segments are appended to the receive queue. rtask wakes up the application
`process which b blocked in the socket receive procedure. This procedure then fill~ the user buffer
`with data from the receive queue.
`
`4 Note: lhc logK of I.he prolornl would allow for a re1rnnsm1~s1on many case.
`
`Ex.1017.015
`
`DELL
`
`
`
`
`
`128
`
`the po'sible pcrformam:c in case the socket-based application programming interface (A Pl) were
`implemented on 1he worbtation. The socket layer would Lhen be split in10 two pans. The upper
`hall resides in the worhtauon. Calls to the API result 111 conrrol flows to and from the lower half
`of the sock.et laycr, \\ h1ch runs on thc PPE. Copying data to and fmm the TCP layer must be done
`by the workstatH)n processor, because the current PPE only functions as a bus slave. Therefore the
`copy opemtions in the socket layer can be combined with the copy between the workstation and
`the PPE. In th1\ 'ccnario we measure the throughput between the lower half of the '><>Ck.et layer on
`tv,.o PPEs. The results of scenario 2 provide ;1n upper bound for the expected performance of such
`an integrated system. As such they are valid if one manages- as shown for our implemcmarion of
`the ISO 8802.2 Logical Link Control protocol [Kaherswerth 911
`to fully overlap 1hc copy op(cid:173)
`ernuons and thc ex.change of control between the work.station and the PPE with the protocol
`execu1 ion on the PPE.
`
`We did not implement TCP checbummmg, because 11 ~hould really be done in hardware
`I Lumley 92]. To do software checksum calculation on the transputer "'ould cost 3 µs per 16-bit
`word. We did, however, implement IP header checksumming
`The Zjhlmonuor'* (ZM'*> I Dauphm 91] monuoring and tracing system was used to record execu(cid:173)
`tion rrat.:es of the PPE subsystem. ZM'* allow~ gathering of trace events from multiple proce!>sof!>.
`The~c events arc timesiamped with a global clock opernting with a resolution of 100 ns. A power(cid:173)
`ful too bet I Mohr 91 I provides trace analysis and visualization
`
`5.2 Measureml'nls
`Becau\e we v.anted to sec the effect~ of pipehnmg and parallel execution of the protocol, we mea·
`sured the time 'pent in the vanous parts of the device driver, IP. TCP and the socket layer. To judge
`Lhe performance of our implementation we measured the number of TCP segments Lhe imple
`mentation can handle per second. Given the segment ~i ze, the expected maximum throughput can
`easily be cakulated
`
`µs/Segment µs/32-bit word
`
`Process (Procedure) on Receiver
`tcp recv
`user task (socket recv/copy)
`ip 1ntrsvc
`ip demux
`Process (Procedure) on Transmitter
`write
`tcp_snd_data
`ip send
`driver send
`Access to Shared Memory (poke call)
`
`' -
`
`235
`31
`9
`23
`
`30
`147
`23
`17
`18.6
`
`,_
`
`0.545
`
`0.545
`,____
`- L -
`
`0.27
`2.4
`
`Table 1 Measured Execution Times
`Table I hsts the e11ecution times of the major processes of our impkmemation. We used segments
`of 4096 bytes in these measurements. The times are reported for the first test scenario. The execu
`tion times per segmem are approximately 41)1- lower for the second ~enario because of reduced
`contention for accesses to the shared memory. The times per 32-blt word for user_task and write
`
`Ex.1017.017
`
`DELL
`
`
`
`
`
`
`
`
`
`132
`
`The prototype PPE interface to the workstation (I BM Rise System/6000) allows a copy lhrough.
`pu1ofonly33 Mb/s 7. lf the application were 10 be executed on the workstation, all copying would
`be done from the workstation's processor and if we assume code similar 10 the second lcSL \Cenar.
`10 running on the PPE, then the hmlled copy throughput rather than the protocol processing will
`be the bottleneck and we should expect the perfonnancc of the integra1cd system to be around 30
`Mb/s.
`
`6. CONCLUSIONS
`
`Our measuremenh show that a full 1mplementarion of TCP/IP on the PPE can cope wnh data rates
`1n the range of 100 Mb/s. The 1hroughput is much higher than the bandwidth of our hardv.are
`1111erface to 1he workstaiion.
`
`II turns out, however, 1hat using a total of four processors, two for 1P and two for TCP offers only
`very little improvement over a 1wo-processor solution, because of the vastly different process mg
`requirements in the two protocol layer\ For full duplc!x traffic, however, the split onto a receiver
`and a transmiuer processor improves protocol performance by a factor of 1.7. Panitioning proto(cid:173)
`cob to obtain even load and linear speedup is a hard problem, in pa11icular for protocols which
`clearly were not designed with para I lei execution in mind. [Zirterbart 911. for example, reports
`even poorer i.peedup factors. With an 8 transputer implementation of the OSI CLNP she only
`achieves a performance increase 3.73 over the single.: processor version.
`
`Having used a DOS implementation of TCP/IP a' the basis for our parallel 1mplementanon was a
`sound deci!>ion Our 1mplementauon runs efficiently, when one compares It with other transpo11
`prowcol implementations. For example, Zitterbart dcscnbes a parnllel implementation of OSI
`TP4 written for a system of8 transpu1ers which was able lo process 460 PDUs/s I Zitterbart 91 ]. ln
`I Braun 91) a parallel 1mplementa11on of XTP is described, there the perfonnance is 1330 PDUs/s.
`
`Once new faster proces.,ors. such as the I 00 MIPS T9000 transputer, become available, the gains
`for pipelined execuuon of protocob "ill have to be reevaluated. While the T9000 will be 10 ume~
`as fa~L as the T425. the delays for interprocessor commumcation will not have shrunk by the .,ame
`fac tor. Therefore the relative overhead for pipelining the protocol execution within a layer and
`even between layers will grow. We claim, however, that the parallel execution of transmit and
`receive functions is sllll a suitable fom1 of parallelism to increase protocol throughput Distrib(cid:173)
`uted 'hared memory. implemented "uh transputer hnks easily allow\ protocol state informanon
`LO he shared between tl1e two side~ of the adapter and impacts the perfonnance of the trnnspon
`protocol much less than expected. First evaluations of a new architecture, which is based on t\\O
`T9000s supported by dedicated hardware for checksumming and extraction of header infom1a(cid:173)
`tion, indicate a performance of over 30000 TCP segments/s.
`
`1 The rcawn \\h}' lh" interlace 1' so ~10\\, 1' that the clocks on the "ork~lfilJon and th.: PPE run as}'nchronou,ly
`When arbitrating an acccs\ from the \,11cro Channel tO the shared memory on the PPF. we arc forced tu u.-e
`me Micro Channel'~ A 1ynchronous £xu•ndrd cycle [IBM 901 tif al least 300 ns. T111s cycle then m:l)' even need
`10 be extended hy up to 487 ns IO match 11 with the appropriate access cycle of the PPE shared memory. In
`.i new design for the Micro Channel interface this problem would be addressed b> l1ulfcnng m the interface
`"'h1Lh would allow \\ntc-bchmd and rcad·ahcad. For con-.ccuU\C accesses, Lill! arb1trauon C)dc for the nc>.l
`"ord accc_,~ tu the 'h.IJ'cd mcmOf) could then he overlapped "1th !he cum;nt "Ord aL-ccss cycle. 1hu' being
`ahk to u-.e regular I\ hero Channel C)dC' or 200 ns. an<l conscqucntl) mcrca,mg the throughpul to more thJn
`~O Mb/~. A bu,ma\tcr interface U\mg the Micro Channel\ o;trcammg mode woulcl ;11low give higher through(cid:173)
`put
`
`Ex.1017.021
`
`DELL
`
`
`
`133
`
`our measurements arc in line with Clark's observation [Clark 891 that the actual protocol proces(cid:173)
`sing i' notthe reason for poor protocol perfom1ance. In rhe PPE. buff er copying and management
`cost rw1ce as much as the prow~ol processing. The second :;cenano shows how 1hroughpu1 can be
`tripled 1f the u'cr data were copied by the workstation processor overlapped 10 the protocol execu(cid:173)
`tion lln the PPC. ln a future design of rhe PPE, we will concentrate on improving the interface 10
`the shared memory for the prorocol pn.x:essor8 and the workstanon.
`
`We also plan 10 work on the design of efficient software interfaces between our subsystem and the
`host system. A' can bc seen from resuhs published for the t\ectar CAB and our own work, cross(cid:173)
`ing the software interface between the host processor and the communication :,ubsyMem is a cost(cid:173)
`J:r orxration. Many re::.earchcrs who advocate the ofnoading of protocol functions into a
`dedicated sub~yslem ignore this i:.sue. For our TCP/IP implementauon onl) a hosr AP! based on
`~ockets will be acceptable, as this interface has become the de· fa cm standard. These sockets musr
`be lightweight enough to provide effic1cn1 pipelined execution between the communication sub(cid:173)
`syMem and the hosr processor 10 exploit the full po"' er of the PPE.
`
`7. REFERENCES
`
`(Arnould 891
`
`[Braun 91)
`
`[Che,son 871
`
`(Cheriton 881
`
`!Clark 891
`
`I Clark 901
`
`I Dauphin 91 J
`
`Arnould. E. A .. Bitz, F. J., Cooper. E. C., Kung, H. T .• Sansom, R. D .•
`Srt:enkiste. P. A .• The Design of Nectar: A Network Backplane for
`I leterogcneous Multicomputers, Proceedings of ASPLOS-rll. pp
`205-216, April 1989.
`
`Br.iua, T .. Zinerhart. M .. A Parallel Implementation of XTP on
`Transputer~. Proc. 16th Annual Conf. on Local Compu1er Networks.
`\1mneapohs. Oct 1991.
`
`Chesson. G., The Protocol Engine Project, Unix Rev1e"". Vol.5 'lo.9.
`Sept. 1987, pp.70-77.
`
`Cheriton. D.R .. VMTP: Ver~arile Me~sagc Transaction Protocol -
`Protocol Specificauon. Network Working Group, Request For
`Comments. RFC 1045, February 1988.
`
`Clark,D. Lambert, M.L., Romkey. J., Sal wen, H., An Analy"s of the
`TCP Proce~sing Overhead. lEEE Communications Magazine, Vol. 27,
`o. 6 (June 1989). pp. 23-29.
`
`Clark, D .• Tennenhouse, D., Architectural Consideration~ for a New
`Generation of Protocols. Proceedings of the SIGCOMM' 10 Symposium.
`Sept 1990, pp. 200 208.
`
`Dauphin, P .• Hofmann. R .• Klar, R .• Mohr. B .• Quick, A.,Siegle, M .•
`Soctz. F. lM4/SlMPLE: A General Approach ro
`Performance-Measurement and -Evaluation of Distnbuted Systems.
`Technical Report 1/91, Erlangen, January 1991.
`
`[HoJre 7XI
`
`Hoare. C.A.R .. Communicating Sequential Processt.:s. Communications
`of the ACM. Vol.21, No 8, August 1978, pp. 666-677
`
`In Ilic PPE a 'hared memory cycle of I.he 11anspu1er is lw1ce a local memory cycle
`
`Ex.1017.022
`
`DELL
`
`
`
`134
`
`I IBM 90]
`
`LBM RISC System/6000 POWERstation and POWERserver Hardware
`Technical Reference - Micro Chrurnel Architecture, 1990.
`
`llNMOS 891
`
`lnmos Limited, The Transputer Databook. First Ed. 1989, Document
`No. 72 TRN 20300, pp. 23-43 and 113-179.
`
`I Kaiserswerth 91 J Kaiscrswerth, M., A Parallel Implementation of the ISO 8802.2-2 LLC
`Protocol, TEEE Tricornm '91 - Communications for Distributed
`Applications and Systems, Chapel Hill NC. April 17-19, 1991.
`I Kaiser~werth 921 Kaiserswenh, M .. The Parallel Protocol Engine, IBM Research Report,
`RZ 2298 (#77818), March 1992.
`
`I Kanakia 881
`
`[Lumley 92]
`
`ILS-C 891
`
`[Mohr 911
`
`Kanakia, H., Cheriton, D.R., The VMP Network Adapter Boirrd (NAB):
`High Performance Network Communication on Multiprocessors, ACM
`SIGCOMM 88, pp. 175-187.
`
`Lumley, J., A High-Throughput Network Interface to a RISC
`Worksuuion, Proceedings of the IEEE Workshop on the Architecture
`and Implementation of High Performances Communication Subsystems,
`Tucson, AZ, Feb. 17- 19. 1992.
`
`Logical Systems, TransputcrToolset. Version 88.4 Feb. 1989.
`
`Mohr, B., SIMPLE: A Performance Evaluation Tool Environment for
`Parallel and Distributed Systems, in A. Bode, Editor, Distributed
`Memory Computing, 2nd European Conference, EDMCC2, pp. 80-89,
`Munich, Germany, April 1991 , Springer Verlag Berlin LNCS 487.
`
`!PEI 921
`
`Protocol Engines Incorporated, XTP Protocol Definition, Revision 3.6.,
`Edited by Protocol Engines Mountain View, CA , January 11, 1992.
`
`[Pountain 881
`
`Pountain, D., May, D., A Tutorial on OCCAM2, BSP Professional
`Books London 1988.
`
`[UM 90]
`
`[Wicki 90]
`
`LBM Corporation, University of Maryland. Network Communications
`Package. Milford 1990.
`
`Wicki, T., A Multiprocessor -Based Controller Architecture for
`High-Speed Communication Protocol Processing, Doctoral Thesis, IBM
`Research Report, RZ 2053 (#72078), Vol 6, 1990.
`
`[Ziuerbart 911
`
`Zitterbart, M., Funktionsbezogene Parallelitat in transportorientierten
`Kommunikationsprotokollen, Dissertation, VOi-Reihe 10 Nr. 183,
`Diisseldorf: VDI-Verlag 1991.
`
`Ex.1017.023
`
`DELL
`
`