`Communication Systems
`
`International Federation for Information Processing
`
`Technical Committee 6
`
`Communication Systems
`
`IFIP Transactions Editorial Policy Board
`
`The IFIP Transactions Editortal Policy Board is responsible for the overall scientific
`quality of the IFIP Transactions through a stringent review and selection process
`
`Chairman
`G.J. Morris, UK
`Members
`D. Khakhar, Sweden
`Lee Poh Aun, Malaysia
`M. Tienari, Finland
`P.C. Poole (TC2)
`P. Bollerslev (TC3)
`M. Tomljanov1ch (TC5)
`
`0. Spaniel (TC6)
`P. Thott-Chnstensen (TC?)
`G.B. Davis (TC8)
`K. Brunnstein (TC9)
`G.L. Reijns (TC10)
`W.J. Caelll (TC11}
`R Meersman (TC12)
`B. Shackel (TC13)
`J. Gruska (SG14)
`
`IFIP Transactions Abstracted/Indexed m
`INSPEC Information Services
`
`
`
`C-14
`
`HIGH
`PERFORMANCE
`NETWORKING, IV
`
`Proceedings of the IFIP TC6/WG6.4 Fourth international Conference on
`High Performance Networking
`Liege. Belgium, 14-18 December 1992
`
`Edited by
`
`A. DANTHINE
`lnstitut dElectric1te 828
`Universite de Liege
`Liege, Belgium
`
`0. SPANIOL
`RWTHAachen
`fnformatik IV
`Aachen. Germany
`
`1993
`
`NORTH-HOLLAND
`AMSTERDAM • LONDON • NEW YORK • TOKYO
`
`
`
`~1'7
`II<
`5/t?.?-.?
`, :;?1?.?1/
`/19?-
`
`ELSEVIER SCIENCE PUBLISHERS BV.
`Sara Burgerhartstraat 25
`P.O. Box 211, 1000 AE Amsterdam, The Netherlands
`
`Keywords are chosen from the ACM Computing Reviews Classification System, !01991, with permission.
`Details of the full classification system are available from
`ACM 11 West 42nd St., New York. NY 10036, USA
`
`ISBN· 0 444 81481 7
`ISSN· 0926-549X
`
`<C 1993 IFIP. All rights reserved.
`No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any
`means. electronic, mechanical. photocopying, recording or otherwise. without the prior written permission of
`the publisher, Elsevier Science Publishers B V., Copyright & Permissions Department, P.O. Box 521 . 1000 AM
`Amsterdam, The Netherlands.
`
`Special regulations for readers in the U.SA - This publlcallon has been registered with the Copyright Clearance
`Center Inc. (CCC). Salem. Massachusetts. Information can be obtained from the CCC about conditions under
`which photocopies of parts of this pubhcat1on may be made in the U.S.A. All other copyright questions, including
`photocopying outside of the U.S A., should be referred to the publisher, Elsevier Science Publishers B.V., unless
`otherwise specified.
`
`No respons1b11ity 1s assumed by the publisher or by IFIP for any injury and/or damage to persons or property
`as a matter of products liability, negligence or otherwise, or from any use or operation of any methods. products,
`instructions or ideas contained in the material herein.
`
`pp. 119-134, 199-218, 267-281 , 367-381 : Copyright not transferred
`
`This book is printed on acid-free paper.
`
`Printed in The Netherlands
`
`
`
`vii
`
`Table of Contents
`
`h 92 pn
`
`Preface
`Program Committee
`List of Reviewers
`
`Session A: MAC Layer Enhancements
`Chair: Harmen van As, IBM Research, Switzerland
`
`OQDB for Time Constrained Services
`Guven Mercankosk, Z.L. Budrikis, QPSX Communications Ltd, Australia,
`A. Cantoni, Australian Telecommunications Research Institute, Australia
`
`A New Reservation Scheme for CRMA High-Speed Networks
`Nan-Fu Huang, Chung-Ching Chiou, Chiung-Shien Wu,
`National Tsing Hua University, Republic of China
`
`A Host Interface Architecture for High-Speed Networks
`Peter A. Steenkiste, Brian D. Zill, H.T. Kung, Steven J. Schlick, Carnegie
`Mellon University, USA
`Jim Hughes, Bob Kowalski, John Mullaney,
`Network Systems Corporation, USA
`
`Session B: Flow and Rate Control
`Chair: Marjory Johnson, RIACS, USA
`
`Dynamic Bandwidth Allocation and Access Control of Virtual Paths
`In ATM Broadband Networks
`Ibrahim Wahby Habib, Tarek N. Saadawi, City University of New York, USA
`
`Congestion Control - Effective Bandwidth Allocation in ATM Networks
`E.D. Sykas, K.M. Vlakos, K.P. Tsoukatos, E.N. Protonotarios,
`National Technical University of Athens, Greece
`
`v
`xi
`xii
`
`1
`
`3
`
`15
`
`31
`
`47
`
`49
`
`65
`
`
`
`viii
`
`A High Speed Data Link Control Protocol
`Ahmed N.Tantawy, IBM Res. Div., T.J. Wa tson Research Center, USA,
`Hanafy Meleis, DEC, Reading, UK
`
`Session C: Parallel Implementation and Transport
`Protocols
`Chair: Guy Pujolle, Universite P. et M. Curie, France
`
`Parallel TCP/IP for Multiprocessor Workstations
`Kurt Maly, S. Khanna, A. Mukkamala, C.M. Overstreet, A. Yerraballi,
`E.C. Foudriat, B. Madan, Old Dominion University, USA
`
`TCP/IP on the Parallel Protocol Engine
`Erich AOtsche, Matthias Kaiserswerth,
`IBM Research Division, Zurich Research Laboratory, Switzerland
`
`A High-Speed Protocol Parallel Implementation: Design and Analysis
`Thomas F. La Porta, AT& T Bell Laboratories, USA,
`Mischa Schwartz, Columbia University, New York, USA
`
`Session D: Multimedia Communication Systems
`Chair: Radu Popescu-Zeletin, GMO FOKUS, Germany
`
`Orchestration Services for Distributed Multimedia Synchronisation
`Andrew Campbell, Geoff Coulson, Francisco Garcia, David Hutchison,
`Lancaster University, UK
`
`Towards an Integrated Quality of Service Architecture (OOS·AI f or
`Distributed Multimedia Communications
`Helmut Leopold, Alcatel ELIN Research, Austria
`Andrew Campbell, David Hutchison, Lancaster University, UK,
`Niklaus Singer, Alcatel ELIN Research, Austria
`
`JVTOS - A Reference Model for a New Multimedia Service
`Gabriel Dermler, University of Stuttgart, Germany
`Konrad Froitzheim, University of Ulm, Germany
`
`Experiences with the Heidelberg Multimedia Communication System:
`Multicast, Rate Enforcement and Pertonnance
`Andreas Cramer, Manny Farber, Brian McKellar, Ralf Steinmetz,
`IBM European Networking Center, Germany
`
`81
`
`101
`
`103
`
`119
`
`135
`
`151
`
`153
`
`169
`
`183
`
`199
`
`
`
`Session E: QoS Semantics and Management
`Chair: Martina Zitterbart, IBM Res. Div., Watson Research Center, USA
`
`Client-Network Interactions in Quality of Service Communication
`Environments
`Domenico Ferrari , Jean Ramaekers, Giorgio Ventre, International
`Computer Science Institute, USA
`
`The OSI 95 Connection-mode Transport Service: The Enhanced QoS
`Andre Danthine, Yves Baguette, Guy Leduc, Luc Leonard,
`University of Liege, Belgium
`
`QoS : From Definition to Management
`Noemie Simoni, Simon Znaty, TELECOM Paris, France
`
`Session F: Evaluation of High Speed Communication
`Sy~ems
`Chair: Otto Spaniel, Technical University Aachen, Germany
`
`ISO OSI FTAM and High Speed File Transfer: No Contradiction
`Martin Bever, Ulrich Schaffer, Claus SchottmUller,
`IBM European Networking Center, Germany
`
`Analysis of a Delay Based Congestion Avoidance Algorithm
`Walid Dabbous, INRIA, France
`
`Performance Issues In Designing Network Interfaces : A Case Study
`K.K. Ramakrishnan, Digital Equipment Corporation, USA
`
`ix
`
`219
`
`221
`
`235
`
`253
`
`~5
`
`267
`
`283
`
`299
`
`Session G: High Performance Protocol Mechanisms 315
`Chair: Craig Partridge, BBN, USA
`
`Multicast Provision for High Speed Networks
`A.G . Waters, University of Essex, UK
`
`Transport Layer Multicast: An Enhancement for XTP Bucket
`Error Control
`Harry Santoso, MASI, Universite P.et M. Curie, France,
`Serge Fdida, MASI, Universite Rene Descartes, France
`
`A Performance Study of the XTP Error Control
`Arne A. Nilsson, Meejeong Lee,
`North Carolina State University, USA
`
`317
`
`333
`
`351
`
`
`
`x
`
`Session H: Protocol Implementation
`Chair: SamirTohme, E.N.S.T. , France
`
`ADAPTIVE An Object-Oriented Framework for Flexible and Adaptive
`Communication Protocols
`Donald F. Box, Douglas C. Schmidt, Tatsuya Suda,
`University of California, Irvine, USA
`
`365
`
`367
`
`HIPOD : An Architecture for High-Speed Protocol Implementations
`A.S. Krishnakumar, J.G. Kneuer, A.J. Shaw, AT&T Bell Laboratories, USA
`
`383
`
`Parallel Transport System Design
`Torsten Braun, University of Karlsruhe, Germany,
`Martina Zitterbart, IBM Res. Div., T.J. Watson Research Center, USA
`
`Session I: Network Interconnection
`Chair: Augusto Casaca, INESC, Portugal
`
`A Rate-based Congestion Avoidance Scheme for Interconnected
`DQDB Metropolitan Area Networks
`Nen-Fu Huang, Chiung-Shien Wu, Chung-Ching Chiou,
`National Tsing Hua University, Rep.of China
`
`Interconnection of LANs/802.6 Customer Premises Equipments {CPEs)
`via SMDS on Top of ATM: a case description
`W. Rozenblad, B. Li, R. Peschi,
`Alcatel Bell Telephone, Research Centre, Belgium
`
`Architectures for Interworking between B-ISDN and Frame Relay
`J. Vozmediano, J. Berrocal, J. Vinyes,
`ETSI Telecomunicacion, Spain
`
`Author Index
`
`397
`
`413
`
`415
`
`431
`
`443
`
`455
`
`
`
`Jilgh Performance Networking. IV (C-14)
`A. Danthinc and 0. Spaniol (Editors)
`Elsevier Science Publishers B.V. (North Holland)
`1993 IFIP.
`
`119
`
`TCP/IP on the Parallel Protocol Engine
`
`Erkh Riitschc and Mauh1a~ Kaiserswcrth
`
`IBM Research Division, Zurich Research Laboratory
`SaumcP.>tmsse 4, 8803 RiiM:hlikon, Switzerland
`
`Abstract
`
`In th" paper. a parallel 1111pkmentation of 1he TCP/JP pro1ocol ~u11e on the Parallel Pro1ocol En
`gine (PPE), a mulliproccssor-based communication subsystem, is descnbed. The execution
`time~ of the various protocol func1ions are used to analy1e the syMem 's performance in two see
`nario,. In the fir\t ~cenario we execute the te\I application on the PPE; in the second we evaluate
`the potential pcrfonnance of our TCP/lP implcmentallon when 111s driven hy an application on
`the workstation. For the second scenario, the end-to-end performance of our 1111plemcntution on a
`four-processor PPE system 1s more than 3300 TCP segments per \econd.
`
`Keyword Codes. C. l.2; C.2.2; D. U
`Keywords: Multiple Daw Stream Architectures (Multiprocessors); Network Protocols:
`Concurrent Programming
`
`1. INTRODUCTION
`
`Progrc~s m high-speed net.,.. orkrng technologies such as fiber opucs have shifted the bottleneck
`in communi<.:ations from 1hc limited bandwidth of the transmission media to protocol processing
`and the operatmg system overhead in the workstation. So-called lightweight protocols and proto(cid:173)
`col ollload to programmable adapter.. are two approaches proposed 10 cope with this problem.
`Prott>l:Ols such as the Xpress Transfer Protocol (XTP)' !PEI 921 and VMTP !Cheriton 88] try 10
`simplify the control mechanisms and packet \lructures such that the protocol implemcnlation be(cid:173)
`come\ less complex and can possibly be done m hardware We took the second approach in build
`f Kaiserswenh 92J. a muluprocessor-bascd
`mg
`the Parallel Protocol Engine (PPE)
`communication adapter, upon which protocol processing can be omoaded from a host system.
`The ~cctarCAB IAmould 891 and the V.MP Network Adapter Board (Kanak1a 881 are other pro(cid:173)
`grammable adapters, each based on a single protocol processor. The XTP chipsct (Chesson 871 is
`a very spec.:ialiied set of RISC processors designed to e;<ecute the XTP protocol. Our objective
`was to 111vcstiga1e and exploit parallelism in many diffrrent protocols. Therefore we decided 10
`de\ clop a gcnaal purpost communication subsystem capable of suppomng standard protocols
`cfftc1cntly 111 software.
`
`1
`
`.\pre" Tran>lcr PrntlK:OI and XTP arc rcgl\tcrcd tradcmarh ol XTP Forum
`
`
`
`120
`
`In this paper our goal is to demonMrate that a careful 1mplementation of a Mandard tmnspon pro(cid:173)
`tocol stack on a general-purpose multiprocessor architecture allows efficient use of the band(cid:173)
`width availabk Ill today's high-,peed networks. A, an example, we chose to implement the
`TCP/IP protocol l.uite on our -t-processor prototype of the PPE.
`
`We implemented the \ockct interface and a test application directly on the PPE to facilitate our
`performance measurements. In this tesr \Cenario we analyze the performance of TCP/IP and the
`socl..ct layer. We also exanuncd a second scenario to unden.tand how our 1mplementauon would
`pcrfonn when integrated into a workstation. where protocol processing up to the transport layer 15
`perfom1ed on the PPE and applicauons can access the transpon service via the socket interface on
`the:: workstation.
`
`In Section 2 our hardware platform, the PPE. is presented. Section 3 introduces TCP/IP. In the
`follm-.mg secuon we C\pl<tin our approach to parallt:I protoc.:ol unplementation. Section 5 prc\(cid:173)
`e111s the results and discusses the impact of the hardware and software architecture on perfor(cid:173)
`mance. The last section gives the condusion and an outlook on our furure work.
`
`2. THE PARALLEL PROTOCOL El'IGJ~E
`
`The PPE is to be pre~ented only hriefly hen.!. It is described in greater detail in [Wicki 90J and
`( Kat,erswenh 91 . 921 We "Will first concentrate on the hardware and then present the program·
`ming environment.
`
`The PPE is a hybrid \ha.red-memory/message-passing multiprocessor. Message passing is used
`for \ym.:hron1zation. whereas shared memor} is used to store sen.ice primitives and protocol
`frumcs. Figul"I! I shows the archnecture of the PPE ;ind ns u~e as a commumcation subsystem.
`
`The PPE u!>es two separate memories. one for tran~mitting. one for receiving data. Both of these
`memories an:: mapped mto the address space of the worhtallon. (n our tmplementation. four
`T425 transputer\ [IN MOS 89] arc used as protocol processors. On each stde of the adapter, two
`T425s have access to the shared memory. Each processor uses private memory to store its pro(cid:173)
`gram and local data. We decided against using a single shared memory for storing both inbound
`and outbound protocol data, although this would make the adapter more flexible and facilitate
`programming, for the following reason. lligh-speed network interfaces work in a synchronous
`fashion.with data be111g docl..ed in and out of memory. possibly at the same time, ::11 the transml\(cid:173)
`!>ion speed of the physical network. Splitting the adapter mto -.cparate receive and transmit parts
`accommodates simultaneous transmission and reeepuon and only requires memory wtth half the
`speed of that required for a smgle-memory solution. This architecture results in significant cost
`savings. especially when transmiss10n speeds exceed 100 Mb/s
`
`The network interface has read access to the transmit side and write access to the n::cetve side of
`the adapter We emulate a physical network by means of an 8 bit wide parallel interface, which
`allows a po1nt-to-pomt connec11011 between two PPE \ystems operaung with a b1directional
`transrmssion rate of up to 120 Mb/s. The transputer links are used exclusively for signalling and
`control message transfer within the PPE and to and from the ho~t system.
`
`The program111111g language wh11:h best dt:~l:nbcs the transputer\ programrrung model is OC(cid:173)
`CAM [Pounta1n 881. It 1s based on the theory of Cum111w1icaii111{ Sequential Proasses (CSP) de(cid:173)
`veloped hy lloare I Hoare 781. The structuring elements are processes that communicate and
`synd1romze \'lit mes~gcs :\1essage transfer" unbuffered communicating processes must reach
`
`
`
`Application I
`
`Protocol Layers
`
`Physical Layer
`
`121
`
`Micro
`Channel
`Interface
`
`Shared TRANSMIT Memory
`
`' .
`... ---......... -.. -.. --------.. -........ -- ... -...
`.. --- - ... -.... -..... -........... .... -...... -...... -.. -.. -.. .
`' .---~~~~~~~~~~~~~--.
`Shared RECEIVE Memory
`
`Figure 1 Architecture of the PPE
`
`a rendl sous before the message is copied dire1.:tly from the sender's 10 the m.:eiver's address
`space l"his behavior maps dirertly to the transputer's register mtxlel and microcode, wh11.:h sup(cid:173)
`port efficient context switches and transparent message passing via four external links and any
`numhcrof internaJ soft channels. However because OCCAM discourages the use of pointers and
`shared memory between different processes and offers very ltttlc suppo11 of user-defined struc(cid:173)
`tured da1a types, we chose IO do our implementation in the C programming language I LS-C 891.
`Access 10 the transputer specific facilities, such as synchronous message passing and process
`contml, I!> provided through library funrnons, which c.:an. in pan. also be generated as more effi(cid:173)
`cient rnline code hy the compiler.
`
`3. THE TCP/IP PROTOCOL STACK
`
`We tmplemenred the full TCP/IP prmocol stack on the PPE. It conststs of the lmernet Protocol
`OP), the f111eme1 Conrro/ Mes.\aRe Protocol (lCMP), and the Transrmssion Control Protocol
`<TCP> \pplicanons interface to the protocol implementation vi,1 <>ockets. >irrnlar 10 the BSD ver(cid:173)
`~ion of Unix2.
`
`lino\ ''a rc1p,1cml irackmarl. of ,\T&T on t.hl' Unncd Stal~' and other rnun111c\.
`
`
`
`122
`
`IP is a datagram protocol that implements functions similar to those of the OSI Connec11onless
`Network Protocol (\LNP). ICMP, wluch is an integral part of lP. is used to exchange control mes
`>ages between internet clients, e.g., it generates a destination unreachable mesi.age when the ad(cid:173)
`dressing infomiation in a received datagram does not allow forwarding or local delivery. TCP,
`w h1ch roughly implements the ISO Transpon Layer functions, provides an error and now-con(cid:173)
`trolled end to-end transport connecuoo between applications. TCP thus builds reliable data trans(cid:173)
`mission services on top of the unreliable IP datagram service. A TCP connection is specified
`through the pairof Internet addresses and the TCP pon identifiers of the two communicating part(cid:173)
`ners. The socket 1s the local end point of a TCP/IP connection. The application program accesses
`sockets through local idcntifit:rs, similar to file descriptors in Unix.
`
`As we did not \\.'am to implcmcnr TCP/JP from scratch. we based our work on a version of TCP/IP
`for MS/DOS from the University of Maryland IUM 901 .
`
`.i. PARALLEL IMPLEMENTATI01' OF TCP/IP
`
`To develop a parallel solution one needs to partition the problem 111to a sci of subproblerm that can
`be executed 1n parallel. The algorithms solvrng these subproblems arc typically encapsulated m
`cooperating processes which are mapped to the parallel-processor hardware. Depending on the
`underlying hardware and the implementation model chosen. these processes communicate and
`synchro1111c via shart:d memory or message pa~sing.
`
`Application
`
`Buff er
`
`user_task
`
`tcp_task
`
`Transmission
`Control
`Protocol
`
`Internet
`Protocol
`
`Buffer
`
`ii? send 'I< I
`
`I dnver send I
`
`Network Adapter
`
`I Procedure I ( Process )
`
`Figure 2. MS/DOS IP Process Structure
`
`
`
`123
`
`The source code, which served as a basi' for lhis implemen1a11on, was already struc1urcd imo
`111ul11plc processes 1ha1 run on 1op of a simple, non-precmp1ive mullimskmg kemcl. Figure 2
`shew•' 1he original spli1into 1hree processes and one in1errupt service routine. Having such a pro(cid:173)
`ce5S s1ruc1ure a11m~t:d us 10 May fairly close to the original source.
`
`As we: wan1ed to execute the IP layer on d1ffert:n1 processors dwn the TCP layer, we first isolated
`the IP rclcvan1 functions from bo1h tcp_task and ip_task into separate processes. Because of the
`f um:tional division of the PPE 11110 a rransmit and receive side, we 1hen spli11hc remainder, i.e. !he
`core ol the TCP protocol, of tcp_task and lp_task vemcallr m10 three processes (rtask,
`tcp_recv running on the receive side and xtask running on 1hc 1ransmi1 side). We will describe
`1he func11ons of 1hc various pro1ocol processes, 1hu1 implemen1 IP, TCP and tht: socket layer in
`1urn. Pigurt: 3 shows the high-level process struc1ure we derived for our implcmenta1ion.
`
`Transmit Side
`
`Receive Side
`
`Application
`
`Socket Layer
`
`Transmission
`Control
`Protocol
`
`Internet
`Protocol
`
`Media
`Access Control
`
`ip_demux
`
`ip_intersvc
`
`Network Adapter
`
`[>roeedurel ( Process )
`
`Figure 3. High-Level Process Structure
`
`In lh..: followmg we presen1 our parnllel soluuon in a 1op-down approach, fir..t 'howing the high(cid:173)
`h:vel process graph of the main processes m our implementalion. These processes have access to
`data shared between 1he transmit and receive side and can interac1 with one another via high-level
`Pnmi11ves such as remote procedure calls (RPC) and queues. In a second step, we will then show
`ho"' the~e services, m particular shared dala between the receive and tramm11 side as well as
`RPC~ from the receive to 1he transmit side, have been realized on the PPE.
`
`
`
`124
`
`4.1 IP and ICMP
`Becau~e IP is a datagram protocol, the normal flow of data through IP in an end-system requires
`no interaction betwet:n the receiving and transmitting part. Routing infom1ation and exception
`handling, however, require a data exchange. The handling of exception and control messages is
`the function of ICMP. We therefore partitioned IP into two independent processes lcmp_demu1e
`and lp_demux. To guarantee the timely handling of incoming packeis, we dedicated a separate
`proces\ on the receive side of the PPE to the handling of the physical network interface.
`
`The rouung table 1s shared between both processes on the Lransmit and receive side of the PPE. An
`RPC 1~ used if lcmp_demux needs to send out an ICMP message.
`
`4.2 TCP
`Splitting the PPE hardware into a i.eparate send and receive side had more impact on how \.\-e had
`to deal with TCP, the socket layer, and application layer, than it had on IP.
`
`We dedded to split the finite \tale machine (FSM) responsible for implementing a TCP connec(cid:173)
`tion into two separate FSMl> once the connectton is in the data phase. The actions of these FSMs
`are implemented on the receive side through two processes, rtask and tcp_recv. On the transmit
`side one process xtask implements the FSM. Owing to the duplex nature of TCP and the piggy(cid:173)
`backing of control information in data packets, these processes need to share the protocol's send
`and receive ~rate vanables maintained in the transmil:.ion conrrol block (fCB).
`
`tcp_ recv demultiplexes incoming TCP segments, locates the appropriate TCB and executes the
`required action for the FSM state. Header prediction is used Lo speed up packet handling for pack(cid:173)
`ets amving con~cutively on the same connection. Correctly received segments are appended to
`the rece1 ve queue and the application process waiting on this connccuon is then woken up to move
`the data to its own buffers. When the received data exceeds the c1cknowledgement threshold,
`which is specified as a percentage of the advertised receive window, tcp_recv makes an RPC to
`the transmit side to generate an acknowledgement. The acknowledgement is sent a.<; a separate
`packt:t, unless this information can be piggybacked onto an outgoing data segment.
`
`rtask 1s drivi=n by two timers, one responsible for delayed acknowledgements, the other for keep(cid:173)
`alive messages. In steady state data transmission, rtask should never generate an acknowledge(cid:173)
`ment, as tcp_recv already generates ackno\.\-ledgernents \.\-hile data are received. Only when the
`timer runs out and new unacknowledged data have been received since the last acknowledgement
`will rtask generate an acknowledgement. Similarly, keep-alive messages are also sent only when
`no acnvity has taken place on a rnnnection for some time. Again, both acknowledgements and
`keep-ahvc messages are gencrntcd via RPCs to the trJllSIIllt Mde.
`
`On the transmit side the process xtask manages the trnnsmit queue and the retransmission timers.
`To send data, xtask creates the TCP header and fills in the necessary infonnatlon from the TCB.
`<;uch as addresses and sequence number.. for the data and acknowledgements. The header and a
`pointer to the d;ua are then pas'.>Cd ro the IP process (procedure lp_send), which embeds this in(cid:173)
`fonnataon into an IP datagram.
`
`4.3 Socket Layer and AppliC<1tion
`To fac1lttate our cxpenments wuh TCP/IP, we decided as a first Mep t0 implement the entire sock(cid:173)
`et layer us well as the test applicanon on the PPE. A detailed description of the interaction~ be-
`
`
`
`125
`
`tween an application on the hose i.yMem and a protocol on the PPE can be found in [ Kaiserswcnh
`92].
`
`In our implementation, the socket layer, although tighdy coupled with TCP, is part of the applica(cid:173)
`tion process. It is m:cessed via a procedural tnterfacc. used to create a socket, btnd an address to it
`and establish a TCP connection w11h a remutc socket. As the FSM logic to establish connecuons is
`also pan oftcp_recv, we decided t0 place the socket and the application code on the receive side.
`Because we wanted to avoid moving data to be sent from the receive side to the transmit side via a
`iran~puter lmk3. we ah.o allow the apphcauon to U\e buffers on the transmit side of the PPE.
`When data is to be transmitted, the send procedure simply makes an RPC with the buffer address
`on the transmit side, thus causing the write process to copy the data from this application buffer co
`che TCP send queue. When the application wants to receive data, the receive procedure checks
`the receive queue for this connecuon and blocks the application process if the queue is empty.
`
`4A Low-Level Primilives
`Before givmg an example of how TCP data segments are sent and received. we describe how we
`maintatn shared TCBs and muung tables on the PPE. which docs not have shared memory be(cid:173)
`tween its transmit and receive side, and how we realize RPCs from the receive to the transmit side.
`Figure 4 shows the process &rraph of the additional processes required to implcmem these fu nc(cid:173)
`tions.
`
`rpc_process
`
`Arty Process
`
`wnte
`
`Any Process
`
`peek_poke
`
`rpc_dernux
`
`peek_poke
`
`Any Process
`
`Transmit Side
`
`. . ._ dedicated Transputer Link <J-- internal Channel
`Figure 4. Low-Level Primitives
`
`Receive Side
`
`We implement dwrifJwed shared memory between the transmit and receive side by placing the
`data struciurcs that arc to be shared at identical physical addresses tn the local memories of the two
`processors\\ ht ch a<.:<.:ess the data structure. Whenever a value is wntten onto the local copy of the
`data structure. die address of the variable and its value are sent via a dedicated transputer link to a
`~erver process. peek poke. on the remote ~idc. This process then updates the memory area iden(cid:173)
`t1f1ed thruugh the address \.l.ith the accompanying value. The peek_poke processes run at high
`priori!) to ensure 1hat the exchange of a message with a remote process takes place immediately
`and is not delayed by scheduling overhead, which would then also delay the remote process be(cid:173)
`cause of the transputer's synchronous message passing. Serializing write accesses co the shared
`data structures is nor necessary m our case. Ea<.:h replicated data structure fall~ tnto two parts, one
`
`3 We nw;:i,urcd an cftcwvc Lhroughput r.ite of approx1ma1cly 14 \1b/s aero'~ a ltallsputcr link, clearly much
`lo"' er thJn 'ta our h1gh-,pccd parullcl 1111crfacc.
`
`
`
`126
`
`wri11en only from the receive side (e.g .. the updated tnu1~mi1 w111dow), and 1hc other wrirten only
`from the tr;111smi1 ~idc (e.g .. the la\t -,end sequence number).
`
`Since we do not have a lockrng protocol for acces\mg shared data structures. It is possible thill for
`a brief period after chc local update and before the remote update has been propagated, the \atnc
`field in the shared data siructure contains 1wo different values. Because of the properties of TCP
`and the way we have 'Pill the proto.:ol onto the cran\mll and receive side of the PPE, this inconMs(cid:173)
`tency will only be of importance if 111s 1he reason for the protocol Mate to change. As an example
`consider the following: assume the n:transrnission timer (it is abo maintained in the TCB) in
`xtask cxpin:s and, because the acknowledgement field in the TCB docs not indica1e reception of
`an acknowledgemc111, xtask decides 10 retransmit the unacknowledged TCP \Cgments. On the
`recdve side. ho\\ ever. ;m at·i-nowlcdgemem has been received in the meantime which makes this
`retransm1s\lon unnecessary4. To avoid this problem, before actually going 10 a retransmit state,
`xtask will reread the acknowledgement field, now however with the value on the receive side, to
`make sure that a retransmission is warranted. Reading a remote field is similar tu writing; a mes(cid:173)
`~age with the address and stze of the variable is sent to tht remote peek_poke process, which then
`returns the value of that field.
`
`RPCs from the receive to the transmtl side have been implemented as follows: any process on the
`receive side can fom1a1 an RPC message, whit·h is then sent via a dedicated transputer link to the
`rpcJlrocess. This process will then execute the remole procedure. or 111 the case of transmission
`request,, pa\s the request via a local (internal) channel 10 the appropriate write process. one of
`.,.. hich exists for each TCP connection Return values are sent, agarn via a dedicated transputer
`ltnk. bac~ to the receive side to rpc_demux, which forwards these values over a local channel to
`the proccs' 1hat had iniuatt:d the RPC. Upon receiving 1he return value. the calli.:r becomes ready
`again and can continue its e"ecution.
`
`4.5 Example
`Sending u TCP dam sewne111: The normal data flow is shown in hgure 3. The send data are in a
`remotely al located buffer on the transmit side. The application creates a socket ;ind establishes a
`TCP connccuon. The socket send call causes an RPC to the n:mote write process which in tum
`copies the data into the TCP send buffer. xtask then controls the tran~mission and eventual re(cid:173)
`transmissions of the data. The send procedure builds the TCP segment and fory.ards the pointer to
`the segmen1 and the assocmted control block to lp_send. Here the IP header is placed in front of
`the TCP segment and then the packet is sent to the network. The data is copied twice: first from the
`applicauon buffer 10 the -.end queue in ~hared memory and from there to the network.
`Receiving a rep datll .\t'f.lmem: Upon receipt the data IS also copied twice: first from the network
`to the receive queue and from there 10 the application buffer. The interrupt handler process serves
`tbe physical interface and forwards poinlers to received datagrams 10 lp_demux, which checks
`the header and forward~ the packet depending on us type to tcp_recv or lcmp demux.
`
`tcp_recv analyzes the TCP header and calls the appropriate handler function for a given protcxol
`~ late. To ~end an acknowle<lgement or a control packet, tcp_recv uses RPCs to the transmit side
`Correctly received segments are appended to the receive queue. rtask wakes up the application
`process which b blocked in the socket receive procedure. This procedure then fill~ the user buffer
`with data from the receive queue.
`
`4 Note: lhc logK of I.he prolornl would allow for a re1rnnsm1~s1on many case.
`
`
`
`127
`
`4.6 Configuration
`On each side of the PPE only ont: of the two processors 1s physically able to control the 111terfacc m
`the network. Thus we placed the tlevicc driver and the IP layer processes on those two processors.
`TCP. the ".><.:ket layer. and the application executc.> on the second processor on each side of the
`PPE.
`
`4.7 Memory Management
`The buffer memory 111 the protocol!> and the socket is managed in an mbuf like linked list There is
`only one buffer size to simplify these func11ons. The buffer size determines the maximum TCP
`segment ~ize. Provided there is sufficient physical memory (up to 4 MB on each side of the PPE)
`large fl\Ctl-sizcd buffers help avoid costly memory managemem functions. Buffer queues and the
`free but fer lt~t an: protected by semaphores to seriali1c access to the data structures from different
`processes on the same processor.
`
`The data and control flow in the PPE is organized such that only one processor requests buffers
`and the other only releases buffers. We ensure that one buffer element always remain~ 111 the
`queue, thus one processor can always append to the end of the buffer queue and the other proces(cid:173)
`sor can consume the first clement without requiring any addJiional queue access protocol between
`the two processors
`
`5. PERFORMANCE
`
`5.1 1esl Setup
`To mea~ure the pcrformanc<: of the TCP/IP 1mplementat1on "'e used a .,1mple te~t driver running
`on the PPE: a source proct:s~ on one system that sends data over a !;OCket and TCP/IP to a sink
`process on the other sy~tem which receives the da1a. TI1e setup 1> ~hown in Figure 5.
`
`Source
`
`Su'oll
`
`Transmit Side I Socket !TCP I
`
`Application
`Receive Side Socket rrcP
`
`Figure 5 Test Environment
`
`Socket !TCP Transmit Side
`
`Applica1ion Receive Side
`Socket /TCP
`
`As 1hc final goal of this work i., to offload protocol processing from the workstation to the adapter,
`we examined the following two \Ccnarios:
`
`Scenario I. The complete socket layer is unplemented on the subsystem. Upon receip