Evaluation of Bypass Architecture for High-Speed Bulk
`Data Transfer through deeply layered protocol stacks
`Yong Hua Thia, BSe, MSe, ole
`A Thesis submitted to the Faculty of Graduate Studies and Research in partial
`fulfillment of the requirement~ for the degree of
`Doctor of Philosophy
`Ottawa-Carleton Institute for Electrical Engineering,
`Faculty of Engineering,
`Department of Systems and Computer Engineering.
`Carleton University.
`Ottawa, Ontario, Canada, K1S 586
`March 5, 1994
`© Yong Hua, Thia 1993


`••• National Library
`of Canada
`Acquisitions and
`Bibliographic Services Branch
`395 Wellington Street
`Ottawa. Ofltano
`Bibliotheque nalionale
`Direction des acquisitions el
`des services bibliographiques
`395. rue Welhngton
`Ottawa (Ont.",,)
`The author has granted an
`irrevocable non-exclusive licence
`allowing the National Library of
`distribute or sell copies of
`his/her thesis by any means and
`in any form or format, making
`this thesis available to interested
`L'auteur a accorde une licence
`irrevocable et non exclusive
`la Bibliotheque
`du Canada
`reproduire. preter, distribuer ou
`vendre des copies de sa these
`de quelque maniere et sous
`quelque forme que ce soit pour
`metire des exemplaires de cette
`la disposition des
`personnes interessees.
`The author retains ownership of
`the copyright in his/her thesis.
`Neither the thesis nor substantial
`extracts from it may be printed or
`reproduced without
`his/her permission.
`L'auteur conserve la propriete du
`droit d'auteur qui protege sa
`these. Ni la these ni des extra its
`substantiels de
`autrement reproduits sans son
`ISBN 0-315-92944-8


`This is an authorized facsimile, made from the microfilm
`master copy of the original dissertation or master thesis
`published by UMI
`The bibliographic information for this thesis is contained
`in lJMr s Dissertation Abstracts database, the only
`central source for accessing almost every doctoral
`dissertation accepted in North America since 1861.
`UMI° Dissertation
`From: Pro Quest
`300 North Zeeb Road
`P,O, Box 1346
`Ann Arbor. Michigan 48106-1346 USA
`Printed in 2006 by digital xerographic process
`on acid~tree paper


`DELL Ex.1069.004DELL Ex.1069.004




`Techniques for handling high data throughput in end-systems attached to data net(cid:173)
`works are the subject of this thesis. The advent of Fibre Optic technology which "ffers
`high bandwidth and low bit error rates. has shifted the performance bottleneck in data
`communications from the channel to the processing in the end-points of the system. T"
`alleviate the end-system bottleneck one may consider new prmocols. improved ",ftware
`implementation <If existing pHllOc"ls. parallel pHlcessing techniljues. spec:ial pro[(lc<-I
`structures and hardware assist by nft10ading all or part of the pfllto<.:ol fum:tions to an
`attat:hed processor or adapter.
`In this thesis. speedup in protocol processing is accomplished by using a combination
`of approaches based on general software performance engineering principles. A novel
`concept for protocol implementation. which is a generalization of Jacobson's "Header
`frediction" algorithm for TCP/IP is introdut:ed. Bypassing identifies a fasl path for bulk
`data transfer. by which frequent simple operations are found and optimized. We propose
`and evaluate three different methods for bypassing.
`In the f rst aporoach. throughput
`performance is increased by pure software improvements due to bypass. The second
`approach uses dedicated VLSI hardware for offloading protocol functions in the bypass
`path. A feasibility study was conducted using the industry standard VHDL language
`to model the bypacs chip which we call ROPE (Reduced Operation Protol.:ol Engine).
`Results show that it is feasible to implement the bypass path for the Session and Transport
`layers of OS! protocols in YLSI with current gate array technology, and that ROPE is
`capable of supporting a very high data rate -
`313 Mbps with TN checksum. Finally. in
`the third approach, a scalable packet-level architecture is proposed and evaluated. It uses
`digital signal processing compnnems which are readily available. A simulation mode!
`with realistic instruction execution times was perfonned, and its results were validated
`with an analytical model. With POBA. a system executing the three layers (transport.
`session and presentation) can scale up to speeds close to the host memory bandwidth.


`This research was supported in large part by the Tclectlmmunkations Research
`Institute of Ontario (TRIO).
`The debt llf gratitude that a Phd student "wes his advisor at the end of a project
`this size is enormous. I am greatly indebted to Prof. Murray Woodside for his excellent
`guidance. continued encouragement. and generous support. as well as his e.xtrallrdinary
`ratience in Improving my writing and communication skills.
`Discussions with Craig Scratchley. Erik Dravnieks. Govindachari Raghunath. Reza
`Etemadi. Srinivas Vunn:J.v:J., Weimin Ma. Yao Li and many others have ~een both invigo(cid:173)
`[".ting and informative. Greg Franks helped me get started on the testbed implementatinn
`and Curtis Hrischuk was always there to help with questions pertaining to PARASOL.
`I would also like to thank Bell-Northern Research which provided the VHDL trlois
`and facility used in this study. I am particularly indebted to Dl. Simon Curry. Hemi
`Thakar. Dr. Parviz Yousefpour. Bernard Dnray, Mike Majid and Mustapha r:
`who helped with the use of their tools :.nd made my stay at BNR an enjoyable and
`memorable one.
`I am grateful to all my friends and fellow students in the Systems and Computer
`Engineering department. In partlcular. I enjoyed playing squash and going on sid trips
`with Norm Lo. Francois Vien. Robert Dussault and Quyen Vi Bo. Special thanks are due
`to the "ffke and support staff, Naren Mehta. John Pope. Dave Sword. Danny Lemay.
`Darlene Hebert, Elena Keen. Vivienne Gilchrist. and others who assisted in nne way
`or another.
`Last but not least. I am deeply grateful to my parents for their support and encour(cid:173)
`agement. My wife. Irene and son Brandon. endured years of sacrifice and hardships 10
`see me through this program. lowe my family so much that I could find no words ttl
`express my deepest gratitude.


`. .... 1
`List cf Figures
`List of Tables
`1.1 The problem statement and motivation . . . . . . . . . . . .
`1.2 The proposed solution ...
`1.3 Summary 01 contributions
`1.4 Thesis Outline ....
`Introduction .,. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8
`2.2 OSI Architecture and layering . . . . . . . . . . . . . . . . . . . . . . . .. 9
`2.2.1 Why Layering? . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 10
`2.2.2 Layering Principles ........................... 11
`2.3 Overview of the Seven Layers of the OSI Architecture ......... 13
`2.4 Why are protocol implementations slow? .... , ............ 15
`2.4.1 Processing and data path of a typical application PDU ..... 15
`2.4.2 Host Architecture .. ...................... ... 18
`2.4.3 Adapter .................................. 18
`2.4.4 HosVAdapter interlace .......................... 19
`2.4.5 Operating system support ....................... 20
`2.4.6 Buffer Management ........................... 21
`2.4.7 Timer Management ........................... 22
`2.4.8 Software Implementation ........................ 23
`2.4.9 Protocol structure and mechanism .................. 23
`2.5 Software Performance Engineering Principles ............... 24
`Review of the State of the Art in Performance Improvement of Upper
`Layer Protocols
`Introduction .................................... 27
`3.2 Lightweight Transport Protocols and their mechanisms to support
`high-speed networks ....:............. . . . . . . . . . . . . 28
`3.2.1 Summary of Transport Service Functions that affect performance
`of high-speed networks . . . . . . . . . . . . . . . . . . . . . . .. 30
` Connection Management ................... 31
` Flow Control and Acknowledgment . . . . . . . . . . ... 32
` Error Handling .......................... 33
`3.2.2 Lightweight Transport Protocols ................... 33
`3.3 Special Implementation Techniques ..................... 36
`3.3.1 Header Prediction ............................ 38


`3.4 Hardware Implementation " ......................... 39
`3.5 Parallel Hardware Implementation ...................... 40
`3.6 Special Architecture or Special Protocol Structures ........... 43
`Integrated Layer Processing ..................... 44
`3.0.2 AXON ................................... 44
`3.6.3 HOPS ................................... 46
`3.6.: Flexible High-Performance Communication Subsystem ..... 46
`3.7 High-Speed Presentation L.ayer . . . . . . . . . . . . . . .
`. .47
`Problem Statement and Proposed Solution
`4.1 Characterization of Gbps processing requirement ............ 48
`4.2 Goals of this thesis ............................... 52
`4.3 State-of-the-art techniques from the SPE point of view ......... 54
`4.4 Characteristics of the Proposed Solution .................. 56
`4.5 Proposed Optim:stic Principle . . . . . . . . . . . . . . . . . . . . ..
`. 58
`The Bypass Concept and Software bypass
`5.1 Bypass Concept ................................. 60
`5.2 Bypass Architecture and its Implementation ................ 63
`5.2.1 Send Bypass Test . . . . . . . . . . . . . . . . . . . . . .. . ... 63
`5.2.2 Receive Bypass Test .......................... 64
`5.2.3 Bypass Stack ............................... 65
`5.2.4 Shared Data ............................... 65
`Isolating Data Transfer phase from Connection Establishment
`and Release jlhase .. . . . . . . . . . . . . . . . . . . . . .. " 65
`5.3 Multiple-layer bypass .............................. 66
`. .. 67
`5.3.1 Difficulties of multiple-layer bypass ...........
`5.4 Bypass without window flow control . . . . . . . . . . . . . . . . . . . .. 68
`5.5 Bypass Algorithm with window flow control ................. 70
`5.6 Experimental Setup ............................... 71
`5.6.1 Software structure ............................ 71
`5.7 Results of Bypass Implementation ...................... 74
`5.8 Summary. . . . . . . . . . . . . . . . . . . . . .. . ............. 77
`6 A Reduced Operation Protocol Engine (ROPE): Design and
`Implementation of Bypass chip
`Ir.!roduction to VHDL concepts ........................ 79
`6.1.1 Design ................................... 80
`6.1.2 Architectural Description ........................ 81
`I). ~.3 Experimental Setup ........................... 81
`6.1.4 Behavioral description ......................... 84
`6.1.5 Extensions to include major procedures for Transport Class 4
`(!mplemented) .............................. 87
` OSI Checksum .. , ................. , ..... 87
` Timers .......... ,................ .... 88
` Retransmission and Resequencing . . .. . ...... , 88


`6.2 Results ....................................... 89
`6.3 Extensions to support Session BAS and 8SS. and the Presentation
`layer (Not Implemented) . . . . . . . . . . . . . . . . . . . . . . . .. .. 90
`6.3.1 Session 8SS and BAS ......................... 91
`6.3.2 Presentation Layer proces5ing .................... 93
`6.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
`7 POBA - Parallel Optimistic Bypass Architecture
`7.1 Description of POBA. . . . . . . . . . . . . . . . . . . . . . . . . ..
`7.1.1 Hardware Architecture
`. . . . . . . . . . . . . . . .
`. . . . . 96
`7.1.2 Sequence of Operations ........................ 97
`7.1.3 Protocol processing at each PPN Node
`'" .......... 100
`7.1.4 Advantages of POBA ......................... joo
`7.2 Characterization of processing requirements and throughput analysis of
`POBA ....................................... 101
`7.2.1 Workload Analysis for POBA .................... 102
`7.2.2 Performance bounriS ......................... 105
`7.2.3 Scalability ................................ 108
`7.3 An implementation using standard off-the-shelf DSP
`microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 0
`7.3.1 Proposed Hardware for the adapter . . . . . . . . . ....... 110
`7.3.2 Task Architecture ......................... .. 110
`7.3.3 Simuiation technique . . . . . . . . . . . . . . . . . . . . . . . .. 114
`7.4 Simulation results ............................... 117
`7.5 Simulation versus analytica! results .................... 122
`7.6 Bypass Issues ................................. 126
`7.7 Predicted performance .... . . . . . . . . . . . . . . . . . . . . . . . . 127
`7.8 Summary ..................................... 128
`Conclusions and Future Work
`8.1 Conclusions ................................... 130
`8.2 Advantages and What it solves . . . . . . . . . . . . . . . . . . . . . . . 132
`8.3 Future Work . . . .. . . . . . . . . . . . . . . . . . . . . . . . . ...... 134
`8.3.1 New Applications for Bypass . . . . . . . . . . . . . . . . . . .. 135
`8.3.2 ApplyinQ perfor~ance principles for last connection setup in
`connectlon-onellted protocols . . . . . . . . . . . . . . . . . .. 138
`List of Abbreviations
`Appendix A
`Pseudo Code for bypass
`Appendix B
`TP4 Checksum
`Appendix C
`ASN.1 Basic Encoding Rule tor Integers
`Appendix D
`Schematic Diagram Generated by SYNOPSYS synthesis tool
`(See 6.1.3)
`'iimeline Snapshot (See 7.3.3)
`i 69
`Simulation Results for POBA (See 7.4)
`i 70
`Appendix E
`Appendix F


`List of Figures
`Figure 1
`Figure 2
`Figure 3
`Figure 4
`Figure 5
`Figure 6
`Figure 7
`Figure 8
`Figure 9
`Figure 10
`Figure 11
`Figure 12
`Figure 13
`Figure 14
`Figure 15
`Figure 16
`Figure 17
`Figure 18
`Figure 19
`Figure 20
`Figure 21
`Figure 22
`Figure 23
`Figure 24
`Figure 25
`Figure 26
`Figure 27
`Figure 28
`Figure 29
`Figure 30
`Figure 31
`Figure 32
`Figure 33
`Figure 34
`, , , , , , 10
`Open System Interconnection Reference Model
`Layering Interface
`. , , ' , , ' , , , . .' '....,'" 12
`Layer Interface . . . , . . . . . . , , . , ,
`, .. 13
`Data Movement in End·Systems . , ' , . ' , . , .. , , , , ,17
`State of the Art Review . , , . . . . . , . . , , . .
`' , , 29
`Zitterbart: Structural and Functional Parallelism .. ,
`Jain Schwartz and Bashkow: Packet level Parallel
`processing of TP4 .. , . . , . , . . . , , . . . . . , , , . , .. 42
`Sterbenz and Parulkar: AXON Block Diagram .... ". 45
`Haas: HOPS Architecture .... , ... , , , .. , .. ' , . , 46
`Effect on throughput by varying X and B "
`, . , . "
`Processor requirement for 1 Gbps protocol processing, , 51
`Time Sequence Diagram ...... , , , .. , , , . , , , , , ,61
`Finite State Machine of TPOITP2 , ' . , . , , , , . . , , , , ,62
`Bypass Architecture . , . . . . , . . ' , . . . , , , . . . . , , ,64
`Critical path taken by bypassed packets ' . , . , , . , , , ,68
`Sender and Receiver Header Templates . . . , , . . . . .. 69
`ACK TPDU ,., ... " . " . " . , ..... , . , , , , , , 70
`Header Templates .. , . , , . , . , , , , . , , , , .. , , , .. 71
`Experimental testbed , .. , . . . , . , . , , , ... , , , , . , 72
`Procedural Flow Chart of Protocol Testbed " " " ' " 74
`Software and Hardware processing path of a non-bypass
`versus a bypass system . . . . . , , , . . , . , . , . , . . , . 80
`Block Diagram of VLSI bypass system . , , , . . , , , ' , . 82
`Design flow diagram. , , , , . , , , , , . , ' , , . . . , , , , . 84
`Organization of intqrnal bypass chip memory .. , , ' , , . 86
`Activity management. Dialogue units and Synchronization
`paints . . , . . . • . . . , , , , . . . , , , . . . ' , , . . , . , , . 92
`Send Bypass Stack Parallel ArChitecture ' , , . . . . , , , . 98
`Bus Utilization. . . . ' , , . . . , , , , . . , , , , . . . , ' . . 1 03
`Performance Bounds , .. , , , .. , . , ... , , , ., ,. 107
`Effects of varying.\" and Brl' on throughput Bounds, . , 108
`Scalability of POBA with packet size (P) and PPN workload
`(\ P) .... , , , ..... , , .. , , , , ... , . , . , , . , . 109
`Software task architecture . , . . , , . , , . , . , . . , . . . 111
`3D plot of Throughput versus.\" and .Y (1 Kbyte) , . , , 117
`Throughput versus .'Ii (1 Kbyte). , , , , , , . , , . , , , . , 118
`. 118
`Throughput versus X (1 Kbyte). , . , , , , , .. ' .. ,
`v iii


`Figure 35
`Figure 36
`Figure 37
`Figure 38
`Figure 39
`Figure 40
`Figure 41
`Figure 42
`Figure 43
`Figure 44
`Figure 45
`Figure 46
`Figure 47
`Figure 48
`Figure 49
`Figure 50
`Figure 51
`Figure 52
`Figure 53
`Figure 54
`Figure 55
`Figure 56
`Figure 57
`Figure 58
`Figure 59
`Figure 60
`Average PPN utilization versus S (1 Kbyte) ........ '19
`Average PPN utilization versus X (1 Kbyte) . . . .
`. . 120
`Control Processor Utilization versus .Y (1 Kbyte) ..... 121
`Control Processor Utilization versus X (1 Kbyte) . . . . . 121
`POBA with 2 Processors . . . . . . . . . . . . . . . . . . . . 122
`POBA with 4 Processors . . . , . , . .
`, . . . . . . . 123
`POBA with 8 Processors . . . . . . . . . . . . . ..
`. 123
`POBA with 10 Processors . . . . . . . . . . . . . . . . . . . '24
`Throughput versus X -. Simulation results compared with
`r c,alysis
`. . . . . . . . . . . . . . . . . . . . .
`. . . . . . 125
`Throughput versus .v - Simulation results compared with
`. . . . . . . . . . . . . . . . . . . . . . . . . . , . . . 125
`Throughput versus message length - Simulation results
`compared with analysis .................... 126
`Effect of switch in processing paths, causing the bypass
`path to be flushed.. . . . . . . . . . . . . . . . . . . . . . . . 127
`Extrapolation of throughput bounds for improved
`technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
`Optimistic Connection Setup ......
`. ... , .. 140
`. . . . . . . . . . . . . 1 70
`POBA under various Workloads "
`POBA with multiple active Connections ........... 171
`Effect of switch in processing path for POBA with .Y = 8 . '71
`3D plot of Throughput versus X and .v (4 Kbytes) ... 172
`Throughput versus X (4 Kbytes) ............... 172
`Throughput versus .v (4 Kbytes) ............... '73
`3D plot of Control Processor Utilization versus X and .Y (4
`Kbytes) ............................... 174
`ContrOl Processor Utilization versus X (4 Kbytes) .... 174
`Control Processor Utilization versus .Y (4 Kbytes) .... 175
`3D plot of PPN Utilization versus X and .Y (4 Kbytes) . 176
`PPN Utilization versus X (4 Kbytes) ............. 176
`PPN Utilization versus .V (4 Kbytes) ............. 177


`List of Tables
`Table 1
`Table 2
`Table 3
`Table 4
`Table 5
`Table 6
`.. 37
`Comparison of Ligt"jtweight Protocols
`Common procedures for Presentation, Session and
`Transport layers during data transfer phase ......... 52
`Bypass of Sessio>1ITPO .
`. .... 76
`. ...... 77
`Bypass of SessionITP2 . . . . . . .
`Throughput Performance and gate count of bypass VLSI
`chip (ROPE) ....... ........ ...... . ... 90
`Partitioning of protocol functions to host. PPNs and '.ne
`Control processor
`. . . . . . . . .. ......


`Chapter 1
`Gigabit networks are currently the subject of much research attention [1231. Appli(cid:173)
`catiuns for future Gigabit per second (Gbps) networks will probably include multimedia
`c(lmmuuicarions. distributed high-speed real-time graphics and support for the bulk trans(cid:173)
`fer llf scientific data. such as that "i tne High Energy Physics community [76].
`1.1 The problem statement and moH~'ation
`The advent of Fibre Optic technology which offers high bandwidth and low bit
`error rates. has shifted the performance bottleneck in data communications from the
`channel to the
`'TOcessing in the end-points of the system. The heavy processing load
`is due to a combination of operating system overhead. protocf'1 complexity. and per(cid:173)
`octet processing on the data stream. The operating system overhead includes non(cid:173)
`protocol rewed processing like context switching. interrupt handling, movement of
`data across layer boundaries and process scheduling associated with handling several
`concurrent stream of events. The' protocol complexity is partly historical [105. I X3 J
`and partly to provide very flexible standard capabilities. The per-octet processing is
`found in checksums and presentation processing. :(1C~flti{'lm:tI"'~'pf('5memji;>:¥ik-(!"
`The problem is even worse for the Session [35, 91, 98} and Presentation [99. 1001
`layers which have a larger set of control Protocol Data Units (PDUs), each with varying


`data formats and optional control fields. For bulk data transfer. per-m:tet "perations
`like presentation em:odingJdecoding and checksum processing will significantly atlect
`throughput performance.
`The performance of upper layer protocols also depend on the hardware it is executing
`on. This includes the host processor. the memory. the bus and the network adapter. Cur(cid:173)
`rent high performance workstation architectures have been (lptimized for data processing
`"'-" .. ,
`rather than data movement [51 j, The RiSe based processors common in these work(cid:173)
`stations typically rely on cache memory to prevent "bubbles" in the processor pipeline.
`As host pwcessing speed continues tp outpace improvements in memory bandwidth. the
`penalty for access from main memory due to cache misses increa~es [71. X7l. Also. as
`the network bandwidth approaches the memory bandwidth, it is :mportant to keep data
`movement on the workstations down 10 the minimum, especially for bulk data transfer.
`The ideal situation will be to transfer the application data units directly from the user
`spa~e to the adapter buffer. with minimal host intervention. This suggests that some
`protocol functions must be offioaded,
`The key problems associated with offboard processing of protocols include:
`It is often difficult to panirion the functionality of protocols in a dear-cut way for
`off-board processing. Poor partitioning of functions lead to a complex protocnl for
`the ho,t software to communit:ate with the adapter, which may offset the pOicntial
`gain in performance due to offloading [174J.
`It is not sufficient to offload the protocol functions and hnpe In achieve a dramatk
`improvement in performance. NOll-protocol related processing is a large part of the
`total load, as shown in [2011. Examples i:lclude interrupt handling, context switching
`and data copying at layer boundaries in deeply iayered protocol stacks.
`It is important to consider multiple layers together [43J. However. it increases the
`complexity of the adapter especially if the full protocol logic of all the layers are
`offioaded [211, 106).


`o The hardware in the adapter might range from s<:veral microprocessors to single-chip
`VLSI implementations. Probably because of the complexity of existing protocols.
`VLSI [11 Xl implementation above the data link layer has been disappointing so
`far. The choice of hardware depends largely upon the complexity of the functions
`supported by the:;
`In [13J where the entire transport protocol layer is
`uflioaded or in [1Of, I where the full protocol stack can be offloaded. general purpose
`microprocessors are used. In [59}. dedicated VLSI chips are used tl> support TCP
`checksums. Also. some newer lightweight transport protocols are designed for VLSI
`implementation [11. 41J.
`1.2 The proposed solution
`This thesis uses a combination of approaches based on general software performance
`engineering principles to solve the problem addressed in the previous section. It employs
`the protocol bypass concept [21OJ which is a generalization of Jacobson' s "Header
`Prediction" algorithm [107) for TCPIIP. -By-passing·ldentifics··a?fa-st·pllth·for-bulk·tlata·
`transfef·i··by··wbicn·ffequent··sirnple'·opera1"ion.s·are·fttlll'i"d··lIITd·optim:tzett: It addres ses the
`problems associated with off-board processing by selecting a particular set of functions
`which are complete in themselves for offloading. 'f'he···primar:y •... S-lx-of···ti:lisc·.1he&'tS"···is··t<h··
`vtl"fify·.·anclxe!Cten~HhC'··bypas~·~e6neept·.iJN'lft:IeI:.«y.·deiiver·"\fery··hi·gtl'mr~tlghput"·fateS' •• (.upY
`tty·1he • .oost.·btlSLrm:.rn<If.:y··tllmtiwidth1·to··the··apy*icatitlTP·prograrn··wtltle'S"uIt'llSiftgxstllndard'
`''he:WyWeigI4t'!''fI'~( By this. we hope that we can achieve first. a usable engineering
`approach. and second. a deeper understanding of performance issues which would guide
`future designs of protocol and host system architecture.
`This thesis also addresses the foUowing questions associated with bypassing:
`o How to improve performance in the bypass path"!
`o How to minimize bypass test overhead and synchronization between the bypass path
`and the stantlard protocnl stack?


`LJ What are the problems associated with multiple-layer bypass'!
`Partitioning of functionality ftlr bypass and its impact on performam:e.
`Three approaches are proposed each of which serves to address one Of more or
`the above questions.
`In the first approach, throughput performance is im.:reased by
`implementing changes to software alone. This study used an experimental protocol
`testbed. an algorithm for multiple-layer bypass with window flow control, and an efficient
`logic for the bypass test [196]. It verified the bypass logi!. and concept.
`The sewnd approach propnses the use of dedicated VLSI hardware fllr offlllading
`protocol functions in the bypass path. Since the bypass system identifies a simple subset
`of protowl processing, it is more amenable to VLSI implementation than an entire
`protocol. This results in an architecture that offloads critical functions that are inefficiently
`handled by current host systems to dedicated VLSI hardware, leaving only a little header
`processing and a OMA transfer to the adapter for the host, and requiring only a ~imple
`command protocoL A feasibility study was conducted using the industry standard VHOL
`language to model the bypass chip. which is called R6'P£·,tR:edu"e'L.Q.pcr.atit}fi~.f?r<rfm;nI
`!i-~ [1951, A behavioral model of the complete bypass system and a structural model
`of ROPE were constructed. Results show that it is both feasible to implement the bypass
`path for the Session and Transport layers of OS! protocols in VLSI with current gate
`array technology, and that it is capable of supporting a very high data rate.
`Finally, in the third approach, a scalable packet-level ParaIJe1«optttfftstfC"Bypa§s
`AI~turew~POB~J' [194J was designed and evaluated, Current "ff-the-shelf OSP
`microprocessors are used to evaluate the architecture experimentally with very realistic
`simulation, Its results are compared with an analytical model. With POBA, a system
`using a small number of available OS? components can execute the three layers (transport,
`session, presentation) at speeds close to the host memory bandwidth [194J. The'1"66~
`~,infumA;j..Qon. Hence its efficiency for parallel implementation is significantly


`higher ~mnp lred to architectures that implement the full protocol logic [21. lOti I.
`~·'i!>ypaliS>CL),tl!U~pt"d'llUi.d·lc"",.addre",&·.~ .. pn~~··~ui£~mtmt··f~wvfast··ct_"''t+HIT·
`!!rltHIfl'. WI! believe that the speed in pwcessing connection packets is ntlt the main issue
`in high-speed networks. but the number of message exchanges required between two
`application entities befnre a connection can be established.
`Recent projects on high-speed protocol processing have explored various approaches.
`each of which has a connection to this research:
`• NI' ..... "IiRhMl'iRht" protocols - Our approach can make standard ,Jrotocois nearly
`as fast as these lightweight protocols, at least for bulk data transfer.
`In bypass.
`lightweight data transfer can be associated with a complete or heavyweight protocol
`for connections.
`Special sojtwrlre implementation techniques - Bypass reduces inter-layer overhead
`when it is extended to a multiple-layer protocol stack. Inter-layer processing overhead
`includes queue and buffer management, context switching and movement of data
`across layers, all of which are a significant overhead in communications processing.
`The advanta£e of this work is the ability to integrate the critical processing functions
`of a deeply layered protocol stack without a major redesign of current standards.
`• Hardware implementation -
`The proposed architecture moves all the per-octet
`processing to the adapter, leaving only a little header processing and a DMA transfer
`to the adapter for the host, and requiring only a simple command protocol.
`• Parallel hardware implementation - The bypass approach is a good way to organize
`the processing within the parallel hardware architecture described by Jain er al [llil
`and Goldberg et at [82}. and will significantly enhance their proposal. Bypass reduces
`the complexity of protocol processing resulting in a significant reduction in processor
`contention for shared connection information.
`Special Architecture and protocol structures - Multiple-layer bypass merges ad(cid:173)
`joining layers 10 form a "super-efficient-stack" during the data transfer phase. The


`functionality of these layers can then be (lrganized fOf optimal performanct!.
`The main difference of this work. wmpared with other ideas in the literature is that
`it addresses the processing requirements of a multiple-layer protocol stack by efficiently
`offluading per-octet protocol related operations and minimizing the key non-protocol
`related per-packet operations. It succeeds by isolating functions that are typically handled
`inefficiently by

