`Data Transfer through deeply layered protocol stacks
`
`by
`
`Yong Hua Thia, BSe, MSe, ole
`
`A Thesis submitted to the Faculty of Graduate Studies and Research in partial
`fulfillment of the requirement~ for the degree of
`
`Doctor of Philosophy
`
`Ottawa-Carleton Institute for Electrical Engineering,
`Faculty of Engineering,
`Department of Systems and Computer Engineering.
`Carleton University.
`Ottawa, Ontario, Canada, K1S 586
`March 5, 1994
`
`© Yong Hua, Thia 1993
`
`Ex.1069.001
`
`DELL
`
`
`
`••• National Library
`
`of Canada
`
`Acquisitions and
`Bibliographic Services Branch
`
`395 Wellington Street
`Ottawa. Ofltano
`K1AON4
`
`Bibliotheque nalionale
`duCanada
`
`Direction des acquisitions el
`des services bibliographiques
`395. rue Welhngton
`Ottawa (Ont.",,)
`KtA0N4
`
`The author has granted an
`irrevocable non-exclusive licence
`allowing the National Library of
`Canada
`to
`reproduce,
`loan,
`distribute or sell copies of
`his/her thesis by any means and
`in any form or format, making
`this thesis available to interested
`persons.
`
`L'auteur a accorde une licence
`irrevocable et non exclusive
`it
`permetiant
`la Bibliotheque
`nationale
`du Canada
`de
`reproduire. preter, distribuer ou
`vendre des copies de sa these
`de quelque maniere et sous
`quelque forme que ce soit pour
`metire des exemplaires de cette
`it
`these
`la disposition des
`personnes interessees.
`
`The author retains ownership of
`the copyright in his/her thesis.
`Neither the thesis nor substantial
`extracts from it may be printed or
`otherwise
`reproduced without
`his/her permission.
`
`L'auteur conserve la propriete du
`droit d'auteur qui protege sa
`these. Ni la these ni des extra its
`substantiels de
`celle-ci
`ne
`dolvent
`etre
`imprimes
`OU
`autrement reproduits sans son
`autorisation.
`
`ISBN 0-315-92944-8
`
`Canada
`
`Ex.1069.002
`
`DELL
`
`
`
`This is an authorized facsimile, made from the microfilm
`master copy of the original dissertation or master thesis
`published by UMI
`
`The bibliographic information for this thesis is contained
`in lJMr s Dissertation Abstracts database, the only
`central source for accessing almost every doctoral
`dissertation accepted in North America since 1861.
`
`UMI° Dissertation
`
`Services
`
`From: Pro Quest
`
`COMPANY
`
`300 North Zeeb Road
`P,O, Box 1346
`Ann Arbor. Michigan 48106-1346 USA
`
`734.761.4700
`800.521.0600
`web www.iJ.proquest.com
`
`Printed in 2006 by digital xerographic process
`on acid~tree paper
`
`Ex.1069.003
`
`DELL
`
`
`
`
`DELL Ex.1069.004DELL Ex.1069.004
`Ex.1069.004
`
`DELL
`
`
`
`
`
`.+. National Library
`
`of Canada
`
`AcqUIsitions and
`Bibliographic Services Branch
`
`Bibl;otheque nationale
`du Canada
`
`Direction des acquisitions et
`des services bibliogr;:,phlques
`
`395 Welnngton Streef
`Ottawa,Ontarto
`KIA ON'
`
`395, rue Wetitngton
`onawa (Onlanol
`KIA ON4
`
`NOTICE
`
`AVIS
`
`The quality of this microform is
`heavily dependent upon
`the
`quality of
`the original
`thesis
`submitted
`for microfilming.
`Every effort has been made to
`ensure
`the highest quality of
`reproduction possible.
`
`La qualite de cette microforme
`depend grandement de la quaiita
`de
`la
`these
`soumise
`au
`microfilmage. Nous avons tout
`fait pour assurer une qualita
`suparieure de reproduction.
`
`If pages are missing, contact the
`university which granted
`the
`degree.
`
`S'U manque des pages, veuillez
`communiquer avec ('universita
`qui a confere Ie grade.
`
`Some pages may have indistinct
`the original
`print especially if
`pages were typed with a poor
`ribbon or
`if
`the
`typewriter
`university sent us an
`inferior
`photocopy.
`
`La qualite d'impression de
`certaines pages peut (aisser it
`desirer, surtout si
`les pages
`originales
`ont
`ete
`dactylographiees it I'aide d'un
`ruban use ou si I'universite nOl'S
`a fait parvenir une photocopie de
`qualite inferieure.
`
`Reproduction in full or in part of
`this microform is governed by
`the Canadian Copyright Act,
`R.S.C. 1970,
`c. C-30, and
`subsequent amendments.
`
`La reproduction, me me partiel/e,
`de cette microforme est soumise
`it la Loi canadienne sur jj droit
`d'auteur, SRC 1970, c. C-30, et
`ses amendements subsequents.
`
`Canada
`
`Ex.1069.006
`
`DELL
`
`
`
`
`
`
`
`ABSTRACT
`
`Techniques for handling high data throughput in end-systems attached to data net(cid:173)
`
`works are the subject of this thesis. The advent of Fibre Optic technology which "ffers
`
`high bandwidth and low bit error rates. has shifted the performance bottleneck in data
`
`communications from the channel to the processing in the end-points of the system. T"
`
`alleviate the end-system bottleneck one may consider new prmocols. improved ",ftware
`
`implementation <If existing pHllOc"ls. parallel pHlcessing techniljues. spec:ial pro[(lc<-I
`
`structures and hardware assist by nft10ading all or part of the pfllto<.:ol fum:tions to an
`
`attat:hed processor or adapter.
`
`In this thesis. speedup in protocol processing is accomplished by using a combination
`
`of approaches based on general software performance engineering principles. A novel
`
`concept for protocol implementation. which is a generalization of Jacobson's "Header
`
`frediction" algorithm for TCP/IP is introdut:ed. Bypassing identifies a fasl path for bulk
`
`data transfer. by which frequent simple operations are found and optimized. We propose
`
`and evaluate three different methods for bypassing.
`
`In the f rst aporoach. throughput
`
`performance is increased by pure software improvements due to bypass. The second
`
`approach uses dedicated VLSI hardware for offloading protocol functions in the bypass
`
`path. A feasibility study was conducted using the industry standard VHDL language
`
`to model the bypacs chip which we call ROPE (Reduced Operation Protol.:ol Engine).
`
`Results show that it is feasible to implement the bypass path for the Session and Transport
`
`layers of OS! protocols in YLSI with current gate array technology, and that ROPE is
`
`capable of supporting a very high data rate -
`
`313 Mbps with TN checksum. Finally. in
`
`the third approach, a scalable packet-level architecture is proposed and evaluated. It uses
`
`digital signal processing compnnems which are readily available. A simulation mode!
`
`with realistic instruction execution times was perfonned, and its results were validated
`
`with an analytical model. With POBA. a system executing the three layers (transport.
`
`session and presentation) can scale up to speeds close to the host memory bandwidth.
`
`Ex.1069.009
`
`DELL
`
`
`
`ACKNOWLEDGMENTS
`
`This research was supported in large part by the Tclectlmmunkations Research
`
`Institute of Ontario (TRIO).
`
`The debt llf gratitude that a Phd student "wes his advisor at the end of a project
`
`this size is enormous. I am greatly indebted to Prof. Murray Woodside for his excellent
`
`guidance. continued encouragement. and generous support. as well as his e.xtrallrdinary
`
`ratience in Improving my writing and communication skills.
`
`Discussions with Craig Scratchley. Erik Dravnieks. Govindachari Raghunath. Reza
`
`Etemadi. Srinivas Vunn:J.v:J., Weimin Ma. Yao Li and many others have ~een both invigo(cid:173)
`
`[".ting and informative. Greg Franks helped me get started on the testbed implementatinn
`
`and Curtis Hrischuk was always there to help with questions pertaining to PARASOL.
`
`I would also like to thank Bell-Northern Research which provided the VHDL trlois
`
`and facility used in this study. I am particularly indebted to Dl. Simon Curry. Hemi
`Thakar. Dr. Parviz Yousefpour. Bernard Dnray, Mike Majid and Mustapha r:
`who helped with the use of their tools :.nd made my stay at BNR an enjoyable and
`
`ahla
`
`memorable one.
`
`I am grateful to all my friends and fellow students in the Systems and Computer
`
`Engineering department. In partlcular. I enjoyed playing squash and going on sid trips
`
`with Norm Lo. Francois Vien. Robert Dussault and Quyen Vi Bo. Special thanks are due
`
`to the "ffke and support staff, Naren Mehta. John Pope. Dave Sword. Danny Lemay.
`
`Darlene Hebert, Elena Keen. Vivienne Gilchrist. and others who assisted in nne way
`
`or another.
`
`Last but not least. I am deeply grateful to my parents for their support and encour(cid:173)
`
`agement. My wife. Irene and son Brandon. endured years of sacrifice and hardships 10
`
`see me through this program. lowe my family so much that I could find no words ttl
`
`express my deepest gratitude.
`
`IV
`
`Ex.1069.010
`
`DELL
`
`
`
`Contents
`
`viii
`x
`
`. .... 1
`3
`6
`7
`
`2
`
`3
`
`ABSTf-1ACT
`ACKNOWLEDGMENTS
`List cf Figures
`List of Tables
`Introduction
`1
`1.1 The problem statement and motivation . . . . . . . . . . . .
`1.2 The proposed solution ...
`1.3 Summary 01 contributions
`1.4 Thesis Outline ....
`Background
`2.1
`Introduction .,. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8
`2.2 OSI Architecture and layering . . . . . . . . . . . . . . . . . . . . . . . .. 9
`2.2.1 Why Layering? . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 10
`2.2.2 Layering Principles ........................... 11
`2.3 Overview of the Seven Layers of the OSI Architecture ......... 13
`2.4 Why are protocol implementations slow? .... , ............ 15
`2.4.1 Processing and data path of a typical application PDU ..... 15
`2.4.2 Host Architecture .. ...................... ... 18
`2.4.3 Adapter .................................. 18
`2.4.4 HosVAdapter interlace .......................... 19
`2.4.5 Operating system support ....................... 20
`2.4.6 Buffer Management ........................... 21
`2.4.7 Timer Management ........................... 22
`2.4.8 Software Implementation ........................ 23
`2.4.9 Protocol structure and mechanism .................. 23
`2.5 Software Performance Engineering Principles ............... 24
`Review of the State of the Art in Performance Improvement of Upper
`Layer Protocols
`3.1
`Introduction .................................... 27
`3.2 Lightweight Transport Protocols and their mechanisms to support
`high-speed networks ....:............. . . . . . . . . . . . . 28
`3.2.1 Summary of Transport Service Functions that affect performance
`of high-speed networks . . . . . . . . . . . . . . . . . . . . . . .. 30
`3.2.1.1 Connection Management ................... 31
`3.2.1.2 Flow Control and Acknowledgment . . . . . . . . . . ... 32
`3.2.1.3 Error Handling .......................... 33
`3.2.2 Lightweight Transport Protocols ................... 33
`3.3 Special Implementation Techniques ..................... 36
`3.3.1 Header Prediction ............................ 38
`
`v
`
`Ex.1069.011
`
`DELL
`
`
`
`4
`
`5
`
`3.4 Hardware Implementation " ......................... 39
`3.5 Parallel Hardware Implementation ...................... 40
`3.6 Special Architecture or Special Protocol Structures ........... 43
`3.6.1
`Integrated Layer Processing ..................... 44
`3.0.2 AXON ................................... 44
`3.6.3 HOPS ................................... 46
`3.6.: Flexible High-Performance Communication Subsystem ..... 46
`3.7 High-Speed Presentation L.ayer . . . . . . . . . . . . . . .
`. .47
`Problem Statement and Proposed Solution
`4.1 Characterization of Gbps processing requirement ............ 48
`4.2 Goals of this thesis ............................... 52
`4.3 State-of-the-art techniques from the SPE point of view ......... 54
`4.4 Characteristics of the Proposed Solution .................. 56
`4.5 Proposed Optim:stic Principle . . . . . . . . . . . . . . . . . . . . ..
`. 58
`The Bypass Concept and Software bypass
`5.1 Bypass Concept ................................. 60
`5.2 Bypass Architecture and its Implementation ................ 63
`5.2.1 Send Bypass Test . . . . . . . . . . . . . . . . . . . . . .. . ... 63
`5.2.2 Receive Bypass Test .......................... 64
`5.2.3 Bypass Stack ............................... 65
`5.2.4 Shared Data ............................... 65
`5.2.5
`Isolating Data Transfer phase from Connection Establishment
`and Release jlhase .. . . . . . . . . . . . . . . . . . . . . .. " 65
`5.3 Multiple-layer bypass .............................. 66
`. .. 67
`5.3.1 Difficulties of multiple-layer bypass ...........
`5.4 Bypass without window flow control . . . . . . . . . . . . . . . . . . . .. 68
`5.5 Bypass Algorithm with window flow control ................. 70
`5.6 Experimental Setup ............................... 71
`5.6.1 Software structure ............................ 71
`5.7 Results of Bypass Implementation ...................... 74
`5.8 Summary. . . . . . . . . . . . . . . . . . . . . .. . ............. 77
`6 A Reduced Operation Protocol Engine (ROPE): Design and
`Implementation of Bypass chip
`6.1
`Ir.!roduction to VHDL concepts ........................ 79
`6.1.1 Design ................................... 80
`6.1.2 Architectural Description ........................ 81
`I). ~.3 Experimental Setup ........................... 81
`6.1.4 Behavioral description ......................... 84
`6.1.5 Extensions to include major procedures for Transport Class 4
`(!mplemented) .............................. 87
`6.1.5.1 OSI Checksum .. , ................. , ..... 87
`6.1.5.2 Timers .......... ,................ .... 88
`6.1.5.3 Retransmission and Resequencing . . .. . ...... , 88
`
`vi
`
`Ex.1069.012
`
`DELL
`
`
`
`6.2 Results ....................................... 89
`6.3 Extensions to support Session BAS and 8SS. and the Presentation
`layer (Not Implemented) . . . . . . . . . . . . . . . . . . . . . . . .. .. 90
`6.3.1 Session 8SS and BAS ......................... 91
`6.3.2 Presentation Layer proces5ing .................... 93
`6.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
`7 POBA - Parallel Optimistic Bypass Architecture
`7.1 Description of POBA. . . . . . . . . . . . . . . . . . . . . . . . . ..
`..95
`7.1.1 Hardware Architecture
`. . . . . . . . . . . . . . . .
`. . . . . 96
`7.1.2 Sequence of Operations ........................ 97
`7.1.3 Protocol processing at each PPN Node
`'" .......... 100
`7.1.4 Advantages of POBA ......................... joo
`7.2 Characterization of processing requirements and throughput analysis of
`POBA ....................................... 101
`7.2.1 Workload Analysis for POBA .................... 102
`7.2.2 Performance bounriS ......................... 105
`7.2.3 Scalability ................................ 108
`7.3 An implementation using standard off-the-shelf DSP
`microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 0
`7.3.1 Proposed Hardware for the adapter . . . . . . . . . ....... 110
`7.3.2 Task Architecture ......................... .. 110
`7.3.3 Simuiation technique . . . . . . . . . . . . . . . . . . . . . . . .. 114
`7.4 Simulation results ............................... 117
`7.5 Simulation versus analytica! results .................... 122
`7.6 Bypass Issues ................................. 126
`7.7 Predicted performance .... . . . . . . . . . . . . . . . . . . . . . . . . 127
`7.8 Summary ..................................... 128
`Conclusions and Future Work
`8.1 Conclusions ................................... 130
`8.2 Advantages and What it solves . . . . . . . . . . . . . . . . . . . . . . . 132
`8.3 Future Work . . . .. . . . . . . . . . . . . . . . . . . . . . . . . ...... 134
`8.3.1 New Applications for Bypass . . . . . . . . . . . . . . . . . . .. 135
`8.3.2 ApplyinQ perfor~ance principles for last connection setup in
`connectlon-onellted protocols . . . . . . . . . . . . . . . . . .. 138
`List of Abbreviations
`142
`Bibliography
`145
`159
`Appendix A
`Pseudo Code for bypass
`Appendix B
`163
`TP4 Checksum
`Appendix C
`ASN.1 Basic Encoding Rule tor Integers
`165
`Appendix D
`Schematic Diagram Generated by SYNOPSYS synthesis tool
`(See 6.1.3)
`168
`'iimeline Snapshot (See 7.3.3)
`i 69
`Simulation Results for POBA (See 7.4)
`i 70
`
`Appendix E
`Appendix F
`
`8
`
`vii
`
`Ex.1069.013
`
`DELL
`
`
`
`List of Figures
`
`Figure 1
`Figure 2
`Figure 3
`Figure 4
`Figure 5
`Figure 6
`Figure 7
`
`Figure 8
`Figure 9
`Figure 10
`Figure 11
`Figure 12
`Figure 13
`Figure 14
`Figure 15
`Figure 16
`Figure 17
`Figure 18
`Figure 19
`Figure 20
`Figure 21
`
`Figure 22
`Figure 23
`Figure 24
`Figure 25
`
`Figure 26
`Figure 27
`Figure 28
`Figure 29
`Figure 30
`
`Figure 31
`Figure 32
`Figure 33
`Figure 34
`
`, , , , , , 10
`Open System Interconnection Reference Model
`Layering Interface
`. , , ' , , ' , , , . .' '....,'" 12
`Layer Interface . . . , . . . . . . , , . , ,
`, .. 13
`Data Movement in End·Systems . , ' , . ' , . , .. , , , , ,17
`State of the Art Review . , , . . . . . , . . , , . .
`' , , 29
`Zitterbart: Structural and Functional Parallelism .. ,
`,.41
`Jain Schwartz and Bashkow: Packet level Parallel
`processing of TP4 .. , . . , . , . . . , , . . . . . , , , . , .. 42
`Sterbenz and Parulkar: AXON Block Diagram .... ". 45
`Haas: HOPS Architecture .... , ... , , , .. , .. ' , . , 46
`Effect on throughput by varying X and B "
`, . , . "
`,50
`Processor requirement for 1 Gbps protocol processing, , 51
`Time Sequence Diagram ...... , , , .. , , , . , , , , , ,61
`Finite State Machine of TPOITP2 , ' . , . , , , , . . , , , , ,62
`Bypass Architecture . , . . . . , . . ' , . . . , , , . . . . , , ,64
`Critical path taken by bypassed packets ' . , . , , . , , , ,68
`Sender and Receiver Header Templates . . . , , . . . . .. 69
`ACK TPDU ,., ... " . " . " . , ..... , . , , , , , , 70
`Header Templates .. , . , , . , . , , , , . , , , , .. , , , .. 71
`Experimental testbed , .. , . . . , . , . , , , ... , , , , . , 72
`Procedural Flow Chart of Protocol Testbed " " " ' " 74
`Software and Hardware processing path of a non-bypass
`versus a bypass system . . . . . , , , . . , . , . , . , . . , . 80
`Block Diagram of VLSI bypass system . , , , . . , , , ' , . 82
`Design flow diagram. , , , , . , , , , , . , ' , , . . . , , , , . 84
`Organization of intqrnal bypass chip memory .. , , ' , , . 86
`Activity management. Dialogue units and Synchronization
`paints . . , . . . • . . . , , , , . . . , , , . . . ' , , . . , . , , . 92
`Send Bypass Stack Parallel ArChitecture ' , , . . . . , , , . 98
`Bus Utilization. . . . ' , , . . . , , , , . . , , , , . . . , ' . . 1 03
`Performance Bounds , .. , , , .. , . , ... , , , ., ,. 107
`Effects of varying.\" and Brl' on throughput Bounds, . , 108
`Scalability of POBA with packet size (P) and PPN workload
`(\ P) .... , , , ..... , , .. , , , , ... , . , . , , . , . 109
`Software task architecture . , . . , , . , , . , . , . . , . . . 111
`3D plot of Throughput versus.\" and .Y (1 Kbyte) , . , , 117
`Throughput versus .'Ii (1 Kbyte). , , , , , , . , , . , , , . , 118
`. 118
`Throughput versus X (1 Kbyte). , . , , , , , .. ' .. ,
`
`v iii
`
`Ex.1069.014
`
`DELL
`
`
`
`Figure 35
`Figure 36
`Figure 37
`Figure 38
`Figure 39
`Figure 40
`Figure 41
`Figure 42
`Figure 43
`
`Figure 44
`
`Figure 45
`
`Figure 46
`
`Figure 47
`
`Figure 48
`Figure 49
`Figure 50
`Figure 51
`Figure 52
`Figure 53
`Figure 54
`Figure 55
`
`Figure 56
`Figure 57
`Figure 58
`Figure 59
`Figure 60
`
`Average PPN utilization versus S (1 Kbyte) ........ '19
`Average PPN utilization versus X (1 Kbyte) . . . .
`. . 120
`Control Processor Utilization versus .Y (1 Kbyte) ..... 121
`Control Processor Utilization versus X (1 Kbyte) . . . . . 121
`POBA with 2 Processors . . . . . . . . . . . . . . . . . . . . 122
`POBA with 4 Processors . . . , . , . .
`, . . . . . . . 123
`POBA with 8 Processors . . . . . . . . . . . . . ..
`"
`. 123
`POBA with 10 Processors . . . . . . . . . . . . . . . . . . . '24
`Throughput versus X -. Simulation results compared with
`r c,alysis
`. . . . . . . . . . . . . . . . . . . . .
`. . . . . . 125
`Throughput versus .v - Simulation results compared with
`analysis
`. . . . . . . . . . . . . . . . . . . . . . . . . . , . . . 125
`Throughput versus message length - Simulation results
`compared with analysis .................... 126
`Effect of switch in processing paths, causing the bypass
`path to be flushed.. . . . . . . . . . . . . . . . . . . . . . . . 127
`Extrapolation of throughput bounds for improved
`technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
`Optimistic Connection Setup ......
`. ... , .. 140
`. . . . . . . . . . . . . 1 70
`POBA under various Workloads "
`POBA with multiple active Connections ........... 171
`Effect of switch in processing path for POBA with .Y = 8 . '71
`3D plot of Throughput versus X and .v (4 Kbytes) ... 172
`Throughput versus X (4 Kbytes) ............... 172
`Throughput versus .v (4 Kbytes) ............... '73
`3D plot of Control Processor Utilization versus X and .Y (4
`Kbytes) ............................... 174
`ContrOl Processor Utilization versus X (4 Kbytes) .... 174
`Control Processor Utilization versus .Y (4 Kbytes) .... 175
`3D plot of PPN Utilization versus X and .Y (4 Kbytes) . 176
`PPN Utilization versus X (4 Kbytes) ............. 176
`PPN Utilization versus .V (4 Kbytes) ............. 177
`
`Ex.1069.015
`
`DELL
`
`
`
`List of Tables
`
`Table 1
`Table 2
`
`Table 3
`Table 4
`Table 5
`
`Table 6
`
`.. 37
`
`Comparison of Ligt"jtweight Protocols
`Common procedures for Presentation, Session and
`Transport layers during data transfer phase ......... 52
`Bypass of Sessio>1ITPO .
`. .... 76
`. ...... 77
`Bypass of SessionITP2 . . . . . . .
`Throughput Performance and gate count of bypass VLSI
`chip (ROPE) ....... ........ ...... . ... 90
`Partitioning of protocol functions to host. PPNs and '.ne
`Control processor
`. . . . . . . . .. ......
`
`102
`
`Ex.1069.016
`
`DELL
`
`
`
`Chapter 1
`
`Introduction
`
`Gigabit networks are currently the subject of much research attention [1231. Appli(cid:173)
`
`catiuns for future Gigabit per second (Gbps) networks will probably include multimedia
`
`c(lmmuuicarions. distributed high-speed real-time graphics and support for the bulk trans(cid:173)
`
`fer llf scientific data. such as that "i tne High Energy Physics community [76].
`
`1.1 The problem statement and moH~'ation
`
`The advent of Fibre Optic technology which offers high bandwidth and low bit
`
`error rates. has shifted the performance bottleneck in data communications from the
`
`channel to the
`
`'TOcessing in the end-points of the system. The heavy processing load
`
`is due to a combination of operating system overhead. protocf'1 complexity. and per(cid:173)
`
`octet processing on the data stream. The operating system overhead includes non(cid:173)
`
`protocol rewed processing like context switching. interrupt handling, movement of
`
`data across layer boundaries and process scheduling associated with handling several
`concurrent stream of events. The' protocol complexity is partly historical [105. I X3 J
`
`and partly to provide very flexible standard capabilities. The per-octet processing is
`
`found in checksums and presentation processing. :(1C~flti{'lm:tI"'~'pf('5memji;>:¥ik-(!"
`
`"FeP:<f';!1;;'<:;'f'~"':y;<f;';'<1;9~f;1tftd,;OS:l;;::r::J;!4",t"'~'>:"~1'(;'<wftiett"tl'letm:le<>~m¥lJl'Iex>;;f'~l'fm:t1it:r;>'tl';>
`
`Mlnn_ml1""esl:ll!mi'!\:nmem?"'ter(!!i!;!e"Un6;;'t!'xcepti0fl:"haJ:lGl:mg,:o:ar-e,<ilo_£iclcl'M;;~,,;iae>fntliel'lt;
`
`f{'lf'0;bi~I'f';"!i~eH";1Xj;Cffillttro.eatmm>;lI;fldo<x_"~€Impi-i£ial#liI:';:t{'l:'<~;:;im~~,,:;m:",Il'::J':tt~e.
`
`The problem is even worse for the Session [35, 91, 98} and Presentation [99. 1001
`
`layers which have a larger set of control Protocol Data Units (PDUs), each with varying
`
`Ex.1069.017
`
`DELL
`
`
`
`data formats and optional control fields. For bulk data transfer. per-m:tet "perations
`
`like presentation em:odingJdecoding and checksum processing will significantly atlect
`
`throughput performance.
`
`The performance of upper layer protocols also depend on the hardware it is executing
`
`on. This includes the host processor. the memory. the bus and the network adapter. Cur(cid:173)
`
`rent high performance workstation architectures have been (lptimized for data processing
`"'-" .. ,
`rather than data movement [51 j, The RiSe based processors common in these work(cid:173)
`
`stations typically rely on cache memory to prevent "bubbles" in the processor pipeline.
`
`As host pwcessing speed continues tp outpace improvements in memory bandwidth. the
`
`penalty for access from main memory due to cache misses increa~es [71. X7l. Also. as
`
`the network bandwidth approaches the memory bandwidth, it is :mportant to keep data
`
`movement on the workstations down 10 the minimum, especially for bulk data transfer.
`
`The ideal situation will be to transfer the application data units directly from the user
`
`spa~e to the adapter buffer. with minimal host intervention. This suggests that some
`
`protocol functions must be offioaded,
`
`The key problems associated with offboard processing of protocols include:
`
`LJ
`
`It is often difficult to panirion the functionality of protocols in a dear-cut way for
`
`off-board processing. Poor partitioning of functions lead to a complex protocnl for
`
`the ho,t software to communit:ate with the adapter, which may offset the pOicntial
`
`gain in performance due to offloading [174J.
`
`LJ
`
`It is not sufficient to offload the protocol functions and hnpe In achieve a dramatk
`
`improvement in performance. NOll-protocol related processing is a large part of the
`
`total load, as shown in [2011. Examples i:lclude interrupt handling, context switching
`
`and data copying at layer boundaries in deeply iayered protocol stacks.
`
`It is important to consider multiple layers together [43J. However. it increases the
`
`complexity of the adapter especially if the full protocol logic of all the layers are
`
`offioaded [211, 106).
`
`2
`
`Ex.1069.018
`
`DELL
`
`
`
`o The hardware in the adapter might range from s<:veral microprocessors to single-chip
`VLSI implementations. Probably because of the complexity of existing protocols.
`
`VLSI [11 Xl implementation above the data link layer has been disappointing so
`
`far. The choice of hardware depends largely upon the complexity of the functions
`
`supported by the:;
`
`.cr.
`
`In [13J where the entire transport protocol layer is
`
`uflioaded or in [1Of, I where the full protocol stack can be offloaded. general purpose
`
`microprocessors are used. In [59}. dedicated VLSI chips are used tl> support TCP
`
`checksums. Also. some newer lightweight transport protocols are designed for VLSI
`
`implementation [11. 41J.
`
`1.2 The proposed solution
`
`This thesis uses a combination of approaches based on general software performance
`
`engineering principles to solve the problem addressed in the previous section. It employs
`
`the protocol bypass concept [21OJ which is a generalization of Jacobson' s "Header
`
`Prediction" algorithm [107) for TCPIIP. -By-passing·ldentifics··a?fa-st·pllth·for-bulk·tlata·
`
`transfef·i··by··wbicn·ffequent··sirnple'·opera1"ion.s·are·fttlll'i"d··lIITd·optim:tzett: It addres ses the
`
`problems associated with off-board processing by selecting a particular set of functions
`
`which are complete in themselves for offloading. 'f'he···primar:y •... S-lx-of···ti:lisc·.1he&'tS"···is··t<h··
`
`vtl"fify·.·anclxe!Cten~HhC'··bypas~·~e6neept·.iJN'lft:IeI:.«y.·deiiver·"\fery··hi·gtl'mr~tlghput"·fateS' •• (.upY
`
`tty·1he • .oost.·btlSLrm:.rn<If.:y··tllmtiwidth1·to··the··apy*icatitlTP·prograrn··wtltle'S"uIt'llSiftgxstllndard'
`
`''he:WyWeigI4t'!''fI'~( By this. we hope that we can achieve first. a usable engineering
`
`approach. and second. a deeper understanding of performance issues which would guide
`
`future designs of protocol and host system architecture.
`
`This thesis also addresses the foUowing questions associated with bypassing:
`
`o How to improve performance in the bypass path"!
`o How to minimize bypass test overhead and synchronization between the bypass path
`and the stantlard protocnl stack?
`
`Ex.1069.019
`
`DELL
`
`
`
`LJ What are the problems associated with multiple-layer bypass'!
`
`Partitioning of functionality ftlr bypass and its impact on performam:e.
`
`Three approaches are proposed each of which serves to address one Of more or
`the above questions.
`In the first approach, throughput performance is im.:reased by
`
`implementing changes to software alone. This study used an experimental protocol
`
`testbed. an algorithm for multiple-layer bypass with window flow control, and an efficient
`
`logic for the bypass test [196]. It verified the bypass logi!. and concept.
`
`The sewnd approach propnses the use of dedicated VLSI hardware fllr offlllading
`
`protocol functions in the bypass path. Since the bypass system identifies a simple subset
`
`of protowl processing, it is more amenable to VLSI implementation than an entire
`
`protocol. This results in an architecture that offloads critical functions that are inefficiently
`
`handled by current host systems to dedicated VLSI hardware, leaving only a little header
`
`processing and a OMA transfer to the adapter for the host, and requiring only a ~imple
`
`command protocoL A feasibility study was conducted using the industry standard VHOL
`
`language to model the bypass chip. which is called R6'P£·,tR:edu"e'L.Q.pcr.atit}fi~.f?r<rfm;nI
`
`i.nl!i-~ [1951, A behavioral model of the complete bypass system and a structural model
`
`of ROPE were constructed. Results show that it is both feasible to implement the bypass
`
`path for the Session and Transport layers of OS! protocols in VLSI with current gate
`
`array technology, and that it is capable of supporting a very high data rate.
`
`Finally, in the third approach, a scalable packet-level ParaIJe1«optttfftstfC"Bypa§s
`
`AI~turew~POB~J' [194J was designed and evaluated, Current "ff-the-shelf OSP
`
`microprocessors are used to evaluate the architecture experimentally with very realistic
`
`simulation, Its results are compared with an analytical model. With POBA, a system
`
`using a small number of available OS? components can execute the three layers (transport,
`
`session, presentation) at speeds close to the host memory bandwidth [194J. The'1"66~
`
`_hite~~.is:iess,_~(,·wj.tfha~fit,'antfe6ueti('ffl'ltt'1'ltt'eSstlr'e-rnttlmttmtfffl"gltare4
`
`~,infumA;j..Qon. Hence its efficiency for parallel implementation is significantly
`
`4
`
`Ex.1069.020
`
`DELL
`
`
`
`higher ~mnp lred to architectures that implement the full protocol logic [21. lOti I.
`
`~·'i!>ypaliS>CL),tl!U~pt"d'llUi.d·lc"",.addre",&·.~ .. pn~~··~ui£~mtmt··f~wvfast··ct_"''t+HIT·
`
`!!rltHIfl'. WI! believe that the speed in pwcessing connection packets is ntlt the main issue
`
`in high-speed networks. but the number of message exchanges required between two
`
`application entities befnre a connection can be established.
`
`Recent projects on high-speed protocol processing have explored various approaches.
`
`each of which has a connection to this research:
`
`• NI' ..... "IiRhMl'iRht" protocols - Our approach can make standard ,Jrotocois nearly
`
`as fast as these lightweight protocols, at least for bulk data transfer.
`
`In bypass.
`
`lightweight data transfer can be associated with a complete or heavyweight protocol
`
`for connections.
`
`•
`
`Special sojtwrlre implementation techniques - Bypass reduces inter-layer overhead
`
`when it is extended to a multiple-layer protocol stack. Inter-layer processing overhead
`
`includes queue and buffer management, context switching and movement of data
`
`across layers, all of which are a significant overhead in communications processing.
`
`The advanta£e of this work is the ability to integrate the critical processing functions
`
`of a deeply layered protocol stack without a major redesign of current standards.
`
`• Hardware implementation -
`
`The proposed architecture moves all the per-octet
`
`processing to the adapter, leaving only a little header processing and a DMA transfer
`
`to the adapter for the host, and requiring only a simple command protocol.
`
`• Parallel hardware implementation - The bypass approach is a good way to organize
`
`the processing within the parallel hardware architecture described by Jain er al [llil
`
`and Goldberg et at [82}. and will significantly enhance their proposal. Bypass reduces
`
`the complexity of protocol processing resulting in a significant reduction in processor
`
`contention for shared connection information.
`
`•
`
`Special Architecture and protocol structures - Multiple-layer bypass merges ad(cid:173)
`
`joining layers 10 form a "super-efficient-stack" during the data transfer phase. The
`
`Ex.1069.021
`
`DELL
`
`
`
`functionality of these layers can then be (lrganized fOf optimal performanct!.
`
`The main difference of this work. wmpared with other ideas in the literature is that
`
`it addresses the processing requirements of a multiple-layer protocol stack by efficiently
`
`offluading per-octet protocol related operations and minimizing the key non-protocol
`
`related per-packet operations. It succeeds by isolating functions that are typically handled
`
`inefficiently by