`FOR HIGH-SPEED
`NETWORKS, III
`
`Proceedings of the IFIP WG6.1/WG6.4 Third International Workshop on
`Protocols for High-Speed Networks,
`Stockholm, Sweden, 13-15 May, 1992
`
`Edited by
`
`B. PEHRSON
`
`P.
`
`NINGBERG
`
`.P|NK
`
`Swedish Institute of Computer Science
`Box 1263
`
`8- 164 28 Kista, Sweden
`
`1 993
`
`NORTH-
`
`AMSTERDAM ° LOND
`
`AN
`
`EW
`
`K 0 TOKYO
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:26)(cid:3)
`Amazon.com, Inc. et al. Exhibit
`
`
`
`ELSEVIER SCIENCE PUBLISHERS B.V.
`
`’
`Sara Burgerhartstraat 25
`P.O. Box 211, 1000 AE Amsterdam, The Netherlands
`
`Keywords are chosen from the ACM Computing Reviews Classification System, ©1991, with permission.
`Details of the full classification system are available from
`ACM, 11 West 42nd St., New York, NY 10036, USA.
`
`ISBN: 0 444 89925 1
`
`ISSN 0926-549X
`
`© 1993 IFIP. All rights reserved.
`No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any
`means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of
`the publisher, Elsevier Science Publishers B.V., Copyright & Permissions Department, P.0. Box 521, 1000 AM
`Amsterdam, The Netherlands.
`
`Special regulations for readers in the U.S.A. - This publication has been registered with the Copyright Clearance
`Center Inc. (CCC), Salem, Massachusetts. Information can be obtained from the CCC about conditions under
`which photocopies of parts of this publication may be made in the U.S.A. All other copyright questions, including
`photocopying outside of the U.S.A., should be referred to the publisher, Elsevier Science Publishers B.V., unless
`otherwise specified.
`“
`
`No responsibility is assumed by the publisher or by IFIP for any injury and/or damage to persons or property
`as a matter of products liability, negligence or otherwise, ortrom any use or operation of any methods, products,
`instructions or ideas contained in the material herein.
`
`This book is printed on acid—tree paper.
`
`Printed in The Netherlands
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:26)(cid:3)
`Amazon.com, Inc. et al. Exhibit
`
`
`
`Protocols for High—Speed Networks, III (C-9)
`B. Pehrson, P. Gunningberg and S. Pink (Editors)
`Elsevier Science Publishers B.V.
`(North-Holland)
`© 1993 IFIP. All rights reserved.
`
`219
`
`A Parallel Approach to OSI Connection-Oriented
`Protocolsl
`
`Murray VV. Goldberg, Gerald W. Neufeld and Mabo R. Ito
`
`University of British Columbia, Vancouver, British Columbia, Canada
`
`Abstract
`
`This paper describes our work in the parallelization of ISO connection—oriented proto-
`cols. Our attention focuses on (though is not restricted to) the upper layers. Our scheme
`performs parallelization on a per—packet basis using widely available, general purpose,
`shared memory multiprocessors. Our parallel implementation of the ISO stack includes
`the transport, session, presentation, and application (ACSE and ROSE) layers. The im-
`plementation has been tested on a four—processor risc machine and shows that parallelism
`can provide significant performance advantages in protocol processing.
`
`1
`
`Introduction
`
`The performance of communication protocol processing is not keeping pace with the
`present rate of improvement in network transmission bandwidth. Several options are
`available to increase protocol throughput. These include new low—overhead or paral-
`lelizable protocols, parallelizing the execution of existing protocols, specialized hardware
`architectures, or offloading protocol processing to auxiliary subsystems. Some of these
`options overlap or can be employed together.
`
`The solution we explore here is the parallel execution of existing protocols on a general
`purpose multiprocessor. This is attractive for several reasons. First,
`it is applied to
`existing protocols. Existing oflicial a11d de facto protocol standards are well entrenched
`and not likely to change quickly, even for the sake of performance. Second, this solution is
`applicable to all protocol layers, including the presentation layer where the overhead and
`potential for improvement are the greatest. Third, existing protocols are well studied and
`understood. Much work has already gone into making efficient implementations. Finally,
`the protocol processing hardware is general purpose and therefore widely available.
`
`This paper describes our work on a parallel implementation of the ISO stack2 of
`connection—oriented protocols.
`
`‘This work was made possible by grants from the Natural Sciences and Engineering Research
`Council of Canada.
`'
`
`2Actually, our stack is not completely ISO—based as it normally runs on top of TCP.
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:26)(cid:3)
`Amazon.com, Inc. et al. Exhibit
`
`
`
`220
`
`Parallelism issues, the protocol framework, implementation details and performance
`measurements are presented. Section 2 covers approaches to protocol parallelism. Sec-
`tion 3 discusses the approach we take to ISO protocol parallelism. Section 4 details the
`parallel protocol implementation, including the parallel thread environment designed and
`implemented for this work. Finally, Sections 5 and 6 provide performance measurements
`and detail future work.
`
`2 Parallel Protocol Processing
`
`There are several ways to introduce parallelism into protocol processing. Most of these
`are discussed in [DGW89], [NJB90] and [CMW89] and are briefly presented here.
`
`2.1 Processor Per Connection
`
`The granularity of the first option is at the level of a single connection. In this case there is
`no parallelism within a connection; packets received for the same connection are processed
`one at a time. Parallelism occurs where multiple connections are active concurrently. The
`separate connections are processed by separate processors and competition between the
`connections for computing power is reduced. This schemeis suitable where communication
`resources are shared by multiple applications or where a single application uses concurrent
`connections.
`In either case multiple concurrent connections will exist and parallelism is
`achieved.
`
`The advantage of this scheme is its simplicity. Concurrency control between parallel
`streams is limited to interactions on resources shared by all connections. This set of
`resources and data structures is fairly small.
`It is a reasonably simple task to take a
`serial protocol implementation and introduce concurrency at this level. The drawback of
`this scheme is the lack of benefit for applications with a small number of heavily used
`connections.
`In this case, even with a large volume of data being sent and received,
`processors will still be idle and little or no parallelism is possible.
`
`2.2 Processor Per Function
`
`At the other end of the granularity spectrum is parallelism within a protocol layer. Gran-
`ularity is at the level of individual protocol functions. This may be of benefit depending
`on the number and size of parallelizable tasks. ,.However, even for protocols designed for
`this sort of parallelism, the benefit is questionable due to the extra synchronization and
`communication overhead among the many small tasks.
`
`2.3 Processor Per Layer
`
`The granularity of the third option is at the level of a single protocol layer. Here, one
`or more processors are dedicated to a single layer and the protocol stack is made of a
`pipeline of processors. At each stage in the pipeline a packet moves to the next processor
`for processing at that layer. The vacated stage is free to begin processing the next packet.
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:26)(cid:3)
`Amazon.com, Inc. et al. Exhibit
`
`
`
`221
`
`The number of packets being processed in parallel could be as high as the number of
`stages in the pipeline.
`
`The advantage of this scheme is simple implementation, especially if pipeline stages
`are drawn along protocol layer boundaries.
`It is natural to implement each layer as a
`separate process. The pipeline will run at the speed of the slowest segment.
`It may be
`possible to divide slow stages to increase overall pipeline speed. However, if such divisions
`do not fall on natural boundaries, implementation becomes complex. The drawback is
`the context switching, communication and synchronization overhead at layer boundaries.
`Between each segment of the pipeline the packet will be handed off or queued for the next
`stage. If there are more stages than there are processors, several context switches may be
`necessary for each packet processed. Without a solution to these problems such a scheme
`is unsuitable for a high speed network.
`
`2.4 Processor Per Packet
`
`The fourth option (and the one selected) is parallelization at the level of a single packet.
`Here, each packet is assigned its own processor to carry it through the stack. There
`are several advantages to this scheme. First, there is no extra communication overhead
`between the layers. Second, performance improvements are made whenever trafiic is
`heavy regardless of how the traffic is distributed over connections. Third, while there
`is synchronization required between the parallel tasks, it is kept to a reasonable level.
`Finally, in the general case, no extra context switching overhead is incurred.
`
`3 Parallelization of the ISO Stack
`
`While much of the protocol processing task is parallelizable, there are areas where syn-
`chronization must occur and others where no parallelism is possible at all.
`
`3.1 Where is Parallelism Possible?
`
`Protocol processing for connection—oriented protocols (and some connection—less protocols)
`can be roughly divided into two unequal parts: packet processing and connection con-
`trol block (CCB) manipulation. Packet processing involves external data representation
`conversion functions, checksurn calculations, encryption and decryption, and encoding
`and decoding of packet headers. CCB manipulation includes state machine transitions,
`enqueuing received segments, etc. In some implementations these parts are intertwined,
`though our experience shows they are not difficult to isolate. The relative processing over-
`head of these two parts varies from layer to layer. For most layers, CCB manipulation for
`a single packet involves only a few lines of code to enqueue a segment, update sequence
`numbers or initiate an acknowledgement (often not even these are required). Packet pro-
`cessing is normally more complex. Even at layers with very simple protocol data unit
`(PDU) formats, such as the network and transport layers, packet manipulation is a sig-
`nificant proportion of the protocol code. At higher layers, such as the presentation and
`application, packet processing overhead completely dominates all other processing. Clark
`and Tennenhouse [CT90] indicate that up to 97% of total protocol stack overhead can
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:26)(cid:3)
`Amazon.com, Inc. et al. Exhibit
`
`
`
`222
`
`be attributed to the presentation conversion function (a packet processing task). While
`this number varies depending on the complexity and size of the application data, informal
`experimentation using our own protocol stack has confirmed the dominant overhead of
`presentation layer data conversion.
`
`Whether CCB manipulation and packet processing are parallelizable depends on how
`parallelism is introduced. For example, if parallelism is at the level of a single connection
`(i.e. one process per connection) then there is no parallelism other than that available
`across connection boundaries.
`If parallelism is at the level of the protocol function (one
`process per protocol function) then CCB manipulation and packet processing are both
`parallelizable. This parallelism, however, comes at the expense of significant synchroniza-
`tion and communication complexity unless the protocol is specifically designed for this
`purpose.
`
`If parallelism is at the packet level (one process per packet), significant parallelism is
`available with reasonable levels of synchronization and communication complexity. Packet
`processing does not require access to shared data structures. Packets are generally context
`free in the sense that state information and synchronization are not required for encoding
`and decoding; only the packet is required for decoding and a reference to the packet
`contents for encoding.
`lf only one process is responsible for the encoding or decoding of a
`single packet, then there is no competition for these data structures. This makes packet
`processing an excellent candidate for parallelization. CCB manipulation is more difficult
`using per—packet parallelism.
`111 this case more than one process may be contending
`for access to shared data in the connection control block. A simple example is that of
`enqueuing a segment of a multi—segment PDU. If multiple segments are being enqueued in
`parallel, access to the list structure must be controlled to maintain list integrity. Likewise,
`state examination and transitions must be made in a consistent manner. It is therefore
`
`difficult to parallelize CCB manipulation.
`
`3.2 How is This Parallelism Achieved?
`
`As indicated, our implementation achieves parallelism on a per-packet basis. This is
`implemented by creating some number of processes designed to run in parallel, each
`capable of taking a packet through the protocol stack. However, the difficult work in
`parallelizing a set of protocols is not introducing parallelism, but is rather introducing
`synchronization and mutual exclusion where parallelism is not desirable.
`
`3.2.1 Mutual Exclusion
`
`Making CCB manipulation a critcial section is possible using mutual exclusion constructs
`such as semaphores, critical regions, or monitors. This mutual exclusion is at the level of
`a single CCB. Since a CCB pertains to only one connection at a single layer, competition
`for access to a CCB is restricted to CCB manipulation occurring at the same layer for the
`same connection. No competition (other than that for shared hardware resources) occurs
`across layer or connection boundaries.
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:26)(cid:3)
`Amazon.com, Inc. et al. Exhibit
`
`
`
`223
`
`3.2.2 Event Ordering
`
`The main synchronization problem is that of event ordering3. That is, when a layer
`receives an event, how does it determine the its relative order in the event stream?
`Correct event ordering must be determined for in—order queueing of segments of a PDU,
`in—order delivery of data to the application, and proper processing of control events.
`Normally, event ordering is determined either through external clues such as sequence
`numbering (as at the data link layer), or through internal clues such as event arrival order
`(as at all layers above the network or transport layer). When unsynchronized parallelism
`is introduced, external (PDU based) sequence numbers still supply sufficient ordering
`information for data events. However, layers which normally rely on arrival order (such
`as the session, presentation and application, as well as the lower layers for control events)
`can no longer do so.
`In a parallel implementation it is possible for a later event to be
`processed more quickly than an earlier event and therefore arrive at some layer out of
`order.
`
`Our solution for determining proper event ordering is a11 internal sequence number
`scheme. This scheme is applied to all events at all layers.
`In this case, every generated
`event is accompanied by an internal sequence number. The sequence number is calcu-
`lated by the layer which generates the event. Even though events can be received out
`of order, their correct ordering can now be determined by the accompanying internal se-
`quence number. This solution does not impose any restrictions on parallelism, but leaves
`some questions unanswered. For example, how are out—of—order data and control events
`handled?
`
`3.2.3 Out of Order Events
`
`There are two possibilities. The first is to process the event in order of arrival based on
`the correct order as indicated by the sequence number. The second is to postpone event
`processing until all intervening events have arrived. The first option varies in difficulty
`depending on the event and protocol being processed. Certain events such as a data
`indication or a request for retransmission of a set of packets could perhaps be processed
`out of order without adverse effects. Most events which have no significant effect on the
`protocol state fall into this category. In general, however, ISO connection based protocols
`are all state—oriented. The correct action taken on receipt of an event depends on the
`current state of the protocol state machine. Likewise, receipt of a control (and to a
`lesser extent, data) event generally alters the state of the protocol. Therefore, the action
`taken on receipt of an out—of—order event depends on the nature and order of intervening
`events not yet received. Given these difficulties, we adopt the second option and postpone
`out—of—order events until they can be processed in—order.
`
`To summarize, we have adopted the following approach. Parallelism is at the level
`of one processor—per—packet.
`In general, at each layer, packet processing and CCB ma-
`
`3An event is any protocol information passed to a protocol layer. Examples include data
`arrival and timer expiry. Events are divided into two categories: control and data. A control
`event is one which passes control information to a layer. Examples are connection requests
`and indications. Control events generally change the state of the protocol state machine. Data
`events are those that pass packets from one layer to another. Examples are data requests and
`indications.
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:26)(cid:3)
`Amazon.com, Inc. et al. Exhibit
`
`
`
`224
`
`nipulation are disjoint stages. All packet processing is done in parallel. There is no syn-
`chronization or competition for resources (other then general hardware resources) during
`encoding and decoding. CCB manipulation is difficult to parallelize and is of relatively
`short duration. Therefore the CCB manipulation stage for each CCB is made mutually
`exclusive. Competition for CCB access occurs only when processes arrive at the CCB
`manipulation stage at the same layer, on the same connection, and at the same time.
`CCB manipulation for out—of—order events is difficult to perform on event arrival. These
`events are therefore delayed until the intervening events have arrived. This approach
`allows parallelism during all packet processing activities (which are extensive, especially
`at the presentation layer). It suffers syncl1ro11izatio11 overhead only when processes arrive
`together at a CCB manipulation stage (same layer, same connection) or when an event
`passes another in the stack.
`
`4
`
`Implementation
`
`This section describes details of the parallel threads environment, the protocol stack
`code, the structure of the threads executing the code, and the mechanisms used for event
`sequencing and CCB mutual exclusion.
`
`4. 1 Pt hreads
`
`A parallel execution environment is required to support the parallel communication stack.
`Pthreads supplies such an environment.
`
`Pthreads is a user—leVel kernel for use on multiprocessors. It provides an environment
`suitable for running a group of parallel cooperating threads. This first implementation of
`Pthreads runs on a shared memory Motorola 88100 based multiprocessor running UNIX
`System V. It can be ported to other multiprocessors (or uniprocessors) and other host
`operating systems so long as shared memory is supported by the target.
`
`Pthreads allows efficient process creation and destruction, interprocess communica—
`tion and memory allocation in a preemptive parallel environment. Pthreads provides
`semaphores (as well as IPC) for application thread synchronization.
`It also provides a
`sleep facility and 0(1) memory allocation routines allowing fast memory allocation. The
`Pthreads ready queue is shared by all processors. The highest priority ready thread is
`run by the next available processor and may therefore migrate between processors over its
`lifetime. Pthreads shared kernel data structures are protected by a system lock. Access
`to a critical section requires continued examination of the lock location until it becomes
`free. Once it is found to be free, locking is attempted using the XMEM (88100 test and
`set) instruction. This process repeats until the XMEM return value verifies locking. Re-
`peated lock examination, rather than repeated XMEM execution, is done to avoid CPU
`instruction synchronization overhead and take advantage of the data cache.
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:26)(cid:3)
`Amazon.com, Inc. et al. Exhibit
`
`
`
`225
`
`4.2 Protocol Implementation
`
`4.2.1 The Parallel Stack
`
`Our work includes a parallel implementation of the ISO seven layer connection—oriented
`stack. These protocols provide the remote operations service defined by the CCITT
`standard X219 (ISO standard 9072-1) [CI87a] using the protocol defined by the CCITT
`standard X229 (ISO standard 9072-2) [CI87b]. To support this protocol, the OSI trans-
`port, session, presentation and association control layers are also provided. The remote
`operations layer supports all association classes (1, 2 and 3), and supports operation
`classes I and 2. The association control service conforms to CCITT recommendation
`
`X227 [CCl87a]. The presentation service provides the kernel functional unit of CCITT
`recommendation X226 [CCI87b]. The session service provides the full-duplex basic com-
`bined subset of functional units described by CCITT recommendation X225 [CCI89a].
`Class 0 of the transport layer of CCITT recommendation X224 [CCI89b]is also provided.
`The transport layer normally makes use of TCP4 as its network service using the interface
`described by [RC86].
`
`4.2.2 Process Structure
`
`Many threads cooperate to provide the protocol service. At the top level is the application
`thread. Only the client side of the application uses a thread which exists over the life
`of the application. This thread runs the client application initiating requests of the
`server application by way of the protocol service. The server application uses a separate
`thread to initialize its services and request initialization of the communication service.
`Following this initialization the server thread exits. Events for the server are delivered
`asynchronously by way of upcalls. The upcalls are executed by threads belonging to the
`communication service.
`
`The communication service maintains a set of threads called packet shepherds. The
`number of packet shepherds is configurable and varies with the number of network con-
`nections. Each packet shepherd executes a loop. This loop blocks awaiting an event from
`the network (a packet), and then shepherds the packet through the protocol stack. When
`the packet shepherd completes processing its packet (including the possible application
`upcall and resulting transmission), it returns to the top of the loop, awaiting its next
`packet. Any number of packet shepherds may be active concurrently in the stack.
`
`There is also a set of worker threads, each responsible for a different function. For
`example, there is an acceptor thread responsible for fielding requests for new network
`connections. There is also a timer thread responsible for generating and executing timer
`events.
`
`4.2.3
`
`Internal Sequence Numbers
`
`Parallel processing of events (eg. received packets) allows the possibility that events may
`pass each other in the stack as they are being processed. This can cause a logically
`earlier event to arrive at some layer after a logically later event. In order for correct event
`
`‘*For testing purposes, TCP network access is replaced by a fast bufler scheme.
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:26)(cid:3)
`Amazon.com, Inc. et al. Exhibit
`
`
`
`226
`
`processing the protocol machine must have knowledge of the logical order of received
`events. This is accomplished using internal sequence numbers. These sequence numbers
`are internal in the sense that they are private within a protocol implementation. They
`are an implementation feature, invisible from a functional point of view.
`
`Internal sequence numbers are generated by all layers which generate events and are
`associated one-to—o11e with these events. A sequence number stream is a monotonically
`increasing-by—one series. There is a separate sequence number stream for each event
`destination of each connection. For example, a session connection generates two indepe11—
`dent sequence number streams. One for events destined for the associated presentation
`connection, and one for events destined for the associated transport connection.
`If one
`connection splits events onto several service providers, separate sequence number streams
`are generated for each. The sequence number associated with each event determines its
`logical position in the event sequence regardless of arrival order.
`
`There are also events which have no place (or perhaps more accurately, have any
`place) in the event stream. These are called unordered events. Unordered events are
`generally asynchronous events such as timer expiry. These events are assigneda special
`null sequence number indicating their unordered status.
`
`4.2.4 The Event Gate
`
`The event gate (depicted in figure l) is a software mechanism that controls the entry of
`threads into a connection’s CCB manipulation stage. There is one event gate for each
`connection active within each layer. Again, our parallelism structure requires no synchro-
`nization between separate connections, even at the same layer. Synchronization is only
`required between threads entering the CCB manipulation stage for the same connection
`at the same layer. There are two aspects to the synchronization performed by the event
`gate: mutual exclusion and sequencing.
`
`Manipulations involving a CCB must be mutually exclusive. These manipulations can
`involve adding segments to a list, counting list elements, or examining and updating state
`variables.
`If two or more processes make these manipulations concurrently, the obvious
`data consistency problems ensue. Therefore the event gate allows only one process at
`a time into the CCB manipulation stage of a connection. An event arriving while the
`CCB manipulation stage is occupied is suspended and enqueued until the stage is clear.
`The resultant context switching is expensive but the relatively short duration of the
`CCB manipulation stage (when compared to the packet processing stages) minimizes
`contention.
`
`Event gates also sequence events entering CCB manipulation stages. Events are al-
`lowed access sequentailly in order of internal sequence number. An event arriving out
`of order is suspended until all intervening in-order events have passed through the CCB
`manipulation stage. These events may arrive from more than one source. For example,
`a connection receives events from at least its service user and service provider. Therefore
`each event gate must keep track of the separate sequence number streams. If more than
`one event is eligible to enter the CCB manipulation stage of a connection (these would
`be events from separate streams) the choice of which is allowed in first is arbitrary. The
`event gate also receives unordered events. These events are allowed in ahead of all other
`waiting events as soon as the CCB manipulation stage is free.
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:26)(cid:3)
`Amazon.com, Inc. et al. Exhibit
`
`
`
`227
`
`Protocol Stack
`
`Event Gate (Controls
`access to CCB
`
`Manipulation Stage)
`
`
`
`Figure 1: The Division of Each Layer Into CCB Manipulation and Packet Processing
`Tasks
`
`5 Performance
`
`This section presents the results of performance measurements of the parallel stack. For
`testing purposes, the TCP layer has been replaced with a buffer scheme. The buffer
`scheme makes data continuously available to the stack for processing. Therefore the stack
`will process data at its maximum sustainable rate. This is intended to simulate a high
`bandwidth network.
`
`5 . 1 Data Transfers
`
`The application data is intended to mimic the format and complexity of X.500 requests
`and responses. The data is encoded using basic encoding rules and is represented here
`using ASN.1. The requests represent a set of keys to be searched for in a database.
`Performance results do not include any application processing other than that done by
`ROSE. The total size of the SearchKey set varies from O to 1920 bytes over the various
`tests. The request format is as follows:
`
`SearchKeys ::= SET OF SearchKey
`
`SearchKey
`goesBy
`empNo
`
`= [APPLICATION 0] SET
`Name,
`EmployeeNumber
`
`{
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:26)(cid:3)
`Amazon.com, Inc. et al. Exhibit
`
`
`
`228
`
`}
`
`Name ::= [APPLICATION 2]
`
`IMPLICIT SEQUENCE {
`
`IA5String,
`givenName
`IA5String,
`initial
`fami1yName IA5String
`
`}
`
`EmployeeNumber
`
`::= [APPLICATION 3]
`
`IMPLICIT INTEGER
`
`The responses consist of a set of personnel records. The size of the PersonnelRecord
`set varies from 0 to 16768 bytes over the various tests. The response format is as follows:
`
`Records
`
`::= SET OF Personne1Record
`
`Personne1Record ::= [APPLICATION 1]
`
`IMPLICIT SET {
`
`goesBy
`title
`empNo
`dateOfHire
`
`[0]
`
`Name,
`IA5String,
`Emp1oyeeNumber ,
`[1] Date,
`
`nameOfSpouse [2] Name,
`children
`[3]
`IMPLICIT SEQUENCE OF Childlnformation DEFAULT {}
`
`}
`
`Childlnformation ::= SET {
`
`Name,
`goesBy
`dateOfBirth [0] Date
`
`}
`
`Name
`
`::= [APPLICATION 2]
`
`IMPLICIT SEQUENCE {
`
`IA5String,
`givenName
`IA5String,
`initial
`familyName IA5String
`
`} D
`
`ate ::= [APPLICATION 4]
`
`IMPLICIT IA5String -- YYYYMMDD
`
`5.2 Overall Performance
`
`Tests were conducted which measure the time required to process one request /result pair
`using 1, 2, 3, and 4 processors over 1 or 3 connections. The data size for all tests is
`represented as A/B where A is the request size in bytes, and B is the result size in
`bytes. Figure 2 gives a comparative View of test results for all data sizes and numbers of
`processors over a single connection. As expected, the time required to process each re-
`quest / response pair diminishes with increasing number of processors and decreasing data
`size. Figures 3 and 4 display with greater detail the time required to process each re-
`quest / response pair. Notice that the paralle