throbber
14
`A Reduced Operation Protocol Engine (ROPE) for a
`multiple-layer bypass architecture
`
`Y.H .. Thia (*) 1 and C.M. Woodside (**)
`
`Newbridge Networks, Inc., Ottawa, Canada (*) and
`Dept. of Systems and Computer Engineering, Carleton University, Ottawa, Canada (**)
`
`Abstract - The Reduced Operation Protocol Engine (ROPE) presented here offtoads
`critical functions of a multiple-layer protocol stack, based on the "bypass concept" of a fast
`path for data transfer. The motivation for identifying this separate processing path is that it
`involves only a small subset of the complete protocol, which can then be implemented in
`hardware. Multiple-layer bypass also eliminates some inter-layer operations such as queue
`and buffer management, context switching and movement of data across layers, all of which
`are a significant overhead. ROPE is intended to support high-speed bulk data transfer. The
`paper describes the design of a ROPE chip for the OSI Session and Transport layer protocols,
`using VHDL. The design is practical in terms of chip complexity and area, using current gate
`array technology, and simulation shows that it can support a data rate approaching 1 gigabit
`per second, in a connection attached to an end-system.
`
`Keyword codes: C.2.2, B.4.1
`Keywords: Network Protocols, Data Communications Devices
`
`1 Introduction
`The advent of Fibre Optic technology, which offers high bandwidth and low bit error
`rates, has shifted the performance bottleneck from the communications channel to the com(cid:173)
`munications processing in the end-points of the system [26]. Other trends such as improved
`quality-of-service guarantees will reinforce this effect. The heavy processing load is due to a
`combination of operating system overhead, protocol complexity, and per-octet processing on
`the data stream. To alleviate the end-system bottleneck one may consider new protocols [10],
`improved software implementation of existing protocols [5, 35], parallel processing techniques
`[14, 21, 38], special protocol structures [15, 30] and hardware assist [22] by offloading all or
`part of the protocol functions to an adaptor. This paper takes the latter approach.
`The key problems associated with offboard processing include:
`D Partitioning the functionality between the host and the adaptor is difficult and may easily
`lead to a complex additional protocol between the two parts, which may cancel out or
`offset the potential gain from offloading. For example, the buffer management task [36]
`may be offtoaded, but this leaves the problem of control for accessing it within the full
`protocol logic.
`
`This research was done while Dr. Thia was at Carleton University
`
`G. Neufield et al. (eds.), Protocols for High Speed Networks IV
`© Springer Science+Business Media Dordrecht 1995
`
`Ex.1015.001
`
`DELL
`
`

`

`ROPE for a multiple-layer bypass architecture
`
`225
`
`0 Non-protocol-specific processing is a large part of the total load, as shown in [35]. Exam(cid:173)
`ples include interrupt handling, context switching and data copying at layer boundaries
`in deeply layered protocol stacks.
`0 The choice of hardware for the adaptor depends on the complexity of the functions it
`supports. In [2, 22] where the transport protocol layer is offtoaded or in [7] where the
`full protocol stack can be offtoaded, general purpose microprocessors are used. Probably
`because of the complexity of existing protocols, VLSI [24] implementation above the
`data link layer has been disappointing so far. In [8], dedicated VLSI chips are used to
`support TCP checksums. Also, some newer lightweight transport protocols are specially
`designed for VLSI implementation [1, 3].
`0 There is a tradeoff between performance, flexibility and cost. If the key functions of
`the frequently executed portion of the protocol remain relatively stable, there can be
`significant advantage in providing hardware support for these functions leaving the other
`tasks in the host software for flexibility.
`0 As host processing speed continues to outpace memory bandwidth and as the network
`bandwidth approaches the processor memory bandwidth, it is important to keep data
`movement on the workstations down to the minimum [4, 9, 28].
`
`This paper presents a feasibility study for a new approach to hardware assistance. It
`combines the relatively simple operations needed for data transfer across multiple layers and
`provides a hardware "fast path" for them, which will be efficient for bulk data transfer. It is
`based on the "protocol bypass concept" [37] which is a generalization of Jacobson's "Header
`Prediction" algorithm [20] for TCP/IP. Bypass solves the problems identified above, which
`may limit the use of offboard processing, by implementing an entire service through all
`layers for certain cases. This simplifies the interface between the host and the adaptor chip
`and minimizes their interaction, which is supported by an access test, some DMA processing
`and a simple command protocol. The chip design based on bypassing is called ROPE, for
`Reduced Operation Protocol Engine. The contribution of this paper is to define the host/chip
`interface and the chip operation, and to report on a VHDL-based feasibility study of the
`chip design. It appears to be feasible to support an end-system single-connection data rate
`approaching 1 Gbps.
`The next section introduces the bypass concept, its architecture and implementation.
`Section 3 analyzes the key protocol processing overheads and discusses the requirements
`for a bypass VLSI implementation. Sections 4, 5 and 6 describe a design study of a ROPE
`chip using the industry standard hardware description language, VHDL, with conclusions in
`Section 7.
`
`2 The Bypass Concept
`
`A bypass adds an additional path for certain operations, with minimal changes to the
`original software. Conformance to the protocol is maintained by doing all the other operations
`through the normal "heavyweight" path. A bypass path can be provided for send, for receive,
`or for both together, and is compatible with other end-systems implemented without a bypass.
`
`Ex.1015.002
`
`DELL
`
`

`

`226
`
`Part Five Protocols
`
`Receive
`Bypassed
`Path
`
`Receive
`Bypassed
`Path
`
`SPS
`Standanl
`(multi-
`layer)
`protocol
`stack
`
`·~"'
`
`Bypass
`Stack
`(e.g. ROPE
`chip)
`
`SPS
`Stan dan!
`(multi-
`layer)
`protocol
`stack
`
`''@··
`
`·!·
`
`Bypass
`Stack
`.,. (e.g. ROPE
`chip)
`
`Send
`Bypassed
`Path
`
`Bypassed
`Path
`
`Figure 1 Bypass Architecture
`
`2.1 Bypass Architecture
`Figure 1 illustrates the architecture of a bypass implementation for any standard protocol.
`The standard protocol stack (SPS) is the processing path taken by all PDUs during a connection
`without the bypass. The SPS may refer to a single layer or to multiple adjoining layers of a
`layered protocol stack. The bypass has 4 key components:
`
`0 Send Bypass Test;
`0 Receive Bypass Test;
`0 Bypass Stack;
`Shared data for access by the two tests, the Bypass stack and the SPS.
`0
`
`The send bypass test identifies outgoing packets that are data packets in the data transfer
`phase. The receive bypass test matches the incoming PDU headers with a template that
`identifies the predicted bypassable headers. The bypass stack performs all the relevant
`protocol processing in the data transfer phase. The shared data are used to maintain state
`consistency between the SPS and the bypass stack, including window flow control parameters
`and connection identifiers. Whenever there is a change in the processing path between the
`SPS and the bypass stack, checks are performed to ensure that there are no outstanding packets
`in the current path, i.e. "no in-transit PDUs", before the change is made. A more detailed
`discussion on this is presented in another paper [33].
`
`Ex.1015.003
`
`DELL
`
`

`

`ROPE for a multiple-layer bypass architecture
`
`227
`
`2.2 Efficient logic for the bypass test
`The "no-in-transit PDU" test can often be avoided. At the beginning of data transfer on a
`new connection, it is automatically satisfied. It holds as long as no packet fails a bypass test,
`and it is sufficient to maintain a flag to indicate this. Once a packet fails, and goes to the SPS,
`then a full "no-in-transit PDU" test must be performed for each packet until the test succeeds,
`after which control can go back to the flag. Token management and synchronization points of
`the session layer [19, 18] which are mapped by equivalent application and presentation service
`primitives can be used as synchronization points within the bypass architecture, as it switches
`between the SPS and the bypass stack. Since they are only inserted periodically in bulk data
`transfer and can be controlled by the application process, these overheads are not excessive.
`
`2.3 Multiple-layer bypass
`A bypass for multiple layers instead of just one gives additional gains by avoiding:
`0 Overhead of encoding and decoding the interface control information passed between
`layers;
`0 Executing the full general protocol logic for the layers to decide how to manipulate the
`data;
`0 Queueing of data at layer boundaries.
`The advantage is increased further in cases where some layers, like the network and application
`layers, have been further subdivided into sublayers.
`A multiple-layer bypass path is a concatenation of processing procedures performed by
`the adjacent layers when they are simultaneously in the data transfer phase. Meanwhile, the
`separate layers in the SPS path handle the other phases.
`In summary, the separation of the bypass path offers the following advantages:
`0 The processing path of data PDUs can be optimized;
`0 The number of possible PDU formats in the bypass path is reduced to data transfer PDUs;
`0 The finite state machine of the protocol is now reduced to only the "OPEN" state, for as
`long as processing remains in the bypass path. The state of the system does not change
`during the entire data transfer phase and the protocol processing is reduced to ensuring
`reliable transfer of data across the communications network.
`
`3 Design Considerations for a Hardware Bypass
`
`3.1 Factors affecting system performance
`This section summarizes the major factors affecting throughput performance in a deeply
`layered stack on an end system. It follows the description by Heatley and Stokesberry [16].
`Protocol procedures can be characterized as per-octet, per-packet or per-group-of-packets
`operations. The per-octet operations take an average time A per octet (e.g., checksum), and
`the per-packet procedures take time B per packet (e.g. address decoding and multiplexing).
`Per-group-of-packets operations include for example transmission of acknowledgments, whose
`frequency is implementation-dependent, and their timing will be aggregated and included in
`parameter B.
`
`Ex.1015.004
`
`DELL
`
`

`

`228
`
`Part Five Protocols
`
`The throughput bound imposed by protocol processing alone, >-max in octets per second,
`is then given by the equation:
`
`Amax(x) = A.x + B.fx/Ml
`
`X
`
`(3.1.1)
`
`where x is the size of the user message in octets and M is the maximum PDU size. In bulk
`data transfer, as x becomes large,
`
`lim
`1
`X--+ 00 Amax(x) =A+ B/M
`
`(3.1.2)
`
`The protocol processing load on an end system is typically shared between the host and the
`network adaptor. As the raw data bit rate supported by optical networks approaches the main
`memory bandwidth of the end system, the cost of moving data and of per-octet processing
`limits the effective throughput presented to the application process, especially for bulk data
`transfer. The data portion of a PDU may be physically moved for the following reasons:
`
`Copying between the adaptor buffer and the host system memory;
`Crossing protection domains (address spaces) -
`e.g. at the user/kernel boundary. This
`problem becomes more pronounced in microkernels which treats a protocol task as a
`server process outside the kernel domain [II];
`Per-octet processing like presentation conversion and checksum routines.
`
`Hardware implementation is particularly efficient for per-octet operations.
`
`3.2 Requirements of a bypass VLSI implementation
`The problems associated with separate or offboard processing, which were discussed in
`the introduction, are addressed by the VLSI design as follows:
`
`0 A clean separation of functionality requiring only a simple protocol to communicate
`between the host and adaptor is desired, and is provided by a bypass. Its particular set of
`functions are complete in themselves and have a focussed interface with the host software
`at the packet entry point. There is relatively infrequent switching between the SPS and
`the bypass stack;
`0 Reduced non-protocol-specific processing overhead. For example the processing of ac(cid:173)
`knowledgment packets is dominated by interrupt handling, typically a few hundred in(cid:173)
`structions, rather than by the protocol processing itself. Our approach removes acknowl(cid:173)
`edgment handling altogether from the host. Also, the bypass system can be extended to
`incorporate multiple-layer stacks and remove overhead that way;
`0 VLSI implementation complexity: only the data transfer functions are implemented.
`0 Tradeoff between performance, flexibility and cost. The host software processes non-data(cid:173)
`transfer packets which are typically small but require more flexible and more complex
`processing. They are also only per-packet, so they have less impact overall. They can be
`efficiently handled by the host given the projected increase in host processing speed [17].
`The operations implemented in VLSI are understood to have less need of flexibility.
`
`Ex.1015.005
`
`DELL
`
`

`

`ROPE for a multiple-layer bypass architecture
`
`229
`
`I
`I
`
`Layer
`
`Procedure
`
`Bypass Chip
`
`Host
`
`Per·Octet Per-Packet Per-Group-Of-Packets
`Aggregated to Per-Packet
`(A)
`(B)
`for bulk data transfer (8)
`
`Remarks
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`Presentation
`
`Session
`
`T111llSport
`(Class4)
`
`Encoding
`
`Encryption
`
`Compression
`
`Context Alteration
`
`Synchronization
`Management
`
`Token management
`
`t"lleckswn (Optional)
`
`Timer Management
`
`Generation of ACK
`packets (Flow Control)
`
`Resequencing
`
`Header Construction
`
`Header Decode
`
`Buffer Management
`
`AU 3layers
`
`Context Switching
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`Data Copying
`
`With multiple-layer
`bypass. data Copying
`witb..in layers is
`ebninated.
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`X
`
`Depends on
`Implementation
`
`MininUzed
`(Simple scheme)
`Moved away
`frombostOS
`
`Use of dual-
`ported memory
`andOMA ..
`
`Table I Bypassable versus Non-bypassable functions
`
`It is important to look at both
`0 Remove operations that are inefficient on the host
`the processing requirements and data movement of the Application PDUs. The critical
`resource is often the host bus/memory path, and caching is used to increase its efficiency.
`However long traverses through the data for per-octet operations like checksum and
`encoding interfere with efficient caching, so it is particularly appropriate to offload them.
`
`Table I identifies procedures which are strong candidates for implementation in the bypass
`chip, and those which are better handled by the host, during the data transfer phase. Besides
`these, the send bypass test is done on the host and the receive bypass test is done on the
`Network Interface Adaptor.
`
`4 VLSI implementation of a Reduced Operation Protocol
`Engine (ROPE) chip using VHDL
`The VHSIC Hardware Description Language (VHDL) [6, 27] was used to model and
`synthesize the chip. VHDL is an industry standard language which can be used to represent
`all levels of abstraction, from logic gates to the system level. By utilizing VHDL as a
`specification tool, it is possible to begin simulation and debugging of complex systems
`
`Ex.1015.006
`
`DELL
`
`

`

`230
`
`Part Five Protocols
`
`Host
`Processor
`
`/ I
`
`A(DI
`
`Host
`Memory
`
`Host Processor Bus
`
`I Presentationt
`
`Module
`
`DMA
`
`!Checksum~ I
`Module
`
`Internal
`Registers
`
`I
`I
`I
`I
`8-1 Protocol
`I
`Processing I
`
`Internal
`Dual
`Ported
`Memory
`
`A!D I
`I
`
`A!D/
`/
`
`Network Interface
`Adaptor (NIA)
`(e.g. FDDI or ATM)
`I
`Transmission Medium
`
`Engine
`
`Reduced Operation Protocol Engine (ROPE)
`
`Figure 2 Block Diagram of VLSI bypass system
`
`before details regarding the implementation are fully specified. It also offers the potential
`of an automatic path from the protocol specification to VLSI implementation, in which any
`modifications to the specification can be easily propagated to the gate level design.
`
`4.2 Architectural Description
`Figure 2 shows the block diagram of the system. The host processor and NIA components
`provide logical interfaces for simulation of behaviour, but they insert no timing delays. For
`modeling purposes they were described as being infinitely fast, either as a source or as a sink.
`This places the maximum stress on the ROPE chip.The architectural considerations involved
`in the chip design can be summarized as follows:
`0 Movement of data across the host bus interface are minimized by using an on-chip DMA
`for fast block data transfer to/from the host system memory.
`0 On-chip dual-ported memory is used, rather than the host memory, to avoid critical
`constraints on bus access latency and throughput.
`0 The control registers of the bypass chip are I/0 mapped to the host processor. This
`enables the host processor to configure the bypass chip directly.
`The presentation module shown in the Figure was allowed for in the data structures but
`was not fully designed.
`
`Ex.1015.007
`
`DELL
`
`

`

`ROPE for a multiple-layer bypass architecture
`
`231
`
`4-3 First Design: Design Steps
`Figure 3 shows the steps followed in this study. There were three stages, a behavioural
`model, a structural or RTL model, and a gate level design. These gave us two kinds of
`feasibility check, that the logic we specified will execute the protocol within the environment
`we envisage, and that the design is technically feasible, for instance in a reasonable chip area.
`A VHDL behavioral model for the system was initially written and tested. It includes
`the bypass chip, the host processor with a simplified bus/memory architecture, and a vestigial
`high-speed network interface adapter which acts simply as an infinite source/sink for data
`packets. The first design implemented the Basic Combined Subset (BCS) of the session layer
`and the Transport protocol class 2 (TP2) protocols. Only the logic for the data transfer phase
`is included in this model, as it is during this phase that the bypass chip is active. A host
`processor submodel generates a simple test sequence that initiates bypass connections and
`a series of bypassable data packets. Non-bypassable packets are also inserted to simulate
`for example the arrival of a connection release PDU. The test sequence allowed us to verify
`the following:
`
`Window flow control logic.
`Generation of the header field including the sequence number.
`Generation of acknowledgment packets on receive.
`Synchronization between the SPS and the bypass stack.
`
`Next, the behavioural model of the bypass chip was manually converted to a structural
`(RTL) model for synthesis, leaving the other components as behavioral constructs. A
`behavioral description has no implied architecture in its representation, while an RTL level
`description has a definite architecture and clocking scheme, and characterizes the system
`in terms of registers, switches (multiplexors), and operations. An initial assignment of
`operations into clock cycles is also made. These descriptions, like behavioral descriptions,
`are technology-independent (silicon library independent). In a VHDL RTL level description,
`not all of the functionality of VHDL can be used because some of the language features
`do not map into the RTL model. For example a WAIT FOR statement has no meaning
`in hardware, but is useful in behavioral simulation. Features that can be easily mapped to
`hardware include IF THEN ELSE statements and signal assignments. The same test sequence
`was again generated by the host processor, and its results were compared with the original
`model to ensure behavioral consistency.
`The structural model was then passed through the SYNOPSYS synthesis tool [31] for
`gate level generation with the 0.8 wn BiCMOS macro library from Texas Instruments [32].
`The timing information generated by this process was back-annotated to the structural model
`to obtain the simulation throughput results presented in section 5. This was sufficient for
`our present purposes, to estimate the space complexity and timing of the chip, and the final
`step of generating a chip layout for fabrication and fault analysis was not performed. As
`the model was too large for the SYNOPSYS package we were using, it was divided into 3
`submodels. The dual-ported SRAM was not synthesized as its gate counts and performance
`characteristics can be easily extracted from data books.
`The second design, with additional functionality, is described in the next section.
`
`Ex.1015.008
`
`DELL
`
`

`

`232
`
`Part Five Protocols
`
`Behavioral Model
`
`Behavioral
`1------+1 Model
`Simulations
`
`(Manual Transformation)
`
`(Timings back-annotated to
`obtain throughput results)
`
`Figure 3 Design jfow diagram
`
`4.4 Behavioural description
`The sequence of operation in the bypass system is summarized as follows:
`
`1) Four high level procedures are made available to the host processor to control the
`bypass chip, namely: BYPASS_START, BYPASS_DMA, BYPASS_SYNC and BY(cid:173)
`PASS_RESTART. On receiving the first bypassable PDU, the BYPASS_START procedure
`is called. This procedure sets up a bypassable connection by sending information like its
`initial window flow control parameters and the DST_REF field to the bypass chip. These
`are stored in a process control block for the particular connection in fixed on-chip mem(cid:173)
`ory locations and are also accessible by the host (1/0 mapped). The maximum number
`of connections allowed for simultaneous bypassing will be equal to the number of these
`control blocks allocated in the bypass chip (5 in this study- See figure 4).
`2) For subsequent bypassable packets, the host processor initiates the BYPASS_DMA
`procedure which checks for free buffer space in the bypass chip and programs the DMA
`by sending the starting address pointer where the PDU is located, and its total length.
`The destination address is supplied by the bypass chip. Arbitration for the host processor
`bus between the host and DMA is provided by the DMAreq and DMAack lines. DMA
`transfers the PDU into the internal dual-ported SRAM (Static RAM). Buffers are pre(cid:173)
`allocated in fixed sizes and are accessed by a simple round robin scheme using a set of
`buffer pointers.
`3) The protocol engine polls the status field of a packet to check if there are any data to be
`processed. If the status is FILLED, protocol processing of the bypass stack can proceed.
`The dual-ported structure allows protocol processing to proceed concurrently with any
`DMA transfer to/from the host processor. A precomputed header template is used to
`construct the header field very quickly.
`
`Ex.1015.009
`
`DELL
`
`

`

`ROPE for a multiple-layer bypass architecture
`
`233
`
`4) Processed in-sequence packets within the transmit window are passed to the network
`interface adapter, which in this study acted as an instantaneous sink.
`5) Whenever the host processor encounters a switch in the processing path, i.e. from the
`bypass stack to the SPS, it will issue a BYPASS_SYNC procedure. This procedure will
`flush the bypass chip of any "in-transit PDU" for that particular connection and return any
`updated information, for example window control parameters, from the bypass chip to the
`host in order to maintain state consistency between the two paths. This may occur when
`it receives, for example, a connection release primitive or a session control primitive of
`the session layer (see section 6) during the data transfer phase.
`6) Whenever the host processor wishes to re-enter the bypass path after a switch in the
`processing path, the host will issue a BYPASS_RESTART procedure which will pass
`only those data that were updated in the standard protocol stack, like window flow control
`parameters, back to the bypass chip. Parameters like the DST-REF field which is not
`changed for the duration of the connection need not be updated.
`
`4.5 Second Design, including major procedures for Transport Class 4 (Implemented)
`This section describes extensions to the first design, which only supports Session BCS
`and TP2 functionality, to include some common TP4 functionality. Procedures for checksum,
`retransmission on timeout and resequencing were implemented. Extensions to the Session
`layer functionality and procedures for presentation layer conversion were not implemented,
`but are also discussed in section 6.
`
`4.5.1 OSI Checksum The transport protocol class 4 checksum algorithm [12] was imple(cid:173)
`mented. It is often difficult to perform it on the fly at the sender end as the two-byte checksum
`field is placed in the variable part of the header. However, with the bypass system, the header
`structure and the position of the checksum field are known in advance. Also, calculation of
`the checksum can now be simplified further by precomputing the partial checksum of the
`header fields.
`
`4.5.2 Timers
`During the data transfer phase TP4 uses a Retransmission timer (Tl), a Window timer
`(W) and an Inactivity timer (I). Only the retransmission timer with one interval per connection
`and the window timer were implemented here.
`Timer management in software is an expensive process due to software interrupt handling
`overhead and update processing of the timer queue [34], but can be easily handled by VLSI
`implementation. On-chip timers are very efficient and can be executed concurrently with
`other protocol processes. The only overhead is in starting and stopping the timers. Once it
`is started, a timer is an autonomous process until an interrupt signal is activated. A separate
`area of on-chip memory is reserved to store the state information of the timer list.
`
`4.5.3 Retransmission and Resequencing
`At the receiver end, out-of-sequence PDUs outside the flow-control window will be
`discarded. Otherwise, a PDU is buffered for resequencing. Duplicate TPDUs can be detected
`
`Ex.1015.010
`
`DELL
`
`

`

`234
`
`Part Five Protocols
`
`-
`
`3 2B i t s -
`
`- 3 2B i t s -
`
`DMA Start Address
`DMALength
`Bypass chip FULL
`Host Tag
`
`Sequence Nwnber
`Lower Window
`Upper Window
`DST-REF field
`Class
`Format Type
`Options
`Presentation Context_ID
`
`Block Address -
`
`Control
`Block I
`
`Control
`BlockN
`
`Sequence Nwnber
`Lower Window
`Upper Window
`DST-REF field
`Class
`Format Type
`Options
`Presentation Context_ID
`
`STATUS
`
`Block Address Pointer
`Reserved
`
`Protocol Header
`
`Header. Pointer
`
`Data Pointer
`
`Reserved
`Buffer
`Space for Data
`
`STATUS
`
`Block Address
`Reserved
`
`Protocol Header
`
`Reserved
`Buffer
`Space for Data
`
`Header Pointer
`
`Pointer -Data
`
`Host Tag: This tag is set on receipt of a host command, e.g BYPASS_START, BYPASS_DMA,
`BYPASS_SYNC or BYPASS_RESTART.
`
`STATUS : Indicates the status of the buffer, e.g. EMPTY, FILLING, FILLED or CLOSED.
`
`Figure 4 Organization of internal bypass chip memory
`easily because the sequence number matches that of a previously received TPDU. At the
`sender end, if timer T1 expires, the transport entity can retransmit either the first TPDU,
`In this design the Go-back-N
`or all TPDUs (Go-back-N) waiting for acknowledgment.
`retransmission strategy was used. For a large window, the on-chip buffer may not be sufficient
`to hold the unacknowledged data packets for retransmission or to buffer data packets for
`resequencing, and slower external memory would be needed.
`
`Ex.1015.011
`
`DELL
`
`

`

`ROPE for a multiple-layer bypass architecture
`
`235
`
`Combinational
`Area (Equivalent
`NAND2 gates)
`
`Non
`Combinational
`Area (Equivalent
`NAND2 gates)
`
`Total Area
`(Equivalent
`NAND2 gates)
`
`Throughput
`performance
`with 1 Kbyte
`packet length
`(Mbps)
`
`6318
`
`N!A
`
`2929
`
`N!A
`
`9247
`
`2,362.8
`
`Approximately
`41,984
`
`N!A
`
`8334
`
`4227
`
`12561
`
`313.3
`
`Procedures
`
`Session (BCS)!
`Transport
`Class 2
`
`Dual Ported
`SRAM (4 Kbyte)
`
`Session (BCS)I
`Transport
`Class 4 with the
`addition of the
`Checksum
`Procedure with
`timer circuitries
`
`Table 2 Throughput PerformiJnce and gate count of bypass VLSI chip
`
`5 Results
`The timing information obtained from the netlist was back-annotated to the structural
`model to obtain throughput performance results. The operating parameters and assumptions
`made in this study are:
`0 A chip clock rate of 66 MHz.
`1 Kbyte data packets
`0
`0 Window size of 64. An acknowledgment packet is sent for every 20 packets received.
`0 The host bus/memory subsystem and network interface adapter were assumed to be infinite
`sinks/source of data packets.
`0 One thousand packets were processed for each iteration.
`4 Kbyte of internal dual ported SRAM.
`0
`Table 2 shows the throughput performance of the ROPE chip. The throughput value
`includes the time taken to move the data packet out from the internal memory of the bypass
`chip to the network interface adapter, but not the data copy operation from the host memory.
`Hence the throughput result includes just one copy operation of the data packet. The total gate
`count for the bypass chip with Session (BCS), TP2 and 4 Kbyte internal dual ported SRAM
`is 51,231 equivalent NAND2 gates. With the additional TP4 procedures like checksum and
`timer circuitries, the total gate count increased to 54,545 gates. Texas Instruments offers a
`
`Ex.1015.012
`
`DELL
`
`

`

`236
`
`Part Five Protocols
`
`gate array package with 112,000 usable gates (TGB 1150). This leaves enough chip area for
`extra on-chip memory or hardwired presentation procedures. With no per-octet operations, i.e.
`just protocol processing of packet headers (TP2), the bypass stack could effectively achieve a
`throughput of 2.362 Gbps. This is consistent with results presented by Clark [5] which claims
`that TCP processing can achieve up to 800 Mbps (without checksum) on current RISC-based
`processors. With OSI checksum, the throughput performance drops to 313.3 Mbps. This
`is still more than an order of magnitude higher than an equivalent 19.2 Mbps optimized
`hand assembled software implementation of the OSI checksum algorithm on the VAX 8800,
`reported by Sklower [29].
`
`6 Considerations for the Session and Presentation
`layers (not implemented)
`The next steps would logically be to enhance the Session [18, 19] processing to the BSS
`(Basic Synchronized Subset) or BAS (Basic Activity Subset) level, or to add Presentation
`processing
`In the BSS, synchronization points occur but the session services do not actually save
`session SDUs and do not themselves perform the recovery operations. These activities are the
`responsibility of the application layer or the application program. The session layer merely
`decrements the serial number back to the synchronization point and the user must apply it
`to determine where to begin recovery procedures. Hence these activities are best handled on
`the host processor and are not very suitable for bypassing. The benefit they would confer (if
`bypassed) would be to reduce the frequency of switching paths.
`Presentation processing can definitely be bypassed. During the data transfer phase, it
`consists only of the Presentation data encoding/decoding functions. Substantial performance
`gains could result if the presentation conversions are simple and are used consistently, although
`the inflexibility of a hardware version is an evident weakness. One possible application of
`ROPE with hardwired presentation conversion is in video servers with the proposed encoding
`standards such as MPEG [13].
`
`7 Summary
`It can be concluded from this study that it is feasible to implement the bypass stack (at
`least for the transport and session layers) in VLSI and that the performance would be at
`least an order of magnitude higher than software protocol processing. The bypass system
`offloads the critical protocol functions and the associated non-protocol-specific functions onto
`a "Reduced Operation Protocol Engine" (ROPE). The gate count for the bypass chip can
`easily fit into a commercially available gate array Integrated Circuit. Per-octet operations
`are particularly efficient when performed on the chip. The host processor is relieved of a
`significant proportion of protocol processing and can concentrate on the application processing.
`The speed of communication processing in the host system can now match the transmission
`bandwidth of high-speed netw

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket