The Architecture of a Gb/ s
`Multimedia Protocol Adapte r
`Erich Rutsch e
`IBM Research Division ,
`Zurich Research Laboratory
`Saumerstrasse 4, 8803 Ri schlikon, Switzerlan d
`In this paper a new multiprocessor-based communication adapter is presented. The adapter architec-
`ture supports isochronous multimedia traffic and asynchronous data traffic by handling them separate-
`ly . The adapter architecture and its components are explained and the protocol processing performance
`for TCP/IP and for ST-II is evaluated . The architecture supports the processing of ST II at the network
`speed of 622 Mb/s . The calculated performance for TCP/IP is more than 30000 segments/sec . The ar-
`chitecture can be extended to protocol processing at one Gb/s .
`Keywords : Multimedia Communication Subsystems ; Network Protocols ; Parallel Protoco l
`Processin g
`1. Introduction
`As data transmission speeds have increased dramatically in recent years, the processing of protocols ha s
`become one of the major bottlenecks in data communications . Current experimental networks provide a
`bandwidth in the Gb/s range . New multimedia applications require that networks guarantee the qualit y
`of service of bulk data streams for video or HDTV . The protocol processing bottleneck has been over -
`come by dedicated communication subsystems which off-load protocol processing from the worksta-
`tion. Many of such communication subsystems proposed in the literature are multiprocessor architec-
`tures [Braun 92, Jain 90, Steenkiste 92, Wicki 90] . In this paper we present a new multiprocesso r
`communication subsystem architecture, the Multimedia Protocol Adapter (MPA), which is based o n
`the experience with the Parallel Protocol Engine (PPE) [Kaiserswerth 92] and is designed to connect t o
`a 622 Mb/s ATM network . The MPA architecture exploits the inherent parallelism between the trans-
`mitter and receiver parts of a protocol and provides support for the handling of new multimedia proto -
`The goal of this architecture is to speed up the handling of multiple protocol stacks and of multimedi a
`protocols such as the Internet Stream Protocol (ST II) [Topolcic 90] . Multimedia traffic often require s
`isochronous transmission in contrast to conventional asynchronous traffic for file transfer or for remot e
`procedure call . To guarantee the isochronous processing of multimedia data streams, the asynchronou s
`and isochronous traffic are handled separately . A Header Parser scans incoming packets, detects th e
`header fields and extracts the header information . This information is used to separate isochronous an d
`asynchronous traffic and to split the header and the data portions of a packet . Dedicated header and data
`memories are used to store the header and data portions of a packet . The separation of receiver, trans -
`mitter and the dedicated memories decreases memory contention .
`Computer Communication Review


`In Section 2 the concepts of the MPA architecture are presented . Section 3 explains protocol processin g
`on the MPA. In Section 4 the performance of the MPA is evaluated by adapting the measurements of ou r
`TCP/IP implementation on the PPE to the MPA architecture . The last section gives the conclusions .
`2. Architectur e
`The architecture of the MPA is based on our experiments with the PPE [Kaiserswerth 92],[Rutsche 92] .
`The PPE is a four-processor system based on the transputer T425 with a network interface running a t
`120 Mb/s . On the separate transmit and receive side two processors, the host system and the networ k
`interface use a shared memory for storing and processing protocol data . Transmit and receive side are
`only connected via serial transputer links .
`2.1 Concep t
`The protocol processing requirements of multimedia protocols are very different from the requirement s
`of traditional transport protocols . Isochronous multimedia traffic may require the processing of bul k
`data streams with low delay and low jitter but may accept bit errors or packet loss . Asynchronous traffic ,
`such as file transfer or remote procedure call, requires more moderate throughput but tolerates no er-
`rors. In a file transfer between a file server and a client the throughput is limited by the I/O bus and th e
`disk speed. Errors in the data are not acceptable, whereas a bit-error in uncompressed video is not vis -
`To guarantee the requirements of multimedia connections the processing of multimedia data must b e
`separated from the processing of asynchronous data . A Header Parser detects the connection to whic h
`an incoming packet belongs . Multimedia packets are then forwarded to dedicated multimedia device s
`while other packets go through normal protocol processing .
`Protocol processing must be done in software to handle a multitude of protocols . Only functions that are
`common to all or most of the protocols are implemented in hardware . The MAC layer for ATM and the
`ATM Adaptation Layer (AAL) must be implemented in hardware or firmware to achieve the full net -
`work bandwidth of 622 Mb/s .
`Our measurements of TCP/IP on the PPE have shown that the processors were not equally loaded be -
`cause of the different processing requirements of the protocol layers and because of the very high cost s
`of the memory operations [Riitsche 92] . The loose coupling via serial links between the receive and a
`transmit part had only minor impact on the performance . An optimal speedup of 1 .7 was calculated fo r
`two processors . Therefore we chose a two-processor architecture for the MPA . One processor on the
`transmit side is connected via serial links to one processor on the receive side . The processors are sup -
`ported by an intelligent Direct Memory Access Unit and dedicated devices for header parsing an d
`checksumming . The memory of both parts is split into a header memory and a data memory to lowe r
`memory contention . The two halves of the MPA are only connected by serial message links .
`2.2 Main Building Blocks
`The MPA is split into two parts, a receiver and a transmitter, as shown in Figure 1 . The various compo -
`nents and their function are presented in the following .
`Computer Communication Review


`Media Access Control Unit (MACU )
`Header Parser (HP)
`Generator (CG)
`DMA Unit (DMAU )
`hI cp
`I Q
`I c
`Media Access Control Unit (MACU )
`t Data
`•a- Control
`Bus (BC)
`Workstation Bus
`Workstation Bus
`Figure 1 . MPA Architecture
`Media Access Control Unit (MACU) : The MPA is designed to be connected to any high-speed net -
`work . The design of the MAC is beyond the scope of this paper . [Traw 91] for example describes an
`interface to the Aurora ATM network ,
`Header Parser (HP) : The HP is similar to the ProtoParser 1 [Chin 92] . The HP detects on the fly th e
`protocol type of an incoming packet and extracts the relevant header information . This information i s
`forwarded to the DMA Unit and the Checksum Generator .
`Checksum Generator (CG) : The CG is triggered by the HP to calculate the appropriate checksum o r
`Cyclic Redundancy Check (CRC) for the packet on the fly . The algorithms are implemented in hard -
`ware and selected by decoding the HP signal . On the sender side the CG is triggered by the DMA unit .
`[Birch 92], for example, describes a programmable CRC generator which is capable of processing 80 0
`Mb/s .
`The Protocol Processor T9000 : The selection of the inmos2 T9000 [inmos 91] is based on our good
`experience with the transputer family of processors in the PPE, The most significant improvements o f
`the T9000 over the T425 for protocol processing are faster programmable link interfaces, a faste r
`memory interface, and a cache . The serial message passing link provides a transmission speed of 10 0
`ProtoParser is a trademark of Protocol Engines, Inc .
`inmos is a trademark of INMOS Limited .
`Computer Communication Review


`Mb/s plus a set of instructions to use the links for control purposes . The peek and poke instructions issue
`read and write operations in the address space of the second transputer connected to the other end of th e
`link. These commands allow distributed 'shared memory' between transputers . Two transputers ma y
`allocate a block of memory at identical physical addresses in their local memory . Whenever a value i s
`written into the local copy of the data structure, the address of the variable and its value are also sent via a
`control link to the second transputer .
`The Memories : The memory is split into dedicated parts for each flow of data through the MPA t o
`lower memory contention and to provide high bandwidth to those components that access the memor y
`most. The following memory split is used :
`Header memory : stores the protocol headers . Fast static memory operating at cache speed is used t o
`avoid wait cycles .
`Data memory : stores the data part of the packets . Inexpensive video memory (VRAM) is used . The seri -
`al port of the VRAM provides guaranteed access via the DMA Unit to the network . The parallel
`port of the VRAM is used in normal processing by the Bus Controller only . The processor ca n
`accesses the parallel port, e .g. for exception handling .
`Local memory : stores the program code of the processor and the control information of the connections .
`Multimedia FIFO : stores multimedia data and is the interface to a multimedia device . It can be con -
`trolled by the processor for synchronization with asynchronous data streams . Multiple multime-
`dia FIFOs can be arranged in parallel .
`The design does not employ physically shared memory between the transmitter and the receiver, be -
`cause the implementation costs are too high compared to a software implementation using transpute r
`Memory Access ; Processor to
`Data Memory
`Header Memory
`Local Memory
`Table 1 . Memory Access Tim e
`Memory Type
`Video RAM
`Static RAM
`Dynamic RAM
`Average Access Tim e
`60 ns
`30 ns
`60 ns
`Direct Memory Access Unit (DMAU) : The DMAU directs the in- and outgoing data streams to th e
`correct destination . The DMAU splits an incoming packet into its header and data part and moves th e
`parts to the respective memories . A pointer to the header structure is written to the receive queue . To
`send a packet the DMAU gathers the data from the data memory and the header from the heade r
`memory . For multimedia traffic the data are gathered from the multimedia FIFO . The memory buffer s
`are handled in a linked list format. The DMAU handles this linked list in hardware and thereby off -
`loads part of the memory management from the protocol processor .
`Bus Controller (BC) : The BC is a programmable busmaster DMA controller . It provides a small FIFO
`and a table for DMA requests . The FIFO contains a pointer to the linked list of source data and a connec -
`tion identifier . The BC determines the destination memory address through the connection identifier i n
`the table . The list format is the same for the BC and the DMAU . In the transmit BC the host writes to th e
`Computer Communication Revie w


`FIFO and the protocol processor to the table . In the receive BC the protocol processor writes to the FIF O
`and the host to the table .
`2.3 Packet Processin g
`Packets are processed in a hardware pipeline which runs at network speed . The pipelined packet proces -
`sing is shown in Figure 2 .
`Receive r
`The MACU receives cells from the ATM network, processes the AAL, and triggers the receive pipelin e
`to start . The receive pipeline is run by the DMAU . The HP and the CG process the data as they are co -
`pied from the MACU to the destination address in the memories or to the multimedia FIFO . The HP
`extracts the relevant header information from the packet and forwards the information to the DMA U
`and the CG . The CG uses this information to detect which checksum or CRC it must calculate . The CG
`calculates the checksum on the fly as the packet is copied by the DMAU and forwards the result to th e
`DMAU. The DMAU uses the information generated by the HP to determine the format and the connec -
`tion of the packet . For a multimedia connection the DMAU removes the header from the packet an d
`writes the data part to the Multimedia FIFO .
`Transmit Pipeline
`parse heade r
`-------------- -
`- Header
`Receive Pipeline Header ~~
`write header
`Figure 2. Pipelined Packet Processin g
`calculate , write checksum -
`For asynchronous traffic the DMAU writes a structure to the header memory which holds the header ,
`the header information extracted by the HP, the checksum calculated by the CG, and the pointer to th e
`data in data memory . The data part of the packet is written to the data memory . The DMAU writes a
`pointer to the header structure to the receive queue . The protocol processor is then responsible for pro-
`cessing of the header structure . The addresses of free buffers in header and data memory are obtaine d
`from a linked list of free buffers .
`Computer Communication Review


`If the HP does not recognize a packet header the entire packet is written to the data memory . In this case ,
`the protocol processor performs the processing of the packet header in data memory . For a new connec -
`tion the protocol processor builds up the connection and programs the HP to recognize the header .
`Sende r
`On the transmit side the protocol processor builds the layered protocol header in the header memory . It
`builds a structure which holds the pointers to the header and to the data, the length of the header an d
`data, and the connection type . This structure is written to the send queue . The DMAU runs the sen d
`pipeline . It interprets the structure and forwards the connection type to the CG . The CG calculates the
`checksum on the fly as the packet is written to the MACU memory . In the MACU the packets are store d
`to process the AAL and to segment the AAL frame into ATM cells . Once the CG has finished, it write s
`the checksum to its position in the packet frame and triggers the MACU to send the packet . Once th e
`packet is sent, the DMAU appends the buffers to the corresponding free—lists .
`3. Protocol Processin g
`3.1 Transport Protocol Stack s
`Transport protocol processing on the MPA in the example of TCP/IP is shown in Figure 3 . The socke t
`layer is split into a lower half serving TCP and an upper half which interfaces to the application . A more
`detailed description of our parallel TCP/IP implementation can be found in [Riitsche 92] .
`Sending a packet : The send data are in a buffer allocated on the host . The application creates a socke t
`and establishes a TCP/IP connection . The socket send call triggers the write process which copies th e
`data to the MPA and gives the control over the data to xtask . The xtask process is then responsibl e
`for the transmission and possible retransmissions of the data . It builds the TCP packet and forwards th e
`pointer to the packet to ip_send . Here the IP header is placed in front of the TCP segment . Then th e
`pointer to the packet is written to the send queue and the DMAU sends the packet via the MACU to th e
`Receiving a packet : Upon receipt of a packet the DMAU writes the pointer to the packet to the receiv e
`queue . ip_demux reads the receive queue, checks the header and, if no error or exception occurred ,
`forwards the packet to tcp_recv, or else to icmp_demux . The tcp_recv process analyzes th e
`TCP header and calls the appropriate handler function for a given protocol state . To send an acknowl -
`edgement or a control packet tcp_recv uses a Remote Procedure Call (RPC) to the transmit side .
`Correctly received packets are appended to the receive list . rtask forwards the received segments t o
`the application process which is blocked in the socket receive procedure . This procedure then fills th e
`user buffer with data from the receive list .
`3.2 Multimedia Protocols
`For multimedia traffic often real—time data and continuous data streams are required . ST—II is a good
`example of a protocol that supports this type of traffic . After a connection has been set up, the receptio n
`of data packets requires only the detection of the connection and the calculation of the header checksum .
`For sending, the header can be built only once in the header memory and then used for all the data pack -
`ets of the connection . These functions are done in a hardware pipeline by the HP and the CG (see Figur e
`2) . The DMAU scatters and gathers the header and the data without any interaction of the protocol pro -
`Computer Communication Review


`Applicatio n
`DMAU out
`Transmit Side
`Receive Sid e
`rVitual Shared Memory
`rciceiaced .`,
`~bI.IaCdIwIare.:in.tlle, MP.
`Figure 3. Parallel TCP/IP Implementatio n
`cessor. Therefore real-time processing of ST-II at the network speed of 622 Mb/s is possible . The inter-
`(SCMP), which
`action of the processor is only required to handle the Stream Control Message Protocol
`is responsible for creating and keeping most of the state in a ST-II protocol connection .
`4. Performance Estimatio n
`4.1 The Method
`The measurements of the TCP/IP implementation on the PPE were used and adapted to the MPA archi-
`tecture [Rutsche 92] . The execution times of program segments accessing local memory Tiocaimem and
`data memory Tdatame,n are calculated from the execution times on the PPE minus the time saved by th e
`hardware devices replacing software functions . These execution times are multiplied by a speedup fac-
`tor S, which is determined by the memory timing and the faster processor, and summed to get the execu -
`tion time TMPA on the MPA.
`TMPA = Tlocai,nem * Slocabnem + Tdatamem * Ssharedmem
`This approach is valid for protocol processing because most operations are memory operations to buil d
`a header or to compare header data with expected data in a control block . The control information i s
`built with simple arithmetical and logical operations such as add, multiply, and, or etc .
`4.2 Cost of Basic Operations
`The transputer T9000 is downwards compatible with the T425 used in the PPE . The main difference s
`are a higher link speed of 100 Mb/s, a sustained performance of more than 70 MIPS and a peak rate o f
`Computer Communication Review


`200 MIPS . A function call or process switch costs less than 1 Rs . The sustained MIPS rate improves the
`performance at least seven times for the simple protocol processing operations . The memory function s
`are determined by the memory access time shown in Table 1 . The access time to the header memor y
`decreases by a factor of 19, the access time to the data memory by a factor of 9 .5. Therefore the full
`power of the processor can be utilized, and the typical speedup factor 1/10 [inmos 91] can be assumed fo r
`Sdaramem • The speedup factor to the local memory is determined by the processor speedup, because th e
`local memory and the cache provide an optimal memory interface to the processor . We assume a conser -
`vative speedup factor of Slocatmem = 1 /7 for the local memory .
`The costs of basic operations for protocol handling are listed in Table 2 . The connection detection an d
`the calculation of a CRC or a checksum are implemented in hardware . These operations run at networ k
`speed as the data is clocked in from the MACU . The T9000 improves the implementation of the distrib -
`uted shared memory, because the peek and poke calls are already implemented in the microcode . There-
`fore the costs of the distributed 'shared memory' are only the issuing of a peek or poke instruction .
`Number of Processor Instructions Estimated Time in n s
`300 + size[word] * 46 0
`Processor Operation
`Queue read / write
`Linked List add/remove
`Distributed shared memory
`0 (implemented in hardware)
`Connection detection
`0 (implemented in hardware)
`Checksum/CRC calculation
`Table 2. Cost of Basic Operation s
`network spee d
`network speed
`4.3 TCP/IP Performance
`The performance of TCP/IP is evaluated using the measurements of our TCP/IP implementation on th e
`PPE. The TCP stack, the socket layer and a test application run on the MPA . The cost of the single pro -
`cesses of TCP/IP is calculated using (1) . Table 3 lists the execution times on the PPE and the calculate d
`execution times on the MPA . In the PPE implementation of the IP processes 60% of the accesses go t o
`the shared memory, in tcp_send 47% and in tcp_recv 10% . In the MPA architecture all of thes e
`accesses are replaced by accesses to the header memory . In the user_task and the ip_intrsv c
`most processing is replaced by the list handling in the DMAU and the BC . However the write process i s
`still needed to control the send queue . All copy operations are implemented in the BC . The ip_demux
`process is supported by the HP, which extracts the header information .
`Computer Communication Revie w


`Process (Procedure) on Receiver
`user_task (socket_recv/copy)
`Process (Procedure) on Transmitter
`Access to Shared Memory (poke call)
`Table 3 . Process Execution Time s
`31+ 0 .545µs/word
`30+ 0 .545µs/word
`17+ 0.27µs/word
`18 .6+2,4µs/word
`1 .7
`µs/Packe t
`4. 3
`9 .8
`0.3 + 0.46µs/word
`The TCP/IP process pipeline for bidirectional traffic is shown in Figure 4 . The throughput is deter -
`mined by tcp_recv and ip_demux, which add up to 26 .7 µs. The transmitter is less costly than i n
`the PPE implementation because of the faster network speed .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`ip_send driver_sen d
`1 o7
`ip_demu x
`1 .7
`writ e
`4 .3
`tcp_snd_dat a
`9 .8
`writ e
`4 .3
`Figure 4 . TCP/IP Processin g
`9 .8
`The throughput calculated for unidirectional TCP/IP traffic between two MPA systems is 35720 TC P
`segments/s . For bidirectional traffic, the throughput is 20290 segments/s, if for every eight packets a n
`acknowledgment packet is sent . The throughput numbers are independent of the packet size because al l
`data copying is done in hardware overlapped to protocol processing . However for large packets, th e
`network becomes the bottleneck (4 kByte segments would result in more than 1 Gb/s)
`. If we assume a
`segment size of 1024 bytes, the throughput is 292 Mb/s which is more than most current workstation ca n
`5. Conclusion
`The separation of isochronous and asynchronous traffic permits processing of isochronous multimedi a
`traffic at the network speed . The separate header and data memories provide optimized access to th e
`critical components . The parallelism between the transmitter and receiver is the most suitable form o f
`moderate parallelism to speed up protocol processing and to lower hardware contention .
`The hardware components such as the HP, CG and DMAU can be built to process one Gb/s . The MPA
`could then process multimedia streams at one Gb/s . A multimedia application could for example look a s
`Computer Communication Review


`follows. The multimedia interface handles 700 Mb/s in hardware . The protocol processors perfor m
`transport protocol processing at a throughput of 300 Mb/s and forward the data to the workstation . This
`split of the bandwidth would make sense, because applications which require reliable transport connec-
`tions in the Gb/s range do not seem feasible in the near future because of the I/O bottleneck of th e
`workstations . However transport protocol processing at one Gb/s is already possible with an architec-
`ture based on the MPA .
`The efficient attachment of the subsystem to the workstation is yet unsolved . To take advantage of the
`high bandwidth available on the network and on the MPA, the current workstation hardware and soft -
`ware interfaces must be changed . Designing these interfaces especially for multimedia will be one o f
`the goals of future work.
`6. Reference s
`[Birch 92]
`Birch, J ., Christensen, L . G ., Skov, M ., "A programmable 800Mbit/s CRC check / gen-
`erator unit for LANs and MANs", Computer Networks and ISDN Systems, Nr . 24,
`North—Holland 1992 .
`Chin, H. W., Edholm, Ph., Schwaderer, D . W., "Implementing PE—1000 Based Inter -
`networking Nodes, Part 2 of 3", Transfer, Volume 5, Nr 3, March/April 1992 .
`Braun, T., Zitterbart, M ., "Parallel Transport System Design", IFIP Conference o n
`High Performance Networking, Liege (Belgium), 1992 .
`"The T9000 Transputer Products Overview Manual", inmos 1991 .
`Jain, N., Schwartz, M ., Bashkow, T. R., "Transport Protocol Processing at GBP S
`Rates", Proceedings of the SIGCOMM '90 Symposium , Sept . 1990.
`[Kaiserswerth 92] Kaiserswerth, M ., 'The Parallel Protocol Engine", IBM Research Report, RZ 229 8
`(#77818), March 1992 .
`Rutsche, E ., Kaiserswerth, M ., "TCP/IP on the Parallel Protocol Engine", Proceedings ,
`IFIP Conference on High Performance Networking, Liege (Belgium), Dec . 1992.
`Topolcic, C. (Editor), "Experimental Internet Stream Protocol, Version 2 (ST—II)" ,
`RFC 1190, Oct . 1990.
`Traw . B., Smith, J., "A High—Performance Host Interface for ATM Networks", Pro -
`ceedings ACM SIGCOMM '91, Zurich, Switzerland, Sept . 1991 .
`Stennkiste, P., et al ., "A Host Interface Architecture for High—Speed Network s", Pro -
`ceedings IFIP Conference on High Performance Networking, Liege (Belgium), Dec .
`Wicki, T., "A Multiprocessor —Based Controller Architecture for High—Speed Commu-
`nication Protocol Processing", Doctoral Thesis, IBM Research Report, RZ 205 3
`(#72078), Vol 6, 1990 .
`[Chin 92]
`[Braun 92]
`[inmos 91]
`[Jain 90]
`[Rutsche 92 ]
`[Topolcic 90 ]
`[Traw 91 ]
`[Steenkiste 92 ]
`[Wicki 90]
`Computer Communication Review

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.


A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket