throbber
' ~ -,.·-.,
`
`":~~igcomm 2003
`
`. '<1':t}~:··
`
`Proceedings of the
`
`ACM SIGCOMM 2003
`,.
`Workshops
`
`MoMeTools
`RIPQoS
`NICELI
`FDNA
`
`001
`
`

`
`-----;--
`/
`I
`
`The Association for Computing Machinery
`1515 Broadway
`New York, New York 10036
`
`Copyright© 2003 by the Association for Computing Machinery, Inc. (ACM). Permission to make digital
`or hard copies of portions ofthis work for personal or classroom use is granted without fee provided that
`copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
`the full citation on the first page. Copyright for components of this work owned by others than ACM must
`be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to
`redistribute to lists, requires prior specific permission and/or a fee. Request permission to republish from:
`Publications Dept., ACM, Inc. Fax+ 1 (212) 869-0481 or <permissions@acm.org>.
`
`For other copying of articles that carry a code at the bottom of the first or last page, copying is permitted
`provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center,
`222 Rosewood Drive, Danvers, MA 01923.
`
`Notice to Past Authors of ACM-Published Articles
`
`ACM intends to create a complete electronic archive of all articles and/or other material previously
`published by ACM. If you have written a work that has been previously published by ACM in any journal
`or conference proceedings prior to 1978, or any SIG Newsletter at any time, and you do NOT want this
`work to appear in the ACM Digital Library, please inform permissions@acm.org, stating the title of the
`work, the author(s), and where and when published.
`
`ISBN: 1-58113-748-6
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`PO Box 11405
`New York, NY 10286-1405
`
`Phone: 1-800-342-6626
`(US and Canada)
`+ 1-212-626-0500
`(all other countries) .
`Fax: +1-212-944-1318
`E-mail: acmhelp@acm.org
`
`ACM Order Number 534032
`Printed in the USA
`
`ii
`
`002
`
`

`
`ACM SIGCOMM 2003 Workshops
`
`MoMeTools
`
`pages
`
`1-104
`
`RIPQoS
`NICELI
`FDNA
`
`pages 1 05-158
`pages 159-242
`pages 243-339
`
`003
`
`

`
`Proceedings of
`ACM SIGCOMM
`
`Workshop on Network-I/O
`Convergence: Experience,
`Lessons, Implications
`(NICELI)
`
`Proceedings of the ACM SIGCOMM 2003 Workshops
`
`159
`
`August 2003
`
`004
`
`

`
`NICELI 2003 Workshop Organizers
`Program Co-chairs
`Allyn Romanow, Cisco Systems
`Jeff Mogul, HP Labs
`
`Program Committee
`Stephen Bailey, Sandburst Corporation
`Jeff Chase, Duke University
`Patrick Crowley, University of Washington
`Uri Elzur, Broadcom Corp.
`Dirk Grunwald, University of Colorado, Boulder
`Liviu Iftode, University of Maryland
`Kostas Magoutis, Harvard University
`Vijay S. Pai, Rice University
`Prasenjit Sarkar, IBM Research
`Peter Steenkiste, Carnegie Mellon University
`
`SIGCOMM Liaisons
`Craig Partridge, BBN Technologies
`Chris Edmondson-Yurkanan, University of Texas at Austin
`
`i .
`
`1:
`
`Proceedings of the ACM SIGCOMM 2003 Workshops
`
`160
`
`August 2003
`
`005
`
`

`
`Program
`
`Invited Talk: David R. Cheriton, Stanford University. Network-I/O
`Convergence in 'Too Fast" Networks: Threats and Countermeasures.
`
`Session 1: Promises and Reality (Session chair: Prasenjit Sarkar, IBM Research)
`
`Server I/0 Networks Past, Present, and Future. Renato Recio (IBM)
`
`On the Elusive Benefits of Protocol Offload. Piyush Shivam, Jeff
`Chase (Duke University)
`
`Performance Measurements of a User-Space DAFS Server with a Database
`Workload. Samuel A. Fineberg, Don Wilson (HP NonStop Labs)
`
`Invited Talk: Wu-chun Feng, Los Alamos National Laboratory and the
`Ohio State University. Bridging the Disconnect Between the Network and
`Large-Scale Scientific Applications.
`
`Session 2: Storage Protocol Designs (Session chair: Jim Pinkerton, Microsoft
`Corporation)
`
`NFS over RDMA. Brent Callaghan, Theresa Lingutla-Raj, Alex
`Chiu, Peter Staubach, Orner Asad (Sun Microsystems, Inc.)
`
`A Study of iSCSI Extensions for RDMA (iSER). Mallikarjun
`Chadalapaka (HPJ,· Uri Elzur (Broadcom), Michael Ko
`(IBM Almaden Research Center), Hemal Shah (Intel),
`Patricia Thaler (Agilent Technologies)
`
`Session 3: Novel Approaches (Session chair: Jeff Chase, Duke University)
`
`High-Speed I/0: The Operating System As A Signalling Mechanism.
`Matthew Burnside, Angelos Keromytis (Columbia University)
`
`Engineering a User-Level TCP for the CLAN Network. Kieran
`Mans ley (University of Cambridge)
`
`A Case for Virtual Channel Processors. Derek McAuley,
`Rolf Neugebauer (Intel Research Cambridge)
`
`163
`
`179
`
`185
`
`196
`
`209
`
`220
`
`228
`
`237
`
`Proceedings of the ACM SIGCOMM 2003 Workshops
`
`161
`
`August 2003
`
`006
`
`

`
`i
`
`~ l
`J
`I' I
`J
`
`'i
`
`i
`
`, I
`
`Letter from the NICELI 2003 co-chairs
`
`Welcome to the Workshop on Network-I/O Convergence: Experience, Lessons, Implications (NICELI) at
`SIGCOMM 2003.
`This year, for the first time, SIGCOMM has enlarged its scope to include several workshops on hot topics.
`Our goals in creating the NICELI workshop were to raise research community awareness of current industry
`work on high-speed 1/0; to tease out research topics from the current work; and especially to foresee future
`issues and developments that are likely with the widespread availability of high-speed NICs. NICELI is an
`opportunity to consider the implications of convergence between networking and 110 technologies, and to
`report what we have already learned about it.
`The development of technology to address high-speed networking and 1/0 has been an interesting interplay
`between research, industry and science. The problem first arose in the context of scientific computing, which
`has long needed high-speed 1/0. The operating systems community has made progress in developing various
`approaches over the last 10 years. Recently, vendors have pushed hard to commercialize and commoditize
`high-speed 110 capabilities, especially those based on RDMA over standard networks. With the imminence
`of IETF standardization and inexpensive hardware, now is a good time to consider what we have learned,
`what still needs to be done, and where this new approach might lead us.
`
`We received twelve papers, and accepted eight. The quality of the submissions was remarkably high, which
`allowed us to make our decisions mostly based on which papers best fit the topic of the workshop, rather
`than based on overall technical quality. We thank all of the authors for their hard work. We used a blind(cid:173)
`review process for "technical paper" submissions (author names were unknown to the reviewers), but not
`for "position papers." Every paper received at least four reviews; most received more. Program Committee
`members, including the co-chairs, were not allowed to review papers submitted from their own institutions,
`nor to participate in the discussion of such papers.
`
`We are fortunate in having invited talks from two well-known experts. The first talk is by David Cheriton
`of Stanford University, who is also the SIGCOMM award recipient this year. David has had a rich history
`of contribution in networking and distributed systems, and initiated original efforts to standardize RDMA
`in the IETF. The second talk is by Wu-chun Feng of Los Alamos National Laboratory and The Ohio State
`University, on his extensive work with networking and 1/0 for scientific applications.
`
`We thank the Program Committee for their good-natured hard work and responsiveness; the workshop could
`not have taken shape without their efforts. A number of SIGCOMM volunteers contributed extensively to
`NICELI. We particularly wish to thank Chris Edmondson-Yurkanan and Craig Partridge, who put much effort
`into organizing details for the workshops.
`
`We are grateful to HP Labs for helping to sponsor the student travel grants for NICELI and SIGCOMM 2003.
`
`Allyn·Romanow
`Jeff Mogul
`
`Proceedings of the ACM SIGCOMM 2003 Workshops
`
`162
`
`August 2003
`
`007
`
`

`
`Engineering a User-Level TCP for the CLAN Network
`
`Position paper
`
`Kieran Mansley
`Laboratory for Communication Engineering
`University of Cambridge
`Cambridge, England
`kjm25@cam.ac.uk
`
`ABSTRACT
`As networks and I/0 systems converge and the bandwidth of
`networks increases, conventional approaches to networking are
`struggling to deliver the performance and flexibility required.
`CLAN (Collapsed LAN) is a high performance user-level net(cid:173)
`work targeted at the server room. It supports RDMA and pro(cid:173)
`grammed I/0 (PIO). We have implemented a set ofiP based pro(cid:173)
`tocols at user level, and shown how true zero copy transmis(cid:173)
`sion (without modifying the sockets API) and reception can be
`achieved.
`In this paper we discuss the problems associated with placing
`protocol stacks at user level and the architectural decisions re(cid:173)
`quired to obtain high performance. We also introduce our work
`using the network gateway which connects CLAN to the Internet
`to assist a server cluster in protocol processing.
`
`1.
`INTRODUCTION
`The line speed of local area networks has increased by orders of
`magnitude in recent years. As it reaches a gigabit per second, the
`network itself is often no longer the bottleneck in transferring data
`from one host to another. Instead, the overhead of moving the
`data between the application and the network [25] and performing
`protocol processing [23] has become critical.
`The overhead of traditional networking is due to a number of
`factors [11] including copying data between buffers, demultiplex(cid:173)
`ing, interrupts, system calls, and inefficient protocols. These use
`up CPU cycles that could be doing useful work for applications.
`Networks are starting to be used in a number of unconventional
`ways, and the roles of networks and storage are converging. For
`example, iSCSI [22] is an emerging standard aimed at Storage
`It allows SCSI commands to be issued over
`Area Networks.
`TCP/IP to remote devices. This presents problems for the con(cid:173)
`ventional structure of operating systems. Each request for data
`by the application must go through two stacks (filesystem and
`network) in the kernel and there is a dependence between them.
`
`Permission to make digital or hard copies of all or part of this work for
`personal or class~oom use is granted without fee provided that copies are
`not made or d1stnbuted for profit or commercial advantage and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`repubhsh, to post on servers or to redistribute to lists, requires prior specific
`permission and/or a fee.
`ACM SIGCOMM 2003 Workshops, August 25-29, 2003, Karlsruhe, Ger(cid:173)
`many.
`Copyright 2003 ACM 1-58113-748-6/03/0008 ... $5.00.
`
`This leads to a more complex implementation and reduced perfor(cid:173)
`mance. To perform an iSCSI operation requires many more CPU
`operations than an equivalent SCSI one. To address this problem
`there have been attempts to perform TCP/IP on a processor on the
`Network Interface Card (NIC), and have it present a SCSI inter(cid:173)
`face to the·operating system. This dramatically increases the cost
`of the network hardware. McAuley and Neugebauer [24] suggest
`using virtual machines as an alternative to increasing the number
`of processors.
`The recent IETF draft proposals for Remote Direct Data Place(cid:173)
`ment (RDDP) and Remote Direct Memory Access (RDMA) [5]
`describe another unconventional way of using networks. This
`style of transfer is becoming increasingly important in high per(cid:173)
`formance networking, and has again motivated the move toward
`placing processors on NICs. Co-processors (while attractive for
`specialised operations such as graphics) are being increasingly
`deployed in I/0 systems. Although this removes load from the
`main CPU it may not be beneficial in the long term as co-processors
`continue to lag behind the speed of CPUs. Co-processors also add
`latency to the data path in the NIC.
`User-level networking has the potential to address many of
`these problems. In this style of networking, the application is
`able to communicate directly with the NIC, bypassing the oper(cid:173)
`ating system for the majority of operations. Similarly the NIC is
`able to deliver received data directly into the application's buffers.
`We have developed a user-level network which provides an API
`with similar semantics to the above IETF draft standards. We
`have also used this network to implement a suite of protocols at
`user level.
`The IP suite of protocols (including TCP, UDP, ICMP) are used
`for many Internet communications, and it is important that they
`perform well. Despite a significant amount of research into effi(cid:173)
`cient implementations of user-level protocol stacks [21, 31] there
`are still many areas that have not been fully resolved.
`In this paper we describe how we have addressed the issue of
`providing efficient protocol implementations for use in innovative
`networks. In particular we have created a high performance suite
`of IP based protocols for the CLAN network. Our work is fo(cid:173)
`cused on TCP!IP in a server cluster network and bridging this to
`the Internet Section 2 gives an overview of the hardware and key
`software abstractions that the CLAN network provides. In Sec(cid:173)
`tion 3 we describe the architecture and structure of our user level
`protocol stack. Section 4 describes our approach to user level re(cid:173)
`ception, while Section 5 deals with user level transmission and
`introduces how we have developed a system to use the gateway
`to assist the cluster nodes in protocol processing. Finally we out-
`
`Proceedings of the ACM SIGCOMM 2003 Workshops
`
`228
`
`August 2003
`
`008
`
`

`
`line how we plan to measure the performance of this setup and
`highlight future work.
`
`2. THE CLAN NETWORK
`CLAN is a low latency, high performance, user-level network.
`It has a raw bandwidth of 1 Gbps. Its primary targets are the server
`room and cluster computing.
`In the rest of this section we provide a brief introduction to
`CLAN. In previous work Riddoch et al have published detailed
`descriptions of the hardware and software support, as well as
`Tripwire [30] (the synchronisation mechanism), and its use to
`support a variety of protocols and applications [28] including
`VI [29].
`2.1 Low Level Data Transfer
`At the lowest level CLAN is a Distributed Shared Memory
`(DSM) interface. Regions of memory can be mapped from one
`host across the network to another allowing very low latency
`transfers using standard processor write instructions. In this way
`it is similar to the SHRIMP [6] network which uses reflective
`memory. In addition to Programmed IO (PIO) the CLAN NIC
`also provides a DMA engine to allow longer transfers to be of(cid:173)
`floaded from the CPU.
`The format of data packets on the wire is similar to write bursts
`on a memory bus. The packet consists of a start address in mem(cid:173)
`ory, followed by the data to be written to that address. There is
`no length field. This is an interesting property, particular when it
`comes to switching; packets can be arbitrarily split or merged part
`way through transmission (as you do not need to alter the header).
`The switch is therefore able to prevent long packets hogging con(cid:173)
`tended ports. Small packets can be combined in the network to
`reduce overheads (this is particularly likely to happen when con(cid:173)
`gestion occurs, so increasing efficiency and relieving the conges(cid:173)
`tion). It also enables NICs to start transmitting as soon as data is
`available, without waiting for an entire packet.
`The prototype network does not provide any security against
`malicious code writing to apertures that belong to other endpoints.
`This would be essential in a commercial implementation, and a
`solution similar to that developed for Hamlyn [37] could be used.
`The API for CLAN takes a different approach to other user(cid:173)
`level networks [7, 36, 32]. It presents a single, low-level network
`interface that supports communication with low overhead and la(cid:173)
`tency, high bandwidth, and efficient and flexible synchronisation.
`More complex interfaces can be built on top of this without con(cid:173)
`siderable additional overhead. It is implemented using simple
`hardware without on board processors.
`Although at the lowest level CLAN is a DSM network, it is not
`intended that the normal DSM style of communication is used by
`applications. Instead, the DSM support is used as the base for
`building higher level communication abstractions.
`2.2 Distributed Message Queue
`One of these abstractions is the Distributed Message Queue
`(DMQ) as shown in Figure 1. It is essentially a flow controlled
`messaging abstraction, and is described here in its simplest form.
`A DMQ is similar to a circular message queue with two point(cid:173)
`ers, one to indicate the current read position (read_i), the other
`to indicate the current write position (wri te_i). Both the sender
`and receiver keep a "lazy" copy (in the shared address space) of
`the pointer they are not responsible for. The buffer for the circu(cid:173)
`lar queue physically resides in the memory of the receiver and the
`
`229
`
`•
`
`~
`
`I
`
`' ' __ ___ ___ _ ! .
`
`····» ,___ __ _)
`.... ~---------
`'
`'
`
`• _________ 1
`
`DHost
`memory
`
`I - - -
`:Remote
`:
`- - - ' aperture
`
`D Tripwire
`
`Figure 1: A Distributed Message Queue
`
`sender has a mapping of it in its own address space. By writing
`packets to these mappings and updating the queue pointers the
`two nodes can communicate.
`To perform transfers in this way requires only a few processor
`write instructions. As a result it represents very low overhead.
`The amount of physical memory required for the buffer is also
`small (around lOKB for full Gbps throughput) due to the low
`latency.
`Synchronisation is performed using Tripwires [30] which pro(cid:173)
`vide a low-overhead mechanism for notifying the application of
`changes to the queue.
`This user-level API can also be applied to other server room in(cid:173)
`terconnects. In particular we are working with a Gigabit Ethernet
`based system called EtherFabric from Level 5 Networks Ltd. [1]
`This new hardware should allow easier implementation and has
`the potential for further interesting experiments with the technol(cid:173)
`ogy described in the rest of this paper.
`
`2.3 CLAN Hardware
`We have a prototype hardware implementation consisting of a
`number of Network Interface Cards (NICs), two S-port switches,
`and a bridge (between CLAN and Gigabit Ethernet) in develop(cid:173)
`ment. The current hardware has some weaknesses (for example
`the DMA engine only allows a single request at once, and gener(cid:173)
`ates an interrupt after each transfer) but represents a viable plat(cid:173)
`form for research into software support for user-level network(cid:173)
`ing. These weaknesses will hopefully be solved by using the new
`hardware available from Level 5 Networks.
`While the hardware used is proprietary, it is all fabricated us(cid:173)
`ing cheap off the shelf components, and as a result would com(cid:173)
`pare favourably to existing Gigabit Ethernet NICs in terms of cost
`when produced in volume.
`The bridge was still under development when AT&T Labora(cid:173)
`tories Cambridge Ltd closed in April 2002. As a result, it is cur(cid:173)
`rently unfinished. To allow bridging experiments to continue we
`are using an Intel STL2 dual processor server PC equipped with
`CLAN and Gigabit Ethernet NICs to perform this role. A new
`version of the NICs designed to run at 3Gbps was also in the
`pipeline when the laboratory closed.
`
`009
`
`

`
`Key:
`
`Thread
`
`Thread
`boundary
`
`Data path
`
`L \
`
`II"··~) I
`
`Application
`
`TCP/IP Stack
`
`Network lnterfa e
`
`0 ·\ ()
`---------- r-------
`c/J
`-------- ---------r
`~
`
`All the hardware used by CLAN is simple and lightweight
`compared to other similarly performing networks. This results in
`a more scalable implementation. Because there are few on board
`resources used by each endpoint (Tripwires being the only one)
`and no on board processor, the hardware itself does not impose
`as many limits on the number of concurrent connections as other
`technologies. Co-processors on NICs results in a more complex
`data path, and as network speeds are currently increasing by o.r(cid:173)
`ders of magnitude every few years (outstripping the increase m
`speed of specialized processors) this is likely to become more
`critical.
`
`3. STACK ARCHITECTURE
`Traditional kernel protocol stacks are executed in a different
`context to the application they are serving. The large overhead
`associated with context switching is one of the primary factors
`that motivated the move to user-level networking. However, ini(cid:173)
`tial attempts at developing user-level network stacks have used
`a similar architecture to their kernel ancestors [8]. To ease im(cid:173)
`plementation (many user-level stacks are direct ports of kernel
`stacks [26]) the protocol processing generally occurs in a sepa(cid:173)
`rate thread to the application. This has a number of disadvan(cid:173)
`tages. Firstly, although you have exchanged context switches for
`thread switches these are both considerably more expensive oper(cid:173)
`ations than a function call. Secondly, protocol processing is done
`at some undetermined time after an application issues a request
`to send or receive data (at the mercy of the scheduler), and this
`can lead to artificially increased latency. TCP's window size is
`sensitive to latency, so by acknowledging in a timely manner you
`will increase the window, and increase the throughput.
`Although separate threads for different tasks make dealing with
`multiple connections, timers, etc, considerably easier, it was de(cid:173)
`cided for the CLAN user-level TCP stack to attempt to do the
`majority of protocol processing in the same thread as the applica(cid:173)
`tion.1
`The CLAN TCPIIP suite is based on lwiP [14, 15], a lightweight
`implementation ofiP, TCP and UDP. It is designed for low-memory
`systems, such as embedded processors. We have heavily modi(cid:173)
`fied it to support high performance rather than its design goal of
`low memory usage. In particular the threading model has been
`changed. lwiP was chosen for its clean and simple code base
`which easily adapted to our needs. This has proved considerably
`easier than taking a higher performance stack (such as the Linux
`kernel stack) and attempting to re-architect it at user level.
`lwiP has a linear model for its threads as shown in Figure 2.
`There is a thread for the application and sockets interface, a thread
`for the TCP!IP stack, and a thread for the network interface. Data
`must pass through each of these threads when either sent or re(cid:173)
`ceived. It can function in an operating system without thread sup(cid:173)
`port, but in this case it cannot use the sockets API and it still re(cid:173)
`quires some external path of execution to call its timer and incom(cid:173)
`ing packet functions at the appropriate times. Because all TCP!IP
`processing is done by one thread, all accesses to the TCP!IP stack
`are serialised through the use of message queues, which require
`semaphores to provide coherency.
`For the CLAN TCPIIP suite this has been adapted so that all
`protocol processing and network card access is performed in the
`1This approach is becoming popular in other areas of computing
`where high performance is required. For example omniORB [33]
`avoids thread switches on the call path by performing all work in
`the calling thread.
`
`Figure 2: lwiP TCP/IP Architecture
`
`Application
`
`(/)
`
`') ~J
`Network Interface .,
`
`Key:
`
`Thread
`
`--
`I Thread
`I
`boundary
`
`'Datapath
`
`TCP/IP Stack
`"")
`
`I
`
`I""~) II I
`
`I
`
`Figure 3: CLAN TCP/IP Architecture
`
`same thread as the application (with the exception of timers, which
`have a separate thread; we are currently investigating how TCP
`timers can be more efficiently implemented). As a result, the
`data path has no thread switches. As each application thread can
`access the TCP!IP stack directly without having to go through
`a semaphore controlled message queue there are fewer locking
`overheads (although some effort had to be expended to ensure the
`TCP!IP code was thread safe). This has lead to an architecture
`as illustrated in Figure 3 where rather than the components be(cid:173)
`ing arranged in a linear fashion they are arranged with the CLAN
`network code acting as a hub. The TCPIIP stack, instead of being
`the means by which the application accesses the network (via the
`sockets API) is now a tool for the network interface to use, and
`the application accesses the CLAN network code directly (but
`still via the sockets API).
`To support this change in the way protocol processing activity
`is driven does not require any modifications to the application,
`other than to link against a different shared library. (Some mod(cid:173)
`ifications are required however to achieve the separate issue of
`zero copy reception of data as described in Section 4.2). A com(cid:173)
`mon criticism of other user level network libraries is that applica(cid:173)
`tions cannot use select () with a combination of the user-level
`socket file descriptors and traditional OS file descriptors. In our
`case this is possible due to the way the asynchronous event queues
`that select responds to are implemented (see [30, Section 3.5] for
`further details), and is invisible to the application.
`A good example of the implications of the change in threading
`
`230
`
`010
`
`

`
`I
`i
`
`to the stack is the implementation of blocking reads and writes.
`In normal circumstances these would block at the thread inter(cid:173)
`face between the application/sockets API and the TCP/IP stack.
`For example, writes would block waiting for space in the TCP
`send queue. By removing this thread boundary we are no longer
`able to block in this way. Instead, because the stack is execut(cid:173)
`ing in the same thread as the application, we use the processor
`time that would otherwise have been released (due to the applica(cid:173)
`tion blocking) to perform the protocol processing. This can con(cid:173)
`tribute to reduced latency as protocol processing occurs as soon
`as something is queued for writing rather than when the TCPIIP
`stack thread is next run. For receives, protocol processing is done
`lazily (i.e. when the application asks for it). This should result in
`improved cache performance as the data is touched by the stack
`just before the application makes use of it. The advantages of this
`technique have been demonstrated by Druschel & Banga [13].
`Implementing this change in architecture has been interesting.
`Often changing assumptions about the way things are organised is
`a good way to expose the limitations and fragility of code. How(cid:173)
`ever, the stack we have chosen has coped very well with this or(cid:173)
`deal. The most interesting problems encountered have been:
`
`Connections. Having a connection oriented network (CLAN) be(cid:173)
`neath a connectionless protocol (IP) brings us advantages
`in demultiplexing, but in tum presents its own problems.
`For incoming packets we have more knowledge than is
`usual (because we know which connection the packet ar(cid:173)
`rived on) and we need some way to propagate this knowl(cid:173)
`edge into the protocol stack so that it is not deduced again
`in the normal way. Similarly, for outgoing packets the ap(cid:173)
`plication layer knows which socket a write has occurred
`on, and propagating this knowledge to the network layer
`is helpful. We have provided hooks into the data struc(cid:173)
`tures that track each packet to allow this information to be
`passed. To complicate matters there are also numerous s?e(cid:173)
`cial cases (e.g. a reset sent by the TCP stack) for wh1ch
`there is no mapping to a socket.
`
`Understanding all of the interfaces involved. The protocols are
`generally well documented, but the sockets interface has
`evolved over many years. Its documentation (understand(cid:173)
`ably) focuses on how to use it in the simple case, rather
`than how to understand all the different ways it can be used
`in and the significance of the details. 2
`
`Optimising the common case. Making the common case (data
`reception and transmission) fast is good, but it c~n result
`in increased complexity for less common operatwns. To
`allow us to judge the overall benefit of a change we have
`developed a profiling system to graphically compare the
`time taken to do a particular operation in two (or more)
`different implementations.
`
`API to the network interface. In traditional architectures the API
`between IP and the network interface is simple (in essence,
`one function to call to transmit data, and another to call
`when data has been received). We have changed the ar(cid:173)
`chitecture to make the network interface code a hub for all
`communication with the application. As a result the co~e
`must now provide a much richer interface and make th1s
`accessible to both the application (through sockets) and the
`protocol stack.
`2Just like footnotes, "The devil is in the detail".
`
`4. USER-LEVEL DELIVERY
`One of the most important differences between traditional net(cid:173)
`working and user-level networking is the way incoming data are
`delivered to the application.
`In the kernel-based architecture incoming packets are delivered
`to a pool of packet buffers, which are then examined by the ker(cid:173)
`nel to determine the application for which they are destined, and
`queued waiting for the application to perform a read. The data
`must be copied from the kernel packet buffers to the application
`memory space.
`User level networks have taken a variety of approaches to de(cid:173)
`livery. The most difficult part is the one that was performed by
`the kernel - that of demultiplexing the incoming packets; i.e. de(cid:173)
`termining which application it should be delivered to. Some user(cid:173)
`level networks have left this functionality in the kernel [35, 16].
`Some have even chosen to leave IP in the kernel [8], but in doing
`either of these they take a large performance hit compared to a
`pure user-level network. Others have chosen to implement this
`(and possibly other) functionality in the NIC itself [26, 7], but
`this requires more complex (and therefore expensive) hardware.
`Also as the NIC must store state for each connection, the avail(cid:173)
`able hardware resources place a limit on the number of concurrent
`connections that can be supported.
`In our implementation we were keen to avoid both of these
`pitfalls, and this is simplified by the physical network. A trans(cid:173)
`fer within a CLAN network is analogous to a write burst on a
`memory bus, or an RDMA Write. Each write consists of a start
`memory address where the first word should be written, and is fol(cid:173)
`lowed by the data. This means that the network is send-directed,
`whereas the majority of others are receive-directed; i.e. it is the
`sender that determines the final location (in the receiver's mem(cid:173)
`ory) of the data, not the receiver. This makes the receiver's role
`in the demultiplex much simpler. However, in order to ensure
`the data ends up in the correct place the receiver must inform the
`sender where the data should go in advance. (This is performed
`as part of connection setup). This style of transfer has recently
`been proposed as a draft standard for Remote Direct Data Place(cid:173)
`ment (RDDP) and Remote Direct Memory Access (RDMA) by
`the IETF.
`4.1
`Implementation
`The model used to transport IP over CLAN is built around
`a structure similar to the Distributed Message Queue discussed
`in section 2.2. A circular queue is shared across the network,
`with the remote host writing data, and the local host reading data.
`There is one DMQ (or more) per socket. For TCP/IP there are es(cid:173)
`sentially two operations that need to be performed on each packet.
`
`• Firstly, it must be processed by the relevant protocol stacks.
`
`• Secondly, if it contains valid data, it must be passed to the
`application.
`
`As a result the queue in this case requires three pointers, rather
`than normal two (read and write). The write pointer is the same
`as before but we now subdivide "read" into a protocol pointer
`and a delivery pointer, which keep track of the respective tasks'
`progress.
`.
`In this way, we are able to perform both de!1very and ?rot?(cid:173)
`col processing directly on the data, in place, w1thout copymg 1t.
`It also separates the act of protocol processing from the act of
`delivery.
`
`231
`
`011
`
`

`
`We also separate the headers from the payloads and transfer
`them into two separate que

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket