`
`SPECIAL ISSUE
`
`VOLUME 28 No. 1
`
`JUNE 2000
`
`Proceedings
`ACM SIGMETRICS '2000
`International Conference on Measurement
`and Modeling of Computer Systems
`
`Sponsored
`
`by ACM SIGMETRICS
`
`EVAS COLLECT\U^
`UCV.^
`
`•Mi
`
`Santa Clara, CA, USA
`
`' U
`
`BUNGIE - EXHIBIT 1040
`
`
`
`The Association for Computing Machinery
`1515 Broadway
`New York, N.Y. 10036
`
`Copyright © 2000 by Association for Computing Machinery, Inc. (ACM). Permission to make
`digital or hard copies of portions of this work for personal or classroom use is granted without fee
`provided that the copies are not made or distributed for profit or commercial advantage and that
`copies bear this notice and the full citation on the first page. Copyright for components of this
`work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy
`otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific
`permission and/or a fee. Request permission to republish from: Publications Dept. ACM, Inc.
`Fax +1 (212) 869-0481 or E-mail permissions(a),acm.org
`
`For other copying of articles that carry a code at the bottom of the first or last page, copying is
`permitted provided that the per-copy fee indicated in the code is paid through the Copyright
`Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, +1-978-750-8400, +1-978-750
`4470 (fax).
`
`Notice to Contributing Authors to SIG Newsletters:
`
`By submitting your article for distribution in this Special Interest Group publication,
`you hereby grant to ACM the following non-exclusive, perpetual, worldwide rights:
`
`-to publish in print on condition of acceptance by the editor
`-to digitize and post your article in the electronic version of this publication
`-to include the article in the ACM Digital Library
`-to allow users to copy and distribute the article for noncommercial, educational or research
`purposes
`
`However, as a contributing author, you retain copyright to your article and
`ACM will make every effort to refer requests for commercial use directly to
`you.
`
`ACM ISBN: 1-58113-194-1
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 11405
`New York, N.Y. 10286-1405
`
`Phone: 1-800-342-6626
`(U.S.A. and Canada)
`+1-212-626-0500
`(All other countries)
`Fax:+1-212-944-1318
`E-mail: acmhelp@acm.org
`
`ACM Order Number: 488000
`Printed in the U.S.A.
`
`
`
`TABLE OF CONTENTS
`
`Organizing Committee
`Referees
`Message from General Chair
`Message from Program Co-Chairs
`Message from Sigmetrics Chair ...
`
`Keynote Address: The Internet and Its Future
`Leonard Kleinrock, Chairman and Founder, Nomadic Inc., Professor, Department of Computer
`Science, UCLA
`
`Session 1: Network Architecture and Protocols
`
`A Case for End System Multicast....:
`Yang-hua Chu, Sanjay Rao, Hui Zhang, Carnegie Mellon University
`
`PLM: Fast Convergence for Cumulative Layered Multicast Transmission Schemes
`Amaud Legout, Ernst W. Biersack, Institut EURECOM
`
`On Achievable Service Differentiation with Token Bucket Marking for TCP
`Sambit Sahu, University of Massachusetts, Philippe Nain, INRIA Sophia Antipolis,
`Don Towsley, University of Massachusetts, Christophe Diot, Sprint ATL, Victor Firiou, Nortel
`Networks
`
`in
`IV
`v
`VI
`vn
`
`1
`
`13
`
`23
`
`Session 2: File and Storage Systems
`
`Feasibility of a Serverless Distributed File System Deployed on an Existing Set of
`34
`Desktop PCs
`William J. Bolosky, John R. Douceur, Microsoft Research, David Ely, University of Washington,
`Marvin Theimer, Microsoft Research
`
`44
`Comparing Random Data Allocation and Data Striping in Multimedia Servers
`Jose Renato Santos, Richard Muntz, UCLA, Berthier Ribeiro-Neto, Universidade Federal de Minas
`Gerais
`
`56
`Modeling and Performance of MEMS-Based Storage Devices
`John Linwood Griffin, Steven W. Schlosser, Gregory R. Ganger, David F.Nagle, Carnegie Mellon
`University.
`
`VIM
`
`
`
`Session 3: Web and Multimedia Servers
`
`Implications of Proxy Caching for Provisioning Networks and Servers
`Mohammad S. Raunak, Prashant Shenoy, University of Massachusetts, Pawan Goyal, Ensim
`Corporation, Krithi Ramamritham, University of Massachusetts
`
`Collaborative Web Caching Based on Proxy Affinities,
`Jiong Yang, Wei Wang, IBM, Richard Muntz, UCLA
`
`66
`
`78
`
`Cluster Reserves: A Mechanism for Resource Management in Cluster-Based Network Servers... 90
`Mohit Aron, Peter Druschel, Willy Zwaenepoel, Rice University
`
`Poster Session
`
`Analysis of the Phenomenon of Several Consecutive Slow Start Phases in TCP,
`Chadi Barakat, Eitan Altman, INR1A Sophia Antipolis
`.
`
`Providing Guaranteed Quality of Service for Interactive Visualization Applications
`Wai-Man R. Wong, Richard R. Muntz, UCLA
`
`IP Multicast Fault Recovery in PIM over OSPF
`Xin Wang, Chienming Yu, Henning Schulzrinne, Columbia University,
`Paul Stirpe, Wei Wu, Reuters
`
`102
`
`104
`
`106
`
`Cell-based Multicast Grouping in Large-Scale Virtual Environments
`Emmanuel Lety, Thierry Turletti, INRIA Sophia Antipolis, Francois Baccelli, ENRIA/ENS
`
`108
`
`Temporal Locality in Web Request Streams: Sources, Characteristics, and Caching
`Implications
`Azer Bestavros, Shudong Jin, Boston University
`
`Automated Disk Drive Characterization
`Jiri Schindler, Gregory Ganger, Carnegie Mellon University
`
`Online Superpage Promotion Revisited
`Zhen Fang, Lixin Zhang, John Carter, Sally McKee, Wilson C. Hsieh University of Utah
`
`110
`
`112
`
`114
`
`An Inherently Loss-Less and Bandwidth Efficient Periodic Broadcast Scheme for VBR Video.... 116
`loanis Nikolaidis, Fulu Li, Ailan Hu University of Alberta
`
`An Analysis of Short-Term Fairness in Wireless Media Access Protocols
`Can Emre Koksal, Hisham Kassab, Hari Balakrishnan, MIT
`
`RESCU: Dynamic Hybrid Packet Loss Recovery for Video Transmission over the Internet
`Srinath R. Joshi, Injong Rhee, North Carolina State University
`
`118
`
`120
`
`IX
`
`
`
`The Content and Access Dynamics of a Busy Web Server
`Venkata N. Padmanabhan, Microsoft Research, Lili Qiu, Cornell University
`
`Session 4: Networking: Congestion Control
`
`TCP in presence of Bursty Losses
`Eitan Altman, Konstantin Avrachenkov, Chadi Barakat, INRIA Sophia Antipolis
`
`122
`
`124
`
`The Incremental Deployability of RTT-Based Congestion Avoidance for High Speed TCP Internet
`134
`Connections
`Jim Martin, GartnerGroup, Arne Nilsson, Injong Rhee, North Carolina State University
`
`Detecting Shared Congestion of Flows Via End-to-end Measurement
`Dan Rubenstein, Jim Kurose, Don Towsley, University of Massachusetts
`Best Student Paper Award
`
`145
`
`Session 5: Network Measurement and Performance Modeling
`
`156
`Measurement and Analysis ofLDAP Performance
`Xin Wang, Henning Schulzrinne, Columbia University, Dilip Kandlur, Dinesh Verma, IBM T.J.
`Watson Research Center
`
`IP Packet Generation: Statistical Models for TCP Start Times Based on Connection-Rate
`Superposition
`William S. Cleveland, Dong Lin, Don X. Sun, Bell Laboratories Lucent Technologies
`
`On the Impact of Soft Hand-off in Cellular Systems
`Nidhi Hegde, Khosrow Sohraby, University of Missouri-Kansas City
`
`Session 6: Queueing and Performance Evaluation Techniques
`
`Delay Asymptotics for a Priority Queueing System
`Sanjay Shakkottai, R. Srikant, University of Illinois at Urbana-Champaign
`
`166
`
`178
`
`188
`
`A Fast and Accurate Iterative Solution of a Multi-Class Threshold-based Queueing System with
`196
`Hysteresis
`Leana Golubchik, University of Maryland, John C.S. Lui, The Chinese University of Hong Kong
`
`Using the Exact State Space of a Markov Model to Compute Approximate Stationary Measures 207
`Andrew S Miner, Gianfranco Ciardo, College of William and Mary,
`Susanna Donatelli, Universita di Torino
`
`x
`
`
`
`AMVA Techniques for High Service Time Variability.
`217
`Derek L. Eager, Univ. of Saskatchewan, Daniel J. Sorin, Mary K. Vernon, University of Wisconsin
`
`Session 7: Tools and Benchmarking
`
`Eff icient Performance Prediction for Modem Microprocessors
`David Ofelt, Juniper Networks, John Hennessy, Stanford University
`
`Improving Interactive System Performance Using TIPME
`Yasuhiro Endo, Network Appliance, Margo Seltzer, Harvard University
`
`Quantifying the Energy Consumption of a Pocket Computer and a Java Virtual Machine.
`Keith Farkas, Compaq Western Research Lab, Jason Flinn, Carnegie Mellon University,
`Godmar Back, University of Utah, Dirk Grunwald, University of Colorado,
`Jennifer Anderson, VMware
`
`Session 8: Memory Management and Databases
`
`Memory System Behavior of Java Programs: Methodology and Analysis
`Jin-Soo Kim, Seoul National University, Yarsun Hsu, National Tsing Hua University
`Best Systems Paper Award
`
`229
`
`240
`
`252
`
`264
`
`An Analytical Model of the Working-Set Sizes in Decision-Support Systems
`Magnus Karlsson, HP Labs, Fredrik Dahlgren, Ericsson, Per Stenstroem, Chalmers University
`
`275
`
`Towards Application/File-level Characterization of Block References:
`A Case for Fine-Grained Buffer Management
`Jongmoo Choi, Seoul National University, Sam H. Noh, Hong-Ik University, Sang Lyul Min,
`Yookun Cho, Seoul National University
`
`286
`
`Session 9: Network Routing
`
`Multicast Routing with Bandwidth Guarantees: A New Approach Using Multicast Network
`Flow
`Murali S. Kodialam, T. V. Lakshman, Bell Laboratories Lucent Technologies,
`Sudipta Sengupta, MIT
`
`Stable Internet Routing Without Global Coordination
`Lixin Gao, Smith College, Jennifer Rexford, AT&T Labs Research
`
`296
`
`307
`
`An Efficient Algorithm for Finding a Path Subject to Two Additive Constraints
`318
`Turgay Korkmaz, Marwan Krunz, University of Arizona, Spyros Tragoudas, Southern Illinois Univ
`Author Index
`
`328
`
`XI
`
`
`
`A Case for End System Multicast *
`
`Yang-hua Chu, Sanjay G. Rao, and Hui Zhang
`Carnegie Mellon University
`{yhchu, sanjay, hzhang}@cs.cmu.edu
`
`ABSTRACT
`The conventional wisdom has been that IP is the natural
`protocol layer for implementing multicast related function
`ality. However, ten years after its initial proposal, IP Multi
`cast is still plagued with concerns pertaining to scalability,
`network management, deployment and support for higher
`layer functionality such as error, flow and congestion con
`trol. In this paper, we explore an alternative architecture for
`small and sparse groups, where end systems implement all
`multicast related functionality including membership man
`agement and packet replication. We call such a scheme End
`System Multicast. This shifting of multicast support from
`routers to end systems has the potential to address most
`problems associated with IP Multicast. However, the key
`concern is the performance penalty associated with such a
`model. In particular, End System Multicast introduces du
`plicate packets on physical links and incurs larger end-to-
`end delay than IP Multicast. In this paper, we study this
`question in the context of the Narada protocol. In Narada,
`end systems self-organize into an overlay structure using a
`fully distributed protocol. In addition, Narada attempts to
`optimize the efficiency of the overlay based on end-to-end
`measurements. We present details of Narada and evaluate
`it using both simulation and Internet experiments. Prelimi
`nary results are encouraging. In most simulations and Inter
`net experiments, the delay and bandwidth penalty are low.
`We believe the potential benefits of repartitioning multicast
`functionality between end systems and routers significantly
`outweigh the performance penalty incurred.
`
`1. INTRODUCTION
`Traditional network architectures distinguish between two
`types of entities: end systems (hosts) and the network (switches
`
`*This research was sponsored by DARPA under contract
`numbers N66001-96-C-8528, F30602-99-1-0518, and E30602-
`97-2-0287, and by NSF under grant numbers Career Award
`NCR-9624979 ANI-9730105, and ANI-9814929. Additional
`support was provided by Intel, Lucent, and Ericsson. Views
`and conclusions contained in this document are those of the
`authors and should not be interpreted as representing
`the of-
`ficial policies, either expressed or implied, of DARPA, NSF,
`Intel, Lucent, Ericsson or the U.S. government.
`
`Perrriission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that
`copies are not made or distributed for profit or commercial advant
`-age and that copies bear this r»otice and the full citation on the first page.
`To copy otherwise, to republish, to post on servers or to
`redistribute to lists, requires prior specific permission and/or a fee.
`SIGMETRICS 2000 6/00 Santa Clara, California, USA
`© 2000 ACM 1-58113-194-1/00/0006...$5.00
`
`and routers). One of the most important architectural deci
`sions is then the division of functionality between end sys
`tems and networks.
`
`In the Internet architecture, the internetworking layer, or IP,
`implements a minimal functionality — a best-effort unicast
`datagram service, and end systems implement all other im
`portant functionality such as error, congestion, and flow con
`trol. Such a minimalist approach is probably the single most
`important technical reason for the Internet's growth from
`a small research network into a global, commercial infras
`tructure with heterogeneous technologies, applications, and
`administrative authorities. The growth of the network in
`turn unleashes the development of new applications, which
`require richer network functionality.
`
`The key architecture question is: what new features should
`be added to the IP layer? Multicast and QoS are the two
`most important features that have been or are being added
`to the IP layer. While QoS is a functionality that cannot be
`provided by end systems alone and thus has to be supported
`at the IP layer, this is not the case for multicast. In partic
`ular, it is possible for end systems to implement multicast
`services on top of the IP unicast service.
`
`In deciding whether to implement multicast services at the
`IP layer or at end systems, there are two conflicting consid
`erations that we need to reconcile. According to the end-to-
`end arguments [17], a functionality should be (a) pushed to
`higher layers if possible, (b) unless implementing it at the
`lower layer can achieve large performance benefit that out
`weighs the cost of additional complexity at the lower layer.
`
`In his seminal work in 1989 [4], Deering argues that this sec
`ond consideration should prevail and multicast should be im
`plemented at the IP layer. This view so far has been widely
`accepted. IP Multicast is the first significant feature that
`has been added to the IP layer since its original design and
`most routers today implement IP Multicast. Despite this, IP
`Multicast has several drawbacks that have so far prevented
`the service from being widely deployed. First, IP Multicast
`requires routers to maintain per group state, which not only
`violates the "stateless'' architectural principle of the origi
`nal design, but also introduces high complexity and serious
`scaling constraints at the IP layer. Second, the current IP
`Multicast model allows for an arbitrary source to send data
`to an arbitrary group. This makes the network vulnerable
`to flooding attacks by malicious sources, and complicates
`network management and provisioning. Third, IP Multi
`cast requires every group to dynamically obtain a globally
`unique address from the multicast address space and it is
`
`1
`
`
`
`difficult to ensure this in a scalable, distributed and consis
`tent fashion. Fourth, IP Multicast is a best effort service.
`Providing higher level features such as reliability, congestion
`control, flow control, and security has been shown to be more
`difficult, than in the unicast case. Finally, IP Multicast calls
`for changes at the infrastructural level, and this slows down
`the pace of deployment. While there have been attempts
`to partially address some of the issues at the IP layer [9,
`15, 21], fundamental concerns pertaining to the "stateful"
`architecture of IP Multicast and support for higher layer
`functionality have remained unresolved.
`
`In this paper, we revisit the issue of whether multicast re
`lated functionality should be implemented at the IP layer or
`at the end systems. In particular, we consider a model in
`which multicast related features, such as group membership,
`multicast routing and packet duplication, are implemented
`at end systems, assuming only unicast IP service. We call
`the scheme End System Multicast. Here, end systems par
`ticipating in the multicast group communicate via an overlay
`structure. The structure is an overlay in the sense that each
`of its edges corresponds to a unicast path between two end
`systems in the underlying Internet.
`
`We believe that End System Multicast has the potential to
`address most problems associated with IP Multicast. Since
`all packets are transmitted as unicast packets, network pro
`visioning is not affected and deployment may be acceler
`ated. End System Multicast maintains the stateless nature
`of the network by requiring end systems, which subscribe
`only to a small number of groups, to perform additional
`complex processing for any given group. In addition, we be
`lieve that solutions for supporting higher layer features such
`as error, flow, and congestion control can be significantly
`simplified by leveraging well understood unicast solutions
`for these problems. We hope to demonstrate this in future
`works. Finally, an end system based architecture no longer
`requires globed consistency in naming of groups and allows
`for application specific naming.
`
`While End System Multicast has many advantages, several
`issues need to be resolved before it become a practical alter
`native to IP Multicast. In particular, an overlay approach
`to multicast, however efficient, cannot perform as well as
`IP Multicast. It is impossible to completely prevent mul
`tiple overlay edges from traversing the same physical link
`and thus some redundant traffic on physical links is unavoid
`able. Further, communication between end systems involves
`traversing other end systems, potentially increasing latency.
`In this paper, we focus on two fundamental questions per
`taining to the End System Multicast architecture: (i) what
`are the performance implications of using an overlay archi
`tecture for multicast? and (ii) how do end systems with
`limited topological information cooperate to construct good
`overlay structures?
`
`In this paper, we seek to answer these questions in the con
`text of a protocol that we have developed called Narada.
`Narada constructs an overlay structure among participat
`ing end systems in a self-organizing and fuliy distributed
`manner. Narada is robust to the failure of end systems and
`to dynamic changes in group membership. End systems
`begin with no knowledge of the underlying physical topol
`ogy, and they determine latencies to other end systems by
`probing them in a controlled fashion. Narada continually
`refines the overlay structure as more probe information is
`available. Narada may be distinguished from many other
`
`self-organizing protocols in that it does not require a native
`multicast medium. We present details in Section 3.
`
`We evaluate the performance penalty of the overlay Narada
`produces using simulations. In a group of 128 members, the
`delay between at least 90% of pairs of members increases
`by a factor of at most 4 compared to the unicast delay be
`tween them. Further, no physical link carries more than
`9 identical copies of a given packet. We have also imple
`mented Narada and conducted preliminary Internet exper
`iments. For a group of 13 members, the delay between at
`least 90% of pairs of members increases by a factor of at
`most 1.5 compared to the unicast delay between them.
`
`While we refer to end systems in this paper, it is with the
`understanding that this may easily be generalized to nodes
`at the edge of the network such as campus wide proxies and
`edge routers. Proxies can exploit native multicast available
`at the LAN level, have larger processing power than hosts,
`are better connected to the Internet and allow for organi
`zation level agreements. Further, they could potentially ex
`ploit temporal and spatial locality in group formations, At
`the other end of the spectrum, tree building mechanisms
`might be built into actual applications running on hosts,
`and could exploit application specific requirements and pe
`culiarities. Irrespective of these differences, common to all
`these architectures is the design and evaluation of an actual
`self-organization protocol for construction of overlay struc
`tures for data delivery. An end system in our paper refers to
`the entity that actually takes part in the self-organization
`protocol, and could be a host, or an application or a proxy.
`
`We believe End System Multicast is more appropriate for
`small size and sparse groups such as audio/video conferenc
`ing, virtual classroom and multiparty network games, but is
`not appropriate for handling applications such as Internet
`TV that involve a single source streaming high bandwidth
`and real time data to several hundred thousand recipients.
`Regardless, we believe End System Multicast is of interest
`as we hypothesize that there will be an explosion of small
`sized and sparse groups in the near future.
`
`2. END SYSTEM MULTICAST
`In this section, we contrast IP Multicast, End System Mul
`ticast and naive unicast with an example, and outline fun
`damental issues in the design of an End System Multicast
`protocol.
`
`Consider Figure 1(a) which depicts an example physical
`topology, /il and R2 are routers, while A, B, C and D are
`end systems. Link delays are as indicated. Thus Rl — R2
`may be imagined to be a costly transcontinental link, while
`all other links are cheaper local links: Further, let us assume
`A wishes to send data to all other nodes.
`
`Figure 1(b) depicts the IP Multicast tree constructed by
`DVMRP [4]. DVMRP is the classical IP Multicast protocol,
`where data is delivered from the source to recipients using
`an IP Multicast tree composed of the shortest paths from
`each recipient to the source. R\ and R2 receive a single
`copy of the packet but forward it along multiple interfaces.
`At most one copy of a packet is sent over any physical link.
`Each recipient receives data with the same delay as though
`A were sending to it directly by unicast.
`
`End System Multicast does not rely on router support for
`
`2
`
`
`
`27
`
`27
`
`28
`
`28
`(c)
`
`27
`
`A
`
`B
`
`I
`
`2
`
`r1
`
`25
`
`r2
`
`c
`
`a
`
`1
`
`3
`
`1
`
`d
`
`b
`
`(a)
`
`a
`
`1
`
`|r1
`2
`
`25
`
`b
`
`c
`
`a
`
`3
`
`1
`r2
`
`27
`
`1 (d
`
`b
`
`fa
`
`T
`
`25
`
`(e)
`
`27
`
`c
`
`a
`
`2
`
`d
`
`b
`
`1
`
`2
`
`Mc
`
`a
`
`3
`
`D
`
`b
`
`(b)
`
`(d)
`
`a
`
`c
`1
`
`1 (D
`
`b
`
`1
`
`r2
`
`1
`R1
`
`2
`
`25
`
`c
`
`1 (D
`
`•Kc
`
`2
`
`d
`
`3
`
`g>
`
`(g)
`
`27
`
`28
`(h)
`
`i
`
`D
`
`(0
`Figure 1: Examples to illustrate IP Multicast, naive unicast and End System Multicast
`3. NARADA DESIGN
`In this section, we present Narada, a protocol we designed
`that implements End System Multicast. In designing Narada,
`we have the following objectives in mind:
`
`multicast. It abstracts the physical topology sis a Complete
`Virtual Graph (CVG) as shown in Figure 1(c). Further, it
`tries to construct a spanning-tree of the CVG along which
`A could send data to other recipients. In particular, this
`scheme could degenerate to naive unicast transmission as
`shown in Figure 1(d). Figure 1(e) depicts how naive unicast
`transmission maps onto the underlying physical network. It
`is seen that links Rl — R2 and A — Rl carry 2 and 3 copies
`of a transmission by A respectively. We refer to the number
`of identical copies of a packet carried by a physical link as
`the stress of a physical link. Thus, unicast in general leads
`to a high stress on the link nearest the source.
`
`We could build smarter spanning trees of the CVG such
`as the one shown in Figure 1(f). Figure 1(g) depicts how
`this tree maps onto the underlying physical network. It is
`seen that all physical links have a stress of at most 2. Not
`only does this tree reduce the worst case physical link stress,
`but it also reduces the stress of the costlier link Rl — R2
`to 1. In order to capture this, we introduce the notion of
`resource usage. We define resource usage as
`* 3,,
`where, L is the number of links active in data transmission,
`di is the delay of link i and s; is the stress of link i. The
`resource usage is a metric of the network resources consumed
`in the process of data delivery to all receivers. Implicit here
`is the assumption that links with higher delay tend to be
`associated with higher cost. The resource usage might be
`computed to be 30 in the case of transmission by DVMRP,
`57 for naive unicast and 32 for the smarter tree , shown in
`Figure 1(b), Figure 1(d) and Figure 1(f) respectively.
`
`While the spanning tree of Figure 1(f) improves link stress
`and resource usage as compared to naive unicast transmis
`sion, it increases delay from the source to some of the recip
`ients. Thus, the delay from A to D has increased to 29 from
`27. We refer to the ratio of the delay between two members
`along the overlay to the unicast delay between them as the
`Relative Delay Penalty (RDP). Thus, <A,D> has an RDP
`of
`while <A,B> and <A,C> have an RDP of 1.
`
`The out-degree oi a node in the overlay tree is its number of
`children in the tree. Thus, the out-degree of A in the smarter
`tree is 2, while it is 3 in naive unicast. The out-degree of a
`node in the overlay spanning tree directly impacts the stress
`of the links close to the node.
`
`3
`
`• Self-organizing: The construction of the end system over
`lay must take place in a fully distributed fashion and must
`be robust to dynamic changes in group membership.
`
`• Overlay efficiency: The tree constructed must ideally have
`low stress, low RDP and low resource usage. Rirther, the
`out-degree of each member in the overlay must reflect the
`bandwidth of its connection to the Internet.
`
`• Self-improving in an incremental fashion: The overlay con
`struction must include mechanisms by which end systems
`gather network information in a scalable fashion. The pro
`tocol must allow for the overlay to incrementally evolve into
`a better structure as more information becomes available.
`
`There are two basic methods for construction of overlay
`spanning trees for data delivery. A first approach is to
`construct the tree directly - that is, members explicitly se
`lect their parents from among the members they know [8].
`Narada however constructs trees in a two-step process. First
`it constructs a richer connected graph that we term mesh.
`The mesh could in general be an arbitrary connected sub
`graph of the CVG (though Narada tries to ensure the mesh
`has desirable performance properties as discussed later.) In
`the second step, Narada constructs (reverse) shortest path
`spanning trees of the mesh, each tree rooted at the corre
`sponding source using well known routing algorithms. Fig
`ure 1(h) presents an example mesh that Narada constructs
`for the physical topology shown in Figure 1(a), along with
`the shortest path spanning tree rooted at A. We have several
`reasons for this two step process. First, group management
`functions are abstracted out and handled at the mesh rather
`than replicated across multiple (per source) trees. Second,
`distributed heuristics for repairing mesh partition and mesh
`optimization are greatly simplified as loop avoidance is no
`longer a constraint. Third, we may leverage standard rout
`ing algorithms for construction of data delivery trees. Fi
`nally, a mesh is more resilient to the failure of members
`than a tree and heavy weight partition repair mechanisms
`are invoked less frequently.
`
`
`
`In our approach, there is no control over the resulting span
`ning trees for a given mesh. Hence, it becomes important
`to construct a good mesh so that good quality trees may
`be produced. In particular, we attempt to ensure the fol
`lowing properties: (i) the shortest path delay between any
`pair of members along the mesh is at most K times the uni-
`cast delay between them, where K is a small constant and
`(ii) each member has a limited number of neighbors in the
`mesh which does not exceed a given (per-member) bound
`chosen to reflect the bandwidth of the member's connection
`to the Internet.1 Limiting the number of neighbors regu
`lates the fanout of members in the spanning trees. Second,
`it controls the overhead of running routing algorithms on
`the mesh. The extreme case where the mesh is chosen to be
`the Complete Virtual Graph incurs all the overhead of rout
`ing with none of its benefits as the resulting shortest path
`spanning trees degenerates to naive unicast transmission.
`
`Narad a has striking differences from self-configuring proto
`cols developed in other contexts. First, Narada distinguishes
`itself from normal routing protocols in that it changes the
`very topology over which routing is performed. Second,
`most existing self-configuring protocols [12, 13, 14, 22] as
`sume native IP Multicast support. Narada attempts self-
`configuration in the absence of a lower level multicast ser
`vice, and this is fundamentally more challenging.
`
`We explain the distributed algorithms that Narada uses to
`construct and maintain the mesh in Section 3.1. We present
`heuristics that Narada uses to improve mesh quality in Sec
`tion 3.2. Narada runs a variant of standard distance vector
`algorithms on top of the mesh and uses well known algo
`rithms to construct per-source (reverse) shortest path span
`ning trees for data delivery. We discuss this in Section 3.3.
`
`3.1 Group Management
`We have seen that Narada tries to construct a mesh among
`end systems participating in the multicast group. In this sec
`tion, we present mechanisms Narada uses to keep the mesh
`connected, to incorporate new members into the mesh and
`to repair possible partitions that may be caused by members
`leaving the group or by member failure.
`
`As we do not wish to rely on a single non-failing entity to
`keep track of group membership, the burden of group main
`tenance is shared jointly by all members. To achieve a high
`degree of robustness, our approach is to have every member
`maintain a list of all other members in the group. Since
`Narada is targeted towards small sized groups, maintaining
`the complete group membership list is not a major over
`head. Every member's list needs to be updated when a new
`member joins or an existing member leaves. The challenge
`is to disseminate changes in group membership efficiently,
`especially in the absence of a multicast service provided by
`the lower layer. We tackle this by exploiting the mesh to
`propagate such information. However, this strategy is com
`plicated by the fact that the mesh might itself become par
`titioned when a member leaves. To handle this, we require
`that each member periodically generate a refresh message
`with monotonically increasing sequence number, which is
`disseminated along the mesh. Each member % keeps track of
`
`1An ideal mesh is a "Degree-Bounded K-spanner" [11] of
`the Complete Virtual Graph. The problem of construct
`ing Degree-Bounded K-spanners of a graph has been widely
`studied in centralized settings that assume complete infor-
`mation and is NP-complete even in such scenarios [11].
`
`Let i receive refresh message from neighbor j at t ' s loca]
`time t. Let < k, s^j > be an entry in j s refresh message.
`* if i does not have em entry for k , then i inserts the
`entry < k, S k j, t > into its table
`* else if t's entry for k is < k , S k i ^ k i >» then
`• if s ^ i > sk j * ignores the entry pertaining to k
`• else i updates its entry for k to <
`>
`
`Figure 2: Actions taken by a member i on receiving
`a refresh message from member j.
`
`the following information for every other member k in the
`group: (i) member address k; (ii) last sequence number Ski
`that i knows k has issued; and (iii) local time at i when i
`first received information that k issued
`If member i has
`not received an update from member k for Tm time, then,
`i assumes that k is either dead or potentially partitioned
`from t. It then initiates a set of actions to determine the
`existence of a partition and repair it if present as discussed
`in Section 3.1.3.
`
`Propagation of refresh messages from every member along
`the mesh could potentially be quite expensive. Instead, we
`require that each member periodically exchange its knowl
`edge of group membership with its neighbors in the mesh.
`A message from member i to a neighbor j contains a list of
`entries, one entry for each member k that i knows is part of
`the group. Each entry has the following fields: (i) membe