`Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
`
`1734
`
`
`
`Berlin
`Heidelberg
`New York
`Barcelona
`Hong Kong
`London
`Milan
`Paris
`Singapore
`Tokyo
`
`
`
`Hermann Hellwagner
`Alexander Reinefeld (Eds.)
`
`SCI: Scalable
`Coherent Interface
`
`Architecture and Software
`for High-Performance Compute Clusters
`
`
`
`Series Editors
`
`Gerhard Goos, Karlsruhe University, Germany
`Juris Hartmanis, Cornell University, NY, USA
`Jan van Leeuwen, Utrecht University, The Netherlands
`
`Volume Editors
`
`Hermann Hellwagner
`University of Klagenfurt, Institute of Information Technology
`A-9020 Klagenfurt, Austria
`E-mail: hermann.hellwagner@uni-klu.ac.at
`
`Alexander Reinefeld
`Konrad-Zuse-Zentrum f¨ur Informationstechnik Berlin (ZIB)
`Takustr. 7, D-14195 Berlin-Dahlem, Germany
`E-mail: ar@zib.de
`
`Cataloging-in-Publication data applied for
`
`Die Deutsche Bibliothek - CIP-Einheitsaufnahme
`
`SCI - Scalable coherent interface : architecture and software for
`high-performance compute clusters / Hermann Hellwagner ; Alexander Reinefeld
`(ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ;
`Milan ; Paris ; Singapore ; Tokyo : Springer, 1999
`(Lecture notes in computer science ; Vol. 1734)
`ISBN 3-540-66696-6
`
`CR Subject Classification (1998): C.2, D.1-4, B.2-8
`
`ISSN 0302-9743
`ISBN 3-540-66696-6 Springer-Verlag Berlin Heidelberg New York
`
`This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
`concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
`reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
`or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
`in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
`liable for prosecution under the German Copyright Law.
`c(cid:1) Springer-Verlag Berlin Heidelberg 1999
`Printed in Germany
`
`Typesetting: Camera-ready by author
`SPIN: 10704208
`06/3142 – 5 4 3 2 1 0
`
`Printed on acid-free paper
`
`
`
`Preface
`
`Background
`
`System interconnection networks have become a critical component of the
`computing technology of the late 1990s, and they are likely to have a great
`impact on the design, architecture, and use of future high-performance com-
`puters. Indeed, it is today not only the sheer computational speed that distin-
`guishes high-performance computers from desktop systems, but the e(cid:14)cient
`integration of the computing nodes into tightly coupled multiprocessor sys-
`tems. Network adapters, switches, and device driver software are increasingly
`becoming performance-critical components in modern supercomputers.
`Due to the recent availability of fast commodity network adapter cards
`and switches, tightly integrated clusters of PCs or workstations have emer-
`ged on the market, now (cid:12)lling the gap between desktop systems and super-
`computers. The use of commercial o(cid:11)-the-shelf (COTS) technology for both
`computing and networking enables scalable computing at relatively low costs.
`Some may disagree, but even the world champion in high-performance com-
`puting, Sandia Lab’s ASCI Red machine, may be seen as a COTS system.
`With just one hardware upgrade (pertaining to the Intel processors, not the
`network), this system has constantly been number one in the TOP-500 list of
`the worldwide fastest supercomputers since its installation in 1997. Clearly,
`the system area network plays a decisive role in overall performance.
`The Scalable Coherent Interface (SCI, ANSI/IEEE Standard 1596-1992)
`speci(cid:12)es one such fast system interconnect, emphasizing the flexibility, scala-
`bility, and high performance of the network. In recent years, SCI has become
`an innovative and widely discussed approach to interconnecting multiple pro-
`cessing nodes in various ways. SCI’s flexibility stems mainly from its com-
`munication protocols: in contrast to many other interconnects, SCI is not
`restricted to either message-based or shared-memory communication models.
`Instead, it combines both, taking advantage of similar properties that have
`been investigated in such hybrid machines as Stanford’s FLASH or MIT’s
`Alewife architectures. Since SCI also de(cid:12)nes a distributed directory-based
`cache coherence protocol, it is up to the computer architect to choose from
`a broad range of communication and execution models, including e(cid:14)cient
`message-passing architectures, as well as shared-memory models, in either
`the NUMA or CC-NUMA variants.
`
`
`
`VI
`
`Preface
`
`European industry and research institutions have played a key role in the
`SCI standardization process. Based on SCI adapter cards, switches, and fully
`integrated cluster systems manufactured by European companies, the SCI
`community in Europe has made and is making signi(cid:12)cant developments and
`state-of-the-art research on this important interconnect.
`
`Purpose of the Book
`
`From many discussions with friends, colleagues, and potential users, we found
`that one signi(cid:12)cant barrier to the widespread deployment and use of SCI is
`the lack of a clear vision of how SCI works, how it is being used in building
`clusters, and how obstacles in its deployment can be avoided. Our goal in
`compiling this book is to address these barriers by providing in-depth infor-
`mation on the technology and applications of SCI from various perspectives.
`The book focuses on SCI clusters built from commodity PCs or workstati-
`ons and SCI adapters, since they represent the mainstream and most cost-
`e(cid:11)ective application of SCI to date.
`In addition, some challenging research issues, mostly pertaining to shared-
`memory programming on SCI clusters, are discussed and potential improve-
`ments for SCI cluster equipment are highlighted.
`Who is the intended audience? The relevance of the book for computer
`architects is obvious, given the importance of system area networks for mod-
`ern high-performance computers. But the book is also intended for system
`administrators and compute center managers who plan to invest in cluster
`technology with COTS components. Furthermore, researchers and students
`wanting to contribute to this interesting technology with their own hard- or
`software developments might (cid:12)nd this book helpful.
`
`Organization of the Book
`
`The book consists of nine parts, each subdivided into chapters covering in-
`dividual topics. On the whole, the contributions cover the complete hard-
`ware/software spectrum of SCI clusters, ranging from the major concepts of
`SCI, through SCI hardware, networking, and low-level software issues, va-
`rious programming models and environments, up to tools and application
`experiences.
`Part I introduces the SCI standard and its application in practical compu-
`ter systems. SCI is put into context by comparing its concepts, architecture,
`and performance with its strongest competitor Myrinet and also with the
`proprietary Cray T3D interconnection network which set the standards back
`in 1993.
`Part II looks at the hardware. It describes two implementations of SCI
`adapters, the commercial, widely used Dolphin SCI cards for the PCI and
`SBus I/O buses, and the prototype adapter developed at TU M¨unchen which
`can be extended by special hardware for monitoring the SCI packet flow.
`
`
`
`Preface
`
`VII
`
`Building on the hardware, Part III explores how to build SCI interconnec-
`tion networks and analyzes various critical aspects of SCI networks, among
`them ringlet scalability and potential performance degradation by hardware-
`generated retry tra(cid:14)c.
`Part IV moves on to software, describing the functionality and concrete
`implementations of SCI device drivers and introducing a low-level API that
`abstracts away SCI’s distributed shared memory (DSM) implementation de-
`tails from higher-level software.
`The (cid:12)rst class of parallel and distributed programming models, namely
`message-passing libraries on top of SCI, are covered in Part V. The chapters
`report on projects which implemented sockets, TCP/IP, PVM, and MPI with
`high e(cid:14)ciency on top of SCI, by making judicious use of the SCI DSM and
`related features.
`As pointed out by the contributions in Part VI, developing shared-memory
`programming environments on SCI clusters with current SCI hardware and
`driver software is more challenging than implementing message-passing libra-
`ries. Partly due to the lack of well established shared-memory standards, the
`approaches described are widely diverse. They range from speci(cid:12)c shared vir-
`tual memory systems on top of SCI to a fully transparent, distributed thread
`system and to shared, parallel objects extending a CORBA middleware im-
`plementation. The chapters discuss some of the limitations of current SCI
`cluster equipment and present potential routes for future developments.
`Real-world experiences with SCI clusters are reported in Part VII. As
`a reference, benchmark and application performance results from the very
`large SCI clusters that are operated at PC2 Paderborn are given (cid:12)rst. The
`parallelization approaches and performance results from two projects, a com-
`plex molecular dynamics code and a real-time data acquisition and (cid:12)ltering
`application prototype for high-energy physics, are described as examples of
`real-world uses of SCI clusters.
`Part VIII deals with tools for SCI clusters, which apparently are still in
`their infancy. Therefore, only two basic SCI monitors, one implemented in
`hardware, the other in software, and their potential applications are presented
`here. In addition, a powerful system management tool, developed to operate
`the large Paderborn clusters as general-purpose, multi-user compute servers
`is introduced.
`Both SCI and SCI interconnects are still evolving in terms of standar-
`dization, product development, research (cid:12)ndings, and applications. In the
`(cid:12)nal part, Part IX, therefore, one of the designers of SCI, David Gustavson,
`describes the perspectives that he sees for SCI.
`
`Acknowledgements
`
`With great pleasure, we acknowledge the e(cid:11)orts of the many individuals who
`have contributed to the development of this book. First and foremost, we
`thank the authors for their enthusiasm, time, and expertise which made this
`
`
`
`VIII
`
`Preface
`
`book possible. We are also grateful to the people who helped in organizing the
`book, especially Oliver Heinz (PC2 Paderborn), Hans-Hermann Frese (ZIB
`Berlin), and Angelika Rossak (University Klagenfurt). The European Com-
`mission provided (cid:12)nancial support through the ESPRIT IV Programme’s SCI
`Working Group (EP 22582). Finally, we acknowledge the help of Alfred Hof-
`mann and Antje Endemann of Springer-Verlag, who were always competent,
`professional, and e(cid:14)cient partners to work with.
`
`September 1999
`
`Hermann Hellwagner
`Alexander Reinefeld
`
`
`
`Table of Contents
`
`Part I. SCI and Competitive Interconnects for Cluster Computing
`
`1. The SCI Standard and Applications of SCI
`3
`Hermann Hellwagner : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
`3
`1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4
`1.2 SCI Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4
`1.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4
`1.2.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`6
`1.2.3 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
`1.3 The SCI Standard and Some Extensions . . . . . . . . . . . . . . . . . . . 11
`1.3.1 Logical Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
`1.3.2 Cache Coherence Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
`1.3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
`1.4 Applications of SCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
`1.4.1 System Area Network for Clusters . . . . . . . . . . . . . . . . . . 23
`1.4.2 Memory Interconnect for Cache-Coherent
`Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
`1.4.3 I/O Subsystem Interconnect . . . . . . . . . . . . . . . . . . . . . . . 30
`1.4.4 Large-Scale Data Acquisition System . . . . . . . . . . . . . . . 31
`1.5 Related Communication Networks and Concepts . . . . . . . . . . . 31
`1.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
`
`2. A Comparison of Three Gigabit Technologies:
`SCI, Myrinet and SGI/Cray T3D
`Christian Kurmann, Thomas Stricker : : : : : : : : : : : : : : : : : : : : : : : : : : : 39
`2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
`2.2 Levels of Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
`2.2.1 Direct Deposit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
`2.2.2 Message Passing (MPI/PVM) . . . . . . . . . . . . . . . . . . . . . . 42
`2.2.3 Protocol Emulation (TCP/IP) . . . . . . . . . . . . . . . . . . . . . 44
`2.3 Gigabit Network Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
`2.3.1 The Intel 80686 Hardware Platform . . . . . . . . . . . . . . . . . 46
`2.3.2 Myricom Myrinet Technology . . . . . . . . . . . . . . . . . . . . . . 47
`
`
`
`X
`
`Table of Contents
`
`2.3.3 Dolphin PCI-SCI Technology . . . . . . . . . . . . . . . . . . . . . . 48
`2.3.4 The SGI/Cray T3D { A Reference Point . . . . . . . . . . . . 48
`2.3.5 ATM: QoS { But Still Short of a Gigabit/s . . . . . . . . . . 50
`2.3.6 Gigabit Ethernet { An Outlook . . . . . . . . . . . . . . . . . . . . 50
`2.4 Transfer Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
`2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
`2.4.2 \Native" and \Alternate" Transfer Modes in the Three
`Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
`2.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
`2.5.1 Performance of Local Memory Copy . . . . . . . . . . . . . . . . 58
`2.5.2 Performance of Direct Transfers to Remote Memory . . 58
`2.5.3 Performance of MPI/PVM Transfers . . . . . . . . . . . . . . . . 61
`2.5.4 Performance of TCP/IP Transfers . . . . . . . . . . . . . . . . . . 64
`2.5.5 Discussion and Comparison . . . . . . . . . . . . . . . . . . . . . . . . 65
`2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
`
`Part II. SCI Hardware
`
`3. Dolphin SCI Adapter Cards
`Marius Christian Liaaen, Hugo Kohmann : : : : : : : : : : : : : : : : : : : : : : : 71
`3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
`3.2 Overview of the Adapter Cards . . . . . . . . . . . . . . . . . . . . . . . . . . 71
`3.3 Operating Modes of the SCI Cards . . . . . . . . . . . . . . . . . . . . . . . 73
`3.4 SCI Requester . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
`3.4.1 Address Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
`3.4.2 SCI Transaction Handling . . . . . . . . . . . . . . . . . . . . . . . . . 75
`3.4.3 SCI Packet Requester . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
`3.5 SCI Responder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
`3.5.1 Mailbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
`3.5.2 Access Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
`3.5.3 Atomic Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
`3.5.4 Host Bridge Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 80
`3.6 DMA Transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
`3.6.1 DMA Transfers on the SBus Card . . . . . . . . . . . . . . . . . . 80
`3.6.2 DMA Transfers on the PCI Card . . . . . . . . . . . . . . . . . . . 80
`3.7 Interrupter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
`3.8 Concurrency Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
`3.8.1 Write Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
`3.8.2 E(cid:14)cient Store Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
`3.9 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
`3.10 Applications and Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
`3.10.1 SAN Interface Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
`3.10.2 Remote I/O Connection and Data Acquisition . . . . . . . 83
`
`
`
`Table of Contents
`
`XI
`
`3.10.3 Switches and Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
`3.11 Cluster Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
`
`4. The TUM PCI/SCI Adapter
`Georg Acher, Wolfgang Karl, Markus Leberecht : : : : : : : : : : : : : : : : : : 89
`4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
`4.2 The PCI/SCI Adapter Architecture . . . . . . . . . . . . . . . . . . . . . . . 90
`4.3 SCI Packet Encoding and Decoding . . . . . . . . . . . . . . . . . . . . . . . 92
`4.3.1 Overview of Packet Processing . . . . . . . . . . . . . . . . . . . . . 92
`4.3.2 Choosing the Technology . . . . . . . . . . . . . . . . . . . . . . . . . . 92
`4.3.3 Internal Structure of the FPGA . . . . . . . . . . . . . . . . . . . . 93
`4.3.4 Structure of the Packet Manager as a Microcode
`Sequencer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
`4.3.5 Microcode Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
`4.3.6 Bene(cid:12)ts of the Micro Sequencer . . . . . . . . . . . . . . . . . . . . 98
`4.4 The SCI Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
`4.5 Preliminary Results for the PCI/SCI Adapter . . . . . . . . . . . . . . 99
`4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
`4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
`
`Part III. Interconnection Networks with SCI
`
`5. Low-Level SCI Protocols and Their Application to
`Flexible Switches
`Andreas C. D¨oring, Wolfgang Obel¨oer, Gunther Lustig, Erik Maehle 105
`5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
`5.2 Data Format of SCI Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
`5.3 Flow Control
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
`5.3.1 Flow Control in Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
`5.3.2 Packet Sequence in SCI . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
`5.3.3 Determination of State Transitions . . . . . . . . . . . . . . . . . 109
`5.4 Bandwidth Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
`5.4.1 Bandwidth Management in One Ring . . . . . . . . . . . . . . . 110
`5.4.2 Idle Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
`5.4.3 Time-Out Determination . . . . . . . . . . . . . . . . . . . . . . . . . . 113
`5.5 Network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
`5.5.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
`5.5.2 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
`5.6 Routers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
`5.6.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
`5.6.2 Products and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 116
`5.6.3 Flexible Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
`5.6.4 Strip-o(cid:11) Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
`
`
`
`XII
`
`Table of Contents
`
`5.6.5 Routing Decision and Topology . . . . . . . . . . . . . . . . . . . . 119
`5.7 Rule-Based Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
`5.8 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
`
`6. SCI Rings, Switches, and Networks for Data Acquisition
`Systems
`Harald Richter, Richard Kleber, Matthias Ohlenroth : : : : : : : : : : : : : 125
`6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
`6.2 SCI-based Data Acquisition Systems . . . . . . . . . . . . . . . . . . . . . . 126
`6.3 SCINET Test Beds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
`6.4 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
`6.5 SCI Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
`6.6 E(cid:14)cient Use of SCI Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
`6.7 Multistage SCI Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
`6.8 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
`6.9 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
`
`7. Scalability of SCI Ringlets
`Geir Horn : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 151
`7.1 Do SCI Ringlets Scale in Number of Nodes? . . . . . . . . . . . . . . 151
`7.2 Ringlet Bandwidth Model
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
`7.2.1 Transaction Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
`7.2.2 Packet Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
`7.2.3 Address Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
`7.2.4 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
`7.2.5 Bypass Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
`7.2.6 Echo Packet Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
`7.2.7 Output Link Utilization Factor . . . . . . . . . . . . . . . . . . . . . 160
`7.3 Scalability Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
`7.3.1 Common Assumptions
`. . . . . . . . . . . . . . . . . . . . . . . . . . . 161
`7.3.2 Uniform Ringlet Tra(cid:14)c . . . . . . . . . . . . . . . . . . . . . . . . . . 162
`7.3.3 Non-uniform Ringlet Tra(cid:14)c . . . . . . . . . . . . . . . . . . . . . . 162
`7.3.4 Changing Packet Lengths . . . . . . . . . . . . . . . . . . . . . . . . 163
`7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
`7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
`
`8. A(cid:11)ordable Scalability Using Multi-Cubes
`H(cid:23)akon Bugge, Knut Omang : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 167
`8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
`8.2 Interconnect Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
`8.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
`8.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
`8.4.1 \Hot-Link" Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
`
`
`
`Table of Contents
`
`XIII
`
`8.4.2 \Hot-B-Link" Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
`8.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
`8.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
`
`Part IV. Device Driver Software and Low-Level APIs
`
`9. Interfacing SCI Device Drivers to Linux
`Roger Butenuth, Hans-Ulrich Heiss : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 179
`9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
`9.2 Layers of Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
`9.2.1 Address Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
`9.2.2 Levels of Hardware Abstraction . . . . . . . . . . . . . . . . . . . . 180
`9.2.3 Resource Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
`9.2.4 Virtual Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
`9.2.5 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
`9.3 Why Linux? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
`9.4 Interfaces of the Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
`9.4.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
`9.4.2 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
`9.4.3 User Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
`9.4.4 SCI Drivers on Other Nodes . . . . . . . . . . . . . . . . . . . . . . . 188
`9.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
`
`10. SCI Physical Layer API
`Volker Lindenstruth, David B. Gustavson : : : : : : : : : : : : : : : : : : : : : : 191
`10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
`10.1.1 Scope of the Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
`10.2 SCI Physical Layer API Architecture and Features. . . . . . . . . . 193
`10.2.1 Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
`10.2.2 Endianness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
`10.3 Supported Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
`10.4 Miscellaneous Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
`10.5 Address Translation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
`10.5.1 Global Object Identi(cid:12)er . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
`10.5.2 SCI Global Address Resolution . . . . . . . . . . . . . . . . . . . . . 200
`10.6 Shared Memory Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
`10.7 Packet Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
`10.8 Block Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
`10.9 Message Passing Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
`10.10 Cache Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
`10.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
`
`
`
`XIV
`
`Table of Contents
`
`Part V. Message Passing Libraries
`
`11. SCI Sockets Library
`Hermann Hellwagner, Josef Weidendorfer : : : : : : : : : : : : : : : : : : : : : : 209
`11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
`11.1.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
`11.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
`11.2 Features and Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
`11.2.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
`11.2.2 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
`11.2.3 Communication via the SSLib . . . . . . . . . . . . . . . . . . . . . . 212
`11.2.4 Connection Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
`11.2.5 Handling Special System Calls . . . . . . . . . . . . . . . . . . . . . 216
`11.2.6 Other Calls Intercepted and Handled by the SSLib . . . 218
`11.2.7 Out-of-Band Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
`11.3 Implementation Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
`11.3.1 Communication Among Components . . . . . . . . . . . . . . . . 218
`11.3.2 SSLib Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
`11.3.3 Choice of Most E(cid:14)cient Communication Mechanism . . 220
`11.3.4 SSLib Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
`11.3.5 Control Transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
`11.4 Functional Tests and Performance . . . . . . . . . . . . . . . . . . . . . . . . 222
`11.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
`11.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
`
`12. TCP=IP over SCI under Linux
`H¨useyin Taskin, Roger Butenuth : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 231
`12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
`12.2 SCIP Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
`12.2.1 Packet Driver Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
`12.2.2 Hardware Address Resolution . . . . . . . . . . . . . . . . . . . . . . 232
`12.2.3 Other Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . 233
`12.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
`12.3.1 Con(cid:12)guration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
`12.3.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
`12.3.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
`12.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
`
`13. PVM for SCI Clusters
`Markus Fischer, Alexander Reinefeld : : : : : : : : : : : : : : : : : : : : : : : : : : 239
`13.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
`13.2 Parallel Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
`
`
`
`Table of Contents
`
`XV
`
`13.2.1 PVM Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
`13.2.2 Models for Zero-Memory-Copy Data Transfer . . . . . . . . 241
`13.3 SCI Communication Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
`13.4 PVM-SCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
`13.4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
`13.4.2 Supporting Multiple Interconnects . . . . . . . . . . . . . . . . . . 245
`13.4.3 Reducing Memory Copies . . . . . . . . . . . . . . . . . . . . . . . . . 245
`13.4.4 Ring Bu(cid:11)er Management . . . . . . . . . . . . . . . . . . . . . . . . . . 246
`13.4.5 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
`13.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
`
`14. ScaMPI { Design and Implementation
`L.P. Huse, K. Omang, H. Bugge, H. Ry, A.T. Haugsdal, E. Rustad 249
`14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
`14.2 Scali Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
`14.3 The SCI Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
`14.3.1 Coordinating Use of Shared Locations. . . . . . . . . . . . . . . 251
`14.3.2 Ensuring Safe Data Transport in SCI { Checkpointing 252
`14.3.3 Shared Address Space Programming without the
`Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .