`
`“Bi
`
`’
`
`*
`
`Ses
`
`t
`
`7 4 *
`y > a
`~ eh ee
`
`: eo
`s
`458 @
`alg
`+%
`; .
`>
`Sart}
`o.
`a
`;
`4 = -
`“)
`wTe
`- Bes
`2
`“+
`
`<S
`
`-
`
`F
`
`|
`
`4
`
`‘
`
`eee te
`
`e
`
`-
`
`-
`
`ee
`
`®,
`2,
`c-
`
`Oe.
`fe?
`= ¢
`9
`aes
`aa
`-“@
`Ss
`ad p
`e
`“®
`eo
`
`s
`ee
`
`Nar
`
`:
`
`}
`
`a
`
`*.
`
`od
`
`-@..
`
`eS.
`e
`. & .
`ee.
`
`e..
`
`<
`
`» e
`
`-
`
`.e.
`
`—*
`
`A
`
`°
`
`=
`
`.
`
`2
`»
`sting
`
`a
`
`-
`
`_-.
`2
`
`-
`- 7
`i
`
`re
`
`7
`
`33,
`
`.
`
`Tate:
`2 we _
`%
`,
`oN Ts
`
`“
`
`.
`MISS
`
`o3 al Wanli
`
`Say
`
`yo ta yea
`:
`a2
`°
`*
`;
`<j
`‘
`=
`°
`3
`.
`WX ote
`hee Seas Ore e
`
`.
`
`tS{ ea
`a
`-%
`.
`
`~
`
`:
`
`~~
`
`Ss
`
`ee
`
`.
`
`=
`
`a
`
`-
`
`s
`
`;
`> er
`
`-
`
`:
`
`:
`
`rs
`
`3.
`
`-
`
`.
`
`=z
`
`a3
`¢
`.
`=
`
`ax
`
`“oe
`
`e
`
`-
`«te
`
`°
`
`7K
`
`+
`
`$
`Neweas
`4
`x
`be. "> “heey?
`a”
`:
`-
`e
`
`-
`
`-
`
`.
`
`-
`
`-
`
`Y=
`.
`oe
`
`~\
`
`f
`
`3
`
`NS
`:
`Rae
`sw,
`
`aYeat
`-
`
`«
`ere.
`
`te
`
`MICROSOFT1011
`
` 1
`
`1
`
`MICROSOFT 1011
`
`
`
`Scalable
`Parallel
`Gomputing_
`
`
`) D
`
`s Q a
`
`
`
`) e 9 9 @
`
`:
`9
`Kai Hwang e
`University of Hong Kong
`®@
`University of Southern California
`
`®9 a
`
`Zhiwei Xu °
`Chinese Academy of Sciences
`:
`University ofHong Kong
`:
`
`WCBaMcGraw-Hill
`
`Boston Burr Ridge, IL Dubuque, IA Madison, WI New York San Francisco St. Louis
`Bangkok Bogota Caracas Lisbon London Madrid
`Mexico City Milan New Delhi Seoul Singapore Sydney Taipei Toronto
`
`
`
`2
`
`
`
`WCB/McGraw-Hill
`A Division of The McGraw-Hill Companies
`
`Scalable Parallel Computing
`Technology, Architecture, Programming
`
`Copyright © 1998, by The McGraw-Hill Companies, Inc. All rights reserved. Printed in the United States ofAmerica. Exceptas permitted
`under the United States Copyright Act of 1976,no part ofthis publication may be reproducedordistributed in any form or by any means,
`or stored in a data baseorretrieval system, without the prior written permission ofthe publisher.
`
`This bookis printed on acid-free paper.
`
`1234567890 DOC DOC 90098
`
`ISBN 0-07-031798-4
`
`Editorial Director: Kevin Kane
`Publisher: Tom Casson
`Sponsoring editor: Lynn Cox
`Marketing manager: John Wannemacher
`Project manager: Richard DeVitto
`Production supervisor: Richard DeVitto
`Cover designer: Francis Owens
`Editorial Assistant: Nina Kreiden
`Printer: Quebecor Printing Dubuque
`
`Library of Congress Cataloging-in-Publication Data
`Hwang, Kai.
`Scalable parallel computing : technology, architecture,
`Programming / Kai Hwang, Zhiwei Xu.
`p-
`cm.
`Includes index.
`ISBN 0-07-03 1798-4
`1.
`Parallel processing (Electronic computers)
`architecture. 3. Computer networks.
`I. Xu, Zhiwei.
`QA76.58H85 1997
`004’.358—de21
`
`2. Computer
`II. Title.
`
`97-41663
`CIP
`
`http://www.mhhe.com
`
`3
`
`
`
`Trademark Notice
`
`vii
`
`of Sun Microsystems,Inc.
`Tera is a trademark of Tera Computer Systems.
`Cosmic Cube and Mosaic C are trademarks of
`California Institute of Technology.
`UNIX 4.3 BSD, NOW, GLUnix, IRAM,xFS,
`and 4.4BSDare trademarks of the University
`of California at Berkeley.
`DEC, DECsafe, VAX, VAXCluster, VMS,Dig-
`ital Unix, TruCluster, Digital NT Cluster,
`Alpha 21164, and Alpha AXP are trademarks
`of Digital Equipment Corporation.
`Apollo Domain, HP, Series 9000, PA-RISC,
`and UX are trademarks of the Hewlett-Packard
`Company.
`IBM PC,AIX, ESSL, 801, RP3, EUI, HACMP,
`IBM, LoadLeveller, NetView, POWER, Pow-
`erPC, RS/6000, S/370, S/390, SP2, SP, and
`Sysplex are trademarks of International Busi-
`ness Machines Corporation.
`Lifekeeper, NCR and Teradata are trademarks
`of National Cash Register.
`M680x0is trademark of Motorola Corporation.
`OSF/1, DCE, and DFS are trademarks of the
`OpenSoftware Foundation,Inc.
`Oracle Parallel Server is a trademark of Oracle
`Corporation.
`Expressis a trademark of Parasoft Corporation.
`Load Sharing Facility is a trademark of
`Platform Computing Corporation.
`Origin 2000, Power C, POWER CHALLEN-
`GEarray, and Cellular IRIX are trademarks of
`Silicon Graphics Computer Systems.
`Sybase is a trademark of Sybase,Inc.
`UNIX and SVRS5are trademarks of X/Open
`Company,Ltd.
`Microsoft, Wolfpack, and WindowsNTare
`trademarks of Microsoft Corporation.
`OpenMPis a trademark of OpenMPStandards
`Evaluation Board.
`
`Myrinetis a trademark of Myricom, Inc.
`
`ANSI Standards X3T11, X3T9, and X3H5are
`trademarks of the American National Stan-
`dardsInstitute.
`
`ALLICACHE and KSR-1 are trademarks of
`Kendall Square Research.
`X Window, Alewife, J-Machine, and *T are
`trademarks of Massachusetts Institute of
`Technology.
`CM-2, CM-5, C*, *Lisp, CM Fortran, and
`CMOSTare trademarks of Thinking Machines
`Corporation.
`a trademarks of
`is
`HP/Convex Exemplar
`Hewlett-Packard Company and Convex Com-
`puter Systems.
`Cray 1, Cray 2, Y-MP, C90, CRAFT, T3D,
`T3E, UNICOS are
`trademarks of Cray
`Research,Inc.
`
`Dash and SPLASHaretrademarks of
`Stanford University.
`Ethernet is a trademark of Xerox Corporation.
`Gigabit Ethernet is trademark of the Gigabit
`Ethernet Alliance.
`SCI, POSIX Threads, and Futurebust+ are
`trademarks of the Institute of Electrical and
`Electronic Engineering,Inc.
`Illiac IV, Cedar, and Parafrase are trademarks
`of University ofIllinois.
`Intel 80x86,
`i960,
`i860, SHV, Paragon, Intel
`Hypercube, Pentium, Pentium Pro, and Pen-
`tium-II are trademarksofIntel Corporation.
`MIPS R2000, R4000, and R10000are
`trademarks of SGI/MIPSTechnology,Inc.
`Mach/OS, C.mmp, and Cm* are trademarksof
`Carnegie-Mellon University.
`NYU Ultracomputer is a trademark of New
`York University.
`SX/4is a trademark of NEC Information
`Systems.
`Symmetry and NUMA-Q are trademarks of
`Sequent Computers.
`Gigaplane, Gigaplane-XB, S-Bus, TSO, PSO,
`NFS, SPARC,Ultra Enterprise, SPARCcluster,
`SPARCstation, and Solaris MC are trademarks
`
`
`
`4
`
`
`
`About the Authors
`
`Kai Hwangpresently holds a Chair Professor of Computer Engineering at the
`University of Hong Kong (HKU), while taking a research leave from the University of
`Southern California (USC). He has been engaged in higher education and computer
`research for 26 years, after earning the Ph.D.
`in Electrical Engineering and Computer
`Science from the University of California at Berkeley. His work on this book started at the
`USC and was mostly completed at the HKU.
`An IEEEFellow, he has published extensively in the areas of computerarchitecture,
`digital arithmetic, parallel processing, and distributed computing. He is the founding
`Editor-in-Chief of the Journal ofParallel and Distributed Computing. He has chaired the
`international conferences: ICPP86, ARITH-7, IPPS96, ICAPP 96, and HPCA-4 in 1998.
`Hehas received several achievement awards for outstanding research and academic con-
`tributions to the field of parallel computing.
`He has lectured worldwide and performed consulting and advisory work for US
`National Academy of Sciences, MIT Lincoln Laboratory, IBM Fishkill, TriTech in
`Singapore, Fujitsu and ETL in Japan, GMD in Germany, CERN School of Computing,
`and Academia Sinica in China. Presently, he leads a research group at HKUin developing
`an ATM-based multicomputer cluster for high-performance computing and distributed
`multimedia, Intranet, and Internet applications.
`
`Zhiwei Xuis a Professor and Chief Architect at the National Center forIntelligent
`Computing Systems (NCIC), Chinese Academy of Sciences, Beijing, China. He is also a
`Honorary Research Fellow of HKU.He received a Ph.D. in Computer Engineering from
`the University of Southern California. He participated in the STAP/MPP benchmark
`projects led by Dr. Hwang at USC and HKUduringthe past 5 years.
`He has taught at the Rutgers University and New York Polytechnic University. He
`has published in the areas of parallel
`languages, pipelined vector processing, and
`benchmark evaluation of massively parallel processors. Presently, he leads a design group
`at NCIC in building a series of cluster-based superservers in China. His current research
`interest lies mainly in network-based cluster computing and the software environments for
`parallel programming.
`
`Viii
`
`5
`
`
`
`Dedication
`
`To my family, for their love and support-- k.h.
`
`To my high-school teacher,
`Ms. Zhang Dahua, whotaught methe
`serenity of science at a tumultuous time -- z.x.
`
`6
`
`
`
`
`
`xi
`
`Table of Contents
`
`About the Authors
`
`viii
`
`Foreword
`
`xix
`
`Preface
`
`xx
`
`Guide to Instructors/Readers
`
`xxiii
`
`Part I Scalability and Clustering
`
`1
`
`3
`
`Chapter 1 Scalable Computer Platforms and Models
`1.1 Evolution of Computer Architecture
`5
`1.1.1. Computer Generations
`5
`1.1.2 Scalable Computer Architectures
`1.1.3 Converging System Architectures
`1.2 Dimensions of Scalability
`9
`1.2.1 Resource Scalability
`9
`1.2.2. Application Scalability
`1.2.3. Technology Scalability
`1.3 Parallel Computer Models
`1.3.1 Semantic Attributes
`1.3.2 Performance Attributes
`18
`1.3.3. Abstract Machine Models
`26
`1.3.4 Physical Machine Models
`30
`1.4 Basic Concepts of Clustering
`30
`1.4.1 Cluster Characteristics
`31
`1.4.2 Architectural Comparisons
`1.4.3. Benefits and Difficulties of Clusters
`1.5 Scalable Design Principles
`37
`1.5.1 Principle of Independence
`1.5.2 Principle of Balanced Design
`1.5.3. Design for Scalability
`44
`1.6 Bibliographic Notes and Problems
`
`11
`12
`13
`14
`
`17
`
`6
`8
`
`32
`
`37
`
`39
`
`47
`
`Chapter 2 Basics of Parallel Programming
`2.1 Parallel Programming Overview 51
`52
`2.1.1 Why Is Parallel Programming Difficult?
`2.1.2 Parallel Programming Environments—55
`2.1.3. Parallel Programming Approaches
`56
`2.2 Processes, Tasks, and Threads
`59
`
`51
`
`7
`
`
`
`Table of Contents
`
`84
`
`59
`
`2.2.1 Definitions of an Abstract Process
`2.2.2 ExecutionMode
`62
`2.2.3. Address Space
`63
`2.2.4 Process Context
`65
`2.2.5 Process Descriptor
`67
`2.2.6 Process Control]
`2.2.7 Variations of Process
`Parallelism Issues
`71
`72
`2.3.1 Homogeneity in Processes
`2.3.2 Static versus Dynamic Parallelism 74
`2.3.3. Process Grouping
`75
`2.3.4 Allocation Issues
`76
`
`66
`
`70
`
`2.3
`
`2.4
`
`2.5
`
`2.6
`
`77
`
`Interaction/Communication Issues
`2.4.1
`Interaction Operations
`77
`2.4.2 Interaction Modes
`80
`2.4.3 Interaction Patterns
`82
`2.4.4 Cooperative versus Competitive Interactions
`Semantic Issues in Parallel Programs
`85
`2.5.1 Program Termination
`85
`86
`2.5.2 Determinacy of Programs
`Bibliographic Notes and Problems
`
`87
`
`91
`
`96
`98
`
`103
`104
`
`111
`
`115
`
`131
`
`134
`
`139
`
`126
`
`148
`
`Chapter 3 Performance Metrics and Benchmarks
`3.1 System and Application Benchmarks
`91
`3.1.1 Micro Benchmarks
`92
`3.1.2 Parallel Computing Benchmarks
`3.1.3. Business and TPC Benchmarks
`3.1.4 SPEC Benchmark Family
`100
`Performance versus Cost
`102
`3.2.1 Execution Time and Throughput
`3.2.2 Utilization and Cost-Effectiveness
`Basic Performance Metrics
`108
`108
`3.3.1 Workload and Speed Metrics
`3.3.2 Caveats in Sequential Performance
`113
`Performanceof Parallel Computers
`113
`3.4.1 Computational Characteristics
`3.4.2 Parallelism and Interaction Overheads
`3.4.3. Overhead Quantification
`118
`Performance of Parallel Programs
`3.5.1 Performance Metrics
`126
`3.5.2. Available Parallelism in Benchmarks
`Scalability and Speedup Analysis
`134
`3.6.1 Amdahl’s Law: Fixed Problem Size
`3.6.2 Gustafson’s Law: Fixed Time
`136
`3.6.3 Sun and Ni’s Law: Memory Bounding
`3.6.4 Isoperformance Models
`144
`Bibliographic Notes and Problems
`
`3.2
`
`3.3
`
`3.4
`
`3.5
`
`3.6
`
`3.7
`
`8
`
`
`
`xiii
`Table of Contents
`
`
`Part If Enabling Technologies
`
`153
`
`155
`
`175
`
`164
`
`Chapter 4 Microprocessors as Building Blocks
`4.1 System Development Trends
`155
`4.1.1 Advances in Hardware
`156
`4.1.2 Advances in Software
`159
`160
`4.1.3. Advances in Applications
`164
`4.2 Principles of Processor Design
`4.2.1 Basics of Instruction Pipeline
`169
`4.2.2 From CISC to RISC and Beyond
`4.2.3, Architectural Enhancement Approaches
`4.3 Microprocessor Architecture Families
`174
`4.3.1 Major Architecture Families
`174
`4.3.2 Superscalar versus Superpipelined Processors
`4.3.3, Embedded Microprocessors
`180
`4.4 Case Studies of Microprocessors
`182
`4.4.1 Digital’s Alpha 21164 Microprocessor
`4.4.2 Intel Pentium Pro Processor
`186
`4.5 Post-RISC, Multimedia,and VLIW 191
`4.5.1 Post-RISC Processor Features
`191
`4.5.2 Multimedia Extensions
`195
`4.5.3. The VLIW Architecture
`199
`4.6 The Future of Microprocessors
`201
`4.6.1 Hardware Trends and Physical Limits
`4.6.2 Future Workloads and Challenges
`203
`4.6.3 Future Microprocessor Architectures
`204
`4.7 Bibliographic Notes and Problems
`206
`
`172
`
`182
`
`201
`
`Chapter 5 Distributed Memory and Latency Tolerance
`5.1 Hierarchical Memory Technology
`211
`5.1.1 Characteristics of Storage Devices
`5.1.2 Memory Hierarchy Properties
`214
`5.1.3. Memory Capacity Planning
`217
`5.2 Cache Coherence Protocols
`220
`5.2.1. Cache Coherency Problem 220
`5.2.2 Snoopy Coherency Protocols
`222
`5.2.3. The MESI Snoopy Protocol
`224
`5.3 Shared-MemoryConsistency
`228
`5.3.1 Memory Event Ordering
`228
`5.3.2 Memory Consistency Models
`5.3.3. Relaxed Memory Models
`234
`237
`5.4 Distributed Cache/Memory Architecture
`5.4.1 NORMA, NUMA, COMA,and DSM Models=237
`5.4.2 Directory-Based Coherency Protocol
`243
`5.4.3 The Stanford Dash Multiprocessor
`245
`5.4.4 Directory-Based Protocolin Dash
`248
`
`211
`
`211
`
`231
`
`
`
`9
`
`
`
`Table of Contents
`
`Table of C
`
`250
`
`257
`
`265
`
`273
`
`250
`5.5 Latency Tolerance Techniques
`5.5.1 Latency Avoidance, Reduction, and Hiding
`5.5.2 Distributed Coherent Caches
`253
`5.5.3 Data Prefetching Strategies
`255
`5.5.4 Effects of Relaxed Memory Consistency
`5.6 Multithreaded Latency Hiding
`257
`258
`5.6.1 Multithreaded Processor Model
`5.6.2. Context-Switching Policies
`260
`5.6.3 Combining Latency Hiding Mechanisms
`5.7 Bibliographic Notes and Problems
`266
`Chapter 6 System Interconnects and Gigabit Networks
`6.1 Basics of Interconnection Network
`273
`6.1.1
`Interconnection Environments
`273
`6.1.2 Network Components
`276
`6.1.3 Network Characteristics
`277
`280
`6.1.4 Network Performance Metrics
`281
`6.2 Network Topologies and Properties
`6.2.1 Topological and Functional Properties
`6.2.2 Routing Schemes and Functions
`283
`6.2.3 Networking Topologies
`286
`6.3 Buses, Crossbar, and Multistage Switches
`6.3.1 Multiprocessor Buses
`294
`6.3.2 Crossbar Switches
`298
`6.3.3 Multistage Interconnection Networks
`6.3.4 Comparison of Switched Interconnects
`6.4 Gigabit Network Technologies
`307
`6.4.1 Fiber Channel and FDDI Rings
`307
`6.4.2 Fast Ethernet and Gigabit Ethernet
`6.4.3 Myrinetfor SAN/LAN Construction
`6.4.4 HiPPI and SuperHiPPI
`314
`6.5 ATMSwitches and Networks
`318
`6.5.1 ATM Technology
`318
`6.5.2 ATM Network Interfaces
`320
`6.5.3 Four Layers of ATM Architecture
`6.5.4 ATM Internetwork Connectivity
`6.6 Scalable Coherence Interface
`326
`6.6.1
`SCIInterconnects
`327
`6.6.2.
`Implementation Issues
`329
`6.6.3 SCI Coherence Protocol
`332
`334
`6.7 Comparison ofNetwork Technologies
`334
`6.7.1 Standard Networks and Perspectives
`6.7.2 Network Performance and Applications
`335
`6.8 Bibliographic Notes and Problems
`337
`Chapter7 Threading, Synchronization, and Communication
`7.1 Software Multithreading
`343
`
`281
`
`294
`
`301
`305
`
`310
`313
`
`321
`324
`
`12
`
`343
`
`10
`
`
`
`Table of Contents
`XV
`i—“‘“*#R$:ROUUM
`
`344
`7.1.1 The Thread Concept
`346
`7.1.2 Threads Management
`348
`7.1.3 Thread Synchronization
`349
`Synchronization Mechanisms
`349
`7.2.1 Atomicity versus Mutual Exclusion
`7.2.2. High-Level Synchronization Constructs
`7.2.3, Low-Level Synchronization Primitives
`7.2.4 Fast Locking Mechanisms
`364
`The TCP/IP Communication Protocol Suite
`7.3.\
`Features of The TCP/IP Suite
`367
`7.3.2 UDP, TCP,andIP
`371
`7.3.3. The Sockets Interface
`375
`Fast and Efficient Communication
`7.4.1 Key Problems in Communication
`7.4.2 The Log P Communication Model
`7.4.3 Low-Level Communications Support
`7.4.4 Communication Algorithms
`396
`Bibliographic Notes and Problems
`398
`
`355
`360
`
`366
`
`376
`377
`384
`386
`
`7.2
`
`7.3
`
`7.4
`
`7.5
`
`Part III Systems Architecture
`
`403
`
`407
`
`Chapter 8 Symmetric and CC-NUMAMultiprocessors
`SMP and CC-NUMATechnology
`407
`8.1
`8.1.1 Multiprocessor Architecture
`407
`8.1.2 Commercial SMP Servers
`412
`8.1.3. The Intel SHV Server Board
`413
`Sun Ultra Enterprise 10000 System 416
`8.2.1 The Ultra E-10000 Architecture
`416
`8.2.2 System Board Architecture
`418
`8.2.3 Scalability and Availability Support
`8.2.4 Dynamic Domainsand Performance
`HP/Convex Exemplar X-Class
`421
`8.3.1 The Exemplar X System Architecture
`8.3.2 Exemplar Software Environment
`424
`The Sequent NUMA-Q 2000
`425
`426
`8.4.1 The NUMA-Q 2000 Architecture
`8.4.2 Software Environmentof NUMA-Q 430
`8.4.3 Performance of the NUMA-Q 431
`The SGI/Cray Origin 2000 Superserver
`8.5.1 Design Goals of Origin 2000 Series
`8.5.2 The Origin 2000 Architecture
`435
`8.5.3. The Cellular IRLX Environment
`443
`8.5.4 Performanceofthe Origin 2000
`447
`Comparison of CC-NUMAArchitectures
`Bibliographic Notes and Problems
`451
`
`418
`420
`
`421
`
`434
`
`434
`
`447
`
`8.2
`
`8.3
`
`8.4
`
`8.5
`
`8.6
`
`8.7
`
`11
`
`
`
`Table of Contents
`xvi
`ee
`
`453
`
`505
`
`453
`
`459
`
`468
`
`Chapter 9 Support of Clustering and Availability
`9.1 Challenges in Clustering
`453
`9.1.1 Classification of Clusters
`9.1.2 Cluster Architectures
`456
`9.1.3 Cluster Design Issues
`457
`9.2 Availability Support for Clustering
`9.2.1 The Availability Concept
`460
`9.2.2 Availability Techniques
`463
`9.2.3 Checkpointing and Failure Recovery
`9.3. Support for Single System Image
`473
`9.3.1 Single System Image Layers
`473
`475
`9.3.2 Single Entry and Single File Hierarchy
`9.3.3 Single I/O, Networking, and Memory Space
`9.4 Single System Image in Solaris MC 482
`9.4.1 Global File System 482
`9.4.2 Global Process Management
`9.4.3 Single I/O System Image
`485
`9.5 Job Managementin Clusters
`486
`9.5.1 Job Management System 486
`9.5.2 Survey of Job Management Systems
`9.5.3. Load-Sharing Facility (LSF)
`494
`9.6 Bibliographic Notes and Problems
`501
`
`479
`
`484
`
`492
`
`Chapter 10 Clusters of Servers and Workstations
`10.1 Cluster Products and Research Projects
`505
`10.1.1 Supporting Trend of Cluster Products
`506
`10.1.2 Cluster of SMP Servers
`508
`10.1.3. Cluster Research Projects
`509
`511
`10.2 Microsoft Wolfpack for NT Clusters
`10.2.1 Microsoft Wolfpack Configurations
`10.2.2 Hot Standby Multiserver Clusters
`10.2.3 Active Availability Clusters
`514
`10.2.4 Fault-Tolerant Multiserver Cluster
`10.3 The IBM SP System 518
`10.3.1 Design Goals and Strategies
`10.3.2 The SP2 System Architecture
`10.3.3 I/O and Internetworking.
`523
`10.3.4 The SP System Software
`526
`10.3.5 The SP2 and Beyond
`530
`10.4 The Digital TruCluster
`531
`531
`10.4.1 The TruCluster Architecture
`10.4.2 The Memory Channel Interconnect
`10.4.3 Programming the TruCluster
`537
`10.4.4 The TruCluster System Software
`10.5 The Berkeley NOW Project
`541
`10.5.1 Active Messages for Fast Communication
`
`512
`513
`
`516
`
`518
`521
`
`534
`
`540
`
`541
`
`12
`
`
`
`Table of Contents
`
`
`547
`10.5.2. GLUnix for Global Resource Management
`10.5.3. The xFS Serverless Network File System 549
`10.6 TreadMarks: A Software-Implemented DSM Cluster
`10.6.1 Boundary Conditions
`556
`10.6.2 User Interface for DSM 557
`10.6.3.
`Implementation Issues
`559
`10.7 Bibliographic Notes and Problems
`
`561
`
`556
`
`565
`
`571
`
`576
`
`Chapter 11 MPP Architecture and Performance
`11.1 An Overview of MPP Technology
`565
`11.1.1 MPP Characteristics and Issues
`565
`11.1.2 MPP Systems—An Overview 569
`11.2 The Cray T3E System 570
`11.2.1 The System Architecture of T3E
`11.2.2 The System Software in T3E
`573
`11.3. New Generation of ASCI/MPPs
`574
`574
`11.3.1 ASCI Scalable Design Strategy
`11.3.2 Hardware and Software Requirements
`11.3.3 Contracted ASCI/MPP Platforms
`577
`11.4 Intel/Sandia ASCI Option Red
`579
`11.4.1 The Option Red Architecture
`579
`11.4.2 Option Red System Software
`582
`11.5 Parallel NAS Benchmark Results
`584
`11.5.1 The NAS Parallel Benchmarks
`585
`11.5.2 Superstep Structure and Granularity
`11.5.3. Memory, I/O, and Communications
`11.6 MPI and STAP Benchmark Results
`590
`11.6.1 MPI Performance Measurements
`590
`11.6.2 MPI Latency and Aggregate Bandwidth
`11.6.3 STAP Benchmark Evaluation of MPPs
`11.6.4 MPP Architectural Implications
`600
`11.7 Bibliographic Notes and Problems
`603
`
`586
`587
`
`592
`594
`
`PartIV Parallel Programming
`607
`Chapter 12 Parallel Paradigms and Programming Models
`12.1 Paradigms and Programmability
`609
`12.1.1 Algorithmic Paradigms
`609
`12.1.2 Programmability Issues
`612
`12.1.3. Parallel Programming Examples
`12.2 Parallel Programming Models
`617
`12.2.1
`Implicit Parallelism 617
`621
`12.2.2 Explicit Parallel Models
`624
`12.2.3. Comparison of Four Models
`12.2.4 OtherParallel Programming Models
`
`627
`
`614
`
`609
`
`13
`
`
`
`xviii
`
`Table of Contents
`
`629
`634
`
`643
`
`653
`
`629
`12.3 Shared-Memory Programming
`12.3.1 The ANSI X3H5 Shared-Memory Model
`12.3.2. The POSIX Threads(Pthreads) Model
`12.3.3. The OpenMP Standard
`636
`12.3.4 The SGI PowerC Model
`640
`12.3.5 C//: A Structured Parallel C Language
`12.4 Bibliographic Notes and Problems
`649
`Chapter 13 Message-Passing Programming
`13.1 The Message-Passing Paradigm 653
`13.1.1 Message-Passing Libraries
`653
`13.1.2 Message-Passing Modes
`655
`13.2 Message-Passing Interface (MPI)
`13.2.1 MPI Messages
`661
`668
`13.2.2 Message Envelope in MPI
`13.2.3. Point-to-Point Communications
`13.2.4 Collective MPI Communications
`13.2.5 The MPI-2 Extensions
`682
`686
`13.3. Parallel Virtual Machine (PVM)
`687
`13.3.1 Virtual Machine Construction
`13.3.2 Process Management inPVM 689
`13.3.3. Communication with PVM 693
`13.4 Bibliographic Notes and Problems
`699
`
`674
`678
`
`658
`
`705
`
`Chapter 14 Data-Parallel Programming
`14.1 The Data-Parallel Model
`705
`14.2 The Fortran 90 Approach
`706
`706
`14.2.1 Parallel Array Operations
`14.2.2 Intrinsic Functions in Fortran90
`14.3 High-Performance Fortran
`711
`14.3.1 Support for Data Parallelism 712
`14.3.2 DataMappinginHPF
`715
`14.3.3. Summary of Fortran90 and HPF
`14.4 Other Data-Parallel Approaches
`725
`14.4.1 Fortran 95 and Fortran 2001
`725
`14.4.2 The pC++ and Nesl Approaches
`14.5 Bibliographic Notes and Problems
`
`708
`
`721
`
`728
`733
`
`737
`Bibliography
`Web Resources List
`
`765
`
`Subject Index
`Author Index
`
`787
`799
`
`14
`
`
`
`Chapter 1
`
`Scalable Computer
`Platforms and Models
`
`This chapter presents basic models ofparallel and cluster computers. Fundamental
`design issues and operational principles of scalable computer platforms are introduced.
`We review the computer technology overthe last 50 years. Scalable and cluster computer
`systems are modeled with key architectural distinctions. Scalability will be introduced in
`three orthogonal dimensions: resource, application, and technology.
`Abstract and physical machine models are specified in Section 1.3. In Section 1.4, we
`introduce basic concepts of multicomputer clustering. The differences among symmetric
`multiprocessors, clusters of computers, and distributed computer systems are clarified.
`Three basic principles are studied in Section 1.5 to guide the design and application of
`scalable parallel computers.
`
`Bits, Bytes, and Words The following units are widely used in the computerfield, but
`sometimes were wrongly used with confusing notations and ambiguous meanings. To
`cope with this problem, we present belowa setof notations that will be used throughout
`the book. In particular, readers should not be confused with the shorthand notations for
`basic units of time, byte, and bit respectively.
`The basic unit in time is second, abbreviated as s. The two basic information units are
`byte and bit. One byte (1 B) is 8 bits (8 b). Byte is always abbreviated as B andbit as b.
`Other information units are word (16 b or 2 B), doubleword (32 b or 4 B), and guadword
`(64 b or 8 B). This is based on conventionusedby Intel, Motorola, and Digital Equipment.
`Mainframe vendors consider a word to have 32 b. Some supercomputer designers
`consider 64 b in a word. A frequently used workload unit is the numberoffloating-point
`operations, abbreviated asflop. A unit for computing speed is the numberoffloating-point
`operations per second (flop/s). A unit for information transfer rate is the numberof bytes
`per second (B/s). The execution rate of a processor is often measured as million instruc-
`tions per second (MIPS), whichis equivalent to the notation Mi/s used in Europe.
`3
`
`15
`
`
`
`4
`
`Chapter 1 Scalable Computer Platforms and Models
`
`s We will use the conventions and notations specified in
`Notations and Convention
`omeextent in the high-
`Table 1.1. Note that decimal and binary interpretations differ to s
`orderunits. Readers should be awareoftheir differences.
`Forinstance, a kilo could mean either 10° = 1000 or 2!0 = 1024. In computerscience,
`K is usually used to mean 1024;
`in other scientific disciplines, k denotes 1000. This
`ambiguity must be clarified, even the relative difference is small. In general, the binary
`interpretation is applied to information units such as bits, bytes, or words. The decimal
`interpretation is often used in reference to other units such as execution time, power,
`weight, pin count, and computational workload,etc.
`The convention forplurals is as follows: The unit will be pluralized when the whole
`word is used, and not pluralized when an abbreviation is used. For instance, we say 2 KB,
`2 kilobytes, or even 2 Kbytes. They all mean the same thing: 2048 bytes. But we do not
`use 2 KBs or 2 Kbyte. Note that 2 Kb means a completely different thing: 2048bits, or
`ng a speed of executing 2 million
`256 bytes.
`Similarly, 2 Mflop/s is a proper notation, meani
`using 2 Mflops/s, 2 MFLOPS,or 2
`floating-point operations per second. We should avoid
`he logarithmic function is
`Mflops, even some people use them. Throughout this book,t
`alwaysbase2. Thatis, log 7 meanslog) ”, unless otherwise specified.
`
`ns for Numbers
`
`
`ii
`
`
`
`Table 1.1 Notations and Conventio
`
`
`
`
`Abbreviation
`Meaning
`
` al
`One thousandthi.ee
`One millionth (eer
`OnebillionthpeMageDet
`Onetrillionth eee
`One quadrillionthFemee
`One quintillionth Paige
`
`10%or2")|
`mega
`Milion
`
`
`= eeee oer”
`10'8 or 2”ii
`Srocttion|ee
`
`
`
`16
`
`
`
`
`
`Section 1.1 Evolution of Computer Architecture )
`
`1.1 Evolution of Computer Architecture
`
`Computers have gone through 50 years of development. We summarize belowfive
`generations of development, approximately 10 to 15 years per generation, since the 1940s.
`A formal definition of a scalable computer is given below. Then we describe the
`converging architecture for future parallel computers.
`
`1.1.1 Computer Generations
`
`Overthe past 50 years, digital electronic computers have gone through five genera-
`tions of development. Table 1.2 provides a summary of the technology, architecture,
`software, application, and representative systems in each generation of computers. Each
`generation has improved from the previous generations in hardware and software technol-
`ogies used and in application levels.
`
`Table 1.2 Computer Generations in 50 Years
`
`Generation
`Period
`
`Technology and
`Architecture
`
`Software and
`Operating System
`
`Systems
`
`World-Wide Web Representative
`
`First
`1946-1956
`
`Second
`1956-1967
`
`Third
`1967-1978
`
`Fourth
`1978-1989
`
`Fifth
`1990- present
`
`Vacuum tubes and relay
`memory, single-bit CPU
`with accumulator-based
`instruction set
`
`Machine/assembly
`languages, programs
`without subroutines
`
`ENIAC,
`IBM 701,
`Princeton IAS
`
`Discrete transistors, core
`memory,floating-point
`accelerator, I/O channels
`
`Algol and Fortran
`with compilers,
`batch processing OS
`
`Integrated circuits,
`pipelined CPU,
`microprogrammed
`control unit
`
`VLSI microprocessors,
`solid-state memory,
`multiprocessors,
`vector supercomputers
`
`ULSI circuits, scalable
`parallel computers,
`workstation clusters,
`Intranet, Internet
`
`C language,
`multiprogramming,
`timesharing OS
`
`Symmetric
`multiprocessing,
`parallelizing compilers,
`message-passinglibraries
`
`Java, microkernels,
`Multithreading,
`distributed OS,
`
`IBM 7030,
`CDC 1604,
`Univac LARC
`
`PDP-11
`IBM 360/370,
`CDC 6600
`
`IBM PC,
`VAX 9000,
`Cray X/MP
`
`IBM SP2,
`SGI Origin 2000,
`Digital TruCluster
`
`In hardware technology, the first generation used vacuum tubes and relay memory.
`The second generation was marked by the use of discrete transistors and core memory.
`
`17
`
`
`
`6
`
`Chapter 1 Scalable Computer Platforms and Models
`
`all-scale integrated (SSI) circuits. The fourth gener-
`The third generation began to use sm
`-scale integrated (VLSI) microprocessors. The fifth
`ation began with the use of very large
`d (ULSI) circuits. For example, in 1997, 10M-
`generation uses ultra large-scale integrate
`(dynamic random-access
`transistor microprocessors and 256M-transistor DRAMs
`memories) are in use.
`In the software area, the first generation used machine/assembly languages. The
`second generation began to use high-level languages (HLLs), such as Algol and Fortran.
`The third generation was marked by the use of C language, multiprogramming and time-
`sharing operating systems. The fourth generation emphasized parallel and vector
`processing. The fifth generation emphasizes scalable and clustered computing as covered
`in this book. As wecrossinto the 21st century, future generations ofcomputers will evolve
`with even higher capability and userfriendliness.
`
`1.1.2 Scalable Computer Architectures
`A common feature of modern computers is parallelism. Even within a single
`processor, parallelism has been exploited in many ways. This will be discussedin detail in
`the concept of scalability has been reemphasized, which
`Chapter 4. More recently,
`1 to this book and will be elaborated
`includes parallelism. The scalability concept is centra
`on later. Given below is a short definition of scalable systems.
`
`Definition 1.1 A computer system,includingall its hardware and software resources, is
`called scalable if it can scale up (i.e.,
`improve its resources) to accommodate ever-
`increasing performance andfunctionality demand and/or scale down(i.e., decrease its
`resources) to reduce cost.
`
`a
`
`Although wewill focus on the scaling-up aspect most ofthe time, scalability is not
`equivalentto being big. Scalability involvesthe ability to scale down. Scaling issues arise
`even on a uniprocessor system. Morespecifically, saying that a system is scalable implies
`the following:
`e Functionality and Performance: The scaled-up system should provide more
`functionality or better performance. The total computing power of the system
`should increase proportionally to the increase in resources. Ideally,
`the users
`would like to see computing powerincrease close to n times when the system
`resourceis improved n times.
`Scaling in Cost: The costpaid for scaling up mustbe reasonable. A rule of thumb
`is that scaling up 7 times should incur a cost of no morethan n or n logntimes.
`Compatibility: The same components,including hardware, system software, and
`application software, shouldstill be usable withlittle change.It is unreasonable to
`
`18
`
`
`
`
`
`Section 1.1 Evolution of Computer Architecture 7
`
`expect the users to pay for a completely new operating system and to redevelop
`their application codes. Often scaling involvesonly a part of a system, e.g., adding
`more processors or upgrading a processor
`to the next-generation. The
`improvement should be compatible with the rest of the system. That is, the old
`memory, disk, interconnect, and peripherals should still be usable.
`
`The Computer Pyramid The main motivation for a scalable system is to provide a
`flexible, cost-effective information processing tool. Asillustrated in Fig. 1.1, the computer
`classes form a pyramid with respect to sales volume and performance andcost. At the
`bottom of the pyramid is the class of personal computers (PCs), which have a quite
`affordable price and a huge annual sales volume. Their performanceis also the lowest.
`
`
`
`Performance and Cost
`
`
`Tens of
`
`supercomputers
`
`at $15 M per system
`
`
`
`
`
`
`
`Stand-alone workstations
`
`
`Millions of PCs at $1500 per system
`Annualsales volume
`
`
`SMP servers andclusters
`
`Figure 1.1
`
`The pyramid of computer classes.
`
`At the top is the class of supercomputers. Their sales volume is very low because of
`the expensive price. However, a supercomputer integrates many resourcesinto a single
`system, employscutting-edge technology, and has the highest performance.
`The scalability concept advocates a common scalable architecture across all
`computerclasses. This has several advantages:
`It satisfies different performance andcost requirements. Forinstance, a user could
`first purchase a low-end system. Whenhis performance requirementincreases, he
`could scale up the system, knowing that the original software and hardware
`componentscanstill be used in the new system, and his investmentis protected.
`
`
`
`19
`
`
`
`Chapter 1 Scalable Computer Platforms and Models
`8
`eeee,
`
`e High-end machines may use low-end components to cut cost. For example, with
`high volume, PCs have low-cost, off-the-shelf components. With a scalable archi-
`tecture, supercomputers can use such components to reduce system cost. In fact,
`using commodity components (processors, memory chips, disks, I/O controllers,
`etc.) has becomeatrend in high-performance systems development.
`e The cutting-edge technology developed for high-end systems may eventually
`migrate down the pyramid to improve the performance of low-end systems,if
`cost-effectiveness can be increased with improved production technology.
`
`1.1.3 Converging System Architectures
`
`Scalable parallel computers are converging toward three architectures, which share
`much commonality with increasing levels of resources sharing as shownin Fig.1.2.
`
`C
`
`D
`
`M
`
`Cache
`
`Disk
`
`Memory
`
`NIC
`
`Networkinterface circuitry
`
`P
`
`Processor
`
`
`
`
`
`Shared Disks
`
`
`
`
`
`Shared Memory]
`
`|Shared Disks
`
`(b) Shared-disk architecture
`
`(c) Shared-memory architecture
`
`Figure 1.2.
`
`Commonarchitecture for scalable parallel computers.
`
`20
`
`
`
`
`
`Section 1.2 Dimensions ofScalability 9
`
`The sha