throbber
eur
`
`“Bi
`
`’
`
`*
`
`Ses
`
`t
`
`7 4 *
`y > a
`~ eh ee
`
`: eo
`s
`458 @
`alg
`+%
`; .
`>
`Sart}
`o.
`a
`;
`4 = -
`“)
`wTe
`- Bes
`2
`“+
`
`<S
`
`-
`
`F
`
`|
`
`4
`
`‘
`
`eee te
`
`e
`
`-
`
`-
`
`ee
`
`®,
`2,
`c-
`
`Oe.
`fe?
`= ¢
`9
`aes
`aa
`-“@
`Ss
`ad p
`e
`“®
`eo
`
`s
`ee
`
`Nar
`
`:
`
`}
`
`a
`
`*.
`
`od
`
`-@..
`
`eS.
`e
`. & .
`ee.
`
`e..
`
`<
`
`» e
`
`-
`
`.e.
`
`—*
`
`A
`

`
`=
`
`.
`
`2

`sting
`
`a
`
`-
`
`_-.
`2
`
`-
`- 7
`i
`
`re
`
`7
`
`33,
`
`.
`
`Tate:
`2 we _
`%
`,
`oN Ts
`
`“
`
`.
`MISS
`
`o3 al Wanli
`
`Say
`
`yo ta yea
`:
`a2

`*
`;
`<j
`‘
`=

`3
`.
`WX ote
`hee Seas Ore e
`
`.
`
`tS{ ea
`a
`-%
`.
`
`~
`
`:
`
`~~
`
`Ss
`
`ee
`
`.
`
`=
`
`a
`
`-
`
`s
`
`;
`> er
`
`-
`
`:
`
`:
`
`rs
`
`3.
`
`-
`
`.
`
`=z
`
`a3

`.
`=
`
`ax
`
`“oe
`
`e
`
`-
`«te
`

`
`7K
`
`+
`
`$
`Neweas
`4
`x
`be. "> “heey?
`a”
`:
`-
`e
`
`-
`
`-
`
`.
`
`-
`
`-
`
`Y=
`.
`oe
`
`~\
`
`f
`
`3
`
`NS
`:
`Rae
`sw,
`
`aYeat
`-
`

`ere.
`
`te
`
`MICROSOFT1011
`
` 1
`
`1
`
`MICROSOFT 1011
`
`

`

`Scalable
`Parallel
`Gomputing_
`
`
`) D
`
`s Q a
`
`
`
`) e 9 9 @
`
`:
`9
`Kai Hwang e
`University of Hong Kong
`®@
`University of Southern California
`
`®9 a
`
`Zhiwei Xu °
`Chinese Academy of Sciences
`:
`University ofHong Kong
`:
`
`WCBaMcGraw-Hill
`
`Boston Burr Ridge, IL Dubuque, IA Madison, WI New York San Francisco St. Louis
`Bangkok Bogota Caracas Lisbon London Madrid
`Mexico City Milan New Delhi Seoul Singapore Sydney Taipei Toronto
`
`
`
`2
`
`

`

`WCB/McGraw-Hill
`A Division of The McGraw-Hill Companies
`
`Scalable Parallel Computing
`Technology, Architecture, Programming
`
`Copyright © 1998, by The McGraw-Hill Companies, Inc. All rights reserved. Printed in the United States ofAmerica. Exceptas permitted
`under the United States Copyright Act of 1976,no part ofthis publication may be reproducedordistributed in any form or by any means,
`or stored in a data baseorretrieval system, without the prior written permission ofthe publisher.
`
`This bookis printed on acid-free paper.
`
`1234567890 DOC DOC 90098
`
`ISBN 0-07-031798-4
`
`Editorial Director: Kevin Kane
`Publisher: Tom Casson
`Sponsoring editor: Lynn Cox
`Marketing manager: John Wannemacher
`Project manager: Richard DeVitto
`Production supervisor: Richard DeVitto
`Cover designer: Francis Owens
`Editorial Assistant: Nina Kreiden
`Printer: Quebecor Printing Dubuque
`
`Library of Congress Cataloging-in-Publication Data
`Hwang, Kai.
`Scalable parallel computing : technology, architecture,
`Programming / Kai Hwang, Zhiwei Xu.
`p-
`cm.
`Includes index.
`ISBN 0-07-03 1798-4
`1.
`Parallel processing (Electronic computers)
`architecture. 3. Computer networks.
`I. Xu, Zhiwei.
`QA76.58H85 1997
`004’.358—de21
`
`2. Computer
`II. Title.
`
`97-41663
`CIP
`
`http://www.mhhe.com
`
`3
`
`

`

`Trademark Notice
`
`vii
`
`of Sun Microsystems,Inc.
`Tera is a trademark of Tera Computer Systems.
`Cosmic Cube and Mosaic C are trademarks of
`California Institute of Technology.
`UNIX 4.3 BSD, NOW, GLUnix, IRAM,xFS,
`and 4.4BSDare trademarks of the University
`of California at Berkeley.
`DEC, DECsafe, VAX, VAXCluster, VMS,Dig-
`ital Unix, TruCluster, Digital NT Cluster,
`Alpha 21164, and Alpha AXP are trademarks
`of Digital Equipment Corporation.
`Apollo Domain, HP, Series 9000, PA-RISC,
`and UX are trademarks of the Hewlett-Packard
`Company.
`IBM PC,AIX, ESSL, 801, RP3, EUI, HACMP,
`IBM, LoadLeveller, NetView, POWER, Pow-
`erPC, RS/6000, S/370, S/390, SP2, SP, and
`Sysplex are trademarks of International Busi-
`ness Machines Corporation.
`Lifekeeper, NCR and Teradata are trademarks
`of National Cash Register.
`M680x0is trademark of Motorola Corporation.
`OSF/1, DCE, and DFS are trademarks of the
`OpenSoftware Foundation,Inc.
`Oracle Parallel Server is a trademark of Oracle
`Corporation.
`Expressis a trademark of Parasoft Corporation.
`Load Sharing Facility is a trademark of
`Platform Computing Corporation.
`Origin 2000, Power C, POWER CHALLEN-
`GEarray, and Cellular IRIX are trademarks of
`Silicon Graphics Computer Systems.
`Sybase is a trademark of Sybase,Inc.
`UNIX and SVRS5are trademarks of X/Open
`Company,Ltd.
`Microsoft, Wolfpack, and WindowsNTare
`trademarks of Microsoft Corporation.
`OpenMPis a trademark of OpenMPStandards
`Evaluation Board.
`
`Myrinetis a trademark of Myricom, Inc.
`
`ANSI Standards X3T11, X3T9, and X3H5are
`trademarks of the American National Stan-
`dardsInstitute.
`
`ALLICACHE and KSR-1 are trademarks of
`Kendall Square Research.
`X Window, Alewife, J-Machine, and *T are
`trademarks of Massachusetts Institute of
`Technology.
`CM-2, CM-5, C*, *Lisp, CM Fortran, and
`CMOSTare trademarks of Thinking Machines
`Corporation.
`a trademarks of
`is
`HP/Convex Exemplar
`Hewlett-Packard Company and Convex Com-
`puter Systems.
`Cray 1, Cray 2, Y-MP, C90, CRAFT, T3D,
`T3E, UNICOS are
`trademarks of Cray
`Research,Inc.
`
`Dash and SPLASHaretrademarks of
`Stanford University.
`Ethernet is a trademark of Xerox Corporation.
`Gigabit Ethernet is trademark of the Gigabit
`Ethernet Alliance.
`SCI, POSIX Threads, and Futurebust+ are
`trademarks of the Institute of Electrical and
`Electronic Engineering,Inc.
`Illiac IV, Cedar, and Parafrase are trademarks
`of University ofIllinois.
`Intel 80x86,
`i960,
`i860, SHV, Paragon, Intel
`Hypercube, Pentium, Pentium Pro, and Pen-
`tium-II are trademarksofIntel Corporation.
`MIPS R2000, R4000, and R10000are
`trademarks of SGI/MIPSTechnology,Inc.
`Mach/OS, C.mmp, and Cm* are trademarksof
`Carnegie-Mellon University.
`NYU Ultracomputer is a trademark of New
`York University.
`SX/4is a trademark of NEC Information
`Systems.
`Symmetry and NUMA-Q are trademarks of
`Sequent Computers.
`Gigaplane, Gigaplane-XB, S-Bus, TSO, PSO,
`NFS, SPARC,Ultra Enterprise, SPARCcluster,
`SPARCstation, and Solaris MC are trademarks
`
`
`
`4
`
`

`

`About the Authors
`
`Kai Hwangpresently holds a Chair Professor of Computer Engineering at the
`University of Hong Kong (HKU), while taking a research leave from the University of
`Southern California (USC). He has been engaged in higher education and computer
`research for 26 years, after earning the Ph.D.
`in Electrical Engineering and Computer
`Science from the University of California at Berkeley. His work on this book started at the
`USC and was mostly completed at the HKU.
`An IEEEFellow, he has published extensively in the areas of computerarchitecture,
`digital arithmetic, parallel processing, and distributed computing. He is the founding
`Editor-in-Chief of the Journal ofParallel and Distributed Computing. He has chaired the
`international conferences: ICPP86, ARITH-7, IPPS96, ICAPP 96, and HPCA-4 in 1998.
`Hehas received several achievement awards for outstanding research and academic con-
`tributions to the field of parallel computing.
`He has lectured worldwide and performed consulting and advisory work for US
`National Academy of Sciences, MIT Lincoln Laboratory, IBM Fishkill, TriTech in
`Singapore, Fujitsu and ETL in Japan, GMD in Germany, CERN School of Computing,
`and Academia Sinica in China. Presently, he leads a research group at HKUin developing
`an ATM-based multicomputer cluster for high-performance computing and distributed
`multimedia, Intranet, and Internet applications.
`
`Zhiwei Xuis a Professor and Chief Architect at the National Center forIntelligent
`Computing Systems (NCIC), Chinese Academy of Sciences, Beijing, China. He is also a
`Honorary Research Fellow of HKU.He received a Ph.D. in Computer Engineering from
`the University of Southern California. He participated in the STAP/MPP benchmark
`projects led by Dr. Hwang at USC and HKUduringthe past 5 years.
`He has taught at the Rutgers University and New York Polytechnic University. He
`has published in the areas of parallel
`languages, pipelined vector processing, and
`benchmark evaluation of massively parallel processors. Presently, he leads a design group
`at NCIC in building a series of cluster-based superservers in China. His current research
`interest lies mainly in network-based cluster computing and the software environments for
`parallel programming.
`
`Viii
`
`5
`
`

`

`Dedication
`
`To my family, for their love and support-- k.h.
`
`To my high-school teacher,
`Ms. Zhang Dahua, whotaught methe
`serenity of science at a tumultuous time -- z.x.
`
`6
`
`

`

`
`
`xi
`
`Table of Contents
`
`About the Authors
`
`viii
`
`Foreword
`
`xix
`
`Preface
`
`xx
`
`Guide to Instructors/Readers
`
`xxiii
`
`Part I Scalability and Clustering
`
`1
`
`3
`
`Chapter 1 Scalable Computer Platforms and Models
`1.1 Evolution of Computer Architecture
`5
`1.1.1. Computer Generations
`5
`1.1.2 Scalable Computer Architectures
`1.1.3 Converging System Architectures
`1.2 Dimensions of Scalability
`9
`1.2.1 Resource Scalability
`9
`1.2.2. Application Scalability
`1.2.3. Technology Scalability
`1.3 Parallel Computer Models
`1.3.1 Semantic Attributes
`1.3.2 Performance Attributes
`18
`1.3.3. Abstract Machine Models
`26
`1.3.4 Physical Machine Models
`30
`1.4 Basic Concepts of Clustering
`30
`1.4.1 Cluster Characteristics
`31
`1.4.2 Architectural Comparisons
`1.4.3. Benefits and Difficulties of Clusters
`1.5 Scalable Design Principles
`37
`1.5.1 Principle of Independence
`1.5.2 Principle of Balanced Design
`1.5.3. Design for Scalability
`44
`1.6 Bibliographic Notes and Problems
`
`11
`12
`13
`14
`
`17
`
`6
`8
`
`32
`
`37
`
`39
`
`47
`
`Chapter 2 Basics of Parallel Programming
`2.1 Parallel Programming Overview 51
`52
`2.1.1 Why Is Parallel Programming Difficult?
`2.1.2 Parallel Programming Environments—55
`2.1.3. Parallel Programming Approaches
`56
`2.2 Processes, Tasks, and Threads
`59
`
`51
`
`7
`
`

`

`Table of Contents
`
`84
`
`59
`
`2.2.1 Definitions of an Abstract Process
`2.2.2 ExecutionMode
`62
`2.2.3. Address Space
`63
`2.2.4 Process Context
`65
`2.2.5 Process Descriptor
`67
`2.2.6 Process Control]
`2.2.7 Variations of Process
`Parallelism Issues
`71
`72
`2.3.1 Homogeneity in Processes
`2.3.2 Static versus Dynamic Parallelism 74
`2.3.3. Process Grouping
`75
`2.3.4 Allocation Issues
`76
`
`66
`
`70
`
`2.3
`
`2.4
`
`2.5
`
`2.6
`
`77
`
`Interaction/Communication Issues
`2.4.1
`Interaction Operations
`77
`2.4.2 Interaction Modes
`80
`2.4.3 Interaction Patterns
`82
`2.4.4 Cooperative versus Competitive Interactions
`Semantic Issues in Parallel Programs
`85
`2.5.1 Program Termination
`85
`86
`2.5.2 Determinacy of Programs
`Bibliographic Notes and Problems
`
`87
`
`91
`
`96
`98
`
`103
`104
`
`111
`
`115
`
`131
`
`134
`
`139
`
`126
`
`148
`
`Chapter 3 Performance Metrics and Benchmarks
`3.1 System and Application Benchmarks
`91
`3.1.1 Micro Benchmarks
`92
`3.1.2 Parallel Computing Benchmarks
`3.1.3. Business and TPC Benchmarks
`3.1.4 SPEC Benchmark Family
`100
`Performance versus Cost
`102
`3.2.1 Execution Time and Throughput
`3.2.2 Utilization and Cost-Effectiveness
`Basic Performance Metrics
`108
`108
`3.3.1 Workload and Speed Metrics
`3.3.2 Caveats in Sequential Performance
`113
`Performanceof Parallel Computers
`113
`3.4.1 Computational Characteristics
`3.4.2 Parallelism and Interaction Overheads
`3.4.3. Overhead Quantification
`118
`Performance of Parallel Programs
`3.5.1 Performance Metrics
`126
`3.5.2. Available Parallelism in Benchmarks
`Scalability and Speedup Analysis
`134
`3.6.1 Amdahl’s Law: Fixed Problem Size
`3.6.2 Gustafson’s Law: Fixed Time
`136
`3.6.3 Sun and Ni’s Law: Memory Bounding
`3.6.4 Isoperformance Models
`144
`Bibliographic Notes and Problems
`
`3.2
`
`3.3
`
`3.4
`
`3.5
`
`3.6
`
`3.7
`
`8
`
`

`

`xiii
`Table of Contents
`
`
`Part If Enabling Technologies
`
`153
`
`155
`
`175
`
`164
`
`Chapter 4 Microprocessors as Building Blocks
`4.1 System Development Trends
`155
`4.1.1 Advances in Hardware
`156
`4.1.2 Advances in Software
`159
`160
`4.1.3. Advances in Applications
`164
`4.2 Principles of Processor Design
`4.2.1 Basics of Instruction Pipeline
`169
`4.2.2 From CISC to RISC and Beyond
`4.2.3, Architectural Enhancement Approaches
`4.3 Microprocessor Architecture Families
`174
`4.3.1 Major Architecture Families
`174
`4.3.2 Superscalar versus Superpipelined Processors
`4.3.3, Embedded Microprocessors
`180
`4.4 Case Studies of Microprocessors
`182
`4.4.1 Digital’s Alpha 21164 Microprocessor
`4.4.2 Intel Pentium Pro Processor
`186
`4.5 Post-RISC, Multimedia,and VLIW 191
`4.5.1 Post-RISC Processor Features
`191
`4.5.2 Multimedia Extensions
`195
`4.5.3. The VLIW Architecture
`199
`4.6 The Future of Microprocessors
`201
`4.6.1 Hardware Trends and Physical Limits
`4.6.2 Future Workloads and Challenges
`203
`4.6.3 Future Microprocessor Architectures
`204
`4.7 Bibliographic Notes and Problems
`206
`
`172
`
`182
`
`201
`
`Chapter 5 Distributed Memory and Latency Tolerance
`5.1 Hierarchical Memory Technology
`211
`5.1.1 Characteristics of Storage Devices
`5.1.2 Memory Hierarchy Properties
`214
`5.1.3. Memory Capacity Planning
`217
`5.2 Cache Coherence Protocols
`220
`5.2.1. Cache Coherency Problem 220
`5.2.2 Snoopy Coherency Protocols
`222
`5.2.3. The MESI Snoopy Protocol
`224
`5.3 Shared-MemoryConsistency
`228
`5.3.1 Memory Event Ordering
`228
`5.3.2 Memory Consistency Models
`5.3.3. Relaxed Memory Models
`234
`237
`5.4 Distributed Cache/Memory Architecture
`5.4.1 NORMA, NUMA, COMA,and DSM Models=237
`5.4.2 Directory-Based Coherency Protocol
`243
`5.4.3 The Stanford Dash Multiprocessor
`245
`5.4.4 Directory-Based Protocolin Dash
`248
`
`211
`
`211
`
`231
`
`
`
`9
`
`

`

`Table of Contents
`
`Table of C
`
`250
`
`257
`
`265
`
`273
`
`250
`5.5 Latency Tolerance Techniques
`5.5.1 Latency Avoidance, Reduction, and Hiding
`5.5.2 Distributed Coherent Caches
`253
`5.5.3 Data Prefetching Strategies
`255
`5.5.4 Effects of Relaxed Memory Consistency
`5.6 Multithreaded Latency Hiding
`257
`258
`5.6.1 Multithreaded Processor Model
`5.6.2. Context-Switching Policies
`260
`5.6.3 Combining Latency Hiding Mechanisms
`5.7 Bibliographic Notes and Problems
`266
`Chapter 6 System Interconnects and Gigabit Networks
`6.1 Basics of Interconnection Network
`273
`6.1.1
`Interconnection Environments
`273
`6.1.2 Network Components
`276
`6.1.3 Network Characteristics
`277
`280
`6.1.4 Network Performance Metrics
`281
`6.2 Network Topologies and Properties
`6.2.1 Topological and Functional Properties
`6.2.2 Routing Schemes and Functions
`283
`6.2.3 Networking Topologies
`286
`6.3 Buses, Crossbar, and Multistage Switches
`6.3.1 Multiprocessor Buses
`294
`6.3.2 Crossbar Switches
`298
`6.3.3 Multistage Interconnection Networks
`6.3.4 Comparison of Switched Interconnects
`6.4 Gigabit Network Technologies
`307
`6.4.1 Fiber Channel and FDDI Rings
`307
`6.4.2 Fast Ethernet and Gigabit Ethernet
`6.4.3 Myrinetfor SAN/LAN Construction
`6.4.4 HiPPI and SuperHiPPI
`314
`6.5 ATMSwitches and Networks
`318
`6.5.1 ATM Technology
`318
`6.5.2 ATM Network Interfaces
`320
`6.5.3 Four Layers of ATM Architecture
`6.5.4 ATM Internetwork Connectivity
`6.6 Scalable Coherence Interface
`326
`6.6.1
`SCIInterconnects
`327
`6.6.2.
`Implementation Issues
`329
`6.6.3 SCI Coherence Protocol
`332
`334
`6.7 Comparison ofNetwork Technologies
`334
`6.7.1 Standard Networks and Perspectives
`6.7.2 Network Performance and Applications
`335
`6.8 Bibliographic Notes and Problems
`337
`Chapter7 Threading, Synchronization, and Communication
`7.1 Software Multithreading
`343
`
`281
`
`294
`
`301
`305
`
`310
`313
`
`321
`324
`
`12
`
`343
`
`10
`
`

`

`Table of Contents
`XV
`i—“‘“*#R$:ROUUM
`
`344
`7.1.1 The Thread Concept
`346
`7.1.2 Threads Management
`348
`7.1.3 Thread Synchronization
`349
`Synchronization Mechanisms
`349
`7.2.1 Atomicity versus Mutual Exclusion
`7.2.2. High-Level Synchronization Constructs
`7.2.3, Low-Level Synchronization Primitives
`7.2.4 Fast Locking Mechanisms
`364
`The TCP/IP Communication Protocol Suite
`7.3.\
`Features of The TCP/IP Suite
`367
`7.3.2 UDP, TCP,andIP
`371
`7.3.3. The Sockets Interface
`375
`Fast and Efficient Communication
`7.4.1 Key Problems in Communication
`7.4.2 The Log P Communication Model
`7.4.3 Low-Level Communications Support
`7.4.4 Communication Algorithms
`396
`Bibliographic Notes and Problems
`398
`
`355
`360
`
`366
`
`376
`377
`384
`386
`
`7.2
`
`7.3
`
`7.4
`
`7.5
`
`Part III Systems Architecture
`
`403
`
`407
`
`Chapter 8 Symmetric and CC-NUMAMultiprocessors
`SMP and CC-NUMATechnology
`407
`8.1
`8.1.1 Multiprocessor Architecture
`407
`8.1.2 Commercial SMP Servers
`412
`8.1.3. The Intel SHV Server Board
`413
`Sun Ultra Enterprise 10000 System 416
`8.2.1 The Ultra E-10000 Architecture
`416
`8.2.2 System Board Architecture
`418
`8.2.3 Scalability and Availability Support
`8.2.4 Dynamic Domainsand Performance
`HP/Convex Exemplar X-Class
`421
`8.3.1 The Exemplar X System Architecture
`8.3.2 Exemplar Software Environment
`424
`The Sequent NUMA-Q 2000
`425
`426
`8.4.1 The NUMA-Q 2000 Architecture
`8.4.2 Software Environmentof NUMA-Q 430
`8.4.3 Performance of the NUMA-Q 431
`The SGI/Cray Origin 2000 Superserver
`8.5.1 Design Goals of Origin 2000 Series
`8.5.2 The Origin 2000 Architecture
`435
`8.5.3. The Cellular IRLX Environment
`443
`8.5.4 Performanceofthe Origin 2000
`447
`Comparison of CC-NUMAArchitectures
`Bibliographic Notes and Problems
`451
`
`418
`420
`
`421
`
`434
`
`434
`
`447
`
`8.2
`
`8.3
`
`8.4
`
`8.5
`
`8.6
`
`8.7
`
`11
`
`

`

`Table of Contents
`xvi
`ee
`
`453
`
`505
`
`453
`
`459
`
`468
`
`Chapter 9 Support of Clustering and Availability
`9.1 Challenges in Clustering
`453
`9.1.1 Classification of Clusters
`9.1.2 Cluster Architectures
`456
`9.1.3 Cluster Design Issues
`457
`9.2 Availability Support for Clustering
`9.2.1 The Availability Concept
`460
`9.2.2 Availability Techniques
`463
`9.2.3 Checkpointing and Failure Recovery
`9.3. Support for Single System Image
`473
`9.3.1 Single System Image Layers
`473
`475
`9.3.2 Single Entry and Single File Hierarchy
`9.3.3 Single I/O, Networking, and Memory Space
`9.4 Single System Image in Solaris MC 482
`9.4.1 Global File System 482
`9.4.2 Global Process Management
`9.4.3 Single I/O System Image
`485
`9.5 Job Managementin Clusters
`486
`9.5.1 Job Management System 486
`9.5.2 Survey of Job Management Systems
`9.5.3. Load-Sharing Facility (LSF)
`494
`9.6 Bibliographic Notes and Problems
`501
`
`479
`
`484
`
`492
`
`Chapter 10 Clusters of Servers and Workstations
`10.1 Cluster Products and Research Projects
`505
`10.1.1 Supporting Trend of Cluster Products
`506
`10.1.2 Cluster of SMP Servers
`508
`10.1.3. Cluster Research Projects
`509
`511
`10.2 Microsoft Wolfpack for NT Clusters
`10.2.1 Microsoft Wolfpack Configurations
`10.2.2 Hot Standby Multiserver Clusters
`10.2.3 Active Availability Clusters
`514
`10.2.4 Fault-Tolerant Multiserver Cluster
`10.3 The IBM SP System 518
`10.3.1 Design Goals and Strategies
`10.3.2 The SP2 System Architecture
`10.3.3 I/O and Internetworking.
`523
`10.3.4 The SP System Software
`526
`10.3.5 The SP2 and Beyond
`530
`10.4 The Digital TruCluster
`531
`531
`10.4.1 The TruCluster Architecture
`10.4.2 The Memory Channel Interconnect
`10.4.3 Programming the TruCluster
`537
`10.4.4 The TruCluster System Software
`10.5 The Berkeley NOW Project
`541
`10.5.1 Active Messages for Fast Communication
`
`512
`513
`
`516
`
`518
`521
`
`534
`
`540
`
`541
`
`12
`
`

`

`Table of Contents
`
`
`547
`10.5.2. GLUnix for Global Resource Management
`10.5.3. The xFS Serverless Network File System 549
`10.6 TreadMarks: A Software-Implemented DSM Cluster
`10.6.1 Boundary Conditions
`556
`10.6.2 User Interface for DSM 557
`10.6.3.
`Implementation Issues
`559
`10.7 Bibliographic Notes and Problems
`
`561
`
`556
`
`565
`
`571
`
`576
`
`Chapter 11 MPP Architecture and Performance
`11.1 An Overview of MPP Technology
`565
`11.1.1 MPP Characteristics and Issues
`565
`11.1.2 MPP Systems—An Overview 569
`11.2 The Cray T3E System 570
`11.2.1 The System Architecture of T3E
`11.2.2 The System Software in T3E
`573
`11.3. New Generation of ASCI/MPPs
`574
`574
`11.3.1 ASCI Scalable Design Strategy
`11.3.2 Hardware and Software Requirements
`11.3.3 Contracted ASCI/MPP Platforms
`577
`11.4 Intel/Sandia ASCI Option Red
`579
`11.4.1 The Option Red Architecture
`579
`11.4.2 Option Red System Software
`582
`11.5 Parallel NAS Benchmark Results
`584
`11.5.1 The NAS Parallel Benchmarks
`585
`11.5.2 Superstep Structure and Granularity
`11.5.3. Memory, I/O, and Communications
`11.6 MPI and STAP Benchmark Results
`590
`11.6.1 MPI Performance Measurements
`590
`11.6.2 MPI Latency and Aggregate Bandwidth
`11.6.3 STAP Benchmark Evaluation of MPPs
`11.6.4 MPP Architectural Implications
`600
`11.7 Bibliographic Notes and Problems
`603
`
`586
`587
`
`592
`594
`
`PartIV Parallel Programming
`607
`Chapter 12 Parallel Paradigms and Programming Models
`12.1 Paradigms and Programmability
`609
`12.1.1 Algorithmic Paradigms
`609
`12.1.2 Programmability Issues
`612
`12.1.3. Parallel Programming Examples
`12.2 Parallel Programming Models
`617
`12.2.1
`Implicit Parallelism 617
`621
`12.2.2 Explicit Parallel Models
`624
`12.2.3. Comparison of Four Models
`12.2.4 OtherParallel Programming Models
`
`627
`
`614
`
`609
`
`13
`
`

`

`xviii
`
`Table of Contents
`
`629
`634
`
`643
`
`653
`
`629
`12.3 Shared-Memory Programming
`12.3.1 The ANSI X3H5 Shared-Memory Model
`12.3.2. The POSIX Threads(Pthreads) Model
`12.3.3. The OpenMP Standard
`636
`12.3.4 The SGI PowerC Model
`640
`12.3.5 C//: A Structured Parallel C Language
`12.4 Bibliographic Notes and Problems
`649
`Chapter 13 Message-Passing Programming
`13.1 The Message-Passing Paradigm 653
`13.1.1 Message-Passing Libraries
`653
`13.1.2 Message-Passing Modes
`655
`13.2 Message-Passing Interface (MPI)
`13.2.1 MPI Messages
`661
`668
`13.2.2 Message Envelope in MPI
`13.2.3. Point-to-Point Communications
`13.2.4 Collective MPI Communications
`13.2.5 The MPI-2 Extensions
`682
`686
`13.3. Parallel Virtual Machine (PVM)
`687
`13.3.1 Virtual Machine Construction
`13.3.2 Process Management inPVM 689
`13.3.3. Communication with PVM 693
`13.4 Bibliographic Notes and Problems
`699
`
`674
`678
`
`658
`
`705
`
`Chapter 14 Data-Parallel Programming
`14.1 The Data-Parallel Model
`705
`14.2 The Fortran 90 Approach
`706
`706
`14.2.1 Parallel Array Operations
`14.2.2 Intrinsic Functions in Fortran90
`14.3 High-Performance Fortran
`711
`14.3.1 Support for Data Parallelism 712
`14.3.2 DataMappinginHPF
`715
`14.3.3. Summary of Fortran90 and HPF
`14.4 Other Data-Parallel Approaches
`725
`14.4.1 Fortran 95 and Fortran 2001
`725
`14.4.2 The pC++ and Nesl Approaches
`14.5 Bibliographic Notes and Problems
`
`708
`
`721
`
`728
`733
`
`737
`Bibliography
`Web Resources List
`
`765
`
`Subject Index
`Author Index
`
`787
`799
`
`14
`
`

`

`Chapter 1
`
`Scalable Computer
`Platforms and Models
`
`This chapter presents basic models ofparallel and cluster computers. Fundamental
`design issues and operational principles of scalable computer platforms are introduced.
`We review the computer technology overthe last 50 years. Scalable and cluster computer
`systems are modeled with key architectural distinctions. Scalability will be introduced in
`three orthogonal dimensions: resource, application, and technology.
`Abstract and physical machine models are specified in Section 1.3. In Section 1.4, we
`introduce basic concepts of multicomputer clustering. The differences among symmetric
`multiprocessors, clusters of computers, and distributed computer systems are clarified.
`Three basic principles are studied in Section 1.5 to guide the design and application of
`scalable parallel computers.
`
`Bits, Bytes, and Words The following units are widely used in the computerfield, but
`sometimes were wrongly used with confusing notations and ambiguous meanings. To
`cope with this problem, we present belowa setof notations that will be used throughout
`the book. In particular, readers should not be confused with the shorthand notations for
`basic units of time, byte, and bit respectively.
`The basic unit in time is second, abbreviated as s. The two basic information units are
`byte and bit. One byte (1 B) is 8 bits (8 b). Byte is always abbreviated as B andbit as b.
`Other information units are word (16 b or 2 B), doubleword (32 b or 4 B), and guadword
`(64 b or 8 B). This is based on conventionusedby Intel, Motorola, and Digital Equipment.
`Mainframe vendors consider a word to have 32 b. Some supercomputer designers
`consider 64 b in a word. A frequently used workload unit is the numberoffloating-point
`operations, abbreviated asflop. A unit for computing speed is the numberoffloating-point
`operations per second (flop/s). A unit for information transfer rate is the numberof bytes
`per second (B/s). The execution rate of a processor is often measured as million instruc-
`tions per second (MIPS), whichis equivalent to the notation Mi/s used in Europe.
`3
`
`15
`
`

`

`4
`
`Chapter 1 Scalable Computer Platforms and Models
`
`s We will use the conventions and notations specified in
`Notations and Convention
`omeextent in the high-
`Table 1.1. Note that decimal and binary interpretations differ to s
`orderunits. Readers should be awareoftheir differences.
`Forinstance, a kilo could mean either 10° = 1000 or 2!0 = 1024. In computerscience,
`K is usually used to mean 1024;
`in other scientific disciplines, k denotes 1000. This
`ambiguity must be clarified, even the relative difference is small. In general, the binary
`interpretation is applied to information units such as bits, bytes, or words. The decimal
`interpretation is often used in reference to other units such as execution time, power,
`weight, pin count, and computational workload,etc.
`The convention forplurals is as follows: The unit will be pluralized when the whole
`word is used, and not pluralized when an abbreviation is used. For instance, we say 2 KB,
`2 kilobytes, or even 2 Kbytes. They all mean the same thing: 2048 bytes. But we do not
`use 2 KBs or 2 Kbyte. Note that 2 Kb means a completely different thing: 2048bits, or
`ng a speed of executing 2 million
`256 bytes.
`Similarly, 2 Mflop/s is a proper notation, meani
`using 2 Mflops/s, 2 MFLOPS,or 2
`floating-point operations per second. We should avoid
`he logarithmic function is
`Mflops, even some people use them. Throughout this book,t
`alwaysbase2. Thatis, log 7 meanslog) ”, unless otherwise specified.
`
`ns for Numbers
`
`
`ii
`
`
`
`Table 1.1 Notations and Conventio
`
`
`
`
`Abbreviation
`Meaning
`
` al
`One thousandthi.ee
`One millionth (eer
`OnebillionthpeMageDet
`Onetrillionth eee
`One quadrillionthFemee
`One quintillionth Paige
`
`10%or2")|
`mega
`Milion
`
`
`= eeee oer”
`10'8 or 2”ii
`Srocttion|ee
`
`
`
`16
`
`

`

`
`
`Section 1.1 Evolution of Computer Architecture )
`
`1.1 Evolution of Computer Architecture
`
`Computers have gone through 50 years of development. We summarize belowfive
`generations of development, approximately 10 to 15 years per generation, since the 1940s.
`A formal definition of a scalable computer is given below. Then we describe the
`converging architecture for future parallel computers.
`
`1.1.1 Computer Generations
`
`Overthe past 50 years, digital electronic computers have gone through five genera-
`tions of development. Table 1.2 provides a summary of the technology, architecture,
`software, application, and representative systems in each generation of computers. Each
`generation has improved from the previous generations in hardware and software technol-
`ogies used and in application levels.
`
`Table 1.2 Computer Generations in 50 Years
`
`Generation
`Period
`
`Technology and
`Architecture
`
`Software and
`Operating System
`
`Systems
`
`World-Wide Web Representative
`
`First
`1946-1956
`
`Second
`1956-1967
`
`Third
`1967-1978
`
`Fourth
`1978-1989
`
`Fifth
`1990- present
`
`Vacuum tubes and relay
`memory, single-bit CPU
`with accumulator-based
`instruction set
`
`Machine/assembly
`languages, programs
`without subroutines
`
`ENIAC,
`IBM 701,
`Princeton IAS
`
`Discrete transistors, core
`memory,floating-point
`accelerator, I/O channels
`
`Algol and Fortran
`with compilers,
`batch processing OS
`
`Integrated circuits,
`pipelined CPU,
`microprogrammed
`control unit
`
`VLSI microprocessors,
`solid-state memory,
`multiprocessors,
`vector supercomputers
`
`ULSI circuits, scalable
`parallel computers,
`workstation clusters,
`Intranet, Internet
`
`C language,
`multiprogramming,
`timesharing OS
`
`Symmetric
`multiprocessing,
`parallelizing compilers,
`message-passinglibraries
`
`Java, microkernels,
`Multithreading,
`distributed OS,
`
`IBM 7030,
`CDC 1604,
`Univac LARC
`
`PDP-11
`IBM 360/370,
`CDC 6600
`
`IBM PC,
`VAX 9000,
`Cray X/MP
`
`IBM SP2,
`SGI Origin 2000,
`Digital TruCluster
`
`In hardware technology, the first generation used vacuum tubes and relay memory.
`The second generation was marked by the use of discrete transistors and core memory.
`
`17
`
`

`

`6
`
`Chapter 1 Scalable Computer Platforms and Models
`
`all-scale integrated (SSI) circuits. The fourth gener-
`The third generation began to use sm
`-scale integrated (VLSI) microprocessors. The fifth
`ation began with the use of very large
`d (ULSI) circuits. For example, in 1997, 10M-
`generation uses ultra large-scale integrate
`(dynamic random-access
`transistor microprocessors and 256M-transistor DRAMs
`memories) are in use.
`In the software area, the first generation used machine/assembly languages. The
`second generation began to use high-level languages (HLLs), such as Algol and Fortran.
`The third generation was marked by the use of C language, multiprogramming and time-
`sharing operating systems. The fourth generation emphasized parallel and vector
`processing. The fifth generation emphasizes scalable and clustered computing as covered
`in this book. As wecrossinto the 21st century, future generations ofcomputers will evolve
`with even higher capability and userfriendliness.
`
`1.1.2 Scalable Computer Architectures
`A common feature of modern computers is parallelism. Even within a single
`processor, parallelism has been exploited in many ways. This will be discussedin detail in
`the concept of scalability has been reemphasized, which
`Chapter 4. More recently,
`1 to this book and will be elaborated
`includes parallelism. The scalability concept is centra
`on later. Given below is a short definition of scalable systems.
`
`Definition 1.1 A computer system,includingall its hardware and software resources, is
`called scalable if it can scale up (i.e.,
`improve its resources) to accommodate ever-
`increasing performance andfunctionality demand and/or scale down(i.e., decrease its
`resources) to reduce cost.
`
`a
`
`Although wewill focus on the scaling-up aspect most ofthe time, scalability is not
`equivalentto being big. Scalability involvesthe ability to scale down. Scaling issues arise
`even on a uniprocessor system. Morespecifically, saying that a system is scalable implies
`the following:
`e Functionality and Performance: The scaled-up system should provide more
`functionality or better performance. The total computing power of the system
`should increase proportionally to the increase in resources. Ideally,
`the users
`would like to see computing powerincrease close to n times when the system
`resourceis improved n times.
`Scaling in Cost: The costpaid for scaling up mustbe reasonable. A rule of thumb
`is that scaling up 7 times should incur a cost of no morethan n or n logntimes.
`Compatibility: The same components,including hardware, system software, and
`application software, shouldstill be usable withlittle change.It is unreasonable to
`
`18
`
`

`

`
`
`Section 1.1 Evolution of Computer Architecture 7
`
`expect the users to pay for a completely new operating system and to redevelop
`their application codes. Often scaling involvesonly a part of a system, e.g., adding
`more processors or upgrading a processor
`to the next-generation. The
`improvement should be compatible with the rest of the system. That is, the old
`memory, disk, interconnect, and peripherals should still be usable.
`
`The Computer Pyramid The main motivation for a scalable system is to provide a
`flexible, cost-effective information processing tool. Asillustrated in Fig. 1.1, the computer
`classes form a pyramid with respect to sales volume and performance andcost. At the
`bottom of the pyramid is the class of personal computers (PCs), which have a quite
`affordable price and a huge annual sales volume. Their performanceis also the lowest.
`
`
`
`Performance and Cost
`
`
`Tens of
`
`supercomputers
`
`at $15 M per system
`
`
`
`
`
`
`
`Stand-alone workstations
`
`
`Millions of PCs at $1500 per system
`Annualsales volume
`
`
`SMP servers andclusters
`
`Figure 1.1
`
`The pyramid of computer classes.
`
`At the top is the class of supercomputers. Their sales volume is very low because of
`the expensive price. However, a supercomputer integrates many resourcesinto a single
`system, employscutting-edge technology, and has the highest performance.
`The scalability concept advocates a common scalable architecture across all
`computerclasses. This has several advantages:
`It satisfies different performance andcost requirements. Forinstance, a user could
`first purchase a low-end system. Whenhis performance requirementincreases, he
`could scale up the system, knowing that the original software and hardware
`componentscanstill be used in the new system, and his investmentis protected.
`
`
`
`19
`
`

`

`Chapter 1 Scalable Computer Platforms and Models
`8
`eeee,
`
`e High-end machines may use low-end components to cut cost. For example, with
`high volume, PCs have low-cost, off-the-shelf components. With a scalable archi-
`tecture, supercomputers can use such components to reduce system cost. In fact,
`using commodity components (processors, memory chips, disks, I/O controllers,
`etc.) has becomeatrend in high-performance systems development.
`e The cutting-edge technology developed for high-end systems may eventually
`migrate down the pyramid to improve the performance of low-end systems,if
`cost-effectiveness can be increased with improved production technology.
`
`1.1.3 Converging System Architectures
`
`Scalable parallel computers are converging toward three architectures, which share
`much commonality with increasing levels of resources sharing as shownin Fig.1.2.
`
`C
`
`D
`
`M
`
`Cache
`
`Disk
`
`Memory
`
`NIC
`
`Networkinterface circuitry
`
`P
`
`Processor
`
`
`
`
`
`Shared Disks
`
`
`
`
`
`Shared Memory]
`
`|Shared Disks
`
`(b) Shared-disk architecture
`
`(c) Shared-memory architecture
`
`Figure 1.2.
`
`Commonarchitecture for scalable parallel computers.
`
`20
`
`

`

`
`
`Section 1.2 Dimensions ofScalability 9
`
`The sha

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket