`
`©OONO30‘I-bOJN—‘t
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`01010101010101-b-b##-b-b-b-b-b#OJOJOJOJOJOJOJOJOJOJNNNNNNNNNN—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t
`
`0301#03N—\O©OONam-bOJN—‘tOQDOONmO‘I#03NAOQDOONmm-bOJN—‘tOQDOONmO‘l-bOJN—‘to
`
`ProCeedings
`
`Frontiers ‘96
`
`,
`
`The Sixth Symposium on the
`
`Frontiers of Massively Parallel Computing
`
`October 27—31, 1996
`Annapolis, Maryland
`
`Sponsored by
`
`IEEE Computer Society
`
`In cooperation with
`
`NASA Goddard Space Flight Center
`USRA/CESDIS
`
`IEEE Computer Society Press
`
`Los Alamitos, California
`
`Washington
`
`0
`
`Brussels
`
`0
`
`Tokyo
`
`X|L|NX 1005
`
`1
`
`XILINX 1005
`
`
`
`
`
`coaxioam-booN—x
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`10
`11
`1 1
`1 2
`12
`13
`1 3
`1 4
`14
`15
`15
`16
`16
`17
`1 7
`1 8
`18
`19
`19
`20
`20
`21
`21
`22
`22
`23
`23
`24
`24
`25
`25
`26
`26
`27
`27
`28
`28
`29
`29
`30
`30
`31
`31
`32
`32
`33
`33
`34
`34
`35
`35
`36
`36
`37
`37
`38
`38
`39
`39
`40
`40
`41
`41
`42
`42
`43
`43
`44
`44
`45
`45
`46
`46
`47
`:11;
`48
`49
`49
`50
`50
`51
`51
`52
`52
`53
`53
`54
`54
`55
`55
`56
`56
`
`
`
`IEEE Computer Society Press
`10662 Los Vaqueros Circle
`P.O.Box 3014
`Los Alamitos, CA 90720-1264
`
`Copyright © 1996 by The Institute of Electrical‘and Electronics Engineers, Inc.
`All rights reserved.
`'
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may
`photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume
`that carry a code at the bottom of the first page, provided that the per-copy fee indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`Other copying, reprint, or republication requests should be addressed to:
`IEEE Copyrights Manager, IEEE
`Service Center, 445 Hoes Lane, PO. Box 1331, Piscataway, NJ 08855-1331.
`
`The a ers in this book com rise the roceedings o the meeting mentioned on the cover and title page. They
`P P
`P
`P
`_
`reflect the authors' opinions and,
`in the interests of timely dissemination, are published as presented and
`without change. Their inclusion in this publication does not necessarily constitute endorsement by the
`editors. the IEEE Computer Society Press, or the Institute ofElectrical and Electronics Engineers, Inc.
`
`IEEE Computer Society Press Order Number PR07551
`IEEE Order Plan Catalog Number 96TB 100062
`ISBN 0-8186-7551-9
`Microfiche ISBN 0—8186-7553—5
`ISSN 1088-4955
`
`IEEE Computer Society Press
`Customer Service Center
`10662 Los Vaqueros Circle
`P0. Box 3014
`Los Alamitos, CA 90720-1314
`Tel: +177l4—821A8380
`Fax: +1—714—821—4641
`Email: cs.books@computer.org
`
`.
`.
`.
`Additional copies may be orderedfrom."
`
`IEEE Service Center
`445 Hoes Lane
`PO. Box 1331
`Piscataway, NJ 08855—1331
`Tel: +1—908—981—1393
`Fax: +1—908—981—9667
`misc,custserv@computercrg
`
`IEEE Computer Society
`13, Avenue de l’Aquilon
`B-1200 Brussels
`BELGIUM
`Tel: +32-2-770-2198
`Fax: +32—2-770—8505
`euro.ofc@computr.org
`
`IEEE Computer Society
`Ooshima Building
`2-19-1 Minami-Aoyama
`Minato—ku, Tokyo 107
`JAPAN
`Tel: +81—3—3408—31 18
`Fax: +81-3—3408-3553
`tokyo.ofc@eomputer.org
`
`' Editorial production by Penny Storms
`Cover by Kerry Bedford and Alex Torres
`Printed in the United States of America by KNI, Inc.
`
`® The institute of Electrical and Electronics Engineers. Inc.
`
`2
`
`
`
`
`
`
`
`Contents
`
`Message from the General Chair ......................................................................1x
`Message from the Program Chair ........................................................................................... x
`
`Conference Committee .............................................................................................................. xi
`Referees ...................................................................................................................................... xiii
`
`Session 1: Invited Speaker
`From ASCI to Teraflops
`John Hopson, Accelerated Strategic Computing Initiative (ASCI)
`
`Session 2A: Scheduling 1
`Gang Scheduling for Highly Efficient Distributed Multiprocessor Systems .............................. 4
`H. Franke, P. Pattnaik, and L. Rudolph
`
`Integrating Polling, Interrupts, and Thread Management........................................................ 13
`K. Langendoen, J. Romein, R. Bhoedjang, and H. Bat
`A Practical Processor Design for Multithreading ....................................................................... 23
`M Amamiya, T. Kawano, H. Tomiyasu, and S. Kusakabe
`
`Session 23: Routing
`Analysis of Deadlock-Free Path-Based Wormhole Multicasting in
`Meshes in Case of Contentions .................................................................................................... 34
`
`E. Fleury and P. Fraigniaud
`Efficient Multicast in Wormhole-Routed 2D Mesh/Torus Multicomputers:
`A Network-Partitioning Approach ............................................................................................... 42
`S—Y. Wang, Y-C. Tseng, and CW. Ho
`Turn Grouping for Efficient Multicast in Wormhole Mesh Networks ...................................... 50
`K-P. Fan and C-T. King
`
`Session 3A: Applications and Algorithm
`
`A3: A Simple and Asymptotically Accurate Model for Parallel Computation ........................... 60/
`A. Grama, V. Kumar, S. Ranka, and V. Singh
`Fault Tolerant Matrix Operations Using Checksum and Reverse Computation ..................... 70
`Y. Kim, J.S. Plank, and J.J. Dongarra
`A Statistically-Based Multi-Algorithmic Approach for Load-Balancing
`Sparse Matrix Computations ...................................................................................... . ................ 78
`S. Nastea, T. El-Ghazawi, and O. Frieder
`
`Session 3B: Petaflops Computing l Point Design Studies
`
`Pursuing a Petaflop: Point Designs for 100 TF Computers Using PIM Technologies ............. 88
`PM Kogge, S.C. Bass, J.B. Brockman, D.Z. Chen, and E. Sha
`*
`Hybrid Technology Multithreaded Architecture ........................................................................ 98
`G. Gao, KK. Likharev, P. C. Messina, and TL. Sterling
`The Illinois Aggressive Coma Multiprocessor Project (I-ACOMA) .......................................... 106
`J. Torrellas and D. Padua
`
`©OONO30‘I-bOJN—‘t
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`01010101010101-b-b##-b-b-b-b-b#OJOJOJOJOJOJOJOJOJOJNNNNNNNNNN—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t
`
`0301#03N—‘tOQDOONam-bOJN—‘tOQDOO\lmm#03NAOQDOONmm-bOJNAOQDOONmm-bOJN—‘to
`
`3
`
`
`
`
`
`Panel Session~—How Do We Break the Barrier to the Software Frontier?
`
`Panel Chair: Rick Stevens, Argonne National Laboratory
`
`Session 4: Invited Speaker
`
`Session 5A: Scheduling 2
`Largest-Job—First-Scan’All Scheduling Policy for 2D Mesh-Connected Systems ................... 118
`S-M. Y00 and H. Y. Youn
`
`Scheduling for Large-Scale Parallel Video Servers .................................................................. 126
`M-Y. Wu and W. Shu
`
`Effect of Variation in Compile Time Costs on Scheduling Tasks on
`Distributed Memory Systems .................................................................................................... 134
`S. Darbha and S. Pande
`
`Session 5B: SIMD
`
`Processor Autonomy and Its Effect on Parallel Program Execution....................................... 144
`RM. Hawver and CB. Adams III
`
`Particle-Mesh Techniques on the MasPar ................................................................................ 154
`,
`P. MacNeiCe, C. Mobarry, and K. OlSon
`
`MIMD Programs on SIMD Architectures ................................................................................. 162
`M—Y. Wu and W. Shu
`
`Session 6A: I/O Techniques
`Intelligent, Adaptive File System Policy Selection ................................................................... 172
`TM. Madhyastha and DA. Reed
`An Abstract-Device Interface for Implementing Portable Parallel—I/O Interfaces ................. 180
`R; Thakur, W. Gropp, and E. Lusk
`PMPIO - A Portable Implementation of MPI-IO ...................................................................... 188
`SA. Fineberg, P. Wong, B. Nitzberg, and ,C. Kuszmaul
`Disk Resident Arrays: An Array—Oriented l/O Library for Out-Of—Core
`Computations .............................................................................................................................. 196
`J. Nieplocha and 1. Foster
`
`Session GB: Memory Management
`Hardware-Controlled Prefeching in Directory-Based Cache Coherent Systems
`W Hu and P. Xia
`
`......... 206
`
`Preliminary Insights on Shared Memory PIC Code Performance on the
`Convex Exemplar SPP 1000 ........................................................................................................ 214
`P. MacNeice, CM Mobarry, J. Crawford, and TL. Sterling
`Scalability of Dynamic Storage Allocation Algorithms ............................................................ 223
`A. Iyengar
`'
`An Interprocedural Framework for Determining Efficient Data
`Redistributions in Distributed Memory Machines ................................................................ 233
`S.K.S. Gupta and S. Krishnamurthy
`
`Vi
`
`©OONO30‘I-bOJN—‘t
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`01010101010101-b-b##-b-b-b-b-b#OJOJOJOJOJOJOJOJOJOJNNNNNNNNNN—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t
`
`0301#03N—\O©OONam-bOJN—‘tOQDOONmO‘I#03NAOQDOONmm-bOJN—‘tOQDOONmO‘l-bOJN—‘to
`
`4
`
`
`
`
`
`Panel Session—Petaflops Alternative Paths
`Panel Chair: Paul Messina, California Institute of Technology
`
`Session 7: Invited Speaker
`Independence Day
`Steven Wallach, HP-Convex
`
`,
`.
`,
`Session 8A: Synchronization
`A Fair Fast Distributed Concurrent-Reader Exclusive-Writer Synchronization ................... 246
`T.J. Johnson and H. Yoon
`
`4
`Lock Improvement Technique for Release Consistency in Distributed
`Shared Memory Systems ............................................................................................................ 255
`SS. Fu and N-F. Tzeng
`
`A Quasi-Barrier Technique to Improve Performance of an Irregular Application ................ 263
`H. V. Shah and J.A.B. Fortes
`
`Session SB: Networks
`
`Performance Analysis and Fault Tolerance of Randomized Routing on
`0105 Networks ............................................................................................................................. 272
`
`M. Bhatia and A. Youssef
`Performing BMMC Permutations in Two Passes through the Expanded
`Delta Network and MasPar MP-2 ............................................................................................. 282
`
`L.F. Wisniewski, TH. Carmen, and T. Sundquist
`Macro-Star Networks: Efficient Low-Degree Alternatives to Star Graphs
`for Large-Scale Parallel Architectures ...................................................................................... 290
`CH. Yeh and E. Varuarigos
`
`Session 9A: Performance Analysis
`
`Modeling and Identifying Bottlenecks in EOSDIS ................................................................... 300
`J. Demmel, M.Y. Ivory, and S.L. Smith
`
`Tools-Supported HPF and MP1 Parallelization of the NAS Parallel Benchmarks ................ 309
`C. Clémengon, KM Decker, V.R. Deshpande, A. Endo,
`J. Fritscher, P.A.R. Lorenzo, N. Masuda, A. Muller,
`R. Rdhl, W. Sawyer, B.J.N. Wylie, and F. Zimmerman
`A Comparison of Workload Traces from Two Production Parallel Machines ......................... 319
`K. Windisch, V. Lo, D. Feitelson, R. Moore, and B. Nitzberg
`
`Morphological Image Processing on Three Parallel Machines ................................................ 327
`MD. Theys, RM. Born, MD. Allemang, and H.J. Siegel
`
`Session 9B: Petaflops Computing] Point Design Studies
`MORPH: A System Architecture for Robust High Performance Using
`Customization (An NSF 100 TeraOps Point Design Study) .................................................... 336
`A.A. Chien and R.K. Gupta
`
`Architecture, Algorithms and Applications for Future Generation
`Supercomputers .......................................................................................................................... 346
`V. Kumar. A. Sameh, A. Grama, and G. Karypis
`
`vii
`
`©OONO30‘I-bOJN—K
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`01010101010101##-b-b-b-b-b-b-b-b03030303030303030303l\)
`
`5
`
`
`
`
`
`©OONO30‘I-bOJN—‘t
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`01010101010101#-b-b-b-b##-b-b-bOJOJOJOJOJOJOJOJOJOJNNNNNNNNNN—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t
`
`0301-503N—\O©OONam#OJN—‘tOGDOO\lmm#03NAOQDOONmm-bOJNAOQDOONmm-bOJN—‘to
`
`Hierarchical Processors-and-Memory Architecture for High Performance '
`Computing ................................................................................................................................... 355
`Z.B. Milecl, R. Eigenmann, J.A.B. Fortes, and V. Taylor
`A Low-Complexity Parallel System of Gracious, Scalable Performance.
`Case Study for Near PetaFLOPS Computing ........................................................................... 363
`3.6. Ziavras, H. Grebel, and A. Chronopoulos
`
`Author Index ............................................................................................................................ 371
`
`viii
`
`6
`
`
`
`«DWVGO‘I-th—‘t
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`O10‘IO‘IO‘IO‘IO‘IO‘I-P-P-P-P##-l>-I>J>#wwwwwwwwwwmNNNNNNNNN—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t
`
`a(II-PCON—\O©mNam-[>03N—‘OQDmem-PWN—‘OQWVGO‘I-FWN—\O©m\l®O‘I-I>QJN—\O
`
`MORPH: A System Architecture for Robust High Performance
`Using Customization
`
`(An NSF 100 TeraOps Point Design Study)
`
`Andrew A. Chien
`
`Rajesh K. Gupta
`
`Department of Computer Science
`University of Illinois
`Urbana, Illinois 61801
`{achien, rgupta} @cs. uiuc. edu
`
`Abstract
`
`Achieving 100 TeraOps performance within a ten-
`year horizon will require massively—parallel architec-
`tures that exploit both commodity software and hard-
`ware technology for cost efficiency.
`Increasing clock
`rates and system diameter in clock periods will make
`efiicient management of communication and coordina-
`tion increasingly critical. Configurable logic presents a
`unique opportunity to customize bindings, mechanisms,
`and policies which comprise the interaction of process-
`ing, memory, I/O and communication resources. This
`programming flexibility, or customizability, can pro-
`vide the key to achieving robust high performance.
`The MultiprocessOr with Reconfigurable Parallel
`Hardware (MORPH) uses reconfigurable logic blocks
`integrated with the system core to control policies,
`in-
`teractions, and interconnections. This integrated con-
`figurability can improve the performance of local mem-
`ory hierarchy,
`increase the efi‘lciency of interproces-
`sor coordination, or better utilize the network bisec—
`tion of the machine. MORPH provides a framework
`for exploring such integrated application-specific cus-
`tomizability. Rather than complicate the situation,
`MORPH’s configurability supports component software
`and interoperabilty frameworks, allowing direct support
`for application-specified patterns, objects, and struc-
`tures. This paper reports the motivation and initial
`design of the MORPH system.
`
`1
`
`Introduction
`
`Increasing reliance on computational techniques for
`scientific inquiry, complex systems design, faster than
`real life simulations, and higher fidelity human com-
`puter interaction continue to drive the need for ever
`higher performance computing systems. Despite rapid
`progress in basic device technology [1], and even in
`uniprocessor computing technology [2], these applica-
`
`tions demand systems scalable in every aspect: pro-
`cessing, memory, I/O, and particularly communica—
`tion. In addition to advances in raw processing power,
`we must also achieve dramatic improvements in system
`usability. Current day scalable systems are still quite
`difficult to program, and in many cases effectively pre—
`cluding use of the most sophisticated (and most effi-
`cient) algorithms. Even when successfully used, most
`systems exhibit substantial performance fragility due
`to rigid architectural choices that do not work well
`across different applications.
`
`Based on technology projections for the 2007 design
`window for the NSF point design studies, our analyses
`indicate that increases in communication cost relative
`
`to computation (gate speeds) make configurable logic
`practical in an ever broader range of the system. The
`benefit of configurable logic is that it can be used to
`customize the machine’s behavior to better match that
`
`required by the application 7 in essence a machine can
`be tuned for each application with little or no perfor—
`mance penalty for this generality. While a broad va-
`riety of such architectures are possible, MORPH is a.
`design point which explores the potential of integrating
`configurability deep into the system core.
`
`Because technology trends continue to increase the
`importance of communication, the MORPH architec-
`ture focuses on exploiting configurability to manage
`locality, communication, and coordination. In particu-
`lar, the MORPH design study is exploring improved ef-
`ficiency and scalability by exploring novel mechanisms
`for binding and mechanisms which comprise the in-
`teraction of processing, memory, I/O, and communi—
`cation resources. Other innovations explore flexible
`hardware granularity (e.g. mechanisms and associa-
`tion with processors and memory) and memory sys—
`tem management (e.g.
`cache coherence, prefetching,
`
`1088-4955/96 $5.00 © 1996 IEEE
`
`336
`
`7
`
`
`
`and other data management policies).
`Because a wealth of studies indicate that no fixed
`
`policies (or even wiring configurations) are optimal,
`MORPH seeks to exploit application structure and be—
`havior to adapt for efiicient execution. Traditionally
`in high performance computing systems, this informa—
`tion has been provided by optimizing compilers (vec—
`torizing or parallelizing compilers). However, as the
`use of component software and interoperability frame—
`works proliferates, we expect such analysis to become
`increasingly difficult. Therefore, we are exploring a
`broad range of techniques to identify opportunities for
`customization based on aggressive compiler analysis,
`user annotations, profiling, and even on-the—fiy moni-
`toring and adaptation.
`
`Because the field of configurable computing is still
`in its nascence, the range of possible architectures has
`only begun to be explored. The primary innovations of
`MORPH are to focus on integration of configurability
`into the system core, and exploitation of opportunities
`to optimize communication and coordination.
`In the
`remainder of this paper, we sketch the MORPH ar-
`chitecture. Since the project is in the early stages of
`design, many of the detailed design issues are intention-
`ally left open. Up to-date information on the project
`and results is available on the Web site: “http://www-
`csag.cs.uiuc.edu/projects/morph.html.”
`
`The remainder of paper is organized as follows.
`In the following Section 2 we present
`the technol—
`ogy trends and parameters underlying the design of
`MORPH. Section 3 presents the overall hardware or—
`ganization of MORPH, while the software architecture
`is presented in the following Section 4. Section 5 de-
`scribes our evaluation environment and examples of
`customizability used by MORPH machine to achieve
`high performance. We summarize design and open is-
`sues in Section 6.
`
`2 Key Technology Trends
`
`The MORPH design is targeted for construction in
`2007, based on the commodity technology that will
`be available. Because these technology extrapolations
`are critical underpinnings for the design, we discuss
`them in some detail. The base processor, memory
`and packaging technology form the fundamental global
`system constraints. Advances in programmable logic
`and computer—aided design will make per—application
`configuration not only feasible but desirable, and the
`widespread use of component software will make flexi—
`ble execution engines desirable for achieving high per—
`formance.
`
`Processor, Memory, and Packaging Technology
`The continued advance of Si technology will produce
`remarkable processors and memory chips, and area]
`bonding and multi—chip carriers will provide significant
`improvements in interchip wiring. However, commu-
`nication will remain critical in achieving high perfor-
`mance.
`
`Projections from Semiconductor Industry Associa-
`tion (SIA) [1] for a decade hence indicate advanced
`processes using 0.1 ,u feature size. However at the deep
`sub-micron level, feature size as measured by transistor
`channel length is increasingly irrelevant for velocity—
`saturated carrier transport [3]. Both logic density and
`speed are dominated by the interconnect density. Pitch
`for the finest interconnect
`is projected at 0.4-0.6 [1.
`On logic devices, average interconnect lengths are any-
`where from 1,000x to 10,000x the pitch, that is, up
`to 6 mm of intra—chip interconnect delay. This limits
`the on-chip clock periods to w l nanosecond, implying
`that the system performance is going to be increas—
`ingly dominated by interconnect delay. With multiple
`issue and multiple processors on a chip, for example,
`four 8-way super—scalar on—chip CPUs, processor chips
`are anticipated to achieve 32 billion instructions per
`second (BIPS) at 1 GHz clock rates. Memory integra—
`tion will continue its four-fold increase in density each
`3 years, achieving 16 Gbit DRAM and 4Gbit SRAM
`chips [4].
`At a voltage level of 1200-1500 mV, per chip power
`consumption is expected to be limited to below 200
`Watts. Total system power is expected to be 163 Kilo
`Watts for the MORPH 100 TeraOp configuration, in-
`cluding a 10% loss in power distribution and manage—
`ment. With packaging advances, using reduced on—chip
`solder bumps are expected to increase I/O’s by 10x to
`approximately 5000, with 90% usable as signal pins.
`Assuming one word of communication every 100 oper—
`ations requires signaling rates of 30-40 GHz, even one
`word every 1,000 operations, 3-4 GHz, well in excess of
`the SIA’s 375 Mhz projected ofllchip signaling rate [4],
`indicating communication is again a critical issue.
`
`Programmable Logic and Computer-Aided De-
`sign (CAD) Advances in programmable logic will
`make their use viable in a much broader range of sys—
`tem elements.
`In addition, the maturation of CAD
`technology will allow the flexible exploitation of config‘
`urable resources to customize for individual programs.
`Programmable logic densities and speed are increas-
`ing in similar fashion to SRAM, approximately four
`times for every three years. The routability and gate
`utilization in reprogrammable devices will continue to
`increase at a rapid rate of 5-15% per year due to
`
`337
`
`«DmeO‘I-th—‘t
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`O10‘IO‘IO‘IO‘IO‘IO‘I-P-P-P-P##-l>-l>-l>#wwwwwwwwwwk)NNNNNNNNN—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t
`
`a(II-PCON—\O©mNam-[>03N—‘OQDmem-PWN—‘OQWVGO‘I-PWN—\O©m\l®O‘I-l>wl\3—\O
`
`8
`
`
`
`better algorithms, CAD tools, and additional rout-
`ing resources. State of the art programmable logic
`devices, implemented in a 0.35 )1 process technology,
`provide 100,000 gates and achieve propagation delays
`of less than 5 us. As an alternative organization, cur-
`rent FPGA devices also offer up to 40 Kbits of mul—
`tifunctional memory in addition to over 60,000 us-
`able gates in implementation of specialized hardware
`functions such as arithmetic or DSP functions. Un-
`
`like standard SRAM parts, the embedded memory can
`be used as multiple-ported SRAM or FIFO providing
`much greater flexibility in system organization. Our-
`rent efforts in FPGA design and architecture show
`significant improvements in the efficiency of the on—
`chip memory blocks. This trend in memory efficiency
`and utilization is expected to continue and close the
`gap between FPGA and SRAM densities and up to
`10 Mbits of multi—functional embedded memory would
`be available in addition to the logic blocks in single-
`chip reprogrammable devices.
`In the ten-year period,
`programmable logic integration levels are expected to
`reach 2 million gates.
`MORPH’S design exploits small blocks of repro-
`grammable interconnect
`logic to achieve application
`customizability rather than application—specific func—
`tional units or compute elements. Perhaps the most
`compelling reason for the incorporation of (small) re-
`programmable logic blocks even in custom processors
`has to do with the comparatively low and decreasing
`delay penalty for reprogrammable logic blocks com—
`pared to medium and long interconnects delays. As
`transistor switching delays scale down to 10 ps range
`(factor of 1/10 from current delays), the interconnect
`delays scale down much more slowly. This is because
`the local interconnect length scales down with pitch
`(1/3), while the global interconnect length actually in-
`creases with increasing die size [5]. As a result, the
`marginal costs of adding switching logic in the inter—
`connect decreases tremendously, and may even become
`necessary to provide signal “repeaters” for moderate
`length interconnects. We are already beginning to
`see the prevalent use of SRAM—based reprogrammable
`logic (FPGA and CPLD) parts being used in final
`product designs. These trends in technology transition
`to field reprogrammable parts is expected to continue
`with increasing time—to—market pressures.
`Coupled with the advances in technology, the ad—
`vances in CAD tools and algorithms are beginning to
`have an impact on how designs are done today. With
`the emphasis on system models in hardware descrip-
`tion languages (HDLs) such as Verilog and VHDL, the
`process of hardware design is increasingly a language-
`level activity, supported by compilation and synthesis
`
`tools [6] These tools are beginning to support a variety
`of design constraints, on performance, size, power, and
`even the pin—outs. Locking I/O maps to ensure that
`physical design remains unchanged while logical con—
`nections are modified based on applications will soon
`be a common feature to allow programmable logic to be
`embedded in key modules of a system and provide on-
`line programmability to change hardware functional—
`ity. Tools for distributed hardware control synthesis to
`allow dynamic binding of hardware resources [7], and
`synthesis of protocols to low latency hardware [9, 10, 8]
`have been successfully demonstrated. With these CAD
`and synthesis capabilities, embedded programmable
`logic can be inserted into the key parts of systems,
`and used to alter behavior dramatically with modest
`performance overhead.
`The architectural implications of these technology
`trends are enormous: the underlying assumption that
`reprogrammability is expensive is already beginning to
`be challenged. As gate switching continues to scale
`well below 100 ps range (today), local decision making
`would cost significantly less than the cost of sending
`information. Such changes will pave the way for a new
`class of system architectures that exploit flexibility to
`deliver robust, high performance to applications.
`
`Component Software The exploding complexity of
`application software is driving increased used of com—
`ponent software (libraries,
`toolkits, shared abstrac-
`tions) and interoperability frameworks (to glue com-
`ponents together). These application structures lever-
`age programmer effort, but also imply that programs
`will increasingly be written in terms of user abstrac—
`tions (e.g. objects) rather than machine abstractions
`(registers, cache lines, memory locations).
`Over the past ten years, there has been an accel—
`erating trend towards component software, enabling
`the construction of larger, more complex applications.
`For example, graphical user interface programmers are
`leveraged by large graphical user interface libraries
`such as X Windows or MS Windows ’95.
`Like-
`
`wise, scientific programmers are increasingly leveraged
`by libraries such as LAPACK [11, 12], or increas-
`ingly domain or application specific libraries such as
`A++/P++, AMR++, POOMA, LPARX++/KeLP
`[13, 14, 15] whose millions of lines of code can not
`only dramatically increase programmer productivity,
`but in some cases, the sophistication of the libraries
`can actually deliver higher levels of performance. This
`phenomenon is occurring in many domains, program-
`ming languages, and system contexts because compo—
`nent software allows application programmers to focus
`on the critical new problems, not solving or reimple«
`
`338
`
`«DmeO‘I-th—‘t
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`O10‘IO‘IO‘IO‘IO‘IO‘I-P-P-P-P##-l>-l>-l>#wwwwwwwwwwmNNNNNNNNN—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t—‘t
`
`a(II-PCON—\O©mNam-[>03N—‘OQDmem-PWN—‘OQWVGO‘I-PWN—\O©m\l®O‘I-I>QJN—\O
`
`9
`
`
`
`
`
`High BW
`System Level
`Interconnect
`(optical or other)
`
`
`
`l
`
`
`
`
`
`
`
`
`
`
`
`
`
` 00 00
`
`
`000
`COO
`
`
`O
`
`0
`
`OO
`
`Scalable Interconnect
`(modules to subsystems)
`
`Nalwork[null-co
`PIDJMMIDII Layla
`
`ngh domlty
`ngh portormlnoo
`Memory Illmom
`
`Nltvork Mort-cc
`Provumm-bh Logic
`
`ngh Poflormlnco
`Processing
`Element
`
`Figure 1: A Flexible 100 TeraOp Architecture
`
`either a cache-coherent machine, a non-cache coherent
`machine, or even clusters of cache coherent machines
`connected by put/get or message passing. Varying
`the mix of processing and memory elements supports
`a wide range of machine configurations and balances.
`Examples of other possible changes include changes in
`cache block size, branch predictors, or prefetch policies.
`Our proposed configuration uses 32 BIPS proces-
`sors, and 8192 processing nodes. The physical memory
`configuration is driven primarily by cost and packag-
`ing (power budgeting) factors. A cost balanced sys-
`tem (based on today’s processor to memory price ra-
`tios and SIA predictions) would be approximately two
`memory chips for every processor, or approximately 4
`gigabytes/processor. Each processing node consists of
`8—16 KB of L1 cache and 128 MB of L2 cache that
`
`can be configured as private or shared among on—chip
`CPUs. Using MCM and area] interconnect technol-
`ogy, our system could be integrated with z 30 nodes
`per card, and with 20 cards per rack, the core of the
`system would fit in eight racks with room needed for
`power supplies, I/O cooling fans etc. The communica-
`tion bandwidth is limited by the wiring of the packages
`to about 90 GB/sec for a pin—out of 1800 usable MCM
`pins. The bisection bandwidth would be in the range
`of 10 TB /sec.
`
`3.1 Architectural Adaptation in MORPH
`The critical configurabilities or adaptation flexibili-
`ties that this architecture provides include:
`
`node
`computing
`over
`0 control
`(processor—memory association)
`
`granularity
`
`0 interleaving (address-physical memory element
`mapping)
`
`0 cache policies (consistency model, coherence pro—
`tocol, object method protocols)
`
`339
`
`10
`
`menting solutions to the same old ones. Experts expect
`this trend to continue [16], with useful libraries prolif-
`erating and increasing in complexity; thereby support—
`ing increased application complexity and programmer
`productivity.
`Beyond library—based sharing in a single program,
`the use of coordination—interoperability frameworks is
`having a major impact on application structure and the
`ease of building large complex applications. Interoper-
`ability frameworks such as OLE, CORBA, SOM, etc.
`[17, 18] define standard interfaces for modules, mak—
`ing it possible to build modular software and compose
`it with little knowledge of the internal software struc-
`ture. Thus, large complex programs will be composed
`of heterogeneous applications including computation,
`visualization, persistent storage, and even on—line in-
`teraction. In future, we expect widespread used of co-
`ordination/interoperability frameworks to build com-
`plex applications from variegated individual programs.
`This means that efforts and tools for optimizing ap—
`plications that must optimize individual programs, co
`ordination frameworks, and entire applications are all
`essential.
`
`The i