`
`Frontiers ‘96
`
`The Sixth Symposium on the
`Frontiers of Massively Parallel Computing
`
`October 27-31, 1996
`Annapolis, Maryland
`
`Sponsored by
`IEEE Computer Society
`
`In cooperation with
`NASA Goddard Space Flight Center
`US RNCESD IS
`
`IEEE Computer Society Press
`Los Alamitos, California
`Washington
`Brussels
`
`Tokyo
`
`Intel Exhibit 1005 - 1
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`IEEE Computer Society Press
`10662 Los Vaqueros Circle
`P.O.Box 301 4
`Los Alamitos, CA 90720-1 264
`
`Copyright 0 1996 by The Institute of Electrical and Electronics Engineers, Inc.
`All rights reserved.
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may
`photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume
`that carry a code at the bottom of the first page, provided that the per-copy fee indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE
`Service Center, 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 088551331,
`
`The papers in this book comprise the proceedings of the meeting mentioned on the cover and title page. They
`reflect the authors’ opinions and, in the interests of timely dissemination, are published as presented and
`without change. Their inclusion in this publication does not necessarily constitute endorsement by the
`editors, the IEEE Computer Society Press, or the Institute of Electrical and Electronics Engineers, Inc.
`
`IEEE Computer Society Press Order Number PRO755 1
`IEEE Order Plan Catalog Number 96TB 100062
`ISBN 0-81 86-755 1-9
`Microfiche ISBN 0-81 86-7553-5
`ISSN 1088-4955
`
`Additional copies may be ordered from:
`
`IEEE Computer Society Press
`Customer Service Center
`10662 Los Vaqueros Circle
`P.O. Box 3014
`Los Alamitos, CA 90720-1314
`Tel: +1-714-821-8380
`Fax: +1-7 14-821 -4641
`Email: cs.books@computer.org
`
`IEEE Service Center
`445 Hoes Lane
`P.O. Box 1331
`Piscataway, NJ 08855-1331
`Tel: +1-908-98 1-1393
`Fax: +1-908-981-9667
`misc.custserv @ computer.org
`
`IEEE Computer Society
`13, Avenue de 1’Aquilon
`B-1200 Brussels
`BELGIUM
`Tel: +32-2-770-2198
`Fax: +32-2-770-8.505
`euro.ofc @ computr.org
`
`IEEE Computer Society
`Ooshima Building
`2-19-1 Minami-Aoyama
`Minato-ku, Tokyo 107
`JAPAN
`Tel: +81-3-3408-3118
`Fax: +81-3-3408-3553
`tokyo.ofc@ computer.org
`
`Editorial production by Penny Storms
`Cover by Kerry Bedford and Alex Torres
`Printed in the United States of America by KNI, Inc.
`
`The Institute of Electrical and Electronics Engineers, Inc.
`
`Intel Exhibit 1005 - 2
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`Contents
`
`Message from the General Chair ............................................................................................
`ix
`Message from the Program Chair ...........................................................................................
`x
`Conference Committee .............................................................................................................
`xi
`...
`Referees ...................................................................................................................................... xu1
`
`Session 1: Invited Speaker
`From ASCI to Teraflops
`John Hopson, Accelerated Strategic Computing Initiative (ASCI)
`
`Session 2 A Scheduling 1
`Gang Scheduling for Highly Efficient Distributed Multiprocessor Systems ..............................
`H. Franke, P. Pattnaik, and L. Rudolph
`Integrating Polling, Interrupts, and Thread Management ........................................................
`K. Langendoen, J. Romein, R. Bhoedjang, and H. Bal
`A Practical Processor Design for Multithreading ....................................................................... 23
`M. Amamiya, T. Kawano, H. Tomiyasu, and S. Kusakabe
`
`13
`
`4
`
`Session 2B: Routing
`Analysis of Deadlock-Free Path-Based Wormhole Multicasting in
`Meshes in Case of Contentions .................................................................................................... 34
`E. Fleury and P. Fraigniaud
`Efficient Multicast in Wormhole-Routed 2D MesWTorus Multicomputers:
`A Network-Partitioning Approach ...............................................................................................
`S- Y. Wang, Y-C. Tseng, and C- W. Ho
`Turn Grouping for Efficient Multicast in Wormhole Mesh Networks ......................................
`K-P. Fan and C-T. King
`
`42
`
`50
`
`Session 3A. Applications and Algorithms
`A3: A Simple and Asymptotically Accurate Model for Parallel Computation ...........................
`A. Grama, V. Kumar, S. Ranka, and V. Singh
`Fault Tolerant Matrix Operations Using Checksum and Reverse Computation ....................
`Y. Kim, J.S. Plank, and J. J. Dongarra
`A Statistically-Based Multi-Algorithmic Approach for Load-Balancing
`Sparse Matrix Computations .......................................................................................................
`S. Nastea, T. El-Ghazawi, and 0. Frieder
`
`60
`
`.70
`
`78
`
`Session 3B: Petaflops Computing / Point Design Studies
`Pursuing a Petaflop: Point Designs for 100 TF Computers Using PIM Technologies ............. 88
`P.M. Kogge, S.C. Bass, J.B. Brockman, D.Z. Chen, and E. Sha
`Hybrid Technology Multithreaded Architecture .......................................................................
`G. Gao, K.K. Likharev, P.C. Messina, and T.L. Sterling
`The Illinois Aggressive Coma Multiprocessor Project (I-ACOMA) ..........................................
`J. Torrellas and D. Padua
`
`106
`
`.98
`
`V
`
`Intel Exhibit 1005 - 3
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`Panel Session-How Do We Break the Barrier to the Software Frontier?
`Panel Chair: Rick Stevens, Argonne National Laboratory
`
`Session 4: Invited Speaker
`
`Session 5 A Scheduling 2
`Largest-Job-First-Scan-All Scheduling Policy for 2D Mesh-Connected Systems ................... 118
`S-M. Yo0 and H.Y. Youn
`Scheduling for Large-scale Parallel Video Servers ..................................................................
`M-Y. Wu and W. Shu
`Effect of Variation in Compile Time Costs on Scheduling Tasks on
`Distributed Memory Systems .................................................................................................... 134
`S. Darbha and S. Pande
`
`126
`
`Session 5B: SIMD
`Processor Autonomy and Its Effect on Parallel Program Execution .......................................
`D.M. Hawver and G.B. Adams III
`Particle-Mesh Techniques on the MasPar ................................................................................
`P. MacNeice, C. Mobarry, and K. Olson
`MIMD Programs on SIMD Architectures .................................................................................
`M-Y. Wu and W. Shu
`
`144
`
`154
`
`162
`
`Session 6A: I/O Techniques
`Intelligent, Adaptive File System Policy Selection ...................................................................
`T.M. Madhyastha and D.A. Reed
`An Abstract-Device Interface for Implementing Portable Parallel-I/O Interfaces ................. 180
`R. Thakur, W. Gropp, and E. Lush
`PMPIO - A Portable Implementation of MPI-IO ...................................................................... 188
`S.A. Fineberg, P. Wong, B. Nitzberg, and C. Kuszmaul
`Disk Resident Arrays: An Array-Oriented IIO Library for Out-Of-Core
`Computations .............................................................................................................................. 196
`J. Nieplocha and I. Foster
`
`172
`
`Session 6B: Memory Management
`Hardware-Controlled Prefeching in Directory-Based Cache Coherent Systems ................... 206
`W. Nu and P. Xia
`Preliminary Insights on Shared Memory PIC Code Performance on the
`Convex Exemplar SPPlQQO ........................................................................................................ 214
`P. MacNeice, C.M. Mobarry, J. Crawford, and T.L. Sterling
`Scalability of Dynamic Storage Allocation Algorithms ............................................................
`A. Iyengar
`An Interprocedural Framework for Determining Efficient Data
`Redistributions in Distributed Memory Machines ...................................................................
`S.K.S. Gupta and S. Krishnamurthy
`
`233
`
`223
`
`vi
`
`Intel Exhibit 1005 - 4
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`Panel Session-Petaflops Alternative Paths
`Panel Chair: Paul Messina, California Institute of Technology
`
`Session 7: Invited Speaker
`Independence Day
`Steven Wallaeh, HP- Convex
`
`Session 8 A Synchronization
`A Fair Fast Distributed Concurrent-Reader Exclusive-Writer Synchronization. .................. 246
`T. J. Johnson and H. Yoon
`Lock Improvement Technique for Release Consistency in Distributed
`Shared Memory Systems ............................................................................................................
`S.S. Fu and N-F. Tzeng
`A Quasi-Barrier Technique to Improve Performance of an Irregular Application ............... .263
`H. V. Shah and J.A.B. Fortes
`
`255
`
`Session 8B: Networks
`Performance Analysis and Fault Tolerance of Randomized Routing on
`Clos Networks .............................................................................................................................
`M. Bhatia and A. Youssef
`Performing BMMC Permutations in Two Passes through the Expanded
`Delta Network and MasPar MP-2 .............................................................................................
`L.F. Wisniewski, T.H. Cormen, and T. Sundquist
`Macro-Star Networks: Efficient Low-Degree Alternatives to Star Graphs
`for Large-scale Parallel Architectures ......................................................................................
`C-H. Yeh and E. Varvarigos
`
`272
`
`282
`
`290
`
`300
`
`Session 9A: Performance Analysis
`Modeling and Identifying Bottlenecks in EOSDIS ...................................................................
`J. Demmel, M.Y. Ivory, and S.L. Smith
`Tools-Supported HPF and MPI Parallelization of the NAS Parallel Benchmarks ................ 309
`C. Cle'mencon, K.M. Decker, V.R. Deshpande, A. Endo,
`J. Fritscher, P.A.R. Lorenzo, N. Masuda, A. Muller,
`R. Ruhl, W. Sawyer, B.J.N. Wylie, and F. Zimmerman
`A Comparison of Workload Traces from Two Production Parallel Machines .........................
`K. Windisch, V. Lo, D. Feitelson, R. Moore, and B. Nitzberg
`Morphological Image Processing on Three Parallel Machines ................................................
`M.D. Theys, R.M. Born, M.D. Allemang, and H.J. Siege1
`
`319
`
`327
`
`Session 9B: Petaflops Computing / Point Design Studies
`MORPH: A System Architecture for Robust High Performance Using
`Customization (An NSF 100 TeraOps Point Design Study) ....................................................
`A.A. Chien and R.K. Cupta
`Architecture, Algorithms and Applications for Future Generation
`Supercomputers ..........................................................................................................................
`K Kumar, A. Sameh, A. Grama, and G. Karypis
`
`336
`
`346
`
`vii
`
`Intel Exhibit 1005 - 5
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`Hierarchical Processors-and-Memory Architecture for High Performance
`Computing ................................................................................................................................... 355
`Z.B. Miled, R. Eigenmann, J.A.B. Fortes, and I? Taylor
`A Low-Complexity Parallel System of Gracious, Scalable Performance.
`Case Study for Near PetaFLOPS Computing ...........................................................................
`S.G. Ziavras, H. Grebel, and A. Chronopoulos
`
`363
`
`’
`
`Author Index ............................................................................................................................
`
`371
`
`...
`Vlll
`
`Intel Exhibit 1005 - 6
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`MORPH: A System Architecture for Robust High Performance
`Using Customization
`(An NSF 100 TeraOps Point Design Study)
`Rajesh E(. Gupta
`Andrew A. Chien
`Department of Computer Science
`University of Illinois
`Urbana, Illinois 61801
`{ achien, rgupta} @cs.uiuc. edu
`
`Abstract
`Achieving 100 TeraOps performance within a ten-
`year horizon will require massively-parallel architec-
`tures that exploit both commodity software and hard-
`ware technology for cost efficiency. Increasing clock
`rates and system diameter in clock periods will make
`eflcient management of communication and coordina-
`tion increasingly critical. Configurable logic presents a
`unique opportunity to customize bindings, mechanisms,
`and policies which comprise the interaction of process-
`ing, memory, 1/0 and communication resources. This
`programming flexibility, or customizability, can pro-
`vide the k e y to achieving robust high performance.
`The MultiprocessOr with Reconfigurable Parallel
`Hardware (MORPH) uses reconfigurable logic blocks
`integrated with the system core to control policies, in-
`teractions, and interconnections. This integrated con-
`figurability can improve the performance of local mem-
`ory hierarchy, increase the efficiency of interproces-
`sor coordination, or better utilize the network bisec-
`tion of the machine. MORPH provides a framework
`for exploring such integrated application-specific cus-
`tomizability. Rather than complicate the situation,
`MORPH’S configurability supports component software
`and interoperabilty frameworks, allowing direct support
`for application-specified patterns, objects, and struc-
`tures. This paper reports the motivation and initial
`design of the MORPH system.
`1 Introduction
`Increasing reliance on computational techniques for
`scientific inquiry, complex systems design, faster than
`real life simulations, and higher fidelity human com-
`puter interaction continue to drive the need for ever
`higher performance computing systems. Despite rapid
`progress in basic device technology [l], and even in
`uniprocessor computing technology [2], these applica-
`
`tions demand systems scalable in every aspect: pro-
`cessing, memory, I/O, and particularly communica-
`tion. In addition to advances in raw processing power,
`we must also achieve dramatic improvements in system
`usability. Current day scalable systems are still quite
`difficult to program, and in many cases effectively pre-
`cluding use of the most sophisticated (and most effi-
`cient) algorithms. Even when successfully used, most
`systems exhibit substantial performance fragility due
`to rigid architectural choices that do not work well
`across different applications.
`Based on technology projections for the 2007 design
`window for the NSF point design studies, our analyses
`indicate that increases in communication cost relative
`to computation (gate speeds) make configurable logic
`practical in an ever broader range of the system. The
`benefit of configurable logic is that it can be used to
`customize the machine’s behavior to better match that
`required by the application - in essence a machine can
`be tuned for each application with little or no perfor-
`mance penalty for this generality. While a broad va-
`riety of such architectures are possible, MORPH is a
`design point which explores the potential of integrating
`configurability deep into the system core.
`Because technology trends continue to increase the
`importance of communication, the MORPH architec-
`ture focuses on exploiting configurability to manage
`locality, communication, and coordination. In particu-
`lar, the MORPH design study is exploring improved ef-
`ficiency and scalability by exploring novel mechanisms
`for binding and mechanisms which comprise the in-
`teraction of processing, memory, I/O, and communi-
`cation resources. Other innovations explore flexible
`hardware granularity (e.g. mechanisms and associa-
`tion with processors and memory) and memory sys-
`tem management (e.g. cache coherence, prefetching,
`
`1088-4955196 $5.00 0 1996 IEEE
`
`336
`
`Authorized licensed use limited to: UCLA Library. Downloaded on June 17,2020 at 17:44:03 UTC from IEEE Xplore. Restrictions apply.
`
`Intel Exhibit 1005 - 7
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`and other data management policies).
`Because a wealth of studies indicate that no fixed
`policies (or even wiring configurations) are optimal,
`MORPH seeks to exploit application structure and be-
`havior to adapt for efficient execution. Traditionally
`in high performance computing systems, this informa-
`tion has been provided by optimizing compilers (vec-
`torizing or parallelizing compilers). However, as the
`use of component software and interoperability frame-
`works proliferates, we expect such analysis to become
`increasingly difficult. Therefore, we are exploring a
`broad range of techniques to identify opportunities for
`customization based on aggressive compiler analysis,
`user annotations, profiling, and even on-the-fly moni-
`toring and adaptation.
`Because the field of configurable computing is still
`in its nascence, the range of possible architectures has
`only begun to be explored. The primary innovations of
`MORPH are to focus on integration of configurability
`into the system core, and exploitation of opportunities
`to optimize communication and coordination. In the
`remainder of this paper, we sketch the MORPH ar-
`chitecture. Since the project is in the early stages of
`design, many of the detailed design issues are intention-
`ally left open. Up to-date information on the project
`and results is available on the Web site: “http://www-
`csag. cs .uiuc .edu/proj ects/morph. html.”
`The remainder of paper is organized as follows.
`In the following Section 2 we present the technol-
`ogy trends and parameters underlying the design of
`MORPH. Section 3 presents the overall hardware or-
`ganization of MORPH, while the software architecture
`is presented in the following Section 4. Section 5 de-
`scribes our evaluation environment and examples of
`customizability used by MORPH machine to achieve
`high performance. We summarize design and open is-
`sues in Section 6.
`
`2 Key Technology Trends
`The MORPH design is targeted for construction in
`2007, based on the commodity technology that will
`be available. Because these technology extrapolations
`are critical underpinnings for the design, we discuss
`them in some detail. The base processor, memory
`and packaging technology form the fundamental global
`system constraints. Advances in programmable logic
`and computer-aided design will make per-application
`configuration not only feasible but desirable, and the
`widespread use of component software will make flexi-
`ble execution engines desirable for achieving high per-
`formance.
`
`Processor, Memory, and Packaging Technology
`The continued advance of Si technology will produce
`remarkable processors and memory chips, and areal
`bonding and multi-chip carriers will provide significant
`improvements in interchip wiring. However, commu-
`nication will remain critical in achieving high perfor-
`mance.
`Projections from Semiconductor Industry Associa-
`tion (SIA) [l] for a decade hence indicate advanced
`processes using 0.1 p feature size. However at the deep
`sub-micron level, feature size as measured by transistor
`channel length is increasingly irrelevant for velocity-
`saturated carrier transport [3]. Both logic density and
`speed are dominated by the interconnect density. Pitch
`for the finest interconnect is projected at 0.4-0.6 p.
`On logic devices, average interconnect lengths are any-
`where from 1 , 0 0 0 ~ to 10,000~ the pitch, that is, up
`to 6 mm of intra-chip interconnect delay. This limits
`the on-chip clock periods to m 1 nanosecond, implying
`that the system performance is going to be increas-
`ingly dominated by interconnect delay. With multiple
`issue and multiple processors on a chip, for example,
`four $-way super-scalar on-chip CPUs, processor chips
`are anticipated to achieve 32 billion instructions per
`second (BIPS) at 1 GHz clock rates. Memory integra-
`tion will continue its four-fold increase in density each
`3 years, achieving 16 Gbit DRAM and 4Gbit SRAM
`chips [4].
`At a voltage level of 1200-1500 mV, per chip power
`consumption is expected to be limited to below 200
`Watts. Total system power is expected to be 163 Kilo
`Watts for the MORPH 100 TeraOp configuration, in-
`cluding a 10% loss in power distribution and manage-
`ment. With packaging advances, using reduced on-chip
`solder bumps are expected to increase I/O’s by lox to
`approximately 5000, with 90% usable as signal pins.
`Assuming one word of communication every 100 oper-
`ations requires signaling rates of 30-40 GHz, even one
`word every 1,000 operations, 3-4 GHz, well in excess of
`the SIA’s 375 Mhz projected off-chip signaling rate [4],
`indicating communication is again a critical issue.
`
`Programmable Logic and Computer- Aided De-
`sign (CAD) Advances in programmable logic will
`make their use viable in a much broader range of sys-
`tem elements. In addition, the maturation of CAD
`technology will allow the flexible exploitation of config-
`urable resources to customize for individual programs.
`Programmable logic densities and speed are increas-
`ing in similar fashion to SRAM, approximately four
`times for every three years. The routability and gate
`utilization in reprogrammable devices will continue to
`increase at a rapid rate of 5-15% per year due to
`
`337
`
`Authorized licensed use limited to: UCLA Library. Downloaded on June 17,2020 at 17:44:03 UTC from IEEE Xplore. Restrictions apply.
`
`Intel Exhibit 1005 - 8
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`better algorithms, CAD tools, and additional rout-
`ing resources. State of the art programmable logic
`devices, implemented in a 0.35 ,U process technology,
`provide 100,000 gates and achieve propagation delays
`of less than 5 ns. As an alternative organization, cur-
`rent FPGA devices also offer up to 40 Kbits of mul-
`tifunctional memory in addition to over 60,000 us-
`able gates in implementation of specialized hardware
`functions such as arithmetic or DSP functions. Un-
`like standard SRAM parts, the embedded memory can
`be used as multiple-ported SRAM or FIFO providing
`much greater flexibility in system organization. Cur-
`rent efforts in FPGA design and architecture show
`significant improvements in the efficiency of the on-
`chip memory blocks. This trend in memory efficiency
`and utilization is expected to continue and close the
`gap between FPGA and SRAM densities and up to
`10 Mbits of multi-functional embedded memory would
`be available in addition to the logic blocks in single-
`chip reprogrammable devices. In the ten-year period,
`programmable logic integration levels are expected to
`reach 2 million gates.
`MORPH’S design exploits small blocks of repro-
`grammable interconnect logic to achieve application
`customizability rather than application-specific func-
`tional units or compute elements. Perhaps the most
`compelling reason for the incorporation of (small) re-
`programmable logic blocks even in custom processors
`has to do with the comparatively low and decreasing
`delay penalty for reprogrammable logic blocks com-
`pared to medium and long interconnects delays. As
`transistor switching delays scale down to 10 ps range
`(factor of 1/10 from current delays), the interconnect
`delays scale down much more slowly. This is because
`the local interconnect length scales down with pitch
`(1/3), while the global interconnect length actually in-
`creases with increasing die size [5]. As a result, the
`marginal costs of adding switching logic in the inter-
`connect decreases tremendously, and may even become
`necessary to provide signal “repeaters” for moderate
`length interconnects. We are already beginning to
`see the prevalent use of SRAM-based reprogrammable
`logic (FPGA and CPLD) parts being used in final
`product designs. These trends in technology transition
`to field reprogrammable parts is expected to continue
`with increasing time-to-market pressures.
`Coupled with the advances in technology, the ad-
`vances in CAD tools and algorithms are beginning to
`have an impact on how designs are done today. With
`the emphasis on system models in hardware descrip-
`tion languages (HDLs) such as Verilog and VHDL, the
`process of hardware design is increasingly a language-
`level activity, supported by compilation and synthesis
`
`tools [6]. These tools are beginning to support a variety
`of design constraints, on performance, size, power, and
`even the pin-outs. Locking 1/0 maps to ensure that
`physical design remains unchanged while logical con-
`nections are modified based on applications will soon
`be a common feature to allow programmable logic to be
`embedded in key modules of a system and provide on-
`line programmability to change hardware functional-
`ity. Tools for distributed hardware control synthesis to
`allow dynamic binding of hardware resources [7], and
`synthesis of protocols to low latency hardware [9, lo,$]
`have been successfully demonstrated. With these CAD
`and synthesis capabilities, embedded programmable
`logic can be inserted into the key parts of systems,
`and used to alter behavior dramatically with modest
`performance overhead.
`The architectural implications of these technology
`trends are enormous: the underlying assumption that
`reprogrammability is expensive is already beginning to
`be challenged. As gate switching continues to scale
`well below 100 ps range (today), local decision making
`would cost significantly less than the cost of sending
`information. Such changes will pave the way for a new
`class of system architectures that exploit flexibility to
`deliver robust, high performance to applications.
`
`Component Software The exploding complexity of
`application software is driving increased used of com-
`ponent software (libraries, toolkits, shared abstrac-
`tions) and interoperability frameworks (to glue com-
`ponents together). These application structures lever-
`age programmer effort, but also imply that programs
`will increasingly be written in terms of user abstrac-
`tions (e.g. objects) rather than machine abstractions
`(registers, cache lines, memory locations).
`Over the past ten years, there has been an accel-
`erating trend towards component software, enabling
`the construction of larger, more complex applications.
`For example, graphical user interface programmers are
`leveraged by large graphical user interface libraries
`such as X Windows or MS Windows ’95. Like-
`wise, scientific programmers are increasingly leveraged
`by libraries such as LAPACK [ll, 121, or increas-
`ingly domain or application specific libraries such as
`A++/P++, AMR++, POOMA, LPARX++/KeLP
`[13, 14, 151 whose millions of lines of code can not
`only dramatically increase programmer productivity,
`but in some cases, the sophistication of the libraries
`can actually deliver higher levels of performance. This
`phenomenon is occurring in many domains, program-
`ming languages, and system contexts because compo-
`nent software allows application programmers to focus
`on the critical new problems, not solving or reimple-
`
`338
`
`Authorized licensed use limited to: UCLA Library. Downloaded on June 17,2020 at 17:44:03 UTC from IEEE Xplore. Restrictions apply.
`
`Intel Exhibit 1005 - 9
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`menting solutions to the same old ones. Experts expect
`this trend to continue [16], with useful libraries prolif-
`erating and increasing in complexity; thereby support-
`ing increased application complexity and programmer
`productivity.
`Beyond library-based sharing in a single program,
`the use of coordination-interoperability frameworks is
`having a major impact on application structure and the
`ease of building large complex applications. Interoper-
`ability frameworks such as OLE, CORBA, SOM, etc.
`[17, 181 define standard interfaces for modules, mak-
`ing it possible to build modular software and compose
`it with little knowledge of the internal software struc-
`ture. Thus, large complex programs will be composed
`of heterogeneous applications including computation,
`visualization, persistent storage, and even on-line in-
`teraction. In future, we expect widespread used of co-
`ordination/interoperability frameworks to build com-
`plex applications from variegated individual programs.
`This means that efforts and tools for optimizing ap-
`plications that must optimize individual programs, co-
`ordination frameworks, and entire applications are all
`essential.
`The increasing complexity in software demands a
`hardware structure which can be easily exploited.
`However, this need for easily accessible performance
`comes at a time when hardware technology trends
`place an increasing importance on issues such as 10-
`cality, partitioning, and mapping of computations -
`managing communication. Our solution is a machine
`architecture which leverages a small amount of pro-
`grammable logic in several key places to implement a
`flexible hardware composition structure. This flexibil-
`ity allows the machine to be configured as convenient or
`to optimize machine communication, supporting cus-
`tomizability by software and high performance.
`3 System Architecture & Organization
`The basic architecture of MORPH reflects the ob-
`servations that in the 2007 technology window, (1)
`communication is already critical and getting increas-
`ingly so [19], and (2) flexible interconnects can be
`used to replace static wires at competitive performance
`[20, 21, 22, 231. The key elements of the MORPH
`architecture include processing elements and memory
`elements embedded in a scalable interconnect. The
`scalable interconnect flexibly connects all parts of the
`system with fast packet routing, efficiently exploiting
`the wiring resources provided by the system packaging
`[21, 22, 20, 231. The hardware structure allows adap-
`tation of data transport, coordination, association (for
`granularity), and efficient computation. As an example
`of its flexibility, MORPH could be used to implement
`
`A
`
`High BW
`System Level
`Interconnect
`(optical or other)
`
`I Pmesasing
`
`Element
`
`I ,,/
`
`Scalable Interconnect
`(modules to subsystems)
`
`Figure 1: A Flexible 100 TeraOp Architecture
`
`either a cache-coherent machine, a non-cache coherent
`machine, or even clusters of cache coherent machines
`connected by put/get or message passing. Varying
`the mix of processing and memory elements supports
`a wide range of machine configurations and balances.
`Examples of other possible changes include changes in
`cache block size, branch predictors, or prefetch policies.
`Our proposed configuration uses 32 BIPS proces-
`sors, and 8192 processing nodes. The physical memory
`configuration is driven primarily by cost and packag-
`ing (power budgeting) factors. A cost balanced sys-
`tem (based on today’s processor to memory price ra-
`tios and SIA predictions) would be approximately two
`memory chips for every processor, or