throbber
Proceedings
`
`Frontiers ‘96
`
`The Sixth Symposium on the
`Frontiers of Massively Parallel Computing
`
`October 27-31, 1996
`Annapolis, Maryland
`
`Sponsored by
`IEEE Computer Society
`
`In cooperation with
`NASA Goddard Space Flight Center
`US RNCESD IS
`
`IEEE Computer Society Press
`Los Alamitos, California
`Washington
`Brussels
`
`Tokyo
`
`Intel Exhibit 1005 - 1
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`IEEE Computer Society Press
`10662 Los Vaqueros Circle
`P.O.Box 301 4
`Los Alamitos, CA 90720-1 264
`
`Copyright 0 1996 by The Institute of Electrical and Electronics Engineers, Inc.
`All rights reserved.
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may
`photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume
`that carry a code at the bottom of the first page, provided that the per-copy fee indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE
`Service Center, 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 088551331,
`
`The papers in this book comprise the proceedings of the meeting mentioned on the cover and title page. They
`reflect the authors’ opinions and, in the interests of timely dissemination, are published as presented and
`without change. Their inclusion in this publication does not necessarily constitute endorsement by the
`editors, the IEEE Computer Society Press, or the Institute of Electrical and Electronics Engineers, Inc.
`
`IEEE Computer Society Press Order Number PRO755 1
`IEEE Order Plan Catalog Number 96TB 100062
`ISBN 0-81 86-755 1-9
`Microfiche ISBN 0-81 86-7553-5
`ISSN 1088-4955
`
`Additional copies may be ordered from:
`
`IEEE Computer Society Press
`Customer Service Center
`10662 Los Vaqueros Circle
`P.O. Box 3014
`Los Alamitos, CA 90720-1314
`Tel: +1-714-821-8380
`Fax: +1-7 14-821 -4641
`Email: cs.books@computer.org
`
`IEEE Service Center
`445 Hoes Lane
`P.O. Box 1331
`Piscataway, NJ 08855-1331
`Tel: +1-908-98 1-1393
`Fax: +1-908-981-9667
`misc.custserv @ computer.org
`
`IEEE Computer Society
`13, Avenue de 1’Aquilon
`B-1200 Brussels
`BELGIUM
`Tel: +32-2-770-2198
`Fax: +32-2-770-8.505
`euro.ofc @ computr.org
`
`IEEE Computer Society
`Ooshima Building
`2-19-1 Minami-Aoyama
`Minato-ku, Tokyo 107
`JAPAN
`Tel: +81-3-3408-3118
`Fax: +81-3-3408-3553
`tokyo.ofc@ computer.org
`
`Editorial production by Penny Storms
`Cover by Kerry Bedford and Alex Torres
`Printed in the United States of America by KNI, Inc.
`
`The Institute of Electrical and Electronics Engineers, Inc.
`
`Intel Exhibit 1005 - 2
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`Contents
`
`Message from the General Chair ............................................................................................
`ix
`Message from the Program Chair ...........................................................................................
`x
`Conference Committee .............................................................................................................
`xi
`...
`Referees ...................................................................................................................................... xu1
`
`Session 1: Invited Speaker
`From ASCI to Teraflops
`John Hopson, Accelerated Strategic Computing Initiative (ASCI)
`
`Session 2 A Scheduling 1
`Gang Scheduling for Highly Efficient Distributed Multiprocessor Systems ..............................
`H. Franke, P. Pattnaik, and L. Rudolph
`Integrating Polling, Interrupts, and Thread Management ........................................................
`K. Langendoen, J. Romein, R. Bhoedjang, and H. Bal
`A Practical Processor Design for Multithreading ....................................................................... 23
`M. Amamiya, T. Kawano, H. Tomiyasu, and S. Kusakabe
`
`13
`
`4
`
`Session 2B: Routing
`Analysis of Deadlock-Free Path-Based Wormhole Multicasting in
`Meshes in Case of Contentions .................................................................................................... 34
`E. Fleury and P. Fraigniaud
`Efficient Multicast in Wormhole-Routed 2D MesWTorus Multicomputers:
`A Network-Partitioning Approach ...............................................................................................
`S- Y. Wang, Y-C. Tseng, and C- W. Ho
`Turn Grouping for Efficient Multicast in Wormhole Mesh Networks ......................................
`K-P. Fan and C-T. King
`
`42
`
`50
`
`Session 3A. Applications and Algorithms
`A3: A Simple and Asymptotically Accurate Model for Parallel Computation ...........................
`A. Grama, V. Kumar, S. Ranka, and V. Singh
`Fault Tolerant Matrix Operations Using Checksum and Reverse Computation ....................
`Y. Kim, J.S. Plank, and J. J. Dongarra
`A Statistically-Based Multi-Algorithmic Approach for Load-Balancing
`Sparse Matrix Computations .......................................................................................................
`S. Nastea, T. El-Ghazawi, and 0. Frieder
`
`60
`
`.70
`
`78
`
`Session 3B: Petaflops Computing / Point Design Studies
`Pursuing a Petaflop: Point Designs for 100 TF Computers Using PIM Technologies ............. 88
`P.M. Kogge, S.C. Bass, J.B. Brockman, D.Z. Chen, and E. Sha
`Hybrid Technology Multithreaded Architecture .......................................................................
`G. Gao, K.K. Likharev, P.C. Messina, and T.L. Sterling
`The Illinois Aggressive Coma Multiprocessor Project (I-ACOMA) ..........................................
`J. Torrellas and D. Padua
`
`106
`
`.98
`
`V
`
`Intel Exhibit 1005 - 3
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`Panel Session-How Do We Break the Barrier to the Software Frontier?
`Panel Chair: Rick Stevens, Argonne National Laboratory
`
`Session 4: Invited Speaker
`
`Session 5 A Scheduling 2
`Largest-Job-First-Scan-All Scheduling Policy for 2D Mesh-Connected Systems ................... 118
`S-M. Yo0 and H.Y. Youn
`Scheduling for Large-scale Parallel Video Servers ..................................................................
`M-Y. Wu and W. Shu
`Effect of Variation in Compile Time Costs on Scheduling Tasks on
`Distributed Memory Systems .................................................................................................... 134
`S. Darbha and S. Pande
`
`126
`
`Session 5B: SIMD
`Processor Autonomy and Its Effect on Parallel Program Execution .......................................
`D.M. Hawver and G.B. Adams III
`Particle-Mesh Techniques on the MasPar ................................................................................
`P. MacNeice, C. Mobarry, and K. Olson
`MIMD Programs on SIMD Architectures .................................................................................
`M-Y. Wu and W. Shu
`
`144
`
`154
`
`162
`
`Session 6A: I/O Techniques
`Intelligent, Adaptive File System Policy Selection ...................................................................
`T.M. Madhyastha and D.A. Reed
`An Abstract-Device Interface for Implementing Portable Parallel-I/O Interfaces ................. 180
`R. Thakur, W. Gropp, and E. Lush
`PMPIO - A Portable Implementation of MPI-IO ...................................................................... 188
`S.A. Fineberg, P. Wong, B. Nitzberg, and C. Kuszmaul
`Disk Resident Arrays: An Array-Oriented IIO Library for Out-Of-Core
`Computations .............................................................................................................................. 196
`J. Nieplocha and I. Foster
`
`172
`
`Session 6B: Memory Management
`Hardware-Controlled Prefeching in Directory-Based Cache Coherent Systems ................... 206
`W. Nu and P. Xia
`Preliminary Insights on Shared Memory PIC Code Performance on the
`Convex Exemplar SPPlQQO ........................................................................................................ 214
`P. MacNeice, C.M. Mobarry, J. Crawford, and T.L. Sterling
`Scalability of Dynamic Storage Allocation Algorithms ............................................................
`A. Iyengar
`An Interprocedural Framework for Determining Efficient Data
`Redistributions in Distributed Memory Machines ...................................................................
`S.K.S. Gupta and S. Krishnamurthy
`
`233
`
`223
`
`vi
`
`Intel Exhibit 1005 - 4
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`Panel Session-Petaflops Alternative Paths
`Panel Chair: Paul Messina, California Institute of Technology
`
`Session 7: Invited Speaker
`Independence Day
`Steven Wallaeh, HP- Convex
`
`Session 8 A Synchronization
`A Fair Fast Distributed Concurrent-Reader Exclusive-Writer Synchronization. .................. 246
`T. J. Johnson and H. Yoon
`Lock Improvement Technique for Release Consistency in Distributed
`Shared Memory Systems ............................................................................................................
`S.S. Fu and N-F. Tzeng
`A Quasi-Barrier Technique to Improve Performance of an Irregular Application ............... .263
`H. V. Shah and J.A.B. Fortes
`
`255
`
`Session 8B: Networks
`Performance Analysis and Fault Tolerance of Randomized Routing on
`Clos Networks .............................................................................................................................
`M. Bhatia and A. Youssef
`Performing BMMC Permutations in Two Passes through the Expanded
`Delta Network and MasPar MP-2 .............................................................................................
`L.F. Wisniewski, T.H. Cormen, and T. Sundquist
`Macro-Star Networks: Efficient Low-Degree Alternatives to Star Graphs
`for Large-scale Parallel Architectures ......................................................................................
`C-H. Yeh and E. Varvarigos
`
`272
`
`282
`
`290
`
`300
`
`Session 9A: Performance Analysis
`Modeling and Identifying Bottlenecks in EOSDIS ...................................................................
`J. Demmel, M.Y. Ivory, and S.L. Smith
`Tools-Supported HPF and MPI Parallelization of the NAS Parallel Benchmarks ................ 309
`C. Cle'mencon, K.M. Decker, V.R. Deshpande, A. Endo,
`J. Fritscher, P.A.R. Lorenzo, N. Masuda, A. Muller,
`R. Ruhl, W. Sawyer, B.J.N. Wylie, and F. Zimmerman
`A Comparison of Workload Traces from Two Production Parallel Machines .........................
`K. Windisch, V. Lo, D. Feitelson, R. Moore, and B. Nitzberg
`Morphological Image Processing on Three Parallel Machines ................................................
`M.D. Theys, R.M. Born, M.D. Allemang, and H.J. Siege1
`
`319
`
`327
`
`Session 9B: Petaflops Computing / Point Design Studies
`MORPH: A System Architecture for Robust High Performance Using
`Customization (An NSF 100 TeraOps Point Design Study) ....................................................
`A.A. Chien and R.K. Cupta
`Architecture, Algorithms and Applications for Future Generation
`Supercomputers ..........................................................................................................................
`K Kumar, A. Sameh, A. Grama, and G. Karypis
`
`336
`
`346
`
`vii
`
`Intel Exhibit 1005 - 5
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`Hierarchical Processors-and-Memory Architecture for High Performance
`Computing ................................................................................................................................... 355
`Z.B. Miled, R. Eigenmann, J.A.B. Fortes, and I? Taylor
`A Low-Complexity Parallel System of Gracious, Scalable Performance.
`Case Study for Near PetaFLOPS Computing ...........................................................................
`S.G. Ziavras, H. Grebel, and A. Chronopoulos
`
`363
`
`’
`
`Author Index ............................................................................................................................
`
`371
`
`...
`Vlll
`
`Intel Exhibit 1005 - 6
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`MORPH: A System Architecture for Robust High Performance
`Using Customization
`(An NSF 100 TeraOps Point Design Study)
`Rajesh E(. Gupta
`Andrew A. Chien
`Department of Computer Science
`University of Illinois
`Urbana, Illinois 61801
`{ achien, rgupta} @cs.uiuc. edu
`
`Abstract
`Achieving 100 TeraOps performance within a ten-
`year horizon will require massively-parallel architec-
`tures that exploit both commodity software and hard-
`ware technology for cost efficiency. Increasing clock
`rates and system diameter in clock periods will make
`eflcient management of communication and coordina-
`tion increasingly critical. Configurable logic presents a
`unique opportunity to customize bindings, mechanisms,
`and policies which comprise the interaction of process-
`ing, memory, 1/0 and communication resources. This
`programming flexibility, or customizability, can pro-
`vide the k e y to achieving robust high performance.
`The MultiprocessOr with Reconfigurable Parallel
`Hardware (MORPH) uses reconfigurable logic blocks
`integrated with the system core to control policies, in-
`teractions, and interconnections. This integrated con-
`figurability can improve the performance of local mem-
`ory hierarchy, increase the efficiency of interproces-
`sor coordination, or better utilize the network bisec-
`tion of the machine. MORPH provides a framework
`for exploring such integrated application-specific cus-
`tomizability. Rather than complicate the situation,
`MORPH’S configurability supports component software
`and interoperabilty frameworks, allowing direct support
`for application-specified patterns, objects, and struc-
`tures. This paper reports the motivation and initial
`design of the MORPH system.
`1 Introduction
`Increasing reliance on computational techniques for
`scientific inquiry, complex systems design, faster than
`real life simulations, and higher fidelity human com-
`puter interaction continue to drive the need for ever
`higher performance computing systems. Despite rapid
`progress in basic device technology [l], and even in
`uniprocessor computing technology [2], these applica-
`
`tions demand systems scalable in every aspect: pro-
`cessing, memory, I/O, and particularly communica-
`tion. In addition to advances in raw processing power,
`we must also achieve dramatic improvements in system
`usability. Current day scalable systems are still quite
`difficult to program, and in many cases effectively pre-
`cluding use of the most sophisticated (and most effi-
`cient) algorithms. Even when successfully used, most
`systems exhibit substantial performance fragility due
`to rigid architectural choices that do not work well
`across different applications.
`Based on technology projections for the 2007 design
`window for the NSF point design studies, our analyses
`indicate that increases in communication cost relative
`to computation (gate speeds) make configurable logic
`practical in an ever broader range of the system. The
`benefit of configurable logic is that it can be used to
`customize the machine’s behavior to better match that
`required by the application - in essence a machine can
`be tuned for each application with little or no perfor-
`mance penalty for this generality. While a broad va-
`riety of such architectures are possible, MORPH is a
`design point which explores the potential of integrating
`configurability deep into the system core.
`Because technology trends continue to increase the
`importance of communication, the MORPH architec-
`ture focuses on exploiting configurability to manage
`locality, communication, and coordination. In particu-
`lar, the MORPH design study is exploring improved ef-
`ficiency and scalability by exploring novel mechanisms
`for binding and mechanisms which comprise the in-
`teraction of processing, memory, I/O, and communi-
`cation resources. Other innovations explore flexible
`hardware granularity (e.g. mechanisms and associa-
`tion with processors and memory) and memory sys-
`tem management (e.g. cache coherence, prefetching,
`
`1088-4955196 $5.00 0 1996 IEEE
`
`336
`
`Authorized licensed use limited to: UCLA Library. Downloaded on June 17,2020 at 17:44:03 UTC from IEEE Xplore. Restrictions apply.
`
`Intel Exhibit 1005 - 7
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`and other data management policies).
`Because a wealth of studies indicate that no fixed
`policies (or even wiring configurations) are optimal,
`MORPH seeks to exploit application structure and be-
`havior to adapt for efficient execution. Traditionally
`in high performance computing systems, this informa-
`tion has been provided by optimizing compilers (vec-
`torizing or parallelizing compilers). However, as the
`use of component software and interoperability frame-
`works proliferates, we expect such analysis to become
`increasingly difficult. Therefore, we are exploring a
`broad range of techniques to identify opportunities for
`customization based on aggressive compiler analysis,
`user annotations, profiling, and even on-the-fly moni-
`toring and adaptation.
`Because the field of configurable computing is still
`in its nascence, the range of possible architectures has
`only begun to be explored. The primary innovations of
`MORPH are to focus on integration of configurability
`into the system core, and exploitation of opportunities
`to optimize communication and coordination. In the
`remainder of this paper, we sketch the MORPH ar-
`chitecture. Since the project is in the early stages of
`design, many of the detailed design issues are intention-
`ally left open. Up to-date information on the project
`and results is available on the Web site: “http://www-
`csag. cs .uiuc .edu/proj ects/morph. html.”
`The remainder of paper is organized as follows.
`In the following Section 2 we present the technol-
`ogy trends and parameters underlying the design of
`MORPH. Section 3 presents the overall hardware or-
`ganization of MORPH, while the software architecture
`is presented in the following Section 4. Section 5 de-
`scribes our evaluation environment and examples of
`customizability used by MORPH machine to achieve
`high performance. We summarize design and open is-
`sues in Section 6.
`
`2 Key Technology Trends
`The MORPH design is targeted for construction in
`2007, based on the commodity technology that will
`be available. Because these technology extrapolations
`are critical underpinnings for the design, we discuss
`them in some detail. The base processor, memory
`and packaging technology form the fundamental global
`system constraints. Advances in programmable logic
`and computer-aided design will make per-application
`configuration not only feasible but desirable, and the
`widespread use of component software will make flexi-
`ble execution engines desirable for achieving high per-
`formance.
`
`Processor, Memory, and Packaging Technology
`The continued advance of Si technology will produce
`remarkable processors and memory chips, and areal
`bonding and multi-chip carriers will provide significant
`improvements in interchip wiring. However, commu-
`nication will remain critical in achieving high perfor-
`mance.
`Projections from Semiconductor Industry Associa-
`tion (SIA) [l] for a decade hence indicate advanced
`processes using 0.1 p feature size. However at the deep
`sub-micron level, feature size as measured by transistor
`channel length is increasingly irrelevant for velocity-
`saturated carrier transport [3]. Both logic density and
`speed are dominated by the interconnect density. Pitch
`for the finest interconnect is projected at 0.4-0.6 p.
`On logic devices, average interconnect lengths are any-
`where from 1 , 0 0 0 ~ to 10,000~ the pitch, that is, up
`to 6 mm of intra-chip interconnect delay. This limits
`the on-chip clock periods to m 1 nanosecond, implying
`that the system performance is going to be increas-
`ingly dominated by interconnect delay. With multiple
`issue and multiple processors on a chip, for example,
`four $-way super-scalar on-chip CPUs, processor chips
`are anticipated to achieve 32 billion instructions per
`second (BIPS) at 1 GHz clock rates. Memory integra-
`tion will continue its four-fold increase in density each
`3 years, achieving 16 Gbit DRAM and 4Gbit SRAM
`chips [4].
`At a voltage level of 1200-1500 mV, per chip power
`consumption is expected to be limited to below 200
`Watts. Total system power is expected to be 163 Kilo
`Watts for the MORPH 100 TeraOp configuration, in-
`cluding a 10% loss in power distribution and manage-
`ment. With packaging advances, using reduced on-chip
`solder bumps are expected to increase I/O’s by lox to
`approximately 5000, with 90% usable as signal pins.
`Assuming one word of communication every 100 oper-
`ations requires signaling rates of 30-40 GHz, even one
`word every 1,000 operations, 3-4 GHz, well in excess of
`the SIA’s 375 Mhz projected off-chip signaling rate [4],
`indicating communication is again a critical issue.
`
`Programmable Logic and Computer- Aided De-
`sign (CAD) Advances in programmable logic will
`make their use viable in a much broader range of sys-
`tem elements. In addition, the maturation of CAD
`technology will allow the flexible exploitation of config-
`urable resources to customize for individual programs.
`Programmable logic densities and speed are increas-
`ing in similar fashion to SRAM, approximately four
`times for every three years. The routability and gate
`utilization in reprogrammable devices will continue to
`increase at a rapid rate of 5-15% per year due to
`
`337
`
`Authorized licensed use limited to: UCLA Library. Downloaded on June 17,2020 at 17:44:03 UTC from IEEE Xplore. Restrictions apply.
`
`Intel Exhibit 1005 - 8
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`better algorithms, CAD tools, and additional rout-
`ing resources. State of the art programmable logic
`devices, implemented in a 0.35 ,U process technology,
`provide 100,000 gates and achieve propagation delays
`of less than 5 ns. As an alternative organization, cur-
`rent FPGA devices also offer up to 40 Kbits of mul-
`tifunctional memory in addition to over 60,000 us-
`able gates in implementation of specialized hardware
`functions such as arithmetic or DSP functions. Un-
`like standard SRAM parts, the embedded memory can
`be used as multiple-ported SRAM or FIFO providing
`much greater flexibility in system organization. Cur-
`rent efforts in FPGA design and architecture show
`significant improvements in the efficiency of the on-
`chip memory blocks. This trend in memory efficiency
`and utilization is expected to continue and close the
`gap between FPGA and SRAM densities and up to
`10 Mbits of multi-functional embedded memory would
`be available in addition to the logic blocks in single-
`chip reprogrammable devices. In the ten-year period,
`programmable logic integration levels are expected to
`reach 2 million gates.
`MORPH’S design exploits small blocks of repro-
`grammable interconnect logic to achieve application
`customizability rather than application-specific func-
`tional units or compute elements. Perhaps the most
`compelling reason for the incorporation of (small) re-
`programmable logic blocks even in custom processors
`has to do with the comparatively low and decreasing
`delay penalty for reprogrammable logic blocks com-
`pared to medium and long interconnects delays. As
`transistor switching delays scale down to 10 ps range
`(factor of 1/10 from current delays), the interconnect
`delays scale down much more slowly. This is because
`the local interconnect length scales down with pitch
`(1/3), while the global interconnect length actually in-
`creases with increasing die size [5]. As a result, the
`marginal costs of adding switching logic in the inter-
`connect decreases tremendously, and may even become
`necessary to provide signal “repeaters” for moderate
`length interconnects. We are already beginning to
`see the prevalent use of SRAM-based reprogrammable
`logic (FPGA and CPLD) parts being used in final
`product designs. These trends in technology transition
`to field reprogrammable parts is expected to continue
`with increasing time-to-market pressures.
`Coupled with the advances in technology, the ad-
`vances in CAD tools and algorithms are beginning to
`have an impact on how designs are done today. With
`the emphasis on system models in hardware descrip-
`tion languages (HDLs) such as Verilog and VHDL, the
`process of hardware design is increasingly a language-
`level activity, supported by compilation and synthesis
`
`tools [6]. These tools are beginning to support a variety
`of design constraints, on performance, size, power, and
`even the pin-outs. Locking 1/0 maps to ensure that
`physical design remains unchanged while logical con-
`nections are modified based on applications will soon
`be a common feature to allow programmable logic to be
`embedded in key modules of a system and provide on-
`line programmability to change hardware functional-
`ity. Tools for distributed hardware control synthesis to
`allow dynamic binding of hardware resources [7], and
`synthesis of protocols to low latency hardware [9, lo,$]
`have been successfully demonstrated. With these CAD
`and synthesis capabilities, embedded programmable
`logic can be inserted into the key parts of systems,
`and used to alter behavior dramatically with modest
`performance overhead.
`The architectural implications of these technology
`trends are enormous: the underlying assumption that
`reprogrammability is expensive is already beginning to
`be challenged. As gate switching continues to scale
`well below 100 ps range (today), local decision making
`would cost significantly less than the cost of sending
`information. Such changes will pave the way for a new
`class of system architectures that exploit flexibility to
`deliver robust, high performance to applications.
`
`Component Software The exploding complexity of
`application software is driving increased used of com-
`ponent software (libraries, toolkits, shared abstrac-
`tions) and interoperability frameworks (to glue com-
`ponents together). These application structures lever-
`age programmer effort, but also imply that programs
`will increasingly be written in terms of user abstrac-
`tions (e.g. objects) rather than machine abstractions
`(registers, cache lines, memory locations).
`Over the past ten years, there has been an accel-
`erating trend towards component software, enabling
`the construction of larger, more complex applications.
`For example, graphical user interface programmers are
`leveraged by large graphical user interface libraries
`such as X Windows or MS Windows ’95. Like-
`wise, scientific programmers are increasingly leveraged
`by libraries such as LAPACK [ll, 121, or increas-
`ingly domain or application specific libraries such as
`A++/P++, AMR++, POOMA, LPARX++/KeLP
`[13, 14, 151 whose millions of lines of code can not
`only dramatically increase programmer productivity,
`but in some cases, the sophistication of the libraries
`can actually deliver higher levels of performance. This
`phenomenon is occurring in many domains, program-
`ming languages, and system contexts because compo-
`nent software allows application programmers to focus
`on the critical new problems, not solving or reimple-
`
`338
`
`Authorized licensed use limited to: UCLA Library. Downloaded on June 17,2020 at 17:44:03 UTC from IEEE Xplore. Restrictions apply.
`
`Intel Exhibit 1005 - 9
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`menting solutions to the same old ones. Experts expect
`this trend to continue [16], with useful libraries prolif-
`erating and increasing in complexity; thereby support-
`ing increased application complexity and programmer
`productivity.
`Beyond library-based sharing in a single program,
`the use of coordination-interoperability frameworks is
`having a major impact on application structure and the
`ease of building large complex applications. Interoper-
`ability frameworks such as OLE, CORBA, SOM, etc.
`[17, 181 define standard interfaces for modules, mak-
`ing it possible to build modular software and compose
`it with little knowledge of the internal software struc-
`ture. Thus, large complex programs will be composed
`of heterogeneous applications including computation,
`visualization, persistent storage, and even on-line in-
`teraction. In future, we expect widespread used of co-
`ordination/interoperability frameworks to build com-
`plex applications from variegated individual programs.
`This means that efforts and tools for optimizing ap-
`plications that must optimize individual programs, co-
`ordination frameworks, and entire applications are all
`essential.
`The increasing complexity in software demands a
`hardware structure which can be easily exploited.
`However, this need for easily accessible performance
`comes at a time when hardware technology trends
`place an increasing importance on issues such as 10-
`cality, partitioning, and mapping of computations -
`managing communication. Our solution is a machine
`architecture which leverages a small amount of pro-
`grammable logic in several key places to implement a
`flexible hardware composition structure. This flexibil-
`ity allows the machine to be configured as convenient or
`to optimize machine communication, supporting cus-
`tomizability by software and high performance.
`3 System Architecture & Organization
`The basic architecture of MORPH reflects the ob-
`servations that in the 2007 technology window, (1)
`communication is already critical and getting increas-
`ingly so [19], and (2) flexible interconnects can be
`used to replace static wires at competitive performance
`[20, 21, 22, 231. The key elements of the MORPH
`architecture include processing elements and memory
`elements embedded in a scalable interconnect. The
`scalable interconnect flexibly connects all parts of the
`system with fast packet routing, efficiently exploiting
`the wiring resources provided by the system packaging
`[21, 22, 20, 231. The hardware structure allows adap-
`tation of data transport, coordination, association (for
`granularity), and efficient computation. As an example
`of its flexibility, MORPH could be used to implement
`
`A
`
`High BW
`System Level
`Interconnect
`(optical or other)
`
`I Pmesasing
`
`Element
`
`I ,,/
`
`Scalable Interconnect
`(modules to subsystems)
`
`Figure 1: A Flexible 100 TeraOp Architecture
`
`either a cache-coherent machine, a non-cache coherent
`machine, or even clusters of cache coherent machines
`connected by put/get or message passing. Varying
`the mix of processing and memory elements supports
`a wide range of machine configurations and balances.
`Examples of other possible changes include changes in
`cache block size, branch predictors, or prefetch policies.
`Our proposed configuration uses 32 BIPS proces-
`sors, and 8192 processing nodes. The physical memory
`configuration is driven primarily by cost and packag-
`ing (power budgeting) factors. A cost balanced sys-
`tem (based on today’s processor to memory price ra-
`tios and SIA predictions) would be approximately two
`memory chips for every processor, or

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket