`
`IEEE Computer Society
`Workshop on VLSI 2000
`
`
`
`System Design for a System-on-Chip Era
`
`Intel Exhibit 1004 - 1
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`LOOOVOUO'IhOONA
`
`Intel Exhibit 1004 - 2
`
`Intel Exhibit 1004 - 2
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`Proceedings
`
`IEEE Computer Society
`Workshop on VLSI 2000
`
`
`
`System Design for a System-on-Chip Era
`
`27-28 April 2000
`
`Orlando, Florida
`
`Edited by
`
`
`
`Asim Smailagic, Robert Brodersen, and Hugo De Man
`
`Sponsored by the
`IEEE Computer Society
`Technical Committee on VLSI
`
`IEEE�
`COMPUTER
`SOCIETY
`+.
`
`Los Alamitos, California
`Washington • Brussels • Tokyo
`
`Intel Exhibit 1004 - 3
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`Copyright © 2000 by The Institute
`
`
`All rights reserved
`
`of Electrical and Electronics Engineers, Inc.
`
`Copyright and Reprint Permissions:
`Abstracting is permitted with credit to the source. Libraries
`
`
`
`
`
`may photocopy beyond the limits of US copyright law, for private usc of patrons, those articles in
`
`this volume that carry a code at the bottom of the first page, provided that the per-copy fee
`
`
`indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive,
`Danvers, MA 01923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights
`
`
`
`Manager, IEEE Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331.
`
`The papers in this book comprise the proceedings of the meeting mentioned on the cover and title
`
`
`
`page. They reflect the authors' opinions and, in the interests of timely dissemination, are
`
`published as presented and without change. Their inclusion in this publication does not
`
`
`
`
`
`
`necessarily constitute endorsement by the editors, the IEEE Computer Society, or the Institute of
`
`
`
`Electrical and Electronics Engineers, Inc.
`
`IEEE Computer Society Order Number PR00534
`ISBN 0-7695-0534-1
`ISBN 0-7695-0536-8(rnicrofiche)
`Library of Congress Number 99-069215
`
`
`
`Additional copies may be ordered from:
`
`IEEE Computer Society
`IEEE Service Center
`IEEE Computer Society
`445 Hoes Lane
`Customer Service Center
`
`AsialPacific Office
`P.O. Box 1331
`10662 Los Vaqueros Circle
`Watanabe Bldg .. 1-4-2
`
`Piscataway, NJ 08855-1331
`P.O. Box 3014
`Minami-Aoyama
`Tel: + 1-732-981-0060
`
`Los Alamitos, CA 90720-1314
`
`Minato-ku. Tokyo 107-0062
`Tel: + 1-714-821-8380
`Fax: + 1-732-981-9667
`JAPAN
`Fax: + 1-714-821-4641
`Tel: + 81-3-3408-3118
`iittp:i/sbqJ.iece.org/storcl
`Fax: + 81-3-3408-3553
`E-mail: es.books@computer.org
`
`l' u�tOincf·-�crvic(: (��, ieee .org
`tokyo.ofc@computer.org
`
`
`
`Editorial production by Anne Rawlinson
`
`
`
`
`
`Productions Cover art production by Joe Daigle/Studio
`
`
`
`Printed in the United States of America by The Printing House
`
`IEEE�
`COMPUTER
`SOCIETY
`•
`
`Intel Exhibit 1004 - 4
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`Table of Contents
`
`Message from the General Chairs ... . . . . . . . . . . . . . . . . .
`
`.....................
`
`. . . . . . . . . . . . . .
`
`.........
`
`..... ix
`
`
`
`Message from the Technical Program Chairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`.. . . . . xi
`
`Workshop Committees
`
`......
`
`.................
`
`......................
`
`. . . . . . .
`
`.................................
`
`Steering Committee ........
`
`. . . . . . . . . . . .
`
`....................
`
`. . . . . . . . . . . .
`
`. . . . . . . . . . .
`
`.......
`
`.....................
`
`xiii
`
`xv
`
`System Level Design Methods and Examples I
`
`Mead-Conway VLSI Design Approach and System Design
`Challenges Ahead . ..... .. . . . . ...... .. . . . . . . . . . .......... ........... ........... . .....
`Lynn Conway
`
`. . . . . . ......
`
`Alternative Architectures for Video Signal Processing
`
`
`
`W. Wolf
`
`...........................
`
`.......... ..... .. . . . . . ...... 3
`
`.........
`
`................. 5
`
`PicoRadio: Ad-hoc Wireless Networking of Ubiquitous Low-energy
`
`
`
`
`Sensor/Monitor Nodes ...................................
`J. Rabaey, J. Ammer, J. L. da Silva Jr., and D. Patel
`
`.............
`
`. . . . . ...........
`
`. . ........
`
`...........................
`
`9
`
`System Level Design Methods and Examples II
`
`
`
`A System-level Approach to Power/Performance Optimization in Wearable
`.......... 15
`
`
`
`Computers
`
`
`......................
`
`.............................
`
`. . ....................
`
`................
`
`. . . . ..............
`
`A Smailagic, D. Reilly, and D. P. Siewiorek
`
`Emerging Trends in VLSI Test and Diagnosis
`Y. Zorian
`
`........ .
`
`........
`
`. . . . ....
`
`. . ................
`
`....................
`
`21
`
`Multilanguage Design of a Robot Arm Controller: Case Study . . .......................
`
`
`
`G. Nicolescu, P. Coste, F. Hessel, P. LeMarrec, and AA Jerraya
`
`.............. 29
`
`Low Power Design
`
`Instruction Scheduling Based on Energy and Performance Constraints
`
`A Parikh, M. Kandemir, N. Vijaykrishnan,
`and M.J. Irwin
`
`........................
`
`37
`
`Networks . . .............. .43
`
`Dynamic Voltage Scaling Techniques for Distributed Microsensor
`R. Min, T. Furrer, and A Chandrakasan
`
`v
`
`Intel Exhibit 1004 - 5
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`Reducing the Power Consumption in FPGAs with Keeping a High Performance
`Level .. . .. . ... . . . . . . . .. . . . . . . . . . . . . . . . .. . . . . . . . . ....
`. . 47
`
`A. D. Garcia G, W. P. Burleson, and J. L. Danger
`
`. . . . . . . . . . . . . . ... . .. . . . . . . . . . . . . . . . . . . . . . . . .
`
`. . . . . . . . ..... ...................
`
`Multiple Access Caches: Energy Implications ............................... ................................. 53
`
`
`
`
`
`
`M Kandemir and MJ. Irwin
`H.S. Kim, N. Vijaykrishnan,
`
`System Level Design Examples
`
`Improved Synchronization Methodologies for High Performance Digital
`
`........ ......... .......... 6·1
`Systems . ....... ......... .. .. . . .... ............. .. . .... ................ .......
`. . . . . ....... ...........
`S. Welch and K. Kornegay
`
`for 2D-Mesh Video Object Motion Tracking .................... 67
`Low Power VLSI Architecture
`W. Badawy and M. Bayoumi
`
`Timing Issues in System Design
`
`
`
`
`
`Architectural Adaptation in AMRM Machines ... .............................................................. 75
`
`
`
`
`
`
`
`R. Gupta
`
`An Empirical and Analytical Comparison of Delay Elements and a New
`
`
`Delay Element Design . . .. . . . . . . . . . . . . . . . . . . . . . . . .... .. . ... . . ....
`
`
`N. Mahapatra, S. Garimella and A. Tareen
`
`. . . . . . . ... . . . . . . . . . . .. . .....
`
`. . . . . . ....
`
`. . . ... . . . .. .. . . . . . 81
`
`Software System DeSign and Design Environment
`
`Encapsulation and Symbolic Execution ............................ ...................................... . ...... 89
`
`Specification and Validation of I nformation Processing Systems by Process
`
`
`
`
`
`
`W. Bof3ung, T. Geyer, S. A. Huss, and L. Wehmeyer
`
`Interaction in Language Based System Level Design using an Advanced
`
`
`
`
`
`Compiler Generator Environment ................. . .............................. ............................ ...... 97
`
`
`I. Poulakis, G. Economakos, and P. Tsanakas
`
`
`
`
`
`On Design-for-reusability in Hardware Description Languages ..................... ............... 103
`
`
`
`J. M. Chang and S. K. Agun
`
`Analysis and SyntheSis of Asynchronous Circuits
`
`Fine-grain Pipelined Asynchronous Adders for High-speed DSP Applications ... .. ........ 111
`
`
`
`
`
`M. Singh and S. M. Nowick
`
`
`
`A Low-latency FIFO for Mixed-clock Systems .............................................................. 119
`
`
`
`
`
`
`T. Chelcea and S. M Nowick
`
`
`
`Advances in Multiplier Design
`
`Reconfigurable Low Energy Multiplier for Multimedia System Design .......................... 129
`
`
`
`
`S. Kim and M. C. Papaefthymiou
`
`An Algorithmic Approach to Building Datapath Multipliers Using (3,2) Counters .......... 135
`
`
`
`
`
`H. A/-Twaijry and M Aloqeely
`
`vi
`
`Intel Exhibit 1004 - 6
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`Issues in System Design
`
`RT-Ievellnterconnect Optimization in DSM Regime ....
`
`
`S. Katkoori and S. Alupoaei
`
`. . . ....................
`
`....
`
`. . ................
`
`. . . 143
`
`Validation of Complex Designs through Hardware Prototyping
`
`G. Stoler
`
`. .. . . . . .................
`
`............ 149
`
`On-line Error Detection in Multiplexor Based FPGAs ....................
`
`
`A. Jaekel
`
`. . . .......
`
`. . . . . . . .
`
`............. 1 55
`
`Author Index . . . . . . . .
`
`. . . . . . . .
`
`. . . . . .
`
`. .. . . . . . . . . . . . . . .
`
`. .. . . . . . . . . ............
`
`. . . . . .
`
`...........
`
`... . . . . . . . . . . . . . . . . . .
`
`161
`
`vii
`
`Intel Exhibit 1004 - 7
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`Architectural Adaptation in AMRM Machines
`
`Rajesh Gupta
`
`Information and Computer Science
`University of California, Irvine
`Irvine, CA 92697
`rgupta@ics. uci. edu
`
`of places where architectural adaptivity can be used,
`for instance, in tailoring the interaction of processing
`with 1/0, customization of CPU elements (e.g.,
`splittable ALU resources) etc.
`
`In view of the microelectronic technology trends
`that
`emphasize
`increasing
`importance
`of
`communication at all levels, from network interfaces
`to on-chip interconnection fabrics, communication
`represents the focus of our studies in architectural
`adaptation. In this context, memory system latency
`and bandwidth issues are key determining factors 'in
`performance of high performance machines because
`these can provide a constant multiplier on
`the
`achievable system performance [ 1 ]. Further, this
`multiplier decreases as the memory latency fails to
`improve as fast as processor clock speeds.
`
`Consider a hypothetical machine with processing
`elements running at 2 GHz with eight-way super(cid:173)
`scalar pipelines. Assuming a typical I microsecond
`round-trip latency for a cache miss, this corresponds
`to about 16K instructions, with an average 30% or
`4800 instructions being load/store. For a single(cid:173)
`thread execution a miss rates as low as 0.02%
`reduces computing efficiency by as much as 50%.
`This points to a need for very low miss rates to
`ensure that high-throughput CPUs can be kept busy.
`A similar analysis of the bisection bandwidth
`concludes that active bandwidth management
`is
`required to reduce the need for communication and to
`increase
`the number of operations before a
`communication is necessary.
`
`2. The AMRM Project
`
`Reconfiguration
`Adaptive Memory
`The
`Management, or
`the AMRM, project at
`the ·
`University of California, Irvine aims to find ways to
`improve the memory system performance of a
`computing system. The basic system architecture
`reflects the view that communication is already
`critical and getting increasingly so [3], and flexible
`
`Abstract
`
`Application adaptive architectures use archi(cid:173)
`tectural mechanisms and policies to achieve system
`level performance goals. The AMRM project at UC
`Irvine foc~es on adaptation of the memory hierarchy
`and its role in latency and bandwidth management.
`This paper describes the architectural principles and
`first implementation of the A MRM machine proof-of(cid:173)
`concept prototype.
`
`1. Introduction
`
`Modem computer system architectures represent
`design tradeoffs and optimizations involving a large
`number of variables in a very large design space.
`Even when successfully
`implemented
`for high
`performance, which is benchmarked against a set of
`representative
`applications,
`the
`performance
`optimization is only in an average sense. Indeed, the
`performance variation across applications and against
`changing data set even in a given application can
`easily be by an order of magnitude [1]. In other
`words, delivered performance can be less than one
`tenth of the system performance that the underlying
`hardware is capable of.
`
`A primary reason of this fragility in performance
`that
`rigid architectural choices
`related
`to
`is
`organization of major system blocks (CPU, cache,
`memory, IO) do not work well across different
`applications.
`
`Architectural Adaptivity provides an attractive
`means
`to
`ensure
`robust
`high
`performance.
`Architectural adaptation refers to the capability of a
`machine
`to
`support multiple
`architectural
`mechanisms and policies that can be tailored to
`application and/or data needs [2]. There are a number
`
`0-7695-0534-1/00 $ I 0.00 © 2000 IEEE
`
`75
`
`Intel Exhibit 1004 - 8
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`interconnects can be used to replace static wires at
`competitive performance in interconnect dominated
`microelectronic technologies [4,5,6]. The AMRM
`machine uses reconfigurable logic blocks integrated
`with the system core to control policies, interactions,
`and interconnections of memory to processing [7].
`The basic machine architecture supports application(cid:173)
`specific cache organization and policies, hardware(cid:173)
`assisted blocking, prefetching and dynamic cache
`structures (such as stream, victim caches, stride
`prediction and miss history buffers) that optimize the
`movement and placement of application data through
`the memory hierarchy. Depending upon the hardware
`technology used and the support available from the
`runtime environment this adaptation can be done
`statically or at run-time. In the following section we
`describe
`a
`specific mechanism
`for
`latency
`management that is shown to provide significant
`performance boost for the class of applications
`characterized by frequent accesses to linked data
`structures scattered in the physical memory. This
`includes algorithms that operate on sparse matrices
`and linked trees.
`
`2.1 Adaptation for Latency Management
`
`Latency management refers to techniques for
`hiding long latencies of memory accesses by useful
`computation. Most common technique for latency
`hiding
`is by prefetching of data to
`the CPU.
`Prefetching is combined with smaller and faster
`(cache) memory elements that attempt to prefetch
`application context(s)
`rather
`than
`single data
`elements. The pointer-based accesses to data items in
`memory hierarchies
`typically yield poor results
`because the indirection introduces main memory and
`memory hierarchy
`latencies
`into
`the
`innermost
`computational loop. Techniques such as software
`prefetching (loop unrolling and hoisting of loads) do
`not adequately solve the problem, as prefetching at
`the processor leaves multi-level memory hierarchy
`in
`the critical path.
`Purely hardware
`latency
`prefetching [8] is also often ineffective because the ·
`address references generated by an application may
`contain no particular address structure. We use an
`application-specific prefetching scheme that resides
`in dedicated hardware at arbitrary levels of the
`memory hierarchy, in all of them, or to bypass them
`completely. This hardware performs application(cid:173)
`specific prefetching, based on the address ranges of
`data structures used. When there is a reference to an
`address inside in this range, the prefetch hardware
`will prefetch the "next" element pointed to by the
`current element. The pointer field for the next
`element can be changed at runtime.
`
`76
`
`This prefetch hardware is combined, for some
`applications, with address translation and compaction
`hardware in the memory controller that works well
`with data structures that do not quite fit into a single
`is done
`cache
`line. The address
`translation
`transparently from the application using hardware
`in
`the cache controller and a
`assist
`translate
`corresponding hardware assist gather in the memory
`controller. Simulation results using this prefetch
`hardware show a IOX reduction in read miss rates
`and l OOX reduction in data volume reduction for
`sparse matrix multiply operations [9].
`
`3. The AMRM System Prototype
`
`While AMRM simulation results continue to
`provide valuable
`insights
`into
`the
`space of
`architectural mechanisms and
`their effectiveness
`[7][9][10], a system implementation is needed to
`bring together different parts of the AMRM project
`(including compiler and runtime system algorithms to
`support adaptivity). The AMRM system prototype is
`divided into two phases. First phase consists of
`implementation of a board-level prototype; followed
`by a second phase single-chip implementation of the
`cache memory system. At the time of this writing, the
`first phase of the project prototype implementation
`has recently completed. The rest of this paper
`describes the system design and implementation of
`the Phase I prototype and its relationship to the
`ongoing second phase ASIC prototype.
`
`The AMRM phase I prototype board is designed
`to serve two purposes. It can simulate a range of
`memory hierarchies for applications running on a
`host processor. The board supports configurability of
`the cache memory via an on-board FPGA-based
`memory controller. The board is also designed to be
`used, in future, as a complete system platform, with
`on-board memory serving as main memory, via a
`. mezzanine card containing the Phase II AMRM
`ASIC implementation.
`
`Whereas the goal of the AMRM prototype is to
`build an adaptive cache memory system, in general,
`the memory hierarchy performance cannot be de(cid:173)
`coupled
`from
`the
`processor
`instruction
`set
`architecture,
`implementation,
`and
`compiler
`implementation.
`(This is particularly true of the
`CPU-L l path that is often pipelined using non(cid:173)
`blocking caches.) It would, therefore, be desirable to
`evaluate any proposed changes to
`the memory
`hierarchy for several processor architectures rather
`than being tied to one specific processor type and its
`
`Intel Exhibit 1004 - 9
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`software. In order to be able to evaluate the effect
`with different processor architectures as well as to
`circumvent the implementation difficulties, our Phase
`I implementation attaches an additional memory
`hierarchy to a system through a standard peripheral
`bus. Thus, the board provides a PCI interface that
`allows a host processor to use the board as a part of
`its memory hierarchy. Applications· running on the
`host processor are instrumented automatically using
`the AMRM compiler to use the memory on the
`AMRM board. Thus direct program execution can
`proceed on the host processor while
`the extra
`memory hierarchy is being exercised.
`
`3.1 AMRM System Goals
`
`One goal of the AMRM prototype system is that
`it be adaptable to many different memory hierarchy
`architectures. Another goal of the AMRM system is
`that it be useful for running real time program
`execution or even memory simulations. The latter is
`accomplished by making
`the AMRM memory
`available to the user and converting user program to
`access
`this memory "directly". The former
`is
`accomplished through the use of a sequence of
`address/command
`type
`requests "run"
`through
`various memory system configurations. The AMRM
`system is to be fast enough to support extensive
`execution or simulation.
`
`A CPU interfaces to the reconfigurable AMRM
`memory system through the PCI bus.
`AMRM
`accepts CPU PCI requests for memory operations,
`issues them to the attached memory system, and
`sends back the data for memory read operations as
`well as memory access time information.
`
`3.2 AMRM prototype architecture
`
`Figure I shows the main components of the
`AMRM prototype board. It consists of a general 3-
`level memory hierarchy plus support for the AMRM
`ASIC chip implementing architectural assists with in
`the CPU-LI datapath. The host interface is managed
`by a Motorola PLX 9080 processor. The FPGAs on
`the board contain controllers for the SRAM, DRAM
`and LI cache. A I MB SRAM is used for tag and
`data store for the LI cache. A total of 512 MB of
`DRAM is provided to implement part of the cache
`hierarchy (and also to serve as main memory by
`reloading the memory controller into the FPGA.)
`
`The board implementation necessarily hard-wires
`certain parameters of the memory hierarchy. This
`includes the board's clock. In order to perform
`
`77
`
`detailed and accurate simulation of diverse memory
`hierarchy configurations at any clock speed, a
`hardware "virtual" clock has been implemented as
`part of the performance monitoring hardware.
`Performance monitoring hardware primarily includes
`various event counters, which are memory-mapped
`and readable from the host processor. The "virtual
`clock" emulates a target system's clock: the clock
`rate is determined by the target system's memory
`hierarchy design and technology parameters. For
`example, the delay for an LI cache hit, miss fetch etc
`in terms of virtual clocks can be configured by the
`host to emulate a given target cache design. Thus, the
`use of virtual clock allows us
`to simplify the
`hardware implementation. For instance, the tag and
`data stores of LI cache can be a single RAM while
`the timing may reflect a design with two separate
`RAM's.
`
`3.3 Command Interface to the AMRM
`Board
`
`The memory hierarchy on the AMRM board can
`be used by an application running on the host
`processor by writing commands to specific addresses
`in the PCI address space. Each command consists of
`a set of four words that specify the operation (e.g.,
`memory read/write, register read/write), the address
`of the location to access and data in case of a write.
`For read commands, a read response is generated and
`data is written into the host's memory.
`
`For debugging purposes and to enable the cache
`to be flushed by the host, there are commands to
`access memory banks directly, i.e., without going
`through the caches. Commands are also available to
`read/write the status, configuration registers and
`performance counters.
`
`reads a
`The onboard command processors
`command and launches its execution in the AMRM
`board. Data is read from the cache and sent back to
`the processor if it hits in the cache. It takes m Virtual
`Clock cycles. Otherwise it is requested from the next
`level in the hierarchy. Writes take n virtual clocks.
`Upon load command completion the data can be
`written into the system memory for access by the host
`processor. Both parameters n and m can be
`programmed under compiler control.
`
`3.4 Virtual Clock System
`
`The virtual clock system consists of a master
`clock counter, Vtime, the virtual clock signal, Velk,
`
`Intel Exhibit 1004 - 10
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`Ready inputs, and associate virtual clock generation
`logic. The Ready input from each major memory
`hierarchy module specifies
`that
`this unit has
`completed the current virtual clock cycle activities. A
`new Velk edge/period is generated when all Ready
`inputs reach 1. The Vtime can be read out by the
`host processor to determine the current virtual time.
`It can also be automatically supplied to the CPU via
`host memory (as opposed to the AMRM board
`memory).
`
`Each major unit in the memory hierarchy is
`designed to generate Ready and wait for the Velk,
`when appropriate. In most cases the designs actually
`use AutoReady counters local to each unit which can
`be loaded with a programmable number of cycles. An
`AutoReady counter generates Ready using the Velk
`while its output is non-zero. An idle unit not
`processing any requests also outputs a Ready signal
`every Velk. For instance, Consider the AMRM Read
`to
`the on-board memory
`command addressed
`hierarchy. It includes a delay (~T) from the previous
`memory access. This delay is used to advance the
`virtual clock forward before starting the new access.
`is accomplished by
`loading
`it
`into
`the
`This
`AutoReady counter.
`This allows other memory
`hierarchy activity to proceed in parallel with CPU
`computation. For instance, a prefetch unit may be
`accessing memory during the ~T-cycle delay.
`
`3.5 AMRM chip functionality
`
`The AMRM board provides for incorporation of
`
`32bit. 33MHz PCI Bus
`
`PCI Interface
`
`32bit Local Bus
`
`FPGAO
`
`Configurarion
`Vinual Clock
`
`Figure I: AMRM Phase I Prototype Board
`
`FPGA2
`
`78
`
`an AMRM chip that uses an ASIC implementation of
`the AMRM cache assist mechanisms. It is positioned
`between L l cache and the rest of the system and can
`be accessed in parallel with the L2 cache. It can thus
`accept and supply data coming from or going to the
`L l cache. For instance, it may contain a write buffer
`or a prefetch unit to access L2. It also has access to
`the memory interface and thus can, for instance,
`prefetch from memory. The AMRM ASIC design is
`currently
`in progress. This chip will
`include a
`processor core with adaptive memory hierarchy.
`When plugged into the AMRM board, the ASIC will
`use onboard DRAM as main memory by simply
`reconfiguring the memory controller in FPGA l .
`
`4. Summary
`
`Traditional computer system architectures are
`designed for best machine performance averaged
`across applications. Due to the static nature of these
`architectures, such machines are limited in exploiting
`application characteristics unless these are common
`for a large number of applications. Unfortunately, a
`number of studies have shown that no single machine
`organization fits all applications, therefore, often the
`delivered performance is only a small fraction of the
`peak machine performance. Therefore, we believe
`that
`there
`are
`significant
`opportunities
`for
`application-specific architectural adaptation. The
`focus of the AMRM project is on architectural
`adaptations that close the gap between processor and
`memory speed by
`intelligent placement of data
`through the memory hierarchy. Our current work has
`demonstrated performance gains due to adaptive
`cache organizations and cache prefetch assists. In
`future, we envision adaptive machines that provide a
`menu of application-specific assists
`that alter
`architectural mechanisms and policies in view of the
`application
`characteristics.
`The
`application
`developer, with the help of compilation tools, selects
`appropriate hardware assists
`to customize
`the
`machine to match application needs without having
`to rewrite the application.
`
`S. Acknowledgement
`
`The AMRM project was conceived through the
`joint effort of Andrew Chien, Alex Nicolau and Alex
`Veidenbaum. The team includes Prashant Arora,
`Chun Chang, Dan Nicolaescu, Rajesh Satapathy,
`Weiyu Tang, Xiao Mei Ji. The project is sponsored
`by DARPA under contract DABT63-98-C-0045.
`
`Intel Exhibit 1004 - 11
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`
`
`6. References
`
`[I] P. M. Kogge, "Summary of the architecture group
`findings," In PetaF/ops Architecture Workshop
`(PAWS), April 1996.
`
`[2] A. A. Chien, R. K. Gupta, "MORPH: A System
`Architecture for Robust High Performance Using
`Customization," In Proceedings of the Sixth
`Symposium on the Frontiers of Massively Parallel
`Computation (Frontiers '96), October 1996.
`
`[3] W. Wulf, S. McKee, "Hitting the memory wall:
`Implication of the obvious," In Computer
`Architecture News, 1995.
`
`[4] C. L. Seitz, "Let's route packets instead of wires,"
`In Proceedings of the (!h MIT Conference on
`Advanced Research in VLSI, W. J. Dally, editor, MIT
`Press, pp. 133-37, 1996.
`
`[5] W. J. Dally, P. Song, "Design ofa self-timed
`VLSI multicomputer communication controller," In
`Proceedings of ICCD, 1987.
`
`[6] R. K. Gupta, "Analysis of Technology Trends: A
`Case of Architectural Adaptation in Custom
`Datapaths," In Proceedings of SPIE session on
`Configurable Computing, Boston, November 1998.
`
`[7] AMRM Website: www.ics.uci.edu/-amrm.
`
`[8] D. F. Zucker, M.J. Flynn, R. B. Lee, "A
`comparison of hardware prefetching techniques for
`multimedia benchmarks," Technical report CSL-TR-
`95-683, Computer Systems Laboratory, Stanford
`University, Stanford University, CA 94305,
`December 1995.
`
`[9] X. Zhang, et al, "Architectural adaptation for
`memory latency and bandwidth optimization," In
`Proceedings of /CCD, October, 1997.
`
`[10] A. Veidenbaum, et al, "Adapting Cache Line
`Size to Application Behavior," In Proceedings of
`International Conference on Supercomputing (JCS),
`June 1999.
`
`79
`
`Intel Exhibit 1004 - 12
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`