throbber
Proceedings
`
`IEEE Computer Society
`Workshop on VLSI 2000
`
`
`
`System Design for a System-on-Chip Era
`
`Intel Exhibit 1004 - 1
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`LOOOVOUO'IhOONA
`
`Intel Exhibit 1004 - 2
`
`Intel Exhibit 1004 - 2
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`Proceedings
`
`IEEE Computer Society
`Workshop on VLSI 2000
`
`
`
`System Design for a System-on-Chip Era
`
`27-28 April 2000
`
`Orlando, Florida
`
`Edited by
`
`
`
`Asim Smailagic, Robert Brodersen, and Hugo De Man
`
`Sponsored by the
`IEEE Computer Society
`Technical Committee on VLSI
`
`IEEE�
`COMPUTER
`SOCIETY
`+.
`
`Los Alamitos, California
`Washington • Brussels • Tokyo
`
`Intel Exhibit 1004 - 3
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`Copyright © 2000 by The Institute
`
`
`All rights reserved
`
`of Electrical and Electronics Engineers, Inc.
`
`Copyright and Reprint Permissions:
`Abstracting is permitted with credit to the source. Libraries
`
`
`
`
`
`may photocopy beyond the limits of US copyright law, for private usc of patrons, those articles in
`
`this volume that carry a code at the bottom of the first page, provided that the per-copy fee
`
`
`indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive,
`Danvers, MA 01923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights
`
`
`
`Manager, IEEE Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331.
`
`The papers in this book comprise the proceedings of the meeting mentioned on the cover and title
`
`
`
`page. They reflect the authors' opinions and, in the interests of timely dissemination, are
`
`published as presented and without change. Their inclusion in this publication does not
`
`
`
`
`
`
`necessarily constitute endorsement by the editors, the IEEE Computer Society, or the Institute of
`
`
`
`Electrical and Electronics Engineers, Inc.
`
`IEEE Computer Society Order Number PR00534
`ISBN 0-7695-0534-1
`ISBN 0-7695-0536-8(rnicrofiche)
`Library of Congress Number 99-069215
`
`
`
`Additional copies may be ordered from:
`
`IEEE Computer Society
`IEEE Service Center
`IEEE Computer Society
`445 Hoes Lane
`Customer Service Center
`
`AsialPacific Office
`P.O. Box 1331
`10662 Los Vaqueros Circle
`Watanabe Bldg .. 1-4-2
`
`Piscataway, NJ 08855-1331
`P.O. Box 3014
`Minami-Aoyama
`Tel: + 1-732-981-0060
`
`Los Alamitos, CA 90720-1314
`
`Minato-ku. Tokyo 107-0062
`Tel: + 1-714-821-8380
`Fax: + 1-732-981-9667
`JAPAN
`Fax: + 1-714-821-4641
`Tel: + 81-3-3408-3118
`iittp:i/sbqJ.iece.org/storcl
`Fax: + 81-3-3408-3553
`E-mail: es.books@computer.org
`
`l' u�tOincf·-�crvic(: (��, ieee .org
`tokyo.ofc@computer.org
`
`
`
`Editorial production by Anne Rawlinson
`
`
`
`
`
`Productions Cover art production by Joe Daigle/Studio
`
`
`
`Printed in the United States of America by The Printing House
`
`IEEE�
`COMPUTER
`SOCIETY
`•
`
`Intel Exhibit 1004 - 4
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`Table of Contents
`
`Message from the General Chairs ... . . . . . . . . . . . . . . . . .
`
`.....................
`
`. . . . . . . . . . . . . .
`
`.........
`
`..... ix
`
`
`
`Message from the Technical Program Chairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`.. . . . . xi
`
`Workshop Committees
`
`......
`
`.................
`
`......................
`
`. . . . . . .
`
`.................................
`
`Steering Committee ........
`
`. . . . . . . . . . . .
`
`....................
`
`. . . . . . . . . . . .
`
`. . . . . . . . . . .
`
`.......
`
`.....................
`
`xiii
`
`xv
`
`System Level Design Methods and Examples I
`
`Mead-Conway VLSI Design Approach and System Design
`Challenges Ahead . ..... .. . . . . ...... .. . . . . . . . . . .......... ........... ........... . .....
`Lynn Conway
`
`. . . . . . ......
`
`Alternative Architectures for Video Signal Processing
`
`
`
`W. Wolf
`
`...........................
`
`.......... ..... .. . . . . . ...... 3
`
`.........
`
`................. 5
`
`PicoRadio: Ad-hoc Wireless Networking of Ubiquitous Low-energy
`
`
`
`
`Sensor/Monitor Nodes ...................................
`J. Rabaey, J. Ammer, J. L. da Silva Jr., and D. Patel
`
`.............
`
`. . . . . ...........
`
`. . ........
`
`...........................
`
`9
`
`System Level Design Methods and Examples II
`
`
`
`A System-level Approach to Power/Performance Optimization in Wearable
`.......... 15
`
`
`
`Computers
`
`
`......................
`
`.............................
`
`. . ....................
`
`................
`
`. . . . ..............
`
`A Smailagic, D. Reilly, and D. P. Siewiorek
`
`Emerging Trends in VLSI Test and Diagnosis
`Y. Zorian
`
`........ .
`
`........
`
`. . . . ....
`
`. . ................
`
`....................
`
`21
`
`Multilanguage Design of a Robot Arm Controller: Case Study . . .......................
`
`
`
`G. Nicolescu, P. Coste, F. Hessel, P. LeMarrec, and AA Jerraya
`
`.............. 29
`
`Low Power Design
`
`Instruction Scheduling Based on Energy and Performance Constraints
`
`A Parikh, M. Kandemir, N. Vijaykrishnan,
`and M.J. Irwin
`
`........................
`
`37
`
`Networks . . .............. .43
`
`Dynamic Voltage Scaling Techniques for Distributed Microsensor
`R. Min, T. Furrer, and A Chandrakasan
`
`v
`
`Intel Exhibit 1004 - 5
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`Reducing the Power Consumption in FPGAs with Keeping a High Performance
`Level .. . .. . ... . . . . . . . .. . . . . . . . . . . . . . . . .. . . . . . . . . ....
`. . 47
`
`A. D. Garcia G, W. P. Burleson, and J. L. Danger
`
`. . . . . . . . . . . . . . ... . .. . . . . . . . . . . . . . . . . . . . . . . . .
`
`. . . . . . . . ..... ...................
`
`Multiple Access Caches: Energy Implications ............................... ................................. 53
`
`
`
`
`
`
`M Kandemir and MJ. Irwin
`H.S. Kim, N. Vijaykrishnan,
`
`System Level Design Examples
`
`Improved Synchronization Methodologies for High Performance Digital
`
`........ ......... .......... 6·1
`Systems . ....... ......... .. .. . . .... ............. .. . .... ................ .......
`. . . . . ....... ...........
`S. Welch and K. Kornegay
`
`for 2D-Mesh Video Object Motion Tracking .................... 67
`Low Power VLSI Architecture
`W. Badawy and M. Bayoumi
`
`Timing Issues in System Design
`
`
`
`
`
`Architectural Adaptation in AMRM Machines ... .............................................................. 75
`
`
`
`
`
`
`
`R. Gupta
`
`An Empirical and Analytical Comparison of Delay Elements and a New
`
`
`Delay Element Design . . .. . . . . . . . . . . . . . . . . . . . . . . . .... .. . ... . . ....
`
`
`N. Mahapatra, S. Garimella and A. Tareen
`
`. . . . . . . ... . . . . . . . . . . .. . .....
`
`. . . . . . ....
`
`. . . ... . . . .. .. . . . . . 81
`
`Software System DeSign and Design Environment
`
`Encapsulation and Symbolic Execution ............................ ...................................... . ...... 89
`
`Specification and Validation of I nformation Processing Systems by Process
`
`
`
`
`
`
`W. Bof3ung, T. Geyer, S. A. Huss, and L. Wehmeyer
`
`Interaction in Language Based System Level Design using an Advanced
`
`
`
`
`
`Compiler Generator Environment ................. . .............................. ............................ ...... 97
`
`
`I. Poulakis, G. Economakos, and P. Tsanakas
`
`
`
`
`
`On Design-for-reusability in Hardware Description Languages ..................... ............... 103
`
`
`
`J. M. Chang and S. K. Agun
`
`Analysis and SyntheSis of Asynchronous Circuits
`
`Fine-grain Pipelined Asynchronous Adders for High-speed DSP Applications ... .. ........ 111
`
`
`
`
`
`M. Singh and S. M. Nowick
`
`
`
`A Low-latency FIFO for Mixed-clock Systems .............................................................. 119
`
`
`
`
`
`
`T. Chelcea and S. M Nowick
`
`
`
`Advances in Multiplier Design
`
`Reconfigurable Low Energy Multiplier for Multimedia System Design .......................... 129
`
`
`
`
`S. Kim and M. C. Papaefthymiou
`
`An Algorithmic Approach to Building Datapath Multipliers Using (3,2) Counters .......... 135
`
`
`
`
`
`H. A/-Twaijry and M Aloqeely
`
`vi
`
`Intel Exhibit 1004 - 6
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`Issues in System Design
`
`RT-Ievellnterconnect Optimization in DSM Regime ....
`
`
`S. Katkoori and S. Alupoaei
`
`. . . ....................
`
`....
`
`. . ................
`
`. . . 143
`
`Validation of Complex Designs through Hardware Prototyping
`
`G. Stoler
`
`. .. . . . . .................
`
`............ 149
`
`On-line Error Detection in Multiplexor Based FPGAs ....................
`
`
`A. Jaekel
`
`. . . .......
`
`. . . . . . . .
`
`............. 1 55
`
`Author Index . . . . . . . .
`
`. . . . . . . .
`
`. . . . . .
`
`. .. . . . . . . . . . . . . . .
`
`. .. . . . . . . . . ............
`
`. . . . . .
`
`...........
`
`... . . . . . . . . . . . . . . . . . .
`
`161
`
`vii
`
`Intel Exhibit 1004 - 7
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`Architectural Adaptation in AMRM Machines
`
`Rajesh Gupta
`
`Information and Computer Science
`University of California, Irvine
`Irvine, CA 92697
`rgupta@ics. uci. edu
`
`of places where architectural adaptivity can be used,
`for instance, in tailoring the interaction of processing
`with 1/0, customization of CPU elements (e.g.,
`splittable ALU resources) etc.
`
`In view of the microelectronic technology trends
`that
`emphasize
`increasing
`importance
`of
`communication at all levels, from network interfaces
`to on-chip interconnection fabrics, communication
`represents the focus of our studies in architectural
`adaptation. In this context, memory system latency
`and bandwidth issues are key determining factors 'in
`performance of high performance machines because
`these can provide a constant multiplier on
`the
`achievable system performance [ 1 ]. Further, this
`multiplier decreases as the memory latency fails to
`improve as fast as processor clock speeds.
`
`Consider a hypothetical machine with processing
`elements running at 2 GHz with eight-way super(cid:173)
`scalar pipelines. Assuming a typical I microsecond
`round-trip latency for a cache miss, this corresponds
`to about 16K instructions, with an average 30% or
`4800 instructions being load/store. For a single(cid:173)
`thread execution a miss rates as low as 0.02%
`reduces computing efficiency by as much as 50%.
`This points to a need for very low miss rates to
`ensure that high-throughput CPUs can be kept busy.
`A similar analysis of the bisection bandwidth
`concludes that active bandwidth management
`is
`required to reduce the need for communication and to
`increase
`the number of operations before a
`communication is necessary.
`
`2. The AMRM Project
`
`Reconfiguration
`Adaptive Memory
`The
`Management, or
`the AMRM, project at
`the ·
`University of California, Irvine aims to find ways to
`improve the memory system performance of a
`computing system. The basic system architecture
`reflects the view that communication is already
`critical and getting increasingly so [3], and flexible
`
`Abstract
`
`Application adaptive architectures use archi(cid:173)
`tectural mechanisms and policies to achieve system
`level performance goals. The AMRM project at UC
`Irvine foc~es on adaptation of the memory hierarchy
`and its role in latency and bandwidth management.
`This paper describes the architectural principles and
`first implementation of the A MRM machine proof-of(cid:173)
`concept prototype.
`
`1. Introduction
`
`Modem computer system architectures represent
`design tradeoffs and optimizations involving a large
`number of variables in a very large design space.
`Even when successfully
`implemented
`for high
`performance, which is benchmarked against a set of
`representative
`applications,
`the
`performance
`optimization is only in an average sense. Indeed, the
`performance variation across applications and against
`changing data set even in a given application can
`easily be by an order of magnitude [1]. In other
`words, delivered performance can be less than one
`tenth of the system performance that the underlying
`hardware is capable of.
`
`A primary reason of this fragility in performance
`that
`rigid architectural choices
`related
`to
`is
`organization of major system blocks (CPU, cache,
`memory, IO) do not work well across different
`applications.
`
`Architectural Adaptivity provides an attractive
`means
`to
`ensure
`robust
`high
`performance.
`Architectural adaptation refers to the capability of a
`machine
`to
`support multiple
`architectural
`mechanisms and policies that can be tailored to
`application and/or data needs [2]. There are a number
`
`0-7695-0534-1/00 $ I 0.00 © 2000 IEEE
`
`75
`
`Intel Exhibit 1004 - 8
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`interconnects can be used to replace static wires at
`competitive performance in interconnect dominated
`microelectronic technologies [4,5,6]. The AMRM
`machine uses reconfigurable logic blocks integrated
`with the system core to control policies, interactions,
`and interconnections of memory to processing [7].
`The basic machine architecture supports application(cid:173)
`specific cache organization and policies, hardware(cid:173)
`assisted blocking, prefetching and dynamic cache
`structures (such as stream, victim caches, stride
`prediction and miss history buffers) that optimize the
`movement and placement of application data through
`the memory hierarchy. Depending upon the hardware
`technology used and the support available from the
`runtime environment this adaptation can be done
`statically or at run-time. In the following section we
`describe
`a
`specific mechanism
`for
`latency
`management that is shown to provide significant
`performance boost for the class of applications
`characterized by frequent accesses to linked data
`structures scattered in the physical memory. This
`includes algorithms that operate on sparse matrices
`and linked trees.
`
`2.1 Adaptation for Latency Management
`
`Latency management refers to techniques for
`hiding long latencies of memory accesses by useful
`computation. Most common technique for latency
`hiding
`is by prefetching of data to
`the CPU.
`Prefetching is combined with smaller and faster
`(cache) memory elements that attempt to prefetch
`application context(s)
`rather
`than
`single data
`elements. The pointer-based accesses to data items in
`memory hierarchies
`typically yield poor results
`because the indirection introduces main memory and
`memory hierarchy
`latencies
`into
`the
`innermost
`computational loop. Techniques such as software
`prefetching (loop unrolling and hoisting of loads) do
`not adequately solve the problem, as prefetching at
`the processor leaves multi-level memory hierarchy
`in
`the critical path.
`Purely hardware
`latency
`prefetching [8] is also often ineffective because the ·
`address references generated by an application may
`contain no particular address structure. We use an
`application-specific prefetching scheme that resides
`in dedicated hardware at arbitrary levels of the
`memory hierarchy, in all of them, or to bypass them
`completely. This hardware performs application(cid:173)
`specific prefetching, based on the address ranges of
`data structures used. When there is a reference to an
`address inside in this range, the prefetch hardware
`will prefetch the "next" element pointed to by the
`current element. The pointer field for the next
`element can be changed at runtime.
`
`76
`
`This prefetch hardware is combined, for some
`applications, with address translation and compaction
`hardware in the memory controller that works well
`with data structures that do not quite fit into a single
`is done
`cache
`line. The address
`translation
`transparently from the application using hardware
`in
`the cache controller and a
`assist
`translate
`corresponding hardware assist gather in the memory
`controller. Simulation results using this prefetch
`hardware show a IOX reduction in read miss rates
`and l OOX reduction in data volume reduction for
`sparse matrix multiply operations [9].
`
`3. The AMRM System Prototype
`
`While AMRM simulation results continue to
`provide valuable
`insights
`into
`the
`space of
`architectural mechanisms and
`their effectiveness
`[7][9][10], a system implementation is needed to
`bring together different parts of the AMRM project
`(including compiler and runtime system algorithms to
`support adaptivity). The AMRM system prototype is
`divided into two phases. First phase consists of
`implementation of a board-level prototype; followed
`by a second phase single-chip implementation of the
`cache memory system. At the time of this writing, the
`first phase of the project prototype implementation
`has recently completed. The rest of this paper
`describes the system design and implementation of
`the Phase I prototype and its relationship to the
`ongoing second phase ASIC prototype.
`
`The AMRM phase I prototype board is designed
`to serve two purposes. It can simulate a range of
`memory hierarchies for applications running on a
`host processor. The board supports configurability of
`the cache memory via an on-board FPGA-based
`memory controller. The board is also designed to be
`used, in future, as a complete system platform, with
`on-board memory serving as main memory, via a
`. mezzanine card containing the Phase II AMRM
`ASIC implementation.
`
`Whereas the goal of the AMRM prototype is to
`build an adaptive cache memory system, in general,
`the memory hierarchy performance cannot be de(cid:173)
`coupled
`from
`the
`processor
`instruction
`set
`architecture,
`implementation,
`and
`compiler
`implementation.
`(This is particularly true of the
`CPU-L l path that is often pipelined using non(cid:173)
`blocking caches.) It would, therefore, be desirable to
`evaluate any proposed changes to
`the memory
`hierarchy for several processor architectures rather
`than being tied to one specific processor type and its
`
`Intel Exhibit 1004 - 9
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`software. In order to be able to evaluate the effect
`with different processor architectures as well as to
`circumvent the implementation difficulties, our Phase
`I implementation attaches an additional memory
`hierarchy to a system through a standard peripheral
`bus. Thus, the board provides a PCI interface that
`allows a host processor to use the board as a part of
`its memory hierarchy. Applications· running on the
`host processor are instrumented automatically using
`the AMRM compiler to use the memory on the
`AMRM board. Thus direct program execution can
`proceed on the host processor while
`the extra
`memory hierarchy is being exercised.
`
`3.1 AMRM System Goals
`
`One goal of the AMRM prototype system is that
`it be adaptable to many different memory hierarchy
`architectures. Another goal of the AMRM system is
`that it be useful for running real time program
`execution or even memory simulations. The latter is
`accomplished by making
`the AMRM memory
`available to the user and converting user program to
`access
`this memory "directly". The former
`is
`accomplished through the use of a sequence of
`address/command
`type
`requests "run"
`through
`various memory system configurations. The AMRM
`system is to be fast enough to support extensive
`execution or simulation.
`
`A CPU interfaces to the reconfigurable AMRM
`memory system through the PCI bus.
`AMRM
`accepts CPU PCI requests for memory operations,
`issues them to the attached memory system, and
`sends back the data for memory read operations as
`well as memory access time information.
`
`3.2 AMRM prototype architecture
`
`Figure I shows the main components of the
`AMRM prototype board. It consists of a general 3-
`level memory hierarchy plus support for the AMRM
`ASIC chip implementing architectural assists with in
`the CPU-LI datapath. The host interface is managed
`by a Motorola PLX 9080 processor. The FPGAs on
`the board contain controllers for the SRAM, DRAM
`and LI cache. A I MB SRAM is used for tag and
`data store for the LI cache. A total of 512 MB of
`DRAM is provided to implement part of the cache
`hierarchy (and also to serve as main memory by
`reloading the memory controller into the FPGA.)
`
`The board implementation necessarily hard-wires
`certain parameters of the memory hierarchy. This
`includes the board's clock. In order to perform
`
`77
`
`detailed and accurate simulation of diverse memory
`hierarchy configurations at any clock speed, a
`hardware "virtual" clock has been implemented as
`part of the performance monitoring hardware.
`Performance monitoring hardware primarily includes
`various event counters, which are memory-mapped
`and readable from the host processor. The "virtual
`clock" emulates a target system's clock: the clock
`rate is determined by the target system's memory
`hierarchy design and technology parameters. For
`example, the delay for an LI cache hit, miss fetch etc
`in terms of virtual clocks can be configured by the
`host to emulate a given target cache design. Thus, the
`use of virtual clock allows us
`to simplify the
`hardware implementation. For instance, the tag and
`data stores of LI cache can be a single RAM while
`the timing may reflect a design with two separate
`RAM's.
`
`3.3 Command Interface to the AMRM
`Board
`
`The memory hierarchy on the AMRM board can
`be used by an application running on the host
`processor by writing commands to specific addresses
`in the PCI address space. Each command consists of
`a set of four words that specify the operation (e.g.,
`memory read/write, register read/write), the address
`of the location to access and data in case of a write.
`For read commands, a read response is generated and
`data is written into the host's memory.
`
`For debugging purposes and to enable the cache
`to be flushed by the host, there are commands to
`access memory banks directly, i.e., without going
`through the caches. Commands are also available to
`read/write the status, configuration registers and
`performance counters.
`
`reads a
`The onboard command processors
`command and launches its execution in the AMRM
`board. Data is read from the cache and sent back to
`the processor if it hits in the cache. It takes m Virtual
`Clock cycles. Otherwise it is requested from the next
`level in the hierarchy. Writes take n virtual clocks.
`Upon load command completion the data can be
`written into the system memory for access by the host
`processor. Both parameters n and m can be
`programmed under compiler control.
`
`3.4 Virtual Clock System
`
`The virtual clock system consists of a master
`clock counter, Vtime, the virtual clock signal, Velk,
`
`Intel Exhibit 1004 - 10
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`Ready inputs, and associate virtual clock generation
`logic. The Ready input from each major memory
`hierarchy module specifies
`that
`this unit has
`completed the current virtual clock cycle activities. A
`new Velk edge/period is generated when all Ready
`inputs reach 1. The Vtime can be read out by the
`host processor to determine the current virtual time.
`It can also be automatically supplied to the CPU via
`host memory (as opposed to the AMRM board
`memory).
`
`Each major unit in the memory hierarchy is
`designed to generate Ready and wait for the Velk,
`when appropriate. In most cases the designs actually
`use AutoReady counters local to each unit which can
`be loaded with a programmable number of cycles. An
`AutoReady counter generates Ready using the Velk
`while its output is non-zero. An idle unit not
`processing any requests also outputs a Ready signal
`every Velk. For instance, Consider the AMRM Read
`to
`the on-board memory
`command addressed
`hierarchy. It includes a delay (~T) from the previous
`memory access. This delay is used to advance the
`virtual clock forward before starting the new access.
`is accomplished by
`loading
`it
`into
`the
`This
`AutoReady counter.
`This allows other memory
`hierarchy activity to proceed in parallel with CPU
`computation. For instance, a prefetch unit may be
`accessing memory during the ~T-cycle delay.
`
`3.5 AMRM chip functionality
`
`The AMRM board provides for incorporation of
`
`32bit. 33MHz PCI Bus
`
`PCI Interface
`
`32bit Local Bus
`
`FPGAO
`
`Configurarion
`Vinual Clock
`
`Figure I: AMRM Phase I Prototype Board
`
`FPGA2
`
`78
`
`an AMRM chip that uses an ASIC implementation of
`the AMRM cache assist mechanisms. It is positioned
`between L l cache and the rest of the system and can
`be accessed in parallel with the L2 cache. It can thus
`accept and supply data coming from or going to the
`L l cache. For instance, it may contain a write buffer
`or a prefetch unit to access L2. It also has access to
`the memory interface and thus can, for instance,
`prefetch from memory. The AMRM ASIC design is
`currently
`in progress. This chip will
`include a
`processor core with adaptive memory hierarchy.
`When plugged into the AMRM board, the ASIC will
`use onboard DRAM as main memory by simply
`reconfiguring the memory controller in FPGA l .
`
`4. Summary
`
`Traditional computer system architectures are
`designed for best machine performance averaged
`across applications. Due to the static nature of these
`architectures, such machines are limited in exploiting
`application characteristics unless these are common
`for a large number of applications. Unfortunately, a
`number of studies have shown that no single machine
`organization fits all applications, therefore, often the
`delivered performance is only a small fraction of the
`peak machine performance. Therefore, we believe
`that
`there
`are
`significant
`opportunities
`for
`application-specific architectural adaptation. The
`focus of the AMRM project is on architectural
`adaptations that close the gap between processor and
`memory speed by
`intelligent placement of data
`through the memory hierarchy. Our current work has
`demonstrated performance gains due to adaptive
`cache organizations and cache prefetch assists. In
`future, we envision adaptive machines that provide a
`menu of application-specific assists
`that alter
`architectural mechanisms and policies in view of the
`application
`characteristics.
`The
`application
`developer, with the help of compilation tools, selects
`appropriate hardware assists
`to customize
`the
`machine to match application needs without having
`to rewrite the application.
`
`S. Acknowledgement
`
`The AMRM project was conceived through the
`joint effort of Andrew Chien, Alex Nicolau and Alex
`Veidenbaum. The team includes Prashant Arora,
`Chun Chang, Dan Nicolaescu, Rajesh Satapathy,
`Weiyu Tang, Xiao Mei Ji. The project is sponsored
`by DARPA under contract DABT63-98-C-0045.
`
`Intel Exhibit 1004 - 11
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

`

`6. References
`
`[I] P. M. Kogge, "Summary of the architecture group
`findings," In PetaF/ops Architecture Workshop
`(PAWS), April 1996.
`
`[2] A. A. Chien, R. K. Gupta, "MORPH: A System
`Architecture for Robust High Performance Using
`Customization," In Proceedings of the Sixth
`Symposium on the Frontiers of Massively Parallel
`Computation (Frontiers '96), October 1996.
`
`[3] W. Wulf, S. McKee, "Hitting the memory wall:
`Implication of the obvious," In Computer
`Architecture News, 1995.
`
`[4] C. L. Seitz, "Let's route packets instead of wires,"
`In Proceedings of the (!h MIT Conference on
`Advanced Research in VLSI, W. J. Dally, editor, MIT
`Press, pp. 133-37, 1996.
`
`[5] W. J. Dally, P. Song, "Design ofa self-timed
`VLSI multicomputer communication controller," In
`Proceedings of ICCD, 1987.
`
`[6] R. K. Gupta, "Analysis of Technology Trends: A
`Case of Architectural Adaptation in Custom
`Datapaths," In Proceedings of SPIE session on
`Configurable Computing, Boston, November 1998.
`
`[7] AMRM Website: www.ics.uci.edu/-amrm.
`
`[8] D. F. Zucker, M.J. Flynn, R. B. Lee, "A
`comparison of hardware prefetching techniques for
`multimedia benchmarks," Technical report CSL-TR-
`95-683, Computer Systems Laboratory, Stanford
`University, Stanford University, CA 94305,
`December 1995.
`
`[9] X. Zhang, et al, "Architectural adaptation for
`memory latency and bandwidth optimization," In
`Proceedings of /CCD, October, 1997.
`
`[10] A. Veidenbaum, et al, "Adapting Cache Line
`Size to Application Behavior," In Proceedings of
`International Conference on Supercomputing (JCS),
`June 1999.
`
`79
`
`Intel Exhibit 1004 - 12
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`17
`18
`19
`20
`21
`22
`23
`24
`25
`26
`27
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54
`55
`56
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket