`
`LIBRA.RY OF CONORESS
`
`Office of Business Enterprises
`Duplication Services Section
`
`THIS IS TO CERTIFY that the collections of the Library of Congress contain a book
`entitled HIGH-SPEED COMPUTING, DIGITAL SIGNAL PROCESSING, AND FILTERING
`USING RECONFIGURABLE LOGIC : 20-21 November 1996, Boston, Massachusetts/ John
`Schewel, ... [et al.], chairs/editors; sponsored ... by SPIE--the International Society for Optical
`Engineering. Published: Bellingham, Wash., USA: SPIE, c1996. Call number TK7895.G36 H55
`1996, and that the attached photocopies-Cover, Title, Copyright, Table of Contents, and the
`article, "Using Reconfigurable Hardware to Customize Memory Hierarchies, Peixin Zhong and
`Margaret Martonosi [pages 237-248] - are a true representation from that work.
`
`THIS IS TO CERTIFY FURTHER, that work is marked with a Library of Congress
`Copyright Office stamp that bears the date November 19, 1996.
`
`IN WITNESS WHEREOF, the seal of the Library of Congress is affixed hereto on
`October 11, 2018.
`
`101 Independence Avenue, SE Washington, DC 20540-4917 Tel 202.707.5650 www.loc.gov;
`duplicationservices@loc.gov
`
`Petitioners Amazon
`Ex. 1008, p. 1
`
`
`
`PROCEEDINGS
`~ SPIE- The International Society for Optical Engineering
`
`ljigh-Speed Computing, Digital
`Signal Processing, and Filtering
`Using Reconfigurable Logic
`
`/
`
`John Schewe!
`Peter M. Athanas
`V. Michael Bove, Jr.
`John Watson
`Chairs/Editors
`
`20-21 November 1996
`Boston, Massac:husetts
`
`Volume 2914
`
`Petitioners Amazon
`Ex. 1008, p. 2
`
`
`
`1fligh-Speed Computing, Digital
`Signal Processing, and Filtering
`Using Reconfigurable Logic
`
`John Schewe!
`Peter M. Athanas
`V. Michael Bove, Jr.
`John Watson
`Chairs/Editors
`
`20-21 November 1996
`Boston, Massachusetts
`
`Sponsored and Published by
`SPIE-The International Society for Optical Engineering
`
`Volume 2914
`
`SPIE is an international technical society dedicated to advancing engineering and scientific
`applications of optical, photonic, imaging, electronic, and optoelectronic technologies.
`
`Petitioners Amazon
`Ex. 1008, p. 3
`
`
`
`/
`
`The papers appearing in this book comprise 1he proceedings of the meeting mentioned on the
`cover and ti1le page. They reflect the authors' opinions and are published as presented and
`without change, in the interests of timely dissemination. Their inclusion in this publication does
`not necessarily constitute endorsement by the editors or by SPIE.
`
`Please use the following format to cile material from 1his book:
`Author(s), 'Title of paper; in High-Speed Computing, Digital Signal Processing, and Filtering
`Using Reconfigurable Logic, John Schewel, Peter M. Athanas, V. Michael Bove, Jr., John
`Watson, Editors, Proc. SPIE 2914, page numbers (1996).
`
`Library of Congress Catalog Card No. 96-69766
`ISBN 0-8194-2316-5
`
`Published by
`SPIE-The International Society for Optical Engineering
`P.O. Box 10, Bellingham, Washington 96227-0010 USA
`Telephone 360/676-3290 (Pacific Time) • Fax 360/647-1445
`
`Copyrigh1 Ql 996, The Society ui Photo-Optical Instrumentation Engineers.
`
`Copying of material in this book for internal or personal use, or for the internal or personal use
`of specific clients, beyond the fair use provisions granted by the U.S. Copyright Law is
`authorized by SPIE subject to payment of copying fees. The Transactional Reporting Service
`base fee for this volume is $6.00 per article (or portion thereof), which should be paid directly
`to the Copyright Clearance Center (CCO, 222 Rosewood Drive, Danvers, MA 01923. Payment
`may also be made electronically through CCC Online at hnp://www.directory.net/copyright/.
`Other copying for republication, resale, advertising or promotion, or any form of systematic or
`multiple reproduction of any material in this book is prohibited except w ith permission in
`writing from the publisher. The CCC fee code is 0-8194-2316-5/96/$6.00.
`
`Printed in 1he United States of America.
`
`Petitioners Amazon
`Ex. 1008, p. 4
`
`
`
`Contents
`
`vii
`ix
`
`Conference Committee
`lncroduccion
`
`SESSION 1
`
`BREAKTHROUGHS IN ALGORITHMS, IMAGI NG, DSP, AND NUMERICAL ANALYSIS I
`
`2
`
`14
`
`26
`
`34
`
`44
`
`54
`
`Signed-digit on line floating-point arithmetic for FPGAs [29 14-01]
`A. Tangtrakul, B. Yeung, T. A. Cook, Rutgers Un iv.
`
`Design oi high-radix digit slices for online computations [2914-02)
`A. F. Tenca, M. D. Ercegovac, Univ. of California/Los Angeles
`
`Performance evaluation of FPGA implementations ot high-speed addition algorithms
`[2914-03)
`W . W. H. Yu, S. Xing, Univ. o f Hong Kong
`
`250-MHz correlation using high-performance reconfigurable computing engines [291 4-04]
`B. Von Herzen, Rapid Prototypes, Inc.
`
`Implementation of a finite difference method on a custom computing platform [2914-05)
`K. J. Paar, P. M. Athanas, Virginia Polytechnic Institute and State Univ.; C. M . Edwards, Nava l
`Underwater Warfare Ctr.
`
`User-configurable data acquisition systems [29 14-06]
`G. Brown, Cornell Univ.
`
`SESSION 2
`
`BREAKTHROUGHS IN ALGORITHMS, IMAG ING, DSP, AND NUMERICAL ANALYSIS II
`
`66
`
`72
`
`84
`
`90
`
`100
`
`110
`
`Internal sorting and FPGA [2914-07]
`A. Beechick, Arrow Connection; S. Casselman, Virtual Computer Corp.; L. D. Yarbrough,
`Mathematical and Computational Sciences
`
`Polynomial-transform-based approach to computing 20 DFTs using reconfigurable computers
`[29 14-08]
`C.H . Dick, La Trobe Univ. (Australia); F. J. Harris, San D iego State Un iv.
`
`Configurable adaptive signal processing subsystem for various applications in telemetry,
`navigation, a nd telecommunication (2914-09]
`H. Chandran, Navtel Systems SA, Ltd. (France)
`
`Dynamic hardware video processing platform (2914-1 OJ
`R. And raka, Andraka Consulting Group
`
`DSP filters in FPGAs for image processing applications [2914-11 ]
`B. Taylor, Giga Operatio ns Corp.
`
`Image processing using reconfigurable FPGAs (2914-1 2]
`L. Ferguson, Atmel Corp.
`
`SPIE Vol. 291 4 I ii,
`
`Petitioners Amazon
`Ex. 1008, p. 5
`
`
`
`SESSION 3
`
`ARCHITECTURES FOR HIGH-PERFORMANCE RECONFIGURABLE COMPUTING
`
`124
`
`133
`
`141
`
`152
`
`170
`
`180
`
`187
`
`195
`
`Design tips and experiences in using reconfigurable FLEX logic [2914-13]
`P. ). Covert, California Microwave Inc.
`
`Programmable hardware for reconfigurable computing systems [2914-14]
`S. J. Sm ith, Altera Corp.
`
`Hierarchical decompos ition model for reconfigurable architecture [29 14-15]
`s. s. Erdogan, A. B. Wahab, Nanyang Technological Univ. (Singapore)
`
`Review of field-programmable analog arrays [2914-16)
`G. Gulak, D.R. D'Mello, Univ. of Toronto (Canada)
`
`PCl-based WILDFIRE reconfigurable computing engines [29 14-17]
`B. K. Fross, R. L. Donaldson, D. ). Palmer, Annapolis Mic ro Systems, Inc.
`
`Reconfi gurable-logic-based fiber channel network card [2914-18)
`S. Casselman, Virtual Computer Corp.
`
`Colt: an experiment in wormhole run-time reconfiguration [29 14- 19]
`R. A. Bittner, Jr., P. M. Athanas, M. D. Musgrove, Vi rginia Polytec hnic Insti tute and State Univ.
`
`Computational acceleration methodologies: advantages of reconfigurable acce le ration
`subsystems [2914-20)
`Y. Savaria, G. Bois, Ecole Polytechnique de Montreal (Canada); P. Popovic, M icro Tech
`Microsystems Inc. (Canada); A. Wayne, Andrew Wayne and Associa tes (Canada)
`
`SESSION 4
`
`DESIGN METHODS AND TOOLS FOR RECONFIGURAB LE COMPUTING
`
`208
`
`218
`
`225
`
`237
`
`249
`
`259
`
`271
`
`283
`
`Malleable architecture generator for FPGA computing [2914-21]
`M. Gokhale, J. Kaba, A. Marks,). Kim, David Sarnoff Research Ctr.
`
`Resource pools: an abstraction for configurable computing codesign [2914-22]
`). B. Peterson, P. M. Athanas, Vi rginia Polytechnic Institute and State Univ.
`
`Solving graph problems with dynamic computation structures (2914-23]
`). W. Babb, M. I. Frank, A. Agarwal, MIT Lab. for Computer Science
`
`Using reconfigurable h~rd':"are to customize memory hierarchies [2914-24]
`P. Zhong, M. Martonos1, Princeton Univ.
`
`Compiling high-level la_nguages for configurable computers: applying lessons from
`heterogeneous processing [2914-2 SJ
`G. E. Weaver, C. C. Weems, K. S. McKin ley, Univ. of Massachusetts
`
`th d
`Rapid_ p~ototyping of datapath intensive architectures with HML: an abst
`description language [2914-26]
`rac
`ar ware
`M. E. Leeser, S. Tarafdar, Northeastern Univ.; Y. Li, Princeton Univ.
`
`Codesign and hi_gh-performance computing: scenes and crisis [29 14•
`281
`u N
`R. W. Hartenstein J Becker M H
`I ·
`· age dinger, U niv. of Kaiserslautern (FRG)
`· erz,
`'
`·
`'
`
`FPGA applications in digital video systems [29 14-29)
`B. K. Fawcett,). Watson, Xilinx, Inc.
`
`,v I SP/£ Vol. 291 ~
`
`Petitioners Amazon
`Ex. 1008, p. 6
`
`
`
`295
`
`300
`
`308
`
`32 1
`
`332
`
`341
`
`Application of FPGA technology to performance limitations in radiation therapy [29 14-30]
`J. J. DeMarco, J. B. Smathers, T. D. Solberg, Univ. of California/Los Angeles;~- Casselman,
`Virtual Com puter Corp.
`
`Embedding large multidimensional DSP computations in reconfigurable logic [2914-31]
`A . Elnagga r, H. M. Alnuweiri, M. R. Ito, Univ. of British Columbia (Canada)
`
`FPGA-based transformable coprocessor for MPEG video processing [2914-32]
`H. A. Chow, H. M. Alnuweiri, Univ. of British Columbia (Canada)
`
`Guide to us ing field programmable gate arrays (f PG As) for application-spec ific digita l s ignal
`processing performance [2914-33]
`G. R. Goslin, Xi:in x. Inc.
`
`Reconfigurable hardware accelerator for embedde d DSP [2914-34]
`K. Reeves, Pentek Inc.; K. Sienski, C. Field, Red River Engineering
`
`large-scale logic array computation [2914-3 5]
`N. Ma rgo I us, Boston Univ.
`
`354
`
`Author Index
`
`SPIE Vol 2914 I v
`
`Petitioners Amazon
`Ex. 1008, p. 7
`
`
`
`Using Reconfigurable Hardware to Customize Mem ory Hierarchies
`
`Peixin Zhong and Margaret Martonosi
`Department of Electrical Engineering
`Princeton University
`Princeton, NJ 08544-5263
`{pzhong, martonosi}©ee.princeton.edu
`
`Abstract
`
`1
`
`Introduction
`
`Over the past decade or more, processor speeds have
`increased much more quickly than memory speeds.
`As a result, a large, and still increasing, processor(cid:173)
`memory performance gap has formed. Many signifi(cid:173)
`cant applications suffer from substantial memory bot(cid:173)
`tlenecks, and their memory performance problems are
`often either too unusual or extreme to be mitigated by
`cache memories alone. Such specialized performance
`"bugs" require specialized solutions, but it is impos(cid:173)
`sible to provide case-by-case memory hierarchies or
`caching strategies on general-purpose computers.
`We have investigated the potential of implementing
`mechanisms like victim caches and prefetch buffers in
`reconfigurable hardware to improve application mem(cid:173)
`ory behavior. Based on technology and commercial
`trends, our simulation-based studies use a forward(cid:173)
`looking model in which configurable logic is located
`on I.he CPU chip. Given such assumptions, our re(cid:173)
`sults show that the flexibility of being able to special(cid:173)
`ize configurable hardware to an application's memory
`referencing behavior more thar: balances the slightly
`slower response times of configurable memory hier(cid:173)
`archy structures. For our three applications, small,
`specialized memory hierarchy additions such as vic(cid:173)
`tim caches and prefetch buffers can reduce miss rates
`substantially and can drop total execution times for
`these programs to between 60 and 80% of their orig(cid:173)
`inal execution times. Our results also indicate that
`different memory specializations may be most effec(cid:173)
`tive for each application; this highlights the useful(cid:173)
`ness of configurable memory hierarchies that are spe(cid:173)
`cialized on a per-application basis.
`Keywords: memory latency, configurable comput(cid:173)
`ing, victim cache, prefetching.
`
`Due to rapid increases in microprocessor speeds, the
`performance gap between processors and main mem(cid:173)
`ory is widening. Cache memories are typically used in
`computer systems to bridge this performance gap and
`reduce the average memory access time. Although
`caches work well in many cases, they may still fail to
`provide high performance for certain applications.
`Several hardware and software techniques have
`been proposed to improve cache performance in such
`cases. For example, prefetching techniques aim to
`hide the large latency out to main memory by bring(cid:173)
`ing data to the cache before it is referenced. Vic(cid:173)
`tim caches attempt to reduce conflict misses in low(cid:173)
`associativity caches. These hardware techniques have
`variable results depending on the application's mem(cid:173)
`ory referencing behavior. Their disadvantage is that
`they represent wasted transistor space on the CPU
`chip for those applications where they arc ineffec(cid:173)
`tive. On the other hand, the drawback to purely
`software-based techniques (such as blocked matrix ac(cid:173)
`cesses or compiler-inserted prefetching directives) is
`that it can be difficult to statically analyze a pro(cid:173)
`gram's memory behavior and determine when such
`techniques will be useful. For these reasons, this pa(cid:173)
`per explores implementing memory hierarchy addi(cid:173)
`tions in programmable hardware.
`Programmable logic, such as field-programmable
`gate arrays (FPGAs), has gained tremendous popu(cid:173)
`larity in the past decade. Programmable logic is pop(cid:173)
`ular because a given chip's behavior can be configured
`and customized for different functions during differ(cid:173)
`ent sessions. Customization on a per-application ba(cid:173)
`sis is feasible because the reconfiguration process is
`fast and can be done with the device in the system.
`Some FPGAs can even be partially reconfigured while
`the rest of the device is in use.
`
`0-8194-23 16-5!96/$6.00
`
`SP/£ Vol. 291~ 237
`
`Petitioners Amazon
`Ex. 1008, p. 8
`
`
`
`The configurability of FPGAs makes them heavily
`used for prototyping, but they have also been used to
`build high-performance application specific compute
`engines [12]. In addition, there has been wo~k (such
`as PRlSC [21] and PRISM [2)) on supplementmg con(cid:173)
`ventional processors with configurable coprocessors
`to accelerate performance for different applications
`Thus far, most configurable computing projects
`have focused heavily on computation as opposed to
`data access. In this paper, we explore the potential
`of using configurable hard ware to make application(cid:173)
`specific improvements to memory behavior. We en(cid:173)
`vision a library of parameterized memory hierarchy
`additions that can be invoked to target cases where
`the cache system does not work well for a particular
`application.
`As fabrication technology improves, more and more
`transistors fit on a single chip. It seems likely that
`we will soon see processor chips that include a re(cid:173)
`gion of on-chip configurable logic. This configurable
`logic can clearly have many uses; our work does not
`preclude configuring the logic for more traditional
`compute-oriented uses, but simply attempts to ex(cid:173)
`plore an alternative use.
`In Section 2, we will first discuss the structure of
`the configurable memory unit that we envision. Fol(cid:173)
`lowing that, in Section 3, we present case studies of
`applying the ideas of configurable memory to several
`applications. In Section 4, we discuss some of the
`hardware organization and implementation issues in(cid:173)
`herent in our approach. A brief account of related
`work is included in Section 5 and Section 6 presents
`some discussion and our conclusions.
`
`flexible, configurable hardware instead.
`Integrated circuits are expected to soon grow to
`contain over 100 million transistors. As this growth
`takes place, we must determine ways to best make
`use of these additional transistors. Rather than sim(cid:173)
`ply devoting increased chip areas to increased cache
`sizes, our research explores other methods for using
`the transistors to reduce (or better tolerate) memory
`access latencies.
`In current research projects, configurable logic is
`typically incorporated into the architecture using an
`attached processor array model. As shown on the left
`hand side of Figure 1, an accelerator based on FP(cid:173)
`GAs and dedicated static RAM (SRAM), is attached
`to the I/ 0 bus of a conventional host processor. The
`conventional processor and configurable logic array
`operate asynchronously. The host supplies control
`signals and monitors results while the logic array pro(cid:173)
`cesses data obtained from an external source such as
`a frame-buffer. The major problem with this model
`is the high communication latency between proces(cid:173)
`sor and configurable logic, due to their physical and
`logical separation.
`As shown on the right hand side of Figure 1, the ex(cid:173)
`pected integration of configurable logic on-chip gives
`us more flexibility not only in its logical placement
`within the architecture, but also in its expected uses.
`In addition, on-chip configurable logic has more flexi(cid:173)
`bility in its connectivity to processors and caches. We
`can now begin considering uses for configurable logic
`that are infeasible (because of latency and connectiv(cid:173)
`ity constraints) with the attached processor model.
`
`2.1 Logical Placem ent , Connectivity
`
`2 Configurable Hardware
`Memory Hierarchies
`
`We believe that configurable logic can result in sig(cid:173)
`nificant performance improvement by improving av(cid:173)
`erage memory access latencies. Our overriding goal
`is to minimize the number of cache misses that result
`in accesses to main memory. Researchers have pro(cid:173)
`posed methods that promise performance improve(cid:173)
`ments of up to 3X by reducing cache misses using
`full-custom hardware [19]. These methods are rare!
`included in commercial processors, however, becaus:
`they do not provide performance improvements for
`a broad enough set of applications. Our current re(cid:173)
`search shows the potential of these approaches using
`
`lll The logical placement of the configurable logic is
`driven by its intended memory optimization func(cid:173)
`tion. The right hand side of Figure 1 shows two
`distinct possibilities, and in fac t, each of these con(cid:173)
`figurable blocks may expend their transistor budgets
`on some combination of configurable gates and asso(cid:173)
`ciated SRAM. (In this figure, the logical positions for
`configurable logic are indicated by diamonds labelled
`"Cl" and "C2".)
`The configurable logic closest to the processor can
`detect L 1 cache misses and manage pref etch buffers,
`stream buffers, or a victim cache. Similar functions
`can be performed by the second block of configurable
`logic, for the 12 cache. In addition, this logic could
`observe memory accesses between nodes of a dis·
`
`238 1 SPlt Vol. 291 4
`
`Petitioners Amazon
`Ex. 1008, p. 9
`
`
`
`8
`LI Cache <€9
`
`Possible
`chip boundaries
`
`L2C,che ~ :'>lcmory
`
`I/OBus
`
`_
`
`____::_mory
`
`! '--------- -
`I L ~L2Cachc
`I
`~
`
`1/0 Bus
`
`Figure 1: Traditional (left) and more forward-looking (right) uses of configurable logic.
`
`tributed memory multiprocessor or hold soecialized
`multiprocessor coherence protoc~ls.
`·
`In this paper, we focus primarily on two particu(cid:173)
`lar optimizations: victim caches and prefetching en(cid:173)
`gines, implemented in configurable hardware. Victim
`caches are small, fully-associative caches that lie be(cid:173)
`tween the cache and main memo:y. Victim caches
`primarily aim to reduce conflict :n.isses by caching
`data that has been evicted from the cache as a result
`of a miss on another location. Even small (four-entry)
`victim caches can reduce conflict misses by about 50%
`and reduce total cache misses by 5-30% [19].
`Pref etching engines are hard ware structures in(cid:173)
`tended to fetch pieces of data from main memory
`before references to them occur, in order to hide the
`large latencies out to main memory. They have been
`studied in several contexts before [6, 7, 20). Although
`such hardware mechanisms have been evaluated via
`simulation studies, widespread adoption in custom
`hardware is unlikely, since their benefits vary from
`application to application. Assuming that future pro(cid:173)
`cessors will include a block of configurable logic on(cid:173)
`chip, this paper uses simulations to re-evaluate these
`mechanisms within the context of a dynamically cus(cid:173)
`tomizable memory hierarchy.
`For prefetch or stream buffers, the configurable
`logic must be able to detect a miss from the 11 ( or 12)
`cache, recognize that the address matches the address
`of data in the prefetch buffer, abort the memory re-
`
`quest, and supply the data to the CPU. Simultaneous
`to servicing the access request, the prefetch controller
`may initiate a memory request for the next location
`from which to prefetch. For a v,ctim cache, the logic
`must once again be able to detect the primary cache
`miss, recognize the presence of the data in its SRAM
`abort the memory request, and supply the data to th~
`CPU. These functions can be easily performed using
`a small state machine controller.
`
`2.2 Logical vs. P hysical P lacement
`
`It is important to make a distinction between the log(cid:173)
`ical and the physical placemeot of the configurable
`logic. The discussion above described the logical
`placement, i.e. where the configurable logic needs to
`be positiooed in terms of the signals it needs access
`to. Physically, we anticipate future generations of mi(cid:173)
`croprocessors to have a fixed amount of configurable
`logic which may be used for a variety of tasks includ(cid:173)
`ing but not limited to accelerating computatioo and
`memory accesses. Since each application will have
`different requirements on where the configurable logic
`should be placed, it is not reasonable to expect it to
`be positiooed in exactly the positions marked io Fig(cid:173)
`ure l. The key demand that our logica.l placement
`makes on configurable logic's physical placement is
`that of connectivity; there must be signal paths that
`provide access to the signals required by the logical
`
`SPIE Vo/ Z9 J a . 239
`
`Petitioners Amazon
`Ex. 1008, p. 10
`
`
`
`is to be run repeatedly, then the performance statis(cid:173)
`tics a.bout this program run can be used as feedback
`to modify the configuration for future runs. C!ea.rly,
`the widespread use of this sort o! hard ware rehe~ on
`t mating the analysis a.nd choice of configura.tions
`
`.
`au o
`a.s much a.s possible. In Section 3's pre 1mmary ca.se
`studies, however, we set the configuration directives
`ma.nua.lly.
`
`1.
`
`3 Application Case Studies
`
`In order to qua.ntita.tively eva.luate our idea, we
`present results on three case study applications.
`
`3.1 Configura ble Mem ory H a rdware
`
`In the case studies presented here, we assume the con(cid:173)
`figurable memory hardwa.re sits on the same_ chip as
`the processor a.nd first-level (11) cache. This corre(cid:173)
`sponds to location C 1 in Figure 1. For these studies,
`we examine two configurable hierarchy additions sep(cid:173)
`arately: a victim cache and a prefetch buffer.
`
`3.1.1 V ictim C ache
`
`A victim cache is a fully-associative cache with typ(cid:173)
`ically no more than 16 entries. When an 1 1 cache
`miss occurs, the referenced line is pulled into the Ll
`cache and another line currently stored in the sa.'11e
`Ca(he line is evicted. The victim cache is a small
`fully-associa.tive buffer that holds these evicted lines.
`Using victim caches can have the effect of increasing
`the set associativity at low hardware cost.
`Figure 3 shows a diagram of the victim cache we
`consider. The address and data lines go to both the
`Ll cache and the victim cache. On an 11 cache hit,
`the data is provided directly from the 1 1 cache, and
`the configurable hardware is not involved. On ao
`Ll cache miss, the vicbm cache is probed in parallel
`with the request to the next level of memory. If the
`reference hits in the victim cache, the victim cache
`data provides to the processor. When the data is
`pulled into the 11 cache, another line must be evicted
`from the cache. The vidim cache intercepts this line
`and stores it.
`
`3.1.2 P refetching Buffer
`
`The goal of a pref etch buffer is to initiate main mem(cid:173)
`ory accesses in advance, so that the data will be closer
`
`ts perforrna.nce
`·
`usa.ge. The physica.l pla.cement rmpa.c
`be
`in two wa.ys. First, the interconnect delay may
`.
`·fica.nt. Second. the interconnect topology may
`signi
`•
`h
`buses.
`lead to contention of sha.red resources sue a.s . . . a.l
`This contention ha.s not been modelled in the iruti
`study presented here.
`
`Z.3 P rog ramming Model
`
`In a.ddition to hardwa.re concerns, our configura~le
`memory hierarchies raise software issues as well. Fig(cid:173)
`ure 2 gives a flowchart that represents the pa.th a
`program would go through in orde_r _to make use of
`configurable memory hierarchy addit10ns.
`
`Source code
`
`Evalua1ion
`
`Code wi1h Config. info
`Run program ~I Feedback !
`'----'-tr----
`Resull
`
`Figure 2: Flowchart of programming model with con(cid:173)
`figura.ble memory hiera.rchy additions.
`
`Starting from conventional source code, some per(cid:173)
`formance analysis is needed to determine wha.t types
`of performance bottlenecks are present. Ba.sed on this
`analysis, configura.tion directives may be added into
`the code in order to use particular configura.ble mem(cid:173)
`ory hierarchy additions. The configuration directives
`may either make use of hardware from a pa.ra.meter(cid:173)
`ized library of pre-made designs, or they ma.y specify
`custom-made configurations. Our flowchart a.lso al(cid:173)
`lows for the possibility that these configuration direc(cid:173)
`tives will either be hand-inserted into the code by the
`programmer, or will be a.utomatica.lly inserted as part
`of the compilation process. At this point, the code
`is executed, and the memory hiera.rchy additions a.re
`configured a.s part of the program run. If the code
`
`240 I SPIE Vol. 291 4
`
`Petitioners Amazon
`Ex. 1008, p. 11
`
`
`
`CPU
`:--------.ol A~ - ---.
`LI c:1che
`
`r~~"'~S~ __ cu_"'- ; _ ----,
`
`Adcrcs.
`
`O:::.ta
`
`Ad<!rtss
`
`ViotimCJ.che
`ci
`o.l::,
`t.lg
`d,::, ~
`I d,,:,
`t:l&
`~t
`
`'"'"'
`
`Lowe: level memo0
`
`CPt.:
`
`L.1 C3Che
`t2g1 cam I
`
`, '"ltl 03::, ->-
`
`.
`
`D:ll3
`
`Figure 3: Block diagram of victim cache.
`
`I Lower level memory j
`
`to the processor when referenced. Prefetching can be
`especially beneficial when the main memory band(cid:173)
`width is large, but the access latency is also quite
`high. The prefetch buffer in our model has several
`independent "slots". Each slot holds several cache
`lines and works like a FIFO. It can issue prefetch(cid:173)
`ing commands to the main memory. We need several
`slots because in a typical program, there will be sev(cid:173)
`eral concurrent data access streams. For example,
`when a program is performing matrix addition, there
`will be three access streams for the two source arrays
`and the destination array. Figure 4 is the schematic
`of a prefetch buffer with four slots; each slot holds
`four cache lines of data.
`If there is an 11 cache hit, it does not trigger any
`operation in the prefetch buffer. If there is an Ll
`miss, the request is sent to the prefetch buffer. If
`the referenced item is available in one of the slots,
`this line is pulled into the CPU as well as the 11
`cache. The remaining lines move toward the top of
`the FIFO. The vacancy is filled with data prefetched
`from main memory. The prefetching engine must de(cid:173)
`termine which line to prefetch next. In the simple
`case, the memory address for the subsequent line is
`calculated by incrementing the current address by the
`cache line size.
`If the L 1 cache miss also misses in the pref etch
`buffer, the least recently used slot is designated to
`begin prefetching the subsequent data following this
`
`Figure 4: Block diagram of prefetch buffers.
`
`referenced address. By doing this, we can avoid com(cid:173)
`pulsory cache misses if the initial reference to a mem(cid:173)
`ory line is already covered by the prefetching.
`To make better use of the slots, we may program
`them with address ranges and data access strides. For
`example, when we have a lot of accesses on one array,
`we can set one slot with the address range of the data
`array. The slot will only respond to references in this
`range. If the access pattern has an address increment
`stride larger than the cache line size, we can assign
`the stride so we only prefetch the data that is useful.
`
`3.2 Evaluation Methodology
`
`In this study, we use software simulations to estimate
`the performance for our applications. In particular,
`we used the TangoLite memory reference generator
`to perform high-level simulations. TangoLite uses a
`direct-execution simulation approach. The origi.nal
`program written in C or FORTRAN is compiled and
`instrumented with calls out to routines that simulate
`memory behavior and keep statistics.
`In our simulation model, we have a single CPU
`that issues no more than one instruction per cycle.
`The direct-mapped Ll data cache has a capacity of
`8 KB and a line size of 32 bytes. In order to focus
`
`SPIE l'o/. 291 4 2, I
`
`Petitioners Amazon
`Ex. 1008, p. 12
`
`(cid:141)
`
`
`on data references, we have assumed ideal memory
`oerformance in the instruction stream.
`. Our simulations assume a hit in the Ll cache costs
`one cycle, the same as a simple ALU instruction. If a
`reference misses in the L 1 cache, additional cycles are
`needed depending on where the data is found. We as(cid:173)
`sume that data found in the configurable cache takes
`2 extra cycles and a miss in both Ll and configurable
`cache will cost 10 extra cycles to go to main memory.
`The victim cache we simulate is fully-associative
`with 4 cache-line-sized entries, updated according to
`an LRU policy. The prefetch buffer has four slots
`with 4 cache lines each. Our simulator ignores any
`additional contention that prefetching may introduce.
`
`3.3 P er-Application Results
`
`With that background, we will now look in more de(cid:173)
`tail at three applications. Our goal is to see the po(cid:173)
`tential performance improvement available by imple(cid:173)
`menting memory hierarchy additions in configurable
`hardware. The three applications studied include
`two kernel applications known for their poor cache
`performance (Matrix Multiplication and Fast Fourier
`Transform) as well as a more substantial program:
`the Tomcatv benchmark from the SPEC92 suite [9].
`
`3.3.1 Matrix Multiplication
`
`In the first example, we study a matrix multiplication
`program multiplying two 100 by 100 matrices. The
`elements are double-precision floating point. This
`means that a total of 80000 bytes are needed for each
`matrix. This greatly exceeds the size of the Ll cache.
`Clearly the calculation involves two source matrices
`and one destination matrix. As shown in Figure 5
`one of the source matrices is accessed in sequential
`column order while the other is accessed in sequen(cid:173)
`tial row order.
`
`Figure 5: Memory access pattern for arrays in matr·
`multiplication.
`ix
`
`Results The simulated performance of l0Ox!00
`matrix multiplication is summarized in Table l. The
`standard cache does not perform well both because
`of the size of the data set and also because one of the
`arrays is accessed down the columns. Overall, the
`data miss rate is 21%.
`The second column of the table shows the new re(cid:173)
`sults obtained when a configurable victim cache is
`used. In this case, some of the conflict misses can
`be avoided. The overall performance is only slightly
`better than in the original case though. The third
`column of the table shows the results when a config.
`urable prefetch engine is used. In this "stride-one
`prefetching" each prefetch slot always attempts to
`prefetch the next cache line. This prefetching allows
`the program to avoid some compulsory misses, but
`does not dramatically improve overall performance.
`The difficulty here is that one of the arrays has a
`columnar memory access pattern that does not ben(cid:173)
`efit from the stride-one prefetching.
`Since simple stride-one prefetching is not fully ef.
`fective, we also investigated allowing the prefetching
`engine adjust its prefctch stride on a per-slot basis.
`The fourth column shows the results of this experi(cid:173)
`ment. Here, we have allocated one slot to each ma(cid:173)
`trix. For the source matrix that accesses elements
`across a row, the pref etch stride is still one cache
`line. For the source matrix whose elements are ac(cid:173)
`cessed down the columns, we set the stride to be the
`size of one row in the matrix. Every reference to
`an address within the matrix will only update the
`prefetch buffer for