throbber
LIBRARY
`
`LIBRA.RY OF CONORESS
`
`Office of Business Enterprises
`Duplication Services Section
`
`THIS IS TO CERTIFY that the collections of the Library of Congress contain a book
`entitled HIGH-SPEED COMPUTING, DIGITAL SIGNAL PROCESSING, AND FILTERING
`USING RECONFIGURABLE LOGIC : 20-21 November 1996, Boston, Massachusetts/ John
`Schewel, ... [et al.], chairs/editors; sponsored ... by SPIE--the International Society for Optical
`Engineering. Published: Bellingham, Wash., USA: SPIE, c1996. Call number TK7895.G36 H55
`1996, and that the attached photocopies-Cover, Title, Copyright, Table of Contents, and the
`article, "Using Reconfigurable Hardware to Customize Memory Hierarchies, Peixin Zhong and
`Margaret Martonosi [pages 237-248] - are a true representation from that work.
`
`THIS IS TO CERTIFY FURTHER, that work is marked with a Library of Congress
`Copyright Office stamp that bears the date November 19, 1996.
`
`IN WITNESS WHEREOF, the seal of the Library of Congress is affixed hereto on
`October 11, 2018.
`
`101 Independence Avenue, SE Washington, DC 20540-4917 Tel 202.707.5650 www.loc.gov;
`duplicationservices@loc.gov
`
`Petitioners Amazon
`Ex. 1008, p. 1
`
`

`

`PROCEEDINGS
`~ SPIE- The International Society for Optical Engineering
`
`ljigh-Speed Computing, Digital
`Signal Processing, and Filtering
`Using Reconfigurable Logic
`
`/
`
`John Schewe!
`Peter M. Athanas
`V. Michael Bove, Jr.
`John Watson
`Chairs/Editors
`
`20-21 November 1996
`Boston, Massac:husetts
`
`Volume 2914
`
`Petitioners Amazon
`Ex. 1008, p. 2
`
`

`

`1fligh-Speed Computing, Digital
`Signal Processing, and Filtering
`Using Reconfigurable Logic
`
`John Schewe!
`Peter M. Athanas
`V. Mi chael Bo ve, Jr.
`John Watson
`Chairs/Editors
`
`20-21 November 1996
`Boston, Massachusetts
`
`Sponsored and Published by
`SPIE-The International Society for Optical Engineering
`
`Volume 2914
`
`SPIE is an international technical society dedicated to advancing engineering and scientific
`applications of optical, photonic, imaging, electronic, and optoelectronic technologies.
`
`Petitioners Amazon
`Ex. 1008, p. 3
`
`

`

`/
`
`The papers appearing in this book comprise 1he proceedings of the meeting mentioned on the
`cover and ti1le page. They reflect the authors' opinions and are published as presented and
`without change, in the interests of timely dissemination. Their inclusion in this publication does
`not necessarily constitute endorsement by the editors or by SPIE.
`
`Please use the following format to cile material from 1his book:
`Author(s), 'Title of paper; in High-Speed Computing, Digital Signal Processing, and Filtering
`Using Reconfigurable Logic, John Schewel, Peter M. Athanas, V. Michael Bove, Jr., John
`Watson, Editors, Proc. SPIE 2914, page numbers (1996).
`
`Library of Congress Catalog Card No. 96-69766
`ISBN 0-8194-2316-5
`
`Published by
`SPIE-The International Society for Optical Engineering
`P.O. Box 10, Bellingham, Washington 96227-0010 USA
`Telephone 360/676-3290 (Pacific Time) • Fax 360/647-1445
`
`Copyrigh1 Ql 996, The Society ui Photo-Optical Instrumentation Engineers.
`
`Copying of material in this book for internal or personal use, or for the internal or personal use
`of specific clients, beyond the fair use provisions granted by the U.S. Copyright Law is
`authorized by SPIE subject to payment of copying fees. The Transactional Reporting Service
`base fee for this volume is $6.00 per article (or portion thereof), which should be paid directly
`to the Copyright Clearance Center (CCO, 222 Rosewood Drive, Danvers, MA 01923. Payment
`may also be made electronically through CCC Online at hnp://www.directory.net/copyright/.
`Other copying for republication, resale, advertising or promotion, or any form of systematic or
`multiple reproduction of any material in this book is prohibited except w ith permission in
`writing from the publisher. The CCC fee code is 0-8194-2316-5/96/$6.00.
`
`Printed in 1he United States of America.
`
`Petitioners Amazon
`Ex. 1008, p. 4
`
`

`

`Contents
`
`vii
`ix
`
`Conference Committee
`lncroduccion
`
`SESSION 1
`
`BREAKTHROUGHS IN ALGORITHMS, IMAGI NG, DSP, AND NUME RICAL ANALYS IS I
`
`2
`
`14
`
`26
`
`34
`
`44
`
`54
`
`Signed-digit on line floating-point arithmetic for FPGAs [29 14-01]
`A. Tangtrakul, B. Yeung, T. A. Cook, Rutgers Un iv.
`
`Design oi high-radix digit slices for online computations [2914-02)
`A. F. Tenca, M. D. Ercegovac, Univ. of California/Los Angeles
`
`Performance evaluation of FPGA implementations ot high-speed addition algorithms
`[2914-03)
`W . W. H. Yu, S. Xing, Univ. o f Hong Kong
`
`250-MHz correlation using high-performance reconfigurable computing engines [291 4-04]
`B. Von Herzen, Rapid Prototypes, Inc.
`
`Implementation of a finite difference method on a custom computing platform [2914-05)
`K. J. Paar, P. M. Athanas, Virginia Polytechnic Institute and State Univ.; C. M . Edwa rds, Nava l
`U nderwater Warfare Ctr.
`
`User-configurable data acquisition systems [29 14-06]
`G. Brown, Cornell Univ.
`
`SESSION 2
`
`BREAKTHROUGHS IN ALGORITHMS, IMAG ING, DSP, AND NUMERICAL ANALYSIS II
`
`66
`
`72
`
`84
`
`90
`
`100
`
`110
`
`Internal sorting and FPGA [2914-07]
`A. Beechick, Arrow Connection; S. Casselman, Virtual Computer Corp.; L. D. Yarbrough,
`Mathematical and Computational Sciences
`
`Polynomial-transform-based approach to computing 20 DFTs using reconfi gurable computers
`[29 14-08]
`C.H . Dick, La Trobe Un iv. (Australia); F. J. Harris, San D iego State Un iv.
`
`Configurable adaptive signal processing subsystem for various applications in telemetry,
`navigation, a nd telecommunication (2914-09]
`H. Chandran, Navtel Systems SA, Ltd. (France)
`
`Dynamic hardware video processing platform (2914-1 OJ
`R. And raka, Andraka Consulting Group
`
`DSP filters in FPGAs for image processing applications [2914-11 ]
`B. Taylor, Giga Operatio ns Corp.
`
`Image processing using reconfigurable FPGAs (2914-1 2]
`L. Ferguson, Atmel Corp.
`
`SPIE Vol. 291 4 I ii,
`
`Petitioners Amazon
`Ex. 1008, p. 5
`
`

`

`SESSION 3
`
`ARCHITECTURES FOR HIGH-PERFORMANCE RECONFIGURABLE COMPUTING
`
`124
`
`133
`
`141
`
`152
`
`170
`
`180
`
`187
`
`195
`
`Design tips and experiences in using reconfigurable FLEX logic [2914-13]
`P. ). Covert, California Microwave Inc.
`
`Programmable hardware for reconfigurable computing systems [2914-14]
`S. J. Sm ith, Altera Corp.
`
`Hierarchical decompos ition model for reconfigurable architecture [29 14-15]
`s. s. Erdogan, A. B. Wahab, Nanyang Technological Univ. (Singapore)
`
`Review of field-programmable analog arrays [2914-16)
`G. Gulak, D.R. D'Mello, Univ. o f Toronto (Canada)
`
`PCl-based WILDFIRE reconfigurable computing engines [29 14-17]
`B. K. Fross, R. L. Donaldson, D. ). Palmer, Annapo lis Mic ro Systems, Inc.
`
`Reconfi gurable-logic-based fiber channel network card [2914-18)
`S. Casselman, Virtual Comp uter Corp.
`
`Colt: an experiment in wormhole run-time reconfigurati on [29 14- 19]
`R. A. Bittner, Jr., P. M. Athanas, M. D. Musgrove, Vi rginia Polytec hnic Insti tute and State Univ.
`
`Computational acceleration methodologies: advantages of reconfigurable acce le ration
`subsystems [2914-20)
`Y. Savaria, G. Bois, Ecole Polytechnique de Montreal (Canada); P. Popovic, M icro Tech
`Microsystems Inc. (Canada); A. Wayne, Andrew Wayne and Associa tes (Canada)
`
`SESSION 4
`
`DESIGN METHODS AND TOOLS FOR RECONFIGURAB LE COMPUTING
`
`208
`
`218
`
`225
`
`237
`
`249
`
`259
`
`271
`
`283
`
`Malleable architecture generator for FPGA computing [2914-21]
`M. Gokhale, J. Kaba, A. Marks,). Kim, David Sarnoff Research Ctr.
`
`Resource pools: an abstraction for configurable computing codesign [2914-22]
`). B. Peterson, P. M. Athanas, Vi rginia Polytech nic Institute and State Un iv.
`
`Solving graph problems with dynamic computation structures (2914-23]
`). W. Babb, M. I. Frank, A. Agarwa l, MIT Lab. for Computer Science
`
`Using reconfigurable h~rd':"are to customize memory hierarchies [2914-24]
`P. Zhong, M. Martonos1, Princeton Un iv.
`
`Compiling high- level la_nguages for configurable computers: applying lessons from
`heterogeneous processing [2914-2 SJ
`G. E. Weaver, C. C. Weems, K. S. McKin ley, Univ. of Massachusetts
`
`th d
`Rapid_ p~ototyping of datapath intensive architectures with HML: an abst
`desc ription language [2914-26]
`rac
`ar ware
`M. E. Leeser, S. Tarafdar, Northeastern Univ.; Y. Li, Princeton Univ.
`
`Codesign and hi_gh-performance computing: scenes and crisis [29 14•
`281
`u N
`R. W. Hartenstein J Becker M H
`I ·
`· age dinger, U niv. of Kaiserslautern (FRG)
`· erz,
`'

`'
`
`FPGA applications in digital video systems [29 14-29)
`B. K. Fawcett,). Watson, Xilinx, Inc.
`
`,v I SP/£ Vol. 291 ~
`
`Petitioners Amazon
`Ex. 1008, p. 6
`
`

`

`295
`
`300
`
`308
`
`32 1
`
`332
`
`341
`
`Application of FPGA technology to performance limitations in radiation therapy [29 14-30]
`J. J. DeMarco, J. B. Smathers, T. D. Solberg, Univ. of California/Los Angeles;~- Casselman,
`Virtual Com puter Corp.
`
`Embedding large multidimensional DSP computations in reconfigurable logic [2914-31]
`A . Elnagga r, H. M. Alnuweiri, M. R. Ito, Univ. of British Columbia (Canada)
`
`FPGA-based transformable coprocessor for MPEG video processing [2914-32]
`H. A. Chow, H. M. Al nuweiri, Univ. of British Columbia (Canada)
`
`Guide to us ing field programmable gate arrays (f PG As) for application-spec ific digita l s ignal
`processing performance [2914-33]
`G. R. Goslin, Xi:i n x. Inc.
`
`Reconfigurable hardware accelerator for embedde d DSP [2914-34]
`K. Reeves, Pentek Inc.; K. Sienski, C. Field, Red River Engineering
`
`large-scale logic array computation [2914-3 5]
`N. Ma rgo I us, Boston Univ.
`
`354
`
`Author Index
`
`SPIE Vol 2914 I v
`
`Petitioners Amazon
`Ex. 1008, p. 7
`
`

`

`Using Reconfigurable Hardware to Customize Mem ory Hierarchies
`
`Peixin Zhong and Margaret Martonosi
`Department of Electrical Engineering
`Princeton University
`Princeton, NJ 08544-5263
`{pzhong, martonosi}©ee.princeton.edu
`
`Abstract
`
`1
`
`Introduction
`
`Over the past decade or more, processor speeds have
`increased much more quickly than memory speeds.
`As a result, a large, and still increasing, processor(cid:173)
`memory performance gap has formed. Many signifi(cid:173)
`cant applications suffer from substantial memory bot(cid:173)
`tlenecks, and their memory performance problems are
`often either too unusual or extreme to be mitigated by
`cache memories alone. Such specialized performance
`"bugs" require specialized solutions, but it is impos(cid:173)
`sible to provide case-by-case memory hierarchies or
`caching strategies on general-purpose computers.
`We have investigated the potential of implementing
`mechanisms like victim caches and prefetch buffers in
`reconfigurable hardware to improve application mem(cid:173)
`ory behavior. Based on technology and commercial
`trends, our simulation-based studies use a forward(cid:173)
`looking model in which configurable logic is located
`on I.he CPU chip. Given such assumptions, our re(cid:173)
`sults show that the flexibility of being able to special(cid:173)
`ize configurable hardware to an application's memory
`referencing behavior more thar: balances the slightly
`slower response times of configurable memory hier(cid:173)
`archy structures. For our three applications, small,
`specialized memory hierarchy additions such as vic(cid:173)
`tim caches and prefetch buffers can reduce miss rates
`substantially and can drop total execution times for
`these programs to between 60 and 80% of their orig(cid:173)
`inal execution times. Our results also indicate that
`different memory specializations may be most effec(cid:173)
`tive for each application; this highlights the useful(cid:173)
`ness of configurable memory hierarchies that are spe(cid:173)
`cialized on a per-application basis.
`Keywords: memory latency, configurable comput(cid:173)
`ing, victim cache, prefetching.
`
`Due to rapid increases in microprocessor speeds, the
`performance gap between processors and main mem(cid:173)
`ory is widening. Cache memories are typically used in
`computer systems to bridge this performance gap and
`reduce the average memory access time. Although
`caches work well in many cases, they may still fail to
`provide high performance for certain applications.
`Several hardware and software techniques have
`been proposed to improve cache performance in such
`cases. For example, prefetching techniques aim to
`hide the large latency out to main memory by bring(cid:173)
`ing data to the cache before it is referenced. Vic(cid:173)
`tim caches attempt to reduce conflict misses in low(cid:173)
`associativity caches. These hardware techniques have
`variable results depending on the application's mem(cid:173)
`ory referencing behavior. Their disadvantage is that
`they represent wasted transistor space on the CPU
`chip for those applications where they arc ineffec(cid:173)
`tive. On the other hand, the drawback to purely
`software-based techniques (such as blocked matrix ac(cid:173)
`cesses or compiler-inserted prefetching directives) is
`that it can be difficult to statically analyze a pro(cid:173)
`gram's memory behavior and determine when such
`techniques will be useful. For these reasons, this pa(cid:173)
`per explores implementing memory hierarchy addi(cid:173)
`tions in programmable hardware.
`Programmable logic, such as field-programmable
`gate arrays (FPGAs), has gained tremendous popu(cid:173)
`larity in the past decade. Programmable logic is pop(cid:173)
`ular because a given chip's behavior can be configured
`and customized for different functions during differ(cid:173)
`ent sessions. Customization on a per-application ba(cid:173)
`sis is feasible because the reconfiguration process is
`fast and can be done with the device in the system.
`Some FPGAs can even be partially reconfigured while
`the rest of the device is in use.
`
`0-8194-23 16-5!96/$6.00
`
`SP/£ Vol. 291 ~ 237
`
`Petitioners Amazon
`Ex. 1008, p. 8
`
`

`

`The configurability of FPGAs makes them heavily
`used for prototyping, but they have also been used to
`build high-performance application specific compute
`engines [12]. In addition, there has been wo~k (such
`as PRlSC [21] and PRISM [2)) on supplementmg con(cid:173)
`ventional processors with configurable coprocessors
`to accelerate performance for different applications
`Thus far, most configurable computing projects
`have focused heavily on computation as opposed to
`data access. In this paper, we explore the potential
`of using configurable hard ware to make application(cid:173)
`specific improvements to memory behavior. We en(cid:173)
`vision a library of parameterized memory hierarchy
`additions that can be invoked to target cases where
`the cache system does not work well for a particular
`application.
`As fabrica tion technology improves, more and more
`transistors fit on a single chip. It seems likely that
`we will soon see processor chips that include a re(cid:173)
`gion of on-chip configurable logic. This configurable
`logic can clearly have many uses; our work does not
`preclude configuring the logic for more traditional
`compute-oriented uses, but simply attempts to ex(cid:173)
`plore an alternative use.
`In Section 2, we will first discuss the structure of
`the configurable memory unit that we envision. Fol(cid:173)
`lowing that, in Section 3, we present case studies of
`applying the ideas of configurable memory to several
`applications. In Section 4, we discuss some of the
`hardware organization and implementation issues in(cid:173)
`herent in our approach. A brief account of related
`work is included in Section 5 and Section 6 presents
`some discussion and our conclusions.
`
`flexible, configurable hardware instead.
`Integrated circuits are expected to soon grow to
`contain over 100 million transistors. As this growth
`takes place, we must determine ways to best make
`use of these additional transistors. Rather than sim(cid:173)
`ply devoting increased chip areas to increased cache
`sizes, our research explores other methods for using
`the transistors to reduce (or better tolerate) memory
`access latencies.
`In current research projects, configurable logic is
`typically incorporated into the architecture using an
`attached processor array model. As shown on the left
`hand side of Figure 1, an accelerator based on FP(cid:173)
`GAs and dedicated static RAM (SRAM), is attached
`to the I/ 0 bus of a conventional host processor. The
`conventional processor and configurable logic array
`operate asynchronously. The host supplies control
`signals and monitors results while the logic array pro(cid:173)
`cesses data obtained from an external source such as
`a frame-buffer. The major problem with this model
`is the high communication latency between proces(cid:173)
`sor and configurable logic, due to their physical and
`logical separation.
`As shown on the right hand side of Figure 1, the ex(cid:173)
`pected integration of configurable logic on-chip gives
`us more flexibility not only in its logical placement
`within the architecture, but also in its expected uses.
`In addition, on-chip configurable logic has more flexi(cid:173)
`bility in its connectivity to processors and caches. We
`can now begin considering uses for configurable logic
`that are infeasible (because of latency and connectiv(cid:173)
`ity constraints) with the attached processor model.
`
`2.1 Logical Placem ent , Conn ectivity
`
`2 Configurable Hardware
`Memory Hierarchies
`
`We believe that configurable logic can result in sig(cid:173)
`nificant performance improvement by improving av(cid:173)
`erage memory access latencies. Our overriding goal
`is to minimize the number of cache misses that result
`in accesses to main memory. Researchers have pro(cid:173)
`posed methods that promise performance improve(cid:173)
`ments of up to 3X by reducing cache misses using
`full-custom hardware [19]. These methods are rare!
`included in commercial processors, however, becaus:
`they do not provide performance improvements for
`a broad enough set of applications. Our current re(cid:173)
`search shows the potential of these approaches using
`
`lll The logical placement of the configurable logic is
`driven by its intended memory optimization func(cid:173)
`tion. The right hand side of Figure 1 shows two
`distinct possibilities, and in fac t, each of these con(cid:173)
`figurable blocks may expend their transistor budgets
`on some combination of configurable gates and asso(cid:173)
`ciated SRAM. (In this figure, the logical positions for
`configurable logic are indicated by diamonds labelled
`"Cl" and "C2".)
`The configurable logic closest to the processor can
`detect L 1 cache misses and manage pref etch buffers,
`stream buffers, or a victim cache. Similar functions
`can be performed by the second block of configurable
`logic, for the 12 cache. In addition, this logic could
`observe memory accesses between nodes of a dis·
`
`238 1 SPlt Vol. 291 4
`
`Petitioners Amazon
`Ex. 1008, p. 9
`
`

`

`8
`LI Cache <€9
`
`Possible
`chip boundaries
`
`L2C,che ~ :'>lcmory
`
`I/OBus
`
`_
`
`____::_mory
`
`! '--------- -
`I L ~L2Cachc
`I
`~
`
`1/0 Bus
`
`Figure 1: Traditional (left) and more forward-looking (right) uses of configurable logic.
`
`tributed memory multiprocessor or hold soecialized
`multiprocessor coherence protoc~ls.

`In this paper, we focus primarily on two particu(cid:173)
`lar optimizations: victim caches and prefetching en(cid:173)
`gines, implemented in configurable hardware. Victim
`caches are small, fully-associative caches that lie be(cid:173)
`tween the cache and main memo:y. Victim caches
`primarily aim to reduce conflict :n.isses by caching
`data that has been evicted from the cache as a result
`of a miss on another location. Even small (four-entry)
`victim caches can reduce conflict misses by about 50%
`and reduce total cache misses by 5-30% [19].
`Pref etching engines are hard ware structures in(cid:173)
`tended to fetch pieces of data from main memory
`before references to them occur, in order to hide the
`large latencies out to main memory. They have been
`studied in several contexts before [6, 7, 20). Although
`such hardware mechanisms have been evaluated via
`simulation studies, widespread adoption in custom
`hardware is unlikely, since their benefits vary from
`application to application. Assuming that future pro(cid:173)
`cessors will include a block of configurable logic on(cid:173)
`chip, this paper uses simulations to re-evaluate these
`mechanisms within the context of a dynamically cus(cid:173)
`tomizable memory hierarchy.
`For prefetch or stream buffers, the configurable
`logic must be able to detect a miss from the 11 ( or 12)
`cache, recognize that the address matches the address
`of data in the prefetch buffer, abort the memory re-
`
`quest, and supply the data to the CPU. Simultaneous
`to servicing the access request, the prefetch controller
`may initiate a memory request for the next location
`from which to prefetch. For a v,ctim cache, the logic
`must once again be able to detect the primary cache
`miss, recognize the presence of the data in its SRAM
`abort the memory request, and supply the data to th~
`CPU. These functions can be easily performed using
`a small state machine controller.
`
`2.2 Logical vs. P hysical P lacement
`
`It is important to make a distinction between the log(cid:173)
`ical and the physical placemeot of the configurable
`logic. The discussion above described the logical
`placement, i.e. where the configurable logic needs to
`be positiooed in terms of the signals it needs access
`to. Physically, we anticipate future generations of mi(cid:173)
`croprocessors to have a fixed amount of configurable
`logic which may be used for a variety of tasks includ(cid:173)
`ing but not limited to accelerating computatioo and
`memory accesses. Since each application will have
`different requirements on where the configurable logic
`should be placed, it is not reasonable to expect it to
`be positiooed in exactly the positions marked io Fig(cid:173)
`ure l. The key demand that our logica.l placement
`makes on configurable logic's physical placement is
`that of connectivity; there must be signal paths that
`provide access to the signals required by the logical
`
`SPIE Vo/ Z9 J a . 239
`
`Petitioners Amazon
`Ex. 1008, p. 10
`
`

`

`is to be run repeatedly, then the performance statis(cid:173)
`tics a.bout this program run can be used as feedback
`to modify the configuration for future runs. C!ea.rly,
`the widespread use of this sort o! hard ware rehe~ on
`t mating the analysis a.nd choice of configura.tions
`
`.
`au o
`a.s much a.s possible. In Section 3's pre 1mmary ca.se
`studies, however, we set the configuration directives
`ma.nua.lly.
`
`1.
`
`3 Application Case Studies
`
`In order to qua.ntita.tively eva.luate our idea, we
`present results on three case study applications.
`
`3.1 Configura ble Mem ory H a rdware
`
`In the case studies presented here, we assume the con(cid:173)
`figurable memory hardwa.re sits on the same_ chip as
`the processor a.nd first-level (11) cache. This corre(cid:173)
`sponds to location C 1 in Figure 1. For these studies,
`we examine two configurable hierarchy additions sep(cid:173)
`arately: a victim cache and a prefetch buffer.
`
`3.1.1 V ictim C ache
`
`A victim cache is a fully-associative cache with typ(cid:173)
`ically no more than 16 entries. When an 1 1 cache
`miss occurs, the referenced line is pulled into the Ll
`cache and another line currently stored in the sa.'11e
`Ca(he line is evicted. The victim cache is a small
`fully-associa.tive buffer that holds these evicted lines.
`Using victim caches can have the effect of increasing
`the set associativity at low hardware cost.
`Figure 3 shows a diagram of the victim cache we
`consider. The address and data lines go to both the
`Ll cache and the victim cache. On an 11 cache hit,
`the data is provided directly from the 1 1 cache, and
`the configurable hardware is not involved. On ao
`Ll cache miss, the vicbm cache is probed in parallel
`with the request to the next level of memory. If the
`reference hits in the victim cache, the victim cache
`data provides to the processor. When the data is
`pulled into the 11 cache, another line must be evicted
`from the cache. The vidim cache intercepts this line
`and stores it.
`
`3.1.2 P refetching Buffer
`
`The goal of a pref etch buffer is to initiate main mem(cid:173)
`ory accesses in advance, so that the data will be closer
`
`ts perforrna.nce

`usa.ge. The physica.l pla.cement rmpa.c
`be
`in two wa.ys. First, the interconnect delay may
`.
`·fica.nt. Second. the interconnect topology may
`signi
`•
`h
`buses.
`lead to contention of sha.red resources sue a.s . . . a.l
`This contention ha.s not been modelled in the iruti
`study presented here.
`
`Z.3 P rog ramming Model
`
`In a.ddition to hardwa.re concerns, our configura~le
`memory hierarchies raise software issues as well. Fig(cid:173)
`ure 2 gives a flowchart that represents the pa.th a
`program would go through in orde_r _to make use of
`configurable memory hierarchy addit10ns.
`
`Source code
`
`Evalua1ion
`
`Code wi1h Config. info
`Run program ~I Feedback !
`'----'-tr----
`Resull
`
`Figure 2: Flowchart of programming model with con(cid:173)
`figura.ble memory hiera.rchy additions.
`
`Starting from conventional source code, some per(cid:173)
`formance analysis is needed to determine wha.t types
`of performance bottlenecks are present. Ba.sed on this
`analysis, configura.tion directives may be added into
`the code in order to use particular configura.ble mem(cid:173)
`ory hierarchy additions. The configuration directives
`may either make use of hardware from a pa.ra.meter(cid:173)
`ized library of pre-made designs, or they ma.y specify
`custom-made configurations. Our flowchart a.lso al(cid:173)
`lows for the possibility that these configuration direc(cid:173)
`tives will either be hand-inserted into the code by the
`programmer, or will be a.utomatica.lly inserted as part
`of the compilation process. At this point, the code
`is executed, and the memory hiera.rchy additions a.re
`configured a.s part of the program run. If the code
`
`240 I SPIE Vol. 291 4
`
`Petitioners Amazon
`Ex. 1008, p. 11
`
`

`

`CPU
`:--------.ol A~ - ---.
`LI c:1che
`
`r~~"'~S~ __ cu_"'- ; _ ----,
`
`Adcrcs.
`
`O:::.ta
`
`Ad<!rtss
`
`ViotimCJ.che
`ci
`o.l::,
`t.lg
`d,::, ~
`I d,,:,
`t:l&
`~t
`
`'"'"'
`
`Lowe: level memo0
`
`CPt.:
`
`L.1 C3Che
`t2g1 cam I
`
`, '"ltl 03::, ->-
`
`.
`
`D:ll3
`
`Figure 3: Block diagram of victim cache.
`
`I Lower level memory j
`
`to the processor when referenced. Prefetching can be
`especially beneficial when the main memory band(cid:173)
`width is large, but the access latency is also quite
`high. The prefetch buffer in our model has several
`independent "slots". Each slot holds several cache
`lines and works like a FIFO. It can issue prefetch(cid:173)
`ing commands to the main memory. We need several
`slots because in a typical program, there will be sev(cid:173)
`eral concurrent data access streams. For example,
`when a program is performing matrix addition, there
`will be three access streams for the two source arrays
`and the destination array. Figure 4 is the schematic
`of a prefetch buffer with four slots; each slot holds
`four cache lines of data.
`If there is an 11 cache hit, it does not trigger any
`operation in the prefetch buffer. If there is an Ll
`miss, the request is sent to the prefetch buffer. If
`the referenced item is available in one of the slots,
`this line is pulled into the CPU as well as the 11
`cache. The remaining lines move toward the top of
`the FIFO. The vacancy is filled with data prefetched
`from main memory. The prefetching engine must de(cid:173)
`termine which line to prefetch next. In the simple
`case, the memory address for the subsequent line is
`calculated by incrementing the current address by the
`cache line size.
`If the L 1 cache miss also misses in the pref etch
`buffer, the least recently used slot is designated to
`begin prefetching the subsequent data following this
`
`Figure 4: Block diagram of prefetch buffers.
`
`referenced address. By doing this, we can avoid com(cid:173)
`pulsory cache misses if the initial reference to a mem(cid:173)
`ory line is already covered by the prefetching.
`To make better use of the slots, we may program
`them with address ranges and data access strides. For
`example, when we have a lot of accesses on one array,
`we can set one slot with the address range of the data
`array. The slot will only respond to references in this
`range. If the access pattern has an address increment
`stride larger than the cache line size, we can assign
`the stride so we only prefetch the data that is useful.
`
`3.2 Evaluation Methodology
`
`In this study, we use software simulations to estimate
`the performance for our applications. In particular,
`we used the TangoLite memory reference generator
`to perform high-level simulations. TangoLite uses a
`direct-execution simulation approach. The origi.nal
`program written in C or FORTRAN is compiled and
`instrumented with calls out to routines that simulate
`memory behavior and keep statistics.
`In our simulation model, we have a single CPU
`that issues no more than one instruction per cycle.
`The direct-mapped Ll data cache has a capacity of
`8 KB and a line size of 32 bytes. In order to focus
`
`SPIE l'o/. 291 4 2, I
`
`Petitioners Amazon
`Ex. 1008, p. 12
`
`(cid:141)
`

`

`on data references, we have assumed ideal memory
`oerformance in the instruction stream.
`. Our simulations assume a hit in the Ll cache costs
`one cycle, the same as a simple ALU instruction. If a
`reference misses in the L 1 cache, additional cycles are
`needed depending on where the data is found. We as(cid:173)
`sume that data found in the configurable cache takes
`2 extra cycles and a miss in both Ll and configurable
`cache will cost 10 extra cycles to go to main memory.
`The victim cache we simulate is fully-associative
`with 4 cache-line-sized entries, updated according to
`an LRU policy. The prefetch buffer has four slots
`with 4 cache lines each. Our simulator ignores any
`additional contention that prefetching may introduce.
`
`3.3 P er-Application Results
`
`With that background, we will now look in more de(cid:173)
`tail at three applications. Our goal is to see the po(cid:173)
`tential performance improvement available by imple(cid:173)
`menting memory hierarchy additions in configurable
`hardware. The three applications studied include
`two kernel applications known for their poor cache
`performance (Matrix Multiplication and Fast Fourier
`Transform) as well as a more substantial program:
`the Tomcatv benchmark from the SPEC92 suite [9].
`
`3.3.1 Matrix Multiplication
`
`In the first example, we study a matrix multiplication
`program multiplying two 100 by 100 matrices. The
`elements are double-precision floating point. This
`means that a total of 80000 bytes are needed for each
`matrix. This greatly exceeds the size of the Ll cache.
`Clearly the calculation involves two source matrices
`and one destination matrix. As shown in Figure 5
`one of the source matrices is accessed in sequential
`column order while the other is accessed in sequen(cid:173)
`tial row order.
`
`Figure 5: Memory access pattern for arrays in matr·
`multiplication.
`ix
`
`Results The simulated performance of l0Ox!00
`matrix multiplication is summarized in Table l. The
`standard cache does not perform well both because
`of the size of the data set and also because one of the
`arrays is accessed down the columns. Overall, the
`data miss rate is 21%.
`The second column of the table shows the new re(cid:173)
`sults obtained when a configurable victim cache is
`used. In this case, some of the conflict misses can
`be avoided. The overall performance is only slightly
`better than in the original case though. The third
`column of the table shows the results when a config.
`urable prefetch engine is used. In this "stride-one
`prefetching" each prefetch slot always attempts to
`prefetch the next cache line. This prefetching allows
`the program to avoid some compulsory misses, but
`does not dramatically improve overall performance.
`The difficulty here is that one of the arrays has a
`columnar memory access pattern that does not ben(cid:173)
`efit from the stride-one prefetching.
`Since simple stride-one prefetching is not fully ef.
`fective, we also investigated allowing the prefetching
`engine adjust its prefctch stride on a per-slot basis.
`The fourth column shows the results of this experi(cid:173)
`ment. Here, we have allocated one slot to each ma(cid:173)
`trix. For the source matrix that accesses elements
`across a row, the pref etch stride is still one cache
`line. For the source matrix whose elements are ac(cid:173)
`cessed down the columns, we set the stride to be the
`size of one row in the matrix. Every reference to
`an address within the matrix will only update t

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket