`7895
`G36
`H55
`1996
`
`Pl~C)(~EEl)I NC~S
`SPIE-The International Society for Optical Engineering
`
`•
`
`High-Speed Computing, Digital
`Signal Processing, and Filtering
`Using Reconfigurable Logic
`
`John Schewel
`Peter M. Athanas
`V. Michael Bove, Jr.
`John Watson
`Chairs/Editors
`
`20-21 November 1996
`Boston, Massachusetts
`
`0 ... 2 .
`
`:- L,
`L..
`
`I
`
`.,
`:'
`• J ~
`
`w
`
`- - - · ·
`
`.
`
`" f "'
`
`-
`
`.
`
`.
`
`.,., .. .," c.:t
`
`Volume 2914
`
`I
`
`I
`/
`
`' I I
`
`Petitioners Amazon
`Ex. 1004, p. 1
`
`
`
`PROCEEDINGS
`9 SPIE-The International Society for Optical Engineering
`
`High-Speed Computing, Digital
`Signal Processing, and Filtering
`Using Reconfigurable Logic
`
`John Schewe!
`Peter M. Athanas
`V. Michael Bove, Jr.
`John Watson
`Chairs/Editors
`
`20-21 November 1996
`Boston, Massachusetts
`
`Sponsored and Published by
`SPI E-The International Society for Optical Engineering
`
`Volume 2914
`
`SPIE is an international technical society dedicated to advancing engineering and scientific
`applications of optical , photonic, imaging, electronic, and optoelectronic technologies .
`
`Petitioners Amazon
`Ex. 1004, p. 2
`
`
`
`The papers appearing in this book comprise the proceedings of the meeting mentioned on the
`cover and title page. They reflect the authors' opinions and are published as presented and
`without change, in the interests of timely dissemination. Their inclusion in this publication does
`not necessarily constitute endorsement by the editors or by SPIE.
`
`Please use the following format to cite material from this book:
`Author(s), •Title of paper,• in High-Speed Computing, Digital Signal Processing, and Filtering
`Using Reconfigurable Logic, John Schewe!, Peter M. Athanas, V. Michael Bove, Jr., John
`Watson, Editors, Proc. SPIE 2914, page numbers (1996).
`
`Library of Congress Catalog Card No. 96-69766
`ISBN 0-8194-2316-5
`
`Published by
`SPIE-The International Society for Optical Engineering
`P.O. Box 10, Bellingham, Washington 98227-0010 USA
`Telephone 360/676-3290 (Pacific Time)• Fax 360/647-1445
`
`Copyright 0 1996, The Society of Photo-Optical Instrumentation Engineers.
`
`Copying of material in this book for internal or personal use, or for the internal or personal use
`of specific clients, beyond the fair use provisions granted by the U.S. Copyright Law is
`authorized by SPIE subject to payment of copying fees. The Transactional Reporting Service
`base fee for this volume is $6.00 per article (or portion thereof), which should be paid directly
`to the Copyright Clearance Center (CCO, 222 Rosewood Drive, Danvers, MA 01923. Payment
`may also be made electronically through CCC Online at http://www.directory.net/copyright/.
`Other copying for republication, resale, advertising or promotion, or any form of systematic or
`multiple reproduction of any material in this book is prohibited except with perm ission in
`writing from the publisher. The CCC fee code is 0-8194-2316-5/96/$6.00.
`
`Printed in the United States of America.
`
`Petitioners Amazon
`Ex. 1004, p. 3
`
`
`
`Using Reconfigurable Hardware to Customize Memory Hierarchies
`
`Peixin Zhong and Margaret Martonosi
`Department of Electrical Engineering
`Princeton University
`Princeton, NJ 08544-5263
`{pzhong, martonosi}©ee.princeton.edu
`
`Abstract
`
`1
`
`Introduction
`
`Over the past decade or more, processor speeds have
`increased much more quickly than memory speeds.
`As a result, a large, and still increasing, processor(cid:173)
`memory performance gap has formed. Many signifi(cid:173)
`cant applications suffer from substantial memory bot(cid:173)
`tlenecks, and their memory performance problems are
`often either too unusual or extreme to be mitigated by
`cache memories alone. Such specialized performance
`"bugs" require specialized solutions, but it is impos(cid:173)
`sible to provide case-by-case memory hierarchies or
`caching strategies on general-purpose computers.
`We have investigated the potential of implementing
`mechanisms like victim caches and prefetch buffers in
`reconfigurable hardware to improve application mem(cid:173)
`ory behavior. Based on technology and commercial
`trends, our simulation-based studies use a forward(cid:173)
`looking model in which configurable logic is located
`on the CPU chip. Given such assumptions, our re(cid:173)
`sults show that the flexibility of being able to special(cid:173)
`ize configurable hardware to an application's memory
`referencing behavior more than balances the slightly
`slower response times of configurable memory hier(cid:173)
`archy structures. For our three applications, small,
`specialized memory hierarchy additions such as vic(cid:173)
`tim caches and prefetch buffers can reduce miss rates
`substantially and can drop total execution times for
`these programs to between 60 and 80% of their orig(cid:173)
`inal execution times. Our results also indicate that
`different memory specializations may be most effec(cid:173)
`tive for each application; this highlights the useful(cid:173)
`ness of configurable memory hierarchies that are spe(cid:173)
`cialized on a per-application basis.
`Keywords: memory latency, configurable comput(cid:173)
`ing, victim cache, prefetching.
`
`Due to rapid increases in microprocessor speeds, the
`performance gap between processors and main mem(cid:173)
`ory is widening. Cache memories are typically used in
`computer systems to bridge this performance gap and
`reduce the average memory access time. Although
`caches work well in many cases, they may still fail to
`provide high performance for certain applications.
`Several hardware and software techniques have
`been proposed to improve cache performance in such
`cases. For example, prefetching techniques aim to
`hide the large latency out to main memory by bring(cid:173)
`ing data to the cache before it is referenced. Vic(cid:173)
`tim caches attempt to reduce conflict misses in low(cid:173)
`associativity caches. These hardware techniques have
`variable results depending on the application's mem(cid:173)
`ory referencing behavior. Their disadvantage is that
`they represent wasted transistor space on the CPU
`chip for those applications where they are ineffec(cid:173)
`tive. On the other hand, the drawback to purely
`software-based techniques ( such as blocked matrix ac(cid:173)
`cesses or compiler-inserted pref etching directives) is
`that it can be difficult to statically analyze a pro(cid:173)
`gram's memory behavior and determine when such
`techniques will be useful. For these reasons, this pa(cid:173)
`per explores implementi~g memory hierarchy addi(cid:173)
`tions in programmable hardware.
`Programmable logic, such as field-programmable
`gate arrays (FPGAs), has gained tremendous popu(cid:173)
`larity in the past decade. Programmable logic is pop(cid:173)
`ular because a given chip's behavior can be configured
`and customized for different functions during differ(cid:173)
`ent sessions. Customization on a per-application ba(cid:173)
`sis is feasible because the reconfiguration process is
`fast and can be done with the device in the system.
`Some FPGAs can even be partially reconfigured while
`the rest of the device is in use.
`
`0-8194-23 16-5/ 96/$6 .00
`
`SP/£ Vol. 2914 I 237
`
`Petitioners Amazon
`Ex. 1004, p. 4
`
`
`
`The configurability of FPGAs makes them heavily
`used for prototyping, but they have also been used to
`build high-performance application specific compute
`engines [12]. In addition, there has been work (such
`as PRISC [21] and PRISM [2]) on supplementing con(cid:173)
`ventional processors with configurable coprocessors
`to accelerate performance for different applications
`Thus far, most configurable computing projects
`have focused heavily on computation as opposed to
`data access. In this paper, we explore the potential
`of using configurable hardware to make application(cid:173)
`specific improvements to memory behavior. We en(cid:173)
`vision a library of parameterized memory hierarchy
`additions that can be invoked to target cases where
`the cache system does not work well for a particular
`application.
`As fabrication technology improves, more and more
`transistors fit on a single chip. It seems likely that
`we will soon see processor chips that include a re(cid:173)
`gion of on-chip configurable logic. This configurable
`logic can clearly have many uses; our work does not
`preclude configuring the logic for more traditional
`compute-oriented uses, but simply attempts to ex(cid:173)
`plore an alternative use.
`In Section 2, we will first discuss the structure of
`the configurable memory unit that we envision. Fol(cid:173)
`lowing that, in Section 3, we present case studies of
`applying the ideas of configurable memory to several
`applications. In Section 4, we discuss some of the
`hardware organization and implementation issues in(cid:173)
`herent in our approach. A brief account of related
`work is included in Section 5 and Section 6 presents
`some discussion and our conclusions.
`
`flexible, configurable hardware instead.
`Integrated circuits are expected to soon grow to
`contain over 100 million transistors. As this growth
`takes place, we must determine ways to best make
`use of these additional transistors. Rather than sim(cid:173)
`ply devoting increased chip areas to increased · cache
`sizes, our research explores other methods for using
`the transistors to reduce ( or better tolerate) memory
`access latencies.
`In current research projects, configurable logic is
`typically incorporated into the architecture using an
`attached processor array model. As shown on the left
`hand side of Figure 1, an accelerator based on FP(cid:173)
`GAs and dedicated static RAM (SRAM), is attached
`to the 1/0 bus of a conventional host processor. The
`conventional processor and configurable logic array
`operate asynchronously. The host supplies control
`signals and monitors results while the logic array pro(cid:173)
`cesses data obtained from an external source such as
`a frame-buffer. The major problem with this model
`is the high communication latency between proces(cid:173)
`sor and configurable logic, due to their physical and
`logical separation.
`As shown on the right hand side of Figure 1, the ex(cid:173)
`pected integration of configurable logic on-chip gives
`us more flexibility not only in its logical placement
`within the architecture, but also in its expected uses.
`In addition, on-chip configurable logic has more flexi(cid:173)
`bility in its connectivity to processors and caches. We
`can now begin considering uses for configurable logic
`that are infeasible (because of latency and connectiv(cid:173)
`ity constraints) with the attached processor model.
`
`2.1 Logical Placement, Connectivity
`
`.
`
`2 Configurable Hardware
`Memory Hierarchies
`
`We believe that configurable logic can result in sig(cid:173)
`nificant performance improvement by improving av(cid:173)
`erage memory access latencies. Our overriding goal
`is to minimize the number of cache misses that result
`in accesses to main memory. Researchers have pro(cid:173)
`posed methods that promise performance improve(cid:173)
`ments of up to 3X by reducing cache misses using
`full-custom hardware [19]. These methods are rarely
`included in commercial processors, however, because
`they do not provide performance improvements for
`a broad enough set of applications. Our current re(cid:173)
`search shows the potential of these approaches using
`
`lil The logical placement of the configurable logic is
`driven by its intended memory optimization func(cid:173)
`tion. The right hand side of Figure 1 shows two
`distinct possibilities, and in fact, each of these con-
`figurable blocks may expend their transistor budgets
`on some combination of configurable gates and asso(cid:173)
`ciated SRAM. (In this figure, the logical positions for
`configurable logic are indicated by diamonds labelled
`"Cl" and "C2" .)
`The configurable logic closest to the processor can
`detect 11 cache misses and manage prefetch buffers,
`stream buffers, or a victim cache. Similar functions
`can be performed by the second block of configurable
`logic, for the 12 cache. In addition, this logic could
`observe memory accesses between nodes of a dis-
`
`238 I SPIE Vol. 2914
`
`Petitioners Amazon
`Ex. 1004, p. 5
`
`
`
`l/OBus
`
`Figure 1: Traditional (left) and more forward-looking (right) uses of configurable logic.
`
`tributed memory multiprocessor, or hold specialized
`multiprocessor coherence protocols.
`In this paper, we focus primarily on two particu(cid:173)
`lar optimizations: victim caches and prefetching en(cid:173)
`gines, implemented in configurable hardware. Victim
`caches are small, fully-associative caches that lie be(cid:173)
`tween the cache and ma.in memory. Victim caches
`primarily aim to reduce conflict misses by caching
`data that ha.s been evicted from the cache as a result
`of a miss on another location. Even small (four-entry)
`victim caches can reduce conflict misses by about 50%
`and reduce total cache misses by 5-30% .[19].
`Prefetching engines are hardware structures in(cid:173)
`tended to fetch pieces of da.ta from ma.in memory
`before references to them occur, in order to hide the
`large latencies out to ma.in memory. They have been
`studied in several contexts before [6, 7, 20]. Although
`such hardware mechanisms have been evaluated via.
`simulation studies, widespread adoption in custom
`hardware is unlikely, since their benefits vary from
`application to a.pplica.tion. Assuming that future prer
`cessors will include a block of configurable logic on(cid:173)
`chip, this paper uses simulations to re-evaluate these
`mechanisms within the context of a dynamically cus(cid:173)
`tomizable memory hierarchy.
`For prefetch or stream buffers, the configurable
`logic must be able to detect a. miss from the Ll ( or 12)
`cache, recognize that the address matches the address
`of data in the pref etch buffer, abort the memory re-
`
`quest, and supply the data to the CPU. Simultaneous
`to servicing the access request, the prefetch controller
`may initiate a memory request for the next location
`from which to prefetch. For a victim cache, the logic
`must once a.gain be able to detect the primary cache
`miss, recognize the presence of the data in its SRAM,
`abort the memory request, and supply the data to the
`CPU. These functions can be easily performed using
`a small state machine controller.
`
`2.2 Logical vs. Physical Placement
`
`It is important to make a distinction between the log(cid:173)
`ical and the physical placement of the configurable
`logic. The discussion above described the logical
`placement, i.e. where the configurable logic needs to
`be positioned in terms of the signals it needs access
`to. Physically, we anticipate future generations of mi(cid:173)
`croprocessors to have a fixed amount of configurable
`logic which may be used for a. variety of tasks includ(cid:173)
`ing but not limited to accelerating computation and
`memory accesses. Since each application will have
`different requirements on where the configurable logic
`should be placed, it is not reasonable to expect it to
`be positioned in exactly the positions marked in Fig(cid:173)
`ure 1. The key demand that our logical placement
`makes on configurable logic's physical placement is
`that of connectivity; there must be signal paths that
`provide access to the signals required by the logical
`
`SP/E Vol. 2914 I 239
`
`Petitioners Amazon
`Ex. 1004, p. 6
`
`
`
`usage. The physical placement impacts performance
`in two ways. First, the interconnect delay may be
`significant. Second, the interconnect topology may
`lead to contention of shared resources such as buses.
`This contention has not been modelled in the initial
`study presented here.
`
`2.3 Programming Model
`
`is to be run repeatedly, then the performance statis(cid:173)
`tics about this program run can be used as feedback
`to modify the configuration for future runs. Clearly,
`the widespread use of this sort of hardware relies on
`automating the analysis and choice of configurations
`as much as possible. In Section 3's preliminary case
`studies, however, we set the configuration directives
`manually.
`
`In addition to hardware concerns, our configurable
`memory hierarchies raise software issues as well. Fig(cid:173)
`ure 2 gives a flowchart that represents the path a
`program would go through in order to make use of
`configurable memory hierarchy additions.
`
`3 Application Case Studies
`
`In order to quantitatively evaluate our idea, we
`present results on three case study applications.
`
`Source code
`
`Evaluation
`
`3.1 Configurable Memory Hardware
`
`In the case studies presented here, we assume the con(cid:173)
`figurable memory hardware sits on the same chip as
`the proce$sor and first-level (Ll) cache. This corre(cid:173)
`sponds to location Cl in Figure 1. For these studies,
`we examine two configurable hierarchy additions sep(cid:173)
`arately: a victim cache and a prefetch buffer.
`
`/
`
`Compilation
`
`Code with Config. info
`
`Run program
`
`Result
`
`Feedback I
`
`Figure 2: Flowchart of programming model with con(cid:173)
`figurable memory hierarchy additions.
`
`Starting from conventional source code, some per(cid:173)
`formance analysis is needed to determine what types
`of performance bottlenecks are present. Based on this
`analysis, configuration directives may be added into
`the code in order to use particular configurable mem(cid:173)
`ory hierarchy additions. The configuration directives
`may either make use of hardware from a parameter(cid:173)
`ized library of pre-made designs, or they may specify
`custom-made configurations. Our flowchart also al(cid:173)
`lows for the possibility that these configuration direc(cid:173)
`tives will either be hand-inserted into the code by the
`programmer, or will be automatically inserted as part
`of the compilation process. At this point, the code
`is executed, and the memory hierarchy additions are
`configured as part of the program run. If the code
`
`3.1.1 Victim Cache
`
`A victim cache is a fully-associative cache with typ(cid:173)
`ically no more than 16 entries. When an 11 cache
`miss occurs, the referenced line is pulled into the Ll
`cache and another line currently stored in the same
`cache line is evicted. The victim cache is a small
`fully-associative buffer that holds these evicted lines.
`Using victim caches can have the effect of increasing
`the set associativity at low hardware cost.
`Figure 3 shows a diagram of the victim cache we
`consider. The address and data lines go to both the
`Ll cache and the victim cache. On an 11 cache hit,
`the data is provided directly from the 11 cache, and
`the configurable hardware is not involved. On an
`11 cache miss, the victim cache is probed in parallel
`with the request to the next level of memory. If the
`reference hits in the victim cache, the victim cache
`data provides to the processor. When the data is
`pulled into the Ll cache, another line must be evicted
`from the cache. The victim cache intercepts this line
`and stores it.
`
`3.1.2 Prefetching Buffer
`
`The goal of a prefetch buffer is to initiate main mem(cid:173)
`ory accesses in advance, so that the data will be closer
`
`240 I SPIE Vol. 2914
`
`Petitioners Amazon
`Ex. 1004, p. 7
`
`
`
`Lower level memory
`
`Figure 3: Block diagram of victim cache.
`
`to the processor when referenced. Prefetching can be
`especially beneficial when the main memory band(cid:173)
`width is large, but the access latency is also quite
`high. The prefetch buffer in our model has several
`independent "slots". Each slot holds several cache
`lines and works like a FIFO. It can issue prefetch(cid:173)
`ing commands to the main memory. We need several
`slots because in a typical program, there will be sev(cid:173)
`eral concurrent data access streams. For example,
`when a program is performing matrix addition, there
`will be three access streams for the two source arrays
`and the destination array. Figure 4 is the schematic
`of a prefetch buffer with four slots; each slot holds
`four cache lines of data.
`If there is an 11 cache hit, it does not trigger any
`operation in the prefetch buffer. If there is an 11
`If
`miss, the request is sent to the prefetch buffer.
`the referenced item is available in one of the slots,
`this line is pulled into the CPU as well as the 11
`cache. The remaining lines move toward the top of
`the FIFO. The vacancy is filled with data prefetched
`from main memory. The prefetching engine must de(cid:173)
`termine which line to prefetch next. In the simple
`case, the memory address for the subsequent line is
`calculated by incrementing the current address by the
`cache line size.
`If the Ll cache miss also misses in the prefetch
`buffer, the least recently used slot is designated to
`begin prefetching the subsequent data following this
`
`Figure 4: Block diagram of prefetch buffers.
`
`referenced address. By doing this, we can avoid com(cid:173)
`pulsory cache misses if the initial reference to a mem(cid:173)
`ory line is already covered by the prefetching.
`To make better use of the slots, we may program
`them with address ranges and data access strides. For
`example, when we have a lot of accesses on one array,
`we can set one slot with the address range of the data
`array. The slot will only respond to references in this
`range. If the access pattern has an address increment
`stride larger than the cache line size, we can assign
`the stride so we only prefetch the data that is useful.
`
`3.2 Evaluation Methodology
`
`In this study, we use software simulations to estimate
`the performance for our applications. In particular,
`we used the Tango1ite memory reference generator
`to perform high-level simulations. Tango1ite uses a
`direct-execution simulation approach. The original
`program written in C or FORTRAN is compiled and
`instrumented with calls out to routines that simulate
`memory behavior and keep statistics.
`In our simulation model, we have a single CPU
`that issues no more than one instruction per cycle.
`The direct-mapped 11 data cache has a capacity of
`8 KB and a line size of 32 bytes. In order to focus
`
`SPfE Vol. 2974 1247
`
`Petitioners Amazon
`Ex. 1004, p. 8
`
`
`
`on data references, we have assumed ideal memory
`performance in the instruction stream.
`Our simulations assume a hit in the 11 cache costs
`one cycle, the same as a simple ALU instruction. If a
`reference misses in the 11 cache, additional cycles are
`needed depending on where the data is found. We as(cid:173)
`sume that data found in the configurable cache takes
`2 extra cycles and a miss in both Ll and configurable
`cache will cost 10 extra cycles to go to main memory.
`The victim cache we simulate is fully-associative
`with 4 cache-line-sized entries, updated according to
`an LRU policy. The prefetch buffer has four slots
`with 4 cache lines each. Our simulator ignores any
`additional contention that prefetching may introduce.
`
`3.3 Per-Application Results
`
`With that background, we will now look in more de(cid:173)
`tail at three applications. Our goal is to see the po(cid:173)
`tential performance improvement available by imple(cid:173)
`menting memory hierarchy additions in configurable
`hardware. The three applications studied include
`two kernel applications known for their poor cache
`performance (Matrix Multiplication and Fast Fourier
`Transform) as well as a more substantial program:
`the Tomcatv benchmark from the SPEC92 suite (9] .
`
`3.3.1 Matrix Multiplication
`
`In the first example, we study a matrix multiplication
`program multiplying two 100 by 100 matrices. The
`elements are double-precision floating point. This
`means that a total of 80000 bytes are needed for each
`matrix. This greatly exceeds the size of the 11 cache.
`Clearly the calculation involves two source matrices
`and one destination matrix. As shown in Figure 5
`one of the source matrices is accessed in sequential
`column order while the other is accessed in sequen(cid:173)
`tial row order.
`
`·1m· ·· · ·>
`
`. . . . ... ·>
`. . ..... ·>
`.. ....
`. ·>
`
`·mm···· · ·> a· : : :
`.. ... . . ·>
`. . .
`.
`.. ... .. ·>x : : : :
`. ...... ·>
`. . . .
`V V V V
`
`Figure 5: Memory access pattern for arrays in matrix
`multiplication.
`
`Results The simulated performance of lO0xl00
`matrix multiplication is summarized in Table 1. The
`standard cache does not perform well both because
`of the size of the data set and also because one of the
`arrays is accessed down the columns. Overall, the
`data miss rate is 21 %.
`The second column of the table shows the new re(cid:173)
`sults obtained when a configurable victim cache is
`used. In this case, some of the conflict misses can
`be avoided. The overall performance is only slightly
`better than in the original case though. The third
`column of the table shows the results when a config(cid:173)
`In this "stride-one
`urable prefetch engine is used.
`prefetching" each prefetch slot always attempts to
`prefetch the next cache line. This prefetching allows
`the program to avoid some compulsory misses, but
`does not dramatically improve overall performance.
`The difficulty here is that one of the arrays has a
`columnar memory access pattern that does not ben(cid:173)
`efit from the stride-one prefetching.
`Since simple stride-one prefetching is not fully ef(cid:173)
`fective, we also investigated allowing the prefetching
`engine adjust its prefetch stride on a per-slot basis.
`The fourth column shows the results of this experi(cid:173)
`ment. Here, we have allocated one slot to each ma(cid:173)
`truc. For the source matrix that accesses elements
`across a row, the pref etch stride is still one cache
`line. For the source matrix whose elements are ac(cid:173)
`cessed down the columns, we set the stride to be the
`size of one row in the matrix. Every reference to
`an address within the matrix will only update the
`prefetch buffer for its matrix. This technique yields
`much better performance. The miss rate is reduced
`by 15% and (as shown in Figure 6) the total program
`execution time is reduced to 61 % of the original.
`
`3.3.2 Fast Fourier Transformation
`
`The next example we examine is Fast Fourier Trans(cid:173)
`formation (FFT), a commonly used algorithm in sig(cid:173)
`nal processing. FFT's access pattern is not as regular
`as other matrix computations; the butterfly memory
`access pattern is shown in Figure 7 for an 8-point
`radix-2 FFT. In this figure, the data is prepared in
`bit-reversed order and the result is in normal order.
`The calculation is done in-place, i.e. only one data
`array is used for source, intermediate and destination.
`The computation is done in dual pairs as follows:
`
`Xm+1(P) = Xm(P) + w;,X.,.(q)
`
`242 I SPIE Vol. 2914
`
`Petitioners Amazon
`Ex. 1004, p. 9
`
`
`
`I Orig. Victim Stride-One Column
`
`Cache Cache
`
`Prefetch
`
`Prefetch
`
`79%
`
`-
`21%
`
`4.2
`
`6.2
`
`79%
`
`2%
`19%
`
`3.9
`
`5.9
`
`79%
`
`2%
`19%
`
`3.9
`
`5.9
`
`79%
`
`15%
`7%
`
`1.7
`
`3.8
`
`11 cache
`hit rate
`Config. cache
`hit rate
`Miss rate
`Memory Lat.
`(106 cycles)
`Exec. Time
`( 106 cycles)
`
`Table 1: Performance results for matrix multiplication.
`
`1.0
`
`0.8
`
`0.6
`
`..
`E
`e=:
`...
`c.i
`IC
`.. 0.4
`;:J
`"C
`.::
`..
`iii
`E
`z
`
`Q
`
`-~
`0
`
`...:
`
`._;
`
`.§
`-~ £ ct
`>
`0
`VJ
`u
`
`1.0
`
`0.8
`
`0.6
`
`..
`E
`..
`e=:
`c.i
`>(
`.. 0.4
`;:J
`"C
`.::
`iii
`e 0.2
`z
`
`Q
`
`0.0
`
`1.0
`
`1.0
`
`0.8
`
`0.6
`
`..
`..
`>(
`.. 0.4
`~
`"C
`.::
`..
`iii
`E 0.2
`z
`o.o
`
`E
`e=:
`c.i
`
`Q
`
`.!!.O
`0
`
`.§
`u
`>
`
`._;
`"'
`i5:
`
`Cl)
`
`-~
`0
`
`.5
`u
`>
`
`...:
`
`£
`
`tr.
`
`Figure 6: Execution times with different configurable hierarchy additions, normalized to execution time with
`original cache hierarchy.
`
`where
`
`Wn = e-i(21r/N)
`is the coefficient which is precomputed in our pro(cid:173)
`gram. The subscript of X stands for the stage of
`computation and the number in bracket stands for
`the position in the data array.
`For this paper, we examine a 1024-point complex
`FFT. Since there are 1024 points and each point
`needs 2 doubles, a data array of 16 KB is involved.
`The corresponding coefficients are precalculated and
`stored as an 8KB array. The calculation requires 10
`stages. The whole data array is referenced in each
`stage and the memory access is in dual pairs as shown
`in Figure 7. For an 8KB direct-mapped cache, the
`
`Orig. Victim Stride-One
`Cache Cache
`Prefetch
`
`Ll cache
`hit rate
`Config. cache
`hit rate
`Miss rate
`Memory Lat.
`( 103 cycles)
`Exec. time
`(103 cycles)
`
`78%
`
`-
`22%
`
`78%
`
`6%
`16%
`
`96
`
`76
`
`293
`
`273
`
`78%
`
`14%
`8%
`
`47
`
`244
`
`Table 2: Performance results for FFT.
`
`SPIE Vol. 2914 I 243
`
`Petitioners Amazon
`Ex. 1004, p. 10
`
`
`
`:~><:~
`xO ~
`:~><:~
`x2 ~
`xl ~ WO :~><:~
`
`WO
`
`W2
`
`xO
`xl
`x2
`x3
`
`. J
`
`. J
`
`.J
`
`WO
`
`WO
`
`x4
`
`x6
`
`x5
`x3
`x7
`
`Figure 7: Butterfly access pattern in FFT.
`
`misses come from conflicts between the data and co(cid:173)
`efficient arrays.
`
`Results Table 2 shows the simulation results for
`FFT. As before, the first column shows the results
`with the original memory hierarchy. Here, the data
`miss rate is 22%. Unlike in matrix multiply, the vic(cid:173)
`tim cache is fairly effective here in improving FFT's
`memory performance. The program's miss rate drops
`to 16%. This is due to the significant data reuse in
`loops and the reduction of conflict misses.
`With stride-one prefetching the program's perfor(cid:173)
`mance is improved even more. This is because of the
`good spatial locality and the sequential access pat(cid:173)
`tern of computing the dual pairs. The miss rate in
`this case is reduced by 14% and the total execution
`time is reduced to 83% of the original cache. In each
`stage, there is one reference sequence (i.e. one slot
`is used) if the dual pair stride is less than a cache
`line. In later stages, the dual-pair stride will be more
`than four cache lines (the size of one prefetch slot),
`and in those cases, two slots are used. In intermedi(cid:173)
`ate stages (i.e. the stride falls between one line and
`four lines) the access to the dual pair may cause the
`prefetch buffer to be updated prematurely and use(cid:173)
`ful data already in the prefetch buffer is lost. More
`sophisticated prefetching may avoid this problem but
`may be too complex to be implemented by reconfig(cid:173)
`urable hardware.
`
`3.3.3 Tomcatv
`
`Tomcatv, a SPECfp92 benchmark, is a vectorized
`mesh generation program written in Fortran. There
`are 7 data arrays used in the program. Each array
`is a 257 by 257 matrix of double precision floating
`
`Orig. Victim Stride-One
`Cache Cache
`Prefetch
`
`Ll cache
`hit rate
`Config. cache
`hit rate
`Miss rate
`Memory Lat.
`(106 cycles)
`Exec. Time
`(106 cycles)
`
`67%
`
`-
`33%
`
`67%
`
`19%
`14%
`
`67%
`
`8.3%
`25%
`
`912
`
`7497
`
`739
`
`1313
`
`898
`
`11306
`
`Table 3: Performance results for tomcatv.
`
`point. Thus, the data arrays take about 3.7MB each,
`and are far larger than the cache size. The arrays are
`mainly accessed sequentially in row order and data
`elements are reused several times in each loop. Due
`to the large active data set, however, the miss rate is
`still quite high.
`
`Results Table 3 shows the results for tomcatv. Al(cid:173)
`though the active data set is very large, there is a
`good deal of spatial and temporal locality in t he mem(cid:173)
`ory accesses. Thus, it may be somewhat surprising
`that the original miss rate is as high as 33%. This
`high miss rate is due to a large number of conflict
`misses in the direct-mapped cache. As shown in the
`second column of the table, the victim cache gives
`very good results on this application by greatly re(cid:173)
`ducing the conflict m isses. The total miss rate drops
`to 14% and execution time is reduced to 69% of its
`original value.
`The prefetch buffer, however, does not do well
`
`244 I SPIE Vol. 2914
`
`Petitioners Amazon
`Ex. 1004, p. 11
`
`
`
`technology similar to that found in a Xilinx XC4000-
`series FPGA [24]. This class of FPGAs uses 4-input
`lookup tables to implement logic functions. In 4000-
`series FPGAs, a configurable logic block (CLB) in(cid:173)
`cludes two 4-input LUTs. When these two are com(cid:173)
`bined, any 5-input function and some 9-input func(cid:173)
`tions can be implemented. Each LUT can be used as
`a 16xl bit SRAM as well. In addition to the lookup
`tables, each CLB has two registers that can used in(cid:173)
`dependently of the LUTs. Based on this information,
`Table 4 shows the number of CLBs required to imple(cid:173)
`ment certain basic functions. Although our discus(cid:173)
`sion gives calculations on a per-hardware-structure
`basis, these estimates compare well with CLB es(cid:173)
`timates given by automated synthesis from VHDL
`down to Xilinx parts.
`
`Function
`4: 1 multiplexer
`16:1 multiplexer
`16x2 bit memory
`4 bit comparator
`
`Number of CLBs
`1
`5
`1
`1
`
`Table 4: CLB counts for common logic functions.
`
`in this case becaus