throbber
TK
`7895
`G36
`H55
`1996
`
`Pl~C)(~EEl)I NC~S
`SPIE-The International Society for Optical Engineering
`
`•
`
`High-Speed Computing, Digital
`Signal Processing, and Filtering
`Using Reconfigurable Logic
`
`John Schewel
`Peter M. Athanas
`V. Michael Bove, Jr.
`John Watson
`Chairs/Editors
`
`20-21 November 1996
`Boston, Massachusetts
`
`0 ... 2 .
`
`:- L,
`L..
`
`I
`
`.,
`:'
`• J ~
`
`w
`
`- - - · ·
`
`.
`
`" f "'
`
`-
`
`.
`
`.
`
`.,., .. .," c.:t
`
`Volume 2914
`
`I
`
`I
`/
`
`' I I
`
`Petitioners Amazon
`Ex. 1004, p. 1
`
`

`

`PROCEEDINGS
`9 SPIE-The International Society for Optical Engineering
`
`High-Speed Computing, Digital
`Signal Processing, and Filtering
`Using Reconfigurable Logic
`
`John Schewe!
`Peter M. Athanas
`V. Michael Bove, Jr.
`John Watson
`Chairs/Editors
`
`20-21 November 1996
`Boston, Massachusetts
`
`Sponsored and Published by
`SPI E-The International Society for Optical Engineering
`
`Volume 2914
`
`SPIE is an international technical society dedicated to advancing engineering and scientific
`applications of optical , photonic, imaging, electronic, and optoelectronic technologies .
`
`Petitioners Amazon
`Ex. 1004, p. 2
`
`

`

`The papers appearing in this book comprise the proceedings of the meeting mentioned on the
`cover and title page. They reflect the authors' opinions and are published as presented and
`without change, in the interests of timely dissemination. Their inclusion in this publication does
`not necessarily constitute endorsement by the editors or by SPIE.
`
`Please use the following format to cite material from this book:
`Author(s), •Title of paper,• in High-Speed Computing, Digital Signal Processing, and Filtering
`Using Reconfigurable Logic, John Schewe!, Peter M. Athanas, V. Michael Bove, Jr., John
`Watson, Editors, Proc. SPIE 2914, page numbers (1996).
`
`Library of Congress Catalog Card No. 96-69766
`ISBN 0-8194-2316-5
`
`Published by
`SPIE-The International Society for Optical Engineering
`P.O. Box 10, Bellingham, Washington 98227-0010 USA
`Telephone 360/676-3290 (Pacific Time)• Fax 360/647-1445
`
`Copyright 0 1996, The Society of Photo-Optical Instrumentation Engineers.
`
`Copying of material in this book for internal or personal use, or for the internal or personal use
`of specific clients, beyond the fair use provisions granted by the U.S. Copyright Law is
`authorized by SPIE subject to payment of copying fees. The Transactional Reporting Service
`base fee for this volume is $6.00 per article (or portion thereof), which should be paid directly
`to the Copyright Clearance Center (CCO, 222 Rosewood Drive, Danvers, MA 01923. Payment
`may also be made electronically through CCC Online at http://www.directory.net/copyright/.
`Other copying for republication, resale, advertising or promotion, or any form of systematic or
`multiple reproduction of any material in this book is prohibited except with perm ission in
`writing from the publisher. The CCC fee code is 0-8194-2316-5/96/$6.00.
`
`Printed in the United States of America.
`
`Petitioners Amazon
`Ex. 1004, p. 3
`
`

`

`Using Reconfigurable Hardware to Customize Memory Hierarchies
`
`Peixin Zhong and Margaret Martonosi
`Department of Electrical Engineering
`Princeton University
`Princeton, NJ 08544-5263
`{pzhong, martonosi}©ee.princeton.edu
`
`Abstract
`
`1
`
`Introduction
`
`Over the past decade or more, processor speeds have
`increased much more quickly than memory speeds.
`As a result, a large, and still increasing, processor(cid:173)
`memory performance gap has formed. Many signifi(cid:173)
`cant applications suffer from substantial memory bot(cid:173)
`tlenecks, and their memory performance problems are
`often either too unusual or extreme to be mitigated by
`cache memories alone. Such specialized performance
`"bugs" require specialized solutions, but it is impos(cid:173)
`sible to provide case-by-case memory hierarchies or
`caching strategies on general-purpose computers.
`We have investigated the potential of implementing
`mechanisms like victim caches and prefetch buffers in
`reconfigurable hardware to improve application mem(cid:173)
`ory behavior. Based on technology and commercial
`trends, our simulation-based studies use a forward(cid:173)
`looking model in which configurable logic is located
`on the CPU chip. Given such assumptions, our re(cid:173)
`sults show that the flexibility of being able to special(cid:173)
`ize configurable hardware to an application's memory
`referencing behavior more than balances the slightly
`slower response times of configurable memory hier(cid:173)
`archy structures. For our three applications, small,
`specialized memory hierarchy additions such as vic(cid:173)
`tim caches and prefetch buffers can reduce miss rates
`substantially and can drop total execution times for
`these programs to between 60 and 80% of their orig(cid:173)
`inal execution times. Our results also indicate that
`different memory specializations may be most effec(cid:173)
`tive for each application; this highlights the useful(cid:173)
`ness of configurable memory hierarchies that are spe(cid:173)
`cialized on a per-application basis.
`Keywords: memory latency, configurable comput(cid:173)
`ing, victim cache, prefetching.
`
`Due to rapid increases in microprocessor speeds, the
`performance gap between processors and main mem(cid:173)
`ory is widening. Cache memories are typically used in
`computer systems to bridge this performance gap and
`reduce the average memory access time. Although
`caches work well in many cases, they may still fail to
`provide high performance for certain applications.
`Several hardware and software techniques have
`been proposed to improve cache performance in such
`cases. For example, prefetching techniques aim to
`hide the large latency out to main memory by bring(cid:173)
`ing data to the cache before it is referenced. Vic(cid:173)
`tim caches attempt to reduce conflict misses in low(cid:173)
`associativity caches. These hardware techniques have
`variable results depending on the application's mem(cid:173)
`ory referencing behavior. Their disadvantage is that
`they represent wasted transistor space on the CPU
`chip for those applications where they are ineffec(cid:173)
`tive. On the other hand, the drawback to purely
`software-based techniques ( such as blocked matrix ac(cid:173)
`cesses or compiler-inserted pref etching directives) is
`that it can be difficult to statically analyze a pro(cid:173)
`gram's memory behavior and determine when such
`techniques will be useful. For these reasons, this pa(cid:173)
`per explores implementi~g memory hierarchy addi(cid:173)
`tions in programmable hardware.
`Programmable logic, such as field-programmable
`gate arrays (FPGAs), has gained tremendous popu(cid:173)
`larity in the past decade. Programmable logic is pop(cid:173)
`ular because a given chip's behavior can be configured
`and customized for different functions during differ(cid:173)
`ent sessions. Customization on a per-application ba(cid:173)
`sis is feasible because the reconfiguration process is
`fast and can be done with the device in the system.
`Some FPGAs can even be partially reconfigured while
`the rest of the device is in use.
`
`0-8194-23 16-5/ 96/$6 .00
`
`SP/£ Vol. 2914 I 237
`
`Petitioners Amazon
`Ex. 1004, p. 4
`
`

`

`The configurability of FPGAs makes them heavily
`used for prototyping, but they have also been used to
`build high-performance application specific compute
`engines [12]. In addition, there has been work (such
`as PRISC [21] and PRISM [2]) on supplementing con(cid:173)
`ventional processors with configurable coprocessors
`to accelerate performance for different applications
`Thus far, most configurable computing projects
`have focused heavily on computation as opposed to
`data access. In this paper, we explore the potential
`of using configurable hardware to make application(cid:173)
`specific improvements to memory behavior. We en(cid:173)
`vision a library of parameterized memory hierarchy
`additions that can be invoked to target cases where
`the cache system does not work well for a particular
`application.
`As fabrication technology improves, more and more
`transistors fit on a single chip. It seems likely that
`we will soon see processor chips that include a re(cid:173)
`gion of on-chip configurable logic. This configurable
`logic can clearly have many uses; our work does not
`preclude configuring the logic for more traditional
`compute-oriented uses, but simply attempts to ex(cid:173)
`plore an alternative use.
`In Section 2, we will first discuss the structure of
`the configurable memory unit that we envision. Fol(cid:173)
`lowing that, in Section 3, we present case studies of
`applying the ideas of configurable memory to several
`applications. In Section 4, we discuss some of the
`hardware organization and implementation issues in(cid:173)
`herent in our approach. A brief account of related
`work is included in Section 5 and Section 6 presents
`some discussion and our conclusions.
`
`flexible, configurable hardware instead.
`Integrated circuits are expected to soon grow to
`contain over 100 million transistors. As this growth
`takes place, we must determine ways to best make
`use of these additional transistors. Rather than sim(cid:173)
`ply devoting increased chip areas to increased · cache
`sizes, our research explores other methods for using
`the transistors to reduce ( or better tolerate) memory
`access latencies.
`In current research projects, configurable logic is
`typically incorporated into the architecture using an
`attached processor array model. As shown on the left
`hand side of Figure 1, an accelerator based on FP(cid:173)
`GAs and dedicated static RAM (SRAM), is attached
`to the 1/0 bus of a conventional host processor. The
`conventional processor and configurable logic array
`operate asynchronously. The host supplies control
`signals and monitors results while the logic array pro(cid:173)
`cesses data obtained from an external source such as
`a frame-buffer. The major problem with this model
`is the high communication latency between proces(cid:173)
`sor and configurable logic, due to their physical and
`logical separation.
`As shown on the right hand side of Figure 1, the ex(cid:173)
`pected integration of configurable logic on-chip gives
`us more flexibility not only in its logical placement
`within the architecture, but also in its expected uses.
`In addition, on-chip configurable logic has more flexi(cid:173)
`bility in its connectivity to processors and caches. We
`can now begin considering uses for configurable logic
`that are infeasible (because of latency and connectiv(cid:173)
`ity constraints) with the attached processor model.
`
`2.1 Logical Placement, Connectivity
`
`.
`
`2 Configurable Hardware
`Memory Hierarchies
`
`We believe that configurable logic can result in sig(cid:173)
`nificant performance improvement by improving av(cid:173)
`erage memory access latencies. Our overriding goal
`is to minimize the number of cache misses that result
`in accesses to main memory. Researchers have pro(cid:173)
`posed methods that promise performance improve(cid:173)
`ments of up to 3X by reducing cache misses using
`full-custom hardware [19]. These methods are rarely
`included in commercial processors, however, because
`they do not provide performance improvements for
`a broad enough set of applications. Our current re(cid:173)
`search shows the potential of these approaches using
`
`lil The logical placement of the configurable logic is
`driven by its intended memory optimization func(cid:173)
`tion. The right hand side of Figure 1 shows two
`distinct possibilities, and in fact, each of these con-
`figurable blocks may expend their transistor budgets
`on some combination of configurable gates and asso(cid:173)
`ciated SRAM. (In this figure, the logical positions for
`configurable logic are indicated by diamonds labelled
`"Cl" and "C2" .)
`The configurable logic closest to the processor can
`detect 11 cache misses and manage prefetch buffers,
`stream buffers, or a victim cache. Similar functions
`can be performed by the second block of configurable
`logic, for the 12 cache. In addition, this logic could
`observe memory accesses between nodes of a dis-
`
`238 I SPIE Vol. 2914
`
`Petitioners Amazon
`Ex. 1004, p. 5
`
`

`

`l/OBus
`
`Figure 1: Traditional (left) and more forward-looking (right) uses of configurable logic.
`
`tributed memory multiprocessor, or hold specialized
`multiprocessor coherence protocols.
`In this paper, we focus primarily on two particu(cid:173)
`lar optimizations: victim caches and prefetching en(cid:173)
`gines, implemented in configurable hardware. Victim
`caches are small, fully-associative caches that lie be(cid:173)
`tween the cache and ma.in memory. Victim caches
`primarily aim to reduce conflict misses by caching
`data that ha.s been evicted from the cache as a result
`of a miss on another location. Even small (four-entry)
`victim caches can reduce conflict misses by about 50%
`and reduce total cache misses by 5-30% .[19].
`Prefetching engines are hardware structures in(cid:173)
`tended to fetch pieces of da.ta from ma.in memory
`before references to them occur, in order to hide the
`large latencies out to ma.in memory. They have been
`studied in several contexts before [6, 7, 20]. Although
`such hardware mechanisms have been evaluated via.
`simulation studies, widespread adoption in custom
`hardware is unlikely, since their benefits vary from
`application to a.pplica.tion. Assuming that future prer
`cessors will include a block of configurable logic on(cid:173)
`chip, this paper uses simulations to re-evaluate these
`mechanisms within the context of a dynamically cus(cid:173)
`tomizable memory hierarchy.
`For prefetch or stream buffers, the configurable
`logic must be able to detect a. miss from the Ll ( or 12)
`cache, recognize that the address matches the address
`of data in the pref etch buffer, abort the memory re-
`
`quest, and supply the data to the CPU. Simultaneous
`to servicing the access request, the prefetch controller
`may initiate a memory request for the next location
`from which to prefetch. For a victim cache, the logic
`must once a.gain be able to detect the primary cache
`miss, recognize the presence of the data in its SRAM,
`abort the memory request, and supply the data to the
`CPU. These functions can be easily performed using
`a small state machine controller.
`
`2.2 Logical vs. Physical Placement
`
`It is important to make a distinction between the log(cid:173)
`ical and the physical placement of the configurable
`logic. The discussion above described the logical
`placement, i.e. where the configurable logic needs to
`be positioned in terms of the signals it needs access
`to. Physically, we anticipate future generations of mi(cid:173)
`croprocessors to have a fixed amount of configurable
`logic which may be used for a. variety of tasks includ(cid:173)
`ing but not limited to accelerating computation and
`memory accesses. Since each application will have
`different requirements on where the configurable logic
`should be placed, it is not reasonable to expect it to
`be positioned in exactly the positions marked in Fig(cid:173)
`ure 1. The key demand that our logical placement
`makes on configurable logic's physical placement is
`that of connectivity; there must be signal paths that
`provide access to the signals required by the logical
`
`SP/E Vol. 2914 I 239
`
`Petitioners Amazon
`Ex. 1004, p. 6
`
`

`

`usage. The physical placement impacts performance
`in two ways. First, the interconnect delay may be
`significant. Second, the interconnect topology may
`lead to contention of shared resources such as buses.
`This contention has not been modelled in the initial
`study presented here.
`
`2.3 Programming Model
`
`is to be run repeatedly, then the performance statis(cid:173)
`tics about this program run can be used as feedback
`to modify the configuration for future runs. Clearly,
`the widespread use of this sort of hardware relies on
`automating the analysis and choice of configurations
`as much as possible. In Section 3's preliminary case
`studies, however, we set the configuration directives
`manually.
`
`In addition to hardware concerns, our configurable
`memory hierarchies raise software issues as well. Fig(cid:173)
`ure 2 gives a flowchart that represents the path a
`program would go through in order to make use of
`configurable memory hierarchy additions.
`
`3 Application Case Studies
`
`In order to quantitatively evaluate our idea, we
`present results on three case study applications.
`
`Source code
`
`Evaluation
`
`3.1 Configurable Memory Hardware
`
`In the case studies presented here, we assume the con(cid:173)
`figurable memory hardware sits on the same chip as
`the proce$sor and first-level (Ll) cache. This corre(cid:173)
`sponds to location Cl in Figure 1. For these studies,
`we examine two configurable hierarchy additions sep(cid:173)
`arately: a victim cache and a prefetch buffer.
`
`/
`
`Compilation
`
`Code with Config. info
`
`Run program
`
`Result
`
`Feedback I
`
`Figure 2: Flowchart of programming model with con(cid:173)
`figurable memory hierarchy additions.
`
`Starting from conventional source code, some per(cid:173)
`formance analysis is needed to determine what types
`of performance bottlenecks are present. Based on this
`analysis, configuration directives may be added into
`the code in order to use particular configurable mem(cid:173)
`ory hierarchy additions. The configuration directives
`may either make use of hardware from a parameter(cid:173)
`ized library of pre-made designs, or they may specify
`custom-made configurations. Our flowchart also al(cid:173)
`lows for the possibility that these configuration direc(cid:173)
`tives will either be hand-inserted into the code by the
`programmer, or will be automatically inserted as part
`of the compilation process. At this point, the code
`is executed, and the memory hierarchy additions are
`configured as part of the program run. If the code
`
`3.1.1 Victim Cache
`
`A victim cache is a fully-associative cache with typ(cid:173)
`ically no more than 16 entries. When an 11 cache
`miss occurs, the referenced line is pulled into the Ll
`cache and another line currently stored in the same
`cache line is evicted. The victim cache is a small
`fully-associative buffer that holds these evicted lines.
`Using victim caches can have the effect of increasing
`the set associativity at low hardware cost.
`Figure 3 shows a diagram of the victim cache we
`consider. The address and data lines go to both the
`Ll cache and the victim cache. On an 11 cache hit,
`the data is provided directly from the 11 cache, and
`the configurable hardware is not involved. On an
`11 cache miss, the victim cache is probed in parallel
`with the request to the next level of memory. If the
`reference hits in the victim cache, the victim cache
`data provides to the processor. When the data is
`pulled into the Ll cache, another line must be evicted
`from the cache. The victim cache intercepts this line
`and stores it.
`
`3.1.2 Prefetching Buffer
`
`The goal of a prefetch buffer is to initiate main mem(cid:173)
`ory accesses in advance, so that the data will be closer
`
`240 I SPIE Vol. 2914
`
`Petitioners Amazon
`Ex. 1004, p. 7
`
`

`

`Lower level memory
`
`Figure 3: Block diagram of victim cache.
`
`to the processor when referenced. Prefetching can be
`especially beneficial when the main memory band(cid:173)
`width is large, but the access latency is also quite
`high. The prefetch buffer in our model has several
`independent "slots". Each slot holds several cache
`lines and works like a FIFO. It can issue prefetch(cid:173)
`ing commands to the main memory. We need several
`slots because in a typical program, there will be sev(cid:173)
`eral concurrent data access streams. For example,
`when a program is performing matrix addition, there
`will be three access streams for the two source arrays
`and the destination array. Figure 4 is the schematic
`of a prefetch buffer with four slots; each slot holds
`four cache lines of data.
`If there is an 11 cache hit, it does not trigger any
`operation in the prefetch buffer. If there is an 11
`If
`miss, the request is sent to the prefetch buffer.
`the referenced item is available in one of the slots,
`this line is pulled into the CPU as well as the 11
`cache. The remaining lines move toward the top of
`the FIFO. The vacancy is filled with data prefetched
`from main memory. The prefetching engine must de(cid:173)
`termine which line to prefetch next. In the simple
`case, the memory address for the subsequent line is
`calculated by incrementing the current address by the
`cache line size.
`If the Ll cache miss also misses in the prefetch
`buffer, the least recently used slot is designated to
`begin prefetching the subsequent data following this
`
`Figure 4: Block diagram of prefetch buffers.
`
`referenced address. By doing this, we can avoid com(cid:173)
`pulsory cache misses if the initial reference to a mem(cid:173)
`ory line is already covered by the prefetching.
`To make better use of the slots, we may program
`them with address ranges and data access strides. For
`example, when we have a lot of accesses on one array,
`we can set one slot with the address range of the data
`array. The slot will only respond to references in this
`range. If the access pattern has an address increment
`stride larger than the cache line size, we can assign
`the stride so we only prefetch the data that is useful.
`
`3.2 Evaluation Methodology
`
`In this study, we use software simulations to estimate
`the performance for our applications. In particular,
`we used the Tango1ite memory reference generator
`to perform high-level simulations. Tango1ite uses a
`direct-execution simulation approach. The original
`program written in C or FORTRAN is compiled and
`instrumented with calls out to routines that simulate
`memory behavior and keep statistics.
`In our simulation model, we have a single CPU
`that issues no more than one instruction per cycle.
`The direct-mapped 11 data cache has a capacity of
`8 KB and a line size of 32 bytes. In order to focus
`
`SPfE Vol. 2974 1247
`
`Petitioners Amazon
`Ex. 1004, p. 8
`
`

`

`on data references, we have assumed ideal memory
`performance in the instruction stream.
`Our simulations assume a hit in the 11 cache costs
`one cycle, the same as a simple ALU instruction. If a
`reference misses in the 11 cache, additional cycles are
`needed depending on where the data is found. We as(cid:173)
`sume that data found in the configurable cache takes
`2 extra cycles and a miss in both Ll and configurable
`cache will cost 10 extra cycles to go to main memory.
`The victim cache we simulate is fully-associative
`with 4 cache-line-sized entries, updated according to
`an LRU policy. The prefetch buffer has four slots
`with 4 cache lines each. Our simulator ignores any
`additional contention that prefetching may introduce.
`
`3.3 Per-Application Results
`
`With that background, we will now look in more de(cid:173)
`tail at three applications. Our goal is to see the po(cid:173)
`tential performance improvement available by imple(cid:173)
`menting memory hierarchy additions in configurable
`hardware. The three applications studied include
`two kernel applications known for their poor cache
`performance (Matrix Multiplication and Fast Fourier
`Transform) as well as a more substantial program:
`the Tomcatv benchmark from the SPEC92 suite (9] .
`
`3.3.1 Matrix Multiplication
`
`In the first example, we study a matrix multiplication
`program multiplying two 100 by 100 matrices. The
`elements are double-precision floating point. This
`means that a total of 80000 bytes are needed for each
`matrix. This greatly exceeds the size of the 11 cache.
`Clearly the calculation involves two source matrices
`and one destination matrix. As shown in Figure 5
`one of the source matrices is accessed in sequential
`column order while the other is accessed in sequen(cid:173)
`tial row order.
`
`·1m· ·· · ·>
`
`. . . . ... ·>
`. . ..... ·>
`.. ....
`. ·>
`
`·mm···· · ·> a· : : :
`.. ... . . ·>
`. . .
`.
`.. ... .. ·>x : : : :
`. ...... ·>
`. . . .
`V V V V
`
`Figure 5: Memory access pattern for arrays in matrix
`multiplication.
`
`Results The simulated performance of lO0xl00
`matrix multiplication is summarized in Table 1. The
`standard cache does not perform well both because
`of the size of the data set and also because one of the
`arrays is accessed down the columns. Overall, the
`data miss rate is 21 %.
`The second column of the table shows the new re(cid:173)
`sults obtained when a configurable victim cache is
`used. In this case, some of the conflict misses can
`be avoided. The overall performance is only slightly
`better than in the original case though. The third
`column of the table shows the results when a config(cid:173)
`In this "stride-one
`urable prefetch engine is used.
`prefetching" each prefetch slot always attempts to
`prefetch the next cache line. This prefetching allows
`the program to avoid some compulsory misses, but
`does not dramatically improve overall performance.
`The difficulty here is that one of the arrays has a
`columnar memory access pattern that does not ben(cid:173)
`efit from the stride-one prefetching.
`Since simple stride-one prefetching is not fully ef(cid:173)
`fective, we also investigated allowing the prefetching
`engine adjust its prefetch stride on a per-slot basis.
`The fourth column shows the results of this experi(cid:173)
`ment. Here, we have allocated one slot to each ma(cid:173)
`truc. For the source matrix that accesses elements
`across a row, the pref etch stride is still one cache
`line. For the source matrix whose elements are ac(cid:173)
`cessed down the columns, we set the stride to be the
`size of one row in the matrix. Every reference to
`an address within the matrix will only update the
`prefetch buffer for its matrix. This technique yields
`much better performance. The miss rate is reduced
`by 15% and (as shown in Figure 6) the total program
`execution time is reduced to 61 % of the original.
`
`3.3.2 Fast Fourier Transformation
`
`The next example we examine is Fast Fourier Trans(cid:173)
`formation (FFT), a commonly used algorithm in sig(cid:173)
`nal processing. FFT's access pattern is not as regular
`as other matrix computations; the butterfly memory
`access pattern is shown in Figure 7 for an 8-point
`radix-2 FFT. In this figure, the data is prepared in
`bit-reversed order and the result is in normal order.
`The calculation is done in-place, i.e. only one data
`array is used for source, intermediate and destination.
`The computation is done in dual pairs as follows:
`
`Xm+1(P) = Xm(P) + w;,X.,.(q)
`
`242 I SPIE Vol. 2914
`
`Petitioners Amazon
`Ex. 1004, p. 9
`
`

`

`I Orig. Victim Stride-One Column
`
`Cache Cache
`
`Prefetch
`
`Prefetch
`
`79%
`
`-
`21%
`
`4.2
`
`6.2
`
`79%
`
`2%
`19%
`
`3.9
`
`5.9
`
`79%
`
`2%
`19%
`
`3.9
`
`5.9
`
`79%
`
`15%
`7%
`
`1.7
`
`3.8
`
`11 cache
`hit rate
`Config. cache
`hit rate
`Miss rate
`Memory Lat.
`(106 cycles)
`Exec. Time
`( 106 cycles)
`
`Table 1: Performance results for matrix multiplication.
`
`1.0
`
`0.8
`
`0.6
`
`..
`E
`e=:
`...
`c.i
`IC
`.. 0.4
`;:J
`"C
`.::
`..
`iii
`E
`z
`
`Q
`
`-~
`0
`
`...:
`
`._;
`
`.§
`-~ £ ct
`>
`0
`VJ
`u
`
`1.0
`
`0.8
`
`0.6
`
`..
`E
`..
`e=:
`c.i
`>(
`.. 0.4
`;:J
`"C
`.::
`iii
`e 0.2
`z
`
`Q
`
`0.0
`
`1.0
`
`1.0
`
`0.8
`
`0.6
`
`..
`..
`>(
`.. 0.4
`~
`"C
`.::
`..
`iii
`E 0.2
`z
`o.o
`
`E
`e=:
`c.i
`
`Q
`
`.!!.O
`0
`
`.§
`u
`>
`
`._;
`"'
`i5:
`
`Cl)
`
`-~
`0
`
`.5
`u
`>
`
`...:
`

`
`tr.
`
`Figure 6: Execution times with different configurable hierarchy additions, normalized to execution time with
`original cache hierarchy.
`
`where
`
`Wn = e-i(21r/N)
`is the coefficient which is precomputed in our pro(cid:173)
`gram. The subscript of X stands for the stage of
`computation and the number in bracket stands for
`the position in the data array.
`For this paper, we examine a 1024-point complex
`FFT. Since there are 1024 points and each point
`needs 2 doubles, a data array of 16 KB is involved.
`The corresponding coefficients are precalculated and
`stored as an 8KB array. The calculation requires 10
`stages. The whole data array is referenced in each
`stage and the memory access is in dual pairs as shown
`in Figure 7. For an 8KB direct-mapped cache, the
`
`Orig. Victim Stride-One
`Cache Cache
`Prefetch
`
`Ll cache
`hit rate
`Config. cache
`hit rate
`Miss rate
`Memory Lat.
`( 103 cycles)
`Exec. time
`(103 cycles)
`
`78%
`
`-
`22%
`
`78%
`
`6%
`16%
`
`96
`
`76
`
`293
`
`273
`
`78%
`
`14%
`8%
`
`47
`
`244
`
`Table 2: Performance results for FFT.
`
`SPIE Vol. 2914 I 243
`
`Petitioners Amazon
`Ex. 1004, p. 10
`
`

`

`:~><:~
`xO ~
`:~><:~
`x2 ~
`xl ~ WO :~><:~
`
`WO
`
`W2
`
`xO
`xl
`x2
`x3
`
`. J
`
`. J
`
`.J
`
`WO
`
`WO
`
`x4
`
`x6
`
`x5
`x3
`x7
`
`Figure 7: Butterfly access pattern in FFT.
`
`misses come from conflicts between the data and co(cid:173)
`efficient arrays.
`
`Results Table 2 shows the simulation results for
`FFT. As before, the first column shows the results
`with the original memory hierarchy. Here, the data
`miss rate is 22%. Unlike in matrix multiply, the vic(cid:173)
`tim cache is fairly effective here in improving FFT's
`memory performance. The program's miss rate drops
`to 16%. This is due to the significant data reuse in
`loops and the reduction of conflict misses.
`With stride-one prefetching the program's perfor(cid:173)
`mance is improved even more. This is because of the
`good spatial locality and the sequential access pat(cid:173)
`tern of computing the dual pairs. The miss rate in
`this case is reduced by 14% and the total execution
`time is reduced to 83% of the original cache. In each
`stage, there is one reference sequence (i.e. one slot
`is used) if the dual pair stride is less than a cache
`line. In later stages, the dual-pair stride will be more
`than four cache lines (the size of one prefetch slot),
`and in those cases, two slots are used. In intermedi(cid:173)
`ate stages (i.e. the stride falls between one line and
`four lines) the access to the dual pair may cause the
`prefetch buffer to be updated prematurely and use(cid:173)
`ful data already in the prefetch buffer is lost. More
`sophisticated prefetching may avoid this problem but
`may be too complex to be implemented by reconfig(cid:173)
`urable hardware.
`
`3.3.3 Tomcatv
`
`Tomcatv, a SPECfp92 benchmark, is a vectorized
`mesh generation program written in Fortran. There
`are 7 data arrays used in the program. Each array
`is a 257 by 257 matrix of double precision floating
`
`Orig. Victim Stride-One
`Cache Cache
`Prefetch
`
`Ll cache
`hit rate
`Config. cache
`hit rate
`Miss rate
`Memory Lat.
`(106 cycles)
`Exec. Time
`(106 cycles)
`
`67%
`
`-
`33%
`
`67%
`
`19%
`14%
`
`67%
`
`8.3%
`25%
`
`912
`
`7497
`
`739
`
`1313
`
`898
`
`11306
`
`Table 3: Performance results for tomcatv.
`
`point. Thus, the data arrays take about 3.7MB each,
`and are far larger than the cache size. The arrays are
`mainly accessed sequentially in row order and data
`elements are reused several times in each loop. Due
`to the large active data set, however, the miss rate is
`still quite high.
`
`Results Table 3 shows the results for tomcatv. Al(cid:173)
`though the active data set is very large, there is a
`good deal of spatial and temporal locality in t he mem(cid:173)
`ory accesses. Thus, it may be somewhat surprising
`that the original miss rate is as high as 33%. This
`high miss rate is due to a large number of conflict
`misses in the direct-mapped cache. As shown in the
`second column of the table, the victim cache gives
`very good results on this application by greatly re(cid:173)
`ducing the conflict m isses. The total miss rate drops
`to 14% and execution time is reduced to 69% of its
`original value.
`The prefetch buffer, however, does not do well
`
`244 I SPIE Vol. 2914
`
`Petitioners Amazon
`Ex. 1004, p. 11
`
`

`

`technology similar to that found in a Xilinx XC4000-
`series FPGA [24]. This class of FPGAs uses 4-input
`lookup tables to implement logic functions. In 4000-
`series FPGAs, a configurable logic block (CLB) in(cid:173)
`cludes two 4-input LUTs. When these two are com(cid:173)
`bined, any 5-input function and some 9-input func(cid:173)
`tions can be implemented. Each LUT can be used as
`a 16xl bit SRAM as well. In addition to the lookup
`tables, each CLB has two registers that can used in(cid:173)
`dependently of the LUTs. Based on this information,
`Table 4 shows the number of CLBs required to imple(cid:173)
`ment certain basic functions. Although our discus(cid:173)
`sion gives calculations on a per-hardware-structure
`basis, these estimates compare well with CLB es(cid:173)
`timates given by automated synthesis from VHDL
`down to Xilinx parts.
`
`Function
`4: 1 multiplexer
`16:1 multiplexer
`16x2 bit memory
`4 bit comparator
`
`Number of CLBs
`1
`5
`1
`1
`
`Table 4: CLB counts for common logic functions.
`
`in this case becaus

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket