`
`I, Rachel J. Watters, am a librarian, and the Director of Wisconsin TechSearch
`
`(“WTS”), located at 728 State Street, Madison, Wisconsin, 53706. WTS is an
`
`interlibrary loan department at the University of Wisconsin-Madison.
`
`I have worked as
`
`a librarian at the University of Wisconsin library system since 1998.
`
`I have been
`
`employed at WTS since 2002, first as a librarian and, beginning in 201 1, as the Director.
`
`Through the course of my employment, I have become well informed about the
`
`operations of the University of Wisconsin library system, which follows standard library
`
`practices.
`
`This Declaration relates to the dates of receipt and availability of the following:
`
`Zhong, P., & Martonosi, M. (1996, October). Using
`reconfigurable hardware to customize memory hierarchies. In
`High-Speed Computing, Digital Signal Processing, and Filtering
`Using Reconfigurable Logic (Vol. 2914, pp. 237—248).
`International Society for Optics and Photonics.
`
`Standardfirefigprocedures for materials at the Universiwf Wisconsin—
`
`Madison Libraries. When a volume was received by the Library, it would be checked
`
`in, stamped with the date of receipt, added to library holdings records, and made
`
`available to readers as soon after its arrival as possible. The procedure normally took a
`
`few days or at most 2 to 3 weeks.
`
`Exhibit A to this Declaration is true and accurate copy of the title page with
`
`library date stamp of High—Speed Computing, Digital Signal Processing, and Filtering
`
`Using Reconfigurable Logic (1996), from the University of Wisconsin-Madison Library
`
`1
`
`Petitioners Amazon
`
`EX. 1007, p. 1
`
`Petitioners Amazon
`Ex. 1007, p. 1
`
`
`
`Declaration of Rachel J. Watters on Authentication of Publication
`
`collection. Exhibit A also includes an excerpt of pages 237 to 248 of that volume,
`
`showing the article entitled Using reconfigurable hardware to customize memory
`
`hierarchies (1996). Based on this information, the date stamp on the volume cover page
`
`indicates Using reconfigurable hardware to customize memory hierarchies (1996) was
`
`received by University of Wisconsin—Madison Libraries on December 12, 1996.
`
`Based on the information in Exhibit A, it is clear that the volume was received by
`
`the libraiy on or before December 12, 1996, catalogued and available to library patrons
`
`within a few days or at most 2 to 3 weeks after December 12, 1996.
`
`I declare that all statements made herein of my own knowledge are true and that
`
`all statements made on information and belief are believed to be true; and further that
`
`these statements were made with the knowledge that willful false statements and the like
`
`so made are punishable by fine or imprisonment, or both, under Section 1001 of Title 18
`
`of the United States Code.
`
`Date: October 2, 2018
`Rachel
`alters
`
`Wisconsin TechSearch
`
`Director
`
`Memorial Library
`728 State Street
`
`Madison, Wisconsin 53706
`
`Petitioners Amazon
`
`Ex. 1007, p. 2
`
`Petitioners Amazon
`Ex. 1007, p. 2
`
`
`
`TK
`7895
`6::
`PROCEEDINGS
`T996
`@ SPlE—The International Society for Optical Engineering
`
`High-Speed Computing, Digital
`Signal Processing, and Filtering
`Using Reconfigurable Logic
`
`John Schewel
`Peter M. Athanas
`
`V. Michael Bove, Jr.
`John Watson
`Chairs/Editors
`
`20—21 November 1996
`
`Boston, Massachusetts
`
`
`
`Volume 2914
`
`Petitioners Amazon
`
`EX. 1007, p. 3
`
`Petitioners Amazon
`Ex. 1007, p. 3
`
`
`
`
`
`PROCEEDINGS
`@ SPIE—The International Society for Optical Engineering
`
`High-Speed Computing, Digital
`Signal Processing, and Filtering
`Using Reconfigurable Logic
`
`John Schewel
`Peter M. Athanas
`
`V. Michael Bove, Jr.
`John Watson
`Chairs/Editors
`
`20—21 November 1996
`
`Boston, Massachusetts
`
`Sponsored and Published by
`SPlE—The international Society for Optical Engineering
`
`
`
`Volume 2914
`
`SPIE is an international technical society dedicated to advancing engineering and scientific
`applications of optical, photonic, imaging, electronic, and optoelectronic technologies.
`
`
`
`Petitioners Amazon
`
`EX. 1007, p. 4
`
`Petitioners Amazon
`Ex. 1007, p. 4
`
`
`
`
`
`
`
`The papers appearing in this book comprise the proceedings of the meeting mentioned on the
`cover and title page. They reflect the authors’ opinions and are published as presented and
`without change, in the interests of timely dissemination. Their inclusion in this publication does
`not necessarily constitute endorsement by the editors or by SPIE.
`
`Please use the following format to cite material from this book:
`Author(s), 'Title of paper,” in High-Speed Computing, Digital Signal Processing and Filtering
`Using Reconfigurable Logic, iohn Schewel, Peter M. Athanas, V. Michael Bove, Jr., John
`Watson, Editors, Proc. SPlE 2914, page numbers (1996).
`
`Library of Congress Catalog Card No. 9669766
`ISBN 0—8194—2316-5
`
`Published by
`SPlE—The International Society for Optical Engineering
`PO. Box 10, Bellingham, Washington 98227-0010 USA
`Telephone 360/676-3290 (Pacific Time) ' Fax 360/647—1445
`
`Copyright 01996, The Society of Photo-Optical Instrumentation Engineers.
`
`Copying of material in this book for internal or personal use, or for the internal or personal use
`of specific clients, beyond the fair use provisions granted by the U.S. Copyright Law is
`authorized by SPlE subject to payment of copying fees. The Transactional Reporting Service
`base fee for this volume is $6.00 per article (or portion thereof), which should be paid directly
`to the Copyright Clearance Center (CCO, 222 Rosewood Drive, Danvers, MA 01923. Payment
`may also be made electronically through CCC Online at http://www.directory.net/copyright/.
`Other copying for republication, resale, advertising or promotion, or any form of systematic or
`multiple reproduction of any material in this book is prohibited except with permission in
`writing from the publisher. The CCC fee code is 0-8194—2316-5/96/56.00.
`
`Printed in the United States of America.
`
`Petitioners Amazon
`
`EX. 1007, p. 5
`
`Petitioners Amazon
`Ex. 1007, p. 5
`
`
`
`Using Reconfigurable Hardware to Customize Memory Hierarchies
`
`Peixin Zhong and Margaret Martonosi
`Department of Electrical Engineering
`Princeton University
`Princeton, NJ 08544-5263
`{pzhong , martonosi}@ee .princeton . edu
`
`Abstract
`
`1
`
`Introduction
`
`Over the past decade or more, processor speeds have
`increased much more quickly than memory speeds.
`As a result, a large, and still increasing, processor-
`memory performance gap has formed. Many signifi-
`cant applications suffer from substantial memory bot-
`tlenecks, and their memory performance problems are
`often either too unusual or extreme to be mitigated by
`cache memories alone. Such specialized performance
`“bugs” require specialized solutions, but it is impos-
`sible to provide case—by-case memory hierarchies or
`caching strategies on general-purpose computers.
`We have investigated the potential of implementing
`mechanisms like victim caches and prefetch buffers in
`reconfigurable hardware to improve application mem-
`ory behavior. Based on technology and commercial
`trends, our simulation-based studies use a forward-
`looking model in which configurable logic is located
`on the CPU chip. Given such assumptions, our re-
`sults show that the flexibility of being able to special-
`ize configurable hardware to an application’s memory
`referencing behavior more than balances the slightly
`slower response times of configurable memory hier-
`archy structures. For our three applications, small,
`specialized memory hierarchy additions such as vic-
`tim caches and prefetch bufi'ers can reduce miss rates
`substantially and can drop total execution times for
`these programs to between 60 and 80% of their orig-
`inal execution times. Our results also indicate that
`different memory specializations may be most effec-
`tive for each application; this highlights the useful-
`ness of configurable memory hierarchies that are spe-
`cialized on a per-application basis.
`Keywords: memory latency, configurable comput—
`ing, victim cache, prefetching.
`
`Due to rapid increases in microprocessor speeds, the
`performance gap between processors and main mem-
`ory is widening. Cache memories are typically used in
`computer systems to bridge this performance gap and
`reduce the average memory access time. Although
`caches work well in many cases, they may still fail to
`provide high performance for certain applications.
`Several hardware and software techniques have
`been proposed to improve cache performance in such
`cases. For example, prefetching techniques aim to
`hide the large latency out to main memory by bring-
`ing data to the cache before it is referenced. Vic—
`tim caches attempt to reduce conflict misses in low-
`associativity caches. These hardware techniques have
`variable results depending on the application’s mem-
`ory referencing behavior. Their disadvantage is that
`they represent wasted transistor space on the CPU
`chip for those applications where they are ineffec-
`tive. On the other hand,
`the drawback to purely
`software-based techniques (such as blocked matrix ac-
`cesses or compiler-inserted prefetchiug directives) is
`that it can be difficult to statically analyze a pro-
`gram’s memory behavior and determine when such
`techniques will be useful. For these reasons, this pa—
`per explores implementing memory hierarchy addi-
`tions in programmable hardware.
`Programmable logic, such as field-programmable
`gate arrays (FPGAs), has gained tremendous popu-
`larity in the past decade. Programmable logic is pop-
`ular because a given chip’s behavior can be configured
`and customized for difi'erent functions during differ-
`ent sessions. Customization on a per-application ba-
`sis is feasible because the reconfiguration process is
`fast and can be done with the device in the system.
`Some FPGAs can even be partially reconfigured while
`the rest of the device is in use.
`
`‘éifirr‘
`
`0-8194—2316—5/96/$6.00
`
`SPIE Vol. 2914 / 237
`
`Petitioners Amazon
`
`EX. 1007, p. 6
`
`Petitioners Amazon
`Ex. 1007, p. 6
`
`
`
`l ll
`
`flexible, configurable hardware instead.
`Integrated circuits are expected to soon grow to
`contain over 100 million transistors. As this growth
`takes place, we must determine ways to best make
`use of these additional transistors. Rather than sim-
`
`ply devoting increased chip areas to increased cache
`sizes, our research explores other methods for using
`the transistors to reduce (or better tolerate) memory
`access latencies.
`
`In current research projects, configurable logic is
`typically incorporated into the architecture using an
`attached processor array model. As shown on the left
`hand side of Figure 1, an accelerator based on FF-
`GAs and dedicated static RAM (SRAM), is attached
`to the I/O bus of a conventional host processor. The
`conventional processor and configurable logic array
`operate asynchronously. The host supplies control
`signals and monitors results while the logic array pro-
`cesses data obtained from an external source such as
`a frame-buffer. The major problem with this model
`is the high communication latency between proces—
`sor and configurable logic, due to their physical and
`logical separation.
`As shown on the right hand side of Figure 1, the ex-
`pected integration of configurable logic on—chip gives
`us more flexibility not only in its logical placement
`within the architecture, but also in its expected uses.
`In addition, on-chip configurable logic has more flari-
`bility in its connectivity to processors and caches. We
`can now begin considering uses for configurable logic
`that are infeasible (because of latency and connectiv-
`ity constraints) with the attached processor model.
`
`2.1 Logical Placement, Connectivity
`
`The logical placement of the configurable logic is
`driven by its intended memory optimization func-
`tion. The right hand side of Figure 1 shows two
`distinct possibilities, and in fact, each of these con—
`figurable blocks may expend their transistor budgets
`on some combination of configurable gates and asso-
`ciated SRAM. (In this figure, the logical positions for
`configurable logic are indicated by diamonds labelled
`“Cl” and “C2”.)
`The configurable logic closest to the processor can
`detect L1 cache misses and manage prefetch buffers,
`stream buffers, or a victim cache. Similar functions
`can be performed by the second block of configurable
`logic, for the L2 cache.
`In addition, this logic could
`observe memory accesses between nodes of a dis-
`
`Petitioners Amazon
`
`Ex. 1007, p. 7
`
`used for prototyping, but they have also been used to
`build high-performance application specific compute
`engines [12]. In addition, there has been work (such
`as PRISC [21] and PRISM [2]) on supplementing con—
`ventional processors with configurable coprocessors
`to accelerate performance for different applications
`Thus far, most configurable computing projects
`have focused heavily on computation as opposed to
`data, access.
`In this paper, we explore the potential
`of using configurable hardware to make application-
`specific improvements to memory behavior. We en-
`vision a library of parameterized memory hierarchy
`additions that can be invoked to target cases where
`the cache system does not work well for a particular
`application.
`As fabrication technology improves, more and more
`transistors fit on a single chip. It seems likely that
`we will soon see processor chips that include a re-
`gion of on-chip configurable logic. This configurable
`logic can clearly have many uses; our work does not
`preclude configuring the logic for more traditional
`compute-oriented uses, but simply attempts to ex-
`plore an alternative use.
`In Section 2, we will first discuss the structure of
`the configurable memory unit that we envision. Fol-
`lowing that, in Section 3, we present case studies of
`applying the ideas of configurable memory to several
`applications.
`In Section 4, we discuss some of the
`hardware organization and implementation issues in-
`herent in our approach. A brief account of related
`work is included in Section 5 and Section 6 presents
`some discussion and our conclusions.
`
` The configurability of FPGAs makes them heavily
`
`2 Configurable Hardware In
`Memory Hierarchies
`
`We believe that configurable logic can result in sig—
`nificant performance improvement by improving av-
`erage memory access latencies. Our overriding goal
`is to minimize the number of cache misses that result
`in accesses to main memory. Researchers have pro-
`posed methods that promise performance improve—
`ments of up to 3X by reducing cache misses using
`full—custom hardware [19]. These methods are rarely
`included in commercial processors, however, because
`they do not provide performance improvements for
`a broad enough set of applications. Our current re-
`search shows the potential of these approaches using
`
`238 ISPIE Vol. 2914
`
`Petitioners Amazon
`Ex. 1007, p. 7
`
`
`
`
`
`Figure 1: Traditional (left) and more forward-looking (right) uses of configurable logic.
`
`tributed memory multiprocessor, or hold specialized
`multiprocessor coherence protocols.
`In this paper, we focus primarily on two particu—
`lar optimizations: victim caches and prefetching en-
`gines, implemented in configurable hardware. Victim
`caches are small, fully—associative caches that lie be-
`tween the cache and main memory. Victim caches
`primarily aim to reduce conflict misses by caching
`data that has been evicted from the cache as a result
`of a miss on another location. Even small (four-entry)
`victim caches can reduce conflict misses by about 50%
`and reduce total cache misses by 5-3D%,[19].
`Prefetching engines are hardware structures in-
`tended to fetch pieces of data from main memory
`before references to them occur, in order to hide the
`large latencies out to main memory. They have been
`studied in several contexts before [6, 7, 20]. Although
`such hardware mechanisms have been evaluated via
`simulation studies, widespread adoption in custom
`hardware is unlikely, since their benefits vary from
`application to application. Assuming that future pro-
`cessors will include a block of configurable logic on-
`chip, this paper uses simulations to re—evaluate these
`mechanisms within the context of a dynamically cus-
`tomizable memory hierarchy.
`the configurable
`For prefetch or stream buffers,
`logic must be able to detect a miss from the L1 (or L2)
`cache, recognize that the address matches the address
`of data in the prefetch buffer, abort the memory re-
`
`quest, and supply the data to the CPU. Simultaneous
`to servicing the access request, the prefetch controller
`may initiate a memory request for the next location
`from which to prefetch. For a victim cache, the logic
`must once again be able to detect the primary cache
`miss, recognize the presence of the data in its SRAM,
`abort the memory request, and supply the data to the
`CPU. These functions can be easily performed using
`a small state machine controller.
`
`2.2 Logical vs. Physical Placement
`
`It is important to make a distinction between the log-
`ical and the physical placement of the configurable
`logic. The discussion above described the logical
`placement, i.e. where the configurable logic needs to
`be positioned in terms of the signals it needs access
`to. Physically, we anticipate future generations of mi-
`croprocessors to have a fixed amount of configurable
`logic which may be used for a variety of tasks includ-
`ing but not limited to accelerating computation and
`memory accesses. Since each application will have
`different requirements on where the configurable logic
`should be placed, it is not reasonable to expect it to
`be positioned in exactly the positions marked in Fig-
`ure 1. The key demand that our logical placement
`makes on configurable logic’s physical placement is
`that of connectivity; there must be signal paths that
`provide access to the signals required by the logical
`
`SPIE Vol. 2914 /239
`
`
`
`Petitioners Amazon
`
`EX. 1007, p. 8
`
`Petitioners Amazon
`Ex. 1007, p. 8
`
`
`
`in two ways First, the interconnect delay may be
`significant. Second, the interconnect topology may
`lead to contention of shared resources such as buses.
`This contention has not been modelled in the initial
`study presented here.
`
`2.3 Programming - Model
`
`In addition to hardware concerns, our configurable
`memory hierarchies raise software issues as well. Fig-
`ure 2 gives a flowchart that represents the path a
`program would go throughin order to make use of
`configurable memory hierarchy additions.
`
`Add directives
`
`Automated
`
`
`Optimization
`
`
`
`Code with—onfig.info
`
`m.
`
`Figure 2: Flowchart of programming model with con-
`figurable memory hierarchy additions.
`
`Starting from conventional source code, some per—
`formance analysis is needed to determine what types
`of performance bottlenecks are present. Based on this
`analysis, configuration directives may be added into
`the code in order to use particular configurable mem—
`ory hierarchy additions. The configuration directives
`may either make use of hardware from a parameter-
`ized library of pre-made designs, or they may specify
`custom-made configurations. Our flowchart also al-
`lows for the possibility that these configuration direc-
`tives will either be hand-inserted into the code by the
`
`programmer, or will be automatically inserted as part
`of the compilation process. At this point, the code
`is executed, and the memory hierarchy additions are
`configured as part of the program run.
`If the code
`
`is to be run repeatedly, then the performance statis-
`tics about this program run can be used as feedback
`to modify the configuration for future runs. Clearly,
`the widespread use of this sort of hardware relies on
`automating the analysis and choice of configurations
`as much as possible. In Section 3’s preliminary case
`studies, however, we set the configuration directives
`manually.
`
`3 Application Case Studies
`
`to quantitatively evaluate our idea, We
`In order
`present results on three case study applications.
`
`3.1 Configurable Memory Hardware
`
`In the case studies presented here, we assume the con-
`figurable memory hardware sits on the same chip as
`the processor and first-level (L1) cache. This corre—
`sponds to location C1 in Figure 1. For these studies,
`we examine two configurable hierarchy additions sep-
`arately: a victim cache and a prefetch buffer.
`
`3.1.1 Victim Cache
`
`A victim cache is a fully-associative cache with typ-
`ically no more than 16 entries. When an L1 cache
`miss occurs, the referenced line is pulled into the L1
`cache and another line currently stored in the same
`cache line is evicted. The victim cache is a small
`fully-associative buffer that holds these evicted lines.
`Using victim caches can have the effect of increasing
`the set associativity at low hardware cost.
`
`Figure 3 shows a diagram of the victim cache we
`consider. The address and data lines go to both the
`L1 cache and the victim cache. On an L1 cache hit,
`the data is provided directly from the L1 cache, and
`the configurable hardware is not involved. On an
`L1 cache miss, the victim cache is probed in parallel
`with the request to the next level of memory. If the
`reference hits in the victim cache, the victim cache
`data provides to the processor. When the data is
`pulled into the L1 cache, another line must be evicted
`from the cache. The victim cache intercepts this line
`and stores it.
`
`3.1.2 Prefetching Buffer
`
`The goal of a prefetch buffer is to initiate main mem-
`ory accesses in advance, so that the data will be closer
`
` usage. The physical placement impacts performance
`
`240 /$PIE Vol. 2914
`
`
`
`Petitioners Amazon
`
`EX. 1007, p. 9
`
`Petitioners Amazon
`Ex. 1007, p. 9
`
`
`
`
`
`Lower level memory
`
`
`Figure 3: Block diagram of victim cache.
`
`V Lower level memory
`
`
`to the processor when referenced. Prefetching can be
`especially beneficial when the main memory band-
`width is large, but the access latency is also quite
`high. The prefetch buffer in our model has several
`independent “slots”. Each slot holds several cache
`lines and works like a FIFO. It can issue prefetch—
`ing commands to the main memory. We need several
`slots because in a typical program, there will be sev-
`eral concurrent data access streams. For example,
`when a program is performing matrix addition, there
`will be three access streams for the two source arrays
`and the destination array. Figure 4 is the schematic
`of a prefetch buffer with four slots; each slot holds
`four cache lines of data.
`If there is an L1 cache hit, it does not trigger any
`operation in the prefetch buffer.
`If there is an L1
`miss, the request is sent to the prefetch buffer.
`If
`the referenced item is available in one of the slots,
`this line is pulled into the CPU as well as the L1
`cache. The remaining lines move toward the top of
`the FIFO. The vacancy is filled with data prefetched
`from main memory. The prefetching engine must de-
`termine which line to prefetch next.
`In the simple
`case, the memory address for the subsequent line is
`calculated by incrementing the current address by the
`cache line size.
`
`If the L1 cache miss also misses in the prefetch
`buffer, the least recently used slot is designated to
`begin prefetching the subsequent data following this
`
`Figure 4: Block diagram of prefetch buffers.
`
`referenced address. By doing this, We can avoid com-
`pulsory cache misses if the initial reference to a mem—
`ory line is already covered by the prefetching.
`To make better use of the slots, we may program
`them with address ranges and data access strides. For
`example, when we have a lot of accesses on one array,
`we can set one slot with the address range of the data
`array. The slot will only respond to references in this
`range. If the access pattern has an address increment
`stride larger than the cache line size, We can assign
`the stride so we only prefetch the data that is useful.
`
`3.2 Evaluation Methodology
`
`In this study, we use software simulations to estimate
`the performance for our applications. In particular,
`we used the TangoLite memory reference generator
`to perform high-level simulations. TangoLite uses a
`direct-execution simulation approach. The original
`program written in C or FORTRAN is compiled and
`instrumented with calls out to routines that simulate
`memory behavior and keep statistics.
`In our simulation model, we have a single CPU
`that issues no more than one instruction per cycle.
`The direct-mapped L1 data cache has a capacity of
`8 KB and a line size of 32 bytes.
`In order to focus
`
`
`
`SPlE Vol. 2914 I 241
`
`
`
`Petitioners Amazon
`
`Ex. 1007, p. 10
`
`Petitioners Amazon
`Ex. 1007, p. 10
`
`
`
`on data references, we have assumed ideal memory
`performance in the instruction stream.
`Our simulations assume a hit in the L1 cache costs
`one cycle, the same as a simple ALU instruction. If a
`reference misses in the L1 cache, additional cycles are
`needed depending on where the data is found. We as-
`sume that data found in the configurable cache takes
`2 extra cycles and a miss in both L1 and configurable
`cache will cost 10 extra cycles to go to main memory.
`The victim cache we simulate is fully-associative
`with 4 cache-line—sized entries, updated according to
`an LRU policy. The prefetch buffer has four slots
`with 4 cache lines each. Our simulator ignores any
`additional contention that prefetching may introduce.
`
`3.3 Per-Application Results
`
`With that background, we will now look in more de-
`tail at three applications. Our goal is to see the po-
`tential performance improvement available by imple-
`menting memory hierarchy additions in configurable
`hardware. The three applications studied include
`two kernel applications known for their poor cache
`performance (Matrix Multiplication and Fast Fourier
`Transform) as well as a more substantial program:
`the Tomcatv benchmark from the SPEC92 suite [9].
`
`3.3.1 Matrix Multiplication
`
`In the first example, we study a matrix multiplication
`program multiplying two 100 by 100 matrices. The
`elements are double-precision floating point. This
`means that a total of 80000 bytes are needed for each
`matrix. This greatly exceeds the size of the L1 cache.
`Clearly the calculation involves two source matrices
`and one destination matrix. As shown in Figure 5
`one of the source matrices is accessed in sequential
`column order while the other is accessed in sequen-
`tial row order.
`
`
`
`Figure 5: Memory access pattern for arrays in matrix
`multiplication.
`
` 242 /$PIE Vol. 2914
`
`Results The simulated performance of 100x100
`matrix multiplication is summarized in Table 1. The
`standard cache does not perform well both because
`of the size of the data set and also because one of the
`arrays is accessed down the columns. Overall, the
`data miss rate is 21%.
`The second column of the table shows the new re—
`sults obtained when a configurable victim cache is
`used.
`In this case, some of the conflict misses can
`be avoided. The overall performance is only slightly
`better than in the original case though. The third
`column of the table shows the results when a config-
`urable prefetch engine is used.
`In this “stride-one
`prefetching” each prefetch slot always attempts to
`prefetch the next cache line. This prefetching allows
`the program to avoid some compulsory misses, but
`does not dramatically improve overall performance.
`The difficulty here is that one of the arrays has a
`columnar memory access pattern that does not ben-
`efit from the stride-one prefetching.
`
`Since simple stride-one prefetching is not fully ef-
`fective, we also investigated allowing the prefetching
`engine adjust its prefetch stride on a per-slot basis.
`The fourth column shows the results of this experi-
`ment. Here, we have allocated one slot to each ma-
`trix. For the source matrix that accesses elements
`across a row, the prefetch stride is still one cache
`line. For the source matrix whose elements are ac-
`cessed down the columns, we set the stride to be the
`size of one row in the matrix. Every reference to
`an address within the matrix will only update the
`prefetch buffer for its matrix. This technique yields
`much better performance. The miss rate is reduced
`by 15% and (as shown in Figure 6) the total program
`execution time is reduced to 61% of the original.
`
`3.3.2 Fast Fourier Transformation
`
`The next example we examine is Fast Fourier Trans—
`formation (FFT), a commonly used algorithm in sig-
`nal processing. FFT’s access pattern is not as regular
`as other matrix computations; the butterfly memory
`access pattern is shown in Figure 7 for an 8-point
`radix-2 FFT. In this figure, the data is prepared in
`bit-reversed order and the result is in normal order.
`The calculation is done in-place, i.e. only one data
`array is used for source, intermediate and destination.
`The computation is done in dual pairs as follows:
`
`Xm+1(P) = Xm(P) + WLXmUz)
`
`
`
`Petitioners Amazon
`
`Ex. 1007, p. 11
`
`Petitioners Amazon
`Ex. 1007, p. 11
`
`
`
`Orig. Victim Stride-One Column
`Cache
`Cache
`Prefetch
`Prefetch
`
`
`
`
`
`
`
`
`1.0
`
`0-36
`
`0.68
`
`
`
`
`
`
`-|--
`
`Config. cache
`----
`
`
`
`L1 cache
`hit rate
`
`hit rate
`Miss rate
`
`Memory Lat.
`(106 cycles)
`Exec. Time
`
`(10‘5 cycles)
`
`|-_—_
`
`
`
`
`Table 1: Performance results for matrix multiplication.
`
`1-0
`0.8
`
`s
`.E
`5-
`5
`g 0.6
`'6
`g
`0.4
`'5
`
`El-
`02
`
`O z
`
`0.0
`
`.56
`5
`
`E
`E
`>
`
`".3'
`E
`-.
`V‘.
`
`0.93
`
`1-0
`
`0.83
`
`‘~0 0.95 0.94
`
`0.61
`
`1.0
`0-3
`
`E
`_
`E1
`i:
`E 0.6
`E 0.4
`E
`g
`“-2
`-
`Z 0 o
`
`1.0
`0
`E
`F 03
`3
`0.6
`E:
`E 0.4
`a
`g
`0.2
`0.0
`2
`
`53—.“
`0
`
`E5
`"E
`E
`E i ‘j
`w <3
`
`an
`5
`
`.§
`.§
`>
`
`r;
`5
`'T
`m
`
`Figure 6: Execution times with difi'erent configurable hierarchy additions, normalized to execution time with
`original cache hierarchy.
`
`Xm+1(q) = Xm(P) - WivaM)
`
`where
`
`W“ = e-J'(2'K/N)
`
`is the coefficient which is precomputed in our pro-
`
`gram. The subscript of X stands for the stage of
`computation and the number in bracket stands for
`the position in the data array.
`For this paper, we examine a 1024-point complex
`FFT. Since there are 1024 points and each point
`needs 2 doubles, a data. array of 16 KB is involved.
`The corresponding coefficients are precalculated and
`stored as an SKB array. The calculation requires 10
`stages. The whole data array is referenced in each
`stage and the memory access is in dual pairs as shown
`in Figure 7. For an SKB direct-mapped cache, the
`
`L1 cache
`
`Config. cache
`
`Victim Stride—One_--
`
`---|
`
`hit rate I- 6%
`14%
`
`_-—|
`Memory Lat.
`(103 cycles)
`
`Exec. time
`(1!)3 cycles)
`
`
`
`Orig.
`
`
`
`
`
`
`
`
`
`
`
`|--
`
`
`96
`
`76
`
`47
`
`Table 2: Performance results for FFT.
`
`SPIE Vol. 2914/243
`
`ix
`.J
`
`Petitioners Amazon
`
`Ex. 1007, p. 12
`
`Petitioners Amazon
`Ex. 1007, p. 12
`
`
`
`x7
`
`x0
`
`x4
`
`x2
`
`x6
`x1
`x3
`
`x3
`
`
`
`Figure 7: Butterfly access pattern in FFT.
`
`misses come from conflicts between the data and co-
`efficient arrays.
`
`Results Table 2 shows the simulation results for
`FFT. As before, the first column shows the results
`with the original memory hierarchy. Here, the data
`miss rate is 22%. Unlike in matrix multiply, the vic-
`tim cache is fairly effective here in improving FFT’s
`memory performance. The program’s miss rate drops
`to 16%. This is due to the significant data reuse in
`loops and the reduction of conflict misses
`With stride-one prefetching the program‘s perfor-
`mance is improved even more. This is because of the
`good spatial locality and the sequential access pat-
`tern of computing the dual pairs. The miss rate in
`this case is reduced by 14% and the total execution
`time is reduced to 83% of the original cache. In each
`stage, there is one reference sequence (i.e. one slot
`is used) if the dual pair stride is less than a cache
`line. In later stages, the dual-pair stride will be more
`than four cache lines (the size of one prefetch slot),
`and in those cases, two slots are used. In intermedi-
`ate stages (i.e.
`the stride falls between one line and
`four lines) the access to the dual pair may cause the
`prefetch buffer to be updated prematurely and use-
`ful data already in the prefetch buffer is lost. More
`sophisticated prefetching may avoid this problem but
`may be too complex to be implemented by reconfig-
`urable hardware.
`
`3.3.3 Tomcatv
`
`is a vectorized
`a. SPECfp92 benchmark,
`Tomcatv,
`mesh generation program written in Fortran. There
`are 7 data arrays used in the program. Each array
`is a 257 by 257 matrix of double precision floating
`
`244 /SPIE Vol. 2914
`
`
`
`Orig. Victim Stride-One
`Cache
`Cach
`Prefetch
`
`hit rate
`
`
`L1 cache_I---
`
`
`
`
`
`Config. cache
`8.3%
`19%
`hit rate
`
`_-—|
`, Memory Lat.
`
`
`
`(106 cycles)
`
`Exec. Time
`
`
`912
`
`7497
`
`739
`
`(106 cycles)
`
`1313
`
`
`
`898
`
`11306
`
`
`
`Table 3: Performance results for tomcatv.
`
`point. Thus, the data arrays take about 3.7MB each,
`and are far larger than the cache size. The arrays are
`mainly accessed sequentially in row order and data
`elements are reused several times in each loop. Due
`to the large active data set, however, the miss rate is
`still quite high.
`
`Results Table 3 shows the results for tomcatv. Al-
`though the active data set is very large, there is a
`good deal of spatial and temporal locality in the mem-
`ory accesses. Thus, it may be somewhat surprising
`that the or