throbber
Declaration of Rachel J. Watters on Authentication of Publication
`
`I, Rachel J. Watters, am a librarian, and the Director of Wisconsin TechSearch
`
`(“WTS”), located at 728 State Street, Madison, Wisconsin, 53706. WTSis an
`
`interlibrary loan departmentat the University of Wisconsin-Madison.
`
`I have worked as
`
`a librarian at the University of Wisconsin library system since 1998.
`
`I have been
`
`employed at WTSsince 2002,first as a librarian and, beginning in 2011, as the Director.
`
`Through the course of my employment, I have become well informed aboutthe
`
`operations of the University of Wisconsin library system, which follows standard library
`
`practices.
`
`This Declarationrelates to the dates of receipt and availability of the following:
`
`Zhong, P., & Martonosi, M. (1996, October). Using
`reconfigurable hardware to customize memory hierarchies. In
`High-Speed Computing, Digital Signal Processing, andFiltering
`Using Reconfigurable Logic (Vol. 2914, pp. 237-248).
`International Society for Optics and Photonics.
`
`Standardoperatingprocedures for materials at the Universityof Wisconsin-
`
`Madison Libraries. When a volume wasreceived by the Library, it would be checked
`
`in, stamped with the date of receipt, addedto library holdings records, and made
`
`available to readers as soonafterits arrival as possible. The procedure normally took a
`
`few days or at most 2 to 3 weeks.
`
`Exhibit A to this Declaration is true and accurate copyofthetitle page with
`
`library date stamp of High-Speed Computing, Digital Signal Processing, and F.iltering
`
`Using Reconfigurable Logic (1996), from the University of Wisconsin-Madison Library
`
`1
`
`Petitioners Amazon
`Ex. 1007, p. 1
`
`Petitioners Amazon
`Ex. 1007, p. 1
`
`

`

`Declaration of Rachel J. Watters on Authentication of Publication
`
`collection. Exhibit A also includes an excerpt of pages 237 to 248 of that volume,
`
`showing the article entitled Using reconfigurable hardware to customize memory
`
`hierarchies (1996). Based onthis information, the date stamp on the volume cover page
`
`indicates Using reconfigurable hardware to customize memory hierarchies (1996) was
`
`received by University of Wisconsin-Madison Libraries on December 12, 1996.
`
`Based on the information in Exhibit A,it is clear that the volume wasreceived by
`
`the library on or before December 12, 1996, catalogued andavailableto library patrons
`
`within a few days orat most 2 to 3 weeks after December 12, 1996.
`
`I declare that all statements made herein of my own knowledge are true and that
`
`all statements made on information and belief are believed to be true; and furtherthat
`
`these statements were made with the knowledge that willful false statements and thelike
`
`so made are punishable byfine or imprisonment, or both, under Section 1001 of Title 18
`
`of the United States Code.
`
`Date: October 2, 2018
`Rachel
`atters
`Director
`
`Wisconsin TechSearch
`Memorial Library
`728 State Street
`Madison, Wisconsin 53706
`
`Petitioners Amazon
`Ex. 1007, p. 2
`
`Petitioners Amazon
`Ex. 1007, p. 2
`
`

`

`TK
`
`noe
`im
`
`PROCEEDINGS
`@ SPIE—TheInternational Society for Optical Engineering
`
`High-Speed Computing, Digital
`Signal Processing, and Filtering
`Using Reconfigurable Logic
`
`John Schewel
`Peter M. Athanas
`V. Michael Bove,Jr.
`John Watson
`Chairs/Editors
`
`20-21 November 1996
`Boston, Massachusetts
`
`
`
`Volume 2914
`
`Petitioners Amazon
`Ex. 1007, p. 3
`
`Petitioners Amazon
`Ex. 1007, p. 3
`
`

`

`
`
`PROCEEDINGS
`@ SPIE—TheInternational Society for Optical Engineering
`
`High-Speed Computing, Digital
`Signal Processing, andFiltering
`Using Reconfigurable Logic
`
`SP!E—The International Society for Optical Engineering
`
`John Schewel
`Peter M. Athanas
`V. Michael Bove,Jr.
`John Watson
`Chairs/Editors
`
`20-21 November 1996
`Boston, Massachusetts
`
`Sponsored and Published by
`
`Volume 2914
`
`SPIE is an international technical society dedicated to advancing engineering and scientific
`applications of optical, photonic, imaging, electronic, and optoelectronic technologies.
`
`Petitioners Amazon
`Ex. 1007, p. 4
`
`Petitioners Amazon
`Ex. 1007, p. 4
`
`

`

`
`
`
`
`The papers appearing in this book comprise the proceedings of the meeting mentioned on the
`coverand title page. They reflect the authors’ opinions and are published as presented and
`without change,in the interests of timely dissernination. Their inclusion in this publication does
`not necessarily constitute endorsementbytheeditors or by SPIE.
`
`Please use the following format to cite material from this book:
`Author(s), “Title of paper,” in High-Speed Computing, Digital Signal Processing, andFiltering
`Using Reconfigurable Logic, John Schewel, Peter M. Athanas, V. Michael Bove, Jr., John
`Watson, Editors, Proc. SPIE 2914, page numbers (1996).
`
`Library of Congress Catalog Card No. 96-69766
`ISBN 0-8194-2316-5
`
`Published by
`SPIE—The International Society for Optical Engineering
`P.O. Box 10, Bellingham, Washington 98227-0010 USA
`Telephone 360/676-3290 (Pacific Time) © Fax 360/647-1445
`
`Copyright ©1996, The Society of Photo-OpticalInstrumentation Engineers.
`
`Copying of material in this book for internal or personal use, orfor the internal or personal use
`of specific clients, beyond the fair use provisions granted by the U.S. Copyright Law is
`authorized by SPIE subject to payment of copying fees. The Transactional Reporting Service
`basefee for this volume is $6.00 per article (or portion thereof), which should be paid directly
`to the Copyright Clearance Center (CCC), 222 Rosewood Drive, Danvers, MA 01923. Payment
`may also be madeelectronically through CCC Online at http://www.directory.net/copyright/.
`Other copying for republication, resale, advertising or promotion, or any form of systematic or
`multiple reproduction of any material in this book is prohibited except with permission in
`writing from the publisher. The CCC fee codeis 0-8194-23 16-5/96/$6.00.
`
`Printed in the United States of America.
`
`Petitioners Amazon
`Ex. 1007, p. 5
`
`Petitioners Amazon
`Ex. 1007, p. 5
`
`

`

`Using Reconfigurable Hardware to Customize Memory Hierarchies
`
`Peixin Zhong and Margaret Martonosi
`Department of Electrical Engineering
`Princeton University
`Princeton, NJ 08544-5263
`{pzhong, martonosi}@ee.princeton.edu
`
`Abstract
`
`1
`
`Introduction
`
`Over the past decade or more, processor speeds have
`increased much more quickly than memory speeds.
`As a result, a large, and still increasing, processor-
`memory performance gap has formed. Manysignifi-
`cant applications suffer from substantial memorybot-
`tlenecks, and their memory performance problemsare
`often either too unusual or extreme to be mitigated by
`cache memories alone. Such specialized performance
`“bugs” require specialized solutions, but it is impos-
`sible to provide case-by-case memory hierarchies or
`caching strategies on general-purpose computers.
`Wehaveinvestigated the potential of implementing
`mechanismslike victim caches and prefetch buffers in
`reconfigurable hardware to improve application mem-
`ory behavior. Based on technology and commercial
`trends, our simulation-based studies use a forward-
`looking model in which configurable logic is located
`on the CPU chip. Given such assumptions, ourre-
`sults show that the flexibility of being able to special-
`ize configurable hardware to an application’s memory
`referencing behavior more than balancestheslightly
`slower response times of configurable memoryhier-
`archy structures. For our three applications, small,
`specialized memory hierarchy additions such as vic-
`tim caches and prefetch buffers can reduce miss rates
`substantially and can drop total execution times for
`these programs to between 60 and 80% oftheir orig-
`inal execution times. Our results also indicate that
`different memory specializations may be most effec-
`tive for each application; this highlights the useful-
`ness of configurable memory hierarchies that are spe-
`cialized on a per-application basis.
`Keywords: memorylatency, configurable comput-
`ing, victim cache, prefetching.
`
`Due to rapid increases in microprocessor speeds, the
`performance gap between processors and main mem-
`ory is widening. Cache memories are typically used in
`computer systems to bridge this performance gap and
`reduce the average memory access time. Although
`caches work well in many cases, they maystill fail to
`provide high performancefor certain applications.
`Several hardware and software techniques have
`been proposed to improve cache performance in such
`cases. For example, prefetching techniques aim to
`hide the large latency out to main memoryby bring-
`ing data to the cache before it is referenced. Vic-
`tim caches attempt to reduce conflict misses in low-
`associativity caches. These hardware techniques have
`variable results depending on the application’s mem-
`ory referencing behavior. Their disadvantageis that
`they represent wasted transistor space on the CPU
`chip for those applications where they are ineffec-
`tive. On the other hand,
`the drawback to purely
`software-based techniques (such as blocked matrix ac-
`cesses or compiler-inserted prefetching directives) is
`that it can be difficult to statically analyze a pro-
`gram’s memory behavior and determine when such
`techniques will be useful. For these reasons, this pa-
`per explores implementing memory hierarchy addi-
`tions in programmable hardware.
`Programmable logic, such as field-programmable
`gate arrays (FPGAs), has gained tremendous popu-
`larity in the past decade. Programmablelogic is pop-
`ular because a given chip’s behavior can be configured
`and customized for different functions during differ-
`ent sessions. Customization on a per-application ba-
`sis is feasible because the reconfiguration process is
`fast and can be done with the device in the system.
`Some FPGAscan even bepartially reconfigured while
`the rest of the device is in use.
`
`ee
`
`0-8194-2316-5/96/$6.00
`
`SPIE Vol. 2914 / 237
`
`Petitioners Amazon
`Ex. 1007, p. 6
`
`Petitioners Amazon
`Ex. 1007, p. 6
`
`

`

`used for prototyping, but they have also been used to
`build high-performance application specific compute
`engines [12]. In addition, there has been work (such
`as PRISC [21] and PRISM[2]) on supplementing con-
`ventional processors with configurable coprocessors
`to accelerate performancefor different applications
`Thus far, most configurable computing projects
`have focused heavily on computation as opposed to
`data access.
`In this paper, we explore the potential
`of using configurable hardware to make application-
`specific improvements to memory behavior. We en-
`vision a library of parameterized memory hierarchy
`additions that can be invoked to target cases where
`the cache system does not work well for a particular
`application.
`As fabrication technology improves, more and more
`transistors fit on a single chip. It seems likely that
`we will soon see processor chips that include a re-
`gion of on-chip configurable logic. This configurable
`logic can clearly have many uses; our work does not
`preclude configuring the logic for more traditional
`compute-oriented uses, but simply attempts to ex-
`plore an alternative use.
`In Section 2, we will first discuss the structure of
`the configurable memory unit that we envision. Fol-
`lowing that, in Section 3, we present case studies of
`applying the ideas of configurable memoryto several
`applications.
`In Section 4, we discuss some of the
`hardware organization and implementationissues in-
`herent in our approach. A brief account of related
`work is included in Section 5 and Section 6 presents
`some discussion and our conclusions.
`
` The configurability of FPGAs makes them heavily
`
`2 Configurable Hardware in
`Memory Hierarchies
`
`Webelieve that configurable logic can result in sig-
`nificant performance improvement by improving av-
`erage memory access latencies. Our overriding goal
`is to minimize the numberof cache misses that result
`in accesses to main memory. Researchers have pro-
`posed methods that promise performance improve-
`ments of up to 3X by reducing cache misses using
`full-custom hardware [19]. These methods arerarely
`included in commercial processors, however, because
`they do not provide performance improvementsfor
`a broad enough set of applications. Our current re-
`search showsthe potential of these approaches using
`
`238 / SPIE Vol. 2914
`
`flexible, configurable hardware instead.
`Integrated circuits are expected to soon grow to
`contain over 100 million transistors. As this growth
`takes place, we must determine ways to best make
`use of these additional transistors. Rather than sim-
`ply devoting increased chip areas to increased cache
`sizes, our research explores other methodsfor using
`the transistors to reduce (or better tolerate) memory
`access latencies.
`In current research projects, configurable logic is
`typically incorporated into the architecture using an
`attached processor array model. As shownontheleft
`hand side of Figure 1, an accelerator based on FP-
`GAsand dedicated static RAM (SRAM), is attached
`to the I/O bus of a conventional host processor. The
`conventional processor and configurable logic array
`operate asynchronously. The host supplies control
`signals and monitors results while the logic array pro-
`cesses data obtained from an external source such as
`a frame-buffer. The major problem with this model
`is the high communication latency between proces-
`sor and configurable logic, due to their physical and
`logical separation.
`As shown on theright handsideof Figure 1, the ex-
`pected integration of configurable logic on-chip gives
`us more flexibility not only in its logical placement
`within the architecture, but also in its expected uses.
`In addition, on-chip configurable logic has moreflexi-
`bility in its connectivity to processors and caches. We
`can now begin considering uses for configurable logic
`that are infeasible (because of latency and connectiv-
`ity constraints) with the attached processor model.
`
`2.1 Logical Placement, Connectivity
`
`The logical placement of the configurable logic is
`driven by its intended memory optimization func-
`tion. The right hand side of Figure 1 shows two
`distinct possibilities, and in fact, each of these con-
`figurable blocks may expend their transistor budgets
`on some combination of configurable gates and asso-
`ciated SRAM. (Inthis figure, the logical positions for
`configurable logic are indicated by diamondslabelled
`“C1” and\"C27.)
`The configurable logic closest to the processor can
`detect L1 cache misses and manage prefetch buffers,
`stream buffers, or a victim cache. Similar functions
`can be performed by the second block of configurable
`logic, for the L2 cache.
`In addition, this logic could
`observe memory accesses between nodes of a dis-
`
`|
`
`Petitioners Amazon
`Ex. 1007, p. 7
`
`Petitioners Amazon
`Ex. 1007, p. 7
`
`

`

`
`
`Figure 1: Traditional (left) and more forward-looking (right) uses of configurable logic.
`
`tributed memory multiprocessor, or hold specialized
`multiprocessor coherence protocols.
`In this paper, we focus primarily on two particu-
`lar optimizations: victim caches and prefetching en-
`gines, implemented in configurable hardware. Victim
`caches are small, fully-associative caches that lie be-
`tween the cache and main memory. Victim caches
`primarily aim to reduce conflict misses by caching
`data that has been evicted from the cache as a result
`of a miss on another location. Even small (four-entry)
`victim caches can reduce conflict misses by about 50%
`and reduce total cache misses by 5-30% [19].
`Prefetching engines are hardware structures in-
`tended to fetch pieces of data from main memory
`before references to them occur, in order to hide the
`large latencies out to main memory. They have been
`studied in several contexts before [6, 7, 20]. Although
`such hardware mechanisms have been evaluated via
`simulation studies, widespread adoption in custom
`hardware is unlikely, since their benefits vary from
`application to application. Assumingthat future pro-
`cessors will include a block of configurable logic on-
`chip, this paper uses simulations to re-evaluate these
`mechanismswithin the context of a dynamically cus-
`tomizable memory hierarchy.
`the configurable
`For prefetch or stream buffers,
`logic must be able to detect a miss from the L1 (or L2)
`cache, recognize that the address matches the address
`of data in the prefetch buffer, abort the memoryre-
`
`quest, and supply the data to the CPU. Simultaneous
`to servicing the access request, the prefetch controller
`may initiate a memory request for the next location
`from which to prefetch. For a victim cache, the logic
`must once again be able to detect the primary cache
`miss, recognize the presence of the data in its SRAM,
`abort the memoryrequest, and supply the data to the
`CPU. These functions can be easily performed using
`a small state machine controller.
`
`2.2 Logical vs. Physical Placement
`
`It is important to make a distinction between the log-
`ical and the physical placement of the configurable
`logic. The discussion above described the logical
`placement, i.e. where the configurable logic needs to
`be positioned in terms of the signals it needs access
`to. Physically, we anticipate future generations of mi-
`croprocessors to have a fixed amount of configurable
`logic which may be used for a variety of tasks includ-
`ing but not limited to accelerating computation and
`memory accesses. Since each application will have
`different requirements on where the configurable logic
`should be placed, it is not reasonable to expect it to
`be positioned in exactly the positions markedin Fig-
`ure 1. The key demand that our logical placement
`makes on configurable logic’s physical placement is
`that of connectivity; there must be signal paths that
`provide access to the signals required by the logical
`
`SPIE Vol. 2914 / 239
`
`
`
`Petitioners Amazon
`Ex. 1007, p. 8
`
`Petitioners Amazon
`Ex. 1007, p. 8
`
`

`

`in two ways. First, the interconnect delay may be
`significant. Second, the interconnect topology may
`lead to contention of shared resources such as buses.
`This contention has not been modelled in theinitial
`study presented here.
`
`2.3 Programming Model
`In addition to hardware concerns, our configurable
`memoryhierarchies raise software issues as well. Fig-
`ure 2 gives a flowchart that represents the path a
`program would go through in order to make use of
`configurable memory hierarchy additions.
`
`Cone
`Add directives
`
`Automated
`
`
`
`is to be run repeatedly, then the performancestatis-
`tics about this program run can be used as feedback
`to modify the configuration for future runs. Clearly,
`the widespread use of this sort of hardwarerelies on
`automating the analysis and choice of configurations
`as much as possible. In Section 3’s preliminary case
`studies, however, we set the configuration directives
`manually.
`
`3 Application Case Studies
`
`to quantitatively evaluate our idea, we
`In order
`present results on three case study applications.
`
`3.1 Configurable Memory Hardware
`
`In the case studies presented here, we assume the con-
`figurable memory hardware sits on the same chip as
`the processor and first-level (L1) cache. This corre-
`sponds to location C1 in Figure 1. For these studies,
`we examine two configurable hierarchy additions sep-
`arately: a victim cache and a prefetch buffer.
`
`3.1.1 Victim Cache
`
`Eien
`
`A victim cache is a fully-associative cache with typ-
`
`Code with[CodewithConfig.info}info
`ically no more than 16 entries. When an L1 cache
`miss occurs, the referenced line is pulled into the L1
`cache and another line currently stored in the same
`cache line is evicted. The victim cache is a small
`fully-associative buffer that holds these evicted lines.
`Using victim caches can have theeffect of increasing
`the set associativity at low hardware cost.
`Figure 3 shows a diagram of the victim cache we
`consider. The address and data lines go to both the
`Ll cache and the victim cache. On an L1 cache hit,
`the data is provided directly from the L1 cache, and
`the configurable hardware is not involved. On an
`Li cache miss, the victim cache is probed in parallel
`with the request to the next level of memory.If the
`reference hits in the victim cache, the victim cache
`data provides to the processor. When the data is
`pulled into the L1 cache, another line mustbe evicted
`from the cache. The victim cache intercepts this line
`and storesit.
`
` usage. The physical placement impacts performance
`
`Figure 2: Flowchart of programming model with con-
`figurable memory hierarchy additions.
`
`Starting from conventional source code, some per-
`formance analysis is needed to determine what types
`of performance bottlenecks are present. Based on this
`analysis, configuration directives may be added into
`the code in order to use particular configurable mem-
`ory hierarchy additions. The configuration directives
`may either make use of hardware from a parameter-
`ized library of pre-made designs, or they mayspecify
`custom-made configurations. Our flowchart also al-
`lowsfor the possibility that these configuration direc-
`tives will either be hand-inserted into the code by the
`programmer,or will be automatically inserted as part
`of the compilation process. At this point, the code
`is executed, and the memoryhierarchy additions are
`configured as part of the program run.
`If the code
`
`3.1.2 Prefetching Buffer
`
`The goal of a prefetch buffer is to initiate main mem-
`ory accesses in advance, so that the data will be closer
`
`240 / SPIE Vol. 2914
`
`
`
`Petitioners Amazon
`Ex. 1007, p. 9
`
`Petitioners Amazon
`Ex. 1007, p. 9
`
`

`

`
`
`| Lower level memory
`
`
`Figure 3: Block diagram of victim cache.
`
`| Lowerlevel memory
`
`
`Figure 4: Block diagram of prefetch buffers.
`
`
`
`referenced address. By doing this, we can avoid com-
`pulsory cache missesif the initial reference to a mem-
`ory line is already covered by the prefetching.
`To make better use of the slots, we may program
`them with address ranges and data access strides. For
`example, when we havea lot of accesses on one array,
`we can set one slot with the address range of the data
`array. Theslot will only respond to references in this
`range. If the access pattern has an address increment
`stride larger than the cache line size, we can assign
`the stride so we only prefetch the data that is useful.
`
`3.2 Evaluation Methodology
`
`to the processor when referenced. Prefetching can be
`especially beneficial when the main memory band-
`width is large, but the access latency is also quite
`high. The prefetch buffer in our model has several
`independent “slots”. Each slot holds several cache
`lines and works like a FIFO. It can issue prefetch-
`ing commands to the main memory. We need several
`slots because in a typical program,there will be sev-
`eral concurrent data access streams. For example,
`when a program is performing matrix addition, there
`will be three access streams for the two source arrays
`and the destination array. Figure 4 is the schematic
`of a prefetch buffer with four slots; each slot holds
`four cachelines of data.
`If there is an L1 cache hit, it does not trigger any
`operation in the prefetch buffer.
`If there is an L1
`miss, the request is sent to the prefetch buffer.
`If
`In this study, we use software simulationsto estimate
`the referenced item is available in one of theslots,
`the performance for our applications. In particular,
`this line is pulled into the CPU as well as the L1
`cache. The remaining lines move toward the top of
`we used the TangoLite memory reference generator
`to perform high-level simulations. TangoLite uses a
`the FIFO. The vacancyis filled with data prefetched
`direct-execution simulation approach. The original
`from main memory. The prefetching engine must de-
`program written in C or FORTRANis compiled and
`termine which line to prefetch next.
`In the simple
`instrumented with calls out to routines that simulate
`case, the memory address for the subsequentline is
`calculated by incrementing the current address by the
`memory behavior and keep statistics.
`cache linesize.
`In our simulation model, we have a single CPU
`that issues no more than one instruction per cycle.
`If the L1 cache miss also misses in the prefetch
`The direct-mapped L1 data cache has a capacity of
`buffer, the least recently used slot is designated to
`begin prefetching the subsequent data following this
`
`8 KB andalinesize of 32 bytes. In order to focus
`
`SPIE Vol. 2914 /241
`
`
`
`Petitioners Amazon
`Ex. 1007, p. 10
`
`Petitioners Amazon
`Ex. 1007, p. 10
`
`

`

` 242 / SPIE Vol. 2914
`
`Figure 5: Memory access pattern for arrays in matrix
`multiplication.
`
`
`
`Petitioners Amazon
`Ex. 1007, p. 11
`
`on data references, we have assumed ideal memory
`performancein the instruction stream.
`Our simulations assume a hit in the L1 cache costs
`one cycle, the same as a simple ALU instruction. Ifa
`reference misses in the L1 cache, additional cycles are
`needed depending on where the datais found. Weas-
`sume that data found in the configurable cache takes
`2 extra cycles and a miss in both Li and configurable
`cache will cost 10 extra cycles to go to main memory.
`The victim cache we simulate is fully-associative
`with 4 cache-line-sized entries, updated according to
`an LRU policy. The prefetch buffer has four slots
`with 4 cache lines each. Our simulator ignores any
`additional contention that prefetching may introduce.
`
`3.3 Per-Application Results
`With that background, wewill now look in more de-
`tail at three applications. Our goal is to see the po-
`tential performance improvement available by imple-
`menting memoryhierarchy additions in configurable
`hardware. The three applications studied include
`two kernel applications known for their poor cache
`performance (Matrix Multiplication and Fast Fourier
`Transform) as well as a more substantial program:
`the Tomcatv benchmark from the SPEC92 suite [9].
`
`3.3.1 Matrix Multiplication
`
`In the first example, we study a matrix multiplication
`program multiplying two 100 by 100 matrices. The
`elements are double-precision floating point. This
`meansthat a total of 80000 bytes are needed for each
`matrix. This greatly exceeds the size of the L1 cache.
`Clearly the calculation involves two source matrices
`and one destination matrix. As shown in Figure 5
`one of the source matrices is accessed in sequential
`column order while the other is accessed in sequen-
`tial row order.
`
`
`
`Results The simulated performance of 100x100
`matrix multiplication is summarized in Table 1. The
`standard cache does not perform well both because
`of the size of the data set and also because oneof the
`arrays is accessed down the columns. Overall, the
`data miss rate is 21%.
`The second column of the table shows the new re-
`sults obtained when a configurable victim cache is
`used.
`In this case, some of the conflict misses can
`be avoided. The overall performance is only slightly
`better than in the original case though. The third
`column ofthe table shows the results when a config-
`urable prefetch engine is used.
`In this “stride-one
`prefetching” each prefetch slot always attempts to
`prefetch the next cache line. This prefetching allows
`the program to avoid some compulsory misses, but
`does not dramatically improve overall performance.
`The difficulty here is that one of the arrays has a
`columnar memory access pattern that does not ben-
`efit from the stride-one prefetching.
`Since simple stride-one prefetching is not fully ef-
`fective, we also investigated allowing the prefetching
`engine adjust its prefetch stride on a per-slot basis.
`The fourth column shows the results of this experi-
`ment. Here, we have allocated one slot to each ma-
`trix. For the source matrix that accesses elements
`across a row, the prefetch stride is still one cache
`line. For the source matrix whose elements are ac-
`cessed down the columns, we set the stride to be the
`size of one row in the matrix. Every reference to
`an address within the matrix will only update the
`prefetch buffer for its matrix. This technique yields
`much better performance. The miss rate is reduced
`by 15% and (as shown in Figure 6) the total program
`execution time is reduced to 61% of the original.
`
`3.3.2 Fast Fourier Transformation
`
`The next example we examine is Fast Fourier Trans-
`formation (FFT), a commonly used algorithm in sig-
`nal processing. FFT’s access patternis not as regular
`as other matrix computations; the butterfly memory
`access pattern is shown in Figure 7 for an 8-point
`radix-2 FFT. In this figure, the data is prepared in
`bit-reversed order and the result is in normal order.
`The calculation is done in-place, i.e. only one data
`array is used for source, intermediate and destination.
`The computation is done in dual pairs as follows:
`
`Xm+1(P) = Xm(p) + WyXm(q)
`
`Petitioners Amazon
`Ex. 1007, p. 11
`
`

`

`
`
`10
`o
`&
`= 08
`3
`z 0.6
`04
`02
`
`1.0
`
`0.86
`
`0.68
`
`00
`
`so BS
`O
`.8b.8
`>
`-
`a
`
`s2
`
`S—
`
`in
`
`o =
`
`, LO 0.95 0.94
`=
`& 08
`3
`é 0.6
`EB
`0A
`=
`E
`0.2
`Zz
`00
`.
`
`0.61
`
`2SEss
`Oe
`a §
`
`0.93
`
`0.83
`
`» or
`=
`= 08
`g
`& 0.6
`= 04
`‘3
`FE
`0.2
`z
`0.0
`
`es
`5
`
`3
`Eo
`ce
`>
`=
`n
`
`Orig.|Victim|Stride-One|Column
`Cache|Cache|Prefetch|Prefetch
`L1 cache
`hit rate
`
`
`
`
`
`
`Ktwe[rn[rm[roe|ron|
` |=|om|on |
`
`Config. cache
`
`hit rate
`
`
`Missrate[|21%|19%|19%[7%
`Peeeeee
`
`MemoryLat.
`(10° cycles)
`Exec. Time
`
`
`(10° cycles)
`
`Table 1: Performance results for matrix multiplication.
`
`Figure 6: Execution times with different configurable hierarchy additions, normalized to execution time with
`original cache hierarchy.
`
`Xm+1(9) = Xm(P) — WyXm(Q)
`
`where
`Orig.|Victim|Stride-One
`W,, = eI (2t/N)
`
`Cache|Cach Prefetch
`
`Li cache
`hit rate
`Config. cache
`
`14%
`6%
`hit rate
`[Missrate|22%|16%|8%
`
`MemoryLat.
`(10% cycles)
`
`Exec. time
`[ove[ore|ou|
`
`
`
`(10% cycles)
`
`
`
`78%
`
`78%
`
`78%
`
`96
`
`76
`
`47
`
`is the coefficient which is precomputed in our pro-
`gram. The subscript of X stands for the stage of
`computation and the number in bracket stands for
`the position in the data array.
`For this paper, we examine a 1024-point complex
`FFT. Since there are 1024 points and each point
`needs 2 doubles, a data array of 16 KBis involved.
`The corresponding coefficients are precalculated and
`stored as an 8KBarray. The calculation requires 10
`stages. The whole data array is referenced in each
`stage and the memoryaccessis in dual pairs as shown
`in Figure 7. For an 8KB direct-mapped cache, the
`
`
`
`Table 2: Performanceresults for FFT.
`
`SPIE Vol. 2914 / 243
`
`i4
`|
`Petitioners Amazon
`Ex. 1007, p. 12
`
`Petitioners Amazon
`Ex. 1007, p. 12
`
`

`

`x7
`
`x0
`
`x4
`
`x2
`
`x6
`xl
`x5
`
`x3
`
`
`
`Figure 7: Butterfly access pattern in FFT.
`
`misses come from conflicts between the data and co-
`efficient arrays.
`
`L1 cache
`
`Cache|Cach Prefetch
`
`ees Orig.|Victim|Stride-One
`
`
`
`hit rate ae 67%
`67%
`
`Results Table 2 shows the simulation results for
`FFT. As before, the first column shows the results
`with the original memory hierarchy. Here, the data
`
`miss rate is 22%. Unlike in matrix multiply, the vic-
`iissrate|33%[14%|25%
`tim cache is fairly effective here in improving FFT’s
`
`| MemoryLat.
`
`memory performance. The program’s miss rate drops
`(10° cycles)
`
`to 16%. This is due to the significant data reuse in
`[Green|azis|ove|ae
`Exec. Time
`loops and the reduction of conflict misses.
`
`
`With stride-one prefetching the program’s perfor-
`mance is improved even more. This is because of the
`good spatial locality and the sequential access pat-
`tern of computing the dual pairs. The miss rate in
`this case is reduced by 14% and the total execution
`timeis reduced to 83% of the original cache. In each
`stage, there is one reference sequence(i.e. oneslot
`is used) if the dual pair stride is less than a cache
`line. In later stages, the dual-pair stride will be more
`than four cache lines (the size of one prefetch slot),
`and in those cases, two slots are used. In intermedi-
`ate stages (ie.
`the stride falls between one line and
`four lines) the access to the dual pair may cause the
`prefetch buffer to be updated prematurely and use-
`ful data already in the prefetch buffer is lost. More
`sophisticated prefetching may avoid this problem but
`may be too complex to be implemented by reconfig-
`urable hardware.
`
`
`
`Config. cache
`
`912
`
`739
`
`Table 3: Performance results for tomcatv.
`
`point. Thus, the data arrays take about 3.7MBeach,
`and are far larger than the cache size. The arrays are
`mainly accessed sequentially in row order and data
`elements are reused several times in each loop. Due
`to the large active data set, however, the missrate is
`still quite high.
`
`Results Table 3 shows the results for tomcatv. Al-
`though the active data set is very large, there is a
`good dealofspatial and temporal locality in the mem-
`ory accesses. Thus, it may be somewhat surprising
`that the original miss rate is as high as 33%. This
`high miss rate is due to a large numberof conflict
`misses in the direct-mapped cache. As shown in the
`second column of the table, the victim cache gives
`very good results on this application by greatly re-
`ducing the conflict misses. The total miss rate drops
`to 14% and execution timeis reduced to 69% ofits
`original value.
`The prefetch buffer, however, does

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket