throbber
Declaration of Rachel J. Watters on Authentication of Publication
`
`I, Rachel J. Watters, am a librarian, and the Director of Wisconsin
`
`TechSearch (“WTS”), located at 728 State Street, Madison, Wisconsin, 53706.
`
`WTSis an interlibrary loan departmentat the University of Wisconsin-Madison.
`
`|
`
`have workedas a librarian at the University of Wisconsin library system since
`
`1998. Ihave been employed at WTSsince 2002,first as a librarian and, beginning
`
`in 2011, as the Director. Through the course of my employment, I have become
`
`well informed aboutthe operations of the University of Wisconsin library system,
`
`whichfollows standard library practices.
`
`This Declarationrelates to the dates of receipt and availability of the
`
`following:
`
`Lange, H., & Koch, A. (2000). Memory access schemes for
`configurable processors. In International Workshop on Field
`Programmable Logic and Applications (pp. 615-625).
`Springer : Berlin, Heidelberg.
`
`Standard operating procedures for materials at the University of Wisconsin-
`
`Madison Libraries. When a volume wasreceived by the Library, it would be
`
`checked in, stamped with the date ofreceipt, addedto library holdings records, and
`
`madeavailable to readers as soonafterits arrival as possible. The procedure
`
`normally took a few days or at most 2 to 3 weeks.
`
`Exhibit A to this Declaration is true and accurate copy of the front matter of
`
`the International Workshop on Field Programmable Logic and Applications
`
`Petitioners Amazon
`Ex. 1006, p. 1
`
`Petitioners Amazon
`Ex. 1006, p. 1
`
`

`

`Declaration of Rachel J. Watters on Authentication of Publication
`
`(2000) publication, which includes a stamp on the verso page showingthat this
`
`bookis the property of the Kurt F. Wendt Library at the University of Wisconsin-
`
`Madison.
`
`Attached as Exhibit B is the cataloging system record of the University of
`
`Wisconsin-Madison Libraries for its copy of the /nternational Workshop on Field
`
`Programmable Logic and Applications (2000) publication. As shownin the
`
`“Receiving date” field of this Exhibit, the University of Wisconsin-Madison
`
`Libraries owned this book and hadit cataloged in the system as of November 7,
`
`2000.
`
`Membersofthe interested public could locate the International Workshop
`
`on Field Programmable Logic and Applications (2000) publicationafter it was
`
`cataloged by searching the public library catalog or requesting a search through
`
`WTS. The search could be donebytitle, author, and/or subject key words.
`
`Membersofthe interested public could access the publication by locating it on the
`
`library’s shelves or requesting it from WTS.
`
`I declare that all statements made herein of my own knowledgeare true and
`
`that all statements made on information andbelief are believed to be true; and
`
`further that these statements were made with the knowledgethat willful false
`
`statements andthe like so madeare punishable by fine or imprisonment, or both,
`
`under Section 1001 of Title 18 of the United States Code.
`
`2
`
`Petitioners Amazon
`Ex. 1006, p. 2
`
`Petitioners Amazon
`Ex. 1006, p. 2
`
`

`

`Declaration of Rachel J. Watters on Authentication of Publication
`
`Date: October 2, 2018
`
`Wisconsin TechSearch
`Memorial Library
`728 State Street
`Madison, Wisconsin 53706
`
`Rachel J| ¥atters
`
`Director
`
`Petitioners Amazon
`Ex. 1006, p. 3
`
`Petitioners Amazon
`Ex. 1006, p. 3
`
`

`

`Reiner W. Hartenstein
`Herbert Griinbacher (Eds.)
`
`Field-Programmable
`Logic and Applications
`
`The Roadmap to Reconfigurable Computing
`
`10th International Conference, FPL 2000
`Villach, Austria, August 27-30, 2000
`Proceedings
`
`
`
`G);) Springer
`
`
`
`Petitioners Amazon
`Ex. 1006, p. 4
`
`Petitioners Amazon
`Ex. 1006, p. 4
`
`

`

`Gerhard Goos, Karlsruhe University, Germany
`Juris Hartmanis, Cornell University, NY, USA
`Jan van Leeuwen, Utrecht University, The Netherlands
`
`Volume Editors
`
`Reiner W. Hartenstein
`University of Kaiserslautern, Computer Science Department
`P. O. Box. 30 49, 67653 Kaiserslautern, Germany
`E-mail: hartenst@rhrk.uni-kl.de
`
`Herbert Griinbacher
`Carinthia Tech Institute
`Richard-Wagner-Str. 19, 9500 Villach, Austria
`E-mail: hg@cti.ac.at
`
`Cataloging-in-Publication Data applied for
`
`Die Deutsche Bibliothek - CIP-Einheitsaufnahme
`
`Field programmable logic and applications: the roadmap to
`reconfigurable computing ; 10th international conference; proceedings
`/ FPL 2000, Villach, Austria, August 27 - 30, 2000. Reiner Ww.
`Hartenstein ; Herbert Griinbacher (ed.). - Berlin ; Heidelberg ; New
`York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ;
`Tokyo : Springer, 2000
`f
`(Lecture notes in computerscience ; Vol. 1896)
`3
`u Re Weridt Library
`“ISBN 3y06 79958,
`University of Wisconsin-Madison
`215 N. Randall Avenue
`Madison, WI 53706-1688
`
`CRSubject Classification (1998): B.6-7, J.6
`
`ISSN 0302-9743
`ISBN 3-540-67899-9 Springer-Verlag Berlin Heidelberg New York
`
` Series Editors
`
`This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
`concemed, specifically the rights of translation, reprinting, re-use ofillustrations, recitation, broadcasting,
`reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
`or parts thereof is permitted only underthe provisions of the German Copyright Law of September 9, 1965,
`in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
`liable for prosecution under the German Copyright Law.
`
`Springer-Verlag Berlin Heidelberg New York
`a member of BertelsmannSpringer Science+Business Media GmbH
`© Springer-Verlag Berlin Heidelberg 2000
`Printed in Germany
`
`Typesetting: Camera-ready by author, data conversion by Steingraber Satztechnik GmbH,Heidelberg
`Printed on acid-free paper
`SPIN 10722573
`06/3142
`543210
`
`
`
`Petitioners Amazon
`Ex. 1006, p. 5
`
`Petitioners Amazon
`Ex. 1006, p. 5
`
`

`

`
`
`MemoryAccess Schemes for Configurable Processors
`
`Holger Lange and Andreas Koch
`
`Tech. Univ. Braunschweig (E.1.S.), GauBstr. 11, D-38106 Braunschweig, Germany
`lange, koch@eis.cs.tu-bs.de
`‘
`
`Abstract. This work discusses the Memory Architecture for Reconfigurable Com-
`puters (MARC),a scalable, device-independent memory interface that supports
`both irregular (via configurable caches) and regular accesses (via pre-fetching
`stream buffers). By hiding specifics behind a consistent abstract interface, it is
`suitable as a target environment for automatic hardware compilation.
`
`1
`
`Introduction
`
`Reconfigurable compute elements can achieve considerable performance gains over
`standard CPUs[1] [2] [3] [4]. In practice, these configurable elements are often combined
`with a conventional processor, which provides the control and I/O services that are
`implemented moreefficiently in fixed logic. Recent single-chip architectures following
`this approach include NAPA [5], GARP [6], OneChip [7], OneChip98 [8], Triscend
`E5 [9], and Altera Excalibur [10]. Board-level configurable processorseither include a
`dedicated CPU [11] [12] or rely on the host CPU for support [13] [14].
`Design tools targeting one of these hybrid systems such as GarpCC[15], Nimble
`[16] or Napa-C [17] haveto deal with software and hardware issues separately as well as
`with the creation of interfaces between theseparts. On the software side, basic services
`such as /O and memory managementare often provided by an operating system of
`some kind. This can range fromafull-scale general-purpose OS over more specialized
`real-time embedded OSes downtotiny kernels offering only a limited set of functions
`tailored to a very specific class of applications. Usually, a suitable OSis either readily
`available on the target platform,or can be ported to it with relative ease.
`This level of support is unfortunately not present on the hardware side of the hybrid
`computer, Since no standard environmentis available for even the most primitive tasks
`such as efficient memory access or communication with the host, the research and de-
`velopment of new designtools often requires considerable effort to providea reliable
`environmentinto which the newly-created hardware can be embedded. This environment
`is sometimescalled a wrapper aroundthe custom datapath. It goes beyond a simple as-
`signmentof chip pads to memory pins. Instead, a structure of on-chip busses and access
`protocols to various resources (e.g., memory, the conventional processor, etc) must be
`defined and implemented.
`In this paper, we present our work on the Memory Architecture for Reconfigurable
`Computers (MARC). It can act as a “hardware target” for a variety of hybrid compil-
`ers, analogously to a software target for conventional compilers. Before describing its
`specifics, we will justify our design decisions by giving a brief overview of current con-
`figurable architectures and showing the custom hardware architectures created by some
`hybrid compilers.
`
`
`
`
`
`R.W. Hartenstein and H. Griinbacher (Eds.) FPL 2000, LNCS 1896, pp. 615-625, 2000.
`© Springer-Verlag Berlin Heidelberg 2000
`
`Petitioners Amazon
`Ex. 1006, p. 6
`
`Petitioners Amazon
`Ex. 1006, p. 6
`
`

`

`
`
`616
`
`H. Lange and A. Koch
`
`2 Hybrid Processors
`
`Static and reconfigurable compute elements may be combined in many ways. The degree
`of integration can range from individual reconfigurable function units (e.g., OneChip
`[7]) to an entirely separate coprocessor attachedto a peripheral bus (e.g., SPLASH[4],
`SPARXIL [18]).
`
` Hybrid
`ProcessorChip
`
`VO Bus
`
`Figure 1. Single-chip hybrid processor
`
`Figure 1 sketches the architecture of a single-chip hybrid processor that combines
`fixed (CPU) and reconfigurable (RC) compute units behind a common cache (D$).
`Such an architecture was proposed,e.g., for GARP [6] and NAPA[5]. It offers very
`high bandwidth, low latency, and cache coherency between the CPU and the RC when
`accessing the shared DRAM.
`
`Fixed Processor Chip
`
`Reconfigurable Array
`
`Figure 2. Hybrid processor emulated by multi-chip system
`
`The board-level systems more commontoday usean architecture similar to Figure
`2. Here, a conventional CPUis attachedby a bus interface unit (BIU) to a system-wide
`I/O bus(e.g., SBus [18] or PCI [11] [12]). Another BIU connects the RC to the 1/0
`bus. Dueto the high communicationlatencies over the I/O bus,the RC is often attached
`directly to a limited amountof dedicated memory (commonly a few KB to a few MB of
`
`Petitioners Amazon
`Ex. 1006, p. 7
`
`Petitioners Amazon
`Ex. 1006, p. 7
`
`

`

`SRAM).In some systems, the RC has access to the main DRAM byusing the I/O bus as
`a master to contact the CPU memory controller (MEMC). With this capability, the CPU
`and the RC are sharingalogically homogeneous address space: Pointers in the CPU
`main memory can be freely exchanged between software on the CPU and hardware in
`the RC.
`
`
`
`y
`
`Memory Access Schemes for Configurable Processors
`
`617
`
`
`
`showsthe latencies measured on [12] for the
`Table 1
`RC accessingdata residing in local Zero-Bus Turnaround
`(ZBT) SRAM(latched in the FPGA I/O blocks) and in
`main DRAM (via the PCIbus). In both cases, one word
`per cycle is transferred after the initial latency.
`It is obvious from these numbersthat any useful wrap-
`per must be able to dealefficiently with access to high la-
`tency memories. This problem, colloquially knownas the
`“memory bottleneck”, has already been tackled for con-
`ventional processors using memory hierarchies (multiple cache levels) combined with
`techniques such as pre-fetching and streamingto improve their performance. As we will
`see later, these approachesare also applicable to reconfigurable systems.
`
`Table 1. Data access laten-
`cies (single word transfers)
`
`
`
`3 Reconfigurable Datapaths
`
`Thestructure of the compute elements implemented on the RCis defined either manually
`or by automatic tools. A commonarchitecture [6] [16] [18] is shown in Figure 3 .
`
`Hardware Operators
`
`memory SystemBus
`
`ProcessorInterface
`RC-local
`MemoryInterface
`
`
`THE
`obese
`
`Datapath Controller
`
`Figure 3. Common RC datapath architecture
`
`The datapath is formed by a numberof hardware operators, often created using
`module generators, which are placed in a regular fashion. While the linear placement
`shownin the figure is often used in practice, more complicated layouts are of course
`possible. All hardware operators are connected to a central datapath controller that
`orchestrates their execution.
`In this paper, we focus onthe interface blocks attaching the datapath to the rest of the
`system. They allow communication with the CPU and main memory using the system
`bus or accessto the local RC RAM.Theinterface blocks themselves are accessed by the
`datapath usingastructure of uni- and bidirectional bussesthat transfer data, addresses,
`and control information.
`
`Petitioners Amazon
`Ex. 1006, p. 8
`
`Petitioners Amazon
`Ex. 1006, p. 8
`
`

`

`618
`
`H. Lange and A. Koch
`
`For manually implemented RC applications, the protocols used here are generally
`developed ad-hoc andheavily influenced by the specific hardware environmenttargeted.
`(e.g., the data sheets of the actual SRAM chips on a PCB). In practice, they may even
`vary between different applications running on the same hardware (e.g., usage of burst-
`modes, fixed access sizes etc.).
`This approachis not applicable for automatic design flows: These tools require pre-
`defined access mechanisms to whichthey strictly adhere for all designs. An example for
`such a well-definedprotocolsuitableas a target for automatic compilation is employed on
`GARP [6], asingle-chip hybrid processorarchitecture.It includes standardizedprotocols
`for sending and retrieving data to/from the RC using specialized CPUinstructions and
`supported by dedicated decoding logic in silicon. Memory requests are routed over a
`single address and four data busses that can supply up to four words per cycle for regular
`(streaming) accesses.
`Noneofthese capabilities is available when using off-the-shelf silicon to implement
`the RC.Instead, each user is faced with implementing the required access infrastructure
`anew.
`
`4 MARC
`
`Our goal was to learn from these past experiences and developasingle, scalable, and
`portable memory interface scheme for reconfigurable datapaths. MARCstrives to be
`applicable for both single-chip and board-level systems, and to hide the intricacies of
`different memory systems from the datapath. Figure 4 shows an overview of this
`architecture.
`
`
`
`
`
`
`
`Back-Ends
`
`Front-Ends
`
`ModSRAM
`
`feteTSP)
`
`UserLogic
`
`ModDRAM
`
`
`
`RC Array
`
`Figure 4. MARCarchitecture
`
`|
`
`Petitioners Amazon
`Ex. 1006, p. 9
`
`Petitioners Amazon
`Ex. 1006, p. 9
`
`

`

`
`
`Memory Access Schemes for Configurable Processors
`
`619
`
`Using MARC,the datapath accesses memory through abstract front-end interfaces.
`Currently, we support two front-ends specialized for different access patterns: Caching
`ports provide for efficient handling of irregular accesses. Streaming ports offer a non-
`unit stride access to regular data structures (such as matrices or images) and perform
`address generation automatically. In both cases, data is pre-fetched/cached to reduce the
`impact of high latencies (especially for transfers using the /O bus). Both ports usestall
`signals to indicate delays in the data transfer (e.g., due to cache miss or stream queue
`refill). A byte-steering logic aligns 8- and 16-bit data on bits 7:0 and 15:0 of the data
`bus regardless of where the datum occurred in the 32-bit memory or bus words.
`The specifics of hardware memory chips or system bus protocols are implemented
`in various back-end interfaces. E.g., dedicated back-ends encapsulate the mechanisms
`for accessing SRAM or communicating over the PCI bus using the BIU.
`The MARCcoreis located between front- and back-ends, where it acts as the main
`controller and data switchboard. It performs address decoding and arbitration between
`transferinitiators in the datapath and transfer receivers in the individual memories and
`busses. Logically, it can mapanarbitrary number of front-endsto an arbitrary number
`of back-ends. In practice, though, the numberof resources managedis of course limited
`bythe finite FPGA capacity. Furthermore,the probability of conflicts between initiators
`increases whenthey share a smaller numberofback-ends. However, the behavior visible
`to the datapath remainsidentical: The heterogeneous hardware resources handled by the
`back-ends are mapped into a homogeneous address space and accessed by a common
`protocol.
`
`item per clock cycle until it has to refill or flush its internal FIFOs. In thatcase, the stall Ml
`
`Irregular Cached Access
`4.1
`Cachingports are set up to provide read data one cycleafter an addresshas been applied,
`and accept one write datum/address perclock cycle.If this is not possible (e.g., a cache
`miss occurs), the stall signal is asserted for the affected port, stopping the initiator.
`Whenthestall signal is de-asserted, data that was “in-flight” due to a previous request
`will remain valid to allow theinitiator to restart cleanly.
`Table 3(a) describesthe interface to a caching port. The architecture currently allows
`for 32-bit data ports, which isthe size most relevant when compiling software into hybrid
`solutions. Should the need arise for wider words,the architecture can easily be extended.
`Arbitrary memory ranges (e.g., memory-mapped I/Oregisters) can be markedas non-
`cacheable. Accessesto these regionswill then bypass the cache. Furthermore, since all
`of the cache machinery is implemented in configurable logic, cache port characteristics
`such as numberof cachelines and cache line length can be adapted to the needsof the
`application. As discussed in [19], this can result in a 3% to 10% speed-up over using a
`single cache configuration for all applications.
`
`4.2 Regular Streamed Access
`Streamingports transfer a number of data words from or to a memory area without the
`need for the datapath to generate addresses. After setting the parameters ofthe transfer
`(by switchingthe port into a “load parameter” mode), the port presents/accepts one data
`
`Petitioners Amazon
`Ex. 1006, p. 10
`
`Petitioners Amazon
`Ex. 1006, p. 10
`
`

`

`
`
`
`
`620
`
`H. Lange and A. Koch
`
`Table 2. Port interfaces
`
`Signal
`
`
`
` Asserted on cache miss.
`
`
`Output enable
`Write enable
`
`
`Flush cache.
`
`(a) Caching port interface
`
`SigiReg
`
`Start address.
`
`
`
`
`Stride (increment).
`8,16,32-bit access.
`
`FIFOsize.
`
`
`Length, transfer.
`Reador write.
`
`
`
`Data item.
`
`Wait, FIFO flush/refill.
`Pause data flow.
`
`
`
`End of stream reached.
`
`Accept new parameters.
`
`(b) Streaming port interface
`
`signal stops the initiator using the port. When the FIFO becomesreadyagain, the stall
`signal is de-asserted andthe transfer continues. The datapath can pausethe transfer by
`asserting the hold signal. As before, our current implementation calls for a 32-bit wide
`data bus. Table 3(b) lists the parameterregisters and the port interface.
`The ‘Block’ register plays a crucial role in matching the stream characteristics to
`the specific application requirements. E.g., if the application has to process a very large
`string (such as in DNA matching), it makes sense for the datapath to request a large
`block size. The longerstart-up delay (for the buffer to be filled) is amortized over the
`long run-timeofthe algorithm. For smaller amountsofdata (e.g., part of a matrix row for
`blocking matrix multiplication), it makes much more senseto pre-fetch only the precise
`amount of data required. [20] suggests compile-time algorithms to estimate the FIFO
`depth to use.
`The cache is bypassed by the streaming ports in order to avoid cache pollution.
`However,since logic guaranteeing the consistency between cachesand streamsfor arbi-
`trary accesses would be very expensive to implement(especially when non-unit strided
`streamsare used), our current design requires that accesses through the caching ports do
`not overlap streamed memory ranges. Thisrestriction mustbe enforced by the compiler.
`If that is not possible, streaming ports cannotbe used. Asan alternative, a cache with
`longercachelines (e.g., 128 bytes), might be used to limit the performanceloss due to
`memory latency.
`
`4.3 Multi-threading
`
`Note thatall stall or flow-control signals are generated/accepted on a per-port basis.
`This allows true multi-threaded hardware execution wheredifferent threads of control
`are assigned to different ports. MARC can accommodate morelogicalports on the front-
`ends than actually exist physically on the back-ends. Forcertain applications, this can
`be exploited to allow the compiler to schedule a larger number of memory accesses
`in parallel. The MARCcore will resolve any inter-port conflicts (if they occur at all,
`
`Petitioners Amazon
`Ex. 1006, p. 11
`
`Petitioners Amazon
`Ex. 1006, p. 11
`
`

`

`Memory Access Schemes for Configurable Processors
`
`621
`
`see Section 5 ) at run-time. The current implementation uses a round-robin policy, later
`versions mightextend this to a priority-based scheme.
`
`4.4 Flexibility
`
`5
`
`Implementation Issues
`
`Ourfirst MARC implementationis targeting the prototyping environmentdescribed in
`[12]. The architecture details relevant for this paper are show in Figure 5.
`
`CPU
`
`BIU
`
`~
`
`RC
`
`1Pt:)ee1)
`
`A separate back-endis used for each memory or bus resource. For example, in a system
`with four ZBT SRAM memories, four instances of the ZBT SRAM back-end would be
`instantiated. The back-endspresent the sameinterface as a caching port ( Table 3(a) ) to
`the MARCcore. They encapsulate the state and access mechanisms to manage each of
`the physical resources. E.g., a PCI backend might know howto access a PCI BIU and
`initiate a data transfer.
`In this manner, additional back-ends handling more memory banks can be attached
`easily. Analogously, MARC can be adapted to different FPGA technologies. For exam-
`ple, on the Xilinx Virtex [24] series of FPGA,the L1 cache of a caching port might
`be implemented using the on-chip memories. On the older XC4000XL series, which
`has only a limited amountof on-chip storage, the cache could be implemented in a
`direct-mappedfashion thathas the cachelines placed in external memory.
`
`only partially floorplanned,thus the given performance numbersare preliminary.
`
`64MB
`Pye
`
`tS
`microSPARC
`Ts
`
`PLX9080
`
`pnyaa
`Virlex
`
`Sis
`128Kx36b
`
`128Kx36b
`2
`128Kx36b
`
`Figure 5. Architecture of prototype hardware
`
`A SUN microSPARC-Ilep RISC [21] [22] is employed as conventional CPU. The
`RC is composedof a Xilinx Virtex XCV1000 FPGA [24].
`
`5.1 Status
`
`Atthis point, we have implemented and intensively simulated a parameterized Verilog
`model of the MARC Coreand back-ends. Onthe front-endside, cachingportsare already
`operational, while streaming ports are still under development. The designis currently
`
`Petitioners Amazon
`Ex. 1006, p. 12
`
`Petitioners Amazon
`Ex. 1006, p. 12
`
`

`

`622
`
`H. Lange and A. Koch
`
`5.2 Physical Resources
`The RC has four 128Kx36b banks of ZBT SRAM as dedicated memory and can access
`the main memory (64MB DRAM managedbythe CPU)over the PCIbus. Tothis end,
`a PLX 9080 PCI Accelerator [23] is used as BIU that translates the RC bus (i960-like)
`into PCI and back. The MARCcore will thus need PLX and ZBT SRAMback-ends.
`All oftheir instances can operate in parallel.
`
`5.3 MARC Core
`The implementation followsthe architecturedescribed in Section4: An arbitrary number
`of caching and streaming ports can be managed. In this implementation (internally
`relying on the Virtex memoriesin dual-ported mode), two cacheports are guaranteed to
`operate without conflicts, and three to four cache ports may operate without conflicts. If
`five or more cacheports are in use, a conflictwill occur and be resolved by the Arbitration
`unit ( Section 4.3 ). This version of the core currently supports a 24-bit address space
`into which the physical resources are mapped.
`
`5.4 Configurable Cache
`Wecurrently provide three cache configurations: 128 lines of 8 words, 64 lines of 16
`words, or 32 lines of 32 words. Non-cacheable areas may be configured at compile
`time (the required comparators are then synthesized directly into specialized logic).
`The datapath can explicitly request a cache flush at any time (e.g., after the end ofa
`computation).
`The cacheis implementedasa fully associative L1 cache.It uses 4KB ofVirtex Block-
`SelectRAMto hold the cachelines on-chip and implements write-back and random line
`replacement. The BlockSelectRAMis usedin dual-port modeto allow up to two accesses
`to occurin parallel. Conflicts are handled by the MARC Core Arbitration logic, The
`CAMsneeded forthe associative lookup are composed from SRL16E shift registers as
`suggested in [25]. This allows a single-cycle read (compare and match detection) and 16
`clock cycles to write anew taginto the CAM.Sincethis operation occurs simultaneously
`with the loading of the cache lines from memory, the CAM latency is completely hidden
`in the longer memory latencies.Asthis 16-cycle delay would also occur (and could not
`be hidden) when reading the tag, e.g., when writing back a dirty cacheline, the tags are
`additionally stored in a conventional memory composed from RAM32x1S elements that
`allow single-cycle reading.
`Foreach cachingport, a dedicated CAM bankis used to allow lookups to occurin
`parallel. Each cachelinerequires 5 4-bitCAMs,thusthe per-port CAMarearequirements
`range from 160 to 640 4-LUTs.In addition to the CAMs,each cacheport includes a
`small state-machine controlling the cache operation for different scenarios (e.g., read
`hit, read miss, write hit, write miss, flush, etc.). The miss penalty cm in cycles for a clean
`cache line is given by
`
`where Che is the data transfer latency for the back-end used ( Table 1 ) and w is the
`number of 32-bit words per cacheline. 7 and 4 are the MARC Core operation startup
`
`a P
`
`etitioners Amazon
`Ex. 1006, p. 13
`
`
`
`Petitioners Amazon
`Ex. 1006, p. 13
`
`

`

`mance improvement between 3% to 10% over static caches. [27] describes the use of the
`
`As shownin Table 3 , scaling an L1 on-chip above 128x8is probably not a wise choice
`given the growing area requirements. However, as already mentioned in Section 4.4,
`part of the ZBT SRAM could be used to hold the cachelines of a direct mapped L2
`cache. In this scenario, only the tags would be held inside of the FPGA.
`A sample cache organization for this approach could partition the 24-bit address
`into a 8-bit tag, 12-bit index and 4-bit block offset. The required tag RAM would be
`organized as 4096x8b, and would thus require 8 BlockSelectRAMs(in addition to those
`used for the on-chip L1 cache). The cache would have 4096 lines of 16 words each, and
`would thus require 64K words of the 128K wordsavailable in one of the ZBT SRAM
`chips. The cachehit/miss determination could occur in two cycles, with the data arriving
`after another two cycles. A miss going to PCI memory would take 66 cyclestorefill a
`clean cache line and deliver the data to theinitiator.
`
`
`
`Memory Access Schemesfor Configurable Processors
`
`623
`
`and shutdowntimesin cycles, respectively. For a dirty cache line, an additional w cycles
`are required to write the modified data back.
`For comparison with [19], note that according to [26], the performance of 4KB of
`fully-associative cache is equivalentto that of 8KB of direct-mapped cache.
`
`5.5 Performance and Area
`
`The performance and area requirements of MARC Core, the technology modules and
`two cacheports are shown in Table 3 .
`
`Table 3. Performance and area requirements
`
`8 8 8
`
`32|—_|
`
`(Configuration [4LUTs| FFs|_RAM32X1Ss__ [BlockRAMs| Clock|
`
`32x32
`12
`31 MHz
`64x16
`26
`30 MHz
`128x8
`56
`29 MHz
`24576 [24576|(eachuses24LUTS|
`
`
`For the three configuration choices, the area requirements vary between 10%-30%
`of the chip logic capacity. Since all configurations use 4KB of on-chip memory for cache
`line storage, 8 of the 32 512x8b BlockSelectRAMsare required.
`
`5.6 Scalability and Extensibility
`
`6 Related Work
`
`[19] gives experimentalresult for the cache-dependent performance behaviorof6 of the
`8 benchmarks in the SPECint95 suite. Due to the temporal configurability we suggest
`for the MARC caches(adapting cache parametersto applications), they expect a perfor-
`
`Petitioners Amazon
`Ex. 1006, p. 14
`
`Petitioners Amazon
`Ex. 1006, p. 14
`
`

`

`624
`
`H. Lange and A. Koch
`
`
`
`configurable logic in a hybrid processorto either add a victim cacheorpre-fetch buffers
`to an existing dedicated direct-mapped L1 cache on an per-application basis. They quote
`improvements in L1 miss rate of up to 19%.[28] discusses the addition of 1MBof L1
`cache memory managed by a dedicated cachecontroller to a configurable processor.
`Another approach proposed in [29] re-maps non-contiguousstrided physical addresses
`into contiguous cacheline entries. A similar functionality is provided in MARCbythe
`pre-fetching of data into the FIFOsof streamingports. [30] suggests a scheme which
`adds configurable logic to a cacheinstead of a cache to configurable logic. They hope to
`avoid the memory bottleneck by putting processing (the configurable logic) very close
`to the data. The farthest step with regard to data pre-fetchingis suggested in [31], which
`describes a memory system that is cognizant of high-level memory access patterns.
`E.g., once a certain memberin a structure is accessed, a set of associated membersis
`fetched automatically. However, the automatic generation of the required logic from
`conventional softwareis not discussed. Onthe subject of streaming accesses, [20] is an
`exhaustive source. The ‘Block’ register of our streaming ports was motivated by their
`discussion of overly longstartup times for large amounts of pre-fetched data.
`
`Proc. IEEE Symp. on FCCMs, Napa 1997
`
`7 Summary
`
`Wepresented an overview of hybrid processorarchitectures and some memory access
`needs often occurring in applications. For the most commonly used RC components
`(off-the-shelf FPGAs), we identified ‘a lack of support for even the most basic of these
`requirements.
`As a solution, we propose a general-purpose Memory Architecture for Reconfig-
`urable Computers that allows device-independent access both for regular (streamed)
`andirregular (cached) patterns. We discussed onereal-world implementation of MARC
`on an emulated hybrid processor combining a SPARC CPU with a Virtex FPGA. The
`sample implementation fully supports multi-threaded access to multiple memory banks
`as well as the creation of “virtual” memory ports attached to on-chip cache memory.
`The configurable caches in the current version can reduce the latency from 46 cycles
`(for access to DRAM via PCI) downto a single cycle on a cachehit.
`
`References
`
`1. Amerson,R., “Teramac — Configurable Custom Computing”, Proc. IEEE Symp. on FCCMs,
`Napa 1995
`2. Bertin, P., Roncin, D., Vuillemin,J., “Programmable Active Memories: A Performance As-
`sessment”, Proc. Symp. Research on Integrated Systems, Cambridge (Mass.) 1993
`3, Box, B., “Field-Programmable Gate Array-based Reconfigurable Preprocessor’, Proc. IEEE
`Symp. on FCCMs, Napa 1994
`4, Buell, D., Arnold, J., Kleinfelder, W., “Splash 2 - FPGAs in Custom Computing Machines”,
`IEEE Press, 1996
`5. Rupp, C., Landguth, M., Garverick,et al., “The NAPA Adaptive Processing Architecture”,
`Proc. IEEE Symp. on FCCMs, Napa 1998
`6. Hauser, J., Wawrzynek,J., “Garp: A MIPSProcessor with a Reconfigurable Coprocessor”,
`
`Petitioners Amazon
`Ex. 1006, p. 15
`
`Petitioners Amazon
`Ex. 1006, p. 15
`
`

`

`. Wittig, R., Chow,P., “OneChip: An FPGAProcessor with Reconfigurable Logic”, Proc. IEEE
`Symp. on FCCMs, Napa 1996
`. Jacob, J., Chow, P., “Memory Interfacing and Instruction Specification for Reconfigurable
`Processors”, Proc. ACM Intl. Symp. on FPGAs, Monterey 1999
`. Triscend, “Triscend E5 CSoC Family”, http://www.triscend.com/products/IndexES.html,
`2000
`. Altera, “Excalibur Embedded Processor Solutions”,
`http://www.altera.com/html/products/excalibur.html, 2000
`TSI-Telsys, “ACE2card User’s Manual”, hardware documentation, 1998
`Koch,A., “A Comprehensive Platform for Hardware-Software Co-Design”, Proc. Intl. Work-
`shop on Rapid-Systems Prototyping, Paris 2000
`Annapolis Microsystems, http://www.annapmicro.com, 2000
`Virtual Computer Corp., http://www.vcc.com, 2000
`Callahan, T., Hauser, J.R., Wawrzynek,J., “The Garp Architecture and C Compiler’, IEEE
`Computer, April 2000
`Li, Y., Callahan, T., Darnell, E., Harr, R., et al., “Hardware-Software Co-Design ofEmbedded
`Reconfigurable Architectures”, Proc. 37th Design Automation Conference, 2000
`Gokhale, M.B., Stone, J.M., “NAPA C: Compiling for a Hybrid RISC/FPGA Machine”, Proc.
`IEEE Symp. on FCCMs, 1998
`. Koch, A., Golze, U., “Practical Experiences with the SPARXIL Co-Processor”’, Proc. Asilo-
`mar Conference on Signals, Systems, and Computers, 11/1997
`. Fung,
`JM.LE, Pan,
`J.,
`“Configurable Cache”, CMU EE742
`http://www.ece.cmu.edu/ ee742/proj-s98/fung, 1998
`McKee,S.A., “Maximizing Bandwidth for Streamed Computations”, dissertation, U. of Vir-
`ginia, School of Engineering and Applied Science, 1995
`Sun Microelectronics, “microSPARC-Ilep User’s Manual”, http://www.sun.

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket