`
`1, Rachel J . Watters, am a librarian, and the Director of Wisconsin
`
`TechSearch (“WTS”), located at 728 State Street, Madison, Wisconsin, 53706.
`
`WTS is an interlibrary loan department at the University of Wisconsin-Madison.
`
`l
`
`have worked as a librarian at the University of Wisconsin library system since
`
`1998.
`
`I have been employed at WTS since 2002, first as a librarian and, beginning
`
`in 2011, as the Director. Through the course of my employment, I have become
`
`well informed about the operations of the University of Wisconsin library system,
`
`which follows standard library practices.
`
`This Declaration relates to the dates of receipt and availability of the
`
`following:
`
`Langc, H., & Koch, A. (2000). Memory access schemes for
`configurable processors. In International Workshop on Field
`Programmable Logic and Applications (pp. 615-625).
`Springer : Berlin, Heidelberg.
`
`Standard ogerating grocednres [or materials at the Universr'gy of Wisconsin—
`
`Madison Libraries. When a volume was received by the Library, it would be
`
`checked in, stamped with the date of receipt, added to library holdings records, and
`
`made available to readers as soon after its arrival as possible. The procedure
`
`normally took a few days or at most 2 to 3 weeks.
`
`Exhibit A to this Declaration is true and accurate copy of the front matter of
`
`the International Workshop on Field Programmable Logic and Applications
`
`Petitioners Amazon
`
`EX. 1006, p. 1
`
`Petitioners Amazon
`Ex. 1006, p. 1
`
`
`
`Declaration of Rachel J. Watters on Authentication of Publication
`
`(2000) publication, which includes a stamp on the verso page showing that this
`
`book is the property of the Kurt F. Wendt Library at the University of Wisconsin—
`
`Madison.
`
`Attached as Exhibit B is the cataloging system record of the University of
`
`Wisconsin-Madison Libraries for its copy of the International Workshop on Field
`
`Programmable Logic and Applications (2000) publication. As shown in the
`
`“Receiving date” field of this Exhibit, the University of Wisconsin—Madison
`
`Libraries owned this book and had it cataloged in the system as of November 7,
`
`2000.
`
`Members of the interested public could locate the international Workshop
`
`on Fieid Programmable Logic and Applications (2000) publication after it was
`
`cataloged by searching the public library catalog or requesting a search through
`
`WTS. The search could be done by title, author, and/or subject key words.
`
`Members of the interested public could access the publication by locating it on the
`
`library’s shelves or requesting it from WTS.
`
`I declare that all statements made herein of my own knowledge are true and
`
`that all statements made on information and belief are believed to be true; and
`
`further that these statements were made with the knowledge that willful false
`
`statements and the like so made are punishable by fine or imprisonment, or both,
`
`under Section 1001 of Title 18 ofthe United States Code.
`
`2
`
`Petitioners Amazon
`
`EX. 1006, p. 2
`
`Petitioners Amazon
`Ex. 1006, p. 2
`
`
`
`Declaration of Rachel J. Watters on Authentication of Publication
`
`Date: October 2, 2018
`
`Wisconsin TechSearch
`
`Director
`
`Rachel J
`
`atters
`
`Memorial Library
`728 State Street
`
`Madison, Wisconsin 53706
`
`Petitioners Amazon
`
`EX. 1006, p. 3
`
`Petitioners Amazon
`Ex. 1006, p. 3
`
`
`
`Reiner W. Hartenstein
`
`Herbert Griinbacher (Eds)
`
`Field-Programmable
`Logic and Applications
`
`The Roadmap to Reconfigurable Computing
`
`10th International Conference, FPL 2000
`Villach, Austria, August 27—30, 2000
`Proceedings
`
` ' 3 Springer
`
`
`
`Petitioners Amazon
`
`EX. 1006, p. 4
`
`Petitioners Amazon
`Ex. 1006, p. 4
`
`
`
`Series Editors
`
`Gerhard Goos, Karlsruhe University, Germany
`Juris Hamnanis, Cornell University, NY, USA
`Jan van Leeuwen, Utrecht University, The Netherlands
`
`Volume Editors
`
`Reiner W. Hartenstein
`
`University of Kaiserslautern, Computer Science Department
`P. O. Box. 30 49, 67653 Kaiserslautern, Germany
`E—mail: hartenst@ rhrk.uni~kl.de
`
`Herbert Grlinbacher
`Carinthia Tech Institute
`
`Richard-Wagner-Sn. 19, 9500 Villach, Austria
`E-mail: hg@cti.ac.at
`
`Cataloging-in-Publication Data applied for
`
`Die Deutsche Bibliothek - ClP~Einhcitsaufnabme
`
`Field programmable logic and applications : the roadmap to
`reconfigurable computing ; 10th international conference ; proceedings
`I FPL 2000, Villach, Austria, August 27 - 30, 2000. Reiner W.
`Hartenstein ; Herbert Gr'nnbacher (ed). - Berlin ; Heidelberg '. New
`York ; Barcelona ; Hong Kong ; London -. Milan ; Paris : Singapore ;
`Tokyo : Springer. 2000
`‘
`(lecture notes in computer science ; Vol. 1896)
`- SBNW 9-9
`1
`u 7filWenclt Library
`University of Wisconsin-Madison
`2’l5 N. Randall Avenue
`Madison. WI 53706-1688
`
`CR Subject Classification (1998): 13.6-7, 1.6
`
`ISSN 0302-9743
`
`ISBN 3640678999 Springer-Voting Berlin Heidelberg New York
`
`This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
`concerned. specifically the rights of translation, reprinting. re-use of illustrations. recitation. broadcasting,
`reproduction on microfilm; or in any other way. and storage in data banks. Duplication of this publication
`or parts thereof is permitted only under the provisions of the German Copyright Law of September 9. 1965,
`in its current version. and permission for use must always be obtained from Springeererlag. Violations are
`liable for prosecution under the German Copyright Law
`
`Springer—Verlag Berlin Heidelberg New York
`a member of BertelsmannSpringer Science+Business Media GmbH
`(t5 Springer-Verlag Berlin Heidelberg 2000
`Printed in Germany
`
`Typesetting: Camera-ready by author. data conversion by Steingriiber Satztechhik GmbH. Heidelberg
`Printed on acid-free paper
`SPIN l0722573
`0613142
`5 4 3 2 1 0
`
`
`
`
`
`Petitioners Amazon
`
`EX. 1006, p. 5
`
`Petitioners Amazon
`Ex. 1006, p. 5
`
`
`
`
`
`Memory Access Schemes for Configurable Processors
`
`Holger Lange and Andreas Koch
`
`Tech. Univ. Braunschweig (E.I.S.), GauBstr. 11. D-38106 Braunschweig, Germany
`lange , kocheeis . cs . tu~be . de
`-
`
`Abstract. This work discusses the Memory Architecture for Reconfigurable Com-
`puters (MARC), a scalable, device-independent memory interface that supports
`both irregular (via configurable caches) and regular accesses (via pro-fetching
`stream buffers). By hiding specifics behind a consistent abstract interface, it is
`suitable as a target environment for automatic hardware compilation.
`
`1
`
`Introduction
`
`Reconfigurable compute elements can achieve considerable performance gains over
`standard CPUs [1] [2] [3] [4]. In practice, these configurable elements are often combined
`with a conventional processor, which provides the control and 1/0 services that are
`implemented more efficiently in fixed logic. Recent single-chip architectures following
`this approach include NAPA [5], GARP [6], OneChip [7], OneChip93 [8], Triscend
`E5 [9], and Altera Excalibur [10]. Board-level configurable processors either include a
`dedicated CPU [11] [12] or rely on the host CPU for support [13] [14].
`Design tools targeting one of these hybrid systems such as GarpCC [15], Nimble
`[16] or Napa-C [17] have to deal with software and hardware issues separately as well as
`with the creation of interfaces between these parts. On the software side, basic services
`such as U0 and memory management are often provided by an operating system of
`some kind. This can range from a full-scale general-purpose 05 over more specialized
`real-time embedded OSes down to tiny kernels offering only a limited set of functions
`tailored to a very specific class of applications. Usually, a suitable OS is either readily
`available on the target platform, or can be ported to it with relative ease.
`This level of support is unfortunately not present on the hardware side of the hybrid
`computer. Since no standard environment is available for even the most primitive tasks
`such as efficient mommy access or communication with the host, the research and de-
`velopment of new design tools often requires considerable effort to provide a reliable
`environment into which the newly-created hardware can be embedded. This environment
`is sometimes called a wrapper around the custom datapath. It goes beyond a simple as-
`signment of chip pads to memory pins. Instead, a structure of on-chip busses and access
`protocols to various resources (e.g., memory, the conventional processor, etc) must be
`defined and implemented.
`In this paper, we present our work on the Memory Architecture for Reconfigurable
`Computers (MARC). It can act as a “hardware target” for a variety of hybrid compil-
`ers, analogously to a software target for conventional compilers. Before describing its
`specifics, we will justify our design decisions by giving a brief overview of current con-
`figurable architectures and showing the custom hardware architectures created by some
`hybrid compilers.
`
`KW. Hartenstein and H. Grflnbacher (Eds) FPL 2000, LNCS 1896. DP‘ 615—625. 2000.
`© Springer-Verlag Berlin Heidelberg 2000
`
`
`
`
`
`Petitioners Amazon
`
`EX. 1006, p. 6
`
`Petitioners Amazon
`Ex. 1006, p. 6
`
`
`
`616
`
`H. Lange and A. Koch
`
`2 Hybrid Processors
`
`Static and reconfigurable compute elements may be combined in many ways. The degree
`of integration can range from individual reconfigurable function units (e.g., OneChip
`[7]) to an entirely separate ceprocessor attached to a peripheral bus (e.g., SPLASH [4],
`SPARXIL [18]).
`
` Hybrld
`ProcessorChip
`
`U0 Bus
`
`Figure 1. Single-chip hybrid processor
`
`Figure 1 sketches the architecture of a single-chip hybrid processor that combines
`fixed (CPU) and reconfigurable (RC) compute units behind a common cache (D$).
`Such an architecture was proposed, e.g., for GARP [6] and NAPA [S]. It offers very
`high bandwidth, low latency, and cache coherency between the CPU and the RC when
`accessing the shared DRAM.
`
`leed Processor Chip
`
`Reconfigurable Array
`
`
`
`Figure 2. Hybrid processor emulated by multi-chip system
`
`The board-level systems more common today use an architecture similar to Figure
`2 . Here, a conventional CPU is attached by a bus interface unit (BIU) to a system-wide
`1/0 bus (e.g., SBus [18] or PCI [11] [12]). Another BIU connects the RC to the I/O
`bus. Due to the high conununication latencies over the 1/0 bus, the RC is often attached
`directly to a limited amount of dedicated memory (commonly a few KB to a few MB of
`
`
`
`Petitioners Amazon
`
`EX. 1006, p. 7
`
`Petitioners Amazon
`Ex. 1006, p. 7
`
`
`
`
`
`{—
`
`I
`
`Memory Access Schemes for Configurable Processors
`
`617
`
`SRAM). In some systems, the RC has access to the main DRAM by using the I/O bus as
`a master to contact the CPU memory controller (MEMC). With this capability, the CPU
`and the RC are sharing a logically homogeneous address space: Pointers in the CPU
`main memory can be freely exchanged between software on the CPU and hardware in
`the RC.
`
`Table 1 shows the latencies measured on [12] for the
`
`
`
`
`ZBT SRAM write
`4
`
`
`RC accessing data residing in local Zero-Bus Turnaround
`(ZBT) SRAM (latched in the FPGA HO blocks) and in
`main DRAM (via the PCI bus). In both cases, one word
`per cycle is transferred after the initial latency.
`[t is obvious from these numbers that any useful wrap-
`per must be able to deal efficiently with access to high la-
`tency memories. This problem. colloquially known as the
`“memory bottleneck”, has already been tackled for con—
`ventional processors using memory hierarchies (multiple cache levels] combined with
`techniques such as pro-fetching and streaming to improve their performance. As we will
`see later, these approaches are also applicable to reconfigurable systems.
`
`Table 1. Data access laten-
`
`cies (single word transfers)
`
`3 Reconfigurable Datapaths
`
`The structure of the compute elements implemented on the RC is defined either manually
`or by automatic tools. A common architecture [6] [16] [18] is shown in Figure 3 _
`
`Hardware Operators
`
`
`
`SystetmBus
`
`memory
`Processorintertaee
`
`PIC-local
`Memoryinterface
`
`
`I||||||||I
`Hflflfl
`
`Datapath Controller
`
`Figure 3. Common RC datapath architecture
`
`The datapath is formed by a number of hardware operators, often created using
`module generators, which are placed in a regular fashion. While the linear placement
`shown in the figure is often used in practice, more complicated layouts are of course
`possible. All hardware operators are connected to a central datapath controller that
`orchestrates their execution.
`
`In this paper, we focus on the interface blocks attaching the datapath to the rest of the
`system. They allow communication with the CPU and main memory using the system
`bus or access to the local RC RAM. The interface blocks themselves are accessed by the
`datapath using a structure of uni~ and bidirectional busses that transfer data, addresses,
`and control information.
`
`
`
`Petitioners Amazon
`
`EX. 1006, p. 8
`
`Petitioners Amazon
`Ex. 1006, p. 8
`
`
`
`
`
`618
`
`H. Lange and A. Koch
`
`
`
`For manually implemented RC applications, the protocols used here are generally
`developed ad-hoc and heavily influenced by the specific hardware environment targeted.
`(e.g.. the data sheets of the actual SRAM chips on a PCB). In practice, they may even
`vary between different applications running on the same hardware (e.g., usage of burst-
`modes. fixed access sizes etc.).
`This approach is not applicable for automatic design flows: These tools require pre-
`defined access mechanisms to which they strictly adhere for all designs. An example for
`such a well-defined protocol suitable as a target for automatic compilation is employed on
`GARP [6], a single—chip hybrid processor architecture. It includes standardized protocols
`for sending and retrieving data to/from the RC using specialized CPU instructions and
`supported by dedicated decoding logic in silicon. Memory requests are routed over a
`single address and feur data busses that can supply up to four words per cycle for regular
`(streaming) accesses.
`None of these capabilities is available when using off-the-shelf silicon to implement
`the RC. Instead, each user is faced with implementing the required access infrastructure
`anew.
`
`4 MARC
`
`Our goal was to learn from these past experiences and develop a single, scalable. and
`portable memory interface scheme for reconfigurable datapaths. MARC strives to be
`applicable for both single-chip and board-level systems, and to hide the intricacies of
`different memory systems from the datapath. Figure 4 shows an overview of this
`architecture.
`
`
`
`Back—Ends
`
`Front-Ends
`
`
`UserLogic
`
`
`Figure 4. MARC architecture
`
`Petitioners Amazon
`
`EX. 1006, p. 9
`
`Petitioners Amazon
`Ex. 1006, p. 9
`
`
`
`
`
`Memory Access Schemes for Configurable Processors
`
`619
`
`Using MARC, the datapath accesses memory through abstract front‘end interfaces.
`Currently, we support two front-ends specialized for different access patterns: Caching
`ports provide for efficient handling of irregular accesses. Streaming ports offer a non-
`unit stride access to regular data structures (such as matrices or images) and perform
`address generation automatically. In both cases, data is pre-fetchedlcached to reduce the
`impact of high latencies (especially for transfers using the 110 bus). Both ports use stall
`signals to indicate delays in the data transfer (e.g., due to cache miss or stream queue
`refill). A byte-steering logic aligns 8- and 16-bit data on bits 7:0 and 15:0 of the data
`bus regardless of where the datum occurred in the 32-bit memory or bus words.
`The specifics of hardware memory chips or system bus protocols are implemented
`in various back-end interfaces. E.g., dedicated back—ends encapsulate the mechanisms
`for accessing SRAM or communicating over the PCI bus using the BIU.
`The MARC core is located between front— and back-ends, where it acts as the main
`controller and data switchboard. It performs address decoding and arbitration between
`transfer initiators in the datapath and transfer receivers in the individual memories and
`busses. Logically, it can map an arbitrary number of front-ends to an arbitrary number
`of back-ends. In practice, though, the number of resources managed is of course limited
`by the finite FPGA capacity. Furthermore, the probability of conflicts between initiators
`increases when they share a smaller number of back-ends. However, the behavior visible
`to the datapath remains identical: The heterogeneous hardware resources handled by the
`back-ends are mapped into a homogeneous address space and accessed by a common
`protocol.
`
`4.1
`
`Irregular Cached Access
`
`Caching ports are set up to provide read data one cycle after an address has been applied,
`and accept one write datumladdress per clock cycle. If this is not possible (e.g., a cache
`miss occurs). the stall signal is asserted for the affected port, stopping the initiator.
`When the stall signal is de-asserted. data that was “in—flight" due to a previous request
`will remain valid to allow the initiator to restart cleanly.
`Table 3(a) describes the interface to a caching port. The architecture currently allows
`for 32~bit data ports, which is the size most relevant when compiling software into hybrid
`solutions. Should the need arise for wider words, the architecture can easily be extended.
`Arbitrary memory ranges (e.g., memory-mapped U0 registers) can be marked as non-
`cacheable. Accesses to these regions will then bypass the cache. Furthermore, since all
`of the cache machinery is implemented in configurable logic. cache port characteristics
`such as number of cache lines and cache line length can be adapted to the needs of the
`application. As discussed in [19], this can result in a 3% to 10% speed-up over using a
`single cache configuration for all applications.
`
`item per clock cycle until it has to refill or flush its internal F1F0s. In that case, the stall
`
`4.2 Regular Streamed Access
`
`Streaming ports transfer a number of data words from or to a memory area without the
`need for the datapath to generate addresses. After setting the parameters of the transfer
`(by switching the port into a “load parameter" mode), the port presentslaccepts one data
`
`; h
`Petitioners Amazon
`
`Ex. 1006, p. 10
`
`Petitioners Amazon
`Ex. 1006, p. 10
`
`
`
`
`
`
`
`620
`
`H. Lange and A. Koch
`
`Table 2. Port interfaces
`
`Signal
`
`in Address.
`Data in/out Data item.
`
`
`
`
`
`
`
`Sigma
`Start address.
`
`Stride (increment).
`8,16,32-bit access.
`
`FIFO size.
`
`Length. transfer.
`
`Read or write.
`
`
`Data item.
`
`Wait, FIFO flushfrefill.
`
`Pause data flow.
`
`End of stream reached.
`
`
`Accept new parameters.
`
`
`(b) Streaming port interface
`
`in
`
`Mdth
`
`8. 16. 32-bit access.
`Asserted on cache miss.
`
`Output enable
`
`Write enable
`Flush cache.
`
`
`(a) Caching port interface
`
`signal stops the initiator using the port. When the FIFO becomes ready again. the stall
`signal is de-asserted and the transfer continues. The datapath can pause the transfer by
`asserting the hold signal. As before, our current implementation calls for a 32-bit wide
`data bus. Table 3(b) lists the parameter registers and the port interface.
`The ‘Block' register plays a crucial role in matching the stream characteristics to
`the specific application requirements. E.g., if the application has to process a very large
`string (such as in DNA matching), it makes sense for the datapath to request a large
`block size. The longer start-up delay (for the buffer to be filled) is amortized over the
`long run-time of the algorithm. For smaller amounts of data (e.g., part of a matrix row for
`blocking matrix multiplication), it makes much more sense to pre—fetch only the precise
`amount of data required. [20] suggests compile-time algorithms to estimate the FIFO
`depth to use.
`The cache is bypassed by the streaming ports in order to avoid cache pollution.
`However. since logic guaranteeing the consistency between caches and streams for arbi—
`trary accesses w0uld be very expensive to implement (especially when non-unit strided
`streams are used), our current design requires that accesses through the caching ports do
`not overlap streamed memory ranges. This restriction must be enforced by the compiler.
`if that is not possible, streaming ports cannot be used. As an alternative, :1 cache with
`longer cache lines (e.g.. 128 bytes), might be used to limit the performance loss due to
`memory latency.
`
`4.3 Multl-threading
`
`Note that all stall or flow-control signals are generatedfaccepted on a per-port basis.
`This allows true multi-threaded hardware execution where different threads of control
`are assigned to different ports. MARC can accommodate more logical ports on the front-
`ends than actually exist physically on the back-ends. For certain applications. this can
`be exploited to allow the compiler to schedule a larger number of memory accesses
`in parallel. The MARC core will resolve any inter-port conflicts (if they occur at all,
`
`Petitioners Amazon
`
`Ex. 1006, p. 11
`
`Petitioners Amazon
`Ex. 1006, p. 11
`
`
`
`Memory Access Schemes for Configurable Processors
`
`621
`
`see Section 5 ) at run-time. The current implementation uses a round-robin policy. later
`versions might extend this to a priority-based scheme.
`
`4.4 Flexibility
`
`A separate back-end is used for each memory or bus resource. For example, in a system
`with four ZBT SRAM memories, four instances of the ZBT SRAM back-end would be
`instantiated. The back-ends present the same interface as a caching port (Table 3(a) ) to
`the MARC core. They encapsulate the state and access mechanisms to manage each of
`the physical resources. E.g., a PCI backend might know how to access a PCI BIU and
`initiate a data transfer.
`In this manner, additional back—ends handling more memory banks can be attached
`easily. Analogously. MARC can be adapted to different FPGA technologies. For exam-
`ple, on the Xilinx Vinex [24] series of FPGA. the L1 cache of a caching port might
`be implemented using the on-chip memories. 0n the older XC4000XL series, which
`has only a limited amount of on-chip storage, the cache could be implemented in a
`direct-mapped fashion that has the cache lines placed in external memory.
`
`5
`
`Implementation Issues
`
`Our first MARC implementation is targeting the prototyping environment described in
`[12]. The architecture details relevant for this paper are show in Figure 5 .
`
`cpu
`
`BIU
`
`'
`
`no
`
`SUN
`microSPARC
`Ilep
`
`PLXQOBO
`
`only partially floorplanned, thus the given performance numbers are preliminary.
`
`128Kx36b
`
`3m
`123Kx35b
`
`123K136b
`
`128KX36b
`
`Figure 5. Architecture of prototype hardware
`
`A SUN microSPARC-Ilep RISC [21] [22] is employed as conventional CPU. The
`RC is composed of a Xilinx Virtex XCVIOOO FPGA [24].
`
`5.1 Status
`
`At this point, we have implemented and intensively simulated a parameten'zed Verilog
`model of the MARC Core and back-ends. On the front-end side. caching ports are already
`operational, while streaming ports are still under development. The design is currently
`
`Petitioners Amazon
`
`Ex. 1006, p. 12
`
`Petitioners Amazon
`Ex. 1006, p. 12
`
`
`
`622
`
`H. Lange and A. Koch
`
`5.2 Physical Resources
`The RC has four 128K1436b banks of ZBT SRAM as dedicated memory and can access
`the main memory (64MB DRAM managed by the CPU) over the PCI bus. To this end.
`a PLX 9080 PCI Accelerator [23] is used as BIU that translates the RC bus (i960—like)
`into PCI and back. The MARC core will thus need PLX and ZBT SRAM back-ends.
`All of their instances can operate in parallel.
`
`5.3 MARC Core
`The implementation follows the architecture described in Section 4 : An arbitrary number
`of caching and streaming ports can be managed. In this implementation (internally
`relying on the Virtex memories in dual-ported mode). two cache ports are guaranteed to
`operate without conflicts. and three to four cache ports may operate without conflicts. If
`five or more cache ports are in use, a conflict will occur and be resolved by the Arbitration
`unit ( Section 4.3 ). This version of the core currently supports a 24—bit address space
`into which the physical resources are mapped.
`
`5.4 Configurable Cache
`We currently provide three cache configurations: 128 lines of 8 words, 64 lines of 16
`words, or 32 lines of 32 words. Non-cacheable areas may be configured at compile
`time (the required comparators are then synthesized directly into specialized logic).
`The datapath can explicitly request a cache flush at any time (e.g.. after the end of a
`computation).
`The cache is implemented as a fully associative L1 cache. It uses 4KB ofVirtex Block-
`SelectRAM to hold the cache lines on-chip and implements write-back and random line
`replacement. The BlockSelectRAM is used in dual-port mode to allow up to two accesses
`to occur in parallel. Conflicts are handled by the MARC Core Arbitration logic. The
`CAMS needed for the associative lookup are composed from SRL16E shift registers as
`suggested in [25]. This allows a single-cycle read (compare and match detection) and 16
`clock cycles to write a new tag into the CAM. Since this operation occurs simultaneously
`with the loading of the cache lines from memory, the CAM latency is completely hidden
`in the longer memory latencies. As this lfi-cycle delay would also occur (and could not
`be hidden) when reading the tag, e.g., when writing back a dirty cache line, the tags are
`additionally stored in a conventional memory composed from RAM32x l S elements that
`allow single—cycle reading.
`For each caching port, a dedicated CAM bank is used to allow lookups to occur in
`parallel. Each cache line requires 5 4-bit CAMS, thus the per-port CAM area requirements
`range from 160 to 640 4-LUTs. In addition to the CAMS. each cache port includes a
`small state-machine controlling the cache operation for different scenarios (e.g.. read
`hit, read miss. write hit, write miss. flush. etc.) The miss penalty cm in cycles for a clean
`cache line is given by
`
`where Che is the data transfer latency for the back-end used ( Table l ) and w is the
`number of 32-bit words per cache line. 7 and 4 are the MARC Core operation startup
`
`
`
`—————#
`
`Petitioners Amazon
`
`Ex. 1006, p. 13
`
`Petitioners Amazon
`Ex. 1006, p. 13
`
`
`
`
`
`Memory Access Schemes for Configurable Processors
`
`623
`
`and shutdown times in cycles, respectively. For a dirty cache line, an additional in cycles
`are required to write the modified data back.
`For comparison with [19], note that according to [26], the performance of 4KB of
`fully-associative cache is equivalent to that of SKB of direct-mapped cache.
`
`5.5 Performance and Area
`
`The performance and area requirements of MARC Core, the technology modules and
`two cache ports are shown in Table 3 .
`
`Table 3. Performance and area requirements
`
`—4LU'I‘s- RAM32XISs magnum—
`32x32
`976
`12
`8
`31 MHz
`64x16
`26
`8
`30 MHz
`128x8
`56
`8
`29 MHz
`
`XCVIOOO avail. 24576 24576 (each uses 2 4LUTs)_—
`
`mance improvement between 3% to 10% over static caches. [27] describes the use of the
`
`For the three configuration choices, the area requirements vary between l0%v30%
`of the chip logic capacity. Since all configurations use 4KB of on—chip memory for cache
`line storage, 8 of the 32 512x8b BlockSelectRAMs are required.
`
`5.6 Scalability and Extensibility
`
`As shown in Table 3 , scaling an L1 on~chip above 128x8 is probably not a wise choice
`given the growing area requirements. However, as already mentioned in Section 4.4 ,
`part of the ZBT SRAM could be used to hold the cache lines of a direct mapped L2
`cache. In this scenario, only the tags would be held inside of the FPGA.
`A sample cache organization for this approach could partition the 24-bit address
`into a 8—bit tag, 12-bit index and 4-bit block offset. The required tag RAM would be
`organized as 4096x8b, and would thus require 8 BlockSelectRAMs (in addition to those
`used for the on-chip L1 cache). The cache would have 4096 lines of 16 words each, and
`would thus require 64K words of the 128K words available in one of the ZBT SRAM
`chips. The cache hit/miss determination could occur in two cycles, with the data arriving
`after another two cycles. A miss going to PCI memory would take 66 cycles to refill a
`clean cache line and deliver the data to the initiator.
`
`6 Related Work
`
`[19] gives experimental result for the cache-dependent performance behavior of 6 of the
`8 benchmarks in the SPECint95 suite. Due to the temporal! configurability we suggest
`for the MARC caches (adapting cache parameters to applications), they expect a perfor-
`
`Petitioners Amazon
`
`Ex. 1006, p. 14
`
`Petitioners Amazon
`Ex. 1006, p. 14
`
`
`
`624
`
`H. Lange and A. Koch
`
`
`
`configurable logic in a hybrid processor to either add a victim cache or pre-fetch buffers
`to an existing dedicated direct-mapped L1 cache on an per—application basis. They quote
`improvements in L1 miss rate of up to 19%. [2%] discusses the addition of 1MB of L1
`cache memory managed by a dedicated cache controller to a configurable processor.
`Another approach proposed in [29] re-maps non-contiguous strided physical addresses
`into contiguous cache line entries. A similar functionality is provided in MARC by the
`pre—fetching of data into the FIFOs of streaming ports. [30] suggests a scheme which
`adds configurable logic to a cache instead of a cache to configurable logic. They hope to
`avoid the memory bottleneck by putting processing (the configurable logic) very close
`to the data. The farthest step with regard to data pro-fetching is suggested in [31]. which
`describes a memory system that is cognizant of high-level memory access patterns.
`E.g.. once a certain member in a structure is accessed, a set of associated members is
`fetched automatically. However, the automatic generation of the required logic from
`conventional software is not discussed. On the subject of streaming accesses, [20] is an
`exhaustive source. The ‘Block’ register of our streaming ports was motivated by their
`discussion of overly long startup times for large amounts of pro-fetched data.
`
`7 Summary
`
`We presented an overview of hybrid processor architectures and some memory access
`needs often occurring in applications. For the most commonly used RC components
`(off-the-shelf FPGAs), we identified ‘a lack of support for even the most basic of these
`requirements.
`As a solution. we propose a general-purpose Memory Architecture for Reconfig-
`urable Computers that allows device-independent access both for regular (streamed)
`and irregular (cached) patterns. We discussed one real-world implementation of MARC
`on an emulated hybrid processor combining a SPARC CPU with a Vine): FPGA. The
`sample implementation fully supports multi—threaded access to multiple memory banks
`as well as the creation of “virtual" memory ports attached to on-chip cache memory.
`The configurable caches in the current version can reduce the latency from 46 cycles
`(for access to DRAM via PCI) down to a single cycle on a cache hit.
`
`Pmc. IEEE Symp. on FCCMs. Napa 1997
`
`References
`
`1. Amerson, R., “Teramac — Configurable Custom Computing". Pmc. IEEE Symp. on FCCMs.
`Napa 1995
`2. Bertin, P., Roncin. D.. Vuillcmin. 1., “Programmable Active Memories: A Performance As-
`sessment". Pmc. Symp. Research on Integrated System. Cambridge (Mass) 1993
`3. Box. 3., ”Field—Programmable Gate Array-based Reconfigurable Preprocessor”, Proc. IEEE
`Symp. on FCCMs, Napa 1994
`4. Buell, D., Arnold. J.. Kleinfelder. W., "Splash 2 — FPGAs in Custom Computing Machines”,
`IEEE Press, 1996
`5. Rupp, C., Landguth. M., Garverick, et al., “The NAPA Adaptive Processing Architectu ",
`Pmc. IEEE Symp. on FCCMS, Napa 1998
`6. Hauser, J ., Wawmynek. J., “Carp: A MIPS Processor with a Reconfigurable Coprocessor“,
`
`Petitioners Amazon
`
`Ex. 1006, p. 15
`
`Petitioners Amazon
`Ex. 1006, p. 15
`
`
`
`Memory Access Schemes for Configurable Processors
`
`625
`
`Wittig, R., Chow, P., “OneChip: An FPGA Processor with Reconfigurable Logic", Proc. IEEE
`Symp. on FCCMs, Napa 1996
`. Jacob, J., Chow, P., "Memory Interfacing and Instruction Specification for Reconfigurable
`Processors", Proc. ACM Inti. Symp. on FPGAs, Monterey 1999
`Triscend, “Triscend ES CSoC Family", http:/lwww.miscondcomiproducts/IndexES.html,
`2000
`
`10.
`
`11.
`12.
`
`13.
`14.
`15.
`
`16.
`
`17.
`
`18.
`
`19.
`
`20.
`
`21.
`22.
`
`23.
`24.
`25.
`
`26.
`
`27.
`
`28.
`
`29.
`
`30.
`
`31.
`
`Altera, “Excalibur Embedded Processor Solutions",
`http:l/www.altera.comfhtml/productslexcalibur.html, 2000
`TSI-Telsys, “ACEanrd User's Manual", hardware documentation, 1998
`Koch, A., “A Comprehensive Platform for Hardware-Software Co-Design”, Pmc. Inrl. Work—
`shop on Rapid—System Prototyping, Paris 2000
`Annapolis Microsystems, http:flwww.annapmicro.com, 2000
`Virtual Computer Corp, httpzh'wwwycceom, 2000
`Callahan, T., Hauser, J.R., Wawrzynek. 1., ”The Garp Architecture and C Compiler", IEEE
`Computer, April 2000
`Li, Y., Callahan, T., Darnell, E., Hart, R., et 31., “Hardware-Software Co-Design of Embedded
`Reconfigurable Architectures", Pmc. 37th Design Automation Conference, 2000
`Golchale, M.B., Stone, J.M., “NAPA C: Compiling for a Hybrid R