throbber
Declaration of Rachel J. Watters on Authentication of Publication
`
`1, Rachel J . Watters, am a librarian, and the Director of Wisconsin
`
`TechSearch (“WTS”), located at 728 State Street, Madison, Wisconsin, 53706.
`
`WTS is an interlibrary loan department at the University of Wisconsin-Madison.
`
`l
`
`have worked as a librarian at the University of Wisconsin library system since
`
`1998.
`
`I have been employed at WTS since 2002, first as a librarian and, beginning
`
`in 2011, as the Director. Through the course of my employment, I have become
`
`well informed about the operations of the University of Wisconsin library system,
`
`which follows standard library practices.
`
`This Declaration relates to the dates of receipt and availability of the
`
`following:
`
`Langc, H., & Koch, A. (2000). Memory access schemes for
`configurable processors. In International Workshop on Field
`Programmable Logic and Applications (pp. 615-625).
`Springer : Berlin, Heidelberg.
`
`Standard ogerating grocednres [or materials at the Universr'gy of Wisconsin—
`
`Madison Libraries. When a volume was received by the Library, it would be
`
`checked in, stamped with the date of receipt, added to library holdings records, and
`
`made available to readers as soon after its arrival as possible. The procedure
`
`normally took a few days or at most 2 to 3 weeks.
`
`Exhibit A to this Declaration is true and accurate copy of the front matter of
`
`the International Workshop on Field Programmable Logic and Applications
`
`Petitioners Amazon
`
`EX. 1006, p. 1
`
`Petitioners Amazon
`Ex. 1006, p. 1
`
`

`

`Declaration of Rachel J. Watters on Authentication of Publication
`
`(2000) publication, which includes a stamp on the verso page showing that this
`
`book is the property of the Kurt F. Wendt Library at the University of Wisconsin—
`
`Madison.
`
`Attached as Exhibit B is the cataloging system record of the University of
`
`Wisconsin-Madison Libraries for its copy of the International Workshop on Field
`
`Programmable Logic and Applications (2000) publication. As shown in the
`
`“Receiving date” field of this Exhibit, the University of Wisconsin—Madison
`
`Libraries owned this book and had it cataloged in the system as of November 7,
`
`2000.
`
`Members of the interested public could locate the international Workshop
`
`on Fieid Programmable Logic and Applications (2000) publication after it was
`
`cataloged by searching the public library catalog or requesting a search through
`
`WTS. The search could be done by title, author, and/or subject key words.
`
`Members of the interested public could access the publication by locating it on the
`
`library’s shelves or requesting it from WTS.
`
`I declare that all statements made herein of my own knowledge are true and
`
`that all statements made on information and belief are believed to be true; and
`
`further that these statements were made with the knowledge that willful false
`
`statements and the like so made are punishable by fine or imprisonment, or both,
`
`under Section 1001 of Title 18 ofthe United States Code.
`
`2
`
`Petitioners Amazon
`
`EX. 1006, p. 2
`
`Petitioners Amazon
`Ex. 1006, p. 2
`
`

`

`Declaration of Rachel J. Watters on Authentication of Publication
`
`Date: October 2, 2018
`
`Wisconsin TechSearch
`
`Director
`
`Rachel J
`
`atters
`
`Memorial Library
`728 State Street
`
`Madison, Wisconsin 53706
`
`Petitioners Amazon
`
`EX. 1006, p. 3
`
`Petitioners Amazon
`Ex. 1006, p. 3
`
`

`

`Reiner W. Hartenstein
`
`Herbert Griinbacher (Eds)
`
`Field-Programmable
`Logic and Applications
`
`The Roadmap to Reconfigurable Computing
`
`10th International Conference, FPL 2000
`Villach, Austria, August 27—30, 2000
`Proceedings
`
` ' 3 Springer
`
`
`
`Petitioners Amazon
`
`EX. 1006, p. 4
`
`Petitioners Amazon
`Ex. 1006, p. 4
`
`

`

`Series Editors
`
`Gerhard Goos, Karlsruhe University, Germany
`Juris Hamnanis, Cornell University, NY, USA
`Jan van Leeuwen, Utrecht University, The Netherlands
`
`Volume Editors
`
`Reiner W. Hartenstein
`
`University of Kaiserslautern, Computer Science Department
`P. O. Box. 30 49, 67653 Kaiserslautern, Germany
`E—mail: hartenst@ rhrk.uni~kl.de
`
`Herbert Grlinbacher
`Carinthia Tech Institute
`
`Richard-Wagner-Sn. 19, 9500 Villach, Austria
`E-mail: hg@cti.ac.at
`
`Cataloging-in-Publication Data applied for
`
`Die Deutsche Bibliothek - ClP~Einhcitsaufnabme
`
`Field programmable logic and applications : the roadmap to
`reconfigurable computing ; 10th international conference ; proceedings
`I FPL 2000, Villach, Austria, August 27 - 30, 2000. Reiner W.
`Hartenstein ; Herbert Gr'nnbacher (ed). - Berlin ; Heidelberg '. New
`York ; Barcelona ; Hong Kong ; London -. Milan ; Paris : Singapore ;
`Tokyo : Springer. 2000
`‘
`(lecture notes in computer science ; Vol. 1896)
`- SBNW 9-9
`1
`u 7filWenclt Library
`University of Wisconsin-Madison
`2’l5 N. Randall Avenue
`Madison. WI 53706-1688
`
`CR Subject Classification (1998): 13.6-7, 1.6
`
`ISSN 0302-9743
`
`ISBN 3640678999 Springer-Voting Berlin Heidelberg New York
`
`This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
`concerned. specifically the rights of translation, reprinting. re-use of illustrations. recitation. broadcasting,
`reproduction on microfilm; or in any other way. and storage in data banks. Duplication of this publication
`or parts thereof is permitted only under the provisions of the German Copyright Law of September 9. 1965,
`in its current version. and permission for use must always be obtained from Springeererlag. Violations are
`liable for prosecution under the German Copyright Law
`
`Springer—Verlag Berlin Heidelberg New York
`a member of BertelsmannSpringer Science+Business Media GmbH
`(t5 Springer-Verlag Berlin Heidelberg 2000
`Printed in Germany
`
`Typesetting: Camera-ready by author. data conversion by Steingriiber Satztechhik GmbH. Heidelberg
`Printed on acid-free paper
`SPIN l0722573
`0613142
`5 4 3 2 1 0
`
`
`
`
`
`Petitioners Amazon
`
`EX. 1006, p. 5
`
`Petitioners Amazon
`Ex. 1006, p. 5
`
`

`

`
`
`Memory Access Schemes for Configurable Processors
`
`Holger Lange and Andreas Koch
`
`Tech. Univ. Braunschweig (E.I.S.), GauBstr. 11. D-38106 Braunschweig, Germany
`lange , kocheeis . cs . tu~be . de
`-
`
`Abstract. This work discusses the Memory Architecture for Reconfigurable Com-
`puters (MARC), a scalable, device-independent memory interface that supports
`both irregular (via configurable caches) and regular accesses (via pro-fetching
`stream buffers). By hiding specifics behind a consistent abstract interface, it is
`suitable as a target environment for automatic hardware compilation.
`
`1
`
`Introduction
`
`Reconfigurable compute elements can achieve considerable performance gains over
`standard CPUs [1] [2] [3] [4]. In practice, these configurable elements are often combined
`with a conventional processor, which provides the control and 1/0 services that are
`implemented more efficiently in fixed logic. Recent single-chip architectures following
`this approach include NAPA [5], GARP [6], OneChip [7], OneChip93 [8], Triscend
`E5 [9], and Altera Excalibur [10]. Board-level configurable processors either include a
`dedicated CPU [11] [12] or rely on the host CPU for support [13] [14].
`Design tools targeting one of these hybrid systems such as GarpCC [15], Nimble
`[16] or Napa-C [17] have to deal with software and hardware issues separately as well as
`with the creation of interfaces between these parts. On the software side, basic services
`such as U0 and memory management are often provided by an operating system of
`some kind. This can range from a full-scale general-purpose 05 over more specialized
`real-time embedded OSes down to tiny kernels offering only a limited set of functions
`tailored to a very specific class of applications. Usually, a suitable OS is either readily
`available on the target platform, or can be ported to it with relative ease.
`This level of support is unfortunately not present on the hardware side of the hybrid
`computer. Since no standard environment is available for even the most primitive tasks
`such as efficient mommy access or communication with the host, the research and de-
`velopment of new design tools often requires considerable effort to provide a reliable
`environment into which the newly-created hardware can be embedded. This environment
`is sometimes called a wrapper around the custom datapath. It goes beyond a simple as-
`signment of chip pads to memory pins. Instead, a structure of on-chip busses and access
`protocols to various resources (e.g., memory, the conventional processor, etc) must be
`defined and implemented.
`In this paper, we present our work on the Memory Architecture for Reconfigurable
`Computers (MARC). It can act as a “hardware target” for a variety of hybrid compil-
`ers, analogously to a software target for conventional compilers. Before describing its
`specifics, we will justify our design decisions by giving a brief overview of current con-
`figurable architectures and showing the custom hardware architectures created by some
`hybrid compilers.
`
`KW. Hartenstein and H. Grflnbacher (Eds) FPL 2000, LNCS 1896. DP‘ 615—625. 2000.
`© Springer-Verlag Berlin Heidelberg 2000
`
`
`
`
`
`Petitioners Amazon
`
`EX. 1006, p. 6
`
`Petitioners Amazon
`Ex. 1006, p. 6
`
`

`

`616
`
`H. Lange and A. Koch
`
`2 Hybrid Processors
`
`Static and reconfigurable compute elements may be combined in many ways. The degree
`of integration can range from individual reconfigurable function units (e.g., OneChip
`[7]) to an entirely separate ceprocessor attached to a peripheral bus (e.g., SPLASH [4],
`SPARXIL [18]).
`
` Hybrld
`ProcessorChip
`
`U0 Bus
`
`Figure 1. Single-chip hybrid processor
`
`Figure 1 sketches the architecture of a single-chip hybrid processor that combines
`fixed (CPU) and reconfigurable (RC) compute units behind a common cache (D$).
`Such an architecture was proposed, e.g., for GARP [6] and NAPA [S]. It offers very
`high bandwidth, low latency, and cache coherency between the CPU and the RC when
`accessing the shared DRAM.
`
`leed Processor Chip
`
`Reconfigurable Array
`
`
`
`Figure 2. Hybrid processor emulated by multi-chip system
`
`The board-level systems more common today use an architecture similar to Figure
`2 . Here, a conventional CPU is attached by a bus interface unit (BIU) to a system-wide
`1/0 bus (e.g., SBus [18] or PCI [11] [12]). Another BIU connects the RC to the I/O
`bus. Due to the high conununication latencies over the 1/0 bus, the RC is often attached
`directly to a limited amount of dedicated memory (commonly a few KB to a few MB of
`
`
`
`Petitioners Amazon
`
`EX. 1006, p. 7
`
`Petitioners Amazon
`Ex. 1006, p. 7
`
`

`

`
`
`{—
`
`I
`
`Memory Access Schemes for Configurable Processors
`
`617
`
`SRAM). In some systems, the RC has access to the main DRAM by using the I/O bus as
`a master to contact the CPU memory controller (MEMC). With this capability, the CPU
`and the RC are sharing a logically homogeneous address space: Pointers in the CPU
`main memory can be freely exchanged between software on the CPU and hardware in
`the RC.
`
`Table 1 shows the latencies measured on [12] for the
`
`
`
`
`ZBT SRAM write
`4
`
`
`RC accessing data residing in local Zero-Bus Turnaround
`(ZBT) SRAM (latched in the FPGA HO blocks) and in
`main DRAM (via the PCI bus). In both cases, one word
`per cycle is transferred after the initial latency.
`[t is obvious from these numbers that any useful wrap-
`per must be able to deal efficiently with access to high la-
`tency memories. This problem. colloquially known as the
`“memory bottleneck”, has already been tackled for con—
`ventional processors using memory hierarchies (multiple cache levels] combined with
`techniques such as pro-fetching and streaming to improve their performance. As we will
`see later, these approaches are also applicable to reconfigurable systems.
`
`Table 1. Data access laten-
`
`cies (single word transfers)
`
`3 Reconfigurable Datapaths
`
`The structure of the compute elements implemented on the RC is defined either manually
`or by automatic tools. A common architecture [6] [16] [18] is shown in Figure 3 _
`
`Hardware Operators
`
`
`
`SystetmBus
`
`memory
`Processorintertaee
`
`PIC-local
`Memoryinterface
`
`
`I||||||||I
`Hflflfl
`
`Datapath Controller
`
`Figure 3. Common RC datapath architecture
`
`The datapath is formed by a number of hardware operators, often created using
`module generators, which are placed in a regular fashion. While the linear placement
`shown in the figure is often used in practice, more complicated layouts are of course
`possible. All hardware operators are connected to a central datapath controller that
`orchestrates their execution.
`
`In this paper, we focus on the interface blocks attaching the datapath to the rest of the
`system. They allow communication with the CPU and main memory using the system
`bus or access to the local RC RAM. The interface blocks themselves are accessed by the
`datapath using a structure of uni~ and bidirectional busses that transfer data, addresses,
`and control information.
`
`
`
`Petitioners Amazon
`
`EX. 1006, p. 8
`
`Petitioners Amazon
`Ex. 1006, p. 8
`
`

`

`
`
`618
`
`H. Lange and A. Koch
`
`
`
`For manually implemented RC applications, the protocols used here are generally
`developed ad-hoc and heavily influenced by the specific hardware environment targeted.
`(e.g.. the data sheets of the actual SRAM chips on a PCB). In practice, they may even
`vary between different applications running on the same hardware (e.g., usage of burst-
`modes. fixed access sizes etc.).
`This approach is not applicable for automatic design flows: These tools require pre-
`defined access mechanisms to which they strictly adhere for all designs. An example for
`such a well-defined protocol suitable as a target for automatic compilation is employed on
`GARP [6], a single—chip hybrid processor architecture. It includes standardized protocols
`for sending and retrieving data to/from the RC using specialized CPU instructions and
`supported by dedicated decoding logic in silicon. Memory requests are routed over a
`single address and feur data busses that can supply up to four words per cycle for regular
`(streaming) accesses.
`None of these capabilities is available when using off-the-shelf silicon to implement
`the RC. Instead, each user is faced with implementing the required access infrastructure
`anew.
`
`4 MARC
`
`Our goal was to learn from these past experiences and develop a single, scalable. and
`portable memory interface scheme for reconfigurable datapaths. MARC strives to be
`applicable for both single-chip and board-level systems, and to hide the intricacies of
`different memory systems from the datapath. Figure 4 shows an overview of this
`architecture.
`
`
`
`Back—Ends
`
`Front-Ends
`
`
`UserLogic
`
`
`Figure 4. MARC architecture
`
`Petitioners Amazon
`
`EX. 1006, p. 9
`
`Petitioners Amazon
`Ex. 1006, p. 9
`
`

`

`
`
`Memory Access Schemes for Configurable Processors
`
`619
`
`Using MARC, the datapath accesses memory through abstract front‘end interfaces.
`Currently, we support two front-ends specialized for different access patterns: Caching
`ports provide for efficient handling of irregular accesses. Streaming ports offer a non-
`unit stride access to regular data structures (such as matrices or images) and perform
`address generation automatically. In both cases, data is pre-fetchedlcached to reduce the
`impact of high latencies (especially for transfers using the 110 bus). Both ports use stall
`signals to indicate delays in the data transfer (e.g., due to cache miss or stream queue
`refill). A byte-steering logic aligns 8- and 16-bit data on bits 7:0 and 15:0 of the data
`bus regardless of where the datum occurred in the 32-bit memory or bus words.
`The specifics of hardware memory chips or system bus protocols are implemented
`in various back-end interfaces. E.g., dedicated back—ends encapsulate the mechanisms
`for accessing SRAM or communicating over the PCI bus using the BIU.
`The MARC core is located between front— and back-ends, where it acts as the main
`controller and data switchboard. It performs address decoding and arbitration between
`transfer initiators in the datapath and transfer receivers in the individual memories and
`busses. Logically, it can map an arbitrary number of front-ends to an arbitrary number
`of back-ends. In practice, though, the number of resources managed is of course limited
`by the finite FPGA capacity. Furthermore, the probability of conflicts between initiators
`increases when they share a smaller number of back-ends. However, the behavior visible
`to the datapath remains identical: The heterogeneous hardware resources handled by the
`back-ends are mapped into a homogeneous address space and accessed by a common
`protocol.
`
`4.1
`
`Irregular Cached Access
`
`Caching ports are set up to provide read data one cycle after an address has been applied,
`and accept one write datumladdress per clock cycle. If this is not possible (e.g., a cache
`miss occurs). the stall signal is asserted for the affected port, stopping the initiator.
`When the stall signal is de-asserted. data that was “in—flight" due to a previous request
`will remain valid to allow the initiator to restart cleanly.
`Table 3(a) describes the interface to a caching port. The architecture currently allows
`for 32~bit data ports, which is the size most relevant when compiling software into hybrid
`solutions. Should the need arise for wider words, the architecture can easily be extended.
`Arbitrary memory ranges (e.g., memory-mapped U0 registers) can be marked as non-
`cacheable. Accesses to these regions will then bypass the cache. Furthermore, since all
`of the cache machinery is implemented in configurable logic. cache port characteristics
`such as number of cache lines and cache line length can be adapted to the needs of the
`application. As discussed in [19], this can result in a 3% to 10% speed-up over using a
`single cache configuration for all applications.
`
`item per clock cycle until it has to refill or flush its internal F1F0s. In that case, the stall
`
`4.2 Regular Streamed Access
`
`Streaming ports transfer a number of data words from or to a memory area without the
`need for the datapath to generate addresses. After setting the parameters of the transfer
`(by switching the port into a “load parameter" mode), the port presentslaccepts one data
`
`; h
`Petitioners Amazon
`
`Ex. 1006, p. 10
`
`Petitioners Amazon
`Ex. 1006, p. 10
`
`

`

`
`
`
`
`620
`
`H. Lange and A. Koch
`
`Table 2. Port interfaces
`
`Signal
`
`in Address.
`Data in/out Data item.
`
`
`
`
`
`
`
`Sigma
`Start address.
`
`Stride (increment).
`8,16,32-bit access.
`
`FIFO size.
`
`Length. transfer.
`
`Read or write.
`
`
`Data item.
`
`Wait, FIFO flushfrefill.
`
`Pause data flow.
`
`End of stream reached.
`
`
`Accept new parameters.
`
`
`(b) Streaming port interface
`
`in
`
`Mdth
`
`8. 16. 32-bit access.
`Asserted on cache miss.
`
`Output enable
`
`Write enable
`Flush cache.
`
`
`(a) Caching port interface
`
`signal stops the initiator using the port. When the FIFO becomes ready again. the stall
`signal is de-asserted and the transfer continues. The datapath can pause the transfer by
`asserting the hold signal. As before, our current implementation calls for a 32-bit wide
`data bus. Table 3(b) lists the parameter registers and the port interface.
`The ‘Block' register plays a crucial role in matching the stream characteristics to
`the specific application requirements. E.g., if the application has to process a very large
`string (such as in DNA matching), it makes sense for the datapath to request a large
`block size. The longer start-up delay (for the buffer to be filled) is amortized over the
`long run-time of the algorithm. For smaller amounts of data (e.g., part of a matrix row for
`blocking matrix multiplication), it makes much more sense to pre—fetch only the precise
`amount of data required. [20] suggests compile-time algorithms to estimate the FIFO
`depth to use.
`The cache is bypassed by the streaming ports in order to avoid cache pollution.
`However. since logic guaranteeing the consistency between caches and streams for arbi—
`trary accesses w0uld be very expensive to implement (especially when non-unit strided
`streams are used), our current design requires that accesses through the caching ports do
`not overlap streamed memory ranges. This restriction must be enforced by the compiler.
`if that is not possible, streaming ports cannot be used. As an alternative, :1 cache with
`longer cache lines (e.g.. 128 bytes), might be used to limit the performance loss due to
`memory latency.
`
`4.3 Multl-threading
`
`Note that all stall or flow-control signals are generatedfaccepted on a per-port basis.
`This allows true multi-threaded hardware execution where different threads of control
`are assigned to different ports. MARC can accommodate more logical ports on the front-
`ends than actually exist physically on the back-ends. For certain applications. this can
`be exploited to allow the compiler to schedule a larger number of memory accesses
`in parallel. The MARC core will resolve any inter-port conflicts (if they occur at all,
`
`Petitioners Amazon
`
`Ex. 1006, p. 11
`
`Petitioners Amazon
`Ex. 1006, p. 11
`
`

`

`Memory Access Schemes for Configurable Processors
`
`621
`
`see Section 5 ) at run-time. The current implementation uses a round-robin policy. later
`versions might extend this to a priority-based scheme.
`
`4.4 Flexibility
`
`A separate back-end is used for each memory or bus resource. For example, in a system
`with four ZBT SRAM memories, four instances of the ZBT SRAM back-end would be
`instantiated. The back-ends present the same interface as a caching port (Table 3(a) ) to
`the MARC core. They encapsulate the state and access mechanisms to manage each of
`the physical resources. E.g., a PCI backend might know how to access a PCI BIU and
`initiate a data transfer.
`In this manner, additional back—ends handling more memory banks can be attached
`easily. Analogously. MARC can be adapted to different FPGA technologies. For exam-
`ple, on the Xilinx Vinex [24] series of FPGA. the L1 cache of a caching port might
`be implemented using the on-chip memories. 0n the older XC4000XL series, which
`has only a limited amount of on-chip storage, the cache could be implemented in a
`direct-mapped fashion that has the cache lines placed in external memory.
`
`5
`
`Implementation Issues
`
`Our first MARC implementation is targeting the prototyping environment described in
`[12]. The architecture details relevant for this paper are show in Figure 5 .
`
`cpu
`
`BIU
`
`'
`
`no
`
`SUN
`microSPARC
`Ilep
`
`PLXQOBO
`
`only partially floorplanned, thus the given performance numbers are preliminary.
`
`128Kx36b
`
`3m
`123Kx35b
`
`123K136b
`
`128KX36b
`
`Figure 5. Architecture of prototype hardware
`
`A SUN microSPARC-Ilep RISC [21] [22] is employed as conventional CPU. The
`RC is composed of a Xilinx Virtex XCVIOOO FPGA [24].
`
`5.1 Status
`
`At this point, we have implemented and intensively simulated a parameten'zed Verilog
`model of the MARC Core and back-ends. On the front-end side. caching ports are already
`operational, while streaming ports are still under development. The design is currently
`
`Petitioners Amazon
`
`Ex. 1006, p. 12
`
`Petitioners Amazon
`Ex. 1006, p. 12
`
`

`

`622
`
`H. Lange and A. Koch
`
`5.2 Physical Resources
`The RC has four 128K1436b banks of ZBT SRAM as dedicated memory and can access
`the main memory (64MB DRAM managed by the CPU) over the PCI bus. To this end.
`a PLX 9080 PCI Accelerator [23] is used as BIU that translates the RC bus (i960—like)
`into PCI and back. The MARC core will thus need PLX and ZBT SRAM back-ends.
`All of their instances can operate in parallel.
`
`5.3 MARC Core
`The implementation follows the architecture described in Section 4 : An arbitrary number
`of caching and streaming ports can be managed. In this implementation (internally
`relying on the Virtex memories in dual-ported mode). two cache ports are guaranteed to
`operate without conflicts. and three to four cache ports may operate without conflicts. If
`five or more cache ports are in use, a conflict will occur and be resolved by the Arbitration
`unit ( Section 4.3 ). This version of the core currently supports a 24—bit address space
`into which the physical resources are mapped.
`
`5.4 Configurable Cache
`We currently provide three cache configurations: 128 lines of 8 words, 64 lines of 16
`words, or 32 lines of 32 words. Non-cacheable areas may be configured at compile
`time (the required comparators are then synthesized directly into specialized logic).
`The datapath can explicitly request a cache flush at any time (e.g.. after the end of a
`computation).
`The cache is implemented as a fully associative L1 cache. It uses 4KB ofVirtex Block-
`SelectRAM to hold the cache lines on-chip and implements write-back and random line
`replacement. The BlockSelectRAM is used in dual-port mode to allow up to two accesses
`to occur in parallel. Conflicts are handled by the MARC Core Arbitration logic. The
`CAMS needed for the associative lookup are composed from SRL16E shift registers as
`suggested in [25]. This allows a single-cycle read (compare and match detection) and 16
`clock cycles to write a new tag into the CAM. Since this operation occurs simultaneously
`with the loading of the cache lines from memory, the CAM latency is completely hidden
`in the longer memory latencies. As this lfi-cycle delay would also occur (and could not
`be hidden) when reading the tag, e.g., when writing back a dirty cache line, the tags are
`additionally stored in a conventional memory composed from RAM32x l S elements that
`allow single—cycle reading.
`For each caching port, a dedicated CAM bank is used to allow lookups to occur in
`parallel. Each cache line requires 5 4-bit CAMS, thus the per-port CAM area requirements
`range from 160 to 640 4-LUTs. In addition to the CAMS. each cache port includes a
`small state-machine controlling the cache operation for different scenarios (e.g.. read
`hit, read miss. write hit, write miss. flush. etc.) The miss penalty cm in cycles for a clean
`cache line is given by
`
`where Che is the data transfer latency for the back-end used ( Table l ) and w is the
`number of 32-bit words per cache line. 7 and 4 are the MARC Core operation startup
`
`
`
`—————#
`
`Petitioners Amazon
`
`Ex. 1006, p. 13
`
`Petitioners Amazon
`Ex. 1006, p. 13
`
`

`

`
`
`Memory Access Schemes for Configurable Processors
`
`623
`
`and shutdown times in cycles, respectively. For a dirty cache line, an additional in cycles
`are required to write the modified data back.
`For comparison with [19], note that according to [26], the performance of 4KB of
`fully-associative cache is equivalent to that of SKB of direct-mapped cache.
`
`5.5 Performance and Area
`
`The performance and area requirements of MARC Core, the technology modules and
`two cache ports are shown in Table 3 .
`
`Table 3. Performance and area requirements
`
`—4LU'I‘s- RAM32XISs magnum—
`32x32
`976
`12
`8
`31 MHz
`64x16
`26
`8
`30 MHz
`128x8
`56
`8
`29 MHz
`
`XCVIOOO avail. 24576 24576 (each uses 2 4LUTs)_—
`
`mance improvement between 3% to 10% over static caches. [27] describes the use of the
`
`For the three configuration choices, the area requirements vary between l0%v30%
`of the chip logic capacity. Since all configurations use 4KB of on—chip memory for cache
`line storage, 8 of the 32 512x8b BlockSelectRAMs are required.
`
`5.6 Scalability and Extensibility
`
`As shown in Table 3 , scaling an L1 on~chip above 128x8 is probably not a wise choice
`given the growing area requirements. However, as already mentioned in Section 4.4 ,
`part of the ZBT SRAM could be used to hold the cache lines of a direct mapped L2
`cache. In this scenario, only the tags would be held inside of the FPGA.
`A sample cache organization for this approach could partition the 24-bit address
`into a 8—bit tag, 12-bit index and 4-bit block offset. The required tag RAM would be
`organized as 4096x8b, and would thus require 8 BlockSelectRAMs (in addition to those
`used for the on-chip L1 cache). The cache would have 4096 lines of 16 words each, and
`would thus require 64K words of the 128K words available in one of the ZBT SRAM
`chips. The cache hit/miss determination could occur in two cycles, with the data arriving
`after another two cycles. A miss going to PCI memory would take 66 cycles to refill a
`clean cache line and deliver the data to the initiator.
`
`6 Related Work
`
`[19] gives experimental result for the cache-dependent performance behavior of 6 of the
`8 benchmarks in the SPECint95 suite. Due to the temporal! configurability we suggest
`for the MARC caches (adapting cache parameters to applications), they expect a perfor-
`
`Petitioners Amazon
`
`Ex. 1006, p. 14
`
`Petitioners Amazon
`Ex. 1006, p. 14
`
`

`

`624
`
`H. Lange and A. Koch
`
`
`
`configurable logic in a hybrid processor to either add a victim cache or pre-fetch buffers
`to an existing dedicated direct-mapped L1 cache on an per—application basis. They quote
`improvements in L1 miss rate of up to 19%. [2%] discusses the addition of 1MB of L1
`cache memory managed by a dedicated cache controller to a configurable processor.
`Another approach proposed in [29] re-maps non-contiguous strided physical addresses
`into contiguous cache line entries. A similar functionality is provided in MARC by the
`pre—fetching of data into the FIFOs of streaming ports. [30] suggests a scheme which
`adds configurable logic to a cache instead of a cache to configurable logic. They hope to
`avoid the memory bottleneck by putting processing (the configurable logic) very close
`to the data. The farthest step with regard to data pro-fetching is suggested in [31]. which
`describes a memory system that is cognizant of high-level memory access patterns.
`E.g.. once a certain member in a structure is accessed, a set of associated members is
`fetched automatically. However, the automatic generation of the required logic from
`conventional software is not discussed. On the subject of streaming accesses, [20] is an
`exhaustive source. The ‘Block’ register of our streaming ports was motivated by their
`discussion of overly long startup times for large amounts of pro-fetched data.
`
`7 Summary
`
`We presented an overview of hybrid processor architectures and some memory access
`needs often occurring in applications. For the most commonly used RC components
`(off-the-shelf FPGAs), we identified ‘a lack of support for even the most basic of these
`requirements.
`As a solution. we propose a general-purpose Memory Architecture for Reconfig-
`urable Computers that allows device-independent access both for regular (streamed)
`and irregular (cached) patterns. We discussed one real-world implementation of MARC
`on an emulated hybrid processor combining a SPARC CPU with a Vine): FPGA. The
`sample implementation fully supports multi—threaded access to multiple memory banks
`as well as the creation of “virtual" memory ports attached to on-chip cache memory.
`The configurable caches in the current version can reduce the latency from 46 cycles
`(for access to DRAM via PCI) down to a single cycle on a cache hit.
`
`Pmc. IEEE Symp. on FCCMs. Napa 1997
`
`References
`
`1. Amerson, R., “Teramac — Configurable Custom Computing". Pmc. IEEE Symp. on FCCMs.
`Napa 1995
`2. Bertin, P., Roncin. D.. Vuillcmin. 1., “Programmable Active Memories: A Performance As-
`sessment". Pmc. Symp. Research on Integrated System. Cambridge (Mass) 1993
`3. Box. 3., ”Field—Programmable Gate Array-based Reconfigurable Preprocessor”, Proc. IEEE
`Symp. on FCCMs, Napa 1994
`4. Buell, D., Arnold. J.. Kleinfelder. W., "Splash 2 — FPGAs in Custom Computing Machines”,
`IEEE Press, 1996
`5. Rupp, C., Landguth. M., Garverick, et al., “The NAPA Adaptive Processing Architectu ",
`Pmc. IEEE Symp. on FCCMS, Napa 1998
`6. Hauser, J ., Wawmynek. J., “Carp: A MIPS Processor with a Reconfigurable Coprocessor“,
`
`Petitioners Amazon
`
`Ex. 1006, p. 15
`
`Petitioners Amazon
`Ex. 1006, p. 15
`
`

`

`Memory Access Schemes for Configurable Processors
`
`625
`
`Wittig, R., Chow, P., “OneChip: An FPGA Processor with Reconfigurable Logic", Proc. IEEE
`Symp. on FCCMs, Napa 1996
`. Jacob, J., Chow, P., "Memory Interfacing and Instruction Specification for Reconfigurable
`Processors", Proc. ACM Inti. Symp. on FPGAs, Monterey 1999
`Triscend, “Triscend ES CSoC Family", http:/lwww.miscondcomiproducts/IndexES.html,
`2000
`
`10.
`
`11.
`12.
`
`13.
`14.
`15.
`
`16.
`
`17.
`
`18.
`
`19.
`
`20.
`
`21.
`22.
`
`23.
`24.
`25.
`
`26.
`
`27.
`
`28.
`
`29.
`
`30.
`
`31.
`
`Altera, “Excalibur Embedded Processor Solutions",
`http:l/www.altera.comfhtml/productslexcalibur.html, 2000
`TSI-Telsys, “ACEanrd User's Manual", hardware documentation, 1998
`Koch, A., “A Comprehensive Platform for Hardware-Software Co-Design”, Pmc. Inrl. Work—
`shop on Rapid—System Prototyping, Paris 2000
`Annapolis Microsystems, http:flwww.annapmicro.com, 2000
`Virtual Computer Corp, httpzh'wwwycceom, 2000
`Callahan, T., Hauser, J.R., Wawrzynek. 1., ”The Garp Architecture and C Compiler", IEEE
`Computer, April 2000
`Li, Y., Callahan, T., Darnell, E., Hart, R., et 31., “Hardware-Software Co-Design of Embedded
`Reconfigurable Architectures", Pmc. 37th Design Automation Conference, 2000
`Golchale, M.B., Stone, J.M., “NAPA C: Compiling for a Hybrid R

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket