throbber
Reiner W. Hartenstein
`Herbert Griinbacher (Eds.)
`
`Field-Program.m.able
`Logic and Applications
`
`The Roadmap to Reconfigurable Computing
`
`10th International Conference, FPL 2000
`Villach, Austria, August 27-30, 2000
`· Proceedings
`
`Springer
`
`Petitioners Amazon
`Ex. 1003, p. 1
`
`

`

`Series Editors
`
`Gerhard Goos, Karlsruhe University, Germany
`Juris Hartmanis, Cornell University, NY, USA
`Jan van Leeuwen, Utrecht University, The Netherlands
`
`Volume Editors
`
`Reiner W. Hartenstein
`University of Kaiserslautem, Computer Science Department
`P. 0. Box. 30 49, 67653 Kaiserslautern, Germany
`E-mail: hartenst@rhrk.uni-kl.de
`Herbert Griinbacher
`Carinthia Tech Institute
`Richard-Wagner-Str. 19, 9500 Villach, Austria
`E-mail: hg@cti.ac.at
`
`Cataloging-in-Publication Data applied for
`
`Die Deutsche Bibliothek - CIP-Einheitsaufnahrne
`
`Field programmable logic and applications : the roadmap to
`reconfigurable computing ; 10th international conference ; proceedings
`/ FPL 2000, Villach, Austria, August 27 - 30, 2000. Reiner W.
`Hartenstein; Herbert Griinbacher (ed.). - Berlin ; Heidelberg; New
`York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ;

`Tokyo : Springer, 2000
`(Lecture notes in computer science ; Vol. l 896)
`
`-lSBN ,~1ifrfi~9We'ndt Library
`
`University of Wisconsin-Madison
`215 N. Randall Avenue
`Madison, WI 53706-1688
`
`CR Subject Classification (1998): B.6-7, J.6
`
`ISSN 0302-9743
`ISBN 3-540-67899-9 Springer-Verlag Berlin Heidelberg New York
`
`This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
`concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
`reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
`or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
`in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
`liable for prosecution under the German Copyright Law.
`
`Springer-Verlag Berlin Heidelberg New York
`a member of BertelsmannSpringer Science+Business Media GmbH
`© Springer-Verlag Berlin Heidelberg 2000
`Printed in Germany
`
`Typesetting: Camera-ready by author, data conversion by Steingraber Satztechnik GmbH, Heidelberg
`Printed on acid-free paper
`SPIN 10722573
`06/3142
`5 4 3 2 I 0
`
`Petitioners Amazon
`Ex. 1003, p. 2
`
`

`

`Memory Access Schemes for Configurable Processors
`
`Bolger Lange and Andreas Koch
`
`Tech. Univ. Braunschweig (E.I.S.), GauBstr. 11, D-38106 Braunschweig, Germany
`lange,koch@eis.cs.tu-bs.de
`
`Abstract. This work discusses the Memory Architecture for Reconfigurable Com(cid:173)
`puters (MARC), a scalable, device-independent memory interface that supports
`both irregular (via configurable caches) and regular accesses (via pre-fetching
`stream buffers). By hiding specifics behind a consistent abstract interface, it is
`suitable as a target environment for automatic hardware compilation.
`
`1 Introduction
`
`Reconfigurable compute elements can achieve considerable performance gains over
`standard CPUs [1] [2] [3] [4]. In practice, these configurable elements are often combined
`with a conventional processor, which provides the control and 1/0 services that are
`implemented more efficiently in fixed logic. Recent single-chip architectures following
`this approach include NAPA [5], GARP [6], OneChip [7], OneChip98 [8], Triscend
`ES [9], and Altera Excalibur [10]. Board-level configurable processors either include a
`dedicated CPU [11] [12] or rely on the host CPU for support [13] [14] .
`Design tools targeting one of these hybrid systems such as GarpCC [15], Nimble
`[16] or Napa-C [17] have to deal with software and hardware issues separately as well as
`with the creation of interfaces between these parts. On the software side, basic services
`such as 1/0 and memory management are often provided by an operating system of
`some kind. This can range from a full-scale general-purpose OS over more specialized
`real-time embedded OSes down to tiny kernels offering only a limited set of functions
`tailored to a very specific class of applications. Usually, a suitable OS is either readily
`available on the target platform, or can be ported to it with relative ease.
`This level of support is unfortunately not present on the hardware side of the hybrid
`computer. Since no standard environment is available for even the most primitive tasks
`such as efficient memory access or communication with the host, the research and de(cid:173)
`velopment of new design tools often requires considerable effort to provide a reliable
`environment into which the newly-created hardware can be embedded. This environment
`is sometimes called a wrapper around the custom datapath. It goes beyond a simple as(cid:173)
`signment of chip pads to memory pins. Instead, a structure of on-chip busses and access
`protocols to various resources (e.g., memory, the conventional processor, etc) must be
`defined and implemented.
`In this paper, we present our work on the Memory Architecture for Reconfigurable
`Computers (MARC). It can act as a "hardware target" for a variety of hybrid compil(cid:173)
`ers, analogously to a software target for conventional compilers. Before describing its
`specifics, we will justify our design decisions by giving a brief overview of current con(cid:173)
`figurable architectures and showing the custom hardware architectures created by some
`hybrid compilers.
`
`R.W. Hartenstein and H. Griinbacher (Eds.) FPL 2000, LNCS 1896, pp. 615-625, 2000.
`© Springer-Verlag Berlin Heidelberg 2000
`
`Petitioners Amazon
`Ex. 1003, p. 3
`
`

`

`616
`
`H. Lange and A. Koch
`
`2 Hybrid Processors
`
`Static and reconfigurable compute elements may be combined in many ways. The degree
`of integration can range from individual reconfigurable function units (e.g., 0neChip
`[7]) to an entirely separate coprocessor attached to a peripheral bus (e.g., SPLASH [4],
`SPARXIL [18)).
`
`::E
`<
`a:
`Q
`
`a. :c
`0
`0 en
`en
`e
`GI
`:2 .s >,
`
`D.
`
`::c
`
`1/0 Bus
`
`Figure 1. Single-chip hybrid processor
`
`Figure 1 sketches the architecture of a single-chip hybrid processor that combines
`fixed (CPU) and reconfigurable (RC) compute units behind a common cache (D$).
`Such an architecture was proposed, e.g., for GARP [6) and NAPA [5]. It offers very
`high bandwidth, low latency, and cache coherency between the CPU and the RC when
`accessing the shared DRAM.
`
`Fixed Processor Chip
`
`Reconfigurable Array
`
`DRAM
`
`SRAM
`
`Figure 2. Hybrid processor emulated by multi-chip system
`
`The board-level systems more common today use an architecture similar to Figure
`2 . Here, a conventional CPU is attached by a bus interface unit (BIU) to a system-wide
`1/0 bus (e.g., SBus [18) or PCI [11) [12)). Another BIU connects the RC to the 1/0
`bus. Due to the high communication latencies over the 1/0 bus, the RC is often attached
`directly to a limited amount of dedicated memory (commonly a few KB to a few MB of
`
`Petitioners Amazon
`Ex. 1003, p. 4
`
`

`

`Memory Access Schemes for Configurable Processors
`
`617
`
`SRAM). In some systems, the RC has access to the main DRAM by using the 1/0 bus as
`a master to contact the CPU memory controller (MEMC). With this capability, the CPU
`and the RC are sharing a logically homogeneous address space: Pointers in the CPU
`main memory can be freely exchanged between software on the CPU and hardware in
`the RC.
`Operation
`ZBT SRAM read
`ZBT SRAM write
`PCI read
`PCI write
`
`Cycles
`4
`4
`46-47
`10
`
`Table 1 shows the latencies measured on [12) for the
`RC accessing data residing in local Zero-Bus Turnaround
`(ZBT) SRAM (latched in the FPGA 1/0 blocks) and in
`main DRAM (via the PCI bus). In both cases, one word
`per cycle is transferred after the initial latency.
`It is obvious from these numbers that any useful wrap(cid:173)
`per must be able to deal efficiently with access to high la(cid:173)
`tency memories. This problem, colloquially known as the
`"memory bottleneck", has already been tackled for con(cid:173)
`ventional processors using memory hierarchies (multiple cache levels) combined with
`techniques such as pre-fetching and streaming to improve their performance. As we will
`see later, these approaches are also applicable to reconfigurable systems.
`
`Table 1. Data access laten-
`cies (single word transfers)
`
`3 Reconfigurable Datapaths
`
`The structure of the compute elements implemented on the RC is defined either manually
`or by automatic tools. A common architecture [6] [16] [18] is shown in Figure 3.
`
`Hardware Operators
`
`~
`0
`E
`CII
`E
`
`1 (.)
`
`a:
`
`I •
`
`.. • ..
`
`Figure 3. Common RC datapath architecture
`
`The datapath is formed by a number of hardware operators, often created using
`module generators, which are placed in a regular fashion. While the linear placement
`shown in the figure is often used in practice, more complicated layouts are of course
`possible. All hardware operators are connected to a central datapath controller that
`orchestrates their execution.
`In this paper, we focus on the interface blocks attaching the datapath to the rest of the
`system. They allow communication with the CPU and main memory using the system
`bus or access to the local RC RAM. The interface blocks themselves are accessed by the
`datapath using a structure of uni- and bidirectional busses that transfer data, addresses,
`and control information.
`
`Petitioners Amazon
`Ex. 1003, p. 5
`
`

`

`618
`
`H. Lange and A. Koch
`
`For manually implemented RC applications, the protocols used here are generally
`developed ad-hoc and heavily influenced by the specific hardware environment targeted.
`(e.g., the data sheets of the actual SRAM chips on a PCB). In practice, they may even
`vary between different applications running on the same hardware (e.g., usage of burst(cid:173)
`modes, fixed access sizes etc.).
`This approach is not applicable for automatic design flows: These tools require pre(cid:173)
`defined access mechanisms to which they strictly adhere for all designs. An example for
`such a well-defined protocol suitable as a target for automatic compilation is employed on
`GARP [6], a single-chip hybrid processor architecture. It includes standardized protocols
`for sending and retrieving data to/from the RC using specialized CPU instructions and
`supported by dedicated decoding logic in silicon. Memory requests are routed over a
`single address and four data busses that can supply up to four words per cycle for regular
`(streaming) accesses.
`None of these capabilities is available when using off-the-shelf silicon to implement
`the RC. Instead, each user is faced with implementing the required access infrastructure
`anew.
`
`4 MARC
`
`Our goal was to learn from these past experiences and develop a single, scalable, and
`portable memory interface scheme fpr reconfigurable datapaths. MARC strives to be
`applicable for both single-chip and board-level systems, and to hide the intricacies of
`different memory systems from the datapath. Figure 4 shows an overview of this
`architecture.
`
`Back-Ends
`
`Front-Ends
`
`RC Array
`
`BIU
`
`1/0 Bus
`
`Figure 4. MARC architecture
`
`Petitioners Amazon
`Ex. 1003, p. 6
`
`

`

`Memory Access Schemes for Configurable Processors
`
`619
`
`Using MARC, the datapath accesses memory through abstract front-end interfaces.
`Currently, we support two front-ends specialized for different access patterns: Caching
`ports provide for efficient handling of irregular accesses. Streaming ports offer a non(cid:173)
`unit stride access to regular data structures (such as matrices or images) and perform
`address generation automatically. In both cases, data is pre-fetched/cached to reduce the
`impact of high latencies (especially for transfers using the I/0 bus). Both ports use stall
`signals to indicate delays in the data transfer (e.g., due to cache miss or stream queue
`refill). A byte-steering logic aligns 8- and 16-bit data on bits 7:0 and 15:0 of the data
`bus regardless of where the datum occurred in the 32-bit memory or bus words.
`The specifics of hardware memory chips or system bus protocols are implemented
`in various back-end interfaces. E.g., dedicated back-ends encapsulate the mechanisms
`for accessing SRAM or communicating over the PCI bus using the BIU.
`The MARC core is located between front- and back-ends, where it acts as the main
`controller and data switchboard. It performs address decoding and arbitration between
`transfer initiators in the datapath and transfer receivers in the individual memories and
`busses. Logically, it can map an arbitrary number of front-ends to an arbitrary number
`of back-ends. In practice, though, the number of resources managed is of course limited
`by the finite FPGA capacity. Furthermore, the probability of conflicts between initiators
`increases when they share a smaller number of back-ends. However, the behavior visible
`to the datapath remains identical: The heterogeneous hardware resources handled by the
`back-ends are mapped into a homogeneous address space and accessed by a common
`protocol.
`
`4.1
`
`Irregular Cached Access
`
`Caching ports are set up to provide read data one cycl~ after an address has been applied,
`and accept one write datum/address per clock cycle. If this is not possible (e.g., a cache
`miss occurs), the stall signal is asserted for the affected port, stopping the initiator.
`When the stall signal is de-asserted, data that was "in-flight" due to a previous request
`will remain valid to allow the initiator to restart cleanly.
`Table 3( a) describes the interface to a caching port. The architecture currently allows
`for 32-bit data ports, which is the size most relevant when compiling software into hybrid
`solutions. Should the need arise for wider words, the architecture can easily be extended.
`Arbitrary memory ranges ( e.g., memory-mapped I/0 registers) can be marked as non(cid:173)
`cacheable. Accesses to these regions will then bypass the cache. Furthermore, since all
`of the cache machinery is implemented in configurable logic, cache port characteristics
`such as number of cache lines and cache line length can be adapted to the needs of the
`application. As discussed in [19], this can result in a 3% to 10% speed-up over using a
`single cache configuration for all applications.
`
`4.2 Regular Streamed Access
`
`Streaming ports transfer a number of data words from or to a memory area without the
`need for the datapath to generate addresses. After setting the parameters of the transfer
`(by switching the port into a "load parameter" mode), the port presents/accepts one data
`item per clock cycle until it has to refill or flush its internal FIFOs. In that case, the stall
`
`Petitioners Amazon
`Ex. 1003, p. 7
`
`

`

`620
`
`H. Lange and A. Koch
`
`Table 2. Port interfaces
`
`Signal Kind Function
`Addr
`in Address.
`Data in/out Data item.
`Width
`in 8, 16, 32-bit access.
`Stall out Asserted on cache miss.
`OE
`in Output enable
`in Write enable
`WE
`Flush
`in Flush cache.
`
`(a) Caching port interface
`
`Sig/Reg Kind Function
`reg Start address.
`Addr
`reg Stride (increment).
`Stride
`Width
`reg 8, 16,32-bit access.
`reg FIFO size.
`Block
`reg Length, transfer.
`Count
`reg Read or write.
`R/W
`Data
`i/o Data item.
`out Wait, FIFO flush/refill .
`Stall
`Hold
`in Pause data flow.
`out End of stream reached.
`EOS
`in Accept new parameters.
`Load
`
`(b) Streaming port interface
`
`signal stops the initiator using the port. When the FIFO becomes ready again, the stall
`signal is de-asserted and the transfer continues. The datapath can pause the transfer by
`asserting the hold signal. As before, our current implementation calls for a 32-bit wide
`data bus. Table 3(b) lists the parameter registers and the port interface.
`The 'Block' register plays a crucial role in matching the stream characteristics to
`the specific application requirements. E.g., if the application has to process a very large
`string (such as in DNA matching), it makes sense for the datapath to request a large
`block size. The longer start-up delay (for the buffer to be filled) is amortized over the
`long run-time of the algorithm. For smaller amounts of data (e.g., part of a matrix row for
`blocking matrix multiplication), it makes much more sense to pre-fetch only the precise
`amount of data required. [20) suggests compile-time algorithms to estimate the FIFO
`depth to use.
`The cache is bypassed by the streaming ports in order to avoid cache pollution.
`However, since logic guaranteeing the consistency between caches and streams for arbi(cid:173)
`trary accesses would be very expensive to implement (especially when non-unit strided
`streams are used), our current design requires that accesses through the caching ports do
`not overlap streamed memory ranges. This restriction must be enforced by the compiler.
`If that is not possible, streaming ports cannot be used. As an alternative, a cache with
`longer cache lines (e.g., 128 bytes), might be used to limit the performance loss due to
`memory latency.
`
`4.3 Multi-threading
`
`Note that all stall or flow-control signals are generated/accepted on a per-port basis.
`This allows true multi-threaded hardware execution where different threads of control
`are assigned to different ports. MARC can accommodate more logical ports on the front(cid:173)
`ends than actually exist physically on the back-ends. For certain applications, this can
`be exploited to allow the compiler to schedule a larger number of memory accesses
`in parallel. The MARC core will resolve any inter-port conflicts (if they occur at all,
`
`Petitioners Amazon
`Ex. 1003, p. 8
`
`

`

`Memory Access Schemes for Configurable Processors
`
`621
`
`see Section 5 ) at run-time. The current implementation uses a round-robin policy, later
`versions might extend this to a priority-based scheme.
`
`4.4 Flexibility
`
`A separate back-end is used for each memory or bus resource. For example, in a system
`with four ZBT SRAM memories, four instances of the ZBT SRAM back-end would be
`instantiated. The back-ends present the same interface as a caching port ( Table 3(a) ) to
`the MARC core. They encapsulate the state and access mechanisms to manage each of
`the physical resources. E.g., a PCI backend might know how to access a PCI BIU and
`initiate a data transfer.
`In this manner, additional back-ends handling more memory banks can be attached
`easily. Analogously, MARC can be adapted to different FPGA technologies. For exam(cid:173)
`ple, on the Xilinx Virtex [24) series of FPGA, the L1 cache of a caching port might
`be implemented using the on-chip memories. On the older XC4000XL series, which
`has only a limited amount of on-chip storage, the cache could be implemented in a
`direct-mapped fashion that has the cache lines placed in external memory.
`
`5 Implementation Issues
`
`Our first MARC implementation is targeting the prototyping environment described in
`[ 12). The architecture details relevant for this paper are show in Figure 5 .
`
`0
`a)
`0 a,
`X
`...J
`Q.
`
`128Kx36b
`
`128Kx36b
`
`128Kx36b
`
`128Kx36b
`
`Figure 5. Architecture of prototype hardware
`
`A SUN microSPARC-Ilep RISC [21) [22) is employed as conventional CPU. The
`RC is composed of a Xilinx Virtex XCVlOOO FPGA [24) .
`
`5.1 Status
`
`At this point, we have implemented and intensively simulated a parameterized Verilog
`model of the MARC Core and back-ends. On the front-end side, caching ports are already
`operational, while streaming ports are still under development. The design is currently
`only partially floorplanned, thus the given performance numbers are preliminary.
`
`Petitioners Amazon
`Ex. 1003, p. 9
`
`

`

`622
`
`H. Lange and A. Koch
`
`5.2 Physical Resources
`
`The RC has four 128Kx36b banks of ZBT SRAM as dedicated memory and can access
`the main memory (64MB DRAM managed by the CPU) over the PCI bus. To this end,
`a PLX 9080 PCI Accelerator [23] is used as BIU that translates the RC bus (i960-like)
`into PCI and back. The MARC core will thus need PLX and ZBT SRAM back-ends.
`All of their instances can operate in parallel.
`
`5.3 MARC Core
`
`The implementation follows the architecture described in Section 4 : An arbitrary number
`of caching and streaming ports can be managed. In this implementation (internally
`relying on the Virtex memories in dual-ported mode), two cache ports are guaranteed to
`operate without conflicts, and three to four cache ports may operate without conflicts. If
`five or more cache ports are in use, a conflict will occur and be resolved by the Arbitration
`unit ( Section 4.3 ). This version of the core currently supports a 24-bit address space
`into which the physical resources are mapped.
`
`5.4 Configurable Cache
`
`We currently provide three cache configurations: 128 lines of 8 words, 64 lines of 16
`words, or 32 lines of 32 words. Non-cacheable areas may be configured at compile
`time (the required comparators are then synthesized directly into specialized logic).
`The datapath can explicitly request a cache flush at any time (e.g., after the end of a
`computation).
`The cache is implemented as a fully associative LI cache. It uses 4KB ofVirtex Block(cid:173)
`SelectRAM to hold the cache lines on-chip and implements write-back and random line
`replacement. The BlockSelectRAM is used in dual-port mode to allow up to two accesses
`to occur in parallel. Conflicts are handled by the MARC Core Arbitration logic. The
`CAMs needed for the associative lookup are composed from SRL16E shift registers as
`suggested in [25] . This allows a single-cycle read (compare and match detection) and 16
`clock cycles to write a new tag into the CAM. Since this operation occurs simultaneously
`with the loading of the cache lines from memory, the CAM latency is completely hidden
`in the longer memory latencies. As this 16-cycle delay would also occur (and could not
`be hidden) when reading the tag, e.g., when writing back a dirty cache line, the tags are
`additionally stored in a conventional memory composed from RAM32x 1 S elements that
`allow single-cycle reading.
`For each caching port, a dedicated CAM bank is used to allow lookups to occur in
`parallel. Each cache line requires 5 4-bit CAMs, thus the per-port CAM area requirements
`range from 160 to 640 4-LUTs. In addition to the CAMs, each cache port includes a
`small state-machine controlling the cache operation for different scenarios (e.g., read
`hit, read miss, write hit, write miss, flush , etc.). The miss penalty Cm in cycles for a clean
`cache line is given by
`
`Cm = 7 + ~e + w + 4,
`where ~e is the data transfer latency for the back-end used ( Table I ) and w is the
`number of 32-bit words per cache line. 7 and 4 are the MARC Core operation startup
`
`Petitioners Amazon
`Ex. 1003, p. 10
`
`

`

`Memory Access Schemes for Configurable Processors
`
`623
`
`and shutdown times in cycles, respectively. For a dirty cache line, an additional w cycles
`are required to write the modified data back.
`For comparison with [19], note that according to [26], the performance of 4KB of
`fully-associative cache is equivalent to that of 8KB of direct-mapped cache.
`
`5.5 Performance and Area
`
`The performance and area requirements of MARC Core, the technology modules and
`two cache ports are shown in Table 3 .
`
`Table 3. Performance and area requirements
`
`RAM32X1Ss
`4LUTs FFs
`Configuration
`12
`2844
`976
`32x32
`26
`4182 1038
`64x16
`56
`7132 1531
`128x8
`XCVlOO0 avail. 24576 24576 (each uses 2 4LUTs)
`
`BlockRAMs Clock
`31 MHz
`8
`30MHz
`8
`29MHz
`8
`-
`32
`
`For the three configuration choices, the area requirements vary between 10%-30%
`of the chip logic capacity. Since all configurations use 4KB of on-chip memory for cache
`line storage, 8 of the 32 512x8b BlockSelectRAMs are required.
`
`5.6 Scalability and Extensibility
`
`As shown in Table 3 , scaling an L1 on-chip above l 28x8 is probably not a wise choice
`given the growing area requirements. However, as already mentioned in Section 4.4 ,
`part of the ZBT SRAM could be used to hold the cache lines of a direct mapped L2
`cache. In this scenario, only the tags would be held inside of the FPGA.
`A sample cache organization for this approach could partition the 24-bit address
`into a 8-bit tag, 12-bit index and 4-bit block offset. The required tag RAM would be
`organized as 4096x8b, and would thus require 8 BlockSelectRAMs (in addition to those
`used for the on-chip L1 cache). The cache would have 4096 lines of 16 words each, and
`would thus require 64K words of the 128K words available in one of the ZBT SRAM
`chips. The cache hit/miss determination could occur in two cycles, with the data arriving
`after another two cycles. A miss going to PCI memory would take 66 cycles to refill a
`clean cache line and deliver the data to the initiator.
`
`6 Related Work
`
`[ 19] gives experimental result for the cache-dependent performance behavior of 6 of the
`8 benchmarks in the SPECint95 suite. Due to the temporal configurability we suggest
`for the MARC caches (adapting cache parameters to applications), they expect a perfor(cid:173)
`mance improvement between 3% to 10% over static caches. [27] describes the use of the
`
`Petitioners Amazon
`Ex. 1003, p. 11
`
`

`

`624
`
`H. Lange and A. Koch
`
`configurable logic in a hybrid processor to either add ~ victim cache or pre-fetch buffers
`to an existing dedicated direct-mapped L1 cache on an per-application basis. They quote
`improvements in L1 miss rate of up to 19%. [28) discusses the addition of 1MB of LI
`cache memory managed by a dedicated cache controller to a configurable processor.
`Another approach proposed in [29) re-maps non-contiguous strided physical addresses
`into contiguous cache line entries. A similar functionality is provided in MARC by the
`pre-fetching of data into the FIFOs of streaming ports. [30) suggests a scheme which
`adds configurable logic to a cache instead of a cache to configurable logic. They hope to
`avoid the memory bottleneck by putting processing (the configurable logic) very close
`to the data. The farthest step with regard to data pre-fetching is suggested in [31), which
`describes a memory system that is cognizant of high-level memory access patterns.
`E.g., once a certain member in a structure is accessed, a set of associated members is
`fetched automatically. However, the automatic generation of the required logic from
`conventional software is not discussed. On the subject of streaming accesses, [20] is an
`exhaustive source. The 'Block' register of our streaming ports was motivated by their
`discussion of overly long startup times for large amounts of pre-fetched data.
`
`7 Summary
`
`We presented an overview of hybrid processor architectures and some memory access
`needs often occurring in applications. For the most commonly used RC components
`( off-the-shelf FPGAs ), we identified ·a lack of support for even the most basic of these
`requirements.
`As a solution, we propose a general-purpose Memory Architecture for Reconfig(cid:173)
`urable Computers that allows device-independent access both for regular (streamed)
`and irregular (cached) patterns. We discussed one real-world implementation of MARC
`on an emulated hybrid processor combining a SPARC CPU with a Virtex FPGA. The
`sample implementation fully supports multi-threaded access to multiple memory banks
`as well as the creation of "virtual" memory ports attached to on-chip cache memory.
`The configurable caches in the current version can reduce the latency from 46 cycles
`(for access to DRAM via PCI) down to a single cycle on a cache hit.
`
`References
`
`1. Amerson, R., "Teramac - Configurable Custom Computing", Proc. IEEE Symp. on FCCMs,
`Napa 1995
`2. Bertin, P., Roncin, D., Vuillemin, J., "Programmable Active Memories: A Performance As(cid:173)
`sessment", Proc. Symp. Research on Integrated Systems, Cambridge (Mass.) 1993
`3. Box, B., "Field-Programmable Gate Array-based Reconfigurable Preprocessor", Proc. IEEE
`Symp. on FCCMs, Napa 1994
`4. Buell, D., Arnold, J., Kleinfelder, W. , "Splash 2 - FPGAs in Custom Computing Machines",
`IEEE Press, 1996
`5. Rupp, C., Landguth, M ., Garverick, et al., "The NAPA Adaptive Processing Architecture",
`Proc. IEEE Symp. on FCCMs, Napa 1998
`6. Hauser, J., Wawrzynek, J., "Garp: A MIPS Processor with a Reconfigurable Coprocessor",
`Proc. IEEE Symp. on FCCMs, Napa 1997
`
`Petitioners Amazon
`Ex. 1003, p. 12
`
`

`

`Memory Access Schemes for Configurable Processors
`
`625
`
`7. Wittig, R., Chow, P., "OneChip: An FPGA Processor with Reconfigurable Logic", Proc. IEEE
`Symp. on FCCMs, Napa 1996
`8. Jacob, J., Chow, P., "Memory Interfacing and Instruction Specification for Reconfigurable
`Processors", Proc. ACM Intl. Symp. on FPGAs, Monterey 1999
`9. Triscend, "Triscend ES CSoC Family", http://www.triscend.com/products/lndexE5.html,
`2000
`10. Altera, "Excalibur Embedded Processor Solutions",
`http://www.altera.com/html/products/excalibur.html, 2000
`11. TSI-Telsys, "ACE2card User's Manual", hardware documentation, 1998
`12. Koch, A., "A Comprehensive Platform for Hardware-Software Co-Design", Proc. Intl. Work-
`shop on Rapid-Systems Prototyping, Paris 2000
`13. Annapolis Microsystems, http://www.annapmicro.com, 2000
`14. Virtual Computer Corp. , http://www.vcc.com, 2000
`15. Callahan, T. , Hauser, J.R., Wawrzynek, J. , "The Garp Architecture and C Compiler", IEEE
`Computer, April 2000
`16. Li, Y., Callahan, T., Darnell, E ., Harr, R., et al., "Hardware-Software Co-Design of Embedded
`Reconfigurable Architectures", Proc. 37th Design Automation Conference, 2000
`17. Gokhale, M.B., Stone, J.M., "NAPA C: Compiling for a Hybrid RISC/FPGA Machine", Proc.
`IEEE Symp. on FCCMs, 1998
`18. Koch, A., Golze, U., "Practical Experiences with the SPARXIL Co-Processor", Proc. Asilo(cid:173)
`mar Conference on Signals, Systems, and Computers, 11/1997
`"Configurable Cache", CMU EE742 course project,
`19. Fung,
`J.M .L.F., Pan,
`J.,
`http://www.ece.cmu.edu/ ee742/proj-s98/fung, 1998
`20. McKee, S.A., "Maximizing Bandwidth for Streamed Computations", dissertation, U. of Vir(cid:173)
`ginia, School of Engineering and Applied Science, 1995
`21. Sun Microelectronics, "microSPARC-Ilep User's Manual", http://www.sun.com/sparc, 1997
`22. Weaver, D.L., Germond, T., "The SPARC Architecture Manual, Version 8", Prentice-Hall,
`1992
`23. PLX Technology, "PCI 9080 Data Book", http://www.plxtech.com, 1998
`24. Xilinx, Inc., "Virtex 2.5V Field-Programmable Gate Arrays", http://www.xilinx.com, 1999
`25. Xilinx, Inc. "Designing Flexible, Fast CAMs with Virtex Family FPGAs", Xilinx Application
`Note 203, 1999
`26. Hennessy, J., Patterson, D., "Computer Architecture: A Quantitative Approach", Morgan(cid:173)
`Kaumann, 1990
`27. Zhong, P., Martonosi, M ., "Using Reconfigurable Hardware to Customize Memory Hierar(cid:173)
`chies'', Proc. SPIE, vol. 2914, 1996
`28. Kimura, S., Yukishita, M., ltou, Y., et al., "A Hardware/Software Codesign Method for a
`General-Purpose Reconfigurable Co-Processor", Proc. 5th CODES/CASHE, 1997
`29. Carter, J., Hsieh, W., Stoller, L., et al., "Impulse: Building a Smarter Memory Controller",
`Proc. 5th Intl. Symp. on High. Perf. Comp. Arch. (HPCA) , 1999
`30. Nakkar. M., Harding, J., Schwartz, D. , et al., "Dynamically programmable cache", Proc.
`SPIE, vol. 3526, 1998
`31. Zhang, X. , Dasdan, A., Schulz, M ., et al., "Architectural Adaptation for Application-Specific
`Locality Optimizations", Proc. Intl. Conj on Comp. Design (ICCD), 1997
`
`Petitioners Amazon
`Ex. 1003, p. 13
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket