throbber
Homayoun
`
`Reference 34
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2146, p. 1
`
`

`

`I N V I T E D
`P A P E R
`
`Reconfigurable Computing
`Architectures
`
`This paper provides an overview of the broad body-of-knowledge developed in the
`field of reconfigurable computing.
`
`By Russell Tessier, Senior Member IEEE, Kenneth Pocek, Life Member IEEE, and
`Andre´ DeHon, Member IEEE
`
`ABSTRACT | Reconfigurable architectures can bring unique
`capabilities to computational tasks. They offer the perfor-
`mance and energy efficiency of hardware with the flexibility of
`software. In some domains, they are the only way to achieve
`the required, real-time performance without fabricating cus-
`tom integrated circuits. Their functionality can be upgraded
`and repaired during their operational lifecycle and specialized
`to the particular instance of a task. We survey the field of
`reconfigurable computing, providing a guide to the body-of-
`knowledge accumulated in architecture, compute models,
`tools, run-time reconfiguration, and applications.
`
`KEYWORDS | Field programmable gate arrays; reconfigurable
`architectures; reconfigurable computing; reconfigurable logic
`
`optimization tasks, spatial computation and specialization
`made it possible to achieve supercomputer-level perfor-
`mance at workstation-level costs. Furthermore, by repro-
`gramming the FPGA, a specialized computer could be
`reconfigured to different tasks and new algorithms. In
`2015, the use of FPGAs for computation and communica-
`tion is firmly established. FPGA implementations of
`applications are now prevalent
`in signal processing,
`cryptography, arithmetic, scientific computing, and net-
`working. Commercial acceptance is growing, as illustrated
`by numerous products that employ FPGAs for more than
`just glue logic.
`Reconfigurable computing (RC)Vperforming compu-
`tations with spatially programmable architectures, such as
`FPGAsVinherited a wide body-of-knowledge from many
`disciplines including custom hardware design, digital
`signal processing (DSP), general-purpose computing on
`sequential and multiple processors, and computer-aided
`design (CAD). As such, RC demanded that engineers
`integrate knowledge across these disciplines.
`It also
`opened up a unique design space and introduced its own
`challenges and opportunities. Over the past 25 years, a
`new community has emerged and begun to integrate and
`develop a body-of-knowledge for building, programming,
`and exploiting this new class of machines.
`How do you organize programmable computations
`spatially? The design space of architecture and organiza-
`tion for the computation is larger when you can perform
`gate-level customization to the task, and when the
`machine can change its organization during the computa-
`tion. When you get to choose the organization of your
`machine for a particular application or algorithm, or even a
`particular data set, how do you maximize performance,
`minimize area, and minimize energy?
`How should you program these machines? How can
`they be programmed to exploit
`the opportunity to
`exquisitely optimize them to the problem? How can they
`be made accessible to domain and application experts?
`Digital Object Identifier: 10.1109/JPROC.2014.2386883
`0018-9219 Ó 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/
`redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
`332 Proceedings of the IEEE | Vol. 103, No. 3, March 2015
`
`I . I N T R O D U C T I O N
`
`Field-programmable gate arrays (FPGAs) were introduced
`in the mid-1980s (e.g., [1]) as a larger capacity platform for
`glue logic than their programmable array logic (PAL)
`ancestors. By the early 1990s, they had grown in capacity
`and were being used for logic emulation (e.g., Quickturn
`[2]) and prototyping, and the notion of customizing a
`computer to a particular task using the emerging capacity
`of these FPGAs became attractive. These custom compu-
`ters could satisfy the processing requirements for many
`important and enabling real-time tasks (video and signal
`processing, vision, control, instrumentation, networking)
`that were too high for microprocessors. On simulation and
`
`Manuscript received August 18, 2014; revised November 13, 2014; accepted
`December 18, 2014. Date of current version April 14, 2015.
`R. Tessier is with the ECE Department, University of Massachusetts, Amherst,
`MA 01003 USA (e-mail: tessier@umass.edu).
`K. Pocek, retired, was with Intel USA.
`A. DeHon is with the ESE Department, University of Pennsylvania, Philadelphia,
`PA 19104 USA.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2146, p. 2
`
`

`

`Should they be programmed more like software or more
`like hardware? It became clear that neither traditional
`hardware nor traditional software forms of design capture
`can exploit the full flexibility that these machines offer.
`What abstractions are useful
`for these machines? for
`programmers? for platform scaling? How do you manage
`the resources in these machines during runtime?
`How do we map and optimize to these machines? One
`way to ease the programmer burden is to automate the
`lower-level details of customizing a computation to the
`capabilities of these reconfigurable computers. Again, off-
`the-shelf CAD tools and software compilers did not address
`the full opportunities. Many CAD challenges, at least
`initially, were fairly unique to RC, including heteroge-
`neous parallelism, hybrid space-time computations, an
`organizational structure that can change during an
`algorithm, operator customization, memory structure
`tuning (e.g., cache, memory sizes, and placement), and
`custom data encoding to the task.
`In some cases,
`optimization problems that one would reasonably have
`performed manually for custom designs (e.g., bitwidth
`selection, cache sizing) became optimization problems to
`solve for every application program and dataset, thereby
`demanding a new level of automation.
`What algorithms make sense for these machines, and
`how does one tailor algorithms to them? The cost structure
`is different, often enabling or demanding different
`algorithms to solve the problem. Approaches that were
`previously inconceivable (e.g., dedicating a set of gates or a
`processing element to each data item) become viable and
`often superior solutions to problems. How do we exploit
`the massive level of
`fine-grained parallelism? These
`demands encouraged the RC community to identify highly
`parallel algorithms with regular communications before
`Graphics Processing Units (GPUs) and multicore chips
`became available with similar demands.
`In this paper, we provide a guide to this accumulated
`body-of-knowledge. We start by reviewing the exploration
`of architectures for RC machines (Section II). Section III
`reviews approaches to program these RCs. We then review
`developments in tools to automate and optimize design for
`RCs (Section IV) and models and tools for Run-Time
`Reconfiguration (RTR, Section V). Section VI highlights
`important application domains. We conclude with some
`final observations in Section VII.
`
`I I . A R C H I T E C T U R E A N D T E C HN O L O G Y
`
`The invention of FPGAs in the early 1980s [1] seeded the
`field that has become known as reconfigurable computing.
`FPGAs offered the lure of hardware performance. It was
`well known that dedicated hardware could offer orders of
`magnitude better performance than software solutions on
`a general-purpose computer and that machines custom-
`built for a particular purpose could be much faster than
`their counterparts. However, building custom hardware is
`
`Tessier et al.: Reconfigurable Computing Architectures
`
`expensive and time consuming. Custom VLSI was the
`domain of a select few; the numbers that could profitably
`play in that domain were already small in the 1990s and
`have shrunk considerably since then. Microprocessor
`performance scaled with Moore’s Law, often delivering
`performance improvements faster than a custom hardware
`design could be built. FPGAs provided a path to the
`promise of hardware customization without
`the huge
`development and manufacturing costs and lead times of
`custom VLSI. It was possible to extract more computa-
`tional
`throughput per unit silicon from FPGAs than
`processors [3], and it was possible to do so with less
`energy [4].
`In the late 1980s it was still possible to provide
`hardware differentiation by assembling packaged integrat-
`ed circuits in different ways at the printed-circuit board
`level. However, as chips grew in capacity and chip speeds
`increased, the cost of accessing functions off chip grew as
`well. There was an increasing benefit to integrating more
`functionality on chip. Opportunities for board-level
`differentiation decreased,
`increasing the demand for
`design customization and differentiation on chip. FPGAs
`provided a way to get differentiation without using custom
`VLSI fabrication.
`The challenge then is how should we organize our
`computation and customization on the FPGA? How should
`the specialized FPGA be incorporated into a system? How
`can the computation take advantage of
`the FPGA
`capabilities? As FPGAs grow in capacity and take on larger
`roles in computing systems than their initial glue-logic
`niche, how should FPGAs evolve to support computing,
`communication, and integration tasks?
`
`A. Pioneers
`Just a few years after the introduction of FPGAs,
`multiple pioneering efforts demonstrated the potential
`benefits of FPGA-based RC. SPLASH arranged 16 FPGAs
`into a linear systolic array and outperformed contemporary
`supercomputers (CM-2, Cray-2) on a DNA sequence
`matching problem with a system whose cost was compa-
`rable to workstations [5]. Programmable active memories
`(PAM) arranged 16 FPGAs into a two-dimensional (2-D)
`grid and demonstrated high performance on a collection of
`applications in signal and image processing, scientific
`computing, cryptography, neural networks, and high
`bandwidth image acquisition [6]. PAM held the record
`for the fastest RSA encryption and decryption speeds
`across all platforms, including exceeding the performance
`of custom chips.
`
`B. Accelerators
`A key idea from the beginning was that FPGAs could
`serve as generic, programmable hardware accelerators for
`general-purpose computers. Floating-point units were well
`known and successful at accelerating numerical applica-
`tions. Many applications might benefit from their own
`
`Vol. 103, No. 3, March 2015 | Proceedings of the IEEE 333
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2146, p. 3
`
`

`

`Tessier et al.: Reconfigurable Computing Architectures
`
`custom units. The question is how should they be
`interfaced with the general-purpose CPU, and how should
`the RC be managed?
`When floating-point co-processors were still separate
`chips, PRISM showed how an FPGA added as an external
`co-processor could accelerate bit-level operations that
`were inefficient on CPUs [7]. As chip capacities grew,
`research explored architectures that integrated an FPGA or
`reconfigurable array on the same die with the processor
`[8]–[10]. Key concerns included the visible architectural
`model
`for the array, how state was shared with the
`configurable array, how different timing requirements of
`the processor and array should be accommodated, and how
`to maximize the bandwidth and minimize the latency
`required to communicate with the array. The reconfigur-
`able logic could be integrated as a programmable
`functional unit for a RISC processor [8], [11], share state
`with the conventional register file, and manage the
`reconfigurable array as a cache of programmable
`instructions [12].
`It was soon apparent that the processor would become
`a bottleneck if it had to mediate the movement of all data
`to and from the reconfigurable array. GARP showed how to
`integrate the array as a co-processor and how the array
`could have direct access to the memory system [13]. The
`architecture for dynamically reconfigurable embedded
`systems (ADRES) also provided direct memory access in
`a co-processor model; ADRES combined a VLIW processor
`core with the reconfigurable array, sharing functional
`units between the VLIW core and the array [14]. Later
`work explored interfacing the reconfigurable logic with
`on-chip caches and virtual memory [15] and streaming
`operations that used scoreboard interlocks on blocks of
`memory [16].
`Accelerators can also have an impact beyond raw
`application performance. Offloading tasks to a reconfigur-
`able array can be effective at reducing the energy required
`for a computation [17], as is important for embedded
`systems. For a survey of both fixed and reconfigurable
`accelerators and estimates of their energy and perfor-
`mance efficiency, see [18]. To simplify design represen-
`tation, mapping, and portability, the Queue Machine [19]
`showed how a processor and an array could run the same
`machine-level instructions.
`These designs at least anticipated and perhaps helped
`motivate commercial processor-FPGA hybrids. Xilinx
`offered a PowerPC on the Virtex2-Pro and now integrates
`ARM cores on their Zynq devices. Altera includes ARM
`cores on Cyclone V and Arria V SoC FPGAs. Stretch
`provided an integrated processor and FPGA device [20].
`Intel integrated an Atom processor and an Altera device
`in a multichip package for embedded computations and
`also integrates a Xeon with an FPGA for
`server
`applications.
`from attached
`Multiprocessors may also benefit
`accelerators. The logic could be used for custom acceler-
`
`334 Proceedings of the IEEE | Vol. 103, No. 3, March 2015
`
`ation, similar to its use in single node machines, or for
`improving communication and synchronization [21], [22].
`Cray integrated FPGAs into their XD1, SRC offered a
`parallel supercomputer with FPGA accelerators, and
`Convey Computer now ships a supercomputer with an
`FPGA acceleration board.
`
`C. Fabric
`Using FPGAs as computing substrates creates different
`needs and opportunities for optimization than using
`FPGAs for glue logic or general-purpose register-transfer
`level (RTL) implementation engines. As a result, there is a
`significant body of work exploring how reconfigurable
`arrays might evolve beyond traditional FPGAs, which we
`highlight in this section. For the evolution of FPGAs for
`traditional usage, including the impact of technology, see
`the companion article [23]. For a theoretical and
`quantitative comparison of reconfigurable architectures,
`see [24].
`
`1) Organization: Fine-grained reconfigurable architec-
`tures, such as FPGAs, give us freedom to organize our
`computation in just about any way, but what organizational
`patterns are actually beneficial for use on the FPGA? This
`freedom fails to provide guidance in how to use the FPGA.
`As a result, there has been considerable exploration of
`computing patterns to help conceptualize the computa-
`tion. These ideas manifest as conceptual guides, as tools, as
`overlay architectures, and as directions for specializing the
`reconfigurable fabric itself. Overlay architectures are
`implemented on top of existing FPGAs and often support
`tuning to specialize the generic architecture to the specific
`needs of a particular application.
`Cellular automata is a natural model for an array of
`identical computing cells. On the FPGA, the cellular
`processing element can be customized to the application
`task. The CAL array from Algotronix was designed
`specifically with this model in mind [25]. Others showed
`how to build cellular computations on top of FPGAs [26]
`including how to deal with cellular automata that are
`larger than the physical FPGA available [27]–[29].
`Dataflow models are useful
`in abstracting spatially
`communicating computational operators. Dataflow hand-
`shaking can simply be used to synchronize data- or
`implementation-dependent timing [30], but it can also be
`used to control operator sharing [31].
`When you must sequentialize a larger graph onto
`limited resources, very long instruction word (VLIW)
`organization provides a straightforward model for operator
`control. As an overlay architecture, the composition of
`operators can be selected to match the needs of the
`application [32], [33]. As we will see later (Section II-C4
`and C5), when we time-switch the fabric itself, the result is
`essentially a VLIW organization.
`Highly pipelined vector processors can be efficient for
`data parallel tasks [34]. The vector lanes provide a model
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2146, p. 4
`
`

`

`for trading area and time [35] and can be extended with
`custom accelerators [36].
`Some have found multicore and manycore overlays to
`be useful models for mapping applications onto FPGAs
`[37]–[39].
`
`2) Coarse-Grained Blocks: While the gate-level configur-
`ability in FPGAs allows great customization, it also means
`that common computing blocks used in DSP and
`numerical computations are inefficient compared to
`custom building blocks (e.g., multipliers, floating-point
`units). Would it be better to provide custom elements of
`these common computing blocks alongside or inside the
`FPGA fabric? When FPGAs were small, it was reasonable
`to consider adding dedicated chips for floating point at the
`board level [40]. The virtual embedded block model was
`introduced to explore hard logic integration in an FPGA
`array and illustrated the benefits of coarse-grain hardware
`to support floating-point computations [41]. Embedded
`multipliers and memories help narrow the performance,
`energy, and area gap between application-specific inte-
`grated circuits (ASICs) and FPGAs [42]. The modern
`extensions to Toronto’s FPGA CAD flow supports explo-
`ration of customized blocks [43]. Both Altera and Xilinx
`now embed hard logic DSP blocks that support wide-
`word addition and multiplication inside their fine-grained
`logic arrays.
`
`3) Synchronous, Asynchronous: Processors and ASICs are
`typically designed around a fixed clock. This allows the
`integrated circuit (IC) designer to carefully optimize for
`performance and means the user of the processor does
`not have to worry about timing closure for their designs.
`As we consider high throughput computational fabrics,
`fixed-frequency, clocked arrays can offer similar advan-
`tages to spatial computing arrays like FPGAs, as shown by
`the high-speed, hierarchical synchronous reconfigurable
`array (HSRA) [44] and SFRA [45]. A different way of
`avoiding issues with timing closure is to drop the clock
`completely for asynchronous handshaking, as shown in
`[46]–[48].
`
`4) Multicontext: Since FPGAs started out as gate array
`replacements, they contained a static configuration to
`control the gates and interconnect. The fact that this
`configuration could be changed opened up new opportu-
`nities to change the circuit during operation. However,
`FPGA reconfiguration was slow. Adding multiple config-
`urations to the FPGA allows the FPGA to change rapidly
`among a set of behaviors [49], [50]. By time-multiplexing
`the expensive logic and interconnect,
`this effectively
`allowed higher logic capacity per unit silicon, often with
`little performance impact. Tabula now offers a commercial
`multicontext FPGA [51]. The companion paper [24]
`identifies conditions under which a multicontext FPGA
`can be lower energy than a single-context FPGA.
`
`Tessier et al.: Reconfigurable Computing Architectures
`
`5) Coarse-Grained: Early FPGAs contained fine-grained
`logic largely because of their limited capacity, their gate
`array roots, and their use as glue logic. As their application
`domain moved to address signal processing and computing
`problems, there was interest in more efficiently supporting
`larger, wide-word logic. Could we have flexible, spatially
`configurable,
`field-programmable machines with wide-
`word, coarse-grained computing blocks? Would a small
`amount of time-multiplexed sharing of these units be
`useful?
`Designed to prototype DSP algorithms, PADDI [52]
`was one of
`the first coarse-grained reconfigurable
`architectures. It essentially used a VLIW architecture,
`with 16b words and 8 contexts per processing element.
`Other VLIW-style architectures have been used for low-
`power multimedia processing,
`including the 16b ALU-
`register-file Reconfigurable Multimedia Array Coprocessor
`(REMARC) array [53], the 32b-word ADRES [54] with
`32 local contexts and 8 entry register files, and the
`8b-word, 16 entry register file dynamically reconfigurable
`processor (DRP) [55].
`The reconfigurable datapath architecture (rDPA) used
`32b-wide ALUs in a cellular arrangement with data
`presence flow control [56]. The reconfigurable pipelined
`datapath (RaPiD) arranged coarse-grain elements, 16b
`ALUs, multipliers and RAM blocks in a one-dimensional
`(1-D) datapath for systolic, pipelined computations, using
`a parallel, configurable bit-level control path for managing
`the dynamic behavior of the computation [57]. PipeRench
`also used the model of a directional pipeline, adding a
`technique to incrementally reconfigure resources and
`virtualize the depth in the pipeline, allowing developers to
`be abstracted from the number of physical pipe stages in
`the design and allowing designs to scale across a range of
`implementations [58].
`To reduce the cost of reconfiguring the cells during a
`multimedia computation, MorphoSys used a single
`instruction, multiple data (SIMD) architecture to control
`a coarse-grained reconfigurable array with tiles composed
`of 16b ALUs, multipliers, and 4 element register files [59].
`Morphosys uses a 32-context instruction memory at the
`periphery of the array that is as wide as a side of the 2-D
`array. Instructions can be shared across rows or columns in
`the 2-D array, and the memory is wide enough to provide
`separate control for each row or column.
`Graphics processors evolved from fixed-function ren-
`dering pipes to general-purpose computing engines in the
`form of general-purpose graphics processing units
`(GPGPUs) around the same time that FPGAs moved into
`computing and Coarse-Grained Reconfigurable Arrays
`(CGRAs) were developing. At a high-level, GPGPUs share
`many characteristics with CGRAs, providing a high-density
`spatial array of processing units that can be programmed
`to perform regular computing tasks. GPGPUs were
`initially focused on SIMD, single-precision, floating-point
`computing tasks, and do not support the spatially local
`
`Vol. 103, No. 3, March 2015 | Proceedings of the IEEE 335
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2146, p. 5
`
`

`

`Tessier et al.: Reconfigurable Computing Architectures
`
`communications typical of FPGAs and CGRAs. The
`companion paper [24] shows why locality exploitation
`can be a key advantage for reconfigurable architectures.
`For compiled CUDA,
`the native language of many
`GPGPUs, the GPGPUs provide 1–4 higher throughput
`than FPGAs, but at a cost of over 4–16 the energy per
`operation [60]. The peak, single-precision, floating-point
`throughput of GPGPUs exceeds processors and FPGAs, but
`the delivered performance on applications can be lower
`than FPGAs and the energy per operation can be higher
`[61]. FPGAs can outperform GPGPUs [62] for video pro-
`cessing, depending on the nature and complexity of the task.
`A key differentiator in the architectures we have seen
`so far is whether the reconfigurable resources are
`controlled with a static configuration,
`like FPGAs, or
`with multicontext memories, like the VLIW-style CGRAs
`above. A second differentiator is how many resources can
`share the same instructions as the SIMD designs exploit.
`This motivated a set of designs that made it possible to use
`configuration to select between these choices. MATRIX
`[63] explored configuring instruction distribution to
`efficiently compose a wide range of architectures (systolic,
`VLIW, SIMD/Vector, MIMD) using a coarse-grain archi-
`tecture based around an 8b ALU-multiplier cell
`that
`includes a local 256  8 RAM. The RAMs could be used
`either as embedded data memories (e.g., register files,
`FIFOs) or instruction stores and the programmable
`interconnect could carry both data streams and instruction
`streams, including streams that could control the inter-
`connect. The CHESS reconfigurable arithmetic array used
`4b ALUs and datapaths with local 32  4 memories where
`the ALUs could be configured statically or sequenced
`dynamically from local memories, but the routing could
`not [64].
`If the domain is large enough, it makes sense to create a
`custom reconfigurable array specialized for a specific
`domain. The Totem design showed how to optimize a
`reconfigurable array for a specific domain and illustrated
`the area savings in the DSP domain [65], how to perform
`function allocation for a domain-customized architecture
`[66], how to provide spare capacity for later changes [67],
`how to automate the layout of domain-optimized reconfi-
`gurable arrays, and how to compile applications to a
`specialized instance of the architecture.
`Some of the ideas from the MATRIX CGRA led to the
`array of SpiceEngine vector processors in Broadcom’s
`Calisto architecture [68]. The Samsung reconfigurable
`processor (SRP) is a derivative of the ADRES CGRA
`architecture [69], [70].
`
`6) Configuration Compression and Management: While
`FPGAs are in-field reconfigurable, the slow reconfigura-
`tion times limit the use of reconfiguration during the
`execution of a task. One key reason configurations are slow
`is the large number of configuration bits required to
`specify the FPGA. Consequently, one way to accelerate
`
`336 Proceedings of the IEEE | Vol. 103, No. 3, March 2015
`
`reconfiguration is to compress the transfer of configura-
`tion data. By adapting ideas from covering in two-level
`logic minimization, the wildcard scheme on the Xilinx
`XC6200 could be effectively used to reduce bitstream size
`and load time [71]. When the sequence of configurations is
`not static, a cache for frequently used configurations can
`be used to accelerate FPGA reconfiguration [72].
`
`7) On-Chip Dynamic Networking: The load-time config-
`ured interconnect in FPGAs is suitable for connecting
`together gates or performing systolic, pipelined, and
`cellular computations. However, as FPGAs started hosting
`more diverse tasks,
`including computation that used
`interconnect
`less continuously,
`it became valuable to
`explore disciplines for dynamically sharing limited on-chip
`FPGA communication bandwidthVto provide network-
`on-chip (NoC) designs on FPGAs. This prompted the
`design of packet-switch overlay networks for FPGAs [73],
`[74]. Often a time-multiplexed overlay network can be
`more efficient
`than a packet-switched network [75].
`Because of the different cost structure between ASICs
`and FPGAs, overlay NoCs on FPGAs should be designed
`differently from ASIC NoCs [76], [77]. Ultimately, it may
`make sense to integrate packet-switched NoC support
`directly into the FPGA fabric [78].
`
`D. Emulation
`Since FPGAs are programmable gate arrays, an obvious
`use for them was to emulate custom logic designs.
`However, since FPGA capacity is lower than contemporary
`and future ASICs, there is typically a need to assemble
`large numbers of FPGAs to support an ASIC design [2].
`The lower density I/O between chips compared to on-chip
`interconnect meant that direct partitioning of gate-level
`netlists onto multiple FPGAs suffered bottlenecks at the
`chip I/O that left most FPGA logic capacity underutilized.
`To avoid this bottleneck, Virtual Wires virtualized the pins
`on the FPGA, exploiting the ability to time multiplex the
`I/O pins many times within an ASIC emulation cycle [79].
`FPGA-based Veloce logic emulators sold by Mentor
`Graphics are based on these techniques. Today’s FPGAs
`include hardware support for high-speed serial links to
`address the chip I/O bandwidth bottleneck.
`As FPGAs grew in size, natural building blocks could fit
`onto a single FPGA, including complete processors [80].
`The emulation focus turned to the system level. Boards of
`FPGAs were used to directly support algorithm develop-
`ment [81]. With an FPGA now large compared to a
`processor core, single FPGAs and boards of FPGAs have
`been used to support multicore architecture research [82],
`[83]. Nonetheless, directly implementing one instantiated
`processor core on the FPGA for one emulated processor
`core can demand very large multi-FPGA systems. Since
`these designs include many, identical processor cores, an
`alternative is to time-multiplex multiple virtual processor
`cores over each physically instantiated core [84].
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2146, p. 6
`
`

`

`infrequently exercised core functionality
`Furthermore,
`can be run on a host processor, allowing the physically
`instantiated core to be smaller and easier to develop.
`While it is possible to simulate the entire processor on
`an FPGA, defining and maintaining the model for the
`FPGA and keeping it running at high frequency can be
`difficult. FPGA-accelerated simulation technologies
`(FAST) observes that cycle-accurate simulation on a
`processor is dominated by the timing modeling, not the
`functional modeling. Consequently, one can move the
`timing model logic to the FPGA and achieve significant
`simulation speedups without significantly changing the
`design and reuse flow for the functional simulator [85].
`A-Port networks provide a generalization for explicitly
`modeling the simulated timing of a structure where the
`FPGA primitive clock cycles do not necessarily corre-
`spond directly to the modeled timing cycles in the
`emulated system [86].
`
`E. Integrated Memory
`As FPGA capacity grew, it became possible to integrate
`memory onto the FPGA die, and it became essential to do
`so to avoid memory bottlenecks. This created opportuni-
`ties to exploit high, application-customized bandwidth and
`raised questions of how to organize, manage, and exploit
`the memory.
`Irregular memory accesses often stall conventional
`processor and memory systems. Consequently, an FPGA
`accelerator that gathers and filters data can accelerate
`irregular accesses on a processor [87]. Similarly, conven-
`tional processors perform poorly on irregular graph
`operations. The GraphStep architecture showed how to
`organize active computations around embedded memories
`in the FPGA to accelerate graph processing [88].
`When it is not possible to store the entire dataset on
`chip, it is often useful to stream, buffer, or cache the data
`in the embedded memories. Windows into spatial regions
`are important for cellular automata [27] and many image
`and signal processing and scientific computing problems
`[89]. CoRAM provided an abstraction for using the on-chip
`memories as windows on a larger, unified off-chip memory
`and a methodology for providing the control and
`communication infrastructure needed to support
`the
`abstraction [90].
`The embedded memories in FPGAs are simple RAMs,
`often with native support
`for dual-port access.
`It
`is
`straightforward to build scratchpad memories, simple
`register files, FIFOs, and direct mapped caches [91] from
`these memories. Recent work has demonstrated efficient
`ways to implement multiported memories [92] and (near)
`associative memories [93].
`
`F. Defect and Fault Tolerance
`We can exploit the homogeneous set of uncommitted
`resources in FPGA-like architectures to tolerate defects.
`
`Tessier et al.: Reconfigurable Computing Architectures
`
`Because of this regularity, FPGAs can use design-indepen-
`dent defect-tolerance techniques such as row and column
`sparing, as is familiar from memories. Nonetheless, the tile
`in an FPGA is much larger than a memory cell in a RAM,
`so it may be more efficient to spare resources at the level of
`interconnect links rather than tiles [94]. However, this
`kind of sparing builds in two levels of reconfigurability:
`one to tolerate fabrication defects and one to support the
`design. It is more efficient to unify our freedom for design-
`mapping and defect-avoidance. With today’s cluster-based
`FPGAs, it is simple to reserve a spare lookup table (LUT) in
`a cluster to tolerate LUT defects [95]. When a natural
`cluster does not already exist, it is possible to overlay a
`conceptual cluster and reserve one space within every
`small grid of compute blocks so that
`it
`is easy to
`precompute logic permutations that avoid any single
`compute block failure within the overlay grid [96].
`Interconnect faults can be repaired quickly by consulting
`precomputed alternate paths for nets that are disrupted by
`defects [97]. These techniques repair the design locally,
`avoiding the need to perform a complete mapping of the
`design; the cost for this simplicity is the need to reserve a
`small fraction of resources for spares even though most
`will not be used.
`If we must tolerate higher defect rates or reduce the
`overhead for spares, we can map around the specific
`defects in a device. TERAMAC pioneered the more
`aggressive,
`large-scale approach of
`identifying defects,
`both in the FPGA and in the interconnect between FPGAs,
`and mapping an application to avoid them [98]. This style
`of design- and component-specific mapping can be used to
`tolerate defects in Programmable Logic Arrays (PL

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket