`
`Reference 34
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2146, p. 1
`
`
`
`I N V I T E D
`P A P E R
`
`Reconfigurable Computing
`Architectures
`
`This paper provides an overview of the broad body-of-knowledge developed in the
`field of reconfigurable computing.
`
`By Russell Tessier, Senior Member IEEE, Kenneth Pocek, Life Member IEEE, and
`Andre´ DeHon, Member IEEE
`
`ABSTRACT | Reconfigurable architectures can bring unique
`capabilities to computational tasks. They offer the perfor-
`mance and energy efficiency of hardware with the flexibility of
`software. In some domains, they are the only way to achieve
`the required, real-time performance without fabricating cus-
`tom integrated circuits. Their functionality can be upgraded
`and repaired during their operational lifecycle and specialized
`to the particular instance of a task. We survey the field of
`reconfigurable computing, providing a guide to the body-of-
`knowledge accumulated in architecture, compute models,
`tools, run-time reconfiguration, and applications.
`
`KEYWORDS | Field programmable gate arrays; reconfigurable
`architectures; reconfigurable computing; reconfigurable logic
`
`optimization tasks, spatial computation and specialization
`made it possible to achieve supercomputer-level perfor-
`mance at workstation-level costs. Furthermore, by repro-
`gramming the FPGA, a specialized computer could be
`reconfigured to different tasks and new algorithms. In
`2015, the use of FPGAs for computation and communica-
`tion is firmly established. FPGA implementations of
`applications are now prevalent
`in signal processing,
`cryptography, arithmetic, scientific computing, and net-
`working. Commercial acceptance is growing, as illustrated
`by numerous products that employ FPGAs for more than
`just glue logic.
`Reconfigurable computing (RC)Vperforming compu-
`tations with spatially programmable architectures, such as
`FPGAsVinherited a wide body-of-knowledge from many
`disciplines including custom hardware design, digital
`signal processing (DSP), general-purpose computing on
`sequential and multiple processors, and computer-aided
`design (CAD). As such, RC demanded that engineers
`integrate knowledge across these disciplines.
`It also
`opened up a unique design space and introduced its own
`challenges and opportunities. Over the past 25 years, a
`new community has emerged and begun to integrate and
`develop a body-of-knowledge for building, programming,
`and exploiting this new class of machines.
`How do you organize programmable computations
`spatially? The design space of architecture and organiza-
`tion for the computation is larger when you can perform
`gate-level customization to the task, and when the
`machine can change its organization during the computa-
`tion. When you get to choose the organization of your
`machine for a particular application or algorithm, or even a
`particular data set, how do you maximize performance,
`minimize area, and minimize energy?
`How should you program these machines? How can
`they be programmed to exploit
`the opportunity to
`exquisitely optimize them to the problem? How can they
`be made accessible to domain and application experts?
`Digital Object Identifier: 10.1109/JPROC.2014.2386883
`0018-9219 Ó 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/
`redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
`332 Proceedings of the IEEE | Vol. 103, No. 3, March 2015
`
`I . I N T R O D U C T I O N
`
`Field-programmable gate arrays (FPGAs) were introduced
`in the mid-1980s (e.g., [1]) as a larger capacity platform for
`glue logic than their programmable array logic (PAL)
`ancestors. By the early 1990s, they had grown in capacity
`and were being used for logic emulation (e.g., Quickturn
`[2]) and prototyping, and the notion of customizing a
`computer to a particular task using the emerging capacity
`of these FPGAs became attractive. These custom compu-
`ters could satisfy the processing requirements for many
`important and enabling real-time tasks (video and signal
`processing, vision, control, instrumentation, networking)
`that were too high for microprocessors. On simulation and
`
`Manuscript received August 18, 2014; revised November 13, 2014; accepted
`December 18, 2014. Date of current version April 14, 2015.
`R. Tessier is with the ECE Department, University of Massachusetts, Amherst,
`MA 01003 USA (e-mail: tessier@umass.edu).
`K. Pocek, retired, was with Intel USA.
`A. DeHon is with the ESE Department, University of Pennsylvania, Philadelphia,
`PA 19104 USA.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2146, p. 2
`
`
`
`Should they be programmed more like software or more
`like hardware? It became clear that neither traditional
`hardware nor traditional software forms of design capture
`can exploit the full flexibility that these machines offer.
`What abstractions are useful
`for these machines? for
`programmers? for platform scaling? How do you manage
`the resources in these machines during runtime?
`How do we map and optimize to these machines? One
`way to ease the programmer burden is to automate the
`lower-level details of customizing a computation to the
`capabilities of these reconfigurable computers. Again, off-
`the-shelf CAD tools and software compilers did not address
`the full opportunities. Many CAD challenges, at least
`initially, were fairly unique to RC, including heteroge-
`neous parallelism, hybrid space-time computations, an
`organizational structure that can change during an
`algorithm, operator customization, memory structure
`tuning (e.g., cache, memory sizes, and placement), and
`custom data encoding to the task.
`In some cases,
`optimization problems that one would reasonably have
`performed manually for custom designs (e.g., bitwidth
`selection, cache sizing) became optimization problems to
`solve for every application program and dataset, thereby
`demanding a new level of automation.
`What algorithms make sense for these machines, and
`how does one tailor algorithms to them? The cost structure
`is different, often enabling or demanding different
`algorithms to solve the problem. Approaches that were
`previously inconceivable (e.g., dedicating a set of gates or a
`processing element to each data item) become viable and
`often superior solutions to problems. How do we exploit
`the massive level of
`fine-grained parallelism? These
`demands encouraged the RC community to identify highly
`parallel algorithms with regular communications before
`Graphics Processing Units (GPUs) and multicore chips
`became available with similar demands.
`In this paper, we provide a guide to this accumulated
`body-of-knowledge. We start by reviewing the exploration
`of architectures for RC machines (Section II). Section III
`reviews approaches to program these RCs. We then review
`developments in tools to automate and optimize design for
`RCs (Section IV) and models and tools for Run-Time
`Reconfiguration (RTR, Section V). Section VI highlights
`important application domains. We conclude with some
`final observations in Section VII.
`
`I I . A R C H I T E C T U R E A N D T E C HN O L O G Y
`
`The invention of FPGAs in the early 1980s [1] seeded the
`field that has become known as reconfigurable computing.
`FPGAs offered the lure of hardware performance. It was
`well known that dedicated hardware could offer orders of
`magnitude better performance than software solutions on
`a general-purpose computer and that machines custom-
`built for a particular purpose could be much faster than
`their counterparts. However, building custom hardware is
`
`Tessier et al.: Reconfigurable Computing Architectures
`
`expensive and time consuming. Custom VLSI was the
`domain of a select few; the numbers that could profitably
`play in that domain were already small in the 1990s and
`have shrunk considerably since then. Microprocessor
`performance scaled with Moore’s Law, often delivering
`performance improvements faster than a custom hardware
`design could be built. FPGAs provided a path to the
`promise of hardware customization without
`the huge
`development and manufacturing costs and lead times of
`custom VLSI. It was possible to extract more computa-
`tional
`throughput per unit silicon from FPGAs than
`processors [3], and it was possible to do so with less
`energy [4].
`In the late 1980s it was still possible to provide
`hardware differentiation by assembling packaged integrat-
`ed circuits in different ways at the printed-circuit board
`level. However, as chips grew in capacity and chip speeds
`increased, the cost of accessing functions off chip grew as
`well. There was an increasing benefit to integrating more
`functionality on chip. Opportunities for board-level
`differentiation decreased,
`increasing the demand for
`design customization and differentiation on chip. FPGAs
`provided a way to get differentiation without using custom
`VLSI fabrication.
`The challenge then is how should we organize our
`computation and customization on the FPGA? How should
`the specialized FPGA be incorporated into a system? How
`can the computation take advantage of
`the FPGA
`capabilities? As FPGAs grow in capacity and take on larger
`roles in computing systems than their initial glue-logic
`niche, how should FPGAs evolve to support computing,
`communication, and integration tasks?
`
`A. Pioneers
`Just a few years after the introduction of FPGAs,
`multiple pioneering efforts demonstrated the potential
`benefits of FPGA-based RC. SPLASH arranged 16 FPGAs
`into a linear systolic array and outperformed contemporary
`supercomputers (CM-2, Cray-2) on a DNA sequence
`matching problem with a system whose cost was compa-
`rable to workstations [5]. Programmable active memories
`(PAM) arranged 16 FPGAs into a two-dimensional (2-D)
`grid and demonstrated high performance on a collection of
`applications in signal and image processing, scientific
`computing, cryptography, neural networks, and high
`bandwidth image acquisition [6]. PAM held the record
`for the fastest RSA encryption and decryption speeds
`across all platforms, including exceeding the performance
`of custom chips.
`
`B. Accelerators
`A key idea from the beginning was that FPGAs could
`serve as generic, programmable hardware accelerators for
`general-purpose computers. Floating-point units were well
`known and successful at accelerating numerical applica-
`tions. Many applications might benefit from their own
`
`Vol. 103, No. 3, March 2015 | Proceedings of the IEEE 333
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2146, p. 3
`
`
`
`Tessier et al.: Reconfigurable Computing Architectures
`
`custom units. The question is how should they be
`interfaced with the general-purpose CPU, and how should
`the RC be managed?
`When floating-point co-processors were still separate
`chips, PRISM showed how an FPGA added as an external
`co-processor could accelerate bit-level operations that
`were inefficient on CPUs [7]. As chip capacities grew,
`research explored architectures that integrated an FPGA or
`reconfigurable array on the same die with the processor
`[8]–[10]. Key concerns included the visible architectural
`model
`for the array, how state was shared with the
`configurable array, how different timing requirements of
`the processor and array should be accommodated, and how
`to maximize the bandwidth and minimize the latency
`required to communicate with the array. The reconfigur-
`able logic could be integrated as a programmable
`functional unit for a RISC processor [8], [11], share state
`with the conventional register file, and manage the
`reconfigurable array as a cache of programmable
`instructions [12].
`It was soon apparent that the processor would become
`a bottleneck if it had to mediate the movement of all data
`to and from the reconfigurable array. GARP showed how to
`integrate the array as a co-processor and how the array
`could have direct access to the memory system [13]. The
`architecture for dynamically reconfigurable embedded
`systems (ADRES) also provided direct memory access in
`a co-processor model; ADRES combined a VLIW processor
`core with the reconfigurable array, sharing functional
`units between the VLIW core and the array [14]. Later
`work explored interfacing the reconfigurable logic with
`on-chip caches and virtual memory [15] and streaming
`operations that used scoreboard interlocks on blocks of
`memory [16].
`Accelerators can also have an impact beyond raw
`application performance. Offloading tasks to a reconfigur-
`able array can be effective at reducing the energy required
`for a computation [17], as is important for embedded
`systems. For a survey of both fixed and reconfigurable
`accelerators and estimates of their energy and perfor-
`mance efficiency, see [18]. To simplify design represen-
`tation, mapping, and portability, the Queue Machine [19]
`showed how a processor and an array could run the same
`machine-level instructions.
`These designs at least anticipated and perhaps helped
`motivate commercial processor-FPGA hybrids. Xilinx
`offered a PowerPC on the Virtex2-Pro and now integrates
`ARM cores on their Zynq devices. Altera includes ARM
`cores on Cyclone V and Arria V SoC FPGAs. Stretch
`provided an integrated processor and FPGA device [20].
`Intel integrated an Atom processor and an Altera device
`in a multichip package for embedded computations and
`also integrates a Xeon with an FPGA for
`server
`applications.
`from attached
`Multiprocessors may also benefit
`accelerators. The logic could be used for custom acceler-
`
`334 Proceedings of the IEEE | Vol. 103, No. 3, March 2015
`
`ation, similar to its use in single node machines, or for
`improving communication and synchronization [21], [22].
`Cray integrated FPGAs into their XD1, SRC offered a
`parallel supercomputer with FPGA accelerators, and
`Convey Computer now ships a supercomputer with an
`FPGA acceleration board.
`
`C. Fabric
`Using FPGAs as computing substrates creates different
`needs and opportunities for optimization than using
`FPGAs for glue logic or general-purpose register-transfer
`level (RTL) implementation engines. As a result, there is a
`significant body of work exploring how reconfigurable
`arrays might evolve beyond traditional FPGAs, which we
`highlight in this section. For the evolution of FPGAs for
`traditional usage, including the impact of technology, see
`the companion article [23]. For a theoretical and
`quantitative comparison of reconfigurable architectures,
`see [24].
`
`1) Organization: Fine-grained reconfigurable architec-
`tures, such as FPGAs, give us freedom to organize our
`computation in just about any way, but what organizational
`patterns are actually beneficial for use on the FPGA? This
`freedom fails to provide guidance in how to use the FPGA.
`As a result, there has been considerable exploration of
`computing patterns to help conceptualize the computa-
`tion. These ideas manifest as conceptual guides, as tools, as
`overlay architectures, and as directions for specializing the
`reconfigurable fabric itself. Overlay architectures are
`implemented on top of existing FPGAs and often support
`tuning to specialize the generic architecture to the specific
`needs of a particular application.
`Cellular automata is a natural model for an array of
`identical computing cells. On the FPGA, the cellular
`processing element can be customized to the application
`task. The CAL array from Algotronix was designed
`specifically with this model in mind [25]. Others showed
`how to build cellular computations on top of FPGAs [26]
`including how to deal with cellular automata that are
`larger than the physical FPGA available [27]–[29].
`Dataflow models are useful
`in abstracting spatially
`communicating computational operators. Dataflow hand-
`shaking can simply be used to synchronize data- or
`implementation-dependent timing [30], but it can also be
`used to control operator sharing [31].
`When you must sequentialize a larger graph onto
`limited resources, very long instruction word (VLIW)
`organization provides a straightforward model for operator
`control. As an overlay architecture, the composition of
`operators can be selected to match the needs of the
`application [32], [33]. As we will see later (Section II-C4
`and C5), when we time-switch the fabric itself, the result is
`essentially a VLIW organization.
`Highly pipelined vector processors can be efficient for
`data parallel tasks [34]. The vector lanes provide a model
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2146, p. 4
`
`
`
`for trading area and time [35] and can be extended with
`custom accelerators [36].
`Some have found multicore and manycore overlays to
`be useful models for mapping applications onto FPGAs
`[37]–[39].
`
`2) Coarse-Grained Blocks: While the gate-level configur-
`ability in FPGAs allows great customization, it also means
`that common computing blocks used in DSP and
`numerical computations are inefficient compared to
`custom building blocks (e.g., multipliers, floating-point
`units). Would it be better to provide custom elements of
`these common computing blocks alongside or inside the
`FPGA fabric? When FPGAs were small, it was reasonable
`to consider adding dedicated chips for floating point at the
`board level [40]. The virtual embedded block model was
`introduced to explore hard logic integration in an FPGA
`array and illustrated the benefits of coarse-grain hardware
`to support floating-point computations [41]. Embedded
`multipliers and memories help narrow the performance,
`energy, and area gap between application-specific inte-
`grated circuits (ASICs) and FPGAs [42]. The modern
`extensions to Toronto’s FPGA CAD flow supports explo-
`ration of customized blocks [43]. Both Altera and Xilinx
`now embed hard logic DSP blocks that support wide-
`word addition and multiplication inside their fine-grained
`logic arrays.
`
`3) Synchronous, Asynchronous: Processors and ASICs are
`typically designed around a fixed clock. This allows the
`integrated circuit (IC) designer to carefully optimize for
`performance and means the user of the processor does
`not have to worry about timing closure for their designs.
`As we consider high throughput computational fabrics,
`fixed-frequency, clocked arrays can offer similar advan-
`tages to spatial computing arrays like FPGAs, as shown by
`the high-speed, hierarchical synchronous reconfigurable
`array (HSRA) [44] and SFRA [45]. A different way of
`avoiding issues with timing closure is to drop the clock
`completely for asynchronous handshaking, as shown in
`[46]–[48].
`
`4) Multicontext: Since FPGAs started out as gate array
`replacements, they contained a static configuration to
`control the gates and interconnect. The fact that this
`configuration could be changed opened up new opportu-
`nities to change the circuit during operation. However,
`FPGA reconfiguration was slow. Adding multiple config-
`urations to the FPGA allows the FPGA to change rapidly
`among a set of behaviors [49], [50]. By time-multiplexing
`the expensive logic and interconnect,
`this effectively
`allowed higher logic capacity per unit silicon, often with
`little performance impact. Tabula now offers a commercial
`multicontext FPGA [51]. The companion paper [24]
`identifies conditions under which a multicontext FPGA
`can be lower energy than a single-context FPGA.
`
`Tessier et al.: Reconfigurable Computing Architectures
`
`5) Coarse-Grained: Early FPGAs contained fine-grained
`logic largely because of their limited capacity, their gate
`array roots, and their use as glue logic. As their application
`domain moved to address signal processing and computing
`problems, there was interest in more efficiently supporting
`larger, wide-word logic. Could we have flexible, spatially
`configurable,
`field-programmable machines with wide-
`word, coarse-grained computing blocks? Would a small
`amount of time-multiplexed sharing of these units be
`useful?
`Designed to prototype DSP algorithms, PADDI [52]
`was one of
`the first coarse-grained reconfigurable
`architectures. It essentially used a VLIW architecture,
`with 16b words and 8 contexts per processing element.
`Other VLIW-style architectures have been used for low-
`power multimedia processing,
`including the 16b ALU-
`register-file Reconfigurable Multimedia Array Coprocessor
`(REMARC) array [53], the 32b-word ADRES [54] with
`32 local contexts and 8 entry register files, and the
`8b-word, 16 entry register file dynamically reconfigurable
`processor (DRP) [55].
`The reconfigurable datapath architecture (rDPA) used
`32b-wide ALUs in a cellular arrangement with data
`presence flow control [56]. The reconfigurable pipelined
`datapath (RaPiD) arranged coarse-grain elements, 16b
`ALUs, multipliers and RAM blocks in a one-dimensional
`(1-D) datapath for systolic, pipelined computations, using
`a parallel, configurable bit-level control path for managing
`the dynamic behavior of the computation [57]. PipeRench
`also used the model of a directional pipeline, adding a
`technique to incrementally reconfigure resources and
`virtualize the depth in the pipeline, allowing developers to
`be abstracted from the number of physical pipe stages in
`the design and allowing designs to scale across a range of
`implementations [58].
`To reduce the cost of reconfiguring the cells during a
`multimedia computation, MorphoSys used a single
`instruction, multiple data (SIMD) architecture to control
`a coarse-grained reconfigurable array with tiles composed
`of 16b ALUs, multipliers, and 4 element register files [59].
`Morphosys uses a 32-context instruction memory at the
`periphery of the array that is as wide as a side of the 2-D
`array. Instructions can be shared across rows or columns in
`the 2-D array, and the memory is wide enough to provide
`separate control for each row or column.
`Graphics processors evolved from fixed-function ren-
`dering pipes to general-purpose computing engines in the
`form of general-purpose graphics processing units
`(GPGPUs) around the same time that FPGAs moved into
`computing and Coarse-Grained Reconfigurable Arrays
`(CGRAs) were developing. At a high-level, GPGPUs share
`many characteristics with CGRAs, providing a high-density
`spatial array of processing units that can be programmed
`to perform regular computing tasks. GPGPUs were
`initially focused on SIMD, single-precision, floating-point
`computing tasks, and do not support the spatially local
`
`Vol. 103, No. 3, March 2015 | Proceedings of the IEEE 335
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2146, p. 5
`
`
`
`Tessier et al.: Reconfigurable Computing Architectures
`
`communications typical of FPGAs and CGRAs. The
`companion paper [24] shows why locality exploitation
`can be a key advantage for reconfigurable architectures.
`For compiled CUDA,
`the native language of many
`GPGPUs, the GPGPUs provide 1–4 higher throughput
`than FPGAs, but at a cost of over 4–16 the energy per
`operation [60]. The peak, single-precision, floating-point
`throughput of GPGPUs exceeds processors and FPGAs, but
`the delivered performance on applications can be lower
`than FPGAs and the energy per operation can be higher
`[61]. FPGAs can outperform GPGPUs [62] for video pro-
`cessing, depending on the nature and complexity of the task.
`A key differentiator in the architectures we have seen
`so far is whether the reconfigurable resources are
`controlled with a static configuration,
`like FPGAs, or
`with multicontext memories, like the VLIW-style CGRAs
`above. A second differentiator is how many resources can
`share the same instructions as the SIMD designs exploit.
`This motivated a set of designs that made it possible to use
`configuration to select between these choices. MATRIX
`[63] explored configuring instruction distribution to
`efficiently compose a wide range of architectures (systolic,
`VLIW, SIMD/Vector, MIMD) using a coarse-grain archi-
`tecture based around an 8b ALU-multiplier cell
`that
`includes a local 256 8 RAM. The RAMs could be used
`either as embedded data memories (e.g., register files,
`FIFOs) or instruction stores and the programmable
`interconnect could carry both data streams and instruction
`streams, including streams that could control the inter-
`connect. The CHESS reconfigurable arithmetic array used
`4b ALUs and datapaths with local 32 4 memories where
`the ALUs could be configured statically or sequenced
`dynamically from local memories, but the routing could
`not [64].
`If the domain is large enough, it makes sense to create a
`custom reconfigurable array specialized for a specific
`domain. The Totem design showed how to optimize a
`reconfigurable array for a specific domain and illustrated
`the area savings in the DSP domain [65], how to perform
`function allocation for a domain-customized architecture
`[66], how to provide spare capacity for later changes [67],
`how to automate the layout of domain-optimized reconfi-
`gurable arrays, and how to compile applications to a
`specialized instance of the architecture.
`Some of the ideas from the MATRIX CGRA led to the
`array of SpiceEngine vector processors in Broadcom’s
`Calisto architecture [68]. The Samsung reconfigurable
`processor (SRP) is a derivative of the ADRES CGRA
`architecture [69], [70].
`
`6) Configuration Compression and Management: While
`FPGAs are in-field reconfigurable, the slow reconfigura-
`tion times limit the use of reconfiguration during the
`execution of a task. One key reason configurations are slow
`is the large number of configuration bits required to
`specify the FPGA. Consequently, one way to accelerate
`
`336 Proceedings of the IEEE | Vol. 103, No. 3, March 2015
`
`reconfiguration is to compress the transfer of configura-
`tion data. By adapting ideas from covering in two-level
`logic minimization, the wildcard scheme on the Xilinx
`XC6200 could be effectively used to reduce bitstream size
`and load time [71]. When the sequence of configurations is
`not static, a cache for frequently used configurations can
`be used to accelerate FPGA reconfiguration [72].
`
`7) On-Chip Dynamic Networking: The load-time config-
`ured interconnect in FPGAs is suitable for connecting
`together gates or performing systolic, pipelined, and
`cellular computations. However, as FPGAs started hosting
`more diverse tasks,
`including computation that used
`interconnect
`less continuously,
`it became valuable to
`explore disciplines for dynamically sharing limited on-chip
`FPGA communication bandwidthVto provide network-
`on-chip (NoC) designs on FPGAs. This prompted the
`design of packet-switch overlay networks for FPGAs [73],
`[74]. Often a time-multiplexed overlay network can be
`more efficient
`than a packet-switched network [75].
`Because of the different cost structure between ASICs
`and FPGAs, overlay NoCs on FPGAs should be designed
`differently from ASIC NoCs [76], [77]. Ultimately, it may
`make sense to integrate packet-switched NoC support
`directly into the FPGA fabric [78].
`
`D. Emulation
`Since FPGAs are programmable gate arrays, an obvious
`use for them was to emulate custom logic designs.
`However, since FPGA capacity is lower than contemporary
`and future ASICs, there is typically a need to assemble
`large numbers of FPGAs to support an ASIC design [2].
`The lower density I/O between chips compared to on-chip
`interconnect meant that direct partitioning of gate-level
`netlists onto multiple FPGAs suffered bottlenecks at the
`chip I/O that left most FPGA logic capacity underutilized.
`To avoid this bottleneck, Virtual Wires virtualized the pins
`on the FPGA, exploiting the ability to time multiplex the
`I/O pins many times within an ASIC emulation cycle [79].
`FPGA-based Veloce logic emulators sold by Mentor
`Graphics are based on these techniques. Today’s FPGAs
`include hardware support for high-speed serial links to
`address the chip I/O bandwidth bottleneck.
`As FPGAs grew in size, natural building blocks could fit
`onto a single FPGA, including complete processors [80].
`The emulation focus turned to the system level. Boards of
`FPGAs were used to directly support algorithm develop-
`ment [81]. With an FPGA now large compared to a
`processor core, single FPGAs and boards of FPGAs have
`been used to support multicore architecture research [82],
`[83]. Nonetheless, directly implementing one instantiated
`processor core on the FPGA for one emulated processor
`core can demand very large multi-FPGA systems. Since
`these designs include many, identical processor cores, an
`alternative is to time-multiplex multiple virtual processor
`cores over each physically instantiated core [84].
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2146, p. 6
`
`
`
`infrequently exercised core functionality
`Furthermore,
`can be run on a host processor, allowing the physically
`instantiated core to be smaller and easier to develop.
`While it is possible to simulate the entire processor on
`an FPGA, defining and maintaining the model for the
`FPGA and keeping it running at high frequency can be
`difficult. FPGA-accelerated simulation technologies
`(FAST) observes that cycle-accurate simulation on a
`processor is dominated by the timing modeling, not the
`functional modeling. Consequently, one can move the
`timing model logic to the FPGA and achieve significant
`simulation speedups without significantly changing the
`design and reuse flow for the functional simulator [85].
`A-Port networks provide a generalization for explicitly
`modeling the simulated timing of a structure where the
`FPGA primitive clock cycles do not necessarily corre-
`spond directly to the modeled timing cycles in the
`emulated system [86].
`
`E. Integrated Memory
`As FPGA capacity grew, it became possible to integrate
`memory onto the FPGA die, and it became essential to do
`so to avoid memory bottlenecks. This created opportuni-
`ties to exploit high, application-customized bandwidth and
`raised questions of how to organize, manage, and exploit
`the memory.
`Irregular memory accesses often stall conventional
`processor and memory systems. Consequently, an FPGA
`accelerator that gathers and filters data can accelerate
`irregular accesses on a processor [87]. Similarly, conven-
`tional processors perform poorly on irregular graph
`operations. The GraphStep architecture showed how to
`organize active computations around embedded memories
`in the FPGA to accelerate graph processing [88].
`When it is not possible to store the entire dataset on
`chip, it is often useful to stream, buffer, or cache the data
`in the embedded memories. Windows into spatial regions
`are important for cellular automata [27] and many image
`and signal processing and scientific computing problems
`[89]. CoRAM provided an abstraction for using the on-chip
`memories as windows on a larger, unified off-chip memory
`and a methodology for providing the control and
`communication infrastructure needed to support
`the
`abstraction [90].
`The embedded memories in FPGAs are simple RAMs,
`often with native support
`for dual-port access.
`It
`is
`straightforward to build scratchpad memories, simple
`register files, FIFOs, and direct mapped caches [91] from
`these memories. Recent work has demonstrated efficient
`ways to implement multiported memories [92] and (near)
`associative memories [93].
`
`F. Defect and Fault Tolerance
`We can exploit the homogeneous set of uncommitted
`resources in FPGA-like architectures to tolerate defects.
`
`Tessier et al.: Reconfigurable Computing Architectures
`
`Because of this regularity, FPGAs can use design-indepen-
`dent defect-tolerance techniques such as row and column
`sparing, as is familiar from memories. Nonetheless, the tile
`in an FPGA is much larger than a memory cell in a RAM,
`so it may be more efficient to spare resources at the level of
`interconnect links rather than tiles [94]. However, this
`kind of sparing builds in two levels of reconfigurability:
`one to tolerate fabrication defects and one to support the
`design. It is more efficient to unify our freedom for design-
`mapping and defect-avoidance. With today’s cluster-based
`FPGAs, it is simple to reserve a spare lookup table (LUT) in
`a cluster to tolerate LUT defects [95]. When a natural
`cluster does not already exist, it is possible to overlay a
`conceptual cluster and reserve one space within every
`small grid of compute blocks so that
`it
`is easy to
`precompute logic permutations that avoid any single
`compute block failure within the overlay grid [96].
`Interconnect faults can be repaired quickly by consulting
`precomputed alternate paths for nets that are disrupted by
`defects [97]. These techniques repair the design locally,
`avoiding the need to perform a complete mapping of the
`design; the cost for this simplicity is the need to reserve a
`small fraction of resources for spares even though most
`will not be used.
`If we must tolerate higher defect rates or reduce the
`overhead for spares, we can map around the specific
`defects in a device. TERAMAC pioneered the more
`aggressive,
`large-scale approach of
`identifying defects,
`both in the FPGA and in the interconnect between FPGAs,
`and mapping an application to avoid them [98]. This style
`of design- and component-specific mapping can be used to
`tolerate defects in Programmable Logic Arrays (PL