`COMPUTING
`THE THEORY AND PRACTICE
`
`OFFPGA-BASED COMPUTATION
`
`
`
`Edited by
`
`Scott Hauck and Andre DeHon
`
`AMSTERDAM. BOSTON. HEIDELBERG. LONDON M � ◄
`
`NEW DELHI • NEW YORK• OXFORD • PARIS • SAN DIEGO
`
`• SYDNEY• TOKYO M O R G A N
`
`SAN FRANCISCO • SINGAPORE
`
`ELSEVIER
`
`is an imprint of Elsevier
`Morgan Kaufmann Publishers
`
`KAUFMANN
`
`IPR2018-01600
`
`EXHIBIT
`2067
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2081, p. 1
`
`
`
`Reconfigurable Computing
`Hauck and DeHon
`
`PUBLISHERS
`MORGAN KAUFMANN
`
`An imprint of Elsevier
`
`
`
`
`30 Corporate Drive, Suite 400, Burlington, MA 01803-4255
`
`
`
`
`
`Copyright © 2008 by Elsevier Inc.
`
`I Original ISBN: 978-0-12-370522-8 I
`
`
`All rights reserved.
`
`No part of this publication may be reproduced or transmitted in any form or
`
`
`
`
`
`
`
`
`by any means-electronic or mechanical, including photocopy, recording, or
`
`
`
`
`
`any information storage and retrieval system-without permission in writing
`from the publisher.
`
`
`
`
`
`First Printed in India 2011
`
`
`
`Indian Reprint ISBN: 978-93-80931-86-9
`
`This edition has been authorized by Elsevier for sale in the following countries:
`
`
`
`
`
`
`
`
`
`India, Pakistan, Nepal, Sri Lanka and Bangladesh. Sale and purchase of this book
`
`
`
`outside these countries is not authorized and is illegal.
`
`
`
`
`
`
`
`
`
`
`
`Published by Elsevier, a division of Reed Elsevier India Private Limited.
`
`
`
`
`
`Registered Office: 622, Indraprakash Building, 21 Barakhamba Road,
`
`
`
`
`New Delhi-110 001.
`
`
`
`122 002, Haryana, India.
`
`Corporate Office: 14th floor, Building No. l0B, DLF Cyber City Phase-II, Gurgaon-
`
`
`
`Printed and bound in India by Sanat Printers, Kundli-131 028
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2081, p. 2
`
`
`
`CHAPTER 21
`
`IMPLEMENTING APPLICATIONS
`WITHFPGAs
`Brad L. Hutchings, Brent E. Nelson
`Department of Electrical and Computer Engineering
`Brigham Young University
`
`Developers can choose various devices when implementing electronic systems:
`field-programmable gate arrays (FPGAs), microprocessors,
`and other standard
`products such as ASSPs, and custom chips or application-specific
`integrated
`circuits (ASICs). This chapter discusses how FPGAs compare
`to other digital
`devices, outlines the considerations
`that will help designers to determine when
`FPGAs are appropriate
`for a specific application, and presents
`implementation
`strategies that exploit features specific to FPGAs.
`The chapter
`is divided into four major sections. Section 21.1 discusses the
`strengths and weaknesses of FPGAs, relative to other available devices. Section 21.2
`suggests when FPGA devices are suitable choices for specific applications/
`algorithms, based upon their 1/0 and computation
`requirements. Section 21.3
`discusses general implementation
`strategies appropriate
`for FPGA devices. Then
`Section 21.4 discusses FPGA-specific arithmetic design techniques.
`
`21.1 STRENGTHS AND WEAKNESSES OF FPGAs
`
`Developers can choose from three general classes of devices when implement(cid:173)
`ing an algorithm or application: microprocessor, FPGA, or ASIC (for simplicity,
`ASSPs are not considered here). This section provides a brief summary of the
`advantages and disadvantages of these devices in terms of time to market, cost,
`development
`time, power consumption, and debug and verification.
`
`21.1.1 Time to Market
`Time to market is often touted as one of the FPGA's biggest strengths, at least
`relative to ASICs. With an ASIC, from specification to product requires (at least):
`(1) design, (2) verification, (3) fabrication, (4) packaging, and (S) device test. In
`addition, software development requires access to the ASIC device (or an emu(cid:173)
`lation of such) before it can be verified and completed. As immediately available
`standard devices, FPGAs have already been fabricated, packaged, and tested by
`the vendor, thereby eliminating at least four months from time to market.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2081, p. 3
`
`
`
`440
`
`Chapter 21 ■ Implementing Applications with FPGAs
`
`More difficult to quantify but perhaps more important are: (1) refabrications
`(respins) caused by either errors in the design or late changes to the specifica(cid:173)
`tion, due to a change in an evolving standard, for example, and (2) software
`development schedules that depend on access to the ASIC. Both of these items
`impact product production schedules; a respin can easily consume an additional
`four months, and early access to hardware can greatly accelerate software devel(cid:173)
`opment and debug, particularly for the embedded software that communicates
`directly with the device.
`In light of these considerations, a conservative estimate of the time-to-market
`advantage of FPGAs relative to ASICs is 6 to 12 months. Such a reduction
`is
`significant;
`in consumer electronics markets, many products have only a
`24-month lifecycle.
`
`21.1.2 Cost
`Per device, FPGAs can be much less expensive than ASICs, especially in lower
`volumes, because the nonrecurring costs of FPGA fabrication are borne by many
`users. However, because of their reprogrammability, FPGAs require much more
`silicon area to implement equivalent functionality. Thus, at the highest volumes
`possible in consumer electronics, FPGA device cost will eventually exceed ASIC
`device cost.
`
`21.1.3 Development Time
`is most often approached as hardware design:
`FPGA application development
`applications are described
`in Verilog or VHDL, simulated
`to determine cor(cid:173)
`rectness, and synthesized using commercial
`logic synthesis tools. Commercial
`tools are available that synthesize behavioral programs written
`in sequential
`languages such as C to FPGAs. However, in most cases, much better perfor(cid:173)
`mance and higher densities are achieved using HDLs, because they allow the
`user to directly describe and exploit the intrinsic parallelism available in an
`application. Exploiting application parallelism is the single best way to achieve
`high FPGA performance. However, designing highly parallel implementations of
`applications
`in HDLs requires significantly more development effort than soft(cid:173)
`ware development with conventional sequential programming
`languages such
`as Java or C++.
`
`21.1.4 Power Consumption
`FPGAs consume more power
`than ASICs simply because programmability
`requires many more transistors, relative to a customized integrated circuit (IC).
`FPGAs may consume more or less power than a microprocessor or digital signal
`processor (DSP), depending on the application.
`
`21.1.5 Debug and Verification
`FPGAs are developed with standard hardware design techniques and tools.
`Coded in VHDL or Verilog and synthesized, FPGA designs can be debugged
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2081, p. 4
`
`
`
`21.2 Application Characteristics and Performance
`
`441
`
`just as typical ASIC designs are. However, many designers verify
`in simulators
`their designs directly, by downloading
`them into an FPGA and testing them in
`a system. With this approach
`the application can be tested at speed (a million
`times faster than simulation)
`in the actual operating environment, where it is
`If thorough,
`this testing provides a stronger
`exposed to real-world conditions.
`form of functional verification
`than simulation. However, debugging applica(cid:173)
`tions in an FPGA can be difficult because vendor tools provide much less observ(cid:173)
`ability and controllability
`than, for example, an hardware description
`language
`(HDL) simulator.
`
`21.1.6 FPGAs and Microprocessors
`As discussed previously, FPGAs are most often contrasted with custom ASICs.
`However, if a programmable
`solution is dictated because of changing applica(cid:173)
`tion requirements or other factors, it is important
`to study the application care(cid:173)
`fully to determine
`if it is possible to meet performance
`requirements with a
`programmable processor-microprocessor
`or DSP. Code development
`for pro(cid:173)
`grammable processors requires much less effort than that required for FPGAs
`or ASICs, because developing software with sequential
`languages such as C or
`Java is much less taxing than writing parallel descriptions with Verilog or VHDL.
`Moreover, the coding and debugging environments for programmable processors
`are far richer than their HDL counterparts. Microprocessors are also generally
`much less expensive than FPGAs. If the microprocessor
`can meet application
`requirements
`(performance, power, etc.), it is almost always the best choice.
`In general, FPGAs are well suited to applications
`that demand extremely high
`performance and reprogrammability,
`for interfacing components
`that communi(cid:173)
`cate with many other devices (so-called glue-logic) and for implementing hard(cid:173)
`ware systems at volumes that make their economies of scale feasible. They are
`less well suited to products that will be produced at the highest possible volumes
`or for systems that must run at the lowest possible power.
`
`21.2 APPLICATION CHARACTERISTICS AND PERFORMANCE
`
`and 1/0
`is largely determined by the computational
`Application performance
`requirements of the system. Computational
`requirements dictate how much
`hardware parallelism can be used to increase performance. 1/0 system limi(cid:173)
`tations and requirements determine how much performance can actually be
`exploited from the parallel hardware.
`
`21.2.1 Computational Characteristics and Performance
`FPGAs can outperform
`today's processors only by exploiting massive amounts
`of parallelism. Their technology has always suffered from a significant clock-rate
`disadvantage; FPGA clock rates have always been slower than CPU clock rates
`by about a factor of 10. This remains
`true today, with clock rates for FPGAs
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2081, p. 5
`
`
`
`442
`
`Chapter 21 ■ Implementing Applications with FPGAs
`
`limited to about 300 to 350 MHz and CPUs operating at approximately 3 GHz.
`As a result, FPGAs must perform at least 10 times the computational work
`per cycle to perform on par with processors. To be a compelling alternative,
`an FPGA-based solution should exceed the performance of a processor-based
`solution by 5 to 10 times and hence must actually perform 50 to 100 times
`the computational work per clock cycle. This kind of performance
`is feasible
`only if the target application exhibits a corresponding amount of exploitable
`parallelism.
`The guideline of 5 to 10 times is suggested for two main reasons. First of all,
`prior to actual implementation,
`it is difficult or impossible to foresee the impact
`of various system and 1/0 issues on eventual performance.
`In our experience,
`5 times can quickly become 2 times or less as various system and algorithmic
`issues arise during implementation. Second, application development for FPGAs
`is much more difficult than conventional software development. For that rea(cid:173)
`son, the additional development effort must be carefully weighed against the
`potential performance advantages. A guideline of 5 to 10 times provides some
`insurance
`that any FPGA-specific performance advantages will not completely
`vanish during the implementation phase.
`Ultimately, the intrinsic characteristics of the application place an upper
`bound on FPGA performance. They determine how much raw parallelism exists,
`how exploitable it is, and how fast the clock can operate. A review of the liter(cid:173)
`ature [3-6, 11, 16, 19-21, 23, 26, 28] shows that the application characteristics
`that have the most impact on application performance are: data parallelism,
`amenability to pipelining, data element size and arithmetic complexity, and sim(cid:173)
`ple control requirements.
`
`Data parallelism
`Large datasets with few or no data dependencies are ideal for FPGA imple(cid:173)
`for two reasons: ( 1) They enable high performance because many
`mentation
`computations can occur concurrently, and (2) they allow operations to be exten(cid:173)
`sively rescheduled. As previously mentioned, concurrency
`is extremely impor(cid:173)
`tant because FPGA applications must be able to achieve 50 to 100 times the
`operations per clock cycle of a microprocessor
`to be competitive. The ability
`to reschedule computations
`is also important because it makes it feasible to
`tailor the circuit design to FPGA hardware and achieve higher performance. For
`example, computations can be scheduled
`to maximize data reuse to increase
`performance and reduce memory bandwidth
`requirements.
`Image-processing
`algorithms with their attendant data parallelism have been among the highest(cid:173)
`performing algorithms mapped to FPGA devices.
`
`Data element size and arithmetic complexity
`they
`Data element size and arithmetic
`complexity are important because
`strongly influence circuit size and speed. For applications with large amounts
`of exploitable parallelism,
`the upper
`limit on this parallelism
`is often deter(cid:173)
`mined by how many operations can be performed concurrently on the FPGA
`device. Larger data elements and greater arithmetic complexity lead to larger
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2081, p. 6
`
`
`
`21.2 Application Characteristics and Performance
`
`443
`
`and fewer computational elements and less parallelism. Moreover, larger and
`more complex circuits exhibit more delay that slows clock rate and impacts
`performance. Not surprisingly, representing data with the fewest possible bits
`and performing computation with the simplest operators generally lead to the
`highest performance. Designing high-performance applications in FPGAs almost
`always involves a precision/performance
`trade-off.
`
`Pipelining
`in FPGAs. Because FPGA
`Pipelining is essential to achieving high performance
`performance is limited primarily by interconnect delay, pipelining (inserting reg(cid:173)
`isters on long circuit pathways) is an essential way to improve clock rate (and
`therefore throughput) at the cost of latency. In addition, pipelining allows com(cid:173)
`putational operations to be overlapped in time and leads to more parallelism in
`the implementation. Generally speaking, because pipelining is used extensively
`throughout FPGA-based designs, applications must be able to tolerate some
`latency (via pipelining) to be suitable candidates for FPGA implementation.
`
`Simple control requirements
`if all operations can be statically sched(cid:173)
`FPGAs achieve the highest performance
`uled as much as possible (this is true of many technologies). Put simply, it takes
`time to make decisions and decision-making circuitry is often on the critical
`path for many algorithms. Replacing runtime decision circuitry with static con(cid:173)
`trol eliminates circuitry and speeds up execution. It makes it much easier to
`construct circuit pipelines that are heavily utilized with few or no pipeline bub(cid:173)
`bles. In addition, statically scheduled controllers require less circuitry, making
`room for more datapath operators, for example. In general, datasets with few
`or no dependencies often have simple control requirements.
`
`21.2.2 VO and Performance
`As mentioned previously, FPGA clock rates are at least one order of magnitude
`slower than those of CPUs. Thus, significant parallelism (either data parallelism
`or pipelining) is required for an FPGA to be an attractive alternative to a CPU.
`However, 1/0 performance
`is just as important: Data must be transmitted at
`rates that can keep all of the parallel hardware busy.
`Algorithms can be loosely grouped into two categories: 1/0 bound and com(cid:173)
`pute bound [17, 18]. At the simplest level, if the number of 1/0 operations
`is
`equal to or greater than the number of calculations
`in the computation,
`the
`computation
`is said to be 1/0 bound. To increase its performance
`requires an
`increase in memory bandwidth-doing more computation
`in parallel will have
`no effect. Conversely, if the number of computations
`is greater than the number
`of 1/0 operations, computational parallelism may provide a speedup.
`A simple example of this, provided by Kung [18], is matrix-matrix multi(cid:173)
`plication. The total number of I/Os in the computation,
`for n-by-n matrices,
`is 3n2-each matrix must be read and the product written back. The total
`number of computations
`to be done, however, is n 3• Thus, this computation
`is
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2081, p. 7
`
`
`
`444
`
`Chapter 21 • Implementing Applications with FPGAs
`
`compute bound. In contrast, matrix-matrix addition requires 3n2 I/Os and 3n2
`calculations and is thus I/O bound. Another way to see this is to note that each
`is used n
`source element read from memory in a matrix-matrix multiplication
`times and each result is produced using n multiply-accumulate
`operations. In
`matrix-matrix addition, each element fetched from memory is used only once
`and each result is produced from only a single addition.
`Carefully coordinating data transfer, I/O movement, and computation order is
`crucial to achieving enough parallelism to provide effective speedup. The entire
`field of systolic array design is based on the concepts of (1) arranging
`the I/O
`and computation
`in a compute-bound application so that each data element
`fetched from memory is reused multiple times, and (2) keeping many processing
`elements busy operating in parallel on that data.
`FPGAs offer a wide variety of memory elements that can be used to coor(cid:173)
`dinate I/O and computation: flip-flops to provide single-bit storage (10,000s of
`bits); LUT-based RAM to provide many small blocks of randomly distributed
`memory (100,000s of bits); and larger RAM or ROM memories (1,000,000s of
`bits). Some vendors' FPGAs contain multiple sizes of random access memories,
`and these memories are often easily configured into special-purpose structures
`such as dynamic-length shift registers, content-addressable memories (CAMs),
`and so forth. In addition to these types of on-chip memory, most FPGA plat(cid:173)
`forms provide off-chip memory as well.
`Increasing the I/O bandwidth to memory is usually critical in harnessing the
`parallelism inherent in a computation. That is, after some point, further multi(cid:173)
`plying the number of processing elements (PEs) in a design (to increase paral(cid:173)
`lelism) usually requires a corresponding
`increase in I/O. This additional I/O can
`often be provided by the many on-chip memories in a typical modem FPGA. The
`work of Graham and Nelson [8] describes a series of early experiments to map
`time-delay SONAR beam forming to an FPGA platform where memory band(cid:173)
`width was the limiting factor in design speedup. While the data to be processed
`were an infinite stream of large data blocks, many of the other data structures
`in the computation were not large (e.g., coefficients, delay values). In this com(cid:173)
`it was not the total amount of memory that limited the speedup but
`putation,
`rather the number of memory ports available. Thus, the use of multiple small
`memories in parallel were able to provide the needed bandwidth.
`The availability of many small memories in today's FPGAs further supports
`the idea of trading off computation for table lookup. Conventional FPGA fabrics
`are based on a foundation of 4-input LUTs; in addition, larger on-chip memories
`can be used to support larger lookup structures. Because the memories already
`exist on chip, unlike in ASIC technology, using them adds no additional cost to
`the system. A common approach in FPGA-based design, therefore, is to evaluate
`which parts of the system's computations might lend themselves to table lookup
`and use the available RAM blocks for these lookups.
`is largely deter(cid:173)
`In summary, the performance of FPGA-based applications
`mined by how much exploitable parallelism
`is available, and by the ability of
`the system to provide data to keep the parallel hardware operational.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2081, p. 8
`
`
`
`21.3 General Implementation Strategies for FPGA-based Systems
`
`445
`
`21.3 GENERAL IMPLEMENTATION STRATEGIES
`
`FOR FPGA-BASED SYSTEMS
`such as microprocessors
`technologies
`In contrast with other programmable
`or DSPs, FPGAs provide an extremely rich and complex set of implementa(cid:173)
`tion alternatives. Designers have complete control over arithmetic schemes and
`number representation
`and can, for example, trade precision for performance.
`In addition, reprogrammable, SRAM-based FPGAs can be configured any num(cid:173)
`ber of times to provide additional implementation
`flexibility for further tailoring
`the implementation
`to lower cost and make better use of the device.
`There are two general configuration
`strategies
`for FPGAs: configure-once,
`where the application consists of a single configuration
`that is downloaded for
`the duration of the application's operation, and runtime reconfiguration
`(RTR),
`where the application consists of multiple configurations
`that are "swapped" in
`and out as the application operates [14].
`
`21.3.1 Configure-once
`is the simplest and most common way to
`Configure-once
`(during operation)
`implement applications with reconfigurable
`logic. The distinctive
`feature of
`configure-once applications
`is that they consist of a single system-wide config(cid:173)
`the FPGAs comprising
`uration. Prior to operation,
`the reconfigurable resource
`are loaded with their respective configurations. Once operation commences, they
`remain in this configuration until the application completes. This approach
`is
`very similar to using an ASIC for application acceleration. From the application
`point of view, it matters little whether the hardware used to accelerate the appli(cid:173)
`cation is an FPGA or a custom ASIC because it remains constant throughout
`its
`operation.
`The configure-once approach can also be applied to reconfigurable applica(cid:173)
`tions to achieve significant acceleration. There are classes of applications,
`for
`example, where the input data varies but remains constant for hours, days, or
`longer. In some cases, data-specific optimizations can be applied to the applica(cid:173)
`tion circuitry and lead to dramatic speedup. Of course, when the data changes,
`the circuit-specific optimizations need to be reapplied and the bitstream regen(cid:173)
`(1) the FPGA and
`erated. Applications of this sort consist of two elements:
`system hardware, and (2) an application-specific compiler that regenerates
`the
`bitstream whenever
`the application-specific data changes. This approach has
`been used, for example, to accelerate SNORT, a popular packet filter used to
`improve network security [13]. SNORT data consists of regular expressions that
`detect malicious packets by their content. It is relatively static, and new regular
`expressions are occasionally added as new attacks are detected. The application(cid:173)
`these regular expressions into FPGA hardware
`specific compiler translates
`that
`matches packets many times faster than software SNORT. When new regular
`expressions are added to the SNORT database, the compiler is rerun and a new
`configuration
`is created and downloaded
`to the FPGA.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2081, p. 9
`
`
`
`446
`
`Chapter 21 ■ Implementing Applications with FPGAs
`
`21.3.2 Runtime Reconfiguration
`Whereas configure-once applications statically allocate logic for the duration of
`an application, RTR applications use a dynamic allocation
`scheme
`that
`re-allocates hardware at runtime. Each application consists of multiple con(cid:173)
`figurations per FPGA, with each one implementing some fraction of it. Whereas
`a configure-once application configures the FPGA once before execution, an RTR
`application typically reconfigures it many times during the normal operation.
`that can be used to implement RTR appli(cid:173)
`There are two basic approaches
`cations: global and local (sometimes referred to as partial configuration
`in the
`literature). Both techniques use multiple configurations for a single application,
`and both reconfigure the FPGA during application execution. The principal dif(cid:173)
`ference between the two is the way the dynamic hardware is allocated.
`
`Global RTR
`Global RTR allocates all (FPGA) hardware resources in each configuration step.
`More specifically, global RTR applications are divided into distinct temporal
`phases, with each phase implemented as a single system-wide configuration that
`occupies all system FPGA resources. At runtime, the application steps through
`each phase by loading all of the system FPGAs with the appropriate configura(cid:173)
`tion data associated with a given phase.
`
`LocalRTR
`than does
`to reconfiguration
`Local RTR takes an even more flexible approach
`global RTR. As the name implies, these applications locally (or selectively) recon(cid:173)
`figure subsets of the logic as they execute. Local RTR applications may configure
`any percentage of the reconfigurable resources at any time, individual FPGAs
`may be configured, or even single FPGA devices may themselves be partially
`reconfigured on demand. This flexibility allows hardware resources to be tai(cid:173)
`lored to the runtime profile of the application with finer granularity
`than that
`possible with global RTR. Whereas global RTR approaches implement the execu(cid:173)
`local RTR
`tion process by loading relatively large, global application partitions,
`applications need load only the necessary functionality at each point in time.
`This can reduce the amount of time spent downloading configurations and can
`lead to a more efficient runtime hardware allocation.
`The organization of local RTR applications
`is based more on a functional
`division of labor than the phased partitioning used by global RTR applications.
`Typically, local RTR applications are implemented by functionally partitioning
`an application
`into a set of fine-grained operations. These operations need not
`be temporally exclusive-many
`of them may be active at one time. This is in
`to global RTR, where only one configuration (per FPGA) may
`direct contrast
`be active at any given time. Still, with local RTR it is important
`to organize
`the operations such that idle circuitry is eliminated or greatly reduced. Each
`operation is implemented as a distinct circuit module, and these circuit modules
`are then downloaded
`to the FPGAs as necessary during operation. Note that,
`unlike global RTR, several of these operations may be loaded simultaneously,
`and each may consume any portion of the system FPGA resources.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2081, p. 10
`
`
`
`21.3 General Implementation Strategies for FPGA-based Systems
`
`447
`
`RTR applications
`Runtime Reconfigured Artificial Neural Network (RRANN) is an early example
`of a global RTR application [7]. RRANN divided the back-propagation algorithm
`(used to train neural networks) into three temporally exclusive configurations
`that were loaded into the FPGA in rapid succession during operation. It demon(cid:173)
`strated a 500 percent increase in density by eliminating idle circuitry in individ(cid:173)
`ual algorithm phases.
`RRANN was followed up with RRANN-2 [9], an application using local RTR.
`Like RRANN, the algorithm was still divided into three distinct phases. However,
`unlike the earlier version, the phases were carefully designed so that they shared
`common circuitry, which was placed and routed into identical physical locations
`for each phase. Initially, only the first configuration was loaded; thereafter, the
`common circuitry remained resident and only circuit differences were loaded
`during operation. This reduced configuration overhead by 25 percent over the
`global RTR approach.
`The Dynamic Instruction Set Computer (DISC) [29] used local RTR to create
`a sequential control processor with a very small fixed core that remained resi(cid:173)
`dent at all times. This resident core was augmented by circuit modules that were
`dynamically loaded as required by the application. DISC was used to implement
`an image-processing application that consisted of various filtering operations. At
`runtime, the circuit modules were loaded as necessary. Although the application
`used all of the filtering circuit modules, it did not require all of them to be loaded
`simultaneously. Thus, DISC loaded circuit modules on demand as required. Only
`a few active circuit modules were ever resident at any time, allowing the appli(cid:173)
`cation to fit in a much smaller device than possible with global RTR.
`
`21.3.3 Summary of Implementation Issues
`
`Of the two general implementation
`techniques, configure-once is the simplest
`and is best supported by commercially available tool flows. This is not surpris(cid:173)
`ing, as all FPGA CAD tools are derivations of conventional ASIC CAD flows.
`While the two RTR implementation approaches (local and global) can provide
`significant performance and capacity advantages, they are much more challeng(cid:173)
`ing to employ, primarily because of a lack of specific tool support.
`The designer's primary task when implementing global RTR applications
`is
`to temporally divide the application
`into roughly equal-size partitions
`to effi(cid:173)
`ciently use reconfigurable resources. This is largely a manual process-although
`the academic community has produced some partitioning
`tools, no commercial
`offerings are currently available. The main disadvantage of global RTR is the
`need for equal-size partitions. If it is not possible to evenly partition the appli(cid:173)
`cation, inefficient use of FPGA resources will result.
`The main advantage of local RTR over global RTR is that it uses fine-grained
`that may make more efficient use of FPGA resources.
`functional operators
`This is important
`for applications
`that are not easily divided into equal-size
`temporally exclusive circuit partitions. However, partitioning a local RTR design
`may require an inordinate amount of designer effort. For example, unlike global
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2081, p. 11
`
`
`
`448
`
`Chapter 21 • Implementing Applications with FPGAs
`
`RTR, where circuit interfaces typically remain fixed between configurations,
`local RTR allows these interfaces to change with each configuration. When
`circuit configurations become small enough for multiple configurations to fit
`into a single device, the designer needs to ensure that .all configurations will
`interface correctly one with another. Moreover, the designer may have to ensure
`not only structural compliance but physical compliance as well. That is, when
`the designer creates circuit configurations that do not occupy an entire FPGA,
`he or she will have to ensure that the physical footprint of each is compatible
`with that of others that may be loaded concurrently.
`
`21.4
`
`IMPLEMENTING ARITHMETIC IN FPGAs
`
`to
`Almost since their invention, FPGAs have employed dedicated circuitry
`· accelerate arithmetic computation. In earlier devices, dedicated circuitry sped
`up the propagation of carry signals for ripple-carry, full-adder blocks. Later
`devices added dedicated multipliers, DSP function blocks, and more complex
`fixed-function circuitry. The presence of such dedicated circuitry can dramati(cid:173)
`cally improve arithmetic performance, but also restricts designers to a very small
`subset of choices when implementing arithmetic.
`Well-known approaches such as carry-look-ahead, carry-save, signed-digi~,
`and so on, generally do not apply to FPGAs. Though these techniques are com(cid:173)
`monly used to create very high-performance arithmetic blocks in custom !Cs,
`they are not competitive when applied to FPGAs simply because they cannot
`access the faster, dedicated circuitry and must be constructed using slower,
`general-purpose user logic. Instead, FPGA designers accelerate arithmetic
`in
`one of two ways with FPGAs: (1) using dedicated blocks if they fit the needs of
`the application, and (2) avoiding the computation entirely, if possible. Design(cid:173)
`ers apply the second option by, for example, replacing full-blown floating-point
`computation with simpler, though not equivalent, fixed-point, or block fl.oating(cid:173)
`point, computations. In some cases, they can eliminate multiplication entirely
`with constant propagation. Of course, the feasibility of replacing slower, com(cid:173)
`plex functions with simpler, faster ones is application dependent.
`
`21.4.1 Fixed-point Number Representation and Arithmetic
`A fixed-point number representation
`is simply an integer representation with
`an implied binary point, usually in 2's complement format to enable the rep(cid:173)
`resentation of both positive and negative values. A common way of describing
`the structure of a fixed-point number is to use a tuple: n, m, where n is the
`numbe