throbber
RECONFIGURABLE
`COMPUTING
`THE THEORY AND PRACTICE
`
`OFFPGA-BASED COMPUTATION
`
`
`
`Edited by
`
`Scott Hauck and Andre DeHon
`
`AMSTERDAM. BOSTON. HEIDELBERG. LONDON M � ◄
`
`NEW DELHI • NEW YORK• OXFORD • PARIS • SAN DIEGO
`
`
`
`
`
`SAN FRANCISCO • SINGAPORE • SYDNEY• TOKYO M O R G A N
`
`ELSEVIER
`
`is an imprint of Elsevier
`Morgan Kaufmann Publishers
`
`KAUFMANN
`
`IPR2018-01594
`
`EXHIBIT
`2057
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2074, p. 1
`
`

`

`Reconfigurable Computing
`Hauck and DeHon
`
`PUBLISHERS
`MORGAN KAUFMANN
`
`An imprint of Elsevier
`
`
`
`
`30 Corporate Drive, Suite 400, Burlington, MA 01803-4255
`
`
`
`
`
`Copyright © 2008 by Elsevier Inc.
`
`I Original ISBN: 978-0-12-370522-8 I
`
`
`All rights reserved.
`
`No part of this publication may be reproduced or transmitted in any form or
`
`
`
`
`
`
`
`
`by any means-electronic or mechanical, including photocopy, recording, or
`
`
`
`
`
`any information storage and retrieval system-without permission in writing
`from the publisher.
`
`
`
`
`
`First Printed in India 2011
`
`
`
`Indian Reprint ISBN: 978-93-80931-86-9
`
`This edition has been authorized by Elsevier for sale in the following countries:
`
`
`
`
`
`
`
`
`
`India, Pakistan, Nepal, Sri Lanka and Bangladesh. Sale and purchase of this book
`
`
`
`outside these countries is not authorized and is illegal.
`
`
`
`
`
`
`
`
`
`
`
`Published by Elsevier, a division of Reed Elsevier India Private Limited.
`
`
`
`
`
`Registered Office: 622, Indraprakash Building, 21 Barakhamba Road,
`
`
`
`
`New Delhi-110 001.
`
`
`
`122 002, Haryana, India.
`
`Corporate Office: 14th floor, Building No. l0B, DLF Cyber City Phase-II, Gurgaon-
`
`
`
`Printed and bound in India by Sanat Printers, Kundli-131 028
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2074, p. 2
`
`

`

`CHAPTER 21
`
`IMPLEMENTING APPLICATIONS
`WITHFPGAs
`Brad L. Hutchings, Brent E. Nelson
`Department of Electrical and Computer Engineering
`Brigham Young University
`
`Developers can choose various devices when implementing electronic systems:
`field-programmable gate arrays (FPGAs), microprocessors,
`and other standard
`products such as ASSPs, and custom chips or application-specific
`integrated
`circuits (ASICs). This chapter discusses how FPGAs compare
`to other digital
`devices, outlines the considerations
`that will help designers to determine when
`FPGAs are appropriate
`for a specific application, and presents
`implementation
`strategies that exploit features specific to FPGAs.
`The chapter
`is divided into four major sections. Section 21.1 discusses the
`strengths and weaknesses of FPGAs, relative to other available devices. Section 21.2
`suggests when FPGA devices are suitable choices for specific applications/
`algorithms, based upon their 1/0 and computation
`requirements. Section 21.3
`discusses general implementation
`strategies appropriate
`for FPGA devices. Then
`Section 21.4 discusses FPGA-specific arithmetic design techniques.
`
`21.1 STRENGTHS AND WEAKNESSES OF FPGAs
`
`Developers can choose from three general classes of devices when implement(cid:173)
`ing an algorithm or application: microprocessor, FPGA, or ASIC (for simplicity,
`ASSPs are not considered here). This section provides a brief summary of the
`advantages and disadvantages of these devices in terms of time to market, cost,
`development
`time, power consumption, and debug and verification.
`
`21.1.1 Time to Market
`Time to market is often touted as one of the FPGA's biggest strengths, at least
`relative to ASICs. With an ASIC, from specification to product requires (at least):
`(1) design, (2) verification, (3) fabrication, (4) packaging, and (S) device test. In
`addition, software development requires access to the ASIC device (or an emu(cid:173)
`lation of such) before it can be verified and completed. As immediately available
`standard devices, FPGAs have already been fabricated, packaged, and tested by
`the vendor, thereby eliminating at least four months from time to market.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2074, p. 3
`
`

`

`440
`
`Chapter 21 ■ Implementing Applications with FPGAs
`
`More difficult to quantify but perhaps more important are: (1) refabrications
`(respins) caused by either errors in the design or late changes to the specifica(cid:173)
`tion, due to a change in an evolving standard, for example, and (2) software
`development schedules that depend on access to the ASIC. Both of these items
`impact product production schedules; a respin can easily consume an additional
`four months, and early access to hardware can greatly accelerate software devel(cid:173)
`opment and debug, particularly for the embedded software that communicates
`directly with the device.
`In light of these considerations, a conservative estimate of the time-to-market
`advantage of FPGAs relative to ASICs is 6 to 12 months. Such a reduction
`is
`significant;
`in consumer electronics markets, many products have only a
`24-month lifecycle.
`
`21.1.2 Cost
`Per device, FPGAs can be much less expensive than ASICs, especially in lower
`volumes, because the nonrecurring costs of FPGA fabrication are borne by many
`users. However, because of their reprogrammability, FPGAs require much more
`silicon area to implement equivalent functionality. Thus, at the highest volumes
`possible in consumer electronics, FPGA device cost will eventually exceed ASIC
`device cost.
`
`21.1.3 Development Time
`is most often approached as hardware design:
`FPGA application development
`applications are described
`in Verilog or VHDL, simulated
`to determine cor(cid:173)
`rectness, and synthesized using commercial
`logic synthesis tools. Commercial
`tools are available that synthesize behavioral programs written
`in sequential
`languages such as C to FPGAs. However, in most cases, much better perfor(cid:173)
`mance and higher densities are achieved using HDLs, because they allow the
`user to directly describe and exploit the intrinsic parallelism available in an
`application. Exploiting application parallelism is the single best way to achieve
`high FPGA performance. However, designing highly parallel implementations of
`applications
`in HDLs requires significantly more development effort than soft(cid:173)
`ware development with conventional sequential programming
`languages such
`as Java or C++.
`
`21.1.4 Power Consumption
`FPGAs consume more power
`than ASICs simply because programmability
`requires many more transistors, relative to a customized integrated circuit (IC).
`FPGAs may consume more or less power than a microprocessor or digital signal
`processor (DSP), depending on the application.
`
`21.1.5 Debug and Verification
`FPGAs are developed with standard hardware design techniques and tools.
`Coded in VHDL or Verilog and synthesized, FPGA designs can be debugged
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2074, p. 4
`
`

`

`21.2 Application Characteristics and Performance
`
`441
`
`just as typical ASIC designs are. However, many designers verify
`in simulators
`their designs directly, by downloading
`them into an FPGA and testing them in
`a system. With this approach
`the application can be tested at speed (a million
`times faster than simulation)
`in the actual operating environment, where it is
`If thorough,
`this testing provides a stronger
`exposed to real-world conditions.
`form of functional verification
`than simulation. However, debugging applica(cid:173)
`tions in an FPGA can be difficult because vendor tools provide much less observ(cid:173)
`ability and controllability
`than, for example, an hardware description
`language
`(HDL) simulator.
`
`21.1.6 FPGAs and Microprocessors
`As discussed previously, FPGAs are most often contrasted with custom ASICs.
`However, if a programmable
`solution is dictated because of changing applica(cid:173)
`tion requirements or other factors, it is important
`to study the application care(cid:173)
`fully to determine
`if it is possible to meet performance
`requirements with a
`programmable processor-microprocessor
`or DSP. Code development
`for pro(cid:173)
`grammable processors requires much less effort than that required for FPGAs
`or ASICs, because developing software with sequential
`languages such as C or
`Java is much less taxing than writing parallel descriptions with Verilog or VHDL.
`Moreover, the coding and debugging environments for programmable processors
`are far richer than their HDL counterparts. Microprocessors are also generally
`much less expensive than FPGAs. If the microprocessor
`can meet application
`requirements
`(performance, power, etc.), it is almost always the best choice.
`In general, FPGAs are well suited to applications
`that demand extremely high
`performance and reprogrammability,
`for interfacing components
`that communi(cid:173)
`cate with many other devices (so-called glue-logic) and for implementing hard(cid:173)
`ware systems at volumes that make their economies of scale feasible. They are
`less well suited to products that will be produced at the highest possible volumes
`or for systems that must run at the lowest possible power.
`
`21.2 APPLICATION CHARACTERISTICS AND PERFORMANCE
`
`and 1/0
`is largely determined by the computational
`Application performance
`requirements of the system. Computational
`requirements dictate how much
`hardware parallelism can be used to increase performance. 1/0 system limi(cid:173)
`tations and requirements determine how much performance can actually be
`exploited from the parallel hardware.
`
`21.2.1 Computational Characteristics and Performance
`FPGAs can outperform
`today's processors only by exploiting massive amounts
`of parallelism. Their technology has always suffered from a significant clock-rate
`disadvantage; FPGA clock rates have always been slower than CPU clock rates
`by about a factor of 10. This remains
`true today, with clock rates for FPGAs
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2074, p. 5
`
`

`

`442
`
`Chapter 21 ■ Implementing Applications with FPGAs
`
`limited to about 300 to 350 MHz and CPUs operating at approximately 3 GHz.
`As a result, FPGAs must perform at least 10 times the computational work
`per cycle to perform on par with processors. To be a compelling alternative,
`an FPGA-based solution should exceed the performance of a processor-based
`solution by 5 to 10 times and hence must actually perform 50 to 100 times
`the computational work per clock cycle. This kind of performance
`is feasible
`only if the target application exhibits a corresponding amount of exploitable
`parallelism.
`The guideline of 5 to 10 times is suggested for two main reasons. First of all,
`prior to actual implementation,
`it is difficult or impossible to foresee the impact
`of various system and 1/0 issues on eventual performance.
`In our experience,
`5 times can quickly become 2 times or less as various system and algorithmic
`issues arise during implementation. Second, application development for FPGAs
`is much more difficult than conventional software development. For that rea(cid:173)
`son, the additional development effort must be carefully weighed against the
`potential performance advantages. A guideline of 5 to 10 times provides some
`insurance
`that any FPGA-specific performance advantages will not completely
`vanish during the implementation phase.
`Ultimately, the intrinsic characteristics of the application place an upper
`bound on FPGA performance. They determine how much raw parallelism exists,
`how exploitable it is, and how fast the clock can operate. A review of the liter(cid:173)
`ature [3-6, 11, 16, 19-21, 23, 26, 28] shows that the application characteristics
`that have the most impact on application performance are: data parallelism,
`amenability to pipelining, data element size and arithmetic complexity, and sim(cid:173)
`ple control requirements.
`
`Data parallelism
`Large datasets with few or no data dependencies are ideal for FPGA imple(cid:173)
`for two reasons: ( 1) They enable high performance because many
`mentation
`computations can occur concurrently, and (2) they allow operations to be exten(cid:173)
`sively rescheduled. As previously mentioned, concurrency
`is extremely impor(cid:173)
`tant because FPGA applications must be able to achieve 50 to 100 times the
`operations per clock cycle of a microprocessor
`to be competitive. The ability
`to reschedule computations
`is also important because it makes it feasible to
`tailor the circuit design to FPGA hardware and achieve higher performance. For
`example, computations can be scheduled
`to maximize data reuse to increase
`performance and reduce memory bandwidth
`requirements.
`Image-processing
`algorithms with their attendant data parallelism have been among the highest(cid:173)
`performing algorithms mapped to FPGA devices.
`
`Data element size and arithmetic complexity
`they
`Data element size and arithmetic
`complexity are important because
`strongly influence circuit size and speed. For applications with large amounts
`of exploitable parallelism,
`the upper
`limit on this parallelism
`is often deter(cid:173)
`mined by how many operations can be performed concurrently on the FPGA
`device. Larger data elements and greater arithmetic complexity lead to larger
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2074, p. 6
`
`

`

`21.2 Application Characteristics and Performance
`
`443
`
`and fewer computational elements and less parallelism. Moreover, larger and
`more complex circuits exhibit more delay that slows clock rate and impacts
`performance. Not surprisingly, representing data with the fewest possible bits
`and performing computation with the simplest operators generally lead to the
`highest performance. Designing high-performance applications in FPGAs almost
`always involves a precision/performance
`trade-off.
`
`Pipelining
`in FPGAs. Because FPGA
`Pipelining is essential to achieving high performance
`performance is limited primarily by interconnect delay, pipelining (inserting reg(cid:173)
`isters on long circuit pathways) is an essential way to improve clock rate (and
`therefore throughput) at the cost of latency. In addition, pipelining allows com(cid:173)
`putational operations to be overlapped in time and leads to more parallelism in
`the implementation. Generally speaking, because pipelining is used extensively
`throughout FPGA-based designs, applications must be able to tolerate some
`latency (via pipelining) to be suitable candidates for FPGA implementation.
`
`Simple control requirements
`if all operations can be statically sched(cid:173)
`FPGAs achieve the highest performance
`uled as much as possible (this is true of many technologies). Put simply, it takes
`time to make decisions and decision-making circuitry is often on the critical
`path for many algorithms. Replacing runtime decision circuitry with static con(cid:173)
`trol eliminates circuitry and speeds up execution. It makes it much easier to
`construct circuit pipelines that are heavily utilized with few or no pipeline bub(cid:173)
`bles. In addition, statically scheduled controllers require less circuitry, making
`room for more datapath operators, for example. In general, datasets with few
`or no dependencies often have simple control requirements.
`
`21.2.2 VO and Performance
`As mentioned previously, FPGA clock rates are at least one order of magnitude
`slower than those of CPUs. Thus, significant parallelism (either data parallelism
`or pipelining) is required for an FPGA to be an attractive alternative to a CPU.
`However, 1/0 performance
`is just as important: Data must be transmitted at
`rates that can keep all of the parallel hardware busy.
`Algorithms can be loosely grouped into two categories: 1/0 bound and com(cid:173)
`pute bound [17, 18]. At the simplest level, if the number of 1/0 operations
`is
`equal to or greater than the number of calculations
`in the computation,
`the
`computation
`is said to be 1/0 bound. To increase its performance
`requires an
`increase in memory bandwidth-doing more computation
`in parallel will have
`no effect. Conversely, if the number of computations
`is greater than the number
`of 1/0 operations, computational parallelism may provide a speedup.
`A simple example of this, provided by Kung [18], is matrix-matrix multi(cid:173)
`plication. The total number of I/Os in the computation,
`for n-by-n matrices,
`is 3n2-each matrix must be read and the product written back. The total
`number of computations
`to be done, however, is n 3• Thus, this computation
`is
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2074, p. 7
`
`

`

`444
`
`Chapter 21 • Implementing Applications with FPGAs
`
`compute bound. In contrast, matrix-matrix addition requires 3n2 I/Os and 3n2
`calculations and is thus I/O bound. Another way to see this is to note that each
`is used n
`source element read from memory in a matrix-matrix multiplication
`times and each result is produced using n multiply-accumulate
`operations. In
`matrix-matrix addition, each element fetched from memory is used only once
`and each result is produced from only a single addition.
`Carefully coordinating data transfer, I/O movement, and computation order is
`crucial to achieving enough parallelism to provide effective speedup. The entire
`field of systolic array design is based on the concepts of (1) arranging
`the I/O
`and computation
`in a compute-bound application so that each data element
`fetched from memory is reused multiple times, and (2) keeping many processing
`elements busy operating in parallel on that data.
`FPGAs offer a wide variety of memory elements that can be used to coor(cid:173)
`dinate I/O and computation: flip-flops to provide single-bit storage (10,000s of
`bits); LUT-based RAM to provide many small blocks of randomly distributed
`memory (100,000s of bits); and larger RAM or ROM memories (1,000,000s of
`bits). Some vendors' FPGAs contain multiple sizes of random access memories,
`and these memories are often easily configured into special-purpose structures
`such as dynamic-length shift registers, content-addressable memories (CAMs),
`and so forth. In addition to these types of on-chip memory, most FPGA plat(cid:173)
`forms provide off-chip memory as well.
`Increasing the I/O bandwidth to memory is usually critical in harnessing the
`parallelism inherent in a computation. That is, after some point, further multi(cid:173)
`plying the number of processing elements (PEs) in a design (to increase paral(cid:173)
`lelism) usually requires a corresponding
`increase in I/O. This additional I/O can
`often be provided by the many on-chip memories in a typical modem FPGA. The
`work of Graham and Nelson [8] describes a series of early experiments to map
`time-delay SONAR beam forming to an FPGA platform where memory band(cid:173)
`width was the limiting factor in design speedup. While the data to be processed
`were an infinite stream of large data blocks, many of the other data structures
`in the computation were not large (e.g., coefficients, delay values). In this com(cid:173)
`it was not the total amount of memory that limited the speedup but
`putation,
`rather the number of memory ports available. Thus, the use of multiple small
`memories in parallel were able to provide the needed bandwidth.
`The availability of many small memories in today's FPGAs further supports
`the idea of trading off computation for table lookup. Conventional FPGA fabrics
`are based on a foundation of 4-input LUTs; in addition, larger on-chip memories
`can be used to support larger lookup structures. Because the memories already
`exist on chip, unlike in ASIC technology, using them adds no additional cost to
`the system. A common approach in FPGA-based design, therefore, is to evaluate
`which parts of the system's computations might lend themselves to table lookup
`and use the available RAM blocks for these lookups.
`is largely deter(cid:173)
`In summary, the performance of FPGA-based applications
`mined by how much exploitable parallelism
`is available, and by the ability of
`the system to provide data to keep the parallel hardware operational.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2074, p. 8
`
`

`

`21.3 General Implementation Strategies for FPGA-based Systems
`
`445
`
`21.3 GENERAL IMPLEMENTATION STRATEGIES
`
`FOR FPGA-BASED SYSTEMS
`such as microprocessors
`technologies
`In contrast with other programmable
`or DSPs, FPGAs provide an extremely rich and complex set of implementa(cid:173)
`tion alternatives. Designers have complete control over arithmetic schemes and
`number representation
`and can, for example, trade precision for performance.
`In addition, reprogrammable, SRAM-based FPGAs can be configured any num(cid:173)
`ber of times to provide additional implementation
`flexibility for further tailoring
`the implementation
`to lower cost and make better use of the device.
`There are two general configuration
`strategies
`for FPGAs: configure-once,
`where the application consists of a single configuration
`that is downloaded for
`the duration of the application's operation, and runtime reconfiguration
`(RTR),
`where the application consists of multiple configurations
`that are "swapped" in
`and out as the application operates [14].
`
`21.3.1 Configure-once
`is the simplest and most common way to
`Configure-once
`(during operation)
`implement applications with reconfigurable
`logic. The distinctive
`feature of
`configure-once applications
`is that they consist of a single system-wide config(cid:173)
`the FPGAs comprising
`uration. Prior to operation,
`the reconfigurable resource
`are loaded with their respective configurations. Once operation commences, they
`remain in this configuration until the application completes. This approach
`is
`very similar to using an ASIC for application acceleration. From the application
`point of view, it matters little whether the hardware used to accelerate the appli(cid:173)
`cation is an FPGA or a custom ASIC because it remains constant throughout
`its
`operation.
`The configure-once approach can also be applied to reconfigurable applica(cid:173)
`tions to achieve significant acceleration. There are classes of applications,
`for
`example, where the input data varies but remains constant for hours, days, or
`longer. In some cases, data-specific optimizations can be applied to the applica(cid:173)
`tion circuitry and lead to dramatic speedup. Of course, when the data changes,
`the circuit-specific optimizations need to be reapplied and the bitstream regen(cid:173)
`(1) the FPGA and
`erated. Applications of this sort consist of two elements:
`system hardware, and (2) an application-specific compiler that regenerates
`the
`bitstream whenever
`the application-specific data changes. This approach has
`been used, for example, to accelerate SNORT, a popular packet filter used to
`improve network security [13]. SNORT data consists of regular expressions that
`detect malicious packets by their content. It is relatively static, and new regular
`expressions are occasionally added as new attacks are detected. The application(cid:173)
`these regular expressions into FPGA hardware
`specific compiler translates
`that
`matches packets many times faster than software SNORT. When new regular
`expressions are added to the SNORT database, the compiler is rerun and a new
`configuration
`is created and downloaded
`to the FPGA.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2074, p. 9
`
`

`

`446
`
`Chapter 21 ■ Implementing Applications with FPGAs
`
`21.3.2 Runtime Reconfiguration
`Whereas configure-once applications statically allocate logic for the duration of
`an application, RTR applications use a dynamic allocation
`scheme
`that
`re-allocates hardware at runtime. Each application consists of multiple con(cid:173)
`figurations per FPGA, with each one implementing some fraction of it. Whereas
`a configure-once application configures the FPGA once before execution, an RTR
`application typically reconfigures it many times during the normal operation.
`that can be used to implement RTR appli(cid:173)
`There are two basic approaches
`cations: global and local (sometimes referred to as partial configuration
`in the
`literature). Both techniques use multiple configurations for a single application,
`and both reconfigure the FPGA during application execution. The principal dif(cid:173)
`ference between the two is the way the dynamic hardware is allocated.
`
`Global RTR
`Global RTR allocates all (FPGA) hardware resources in each configuration step.
`More specifically, global RTR applications are divided into distinct temporal
`phases, with each phase implemented as a single system-wide configuration that
`occupies all system FPGA resources. At runtime, the application steps through
`each phase by loading all of the system FPGAs with the appropriate configura(cid:173)
`tion data associated with a given phase.
`
`LocalRTR
`than does
`to reconfiguration
`Local RTR takes an even more flexible approach
`global RTR. As the name implies, these applications locally (or selectively) recon(cid:173)
`figure subsets of the logic as they execute. Local RTR applications may configure
`any percentage of the reconfigurable resources at any time, individual FPGAs
`may be configured, or even single FPGA devices may themselves be partially
`reconfigured on demand. This flexibility allows hardware resources to be tai(cid:173)
`lored to the runtime profile of the application with finer granularity
`than that
`possible with global RTR. Whereas global RTR approaches implement the execu(cid:173)
`local RTR
`tion process by loading relatively large, global application partitions,
`applications need load only the necessary functionality at each point in time.
`This can reduce the amount of time spent downloading configurations and can
`lead to a more efficient runtime hardware allocation.
`The organization of local RTR applications
`is based more on a functional
`division of labor than the phased partitioning used by global RTR applications.
`Typically, local RTR applications are implemented by functionally partitioning
`an application
`into a set of fine-grained operations. These operations need not
`be temporally exclusive-many
`of them may be active at one time. This is in
`to global RTR, where only one configuration (per FPGA) may
`direct contrast
`be active at any given time. Still, with local RTR it is important
`to organize
`the operations such that idle circuitry is eliminated or greatly reduced. Each
`operation is implemented as a distinct circuit module, and these circuit modules
`are then downloaded
`to the FPGAs as necessary during operation. Note that,
`unlike global RTR, several of these operations may be loaded simultaneously,
`and each may consume any portion of the system FPGA resources.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2074, p. 10
`
`

`

`21.3 General Implementation Strategies for FPGA-based Systems
`
`447
`
`RTR applications
`Runtime Reconfigured Artificial Neural Network (RRANN) is an early example
`of a global RTR application [7]. RRANN divided the back-propagation algorithm
`(used to train neural networks) into three temporally exclusive configurations
`that were loaded into the FPGA in rapid succession during operation. It demon(cid:173)
`strated a 500 percent increase in density by eliminating idle circuitry in individ(cid:173)
`ual algorithm phases.
`RRANN was followed up with RRANN-2 [9], an application using local RTR.
`Like RRANN, the algorithm was still divided into three distinct phases. However,
`unlike the earlier version, the phases were carefully designed so that they shared
`common circuitry, which was placed and routed into identical physical locations
`for each phase. Initially, only the first configuration was loaded; thereafter, the
`common circuitry remained resident and only circuit differences were loaded
`during operation. This reduced configuration overhead by 25 percent over the
`global RTR approach.
`The Dynamic Instruction Set Computer (DISC) [29] used local RTR to create
`a sequential control processor with a very small fixed core that remained resi(cid:173)
`dent at all times. This resident core was augmented by circuit modules that were
`dynamically loaded as required by the application. DISC was used to implement
`an image-processing application that consisted of various filtering operations. At
`runtime, the circuit modules were loaded as necessary. Although the application
`used all of the filtering circuit modules, it did not require all of them to be loaded
`simultaneously. Thus, DISC loaded circuit modules on demand as required. Only
`a few active circuit modules were ever resident at any time, allowing the appli(cid:173)
`cation to fit in a much smaller device than possible with global RTR.
`
`21.3.3 Summary of Implementation Issues
`
`Of the two general implementation
`techniques, configure-once is the simplest
`and is best supported by commercially available tool flows. This is not surpris(cid:173)
`ing, as all FPGA CAD tools are derivations of conventional ASIC CAD flows.
`While the two RTR implementation approaches (local and global) can provide
`significant performance and capacity advantages, they are much more challeng(cid:173)
`ing to employ, primarily because of a lack of specific tool support.
`The designer's primary task when implementing global RTR applications
`is
`to temporally divide the application
`into roughly equal-size partitions
`to effi(cid:173)
`ciently use reconfigurable resources. This is largely a manual process-although
`the academic community has produced some partitioning
`tools, no commercial
`offerings are currently available. The main disadvantage of global RTR is the
`need for equal-size partitions. If it is not possible to evenly partition the appli(cid:173)
`cation, inefficient use of FPGA resources will result.
`The main advantage of local RTR over global RTR is that it uses fine-grained
`that may make more efficient use of FPGA resources.
`functional operators
`This is important
`for applications
`that are not easily divided into equal-size
`temporally exclusive circuit partitions. However, partitioning a local RTR design
`may require an inordinate amount of designer effort. For example, unlike global
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2074, p. 11
`
`

`

`448
`
`Chapter 21 • Implementing Applications with FPGAs
`
`RTR, where circuit interfaces typically remain fixed between configurations,
`local RTR allows these interfaces to change with each configuration. When
`circuit configurations become small enough for multiple configurations to fit
`into a single device, the designer needs to ensure that .all configurations will
`interface correctly one with another. Moreover, the designer may have to ensure
`not only structural compliance but physical compliance as well. That is, when
`the designer creates circuit configurations that do not occupy an entire FPGA,
`he or she will have to ensure that the physical footprint of each is compatible
`with that of others that may be loaded concurrently.
`
`21.4
`
`IMPLEMENTING ARITHMETIC IN FPGAs
`
`to
`Almost since their invention, FPGAs have employed dedicated circuitry
`· accelerate arithmetic computation. In earlier devices, dedicated circuitry sped
`up the propagation of carry signals for ripple-carry, full-adder blocks. Later
`devices added dedicated multipliers, DSP function blocks, and more complex
`fixed-function circuitry. The presence of such dedicated circuitry can dramati(cid:173)
`cally improve arithmetic performance, but also restricts designers to a very small
`subset of choices when implementing arithmetic.
`Well-known approaches such as carry-look-ahead, carry-save, signed-digi~,
`and so on, generally do not apply to FPGAs. Though these techniques are com(cid:173)
`monly used to create very high-performance arithmetic blocks in custom !Cs,
`they are not competitive when applied to FPGAs simply because they cannot
`access the faster, dedicated circuitry and must be constructed using slower,
`general-purpose user logic. Instead, FPGA designers accelerate arithmetic
`in
`one of two ways with FPGAs: (1) using dedicated blocks if they fit the needs of
`the application, and (2) avoiding the computation entirely, if possible. Design(cid:173)
`ers apply the second option by, for example, replacing full-blown floating-point
`computation with simpler, though not equivalent, fixed-point, or block fl.oating(cid:173)
`point, computations. In some cases, they can eliminate multiplication entirely
`with constant propagation. Of course, the feasibility of replacing slower, com(cid:173)
`plex functions with simpler, faster ones is application dependent.
`
`21.4.1 Fixed-point Number Representation and Arithmetic
`A fixed-point number representation
`is simply an integer representation with
`an implied binary point, usually in 2's complement format to enable the rep(cid:173)
`resentation of both positive and negative values. A common way of describing
`the structure of a fixed-point number is to use a tuple: n, m, where n is the
`num

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket