`
`John Villasenor, Brian Schoncr, Kang-Ngee Chia, Charles Zapata,
`Hea Joung Kim, Chris Jones, Shane Lansing. and Bill Mangione-Smith
`Electrical Engineering Department
`'
`University of California, Los Angeles
`Los Angeles, CA 90095-1594
`
`FPGAs can be used to build systems for automatic target
`recognition (ATR) that achieve an order of magnitude
`increase in performance over systems built using general
`purpose processors. This improvement is possible because
`the bit~ievei operations that comprise much of the ATR
`computational burden map extremely efiicientiy into
`FPGAs, and because the specificity of ATR target tern»
`plates can be leveraged via fast
`reconfiguration. We
`describe here algorithms, design tools, and implementa-
`tion strategies that are being used in a configurable com-
`puting system for ATR.
`
`1. Introduction
`
`The ability to rapidly modify the gate level logic of
`an FPGA during execution opens a number of new com-
`puting possibilities that have only recently begun to be
`explored. Configurable computing machines that exploit
`this ability will involve architectures that differ in impor~
`taut ways from those used in current machines, and will
`support a wide range of new and powerful capabilities. At
`the simplest level. a single FPGA can implement an arbi-
`trary number of designs in rapid succession, and can there-
`fore deliver the functionality of a device many times its
`size. More sophisticated implementations in which the
`configuration control receives input from the results of
`previous computations or frOm the external operating envi-
`ronment can also be envisioned. Finally, rapid reconfigura-
`tion makes feasible the implementation of dedicated logic
`circuits to support large numbers of highly specific compu-
`tational tasks that are wholly unsuited to ASIC implemen-
`tation.
`
`Configurable computing requires 1) commercially
`available FPGAs with Sufficiently fast configuration times,
`2) design tools that understand and take advantage of hard
`ware dynamisms, and 3) boards and other higher level
`interface and support hardware that will make fast-recon—
`figurable FPC‘rAs viable in real systems. Although there is
`
`not yet a base of commercial FPGAs with submillisecond
`reconfiguration times to satisfy the first of these require-
`ments.
`there has been prototype development
`in both
`industry and academia to explore the hardware issues of
`fast reconfiguration. It is our belief that this is an area
`which the FPGA vendors are well prepared to address. and
`that fast reconfiguration will receive increased commercial
`attention as the payoffs on the application side become
`clear. By contrast, design tools and systems to support use
`of dynamic computing devices have been in a state of rela-
`tive immaturity. We describe here results from an ongoing
`project to build a configurable computing system for auto-
`matic target recognition (ATR). Focusing on this applica-
`tion has furnished quantitative results on design time and
`challenges, configuration overhead, and computational
`efficiency of configurable computing systems. More gener-
`ally, it has led to an understanding of the requirements and
`hurdles involved in extending configurable computing to
`more general applications.
`
`The rest of this paper is organized as follows: The
`remainder of the introductiou includes a description of
`related work in configurable computing, and an overview
`of the automatic target recognition (ATR) problem that
`serves as the application focus for the new results pre~
`sented here. Section 2 introduces the basic mapping of
`ATR target templates into FPGA adder trees that forms the
`core of the processing, and discusses some of the trade-offs
`in using rapid reconfiguration to support ATR. Section 3
`discusses design and partitioning issues to support rapid
`implementation of a large set of target templates. Section 4
`presents a more detailed analysis of design trade-offs and
`considers some the issues of U0 and board design for a
`FPGA—based ATR system. Conclusions and a brief
`description of ongoing and future work are contained in
`Section 5.
`
`1.1 Overview of related work
`
`Configurable computing has been explored in both
`
`0-8186-7548-986 $05.00 © 1996 IEEE
`
`'70
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 70
`Petitioner Microsoft Corporation - EX. 1051, p. 70
`
`
`
`academia and industry for several years. On the hardware
`side, several fast-reconfiguring FPGAs have been devel-
`oped. One of these is the Configurable Logic Array
`(CLAy) from National Semiconductor, which is a fine-
`grained device consisting of an array of 56 by 56 cells,
`each of which is roughly equivalent to a half adder and D-
`flip flop. Configuration bitstreams can be loaded into the
`CLAy using 8 pins, allowing a complete reconfiguration
`of the device in approximately 750 microseconds. The
`CLAy also supports partial reconfiguration, which reduces
`the reconfiguration time linearly in accordance with the
`fraction of the gate array concerned.
`
`An alternative to external loading of bitstrearns used
`in the CLAy and in most other FPGAs is the context-
`switched approach advocated by a group at MIT [1,2]. In
`this device, referred to as a Dynamically Programmable
`Gate Array (DPGA), multiple configurations reside simul-
`taneously on chip. One of these configurations occupies
`the active layer, and is the one which is actually executing.
`Any of the others can be switched to the active layer in
`one clock cycle. The increase in chip area to support three
`extra contexts is approximately 20%.
`
`Utilization of FPGAs as dynamic computing devices
`has been explored by several groups including a team led
`by Hutchings at Brigham Young University. Hutchings
`has performed a series of thorough studies in which the
`benefits of partial reconfiguration was explored using the
`application example of neural nets [3]. The Brigham
`Young group has also investigated the use of partial recon-
`figuration as a means to construct a computer with a
`dynamic instruction set [4]. This idea, which has also been
`discussed by Athanas and Silverman in [5], achieves
`
`increased efliciency by using FPGA resources to hold the
`instructions that are needed on an application-specific
`basis. These experiences have led to the formulation of
`design methodologies for partially reconfiguring systems
`[6], and to important quantitative results on the benefits of
`partial reconfiguration. For example, for the neural net
`application partial reconfiguration enabled a 25% reduc—
`tion in configuration time and a 50% increase in functional
`density compared with a system based on complete recon-
`figuration.
`
`In a previous publication [7] we described the imple
`mentation of a video communications system imple-
`mented using configurable computing techniques. This
`system delivers real time video at a rate of 8 frameslsec—
`end, and includes the steps of image transformation, quan-
`tization and run-length coding. and BPSK modulation!
`demodulation. These functions are implemented using a
`single 5000 gate CLAy, with rapid swapping of designs
`used to time share the gate array hardware. The rapid
`swapping of designs used in the video system has some
`commonalities, as well as some significant differences
`with the approach for ATR that we describe in the present
`paper.
`
`1.2 Application Description :: ATR
`
`Automatic target recognition is among the most
`demanding real time computational problems in existence.
`The challenge addressed by an ATR system is conceptu-
`ally simple -- to analyze a digitally represented input
`image or video sequence in order to automatically locate
`and identify all objects within the scene of interest to the
`observer. Since there are many types of imaging devices
`and many algorithmic choices available to a designer,
`
`Input SAR image
`several K
`‘—.-
`
`radar absorption). Templates with highest correlation are selected in the peak detection step.
`
`Bright template
`correlation result
`
`a-Peak Detect
`
`—b
`Surround temp ate
`correlation result
`
`Tom - late 1
`
`H Computation
`1301119an
`
`Figure 1 High level block diagram for ATR processing. The focus of attention algorithm identifies regions of interest,
`referred to as “chips”, in SAR images. Chips are correlated against a series of binary target template pairs, with each
`pair containing a bright template (identifying pixels of strong expected radar return) and a sun'ound template (strong
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 71
`Petitioner Microsoft Corporation - EX. 1051, p. 71
`
`
`
`
`
`there are clearly a large number of possible ways to imple-
`ment an ATR system. In this paper we focus on a particular
`approach which is currently being applied in the US.
`Department of Defense Joint STARS airborne radar imag-
`ing platform, and which therefore has high current rele-
`vance and interest.
`
`The processing used in ATR is illustrated in simpli-
`fied format in Figure 1. Synthetic aperture radar (SAR)
`images consisting of 8-bit pixels and measuring several
`thousand pixels on a side and are generated in real time by
`the radar imager. Images are input to a focus—of—attention
`processor which identifies a set of regions of interest, each
`of which contains a potential target. These regions of inter-
`est, known by the potentially confusing term “chip“, must
`then be correlated with a very large number of target tern-
`plates. Target templates are binary; e. g. each pixel is repre-
`sented using one bit. The correlation results are output to a
`peak detector which identifies the template and relative
`offset at which the peak correlation value occurs. The cor-
`relation of chips with templates is the computational bot-
`tleneck in
`the
`system,
`involving
`data
`rates
`and
`computational requirements that exceed by several orders
`of magnitude the processing load in any other steps in the
`algorithm. While the precise system parameters vary with
`implementation, in the work described here we use chip
`sizes of 128 by 128 and template sizes of 16 by 16. A cor-
`relation of a single chip with a single template in this case
`involves consideration of approximately 1282 relative off-
`sets, corresponding to 105 bits of output data if the correla-
`
`tion outputs are represented using 6 bits. If there are 103
`templates to be evaluated per chip, the magnitude of the
`processing task becomes readily apparent when one con-
`siders that the imaging system produces many frames per
`second, each of which contains many chips.
`
`Figure 2 illustrates the correlation operation targeted
`for FPGA implementation in more detail. Target templates
`occur in pairs, one member of which is called the bright
`template and contains pixels from which a strong radar
`return is expected, and the other member of which is the
`surround template and identifies pixels where strong radar
`absorption is expected. In both cases the template is of size
`16 by 16, with pixels represented using only one bit. The
`templates tend to be sparsely populated, with only a rela—
`tively small percentage of the pixels set to 1. As will be
`discussed later, this preperty is important in obtaining high
`performance in F'PGA implementations. The first step of
`the correlation is known as a shapesum calculation,
`in
`which the 8—bit SAR chip is correlated with the bright tem»
`plate, providing for every pixel in the chip a number which
`is used for local gain control. The second step is the actual
`correlation, which is performed in parallel on eight differ-
`ent binary images. each of which is created by applying a
`different threshold to the chip. The binary images are cor-
`related with both the bright template and the surround tem-
`plate. producing eight pairs of correlation outputs. The
`shapesum value is used to select which output pair will be
`processed in the peak detection step.
`
`BitSlice
`
`Size 16x16
`540% populated
`
`Shapesum
`
`Size 1 6x16
`10-50% populated
`
`Threshold
`. select
`
`Chlp (128 x128 x 8 bits)
`
`= threshold (fixed
`for all templates
`
`= 1 bit correlate:
`
`
`
`Figure 2 Correlation operations showing shapesum calculation (top) and parallel correlations (bottom)
`
`72
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 72
`Petitioner Microsoft Corporation - EX. 1051, p. 72
`
`
`
`2. Mapping and dynamic reconfiguration of
`target templates
`
`FPGAs offer an extremely attractive solution to the
`correlation problem. First of all, the operations being per-
`formed occur directly at the bit level and are dominated by
`shifts and adds, making them easy to map into the hard-
`ware provided by the FPGA. This contrasts. for example,
`with multiply-intensive algorithms which would make rel-
`atively poor utilization of FPGA resources. More impor-
`tantly, the sparse nature of the templates can be utilized to
`achieve a far more efficient implementation in the FPGA
`than could be realized in a general purpose correlator. This
`can be illustrated using the example of the simple template
`shown in Figure 3.
`
`Template 1
`
`
`
`Figure 3 Example binary template with five “on"
`pixels (top) and corresponding adder tree (bottom).
`
`
`
`
`
`In this example template, only 5 of the 24 pixels are
`“on". At any given relative offset between the template
`and chip, the correlation output is the sum of the five
`binary pixels in the chip that lie immediately above the
`“on" pixels in the template. The template can therefore be
`implemented in the FPGA as a simple adder tree as shown
`in Figure 3. The chip pixel values can be stored in flip-
`flops. and are shifted to the right by one flip flop with each
`clock cycle. Though correlation of a large image with a
`small mask is often understood conceptually in terms of
`the mask being scanned across the image, in this case the
`opposite is occurring - the template is hard-wired into the
`
`FPGA while the image pixels are clocked past it.
`
`Another important opportunity for increased effi—
`ciency lies in the potential to combine multiple templates
`on a single FPGA. The simplest way to do this is to spa-
`tially partition the FPGA into several smaller blocks. each
`of which handles the logic for a single template. Alterna-
`tively. one can seek to identify templates having some
`topological commonality, and which can therefore share
`parts of adder trees. This is illustrated in Figure 4, which
`
`Template 1
`
`Template 2
`
`tiple correlations.
`
`Output 1
`
`Output 2
`
`Figure 4 Template commonalities are exploited to
`reduce hardware requirements for computing mul-
`
`shows two templates which share several pixels in com-
`mon. and which can be mapped using a set of adder trees
`which leverage this overlap.
`
`The advantage of using FPGAs over ASICs is that
`FPGAS can be dynamically optimized at the gate level to
`exploit template characteristics. An ASIC would have to
`provide large general purpose adder trees to handle the
`worst case condition of summing all possible template
`bits. The FPGA, however, exploits the sparse nature of
`the templates, and only constructs the small adder trees
`required. We have also shown that FPGAs can exploit
`other factors such as collapsing adder trees with conunon
`elements, and packing unused data points into space-sav-
`ing RAM—based shift registers. The end result is that a sin-
`gle FPGA can efficiently compute several templates in
`parallel more efficiently than several general purpose cor-
`relating ASICs.
`
`'73
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 73
`Petitioner Microsoft Corporation - EX. 1051, p. 73
`
`
`
`There are many factors that determine the perfor-
`mance gain that results from using an FPGA. One of the
`most important is FPGA reconfiguration time. Assuming
`that performing the correlation of a single chip requires
`approximately 1282 = 16K clock cycles, the reconfigura-
`tion must be performed in ~IO3 or fewer clock cycles to
`avoid prohibitive overhead.
`In this respect a context-
`switched FPGA with two contexts would be extremely
`useful. If the idle context could be loaded while the active
`
`then reconfiguration overhead
`context was processing,
`would vanish. The achievable parallelism is also a critical
`parameter. Based on the work to date, we estimate that we
`can map an average of 20 bright templates or 5 surround
`templates on a single 13000-gate FPGA.
`
`2.1 Experimental results for FPGA resource utilization
`
`The approach of using a template-specific adder tree
`achieves significant reduction in routing complexity over a
`general correlator which must include logic to support
`arbitrary templates. To a first approximation, the extent of
`this reduction is inversely propertional to the fraction of
`“on” pixels in the template. While this complexity reduc-
`tion is important, it alone is not sufficient to lead to effi-
`cient implementations on FPGAs. This is due primarily to
`the limited number of flip flops available on commercial
`FPGAs (for example, the Xilinx XC4010 and ATT ORCA
`2C10 contain 800 and 1024 flip flops respectively). This
`would not generally be sufficient to support buffering the
`112 pixels per chip row that are not actually under the tem-
`plate, but need to be wrapped around to the next row of the
`template. The total number of l-bit storage elements
`needed to hold buffered pixel values for all 16 rows is 16 *
`112 = 1792. Implementing these on the FPGA using the
`usual flip-flops based shift registers is inefficient. and for
`many FPGAs impossible.
`
`This problem can be resolved by collapsing the long
`Strings of image pixels that are not being actively corre-
`lated against a template into shift registers, which can be
`implemented very efficiently on some look-up—table based
`FPGAs. For example, RAMs in the Xilinx XC4000 library
`can be used as shift registers which delay data by some
`predetermined number of clock cycles. Each 16x1 bit
`RAM primitive uses up a function generator on the FPGA,
`and can implement an element which is effectively a lt'r
`bit shift register in which the internal bits cannot be
`accessed. A flip-flop is also needed at the output of each
`RAM to act as a buffer and synchronizer. A single control
`circuit is used to control the stepping of the address lines
`and the timely assertion of the write—enable and output-
`enable signals for all
`the RAM-based shift register ele-
`ments. This is a small over head price to pay for the saw
`ings in CLB usage relative to a brute force implementation
`using flip flops.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`most templates.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Figure 5 Four templates that were mapped onto the
`Xilinx 4010 to explore resource utilization. These
`examples were generated by 4 rotations of a syn-
`thetic template. The number of “on" pixels is 91,
`which is higher than what would be expected for
`
`By contrast. the 256 image pixels that lie within the
`16 by 16 template boundary at any given time can be
`stored easily using flip-flop based registers, since there are
`sufficient flip-flops available to do this. and the adder tree
`structures do not consume flip-flops. Also, using standard
`flip-flop based shift registers for image pixels within the
`template simplifies the mapping process by allowing
`access to every pixel in the template. New templates can
`be implemented by simply connecting the template pixels
`of concern to the inputs of the adder tree structures. This
`leads to significant simplification of automated template
`mapping tools.
`
`To gain a fuller understanding of the FPGA resource
`trade-offs involved in template mapping, we implemented
`in parallel the four templates shown in Figure 5 onto the
`Xilinx 4010 using the techniques described above. Each
`template had 91 “on” pixels, few of which are shared with
`other templates. Since parallelism was low, this exercise
`gaves a worst—case estimate for the capacity of the 4010.
`The resulting FPGA resource utilization. as summarized in
`Table 1, shows that 313 flip—flops and 756 function genera-
`tors were used. The resources used by the two components
`of target correlation, namely storage of active pixels on the
`FPGA (lst row of table) and implementation of the adder
`tree corresponding to the templates (2nd row) are indepen-
`dent of each other. The resources used by the pixel storage
`are determined by the template Size, and are independent
`of the number of templates being implemented. Adding
`templates involves adding new adder tree structures, and
`will hence increase the number of function generators
`being used. The total number of templates implementable
`
`74
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 74
`Petitioner Microsoft Corporation - EX. 1051, p. 74
`
`
`
`
`
`Flip
`Flops
`
`Fen.
`Gen.
`
`U0
`pins
`
`
`
`
`
`“um
`
`
`
`
`
`
`
`
`
`
`
`
`Table 1 Xilinx 4010 resource utilization for the four
`
`
`
`sample templates shown in Figure 5. Resources for
`the two basic functions implemented (pixel storage
`and adder trees) are shown separately. The adder
`trees account for the majority of the function gener—
`ator utilzation, and are a key factor in limiting the
`number of templates that can be implemented simul-
`taneeously in a single configuration.
`
`
`
`on an FPGA will be bounded by the number of usable
`function generators.
`
`The four templates in the example above exhibit a
`very low number of common pixels and would probably
`not be grouped together by the partitioning algorithm.
`These results suggest that in practice, we can expect to fit
`6-10 surround templates having a higher number of over-
`lapping pixels onto a 13000 gate FPGA. Since the bright
`templates are less populated than the surround templates.
`we estimate that 15-20 bright templates can be mapped
`onto a 13000 gate FPGA.
`
`3. ATR template partitioning issues
`
`To minimize the number of FPGA reconfigurations
`necessary to correlate a given target image against the
`entire set of templates,
`it is necessary to maximize the
`number of templates placed in every configuration of the
`FPGA. To accomplish this optimization goal. we want to
`partition the set of templates into groups that can share
`computations (i.e.,
`share adder
`trees) so that
`fewer
`resources are used per template. Since the set of templates
`may number in the thousands, and the goal may be to
`place ten to twenty templates per configuration, exhaustive
`enumeration of all of the possible groupings is not an
`option. Instead, it is best to seek a heuristic method which
`will furnish a good, although perhaps suboptimal, solu-
`tion.
`
`Correlation between two templates can establish the
`number of pixels in common, and is a good starting point
`for comparing and selecting templates. However, some
`extra analysis beyond performing iterative correlations on
`
`the template set is necessary. For example, a template with
`a large surface area may correlate well with several
`smaller templates, perhaps even completely subsuming
`them, but the smaller templates may not correlate with
`each other at all, and would involve no redundant compu-
`tations. There are two possible solutions to this. The first is
`to ensure that any template added to an existing group is
`approximately the same size as templates in the group.
`The second option is to compute the number of additions
`required each time a new template is brought in . effec-
`tively recomputing the adder IIee every time.
`
`Recomputing the entire adder tree is computation-
`ally expensive, and is not a good method of partitioning a
`set of templates into subsets. However. one of the heurst-
`ics used in deciding whether or not to include a template
`into a newly formed partition. is to determine the number
`of new terms that adding the template would create in the
`partition‘s adder tree. The assumption is that more terms
`would result in a significant. number of new additions,
`resulting in a wider and deeper adder tree. By keeping the
`number of new terms created to a minimum, newly added
`templates do not increase the number of additions by a sig-
`inificant amount. Using C-I—t, we have created a design
`tool to implement the partitioning process. C-H- allows us
`to abstract templates and sets of templates into easily
`manipulated objects. New methods of comparison can be
`implemented by adding new methods to the objects, mak-
`ing this system adaptable to different heuristics. Further-
`more,
`the system is not
`tied down to any particular
`platform, as C++ compilers are available for many differw
`ent platforms.
`
`The tool uses an iterative approach to partitioning
`templates. Templates which compare well
`to a chosen
`“base" template (usually selected by largest area) are
`removed from the main template set and placed in a sepa-
`rate partition. This process is repeated until all templates
`are partitioned. After the partitions have been selected, the
`design tool computes the adder tree for each partition. Fig—
`ures 6a-6c show the process of creating an adder tree from
`the templates in a partition. Within each partition. the tem-
`plates (shown in Figure 63) are searched for shared subsets
`of pixels (Figure 6b). These subsets, called terms, can be
`automatically added together.
`leading to a template
`description using terms instead of pixels (Figure 6c). The
`most commonly occurring addition of two terms is chosen
`to be added together, forming a new term which can be
`used by the templates. In this way. each template is rebuilt
`by combining terms in such a way that the most redundant
`additions are shared between templates, and the final result
`of the process are terms which compute entire templates.
`For the sample templates shown here, 54 additions would
`be required to compute the correlations for all five tem-
`plates in a naive approach. However, after combining the
`
`75
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 75
`Petitioner Microsoft Corporation - EX. 1051, p. 75
`
`
`
`Figure 6: Example of template grouping to leverage
`pixels common to multiple templates. In this exam-
`ple,
`the grouping reduces the number of adders
`needed from 54 to 29.
`
`Figure 63: Five sample templates to be allocated on
`the same FPGA configuration.
`
`hardware required for efficient implementation and debug-
`ging. Minimizing
`these
`costs
`requires
`innovative
`approaches to system design.
`
`Figure 7 compares the configurable computing archi-
`tecture discussed here (to) with a traditional processor
`(7a) and a customizable computing architecture (7b). The
`traditional processor receives simple operands from a data
`memory. performs a simple operation in the program. and
`returns the result
`to data memory. Custom computers
`attempt to gain performance advantages over traditional
`processors by implementing common operations in hard-
`ware. Some fine-grain systems have been proposed [4] but
`the majority of successful custom computers implement
`complex operations in a large array of FPGAs [8,9] as
`depicted in Figure 'i'b. Large custom computers deliver tre-
`mendous performance for some applications. but have def-
`inite shortcomings. The number of FPGAs is typically
`fixed, or only slightly variable through the use of parallel
`cards. Large applications may not fit. and small applica-
`tions may inefficiently use available resources. Also, large
`
`
`
`Figure 6b. The terms that result from overlapping
`templates. Each term corresponds to a group of pix-
`els needed by a particular set of templates. These
`terms correspond to nodes on an adder tree.
`
`'Ibmplate A
`
`®+©+®+@+®+®
`
`Template B
`
`Template C
`
`Template D
`
`'Ibmplate E
`
`@+®+©+®+®
`
`®+@+©+®+©
`
`®+©
`
`®+©+©+®+O
`
`Figure 6c. The templates rewritten as sums of terms.
`The most common additions are coalesced into new
`
`terms. Here, 5 + 6 occurs four times, and is the first
`candidate to be replaced with a new term, 11.
`
`templates through the process described here, only 29
`additions are required.
`
`4. Configurable computing system design
`
`The increased performance of configurable systems
`comes with several costs. These include the time and
`
`bandwidth required for reconfiguration. memory and U0
`required to store intermediate results, and additional
`
`76
`
`Dynamic Computer
`
`Figure 7c
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 76
`Petitioner Microsoft Corporation - EX. 1051, p. 76
`
`Figure 7a
`Traditional Processor
`
`Figure 7b
`Custom Computer
`
`media a
`Result
`Storage
`
`Memory
`
`Configuration .
`
`7 Architecture
`Figure
`comparisons. Traditional
`processors (7a) take sim-
`ple operands from mem—
`ory, perform an operation,
`and return the result
`to
`
`memory. Custom comput-
`ers (7b) gain performance
`advantages
`by
`putting
`common operations
`into
`hardware.
`In
`dynamic
`machines (7c) logic is pro-
`grammed
`by
`selecting
`configuration
`bitstrearns
`stored in nearby RAM.
`
`
`
`bit—plane 0 I-
`bit-plane l
`'C
`
`\
`“H .—p
`summation
`/'
`
`
`
`bit-plane 7
`
`:
`I-
`Figure 8a Eight FPGAs, each performing one bit
`correlations
`
`SRAM Holding "
`
`Figure 8b One FPGA, performing the correlations
`and adding the results serially
`
`custom computers can be difficult to partition and pro—
`gram.
`Our approach to configurable computing is to use a
`small number of rapidly reconfiguring FPGAs tightly cou-
`pled to an intermediate result memory and a configuration
`memory. This
`configurable
`computing
`architecture
`attempts to gain some of the advantages of a traditional
`programmable processor and some of the application_spe_
`cific performance of a custom computer. We have had suc-
`cess implementing a video compression and transmission
`system to this architecture [7]. and will now briefly
`describe some of the trade-offs involved in implementing
`an automatic target recognition (ATR) algorithm on this
`architecture.
`
`Partitioning of computations plays a major role in
`any parallel processing environment. For a coafigurable
`computing system, partitioning is particularly interesting
`because it is both a hardware and a software issue. Con-
`
`sider two methods for performing addition of shapesum
`bit~planes shown in Figure 8a. Method A is a straightfor-
`ward parallel implementation requiring several FPGAs,
`and has several drawbacks. First. the outputs from several
`FPGAs converge at the addition operation. This may cre-
`ate a severe U0 bottleneck. Second, the system is not scal-
`able. If the system requires more precision, and therefore
`more bit-planes, more FPGAs must be added to the sys—
`tern.
`
`Method B in Figure 8 illustrates our approach. Each
`bit plane is correlated individually. and then added to the
`
`previous results in the temporary storage. Method B is
`completely scalable to any image or template precision,
`
`and this simple architecture can implement all the correla-
`tion, normalization, and peak: detection routines required
`for ATR. One drawback ofmethod B is the cost and power
`required for the wide temporary result SRAM. Another
`.
`.
`.
`.
`.
`possible drawback is the extra execution time required to
`run ATR correlations in serial. The ratio of performance to
`number of FPGAs is roughly equivalent for the two meth-
`ods, and the performance gap can be closed by simply
`using more of the smaller method B boards.
`
`The approach of a reconfigurable FPGA connected to
`an intermediate memory allows us to have a fairly compli-
`cated flow of control. For example, the shapesum calcula-
`tion in ATR tends to be more difficult than the image-
`template correlation. A single FPGA might compute two
`shapcsums or four correlations. For this reason. one may
`wish to have a program that performs two shapesum oper-
`ations and forwards the results to a single correlation oper-
`ation as shown in Figure 9- LooPing and branching
`operations can also find uses for adjustable precision and
`adaptive filtering.
`
`Reconfigurations for 10k gate FPGAs are typically
`around 20113 in length. Reconfiguring every 20ms gives a
`reconfiguration bandwidth of approximately 1MB per
`FPGA per second. This reconfiguration bandwidth, cou-
`pled With the complexity 0f the flow 0f control can be han-
`dled by placement Of a
`small mierocontroller and
`configuration RAM next to every FPGA- The microcon-
`lroller permits complicated flow of canto}. and Since [he
`microcontroller addresses the configuration RAM» it frees
`UP valuable U0 0" the FPGA- The microcontroller 3150 is
`important for debugging. which is a major issue in config-
`
`Operation 2
`
`
`
`4 results
`
`Figure 9 Reconfiguration Flow Of Control Example
`
`7'?
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 77
`Petitioner Microsoft Corporation - EX. 1051, p. 77
`
`
`
`urable systems because the many different hardware con~
`figurations can make, it difficult to isolate problems.
`
`Figure 10 shows the current (as of April 1996) ver-
`sion of the evolving configurable computing system for
`performing ATR. The principal components ate labeled in
`the graph and include a “dynamic" FPGA which is recon-
`figured on the fly, and which performs most of the comput—
`ing functions. and a “static” FPGA which is configured
`only once, and which performs control and some computa-
`tional functions. The EPROM holds configuration bit-
`streams, and the SRAM holds the input image data (eg.
`the chip). Because the correlation operation involves the
`application of a