throbber
Configurable Computing Solutions for Automatic Target Recognition
`
`John Villasenor, Brian Schoncr, Kang-Ngee Chia, Charles Zapata,
`Hea Joung Kim, Chris Jones, Shane Lansing. and Bill Mangione-Smith
`Electrical Engineering Department
`'
`University of California, Los Angeles
`Los Angeles, CA 90095-1594
`
`FPGAs can be used to build systems for automatic target
`recognition (ATR) that achieve an order of magnitude
`increase in performance over systems built using general
`purpose processors. This improvement is possible because
`the bit~ievei operations that comprise much of the ATR
`computational burden map extremely efiicientiy into
`FPGAs, and because the specificity of ATR target tern»
`plates can be leveraged via fast
`reconfiguration. We
`describe here algorithms, design tools, and implementa-
`tion strategies that are being used in a configurable com-
`puting system for ATR.
`
`1. Introduction
`
`The ability to rapidly modify the gate level logic of
`an FPGA during execution opens a number of new com-
`puting possibilities that have only recently begun to be
`explored. Configurable computing machines that exploit
`this ability will involve architectures that differ in impor~
`taut ways from those used in current machines, and will
`support a wide range of new and powerful capabilities. At
`the simplest level. a single FPGA can implement an arbi-
`trary number of designs in rapid succession, and can there-
`fore deliver the functionality of a device many times its
`size. More sophisticated implementations in which the
`configuration control receives input from the results of
`previous computations or frOm the external operating envi-
`ronment can also be envisioned. Finally, rapid reconfigura-
`tion makes feasible the implementation of dedicated logic
`circuits to support large numbers of highly specific compu-
`tational tasks that are wholly unsuited to ASIC implemen-
`tation.
`
`Configurable computing requires 1) commercially
`available FPGAs with Sufficiently fast configuration times,
`2) design tools that understand and take advantage of hard
`ware dynamisms, and 3) boards and other higher level
`interface and support hardware that will make fast-recon—
`figurable FPC‘rAs viable in real systems. Although there is
`
`not yet a base of commercial FPGAs with submillisecond
`reconfiguration times to satisfy the first of these require-
`ments.
`there has been prototype development
`in both
`industry and academia to explore the hardware issues of
`fast reconfiguration. It is our belief that this is an area
`which the FPGA vendors are well prepared to address. and
`that fast reconfiguration will receive increased commercial
`attention as the payoffs on the application side become
`clear. By contrast, design tools and systems to support use
`of dynamic computing devices have been in a state of rela-
`tive immaturity. We describe here results from an ongoing
`project to build a configurable computing system for auto-
`matic target recognition (ATR). Focusing on this applica-
`tion has furnished quantitative results on design time and
`challenges, configuration overhead, and computational
`efficiency of configurable computing systems. More gener-
`ally, it has led to an understanding of the requirements and
`hurdles involved in extending configurable computing to
`more general applications.
`
`The rest of this paper is organized as follows: The
`remainder of the introductiou includes a description of
`related work in configurable computing, and an overview
`of the automatic target recognition (ATR) problem that
`serves as the application focus for the new results pre~
`sented here. Section 2 introduces the basic mapping of
`ATR target templates into FPGA adder trees that forms the
`core of the processing, and discusses some of the trade-offs
`in using rapid reconfiguration to support ATR. Section 3
`discusses design and partitioning issues to support rapid
`implementation of a large set of target templates. Section 4
`presents a more detailed analysis of design trade-offs and
`considers some the issues of U0 and board design for a
`FPGA—based ATR system. Conclusions and a brief
`description of ongoing and future work are contained in
`Section 5.
`
`1.1 Overview of related work
`
`Configurable computing has been explored in both
`
`0-8186-7548-986 $05.00 © 1996 IEEE
`
`'70
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 70
`Petitioner Microsoft Corporation - EX. 1051, p. 70
`
`

`

`academia and industry for several years. On the hardware
`side, several fast-reconfiguring FPGAs have been devel-
`oped. One of these is the Configurable Logic Array
`(CLAy) from National Semiconductor, which is a fine-
`grained device consisting of an array of 56 by 56 cells,
`each of which is roughly equivalent to a half adder and D-
`flip flop. Configuration bitstreams can be loaded into the
`CLAy using 8 pins, allowing a complete reconfiguration
`of the device in approximately 750 microseconds. The
`CLAy also supports partial reconfiguration, which reduces
`the reconfiguration time linearly in accordance with the
`fraction of the gate array concerned.
`
`An alternative to external loading of bitstrearns used
`in the CLAy and in most other FPGAs is the context-
`switched approach advocated by a group at MIT [1,2]. In
`this device, referred to as a Dynamically Programmable
`Gate Array (DPGA), multiple configurations reside simul-
`taneously on chip. One of these configurations occupies
`the active layer, and is the one which is actually executing.
`Any of the others can be switched to the active layer in
`one clock cycle. The increase in chip area to support three
`extra contexts is approximately 20%.
`
`Utilization of FPGAs as dynamic computing devices
`has been explored by several groups including a team led
`by Hutchings at Brigham Young University. Hutchings
`has performed a series of thorough studies in which the
`benefits of partial reconfiguration was explored using the
`application example of neural nets [3]. The Brigham
`Young group has also investigated the use of partial recon-
`figuration as a means to construct a computer with a
`dynamic instruction set [4]. This idea, which has also been
`discussed by Athanas and Silverman in [5], achieves
`
`increased efliciency by using FPGA resources to hold the
`instructions that are needed on an application-specific
`basis. These experiences have led to the formulation of
`design methodologies for partially reconfiguring systems
`[6], and to important quantitative results on the benefits of
`partial reconfiguration. For example, for the neural net
`application partial reconfiguration enabled a 25% reduc—
`tion in configuration time and a 50% increase in functional
`density compared with a system based on complete recon-
`figuration.
`
`In a previous publication [7] we described the imple
`mentation of a video communications system imple-
`mented using configurable computing techniques. This
`system delivers real time video at a rate of 8 frameslsec—
`end, and includes the steps of image transformation, quan-
`tization and run-length coding. and BPSK modulation!
`demodulation. These functions are implemented using a
`single 5000 gate CLAy, with rapid swapping of designs
`used to time share the gate array hardware. The rapid
`swapping of designs used in the video system has some
`commonalities, as well as some significant differences
`with the approach for ATR that we describe in the present
`paper.
`
`1.2 Application Description :: ATR
`
`Automatic target recognition is among the most
`demanding real time computational problems in existence.
`The challenge addressed by an ATR system is conceptu-
`ally simple -- to analyze a digitally represented input
`image or video sequence in order to automatically locate
`and identify all objects within the scene of interest to the
`observer. Since there are many types of imaging devices
`and many algorithmic choices available to a designer,
`
`Input SAR image
`several K
`‘—.-
`
`radar absorption). Templates with highest correlation are selected in the peak detection step.
`
`Bright template
`correlation result
`
`a-Peak Detect
`
`—b
`Surround temp ate
`correlation result
`
`Tom - late 1
`
`H Computation
`1301119an
`
`Figure 1 High level block diagram for ATR processing. The focus of attention algorithm identifies regions of interest,
`referred to as “chips”, in SAR images. Chips are correlated against a series of binary target template pairs, with each
`pair containing a bright template (identifying pixels of strong expected radar return) and a sun'ound template (strong
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 71
`Petitioner Microsoft Corporation - EX. 1051, p. 71
`
`

`

`
`
`there are clearly a large number of possible ways to imple-
`ment an ATR system. In this paper we focus on a particular
`approach which is currently being applied in the US.
`Department of Defense Joint STARS airborne radar imag-
`ing platform, and which therefore has high current rele-
`vance and interest.
`
`The processing used in ATR is illustrated in simpli-
`fied format in Figure 1. Synthetic aperture radar (SAR)
`images consisting of 8-bit pixels and measuring several
`thousand pixels on a side and are generated in real time by
`the radar imager. Images are input to a focus—of—attention
`processor which identifies a set of regions of interest, each
`of which contains a potential target. These regions of inter-
`est, known by the potentially confusing term “chip“, must
`then be correlated with a very large number of target tern-
`plates. Target templates are binary; e. g. each pixel is repre-
`sented using one bit. The correlation results are output to a
`peak detector which identifies the template and relative
`offset at which the peak correlation value occurs. The cor-
`relation of chips with templates is the computational bot-
`tleneck in
`the
`system,
`involving
`data
`rates
`and
`computational requirements that exceed by several orders
`of magnitude the processing load in any other steps in the
`algorithm. While the precise system parameters vary with
`implementation, in the work described here we use chip
`sizes of 128 by 128 and template sizes of 16 by 16. A cor-
`relation of a single chip with a single template in this case
`involves consideration of approximately 1282 relative off-
`sets, corresponding to 105 bits of output data if the correla-
`
`tion outputs are represented using 6 bits. If there are 103
`templates to be evaluated per chip, the magnitude of the
`processing task becomes readily apparent when one con-
`siders that the imaging system produces many frames per
`second, each of which contains many chips.
`
`Figure 2 illustrates the correlation operation targeted
`for FPGA implementation in more detail. Target templates
`occur in pairs, one member of which is called the bright
`template and contains pixels from which a strong radar
`return is expected, and the other member of which is the
`surround template and identifies pixels where strong radar
`absorption is expected. In both cases the template is of size
`16 by 16, with pixels represented using only one bit. The
`templates tend to be sparsely populated, with only a rela—
`tively small percentage of the pixels set to 1. As will be
`discussed later, this preperty is important in obtaining high
`performance in F'PGA implementations. The first step of
`the correlation is known as a shapesum calculation,
`in
`which the 8—bit SAR chip is correlated with the bright tem»
`plate, providing for every pixel in the chip a number which
`is used for local gain control. The second step is the actual
`correlation, which is performed in parallel on eight differ-
`ent binary images. each of which is created by applying a
`different threshold to the chip. The binary images are cor-
`related with both the bright template and the surround tem-
`plate. producing eight pairs of correlation outputs. The
`shapesum value is used to select which output pair will be
`processed in the peak detection step.
`
`BitSlice
`
`Size 16x16
`540% populated
`
`Shapesum
`
`Size 1 6x16
`10-50% populated
`
`Threshold
`. select
`
`Chlp (128 x128 x 8 bits)
`
`= threshold (fixed
`for all templates
`
`= 1 bit correlate:
`
`
`
`Figure 2 Correlation operations showing shapesum calculation (top) and parallel correlations (bottom)
`
`72
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 72
`Petitioner Microsoft Corporation - EX. 1051, p. 72
`
`

`

`2. Mapping and dynamic reconfiguration of
`target templates
`
`FPGAs offer an extremely attractive solution to the
`correlation problem. First of all, the operations being per-
`formed occur directly at the bit level and are dominated by
`shifts and adds, making them easy to map into the hard-
`ware provided by the FPGA. This contrasts. for example,
`with multiply-intensive algorithms which would make rel-
`atively poor utilization of FPGA resources. More impor-
`tantly, the sparse nature of the templates can be utilized to
`achieve a far more efficient implementation in the FPGA
`than could be realized in a general purpose correlator. This
`can be illustrated using the example of the simple template
`shown in Figure 3.
`
`Template 1
`
`
`
`Figure 3 Example binary template with five “on"
`pixels (top) and corresponding adder tree (bottom).
`
`
`
`
`
`In this example template, only 5 of the 24 pixels are
`“on". At any given relative offset between the template
`and chip, the correlation output is the sum of the five
`binary pixels in the chip that lie immediately above the
`“on" pixels in the template. The template can therefore be
`implemented in the FPGA as a simple adder tree as shown
`in Figure 3. The chip pixel values can be stored in flip-
`flops. and are shifted to the right by one flip flop with each
`clock cycle. Though correlation of a large image with a
`small mask is often understood conceptually in terms of
`the mask being scanned across the image, in this case the
`opposite is occurring - the template is hard-wired into the
`
`FPGA while the image pixels are clocked past it.
`
`Another important opportunity for increased effi—
`ciency lies in the potential to combine multiple templates
`on a single FPGA. The simplest way to do this is to spa-
`tially partition the FPGA into several smaller blocks. each
`of which handles the logic for a single template. Alterna-
`tively. one can seek to identify templates having some
`topological commonality, and which can therefore share
`parts of adder trees. This is illustrated in Figure 4, which
`
`Template 1
`
`Template 2
`
`tiple correlations.
`
`Output 1
`
`Output 2
`
`Figure 4 Template commonalities are exploited to
`reduce hardware requirements for computing mul-
`
`shows two templates which share several pixels in com-
`mon. and which can be mapped using a set of adder trees
`which leverage this overlap.
`
`The advantage of using FPGAs over ASICs is that
`FPGAS can be dynamically optimized at the gate level to
`exploit template characteristics. An ASIC would have to
`provide large general purpose adder trees to handle the
`worst case condition of summing all possible template
`bits. The FPGA, however, exploits the sparse nature of
`the templates, and only constructs the small adder trees
`required. We have also shown that FPGAs can exploit
`other factors such as collapsing adder trees with conunon
`elements, and packing unused data points into space-sav-
`ing RAM—based shift registers. The end result is that a sin-
`gle FPGA can efficiently compute several templates in
`parallel more efficiently than several general purpose cor-
`relating ASICs.
`
`'73
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 73
`Petitioner Microsoft Corporation - EX. 1051, p. 73
`
`

`

`There are many factors that determine the perfor-
`mance gain that results from using an FPGA. One of the
`most important is FPGA reconfiguration time. Assuming
`that performing the correlation of a single chip requires
`approximately 1282 = 16K clock cycles, the reconfigura-
`tion must be performed in ~IO3 or fewer clock cycles to
`avoid prohibitive overhead.
`In this respect a context-
`switched FPGA with two contexts would be extremely
`useful. If the idle context could be loaded while the active
`
`then reconfiguration overhead
`context was processing,
`would vanish. The achievable parallelism is also a critical
`parameter. Based on the work to date, we estimate that we
`can map an average of 20 bright templates or 5 surround
`templates on a single 13000-gate FPGA.
`
`2.1 Experimental results for FPGA resource utilization
`
`The approach of using a template-specific adder tree
`achieves significant reduction in routing complexity over a
`general correlator which must include logic to support
`arbitrary templates. To a first approximation, the extent of
`this reduction is inversely propertional to the fraction of
`“on” pixels in the template. While this complexity reduc-
`tion is important, it alone is not sufficient to lead to effi-
`cient implementations on FPGAs. This is due primarily to
`the limited number of flip flops available on commercial
`FPGAs (for example, the Xilinx XC4010 and ATT ORCA
`2C10 contain 800 and 1024 flip flops respectively). This
`would not generally be sufficient to support buffering the
`112 pixels per chip row that are not actually under the tem-
`plate, but need to be wrapped around to the next row of the
`template. The total number of l-bit storage elements
`needed to hold buffered pixel values for all 16 rows is 16 *
`112 = 1792. Implementing these on the FPGA using the
`usual flip-flops based shift registers is inefficient. and for
`many FPGAs impossible.
`
`This problem can be resolved by collapsing the long
`Strings of image pixels that are not being actively corre-
`lated against a template into shift registers, which can be
`implemented very efficiently on some look-up—table based
`FPGAs. For example, RAMs in the Xilinx XC4000 library
`can be used as shift registers which delay data by some
`predetermined number of clock cycles. Each 16x1 bit
`RAM primitive uses up a function generator on the FPGA,
`and can implement an element which is effectively a lt'r
`bit shift register in which the internal bits cannot be
`accessed. A flip-flop is also needed at the output of each
`RAM to act as a buffer and synchronizer. A single control
`circuit is used to control the stepping of the address lines
`and the timely assertion of the write—enable and output-
`enable signals for all
`the RAM-based shift register ele-
`ments. This is a small over head price to pay for the saw
`ings in CLB usage relative to a brute force implementation
`using flip flops.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`most templates.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Figure 5 Four templates that were mapped onto the
`Xilinx 4010 to explore resource utilization. These
`examples were generated by 4 rotations of a syn-
`thetic template. The number of “on" pixels is 91,
`which is higher than what would be expected for
`
`By contrast. the 256 image pixels that lie within the
`16 by 16 template boundary at any given time can be
`stored easily using flip-flop based registers, since there are
`sufficient flip-flops available to do this. and the adder tree
`structures do not consume flip-flops. Also, using standard
`flip-flop based shift registers for image pixels within the
`template simplifies the mapping process by allowing
`access to every pixel in the template. New templates can
`be implemented by simply connecting the template pixels
`of concern to the inputs of the adder tree structures. This
`leads to significant simplification of automated template
`mapping tools.
`
`To gain a fuller understanding of the FPGA resource
`trade-offs involved in template mapping, we implemented
`in parallel the four templates shown in Figure 5 onto the
`Xilinx 4010 using the techniques described above. Each
`template had 91 “on” pixels, few of which are shared with
`other templates. Since parallelism was low, this exercise
`gaves a worst—case estimate for the capacity of the 4010.
`The resulting FPGA resource utilization. as summarized in
`Table 1, shows that 313 flip—flops and 756 function genera-
`tors were used. The resources used by the two components
`of target correlation, namely storage of active pixels on the
`FPGA (lst row of table) and implementation of the adder
`tree corresponding to the templates (2nd row) are indepen-
`dent of each other. The resources used by the pixel storage
`are determined by the template Size, and are independent
`of the number of templates being implemented. Adding
`templates involves adding new adder tree structures, and
`will hence increase the number of function generators
`being used. The total number of templates implementable
`
`74
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 74
`Petitioner Microsoft Corporation - EX. 1051, p. 74
`
`

`

`
`
`Flip
`Flops
`
`Fen.
`Gen.
`
`U0
`pins
`
`
`
`
`
`“um
`
`
`
`
`
`
`
`
`
`
`
`
`Table 1 Xilinx 4010 resource utilization for the four
`
`
`
`sample templates shown in Figure 5. Resources for
`the two basic functions implemented (pixel storage
`and adder trees) are shown separately. The adder
`trees account for the majority of the function gener—
`ator utilzation, and are a key factor in limiting the
`number of templates that can be implemented simul-
`taneeously in a single configuration.
`
`
`
`on an FPGA will be bounded by the number of usable
`function generators.
`
`The four templates in the example above exhibit a
`very low number of common pixels and would probably
`not be grouped together by the partitioning algorithm.
`These results suggest that in practice, we can expect to fit
`6-10 surround templates having a higher number of over-
`lapping pixels onto a 13000 gate FPGA. Since the bright
`templates are less populated than the surround templates.
`we estimate that 15-20 bright templates can be mapped
`onto a 13000 gate FPGA.
`
`3. ATR template partitioning issues
`
`To minimize the number of FPGA reconfigurations
`necessary to correlate a given target image against the
`entire set of templates,
`it is necessary to maximize the
`number of templates placed in every configuration of the
`FPGA. To accomplish this optimization goal. we want to
`partition the set of templates into groups that can share
`computations (i.e.,
`share adder
`trees) so that
`fewer
`resources are used per template. Since the set of templates
`may number in the thousands, and the goal may be to
`place ten to twenty templates per configuration, exhaustive
`enumeration of all of the possible groupings is not an
`option. Instead, it is best to seek a heuristic method which
`will furnish a good, although perhaps suboptimal, solu-
`tion.
`
`Correlation between two templates can establish the
`number of pixels in common, and is a good starting point
`for comparing and selecting templates. However, some
`extra analysis beyond performing iterative correlations on
`
`the template set is necessary. For example, a template with
`a large surface area may correlate well with several
`smaller templates, perhaps even completely subsuming
`them, but the smaller templates may not correlate with
`each other at all, and would involve no redundant compu-
`tations. There are two possible solutions to this. The first is
`to ensure that any template added to an existing group is
`approximately the same size as templates in the group.
`The second option is to compute the number of additions
`required each time a new template is brought in . effec-
`tively recomputing the adder IIee every time.
`
`Recomputing the entire adder tree is computation-
`ally expensive, and is not a good method of partitioning a
`set of templates into subsets. However. one of the heurst-
`ics used in deciding whether or not to include a template
`into a newly formed partition. is to determine the number
`of new terms that adding the template would create in the
`partition‘s adder tree. The assumption is that more terms
`would result in a significant. number of new additions,
`resulting in a wider and deeper adder tree. By keeping the
`number of new terms created to a minimum, newly added
`templates do not increase the number of additions by a sig-
`inificant amount. Using C-I—t, we have created a design
`tool to implement the partitioning process. C-H- allows us
`to abstract templates and sets of templates into easily
`manipulated objects. New methods of comparison can be
`implemented by adding new methods to the objects, mak-
`ing this system adaptable to different heuristics. Further-
`more,
`the system is not
`tied down to any particular
`platform, as C++ compilers are available for many differw
`ent platforms.
`
`The tool uses an iterative approach to partitioning
`templates. Templates which compare well
`to a chosen
`“base" template (usually selected by largest area) are
`removed from the main template set and placed in a sepa-
`rate partition. This process is repeated until all templates
`are partitioned. After the partitions have been selected, the
`design tool computes the adder tree for each partition. Fig—
`ures 6a-6c show the process of creating an adder tree from
`the templates in a partition. Within each partition. the tem-
`plates (shown in Figure 63) are searched for shared subsets
`of pixels (Figure 6b). These subsets, called terms, can be
`automatically added together.
`leading to a template
`description using terms instead of pixels (Figure 6c). The
`most commonly occurring addition of two terms is chosen
`to be added together, forming a new term which can be
`used by the templates. In this way. each template is rebuilt
`by combining terms in such a way that the most redundant
`additions are shared between templates, and the final result
`of the process are terms which compute entire templates.
`For the sample templates shown here, 54 additions would
`be required to compute the correlations for all five tem-
`plates in a naive approach. However, after combining the
`
`75
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 75
`Petitioner Microsoft Corporation - EX. 1051, p. 75
`
`

`

`Figure 6: Example of template grouping to leverage
`pixels common to multiple templates. In this exam-
`ple,
`the grouping reduces the number of adders
`needed from 54 to 29.
`
`Figure 63: Five sample templates to be allocated on
`the same FPGA configuration.
`
`hardware required for efficient implementation and debug-
`ging. Minimizing
`these
`costs
`requires
`innovative
`approaches to system design.
`
`Figure 7 compares the configurable computing archi-
`tecture discussed here (to) with a traditional processor
`(7a) and a customizable computing architecture (7b). The
`traditional processor receives simple operands from a data
`memory. performs a simple operation in the program. and
`returns the result
`to data memory. Custom computers
`attempt to gain performance advantages over traditional
`processors by implementing common operations in hard-
`ware. Some fine-grain systems have been proposed [4] but
`the majority of successful custom computers implement
`complex operations in a large array of FPGAs [8,9] as
`depicted in Figure 'i'b. Large custom computers deliver tre-
`mendous performance for some applications. but have def-
`inite shortcomings. The number of FPGAs is typically
`fixed, or only slightly variable through the use of parallel
`cards. Large applications may not fit. and small applica-
`tions may inefficiently use available resources. Also, large
`
`
`
`Figure 6b. The terms that result from overlapping
`templates. Each term corresponds to a group of pix-
`els needed by a particular set of templates. These
`terms correspond to nodes on an adder tree.
`
`'Ibmplate A
`
`®+©+®+@+®+®
`
`Template B
`
`Template C
`
`Template D
`
`'Ibmplate E
`
`@+®+©+®+®
`
`®+@+©+®+©
`
`®+©
`
`®+©+©+®+O
`
`Figure 6c. The templates rewritten as sums of terms.
`The most common additions are coalesced into new
`
`terms. Here, 5 + 6 occurs four times, and is the first
`candidate to be replaced with a new term, 11.
`
`templates through the process described here, only 29
`additions are required.
`
`4. Configurable computing system design
`
`The increased performance of configurable systems
`comes with several costs. These include the time and
`
`bandwidth required for reconfiguration. memory and U0
`required to store intermediate results, and additional
`
`76
`
`Dynamic Computer
`
`Figure 7c
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 76
`Petitioner Microsoft Corporation - EX. 1051, p. 76
`
`Figure 7a
`Traditional Processor
`
`Figure 7b
`Custom Computer
`
`media a
`Result
`Storage
`
`Memory
`
`Configuration .
`
`7 Architecture
`Figure
`comparisons. Traditional
`processors (7a) take sim-
`ple operands from mem—
`ory, perform an operation,
`and return the result
`to
`
`memory. Custom comput-
`ers (7b) gain performance
`advantages
`by
`putting
`common operations
`into
`hardware.
`In
`dynamic
`machines (7c) logic is pro-
`grammed
`by
`selecting
`configuration
`bitstrearns
`stored in nearby RAM.
`
`

`

`bit—plane 0 I-
`bit-plane l
`'C
`
`\
`“H .—p
`summation
`/'
`
`
`
`bit-plane 7
`
`:
`I-
`Figure 8a Eight FPGAs, each performing one bit
`correlations
`
`SRAM Holding "
`
`Figure 8b One FPGA, performing the correlations
`and adding the results serially
`
`custom computers can be difficult to partition and pro—
`gram.
`Our approach to configurable computing is to use a
`small number of rapidly reconfiguring FPGAs tightly cou-
`pled to an intermediate result memory and a configuration
`memory. This
`configurable
`computing
`architecture
`attempts to gain some of the advantages of a traditional
`programmable processor and some of the application_spe_
`cific performance of a custom computer. We have had suc-
`cess implementing a video compression and transmission
`system to this architecture [7]. and will now briefly
`describe some of the trade-offs involved in implementing
`an automatic target recognition (ATR) algorithm on this
`architecture.
`
`Partitioning of computations plays a major role in
`any parallel processing environment. For a coafigurable
`computing system, partitioning is particularly interesting
`because it is both a hardware and a software issue. Con-
`
`sider two methods for performing addition of shapesum
`bit~planes shown in Figure 8a. Method A is a straightfor-
`ward parallel implementation requiring several FPGAs,
`and has several drawbacks. First. the outputs from several
`FPGAs converge at the addition operation. This may cre-
`ate a severe U0 bottleneck. Second, the system is not scal-
`able. If the system requires more precision, and therefore
`more bit-planes, more FPGAs must be added to the sys—
`tern.
`
`Method B in Figure 8 illustrates our approach. Each
`bit plane is correlated individually. and then added to the
`
`previous results in the temporary storage. Method B is
`completely scalable to any image or template precision,
`
`and this simple architecture can implement all the correla-
`tion, normalization, and peak: detection routines required
`for ATR. One drawback ofmethod B is the cost and power
`required for the wide temporary result SRAM. Another
`.
`.
`.
`.
`.
`possible drawback is the extra execution time required to
`run ATR correlations in serial. The ratio of performance to
`number of FPGAs is roughly equivalent for the two meth-
`ods, and the performance gap can be closed by simply
`using more of the smaller method B boards.
`
`The approach of a reconfigurable FPGA connected to
`an intermediate memory allows us to have a fairly compli-
`cated flow of control. For example, the shapesum calcula-
`tion in ATR tends to be more difficult than the image-
`template correlation. A single FPGA might compute two
`shapcsums or four correlations. For this reason. one may
`wish to have a program that performs two shapesum oper-
`ations and forwards the results to a single correlation oper-
`ation as shown in Figure 9- LooPing and branching
`operations can also find uses for adjustable precision and
`adaptive filtering.
`
`Reconfigurations for 10k gate FPGAs are typically
`around 20113 in length. Reconfiguring every 20ms gives a
`reconfiguration bandwidth of approximately 1MB per
`FPGA per second. This reconfiguration bandwidth, cou-
`pled With the complexity 0f the flow 0f control can be han-
`dled by placement Of a
`small mierocontroller and
`configuration RAM next to every FPGA- The microcon-
`lroller permits complicated flow of canto}. and Since [he
`microcontroller addresses the configuration RAM» it frees
`UP valuable U0 0" the FPGA- The microcontroller 3150 is
`important for debugging. which is a major issue in config-
`
`Operation 2
`
`
`
`4 results
`
`Figure 9 Reconfiguration Flow Of Control Example
`
`7'?
`
`Petitioner Microsoft Corporation - Ex. 1051, p. 77
`Petitioner Microsoft Corporation - EX. 1051, p. 77
`
`

`

`urable systems because the many different hardware con~
`figurations can make, it difficult to isolate problems.
`
`Figure 10 shows the current (as of April 1996) ver-
`sion of the evolving configurable computing system for
`performing ATR. The principal components ate labeled in
`the graph and include a “dynamic" FPGA which is recon-
`figured on the fly, and which performs most of the comput—
`ing functions. and a “static” FPGA which is configured
`only once, and which performs control and some computa-
`tional functions. The EPROM holds configuration bit-
`streams, and the SRAM holds the input image data (eg.
`the chip). Because the correlation operation involves the
`application of a

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket