`
`I
`
`Automated Target Recognition on SPLASH 2 l
`
`Michael Rancher and Brad L. Hutchings
`Department of Electrical and Computer Engineering
`Brigham Young University ,
`Provo, UT 34602
`
`Abstract
`Automated target recognition is an application area.
`that requires special—purpose hardware to achieve re'a-
`sonahle performance. FPGA—based platform can pro-
`vide a high level of performance for ATE. systems if
`the implementation can be adapted to the limited
`FPGA and routing resources of these architectures.
`This paper discusses a mapping experiment where a
`linear-systolic implementation of an ATR algorithm is
`mapped to the SPLASH 2 platform. Simple column—
`oriented processors were used throughout the design
`to achieve high performance with limited nearest-
`neighbor communication. The distributed SPLASH 2
`memories are also exploited to achieve a high degree
`of parallelism. The resulting design is scalable and
`can be spread across multiple SPLASH 2 boards with
`a linear increase in performance.
`
`1
`
`Introduction
`Automated target recognition (ATR) is a compu-
`tationally demanding application area that typically
`requires special-purpose hardware to achieve desirable
`performance. ASICs are not an option for these sys-
`tems due to high non-recurring engineering (NEE)
`costs and because the algorithms are constantly evolv-
`ing. Existing FPGA—based computing platforms can
`potentially provide the necessary performance and
`flexibility for evolving ATE. systems; however, map-
`ping applications to these existing platforms can be
`very challenging because they lack abundant intercon-
`nect and FPGA resources. The key to achieving a
`high-performance implementation of ATE. algorithms
`with existing platforms is to carefully organize the de-
`sign of the ATE. implementation so that it can commu-
`nicate via the limited interconnect and can be easily
`partitioned among the FPGA devices.
`This paper presents a linear systolic implementa-
`tion of an existing ATR algorithm on SPLASH 2 that
`—-——__________
`lThis work was supported by DARPAICSTO under contract
`number DABTSS—Qd—C—UOSB under a subcontract to National
`Semiconductor
`
`Inter-
`is well-suited to the SPLASH 2 architecture.
`FPGA communication is limited and easily accommo-
`dated by the SPLASH 2 interconnect. Moreover, the
`implementation can be scaled across any number of
`SPLASH 2 boards and achieves high performance with
`limited resources.
`
`This paper briefly discusses the entire ATE. algo-
`rithm as developed by Sandia National Labs, and
`then overviews the design and implementation of the
`most computationally demanding part of the algo-
`rithm: Chunky SLD. The SPLASH 2 implementation
`is presented in some detail with future directions and
`possible improvements.
`
`2 Automatic Target Recognition
`_ The goal of a typical ATR system is to analyse a
`digital representation of a. scene and locate/identify
`objects that are of interest. Although this goal is
`conceptually simpleI ATR systems have extremely de—
`manding HO and computational requirements: image
`data are large, can be generated in real-time. and must
`be processed quickly so that results remain relevant in
`a dynamic environment. The common use of special—
`purpose hardware in nearly all high-performance ATE.
`systems is a clear indication ofthe computational com—
`plexity of these systems.
`This paper details the implementation of an exist-
`ing ATR algorithm on SPLASH 2. The algorithm in
`question was developed at Sandie National Labora-
`tories and was designed to detect partially obscured
`targets in Synthetic Aperture Radar (SAR) images.
`It is commonly referred to as Chunky SLD. so named
`for the second step of the algorithm that difl'erenti-
`ates this-algorithm from others developed at Sandia.
`This algorithm consists of the following three steps:
`(1) Focus of Attention (FDA), (2) Second—Level De-
`tection (SLD), and (3) Final Identification (Fl). Each
`of these steps will now be introduced so that the al-
`gorithm implementation can be understood in its op-
`erating context.
`'
`
`o-srse-s 159497 $10.00 © 1997 IEEE
`
`192
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 117
`Petitioner Microsoft Corporation - EX. 1066, p. 117
`__——_—"—————-——————-___________
`
`
`
`
`
` nix downrarnpled
`
`
`
`Figure 1: ATE. Block Diagram.
`
`2.1 Focus of Attention (FDA)
`
`Focus of attention is the first step of the ATE. pro-
`cess and uses image morphology techniques to de-
`tect potential targets in SAR data. FOA operates on
`“down—sampled” SAR images that are approximately
`500-1000 pixels on a side. Once FOA detects a poten—
`tial target. it determines the approximate center of the
`potential target and creates 2x down—sampled sub-
`images of the original SAR data where each subimage
`contains a single target centered within the whim-
`age. These subimages are referred to as chips and are
`128 x 128 pixels.
`
`2.2 Second Level Detection (SLD)
`
`The SLD step processes the chips generated by the
`FOA step. SLD further restricts the areas of interest
`by giving the potential targets coordinates and angu-
`lar orientation. SLD does this by correlating prede-
`fined binary templates to the areas of interest. The
`templates represent different object orientation an-
`gles. Templates are oriented between 5 and 10 de-
`grees apart. SLD also uses adaptive threshold levels
`determined by the overall image intensity.
`The algorithm studied in the paper is a variation
`of SLD called Chunky SLD. Chunky SLD adds a level
`of complexity to SLD by using more templates to rap-
`resent objects that have been partially obscured (par-
`tially hidden by camouflage or objects overhead). This
`allows better target recognition at a cost of higher
`computational requirements. Chunky SLD is dis-
`cussed in more detail later in this section.
`
`2.3 Final Identification (F1)
`
`The F1 algorithm correlates full resolution image
`data and templates with finer angular resolution (3 to
`5 degrees). Fl also uses adaptive threshold levels. The
`output of F1 is a location of the target, and confidence
`level corresponding to the level of correlation between
`the object and the FI templates.
`
`2.4 The Chunky SLD Algorithm
`The general goal of the Chunky—SLD algorithm is
`to recognize targets that are partially. concealed or ob-
`scured in some way. To achieve this goal, the designers
`of this algorithm treat the target as a. set of 40 tem-
`plate pairs where each pair of templates is a digital
`representation of some salient feature of the specific
`target. If the majority of the template pairs strongly
`correlate with the image data, then a match of the
`overall target is assumed. Each pair of templates con-
`
`
`
`:--
`
`Cam
`In}
`mulc- Ofllel
`
` Hi:
`
`—————_—-—
`
`Figure 2: Chunky SLD
`
`sists of a Bright template and a Surround template.
`The Bright template is a representation of expected
`reflections directly from surfaces of a salient target fea-
`ture While the Surround template represents expected
`absorption in the immediate area surrounding the tar-
`get feature. Each pair of a Bright and Surround tem-
`plate is referred to as a chunk, so called because each
`pair of templates represents a "chunk” of the overall
`target. Each set of 40 chunks represents a single target
`at a specific rotation. There are 72 orientations, each
`representing a different target orientation and radar
`incidence angle. Each set of 72 orientations is referred
`to as a class and is the complete set of templates that
`must be correlated with a chip to detect the presence
`of a specific target.
`
`Clau_
`an as mine;
`for one shied}
`
`
`
`(40 Chunks)
`
`Figure 3: Template Organization
`
`The first step of the Chunky SLD algorithm is to
`correlate the chip and the Bright template. This cor-
`relation value is used to compute a value that will be
`
`193
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 118
`Petitioner Microsoft Corporation - EX. 1066, p. 118
`———.—_._______________________—_
`
`
`
`used to threshold the incoming chip data, converting
`the 8-bit chip image into a binary image. The equa-
`tions describing this process are shown below.
`15
`
`Shapesum(£.yl = Z Bremp(a,b)*Chip(:-:+a,y+b)
`a,b=C|
`
`Shapesum(s, y)
`Thr85h0td(z' y) = BrightiemplstmpiseLcount
`The values obtained by correlating the Bright and
`Surround templates with the binarised chip (Bum. and
`Sum] are checked against minimum values to generate
`a. “hit value” for each offset in the chip. The threshold
`value is also checked to see if it falls in an acceptable
`range when generating the hit values.
`
`HGT...“ 2 T 2 ngn] AND
`[Baum 2 Baa] AND
`[Senior 2 Saris-1])
`their
`Hit 2 1;
`else
`
`Hit = D;
`
`(1)
`
`The hit values are accumulated for each offset for a
`specific orientation (4U chunks). The highest values
`are used to identify the areas of interest for the final
`identification step.
`
`2.4.1 Template Characteristics
`
`Template pairs exhibit useful properties: sparseness
`and mutual exclusivity. The Bright template consists
`mostly of zeros; only 3 to 10 percent of the template
`values-are ‘1's and this limits the magnitude of the
`Shapesum and Bmm values. The Bright and Surround
`templates are also mutually exclusive; that is, if the
`two templates are overlaid no “on” pixels will overlap.
`When carefully exploited, both of these properties lead
`to more compact and higher performance hardware.
`
`3 Other Implementations of Chunky-
`SLD
`
`As explained the ATE. application is computation-
`ally demanding. There are (128-15) x (128-15) offsets
`per chunk x 40 chunks x 72 orientations E 36 x 105
`hit values to compute per targeted object (or per class,
`see Figure 3). The computational rate and I/O re-
`quirements of this algorithm make it impossible to use
`current microprocessors. Thus any high—performance
`implementation of this algorithm will require special-
`purpose hardware to meet performance goals.
`
`Howaver, custom ASICs are also not an option be-
`cause the algorithm is constantly evolving and also
`because commercial-off—the—shelf components (COTS)
`are often dictated by the ultimate customers of ATR.
`systems. The only remaining options are to construct
`the system with commercially available fixed-function
`devices such as oorrelaters, multipliers, etc., or to
`use programmable logic, e.g., FPGAs [1, 2]. Thus
`all known implementations of Chunky-SLD use either
`fixed-function devices or programmable logic.
`3.1 Sandie
`
`Current Sandia implementations of ATR are based
`on commercially available one-bit correlater chips.
`Sandie designers adapted the basic Chunky-SLD al-
`gorithm so they could exploit the capabilities of these
`components to achieve high performance. Rather than
`process the Shapesum and then process the final cor-
`relation, the tWo steps were done in parallel. The cor-
`relation was done at 8 discrete threshold levels and
`the Shapesurn determined which threshold to use for
`each offset.
`
`Chip
`
`mu 3mm
`
`Figure 4: Sandia’s Implementation.
`
`3.2 UCLA
`
`A group at UCLA (Headed by John Villasenor) [3]
`is working on a FPGA based SLD implementation. By
`doing bit level correlations they are able to do very
`compact adder trees that take advantage of template
`sparseness and FPGA on board lookuptable memory
`capability. Their approach compiles template infor-
`mation directly into the hardware and relies on fast
`reconfiguration to switch template information. They
`
`194
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 119
`Petitioner Microsoft Corporation - EX. 1066, p. 119
`————-——-—-—_.____.._____________________
`
`
`
`also try and take advantage of template overlap by
`computing the results of multiple correlations simul—
`taneously.
`
`4 Chunky-SLD on splash 2
`On SPLASH 2 a static implementation was done
`to avoid the overhead of reconfiguring the hardware
`during execution.
`In order to reduce the hardware
`requirements without reducing throughput a. deeply
`pipelined design (~400 cycle latency) was imple-
`mented. The Shapesum unit generates the threshold
`value which is then used to generate the final corre-
`lation values. (Note: There is a unique threshold for
`each ofi’set). By doing this only two final correlations
`have to be computed per offset (one me and one
`Ssuml-
`.
`The Sandie. implementation computes the Shape-
`sum and final correlation in parallel which forces them
`to compute multiple final correlations. While our im-
`plementation does them serially. This allowa us to use
`an exact threshold value. Also only one final correla-
`tion needs to be computed because the threshold value
`is computed before the final correlation begins. The
`technique used was to look at the correlations by col-
`umn, compute the partial correlation for that column,
`and sum up the partial sums for all 16 columns. In
`this method 16 difi'erent column correlations are going
`on in parallel but only one column of data needs to be
`available for processing.
`4.1
`Implementing the Correlation as 001-
`umn Sums
`
`Figure 5 depicts a simple example that demon—
`strates a correlation of a 3x3 template with a binary
`image. Each row in the table represents one clock cy-
`cle- The first column is the clock cycle number. Cor-
`responding numbers are found in the Pixel load order
`box at the right. A new pixel is brought in on each
`clock cycle. The P1, P2, and P3 columns represent
`the three column processing units needed for a three
`column template. The last column represents the ac-
`tual output. Clock cycles 9 through 12 have been ex-
`panded to show how data (pixels and partial sums)
`are passed from column to column and illustrate the
`date. format. Once the pipeline is full, a new correla-
`tion value is computed as each column arrives (three
`pixels/cycles).
`Note that valid output comes every three cycles be—
`cause the template is three rows tall. All processing
`elements are actively processing 3 pixel values at all
`times. The SPLASH 2 implementation works just like
`the example except for the size of the columns (16
`pixels instead of three) and the data format (eight—bit
`instead of one-bit).
`
` mu
`p..—
` —n--—..
`
`IIIIIIEIIIIIEIE!
`1mm HIDDEN-El
`EEEEIEIEIEI
`
`slamm-
`ammo)-
`
`
`
`
`
`
`
`sass:
`
`
`in
`--
`L
`session
`microm-
`
`
`Irena-mm-
`ministers:
`memes-
`
`
`elem-I
`sesame:
`memes-
`rammin-
`
`
`emulsion
`groin).-
`
`O-Oeipul T-clcck cycle
`
`
`
`
`
`
`shifts
`
`shifted our
`
`O Pixellssummedinatthescpolnls
`
`Figure 5: Example Column Sum.
`
`4.2 Platforms/Implementations
`Chunky SLD was implemented on the SPLASH 2
`board. SPLASH 2 has shown itself to be a useful plat—
`form and has had numerous applications mapped to it
`[4, 5. 6, 7, 8, 9, 101 11]. The implementation was done
`in VHDL and simulated/synthesized using Synopsis-
`All place and route was done automatically using Kil-
`inx place and route tools.
`
`One of the goals of the implementation was to run
`the system so that it consumed a pixel per cycle. This
`means that each cycle all processing elements (PE)
`need to be able to process a new pixel. This im-
`plementation follows Sandia National Labs algorithms
`(not implementation) as closely as possible (see Sec-
`tion 2.4).
`The SPLASH 2 board was developed by SEC (Su-
`percomputing Research Center Institute for Defense
`Analyses) [12]. The SPLASH 2 board is a linear sys-
`tolic array of processing elements (FPGAs), each with
`their owu memory.
`
`4.2.1 Splash 2 Hardware
`
`From a conceptual point of view, the SPLASH 2 system
`consists of a linear array of processing elements. This
`
`.«fi'
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 120
`120
`Petitioner Microsoft Corporation - EX. 1066, p.
`—————————____*_____________________
`
`195
`
`
`
`makes SPLASH 2 a good candidate for linear-systolic
`applications with limited neighbor-teneighbor inter-
`connect. Because of limited routing resources SPLASH
`2 has difficulty implementing multi—chip systems that
`are not linear systolic, though they are possible [8].
`The actual SPLASH 2 platform consists of a board
`with 16 Xilinx 4010 chips (plus one for control) ar—
`ranged in a linear systolic array. Each chip has a lim-
`ited 36-bit connection to its two nearest neighbors.
`Each Xilinx 4010 is connected to a 512 kbyte memory
`
`
`
`
`
`
`{Xllinx WIOJ
`
`Element
`
`left
`neighbor
`
`
`
`Figure T: SPLASH 2 platform.
`
`data from the system when switching to a new tem-
`plate. The overhead from the flushing operation is
`minimal (4%"m— =) 0.14%-
`During each clock cycle, a new pixel arrives at the
`FPGA. If the template bit corresponding to this pixel
`is on then the incoming pixel is added to the current
`partial sum. Each 16 clock cycles, this partial sum is
`then passed on to the next column and a new partial
`sum is received from the previous column. The last
`column computes a complete Shapesum every 16 cy-
`cles (one column). The final correlation of the Bright
`and Surround templates with the thresholded chip
`data works similarly except there are two correlations
`(one for each template).
`Intermediate hit values are stored in a table, re
`ferred to as the hit-table, in one of the local memories.
`‘ Each location in the table corresponds to an any oil'-
`set of a chip, the origin of a single correlation. For
`each ofi'set, if a chunk “hits”, then the corresponding
`location in this table is incremented. Thus the table
`contains the accumulated hit values for all chunks and
`all offsets that have been computed to that point.
`Hits are computed according to Equation 1. First,
`each Bum and 5...", value is compared to its corre—
`sponding minimum value. Second, the threshold value
`corresponding to each Bum and Sam.
`is checked to
`see if it is between a certain minimum and maximum
`value. For reasons of efficiency, the threshold value is
`actually examined earlier in the process and a zero for
`the threshold is stored in lookup-table memory if it
`is out of bounds. This works correctly because if the
`
`finisher
`
`Figure 6: Single Processing Element of SPLASH 2.
`
`{16—bit word size). The memory can handle back-to-
`back reads. or back-to-back writes, but requires one
`‘dead‘ (or turn around) cycle when changing from
`write to read. There is also a crossbar connected to
`all of the chips that allowa some level of random con-
`nection between chips. Up to 16 boards can be daisy—
`chained together to provide a large linear-systolic an
`ray of 256 elements.
`
`4.3 ATR Implementation on Splash 2
`Similar to the example, the SPLASH 2 implemen-
`tation processes one pixel at a time and loads them
`in column order so that the partial sums can be gen—
`erated and passed from column to column. All tem-
`plate data are stored in the memories adjacent to the
`FPGAs on the SPLASH 2 boards. Each memory can
`hold several thousand templates thus making it pos-
`sible to store all of the templates for a single class
`(5760) on a single SPLASH 2 board. There is sufficient
`room in the FPGA design to store a single template.
`The templates are switched by reading the new tem-
`plate data out of the memory and storing it within
`the FPGA. However, because this implementatioh is
`deeply pipelined, it is necessary to flush all current
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 121
`Petitioner Microsoft Corporation - EX. -1066, p. 121
`————————————.__._____________________
`
`196
`
`
`
`threshold is zero, it will cause the 13mm to be zero,
`which will in turn cause the Bum. comparison to fail.
`Otherwise, if all three of these tests come back true
`then a hit has been found for the corresponding offset
`(see Equation 1) and the corresponding location in the
`hit-table is incremented. After the 40 templates are
`tested against the same chip the two ofi’sets with the
`highest accumulated hit values are written into mem—
`ory where the host computer can read them. This is
`accomplished by examining the hit values during this
`process and retaining the top two valuesin special
`memory locations. These final two hit values (which
`represent the top two orientations for a specific class)
`are used in the FI step.
`.
`For the SPLASH 2 board, as with most FPGA sys-
`tems, partitioning is a major issue. The design needed
`to be modular so that different design modules could
`be reassigned to different FPGAs as necessary. This
`is where the column modules were so valuable (see
`Figure 8).
`
`4.4 Special Features
`
`This implementation has several notable character-
`istics. They include control distribution, modular de-
`sign for partitioning and memory utilisation.
`
`4.4.1 Distributed Control
`
`The control in this system is distributed throughout
`the array. Each column module has it's own state
`machine based control. Module synchronization is
`achieved by distributing a control token through the
`pipeline along with the data. When a module re-
`ceives this signal, it resets its internal state machines
`and retrieves template data from its local memory. A
`memory controller rmides in each processing element
`(FPGA) to retrieve template data and give memory
`access to all other modules.
`
`4.4.2 Modular Design (Design for Partition-
`ing)
`
`Each column in both the Shapesum and final corre-
`lation use totally self contained modules that can be
`easily migrated from processing element to processing
`element. This was done to simplify the partitioning
`onto SPLASH 2 [13]. Memory data had to be care-
`fully partitioned as well so that the data could follow
`the module to which it applied. The regularity of the
`design was an important concern; it allowed the place-
`ment of specific circuit modules to be dictated by the
`requirements of the algorithm and not by the limited
`interconnect of the platform. There are 16 identical
`
`modules in the Shapesum and 16 more identical mod-
`ules in the final correlation. Along with these there is
`a divide. a hit. accumulator and 2 delay modules (see
`Figure 8).
`
`4.4.3- Memory Usage
`
`The memories in SPLASH 2 serve several purposes.
`The template information is stored in them. They
`are used to implement video shift registers that cor—
`rect for the latency incurred during threshold compu-
`tation. These shift registers require that two mem—
`ories be used in tandem because every clock cycle a
`new pixel (8 bits) had to be written to memory and a
`delayed pixel had to be read from memory. The band-
`width of one memory was such that it can handle two
`pixels (load and store) every three cycles. Thus one
`memory would delay two pixels and skip two pixels,
`while the other memory Would delay the two pixels
`that the first memory skipped and skip the two pixels
`that the first memory delayed. The divide unit and
`the final result including the accumulated hit values
`are also stored in memory.
`
`4.5 Performance Metrics
`
`There are many metrics that could be used to mea-
`sure the value of this implementation. This section is
`devoted to discussing some of these metrics.
`
`4.5.1 Performance
`
`This implementation runs at a frequency of 19 MHz
`using a test template that tests all column modules.
`Idelay (a Xilinx timing tool) reports a guaranteed
`frequency of 13.2 MHz. Designs that will run in the
`10 to 20 MHz range are typical [8, 10].
`Using the above frequency a single system could
`process one orientation every .487 seconds (.701 sec—
`onds using a 13.2 MHz clock).
`
`123 col 3-: (123 — 15) row x 165-3
`19 Ms;
`
`
`40
`Chunks __
`Orient
`X
`
`Chunk-'1
`(12 nus/Chunk) x 400,.icntafi0“
`4871115 @ 19 MHz
`
`Dl'
`
`123 col x(128—15)row x 1653-;
`13.2 MHz
`
`
`40
`Chunks __
`Orient
`x
`
`Chunks
`(17.5 ins/Chunk) x ”Orientation -
`'1’01 ms @ 13.2 MHZ
`
`
`
`
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 122
`Petitioner Microsoft Corporation - EX. 1066, p. 12.2d
`————————-————____________________¥
`
`19'?
`
`F
`
`
`
` Hit Compum'ron
`
`and
`Annunciation
`
`shapesuln -—i--—- Final lesflon
`
`_ Figure B: SPLASH 2 Implementation.
`
`Using a brute force approach it takes 59 seconds to
`compute the same data on an HP 770 (110 MHz).
`That is two orders of magnitude slmver. On one
`SPLASH 2 board two processing units can be in—
`stalled yielding the ability to process 1 orientation
`(40 chunks) every .244 seconds (.350 seconds at 13.2
`MHz).
`
`4.5.2 Multi-board Splash 2
`
`If a lG—board system were used then 256 processing
`elements would be available. this Would make 42 pro-
`cessing units (256 PEs -:- 6 PE per PU) and 42x the
`throughput. Thus one orientation can be processed
`every 11.6 ms (16.? ms at 13.2 MHz). This is equal
`to covering 86 orientation per second (60 orientations
`per second at 13.2 MHz).
`
`4.5.3 Memory Bandwidth
`
`The distributed memory and aggregate memory band-
`width is what makes this implementation possible. Be—
`cause the memory is distributed to each chip this im-
`plementation can be extended to a. multi-board system
`quite easily. Consider that a processing unit consists
`of 6 FPGAs and their corresponding memories. Each
`processing unit performs 10 reads and 13 writes each
`16 clock cycles. The SPLASH 2 memories require 3
`clock cycles to perform a read and a write for a total
`available bandwidth of 2 reads and 2 writes every cy—
`fi memorie
`'
`‘
`cle (W). Thus a processing unit consumes ap—
`proximately 1/2 of the available memory bandwidth.
`if there were only two or three memories available for
`the six-chip system this implementation would not be
`possible.
`
`4.5.4 Memory stored templates
`
`The decision to use statically configured hardware
`and memory-stored templates was made early in the
`design process. Another approach would be to use
`hard-coded templates. e.g., hardware with the tem-
`plate data compiled in as constants. A hard—coded
`template implementation could possibly use less hard-
`ware by removing unused accumulators and template
`reading logic, however it would require the hardware
`to be reconfigured during run-time in order to switch
`templates. This would take significantly longer than
`the memory read done in the memory-stored tem-
`plate version. The SPLASH 2 system requires approxi-
`mately 1? ms per chip to configure. The system would
`have to be reconfigured each time a new template was
`used.
`In order to process one orientation the sys—
`tem would have to be configured 40 difi'erent
`times.
`This would adversely impact system throughput.
`In
`fact even if the system could be reconfigured in one
`cycle the performance wouldn't improve because the
`time spent reading templates from memory is hidden
`while the pipeline is initially filling (during the flush
`cycles). There are other advantages as wall. Design
`time is simplified because new hardware bit streams
`don’t have to be generated and tested to use new tem—
`plates. Re-compilation may improve hardware utiliza-
`tion, however that would come at a cost of repartition-
`ing difficulty.
`
`5 Future work
`
`In the future, additional related research will be
`conducted on possible improvements to the implemen-
`tation described in this paper. In addition, alternative
`implementations that are not currently possible with
`the SPLASH 2 platform are under examination- Fi-
`nally. work is underway on a bit-serial version of this
`implementation.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 123
`Petitioner Microsoft Corporation - EX. 1066, p. 123
`—————————___________*_“_____
`
`198
`
`
`
`5.1 Splash 2 Improvements
`Although SPLASH 2 was suitable for this project,
`there are many possible architectural improvements
`that could be made that would increase performance
`significantly. A faster memory interface that sup—
`ports a twu-cycle read—write operation or perhaps a
`dual-ported memory interface would provide improved
`memory bandwidth. A larger processing element
`would give more resources and make implementations '
`more flexible. Updated VHDL libraries that conform
`to standard data types Would help design portability
`and simplicity.
`If chip-to—chip I/O were increased on SPLASH 2 a
`full column could be processed each clock cycle. This
`Would require at least 339 1/0 pins ([128 pixel bits
`+ 13 bit partial sum + 8 bit threshold + 2 bits of
`control] x 2 (in and out) + 16-bit memory data + 21-
`bit memory address = 339 pins). The current chips
`(Xilinx 4010) has enough logic and routing to pro—
`cess an entire column at one time. If this kind of I/O
`bandwidth were available [14] the system would con-
`sume 2 SPLASH 2 boards but would have 16 times the
`throughput because 16 times as much data would be
`processed each clock cycle.
`Another option would be to distribute the chip im-
`age to each FPGA memory and have a wide memory
`data path. This would require at least 195 1/0 pins
`([13 bit partial sum + 8 bit threshold + 2 bit control]
`x. 2 (in and out] + 128 bit memory data path + 21
`bit memory address = 195) but would have the same
`throughput.
`
`5.1.1
`
`Interface Improvements
`
`The current implementation assumes that‘all data
`are loaded on the SPLASH 2 board before processing
`commences. A more realistic implementation would
`stream data to the system as it is produced. Thus,
`DMA channels or other high—performance 1/0 hard-
`ware, if added to the system, Would make performance
`more sustainable across multiple images-
`5.2 Other implementations
`There are several ways this algorithm could be im-
`plemented. The most obvious way is to to a full paral-
`lel version that looks at a full 16x16 pixels at a time.
`Another is a linear systolic version that looks at one
`pixel at a time. A linear systolic implementation was
`designed because SPLASH 2 is organized that way.
`This implementation takes a relatively brute-force
`approach and feeds the entire chip into the system
`for processing. Whether or not a given pixel actu-
`ally contributes to the calculation is determined after
`
`the image data are already in the pipeline. Other ap-
`proaches may take a.more selective approach and only
`access image pixels that actually contribute to the
`calculation. This may reduce overall memory band-
`width reciuirements and suggest other implementation
`strategies. An implementation that only examined
`pixels that will be used could potentially give a perfor-
`mance increase when correlating the Bright template
`because the majority of pixels in the template repre-
`sent chip locations that do not contribute to the com—
`putation. However, such an approach would require
`a. complicated control scheme and may not map well
`onto SPLASH 2.
`
`Here at BYU there is ongoing work targeting 3 dif-
`ferent FPGA architectures, each looking to better un-
`derstand the hardware and ATR algorithms. These
`include a bit-serial version targeted at National Semi-
`conductors Clay FPGA and a Xilinx 6200. Another
`project is looking at a full parallel implementation tar-
`geting an Alters Flex-10K part.
`
`6 Conclusion
`
`SPLASH 2 is organised as a' linear array of FPGA
`chips each connected to a separate memory. Because
`of this, applications can only make efi'ective use of
`SPLASH 2 resources if (1) individual circuit modules
`are small enough to easily fit into the limited resources
`available within a single FPGA (Xilinx 4010) and (2)
`circuit modules can be grouped together in such a way
`that the interconnect requirements between modules
`on separate chips will fit in the limited chip—to—chip
`interconnect available on SPLASH 2.
`in this
`Both of these requirements were met
`application by carefufly organizing all correlations
`(both Shapesum and final correlation) as independent
`column-oriented processing elements (PEs). Each PE
`occupied only a small amount of FPGA resources
`thereby making it possible to place several PEs into
`a. single FPGA. PEs that are integrated on a single
`FPGA communicate via. the available on-chip FPGA
`routing; PEs that are located on different FPGAs can
`easily communicate via the limited inter-chip routing
`on SPLASH 2.
`
`This repetitive, column-oriented organization also
`eases the programming of SPLASH 2 considerably.
`Much of the VHDL code was reused and because the
`fourth and fifth chips are identical (because of the lin-
`ear column layout, see Figure 8) it was possible to
`down-load the same bit-stream into both chips.
`This column organisation presents a good tradeoii'
`between a fully parallel design and a-design that fits
`within the restrictions of the SPLASH 2 architecture.
`Each column operates in parallel computing a column
`
`
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 124
`Petitioner Microsoft Corporation - EX. 1066, p. 124
`m
`
`199
`
`
`
`sum that when complete is passed onto its neighboring
`column. While it is possible to achieve higher levels of
`parallelism, it is likely that such designs will require
`denser interconnect than SPLASH 2 is capable of pro-
`viding (see Section 5.1)- By limiting communication
`to only adjacent columns, significant parallelism is still
`achieved with only moderate amounts of interconnect.
`This implementation can also take advantage of
`SPLASH 2’s extensibility giving it even higher perfor-
`mance capability. Simply by passing the pixel data
`from chip 6 [see Figure 8) to the next chip on the
`SPLASH 2 board and configuring the next 6 chips in
`the array with the same bit stream as the first 6 chips
`the throughput on the system would be doubled. This
`extensibility can also be expanded to multiple boards
`giving a linear increase in system throughput.
`
`References
`
`'Irimberger. A reprogrammable gate array
`[1] S.
`and applications. Proceed