throbber
Irlls maternal may be protected by Cepvright law {Title 17 U.S. Code)
`
`I
`
`Automated Target Recognition on SPLASH 2 l
`
`Michael Rancher and Brad L. Hutchings
`Department of Electrical and Computer Engineering
`Brigham Young University ,
`Provo, UT 34602
`
`Abstract
`Automated target recognition is an application area.
`that requires special—purpose hardware to achieve re'a-
`sonahle performance. FPGA—based platform can pro-
`vide a high level of performance for ATE. systems if
`the implementation can be adapted to the limited
`FPGA and routing resources of these architectures.
`This paper discusses a mapping experiment where a
`linear-systolic implementation of an ATR algorithm is
`mapped to the SPLASH 2 platform. Simple column—
`oriented processors were used throughout the design
`to achieve high performance with limited nearest-
`neighbor communication. The distributed SPLASH 2
`memories are also exploited to achieve a high degree
`of parallelism. The resulting design is scalable and
`can be spread across multiple SPLASH 2 boards with
`a linear increase in performance.
`
`1
`
`Introduction
`Automated target recognition (ATR) is a compu-
`tationally demanding application area that typically
`requires special-purpose hardware to achieve desirable
`performance. ASICs are not an option for these sys-
`tems due to high non-recurring engineering (NEE)
`costs and because the algorithms are constantly evolv-
`ing. Existing FPGA—based computing platforms can
`potentially provide the necessary performance and
`flexibility for evolving ATE. systems; however, map-
`ping applications to these existing platforms can be
`very challenging because they lack abundant intercon-
`nect and FPGA resources. The key to achieving a
`high-performance implementation of ATE. algorithms
`with existing platforms is to carefully organize the de-
`sign of the ATE. implementation so that it can commu-
`nicate via the limited interconnect and can be easily
`partitioned among the FPGA devices.
`This paper presents a linear systolic implementa-
`tion of an existing ATR algorithm on SPLASH 2 that
`—-——__________
`lThis work was supported by DARPAICSTO under contract
`number DABTSS—Qd—C—UOSB under a subcontract to National
`Semiconductor
`
`Inter-
`is well-suited to the SPLASH 2 architecture.
`FPGA communication is limited and easily accommo-
`dated by the SPLASH 2 interconnect. Moreover, the
`implementation can be scaled across any number of
`SPLASH 2 boards and achieves high performance with
`limited resources.
`
`This paper briefly discusses the entire ATE. algo-
`rithm as developed by Sandia National Labs, and
`then overviews the design and implementation of the
`most computationally demanding part of the algo-
`rithm: Chunky SLD. The SPLASH 2 implementation
`is presented in some detail with future directions and
`possible improvements.
`
`2 Automatic Target Recognition
`_ The goal of a typical ATR system is to analyse a
`digital representation of a. scene and locate/identify
`objects that are of interest. Although this goal is
`conceptually simpleI ATR systems have extremely de—
`manding HO and computational requirements: image
`data are large, can be generated in real-time. and must
`be processed quickly so that results remain relevant in
`a dynamic environment. The common use of special—
`purpose hardware in nearly all high-performance ATE.
`systems is a clear indication ofthe computational com—
`plexity of these systems.
`This paper details the implementation of an exist-
`ing ATR algorithm on SPLASH 2. The algorithm in
`question was developed at Sandie National Labora-
`tories and was designed to detect partially obscured
`targets in Synthetic Aperture Radar (SAR) images.
`It is commonly referred to as Chunky SLD. so named
`for the second step of the algorithm that difl'erenti-
`ates this-algorithm from others developed at Sandia.
`This algorithm consists of the following three steps:
`(1) Focus of Attention (FDA), (2) Second—Level De-
`tection (SLD), and (3) Final Identification (Fl). Each
`of these steps will now be introduced so that the al-
`gorithm implementation can be understood in its op-
`erating context.
`'
`
`o-srse-s 159497 $10.00 © 1997 IEEE
`
`192
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 117
`Petitioner Microsoft Corporation - EX. 1066, p. 117
`__——_—"—————-——————-___________
`
`

`

`
`
` nix downrarnpled
`
`
`
`Figure 1: ATE. Block Diagram.
`
`2.1 Focus of Attention (FDA)
`
`Focus of attention is the first step of the ATE. pro-
`cess and uses image morphology techniques to de-
`tect potential targets in SAR data. FOA operates on
`“down—sampled” SAR images that are approximately
`500-1000 pixels on a side. Once FOA detects a poten—
`tial target. it determines the approximate center of the
`potential target and creates 2x down—sampled sub-
`images of the original SAR data where each subimage
`contains a single target centered within the whim-
`age. These subimages are referred to as chips and are
`128 x 128 pixels.
`
`2.2 Second Level Detection (SLD)
`
`The SLD step processes the chips generated by the
`FOA step. SLD further restricts the areas of interest
`by giving the potential targets coordinates and angu-
`lar orientation. SLD does this by correlating prede-
`fined binary templates to the areas of interest. The
`templates represent different object orientation an-
`gles. Templates are oriented between 5 and 10 de-
`grees apart. SLD also uses adaptive threshold levels
`determined by the overall image intensity.
`The algorithm studied in the paper is a variation
`of SLD called Chunky SLD. Chunky SLD adds a level
`of complexity to SLD by using more templates to rap-
`resent objects that have been partially obscured (par-
`tially hidden by camouflage or objects overhead). This
`allows better target recognition at a cost of higher
`computational requirements. Chunky SLD is dis-
`cussed in more detail later in this section.
`
`2.3 Final Identification (F1)
`
`The F1 algorithm correlates full resolution image
`data and templates with finer angular resolution (3 to
`5 degrees). Fl also uses adaptive threshold levels. The
`output of F1 is a location of the target, and confidence
`level corresponding to the level of correlation between
`the object and the FI templates.
`
`2.4 The Chunky SLD Algorithm
`The general goal of the Chunky—SLD algorithm is
`to recognize targets that are partially. concealed or ob-
`scured in some way. To achieve this goal, the designers
`of this algorithm treat the target as a. set of 40 tem-
`plate pairs where each pair of templates is a digital
`representation of some salient feature of the specific
`target. If the majority of the template pairs strongly
`correlate with the image data, then a match of the
`overall target is assumed. Each pair of templates con-
`
`
`
`:--
`
`Cam
`In}
`mulc- Ofllel
`
` Hi:
`
`—————_—-—
`
`Figure 2: Chunky SLD
`
`sists of a Bright template and a Surround template.
`The Bright template is a representation of expected
`reflections directly from surfaces of a salient target fea-
`ture While the Surround template represents expected
`absorption in the immediate area surrounding the tar-
`get feature. Each pair of a Bright and Surround tem-
`plate is referred to as a chunk, so called because each
`pair of templates represents a "chunk” of the overall
`target. Each set of 40 chunks represents a single target
`at a specific rotation. There are 72 orientations, each
`representing a different target orientation and radar
`incidence angle. Each set of 72 orientations is referred
`to as a class and is the complete set of templates that
`must be correlated with a chip to detect the presence
`of a specific target.
`
`Clau_
`an as mine;
`for one shied}
`
`
`
`(40 Chunks)
`
`Figure 3: Template Organization
`
`The first step of the Chunky SLD algorithm is to
`correlate the chip and the Bright template. This cor-
`relation value is used to compute a value that will be
`
`193
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 118
`Petitioner Microsoft Corporation - EX. 1066, p. 118
`———.—_._______________________—_
`
`

`

`used to threshold the incoming chip data, converting
`the 8-bit chip image into a binary image. The equa-
`tions describing this process are shown below.
`15
`
`Shapesum(£.yl = Z Bremp(a,b)*Chip(:-:+a,y+b)
`a,b=C|
`
`Shapesum(s, y)
`Thr85h0td(z' y) = BrightiemplstmpiseLcount
`The values obtained by correlating the Bright and
`Surround templates with the binarised chip (Bum. and
`Sum] are checked against minimum values to generate
`a. “hit value” for each offset in the chip. The threshold
`value is also checked to see if it falls in an acceptable
`range when generating the hit values.
`
`HGT...“ 2 T 2 ngn] AND
`[Baum 2 Baa] AND
`[Senior 2 Saris-1])
`their
`Hit 2 1;
`else
`
`Hit = D;
`
`(1)
`
`The hit values are accumulated for each offset for a
`specific orientation (4U chunks). The highest values
`are used to identify the areas of interest for the final
`identification step.
`
`2.4.1 Template Characteristics
`
`Template pairs exhibit useful properties: sparseness
`and mutual exclusivity. The Bright template consists
`mostly of zeros; only 3 to 10 percent of the template
`values-are ‘1's and this limits the magnitude of the
`Shapesum and Bmm values. The Bright and Surround
`templates are also mutually exclusive; that is, if the
`two templates are overlaid no “on” pixels will overlap.
`When carefully exploited, both of these properties lead
`to more compact and higher performance hardware.
`
`3 Other Implementations of Chunky-
`SLD
`
`As explained the ATE. application is computation-
`ally demanding. There are (128-15) x (128-15) offsets
`per chunk x 40 chunks x 72 orientations E 36 x 105
`hit values to compute per targeted object (or per class,
`see Figure 3). The computational rate and I/O re-
`quirements of this algorithm make it impossible to use
`current microprocessors. Thus any high—performance
`implementation of this algorithm will require special-
`purpose hardware to meet performance goals.
`
`Howaver, custom ASICs are also not an option be-
`cause the algorithm is constantly evolving and also
`because commercial-off—the—shelf components (COTS)
`are often dictated by the ultimate customers of ATR.
`systems. The only remaining options are to construct
`the system with commercially available fixed-function
`devices such as oorrelaters, multipliers, etc., or to
`use programmable logic, e.g., FPGAs [1, 2]. Thus
`all known implementations of Chunky-SLD use either
`fixed-function devices or programmable logic.
`3.1 Sandie
`
`Current Sandia implementations of ATR are based
`on commercially available one-bit correlater chips.
`Sandie designers adapted the basic Chunky-SLD al-
`gorithm so they could exploit the capabilities of these
`components to achieve high performance. Rather than
`process the Shapesum and then process the final cor-
`relation, the tWo steps were done in parallel. The cor-
`relation was done at 8 discrete threshold levels and
`the Shapesurn determined which threshold to use for
`each offset.
`
`Chip
`
`mu 3mm
`
`Figure 4: Sandia’s Implementation.
`
`3.2 UCLA
`
`A group at UCLA (Headed by John Villasenor) [3]
`is working on a FPGA based SLD implementation. By
`doing bit level correlations they are able to do very
`compact adder trees that take advantage of template
`sparseness and FPGA on board lookuptable memory
`capability. Their approach compiles template infor-
`mation directly into the hardware and relies on fast
`reconfiguration to switch template information. They
`
`194
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 119
`Petitioner Microsoft Corporation - EX. 1066, p. 119
`————-——-—-—_.____.._____________________
`
`

`

`also try and take advantage of template overlap by
`computing the results of multiple correlations simul—
`taneously.
`
`4 Chunky-SLD on splash 2
`On SPLASH 2 a static implementation was done
`to avoid the overhead of reconfiguring the hardware
`during execution.
`In order to reduce the hardware
`requirements without reducing throughput a. deeply
`pipelined design (~400 cycle latency) was imple-
`mented. The Shapesum unit generates the threshold
`value which is then used to generate the final corre-
`lation values. (Note: There is a unique threshold for
`each ofi’set). By doing this only two final correlations
`have to be computed per offset (one me and one
`Ssuml-
`.
`The Sandie. implementation computes the Shape-
`sum and final correlation in parallel which forces them
`to compute multiple final correlations. While our im-
`plementation does them serially. This allowa us to use
`an exact threshold value. Also only one final correla-
`tion needs to be computed because the threshold value
`is computed before the final correlation begins. The
`technique used was to look at the correlations by col-
`umn, compute the partial correlation for that column,
`and sum up the partial sums for all 16 columns. In
`this method 16 difi'erent column correlations are going
`on in parallel but only one column of data needs to be
`available for processing.
`4.1
`Implementing the Correlation as 001-
`umn Sums
`
`Figure 5 depicts a simple example that demon—
`strates a correlation of a 3x3 template with a binary
`image. Each row in the table represents one clock cy-
`cle- The first column is the clock cycle number. Cor-
`responding numbers are found in the Pixel load order
`box at the right. A new pixel is brought in on each
`clock cycle. The P1, P2, and P3 columns represent
`the three column processing units needed for a three
`column template. The last column represents the ac-
`tual output. Clock cycles 9 through 12 have been ex-
`panded to show how data (pixels and partial sums)
`are passed from column to column and illustrate the
`date. format. Once the pipeline is full, a new correla-
`tion value is computed as each column arrives (three
`pixels/cycles).
`Note that valid output comes every three cycles be—
`cause the template is three rows tall. All processing
`elements are actively processing 3 pixel values at all
`times. The SPLASH 2 implementation works just like
`the example except for the size of the columns (16
`pixels instead of three) and the data format (eight—bit
`instead of one-bit).
`
` mu
`p..—
` —n--—..
`
`IIIIIIEIIIIIEIE!
`1mm HIDDEN-El
`EEEEIEIEIEI
`
`slamm-
`ammo)-
`
`
`
`
`
`
`
`sass:
`
`
`in
`--
`L
`session
`microm-
`
`
`Irena-mm-
`ministers:
`memes-
`
`
`elem-I
`sesame:
`memes-
`rammin-
`
`
`emulsion
`groin).-
`
`O-Oeipul T-clcck cycle
`
`
`
`
`
`
`shifts
`
`shifted our
`
`O Pixellssummedinatthescpolnls
`
`Figure 5: Example Column Sum.
`
`4.2 Platforms/Implementations
`Chunky SLD was implemented on the SPLASH 2
`board. SPLASH 2 has shown itself to be a useful plat—
`form and has had numerous applications mapped to it
`[4, 5. 6, 7, 8, 9, 101 11]. The implementation was done
`in VHDL and simulated/synthesized using Synopsis-
`All place and route was done automatically using Kil-
`inx place and route tools.
`
`One of the goals of the implementation was to run
`the system so that it consumed a pixel per cycle. This
`means that each cycle all processing elements (PE)
`need to be able to process a new pixel. This im-
`plementation follows Sandia National Labs algorithms
`(not implementation) as closely as possible (see Sec-
`tion 2.4).
`The SPLASH 2 board was developed by SEC (Su-
`percomputing Research Center Institute for Defense
`Analyses) [12]. The SPLASH 2 board is a linear sys-
`tolic array of processing elements (FPGAs), each with
`their owu memory.
`
`4.2.1 Splash 2 Hardware
`
`From a conceptual point of view, the SPLASH 2 system
`consists of a linear array of processing elements. This
`
`.«fi'
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 120
`120
`Petitioner Microsoft Corporation - EX. 1066, p.
`—————————____*_____________________
`
`195
`
`

`

`makes SPLASH 2 a good candidate for linear-systolic
`applications with limited neighbor-teneighbor inter-
`connect. Because of limited routing resources SPLASH
`2 has difficulty implementing multi—chip systems that
`are not linear systolic, though they are possible [8].
`The actual SPLASH 2 platform consists of a board
`with 16 Xilinx 4010 chips (plus one for control) ar—
`ranged in a linear systolic array. Each chip has a lim-
`ited 36-bit connection to its two nearest neighbors.
`Each Xilinx 4010 is connected to a 512 kbyte memory
`
`
`
`
`
`
`{Xllinx WIOJ
`
`Element
`
`left
`neighbor
`
`
`
`Figure T: SPLASH 2 platform.
`
`data from the system when switching to a new tem-
`plate. The overhead from the flushing operation is
`minimal (4%"m— =) 0.14%-
`During each clock cycle, a new pixel arrives at the
`FPGA. If the template bit corresponding to this pixel
`is on then the incoming pixel is added to the current
`partial sum. Each 16 clock cycles, this partial sum is
`then passed on to the next column and a new partial
`sum is received from the previous column. The last
`column computes a complete Shapesum every 16 cy-
`cles (one column). The final correlation of the Bright
`and Surround templates with the thresholded chip
`data works similarly except there are two correlations
`(one for each template).
`Intermediate hit values are stored in a table, re
`ferred to as the hit-table, in one of the local memories.
`‘ Each location in the table corresponds to an any oil'-
`set of a chip, the origin of a single correlation. For
`each ofi'set, if a chunk “hits”, then the corresponding
`location in this table is incremented. Thus the table
`contains the accumulated hit values for all chunks and
`all offsets that have been computed to that point.
`Hits are computed according to Equation 1. First,
`each Bum and 5...", value is compared to its corre—
`sponding minimum value. Second, the threshold value
`corresponding to each Bum and Sam.
`is checked to
`see if it is between a certain minimum and maximum
`value. For reasons of efficiency, the threshold value is
`actually examined earlier in the process and a zero for
`the threshold is stored in lookup-table memory if it
`is out of bounds. This works correctly because if the
`
`finisher
`
`Figure 6: Single Processing Element of SPLASH 2.
`
`{16—bit word size). The memory can handle back-to-
`back reads. or back-to-back writes, but requires one
`‘dead‘ (or turn around) cycle when changing from
`write to read. There is also a crossbar connected to
`all of the chips that allowa some level of random con-
`nection between chips. Up to 16 boards can be daisy—
`chained together to provide a large linear-systolic an
`ray of 256 elements.
`
`4.3 ATR Implementation on Splash 2
`Similar to the example, the SPLASH 2 implemen-
`tation processes one pixel at a time and loads them
`in column order so that the partial sums can be gen—
`erated and passed from column to column. All tem-
`plate data are stored in the memories adjacent to the
`FPGAs on the SPLASH 2 boards. Each memory can
`hold several thousand templates thus making it pos-
`sible to store all of the templates for a single class
`(5760) on a single SPLASH 2 board. There is sufficient
`room in the FPGA design to store a single template.
`The templates are switched by reading the new tem-
`plate data out of the memory and storing it within
`the FPGA. However, because this implementatioh is
`deeply pipelined, it is necessary to flush all current
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 121
`Petitioner Microsoft Corporation - EX. -1066, p. 121
`————————————.__._____________________
`
`196
`
`

`

`threshold is zero, it will cause the 13mm to be zero,
`which will in turn cause the Bum. comparison to fail.
`Otherwise, if all three of these tests come back true
`then a hit has been found for the corresponding offset
`(see Equation 1) and the corresponding location in the
`hit-table is incremented. After the 40 templates are
`tested against the same chip the two ofi’sets with the
`highest accumulated hit values are written into mem—
`ory where the host computer can read them. This is
`accomplished by examining the hit values during this
`process and retaining the top two valuesin special
`memory locations. These final two hit values (which
`represent the top two orientations for a specific class)
`are used in the FI step.
`.
`For the SPLASH 2 board, as with most FPGA sys-
`tems, partitioning is a major issue. The design needed
`to be modular so that different design modules could
`be reassigned to different FPGAs as necessary. This
`is where the column modules were so valuable (see
`Figure 8).
`
`4.4 Special Features
`
`This implementation has several notable character-
`istics. They include control distribution, modular de-
`sign for partitioning and memory utilisation.
`
`4.4.1 Distributed Control
`
`The control in this system is distributed throughout
`the array. Each column module has it's own state
`machine based control. Module synchronization is
`achieved by distributing a control token through the
`pipeline along with the data. When a module re-
`ceives this signal, it resets its internal state machines
`and retrieves template data from its local memory. A
`memory controller rmides in each processing element
`(FPGA) to retrieve template data and give memory
`access to all other modules.
`
`4.4.2 Modular Design (Design for Partition-
`ing)
`
`Each column in both the Shapesum and final corre-
`lation use totally self contained modules that can be
`easily migrated from processing element to processing
`element. This was done to simplify the partitioning
`onto SPLASH 2 [13]. Memory data had to be care-
`fully partitioned as well so that the data could follow
`the module to which it applied. The regularity of the
`design was an important concern; it allowed the place-
`ment of specific circuit modules to be dictated by the
`requirements of the algorithm and not by the limited
`interconnect of the platform. There are 16 identical
`
`modules in the Shapesum and 16 more identical mod-
`ules in the final correlation. Along with these there is
`a divide. a hit. accumulator and 2 delay modules (see
`Figure 8).
`
`4.4.3- Memory Usage
`
`The memories in SPLASH 2 serve several purposes.
`The template information is stored in them. They
`are used to implement video shift registers that cor—
`rect for the latency incurred during threshold compu-
`tation. These shift registers require that two mem—
`ories be used in tandem because every clock cycle a
`new pixel (8 bits) had to be written to memory and a
`delayed pixel had to be read from memory. The band-
`width of one memory was such that it can handle two
`pixels (load and store) every three cycles. Thus one
`memory would delay two pixels and skip two pixels,
`while the other memory Would delay the two pixels
`that the first memory skipped and skip the two pixels
`that the first memory delayed. The divide unit and
`the final result including the accumulated hit values
`are also stored in memory.
`
`4.5 Performance Metrics
`
`There are many metrics that could be used to mea-
`sure the value of this implementation. This section is
`devoted to discussing some of these metrics.
`
`4.5.1 Performance
`
`This implementation runs at a frequency of 19 MHz
`using a test template that tests all column modules.
`Idelay (a Xilinx timing tool) reports a guaranteed
`frequency of 13.2 MHz. Designs that will run in the
`10 to 20 MHz range are typical [8, 10].
`Using the above frequency a single system could
`process one orientation every .487 seconds (.701 sec—
`onds using a 13.2 MHz clock).
`
`123 col 3-: (123 — 15) row x 165-3
`19 Ms;
`
`
`40
`Chunks __
`Orient
`X
`
`Chunk-'1
`(12 nus/Chunk) x 400,.icntafi0“
`4871115 @ 19 MHz
`
`Dl'
`
`123 col x(128—15)row x 1653-;
`13.2 MHz
`
`
`40
`Chunks __
`Orient
`x
`
`Chunks
`(17.5 ins/Chunk) x ”Orientation -
`'1’01 ms @ 13.2 MHZ
`
`
`
`
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 122
`Petitioner Microsoft Corporation - EX. 1066, p. 12.2d
`————————-————____________________¥
`
`19'?
`
`F
`
`

`

` Hit Compum'ron
`
`and
`Annunciation
`
`shapesuln -—i--—- Final lesflon
`
`_ Figure B: SPLASH 2 Implementation.
`
`Using a brute force approach it takes 59 seconds to
`compute the same data on an HP 770 (110 MHz).
`That is two orders of magnitude slmver. On one
`SPLASH 2 board two processing units can be in—
`stalled yielding the ability to process 1 orientation
`(40 chunks) every .244 seconds (.350 seconds at 13.2
`MHz).
`
`4.5.2 Multi-board Splash 2
`
`If a lG—board system were used then 256 processing
`elements would be available. this Would make 42 pro-
`cessing units (256 PEs -:- 6 PE per PU) and 42x the
`throughput. Thus one orientation can be processed
`every 11.6 ms (16.? ms at 13.2 MHz). This is equal
`to covering 86 orientation per second (60 orientations
`per second at 13.2 MHz).
`
`4.5.3 Memory Bandwidth
`
`The distributed memory and aggregate memory band-
`width is what makes this implementation possible. Be—
`cause the memory is distributed to each chip this im-
`plementation can be extended to a. multi-board system
`quite easily. Consider that a processing unit consists
`of 6 FPGAs and their corresponding memories. Each
`processing unit performs 10 reads and 13 writes each
`16 clock cycles. The SPLASH 2 memories require 3
`clock cycles to perform a read and a write for a total
`available bandwidth of 2 reads and 2 writes every cy—
`fi memorie
`'
`‘
`cle (W). Thus a processing unit consumes ap—
`proximately 1/2 of the available memory bandwidth.
`if there were only two or three memories available for
`the six-chip system this implementation would not be
`possible.
`
`4.5.4 Memory stored templates
`
`The decision to use statically configured hardware
`and memory-stored templates was made early in the
`design process. Another approach would be to use
`hard-coded templates. e.g., hardware with the tem-
`plate data compiled in as constants. A hard—coded
`template implementation could possibly use less hard-
`ware by removing unused accumulators and template
`reading logic, however it would require the hardware
`to be reconfigured during run-time in order to switch
`templates. This would take significantly longer than
`the memory read done in the memory-stored tem-
`plate version. The SPLASH 2 system requires approxi-
`mately 1? ms per chip to configure. The system would
`have to be reconfigured each time a new template was
`used.
`In order to process one orientation the sys—
`tem would have to be configured 40 difi'erent
`times.
`This would adversely impact system throughput.
`In
`fact even if the system could be reconfigured in one
`cycle the performance wouldn't improve because the
`time spent reading templates from memory is hidden
`while the pipeline is initially filling (during the flush
`cycles). There are other advantages as wall. Design
`time is simplified because new hardware bit streams
`don’t have to be generated and tested to use new tem—
`plates. Re-compilation may improve hardware utiliza-
`tion, however that would come at a cost of repartition-
`ing difficulty.
`
`5 Future work
`
`In the future, additional related research will be
`conducted on possible improvements to the implemen-
`tation described in this paper. In addition, alternative
`implementations that are not currently possible with
`the SPLASH 2 platform are under examination- Fi-
`nally. work is underway on a bit-serial version of this
`implementation.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 123
`Petitioner Microsoft Corporation - EX. 1066, p. 123
`—————————___________*_“_____
`
`198
`
`

`

`5.1 Splash 2 Improvements
`Although SPLASH 2 was suitable for this project,
`there are many possible architectural improvements
`that could be made that would increase performance
`significantly. A faster memory interface that sup—
`ports a twu-cycle read—write operation or perhaps a
`dual-ported memory interface would provide improved
`memory bandwidth. A larger processing element
`would give more resources and make implementations '
`more flexible. Updated VHDL libraries that conform
`to standard data types Would help design portability
`and simplicity.
`If chip-to—chip I/O were increased on SPLASH 2 a
`full column could be processed each clock cycle. This
`Would require at least 339 1/0 pins ([128 pixel bits
`+ 13 bit partial sum + 8 bit threshold + 2 bits of
`control] x 2 (in and out) + 16-bit memory data + 21-
`bit memory address = 339 pins). The current chips
`(Xilinx 4010) has enough logic and routing to pro—
`cess an entire column at one time. If this kind of I/O
`bandwidth were available [14] the system would con-
`sume 2 SPLASH 2 boards but would have 16 times the
`throughput because 16 times as much data would be
`processed each clock cycle.
`Another option would be to distribute the chip im-
`age to each FPGA memory and have a wide memory
`data path. This would require at least 195 1/0 pins
`([13 bit partial sum + 8 bit threshold + 2 bit control]
`x. 2 (in and out] + 128 bit memory data path + 21
`bit memory address = 195) but would have the same
`throughput.
`
`5.1.1
`
`Interface Improvements
`
`The current implementation assumes that‘all data
`are loaded on the SPLASH 2 board before processing
`commences. A more realistic implementation would
`stream data to the system as it is produced. Thus,
`DMA channels or other high—performance 1/0 hard-
`ware, if added to the system, Would make performance
`more sustainable across multiple images-
`5.2 Other implementations
`There are several ways this algorithm could be im-
`plemented. The most obvious way is to to a full paral-
`lel version that looks at a full 16x16 pixels at a time.
`Another is a linear systolic version that looks at one
`pixel at a time. A linear systolic implementation was
`designed because SPLASH 2 is organized that way.
`This implementation takes a relatively brute-force
`approach and feeds the entire chip into the system
`for processing. Whether or not a given pixel actu-
`ally contributes to the calculation is determined after
`
`the image data are already in the pipeline. Other ap-
`proaches may take a.more selective approach and only
`access image pixels that actually contribute to the
`calculation. This may reduce overall memory band-
`width reciuirements and suggest other implementation
`strategies. An implementation that only examined
`pixels that will be used could potentially give a perfor-
`mance increase when correlating the Bright template
`because the majority of pixels in the template repre-
`sent chip locations that do not contribute to the com—
`putation. However, such an approach would require
`a. complicated control scheme and may not map well
`onto SPLASH 2.
`
`Here at BYU there is ongoing work targeting 3 dif-
`ferent FPGA architectures, each looking to better un-
`derstand the hardware and ATR algorithms. These
`include a bit-serial version targeted at National Semi-
`conductors Clay FPGA and a Xilinx 6200. Another
`project is looking at a full parallel implementation tar-
`geting an Alters Flex-10K part.
`
`6 Conclusion
`
`SPLASH 2 is organised as a' linear array of FPGA
`chips each connected to a separate memory. Because
`of this, applications can only make efi'ective use of
`SPLASH 2 resources if (1) individual circuit modules
`are small enough to easily fit into the limited resources
`available within a single FPGA (Xilinx 4010) and (2)
`circuit modules can be grouped together in such a way
`that the interconnect requirements between modules
`on separate chips will fit in the limited chip—to—chip
`interconnect available on SPLASH 2.
`in this
`Both of these requirements were met
`application by carefufly organizing all correlations
`(both Shapesum and final correlation) as independent
`column-oriented processing elements (PEs). Each PE
`occupied only a small amount of FPGA resources
`thereby making it possible to place several PEs into
`a. single FPGA. PEs that are integrated on a single
`FPGA communicate via. the available on-chip FPGA
`routing; PEs that are located on different FPGAs can
`easily communicate via the limited inter-chip routing
`on SPLASH 2.
`
`This repetitive, column-oriented organization also
`eases the programming of SPLASH 2 considerably.
`Much of the VHDL code was reused and because the
`fourth and fifth chips are identical (because of the lin-
`ear column layout, see Figure 8) it was possible to
`down-load the same bit-stream into both chips.
`This column organisation presents a good tradeoii'
`between a fully parallel design and a-design that fits
`within the restrictions of the SPLASH 2 architecture.
`Each column operates in parallel computing a column
`
`
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 124
`Petitioner Microsoft Corporation - EX. 1066, p. 124
`m
`
`199
`
`

`

`sum that when complete is passed onto its neighboring
`column. While it is possible to achieve higher levels of
`parallelism, it is likely that such designs will require
`denser interconnect than SPLASH 2 is capable of pro-
`viding (see Section 5.1)- By limiting communication
`to only adjacent columns, significant parallelism is still
`achieved with only moderate amounts of interconnect.
`This implementation can also take advantage of
`SPLASH 2’s extensibility giving it even higher perfor-
`mance capability. Simply by passing the pixel data
`from chip 6 [see Figure 8) to the next chip on the
`SPLASH 2 board and configuring the next 6 chips in
`the array with the same bit stream as the first 6 chips
`the throughput on the system would be doubled. This
`extensibility can also be expanded to multiple boards
`giving a linear increase in system throughput.
`
`References
`
`'Irimberger. A reprogrammable gate array
`[1] S.
`and applications. Proceed

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket