`
`Chapter 27 ■ SPIHT Image Compression
`
`63
`
`63
`
`48 47
`
`Coefficient 4
`
`32 31
`
`Coefficient 2
`Coefficient 3
`Left memory port
`
`16 15
`
`Coefficient 1
`
`0
`
`0
`32 31 25 24
`4 Children and parent's threshold data
`Threshold and sign data for 16 children
`Right memory port
`
`FIGURE 27.15 ■ Data passed to the SPIHT coder to calculate a single block.
`
`Spatial Orientation Tree
`
`15 14
`0
`11 Coefficient magnitude I
`Coefficient
`Si{n
`bit
`
`Child
`coefficients
`
`Stack
`
`Child
`maximum
`magnitudes
`
`Current
`coefficients
`
`■ilEEEE i3 El
`
`■ ■■
`■
`
`FIGURE 27.16 ■ A depth-first search of the spatial orientation trees.
`
`The second block in the second level is now complete, and its maximum magni
`tude can now be calculated, shown as the dark gray block in the stack's highest
`level. In addition, the 16 child coefficients in the lowest level were saved and
`are available. There are no child values for the lowest level since there are no
`children.
`Another benefit of scanning the image in a depth-first search order is that Mor
`ton Scan Ordering is naturally realized within each level, although it is intermixed
`between levels. By writing data from each level to a separate area of memory and
`later reading the data from the highest wavelet level to the lowest, the Morton
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 601
`
`
`
`Read
`memory port
`
`t
`
`'
`
`.
`
`Magnitude
`calculation
`
`Depth-first
`search state
`machine and
`control logic
`
`27.4 Hardware Implementation
`
`585
`
`Coefficient
`stack and
`maximum
`magnitude
`calculation
`
`-
`
`-
`
`Encode
`maximum
`magnitudes
`and group
`block data
`
`-
`
`Write
`f--+ memory
`port 1
`
`Write
`r+ memory
`port 2
`
`Memory
`buffer and
`address
`generator
`
`□□ □
`□
`
`i
`
`.
`
`:
`
`FIGURE 27.17 ■ A block diagram of the SPIHT maximum magnitude phase.
`
`Scan Ordering is naturally realized. A block diagram of the maximum magnitude
`phase is provided in Figure 27 .17. Since two pixels are read together and the
`image is scanned only once, the runtime of this phase is half a clock cycle per
`pixel. Because the maximum magnitude phase computes in less time than the
`wavelet phase, the throughput of the overall system is not affected.
`
`27 .4.5 The SPIHT Coding Phase
`The final SPIHT coding phase performs the Fixed Order SPIHT encoding in
`parallel, based on the data from the maximum magnitude phase. Coefficient
`blocks are read from the highest wavelet level to the lowest. As information is
`loaded from memory it is shifted from the variable fixed-point representation to
`a common fixed-point representation for every wavelet level. Once each block
`has been adjusted to an identical numerical representation, the parallel version
`of SPIHT is used to calculate what information each block will contribute to
`each bit plane.
`The information is grouped and counted before being added to three separate
`variable FIFOs for each bit plane. The data that the variable FIFO components
`receive range in size from Oto 37 bits, and the variable FIFOs arrange the block
`data into regular sized 32-bit words for memory access. Care is also taken to
`stall the algorithm if any of the variable FIFOs becomes too full.
`Data from each buffer is output to a fixed location in memory and the number
`of bits in each bitstream is output as well. Given that data is added dynamically
`to each bitstream, there needs to be a dynamic scheduler to select which buffer
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 602
`
`
`
`586
`
`Chapter 27 ■ SPIHT Image Compression
`
`should be written to memory. Since there are a large number of FIFOs that all
`require a BlockRAM, the FIFOs are spread across the FPGA, and some type
`of staging is required to prevent a signal from traveling too far. The scheduler
`selects which FIFO to read based on both how full a FIFO is and when it was
`last accessed.
`Our studies showed that the LSP bitstream is roughly the same size of the
`LIP and LIS streams combined. Because of this the LSP bitstreams transfer
`more data to memory than the other two lists. In our design the LIP and LIS
`bitstreams share a memory port while the LSP stream writes to a separate mem
`ory port. Since a 2 x 2 block of coefficients is processed every clock cycle, the
`design takes one-quarter of a clock cycle per pixel, which is far less than the
`three-quarters of a clock cycle per pixel for the DWT. The block diagram for
`the SPIHT coding phase is given in Figure 27.18.
`With 22 total bit planes to calculate, the design involves 66 individual
`data grouping and variable FIFO blocks. Although none consume a significant
`amount of FPGA resources individually, 66 blocks do. The entire design required
`160 percent of the resources in a Virtex 2000E, and would not fit in the target
`system. However, by removing the lower bit planes, less FPGA resources are
`needed, and the architecture can easily be adjusted to fit the FPGA being used.
`Depending on the size of the final bitstream required, the FPGA size used in the
`SPIHT phase can be varied to handle the number of intermediate bitstreams
`generated.
`Removing lower bit planes is possible since the final bitstream transmits data
`from the highest bit plane to the lowest. In our design the lower 9-bit planes
`
`LIP and LIS
`address
`generator
`
`Write
`memory
`port 1
`
`Write
`memory
`port 2
`
`LSP
`address
`generator
`
`Select and Read FIFOs
`
`Shift data
`
`LIP
`data
`
`LIS
`data
`
`LSP
`data
`
`Calculate
`bit plane 21
`
`LIP
`data
`
`LIS
`data
`
`LSP
`data
`
`Calculate
`bit plane 0
`
`Address
`generator and
`control logic
`
`Dynamic
`FIFO
`scheduler
`
`Read
`memory
`port 1
`
`Read
`memory
`port 2
`
`FIGURE 27.18 ■ A block diagram of the SPIHT coding phase.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 603
`
`
`
`27.5 Design Results
`
`587
`
`were eliminated. Yet, without these lower planes, bitrates of up to 6 bpp can
`still be achieved. We found the constraint to be acceptable because we are inter
`ested in high compression ratios using low bitrates, and 6 bpp is practically a
`lossless signal. Since SPIHT is optimized for lower bitrates, the ability to cal
`culate higher bitrates was not considered necessary. Alternatively, the use of a
`larger FPGA would alleviate the size constraint.
`
`27 .5 DESIGN RESULTS
`The system was designed using VHDL with mqdels provided by Annapolis Micro
`Systems to access the PCI bus and memory ports. Simulations for debugging
`purposes were carried out with ModelSim EE 5.4e from Mentor Graphics. Syn
`plify 6.2 from Synplicity was used to compile the VHDL code and generate a
`netlist. The Xilinx Foundation Series 3.li tool set was used to place and route
`the design. Lastly, the peutil. exe utility from Annapolis Micro Systems gen
`erated the FPGA configuration streams.
`Table 27.3 shows the speed and runtime specifications of the final architec
`ture. All performance numbers are measured results from the actual hardware
`implementation. Each phase computes on separate memory blocks, which can
`operate at different clock rates. The design can process any square image where
`the dimensions are a power of 2: 16 x 16, 32 x 32, up to 1024 x 1024.
`Since the WildStar board is connected to the host computer by a relatively
`slow PCI bus, the throughput of the entire system we built is constrained by
`the throughput of the PCI bus. However, since the study is on how image com
`pression routines could be implemented on a satellite, such a system would be
`designed differently, and would not contain a reconfigurable board connected to
`some host platform though a PCI bus. Instead, the image compression routines
`would be inserted directly into the data path and the data transfer times would
`not be the bottleneck of tlie system. 1;1or this reason we analyzed the throughput
`of just the SPIHT compression engine and analyzed how quickly the FPGAs can
`process the images.
`The throug}iput of the system was constrained by the discrete wavelet trans
`form at 100 MPixels/sec. One method to increase this rate is to compute
`more rows in parallel. If the available memory ports accessed 128 bits of
`data instead of the 64 bits with our WildStar board, the number of clock
`cycles per pixel could be reduced by half and the throughput could double.
`
`TABLE 27.3 ■ Performance numbers
`Clock cycles
`per pixel
`
`Clock cycles per
`512 x 512 image
`
`Phase
`
`Clock rate
`
`Throughput
`
`FPGA area
`(%)
`
`Wavelet
`Magnitude
`SPIHT
`
`182465
`131132
`65793
`
`3/4
`1/2
`1/4
`
`75 MHz
`73 MHz
`56 MHz
`
`100 MPixels/sec
`146 MPixels/sec
`224 MPixels/sec
`
`62
`34
`98
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 604
`
`
`
`588
`
`Chapter 27 ■ SPIHT Image Compression
`
`Assuming the original image consists of 8 bpp, images are processed at a rate of
`800 Mbits/sec.
`The entire throughput of the architecture is less than one clock cycle for
`every pixel, which is lower than parallel versions of the DWT. Parallel ver
`sions of the DWT used complex scheduling to compute multiple wavelet levels
`simultaneously, which left limited resources to process multiple rows at a time.
`Given more resources though, they would obtain higher data rates than our
`architecture by processing multiple rows simultaneously. In the future, a DWT
`architecture other than the one we implemented could be selected for additional
`speed improvements.
`We compared our results to the original software version of SPIHT provided
`on the SPIHT web site [15]. The comparison was made without arithmetic cod
`ing since our hardware implementation does not perform any arithmetic coding
`on the final bitstream. Additionally, in our testing on sample NASA images, arith
`metic coding added little to overall compression rates and thus was dropped
`[11]. An IBM RS/6000 Model 270 workstation was used for the comparison,
`and we used a combination of standard image compression benchmark images
`and satellite images from NASA's web site. The software version of SPIHT com
`pressed a 512 x 512 image in 1.101 seconds on average without including disk
`access. The wavelet phase, which constrains the hardware implementation, com
`putes in 2.48 milliseconds, yielding a speedup of 443 times for the SPIHT engine.
`In addition, by creating a parallel implementation of the wavelet phase, further
`improvements to the runtimes of the SPIHT engine are possible.
`While this is the speedup we will obtain if the data transfer times are not a
`factor, the design may be used to speed up SPIHT on a general-purpose pro
`cessor. On such a system the time to read and write data must be included as
`well. Our WildStar board is connected to the host processor over a PCI bus,
`which writes images in 13 milliseconds and reads the final datastream in 20.75
`milliseconds. Even with the data transfer delay, the total speedup still yields an
`improvement of 31.4 times.
`Both the magnitude and SPIHT phases yield higher throughputs than the
`wavelet phase, even though they operate at lower clock rates. The reason for
`the higher throughputs is that both of these phases need fewer clock cycles
`per pixel to compute an image. The magnitude phase takes half a clock cycle
`per pixel and the SPIHT phase requires just a quarter. The fact that the SPIHT
`phase computes in less than one clock cycle per pixel, let alone a quarter, is
`a striking result considering that the original SPIHT algorithm is very sequen
`tial in nature and had to consider each pixel in an image multiple times per
`bit plane.
`
`27 .6 SUMMARY AND FUTURE WORK
`
`In this chapter we demonstrated a viable image compression routine on a recon
`figurable platform. We showed how by analyzing the range of data processed by
`each section of the algorithm, it is advantageous to create optimized memory
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 605
`
`
`
`27.6 Summary and Future Work
`
`589
`
`structures as with our variable fixed-point work. Doing so minimizes memory
`usages and yields efficient data transfers. Here each bit transferred between
`memory and the processor board directly impacted the final results. In addi
`tion, our Fixed Order SPIHT modifications illustrate how by making slight
`adjustments to an existing algorithm, it is possible to dramatically increase the
`performance in a custom hardware implementation and simultaneously yield
`essentially identical results. With Fixed Order SPIHT the throughput of the
`system increased by over an order of magnitude while still matching the original
`algorithm's PSNR curve.
`This SPIHT work was part of a development effort funded by NASA.
`
`References
`
`[1] V. R. Algazi, R. R. Estes. Analysis-based coding of image transform and subband
`coefficients. Applications of Digital Image Processing XVIII, SPIE Proceedings 2564,
`1995.
`[2] Annapolis Microsystems. WildStar Reference Manual, Annapolis Microsystems, 2000.
`[3] A. Benkrid, D. Crookes, K. Benkrid. Design and implementation of generic 2D
`biorthogonal discrete wavelet transform on an FPGA. IEEE Symposium on Field
`Programmable Custom Computing Machines, April 2001.
`[4] M. Carraeu. Hubble Servicing Mission: Hubble is fitted with a new "eye."
`http://www.chron.com/contentlinteractive/space!missions/sts-103/hubblelarchive/
`931207.html, December 7, 1993.
`[5] C. M. Chakrabarti, M. Vishwanath. Efficient realization of the discrete and contin
`uous wavelet transforms: From single chip implementations to mappings in SIMD
`array computers. IEEE Transactions on Signal Processing 43, March 1995.
`[6] C. M. Chakrabarti, M. Vishwanath, R. M. Owens. Architectures for wavelet trans
`forms: A survey. Journal of VLSI Signal Processing 14, 1996.
`[7] T. Cormen, C. Leiserson, R. Rivest. Introduction to Algorithms, MIT Press, 1997.
`[8] T. W. Fry. Hyper Spectral Image Compression on Reconfigurable Platforms, Master's
`thesis, University of Washington, Seattle, 2001.
`[9] R. C. Gonzalez, R. E. Woods. Digital Image Processing, Addison-Wesley, 1993.
`[10] A. Graps. An introduction to wavelets. IEEE Computational Science and Engineering
`2(2), 1995.
`[11] T. Owen, S. Hauck. Arithmetic Compression on SP/TH Encoded Images, Technical
`report UWEETR-2002-2007, Department of Electrical Engineering, University of
`Washington, Seattle, 2002.
`[12] K. K. Parhi, T. Nishitani. VLSI architectures for discrete wavelet transforms. IEEE
`Transactions on VLSI Systems 1(2), 1993.
`[13] J. Ritter, P. Molitor. A pipelined architecture for partitioned DWT based lossy image
`compression using FPGAs. ACMISIGDA Ninth International Symposium on Field
`Programmable Gate Arrays, February 2001.
`[14] A. Said, W. A. Pearlman. A new fast and efficient image codec based on set parti
`tioning in hierarchical trees. IEEE Transactions on Circuits and Systems for Vuleo
`Technology 6, June 1996.
`[15] A. Said, W. A. Pearlman. SPIHT image compression: Properties of the method.
`http://www.cipr.rpi.edu/research!SPIHT/spihtl.html.
`[16] H. Sava, M. Fleury, A. C. Downton, A. Clark. Parallel pipeline implementations of
`wavelet transforms. IEEE Proceedings Part 1 (Vision, Image and Signal Processing)
`144(6), 1997.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 606
`
`
`
`590
`
`Chapter 27 ■ SPIHT Image Compression
`
`[17] J. M. Shapiro. Embedded image coding using zero trees of wavelet coefficients.
`IEEE Transactions on Signal Processing 41(12), 1993.
`[18] W. Sweldens. The Lifting Scheme: A new philosophy in biorthogonal wavelet con
`structions. Wavelet Applications in Signal and Image Processing 3, 1995.
`[19] NASA. TERRA: The EOS flagship. The EOS Data and Information System (EOS
`DIS). http://terra.nasa.gov/Brochure/Sect ..5-1.html.
`[20] C. Valens. A really friendly guide to wavelets. http://perso.wanadoo.fr/polyvalens/
`clemenslwaveletslwavelets.html.
`[21] M. Vishwanath, R. M. Owens, M. J. Irwin. VLSI architectures for the discrete
`wavelet transform. IEEE Transactions on Circuits and Systems, Part II, May 1995.
`[22] Xilinx, Inc. The Programmable Logic Data Book, Xilinx, Inc., 2000.
`[23] Xilinx, Inc. Serial Distribli,ted Arithmetic FIR Filter, Xilinx, Inc., 1998.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 607
`
`
`
`CHAPTER 28
`
`AUTOMATIC TARGET RECOGNITION SYSTEMS
`ON RECONFIGURABLE DEVICES
`
`Young H. Cho
`Open Acceleration Systems Research
`
`An Automatic Target Recognition (ATR) system analyzes a digital image or video
`sequence to locate and identify all objects of a certain class. There are several
`ways to implement ATR systems, and the right one is dependent, in large part,
`on the operating environment and the signal source. In this chapter we focus
`on the implementations of reconfigurable ATR designs based on the algorithms
`from Sandia National Laboratories (SNL) for the U.S. Department of Defense
`Joint STARS airborne radar imaging platform. STARS is similar to an aircraft
`AWACS system, but detects ground targets.
`ATR in Synthetic Aperture Radar (SAR) imagery requires tremendous process
`ing throughput. In this application, data come from high-bandwidth sensors, and
`the processing is time critical. On the other hand, there is limited space and power
`for processing the data in the sensor platforms. One way to meet the high compu
`tational requirement is to build custom circuits as an ASIC. However, very high
`nonrecurring engineering (NRE) costs for low-volume ASICs, and often evolving
`algorithms, limit the feasibility of using custom hardware. Therefore, reconfig
`urable devices can play a prominent role in meeting the challenges with greater
`flexibility and lower costs.
`This chapter is organized as follows: Section 28.1 describes a highly paralleliz
`able Automatic Target Recognition (ATR) algorithm. The system based on it is
`implemented using a mix of software and hardware processing, where the most
`computationally demanding tasks are accelerated using field-programmable gate
`arrays (FPGAs). We present two high-performance implementations that exercise
`the FPGA's benefits. Section 28.2 describes the system that automatically builds
`algorithm-specific and resource-efficient "hardwired" accelerators. It relies on the
`dynamic reconfiguration feature of FPGAs to obtain high performance using lim
`ited logic resources.
`The system in Section 28.3 is based on an architecture that does not
`require frequent reconfiguration. The architecture is modular, easily scalable,
`and highly tuned for the ATR application. These application-specific processors
`are automatically generated based on application and environment parameters.
`In Section 28.4 we compare the implementations to discuss the benefits and the
`trade-offs of designing ATR systems using FPGAs. In Section 28.5, we draw our
`conclusions on FPGA-based ATR system design.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 608
`
`
`
`592
`
`Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices
`
`28.1
`
`AUTOMATIC TARGET RECOGNITION ALGORITHMS
`
`Sandia real-time SAR ATR systems use a hierarchy of algorithms to reduce the
`processing demands for SAR images in order to yield a high probability of detec
`tion (PD) and a low false alarm rate (FAR).
`
`28.1.1 Focus of Attention
`As shown in Figure 28.1, the first step in the SNL algorithm is a Focus of
`Attention (FOA) algorithm that runs over a downsampled version of the entire
`image to find regions of interest that are of approximately the right size and
`brightness. These regions are then extracted and processed by an indexing stage
`to further reduce the datastream, which includes target hypotheses, orientation
`estimations, and target center locations. The surviving hypotheses have the full
`resolution data sent to an identification executive that schedules multiple iden
`tification algorithms and then fuses their results.
`The FOA stage identifies interesting image areas called "chips." Then it com
`poses a list of targets suspected to be in a chip. Having access to range and
`altitude information, the FOA algorithm also determines the elevation for the
`chip, without having to identify the target first. It then tasks the next stage with
`evaluating the likelihood that the suspected targets are actually in the given
`image chip and exactly where.
`
`28.1.2 Second-level Detection
`The next stage of the algorithm, called Second Level Detection (SLD), takes the
`extracted imagery (an image chip), matches it against a list of provided target
`
`I Synthetic aperture radar sensors I
`
`,
`
`,
`
`,
`,
`
`--------c==:::
`
`_,
`
`Focus of attention
`
`'
`'
`'
`
`--------·----------
`
`Second-level detection driver
`
`I
`
`c=_~===-···· ·--.::·:iJ]
`~ -~ ___J_--_ -_ - -- ·--<:··□
`
`-------------=----
`
`Reporting module
`
`M-47 Tank
`Angle: 355°
`\ Elevation: 10 ft
`
`FIGURE 28.1 ■ The Sandia Automatic Target Recognition algorithm.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 609
`
`
`
`28.1 Automatic Target Recognition Algorithms
`
`593
`
`hypotheses, and returns the hit information for each image chip consisting of
`the best two orientation matches and other relevant information.
`The system has a database of target models. For each target, and for each
`of its three different elevations, 72 templates are defined corresponding to its
`all-around views. The orientations of adjacent views are separated by 5 degrees.
`SLD is a binary silhouette matcher that has a bright mask and a surround
`mask that are mutually exclusive. Each template is composed of several param
`eters along with a "bright mask" and a "surround mask," where the former
`defines the image pixels that should be bright for a match, and the latter defines
`the ones that should not. The bright and surround masks are 32x32 bitmaps,
`each with about 100 asserted bits. "Bright" is defined relative to a dynamic
`threshold.
`On receiving tasks from the FOA, the SLD unit compares all of the stored
`templates for this target and elevation and the applicable orientations with
`the image chip, and computes the level of matching (the "hit quality"). The
`two hits with the highest quality are reported to the SLD driver as the most
`likely candidates to include targets. For each hit, the template index number,
`the exact position of the hit in the search area, and the hit quality are pro
`vided. After receiving this information, the SLD driver reports it to the ATR
`system.
`The purpose of the first step in the SLD algorithm, called the shape sum, is to
`distinguish the target from its surrounding background. This consists of adap
`tively estimating the illumination for each position in the search area, assuming
`that the target is at that orientation and location. If the energy is too little or
`too much, no further processing for that position for that template match is
`required. Hence, for each mask position in the search area, a specific threshold
`value is computed as in equation 28.1.
`
`31 31
`
`SMx,y
`
`= L [.Bu,vMx+u,y+v
`u=0v=0
`
`THx y
`
`'
`
`SMx,y
`BC
`
`= --
`
`-Bias
`
`(28.1)
`
`(28.2)
`
`The next step in the algorithm distinguishes the target from the background
`by thresholding each image pixel with respect to the threshold of the cur
`rent mask position, as computed before. The same pixel may be above the
`threshold for some mask positions but below it for others. This thresholci.
`calculation determines the actual bright and surround pixel for each posi
`tion. As shown in equation �-2, it consists of dividing the shape sum by the
`number of pixels in the bright mask and subtracting a template-specific Bias
`constant.
`As shown in equation 28.3, the pixel values under the bright mask that are
`greater than or equal to the threshold are counted; if this count exceeds the
`minimal bright sum, the processing continues. On the other hand, the pixel
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 610
`
`
`
`594
`
`Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices
`
`values under the surround mask that are less than the threshold are counted to
`calculate the surround sum as shown in equation 28.4. If this count exceeds the
`minimal surround sum, it is declared a hit.
`
`31 31
`
`BSx,y = L E,Bu,v [Mx+u,y+v 2'.'. THx,y]
`u=0v=0
`SSx,y = L E,Su,v [Mx+u,y+v < THx,y]
`u=0v=0
`
`31 31
`
`(28.3)
`
`(28.4)
`
`Once the position of the hit is determined, we can calculate its quality by
`taking the average of bright and surround pixels that were correct, as shown in
`equation 28.5. This quality value is sent back to the driver with the position to
`determine the two best targets.
`
`_ ! (BSx,y SSx,y)
`Q
`x,y - 2
`BC + SC
`
`(28.5)
`
`28.2 DYNAMICALLY RECONFIGURABLE DESIGNS
`
`FPGAs can be reconfigured to perform multiple functions with the same logic
`resources by providing a number of corresponding configuration bit files. This
`ability allows us to develop dynamically reconfigurable designs. In this section,
`we present an ATR system implementation of UCLA's Mojave project that uses
`an FPGA's dynamic reconfigurability.
`
`28.2.1 Algorithm Modifications
`As described previously, the current Sandia system uses 64 x 64 pixel chips
`and 32 x 32 pixel templates. However, the Mojave system uses chip sizes of
`128 x 128 pixels and template sizes of 8 x 8 pixels. It uses different chip and tem
`plate sizes in order to map into existing FPGA devices that are relatively small.
`A single template moves through a single chip to yield 14,641 (121 x 121) image
`correlation results. Assuming that each output can be represented with 6 bits,
`the 87,846 bits are produced by the system.
`There is also a divide step in the Sandia algorithm that follows the shape
`sum operation and guides the selection of threshold bin for the chip. This sys
`tem does not implement the divide, mainly because it is expensive relative to
`available FPGA resources for the design platform.
`
`Image Correlation Circuit
`28.2.2
`FPGAs offer an extremely attractive solution to the correlation problem. First of
`all, the operations being performed occur directly at the bit level and are domi
`nated by shifts and adds, making them easy to map into the hardware provided
`by the FPGA. This contrasts, for example, with multiply-intensive algorithms
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 611
`
`
`
`28.2 Dynamically Reconfigurable Designs
`
`595
`
`that would make relatively poor utilization of FPGA resources. More important,
`the sparse nature of the templates can be utilized to achieve a far more efficient
`implementation in the FPGA than could be realized in a general-purpose corre
`lation device. This can be illustrated using the example of the simple template
`shown in Figure 28.2.
`In the example template shown in the figure, only 5 of the 20 pixels are
`asserted. At any given relative offset between the template and the chip, the
`correlation output is the sum of the 5 binary pixels in the chip that match
`the asserted bits in the template. The template can therefore be implemented in
`the FPGA as a simple multiple-port adder. The chip pixel values can be stored
`in flip-flops and are shifted to the right by one flip-flop with each clock cycle.
`Though correlation of a large image with a small mask is often understood con
`ceptually in terms of the mask being scanned across the image, in this case the
`opposite is occurring-the template is hardwired into the FPGA while the image
`pixels are clocked past it.
`Another important opportunity for increased efficiency lies in the potential to
`combine multiple templates on a single FPGA. The simplest way to do this is to
`spatially partition the FPGA into several smaller blocks, each of which handles
`the logic for a single template. Alternatively, we can try to identify templates
`that have some topological commonality and can therefore share parts of their
`adder trees. This is illustrated in Figure 28.3, which shows two templates sharing
`several pixels that can be mapped using a set of adder trees to leverage this
`overlap.
`A potential advantage FPGAs have over ASICs is that they can be dynam
`ically optimized at the gate level to exploit template characteristics. For our
`application, a programmable ASIC design would need to provide large general
`purpose adder trees to handle the worst-case condition of summing all possi
`ble template bits, as shown in Figure 28.4. In constrast, an FPGA exploits the
`sparse nature of the templates and constructs only the small adder trees required.
`Additionally, FPGAs can optimize the design based on other application-specific
`characteristics.
`
`Template A
`
`ii
`
`Image
`
`Doo-....._
`20
`+
`D
`
`D10::0-
`01-
`
`0
`021 ---
`
`Result A
`
`030 ... 7
`
`FIGURE 28.2 ■ An example template and a corresponding register chain with an adder tree.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 612
`
`
`
`596
`
`Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices
`
`Template A
`
`Template8
`
`ResultA
`
`Result8
`
`FIGURE 28.3 ■ Common hardware shared between two templates.
`
`Template A
`
`Image
`
`,
`,
`,
`
`,'
`
`'
`
`'
`
`'
`,
`'( , '
`
`\
`
`' '
`'-+-'-�--+-, �,..,,•, /,
`,\
`' '
`\ II
`'
`1
`'\,'
`', I
`' '
`,.
`I\
`"
`
`• ,
`,
`..
`.
`
`,
`,
`
`•
`,
`.
`•
`
`Registers
`for image chip
`and templates
`
`'I
`A
`I
`'
`
`,,1
`I\
`,I
`
`EffiE
`~.
`~ a±lli
`, ,
`.
`.
`f 'tj ! AND
`\, 6,- o' 'o' 'c
`\0 O O (J (J: used to
`• perform
`'
`0 0 0 0 0 dot product
`QQ099
`·--��/ Large adder
`
`1
`
`I
`
`�
`
`r
`
`: :
`I I
`I
`I
`
`gates
`
`FIGURE 28.4 ■ The ASIC version of the equivalent function.
`
`28.2.3 Performance Analysis
`Using a template-specific adder tree achieves significant reduction in routing
`complexity over a general correlation device, which must include logic to sup
`port arbitrary templates. The extent of this reduction is inversely proportional
`to the fraction of asserted pixels in the template. While this complexity reduc
`tion is important, alone it is not sufficient to lead to efficient implementations
`on FPGAs. The number of D-flip-flops required for storing the data points can
`cause inefficiencies in the design. Implementing these on the FPGA using the
`usual flip-flop-based shift registers is inefficient.
`This problem can be resolved by collapsing the long strings of image pixels
`those not being actively correlated against a template-into shift registers, which
`can be implemented very efficiently on some lookup table (LUT)-based FPGAs.
`For example, LUTs in the Xilinx XC4000 library can be used as shift registers
`that delay data by some predetermined number of clock cycles. Each 16 x 1-bit
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 613
`
`
`
`28.2 Dynamically Reconfigurable Designs
`
`597
`
`LUT can implement an element that is effectively a 16-bit shift register in which
`the internal bits cannot be accessed. A flip-flop is also needed at the output of
`each RAM to act as a buffer and synchronizer. A single control circuit is used to
`control the stepping of the address lines and the timely assertion of the write
`enable and output-enable signals for all RAM-based shift register elements. This
`is a small price to pay for the savings in configurable logic block (CLB) usage
`relative to a brute-force implementation using flip-flops.
`In contrast, the 256-pixel template images, like those shown in Figure 28.5,
`can be stored easily using flip-flop-based registers. This is because sufficient
`flip-flops are available to do this, l;lnd the adder tree structures do not consume
`them. Also, using standard flip-flop-based shift registers for image pixels in the
`template simplifies the mapping process by allowing every pixel to be accessed.
`New templates can be implemented by simply connecting the template pixels
`of concern to the inputs of the adder tree structures. This leads to significant
`simplification of automated template-mapping tools.
`The resources used by the two components of target correlation-namely,
`storage of active pixels on the FPGA and implementation of the adder tree cor
`responding to the templates-are independent of each other. The resources used
`by the pixel storage are determined by the template size and are independent of
`the number of templates being implemented. Adding templates involves adding
`new adder tree structures and hence increases the number of function genera
`tors being used. The total number of templates on an FPGA is bounded by the
`number of usable function generators.
`The experimental results suggest that in practice we can expect to fit 6 to 10
`surround templates having a higher number of overlapping pixels onto a 13,000-
`gate FPGA. However, intelligent grouping of compatible templates is important.
`Because the bright templates are less populated than the surround templates, we
`estimate that 15 to 20 of them can be mapped onto the same FPGA.
`
`■
`•• ••
`•••
`
`FIGURE 28.5 ■ Example of eight rotation templates of a SAR 16 x 16 bitmap image.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 614
`
`
`
`598
`
`Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices
`
`28.2.4 Template Partitioning
`To min