throbber
584
`
`Chapter 27 ■ SPIHT Image Compression
`
`63
`
`63
`
`48 47
`
`Coefficient 4
`
`32 31
`
`Coefficient 2
`Coefficient 3
`Left memory port
`
`16 15
`
`Coefficient 1
`
`0
`
`0
`32 31 25 24
`4 Children and parent's threshold data
`Threshold and sign data for 16 children
`Right memory port
`
`FIGURE 27.15 ■ Data passed to the SPIHT coder to calculate a single block.
`
`Spatial Orientation Tree
`
`15 14
`0
`11 Coefficient magnitude I
`Coefficient
`Si{n
`bit
`
`Child
`coefficients
`
`Stack
`
`Child
`maximum
`magnitudes
`
`Current
`coefficients
`
`■ilEEEE i3 El
`
`■ ■■
`■
`
`FIGURE 27.16 ■ A depth-first search of the spatial orientation trees.
`
`The second block in the second level is now complete, and its maximum magni­
`tude can now be calculated, shown as the dark gray block in the stack's highest
`level. In addition, the 16 child coefficients in the lowest level were saved and
`are available. There are no child values for the lowest level since there are no
`children.
`Another benefit of scanning the image in a depth-first search order is that Mor­
`ton Scan Ordering is naturally realized within each level, although it is intermixed
`between levels. By writing data from each level to a separate area of memory and
`later reading the data from the highest wavelet level to the lowest, the Morton
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 601
`
`

`

`Read
`memory port
`
`t
`
`'
`
`.
`
`Magnitude
`calculation
`
`Depth-first
`search state
`machine and
`control logic
`
`27.4 Hardware Implementation
`
`585
`
`Coefficient
`stack and
`maximum
`magnitude
`calculation
`
`-
`
`-
`
`Encode
`maximum
`magnitudes
`and group
`block data
`
`-
`
`Write
`f--+ memory
`port 1
`
`Write
`r+ memory
`port 2
`
`Memory
`buffer and
`address
`generator
`
`□□ □
`□
`
`i
`
`.
`
`:
`
`FIGURE 27.17 ■ A block diagram of the SPIHT maximum magnitude phase.
`
`Scan Ordering is naturally realized. A block diagram of the maximum magnitude
`phase is provided in Figure 27 .17. Since two pixels are read together and the
`image is scanned only once, the runtime of this phase is half a clock cycle per
`pixel. Because the maximum magnitude phase computes in less time than the
`wavelet phase, the throughput of the overall system is not affected.
`
`27 .4.5 The SPIHT Coding Phase
`The final SPIHT coding phase performs the Fixed Order SPIHT encoding in
`parallel, based on the data from the maximum magnitude phase. Coefficient
`blocks are read from the highest wavelet level to the lowest. As information is
`loaded from memory it is shifted from the variable fixed-point representation to
`a common fixed-point representation for every wavelet level. Once each block
`has been adjusted to an identical numerical representation, the parallel version
`of SPIHT is used to calculate what information each block will contribute to
`each bit plane.
`The information is grouped and counted before being added to three separate
`variable FIFOs for each bit plane. The data that the variable FIFO components
`receive range in size from Oto 37 bits, and the variable FIFOs arrange the block
`data into regular sized 32-bit words for memory access. Care is also taken to
`stall the algorithm if any of the variable FIFOs becomes too full.
`Data from each buffer is output to a fixed location in memory and the number
`of bits in each bitstream is output as well. Given that data is added dynamically
`to each bitstream, there needs to be a dynamic scheduler to select which buffer
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 602
`
`

`

`586
`
`Chapter 27 ■ SPIHT Image Compression
`
`should be written to memory. Since there are a large number of FIFOs that all
`require a BlockRAM, the FIFOs are spread across the FPGA, and some type
`of staging is required to prevent a signal from traveling too far. The scheduler
`selects which FIFO to read based on both how full a FIFO is and when it was
`last accessed.
`Our studies showed that the LSP bitstream is roughly the same size of the
`LIP and LIS streams combined. Because of this the LSP bitstreams transfer
`more data to memory than the other two lists. In our design the LIP and LIS
`bitstreams share a memory port while the LSP stream writes to a separate mem­
`ory port. Since a 2 x 2 block of coefficients is processed every clock cycle, the
`design takes one-quarter of a clock cycle per pixel, which is far less than the
`three-quarters of a clock cycle per pixel for the DWT. The block diagram for
`the SPIHT coding phase is given in Figure 27.18.
`With 22 total bit planes to calculate, the design involves 66 individual
`data grouping and variable FIFO blocks. Although none consume a significant
`amount of FPGA resources individually, 66 blocks do. The entire design required
`160 percent of the resources in a Virtex 2000E, and would not fit in the target
`system. However, by removing the lower bit planes, less FPGA resources are
`needed, and the architecture can easily be adjusted to fit the FPGA being used.
`Depending on the size of the final bitstream required, the FPGA size used in the
`SPIHT phase can be varied to handle the number of intermediate bitstreams
`generated.
`Removing lower bit planes is possible since the final bitstream transmits data
`from the highest bit plane to the lowest. In our design the lower 9-bit planes
`
`LIP and LIS
`address
`generator
`
`Write
`memory
`port 1
`
`Write
`memory
`port 2
`
`LSP
`address
`generator
`
`Select and Read FIFOs
`
`Shift data
`
`LIP
`data
`
`LIS
`data
`
`LSP
`data
`
`Calculate
`bit plane 21
`
`LIP
`data
`
`LIS
`data
`
`LSP
`data
`
`Calculate
`bit plane 0
`
`Address
`generator and
`control logic
`
`Dynamic
`FIFO
`scheduler
`
`Read
`memory
`port 1
`
`Read
`memory
`port 2
`
`FIGURE 27.18 ■ A block diagram of the SPIHT coding phase.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 603
`
`

`

`27.5 Design Results
`
`587
`
`were eliminated. Yet, without these lower planes, bitrates of up to 6 bpp can
`still be achieved. We found the constraint to be acceptable because we are inter­
`ested in high compression ratios using low bitrates, and 6 bpp is practically a
`lossless signal. Since SPIHT is optimized for lower bitrates, the ability to cal­
`culate higher bitrates was not considered necessary. Alternatively, the use of a
`larger FPGA would alleviate the size constraint.
`
`27 .5 DESIGN RESULTS
`The system was designed using VHDL with mqdels provided by Annapolis Micro
`Systems to access the PCI bus and memory ports. Simulations for debugging
`purposes were carried out with ModelSim EE 5.4e from Mentor Graphics. Syn­
`plify 6.2 from Synplicity was used to compile the VHDL code and generate a
`netlist. The Xilinx Foundation Series 3.li tool set was used to place and route
`the design. Lastly, the peutil. exe utility from Annapolis Micro Systems gen­
`erated the FPGA configuration streams.
`Table 27.3 shows the speed and runtime specifications of the final architec­
`ture. All performance numbers are measured results from the actual hardware
`implementation. Each phase computes on separate memory blocks, which can
`operate at different clock rates. The design can process any square image where
`the dimensions are a power of 2: 16 x 16, 32 x 32, up to 1024 x 1024.
`Since the WildStar board is connected to the host computer by a relatively
`slow PCI bus, the throughput of the entire system we built is constrained by
`the throughput of the PCI bus. However, since the study is on how image com­
`pression routines could be implemented on a satellite, such a system would be
`designed differently, and would not contain a reconfigurable board connected to
`some host platform though a PCI bus. Instead, the image compression routines
`would be inserted directly into the data path and the data transfer times would
`not be the bottleneck of tlie system. 1;1or this reason we analyzed the throughput
`of just the SPIHT compression engine and analyzed how quickly the FPGAs can
`process the images.
`The throug}iput of the system was constrained by the discrete wavelet trans­
`form at 100 MPixels/sec. One method to increase this rate is to compute
`more rows in parallel. If the available memory ports accessed 128 bits of
`data instead of the 64 bits with our WildStar board, the number of clock
`cycles per pixel could be reduced by half and the throughput could double.
`
`TABLE 27.3 ■ Performance numbers
`Clock cycles
`per pixel
`
`Clock cycles per
`512 x 512 image
`
`Phase
`
`Clock rate
`
`Throughput
`
`FPGA area
`(%)
`
`Wavelet
`Magnitude
`SPIHT
`
`182465
`131132
`65793
`
`3/4
`1/2
`1/4
`
`75 MHz
`73 MHz
`56 MHz
`
`100 MPixels/sec
`146 MPixels/sec
`224 MPixels/sec
`
`62
`34
`98
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 604
`
`

`

`588
`
`Chapter 27 ■ SPIHT Image Compression
`
`Assuming the original image consists of 8 bpp, images are processed at a rate of
`800 Mbits/sec.
`The entire throughput of the architecture is less than one clock cycle for
`every pixel, which is lower than parallel versions of the DWT. Parallel ver­
`sions of the DWT used complex scheduling to compute multiple wavelet levels
`simultaneously, which left limited resources to process multiple rows at a time.
`Given more resources though, they would obtain higher data rates than our
`architecture by processing multiple rows simultaneously. In the future, a DWT
`architecture other than the one we implemented could be selected for additional
`speed improvements.
`We compared our results to the original software version of SPIHT provided
`on the SPIHT web site [15]. The comparison was made without arithmetic cod­
`ing since our hardware implementation does not perform any arithmetic coding
`on the final bitstream. Additionally, in our testing on sample NASA images, arith­
`metic coding added little to overall compression rates and thus was dropped
`[11]. An IBM RS/6000 Model 270 workstation was used for the comparison,
`and we used a combination of standard image compression benchmark images
`and satellite images from NASA's web site. The software version of SPIHT com­
`pressed a 512 x 512 image in 1.101 seconds on average without including disk
`access. The wavelet phase, which constrains the hardware implementation, com­
`putes in 2.48 milliseconds, yielding a speedup of 443 times for the SPIHT engine.
`In addition, by creating a parallel implementation of the wavelet phase, further
`improvements to the runtimes of the SPIHT engine are possible.
`While this is the speedup we will obtain if the data transfer times are not a
`factor, the design may be used to speed up SPIHT on a general-purpose pro­
`cessor. On such a system the time to read and write data must be included as
`well. Our WildStar board is connected to the host processor over a PCI bus,
`which writes images in 13 milliseconds and reads the final datastream in 20.75
`milliseconds. Even with the data transfer delay, the total speedup still yields an
`improvement of 31.4 times.
`Both the magnitude and SPIHT phases yield higher throughputs than the
`wavelet phase, even though they operate at lower clock rates. The reason for
`the higher throughputs is that both of these phases need fewer clock cycles
`per pixel to compute an image. The magnitude phase takes half a clock cycle
`per pixel and the SPIHT phase requires just a quarter. The fact that the SPIHT
`phase computes in less than one clock cycle per pixel, let alone a quarter, is
`a striking result considering that the original SPIHT algorithm is very sequen­
`tial in nature and had to consider each pixel in an image multiple times per
`bit plane.
`
`27 .6 SUMMARY AND FUTURE WORK
`
`In this chapter we demonstrated a viable image compression routine on a recon­
`figurable platform. We showed how by analyzing the range of data processed by
`each section of the algorithm, it is advantageous to create optimized memory
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 605
`
`

`

`27.6 Summary and Future Work
`
`589
`
`structures as with our variable fixed-point work. Doing so minimizes memory
`usages and yields efficient data transfers. Here each bit transferred between
`memory and the processor board directly impacted the final results. In addi­
`tion, our Fixed Order SPIHT modifications illustrate how by making slight
`adjustments to an existing algorithm, it is possible to dramatically increase the
`performance in a custom hardware implementation and simultaneously yield
`essentially identical results. With Fixed Order SPIHT the throughput of the
`system increased by over an order of magnitude while still matching the original
`algorithm's PSNR curve.
`This SPIHT work was part of a development effort funded by NASA.
`
`References
`
`[1] V. R. Algazi, R. R. Estes. Analysis-based coding of image transform and subband
`coefficients. Applications of Digital Image Processing XVIII, SPIE Proceedings 2564,
`1995.
`[2] Annapolis Microsystems. WildStar Reference Manual, Annapolis Microsystems, 2000.
`[3] A. Benkrid, D. Crookes, K. Benkrid. Design and implementation of generic 2D
`biorthogonal discrete wavelet transform on an FPGA. IEEE Symposium on Field­
`Programmable Custom Computing Machines, April 2001.
`[4] M. Carraeu. Hubble Servicing Mission: Hubble is fitted with a new "eye."
`http://www.chron.com/contentlinteractive/space!missions/sts-103/hubblelarchive/
`931207.html, December 7, 1993.
`[5] C. M. Chakrabarti, M. Vishwanath. Efficient realization of the discrete and contin­
`uous wavelet transforms: From single chip implementations to mappings in SIMD
`array computers. IEEE Transactions on Signal Processing 43, March 1995.
`[6] C. M. Chakrabarti, M. Vishwanath, R. M. Owens. Architectures for wavelet trans­
`forms: A survey. Journal of VLSI Signal Processing 14, 1996.
`[7] T. Cormen, C. Leiserson, R. Rivest. Introduction to Algorithms, MIT Press, 1997.
`[8] T. W. Fry. Hyper Spectral Image Compression on Reconfigurable Platforms, Master's
`thesis, University of Washington, Seattle, 2001.
`[9] R. C. Gonzalez, R. E. Woods. Digital Image Processing, Addison-Wesley, 1993.
`[10] A. Graps. An introduction to wavelets. IEEE Computational Science and Engineering
`2(2), 1995.
`[11] T. Owen, S. Hauck. Arithmetic Compression on SP/TH Encoded Images, Technical
`report UWEETR-2002-2007, Department of Electrical Engineering, University of
`Washington, Seattle, 2002.
`[12] K. K. Parhi, T. Nishitani. VLSI architectures for discrete wavelet transforms. IEEE
`Transactions on VLSI Systems 1(2), 1993.
`[13] J. Ritter, P. Molitor. A pipelined architecture for partitioned DWT based lossy image
`compression using FPGAs. ACMISIGDA Ninth International Symposium on Field­
`Programmable Gate Arrays, February 2001.
`[14] A. Said, W. A. Pearlman. A new fast and efficient image codec based on set parti­
`tioning in hierarchical trees. IEEE Transactions on Circuits and Systems for Vuleo
`Technology 6, June 1996.
`[15] A. Said, W. A. Pearlman. SPIHT image compression: Properties of the method.
`http://www.cipr.rpi.edu/research!SPIHT/spihtl.html.
`[16] H. Sava, M. Fleury, A. C. Downton, A. Clark. Parallel pipeline implementations of
`wavelet transforms. IEEE Proceedings Part 1 (Vision, Image and Signal Processing)
`144(6), 1997.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 606
`
`

`

`590
`
`Chapter 27 ■ SPIHT Image Compression
`
`[17] J. M. Shapiro. Embedded image coding using zero trees of wavelet coefficients.
`IEEE Transactions on Signal Processing 41(12), 1993.
`[18] W. Sweldens. The Lifting Scheme: A new philosophy in biorthogonal wavelet con­
`structions. Wavelet Applications in Signal and Image Processing 3, 1995.
`[19] NASA. TERRA: The EOS flagship. The EOS Data and Information System (EOS­
`DIS). http://terra.nasa.gov/Brochure/Sect ..5-1.html.
`[20] C. Valens. A really friendly guide to wavelets. http://perso.wanadoo.fr/polyvalens/
`clemenslwaveletslwavelets.html.
`[21] M. Vishwanath, R. M. Owens, M. J. Irwin. VLSI architectures for the discrete
`wavelet transform. IEEE Transactions on Circuits and Systems, Part II, May 1995.
`[22] Xilinx, Inc. The Programmable Logic Data Book, Xilinx, Inc., 2000.
`[23] Xilinx, Inc. Serial Distribli,ted Arithmetic FIR Filter, Xilinx, Inc., 1998.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 607
`
`

`

`CHAPTER 28
`
`AUTOMATIC TARGET RECOGNITION SYSTEMS
`ON RECONFIGURABLE DEVICES
`
`Young H. Cho
`Open Acceleration Systems Research
`
`An Automatic Target Recognition (ATR) system analyzes a digital image or video
`sequence to locate and identify all objects of a certain class. There are several
`ways to implement ATR systems, and the right one is dependent, in large part,
`on the operating environment and the signal source. In this chapter we focus
`on the implementations of reconfigurable ATR designs based on the algorithms
`from Sandia National Laboratories (SNL) for the U.S. Department of Defense
`Joint STARS airborne radar imaging platform. STARS is similar to an aircraft
`AWACS system, but detects ground targets.
`ATR in Synthetic Aperture Radar (SAR) imagery requires tremendous process­
`ing throughput. In this application, data come from high-bandwidth sensors, and
`the processing is time critical. On the other hand, there is limited space and power
`for processing the data in the sensor platforms. One way to meet the high compu­
`tational requirement is to build custom circuits as an ASIC. However, very high
`nonrecurring engineering (NRE) costs for low-volume ASICs, and often evolving
`algorithms, limit the feasibility of using custom hardware. Therefore, reconfig­
`urable devices can play a prominent role in meeting the challenges with greater
`flexibility and lower costs.
`This chapter is organized as follows: Section 28.1 describes a highly paralleliz­
`able Automatic Target Recognition (ATR) algorithm. The system based on it is
`implemented using a mix of software and hardware processing, where the most
`computationally demanding tasks are accelerated using field-programmable gate
`arrays (FPGAs). We present two high-performance implementations that exercise
`the FPGA's benefits. Section 28.2 describes the system that automatically builds
`algorithm-specific and resource-efficient "hardwired" accelerators. It relies on the
`dynamic reconfiguration feature of FPGAs to obtain high performance using lim­
`ited logic resources.
`The system in Section 28.3 is based on an architecture that does not
`require frequent reconfiguration. The architecture is modular, easily scalable,
`and highly tuned for the ATR application. These application-specific processors
`are automatically generated based on application and environment parameters.
`In Section 28.4 we compare the implementations to discuss the benefits and the
`trade-offs of designing ATR systems using FPGAs. In Section 28.5, we draw our
`conclusions on FPGA-based ATR system design.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 608
`
`

`

`592
`
`Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices
`
`28.1
`
`AUTOMATIC TARGET RECOGNITION ALGORITHMS
`
`Sandia real-time SAR ATR systems use a hierarchy of algorithms to reduce the
`processing demands for SAR images in order to yield a high probability of detec­
`tion (PD) and a low false alarm rate (FAR).
`
`28.1.1 Focus of Attention
`As shown in Figure 28.1, the first step in the SNL algorithm is a Focus of
`Attention (FOA) algorithm that runs over a downsampled version of the entire
`image to find regions of interest that are of approximately the right size and
`brightness. These regions are then extracted and processed by an indexing stage
`to further reduce the datastream, which includes target hypotheses, orientation
`estimations, and target center locations. The surviving hypotheses have the full
`resolution data sent to an identification executive that schedules multiple iden­
`tification algorithms and then fuses their results.
`The FOA stage identifies interesting image areas called "chips." Then it com­
`poses a list of targets suspected to be in a chip. Having access to range and
`altitude information, the FOA algorithm also determines the elevation for the
`chip, without having to identify the target first. It then tasks the next stage with
`evaluating the likelihood that the suspected targets are actually in the given
`image chip and exactly where.
`
`28.1.2 Second-level Detection
`The next stage of the algorithm, called Second Level Detection (SLD), takes the
`extracted imagery (an image chip), matches it against a list of provided target
`
`I Synthetic aperture radar sensors I
`
`,
`
`,
`
`,
`,
`
`--------c==:::
`
`_,
`
`Focus of attention
`
`'
`'
`'
`
`--------·----------
`
`Second-level detection driver
`
`I
`
`c=_~===-···· ·--.::·:iJ]
`~ -~ ___J_--_ -_ - -- ·--<:··□
`
`-------------=----
`
`Reporting module
`
`M-47 Tank
`Angle: 355°
`\ Elevation: 10 ft
`
`FIGURE 28.1 ■ The Sandia Automatic Target Recognition algorithm.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 609
`
`

`

`28.1 Automatic Target Recognition Algorithms
`
`593
`
`hypotheses, and returns the hit information for each image chip consisting of
`the best two orientation matches and other relevant information.
`The system has a database of target models. For each target, and for each
`of its three different elevations, 72 templates are defined corresponding to its
`all-around views. The orientations of adjacent views are separated by 5 degrees.
`SLD is a binary silhouette matcher that has a bright mask and a surround
`mask that are mutually exclusive. Each template is composed of several param­
`eters along with a "bright mask" and a "surround mask," where the former
`defines the image pixels that should be bright for a match, and the latter defines
`the ones that should not. The bright and surround masks are 32x32 bitmaps,
`each with about 100 asserted bits. "Bright" is defined relative to a dynamic
`threshold.
`On receiving tasks from the FOA, the SLD unit compares all of the stored
`templates for this target and elevation and the applicable orientations with
`the image chip, and computes the level of matching (the "hit quality"). The
`two hits with the highest quality are reported to the SLD driver as the most
`likely candidates to include targets. For each hit, the template index number,
`the exact position of the hit in the search area, and the hit quality are pro­
`vided. After receiving this information, the SLD driver reports it to the ATR
`system.
`The purpose of the first step in the SLD algorithm, called the shape sum, is to
`distinguish the target from its surrounding background. This consists of adap­
`tively estimating the illumination for each position in the search area, assuming
`that the target is at that orientation and location. If the energy is too little or
`too much, no further processing for that position for that template match is
`required. Hence, for each mask position in the search area, a specific threshold
`value is computed as in equation 28.1.
`
`31 31
`
`SMx,y
`
`= L [.Bu,vMx+u,y+v
`u=0v=0
`
`THx y
`
`'
`
`SMx,y
`BC
`
`= --
`
`-Bias
`
`(28.1)
`
`(28.2)
`
`The next step in the algorithm distinguishes the target from the background
`by thresholding each image pixel with respect to the threshold of the cur­
`rent mask position, as computed before. The same pixel may be above the
`threshold for some mask positions but below it for others. This thresholci.
`calculation determines the actual bright and surround pixel for each posi­
`tion. As shown in equation �-2, it consists of dividing the shape sum by the
`number of pixels in the bright mask and subtracting a template-specific Bias
`constant.
`As shown in equation 28.3, the pixel values under the bright mask that are
`greater than or equal to the threshold are counted; if this count exceeds the
`minimal bright sum, the processing continues. On the other hand, the pixel
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 610
`
`

`

`594
`
`Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices
`
`values under the surround mask that are less than the threshold are counted to
`calculate the surround sum as shown in equation 28.4. If this count exceeds the
`minimal surround sum, it is declared a hit.
`
`31 31
`
`BSx,y = L E,Bu,v [Mx+u,y+v 2'.'. THx,y]
`u=0v=0
`SSx,y = L E,Su,v [Mx+u,y+v < THx,y]
`u=0v=0
`
`31 31
`
`(28.3)
`
`(28.4)
`
`Once the position of the hit is determined, we can calculate its quality by
`taking the average of bright and surround pixels that were correct, as shown in
`equation 28.5. This quality value is sent back to the driver with the position to
`determine the two best targets.
`
`_ ! (BSx,y SSx,y)
`Q
`x,y - 2
`BC + SC
`
`(28.5)
`
`28.2 DYNAMICALLY RECONFIGURABLE DESIGNS
`
`FPGAs can be reconfigured to perform multiple functions with the same logic
`resources by providing a number of corresponding configuration bit files. This
`ability allows us to develop dynamically reconfigurable designs. In this section,
`we present an ATR system implementation of UCLA's Mojave project that uses
`an FPGA's dynamic reconfigurability.
`
`28.2.1 Algorithm Modifications
`As described previously, the current Sandia system uses 64 x 64 pixel chips
`and 32 x 32 pixel templates. However, the Mojave system uses chip sizes of
`128 x 128 pixels and template sizes of 8 x 8 pixels. It uses different chip and tem­
`plate sizes in order to map into existing FPGA devices that are relatively small.
`A single template moves through a single chip to yield 14,641 (121 x 121) image
`correlation results. Assuming that each output can be represented with 6 bits,
`the 87,846 bits are produced by the system.
`There is also a divide step in the Sandia algorithm that follows the shape
`sum operation and guides the selection of threshold bin for the chip. This sys­
`tem does not implement the divide, mainly because it is expensive relative to
`available FPGA resources for the design platform.
`
`Image Correlation Circuit
`28.2.2
`FPGAs offer an extremely attractive solution to the correlation problem. First of
`all, the operations being performed occur directly at the bit level and are domi­
`nated by shifts and adds, making them easy to map into the hardware provided
`by the FPGA. This contrasts, for example, with multiply-intensive algorithms
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 611
`
`

`

`28.2 Dynamically Reconfigurable Designs
`
`595
`
`that would make relatively poor utilization of FPGA resources. More important,
`the sparse nature of the templates can be utilized to achieve a far more efficient
`implementation in the FPGA than could be realized in a general-purpose corre­
`lation device. This can be illustrated using the example of the simple template
`shown in Figure 28.2.
`In the example template shown in the figure, only 5 of the 20 pixels are
`asserted. At any given relative offset between the template and the chip, the
`correlation output is the sum of the 5 binary pixels in the chip that match
`the asserted bits in the template. The template can therefore be implemented in
`the FPGA as a simple multiple-port adder. The chip pixel values can be stored
`in flip-flops and are shifted to the right by one flip-flop with each clock cycle.
`Though correlation of a large image with a small mask is often understood con­
`ceptually in terms of the mask being scanned across the image, in this case the
`opposite is occurring-the template is hardwired into the FPGA while the image
`pixels are clocked past it.
`Another important opportunity for increased efficiency lies in the potential to
`combine multiple templates on a single FPGA. The simplest way to do this is to
`spatially partition the FPGA into several smaller blocks, each of which handles
`the logic for a single template. Alternatively, we can try to identify templates
`that have some topological commonality and can therefore share parts of their
`adder trees. This is illustrated in Figure 28.3, which shows two templates sharing
`several pixels that can be mapped using a set of adder trees to leverage this
`overlap.
`A potential advantage FPGAs have over ASICs is that they can be dynam­
`ically optimized at the gate level to exploit template characteristics. For our
`application, a programmable ASIC design would need to provide large general­
`purpose adder trees to handle the worst-case condition of summing all possi­
`ble template bits, as shown in Figure 28.4. In constrast, an FPGA exploits the
`sparse nature of the templates and constructs only the small adder trees required.
`Additionally, FPGAs can optimize the design based on other application-specific
`characteristics.
`
`Template A
`
`ii
`
`Image
`
`Doo-....._
`20
`+
`D
`
`D10::0-
`01-
`
`0
`021 ---
`
`Result A
`
`030 ... 7
`
`FIGURE 28.2 ■ An example template and a corresponding register chain with an adder tree.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 612
`
`

`

`596
`
`Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices
`
`Template A
`
`Template8
`
`ResultA
`
`Result8
`
`FIGURE 28.3 ■ Common hardware shared between two templates.
`
`Template A
`
`Image
`
`,
`,
`,
`
`,'
`
`'
`
`'
`
`'
`,
`'( , '
`
`\
`
`' '
`'-+-'-�--+-, �,..,,•, /,
`,\
`' '
`\ II
`'
`1
`'\,'
`', I
`' '
`,.
`I\
`"
`
`• ,
`,
`..
`.
`
`,
`,
`
`•
`,
`.
`•
`
`Registers
`for image chip
`and templates
`
`'I
`A
`I
`'
`
`,,1
`I\
`,I
`
`EffiE
`~.
`~ a±lli
`, ,
`.
`.
`f 'tj ! AND
`\, 6,- o' 'o' 'c
`\0 O O (J (J: used to
`• perform
`'
`0 0 0 0 0 dot product
`QQ099
`·--��/ Large adder
`
`1
`
`I
`
`�
`
`r
`
`: :
`I I
`I
`I
`
`gates
`
`FIGURE 28.4 ■ The ASIC version of the equivalent function.
`
`28.2.3 Performance Analysis
`Using a template-specific adder tree achieves significant reduction in routing
`complexity over a general correlation device, which must include logic to sup­
`port arbitrary templates. The extent of this reduction is inversely proportional
`to the fraction of asserted pixels in the template. While this complexity reduc­
`tion is important, alone it is not sufficient to lead to efficient implementations
`on FPGAs. The number of D-flip-flops required for storing the data points can
`cause inefficiencies in the design. Implementing these on the FPGA using the
`usual flip-flop-based shift registers is inefficient.
`This problem can be resolved by collapsing the long strings of image pixels­
`those not being actively correlated against a template-into shift registers, which
`can be implemented very efficiently on some lookup table (LUT)-based FPGAs.
`For example, LUTs in the Xilinx XC4000 library can be used as shift registers
`that delay data by some predetermined number of clock cycles. Each 16 x 1-bit
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 613
`
`

`

`28.2 Dynamically Reconfigurable Designs
`
`597
`
`LUT can implement an element that is effectively a 16-bit shift register in which
`the internal bits cannot be accessed. A flip-flop is also needed at the output of
`each RAM to act as a buffer and synchronizer. A single control circuit is used to
`control the stepping of the address lines and the timely assertion of the write­
`enable and output-enable signals for all RAM-based shift register elements. This
`is a small price to pay for the savings in configurable logic block (CLB) usage
`relative to a brute-force implementation using flip-flops.
`In contrast, the 256-pixel template images, like those shown in Figure 28.5,
`can be stored easily using flip-flop-based registers. This is because sufficient
`flip-flops are available to do this, l;lnd the adder tree structures do not consume
`them. Also, using standard flip-flop-based shift registers for image pixels in the
`template simplifies the mapping process by allowing every pixel to be accessed.
`New templates can be implemented by simply connecting the template pixels
`of concern to the inputs of the adder tree structures. This leads to significant
`simplification of automated template-mapping tools.
`The resources used by the two components of target correlation-namely,
`storage of active pixels on the FPGA and implementation of the adder tree cor­
`responding to the templates-are independent of each other. The resources used
`by the pixel storage are determined by the template size and are independent of
`the number of templates being implemented. Adding templates involves adding
`new adder tree structures and hence increases the number of function genera­
`tors being used. The total number of templates on an FPGA is bounded by the
`number of usable function generators.
`The experimental results suggest that in practice we can expect to fit 6 to 10
`surround templates having a higher number of overlapping pixels onto a 13,000-
`gate FPGA. However, intelligent grouping of compatible templates is important.
`Because the bright templates are less populated than the surround templates, we
`estimate that 15 to 20 of them can be mapped onto the same FPGA.
`
`■
`•• ••
`•••
`
`FIGURE 28.5 ■ Example of eight rotation templates of a SAR 16 x 16 bitmap image.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 614
`
`

`

`598
`
`Chapter 28 ■ Automatic Target Recognition Systems on Reconfigurable Devices
`
`28.2.4 Template Partitioning
`To min

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket