`
`iEEE Symposiumon——_
`FPGAs on
`CUSTOM
`
`OETTT
`
`ha
`”
`
`April 16-18, 1997
`Napa Valley, California
`
`Edited by Kenneth L. Pocek and Jeffrey Amold
`
`Sponsored by the IEEE Computer Society Technical Committee on Computer Architecture
`
`_S
`
`2P
`
`
`
`: o
`_&
`i
`COMPUTER o
`
`
`SOCIETY©ce /
`
`Petitioner Microsoft Corporation - Ex. 1011, Cover 1
`** “Petifidfiét Microsoft Corporation - Ex. 1011, Cover 1
`
`
`
`PROCEEDINGS
`
`The 5th Annual IEEE Symposium on
`Field-Programmable
`Custom Computing Machines
`
`April 16— 18, 1997
`
`Napa Valley, California
`
`Sponsored by
`IEEE Computer Society
`IEEE Computer Society Technical Committee on Computer Architecture
`
`~D
`
`Vi
`
`f Oy
`
`COMPUTER a
`SOCIETY
`
`“r
`
`|
`
`Los Alamitos, California
`Washington
`e Brussels
`
`Tokyo
`
`Petitioner Microsoft Corporation - Ex. 1011, Cover 2
`Petitioner Microsoft Corporation - Ex. 1011, Cover 2
`
`
`
`Copyright © 1997 by TheInstitute ofElectrical and Electronics Engineers, Inc.
`All rights reserved
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may
`photocopy beyondthe limits ofUS copyrightlaw,forprivate use ofpatrons, those articles in this volumethat
`carry a code at the bottom ofthefirst page, provided that the per-copy fee indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE
`Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331.
`The papers in this book comprise the proceedings ofthe meeting mentionedonthe coverand titlepage. They
`reflect the authors’ opinions and, in the interests oftimely dissemination, are published as presented and
`without change, Their inclusion in this publication does not necessarily constitute endorsement by the
`editors, the IEEE ComputerSociety, or the Institute ofElectricalandElectonics Engineers, Inc.
`
`IEEE Computer Society Order Number PRO8159
`ISBN 0-8 186-8 159-4
`ISBN 0-8186-8160-8 (case)
`ISBN 0-8 186-8161-6 (microfiche)
`IEEE Order Plan Catalog Number 97TB100186
`ISSN 1082-3409
`
`Additional copies may be orderedfrom:
`
`IEEE Computer Society IEEE Computer Society|[EEE Computer SocietyIEEE Service Center
`
`445 Hoes Lane
`Customer Service Center
`13, Avenue de l’Aquilon
` Ooshima Building
`P.O. Box 1331
`10662 Los Vaqueros Circle
`B-1200 Brussels
`2-19-] Minami-Aoyama
`P.O. Box 3014
`Piscataway, NJ 08855-1331
`BELGIUM
`Minato-ku, Tokyo 107
`
`Los Alamitos, CA 90720-1314—Tel: + 1-908-981-1393 Tel: + 32-2-770-2198 JAPAN
`
`Tel: + 1-714-821-8380
`Fax: + 1-908-981-9667
`Fax: + 32-2-770-8505
`Tel: + 81-3-3408-3118
`Fax: + 1-714-82]-4641
`mis.custserv@computer.org
` euro.ofc@computer.org
`Fax: + 81-3-3408-3553
`E-mail: cs.books@computer.org
`tokyo.ofe@computer.org
`
`Editorial production by Bob Wemer
`Coverart production Joe Daigle/Studio Productions
`Printed in the United States of America by Technical Communication Services
`
`-@D
`COMPUTER @
`SOCIETY
`cx
`
`Petitioner Microsoft Corporation - Ex. 1011, Cover 3
`Petitioner Microsoft Corporation - Ex. 1011, Cover 3
`
`
`
`
`Table of Contents
`
`Symposium on Field-Programmable Custom Computing Machines — FCCM’97
`
`WpGPBRETON.csi955: 5na ermevansnesAESTenna meer erasescca penereeannnsns
`Program Committee oc cccccccsssssssssssssecesssessssssevsnseesesscesssnsnssnessesssssvestssenssteesesesusersessssustesesensasuusessessseen®
`
`Session 1: Device Architecture
`
`An FPGAArchitecture for DRAM-based Systolic Computations........c..cccccccscsssecsssesssesesnessnsesssnecssseensee D
`N. Margolus
`
`Garp: A MIPS Processor with a Reconfigurable Coprocessor.......s.seccsssccsecsresessueessssssnssssesseneessee LD
`J. Hauser, J. Wawrzynek
`
`Al Timie-Maitiploed BPCw.sssscvinissenreoserennsnnsnevnnsasisstsiascicsigissibastenennnnvanmcraataenasissisaswsbcisiintlssisanfeeemenneneecsee BO
`S. Trimberger, D. Carberry, A. Johnson, J. Wong
`
`Session 2: Communication Applications
`
`An FPGA-Based Coprocessor for ATM Firewalls .........c.ccscsscsssesscsscssecssneesseersnessneenneqenssineenseaneenneesseeaeese OO
`J. McHenry, P. Dowd, T. Carrozzi,
`F. Pellegrino, W. Cocks
`
`A Wireless LAN Demodulator in a Pamette: Design and Experience.........ccceessesseeenesssneseenee 40
`T. McDermott, P. Ryan, M. Shand,
`D. Skeilern, T, Percival, N. Weste
`
`Session 3: Run Time Reconfiguration
`
`Incremental Reconfiguration for Pipelined Applications................-cc1ssccscssssssessseeeesseesnesnversusecunessnrsnnens HE
`H. Schmit
`
`Compilation Tools for Run-Time Reconfigurable Designs ........0....c.cccsesseesecseeeesserenteeneeesneessurereeesneees OO
`W. Lub, N. Shirazi, P. Cheung
`
`A Dynamic Reconfiguration Run-Time System .......c.cccccscscssseccceessnneesnsnteenaneeesneerstesseeessiecsuiesananssneessee OO
`J. Burns, A. Donlin, J. Hogg, S. Singh, M. de Wit
`
`Session 4: Architectures for Run Time Reconfiguration
`
`The Swappable Logic Unit: A Paradigm for Virtual Hardware.........cccssecssssssssessssneessessnenee DE
`G. Brebner
`
`Petitioner Microsoft Corporation - Ex. 1011, ToC i
`Petitioner Microsoft Corporation - Ex. 1011, ToC i
`es_—
`
`
`
`The Chimaera Reconfigurable Functional Uniit..ies:ceoccosemeceoeeeeoeocc
`S. Hauck, T. Fry, M. Hosier, J. Kao
`
`ee 87
`
`Session 5: Architecture
`
`Computing Kernels Implemented with a Wormhole RTR CONasessresnsnccesesiaciiscsstparccicase
`sneee 98
`R. Bittner Jr., P. Athanas
`
`Mapping Applications to the RaPiD Configurable Architecture ........ccccccceccccsccossss
`C. Ebeling, D. Cronquist, P. Franklin,
`J. Secosky, S. Berg
`
`- 106
`
`Defect Tolerance on the Teramac Custom RFnccecacssssnreccerconccincraarerasiens
`B. Culbertson, R. Amerson, R. Carter,
`P. Kuekes, G. Snider
`
`iieieeneeesensnsemn 116
`
`Session 6: Performance
`
`Systems Performance Measurement on PCI Pametteececcccscscocscoeseeeeeosecccece,
`L. Moll, M. Shand
`
`anistasiapsteasaecias 125
`
`The RAW Benchmark Suite: Computation Structuresfor
`General Purpose Computing...............
`J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua,
`M. Taylor, J. Kim, S. Devabhaktuni, A. Agarwal
`
`Session 7: Software Tools
`
`ww. 184
`
`Automated Field-Programmable Compute Accelerator Design using
`Partial Evaluation........ssnsesmssemsnntiniseiatusimutieeienenstisiiiutusuassneseees
`Q. Wang, D. Lewis
`
`gbasnccnscadtesitasit 145
`
`FPGASynthesis on the XC6200 using IRIS and Trianus/Hades
`(Or, from Heavento Hell and back IYisiscciiercsccccsscasscececiccicccempeesennmnercemrerreerese
`R. Woods, S. Ludwig, J. Heron, D. Trainor, S. Gehring
`
`cspcnenas enemas 155
`
`High Level Compilation for Fine Grained FPGAS ..c.cc-cccccssscssssessccssseeesoseeoececcce
`M. Gokhale, E. Gomersall
`
`siseisariiinsguaeepet 165
`
`Session 8: CAD Applications
`
`Acceleration ofan FPGA Router .......csc:cssossnussnsnssiutuesunessiepussstsseeseeseesc.
`P. Chan, M. Schlag
`
`sot ocennemmcineansai 175
`
`Fault Simulation on Reconfigurable Hardware .......c:c:s:cssuscssssnsnseseeseeceeoeecccc
`M. Abramovici, P. Menon
`
`suiciecancespeicncaias 182
`
`vi
`
`Petitioner Microsoft Corporation - Ex. 1011, ToC ii
`Petitioner Microsoft Corporation - Ex.
`1011, ToC i
`
`
`
`Session 9: Image ProcessingApplications
`AutomatedTargetRecognition On SPLASH Boernnnuninuiintnvnmmisisiereeecc. 192
`M.Rencher, B. Hutchings
`Real-TimeStereo Vision on the PARTSReconfigurableOs)201
`J. Woodfill, B. Von Herzen
`
`Increased FPGA Capacity Enables Scalable, Flexible CCMs:
`An Examplefrom Image Processig.cmuinmnnnsrsnninsiwntsititaeeee.211
`d. Greenbaum, M. Baxter
`
`Session 10: Arithmetic Applications
`ComparisonofArithmetic Architectures for Reed-Solomon Decoders in
`Neconfigurable Harda2anon...219
`C. Paar, M. Rosner
`
`Implementation ofSingle Precision FloatingPoint Square Root on FPGAS...ceccscssscssssssssssce226
`
`Y. Li, W. Chu
`
`Poster Papers
`
`Datapath-Oriented FPGA Mapping and Placementfor
`Configurable POTPURLNGornrnmrnentntteeeesec.234
`T. Callahan, J. Wawrzynek
`Mappinga Real-Time Video Algorithm toa Context-Switched CAassestsessen:236
`
`S. Kelem
`
`A Parallel Hardware Evolvable Computer POLYPascneusnnnnnnnnnniniiiiinneeeeeccc.238
`U. Tangen, L. Schulte, J. McCaskill
`Laser Defect Correction Applications to FPGA Based Custom Computers...ccc.240
`G. Chapman, B. Dufort
`Speech Recognition HMM Trainingon Reconfigurable Parallel PLOCESSOR......sscsccsssccsnsesseoneeeee242
`H. Yun, A. Smith, H. Silverman
`Efficient Implementation ofthe DCT on Customec244
`N. Bergmann, Y. Chung, B. Gunther
`On Acceleration ofthe Check Tautology Logic Synthesis Algorithm using an
`FPGA-basedReconfigurable CoprocesB0T.....nnnnmnnnnnntintuininininiiitteteeercecc..246
`J. Cong, J. Peck
`Index OfAMEROT8neneinen249
`
`Vii
`Petitioner Microsoft Corporation - Ex. 1011, ToC iii
`Petitioner Microsoft Corporation - Ex. 1011, ToC iti
`
`$AHHnterrMlicrosot
`
`
`
`|
`
`ins material may be protected by Copyright law (Title 17 U.S. Code)
`
`|
`
`Automated Target Recognition on SPLASH 2!
`
`Michael Rencher and Brad L. Hutchings
`Department of Electrical and Computer Engineering
`Brigham Young University ,
`Provo, UT 84602
`
`Abstract
`Automated target recognition is an application area
`that requires special-purpose hardware to achieve rea-
`sonable performance. FPGA-based platforms can pro-
`vide a high level of performance for ATR systems if
`the implementation can be adapted to the limited
`FPGAand routing resources of these architectures.
`This paper discusses a mapping experiment where a
`linear-systolic implementation of an ATR algorithm is
`mapped to the SPLASH 2 platform. Simple column-
`oriented processors were used throughout the design
`to achieve high performance with limited nearest-
`neighbor communication. The distributed SPLASH 2
`memories are also exploited to achieve a high degree
`of parallelism. The resulting design is scalable and
`can be spread across multiple SPLASH 2 boards with
`a linear increase in performance.
`
`1
`
`Introduction
`Automated target recognition (ATR) is a compu-
`tationally demanding application area that typically
`requires special-purpose hardware to achieve desirable
`performance. ASICs are not an option for these sys-
`tems due to high non-recurring engineering (NRE)
`costs and because the algorithms are constantly evolv-
`ing. Existing FPGA-based computing platforms can
`potentially provide the necessary performance and
`flexibility for evolving ATR systems; however, map-
`ping applications to these existing platforms can be
`very challenging because they lack abundant intercon-
`nect and FPGA resources. The key to achieving a
`high-performance implementation of ATR algorithms
`with existing platforms is to carefully organize the de-
`sign of the ATR implementation so that it can commu-
`nicate via the limited interconnect and can beeasily
`partitioned among the FPGA devices.
`This paper presents a linear systolic implementa-
`tion of an existing ATR algorithm on SPLASH 2 that
`
`tThis work wassupported by DARPA/CSTO under contract
`number DABT63-94-C-0085 under a subcontract to National
`Semiconductor
`
`Inter-
`is well-suited to the SPLASH 2 architecture.
`FPGA communicationis limited and easily accommo-
`dated by the SPLASH 2 interconnect. Moreover, the
`implementation can be scaled across any number of
`SPLASH 2 boards and achieves high performance with
`limited resources.
`This paper briefly discusses the entire ATR algo-
`rithm as developed by Sandia National Labs, and
`then overviews the design and implementation of the
`most computationally demanding part of the algo-
`rithm: Chunky SLD. The SpLasH 2 implementation
`is presented in some detail with future directions and
`possible improvements.
`
`2 Automatic Target Recognition
`_ The goal of a typical ATR system is to analyze a
`digital representation of a scene and locate/identify
`objects that are of interest. Although this goal is
`conceptually simple, ATR systems have extremely de-
`manding I/O and computational requirements: image
`data are large, can be generated in real-time, and must
`be processed quickly so that results remain relevant in
`a dynamic environment. The common use of special-
`purpose hardware in nearlyall high-performance ATR
`systemsis a clear indication ofthe computational com-
`plexity of these systems.
`This paper details the implementation of an exist-
`ing ATR algorithm on SPLASH 2. The algorithm in
`question was developed at Sandia National Labora-
`tories and was designed to detect partially obscured
`targets in Synthetic Aperture Radar (SAR) images.
`It is commonly referred to as Chunky SLD, so named
`for the second step of the algorithm that differenti-
`ates this.algorithm from others developed at Sandia.
`This algorithm consists of the following three steps:
`(1) Focus of Attention (FOA), (2) Second-Level De-
`tection (SLD), and (3) Final Identification (FI). Each
`of these steps will now be introduced so that the al-
`gorithm implementation can be understoodin its op-
`erating context.
`:
`
`0-8186-8159-4/97 $10.00 © 1997 IEEE
`
`192
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 192
`Petitioner Microsoft Corporation - Ex. 1011, p. 192
`See
`
`
`
`
`
`Section implemented
`on Splash-2
`
`Figure 1: ATR Block Diagram.
`
`2.1 Focus of Attention (FOA)
`Focus ofattentionis thefirst step of the ATR pro-
`cess and uses image morphology techniques to de-
`tect potential targets in SAR data. FOA operates on
`“down-sampled” SAR images that are approximately
`600-1000 pixels on a side. Once FOA detects a poten-
`tial target, it determines the approximate center of the
`potential target and creates 2x down-sampled sub-
`images of the original SAR data where each subimage
`contains a single target centered within the subim-
`age. These subimages are referred to as chips and are
`128 x 128 pixels.
`
`2.4 The Chunky SLD Algorithm
`The general goal of the Chunky-SLD algorithm is
`to recognize targets that are partially concealed or ob-
`scured in some way. To achievethis goal, the designers
`of this algorithm treat the target as a set of 40 tem-
`plate pairs where each pair of templates is a digital
`representation of somesalient feature of the specific
`target. If the majority of the template pairs strongly
`correlate with the image data, then a match of the
`overall target is assumed. Each pair of templates con-
`
`
`
`Figure 2: Chunky SLD
`
`sists of a Bright template and a Surround template.
`The Bright template is a representation of expected
`reflections directly from surfaces of a salient target fea-
`ture while the Surround template represents expected
`absorption in the immediate area surrounding the tar-
`get feature. Each pair of a Bright and Surround tem-
`plate is referred to as a chunk, so called because each
`pair of templates represents a “chunk” of the overall
`target. Each set of 40 chunksrepresents a single target
`at a specific rotation. There are 72 orientations, each
`representing a different target orientation and radar
`incidence angle. Each set of 72 orientations is referred
`to as a class and is the complete set of templates that
`must be correlated with a chip to detect the presence
`of a specific target.
`
`2.2 Second Level Detection (SLD)
`The SLD step processes the chips generated by the
`FOAstep. SLD further restricts the areas of interest
`by giving the potential targets coordinates and angu-
`lar orientation. SLD does this by correlating prede-
`fined binary templates to the areas of interest. The
`templates represent different object orientation an-
`gles. Templates are oriented between 5 and 10 de-
`grees apart. SLD also uses adaptive threshold levels
`determined by the overall image intensity.
`
`The algorithm studied in the paperis a variation
`of SLD called Chunky SLD. Chunky SLD addsa level
`
`of complexity to SLD by using more templates to rep-
`Surround
`resent objects that have been partially obscured (par-
` Class
`
`tially hidden by camouflageor objects overhead). This
`(72 Orientations-/Orientation
`
`All the templates
`allows better target recognition at a cost of higher
`(Template
`for one object) Eo Cease)
`pa
`computational requirements. Chunky SLD is dis-
`
`cussed in more detail later in this section.
`
`2.3 Final Identification (FI)
`The FI algorithm correlates full resolution image
`data and templates with finer angularresolution (3 to
`5 degrees). FI also uses adaptive threshold levels. The
`outputof FI is a location of the target, and confidence
`level corresponding to the level of correlation between
`the object and the FI templates.
`
`
`
`Bright
`
`Figure 3: Template Organization
`
`The first step of the Chunky SLD algorithm is to
`correlate the chip and the Bright template. This cor-
`relation value is used to compute a value that will be
`
`193
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 193
`Petitioner Microsoft Corporation - Ex. 1011, p. 193
`
`
`
`used to threshold the incoming chip data, converting
`the 8-bit chip imageinto a binary image. The equa-
`tions describing this process are shown below.
`15
`Shapesum(a, y) = >> Bremp(a, 6) *Chip(x +a, y+6)
`a,6=0
`
`Threshold(«, y) =
`
`Shapesum(cz, y)
`Bright_template_pizel_count
`The values obtained by correlating the Bright and
`Surround templates with the binarized chip (Bsum and
`Ssum) are checked against minimum values to generate
`a “hit value” for each offset in the chip. The threshold
`value is also checked tosee if it falls in an acceptable
`range when generatingthe hit values,
`
`if((Tmaz > T 2 Tmin] AND
`[Bsum > Bmin] AND
`[Ses > Smin])
`then
`Hit =1;
`else
`
`Ait=0;
`
`(1)
`
`The hit values are accumulated for each offset for a
`specific orientation (40 chunks). The highest values
`are used to identify the areas of interest for the final
`identification step.
`
`2.4.1 Template Characteristics
`Template pairs exhibit useful properties: sparseness
`and mutual exclusivity. The Bright template consists
`mostly of zeros; only 3 to 10 percent of the template
`values are ‘1’s and this limits the magnitude of the
`Shapesum and Byym values. The Bright and Surround
`templates are also mutually exclusive; that is, if the
`two templates are overlaid no “on” pixels will overlap.
`Whencarefully exploited, both of these properties lead
`to more compact and higher performance hardware.
`3 Other Implementations of Chunky-
`SLD
`As explained the ATR applicationis computation-
`ally demanding. There are (128-15) x (128-15) offsets
`per chunk x 40 chunks x 72 orientations X 36 x 106
`hit values to compute per targeted object (or per class,
`see Figure 3). The computational rate and I/O re-
`quirements of this algorithm makeit impossible to use
`current microprocessors. Thus any high-performance
`implementation ofthis algorithm will require special-
`purpose hardware to meet performance goals.
`
`Figure 4: Sandia’s Implementation.
`
`3.2 UCLA
`A group at UCLA (Headed by John Villasenor) [3]
`is working on a FPGA based SLD implementation. By
`doing bit level correlations they are able to do very
`compact adder trees that take advantage of template
`sparseness and FPGA on board lookup-table memory
`capability. Their approach compiles template infor-
`mation directly into the hardware and relies on fast
`reconfiguration to switch template information. They
`
`194
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 194
`Petitioner Microsoft Corporation - Ex. 1011, p. 194
`
`However, custom ASICsare also not an option be-
`cause the algorithm is constantly evolving and also
`because commercial-off-the-shelf components (COTS)
`are often dictated by the ultimate customers of ATR
`systems. The only remaining optionsare to construct
`the system with commercially availablefixed-function
`devices such as correlaters, multipliers, etc., or to
`use programmable logic, e.g., FPGAs [1, 2]. Thus
`all known implementations of Chunky-SLDuseeither
`fixed-function devices or programmable logic.
`3.1 Sandia
`Current Sandia implementations of ATR are based
`on commercially available one-bit correlater chips.
`Sandia designers adapted the basic Chunky-SLD al-
`gorithm so they could exploit the capabilities of these
`components to achieve high performance. Rather than
`process the Shapesum andthen process thefinal cor-
`relation, the two steps were done in parallel. The cor-
`relation was done at 8 discrete threshold levels and
`the Shapesum determined which threshold to use for
`each offset.
`
` Chip
`
`Zg =
`
`128x128x8bit
`
`
`
`Template
`
`Image
`
`Figure 5: Example Column Sum.
`
`¥
`
`a
`
`7Tee
`
`ae
`
`also try and take advantage of template overlap by
`computing the results of multiple correlations simul-
`taneously.
`4 Chunky-SLD on Splash 2
`On SPLASH 2astatic implementation was done
`to avoid the overhead of reconfiguring the hardware
`ixet
`114]7[10]13]16]19]
`toad order|2{5]8/14]14/17]20
`during execution.
`In order to reduce the hardware
`[3]6}9]12]15]18]21]
`requirements without reducing throughput a deeply
`pipelined design (~400 cycle latency) was imple-
`mented. The Shapesum unit generates the threshold
`
`value which is then used to generate the final corre-
`Ka
`lation values.
`(Note: There is a unique threshold for
`
`eachoffset). By doing this only two final correlations
`10]oofGah
`have to be computed per offset (one Bsum and one
`
`
`Galor||
`Ssum):
`14]voOnis)S|
`|ao/fall2
`Asaael
`The Sandia implementation computes the Shape-
`
`
`
`
`[16]over]v2]|
`2]aalleayoe|)|
`sum andfinal correlation in parallel which forces them
`17]woloolonl2|
`to compute multiple final correlations. While our im-
`Lisl@dkuy[or]|
`plementation does them serially. This allowsus to use
`final value
`[19]wolenlv2]
`an exact threshold value. Also only onefinal correla-
`shifted out
`shifts
`[20]onKio2|
`tion needs to be computed because the threshold value
`O Pixel is summed in at these points
`211Olofin|]
`
`
`is computed before thefinal correlation begins. The
`22] @nfont
` O- OutputT-clock cycle
`
`technique used was to look at the correlations by col-
`
`oy]|i]
`umn, compute the partial correlation for that column,
`and sum up the partial sumsfor all 16 columns.
`In
`this method 16 different column correlations are going
`on in parallel but only one columnof data needs to be
`available for processing.
`4.2 Platforms/Implementations
`4.1
`Implementing the Correlation as Col-
`Chunky SLD was implemented on the SPLASH 2
`umn Sums
`board. SPLASH 2 has shownitself to be a useful plat-
`Figure 5 depicts a simple example that demon-
`form and has had numerousapplications mapped toit
`strates a correlation of a 3x3 template with a binary
`[4, 5, 6, 7, 8, 9, 10, 11]. The implementation was done
`image. Each rowin the table represents one clock cy-
`in VHDL and simulated/synthesized using Synopsis.
`cle. The first column is the clock cycle number. Cor-
`All place and route was done automatically using Xil-
`responding numbers are found in the Pizel load order
`inx place and route tools.
`box at the right. A new pixelis brought in on each
`Oneof the goals of the implementation was to run
`clock cycle. The P1, P2, and P3 columns represent
`the system so that it consumedapixel per cycle. This
`the three column processing units needed for a three
`means that each cycle all processing elements (PE)
`column template. The last column represents the ac-
`need to be able to process a new pixel. This im-
`tual output. Clock cycles 9 through 12 have been ex-
`plementation follows Sandia National Labs algorithms
`panded to show how data (pixels and partial sums)
`(not implementation) as closely as possible (see Sec-
`are passed from column to columnandillustrate the
`tion 2.4),
`data format. Once thepipelineis full, a new correla-
`The SpLAsH 2 board was developed by SRC (Su-
`tion value is computed as each columnarrives (three
`percomputing Research Center Institute for Defense
`pixels/cycles).
`Analyses) [12]. The SPLASH 2 boardis a linear sys-
`Note that valid output comes every three cycles be-
`tolic array of processing elements (FPGAs), each with
`cause the template is three rowstall. All processing
`their own memory.
`elements are actively processing 3 pixel values at all
`times. The SPLASH 2 implementation works just like
`the example except for the size of the columns (16
`pixels instead of three) and the data format (eight-bit
`instead of one-bit).
`
`4.2.1 Splash 2 Hardware
`
`From a conceptualpointof view, the SPLASH 2 system
`consists of a linear array of processing elements. This
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 195
`Petitioner Microsoft Corporation - Ex. 1011, p. 195
`PSFeee
`
`195
`
`
`
`makes SPLASH 2 a good candidate for linear-systolic
`applications with limited neighbor-to-neighbor inter-
`connect. Because of limited routing resources SPLASH
`2 has difficulty implementing multi-chip systems that
`are not linear systolic, though they are possible (8).
`The actual SPLASH 2 platform consists of a board
`with 16 Xilinx 4010 chips (plus one for control) ar-
`ranged inalinear systolic array. Each chip has a lim-
`ited 36-bit connection to its two nearest neighbors.
`Each Xilinx 4010 is connected to a 512 kbyte memory
`
`
`
` Processing
`neighbor
`
`left
`neighbor
`
`
`Element
`
`(Xilinx 4010)
`
`right
`
`Figure 7: SPLASH 2 platform.
`
`crossbar
`
`data from the system when switching to a new tem-
`plate. The overhead from the flushing operation is
`ss
`318 flush cycles
`ee!
`minimal (ssrasa compute cycles =) 0.14%.
`During eachclock cycle, a new pixel arrives at the
`FPGA.If the templatebit correspondingto this pixel
`is on then the incomingpixel is added to the current
`partial sum. Each 16 clock cycles, this partial sum is
`then passed on to the next column and a new partial
`(16-bit word size). The memory can handle back-to-
`sum is received from the previous column. Thelast
`back reads, or back-to-back writes, but requires one
`column computes a complete Shapesum every 16 cy-
`‘dead’ (or turn around) cycle when changing from
`cles (one column). Thefinal correlation of the Bright
`write to read. There is also a crossbar connected to
`and Surround templates with the thresholded chip
`all of the chips that allows somelevel of random con-
`data workssimilarly except there are two correlations
`nection between chips. Up to 16 boards can be daisy-
`(one for each template).
`chained together to provide a large linear-systolic ar-
`Intermediate hit values are stored in a table, re-
`tay of 256 elements.
`ferred to as the hit-table, in one of the local memories.
`4.3 ATR Implementation on Splash 2
`Each location in the table corresponds to an x-y off-
`Similar to the example, the SPLASH 2 implemen-
`set of a chip, the origin of a single correlation. For
`tation processes one pixel at a time and loads them
`each offset, if a chunk “hits”, then the corresponding
`in columnorder so that the partial sums can be gen-
`location in this table is incremented. Thus the table
`erated and passed from column to column. All tem-
`contains the accumulated hit values for all chunks and
`plate data are stored in the memories adjacent to the
`all offsets that have been computed to that point.
`FPGAs on the SPLASH 2 boards. Each memory can
`Hits are computed according to Equation 1. First,
`hold several thousand templates thus making it pos-
`each Bum and Ssum value is compared to its corre-
`sible to store all of the templates for a single class
`sponding minimum value. Second,the threshold value
`(5760) on a single SPLasH 2 board. There is sufficient
`corresponding to each Byum and Ssum is checked to
`room in the FPGA design to store a single template.
`see if it is between a certain minimum and maximum
`The templates are switched by reading the new tem-
`value. For reasonsofefficiency, the threshold valueis
`plate data out of the memory andstoring it within
`actually examinedearlier in the process andazero for
`the FPGA. However, because this implementation is
`the threshold is stored in lookup-table memoryif it
`deeply pipelined, it is necessary to flush all current
`is out of bounds. This works correctly because if the
`
`Figure 6: Single Processing Element of SPLASH 2.
`
`196
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 196
`Petitioner Microsoft Corporation - Ex.-1011, p. 196
`
`
`
`—eweroF
`
`modules in the Shapesum and 16 more identical mod-
`ules in the final correlation. Along with these thereis
`a divide, a hit accumulator and 2 delay modules (see
`Figure 8).
`
`4.4.3- Memory Usage
`he memories in SPLASH 2 serve several purposes.
`The template information is stored in them. They
`are used to implementvideo shift registers that cor-
`rect for the latency incurred during threshold compu-
`tation. These shift registers require that two mem-
`ories be used in tandem because every clock cycle a
`new pixel (8 bits) had to be written to memory and a
`delayed pixel had to be read from memory. The band-
`width of one memory was such that it can handle two
`pixels (load and store) every three cycles. Thus one
`memory would delay two pixels and skip two pixels,
`while the other memory would delay the two pixels
`that the first memory skipped and skip the twopixels
`that the first memory delayed. The divide unit and
`the final result including the accumulated hit values
`are also stored in memory.
`4.5 Performance Metrics
`There are many metrics that could be used to mea-
`sure the value of this implementation. This section is
`devoted to discussing some of these metrics.
`
`4.5.1 Performance
`
`threshold is zero, it will cause the Bsum to be zero,
`which will in turn cause the Byum comparison to fail.
`Otherwise, if all three of these tests come back true
`then a hit has been foundfor the correspondingoffset
`(see Equation 1) and the corresponding locationin the
`hit-table is incremented. After the 40 templates are
`tested against the samechip the two offsets with the
`highest accumulated hit values are written into mem-
`ory where the host computer can read them. This is
`accomplished by examiningthehit values during this
`Process and retaining the top two values in special
`memory locations. These final two hit values (which
`represent the top two orientations for a specific class)
`are used in the FI step.
`:
`For the SPLASH 2 board, as with most FPGA sys-
`tems, partitioning is a major issue. The design needed
`to be modular so that different design modules could
`be reassigned to different FPGAs as necessary. This
`is where the column modules were so valuable (see
`Figure 8).
`4.4 Special Features
`This implementation has several notable character-
`istics. They include control distribution, modular de-
`sign for partitioning and memoryutilization.
`
`4.4.1 Distributed Control
`
`The control in this system is distributed throughout
`the array. Each column module has it’s own state
`machine based control. Module synchronization is
`achieved by distributing a control token through the
`pipeline along with the data. When a module re
`ceives this signal, it resets its internal state machines
`and retrieves template data from its local memory. A
`memory controller resides in each processing element
`(FPGA) to retrieve template data and give memory
`access to all other modules.
`
`4.4.2 Modular Design (Design for Partition-
`ing)
`
`|waveerrtone
`
`
`
`oroe
`
`Each column in both the Shapesum and final corre-
`lation use totally self contained modules that can be
`easily migrated from processing element to processing
`element. This was done to simplify the partitioning
`onto SPLASH 2 [13]. Memory data had to be care-
`fully partitioned as well so that the data could follow
`the module to whichit applied. The regularity of the
`128 col x (128 - 15) row x 162%|Chunks _
`design was an important concern; it allowed the place-
`13.2 MHz
`aed
`ment of specific circuit modules to be dictated by the
`Chunks
`(17.5 ms/Chunk) x ODdentation
`We
`re we
`requirements of the algorithm and not by the limited
`interconnect of the platform. There are 16 identical
`701 ms @ 13.2 MHz
`
`This implementation runs at a frequency of 19 MHz
`using a test template that tests all column modules.
`Xdelay (a Xilinx timing tool) reports a guaranteed
`frequency of 13.2 MHz. Designs that will run in the
`10 to 20 MHz rangeare typical (8, 10].
`Using the above frequency a single system could
`process oneorientation every .487 seconds (.701 sec-
`onds using a 13.2 MHzclock).
`128 col x (128 — 15) row x 162 “
`Chunks _
`Orient
`19 MHz
`=
`(12 ms/Chunk) x 40 =Orientation —
`
`or
`
`487 ms @ 19 MHz
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 197
`Petitioner Microsoft Corporation - Ex. 1011, p. 197,
`
`r+
`
`
`
`
`
`Shapesum<—— Final Correlation
`
`_ Figure 8: SPLAsH 2 Implementation.
`
`Using a brute force approach it takes 59 seconds to
`compute the same data on an HP 770 (110 MHz).
`That is two orders of magnitude slower. On one
`SPLASH 2 board two processing units can be in-
`stalled yielding the ability to process 1 orientation
`(40 chunks) every .244 seconds (.350 seconds at 13.2
`MHz).
`
`4.5.2 Multi-board Splash 2
`If a 16-board system were used then 256 processing
`elements would be available. this would make 42 pro-
`cessing units (256 PEs + 6 PE per PU) and 42x the
`throughput. Thus oneorientation can be processed
`every 11.6 ms (16.7 ms at 13.2 MHz). This is equal
`to covering 86 orientation per second (60 orientations
`per second at 13.2 MHz).
`
`4.5.3 Memory Bandwidth
`Thedistributed memory andaggregate memory band-
`width is what makes this implementation possible. Be-
`cause the memoryis distributed to each chip this im-
`plementation can be extended to a mul