throbber
Proceedings
`
`iEEE Symposiumon——_
`FPGAs on
`CUSTOM
`
`OETTT
`
`ha
`”
`
`April 16-18, 1997
`Napa Valley, California
`
`Edited by Kenneth L. Pocek and Jeffrey Amold
`
`Sponsored by the IEEE Computer Society Technical Committee on Computer Architecture
`
`_S
`
`2P
`
`
`
`: o
`_&
`i
`COMPUTER o
`
`
`SOCIETY©ce /
`
`Petitioner Microsoft Corporation - Ex. 1011, Cover 1
`** “Petifidfiét Microsoft Corporation - Ex. 1011, Cover 1
`
`

`

`PROCEEDINGS
`
`The 5th Annual IEEE Symposium on
`Field-Programmable
`Custom Computing Machines
`
`April 16— 18, 1997
`
`Napa Valley, California
`
`Sponsored by
`IEEE Computer Society
`IEEE Computer Society Technical Committee on Computer Architecture
`
`~D
`
`Vi
`
`f Oy
`
`COMPUTER a
`SOCIETY
`
`“r
`
`|
`
`Los Alamitos, California
`Washington
`e Brussels
`
`Tokyo
`
`Petitioner Microsoft Corporation - Ex. 1011, Cover 2
`Petitioner Microsoft Corporation - Ex. 1011, Cover 2
`
`

`

`Copyright © 1997 by TheInstitute ofElectrical and Electronics Engineers, Inc.
`All rights reserved
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may
`photocopy beyondthe limits ofUS copyrightlaw,forprivate use ofpatrons, those articles in this volumethat
`carry a code at the bottom ofthefirst page, provided that the per-copy fee indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE
`Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331.
`The papers in this book comprise the proceedings ofthe meeting mentionedonthe coverand titlepage. They
`reflect the authors’ opinions and, in the interests oftimely dissemination, are published as presented and
`without change, Their inclusion in this publication does not necessarily constitute endorsement by the
`editors, the IEEE ComputerSociety, or the Institute ofElectricalandElectonics Engineers, Inc.
`
`IEEE Computer Society Order Number PRO8159
`ISBN 0-8 186-8 159-4
`ISBN 0-8186-8160-8 (case)
`ISBN 0-8 186-8161-6 (microfiche)
`IEEE Order Plan Catalog Number 97TB100186
`ISSN 1082-3409
`
`Additional copies may be orderedfrom:
`
`IEEE Computer Society IEEE Computer Society|[EEE Computer SocietyIEEE Service Center
`
`445 Hoes Lane
`Customer Service Center
`13, Avenue de l’Aquilon
` Ooshima Building
`P.O. Box 1331
`10662 Los Vaqueros Circle
`B-1200 Brussels
`2-19-] Minami-Aoyama
`P.O. Box 3014
`Piscataway, NJ 08855-1331
`BELGIUM
`Minato-ku, Tokyo 107
`
`Los Alamitos, CA 90720-1314—Tel: + 1-908-981-1393 Tel: + 32-2-770-2198 JAPAN
`
`Tel: + 1-714-821-8380
`Fax: + 1-908-981-9667
`Fax: + 32-2-770-8505
`Tel: + 81-3-3408-3118
`Fax: + 1-714-82]-4641
`mis.custserv@computer.org
` euro.ofc@computer.org
`Fax: + 81-3-3408-3553
`E-mail: cs.books@computer.org
`tokyo.ofe@computer.org
`
`Editorial production by Bob Wemer
`Coverart production Joe Daigle/Studio Productions
`Printed in the United States of America by Technical Communication Services
`
`-@D
`COMPUTER @
`SOCIETY
`cx
`
`Petitioner Microsoft Corporation - Ex. 1011, Cover 3
`Petitioner Microsoft Corporation - Ex. 1011, Cover 3
`
`
`

`

`Table of Contents
`
`Symposium on Field-Programmable Custom Computing Machines — FCCM’97
`
`WpGPBRETON.csi955: 5na ermevansnesAESTenna meer erasescca penereeannnsns
`Program Committee oc cccccccsssssssssssssecesssessssssevsnseesesscesssnsnssnessesssssvestssenssteesesesusersessssustesesensasuusessessseen®
`
`Session 1: Device Architecture
`
`An FPGAArchitecture for DRAM-based Systolic Computations........c..cccccccscsssecsssesssesesnessnsesssnecssseensee D
`N. Margolus
`
`Garp: A MIPS Processor with a Reconfigurable Coprocessor.......s.seccsssccsecsresessueessssssnssssesseneessee LD
`J. Hauser, J. Wawrzynek
`
`Al Timie-Maitiploed BPCw.sssscvinissenreoserennsnnsnevnnsasisstsiascicsigissibastenennnnvanmcraataenasissisaswsbcisiintlssisanfeeemenneneecsee BO
`S. Trimberger, D. Carberry, A. Johnson, J. Wong
`
`Session 2: Communication Applications
`
`An FPGA-Based Coprocessor for ATM Firewalls .........c.ccscsscsssesscsscssecssneesseersnessneenneqenssineenseaneenneesseeaeese OO
`J. McHenry, P. Dowd, T. Carrozzi,
`F. Pellegrino, W. Cocks
`
`A Wireless LAN Demodulator in a Pamette: Design and Experience.........ccceessesseeenesssneseenee 40
`T. McDermott, P. Ryan, M. Shand,
`D. Skeilern, T, Percival, N. Weste
`
`Session 3: Run Time Reconfiguration
`
`Incremental Reconfiguration for Pipelined Applications................-cc1ssccscssssssessseeeesseesnesnversusecunessnrsnnens HE
`H. Schmit
`
`Compilation Tools for Run-Time Reconfigurable Designs ........0....c.cccsesseesecseeeesserenteeneeesneessurereeesneees OO
`W. Lub, N. Shirazi, P. Cheung
`
`A Dynamic Reconfiguration Run-Time System .......c.cccccscscssseccceessnneesnsnteenaneeesneerstesseeessiecsuiesananssneessee OO
`J. Burns, A. Donlin, J. Hogg, S. Singh, M. de Wit
`
`Session 4: Architectures for Run Time Reconfiguration
`
`The Swappable Logic Unit: A Paradigm for Virtual Hardware.........cccssecssssssssessssneessessnenee DE
`G. Brebner
`
`Petitioner Microsoft Corporation - Ex. 1011, ToC i
`Petitioner Microsoft Corporation - Ex. 1011, ToC i
`es_—
`
`

`

`The Chimaera Reconfigurable Functional Uniit..ies:ceoccosemeceoeeeeoeocc
`S. Hauck, T. Fry, M. Hosier, J. Kao
`
`ee 87
`
`Session 5: Architecture
`
`Computing Kernels Implemented with a Wormhole RTR CONasessresnsnccesesiaciiscsstparccicase
`sneee 98
`R. Bittner Jr., P. Athanas
`
`Mapping Applications to the RaPiD Configurable Architecture ........ccccccceccccsccossss
`C. Ebeling, D. Cronquist, P. Franklin,
`J. Secosky, S. Berg
`
`- 106
`
`Defect Tolerance on the Teramac Custom RFnccecacssssnreccerconccincraarerasiens
`B. Culbertson, R. Amerson, R. Carter,
`P. Kuekes, G. Snider
`
`iieieeneeesensnsemn 116
`
`Session 6: Performance
`
`Systems Performance Measurement on PCI Pametteececcccscscocscoeseeeeeosecccece,
`L. Moll, M. Shand
`
`anistasiapsteasaecias 125
`
`The RAW Benchmark Suite: Computation Structuresfor
`General Purpose Computing...............
`J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua,
`M. Taylor, J. Kim, S. Devabhaktuni, A. Agarwal
`
`Session 7: Software Tools
`
`ww. 184
`
`Automated Field-Programmable Compute Accelerator Design using
`Partial Evaluation........ssnsesmssemsnntiniseiatusimutieeienenstisiiiutusuassneseees
`Q. Wang, D. Lewis
`
`gbasnccnscadtesitasit 145
`
`FPGASynthesis on the XC6200 using IRIS and Trianus/Hades
`(Or, from Heavento Hell and back IYisiscciiercsccccsscasscececiccicccempeesennmnercemrerreerese
`R. Woods, S. Ludwig, J. Heron, D. Trainor, S. Gehring
`
`cspcnenas enemas 155
`
`High Level Compilation for Fine Grained FPGAS ..c.cc-cccccssscssssessccssseeesoseeoececcce
`M. Gokhale, E. Gomersall
`
`siseisariiinsguaeepet 165
`
`Session 8: CAD Applications
`
`Acceleration ofan FPGA Router .......csc:cssossnussnsnssiutuesunessiepussstsseeseeseesc.
`P. Chan, M. Schlag
`
`sot ocennemmcineansai 175
`
`Fault Simulation on Reconfigurable Hardware .......c:c:s:cssuscssssnsnseseeseeceeoeecccc
`M. Abramovici, P. Menon
`
`suiciecancespeicncaias 182
`
`vi
`
`Petitioner Microsoft Corporation - Ex. 1011, ToC ii
`Petitioner Microsoft Corporation - Ex.
`1011, ToC i
`
`

`

`Session 9: Image ProcessingApplications
`AutomatedTargetRecognition On SPLASH Boernnnuninuiintnvnmmisisiereeecc. 192
`M.Rencher, B. Hutchings
`Real-TimeStereo Vision on the PARTSReconfigurableOs)201
`J. Woodfill, B. Von Herzen
`
`Increased FPGA Capacity Enables Scalable, Flexible CCMs:
`An Examplefrom Image Processig.cmuinmnnnsrsnninsiwntsititaeeee.211
`d. Greenbaum, M. Baxter
`
`Session 10: Arithmetic Applications
`ComparisonofArithmetic Architectures for Reed-Solomon Decoders in
`Neconfigurable Harda2anon...219
`C. Paar, M. Rosner
`
`Implementation ofSingle Precision FloatingPoint Square Root on FPGAS...ceccscssscssssssssssce226
`
`Y. Li, W. Chu
`
`Poster Papers
`
`Datapath-Oriented FPGA Mapping and Placementfor
`Configurable POTPURLNGornrnmrnentntteeeesec.234
`T. Callahan, J. Wawrzynek
`Mappinga Real-Time Video Algorithm toa Context-Switched CAassestsessen:236
`
`S. Kelem
`
`A Parallel Hardware Evolvable Computer POLYPascneusnnnnnnnnnniniiiiinneeeeeccc.238
`U. Tangen, L. Schulte, J. McCaskill
`Laser Defect Correction Applications to FPGA Based Custom Computers...ccc.240
`G. Chapman, B. Dufort
`Speech Recognition HMM Trainingon Reconfigurable Parallel PLOCESSOR......sscsccsssccsnsesseoneeeee242
`H. Yun, A. Smith, H. Silverman
`Efficient Implementation ofthe DCT on Customec244
`N. Bergmann, Y. Chung, B. Gunther
`On Acceleration ofthe Check Tautology Logic Synthesis Algorithm using an
`FPGA-basedReconfigurable CoprocesB0T.....nnnnmnnnnnntintuininininiiitteteeercecc..246
`J. Cong, J. Peck
`Index OfAMEROT8neneinen249
`
`Vii
`Petitioner Microsoft Corporation - Ex. 1011, ToC iii
`Petitioner Microsoft Corporation - Ex. 1011, ToC iti
`
`$AHHnterrMlicrosot
`
`

`

`|
`
`ins material may be protected by Copyright law (Title 17 U.S. Code)
`
`|
`
`Automated Target Recognition on SPLASH 2!
`
`Michael Rencher and Brad L. Hutchings
`Department of Electrical and Computer Engineering
`Brigham Young University ,
`Provo, UT 84602
`
`Abstract
`Automated target recognition is an application area
`that requires special-purpose hardware to achieve rea-
`sonable performance. FPGA-based platforms can pro-
`vide a high level of performance for ATR systems if
`the implementation can be adapted to the limited
`FPGAand routing resources of these architectures.
`This paper discusses a mapping experiment where a
`linear-systolic implementation of an ATR algorithm is
`mapped to the SPLASH 2 platform. Simple column-
`oriented processors were used throughout the design
`to achieve high performance with limited nearest-
`neighbor communication. The distributed SPLASH 2
`memories are also exploited to achieve a high degree
`of parallelism. The resulting design is scalable and
`can be spread across multiple SPLASH 2 boards with
`a linear increase in performance.
`
`1
`
`Introduction
`Automated target recognition (ATR) is a compu-
`tationally demanding application area that typically
`requires special-purpose hardware to achieve desirable
`performance. ASICs are not an option for these sys-
`tems due to high non-recurring engineering (NRE)
`costs and because the algorithms are constantly evolv-
`ing. Existing FPGA-based computing platforms can
`potentially provide the necessary performance and
`flexibility for evolving ATR systems; however, map-
`ping applications to these existing platforms can be
`very challenging because they lack abundant intercon-
`nect and FPGA resources. The key to achieving a
`high-performance implementation of ATR algorithms
`with existing platforms is to carefully organize the de-
`sign of the ATR implementation so that it can commu-
`nicate via the limited interconnect and can beeasily
`partitioned among the FPGA devices.
`This paper presents a linear systolic implementa-
`tion of an existing ATR algorithm on SPLASH 2 that
`
`tThis work wassupported by DARPA/CSTO under contract
`number DABT63-94-C-0085 under a subcontract to National
`Semiconductor
`
`Inter-
`is well-suited to the SPLASH 2 architecture.
`FPGA communicationis limited and easily accommo-
`dated by the SPLASH 2 interconnect. Moreover, the
`implementation can be scaled across any number of
`SPLASH 2 boards and achieves high performance with
`limited resources.
`This paper briefly discusses the entire ATR algo-
`rithm as developed by Sandia National Labs, and
`then overviews the design and implementation of the
`most computationally demanding part of the algo-
`rithm: Chunky SLD. The SpLasH 2 implementation
`is presented in some detail with future directions and
`possible improvements.
`
`2 Automatic Target Recognition
`_ The goal of a typical ATR system is to analyze a
`digital representation of a scene and locate/identify
`objects that are of interest. Although this goal is
`conceptually simple, ATR systems have extremely de-
`manding I/O and computational requirements: image
`data are large, can be generated in real-time, and must
`be processed quickly so that results remain relevant in
`a dynamic environment. The common use of special-
`purpose hardware in nearlyall high-performance ATR
`systemsis a clear indication ofthe computational com-
`plexity of these systems.
`This paper details the implementation of an exist-
`ing ATR algorithm on SPLASH 2. The algorithm in
`question was developed at Sandia National Labora-
`tories and was designed to detect partially obscured
`targets in Synthetic Aperture Radar (SAR) images.
`It is commonly referred to as Chunky SLD, so named
`for the second step of the algorithm that differenti-
`ates this.algorithm from others developed at Sandia.
`This algorithm consists of the following three steps:
`(1) Focus of Attention (FOA), (2) Second-Level De-
`tection (SLD), and (3) Final Identification (FI). Each
`of these steps will now be introduced so that the al-
`gorithm implementation can be understoodin its op-
`erating context.
`:
`
`0-8186-8159-4/97 $10.00 © 1997 IEEE
`
`192
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 192
`Petitioner Microsoft Corporation - Ex. 1011, p. 192
`See
`
`

`

`
`
`Section implemented
`on Splash-2
`
`Figure 1: ATR Block Diagram.
`
`2.1 Focus of Attention (FOA)
`Focus ofattentionis thefirst step of the ATR pro-
`cess and uses image morphology techniques to de-
`tect potential targets in SAR data. FOA operates on
`“down-sampled” SAR images that are approximately
`600-1000 pixels on a side. Once FOA detects a poten-
`tial target, it determines the approximate center of the
`potential target and creates 2x down-sampled sub-
`images of the original SAR data where each subimage
`contains a single target centered within the subim-
`age. These subimages are referred to as chips and are
`128 x 128 pixels.
`
`2.4 The Chunky SLD Algorithm
`The general goal of the Chunky-SLD algorithm is
`to recognize targets that are partially concealed or ob-
`scured in some way. To achievethis goal, the designers
`of this algorithm treat the target as a set of 40 tem-
`plate pairs where each pair of templates is a digital
`representation of somesalient feature of the specific
`target. If the majority of the template pairs strongly
`correlate with the image data, then a match of the
`overall target is assumed. Each pair of templates con-
`
`
`
`Figure 2: Chunky SLD
`
`sists of a Bright template and a Surround template.
`The Bright template is a representation of expected
`reflections directly from surfaces of a salient target fea-
`ture while the Surround template represents expected
`absorption in the immediate area surrounding the tar-
`get feature. Each pair of a Bright and Surround tem-
`plate is referred to as a chunk, so called because each
`pair of templates represents a “chunk” of the overall
`target. Each set of 40 chunksrepresents a single target
`at a specific rotation. There are 72 orientations, each
`representing a different target orientation and radar
`incidence angle. Each set of 72 orientations is referred
`to as a class and is the complete set of templates that
`must be correlated with a chip to detect the presence
`of a specific target.
`
`2.2 Second Level Detection (SLD)
`The SLD step processes the chips generated by the
`FOAstep. SLD further restricts the areas of interest
`by giving the potential targets coordinates and angu-
`lar orientation. SLD does this by correlating prede-
`fined binary templates to the areas of interest. The
`templates represent different object orientation an-
`gles. Templates are oriented between 5 and 10 de-
`grees apart. SLD also uses adaptive threshold levels
`determined by the overall image intensity.
`
`The algorithm studied in the paperis a variation
`of SLD called Chunky SLD. Chunky SLD addsa level
`
`of complexity to SLD by using more templates to rep-
`Surround
`resent objects that have been partially obscured (par-
` Class
`
`tially hidden by camouflageor objects overhead). This
`(72 Orientations-/Orientation
`
`All the templates
`allows better target recognition at a cost of higher
`(Template
`for one object) Eo Cease)
`pa
`computational requirements. Chunky SLD is dis-
`
`cussed in more detail later in this section.
`
`2.3 Final Identification (FI)
`The FI algorithm correlates full resolution image
`data and templates with finer angularresolution (3 to
`5 degrees). FI also uses adaptive threshold levels. The
`outputof FI is a location of the target, and confidence
`level corresponding to the level of correlation between
`the object and the FI templates.
`
`
`
`Bright
`
`Figure 3: Template Organization
`
`The first step of the Chunky SLD algorithm is to
`correlate the chip and the Bright template. This cor-
`relation value is used to compute a value that will be
`
`193
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 193
`Petitioner Microsoft Corporation - Ex. 1011, p. 193
`
`

`

`used to threshold the incoming chip data, converting
`the 8-bit chip imageinto a binary image. The equa-
`tions describing this process are shown below.
`15
`Shapesum(a, y) = >> Bremp(a, 6) *Chip(x +a, y+6)
`a,6=0
`
`Threshold(«, y) =
`
`Shapesum(cz, y)
`Bright_template_pizel_count
`The values obtained by correlating the Bright and
`Surround templates with the binarized chip (Bsum and
`Ssum) are checked against minimum values to generate
`a “hit value” for each offset in the chip. The threshold
`value is also checked tosee if it falls in an acceptable
`range when generatingthe hit values,
`
`if((Tmaz > T 2 Tmin] AND
`[Bsum > Bmin] AND
`[Ses > Smin])
`then
`Hit =1;
`else
`
`Ait=0;
`
`(1)
`
`The hit values are accumulated for each offset for a
`specific orientation (40 chunks). The highest values
`are used to identify the areas of interest for the final
`identification step.
`
`2.4.1 Template Characteristics
`Template pairs exhibit useful properties: sparseness
`and mutual exclusivity. The Bright template consists
`mostly of zeros; only 3 to 10 percent of the template
`values are ‘1’s and this limits the magnitude of the
`Shapesum and Byym values. The Bright and Surround
`templates are also mutually exclusive; that is, if the
`two templates are overlaid no “on” pixels will overlap.
`Whencarefully exploited, both of these properties lead
`to more compact and higher performance hardware.
`3 Other Implementations of Chunky-
`SLD
`As explained the ATR applicationis computation-
`ally demanding. There are (128-15) x (128-15) offsets
`per chunk x 40 chunks x 72 orientations X 36 x 106
`hit values to compute per targeted object (or per class,
`see Figure 3). The computational rate and I/O re-
`quirements of this algorithm makeit impossible to use
`current microprocessors. Thus any high-performance
`implementation ofthis algorithm will require special-
`purpose hardware to meet performance goals.
`
`Figure 4: Sandia’s Implementation.
`
`3.2 UCLA
`A group at UCLA (Headed by John Villasenor) [3]
`is working on a FPGA based SLD implementation. By
`doing bit level correlations they are able to do very
`compact adder trees that take advantage of template
`sparseness and FPGA on board lookup-table memory
`capability. Their approach compiles template infor-
`mation directly into the hardware and relies on fast
`reconfiguration to switch template information. They
`
`194
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 194
`Petitioner Microsoft Corporation - Ex. 1011, p. 194
`
`However, custom ASICsare also not an option be-
`cause the algorithm is constantly evolving and also
`because commercial-off-the-shelf components (COTS)
`are often dictated by the ultimate customers of ATR
`systems. The only remaining optionsare to construct
`the system with commercially availablefixed-function
`devices such as correlaters, multipliers, etc., or to
`use programmable logic, e.g., FPGAs [1, 2]. Thus
`all known implementations of Chunky-SLDuseeither
`fixed-function devices or programmable logic.
`3.1 Sandia
`Current Sandia implementations of ATR are based
`on commercially available one-bit correlater chips.
`Sandia designers adapted the basic Chunky-SLD al-
`gorithm so they could exploit the capabilities of these
`components to achieve high performance. Rather than
`process the Shapesum andthen process thefinal cor-
`relation, the two steps were done in parallel. The cor-
`relation was done at 8 discrete threshold levels and
`the Shapesum determined which threshold to use for
`each offset.
`
` Chip
`
`Zg =
`
`128x128x8bit
`
`

`

`Template
`
`Image
`
`Figure 5: Example Column Sum.
`

`
`a
`
`7Tee
`
`ae
`
`also try and take advantage of template overlap by
`computing the results of multiple correlations simul-
`taneously.
`4 Chunky-SLD on Splash 2
`On SPLASH 2astatic implementation was done
`to avoid the overhead of reconfiguring the hardware
`ixet
`114]7[10]13]16]19]
`toad order|2{5]8/14]14/17]20
`during execution.
`In order to reduce the hardware
`[3]6}9]12]15]18]21]
`requirements without reducing throughput a deeply
`pipelined design (~400 cycle latency) was imple-
`mented. The Shapesum unit generates the threshold
`
`value which is then used to generate the final corre-
`Ka
`lation values.
`(Note: There is a unique threshold for
`
`eachoffset). By doing this only two final correlations
`10]oofGah
`have to be computed per offset (one Bsum and one
`
`
`Galor||
`Ssum):
`14]voOnis)S|
`|ao/fall2
`Asaael
`The Sandia implementation computes the Shape-
`
`
`
`
`[16]over]v2]|
`2]aalleayoe|)|
`sum andfinal correlation in parallel which forces them
`17]woloolonl2|
`to compute multiple final correlations. While our im-
`Lisl@dkuy[or]|
`plementation does them serially. This allowsus to use
`final value
`[19]wolenlv2]
`an exact threshold value. Also only onefinal correla-
`shifted out
`shifts
`[20]onKio2|
`tion needs to be computed because the threshold value
`O Pixel is summed in at these points
`211Olofin|]
`
`
`is computed before thefinal correlation begins. The
`22] @nfont
` O- OutputT-clock cycle
`
`technique used was to look at the correlations by col-
`
`oy]|i]
`umn, compute the partial correlation for that column,
`and sum up the partial sumsfor all 16 columns.
`In
`this method 16 different column correlations are going
`on in parallel but only one columnof data needs to be
`available for processing.
`4.2 Platforms/Implementations
`4.1
`Implementing the Correlation as Col-
`Chunky SLD was implemented on the SPLASH 2
`umn Sums
`board. SPLASH 2 has shownitself to be a useful plat-
`Figure 5 depicts a simple example that demon-
`form and has had numerousapplications mapped toit
`strates a correlation of a 3x3 template with a binary
`[4, 5, 6, 7, 8, 9, 10, 11]. The implementation was done
`image. Each rowin the table represents one clock cy-
`in VHDL and simulated/synthesized using Synopsis.
`cle. The first column is the clock cycle number. Cor-
`All place and route was done automatically using Xil-
`responding numbers are found in the Pizel load order
`inx place and route tools.
`box at the right. A new pixelis brought in on each
`Oneof the goals of the implementation was to run
`clock cycle. The P1, P2, and P3 columns represent
`the system so that it consumedapixel per cycle. This
`the three column processing units needed for a three
`means that each cycle all processing elements (PE)
`column template. The last column represents the ac-
`need to be able to process a new pixel. This im-
`tual output. Clock cycles 9 through 12 have been ex-
`plementation follows Sandia National Labs algorithms
`panded to show how data (pixels and partial sums)
`(not implementation) as closely as possible (see Sec-
`are passed from column to columnandillustrate the
`tion 2.4),
`data format. Once thepipelineis full, a new correla-
`The SpLAsH 2 board was developed by SRC (Su-
`tion value is computed as each columnarrives (three
`percomputing Research Center Institute for Defense
`pixels/cycles).
`Analyses) [12]. The SPLASH 2 boardis a linear sys-
`Note that valid output comes every three cycles be-
`tolic array of processing elements (FPGAs), each with
`cause the template is three rowstall. All processing
`their own memory.
`elements are actively processing 3 pixel values at all
`times. The SPLASH 2 implementation works just like
`the example except for the size of the columns (16
`pixels instead of three) and the data format (eight-bit
`instead of one-bit).
`
`4.2.1 Splash 2 Hardware
`
`From a conceptualpointof view, the SPLASH 2 system
`consists of a linear array of processing elements. This
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 195
`Petitioner Microsoft Corporation - Ex. 1011, p. 195
`PSFeee
`
`195
`
`

`

`makes SPLASH 2 a good candidate for linear-systolic
`applications with limited neighbor-to-neighbor inter-
`connect. Because of limited routing resources SPLASH
`2 has difficulty implementing multi-chip systems that
`are not linear systolic, though they are possible (8).
`The actual SPLASH 2 platform consists of a board
`with 16 Xilinx 4010 chips (plus one for control) ar-
`ranged inalinear systolic array. Each chip has a lim-
`ited 36-bit connection to its two nearest neighbors.
`Each Xilinx 4010 is connected to a 512 kbyte memory
`
`
`
` Processing
`neighbor
`
`left
`neighbor
`
`
`Element
`
`(Xilinx 4010)
`
`right
`
`Figure 7: SPLASH 2 platform.
`
`crossbar
`
`data from the system when switching to a new tem-
`plate. The overhead from the flushing operation is
`ss
`318 flush cycles
`ee!
`minimal (ssrasa compute cycles =) 0.14%.
`During eachclock cycle, a new pixel arrives at the
`FPGA.If the templatebit correspondingto this pixel
`is on then the incomingpixel is added to the current
`partial sum. Each 16 clock cycles, this partial sum is
`then passed on to the next column and a new partial
`(16-bit word size). The memory can handle back-to-
`sum is received from the previous column. Thelast
`back reads, or back-to-back writes, but requires one
`column computes a complete Shapesum every 16 cy-
`‘dead’ (or turn around) cycle when changing from
`cles (one column). Thefinal correlation of the Bright
`write to read. There is also a crossbar connected to
`and Surround templates with the thresholded chip
`all of the chips that allows somelevel of random con-
`data workssimilarly except there are two correlations
`nection between chips. Up to 16 boards can be daisy-
`(one for each template).
`chained together to provide a large linear-systolic ar-
`Intermediate hit values are stored in a table, re-
`tay of 256 elements.
`ferred to as the hit-table, in one of the local memories.
`4.3 ATR Implementation on Splash 2
`Each location in the table corresponds to an x-y off-
`Similar to the example, the SPLASH 2 implemen-
`set of a chip, the origin of a single correlation. For
`tation processes one pixel at a time and loads them
`each offset, if a chunk “hits”, then the corresponding
`in columnorder so that the partial sums can be gen-
`location in this table is incremented. Thus the table
`erated and passed from column to column. All tem-
`contains the accumulated hit values for all chunks and
`plate data are stored in the memories adjacent to the
`all offsets that have been computed to that point.
`FPGAs on the SPLASH 2 boards. Each memory can
`Hits are computed according to Equation 1. First,
`hold several thousand templates thus making it pos-
`each Bum and Ssum value is compared to its corre-
`sible to store all of the templates for a single class
`sponding minimum value. Second,the threshold value
`(5760) on a single SPLasH 2 board. There is sufficient
`corresponding to each Byum and Ssum is checked to
`room in the FPGA design to store a single template.
`see if it is between a certain minimum and maximum
`The templates are switched by reading the new tem-
`value. For reasonsofefficiency, the threshold valueis
`plate data out of the memory andstoring it within
`actually examinedearlier in the process andazero for
`the FPGA. However, because this implementation is
`the threshold is stored in lookup-table memoryif it
`deeply pipelined, it is necessary to flush all current
`is out of bounds. This works correctly because if the
`
`Figure 6: Single Processing Element of SPLASH 2.
`
`196
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 196
`Petitioner Microsoft Corporation - Ex.-1011, p. 196
`
`

`

`—eweroF
`
`modules in the Shapesum and 16 more identical mod-
`ules in the final correlation. Along with these thereis
`a divide, a hit accumulator and 2 delay modules (see
`Figure 8).
`
`4.4.3- Memory Usage
`he memories in SPLASH 2 serve several purposes.
`The template information is stored in them. They
`are used to implementvideo shift registers that cor-
`rect for the latency incurred during threshold compu-
`tation. These shift registers require that two mem-
`ories be used in tandem because every clock cycle a
`new pixel (8 bits) had to be written to memory and a
`delayed pixel had to be read from memory. The band-
`width of one memory was such that it can handle two
`pixels (load and store) every three cycles. Thus one
`memory would delay two pixels and skip two pixels,
`while the other memory would delay the two pixels
`that the first memory skipped and skip the twopixels
`that the first memory delayed. The divide unit and
`the final result including the accumulated hit values
`are also stored in memory.
`4.5 Performance Metrics
`There are many metrics that could be used to mea-
`sure the value of this implementation. This section is
`devoted to discussing some of these metrics.
`
`4.5.1 Performance
`
`threshold is zero, it will cause the Bsum to be zero,
`which will in turn cause the Byum comparison to fail.
`Otherwise, if all three of these tests come back true
`then a hit has been foundfor the correspondingoffset
`(see Equation 1) and the corresponding locationin the
`hit-table is incremented. After the 40 templates are
`tested against the samechip the two offsets with the
`highest accumulated hit values are written into mem-
`ory where the host computer can read them. This is
`accomplished by examiningthehit values during this
`Process and retaining the top two values in special
`memory locations. These final two hit values (which
`represent the top two orientations for a specific class)
`are used in the FI step.
`:
`For the SPLASH 2 board, as with most FPGA sys-
`tems, partitioning is a major issue. The design needed
`to be modular so that different design modules could
`be reassigned to different FPGAs as necessary. This
`is where the column modules were so valuable (see
`Figure 8).
`4.4 Special Features
`This implementation has several notable character-
`istics. They include control distribution, modular de-
`sign for partitioning and memoryutilization.
`
`4.4.1 Distributed Control
`
`The control in this system is distributed throughout
`the array. Each column module has it’s own state
`machine based control. Module synchronization is
`achieved by distributing a control token through the
`pipeline along with the data. When a module re
`ceives this signal, it resets its internal state machines
`and retrieves template data from its local memory. A
`memory controller resides in each processing element
`(FPGA) to retrieve template data and give memory
`access to all other modules.
`
`4.4.2 Modular Design (Design for Partition-
`ing)
`
`|waveerrtone
`
`
`
`oroe
`
`Each column in both the Shapesum and final corre-
`lation use totally self contained modules that can be
`easily migrated from processing element to processing
`element. This was done to simplify the partitioning
`onto SPLASH 2 [13]. Memory data had to be care-
`fully partitioned as well so that the data could follow
`the module to whichit applied. The regularity of the
`128 col x (128 - 15) row x 162%|Chunks _
`design was an important concern; it allowed the place-
`13.2 MHz
`aed
`ment of specific circuit modules to be dictated by the
`Chunks
`(17.5 ms/Chunk) x ODdentation
`We
`re we
`requirements of the algorithm and not by the limited
`interconnect of the platform. There are 16 identical
`701 ms @ 13.2 MHz
`
`This implementation runs at a frequency of 19 MHz
`using a test template that tests all column modules.
`Xdelay (a Xilinx timing tool) reports a guaranteed
`frequency of 13.2 MHz. Designs that will run in the
`10 to 20 MHz rangeare typical (8, 10].
`Using the above frequency a single system could
`process oneorientation every .487 seconds (.701 sec-
`onds using a 13.2 MHzclock).
`128 col x (128 — 15) row x 162 “
`Chunks _
`Orient
`19 MHz
`=
`(12 ms/Chunk) x 40 =Orientation —
`
`or
`
`487 ms @ 19 MHz
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 197
`Petitioner Microsoft Corporation - Ex. 1011, p. 197,
`
`r+
`
`

`

`
`
`Shapesum<—— Final Correlation
`
`_ Figure 8: SPLAsH 2 Implementation.
`
`Using a brute force approach it takes 59 seconds to
`compute the same data on an HP 770 (110 MHz).
`That is two orders of magnitude slower. On one
`SPLASH 2 board two processing units can be in-
`stalled yielding the ability to process 1 orientation
`(40 chunks) every .244 seconds (.350 seconds at 13.2
`MHz).
`
`4.5.2 Multi-board Splash 2
`If a 16-board system were used then 256 processing
`elements would be available. this would make 42 pro-
`cessing units (256 PEs + 6 PE per PU) and 42x the
`throughput. Thus oneorientation can be processed
`every 11.6 ms (16.7 ms at 13.2 MHz). This is equal
`to covering 86 orientation per second (60 orientations
`per second at 13.2 MHz).
`
`4.5.3 Memory Bandwidth
`Thedistributed memory andaggregate memory band-
`width is what makes this implementation possible. Be-
`cause the memoryis distributed to each chip this im-
`plementation can be extended to a mul

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket