IPR2018-01605, No. 1011-15 Exhibit - Rencher, et al, Automated Target Recognition on SPLASH 2, IEEESymposium on Field Programmable Custom Computer Machines,1997, 192 200 Chunky SLD (P.T.A.B. Sep. 6, 2

Proceedings
`
`IEEE Symposium
`l FPGHS FDR
`I CUSTOM
`I COMPUTING
`
`I.
`
`NINININIIIINIIIINIIIII
`1.4353986451]
`vans: 1557
`PIC"
`
`K
`
`142
`
`April 1 6-1 8, 1 997
`Napa Valley, California
`
`Edited by Kenneth L. Pooek and Jeffrey Amoid
`
`Sponsored by the IEEE Computer Society Technical! Committee on Computer Architeciure
`
`IEEE.
`COMPUTER Q,
`SOCIETY
`m
`
`‘ J
`
`,r’
`
`’N/
`
`'
`,/
`
`./
`
`“A
`
`I
`
`I
`
`
`
`Petitioner Microsoft Corporation - Ex. 1011, Cover 1
`“J“ "Péfi‘fidfiéf rIN/Ti’érosoft Corporation - EX. 1011, Cover 1
`
`

`PROCEEDINGS
`
`The 5th Annual IEEE Symposium on
`Field—Programmable
`Custom Computing Machines
`
`April 16— 18, 1997
`
`Napa Valley, California
`
`Sponsored by
`
`IEEE Computer Society
`IEEE Computer Society Technical Committee on Computer Architecture
`
`@
`IEEE
`COMPUTER
`SOCIETY
`
`Los Alamitos, California
`
`i
`
`‘
`
`:
`J‘;
`I.’I—
`.
`.
`.-
`.
`-——--‘——"~—-" -5---'
`
`Tokyo
`o
`Brussels
`in
`Washington
`
`
`Petitioner Microsoft Corporation - Ex. 1011, Cover 2
`Petitioner Microsoft Corporation - Ex. 1011, Cover 2
`_—__________________
`
`

`Copyright ® 199‘? by The Institute of Eiecttical and Electronics Engineers. Inc.
`All rights reserved
`
`CW: and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may
`photocopy beyond the limits ofUS copyright law. for private use of patrons. those articles in this volume that
`carry a code at the bottom of the first page, provided that the per—copy fee indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive. Danvers, MA 01923.
`
`Other copying. reprint, or republication requests should be addressed to: IEEE Copyrights Manager. [BEE
`Service Center, 445 Hoes Lane, 9.0. Box 133, Piscataway, N] 088554 331.
`
`The papers in this book comprise the proceedings ofthe meeting mentioned on the carer and title page. They
`reflect the authors' opinions and. in the interestr of timely dissemination, are pubt'irlted as presented and
`without change. Their inciurion in this publication does not necersoriiy constitute endorsement by the
`editors. the {£85 Computer Society. or the institute ofElectrical and Electont'cs Engineers, inc.
`
`{BEE Computer Society Order Number PR08159
`ISBN 0-8186-8159-4
`
`ISBN 0-8186-8160-8 (easel
`ISBN 0-8 186-8161-6 {microfiche}
`IEEE Order Plan Catalog Number 97T3100186
`15%? 1082-3409
`
`[EEE Computer Society
`Customer Service Center
`[0662 Los Vaqueros Circle
`P.0. Box 30I4
`Lou Alamitns, CA 90720-1314
`Tel: 4- 1-714-821-8380
`Fax: + 1—714-821-4641
`E-mail: cs.books@con1puter.org
`
`Additional copier may be orderedfrom:
`
`lEEE Service Center
`4-45 Hoes Lane
`P.O. Box 1331
`Piscataway. NJ 08855-1331
`Tet: + 1—908—981-1393
`Fax: 4- 1308-9313667
`mis.eusteerv@cornputer.org
`
`[BEE Compmer Society
`l3. Avenue: de I'Aquilon
`8-1200 Brussels
`BELGIUM
`Tel: -I- 32-2-770-2198
`Fax: + 32-2‘770-8505
`emoofc®eom1mtenorg
`
`IEEE Comte:- Society
`Ooshima Building
`2—l9—I Mum-Anytime
`Minute-lot. Tokyo 10'?
`JAPAN
`Tel: + 31-3-3408-31 18
`Fax: + 81-3-3408-3553
`tokyo.of0@ctxnputer.org
`
`Editorial production by Bob Werner
`
`Cover art production Joe Daiglet‘Studio Productions
`
`Printed in the United States of America by Technical Communication Services
`
`m®
`COMPUTER 9
`SOCIETY
`to
`
`Petitioner Microsoft Corporation - Ex. 1011, Cover 3
`Petitioner Microsoft Corporation - EX. 1011, Cover 3
`
`
`

`Table of Contents
`
`Symposium on Field-Programmable Custom Computing Machines —— FCCM’97
`
`Introductionrx
`ProgramCommittee 94
`
`Session 1: Device Architecture
`
`An FPGA Architecture for DWI-baSed Systolic Computations..............,......................,.....................2
`
`N. Margoius
`
`Garp: A MIPS Processor with a Reconfigurable Coprocessor 12
`
`J. Hansen J. Wawrzynek
`
`A Time—MultiplexedFPGA 22
`
`S. Trimberger, D. Carberry, A. Johnson, J. Wong
`
`Session 2: Communication Applications
`
`An FPGA-Based Coprocessor for ATM Firewalls 30
`
`J. McHem-y. P. Dowd, T. Carmzzi.
`
`F. Pellegrino, W. Cocks
`
`A Wireless LAN Demodulator in a Pamette: Design and Experience 40
`
`T. McDennott, P. Ryan, M. Shand,
`
`D. Skeilern, T. Percival, N. Waste
`
`Session 3: Run Time Reconfiguration
`
`Incremental Reconfiguration for PipelinedAppllcat10n547
`H. Schmit
`
`Compilation Tools for Run-Time ReconfigurableDeslgns 56
`
`W. Leda, N. Shirazi, P. Chenng
`
`A Dynamic Reconfiguration Run-Time System
`
`J. Burns, A. Doniin, J. Hogg, S. Singh, M. de Wit
`
`66
`
`Session 4: Architectures for Run Time Reconfiguration
`
`The Swappable Logic Unit: A Paradigm for Virtual Hardware 77
`G. Brebner
`
`Petitioner Microsoft Corporation - Ex. 1011, ToC i
`Petitioner Microsoft Corporation - Ex. 1011, ToC i
`W
`
`

`The Chimaera Reconfigurable Functional Unit......................................................................................... 87
`S. Hench, T. Fry, M. Rosier, J. Koo
`
`Session 5: Architecture
`
`Computing Kernels Implemented with a Wormhole RTR CCM
`R. Bittner Jr., P. Athanos
`
`Mapping Applications to the RaPiD Configurable Architecture
`C. Ebeling, D. Cronquist, P. Franklin,
`J. Secosky, S. Berg
`
`98
`
`106
`
`Defect Tolerance on the Tersmac Custom Computer ............................................................................
`B. Culbertson, R. Amerson, R. Carter,
`P. Kuekes, G. Snider
`
`116
`
`Session 6: Performance
`
`Systems Performance Measurement on PCI Pamette........................................................................... 125
`L. Moll, M. Shand
`
`The RAW Benchmark Suite: Computation Structures for
`General PurposeComputing134
`J. Babb, M. Frank, V. Lee, E. Woingold, H. Home,
`M. Taylor, J. Kira, S. Devobhaktuni, A. Agorwol
`
`Session 7: Software Tools
`
`Automated Field-Programmable Compute Accelerator Design using
`Partial Evaluation................................................................................................................................................ 145
`Q. Wang, D. Lewis
`
`FPGA Synthesis on the X06200 using IRIS and Trianus’I-Iades
`(Or, from Heaven to Hell and back again) ................................................................................................. 155
`R. Woods, S. Ludwig, J. Heron, D. Trainer, 8. Gehring
`
`High Level Compilation for Fine Grained FPGAs .................................................................................. 165
`M. Gokhale, E. Gamer-sail
`
`Session 8: CAD Applications
`
`Acceleration of an FPGA Router .................................................................................................................... 175
`P. Chen, M. Schlag
`
`Fault Simulation on Reconfigurable Hardware ....................................................................................... 182
`M. Abramovici, P. Manon
`
`Petitioner Microsoft Corporation - Ex. 1011, ToC ii
`Petitioner Microsoft Corporation - EX. 1011, ToC ii
`____________________*________—_
`
`vi
`
`

`Real-Time Stereo Vision on the PARTS
`J. Woodfill, B. Van Herman
`
`Reconfigurable Computer.................................................
`
`201
`
`Increased FPGA Capacity Enables Scalable, Flexible CCMs:
`An Example from Image Processing............................................................................................................ 211
`J. Greenbaum, M. Baxter
`
`Reconfigurable Hardware.................................................................................................................................219
`
`C. Paar, M. Earner
`
`Implementation of Single Precision Flo
`Y. Li, W. Chu
`
`ating Point Square Root on FPGAs ................................ 226
`
`Poster Papers
`
`Oriented FPGA Mapping and Placement for
`Configurable Computing
`
`A Parallel Hardware Evolvable Computer POLY?................................................................................238
`U. Tongan, L. Schaite, J. McCaskiH
`
`Laser Defect Correction Applications to FPGA Based Custom Computem..................................240
`G. Chapman 3. Dufon‘.
`
`Speech Recognition HMM Training on Reconfigurable Parallel Processor242
`H. Yun,A_ Smith, H. Silver-man
`
`Efficient Implementation of the DCT on Custo
`N. Bergmann, Y. Chung, B. Gunther
`
`:11 Computers ............................................................ 244
`
`On Acceleration of the Check Tautology Logic Synthesis Algorithm using an
`FPGA—based Reconfigurable Coprocesaor...................................................................................................246
`
`J. Cong, J. Peck
`
`Index ofAuthors...........................................................................................................................................249
`
`vii
`Petitioner Microsoft Corporation - Ex. 1011, ToC iii
`Petitioner Microsoft Corporation - EX. 1011, ToC ii1
`___________________*___‘*_
`
`

`‘ mus material may be protected by Copyright law {Title 17 U.S. Code)
`
`‘
`
`Automated Target Recognition on SPLASH 2 l
`
`Michael Renchcr and Brad L. Hutchings
`Department of Electrical and Computer Engineering
`Brigham Young UniVersity _
`Provo, UT 84602
`
`Abstract
`
`Automated target recognition is an application area
`that requires special-purpose hardware to achieve rear
`sonable performance. FPGA—based platforms can pro-
`vide a high level of performance for ATE. systems if
`the implementation can be adapted to the limited
`FPGA and routing resources of these architectures.
`This paper discuSses a mapping experiment where a
`linear-systolic implementation of an ATE. algorithm is
`mapped to the SPLASH 2 platform. Simple column—
`oriented processors were used throughout the design
`to achieve high performance with limited nearest-
`neighbor communicatiOn. The distributed SPLASH 2
`memories are also exploited to achieve a high degree
`of parallelism. The resulting design is scalable and
`can be spread across multiple SPLASH 2 boards with
`a linear increase in performance.
`
`1
`
`Introduction
`
`Automated target recognition (ATE) is a compu—
`tationally demanding application area that typically
`requires special-purpose hardware to achieve desirable
`performance. ASICs are not an option for these sys-
`tems due to high non-recurring engineering (NEE)
`costs and because the algorithms are constantly evolv-
`ing. Existing FPGA-based computing platforms can
`potentially provide the necessary performance and
`flexibility for evolving ATE. systems; however, map-
`ping applications to these existing platforms can be
`very challenging because they lack abundant intercon-
`nect and FPGA resources. The key to achieving a
`high-performance implementation of ATR. algorithms
`with existing platforms is to carefully organize the de-
`sign of the ATE. implementation so that it can commu-
`nicate via the limited interconnect and can be easily
`partitioned among the FPGA devices.
`This paper presents a linear systolic implementa-
`tion of an existing ATR algorithm on SPLASH 2 that
`
`'This work was‘supported by DARPAICSTO under contract
`number DABTSS—Q-t—C-OOSS under a subcontract to National
`Semiconductor
`
`is well-suited to the SPLASH 2 architecture.
`
`Inter—
`
`FPGA communication is limited and easily accommo—
`dated by the SPLASH 2 interconnect. Moreover, the
`implementation can be scaled across any number of
`SPLASH 2 boards and achieves high performance with
`limited resources.
`
`This paper briefly discusses the entire ATE. algo-
`rithm as developed by Sandia National Labs, and
`then overviews the design and implementation of the
`most computationally demanding part of the algo-
`rithm: Chunky SLD. The SPLASH 2 implementation
`is presented in some detail with future directions and
`possible improvements.
`
`2 Automatic Target Recognition
`
`' The goal of a typical ATR system is to analyze a
`digital representation of a scene and locate/identify
`objects that are of interest. Although this goal is
`conceptually simple, ATR systems have extremely de
`mending I/O and computational requirements: image
`data are large, can be generated in real-time, and must
`be processed quickly so that results remain relevant in
`a dynamic environment. The common use of special-
`purpose hardware in nearly all high-performance ATE.
`systems is a clear indication of the computational com-
`plexity of these systems.
`
`This paper details the implementation of an exist-
`ing ATR algorithm on SPLASH 2. The algorithm in
`question was developed at Sandie. National Labora—
`tories and was designed to detect partially obscured
`targets in Synthetic Aperture Radar (SAR) images.
`It is commonly referred to as Chunky SLD, so named
`for the second step of the algorithm that differenti-
`ates this‘algorithm from others developed at Sandia.
`This algorithm consists of the following three steps:
`(1) Focus of Attention (FDA), (2) Second—Level Dc»
`tection (SLD), and (3) Final Identification (FI). Each
`of these steps will now be introduced so that the al-
`gorithm implementation can be understood in its op—
`erating context.
`‘
`
`06186-815945)? $10.00 © 199’? IEEE
`
`192
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 192
`Petitioner Microsoft Corporation - EX. 1011, p. 192
`;___——___—___'__‘_—___—___—__J-—.——————’
`
`

`2.4 The Chunky SLD Algorithm
`The general goal of the Chunky—SLD algorithm is
`to recognize targets that are partially. concealed or ob-
`scured in some way. To achieve this goal, the designers
`of this algorithm treat the target as a set of 40 tem-
`plate pairs where each pair of templates is a digital
`representation of some salient feature of the specific
`target. If the majority of the template pairs strongly
`correlate with the image data, then a match of the
`overall target is assumed. Each pair of templates con-
`
`
`
`Figure 2: Chunky SLD
`
`sists of a Bright template and a. Surround template.
`The Bright template is a representation of expected
`reflections directly from surfaces of a salient target fea-
`ture while the Surround template represents expected
`absorption in the immediate area surrounding the tar-
`get feature. Each pair of a Bright and Surround tem-
`plate is referred to as a chunk, so called because each
`pair of templates represents a “chunk” of the overall
`target. Each set of 40 chunks represents a single target
`at a specific rotation. There are 72 orientations, each
`representing a different target orientation and radar
`incidence angle. Each set of 72 orientations is referred
`to as a class and is the complete set of templates that
`must be correlated with a chip to detect the presence
`of a specific target.
`
`Class
`
`
`
`All the templates
`for one shim}
`
`{1'2Ofientstions— Orientation
`
`(40 Chm“)
`
`u———_.——-
`
`:
`a
`
`l
`
`.
`
`'
`
`1*
`
`\
`
`.
`
` ex downssmplsd
`
`
`downsunpled
`Second
`level
`Dctscdon
`
`m S
`
`ection implemented
`on Splash-2
`
`Figure 1: ATE. Block Diagram.
`
`2.1 Focus of Attention (FDA)
`
`Focus of attention is the first step of the ATE pro
`cess and uses image morphology techniques to de-
`tect potential targets in SAR data. FOA operates on
`“down-sampled” SAR. images that are approximately
`600-1000 pixels on a side. Once FOA detects a poten-
`tial target, it determines the approximate center of the
`potential target and creates 2x down-sampled snb‘
`images of the original SAR data where each subirnage
`contains a single target centered within the whim-
`age. These subimages are referred to as chips and are
`128 x 128 pixels.
`
`2.2 Second Level Detection (SLD)
`
`The SLD step processes the chips generated by the
`FDA step. SLD further restricts the areas of interest
`by giving the potential targets coordinates and angu-
`lar orientation. SLD does this by correlating prede-
`fined binary templates to the areas of interest. The
`templates represent different object orientation an-
`gles. Templates are oriented between 5 and 10 de-
`grees apart. SLD also uses adaptive threshold levels
`determined by the overall image intensity.
`The algorithm studied in the paper is a variation
`of SLD called Chunky SLD. Chunky SLD adds a level
`of complexity to SLD by using more templates to rep-
`resent objects that have been partially obscured (par
`tially hidden by camouflage or objects overhead}. This
`allows better target recognition at a cost of higher
`computational requirements. Chunky SLD is dis-
`cussed in more detail later in this section.
`
`»
`
`2.3 Final Identification (F1)
`
`The F1 algorithm correlates full resolution image
`data and templates with finer angular resolution (3 to
`5 degrees). Fl also uses adaptive threshold levels. The
`output of PI is a location of the target, and confidence
`level corresponding to the level of correlation between
`the object and the FI templates.
`
`Figure 3: Template Organization
`
`The first step of the Chunky SLD algorithm is to
`correlate the chip and the Bright template. This cor-
`relation value is used to compute a vaiue that will be
`
`193
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 193
`Petitioner Microsoft Corporation - EX. 1011, p. 193
`-————___.___________________
`
`

`used to threshold the incoming chip data, converting
`the 8-bit chip image into a binary image. The equa-
`tions describing this process are shown below.
`15
`
`Shapesum(a.y) = Z Bremp(c,b)*0hip(s+s,y+b)
`e,b=0
`
`Shopesum(s, y)
`Threshold(a:,y) _--_. BrigthemplatLpizeLcount
`The values obtained by correlating the Bright and
`Surround templates with the binarised chip (B"m. and
`8”,“) are checked against minimum values to generate
`a “hit value" for each offset in the chip. The threshold
`value is also checked to see if it falls in an acceptable
`range when generating the hit values.
`
`ifflTmas 2 T 2 Train] AND
`[Bmm 2 3mm] AND
`[Scum 2’. Snail-ID
`then
`Hit : 1;
`else
`
`Hit = 0;
`
`(1)
`
`The hit values are accumulated for each offset for a
`specific orientation (4U chunks). The highest values
`are used to identify the areas of interest for the final
`identification step.
`
`2.4.1 Template Characteristics
`
`Template pairs exhibit useful properties: sparseness
`and mutual exclusivity. The Bright template consists
`mostly of zeros; only 3 to 10 percent of the template
`values-are ‘1’s and this limits the magnitude of the
`Shapesum and Bmm values. The Bright and Surround
`templates are also mutually exclusive; that is, if the
`two templates are overlaid no “on” pixels will overlap.
`When carefully exploited, both of these properties lead
`to more compact and higher performance hardware.
`
`3 Other Implementations of Chunky-
`SLD
`
`As explained the ATE application is computation-
`ally demanding. There are (128-15) x (128-15) offsets
`per chunk x 40 chunks x 72 orientations E 36 x 10‘l
`hit values to compute per targeted object (or per class,
`see Figure 3). The computational rate and I/O re-
`quirements of this algorithm make it impossible to use
`current microprocessors. Thus any high-performance
`implementation of this algorithm will require special-
`purpose hardware to meet performance goals.
`
`194
`
`Howaver, custom ASICs are also not an option be—
`cause the algorithm is oonstantly evolving and also
`because commercial-ofi-the—shelf components (COTS)
`are often dictated by the ultimate customers of ATR
`systems. The only remaining options are to construct
`the system with commercially available fixed-function
`devices such as correlaters, multipliers, etc., or to
`use programmable logic, e.g., FPGAs [1, 2]. Thus
`all known implementations of Chunky—SLD use either
`fixed-function devices or programmable logic.
`3.1 Sandie
`
`Current Sandie implementations of ATR are based
`on commercially available one-bit correlater chips.
`Sandia designers adapted the basic Chunky-SLD al-
`gorithm so they could exploit the capabilities of these
`components to achieve high performance. Rather than
`process the Shapesum and then process the final cor-
`relation, the two steps were done in parallel. The cor-
`relation was done at 8 discrete threshold levels and
`the Shapesum determined which threshold to use for
`each offset.
`
`
`
`
`9’3
`Ea.
`
`‘
`
`Chip
`summer
`
`
`
`Final
`Correlation
`Out
`
`Figure 4: Sandia’s Implementation.
`
`3.2 UCLA
`
`A group at UCLA (Headed by John Villasenor] {3]
`is working on a FPGA based SLD implementation. By
`doing bit level correlations they are able to do very
`compact adder trees that take advantage of template
`sparseness and FPGA on board lockup-table memory
`capability. Their approach compiles template infor-
`mation directly into the hardware and relies on fast
`reconfiguration to switch template information. They
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 194
`Petitioner Microsoft Corporation - EX. 1011, p. 194
`—————-———_________________________
`
`

`
`
`
`
`r‘PiP:
`
`P3
`
`Template
`
`Image
`
`in P2 P3
`
`m IlllIltllEItlIEl
`1mm EEEEIBIE!
`EEEIEIEIEIE
`
`0
`
`unma-
`emoll-
`
`aromas-
`
`omens-
`
`nearer-oi-
`
`crow-orni-
`
`Immer-
`
`empres-
`
`sees:
`
`
`
`p..—
`
`n—u—n.
`
`.231”
`
`
`
`also try and take advantage of template overlap by
`computing the results of multiple correlations simul—
`taneously.
`
`4 Chunky-SLD on splash 2
`On SPLASH 2 a static implementation was done
`to avoid the overhead of reconfiguring the hardware
`during execution.
`In order to reduce the hardware
`requirements without reducing throughput a deeply
`pipelined design (~400 cycle latency) was imple-
`mented. The Shapesum unit generates the threshold
`value which is then used to generate the final corre-
`lation values.
`(Note: There is a unique threshold for
`each offset). By doing this only tw0 final correlations
`have to be computed per offset (one B,mm and one
`Ssum}-
`‘
`The Sandie implementation computes the Shape—
`sum and final correlation in parallel which forces them
`to compute multiple final correlations. While our im-
`plementation does them serially. This allows us to use
`an exact threshold value. Also only one final correla-
`tion' needs to be computed because the threshold value
`is computed before the final correlation begins. The
`technique used was to look at the correlations by col-
`umn, compute the partial correlation for that column,
`and sum up the partial sums for all 16 columns.
`In
`this method 16 different cblumn correlations are going
`on in parallel but only one column of data needs to be
`available for processing.
`4.1
`Implementing the Correlation as Col-
`umn Sums
`
`Figure 5 depicts a simple example that demon-
`strates a correlation of a 3x3 template with a binary
`image. Each row in the table represents one clock cy-
`cle. The first column is the clock cycle number. Cor-
`responding numbers are found in the Pixel load order
`box at the right. A new pixel is brought in on each
`clock cycle. The P1, P2, and P3 columns represent
`the three column processing units needed for a three
`column template. The last column represents the ac-
`tual output. Clock cycles 9 through 12 have been ex-
`panded to show how data (pixels and partial sums)
`are passed from column to column and illustrate the
`data format. Once the pipeline is full, a new correla-
`tion value is computed as each column arrives (three
`pixels/cycles) .
`Note that valid output comes every three cycles be
`cause the template is three rows tall. All processing
`elements are actively processing 3 pixel values at all
`times. The SPLASH 2 implementation works just like
`the example except for the size of the columns (16
`pixels instead of three) and the data format (eight—bit
`instead of one-bit).
`
`If}
`-.
`k
`cinema
`minimal-
`excitem-
`centennial
`excitem-
`rammin-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MW
`
`
`ENE?“-
`
`Imwfifl
`
`
`flII
`III
`
`
`
`
`mm
`W...
`shifted out
`m
`
`O Hulls summer! to amuse points
`
`maeu
`rsmmsa-
`Emma.
`nmmmu
`glare).-
`
`0-0u1pur T-clocl: cycle
`
`Figure 5: Example Column Sum.
`
`4.2 Platforms/Implementations
`Chunky SLD was implemented on the SPLASH 2
`board. SPLASH 2 has shown itself to be a useful plat—
`form and has had numerous applications mapped to it
`[4, 5, 6, 7, 8, 9, 10, 11]. The implementation was done
`in VHDL and simulated/synthesized using Synopsis-
`All place and route was done automatically using Kil—
`inx place and route tools.
`
`One of the goals of the implementation was to run
`the system so that it consumed a pixel per cycle. This
`means that each cycle all processing elements (PE)
`need to be able to process a new pixel. This im~
`plementation follows Sandia National Labs algorithms
`(not implementation) as closely as possible (see Sec-
`tion 2.4).
`The SPLASH 2 board was developed by SEC (Su-
`percomputing Research Center Institute for Defense
`Analyses) [12]. The SPLASH 2 board is a linear sys-
`tolic array of processing elements (FPGAs), each with
`their own memory.
`
`4.2.1 Splash 2 Hardware
`
`From a conceptual point of view, the SPLASH 2 system
`consists of a linear array of processing elements. This
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 195
`Petitioner Microsoft Corporation - EX. 1011, p. 195
`————————___________________
`
`195
`
`

`makes SPLASH 2 a good candidate for linear-systolic
`applications with limited neighbor-to—nefighbor inter-
`connect. Because of limited routing resources SPLASH
`2 has difliculty implementing mnlti-chip systems that
`are not linear systolic, though they are possible [8].
`The actual SPLASH 2 platform consists of a board
`with 16 Xilinx 4010 chips (plus one for control) ar-
`ranged in a linear systolic array. Each chip has a lim-
`ited 36-bit connection to its two nearest neighbors.
`Each Xilinx 4010 is connected to a. 512 kbyte memory
`
`
`
`
`(Xllinx 4010]
`
`
`Element
`
`left
`neighbor
`
`crossbar
`
`Figure 6: Single Processing Element of SPLASH 2.
`
`{16-bit word size). The memory can handle back-to-
`back reads, or back-to-baclr writes, but requires one
`‘dead’ (or turn around) cycle when changing from
`write to read. There is also a crossbar connected to
`all of the chips that allows some level of random con-
`nection between chips. Up to 16 boards can be daisy—
`chained together to provide a large linear-systolic ar-
`ray of 256 elements.
`
`4.3 ATR Implementation on Splash 2
`Similar to the example, the SPLASH 2 implemen-
`tation processes one pixel at a time and loads them
`in column order so that the partial sums can be gen-
`erated and passed from column to column. All tem-
`plate data are stored in the memories adjacent to the
`FPGAs on the SPLASH 2 boards. Each memory can
`hold several thousand templates thus making it pos-
`sible to store all of the templates for a single class
`(5760) on asingle SPLASH 2 board. There is sufficient
`room in the FPGA design to store a single template.
`The templates are switched by reading the new tem-
`plate data out of the memory and storing it within
`the FPGA. However, because this implementation is
`deeply pipelined, it is necessary to flush all current
`
`196
`
`
`
`Figure 7': SPLASH 2 platform.
`
`data from the system when switching to a new tem-
`plate. The overhead from the flushing Operation is
`minimal (MM— —) 0.14%.
`231424 compute cycles _'
`During each clock cycle, a new pixel arrives at the
`FPGA. If the template bit corresponding to this pixel
`is on then the incoming pixel is added to the current
`partial sum. Each 16 clock cycles, this partial sum is
`then passed on to the next column and a new partial
`sum is received from the previous column. The last
`column computes a complete Shapesum every 16 cy-
`cles (one column). The final correlation of the Bright
`and Surround templates with the thresholded chip
`data works similarly except there are two correlations
`(one for each template).
`Intermediate hit values are stored in a table, re-
`ferred to as the hit-table, in one of the local memories.
`Each location in the table corresponds to an x—y off-
`set of a chip, the origin of a single correlation. For
`each offset, if a chunk “hits”, then the corresponding
`location in this table is incremented. Thus the table
`contains the accumulated hit values for all chunks and
`all offsets that have been computed to that point.
`Hits are computed according to Equation 1. First,
`each 3,,m1 and 3mm value is compared to its corre-
`sponding minimum value. Second, the threshold value
`corresponding to each B_mm and Sum.
`is checked to
`see if it is between a certain minimum and maximum
`value. For reasons of efliciency, the threshold value is
`actually examined earlier in the process and a. zero for
`the threshold is stored in lockup-table memory if it
`is out of bounds. This works correctly because if the
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 196
`Petitioner Microsoft Corporation - EX. -101 1, p. 196
`——————————__________________
`
`

`threshold is zero, it will cause the Bmm to be zero,
`which will in turn cause the Bum, comparison to fail.
`Otherwise1 if all three of these tests come back true
`then a bit has been found for the corresponding ofi'set
`(see Equation 1) and the corresponding location in the
`hit-table is incremented. After the 40 templates are
`tested against the same chip the twa ofl‘sets with the
`highest accumulated hit values are written into mem-
`ory where the host computer can read them. This is
`accomplished by examining the hit values during this
`process and retaining the top two values_in special
`memory locations. These final two hit values (which
`represent the top twa orientations for a specific class)
`are used in the F1 step.
`.
`For the SPLASH 2 board, as with most FPGA sys-
`tems, partitioning is a major issue. The design needed
`to be modular so that different design modules could
`be reassigned to different FPGAs as necessary. This
`is where the column modules were so valuable (see
`Figure 8).
`
`4.4 Special Features
`
`This implementation has several notable character-
`istics. They include control distribution, modular de-
`sign for partitioning and memory utilisation.
`
`4.4.1 Distributed Control
`
`The control in this system is distributed throughout
`the array. Each column module has it’s own state
`machine based control. Module synchronization is
`achieved by distributing a control token through the
`pipeline along with the data. When a module re-
`ceives this signal, it resets its internal state machines
`and retrieves template data from its local memory. A
`memory controller resides in each processing element
`(FPGA) to retrieve template data and give memory
`access to all other modules.
`
`4.4.2 Modular Design (Design for Partition-
`ing)
`
`Each column in both the Shapesum and final corre-
`lation use totally self contained modules that can be
`easily migrated from processing element to processing
`element. This was done to simplify the partitioning
`onto SPLASH 2 [13]. Memory data had to be care—
`fully partitioned as well so that the data could follow
`the module to which it applied. The regularity of the
`design was an important concern; it allowed the place-
`ment of specific circuit modules to be dictated by the
`requirements of the algorithm and not by the limited
`interconnect of the platform. There are 16 identical
`
`modules in the Shapesum and 16 more identical mod-
`ules in the final correlation. Along with these there is
`a divide, a hit accumulator and 2 delay modules (see
`Figure 8).
`
`4.4.3- Memory Usage
`
`The memories in SPLASH 2 serve several purposes.
`The template information is stored in them. They
`are used to implement video shift registers that cor—
`rect for the latency incurred during threshold compu—
`tation. These shift registers require that two mem-
`ories be used in tandem because every clock cycle a
`new pixel (8 bits) had to be written to memory and a
`delayed pixel had to be read from memory. The band-
`width of one memory was such that it can handle two
`pixels (load and store) every three cycles. Thus one
`memory would delay two pixels and skip two pixels,
`while the other memory would delay the two pixels
`that the first memory skipped and skip the two pixels
`that the first memory delayed. The divide unit and
`the final result including the accumulated hit values
`are also stored in memory.
`
`4.5 Performance Metrics
`
`There are many metrics that could be used to mea-
`sure the value of this implementation. This section is
`devoted to discussing some of these metrics.
`
`4.5.1 Performance
`
`This implementation runs at a frequency of 19 MHz
`using a test template that tests all column modules.
`delay (a Xilinx timing tool) reports a guaranteed
`frequency of 13.2 MHz. Designs that will run in the
`10 to 20 MHz range are typical [8, 10].
`Using the above frequency a single system could
`process one orientation every .487 seconds (.701 sec—
`onds using a 13.2 MHz clock).
`
`123 col x (128 — 15) row x 16%;“
`19 MHz
`
`
`4o Chunks ‘
`Orient H
`Chunks
`(12 nus/Chunk) x 40—-—-—-—— '-
`Orientotion '-
`
`x
`
`or
`
`487ms @ 19 MHz
`
`
`12s col x (128 — 15) 1'01» x 165:; x 40 Chunks __
`13.2 MHz
`Orient
`Chunks
`_..___ =
`7‘
`(I 5 rue/Chunk) x “Orientation
`T01 ms @ 13.2 MHz
`
`
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 197
`Petitioner Microsoft Corporation - EX. 1011, p. 1974
`—————————-—______________________
`
`19'?
`
`ti
`
`

`
`
`Shaprsuln -—5—-—— Final Concludes
`
`_ Figure 8: SPLASH 2 Implementation.
`
`Using a brute force approach it takes 59 seconds to
`compute the same data on an HP 770 (110 MHZ).
`That is two orders of magnitude slower. On one
`SPLASH 2 board two processing units can be in—
`stalled yielding the ability to process 1 orientation
`(40 chunks) every .244 seconds (.350 seconds at 13.2
`MHz).
`
`4-5-2 Multi-board Splash 2
`If a lG-board system were used then 256 processing
`elements would be available. this would make 42 pro-
`ceasing units (256 PES “3‘ 5 PE Per PU) and 42x the
`throughput. Thus one orientation can be processed
`every 11.6 ms (15-7 ms at 13-2 MHz). This is equal
`to coveri

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases