`
`IEEE Symposium
`l FPGHS FDR
`I CUSTOM
`I COMPUTING
`
`I.
`
`NINININIIIINIIIINIIIII
`1.4353986451]
`vans: 1557
`PIC"
`
`K
`
`142
`
`April 1 6-1 8, 1 997
`Napa Valley, California
`
`Edited by Kenneth L. Pooek and Jeffrey Amoid
`
`Sponsored by the IEEE Computer Society Technical! Committee on Computer Architeciure
`
`IEEE.
`COMPUTER Q,
`SOCIETY
`m
`
`‘ J
`
`,r’
`
`’N/
`
`'
`,/
`
`./
`
`“A
`
`I
`
`I
`
`
`
`Petitioner Microsoft Corporation - Ex. 1011, Cover 1
`“J“ "Péfi‘fidfiéf rIN/Ti’érosoft Corporation - EX. 1011, Cover 1
`
`
`
`PROCEEDINGS
`
`The 5th Annual IEEE Symposium on
`Field—Programmable
`Custom Computing Machines
`
`April 16— 18, 1997
`
`Napa Valley, California
`
`Sponsored by
`
`IEEE Computer Society
`IEEE Computer Society Technical Committee on Computer Architecture
`
`@
`IEEE
`COMPUTER
`SOCIETY
`
`Los Alamitos, California
`
`i
`
`‘
`
`:
`J‘;
`I.’I—
`.
`.
`.-
`.
`-——--‘——"~—-" -5---'
`
`Tokyo
`o
`Brussels
`in
`Washington
`
`
`Petitioner Microsoft Corporation - Ex. 1011, Cover 2
`Petitioner Microsoft Corporation - Ex. 1011, Cover 2
`_—__________________
`
`
`
`Copyright ® 199‘? by The Institute of Eiecttical and Electronics Engineers. Inc.
`All rights reserved
`
`CW: and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may
`photocopy beyond the limits ofUS copyright law. for private use of patrons. those articles in this volume that
`carry a code at the bottom of the first page, provided that the per—copy fee indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive. Danvers, MA 01923.
`
`Other copying. reprint, or republication requests should be addressed to: IEEE Copyrights Manager. [BEE
`Service Center, 445 Hoes Lane, 9.0. Box 133, Piscataway, N] 088554 331.
`
`The papers in this book comprise the proceedings ofthe meeting mentioned on the carer and title page. They
`reflect the authors' opinions and. in the interestr of timely dissemination, are pubt'irlted as presented and
`without change. Their inciurion in this publication does not necersoriiy constitute endorsement by the
`editors. the {£85 Computer Society. or the institute ofElectrical and Electont'cs Engineers, inc.
`
`{BEE Computer Society Order Number PR08159
`ISBN 0-8186-8159-4
`
`ISBN 0-8186-8160-8 (easel
`ISBN 0-8 186-8161-6 {microfiche}
`IEEE Order Plan Catalog Number 97T3100186
`15%? 1082-3409
`
`[EEE Computer Society
`Customer Service Center
`[0662 Los Vaqueros Circle
`P.0. Box 30I4
`Lou Alamitns, CA 90720-1314
`Tel: 4- 1-714-821-8380
`Fax: + 1—714-821-4641
`E-mail: cs.books@con1puter.org
`
`Additional copier may be orderedfrom:
`
`lEEE Service Center
`4-45 Hoes Lane
`P.O. Box 1331
`Piscataway. NJ 08855-1331
`Tet: + 1—908—981-1393
`Fax: 4- 1308-9313667
`mis.eusteerv@cornputer.org
`
`[BEE Compmer Society
`l3. Avenue: de I'Aquilon
`8-1200 Brussels
`BELGIUM
`Tel: -I- 32-2-770-2198
`Fax: + 32-2‘770-8505
`emoofc®eom1mtenorg
`
`IEEE Comte:- Society
`Ooshima Building
`2—l9—I Mum-Anytime
`Minute-lot. Tokyo 10'?
`JAPAN
`Tel: + 31-3-3408-31 18
`Fax: + 81-3-3408-3553
`tokyo.of0@ctxnputer.org
`
`Editorial production by Bob Werner
`
`Cover art production Joe Daiglet‘Studio Productions
`
`Printed in the United States of America by Technical Communication Services
`
`m®
`COMPUTER 9
`SOCIETY
`to
`
`Petitioner Microsoft Corporation - Ex. 1011, Cover 3
`Petitioner Microsoft Corporation - EX. 1011, Cover 3
`
`
`
`
`Table of Contents
`
`Symposium on Field-Programmable Custom Computing Machines —— FCCM’97
`
`Introductionrx
`ProgramCommittee 94
`
`Session 1: Device Architecture
`
`An FPGA Architecture for DWI-baSed Systolic Computations..............,......................,.....................2
`
`N. Margoius
`
`Garp: A MIPS Processor with a Reconfigurable Coprocessor 12
`
`J. Hansen J. Wawrzynek
`
`A Time—MultiplexedFPGA 22
`
`S. Trimberger, D. Carberry, A. Johnson, J. Wong
`
`Session 2: Communication Applications
`
`An FPGA-Based Coprocessor for ATM Firewalls 30
`
`J. McHem-y. P. Dowd, T. Carmzzi.
`
`F. Pellegrino, W. Cocks
`
`A Wireless LAN Demodulator in a Pamette: Design and Experience 40
`
`T. McDennott, P. Ryan, M. Shand,
`
`D. Skeilern, T. Percival, N. Waste
`
`Session 3: Run Time Reconfiguration
`
`Incremental Reconfiguration for PipelinedAppllcat10n547
`H. Schmit
`
`Compilation Tools for Run-Time ReconfigurableDeslgns 56
`
`W. Leda, N. Shirazi, P. Chenng
`
`A Dynamic Reconfiguration Run-Time System
`
`J. Burns, A. Doniin, J. Hogg, S. Singh, M. de Wit
`
`66
`
`Session 4: Architectures for Run Time Reconfiguration
`
`The Swappable Logic Unit: A Paradigm for Virtual Hardware 77
`G. Brebner
`
`Petitioner Microsoft Corporation - Ex. 1011, ToC i
`Petitioner Microsoft Corporation - Ex. 1011, ToC i
`W
`
`
`
`The Chimaera Reconfigurable Functional Unit......................................................................................... 87
`S. Hench, T. Fry, M. Rosier, J. Koo
`
`Session 5: Architecture
`
`Computing Kernels Implemented with a Wormhole RTR CCM
`R. Bittner Jr., P. Athanos
`
`Mapping Applications to the RaPiD Configurable Architecture
`C. Ebeling, D. Cronquist, P. Franklin,
`J. Secosky, S. Berg
`
`98
`
`106
`
`Defect Tolerance on the Tersmac Custom Computer ............................................................................
`B. Culbertson, R. Amerson, R. Carter,
`P. Kuekes, G. Snider
`
`116
`
`Session 6: Performance
`
`Systems Performance Measurement on PCI Pamette........................................................................... 125
`L. Moll, M. Shand
`
`The RAW Benchmark Suite: Computation Structures for
`General PurposeComputing134
`J. Babb, M. Frank, V. Lee, E. Woingold, H. Home,
`M. Taylor, J. Kira, S. Devobhaktuni, A. Agorwol
`
`Session 7: Software Tools
`
`Automated Field-Programmable Compute Accelerator Design using
`Partial Evaluation................................................................................................................................................ 145
`Q. Wang, D. Lewis
`
`FPGA Synthesis on the X06200 using IRIS and Trianus’I-Iades
`(Or, from Heaven to Hell and back again) ................................................................................................. 155
`R. Woods, S. Ludwig, J. Heron, D. Trainer, 8. Gehring
`
`High Level Compilation for Fine Grained FPGAs .................................................................................. 165
`M. Gokhale, E. Gamer-sail
`
`Session 8: CAD Applications
`
`Acceleration of an FPGA Router .................................................................................................................... 175
`P. Chen, M. Schlag
`
`Fault Simulation on Reconfigurable Hardware ....................................................................................... 182
`M. Abramovici, P. Manon
`
`Petitioner Microsoft Corporation - Ex. 1011, ToC ii
`Petitioner Microsoft Corporation - EX. 1011, ToC ii
`____________________*________—_
`
`vi
`
`
`
`Real-Time Stereo Vision on the PARTS
`J. Woodfill, B. Van Herman
`
`Reconfigurable Computer.................................................
`
`201
`
`Increased FPGA Capacity Enables Scalable, Flexible CCMs:
`An Example from Image Processing............................................................................................................ 211
`J. Greenbaum, M. Baxter
`
`Reconfigurable Hardware.................................................................................................................................219
`
`C. Paar, M. Earner
`
`Implementation of Single Precision Flo
`Y. Li, W. Chu
`
`ating Point Square Root on FPGAs ................................ 226
`
`Poster Papers
`
`Oriented FPGA Mapping and Placement for
`Configurable Computing
`
`A Parallel Hardware Evolvable Computer POLY?................................................................................238
`U. Tongan, L. Schaite, J. McCaskiH
`
`Laser Defect Correction Applications to FPGA Based Custom Computem..................................240
`G. Chapman 3. Dufon‘.
`
`Speech Recognition HMM Training on Reconfigurable Parallel Processor242
`H. Yun,A_ Smith, H. Silver-man
`
`Efficient Implementation of the DCT on Custo
`N. Bergmann, Y. Chung, B. Gunther
`
`:11 Computers ............................................................ 244
`
`On Acceleration of the Check Tautology Logic Synthesis Algorithm using an
`FPGA—based Reconfigurable Coprocesaor...................................................................................................246
`
`J. Cong, J. Peck
`
`Index ofAuthors...........................................................................................................................................249
`
`vii
`Petitioner Microsoft Corporation - Ex. 1011, ToC iii
`Petitioner Microsoft Corporation - EX. 1011, ToC ii1
`___________________*___‘*_
`
`
`
`‘ mus material may be protected by Copyright law {Title 17 U.S. Code)
`
`‘
`
`Automated Target Recognition on SPLASH 2 l
`
`Michael Renchcr and Brad L. Hutchings
`Department of Electrical and Computer Engineering
`Brigham Young UniVersity _
`Provo, UT 84602
`
`Abstract
`
`Automated target recognition is an application area
`that requires special-purpose hardware to achieve rear
`sonable performance. FPGA—based platforms can pro-
`vide a high level of performance for ATE. systems if
`the implementation can be adapted to the limited
`FPGA and routing resources of these architectures.
`This paper discuSses a mapping experiment where a
`linear-systolic implementation of an ATE. algorithm is
`mapped to the SPLASH 2 platform. Simple column—
`oriented processors were used throughout the design
`to achieve high performance with limited nearest-
`neighbor communicatiOn. The distributed SPLASH 2
`memories are also exploited to achieve a high degree
`of parallelism. The resulting design is scalable and
`can be spread across multiple SPLASH 2 boards with
`a linear increase in performance.
`
`1
`
`Introduction
`
`Automated target recognition (ATE) is a compu—
`tationally demanding application area that typically
`requires special-purpose hardware to achieve desirable
`performance. ASICs are not an option for these sys-
`tems due to high non-recurring engineering (NEE)
`costs and because the algorithms are constantly evolv-
`ing. Existing FPGA-based computing platforms can
`potentially provide the necessary performance and
`flexibility for evolving ATE. systems; however, map-
`ping applications to these existing platforms can be
`very challenging because they lack abundant intercon-
`nect and FPGA resources. The key to achieving a
`high-performance implementation of ATR. algorithms
`with existing platforms is to carefully organize the de-
`sign of the ATE. implementation so that it can commu-
`nicate via the limited interconnect and can be easily
`partitioned among the FPGA devices.
`This paper presents a linear systolic implementa-
`tion of an existing ATR algorithm on SPLASH 2 that
`
`'This work was‘supported by DARPAICSTO under contract
`number DABTSS—Q-t—C-OOSS under a subcontract to National
`Semiconductor
`
`is well-suited to the SPLASH 2 architecture.
`
`Inter—
`
`FPGA communication is limited and easily accommo—
`dated by the SPLASH 2 interconnect. Moreover, the
`implementation can be scaled across any number of
`SPLASH 2 boards and achieves high performance with
`limited resources.
`
`This paper briefly discusses the entire ATE. algo-
`rithm as developed by Sandia National Labs, and
`then overviews the design and implementation of the
`most computationally demanding part of the algo-
`rithm: Chunky SLD. The SPLASH 2 implementation
`is presented in some detail with future directions and
`possible improvements.
`
`2 Automatic Target Recognition
`
`' The goal of a typical ATR system is to analyze a
`digital representation of a scene and locate/identify
`objects that are of interest. Although this goal is
`conceptually simple, ATR systems have extremely de
`mending I/O and computational requirements: image
`data are large, can be generated in real-time, and must
`be processed quickly so that results remain relevant in
`a dynamic environment. The common use of special-
`purpose hardware in nearly all high-performance ATE.
`systems is a clear indication of the computational com-
`plexity of these systems.
`
`This paper details the implementation of an exist-
`ing ATR algorithm on SPLASH 2. The algorithm in
`question was developed at Sandie. National Labora—
`tories and was designed to detect partially obscured
`targets in Synthetic Aperture Radar (SAR) images.
`It is commonly referred to as Chunky SLD, so named
`for the second step of the algorithm that differenti-
`ates this‘algorithm from others developed at Sandia.
`This algorithm consists of the following three steps:
`(1) Focus of Attention (FDA), (2) Second—Level Dc»
`tection (SLD), and (3) Final Identification (FI). Each
`of these steps will now be introduced so that the al-
`gorithm implementation can be understood in its op—
`erating context.
`‘
`
`06186-815945)? $10.00 © 199’? IEEE
`
`192
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 192
`Petitioner Microsoft Corporation - EX. 1011, p. 192
`;___——___—___'__‘_—___—___—__J-—.——————’
`
`
`
`2.4 The Chunky SLD Algorithm
`The general goal of the Chunky—SLD algorithm is
`to recognize targets that are partially. concealed or ob-
`scured in some way. To achieve this goal, the designers
`of this algorithm treat the target as a set of 40 tem-
`plate pairs where each pair of templates is a digital
`representation of some salient feature of the specific
`target. If the majority of the template pairs strongly
`correlate with the image data, then a match of the
`overall target is assumed. Each pair of templates con-
`
`
`
`Figure 2: Chunky SLD
`
`sists of a Bright template and a. Surround template.
`The Bright template is a representation of expected
`reflections directly from surfaces of a salient target fea-
`ture while the Surround template represents expected
`absorption in the immediate area surrounding the tar-
`get feature. Each pair of a Bright and Surround tem-
`plate is referred to as a chunk, so called because each
`pair of templates represents a “chunk” of the overall
`target. Each set of 40 chunks represents a single target
`at a specific rotation. There are 72 orientations, each
`representing a different target orientation and radar
`incidence angle. Each set of 72 orientations is referred
`to as a class and is the complete set of templates that
`must be correlated with a chip to detect the presence
`of a specific target.
`
`Class
`
`
`
`All the templates
`for one shim}
`
`{1'2Ofientstions— Orientation
`
`(40 Chm“)
`
`u———_.——-
`
`:
`a
`
`l
`
`.
`
`'
`
`1*
`
`\
`
`.
`
` ex downssmplsd
`
`
`downsunpled
`Second
`level
`Dctscdon
`
`m S
`
`ection implemented
`on Splash-2
`
`Figure 1: ATE. Block Diagram.
`
`2.1 Focus of Attention (FDA)
`
`Focus of attention is the first step of the ATE pro
`cess and uses image morphology techniques to de-
`tect potential targets in SAR data. FOA operates on
`“down-sampled” SAR. images that are approximately
`600-1000 pixels on a side. Once FOA detects a poten-
`tial target, it determines the approximate center of the
`potential target and creates 2x down-sampled snb‘
`images of the original SAR data where each subirnage
`contains a single target centered within the whim-
`age. These subimages are referred to as chips and are
`128 x 128 pixels.
`
`2.2 Second Level Detection (SLD)
`
`The SLD step processes the chips generated by the
`FDA step. SLD further restricts the areas of interest
`by giving the potential targets coordinates and angu-
`lar orientation. SLD does this by correlating prede-
`fined binary templates to the areas of interest. The
`templates represent different object orientation an-
`gles. Templates are oriented between 5 and 10 de-
`grees apart. SLD also uses adaptive threshold levels
`determined by the overall image intensity.
`The algorithm studied in the paper is a variation
`of SLD called Chunky SLD. Chunky SLD adds a level
`of complexity to SLD by using more templates to rep-
`resent objects that have been partially obscured (par
`tially hidden by camouflage or objects overhead}. This
`allows better target recognition at a cost of higher
`computational requirements. Chunky SLD is dis-
`cussed in more detail later in this section.
`
`»
`
`2.3 Final Identification (F1)
`
`The F1 algorithm correlates full resolution image
`data and templates with finer angular resolution (3 to
`5 degrees). Fl also uses adaptive threshold levels. The
`output of PI is a location of the target, and confidence
`level corresponding to the level of correlation between
`the object and the FI templates.
`
`Figure 3: Template Organization
`
`The first step of the Chunky SLD algorithm is to
`correlate the chip and the Bright template. This cor-
`relation value is used to compute a vaiue that will be
`
`193
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 193
`Petitioner Microsoft Corporation - EX. 1011, p. 193
`-————___.___________________
`
`
`
`used to threshold the incoming chip data, converting
`the 8-bit chip image into a binary image. The equa-
`tions describing this process are shown below.
`15
`
`Shapesum(a.y) = Z Bremp(c,b)*0hip(s+s,y+b)
`e,b=0
`
`Shopesum(s, y)
`Threshold(a:,y) _--_. BrigthemplatLpizeLcount
`The values obtained by correlating the Bright and
`Surround templates with the binarised chip (B"m. and
`8”,“) are checked against minimum values to generate
`a “hit value" for each offset in the chip. The threshold
`value is also checked to see if it falls in an acceptable
`range when generating the hit values.
`
`ifflTmas 2 T 2 Train] AND
`[Bmm 2 3mm] AND
`[Scum 2’. Snail-ID
`then
`Hit : 1;
`else
`
`Hit = 0;
`
`(1)
`
`The hit values are accumulated for each offset for a
`specific orientation (4U chunks). The highest values
`are used to identify the areas of interest for the final
`identification step.
`
`2.4.1 Template Characteristics
`
`Template pairs exhibit useful properties: sparseness
`and mutual exclusivity. The Bright template consists
`mostly of zeros; only 3 to 10 percent of the template
`values-are ‘1’s and this limits the magnitude of the
`Shapesum and Bmm values. The Bright and Surround
`templates are also mutually exclusive; that is, if the
`two templates are overlaid no “on” pixels will overlap.
`When carefully exploited, both of these properties lead
`to more compact and higher performance hardware.
`
`3 Other Implementations of Chunky-
`SLD
`
`As explained the ATE application is computation-
`ally demanding. There are (128-15) x (128-15) offsets
`per chunk x 40 chunks x 72 orientations E 36 x 10‘l
`hit values to compute per targeted object (or per class,
`see Figure 3). The computational rate and I/O re-
`quirements of this algorithm make it impossible to use
`current microprocessors. Thus any high-performance
`implementation of this algorithm will require special-
`purpose hardware to meet performance goals.
`
`194
`
`Howaver, custom ASICs are also not an option be—
`cause the algorithm is oonstantly evolving and also
`because commercial-ofi-the—shelf components (COTS)
`are often dictated by the ultimate customers of ATR
`systems. The only remaining options are to construct
`the system with commercially available fixed-function
`devices such as correlaters, multipliers, etc., or to
`use programmable logic, e.g., FPGAs [1, 2]. Thus
`all known implementations of Chunky—SLD use either
`fixed-function devices or programmable logic.
`3.1 Sandie
`
`Current Sandie implementations of ATR are based
`on commercially available one-bit correlater chips.
`Sandia designers adapted the basic Chunky-SLD al-
`gorithm so they could exploit the capabilities of these
`components to achieve high performance. Rather than
`process the Shapesum and then process the final cor-
`relation, the two steps were done in parallel. The cor-
`relation was done at 8 discrete threshold levels and
`the Shapesum determined which threshold to use for
`each offset.
`
`
`
`
`9’3
`Ea.
`
`‘
`
`Chip
`summer
`
`
`
`Final
`Correlation
`Out
`
`Figure 4: Sandia’s Implementation.
`
`3.2 UCLA
`
`A group at UCLA (Headed by John Villasenor] {3]
`is working on a FPGA based SLD implementation. By
`doing bit level correlations they are able to do very
`compact adder trees that take advantage of template
`sparseness and FPGA on board lockup-table memory
`capability. Their approach compiles template infor-
`mation directly into the hardware and relies on fast
`reconfiguration to switch template information. They
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 194
`Petitioner Microsoft Corporation - EX. 1011, p. 194
`—————-———_________________________
`
`
`
`
`
`
`
`r‘PiP:
`
`P3
`
`Template
`
`Image
`
`in P2 P3
`
`m IlllIltllEItlIEl
`1mm EEEEIBIE!
`EEEIEIEIEIE
`
`0
`
`unma-
`emoll-
`
`aromas-
`
`omens-
`
`nearer-oi-
`
`crow-orni-
`
`Immer-
`
`empres-
`
`sees:
`
`
`
`p..—
`
`n—u—n.
`
`.231”
`
`
`
`also try and take advantage of template overlap by
`computing the results of multiple correlations simul—
`taneously.
`
`4 Chunky-SLD on splash 2
`On SPLASH 2 a static implementation was done
`to avoid the overhead of reconfiguring the hardware
`during execution.
`In order to reduce the hardware
`requirements without reducing throughput a deeply
`pipelined design (~400 cycle latency) was imple-
`mented. The Shapesum unit generates the threshold
`value which is then used to generate the final corre-
`lation values.
`(Note: There is a unique threshold for
`each offset). By doing this only tw0 final correlations
`have to be computed per offset (one B,mm and one
`Ssum}-
`‘
`The Sandie implementation computes the Shape—
`sum and final correlation in parallel which forces them
`to compute multiple final correlations. While our im-
`plementation does them serially. This allows us to use
`an exact threshold value. Also only one final correla-
`tion' needs to be computed because the threshold value
`is computed before the final correlation begins. The
`technique used was to look at the correlations by col-
`umn, compute the partial correlation for that column,
`and sum up the partial sums for all 16 columns.
`In
`this method 16 different cblumn correlations are going
`on in parallel but only one column of data needs to be
`available for processing.
`4.1
`Implementing the Correlation as Col-
`umn Sums
`
`Figure 5 depicts a simple example that demon-
`strates a correlation of a 3x3 template with a binary
`image. Each row in the table represents one clock cy-
`cle. The first column is the clock cycle number. Cor-
`responding numbers are found in the Pixel load order
`box at the right. A new pixel is brought in on each
`clock cycle. The P1, P2, and P3 columns represent
`the three column processing units needed for a three
`column template. The last column represents the ac-
`tual output. Clock cycles 9 through 12 have been ex-
`panded to show how data (pixels and partial sums)
`are passed from column to column and illustrate the
`data format. Once the pipeline is full, a new correla-
`tion value is computed as each column arrives (three
`pixels/cycles) .
`Note that valid output comes every three cycles be
`cause the template is three rows tall. All processing
`elements are actively processing 3 pixel values at all
`times. The SPLASH 2 implementation works just like
`the example except for the size of the columns (16
`pixels instead of three) and the data format (eight—bit
`instead of one-bit).
`
`If}
`-.
`k
`cinema
`minimal-
`excitem-
`centennial
`excitem-
`rammin-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MW
`
`
`ENE?“-
`
`Imwfifl
`
`
`flII
`III
`
`
`
`
`mm
`W...
`shifted out
`m
`
`O Hulls summer! to amuse points
`
`maeu
`rsmmsa-
`Emma.
`nmmmu
`glare).-
`
`0-0u1pur T-clocl: cycle
`
`Figure 5: Example Column Sum.
`
`4.2 Platforms/Implementations
`Chunky SLD was implemented on the SPLASH 2
`board. SPLASH 2 has shown itself to be a useful plat—
`form and has had numerous applications mapped to it
`[4, 5, 6, 7, 8, 9, 10, 11]. The implementation was done
`in VHDL and simulated/synthesized using Synopsis-
`All place and route was done automatically using Kil—
`inx place and route tools.
`
`One of the goals of the implementation was to run
`the system so that it consumed a pixel per cycle. This
`means that each cycle all processing elements (PE)
`need to be able to process a new pixel. This im~
`plementation follows Sandia National Labs algorithms
`(not implementation) as closely as possible (see Sec-
`tion 2.4).
`The SPLASH 2 board was developed by SEC (Su-
`percomputing Research Center Institute for Defense
`Analyses) [12]. The SPLASH 2 board is a linear sys-
`tolic array of processing elements (FPGAs), each with
`their own memory.
`
`4.2.1 Splash 2 Hardware
`
`From a conceptual point of view, the SPLASH 2 system
`consists of a linear array of processing elements. This
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 195
`Petitioner Microsoft Corporation - EX. 1011, p. 195
`————————___________________
`
`195
`
`
`
`makes SPLASH 2 a good candidate for linear-systolic
`applications with limited neighbor-to—nefighbor inter-
`connect. Because of limited routing resources SPLASH
`2 has difliculty implementing mnlti-chip systems that
`are not linear systolic, though they are possible [8].
`The actual SPLASH 2 platform consists of a board
`with 16 Xilinx 4010 chips (plus one for control) ar-
`ranged in a linear systolic array. Each chip has a lim-
`ited 36-bit connection to its two nearest neighbors.
`Each Xilinx 4010 is connected to a. 512 kbyte memory
`
`
`
`
`(Xllinx 4010]
`
`
`Element
`
`left
`neighbor
`
`crossbar
`
`Figure 6: Single Processing Element of SPLASH 2.
`
`{16-bit word size). The memory can handle back-to-
`back reads, or back-to-baclr writes, but requires one
`‘dead’ (or turn around) cycle when changing from
`write to read. There is also a crossbar connected to
`all of the chips that allows some level of random con-
`nection between chips. Up to 16 boards can be daisy—
`chained together to provide a large linear-systolic ar-
`ray of 256 elements.
`
`4.3 ATR Implementation on Splash 2
`Similar to the example, the SPLASH 2 implemen-
`tation processes one pixel at a time and loads them
`in column order so that the partial sums can be gen-
`erated and passed from column to column. All tem-
`plate data are stored in the memories adjacent to the
`FPGAs on the SPLASH 2 boards. Each memory can
`hold several thousand templates thus making it pos-
`sible to store all of the templates for a single class
`(5760) on asingle SPLASH 2 board. There is sufficient
`room in the FPGA design to store a single template.
`The templates are switched by reading the new tem-
`plate data out of the memory and storing it within
`the FPGA. However, because this implementation is
`deeply pipelined, it is necessary to flush all current
`
`196
`
`
`
`Figure 7': SPLASH 2 platform.
`
`data from the system when switching to a new tem-
`plate. The overhead from the flushing Operation is
`minimal (MM— —) 0.14%.
`231424 compute cycles _'
`During each clock cycle, a new pixel arrives at the
`FPGA. If the template bit corresponding to this pixel
`is on then the incoming pixel is added to the current
`partial sum. Each 16 clock cycles, this partial sum is
`then passed on to the next column and a new partial
`sum is received from the previous column. The last
`column computes a complete Shapesum every 16 cy-
`cles (one column). The final correlation of the Bright
`and Surround templates with the thresholded chip
`data works similarly except there are two correlations
`(one for each template).
`Intermediate hit values are stored in a table, re-
`ferred to as the hit-table, in one of the local memories.
`Each location in the table corresponds to an x—y off-
`set of a chip, the origin of a single correlation. For
`each offset, if a chunk “hits”, then the corresponding
`location in this table is incremented. Thus the table
`contains the accumulated hit values for all chunks and
`all offsets that have been computed to that point.
`Hits are computed according to Equation 1. First,
`each 3,,m1 and 3mm value is compared to its corre-
`sponding minimum value. Second, the threshold value
`corresponding to each B_mm and Sum.
`is checked to
`see if it is between a certain minimum and maximum
`value. For reasons of efliciency, the threshold value is
`actually examined earlier in the process and a. zero for
`the threshold is stored in lockup-table memory if it
`is out of bounds. This works correctly because if the
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 196
`Petitioner Microsoft Corporation - EX. -101 1, p. 196
`——————————__________________
`
`
`
`threshold is zero, it will cause the Bmm to be zero,
`which will in turn cause the Bum, comparison to fail.
`Otherwise1 if all three of these tests come back true
`then a bit has been found for the corresponding ofi'set
`(see Equation 1) and the corresponding location in the
`hit-table is incremented. After the 40 templates are
`tested against the same chip the twa ofl‘sets with the
`highest accumulated hit values are written into mem-
`ory where the host computer can read them. This is
`accomplished by examining the hit values during this
`process and retaining the top two values_in special
`memory locations. These final two hit values (which
`represent the top twa orientations for a specific class)
`are used in the F1 step.
`.
`For the SPLASH 2 board, as with most FPGA sys-
`tems, partitioning is a major issue. The design needed
`to be modular so that different design modules could
`be reassigned to different FPGAs as necessary. This
`is where the column modules were so valuable (see
`Figure 8).
`
`4.4 Special Features
`
`This implementation has several notable character-
`istics. They include control distribution, modular de-
`sign for partitioning and memory utilisation.
`
`4.4.1 Distributed Control
`
`The control in this system is distributed throughout
`the array. Each column module has it’s own state
`machine based control. Module synchronization is
`achieved by distributing a control token through the
`pipeline along with the data. When a module re-
`ceives this signal, it resets its internal state machines
`and retrieves template data from its local memory. A
`memory controller resides in each processing element
`(FPGA) to retrieve template data and give memory
`access to all other modules.
`
`4.4.2 Modular Design (Design for Partition-
`ing)
`
`Each column in both the Shapesum and final corre-
`lation use totally self contained modules that can be
`easily migrated from processing element to processing
`element. This was done to simplify the partitioning
`onto SPLASH 2 [13]. Memory data had to be care—
`fully partitioned as well so that the data could follow
`the module to which it applied. The regularity of the
`design was an important concern; it allowed the place-
`ment of specific circuit modules to be dictated by the
`requirements of the algorithm and not by the limited
`interconnect of the platform. There are 16 identical
`
`modules in the Shapesum and 16 more identical mod-
`ules in the final correlation. Along with these there is
`a divide, a hit accumulator and 2 delay modules (see
`Figure 8).
`
`4.4.3- Memory Usage
`
`The memories in SPLASH 2 serve several purposes.
`The template information is stored in them. They
`are used to implement video shift registers that cor—
`rect for the latency incurred during threshold compu—
`tation. These shift registers require that two mem-
`ories be used in tandem because every clock cycle a
`new pixel (8 bits) had to be written to memory and a
`delayed pixel had to be read from memory. The band-
`width of one memory was such that it can handle two
`pixels (load and store) every three cycles. Thus one
`memory would delay two pixels and skip two pixels,
`while the other memory would delay the two pixels
`that the first memory skipped and skip the two pixels
`that the first memory delayed. The divide unit and
`the final result including the accumulated hit values
`are also stored in memory.
`
`4.5 Performance Metrics
`
`There are many metrics that could be used to mea-
`sure the value of this implementation. This section is
`devoted to discussing some of these metrics.
`
`4.5.1 Performance
`
`This implementation runs at a frequency of 19 MHz
`using a test template that tests all column modules.
`delay (a Xilinx timing tool) reports a guaranteed
`frequency of 13.2 MHz. Designs that will run in the
`10 to 20 MHz range are typical [8, 10].
`Using the above frequency a single system could
`process one orientation every .487 seconds (.701 sec—
`onds using a 13.2 MHz clock).
`
`123 col x (128 — 15) row x 16%;“
`19 MHz
`
`
`4o Chunks ‘
`Orient H
`Chunks
`(12 nus/Chunk) x 40—-—-—-—— '-
`Orientotion '-
`
`x
`
`or
`
`487ms @ 19 MHz
`
`
`12s col x (128 — 15) 1'01» x 165:; x 40 Chunks __
`13.2 MHz
`Orient
`Chunks
`_..___ =
`7‘
`(I 5 rue/Chunk) x “Orientation
`T01 ms @ 13.2 MHz
`
`
`
`Petitioner Microsoft Corporation - Ex. 1011, p. 197
`Petitioner Microsoft Corporation - EX. 1011, p. 1974
`—————————-—______________________
`
`19'?
`
`ti
`
`
`
`
`
`Shaprsuln -—5—-—— Final Concludes
`
`_ Figure 8: SPLASH 2 Implementation.
`
`Using a brute force approach it takes 59 seconds to
`compute the same data on an HP 770 (110 MHZ).
`That is two orders of magnitude slower. On one
`SPLASH 2 board two processing units can be in—
`stalled yielding the ability to process 1 orientation
`(40 chunks) every .244 seconds (.350 seconds at 13.2
`MHz).
`
`4-5-2 Multi-board Splash 2
`If a lG-board system were used then 256 processing
`elements would be available. this would make 42 pro-
`ceasing units (256 PES “3‘ 5 PE Per PU) and 42x the
`throughput. Thus one orientation can be processed
`every 11.6 ms (15-7 ms at 13-2 MHz). This is equal
`to coveri