`
`IEEE Symposium on
`
`FPGAs For
`CUSTOM
`COMPUTING
`MACHINES
`
`April 16-18, 1997
`Napa Valley, California
`
`Edited by Kenneth L. Pocek and Jeffrey Arnold
`
`Sponsored by the IEEE Computer Society Technical Committee on Computer Architecture
`
`BD
`COMPUTER o
`SOGIETY=.=
`
`Petitioner Microsoft Corporation - Ex. 1009, Cover 1
`Petitioner Microsoft Corporation - Ex. 1009, Cover 1
`
`
`
`F’'CCM’97
`
`
`
`09 1997
`
`Petitioner Microsoft Corporation - Ex. 1009, Cover 2
`Petitioner Microsoft Corporation - Ex. 1009, Cover 2
`
`
`
`Copyright © 1997 by TheInstitute ofElectrical and Electronics Engineers, Inc.
`All rights reserved
`
`to the source. Libraries may
`Copyright and Reprint Permissions: Abstracting is permitted with credit
`photocopy beyondthe limits ofUS copyrightlaw,for private use ofpatrons, thosearticles in this volumethat
`carry a code at the bottom ofthe first page, provided that the per-copy fee indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923,
`Other copying, reprint, or republication requests should be addressed to: [EEE Copyrights Manager, [EEE
`Service Center, 445 Hoes Lane, P.O, Box 133, Piscataway, NJ 08855-1331.
`Thepapers in this book comprise theproceedings ofthe meeting mentioned on the cover and title page. They
`reflect the authors’ opinions and, in the interests oftimely dissemination, are published as presented and
`without change. Their inclusion in this publication does not necessarily constitute endorsement by the
`editors, the IEEE Computer Society, or the Institute ofElectrical andElectonics Engineers, Inc.
`
`TK 789s-
`6376135
`.
`SP? 7
`
`IEEE Computer Society Order Number PRO8159
`ISBN 0-8186-8159-4
`ISBN 0-8186-8160-8 (case)
`ISBN 0-8186-8161-6 (microfiche)
`IEEE Order Plan Catalog Number 97TB 100186
`ISSN 1082-3409
`
`Additional copies may be orderedfrom:
`
`IEEE Computer Society
`IEEE Service Center IEEE Computer Society©IEEE Computer Society
`
`Customer Service Center
`445 Hoes Lane
`13, Avenue del’Aquilon
` Ooshima Building
`10662 Los VaquerosCircle
`P.O. Box 1331
`B-1200 Brussels
`2-19-1 Minami-Aoyama
`P.O. Box 3014
`
`Piscataway, NJ 08855-1331|BELGIUM Minato-ku, Tokyo 107
`Los Alamitos, CA 90720-1314
`Tel: + 1-908-981-1393
`Tel: + 32-2-770-2198
`JAPAN
`Tel: + 1-714-821-8380
`Fax: + 1-908-981-9667
`Fax: + 32-2-770-8505
`Tel: + 81-3-3408-3118
`Fax: + 1-714-821-4641
`mis.custserv@computer.org
`euro.oft@computer.org
`Fax: + 81-3-3408-3553
`E-mail: cs.books@computer.org
`tokyo.ofe@computer.org
`
`Editorial production by Bob Werner
`Coverart production Joe Daigle/Studio Productions
`Printed in the United States of America by Technical Communication Services
`
`IEEE
`
`®
`
`COMPUTER }
`SOCIETY
`ce
`
`Petitioner Microsoft Corporation - Ex. 1009, Cover 3
`Petitioner Microsoft Corporation - Ex. 1009, Cover 3
`
`
`
`Table of Contents
`
`Symposium on Field-Programmable Custom Computing Machines — FCCM’97
`
`PRGUtaUoo vmnenrorsioonnvnnanctoicc eC pneevmmnnennnn cena aaDSATthese neneapervnenrenaees
`
`Program Committee .0.....ooccccccccccsssscsccssssssssesseecsessnsntsopeseseccnssssssteessssnnsnssesessesesstsnstunssetenesssnveeseessennssueeeee X
`
`Session 1: Device Architecture
`
`An FPGAArchitecture for DRAM-based Systolic Computations...0.....0...:ccccccsessssersssersnessnenssuneseteer
`N. Margolus
`
`Garp: A MIPSProcessor with a Reconfigurable Coprocessor..........cscccsscssessssessneseseeensecsntentennneatioenee LD
`J. Hauser, J. Wawrzynek
`
`A Time-Multiplexed FPGA...
`cesta
`§. Trimberger, D. Carberry,A. iebnsan,J.Wong
`
`escaDNAG CA SUSE Sli ccaSiavachar omen egraenetannscanesiuatcut ED
`
`Session 2: Communication Applications
`
`An FPGA-Based Coprocessor for ATM Firewalls ............csccccccsssesssueersneesnneessssernuumessiestenessieesssersnnees OO
`J. McHenry, P. Dowd, T. Carrozzi,
`F. Pellegrino, W. Cocks
`
`A Wireless LAN Demodulator in a Pamette: Design and Experience............... <0 napacenneeryn sees 40
`T. McDermott, P. Ryan, M. Shand,
`D. Skellern, T. Percival, N. Weste
`
`Session 3: Run Time Reconfiguration
`
`Incremental Reconfiguration for Pipelined Applications .....0........csicsicsecsrssssieesitssesseecsnserssesneeenetenses 47
`H. Schmit
`
`Compilation Tools for Run-Time Reconfigurable Designs...........csscsscssesssessseesseecsnnsessnucersersiieeseeeess OO
`W. Luk, N. Shirazi, P. Cheung
`
`A Dynamic Reconfiguration Run-Time System .......ccccscssssecsssesestesssetssvessnsssnussssessusssursnemerneeseeareserers 66
`J, Burns, A. Donlin, J. Hogg, S. Singh, M. de Wit
`
`Session 4: Architectures for Run Time Reconfiguration
`
`The Swappable Logic Unit: A Paradigm for Virtual Hardware.......ccscsssssssesseessssnesverssies CT
`G. Brebner
`
`Petitioner Microsoft Corporation - Ex. 1009, ToC i
`Petitioner Microsoft Corporation - Ex. 1009, ToC ii
`
`
`
`The Chimaera Reconfigurable Functional UMEcccssssssssssssssssssssnsessssesecsetsssssnsssspususvusesstsssessessisssessssssseecee 87
`S. Hauck, T. Fry, M. Hosler, J. Kao
`
`Session 5: Architecture
`
`Computing Kernels Implemented with a Wormhole RTR COM ooneceecccccccscccssssssecssssssesssssssssenteeseseeeees 98
`R. Bittner Jr., P. Athanas
`
`Mapping Applications to the RaPiD Configurable Architecture ..0.0.0..0.cccccccccscssssssssssssssseeeeoeecee....... 106
`C. Ebeling, D. Cronquist, P. Franklin,
`J. Secosky, S. Berg
`
`Defect Tolerance on the Teramac Custom Computer......csssssssssssssssssessssssssusssssssssssssasnsrssesssasssssese 116
`B. Culbertson, R. Amerson, R. Carter,
`P. Kuekes, G. Snider
`
`Session 6: Performance
`
`Systems Performance Measurement on PCI PAMEtHtE......esecssosssrecsssssssssssssssssssvasssessessesssssssssasssssensessee 125
`L. Moll, M. Shand
`
`The RAW BenchmarkSuite: Computation Structures for
`General Purpose Computing....ccccscscsssnsastaessnsssssussinnitissianinetinasisasinaunaisapiiutiuitiseeeeeeeceecc 134
`J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua,
`M. Taylor, J. Kim, S. Devabhaktuni, A. Agarwal
`
`Session 7: Software Tools
`
`_ Automated Field-Programmable Compute Accelerator Design using
`Partial Evalwation...coesnenieninnensnisinnsintninmentuteninniitsinuiisitinastinuunsnune 145
`Q. Wang, D. Lewis
`
`FPGA Synthesis on the XC6200 using IRIS and Trianus/Hades
`(Or, from Heavento Hell and back EID) cosscnsisnecncansivcscssnsacinivscapisaluSicecctipcreames 155
`R. Woods, S. Ludwig, J. Heron, D. Trainor, S, Gehring
`
`High Level Compilation for Fine Grained FPGAS ..cssoscsnsssnstsesssssstsssstntseeeeeeeeececcc 165
`M. Gokhale, E. Gomersall
`
`Session 8: CAD Applications
`
`Acceleration of an FPGA Router ........snusumnestunsistnitsmnmnnenusnnnuniuununiuuuececc. 175
`P. Chan, M. Schlag
`
`Fault Simulation on Reconfigurable Hardware on...cccccssosssssessessessunnunnssssssesssusecssssarssannseersneceeee, 182
`M. Abramovici, P. Menon
`
`Petitioner Microsoft Corporation - Ex. 1009, ToC ii
`Petitioner Microsoft Corporation - Ex. 1009, ToC ii
`eeeiaie
`
`vi
`
`
`
`Session 9: Image Processing Applications
`Automated Target Recognition on SPLASH 2...vervesnneesuasosesrenssnesanssesetsanssuesttssssuutisiyestessisassaytesccsssees.. 192
`M. Rencher, B. Hutchings
`
`Real-TimeStereo Vision on the PARTS Reconfigurable Computer...occurs201
`J. Woodfill, B. Von Herzen
`
`Increased FPGA Capacity Enables Scalable, Flexible CCMs:
`An Example from Image PLOCOSSING0..escsssscsssseccsssesiesssseeeeesec... SiereanasseemraninsacansanaconiinnanancnyedLL
`J. Greenbaum, M. Baxter
`
`Session 10: Arithmetic Applications
`
`Comparison ofArithmetic Architectures for Reed-Solomon Decoders in
`Reconfigurable PEARLStameSrnamenamseaannssicaensinecs 219
`C. Paar, M. Rosner
`
`Implementation ofSingle Precision Floating Point Square Root on PPGBGssiscsinccasscsancs;SOG
`
`Y. Li, W. Chu
`
`Poster Papers
`
`Datapath-Oriented FPGA Mapping and Placementfor
`Configurable Corpatgcnmenenrnssiihtnammaranammmanensnnainencerseasese 98
`T. Callahan, J. Wawrzynek
`
`Mapping a Real-Time Video Algorithm to a Context-Switched FPGA ooooeccccescccccsessssssnssneereveneee. 236
`
`S. Kelem
`
`A Parallel Hardware Evolvable Computer POLYP...sscsnssstenistermrcttusistumnentannuarumees,238
`U. Tangen, L. Schulte, J. McCaskill
`
`Laser Defect Correction Applications to FPGA Based Custom Computers...cccoecccccsececee. 240
`G. Chapman, B. Dufort
`
`Speech Recognition HMM Training on Reconfigurable Parallel PLOCESSOPo......ccscssssseeeeeesecesseees, 242
`H. Yun, A. Smith, H. Silverman
`
`Efficient Implementation ofthe DCT on Custom Computers shsendsoneneetgunusonsetsinesanssiausisascinedivanvsiceosans DAA
`N. Bergmann,Y. Chung, B. Gunther
`
`On Acceleration ofthe Check Tautology Logic Synthesis Algorithm using an
`FPGA-based Reconfigurable OCSemmamEREEmemesanernnaneraactA
`J. Cong, J. Peck
`
`Index ofAUEOTSrniinnnnnnnninintnninntititiinieeeec.. 249
`
`Vil
`
`Petitioner Microsoft Corporation - Ex. 1009, ToC iii
`Petitioner Microsoft Corporation - Ex. 1009, ToCiii
`
`
`
`This material may be protected by Copyright law (Title 17 U.S. Code)
`
`Mapping Applications to the RaPiD Configurable Architecture*
`Carl Ebeling, Darren C. Cronquist, Paul Franklin, Jason Secosky, and Stefan G. Berg
`Department of ComputerScience and Engineering
`University of Washington
`Box 352350
`Seattle, WA 98195-2350
`
`Abstract
`
`The goal of the RaPiD (Reconfigurable Pipelined
`Datapath)
`architecture is
`to provide high per-
`formance configurable computing for a range of
`computationally-intensive applications that demand
`special-purpose hardware. This is accomplished by
`mapping the computation into a deep pipeline using
`a configurable array of coarse-grained computational
`units. A key feature of RaPiD is the combination
`of static and dynamic control. While the underly-
`ing computational pipelines are configured statically,
`a limited amount of dynamic controlis provided which
`greatly increases the range and capability of applica-
`tions that can be mapped to RaPiD. This paperillus-
`trates this mapping and configuration for several im-
`portant applications including a FIR filter, 2-D DCT,
`motion estimation, and parametric curve generation;
`it also shows how static and dynamic control are used
`to perform complex computations.
`
`1.
`
`Introduction
`
`custom computing machines
`Field-programmable
`have attracted a lot of attention recently because of
`their promise to deliver the high performance provided
`by special purpose hardware along with the flexibil-
`ity of general purpose processors. Unfortunately, the
`promise of configurable computing has yet to be real-
`ized in spite of some very successful examples [1, 9].
`There are two main reasonsforthis.
`First, configurable computing platforms are cur-
`rently implemented using commercial FPGAs. These
`FPGAsare necessarily very fine-grained so they can
`be used to implement arbitrary circuits, but the over-
`head of this generality exacts a high price in density
`and performance. Compared to general purpose pro-
`cessors (including digital signal processors), which use
`highly optimized functional units that operate in bit-
`parallel fashion on long data words, FPGAsare some-
`whatinefficient for performing logical operations and
`“This work was supported in part by the Defense Advanced
`Research Projects Agency under Contract DAAH04-94-G0272. D.
`Cronquist was supported in
`part by an IBM fellowship. P. Franklin
`was supported by an NSF fellowship
`
`even worsefor ordinary arithmetic. FPGA-based com-
`puting hasthe advantage only on complex bit-oriented
`computations like count-ones, find-first-one, or com-
`plicated bit-level masking and filtering.
`Second, programming an FPGA-based configurable
`computer is akin to designing an ASIC. The program-
`mer either uses synthesis tools that deliver poor den-
`sity and performance or designs the circuit manually,
`which requires both intimate knowledge of the FPGA
`architecture and substantial design time. Neither al-
`ternative is attractive, particularly for simple compu-
`tations that can be described in a few lines of C code.
`Our response to these two problems is RaPiD, a
`coarse-grained configurable architecture for construct-
`ing deep computational pipelines. RaPiD is aimedat.
`regular, computation-intensive tasks like those found
`in digital signal processing (DSP). RaPiD provides a
`large numberof ALUs, multipliers, registers and mem-
`ory modules that can be configured into the appropri-
`ate pipelined datapath. The datapaths constructed
`in RaPiD are linear arrays of functional units com-
`municating in mostly nearest-neighborfashion. Sys-
`tolic algorithms [4], for example, map very well into
`RaPiD datapaths, allowing us to take advantage of
`the considerable research on compiling computations
`to systolic arrays[5, 7]. However, RaPiDis notlimited
`to implementingsystolic algorithms; a pipeline can be
`constructed which comprises different computations at,
`different stages and at different times.
`Webegin with an overview of the RaPiD architec-
`ture; for a more detailed description see [3]. We then
`give a general description of how computations map
`to RaPiD using a FIR filter as an example, and then
`present how the architectural features of RaPiD are
`used to perform more complex computations found in
`2-D DCT, motion estimation, and parametric curve
`generation.
`
`2 The RaPiD Datapath Architecture
`RaPiD is a linear array of functional units which
`is configured to form a mostly linear computational
`pipeline. This array of functional units is divided into
`identical cells which are replicated to form a complete
`array. Figure 1 shows the cell used in RaPiD-1, the
`
`0-8186-8159-4/97 $10.00 © 1997 IEEE
`
`106
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 106
`Petitioner Microsoft Corporation - Ex. 1009, p. 106
`Me
`
`
`
`first version of the RaPiD architecture. This cell com-
`prises an integer multiplier, three integer ALUs, six
`general-purpose “datapath registers” and three small
`local memories. A typical single-chip RaPiD array
`would contain between 8 and 32 of these cells.
`
`datapath
`
`registers &
`
`input muxes
`
`bus connectors
`output drivers
`Figure 1: A basic RaPiD cell which 1s replicated left to
`right to form a complete chip. RaPiD-1 contains 16
`cells similar to this one, with fourteen 16-bit buses.
`
`2.1 Datapath Composition
`The functional units are interconnected using a set
`of segmented buses that run the length of the data-
`path. The functional units use an: 1 multiplexer to
`select their data inputs from one of the n —2 bus seg-
`ments in the adjacent tracks. The additional inputs
`providefixed zero or feedback lines, which can be used
`to clear and holdregister values, or to use an ALU as
`an accumulator. Each functional unit output includes
`optionalregisters to accommodate pipeline delays and
`a set of tristate drivers to drive their output onto one
`or more bus segments.
`The buses in different
`tracks are segmented into
`different lengths to make the mostefficient use of the
`connection resources.
`In some tracks, adjacent bus
`Segments can be connected together by a bus connec-
`tor as shown in Figure 1. This connection can be pro-
`grammed ineither direction via a unidirectional buffer
`or pipelined with up to three register delays, allowing
`data pipelines to bebuilt in the bus structure itself,
`RaPiD’s ALUsperform the usual logical and arith-
`metic operations on one word of data, The ALUs
`can be chained for wide-integer operations.
`The
`multiplier inputs two single-word numbers and pro-
`duces a double-word result, shifted by a statically pro-
`grammed amount to maintain a given fixed-point rep-
`resentation. Both wordsof the result are available as
`separate outputs.
`The datapath registers serve a variety of purposes
`in RaPiD. These registers can be used to store con-
`stants loaded duringinitialization and temporary val-
`
`ues. They can be used as additional multiplexers to
`simplify control;
`like any functional unit,
`the regis-
`ters can bedisabled. Theyare also used while routing
`RaPiD applications to connect bus segments in differ-
`ent tracks and/or for additional pipeline delays.
`In many applications, the data is partitioned into
`blocks which are loaded once, savedlocally, reused as
`needed, and then discarded. The local memories pro-
`vided in each cell of the datapath serve this purpose.
`Each local memory has a specialized datapath regis-
`ter used as an address register; one of the bus inputs
`to this address register is replaced by an incrementing
`feedback path. Like the SILOs found in the Philips
`VSP[8], this supports the common case of sequential
`memory accesses. More complex addressing patterns
`can be generatedusing registers and ALUsin the datia-
`path,
`Input and output data enter and exit RaPiD via
`I/O streamsat each end of the datapath. Each stream
`contains a FIFO filled with data required or with re-
`sults produced bythe computation. The datapath ex-
`plicitly reads from an input stream to obtain the next
`input data value and writes to an output stream to
`store a result.
`External memory Operations are carried out inde-
`pendent of the RaPiD array via three 1/O streams
`by placing FIFOs between the array and a memory
`controller.
`In addition to carrying out the memory
`operations, the memorycontroller generates statically
`determined sequences of addresses for each stream, Tf
`the datapath reads a value from an empty FIFO or
`writes a value to a full FIFO, the datapathis stalled
`until the FIFO is ready,
`
`2.2 Control Path
`For the most part, the structure of a pipelineis stat-
`ically configured. However, there are almost always
`some pipeline control signals that must be dynamic.
`For example, constants are loaded into datapath regis-
`ters duringinitialization but then remain unchanged,
`Theload signals of the datapath registers thus take on
`different values duringinitialization and computation,
`More complex examples include double-buffering the
`local memories and performing data-dependent calcu-
`lations.
`Thecontrolsignals are thus divided into static con-
`trol signals provided by configuration memory as in
`ordinary FPGAs,and control signals whichcan be dy-
`namically programmed on every cycle. RaPiD is pro-
`grammedfor a particular application by first mapping
`the computation onto a datapath pipeline, The static
`programming bits are used to construct this pipeline
`and the dynamic programmingbits are used to sched-
`ule the datapath operations over time. These dynamic
`control bits are provided by a small pipelined control
`path, not by the moretypicallocal microprogrammed,
`SIMD,or VLIW control.
`
`107
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 107
`Petitioner Microsoft Corporation - Ex. 1009, p. 107
`
`
`
`Dynamic controlis implemented by inserting a few
`“context” bits each cycle into a pipelined control path
`that parallels the datapath. This context contains
`all the information required by the various pipeline
`stages to compute their dynamic control signals. The
`control path contains 1-bit segmented buses similar in
`structure to the datapath buses, as shown in Figure 2,
`(Signals which can be dynamic but do not need to
`change during a particular computation are connected
`to the static zero line.) Control values are inserted by
`a global pipeline controller at one end of the control
`path and are passed from stage to stage where they
`are applied to the appropriate control signals. Since
`applications generally use only a few dynamic control
`signals and use similar pipeline stages, the number of
`control signals in the control pathis relatively small.
`
`
`
`Figure 2: Dynamic control generation for part of a
`RaPiD cell; these control buses are one bit wide.
`
`Each dynamic control signal is derived from the in-
`formation contained in the control path. Usually the
`signal is simply connected to one of the bits in the
`control path, but in more complex cases lookup-tables
`embedded in the control path are used to compute
`the control signal based on more information includ-
`ing bits in the control path, status from ALUsin the
`datapath, or feedback when implementing simple FSM
`controllers. The generation of dynamic controlis il-
`lustrated in detail in the applications that follow.
`
`2.3 RaPiD-1 Design Features
`Most of the design and layout of the RaPiD-1 chip,
`the first implementation of the RaPiD architecture,
`is complete. This section presents those details of
`RaPiD-1 useful in understanding the performancere-
`sults discussed for each application presented in the
`following sections.
`RaPiD-1’s datapath is based on 16-bit fixed-point
`integers; to accommodate this, the multipliers can be
`statically programmedtoshift their 32-bit output ar-
`bitrarily. Each RaPiD-1 cell contains three ALUs,
`one multipliers, and three 32-word local memories,
`Fourteen tracks are provided for the segmented data
`
`buses, which
`are supplemented by the zero and feed-
`back inputs
`available to each functional input. The
`16 cells eacl
`1 have the functional units shown in Fig-
`ure 1, in addition to control logic and up to 15 control
`buses. The
`RaPiD-1 array is designed to be clocked
`at 100MHz,
`and reconfiguration time for the array is
`conservatively estimated to be 2000 cycles.
`
`3 Programming Model
`Mapping applications to RaPiD involves designing the
`underlying datapath and providing the dynamic con-
`trol required for the different parts of the computa-
`tion. The control design can
`be complicated because
`control sign
`als are generated at different times and
`travel at different rates. We have designed the RaPiD
`B programming language to
`accommodate these con-
`trol patterns. Our RaPiD B
`compiler which produces
`a placed an
`d routed implementation along with the
`dynamiccontrol program is
`nearly complete. This sec-
`tion first de
`scribes a FIR (Finite Impulse Response)
`filter,
`a simple application useful for illustrating some
`of the basicfeatures of RaPiD.It then briefly presents
`the timing models used by RaPiD B and by the re-
`mainderof this paper.
`
`3.1
`
`FIR Filter Computation
`Digital FIR
`filters are used in many signal processing
`applications
`, typically for eliminating unwanted fre-
`quency components from a signal. Figure 3a Bives a
`specificationfor a FIR filter with NumTaps taps and
`NumxXinputs. Thefilter weights are stored in the W
`array, the in
`put in the X array, and the output in the
`Y array (starting
`at array location NumT'aps — 1).
`Figure 3b shows theentire computation required for a
`single output of a 4-tap FIRfilter.
`
`Y{i] = a
`for j := 0 to Num‘Taps-1
`eanfil = YU + Xfi)"WU)
`
`end
`
`we Dice KB vce
`
`RTMGMSecX4KD, eXQoX bLKO ——Sae
`x
`x
`*
`x
`Wo Wl
`oW2 wa
`
`Y= >
`
`(b)
`
`Figure 3: FIR filter. (a)
`Algorithm. (b) Computation
`for NumTaps=4 and i=6.
`
`Thecircuit in Figure 4a performs the entire compu-
`tation for one output value in a single cycle;it is easily
`obtained by unrolling the inner loop of the program
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 108
`etitioner Microsoft Corporation - Ex. 1009, p. 108
`P
`eS
`
`108
`
`
`
`4 FIR Filter Implementation
`
`4.1 Simple Case
`As with most applications, there are a variety of ways
`to map a FIR filter to RaPiD. The choice of mapping
`is driven by the parameters of both the RaPiD ar-
`ray and the application. For example, if the number
`of taps is less than the number of RaPiD multipliers,
`then each multiplier is assigned to multiply a specific
`weight. The weights are first preloaded into datapath
`registers whose outputs drive the input of a specific
`multiplier. Pipeline registers are used to stream the
`X inputs and Y outputs. Since each Y output must
`see NumTaps inputs, the X and Y buses must be
`pipelined at different rates. Figure 5a showsonecell
`ofthe FIR filter (several stages are shown in Figure 4b)
`with the X input bus doubly pipelined and the Y in-
`put bus singly pipelined.
`
`
`
`in Figure 3a. Unfortunately,
`the circuit shownin Fig-
`ure 4a has poor performance characteristics (note the
`combinational path throughall of the adders, which
`scales linearly with the numberof weights). A retimed
`version of this circuit is shown in Figure 4b; the re-
`timed circuit performs substantially better than the
`original, particularly for larger computations.
`
` OuT
`a0
`
`
`
`(b)
`Figure 4: Schematic diagrams for four-tap FIR filter
`(a) as viewed in RaPiD B, grouping related compu-
`tation and (b) as a high-performance pipelined imple-
`mentation.
`
`
`OUT
`
`Specifying this retimed circuit directly is difficult
`becauseof the complexity of the relative timing of the
`internal data and control signals. It is much easier to
`specify the computation somewhat naively as in Fig-
`ure 4a, knowing that retiming can transform it into
`a high-performance, pipelined circuit. This becomes
`particularly evident in circuits with more complicated
`control, and when more aggressive steps, such as using
`the pipeline stage available in RaPiD’s multiplier, are
`needed to achieve the desired performance. Therefore,
`the RaPiD B compiler retimes the resulting netlist
`based on[6].
`All of the applications presented in the following
`sections have been specified in a preliminary ver-
`This implementation maps easily to the RaPiD ar-
`sion of RaPiD B and simulated to validate the im-
`ray, as shownfor one tap in Figure 5b. Forclarity, all
`plementations described and the accompanying cy-
`unused functional units are removed, and used buses
`cle count. For ease of explanation, the computations
`are highlighted. The bus connectors from Figure 1 are
`shown throughoutthis paper are shown before the full
`left open to represent no connection and boxed to rep-
`retiming performed by the RaPiD B compiler. A pre-
`resent a register. The controlfor this mapping consists
`liminary version of the RaPiDBtoolsetis nearly com-
`of two phases of execution:
`loading the weights and
`plete, including compilation, retiming, control synthe-
`computing the output results.
`In the first phase, the
`sis, and full placement and routing of the resulting
`weights are sent down the IN double pipeline along
`RaPiDcircuit.
`with a singly pipelined control bit which connects the
`
`(b)
`(a) Netlist for one cell of the simple FIR
`Figure 5:
`filter.
`(b) One tap of the FIR filter mapped to the
`RaPiD array (this is replicated to form more taps).
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 109
`Petitioner Microsoft Corporation - Ex. 1009, p. 109
`eeeeee
`
`109
`
`
`
`and the allocated functional units are fully utilized.
`RaPiD-1 (Section 2.3) can therefore operate at very
`nearits peak performance of 1.6 GOPS (where GOPS
`is a billion multiply-accumulates per second).
`
`5 Discrete Cosine Transform
`
`is used fre-
`The discrete cosine transform (DCT)
`quently in signal processing and graphics applications.
`For example, the 2-D DCTis used in JPEG/MPEG
`data compression to convert an image from the spatial
`domain to the frequency domain. A 2-D DCT can be
`decomposed into two sequential 1-D DCTs. We first
`describe how the 1-D DCT can be computed on RaPiD
`and then show how two 1-D DCTs can be composed
`to perform a 2-D DCT.
`
`5.1
`
`1-D DCT
`
`An N-point 1-D DCTpartitions an input vector A into
`N-element sub-vectors, and for each resulting sub-
`vector A; computes
`N-1
`
`7mk
`tie = > Qhn COS gn (27 + 1)
`
`n=0
`
`(1)
`
`for 0 < i < N —1, where ag, is the n-th element of
`sub-vector A; (and the (kN +7n)-th element of vector
`A). The N? cosine terms are constant over all sub-
`vectors and hence can be read once as precomputed
`weights W where w,; = cos $y (2n +1). This reduces
`Equation 1 to
`
`input of each datapath register to the JN bus. When
`the final weightis inserted, the controlbit is switched,
`and the input is connected to the feedbackline. Since
`the control bit travels twice as fast as the weights,
`each datapath register will hold a unique weight. No
`special signals are required to begin the computation;
`the second phaseimplicitly starts when the controlbit
`goes low.
`
`Increasing the Numberof Taps
`4.2
`If the number of taps exceeds the number of RaPiD
`multipliers, the multipliers must be time-shared be-
`tween several taps. This can be achieved in RaPiD
`by using a local memory to store several weights per
`Stage. Figure 6 shows our implementation for this
`mapping. Unlike the simple case, we make the arbi-
`trary choice for doubly pipelining the Y output values
`and singly pipelining the X input values.
`
`1 Weights and X values atreamin
`
`Lefl RAM holds weights to be
`multiplied with X inputs.
`
`Right RAM holds intermediate
`vt seca values that shift dawn
`until
`they are sent to next stage,
`
`D {
`
`Figure 6: Netlist for one cell of extended FIR fil-
`ter. The top pipelined bus streams in the X inputs
`(the weights during initialization) while the bottom bus
`streams out the intermediate Y values.
`
`it
`Uri = D> OhnWnis
`=0
`
`(2)
`
`As a new X is read from external memory,thefirst
`stage replicates it and presents it to the input data-
`path for several cycles. Each stage can multiply this
`X by its weights in turn and addit to oneofits stored
`for 0 <i < N—1. By viewing input vector A and
`intermediate values. At this point a new X valuewill
`weights W as matrices A and W, Equation 2 reduces
`be fetched from memory and the cycle repeats.
`to the matrix multiply Y = Ax W. Thus, to compute
`There are the same numberof intermediate values
`a 1-D DCT, RaPiD performs a matrix multiply.
`as there are weights per stage. These intermediateval-
`To implement an 8 point 1-D DCT on an 8x 8
`ues are stored in a second local memory. Let’s exam-
`input matrix A (ie. a 64-element vector), the entire
`ine the stage holding weights Wss, Ws4, Ws3, and Wso
`8 x 8 weight matrix W is stored in RaPiD’slocal mem-
`(four taps per stage). A new input value X29 appears
`ories, one column percell. Each cell of the resulting
`on the input datapath. In fourcycles the partial sums
`pipelineis configured as shown in Figure 7. The A ma-
`for Y75, Yr4, Y73, and Y7 will be computed. These
`trix is passed through the array in row-major order.
`are stored in that order in the local memory holding
`Within each cell, the local memory address is incre-
`the intermediate values. At this point, X29 moves to
`mented each cycle, and a register accumulates the dot
`the next pipeline stage followed by the intermediate
`product of the stored column and the incoming row.
`value Y72. The next input, X21, appears on the input
`Whenacell receives the last element of a row, the
`datapath along with the intermediate value Y7¢ from
`resulting product is passed down an output pipeline,
`the previous stage. Now the partial sums for ¥76, ¥7s,
`the addressis cleared, and thecell is ready to compute
`Y74, and ¥73 are computed.
`the product of the next row on the next cycle. This
`effectively computes the matrix multiply of A x W.
`4.3 FIR Performance
`‘To produce the final DCT result each Yai must be multiplied
`by\/#4; where E; = 4 if i= 0 and E; = 1 otherwise. For our
`When the number of taps is a multiple of 16 the
`weights can be partitioned evenly across the stages
`purposes we ignore this scaling factor and focus on the computation
`of yri.
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 110
`Petitioner Microsoft Corporation - Ex. 1009, p. 110
`
`
`
`fate te
`Amn! N-Point N-Point|Vis|_Zai_ Zim_|
`
`fy
`one
`-
`trans ose
`Pt 7
`1-D DCT
`\-D DCT
`A RowofmatrixAstreamsin Le|R
`
`
`
`
`
`Figure 8: 2-D N x N DCT
`
`Row of the resuliing praduct srcams out
`
`A)
`
`We show the implementation of an 8 x 8 2-D DCT
`on a 16-cell RaPiD array. Consider an M x N image
`and an 8 x 8 weight matrix W. First, the image is
`divided into AN sub-images of size 8 x 8. The com-
`putation for each sub-image A is outlined in Figure 9,
`
`Intermediate results
`stored in RAM and
`transposed by control
`
`MxN
`
`Computed by
`first 8 stages
`
`Computed by
`last 8 stages
`Figure 9: To compute 2-D DCT, an M x N image
`is partitioned into 8 x 8 sub-images. RaPiD computes
`each 1-D DCTby multiplying the sub-image by an 8x8
`weight matriz.
`
`Figure 7: Netlist for one cell of a matrix multiply.
`The top pipelined bus streams in the A matrix (in
`row-major order) while the bottom bus streams out
`the resulting matrix product (also in row-major order).
`The top bus also streams the W columns into thelocal
`memories prior to the computation.
`
`5.2
`
`2-D DCT
`
`An N x N 2-D DCTpartitions an input matrix into
`sub-matrices of size N x N, and for each resulting
`sub-matrix A, computes
`
`N-1N-1
`
`_
`
`.
`ae
`
`we.
`
`vi = py zs Amn COS ay (2m +1) cos on (en + 1)
`(3)
`,
`for 0 <i,7 < N—1.? As with the 1-D DCT, Equation
`3 is reduced using the N? precomputed W weights,
`yielding
`N-1N-1
`Ss: OmnnWmiWng
`m=0 n=0
`for0<i,j < N-1. Extracting w,,; from the inner
`summation leaves
`
`Yi =
`
`(4)
`
`
`(((Al x Wl)xwl)’<<
`
`Since a 2-D DCT performs two multiplies by the
`same weight matrix, W is loaded only once: one col-
`umn percell in both thefirst 8 cells and last 8 cells.
`The transpose in between matrix multiplies is per-
`formed with two local memories per cell: one to store
`products of the current sub-image and the other to
`store the products of the previous sub-image. During
`the computation of the current sub-image, the trans-
`pose of the previous sub-image computation is passed
`to the next 8 cells. The datapath for one RaPiD cell
`of a 2-D DCTis shown in Figure 10.
`One RAMstores the
`current !-D DCT results.
`frevie
`mua
`1-D
`a
`The alhersures ie
`tr
`perionm