throbber
Proceedings
`
`IEEE Symposium on
`
`FPGAs For
`CUSTOM
`COMPUTING
`MACHINES
`
`April 16-18, 1997
`Napa Valley, California
`
`Edited by Kenneth L. Pocek and Jeffrey Arnold
`
`Sponsored by the IEEE Computer Society Technical Committee on Computer Architecture
`
`BD
`COMPUTER o
`SOGIETY=.=
`
`Petitioner Microsoft Corporation - Ex. 1009, Cover 1
`Petitioner Microsoft Corporation - Ex. 1009, Cover 1
`
`

`

`F’'CCM’97
`
`
`
`09 1997
`
`Petitioner Microsoft Corporation - Ex. 1009, Cover 2
`Petitioner Microsoft Corporation - Ex. 1009, Cover 2
`
`

`

`Copyright © 1997 by TheInstitute ofElectrical and Electronics Engineers, Inc.
`All rights reserved
`
`to the source. Libraries may
`Copyright and Reprint Permissions: Abstracting is permitted with credit
`photocopy beyondthe limits ofUS copyrightlaw,for private use ofpatrons, thosearticles in this volumethat
`carry a code at the bottom ofthe first page, provided that the per-copy fee indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923,
`Other copying, reprint, or republication requests should be addressed to: [EEE Copyrights Manager, [EEE
`Service Center, 445 Hoes Lane, P.O, Box 133, Piscataway, NJ 08855-1331.
`Thepapers in this book comprise theproceedings ofthe meeting mentioned on the cover and title page. They
`reflect the authors’ opinions and, in the interests oftimely dissemination, are published as presented and
`without change. Their inclusion in this publication does not necessarily constitute endorsement by the
`editors, the IEEE Computer Society, or the Institute ofElectrical andElectonics Engineers, Inc.
`
`TK 789s-
`6376135
`.
`SP? 7
`
`IEEE Computer Society Order Number PRO8159
`ISBN 0-8186-8159-4
`ISBN 0-8186-8160-8 (case)
`ISBN 0-8186-8161-6 (microfiche)
`IEEE Order Plan Catalog Number 97TB 100186
`ISSN 1082-3409
`
`Additional copies may be orderedfrom:
`
`IEEE Computer Society
`IEEE Service Center IEEE Computer Society©IEEE Computer Society
`
`Customer Service Center
`445 Hoes Lane
`13, Avenue del’Aquilon
` Ooshima Building
`10662 Los VaquerosCircle
`P.O. Box 1331
`B-1200 Brussels
`2-19-1 Minami-Aoyama
`P.O. Box 3014
`
`Piscataway, NJ 08855-1331|BELGIUM Minato-ku, Tokyo 107
`Los Alamitos, CA 90720-1314
`Tel: + 1-908-981-1393
`Tel: + 32-2-770-2198
`JAPAN
`Tel: + 1-714-821-8380
`Fax: + 1-908-981-9667
`Fax: + 32-2-770-8505
`Tel: + 81-3-3408-3118
`Fax: + 1-714-821-4641
`mis.custserv@computer.org
`euro.oft@computer.org
`Fax: + 81-3-3408-3553
`E-mail: cs.books@computer.org
`tokyo.ofe@computer.org
`
`Editorial production by Bob Werner
`Coverart production Joe Daigle/Studio Productions
`Printed in the United States of America by Technical Communication Services
`
`IEEE
`

`
`COMPUTER }
`SOCIETY
`ce
`
`Petitioner Microsoft Corporation - Ex. 1009, Cover 3
`Petitioner Microsoft Corporation - Ex. 1009, Cover 3
`
`

`

`Table of Contents
`
`Symposium on Field-Programmable Custom Computing Machines — FCCM’97
`
`PRGUtaUoo vmnenrorsioonnvnnanctoicc eC pneevmmnnennnn cena aaDSATthese neneapervnenrenaees
`
`Program Committee .0.....ooccccccccccsssscsccssssssssesseecsessnsntsopeseseccnssssssteessssnnsnssesessesesstsnstunssetenesssnveeseessennssueeeee X
`
`Session 1: Device Architecture
`
`An FPGAArchitecture for DRAM-based Systolic Computations...0.....0...:ccccccsessssersssersnessnenssuneseteer
`N. Margolus
`
`Garp: A MIPSProcessor with a Reconfigurable Coprocessor..........cscccsscssessssessneseseeensecsntentennneatioenee LD
`J. Hauser, J. Wawrzynek
`
`A Time-Multiplexed FPGA...
`cesta
`§. Trimberger, D. Carberry,A. iebnsan,J.Wong
`
`escaDNAG CA SUSE Sli ccaSiavachar omen egraenetannscanesiuatcut ED
`
`Session 2: Communication Applications
`
`An FPGA-Based Coprocessor for ATM Firewalls ............csccccccsssesssueersneesnneessssernuumessiestenessieesssersnnees OO
`J. McHenry, P. Dowd, T. Carrozzi,
`F. Pellegrino, W. Cocks
`
`A Wireless LAN Demodulator in a Pamette: Design and Experience............... <0 napacenneeryn sees 40
`T. McDermott, P. Ryan, M. Shand,
`D. Skellern, T. Percival, N. Weste
`
`Session 3: Run Time Reconfiguration
`
`Incremental Reconfiguration for Pipelined Applications .....0........csicsicsecsrssssieesitssesseecsnserssesneeenetenses 47
`H. Schmit
`
`Compilation Tools for Run-Time Reconfigurable Designs...........csscsscssesssessseesseecsnnsessnucersersiieeseeeess OO
`W. Luk, N. Shirazi, P. Cheung
`
`A Dynamic Reconfiguration Run-Time System .......ccccscssssecsssesestesssetssvessnsssnussssessusssursnemerneeseeareserers 66
`J, Burns, A. Donlin, J. Hogg, S. Singh, M. de Wit
`
`Session 4: Architectures for Run Time Reconfiguration
`
`The Swappable Logic Unit: A Paradigm for Virtual Hardware.......ccscsssssssesseessssnesverssies CT
`G. Brebner
`
`Petitioner Microsoft Corporation - Ex. 1009, ToC i
`Petitioner Microsoft Corporation - Ex. 1009, ToC ii
`
`

`

`The Chimaera Reconfigurable Functional UMEcccssssssssssssssssssssnsessssesecsetsssssnsssspususvusesstsssessessisssessssssseecee 87
`S. Hauck, T. Fry, M. Hosler, J. Kao
`
`Session 5: Architecture
`
`Computing Kernels Implemented with a Wormhole RTR COM ooneceecccccccscccssssssecssssssesssssssssenteeseseeeees 98
`R. Bittner Jr., P. Athanas
`
`Mapping Applications to the RaPiD Configurable Architecture ..0.0.0..0.cccccccccscssssssssssssssseeeeoeecee....... 106
`C. Ebeling, D. Cronquist, P. Franklin,
`J. Secosky, S. Berg
`
`Defect Tolerance on the Teramac Custom Computer......csssssssssssssssssessssssssusssssssssssssasnsrssesssasssssese 116
`B. Culbertson, R. Amerson, R. Carter,
`P. Kuekes, G. Snider
`
`Session 6: Performance
`
`Systems Performance Measurement on PCI PAMEtHtE......esecssosssrecsssssssssssssssssssvasssessessesssssssssasssssensessee 125
`L. Moll, M. Shand
`
`The RAW BenchmarkSuite: Computation Structures for
`General Purpose Computing....ccccscscsssnsastaessnsssssussinnitissianinetinasisasinaunaisapiiutiuitiseeeeeeeceecc 134
`J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua,
`M. Taylor, J. Kim, S. Devabhaktuni, A. Agarwal
`
`Session 7: Software Tools
`
`_ Automated Field-Programmable Compute Accelerator Design using
`Partial Evalwation...coesnenieninnensnisinnsintninmentuteninniitsinuiisitinastinuunsnune 145
`Q. Wang, D. Lewis
`
`FPGA Synthesis on the XC6200 using IRIS and Trianus/Hades
`(Or, from Heavento Hell and back EID) cosscnsisnecncansivcscssnsacinivscapisaluSicecctipcreames 155
`R. Woods, S. Ludwig, J. Heron, D. Trainor, S, Gehring
`
`High Level Compilation for Fine Grained FPGAS ..cssoscsnsssnstsesssssstsssstntseeeeeeeeececcc 165
`M. Gokhale, E. Gomersall
`
`Session 8: CAD Applications
`
`Acceleration of an FPGA Router ........snusumnestunsistnitsmnmnnenusnnnuniuununiuuuececc. 175
`P. Chan, M. Schlag
`
`Fault Simulation on Reconfigurable Hardware on...cccccssosssssessessessunnunnssssssesssusecssssarssannseersneceeee, 182
`M. Abramovici, P. Menon
`
`Petitioner Microsoft Corporation - Ex. 1009, ToC ii
`Petitioner Microsoft Corporation - Ex. 1009, ToC ii
`eeeiaie
`
`vi
`
`

`

`Session 9: Image Processing Applications
`Automated Target Recognition on SPLASH 2...vervesnneesuasosesrenssnesanssesetsanssuesttssssuutisiyestessisassaytesccsssees.. 192
`M. Rencher, B. Hutchings
`
`Real-TimeStereo Vision on the PARTS Reconfigurable Computer...occurs201
`J. Woodfill, B. Von Herzen
`
`Increased FPGA Capacity Enables Scalable, Flexible CCMs:
`An Example from Image PLOCOSSING0..escsssscsssseccsssesiesssseeeeesec... SiereanasseemraninsacansanaconiinnanancnyedLL
`J. Greenbaum, M. Baxter
`
`Session 10: Arithmetic Applications
`
`Comparison ofArithmetic Architectures for Reed-Solomon Decoders in
`Reconfigurable PEARLStameSrnamenamseaannssicaensinecs 219
`C. Paar, M. Rosner
`
`Implementation ofSingle Precision Floating Point Square Root on PPGBGssiscsinccasscsancs;SOG
`
`Y. Li, W. Chu
`
`Poster Papers
`
`Datapath-Oriented FPGA Mapping and Placementfor
`Configurable Corpatgcnmenenrnssiihtnammaranammmanensnnainencerseasese 98
`T. Callahan, J. Wawrzynek
`
`Mapping a Real-Time Video Algorithm to a Context-Switched FPGA ooooeccccescccccsessssssnssneereveneee. 236
`
`S. Kelem
`
`A Parallel Hardware Evolvable Computer POLYP...sscsnssstenistermrcttusistumnentannuarumees,238
`U. Tangen, L. Schulte, J. McCaskill
`
`Laser Defect Correction Applications to FPGA Based Custom Computers...cccoecccccsececee. 240
`G. Chapman, B. Dufort
`
`Speech Recognition HMM Training on Reconfigurable Parallel PLOCESSOPo......ccscssssseeeeeesecesseees, 242
`H. Yun, A. Smith, H. Silverman
`
`Efficient Implementation ofthe DCT on Custom Computers shsendsoneneetgunusonsetsinesanssiausisascinedivanvsiceosans DAA
`N. Bergmann,Y. Chung, B. Gunther
`
`On Acceleration ofthe Check Tautology Logic Synthesis Algorithm using an
`FPGA-based Reconfigurable OCSemmamEREEmemesanernnaneraactA
`J. Cong, J. Peck
`
`Index ofAUEOTSrniinnnnnnnninintnninntititiinieeeec.. 249
`
`Vil
`
`Petitioner Microsoft Corporation - Ex. 1009, ToC iii
`Petitioner Microsoft Corporation - Ex. 1009, ToCiii
`
`

`

`This material may be protected by Copyright law (Title 17 U.S. Code)
`
`Mapping Applications to the RaPiD Configurable Architecture*
`Carl Ebeling, Darren C. Cronquist, Paul Franklin, Jason Secosky, and Stefan G. Berg
`Department of ComputerScience and Engineering
`University of Washington
`Box 352350
`Seattle, WA 98195-2350
`
`Abstract
`
`The goal of the RaPiD (Reconfigurable Pipelined
`Datapath)
`architecture is
`to provide high per-
`formance configurable computing for a range of
`computationally-intensive applications that demand
`special-purpose hardware. This is accomplished by
`mapping the computation into a deep pipeline using
`a configurable array of coarse-grained computational
`units. A key feature of RaPiD is the combination
`of static and dynamic control. While the underly-
`ing computational pipelines are configured statically,
`a limited amount of dynamic controlis provided which
`greatly increases the range and capability of applica-
`tions that can be mapped to RaPiD. This paperillus-
`trates this mapping and configuration for several im-
`portant applications including a FIR filter, 2-D DCT,
`motion estimation, and parametric curve generation;
`it also shows how static and dynamic control are used
`to perform complex computations.
`
`1.
`
`Introduction
`
`custom computing machines
`Field-programmable
`have attracted a lot of attention recently because of
`their promise to deliver the high performance provided
`by special purpose hardware along with the flexibil-
`ity of general purpose processors. Unfortunately, the
`promise of configurable computing has yet to be real-
`ized in spite of some very successful examples [1, 9].
`There are two main reasonsforthis.
`First, configurable computing platforms are cur-
`rently implemented using commercial FPGAs. These
`FPGAsare necessarily very fine-grained so they can
`be used to implement arbitrary circuits, but the over-
`head of this generality exacts a high price in density
`and performance. Compared to general purpose pro-
`cessors (including digital signal processors), which use
`highly optimized functional units that operate in bit-
`parallel fashion on long data words, FPGAsare some-
`whatinefficient for performing logical operations and
`“This work was supported in part by the Defense Advanced
`Research Projects Agency under Contract DAAH04-94-G0272. D.
`Cronquist was supported in
`part by an IBM fellowship. P. Franklin
`was supported by an NSF fellowship
`
`even worsefor ordinary arithmetic. FPGA-based com-
`puting hasthe advantage only on complex bit-oriented
`computations like count-ones, find-first-one, or com-
`plicated bit-level masking and filtering.
`Second, programming an FPGA-based configurable
`computer is akin to designing an ASIC. The program-
`mer either uses synthesis tools that deliver poor den-
`sity and performance or designs the circuit manually,
`which requires both intimate knowledge of the FPGA
`architecture and substantial design time. Neither al-
`ternative is attractive, particularly for simple compu-
`tations that can be described in a few lines of C code.
`Our response to these two problems is RaPiD, a
`coarse-grained configurable architecture for construct-
`ing deep computational pipelines. RaPiD is aimedat.
`regular, computation-intensive tasks like those found
`in digital signal processing (DSP). RaPiD provides a
`large numberof ALUs, multipliers, registers and mem-
`ory modules that can be configured into the appropri-
`ate pipelined datapath. The datapaths constructed
`in RaPiD are linear arrays of functional units com-
`municating in mostly nearest-neighborfashion. Sys-
`tolic algorithms [4], for example, map very well into
`RaPiD datapaths, allowing us to take advantage of
`the considerable research on compiling computations
`to systolic arrays[5, 7]. However, RaPiDis notlimited
`to implementingsystolic algorithms; a pipeline can be
`constructed which comprises different computations at,
`different stages and at different times.
`Webegin with an overview of the RaPiD architec-
`ture; for a more detailed description see [3]. We then
`give a general description of how computations map
`to RaPiD using a FIR filter as an example, and then
`present how the architectural features of RaPiD are
`used to perform more complex computations found in
`2-D DCT, motion estimation, and parametric curve
`generation.
`
`2 The RaPiD Datapath Architecture
`RaPiD is a linear array of functional units which
`is configured to form a mostly linear computational
`pipeline. This array of functional units is divided into
`identical cells which are replicated to form a complete
`array. Figure 1 shows the cell used in RaPiD-1, the
`
`0-8186-8159-4/97 $10.00 © 1997 IEEE
`
`106
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 106
`Petitioner Microsoft Corporation - Ex. 1009, p. 106
`Me
`
`

`

`first version of the RaPiD architecture. This cell com-
`prises an integer multiplier, three integer ALUs, six
`general-purpose “datapath registers” and three small
`local memories. A typical single-chip RaPiD array
`would contain between 8 and 32 of these cells.
`
`datapath
`
`registers &
`
`input muxes
`
`bus connectors
`output drivers
`Figure 1: A basic RaPiD cell which 1s replicated left to
`right to form a complete chip. RaPiD-1 contains 16
`cells similar to this one, with fourteen 16-bit buses.
`
`2.1 Datapath Composition
`The functional units are interconnected using a set
`of segmented buses that run the length of the data-
`path. The functional units use an: 1 multiplexer to
`select their data inputs from one of the n —2 bus seg-
`ments in the adjacent tracks. The additional inputs
`providefixed zero or feedback lines, which can be used
`to clear and holdregister values, or to use an ALU as
`an accumulator. Each functional unit output includes
`optionalregisters to accommodate pipeline delays and
`a set of tristate drivers to drive their output onto one
`or more bus segments.
`The buses in different
`tracks are segmented into
`different lengths to make the mostefficient use of the
`connection resources.
`In some tracks, adjacent bus
`Segments can be connected together by a bus connec-
`tor as shown in Figure 1. This connection can be pro-
`grammed ineither direction via a unidirectional buffer
`or pipelined with up to three register delays, allowing
`data pipelines to bebuilt in the bus structure itself,
`RaPiD’s ALUsperform the usual logical and arith-
`metic operations on one word of data, The ALUs
`can be chained for wide-integer operations.
`The
`multiplier inputs two single-word numbers and pro-
`duces a double-word result, shifted by a statically pro-
`grammed amount to maintain a given fixed-point rep-
`resentation. Both wordsof the result are available as
`separate outputs.
`The datapath registers serve a variety of purposes
`in RaPiD. These registers can be used to store con-
`stants loaded duringinitialization and temporary val-
`
`ues. They can be used as additional multiplexers to
`simplify control;
`like any functional unit,
`the regis-
`ters can bedisabled. Theyare also used while routing
`RaPiD applications to connect bus segments in differ-
`ent tracks and/or for additional pipeline delays.
`In many applications, the data is partitioned into
`blocks which are loaded once, savedlocally, reused as
`needed, and then discarded. The local memories pro-
`vided in each cell of the datapath serve this purpose.
`Each local memory has a specialized datapath regis-
`ter used as an address register; one of the bus inputs
`to this address register is replaced by an incrementing
`feedback path. Like the SILOs found in the Philips
`VSP[8], this supports the common case of sequential
`memory accesses. More complex addressing patterns
`can be generatedusing registers and ALUsin the datia-
`path,
`Input and output data enter and exit RaPiD via
`I/O streamsat each end of the datapath. Each stream
`contains a FIFO filled with data required or with re-
`sults produced bythe computation. The datapath ex-
`plicitly reads from an input stream to obtain the next
`input data value and writes to an output stream to
`store a result.
`External memory Operations are carried out inde-
`pendent of the RaPiD array via three 1/O streams
`by placing FIFOs between the array and a memory
`controller.
`In addition to carrying out the memory
`operations, the memorycontroller generates statically
`determined sequences of addresses for each stream, Tf
`the datapath reads a value from an empty FIFO or
`writes a value to a full FIFO, the datapathis stalled
`until the FIFO is ready,
`
`2.2 Control Path
`For the most part, the structure of a pipelineis stat-
`ically configured. However, there are almost always
`some pipeline control signals that must be dynamic.
`For example, constants are loaded into datapath regis-
`ters duringinitialization but then remain unchanged,
`Theload signals of the datapath registers thus take on
`different values duringinitialization and computation,
`More complex examples include double-buffering the
`local memories and performing data-dependent calcu-
`lations.
`Thecontrolsignals are thus divided into static con-
`trol signals provided by configuration memory as in
`ordinary FPGAs,and control signals whichcan be dy-
`namically programmed on every cycle. RaPiD is pro-
`grammedfor a particular application by first mapping
`the computation onto a datapath pipeline, The static
`programming bits are used to construct this pipeline
`and the dynamic programmingbits are used to sched-
`ule the datapath operations over time. These dynamic
`control bits are provided by a small pipelined control
`path, not by the moretypicallocal microprogrammed,
`SIMD,or VLIW control.
`
`107
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 107
`Petitioner Microsoft Corporation - Ex. 1009, p. 107
`
`

`

`Dynamic controlis implemented by inserting a few
`“context” bits each cycle into a pipelined control path
`that parallels the datapath. This context contains
`all the information required by the various pipeline
`stages to compute their dynamic control signals. The
`control path contains 1-bit segmented buses similar in
`structure to the datapath buses, as shown in Figure 2,
`(Signals which can be dynamic but do not need to
`change during a particular computation are connected
`to the static zero line.) Control values are inserted by
`a global pipeline controller at one end of the control
`path and are passed from stage to stage where they
`are applied to the appropriate control signals. Since
`applications generally use only a few dynamic control
`signals and use similar pipeline stages, the number of
`control signals in the control pathis relatively small.
`
`
`
`Figure 2: Dynamic control generation for part of a
`RaPiD cell; these control buses are one bit wide.
`
`Each dynamic control signal is derived from the in-
`formation contained in the control path. Usually the
`signal is simply connected to one of the bits in the
`control path, but in more complex cases lookup-tables
`embedded in the control path are used to compute
`the control signal based on more information includ-
`ing bits in the control path, status from ALUsin the
`datapath, or feedback when implementing simple FSM
`controllers. The generation of dynamic controlis il-
`lustrated in detail in the applications that follow.
`
`2.3 RaPiD-1 Design Features
`Most of the design and layout of the RaPiD-1 chip,
`the first implementation of the RaPiD architecture,
`is complete. This section presents those details of
`RaPiD-1 useful in understanding the performancere-
`sults discussed for each application presented in the
`following sections.
`RaPiD-1’s datapath is based on 16-bit fixed-point
`integers; to accommodate this, the multipliers can be
`statically programmedtoshift their 32-bit output ar-
`bitrarily. Each RaPiD-1 cell contains three ALUs,
`one multipliers, and three 32-word local memories,
`Fourteen tracks are provided for the segmented data
`
`buses, which
`are supplemented by the zero and feed-
`back inputs
`available to each functional input. The
`16 cells eacl
`1 have the functional units shown in Fig-
`ure 1, in addition to control logic and up to 15 control
`buses. The
`RaPiD-1 array is designed to be clocked
`at 100MHz,
`and reconfiguration time for the array is
`conservatively estimated to be 2000 cycles.
`
`3 Programming Model
`Mapping applications to RaPiD involves designing the
`underlying datapath and providing the dynamic con-
`trol required for the different parts of the computa-
`tion. The control design can
`be complicated because
`control sign
`als are generated at different times and
`travel at different rates. We have designed the RaPiD
`B programming language to
`accommodate these con-
`trol patterns. Our RaPiD B
`compiler which produces
`a placed an
`d routed implementation along with the
`dynamiccontrol program is
`nearly complete. This sec-
`tion first de
`scribes a FIR (Finite Impulse Response)
`filter,
`a simple application useful for illustrating some
`of the basicfeatures of RaPiD.It then briefly presents
`the timing models used by RaPiD B and by the re-
`mainderof this paper.
`
`3.1
`
`FIR Filter Computation
`Digital FIR
`filters are used in many signal processing
`applications
`, typically for eliminating unwanted fre-
`quency components from a signal. Figure 3a Bives a
`specificationfor a FIR filter with NumTaps taps and
`NumxXinputs. Thefilter weights are stored in the W
`array, the in
`put in the X array, and the output in the
`Y array (starting
`at array location NumT'aps — 1).
`Figure 3b shows theentire computation required for a
`single output of a 4-tap FIRfilter.
`
`Y{i] = a
`for j := 0 to Num‘Taps-1
`eanfil = YU + Xfi)"WU)
`
`end
`
`we Dice KB vce
`
`RTMGMSecX4KD, eXQoX bLKO ——Sae
`x
`x
`*
`x
`Wo Wl
`oW2 wa
`
`Y= >
`
`(b)
`
`Figure 3: FIR filter. (a)
`Algorithm. (b) Computation
`for NumTaps=4 and i=6.
`
`Thecircuit in Figure 4a performs the entire compu-
`tation for one output value in a single cycle;it is easily
`obtained by unrolling the inner loop of the program
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 108
`etitioner Microsoft Corporation - Ex. 1009, p. 108
`P
`eS
`
`108
`
`

`

`4 FIR Filter Implementation
`
`4.1 Simple Case
`As with most applications, there are a variety of ways
`to map a FIR filter to RaPiD. The choice of mapping
`is driven by the parameters of both the RaPiD ar-
`ray and the application. For example, if the number
`of taps is less than the number of RaPiD multipliers,
`then each multiplier is assigned to multiply a specific
`weight. The weights are first preloaded into datapath
`registers whose outputs drive the input of a specific
`multiplier. Pipeline registers are used to stream the
`X inputs and Y outputs. Since each Y output must
`see NumTaps inputs, the X and Y buses must be
`pipelined at different rates. Figure 5a showsonecell
`ofthe FIR filter (several stages are shown in Figure 4b)
`with the X input bus doubly pipelined and the Y in-
`put bus singly pipelined.
`
`
`
`in Figure 3a. Unfortunately,
`the circuit shownin Fig-
`ure 4a has poor performance characteristics (note the
`combinational path throughall of the adders, which
`scales linearly with the numberof weights). A retimed
`version of this circuit is shown in Figure 4b; the re-
`timed circuit performs substantially better than the
`original, particularly for larger computations.
`
` OuT
`a0
`
`
`
`(b)
`Figure 4: Schematic diagrams for four-tap FIR filter
`(a) as viewed in RaPiD B, grouping related compu-
`tation and (b) as a high-performance pipelined imple-
`mentation.
`
`
`OUT
`
`Specifying this retimed circuit directly is difficult
`becauseof the complexity of the relative timing of the
`internal data and control signals. It is much easier to
`specify the computation somewhat naively as in Fig-
`ure 4a, knowing that retiming can transform it into
`a high-performance, pipelined circuit. This becomes
`particularly evident in circuits with more complicated
`control, and when more aggressive steps, such as using
`the pipeline stage available in RaPiD’s multiplier, are
`needed to achieve the desired performance. Therefore,
`the RaPiD B compiler retimes the resulting netlist
`based on[6].
`All of the applications presented in the following
`sections have been specified in a preliminary ver-
`This implementation maps easily to the RaPiD ar-
`sion of RaPiD B and simulated to validate the im-
`ray, as shownfor one tap in Figure 5b. Forclarity, all
`plementations described and the accompanying cy-
`unused functional units are removed, and used buses
`cle count. For ease of explanation, the computations
`are highlighted. The bus connectors from Figure 1 are
`shown throughoutthis paper are shown before the full
`left open to represent no connection and boxed to rep-
`retiming performed by the RaPiD B compiler. A pre-
`resent a register. The controlfor this mapping consists
`liminary version of the RaPiDBtoolsetis nearly com-
`of two phases of execution:
`loading the weights and
`plete, including compilation, retiming, control synthe-
`computing the output results.
`In the first phase, the
`sis, and full placement and routing of the resulting
`weights are sent down the IN double pipeline along
`RaPiDcircuit.
`with a singly pipelined control bit which connects the
`
`(b)
`(a) Netlist for one cell of the simple FIR
`Figure 5:
`filter.
`(b) One tap of the FIR filter mapped to the
`RaPiD array (this is replicated to form more taps).
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 109
`Petitioner Microsoft Corporation - Ex. 1009, p. 109
`eeeeee
`
`109
`
`

`

`and the allocated functional units are fully utilized.
`RaPiD-1 (Section 2.3) can therefore operate at very
`nearits peak performance of 1.6 GOPS (where GOPS
`is a billion multiply-accumulates per second).
`
`5 Discrete Cosine Transform
`
`is used fre-
`The discrete cosine transform (DCT)
`quently in signal processing and graphics applications.
`For example, the 2-D DCTis used in JPEG/MPEG
`data compression to convert an image from the spatial
`domain to the frequency domain. A 2-D DCT can be
`decomposed into two sequential 1-D DCTs. We first
`describe how the 1-D DCT can be computed on RaPiD
`and then show how two 1-D DCTs can be composed
`to perform a 2-D DCT.
`
`5.1
`
`1-D DCT
`
`An N-point 1-D DCTpartitions an input vector A into
`N-element sub-vectors, and for each resulting sub-
`vector A; computes
`N-1
`
`7mk
`tie = > Qhn COS gn (27 + 1)
`
`n=0
`
`(1)
`
`for 0 < i < N —1, where ag, is the n-th element of
`sub-vector A; (and the (kN +7n)-th element of vector
`A). The N? cosine terms are constant over all sub-
`vectors and hence can be read once as precomputed
`weights W where w,; = cos $y (2n +1). This reduces
`Equation 1 to
`
`input of each datapath register to the JN bus. When
`the final weightis inserted, the controlbit is switched,
`and the input is connected to the feedbackline. Since
`the control bit travels twice as fast as the weights,
`each datapath register will hold a unique weight. No
`special signals are required to begin the computation;
`the second phaseimplicitly starts when the controlbit
`goes low.
`
`Increasing the Numberof Taps
`4.2
`If the number of taps exceeds the number of RaPiD
`multipliers, the multipliers must be time-shared be-
`tween several taps. This can be achieved in RaPiD
`by using a local memory to store several weights per
`Stage. Figure 6 shows our implementation for this
`mapping. Unlike the simple case, we make the arbi-
`trary choice for doubly pipelining the Y output values
`and singly pipelining the X input values.
`
`1 Weights and X values atreamin
`
`Lefl RAM holds weights to be
`multiplied with X inputs.
`
`Right RAM holds intermediate
`vt seca values that shift dawn
`until
`they are sent to next stage,
`
`D {
`
`Figure 6: Netlist for one cell of extended FIR fil-
`ter. The top pipelined bus streams in the X inputs
`(the weights during initialization) while the bottom bus
`streams out the intermediate Y values.
`
`it
`Uri = D> OhnWnis
`=0
`
`(2)
`
`As a new X is read from external memory,thefirst
`stage replicates it and presents it to the input data-
`path for several cycles. Each stage can multiply this
`X by its weights in turn and addit to oneofits stored
`for 0 <i < N—1. By viewing input vector A and
`intermediate values. At this point a new X valuewill
`weights W as matrices A and W, Equation 2 reduces
`be fetched from memory and the cycle repeats.
`to the matrix multiply Y = Ax W. Thus, to compute
`There are the same numberof intermediate values
`a 1-D DCT, RaPiD performs a matrix multiply.
`as there are weights per stage. These intermediateval-
`To implement an 8 point 1-D DCT on an 8x 8
`ues are stored in a second local memory. Let’s exam-
`input matrix A (ie. a 64-element vector), the entire
`ine the stage holding weights Wss, Ws4, Ws3, and Wso
`8 x 8 weight matrix W is stored in RaPiD’slocal mem-
`(four taps per stage). A new input value X29 appears
`ories, one column percell. Each cell of the resulting
`on the input datapath. In fourcycles the partial sums
`pipelineis configured as shown in Figure 7. The A ma-
`for Y75, Yr4, Y73, and Y7 will be computed. These
`trix is passed through the array in row-major order.
`are stored in that order in the local memory holding
`Within each cell, the local memory address is incre-
`the intermediate values. At this point, X29 moves to
`mented each cycle, and a register accumulates the dot
`the next pipeline stage followed by the intermediate
`product of the stored column and the incoming row.
`value Y72. The next input, X21, appears on the input
`Whenacell receives the last element of a row, the
`datapath along with the intermediate value Y7¢ from
`resulting product is passed down an output pipeline,
`the previous stage. Now the partial sums for ¥76, ¥7s,
`the addressis cleared, and thecell is ready to compute
`Y74, and ¥73 are computed.
`the product of the next row on the next cycle. This
`effectively computes the matrix multiply of A x W.
`4.3 FIR Performance
`‘To produce the final DCT result each Yai must be multiplied
`by\/#4; where E; = 4 if i= 0 and E; = 1 otherwise. For our
`When the number of taps is a multiple of 16 the
`weights can be partitioned evenly across the stages
`purposes we ignore this scaling factor and focus on the computation
`of yri.
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 110
`Petitioner Microsoft Corporation - Ex. 1009, p. 110
`
`

`

`fate te
`Amn! N-Point N-Point|Vis|_Zai_ Zim_|
`
`fy
`one
`-
`trans ose
`Pt 7
`1-D DCT
`\-D DCT
`A RowofmatrixAstreamsin Le|R
`
`
`
`
`
`Figure 8: 2-D N x N DCT
`
`Row of the resuliing praduct srcams out
`
`A)
`
`We show the implementation of an 8 x 8 2-D DCT
`on a 16-cell RaPiD array. Consider an M x N image
`and an 8 x 8 weight matrix W. First, the image is
`divided into AN sub-images of size 8 x 8. The com-
`putation for each sub-image A is outlined in Figure 9,
`
`Intermediate results
`stored in RAM and
`transposed by control
`
`MxN
`
`Computed by
`first 8 stages
`
`Computed by
`last 8 stages
`Figure 9: To compute 2-D DCT, an M x N image
`is partitioned into 8 x 8 sub-images. RaPiD computes
`each 1-D DCTby multiplying the sub-image by an 8x8
`weight matriz.
`
`Figure 7: Netlist for one cell of a matrix multiply.
`The top pipelined bus streams in the A matrix (in
`row-major order) while the bottom bus streams out
`the resulting matrix product (also in row-major order).
`The top bus also streams the W columns into thelocal
`memories prior to the computation.
`
`5.2
`
`2-D DCT
`
`An N x N 2-D DCTpartitions an input matrix into
`sub-matrices of size N x N, and for each resulting
`sub-matrix A, computes
`
`N-1N-1
`
`_
`
`.
`ae
`
`we.
`
`vi = py zs Amn COS ay (2m +1) cos on (en + 1)
`(3)
`,
`for 0 <i,7 < N—1.? As with the 1-D DCT, Equation
`3 is reduced using the N? precomputed W weights,
`yielding
`N-1N-1
`Ss: OmnnWmiWng
`m=0 n=0
`for0<i,j < N-1. Extracting w,,; from the inner
`summation leaves
`
`Yi =
`
`(4)
`
`
`(((Al x Wl)xwl)’<<
`
`Since a 2-D DCT performs two multiplies by the
`same weight matrix, W is loaded only once: one col-
`umn percell in both thefirst 8 cells and last 8 cells.
`The transpose in between matrix multiplies is per-
`formed with two local memories per cell: one to store
`products of the current sub-image and the other to
`store the products of the previous sub-image. During
`the computation of the current sub-image, the trans-
`pose of the previous sub-image computation is passed
`to the next 8 cells. The datapath for one RaPiD cell
`of a 2-D DCTis shown in Figure 10.
`One RAMstores the
`current !-D DCT results.
`frevie
`mua
`1-D
`a
`The alhersures ie
`tr
`perionm

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket