`
`IEEE Symposium on
`
`FPBHS FDR
`CUBTDIYI
`COMPUTING
`IYIHCHJINES
`
`April 1 6-1 8, 1 997
`Napa Valley, California
`
`Edited by Kenneth L. Pocek and Jeffrey Amold
`
`Sponsored by the IEEE Computer Society Technical Committee on Computer Architecture
`
`IEEE.
`COMPU'TER 9
`
`SOCIETY m
`
`Petitioner Microsoft Corporation - Ex. 1009, Cover 1
`Petitioner Microsoft Corporation - EX. 1009, Cover 1
`
`
`
`FCCM’97
`
`
`
`I99?
`
`Petitioner Microsoft Corporation - Ex. 1009, Cover 2
`Petitioner Microsoft Corporation - Ex. 1009, Cover 2
`
`
`
`Copyright © 199? by The Institute of Electrical and Electronics Engineers, Inc.
`All rights reserved
`
`to the source. Libraries may
`Copyright and Reprint Permissions: Abstracting is permitted with credit
`photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume that
`carry a code at the bottom of the first page, provided that the per»c0py fec indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 0! 923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE
`Service Center, 445 Hoes Lane, PO. Box 133, Piscataway, NJ 088554331.
`
`The papers in this book comprise the proceedings oftlte meeting mentioned on the cover and titie page. They
`reflect the authors‘ opinions and, in the interests of timely dissemination, are published as presented and
`without change. Their inclusion in this publication does not necessarily constitute endorsement by the
`editors. the [BEE Computer Society. or the institute ofElectrt‘col and Eiectonics Engineers, Inc.
`
`IEEE. Computer Socrety Order Number PR08159
`ISBN 0-8186-8159-4
`ISBN 0-8186-8160-8 (case)
`ISBN 0-8186-8 l 61 -6 (microfiche)
`IEEE Order Plan Catalog Number 97T3100186
`ISSN 1082-34-09
`
`7K 9. e r 5"
`' C: 3. 5 1‘3. 5
`I 9
`_
`9 7
`
`IEEE Computer Society
`Customer Service Center
`[0662 Los Vaqueros Circle
`PD. Box 3014
`Los Alamitos, CA 90720—13 l4
`Tel: + 1-714-821-8380
`Fax: + 1-714—821-4641
`Ermail: cs.booke@computer.org
`
`Additional copies may be orderedfront:
`
`IEEE Service Center
`445 Hoes Lane
`PO. Box 1331
`Piscataway, NJ 08355-1331
`Tel: + 1-908-931-1393
`Fax: + 1-908-981-9667
`mis.custserv@compnter.org
`
`IEEE Computer Society
`13, Avenue de l’Aquilon
`B~l200 Brussels
`BELGIUM
`Tel: + 32-2—770-2198
`Fax: + 32-2-770-8505
`euro.ofc@computer.org
`
`IEEE Computer Society
`Ooshima Building
`2-] 9-1 Minami-Aoyama
`Minato-ku, Tokyo 107
`JAPAN
`Tel: + 31-3-3408-3] 18
`Fax: + 31-3-3408-3553
`tokyo.ofc@computer.org
`
`Editorial production by Bob Werner
`
`Cover art production Joe DaigletStudio Productions
`
`Printed in the United States of America by Technical Communication Services
`
`lEEE
`
`o
`
`COMPUTER Q
`SOCIETY
`
`..-:,i .
`:.x-L‘
`-
`'.,
`Petitioner Microsoft Corporation - Ex. 1009, Cover 3
`Petitioner Microsoft Corporation - Ex. 1009, Cover 3
`
`
`
`
`Table of Contents
`
`Symposium on Field-Programmable Custom Computing Machines — FCCM’97
`
`Introductionix
`
`Program Committeex
`
`Session 1: Device Architecture
`
`An FPGA Architecture for DRAM-based Systolic Computations...,........................._.,......._..................2
`N. Margoius
`
`Garp: A MIPS Processor with a Reconfigurable Coprocessor............................................................... 12
`
`J. Hauser, J. Wawrzynek
`
`A'I‘ime-MultiplexedFPGAm
`.. 22
`S Trimberger D. Cal—berry, A. Johnson, J.wong
`
`Session 2: Communication Applications
`
`An FPGA—Based Coprocessor for ATMFirewalls 30
`
`J. McHemy, P. Dowel, T. Carrozzi,
`
`F. Pellegrino, W. Cocks
`
`A Wireless LAN Demodulator in a Pamette: Design and Experience ............... . ............................... 40
`
`T. McDermott, P. Ryan, M. Shond,
`
`D. Skeiiern, T. Percival, N. Wests
`
`Session 3: Run Time Reconfiguration
`
`Incremental Reconfiguration for Pipelined Applications ....................................................................... 47
`H. Schmit
`
`Compilation Tools for Run-Time ReconfigurableDesxgns 56
`
`W. Luk, N. Shirazi, P. Cheang
`
`A Dynamic Reconfiguration Run-Time System ......................................................................................... 66
`
`J. Burns, A. Dentin, J. Hogg, S. Singh, M. de Wit
`
`Session 4: Architectures for Run Time Reconfiguration
`
`The Swappable Logic Unit: A Paradigm for Virtual Hardware 77
`G. Brebncr
`
`Petitioner Microsoft Corporation - Ex. 1009, ToC i
`Petitioner Microsoft Corporation - EX. 1009, ToC iW
`
`
`
`The Chimaera Reconfigurable Functional Unit......................................................................................... 87
`S. Hauck, T. Fry, M. Hooter, J. Kao
`
`Session 5: Architecture
`
`Computing Kernels Implemented with a Wormhole RTR CCM
`R. Bitmer Jr, P. Athanas
`
`Mapping Applications to the RaPiD Configurable Architecture
`C. Ebeling, D. Cronquist, P. Franklin,
`J. Secosky, S. Berg
`
`98
`
`106
`
`Defect Tolerance on the Teramac Custom Computer ............................................................................ 116
`B. Culbertson, R. Amerson, R. Carter,
`P. Kuekes, G. Snider
`
`Session 6: Performance
`
`Systems Performance Measurement on PCI Pamette
`L. Moll, M. Shand
`
`125
`
`The RAW Benchmark Suite: Computation Structures for
`General Purpose Computing................
`J. Babb, M. Frank, V. Lee, E. Waingold, R. Barns,
`M. Taylor, J. Kim, S. Devabkaktuni, A. Agorwal
`
`Session 7: Software Tools
`
`' Automated Field-Programmable Compute Accelerator Design using
`Partial Evaluation
`Q. Wang, D. Lewis
`
`145
`
`FPGA Synthesis on the X08200 using IRIS and TrianuSJI-Iades
`(Or, from Heaven to Hell and back again) ................................................................................................. 155
`R. Woods, 8. Ludwig, J. Heron, D. Trainer, S. Gehring
`
`High Level Compilation for Fine Grained FPGAs .................................................................................. 165
`M. Gokhole, E. Gomersaii
`
`Session 8: CAD Applications
`
`Acceleration of an FPGA Router .................................................................................................................... 175
`P. Chen, M. Schlag
`
`Fault Simulation on ReconfigurableHardware
`M. Abramovici, P. Menon
`
`. . .. 182
`
`Petitioner Microsoft Corporation - Ex. 1009, ToC ii
`Petitioner Microsoft Corporation - EX. 1009, ToC ii
`___________________________________
`
`vi
`
`
`
`Session 9: Image Processing Applications
`
`Automated Target Recognition on SPLASH 2............................................................................................ 192
`M. Rancher, B. Hutchings
`
`Real-Time Stereo Vision on the PARTS Reconfi
`J. Woodfill, B. Von Herzen
`
`gurable Computer 201
`
`Increased FPGA Capacity Enables Scalable, Flexible CCMs:
`An Example from Image Processing...........................
`J. Greenbanm, M. Baxter
`
`Session 10: Arithmetic Applications
`
`Comparison of Arithmetic Architectures for Reed-
`Reconfig-urable Hardware
`C. Paar, M Rosner
`
`Solomon Decoders in
`
`219
`
`' g Point Square Root on FPGAs
`
`226
`
`Poster Papers
`
`Datapath-Oriented FPGA Mapping and Placement for
`Configurable Computing......,.......
`T. Callahan, J. Wawrzynek
`
`Mapping a Real-Time Video Algorithm to a Context-Switched FPGA
`S. Kelern
`
`A Parallel Hardware Evolvable Computer POLYP
`U. Tongan. L. Schulte, J. McCaskiZl
`
`236
`
`....238
`
`Laser Defect Correction Applications to FPGA Based Custom Computers .................................. 240
`G. Chapman, B. Dufort
`
`Speech Recognition HMM Training on Reconfigurable Parallel Processor........................_......... 242
`H. Yun, A. Smith, H. Silver-man
`
`Efficient Implementation of the DCT on Custom Computers
`N. Bergmann, Y. Chung, B. Gunther
`
`244
`
`0n Acceleration of the Check Tautolog‘y Logic Synthesis Algorithm usmg an
`FPGA—based Reconfigurable Coprocessor246
`
`J. Cong, J. Peck
`
`Index ofAuthors
`
`249
`
`vii
`
`Petitioner Microsoft Corporation - Ex. 1009, ToC iii
`Petitioner Microsoft Corporation - Ex. 1009, ToC iii
`
`
`
`This material may be protected by Copyright law (Title 17 US. Code)
`
`Mapping Applications to the RaPiD Configurable Architecture*
`
`Carl Ebeling, Darren C. Cronquist, Paul Franklin, Jason Secosky, and Stefan G. Berg
`Department of Computer Science and Engineering
`University of Washington
`Box 352350
`Seattle, WA 98195—2350
`
`Abstract
`
`The goal of the RaPiD (Reconfigurable Pipelined
`Datapath)
`architecture is
`to provide high per-
`formance configurable computing for a range of
`computationally—intensive applications that demand
`special-purpose hardware. This is accomplished by
`mapping the computation into a deep pipeline using
`a configurable array of coarse—grained computational
`units. A key feature of RaPiD is the combination
`of static and dynamic control. While the underly—
`ing computational pipelines are configured statically,
`a limited amount of dynamic control is provided which
`greatly increases the range and capability of applica-
`tions that can be mapped to RaPiD. This paper illus—
`trates this mapping and configuration for several im—
`portant applications including a FIR filter, 2-D DCT,
`motion estimation, and parametric curve generation;
`it also shows how static and dynamic control are used
`to perform complex computations.
`
`l .
`
`Introduction
`
`custom computing machines
`Field—programmable
`have attracted a lot of attention recently because of
`their promise to deliver the high performance provided
`by special purpose hardware along with the flexibil-
`ity of general purpose processors. Unfortunately, the
`promise of configurable computing has yet to be real—
`ized in spite of some very successful examples [1, 9}.
`There are two main reasons for this.
`First, configurable computing platforms are cur—
`rently implemented using Commercial FPGAs. These
`FPGAs are necessarily very finc~grainod so they can
`he used to implement arbitrary circuits, but the over—
`head of this generality exacts a high price in density
`and performance. Compared to general purpose pro—
`cessors (including digital signal processors), which use
`highly optimized functional units that operate in bit-
`parallel fashion on long data words, FPGAs are some—
`what inefiicient for performing logical operations and
`
`"This work was snpporlcd in part by the Defense Advanced
`Research Projects Agency under Contract DAAHUs—94~G()272. D.
`Croltquist was supported in art by an IBM fellowship. P. Franklin
`was supported by an NSF fellowship.
`
`even worse for ordinary arithmetic. FPGA-based com—
`puting has the advantage only on complex bit-oriented
`computations like count-ones, find-first-one, or com-
`plicated bit~level masking and filtering.
`Second, programming an FPGA—based configurable
`computer is akin to designing an ASIC. The program-
`mer either uses synthesis tools that deliver poor den—
`sity and performance or designs the circuit manually,
`which requires both intimate knowledge of the FPGA
`architecture and substantial design time. Neither al—
`ternative is attractive, particularly for simple compu-
`tations that can be described in a few lines of C code.
`Our response to these two problems is RaPiD, a
`coarse-grained configurable architecture for construct-
`ing deep computational pipelines. RaPiD is aimed at
`regular, computation—intensive tasks like those found
`in digital signal processing (DSP). RaPiD provides a
`large number of ALUs, multipliers, registers and mem~
`ory modules that can be configured into the appropri-
`ate pipelined datapath, The datapaths constructed
`in RaPiD are linear arrays of functional units corn-
`municating in mostly nearest-neighbor fashion. Sys—
`tolic algorithms [4], for example, map very well into
`RaPiD datapaths, allowing us to take advantage of
`the considerable research on compiling computations
`to systolic arrays [5, 7]. However, RaPiD is not limited
`to implementing systolic algorithms; a pipeline can be
`constructed which comprises different computations at
`different stages and at different times.
`We begin with an overview of the RaPiD architec—
`ture; for a more detailed description see [3]. We then
`give a general description of how computations map
`to RaPiD using a FIR filter as an example, and then
`present how the architectural features of RaPiD are
`used to perform more complex computations found in
`2-D DCT, motion estimation, and parametric curve
`generation.
`
`2 The RaPiD Datapath Architecture
`
`RaPiD is a linear array of functional units which
`is configured to form a mostly linear computational
`pipeline. This array of functional units is divided into
`identical cells which are replicated to form a complete
`array. Figure 1 shows the cell used in RaPiD-l, the
`
`0—8 18681594197 $10.00 © 199'? IEEE
`
`106
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 106
`Petitioner Microsoft Corporation - Ex. 1009, p. 106
`————________________—_
`
`
`
`first version of the RaPiD architecture. This cell com-
`prises an integer multiplier, three integer ALUs, six
`general-purpose “datapath registers” and three small
`local memories. A typical single-chip RaPiD array
`would contain between 8 and 32 of these cells.
`
`datapa th
`registers
`
`
`
`
`
`bus connectors
`
`
`
`// \
`input muxes
`output drivers
`
`ues. They can be used as additional multiplexers to
`simplify control;
`like any functional unit.
`the regis—
`ters can be disabled. They are also used while routing
`R-aPiD applications to connect hos segments in differ-
`ent tracks and/or for additional pipeline delays.
`In many applications, the data is partitioned into
`blocks which are loaded once, saved locally, reuse-d as
`needed, and their discarded. The local memories pro-
`vided in each cell of the datapath serve this purpose.
`Each local memory has a specialised datapath regis-
`ter used as an address register; one of the bus inputs
`to this address register is replaced by an incrementing
`feedback path. Like the STLOs found in the Philips
`VSP [3], this supports the cmmnon case of sequential
`memory accesses. More complex addressing patterns
`can be generated using registers and ALUS in the data
`path.
`Input and output data enter and exit RaPiD via
`I/0 streams at each end of the datapath. Each stream
`contains a FIFO filled with data. required or with re-
`sults produced by the computation. The datapath ex-
`plicitly reads from an input. stream to obtain the next
`input data value and writes to an output stream to
`store a result.
`External memory operations are carried out inde—
`pendent of the RaPiD array via three I/O streams
`by placing FIFOS between the array and a memory
`controller.
`In addition to carrying out the memory
`operations, the memory controller generates statics] 13'
`determined sequences of addresses for each stream. If
`the datapath reads a value from an empty FIFO or
`writes a value to a full FIFO. the datapath is stalled
`until the FIFO is ready.
`
`2.2 Control Path
`
`For the most part. the structure of a pipeline is stat-
`ically configm'cd. However, there are almost always
`some pipeline control signals that must he dyuan'iic.
`For exampleI constants are loaded into datapath regis-
`ters during initialization but then remain unchanged.
`The load signals of the datapath registers thus take on
`different. values during initialization and computation.
`More complex examples include (louhlc—lulifering the
`local memories and perforating data~depemlent calcu-
`lations.
`
`The control signals are thus divided into static con-
`trol signals provided by configuration memory as in
`ordinary FPGAs, and control signals which can be dy-
`namically prograimned on every cycle. RaPiD is pro—
`grammed for a particular application by first mapping
`the computation onto a datapath pipeline. The static
`programming bits are used to coastruct this pipeline
`and the dynamic programming bits are used to sched-
`ule the datapath operations over time. These dynamic
`control bits are provided by a small pipelined control
`path, not by the more typical local microprogrammed,
`Sl'MD, or VLIW control.
`
`Figure 1: A basic RePiD cell which is replicated left to
`right to form a complete chip. RnPiD—I contains 16
`cells similar to this one, with fourteen 16-bit buses.
`
`2.1 Datapath Composition
`
`The functional units are interconnected using a set
`of segmented buses that run the length of the data-
`path. The functional units use a n : 1 multiplexer to
`select their data inputs from one of the n. — 2 bus seg—
`ments in the adjacent tracks. The additional inputs
`provide fixed zero or feedback lines, which can be used
`to clear and hold register values, or to use an ALU as
`an accumulator. Each functional unit output includes
`optional registers to accommodate pipeline delays and
`a set of tristate drivers to drive their output onto one
`or more bus segments.
`tracks are segmented into
`The buses in different
`different lengths to make the most efficient use of the
`connection resources.
`In some tracks, adjacent bus
`segments can be connected together by a bus connec-
`tor as shown in Figure 1. This connection can be pro-
`grammed in either direction via a unidirectional buffer
`or pipelined with up to three register delays, allowing
`data pipelines to be built in the bus structure itself.
`RaPiD’s ALUs perform the usual logical and arith—
`metic Operations on one word of data. The ALUs
`can be chained for wide-integer operations.
`The
`multiplier inputs two single-word numbers and pro-
`duces a double-word result, shifted by a statically pro-
`grammed amount to maintain a given fixed-point rep~
`resentation. Both words of the result are available as
`separate outputs.
`The datapath registers serve a variety of purposes
`in RaPiD. These registers can be used to store con-
`stants loaded during initialization and temporary val—
`
`107
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 107
`Petitioner Microsoft Corporation - Ex. 1009, p. 107
`
`
`
`Dynamic control is implemented by inserting a few
`“context” bits each cycle into a pipelined control path
`that parallels the datapath. This context contains
`all the information required by the various pipeline
`stages to compute their dynamic control signals. The
`control path contains l—bit segmented buses similar in
`structure to the datapath buses, as shown in Figure 2.
`change during a particular computation are connected
`to the static zero line.) Control values are inserted by
`a global pipeline controller at one end of the control
`path and are passed from stage to stage where they
`are applied to the appropriate control signals. Since
`applications generally use only a few dynamic control
`signals and use similar pipeline stages, the number of
`control signals in the control path is relatively small.
`.11.,
`'
`l‘.
`I
`1:35: “at“ alu
`
`'
`‘35:
`
`3.1 FIR Filter Computation
`
`buses, which are supplemented by the zero and feed-
`back inputs available to each functional input. The
`16 cells each have the functional units shown in Fig
`ure 1, in addition to control logic and up to 15 control
`buses. The RaPiD-l array is designed to be clocked
`at 100MHz, and reconfiguration time for the array is
`conservatively estimated to be 2000 cycles.
`,
`3 Programming MOdel
`Mapping applications to RaPif) involves designing the
`underlyIng datspath and provldmg the dynamic con—
`tFOI reqmred f0? the different parts Of‘ the computa-
`“011- Th? control (1931511 can be complicated because
`control signals are generated at different times and
`travel at different rates. We have designed the RaPiD
`B programming language to accommodate these con-
`.
`u
`.
`trol patterns. Our RaPiD B compiler which produces
`a placed and routed implementation along with the
`dynamic control program is nearly complete. This sec-
`tion first describes a FIR (Finite Impulse Response)
`filter, a simple application useful for illustrating some
`of the basic features of RaPiD. It then briefly pri-vsents
`the timing models used by RaPiD B and by the re-
`mainder of this paper.
`
`l'.
`
`Figure 2: Dynamic central generation for part Of 3
`RePiD cell; these control buses are one hit wide.
`Each dynamic control signal is derived from the in-
`formation contained in the control path. Usually the
`signal is simply connected to one of the bits in the
`control path, but in more complex cases lookup-tables
`embedded in the control path are used to compute
`the control signal based on more information includ-
`ing bits in the control path, status from ALUs in the
`datapath, or feedback when implementing simple FSM
`controllers. The generation of dynamic control is 31-
`lu'strated in detail in the applications that follow.
`
`2,3 RaPiD-I Design Features
`Most of the design and layout of the RaPiD-l chip,
`the first implementation of the RaPiD architecture,
`is complete. This section presents those details of
`RaPiD—I useful in understanding the performance re-
`sults discussed for each application presented in the
`following sections.
`RaPiD—I’s datapath is based on 16-bit fixed~point
`integers; to accommodate this, the multipliers can be
`statically programmed to shift their 32-bit output ar—
`bitrarily. Each RaPiD—l cell contains three ALUs,
`one multipliers, and three 32—word local memories.
`Fourteen tracks are provided for the segmented data
`
`Digital FIR filters are used in many signal processing
`applications, typically for eliminating unwanted fre-
`”“91le “0019071911“ from a signal. Figure 33» gives a
`Specification for a FIR filter with NumTups taps and
`NumX inputs. The filter weights are stored in the W
`array, the input in the X array, and the output. in the
`Y array (starting at array location NumTeps — 1).
`Figure 3b shows the entire computation required for a
`single output of a 4-tap FIR. filter.
`“iii?" am— m "I” "
`
`
`whirl-1“iiiFi-jnic'li-qliihu]
`
`
`end M
`'
`J
`
`
`end
`
`
`
`.....xs......xs......x7......x)?......2:5- £4310 ....xz......xn .....re a
`we w'
`“'2
`“G
`“5::
`
`(W
`.
`_
`.
`Figure 35 FIR 1mg“ (a) Algorithm. (5) Computation
`for Nam fill-“5:4 and i=5
`The circuit in Figure 4a performs the entire compu—
`tation for one output value in a single cycle; it is easily
`obtained by unrolling the inner loop of the program
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 108
`Petitioner Microsoft Corporation - EX. 1009, p. 108
`——________________________________
`
`108
`
`
`
`in Figure 3a. Unfortunately,
`the circuit shown in Fig-
`ure 4a has poor performance characteristics (note the
`combinational path through all of the adders, which
`scales linearly with the number of weights). A retimed
`version of this circuit is shown in Figure 4b; the re-
`timed circuit performs substantially better than the
`original, particularly for larger computations.
`
` OUT
`
`
`(b)
`
`Figure 4: Schematic diagrams for four-tap FIR filter
`(a) as viewed in RaPz'D B, grouping related compu-
`tation and (b) as a high-perfonnance pipelined imple-
`mentatian.
`
`Specifying this retimed circuit directly is difficult
`because of the complexity of the relative timing of the
`internal data and control signals. It is much easier to
`specify the computation somewhat naively as in Fig
`ure 4a, knowing that retiming can transform it into
`a high-performance, pipelined circuit. This becomes
`particularly evident in circuits with more complicated
`control, and when more aggressive steps, such as using
`the pipeline stage available in RaPiD’s multiplier, are
`needed to achieve the desired performance. Therefore,
`the RaPiD B compiler retimes the resulting netlist
`based on [6].
`All of the applications presented in the following
`sections have been Specified in a preliminary ver-
`sion of RaPiD B and simulated to validate the im-
`plementations described and the accompanying cy-
`cle count. For ease of explanation, the computations
`shown throughout this paper are shown before the full
`retirning performed by the RaPiD B compiler. A pre-
`liminary version of the RaPiD B toolset is nearly com—
`plete, including compilation, retiming, control synthe-
`sis, and full placement and routing of the resulting
`RaPiD circuit.
`
`4 FIR Filter Implementation
`
`4.1 Simple Case
`
`As with most applications, there are a variety of ways
`to map a FIR filter to RaPiD. The choice of mapping
`is driven by the parameters of both the RaPiD ar—
`ray and the application- For example, if the number
`of taps is less than the number of RaPiD multipliers,
`then each multiplier is assigned to multiply a Specific
`weight. The weights are first preloaded into datapath
`registers whose outputs drive the input of a specific
`multiplier. Pipeline registers are used to stream the
`X inputs and Y outputs. Since each Y output must
`see NumTaps inputs, the X and Y buses must be
`pipeliued at different rates. Figure 5a shows one cell
`of the FIR filter (several stages are shown in Figure 4b)
`with the X input bus doubly pipelined and the Y in—
`put bus singly pipelined.
`
`
`
`
`OUT
`
`(b)
`
`(a) Netlist for one cell of the simple FIR
`Figure 5:
`filter.
`(a) One top of the FIR filter mapped to the
`RaPiD army {this is replicated to form more taps).
`
`This implementation maps easily to the RaPiD ar-
`ray, as shown for one tap in Figure 5b. For clarity, all
`unused functional units are removed, and used buses
`are highlighted. The bus connectors from Figure 1 are
`left open to represent no connection and boxed to rep—
`resent a register. The control for this mapping consists
`of two phases of execution:
`loading the weights and
`computing the output results.
`In the first phase, the
`weights are sent dowu the IN double pipeiine along
`with a singly pipelined control bit which connects the
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 109
`Petitioner Microsoft Corporation - Ex. 1009, p. 109
`—————————________________________
`
`109
`
`
`
`input of each datapath register to the IN bus. When
`the final weight is inserted, the control bit is switched,
`and the input is connected to the feedback line. Since
`the control bit travels twice as fast as the weights,
`each datapath register will hold a unique weight. No
`special signals are required to begin the computation;
`the second phase implicitly starts when the control bit
`goes low.
`
`4.2
`
`Increasing the Number of Taps
`
`If the number of taps exceeds the number of RaPiD
`multipliers, the multipliers must be time-shared be—
`tween several taps. This can be achieved in RaPiD
`by using a local memory to store several weights per
`stage. Figure 6 shows our implementation for this
`mapping. Unlike the simple case, we make the arbi-
`trary choice for doubly pipelining the Y output values
`and singly pipelining the X input values.
`
`Len RAM holds weights In be
`mullipliad willl X inputs.
`
`Right RAM holds Irllmnnlialc
`Lullil
`y are sent to nail stage.
`Y WEN: values nun an”: dawn
`
`fl weigh“ and x valuu mean: In
`
`
`
`Figure 6: Netlt'st for one cell of extended FIR fil-
`ter. The top pipelinecl bus streams in the X inputs
`(the weights during initialization) while the bottom has
`streams out the intermediate Y values.
`
`As a new X is read from external memory, the first
`stage replicates it and presents it to the input data-
`path for several cycles. Each stage can multiply this
`X by its weights in turn and add it to one of its stored
`intermediate values. At this point a new X value will
`be fetched from memory and the cycle repeats.
`There are the same number of intermediate values
`as there are weights per stage. These intermediate val-
`ues are stored in a second local memory. Let’s exam—
`ine the stage holding weights W55, W54, W53, and W52
`(four taps per stage). A new input value X20 appears
`on the input datapath. In four cycles the partial sums
`for 1’75, Y“, Y73, and Ym will be computed. These
`are stored in that order in the local memory holding
`the intermediate values. At this point, X23 moves to
`the next pipeline stage foliowed by the intermediate
`value 1’32. The next input, X21, appears on the input
`datapath along with the intermediate value Y-m from
`the previous stage. Now the partial sums for 1%, E5,
`144, and Y73 are computed.
`
`4.3 FIR Performance
`
`When the number of taps is a multiple of 16 the
`weights can be partitioned evenly across the stages
`
`110
`
`and the allocated functional units are fully utilized.
`RaPiD-l (Section 2.3) can therefore operate at very
`near its peak performance of 1.6 GOPS (where GOPS
`is a billion multiply-accumulates per second).
`
`5 Discrete Cosine Transform
`
`is used fre-
`The discrete cosine transform (DOT)
`quently in signal processing and graphics applications.
`For example, the 2—D DCT is used in JPEG/MPEG
`data compression to convert an image from the spatial
`domain to the frequency domain. A 2—D DCT can be
`decomposed into two sequential 1~D DCTs. We first
`describe how the 1-D DCT can be computed on Mill
`and then show how two l-D DCTs can be composed
`to perform a 2-D DCT.
`
`5.1
`
`l-D DCT
`
`An N-point 1~D DCT partitions an input vector A into
`N-element sub-vectors, and for each resulting sub~
`vector A}, computes
`N—l
`
`.or
`(I)
`ya; = ”2:0 ahn COS ml?” +1)
`for 0 g i g N — 1, where ah“ is the n—th element of
`sub-vector A}, (and the (hN + n)—th element of vectOr
`ALI The N2 cosine terms are constant over all sub-
`vectors and hence can be read once as precomputed
`wr-rights W where w”; = cos u{1%(2n + 1). This reduces
`Equation 1 to
`
`(9)
`
`i 5 N - 1. By viewing input vector A and
`for 0 _<_
`weights W as matrices A and W, Equation 2 reduces
`to the matrix multiply Y = A x W. Thus, to compute
`a 1-D DCT, RaPiD performs a matrix multiply.
`To implement an 8 point l—D DCT on an 8 x 8
`input matrix A (i.e. a 64-element vector), the entire
`8 x 8 weight matrix W is stored in RaPiD’slocal mem-
`ories, one column per cell. Each cell of the resulting
`pipeline is configured as shown in Figure 7. The A ma-
`trix is passed through the array in row-major order.
`Within each cell, the local memory address is incre-
`mented each cycle, and a register accumulates the dot
`product of the stored column and the incoming row.
`When a cell receives the last element of a row, the
`resulting product is passed down an output pipeline,
`the address is cleared, and the cell is ready to compute
`the product of the next row on the next cycle. This
`effectively computes the matrix multiply of A x W.
`
`1‘To produce the final DOT result each ya; must be multiplied
`by ӣ13.- where E; = % ifi 2 U and E,- :1 otherwise. For our
`purposes we ignore this scaling factor and focus on the computation
`of y“.
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 110
`Petitioner Microsoft Corporation - EX. 1009, p. 110
`—————————_________________________
`
`
`
`III.
`
`I
`
`a
`
`. Row III' m- rmilrlng pm“ meamulul
`
`A
`
`Rnw nl mun-i: A stream: in
`
`Figure 7': Netlist for one cell of a matrix multiply.
`The top pipelined bus streams in the A matrix (in
`row-major order) while the bottom bus streams out
`the resulting matrix: product (also in row~major order).
`The top bus also streams the W columns into the local
`memories prior to the computation.
`
`5.2
`
`2-D DCT
`
`An N X N 2-D DCT partitions an input matrix into
`sub-matrices of size N x N, and for each resulting
`sub-matrix A, computes
`
`m.
`N—lN—l
`we 2 Z 2 arm, cos 2—fi(2m + 1)
`m=fl n=l3
`
`”3'
`1
`—— 2
`cos 2N( u+ )
`
`(3)
`_
`.
`.
`for O 5 uj S N— 1.3 As With the 1-D DOT, Equation
`3 is reduced using the N3 precomputed W weights,
`yielding
`N—iN—i
`
`yji = Z Z amnwmiwnj
`m=0 71:0
`
`(4)
`
`for 0 S i, j S N — 1. Extracting mm from the inner
`summation leaves
`
`and thus
`
`N—l
`
`zmj = Z “moi-“113:
`11:0
`
`N—l
`
`Stu = Z zmjwmé
`731:0
`
`(5)
`
`(5)
`
`forOSi,j_<_N~1.
`As seen in Equation 5 and Equation 6, both em,
`and we are equivalent to N X N matrix multiplies.
`However, since the zmj- values are produced in row—
`major order but required in column-major order, the
`results from the 2m, DOT must be transposed prior
`to computing 3,3,; as illustrated in Figure 8. In addi-
`tion, since both input streams are read in row-major
`order, it might be desirable to produce row~major out»
`put (potentially reducing memory stalls), requiring yet
`another transform (Le. output y“ instead of pig). The
`resulting computation is ((A X W)T x WP".
`
`1'To produce the final DCT result each 95; must be multiplied by
`{$13} 5‘5. As with 1-D DOT, we ignore this scaling factor and focus
`on the computation of yji<
`
`111
`
`am N-Point
`i—D DCT
`
`Zma'
`
`'
`
`trans 05B
`
`ij
`
`N~Poini
`[-1) nor
`
`n
`
`Figure 8: 2-D N x N DOT
`
`We show the implementation of an 8 x 8 2D DCT
`On a Iii-cell RaPiD array. Consider an M x N image
`and an 8 x 8 weight matrix W. First, the image is
`divided into “T145: sub-images of size 8 x 8. The com«
`putation for each sub-image A is outlined in Figure 9.
`
`
`
`MxN
`
`Intermediate results
`stored in RAM and
`transposed by comrol
`
`((l'a‘lx rut/Elf
`/
`"\
`
`Computed by
`first 3 stages
`
`Computed by
`last 3 stages
`
`Figure 9: To compute 2~D DOT, an M x N image
`is partitioned into 8 x 8 sub-images. RaPiD computes
`each l-D DCT by multiplying the sub—image by an 8x8
`weight matrix.
`
`Since a 2-D DCT performs two multiplies by the
`same Weight matrix, W is loaded only once: one col-
`umn per cell in both the first 8 cells and last 8 cells.
`The transpose in between matrix multiplies is per-
`formed with two local memories per cell: one to store
`products of the current sub-image and the other to
`store the products of the previous sub-image. During
`the computation of the current sub-image, the trans-
`pose of the previous sub-image