throbber
Proceedings
`
`IEEE Symposium on
`
`FPBHS FDR
`CUBTDIYI
`COMPUTING
`IYIHCHJINES
`
`April 1 6-1 8, 1 997
`Napa Valley, California
`
`Edited by Kenneth L. Pocek and Jeffrey Amold
`
`Sponsored by the IEEE Computer Society Technical Committee on Computer Architecture
`
`IEEE.
`COMPU'TER 9
`
`SOCIETY m
`
`Petitioner Microsoft Corporation - Ex. 1009, Cover 1
`Petitioner Microsoft Corporation - EX. 1009, Cover 1
`
`

`

`FCCM’97
`
`
`
`I99?
`
`Petitioner Microsoft Corporation - Ex. 1009, Cover 2
`Petitioner Microsoft Corporation - Ex. 1009, Cover 2
`
`

`

`Copyright © 199? by The Institute of Electrical and Electronics Engineers, Inc.
`All rights reserved
`
`to the source. Libraries may
`Copyright and Reprint Permissions: Abstracting is permitted with credit
`photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume that
`carry a code at the bottom of the first page, provided that the per»c0py fec indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 0! 923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE
`Service Center, 445 Hoes Lane, PO. Box 133, Piscataway, NJ 088554331.
`
`The papers in this book comprise the proceedings oftlte meeting mentioned on the cover and titie page. They
`reflect the authors‘ opinions and, in the interests of timely dissemination, are published as presented and
`without change. Their inclusion in this publication does not necessarily constitute endorsement by the
`editors. the [BEE Computer Society. or the institute ofElectrt‘col and Eiectonics Engineers, Inc.
`
`IEEE. Computer Socrety Order Number PR08159
`ISBN 0-8186-8159-4
`ISBN 0-8186-8160-8 (case)
`ISBN 0-8186-8 l 61 -6 (microfiche)
`IEEE Order Plan Catalog Number 97T3100186
`ISSN 1082-34-09
`
`7K 9. e r 5"
`' C: 3. 5 1‘3. 5
`I 9
`_
`9 7
`
`IEEE Computer Society
`Customer Service Center
`[0662 Los Vaqueros Circle
`PD. Box 3014
`Los Alamitos, CA 90720—13 l4
`Tel: + 1-714-821-8380
`Fax: + 1-714—821-4641
`Ermail: cs.booke@computer.org
`
`Additional copies may be orderedfront:
`
`IEEE Service Center
`445 Hoes Lane
`PO. Box 1331
`Piscataway, NJ 08355-1331
`Tel: + 1-908-931-1393
`Fax: + 1-908-981-9667
`mis.custserv@compnter.org
`
`IEEE Computer Society
`13, Avenue de l’Aquilon
`B~l200 Brussels
`BELGIUM
`Tel: + 32-2—770-2198
`Fax: + 32-2-770-8505
`euro.ofc@computer.org
`
`IEEE Computer Society
`Ooshima Building
`2-] 9-1 Minami-Aoyama
`Minato-ku, Tokyo 107
`JAPAN
`Tel: + 31-3-3408-3] 18
`Fax: + 31-3-3408-3553
`tokyo.ofc@computer.org
`
`Editorial production by Bob Werner
`
`Cover art production Joe DaigletStudio Productions
`
`Printed in the United States of America by Technical Communication Services
`
`lEEE
`
`o
`
`COMPUTER Q
`SOCIETY
`
`..-:,i .
`:.x-L‘
`-
`'.,
`Petitioner Microsoft Corporation - Ex. 1009, Cover 3
`Petitioner Microsoft Corporation - Ex. 1009, Cover 3
`
`
`

`

`Table of Contents
`
`Symposium on Field-Programmable Custom Computing Machines — FCCM’97
`
`Introductionix
`
`Program Committeex
`
`Session 1: Device Architecture
`
`An FPGA Architecture for DRAM-based Systolic Computations...,........................._.,......._..................2
`N. Margoius
`
`Garp: A MIPS Processor with a Reconfigurable Coprocessor............................................................... 12
`
`J. Hauser, J. Wawrzynek
`
`A'I‘ime-MultiplexedFPGAm
`.. 22
`S Trimberger D. Cal—berry, A. Johnson, J.wong
`
`Session 2: Communication Applications
`
`An FPGA—Based Coprocessor for ATMFirewalls 30
`
`J. McHemy, P. Dowel, T. Carrozzi,
`
`F. Pellegrino, W. Cocks
`
`A Wireless LAN Demodulator in a Pamette: Design and Experience ............... . ............................... 40
`
`T. McDermott, P. Ryan, M. Shond,
`
`D. Skeiiern, T. Percival, N. Wests
`
`Session 3: Run Time Reconfiguration
`
`Incremental Reconfiguration for Pipelined Applications ....................................................................... 47
`H. Schmit
`
`Compilation Tools for Run-Time ReconfigurableDesxgns 56
`
`W. Luk, N. Shirazi, P. Cheang
`
`A Dynamic Reconfiguration Run-Time System ......................................................................................... 66
`
`J. Burns, A. Dentin, J. Hogg, S. Singh, M. de Wit
`
`Session 4: Architectures for Run Time Reconfiguration
`
`The Swappable Logic Unit: A Paradigm for Virtual Hardware 77
`G. Brebncr
`
`Petitioner Microsoft Corporation - Ex. 1009, ToC i
`Petitioner Microsoft Corporation - EX. 1009, ToC iW
`
`

`

`The Chimaera Reconfigurable Functional Unit......................................................................................... 87
`S. Hauck, T. Fry, M. Hooter, J. Kao
`
`Session 5: Architecture
`
`Computing Kernels Implemented with a Wormhole RTR CCM
`R. Bitmer Jr, P. Athanas
`
`Mapping Applications to the RaPiD Configurable Architecture
`C. Ebeling, D. Cronquist, P. Franklin,
`J. Secosky, S. Berg
`
`98
`
`106
`
`Defect Tolerance on the Teramac Custom Computer ............................................................................ 116
`B. Culbertson, R. Amerson, R. Carter,
`P. Kuekes, G. Snider
`
`Session 6: Performance
`
`Systems Performance Measurement on PCI Pamette
`L. Moll, M. Shand
`
`125
`
`The RAW Benchmark Suite: Computation Structures for
`General Purpose Computing................
`J. Babb, M. Frank, V. Lee, E. Waingold, R. Barns,
`M. Taylor, J. Kim, S. Devabkaktuni, A. Agorwal
`
`Session 7: Software Tools
`
`' Automated Field-Programmable Compute Accelerator Design using
`Partial Evaluation
`Q. Wang, D. Lewis
`
`145
`
`FPGA Synthesis on the X08200 using IRIS and TrianuSJI-Iades
`(Or, from Heaven to Hell and back again) ................................................................................................. 155
`R. Woods, 8. Ludwig, J. Heron, D. Trainer, S. Gehring
`
`High Level Compilation for Fine Grained FPGAs .................................................................................. 165
`M. Gokhole, E. Gomersaii
`
`Session 8: CAD Applications
`
`Acceleration of an FPGA Router .................................................................................................................... 175
`P. Chen, M. Schlag
`
`Fault Simulation on ReconfigurableHardware
`M. Abramovici, P. Menon
`
`. . .. 182
`
`Petitioner Microsoft Corporation - Ex. 1009, ToC ii
`Petitioner Microsoft Corporation - EX. 1009, ToC ii
`___________________________________
`
`vi
`
`

`

`Session 9: Image Processing Applications
`
`Automated Target Recognition on SPLASH 2............................................................................................ 192
`M. Rancher, B. Hutchings
`
`Real-Time Stereo Vision on the PARTS Reconfi
`J. Woodfill, B. Von Herzen
`
`gurable Computer 201
`
`Increased FPGA Capacity Enables Scalable, Flexible CCMs:
`An Example from Image Processing...........................
`J. Greenbanm, M. Baxter
`
`Session 10: Arithmetic Applications
`
`Comparison of Arithmetic Architectures for Reed-
`Reconfig-urable Hardware
`C. Paar, M Rosner
`
`Solomon Decoders in
`
`219
`
`' g Point Square Root on FPGAs
`
`226
`
`Poster Papers
`
`Datapath-Oriented FPGA Mapping and Placement for
`Configurable Computing......,.......
`T. Callahan, J. Wawrzynek
`
`Mapping a Real-Time Video Algorithm to a Context-Switched FPGA
`S. Kelern
`
`A Parallel Hardware Evolvable Computer POLYP
`U. Tongan. L. Schulte, J. McCaskiZl
`
`236
`
`....238
`
`Laser Defect Correction Applications to FPGA Based Custom Computers .................................. 240
`G. Chapman, B. Dufort
`
`Speech Recognition HMM Training on Reconfigurable Parallel Processor........................_......... 242
`H. Yun, A. Smith, H. Silver-man
`
`Efficient Implementation of the DCT on Custom Computers
`N. Bergmann, Y. Chung, B. Gunther
`
`244
`
`0n Acceleration of the Check Tautolog‘y Logic Synthesis Algorithm usmg an
`FPGA—based Reconfigurable Coprocessor246
`
`J. Cong, J. Peck
`
`Index ofAuthors
`
`249
`
`vii
`
`Petitioner Microsoft Corporation - Ex. 1009, ToC iii
`Petitioner Microsoft Corporation - Ex. 1009, ToC iii
`
`

`

`This material may be protected by Copyright law (Title 17 US. Code)
`
`Mapping Applications to the RaPiD Configurable Architecture*
`
`Carl Ebeling, Darren C. Cronquist, Paul Franklin, Jason Secosky, and Stefan G. Berg
`Department of Computer Science and Engineering
`University of Washington
`Box 352350
`Seattle, WA 98195—2350
`
`Abstract
`
`The goal of the RaPiD (Reconfigurable Pipelined
`Datapath)
`architecture is
`to provide high per-
`formance configurable computing for a range of
`computationally—intensive applications that demand
`special-purpose hardware. This is accomplished by
`mapping the computation into a deep pipeline using
`a configurable array of coarse—grained computational
`units. A key feature of RaPiD is the combination
`of static and dynamic control. While the underly—
`ing computational pipelines are configured statically,
`a limited amount of dynamic control is provided which
`greatly increases the range and capability of applica-
`tions that can be mapped to RaPiD. This paper illus—
`trates this mapping and configuration for several im—
`portant applications including a FIR filter, 2-D DCT,
`motion estimation, and parametric curve generation;
`it also shows how static and dynamic control are used
`to perform complex computations.
`
`l .
`
`Introduction
`
`custom computing machines
`Field—programmable
`have attracted a lot of attention recently because of
`their promise to deliver the high performance provided
`by special purpose hardware along with the flexibil-
`ity of general purpose processors. Unfortunately, the
`promise of configurable computing has yet to be real—
`ized in spite of some very successful examples [1, 9}.
`There are two main reasons for this.
`First, configurable computing platforms are cur—
`rently implemented using Commercial FPGAs. These
`FPGAs are necessarily very finc~grainod so they can
`he used to implement arbitrary circuits, but the over—
`head of this generality exacts a high price in density
`and performance. Compared to general purpose pro—
`cessors (including digital signal processors), which use
`highly optimized functional units that operate in bit-
`parallel fashion on long data words, FPGAs are some—
`what inefiicient for performing logical operations and
`
`"This work was snpporlcd in part by the Defense Advanced
`Research Projects Agency under Contract DAAHUs—94~G()272. D.
`Croltquist was supported in art by an IBM fellowship. P. Franklin
`was supported by an NSF fellowship.
`
`even worse for ordinary arithmetic. FPGA-based com—
`puting has the advantage only on complex bit-oriented
`computations like count-ones, find-first-one, or com-
`plicated bit~level masking and filtering.
`Second, programming an FPGA—based configurable
`computer is akin to designing an ASIC. The program-
`mer either uses synthesis tools that deliver poor den—
`sity and performance or designs the circuit manually,
`which requires both intimate knowledge of the FPGA
`architecture and substantial design time. Neither al—
`ternative is attractive, particularly for simple compu-
`tations that can be described in a few lines of C code.
`Our response to these two problems is RaPiD, a
`coarse-grained configurable architecture for construct-
`ing deep computational pipelines. RaPiD is aimed at
`regular, computation—intensive tasks like those found
`in digital signal processing (DSP). RaPiD provides a
`large number of ALUs, multipliers, registers and mem~
`ory modules that can be configured into the appropri-
`ate pipelined datapath, The datapaths constructed
`in RaPiD are linear arrays of functional units corn-
`municating in mostly nearest-neighbor fashion. Sys—
`tolic algorithms [4], for example, map very well into
`RaPiD datapaths, allowing us to take advantage of
`the considerable research on compiling computations
`to systolic arrays [5, 7]. However, RaPiD is not limited
`to implementing systolic algorithms; a pipeline can be
`constructed which comprises different computations at
`different stages and at different times.
`We begin with an overview of the RaPiD architec—
`ture; for a more detailed description see [3]. We then
`give a general description of how computations map
`to RaPiD using a FIR filter as an example, and then
`present how the architectural features of RaPiD are
`used to perform more complex computations found in
`2-D DCT, motion estimation, and parametric curve
`generation.
`
`2 The RaPiD Datapath Architecture
`
`RaPiD is a linear array of functional units which
`is configured to form a mostly linear computational
`pipeline. This array of functional units is divided into
`identical cells which are replicated to form a complete
`array. Figure 1 shows the cell used in RaPiD-l, the
`
`0—8 18681594197 $10.00 © 199'? IEEE
`
`106
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 106
`Petitioner Microsoft Corporation - Ex. 1009, p. 106
`————________________—_
`
`

`

`first version of the RaPiD architecture. This cell com-
`prises an integer multiplier, three integer ALUs, six
`general-purpose “datapath registers” and three small
`local memories. A typical single-chip RaPiD array
`would contain between 8 and 32 of these cells.
`
`datapa th
`registers
`
`
`
`
`
`bus connectors
`
`
`
`// \
`input muxes
`output drivers
`
`ues. They can be used as additional multiplexers to
`simplify control;
`like any functional unit.
`the regis—
`ters can be disabled. They are also used while routing
`R-aPiD applications to connect hos segments in differ-
`ent tracks and/or for additional pipeline delays.
`In many applications, the data is partitioned into
`blocks which are loaded once, saved locally, reuse-d as
`needed, and their discarded. The local memories pro-
`vided in each cell of the datapath serve this purpose.
`Each local memory has a specialised datapath regis-
`ter used as an address register; one of the bus inputs
`to this address register is replaced by an incrementing
`feedback path. Like the STLOs found in the Philips
`VSP [3], this supports the cmmnon case of sequential
`memory accesses. More complex addressing patterns
`can be generated using registers and ALUS in the data
`path.
`Input and output data enter and exit RaPiD via
`I/0 streams at each end of the datapath. Each stream
`contains a FIFO filled with data. required or with re-
`sults produced by the computation. The datapath ex-
`plicitly reads from an input. stream to obtain the next
`input data value and writes to an output stream to
`store a result.
`External memory operations are carried out inde—
`pendent of the RaPiD array via three I/O streams
`by placing FIFOS between the array and a memory
`controller.
`In addition to carrying out the memory
`operations, the memory controller generates statics] 13'
`determined sequences of addresses for each stream. If
`the datapath reads a value from an empty FIFO or
`writes a value to a full FIFO. the datapath is stalled
`until the FIFO is ready.
`
`2.2 Control Path
`
`For the most part. the structure of a pipeline is stat-
`ically configm'cd. However, there are almost always
`some pipeline control signals that must he dyuan'iic.
`For exampleI constants are loaded into datapath regis-
`ters during initialization but then remain unchanged.
`The load signals of the datapath registers thus take on
`different. values during initialization and computation.
`More complex examples include (louhlc—lulifering the
`local memories and perforating data~depemlent calcu-
`lations.
`
`The control signals are thus divided into static con-
`trol signals provided by configuration memory as in
`ordinary FPGAs, and control signals which can be dy-
`namically prograimned on every cycle. RaPiD is pro—
`grammed for a particular application by first mapping
`the computation onto a datapath pipeline. The static
`programming bits are used to coastruct this pipeline
`and the dynamic programming bits are used to sched-
`ule the datapath operations over time. These dynamic
`control bits are provided by a small pipelined control
`path, not by the more typical local microprogrammed,
`Sl'MD, or VLIW control.
`
`Figure 1: A basic RePiD cell which is replicated left to
`right to form a complete chip. RnPiD—I contains 16
`cells similar to this one, with fourteen 16-bit buses.
`
`2.1 Datapath Composition
`
`The functional units are interconnected using a set
`of segmented buses that run the length of the data-
`path. The functional units use a n : 1 multiplexer to
`select their data inputs from one of the n. — 2 bus seg—
`ments in the adjacent tracks. The additional inputs
`provide fixed zero or feedback lines, which can be used
`to clear and hold register values, or to use an ALU as
`an accumulator. Each functional unit output includes
`optional registers to accommodate pipeline delays and
`a set of tristate drivers to drive their output onto one
`or more bus segments.
`tracks are segmented into
`The buses in different
`different lengths to make the most efficient use of the
`connection resources.
`In some tracks, adjacent bus
`segments can be connected together by a bus connec-
`tor as shown in Figure 1. This connection can be pro-
`grammed in either direction via a unidirectional buffer
`or pipelined with up to three register delays, allowing
`data pipelines to be built in the bus structure itself.
`RaPiD’s ALUs perform the usual logical and arith—
`metic Operations on one word of data. The ALUs
`can be chained for wide-integer operations.
`The
`multiplier inputs two single-word numbers and pro-
`duces a double-word result, shifted by a statically pro-
`grammed amount to maintain a given fixed-point rep~
`resentation. Both words of the result are available as
`separate outputs.
`The datapath registers serve a variety of purposes
`in RaPiD. These registers can be used to store con-
`stants loaded during initialization and temporary val—
`
`107
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 107
`Petitioner Microsoft Corporation - Ex. 1009, p. 107
`
`

`

`Dynamic control is implemented by inserting a few
`“context” bits each cycle into a pipelined control path
`that parallels the datapath. This context contains
`all the information required by the various pipeline
`stages to compute their dynamic control signals. The
`control path contains l—bit segmented buses similar in
`structure to the datapath buses, as shown in Figure 2.
`change during a particular computation are connected
`to the static zero line.) Control values are inserted by
`a global pipeline controller at one end of the control
`path and are passed from stage to stage where they
`are applied to the appropriate control signals. Since
`applications generally use only a few dynamic control
`signals and use similar pipeline stages, the number of
`control signals in the control path is relatively small.
`.11.,
`'
`l‘.
`I
`1:35: “at“ alu
`
`'
`‘35:
`
`3.1 FIR Filter Computation
`
`buses, which are supplemented by the zero and feed-
`back inputs available to each functional input. The
`16 cells each have the functional units shown in Fig
`ure 1, in addition to control logic and up to 15 control
`buses. The RaPiD-l array is designed to be clocked
`at 100MHz, and reconfiguration time for the array is
`conservatively estimated to be 2000 cycles.
`,
`3 Programming MOdel
`Mapping applications to RaPif) involves designing the
`underlyIng datspath and provldmg the dynamic con—
`tFOI reqmred f0? the different parts Of‘ the computa-
`“011- Th? control (1931511 can be complicated because
`control signals are generated at different times and
`travel at different rates. We have designed the RaPiD
`B programming language to accommodate these con-
`.
`u
`.
`trol patterns. Our RaPiD B compiler which produces
`a placed and routed implementation along with the
`dynamic control program is nearly complete. This sec-
`tion first describes a FIR (Finite Impulse Response)
`filter, a simple application useful for illustrating some
`of the basic features of RaPiD. It then briefly pri-vsents
`the timing models used by RaPiD B and by the re-
`mainder of this paper.
`
`l'.
`
`Figure 2: Dynamic central generation for part Of 3
`RePiD cell; these control buses are one hit wide.
`Each dynamic control signal is derived from the in-
`formation contained in the control path. Usually the
`signal is simply connected to one of the bits in the
`control path, but in more complex cases lookup-tables
`embedded in the control path are used to compute
`the control signal based on more information includ-
`ing bits in the control path, status from ALUs in the
`datapath, or feedback when implementing simple FSM
`controllers. The generation of dynamic control is 31-
`lu'strated in detail in the applications that follow.
`
`2,3 RaPiD-I Design Features
`Most of the design and layout of the RaPiD-l chip,
`the first implementation of the RaPiD architecture,
`is complete. This section presents those details of
`RaPiD—I useful in understanding the performance re-
`sults discussed for each application presented in the
`following sections.
`RaPiD—I’s datapath is based on 16-bit fixed~point
`integers; to accommodate this, the multipliers can be
`statically programmed to shift their 32-bit output ar—
`bitrarily. Each RaPiD—l cell contains three ALUs,
`one multipliers, and three 32—word local memories.
`Fourteen tracks are provided for the segmented data
`
`Digital FIR filters are used in many signal processing
`applications, typically for eliminating unwanted fre-
`”“91le “0019071911“ from a signal. Figure 33» gives a
`Specification for a FIR filter with NumTups taps and
`NumX inputs. The filter weights are stored in the W
`array, the input in the X array, and the output. in the
`Y array (starting at array location NumTeps — 1).
`Figure 3b shows the entire computation required for a
`single output of a 4-tap FIR. filter.
`“iii?" am— m "I” "
`
`
`whirl-1“iiiFi-jnic'li-qliihu]
`
`
`end M
`'
`J
`
`
`end
`
`
`
`.....xs......xs......x7......x)?......2:5- £4310 ....xz......xn .....re a
`we w'
`“'2
`“G
`“5::
`
`(W
`.
`_
`.
`Figure 35 FIR 1mg“ (a) Algorithm. (5) Computation
`for Nam fill-“5:4 and i=5
`The circuit in Figure 4a performs the entire compu—
`tation for one output value in a single cycle; it is easily
`obtained by unrolling the inner loop of the program
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 108
`Petitioner Microsoft Corporation - EX. 1009, p. 108
`——________________________________
`
`108
`
`

`

`in Figure 3a. Unfortunately,
`the circuit shown in Fig-
`ure 4a has poor performance characteristics (note the
`combinational path through all of the adders, which
`scales linearly with the number of weights). A retimed
`version of this circuit is shown in Figure 4b; the re-
`timed circuit performs substantially better than the
`original, particularly for larger computations.
`
` OUT
`
`
`(b)
`
`Figure 4: Schematic diagrams for four-tap FIR filter
`(a) as viewed in RaPz'D B, grouping related compu-
`tation and (b) as a high-perfonnance pipelined imple-
`mentatian.
`
`Specifying this retimed circuit directly is difficult
`because of the complexity of the relative timing of the
`internal data and control signals. It is much easier to
`specify the computation somewhat naively as in Fig
`ure 4a, knowing that retiming can transform it into
`a high-performance, pipelined circuit. This becomes
`particularly evident in circuits with more complicated
`control, and when more aggressive steps, such as using
`the pipeline stage available in RaPiD’s multiplier, are
`needed to achieve the desired performance. Therefore,
`the RaPiD B compiler retimes the resulting netlist
`based on [6].
`All of the applications presented in the following
`sections have been Specified in a preliminary ver-
`sion of RaPiD B and simulated to validate the im-
`plementations described and the accompanying cy-
`cle count. For ease of explanation, the computations
`shown throughout this paper are shown before the full
`retirning performed by the RaPiD B compiler. A pre-
`liminary version of the RaPiD B toolset is nearly com—
`plete, including compilation, retiming, control synthe-
`sis, and full placement and routing of the resulting
`RaPiD circuit.
`
`4 FIR Filter Implementation
`
`4.1 Simple Case
`
`As with most applications, there are a variety of ways
`to map a FIR filter to RaPiD. The choice of mapping
`is driven by the parameters of both the RaPiD ar—
`ray and the application- For example, if the number
`of taps is less than the number of RaPiD multipliers,
`then each multiplier is assigned to multiply a Specific
`weight. The weights are first preloaded into datapath
`registers whose outputs drive the input of a specific
`multiplier. Pipeline registers are used to stream the
`X inputs and Y outputs. Since each Y output must
`see NumTaps inputs, the X and Y buses must be
`pipeliued at different rates. Figure 5a shows one cell
`of the FIR filter (several stages are shown in Figure 4b)
`with the X input bus doubly pipelined and the Y in—
`put bus singly pipelined.
`
`
`
`
`OUT
`
`(b)
`
`(a) Netlist for one cell of the simple FIR
`Figure 5:
`filter.
`(a) One top of the FIR filter mapped to the
`RaPiD army {this is replicated to form more taps).
`
`This implementation maps easily to the RaPiD ar-
`ray, as shown for one tap in Figure 5b. For clarity, all
`unused functional units are removed, and used buses
`are highlighted. The bus connectors from Figure 1 are
`left open to represent no connection and boxed to rep—
`resent a register. The control for this mapping consists
`of two phases of execution:
`loading the weights and
`computing the output results.
`In the first phase, the
`weights are sent dowu the IN double pipeiine along
`with a singly pipelined control bit which connects the
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 109
`Petitioner Microsoft Corporation - Ex. 1009, p. 109
`—————————________________________
`
`109
`
`

`

`input of each datapath register to the IN bus. When
`the final weight is inserted, the control bit is switched,
`and the input is connected to the feedback line. Since
`the control bit travels twice as fast as the weights,
`each datapath register will hold a unique weight. No
`special signals are required to begin the computation;
`the second phase implicitly starts when the control bit
`goes low.
`
`4.2
`
`Increasing the Number of Taps
`
`If the number of taps exceeds the number of RaPiD
`multipliers, the multipliers must be time-shared be—
`tween several taps. This can be achieved in RaPiD
`by using a local memory to store several weights per
`stage. Figure 6 shows our implementation for this
`mapping. Unlike the simple case, we make the arbi-
`trary choice for doubly pipelining the Y output values
`and singly pipelining the X input values.
`
`Len RAM holds weights In be
`mullipliad willl X inputs.
`
`Right RAM holds Irllmnnlialc
`Lullil
`y are sent to nail stage.
`Y WEN: values nun an”: dawn
`
`fl weigh“ and x valuu mean: In
`
`
`
`Figure 6: Netlt'st for one cell of extended FIR fil-
`ter. The top pipelinecl bus streams in the X inputs
`(the weights during initialization) while the bottom has
`streams out the intermediate Y values.
`
`As a new X is read from external memory, the first
`stage replicates it and presents it to the input data-
`path for several cycles. Each stage can multiply this
`X by its weights in turn and add it to one of its stored
`intermediate values. At this point a new X value will
`be fetched from memory and the cycle repeats.
`There are the same number of intermediate values
`as there are weights per stage. These intermediate val-
`ues are stored in a second local memory. Let’s exam—
`ine the stage holding weights W55, W54, W53, and W52
`(four taps per stage). A new input value X20 appears
`on the input datapath. In four cycles the partial sums
`for 1’75, Y“, Y73, and Ym will be computed. These
`are stored in that order in the local memory holding
`the intermediate values. At this point, X23 moves to
`the next pipeline stage foliowed by the intermediate
`value 1’32. The next input, X21, appears on the input
`datapath along with the intermediate value Y-m from
`the previous stage. Now the partial sums for 1%, E5,
`144, and Y73 are computed.
`
`4.3 FIR Performance
`
`When the number of taps is a multiple of 16 the
`weights can be partitioned evenly across the stages
`
`110
`
`and the allocated functional units are fully utilized.
`RaPiD-l (Section 2.3) can therefore operate at very
`near its peak performance of 1.6 GOPS (where GOPS
`is a billion multiply-accumulates per second).
`
`5 Discrete Cosine Transform
`
`is used fre-
`The discrete cosine transform (DOT)
`quently in signal processing and graphics applications.
`For example, the 2—D DCT is used in JPEG/MPEG
`data compression to convert an image from the spatial
`domain to the frequency domain. A 2—D DCT can be
`decomposed into two sequential 1~D DCTs. We first
`describe how the 1-D DCT can be computed on Mill
`and then show how two l-D DCTs can be composed
`to perform a 2-D DCT.
`
`5.1
`
`l-D DCT
`
`An N-point 1~D DCT partitions an input vector A into
`N-element sub-vectors, and for each resulting sub~
`vector A}, computes
`N—l
`
`.or
`(I)
`ya; = ”2:0 ahn COS ml?” +1)
`for 0 g i g N — 1, where ah“ is the n—th element of
`sub-vector A}, (and the (hN + n)—th element of vectOr
`ALI The N2 cosine terms are constant over all sub-
`vectors and hence can be read once as precomputed
`wr-rights W where w”; = cos u{1%(2n + 1). This reduces
`Equation 1 to
`
`(9)
`
`i 5 N - 1. By viewing input vector A and
`for 0 _<_
`weights W as matrices A and W, Equation 2 reduces
`to the matrix multiply Y = A x W. Thus, to compute
`a 1-D DCT, RaPiD performs a matrix multiply.
`To implement an 8 point l—D DCT on an 8 x 8
`input matrix A (i.e. a 64-element vector), the entire
`8 x 8 weight matrix W is stored in RaPiD’slocal mem-
`ories, one column per cell. Each cell of the resulting
`pipeline is configured as shown in Figure 7. The A ma-
`trix is passed through the array in row-major order.
`Within each cell, the local memory address is incre-
`mented each cycle, and a register accumulates the dot
`product of the stored column and the incoming row.
`When a cell receives the last element of a row, the
`resulting product is passed down an output pipeline,
`the address is cleared, and the cell is ready to compute
`the product of the next row on the next cycle. This
`effectively computes the matrix multiply of A x W.
`
`1‘To produce the final DOT result each ya; must be multiplied
`by ӣ13.- where E; = % ifi 2 U and E,- :1 otherwise. For our
`purposes we ignore this scaling factor and focus on the computation
`of y“.
`
`Petitioner Microsoft Corporation - Ex. 1009, p. 110
`Petitioner Microsoft Corporation - EX. 1009, p. 110
`—————————_________________________
`
`

`

`III.
`
`I
`
`a
`
`. Row III' m- rmilrlng pm“ meamulul
`
`A
`
`Rnw nl mun-i: A stream: in
`
`Figure 7': Netlist for one cell of a matrix multiply.
`The top pipelined bus streams in the A matrix (in
`row-major order) while the bottom bus streams out
`the resulting matrix: product (also in row~major order).
`The top bus also streams the W columns into the local
`memories prior to the computation.
`
`5.2
`
`2-D DCT
`
`An N X N 2-D DCT partitions an input matrix into
`sub-matrices of size N x N, and for each resulting
`sub-matrix A, computes
`
`m.
`N—lN—l
`we 2 Z 2 arm, cos 2—fi(2m + 1)
`m=fl n=l3
`
`”3'
`1
`—— 2
`cos 2N( u+ )
`
`(3)
`_
`.
`.
`for O 5 uj S N— 1.3 As With the 1-D DOT, Equation
`3 is reduced using the N3 precomputed W weights,
`yielding
`N—iN—i
`
`yji = Z Z amnwmiwnj
`m=0 71:0
`
`(4)
`
`for 0 S i, j S N — 1. Extracting mm from the inner
`summation leaves
`
`and thus
`
`N—l
`
`zmj = Z “moi-“113:
`11:0
`
`N—l
`
`Stu = Z zmjwmé
`731:0
`
`(5)
`
`(5)
`
`forOSi,j_<_N~1.
`As seen in Equation 5 and Equation 6, both em,
`and we are equivalent to N X N matrix multiplies.
`However, since the zmj- values are produced in row—
`major order but required in column-major order, the
`results from the 2m, DOT must be transposed prior
`to computing 3,3,; as illustrated in Figure 8. In addi-
`tion, since both input streams are read in row-major
`order, it might be desirable to produce row~major out»
`put (potentially reducing memory stalls), requiring yet
`another transform (Le. output y“ instead of pig). The
`resulting computation is ((A X W)T x WP".
`
`1'To produce the final DCT result each 95; must be multiplied by
`{$13} 5‘5. As with 1-D DOT, we ignore this scaling factor and focus
`on the computation of yji<
`
`111
`
`am N-Point
`i—D DCT
`
`Zma'
`
`'
`
`trans 05B
`
`ij
`
`N~Poini
`[-1) nor
`
`n
`
`Figure 8: 2-D N x N DOT
`
`We show the implementation of an 8 x 8 2D DCT
`On a Iii-cell RaPiD array. Consider an M x N image
`and an 8 x 8 weight matrix W. First, the image is
`divided into “T145: sub-images of size 8 x 8. The com«
`putation for each sub-image A is outlined in Figure 9.
`
`
`
`MxN
`
`Intermediate results
`stored in RAM and
`transposed by comrol
`
`((l'a‘lx rut/Elf
`/
`"\
`
`Computed by
`first 3 stages
`
`Computed by
`last 3 stages
`
`Figure 9: To compute 2~D DOT, an M x N image
`is partitioned into 8 x 8 sub-images. RaPiD computes
`each l-D DCT by multiplying the sub—image by an 8x8
`weight matrix.
`
`Since a 2-D DCT performs two multiplies by the
`same Weight matrix, W is loaded only once: one col-
`umn per cell in both the first 8 cells and last 8 cells.
`The transpose in between matrix multiplies is per-
`formed with two local memories per cell: one to store
`products of the current sub-image and the other to
`store the products of the previous sub-image. During
`the computation of the current sub-image, the trans-
`pose of the previous sub-image

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket