`«p) THEINSTITUTEOF ELECTRICALAND ELECTRONICSENGINEERS september 1987
`
`HSSN 0018-9219)
`
`@
`
`}PECIAL ISSUE ON
`
`aeS,
`it o
`%
`7
`J
`
`
`MO. 64110
` i) See ees eo
`
`/
`25718
`NDA HALL LIS
`sf REALS DEPT c.:
`5109
`CHERRY 5S
`oe aS CITY .
`
`{NNLS SO2
`
`2
`
`3:
`
`opie itiemariinaemtiameaimmeal
`een ,
`
`Petitioner Microsoft Corporation - Ex. 1010, Cover 1
`Petitioner Microsoft Corporation - Ex. 1010, Cover 1
`
`
`
`1139|SCANNING THE ISSUE
`
`
`
`SPUTTTTTTTUTTTTTTTTTTTTTTUTTUOVTITUTITTIsteerer{]]||)1]LUVIUUATUULIUUAIHELI
`
`proceedings«I[ElZE }
`
`
`published monthly bythe Institute of Electrical and Electronics Engineers,Inc.
`september 1987
`
`SPECIAL ISSUE ON
`
`HARDWARE AND SOFTWARE FOR DIGITAL SIGNAL PROCESSING
`Fdited by Sanjit K. Mitra and Kalyan Mondal
`
`PAPERS
`
`VLSI
`
`The TM5320 Family of Digital Signal Processors, K-S. Lin, G. A. Frantz, and R. Simar, Ir.
`1143
`1160 VLSI Processor for Image Processing, M. Sugai, A. Kanuma, K. Suzuki, and M. Kubo
`1167 Digital Signal Processor for Test and Measurement Environment, A. Kareem, C. L. Saxe,
`F. Etheridge, and D. McKinney.
`1172. The Graph Search Machine (GSM): A VLSI Architecture for Connected Speech Recognition and
`Other Applications, S. C. Glinski, T. M. Lalumia, D. R. Cassiday, T. Koh, C. Gerveshi, G. A. Wilson,
`and ]. Kumar
`1185 DSP56200: An Algorithm-Specific Digital Signal Processor Peripheral, G. D. Hillman
`1192
`Parallel Bit-Level Pipelined VLSI Designs for High-Speed Signal Processing, M. Hatamian and
`G. L. Cash
`
`Systems
`
`1203) AG & 320-MHz 1024-Channel FFT Cross-Spectrum Analyzer for Radio Astronomy, ¥. Chikada,
`M.Ishiguro, H, Hirabayashi, M. Morimoto, K-I. Morita, T. Kanzawa, H. Iwashita, K. Nakazima,
`S-I. Ishikawa, T. Takahashi, K. Handa, T. Kasuga, S. Okumura, T. Miyazawa, T. Nakazuru, K. Miura,
`and S. Nagasawa
`1211 MUPSI: A Multiprocessor for Signal Processing, G. Bolch, F. Hofmann, B. Hoppe, C-U. Linster,
`R. Polzer, H. W. Schiissler, G. Wackersreuther, and F-X. Wurm
`1220 Data-Driven Multicomputers in Digital Signal Processing, J-L. Gaudiot
`1235
`Synchronous Data Flow, —. A. Lee and D. G. Messerschmitt
`1246 Multiple Digital Signal Processor Environment for Intelligent Signal Processing, W. 5, Gass,
`RK. T. Tarrant, B. . Pawate, M. Gammel, P. K. Rajasekaran, &, H. Wiggins, and C. D. Convington
`CAD
`
`1260 Computer-Aided Design of VLSI FIR Filters, P. R. Cappello and C-W. Wu
`1272
`A Silicon Compiler for Digital Signal Processing: Methodology, Implementation, and Applications,
`F.F. Yassa, J. R. Jasica, R.
`|. Hartley, and S. E, Noujaim
`Algorithms
`
`1283 Vectorized Mixed Radix Discrete Fourier Transform Algorithms, R. C. Agarwal andJ. W. Cooley
`1293
`Implementation of Digital Filtering Algorithms Using Pipelined Vector Processors, W. Sung and
`5. K. Mitra
`1304 Array Architectures for Iterative Algorithms, H. V. Jagadish, 5. K. Rao, and T. Kailath
`Software
`
`1322
`
`SIG—A General-Purpose Signal Processing Program, D. L. Lager and S. G. Azevedo
`
`PROCEEDINGS LETTERS
`
`1333 Cumulants: A Powerful Tool in Signal Processing, G. B. Giannakis
`1335 A Novel Design of a Combinational Networkto Facilitate Fault Detection, P. R. Bhattacharjee,
`5. K. Basu, and J. C Paul
`1336 Asynchronous Fiber Optic Local Area Network Using CDMA and Optical Correlation,
`M. A. Santoro and P. R. Prucnal
`1338) On a Constrained LMS Dither Algorithm, K. Cho and N. Ahmed
`
`Petitioner Microsoft Corporation - Ex. 1010, Cover 2
`Petitioner Microsoft Corporation - Ex. 1010, Cover 2
`eeeTHONACTOSONOF
`
`
`
`BOOK REVIEWS
`Spread Spectrum Systems, 2nd ed.
`1341
`1342) RC Active Filter Design Handbook,
`1342 Book Alert
`1344
`
`» by R. C. Dixon, reviewed by L. H. Sibul
`by FLW. Stephenson, reviewed by S. Natarajan
`FUTURE SPECIAL ISSUES!SPECIAL SECTIONS OF THE PROCEEDINGS
`
`COVER A family of successive digital si
`gnal processors (overlaying a programmable bit-level pipelined MAC
`cl
`Vip) representing an important developmentwithin the topic area of this special issue,
`1987 PROCEEDINGS EDITORIAL BOARD
`M. I. Skolnik, Edjtor
`R. L. Abrams
`J. L. Melsa
`C.N, Berglund
`R. M. Mersereau
`G. M. Borsuk
`J.D. Musa
`K. R. Carver
`T. C. Pilkington
`R. E. Crochiere
`M. B. Pursley
`J. O. Dimmock
`R. J. Ringlee
`R. C. Dixon
`A, C. Schell
`Tse-yun Feng
`GeneStrull
`A. L. Frisiani
`Yasuharu Suematsu
`G. T. Hawley
`Kiyo Tomiyasu
`W.K. Kahn
`Sven Treitel
`Sherman Karp
`Harry Urkowitz
`5. 5. Lam
`A. 5. Willsky
`Ruey-wen Liu
`J. C. Wiltse
`R. D. Masiello
`M. J. Wozny
`
`1987 IEEE PUBLICATIONS BOARD
`C. H. House, Chairman
`N. R. Dixon, Vice Chairman
`D. L. Staiger, Staff Secretary
`J. J. Baldini
`G. F. McClure
`Donald Christiansen
`H. P. Meisel
`Robert Cotellessa
`T. Pavlidis
`S. H. Durrani
`H. B. Rigas
`Bruce Eisenstein
`T. E. Sharp
`Irving Engelson
`D. H. Sheingold
`W. C. Farrell
`Marwan Simaan
`Sheldon Gruber
`M. L. Skolnik
`W. K. Kahn
`Jerome Suran
`Philip Lopresti
`R. C. Williamson
`
`HEADQUARTERS STAEF
`Eric Herz, Executive Director and General
`Manager
`ElwoodK. Gannett, Deputy General
`Manager
`
`SEINES|]UITTILLLIELT
`CPE]
`
`PUBLISHING SERVICES
`David L. Staiger, Staff Director
`Elizabeth Braham, Manager
`Database/Information Services
`W.R. Crone, Manager,
`IEEE PRESS/PROCEEDINGSOF THE IEEE
`Patricia H. Penick, Manager, Publication
`Administrative Services
`Otto W. Vathke, Publication Business
`Manager
`Ann H. Burgmeyer, Gail S. Ferenc,
`Carolyne Tamney, Production Managers
`ADVERTISING
`William R. Saunders, Advertising Director
`Ralph Obert, Advertising Production
`Manager
`by the Ist ofa month to be effective ior the following month’sissue. Send newaddress, plus
`mailing label showing old address, to the IEEE Service Center,
`Advertising correspondence should be addressed to the Advertising Department at EEE
`Headquarters,
`
`PROCEEDINGS STAFF
`HansP. Leander, Technical Editor
`Nela Rybowicz, Associate Editor
`PROCEEDINGS OF THEIEEEis published monthly by The Institute of Electrical and Elec-
`tronics Engineers, Inc, IEEE Headquarters: 345 East 47th Street, New York, NY 10017-2394. IEEE
`Service Center(for orders, subscriptions, and address changes): 445 Hoes Lane, Piscataway,
`N} 08054-4150, Telephones: Technical Editor 2 12-705-7906; Publishing Services 2 12-705-7560;
`IEEE Service Center 201-981-0060; Advertising 212-705-7579, Copyright and Reprint Permis-
`sions:Abstracting is permittedwithcredit to thesource.Librariesarepermittedtophotocopy
`beyondthelimits of the U.S. Copyright Law for private
`use Of patrons:(1) those post-1977
`articles that carry a codeat the bottomofthefirst Page, provided the per-capyfee indicated
`in thecadeis paid through theCopyright ClearanceCenter, 29 Congress Street, Salem, MA
`Manuscriptsshould be submitted in triplicate to the Editor at IEEE Headquarters. A summary
`01970; (2) pre-1978 articles withoutfee. Instructors are permitted to photocopyisolated arti-
`ofinstructionsfor Preparationis found in the mast recentJanuaryissue ofthis journal. Detailed
`cles for noncommercial classroomuse without fee. For all other copying, reprint or repub-
`instructions are containedin “Information for IEEE Authors,”available on request. See note
`lication permission, write to Copyrights and Permissions Department, IEEE Publishing Ser-
`at beginning of “Proceedings Letters” for special instructions for this section, Aiter a man
`vices, 345 East 47th Street, New York, NY 10017-2394, Copyright © 1987 by The Institute of
`uscript has been acceptedfar publication, theauthor'sorganizationwill be requested topay
`Electrical and Electronies Engineers, Inc. All rights reserved,Printed in U.S.A. Second-class
`a voluntary charge of $110 Per printed page to cover part of the publication cost. Respon-
`sibility for contents of Papers rests upon the authors and not the IEEE or its members,
`postage paid at New York, NY and at additional mailing
`offices. Postmaster: Send address
`changes to PROCEEDINGS OF THEIEEE, IEEE Service Cen
`08854-4150,
`ter, 445 Hoes Lane, Piscataway, NI
`Copyright: It
`is the policyof the IEEE to own the copyright to the technical contributionsit
`publishes on behalf of the interests of the IEEE,its authors and employers, andto facilitate
`AnnualSubscription: Memberand nonmembersubscriptionpricesavailableon request.Sin-
`the appropriate reuse of this material by others. To comply with the U.S, Copyright Law,
`gle Copies: IEEE members $10.00 (first copyonly), nonmembers $20.00 per copy. (Note:Add
`authors are required to sign
`an IEEE copyright transfer formbeiore publication. This form,
`34.00 for postage and handling charge to any order from
`4 copyof whichis found in the
`$1.00 to $50.00, including prepaid
`orders.) Other: Available in microfiche and microfilm, Ch
`Most recent January issue of this journal, returns to authors
`and their employers full rights to reuse their material for their own purposes. Authors must
`ange of address must be received
`submit a signed copyof this form with their manuscripts,
`CONTRIBUTED PAPERS
`The PRoceeDINGsOFTHE IEEEwelcomesforconsideration technical
`the topic andhowit meets the abovecriteria, an outline ofthe pro-
`Papers on topics of broad significance and long-range interest in
`all areas ofelectrical, electronics, and computer engineering. In-
`posed paper, and a brief biography showing the author's quali-
`depth reviews andtutorials are particularly appropriate, although
`fications to write the Paper(including reference to previously pub-
`lished material as well as the author's relation to the topic). If the
`papers describing individual research will be considered if they
`meetthe generalcriteria above, Papersthat do not meet these cri-
`Proposal receives a favorable review, the author will be encour-
`teria should be submitted to the appropriate IEEE Transactions and
`aged to prepare the Paper,
`whichafter submittalwill go through
`Journals.
`the normal review process,
`Itis suggested thata Prospective author, before preparing a full-
`Please send papers and
`Proposals to the Technical Editor,
`length manuscript, submit a Proposal containing a description of
`PROCEEDINGS OF THE IEEE,
`345 East 47th Street, New York, NY 10017-
`2394, USA (Telephone:
`212-705-7906).
`
`
`Petitioner Microsoft Corporation - Ex. 1010, Cover 3
`Petitioner Microsoft Corporation - Ex. 1010, Cover 3
`
`
`
`Data-Driven Multicomputers in Digital
`Signal Processing
`
`
`
`JEAN-LUC GAUDIOT, MEMBER,
`
`IEEE
`
`Newtechnologies of integration allow the design of powerful
`systems which may include several thousands of elementary pro-
`cessors. These multiprocessors may be used for a range of appli-
`cationsin signal and data processing. However, assuring the proper
`interaction of a large number of processors and the ultimate safe
`execution of the user programs presents a crucial scheduling prob-
`lem. The scheduling of operations upon the availability of their
`operands has been termedthe data-driven mode of execution and
`offers an elegant solution to the issue. This approachis described
`in this paper and several architectures which have been proposed
`or implemented (systolic arrays, data-flow machines, etc.) are
`examined in detail. The problems associated with data-driven exe-
`cution are also studied. A multi-level approach ta high-speed dig-
`ital signal processing is then evaluated.
`
`1.
`
`INTRODUCTION
`
`{f we are to approach the computational throughputs
`equivalent to billions of instructions per second which will
`be required from the processing systems of the future,
`improvements on all levels of computer design must be
`made,Faster technology and better packaging methods can
`be applied to raise clock rates. However, a one billion
`instructions per second machine would require a clock
`period as low as a nanosecond. This approachis inevitably
`bounded by physical limits such as the speed of light.
`Therefore,
`instead of considering the technological
`approach to performance improvement, we emphasize
`here the architectural method, Indeed, instead of merely
`increasing the clock frequency fora correspondingincrease
`in overall throughput, performance can also be improved
`by allowing multiple processing elements to collaborate on
`the same program. This inevitably introduces synchroniz-
`ation problems, and issues of resource allocation and shar-
`ing must be solved. Programmability is indeed the central
`problem. In one solution, a conventional language such as
`Fortranis used to program the application. A sophisticated
`compileris relied upon to partition a sequential program
`for execution on a multiprocessor. This approach has the
`
`Manuscript received September 4, 1986; revised January 23, 1987.
`This work was supported in part by the Department of Energy under
`Grant DE-FG03-87ER 25043. The views expressed in this paper are
`not necessarily endorsed by the U.S. Department of Energy.
`The authoris with the Computer ResearchInstitute, Department
`of Electrical Engineering—Systems, University of Southern Cali-
`fornia, Los Angeles, CA 90089, USA.
`{EEE Log Number8716208.
`
`advantage of imposing no “software retooling.” However,
`complex numerical applications will not be easily parti-
`tioned and muchpotential parallelism may remain unde-
`tected by the compiler.
`Ada, CSP [26], extended Fortran (e.g., HEP, Sequent), on
`the other hand, allow the programmerto deal with parallel
`processes by the use of primitives for parallel task spawn-
`ing, synchronization, and message passing. However, while
`the programmer can express someof the parallelism char-
`acteristic of the application, much potential concurrency
`may never be uncovered because of the inherent sequen-
`tial concepts of the language which must be countered
`through the use of special “parallelism spawning”instruc-
`tions. Also, developmenttime becomesimportant since the
`programmermust “juggle” with many parallel tasks to syn-
`chronize.
`In addition, debugging becomes correspond-
`ingly more difficult due to the sometimes undeterministic
`appearanceof errors.
`For these reasons, an implicit approach must be devised.
`In the above two methods,instruction scheduling is based
`upon a central program counter. We propose to demon-
`strate here the data-driven approach to programming mul-
`tiprocessors: instructions can be scheduled by the avail:
`ability of their operands. This model of executionis a subset
`of the functional model of execution [9]. It provides a sig-
`nificant improvement to the programmability of multipro-
`cessors by excluding the notion of global state and intro-
`ducing the notion of values applied to functions instead of
`instructions fetching the contents of memorycells as they
`are in the conventional “‘control-flow’’ model.
`The overall objective of this paper is to demonstrate the
`applicability of data-driven principles of execution to the
`design of high-performance signal and data processing
`architectures, Several approacheswill be demonstrated and
`their particular domain of application will be contrasted.
`The description of low-level processing systems is beyond
`the scope ofthis paper andthe interestedreaderis referred
`to an excellent survey by Allen (3]. Instead, we will con-
`centrate here on the issues related to building high-per-
`formance multiprocessors for signal processing applica-
`tions.
`In Section Il, we show the type of problems
`considered in signal processing. The data-flow principles
`of executionas they relate to digital signal processing prob-
`lems are describedin detail in Section Il while several exist-
`
`0048-9279/87/0900-1220807.00 © 1987 IEEE
`
`1220
`
`PROCEEDINGS OF THE IEEE, VOL. 75, NO. 9, SEPTEMBER 1987
`
`Petitioner Microsoft Corporation - Ex. 1010, p. 1220
`Petitioner Microsoft Corporation - Ex. 1010, p. 1220
`
`
`
`ing data-driven architectures are describedin SectionIV.
`In Section V, we analyze a multi-level data-driven archi-
`tecture and examine its programming environment. Con-
`clusions are drawn in Section VI.
`
`I. Tee REQUIREMENTS OF SIGNAL PROCESSING
`
`Digital signal processing techniquesare applied to many
`different technical problems, These include radar and sonar
`systems, image processing, speech recognition, etc. The
`elementary building blocks of these were originally con-
`centrated on such tasks as convolution, correlation, and
`Fourier transform. More complex algorithms (matrix oper-
`ations, linear systems solvers, etc.) are now considered.
`Higher order operations include not only simple problems
`such as elementaryfiltering (IIR, FIR, etc.), but also more
`complex functions such as adaptive and Kalmanfiltering
`[45]. Also, such complex problems as Computer-Aided
`Tomography or Synthetic Aperture Radar can be consid-
`ered[39], [16]. Signal processing algorithms are very appro-
`priate for description by functional languages. Indeed, a
`signal processing algorithm is often represented ina graph
`form (36] and can be decomposedin twolevels:
`
`* aregular level which can be implemented by a vector
`operation(i.e.,a loop in whichall iterations present no
`dependencies among themselves);
`alevel which contains conditional operations and heu-
`ristic decision making.
`
`*
`
`rhis description showsthat the lower operational levels
`caneasily deliver parallelism (by compiler analysis or pro-
`grammerinspection). This layer usually consists of simple
`constructs(arithmetic instructions, FFT butterfly networks,
`simplefilters, etc.). However, the higherlevels will require
`more complex probleminsight and even runtime depen-
`dency detectionin order to allow maximumparallelism. We
`will now describe principles of execution whichwill allow
`us to deliver this concurrency.
`
`Il. Data-Flow PRINCIPLES
`
`The data-flow solutionto the programmability problems
`of large-scale multiprocessors [5] has been pioneered by
`Adams [2], Chamberlin [11], and Rodriguez [43]. It is now
`describedin detail in this section.
`
`A. Basic Principles of Execution
`
`In the conventional von Neumann modelof execution,
`an instruction is declared executable when a Program
`Counter of the machine points to it. This event is usually
`under direct programmer control. While a control-flow pro-
`gramis a sequential listing of instructions, a data-flow pro-
`gram can be represented as a graph where the nodesare
`the instructions (actors) which communicate with other
`nodes over arcs (Fig. 1). An instruction is declared execut-
`able whenit has all its operands. In the graph represen-
`tation chosen above, this meansthat all the input arcs to
`an actor must carry data values (referredto as tokens) before
`this actor can be executed, Execution proceeds by first
`absorbing the input tokens, processing the input values
`accordingto the op. code of the actor, and accordingly pro-
`ducing result tokens on the output arcs. Insummary,it can
`
`GAUDIOT: DATA-DRIVEN MULTICOMPUTERSIN DSP,
`
`
`
`Fig. 1. A simple data-flow graph.
`
`be said that the data-flow model of execution obeys two
`fundamental principles:
`* Asynchrony of operations: The executability of an
`instruction is decided by a /ocal criterion only. The pres-
`ence of the operands can be sensed “locally” by each
`instruction, This is an attractive property for an implemen-
`tation in a distributed environment where no central con-
`troller should be used for global scheduling.
`* Functionality of the operations: The effect of each
`operation is limited to the productionof results to be con-
`sumedby a specific numberof other actors. This precludes
`the existence of “side-effects.” These side-effects may be
`long-ranging in that the execution of an instruction may
`effect the state of a cell of memory which will be used only
`muchlater by another unrelated operation.
`
`B, Data-Flow Interpreters
`When iterations are executed, the underlying principle
`of data-flow (single assignment of variables) must invariably
`be violated. Indeed, for an actor to be repeatedly evaluated
`as in an iteration, its input arcs must carry several tokens
`(from different iterations). Several solutions have been pro-
`posed to allow the controlled violationof these rules with-
`out compromising the safe execution of
`the program.
`Among these,
`the Acknowledgment scheme and the
`U-interpreter have been given the most consideration.
`1) Acknowledgment Scheme [14]: Proper matching of the
`tokens can be observed by ordering the token production.
`This would be done bya careful designof the program graph
`so as to insure that tokens of two different iterations can
`never overtake each other. In addition, it must be guar-
`anteed that no token pileup is encountered on any one arc.
`This condition can be verified by allowing the firing of an
`actor when tokens are on all input arcs and there are no
`tokens on any outputarcs. In order to enforcethis last con-
`dition, an acknowledgment must be sent by the succes-
`sor(s) to the predecessor when the token has been con-
`sumed (Fig. 2). Note that an actor is executable when it has
`received its input arguments as well as all acknowledg-
`ments. The parallelism which can be exploited from this
`scheme is mostly pipelining betweentheactorsof different
`iterations. Thus when the number of instructions in the
`body of aniteration is the same as the numberofavailable
`processors, the speedup observed by this mechanism of
`execution is maximal. However, for small iterations (com-
`pared to the size of the machine), the exploited parallelism
`
`Petitioner Microsoft Corporation - Ex. 1010, p. 1221
`Petitioner Microsoft Corporation - Ex. 1010, p. 1221
`
`$$C*#PMittonMicrosottCo
`
`
`
`*ACK’
`
`Fig. 2. The acknowledgment scheme.
`
`designed: the MITcell block architecture [13], the Hughes
`Data-Flow Machine [21], the DSFP [24], the USC TX16 [19],
`Unraveling
`Interpreter
`(U-Interpreter): The
`U-interpreter[4] Provides the most asynchronouspossible
`Operation. In orderto allow safe execution of actors in an
`iterative construct, tokens are tagged withinformation per-
`be found, This tag includestheiteration number. Indeed,
`the U-interpreter closely follows these principles: to each
`data tokenis attached a lag of the form u.P.s.i, wherePiden-
`tifies the procedure nameof the destination actor, while s
`
`raph. The function of this actoris to create anewcontext
`for the execution of the iteration: the input tokens are
`taggedby u.P.s.i while the Output tokensare identical but
`are tagged with u'P.t.1 where a’ is itself u.7, Note that this
`mechanism is sufficient to create an entirely different set
`of tokens fortwo nested iterations. Indeed, assumethat two
`tokens both belong to the same inner iteration i-but belong
`to thej; andj, outer iterations respectively. Thefirst token
`
`
`we we:
`
`
`
`Fig. 3. A typicaliterative construct in the U-interpreter,
`
`would be tagged u1.P.5.j (ul = w.j,) while the second is
`tagged with u2.P.s.i (u2 = uj). This showsthat an appro-
`Priate differentiation has been made between the two
`instances. Theoriginalcontext wis retrieved bythe L~' actor
`beforeexiting,
`Contrarily to the acknowledgment Scheme, this dynamic
`data-flow scheme allowsfull asynchronous execution of the
`
`without compiler-induced replication of the graph. Like-
`wise, multiple function calls and more particularly recur-
`sionsare allowed since each newactor instantiation receives
`a different tag. This means that the U-interpreter would be
`preferredto thestatic model when the ability for fast recur-
`sive calls is required, However, this flexibility comesat the
`expense of added hardware complexity. Indeed,it will be
`shownin Section IV-D that implementationof the U-inter-
`preter requires an associative memory for fast tag match-
`ing. Several machines based on theseprinciples have been
`studied: the MIT tagged token data-flow machine [6], the
`ESL DDSP [28], the University of Manchester machine[23],
`the ETL Sigma-1 [25], etc,
`
`C. Structure Handling
`This is a crucial issue in signal Processing for this kind
`of application requires that many data elements which
`belong to the same structure be processedin a parallel or
`pipelined fashion. One of the basic premises of data-flow
`Principles states that an output is a function ofits inputs
`only, regardless of the state of the machineat the time of
`execution. Whena structure of elementary elements must
`be processed, the absenceofside-effects Meansthatit may
`not be updatedforthis would implyits transition through
`several states. Instead, if any updates are needed, a new
`array which contains the new elements must be created.
`Copyingof all elements must be undertaken for the mod-
`ification of a single one. This solution imposesaninordi-
`nate overhead, This is why the implementation schemes we
`will now describecan shortcut this complete copying while
`
`PROCEEDINGS OF THE IEEE, VOL. 75, NO. 9, SEPTEMBER 1987
`
`Petitioner Microsoft Corporation - Ex. 1010, p. 1222
`Petitioner Microsoft Corporation - Ex. 1010, p. 1222
`
`
`eeeCC*SPtitionMicrosoftCi
`
`
`
`of the possibility of parallelism that can be detected only
`at runtime. For instance, dependencies on A and B related
`by A(F(/)) = B(i) maybe artificially imposed. However, the
`applications targetted by the HDFM include some amount
`of regularity which can beeasily detected by the compiler
`and implemented as conventionalarrays.
`4) TokenRelabeling [18]: In the U-Interpreter, the notion
`of array can be entirely ignoredat the lowest level of exe-
`cution. Instead, the tag associated with each token under
`the rules of the U-interpretationis usedas identification of
`the index of the array element of the high-level language.
`In other words, it can be simply said that, when an array A
`is created,its A(/) element will be taggedwith i (hereafter
`denotedA(i),;), if the elements are producedin the logical
`order. In the “production” of a unidimensional array, the
`iteration numbercan usually bedirectly interpreted as the
`index of the array elementjust produced by the iteration.
`Special token relabeling program graphs can be created
`to handle scatter and gather program constructs [27] (Fig.
`5(a)). This figure shows that an inversion function F7!
`
`Ay
`. (Relabeling}
`(en)
`. Ai)
`<=> A(F{EI}
`
`fey]
`
`[i]
`
`By
`*.
`\
`
`Clk
`
`| We
`
`[=1,100
`DO!
`J Cl} = BIL) + A{F(T))
`(a)
`
`=t
`— Fk),
`fe]
`
`treat]
`
`Aik),
`At Vay
`
`o pl
`Bh Alk)
`Fo> ALF)
`“
`(rte)
`
`i]
`
`[e
`
`i)
`
`-
`
`y(
`
`b)
`
`Fig. 5.
`
`(a) A gather operation. (b) Token Relabeling gather.
`
`This demonstrates that, without recourse to the calculation
`of F~', the properrelabeling of the A elements has been
`effectively produced.
`This algorithm requires no intermediary storage, does
`not need array operations, and imposes smaller hardware
`and execution overhead. This relabeling approach elimi-
`nates a large portion of the overheadassociated with the
`production and consumption ofarray A. Pipelining between
`the source and the sinkofa data structureis the goal of this
`unknownat compile time would be neededto perform the
`relabeling of data-flow tokens. Such acalculationis not truly
`necessary. Instead, we introduce(Fig. 5(b)) a sequence gen-
`erator which produces the F( j)’s, tagged by}. An exchanger
`actor (called x) swaps the tag andthe data value and pro-
`duces ji-;,;. Both streams (the A’s and jteciq) are input to a
`special relabeling actor 6 which only modifies the iteration
`portion of the tag. By the principles of the U-interpreter,
`only tokens which bear the same tag will be matched as
`proper inputs to the 6 actor. In other words, the mate of
`
`preserving the meaning of the program duringarrayu pdate
`operations,
`1) Heaps: Dennis[14] has proposed to represent arrays
`by directed acyclic graphs (also called heaps). Each value
`is representedas the leaf of a graph tree. The modification
`of a single element ina heapis represented in Fig. 4. Note
`
`642
`Ws
`Fig. 4. A heap update.
`
`13°21
`
`22
`
`2
`
`23
`
`NEW
`23 pi
`
`that the complexity of the modification of a single element
`of the array is O(n) for a copy operation, while it
`is
`O(log n) for the heap. Several instructions are exclusively
`devoted to the access of heaps [15]; SELECT receives a
`pointer to the root node, an index value, and returns a
`pointerto the substructure (which may bea leaf node) asso-
`ciated with the index; APPEND also needs the same two
`operands in addition to the value of the element to append
`to the structure.
`2) I-Structures: A heap must be entirely ready before it
`can be consumed because no consumption (SELECTactors)
`can take place until the pointer token appears(i.e., the cre-
`ation ofthe array is completed). In the [-structure scheme
`[7] constraints on the creation of arrays allowthe selection
`of individual elements (or substructures) from the array
`before its complete production. One possible implemen-
`tationofI-structures makes use of a “presence”bit which
`indicates when an elementof an array has been calculated
`andis ready for consumption. An attempt to read an empty
`cell would cause the read to be deferred until such time that
`the cell presencebit is marked. Conversely, a write into a
`cell, the presence bit of whichindicatesvalid stored data,
`could be cause for the generation of an error signal. The
`advantages of this scheme are:
`
`*
`
`* better performance because pipelining is allowed
`between I-structure consumers and producers;
`fess “serialization” of operations such as APPENDs,
`because they are allowed to occur independently on
`the same structure,
`3) HDFM Arrays: A special scheme for handling arrays in
`a VAL high-level environment has been designed for the
`Hughes Data-Flow Machine (HDFM)[21]. Ituses the fact that
`data-flow arrays as described above are overly “asynchro-
`nous,’ i.e., they do not take advantageof the data depen-
`dency information carried by the program graph. Safetyof
`accesses is respected by not allowing the updating of an
`array before all the reads from the current version of the
`array have been performed. Only then can the array be
`directly modified. Safety and correct execution of WRITE
`Operations are a compile-time task. This has the advantage
`of reducing the number of memory accesses (no complex
`graphof pointers must be traveredas in heaps) as well as
`of offering a better possibility of distribution of an array (no
`root node). However, spurious data dependencies may be
`introduced because the compiler is not necessarily aware
`
`GAUDIOT: DATA-DRIVEN MULTICOMPUTERS IN DSP
`
`4223
`
`Petitioner Microsoft Corporation - Ex. 1010, p. 1223
`Petitioner Microsoft Corporation - Ex. 1010, p. 1223
`PAAEHOWiscrOSONA
`
`
`
`their domain of applicationto design or compile time and
`refertothemas data-driven systems,We thus initially exam-
`ine multiprocessor systems where data dependencies have
`been frozen at design time (systolic arrays). We then con.
`sider Programmable systolic arrays (the Wavefront Array
`Processor) and multiprocessors scheduledat compile time
`by the use of data-flow Programgraphs(the ESL polycyclic
`Processor), Finally, we studysystems wherethedata depen-
`dencies provide scheduling information at runtime (the
`Hughes Data-Flow Machine) and examine the influence of
`the level of resolution upon the performance(the USC
`A. Systolic Arrays [32]
`
`TX 16),
`
`cessingproblems, Matrix operations, image Processing, etc,
`In summary,a systolic array is simply a collection of inter-
`connected Processing Elements (PEs),
`in order to incor-
`porate as many processors as possible, the structure of the
`PEs themselvesis kept toa maximum simplicityand usually
`includes only a few operation units, For design simplifi-
`cation, there are few types of PEs in the same system. By
`neighbor topology in orderto minimize communication
`delays as well as Powerclistribution issues. Note thattopol-
`
`ae ue ECA actor 6 is
`
`Point of view:
`
`ooSeng actor which takes the A input and relabels it
`with the data carried by the token on the other input are,
`In other words,it Oulputs A(F(i)),,.. Since | isa dummyvar-
`lable, andsince Fis bijective, it can be saidthat, ona global
`A(F(i)),, = Alk)p WAH
`scheme, just as it was the idea behind the design ofthe
`l-structures, However, thetoken relabelingapproachbrings
`4 better runtime memory management since tokens cor.
`responding to the various elements of the arraystill exist
`and must still be temporarily stored, they need not go
`through an additional storageas adata structure. Also, there
`iS NO Needfor “requests” for data as would be the casein
`an [-structure environment: When an I-structure is Created,
`actors which need data fromit must demand the element
`in question untilit has arrived, This May introduce heavy
`overheads as unsatisfied requests must be queuedin the
`structure itself, Garbage collection is automatically han-
`dled since whenthe “array token’is matched,it is auto-
`matically removed from the arc, In otherwords, whenit has
`beenused,itisswallowed byapplyingdata-flowprinciples,
`D. High-Level Data-Flow Languages
`In addition to the low-level mechanisms of execution
`which were described earlier, special high-level data-flow
`languages have been designed for €asier
`translation into
`data-flowgraphs. To be sure, these high-level languages are
`nota necessity: the Texas Instruments data-flow Project [31]
`relied upon Fortran Programming through the use of a
`modified version of a vectorizing compiler Originally des-
`tinedto the TI ASC. However, many high-level languages
`have been designedfor data-flow Prototypes. Most notable
`are VAL (Value