`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 11, NOVEMBER 1982
`
`Wavefront Array Processor: Language,
`Architecture, and Applications
`
`IEEE, K. S. ARUN, STUDENT MEMBER, IEEE, RON J. GAL-EZER,
`SUN-YUAN KUNG, MEMBER,
`STUDENT MEMBER,
`IEEE, AND D. V. BHASKAR RAO, STUDENT MEMBER, IEEE
`
`Abstract—This paper describes the development of a wavefront-
`based language and architecture for a programmable special-purpose
`multiprocessor array. Based on the notion of computational wavefront,
`the hardwareof the processor array is designed to provide a computing
`medium that preserves the key properties of the wavefront. In con-
`junction, a wavefront language (MDFL)is introduced that drastically
`reduces the complexity of the description of parallel algorithms and
`simulates the wavefront propagation across the computing network.
`Together, the hardware and the language lead to a programmable
`wavefront array processor (WAP). The WAP blends the advantages
`of the dedicated systolic array and the general-purpose data-flow
`machine, and provides a powerful tool for the high-speed execution
`of a large class of matrix operations and related algorithms which have
`widespread applications.
`
`Index Terms—Asynchrony, computational wavefront, concurrency,
`data-flow computing, matrix data-flow language, signal processing,
`systolic array, VLSI array processor, wavefront architecture.
`
`I.
`
`INTRODUCTION
`
`A, VLSI Signal Processing
`
`ITH the rapidly growing microelectronics technology
`leading the way, modern signal processing is under-
`going a major revolution. The past two decades have witnessed
`a steep increase in the complexity of computations, processing
`speed requirements, and the volumeof data handled in various
`signal processing applications. The availability of low cost, high
`density, fast VLSI devices has opened a new avenue for
`implementing these increasingly sophisticated algorithms and
`systems[1], [2]. While a major improvementin device speed
`is foreseen,it is in no way comparableto therate of increase
`of throughput required by modern real-time signal processing.
`In order to achieve such increases in throughput rate, the only
`effective solution appears to be highly concurrent processing.
`Consequently, it has become a major challenge to update the
`currentsignal processing techniques so as to maximally exploit
`their potential for concurrent execution.
`In a broad sense, the answer to this challenge lies in a
`cross-disciplinary research encompassing the areasof algo-
`rithm analysis, parallel computer design, and system appli-
`
`Manuscript received January 11, 1982; revised June 18, 1982. This work
`was supported in part by the Office of Naval Research under Contract
`N00014-80-C-0457, NO0014-81-K-0191, by the National Science Foundation
`under Grant ECS-80-16581, and by the Defense Advanced Research Projects
`Agency under Contract MDA903-79-C-0680.
`The authors are with the Department of Electrical Engineering-Systems,
`University of Southern California, Los Angeles, CA 90089.
`
`cations. In this paper, we therefore introduce an integrated
`research approach aimed at incorporating the vast VLSI
`computational capability into modern signal processing ap-
`plications.
`Thetraditional design of parallel computers and languages
`is deemed unsuitable for our purposes. It usually suffers from
`heavy supervisory overhead incurred by synchronization,
`communication, and scheduling tasks, which severely hamper
`the throughput rate whichis critical to real-time signal pro-
`cessing. In fact, these are the key barriers inherent in very large
`scale computing structure design. Moreover, although VLSI
`provides the capability of implementing a large arrayof pro-
`cessors on one chip, it imposes its own constraints on the sys-
`tem. Large design and layoutcosts [2] suggest the utilization
`of a repetitive modular structure. In addition, communication,
`which costs the most in VLSI chips in termsof area, time, and
`energy, has to be restricted (to /ocalized communication)[1].
`In general, highly concurrent systems require this locality
`property in order to reduce interdependence and ensuing
`waiting delays that result from excessive communication [1].
`This locality constraint prevents the utilization of centralized
`control and global synchronization. The resulting use of
`asynchronous distributed control and localized data flowis
`an effective approachto the design of very large scale, highly
`concurrent computing structures.
`
`B. A Special-Purpose VLSI Array Processor
`
`The aboverestrictions imposed by VLSI will render the
`general-purpose array processor
`rather
`inefficient. We
`therefore restrict ourselves to a special class of applications,
`ie., recursive! and local data-dependentalgorithms, to con-
`form with the constraints imposed by VLSI. Thisrestriction,
`however,incurslittle loss of generality, as a great majority of
`signal processing algorithms possess these properties. One
`typical exampleis a class of matrix algorithms.It has recently
`beenindicated that a major portion of the computational needs
`for signal processing and applied mathematical problemscan,
`in fact, be reduced to a basic set of matrix operations and other
`related algorithms [3],
`[4]. Therefore, a special-purpose
`parallel machine for processing these typical computational
`algorithms will be cost effective and attractive in VLSI system
`design.
`
`' Ina recursive algorithm,all processors do nearly identical tasks, and each
`processor repeats a fixed set of tasks on sequentially available data.
`
`0018-9340/82/1100-1054$00.75 © 1982 IEEE
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1054
`Petitioner Microsoft Corporation - Ex. 1021, p. 1054
`
`
`
`KUNG et al: WAVEFRONT ARRAY PROCESSOR
`
`1055
`
`MEMORY MODULES
`
`
`
`MODULES
`
`MEMORY
`
` oe
`mE
`Fig. 1. The WAPconfiguration,
`
`FIRST WAVE — — ——
`SECOND WAVE -------
`
`oy
`
`Very significantly, these algorithms involve repeated ap-
`plication ofrelatively simple operations with regular localized
`data flow in a homogeneous computing network. This leads
`to an important notion of computational wavefront, which
`portrays the computation activities in a manner resembling
`a wave propagation phenomenon. Moreprecisely, the recursive
`natureof the algorithm,in conjunction with the localized data
`dependency, points to a continuously advancing wave ofdata
`and computationalactivity. The computational sequence starts
`with one element and propagates through the processorarray,
`closely resembling a physical wave phenomenon(cf. Fig. 1).
`In fact, all algorithms that have /ocality and recursivity (and
`are thus implementable on our VLSIarray processor) will
`exhibit this wave phenomenon. Therefore, the notion of a
`computational wavefront has attracted the attention of many
`researchers [2], [5]-[13].
`The wavefront concept provides a firm theoretical founda-
`tion for the design of highly parallel array processors and
`concurrent languages. In addition, this concept appears to have
`somedistinct advantages.
`First, the wavefront notion drastically reduces the com-
`plexity in the description of parallel algorithms. The mecha-
`nism provided for this descriptionis a special-purpose, wave-
`front-oriented language [7], [9], [10]. Rather than requiring
`a program for each processorin the array, this language allows
`the programmerto address an entirefront ofprocessors.
`Second, the wavefront notion leads to a wavefront-based
`architecture which preserves Huygen’s principle [14], and
`ensures that wavefronts never intersect. Therefore, a wavefront
`architecture can provide asynchronous waiting capability, and
`consequently, can cope with timing uncertainties, such as local
`clocking, random delay in communications, and fluctuations
`of computing times. In short, the notion lends itself to a
`(asynchronous) data-flow computing structure that conforms
`well with the constraints of VLSI.
`The integration of the wavefront concept, the wavefront
`language, and the wavefront architecture leads to a pro-
`
`grammable computing network, which wewill call the wave-
`front array processor (WAP). The WAPis, in a sense, an
`optimal tradeoff between the globally synchronized and
`dedicated systolic array [1], [15], [16] (that works on a sim-
`ilar set of algorithms), and the general-purpose data-flow
`multiprocessors [17|-[23]. It provides a powerful tool for the
`high-speed execution of a large class of algorithms which have
`widespread applications.
`
`C. Organization
`
`The organization ofthe rest of the paperis as follows. Sec-
`tion II elaborates on the computational wavefront. Section III
`proposes a wavefront-oriented language (MDFL) for pro-
`gramming the WAP,andprovides a programming method-
`ology. Section IV explains a possible hardware organization
`and architecture for the WAP. Section V illustrates some
`applications of the WAP by means of MDFLprograms.Sec-
`tion VI compares the WAPto other array processors, such as
`the systolic array, data-flow multiprocessors, and conventional
`SIMDarrayslike the Illiac IV.
`
`IJ. CONCEPT OF COMPUTATIONAL WAVEFRONT
`
`The wavefront array processor is configured in a square
`array of N X N processing elements with regular and local
`interconnections(cf. Fig. 1).
`The computing networkserves as a (data) wave-propagating
`medium. The notion of computational wavefront can be best
`explained by a simple example. To this end, we shall consider
`matrix multiplication as being representative. Let A = {a,j},
`B = {b;;|, and C = A X B = {c;;} all be N X N matrices. The
`matrix A can be decomposed into columns A; and matrix B
`into rows B;, and therefore,
`
`C= A, « B, + Ar* Bo t+ *>+ Ayn * By.
`
`(1)
`
`The matrix multiplication can then be carried out in N re-
`cursions, executing
`
`CH) = CANN) + Ay * By
`
`(2)
`
`recursively for k = 1, 2,---, N.
`The exploitation of parallelism is now evident with the
`availability of N X N processors. A parallel algorithm for
`matrix multiplication is fairly simple. However, most existing
`parallel programs would need global interconnections in the
`computing network, while localized interconnections and data
`flow are much moredesirable in VLSI systems.
`The topology of the matrix multiplication algorithm can. be
`mapped naturally onto the square, orthogonal N X N matrix
`array of the WAP (cf. Fig. 1). To create a smooth data
`movement in a localized communication network, we make
`use of the computational wavefront concept. For the purpose
`of this example, a wavefrontin the processing array will cor-
`respond to a mathematical recursion in the algorithm. Suc-
`cessive pipelining of the wavefronts will accomplish the com-
`putationofall recursions.
`As an example, the computational wavefront forthe first
`recursion in matrix multiplication will now be examined.
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1055
`Petitioner Microsoft Corporation - Ex. 1021, p. 1055
`
`
`
`1056
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 11, NOVEMBER 1982
`
`Suppose that the registers of all the processing elements (PE’s)
`are initially set to zero:
`ci =0
`
`for all (i, j);
`
`the entries of A are stored in the memory modulestotheleft
`(in columns), and those of B in the memory modules on the top
`(in rows). The process starts with PE (1, 1), where
`CiP = CP + ay, *d),
`
`(3)
`
`is computed. The computational activity then propagates to
`the neighboring PE’s (1, 2) and (2, 1), which will execute
`
`scription of computational wavefronts and the corresponding
`dataflow ina large classof algorithms (which exhibit the re-
`cursivity and locality mentioned earlier).
`Since matrix algorithmsare typicalof this class, we call the
`language the matrix data-flow language (MDFL). This
`wavefront language permits the array processor to be pro-
`grammable, broadensthe rangeofapplications, and also makes
`possible the simulation and verification of parallel algo-
`rithms.
`
`A. Matrix Data-Flow Language
`
`and
`
`CHP = Ci + ayy * diz
`
`CSP = CHP+ aa * by).
`
`(4)
`
`(5)
`
`There exist two approaches to programming the WAP: a
`local approach describing the actions of each processingele-
`ment, and a global approach describing the actions of each
`wavefront. To allow the user to program the WAPin both of
`these fashions, two versions of MDFLare proposed:global and
`The nextfront ofactivity will be at PE’s (3, 1), (2, 2), and
`local MDFL.
`(1, 3), thus creating a computation wavefront traveling down
`A global MDFLprogram describes the algorithm from the
`the processor array. This computational wavefrontis similar
`viewpoint of a wavefront, while a local MDFL program de-
`to optical wavefronts (they both obey Huygen’s principle) since
`scribes the operations of an individual processor. More pre-
`each processor acts as a secondary source andis responsible
`cisely, the perspective of a global MDFL programmeris of one
`for the propagation of the wavefront. It may be noted that wave
`wavefront passing across all the processors, while the per-
`propagation implies localized data flow. Once-the wavefront
`spective of a local MDFL programmeris that of one processor
`sweepsthroughall the cells, the first recursionis over. As the
`encountering a series of wavefronts.
`first wave propagates, we can execute an identical second re-
`From the macroscopic pointof view, a higherlevel language,
`cursion in parallel by pipelining a second wavefront immedi-
`closer to the algorithm,is desired for reducing the heavy bur-
`ately after the first one. For example, the (1, 1) processorwill
`den on the programmer. Global MDFLprovides suchatool,
`execute
`as it is easier to view the algorithm as a series of wavefronts.
`At the microscopic level, each PE executesits own set ofin-
`structions and performslocalized interprocessor transactions.
`Implementing a program on such a system requires trans-
`forming the high level description (in global MDFL)intoa set
`of lower level programs(in local MDFL)for the individual
`processors.* We have developed a compiler to do such a map-
`ping of global wavefront-oriented MDFLprogramsintolocal
`programsfor individual processors.
`
`Ch? = Ch) + ayo * bay
`
`(6)
`
`(7)
`
`and so on. In general, the (i, /)th processorwill execute the kth
`recursion,
`CHP = iy * bij + a2 * bay + +++ + ain * Dy.
`The pipelining is feasible because the wavefronts of two suc-
`cessive recursions will never intersect (Huygen’s wavefront
`principle), as the processors executing the recursions at any
`given instant will be different, thus avoiding any contention
`problems.
`Locality, regularity, recursivity, and concurrency lead to
`the wavefront phenomenon. Thus,all algorithms which possess
`these properties will exhibit computational wavefronts. The
`notion of computational wavefronts leads to a wavefront-based
`language (a modified data-flow language) to program the
`processor array. This language (called the matrix data-flow
`language)is especially powerful for matrix and other related
`algorithms.It will be developed in the next section.
`
`Ill. A WAVEFRONT LANGUAGE: MATRIX DATA-FLOW
`LANGUAGE
`
`In contrast to the heavy burden of scheduling, resource
`sharing, as well as the control of processor interactions en-
`countered in programming a general-purpose multiprocessor,
`the computational wavefront notion can facilitate the de-
`scription of parallel algorithms and drastically reduce the
`complexity of parallel programming. In this section, we in-
`troduce a wavefront languagethatis tailored towardsthe de-
`
`B. MDFLInstruction Set
`
`At each front of activity, the computational wavefront
`performs similartasks in all the processors involved (cf. Section
`II, matrix multiplication example and Section III-D on pro-
`gramming methodology). Hence, global (wavefront) in-
`structions andlocal (processor) instructions are nearly iden-
`tical. Most of the MDFLinstructionset is, therefore, common
`to both global and local MDFL.
`Table I is a complete list of the MDFL instruction reper-
`toire. For the complete semantics and more detailed syntax,
`the reader is referred to a recent publication [10]. Unless
`otherwise specified, the following instructions belong to both
`global and local MDFL. The proposed instruction set is
`functionally complete, but improvements and modifications
`are still underway to incorporate additional features such as
`double precision, etc. There also appears to be room for the
`
`2 It suffices, in general, to describe a wavefrontalgorithm at four locations
`of the array: corner, FirstRow, FirstColumn,and Interior PE’s. Consequently,
`every global MDFLprogram gets compiled into four slightly different local
`programs.
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1056
`Petitioner Microsoft Corporation - Ex. 1021, p. 1056
`
`
`
`KUNG et al; WAVEFRONT ARRAY PROCESSOR
`
`1057
`
`TABLE I
`MDEFLINSTRUCTION SET
`
`Data Transfer Instructions
`
`FLOW <SOURCE REGISTER>, <DIRECTION>;
`FETCH <DESTINATION REGISTER>, <DIRECTION>;
`READ;
`
`Recursion Oriented Instructions
`
`REPEAT ... UNTTL "ERMINATED;
`WHILE WAVEFRONT (TN ARRAY DO
`BFGIN ... END;
`SFT COUNT <NUMBER OF WAVEFRONTS>;
`DECREMENT COUNT;
`ENDPROGRAM,
`
`Conditional Instructions
`
`IF FOUAL THEN <STATEMENT™>;
`IF NOT-EQUAL THEN <STATEMENT>;
`IF GREATER THEN <STATEMENT>;
`IF LESS-THAN THEN <STATEMEN™>;
`IF <DIRECTION> DISABLED THEN <STATEMENT>;
`CASE KIND =
`(1,1)
`(1,*)
`(*,1)
`INT
`ENDCASE;
`
`=: <STATEMENT>;
`+ <STATEMENT>;
`: <STATEMENTS;
`: <STATEMEMT>;
`
`successive wavefronts are pipelined through the array. As soon
`as the kth wavefront is propagated, the (1, 1) processorini-
`tiates the (kK + 1)st wavefront.
`To allow for more than one wavefront per recursion, the
`complete global MDFLprogram will have the syntax
`BEGIN
`
`SET COUNT{ );
`REPEAT
`
`(TASKS 4A);
`WHILE WAVEFRONTIN ARRAY DO
`BEGIN
`
`(TASKS B);
`END;
`WHILE WAVEFRONTIN ARRAY DO
`BEGIN
`
`(TASKS C);
`END;
`(TASKS DP);
`DECREMENT COUNT;
`UNTIL TERMINATED;
`ENDPROGRAM.
`
`Internal Processer Instructions
`
`<SOURCE>, <DESTTNATTON>;
`<SOURCE #1>, <SOURCE 42>, <DESTINATTON>;
`<SOURCF #1>, <SOURCFE #2>, <DESTINATTON>;
`<SOURCE #1>, <SOURCE #2>, <DESTINATION>;
`<SOURCE #1>, <SOURCE #2>, <DESTINATTON>;
` <SOURCR>, <DESTINATION>;
`<SOURCE #1>, <SOURCE #2>;
`<SOURCE>;
`
`TSR
`ADD
`SUB
`MULT
`DIV
`SORT
`CMP
`TST
`STORE;
`NOP;
`RESET;
`BEGIN ... END;
`DISABLE-SELF ;
`
`developmentofa higher level language suitable for the non-
`specialist programmer.
`
`C. Programming Methodology
`
`In this subsection, we provide guidelines for programming
`in global MDFL. The most straightforward method of pro-
`gramming the WAP would be to explicitly spell out the actions
`of each wavefrontat eachofits (2m — 1) positions (fronts) (cf.
`Fig. 1). Nevertheless, the regularity and recursivity in almost
`all matrix algorithms allows us to assumethe following.
`1) Space Invariance: Thetasks performed by a wavefront
`in a particular kind of processor must be identical at all
`(2n — 1) fronts.
`2) Time Invariance: Recursionsare identical.
`Accordingly, global MDFL provides two repetitive con-
`structs, the space repetitive construct
`WHILE WAVEFRONTIN ARRAY DO
`
`BEGIN (TASK 7) END
`
`(so that T is repeated at all fronts), and the time repetitive
`construct
`
`REPEAT (ONE RECURSION) UNTIL TERMINATED
`
`(so that the same recursion is repeated).
`The REPEAT construct is inherently concurrent in that
`
`Each recursion will execute the instructions within the
`REPEAT «** UNTIL construct. The numberof recursionsisset
`by SET COUNT.In this example, a recursion consists of two
`wavefronts. At thestart, tasks A are performed only at the (1,
`1) processor. Thefirst wavefront of each recursion will perform
`tasks B at each ofits (2n — 1) fronts. The second wavewill
`execute tasks C in eachofthese fronts immediately after tasks
`B have been concluded. It should be noted that the number of
`different wavefronts within a recursion may vary from one
`application to another. At the end of each recursion, COUNT
`is decremented. Whenit becomes zero, TERMINATEDisset
`and a “phase”ofidentical recursionsis over.
`The corresponding local MDFL program forinterior pro-
`cessors (cf. Section III-A) will be
`
`REPEAT
`
`(TASKS 8B’)
`(TASKSC’)
`UNTIL TERMINATED;
`
`where B’ and C’ are the compiled versions of B and C, with
`only relevant portions of the CASE statement extracted. The
`conversion of a global programinto its local versionsis thus
`fairly straightforward.
`Certain syntax rules [10] are needed to ensure that there
`is no circular waiting for data between adjacent PE’s. Theyalso
`ensure that neighbors will not contend for the interprocessor
`busat the sametime. Concisely,the rules dictate that FETCHes
`in one direction precede (and equal in number) the FLOWsin
`the opposite direction, and that the data FETCHing precede
`any computation (cf. data-driven computation in Section
`IV-A). The complete proofof the claim that these rules will
`prevent deadlock and bus contention is omitted here [24].
`Based on the above guidelines, a complete global MDFL
`program for matrix multiplication follows.
`More programming examples will be provided in Section
`
`V.
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1057
`Petitioner Microsoft Corporation - Ex. 1021, p. 1057
`
`
`
`1058
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 11, NOVEMBER 1982
`
`Array Size:
`
`Program 1. Matrix multiplication.
`Nx N
`
`IV. WAVEFRONT ARCHITECTURE
`
`The hardwareof the processing array is designed to support
`Cc=A xB
`Computation:
`MDEL. The main architectural considerations include four
`(k-1)
`(k)
`th
`
`k wavefront: +aibc = ec
`general aspects: 1) interprocessor communications, based on
`ij
`ij
`ik kj
`wavefront requirements; 2) the basic PE configuration; 3)
`kw ly 25
`9 N
`interfacing with the host computer; and finally, 4) the ex-
`tendability and reconfigurability issues.
`
`
`
`
`
`
`
`coe
`
`Initial:
`
`“atrix A is stored in the Memory “odule
`
`(™M)
`
`on the left
`
`(stored row by row). M“atrix FP is in
`
`stered column by column.
`the top and is
`4’ on
`The result will be in the C registers.
`
`Final:
`
`Lis
`
`5:
`
`10:
`
`15:
`
`REGIN
`SPT COUNT 3;
`REPEAT;
`WHILE WAVRFRONT TN ARRAY DO
`REGIN
`
`FETCH AR, UP;
`FETCH A, LEFT;
`FLOW A, RIGHT;
`FLOW R, DOWN;
`{ ¢ «<= ¢ + AB;
`MULT A, Ry D;
`ADD Cc, D, Cz
`
`END;
`DECREMENT COUN™;
`UNTIL TERMINATED:
`FNDPROGRA™,
`
`*
`
`* Comments are enclosed between *!"* and ";"
`
`D. Summary of MDFL Features
`
`Future VLSI multiprocessors must support massive con-
`currency to achieve a significant increase in performance;
`consequently, a base language for parallel computers must
`allow expression of concurrency of program execution on a
`large scale [20]. However, few languages supportthe idea of
`large scale concurrency, and only weak notionsoflocality exist
`in most of them [25].
`It must be noted that a parallel program is not just an en-
`semble of separate programsfor individual processors. More
`importantly, it should also define coordination between PE’s
`including their interdependencefor data and the sequencing
`of their tasks. As such, it is a tall order.
`At present, data-flow languages appearto bethe best can-
`didate as the base languageofparallel computers. Data-flow
`languages are asynchronous, data driven, and algorithmic.
`They avoid centralized control and shared memoryto achieve
`asynchrony and maximalconcurrency [20], [26]-[29].
`MDEFLbasically possessesall these properties of data-flow
`languages. It shares the principle of data-activated compu-
`tation and its consequent advantages. In addition, MDFL has
`a ratherdistinctive feature of regularity, and is built around
`the notionoflocality. It is very close to the algorithms; hence,
`MDFLprogramsare modular and easy to understand. MDFL
`permits the programmerto addressa front of processors at the
`same time, instead of programming individual processors
`separately. Being wavefront oriented, it permits viewing the
`algorithm as the repetition of a computational sequence(i-e.,
`the wavefront) progressing through the array. In short, the
`wavefront notion makes it possible to program an array of
`asynchronousprocessorsin a simplistic fashion, and leadsto
`simple readable MDFLprograms.
`
`A, Interprocessor Communication
`The configuration of the processor array is as shownin the
`schematic diagram of Fig. 1. It provides for data and control
`communications between orthogonally adjacent processors and
`links to memory modules throughthefirst row andfirst column
`of processors.
`To simulate the phenomenonof wavefront propagation, the
`processorsin the array must wait for a primary wavefront (of
`data), then perform its computation and,finally act as a sec-
`ondary source of new wavefronts. To implementthis wait,
`processors are provided with data transfer buffers. Hence, a
`FETCHingof data involves an inherent WAITingfor the buffer
`to be filled (DATA READY) by the adjacent data sourcing
`processor. Thus,if the software ensuresthat the processoral-
`ways performs data FETCH before the computation (cf. syntax
`rules of Section III-C), the processing will notbe initiated until
`the arrival ofthe data wavefront (this is similar to the concept
`of data flow machines [17]-[23], [26]-[29]). Each processor
`can FLOW datato the input buffers of the neighboring PE’s,
`thus acting as a secondarysource of data wavefronts (Huygen’s
`principle). To avoid the overrunning of data wavefronts (in
`conformation with Huygen’sprinciple), the processor hard-
`ware ensures that a processor cannot send new data to the
`buffer unless the old data have been used by the neighbor.
`Thus,
`the wavefront concept suggests that interprocessor
`communication employ buffers and “DATA READY/DATA
`USED”flags between adjacent processors.
`The WAITsfor wavefronts of data allow for globally asyn-
`chronous operation of processors, i.e., there is no need for
`global synchronization. Synchronization is a very critical issue
`in parallel processing, especially when one considers large scale
`systems. Two opposite timing schemes cometo mind, namely,
`the synchronousand the asynchronous timing approaches. In
`the synchronous scheme,there is a global clock network which
`distributes the clocking signals over the entire chip. The global
`clock beats out the rhythm to whichall the PE’s in the array
`execute their tasks. In the basic synchronousconfiguration,
`all the PE’s operatein unison,all performing the sameidentical
`operation. In contrast, the asynchronous schemeinvolves no
`global clock, and information transfer is by mutual conve-
`nience between each PE and its immediate neighbors.
`Wheneverthe dataare available, the transmitting PE informs
`the receiver of the fact, and the receiver accepts the data when
`it needs them.It then conveys to the sender the information
`that the data have been used. This scheme can be implemented
`by meansofa simple handshaking protocol[10], [30] (cf. Fig.
`2).
`B. PE Configuration
`The proposed hardwarefor the processing elements of the
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1058
`Petitioner Microsoft Corporation - Ex. 1021, p. 1058
`
`
`
`
`
`CONTROL
`UNIT
`
`MEMORY
`BLOCK:
`ABSOLUTE
`LOADER
`
`1059
`
`Busses
`BUSSES
`
`4
`#ARECTIONA
`DATA BUSSES L
`
`
`
`
`
`REGISTER
`BLOCK
`
`IR
`
`= INSTRUCTION
`REGISTER
`— STATUS FLAGS
`SF
`PMB — PROGRAM
`MEMORY BUFFER
`FOR LOADING
`
`DATA 1/0
`MULTIPLEXERS
`AND
`BUFFERS
`
`Fig. 3. Architecture of an INTerior PE.
`
`great delay involved in accessing the host computer frequently,
`we propose a globally parallel, locally sequential (GPLS)
`scheme,which is explained below.
`Suppose that the array size is N * N, andthesize of the
`operand matrix A is N’ X N’ where N’ = mNand m isanin-
`teger. Then a partitioning of A into submatrices ofsize m will
`be
`
`A = [A],
`
`“SN
`
`and
`
`Ai = [@G-1)m+k,G-1)m+p])
`
`k,p=1,:::,m.
`
`Each processor then works on an m X m submatrix of data
`stored in its local memory. While the processors can work in
`parallel, an individual processor works on its m X m data se-
`quentially, one after the other; hence, the term GPLS. We
`assert that the inclusion of local memory [two-dimensional
`array M(i, j) and linear array G(/)] will not affect the global
`interconnection. More curiously, global MDFLalso needs only
`a minimal modification for the extension. An extended MDFL
`syntax and semantics have been developed to effectively (i.e.,
`transparently) handle largersize matrix operations. A sample
`program for multiplication of matrices of larger size (mN X
`mN) is listed here.
`
`Program 2. Extended matrix multiplication.
`BEGIN
`EQUIVALENCE (8, GCI);
`EQUIVALENCE (C,M);
`SET COUNT mn;
`REPEAT
`WHILE WAVEFRONT IN ARRAY DO
`SCAN BY ROW 1 TO m OO
`BEGIN
`FETCH B, UP;
`FETCH A, LEFT;
`MULT A, B, R;
`apD
`6c, R, C;
`FLOW A, RIGHT;
`FLOW 6B, OWN;
`MOVE RIGHT, DOWN;
`DECREMENT COUNT;
`UNTIL TERMINATED;
`ENDPROGRAM.
`
`Here, each PE scansits submatrix in local memory M(i, j)
`row by row, and C refers to M(i, 7) where i and j are the cur-
`rent values of the scan counters / and J. Similarly, B refers to
`G(i).
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1059
`Petitioner Microsoft Corporation - Ex. 1021, p. 1059
`
`KUNG er al; WAVEFRONT ARRAY PROCESSOR
`
`k—PEt))
`OATA
`
`hy DATA
`DATA SENT
`TRANSFER
`CONTROL
`FLIP FLOP
`
`DATA READY
`
`
`
`DATA PROCESSED
`
`—f
`
`||
`
`|
`|
`
`PE(i)—>4
`
`| j
`
`DATA
`
`JuseD
`|
`
`Fig. 2. The handshaking scheme.
`
`WAPinvolves a simple modularstructure. A basic architecture
`is schematically shown in Fig. 3. Each processor is a hardware
`interpreter of local MDFL,andreflects a relatively conven-
`tional architecture consisting of an internal program memory,
`a control unit, an ALU, anda set of registers. The only ex-
`ception is the PE’s interfacing with its neighbors. Every pro-
`cessing element has a bidirectional buffer and independent
`status and control flags to eachof its four adjacent elements
`(cf. Fig. 3). These buffers are supported by an appropriate
`multiplexing subsystem, under the control unit’s supervision.
`Except for the extended capabilities of the (1, 1) processor and
`the memorylinks ofthe first row and first column of proces-
`sors, all of the processing elements are essentially identical.
`Hence, from both the software (cf. footnote 2, Section III-A)
`and hardwareviewpoints,it suffices to group the PE’s into four
`types: Corner, FirstRow, FirstCol, and Interior processors.
`For signal processing applications, the most importantre-
`quirements are the accuracy and speed ofthe arithmetic units.
`These are determined by extensive numerical error analysis
`of the algorithms under consideration. Fortunately, most
`matrix algorithms have a rather complete error analysis doc-
`umentation [31], [32]. Another important factor in the se-
`lection of the ALUis the speed/hardwaretradeoff. The de-
`cision betweenserial versus parallel and fixed- versus float-
`ing-point arithmetic will be affected by this consideration. In
`addition, the potential for concurrent execution andinternal
`pipelining within the processing elements may be exploited for
`further speedupof the processing rate.
`
`C. Interfacing with the Host Computer and Extendability
`
`Theinteraction between the host computer and the WAP
`includes loading of programsand I/O ofdata before andafter
`processing. All these transactions may be carried out by means
`of a smart interface, the nature of which is currently under
`investigation. This interfacing will minimize the host com-
`puter’s involvementin the transaction, permitting efficient
`transfer of large blocks of data.
`The word “extendability” refers to the capability of handling
`matrix operations of larger order with a smaller sized array
`processor. Softwaresolutions to the extendability problem are
`often adopted via efficient partitioning (decomposition)of the
`algorithms’ tasks (e.g., the divide-and-conquer method for
`matrix inversion [16], [33]) andintensiveutilization of the host
`computer’s memory.
`In order to handle a larger matrix without incurring the
`
`
`
`1060
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 11, NOVEMBER 1982
`
`spectively. Each of the above three categories will be separately
`discussed in Sections V-A, V-B, and V-C,respectively. Ulti-
`mately, real system applications will be the only objective
`criterion for justifying and reconfirming the usefulness of the
`WAP.Investigations are, therefore, being conducted intoits
`use in such diversefields as image analysis, speech recognition,
`Kalman filtering, nonlinear filtering, adaptive filtering,
`smoothing, prediction, optimization, detection, spectrum
`analysis, and other advanced signal] processing applications
`[12].
`In fact, some recent reports have indicated a promising and
`encouraging trend. For instance, a computing structure similar
`to the WAPhasbeen applied for parallel processing (for dy-
`namic time warping [34]) in speech recognition.It has also
`been recently reported [13] that the wavefront concept can be
`effectively utilized to solve the singular value decomposition
`(SVD) problem, which is crucial for image processing and
`beam-forming applications. We are currently looking into an
`MDEFLprogram for such an SVD algorithm. Finally, for a
`system application in real-time signal processing, a case of
`adaptive array processing systemsfor high resolution direc-
`tivity pattern analysis is studied in an earlier publication [10].
`There, a combined application of the WAP, MDFL, and
`modern spectrum analysis is presented.
`
`A. Matrix-Based Algorithms
`
`Exam