`
`[EEE TRANSACTIONS ON COMPUTERS. VOL. c-3l. N0. 11. NovEMttER 1982
`
`Wavefront Array Processor: Language,
`Architecture, and Applications
`
`lEEE, RON J. GAL-EZER.
`tEEE, K. S. ARUN, STUDENT MEMBER,
`SUNsYUAN KUNG, MEMBER,
`STUDENT MEMBER,
`IEEE, AND D. V. BHASKAR RAO, STUDENT MEMBER. IEEE
`
`Abstract—This paper describes the development of a wavefront—
`based language and architecture for a programmable special-purpose
`multiprocessor array. Based on the notion of computational wavefront,
`the hardware of the processor array is designed to provide a computing
`medium that preserves the key properties of the wavefront. In con-
`junction, a wavefront language {MDFL} is introduced that drastically
`reduces the complexity of the description of parallel algorithm and
`simulates the wavefront propagation across the computing network.
`Together. the hardware and the language lead to a programmable
`wavefront array processor (WAP). The WAP blends the advantages
`of the dedicated systolic array and the general-purpose data-flow
`machine, and provides a powerful tool for the high-speed execution
`of a large class of matrix operations and related algorithms which have
`widespread applications.
`
`index Term—Asynchrony, computational wavefront, concurrency,
`data-flow computing, matrix data-flow language, signal processing,
`systolic array. VLSI array processor. wavefront architecture.
`
`I.
`
`lNTRODUCTlON
`
`A. VLSI Signal Processing
`
`[TI-i the rapidly growing microelectronics technology
`leading the way. modern signal processing is under-
`going a major revolution. The past two decades have witnessed
`a steep increase in the complexity of computations. processing
`speed requirements, and the volume of data handled in various
`signal processing applications. The availability of low cost. high
`density, fast VLSI devices has opened a new avenue for
`implementing these increasingly sophisticated algorithms and
`systems [1}. [2]. While a major improvement in device speed
`is foreseen, it is in no way comparable to the rate ofittcreasc
`of throughput required by modern real-time signal processing.
`In order to achieve such increases in throughput rate, the only
`effective solution appears to be highly coucurrent processing.
`Consequently, it has become a major challenge to update the
`current signal processing techniques so as to maximally exploit
`their potential for concurrent execution.
`In a broad sense, the answer to this challenge lies in a
`cross~disciplinary research encompassing the areas of algo-
`rithm analysis, parallel computer design, and system appli-
`
`Manuscript received January I l. l982: revised June 18. 1982. This work
`was supported in part by the Office of Naval Research under Contract
`NOOGI4-80-C-045'l. N00014-8l-K-019l . by the National Science Foundation
`under Grant PICS-804653 l. and by the Defense Advanced Research Projects
`Agency under Contract MDA903-79-C-0630.
`The authors are with the Department of Electrical Engineering—Systems.
`University of Southern California. Los Angeles. CA 90089.
`
`cations. In this paper, we therefore introduce an integrated
`research approach aimed at incorporating the vast VLSI
`computational capability into modern signal processing ap-
`plications.
`The traditional design of parallel computers and languages
`is deemed unsuitable for our purposes. It usaally suffers from
`heavy supervisory overhead incurred by synchronization,
`communication, and scheduling tasks, which severely hamper
`the throughput rate which is critical to real-time signal pro-
`cessing. In fact. these are the key barriers inherent in very large
`scale computing structure design. Moreover. although VLSI
`provides the capability of implementing a large array of pro-
`cessors on one chip, it imposes its own constraints on the sys-
`tem. Large design and layout costs [2] suggest the utilization
`of a repetitive modular structure. In addition, communication,
`which costs the most in VLSI chips in terms of area, time, and
`energy. has to be restricted (to localized communication) [1].
`In general. highly concurrent systems require this locality
`property in order to reduce interdependence and ensuing
`waiting delays that result from excessive communication [1 ].
`This locality constraint prevents the utilization of centralized
`control and global synchronization. The resulting use of
`asynchronous distributed control and localized data flow is
`an effective approach to the design of very large scale, highly
`concurrent computing structures.
`
`3. A Special-Parpose VLSI Array Processor
`
`The above restrictions imposed by VLSI will render the
`general-purpose array processor
`rather
`inefficient. We
`therefore restrict ourselves to a special class of applications,
`i.e., recursive' and local data-dependent algorithms. to con-
`form with the constraints imposed by VLSI. This restriction.
`however. incurs little loss of generality, as a great majority of
`signal processing algorithms possess these properties. One
`typical example is a class of matrix algorithms. It has recently
`been indicated that a major portion of the computational needs
`for signal processing and applied mathematical problems can,
`in fact, be reduced to a basic set of matrix operations and other
`related algorithms [3].
`[4]. Therefore. a special-purpose
`parallel machine for processing these typical computational
`algorithms will be cost effective and attractive in VLSI system
`design.
`
`' In a recursive algorithm. all processors do nearly identical tasks. and each
`processor repeats a fixed set of tasks on sequentially available data.
`
`0018-9340/82/1IOU-105430035 © 19321EEE
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1054
`Petitioner Microsoft Corporation - Ex. 1021, p. 1054
`
`
`
`KUNG or at: WAVEFRONT ARRAY PROCESSOR
`
`1055
`
`MEMORY MODULES
`
`grammable computing network, which we will call the wave-
`front array processor (WAP). The WAP is, in a sense, an
`optimal rrodeoff between the globaliy synchronized and
`dedicatedsysiolic array [1], [15], [16} (that works on a sim-
`ilar set of algorithms). and the general-purpose data-flow
`muftiprocessors [1?1—[23]. It provides a powerful tool for the
`high-speed execution of a large class of algorithms which have
`widespread applications.
`
`C. Organization
`
`The organization of the rest of the paper is as follows. Sec-
`tion II elaborates on the computational wavefront. Section III
`proposes a wavefront-oriented language (MDFL) for pro-
`gramming the WAP, and provides a programming method-
`ology. Section W explains a pessible hardware organization
`and architecture for the WAP. Section V illustrates some
`
`applications of the WAP by means of MDFL programs. Sec-
`tion VI compares the WAP to other array processors, such as
`the systolic array, data-flow multiprocessors, and conventional
`SIMD arrays like the Illiac IV.
`
`II. CONCEPT OF COMPUTATIONAL WAVEFRONT
`
`The wavefront array processor is configured in a square
`array of N X N processing elements with regular and local
`interconnections (cf. Fig. 1).
`The computing network serves as a (data) wave-propagating
`medium. The notion of computational wavefront can be best
`explained by a simpie example. To this end. we shall consider
`matrix multiplication as being representative. Let A = log},
`= {by-l. and C = A X B = {CU-l all beN X N matrices. The
`matrix A can be decomposed into columns A,- and matrix B
`into rows By, and therefore.
`
`c=A]*B]+A2*Bg+ "+AN*BN.
`
`(1)
`
`The matrix multiplication can then be carried out in N re-
`cursions, executing
`
`C(")=CU‘")+A;c *Bk
`
`(2)
`
`, N.
`-
`-
`recursively for k = l, 2, -
`The exploitation of parallelism is now evident with the
`availability of N X N processors. A parallel algorithm for
`matrix multiplication is fairly simple. However, most existing
`parallel programs would need global interconnections in the
`computing network, while localized interconnections and data
`flow are much more desirable in VLSl systems.
`The topology of the matrix multiplication algorithm can be
`mapped naturally onto the square, orthogonal N x N matrix
`array of the WAP (cf. Fig. 1). To create a smooth data
`movement in a localized communication network, we make
`use of the computational wavefront concept. For the purpose
`of this example, a wavefront in the processing array will cor-
`respond to a mathematical recursion in the algorithm. Suc-
`cessive pipelining of the wavefronts will accomplish the com-
`putation of all recursions.
`As an example, the computational wavefront for the first
`recursion in matrix multiplication will now be examined.
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1055
`Petitioner Microsoft Corporation - EX. 1021, p. 1055
`
`
`
`m
`
`I
`I
`,
`.
`, —~ ., H.H
`«'
`-
`a
`l.‘//
`f,’/
`a
`"/
`."//
`5
`o
`I
`x
`’
`.
`gums—.HH ./ ..._.
`E
`'
`a
`'
`a
`
`,K
`
`I
`
`34¢.
`""9
`/
`
`
`
`FIRST WAVE — — —
`SECOND W-M'E- '---~-
`
`i
`
`Fig. l. The WAP configuration.
`
`Very significantly, these algorithms involve repeated ap-
`plication of relatively simple operations with regular localized
`data flow in a homogeneous computing network. This leads
`to an important notion of computational wavefront. which
`portrays the computation activities in a manner resembling
`a wave propagation phenomenon. More precisely, the recursive
`nature ofthe algorithm, in conjunction with the localized data
`dependency, points to a continuously advancing wave of data
`and computational activity. The computational sequence starts
`with one element and propagates through the processor array,
`closely resembling a physical wave phenomenon (cf. Fig. I).
`In fact, all algorithms that have locality and recursioity (and
`are thus implementable on our VLSI array processor) will
`exhibit this wave phenomenon. Therefore, the notion of a
`computational wavefront has attracted the attention of many
`researchers [2], [5]—[l3].
`The wavefront concept provides a firm theoretical founda-
`tion for the design of highly parallel array processors and
`concurrent languages. In addition, this concept appears to have
`some distinct advantages.
`First. the wavefront notion drastically reduces the com-
`plexity in the description of parallel algorithms. The mecha-
`nism provided for this description is a special-purpose, wave-
`front-oriented language [7’], [9], [10]. Rather than requiring
`a program for each processor in the array, this language allows
`the programmer to address on entire front ofprocessors.
`Second, the wavefront notion leads to a wavefront-based
`
`architecture which preserves Huygen’s principle [14]. and
`ensures that wavefronts never intersect. Therefore, a wavefront
`
`architecture can provide asynchronous waiting capability, and
`consequently, can cope with timing uncertainties, such as local
`clocking, random delay in communications, and fluctuations
`of computing times. In short, the notion lends itself to a
`(asynchronous) data-flow computing structure that conforms
`well with the constraints of VLSI.
`The integration of the wavefront concept, the wavefront
`language, and the wavefront architecture leads to a pro-
`
`
`
`l056
`
`[EEE TRANSACTIONS ON COMPUTERS. voL_ (3.3!, N0. 11, NOVEMBER 1982
`
`Suppose that the registers of all the processing elements (PE’s)
`are initially set to zero:
`
`Cl? = 0
`
`for all (i,j);
`
`the entries of A are stored in the memory modules to the left
`(in columns), and those of B in the memory modules on the top
`(in rows). The process starts with PE (1, l), where
`
`Cll} = Clot) + an * bit
`
`(3)
`
`scription of computational wavefronts and the corresponding
`data flow in a large class of algorithms (which exhibit the re-
`cursivity and locality mentioned earlier).
`Since matrix algorithms are typical of this class, we call the
`language the matrix data-flow language (MDFL). This
`wavefront language permits the array processor to be pro-
`grammable, broadens the range of applications, and also makes
`possible the simulation and verification of parallel algo-
`rithms.
`
`is computed. The computational activity then propagates to
`the neighboring PE’s (l, 2) and (2, l), which will execute
`
`A. Matrix Data-Flow Language
`
`and
`
`cl? = Citi) + an * biz
`
`_
`
`Cii} = Chi) + a2] * bl l-
`
`(4)
`
`(5)
`
`The next front of activity will be at PE’s (3, l), (2, 2), and
`(l, 3), thus creating a computation wavefront traveling down
`the processor array. This computational wavefront is similar
`to optical wavefronts (they both obey Huygen’s principle) since
`each processor acts as a secondary source and is responsible
`for the propagation of the wavefront. It may be noted that wave
`propagation implies localized data flow. Once the wavefront
`sweeps through all the cells, the first recursion is over. As the
`first wave propagates, we can execute an identical second re-
`cursion in parallel by pipelining a second wavefront immedi—
`ately alter the first one. For example, the (l, 1) processor will
`execute
`
`Cl” = Clil + an * bzl
`
`(6)
`
`and soon. In general. the (i,j)th processor will execute the lcth
`recursion,
`
`Cll)=an *blj+arz*bzr+'”+alk*bki'
`
`(7)
`
`The pipelining is feasible because the wavefronts of two suc-
`cessive recursions will never intersect (Huygen's wavefront
`principle), as the processors executing the recursions at any
`given instant will be different, thus avoiding any contention
`problems.
`Locality, regularity, recursiuity, and concurrency lead to
`the wavefront phenomenon. Thus, all algorithms which possess
`these properties will exhibit computational wavefronts. The
`notion of computational wavefronts leads to a wavefront—based
`language (a modified data~flow language) to program the
`processor array. This language (called the matrix data-flow
`language) is especially powerful for matrix and other related
`algorithms. It will be developed in the next section.
`
`Ill. A WAVEFRONT LANGUAGE: MATRIX DATA-FLOW
`LANGUAGE
`
`In contrast to the heavy burden of scheduling, resource
`sharing. as well as the control of processor interactions en-
`countered in programming a general-purpose multiprocessor,
`the computational wavefront notion can facilitate the de-
`scription of parallel algorithms and drastically reduce the
`complexity of parallel programming. In this section, we in-
`troduce a wavefront language that is tailored towards the de-
`
`There exist two approaches to programming the WAP: a
`local approach describing the actions of each processing ele-
`ment, and a global approach-describing the actions of each
`wavefront. To allow the user to program the WAP in both of
`these fashions, two versions of MDFL are proposed: global and
`local MDFL.
`
`A global MDFL program describes the algorithm from the
`viewpoint of a wavefront, while a local MDFL program de-
`scribes the operations of an individual processor. More pre-
`cisely, the perspective of a global MDFL programmer is of one
`wavefront passing across all the processors, while the per-
`spective of a local MDFL programmer is that of one processor
`encountering a series of wavefronts.
`From the macroscopic point of view, a higher level language,
`closer to the algorithm, is desired for reducing the heavy bur-
`den on the programmer. Global MDFL provides such a tool,
`as it is easier to view the algorithm as a series of wavefronts.
`At the microscopic level, each PE executes its own set of in-
`structions and performs localized interprocessor transactions.
`Implementing a program on such a system requires trans-
`forming the high level description (in global MDFL) into a set
`of lower level programs (in local MDFL) for the individual
`processors.2 We have developed a compiler to do such a map-
`ping of global wavefront—oriented MDFL programs into local
`programs for individual processors.
`
`3. MDFL Instruction Set
`
`At each front of activity, the computational wavefront
`performs similar tasks in all the processors involved (cf. Section
`II, matrix multiplication example and Section III-D on pro-
`gramming methodology). Hence, global (wavefront) in-
`structions and local (processor) instructions are nearly iden-
`tical. Most of the MDFL instruction set is, therefore, common
`to both global and local MDFL.
`Table l is a complete list of the MDFL instruction reper—
`toire. For the complete semantics and more detailed syntax,
`the reader is referred to a recent publication [10]. Unless
`otherwise specified, the following instructions belong to both
`global and local MDFL. The proposed instruction set is
`functionally complete, but improvements and modifications
`are still underway to incorporate additional features such as
`double precision. etc. There also appears to be room for the
`
`3 it suffices, in general, to describe a wavefront algorithm at four locations
`of the array: corner, Firisow, FirstColumn, and Interior PE’s. Consequently,
`every global MDFL program gets compiled into four slightly different local
`programs.
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1056
`Petitioner Microsoft Corporation - EX. 1021, p. 1056
`
`
`
`KUNG er «1'..- WAVEFRONT ARRAY PROCESSOR
`
`I 05?
`
`TABLE [
`MDFL INSTRUCTION SET
`
`Data Transfer Instructions
`
`(DIRECTION):
`FLOW (SOURCE REGISTER),
`FETCH (DESTINATIGN REGIS”ER>,
`(DIRECTION);
`READ;
`
`Recursion Dr tented Instructions
`
`REPEAT ... UNTTL “ERMINATRD:
`NHILE HAVEFRONT [N ARRAY D0
`BEGIN ... END;
`SET COUNT (“UNBER OF WAVEEROWTS>;
`DECREMENT COUNT;
`ENDPRDGRAM.
`
`Conditional Instructions
`
`I? EQUAL THEN (STATEMENT);
`I? NOT-EOUAL THEN (STATEHENT):
`IF GRERTER THEN (STATEMENT);
`IF LESS—THAN THEN (STATEMENT):
`IF (DIRECTION) DISABLED T32“ (STATE“ENT>:
`CASE KIND 5
`[1,1]
`{I,*I
`{‘,1)
`INT
`ENDCASE;
`
`(STATEVENT>;
`:
`(STATEMENT):
`:
`(STATEMENT):
`:
`: <5TRTE"E“T>:
`
`Internal Processor Instructions
`
`(DESTINATION);
`(SOURCE),
`(SOURCE 11>, <SOURCE 12>.
`(SOURCE :1).
`(SOURCE 12>,
`(SOURCE It), (SOURCE t2).
`(SOURCE It),
`(SOURCE |2>,
`(SOURCE),
`(DESTINATION);
`(SOURCE #1),
`(SOURCE l2);
`<SOURCB>;
`
`TSR
`non
`SUB
`«ULT
`otv
`SORT
`cue
`TST
`STORE;
`MOP:
`RESET;
`HEGIN ... END;
`DISABLE-SELF;
`
`(DESTINATION);
`(DESTINATION);
`(DESTINATION);
`(DESTINATION);
`
`development of a higher level language suitable for the non-
`specialist programmer.
`
`C. Programming Methodology
`
`In this subsection, we provide guidelines for programming
`in global MDFL. The most straightforward method of pro
`gramming the WAP would be to explicitly spell out the actions
`ofeach wavefront at each of its (2n - 1) positions (fronts) (cf.
`Fig. 1). Nevertheless, the regularity and recursivity in almost
`all matrix algorithms allows us to assume the following.
`i) Space Invariance: The tasks performed by a wavefront
`in a particular kind of processor must be identical at all
`(2:: - I) fronts.
`2) Time Invariance: Recursions are identical.
`Accordingly, global MD FL provides two repetitive con-
`structs, the space repetitive construct
`WHILE WAVEFRONT IN ARRAY DO
`
`BEGIN (TASK T) END
`
`(so that T is repeated at all fronts). and the time repetitive
`construct
`
`REPEAT (ONE RECURSION) UNTIL TERMINATED
`
`(so that the same recursion is repeated).
`The REPEAT construct is inherently concurrent in that
`
`successive wavefronts are pipelined through the array. As soon
`as the kth wavefront is propagated, the (l, 1) processor ini-
`tiates the (k + l)st wavefront.
`To allow for more than one wavefront per recursion, the
`complete global MDFL program will have the syntax
`BEGIN
`
`SET COUNT (
`REPEAT
`
`>:
`
`(TASKS A );
`WHILE WAVEFRONT IN ARRAY DO
`BEGIN
`
`(TASKS a);
`ENIX
`WHILE WAVEFRONT IN ARRAY D0
`BEGIN
`
`(TASKS C);
`END;
`(TASKS D);
`DECREMENT COUNT;
`UNTIL TERMINATED;
`ENDPROGRAM.
`
`Each recursion will execute the instructions within the
`REPEAT -
`'
`- UNTIL construct. The number of recursions is set
`
`by SET COUNT. In this example, a recursion consists of two
`wavefronts. At the start. tasks A are performed only at the (l,
`1) processor. The first wavefront of each recursion will perform
`tasks 3 at each of its (2:: — I) fronts. The second wave will
`execute tasks C in each of these fronts immediately after tasks
`3 have been concluded. it should be noted that the number of
`
`different wavefronts within a recursion may vary from one
`application to another. At the end of each recursion, COUNT
`is decremented. When it becomes zero, TERMINATED is set
`and a “phase” of identical recursions is over.
`The corresponding local MDFL program for interior pro-
`cessors (cf. Section III-A) will be
`
`REPEAT
`
`(TASKS 3")
`(TASKS C’)
`UNTIL TERMINATED;
`
`where 3’ and C’ are the compiled versions of B and C, with
`only relevant portions of the CASE statement extracted. The
`conversion of a global program into its local versions is thus
`fairly straightforward.
`Certain syntax rules [10] are needed to ensure that there
`is no circular waiting for data between adjacent PE’s. They also
`ensure that neighbors will not contend for the interprocessor
`bus at the same time. Concisely, the rules dictate that PETCHes
`in one direction precede (and equal in number) the FLOWS in
`the opposite direction, and that the data FETCHing precede
`any computation (cf. data-driven computation in Section
`IV-A). The complete proof of the claim that these rules will
`prevent deadlock and bus contention is omitted here [24].
`Based on the above guidelines, a complete global MDFL
`program for matrix multiplication follows.
`More programming examples will be provided in Section
`
`V.
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1057
`Petitioner Microsoft Corporation - EX. 1021, p. 1057
`
`
`
`IEEE TRANSACTIONS ON COMPUTERS. vOL. c-31. No. 11. NOVEMBER 1982
`
`“158
`
`Array Size:
`
`Computation:
`
`Program 1. Matrix multiplication.
`M x N
`
`C = A x B
`th
`k wavefront:
`
`lki
`
`c
`
`ik-li
`
`=
`
`c
`
`ij
`ij
`k = l, 2.
`
`+ a
`
`b
`ik kj
`, N
`
`...
`
`Initial:
`
`"atria A is stored in the Memory I“odthle
`
`["“i
`
`on the left
`
`(stored row by row).
`
`“atrix F is in
`
`stored column by column.
`the top and is
`M“ on
`The result will be in the C registers.
`
`Pinal:
`
`1:
`
`5:
`
`10:
`
`15:
`
`REST“
`sn'r coo-w 3;
`REPEAT:
`WHILE WhVRPRONT TN ARRAY DO
`RESIN
`
`FETCH R. UP:
`FETCH A. LEFT:
`now A, mom-,-
`PLOW a, Down;
`I C <- C + as;
`vut'r A. n. or
`non C. n. C;
`
`esp;
`DECREVENT COUNM;
`uurtL *ERMINA"ED:
`ENDFROGRAM.
`
`*
`
`* Comments are enclosed between '5' and ';'
`
`D. Summary of MDFL Features
`
`Future VLSI multiprocessors must support massiVe con-
`currency to achieve a significant increase in performance;
`consequently, a base language for parallel computers must
`allow expression of concurrency of program execution on a
`large scale [20]. However, few languages support the idea of
`large scale concurrency. and only weak notions of locality exist
`in most of them [25].
`It must be noted that a parallel program is not just an en-
`semble of separate programs for individual processors. More
`importantly. it should also define coordination between PE's
`including their interdependence for data and the sequencing
`of their tasks. As such. it is a tall Order.
`
`At present, data-flow languages appear to be the best can-
`didate as the base language of parallel computers. Data-flow
`languages are asynchronous. data driven, and algorithmic.
`They avoid centralized control and shared memory to achieve
`asynchrony and maximai concurrency [20]. [26]—[29].
`MDFL basically possesses all these properties of data-flow
`languages. It shares the principle of data-activated compu-
`tation and its consequent advantages. In addition, MDFL has
`a rather distinctive feature of regularity, and is built around
`the notion of locality. It is very close to the algorithms; hence.
`MDFL programs are modular and easy to understand. MDFL
`permits the programmer to address a front of processors at the
`same time, instead of programming individual processors
`separately. Being wavefront oriented, it permits viewing the
`algorithm as the repetition ofa computational sequence (i.c.,
`the wavefront) progressing through the array. In short, the
`wavefront notion makes it possible to program an array of
`asynchronous processors in a simplistic fashion. and leads to
`simple readable MDFL programs.
`
`IV. WAVEFRONT ARCHITECTURE
`
`The hardware of the processing array is designed to support
`MDFL. The main architectural considerations include four
`
`general aspects: 1) interprocessor communications, based on
`wavefront requirements; 2) the basic PE configuration; 3)
`interfacing with the host computer; and finally. 4) the ex-
`tendability and reconfigurability issues.
`
`A. Interprocesmr Communication
`
`The configuration of the processor array is as shown in the
`schematic diagram of Fig. 1. It provides for data and control
`communications between orthogonally adjacent processors and
`links to memory module; through the first row and first column
`of processors.
`To simulate the phenomenon of wavefront propagation, the
`processors in the array must wait for a primary wavefront (of
`data), then perform its computation and, finally act as a sec-
`ondary source of new wavefronts. To implement this wait,
`processors are provided with data transfer buffers. Hence, a
`FETCHing of data involves an inherent WAlTing for the buffer
`to be filled (DATA READY) by the adjacent data sourcing
`processor. Thus, if the software ensures that the processor al-
`ways performs data FETCH before the computation (cf. syntax
`rules of Section III-C), the processing With? not be initiated until
`the arrival oftire data wavefront (this is similar to the concept
`of data flow machines [171-[23], [26]—{29]). Each processor
`can FLOW data to the input buffers of the neighboring PE's.
`thus acting as a secondary source of data wavefronts (Huygen’s
`principle). To avoid the overrunning of data wavefronts (in
`conformation with Huygen’s principle), the processor hard-
`ware ensures that a processor cannot send new data to the
`buffer unless the old data have been used by the neighbor.
`Thus.
`the wavefront concept suggests that interprocessor
`communication employ buffers and “DATA READYg‘DATA
`USED” flags between adjacent processors.
`The WAITS for wavefronts ofdata allow for giobaiiy asyn-
`chronous operation of processors, i.e., there is no need for
`global synchronization. Synchronization is a very critical issue
`in parallel processing, especially when one considers large scale
`systems. Two opposite timing schemes come to mind, namely,
`the synchronous and the asynchronous timing approaches. In
`the synchronous scheme, there is a global clock network which
`distributes the clocking signals over the entire chip. The global
`clock beats out the rhythm to which all the PE‘s in the array
`execute their tasks. In the basic synchronous configuration.
`all the PE‘s operate in unison. all performing the same identical
`operation. In contrast. the asynchronous scheme involves no
`global clock, and information transfer is by mutual conve-
`nience between each PE and its immediate neighbors.
`Whenever the data are available, the transmitting PB informs
`the receiver of the fact. and the receiver accepts the data when
`it needs them. It then conveys to the sender the information
`that the data have been used. This scheme can be implemented
`by means ofa simple handshaking protocol [10]. [30] (cf. Fig.
`2).
`
`3. PE Configuration
`
`The proposed hardware for the processing elements of the
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1058
`Petitioner Microsoft Corporation - EX. 1021, p. 1058
`
`
`
`1059
`
`CONTROL
`BUSES
`
`
`CONTROL
`HEM DRY
`UNIT
`
`BLOCK;
`ABSOLUTE
`LDADER
`
`
`
`REGISTER
`BLOCK
`
`4”
`
`DIRECTIONAL
`DATA DUE E5
`
`lB
`
`— JNSTRUCTIDN
`DATA ”0
`
`REGISTER
`Nu LT‘IP LEXE R5
`— STATUS FLAGS
`AND
`SF
`BU F F E R5
`PMB — PROGRAM
`MEMORY BUFFER
`FOR LOADING
`
`Fig. 3. Architecture of an [NTerior PE.
`
`great delay involved in accessing the host computer frequently,
`we propose a globally parallel, locally sequential (GPLS)
`scheme, which is explained below.
`Suppose that the array size is N X N, and the size of the
`operand matrix A is N’ X N’ where N’ = mN and m is an in—
`teger. Then a partitioning of A into submatrices ofsize in will
`be
`
`A = [Au],
`
`and
`
`A“ = [fl(i—I)m+k,U—l)m+p],
`
`k,p =1, .
`
`.
`
`.
`
`‘ m.
`
`Each processor then works on an m X m submatrix of data
`stored in its local memory. While the processors can work in
`parallel, an individual processor works on its in X in data so»
`quentially, one after the other; hence. the term GPLS. We
`assert that the inclusion of !ocal memory [two-dimensional
`array M(r',j) and linear array G(i)] will not affect the global
`interconnection. More curiously, global MDFL also needs only
`a minimal modification for the extension. An extended MDFL
`
`syntax and semantics have been developed to effectively (i.e.,
`transparently) handle larger size matrix operations. A sample
`program for multiplication of matrices of larger size (mN ><
`mN) is listed here.
`
`Program 2.seem
`Extended matrix multiplication.
`EQUIVALENCE ts, MD).-
`EQUIVALENEE (Em).-
`srr cuum IN:
`nzprar
`HHILE Havernour [N ARMY no
`SCAN or
`flow I TD in D0
`BEGIN
`szrcn a, up;
`rncn I. LEFT;
`nun A, a, n;
`we
`1:, E.
`c,-
`FLou
`a, RIGHT;
`fiLOlI
`3. film}
`nova RIGHT. noun;
`END:
`DECREHENT count;
`out” IERMNM'ED;
`ENOPROGEM.
`
`Here. each PE scans its submatrix in local memory MU, j)
`row by row, and C refers to M(i,j) where 1' and} are the cur-
`rent values of the scan counters I and J. Similarly, 3 refers to
`GO).
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1059
`Petitioner Microsoft Corporation - EX. 1021, p. 1059
`
`KUNG er a!"- WAVEFRONT ARRAY PROCESSOR
`
`Patti—4
`
`| I
`
`l-—~PEtia
` DATA
`
`BUFFER
`can sent
`I DATfi
`._.!I—.I_
`
`
`
`
`Darn
`TRANSFER
`CONTROL
`FLIP FLOP
`DATE READY
`
`DATA
`
`
`lusts
`|
`I
`I
`
`Fig. 2. The handshaking scheme.
`
`WAP involves a simple modular structure. A basic architecture
`is schematically shown in Fig. 3. Each processor is a hardware
`interpreter of local MDFL. and reflects a relatively coriven-
`tional architecture consisting of an internal program memory,
`a control unit, an ALU, and a set of registers. The only ex-
`ception is the PE's interfacing with its neighbors. Every pro-
`cessing element has a bidirectional buffer and independent
`status and control flags to each of its four adjacent elements
`(cf. Fig. 3). These buffers are supported by an appropriate
`multiplexing subsystem, under the control unit’s supervision.
`Except for the extended capabilities of the (l, 1) processor and
`the memory links of the first row and first column of proces~
`sors, all of the procrsssing elements are essentially identical.
`Hence, from both the software (cf. footnote 2, Section III-A)
`and hardware viewpoints, it suffices to group the PBS into four
`types: Corner, FirstRow, FirstCol, and Interior processors.
`For signal processing applications, the most important re-
`quirements are the accuracy and speed of the arithmetic units.
`These are determined by extensive numerical error analysis
`of the algorithms under consideration. Fortunately, most
`matrix algorithms have a rather complete error analysis doc-
`umentation [31], [32]. Another important factor in the se-
`lection of the ALU is the speed/hardware tradeoff. The de-
`cision between serial versus parallel and fixed- versus float-
`ing-point arithmetic will be affected by this consideratiou. In
`addition, the potential for concurrent execution and internal
`pipelining within the processing elements may be exploited for
`further speedup of the processing rate.
`
`C. Interfacing with the Host Computer and Exteudability
`
`The interaction between the host computer and the WAP
`includes loading of programs and 1/0 of data before and after
`processing. All these transactions may be carried out by means
`of a smart interface. the nature of which is currently under
`
`investigation. This interfacing will minimize the host com-
`puter's involvement in the transaction, permitting efficient
`transfer of large blocks of data.
`The word “extendability" refers to the capability of handling
`matrix operations of larger order with a Smaller sized array
`processor. Software solutions to the extendability problem are
`often adopted via efficient partitioning (decomposition) of the
`algorithms' tasks (e.g., the divide-and-conquer method for
`matrix inversion [16], [33]) and intensive utilization of the host
`computer’s memory.
`In order to handle a larger matrix without incurring the
`
`
`
`I 060
`
`IEEE TRANSACTIONS ON COMPUTERS. vOL. (2.3]. N0. 1]. NOVEMBER I982
`
`spectively. Each of the above three categories will be separately
`discussed in Sections V-A, V-B. and V-C. respectively. Ulti-
`mately, real system applications will be the only objective
`criterion for justifying and rcoonfirming the usefulness of the
`WAP. Investigations are. therefore, being conducted into its
`use in such diverse fields as image analysis, speech recognition,
`Kalman filtering, nonlinear filtering, adaptive filtering.
`smoothing. prediction, optimization. detection, spectrum
`analysis, and other advanced signal processing applications
`[12].
`In fact, some recent reports have indicated a promising and
`encouraging trend. For instance. a computing structure similar
`to the WAP has been applied for parallel processing (for dy-
`namic time warping [34]) in speech recognition. It has also
`been recently reported [1 3] that the wavefront concept can be
`effectively utilized to solve the singular value decomposition
`(SVD) problem, which is crucial for image processing and
`beam-forming applications. We are currentl